Springer Texts in Statistics Advisors: George Casella
Stephen Fienberg
Ingram Olkin
Springer Texts in Statistics Al...

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Springer Texts in Statistics Advisors: George Casella

Stephen Fienberg

Ingram Olkin

Springer Texts in Statistics Alfred: Elements of Statistics for the Life and Social Sciences Athreya and Lahiri: Measure Theory and Probability Theory Berger: An Introduction to Probability and Stochastic Processes Bilodeau and Brenner: Theory of Multivariate Statistics Blom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting, Second Edition Carmona: Statistical Analysis of Financial Data in S-Plus Chow and Teicher: Probability Theory: Independence, Interchangeability, Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and Spatial Data—Nonparametric Regression and Response Surface Maximization, Second Edition Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory of Linear Models, Third Edition Creighton: A First Course in Probability Models and Statistical Inference Davis: Statistical Methods for the Analysis of Repeated Measurements Dean and Voss: Design and Analysis of Experiments du Toit, Steyn, and Stumpf: Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Ghosh, Delampady and Samanta: An Introduction to Bayesian Analysis: Theory and Methods Gut: Probability: A Graduate Course Heiberger and Holland: Statistical Analysis and Data Display: An Intermediate Course with Examples in S-PLUS, R, and SAS Jobson: Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design Jobson: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods Kalbfleisch: Probability and Statistical Inference, Volume I: Probability, Second Edition Kalbfleisch: Probability and Statistical Inference, Volume II: Statistical Inference, Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statistical Inference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lange: Optimization Lehmann: Elements of Large-Sample Theory (continued after index)

Krishna B. Athreya Soumendra N. Lahiri

Measure Theory and Probability Theory

Krishna B. Athreya Department of Mathematics and Department of Statistics Iowa State University Ames, IA 50011 [email protected]

Editorial Board George Casella Department of Statistics University of Florida Gainesville, FL 32611-8545 USA

Soumendra N. Lahiri Department of Statistics Iowa State University Ames, IA 50011 [email protected]

Stephen Fienberg Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890 USA

Ingram Olkin Department of Statistics Stanford University Stanford, CA 94305 USA

Library of Congress Control Number: 2006922767 ISBN-10: 0-387-32903-X ISBN-13: 978-0387-32903-1

e-ISBN: 0-387-35434-4

Printed on acid-free paper. ©2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springer.com

(MVY)

Dedicated to our wives Krishna S. Athreya and Pubali Banerjee and to the memory of Uma Mani Athreya and Narayani Ammal

Preface

This book arose out of two graduate courses that the authors have taught during the past several years; the ﬁrst one being on measure theory followed by the second one on advanced probability theory. The traditional approach to a ﬁrst course in measure theory, such as in Royden (1988), is to teach the Lebesgue measure on the real line, then the diﬀerentation theorems of Lebesgue, Lp -spaces on R, and do general measure at the end of the course with one main application to the construction of product measures. This approach does have the pedagogic advantage of seeing one concrete case ﬁrst before going to the general one. But this also has the disadvantage in making many students’ perspective on measure theory somewhat narrow. It leads them to think only in terms of the Lebesgue measure on the real line and to believe that measure theory is intimately tied to the topology of the real line. As students of statistics, probability, physics, engineering, economics, and biology know very well, there are mass distributions that are typically nonuniform, and hence it is useful to gain a general perspective. This book attempts to provide that general perspective right from the beginning. The opening chapter gives an informal introduction to measure and integration theory. It shows that the notions of σ-algebra of sets and countable additivity of a set function are dictated by certain very natural approximation procedures from practical applications and that they are not just some abstract ideas. Next, the general extension theorem of Carathedory is presented in Chapter 1. As immediate examples, the construction of the large class of Lebesgue-Stieltjes measures on the real line and Euclidean spaces is discussed, as are measures on ﬁnite and countable

viii

Preface

spaces. Concrete examples such as the classical Lebesgue measure and various probability distributions on the real line are provided. This is further developed in Chapter 6 leading to the construction of measures on sequence spaces (i.e., sequences of random variables) via Kolmogorov’s consistency theorem. After providing a fairly comprehensive treatment of measure and integration theory in the ﬁrst part (Introduction and Chapters 1–5), the focus moves onto probability theory in the second part (Chapters 6–13). The feature that distinguishes probability theory from measure theory, namely, the notion of independence and dependence of random variables (i.e., measureable functions) is carefully developed ﬁrst. Then the laws of large numbers are taken up. This is followed by convergence in distribution and the central limit theorems. Next the notion of conditional expectation and probability is developed, followed by discrete parameter martingales. Although the development of these topics is based on a rigorous measure theoretic foundation, the heuristic and intuitive backgrounds of the results are emphasized throughout. Along the way, some applications of the results from probability theory to proving classical results in analysis are given. These include, for example, the density of normal numbers on (0,1) and the Wierstrass approximation theorem. These are intended to emphasize the beneﬁts of studying both areas in a rigorous and combined fashion. The approach to conditional expectation is via the mean square approximation of the “unknown” given the “known” and then a careful approximation for the L1 -case. This is a natural and intuitive approach and is preferred over the “black box” approach based on the Radon-Nikodym theorem. The ﬁnal part of the book provides a basic outline of a number of special topics. These include Markov chains including Markov chain Monte Carlo (MCMC), Poisson processes, Brownian motion, bootstrap theory, mixing processes, and branching processes. The ﬁrst two parts can be used for a two-semester sequence, and the last part could serve as a starting point for a seminar course on special topics. This book presents the basic material on measure and integration theory and probability theory in a self-contained and step-by-step manner. It is hoped that students will ﬁnd it accessible, informative, and useful and also that they will be motivated to master the details by carefully working out the text material as well as the large number of exercises. The authors hope that the presentation here is found to be clear and comprehensive without being intimidating. Here is a quick summary of the various chapters of the book. After giving an informal introduction to the ideas of measure and integration theory, the construction of measures starting with set functions on a small class of sets is taken up in Chapter 1 where the Caratheodory extension theorem is proved and then applied to construct Lebesgue-Stieltjes measures. Integration theory is taken up in Chapter 2 where all the basic convergence theorems including the MCT, Fatou, DCT, BCT, Egorov’s, and Scheﬀe’s are

Preface

ix

proved. Included here are also the notion of uniform integrability and the classical approximation theorem of Lusin and its use in Lp -approximation by smooth functions. The third chapter presents basic inequalities for Lp spaces, the Riesz-Fischer theorem, and elementary theory of Banach and Hilbert spaces. Chapter 4 deals with Radon-Nikodym theory via the Riesz representation on L2 -spaces and its application to diﬀerentiation theorems on the real line as well as to signed measures. Chapter 5 deals with product measures and the Fubini-Tonelli theorems. Two constructions of the product measure are presented: one using the extension theorem and another via iterated integrals. This is followed by a discussion on convolutions, Laplace transforms, Fourier series, and Fourier transforms. Kolmogorov’s consistency theorem for the construction of stochastic processes is taken up in Chapter 6 followed by the notion of independence in Chapter 7. The laws of large numbers are presented in a uniﬁed manner in Chapter 8 where the classical Kolmogorov’s strong law as well as Etemadi’s strong law are presented followed by Marcinkiewicz-Zygmund laws. There are also sections on renewal theory and ergodic theorems. The notion of weak convergence of probability measures on R is taken up in Chapter 9, and Chapter 10 introduces characteristic functions (Fourier transform of probability measures), the inversion formula, and the Levy-Cramer continuity theorem. Chapter 11 is devoted to the central limit theorem and its extensions to stable and inﬁnitely divisible laws. Chapter 12 discusses conditional expectation and probability where an L2 -approach followed by an approximation to L1 is presented. Discrete time martingales are introduced in Chapter 13 where the basic inequalities as well as convergence results are developed. Some applications to random walks are indicated as well. Chapter 14 discusses discrete time Markov chains with a discrete state space ﬁrst. This is followed by discrete time Markov chains with general state spaces where the regeneration approach for Harris chains is carefully explained and is used to derive the basic limit theorems via the iid cycles approach. There are also discussions of Feller Markov chains on Polish spaces and Markov chain Monte Carlo methods. An elementary treatment of Brownian motion is presented in Chapter 15 along with a treatment of continuous time jump Markov chains. Chapters 16–18 provide brief outlines respectively of the bootstrap theory, mixing processes, and branching processes. There is an Appendix that reviews basic material on elementary set theory, real and complex numbers, and metric spaces. Here are some suggestions on how to use the book. 1. For a one-semester course on real analysis (i.e., measure end integration theory), material up to Chapter 5 and the Appendix should provide adequate coverage with Chapter 6 being optional. 2. A one-semester course on advanced probability theory for those with the necessary measure theory background could be based on Chapters 6–13 with a selection of topics from Chapters 14–18.

x

Preface

3. A one-semester course on combined treatment of measure theory and probability theory could be built around Chapters 1, 2, Sections 3.1– 3.2 of Chapter 3, all of Chapter 4 (Section 4.2 optional), Sections 5.1 and 5.2 of Chapter 5, Chapters 6, 7, and Sections 8.1, 8.2, 8.3 (Sections 8.5 and 8.6 optional) of Chapter 8. Such a course could be followed by another that includes some coverage of Chapters 9– 12 before moving on to other areas such as mathematical statistics or martingales and ﬁnancial mathematics. This will be particularly useful for graduate programs in statistics. 4. A one-semester course on an introduction to stochastic processes or a seminar on special topics could be based on Chapters 14–18. A word on the numbering system used in the book. Statements of results (i.e., Theorems, Corollaries, Lemmas, and Propositions) are numbered consecutively within each section, in the format a.b.c, where a is the chapter number, b is the section number, and c is the counter. Deﬁnitions, Examples, and Remarks are numbered individually within each section, also of the form a.b.c, as above. Sections are referred to as a.b where a is the chapter number and b is the section number. Equation numbers appear on the right, in the form (b.c), where b is the section number and c is the equation number. Equations in a given chapter a are referred to as (b.c) within the chapter but as (a.b.c) outside chapter a. Problems are listed at the end of each chapter in the form a.c, where a is the chapter number and c is the problem number. In the writing of this book, material from existing books such as Apostol (1974), Billingsley (1995), Chow and Teicher (2001), Chung (1974), Durrett (2004), Royden (1988), and Rudin (1976, 1987) has been freely used. The authors owe a great debt to these books. The authors have used this material for courses taught over several years and have beneﬁted greatly from suggestions for improvement from students and colleagues at Iowa State University, Cornell University, the Indian Institute of Science, and the Indian Statistical Institute. We are grateful to them. Our special thanks go to Dean Issacson, Ken Koehler, and Justin Peters at Iowa State University for their administrative support of this long project. Krishna Athreya would also like to thank Cornell University for its support. We are most indebted to Sharon Shepard who typed and retyped several times this book, patiently putting up with our never-ending “ﬁnal” versions. Without her patient and generous help, this book could not have been written. We are also grateful to Denise Riker who typed portions of an earlier version of this book. John Kimmel of Springer got the book reviewed at various stages. The referee reports were very helpful and encouraging. Our grateful thanks to both John Kimmel and the referees.

Preface

xi

We have tried hard to make this book free of mathematical and typographical errors and misleading or ambiguous statements, but we are aware that there will still be many such remaining that we have not caught. We will be most grateful to receive such corrections and suggestions for improvement. They can be e-mailed to us at [email protected] or [email protected] On a personal note, we would like to thank our families for their patience and support. Krishna Athreya would like to record his profound gratitude to his maternal granduncle, the late Shri K. Venkatarama Iyer, who opened the door to mathematical learning for him at a crucial stage in high school, to the late Professor D. Basu of the Indian Statistical Institute who taught him to think probabilistically, and to Professor Samuel Karlin of Stanford University for initiating him into research in mathematics. K. B. Athreya S. N. Lahiri May 12, 2006

Contents

Preface

vii

Measures and Integration: An Informal Introduction

1

1 Measures 1.1 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The extension theorems and Lebesgue-Stieltjes measures 1.3.1 Caratheodory extension of measures . . . . . . . 1.3.2 Lebesgue-Stieltjes measures on R . . . . . . . . . 1.3.3 Lebesgue-Stieltjes measures on R2 . . . . . . . . 1.3.4 More on extension of measures . . . . . . . . . . 1.4 Completeness of measures . . . . . . . . . . . . . . . . . 1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

9 9 14 19 19 25 27 28 30 31

2 Integration 2.1 Measurable transformations . . . . . . . . . 2.2 Induced measures, distribution functions . . 2.2.1 Generalizations to higher dimensions 2.3 Integration . . . . . . . . . . . . . . . . . . 2.4 Riemann and Lebesgue integrals . . . . . . 2.5 More on convergence . . . . . . . . . . . . . 2.6 Problems . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

39 39 44 47 48 59 61 71

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

xiv

Contents

3 Lp -Spaces 3.1 Inequalities . . . . . . . . . . 3.2 Lp -Spaces . . . . . . . . . . . 3.2.1 Basic properties . . . 3.2.2 Dual spaces . . . . . . 3.3 Banach and Hilbert spaces . . 3.3.1 Banach spaces . . . . 3.3.2 Linear transformations 3.3.3 Dual spaces . . . . . . 3.3.4 Hilbert space . . . . . 3.4 Problems . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

83 . 83 . 89 . 89 . 93 . 94 . 94 . 96 . 97 . 98 . 102

4 Diﬀerentiation 4.1 The Lebesgue-Radon-Nikodym theorem 4.2 Signed measures . . . . . . . . . . . . . 4.3 Functions of bounded variation . . . . . 4.4 Absolutely continuous functions on R . 4.5 Singular distributions . . . . . . . . . . 4.5.1 Decomposition of a cdf . . . . . . 4.5.2 Cantor ternary set . . . . . . . . 4.5.3 Cantor ternary function . . . . . 4.6 Problems . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

113 113 119 125 128 133 133 134 136 137

5 Product Measures, Convolutions, and Transforms 5.1 Product spaces and product measures . . . . . . . . 5.2 Fubini-Tonelli theorems . . . . . . . . . . . . . . . . 5.3 Extensions to products of higher orders . . . . . . . 5.4 Convolutions . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Convolution of measures on R, B(R) . . . . 5.4.2 Convolution of sequences . . . . . . . . . . . 5.4.3 Convolution of functions in L1 (R) . . . . . . 5.4.4 Convolution of functions and measures . . . . 5.5 Generating functions and Laplace transforms . . . . 5.6 Fourier series . . . . . . . . . . . . . . . . . . . . . . 5.7 Fourier transforms on R . . . . . . . . . . . . . . . . 5.8 Plancherel transform . . . . . . . . . . . . . . . . . . 5.9 Problems . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

147 147 152 157 160 160 162 162 164 164 166 173 178 181

6 Probability Spaces 6.1 Kolmogorov’s probability model . . . . 6.2 Random variables and random vectors 6.3 Kolmogorov’s consistency theorem . . 6.4 Problems . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

189 189 191 199 212

7 Independence

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

219

Contents

7.1 7.2 7.3

xv

Independent events and random variables . . . . . . . . . . 219 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8 Laws of Large Numbers 8.1 Weak laws of large numbers . . . . . . . . . . . 8.2 Strong laws of large numbers . . . . . . . . . . 8.3 Series of independent random variables . . . . . 8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs 8.5 Renewal theory . . . . . . . . . . . . . . . . . . 8.5.1 Deﬁnitions and basic properties . . . . . 8.5.2 Wald’s equation . . . . . . . . . . . . . 8.5.3 The renewal theorems . . . . . . . . . . 8.5.4 Renewal equations . . . . . . . . . . . . 8.5.5 Applications . . . . . . . . . . . . . . . 8.6 Ergodic theorems . . . . . . . . . . . . . . . . . 8.6.1 Basic deﬁnitions and examples . . . . . 8.6.2 Birkhoﬀ’s ergodic theorem . . . . . . . . 8.7 Law of the iterated logarithm . . . . . . . . . . 8.8 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

237 237 240 249 254 260 260 262 264 266 268 271 271 274 278 279

9 Convergence in Distribution 9.1 Deﬁnitions and basic properties . . . . . . . . . . . . . . . 9.2 Vague convergence, Helly-Bray theorems, and tightness . 9.3 Weak convergence on metric spaces . . . . . . . . . . . . . 9.4 Skorohod’s theorem and the continuous mapping theorem 9.5 The method of moments and the moment problem . . . . 9.5.1 Convergence of moments . . . . . . . . . . . . . . . 9.5.2 The method of moments . . . . . . . . . . . . . . . 9.5.3 The moment problem . . . . . . . . . . . . . . . . 9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

287 287 291 299 303 306 306 307 307 309

10 Characteristic Functions 10.1 Deﬁnition and examples . . . . . 10.2 Inversion formulas . . . . . . . . 10.3 Levy-Cramer continuity theorem 10.4 Extension to Rk . . . . . . . . . 10.5 Problems . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

317 317 323 327 332 337

11 Central Limit Theorems 11.1 Lindeberg-Feller theorems . . . . . . . . 11.2 Stable distributions . . . . . . . . . . . . 11.3 Inﬁnitely divisible distributions . . . . . 11.4 Reﬁnements and extensions of the CLT

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

343 343 352 358 361

. . . . .

. . . . .

. . . . .

xvi

Contents

11.4.1 The Berry-Esseen theorem . . . . . . . . 11.4.2 Edgeworth expansions . . . . . . . . . . 11.4.3 Large deviations . . . . . . . . . . . . . 11.4.4 The functional central limit theorem . . 11.4.5 Empirical process and Brownian bridge 11.5 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

361 364 368 372 374 376

12 Conditional Expectation and Conditional Probability 12.1 Conditional expectation: Deﬁnitions and examples . . . 12.2 Convergence theorems . . . . . . . . . . . . . . . . . . . 12.3 Conditional probability . . . . . . . . . . . . . . . . . . 12.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

383 383 389 392 393

13 Discrete Parameter Martingales 13.1 Deﬁnitions and examples . . . . . . . . . . . . . 13.2 Stopping times and optional stopping theorems 13.3 Martingale convergence theorems . . . . . . . . 13.4 Applications of martingale methods . . . . . . 13.4.1 Supercritical branching processes . . . . 13.4.2 Investment sequences . . . . . . . . . . 13.4.3 A conditional Borel-Cantelli lemma . . . 13.4.4 Decomposition of probability measures . 13.4.5 Kakutani’s theorem . . . . . . . . . . . 13.4.6 de Finetti’s theorem . . . . . . . . . . . 13.5 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

399 399 405 417 424 424 425 425 427 429 430 430

14 Markov Chains and MCMC 14.1 Markov chains: Countable state space . . . . . . . . . . . 14.1.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . 14.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 14.1.3 Existence of a Markov chain . . . . . . . . . . . . . 14.1.4 Limit theory . . . . . . . . . . . . . . . . . . . . . 14.2 Markov chains on a general state space . . . . . . . . . . . 14.2.1 Basic deﬁnitions . . . . . . . . . . . . . . . . . . . 14.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 14.2.3 Chapman-Kolmogorov equations . . . . . . . . . . 14.2.4 Harris irreducibility, recurrence, and minorization . 14.2.5 The minorization theorem . . . . . . . . . . . . . . 14.2.6 The fundamental regeneration theorem . . . . . . 14.2.7 Limit theory for regenerative sequences . . . . . . 14.2.8 Limit theory of Harris recurrent Markov chains . . 14.2.9 Markov chains on metric spaces . . . . . . . . . . . 14.3 Markov chain Monte Carlo (MCMC) . . . . . . . . . . . . 14.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 14.3.2 Metropolis-Hastings algorithm . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

439 439 439 440 442 443 457 457 458 461 462 464 465 467 469 473 477 477 478

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

Contents

xvii

14.3.3 The Gibbs sampler . . . . . . . . . . . . . . . . . . . 480 14.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 15 Stochastic Processes 15.1 Continuous time Markov chains . . . . . . . . . . . . . . 15.1.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Kolmogorov’s diﬀerential equations . . . . . . . . 15.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . 15.1.4 Limit theorems . . . . . . . . . . . . . . . . . . . 15.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Construction of SBM . . . . . . . . . . . . . . . . 15.2.2 Basic properties of SBM . . . . . . . . . . . . . . 15.2.3 Some related processes . . . . . . . . . . . . . . . 15.2.4 Some limit theorems . . . . . . . . . . . . . . . . 15.2.5 Some sample path properties of SBM . . . . . . 15.2.6 Brownian motion and martingales . . . . . . . . 15.2.7 Some applications . . . . . . . . . . . . . . . . . 15.2.8 The Black-Scholes formula for stock price option 15.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

487 487 487 488 489 491 493 493 495 498 498 499 501 502 503 504

16 Limit Theorems for Dependent Processes 16.1 A central limit theorem for martingales . . 16.2 Mixing sequences . . . . . . . . . . . . . . . 16.2.1 Mixing coeﬃcients . . . . . . . . . . 16.2.2 Coupling and covariance inequalities 16.3 Central limit theorems for mixing sequences 16.4 Problems . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

509 509 513 514 516 519 529

17 The Bootstrap 17.1 The bootstrap method for independent variables . . . . . 17.1.1 A description of the bootstrap method . . . . . . . 17.1.2 Validity of the bootstrap: Sample mean . . . . . . 17.1.3 Second order correctness of the bootstrap . . . . . 17.1.4 Bootstrap for lattice distributions . . . . . . . . . 17.1.5 Bootstrap for heavy tailed random variables . . . . 17.2 Inadequacy of resampling single values under dependence 17.3 Block bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Properties of the MBB . . . . . . . . . . . . . . . . . . . . 17.4.1 Consistency of MBB variance estimators . . . . . . 17.4.2 Consistency of MBB cdf estimators . . . . . . . . . 17.4.3 Second order properties of the MBB . . . . . . . . 17.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

533 533 533 535 536 537 540 545 547 548 549 552 554 556

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

18 Branching Processes 561 18.1 Bienyeme-Galton-Watson branching process . . . . . . . . . 562

xviii

Contents

18.2 BGW process: Multitype case . . . . . . . . . . . . . . . . . 18.3 Continuous time branching processes . . . . . . . . . . . . . 18.4 Embedding of Urn schemes in continuous time branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

564 566

A Advanced Calculus: A Review A.1 Elementary set theory . . . . . . . . . . . . . . . . . . . . . A.1.1 Set operations . . . . . . . . . . . . . . . . . . . . . A.1.2 The principle of induction . . . . . . . . . . . . . . . A.1.3 Equivalence relations . . . . . . . . . . . . . . . . . . A.2 Real numbers, continuity, diﬀerentiability, and integration . A.2.1 Real numbers . . . . . . . . . . . . . . . . . . . . . . A.2.2 Sequences, series, limits, limsup, liminf . . . . . . . . A.2.3 Continuity and diﬀerentiability . . . . . . . . . . . . A.2.4 Riemann integration . . . . . . . . . . . . . . . . . . A.3 Complex numbers, exponential and trigonometric functions A.4 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Basic deﬁnitions . . . . . . . . . . . . . . . . . . . . A.4.2 Continuous functions . . . . . . . . . . . . . . . . . . A.4.3 Compactness . . . . . . . . . . . . . . . . . . . . . . A.4.4 Sequences of functions and uniform convergence . . A.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

573 573 574 577 577 578 578 580 582 584 586 590 590 592 592 593 594

568 569

B List of Abbreviations and Symbols 599 B.1 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . 599 B.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600 References

603

Author Index

610

Subject Index

612

Measures and Integration: An Informal Introduction

For many students who are learning measure and integration theory for the ﬁrst time, the notions of a σ-algebra of subsets of a set Ω, countable additivity of a set function λ, measurability of a function, the deﬁnition of an integral, and the interchange of limits and integration are not easy to understand and often seem not so intuitive. The goals of this informal introduction to this subject are (1) to show that the notions of σ-algebra and countable additivity are logical consequences of certain natural approximation procedures; (2) the dividends for the assumption of these two properties are great, and they lead to a nice and natural theory that is also very powerful for the handling of limits. Of course, as the saying goes, the devil is in the details. After this informal introduction, the necessary details are given in the next few sections. It is hoped that after this heuristic explanation of the subject, the motivation for and the process of mastering the details on the part of the students will be forthcoming. What is Measure Theory? A simple answer is that it is a theory about the distribution of mass over a set S. If the mass is uniformly distributed and S is an Euclidean space Rk , it is the theory of Lebesgue measure on Rk (i.e., length in R, area in R2 , volume in R3 , etc.). Probability theory is concerned with the case when S is the sample space of a random experiment and the total mass is one. Consider the following example. Imagine an open ﬁeld S and a snowy night. At daybreak one goes to the ﬁeld to measure the amount of snow in as many of the subsets of S as

2

Measures and Integration: An Informal Introduction

possible. Suppose now that one has the tools to measure the snow exactly on a class of subsets, such as triangles, rectangles, circular shapes, elliptic shapes, etc., no matter how small. It is natural to try to approximate oddlyshaped regions by combinations of these “standard shapes,” and then use a limiting process to obtain a measure for the oddly-shaped regions and reach some limit for such sets. Let B denote the class of subsets of S whose measure is obtained this way and let λ(B) denote the amount of snow in each B ∈ B. Call B the class of all (snow) measurable sets and λ(B) the measure (of snow) on B for each B ∈ B. It is reasonable to expect that the following properties of B and λ(·) hold: Properties of B (i) A ∈ B ⇒ Ac ∈ B (i.e., if one can measure the amount of snow on A and knows the total amount on S, then one knows the amount of snow on Ac ). (ii) A1 , A2 ∈ B ⇒ A1 ∪ A2 ∈ B (i.e., if one can measure the amount of snow on A1 and A2 , then one can do the same for A1 ∪ A2 ). (iii) If {An : n ≥ 1} ⊂ B, and An ⊂ An+1 for all n ≥ 1, then limn→∞ An ≡ ∞ n=1 An ∈ B (i.e., if one can measure the amount of snow on An for each n ≥ 1 on an increasing sequence of sets, then one can do so on the limit of An ). (iv) C ⊂ B where C is the class of nice sets such as triangles, squares, etc., that one started with. Properties of λ(·) (i) λ(A) ≥ 0 for A ∈ B (i.e., the amount of snow on any set is nonnegative!) (ii) If A1 , A2 ∈ B, A1 ∩ A2 = ∅, λ(A1 ∪ A2 ) = λ(A1 ) + λ(A2 ) (i.e., the amounts of snow on two disjoint sets simply add up! This property of λ is referred to as ﬁnite additivity). (iii) If {An : n ≥ 1} ⊂B, are such that An ⊂ An+1 for all n, then ∞ λ(limn→∞ An ) = λ( n=1 An ) = limn→∞ λ(An ) (i.e., if we can approximate a set A by an increase sequence of sets {An }n≥1 from B, then λ(A) = limn→∞ λ(An ). This property of λ is referred to as monotone continuity from below, or m.c.f.b. in short). This last assumption (iii) is what guarantees that diﬀerent approximations lead to consistent limits. Thus, if there are two increasing sequences {An }n≥1 and {An }n≥1 having the same limit A but {λ(An )}n≥1 and {λ(An )}n≥1 have diﬀerent limits, then the approximating procedures are not consistent.

Measures and Integration: An Informal Introduction

3

It turns out that the above set of reasonable and natural assumptions lead to a very rich and powerful theory that is widely applicable. A triplet (S, B, λ) that satisﬁes the above two sets of assumptions is called a measure space. The assumptions on B and λ are equivalent to the following: On B B(i) : ∅, the empty set, lies in B B(ii) : A ∈ B ⇒ Ac ∈ B (same as (i) before) B(iii) : A1 , A2 , . . . ∈ B ⇒ ∪i Ai ∈ B (combines (ii) and (iii) above) (Closure under countable unions). On λ λ(i) : λ(·) ≥ 0 (same as (i) before) and λ(∅) = 0. ∞ λ(ii) : λ(∪n≥1 An ) = n=1 λ(An ) if {An }n≥1 ⊂ B are pairwise disjoint, i.e., Ai ∩ Aj = ∅ for i = j (Countable additivity). Any collection B of subsets of S that satisﬁes B(i) , B(ii) , B(iii) above is called a σ-algebra. Any set function λ on a σ-algebra B that satisﬁes λ(i) and λ(ii) above is called a measure. Thus, a measure space is a triplet (S, B, λ) where S is a nonempty set, B is a σ-algebra of subsets of S and λ is a measure on B. Notice that the σ-algebra structure on B and the countable additivity of λ are necessary consequences of the very natural assumptions (i), (ii), and (iii) on B and λ deﬁned at the beginning. It is not often the case that one is given B and λ explicitly. Typically, one starts with a small collection C of subsets of S that have properties resembling intervals or rectangles and a set function λ on C. Then, B is the smallest σ-algebra containing C obtained from C by various operations such as countable unions, intersections, and their limits. The key properties on C that one needs are: (i) A, B ∈ C ⇒ A ∩ B ∈ C (e.g., intersection of intervals is an interval). (ii) A ∈ C ⇒ Ac is a ﬁnite union of sets from C (e.g., the complement of an interval is the union of two intervals or an interval itself). A collection C satisfying (i) and (ii) is called a semialgebra. The function λ on B is an extension of λ on C. For this extension to be a measure on B, the conditions needed are (i) λ(A) ≥ 0

for all A ∈ C

(ii) If A1 , A2 , . . . ∈ C are pairwise disjoint and A = ∞ λ(A) = n=1 λ(An ).

n≥1

An ∈ C, then

4

Measures and Integration: An Informal Introduction

There is a result, known as the extension theorem, that says that given such a pair (C, λ), it is possible to extend λ to B, the smallest σ-algebra containing C, such that (S, B, λ) is a measure space. Actually, it does more. It constructs a σ-algebra B ∗ larger than B and a measure λ∗ on B ∗ such that (S, B ∗ , λ∗ ) is a larger measure space, λ∗ coincides with λ on C and it provides nice approximation theorems. For example, the following approximation result is available: If B ∈ B∗ with λ∗ (B) < ∞, then for every > 0, B can be approximated by a ﬁnite union of sets from C, i.e., there kexist sets A1 , . . . , Ak ∈ C with k < ∞ such that λ∗ (AB) < where A ≡ i=1 Ai and AB = (A∩B c )∪(Ac ∩B), the symmetric diﬀerence between A and B. That is, in principle, every (measurable) set B of ﬁnite measure (i.e., B belonging to B∗ with λ∗ (B) < ∞) is nearly a ﬁnite union of (elementary) sets that belong to C. For example if S = R and C is the class of intervals, then every measurable set of ﬁnite measure is nearly a ﬁnite union of disjoint bounded open intervals. The following are some concrete examples of the above extension procedure. Theorem: (Lebesgue-Stieltjes measures on R). Let F : R → R satisfy (i) x1 < x2 ⇒ F (x1 ) ≤ F (x2 ) (nondecreasing); (ii) F (x) = F (x+) ≡ limy↓x F (y) for all x ∈ R (i.e., F (·) is right continuous). Let C be the class of sets of the form (a, b], or (b, ∞), −∞ ≤ a < b < ∞. Then, there exists a measure µF deﬁned on B ≡ B(R), the smallest σalgebra generated by C such that µF ((a, b]) = F (b) − F (a)

for all

− ∞ < a < b < ∞.

The σ-algebra B ≡ B(R) is called the Borel σ-algebra of R. Corollary: There exists a measure m on B(R) such that m(I) = the length of I, for any interval I. Proof: Take F (x) ≡ x in the above theorem. This measure is called the Lebesgue measure on R. Corollary: There exists a measure λ on B(R) such that b 2 1 e−x /2 dx. λ((a, b]) = √ 2π a Proof: Take F (x) =

x −∞

2 √1 e−u /2 du, 2π

x ∈ R.

2

Measures and Integration: An Informal Introduction

5

This measure is called the standard normal probability measure on R. 2 Theorem: (Lebesgue-Stieltjes measures on R2 ). Let F : R2 → R be a function satisfying the following: (i) (Monotonicity) For x = (x1 , x2 ) , y = (y1 , y2 ) with xi ≤ yi for i = 1, 2, ∆F (x, y) ≡ F (y1 , y2 ) − F (x1 , y2 ) − F (y1 , x2 ) + F (x1 , x2 ) ≥ 0. (ii) (Continuity from above) F (x) =

lim

yi ↓xi ,i=1,2

F (y) for all x ∈ R2 .

Let C be the class of all rectangles of the form (a, b] ≡ (a1 , b1 ] × (a2 , b2 ] with a = (a1 , a2 ) , b = (b1 , b2 ) ∈ R2 . Then there exists a measure µF , deﬁned on the σ-algebra B ≡ B(R2 ), generated by C, such that µF ((a, b]) = ∆F (a, b). The above theorems have a converse that says that every measure on (Rk , B(Rk )) that is ﬁnite on bounded sets arises from some function F (called a distribution function) and is, therefore, a Lebesgue-Stieltjes measure. Here is another simple example of a measure space (with discrete S). Example: Let S = {s1 , s2 , . . . , sk }, k ≤ ∞, and let B = P(S), the power set of S, i.e., the collection of all possible subsets of S. Let p1 , p2 , . . . be nonnegative numbers. Let pi IA (si ), λ(A) ≡ 1≤i≤k

where IA is the indicator function of the set A, deﬁned by IA (s) = 1 if s ∈ A and 0 otherwise. It is easy to verify that (S, B, λ) is a measure space and also that every measure λ on B arises this way. What is Integration Theory? In short, it is a theory about weighted sums of functions on a set S when the weights are speciﬁed by a mass distribution λ. Here is a more detailed answer. Let (S, B, λ) be a measure space. Suppose f : S → R is a simple function, i.e., f is such that f (S) is a ﬁnite set {a1 , a2 , . . . , ak }. It is reasonable to k deﬁne the weighted sum of f with respect to λ as i=1 ai λ(Ai ) where Ai = f −1 {ai }. Of course, for this to be well deﬁned, one needs Ai to be in B and λ(Ai ) < ∞ for all i such that ai = 0. k Notice that the quantity i=1 ai λ(Ai ) remains the same whether the ai ’s are distinct or not. Call this the integral of f with respect to λ and denote

6

Measures and Integration: An Informal Introduction

this by f dλ. If f and g are simple, then for α,β ∈ R, (αf + βg)dλ = α f dx + β gdλ. Now how should one deﬁne f dλ (integral of f with respect to λ) for a nonsimple f ? The answer, of course, is to “approximate” by simple functions. Let f be a nonnegative function. To deﬁne the integral of f , one would like to approximate f by simple functions. It turns out that a necessary and suﬃcient condition for this is that for any a ∈ R, the set {s : f (s) ≤ a} is in B. Such a function f is called measurable with respect to B or B-measurable or simply, measurable (if B is kept ﬁxed throughout). Let f be a nonnegative B measurable function. Then there exists a sequence {fn }n≥1 of simple nonnegative functions such that for each s ∈ S, {fn (s)}n≥1 is a nondecreasing sequence converging to f (s). It is now natural to deﬁne the weighted sum of f with respect to λ, i.e., the integral of f with respect to λ, denoted by f dλ, as

f dλ = lim

fn dλ.

n→∞

An immediate question is: Is the right side the same for all such approximating sequences {fn }n≥1 ? The answer is a yes; it is guaranteed by the very natural assumption imposed on λ that it is ﬁnitely additive and monotone continuous from below, i.e. λ(ii) and λ(iii) (or equivalently, that λ is countably additive, i.e., λ(ii) ). One can strengthen this to a stronger result known as the monotone convergence theorem, a key result that in turn leads to two other major convergence results. The monotone convergence theorem (MCT): Let (S, B, λ) be a measure space and let fn : S → R+ , n ≥ 1 be a sequence of nonnegative Bmeasurable functions (not necessarily simple) such that for all s ∈ S, (i) fn (s) ≤ fn+1 (s),

for all

n ≥ 1,

and

(ii) lim fn (s) = f (s). n→∞

Then f is B-measurable and

f dλ = lim

n→∞

fn dλ.

This says that the integral and the limit can be interchanged for monotone nondecreasing nonnegative B-measurable functions. Note that if fn = IAn , the indicator function of a set An and if An ⊂ An+1 for each n, then the MCT is the same as m.c.f.b. (cf. property λ(ii)). Thus, the very natural assumption of m.c.f.b. yields a basic convergence result that makes the integration theory so elegant and powerful. To extend the deﬁnition of f dλ to a real valued, B-measurable function f : S → R, one uses the simple idea that f can be decomposed as f = f + − f − where f + (s) = max{f (s), 0} and f − (s) = max{−f (s), 0}, s ∈ S. Since both f + and f − are nonnegative and B-measurable, f + dλ and

Measures and Integration: An Informal Introduction

7

f − dλ are both well deﬁned. Now set f dλ = f + dλ − f − dλ,

provided at least one of the two terms on the right is ﬁnite. function f The + dλ and f − dλ is said to be integrable with respect to (w.r.t.) λ if both f are ﬁnite or, equivalently, if |f |dλ < ∞. The following is a consequence of the MCT. Fatou’s lemma: Let {fn }n≥1 be a sequence of nonnegative B-measurable functions on a measure space (S, B, λ). Then lim inf fn dλ ≤ lim inf fn dλ. n→∞

n→∞

This in turn leads to (Lebesgue’s) dominated convergence theorem (DCT): Let {fn }n≥1 be a sequence of B-measurable functions from a measure space (S, B, λ) to R and let g be a B-measurable nonnegative integrable function on (S, B, λ). Suppose that for each s in S, (i) |fn (s)| ≤ g(s)

for all

n≥1

and

(ii) lim fn (s) = f (s). n→∞

Then, f is integrable and fn dλ = f dλ = lim lim fn dλ. n→∞

n→∞

Thus some very natural assumptions on B and λ lead to an interesting measure and integration theory that is quite general and that allows the interchange of limits and integrals under fairly general conditions. A systematic treatment of the measure and integration theory is given in the next ﬁve chapters.

1 Measures

Section 1.1 deals with algebraic operations on subsets of a given nonempty set Ω. Section 1.2 treats nonnegative set functions on classes of sets and deﬁnes the notion of a measure on an algebra. Section 1.3 treats the extension theorem, and Section 1.4 deals with completeness of measures.

1.1 Classes of sets Let Ω be a nonempty set and P(Ω) ≡ {A : A ⊂ Ω} be the power set of Ω, i.e., the class of all subsets of Ω. Deﬁnition 1.1.1: A collection of sets F ⊂ P(Ω) is called an algebra if (a) Ω ∈ F, (b) A ∈ F implies Ac ∈ F, and (c) A, B ∈ F implies A ∪ B ∈ F (i.e., closure under pairwise unions). Thus, an algebra is a class of sets containing Ω that is closed under complementation and pairwise (and hence ﬁnite) unions. It is easy to see that one can equivalently deﬁne an algebra by requiring that properties (a), (b) hold and that the property (c)

A, B ∈ F ⇒ A ∩ B ∈ F

holds (i.e. closure under ﬁnite intersections).

10

1. Measures

Deﬁnition 1.1.2: A class F ⊂ P(Ω) is called a σ-algebra if it is an algebra and if it satisﬁes

An ∈ F. (d) An ∈ F for n ≥ 1 ⇒ n≥1

Thus, a σ-algebra is a class of subsets of Ω that contains Ω and is closed under complementation and countable unions. As pointed out in the introductory chapter, a σ-algebra can be alternatively deﬁned as an algebra that is closed under monotone unions as the following shows. Proposition 1.1.1: Let F ⊂ P(Ω). Then F is a σ-algebra if and only if F is an algebra and satisﬁes

An ∈ F. An ∈ F, An ⊂ An+1 for all n ⇒ n≥1 ∞ Proof: The ‘only if’ part is obvious. nFor the ‘if’ part, let {Bn }n=1 ⊂ F. Then, since F is analgebra, An≡ j=1 Bj ∈ F for all n. Further, An ⊂ An+1 for all n and n≥1 Bn = n≥1 An . Since by hypothesis ∪n An ∈ F, ∪n Bn ∈ F. 2

Here are some examples of algebras and σ-algebras. Example 1.1.1: Let Ω = {a, b, c, d}. Consider the classes F1 = {Ω, ∅, {a}} and F2 = {Ω, ∅, {a}, {b, c, d}}. Then, F2 is an algebra (and also a σ-algebra), but F1 is not an algebra, since {a}c ∈ F1 . Example 1.1.2: Let Ω be any nonempty set and let F3 = P(Ω) ≡ {A : A ⊂ Ω},

the power set of

Ω

and F4 = {Ω, ∅}. Then, it is easy to check that both F3 and F4 are σ-algebras. The latter σ-algebra is often called the trivial σ-algebra on Ω (Problem 1.1). From the deﬁnition it is clear that any σ-algebra is also an algebra and thus F2 , F3 , F4 are examples of algebras, too. The following is an example of an algebra that is not a σ-algebra. Example 1.1.3: Let Ω be a nonempty set, and let |A| denote the number of elements of a set A ⊂ Ω. Deﬁne. F5 = {A ⊂ Ω : either |A| is ﬁnite or |Ac | is ﬁnite}.

1.1 Classes of sets

11

Then, note that (i) Ω ∈ F5 (since |Ωc | = |∅| = 0)), (ii) A ∈ F5 implies Ac ∈ F5 (if |A| < ∞, then |(Ac )c | = |A| < ∞ and if |Ac | < ∞, then Ac ∈ F5 trivially). Next, suppose that A, B ∈ F5 . If either |A| < ∞ or |B| < ∞, then |A ∩ B| ≤ min{|A|, |B|} < ∞, so that A ∩ B ∈ F5 . On the other hand, if both |Ac | < ∞ and |B c | < ∞, then |(A ∩ B)c | = |Ac ∪ B c | ≤ |Ac | + |B c | < ∞, implying that A ∩ B ∈ F5 . Thus, property (c) holds, and F5 is an algebra. However, if |Ω| = ∞, then F5 is not a σ-algebra. To see this, suppose that |Ω| = ∞ and {ω 1 , ω2 , . . .} ⊂ Ω. Then, by deﬁnition, Ai = {ωi } ∈ F5 for all ∞ i ≥ 1, but A ≡ i=1 A2i−1 = {ω1 , ω3 , . . .} ∈ F5 , since |A| = |Ac | = ∞. Example 1.1.4: Let Ω be a nonempty set and let F6 = {A ⊂ Ω : A is countable or Ac is countable}. Then, it is easy to show that F6 is a σ-algebra (Problem 1.3). Suppose {Fθ : θ ∈ Θ} is a family of σ-algebras on Ω. From the deﬁnition, it follows that the intersection θ∈Θ Fθ is a σ-algebra, no matter how large the index set Θ is (Problem 1.4). However, the union of two σ-algebras may not even be an algebra (Problem 1.5). For the development of measure theory and probability theory, the concept of a σ-algebra plays a crucial role. In many instances, given an arbitrary collection of subsets of Ω, one would like to extend it to a possibly larger class that is a σ-algebra. This leads to the following deﬁnition. Deﬁnition 1.1.3: If A is a class of subsets of Ω, then the σ-algebra generated by A, denoted by σA, is deﬁned as σA = F, F∈I(A)

where I(A) ≡ {F : A ⊂ F and F is a σ-algebra on Ω} is the collection of all σ-algebras containing the class A. Note that since the power set P(Ω) contains A and is itself a σ-algebra, the collection I(A) is not empty and hence, the intersection in the above deﬁnition is well deﬁned. Example 1.1.5: In the setup of Example 1.1.1, σF1 = F2 (why?). A particularly useful class of σ-algebras are those generated by open sets of a topological space. These are called Borel σ-algebras. A topological space is a pair (S, T ) where S is a nonempty set and T is a collection of subsets of S such that(i) S ∈ T , (ii) O1 , O2 ∈ T ⇒ O1 ∩ O2 ∈ T , and (iii) {Oα : α ∈ I} ⊂ T ⇒ α∈I Oα ∈ T . Elements of T are called open sets.

12

1. Measures

A metric space is a pair (S, d) where S is a nonempty set and d is a function from S × S to R+ satisfying (i) d(x, y) = d(y, x) for all x, y in S, (ii) d(x, y) = 0 iﬀ x = y, and (iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z in S. Property (iii) is called the triangle inequality. The function d is called a metric on S (cf. see A.4). Any Euclidean space Rn (1 ≤ n < ∞) is a metric space under any one of the following metrics: (a) For 1 ≤ p < ∞, dp (x, y) =

n

|xi − yi |p

1/p .

i=1

(b) d∞ (x, y) = max |xi − yi |. 1≤i≤n

(c) For 0 < p < 1, dp (x, y) =

n

|xi − yi |p .

i=1

A metric space (S, d) is a topological space where a set O is open if for all x ∈ O, there is an > 0 such that B(x, ) ≡ {y : d(y, x) < } ⊂ O. Deﬁnition 1.1.4: The Borel σ-algebra on a topological space S (in particular, on a metric space or an Euclidean space) is deﬁned as the σ-algebra generated by the collection of open sets in S. Example 1.1.6: Let B(Rk ) denote the Borel σ-algebra on Rk , 1 ≤ k < ∞. Then, B(Rk ) ≡ σ{A : A is an open subset of Rk } is also generated by each of the following classes of sets O1 O2

= {(a1 , b1 ) × . . . × (ak , bk ) : −∞ ≤ ai < bi ≤ ∞, 1 ≤ i ≤ k};

O3

= {(a1 , b1 ) × . . . × (ak , bk ) : ai , bi ∈ Q, ai < bi , 1 ≤ i ≤ k};

O4

= {(−∞, x1 ) × . . . × (−∞, xk ) : x1 , . . . , xk ∈ Q},

= {(−∞, x1 ) × · · · × (−∞, xk ) : x1 , . . . , xk ∈ R};

where Q denotes the set of all rational numbers. To show this, note that σOi ⊂ B(Rk ) for i = 1, 2, 3, 4, and hence, it is enough to show that σOi ⊃ B(Rk ). Let G be a σ-algebra containing set A ⊂ Rk , there exist a sequence of O3 . Observe that given any open sets {Bn }n≥1 in O3 such that A = n≥1 Bn (Problem 1.9). Since G is a σalgebra and Bn ∈ G for all n ≥ 1, A ∈ G. Thus, G is a σ-algebra containing all open subsets of Rk , and hence G ⊃ B(Rk ). Hence, it follows that B(Rk ) ⊃ σO1 ⊃ σO3 =

G:G⊃O3

G ⊃ B(Rk ).

1.1 Classes of sets

13

Next note that any interval (a, b) ⊂ R can be expressed in terms of half spaces of the form (−∞, x), x ∈ R as

(a, b) =

∞

[(−∞, b) \ (−∞, a + n−1 )],

n=1

where for any two sets A and B, A\B = {x : x ∈ A, x ∈ / B}. It is not diﬃcult to show that this implies that σOi = B(Rk ) for i = 2, 4 (Problem 1.10). Example 1.1.7: Let Ω be a nonempty set with |Ω| = ∞ and F5 and F6 be as in Examples 1.1.3 and 1.1.4. Then F6 = σF5 . To see this, note that F6 is a σ-algebra containing F5 , so that σF5 ⊂ F6 . To prove the reverse inclusion, let G be a σ-algebra containing F5 . It is enough to show that F6 ⊂ G. Let A ∈ F6 . If A is countable, say ∞A = {ω1 , ω2 , . . .}, then Ai ≡ {ωi } ∈ F5 ⊂ G for all i ≥ 1 and hence A = i=1 Ai ∈ G. On the other hand, if Ac is countable, then by the above argument, Ac ∈ G ⇒ A ∈ G. Deﬁnition 1.1.5: A class C of subsets of Ω is a π-system or a π-class if A, B ∈ C ⇒ A ∩ B ∈ C. Example 1.1.8: The class C of intervals in R is a π-system whereas the class of all open discs in R2 is not. Deﬁnition 1.1.6: A class L of subsets of Ω is a λ-system or a λ-class if (i) Ω ∈ L, (ii) A, B ∈ L, A ⊂ B ⇒ B \ A ∈ L, and (iii) An ∈ L, An ⊂ An+1 for all n ≥ 1 ⇒ n≥1 An ∈ L. Example 1.1.9: Every σ-algebra F is a λ-system. But an algebra need not be a λ-system. It is easily checked that if L1 and L2 are λ-systems, then L1 ∩ L2 is also a λ-system. Recall that σB, the σ-algebra generated by B, is the intersection of all σ-algebras containing B and is also the smallest σ-algebra containing B. Similarly, for any B ⊂ P(Ω), the λ-system generated by B, denoted by λB, is deﬁned as the intersection of all λ-systems containing B. It is the smallest λ-system containing B. Theorem 1.1.2: (The π-λ theorem). If C is a π-system, then λC = σC. Proof: For any C, σC is a λ-system and σC contains C. Thus, λC ⊂ σC for any C. Hence, it suﬃces to show that if C is a π-system, then λC is a σ-algebra . Since λC is a λ-system, it is closed under complementation and monotone increasing unions. By Proposition 1.1.1, it is enough to show that it is closed under intersection. Let λ1 (C) ≡ {A : A ∈ λC, A ∩ B ∈ λC for all B ∈ C}. Then, λ1 (C) is a λ-system and C being a π-system, λ1 (C) ⊃ C. Therefore, λ1 (C) ⊃ λC. But λ1 (C) ⊂ λC. So λ1 (C) = λC.

14

1. Measures

Next, let λ2 (C) ≡ {A : A ∈ λC, A ∩ B ∈ λC for all B ∈ λC}. Then λ2 (C) is a λ-system and by the previous step C ⊂ λ2 (C) ⊂ λC. Hence, it follows that λ2 (C) = λC, i.e., λC is closed under intersection. This completes the proof of the theorem. 2 Corollary 1.1.3: If C is a π-system and L is a λ-system containing C, then L ⊃ σC. Remark 1.1.1: There are several equivalent deﬁnitions of λ-systems; see, for example, Billingsley (1995). A closely related concept is that of a monotone class; see, for example, Chung (1974).

1.2 Measures A set function is an extended real valued function deﬁned on a class of subsets of a set Ω. Measures are nonnegative set functions that, intuitively speaking, measure the content of a subset of Ω. As explained in Section 2 of the introductory chapter, a measure has to satisfy certain natural requirements, such as the measure of the union of a countable collection of disjoint sets is the sum of the measures of the individual sets. Formally, one has the following deﬁnition. Deﬁnition 1.2.1: Let Ω be a nonempty set and F be an algebra on Ω. Then, a set function µ on F is called a measure if (a) µ(A) ∈ [0, ∞] for all A ∈ F; (b) µ(∅) = 0; (c) for any disjoint collection of sets A1 , A2 , . . . , ∈ F with µ

n≥1

n≥1

An ∈ F,

∞ An = µ(An ). n=1

As discussed in Section 2 of the introductory chapter, these conditions on µ are equivalent to ﬁnite additivity and monotone continuity from below. Proposition 1.2.1: Let Ω be a nonempty set and F be an algebra of subsets of Ω and µ be a set function on F with values in [0, ∞] and with µ(∅) = 0. Then, µ is a measure iﬀ µ satisﬁes (iii)a : (ﬁnite additivity) for all A1 , A2 ∈ F with A1 ∩A2 = ∅, µ(A1 ∪A2 ) = µ(A1 ) + µ(A2 ), and (iii)b : (monotone continuity from below or, m.c.f.b., in short) for any collection {An }n≥1 of sets in F such that An ⊂ An+1 for all n ≥ 1

1.2 Measures

and

n≥1

15

An ∈ F, µ

An = lim µ(An ). n→∞

n≥1

Proof: Let µ be a measure on F. Since µ satisﬁes (iii), taking A3 , A4 , . . . to be ∅ yields (iii)a . This implies that for A and B in F, A ⊂ B ⇒ µ(B) = µ(A) + µ(B \ A) ≥ µ(A), i.e., µ is monotone. To establish (iii)b , note that if µ(An ) = ∞ for some n = n0 , then µ(An ) = ∞ for all n ≥ n0 and µ( n≥1 An ) = ∞ and (iii)b holds in this case. Hence, suppose that µ(An ) < ∞ for all n ≥ 1. Setting Bn = An \ An−1 for n ≥ 1 (with A0 = ∅), by (iii)a , µ(Bn ) =µ(An ) − µ(A n−1 ). Since {Bn }n≥1 is a disjoint collection of sets in F with n≥1 Bn = n≥1 An , by (iii) µ

An

= µ

n≥1

∞ N Bn = µ(Bn ) = lim [µ(An ) − µ(An−1 )]

=

N →∞

n=1

n≥1

n=1

lim µ(AN ),

N →∞

and so (iii)b holds also in this case. Conversely, let µ satisfy µ(∅) = 0 and nn }n≥1 be (iii)a and (iii)b . Let {A a disjoint collection of sets in F with i≥1 Ai ∈ F. Let Cn = j=1 Aj for n ≥ 1. Since F isan algebra,Cn ∈ F for all n ≥ 1. Also, Cn ⊂ Cn+1 for all n ≥ 1. Hence, n≥1 Cn = j≥1 Aj . By (iii)b ,

= µ Aj Cn = lim µ(Cn ) µ j≥1

=

=

n≥1 n

lim

n→∞ ∞

n→∞

µ(Aj )

(by (iii)a )

j=1

µ(Aj ).

j=1

Thus, (iii) holds.

2

Remark 1.2.1: The deﬁnition of a measure given in Deﬁnition 1.2.1 is valid when F is a σ-algebra. However, very often one may start with a measure on an algebra A and then extend it to a measure on the σ-algebra σA. This is why the deﬁnition of a measure on an algebra is given here. In the same vein, one may begin with a deﬁnition of a measure on a class of subsets of Ω that form only a semialgebra (cf. Deﬁnition 1.3.1). As described in the introductory chapter, such preliminary collections of sets are “nice” sets for which the measure may be deﬁned easily, and the extension to a σalgebra containing these sets may be necessary if one is interested in more general sets. This topic is discussed in greater detail in the next section.

16

1. Measures

Deﬁnition 1.2.2: A measure µ is called ﬁnite or inﬁnite according as µ(Ω) < ∞ or µ(Ω) = ∞. A ﬁnite measure with µ(Ω) = 1 is called a probability measure. A measure µ on a σ-algebra F is called σ-ﬁnite if there exist a countable collection of sets A1 , A2 , . . . , ∈ F, not necessarily disjoint, such that

An = Ω and (b) µ(An ) < ∞ for all n ≥ 1. (a) n≥1

Here are some examples of measures. Example 1.2.1: (The counting measure). Let Ω be a nonempty set and F3 = P(Ω) be the set of all subsets of Ω (cf. Example 1.1.2). Deﬁne µ(A) = |A|,

A ∈ F3 ,

where |A| denotes the number of elements in A. It is easy to check that µ satisﬁes the requirements (a)–(c) of a measure. This measure µ is called the counting measure on Ω. Note that µ is ﬁnite iﬀ Ω is ﬁnite and it is σ-ﬁnite if Ω is countably inﬁnite. Example 1.2.2: (Discrete probability measures). Let ω1 , ω2 , . . . , ∈ Ω and ∞ p1 , p2 , . . . ∈ [0, 1] be such that i=1 pi = 1. Deﬁne for any A ⊂ Ω P (A) =

∞

pi IA (ωi ),

i=1

where IA (·) denotes the indicator function of a set A, deﬁned by IA (ω) = 0 or 1 according as ω ∈ A or ω ∈ A. For any disjoint collection of sets A1 , A2 , . . . ∈ P(Ω), P

∞

Ai

=

i=1

∞ j=1

=

=

=

∞

pj I∞ (ωj ) i=1 Ai

∞

pj j=1 i=1 ∞ ∞

IAi (ωj )

pj IAi (ωj )

i=1 ∞

j=1

P (Ai ),

i=1

where interchanging the order of summation is permissible since the summands are nonnegative. This shows that P is a probability measure on P(Ω).

1.2 Measures

17

Example 1.2.3: (Lebesgue-Stieltjes measures on R). As mentioned in the previous chapter (cf. Section 2), a large class of measures on the Borel σ-algebra B(R) of subsets of R, known as the Lebesgue-Stieltjes measures, arise from nondecreasing right continuous functions F : R → R. For each such F , the corresponding measure µF satisﬁes µF ((a, b]) = F (b) − F (a) for all −∞ < a < b < ∞. The construction of these µF ’s via the extension theorem will be discussed in the next section. Also, note that if An = (−n, n), n = 1, 2, . . ., then R = n≥1 An and µF (An ) < ∞ for each n ≥ 1 (such measures are called Radon measures) and thus, µF is necessarily σ-ﬁnite. Proposition 1.2.2: Let µ be a measure on an algebra F, and let A, B, A1 , . . . , Ak ∈ F, 1 ≤ k < ∞. Then, (i) (Monotonicity) µ(A) ≤ µ(B) if A ⊂ B; (ii) (Finite subadditivity) µ(A1 ∪ . . . ∪ Ak ) ≤ µ(A1 ) + . . . + µ(Ak ); (iii) (Inclusion-exclusion formula) If µ(Ai ) < ∞ for all i = 1, . . . , k, then µ(A1 ∪ . . . ∪ Ak )

=

k i=1

µ(Ai ) −

µ(Ai ∩ Aj )

1≤i<j 0 is arbitrary, this yields µ(A) ≤ µ∗ (A). Thus, given a measure µ on a semialgebra C ⊂ P(Ω), there is a complete measure space (Ω, Mµ∗ , µ∗ ) such that Mµ∗ ⊃ C and µ∗ restricted to C equals µ. For this reason, µ∗ is called an extension of µ. The measure space (Ω, Mµ∗ , µ∗ ) is called the Caratheodory extension of µ. Since Mµ∗ is a σalgebra and contains C, Mµ∗ must contain σC, the σ-algebra generated by C, and thus, (Ω, σC, µ∗ ) is also a measure space. However, the latter may not be complete (see Section 1.4). Now the above method is applied to the construction of LebesgueStieltjes measures on R and R2 .

1.3.2

Lebesgue-Stieltjes measures on R

Let F : R → R be nondecreasing. For x ∈ R, let F (x+) ≡ limy↓x F (y), and F (x−) ≡ limy↑x F (y). Set F (∞) = limx↑∞ F (x) and F (−∞) = limx↓−∞ F (y). Let C ≡ (a, b] : −∞ ≤ a ≤ b < ∞ ∪ (a, ∞) : −∞ ≤ a < ∞ .

(3.7)

Deﬁne µF ((a, b])

= F (b+) − F (a+),

µF ((a, ∞))

= F (∞) − F (a+).

(3.8)

Then, it may be veriﬁed that (i) C is a semialgebra; (ii) µF is a measure on C. (For (ii), one needs to use the Heine-Borel theorem. See Problems 1.22 and 1.23.) Let (R, Mµ∗F , µ∗F ) be the Caratheodory extension of µF , i.e., the measure space constructed as in the above two theorems.

26

1. Measures

Deﬁnition 1.3.7: Let F : R → R be nondecreasing. The (measure) space (R, Mµ∗F , µ∗F ) is called a Lebesgue-Stieltjes measure space and µ∗F is the Lebesgue-Stieltjes measure generated by F . Since σC = B(R), the class of all Borel sets of R, every LebesgueStieltjes measure µ∗F is also a measure on (R, B(R)). Note also that µ∗F is ﬁnite on bounded intervals. Conversely, given any Radon measure µ on (R, B(R)), i.e., a measure on (R, B(R)) that is ﬁnite on bounded intervals, set ⎧ if x > 0 ⎨ µ((0, x]) 0 if x = 0 F (x) = ⎩ −µ((x, 0]) if x ≤ 0. Then µF = µ on C. By the uniqueness of the extension (discussed later in this section, see also Theorem 1.2.4), it follows that µ∗F coincides with µ on B(R). Thus, every Radon measure on (R, B(R)) is necessarily a LebesgueStieltjes measure. Deﬁnition 1.3.8: (Lebesgue Measure on R). When F (x) ≡ x, x ∈ R, the measure µ∗F is called the Lebesgue measure and the σ-algebra Mµ∗F is called the class of Lebesgue measurable sets. The Lebesgue measure will be denoted by m(·) or µL (·). Given below are some important results on m(·). (i) It follows from equation (3.1), that µ∗F ({x}) = F (x+) − F (x−) and hence = 0 if F is continuous at x. Thus m({x}) ≡ 0 on R. (ii) By countable additivity of m(·), m(A) = 0 for any countable set A. (iii) (Cantor set). There exists an uncountable set C such that m(C) = 0. An example is the Cantor set constructed Start with I0 = as follows: 1 2 [0, 1]. Delete the open middle third, i.e., 3 , 3 . Next from the closed 1 2 intervals I11 = [0, 3 ] and I12 = [ 3 , 1] delete the open middle thirds, i.e., 19 , 29 and 79 , 29 , respectively. Repeat this process of deleting the middle third from each of the remaining closed intervals. Thus at stage n there will be 2n−1 new closed intervals and 2n−1 deleted open intervals, each of length 31n . Let Un denote the union of all the deleted open intervals at the nth stage. Then {Un }n≥1 are disjoint open sets. Let U ≡ n≥1 Un . By countable additivity

m(U ) =

∞ n=1

m(Un ) =

∞ 2n−1 = 1. 3n n=1

Let C ≡ [0, 1] − U . Since U is open and[0,1] is closed, C is nonempty. ∞ It can be shown that C ≡ {x : x = 1 a3ii , ai = 0 or 2} (Problem

1.3 The extension theorems and Lebesgue-Stieltjes measures

27

1.33). Thus C can be mapped in (1–1) manner on to the set of all sequences {δi }i≥1 such that δi = 0 or 2. But this set is uncountable. Since m([0, 1]) = 1, it follows that m(C) = 0. For more properties of the Cantor set, see Rudin (1976) and Chapter 4. (iv) m(·) is invariant under reﬂection and translation. That is, for any E in B(R), m(−E) = m(E) and m(E + c) = m(E) for all c in R where −E = {−x : x ∈ E} and E + c = {y : y = x + c, x ∈ E}. This follows from Theorem 1.2.4 and the fact that the claim holds for intervals (cf. Problem 2.48). (v) There exists a subset A ⊂ R such that A ∈ Mm . That is, there exists a non-Lebesgue measurable set. The proof of this requires the use of the axiom of choice (cf. A.1). For a proof see Royden (1988).

1.3.3

Lebesgue-Stieltjes measures on R2

Let F : R2 → R satisfy F (a2 , b2 ) − F (a2 , b1 ) − F (a1 , b2 ) + F (a1 , b1 ) ≥ 0,

(3.9)

F (a2 , b2 ) − F (a1 , b1 ) ≥ 0,

(3.10)

and ¯ 2 by appropriate limiting procedure. for all a1 ≤ a2 , b1 ≤ b2 . Extend F to R Let C2 ≡ {I1 × I2 : I1 , I2 ∈ C}, (3.11) where C is as in (3.7). Next, for I1 = (a1 , a2 ], I2 = (b1 , b2 ], − ∞ < a1 , a2 , b1 , b2 < ∞, set µF (I1 × I2 ) ≡ F (a2 +, b2 +) − F (a2 +, b1 +) − (F (a1 +, b2 +) + F (a1 +, b1 +)), (3.12) where for any a, b ∈ R, F (a+, b+) is deﬁned as F (a+, b+) ≡

lim

a ↓a,b ↓b

F (a , b ).

Note that by (3.10), the limit exists and hence, F (a+, b+) is well deﬁned. Further, by (3.9), the right side of (3.12) is nonnegative. Next extend the deﬁnition of µF to unbounded sets in C2 by the limiting procedure: (3.13) µF (I1 × I2 ) = lim µF (I1 × I2 ) ∩ Jn , n→∞

where Jn = (−n, n]×(−n, n]. Then it may be veriﬁed (Problems 1.24, 1.25) that

28

1. Measures

(i) C2 is a semialgebra (ii) µF is a measure on C2 . Let (R2 , Mµ∗F , µ∗F ) be the measure space constructed as in the above two theorems. The measure µ∗F is called the Lebesgue-Stieltjes measure generated by F and Mµ∗F a Lebesgue-Stieltjes σ-algebra. Again, in this case, Mµ∗ includes the σ-algebra σC ≡ B(R2 ) and so (R2 , B(R2 ), µ∗F ) is also a measure space. If F (a, b) = ab, then µF is called the Lebesgue measure on R2 . A similar construction holds for any Rk , k < ∞.

1.3.4

More on extension of measures

Next the uniqueness of the extension µ∗ and approximation of the µ∗ measure of a set in Mµ∗ by that of a set from the algebra A ≡ A(C) are considered. As in the case of measures deﬁned on an algebra, a measure µ on a semialgebra C ⊂ P(Ω) is said to be σ-ﬁnite if there exists a countable collection {An }n≥1 ⊂ C such that (i) µ(An ) < ∞ for each n ≥ 1 and (ii) n≥1 An = Ω. The following approximation result holds. Theorem 1.3.4: Let A ∈ Mµ∗ and µ∗ (A) < ∞. Then, for each > 0, there exist B1 , B2 , . . . , Bk ∈ C, k < ∞ with Bi ∩ Bj = ∅ for 1 ≤ i = j ≤ k, such that

k

∗ Bj < , µ A j=1

where for any two sets E1 and E2 , E1 E2 is the symmetric diﬀerence of E1 and E2 , deﬁned by E1 E2 ≡ (E1 ∩ E2c ) ∪ (E1c ∩ E2 ). Proof: By deﬁnition of µ∗ , µ∗ (A)< ∞ implies that for every > 0, there exist {Bn }n≥1 ⊂ C such that A ⊂ n≥1 Bn and µ∗ (A) ≤

∞

µ(Bn ) ≤ µ∗ (A) + /2 < ∞.

n=1

Since Bn ∈ C, Bnc is a ﬁnite union of disjoint sets from C. W.l.o.g., it can be assumed that {Bn }n≥1 are disjoint. (Otherwise, ∞ one can consider the sequence B1 , B2 ∩ B1c , B3 ∩ B2c ∩ B1c , . . ..) Next, n=1µ(Bn ) < ∞ implies ∞ that for every > 0, there exists k ∈ N such that n=k+1 µ(Bn ) < 2 . k Since both A and j=1 Bj are subsets of n≥1 Bn ,

k ∞ ∞ ∗ c ∗ µ A Bj Bj Bj . ≤µ A ∩ +µ ∗

j=1

j=1

j=k+1

1.3 The extension theorems and Lebesgue-Stieltjes measures

29

But since µ∗ is a measure on (Ω, Mµ∗ ), µ∗ ( n≥1 Bn ) = µ∗ (A) + µ∗ (( n≥1 Bn ) ∩ Ac ). Further, since µ∗ (A) < ∞, µ∗

Bn ∩ Ac

n≥1

= µ∗

Bn − µ∗ (A))

n≥1

≤ =

∞ n=1 ∞

µ∗ (Bn ) − µ∗ (A), µ(Bn ) − µ∗ (A),

(since µ∗ is countably subadditive) (since µ∗ = µ on C)

n=1

0. Choose c > a and dn > bn , n ≥ 1 such that such that F (c) − F (a) < η, [c, b] ⊂ n n≥1 (an , dn ) and F (dn ) − F (bn ) < η/2 for all n ∈ N. Next, apply the Heine-Borel theorem to the interval [c, b] and the open cover {(an , dn )}n≥1 and extract a ﬁnite cover {(ai , di )}ki=1 for [c, b]. W.l.o.g., assume that c ∈ (a1 , d1 ) and b ∈ (ak , dk ). Now verify that F (b) − F (a) ≤ ≤

k j=1 ∞

(F (bj ) − F (aj )) + 2η (F (bj ) − F (aj )) + 2η.

j=1

1.23 Extend the above arguments to the case when (a, b] and (ai , bi ], i ≥ 1 are not necessarily bounded intervals. 1.24 Verify that C2 , deﬁned in (3.11), is a semialgebra. 1.25 (a) Verify that the limit in (3.13) exists. (b) Extend the arguments in Problems 1.22 and 1.23 to verify that µF of (3.12) and (3.13) is a measure on C2 . 1.26 Establish Theorem 1.3.6 by completing the following: (a) Suppose ﬁrst that ν(Ω) < ∞. Verify that L ≡ {A : A ∈ σC, µ∗ (A) = ν(A)} is a λ system and use the π-λ theorem. (b) Extend the above to the σ-ﬁnite case. 1.27 Prove Corollary 1.3.5 for Lebesgue measure m(·). 1.28 Let F be a discrete distribution function, i.e., F is of the form F (x) ≡

∞

aj I(xj ≤ x),

x ∈ R,

j=1

where 0 < aj < ∞, P(R).

j≥1

aj = 1, xj ∈ R, j ≥ 1. Show that Mµ∗F =

(Hint: Show that µ∗F (Ac ) = 0, where A ≡ {xj : j ≥ 1}, and use the fact that for any B ⊂ R, B ∩ A ∈ B(R).)

1.5 Problems

1.29 Let

⎧ ⎨ 0 x F (x) = ⎩ 1

35

for x < 0 for 0 ≤ x ≤ 1 for x > 0.

Show that Mµ∗F ≡ {A : A ∈ P(R), A ∩ [0, 1] ∈ M}, where M is the σ-algebra of Lebesgue measurable sets as in Deﬁnition 1.3.7. 1.30 Let F (·) = 12 Φ(·)+ 12 FP (·) where Φ(·) is the standard normal cdf, i.e., x ∞ k (k) 2 Φ(x) ≡ −∞ √12π e−u /2 Au and FP (x) ≡ k=0 e−2 2k! I(−∞,x] , x ∈ R. Let F1 = Φ, F2 = FP and F3 = F . Let A1 = (0, 1), A2 = {x : x ∈ R, sin x ∈ (0, 12 )}, A3 = {x : for some integers a0 , a1 , . . . , ak , k < ∞, k i i=0 ai x = 0}, the set of all algebraic numbers. Compute µFi (Aj ), 1 ≤ i, j ≤ 3. 1.31 Let µ be a measure on a semialgebra C ⊂ P(Ω) where Ω is a nonempty set. Let µ∗ be the outer measure generated by µ and let Mµ∗ be the σ-algebra of µ∗ -measurable sets as deﬁned in Theorem 1.3.3. (a) Show that for all A ⊂ Ω, there exists a B ∈ σC such that A ⊂ B and µ∗ (A) = µ∗ (B). (Hint: If µ∗ (A) = ∞, take B to be Ω. If µ∗ (A) < ∞, use the deﬁnition of µ∗ to show that for each n ≥ 1, there exists {B } ⊂ C such that A ⊂ Bn ≡ j≥1 Bnj , µ∗ (A) ≤ ∞ nj j≥1 1 ∗ j=1 µ(Bnj ) < µ (A) + n . Take B = n≥1 Bn .) (b) Show that for all A ∈ Mµ∗ with µ∗ (A) < ∞, there exists B ∈ σC such that A ⊂ B and µ∗ (B \ A) = 0. (Hint: Use (a) and the relation B = A ∪ (B \ A) with A and B \ A = B ∩ Ac in Mµ∗ .) (c) Show that if µ is σ-ﬁnite (i.e., there exist sets Ωn , n ≥ 1 in C with µ(Ωn ) < ∞ for all n ≥ 1 and n≥1 Ωn = Ω), then in (b), the hypothesis that µ∗ (A) < ∞ can be dropped. (Hint: Assume w.l.o.g. that {Ωn }n≥1 are disjoint. Apply (b) to {An ≡ A ∩ Ωn }n≥1 .) (d) Show that if µ is σ-ﬁnite, then A ∈ Mµ∗ iﬀ there exist sets B1 , B2 ∈ σC such that B1 ⊂ A ⊂ B2 and µ∗ (B2 \ B1 ) = 0. (Hint: Apply (c) to both A and Ac .) This shows that Mµ∗ is the completion of σC w.r.t. µ∗ . 1.32 (An outline of a proof of Corollary 1.3.5). Let (R, Mµ∗F , µ∗F ) be a Lebesgue Stieltjes measure space generated by a right continuous and nondecreasing function F : R → R.

36

1. Measures

(a) Show that A ∈ Mµ∗F iﬀ there exist Borel sets B1 and B2 ∈ B(R) such that B1 ⊂ A ⊂ B2 and µ∗F (B2 \ B1 ) = 0. (Hint: Take C to be the semialgebra C = {(a, b] : −∞ ≤ a ≤ b < ∞} ∪ {(b, ∞) : −∞ ≤ b < ∞} and apply Problem 1.31 (d).) (b) Let A ∈ Mµ∗F with µ∗F (A) < ∞. Show that for any > 0, there exist a ﬁnite number of bounded open intervals Ij , j = 1, 2, . . . , k k such that µ∗F (A j=1 Ij ) < . (Outline: Claim: For any B ∈ C with µF (B) < ∞, there exists an open interval I such that µ∗F (I B) < . To see this, note that if B = (a, b], − ∞ ≤ a < b < ∞, then one may choose b > b such that F (b ) − F (b) < . Now, with I = (a, b ), µ∗F (I B) = µ∗F ((b, b )) = µF ((b, b )) ≤ F (b ) − F (b) < . If B = (b, ∞) and µF (B) < ∞, then there exists b > b such that F (∞) − F (b −) < . Hence, with I = (b, b ), µ∗F (I B) = µ∗F ([b , ∞)) = F (∞) − F (b −) < . This proves the claim. Next, By Theorem 1.3.4, for all > 0, there exist k B1 , B2 , . . . , Bk ∈ C such that µ∗F (A j=1 Bj ) < /2. For each Bj , ﬁnd Ij , a bounded open interval such that µ∗F (Bj Ij ) < 2j . Since (A1 ∪ A2 ) (C1 ∪ C2 ) ⊂ (A1 C1 ) ∪ (A2 C2 ) for any A1 , A2 , C1 , C2 ⊂ Ω, it follows that µ∗F

k

k k

Bj Ij < µ∗F (Bj Ij ) < . 2 j=1 j=1 j=1

k Hence, µ∗F (A [ j=1 Ij ]) < .) (c) Let A ∈ Mµ∗F with µ∗F (A) < ∞. Show that for every > 0, there exists an open set O such that A ⊂ O and µ∗F (O \ A) < . > 0, there ex(Hint: By deﬁnition of µ∗F , for every ∗ B ist {Bj }j≥1 ⊂ C such that A ⊂ j≥1 j and µF (A) ≤ ∞ ∗ j=1 µF (Bj ) ≤ µF (A) + . Now as in (b), there exist open ∗ j intervals Ij such that Bj ⊂ Ij and ∞ µF (Ij \ Bj ) < /2∗ for all ∞ j ≥ 1. Then A ⊂ j=1 Bj ⊂ j=1 Ij ≡ O. Also, µF (O) = ∗ ∗ ∗ µ∗F (O \ A) ⇒ µ µ∗F (A) + F (O \ A) = µF (O) − µ (A) < 2 (since ∞ ∞ ∗ ∗ ∗ ∗ µ (O) ≤ j=1 µ (Ij ) = j=1 µF (Bj ) + ≤ µF (A) + 2 < ∞).) (d) Extend (c) to all A ∈ Mµ∗F . (Hint: Let Ai = A ∩ [i, i + 1], i ∈ Z. Apply (c) to Ai with and take unions.) i = 2|i|+1 (e) Show that for all A ∈ Mµ∗F and for all > 0, there exist a closed set C and an open set O such that C ⊂ A ⊂ O and

1.5 Problems

37

µ∗F (O \ A) ≤ , µ∗ (A \ C) < . (Hint: Apply (d) to A and Ac .) (f) Show that for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exist a closed and bounded set F ⊂ A such that µ∗F (A \ F ) < and an open set O with A ⊂ O such that µ∗F (O \ A) < . (Hint: Apply (d) to A ∩ [−M, M ] where M is chosen so that µ∗F (A ∩ [−M, M ]c ) < . Why is this possible?) Remark: Thus for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exist a compact set K ⊂ A and an open set O ⊃ A such that µ∗F (A \ K) < and µ∗F (O \ A) < . The ﬁrst property is called inner regularity of µ∗F and the second property is called outer regularity of µ∗F . (g) Show that for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exists a continuous function g with compact support (i.e., g (x) is zero for |x| large) such that µ∗F (A g−1 {1}) < . (Hint: For any bounded open interval (a, b), let η > 0 be such that µF ((a, a + η]) + µF ([b − η, b)) < . Next deﬁne ⎧ if a + η ≤ x ≤ b − η ⎨ 1 0 if x ∈ (a, b) g (x) = ⎩ linear over [a, a + η] ∪ [b − η, b]. Then g is continuous with compact support. Also, g −1 {1} = [a + η, b − η] and (a, b) g −1 {1} = (a, a + η) ∪ (b − η, b). So µF {(a, b) g−1 (1)}| < , proving the claim for A = (a, b). The general case follows from (b).) (h) Show that for all A ∈ Mµ∗F and for all > 0, there exists a continuous function g (not necessarily with compact support) such that µ∗F (A g−1 {1}) < (i.e., drop the condition µ∗F (A) < ∞). (Hint: Let Ak = A ∩ [k, k + 1], k ∈ Z. Find gk : R → R continuous with supportin k − 18 , k + 98 such that µ∗F (IAk = . Let g = gk ) < 2|k|+1 k∈Z gk . Note that for any x ∈ R, at (x) = 0 and so g is continuous. Also, µ∗F (IA = g) ≤ most two g k ∗ k∈Z µF (IAk = gk ) < .) 1.33 Let C be the Cantor set in [0,1] as deﬁned in Section 1.3.2. (a) Show that ∞ ai C= x:x= , a ∈ {0, 2} . i 3i i=1

38

1. Measures

and hence that C is uncountable. (b) Show that C + C ≡ {x + y : x, y ∈ C} = [0, 2].

2 Integration

2.1 Measurable transformations Oftentimes, one is not interested in the full details of a measure space (Ω, F, µ) but only in certain functions deﬁned on Ω. For example, if Ω represents the outcomes of 10 tosses of a fair coin, one may only be interested in knowing the number of heads in the 10 tosses. It turns out that to assign measures (probabilities) to sets (events) involving such functions, one can allow only certain functions (called measurable functions) that satisfy some ‘natural’ restrictions, speciﬁed in the following deﬁnitions. Deﬁnition 2.1.1: Let Ω be a nonempty set and let F be a σ-algebra on Ω. Then the pair (Ω, F) is called a measurable space. If µ is a measure on (Ω, F), then the triplet (Ω, F, µ) is called a measure space. If in addition, µ is a probability measure, then (Ω, F, µ) is called a probability space. Deﬁnition 2.1.2: (a) Let (Ω, F) be a measurable space. Then a function f : Ω to R is called F, B(R)-measurable (or F-measurable) if for each a in R (1.1) f −1 (−∞, a] ≡ {ω : f (ω) ≤ a} ∈ F. (b) Let (Ω, F, P ) be a probability space. Then a function X : Ω → R is called a random variable, if the event X −1 ((−∞, a]) ≡ {ω : X(ω) ≤ a} ∈ F

40

2. Integration

for each a in R, i.e., a random variable is a real valued F-measurable function on a probability space (Ω, F, P ). It will be shown later that condition (1.1) on f is equivalent to the stronger condition that f −1 (A) ∈ F for all Borel sets A ∈ B(R). Since for any Borel set A ∈ B(R), f −1 (A) is a member of the underlying σ-algebra F, one can assign a measure to the set f −1 (A) using a measure µ on (Ω, F). Note that for an arbitrary function T from Ω → R, T −1 (A) need not be a member of F and hence such an assignment may not be possible. Thus, condition (1.1) on real valued mappings is a ‘natural’ requirement while dealing with measure spaces. The following deﬁnition generalizes (1.1) to maps between two measurable spaces. Deﬁnition 2.1.3: Let (Ωi , Fi ), i = 1, 2 be measurable spaces. Then, a mapping T : Ω1 → Ω2 is called measurable with respect to the σ-algebras F1 , F2 (or F1 , F2 -measurable) if T −1 (A) ∈ F1

for all A ∈ F2 .

Thus, X is a random variable on a probability space (Ω, F, P ) iﬀ X is F, B(R)-measurable. Some examples of measurable transformations are given below. Example 2.1.1: Let Ω = {a, b, c, d}, F2 = {Ω, ∅, {a}, {b, c, d}} and let F3 = the set of all subsets of Ω. Deﬁne the mappings Ti : Ω → Ω, i = 1, 2, by T1 (ω) ≡ a for ω ∈ Ω

and T2 (ω) =

a if ω = a, b c if ω = c, d.

Then, T1 is F2 , F3 -measurable since for any A ∈ F3 , T1−1 (A) = Ω or ∅ according as a ∈ A or a ∈ A. By similar arguments, it follows that T2 is F3 , F2 -measurable. However, T2 is not F2 , F3 -measurable since T2−1 ({a}) = {a, b} ∈ F2 . As this simple example shows, measurability of a given mapping critically depends on the σ-algebras on its domain and range spaces. In general, if T is F1 , F2 -measurable, then T is F˜1 , F2 -measurable for any σ-algebra F˜1 ⊃ F1 and T is F1 , F˜2 -measurable for any F˜2 ⊂ F2 . Example 2.1.2: Let T : R → R be deﬁned as sin 2x if x > 0 T (x) = 1 + cos x if x ≤ 0.

2.1 Measurable transformations

41

Is T measurable w.r.t. the Borel σ-algebras B(R), B(R)? If one is to apply the deﬁnition directly, one must check that T −1 (A) ∈ B(R) for all A ∈ B(R). However, ﬁnding T −1 (A) for all Borel sets A is not an easy task. In many instances like this one, veriﬁcation of the measurability property of a given mapping by directly using the deﬁnition can be diﬃcult. In such situations, one may use some easy-to-verify suﬃcient conditions. Some results of this type are given below. Proposition 2.1.1: Let (Ωi , Fi ), i = 1, 2, 3 be measurable spaces. (i) Suppose that F2 = σA for some class of subsets A of Ω2 . If T : Ω1 → Ω2 is such that T −1 (A) ∈ F1 for all A ∈ A, then T is F1 , F2 measurable. (ii) Suppose that T1 : Ω1 → Ω2 is F1 , F2 -measurable and T2 : Ω2 → Ω3 is F2 , F3 -measurable. Let T = T2 ◦ T1 : Ω1 → Ω3 denote the composition of T1 and T2 , deﬁned by T (ω1 ) = T2 (T1 (ω1 )), ω1 ∈ Ω1 . Then, T is F1 , F3 -measurable. Proof: (i) Deﬁne the collection of sets F = {A ∈ F2 : T −1 (A) ∈ F1 }. Then, (a) T −1 (Ω2 ) = Ω1 ∈ F1 ⇒ Ω2 ∈ F. (b) If A ∈ F, then T −1 (A) ∈ F1 ⇒ (T −1 (A))c ∈ F1 ⇒ T −1 (Ac ) = (T −1 (A))c ∈ F1 , implying Ac ∈ F. −1 (c) If A1 , A2 , . . . , ∈ F, then, T (Ai ) ∈ F1 for−1all i ≥ 1. Since F1 −1 is a σ-algebra , T ( A ) = (An ) ∈ F1 . Thus, n n≥1 n≥1 T n≥1 An ∈ F. (See also Problem 2.1 on de Morgan’s laws.)

Hence, by (a), (b), (c), F is a σ-algebra and by hypothesis A ⊂ F . Hence, F2 = σA ⊂ F ⊂ F2 . Thus, F = F2 and T is F1 , F2 measurable. (ii) Let A ∈ F3 . Then, T2−1 (A) ∈ F2 , since T2 is F2 , F3 -measurable. Also, by the F1 , F2 -measurability of T1 , T −1 (A) = T1−1 (T2−1 (A)) ∈ 2 F1 , showing that T is F1 , F3 -measurable. Proposition 2.1.2: For any k, p ∈ N, if f : Rp → Rk is continuous, then f is B(Rp ), B(Rk )-measurable. Proof: Let A = {A : A is an open set in Rk }. Then, by the continuity of f , f −1 (A) is open and hence, is in B(Rp ) (cf. Section A.4). Thus, f −1 (A) ∈

42

2. Integration

Rp for all A ∈ A. Since B(Rk ) = σA, by Proposition 2.1.1 (a), f is 2 B(Rp ), B(Rk )-measurable. Although the converse to the above proposition is not true, a result due to Lusin says that except on a set of small measure, f coincides with a continuous function. This is stronger than the statement that except on set of small measure, f is close to a continuous function. For the statement and proof of Lusin’s theorem, see Theorem 2.5.12. Proposition 2.1.3: Let f1 , . . . , fk (k ∈ N) be F, B(R)-measurable transformations from Ω to R. Then, (i) f = (f1 , . . . , fk ) is F, B(Rk )-measurable. (ii) g = f1 + . . . + fk is F, B(R)-measurable. k (iii) h ≡ i=1 fi is F, B(R)-measurable. (iv) Let p ∈ N and let ψ : Rk → Rp be continuous. Then, ξ ≡ ψ ◦ f is F, B(Rp )-measurable, where f = (f1 , . . . , fk ). Proof: To prove (i), note that for any rectangle R = (a1 , b1 )×. . .×(ak , bk ), f −1 (R)

= =

{ω ∈ Ω : a1 < f1 (ω) < b1 , . . . , ak < fk (ω) < bk } k

{ω ∈ Ω : ai < f1 (ω) < bi }

i=1

=

k

fi−1 (ai , bi ) ∈ F,

i=1

since each fi is F, B(R)-measurable. Hence, by Proposition 2.1.1 (i), f is F, B(Rk )-measurable. To prove (ii), note that the function g1 (x) ≡ x1 + . . . + xk , x = (x1 , . . . , xk ) ∈ Rk is continuous on Rk , and hence, by Proposition 2.1.2, is B(Rk ), B(R)-measurable. Since g = g1 ◦ f , g is F, B(R)-measurable, by Proposition 2.1.1 (ii). The proofs of (iii) and (iv) are similar to that of (ii) and hence, are omitted. 2 Corollary 2.1.4: The collection of F, B(R)-measurable functions from Ω to R is closed under pointwise addition and multiplication as well as under scalar multiplication. The proof of Corollary 2.1.4 is omitted. In view of the above, writing the function T of Example 2.1.2 as T (x) = (sin 2x)I(0,∞) (x) + (1 + cos x)I(−∞,0] (x), x ∈ R, the B(R), B(R)-measurability of T follows. Note that here T is not continuous over R but only piecewise continuous (see also Problem 2.2).

2.1 Measurable transformations

43

Next, measurability of the limit of a sequence of measurable functions is ¯ = R ∪ {+∞, −∞} denote the extended real line and let considered. Let R ¯ ≡ σB(R) ∪ {+∞} ∪ {−∞} denote the extended Borel σ-algebra on B(R) ¯ R. ¯ be a F, B(R)¯ Proposition 2.1.5: For each n ∈ N, let fn : Ω → R measurable function. (i) Then, each of the functions supn∈N fn , inf n∈N fn , lim supn→∞ fn , and ¯ lim inf n→∞ fn is F, B(R)-measurable. (ii) The set A ≡ {ω : limn→∞ fn (ω) exists and is ﬁnite} lies in F and ¯ the function h ≡ (limn→∞ fn ) · IA is F, B(R)-measurable. Proof: ¯ (i) Let g = supn≥1 fn . To show that g is F, B(R)-measurable, it is enough to show that {ω : g(ω) ≤ r} ∈ F for all r ∈ R (cf. Problem 2.4). Now, for any r ∈ R, {ω : g(ω) ≤ r} = =

∞ n=1 ∞

{ω : fn (ω) ≤ r} fn−1 ((−∞, r]) ∈ F,

n=1

since fn−1 ((−∞, r]) ∈ F for all n ≥ 1, by the measurability of fn . Next note that inf n≥1 fn = − supn≥1 (−fn ) and hence, inf n≥1 fn is ¯ F, B(R)-measurable. To prove the measurability of lim supn→∞ fn , ¯ deﬁne the functions gm = supn≥m fn , m ≥ 1. Then, gm is F, B(R)measurable for each m ≥ 1 and since gm is nonincreasing in m, ¯ inf m≥1 gm ≡ lim supn→∞ fn is also F, B(R)-measurable. A similar argument works for lim inf n→∞ fn . ˜i = (ii) Let h1 = lim supn→∞ fn and h2 = lim inf n→∞ fn , and deﬁne h ˜ ˜ hi IR (hi ), i = 1, 2. Note that h1 − h2 is F, B(R)-measurable. Hence, {ω : lim fn (ω) n→∞

=

exists and is ﬁnite}

{ω : −∞ < lim sup fn (ω) = lim inf fn (ω) < ∞} n→∞

n→∞

= {ω : −∞ < h2 (ω) = h1 (ω) < ∞} ˜ 1 (ω) = h ˜ 2 (ω)} ∩ {ω : h1 (ω) < ∞, h2 (ω) > −∞} = {ω : h −1 ˜1 − h ˜ 2 ) ({0}) ∩ {ω : h1 (ω) < ∞, h2 (ω) > −∞} ∈ F. = (h Finally, note that h = h1 IA .

2

44

2. Integration

Deﬁnition 2.1.4: Let {fλ : λ ∈ Λ} be a family of mappings from Ω1 into Ω2 and let F2 be a σ-algebra on Ω2 . Then, σ{fλ−1 (A) : A ∈ F2 , λ ∈ Λ} is called the σ-algebra generated by {fλ : λ ∈ Λ} (w.r.t. F2 ) and is denoted by σ{fλ : λ ∈ Λ}. Note that σ{fλ : λ ∈ Λ} is the smallest σ-algebra on Ω1 that makes all fλ ’s measurable w.r.t. F2 on Ω2 . Example 2.1.3: Let f = IA for some set A ⊂ Ω1 and Ω2 = R and F2 = B(R). Then, σ{f } = σ{A} = {Ω1 , ∅, A, Ac }. Example 2.1.4: Let Ω1 = Rk , Ω2 = R, F2 = B(R), and for 1 ≤ i ≤ k, let fi : Ω1 → Ω2 be deﬁned as fi (x1 , . . . , xk ) = xi ,

(x1 , . . . , xk ) ∈ Ω1 = Rk .

Then, σ{fi : 1 ≤ i ≤ k} = B(Rk ). To show this, note that any k measurable rectangle A1 × . . . × Ak can be written as A1 × . . . × Ak = i=1 fi−1 (Ai ) and hence A1 × . . . × Ak ∈ σ{fi : 1 ≤ i ≤ k} for all A1 , . . . , Ak ∈ R. Since Rk is generated by the collection of all measurable rectangles, B(Rk ) ⊂ σ{fi : 1 ≤ i ≤ k}. Conversely, for any A ∈ B(R) and for any 1 ≤ i ≤ k, fi−1 (A) = R × . . . × A × . . . × R (with A in the ith position) is in B(Rk ). Therefore, σ{fi : 1 ≤ i ≤ k} = σ{fi−1 (A) : A ∈ R, 1 ≤ i ≤ k} ⊂ B(Rk ). Hence, σ{fi : 1 ≤ i ≤ k} = B(Rk ). Proposition 2.1.6: Let {fλ : λ ∈ Λ} be an uncountable collection of maps from Ω1 to Ω2 . Then for any B ∈ σ{fλ : λ ∈ Λ}, there exists a countable set ΛB ⊂ Λ such that B ∈ σ{fλ : λ ∈ ΛB }. Proof: The proof of this result is left as an exercise (Problem 2.5).

2

2.2 Induced measures, distribution functions Suppose X is a random variable deﬁned on a probability space (Ω, F, P ). Then P governs the probabilities assigned to events like X −1 ([a, b]), −∞ < a < b < ∞. Since X takes values in the real line, one should be able to express such probabilities only as a function of the set [a, b]. Clearly, since X is F, R-measurable, X −1 (A) ∈ F for all A ∈ B(R) and the function PX (A) ≡ P (X −1 (A))

(2.1)

2.2 Induced measures, distribution functions

45

is a set function deﬁned on B(R). Is this a (probability) measure on B(R)? The following proposition answers the question more generally. Proposition 2.2.1: Let (Ωi , Fi ), i = 1, 2 be measurable spaces and let T : Ω1 → Ω2 be a F1 , F2 -measurable mapping from Ω1 to Ω2 . Then, for any measure µ on (Ω1 , F1 ), the set function µT −1 , deﬁned by µT −1 (A) ≡ µ(T −1 (A)), A ∈ F2

(2.2)

is a measure on F2 . Proof: It is easy to check that µT −1 satisﬁes the three conditions for being a measure. The details are left as an exercise (cf. Problem 2.9). Deﬁnition 2.2.1: The measure µT −1 is called the measure induced by T (or the induced measure of T ) on F2 . In particular, if µ(Ω1 ) = 1, then µT −1 (Ω2 ) = 1. Hence, the set function P deﬁned in (2.1) is indeed a probability measure on (R, B(R)). Deﬁnition 2.2.2: For a random variable X deﬁned on a probability space (Ω, F, P ), the probability distribution of X (or the law of X), denoted by PX (say), is the induced measure of X under P on R, as deﬁned in (2.2). In introductory courses on probability and statistics, one deﬁnes probabilities of events like ‘X ∈ [a, b]’ by using the probability mass function for discrete random variables and the probability density function for ‘continuous’ random variables. The measure-theoretic deﬁnition above allows one to treat both these cases as well as the case of ‘mixed’ distributions under a uniﬁed framework. Deﬁnition 2.2.3: The cumulative distribution function (or cdf in short) of a random variable X is deﬁned as FX (x) ≡ PX ((−∞, x]), x ∈ R.

(2.3)

Proposition 2.2.2: Let F be the cdf of a random variable X. (i) For x1 < x2 , F (x1 ) ≤ F (x2 )

(i.e., F is nondecreasing on R).

(ii) For x in R, F (x) = limy↓x F (y)

(i.e., F is right continuous on R).

(iii) limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1. Proof: For x1 < x2 , (−∞, x1 ] ⊂ (−∞, x2 ]. Since PX is a measure on B(R), F (x1 ) = PX ((−∞, x1 ]) ≤ PX ((−∞, x2 ]) = F (x2 ), proving (i).

46

2. Integration

To prove (ii), let xn ↓ x. Then, the sets (−∞, xn ] ↓ (−∞, x], and PX ((−∞, x1 ]) = P (X ≤ x1 ) ≤ 1. Hence, using the monotone continuity from above of the measure PX (m.c.f.a.) (cf. Proposition 1.2.3), one gets lim F (xn ) = lim PX ((−∞, xn ]) = PX ((−∞, x]) = F (x).

n→∞

n→∞

Next consider part (iii). Note that if xn ↓ −∞ and yn ↑ ∞, then (−∞, xn ] ↓ ∅ and (−∞, yn ] ↑ (−∞, ∞). Hence, part (iii) follows the m.c.f.a. 2 and the m.c f.b. properties of PX (cf. Propositions 1.2.1 and 1.2.3). Deﬁnition 2.2.4: A function F : R → R satisfying (i), (ii), and (iii) of Proposition 2.2.2 is called a cumulative distribution function (or cdf for short). Thus, given a random variable X, its cdf FX satisﬁes properties (i), (ii), (iii) of Proposition 2.2.2. Conversely, given a cdf F , one can construct a probability space (Ω, F, P ) and a random variable X on it such that the cdf of X is F . Indeed, given a cdf F , note that by Theorem 1.3.3 and Definition 1.3.7, there exists a (Lebesgue-Stieltjes) probability measure µF on (R, B(R)) such that µF ((−∞, x]) = F (x). Now deﬁne X to be the identity map on R, i.e., let X(x) ≡ x for all x ∈ R. Then, X is a random variable on the probability space (R, B(R), µF ) with probability distribution PX = µF and cdf FX = F . In addition to (i), (ii) and (iii) of Proposition 2.2.2, it is easy to verify that for any x in R, P (X < x) = FX (x−) ≡ lim FX (y) y↑x

and hence P (X = x) = FX (x) − FX (x−).

(2.4)

Thus, the function FX (·) has a jump at x iﬀ P (X = x) > 0. Since a monotone function from R to R can have only jump discontinuities and only a countable number of them (cf. Problem 2.11), for any random variable X, the set {a ∈ R : P (X = a) > 0} is countable. This leads to the following deﬁnitions. Deﬁnition 2.2.5: (a) A random variable X is called discrete if there exists a countable set A ⊂ R such that P (X ∈ A) = 1. (b) A random variable X is called continuous if P (X = x) = 0 for all x ∈ R. Note that X is continuous iﬀ FX is continuous on all of R and X is discrete iﬀ the sum of all the jumps of its cdf FX is one. It may also be

2.2 Induced measures, distribution functions

47

noted that if FX is a step function, then X is discrete but not conversely. For example, consider the case where the set A in the above deﬁnition is the set of all rational numbers. It turns out that a given cdf may be written as a weighted sum of a discrete and a continuous cdfs. Let F be a cdf. Let A ≡ {x : p(x) ≡ F (x) − F (x−) > 0}. As remarked earlier, A is at most countable. Write α = y∈A p(y) and let F˜d (x) = y∈A p(y)I(−∞,x] (y), and F˜c (x) = F (x) − F˜d (x). It is easy to verify that F˜c (·) is continuous on R. If α = 0, then F (x) = F˜c (x) and F is continuous. If α = 1, then F = F˜d (x) and F is discrete. If 0 < α < 1, F (·) can be written as F (x) = αFd (x) + (1 − α)Fc (x),

(2.5)

where Fd (x) ≡ α−1 F˜d (x) and Fc (x) ≡ (1 − α)−1 F˜c (x) are both cdfs, with Fd being discrete and Fc being continuous. For a further decomposition of Fc (·) into absolutely continuous and singular continuous components, see Chapter 4.

2.2.1

Generalizations to higher dimensions

Induced distributions of random vectors and the associated cdfs are brieﬂy considered in this section. Let X = (X1 , X2 , . . . , Xk ) be a k-dimensional random vector deﬁned on a probability space (Ω, F, P ). The probability distribution PX of X is the induced probability measure on (Rk , B(Rk )), deﬁned by (cf. (2.1)) PX (A) ≡ P (X −1 (A)) A ∈ B(Rk ) .

(2.6)

The cdf FX of X is now deﬁned by FX (x) = P (X ≤ x),

x ∈ Rk ,

(2.7)

where for any x = (x1 , x2 , . . . , xk ) and y = (y1 , y2 , . . . , yk ) in Rk , x ≤ y means that xi ≤ yi for all i = 1, . . . , k. The extension of Proposition 2.2.2 to the k-dimensional case is notationally involved. Here, an analog of Proposition 2.2.2 for the bivariate case, i.e., for k = 2 is stated. Proposition 2.2.3: Let F be the cdf of a bivariate random vector X = (X1 , X2 ). (i) Then, for any x = (x1 , x2 ) ≤ y = (y1 , y2 ), F (y1 , y2 ) − F (x1 , y2 ) − F (y1 , x2 ) + F (x1 , x2 ) ≥ 0. (ii) For any x = (x1 , x2 ) ∈ R2 , lim

y1 ↓x1 ,y2 ↓x2

F (y1 , y2 ) = F (x1 , x2 ),

i.e., F is right continuous on R2 .

(2.8)

48

2. Integration

(iii) limx1 →−∞ F (x1 , a) = limx2 →−∞ F (a, x2 ) = 0 for all a ∈ R; limx1 →∞,x2 →∞ F (x1 , x2 ) = 1. (iv) For any a ∈ R, limy↑∞ F (a, y) = P (X1 ≤ a) and limy↑∞ F (y, a) = P (X2 ≤ a). Proof: Clearly, 0 ≤ P (X ∈ (x1 , y1 ] × (x2 , y2 ]) = P (x1 < X1 ≤ y1 , x2 < X2 ≤ y2 ) = P (X1 ≤ y1 , x2 < X2 ≤ y2 ) − P (X1 ≤ x1 , x2 < X2 ≤ y2 ) =

P (X1 ≤ y1 , X2 ≤ y2 ) − P (X1 ≤ y1 , X2 ≤ x2 ) −[P (X1 ≤ x1 , X2 ≤ y2 ) − P (X1 ≤ x1 , X2 ≤ x2 )]

= F (y1 , y2 ) − F (y1 , x2 ) − F (x1 , y2 ) + F (x1 , x2 ). This proves (i). To prove (ii), note that for any sequence yin ↓ xi , i = 1, 2, the sets An = (−∞, y1n ] × (−∞, y2n ] ↓ A ≡ (−∞, x1 ] × (−∞, x2 ]. Hence, by m.c.f.a property of a probability measure, F (y1n , y2n ) = P (X ∈ An ) ↓ P (A) = F (x1 , x2 ). For (iii), note that (−∞, x1n ] × (−∞, a] ↓ ∅ for any sequence x1n ↓ −∞ and for any a ∈ R. Hence, again by the m.c.f.a. property, F (x1n , a) ↓ 0

as

n → ∞.

By similar arguments, F (a, x2n ) ↓ 0 whenever x2n ↓ −∞. To prove the last relation in (iii), apply the m.c.f.b. property to the sets (−∞, x1n ] × (−∞, x2n ] ↑ R2 for x1n ↑ ∞, x2n ↑ ∞. The proof of part (iv) is similar. 2 Note that any function satisfying properties (i), (ii), (iii) of Proposition 2.2.3 determines a probability measure uniquely. This follows from the discussions in Section 1.3, as (1.3.9) and (1.3.10) follow from (i) and (iii) (Problem 2.12). For a general k ≥ 1, an analog of property (i) above is cumbersome to write down explicitly. Indeed, for any x ≤ y, now a sum involving 2k -terms must be nonnegative. However, properties (ii), (iii), and (iv) can be extended in an obvious way to the k-dimensional case. See Problem 2.13 for a precise statement. Also, for a general k ≥ 1, functions satisfying the properties listed in Problem 2.13 uniquely determine a probability measure on (Rk , B(Rk )).

2.3 Integration Let (Ω, F, µ) be a measure space and f : Ω → R be a measurable function. The goal of this section is to deﬁne the integral of f with respect to the

2.3 Integration

49

measure µ and establish some basic convergence results. The integral of a nonnegative function taking only ﬁnitely many values is deﬁned ﬁrst, which is then extended to all nonnegative measurable functions by approximation from below. Finally, the integral of an arbitrary measurable function is deﬁned using the decomposition of the function into its positive and negative parts. ¯ ≡ [−∞, ∞] is called simple Deﬁnition 2.3.1: A function f : Ω → R ¯ and sets if there exist a ﬁnite set (of distinct elements) {c1 , . . . , ck } ∈ R A1 , . . . , Ak ∈ F, k ∈ N such that f can be written as f=

k

ci IAi .

(3.1)

i=1

Deﬁnition 2.3.2: (The integral of a simple nonnegative function). Let ¯ + ≡ [0, ∞] be a simple nonnegative function on (Ω, F, µ) with f :Ω→R the representation (3.1). The integral of f w.r.t. µ, denoted by f dµ, is deﬁned as k f dµ ≡ ci µ(Ai ) . (3.2) i=1

Here and in the following, the relation 0·∞=0 is adopted as a convention. It may be veriﬁed that the value of the integral in (3.2) does not depend l on the representation of f . That is, if f can be expressed as f = j=1 dj IBj ¯ (not necessarily distinct) and for some sets for some d1 , . . . , dl ∈ R + l k B1 , . . . , Bl ∈ F, then i=1 ci µ(Ai ) = j=1 dj µ(Bj ), so that the value of the integral remains unchanged (Problem 2.17). Also note that for the f in Deﬁnition 2.3.2, 0≤

f dµ ≤ ∞.

The following proposition is an easy consequence of the deﬁnition and the above remark. Proposition 2.3.1: Let f and g be two simple nonnegative functions on (Ω, F, µ). Then (i) (Linearity) For α ≥ 0, β ≥ 0, (αf + βg)dµ = α f dµ + β gdµ. (ii) (Monotonicity) If f ≥ g a.e. (µ), i.e., µ({ω : ω ∈ Ω, f (ω) < g(ω)}) = 0, then f dµ ≥ gdµ.

50

2. Integration

(iii) If f = g a.e. (µ), i.e., if µ({ω : ω ∈ Ω, f (ω) = g(ω)}) = 0, then f dµ = gdµ. Deﬁnition 2.3.3: (The integral of a nonnegative measurable function). ¯ + be a nonnegative measurable function on (Ω, F, µ). The Let f : Ω → R integral of f with respect to µ, also denoted by f dµ, is deﬁned as f dµ = lim fn dµ, (3.3) n→∞

where {fn }n≥1 is any sequence of nonnegative simple functions such that fn (ω) ↑ f (ω) for all ω. Note that by Proposition 2.3.1 (ii), the sequence { fn dµ}n≥1 is nondecreasing, and hence the right side of (3.3) is well deﬁned. That the right side of (3.3) is the same for all such approximating sequences of functions needs to be established and is the content of the following proposition. The proof of this proposition exploits in a crucial way the m.c.f.b. and the ﬁnite additivity of the set function µ (or, equivalently, the countable additivity of µ). Proposition 2.3.2: Let {fn }n≥1 and {gn }n≥1 be two sequences of simple ¯ + such that as n → ∞, nonnegative measurable functions on (Ω, F, µ) to R fn (ω) ↑ f (ω) and gn (ω) ↑ f (ω) for all ω ∈ Ω. Then fn dµ = lim gn dµ. (3.4) lim n→∞

n→∞

Proof: Fix N ∈ N and 0 < ρ < 1. It will now be shown that lim fn dµ ≥ ρ gN dµ. n→∞

(3.5)

k Suppose that gN has the representation gN ≡ i=1 di IBi . Let Dn = {ω ∈ Ω : fn (ω) ≥ ρgN (ω)}, n ≥ 1. Since fn (ω) ↑ f (ω) for all ω, Dn ↑ D ≡ {ω : f (ω) ≥ ρgN (ω)} (Problem 2.18 (b)). Also since gN (ω) ≤ f (ω) and 0 < ρ < 1, D = Ω. Now writing fn = fn IDn + fn IDnc , it follows from Proposition 2.3.1 that fn dµ ≥ fn IDn dµ ≥ ρ gN IDn dµ = ρ

k

di µ(Bi ∩ Dn ).

(3.6)

i=1

By the m.c.f.b. property, for each i ∈ N, µ(Bi ∩ Dn ) ↑ µ(Bi ∩ Ω) = µ(Bi ) as n → ∞. Since the sequence { fn dµ}n≥1 is nondecreasing, taking limits

2.3 Integration

in (3.6), yields (3.5). Next, letting ρ ↑ 1 yields limn→∞ for each N ∈ N and hence, fn dµ ≥ lim gn dµ . lim n→∞

fn dµ ≥

51

gN dµ

n→∞

By symmetry, (3.4) follows and hence, the proof is complete.

2

Remark 2.3.1: It is easy to verify that Proposition 2.3.2 remains valid if {fn }n≥1 and {gn }n≥1 increase to f a.e. (µ). Given a nonnegative measurable function f , one can always construct a nondecreasing sequence {fn }n≥1 of nonnegative simple functions such that fn (ω) ↑ f (ω) for all ω ∈ Ω in the following manner. Let {δn }n≥1 be a sequence of positive real numbers and let {Nn }n≥1 be a sequence of positive integers such that as n → ∞, δn ↓ 0, Nn ↑ ∞ and Nn δn ↑ ∞. Further, suppose that the sequence Pn ≡ {jδn : j = 0, 1, 2, . . . , Nn } is nested, i.e., Pn ⊂ Pn+1 for each n ≥ 1. Now set if jδn ≤ f (ω) < (j + 1)δn , j = 0, 1, 2, . . . , (Nn − 1) jδn fn (ω) = Nn δn if f (ω) ≥ Nn δn . (3.7) A speciﬁc choice of δn and Nn is given by δn = 2−n , Nn = n2n . Thus,with the above choice of {fn }n≥1 in the deﬁnition of the Lebesgue integral f dµ in (3.3), the range of f is subdivided into intervals of decreasing lengths. This is in contrast to the deﬁnition of the Riemann integral of f over a bounded interval, which is deﬁned via subdividing the domain of f into ﬁner subintervals. Remark 2.3.2: In some cases it may be more appropriate to choose the approximating sequence {fn }n≥1 in a manner diﬀerent from (3.7). For example, let Ω = {ωi : i ≥ 1} be a countable set, F = P(Ω), the power set of Ω, and let µ be a measure on (Ω, F). Then any function f : Ω → R +∞≡ [0, ∞) is measurable and the integral f dµ coincides with the sum i=1 f (ωi )µ({ωi }) as can be seen by choosing the approximating sequence {fn }n≥1 as f (ωi ) for i = 1, 2, . . . , n fn (ωi ) = 0 for i > n. Remark 2.3.3: The integral of a nonnegative measurable function can be alternatively deﬁned as f dµ = sup gdµ : g nonnegative and simple, g ≤ f . The equivalence of this to (3.3) is seen as follows. Clearly the right side above, say, M is greater than or equal to f dµ as in (3.3). Conversely,

52

2. Integration

there exist a sequence {gn }n≥1 of simple nonnegative functions with gn ≤ f for all n ≥ 1 such that limn gn dµ equals the supremum M deﬁned above. Now set hn = max{gj : 1 ≤ j ≤ n}, n ≥ 1. Now it can be veriﬁed that for each n ≥ 1, hn is nonnegative, simple, and satisﬁes hn ↑ f and also that hn dµ converges to M (Problem 2.19 (b)). Corollary 2.3.3: Let f and g be two nonnegative measurable functions on (Ω, F, µ). Then, the conclusions of Proposition 2.3.1 remain valid for such f and g. Proof: This follows from Proposition 2.3.1 for nonnegative simple functions and Deﬁnition 2.3.3. 2 The deﬁnition of the integral f dµ of a nonnegative measurable function f in (3.3) makes it possible to interchange limits and integration in a fairly routine manner. In particular, the following key result is a direct consequence of the deﬁnition. Theorem 2.3.4: (The monotone convergence theorem or MCT). Let {fn }n≥1 and f be nonnegative measurable functions on (Ω, F, µ) such that fn ↑ f a.e. (µ). Then fn dµ. (3.8) f dµ = lim n→∞

Remark 2.3.4: The important diﬀerence between (3.4) and (3.8) is that in (3.8), the fn ’s need not be simple. Proof: It is similar to the proof of Proposition 2.3.2. Let {gn }n≥1 be a sequence of nonnegative simple functions on (Ω, F, µ) such that gn (ω) ↑ f (ω) for all ω. By hypothesis, there exists a set A ∈ F such that µ(Ac ) = 0 and for ω in A, fn (ω) ↑ f (ω). Fix k ∈ N and 0 < ρ < 1. Let Dn = {ω : ω ∈ A, fn (ω) ≥ ρgk (ω)}, n ≥ 1. Then, Dn ↑ D ≡ {ω : ω ∈ A, f (ω) ≥ ρgk (ω)}. Since gk (ω) ≤ f (ω) for all ω, it follows that D = A. Now, by Corollary 2.3.3, fn dµ ≥ fn IDn dµ ≥ ρ gk IDn dµ for all n ≥ 1. By m.c.f.b.,

gk IA dµ = gk dµ as n → ∞, yielding lim inf fn dµ ≥ ρ gk dµ

gk IDn dµ ↑

n→∞

for all 0 < ρ < 1 and all k ∈ N. Letting ρ ↑ 1 ﬁrst and then k ↑ ∞, from (3.3) one gets lim inf fn dµ ≥ f dµ. n→∞

2.3 Integration

By monotonicity (Corollary 2.3.3), fn dµ ≤ f dµ

53

for all n ≥ 1 2

and so the proof is complete.

Corollary 2.3.5: Let {hn }n≥1 be a sequence of nonnegative measurable functions on a measure space (Ω, F, µ). Then ∞ ∞ hn dµ . hn dµ = n=1

Proof: Let fn = By the MCT,

n i=1

n=1

hi , n ≥ 1, and let f = fn dµ ↑ f dµ.

∞ i=1

hi . Then, 0 ≤ fn ↑ f .

But by Corollary 2.3.3, fn dµ =

n

hi dµ.

i=1

2

Hence, the result follows.

Corollary 2.3.6: Let f be a nonnegative measurable function on a measurable space (Ω, F, µ). For A ∈ F, let ν(A) ≡ f IA dµ. Then, ν is a measure on (Ω, F). Proof: Let {An }n≥1 be a sequence of disjoint sets in F. Let hn = f IAn for n ≥ 1. Then by Corollary 2.3.5, ν(

An )

=

n≥1

=

f I[

n≥1

∞ n=1

An ] dµ

=

f·

∞

IAn dµ

n=1

∞ ∞ hn dµ = hn dµ = ν(An ). n=1

n=1

2 Remark 2.3.5: Notice that µ(A) = 0 ⇒ ν(A) = 0. In this case ν is said to be dominated by or absolutely continuous with respect to µ. The Radon-Nikodym theorem (see Chapter 4) provides a converse to this. That is, if ν and µ are two measures on a measurable space (Ω, F) such that

54

2. Integration

ν is dominated by µ and µ is σ-ﬁnite, then there exists a nonnegative measurable function f such that ν(A) = f IA dµ for all A in F. This f is called a Radon-Nikodym derivative (or a density) of ν with respect to µ dν and is denoted by dµ . Theorem 2.3.7: (Fatou’s lemma). Let {fn }n≥1 be a sequence of nonnegative measurable functions on (Ω, F, µ). Then (3.9) lim inf fn dµ ≥ lim inf fn dµ. n→∞

n→∞

Proof: Let gn (ω) ≡ inf{fj (ω) : j ≥ n}. Then {gn }n≥1 is a sequence of nonnegative, nondecreasing measurable functions on (Ω, F, µ) such that gn (ω) ↑ g(ω) ≡ lim inf fn (ω). By the MCT, n→∞

gn dµ ↑ But by monotonicity

gdµ.

fn dµ ≥

gn dµ for each n ≥ 1,

and hence, (3.9) follows.

2

Remark 2.3.6: In (3.9), the inequality can be strict. For example, take fn = I[n,∞) , n ≥ 1, on the measure space (R, B(R), m) where m is the Lebesgue measure. For another example, consider fn = nI[0, n1 ] , n ≥ 1, on the ﬁnite measure space ([0, 1], B([0, 1]), m). Deﬁnition 2.3.4: (The integral of a measurable function). Let f be a real valued measurable function on a measure space (Ω, F, µ). Let f + = f I{f ≥0} and f − = −f I{f K}) = 0

for some

K ∈ (0, ∞) .

The following is an extension of Proposition 2.3.1 to integrable functions. Proposition 2.3.8: Let f , g ∈ L1 (Ω, F, µ). Then (i) (αf + βg)dµ = α f dµ + β gdµ for any α, β ∈ R. (ii) f ≥ g a.e. (µ) ⇒ f dµ ≥ gdµ. (iii) f = g a.e. (µ) ⇒ f dµ = gdµ. Proof: It is easy to verify (Problem 2.32) that if h = h1 − h2 where h1 and h2 are nonnegative functions in L1 (Ω, F, µ), then h is also in L1 (Ω, F, µ) and hdµ = h1 dµ − h2 dµ. (3.11) Note that h ≡ αf + βg can be written as =

(a+ − α− )(f + − f − ) + (β + − β − )(g + − g − ) (α+ f + + α− f − + β + g + + β − g − ) − (α+ f − + α− f + + β + g − + β − g + )

= h1 − h 2 ,

say.

56

2. Integration

Since f , g ∈ L1 (Ω, F, µ), it follows that h1 and h2 ∈ L1 (Ω, F, µ). Further, they are nonnegative and by (3.11), h ∈ L1 (Ω, F, µ) and hdµ = h1 dµ − h2 dµ. Now apply Proposition 2.3.1 to each of the terms on the right side and regroup the terms to get hdµ = α f dµ + β gdµ. Proofs of (ii) and (iii) are left as an exercise.

2

Remark 2.3.8: By Proposition 2.3.8, if f and g ∈ L1 (Ω, F, µ), then so does αf + βg for all α, β ∈ R. Thus, L1 (Ω, F, µ) is a vector space over R. Further, if one sets f 1 ≡

|f |dµ,

and identiﬁes a function f with its equivalence class under the relation f ∼ g iﬀ f = g a.e. (µ) , then · 1 deﬁnes a norm on L1 (Ω, F, µ) and makes it a normed linear space (cf. Chapter 3). A similar remark also holds for Lp (Ω, F, µ) for 1 < p ≤ ∞. Next note that by Proposition 2.3.8, if f = 0 a.e. (µ), then f dµ = 0. However, the converse is not true. But if f is nonnegative a.e. (µ), then the converse is true as shown below. Proposition 2.3.9: Let f be a measurable function on (Ω, F, µ) and let f be nonnegative a.e. (µ). Then f dµ = 0 iﬀ f = 0 a.e. (µ). Proof: It is enough to prove the “only if” part. Let D = {ω : f (ω) > 0} and Dn = {ω : f (ω) > n1 }, n ≥ 1. Then D = n≥1 Dn . Since f ≥ f IDn a.e. (µ), 1 0 = f dµ ≥ f IDn dµ ≥ µ(Dn ) ⇒ µ(Dn ) = 0 for each n ≥ 1. n Also Dn ↑ D and so by m.c.f.b., µ(D) = lim µ(Dn ) = 0 . n→∞

Hence, Proposition 2.3.9 follows. A dual to the above proposition is the next one.

2

2.3 Integration

57

Proposition 2.3.10: Let f ∈ L1 (Ω, F, µ). Then, |f | < ∞ a.e. (µ). Proof: Let Cn = {ω : |f (ω)| > n}, n ≥ 1 and let C = {ω : |f (ω)| = ∞}. Then Cn ↓ C and |f |dµ |f |dµ ≥ |f |ICn dµ ≥ nµ(Cn ) ⇒ µ(Cn ) ≤ . n Since |f |dµ < ∞, limn→∞ µ(Cn ) = 0. Hence, by m.c.f.a., µ(C) = limn→∞ µ(Cn ) = 0. 2 The next result is a useful convergence theorem for integrals. Theorem 2.3.11: (The extended dominated convergence theorem or EDCT ). Let (Ω, F, µ) be a measure space and let fn , gn : Ω → R be F, R-measurable functions such that |fn | ≤ gn a.e. (µ) for all n ≥ 1. Suppose that (i) gn → g a.e. (µ) and fn → f a.e. (µ); (ii) gn , g ∈ L1 (Ω, F, µ) and |gn |dµ → |g|dµ as n → ∞. Then, f ∈ L1 (Ω, F, µ), fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.12) n→∞

n→∞

Two important special cases of Theorem 2.3.11 will be stated next. When gn = g for all n ≥ 1, one has the standard version of the dominated convergence theorem. Corollary 2.3.12: (Lebesgue’s dominated convergence theorem, or DCT ). If |fn | ≤ g a.e. (µ) for all n ≥ 1, gdµ < ∞ and fn → f a.e. (µ), then f ∈ L1 (Ω, F, µ), fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.13) n→∞

n→∞

Corollary 2.3.13: (The bounded convergence theorem, or BCT ). Let µ(Ω) < ∞. If there exist a 0 < k < ∞ such that |fn | ≤ k a.e. (µ) and fn → f a.e. (µ), then fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.14) n→∞

n→∞

Proof: Take g(ω) ≡ k for all ω ∈ Ω in the previous corollary. Proof of Theorem 2.3.11: By Fatou’s lemma, |f |dµ ≤ lim inf |fn |dµ ≤ lim inf |gn |dµ = |g|dµ < ∞. n→∞

n→∞

2

58

2. Integration

Hence, f is integrable. For proving the second part, let hn = fn + gn and γn = gn − fn , n ≥ 1. Then, {hn }n≥1 and {γn }n≥1 are sequences of nonnegative integrable functions. By Fatou’s lemma and (ii), (f + g)dµ = lim inf hn dµ n→∞ ≤ lim inf hn dµ n→∞ = lim inf gn dµ + fn dµ n→∞ = gdµ + lim inf fn dµ. n→∞

Similarly,

(g − f )dµ ≤

gdµ − lim sup

fn dµ.

n→∞

By Proposition 2.3.8, (g ± f )dµ = gdµ ± f dµ. Hence, f dµ ≤ lim inf fn dµ n→∞

and lim sup n→∞

fn dµ ≤

f dµ

yielding that limn→∞ fn dµ = f dµ. For the last part, apply the above argument to fn and gn replaced by f˜n ≡ |f − fn | and g˜n ≡ gn + |f |, respectively. 2 Theorem 2.3.14: (An approximation theorem). Let µF be a LebesgueStieltjes measure on (R, B(R)). Let f ∈ Lp (R, B(R), µF ), 0 < p < ∞. Then, for any δ > 0, there exist a step function h and a continuous function g with compact support (i.e., g vanishes outside a bounded interval) such that |f − h|p dµ < δ, (3.15) |f − g|p dµ < δ,

(3.16)

k where a step function h is a function of the form h = i=1 ci IAi with k < ∞, c1 , c2 , . . . , ck ∈ R and A1 , A2 , . . . , Ak being bounded disjoint intervals. Proof: Let fn (·) = f (·)IBn (·) where Bn = {x : |x| ≤ n, |f (x)| ≤ n}. By the DCT, for every > 0, there exists an N such that for all n ≥ N , (3.17) |f − fn |p dµF < .

2.4 Riemann and Lebesgue integrals

59

Since |fN (·)| ≤ N on [−N , N ], for any η > 0, there exists a simple function f˜ such that sup{|fN (x) − f˜(x)| : x ∈ R} < η. (cf. (3.7))

(3.18)

Next, using Problem 1.32 (b), one can show that for any η > 0, there exists a step function h (depending on η) such that (3.19) |f˜N − h|p dµF < η. Since f − h = f − fN + fN − f˜ + f˜ − h, |f − h|p ≤ Cp |f − fN |p + |fN − f˜|p + |f˜ − h|p , where Cp is a constant depending only on p. This in turn yields, from (3.17)–(3.19), |f − h|p dµF ≤ Cp ( + (µF {x : |x| ≤ N })η p + η).

(3.20)

Given δ > 0, choose > 0 ﬁrst and then η > 0 such that the right side of (3.20) above is less than δ. Next, given any step function h and η > 0 there exists a continuous function g with compact support such that µF {x : h(x) = g(x)} < η (cf. Problem 1.32 (g)). Now (3.16) follows from (3.15). 2 Remark 2.3.9: The approximation (3.16) remains valid if g is restricted to the class of all inﬁnitely diﬀerentiable functions with compact support. Further it remains valid for 0 < p < ∞ for Lebesgue-Stieltjes measures on any Euclidean space. Remark 2.3.10: The above approximation theorem fails for p = ∞. For example, consider the function f (x) ≡ 1 in L∞ (m).

2.4 Riemann and Lebesgue integrals Let f be a real valued bounded function on a bounded interval [a, b]. Recall the deﬁnition of the Riemann integral. Let P = {x0 , x1 , . . . , xn } be a ﬁnite partition of [a, b], i.e., x0 = a < x1 < x2 < xn−1 < xn = b and ∆ = ∆(P ) ≡ max{(xi+1 − xi ) : 0 ≤ i ≤ n − 1} be the diameter of P . Let Mi = sup{f (x) : xi ≤ x ≤ xi+1 } and mi = inf{f (x) : xi ≤ x ≤ xi+1 }, i = 0, 1, . . . , n − 1. Deﬁnition 2.4.1: The upper- and lower-Riemann sums of f w.r.t. the partition P are, respectively, deﬁned as U (f, P ) ≡

n−1 i=0

Mi · (xi+1 − xi )

(4.1)

60

2. Integration

and L(f, P ) ≡

n−1

mi · (xi+1 − xi ).

(4.2)

i=0

It is easy to verify that if Q = {y0 , y1 , . . . , yk } is another partition satisfying P ⊂ Q, then U (f, P ) ≥ U (f, Q) ≥ L(f, Q) ≥ L(f, P ). Let P denote the collection of all ﬁnite partitions of [a, b]. Deﬁnition 2.4.2: The upper-Riemann integral f is deﬁned as f = inf U (f, P ) (4.3)

P ∈P

and the lower-Riemann integral f , by f = sup L(f, P ). P ∈P

(4.4)

It can be shown (cf. Problem 2.23) that if {Pn }n≥1 is any sequence of partitions such that ∆(Pn ) → 0 as n → ∞ and Pn ⊂ Pn+1 for each n ≥ 1, then U (Pn , f ) ↓ f and L(Pn , f ) ↑ f . Deﬁnition 2.4.3: f is said to be Riemann integrable if f = f. The common value is denoted by

[a,b]

(4.5)

f.

Fix a sequence {Pn }n≥1 of partitions such that Pn ⊂ Pn+1 for all n ≥ 1 and ∆(Pn ) → 0 as n → ∞. Let Pn = {xn0 = a < xn1 < xn2 . . . < xnkn = b}. For i = 0, 1, . . . , kn − 1, let φn (x) ≡ ψn (x) ≡

sup{f (t) : xi ≤ t ≤ xi+1 }, x ∈ [xi , xi+1 ) inf{f (t) : xi ≤ t ≤ xi+1 }, x ∈ [xi , xi+1 )

and let φn (b) = ψn (b) = 0. Then, φn and ψn are step functions on [a, b] and hence, are Borel measurable. Further, since f is bounded, so are φn and ψn and hence are integrable on [a, b] w.r.t. theLebesgue measure m. The Lebesgue integrals of φn and ψn are given by [a,b] φn dm = U (Pn , f ) and [a,b] ψn dm = L(Pn , f ). It can be shown (Problem 2.24) that for all x ∈ n≥1 Pn , as n → ∞, φn (x) → φ(x) ≡ lim sup{f (y) : |y − x| < δ}

(4.6)

ψn (x) → ψ(x) ≡ lim inf{f (y) : |y − x| < δ}

(4.7)

δ↓0

and δ↓0

2.5 More on convergence

61

Thus, φ and ψ, being limits of Borel measurable functions (except possibly on a countable set), are Borel measurable. By the BCT (Corollary 2.3.13), f = lim φn dm = φdm n→∞

and

f = lim

n→∞

ψn dm =

ψdm.

Thus, f is Riemann integrable on [a, b], iﬀ φdm = ψdm, iﬀ (φ − ψ)dm = 0. Since φ(x) ≥ f (x) ≥ ψ(x) for all x, this holds iﬀ φ = f = ψ a.e [m]. It can be shown that f is continuous at x0 iﬀ φ(x0 ) = f (x0 ) = ψ(x0 ) (Problem 2.8). Summarizing the above discussion, one gets the following theorem. Theorem 2.4.1: Let f be a bounded function on a bounded interval [a, b]. Then f is Riemann integrable on [a, b] iﬀ f is continuous a.e. (m) on [a, b]. In [a, b] and the Lebesgue in this case, f is Lebesgue integrable on tegral [a,b] f dm equals the Riemann integral [a,b] f , i.e., the two integrals coincide. It should be noted that Lebesgue integrability need not imply Riemann integrability. For example, consider f (x) ≡ IQ1 (x) where Q1 is the set of rationals in [0, 1] (Problem 2.25). The functions φ and ψ deﬁned in (4.6) and (4.7) above are called, respectively, the upper and the lower envelopes of the function f . They are semicontinuous in the sense that for each α ∈ R, the sets {x : φ(x) < α} and {x : ψ(x) > α} are open (cf. Problem 2.8). Remark 2.4.1: The key diﬀerence in the deﬁnitions of Riemann and Lebesgue integrals is that in the former the domain of f is partitioned while in the latter the range of f is partitioned.

2.5 More on convergence Let {fn }n≥1 and f be measurable functions from a measure space (Ω, F, µ) ¯ the set of extended real numbers. There are several notions of conto R, vergence of {fn }n≥1 to f . The following two have been discussed earlier. Deﬁnition 2.5.1: {fn }n≥1 converges to f pointwise if lim fn (ω) = f (ω)

n→∞

for all ω in Ω.

Deﬁnition 2.5.2: {fn }n≥1 converges to f almost everywhere (µ), denoted by fn → f a.e. (µ), if there exists a set B in F such that µ(B) = 0 and lim fn (ω) = f (ω)

n→∞

for all ω∈B c .

(5.1)

62

2. Integration

Now consider some more notions of convergence. Deﬁnition 2.5.3: {fn }n≥1 converges to f in measure (w.r.t. µ), denoted by fn −→m f , if for each > 0, lim µ {|fn − f | > } = 0. (5.2) n→∞

to f in Lp (µ), Deﬁnition 2.5.4: Let 0 K}) = 0 .

(5.6)

(5.7)

Deﬁnition 2.5.7: {fn }n≥1 converges to f nearly uniformly (µ) if for every > 0, there exists a set A ∈ F such that µ(A) < and on Ac , fn → f uniformly, i.e., sup{|fn (ω) − f (ω)| : ω ∈ Ac } → 0 as n → ∞. The notion of convergence in Deﬁnition 2.5.7 is also called almost uniform convergence in some books (cf. Royden (1988)). The sequence fn ≡ nI[0,1/n] on (Ω = [0, 1], B([0, 1]), m) converges to f ≡ 0 nearly uniformly but not in L∞ (m). When µ is a probability measure, there is another useful notion of convergence, known as convergence in distribution, that is deﬁned in terms of the induced measures {µfn−1 }n≥1 and µf −1 . This notion of convergence will be treated in detail in Chapter 9.

2.5 More on convergence

63

Next, the connections between some of these notions of convergence are explored. Theorem 2.5.1: Suppose that µ(Ω) < ∞. Then, fn → f a.e. (µ) implies fn −→m f . The proof is left as an exercise (Problem 2.26). The hypothesis that ‘µ(Ω) < ∞’ in Theorem 2.5.1 cannot be dispensed with as seen by taking fn = I[n,∞) on R with Lebesgue measure. Also, fn −→m f does not imply fn → f a.e. (µ) (Problem 2.46), but the following holds. Theorem 2.5.2: Let fn −→m f . Then, there exists a subsequence {nk }k≥1 such that fnk → f a.e. (µ). Proof: Since fn −→m f , for each integer k ≥ 1, there exists an nk such that for all n ≥ nk µ |fn − f | > 2−k (5.8) < 2−k . W.l.o.g., that assume nk+1 > nk for all k ≥ 1. Let Ak = {|fnk − f | > 2−k }. By Corollary 2.3.5, ∞ ∞ ∞ IAk dµ = IAk dµ = µ(Ak ), k=1

k=1

k=1

∞ which, by (5.8), is ﬁnite. by Proposition 2.3.10, k=1 IAk < ∞ a.e. Hence, ∞ (µ). Now observe that k=1 IAk (ω) < ∞ ⇒ |fnk (ω) − f (ω)| ≤ 2−k for all k large ⇒ limk→∞ fnk (ω) = f (ω). Thus, fnk → f a.e. (µ). 2 Remark 2.5.1: From the above result it follows that the extended dominated convergence theorem (Theorem 2.3.11) remains valid if convergence a.e. of {fn }n≥1 and of {gn }n≥1 are replaced by convergence in measure for both (Problem 2.37). Theorem 2.5.3: Let {fn }n≥1 , f be measurable functions on a measure p space (Ω, F, µ). Let fn −→L f for some 0 < p < ∞. Then fn −→m f . Proof: For each > 0, let An = {|fn − f | ≥ }, n ≥ 1. Then p |fn − f |p dµ ≥ p µ(An ). |fn − f | dµ ≥ Since fn → f in Lp ,

An

|fn − f |p dµ → 0 and hence, µ(An ) → 0. p

2

It turns out that fn −→m f need not imply fn −→L f , even if {fn : n ≥ 1} ∪ {f } is contained in Lp (Ω, F, µ). For example, let fn = nI[0, n1 ] and f ≡ 0 on the Lebesgue space ([0, 1], B([0, 1]), m), where m is the Lebesgue measure. Then fn −→m f but |fn − f | ≡ 1 for all n ≥ 1. However, under

64

2. Integration

some additional conditions, convergence in measure does imply convergence in Lp . Here are two results in this direction. Theorem 2.5.4: (Scheﬀe’s theorem). Let {fn }n≥1 , f be a collection of nonnegative functions on a measure space (Ω, F, µ). Let fn → f measurable a.e. (µ), fn dµ → f dµ and f dµ < ∞. Then lim |fn − f |dµ = 0. n→∞

gn+ and gn− go Proof: Let gn = f − fn , n ≥ 1. Since fn → f a.e. (µ), both + to zero a.e. (µ). Further, 0 ≤ gn ≤ f and by hypothesis f dµ < ∞. Thus, by the DCT, it follows that gn+ dµ → 0. gn dµ → 0. Thus, gn− dµ = gn+ dµ − Next, note that by hypothesis, 2 gn dµ → 0 and hence, |gn |dµ = gn+ dµ + gn− dµ → 0. Corollary 2.5.5: Let {fn }n≥1 , f be probability density functions on a measure space (Ω, F, µ). That is, for all n ≥ 1, fn dµ = f dµ = 1 and fn , f ≥ 0 a.e. (µ). If fn → f a.e. (µ), then |fn − f |dµ = 0. lim n→∞

Remark 2.5.2: The above theorem and the corollary remain valid if the convergence of fn to f a.e. (µ) is replaced by f −→m f . , n = 1, 2, . . . and {pk }k≥1 be sequences of Corollary 2.5.6: Let {pnk }k≥1 ∞ ∞ nonnegative numbers satisfying k=1 pnk = 1 = k=1 pk . Let pnk → pk ∞ as n → ∞ for each k ≥ 1. Then k=1 |pnk − pk | → 0. Proof: Apply Corollary 2.5.5 with µ = the counting measure on (N, P(N)). 2 A more general result in this direction that does not require fn , f to be nonnegative involves the concept of uniform integrability. Let {fλ : λ ∈ Λ} be a collection of functions in L1 (Ω, F, µ). Then for each λ ∈ Λ, by the DCT and the integrability of fλ , aλ (t) ≡ |fλ |dµ → 0 as t → ∞. (5.9) {|fλ |>t}

The notion of uniform integrability requires that the integrals aλ (t) go to zero uniformly in λ ∈ Λ as t → ∞.

2.5 More on convergence

65

Deﬁnition 2.5.8: The collection of functions {fλ : λ ∈ Λ} in L1 (Ω, F, µ) is uniformly integrable (or UI, in short) if sup aλ (t) → 0 as t → ∞ .

(5.10)

λ∈Λ

The following proposition summarizes some of the main properties of UI families of functions. Proposition 2.5.7: Let {fλ : λ ∈ Λ} be a collection of µ-integrable functions on (Ω, F, µ). (i) If Λ is ﬁnite, then {fλ : λ ∈ Λ} is UI. (ii) If K ≡ sup{ |fλ |1+ dµ : λ ∈ Λ} < ∞ for some > 0, then {fλ : λ ∈ Λ} is UI. (iii) If |fλ | ≤ g a.e. (µ) and gdµ < ∞, then {fλ : λ ∈ Λ} is UI. (iv) If {fλ : λ ∈ Λ} and {gγ : γ ∈ Γ} are UI, then so is {fλ + gγ : λ ∈ Λ, γ ∈ Γ}. (v) If {fλ : λ ∈ Λ} is UI and µ(Ω) < ∞, then sup |fλ |dµ < ∞.

(5.11)

λ∈Λ

Proof: By hypothesis, aλ (t) ≡ {|fλ |>t} |fλ |dµ → 0 as t → ∞ for each λ. If Λ is ﬁnite this implies that sup{aλ (t) : λ ∈ Λ} → 0 as t → ∞. This proves (i). To prove (ii), note that since 1 < |fλ |/t on the set {|fλ | > t}, |fλ | |fλ |/t dµ ≤ Kt− → 0 as t → ∞. sup aλ (t) ≤ sup λ∈Λ

λ∈Λ

|fλ |>t

For (iii), note that for each t ∈ R, the function ht (x) ≡ xI(t,∞) (x), x ∈ R is nondecreasing. Hence, by the integrability of g, sup aλ (t) = sup ht (|fλ |)dµ ≤ ht (g)dµ = gdµ → 0 as t → ∞. λ∈Λ

λ∈Λ

{g>t}

To prove (iv), for t > 0, let a(t) = supλ∈Λ ht (|fλ |)dµ and b(t) = supγ∈Γ ht (|gγ |)dµ. Then, for any λ ∈ Λ, and γ ∈ Γ, |fλ + gγ |dµ = ht |fλ + gγ | dµ {|fλ +gγ |>t}

≤

ht 2 max{|fλ |, |gγ |} dµ

66

2. Integration

ht 2|fλ | I |fλ | ≥ |gγ | dµ + ≤ 2 |fλ |dµ + 2

=

{|fλ |>t/2}

≤

ht 2|gγ | I |fλ | < |gγ | dµ

{|gγ |>t/2}

|gγ |dµ

2[a(t/2) + b(t/2)].

By hypothesis, both a(t) and b(t) → 0 as t → ∞, thus proving (iv). Next consider (v). Since {fλ } : λ ∈ Λ} is UI, there exists a T > 0 such that sup hT |fλ | dµ ≤ 1. λ∈Λ

Hence, sup

|fλ |dµ

=

sup

λ∈Λ

λ∈Λ

!

{|fλ |≤T }

|fλ |dµ +

hT (|fλ |)dµ

≤ T µ(Ω) + 1 < ∞. 2

This completes the proof of the proposition.

Remark 2.5.3: In the above proposition, part (ii) can be improved as φ(x) follows: Let φ : R+ → R+ be nondecreasing and x ↑ ∞ as x ↑ ∞. If supλ∈Λ φ(|fλ |)dµ < ∞, then {fλ : λ ∈ Λ} is UI (Problem 2.27). A converse to this result is true. That is, if {fλ : λ ∈ Λ} is UI then there exists such a function φ. Some examples of such φ’s are φ(x) = xk , k > 1, φ(x) = x(log x)β I(x > 1), β > 0, and φ(x) = exp(βx), β > 0. In part (v) of Proposition 2.5.7, (5.11) does not imply UI. For example, consider the sequence of functions fn = nI[0, n1 ] , n = 1, 2, . . . on [0, 1]. On the other hand, (5.11) with an additional condition becomes necessary and suﬃcient for UI. Proposition 2.5.8: Let f ∈ L1 (Ω, F, µ). Then for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ A |f |dµ < . Fix > 0. By the DCT, there exists a t > 0 such that |f |dµ < /2. Hence, for any A ∈ F with µ(A) ≤ δ ≡ 2t , {|f |>t}

Proof:

|f |dµ

A

≤

A∩{|f |≤t}

|f |dµ +

≤ tµ(A) +

{|f |>t}

{|f |>t}

|f |dµ

|f |dµ

≤ , proving the claim.

2

2.5 More on convergence

67

The above proposition shows that for every f ∈ L1 (Ω, F, µ), the measure (cf. Corollary 2.3.6) ν|f | (A) ≡

|f |dµ

(5.12)

A

on (Ω, F) satisﬁes the condition that ν|f | (A) is small if µ(A) is small, i.e., for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ ν|f | (A) < . This property is referred to as the absolute continuity of the measure νf w.r.t. µ. Deﬁnition 2.5.9: Given a family {fλ : λ ∈ Λ} ⊂ L1 (Ω, F, µ), the measures {ν|fλ | : λ ∈ Λ} as deﬁned in (5.12) above are uniformly absolutely continuous w.r.t. µ (or u.a.c. (µ), in short) if for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ sup ν|fλ | (A) : λ ∈ Λ < . Theorem 2.5.9: Let {fλ : λ ∈ Λ} ⊂ L1 (Ω, F, µ) and µ(Ω) < ∞. Then, {fλ : λ ∈ Λ} is UI iﬀ supλ∈Λ |fλ |dµ < ∞ and ν|fλ | (·) : λ ∈ Λ is u.a.c. (µ). Proof: Let fλ : λ ∈ Λ be UI. Then, since µ(Ω) < ∞, L1 boundedness of {fλ : λ ∈ Λ} follows from Proposition 2.5.7 (v). To establish u.a.c. (µ), ﬁx > 0. By UI, there exists an N such that |fλ |dµ < /2. sup λ∈Λ

{|fλ |>N }

Let δ = and let A ∈ F be such that µ(A) < δ. Then, as in the proof of Proposition 2.5.8 above, |fλ |dµ ≤ N µ(A) + /2 < . sup 2N

λ∈Λ

A

Conversely, suppose fλ : λ ∈ Λ is L1 bounded and u.a.c. (µ). Then, for every > 0, there exists a δ > 0 such that |fλ |dµ < if µ(A) < δ . (5.13) sup

λ∈Λ

A

Also, for any nonnegative f in L1 (Ω, F, µ) and t > 0, tµ {f ≥ t} , which implies that f dµ µ {f ≥ t} ≤ . t

f dµ ≥

{f >t}

f dµ ≥

(This is known as Markov’s inequality − see Chapter 3). Hence, it follows that sup µ({|fλ | ≥ t}) ≤ sup |fλ |dµ / t. (5.14) λ∈Λ

λ∈Λ

68

2. Integration

Now given > 0, choose T such that supλ∈Λ |fλ |dµ /T < δ where δ is as in (5.13). Then, by (5.14), it follows that |fλ |dµ < , sup λ∈Λ

{|fλ |≥T }

i.e., {fλ : λ ∈ Λ} is UI.

2

Theorem 2.5.10: Let (Ω, F, µ) be a measure space with µ(Ω) < ∞, and let {fn : n ≥ 1} ⊂ L1 (Ω, F, µ) be such that fn → f a.e. (µ) and f is F, B(R)-measurable. If {fn : n ≥ 1} is UI, then f is integrable and |fn − f |dµ = 0. lim n→∞

Remark 2.5.4: In view of Proposition 2.5.7 (iii), Theorem 2.5.10 yields convergence of f dµ to f dµ under weaker conditions than the DCT, provided µ(Ω) < ∞. However, even under the restriction µ(Ω) < ∞, UI of {fn }n≥1is a suﬃcient, but not a necessary condition for convergence of In the fn dµ to f dµ (Problem 2.28). special case where fn ’s are non negative (and µ(Ω) < ∞), fn dµ → f dµ < ∞ if and only if {fn }n≥1 are UI (Problem 2.29). On the other hand, when µ(Ω) = +∞, UI is no longer suﬃcient to guarantee the convergence of fn dµ to f dµ (Problem 2.30). Thus, the notion of UI is useful mainly for ﬁnite measures and, in particular, probability measures. Proof: By Proposition 2.5.7 and Fatou’s lemma, |f |dµ ≤ lim inf |fn |dµ ≤ sup |fn |dµ < ∞ n→∞

n≥1

and hence f is integrable. Next for n ∈ N, t ∈ (0, ∞), deﬁne the functions gn = |fn − f |, g n,t = gn I(|gn | ≤ t) and g¯n,t = gn I(|gn | > t). Since {fn }n≥1 is UI and f is integrable, by Proposition 2.5.7 (iv), {gn }n≥1 is UI. Hence, for any > 0, there exists t > 0 such that sup g¯n,t dµ < for all t ≥ t . (5.15) n≥1

Next note that for any t > 0, since fn → f a.e. (µ), g n,t → 0 a.e. (µ), and |g n,t | ≤ t and (t)dµ = tµ(Ω) < ∞. Hence, by the DCT, lim

n→∞

g n,t dµ = 0

for all t > 0.

(5.16)

2.5 More on convergence

69

By (5.15) and (5.16) with t = t , we get g n,t dµ ≤ 0 ≤ lim sup |fn − f |dµ ≤ sup g¯n,t dµ + lim n→∞

n→∞

n≥1

for all > 0. Hence,

|fn − f |dµ → 0 as n → ∞.

2

The next result concerns connections between the notions of almost everywhere convergence and almost uniform convergence. Theorem 2.5.11: (Egorov’s theorem). Let fn → f a.e. (µ) and µ(Ω) < ∞. Then fn → f nearly uniformly (µ) as in Deﬁnition 2.5.7. Proof: For j, n, r ∈ N, deﬁne the sets Ajr Bnr

= {ω : |fj (ω) − f (ω)| ≥ r−1 }

= Ajr , Cr = Bnr j≥n

D

=

n≥1

Cr .

r≥1

It is easy to verify that D is the set of points where fn does not convergence to f . That is, D = {ω : fn (ω) → f (ω)}. By hypothesis µ(D) = 0. This implies µ(Cr ) = 0 for all r ≥ 1. Since Bnr ↓ Cr as n → ∞ and µ(Ω) < ∞, by m.c.f.a., µ(Cr ) = 0 ⇒ limn→∞ µ(Bnr ) = 0. Sofor all r ≥ 1, > 0, there exists kr ∈N such that µ(Bkr r ) < 2r . Let A = r≥1 Bkr r . Then µ(A) < ε and Ac = r≥1 Bkcr r . Also, for each r ∈ N, Ac ⊂ Bkcr r . So for any n ≥ 1, sup{|fn (ω) − f (ω)| : ω ∈ Ac } ≤ sup{|fn (ω) − f (ω)| : ω ∈ Bkcr r } for all r ∈ N 1 ≤ if n ≥ kr . r That is, sup{|fn (ω) − f (ω)| : ω ∈ Ac } → 0 as n → ∞.

2

Theorem 2.5.12: (Lusin’s theorem). Let F : R → R be a nondecreasing function and let µ = µ∗F be the corresponding Lebesgue-Stieltjes measure on (R, Mµ∗F ). Let f : R → R be Mµ∗F , B(R)-measurable. Let µ {x : |f (x)| = ∞} = 0. Then for every > 0, there exists a continuous function g : R → R such that µ {x : f (x) = g(x)} < . Proof: Fix −∞ < a < b < ∞. Since µ([a, b]) < ∞ and µ {x : |f (x)| = ∞} = 0, for each > 0, there exists K ∈ (0, ∞) such that µ {x : a ≤ x ≤ b, |f (x)| > K} < . 2

70

2. Integration

Let fK (x) = f (x)I[a,b] (x)I{|f |≤K} (x). Then fK : R → R is bounded, Mµ∗F , B(R)-measurable and zero for |x| > K. Consider now the following claim: For every > 0, there exists a continuous function g : [a, b] → R such that µ {x : fK (x) = g(x), a ≤ x ≤ b} < . (5.17) 2 Clearly this implies that µ {x : a ≤ x ≤ b, f (x) = g(x)} < . (5.18) δ , apply Fix δ > 0. Now for each n ∈ Z, take [a, b] = [n, n + 1], = 2|n|+2 . Let g ˜ be a continuous (5.18) and call the resulting continuous function g n function from R → R such that µ {x : n ≤ x ≤ n + 1, g˜(x) = gn (x)} < δ . This can be done by setting g˜(x) = gn (x) for n ≤ x ≤ (n + 1) − δn 2|n|+2 " # δ and linear on (n + 1) − δn , n + 1 for some 0 < δn < 2|n|+3 . Then ∞

µ {x : f (x) = g˜(x)} ≤

n=−∞ ∞

≤

µ {x : n ≤ x ≤ n + 1, f (x) = g˜(x)} µ {x : n ≤ x ≤ n + 1, f (x) = gn (x)}

n=−∞ ∞

+

0, there exists a simple function k s(x) ≡ i=1 ci IAi (x), with Ai ⊂ [a, b], Ai ∈ Mµ∗F , {Ai : 1 ≤ i ≤ k} are disjoint, µ(Ai ) < ∞, and ci ∈ R for i = 1, . . . , k, such that |f (x)−s(x)| < 4 for all a ≤ x ≤ b. By Theorem 1.3.4, for each Ai and η > 0, there exist a ﬁnite number of open disjoint intervals Iij = (aij , bij ), j = 1, . . . ni such that

ni

η . Iij < µ Ai 2k j=1 Now as in Problem 1.32 (g), there exists a continuous function gij such that η −1 µ gij {1} Iij < , j = 1, 2, . . . , ni , i = 1, 2, . . . , k. kni Let gi ≡

ni j=1

gij , 1 ≤ i ≤ k.

2.6 Problems

71

k Then µ(Ai gi−1 {1}) < kη . Let g = i=1 ci gi . Then µ({s = g}) < η. Hence for every > 0, η > 0, there is a continuous function g,η : [a, b] → R such that µ x : a ≤ x ≤ b, |fK (x) − g,η (x)| > < η. Now for each n ≥ 1, let hn (·) ≡ g 21n , 21n (·) Then, µ(An )

n . 2 and hence ∞

µ(An ) < ∞.

n=1

By the MCT, this implies that ∞

[a,b]

∞

n=1 IAn

IAn < ∞ a.e.

dµ < ∞ and hence

µ.

n=1

Thus hn → fK a.e. µ on [a, b]. By Egorov’s theorem for any > 0, there is a set A ∈ B([a, b]) such that µ(Ac ) < /2

and hn → fK

uniformly on A .

By the inner regularity (Corollary 1.3.5) of µ, there is a compact set D ⊂ A such that µ(A \D) < /2. Since hn → fK uniformly on A , fK is continuous on A and hence on D. It can be shown that there exists a continuous function g : [a, b] → R such that g = fK on D (Problem 2.8 (e)). A more general result extending a continuous function deﬁned on a closed set to the whole space is known as Tietze’s extension theorem (see Munkres (1975)). Thus µ({x : a ≤ x ≤ b, fK (x) = g(x)}) < . This completes the proof of (5.17) and hence that of the proposition. 2 Remark 2.5.5: (Littlewood’s principles). As pointed out in Section 1.3, Theorems 1.3.4, 2.5.11, and 2.5.12 constitute J. E. Littlewood’s three principles: every Mµ∗F measurable set is nearly a ﬁnite union of intervals; every a.e. convergent sequence is nearly uniformly convergent; and every Mµ∗F measurable function is nearly continuous.

2.6 Problems 2.1 Let Ωi , i = 1, 2 be two nonempty sets and T : Ω1 → Ω2 be a map. Then for any collection {Aα : α ∈ I} of subsets of Ω2 , show that

= Aα T −1 Aα T −1 α∈I

α∈I

72

2. Integration

and T −1

Aα

=

α∈I

T −1 Aα .

α∈I

c Further, T −1 (A) = T −1 (Ac ) for all A ⊂ Ω2 . (These are known as de Morgan’s laws.) 2.2 Let {Ai }i≥1 be a collection of disjoint sets in a measurable space (Ω, F). of F, B(R)-measurable functions (a) Let {gi }i≥1 be a collection ∞ from Ω to R. Show that i=1 IAi gi converges on R and is F, B(R)-measurable. (b) Let G ≡ σ{Ai : i ≥ 1}. Show that h : Ω → R is G, B(R)measurable iﬀ g(·) is constant on each Ai . 2.3 Let f, g : Ω → R be F, B(R)-measurable. Set h(ω) =

f (ω) I(g(ω) = 0), g(ω)

ω ∈ Ω.

Verify that h is F, B(R)-measurable. ¯ be such that for every r ∈ R, g −1 ((−∞, r]) ∈ F. Show 2.4 Let g : Ω → R ¯ that g is F, B(R)-measurable. 2.5 Prove Proposition 2.1.6. (Hint: Show that σ{fλ : λ ∈ Λ} =

σ{fλ : λ ∈ L}

L∈C

where C is the collection of all countable subsets of Λ.) 2.6 Let Xi , i = 1, 2, 3 be random variables on a probability space (Ω, F, P ). Consider the random equation (in t ∈ R): X1 (ω)t2 + X2 (ω)t + X3 (ω) = 0.

(6.1)

(a) Show that A ≡ {ω ∈ Ω : Equation (6.1) has two distinct real roots} ∈ F. (b) Let T1 (ω) and T2 (ω) denote the two roots of (6.1) on A. Let Ti (w) on A , fi (w) = 0 on Ac i = 1, 2. Show that (f1 , f2 ) is F, B(R2 )-measurable.

2.6 Problems

73

2.7 Let M ≡ Xij , 1 ≤ i, j ≤ k, be a (random) matrix of random variables Xij deﬁned on a probability space (Ω, F, P ). (a) Show that Y1 ≡ det(M ) (the determinant of M ) and Y2 ≡ tr(M )(the trace of M ) are both F, B(R)-measurable. (b) Show also that Y3 ≡ the largest eigenvalue of M M is F, B(R)measurable, where M is the transpose of M .

M M x) .) (Hint: Use the result that Y3 = sup (x (x x) x =0

2.8 Let f : R → R. Let f¯(x) = inf δ>0 sup|y−x|0 inf |y−x| 0, {x : f¯(x) − f (x) < t} ≡

{x : f¯(x) < t + r, f (x) > r}

r∈Q

and hence is open. (c) Show that f is continuous at some x0 in R iﬀ f¯(x0 ) = f (x0 ). (d) Show that the set Cf ≡ {x : f (·) is continuous at x} is a Gδ set, i.e., an intersection of a countable number of open sets, and hence, Cf is a Borel set. (e) Let D be a closed set in R. Let f : D → R be continuous on D. Show that there exists a g : R → R continuous such that g = f on D. (Hint: Note that Dc is open in R and hence it can be expressed as a countable union of disjoint open intervals {Ij = (aj , bj ) : 1 ≤ j ≤ k ≤ ∞}. Note that aj , bj ∈ D for all j except for possibly the j’s for which aj = −∞ or bj = +∞. Let ⎧ f (x) if x ∈ D ⎪ ⎪ ⎪ (x−aj ) ⎪ f (a ) + (f (b ) − f (a )) if x ∈ (aj , bj ), ⎪ j j j (bj −aj ) ⎪ ⎪ ⎪ ⎨ aj , bj ∈ D g(x) ≡ ) if a f (b j j = −∞, ⎪ ⎪ ⎪ x ∈ (aj , bj ) ⎪ ⎪ ⎪ ⎪ f (a ) if b = ∞, j j ⎪ ⎩ x ∈ (aj , bj ). Now verify that g has the required properties.)

74

2. Integration

2.9 Prove Proposition 2.2.1 using Problem 2.1. 2.10 (a) Show that for any x ∈ R and any random variable X with cdf FX (·), P (X < x) = FX (x−) ≡ limy↑x FX (y). (b) Show that Fc (·) in (2.5) is continuous. 2.11 Let F : R → R be nondecreasing. (a) Show that for x ∈ R, F (x−) ≡ limy↑x F (y) and F (x+) = limy↓x F (y) exist and satisfy F (x−) ≤ F (x) ≤ F (x+). (b) Let D ≡ {x : F (x+) − F (x−) > 0}. Show that D is at most countable. (Hint: Show that D=

Dn,r ,

n≥1 r≥1

where Dn,r = {x : |x| ≤ n, F (x+) − F (x−) > 1r } and that each Dn,r is ﬁnite.) 2.12 Suppose that (i) and (iii) of Proposition 2.2.3 hold. Show that for any a1 in R and −∞ < a2 ≤ b2 < ∞, F (a2 , a1 ) ≤ F (b2 , a1 ) and F (a1 , a2 ) ≤ F (a1 , b2 ), (i.e., F is monotone coordinatewise). 2.13 Let F : Rk → R be such that: (a) for x1 = (x11 , x12 , . . . , x1k ) and x2 = (x21 , x22 , . . . , x2k ) with x1i ≤ x2i for i = 1, 2, . . . k, ∆F (x1 , x2 ) ≡ (−1)s(a) F (a) ≥ 0, a∈A

where A ≡ {a = (a1 , a2 , . . . , ak ) : ai ∈ {x1i , x2i }, i = 1, 2, . . . , k} and for a in A, s(a) = |{i : ai = x1i , i = 1, 2, . . . , k}| is the number of indices i for which ai = x1i . (b) For each i = 1, 2, . . . , k,

lim F (xi ) = 0.

xi1 ↓−∞

Let Ck be the semialgebra of sets of the form {A : A = A1 × . . . × Ak , Ai ∈ C for all 1 ≤ i ≤ k} where C is the semialgebra in R deﬁned in (3.7). Set µF (A) ≡ ∆F (x1 , x2 ) if A = (x11 , x21 ] × (x12 , x22 ] × . . . × (x1k , x2k ] is bounded and set µF (A) = limn→∞ µ(A ∩ Jn ), where Jn = (−n, n]k . (i) Show that F is coordinatewise monotone, i.e., if x = (x1 , . . . , xk ), y = (y1 , . . . , yk ) and xi ≤ yi for every i = 1, . . . , k, then F (y) ≥ F (x).

2.6 Problems

75

(ii) Show that C is a semialgebra and µF is a measure on C by using the Heine-Borel theorem as in Problem 1.22 and 1.23. 2.14 Let m(·) denote the Lebesgue measure on (R, B(R)). Let T : R → R be the map T (x) = x2 . Evaluate the induced measure mT −1 (A) of the set A, where (a) A = [0, t], t > 0. (b) A = (−∞, 0). (c) A = {1, 2, 3, . . .}. ∞ (d) A = i=1 (i2 , (i + i12 )2 ). ∞ (e) A = i=1 (i2 , (i + 1i )2 ). 2.15 Consider the probability space (0, 1), B((0, 1)), m , where m(·) is the Lebesgue measure. (a) Let Y1 be the random variable Y1 (x) ≡ sin 2πx for x ∈ (0, 1). Find the cdf of Y1 . (b) Let Y2 be the random variable Y2 (x) ≡ log x for x ∈ (0, 1). Find the cdf of Y2 . (c) Let F : R → R be a cdf . For 0 < x < 1, let F1−1 (x)

=

inf{y : y ∈ R, F (y) ≥ x}

F2−1 (x)

=

sup{y : y ∈ R, F (y) ≤ x}.

Let Zi be the random variable deﬁned by Zi = Fi−1 (x) 0 < x < 1,

i = 1, 2.

(i) Find the cdf of Zi , i = 1, 2. (Hint: Verify using the right continuity of F that for any 0 < x < 1, t ∈ R, F (t) ≥ x ⇔ F1−1 (x) ≤ t.) (ii) Show also that F1−1 (·) is left continuous and F2−1 (·) is right continuous. 2.16 (a) Let (Ω, F1 , µ) be a σ-ﬁnite measure space. Let T : Ω → R be F, B(R)-measurable. Show, by an example, that the induced measure µT −1 need not be σ-ﬁnite. (b) Let (Ωi , Fi ) be measurable spaces for i = 1, 2 and let T : Ω1 → Ω2 be F1 , F2 -measurable. Show that any measure µ on (Ω1 , F1 ) is σ-ﬁnite if µT −1 is σ-ﬁnite on (Ω2 , F2 ).

76

2. Integration

2.17 Let (Ω, F, µ) be a measure space and let f : Ω → [0, ∞] be such that it admits two representations k

f=

ci IAi and f =

i=1

dj IBj ,

j=1

where ci , dj ∈ [0, ∞],and Ai and Bj ∈ F for all i, j. Show that k

ci µ(Ai ) =

i=1

dj µ(Bj ).

j=1

(Hint: Express Ai and Bj as ﬁnite unions of a common collection of disjoint sets in F.) 2.18 (a) Prove Proposition 2.3.1. (b) In the proof of Proposition 2.3.2, verify that Dn ↑ D. (c) Verify Remark 2.3.1. (Hint: Let An A

{ω : fn−1 (ω) ≥ fn (ω), gn+1 (ω) ≥ gn (ω)} ω : lim gn (ω) = g(ω), = An

=

n→∞

n≥1

lim fn (ω) = f (ω)

n→∞

and f˜n = fn IA , g˜n = gn IA . Verify that µ(Ac ) = 0 and apply gn }n≥1 .) Proposition 2.3.2 to {f˜n }n≥1 and {˜ 2.19 (a) Verify that fn (·) deﬁned in (3.7) satisﬁes fn (ω) ↑ f (ω) for all ω in Ω. (b) Verify that the sequence {hn }n≥1 of Remark 2.3.3 satisﬁes limn→∞ hn = f a.e. (µ), and limn→∞ hn dµ = M . 2.20 Apply Corollary 2.3.5 to show that for any collection {aij : i, j ∈ N} of nonnegative numbers, ⎛ ⎞ ∞ ∞ ∞ ∞ ⎝ aij ⎠ = aij . i=1

j=1

j=1

i=1

2.21 Let g : R → R. (a) Recall that limt→∞ g(t) = L for some L in R if for every > 0, there exists a T < ∞ such that t ≥ T ⇒ |g(t) − L| < . Show that limt→∞ g(t) = L for some L in R iﬀ limn→∞ g(tn ) = L for every sequence {tn }n≥1 with limn→∞ tn = ∞.

2.6 Problems

77

(b) Formulate and prove a similar result when limt→a g(t) = L for some a, L ∈ R. 2.22 Let {ft : t ∈ R} ⊂ L1 (Ω, F, µ). (a) (The continuous version of the MCT). Suppose that ft ↑ f as t ↑ ∞ a.e. (µ) and for each t, ft ≥ 0 a.e. (µ). Show that ft dµ ↑ f dµ. (b) (The continuous version of the DCT). Suppose there exists a nonnegative g ∈ L1 (Ω, F, µ) such that for each t, |ft | ≤ g a.e. 1 (µ). Then f ∈ L (Ω, F, µ) and (µ) and as t → ∞, ft → f a.e. |ft − f |dµ → 0 and hence, ft dµ → f dµ, as t → ∞. 2.23 Let f : [a, b] → R be bounded where −∞ < a < b < ∞. Let {Pn }n≥1 be a sequence of partitions such that ∆(Pn ) → 0. Show that as n → ∞, U (Pn , f ) → where

f

and L(Pn , f ) →

f,

f and f are as deﬁned in (4.3) and (4.4), respectively.

(Hint: Given > 0, ﬁx a partition P = {x0 = a < x1 < . . . < xk = b} such that f < U (P, f ) < f + . Let δ = min0≤i≤k−1 (xi+1 − xi ). Choose n large such that the diameter ∆(Pn ) < δ. Verify that U (Pn , f ) < U (P, f ) + kB∆(Pn ) where B = sup{|f (x)| : a ≤ x ≤ b} and conclude that limn U (Pn , f ) ≤ f + .) 2.24 Establish (4.6) and (4.7). (Hint: Show that for every x and any > 0, φn (x) ≤ φ(x) + for all n large and that for x ∈ n≥1 Pn , φn (x) ≥ φ(x) for all n.) 2.25 If f (x) = IQ1 (x) where Q1 = Q∩[0, 1], Q being the set of all rationals, then show that for any partition P , U (P, f ) = 1 and L(P, f ) = 0. 2.26 Establish Theorem 2.5.1. (Hint: Verify that D

≡ =

{ω : fj (ω) → f (ω)}

Ajr , r≥1 n≥1 j≥n

where Ajr = {|fj −f | > 1r }. Show that since µ(D) = 0 and µ(Ω) < ∞, µ(Drn ) → 0 as n → ∞ for each r ∈ N, where Drn = j≥n Ajr .)

78

2. Integration

2.27 Let φ : R+ → R+ be nondecreasing and φ(x) ↑ ∞ as x ↑ ∞. x 1 : λ ∈ Λ} be a subset of L (Ω, F, µ). Show that if Also, let {f λ supλ∈Λ φ(|fλ |)dµ < ∞, then {fλ : λ ∈ Λ} is UI. 2.28 Let µ be the Lebesgue measure on ([−1, 1], B([−1, 1])). For n ≥ 1, deﬁne fn (x) = nI(0,n−1 ) (x) − nI(−n−1 ,0) (x) and f (x) ≡ 0 for x ∈ [−1, 1]. Show that fn → f a.e. (µ) and fn dµ → f dµ but {fn }n≥1 is not UI. 2.29 Let {fn : n ≥ 1} ∪ {f } ⊂ L1 (Ω, F, µ). (a) Show that |fn − f |(dµ) → 0 iﬀ fn −→m f and |fn |dµ → |f |dµ. (b) Show further that if µ(Ω) < ∞ then the above two are equivalent to fn −→m f and {fn } UI. 2.30 For n ≥ 1, let fn (x) = n−1/2 I(0,n) (x), x ∈ R, and let f (x) = 0, x ∈ R. Let m denote the Lebesgue measure Show that fn → f on (R, B(R)). a.e. (m) and {fn }n≥1 is UI, but fn dm → f dm. 2.31 (Computing integrals w.r.t. the Lebesgue measure). Let f ∈ L1 (R, Mm , m) where (R, Mm , m) is the real line with Lebesgue σalgebra, andLebesgue measure, i.e., m = µ∗F where F (x) ≡ x. The + − + deﬁnition − of f dm as f dm− f dm involves computing f dm and f dm which in turn is given in terms of approximating by integrals of simple nonnegative functions. This is not a very practical procedure. For f that happens to be continuous a.e. and bounded on ﬁnite intervals, one can compute the Riemann integral of f over ﬁnite intervals and pass to the limit. Justify the following steps: (a) Let f be continuous a.e. and bounded on ﬁnite intervals and f ∈ L1 (R, Mm , m). Show that for −∞ < a < b < ∞, f ∈ L1 ([a, b], Mm , m) and ) f dm = f (x)dx, [a,b]

[a,b]

where the right side denotes the Riemann integral of f on [a, b]. (b) If, in addition, f ∈ L1 (R, Mm , m), then f dm = a→−∞ lim f dm. R

b→+∞

[a,b]

(c) If f is continuous a.e. and ∈ L1 (R, Mm , m), then f dm = lim φc (f )dm R

a→−∞ b→+∞ c→∞

[a,b]

where φc (f ) = f (x)I(|f (x)| ≤ c)+cI(f (x) > c)−cI(f (x) < −c).

2.6 Problems

(d) Apply the above procedure to compute (i) f (x) =

1 1+x2 , −x2 /2

(ii) f (x) = e

R

f dm for

,

+x −x2 /2

(iii) f (x) = e

79

e

.

2.32 (a) Let a ∈ R. Show that if a1 , a2 are nonnegative such that a = a1 − a2 then a1 ≥ a+ , a2 ≥ a− and a1 − a+ = a2 − a− . (b) Let f = f1 − f2 where f1 , f2 are nonnegative and are in 1 1 (Ω, F, µ). Show that f ∈ L (Ω, F, µ) and f dµ = f1 dµ − L f2 dµ. 2.33 Let (Ω, F, µ) be a measure space. Let f : Ω × (a, b) → R be such that for each a < t < b, f (·, t) ∈ L1 (Ω, F, µ). (a) Suppose for each a < t < b, (i) lim f (ω, t + h) = f (ω, t) a.e. (µ). h→0

(ii) sup |f (ω, t + h)| ≤ g1 (ω, t), where g1 (·, t) ∈ L1 (Ω, F, µ). |h|≤1

Show that φ(t) ≡

Ω

f (ω, t)dµ is continuous on (a, b).

(b) Suppose for each a < t < b. f (ω, t + h) − f (ω, t) = g2 (ω, t) exists a.e. (µ), h * f (ω, t + h) − f (ω, t) * * * (ii) sup * * ≤ G(ω, t) a.e. (µ), h 0≤|h|≤1 (i) lim

h→0

(iii) G(ω, t) ∈ L1 (Ω, F, µ).

Show that φ(t) ≡

f (ω, t)dµ Ω

is diﬀerentiable on (a, b). (Hint: Use the continuous version of DCT (cf. Problem 2.22).) 2.34 Let A ≡ (aij ) be an inﬁnite matrix of real numbers. Suppose that for each j, limi→∞ aij = aj exists in R and supi |aij | ≤ bj , where ∞ j=1 bj < ∞. (a) Show by an application of the DCT that lim

i→∞

∞

|aij − aj | = 0.

j=1

(b) Show the same directly, i.e., without using the DCT.

80

2. Integration

2.35 Using the above problem or otherwise, show that for any sequence {xn }n≥1 with limn→∞ xn = x in R, ∞ xn n xj 1+ ≡ exp(x). = n→∞ n j! j=0

lim

2.36 (a) Let {fn }n≥1 ⊂ L1 (Ω, F, µ) such that fn → 0 in L1 (µ). Show that {fn }n≥1 is UI. (b) Let {fn }n≥1 ⊂ Lp (Ω, F, µ), 0 < p < ∞, such that µ(Ω) < ∞, {|fn |p }n≥1 is UI and fn −→m f . Show that f ∈ Lp (µ) and fn → f in Lp (µ). 2.37 (a) Show that for a sequence of real numbers {xn }n≥1 , limn→∞ xn = ¯ holds iﬀ every subsequence {xn }j≥1 of {xn }n≥1 has a x ∈ R j further subsequence {xnjk }k≥1 such that limk→∞ xnjk = x. (b) Use (a) and Theorem 2.5.2 to show that the extended DCT (Theorem 2.3.11) is valid if the a.e. convergence of {fn }n≥1 and {gn }n≥1 is replaced by convergence in measure. 2.38 Let (R, Mµ∗F , µ∗F ) be a Lebesgue-Stieltjes measure space generated ¯ be Mµ∗ -measurable by a F : R → R nondecreasing. Let f : R → R F ∗ such that |f | < ∞ a.e. (µF ). Show that for every k ∈ N and for every , η ∈ (0, ∞), there exists a continuous function g : R → R such that g(x) = 0 for |x| > k and µ∗F ({x : |x| ≤ k, |f (x) − g(x)| > η} < . (Hint: Complete the following: Step I: For all > 0, there exists Mk, ∈ (0, ∞) such that µ∗F ({x : |x| ≤ k, |f (x)| > Mk, }) < . Step II: For η > 0, there exists a simple function s(·) such that µ∗F ({x : |x| ≤ k, |f (x)| ≤ Mk, , |f (x) − s(x)| > η}) = 0. Step III: For δ > 0, there exists a continuous function g(·) such that g ≡ 0 for |x| > k and µ∗F ({x : |x| ≤ k, s(x) = g(x)}) < δ.) 1 2.39 Recall from Corollary 2.3.6 that for g ∈ L (Ω, F, µ) and nonnegative function h, νg (A) ≡ A gdµ is a measure. Next for any F-measurable show that h ∈ L1 (νg ) iﬀ h · g ∈ L1 (µ) and hdνg = hgdµ.

(Hint: Verify ﬁrst for h simple and nonnegative, next for h nonnegative, and ﬁnally for any h.) 2.40 Prove the BCT, Corollary 2.3.13, using Egorov’s theorem (Theorem 2.5.11). 2.41 Deduce the DCT from the BCT with the notation as in Corollary 2.3.12. (Hint: Apply the BCT to the measure space (Ω, F, νg ) and functions hn = fgn I(g > 0), h = fg I(g > 0) where νg is as in Problem 2.39.)

2.6 Problems

81

2.42 (Change of variables formula). Let (Ωi , Fi ), i = 1, 2 be two measurable spaces. Let f : Ω1 → Ω2 be F1 , F2 -measurable, h : Ω2 → R be F2 , B(R)-measurable, and µ1 be a measure on (Ω1 , F1 ). Show that g ≡ h ◦ f , i.e., g(ω) ≡ h(f (ω)) for ω ∈ Ω1 is in L1 (µ) iﬀ h(·) ∈ L1 (Ω2 , F2 , µ2 ) where µ2 = µ1 f −1 iﬀ I(·) ∈ L1 R, B(R), µ3 ≡ µ2 h−1 where I(·) is the identity function in R, i.e., I(x) ≡ x for all x ∈ R and also that Ω1

gdµ1 =

Ω2

hdµ2 =

R

xdµ3 .

2

2.43 Let φ(x) ≡ √12π e−x /2 be the standard N (0, 1) pdf on R. Let {(µn , σn )}n≥1 be a sequence in R × R+ µn → µ and . Suppose x−µ 1 n , f (x) = φ σn → σ as n → ∞. Let fn (x) = σ1n φ x−µ σn σ σ and νn (A) = A fn dm, ν(A) = A f dm for any Borel set A in R, where m(·) is the Lebesgue measure on R. Using Scheﬀe’s theorem, verify that, as n → ∞, νn (·) → ν(·) uniformly on B(R) and that for any h : R → R, bounded and Borel measurable, hdνn → hdν. n 2.44 Let fn (x) = cn 1 − nx I[0,n] (x), x ∈ R, n ≥ 1. (a) Find cn such that

fn dm = 1.

(b) Show that limn→∞ fn (x) ≡ f (x) exists for all x in R and that f is a pdf (c) For A ∈ B(R), let νn (A) ≡

fn dm

and ν(A) =

A

f dm. A

Show that νn → ν uniformly on B(R). 2.45 Let (Ω, F, µ) be a measure space and f : Ω → R be F-measurable. Suppose that Ω |f |dµ < ∞ and D is a closed set in R such that for all B ∈ F with µ(B) > 0, 1 µ(B)

f dµ ∈ D. B

Show that f (ω) ∈ D a.e. µ. (Hint: Show that for x ∈ D, there exists r > 0 such that µ{ω : |f (ω) − x| < r} = 0.)

82

2. Integration

2.46 Find a sequence of nonnegative continuous functions {fn }n≥1 on [0,1] such that [0,1] fn dm → 0 but {fn (x)}n≥1 does not converge for any x. (Hint: Let for m ≥ 1, 1 ≤ k ≤ m, gm,k = I" k−1 a reordering of {gn,k : 1 ≤ k ≤ n, n ≥ 1}.)

m

k ,m

# and {fn }n≥1 be

2.47 Let {fn }n≥1 be a sequence of continuous functions from [0,1] to [0,1] such that fn (x) → 0 as n → ∞ for all 0 ≤ x ≤ 1. Show that 1 f (x)dx → 0 as n → ∞ (where the integral is the Riemann in0 n tegral) by two methods: one by using BCT and one without using BCT. Show also that if µ is a ﬁnite measure on ([0, 1], B([0, 1])), then f dµ → 0 as n → ∞. [0,1] n 2.48 (Invariance of Lebesgue measure under translation and reﬂection.) Let m(·) be Lebesgue measure on (R, B(R)). (a) For any E ∈ B(R) and c ∈ R, deﬁne −E ≡ {x : −x ∈ E} and E + c ≡ {y : y = x + c, x ∈ E}. Show that m(−E) = m(E) and m(E + c) = m(E) for all E ∈ B(R) and c ∈ R. (b) For any f ∈ L1 (R, B(R), m) and c ∈ R, let f˜(x) ≡ f (−x) and fc (x) ≡ f (x + c). Show that ˜ f dm = fc dm = f dm.

3 Lp-Spaces

3.1 Inequalities This section contains a number of useful inequalities. Theorem 3.1.1: (Markov’s inequality). Let f be a nonnegative measurable function on a measure space (Ω, F, µ). Then for any 0 < t < ∞, f dµ . (1.1) µ({f ≥ t}) ≤ t Proof: Since f is nonnegative,

f dµ ≥

({f ≥t})

f dµ ≥ tµ({f ≥ t}).

2

Corollary 3.1.2: Let X be a random variable on a probability space (Ω, F, P ). Then, for r > 0, t > 0, P (|X| ≥ t) ≤

E|X|r . tr

Proof: Since {|X| ≥ t} = {|X|r ≥ tr } for all t > 0, r > 0, this follows from (1.1). 2 Corollary 3.1.3: (Chebychev’s inequality). Let X be a random variable with EX 2 < ∞, E(X) = µ and Var(X) = σ 2 . Then for any 0 < k < ∞, P (|X − µ| ≥ kσ) ≤

1 . k2

84

3. Lp -Spaces

Proof: Follows from Corollary 3.1.2 with X replaced by X − µ and with r = 2, t = kσ. 2 Corollary 3.1.4: Let φ : R+ → R+ be nondecreasing. Then for any random variable X and 0 < t < ∞, P (|X| ≥ t) ≤

Eφ(|X|) . φ(t)

Proof: Use (1.1) and the fact that |X| ≥ t ⇒ φ(|X|) ≥ φ(t).

2

Corollary 3.1.5: (Cramer’s inequality). For any random variable X and t > 0, E(eθX ) . P (X ≥ t) ≤ inf θ>0 eθt Proof: For t > 0, θ > 0, P (X ≥ t) = P eθX ≥ eθt ≤

E(eθX ) , eθt

by (1.1). 2

Deﬁnition 3.1.1: A function φ : (a, b) → R is called convex if for all 0 ≤ λ ≤ 1, a < x ≤ y < b, φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y).

(1.2)

Geometrically, this means that for the graph of y = φ(x) on (a, b), for each ﬁxed t ∈ (0, ∞), the chord over the interval (x, x + t) turns in the counterclockwise direction as x increases. More precisely, the following result holds. Proposition 3.1.6: Let φ : (a, b) → R. Then φ is convex iﬀ for all a < x1 < x2 < x3 < b, φ(x3 ) − φ(x2 ) φ(x2 ) − φ(x1 ) ≤ , x2 − x1 x3 − x2

(1.3)

which is equivalent to φ(x3 ) − φ(x2 ) φ(x3 ) − φ(x1 ) φ(x2 ) − φ(x1 ) ≤ ≤ . x2 − x1 (x3 − x1 ) x3 − x2

(1.4)

Proof: Let φ be convex and a < x1 < x2 < x3 < b. Then one can write 3 −x2 ) x2 = λx1 + (1 − λ)x3 with λ = (x (x3 −x1 ) . So by (1.2), φ(x2 )

= φ(λx1 + (1 − λ)x3 ) ≤ λφ(x1 ) + (1 − λ)φ(x3 ) (x3 − x2 ) (x2 − x1 ) = φ(x1 ) + φ(x3 ) (x3 − x1 ) (x3 − x1 )

3.1 Inequalities

85

which is equivalent to (1.3). Also, since φ(x3 ) − φ(x2 ) φ(x2 ) − φ(x1 ) φ(x3 ) − φ(x1 ) =λ + (1 − λ) , (x3 − x1 ) (x3 − x2 ) (x2 − x1 ) (1.4) follows from (1.3). Conversely, suppose (1.4) holds for all a < x1 < x2 < x3 < b. Then given a < x < y < b and 0 < λ < 1, set x1 = x, x2 = λx + (1 − λ)y, x3 = y and apply (1.4) to verify (1.2). 2 The following properties of a convex function are direct consequences of (1.3). The proof is left as an exercise (Problem 3.1). Proposition 3.1.7: Let φ : (a, b) → R be convex. Then, (i) For each x ∈ (a, b), φ+ (x) ≡ lim y↓x

φ(y) − φ(x) , (y − x)

φ− (x) ≡ lim y↑x

φ(y) − φ(x) (y − x)

exist and are ﬁnite. (ii) Further, φ− (·) ≤ φ+ (·) and both are nondecreasing on (a, b). (iii) φ (·) exists except on the countable set of discontinuity points of φ+ and φ− . (iv) For any a < c < d < b, φ is Lipschitz on [c, d], i.e., there exists a constant K < ∞ such that |φ(x) − φ(y)| ≤ K|x − y|

for all

c ≤ x, y ≤ d.

(v) For any a < c, x < b, φ(x) − φ(c) ≥ φ+ (c)(x − c)

and φ(x) − φ(c) ≥ φ− (c)(x − c). (1.5)

By the mean value theorem, a suﬃcient condition for (1.3) and hence, for the convexity of φ is that φ be diﬀerentiable on (a, b) and φ be nondecreasing. A further suﬃcient condition for this is that φ be twice diﬀerentiable on (a, b) and φ be nonnegative. This is stated as Proposition 3.1.8: Let φ be twice diﬀerentiable on (a, b) and φ be nonnegative on (a, b). Then φ is convex on (a, b). Example 3.1.1: The following functions are convex in the given intervals: (a) φ(x) = |x|p , p ≥ 1, (b) φ(x) = ex ,

(−∞, ∞).

(−∞, ∞).

86

3. Lp -Spaces

(c) φ(x) = − log x,

(0, ∞).

(d) φ(x) = x log x,

(0, ∞).

Remark 3.1.1: By Proposition 3.1.7 (iii), the convexity of φ implies that φ exists except on a countable set. For example, the function φ(x) = |x| is convex on R; it is diﬀerentiable at all x = 0. Similarly, it is easy to construct a piecewise linear convex function φ with a countable number of points where φ is not diﬀerentiable. The following is an important inequality for convex functions. Theorem 3.1.9: (Jensen’s inequality). Let f be a measurable function on a probability space (Ω, F, P ) with P (f ∈ (a, b)) = 1 for some interval (a, b), −∞ ≤ a < b ≤ ∞ and let φ : (a, b) → R be convex. Then φ f dP ≤ φ(f )dP, (1.6) provided

|f |dP < ∞ and

|φ(f )|dP < ∞.

Remark 3.1.2: In terms of random variables, this says that for any random variable X on a probability space (Ω, F, P ) with P (X ∈ (a, b)) = 1 and for any function φ that is convex on (a, b), φ(EX) ≤ Eφ(X),

(1.7)

provided E|X| < ∞ and E|φ(X)| < ∞, where for any Borel measurable function h, Eh(X) ≡ h(X)dP . Proof of Theorem 3.1.9: Let c = f dP . Applying (1.5), one gets

(1.8) Y (ω) ≡ φ(f (ω)) − φ(c) − φ+ (c)(f (ω) − c) ≥ 0 a.e. (P ), which, when integrated, yields Y (ω)P (dω) ≥ 0. Since (f (ω) − c) P (dω) = 0, (1.6) follows. 2 Remark 3.1.3: Suppose that equality holds in (1.6). Then, it follows that Y (ω)P (dω) = 0. By (1.8), this implies φ(f (ω)) − φ(c) = φ+ (c)(f (ω) − c)

a.e. (P ).

Thus, if φ is a strictly convex function (i.e., strict inequality holds in (1.2) for all x, y ∈ (a, b) and 0 < λ < 1), then equality holds in (1.6) iﬀ f (ω) = c a.e. (P ). The following are easy consequences of Jensen’s inequality (Problem 3.3). Proposition 3.1.10: Let k ≥ 1 be an integer.

3.1 Inequalities

87

(i) Let a1 , a2 , , . . . , ak be real and p1 , p2 , , . . . , pk be positive numbers such k that i=1 pi = 1. Then k

pi exp(ai ) ≥ exp

k

i=1

pi ai .

(1.9)

i=1

(ii) Let b1 , b2 , . . . , bk be nonnegative numbers and p1 , p2 , . . . , pk be as in (i). Then k k + pi bi ≥ bpi i , (1.10) i=1

and in particular,

i=1

k

+ k k 1 bi ≥ bi , k i=1 i=1 1

(1.11)

i.e., the arithmetic mean of b1 , . . . , bk is greater than or equal to the geometric mean of the bi ’s. Further, equality holds in (1.10) iﬀ b1 = b2 = . . . = bk . (iii) For any a, b real and 1 ≤ p < ∞, |a + b|p ≤ 2p−1 (|a|p + |b|p ).

(1.12)

Inequality (1.10) is useful in establishing the following: Theorem 3.1.11: (H¨ older’s inequality). Let (Ω, F, µ) be a measure space. p . Let 1 < p < ∞, f ∈ Lp (Ω, F, µ) and g ∈ Lq (Ω, F, µ) where q = (p−1) Then

p1 q1 p q |f g|dµ ≤ |f | dµ (1.13) |g| dµ , i.e.,

f g1

≤

f p gq .

If f g1 = 0, then equality holds in (1.13) iﬀ |f |p = c|g|q a.e. (µ) for some constant c ∈ (0, ∞). Proof: W.l.o.g. assume that |f |p dµ > 0 and |g|q dµ > 0. Fixω ∈ Ω. Let p1 = p1 , p2 = 1q , b1 = c1 |f (ω)|p , and b2 = c2 |g(ω)|q , where c1 = ( |f |p dµ)−1 and c2 = ( |g|q dµ)−1 . Then applying (1.10) with k = 2 yields 1 1 c2 c1 |f (ω)|p + |g(ω)|q ≥ c1p c2q |f (ω)g(ω)|. p q

Integrating w.r.t. µ yields 1

1

1 ≥ c1p c2q

|f (ω)g(ω)|dµ(ω)

(1.14)

3. Lp -Spaces

88

which is equivalent to (1.13). Next, equality in (1.13) implies equality in (1.14) a.e. (µ). Since 1 < p < ∞, by the last part of Proposition 3.1.10 (ii), this implies that b1 = b2 a.e. q 2 (µ), i.e., |f (ω)|p = c2 c−1 1 |g(ω)| a.e. (µ). Remark 3.1.4: (H¨ older’s inequality for p = 1, q = ∞). Let f ∈ L1 (Ω, F, µ) and g ∈ L∞ (Ω, F, µ). Then |f g| ≤ |f | g∞ a.e. (µ) and hence, f g1 ≡ |f g|dµ ≤ f 1 g∞ . If equality holds in the above inequality, then |f |(g∞ − |g|) = 0 a.e. (µ) and hence, |g| = g∞ on the set {|f | > 0} a.e. (µ). The next two corollaries follow directly from Theorem 3.1.11. The proof is left as an exercise (Problem 3.9). Corollary 3.1.12: (Cauchy-Schwarz inequality). Let f , g ∈ L2 (Ω, F, µ). Then

12 12 2 2 |f g|dµ ≤ |f | dµ |g| dµ , (1.15) i.e.,

f g1

≤

f 2 g2 .

Corollary 3.1.13: Let k ∈ N. Let a1 , a2 , . . . , ak , b1 , b2 , . . . , bk be real numbers and c1 , c2 , . . . , ck be positive real numbers. (i) Then, for any 1 < p < ∞, k p1 k q1 k |ai bi |ci ≤ |ai |p ci |bi |q ci , i=1

where q =

i=1

(1.16)

i=1

p−1 p .

(ii) k

|ai bi |ci ≤

i=1

k

2

|ai | ci

12 k

i=1

12 2

|bi | ci

.

(1.17)

i=1

Next, as an application of H¨ older’s inequality, one gets Theorem 3.1.14: (Minkowski’s inequality). Let 1 < p < ∞ and f, g ∈ Lp (Ω, F, µ). Then

p1

p1 p1 p p p |f + g| dµ ≤ |f | dµ + |g| dµ , i.e.,

f + gp

≤

f p + gq .

(1.18)

3.2 Lp -Spaces

89

Proof: Let h1 = |f + g|, h2 = |f + g|p−1 . Then by (1.12), |f + g|p ≤ 2p−1 (|f |p + |g|p ), implying that h1 ∈ Lp (Ω, F, µ) and h2 ∈ Lq (Ω, F, µ), where q = older’s inequality, Since |f + g|p = h1 h2 ≤ |f |h2 + |g|h2 , by H¨

|f +g|p dµ ≤ But

hq2 =

|f |p dµ

p1

hq2

q1

+

|g|p dµ

p1

hq2

p (p−1) .

q1 . (1.19)

|f + g|p and so (1.19) yields (1.18).

2

Remark 3.1.5: Inequality (1.18) holds for both p = 1 and p = ∞.

3.2 Lp -Spaces 3.2.1

Basic properties

Let (Ω, F, µ) be a measure space. Recall the deﬁnition of Lp (Ω, F, µ), 0 < p ≤ ∞, from Section 2.5 as the set of all measurable functions f on (Ω, F, µ) such that f p < ∞, where for 0 < p < ∞, min{ p1 ,1} |f |p dµ f p = and for p = ∞, f ∞ ≡ sup{k : µ({|f | > k}) > 0} (called the essential supremum of f ). In this section and elsewhere, Lp (µ) denotes Lp (Ω, F, µ). The following proposition shows that Lp (µ) is a vector space over R. Proposition 3.2.1: For 0 < p ≤ ∞, f, g ∈ Lp (µ), a, b ∈ R

⇒

af + bg ∈ Lp (µ).

(2.1)

Proof: Case 1: 0 < p ≤ 1. For any two positive numbers x, y,

p p y y x x + = 1. + ≥ x+y x+y x+y x+y Hence, for all x, y ∈ (0, ∞) (x + y)p ≤ xp + y p .

(2.2)

90

3. Lp -Spaces

It is easy to check that (2.2) continues to hold if x, y ∈ [0, ∞). This yields |af + bg|p ≤ |a|p |f |p + |b|p |g|p , which, in turn, yields (2.1) by integration. Case 2: 1 < p < ∞. By (1.12), |af + bg|p ≤ 2p−1 (|af |p + |bg|p ). Integrating both sides of the above inequality yields (2.1). Case 3: p = ∞. By deﬁnition, there exist constants K1 < ∞ and K2 < ∞ such that µ({|f | > K1 }) = 0 = µ({|g| > K2 }). This implies that µ({|af + bg| > K}) = 0 for any K > |a|K1 + |b|K2 . Hence, af + bg ∈ L∞ (µ) and af + bg∞ ≤ |a| f ∞ + |b| g∞ . 2 Recall that a set S with a function d : S × S → [0, ∞] is called a metric space if for all x, y, z ∈ S, (i) d(x, y) = d(y, x)

(symmetry)

(ii) d(x, y) ≤ d(x, z) + d(y, z)

(triangle inequality)

(iii) d(x, y) = 0 iﬀ x = y and the function d(·, ·) is called a metric on S. Some examples are (a) S = Rk with d(x, y) =

k

2 i=1 |xi − yi |

12

;

(b) S = C[0, 1], the space of all continuous functions on [0, 1] with d(f, g) = sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}; (c) S = a nonempty set, and d(x, y) = 1 if x = y and 0 if x = y. (This d(·, ·) is called the discrete metric on S.) The Lp -norm · p can be used to introduce a distance notion in Lp (µ) for 0 < p ≤ ∞. Deﬁnition 3.2.1: For f , g ∈ Lp (µ), 0 < p ≤ ∞, let dp (f, g) ≡ f − gp . Note that, for any f , g, h ∈ Lp (µ) and 1 ≤ p ≤ ∞, (i) dp (f, g) = dp (g, f ) ≥ 0

(nonnegativity and symmetry), and

(ii) dp (f, h) ≤ dp (f, g) + dp (g, h)

(triangle inequality),

(2.3)

3.2 Lp -Spaces

91

which follows by Minkowski’s inequality (Theorem 3.1.14) for 1 ≤ p < ∞, and by Proposition 3.2.1 for p = ∞. However, dp (f, g) = 0 implies only that f = g a.e. (µ). Thus, dp (·, ·) of (2.3) satisﬁes conditions (i) and (ii) of being a metric and it satisﬁes condition (iii) as well, provided any two functions f and g that agree a.e. (µ) are regarded as the same element of Lp (µ). This leads to the following: Deﬁnition 3.2.2: For f , g in Lp (µ), f is called equivalent to g and is written as f ∼ g, if f = g a.e. (µ). It is easy to verify that the relation ∼ of Deﬁnition 3.2.2 is an equivalence relation, i.e., it satisﬁes (i) f ∼ f for all f in Lp (µ) (ii) f ∼ g ⇒ g ∼ f

(reﬂexive)

(symmetry)

(iii) f ∼ g, g ∼ h ⇒ f ∼ h

(transitive).

This equivalence relation ∼ divides Lp (µ) into disjoint equivalence classes such that in each class all elements are equivalent. The notion of distance between these classes may be deﬁned as follows: dp ([f ], [g]) ≡ dp (f, g) where [f ] and [g] denote, respectively, the equivalence classes of functions containing f and g. It can be veriﬁed that this is a metric on the set of equivalence classes. In what follows, the equivalence class [f ] is identiﬁed with the element f . With this identiﬁcation, (Lp (µ), dp (·, ·)) becomes a metric space for 1 ≤ p ≤ ∞. Remark 3.2.1: For 0 < p < 1, if one deﬁnes dp (f, g) ≡ |f − g|p dµ,

(2.4)

then (Lp (µ), dp ) becomes a metric space (with the same identiﬁcation as above of functions with their equivalence classes). The triangle inequality follows from (2.2). Recall that a metric space (S, d) is called complete if every Cauchy sequence in (S, d) converges to an element in S, i.e., if {xn }n≥1 is a sequence in S such that for every > 0, there exists a N such that n, m ≥ N ⇒ d(xn , xm ) ≤ , then there exists an element x in (S, d) such that limn→∞ d(xn , x) = 0. The next step is to establish the completeness of Lp (µ). Theorem 3.2.2: For 0 < p ≤ ∞, (Lp (µ), dp (·, ·)) is complete, where dp is as in (2.3).

92

3. Lp -Spaces

Proof: Let {fn }n≥1 be a Cauchy sequence in Lp (µ) for 0 < p < ∞. The main steps in the proof are as follows: (I) there exists a subsequence {nk }k≥1 such that {fnk }k≥1 converges a.e. (µ) to a limit function f ; (II) lim dp (fnk , f ) = 0; k→∞

(III) lim dp (fn , f ) = 0. n→∞

Step (I): Let {k }k≥1 and {δk }k≥1 be sequences of positive numbers decreasing to zero. Since {fn }n≥1 is Cauchy, for each k ≥ 1, there exists an integer nk such that (2.5) |fn − fm |p dµ ≤ k for all n, m ≥ nk . W.l.o.g., let nk+1 > nk for each k ≥ 1. Then, by Markov’s inequality (Theorem 3.1.1), −p µ({|fnk+1 − fnk | ≥ δk }) ≤ δk |fnk+1 − fnk |p dµ ≤ δk−p k . (2.6) Let A k = {|fnk+1 − fnk | ≥ δk }, k = 1, 2, . . . and A = lim supk→∞ Ak ≡ ∞ k≥j Ak . If {k }k≥1 and {δk }k≥1 satisfy j=1 ∞

δk−p k < ∞,

(2.7)

k=1

∞ then by (2.6), k=1 µ(Ak ) < ∞ and hence, as in the proof of Theorem 2.5.2, µ(A) = 0. that for ω in Ac , |fnk+1 (ω) − fnk (ω)| < δk for all k large. Thus, if Note ∞ c k=1 δk < ∞, then for ω in A , {fnk (ω)}k≥1 is a Cauchy sequence in R and hence, it converges to some f (ω) in R. Setting f (ω) = 0 for ω ∈ A, one gets lim fnk = f a.e. (µ). k→∞

∞ A choice of {k }k≥1 and {δk }k≥1 such that k=1 δk < ∞ and (2.7) holds is given by k = 2−(p+1)k and δk = 2−k . This completes Step (I). Step (II): By Fatou’s lemma, part (I), and (2.5), for any k ≥ 1 ﬁxed, k ≥ lim inf |fnk − fnk+j |p dµ ≥ |fnk − f |p dµ . j→∞

Since fnk ∈ Lp (µ), this shows that f ∈ Lp (µ). Now, on letting k → ∞, (II) follows.

3.2 Lp -Spaces

93

Step (III): By triangle inequality, for any k ≥ 1 ﬁxed, dp (fn , f ) ≤ dp (fn , fnk ) + dp (fnk , f ). By (2.5) and (II), for n ≥ nk , the right side above is ≤ 2˜ k , where ˜k = k 1/p if 0 < p < 1 and ˜k = k if 1 ≤ p < ∞. Now letting k → ∞, (III) follows. The proof of Theorem 3.2.2 is complete for 0 < p < ∞. The case p = ∞ is left as an exercise (Problem 3.14). 2

3.2.2

Dual spaces

Let 1 ≤ p < ∞. Let g ∈ Lq (µ), where q = if p = 1. Let Tg (f ) = By H¨olders inequality, Tg is linear, i.e.,

f gdµ,

p (p−1)

if 1 < p < ∞ and q = ∞

f ∈ Lp (µ).

(2.8)

|f g|dµ < ∞ and so Tg (·) is well deﬁned. Clearly

Tg (α1 f1 + α2 f2 ) = α1 Tg (f1 ) + α2 Tg (f2 )

(2.9)

for all α1 , α2 ∈ R and f1 , f2 ∈ Lp (µ). Deﬁnition 3.2.3: (a) A function T : Lp (µ) → R that satisﬁes (2.9) is called a linear functional. (b) A linear functional T on Lp (µ) is called bounded if there is a constant c ∈ (0, ∞) such that |T (f )| ≤ cf p

for all f ∈ Lp (µ).

(c) The norm of a bounded linear functional T on Lp (µ) is deﬁned as T = sup |T f | : f ∈ Lp (µ), f p = 1 . By H¨older’s inequality (cf. Theorem 3.1.11 and Remark 3.1.4), |Tg (f )| ≤ gq f p

for all f ∈ Lp (µ),

and hence, Tg is a bounded linear functional on Lp (µ). This implies that if dp (fn , f ) → 0, then |Tg (fn ) − Tg (f )| ≤ gq dp (fn , f ) → 0, i.e., Tg is continuous on the metric space (Lp (µ), dp ).

3. Lp -Spaces

94

Deﬁnition 3.2.4: The set of all continuous linear functionals on Lp (µ) is called the dual space of Lp (µ) and is denoted by (Lp (µ))∗ . In the next section, it will be shown that continuity of a linear functional on Lp (µ) implies boundedness. A natural question is whether every continuous linear functional T on Lp (µ) coincides with Tg for some g in Lq (µ). The answer is “yes” for 1 ≤ p < ∞, as shown by the following result. Theorem 3.2.3: (Riesz representation theorem). Let 1 ≤ p < ∞. Let T : Lp (µ) → R be linear and continuous. Then, there exists a g in Lq (µ) such that T = Tg , i.e., T (f ) = Tg (f ) ≡ f gdµ for all f ∈ Lp (µ), (2.10) where q =

p p−1

for 1 < p < ∞ and q = ∞ if p = 1.

Remark 3.2.2: Such a representation is not valid for p = ∞. That is, there exists continuous linear functionals T on L∞ (µ) for which there is no g ∈ L1 (µ) such that T (f ) = f gdµ for all f ∈ L∞ (µ), provided µ is not concentrated on a ﬁnite set {ω1 , ω2 , . . . , ωk } ⊂ Ω. For a proof of Theorem 3.2.3 and the above remark, see Royden (1988) or Rudin (1987). Next consider the mapping from (Lp (µ))∗ and Lq (µ) deﬁned by φ(Tg ) = g, where Tg is as deﬁned in (2.10). Then, φ is linear, i.e., φ(α1 T1 + α2 T2 ) = α1 φ(T1 ) + α2 φ(T2 ) for all α1 , α2 ∈ R and T1 , T2 ∈ (Lp (µ))∗ . Further, φ(T )q = T

for all T ∈ (Lp (µ))∗ .

Thus, φ preserves the vector space structure of (Lp (µ))∗ and the norm. For this reason, it is called an isometry between (Lp (µ))∗ and Lq (µ).

3.3 Banach and Hilbert spaces 3.3.1

Banach spaces

If (Ω, F, µ) is a measure space it was seen in the previous section that the space Lp (Ω, F, µ) of equivalence classes of functions f with Ω |f |p dµ < ∞ is a vector space over R for all 1 ≤ p < ∞ and for p ≥ 1, · p ≡ ( |f |p dµ)1/p satisﬁes

3.3 Banach and Hilbert spaces

95

(i) f + g ≤ f + g, (ii) αf = |a| f for every α ∈ R, (iii) f = 0 iﬀ f = 0 a.e. (µ). The Euclidean spaces Rk for any kk ∈ N is also a vector space. Note that for p ≥ 1, setting xp ≡ ( i=1 |xi |p )1/p if x = (x1 , x2 , . . . , xk ), (Rk , xp ) may be identiﬁed with a special case of Lp (Ω, F, µ), where Ω ≡ {1, 2, . . . , k}, F = P(Ω) and µ is the counting measure. Generalizing the above examples leads to the notion of a normed vector space (also called normed linear space). Recall that a vector space V over R is a nonempty set with a binary operation +, a function from V × V to V (called addition), and scalar by the real numbers, i.e., a multiplication function from R × V → V , (α, v) → αv satisfying (i) v1 , v2 ∈ V ⇒ v1 + v2 = v2 + v1 ∈ V . (ii) v1 , v2 , v3 ∈ V ⇒ (v1 + v2 ) + v3 = v1 + (v2 + v3 ). (iii) There exists an element θ, called the zero vector, in V such that v + θ = v for all v in V . (iv) α ∈ R, v ∈ V ⇒ αv ∈ V . (v) α ∈ R, v1 , v2 ∈ V ⇒ α(v1 + v2 ) = αv1 + αv2 . (vi) α1 , α2 ∈ R, v ∈ V ⇒ (α1 +α2 )v = α1 v +α2 v and α1 (α2 v) = (α1 α2 )v. (vii) v ∈ V ⇒ 0v = θ and 1v = v. Note that from conditions (vi) and (vii) above, it follows that for any v in V , v + (−1)v = 0 · v = θ. Thus for any v in V , (−1)v is the additive inverse and is denoted by −v. Conditions (i), (ii), and (iii) are called respectively commutativity, associativity, and the existence of an additive identity. Thus V under the operation + is an Abelian (i.e., commutative) group. Deﬁnition 3.3.1: A function f from V to R+ denoted by f (v) ≡ v is called a norm if (a) v1 , v2 ∈ V ⇒ v1 + v2 ≤ v1 + v2 (b) α ∈ R, v ∈ V ⇒ αv = |α| v

(triangle inequality)

(scalar homogeneity)

(c) v = 0 iﬀ v = θ. A vector space V with a norm · deﬁned on it is called a normed vector space or normed linear space and is denoted as (V, · ). Let d(v1 , v2 ) ≡ v1 − v2 , v1 , v2 ∈ V . Then from the deﬁnition of · , it follows that d is a metric on V , i.e., (V, d) is a metric space. Recall that a metric space (S, d)

96

3. Lp -Spaces

is called complete if every Cauchy sequence {xn }n≥1 in S converges to an element x in S. Deﬁnition 3.3.2: A Banach space is a complete normed linear space (V, · ). It was shown by S. Banach of Poland that all Lp (Ω, B, µ) spaces are Banach spaces, provided p ≥ 1 and in particular, all Euclidean spaces are Banach spaces. An example of a diﬀerent kind is the space C[0, 1] of all real valued continuous functions on [0, 1] with the usual operation of pointwise addition and scalar multiplication, i.e., (f + g)(x) = f (x) + g(x) and (αf )(x) = α · f (x) for all α ∈ R, 0 ≤ x ≤ 1, f , g ∈ C[0, 1] where the norm (called the supnorm) is deﬁned by f = sup{|f (x)| = 0 ≤ x ≤ 1}. The veriﬁcation of the fact that C[0, 1] with the supnorm is a Banach space is left as an exercise (Problem 3.22). The space P of all polynomials on [0, 1] is also a normed linear space under the above norm but (P, · ) is not complete (Problem 3.23). However for each n ∈ N, the space Pn of all polynomials on [0, 1] of degree ≤ n is a Banach space under the supnorm (Problem 3.26). Deﬁnition 3.3.3: Let V be a vector space. A subset W ⊂ V is called a subspace of V if v1 , v2 ∈ W, α1 , α2 ∈ R ⇒ α1 v1 + α2 v2 ∈ W . If (V, · ) is a normed vector space and W is a subspace of V , then (W, · ) is also a normed vector space. If W is closed in (V, · ), then W is called a closed subspace of V . Remark 3.3.1: If (V, · ) is a Banach space and W is a closed subspace of V , then (W, · ) is also a Banach space.

3.3.2

Linear transformations

Let (Vi , · i ), i = 1, 2 be two normed linear spaces over R. Deﬁnition 3.3.4: A function T from V1 to V2 is called a linear transformation or linear operator if α1 , α2 ∈ R, x, y ∈ V1 ⇒ T (α1 x + α2 y) = α1 T (x) + α2 T (y). Deﬁnition 3.3.5: A linear operator T from (V1 , · 1 ) to (V2 , · 2 ) is called bounded if T ≡ sup{T x2 : x1 < 1} < ∞, i.e., the image of the unit ball in (V1 , · 1 ) is contained in a ball of ﬁnite radius centered at the zero in V2 . Here is a summary of some important results on this topic. By linearity x 1 ) = x T (x) for any x = 0. It follows that T is bounded iﬀ of T , T ( x there exists k < ∞ such that for any x ∈ V1 , T x2 ≤ kx1 . Clearly, then k can be taken to be T . Also by linearity, if T is bounded, then T x1 −T x2 = T (x1 −x2 ) ≤ T x1 −x2 and so the map T is continuous

3.3 Banach and Hilbert spaces

97

(indeed, uniformly so). It turns out that if a linear operator T is continuous at some x0 in V1 , then T is continuous on all of V1 and is bounded (Problem 3.28 (a)). Now let B(V1 , V2 ) be the space of all bounded linear operators from (V1 , ·1 ) to (V2 , ·2 ). For T1 , T2 ∈ B(V1 , V2 ), α1 , α2 in R, let (α1 T1 +α2 T2 ) be deﬁned by (α1 T1 + α2 T2 )(x) ≡ α1 T1 (x) + α2 T1 (x) for all x in V1 . Then it can be veriﬁed that (α1 T1 + α2 T2 ) also belongs to B(V1 , V2 ) and T ≡ sup{T x2 : x1 ≤ 1}

(3.1)

is a norm on B(V1 , V2 ). Thus (B(V1 , V2 ), · ) is also a normed linear space. If (V2 , · 2 ) is complete, then it can be shown that (B(V1 , V2 ), · ) is also a Banach space (Problem 3.28 (b)). In particular, if (V2 , · 2 ) is the real line, the space (B(V1 , R), · ) is a Banach space.

3.3.3

Dual spaces

Deﬁnition 3.3.6: The space of all bounded linear functions from (V1 , ·) to R (also called bounded linear functionals), denoted by V1∗ , is called the dual space of V1 . Thus, for any normed linear space (V1 , ·1 ) (that need not be complete), the dual space (V1∗ , · ) is always a Banach space, where T ≡ sup{|T x| : x1 < 1} for T ∈ V1∗ . If (V1 , · 1 ) = Lp (Ω, F, µ) for some measure space (Ω, F, µ) and 1 ≤ p < ∞, by the Riesz representation theorem (see Theorem 3.2.3), the dual space may be identiﬁed with Lq (Ω, F, µ) where q is the conjugate of p, i.e., p1 + 1q = 1. However, as pointed out earlier in Section 3.2, this is not true for p = ∞. That is, the dual of L∞ (Ω, F, µ) is not L1 (Ω, F, µ) unless (Ω, F, µ) is a measure space where Ω is a ﬁnite set {w1 , w2 , . . . , wk } and F = P(Ω). An example for the p = ∞ case can be constructed for the space ∞ of all bounded sequences of real numbers (cf. Royden (1988)). The representation of the dual space of the Banach space C[0, 1] with supnorm is in terms of ﬁnite signed measures (cf. Section 4.2). Theorem 3.3.1: (Riesz ). Let T : C[0, 1] → R be linear and bounded. Then there exists two ﬁnite measures µ1 and µ2 on [0, 1] such that for any f ∈ C[0, 1] T (f ) = f dµ1 − f dµ2 .

For a proof see Royden (1988) or Rudin (1987) (see also Problem 3.27).

3. Lp -Spaces

98

3.3.4

Hilbert space

A vector space V over R is called a real innerproduct space if there exists a function f : V × V → R, denoted by f (x, y) ≡ x, y (and called the innerproduct) that satisﬁes (i) x, y = y, x for all x, y ∈ V , (ii) (linearity) α1 x1 + α2 x2 , y = α1 x, y + α2 x2 , y for all α1 , α2 ∈ R, x1 , x2 , y ∈ V , (iii) x, x ≥ 0 for all x ∈ V and x, x = 0 iﬀ x = θ, the zero vector of V . Using the fact that the quadratic function ϕ(t) = x + ty, x + ty = x, x + 2tx, y + t2 y, y is nonnegative for all t ∈ R, one gets the CauchySchwarz inequality , |x, y| ≤ x, xy, y for all x, y ∈ V. , Now setting x = x, x and using the Cauchy-Schwarz inequality, one veriﬁes that x is a norm on V and thus (V, · ) is a normed linear space. Further, the function x, y from V × V to R is continuous (Problem 3.29) under the norm (x1 , x2 ) = x1 + x2 , (x1 , x2 ) ∈ V × V . Deﬁnition 3.3.7: Let (V, ·, ·) be a real innerproduct space. It is called a Hilbert space if (V, · ) is a Banach space, i.e., if it is complete. It was seen in Section 3.2 that for any measure space (Ω, F, µ), the space 2 F, µ) of all equivalence classes of functions f : Ω → R satisfying L (Ω, |f |2 dµ < ∞ is a complete innerproduct space with the innerproduct f, g = f gdµ and hence a Hilbert space. It turns out that every Hilbert space H is an L2 (Ω, F, µ) for some (Ω, F, µ). (The axiom of choice or its equivalent, the Hausdorﬀ’s maximality principle, is required for a proof of this. See Rudin (1987).) This is in contrast to the Banach space case where every Lp (Ω, F, µ) with p ≥ 1 is a Banach space but not conversely, i.e., every Banach space need not be an Lp (Ω, F, µ). Next for each x in a Hilbert space H, let Tx : H → R be deﬁned by Tx (y) = x, y. By the deﬁning properties of x, y and the Cauchy-Schwarz inequality, it is easy to verify that Tx is a bounded linear function on H, i.e., Tx (α1 y1 + α2 y2 ) = α1 Tx (y1 ) + α2 Tx (y2 )

for all α1 , α2 ∈ R, y1 , y2 ∈ H (3.2)

and |Tx (y)| ≤ x y ∗

for all y ∈ H.

(3.3)

Thus Tx ∈ H , the dual space. It is an important result (see Theorem 3.3.3 below) that every T ∈ H ∗ is equal to Tx for some x in H and T = x. Thus H ∗ can be identiﬁed with H.

3.3 Banach and Hilbert spaces

99

Deﬁnition 3.3.8: Let (V, ·, ·) be an inner product space. Two vectors x, y in V are said to be orthogonal and written as x ⊥ y if x, y = 0. A collection B ⊂ V is called orthogonal if x, y ∈ B, x = y ⇒ x, y = 0. The collection B is called orthonormal if it is orthogonal and in addition for all x in B, x = 1. Note that if x ⊥ y, then x − y2 = x − y, x − y = x, x + y, y = x2 + y2 and√so if B is an orthonormal set, then x, y ∈ B ⇒ either x = y or x − y = 2. Thus, if V is separable under the metric d(x, y) = x − y (i.e., there exists a countable set D ⊂ V such that for every x in V and > 0, there exists a d ∈ D such that x − d < ) and 1 if B ⊂ V is an orthonormal system, then the open ball Sb of radius 2√ 2 around each b ∈ B satisﬁes {Sb ∩ D : b ∈ B} are disjoint and nonempty. Thus B is countable. Now let (V, ·, ·) be a separable innerproduct space and B ⊂ V be an orthonormal system. Deﬁnition 3.3.9: The Fourier coeﬃcients of a vector x in V with respect on orthonormal set B is the set {x, b : b ∈ B}. Since V is separable, B is countable. Let nB = {bi : i ∈ N}. For a given x ∈ V , let ci = x, bi , i ≥ 1. Let xn ≡ i=1 ci bi , n ∈ N. The sequence {xn }n≥1 is called the partial sum sequence of the Fourier expansion of the vector x w.r.t. the orthonormal set B. A natural question is: when does {xn }n≥1 converge to x? By the linearity property in the deﬁnition of the innerproduct ·, ·, it follows that 0 ≤ x − xn 2 = x − xn , x − xn = x, x − 2x, xn + xn , xn and x, xn =

n

ci x, bi =

i=1

n

c2i .

i=1

Since {bi }i≥1 are orthonormal, xn 2 = xn , xn =

n

c2i = x, xn .

i=1

Thus, 0 ≤ x − xn 2 = x2 − xn 2 = x2 −

n

c2i ,

i=1

leading to Proposition 3.3.2: (Bessel’s inequality). Let {bi }i≥1 be orthonormal in an innerproduct space (V, ·, ·). Then, for any x in V , ∞ i=1

x, bi 2 ≤ x2 .

(3.4)

100

3. Lp -Spaces

Now let (V, ·, ·) be a Hilbert space. Since for m > n, xn − xm 2 =

m

x, bi 2 ,

i=n+1

it follows from Bessel’s inequality that {xn }n≥1 is a Cauchy sequence, and since V is complete, there is a y in V such that xn → y. This implies (by the continuity of x, y) that x, bi = limn→∞ xn , bi = y, bi ⇒ x−y, bi = 0 for all i ≥ 1. Thus, it follows that x, bi = y, bi for all i ≥ 1. The last relation implies y = x iﬀ the set {bi }i≥1 satisﬁes the property that z, bi = 0

for all i ≥ 1 ⇒ z = 0.

(3.5)

This property is called the completeness of B. Thus B ≡ {bi }i≥1 is a complete orthonormal set for a Hilbert space H ≡ (V, ·, ·), iﬀ for every vector x, ∞ c2i = x2 , (3.6) i=1

where ci = x, bi , i ≥ 1, which in turn holds iﬀ n -x − ci bi - → 0 as n → ∞.

(3.7)

i=1

∞ of real numbers such that i=1 c2i < ∞, Conversely, if {ci }i≥1 is a sequence n then the sequence {xn ≡ i=1 ci bi }n≥1 is Cauchy and hence converges to an x in V . Thus the Hilbert spaceH can be identiﬁed with the space 2 ∞ of all square summable sequences {ci }i≥1 : i=1 c2i < ∞ , in the sense that the map ϕ : x → {ci }i≥1 , where ci = x, bi , i ≥ 1, preserves the algebraic structure as well as the innerproduct, i.e., ϕ is a linear operator from H to 2 and ϕ(x), ϕ(y) = x, y for all x, y ∈ H. Such a ϕ is called an isometric isomorphism between H to 2 . Note also that 2 is simply L2 (Ω, F, µ) where Ω ≡ N, F = P(N), and µ, the counting measure. It can be shown (using the axiom of choice) that every separable Hilbert space does possess a complete orthonormal system, i.e., an orthonormal basis. Next some examples are given. Here, unless otherwise indicated, H denotes the Hilbert space and B denotes an orthonormal basis of H. Example 3.3.1:

∞ (a) H ≡ 2 = {(x1 , x2 , . . .) : xi ∈ R, i=1 x2i < ∞}. B ≡ {ei : i ≥ 1} where ei ≡ (0, 0, . . . , 1, 0, . . .) with 1 in the ith position and 0 elsewhere. 1 m(A), m(·) being (b) H ≡ L2 ([0, 2π], B [0, 2π] , µ) where µ(A) = 2π Lebesgue measure. B ≡ {cos nx : n = 0, 1, 2, . . . , } ∪ {sin nx : n = 1, 2, . . .}. (For a proof, see Chapter 5.)

3.3 Banach and Hilbert spaces

101

2 (c) Let kH ≡ L (R, B(R), µ) where µ is a ﬁnite measure such that |x| dµ < ∞ for all k = 1, 2, . . .. Let B1 ≡ {1, x, x2 , . . .} and B be the orthonormal set generated by applying the Gram-Schmidt procedure to B1 (see Problem 3.31). It can be shown that B is a basis for H (Problem 3.39). When µ is the standard normal distribution, the elements of B are called Hermite polynomials.

For one more example, i.e., Haar functions, see Problem 3.40. Theorem 3.3.3: (Riesz representation). Let H be a separable Hilbert space. Then every bounded linear functional T on H → R can be represented as T ≡ Tx0 for some x0 ∈ V , where Tx0 (y) ≡ y, x0 . Proof: Let B = {bi }i≥1 be an orthonormal basis for H. Let ci ≡ T (bi ), i ≥ 1. Then, for n ≥ 1, n

c2i

=

i=1

n

=

T

n

≤

c2i

≤

c2i

< ∞.

⇒

ci bi

i=1 - n

* * * n 2* * ci ** ⇒* i=1 n

ci T (bi )

i=1

T -

i=1

(by the linearity of T )

1/2

n 2 ci bi - = T ci i=1

T 2

i=1

⇒

∞ i=1

n Thus {xn ≡ i=1 ci bi }n≥1 is Cauchy in H and hence converges to an H. By the continuity of T , for any y, T y = limn→∞ T yn , where x0 in n yn ≡ i=1 y, bi bi , n ≥ 1. But T yn

=

n i=1

=

y, bi ci =

. / n y, bi ci i=1

y, xn , by the linearity of T

Again by continuity of y, x, it follows that T y = y, x0 .

2

A suﬃcient condition for an L2 (Ω, F, µ) to be separable is that there exists an at most countable family A ≡ {Aj }j≥1 of sets in F such that F = σA andµ(Aj ) > 0 for each j. This holds for any σ-ﬁnite measure µ on Rk , B(Rk ) (Problem 3.38).

102

3. Lp -Spaces

Remark 3.3.2: Assuming the axiom of choice, the Riesz representation theorem remains valid for any Hilbert space, separable or not (Problem 3.43).

3.4 Problems 3.1 Prove Proposition 3.1.7. (Hint: Use (1.4) repeatedly.) 3.2 Let (Ω, F, µ) be a measure space with µ(Ω) ≤ 1 and f : Ω → (a, b) ⊂ R be in L1 (Ω, F, µ). Let φ : (a, b) → R be convex. Show that if c ≡ f dµ ∈ (a, b) and φ(f ) ∈ L1 (Ω, F, µ) and cφ+ (c) ≥ 0, then

µ(Ω) φ f dµ ≤ φ(f )dµ. 3.3 Prove Proposition 3.1.10. (Hint: Apply Jensen’s inequality with Ω ≡ {1, 2, . . . , k}, F = P(Ω), P ({i}) = pi , f (i) = ai , i = 1, 2, . . . , k, and φ(x) = ex to get (i). Deduce (ii) from (i) and Remark 3.1.3. For (iii), consider φ(x) = |x|p .) 3.4 Give an example of a convex function φ on (0, 1) with a ﬁnite number of points where it is not diﬀerentiable. Can this be extended to the countable case? Uncountable case? (Hint: Note that φ+ (·) and φ− (·) are both monotone and hence have at most a countable number of discontinuity points.) 3.5 Let φ : (a, b) → R be convex. (a) Using the deﬁnition and induction, show that

n n φ pi xi ≤ pi φ(xi ) i=1

i=1

for any n ≥ 2, x1 , x2 . . . , xn in (a, b) and {p1 , p2 , . . . , pn }, a probability distribution. (b) Use (a) to prove Jensen’s inequality for any bounded φ. 3.6 Show that a function φ : R → R is convex iﬀ

φ f dm ≤ φ(f )dm [0,1]

[0,1]

for every bounded Borel measurable function f : [0, 1] → R, where m(·) is the Lebesgue measure.

3.4 Problems

103

3.7 Let φ be convex on (a, b) and ψ : R → R be convex and nondecreasing. Show that ψ ◦ φ is convex on (a, b). 3.8 Let X be a nonnegative random variable on some probability space. 1 (a) Show that (EX)(E X ) ≥ 1. What does this say about the cor1 relation between X and X ?

(b) Let f, g : R+ → R+ be Borel measurable and such that f (x)g(x) ≥ 1 for all x in R+ . Show that Ef (X)Eg(X) ≥ 1. 3.9 Prove Corollary 3.1.13 using H¨ older’s inequality applied to an appropriate measure space. 3.10 Extend H¨ older’s inequality as follows. Let k1 < pi < ∞, and fi ∈ Lpi (Ω, F, µ), i = 1, 2, . . . , k. Suppose i=1 p1i = 1. Show that k k i=1 fi dµ ≤ i=1 fi pi . (Hint: Use Proposition 3.1.10 (ii).) 3.11 Verify Minkowski’s inequality for p = 1 and p = ∞. 3.12 (a) Find (Ω, F, µ), 0 < p < 1, f , g ∈ Lp (Ω, F, µ) such that 1/p 1/p 1/p |f + g|p dµ > |f |p dµ + |g|p dµ . (b) Prove (1.18) for 0 < p < 1 with f p =

|f |p dµ.

3.13 Let (Ω, F, µ) be a measure space. Let {Ak }k≥1 ⊂ F and ∞ lim Ak = 0, where lim Ak = k=1 µ(Ak ) < ∞. Show that µ k→∞ k→∞ ∞ j≥n Aj = {ω : ω ∈ Aj for inﬁnitely many j ≥ 1}. n=1 3.14 Establish Theorem 3.2.2 for p = ∞. (Hint: For each k ≥ 1, choose nk ↑ such that fnk+1 − fnk ∞ < 2−k . Show that there exists a set A with µ(Ac ) = 0 and for ω in A, |fnk+1 (ω) − fnk (ω)| < 2−k for all k ≥ 1 and now proceed as in the proof for the case 0 < p < ∞.) 3.15 Let f , g ∈ Lp (Ω, F, µ), 0 < p < 1. Show that d(f, g) = |f − g|p dµ is a metric and (Lp (Ω, F, µ), d) is complete. 3.16 Let (Ω, F, µ) be a measure space and f : Ω → R be F-measurable. Let Af = {p : 0 < p < ∞, |f |p dµ < ∞}. (a) Show that p1 , p2 ∈ Af , p1 < p2 implies [p1 , p2 ] ⊂ Af . (Hint: Use |f |≥1 |f |p dµ ≤ |f |≥1 |f |p2 dµ and |f |≤1 |f |p dµ ≤ |f |p1 dµ for any p1 < p < p2 .) |f |≤1

104

3. Lp -Spaces

(b) Let ψ(p) = log |f |p dµ for p ∈ Af . By (a), it is known that Af is connected, i.e., it is an interval. Show that ψ is convex in the open interior of Af . (Hint: Use H¨older’s inequality.) (c) Give examples to show that Af could be a closed interval, an open interval, and semi-open intervals. (d) If 0 < p1 < p < p2 , show that f p ≤ max{f p1 , f p2 } (Hint: Use (b).) (e) Show that if |f |r dµ < ∞ for some 0 < r < ∞, then f p → f ∞ as p → ∞. (Hint: Show ﬁrst that for any K > 0, µ(|f | > K) > 0 ⇒ lim f p ≥ K. If f ∞ < ∞ and µ(Ω) < ∞, use the fact that

p→∞

1/p f p ≤ f ∞ (µ(Ω)) and reduce the general case under the hypothesis that |f |p dµ < ∞ for some p to this case.)

3.17 Let X be a random variable on a probability space (Ω, F, µ). Recall that Eh(X) = h(X)dµ if h(X) ∈ L1 (Ω, F, µ). p1

(a) Show that (E|X|p1 ) ≤ (E|X|p2 ) p2 for any 0 < p1 < p2 < ∞. (b) Show that ‘=’ holds in (a) iﬀ |X| is a constant a.e. (µ). (c) Show that if E|X| < ∞, then E| log |X|| < ∞ and E|X|r < ∞ for all 0 < r < 1, and 1r log(E|X|r ) → E log |X| as r → 0. 3.18 Let X be a nonnegative random variable. (a) Show that EX log X ≥ (EX)(E log X). , √ (b) Show that 1 + (EX)2 ≤ E( 1 + X 2 ) ≤ 1 + EX. 3.19 Let Ω = N, F = P(N), and let µ be the counting measure. Denote Lp (Ω, F, µ) for this case by p . (a) Show that p is the set of all sequences {xn }n≥1 such that ∞ p n=1 |xn | < ∞. (b) For the following sequences, ﬁnd all p > 0 such that they belong to p : (i) xn ≡ (ii) xn =

1 n,

n ≥ 1.

1 n(log(n+1))2 ,

n ≥ 1.

3.4 Problems

105

3.20 For 1 ≤ p < ∞, prove the Riesz representation theorem for p . That is, show that if T is a bounded linear functional from p → R, then there exists ∞ a y = {yi }i≥1 ∈ q such that for any x = {xi }i≥1 in p , T (x) = i=1 xi yi . (Hint: Let yi = T (ei ) where ei = {ei (j)}j≥1 , ei (j) = 1 if i = j, 0 if i = j. Use the fact |T (x)| ≤ T xp to show that for each n ∈ N, n ( i=1 |yi |q ) ≤ T q < ∞.) 3.21 Let Ω = R, F = B(R), µ = µF where F is a cdf on R. If f (x) ≡ x2 , ﬁnd Af = {p : 0 < p < ∞, f ∈ Lp (R, B(R), µF )} for the following cases: x 2 (a) F (x) = Φ(x), the N (0, 1) cdf, i.e., Φ(x) ≡ √12π −∞ e−u /2 du, x ∈ R. x 1 (b) F (x) = π1 −∞ 1+u 2 du, x ∈ R. 3.22 Show that C[0, 1] with the supnorm (i.e., with f = sup{|f (x)| : 0 ≤ x ≤ 1}) is a Banach space. (Hint: To verify completeness, let {fn }n≥1 be a Cauchy sequence in C[0, 1]. Show that for each 0 ≤ x ≤ 1, {fn (x)}n≥1 is a Cauchy sequence in R. Let f (x) = lim fn (x). Now show that sup{|fn (x) − n→∞

f (x)| : 0 ≤ s ≤ 1} ≤ lim fn − fm . Conclude that fn converges to m→∞

f uniformly on [0, 1] and that f ∈ C[0, 1].) 3.23 Show that the space (P, · ) of all polynomials on [0, 1] with the supnorm is a normed linear space that is not complete. 1 (Hint: Let g(t) = 1−t/2 for 0 ≤ t ≤ 1. Find a sequence of polynomials {fn }n≥1 in P that converge to g in supnorm.)

3.24 Show that the function f (v) ≡ v from a normed linear space (V, ·) to R+ is continuous. 3.25 Let (V, · ) be a normed linear space. Let S = {v : v < 1}. Show that S is an open set in V . 3.26 Show that the space Pk of all polynomials on [0, 1] of degree ≤ k is a Banach space under the supnorm, i.e., under f = sup{|f (x)|, 0 ≤ s ≤ 1}. k (Hint: Let pn (x) = j=0 anj xj be a sequence of elements in Pk that converge in supnorm to some f (·). Show that {an1 }n≥1 converges and recursively, {ani }n≥1 converges for each i.) 3.27 Let µ be a ﬁnite measure on [0,1]. Verify that Tµ (f ) ≡ f dµ is a bounded linear functional on C[0, 1] and that Tµ = µ([0, 1]).

106

3. Lp -Spaces

3.28 Let (Vi , · i ), i = 1, 2, be two normed linear spaces over R. (a) Let T : V1 → V2 be a linear operator. Show that if for some x0 , T x − T x0 → 0 as x → x0 , then T is continuous on V1 and hence bounded. (b) Show that if (V2 , · 2 ) is complete, then B(V1 , V2 ) ≡ {T | T : V1 → V2 , T linear and bounded} is complete under the operator norm deﬁned in (3.1). In the following, H will denote a real Hilbert space. 3.29 (a) Use the Cauchy-Schwarz inequality to show that the function f (x, y) = x, y is continuous from H × H → R. (b) (Parallelogram law). Show that in an innerproduct space (V, ·, ·), for any x, y ∈ V x + y2 + x − y2 = 2(x2 + y2 ) where x2 = x, x. 3.30 (a) Let {Qn (x)}n≥0 be deﬁned on [0, 2π] by Qn (x) = cn

1 + cos x n 2

where cn is such that

1 2π

2π

0

Qn (x)dx = 1.

Clearly, Qn (·) ≥ 0. (i) Verify that for each δ > 0, sup{Qn (x) : δ ≤ x ≤ 2π − δ} → 0

as

n → ∞.

(ii) Use this to show that if f ∈ C[0, 2π] and if 1 Pn (t) ≡ 2π

0

2π

Qn (t − s)f (s)ds, n ≥ 0,

(4.1)

then Pn → f uniformly on [0, 2π]. (b) Use this to give a proof of the completeness of the class C of trigonometric functions. (c) Show that if f ∈ L1 ([0, 2π]), then Pn (·) converges to f in L1 ([0, 2π]).

3.4 Problems

107

(d) Let {µn (·)}n≥1 be a sequence of probability measures on (R, B(R)) such that for each δ > 0, µn ({x : |x| ≥ δ}) → 0 as n → ∞. Let f : R → R be Borel measurable. Let fn (x) ≡ f (x− y)µn (dy), n ≥ 1. Assuming that fn (·) is well deﬁned and Borel measurable, show that (i) f (·) continuous at x0 and bounded ⇒ fn (x0 ) → f (x0 ). (ii) f (·) uniformly continuous and bounded on R ⇒ fn → f uniformly on R. (iii) f ∈ Lp (R, B(R), m), 0 < p < ∞, m(·) = Lebesgue measure ⇒ |fn − f |p dm → 0. (iv) Show that (iii) ⇒ (c). 3.31 (Gram-Schmidt procedure). Let B ≡ {bn : n ∈ N} be a collection of nonzero vector in H. Set b1 b1 = b2 − b2 , e1 e1 , e˜2 (provided ˜ e2 = 0), and so on. = ˜ e2 =

e1 e˜2 e2

If ˜ en = 0 for some n ∈ N, then delete bn . Let E ≡ {ej : 1 ≤ j < k}, k ≤ ∞, be the collection of vectors reached this way. (a) Show that E is an orthonormal system. (b) Let HB denote the closed linear subspace generated by B, i.e., HB

≡

x : x ∈ H, there exists xn of the form aj ∈ R, such that xn − x → 0 .

n

aj bj ,

j=1

Show that HB is a Hilbert space and E is a basis for HB . 3.32 Let H = L2 (R, B(R), µ), where µ is a probability measure. Let B ≡ {1, x, x2 , . . .}. Assume that |x|k dµ < ∞ for all k ∈ N. Apply the Gram-Schmidt procedure in Problem 3.31 to the set B for the following cases and evaluate e1 , e2 , e3 . (a) µ = Uniform [0, 1] distribution. (b) µ = standard normal distribution. (c) µ = Exponential (1) distribution. The orthonormal basis E obtained this way is called Orthogonal Polynomials w.r.t. the given measure. (See Szego (1939).)

108

3. Lp -Spaces

3.33 Let B ⊂ H be an orthonormal system. Show that for any x in H, {b : x, b = 0} is at most countable. a collection of nonnegative (Hint: Show ﬁrst that if {yα : α ∈ I} is real numbers such that for some C < ∞, α∈F yα ≤ C for all F ⊂ I, F ﬁnite, then {α : yα > 0} is at most countable and apply this to the Bessel inequality.) 3.34 Let B ⊂ H. Deﬁne B ⊥ ≡ {x : x ∈ H, x, b = 0, for all b ∈ B}. Show that B ⊥ is a closed subspace of H. 3.35 Let B ⊂ H be a closed subspace of H. (a) Using the fact that every Hilbert space admits an orthonormal basis, show that every x in H can be uniquely decomposed as x=y+z

(4.2)

where y ∈ B and z ∈ B ⊥ and x2 = y2 + z2 . (b) Let PB : H → B be deﬁned by PB x = y where x admits the decomposition in (4.2) above. Verify that PB is a bounded linear operator from H to B and is of norm 1 if B has at least one nonzero vector. (The operator PB is called the projection onto B.) (c) Verify that PB (PB x) = PB x for all x in H. 3.36 Let H be separable and {xn }n≥1 ⊂ H be such that {xn n≥1 } is bounded by some C < ∞. Show that there exist a subsequence {xnj }j≥1 ⊂ {xn }n≥1 and an x0 in H, such that for every y in H, xnj , y → x0 , y. (Hint: Fix an orthonormalbasis B ≡ {bn }n≥1 ⊂ H. Let ani = ∞ xn , bi , n ≥ 1, i ≥ 1. Using i=1 a2ni ≤ C for all n and the BolzanoWierstrass property, show that (a) there exists {nj }j≥1 such that lim anj i = ai exists for all i ≥ 1, j→∞ ∞ 2 a < ∞, i=1 i n (b) lim i=1 ai bi ≡ x0 exists in H, n→∞

(c) xnj , y → x0 , y for all y in H.) 3.37 Let (V, ·, ·) be an innerproduct space. Verify that ·, · is bilinear, i.e., for α1 , α2 , β1 , β2 ∈ R, x1 , x2 , y1 , y2 in V , α1 x1 + α2 x2 , β1 y1 + β2 y2 =

α1 β1 x1 , y1 + α1 β2 x1 , y2 + α2 β1 x2 , y1 + α2 β2 x2 , y2 .

State and prove an extension to more than two vectors.

3.4 Problems

109

3.38 Let (Ω, F, µ) be a measure space. Suppose that there exists an at most countable family A ≡ {Aj }j≥1 ⊂ F such that F = σA and µ(Aj ) > 0 for each j ≥ 1. Then show that for 0 < p < ∞, Lp (Ω, F, µ) is separable. (Hint: Show ﬁrst that for any A ∈ F with µ(A) < ∞, and > 0, there exists a countable subcollection A1 of A such that µ(AB) < where n B = {∪Aj : Aj ∈ A1 }. Now consider the class of functions { i=1 ci IAi , n ≥ 1, Ai ∈ A, ci ∈ Q}.) 3.39 Show that B in Example 3.3.1 (c) is a basis for H. (Hint: Using Theorem 2.3.14 prove that the set of all polynomials are dense in H.) 3.40 (Haar functions). For x in R let h(x) = I[0,1/2) (x) − I[1/2,1) (x). Let h00 (x) ≡ I[0,1) (x) and for k ≥ 1, 0 ≤ j < 2k−1 , let hkj (x) ≡ k−1 2 2 h(2k−1 x − j), 0 ≤ x < 1. (a) Verify that the family {hkj (·), k ≥ 0, 0 ≤ j < 2k−1 } is an orthonormal set in L2 ([0, 1], B([0, 1]), m), where m(·) is Lebesgue measure. (b) Verify that this family is complete by completing the following two proofs: (i) Show "that for indicator function f of dyadic interval of the form 2kn , 2 n , k < , the following identity holds: 2 − k f 2 dm = n = f hkj dm . 2 k,j

Now use the fact the linear combinations of such f ’s is dense in L2 [0, 1]. (ii) For each f ∈ L2 ([0, 1], B([0, 1]), m) such that f is orthogonal to the Haar functions, F (t) ≡ [0,t] f dm, 0 ≤ t ≤ 1 is continuous and satisﬁes F ( 2jn ) = 0 for all 0 ≤ j ≤ 2n , n = 1, 2, . . . and hence F ≡ 0 implying f = 0 a.e. 3.41 Let H be a Hilbert space over R and M be a closed subspace of H. Let v0 ∈ H. Show that min{v − v0 : v ∈ M } = max{v0 , u, u ∈ M ⊥ , u = 1}, where M ⊥ is the orthogonal complement of M , i.e., M ⊥ ≡ {u : v, u = 0 for all v ∈ M }. (Hint: Use Problem 3.35 (a).)

110

3. Lp -Spaces

3.42 Let B be an orthonormal set in a Hilbert space H. (a) (i) Show that for any x in H and any ﬁnite set {bi : 1 ≤ i ≤ k} ⊂ B, k < ∞, k

x, bi 2 ≤ x2 .

i=1

(ii) Conclude that for all x in H, Bx ≡ {b : x, b = 0, b ∈ B} is at most countable. (b) Show that the following are equivalent: (i) B is complete, i.e., x ∈ H, x, b = 0 for all b ∈ B ⇒ x = 0. (ii) For all x ∈ B, there exists anat most countable set Bx ≡ ∞ {bi : i ≥ 1} such that x2 = i=1 x, bi 2 . (iii) For all x ∈ B, > 0, there exists a ﬁnite set {b1 , b2 , . . . , bk } ⊂ B such that k x, bi bi - < . -x − i=1

(iv) If B ⊂ B 1 , B 1 an orthonormal set in H ⇒ B = B 1 . 3.43 Extend Theorem 3.3.3 to any Hilbert space assuming that the axiom of choice holds. (Hint: Using the axiom of choice or its equivalent, the Hausdorﬀ maximality principle, it can be shown that every Hilbert space H admits an orthonormal basis B (see Rudin (1987)). Now let T be a bounded linear functional from H to R. Let f (b) ≡ T (b) for b in B. k Verify that i=1 |f (bi )|2 ≤ T 2 for all ﬁnite collection {bi : 1 ≤ i ≤ k} ⊂ B. Conclude that D ≡ {b : f (b) = 0} is countable. Let x0 ≡ b∈D f (b)b. Now use the proof of Theorem 3.3.3 to show that T (x) ≡ x, x0 for all x in H.) 3.44 Let (V, ·) be a normed linear space. Let {Tn }n≥1 and T be bounded linear operators from V to V . The sequence {Tn }n≥1 is said to converge (a) weakly to T if for each w in V ∗ , the dual of V , and each v in V , w(Tn (v)) → w(T (v)), (b) strongly if for each v in V , Tn v − T v → 0, (c) uniformly if sup{Tn v − T v : v ≤ 1} → 0.

3.4 Problems

111

Let Vp = Lp (R, B(R), µL ), 1 ≤ p ≤ ∞. Let {hn }n≥1 ⊂ R be such that hn = 0, hn → 0, as n → ∞. Let (Tn f )(·) ≡ f (· + hn ), T f (·) ≡ f (·). Verify that (i) {Tn }n≥1 and T are bounded linear operators on Vp , 1 ≤ p ≤ ∞, (ii) for 1 ≤ p < ∞, {Tn }n≥1 converges to T weakly, (iii) for 1 ≤ p < ∞, {Tn } converges to T strongly, (iv) for 1 ≤ p < ∞, {Tn } does not converge to T uniformly by showing that for all n, Tn − T = 1, (v) for p = ∞, show that Tn does not converge weakly to T .

4 Diﬀerentiation

4.1 The Lebesgue-Radon-Nikodym theorem Deﬁnition 4.1.1: Let (Ω, F) be a measurable space and let µ and ν be two measures on (Ω, F). The measure µ is said to be dominated by ν or absolutely continuous w.r.t. ν and written as µ ν if ν(A) = 0 ⇒ µ(A) = 0

for all A ∈ F.

(1.1)

Example 4.1.1: Let m be the Legesgue measure on (R, B(R)) and let µ be the standard normal distribution, i.e., 2 1 √ e−x /2 m(dx), A ∈ B(R). µ(A) ≡ 2π A Then m(A) = 0 ⇒ µ(A) = 0 and hence µ m. Example 4.1.2: Let Z+ ≡ {0, 1, 2, . . .} denote the set of all nonnegative integers. Let ν be the counting measure on Ω = Z+ and µ be the Poisson (λ) distribution for 0 < λ < ∞, i.e., ν(A) = number of elements in A and µ(A) =

e−λ λj j!

j∈A

for all A ∈ P(Ω), the power set of Ω. Since ν(A) = 0 ⇔ A = ∅ ⇔ µ(A) = 0, it follows that µ ν and ν µ.

114

4. Diﬀerentiation

Example 4.1.3: Let f be a nonnegative measurable function on a measure space (Ω, F, ν). Let f dν for all A ∈ F. (1.2) µ(A) ≡ A

Then, µ is a measure on (Ω, F) and ν(A) = 0 ⇒ µ(A) = 0 for all A ∈ F and hence µ ν. The Radon-Nikodym theorem is a sort of converse to Example 4.1.3. It says that if µ and ν are σ-ﬁnite measures (see Section 1.2) on a measurable space (Ω, F) and if µ ν, then there is a nonnegative measurable function f on (Ω, F) such that (1.2) holds. Deﬁnition 4.1.2: Let (Ω, F) be a measurable space and let µ and ν be two measures on (Ω, F). Then, µ is called singular w.r.t. ν and written as µ ⊥ ν if there exists a set B ∈ F such that µ(B) = 0

and ν(B c ) = 0.

(1.3)

Note that µ is singular w.r.t. ν implies that ν is singular w.r.t. µ. Thus, the notion of singularity between two measures µ and ν is symmetric but that of absolutely continuity is not. Note also that if µ and ν are mutually singular and B satisﬁes (1.3), then for all A ∈ F, µ(A) = µ(A ∩ B c )

and ν(A) = ν(A ∩ B).

(1.4)

Example 4.1.4: Let µ be the Lebesgue measure on (R, B(R)) and ν be deﬁned as ν(A) = # elements in A ∩ Z where Z is the set of integers. Then ν(Zc ) = 0

and µ(Z) = 0

and hence (1.3) holds with B = Z. Thus µ and ν are mutually singular. Another example is the pair m and µc on [0,1] where µc is the LebesgueStieltjes measure generated by the Cantor function (cf. Section 4.5) and m is the Lebesgue measure. Example 4.1.5: Let µ be the Lebesgue measure restricted to (−∞, 0] and ν be the Exponential(1) distribution. That is, for any A ∈ B(R), µ(A) ν(A)

=

the Lebesgue measure of = e−x dx.

A ∩ (−∞, 0];

A∩(0,∞)

Then, µ((0, ∞)) = 0 and ν((−∞, 0]) = 0, and (1.3) holds with B = (−∞, 0].

4.1 The Lebesgue-Radon-Nikodym theorem

115

Suppose that µ and ν are two ﬁnite measures on a measurable space (Ω, F). H. Lebesgue showed that µ1 can be decomposed as a sum of two measures, i.e., µ = µa + µs where µa ν and µs ⊥ ν. The next theorem is the main result of this section and it combines the above decomposition result of Lebesgue with the Radon-Nikodym theorem mentioned earlier. Theorem 4.1.1: Let (Ω, F) be a measurable space and let µ1 and µ2 be two σ-ﬁnite measures on (Ω, F). (i) (The Lebesgue decomposition theorem). uniquely decomposed as

The measure µ1 can be

µ1 = µ1a + µ1s

(1.5)

where µ1a and µ1s are σ-ﬁnite measures on (Ω, F) such that µ1a µ2 and µ1s ⊥ µ2 . (ii) (The Radon-Nikodym theorem). There exists a nonnegative measurable function h on (Ω, F) such that µ1a (A) = hdµ2 for all A ∈ F. (1.6) A

Proof: Case 1: Suppose that µ1 and µ2 are ﬁnite measures. Let µ be the measure µ = µ1 + µ2 and let H = L2 (µ). Deﬁne a linear function T on H by T (f ) =

f dµ1 .

(1.7)

Then, by the Cauchy-Schwarz inequality applied to the functions f and g ≡ 1, |T (f )|

12 12 µ1 (Ω) f dµ1

12 12 f dµ µ1 (Ω) .

≤ ≤

2

2

This shows that T is a bounded linear functional on H with T ≤ M ≡ 1 (µ1 (Ω)) 2 . By the Riesz representation theorem (cf. Theorem 3.3.3 and Remark 3.3.2), there exists a g ∈ L2 (µ) such that T (f ) = f gdµ (1.8)

116

4. Diﬀerentiation

for all f ∈ L2 (µ). Let f = IA for A in F. Then, (1.7) and (1.8) yield gdµ. µ1 (A) = T (IA ) = A

But 0 ≤ µ1 (A) ≤ µ(A) for all A ∈ F. Hence the function g in L2 (µ) satisﬁes gdµ ≤ µ(A) for all A ∈ F. (1.9) 0≤ A

Let A1 = {0 ≤ g < 1}, A2 = {g = 1}, A3 = {g ∈ [0, 1]}. Then (1.9) implies that µ(A3 ) = 0 (see Problem 4.1). Now deﬁne the measures µ1a (·) and µ1s (·) by µ1a (A) ≡ µ1 (A ∩ A1 ),

µ1s (A) ≡ µ1 (A ∩ A2 ), A ∈ F.

(1.10)

Next it will be shown that µ1a µ2 and µ1s ⊥ µ2 , thus establishing (1.5). By (1.7) and (1.8), for all f ∈ H, f dµ1 = f gdµ = f gdµ1 + f gdµ2 ⇒

f (1 − g)dµ1 =

f gdµ2 .

(1.11)

Setting f = IA2 yields 0 = µ2 (A2 ). From (1.10), since µ1s (Ac2 ) = 0, it follows that µ1s ⊥ µ2 . Now ﬁx n ≥ 1 and A ∈ F. Let f = IA∩A1 (1 + g + . . . + g n−1 ). Then (1.11) implies that n (1 − g )dµ1 = g(1 + g + . . . + g n−1 )dµ2 . A∩A1

A∩A1

Now letting n → ∞, and using the MCT on both sides, yields g µ1a (A) = dµ2 . IA1 (1 − g) A Setting h ≡

g 1−g IA1

(1.12)

completes the proof of (1.5) and (1.6).

Case 2: Now suppose that µ1 and µ2 are σ-ﬁnite. Then there exists a countable partition {Dn }≥1 ⊂ F of Ω such that µ1 (Dn ) and µ2 (Dn ) are (n) (n) both ﬁnite for all n ≥ 1. Let µ1 (·) ≡ µ1 (· ∩ Dn ) and µ2 ≡ µ2 (· ∩ Dn ). (n) (n) Then applying ‘Case 1’ to µ1 and µ2 for each n ≥ 1, one gets measures (n) (n) µ1a , µ1s and a function hn such that (n)

(n)

(n)

µ1 (·) ≡ µ1a (·) + µ1s (·)

(1.13)

4.1 The Lebesgue-Radon-Nikodym theorem

117

(n) (n) (n) (n) where, for A in F, µ1a (A) = A hn dµ2 = A hn IDn dµ2 and µ1s (·) ⊥ µ2 . ∞ (n) Since µ1 (·) = n=1 µ1 (·), it follows from (1.13) that where µ1a (A) ≡

(1.14) µ1 (·) = µ1a (·) + µ1s (·), (n) (n) ∞ n=1 µ1a (A) and µ1s (·) = n=1 µ1s (·). By the MCT, hdµ2 , A ∈ F, µ1a (A) =

∞

A

∞

where h ≡ n=1 hn IDn . Clearly, µ1a µ2 . The veriﬁcation of the singularity of µ1s and µ2 is left as an exercise (Problem 4.2). It remains to prove the uniqueness of the decomposition. Let µ1 = µa + µs and µ1 = µa + µs be two decompositions of µ1 where µa and µa are absolutely continuous w.r.t. µ2 and µs and µs are singular w.r.t. µ2 . By deﬁnition, there exist sets B and B in F such that µ2 (B) = 0, µ2 (B ) = 0,

and µs (B c ) = 0, µs (B c ) = 0.

Let D = B ∪ B . Then µ2 (D) = 0 and µs (Dc ) ≤ µs (B c ) = 0. Similarly, µs (Dc ) ≤ µs (B c ) = 0. Also µ2 (D) = 0 implies µa (D) = 0 = µa (D). Thus for any A ∈ F, µa (A) = µa (A ∩ Dc )

and µa (A) = µa (A ∩ Dc ).

Also µs (A ∩ Dc ) ≤ µs (A ∩ B c ) = 0 µs (A ∩ Dc ) ≤ µs (A ∩ B c ) = 0. Thus, µ(A ∩ Dc ) = µa (A ∩ Dc ) + µs (A ∩ Dc ) = µa (A ∩ Dc ) = µa (A) and µ(A ∩ Dc ) = µa (A ∩ Dc ) + µs (A ∩ Dc ) = µa (A ∩ Dc ) = µa (A). Hence, µa (A) = µ(A ∩ Dc ) = µa (A) for every A ∈ F. That is, µa = µa and hence, 2 µs = µs . Remark 4.1.1: In Theorem 4.1.1, the hypothesis of σ ﬁniteness cannot be dropped. For example, let µ be the Lebesgue measure and ν be the counting measure on [0, 1]. Then µ ν but there does not exist a nonnegative F measurable function h such that µ(A) = A hdν. To see this, if possible, suppose that for some h ∈ L1 (ν), µ(A) = A hdν for all A ∈ F. Note that µ([0, 1]) = 1 implies that [0,1] hdν < ∞ and hence, that B ≡ {x : h(x) > 0} is countable (Problem 4.3). But µ being the Lebesgue measure, µ(B) = 0 c c c and µ(B ) = 1. Since by deﬁnition, h ≡ 0 on B , this implies 1 = µ(B ) = hdν = 0, leading to a contradiction. However, if ν is σ-ﬁnite and µ ν Bc (µ not necessarily σ-ﬁnite), then the Radon-Nikodym theorem holds, i.e., there exists a nonnegative F-measurable function h such that µ(A) = hdν for all A ∈ F. A

118

4. Diﬀerentiation

For a proof, see Royden (1988), Chapter 11. Deﬁnition 4.1.3: Let µ and ν be measures on a measurable space (Ω, F) and let h be a nonnegative measurable function such that hdν for all A ∈ F. µ(A) = A

Then h is called the Radon-Nikodym derivative of µ w.r.t. ν and is written as dµ = h. dν If µ(Ω) < ∞ and there exist two nonnegative F-measurable functions h1 and h2 such that µ(A) =

h1 dν = A

h2 dν A

for all A ∈ F, then h1 = h2 a.e. (ν) and thus the Radon-Nikodym derivative dµ dν is unique up to equivalence a.e. (ν). This also extends to the case when µ is σ-ﬁnite. The following proposition is easy to verify (cf. Problem 4.4). Proposition 4.1.2: Let ν, µ, µ1 , µ2 , . . . be σ-ﬁnite measures on a measurable space (Ω, F). (i) If µ1 µ2 and µ2 µ3 , then µ1 µ3 and dµ1 dµ2 dµ1 = dµ3 dµ2 dµ3

a.e. (µ3 ).

(ii) Suppose that µ1 and µ2 are dominated by µ3 . Then for any α, β ≥ 0, αµ1 + βµ2 is dominated by µ3 and dµ1 dµ2 d(αµ1 + βµ2 ) =α +β dµ3 dµ3 dµ3 (iii) If µ ν and

dµ dν

a.e. (µ3 ).

> 0 a.e. (ν), then ν µ and dν = dµ

dµ dν

−1 a.e. (µ).

(iv) Let {µn }n≥1 be a sequence of measures and {αn }n≥1 be a sequence of positive real numbers, i.e., αn > 0 for all n ≥ 1. Deﬁne µ = ∞ n=1 αn µn .

4.2 Signed measures

119

(a) Then, µ ν iﬀ µn ν for each n ≥ 1 and in this case, ∞ dµ dµn = αn dν dν n=1

a.e. (ν).

(b) µ ⊥ ν iﬀ µn ⊥ ν for all n ≥ 1.

4.2 Signed measures Let µ1 and µ2 be two ﬁnite measures on a measurable space (Ω, F). Let ν(A) ≡ µ1 (A) − µ2 (A),

A ∈ F.

for all

(2.1)

Then ν : F → R satisﬁes the following: (i) ν(∅) = 0. (ii) For any {An }n≥1 ⊂ F with Ai ∩ Aj = ∅ for i = j, and with ∞ i=1 |ν(Ai )| < ∞, ∞ ν(A) = ν(Ai ). (2.2) i=1

(iii) Let ν

≡ sup

∞

|ν(Ai )| : {An }n≥1 ⊂ F, Ai ∩ Aj = ∅ for

i=1

i = j,

An = Ω .

(2.3)

n≥1

Then, ν is ﬁnite. Note that (iii) holds because ν ≤ µ1 (Ω) + µ2 (Ω) < ∞. Deﬁnition 4.2.1: A set function ν : F → R satisfying (i), (ii), and (iii) above is called a ﬁnite signed measure. The above example shows that the diﬀerence of two ﬁnite measures is a ﬁnite signed measure. It will be shown below that every ﬁnite signed measure can be expressed as the diﬀerence of two ﬁnite measures. Proposition 4.2.1: Let ν be a ﬁnite signed measure on (Ω, F). Let ∞ |ν(An )| : {An }n≥1 ⊂ F, Ai ∩ Aj = ∅ for i = j, |ν|(A) ≡ sup n=1

n≥1

An = A .

(2.4)

120

4. Diﬀerentiation

Then |ν|(·) is a ﬁnite measure on (Ω, F). Proof: That |ν|(Ω) < ∞ follows from part (iii) of the deﬁnition. Thus it is enough to verify that |ν| is countably additive. Let {An }n≥1 be a countable family of disjoint sets in F. Let A = n≥1 An . By the deﬁnition of |ν|, for all > 0 and n ∈ N, there exists a countable ∞family {Anj }j≥1 of disjoint sets in F with An = j≥1 Anj such that j=1 |ν(Anj )| > |ν|(An ) − 2n . Hence, ∞ ∞ ∞ |ν(Anj )| > |ν|(An ) − . n=1 j=1

n=1

Note that {Anj }n≥1,j≥1 is a countable family of disjoint sets in F such that A = n≥1 An = n≥1 j≥1 Anj . It follows from the deﬁnition of |ν| that |ν|(A) ≥

∞ ∞

|ν(Anj )| >

n=1 j=1

∞

|ν|(An ) − .

n=1

Since this is true for for all > 0, it follows that |ν|(A) ≥

∞

|ν|(An ).

(2.5)

n=1

To get the opposite inequality, let {Bj }j≥1 be a countable family of disjoint B = A = sets in F such that j j≥1 n≥1 An . Since Bj = Bj ∩ A = (B ∩ A ) and ν satisﬁes (2.2), j n n≥1 ν(Bj ) =

∞

ν(Bj ∩ An )

for all j ≥ 1.

n=1

Thus ∞

|ν(Bj )|

≤

j=1

=

∞ ∞ j=1 n=1 ∞ ∞

|ν(Bj ∩ An )| |ν(Bj ∩ An )|.

(2.6)

n=1 j=1

Note that for each An , {Bj ∩An }j≥1 is a countable family of disjoint sets in F such that An = j≥1 (Bj ∩ An ). Hence from (2.4), it fol∞ ∞ lows that |ν|(An ) ≥ |ν(Bj ∩ An )| and hence, |ν|(An ) ≥ j=1 ∞ ∞ n=1 ∞ j=1 |ν(Bj ∩ An )|. From (2.6), it follows that n=1 |ν|(An ) ≥ n=1 ∞ |ν(B )|. This being true for every such family {B } j j j≥1 , it follows j=1 from (2.4) that ∞ |ν|(A) ≤ |ν|(Ai ) (2.7) i=1

4.2 Signed measures

121

2

and with (2.5), this completes the proof.

Deﬁnition 4.2.2: The measure |ν| deﬁned by (2.4) is called the total variation measure of the signed measure ν. Next, deﬁne the set functions ν+ ≡

|ν| + ν , 2

ν− ≡

|ν| − ν . 2

(2.8)

It can be veriﬁed that ν + and ν − are both ﬁnite measures on (Ω, F). Deﬁnition 4.2.3: The measures ν + and ν − are called the positive and negative variation measures of the signed measure ν, respectively. It follows from (2.8) that ν = ν+ − ν−.

(2.9)

Thus every ﬁnite signed measure ν on (Ω, F) is the diﬀerence of two ﬁnite measures, as claimed earlier. Note that both ν + and ν − are dominated by |ν| and all three measures are ﬁnite. By the Radon-Nikodym theorem (Theorem 4.1.1), there exist functions h1 and h2 in L1 (Ω, F, |ν|) such that dν + = h1 d|ν|

and

dν − = h2 . d|ν|

(2.10)

This and (2.9) imply that for any A in F,

h1 d|ν| −

ν(A) = A

h2 d|ν| =

A

hd|ν|,

(2.11)

A

where h = h1 − h2 . Thus every ﬁnite signed measure ν on (Ω, F) can be expressed as f dµ, A ∈ F (2.12) ν(A) = A

for some ﬁnite measure µ on (Ω, F) and some f ∈ L1 (Ω, F, µ). Conversely, it is easy to verify that a set function ν deﬁned by (2.12) for some ﬁnite measure µ on (Ω, F) and some f ∈ L1 (Ω, F, µ) is a ﬁnite signed measure (cf. Problem 4.6). This leads to the following: Theorem 4.2.2: (i) A set function ν on a measurable space (Ω, F) is a ﬁnite signed measure iﬀ there exist two ﬁnite measures µ1 and µ2 on (Ω, F) such that ν = µ1 − µ2 .

122

4. Diﬀerentiation

(ii) A set function ν on a measurable space (Ω, F) is a ﬁnite signed measure iﬀ there exist a ﬁnite measure µ on (Ω, F) and an f ∈ L1 (Ω, F, µ) such that for all A in F, f dµ. ν(A) = A

Deﬁnition 4.2.4: Let ν be a ﬁnite signed measure on a measurable space on (Ω, F). A set A ∈ F is called a positive set for ν if for any B ⊂ A, B ∈ F, ν(B) ≥ 0. A set A ∈ F is called a negative set for ν if for any B ⊂ A with B ∈ F, ν(B) ≤ 0. Let h be as in (2.11). Let Ω+ = {ω : h(ω) ≥ 0} and

Ω− = {ω : h(ω) < 0}.

(2.13)

From (2.11), it follows that for all A in F, ν(A∩Ω+ ) ≥ 0 and ν(A∩Ω− ) ≤ 0. Thus Ω+ is a positive set and Ω− is a negative set for ν. Furthermore, Ω+ ∪ Ω− = Ω and Ω+ ∩ Ω− = ∅. Summarizing this discussion, one gets the following theorem. Theorem 4.2.3: (Hahn decomposition theorem). Let ν be a ﬁnite signed measure on a measurable space (Ω, F). Then there exist a positive set Ω+ and a negative set Ω− for ν such that Ω = Ω+ ∪ Ω− and Ω+ ∩ Ω− = ∅. Let Ω+ and Ω− be as in (2.13). It can be veriﬁed (Problem 4.8) that for any B ∈ F, if B ⊂ Ω+ , then ν(B) = |ν|(B). By (2.11), this implies that for all A in F, A∩Ω+

hd|ν| = |ν|(A ∩ Ω+ ).

It follows that h = 1 a.e. (|ν|) on Ω+ . Similarly, h = −1 a.e. (|ν|) on Ω− . Thus, the measures ν + and ν − , deﬁned in (2.8), reduce to (1 + h) ν + (A) = d|ν| 2 A (1 + h) (1 + h) d|ν| + |ν| = 2 d A∩Ω+ A∩Ω− = |ν|(A ∩ Ω+ ), and similarly

ν − (A) = |ν|(A ∩ Ω− ).

Note that ν + and ν − are both ﬁnite measures that are mutually singular. This particular decomposition of ν as ν = ν+ − ν−

4.2 Signed measures

123

is known as the Jordan decomposition of ν. It will now be shown that this decomposition is minimal and that it is unique in the class of signed measures with mutually singular components. Suppose there exist two ﬁnite measures µ1 and µ2 on (Ω, F) such that ν = µ1 − µ2 . For any A ∈ F, ν + (A) = ν(A ∩ Ω+ ) = µ1 (A ∩ Ω+ ) − µ2 (A ∩ Ω+ ) ≤ µ1 (A ∩ Ω+ ) ≤ µ1 (A) and ν − (A) = −ν(A ∩ Ω− ) = µ2 (A ∩ Ω− ) − µ1 (A ∩ Ω− ) ≤ µ2 (A ∩ Ω− ) ≤ µ2 (A). Thus, ν + ≤ µ1 and ν − ≤ µ2 . Clearly, since both µ1 and ν + are ﬁnite measures on (Ω, F), µ1 − ν + is also a ﬁnite measure. Similarly, µ2 − ν − is also a ﬁnite measure. Also, since µ1 − µ2 = ν = ν + − ν − , it follows that µ1 −ν + = µ2 −ν − = λ, say. Thus, for any decomposition of ν as µ1 −µ2 with µ1 , µ2 ﬁnite measures, it holds that µ1 = ν + + λ and µ2 = ν − + λ, where λ is a measure on (Ω, F). Thus, ν = ν + − ν − is a minimal decomposition in the sense that in this case λ = 0. Now suppose µ1 and µ2 are mutually singular, i.e., there exist Ω1 , Ω2 ∈ F such that Ω1 ∩ Ω2 = ∅, Ω1 ∪ Ω2 = Ω, and µ1 (Ω2 ) = 0 = µ2 (Ω1 ). Since µ1 ≥ λ and µ2 ≥ λ, it follows that λ(Ω2 ) = 0 = λ(Ω1 ). Thus λ = 0 and µ1 = ν + and µ2 = ν − . Summarizing the above discussion yields: Theorem 4.2.4: Let ν be a ﬁnite signed measure on a measurable space (Ω, F) and let µ1 and µ2 be two ﬁnite measures on (Ω, F) such that ν = µ1 − µ2 . Then there exists a ﬁnite measure λ such that µ1 = ν + + λ and µ2 = ν − + λ with λ = 0 iﬀ µ1 and µ2 are mutually singular. Let S ≡ {ν : ν

is a ﬁnite signed measure on

(Ω, F)}.

Also, for any α ∈ R, let α+ = max(α, 0) and α− = max(−α, 0). Note that for ν1 , ν2 in S and α1 , α2 ∈ R, α1 ν1 + α2 ν2

= =

(α1+ − α1− )(ν1+ − ν1− ) + (α2+ − α2− )(ν2+ − ν2− ) (α1+ ν1+ + α1− ν1− + α2+ ν2+ + α2− ν2− ) − (α1+ ν1− + α1− ν1+ + α2+ ν2− + α2− ν2+ )

= λ1 − λ2 ,

say,

where λ1 and λ2 are both ﬁnite measures. It now follows from Theorem 4.2.2 that α1 ν1 + α2 ν2 ∈ S. Thus, S is a vector space over R. Now it will be shown that ν ≡ |ν|(Ω) is a norm on S and that (S, · ) is a Banach space. Deﬁnition 4.2.5: For a ﬁnite signed measure ν on a measurable space (Ω, F), the total variation norm ν is deﬁned by ν ≡ |ν|(Ω). Proposition 4.2.5: Let S ≡ {ν : ν a ﬁnite signed measure on (Ω, F)}. Then, ν ≡ |ν|(Ω), ν ∈ S is a norm on S.

124

4. Diﬀerentiation

Proof: Let ν1 , ν2 ∈ S, α1 , α2 ∈R and λ = α1 ν1 + α2 ν2 . For any A ∈ F and any {Ai }i≥1 ⊂ F with A = i≥1 Ai , |λ(Ai )| ≤ |α1 ||ν1 (Ai )| + |α2 ||ν2 (Ai )| for all i ≥ 1 |λ(Ai )| ≤ |α1 | |ν1 (Ai )| + |α2 | |ν2 (Ai )| ⇒ i≥1

i≥1

i≥1

≤ |α1 ||ν1 |(A) + |α2 ||ν2 |(A). Taking supremum over all {Ai }i≥1 yields, |λ|(A) ≤ |α1 ||ν1 |(A) + |α2 ||ν2 |(·) |λ|(·) ≤ |α1 | |ν1 |(·) + |α2 ||ν2 |(·),

i.e., ⇒

λ ≡ |λ|(Ω) ≤ |α1 | |ν1 |(Ω) + |α2 | |ν2 |(Ω) = |α1 |ν1 + |α2 |ν2 .

Taking α1 = α2 = 1 yields ν1 + ν2 ≤ ν1 + ν2 , i.e., the triangle inequality holds. Next taking α2 = 0 yields α1 ν1 ≤ |α1 |ν1 . To get the opposite inequality, note that for α1 = 0, ν1 = α11 α1 ν1 and so ν1 ≤ | α11 |α1 ν1 . Hence, |α1 | ν1 ≤ α1 ν1 . Thus, for any α1 = 0, α1 ν = |α1 |ν. Finally, ν = 0 ⇒ |ν|(Ω) = 0 ⇒ |ν|(A) = 0 for all A ∈ F ⇒ ν(A) = 0 for all A ∈ F, i.e., ν is the zero measure. Thus · is a norm on S. 2 Proposition 4.2.6: (S, · ) is complete. Proof: Let {νn }n≥1 be a Cauchy sequence in (S, · ). Note that for each A ∈ F, |νn (A) − νm (A)| ≤ |νn − νm |(A) ≤ νn − νm . Hence, for each A ∈ F, {νn (A)}n≥1 is a Cauchy sequence in R and hence ν(A) ≡ lim νn (A) n→∞

exists.

It will be shown that ν(·) is a ﬁnite signed measure and νn − ν → 0 as n → ∞. Let {Ai }i≥1 ⊂ F, Ai ∩ Aj = ∅ for i = j, and A = i≥1 Ai . Let xn ≡ {νn (Ai )}i≥1 , n ≥ 1, and let x 0 ≡ {ν(Ai )}i≥1 . Note that each xn ∈ 1 where 1 ≡ {x : x = {xi }i≥1 ∈ R, i≥1 |xi | < ∞}. For x ∈ 1 , let x1 = ∞ |x |. Then x − x = |ν i n m i≥1 n (Ai ) − νm (Ai )| ≤ |νn − νm |(A) ≤ i=1 |νn − νm |(Ω) → 0 as n, m → ∞. But 1 is complete under · 1 . So there ) → ν(Ai ) for all exists x∗ ∈ 1 such that xn −x∗ 1 → 0. Since xni ≡ νn (A i ∞ ) for all i ≥ 1 and that i ≥ 1, it follows that x∗i = ν(A i i=1 |ν(Ai )| < ∞. ∞ Also, for all n ≥ 1, νn (A) = i=1 νn (Ai ). Since i≥1 |νn (Ai )−ν(Ai )| → 0, νn (A) ≡ i≥1 νn (Ai ) → i≥1 ν(Ai ) as n → ∞. But νn (A) → ν(A). Thus,

4.3 Functions of bounded variation

ν(A) =

∞ i=1

125

ν(Ai ). Also for any countable partition {Ai }i≥1 ⊂ F of Ω,

∞

|ν(Ai )| = lim

∞

n→∞

i=1

|νn (Ai )| ≤ lim νn < ∞. n→∞

i=1

Thus |ν|(Ω) < ∞ and hence, ν ∈ S. Finally, νn − ν

=

sup

∞

|νn (Ai ) − ν(Ai )| : {Ai }i≥1 ⊂ F

i=1

is a disjoint partition of

Ω .

But for every countable partition {Ai }i≥1 ⊂ F, ∞ i=1

|νn (Ai ) − ν(Ai )| = lim

m→∞

∞

|νn (Ai ) − νm (Ai )| ≤ lim νn − νm .

i=1

m→∞

Thus, νn − ν ≤ limm→∞ νn − νm and hence, limn→∞ νn − ν ≤ 2 limn→∞ limm→∞ νn − νm = 0. Hence, νn → ν in S. Deﬁnition 4.2.6: (Integration w.r.t. signed measures). Let µ be a ﬁnite signed measure on a measurable space (Ω, F) and |µ| be its total variation measure as in Deﬁnition 4.2.2. Then, for any f ∈ L1 (Ω, F, |µ|), f dµ is deﬁned as + f dµ = f dµ − f dµ− , where µ+ and µ− are the positive and negative variations of µ as deﬁned in (2.8). Proposition 4.2.7: Let µ be a signed measure on a measurable space (Ω, F, P ). Let µ = λ1 − λ2 where λ1 and λ2 are ﬁnite measures. Let f ∈ L1 (Ω, F, λ1 + λ2 ). Then f ∈ L1 (Ω, F, |µ|) and f dµ = f dλ1 − f dλ2 . (2.14) Proof: Left as an exercise (Problem 4.13).

4.3 Functions of bounded variation From the construction of the Lebesgue-Stieltjes measures on (R, B(R)) discussed in Chapter 1, it is seen that to every nondecreasing right continuous function F : R → R, there is a (Radon) measure µF on (R, B(R)) such that

126

4. Diﬀerentiation

µF ((a, b]) = F (b) − F (a) for all a < b and conversely. If µ1 and µ2 are two Radon measures and µ = µ1 − µ2 , let ⎧ for x > 0, ⎨ µ((0, x]) −µ((x, 0]) for x < 0, G(x) ≡ ⎩ 0 for x = 0. ⎧ for x > 0, ⎨ F1 (x) − F2 (x) − (F1 (0) − F2 (0)) for x < 0, (F1 (0) − F2 (0)) − (F1 (x) − F2 (x)) = ⎩ 0 for x = 0. Thus to every ﬁnite signed measure µ on (R, B(R)), there corresponds a function G(·) that is the diﬀerence of two right continuous nondecreasing and bounded functions. The converse is also easy to establish. A characterization of such a function G(·) without any reference to measures is given below. Deﬁnition 4.3.1: Let f : [a, b] → R, where −∞ < a < b < ∞. Then for any partition Q = {a = x0 < x1 < x2 < . . . < xn = b}, n ∈ N, the positive, negative and total variations of f with respect to Q are respectively deﬁned as P (f, Q) ≡

n

(f (xi ) − f (xi−1 ))+

i=1

N (f, Q) ≡

n

(f (xi ) − f (xi−1 ))−

i=1

T (f, Q) ≡

n

|f (xi ) − f (xi−1 )|.

i=1

It is easy to verify that (i) if f is nondecreasing, then P (f, Q) = T (f, Q) = f (b) − f (a)

and N (f, Q) = 0

and that (ii) for any f , P (f, Q) + N (f, Q) = T (f, Q). Deﬁnition 4.3.2: Let f = [a, b] → R, where −∞ < a < b < ∞. The positive, negative and total variations of f over [a, b] are respectively deﬁned as P (f, [a, b]) ≡

sup P (f, Q)

N (f, [a, b]) ≡

sup N (f, Q)

Q Q

T (f, [a, b]) ≡

sup T (f, Q), Q

4.3 Functions of bounded variation

127

where the supremum in each case is taken over all ﬁnite partitions Q of [a, b]. Deﬁnition 4.3.3: Let f : [a, b] → R, where −∞ < a < b < ∞. Then, f is said to be of bounded variation on [a, b] if T (f, [a, b]) < ∞. The set of all such functions is denoted by BV [a, b]. As remarked earlier, if f is nondecreasing, then T (f, Q) = f (b) − f (a) for each Q and hence T (f, [a, b]) = f (b) − f (a). It follows that if f = f1 − f2 , where both f1 and f2 are nondecreasing, then f ∈ BV [a, b]. A natural question is whether the converse is true. The answer is yes, as shown by the following result. Theorem 4.3.1: Let f ∈ BV [a, b]. Let f1 (x) ≡ P (f, [a, x]) and f2 (x) ≡ N (f, [a, x]). Then f1 and f2 are nondecreasing in [a, b] and for all a ≤ x ≤ b, f (x) = f1 (x) − f2 (x) Proof: That f1 and f2 are nondecreasing follows from the deﬁnition. It is enough to verify that if f ∈ BV [a, b], then f (b) − f (a) = P (f, [a, b]) − N (f, [a, b]), as this can be applied to [a, x] for a ≤ x < b. For each ﬁnite partition Q of [a, b], f (b) − f (a)

=

n

(f (xi ) − f (xi−1 ))

i=1

= P (f, Q) − N (f, Q). Thus P (f, Q) = f (b) − f (a) + N (f, Q). By taking supremum over all ﬁnite partitions Q, it follows that P (f, [a, b]) = f (b) − f (a) + N (f, [a, b]). If f ∈ BV [a, b], this yields f (b) − f (a) = P (f, [a, b]) − N (f, [a, b]).

2

Remark 4.3.1: Since T (f, Q) = P (f, Q) + N (f, Q) = 2P (f, Q) − (f (b) − f (a)), it follows that if f ∈ BV [a, b], then T (f, [a, b])

= 2P (f, [a, b]) − (f (b) − f (a)) = P (f, [a, b]) + N (f, [a, b]).

Corollary 4.3.2: A function f ∈ BV [a, b] iﬀ there exists a ﬁnite signed measure µ on (R, B(R)) such that f (x) = µ([a, x]), a ≤ x ≤ b. The proof of this corollary is left as an exercise.

128

4. Diﬀerentiation

Remark 4.3.2: Some observations on functions of bounded variations are listed below. (a) Let f = IQ where Q is the set of rationals. Then for any [a, b], a < b, P (f, [a, b]) = N (f, [a, b]) = ∞ and so f ∈ / BV [a, b]. This holds for f = ID for any set D such that both D and Dc are dense in R. (b) Let f be Lipschitz on [a, b]. That is, |f (x) − f (y)| ≤ K|x − y| for all x, y in [a, b] where K ∈ (0, ∞) is a constant. Then, f ∈ BV [a, b]. (c) Let f be diﬀerentiable in (a, b) and continuous on [a, b] and f (·) be bounded in (a, b). Then by the mean value theorem, f is Lipschitz and hence, f is in BV [a, b]. (d) Let f (x) = x2 sin x1 , 0 < x ≤ 1, and let f (0) = 0. Then f is continuous on [0, 1], diﬀerentiable on (0, 1) with f bounded on (0, 1), and hence f ∈ BV [0, 1]. (e) Let g(x) = x2 sin x12 , 0 < x ≤ 1, g(0) = 0. Then g is continuous on [0, 1], diﬀerentiable on (0, 1) but g is not bounded on (0, 1). This by itself does not imply that g ∈ / BV [0, 1], since being Lipschitz is only a suﬃcient condition. But it turns out that g ∈ / BV [0, 1]. To see 0 n 1 this, let xn = (2n+1) |g(xi ) − g(xi−1 )| ≥ π , n = 0, 1, 2 . . .. Then 2

n i=1

1 (2i+1) π 2

i=1

and hence T (g, [0, 1]) = ∞.

(f) It is known (see Royden (1988), Chapter 4) that if f : [a, b] → R then f is diﬀerentiable a.e. (m) on (a, b) and is nondecreasing, f dm ≤ f (b) − f (c), where (m) denotes the Lebesgue measure. [a,b] This implies that if f ∈ BV [a, b], then f is diﬀerentiable a.e. (m) on (a, b) and so, [a,b] |f |dm ≤ T (f, [a, b]).

4.4 Absolutely continuous functions on R Deﬁnition 4.4.1: A function F : R → R is absolutely continuous (a.c.) if for all > 0, there exists δ > 0 such that if Ij = [aj , bj ], j = 1, 2, . . . , k k k (k ∈ N) are disjoint and j=1 (bj − aj ) < δ, then j=1 |F (bj ) − F (aj )| < . By the mean value theorem, it follows that if F is diﬀerentiable and F (·) is bounded, then F is a.c. Also note that F is a.c. implies F is uniformly continuous.

4.4 Absolutely continuous functions on R

129

Deﬁnition 4.4.2: A function F : [a, b] → R is absolutely continuous if the function F˜ , deﬁned by ⎧ if a ≤ x ≤ b, ⎨ F (x) F (a) if x < a, F˜ (x) = ⎩ F (b) if x > b, is absolutely continuous. Thus F (x) = x is a.c. on R. Any polynomial is a.c. on any bounded interval but not necessarily on all of R. For example, F (x) = x2 is a.c. on any bounded interval but not a.c. on R, since it is not uniformly continuous on R. The main result of this section is the following result due to H. Lebesgue, known as the fundamental theorem of Lebesgue integral calculus. Theorem 4.4.1: A function F : [a, b] → R is absolutely continuous iﬀ there is a function f : [a, b] → R such that f is Lebesgue measurable and integrable w.r.t. m and such that f dm, for all a ≤ x ≤ b (4.1) F (x) = F (a) + [a,x]

where m is the Lebesgue measure. Proof: First consider the “if part.” Suppose that (4.1) holds. Since |f |dm < ∞, for any > 0, there exists a δ > 0 such that (cf. Proposi[a,b] tion 2.5.8). |f |dm < .

m(A) < δ ⇒

(4.2)

A

Thus, if Ij = (aj , bj ), ⊂ [a, b], j = 1, 2, . . . , k are such that δ, then k

k

j=1 (bj − aj )

0. Let δ > 0 be chosen so that (aj , bj ) ⊂ [a, b], j = 1, 2, . . . , k,

k j=1

(bj − aj ) < δ ⇒

k j=1

|F (bj ) − F (aj )| < .

130

4. Diﬀerentiation

Let A ∈ Mm , A ⊂ (a, b), and m(A) = 0. Then, there exist a countable collection of disjoint open intervals {Ij = (aj , bj ) : Ij ⊂ [a, b]}j≥1 such that

Ij and (bj − aj ) < δ. A⊂ j≥1

j≥1

Thus

k

Ij ≤ µF A ∩

µF

k

j=1

Ij

i=1

≤

k

µF (Ij ) =

j=1

k

(F (bj ) − F (aj )) <

j=1

for all k ∈ N. Ij , by the m.c.f.b. property of µF (·), µF (A) = Since A ⊂ j≥1 k limk→∞ µF (A ∩ j=1 Ij ) ≤ . This being true for any > 0, it follows that µF (A) = 0. Since F is continuous, µF ({a, b}) = 0 and hence µF (a, b)c = 0. Thus, µF m, i.e., µF is dominated by m. Now, by the Radon-Nikodym theorem (cf. Theorem 4.1.1 (ii)), there exists a nonnegative measurable function f such that A ∈ Mm implies that µF (A) = A∩[a,b] f dm and, in particular, for a ≤ x ≤ b, µF ([a, x]) = F (x) − F (a) = f dm, [a,x]

2

i.e., (4.1) holds.

The representation (4.1) of an absolutely continuous F can be strengthened as follows: Theorem 4.4.2: Let F : R → R satisfy (4.1). Then F is diﬀerentiable a.e. (m) and F (·) = f (·) a.e. (m). For a proof of this result, see Royden (1988), Chapter 4. The relation between the notion of absolute continuity of a distribution function F : R → R and that of the associated Lebesgue-Stieltjes measure µF w.r.t. Lebesgue measure m will be discussed now. Let F : R → R be a distribution function, i.e., F is nondecreasing and right continuous. Let µF be the associated Lebesgue-Stieltjes measure such that µF ((a, b]) = F (b) − F (a) for all −∞ < a < b < ∞. Recall that F is said to be absolutely continuous on an interval [a, b] if for each > 0, there exists a δ > 0 such that for any ﬁnite collection of intervals Ij = (aj , bj ), j = 1, 2, . . . , n, contained in [a, b], n j=1

(bj − aj ) < δ

⇒

n j=1

(F (bj ) − F (aj )) < .

4.4 Absolutely continuous functions on R

131

Recall also that µF is absolutely continuous w.r.t. the Lebesgue measure m if for A ∈ B(R), m(A) = 0 ⇒ µF (A) = 0. A natural question is that if F is absolutely continuous on every interval [a, b] ⊂ R, is µF absolutely continuous w.r.t. m and conversely? The answer is yes. Theorem 4.4.3: Let F : R → R be a nondecreasing function and let µF be the associated Lebesgue-Stieltjes measure. Then F is absolutely continuous on [a, b] for all −∞ < a < b < ∞ iﬀ µF m where m is the Lebesgue measure on (R, B(R)). Proof: Suppose that µF m. Then by Theorem 4.1.1, there exists a nonnegative measurable function h such that hdm for all A in B(R). µF (A) = A

Hence, for any a < b in R and a < x < b, F (x) − F (a) ≡ µF ((a, x]) =

hdm. (a,x]

This implies the absolute continuity of F on [a, b] as shown in Theorem 4.4.1. Conversely, if F is absolutely continuous on [a, b] for all −∞ < a < b < ∞, then as shown in the proof of the “only if” part of Theorem 4.4.1, for all −∞ < a < b < ∞, then µF (A ∩ [a, b]) = 0 if m(A ∩ [a, b]) = 0. Thus, if m(A) = 0, then for all −∞ < a < b < ∞, m(A ∩ [a, b]) = 0 and hence 2 µF (A ∩ [a, b]) = 0 and hence µF (A) = 0, i.e., µF m. Recall that a measure µ on (Rk , B(Rk )) is a Radon measure if µ(A) < ∞ for every bounded Borel set A. In the following, let m(·) denote the Lebesgue measure on Rk . Deﬁnition 4.4.3: A Radon measure µ on (Rk , B(Rk )) is diﬀerentiable at x ∈ Rk with derivative (Dµ)(x) if for any > 0, there is a δ > 0 such that * * µ(A) * * − (Dµ)(x)* < * m(A) for every open ball A such that x ∈ A and diam. (A) ≡ sup{x − y : x, y ∈ A}, the diameter of A, is less than δ. Theorem 4.4.4: Let µ be a Radon measure on (Rk , B(Rk )). Then (i) µ is diﬀerentiable a.e. (m), Dµ(·) is Lebesgue measurable, and ≥ 0 a.e. (m) and for all bounded Borel sets A ∈ B(Rk ), Dµ(·)dm ≤ µ(A). A

132

4. Diﬀerentiation

(ii) Let µa (A) ≡ A Dµ(·)dm, A ∈ B(Rk ). Let µs (·) be the unique measure on B(Rk ) such that for all bounded Borel sets A µs (A) = µ(A) − µa (A). Then µs ⊥ m and Dµs (·) = 0

a.e. (m).

For a proof, see Rudin (1987). Remark 4.4.1: By the uniqueness of the Lebesgue decomposition, it follows that a Radon measure µ on B(Rk ) is ⊥ m iﬀ Dµ(·) = 0 a.e. (m) and is m iﬀ µ(A) = A Dµ(·)dm for all A ∈ B(Rk ). k integrable w.r.t. m on bounded sets. Let µ(A) ≡ Let f : R → R+ be k f dm for A ∈ B(R ). Then µ(·) is a Radon measure and that is m A and hence by Theorem 4.4.4

Dµ(x) = f (x) for almost all

x(m).

That is, for almost all x(m), for each > 0, there is a δ > 0 such that * 1 * * * f dm − f (x)* < * m(A) A for all open balls A such that x ∈ A and diam. (A) < δ. It turns out that a stronger result holds. Theorem 4.4.5: For almost all x(m), for each > 0, there is a δ > 0 such that 1 |f − f (x)|dm < m(A) A for all open balls A such that x ∈ A and diam. (A) < δ (see Problems 4.23, 4.24). Theorem 4.4.6: (Change of variables formula in Rk , k > 1). Let V be an open set in Rk . Let T ≡ (T1 , T2 , . . . , Tk ) be a map from Rk → Rk such i (·) that for each i, Ti : Rk → R and ∂T ∂xj exists on V for all 1 ≤ i, j ≤ k. i (·) Suppose that the Jacobian JT (·) ≡ det ∂T is continuous and positive ∂xj on V . Suppose further that T (V ) is a bounded open set W in Rk and that T is (1 − 1) and T −1 is continuous. Then (i) For all Borel set E ⊂ V , T (E) is a Borel set ⊂ W . (ii) ν(·) ≡ m(T (·)) is a measure on B(W ) and ν m with dν(·) = JT (·). dm

4.5 Singular distributions

(iii) For any h ∈ L1 (W, m)

hdm =

W

133

h T (·) JT (·)dm.

V

(iv) λ(·) ≡ mT −1 (·) is a measure on B(W ) and λ m with dλ = |J T −1 (·) |−1 . dm (v) For any µ m on B(V ) the measure ψ(·) ≡ µT −1 (·) is dominated by m with dµ −1 dψ (·) = T −1 (·) JT T −1 (·) dm dm

on W.

For a proof see Rudin (1987), Chapter 7.

4.5 Singular distributions 4.5.1

Decomposition of a cdf

Recall that a cumulative distribution function (cdf) on R is a function F : R → [0, 1] such that it is nondecreasing, right continuous, F (−∞) = 0, F (∞) = 1. In Section 2.2, it was shown that any cdf F on R can be written as (5.1) F = αFd + (1 − α)Fc , where Fd and Fc are discrete and continuous cdfs respectively. In this section, the cdf Fc will be further decomposed into a singular continuous and absolutely continuous cdfs. Deﬁnition 4.5.1: A cdf F is singular if F ≡ 0 almost everywhere w.r.t. the Lebesgue measure on R. Example 4.5.1: The cdfs of Binomial, Poisson, or any integer valued random variables are singular. It is known (cf. Royden (1988), Chapter 5) that a monotone function F : R → R is diﬀerentiable almost everywhere w.r.t. the Lebesgue measure and its derivative F satisﬁes b F (x)dx ≤ F (b) − F (a), (5.2) a

for any −∞ < a < b < ∞.

134

4. Diﬀerentiation

x For x ∈ R, let F˜ac (x) ≡ −∞ Fc (t)dt and F˜sc (x) ≡ Fc (x) − F˜ac (x). If ∞ β˜ ≡ −∞ Fc (t)dt = F˜ac (∞) = 0, then Fc (t) = 0 a.e. and so Fc is singular continuous. If β˜ = 1, then Fc = F˜ac and so, Fc is absolutely continuous. If 0 < α < 1 and 0 < β˜ < 1, then F can be written as F = αFd + βFac + γFsc ,

(5.3)

˜ −1 F˜sc , ˜ γ = (1 − α)(1 − β), ˜ Fac = β˜−1 F˜ac , Fsc = (1 − β) where β = (1 − α)β, and Fd is as in (5.1). Note that Fd , Fac , Fsc are all cdfs and α, β, γ are nonnegative numbers adding up to 1. Summarizing the above discussions, one has the following: Proposition 4.5.1: Given any cdf F , there exist nonnegative constants α, β, γ and cdfs Fd , Fac , Fsc satisfying (a) α + β + γ = 1, and (b) Fd is discrete, Fac is absolutely continuous, Fsc is singular continuous, such that the decomposition (5.3) holds. It can be shown that the constants α, β, and γ are uniquely determined, and that when 0 < α < 1, the decomposition (5.1) is unique, and that when 0 < α, β, γ < 1, the decomposition (5.3) is unique. The decomposition (5.3) also has a probabilistic interpretation. Any random variable X can be realized as a randomized choice over three random variables Xd , Xac , and Xsc having cdfs Fd , Fac , and Fsc , respectively, and with corresponding randomization probabilities α, β, and γ. For more details see Problem 6.15 in Chapter 6.

4.5.2

Cantor ternary set

Recall the construction of the Cantor set from Section 1.3. Let I0 = [0, 1] denote the unit interval. If one deletes the open " middle # third of" I0 ,# then one gets two disjoint closed intervals I11 = 0, 13 and I12 = 23 , 1 . Proceeding similarly with" the# closed intervals " 2 1 # I11 and " 2 I712#, 1 one gets " four # disjoint intervals I21 = 0, 9 , I22 = 9 , 3 , I23 = 3 , 9 , I24 = 89 , 1 , and so on. Thus, at each step, deleting the open middle third of the closed intervals constructed in the previous step, one is left with 2n 2n disjoint closed intervals each of length 3−n after n steps. Let Cn = j=1 Inj ∞ and C = n=1 Cn . By construction Cn+1 ⊂ Cn for each n and Cn ’s are closed sets. With m(·) ndenoting Lebesgue measure, one has m(C0 ) = 1 and m(Cn ) = 2n 3−n = 23 . ∞ Deﬁnition 4.5.2: The set C ≡ n=1 Cn is called the Cantor ternary set or simply the Cantor set. n Since m(C0 ) = 1, by m.c.f.a. m(C) = limn→∞ m(Cn ) = limn→∞ 23 = 0. Thus, the Cantor set C has zero Lebesgue measure. Next, let U1 = U11 = 13 , 23 be the deleted interval at the ﬁrst stage, U2 = U21 ∪ U22 =

4.5 Singular distributions

135

1

7 8 ∪ 9 , 9 be the union of the deleted intervals at the second stage, and 2n−1 ∞ 2n−1 similarly, Un = j=1 Unj at stage n. Thus C c = U = n=1 j=1 Unj is open and m(C c ) = 1. Since C ∪ C c = [0, 1] and C c is open, it follows that C is nonempty. In fact, C is uncountably inﬁnite as will be shown now. To do this, one needs the concept of p-nary expansion of numbers in [0,1]. Fix a positive integer p > 1. For each x in [0,1), let a1 (x) = px where t = n if n ≤ t < n + 1. Thus a1 (x) ≤ px < a1 (x) + 1 and a1 (x) ∈ {0, 1, . . . , p − 1}, i.e., a1p(x) ≤ x < a1p(x) + p1 . Thus, if kp ≤ x < k+1 p 2 9, 9

for some k = 0, 1, 2, . . . , p − 1, then a1 (x) = k. Next, let x1 ≡ x − a1p(x) and " a2 (x) = p2 x1 . Then, x1 ∈ 0, p1 and a2 (x) a1 (x) 1 a2 (x) < ≤ x1 = x − + 2 2 2 p p p p and a2 (x) ∈ {0, 1, 2, p − 1}. Next, let 0 ≤ x2 ≡ x − a1p(x) − a3 (x) = p3 x2 and so on. After k such iterations one gets 0≤x−

k ai (x) i=1

pi

0} is countable. (Hint: Let Bn = {x : h(x) > each n ∈ N.)

1 n }.

Show that Bn is a ﬁnite set for

4.4 Prove Proposition 4.1.2. 4.5 Find the Lebesgue decomposition of µ w.r.t. ν and the Radona Nikodym derivative dµ dν in the following cases where µa is the absolutely continuous component of µ w.r.t. ν. (a) µ = N (0, 1), ν = Exponential(1) (b) µ = Exponential(1), ν = N (0, 1) (c) µ = µ1 + µ2 , where µ1 = N (0, 1), µ2 = Poisson(1) and ν = Cauchy(0, 1). (d) µ = µ1 + µ2 , ν = Geometric(p), 0 < p < 1, where µ1 = N (0, 1) and µ2 = Poisson(1). (e) µ = µ1 + µ2 , ν = ν1 + ν2 where µ1 = N (0, 1), µ2 = Poisson(1), ν1 = Cauchy(0, 1) and ν2 = Geometric(p), 0 < p < 1. (f) µ = Binomial (10, 1/2), ν = Poisson (1). The measures referred to above are deﬁned in Tables 4.6.1 and 4.6.2, given at the end of this section.

138

4. Diﬀerentiation

4.6 Let (Ω, F, µ) be a measure space and f ∈ L1 (Ω, F, µ). Let νf (A) ≡ f dµ for all A ∈ F. A

(a) Show that νf is a ﬁnite signed measure. (b) Show that ν = Ω |f |dµ and for A ∈ F, νf+ (A) = A f + dµ, νf− (A) = A f dµ, and |νf |(A) = A |f |(dµ). 4.7 (a) Let µ1 and µ2 be two ﬁnite measures such that both are dominated by a σ-ﬁnite measure ν. Show that the total variation measure of the signed measure µ ≡ µ1 − µ2 is given by i |µ|(A) = A |h1 − h2 |dν where for i = 1, 2, hi = dµ dν . (b) Conclude that if µ1 and µ2 are two measures on a countable set Ω ≡ {ωi }i≥1 with F ≡ P(Ω), then |µ|(A) = i∈A |µ1 (ωi ) − µ2 (ωi )|. (c) Show that if µn is the Binomial (n, pn ) measure and µ is the Poisson (λ) measure, 0 < λ < ∞, then as n → ∞, |µn −µ|(·) → 0 uniformly on P(Z+ ) iﬀ npn → λ. (Hint: Show that for each i ∈ Z+ ≡ {0, 1, 2, . . .}, µn ({i}) → µ({i}) and use Scheﬀe’s theorem.) 4.8 Let ν be a ﬁnite signed measure on a measurable space (Ω, F) and let |ν| be the total variation measure corresponding to ν. Show that for any B ∈ Ω+ , B ⊂ F, |ν|(B) = ν(B), where Ω+ is as deﬁned in (2.13). (Hint: For any set A ⊂ Ω+ , ν(A) = hd|ν| = A

A∩Ω+

hd|ν| ≥ 0.)

4.9 Show that the Banach space S of ﬁnite signed measures on (N, P(N)) is isomorphic to 1 , the Banach space of absolutely convergent sequences {xn }n≥1 in R. 4.10 Let µ1 and µ2 be two probability measures on (Ω, F). (a) Show that µ1 − µ2 = 2 sup{|µ1 (A) − µ2 (A)| : A ∈ F}. (Hint: For any A ∈ F, {A, Ac } is a partition of Ω and so µ1 − µ2 ≥ |µ1 (A) − µ2 (A)| + |µ1 (Ac ) − µ2 (Ac )| = 2|µ1 (A) − µ2 (A)|,

4.6 Problems

139

since µ1 and µ2 are probability measures. For the opposite inequality, use the Hahn decomposition of Ω w.r.t. µ1 − µ2 and the fact µ1 − µ2 = |(µ1 − µ2 )(Ω+ )| + |(µ1 − µ2 )(Ω− )|.) (b) Show that µ1 − µ2 is also equal to * * * * sup * f dµ1 − f dµ2 * : f ∈ B(Ω, R) where B(Ω, R) is the collection of all F-measurable functions from Ω to R such that sup{|f (ω)| : ω ∈ Ω} ≤ 1. 4.11 Let (Ω, F) be a measurable space. (a) Let {µn }n≥1 be a sequence of ﬁnite measures on (Ω, F). Show that there exists a probability measure λ such that µn λ. ∞ (·) (Hint: Consider λ(·) = n=1 21n µµnn(Ω) .) (b) Extend (a) to the case where {µn }n≥1 are σ-ﬁnite. (c) Conclude that for any sequence {νn }n≥1 of ﬁnite signed measures on (Ω, F), there exists a probability measure λ such that |νn | λ for all n ≥ 1. 4.12 Let {µn }n≥1 be a sequence of ﬁnite measures on a measurable space (Ω, F). Show that there exists a ﬁnite measure µ on (Ω, F) such that µn − µ → 0 iﬀ there is a ﬁnite measure λ dominating µ and µn , dµ n n ≥ 1 such that the Radon-Nikodym derivatives fn ≡ dµ dλ → f ≡ dλ in measure on (Ω, F, λ) and µn (Ω) → µ(Ω). 4.13 (a) Let µ1 and µ2 be two ﬁnite measures on (Ω, F). Let µ1 = µ1a + µ1s be the Lebesgue-Radon-Nikodym decomposition of µ1 w.r.t. µ2 as in Theorem 4.1.1. Show that if µ = µ1 − µ2 , then for all A ∈ F, dµ1a |h − 1|dµ2 + µ1s (A) where h = |µ|(A) = dµ2 A is the Radon-Nikodym derivative of µ1a w.r.t. µ2 . Conclude that if µ1 ⊥ µ2 , *then |µ|(·) * = µ1 (·) + µ2 (·) and if µ1 µ2 , then 1 *dµ2 . − 1 |µ|(A) = A * dµ dµ2 (b) Compute |µ|(·), µ if µ = µ1 − µ2 for the following cases (i) µ1 = N (0, 1), µ2 = N (1, 1) (ii) µ1 = Cauchy (0,1), µ2 = N (0, 1) (iii) µ1 = N (0, 1), µ2 = Poisson (λ). (c) Establish Proposition 4.2.7.

140

4. Diﬀerentiation

4.14 Give another proof of the completeness of (S, · )) by verifying the following steps. (a) For any sequence {νn }n≥1 in S, there is a ﬁnite measure λ and {fn }n≥1 ⊂ L1 (Ω, F, λ) such that νn (A) = fn dλ for all A ∈ F, for all n ≥ 1. A

(b) {νn }n≥1 Cauchy in S is the same as {fn }n≥1 Cauchy in L1 (Ω, F, λ) and hence, the completeness of (S, ·)) follows from the completeness of L1 (Ω, F, λ). 4.15 Let f, g ∈ BV [a, b]. (a) Show that P (f + g; [a, b]) ≤ P (f ; [a, b]) + P (g; [a, b]) and that the same is true for N (·; ·) and T (·; ·). (b) Show that for any c ∈ R, P (cf ; [a, b]) = |c|P (f ; [a, b]) and do the same for N (·; ·) and T (·; ·). (c) For any a < c < b, P (f ; [a, b]) = P (f ; [a, c]) + P (f ; [c, b]). 4.16 Let {fn }n≥1 ⊂ BV [a, b] and let limn fn (x) = f (x) for all x in [a, b]. Show that P (f ; [a, b]) ≤ limn→∞ P (fn ; [a, b]) and do the same for N (·; ·) and T (·; ·). 4.17 Let f ∈ BV [a, b]. Show that f is continuous except on an at most countable set. 4.18 Let F : [a, b] → R be a.c. Show that it is of bounded variation. (Hint: By the deﬁnition of a.c., for = 1, there is a δ1 > 0 such k k that j=1 |aj − bj | < δ1 ⇒ j=1 |F (aj ) − F (bj )| < 1. Let M be an integer > b−a δ + 1. Show that T (F, [a, b]) ≤ M .) 4.19 Let F be an absolutely continuous nondecreasing function on R. Let µF be the Lebesgue-Stieltjes measure corresponding to F . Show that for any h ∈ L1 (R, MµF , µF ), hdµF = hf dm R

where f is a nonnegative measurable function such that F (b)−F (a) = f dm for any a < b. [a,b]

4.6 Problems

141

4.20 Let F : [a, b] → R be absolutely continuous with F (·) > 0 a.e. on [a, b], where −∞ < a < b < ∞. Let F (a) = c and F (b) = d. Let m(·) denote the Lebesgue measure on R. Show the following: (a) (Change of variables formula). For any g : [c, d] → R and Lebesgue measurable and integrable w.r.t. m gdm = g(F )F dm. [c,d]

[a,b]

(b) For any Borel set E ⊂ [a, b], F (E) is also a Borel set. (c) ν(·) ≡ m F (·) is a measure on B([a, b]) and ν m with dν (·) = F (·). dm (d) λ(·) ≡ mF −1 (·) is a measure on B([c, d]) and λ m with −1 dλ (·) = F F −1 (·) . dm (e) For any measure µ m on B([a, b]) the measure ψ(·) ≡ µF −1 (·) is dominated by m with dµ −1 −1 −1 dψ = F (·) F F (·) . dm dm (f) Establish (a) assuming that g and F are both continuous noting that both integrals reduce to Riemann integrals. (Hint: (i) Verify (a) for g = I[a,b] , c < α < β < d and approximate by step functions. (ii) Show that F is (1 − 1) and F −1 (·) is continuous and hence Borel measurable. (iii) Show that ν(·) = µF , the Lebesgue-Stieltjes measure corresponding to F . (iv) Use the fact that for any c ≤ α ≤ β ≤ d, ψ([α, β])

= µ([γ, δ]), where γ = F −1 (α), δ = F −1 (β), dµ gdm, where g = = dm [γ,δ] g F −1 F (·) = F (·)dm [γ,δ] F F −1 F (·) g F −1 (·) dm by (a). ) = −1 (·) [α,β] F F

142

4. Diﬀerentiation

4.21 Let F : R → R be absolutely continuous on every ﬁnite interval. (a) Show that the f in (4.1) can be chosen independently of the interval [a, b]. (b) Further, if f is integrable over R, then limx→−∞ F (x) ≡ F (−∞) and limx→∞ F (x) ≡ F (∞) exist and F (x) = F (−∞) + f dµL for all x in R. (−∞,x) (c) Give an example where F : R → R is a.c., but f is not integrable over R. 4.22 Let F : R → R be absolutely continuous on bounded intervals. Let {I = 1 ≤ j ≤ k ≤ ∞} be a collection of disjoint intervals such that kj j=1 Ij ≡ R and on each Ij , F (·) > 0 a.e. or F (·) < 0 a.e. w.r.t. m. (a) Show that for any h ∈ L1 (R, m), hdm = h F (·) |F (·)|dm. R

R

(b) Show that if µ is a measure on R, B(R) dominated by m then the measure µF −1 (·) is also dominated by m and dµF −1 (y) = dm

xj ∈D(y)

f (xj ) |F (xj )|

and D(y) = {xj : xj ∈ Ij , F (xj ) = y}. (c) Let µ be the N (0, 1) measure on R, B(R) , i.e., where f (·) =

dµ dm

x2 1 dµ (x) = √ e− 2 , −∞ < x < ∞. dm 2n

Let F (x) = x2 . Find

dµF −1 dm .

4.23 Let f : R → R be integrable w.r.t. m on bounded intervals. Show that for almost all x0 in R (w.r.t. m), lim

a↑x0 b↓x0

1 (b − a)

b

|f (x) − f (x0 )|dx = 0. a

(Hint: For each rational r, by Theorem 4.4.4 1 lim a↑x0 (b − a) b↓x 0

b

|f (x) − r|dx = |f (x0 ) − r|. a

4.6 Problems

143

a.e. (m). Let Ar denote the set of x0 for which this fails to hold. Let A = r∈Q Ar . Then m(A) = 0. For any x0 ∈ A and any > 0, choose a rational r such that |f (x0 ) − r| < and now show that b 1 lim |f (x) − f (x0 )|dx < . ) a↑x0 (b − a) a b↓x 0

4.24 Use the hint to the above problem to establish Theorem 4.4.5. 4.25 Let (Ω, B) be a measurable space. Let {µn }n≥1 and µ be σ-ﬁnite measures on (Ω, B). Let for each n ≥ 1, µn = µna + µns be the Lebesgue of µn w.r.t. µ with µna µ and µns ⊥ µ. Let decomposition λ = n≥1 µn , λa = n≥1 µna , λs = n≥1 µns . Show that λa µ and λs ⊥ µ and that λ = λa + λs is the Lebesgue decomposition of λ w.r.t. µ. 4.26 Let {µn }n≥1 be Radon measures on Rk , B(Rk ) and m be the ∞ Lebesgue measure on Rk . Show that if λ = n=1 µn is also a Radon measure, then ∞ Dµn a.e. (m). Dλ = n=1

(Hint: Use Theorem 4.4.4 and the uniqueness of the Lebesgue decomposition.) 4.27 Let Fn , n ≥ 1 be a sequence of nondecreasing functions from R → R. Let F (x) = n≥1 Fn (x) < ∞ for all x ∈ R. Show that F (·) is nondecreasing and Fn (·) a.e. (m). F (·) = n≥1

4.28 Let E be a Lebesgue measurable set in R. The metric density of E at x is deﬁned as m E ∩ (x − δ, x + δ) DE (x) ≡ lim δ↓0 2δ if it exists. Show that DE (·) = IE (·) a.e. m. (Hint: Consider the measure λE (·) ≡ m(E ∩ ·) on the Lebesgue σ-algebra. Show that λE m and ﬁnd λE (·) (cf. Deﬁnition 4.4.2).) 4.29 Let F , G : [a, b] → R be both absolutely continuous. Show that H = F G is also absolutely continuous on [a, b] and that F dG + GdF = F (b)G(b) − F (a)G(a). [a,b]

[a,b]

144

4. Diﬀerentiation

4.30 Let (Ω, F, µ) be a ﬁnite measure space. Fix 1 ≤ p < ∞. Let T : Lp (µ) → R be a bounded linear functional as deﬁned in (3.2.10) (cf. Section 3.2). Complete the following outline of a proof of Theorem 3.2.3 (Riesz representation theorem). (a) Let ν(A) ≡ T (IA ), A ∈ F. Verify that ν(·) is a signed measure on (Ω, F). (b) Verify that |ν| µ. dν (c) Let g ≡ dµ . Show that g ∈ Lq (µ) where q = and q = ∞ if p = 1.

p p−1 ,

1 0, there is a f ∈ C[0, 2π] and an m ≥ 1 such that g − Dm (f, ·)2 < 2. That is, the set T of all ﬁnite linear combinations of the functions in the class T0 is dense in L2 [0, 2π]. Further, it is easy to verify that h1 , h2 ∈ T0 , h1 = h2 implies 2π

0

h1 (x)h2 (x)dx = 0,

170

5. Product Measures, Convolutions, and Transforms

i.e., T0 is an orthogonal family. Since T0 is orthogonal and T is dense in 2 L2 [0, 2π], T0 is complete. Deﬁnition 5.6.2: A function in T is called a trigonometric polynomial. Thus, the above theorem says that trigonometric polynomials are dense in L2 [0, 2π]. Completeness of T in L2 [0, 2π] and the results of Section 3.3 lead to Theorem 5.6.4: Let f ∈ L2 [0, 2π]. Let {(an , bn ), n = 0, 1, 2, . . .} and {sn (f, ·)}n≥0 be the associated Fourier coeﬃcient sequences and partial sum sequence of the Fourier series for f as in Deﬁnition 5.6.1. Then (i) sn (f, ·) → f in L2 [0, 2π], (ii)

2π 2 1 a2n + ˘b2n ) = 2π |f | dx where n=0 (˘ 0 2π 1 2 2 cn = 2π 0 (cos nx) dx = 12 , and for 2π 1 1 2 2π 0 (sin nx) dx = 2 .

∞

for n ≥ 0, a ˘n = an /cn , with ˘ n ≥ 1, bn = dbnn with d2n =

2π 1 f g dx = a0 α0 + (iii) Further, if f , g ∈ L2 [0, 2π], then 2π 0 ∞ bn βn an αn + d2 , where {(an , bn ) : n = 0, 1, 2, . . .} and n=1 c2n n {(αn , βn ) : n = 0, 1, 2, . . .} are, respectively, the Fourier coeﬃcients of f and g. Clearly (ii) above is a restatement of Bessel’s equality. Assertion (iii) is known as the Parseval identity. As for convergence pointwise or almost everywhere, A. N. Kolmogorov showed in 1926 (see K¨orner (1989)) that there exists an f ∈ L1 [0, 2π] such that limn→∞ |sn (f, x)| = ∞ everywhere on [0, 2π]. This led to the belief that for f ∈ C[0, 2π], the mean square convergence of (i) in Theorem 5.6.4 cannot be improved upon. But L. Carleson showed in 1964 (see K¨orner (1989)) that for f in L2 [0, 2π], sn (f, ·) → f (·) almost everywhere. Finally, turning to L1 [0, 2π], one has the following: Theorem 5.6.5: Let f ∈ L1 [0, 2π]. Let {(an , bn ) : n ≥ 0} be as in (6.1) ∞ and satisfy n=0 (|an | + |bn |) < ∞. Let sn (f, ·) be as in (6.2). Then sn (f, ·) converges uniformly on [0, 2π] and the limit coincides with f almost everywhere. ∞ Proof: Note that n=0 (|an | + |bn |) < ∞ implies that the sequence {sn (f, ·)}n≥0 is a Cauchy sequence in the Banach space C[0, 2π] with the sup-norm. Thus, there exists a g in C[0, 2π] such that sn (f, ·) → g uniformly on [0, 2π]. It is easy to check that this implies that g and f have the same Fourier coeﬃcients. Set h = g − f . Then h ∈ L1 [0, 2π] and the Fourier coeﬃcients of h are all zero. This implies that h is orthogonal to the members of the class T , which in turn yields that h is orthogonal to all

5.6 Fourier series

171

continuous functions in C[0, 2π], i.e., 2π h(x)k(x)dx = 0 0

for all k ∈ C[0, 2π]. Since h ∈ L1 [0, 2π] and for any interval A ⊂ [0, 2π], there exists a sequence {kn }n≥1 of uniformly bounded continuous functions, such that kn → IA a.e. (m), by the DCT, h(x)IA (x)dx = lim h(x)kn (x)dx = 0. n→∞

This in turn implies that h = 0 a.e., i.e., g = f a.e.

2

Remark 5.6.1: If f ∈ L2 [0, 2π], the Fourier coeﬃcients {an , bn } are square summable and hence go to zero as n → ∞. What if f ∈ L1 [0, 2π]? If f ∈ L1 [0, 2π], one can assert the following: Theorem 5.6.6: (Riemann-Lebesgue lemma). Let f ∈ L1 [0, 2π]. Then 2π 2π lim f (x) cos nx dx = 0 = lim f (x) sin nx dx. n→∞

n→∞

0

0

Proof: The lemma holds if f = IA for any interval A ⊂ [0, 2π] and since step functions (i.e., linear combinations of indicator functions of intervals) 2 are dense in L1 [0, 2π], the lemma is proved. It can be shown that the mapping f → {(an , bn )}n≥0 from L1 [0, 2π] to bivariate sequences that go to zero as n → ∞ is one-to-one but not onto (Rudin (1987), Chapter 5). Remark 5.6.2: (The complex case). Let T ≡ {z : z = eιθ , 0 ≤ θ ≤ 2π} be the unit circle in the complex plane C. Every function g : T → C can be identiﬁed with a function f on R by f (t) = g(eιt ). Clearly, f (·) is periodic on R with period 2π. In the rest of this section, for 0 < p < ∞, Lp (T ) will stand for the collection of all Borel measurable functions f : [0, 2π] to C such that [0,2π] |f |p dm < ∞ where m(·) is the Lebesgue measure. A trigonometric polynomial is a function of form f (·) ≡

k n=−k

αn eιnx ≡ a0 +

k

(an cos nx + bn sin nx),

n=1

k < ∞, where {αn }, {an }n≥0 , and {bn }n≥0 are sequences of complex numbers. The completeness of the trigonometric polynomials proved in Theorem 5.6.3 implies that the family {eιnx : n = 0, ±1, ±2, . . .} is a complete orthonormal basis for L2 (T ), which is a complex Hilbert space.

172

5. Product Measures, Convolutions, and Transforms

Thus Theorem 5.6.4 carries over to this case. Theorem 5.6.7: (i) Let f ∈ L2 (T ). Then, k

αn eιnx → f

in

L2 (T )

n=−k

where αn ≡ Further,

1 2π

∞

2π

0

f (x)e−ιns dx, n ∈ Z.

1 2π

|αn |2 =

n=−∞

2π

0

|f |2 dm.

(ii) For any sequence {αn }n∈Z of complex numbers such that k ∞ 2 ιnx conn=−∞ |αn | < ∞, the sequence fk (x) ≡ n=−k αn e k≥1 2 verges in L (T ) to a unique f such that αn =

1 2π

2π

f (x)e−ιnx dx.

0

(iii) For any f , g ∈ L2 (T ), ∞

αn β¯n =

n=−∞

1 2π

2π

f (x)g(x)dx 0

where

Further,

1 fˆ(n) = 2π

αn

=

βn

= gˆ(n) =

1 2π

∞

2π

0 2π 0

f (x)e−ιnx dx, g(x)e−ιnx dx, n ∈ Z.

|αn βn | < ∞.

n=−∞

(iv) L2 (T ) is isomorphic to 2 (Z), the Hilbert space of all square summable sequences of complex numbers on Z. Similarly, Theorem 5.6.5 carries over to the complex case.

5.7 Fourier transforms on R

173

Theorem 5.6.8: Let f ∈ L1 (T ). Suppose ∞

|fˆ(n)| < ∞

n=−∞

where 1 fˆ(n) = 2π

2π

0

Then sn (f, x) ≡

f (x)e−ιnx dx, n ∈ Z. n

fˆ(j)e−ιjx

j=−n

converges uniformly on [0, 2π] and the limit coincides with f a.e. and hence f is continuous a.e.

5.7 Fourier transforms on R In this section and in Section 5.8, let Lp (R) stand for |f |p dm < ∞} Lp (R) ≡ {f : f : R → C, Borel measurable, R

where m(·)is Lebesgue measure. Also, for f ∈ L1 (R), written as f (x)dx. Let C0 ≡ {f : f : R → C, continuous and

R

(7.1)

f dm will often be

lim f (x) = 0}.

|x|→∞

Deﬁnition 5.7.1: For f ∈ L1 (R), t ∈ R, fˆ(t) ≡ f (x)e−ιtx dx

(7.2)

(7.3)

is called the Fourier transform of f . Proposition 5.7.1: Let f ∈ L1 (R) and fˆ(·) be as in (7.3). Then (i) fˆ(·) ∈ C0 . (ii) If fa (x) ≡ f (x − a), a ∈ R, then fˆa (t) = eιta fˆ(t), t ∈ R. Proof: (i) For any t ∈ R, tn → t ⇒ eιtn x f (x) → eιtx f (x) for all x ∈ R and since |eιtn x f (x)| ≤ |f (x)| for all n and x, by the DCT fˆ(tn ) → fˆ(t). To show that fˆ(t) → 0 as |t| → ∞, the same proof as that of Theorem 5.6.6 works. Thus, it holds if f = I[a,b] , for a, b ∈ R, a < b and since the step functions are dense in L1 (R), it holds for all f ∈ L1 (R).

174

5. Product Measures, Convolutions, and Transforms

(ii) This is a consequence of the translation invariance of m(·), i.e., m(A+ a) = m(A) for all A ∈ B(R), a ∈ R. 2 The continuity of fˆ(·) can be strengthened to diﬀerentiability if, in addition to f ∈ L1 (R), xf (x) ∈ L1 (R). More generally, if f ∈ L1 (R) and xk f (x) ∈ L1 (R) for some k ≥ 1, then fˆ(·) is diﬀerentiable k-times with all derivatives fˆ(r) (t) → 0 as |t| → ∞ for r ≤ k (Problem 5.22). Proposition 5.7.2: Let f , g ∈ L1 (R) and f ∗ g be their convolution as deﬁned in (4.7). Then f ∗ g = fˆgˆ. (7.4) Proof: f ∗ g(t)

−ιtx

e

= R

R

= R

R

=

R

f (x − u)g(u)du dx

e−ιt(x−u) f (x − u)e−ιtu g(u)du dx

(ft ∗ gt )(x)dx

(7.5)

where ft (x) = e−ιtx f (x), gt (x) = e−ιtx g(x). Thus, by Proposition 5.4.4 ft (x)dx gt (x)dx f ∗ g(t) = R

=

R

fˆ(t)ˆ g (t).

2

The process of recovering f from fˆ (i.e., that of ﬁnding an inversion formula) can be developed along the lines of Fejer’s theorem (Theorem 5.6.1). Theorem 5.7.3: (Fejer’s theorem). Let f ∈ L1 (R), fˆ(·) be as in (7.3) and T 1 ST (f, x) ≡ (7.6) fˆ(t)eιtx dt, T ≥ 0, 2π −T 1 R ST (f, x)dT, R ≥ 0. (7.7) DR (f, x) ≡ R 0 (i) If f is continuous at x0 and f is bounded on R, then lim DR (f, x0 ) = f (x0 ).

R→∞

(7.8)

(ii) If f is uniformly continuous and bounded on R, then lim DR (f, ·) = f (·) uniformly on R.

R→∞

(7.9)

5.7 Fourier transforms on R

(iii) As R → ∞,

DR (f, ·) → f (·)

L1 (R).

in

175

(7.10)

(iv) If f ∈ Lp (R), 1 ≤ p < ∞, then as R → ∞, DR (f, ·) → f (·)

Lp (R).

in

(7.11)

Corollary 5.7.4: (Uniqueness theorem). If f and g ∈ L1 (R) and fˆ(·) = gˆ(·), then f = g a.e. (m). ˆ ≡ 0. Thus, ST (h, ·) ≡ 0 Proof: Let h = f − g. Then h ∈ L1 (R) and h(·) and DR (h, ·) ≡ 0 where ST (h, ·) and DR (h, ·) are as in (7.6) and (7.7). Hence by Theorem 5.7.3 (iii), h = 0 a.e. (m), i.e., f = g a.e. (m). 2 Corollary 5.7.5: (Inversion formula). Let f ∈ L1 (R) and fˆ ∈ L1 (R). Then 1 (7.12) fˆ(t)eιtx dx a.e. (m). f (x) = 2π R Proof: Since fˆ ∈ L1 (R), by the DCT, 1 fˆ(t)eιtx dt ST (f, x) → 2π R

as

T →∞

for all x in R and hence DR (f, x) has the same limit as R → ∞. Now (7.12) follows from (7.10). 2 The following results, i.e., Lemma 5.7.6 and Lemma 5.7.7, are needed for the proof of Theorem 5.7.3. The ﬁrst one is an analog of Lemma 5.6.2. Lemma 5.7.6: (Fejer ). For R > 0, let KR (x) ≡

1 1 2π R

R

0

T

eιtx dt dT.

(7.13)

−T

Then (i) KR (x) =

1 (1−cos Rx) , π x2

x = 0

R 2π

x = 0,

(7.14)

and hence KR (·) ≥ 0. (ii) For δ > 0,

|x|≥δ

KR (x)dx → 0

as

R → ∞.

(7.15)

176

5. Product Measures, Convolutions, and Transforms

(iii)

∞

−∞

KR (x)dx = 1.

(7.16)

Proof: (i) KR (0) =

1 2πR

R 0

(2T )dT = KR (x)

1 2π R.

= =

For x = 0 and R > 0,

1 2πR

R

0

2 sin T x dT x

1 2(1 − cos Rx) . 2πR x2

(ii) For δ > 0, 0≤

|x|≥δ

KR (x)dx

≤

2 2πR

=

2 1 →0 πR δ

|x|≥δ

1 dx x2 as

R → ∞.

(iii)

∞

−∞

KR (x)dx

= =

2 1 ∞ 1 − cos Rx dx π R 0 x2 2 ∞ 1 − cos u du. π 0 u2

Now

∞

1 − cos u du u2 0 L 1 − cos u = lim du (by the MCT) L→∞ 0 u2 L u 1 = lim sin x dx 2 du L→∞ 0 u 0 L L 1 = lim sin x du dx (by Fubini’s theorem) 2 L→∞ 0 x u L sin x 1 L = lim dx − sin x dx L→∞ x L 0 0 L * L * sin x * * = lim dx since * sin x dx* ≤ 1. L→∞ 0 x 0

5.7 Fourier transforms on R

Thus,

∞

1 − cos u du = u2 0 (cf. Problem 5.9). Hence (iii) follows.

0

∞

π sin x dx = x 2

Lemma 5.7.7: Let f ∈ Lp (R), 0 < p < ∞. Then ∞ |f (x − u) − f (x)|p dx → 0 as |u| → 0.

177

(7.17) 2

(7.18)

−∞

Proof: The lemma holds if f ∈ CK , i.e., if f is continuous on R (with values in C) and vanishes outside a bounded interval. By Theorem 2.3.14, such functions are dense in Lp (R). So given f ∈ Lp (R), 0 < p < ∞ and > 0, let g ∈ CK be such that |f − g|p dm < . For any 0 < p < ∞, there is a 0 < cp < ∞ such that for all x, y, z ∈ (0, ∞), |x + y + z|p ≤ cp (|x|p + |y|p + |z|p ). Then, |f (x − u) − f (x)|p dx ≤ cp |f (x − u) − g(x − u)|p + |g(x − u) − g(x)|p + |g(x) − f (x)|p dx = cp 2 + |g(x − u) − g(x)|p du.

So lim

|u|→0

|f (x − u) − f (x)|p dx ≤ cp 2.

Since > 0 is arbitrary, the lemma is proved.

2

Proof of Theorem 5.7.3: From (7.7) R T ∞ 1 ιtx DR (f, x) ≡ e e−ιty f (y)dy dt dT 2πR 0 −T −∞ R T ∞ 1 1 eιtu f (x − u)du dt dT. = 2π R 0 −T −∞ Now Fubini’s theorem yields ∞ 1 1 R T DR (f, x) = f (x − u) eιtu dt dR du 2π R 0 −∞ −T ∞ f (x − u)KR (u)du (7.19) = −∞

178

5. Product Measures, Convolutions, and Transforms

where KR (·) is as in (7.13). Now let f be continuous at x0 and bounded on R by Mf . Fix > 0 and choose δ > 0 such that |x − x0 | < δ ⇒ |f (x) − f (x0 )| < . From (7.16) and (7.19), DR (f, x0 ) − f (x0 ) = f (x0 − u) − f (x0 ) KR (u)du implying

|DR (f, x0 ) − f (x0 )|

≤

|u| 0}, i = 1, 2. (a) Show that D1 ∪ D2 is countable. (b) Let φi (x) = µi ({x}) x ∈ R, i = 1, 2. Show that φi is Borel measurable for i = 1, 2. (c) Show that φ1 dµ2 = z∈D1 ∩D2 φ1 (z)φ2 (z). (d) Deduce (2.11) from (c). 5.4 Extend Theorem 5.2.3 as follows. Let F1 , F2 be two nondecreasing right continuous functions on [a, b]. Then F1 dF2 + F2 dF1 = F1 (b)F2 (b) − F1 (a)F2 (a) + φ1 dµ2 , (a,b]

(a,b]

(a,b]

where φ1 is as in Problem 5.3. 5.5 Let Fi : R → R be nondecreasing and right continuous, i = 1, 2. Show that if limb↑∞ F1 (b)F2 (b) = λ1 and lima↓−∞ F1 (a)F2 (a) = λ2 exist and are ﬁnite, then F1 dF2 + F2 dF1 = λ1 − λ2 + φ1 dµ2 R

R

where φ1 is as in Problem 5.3.

R

182

5. Product Measures, Convolutions, and Transforms

5.6 Let (Ω, F, µ) be a σ-ﬁnite measure space and f be a nonnegative measurable function. Then, Ω f dµ = [0,∞) µ({f ≥ t})dt. (Hint: Consider the product space of (Ω, F, µ) with (R+ , B(R+ ), m) and apply Tonelli’s theorem to the function g(ω, t) = I(f (ω) ≥ t), after showing that g is F × B(R+ ))-measurable.) 5.7 Let (Ω, F, P ) be a probability space and X : Ω → R+ be a random variable. (a) Show that for any h : R+ → R+ that is absolutely continuous, h(X)dP = h(0) + h (t)P (X ≥ t)dt Ω [0,∞) = h(0) + h (t)P (X > t)dt. (0,∞)

(b) Show that for any 0 < p < ∞, p X dP = ptp−1 P (X ≥ t)dt. Ω

[0,∞)

(c) Show that for any 0 < p < ∞, 1 −p X dP = ψX (t)tp−1 dt, Γ(p) [0,∞) Ω

where Γ(p) = t ∈ R+ .

[0,∞)

e−t tp−1 dt, p > 0, and ψX (t) =

Ω

e−tX dP ,

(Hint: (a) Apply Tonelli’s theorem to the function f (t, ω) ≡ h (t)I(X(w) ≥ t) on the product measure space ([0, ∞) × Ω, B([0, ∞)) × F, m × P ), where m is Lebesgue measure on (R+ , B(R+ )).) 5.8 Let g : R+ → R+ and f : R2 → R+ be Borel measurable. Let A = {(x, y) : x ≥ 0, 0 ≤ y ≤ g(x)}. (a) Show that A ∈ B(R2 ). (b) Show that R+

[0,g(x)]

f (x, y)m(dy) m(dx) = f IA dm(2) ,

where m(2) is Lebesgue measure on (R2 , B(R2 )) and m(·) is Lebesgue measure on R.

5.9 Problems

183

(c) If g is continuous and strictly increasing show that the two integrals in (b) equal f (x, y)m(dx) m(dy). [g −1 (y),∞)

R+

5.9 (a) For 1 < A < ∞, let hA (t) =

A

0

e−xt sin x dx, t ≥ 0.

Use integration by parts to show that |hA (t)| ≤ and hA (t) →

1 + e−t 1 + t2

1 1 + t2

as A → ∞.

(b) Show using Fubini’s theorem that for 0 < A < ∞, A ∞ sin x dx. hA (t)dt = x 0 0 (c) Conclude using the DCT that A ∞ sin x 1 lim dx = dt. A→∞ 0 x 1 + t2 0 (d) Using Theorem 4.4.1 and the fact that φ(x) ≡ tan x is a (1–1) strictly monotone map from (0, π2 ) to (0, ∞) having the inverse 1 map ψ(·) with derivative ψ (t) ≡ 1+t 2 , 0 < t < ∞, conclude that ∞ π 1 dt = . 2 1+t 2 0 ∞ −x2 /2 ,π 5.10 Show that I ≡ 0 e dx = 2 . ∞ ∞ (x2 +y2 ) (Hint: By Tonelli’s theorem, I 2 = 0 0 e− 2 dxdy. Now use the change of variables x = r cos θ, y = r sin θ, 0 < r < ∞, 0 < θ < π 2 .) 5.11 Let µ be a ﬁnite measure on R, B(R) . Let f , g : R → R+ be nondecreasing. Show that µ(R) f g dµ ≥ f dµ gdµ . (Hint: Consider h(x1 , x2 ) = f (x1 ) − f (x2 ) g(x1 ) − g(x2 ) on R2 and integrate w.r.t. µ × µ.)

184

5. Product Measures, Convolutions, and Transforms

5.12 Let µ and λ be σ-ﬁnite measures on R, B(R) . Recall that ν(A) ≡ (µ ∗ λ)(A) ≡ IA (x + y)dµdλ, A ∈ B(R). (a) Show that for any Borel measurable f : R → R+ , f (x + y) is B(R) × B(R), B(R)-measurable from R × R → R and f dν = f (x + y)dµdλ. (b) Show that ν(A) = R

µ(A − t)λ(dt), A ∈ B(R).

(c) Suppose there exist countable sets Bλ , Bµ such that µ(Bµc ) = 0 = λ(Bλc ). Show that there exists a countable set Bν such that ν(Bνc ) = 0. (d) Suppose that µ({x}) = 0 for all x in R. Show that ν({x}) = 0 for all x in R. dµ (e) Suppose that µ m with dm = h. Show that ν m and ﬁnd dν dm in terms of h, µ and λ.

(f) Suppose that µ m and λ m. Show that dµ dλ dν = ∗ . dm dm dm 5.13 (Convolution of cdfs). Let Fi , i = 1, 2 be cdfs on R. Recall that a cdf F on R is a function from R → R+ such that it is nondecreasing, right continuous with F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞. (a) Show that (F1 ∗ F2 )(x) ≡ R F1 (x − u)dF2 (u) is well deﬁned and is a cdf on R. (b) Show also that (F1 ∗ F2 )(·) = (F2 ∗ F1 )(·). (c) Suppose t ∈ R is such that etx dFi (x) < ∞ for i = 1,2. Show tx tx tx e dF1 (x) e dF2 (x) . that e d(F1 ∗ F2 )(x) = 5.14 Let f , g ∈ L1 (R, B(R), m). (a) Show that if f is continuous and bounded on R, then so is f ∗ g. (b) Show that if f is diﬀerentiable with a bounded derivative on R, then so is f ∗ g. (Hint: Use the DCT.)

5.9 Problems

185

5.15 Let f ∈ L1 (R), g ∈ Lp (R), 1 ≤ p ≤ ∞. (a) Show that if 1 ≤ p < ∞, then for all x in R *p * p−1 * * |g(x−u)|p |f (u)|du |f |dm * |f (u)g(x−u)|du* ≤ and hence that

(f ∗ g)(x) ≡

f (u)g(x − u)du

is well deﬁned a.e. (m) and f ∗ gp ≤ f 1 gp with “=” holding iﬀ either f = 0 a.e. or g = 0 a.e. (b) Show that if p = ∞ then f ∗ g∞ ≤ f 1 g∞ and “=” can hold for some nonzero f and g. (Hint for (a): Use Jensen’s inequality with probability measure dµ = |f |dm f 1 if f ||1 > 0.) 5.16 Let 1 ≤ p ≤ ∞ and q = 1 − p1 . Let f ∈ Lp (R), g ∈ Lq (R). (a) Show that f ∗ g is well deﬁned and uniformly continuous. (b) Show that if 1 < p < ∞, lim (f ∗ g)(x) = 0.

|x|→∞

(Hint: For (a) use H¨ older’s inequality and Lemma 5.7.7. For (b) approximate g by simple functions.) 5.17 Let g : R → R be inﬁnitely diﬀerentiable and be zero outside a bounded interval. (a) Let f : R → R be Borel measurable and A |f |dm < ∞ for all bounded intervals A in R. Show that f ∗ g is well deﬁned and inﬁnitely diﬀerentiable. (b) Show that for any f ∈ L1 (R), there exist a sequence {gn }n≥1 of such functions such that f ∗ gn → f in L1 (R). 5.18 For f ∈ L1 (R), let fσ = f ∗φσ where φσ (x) = x ∈ R.

2

x √ 1 e− 2σ2 2πσ

, 0 < σ < ∞,

186

5. Product Measures, Convolutions, and Transforms

(a) Show that fσ is inﬁnitely diﬀerentiable. (b) Show that if f is continuous and zero outside a bounded interval, then fσ converges to g uniformly on R as σ → 0. (c) Show that if f ∈ Lp (R), 1 ≤ p < ∞, then fσ → f in Lp (R) as σ → 0. 5.19 Let f ∈ Lp (R), 1 ≤ p < ∞ and h(x) = bounded Borel set and

A+x

f (u)du where A is a

A + x ≡ {y : y = a + x, a ∈ A}. (a) Show that h = f ∗ g for some g bounded and with bounded support. (b) Show that h(·) is continuous and that lim h(x) = 0. |x|→∞

(Hint: For 1 1. −ιtx 1 (Hint: Consider g(x) = 2π fˆ(t)dt.) e 2

5.23 Let f (x) =

x √1 e− 2 2π

, x ∈ R.

(a) Show that fˆ(·) is real valued, diﬀerentiable and satisﬁes the ordinary diﬀerential equation fˆ (t) + tfˆ(t) = 0, t ∈ R and fˆ(0) = 1. Find fˆ(t).

5.9 Problems

187

(b) For µ in R, σ > 0, let fµ,σ (x) = √

1 x−u 2 1 e− 2 ( σ ) . 2πσ

Find fˆµ,σ and verify that for any (µi , σi ), i = 1, 2, fµ1 ,σ1 ∗ fµ2 ,σ2 = fµ1 +µ2 ,σ12 +σ22 . (Hint for (b): Use Fourier transforms and uniqueness.) 5.24 (Rate of convergence of Fourier series). Consider the function h(x) = π 2 − |x| in −π ≤ x ≤ π. π −ιnx 1 ˆ e h(x)dx, n = 0, ±1, ±2. (a) Find h(n) ≡ 2π −π ∞ ˆ < ∞. (b) Show that n=−∞ |h(n)| +n ˆ ιjx (c) Show that Sn (h, x) ≡ converges to h(x) unij=−n h(j)e formly on [−π, π]. (d) Verify that sup{|Sn (h, x) − h(x)| : −π ≤ x ≤ π} ≤ and |Sn (h, 0) − h(0)| ≥

2 1 , n≥2 π (n − 1)

2 1 . π (n + 2)

(Remark: This example shows that the Fourier series of a function can converge very slowly such as in this example where the rate of decay is n1 .) 5.25 Using Fejer’s theorem (Theorem 5.6.1) prove Wierstrass’ theorem on uniform approximation of a continuous function on a bounded closed interval by a polynomial. (Hint: Show that on bounded intervals a trigonometric polynomial can be approximated uniformly by a polynomial using the power series representation of sine and cosine functions (see Section A.3). 5.26 Evaluate

n

lim

n→∞

−n

sin λx ιtx e dx, 0 < λ < ∞, 0 < x < ∞. x

(Hint: For 0 < λ < ∞, sinxλx ∈ L2 (R) and it is the Fourier transform of f (t) = I[−λ,λ] (·). Now apply Plancherel theorem. Alternatively use n Fubini’s theorem and the fact lim −n siny y dy exists in R.) n→∞

188

5. Product Measures, Convolutions, and Transforms

c 5.27 Find an example of a function f ∈ L2 (R) ∩ L1 (R) such that its Plancherel transform fˆ ∈ L1 (R). (Hint: Examine Problem 5.26.)

6 Probability Spaces

6.1 Kolmogorov’s probability model Probability theory provides a mathematical model for random phenomena, i.e., those involving uncertainty. First one identiﬁes the set Ω of possible outcomes of (random) experiment associated with the phenomenon. This set Ω is called the sample space, and an individual element ω of Ω is called a sample point. Even though the outcome is not predictable ahead of time, one is interested in the “chances” of some particular statement to be valid for the resulting outcome. The set of ω’s for which a given statement is valid is called an event. Thus, an event is a subset of Ω. One then identiﬁes a class F of events, i.e., a class F of subsets of Ω (not necessarily all of P(Ω), the power set of Ω), and then a set function P on F such that for A in F, P (A) represents the “chance” of the event A happening. Thus, it is reasonable to impose the following conditions on F and P : (i) A ∈ F ⇒ Ac ∈ F (i.e., if one can deﬁne the probability of an event A, then the probability of A not happening is also well deﬁned). (ii) A1 , A2 ∈ F ⇒ A1 ∪ A2 ∈ F (i.e., if one can deﬁne the probabilities of A1 and A2 , then the probability of at least one of A1 or A2 happening is also well deﬁned). (iii) for all A in F, 0 ≤ P (A) ≤ 1, P (∅) = 0, and P (Ω) = 1.

190

6. Probability Spaces

(iv) A1 , A2 ∈ F, A1 ∩ A2 = ∅ ⇒ P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) (i.e., if A1 and A2 are mutually exclusive events, then the probability of at least one of the two happening is simply the sum of the probabilities). The above conditions imply that F is an algebra and P is a ﬁnitely additive set function. Next, as explained in Section 1.2, it is natural to require that F be closed under monotone increasing unions and P be monotone continuous from below. That is, if {An }n≥1 is a sequence of events in F such that An implies An+1 (i.e., An ⊂ An+1 ) for all n ≥ 1, then the probability of at least one of the An ’s happening is well deﬁned and is the limit of the corresponding probabilities. In other words, the following conditions on F and P must hold in addition to (i)–(iv) above: (v) An ∈ F, A n ⊂ An+1 for all n = 1, 2, . . . P (An ) ↑ P ( n≥1 An ).

⇒

n≥1

An , ∈ F and

As noted in Section 1.2, conditions (i)–(v) imply that (Ω, F, P ) is a measure space, i.e., F is a σ-algebra and P is a measure on F with P (Ω) = 1. That is, (Ω, F, P ) is a probability space. This is known as Kolmogorov’s probability model for random phenomena (see Kolmogorov (1956), Parthasarathy (2005)). Here are some examples. Example 6.1.1: (Finite sample spaces). Let Ω ≡ {ω1 , ω2 , . . . , ωk }, 1 ≤ k k < ∞, F ≡ P(Ω), the power set of Ω and P (A) = i=1 pi IA (ωi ) where k {pi }ki=1 are such that pi ≥ 0 and i=1 pi = 1. This is a probability model for random experiments with ﬁnitely many possible outcomes. An important application of this probability model is ﬁnite population sampling. Let {U1 , U2 , . . . , UN } be a ﬁnite population of N units or objects. These could be individuals in a city, counties in a state, etc. In a typical sample survey procedure, one chooses a subset of size n (1 ≤ n ≤ N ) from this population. Let Ω denote the collection of all possible subsets of size n. Here k = N n , each ωi is a sample of size n and pi is the selection probability of ωi . The assignment of {pi }ki=1 is determined by a given sampling scheme. For example, in simple random sampling without replacement, pi = k1 for i = 1, 2, . . . , k. Other examples include coin tossing, rolling of dice, bridge hands, and acceptance sampling in statistical quality control (Feller (1968)). Example 6.1.2: (Countably inﬁnite sample spaces). Let Ω ≡ {ω1 , ω2 , . . .} ∞ ∞ p I be a countable set, F = P(Ω), and P (A) ≡ i A (ωi ) where {pi }i=1 i=1 ∞ satisfy pi ≥ 0 and i=1 pi = 1. It is easy to verify that (Ω, F, P ) is a probability space. This is a probability model for random experiments with countably inﬁnite number of outcomes. For example, the experiment of tossing a coin until a “head” is produced leads to such a probability space.

6.2 Random variables and random vectors

191

Example 6.1.3: (Uncountable sample spaces). (a) (Random variables). Let Ω = R, F = B(R), P = µF , the LebesgueStieltjes measure corresponding to a cdf F , i.e., corresponding to a function F : R → R that is nondecreasing, right continuous, and satisﬁes F (−∞) = 0, F (+∞) = 1. See Section 1.3. This serves as a model for a single random variable X. (b) (Random vectors). Let Ω = Rk , F = B(Rk ), P = µF , the LebesgueStieltjes measure corresponding to a (multidimensional) cdf F on Rk where k ∈ N. See Section 1.3. This is a model for a random vector (X1 , X2 , . . . , Xk ). (c) (Random sequences). Let Ω = R∞ ≡ R × R × . . . be the set of all sequences {xn }n≥1 of real numbers. Let C be the class of all ﬁnite dimensional sets of the form A × R × R × . . ., where A ∈ B(Rk ) for some 1 ≤ k < ∞. Let F be the σ-algebra generated by C. For each 1 ≤ k < ∞, let µk be a probability measure on B(Rk ) such that µk+1 (A × R) = µk (A) for all A ∈ B(Rk ). Then there exists a probability measure µ on F such that µ(A × R × R × . . .) = µk (A) if A ∈ B(Rk ). (This will be shown later as a special case of the Kolmogorov’s consistency theorem in Section 6.3.) This will be a model for a sequence {Xn }n≥1 of random variables such that for each k, 1 ≤ k < ∞, the distribution of (X1 , X2 , . . . , Xk ) is µk .

6.2 Random variables and random vectors Recall the following deﬁnitions introduced earlier in Sections 2.1 and 2.2. Deﬁnition 6.2.1: Let (Ω, F, P ) be a probability space and X : Ω → R be F, B(R)-measurable, that is, X −1 (A) ∈ F for all A ∈ B(R). Then, X is called a random variable on (Ω, F, P ). Recall that X : Ω → R is F, B(R)-measurable iﬀ for all x ∈ R, {ω : X(ω) ≤ x} ∈ F. Deﬁnition 6.2.2: Let X be a random variable on (Ω, F, P ). Let FX (x) ≡ P ({ω : X(ω) ≤ x}), x ∈ R.

(2.1)

Then FX (·) is called the cumulative distribution function (cdf) of X. Deﬁnition 6.2.3: Let X be a random variable on (Ω, F, P ). Let PX (A) ≡ P (X −1 (A))

for all A ∈ B(R).

(2.2)

Then the probability measure PX is called the probability distribution of X.

192

6. Probability Spaces

Note that PX is the measure induced by X on (R, B(R)) under P and that the Lebesgue-Stieltjes measure µFX on B(R) corresponding with the cdf FX of X is the same as PX . Deﬁnition 6.2.4: Let (Ω, F, P ) be a probability space, k ∈ N and X : Ω → Rk be F, B(Rk )-measurable, i.e., X −1 (A) ∈ F for all A ∈ B(Rk ). Then X is called a (k-dimensional) random vector on (Ω, F, P ). Let X = (X1 , X2 , . . . , Xk ) be a random vector with components Xi , i = 1, 2, . . . , k. Then each Xi is a random variable on (Ω, F, P ). This follows from the fact that the coordinate projection maps from Rk to R, given by πi (x1 , x2 , . . . , xk ) ≡ xi , 1 ≤ i ≤ k are continuous and hence, are Borel measurable. Conversely, if for 1 ≤ i ≤ k, Xi is a random variable on (Ω, F, P ), then X = (X1 , X2 , . . . , Xk ) is a random vector (cf. Proposition 2.1.3). Deﬁnition 6.2.5: Let X be a k-dimensional random vector on (Ω, F, P ) for some k ∈ N. Let FX (x) ≡ P ({ω : X1 (ω) ≤ x1 , X2 (ω) ≤ x2 , . . . , Xk (ω) ≤ xk })

(2.3)

for x = (x1 , x2 , . . . , xk ) ∈ Rk . Then FX (·) is called the joint cumulative distribution function (joint cdf) of the random vector X. Deﬁnition 6.2.6: Let X be a k-dimensional random vector on (Ω, F, P ) for some k ∈ N. Let PX (A) = P (X −1 (A))

for all A ∈ B(Rk ).

(2.4)

The probability measure PX is called the (joint) probability distribution of X. As in the case k = 1, the Lebesgue-Stieltjes measure µFX on B(Rk ) corresponding to the joint cdf FX is the same as PX . Next, let X = (X1 , X2 , . . . , Xk ) be a random vector. Let Y = (Xi1 , Xi2 , . . . , Xir ) for some 1 ≤ i1 < i2 < . . . < ir ≤ k and some 1 ≤ r ≤ k. Then, Y is also a random vector. Further, the joint cdf of Y can be obtained from FX by setting the components xj , j ∈ {i1 , i2 , . . . , ir } equal to ∞. Similarly, the probability distribution PY can be obtained from PX as an induced measure from the projection map π(x) = (xi1 , xi2 , . . . , xir ), x ∈ Rk . For example, if (i1 , i2 , . . . , ir ) = (1, 2, . . . , r), r ≤ k, then FY (y1 , . . . , yr ) = FX (y1 , . . . , yr , ∞, . . . , ∞), (y1 , . . . , yr ) ∈ Rr and PY (A) = PX (A × R(k−r) ), A ∈ B(Rr ).

6.2 Random variables and random vectors

193

Deﬁnition 6.2.7: Let X = (X1 , X2 , . . . , Xk ) be a random vector on (Ω, F, P ). Then, for each i = 1, . . . , k, the cdf FXi and the probability distribution PXi of the random variable Xi are called the marginal cdf and the marginal probability distribution of Xi , respectively. It is clear that the distribution of X determines the marginal distribution PXi of Xi for all i = 1, 2, . . . , k. However, the marginal distributions {PXi : i = 1, 2, . . . , k} do not uniquely determine the joint distribution PX , without additional conditions, such as independence (see Problem 6.1). Deﬁnition 6.2.8: Let X be a random variable on (Ω, F, P ). The expected value of X, denoted by EX or E(X), is deﬁned as XdP, (2.5) EX = Ω

provided the integral is well deﬁned. That is, at least one of the two quantities X + dP and X − dP is ﬁnite. If X is a random variable on (Ω, F, P ) and h : R → R is Borel measurable, then Y = h(X) is also a random variable on (Ω, F, P ). The expected value of Y may be computed as follows. Proposition 6.2.1: (The change of variable formula). Let X be a random variable on (Ω, F, P ) and h : R → R be Borel measurable. Let Y = h(X). Then |Y |dP = |h(x)|PX (dx) = |y|PY (dy). (i) Ω

(ii) If

Ω

R

R

|Y |dP < ∞, then Y dP = h(x)PX (dx) = yPY (dy). Ω

R

(2.6)

R

Proof: If h = IA for A in B(R), the proposition follows from the deﬁnition of PX . By linearity, this extends to a nonnegative and simple function h and by the MCT, to any nonnegative measurable h, and hence to any measurable h. 2 Remark 6.2.1: Proposition 6.2.1 shows that the expectation of Y can be computed in three diﬀerent ways, i.e., by integrating Y with respect to P on Ω or by integrating h(x) on R with respect to the probability distribution PX of the random variable X or by integrating y on R with respect to the probability distribution PY of the random variable Y . Remark 6.2.2: If the function h is nonnegative, then the relation EY = h(x)P (dx) is valid even if EY = ∞. X R

194

6. Probability Spaces

Deﬁnition 6.2.9: For any positive integer n, the nth moment µn of a random variable X is deﬁned by µn ≡ EX n ,

(2.7)

provided the expectation is well deﬁned. Deﬁnition 6.2.10: The variance of a random variable X is deﬁned as Var(X) = E(X − EX)2 , provided EX 2 < ∞. Deﬁnition 6.2.11: The moment generating function (mgf ) of a random variable X is deﬁned by MX (t) ≡ E(etX )

for all t ∈ R.

(2.8)

Since etX is always nonnegative, E(etX ) is well deﬁned but could be inﬁnity. Proposition 6.2.1 gives a way of computing the moments and the mgf of X without explicitly computing the distribution of X k or etX . As an illustration, consider the case of a random variable X deﬁned on the probability space (Ω, F, P ) with Ω = {H, T }n , n ∈ N, F = the power set of Ω and P = the probability distribution deﬁned by P ({ω}) = pX(ω) q n−X(ω) where 0 < p < 1, q = 1 − p, and X(ω) = the number of H’s in ω. By the change of variable formula, the mgf of X is given by etx PX (dx) MX (t) ≡ =

n r=0

etr

n r n−r p q = (pet + q)n , r

distribution of X, is supported on {0, 1, 2, . . . , n} since PX , the probability with PX ({r}) = nr pr q n−r . Note that PX is the Binomial (n, p) distribution. Here, MX (t) is computed using the distribution of X, i.e., using the middle term in (2.6) only. The connection between the mgf MX (·) and the moments of a random variable X is given in the following propositions. Proposition 6.2.2: Let X be a nonnegative random variable and t ≥ 0. Then ∞ n t µn MX (t) ≡ E(etX ) = (2.9) n! n=0 where µn is as in (2.7). ∞ n Proof: Since etX = n=0 tn Xn! and X is nonnegative, (2.9) follows from the MCT. 2

6.2 Random variables and random vectors

195

Proposition 6.2.3: Let X be a random variable and let MX (t) be ﬁnite for all |t| < , for some > 0. Then (i) E|X|n < ∞ (ii) MX (t) =

∞

for all tn

n=0

µn n!

n ≥ 1, for all

|t| < ,

(iii) MX (·) is inﬁnitely diﬀerentiable on (−, +) and for r ∈ N, the rth derivative of MX (·) is (r)

MX (t) = In particular,

∞ n t µn+r = E(etX X r ) n! n=0

for

|t| < .

(r)

MX (0) = µr = EX r .

(2.10)

(2.11)

Proof: Since MX (t) < ∞ for all |t| < , E(e|tX| ) ≤ E(etX ) + E(e−tX ) < ∞ Also, e|tX| ≥

for |t| < .

(2.12)

|t|n |X|n n!

for *all n ∈ N and * hence, (i) follows by choosing a t *n (tx)j * in (0, ). Next note that * j=0 j! * ≤ e|tx| for all x in R and all n ∈ N. Hence, by (2.12) and the DCT, (ii) follows. Turning to (iii), since MX (·) admits a power series expansion convergent in |t| < , it is inﬁnitely diﬀerentiable in |t| < and the derivatives of MX (·) can be found by term-by-term diﬀerentiation of the power series (see Rudin (1976), Chapter 9). Hence, ∞ dr tn µn (r) MX (t) = dtr n=0 n! = = =

∞ µn dr (tn ) n! dtr n=0 ∞

µn

n=r ∞ n

tn−r (n − r)!

t µn+r . n! n=0

The veriﬁcation of the second equality in (2.10) is left in an exercise (see Problem 6.4). 2 Remark 6.2.3: If the mgf MX (·) is ﬁnite for |t| < for some > 0, then by part (ii) of the above proposition, MX (t) has a power series expansion

196

6. Probability Spaces

in t around 0 and µn!n is simply the coeﬃcient of tn . For example, if X has a N (0, 1) distribution, then for all t ∈ R, ∞ 2 2 1 MX (t) = etx √ e−x /2 dx = et /2 2π −∞ ∞ 2 k (t ) 1 = . (2.13) k! 2k k=0

Thus, µn =

0 (2k)! k!2k

if n is odd if n = 2k, k = 1, 2, . . .

Remark 6.2.4: If MX (t) is ﬁnite for |t| < for some > 0, then all the moments {µn }n≥1 of X are determined and also its probability distribution. However, in general, the sequence {µn }n≥1 of moments of X need not determine the distribution of X uniquely. Table 6.2.1 gives the mean, variance, and the mgf of a number of standard probability distributions on the real line. For future reference, some of the inequalities established in Section 3.1 are specialized for random variables and collected below without proofs. Proposition 6.2.4: (Markov’s inequality). Let X be a random variable on (Ω, F, P ). Then for any φ : R+ → R+ nondecreasing and any t > 0 with φ(t) > 0, E φ(|X|) . (2.14) P (|X| ≥ t) ≤ φ(t) In particular, (i) for r > 0, t > 0, P (X ≥ t) ≤ P (|X| ≥ t) ≤

E|X|r , tr

(2.15)

(ii) for any t ≥ 0, P (|X| ≥ t) ≤

E(eθ|X| ) , eθt

for any θ > 0 and hence E(eθ|X| ) . θ>0 eθt

P (|X| ≥ t) ≤ inf

(2.16)

Proposition 6.2.5: (Chebychev’s inequality). Let X be a random variable with EX 2 < ∞, EX = µ, Var(X) = σ 2 . Then for any k > 0, P (|X − µ| ≥ kσ) ≤

1 . k2

(2.17)

6.2 Random variables and random vectors

197

TABLE 6.2.1. Mean, variance, mgf of the distributions listed in Tables 4.6.1 and 4.6.2.

Distribution

Mean

Variance

mgf M(t)

Bernoulli (p), 0 0, then equality holds in (2.20) iﬀ P (Y = aX + b) = 1 for some constants a, b in R (Problem 6.6). Proposition 6.2.9: (Minkowski’s inequality). Let X and Y be random variables on (Ω, F, P ) such that E|X|p < ∞, E|Y |p < ∞ for some 1 ≤ p < ∞. Then (E|X + Y |p )1/p ≤ (E|X|p )1/p + (E|Y |p )1/p . (2.21) Deﬁnition 6.2.12: (Product moments of random vectors). Let X = (X1 , X2 , . . . , Xk ) be a random vector. The product moment of order r = (r1 , r2 , . . . , rk ), with ri being a nonnegative integer for each i, is deﬁned as µr ≡ µr1 ,r2 ,...,rk ≡ E(X1r1 X2r2 · · · Xkrk ),

(2.22)

provided E|X1r1 · · · Xkrk | < ∞. The joint moment generating function (joint mgf ) of a random vector X = (X1 , X2 , . . . , Xk ) is deﬁned by MX1 ,...,Xk (t1 , t2 , . . . , tk ) ≡ E(et1 X1 +t2 X2 +···+tk Xk ),

(2.23)

for all t1 , t2 , . . . , tk in R. As in the case of a random variable, if the joint mgf MX1 ,X2 ,...,Xk (t1 , . . . , tk ) is ﬁnite for all (t1 , t2 , . . . , tk ) with |ti | < for all i = 1, 2, . . . , k for some > 0, then an analog of Proposition 6.2.3 holds. For example, the following assertions are valid (cf. Problem 6.4): (i) E|Xi |n < ∞ for all i = 1, 2, . . . , k

and n ≥ 1.

(ii) For t = (t1 , . . . , tk ) ∈ Rk and r = (r1 , r2 , . . . , rk ) ∈ Zk+ , let tr = tr11 tr22 · · · trkk , r! = r1 !r2 ! · · · rk !, and µr = EX1r1 X2r2 · · · Xkrk .

(2.24)

6.3 Kolmogorov’s consistency theorem

Then, MX (t1 , . . . , tk ) =

tr µr r! k

199

(2.25)

r∈Z+

for all t = (t1 , t2 , . . . , tk ) ∈ (−, +)k . (iii) For any r = (r1 , . . . , rk ) ∈ Zk+ ,

* dr * M (t) = µr , * X dtr t=0

where

dr dtr

=

∂ r1 ∂ r2 r r ∂t11 ∂t22

(2.26)

rk

∂ . . . ∂t rk . k

6.3 Kolmogorov’s consistency theorem In the previous section, the case of a single random variable and that of a ﬁnite dimensional random vector were discussed. The goal of this section is to discuss inﬁnite families of random variables such as a random sequence {Xn }n≥1 or a random function {X(t) : 0 ≤ t < T }, 0 ≤ T ≤ ∞. For example, Xn could be the population size of the nth generation of a randomly evolving biological population, and X(t) could be the temperature at time t in a chemical reaction over a period [0, T ]. An example from modeling of spatial random phenomenon is a collection {X(s) : s ∈ S} of random variables X(s) where S is a speciﬁed region such as the U.S., and X(s) is the amount of rainfall at location s ∈ S during a speciﬁed month. Let (Ω, F, P ) be a probability space and {Xα : α ∈ A} be a collection of random variables deﬁned on (Ω, F, P ), where A is a nonempty set. Then for any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the random vector (Xα1 , Xα2 , . . . , Xαk ) has a joint probability distribution µ(α1 ,α2 ,...,αk ) over (Rk , B(Rk )). Deﬁnition 6.3.1: A (real valued) stochastic process with index set A is a family {Xα : α ∈ A} of random variables deﬁned on a probability space (Ω, F, P ). Example 6.3.1: (Examples of stochastic processes). Let Ω = [0, 1], F = B([0, 1]), P = the Lebesgue measure on [0, 1]. Let A1 = {1, 2, 3, . . .}, A2 = [0, T ], 0 < T < ∞. For ω ∈ Ω, n ∈ A1 , t ∈ A2 , let Xn (ω)

=

sin 2πnω

Yt (ω)

=

sin 2πtω

Zn (ω) Vn,t (ω)

=

nth digit in the decimal expansion of ω

=

Xn2 (ω) + Yt2 (ω).

Then {Xn : n ∈ A1 }, {Zn : n ∈ A1 }, {Vn,t : (n, t) ∈ A1 × A2 }, {Yt : t ∈ A2 } are all stochastic processes.

200

6. Probability Spaces

Note that a real valued stochastic process {Xα : α ∈ A} may also be viewed as a random real valued function on the set A by the identiﬁcation ω → f (ω, ·), where f (ω, α) = Xα (ω) for α in A. Deﬁnition 6.3.2: The family {µ(α1 ,α2 ,...,αk ) (·) ≡ P ((Xα1 , . . . , Xαk ) ∈ ·): (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} of probability distributions is called the family of ﬁnite dimensional distributions (fdds) associated with the stochastic process {Xα : α ∈ A}. This family of ﬁnite dimensional distributions satisﬁes the following consistency conditions: For any (α1 , α2 , . . . , αk ) ∈ Ak , 2 ≤ k < ∞, and any B1 , B2 , . . . , Bk in B(R), C1: µ(α1 ,α2 ,...,αk ) (B1 ×· · ·×Bk−1 ×R) = µ(α1 ,α2 ,...,αk−1 ) (B1 ×· · ·×Bk−1 ); C2: For any permutation (i1 , i2 , . . . , ik ) of (1, 2, . . . , k), µ(αi1 ,αi2 ,...,αik ) (Bi1 ×Bi2 ×· · ·×Bik ) = µ(α1 ,...,αk ) (B1 ×B2 ×· · ·×Bk ) . To verify C1, note that µ(α1 ,α2 ,...,,αk ) (B1 × B2 × · · · × Bk−1 × R) =

P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk−1 ∈ Bk−1 , Xαk ∈ R)

=

P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk−1 ∈ Bk−1 )

= µ(α1 ,α2 ,...,αk−1 ) (B1 × B2 × · · · × Bk−1 ). Similarly, to verify C2, note that µ(αi1 ,αi2 ,...,αik ) (Bi1 × Bi2 · · · × Bik ) = P (Xαi1 ∈ Bi1 , Xαi2 ∈ Bi2 , . . . , Xαik ∈ Bik ) = P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk ∈ Bk ) = µ(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk ). A natural question is that given a family of probability distributions QA ≡ {ν(α1 ,α2 ,...,αk ) : (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} on ﬁnite dimensional Euclidean spaces, does there exist a real valued stochastic process {Xα : α ∈ A} such that its family of ﬁnite dimensional distributions coincides with QA ? Kolmogorov (1956) showed that if QA satisﬁes C1 and C2, then such a stochastic process does exist. This is known as Kolmogorov’s consistency theorem (also known as Kolmogorov’s existence theorem). Theorem 6.3.1: (Kolmogorov’s consistency theorem). Let A be a nonempty set. Let QA ≡ {ν(α1 ,α2 ,...,αk ) : (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} be a family of probability distributions such that for each (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞,

6.3 Kolmogorov’s consistency theorem

201

(i) ν(α1 ,α2 ,...,αk ) is a probability distribution on (Rk , B(Rk )), (ii) C1 and C2 hold, i.e., for all B1 , B2 , . . . , Bk ∈ B(R), 2 ≤ k < ∞, ν(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk−1 × R) = ν(α1 ,α2 ,...,αk−1 ) (B1 × B2 × · · · × Bk−1 )

(3.1)

and for any permutation (i1 , i2 , . . . , ik ) of (1, 2, . . . , k), µ(αi1 ,αi2 ,...,αik ) (Bi1 × Bi2 × · · · × Bik ) = µ(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk ).

(3.2)

Then, there exists a probability space (Ω, F, P ) and a stochastic process XA ≡ {Xα : α ∈ A} on (Ω, F, P ) such that QA is the family of ﬁnite dimensional distributions associated with XA . Remark 6.3.1: Thus the above theorem says that given the family QA satisfying conditions (i) and (ii), there exists a real valued function on A × Ω such that for each ω, f (·, ω) is a function on A and for each (α1 , α2 , . . . , αk ) ∈ Ak , the vector f (α1 , ω), f (α2 , ω), . . . , f (αk , ω) is a random vector with probability distribution ν(α1 ,α2 ,...,αk ) . This random function point of view is useful in dealing with functionals of the form . .}, then one M (ω) ≡ {sup f (α, ω) : α ∈ A}. For example, if A1 = {1, 2, . n might consider functionals such as limn→∞ f (n, ω), limn→∞ n1 j=1 f (j, ω), ∞ j=1 f (j, ω), etc. Since the random functionals are not fully determined by f (α, ω) for ﬁnitely many α’s, it is not possible to compute probabilities of events deﬁned in terms of these functionals from the knowledge of the ﬁnite dimensional distribution of (f (α1 , ω), . . . , f (αk , ω)) for a given (α1 , . . . , αk ), no matter how large k is. Kolmogorov’s consistency theorem allows one to compute these probabilities given all ﬁnite dimensional distributions (provided that the functionals satisfy appropriate measurability conditions). Given a probability measure µ on (R, B(R)), now consider the problem of constructing a probability space (Ω, F, P ) and a random variable X on it with distribution µ. A natural solution is to set the sample space Ω to be R, the σ-algebra F to be B(R), and the probability measure P to be µ and the random variable X to be the identity map X(ω) ≡ ω. Similarly, given a probability measure µ on (Rk , B(R)k ), one can set the sample space Ω to be Rk and the σ-algebra F to be B(Rk ) and the probability measure P to be µ and the random vector X to be the identity map. Arguing in the same fashion, given a family QA of ﬁnite dimensional distributions with index set A, to construct a stochastic process {Xα : α ∈ A} with index set A on some probability space (Ω, F, P ), it is natural to set the sample space Ω to be RA , the collection of all real valued functions on A, F to be a suitable σ-algebra that includes all ﬁnite dimensional events,

202

6. Probability Spaces

P to be an appropriate probability measure that yields QA , and X to be the identity map. These considerations lead to the following deﬁnitions. Deﬁnition 6.3.3: Let A be a nonempty set. Then RA ≡ {f | f : A → R}, the collection of all real valued functions on A. If A is a ﬁnite set {a1 , a2 , . . . , ak }, then RA can be identiﬁed with Rk by associating each f ∈ RA with the vector (f (a1 ), f (a2 ), . . . , f (ak )) in Rk . If A is a countably inﬁnite set {a1 , a2 , a3 , . . .}, then RA can be similarly identiﬁed with R∞ , the set of all sequences {x1 , x2 , x3 , . . .} of real numbers. If A is the interval [0, 1], then RA is the collection of all real valued functions on [0, 1]. Deﬁnition 6.3.4: Let A be a nonempty set. A subset C ⊂ RA is called a ﬁnite dimensional cylinder set (fdcs) if there exists a ﬁnite subset A1 ⊂ A, say, A1 ≡ {α1 , α2 , . . . , αk }, 1 ≤ k < ∞ and a Borel set B in B(Rk ) such that C = {f : f ∈ RA and (f (α1 ), f (α2 ), . . . , f (αk )) ∈ B}. The set B is called a base for C. The collection of all ﬁnite dimensional cylinder sets will be denoted by C. The name cylinder is motivated by the following example: Example 6.3.2: Let A = {1, 2, 3} and C = {(x1 , x2 , x3 ) : x21 + x22 ≤ 1}. Then C is a cylinder (in the usual sense of the English word), but with inﬁnite height and depth. According to Deﬁnition 6.3.4, C is also a cylinder in R3 with the unit circle in R2 as its base. Examples 6.3.3 and 6.3.4 below are examples of fdcs, whereas Example 6.3.5 is an example of a set that is not a fdcs. Example 6.3.3: Let A = {1, 2} and C = {(x1 , x2 ) : | sin 2πx1 | ≤

√1 }. 2

Example 6.3.4: Let A = {1, 2, 3, . . .} and C = {(x1 , x2 , x3 , . . .) : x230 10

−

x242 5

x217 4

+

≤ 10}.

Example 6.3.5: Let A = {1,2, 3, . . .} and D = {(x1 , x2 , x3 , . . .) : xj ∈ R n for all j ≥ 1 and limn→∞ n1 j=1 xj exists} is not a ﬁnite dimensional cylinder set (Problem 6.8). Proposition 6.3.2: Let A be a nonempty set and C be the collection of all ﬁnite dimensional cylinder sets in RA . Then C is an algebra. Proof: Let C1 , C2 ∈ C and let C1 C2

= {f : f ∈ RA and f (α1 ), f (α2 ), . . . , f (αk ) ∈ B1 } = {f : f ∈ RA and f (β1 ), f (β2 ), . . . , f (βj ) ∈ B2 }

6.3 Kolmogorov’s consistency theorem

203

for some A1 = {α1 , α2 , . . . , αk } ⊂ A, A2 = {β1 , β2 , . . . , βj } ⊂ A, B1 ∈ B(Rk ), B2 ∈ B(Rj ), 1 ≤ k < ∞, 1 ≤ j < ∞. Let A3 = A1 ∪ A2 = {γ1 , γ2 , . . . , γ }, where without loss of generality (γ1 , γ2 , . . . , γk ) = (α1 , α2 , . . . , αk ) and (γ −j+1 , . . . , γ −1 , γ ) = (β1 , β2 , . . . , βj ). Then C1 and C2 may be expressed as ˜1 } C1 = {f : f ∈ RA and f (γ1 ), f (γ2 ), . . . , f (γ ) ∈ B A ˜2 } C2 = {f : f ∈ R and f (γ1 ), . . . , f (γ ) ∈ B ˜2 = R −j × B2 . Thus, C1 ∪ C2 = {f : f ∈ RA ˜1 = B1 × R −k and B where B ˜ ˜2 }. Since both B ˜1 and B ˜2 lie in B(R ), and (f (γ1 ), . . . , f (γ )) ∈ B1 ∪ B C1 ∪ C2 ∈ C. Next note that, C1c = {f : f ∈ RA and f (α1 ), . . . , f (αk ) ∈ B1c }. Since B1c ∈ B(Rk ), it follows that C1c ∈ C. Thus, C is an algebra. 2 Remark 6.3.2: If A is a ﬁnite nonempty set, the collection C is also a σ-algebra. Deﬁnition 6.3.5: Let A be a nonempty set. Let RA be the σ-algebra generated by the collection C. Then RA is called the product σ-algebra on RA . Remark 6.3.3: If A = {1, 2, 3, . . .} ≡ N and RN is identiﬁed with the set R∞ of all sequences of real numbers, then the product σ-algebra RN coincides with the Borel σ-algebra B(R∞ ) on R∞ under the metric

∞ |xj − yj | 1 d(x, y) = (3.3) 2j 1 + |xj − yj | j=1 for x = (x1 , x2 , . . .), y = (y1 , y2 , . . .) in R∞ (Problem 6.9). Deﬁnition 6.3.6: Let A be a nonempty set. For any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the projection map π(α1 ,...,αk ) from RA to Rk is deﬁned by π(α1 ,α2 ,...,αk ) (f ) = (f (α1 ), f (α2 ), . . . , f (αk )).

(3.4)

In particular, for α ∈ A, πα (f ) = f (α)

(3.5)

is called a co-ordinate map. The projection map πA1 for any arbitrary subset A1 ⊂ A may be similarly deﬁned. The next proposition follows from the deﬁnition of RA . Proposition 6.3.3: (i) For each α ∈ A, the map πα from RA to R is RA , B(R)-measurable. (ii) For any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the map π(α1 ,α2 ,...,αk ) from RA to Rk is RA , B(Rk )-measurable.

204

6. Probability Spaces

Proof of Theorem 6.3.1: Let Ω = RA and F ≡ RA . Deﬁne a set function P on C by (3.6) P (C) = µ(α1 ,α2 ,...,αk ) (B) for a C in C with representation C = {ω : ω ∈ RA , ω(α1 ), ω(α2 ), . . . , ω(αk ) ∈ B}.

(3.7)

The main steps in the proof are (i) To show that P (C) as deﬁned in (3.6) is independent of the representation (3.7) of C, and (ii) P (·) is countably additive on C. Next, by the Caratheodory extension theorem (Theorem 1.3.3), there exists a unique extension of P (also denoted by P ) to F such that (Ω, F, P ) is a probability space. Deﬁning Xα (ω) ≡ πα (ω) = ω(α) for α in A yields a stochastic process {Xα : α ∈ A} on the probability space (RA , RA , P ) ≡ (Ω, F, P ) with the family QA as its set of ﬁnite dimensional distributions. Hence, it remains to establish (i) and (ii). Let C ∈ C admit two representations: C ≡ {ω : ω(α1 ), ω(α2 ), . . . , ω(αk ) ∈ B1 } ≡

π(α1 ,α2 ,...,αk ) (B1 )

≡

{ω : ω(β1 ), ω(β2 ), . . . , ω(βj ) ∈ B2 }

≡

−1 (B2 ) π(β 1 ,β2 ,...,βj )

and C

for some A1 = {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞, and some A2 = {β1 , β2 , . . . , βj } ⊂ A, 1 ≤ j < ∞, B1 ∈ B(Rk ) and B2 ∈ B(Rj ). Let A3 = A1 ∪A2 = {γ1 , γ2 , . . . , γ } and w.l.o.g., let (γ1 , γ2 , . . . , γk ) = (α1 , α2 , . . . , αk ) and (γ −j+1 , γ −j+2 , γ −1 , γ ) = (β1 , β2 , . . . , βj ). Then C may be represented as C

˜1 ) = πγ−1 (B 1 ,γ2 ,...,γ ˜2 ) (B = π −1 γ1 ,γ2 ,...,γ

˜1 = B1 ×R −k and B ˜2 = R −j ×B2 . Note that (ω(γ1 ), . . . , ω(γ )) ∈ where B ˜1 iﬀ ω ∈ C iﬀ (ω(γ1 ), . . . , ω(γ )) ∈ B ˜2 and thus B ˜1 = B ˜2 . B

(3.8)

6.3 Kolmogorov’s consistency theorem

205

Next by the ﬁrst consistency condition (3.1) and induction, ˜1 ) = ν(α ,α ,...,α ) (B1 ). ν(γ1 ,γ2 ,...,γ ) (B 1 2 k

(3.9)

Also by (3.2), for B2 of the form B21 × B22 × · · · × B2j with B2i ∈ B(R) for all 1 ≤ i ≤ j, ν(γ1 ,γ2 ,...,γ ) (R −j × B2 ) = ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B2 × R −j ). Now note that (a) ν(γ1 ,γ2 ,...,γ ) (R −j × B) and ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B × R −j ), considered as set functions deﬁned for B ∈ B(Rj ), are probability measures on B(Rj ), (b) they coincide on the class Γ of sets of the form B = B21 × B22 × · · · × B2j with B2i ∈ B(R) for all i, and (c) the class Γ is a π-class and it generates B(Rj ). Hence, by the uniqueness theorem (Theorem 1.3.6), ν(γ1 ,γ2 ,...,γ ) (R −j × B) = ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B × R −j )

(3.10)

for all B ∈ B(Rj ). Again by (3.1) and induction ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B2 × R −j ) = ν(γ −j+1 ,...,γ ) (B2 ) = ν(β1 ,β2 ,...,βj ) (B2 ). ˜2 = R −j × B2 , by (3.10) and (3.11) Since B ˜2 ) = ν(β ,β ,...,β ) (B2 ). ν(γ1 ,γ2 ,...,γ ) (B 1 2 j Now from (3.8) and (3.9) it follows that ν(α1 ,...,αk ) (B1 )

˜1 ) ν(γ1 ,γ2 ,...,γ ) (B ˜2 ) = ν(γ1 ,γ2 ,...,γ ) (B =

= ν(β1 ,β2 ,...,βj ) (B2 ), thus establishing (i). To establish (ii), it needs to be shown that a

(ii) P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) if C1 , C2 ∈ C and C1 ∩ C2 = ∅. b

(ii) Cn ∈ C, Cn ⊃ Cn+1 for all n,

n≥1 Cn

= ∅ ⇒ P (Cn ) ↓ 0.

(3.11)

206

6. Probability Spaces

−1 −1 Let C1 = π(α (B1 ) and C2 = π(β (B2 ) for B1 ∈ B(Rk ), B2 ∈ 1 ,...,αk ) 1 ,...,βj ) j B(R ), {α1 , . . . , αk } ⊂ A and {β1 , . . . , βj } ⊂ A, 1 ≤ j, k < ∞. As in the proof of Proposition 6.3.2, C1 and C2 may be represented as −1 ˜i ), i = 1, 2, (B Ci = π(γ 1 ,γ2 ,...,γ )

˜i ∈ B(R ). Since C1 and C2 are disjoint by hypothesis, it follows where B ˜1 and B ˜2 are disjoint. Also, since that B ˜i ), i = 1, 2, P (Ci ) = ν(γ1 ,γ2 ,...,γ ) (B and ν(γ1 ,...,γ ) (·) is a measure on B(R ), it follows that P (C1 ∪ C2 )

˜1 ∪ B ˜2 ) ν(γ1 ,...,γ ) (B ˜1 ) + ν(γ ,...,γ ) (B ˜2 ) = ν(γ1 ,...,γ ) (B 1 =

=

P (C1 ) + P (C2 ),

thus proving (ii)a . To prove (ii)b , note that for any sequence {Cn }n≥1 ⊂ C, there exists a countable set A1 = {α1 , α2 , . . . , αn , . . .}, an increasing sequence {kn }n≥1 of positive integers and a sequence of Borel sets {Bn }n≥1 such that Bn ∈ −1 B(Rkn ) and Cn = π(α (Bn ) for all n ∈ N. Now suppose that 1 ,α2 ,...,αkn ) {Cn }n≥1 is decreasing. It will be shown that if limn→∞ P (Cn ) = δ > 0, then n≥1 Cn = ∅. For each n, by the regularity of measures (Corollary 1.3.5), there exists a compact set Gn ⊂ Bn such that ν(α1 ,...,αkn ) (Bn \ Gn )

0, Hn ⊂ Cn , and P (Cn \ Hn ) < 2δ , it follows that all n ≥ 1. This implies Hn = ∅ for each n. It will now P (Hn ) > 2δ for be shown that n≥1 Hn = ∅. Let {ωn }n≥1 be a sequence of elements

6.3 Kolmogorov’s consistency theorem

207

from Ω = RA such that for each n, ωn ∈ Hn . Then, since {Hn }n≥1 is a decreasing sequence, for each 1 ≤ j < ∞, ωn ∈ Hj for n ≥ j. This implies that the vector (ωn (α1 ), ωn (α2 ), . . . , ωn (αkj )) ∈ Gj for all n ≥ j. Since G1 is compact, there exists a subsequence {n1i }i≥1 such that limi→∞ ωn1i (α1 ) = ω(α1 ) exists. Next, since G2 is compact, there exists a further sequence {n2i }i≥1 of {n1i }i≥1 such that limi→∞ ωn2i (α2 ) = ω(α2 ) exists. Proceeding this way and applying the usual ‘diagonal method,’ a subsequence {ni }i≥1 is obtained such that limi→∞ ωni (αj ) = ω(αj ) for all 1 ≤ j < ∞. Let ω(α) = 0 for α ∈ {α1 , α2 , . . .}. Since for each j, Gj iscompact, (ω(α 1 ), ω(α2 ), . . . , ω(α kj )) ∈ Gj and hence ω ∈ Hj . Thus, ω ∈ j≥1 Hj ⊂ j≥1 Cj implying j≥1 Cj = ∅. The proof of the theorem is now complete. 2 When the index set A is countable and identiﬁed with the set N ≡ {1, 2, 3, . . .}, it is possible to give a simpler formulation of the consistency conditions. Theorem 6.3.4: Let {µn }n≥1 be a sequence of probability measures such that (i) for each n ∈ N, µn is a probability measure on (Rn , B(Rn )), (ii) for each n ∈ N, µn+1 (B × R) = µn (B) for all B ∈ B(Rn ). Then there exists a stochastic process {Xn : n ≥ 1} on a probability space (Ω, F, P ) with Ω = R∞ , F = B(R∞ ) such that for each n ≥ 1, the probability distribution P(X1 ,X2 ,...,Xn ) of the random vector (X1 , X2 , . . . , Xn ) is µn . Proof: For any {i1 , i2 , . . . , ik } ⊂ N, let j1 < j2 < · · · < jk be the increasing rearrangement of i1 , i2 , . . . , ik . Then there exists a permutation (r1 , r2 , . . . , rk ) of (1, 2, . . . , k) such that j1 = ir1 , j2 = ir2 , . . . , jk = irk . Now deﬁne (·) ν(j1 ,j2 ,...,jk ) (·) ≡ µjk πj−1 1 ,j2 ,...,jk where πj1 ,j2 ,...,jk (x1 , . . . , xjk ) = (xj1 , xj2 , . . . , xjk ) for all (x1 , x2 , . . . , xjk ) ∈ Rjk . Next deﬁne ν(i1 ,i2 ,...,ik ) (B1 × B2 × . . . × Bk ) ≡ ν(j1 ,j2 ,...,jk ) (Br1 × Br2 × . . . × Brk ) where Bi ∈ B(R) for all i, 1 ≤ i ≤ k. It can be veriﬁed that this family of ﬁnite dimensional distributions QN ≡ {ν(i1 ,i2 ,...,ik ) (·) : {i1 , i2 , . . . , ik } ⊂ N, 1 ≤ k < ∞}

(3.12)

satisﬁes the consistency conditions (3.1) and (3.2) of Theorem 6.3.1 and hence the assertion follows. 2

208

6. Probability Spaces

Example 6.3.6: (Sequence of independent random variables). Let {Fn }n≥1 be a sequence of cdfs on R. Consider the problem of constructing a sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) such that (i) for each n ∈ N, Xn has cdf Fn and (ii) for any n ∈ N and any {i1 , i2 , . . . , in } ⊂ N, the random variables {Xi1 , Xi2 , . . . , Xin } are independent, i.e., P (Xi1 ≤ x1 , Xi2 ≤ x2 , . . . , Xin ≤ xn ) =

n +

Fij (xj )

(3.13)

j=1

for all x1 , x2 , . . . , xn in R. This problem can be solved by using Theorem 6.3.4. Let µn be the Lebesgue-Stieltjes probability measure on (Rn , B(Rn )) corresponding to the distribution function F1,2,...,n (x1 , x2 , . . . , xn ) ≡

n +

F (xj ),

x1 , . . . , xn ∈ R.

j=1

It is easy to verify that the family {µn : n ≥ 1} satisﬁes (i) and (ii) of Theorem 6.3.4. Hence, there exist a probability measure P on the sequence space Ω ≡ R∞ equipped with σ-algebra F ≡ B(R∞ ) and random variables Xn (ω) ≡ πn (ω) ≡ ω(n), for ω = (ω(1), ω(2), . . .) in R∞ , n ≥ 1, such that (3.13) holds. Example 6.3.7: (Family of independent random variables). Given a family {Fα : α ∈ A} of cdfs on R for some index set A, a construction similar to Example 6.3.6, but using Theorem 6.3.1 yields the existence of a real valued stochastic process {Xα : α ∈ A} such that for any {α1 , α2 , . . . , αn } ⊂ A, 1 ≤ n < ∞, the random variables {Xα1 , Xα2 , . . . , Xαn } are independent, i.e., (3.13) holds. Example 6.3.8: (Markov chains). Let Q = ((qij )) be a k × k stochastic matrix for some 1 < k < ∞. That is, (a) for all 1 ≤ i, j ≤ k, qij ≥ 0 and (b) for each 1 ≤ i ≤ k,

k

j=1 qij

= 1.

Let p = (p1 , p2 , . . . , pk ) be a probability vector, i.e., for all i, pi ≥ 0, and k i=1 pi = 1. Consider the problem of constructing a sequence {Xn }n≥1 of random variables such that for each n ∈ N, P (X1 = j1 , X2 = j2 , . . . , Xn = jn ) = pj1 qj1 j2 . . . qjn−1 jn for 1 ≤ ji ≤ k, i = 1, 2, . . . , n.

(3.14)

6.3 Kolmogorov’s consistency theorem

209

Let µn be the discrete probability distribution determined by the right side of (3.14), that is, µn ({(j1 , j2 , . . . , jn )}) = pj1 qj1 j2 . . . qjn−1 jn for all (j1 , . . . , jn ) such that 1 ≤ ji ≤ k for all 1 ≤ i ≤ n. It is easy to verify that {µn }n≥1 satisﬁes the conditions of Theorem 6.3.4 and hence there exist a sequence {Xn }n≥1 of random variables satisfying (3.14). It may be veriﬁed that (3.14) is equivalent to P (Xn+1 = jn+1 |X1 = j1 , . . . , Xn = jn ) = qjn jn+1 = P (Xn+1 = jn+1 |Xn = jn )

(3.15)

for all n ≥ 1, 1 ≤ ji ≤ k, i = 1, 2, . . . , n + 1 provided P (X1 = j1 , . . . , Xn = jn ) > 0 and P (X1 = j) = pj for 1 ≤ j ≤ k. This says that the conditional distribution of Xn+1 given X1 , X2 , . . . , Xn depends only on Xn . This property is known as the Markov property, and the sequence {Xn }n≥1 is called a Markov chain with state space S ≡ {1, 2, . . . , k} and time homogeneous transition probability matrix ((qij )). When the state space S = {1, 2, . . .}, the above construction goes over with minor notational modiﬁcations. Next consider the case S = R. A function Q : R × B(R) → [0, 1] is called a probability transition function if (i) for each x in R, Q(x, ·) is a probability measure on (R, B(R)) and (ii) for each B in B(R), Q(·, B) is a Borel measurable function on R. Let µ be a probability distribution on (R, B(R)). Using Theorem 6.3.4, it can be shown that there exists a stochastic process {Xn }n≥1 such that P (X1 ∈ B1 , X2 ∈ B2 , . . . , Xn ∈ Bn ) = ··· Q(xn−1 , Bn )Q(xn−2 , dxn−1 ) · · · Q(x1 , dx2 )µ(dx1 ), B1

B2

Bn−1

(3.16) where right side of (3.16) is a well-deﬁned probability measure on n R , B(Rn ) (Problem 6.18). Such a sequence {Xn }n≥ is called a Markov chain with state space R, initial distribution µ, and transition probability function Q. For more on Markov chains, see Chapter 14. Example 6.3.9: (Gaussian processes). Let A be a nonempty set and {Xα : α ∈ A} be a stochastic process. Such a process is called Gaussian if for {α1 , α2 , . . . , αk } ⊂ A and real numbers t1 , t2 , . . . , tk , the random k variable i=1 ti Xαi has a univariate normal distribution (with possibly zero variance). For such a process, the functions µ(α) ≡ EXα and σ(α, β) ≡

210

6. Probability Spaces

Cov(Xα , Xβ ) are called the mean and covariance functions, respectively. k Since Var( i=1 ti Xαi ) ≥ 0, it follows that for any t1 , t2 , . . . , tk , k k

ti tj σ(αi , αj ) ≥ 0.

(3.17)

i=1 j=1

This property of the covariance function σ(·, ·) is called nonnegative deﬁniteness. A natural question is: Given functions µ : A → R and σ : A × A → R such that σ is symmetric and satisﬁes (3.17), does there exist a Gaussian process {Xα : α ∈ A} with µ(·) and σ(·; ) as its mean and covariance functions, respectively? The answer is yes and it follows from Theorem 6.3.1 by deﬁning the family QA of ﬁnite dimensional distributions as follows. Let ν(α1 ,α2 ,...,αk ) be the unique probability distribution on (Rk , B(Rk )) with the moment generating function M(α1 ,α2 ,...,αk ) (s1 , s2 , . . . , sk )

k k k 1 si µ(αi ) + si sj σ(αi , αj ) (3.18) = exp 2 i=1 j=1 i=1 for s1 , s2 , . . . , sk in R. If the matrix Σ ≡ (σ(αi , αj )) , 1 ≤ i, j ≤ k is positive deﬁnite, i.e., it is such that in (3.17) equality holds iﬀ ti = 0 for all i, then ν(α1 ,...,αk ) (·) can be shown to be a probability measure that is measure on Rk with density absolutely continuous w.r.t. mk , the Lebesgue k k ˜ij xj −µ(αj ) /2 1 − 12 − i=1 j=1 xi −µ(αi ) σ ˜ ≡ (˜ e where Σ σij ) = k/2 |Σ| (2π)

Σ−1 , the inverse of Σ and |Σ| = the determinant of Σ. The veriﬁcation of conditions (3.1) and (3.2) for this family is left as an exercise (Problem 6.12). Remark 6.3.4: Kolmogorov’s consistency theorem (Theorem 6.3.1) remains valid when the real line R is replaced by a complete separable metric space S. More speciﬁcally, let A be a nonempty set and for {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞, let ν(α1 ,α2 ,...,αk ) (·) be a probability measure on (Sk , B(Sk )). If the family QA ≡ {ν(α1 ,α2 ,...,αk ) : {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞} satisﬁes the natural analogs of (3.1) and (3.2), then there exists a probability measure P on (Ω ≡ SA , F ≡ (B(S))A ) and an S-valued stochastic process {Xα : α ∈ A} on (Ω, F, P ) such that ν(α1 ,α2 ,...,αk ) (·) = P (Xα1 , Xα2 , . . . XαK )−1 (·). Here SA is the set of all S valued functions on A, (B(S))A is the σ-algebra generated by the cylinder sets of the form C = {f : f : A → S, f (αi ) ∈ Bi , i = 1, 2, . . . , k} where {α1 , α2 , . . . , αk } ⊂ A, Bi ∈ B(S), 1 ≤ i ≤ k, 1 ≤ k < ∞, and also Xα (ω) is the projection map Xα (ω) ≡ ω(α). The main step in the

6.3 Kolmogorov’s consistency theorem

211

proof of Theorem 6.3.1 was to establish the countable additivity of the set function P on the algebra C of ﬁnite dimensional cylinder sets. This in turn depended upon the fact that any probability measure µ on (Rk , B(Rk )) for 1 ≤ k < ∞ is regular, i.e., for every Borel set B in B(Rk ) and for every > 0, there exists a compact set G ⊂ B such that µ(B\G) < . If S is a Polish space, then any probability measure on (Sk , (B(S))k ), 1 ≤ k < ∞ is regular (see Billingsley (1968)), and hence, the main steps in the proof of Theorem 6.3.1 go through in this case. Remark 6.3.5: (Limitations of Theorem 6.3.1). In this construction, Ω = RA is rather large and the σ-algebra F ≡ (B(R))A is not large enough to include many events of interest when the index set A is uncountable. In fact, it can be shown that F coincides with the class of all sets G ⊂ Ω that depend only on a countable number of coordinates of ω. More precisely, the following holds. Proposition 6.3.5: The σ-algebra F

=

−1 {G : G = πA (B) for some B in B(R∞ ) 1 and A1 ⊂ A, A1 countable}.

(3.19)

Proof: Verify that the right side of (3.19) is a σ-algebra containing the class C of cylinder sets and also that, it is contained in F. 2 For example, if A = [0, 1], then the set C[0, 1] of all continuous functions from [0, 1] → R is not a member of F ≡ (B(R))A . Similarly, if M (ω) ≡ sup{|ω(α)| : α ∈ [0, 1]}, then the set {M (ω) ≤ 1} is not in F = (B(R))[0,1] . When A is an interval in R, this diﬃculty can be overcome in several ways. One approach pioneered by J.L. Doob is the notion of separable stochastic processes (Doob (1953)). Another approach pioneered by Kolmogorov and Skorohod is to restrict Ω to the class of all continuous functions or functions that are right continuous and have left limits (Billingsley (1968)). For more on stochastic processes, see Chapter 15. Independent Random Experiments If E1 and E2 are two random experiments with associated probability spaces (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ), it is possible to model the experiment of performing both E1 and E2 independently by the product probability space (Ω1 × Ω2 , F1 × F2 , P1 × P2 ) (see Chapter 5). The same idea carries over to an arbitrary collection {Eα : α ∈ A} of random experiments. It is possible to think of a grand experiment E in which all the Eα ’s are independent components by considering the product probability space (×α∈A Ωα , ×α∈A Fα , ×α∈A Pα ) ≡ (Ω, F, P )

(3.20)

where (Ωα , Fα , Pα ) is the probability space corresponding to Eα . Here Ω ≡ ×α∈A Ωα is the collection of all functions ω on A such that ω(α) ∈ Ωα ,

212

6. Probability Spaces

F ≡ ×α∈A Fα is the σ-algebra generated by ﬁnite dimensional cylinder sets of the form C = {ω : ω(αi ) ∈ Bαi , i = 1, 2, . . . , k}, (3.21) 1 ≤ k < ∞, {α1 , α2 , . . . , αk } ⊂ A, Bαi ∈ Fαi and P ≡ ×α∈A Pα is the probability measure on F such that for C of (3.21), P (C) =

k +

Pαi (Bαi ).

(3.22)

i=1

The proof of the existence of such a P on F is an application of the extension theorem (Theorem 1.3.3). The veriﬁcation of countable additivity on the class C of cylinder sets is not diﬃcult. See Kolmogorov (1956).

6.4 Problems 6.1 Let µ1 = µ2 be the probability distribution on Ω = {1, 2} with µ1 ({1}) = 1/2. Find two distinct probability distributions on Ω × Ω with µ1 and µ2 as the set of marginals. 6.2 Let Ω = (0, 1), F = B((0, 1)) and P be the Lebesgue measure on (0,1). Let X(ω) = − log ω, h(x) = x2 and Y = h(X). Find PX and PY and evaluate EY by applying the change of variables formula (Proposition 6.2.1). 6.3 In the change of variables formula, one of the three integrals is usually easier to evaluate than the other two. In this problem, in part (a), the ﬁrst integral is easier to evaluate than the other two while in part (b), the second one is easier. (a) Let Z ∼ N (0, 1), X = Z 2 , and Y = e−X . (i) Find the distributions PX and PY on (R, B(R)). (ii) Compute the integrals −z 2 −x e φ(z)dz, e PX (dx) and yPY (dy), R

R

where φ(z) = √12π e−z three integrals agree.

2

R

/2

, −∞ < z < ∞. Verify that all

(b) Let X1 , X2 , . . . , Xn be iid N (0, 1) random variables. Let Y = (X1 + · · · + Xk ) and Z = Y 2 . (i) Find the distributions of Y and Z. (ii) Evaluate (x1 + · · · + xk )2 dPX1 ,...,Xk (x1 , . . . , xk ), Rk 2 y PY (dy), and R+ zPZ (dz). R

6.4 Problems

213

(c) Let X1 , X2 , . . . , Xk be independent Binomial (ni , p), i = 1, 2, . . . , k random variables. Let Y = (X1 + · · · + Xk ). (i) Find the distribution PY of Y . (ii) Evaluate Rk (x1 + · · · + xk )dPX1 ,...,Xk (x1 , . . . , xk ) and yPY (dy). R 6.4 Let X be a random variable such that MX (t) ≡ E(etX ) < ∞ for |t| < for some > 0. (a) Show that E(etX |X|r ) < ∞ for all r > 0 and |t| < . (r)

(b) Show that MX (t), the rth derivative of MX (t) for r ∈ N, satisﬁes (r) MX (t) = E(etX X r ) for |t| < . (c) Verify (2.25). (Hint: (a) First show that for t1 ∈ (−, ), there exist a t2 ∈ (−, ) such that |t1 | < |t2 | < and for some C < ∞, et1 x |x|r ≤ Ce|t2 x| for all x in R. (b) Verify that for all x ∈ R, |ex − 1| ≤ |x|e|x| . Now use (a) and the (1) DCT to show that MX (t) is diﬀerentiable and MX (t) = E(etX X) for all |t| < . Now complete the proof by induction.) 6.5 Let X be a random variable. (a) Show that φ(r) ≡ (E|X|r )1/r is nondecreasing on (0, ∞). (b) Show that φ(r) ≡ log E|X|r is convex in (0, r0 ) if E|X|r0 < ∞. (c) Let M = sup{x : P (|X| > x) > 0}. Show that (i) lim φ(r) = M . r↑∞

(ii) lim

n→∞

E|X|n+1 E|X|n

= M.

(Hint: For M < ∞, note that E|X|r ≥ (M − )r P (|X| > M − ) for any > 0.) 6.6 Show that if equality holds in (2.20), then there exist constants a and b such that P (Y = aX + b) = 1. (Hint: Show that there exist a constant a such that Var(Y − aX) = 0.) 6.7 Determine C and its base B explicitly in Examples 6.3.3 and 6.3.4. 6.8 (a) Show that D in Example 6.3.5 is not a ﬁnite dimensional cylinder set. n (Hint: Note that lim n1 j=1 xj is not determined by the valn→∞

ues of ﬁnitely many xi ’s.)

214

6. Probability Spaces

(b) Find three other such examples of sets D in R∞ that are not ﬁnite dimensional cylinder sets. 6.9 Establish the assertion in Remark 6.3.3 by completing the following steps: (a) Show that the coordinate map fn (x) ≡ xn from R∞ to R is continuous under the metric d of (3.3). (Conclude, using Example 1.1.6, that RN ⊂ B(R∞ )). (b) Let C1 ≡ {A : A = (a1 , b1 ) × · · · × (ak , bk ) × R∞ , −∞ ≤ ai < bi ≤ ∞, 1 ≤ i ≤ k, for some k < ∞} and C2 ≡ {A : A is an open ball in (R∞ , d)}. Show that σC2 ⊂ σC1 . (c) Show that σC2 = B(R∞ ) by showing that every open set in (R∞ , d) is a countable union of open balls. 6.10 Show that the family QN deﬁned in (3.12) satisﬁes the consistency conditions (3.1) and (3.2) of Theorem 6.3.1. 6.11 Verify that the family of ﬁnite dimensional distributions deﬁned by the right side of (3.14) satisﬁes the conditions of Theorem 6.3.4. 6.12 Verify that the family of distributions deﬁned in (3.18) satisﬁes conditions (3.1) and (3.2) of Theorem 6.3.1. (Hint: Use the fact that for any k ≥ 1, any µ = (µ1 , µ2 , . . . , µk ) ∈ Rk , and any nonnegative deﬁnite k × k matrix Σ ≡ ((σij ))k×k , there is a unique probability distribution ν such that for any s = (s1 , s2 , . . . , sk ) in Rk , exp Rk

=

exp

k

k i=1

si xk ν(dx)

i=1

si µi +

k k 1 si sj σij . 2 i=1 j=1

Observe that this implies that for s = (s1 , s2 , . . . , sk ) in Rk , the ink duced distribution (under ν) on R by the map g(x) = i=1 si xi k k from R → R is univariate normal with mean i=1 si µi and variance k k i=1 j=1 si sj σij .) 6.13 Show that the set D ≡ C[0, 1] of continuous functions from [0, 1] to R is not a member of the σ-algebra F ≡ (B(R))[0,1] . −1 (B) (Hint: If D ∈ F, then by Proposition 6.3.5, D is of the form πA 1 ∞ for some B in B(R ), where A1 ⊂ [0, 1] is countable. Show that for any such A1 and B, there exist functions f : [0, 1] → R such that −1 f ∈ πA (B) but f is not continuous on [0, 1].) 1

6.4 Problems

6.14 Show that K ≡

ω : ω ∈ R[0,1] , sup |ω(α)| < 1 0≤α≤1

215

is not in F ≡

(B(R))[0,1] . (Hint: Observe that sup |ω(α)| is not determined by the values of 0≤α≤1

ω(α) for countably many α’s.)

6.15 Let {µi }i≥1 be a sequence of probability distributions on R, B(R) and let µ be a probability distribution on N with pi ≡ µ({i}), i ≥ 1. (a) Verify that ν(·) ≡ i≥1 pi µi (·) is a probability distribution on R, B(R) . (b) (i) Show that there exists a probability space (Ω, F, P ) and a collection of independent random variables {J, X1 , X2 , . . .} on (Ω, F, P ) such that for each i ≥ 1, Xi has distribution µi and J ∼ µ. (ii) Let Y = XJ , i.e., Y (ω) ≡ XJ(ω) (ω). Show that Y is a random variable on (Ω, F, P ) and Y ∼ ν. 6.16 Let F be a cdf on R and let F be decomposed as F = αFd + βFac + γFsc where α, β, γ ∈ [0, 1] and α + β + γ = 1 and Fd , Fac , Fsc are discrete, absolutely continuous, and singular continuous cdfs on R (cf. (4.5.3)). Show that there exist independent random variables X1 , X2 , X3 and J on some probability space such that X1 ∼ Fd , X2 ∼ Fac , X3 ∼ Fsc , P (J = 1) = α, P (J = 2) = β, P (J = 3) = γ and XJ ∼ F , where ∼ means “has cdf”. 6.17 Let µ be a probability on R, B(R) . Let for each x in R, measure F (x, ·) be a cdf on R, B(R) . Let ψ(x, t) ≡ inf{y : F (x, y) ≥ t}, for x in R, 0 < t < 1. Assume that ψ(·, ·) : R × (0, 1) → R is measurable. Let X and U be independent random variables on some probability space (Ω, F, P ) such that X ∼ µ and U ∼ uniform (0,1). (a) Show that Y = ψ(X, U ) is a random variable. (b) Show that P (Y ≤ y) = R F (x, y)µ(dx). (The distribution of Y is called a mixture of distributions with µ as the mixing distribution. This is of relevance in Bayesian statistical inference.) 6.18 (a) Let (Si , Si ), i = 1, 2 be two measurable spaces. Let µ be a probability measure on (S1 , S1 ) and let Q : S1 × S2 → [0, 1] be such that for each x in S1 , Q(x, ·) is a probability measure on (S2 , S2 ) and for each B in S2 , Q(·, B) is S1 -measurable. Deﬁne Q(x, B2 )µ(dx) ν(B1 × B2 ) ≡ B1

216

6. Probability Spaces

on C ≡ {B1 × B2 : Bi ∈ si , i = 1, 2}. Show that ν can be extended to be a probability measure on σC ≡ S1 × S2 . (b) Let µ and Q be as in Example 6.3.8 (cf. (3.16)). For each n ≥ 1 let νn be a set function deﬁned by the recursive scheme ν1 (·) νn+1 (A × B)

=

µ(·), = Q(x, B)νn (dx), A ∈ B(Rn ), B ∈ B(R). A

Show that for each n, νn can be extended to be a probability measure on Rn , B(Rn ) . (Thus the right side of (3.16) is deﬁned to be νn (B1 × B2 × · · · × Bn ).) 6.19 (Bayesian paradigm). Consider the setup in Problem 6.18 (a). Let λ(B) ≡ ν(S1 × B) = S1 Q(x, B)µ(dx) for all B in S2 . (a) Verify that λ is a probability measure on (S2 , S2 ). ˜ B1 ), (b) Now ﬁx B1 in S1 . Show that there exists a function Q(x, S2 → [0, 1] that is S2 , B(R)-measurable such that ˜ B1 )λ(dx). Q(x, ν(B1 × B2 ) = B2

(Hint: Apply the Radon-Nikodym theorem to the pair ν(B1 ×·) and λ(·).) (c) Let Ω = S1 × S2 , F = σC. For ω = (s1 , s2 ), let θ(ω) = s1 and X(ω) = s2 . Think of θ as the parameter, X as the data, Q(θ, ·) as the distribution of X given θ, µ(·) as the prior distribution of ˜ B1 ) as the posterior probability that θ is in B1 given θ and Q(x, ˜ B1 ) when (Si , Si ) = (R, B(R)), the data X = x. Compute Q(x, i = 1, 2, µ(·) ∼ N (0, 1), Q(θ, ·) ∼ N (θ, 1). 6.20 Let X be a random variable on some probability space (Ω, F, P ). Recall that a random variable X is (a) discrete if there is a ﬁnite or countable set D ≡ {aj : 1 ≤ j ≤ k ≤ ∞} such that P (X ∈ D) = 1, (b) continuous if for every x ∈ R, P (X = x) = 0 or equivalently the cdf FX (·) is continuous on all of R, (c) absolutely continuous if there exists a nonnegative Borel measurable function fX (·) on R such that for any −∞ < a < b < ∞, fX (·)dm P (a < X ≤ b) = (a,b]

or equivalently the induced measure P X −1 is m,

6.4 Problems

217

(d) singular if P X −1 ⊥m or equivalently FX (·) = 0 a.e. m, (e) singular continuous if it is singular and continuous. Let g : R → R be Borel measurable and Y = g(X). (a) Show that if X is discrete then so is Y but not conversely. (b) Show that if X is continuous and g is (1–1) on the range of X, then Y is continuous. (c) Show if X is absolutely continuous with pdf fX (·) and g is absolutely continuous on bounded intervals such that g (·) > 0 a.e. (m), then Y is also absolutely continuous with pdf fX g −1 (y) fY (y) = −1 . g g (y) (d) Let X be as in (c) above. Suppose g is absolutely continuous on bounded intervals and there exist disjoint intervals {Ij }1≤j≤k , 1 ≤ k ≤ ∞, such that 1≤j≤k Ij = R and for each j, either g (·) > 0 a.e. (m) on Ij or g (·) < 0 a.e. (m) on Ij . Show that Y is also absolutely continuous with pdf fY (y) =

xj ∈D(y)

fX (xj ) |g (xj )|

where D{y} ≡ {xj : xj ∈ Ij , g(xj ) = y}. (e) Use (c) to compute the pdf of Y when (i) (ii) (iii) (iv)

X X X X

∼ N (0, 1), g(x) = ex . ∼ N (0, 1), g(x) = x2 . ∼ N (0, 1), g(x) = sin 2πx. ∼ exp(1), g(x) = e−x .

6.21 (Simple random sampling without replacement). Let S ≡ {1, 2, . . . , m}, 1 < m < ∞. Fix 1 ≤ n ≤ m. Choose an element 1 for all j ∈ S. X1 from S such that the probability that X1 = j is m Next, choose an element X2 from S − {X1 } such that the probability 1 for j ∈ S − {X1 }. Continue this procedure for n that X2 = j is (m−1) steps. Write the outcome as the ordered vector ω ≡ (X1 , X2 , . . . , Xn ). (a) Identify the sample space Ω, the σ-algebra F and the probability measure P for this experiment. (b) Show that for any permutation σ of {1, 2, . . . , n}, the random vector Yσ = (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) has the same distribution as (X1 , X2 , . . . , Xn ).

218

6. Probability Spaces

(c) Conclude that {Xi }1≤i≤n are identically distributed and that EXi , Cov(Xi , Xj ), i = j are independent of i and j and compute them. (d) Answer the same questions (a)–(c) if the sampling is changed to with replacement, i.e., at each stage i, the probability that 1 for all j ∈ S. P (Xi = j) = m (e) In (d), let D be the number of distinct units in the sample. Find E(D) and Var(D). 6.22 Let X be a nonnegative random variable. Show that , , 1 + (EX)2 ≤ E 1 + X 2 ≤ 1 + EX. √ (Note that f (x) ≡ 1 + x2 is convex on [0, ∞) and bounded by 1+x.) 6.23 Let X and Y be nonnegative random variables deﬁned on a probability space (Ω, F, P ). Suppose X · Y ≥ 1 w.p. 1. Show that EX · EY ≥ 1. √ √ (Hint: Use Cauchy-Schwarz on X Y .) 6.24 Let µ be a probability measure on R, B(R) . Show that there is a random variable X on the Lebesgue space ([0, 1], B([0, 1]), m) such −1 that k m Xk ≡ µ where m is the Lebesgue measure. Extend this to R , B(R ) , where k is an integer > 1. (Note: This is true for any Polish space, i.e., a complete separable metric space, see Billingsley (1968).)

7 Independence

7.1 Independent events and random variables Although a probability space is nothing more than a measure space with the measure of the whole space equal to one, probability theory is not merely a subset of measure theory. A distinguishing and fundamental feature of probability theory is the notion of independence. Deﬁnition 7.1.1: Let (Ω, F, P ) be a probability space and {B1 , B2 , . . . , Bn } ⊂ F be a ﬁnite collection of events. (i) B1 , B2 , . . . , Bn are called independent w.r.t. P , if P

k j=1

Bij

=

k +

P (Bij )

(1.1)

j=1

for all {i1 , i2 , . . . , ik } ⊂ {1, 2, . . . , n}, 1 ≤ k ≤ n. (ii) B1 , B2 , . . . , Bn are called pairwise independent w.r.t. P if P (Bi ∩ Bj ) = P (Bi )P (Bj ) for all i, j, i = j. Note that a collection B1 , B2 , . . . , Bn of events may be independent with respect to one probability measure P but not with respect to another measure P . Note also that pairwise independence does not imply independence (Problem 7.1). Deﬁnition 7.1.2: Let (Ω, F, P ) be a probability space. A collection of events {Bα , α ∈ A} ⊂ F is called independent w.r.t. P if for every ﬁnite

220

7. Independence

subcollection {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞,

+ k k P Bαi = P (Bαi ). i=1

(1.2)

i=1

Deﬁnition 7.1.3: Let (Ω, F, P ) be a probability space. Let A be a nonempty set. For each α in A, let Gα ⊂ F be a collection of events. Then the family {Gα : α ∈ A} is called independent w.r.t P if for every choice of Bα in Gα for α in A, the collection of events {Bα : α ∈ A} is independent w.r.t. P as in Deﬁnition 7.1.2. Deﬁnition 7.1.4: Let (Ω, F, P ) be a probability space and let {Xα : α ∈ A} be a collection of random variables on (Ω, F, P ). Then the collection {Xα : α ∈ A} is called independent w.r.t. P if the family of σ-algebras {σXα : α ∈ A} is independent w.r.t. P , where σXα is the σ-algebra generated by Xα , i.e., σXα ≡ {Xα−1 (B) : B ∈ B(R)}.

(1.3)

Note that the collection {Xα : α ∈ A} is independent iﬀ for any {α1 , α2 , . . . , αk } ⊂ A, and Bi ∈ B(R), for i = 1, 2, . . . , k, 1 ≤ k < ∞, P (Xαi ∈ Bi , i = 1, 2, . . . , k) =

k +

P (Xαi ∈ Bi ).

(1.4)

i=1

It turns out that if (1.4) holds for all Bi of the form Bi = (−∞, xi ], xi ∈ R, then it holds for all Bi ∈ B(R), i = 1, 2, . . . , k. This follows from the proposition below. Proposition 7.1.1: Let (Ω, F, P ) be a probability space. Let A be a nonempty set. Let Gα ⊂ F be a π-system for each α in A. Let {Gα : α ∈ A} be independent w.r.t. P . Then the family of σ-algebras {σGα : α ∈ A} is also independent w.r.t. P . Proof: Fix 2 ≤ k < ∞, {α1 , α2 , . . . , αk } ⊂ A, Bi ∈ Gαi , i = 1, 2, . . . , k − 1. Let

k−1 + L ≡ B : B ∈ σGαk , P (B1 ∩ · · · ∩ Bk−1 ∩ B) = P (Bi ) P (B) . i=1

(1.5)

7.1 Independent events and random variables

221

It is easy to verify that L is a λ-system. By hypothesis, L contains the π-system Gαk . Hence by the π-λ theorem (cf. Theorem 1.1.2), L = σGα . Iterating the above argument k times completes the proof. 2 Corollary 7.1.2: A collection {Xα : α ∈ A} of random variables on a probability space (Ω, F, P ) is independent w.r.t. P iﬀ for any {α1 , α2 , . . . , αk } ⊂ A and any x1 , x2 , . . . , xk in R, the joint cdf Fα1 ,α2 ,...,αk of (Xα1 , Xα2 , . . . , Xαk ) is the product of the marginal cdfs Fαi , i.e., Fα1 ,α2 ,...,αk (x1 , x2 , . . . , xk ) ≡ P (Xαi ≤ xi , i = 1, 2, . . . , k) =

k +

k +

P (Xαi ≤ xi ) =

i=1

Fαi (xi ).

(1.6)

i=1

Proof: For the ‘if’ part let Gα ≡ {Xα−1 (−∞, x] : x ∈ R}, α ∈ A. Now apply Proposition 7.1.1. The only if part is easy. 2 Remark 7.1.1: If the probability distribution of (Xα1 , Xα2 , . . . , Xαk ) is absolutely continuous w.r.t. the Lebesgue measure mk on Rk , then (1.6) and hence the independence of {Xα1 , Xα2 , . . . , Xαk } is equivalent to the condition that fα1 ,α2 ,...,αk (x1 , x2 , . . . , xk ) =

k +

fαi (xi ),

(1.7)

i=1

a.e. (mk ), where f(α1 ,α2 ,...,αk ) is the joint density of (Xα1 , Xα2 , . . . , Xαk ), and fαi is the marginal density of Xαi , i = 1, 2, . . . , k. See Problem 7.18. Proposition 7.1.3: Let (Ω, F, P ) be a probability space and let {X1 , X2 , . . . , Xk }, 2 ≤ k < ∞ be a collection of random variables on (Ω, F, P ). (i) Then {X1 , X2 , . . . , Xk } is independent iﬀ E

k +

hi (Xi ) =

i=1

k +

Ehi (Xi )

(1.8)

i=1

for all bounded Borel measurable functions hi : R → R, i = 1, 2, . . . , k. (ii) If X1 , X2 are independent and E|X1 | < ∞, E|X2 | < ∞, then E|X1 X2 | < ∞

and

EX1 X2 = EX1 EX2 .

(1.9)

Proof: (i) If (1.8) holds, then taking hi = IBi with Bi ∈ B(R), i = 1, 2, . . . , k yields the independence of {X1 , X2 , . . . , Xk }. Conversely,

222

7. Independence

if {X1 , X2 , . . . , Xk } are independent, then (1.8) holds for hi = IBi for Bi ∈ B(R), i = 1, 2, . . . , k, and hence for simple functions {h1 , h2 , . . . , hk }. Now (1.8) follows from the BCT. (ii) Note that by the change of variable formula (Proposition 6.2.1) |x1 x2 |dPX1 ,X2 (x1 , x2 ), E|X1 X2 | = R2

E|Xi | =

R

|xi |dPXi (xi ),

i = 1, 2,

where PX1 ,X2 is the joint distribution of (X1 , X2 ) and PXi is the marginal distribution of Xi , i = 1, 2. Also, by the independence of X1 and X2 , PX1 ,X2 is equal to the product measure PX1 ×PX2 . Hence, by Tonelli’s theorem, E|X1 X2 | = |x1 x2 |dPX1 ,X2 (x1 , x2 ) R2

|x1 x2 |dPX1 (x1 )dPX2 (x2 )

= R2

= R

|x1 |dPX1 (x1 ) |x2 |dPX2 (x2 ) R

= E|X1 |E|X2 | < ∞. Now using Fubini’s theorem, one gets (1.9).

2

Remark 7.1.2: Note that the converse to (ii) above need not hold. That is, if X1 and X2 are two random variables such that E|X1 | < ∞, E|X2 | < ∞, E|X1 X2 | < ∞, and EX1 X2 = EX1 EX2 , then X1 and X2 need not be independent.

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law In this section some basic results on classes of independent events are established. These will play an important role in proving laws of large numbers in Chapter 8. Deﬁnition 7.2.1: Let (Ω, F) be a measurable space and {An }n≥1 be a sequence of sets in F. Then ∞

An (2.1) lim sup An ≡ lim An ≡ n→∞

k=1

n≥k

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law

lim inf An n→∞

≡

lim An ≡

∞

An .

223

(2.2)

k=1 n≥k

Proposition 7.2.1: Both lim An and lim An ∈ F and lim An

= {ω : ω ∈ An for inﬁnitely many n}

lim An

= {ω : ω ∈ An for all but a ﬁnite number of n}.

Proof: Since {An }n≥1 ⊂ F and F is a σ-algebra, Bk = ∞ each k ∈ N and hence lim An ≡ k=1 Bk ∈ F. Next,

n≥k

An ∈ F for

ω ∈ lim An ⇐⇒ ω ∈ Bk

for all k = 1, 2, ...

⇐⇒ for each k, there exists nk ≥ k ⇐⇒ ω ∈ An for inﬁnitely many n.

such that

ω ∈ A nk 2

The proof for lim An is similar.

In probability theory, lim An is referred to as the event that “An happens inﬁnitely often (i.o.)” and lim An as the event that “all but a ﬁnitely many An ’s happen.” Example 7.2.1: Let Ω = R, F = B(R), and let " 1# 0, for n odd " n1 # An = 1 − n , 1 for n even. Then lim An = {0, 1}, lim An = ∅. The following result on the probabilities of lim An and lim An is very useful in probability theory. Theorem 7.2.2: Let (Ω, F, P ) be a probability space and {An }n≥1 be a sequence of events in F. Then ∞ (a) (The ﬁrst Borel-Cantelli lemma). If P (An ) < ∞, then n=1

P (lim An ) = 0. (b) (The second Borel-Cantelli lemma). If

∞

P (An ) = ∞ and {An }n≥1

n=1

are pairwise independent, then P (lim An ) = 1. Remark 7.2.1: This result is also called a zero-one law as it asserts that for ∞pairwise independent events {An }n≥1 , P (lim An ) = 0 or 1 according to n=1 P (An ) < ∞ or equal to ∞.

224

7. Independence

Proof:

n ∞ (a) Let Zn ≡ j=1 IAj . Then Zn ↑ Z ≡ j=1 IAj and by the MCT, n ∞ EZn ≡ j=1 P (Aj ) ↑ EZ. Thus, j=1 P (Aj ) < ∞ ⇒ EZ < ∞ ⇒ Z < ∞ w.p. 1 ⇒ P (Z = ∞) = 0. But the event lim An = {Z = ∞} and so (a) follows.

(b) Without loss of generality, assume P (Aj ) > 0 for some j. Let Jn = Zn EZn for n ≥ j where Zn is as above. Then, EJn = 1 and by the pairwise independence of {An }n≥1 , the variance of Jn is n

P (Aj )(1 − P (Aj ))

1 . ≤ (EZn )2 (EZn ) ∞ n If j=1 P (Aj ) = ∞, then EZn = j=1 P (Aj ) ↑ ∞, by the MCT. Thus EJn ≡ 1, Var(Jn ) → 0 as n → ∞. By Chebychev’s inequality, for all > 0, Var(Jn ) =

j=1

Var(Jn ) → 0 as n → ∞. 2 Thus, Jn → 1 in probability and hence there exists a subsequence {nk }k≥1 such that Jnk → 1 w.p. 1 (cf. Theorem 2.5.2). Since EZnk ↑ ∞, this implies that Znk → ∞ w.p. 1. But {Zn }n≥1 is nondecreasing in n and hence Zn ↑ ∞ w.p. 1. Now since lim An = {Z = ∞}, it 2 follows that P (lim An ) = P (Z = ∞) = 1. P (|Jn − 1| > ) ≤

Proposition 7.2.3: Let {Xn }n≥1 , be a sequence of random variables on some probability space (Ω, F, P ). ∞ (a) If n=1 P (|Xn | > ) < ∞ for each > 0, then P ( lim Xn = 0) = 1. n→∞

(b) If {Xn }n≥ are pairwise independent and P ( lim Xn = 0) = 1, then n→∞ ∞ P (|X | > ) < ∞ for each > 0. n n=1 Proof:

∞ (a) Fix > 0. Let An = {|Xn | > }, n ≥ 1. Then n=1 P (An ) < ∞ ⇒ P (lim An ) = 0, by the ﬁrst Borel-Cantelli lemma (Theorem 7.2.2 (a)). But (lim An )c

=

{ω : there exists n(ω) < ∞ such that for all n ≥ n(ω), w ∈ An }

= {ω : there exists n(ω) < ∞ such that |Xn (ω)| ≤ for all n ≥ n(ω)} = B , say.

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law

Thus, that

∞ n=1

P (An ) < ∞ ⇒ P (B ) = 1. Let B =

Since P (B ) ≤

r=1

B r1 . Now note

∞ ω : lim |Xn (ω)| = 0 = B r1 . n→∞

c

∞

225

∞ r=1

r=1

c

P (B 1 ) = 0, P (B) = 1. r

∞ (b) Let {Xn }n≥1 be pairwise independent and n=1 P (|Xn | > 0 ) = ∞ for some 0 > 0. Let An = {|Xn | > 0 }. Since {Xn }n≥1 are pairwise independent, so are {An }n≥1 . By the second Borel-Cantelli lemma P (lim An ) = 1. But ω ∈ lim An ⇒ lim supn→∞ |Xn | ≥ 0 and hence P (limn→∞ |Xn | = 0) = 0. This contradicts the hypothesis that P (lim supn→∞ |Xn | = 0) = 1. 2 Deﬁnition 7.2.2: The tail σ-algebra of a sequence of random variables {Xn }n≥1 on a probability space (Ω, F, P ) is T =

∞

σ{Xj : j ≥ n}

n=1

and any A ∈ T is called a tail event. Further, any T -measurable random variable is called a tail random variable (w.r.t. {Xn }n≥1 ). Tail events are determined by the behavior of the sequence {Xn }n≥1 for large n and they remain unchanged if any ﬁnite subcollection of the Xn ’s are dropped or replaced by another ﬁnite set of random variables. Events such as {lim supn→∞ Xn < x} or {limn→∞ Xn = x}, x ∈ R, belong to T . A remarkable result of Kolmogorov is that for any sequence of independent random variables, any tail event has probability zero or one. Theorem 7.2.4: (Kolmogorov’s 0-1 law ). Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ) and let T be the tail σ-algebra of {Xn }n≥1 . Then P (A) = 0 or 1 for all A ∈ T . Remark 7.2.2: Note that in Proposition 7.2.3, the event A ≡ {limn→∞ Xn = 0} belongs to T , and hence, by the above theorem, P (A) = 0 or 1. Thus, proving that P (A) = 1 is equivalent to proving P (A) = 0. Kolmogorov’s 0-1 law only restricts the possible values of tail events like A to 0 or 1, while the Borel-Cantelli lemmas (Theorem 7.2.2) provide a tool for ascertaining whether the value is either 0 or 1. On the other hand, note that Theorem 7.2.2 requires only pairwise independence of {An }n≥1 but Kolmogorov’s 0-1 law requires the full independence of the sequence {Xn }n≥1 .

226

7. Independence

Proof: For n ≥ 1, deﬁne the σ-algebras Fn and Tn by Fn = σ{X1 , . . . , Xn } and Tn = σ{Xn+1 , Xn+2 , . . .}. Since Xn , n ≥ 1 are independent, Fn is independent of Tn for all n ≥ 1. Since, for each n, ∞ T = m=n Fm is a sub σ-algebra ∞of Tn , this implies Fn is independent of T for all n ≥ 1 and hence A ≡ n=1 Fn is independent of T . It is easy to check that A is an algebra (and hence, is a π-system). Hence, by Proposition 7.1.1, σA is independent of T . Since T is also a sub-σ-algebra of σA = σ{Xn : n ≥ 1}, this implies T is independent of itself. Hence for any B ∈ T , P (B ∩ B) = P (B) · P (B), 2

which implies P (B) = 0 or 1.

¯ Deﬁnition 7.2.3: Let (Ω, F, P ) be a probability space and let X : Ω → R ¯ ¯ from be a F, B(R)-measurable mapping. (Recall the deﬁnition of B(R) (2.1.4)). Then X is called an extended real-valued random variable or an ¯ R-valued random variable. Corollary 7.2.5: Let T be the tail σ-algebra of a sequence of indepen¯ dent random variables {Xn }n≥1 on (Ω, F, P ) and let X be a T , B(R)¯ ¯ ¯ measurable R-valued random variable from Ω to R. Then, there exists c ∈ R such that P (X = c) = 1. Proof: If P (X ≤ x) = 0 for all x ∈ R, then P (X = +∞) = 1. Hence, suppose that B ≡ {x ∈ R : P (X ≤ x) = 0} = ∅. Since {X ≤ x} ∈ T for all x ∈ R, P (X ≤ x) = 1 for all x ∈ B. Deﬁne c = inf{x : x ∈ B}. Check that P (X = c) = 1. 2 An immediate implication of Corollary 7.2.5 is that for any sequence ¯ random variables of independent random variables {Xn }n≥1 , the R-valued lim supn→∞ Xn and lim inf n→∞ Xn are degenerate, i.e., they are constants w.p. 1. Example 7.2.2: Let {Xn }n≥1 be a sequence of independent random 2 variables on (Ω, F, P ) with EXn = 0, √n =−11 for all2 n ≥ 1. Let x EX Sn = X1 + . . . + Xn , n ≥ 1 and Φ(x) = −∞ ( 2π) exp(−y /2)dy, x ∈ R. √ If P (Sn ≤ nx) → Φ(x) for all x ∈ R, then Sn lim sup √ = +∞ n n→∞

a.s.

(2.3)

√ To show this, let S = lim supn→∞ Sn / n. First it will be shown that ¯ S is T , B(R)-measurable. For any m ≥ 1, deﬁne √ the variables Tm,n = √ (Xm+1 + . . . + Xn )/ n and Sm,n = (X1 + . . . + Xm )/ n, n > m. Note that for any ﬁxed m ≥ 1, Tm,n is σXm+1 , . . .-measurable and Sm,n (ω) → 0 as

7.3 Problems

227

n → ∞ for all ω ∈ Ω. Hence, for any m ≥ 1, S

=

lim sup (Sm,n + Tm,n )

=

lim sup Tm,n

n→∞ n→∞

is σX , Xm+2 , . . .-measurable. Thus, S is measurable with respect to m+1 ∞ T = m=1 σXm+1 , Xm+2 , . . .. Hence, by Theorem 7.2.4, P (S = +∞) ∈ {0, 1}. If possible, now suppose that P (S = +∞) = 0. Then, by Corollary√7.2.5, there exists c ∈ [−∞, ∞) such that P (S = c) = 1. Let An = {Sn > nx}, n ≥ 1, with x = c + 1. Then, 0 < 1 − Φ(x)

= ≤

lim P (An )

lim P Am

n→∞

n→∞

m≥n

∞

∞

Am

=

P

=

S n P √ > x i.o. n

n=1 m=n

≤ P (S ≥ c + 1) = 0. This shows that P (S = +∞) must be 1. Also see Problem 7.16. Remark 7.2.3: It will be shown in Chapter 11 that if {Xi }i≥1 are independent and identically distributed (iid) random variables with EX1 = 0 and EX12 = 1, then S n P √ ≤ x → Φ(x) for all x in R. n (This is known as the central limit theorem.) Indeed, a stronger result known as the law of the iterated logarithm holds, which says that for such {Xi }i≥1 , Sn = +1, w.p. 1. lim sup √ 2n log log n n→∞

7.3 Problems 7.1 Give an example of three events A1 , A2 , A3 on some probability space such that they are pairwise independent but not independent. (Hint: Consider iid random variables X1 , X2 , X3 with P (X1 = 1) =

228

7. Independence 1 2

= P (X1 = 0) and the events A1 = {X1 = X2 }, A2 = {X1 = X3 }, A3 = {X3 = X1 }.) 7.2 Let {Xα : α ∈ A} be a collection of independent random variables on some probability space (Ω, F, P ). For any subset B ⊂ A, let XB ≡ {Xα : α ∈ B}. (a) Let B be a nonempty proper subset of A. Show that the collections XB and XB c are independent, i.e., the σ-algebras σXB and σXB c are independent w.r.t. P . (b) Let {Bγ : γ ∈ Γ} be a partition of A by nonempty proper subsets Bγ . Show that the family of σ-algebras {σXBγ : γ ∈ Γ} are independent w.r.t. P . 7.3 Let X1 , X2 be iid standard exponential random variables, i.e., e−x dx, A ∈ B(R). P (X1 ∈ A) = A∩(0,∞)

Let Y1 = min(X1 , X2 ) and Y2 = max(X1 , X2 ) − Y1 . Show that Y1 and Y2 are independent. Generalize this to the case of three iid standard exponential random variables. 7.4 Let Ω = (0, 1), F = B((0, 1)), the Borel σ-algebra on (0,1) and P be the Lebesgue measure on (0, 1). For each ω ∈ (0, 1), let ω = ∞ Xi (ω) be the nonterminating binary expansion of ω. i=1 2i (a) Show that {Xi }i≥1 are iid Bernouilli ( 12 ) random variables, i.e., is P (X1 = 0) = 12 = P (X1 = 1). (Hint: Let si ∈ {0, 1}, i = 1, 2, . . . , k, k ∈ N. Show that the set {ω : 0 < ω < 1, Xi (ω) = si , 1 ≤ i ≤ k} is an interval of length 2−k .) (b) Show that Y1 Y2

= =

∞ X2i−1 i=1 ∞ i=1

2i X2i 2i

(3.1) (3.2)

are independent Uniform (0,1) random variables. (c) Using the fact that the set N × N of lattice points (m, n) is in one to one correspondence with N itself, construct a sequence {Yi }i≥1 of iid Uniform (0,1) random variables such that for each j, Yj is a function of {Xi }i≥1 .

7.3 Problems

229

(d) For any cdf F , show that the random variable X(ω) ≡ F −1 (ω) has cdf F , where F −1 (u) = inf{x : F (x) ≥ u} for

0 < u < 1.

(3.3)

(e) Let {Fi }i≥1 be a sequence of cdfs on R. Using (c), construct a sequence {Zi }i≥1 of independent random variables on (Ω, F, P ) such that Zi has cdf Fi , i ≥ 1. ∞ i (f) Show that the cdf of the random variable W ≡ i=1 2X 3i is the Cantor function (cf. Section 4.5). (g) Let p > 1 be a positive integer. For each ω ∈ (0, 1) let ω ≡ ∞ Vi (ω) be the nonterminating p-nary expansion of ω. Show i=1 pi that {Vi }i≥1 are iid and determine the distribution of V1 . 7.5 Let {Xi }i≥1 be a Markov chain with state space S = {0, 1} and transition probability matrix

q 0 p0 where pi = 1 − qi , 0 < qi < 1, i = 0, 1 . Q= p1 q 1 Let τ1 = min{j : Xj = 0} and τk+1 = min{j : j > τk , Xj = 0}, k = 1, 2, . . .. Note that τk is the time of kth visit to the state 0. (a) Show that {τk+1 − τk : k ≥ 1} are iid random variables and independent of τ1 . (b) Show that Pi (τ1 < ∞) = 1

and hence Pi (τk < ∞) = 1

for all k ≥ 2, i = 0, 1 where Pi denotes the probability distribution with X1 = i w.p. 1. ∞ (Hint: Show that k=1 P (τ1 > k | X1 = i) < ∞ for i = 0, 1 and use the Borel-Cantelli lemma.) (c) Show also that Ei (eθ0 τ1 ) < ∞ for some θ0 > 0, i = 0, 1, where Ei denotes the expectation under Pi . 7.6 Let X1 and X2 be independent random variables. (a) Show that for any p > 0, p

E|X1 + X2 |p < ∞ iﬀ E|X1 | < ∞, E|X2 |p < ∞. Show that this is false if X1 and X2 are not independent. (Hint: Use Fubini’s theorem to conclude that E|X1 + X2 |p < ∞ implies that E|X1 + x2 |p < ∞ for some x2 and hence E|X1 |p < ∞.)

230

7. Independence

(b) Show that if E(X12 + X22 ) < ∞, then Var(X1 + X2 ) = Var(X1 ) + Var(X2 ).

(3.4)

Show by an example that (3.4) need not imply the independence of X1 and X2 . Show also that if X1 and X2 take only two values each and (3.4) holds, then X1 and X2 are independent. 7.7 Let X1 and X2 be two random variables on a probability space (Ω, F, P ). (a) Show that, if P X1 ∈ (a1 , b1 ), X2 ∈ (a2 , b2 ) = P X1 ∈ (a1 , b1 ) P X2 ∈ (a2 , b2 )

(3.5)

for all a1 , b1 , a2 , b2 in a dense set D in R, then X1 and X2 are independent. (Hint: Show that (3.5) implies that the joint cdf of (X1 , X2 ) is the product of the marginal cdfs of X1 and X2 and use Corollary 7.1.2.) (b) Let fi : R → R, i = 1, 2 be two one-one functions such that both fi and fi−1 are Borel measurable, i = 1, 2. Show that X1 and X2 are independent iﬀ f1 (X1 ) and f2 (X2 ) are independent. Conclude that X1 and X2 are independent iﬀ eX1 and eX2 are independent. 7.8 (a) Let X1 and X2 be two independent bounded random variables. Show that E(p1 (X1 )p2 (X2 )) = (Ep1 (X1 ))(Ep2 (X2 ))

(3.6)

where p1 (·) and p2 (·) are polynomials. (b) Show that if X1 and X2 are bounded random variables and (3.6) holds for all polynomials p1 (·) and p2 (·), then X1 and X2 are independent. (Hint: Use the facts that (i) continuous functions on a bounded closed interval [a, b] can be approximated uniformly by polynomials, and (ii) for any interval (c, d) ⊂ [a, b], any random variable X and > 0, there exists a continuous function f on [a, b] such that E|f (X) − I(c,d) (X)| < , provided P (X = c or d) = 0.) 7.9 Let {Xn }n≥1 be a sequence of iid random variables on a probability space (Ω, F, P ).∞Let R = R(ω) be the radius of convergence of the power series n=1 Xn rn .

7.3 Problems

231

(a) Show that R is a tail random variable. (Hint: Note that R=

1 lim sup |Xn |1/n

.)

n→∞

(b) Show that if E(log |X1 |)+ = ∞, then R = 0 w.p. 1. and if E(log |X1 |)+ < ∞, then R ≥ 1 w.p. 1. (Hint: Apply the Borel-Cantelli lemmas to An = {|Xn | > λn } for each λ > 1.) 7.10 Let {An }n≥1 be a sequence of events in (Ω, F, P ) such that ∞

P (An ∩ Acn+1 ) < ∞

n=1

and limn→∞ P (An ) = 0. Show that P (lim sup An ) = 0. n→∞

Show also that limn→∞ P (An ) limn→∞ P ( j≥n Aj ) = 0.

=

0 can be replaced by

(Hint: Let Bn = An ∩ Acn+1 , n ≥ 1, B = lim Bn , A = lim An . Show that A ∩ B c ⊂ lim An .) 7.11 For any nonnegative random variable X, show that E|X| < ∞ iﬀ ∞ n=1 P (|X| > n) < ∞ for every > 0. 7.12 Let {Xi }i≥1 be a sequence of pairwise independent and identically distribution random variables. (a) Show that limn→∞

Xn n

= 0 w.p. 1 iﬀ E|X1 | < ∞.

(Hint: E|X1 | < ∞ ⇐⇒

∞

P (|Xn | > n) < ∞ for all > 0.)

n=1

(b) Show that E(log |X1 |)+ < ∞ iﬀ 1/n →1 |Xn |

w.p. 1.

7.13 Let {Xi }i≥1 be a sequence of identically distributed random variables and let Mn = max{|Xj | : 1 ≤ j ≤ n}.

232

7. Independence

(a) If E|X1 |α < ∞ for some α ∈ (0, ∞), then show that Mn → 0 w.p. 1. n1/α

(3.7)

(Hint: Fix > 0. Let An = {|Xn | > n1/α }. Apply the ﬁrst Borel-Cantelli lemma.) (b) Show that if {Xi }i≥1 are iid satisfying (3.7) for some α > 0, then E|X1 |α < ∞. (Hint: Apply the second Borel-Cantelli lemma.) 7.14 Let X1 and X2 be independent random variables with distributions µ1 and µ2 . Let Y = (X1 + X2 ). (a) Show that the distribution µ of Y is the convolution µ1 ∗ µ2 as deﬁned by (µ1 ∗ µ2 )(A) = µ1 (A − x)µ2 (dx) R

(cf. Problem 5.12). (b) Show that if X1 has a continuous distribution then so does Y . (c) Show that if X1 has an absolutely continuous distribution then so does Y and that the density function of Y is given by dµ (x) ≡ fY (x) = fX1 (x − u)µ2 (du) dm 1 where fX1 (x) ≡ dµ dm (x), the probability density of X1 . (d) Y has a discrete distribution iﬀ both X1 and X2 are discrete. 7.15 (AR(1) series). Let {Xn }n≥0 be a sequence of random variables such that for some ρ ∈ R, Xn+1 = ρXn + n+1 , n ≥ 0 where {n }n≥1 are independent and independent of X0 . (a) Show that if |ρ| < 1 and E(log |1 |)+ < ∞, then ˆn ≡ X

n j=0

ˆ ∞ , say. to a limit X

ρj j+1

converges w.p. 1

7.3 Problems

233

(b) Show that under the hypothesis of (a), for any bounded continuous function h : R → R and for any distribution of X0 ˆ ∞ ). Eh(Xn ) → Eh(X ˆ n have the (Hint: Show that for each n ≥ 1, Xn − ρn X0 and X same distribution.) 7.16 Establish the following generalization of Example 7.2.2. Let {Xn }n≥1 be a sequence of independent random variables on some probability space (Ω, F, P ). Suppose there exists sequences {an }n≥1 , {xn }n≥1 , such that an ↑ ∞, xn ↑ ∞ and for each k < ∞, limn P (Sn ≤ an xk ) ≡ F (xk ) exists and is < 1. Show that lim supn→∞ Sann = +∞ a.s. 7.17 (a) Let {Xi }ni=1 be random variables on a probability space (Ω, F, P ) and let P (X1 , X2 , . . . , Xn )−1 (·) be dominated by the product measure µ × µ × · · · × µ where µ is a σﬁnite measure on (R, B(R)) with Radon-Nikodym derivative f (x1 , x2 , . . . , xn ). Show that {Xi }ni=1 are independent w.r.t. P n iﬀ f (x1 , x2 , . . . , xn ) ≡ i=1 hi (xi ) for all (x1 , x2 , . . . , xn ) ∈ R where for each i, hi : R → R is Borel measurable. (b) Use (a) to show that if (X1 , X2 ) has an absolutely continuous distribution with density f (x1 , x2 ) then X1 and X2 are independent iﬀ f (x1 , x2 ) = f1 (x1 )f2 (x2 ) where fi (·) is the density of Xi . (c) Using (a) or otherwise conclude that if Xi , i = 1, 2 are both discrete random variables then X1 and X2 are independent iﬀ P (X1 = a, X2 = b) = P (X1 = a)P (X2 = b) for all a and b. 7.18 Let {Xn }n≥1 be a sequence of independent random variables such that for n ≥ 1, P (Xn = 1) = n1 = 1 − P (Xn = 0). Show that Xn −→p 0 but not w.p. 1. 7.19 Let (Ω, F, P ) be a probability space. (a) Suppose there exists events A1 , A2 , . . . , Ak that are independent with 0 < P (Ai ) < 1, i = 1, 2, . . . , k. Show that |Ω| ≥ 2k where for any set A, |A| is the number of elements in A. (b) Let {Xi }ki=1 be independent random variables such that Xi takes n distinct values with positive probability. Show that |Ω| ≥ i k j=1 ni .

234

7. Independence

(c) Show that there exists a probability space (Ω, F, P ) such that |Ω| = 2k and k independent events A1 , A2 , . . . , Ak in F such that 0 < P (Ai ) < 1, i = 1, 2, . . . , k. 7.20 (a) Let Ω ≡ {(x1 , x2 ) : x1 , x2 ∈ R, x21 + x22 ≤ 1} be the unit disc in R2 . Let F ≡ B(Ω), the Borel σ-algebra in Ω and P = normalized Lebesgue measure, i.e., P (A) ≡ m(A) π , A ∈ F. For ω = (x1 , x2 ) let X1 (ω) = x1 , X2 (ω) = x2 , and R(ω), θ(ω) be the polar representation of ω. Show that the random variables R and θ are independent but X1 and X2 are not. (b) Formulate and establish an extension of the above to the unit sphere in R3 . 7.21 Let X1 , X2 , X3 be iid random variables such that P (X1 = x) = 0 for all x ∈ R. (a) Show that for any permutation σ of (1,2,3) 1 P Xσ(1) > Xσ(2) > Xσ(3) = . 3! (b) Show that for any i = 1, 2, 3 1 P Xi = max Xj = . 1≤j≤3 3 (c) State and prove a generalization of (a) and (b) to random variables {Xi : 1 ≤ i ≤ n} such that the joint distribution of (X1 , X2 , . . . , Xn ) is the same as that of (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) for any permutation σ of {1, 2, . . . , n} and P (X1 = x) = 0 for all x ∈ R. 7.22 Let f , g : R → R be monotone nondecreasing. Show that for any random variable X Ef (X)g(X) ≥ Ef (X)Eg(X) provided all the expectations exist. (Hint: of X with same distribution. Note that Let Y be independent Z = f (X) − f (Y ) g(X) − g(Y ) ≥ 0 w.p. 1.) 7.23 Let X1 , X2 , . . . , Xn be random variables on some probability space (Ω, F, P ). Show that if P (X1 , X2 , . . . , Xn )−1 (·) mn , the Lebesgue measure on Rn then for each i, P Xi−1 (·) m. Give an example to show that the converse is not true. Show also that if

7.3 Problems

235

P (X1 , X2 , . . . , Xn )−1 (·) mn then {X1 , X2 , . . . , Xn } are indepenn dent iﬀ f(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = i=1 fXi (xi ) where the f ’s are the respective pdfs.

8 Laws of Large Numbers

When measuring a physical quantity such as the mass of an object, it is commonly believed that the average of several measurements is more reliable than a single one. Similarly, in applications of statistical inference when estimating a population mean µ, a random sample {X1 , X2 , . . . , Xn } ¯n ≡ of size n is drawn from the population, and the sample average X n 1 X is used as an estimator for the parameter µ. This is based on i=1 i n ¯ n will be close to µ in some suitable sense. In the idea that as n gets large, X many time-evolving physical systems {f (t) : 0 ≤ t < ∞}, where f (t) is an T element in the phase space S, “time averages” of the form T1 0 h(f (t))dt (where h is a bounded function on S) converge, as T gets large, to the “space average” of the form S h(x)π(dx) for some appropriate measure π on S. The above three are examples of a general phenomenon known as the law of large numbers. This chapter is devoted to a systematic development of this topic for sequences of independent random variables and also to some important reﬁnements of the law of large numbers.

8.1 Weak laws of large numbers Let {Zn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ). Recall that the sequence {Zn }n≥1 is said to converge in probability to a random variable Z if for each > 0, lim P (|Zn − Z| ≥ ) = 0.

n→∞

(1.1)

238

8. Laws of Large Numbers

This is written as Zn −→p Z. The sequence {Zn }n≥1 is said to converge with probability one or almost surely (a.s.) to Z if there exists a set A in F such that P (A) = 1 and for all ω in A,

lim Zn (ω) = Z(ω).

n→∞

(1.2)

This is written as Zn → Z w.p. 1 or Zn → Z a.s. Deﬁnition 8.1.1: A sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) is said to obey the weak law of large numbers (WLLN) with normalizing sequences of real numbers {an }n≥1 and {bn }n≥1 if

where Sn =

Sn − an −→p 0 bn

n i=1

as

n→∞

(1.3)

Xi for n ≥ 1.

The following theorem says that if {Xn }n≥1 is a sequence of iid random variables with EX12 < ∞, then it obeys the weak law of large numbers with an = nEX1 and bn = n. Theorem 8.1.1: Let {Xn }n≥1 be a sequence of iid random variables such that EX12 < ∞. Then ¯ n ≡ X1 + . . . + Xn −→p EX1 . X n

(1.4)

Proof: By Chebychev’s inequality, for any > 0, 2 ¯ ¯ n − EX1 | > ) ≤ Var(Xn ) = 1 · σ , P (|X 2 2 n

where σ 2 = Var(X1 ). Since

σ2 n2

→ 0 as n → ∞, (1.4) follows.

(1.5) 2

Corollary 8.1.2: Let {Xn }n≥1 be a sequence of iid Bernoulli (p) random variables, i.e., P (X1 = 1) = p = 1 − P (X1 = 0). Let pˆn =

#{i : 1 ≤ i ≤ n, Xi = 1} , n ≥ 1, n

(1.6)

where for a ﬁnite set A, #A denotes the number of elements in A. Then pˆn −→p p. ¯n. Proof: Check that EX1 = p and pˆn = X

2

This says that one can estimate the probability p of getting a “head” of a coin by tossing it n times and calculating the proportion of “heads.” This is also the basis of public opinion polls. Since the proof of Theorem 8.1.1 depended only on Chebychev’s inequality, the following generalization is immediate (Problem 8.1).

8.1 Weak laws of large numbers

239

Theorem 8.1.3: Let {Xn }n≥1 be a sequence of random variables on a probability space such that (i) EXn2 < ∞

for all

n ≥ 1,

(ii) EXi Xj = (EXi )(EXj ) for all i = j (i.e., {Xn }n≥1 are uncorrelated), n (iii) n12 i=1 σi2 → 0 as n → ∞, where σi2 = Var(Xi ), i ≥ 1. Then where µ ¯n ≡

1 n

¯n − µ X ¯n −→p 0

n i=1

(1.7)

EXi .

the above theorem Corollary 8.1.4: Let {Xn }n≥1 satisfy (i) and (ii) of n ¯n ≡ n1 i=1 EXi → µ as and let the sequence {σn2 }n≥1 be bounded. Let µ p ¯ n → ∞. Then Xn −→ µ. An Application to Real Analysis Let f : [0, 1] → R be a continuous function. K. Weierstrass showed that f can be approximated uniformly over [0, 1] by polynomials. S.N. Bernstein constructed a special class of such polynomials. A proof of Bernstein’s result using the WLLN (Theorem 8.1.1) is given below. Theorem 8.1.5: Let f : [0, 1] → R be a continuous function. Let n r n xr (1 − x)n−r , 0 ≤ x ≤ 1 f Bn,f (x) ≡ r n r=0

(1.8)

be the Bernstein polynomial of order n for the function f . Then lim sup |f (x) − Bn,f (x)| : 0 ≤ x ≤ 1 = 0. n→∞

Proof: Since f is continuous on the closed and bounded interval [0, 1], it is uniformly continuous and hence for any > 0, there exists a δ > 0 such that (1.9) |x − y| < δ ⇒ |f (x) − f (y)| < . Fix x in [0, 1]. Let {Xn }n≥1 be a sequence of iid Bernoulli (x) random variables. Let pˆn be as in (1.6). Then Bn,f (x) = Ef (ˆ pn ). Hence, |f (x) − Bn,f (x)|

≤ E|f (ˆ pn ) − f (x)| = E |f (ˆ pn ) − f (x)|I(|ˆ pn − x| < δ ) + E |f (ˆ pn ) − f (x)|I(|ˆ pn − x| ≥ δ ) ≤ + 2f P (|ˆ pn − x| ≥ δ )

240

8. Laws of Large Numbers

where f = sup{|f (x)| : 0 ≤ x ≤ 1}. But by Chebychev’s inequality, 1 Var(ˆ pn ) δ2 1 x(1 − x) ≤ for all 0 ≤ x ≤ 1. = 2 nδ 4nδ2 1 Thus, sup |f (x) − Bn,f (x)| : 0 ≤ x ≤ 1 ≤ + 2f 4nδ 2 . Letting n → ∞ ﬁrst and then ↓ 0 completes the proof. 2 P (|ˆ pn − x| ≥ δ ) ≤

8.2 Strong laws of large numbers . Deﬁnition 8.2.1: A sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) is said to obey the strong law of large numbers (SLLN) with normalizing sequences of real numbers {an }n≥1 and {bn }n≥1 if

where Sn =

n i=1

Sn − an →0 bn

as

n→∞

w.p. 1,

(2.1)

Xi for n ≥ 1.

The following theorem says that if {Xn }n≥1 is a sequence of iid random variables with EX14 < ∞, then the strong law of large numbers holds with an = nEX1 and bn = n. This result is referred to as Borel’s SLLN. Theorem 8.2.1: (Borel’s SLLN ). Let {Xn }n≥1 be a sequence of iid random variables such that EX14 < ∞. Then ¯ n ≡ X1 + X2 + . . . + Xn → EX1 X n

w.p. 1.

(2.2)

¯ n − EX1 | ≥ }, n ≥ 1. To establish Proof: Fix > 0 and let An ≡ {|X (2.2), by Proposition 7.2.3 (a), it suﬃces to show that ∞

P (An ) < ∞.

(2.3)

n=1

By Markov’s inequality ¯ n − EX1 |4 E|X . (2.4) 4 Let Yi = Xi − EX1 for i ≥ 1. Since the Xi ’s are independent, it is easy to check that n 4 1 4 ¯ E|Xn − EX1 | = E Yi n4 i=1 P (An ) ≤

8.2 Strong laws of large numbers

= =

241

1 4 2 2 nEY + 3n(n − 1)(EY ) 1 1 n4 −2 O(n ). 2

By (2.4) this implies (2.3).

The following two results are easy consequences of the above theorem. Corollary 8.2.2: Let {Xn }n≥1 be a sequence of iid random variables that are bounded, i.e., there exists a C < ∞ such that P (|X1 | ≤ C) = 1. Then ¯ n → EX1 X

w.p. 1.

Corollary 8.2.3: Let {Xn }n≥1 be a sequence of iid Bernoulli(p) random variables. Then pˆn ≡

#{i : 1 ≤ i ≤ n, Xi = 1} → p w.p. 1. n

(2.5)

An application of the above result yields the following theorem on the uniform convergence of the empirical cdf to the true cdf. Theorem 8.2.4: (Glivenko-Cantelli). Let {Xn }n≥1 be a sequence of iid random variables with a common cdf F (·). Let Fn (·), the empirical cdf based on {X1 , X2 , . . . , Xn }, be deﬁned by 1 I(Xj ≤ x), n j=1 n

Fn (x) ≡

x ∈ R.

(2.6)

Then, ˜ n ≡ sup |Fn (x) − F (x)| → 0 ∆

w.p. 1.

(2.7)

x

Remark 8.2.1: Note that by applying Corollary 8.2.3 to the sequence of Bernoulli random variables {Yn ≡ I(Xn ≤ x)}n≥1 , one may conclude that Fn (x) → F (x) w.p. 1 for each ﬁxed x. So the main thrust of this theorem is the uniform convergence on R of Fn to F w.p. 1. It can be shown that (2.7) holds for sequences {Xn }n≥1 that are identically distributed and only pairwise independent. The proof is based on Etemadi’s SLLN (Theorem 8.2.7) below. The proof of Theorem 8.2.4 makes use of the following two lemmas. Lemma 8.2.5: (Scheﬀe’s theorem: A generalized version). Let (Ω, F, µ) be a measure space and {fn }n≥1 and f be nonnegative µ-integrable functions → f a.e. (µ) and (ii) f dµ → f dµ. Then such that as n → ∞, (i) f n n |f − fn |dµ → 0 as n → ∞.

242

8. Laws of Large Numbers

2

Proof: See Theorem 2.5.4. For any bounded monotone function H: R → R, deﬁne H(∞) ≡ lim H(x), x↑∞

H(−∞) ≡ lim H(x). x↓−∞

Lemma 8.2.6: (Poly¯ a’s theorem). Let {Gn }n≥1 and G be a collection of bounded nondecreasing functions on R → R such that G(·) is continuous on R and Gn (x) → G(x) for all x in D ∪ {−∞, +∞}, where D is dense in R. Then ∆n ≡ sup{|Gn (x) − G(x)| : x ∈ R} → 0. That is, Gn → G uniformly on R. Proof: Fix > 0. By the deﬁnitions of G(∞) and G(−∞), there exist C1 and C2 in D such that G(C1 ) − G(−∞) < ,

and G(∞) − G(C2 ) < .

(2.8)

Since G(·) is continuous, it is uniformly continuous on [C1 , C2 ] and so there exists a δ > 0 such that x, y ∈ [C1 , C2 ], |x − y| < δ ⇒ |G(x) − G(y)| < .

(2.9)

Also, there exist points a1 = C1 < a2 < . . . < ak = C2 , 1 < k < ∞, in D such that max{(ai+1 − ai ) : 1 ≤ i ≤ k − 1} < δ. Let a0 = −∞, ak+1 = ∞. By the convergence of Gn (·) to G(·), on D ∪ {−∞, ∞}, ∆n1

≡ max{|Gn (ai ) − G(ai )| : 0 ≤ i ≤ k + 1} → 0

(2.10)

as n → ∞. Now note that for any x in [ai , ai+1 ], 1 ≤ i ≤ k − 1, by the monotonicity of Gn (·) and G(·), and by (2.9) and (2.10), Gn (x) − G(x) ≤ Gn (ai+1 ) − G(ai ) ≤ Gn (ai+1 ) − G(ai+1 ) + G(ai+1 ) − G(ai ) ≤ ∆n1 + , and similarly, Gn (x) − G(x) ≥ −∆n1 − . Thus sup{|Gn (x) − G(x)| : a1 ≤ x ≤ ak } ≤ ∆n1 + .

(2.11)

8.2 Strong laws of large numbers

243

For x < a1 , by (2.8) and (2.10), |Gn (x) − G(x)|

≤ ≤

|Gn (x) − Gn (−∞)| + |Gn (−∞) − G(−∞)| + |G(−∞) − G(x)| (Gn (a1 ) − Gn (−∞)) + |Gn (−∞) − G(−∞)| +

≤

|Gn (a1 ) − G(a1 )| + |G(a1 ) − G(−∞)| + 2|Gn (−∞) − G(−∞)| +

≤

3∆n1 + 2.

Similarly, for x > ak , |Gn (x) − G(x)| ≤ 3∆n1 + 2. Combining the above with (2.11) yields ∆n ≤ 3∆n1 + 2. By (2.10), lim sup ∆n ≤ 2, n→∞

2

and > 0 being arbitrary, the proof is complete.

˜ n = supx∈Q |Fn (x) − F (x)| Proof of Theorem 8.2.4: First note that ∆ and hence, it is a random variable. Let B ≡ {bj : j ∈ J} be the set of jump discontinuity points of F with the corresponding jump sizes {pj : j ∈ J}, where J is a subset of N. Let p = j∈J pj . Note that =

1 I(Xi ≤ x) n i=1

=

1 1 I(Xi ≤ x, Xi ∈ B) + I(Xi ≤ x, Xi ∈ B) n i=1 n i=1

n

Fn (x)

n

n

Fnd (x) + Fnc (x), say. Then, Fnd (x) = j∈J pˆnj I(bj ≤ x), where =

pˆnj =

(2.12)

#{i : 1 ≤ i ≤ n, Xi = bj } . n

Let pˆn = j∈J pˆnj = n1 · #{i : 1 ≤ i ≤ n, Xi ∈ B}. By Corollary 8.2.3, for each j ∈ J, pˆnj → pj w.p. 1 and pˆn → p w.p. 1. Since B is countable, there exists a set A = 1 and 0 in F such that P (A0 ) for all ω in A0 , pˆnj → pj for all j ∈ J and j∈J pˆnj = pˆn → p = j∈J pj .

244

8. Laws of Large Numbers

By Lemma 8.2.5 (applied with µ being the counting measure on the set J), it follows that for ω in A0 ,

|ˆ pnj − pj | → 0.

(2.13)

j∈J

Let Fd (x) ≡

j∈J

pj I(bj ≤ x), x ∈ R. Then, sup |Fnd (x) − Fd (x)| ≤ x∈R

|ˆ pnj − pj |,

(2.14)

j∈J

which → 0 as n → ∞ for all ω in A0 , by (2.13). Next let, Fc (x) ≡ F (x) − Fd (x), x ∈ R. Then, it is easy to check that, Fc (·) is continuous and nondecreasing on R, Fc (−∞) = 0 and Fc (∞) = 1 − p. Again, by Corollary 8.2.3, there exists a set A1 in F such that P (A1 ) = 1 and for all ω in A1 , Fnc (x) → Fc (x) for all rational x in R and Fnc (∞) ≡ 1 − pˆn → 1 − p = Fc (∞). Also, Fnc (−∞) = 0 = Fc (−∞). So by Lemma 8.2.6, with D = Q, for ω in A1 , sup |Fnc (x) − Fc (x)| → 0 as n → ∞. (2.15) x∈R

Since P (A0 ∩ A1 ) = 1, the theorem follows from (2.12)–(2.15).

2

Borel’s SLLN for iid random variables requires that E|X1 |4 < ∞. Kolmogorov (1956) improved on this signiﬁcantly by using his “3-series” theorem and reduced the moment condition to E|X1 | < ∞. More recently, Etemadi (1981) N. improved this further by assuming only that the {Xn }n≥1 are pairwise independent and identically distributed with E|X1 | < ∞. More precisely, he proved the following. Theorem 8.2.7: (Etemadi’s SLLN ). Let {Xn }n≥1 be a sequence of pairwise independent and identically distributed random variables with E|X1 | < ∞. Then ¯ n → EX1 w.p. 1. X (2.16) Proof: The main steps in the proof are (I) reduction to the nonnegative case,

8.2 Strong laws of large numbers

245

(II) proof of convergence of Y¯n along a geometrically growing subsequence using the Borel-Cantelli lemma and Chebychev’s inequality, where Y¯n is the average of certain truncated versions of X1 , . . . , Xn , and extending the convergence from the geometric subsequence to the full sequence. Step I: Since the {Xn }n≥1 are pairwise independent and identically distributed with E|X1 | < ∞, it follows that {Xn+ }n≥1 and {Xn− }n≥1 are both sequences of pairwise independent and identically distributed nonnegative random variables with EX1+ < ∞ and EX1− < ∞. Also, since

¯n = 1 Xi = X n i=1 n

1 + X n i=1 i n

−

n 1 − X , n i=1 i

it is enough to prove the theorem under the assumption that the Xi ’s are nonnegative. Step II: Now let Xi ’s be nonnegative and let Yi = Xi I(Xi ≤ i), i ≥ 1. Then, ∞

P (Xi = Yi )

=

i=1

=

∞ i=1 ∞

P (Xi > i) P (X1 > i) ≤

i=1 ∞

= 0

∞ i=1

i

P (X1 > t)dt

i−1

P (X1 > t)dt

= EX1 < ∞. Hence, by the Borel-Cantelli lemma, P (Xi = Yi , inﬁnitely often) = 0. This implies that w.p. 1, Xi = Yi for all but ﬁnitely many i’s and hence, it suﬃces to show that 1 Yi → EX1 w.p. 1. Y¯n ≡ n i=1 n

(2.17)

Next, EYi = E(Xi I(Xi ≤ i) = E(X1 I(X1 ≤ i)) → EX1 (by the MCT) and hence n 1 E Y¯n = EYi → EX1 as n → ∞. (2.18) n i=1

246

8. Laws of Large Numbers

Suppose for the moment that for each ﬁxed 1 < ρ < ∞, it is shown that Y¯nk → EX1 as k → ∞ w.p. 1

(2.19)

where nk = ρk = the greatest integer less than or equal to ρk , k ∈ N. Then, since the Yi ’s are nonnegative, for any n and k satisfying ρk ≤ n < ρk+1 , one gets nk+1 nk n 1 1 1 Yi ≤ Y¯n = Yj ≤ Yi n i=1 n j=1 n i=1

=⇒

nk ¯ nk+1 ¯ Ynk ≤ Y¯n ≤ Ynk+1 n n

=⇒

1 ¯ Yn ≤ Y¯n ≤ ρY¯nk+1 . ρ k

From (2.19), it follows that 1 EX1 ≤ lim inf Y¯n ≤ lim sup Y¯n ≤ ρEX w.p. 1. n→∞ ρ n→∞ Since this is true for each 1 < ρ < ∞, by taking ρ = 1 + it follows that

1 r

for r = 1, 2, . . .,

EX1 ≤ lim inf Y¯n ≤ lim sup Y¯n ≤ EX1 w.p. 1, n→∞

n→∞

establishing (2.17). It now remains to prove (2.19). By (2.18), it is enough to show that Y¯nk − E Y¯nk → 0

as

k → ∞, w.p. 1.

(2.20)

By Chebychev’s inequality and the pairwise independence of the variables {Yn }n≥1 , for any > 0, P (|Y¯nk − E Y¯nk | > ) ≤ ≤

nk 1 1 1 ¯ Var(Ynk ) = 2 2 Var(Yi ) 2 nk i=1 nk 1 1 EYi2 . 2 n2k i=1

Thus, ∞

P (|Y¯nk − E Y¯nk | > ) ≤

k=1

=

nk ∞ 1 1 EYi2 2 n2k i=1 k=1

∞ 1 1 2 . EYi 2 i=1 n2k k:nk ≥i

(2.21)

8.2 Strong laws of large numbers

247

Since nk = ρk > ρk−1 for 1 < ρ < ∞, k:nk ≥i

1 ≤ n2k

k:ρk−1 ≥i

1 ρ(k−1)2

≤

C1 i2

(2.22)

for some constant C1 , 0 < C1 < ∞. Next, since the Xi ’s are identically distributed, ∞ EY 2 i

i=1

i2

=

∞ EX12 I(0 ≤ X1 ≤ i) i2 i=1

=

∞ i EX12 I(j − 1 < X1 ≤ j) i2 i=1 j=1

=

∞ ∞ EX12 I(j − 1 < X1 ≤ j) i−2 j=1

i=j

∞ jEX1 I(j − 1 < X1 ≤ j) · C2 j −1 ≤ j=1

= C2 EX1 < ∞,

(2.23)

for some constant C2 , 0 < C2 < ∞. Now (2.21)–(2.23) imply that ∞

P (|Y¯nk − E Y¯nk | > ) < ∞

k=1

for each > 0. By the Borel-Cantelli lemma and Proposition 7.2.3 (a), (2.20) follows and the proof is complete. 2 The following corollary is immediate from the above theorem. Corollary 8.2.8: (Extension to the vector case). Let {Xn = (Xn1 , . . . , Xnk )}n≥1 be a sequence of k-dimensional random vectors deﬁned on a probability space (Ω, F, P ) such that for each i, 1 ≤ i ≤ k, the sequence {Xni }n≥1 are pairwise independent and identically distributed with E|X1i | < ∞. Let µ = (EX11 , EX12 , . . . , EX1k ) and f : Rk → R be continuous at µ. Then ¯ n ≡ (X ¯ n1 , X ¯ n2 , . . . , X ¯ nk ) → µ w.p. 1, where X ¯ ni = 1 n Xji for (i) X j=1 n 1 ≤ i ≤ k. ¯ n ) → f (µ) w.p. 1. (ii) f (X Example 8.2.1: Let (Xn , Yn ), n = 1, 2, . . . be a sequence of bivariate iid random vectors with EX12 < ∞, EY12 < ∞. Then the sample correlation

248

8. Laws of Large Numbers

coeﬃcient ρˆn , deﬁned by, 1 n

¯ n Y¯n Xi Yi − X ¯ 2 1 n (Yi − Y¯n )2 i=1 (Xi − Xn ) i=1 n

ρˆn ≡ 0 n 1 n

n

i=1

is a strongly consistent estimator of the population correlation coeﬃcient ρ, deﬁned by, Cov(X1 , Y1 ) , ρ= , Var(X1 )Var(Y1 ) i.e., ρˆn → ρ w.p. 1. This follows from the above corollary by taking f : R5 → R to be ⎧ t 5 − t 1 t2 ⎪ ⎨ , , for t3 > t21 , t4 > t22 2 )(t − t2 ) (t − t 3 4 1 2 f (t1 , t2 , t3 , t4 , t5 ) = ⎪ ⎩ 0, otherwise, and the vector (Xn1 , Xn2 , . . . , Xn5 ) to be Xn1 = Xn , Xn2 = Yn , Xn3 = Xn2 , Xn4 = Yn2 , Xn5 = Xn Yn . Corollary 8.2.9: (Extension to the pairwise m-dependent case). Let {Xn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ) such that for an integer m, 1 ≤ m < ∞, and for each i, 1 ≤ i ≤ m, the random variables {Xi , Xi+m , Xi+2m , . . .} are identically distributed and pairwise independent with E|Xi | < ∞. Then ¯n → 1 EXi w.p. 1. X m i=1 m

The proof is left as an exercise (Problem 8.2). For an application of the above result to a discussion on normal numbers, see Problem 8.15. Example 8.2.2: (IID Monte Carlo). Let (S, S, π) be a probability space, f ∈ L1 (S, S, π) and λ = S f dπ. Let {Xn }n≥1 be a sequence of iid Svalued random variables with distribution π. Then, the IID Monte Carlo approximation to λ is deﬁned as ˆn ≡ 1 f (Xi ). λ n i=1 n

ˆ n → λ w.p. 1. Note that by the SLLN, λ An extension of this to the case where {Xi }i≥1 is a Markov chain, known as Markov chain Monte Carlo (MCMC), is discussed in Chapter 14.

8.3 Series of independent random variables

249

8.3 Series of independent random variables Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ). The goal of this section is to investigate the conver∞ gence of the inﬁnite series n=1 Xn , i.e., that of the partial sum sequence, n Sn = i=1 Xi , n ≥ 1. The main result of this section is Kolmogorov’s 3-series theorem (Theorem 8.3.5). The following two inequalities play a fundamental role in the proof of this theorem and also have other important applications. Theorem 8.3.1: Let {Xj : 1 ≤ j ≤ n} be a collection of independent i random variables. Let Si = j=1 Xj for 1 ≤ i ≤ n. (i) (Kolmogorov’s ﬁrst inequality). Suppose that EXj = 0v and EXj2 < ∞, 1 ≤ j ≤ n. Then, for 0 < λ < ∞, Var(S ) n . (3.1) P max |Si | ≥ λ ≤ 1≤i≤n λ2 (ii) (Kolmogorov’s second inequality). Suppose that there exists a constant C ∈ (0, ∞) such that P (|Xj − EXj | ≤ C) = 1 for 1 ≤ j ≤ n. Then, for any 0 < λ < ∞, (2C + 4λ)2 . P max |Si | ≤ λ ≤ 1≤i≤n Var(Sn ) Proof: Let A = {max1≤i≤n |Si | ≥ λ} and let A1 Aj

= {|S1 | ≥ λ},

= {|S1 | < λ, |S2 | < λ, . . . , |Sj−1 | < λ, |Sj | ≥ λ} n for j = 2, . . . , n. Then A1 , . . . , An are disjoint, j=1 Aj = A and P (A) = n j=1 P (Aj ). Since EXj = 0 for all j, Var(Sn ) = ESn2

≥ E(Sn2 IA ) =

n

E(Sn2 IAj )

j=1 n = E (Sn − Sj )2 + Sj2 + 2(Sn − Sj )Sj IAj j=1

≥

n j=1

E(Sj2 IAj ) + 2

n−1

E (Sn − Sj )Sj IAj .

(3.2)

j=1

n Note that since {X1 , . . . , Xn } are independent, (Sn − Sj ) ≡ i=j+1 Xi and Sj IAj are independent for 1 ≤ j ≤ n − 1. Hence, E (Sn − Sj )Sj IAj = E(Sn − Sj )E(Sj IAj ) = 0.

250

8. Laws of Large Numbers

Also on Aj , Sj2 ≥ λ2 . Therefore, by (3.2), Var(Sn ) ≥

n

λ2 P (Aj ) = λ2 P (A).

j=1

This establishes (i). For a proof of (ii), see Chung (1974), p. 117.

2

Remark 8.3.1: Recall that Chebychev’s inequality asserts that for λ > 0, n) and thus Kolmogorov’s ﬁrst inequality is signiﬁP (|Sn | ≥ λ) ≤ Varλ(S 2 cantly stronger. Kolmogorov’s ﬁrst inequality has an extension known as Doob’s maximal inequality to a class of dependent random variables, called martingales (see Chapter 13). The next inequality is due to P. Levy. Deﬁnition 8.3.1: For any random variable X, a real number c is called a median of X if 1 P (X < c) ≤ ≤ P (X ≤ c). (3.3) 2 Such a c always exists. It can be veriﬁed that c0 ≡ inf{x : P (X ≤ x) ≥ 12 } is a median. Note that if c is a median of X and α is a real number, then αc is a median of αX and α+c is a median of α+X. Further, if P (|X| ≥ α) < 12 for some α > 0, then any median c of X satisﬁes |c| ≤ α (Problem 8.4). Theorem 8.3.2: (Levy’s inequality). Let Xj , j = 1, . . . , n be independent n random variables. Let Sj = j=1 Xi , and cj,n be a median of (Sn − Sj ) for 1 ≤ j ≤ n, where cn,n is set equal to 0. Then, for any 0 < λ < ∞, (i) P max (Sj − cj,n ) ≥ λ ≤ 2P (Sn ≥ λ) ; 1≤j≤n

(ii) P

max |Sj − cj,n | ≥ λ ≤ 2P (|Sn | ≥ λ).

1≤j≤n

Proof: Let Aj B B1 Bj

{Sj − Sn ≤ cj,n } for 1 ≤ j ≤ n, = max (Sj − cj,n ) ≥ λ ,

=

1≤j≤n

=

{S1 − c1,n ≥ λ}

1 ≤ i ≤ j − 1, Sj − cj,n ≥ λ}, n for j = 2, . . . , n. Then B1 , . . . , Bn are disjoint and j=1 Bj = B. Since X1 , . . . , Xn are independent, Aj and Bj are independent for each j = 1, 2, . . . , n. Also for each j, Aj = {Sj − cj,n ≤ Sn }, and hence on Aj ∩ Bj , Sn ≥ λ holds. Thus, =

{Si − ci,n < λ for

P (Sn ≥ λ) ≥

n j=1

P (Aj ∩ Bj )

8.3 Series of independent random variables

=

n j=1

≥ =

1 P 2

251

P (Aj )P (Bj )

n

Bj

j=1

1 P (B), 2

proving part (i). Now part (ii) follows by applying part (i) to both {Xi }ni=1 and {−Xi }ni=1 . 2 Recall that if {Yn }n≥1 is a sequence of random variables, then {Yn }n≥1 converges w.p. 1 implies that {Yn }n≥1 converges in probability as well. A remarkable result of P. Levy is that if {Sn }n≥1 is the sequence of partial sums of independent random variables and {Sn }n≥1 converges in probability, then {Sn }n≥1 must converge w.p. 1 as well. The proof of this uses Levy’s inequality proved above. Theorem 8.3.3: Let {Xn }n≥1 be a sequence of independent random varin ables. Let Sn = j=1 Xj for 1 ≤ n < ∞ and let {Sn }n≥1 converge in probability to a random variable S. Then Sn → S w.p. 1. Proof: Recall that a sequence {xn }n≥1 of real numbers converges iﬀ it is Cauchy iﬀ δn ≡ sup{|xk − x | : k, ≥ n} → 0 as n → ∞. Let ˜n ∆

≡ sup{|Sk − S | : k, ≥ n} and

∆n

≡ sup{|Sk − Sn | : k ≥ n}.

˜ n ≤ 2∆n and ∆ ˜ n is decreasing in n. Suppose it is shown that Then, ∆ ∆n −→p 0.

(3.4)

˜ n −→p 0 and hence there is a subsequence {nk }k≥1 such that Then, ∆ ˜ n is decreasing in n, this implies that ˜ ∆nk → 0 as k → ∞ w.p. 1. Since ∆ ˜ ∆n → 0 w.p. 1. Thus it suﬃces to establish (3.4). Fix 0 < < 1. Let Sn, ∆n,k

= Sn+ − Sn =

for ≥ 1,

max{|Sn, | : 1 ≤ ≤ k}, k ≥ 1.

Note that for each n ≥ 1, {∆n,k }k≥1 is a nondecreasing sequence, lim ∆n,k = ∆n and hence, for any n ≥ 1,

k→∞

P (∆n > ) = lim P (∆n,k > ). k→∞

(3.5)

Levy’s inequality (Theorem 8.3.2) will now be used to bound P (∆n,k > ) uniformly in k. Since Sn −→p S, for any η > 0, there exists an n0 ≥ 1 such that for all n ≥ n0 , P (|Sn − S| > η) < η.

252

8. Laws of Large Numbers

This implies that for all k ≥ ≥ n0 , P (|Sk − S | > 2η) < 2η.

(3.6)

If 0 < η < 14 , then the medians of Sk − S for k ≥ ≥ n0 are bounded by 2η. Hence, for n ≥ n0 and k ≥ 1, applying Levy’s inequality (i.e., the above theorem) to {Xi : n + 1 ≤ i ≤ n + k}, P (∆n,k > ) = P max |Sn,j | > 1≤j≤k ≤ P max |Sn,j − cn+j,n+k | ≥ − 2η 1≤j≤k

≤ 2P (|Sn,k | ≥ − 2η). Now, choosing 0 < η < 4 , (3.6) yields P (∆n,k > ) < 4η < for all n ≥ n0 , k ≥ 1. Then, by (3.5), P (∆n > ) ≤ for all n ≥ n0 . Hence, (3.4) holds. 2 The following result on convergence of inﬁnite series of independent random variables is an immediate consequence of the above theorem. Theorem 8.3.4: (Khinchine-Kolmogorov’s 1-series theorem). Let a probability {Xn }n≥1 be a sequence of independent random variables on ∞ space (Ω, F, P ) such that EXn = 0 for all n ≥ 1 and n=1 EXn2 < ∞. n Then Sn ≡ j=1 Xj converges in mean square and almost surely, as n → ∞. Proof: For any n, k ∈ N, E(Sn − Sn+k )2 = Var(Sn − Sn+k ) =

n+k j=n+1

Var(Xj ) =

n+k

EXj2 ,

j=k+1

∞

by independence. Since n=1 EXn2 < ∞, {Sn }n≥1 is a Cauchy sequence in L2 (Ω, F, P ) and hence converges in mean square to some S in L2 (Ω, F, P ). This implies that Sn −→p S, and by the above theorem Sn → S w.p. 1. 2 Remark 8.3.2: It is possible to give another proof of the above theorem using Kolmogorov’s inequality. See Problem 8.5. Theorem 8.3.5: (Kolmogorov’s 3-series theorem). Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ) n and let Sn = i=1 Xi , n ≥ 1. Then the sequence {Sn }n≥1 converges w.p. 1 iﬀ the following 3-series converge for some 0 < c < ∞: ∞ (i) i=1 P (|Xi | > c) < ∞, ∞ (ii) i=1 E(Yi ) converges, ∞ (iii) i=1 Var(Yi ) < ∞,

8.3 Series of independent random variables

253

where Yi = Xi I(|Xi | ≤ c), i ≥ 1. Proof: (Suﬃciency). By (i) and the Borel-Cantelli lemma, P (Xi = Hence {Sn }n≥1 converges w.p. 1 iﬀ {Tn }n≥1 Yi i.o.) = P (|Xi | > c i.o.) = 0. n converges w.p. 1, where T = i=1 Yi , n ≥ 1. By (iii) and the 1-series then n orem, the sequence { i=1 (Yi − EYi )}n≥1 converges w.p. 1. Hence, by (ii), {Tn }n≥1 converges w.p. 1 and hence {Sn }n≥1 converges w.p. 1. (Necessity). Suppose {Sn }n≥1 converges w.p. 1. Fix 0 < c < ∞ and let Yi = Xi I(|Xi | ≤ c), i ≥ 1. Since {Sn }n≥1 converges w.p. 1, Xn → 0 w.p. 1. Hence, w.p. 1, |Xi | ≤ c for all but a ﬁnite number of i’s. If Ai ≡ {Xi = Yi } = {|Xi | > c}, then by the second Borel-Cantelli lemma, ∞

P (Ai ) < ∞, establishing (i).

i=1

To establish (ii) and (iii), the following construction and the second inequality of Kolmogorov will be used. Without loss of generality, assume ˜ n }n≥1 of random variables on the same that there is another sequence {X ˜ n }n≥1 are independent, (b) probability space (Ω, F, P ) such that (a) {X ˜ ˜n, {Xn }n≥1 is independent of {Xn }n≥1 , and (c) for each n ≥ 1, Xn =d X ˜ i.e., Xn and Xn have the same distribution. Let Y˜i Zi

˜ i I(|X ˜ i | ≤ c), = X = Yi − Y˜i , i ≥ 1,

Tn

≡

T˜n

≡

n i=1 n

Yi , Y˜i ,

i=1

and Rn ≡

n i=1

Zi , n ≥ 1.

n Since {Sn ≡ i=1 Xi }n≥1 converges w.p. 1, and Xi = Yi for all but a ﬁnite number of i, {Tn }n≥1 converges w.p. 1. Since {Yi }n≥1 and {Y˜i }n≥1 have the same distribution on R∞ , {T˜n }n≥1 converges w.p. 1. Thus the diﬀerence sequence {Rn }n≥1 converges w.p. 1. Next, note that {Zn }n≥1 are independent random variables with mean 0 and {Zn }n≥1 are uniformly bounded by 2c. Applying Kolmogorov’s second inequality (Theorem 8.3.1 (b)) to {Zj : m < j ≤ m + n} yields

(2c + 4)2 P max |Rj − Rm | ≤ ≤ m+n (3.7) m<j≤m+n i=m+1 Var(Zi ) for all m ≥ 1, n ≥ 1, 0 < < ∞.

254

8. Laws of Large Numbers

Let ∆m ≡ maxm<j |Rj − Rm |. Let n → ∞ in (3.7) to conclude that (2c + 4)2 P (∆m ≤ ) ≤ ∞ . i=m+1 Var(Zi ) Now suppose (iii) does not hold.Then, since Yi and Y˜i are iid, Var(Zi ) = ∞ 2Var(Yi ) for all i ≥ 1, and thus i=m+1 Var(Zi ) = ∞ and hence P (∆m ≤ ) = 0 for each m ≥ 1, 0 < < ∞. This implies that P (∆m > ) = 1 for each > 0 and hence that ∆m = ∞ w.p. 1 for all m ≥ 1. This contradicts the convergence w.p. 1 of the sequence {Rn }n≥1 . Hence (iii) holds. n (Yi − EYi )}n≥1 converges w.p. 1. Since By the 1-series theorem, { i=1 ∞ n { i=1 Yi }n≥1 converges w.p. 1, i=1 EYi converges, establishing (ii). This completes the proof of necessity part and of the theorem. 2 Remark 8.3.3: To go from the convergence w.p. 1 of {Rn }n≥1 to (iii), it suﬃces to show that if (iii) fails, then for each 0 < A < ∞, P (|Rn | ≤ A) → 0 as n → ∞. This can be established without the use of (3.7) but using the central limit theorem (to be proved later in Chapter 11), which shows that if Var(Rn ) → ∞, then

x 2 1 Rn ≤ x → Φ(x) ≡ √ e−t /2 dt, P , 2π −∞ Var(Rn ) for all x in R. (Also see Billingsley (1995), p. 290.)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs For a sequence of independent and identically distributed random variables {Xn }n≥1 , Kolmogorov showed that {Xn }n≥1 obeys the SLLN with bn = n iﬀ E|X1 | < ∞. Marcinkiewz and Zygmund generalized this result and proved a class of SLLNs for {Xn }n≥1 when E|X|p < ∞ for some p ∈ (0, 2). The proof uses Kolmogorov’s 3-series theorem and some results from real analysis. This approach is to be contrasted with Etemadi’s proof of the SLLN, which uses a decomposition of the random variables {Xn }n≥1 into positive and negative parts and uses monotonicity of the sum to establish almost sure convergence along a subsequence by an application of the BorelCantelli lemma. The alternative approach presented in this section is also useful for proving SLLNs for sums of independent random variables that are not necessarily identically distributed. The next three are preparatory results for Theorem 8.4.4. Lemma 8.4.1: (Abel’s summation formula). Let {an }n≥1 and {bn }n≥1 be two sequences of real numbers. Then, for all n ≥ 2, n j=1

aj bj = An bn −

n−1 j=1

Aj (bj+1 − bj )

(4.1)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

where Ak =

k j=1

255

aj , k ≥ 1.

Proof: Let A0 = 0. Then, aj = Aj − Aj−1 , j ≥ 1. Hence, n

aj bj

=

j=1

n

(Aj − Aj−1 )bj =

j=1

=

n

n

Aj b j −

j=1

Aj bj −

j=1

n−1

n

Aj−1 bj

j=1

Aj bj+1 ,

j=1

2

yielding (4.1).

be seLemma 8.4.2: (Kronecker’s lemma). Let {an }n≥1 and {bn }n≥1 ∞ quences of real numbers such that 0 < bn ↑ ∞ as n → ∞ and j=1 aj converges. Then, n 1 aj bj −→ 0 bn j=1

as

n → ∞.

(4.2)

k ∞ Proof: Let Ak = j=1 aj , A ≡ j=1 aj = limk→∞ Ak and Rk = A − Ak , k ≥ 1. Then, by Lemma 8.4.1 for n ≥ 2, n

aj bj

= An bn −

j=1

n−1

Aj (bj+1 − bj )

j=1

= An bn −

n−1

(A − Rj )(bj+1 − bj )

j=1

=

An bn − A

n−1

(bj+1 − bj ) +

j=1

n−1

Rj (bj+1 − bj )

j=1

= An bn − Abn + Ab1 +

n−1

Rj (bj+1 − bj )

j=1

=

−Rn bn + Ab1 +

n−1

Rj (bj+1 − bj ).

(4.3)

j=1

∞ Since n=1 an converges, Rn → 0 as n → ∞. Hence, given any > 0, there exists N = N > 1 such that |Rn | ≤ for all n ≥ N . Since 0 < bn ↑ ∞, for all n > N , * * * −1 n−1 * *bn Rj (bj+1 − bj )** * j=1

256

8. Laws of Large Numbers

≤

b−1 n

N −1

|Rj | |bj+1 − bj | +

b−1 n

j=1

= b−1 n

N −1

n−1

(bj+1 − bj )

j=N

|Rj | |bj+1 − bj | + .

j=1

Now letting n → ∞ and then letting ↓ 0, yields * * n−1 * * * R (b − b ) lim sup **b−1 j j+1 j * = 0. n n→∞

j=1

2

Hence, (4.2) follows from (4.3). Lemma 8.4.3: For any random variable X, ∞

P (|X| > n) ≤ E|X| ≤

n=1

∞

P (|X| > n).

(4.4)

n=0

Proof: For n ≥ 1, let An = {n − 1 < |X| ≤ n}. Deﬁne the random variables ∞ ∞ Y = (n − 1) IAn and Z = n I An . n=1

n=1

Then, it is clear that Y ≤ |X| ≤ Z, so that EY ≤ E|X| ≤ EZ.

(4.5)

Note that EY

=

∞

(n − 1)P (An )

n=1

=

=

=

∞ n−1

P (An )

n=2 j=1 ∞ ∞

P (n − 1 < |X| ≤ n)

j=1 n=j+1 ∞

P (|X| > j).

j=1

Similarly, one can show that EZ =

∞ j=0

P (|X| > j). Hence, (4.4) follows. 2

Theorem 8.4.4: (Marcinkiewz-Zygmund SLLNs). Let {Xn }n≥1 be a sequenceof identically distributed random variables and let p ∈ (0, 2). Write n Sn = i=1 Xi , n ≥ 1.

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

257

(i) If {Xn }n≥1 are pairwise independent and Sn − nc n1/p

converges w.p. 1

(4.6)

for some c ∈ R, then E|X1 |p < ∞. (ii) Conversely, if E|X1 |p < ∞ and {Xn }n≥1 are independent, then (4.6) holds with c = EX1 for p ∈ [1, 2) and with any c ∈ R for p ∈ (0, 1). Corollary 8.4.5: (Kolmogorov’s SLLN ). Let {Xn }n≥1 be a sequence of iid random variables. Then, Sn − nc →0 n

w.p. 1

for some c ∈ R iﬀ E|X1 | < ∞, in which case, c = EX1 . Thus, Kolmogorov’s SLLN corresponds with the special case p = 1 of Theorem 8.4.4. Note that compared with the WLLN and Borel’s SLLN of Sections 8.1 and 8.2, Kolmogorov’s SLLN presents a signiﬁcant improvement in the moment condition, i.e., it assumes the ﬁniteness of only the ﬁrst absolute moment. Further, both the Kolmogorov’s SLLN and the Marcinkiewz-Zygmund SLLN are proved under minimal moment conditions, since the corresponding moment conditions are shown to be necessary. Proof of Theorem 8.4.4: (i) Suppose that (4.6) holds for some c ∈ R. Then, Xn n1/p

= =

Sn − Sn−1 n1/p Sn − nc Sn−1 − (n − 1)c c − + 1/p n1/p n1/p n → 0 as n → ∞, a.s.

Hence, P (|Xn /n1/p | > 1 i.o.) = 0. By the second Borel-Cantelli lemma and by the pairwise independence of {Xn }n≥1 , this implies

∞ |Xn | P > 1 < ∞, n1/p n=1 i.e.,

∞ P |X1 |p > n < ∞. n=1

Hence, by Lemma 8.4.3, E|X1 |p < ∞.

258

8. Laws of Large Numbers

To prove (ii), suppose that E|X1 |p < ∞ for some p ∈ (0, 2). For 1 ≤ p < 2, w.l.o.g. assume that EX1 = 0. Next, deﬁne the variables Zn = Xn I(|Xn |p ≤ n), n ≥ 1. Then, by Lemma 8.4.3, ∞

=

n=1 ∞

P (Xn = Zn ) P (|Xn |p > n) =

n=1

∞

P (|X1 |p > n) ≤ E|X1 |p < ∞.

n=1

Hence, by the Borel-Cantelli lemma, P (Xn = Zn i.o.) = 0.

(4.7)

Note that, in view of (4.7), (4.6) holds with c = 0 if and only if n1/p

n

Zi → 0

as

n → ∞,

w.p. 1.

(4.8)

i=1

Note that for any j ∈ N, θ > 1 and β ∈ (−∞, 0)\{−1}, ∞

n−θ

≤

j −θ +

n=j

∞ n=j+1

= j −θ + ≤

n

x−θ dx

n−1

1 · j −(θ−1) θ−1

θ · j −(θ−1) θ−1

(4.9)

and similarly, j

nβ

≤

# " β + j (β+1) /(β + 1)

≤

j β+1 β I(β < −1) + I(−1 < β < 0). β+1 β+1

n=1

Now, ∞

≤ =

n=1 ∞

Var(Zn /n1/p ) EX12 I(|X1 |p ≤ n) · n−2/p

n=1 n ∞ n=1 j=1

EX12 I(j − 1 < |X1 |p ≤ j) · n−2/p

(4.10)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

=

∞ ∞ j=1

≤ ≤

2 2−p 2 2−p

n−2/p

n=j ∞

· EX12 I(j − 1 < |X1 |p ≤ j)

2 j −( p −1) · EX12 I (j − 1) < |X1 |p ≤ j

j=1 ∞

259

(by (4.9))

2

j −( p −1) · E|X1 |p I(j − 1 < |X1 |p ≤ j) · (j 1/p )2−p

j=1

2 E|X1 |p < ∞. = 2−p ∞ 1/p converges w.p. 1. By Hence, by Theorem 8.3.4, n=1 (Zn − EZn )/n Kronecker’s lemma (viz. Lemma 8.4.2), n−1/p

n

(Zj − EZj ) → 0

as

n → ∞,

w.p. 1.

(4.11)

j=1

Now consider the case p = 1. In this case, E|X1 | < ∞ and by the DCT, n EZn = EX1 I(|X1 | ≤ n) → EX1 = 0 as n → ∞. Hence, n−1 i=1 EZi → 0. Part (ii) of the theorem now follows from (4.8) and (4.11) for p = 1. Next consider the case p ∈ (0, 2), p = 1. Using (4.9) and (4.10), one can show (cf. Problem 8.12) that n

−1/p

n

EZj → 0

as

n → ∞.

(4.12)

j=1

Hence, by (4.8), (4.11), and (4.12), one gets (4.6) with c = 0 for p ∈ (0, 2)\{1}. Finally, note that for p ∈ (0, 1), and for any c ∈ R, Sn − nc n1/p

=

Sn nc − 1/p n1/p n → 0 as n → ∞,

a.s.,

whenever Sn /n1/p → 0 as n → ∞, w.p. 1. Hence, (4.6) holds with an arbitrary c ∈ R for p ∈ (0, 1). This completes the proof of part (ii) for p ∈ (0, 2)\{1} and hence of the theorem. 2 The next result gives a SLLN for independent random variables that are not necessarily identically distributed. Theorem Let {Xn }n≥1 be a sequence of independent random vari8.4.6: ∞ ables. If n=1 E|Xn |αn /nαn < ∞ for some αn ∈ [1, 2], n ≥ 1, then n−1

n j=1

(Xj − EXj ) → 0

as

n → ∞,

w.p. 1.

(4.13)

260

8. Laws of Large Numbers

Proof: W.l.o.g. suppose that EXn = 0 for all n ≥ 1. Let Yn = Xn I(|Xn | ≤ n)/n. Note that |EYn | = |n−1 (EXn − EXn I(|Xn | > n))| = n−1 |EXn I(|Xn | > n)|, n ≥ 1. Since 1 ≤ αn ≤ 2, ∞

{P (|Xn | > n) + |EYn |}

n=1

≤ 2 ≤

2

∞ n=1 ∞

n−1 E|Xn |I(|Xn | > n) E|Xn |αn /nαn < ∞

n=1

and ∞

Var(Yn ) ≤

n=1

≤

∞ n=1 ∞

n−2 EXn2 I(|Xn | ≤ n) n−αn EXnαn < ∞.

n=1

∞ Hence, by Kolmogorov’s 3-series theorem, n=1 (Xn /n) converges w.p. 1. Now the theorem follows from Lemma 8.4.2. 2 Corollary 8.4.7: Let {Xn }n≥1 be asequence of independent random vari∞ ables such that for some α ∈ [1, 2], n=1 (n−α E|Xn |α ) < ∞. Then (4.13) holds.

8.5 Renewal theory 8.5.1

Deﬁnitions and basic properties

Let {Xn }n≥0 be a sequence of nonnegative random variables that are independent and, for i ≥ 1, identically distributed with cdf F . Let Sn = n i=0 Xi for n ≥ 0. Imagine a system where a component in operation at time t = 0 lasts X0 units of time and then is replaced by a new one that lasts X1 units of time, which, at failure, is replaced by yet another new one that lasts X2 units of time and so on. The sequence {Sn }n≥0 represents the sequence of epochs when ‘renewal’ takes place and is called a renewal sequence. Assume that P (X1 = 0) < 1. Then, since P (X1 < ∞) = 1, it follows that for each n, P (Sn < ∞) = 1 and limn→∞ Sn = ∞ w.p. 1 (Problem 8.16). Now deﬁne the counting process {N (t) : t ≥ 0} by the relation N (t) = k

if Sk−1 ≤ t < Sk

for k = 0, 1, 2, . . .

(5.1)

where S−1 = 0. Thus N (t) counts the number of renewals up to time t.

8.5 Renewal theory

261

Deﬁnition 8.5.1: The stochastic process {N (t) : t ≥ 0} is called a renewal process with lifetime distribution F . The renewal sequence {Sn }n≥0 and the renewal process {N (t) : t ≥ 0} are called nondelayed or standard if X0 has the same distribution as X1 and are called delayed otherwise. Since P (X1 ≥ 0) = 1, {Sn }n≥0 is nondecreasing in n and for each t ≥ 0, the event {N (t) = k} = {Sk−1 ≤ t < Sk } belongs to the σ-algebra σ{Xj : 0 ≤ j ≤ k} and hence N (t) is a random variable. Using the nontriviality hypothesis that P (X1 = 0) < 1, it is shown below that for each t > 0, the random variable N (t) has ﬁnite moments of all order. Proposition 8.5.1: Let P (X1 = 0) < 1. Then there exists 0 < λ < 1 (not depending on t) and a constant C(t) ∈ (0, ∞) such that P (N (t) > k) ≤ C(t)λk

for all

k > 0.

(5.2)

Proof: For t > 0, k ∈ N, P (N (t) > k)

= P (Sk ≤ t) for θ > 0 = P e−θSk ≥ e−θt −θS θt k ≤ e E e (by Markov’s inequality) k . = eθt E e−θX0 E e−θX1

By BCT, limθ↑∞ E(e−θX1 ) = P (X1 = 0) < 1. Hence, there exists a θ large 2 such that λ ≡ E(e−θX1 ) is less than one, thus, completing the proof. Corollary 8.5.2: There exists an s0 > 0 such that the moment generating function (m.g.f.) E(esN (t) ) < ∞ for all s < s0 and t ≥ 0. Proof: From (5.2), for any t > 0, it follows that P N (t) = k = O(λk ) as ∞ k → ∞ for some 0 < λ < 1 and hence E esN (t) = k=0 (es )k P N (t) = 2 k < ∞ for any s such that es λ < 1, i.e., for all s < s0 ≡ − log λ. From (5.1), it follows that for t > 0, SN (t)−1 ≤ t < SN (t) ⇒

S N (t) − 1 S t N (t)−1 N (t) ≤ ≤ . N (t) (N (t) − 1) N (t) N (t)

(5.3)

Let A be the event that Snn → EX1 as n → ∞ and let B be the event that N (t) → ∞ as t → ∞. Since Sn → ∞ w.p. 1, it follows that P (B) = 1. Also, by the SLLN, P (A) = 1. On the event C = A ∩ B, it holds that SN (t) → EX1 N (t)

as t → ∞.

262

8. Laws of Large Numbers

This together with (5.3) yields the following result. Proposition 8.5.3: Suppose that P (X1 = 0) < 1. Then, lim

t→∞

1 N (t) = t EX1

w.p. 1.

(5.4)

Deﬁnition 8.5.2: The function U (t) ≡ EN (t) for the nondelayed process is called the renewal function. An explicit expression for U (·) is given by (5.13) below. Next consider the convergence of EN (t)/t. By (5.4) and Fatou’s lemma, one gets 1 EN (t) ≥ . (5.5) lim inf t→∞ t EX1 It turns out that the lim inf t→∞ in (5.5) can be replaced by limt→∞ and ≥ by equality. To do this it suﬃces to show that the family { N t(t) : t ≥ k} is uniformly integrable for some k < ∞. This can be done by showing E( N t(t) )2 is bounded in t (see Chung (1974), Chapter 5). An alternate approach is to bound the lim sup. For this one can use an identity known as Wald’s equation (see also Chapter 13).

8.5.2

Wald’s equation

Let {Xj }j≥1 be independent n random variables with EXj = 0 for all j ≥ 1. Also, let S0 = 0, Sn = j=1 Xj , n ≥ 1. Deﬁnition 8.5.3: A positive integer valued random variable N is called a stopping time with respect to {Xj }j≥1 if for every j ≥ 1, the event {N = j} ∈ σ{X1 , . . . , Xj }. A stopping time N is called bounded if there exists a K < ∞ such that P (N ≤ K) = 1. n Example 8.5.1: N ≡ min{n : j=1 Xj ≥ 25} is a stopping time w.r.t. n {Xj }j≥1 , but M ≡ max{n : j=1 Xj ≥ 25} is not. Proposition 8.5.4: Let {Xj }j≥1 be independent random variables with EXj = 0. Let N be a bounded stopping time w.r.t. {Xj }j≥1 . Then E(|SN |) < ∞

and

ESN = 0.

K Proof: Let K ∈ N be such that P (N ≤ K) = 1. Then |SN | ≤ j=1 |Xi | K and hence E|SN | < ∞. Next, SN = j=1 Xj I(N ≥ j) and hence ESN =

K E Xj I(N ≥ j) . j=1

8.5 Renewal theory

263

But the event {N ≥ j} = {N ≤ j − 1}c ∈ σ{X1 , X2 , . . . , Xj−1 }. Since Xj is independent of σX1 , X2 , . . . , Xj−1 , E Xj I(N ≥ j) = 0 for 1 ≤ j ≤ K. 2

Thus ESN = 0.

Corollary 8.5.5: Let {Xj }j≥1 be iid random variables with E|X1 | < ∞. Let N be a bounded stopping time w.r.t. {Xj }j≥1 . Then ESN = (EN )EX1 . Corollary 8.5.6: Let {Xj }j≥1 be iid nonnegative random variable with E|X1 | < ∞. Let N be a stopping time w.r.t. {Xj }j≥1 . Then ESN = (EN )EX1 . Proof: Let Nk = N ∧ k, k = 1, 2, . . .. Then Nk is a bounded stopping time. By Corollary 8.5.5, E(SNk ) = (ENk )EX1 . Let k ↑ ∞. Then 0 ≤ SNk ↑ SN and Nk ↑ N . By the MCT, ESNk ↑ ESN and ENk ↑ EN . 2 Theorem 8.5.7: (Wald’s equation). Let {Xj }j≥1 be iid random variables with E|X1 | < ∞. Let N be a stopping time w.r.t. {Xj }j≥1 such that EN < ∞. Then ESN = (EN )EX1 . Proof: Let Tn = by Corollary 8.5.5,

n j=1

|Xj |, n ≥ 1. Let Nk = N ∧ k, k = 1, 2, . . .. Then E(SNk ) = (ENk )EX1 .

Also, |SNk | ≤ TNk and ETNk = (ENk )E|X1 |. Further, as k → ∞, Nk → N , SNk → SN , TNk → TN , and ETNk → ETN = (EN )E|X1 | < ∞. So, by the extended DCT (Theorem 2.3.11) ESNk → ESN i.e.,

(ENk )EX1 → ESN

i.e.,

ESN = (EN )EX1 . 2

264

8. Laws of Large Numbers

8.5.3

The renewal theorems

In this section, two versions of the renewal theorem will be proved. For this, the notation and concepts introduced in Sections 8.5.1 and 8.5.2 will be used without further explanation. Note that for each t > 0 and j = 0, 1, 2, . . ., the event {N (t) = j} = {Sj−1 ≤ t < Sj } belongs to σ{X0 , . . . , Xj }. Thus, by Wald’s equation (Theorem 8.5.7 above) E(SN (t) ) = EN (t) EX1 + EX0 . ˜ (t)}t≥0 ˜ i = min{Xi , m}, i ≥ 0. Let {S˜n }n≥0 and {N Let m ∈ (0, ∞) and X be the associated renewal sequence and renewal process, respectively. Again, by Wald’s equation, ˜0. ˜ (t) E X ˜1 + E X E S˜N˜ (t) = E N But since S˜N˜ (t)−1 ≤ t < S˜N˜ (t) , it follows that S˜N˜ (t) ≤ t + m and hence ˜ 0 ≤ t + m. ˜ (t))E X ˜1 + E X (E N This yields lim sup t→∞

˜ (t) 1 EN ≤ . ˜1 t EX

˜ (t) ≥ N (t) and hence Clearly, for all t > 0, N lim sup E t→∞

1 N (t) ≤ . ˜1 t EX

(5.6)

˜ 1 → EX1 as Since this is true for each m ∈ (0, ∞) and by the MCT, E X m → ∞, it follows that lim sup t→∞

1 EN (t) ≤ . t EX1

Combining this with (5.5) leads to the following result. Theorem 8.5.8: (The weak renewal theorem). Let {N (t) : t ≥ 0} be a renewal process with distribution F . Let µ = [0,∞) xdF (x) ∈ (0, ∞). Then, 1 EN (t) = . t→∞ t µ

(5.7)

lim

The above result is also valid when µ = ∞ when

1 µ

is interpreted as zero.

Deﬁnition 8.5.4: A random variable X (and its probability distribution) is called arithmetic (or lattice) if there exists a ∈ R and d > 0 such that X−a d

8.5 Renewal theory

265

is integer valued. The largest such d is called the span of (the distribution of) X. Deﬁnition 8.5.5: A random variable X (and its distribution distribution) is called nonarithmetic (or nonlattice) if it is not arithmetic. The weak renewal theorem (Theorem 8.5.8) implies that EN (t) = t/µ + o(t) as t → ∞. This suggests that E N (t + h) − N (t) = (t + h)/µ − t/µ + o(t) = h/µ + o(t). A strengthening of the above result is as follows. Theorem 8.5.9: (The strong renewal theorem). Let {N (t) : t ≥ 0} be a renewal process with a nonarithmetic distribution F with a ﬁnite positive mean µ. Then, for each h > 0, h lim E N (t + h) − N (t) = . t→∞ µ

(5.8)

Remark 8.5.1: Since N (t) =

k−1

N (t − j) − N (t − j − 1) + N (t − k)

j=0

where k ≤ t < k + 1, the weak renewal theorem follows from the strong renewal theorem. The following are the “arithmetic versions” of Theorems 8.5.8 and 8.5.9. Let {Xi }i≥0 be independent positive integer valued random nvariables such that {Xi }i≥1 are iid with distribution {pj }j≥1 . Let Sn = j=0 Xj , n ≥ 0, S−1 = 0. Let Nn = k if Sk−1 ≤ n < Sk , k = 0, 1, 2, . . .. Let un

= P (there is a renewal at time n) = P (Sk = n for some k ≥ 0).

Theorem 8.5.10: Let µ =

∞ j=1

jpj ∈ (0, ∞). Then

1 1 uj → n j=0 µ n

as

n → ∞.

(5.9)

∞ Theorem 8.5.11: Let µ = j=1 jpj ∈ (0, ∞) and g.c.d. {k : pk > 0} = 1. Then 1 as n → ∞. (5.10) un → µ For proofs of Theorems 8.5.9 and 8.5.11, see Feller (1966) for an analytic proof or Lindvall (1992) for a proof using the coupling method. The proof of Theorem 8.5.10 is similar to that of Theorem 8.5.8.

266

8. Laws of Large Numbers

8.5.4

Renewal equations

The above strong renewal theorems have many applications. These are via what are known as renewal equations. Let F (·) be a cdf such that F (0) = 0. Let B 0 ≡ {f | f : [0, ∞) → R, f is Borel measurable and bounded on bounded intervals}. Deﬁnition 8.5.6: A function a(·) is said to satisfy the renewal equation with distribution F (·) and forcing function b(·) ∈ B 0 if a ∈ B 0 and a(t − u)dF (u) for t ≥ 0. (5.11) a(t) = b(t) + (0,t]

Theorem 8.5.12: Let F be a cdf such that F (0) = 0 and let b(·) ∈ B 0 . Then there is a unique solution a0 (·) ∈ B 0 to (5.11) given by b(t − u)U (du) (5.12) a0 (t) = [0,t]

where U (·) is the Lebesgue-Stieltjes measure induced by the nondecreasing function ∞ F (n) (t), (5.13) U (t) ≡ n=0

with F

(n)

(·), n ≥ 0 being deﬁned by the relations (n) F (n−1) (t − u)dF (u), t ∈ R, n ≥ 1, F (t) = (0,t]

F (0) (t)

=

1 0

if t ≥ 0 t < 0.

It will be shown below that the function U (·) deﬁned in (5.13) is the renewal function EN (t) as in Deﬁnition 8.5.2. Proof: For any function b ∈ B 0 and any nondecreasing right continuous function G : [0, ∞) → R, let b(t − u)dG(u). (b ∗ G)(t) ≡ [0,t]

Then since F (0) = 0, the equation (5.11) can be rewritten as a = b + a ∗ F.

(5.14)

Let {Xi }i≥1 be iid random variables with cdf F . Then n it is easy to verify that F (n) (t) = P (Sn ≤ t), where S0 = 0, and Sn = i=1 Xi for n ≥ 1. Let

8.5 Renewal theory

267

{N (t) : t ≥ 0} be as deﬁned by (5.1). Then, for t ∈ (0, ∞), EN (t) =

∞ ∞ ∞ P N (t) ≥ j = P (Sj−1 ≤ t) = F (n) (t) = U (t). j=1

n=0

j=1

By Proposition 8.5.1, U (t) < ∞ for all t > 0 and is nondecreasing. Since b ∈ B 0 for each 0 < t < ∞, a0 deﬁned by (5.12) is well-deﬁned. By deﬁnition a0 = b ∗ U and by (5.13), a0 satisﬁes (5.14) and hence (5.11). If ˜ ≡ a1 − a2 satisﬁes a1 and a2 from B 0 are two solutions to (5.14) then a a ˜=a ˜∗F and hence

a ˜=a ˜ ∗ F (n)

for all n ≥ 1.

This implies M (t) ≡ sup{|˜ a(u)| : 0 ≤ u ≤ t} ≤ M (t)F (n) (t). a| = 0 on (0, t] for each t. Thus But F (n) (t) → 0 as n → ∞. Hence |˜ 2 a0 = b ∗ U is the unique solution to (5.11). The discrete or arithmetic analog of the renewal equation (5.11) is as follows. Let {Xi }i≥1 be iid positive integer valued n random variables with 1. Let un = distribution {pj }j≥1 . Let S0 = 0, and Sn = i=1 Xi for n ≥ n P (Sj = n for some j ≥ 0). Then, u0 = 1 and un satisﬁes un = j=1 pj un−j for n ≥ 1. For any sequence {bj }j≥0 , the equation an = bn +

n

an−j pj , n = 0, 1, 2, . . .

(5.15)

j=1

is called the discrete renewal equation. As in the general case, it can be shown (Problem 8.17 (a)) that the unique solution to (5.15) is given by an =

n

bn−j uj .

(5.16)

j=0

The following convergence results are easy to establish from Theorem 8.5.11 (Problem 8.17 (b)). Theorem 8.5.13: (The key renewal theorem, discrete ∞ case). Let {pj }j≥1 be aperiodic, i.e., g.c.d. {k : pk > 0} = 1 and µ ≡ j=1 jpj ∈ (0, ∞). Let renewal sequence associated with {pj }j≥1 . That {un }n≥0 be the ∞ is, u0 = 1 n and un = j=1 pj un−j for n ≥ 1. Let {bj }j≥0 be such that j=1 |bj | < ∞. Let {an }n≥0 satisfy a0 = b0 and an = bn +

∞ j=1

an−j pj n ≥ 1.

(5.17)

268

8. Laws of Large Numbers

Then an =

∞

∞

j=0 bj un−j , n ≥ 0 and lim an = n→∞

1 bj . µ j=0

The nonarithmetic analog of the above is as follows. Deﬁnition 8.5.7: A function b(·) ∈ B 0 is directly Riemann integrable (dri) ∞ on [0, ∞) iﬀ (i) for all h > 0, n=0 sup{|b(u)| : nh ≤ u ≤ (n + 1)h} < ∞, ∞ and (ii) limh→0 n=0 h(mn (h) − mn (h) = 0 where mn (h)

=

sup{b(u) : nh ≤ u ≤ (n + 1)h}

mn (h)

=

inf{b(u) : nh ≤ u ≤ (n + 1)h}.

Theorem 8.5.14: (The key renewal theorem, nonarithmetic case). Let F (·) be a nonarithmetic distribution with F (0) = 0 and µ = ∞ (n) udF (u) < ∞. Let U (·) = (·) be the renewal function asson=0 F [0,∞) ciated with F . Let b(·) ∈ B 0 be directly Riemann integrable. Then the unique solution to the renewal equation a=b+a∗F

(5.18)

is given by a = b ∗ U and lim a(t) =

t→∞

where c(b) ≡ lim

h→0

∞

c(b) µ

(5.19)

hmn (h).

n=0

Remark 8.5.2: A suﬃcient condition for b(·) to be dri is that it is Riemann integrable on bounded intervals and that there exists a nonincreasing integrable function h(·) on [0, ∞) and a constant C such that |b(·)| ≤ Ch(·) (Problem 8.18 (b)).

8.5.5

Applications

Here are two important applications of the above two theorems to a class of stochastic processes known as regenerative processes. Deﬁnition 8.5.8: (a) A sequence of random variables {Yn }n≥0 is called regenerative if there exists a renewal sequence {Tj }j≥0 such that the random cyclesand cycle length variables ηj = {Yi : Tj ≤ i < Tj+1 }, Tj+1 − Tj for j = 0, 1, 2, . . . are iid. (b) A stochastic process {Y (t) : t ≥ 0} is called regenerative if there exists a renewal sequence {Tj }j≥0 such that the random cycles and

8.5 Renewal theory

269

cycle length variables ηj ≡ {Y (t) : Tj ≤ t < Tj+1 , Tj+1 − Tj } for j = 0, 1, 2, . . . are iid. (c) In both (a) and (b), the sequence {Tj }j≥0 are called the regeneration times. Example 8.5.2: Let {Yn }n≥0 be a countable state space Markov chain (see Chapter 14) that is irreducible and recurrent. Fix a state ∆. Let T0

=

min{n : n > 0, Yn = ∆}

Tj+1

=

min{n : n > Tj , Yn = ∆}, n ≥ 0.

Then {Yn }n≥0 is regenerative (Problem 8.19). Example 8.5.3: Let {Y (t) : t ≥ 0} be a continuous time Markov chain (see Chapter 14) with a countable state space that is irreducible and recurrent. Fix a state ∆. Let T0 Tj+1

=

inf{t : t > 0, Y (t) = ∆}

=

inf{t : t > Tj , Y (t) = ∆}.

Then {Y (t) : t ≥ 0} is regenerative (Problem 8.19). Theorem 8.5.15: Let {Yn }n≥0 be a regenerative sequence of random variables with some state space (S, S) where S is a σ-algebra on S with regeneration times {Tj }j≥0 . Let f : S → R be bounded and S, B(R)-measurable. Let ≡ Ef (Yn+T0 ),

an bn

≡ Ef (YT0 +n )I(T1 > T0 + n).

(5.20)

Let µ = E(T1 − T0 ) ∈ (0, ∞) and g.c.d. {j : pj ≡ P (T1 − T0 = j) > 0} = 1. Then (i) an → f (y)π(dy) where π(A) ≡

1 µ

E

S

, A ∈ S.

T1 −1 j=T0 IA (Yj )

(ii) In particular, P (Yn ∈ ·) − π(·) → 0

as

n → ∞,

(5.21)

where · denotes the total variation norm. Proof: By the regenerative property, {an }n≥1 satisﬁes the renewal equation n an = bn + an−j pj j=0

270

8. Laws of Large Numbers

and hence, part (i) of the theorem follows from Theorem 8.5.13 and the ∞ fact n=0 bn = µπ(A). prove (ii) note that a ˜n ≡ Ef (Yn ) = E(f (Yn )I(T0 > n)) + To n a P (T = j) and by DCT limn→∞ a ˜n = limn→∞ an . 0 j=0 n−j It is not diﬃcult to show that for any two probability measures µ and ν on (S, S), the total variation norm * * * * µ − ν = sup * f dµ − f dν * : f ∈ B(S, R) where B(S, R) = {f : f : S → R, F measurable, sup{|f (s)| : s ∈ S} ≤ 1} (Problem 4.10 (b)). Thus, P (Yn+T0 ∈ ·) − π(·) * * * * ≤ sup *Ef (Yn0 +T ) − f dπ * : f ∈ B(S, R) .

(5.22)

Now, for any f ∈ B(S, R) and any integer K ≥ 1, from Theorem 8.5.13, * * * * *Ef (Yn0 +T ) − f dπ * ≤

* 1 ** * bj *un−j − * + 2 µ j=0

K

∞

P (T1 − T0 > j) ≡ δn , say (5.23)

j=(K+1)

where {bj } is deﬁned in (5.20). Since E(T1 − T0 ) < ∞, given > 0, there exists a K such that ∞

P (T1 − T0 > j) < /2.

j=(K+1)

By Theorem (8.5.11), un → (5.22), (ii) follows.

1 µ.

Thus, in (5.23), lim δn ≤ and so from 2

Theorem 8.5.16: Let {Y (t) : t ≥ 0} be a regenerative stochastic process with state space (S, S) where S is a σ-algebra on S. Let f : S → R be bounded and S, B(R)-measurable. Let a(t)

=

b(t) ≡

Ef (YT0 +t ), t ≥ 0, Ef (YT0 +t )I(T1 > T0 + t), t ≥ 0.

Let µ = E(T1 − T0 ) ∈ (0, ∞) and the distribution of T1 − T0 be nonarithmetic. Then (i) a(t) → f (y)π(dy) where π(A) =

1 µ

E

T T0

S

IA (Y (u))du , A ∈ S.

8.6 Ergodic theorems

271

(ii) In particular, P (Yt ∈ ·) − π(·) → 0

as

t→∞

(5.24)

where · is the total variation norm. The proof of this is similar to that of the previous theorem but uses Theorem 8.5.14. 2

8.6 Ergodic theorems 8.6.1

Basic deﬁnitions and examples

The law of large numbers proved in Section 8.2 states that if {Xi }i≥1 are pairwise independent and identically distributed and if h(·) is a Borel measurable function, then the time average, i.e.,

n 1 h(Xi ) n i=1

→ Eh(X1 ), i.e., space average w.p. 1

(6.1)

as n → ∞, provided E|h(X1 )| < ∞. The goal of this section is to investigate how far the independence assumption can be relaxed. Deﬁnition 8.6.1: (Stationary sequences). A sequence of random variables {Xi }i≥1 on a probability space (Ω, F, P ) is called strictly stationary if for each k ≥ 1 the joint distribution of (Xi+j : j = 1, 2, . . . , k) is the same for all i ≥ 0. Example 8.6.1: {Xi }i≥1 iid. Example 8.6.2: Let {Xi }i≥1 be iid. Fix 1 ≤ < ∞. Let h : R → R be a Borel function and Yi = h(Xi , Xi+1 , . . . , Xi+ −1 ), i ≥ 1. Then {Yi }i≥1 is strictly stationary. Example 8.6.3: Let {Xi }i≥1 be a Markov chain with a stationary distribution π. If X1 ∼ π then {Xi }i≥1 is strictly stationary (see Chapter 14). It will be shown that if {Xi }i≥1 is a strictly stationary sequence that is not a mixture of two other strictly stationary sequences, then (6.1) holds. This is known as the ergodic theorem (Theorem 8.6.1 below). Deﬁnition 8.6.2: (Measure preserving transformations). Let (Ω, F, P ) be a probability space and T : Ω → Ω be F, F measurable. Then, T is

272

8. Laws of Large Numbers

called P -preserving (or simply measure preserving on (Ω, F, P )) if for all A ∈ F, P (T −1 (A)) = P (A). That is, the random point T (ω) has the same distribution as ω. Let X be a real valued random variable on (Ω, F, P ). Let Xi ≡ X(T (i−1) (ω)) where T (0) (ω) = ω, T (i) (ω) = T (T (i−1) (ω)), i ≥ 1. Then {Xi }i≥1 is a strictly stationary sequence. It turns out that every strictly stationary sequence arises this way. Let {Xi }i≥1 be a strictly stationary sequence deﬁned on some probability space ˜ ≡ {Xi (ω)}i≥1 (Ω, F, P ). Let P˜ be the probability measure induced by X ∞ ˜ ∞ ∞ ˜ on Ω ≡ R , F ≡ B(R ) where R is the space of all sequences of real numbers and B(R∞ ) is the σ-algebra generated by ﬁnite dimensional cylinder sets of the form {x : (xj : j = 1, 2, . . . , k) ∈ Ak }, 1 ≤ k < ∞, Ak ∈ B(Rk ). Let T : R∞ → R∞ be the unilateral (one sided) shift to the right, ˜ F, ˜ P˜ ). Let i.e., T (xi )i≥1 = (xi )i≥2 . Then T is measure preserving on (Ω, i−1 ω ) = x1 , and Yi (˜ ω ) = Y1 (T ω ˜ ) = xi for i ≥ 2 if ω ˜ = (x1 , x2 , x3 , . . .). Y1 (˜ ˜ F, ˜ P˜ ) and has the Then {Yi }i≥1 is a strictly stationary sequence on (Ω, same distribution as {Xi }i≥1 . Example 8.6.4: Let Ω = [0, 1], F = B([0, 1]), P = Lebesgue measure. Let T ω ≡ 2ω mod 1, i.e., ⎧ ⎨

2ω if 0 ≤ ω < 12 2ω − 1 if 12 ≤ ω < 1 Tω = ⎩ 0 ω = 1. Then T is measure preserving since P ({ω : a < T ω < b}) = (b − a) for all 0 < a < b < 1 (Problem 8.20). This example is an equivalent version of the iid sequence {δi }i≥1 of ∞ be the biBernoulli (1/2) random variables. To see this, let ω = i=1 δi2(ω) i nary expansion of ω. Then {δi }i≥1 is iid Bernoulli (1/2) and T ω = 2ω mod ∞ (ω) (cf. Problem 7.4). Thus T corresponds with the unilateral 1 = i=2 δ2ii−1 shift to right on the iid sequence {δi }i≥1 . For this reason, T is called the Bernoulli shift. Example 8.6.5: (Rotation). Let Ω = {(x, y) : x2 + y 2 = 1} be the unit circle. Fix θ0 in [0, 2π). If ω = (cos θ, sin θ), θ in [0, 2π) set T ω = cos(θ + θ0 ), sin(θ + θ0 ) . That is, T rotates any point ω on Ω counterclockwise through an angle θ0 . Then T is measure preserving w.r.t. the Uniform distribution on [0, 2π]. Deﬁnition 8.6.3: Let (Ω, F, P ) be a probability space and T : Ω → Ω be a F, F measurable map. A set A ∈ F is T-invariant if A = T −1 A. A set A ∈ F is almost T -invariant w.r.t. P if P (A T −1 A) = 0 where A1 A2 = (A1 ∩ Ac2 ) ∪ (Ac1 ∩ A2 ) is the symmetric diﬀerence of A1 and A2 .

8.6 Ergodic theorems

273

It can be shown that A is almost T -invariant w.r.t. P iﬀ there exists a set A that is T -invariant and P (A A ) = 0 (Problem 8.21). Examples of T -invariant sets are A1 = {ω : T j ω ∈ A0 for inﬁnitely many n j i ≥ 1} where A0 ∈ F; A2 = ω : n1 j=1 h(T ω) converges as n → ∞ where h : Ω → R is a F measurable function. On the other hand, the event {x : x1 ≤ 0} is not shift invariant in R∞ , B(R∞ ) nor is it almost shift invariant if P˜ corresponds to the iid case with a nondegenerate distribution. The collection I of T -invariant sets is a σ-algebra and is called the invariant σ-algebra. A function h : Ω → R is I-measurable iﬀ h(ω) = h(T ω) for all ω (Problem 8.22). Deﬁnition 8.6.4: A measure preserving transformation T on a probability space (Ω, F, P ) is ergodic or irreducible (w.r.t. P ) if A is T -invariant implies P (A) = 0 or 1. Deﬁnition 8.6.5: A stationary sequence of random variables {Xi }i≥1 is ergodic if the unilateral shift T is ergodic on the sequence space (R∞ , B(R∞ ), P˜ ) where P˜ is the measure on R∞ induced by {Xi }i≥1 . Example 8.6.6: Consider the above sequence space. Then A ∈ F˜ is invariant with respect ∞ to ˜the unilateral shift implies that A is in the tail σ-algebra T ≡ n=1 σ(X j (ω), j ≥ n) (Problem 8.23). If {Xi }i≥1 are independent then by the Kolmogorov’s zero-one law, A ∈ T implies P (A) = 0 or 1. Thus, if {Xi }i≥1 are iid then it is ergodic. On the other hand, mixtures of iid sequences are not ergodic as seen below. Example 8.6.7: Let {Xi }i≥1 and {Yi }i≥1 be two iid sequences with different distributions. Let δ be Bernoulli (p), 0 < p < 1 and independent of both {Xi }i≥1 and {Yi }i≥1 . Let Zi ≡ δXi + (1 − δ)Yi , i ≥ 1. Then {Zi }i≥1 is a stationary sequence and is not ergodic (Problem 8.24). The above example can be extended to mixtures of irreducible positive recurrent discrete state space Markov chains (Problem 8.25 (a)). Another example is Example 8.6.5, i.e., rotation of the circle when θ is rational (Problem 8.25 (b)). Remark 8.6.1: There is a simple example of a measure preserving transformation T that is ergodic but T 2 is not. Let Ω = {ω1 , ω2 }, ω1 = ω2 . Let T ω1 = ω2 , T ω2 = ω1 , P be the distribution P ({ω1 }) = P ({ω2 }) = 12 . Then T is ergodic but T 2 is not (Problem 8.26).

274

8. Laws of Large Numbers

8.6.2

Birkhoﬀ ’s ergodic theorem

Theorem 8.6.1: Let (Ω, F, P ) be a probability space, T : Ω → Ω be a measure preserving ergodic map on (Ω, F, P ) and X ∈ L1 (Ω, F, P ). Then n−1 1 X(T j ω) → EX ≡ XdP n j=0 Ω

(6.2)

w.p. 1 and in L1 as n → ∞. Remark 8.6.2: A more general version is without the assumption of T being ergodic. In this case, the right side of (6.2) is a random variable Y (ω) that is T -invariant, i.e., Y (ω) = Y (T (ω)) w.p. 1 and satisﬁes XdP = Y dP for all T -invariant sets A. This Y is called the condiA A tional expectation of X given I, the σ-algebra of invariant sets (Chapter 13). For a proof of this version, see Durrett (2004). The proof of Theorem 8.6.1 depends on the following inequality. Lemma 8.6.2: (Maximal ergodic inequality). Let T be measure preserving on a probability space (Ω, F, P ) and X ∈ L1 (Ω, F, P ). Let S0 (ω) = 0, n−1 Sn (ω) = j=0 X(T j ω), n ≥ 1, Mn (ω) = max{Sj (ω) : 0 ≤ j ≤ n}. Then E X(ω)I Mn (ω) > 0 ≥ 0. Proof: By deﬁnition of Mn (ω), Sj (ω) ≤ Mn (ω) for 1 ≤ j ≤ n. Thus X(ω) + Mn (T ω) ≥ X(ω) + Sj (T ω) = Sj+1 (ω). Also, since Mn (T ω) ≥ 0, X(ω) ≥ X(ω) − Mn (T ω) = S1 (ω) − Mn (T ω). Thus X(ω) ≥ max Sj (ω) : 1 ≤ j ≤ n − Mn (T ω). For ω such that Mn (ω) > 0, Mn (ω) = max Sj (ω) : 1 ≤ j ≤ n and hence X(ω) ≥ Mn (ω) − Mn (T ω). Also, since X ∈ L1 (Ω, F, P ) it follows that Mn ∈ L1 (Ω, F, P ) for all n ≥ 1. Taking expectations yields E X(ω)I Mn (ω) > 0 ≥ E Mn (ω) − Mn (T ω)I Mn (ω) > 0 ≥ E Mn (ω) − Mn (T ω)I Mn (ω) ≥ 0 (since Mn (T ω) ≥ 0) = E Mn (ω) − Mn (T ω) = 0, since T is measure preserving.

2

8.6 Ergodic theorems

275

Remark 8.6.3: Note that the measure preserving property of T is used only at the last step. Proof of Theorem 8.6.1: W.l.o.g. assume that EX = 0. Let Z(ω) ≡ lim supn→∞ Snn(ω) . Fix > 0 and set A ≡ {ω : Z(ω) > }. It will be shown that P (Aε ) = 0. Clearly, A is T invariant. Since T is ergodic, P (A ) = 0 or 1. Suppose P (A ) = 1. Let Y (ω) = X(ω)−. Let Mn,Y (ω) ≡ max{Sj,Y (ω) : j−1 0 ≤ j ≤ n} where S0,Y (ω) ≡ 0, Sj,Y (ω) ≡ k=0 Y (T k ω), j ≥ 1. Then by Lemma 8.6.2 applied to Y (ω) E Y (ω)I Mn,Y (ω) > 0 ≥ 0. But Bn ≡ {ω : Mn,Y (ω) > 0} = {ω : sup1≤j≤n 1j Sj,Y (ω) > 0}. Clearly, Bn ↑ B ≡ {ω : sup1≤j 0}. Since 1j Sj,Y (ω) = 1j Sj (ω) − for j ≥ 1, B ⊃ A and since P (A ) = 1, it follows that P (B) = 1. Also |Y | ≤ |X| + ∈ L1 (Ω, F, P ). So by the bounded convergence theorem, 0 ≤ E(Y IBn ) → E(Y IB ) = EY = 0 − < 0, which is a contradiction. Thus P (A ) = 0. This being true for every > 0 it follows that P (limn→∞ Snn(ω) ≤ 0) = 1. Applying this to −X(ω) yields P

Sn (ω) ≥0 =1 n n→∞ lim

and hence P limn→∞ Snn(ω) = 0 = 1. To prove L1 -convergence, note that applying the above to X + and X − yields n 1 + i fn (ω) ≡ X (T ω) → EX + (ω) w.p. 1. n i=1 + = EX Since T is measure preserving fn (ω)dP (ω) for all n. So by Scheﬀe’s* theorem (Lemma 8.2.5), |fn (ω) − * + * 1 n X + (T i ω) − EX + * → 0. Similarly, (ω)|dP → 0, i.e., E EX i=1 n * n * E * n1 i=1 X − (T i ω) − EX − * → 0. This yields L1 convergence. 2 Corollary 8.6.3: Let {Xi }i≥1 be a stationary ergodic sequence of Rk valued random variables on some probability space (Ω, F, P ). Let h : Rk → R be Borel measurable and let E|h(X1 , X2 , . . . , Xk )| < ∞. Then 1 h(Xi , Xi+1 , . . . , Xi+k−1 ) → Eh(X1 , X2 , . . . , Xk ) w.p. 1. n i=1 n

˜ = (Rk )∞ , F˜ ≡ B (Rk )∞ and Proof: Consider the probability space Ω P˜ the probability measure induced by the map ω → (Xi (ω))i≥1 and the ˜ deﬁned by T˜(xi )i≥1 = (xi )i≥2 . Then T˜ is unilateral shift map T˜ on Ω

276

8. Laws of Large Numbers

measure preserving and ergodic. So the corollary follows from Theorem 8.6.1. 2 Remark 8.6.4: This corollary is useful in statistical time series analysis. If {Xi }i≥1 is a real valued stationary ergodic sequence, then the mean m ≡ EX1 , variance Var(X1 ), and covariance Cov(X1 , X2 ) can all be estimated consistently by the corresponding sample functions

2 n n 1 1 2 Xi − Xi , n i=1 n i=1

2 n n 1 1 Xi Xi+1 − Xi . n i=1 n i=1 1 Xi , n i=1 n

and

Further, the joint distribution of (X1 , X2 , . . . , Xk ) for any k ≥ 1, can be estimated consistently the corresponding empirical measure, i.e., by n Ln (A1 , A2 , . . . , Ak ) ≡ n1 i=1 I(Xi+k ∈ Ak , j = 1, 2, . . . , k), which converges to P (X1 ∈ A1 , X2 ∈ A2 , . . . , Xk ∈ Ak ) w.p. 1 where Ai ∈ B(R), i = 1, 2, . . . , k. The next three results (Theorems 8.6.4–8.6.6) are consequences and extensions of the ergodic theorem, Theorem 8.6.1. For proofs, see Durrett (2004). The ﬁrst one is the following result on the behavior of the log-likelihood function of a stationary ergodic sequence of random variables with a ﬁnite range. Theorem 8.6.4: (Shannon-McMillan-Breiman theorem). Let {Xi }i≥1 be a stationary ergodic sequence of random variables with values in a ﬁnite set S ≡ {a1 , a2 , . . . , ak }. For each n, x1 , x2 , . . . , xn in S, let p(xn | xn−1 , xn−2 , . . . , x1 ) = P (Xn = xn | Xj = xj , 1 ≤ j ≤ n − 1) P (Xj = xj : 1 ≤ j ≤ n) ≡ P (Xj = xj : 1 ≤ j ≤ n − 1) whenever the denominator is positive and let p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ). Then 1 log p(X1 , X2 , . . . , Xn ) = −H exists w.p. 1 n where H ≡ limn→∞ E − log p(Xn | Xn−1 , Xn−2 , . . . , X1 ) is called the entropy rate of {Xi }i≥1 . lim

n→∞

Remark 8.6.5: In the iid case this is a consequence of the strong law of k large numbers, and H can be identiﬁed as j=1 (− log pj )pj where pj =

8.6 Ergodic theorems

277

P (X1 = aj ), 1 ≤ j ≤ k. This is called the Kolmogorov-Shannon entropy of the distribution {pj : 1 ≤ j ≤ k}. If {Xi }i≥1 is a stationary ergodic Markov chain, then again it is a consequence of the strong law of large numbers, and H can be identiﬁed with k k πi (− log pij )pij E − log p(X2 | X1 ) = i=1

j=1

where π ≡ {πi : 1 ≤ i ≤ k} is the stationary distribution and P ≡ (pij ) is the transition probability matrix of the Markov chain {Xi }i≥1 . See Problem 8.27. A more general version of the ergodic Theorem 8.6.1 is the following. Theorem 8.6.5: (Kingman’s subadditive ergodic theorem). Let {Xm,n : 0 ≤ m < n}n≥1 be a collection of random variables such that (i) X0,m + Xm,n ≥ X0,n for all 0 ≤ m < n, n ≥ 1. (ii) For all k ≥ 1, {Xnk,(n+1)k }n≥1 is a stationary sequence. (iii) The sequence {Xm,m+k , k ≥ 1} has a distribution that does not depend on m ≥ 0. + (iv) EX0,1 < ∞ and for all n,

EX0,n n

≥ γ0 , where γ0 > −∞.

Then (i) lim

n→∞

(ii) lim

n→∞

EX0,n n X0,n n

= inf

n≥1

EX0,n n

≡ γ.

≡ X exists w.p. 1 and in L1 , and EX = γ.

(iii) If {Xnk,(n+1)k }n≥1 is ergodic for each k ≥ 1, then X ≡ γ w.p. 1. A nice application of this is a result on products of random matrices. Theorem 8.6.6: Let {Ai }i≥1 be a stationary sequence of k × k random matrices with nonnegative entries. Let αm,n (i, j) be the (i, j)th entry in Am+1 , · · · , An . Suppose E| log α1,2 (i, j)| < ∞ for all i, j. Then 1 n→∞ n

(i) lim

log α0,n (i, j) = η exists w.p. 1. 1 n→∞ n

log Am+1 · · · , An = η w.p. 1, where for any k×k k matrix B ≡ ((bij )), B = max j=1 |bij | : 1 ≤ i ≤ k .

(ii) For any m, lim

278

8. Laws of Large Numbers

Remark 8.6.6: A concept related to ergodicity is that of mixing. A measure preserving transformation T on a probability space (Ω, F, P ) is mixing if for all A, B ∈ B * * lim *P (A ∩ T −n B) − P (A)P (T −n B)* = 0. n→∞

A stationary sequence of random variables {Xi }i≥1 is mixing if the unilateral shift on the sequence space R∞ induced by {Xi }i≥1 is mixing. If T is mixing and A is T -invariant, then taking B = A in the above yields P (A) = P 2 (A) i.e., P (A) = 0 or 1. Thus, if T is mixing, then T is ergodic. Conversely, if T is ergodic, then by Theorem 8.6.1, for any B in B 1 IB (T j ω) → P (B) w.p. 1. n j=1 n

n Integrating both sides over A w.r.t. P yields n1 j=1 P (A ∩ T −j B) → P (A)P (B), i.e., T is mixing in an average sense, i.e., the Cesaro sense. A suﬃcient condition for a stationary sequence to be mixing is that the tail σ-algebra be trivial. If {Xi }i≥1 is a stationary irreducible Markov chain with a countable state space, then it is mixing iﬀ it is aperiodic. For proofs of the above results, see Durrett (2004).

8.7 Law of the iterated logarithm 2 Let {Xn }n≥1 be a sequence of iid random variables with nEX1 = 0, EX1 = 1 ¯ 1. The SLLN asserts that the sample mean Xn = n i=1 Xi → 0 w.p. 1. The central limit theorem √ ¯ (to be proved later) asserts that for all −∞ < a < b < ∞, P (a ≤ nX n ≤ b) → Φ(b) n− Φ(a) where Φ(·) is the standard √ Normal cdf. This suggests that Sn = i=1 Xi is of the order magnitude n Sn get as a function for large n. This raises the question of how large does √ n √ of n. It turns out that it is of the order 2n log log n. More precisely, the following holds:

Theorem 8.7.1: (Law of the iterated logarithm). Let {Xi (ω)}i≥1 be iid random variables on a probability space n(Ω, F, P ) with mean zero and variance one. Let S0 (ω) = 0, Sn (ω) = i=1 Xi (ω), n ≥ 1. For each ω, let A(ω) be the set of limit points of [−1, +1]} = 1. For a proof, see Durrett (2004).

S (ω) √ n 2n log log n

n≥1

. Then P {ω : A(ω) =

8.8 Problems

279

A deep generalization of the above was obtained by Strassen (1964). Theorem 8.7.2: Under the setup of Theorem 8.7.1, the following holds: Sj (ω) , j = 0, 1, 2, . . . , n and Yn (t, ω) be the function Let Yn ( nj ; ω) = √2n log log n obtained by linearly interpolating the above values on [0, 1]. For each ω, let B(ω) be the set of limit points of {Yn (·, ω)}n≥1 in the function space C[0, 1] of all continuous functions on [0, 1] with the supnorm. Then P {ω : B(ω) = K} = 1 where K ≡ f : f : [0, 1] → R, f is continuously diﬀerentiable, f (0) = 0 1 and 12 0 (f (t))2 dt ≤ 1 .

8.8 Problems 8.1 Prove Theorem 8.1.3 and Corollary 8.1.4. (Hint: Use Chebychev’s inequality.) 8.2 Let {Xn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ) such that for some m ∈ N and for each i = 1, . . . , m, {Xi , Xi+m , Xi+2m , . . .} are identically distributed and pairwise independent. Furthermore, suppose that E(|X1 | + · · · + |Xm |) < ∞. Show that m 1 X n −→ EXi , w.p. 1. m i=1 (Hint: Reduce the problem to nonnegative Xn ’s and apply Theorem 8.2.7 for each i = 1, . . . , m.) 8.3 Let f be a bounded measurable function on [0,1] that is continuous 1 x1 +x2 +···+xn 11 1 dx1 dx2 . . . dxn . at 2 . Evaluate lim 0 0 · · · 0 f n n→∞

8.4 Show that if P (|X| > α) < 12 for some real number α, then any median of X must lie in the interval [−α, α]. 8.5 Prove Theorem 8.3.4 using Kolmogorov’s ﬁrst inequality (Theorem 8.3.1 (a)). (Hint: Apply Theorem 8.3.1 to ∆n,k deﬁned in the proof of Theorem 8.3.3 to establish (3.4).) 8.6 Let {Xn }n≥1 be a sequence of iid random variables with E|X1 |α < ∞ for some α > 0. Derive a necessary and suﬃcient condition on α ∞ for almost sure convergence of the series X sin 2πnt for all n n=1 t ∈ (0, 1).

280

8. Laws of Large Numbers

8.7 Show that for any given sequence of random variables {Xn }n≥1 , there n exists a sequence of real numbers {an }n≥1 ⊂ (0, ∞) such that X an → 0 w.p. 1. 8.8 Let {Xn }n≥1 be a sequence of independent random variables with P (Xn = 2) = P (Xn = nβ ) = an , P (Xn = an ) = 1 − 2an 1 for some n ∈ (0, 3 ) and β ∈ R. Show that a∞ only if n=1 an < ∞.

∞ n=1

Xn converges if and

8.9 Let {Xn }n≥1 be a sequence of iid random variables with E|X1 |p = ∞ n for some p ∈ (0, 2). Then P (lim sup |n−1/p i=1 Xi | = ∞) = 1. n→∞

8.10 For any random variable X and any r ∈ (0, ∞), E|X|r < ∞ iﬀ ∞ r−1 (log n)r P (|X| > n log n) < ∞. n=1 n m (Hint: Check that n=1 nr−1 (log n)r ∼ r−1 mr (log m)r as m → ∞.) 8.11 Let {Xn }n≥1 be a sequenceof independent random variables with n EXn = 0, EXn2 = σn2 , s2n = j=1 σj2 → ∞. Then, show that for any a > 12 , n 2 −a Xi → 0 w.p. 1. s−2 n (log sn ) i=1

8.12 Show that for p ∈ (0, 2), p = 1, (4.12) holds. ∞ ∞ 1/p | ≤ (Hint: For p ∈ (1, 2), n=1 |EZn /n n=1 E|X1 |I(|X1 | > ∞ j −1/p p n · E|X |I(j < |X ≤ j + 1) ≤ n)n−1/p = 1 1| j=1 n=1 ∞ p p 1/p < ∞, by (4.10). For p ∈ (0, 1), | ≤ n=1 |EZn /n p−1 E|X1 | ∞ ∞ 1 −1/p p p ( n )E|X |I(j − 1 < |X | ≤ j) ≤ E|X | , by 1 1 1 j=1 n=j 1−p (4.9).) 8.13 Let Yi = xi β + i , i ≥ 1 where {n }n≥1 is a sequence of iid random vectors, {xn }n≥1 is a sequence of constants, and β ∈ R is a constant n n 2 (the regression parameter). Let βˆn = i=1 nxi Yi /2 i=1 xi denote (the −1 least squares) estimator of β. Let n i=1 xi → c ∈ (0, ∞) and E1 = 0. (a) If E|1 |1+δ < ∞ for some δ ∈ (0, ∞), then show that βˆn −→ β

as

n → ∞,

w.p. 1.

(8.1)

(b) Suppose sup{|xi | : i ≥ 1} < ∞ and E|1 | < ∞. Show that (8.1) holds.

8.8 Problems

281

8.14 (Strongly consistent estimation.) Let {Xi }i≥1 be random variables on some probability space (Ω, F, P ) such that (i) for some integer m ≥ 1 the collections {Xi : i ≤ n} and {Xi : i ≥ n + m} are independent for each n ≥ 1, and (ii) the distribution of {Xi+j ; 0 ≤ j ≤ k} is independent of i, for all k ≥ 0. (a) Show that for every ≥ 1 and h : R → R with E|h(X1 , X2 , . . . , X )| < ∞, there are functions {fn : Rn → R}n≥1 such that fn (X1 , X2 , . . . , Xn ) → λ ≡ Eh(X1 , X2 , . . . , X ) w.p. 1. In this case, one says λ is estimable from {Xi }i≥1 in a strongly consistent manner. (b) Now suppose the distribution µ(·) of X1 is a mixture of the k form µ = i=1 αi µi . Suppose there exist disjoint Borel sets {Ai }1≤i≤k in R such that µi (Ai ) = 1 for each i. Show that all the αi ’s as well as λi ≡ hi (x)dµi where hi ∈ L1 (µi ) are estimable from {Xi }i≥1 in a strongly consistent manner. 8.15 (Normal numbers). Recall that in Section 4.5 it was shown that for any positive integer p > 1 and for any 0 ≤ ω ≤ 1, it is possible to write ω as ∞ Xi (ω) ω= (8.2) pi i=1 where for each i, Xi (ω) ∈ {0, 1, 2, . . . , p − 1}. Recall also that such an expansion is unique except for ω of the form q/pn , q = 1, 2, . . . , pn −1, n ≥ 1 in which case there are exactly two expansions, one of which is recurring. In what follows, for such ω’s the recurrent expansion will be the one used in (8.2). A number ω in [0,1] is called normal w.r.t. the integer p if for every ﬁnite pattern a1 a2 . . . ak where k ≥ 1 is a positive integer nand ai ∈ {0, 1, 2, . . . , p − 1} for 1 ≤ i ≤ k the relative frequency n1 i=1 δi (ω) where δi (ω) =

1 if Xi+j (ω) = aj+1 , j = 0, 1, 2, . . . , k − 1 0 otherwise

converges to p−k as n → ∞. A number ω in [0,1] is called absolutely normal if it is normal w.r.t. p for every integer p > 1. Show that the set A of all numbers ω in [0,1] that are absolutely normal has Lebesgue measure one. (Hint: Note that in (8.2), the function {Xi (ω)}i≥1 are iid random variables. Now use Problem 8.14 repeatedly.) 8.16 Show that for the renewal sequence {Sn }∞ n=0 , if P (X1 > 0) > 0, then lim Sn = ∞ w.p. 1. n→∞

282

8. Laws of Large Numbers

8.17 (a) Show that {an }n≥0 of (5.16) is the unique solution to (5.15) by using generating functions (cf. Section 5.5). (b) Deduce Theorems 8.5.13 and 8.5.14 from Theorems 8.5.11 and 8.5.12, respectively. (Hint: For Theorems 8.5.13 use the DCT , and for Theorem 8.5.14, show ﬁrst that k

mn (h) U ((n + 1)h) − U (nh)

n=0

≤ a(kh) ≤

k

mn (h) U ((n + 1)h) − U (nh) .)

n=0

8.18 (a) Let b(·) : [0, ∞) → R be dri. Show that b(·) is Riemann integrable on every bounded interval. Conclude that if b(·) is dri it must be continuous almost everywhere w.r.t. Lebesgue measure. (b) Let b(·) : [0, ∞) → R be Riemann integrable on [0, K] for each K < ∞. Let h(·) : [0, ∞) → R+ be nonincreasing and integrable w.r.t. Lebesgue measure and |b(·)| ≤ h(·) on [0, ∞). Show that b(·) is dri. 8.19 Verify that the sequence {Yn }n≥0 in Example 8.5.2 and the process {Y (t) : t ≥ 0} in Example 8.5.3 are both regenerative. 8.20 Show that the map T in Example 8.6.4 in Section 8.6 is measure preserving. (Hint: Show that for 0 < a < b < 1, P ω : T ω ∈ (a, b) = (b − a).) 8.21 Let T be a measure preserving map on a probability space (Ω, F, P ). Show that A is almost T -invariant w.r.t. P iﬀ there exists a set A1 such that A1 = T −1 A1 and P (A A1 ) = 0. ∞ (Hint: Consider A1 = n=0 T −n A. ) 8.22 Show that a function h : Ω → R is I-measurable iﬀ h(ω) = h(T ω) for all ω where I is the σ-algebra of T -invariant sets. 8.23 Consider the sequence space R∞ , B(R∞ ) . Show that A ∈ B(R∞ ) is invariant w.r.t. the unilateral shift T implies that A is in the tail σ-algebra. 8.24 In Example 8.6.7 of Section 8.6, show that {Zi }i≥1 is a stationary sequence that is not ergodic. (Hint: Assuming it is ergodic, derive a contradiction using the ergodic Theorem 8.6.1.)

8.8 Problems

283

8.25 (a) Extend Example 8.6.7 to the Markov chain case with two disjoint irreducible positive recurrent subsets. (b) Show that in Example 8.6.5, if θ0 is rational, then T is not ergodic. 8.26 (a) Verify that in Remark 8.6.1, T is ergodic but T 2 is not. (b) Construct a Markov chain with four states for which T is ergodic but T 2 is not. 8.27 In Remark 8.6.5, prove the Shannon-McMillan-Breiman theorem directly for the Markov chain case.

n−1 pXi Xi+1 p(X1 ).) (Hint: Express p(X1 , X2 , . . . , Xn ) as i=1

8.28 Let {Xi }i≥1 be iid Bernoulli (1/2) random variables. Let W1 W2

= =

∞ 2X2i i=1 ∞ i=1

4i X2i−1 . 4i

(a) Show that W1 and W2 are independent. (b) Let A1 = {ω : ω ∈ (0, 1) such that in the expansion of ω in base 4 only the digits 0 and 2 appear} and A2 = {ω : ω ∈ (0, 1) such that in the expansion of ω in base 4 only the digits 0 and 1 appear}. Show that m(A1 ) = m(A2 ) = 0 where m(·) is Lebesgue measure and hence that the distribution of W1 and W2 are singular w.r.t. m(·). (c) Let W ≡ W1 + W2 . Then show that W has uniform (0,1) distribution. (Hint: For (b) use the SLLN.) Remark: This example shows that the convolution of two singular probability measures can be absolutely continuous w.r.t. Lebesgue measure. 8.29 Let {Xn }n≥1 be a sequence of pairwise independent and identically distributed random variables with P (X1 ≤ x) = F (x), x ∈ R. Fix 0 < p < 1. Suppose that F (ζp + ) > p for all > 0 where ζp = F −1 (p) ≡ inf{x : F (x) ≥ p}. Show that ζˆn ≡ Fn−1 (p) ≡ inf{x : Fn (x) ≥ p} converges to ζp w.p. 1 n where Fn (x) ≡ n−1 i=1 I(Xi ≤ x), x ∈ R is the empirical distribution function of X1 , . . . , Xn .

284

8. Laws of Large Numbers

8.30 Let {Xi }i≥1be random variables such thatEXi2 < ∞ for all i ≥ 1. n n Suppose n1 i=1 EXi → 0 and an ≡ n12 j=0 (n − j)v(j) → 0 as * * n → ∞ where v(j) = supi *Cov(Xi , Xi+j )*. ¯ n −→p 0. (a) Show that X ∞ ¯ n → 0 w.p. 1. (b) Suppose further that n=1 an < ∞. Show that X (c) Show that as n → ∞, v(n) → 0 implies an → 0 but the converse need not hold. 8.31 Let i }i≥1 be iid random variables with cdf F (·). Let Fn (x) ≡ {X n 1 i=1 I(Xi ≤ x) be the empirical cdf. Suppose xn → x0 and F (·) n is continuous at x0 . Show that Fn (xn ) → F (x0 ) w.p. 1. 8.32 Let p be a positive integer > 1. Let {δi }i≥1 be iid random variable p−1 with distribution ∞ δi P (δ1 = j) = pj , 0 ≤ j ≤ p − 1, pj ≥ 0, 0 pj = 1. Let X = i=1 pi . Show that (a) P (X ∈ (0, 1)) = 1. (b) FX (x) ≡ P (X ≤ x) is continuous and strictly increasing in (0,1) if 0 < pj < 1 for any 0 ≤ j ≤ p − 1. (c) FX (·) is absolutely continuous iﬀ pj = which case FX (x) ≡ x, 0 ≤ x ≤ 1.

1 j

for all 0 ≤ j ≤ p − 1 in

8.33 (Random AR-series). Let {Xn }n≥0 be a sequence of random variables such that Xn+1 = ρn+1 Xn + n+1 , n ≥ 0 where the sequence {(ρn , n )}n≥1 are iid and independent of X0 . (a) Show that if E(log |ρ1 |) < 0 and E(log |1 |)+ < ∞ then ˆn ≡ X

n

ρ1 ρ2 . . . ρj , j+1

converges w.p. 1.

j=0

(b) Show that under the hypothesis of (a), for any bounded continuous function h : R → R and for any distribution of X0 ˆ ∞ ). Eh(Xn ) → Eh(X (Hint: Show by SLLN that there is a 0 < λ < 1 such that ρ1 , ρ2 , . . . , ρj = 0(λj ) w.p. 1 as j → ∞ and by Borel-Cantelli |j | = 0(λj ) for some λ > 0 λ λ < 1.) 8.34 (Iterated random functions). Let (S, ρ) be a complete separable metric space. Let (G, G) be a measurable space. Let f : G × S → S be

8.8 Problems

285

G × B(S), B(S) measurable function. Let (Ω, F, P ) be a probability space and {θi }i≥1 be iid G-valued random variables on (Ω, F, P ). Let X0 be an S-valued random variable on (Ω, F, P ) independent of {θi }i≥1 . Deﬁne {Xn }n≥0 by the random iteration scheme, X0 (x, ω) ≡ x Xn+1 (x, ω) = f θn+1 (ω), Xn (x, ω) n ≥ 0. (a) Show that for each n ≥ 0, the map Xn = S × Ω → S is B(S) × F, B(S) measurable. ˆ n (x, ω) = (b) Let fn (x) ≡ fn (x, ω) ≡ f (θn (ω), x). Let X ˆ n (x, ω) and f1 (f2 , . . . , fn (x)). Show that for each x and n, X Xn (x, ω) have the same distribution. (c) Now assume that for all ω, f (θ1 (ω), x) is Lipschitz from S to S, i.e., d(f (θi (ω), x), f (θi (ω), y)) < ∞. i (ω) ≡ sup d(x, y) x =y Show that i (ω) is a random variable on (Ω, F, P ), i.e. that i (·) : Ω → R+ is F, B(R) measurable. (d) Suppose that E| log 1 (ω)| < ∞ and E log 1 (ω) < 0, ˆ n (x, ω) = E| log d(f (θ1 , x), x)| < ∞ for all x. Show that limn X ˆ X∞ (ω) exists w.p. 1 and is independent of x w.p. 1. (Hint: Use Borel-Cantelli to show ˆ n (x, ω)}n≥1 is Cauchy in (S, ρ).) {X

that

for

each

x,

(e) Under the hypothesis in (d) show that for any bounded continuous h : S → R and for any x ∈ S, limn→∞ Eh(Xn (x, ω)) = ˆ ∞ (ω)). Eh(X (f) Deduce the results in Problems 7.15 and 8.33 as special cases. 8.35 (Extension of Gilvenko-Cantelli (Theorem 8.2.4) to the multivariate case). Let {Xn }n≥1 be a sequence of pairwise independent k and identically distributed random vectors taking values in R with cdf F (x) ≡ P X11 ≤ x1 , X12 ≤ x2 , . . . , X1k ≤ xk where X1 = (X 11 , X12 , . . . , X1k ) and x = (x1 , x2 , . . . , xk ) ∈ R. Let Fn (x) ≡ n 1 i=1 I(Xi ≤ x) be the empirical cdf based on {Xi }1≤i≤n . Show n that sup{|Fn (x) − F (x)| : x ∈ R} → 0 w.p. 1. (Hint: First prove an extension of Poly¯ a’s theorem (Lemma 8.2.6) to the multivariate case.)

9 Convergence in Distribution

9.1 Deﬁnitions and basic properties In this section, the notion of ‘convergence in distribution’ of a sequence of random variables is discussed. The importance and usefulness of this notion lie in the following observation: if a sequence of random variables Xn converges in distribution to a random variable X, then one may approximate the probabilities P (Xn ∈ A) by P (X ∈ A) for large n for a large class of sets A ∈ B(R). In many situations, exact evaluation of P (Xn ∈ A) is a more diﬃcult task than the evaluation of P (X ∈ A). As a result, one may work with the limiting value P (X ∈ A) instead of P (Xn ∈ A), when n is large. As an example, consider the following problem from statistical inference. Let Y1 , Y2 , . . . be a collection of iid random variables with a ﬁnite second moment. Suppose that one is interested in ﬁnding the observed level of signiﬁcance or the p-value for a statistical test of the hypotheses H1 : µ = 0 about the population mean H0 : µ = 0 against an alternative ¯n = n−1 n Yi is used and the test rejects H0 µ. If the test statistic Y i=1 √ for large values of | nY¯n |, √ then the p-value of the test can be found using the function ψn (a) ≡ P0 (| nY¯n | > a), a ∈ [0, ∞), where P0 denotes the joint distribution of {Yn }n≥1 under µ = 0. Note that here, ﬁnding ψn (·) is diﬃcult, as it depends on the √ joint distribution of Y1 , . . . , Yn . If, however, one knows that under µ = 0, nY¯n converges in distribution to a normal random variable Z (which is in fact guaranteed by the central limit theorem, see Chapter 11), then one may approximate ψn (a) by P (|Z| > a), which can be found, e.g., by using a table of normal probabilities.

288

9. Convergence in Distribution

The formal deﬁnition of ‘convergence in distribution’ is given below. Deﬁnition 9.1.1: Let Xn , n ≥ 0 be a collection of random variables and let Fn denote the cdf of Xn , n ≥ 0. Then, {Xn }n≥1 is said to converge in distribution to X0 , written as Xn −→d X0 , if lim Fn (x) = F0 (x) for every

n→∞

x ∈ C(F0 )

(1.1)

where C(F0 ) = {x ∈ R : F0 is continuous at x}. Deﬁnition 9.1.2: Let {µn }n≥0 be probability measures on (R, B(R)). denoted Then {µn }n≥1 is said to converge to µ0 weakly or in distribution, by µn −→d µ0 , if (1.1) holds with Fn (x) ≡ µn (−∞, x] , x ∈ R, n ≥ 0. Unlike the notions of convergence in probability and convergence almost surely, the notion of convergence in distribution does not require that the random variables Xn , n ≥ 0 be deﬁned on a common probability space. Indeed, for each n ≥ 0, Xn may be deﬁned on a diﬀerent probability space (Ωn , Fn , Pn ) and {Xn }n≥1 may converge in distribution to X0 . In such a context, the notions of convergence of {Xn }n≥1 to X0 in probability or almost surely are not well deﬁned. Deﬁnition 9.1.1 requires only the cdfs of Xn ’s to converge to that of X0 at each x ∈ C(F0 ) ⊂ R, but does not require the (almost sure or in probability) convergence of the random variables Xn ’s themselves. Example 9.1.1: For n ≥ 1, let Xn ⎧ ⎨ 0 nx Fn (x) = ⎩ 1

∼ Uniform (0, n1 ), i.e., Xn has the cdf if x ≤ 0 if 0 < x < if x ≥ n1

1 n

and let X0 be the degenerate random variable taking the value 0 with probability 1, i.e., the cdf of X0 is 0 if x < 0 F0 (x) = 1 if x ≥ 0. Note that the function F0 (x) is discontinuous only at x = 0. Hence, C(F0 ) = R\{0}. It is easy to verify that for every x = 0, Fn (x) → F0 (x)

as

n → ∞.

Hence, Xn −→d X0 . Example 9.1.2: Let {an }n≥1 and {bn }n≥1 be sequences of real numbers such that 0 < bn < ∞ for all n ≥ 1. Let Xn ∼ N (an , bn ), n ≥ 1. Then, the cdf of Xn is given by x − a n , x∈R (1.2) Fn (x) = Φ bn

9.1 Deﬁnitions and basic properties

289

x where Φ(x) = −∞ φ(t)dt and φ(x) = (2π)−1/2 exp(−x2 /2), x ∈ R. If X0 ∼ N (a0 , b0 ) for some a0 ∈ R, b0 ∈ [0, ∞), then using (1.2), one can show that Xn −→d X0 if and only if an → a0 and bn → b0 as n → ∞ (Problem 9.8). Next some simple implications of Deﬁnition 9.1.1 are considered. Proposition 9.1.1: If Xn −→p X0 , then Xn −→d X0 . Proof: Let Fn denote the cdf of Xn , n ≥ 0. Fix x ∈ C(F0 ). Then, for any > 0, P (Xn ≤ x) ≤ P (X0 ≤ x + ) + P (Xn ≤ x, X0 > x + ) ≤

P (X0 ≤ x + ) + P (|Xn − X0 | > )

(1.3)

and similarly, P (Xn ≤ x) ≥ P (X0 ≤ x − ) − P (|Xn − X0 | > ).

(1.4)

Hence, by (1.3) and (1.4), F0 (x − ) − P (|Xn − X0 | > ) ≤ Fn (x) ≤ F0 (x + ) + P (|Xn − X0 | > ). Since Xn −→p X0 , letting n → ∞, one gets F0 (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F0 (x + ) n→∞

(1.5)

n→∞

for all ∈ (0, ∞). Note that as x ∈ C(F0 ), F0 (x−) = F0 (x). Hence, letting ↓ 0 in (1.5), one has limn→∞ Fn (x) = F0 (x). This proves the result. 2 As pointed out before, the converse of Proposition 9.1.1 is false in general. The following is a partial converse. The proof follows from the deﬁnitions of convergence in probability and convergence in distribution and is left as an exercise (Problem 9.1). Proposition 9.1.2: If Xn −→d X0 and P (X0 = c) = 1 for some c ∈ R, then Xn −→p c. Theorem 9.1.3: Let Xn , n ≥ 0 be a collection of random variables with respective cdfs Fn , n ≥ 0. Then, Xn −→d X0 if and only if there exists a dense set D in R such that lim Fn (x) = F0 (x)

n→∞

for all

x ∈ D.

(1.6)

Proof: Since C(F0 )c has at most countably many points, the ‘only if’ part follows. To prove the ‘if’ part, suppose that (1.6) holds. Fix x ∈ C(F0 ). Then, there exist sequences {xn }n≥1 , {yn }n≥1 in D such that xn ↑ x and yn ↓ x as n → ∞. Hence, for any k, n ∈ N, Fn (xk ) ≤ Fn (x) ≤ Fn (yk ).

290

9. Convergence in Distribution

By (1.6), for every k ∈ N, F0 (xk )

= ≤

lim Fn (xk ) ≤ lim inf Fn (x)

n→∞

n→∞

lim sup Fn (x) ≤ lim Fn (yk ) = F0 (yk ). n→∞

n→∞

(1.7)

Since x ∈ C(F0 ), limk→∞ F0 (xk ) = F0 (x) = limk→∞ F0 (y). Hence, by (1.7), limn→∞ Fn (x) exists and equals F0 (x). This completes the proof of Theorem 9.1.3. 2 Theorem 9.1.4: (Poly¯ a’s theorem). Let Xn , n ≥ 0 be random variables with respective cdfs Fn , n ≥ 0. If F0 is continuous on R, then * * sup *Fn (x) − F0 (x)* → 0 as n → ∞. x∈R

Proof: This is a special case of Lemma 8.2.6 and uses the following proposition. 2 Proposition 9.1.5: If a cdf F is continuous on R, then it is uniformly continuous on R. The proof of Proposition 9.1.5 is left as an exercise (Problem 9.2). Theorem 9.1.6: (Slutsky’s theorem). Let {Xn }n≥1 and {Yn }n≥1 be two sequences of random variables such that for each n ≥ 1, (Xn , Yn ) is deﬁned on a probability space (Ωn , Fn , Pn ). If Xn −→d X and Yn −→p a for some a ∈ R, then (i) Xn + Yn −→d X + a, (ii) Xn Yn −→d aX, and (iii) Xn /Yn −→d X/a, provided a = 0. Proof: Only a proof of part (i) is given here. The other parts may be proved similarly. Let F0 denote the cdf of X. Then, the cdf of X + a is given by F (x) = F0 (x − a), x ∈ R. Fix x ∈ C(F ). Then, x − a ∈ C(F0 ). For any > 0 (as in the derivations of (1.3) and (1.4)), P (Xn + Yn ≤ x) ≤ P (|Yn − a| > ) + P (Xn + a − ≤ x)

(1.8)

P (Xn + Yn ≤ x) ≥ P (Xn + a + ≤ x) − P (|Y − a| > ).

(1.9)

and

Now ﬁx > 0 such that x − a − , x − a + ∈ C(F0 ). This is possible since R\C(F0 ) is countable. Then, from (1.8) and (1.9), it follows that lim sup P (Xn + Yn ≤ x) n→∞

9.2 Vague convergence, Helly-Bray theorems, and tightness

≤

291

lim P ((Yn − a) > ) + P (Xn ≤ x − a + )

n→∞

= F0 (x − a + )

(1.10)

and similarly, lim inf P (Xn + Yn ≤ x) ≥ F0 (x − a − ). n→∞

(1.11)

Now letting → 0+ in such a way that x − a ± ∈ C(F0 ), from (1.10) and (1.11), it follows that F0 ((x − a)−) ≤ lim inf P (Xn + Yn ≤ x) n→∞

≤ lim sup P (Xn + Yn ≤ x) n→∞

≤ F0 (x − a). 2

Since x − a ∈ C(F0 ), (i) is proved.

9.2 Vague convergence, Helly-Bray theorems, and tightness One version of the Bolzano-Weirstrass theorem from real analysis states that if A ⊂ [0, 1] is an inﬁnite set, then there exists a sequence {xn }n≥1 ⊂ A such that limn→∞ xn ≡ x exists in [0, 1]. Note that x need not be in A unless A is closed. There is an analog of this for sub-probability measures on (R, B(R)), i.e., for measures µ on (R, B(R)) such that µ(R) ≤ 1. First, one needs a deﬁnition of convergence of sub-probability measures. Deﬁnition 9.2.1: Let {µn }n≥1 , µ be sub-probability measures on (R, B(R)). Then {µn }n≥1 is said to converge to µ vaguely, denoted by µn −→v µ, if there exists a set D ⊂ R such that D is dense in R and µn ((a, b]) → µ((a, b])

as n → ∞

for all a, b ∈ D.

(2.1)

Example 9.2.1: Let {Xn }n≥1 , X be random variables such that Xn converges to X in distribution, i.e., Fn (x) ≡ P (Xn ≤ x) → F (x) ≡ P (X ≤ x)

(2.2)

for all x ∈ C(F ), the set of continuity points of F . Since the complement of C(F ) is at most countable, (2.2) implies that µn −→v µ where µn (·) ≡ P (Xn ∈ ·) and µ(·) ≡ P (X ∈ ·). Remark 9.2.1: It follows from above that if {µn }n≥1 , µ are probability measures, then µn −→d µ ⇒ µn −→v µ. (2.3)

292

9. Convergence in Distribution

Conversely, it is not diﬃcult to show that (Problem 9.4) if µn −→v µ and µn and µ are probability measures, then µn −→d µ. Example 9.2.2: Let µn be the probability measure corresponding to the Uniform distribution on [−n, n], n ≥ 1. It is easy to show that µn −→v µ0 , where µ0 is the measure that assigns zero mass to all Borel sets. This shows that if µn −→v µ, then µn (R) need not converge to µ(R). But if µn (R) does converge to µ(R) and µ(R) > 0 and if µn −→v µ, then it can be shown µ n and µ = µ(R) . that µn −→d µ where µn = µnµ(R) Theorem 9.2.1: (Helly’s selection theorem). Let A be an inﬁnite collection of sub-probability measures on (R, B(R)). Then, there exist a sequence {µn }n≥1 ⊂ A and a sub-probability measure µ such that µn −→v µ. Proof: Let D ≡ {rn }n≥1 be a countable dense set in R (for example, one may take D = Q, the set of rationals or D = Dd , the set of all dyadic rationals of the form {j/2n : j an integer, n a positive integer}). Let for each x, A(x) ≡ {µ((−∞, x]) : µ ∈ A}. Then A(x) ⊂ [0, 1] and so by the Bolzano-Weirstrass theorem applied to the set A(r1 ), one gets a sequence {µ1n }n≥1 ⊂ A such that limn→∞ F1n (r1 ) ≡ F (r1 ) exists, where F1i (x) ≡ µ1i ((−∞, x]), x ∈ R. Next, applying the Bolzano-Weirstrass theorem to {F1n (r2 )}n≥1 yields a further subsequence {µ2n }n≥1 ⊂ {µ1n }n≥1 ⊂ A such that limn→∞ F2n (r2 ) ≡ F (r2 ) exists, where F2i (x) ≡ µ2i ((−∞, x]), i ≥ 1. Continuing this way, one obtains a sequence of nested subsequences {µjn }n≥1 , j = 1, 2, . . . such that for each j, limn→∞ Fjn (rj ) ≡ F (rj ) exists. In particular, for the subsequence {µnn }n≥1 , lim Fnn (rj ) = F (rj )

(2.4)

F˜ (x) ≡ inf{F (r) : r > x, r ∈ D}.

(2.5)

n→∞

exists for all j. Now set

Then, F˜ (·) is a nondecreasing right continuous function on R (Problem 9.5) and it equals F (·) on D. Let µ be the Lebesgue-Stieltjes measure generated by F˜ . Since Fnn (x) ≤ 1 for all n and x, it follows that F˜ (x) ≤ 1 for all x and hence that µ is a sub-probability measure. Suppose it is shown that (2.4) also implies that lim Fnn (x) = F˜ (x) (2.6) n→∞

for all x ∈ CF˜ , the set of continuity points of F˜ . Then, all a, b ∈ CF˜ , µnn ((a, b]) ≡ Fnn (b) − Fnn (a) → F˜ (b) − F˜ (a) = µ((a, b]) and hence that µn −→v µ. To establish (2.6), ﬁx x ∈ CF˜ and > 0. Then, there is a δ > 0 such that for all x−δ < y < x+δ, F˜ (x)− < F˜ (y) < F˜ (x)+. This implies that there exist x − δ < r < x < r < x + δ, r, r ∈ D and F˜ (x) − < F (r) ≤ F˜ (x) ≤ F (r ) < F˜ (x) + . Since Fnn (r) ≤ Fnn (x) ≤ Fnn (r ), it

9.2 Vague convergence, Helly-Bray theorems, and tightness

293

follows that F˜ (x) − ≤ lim Fnn (x) ≤ lim Fnn (x) ≤ F˜ (x) + , n→∞

n→∞

2

establishing (2.6).

Next, some characterization results on vague convergence and convergence in distribution will be established. These can then be used to deﬁne the notions of convergence of sub-probability measures on more general metric spaces. Theorem 9.2.2: (The ﬁrst Helly-Bray theorem or the Helly-Bray theorem for vague convergence). Let {µn }n≥1 and µ be sub-probability measures on (R, B(R)). Then µn −→v µ iﬀ f dµn → f dµ (2.7) for all f ∈ C0 (R) ≡ {g | g : R → R is continuous and lim|x|→∞ g(x) = 0}. Proof: Let µn −→v µ and let f ∈ C0 (R). Given > 0, choose K large such that |f (x)| < for |x| > K. Since µn −→v µ, there exists a dense set D ⊂ R such that µn ((a, b]) → µ((a, b]) for all a, b ∈ D. Now choose a, b ∈ D such that a < −K and b > K. Since f is uniformly continuous on [a, b] and D is dense in R, there exist points x0 = a < x1 < x2 < · · · < xm = b in D such that supxi ≤x≤xi+1 |f (x) − f (xi )| < for all 0 ≤ i < m. Now

f dµn =

(−∞,a]

f dµn +

m−1 i=0

(xi ,xi+1 ]

f dµn +

(b,∞)

f dµn

and so * * m−1 * * * f dµn − * f (x )µ ((x , x ]) i n i i+1 * < 2 + · µn ((a, b]) < 3. * i=0

A similar approximation holds for measures, it follows that

f dµ. Since µn , µ are sub-probability

* * m * * * * * f dµn − f dµ* < 6 + f *µn ((xi , xi+1 ]) − µ((xi , xi+1 ])*, * * i=0

where f = sup{|f (x)| : x ∈ R}. Letting n → ∞ and noting that µn −→v µ and {xi }m i=0 ⊂ D, one gets * * * * * lim sup * f dµn − f dµ** ≤ 6. n→∞

294

9. Convergence in Distribution

Since > 0 is arbitrary, (2.7) follows and the proof of the “only if” part is complete. To prove the converse, let D be the set of points {x : µ({x}) = 0}. Fix a, b ∈ D, a < b. Let > 0. Let f1 be the function deﬁned by ⎧ ⎨ 1 if a ≤ x ≤ b 0 if x < a − or x > b + f1 (x) = ⎩ linear on a − ≤ x < a, b ≤ x ≤ b + . Then, f1 ∈ C0 (R) and by (2.7), f1 dµn → f1 dµ. f1 dµn and f1 dµ ≤ µ((a − , b + ]). Thus, But µn ((a, b]) ≤ lim supn→∞ µn ((a, b]) ≤ µ((a − , b + ]). Letting ↓ 0 and noting that a, b ∈ D, one gets (2.8) lim sup µn ((a, b]) ≤ µ((a, b]). n→∞

A similar argument with f2 = 1 on [a + , b − ] and 0 for x ≤ a and ≥ b and linear in between, yields lim inf µn ((a, b]) ≥ µ((a, b]). n→∞

This with (2.8) completes the proof of the “if” part.

2

Theorem 9.2.3: (The second Helly-Bray theorem or the Helly-Bray theorem for weak convergence). Let {µn }n≥1 , µ be probability measures on (R, B(R)). Then, µn −→d µ iﬀ f dµn → f dµ (2.9) for all f ∈ CB (R) ≡ {g | g : R → R, g is continuous and bounded}. Proof: Let µn −→d µ. Let > 0 and f ∈ CB (R) be given. Choose K large such that µ((−K, K]) > 1 − . Also, choose a < −K and b > K such that µ({a}) = µ({b}) = 0, a, b ∈ D. Let a = x0 < x1 < < xm = b be chosen so that x0 , . . . , xm ∈ D and sup

xi ≤x≤xi+1

|f (x) − f (xi )| <

for all i = 1, . . . , m−1. Since f dµn − f dµ = (−∞,a] f dµn − (−∞,a] f dµ+ m−1 i=1 ( (xi ,xi+1 ] f dµn − (xi ,xi+1 ] f dµ) + (b,∞) f dµn − (b,∞) f dµ, it follows that * * * * * f dµn − f dµ* < f µn ((−∞, a]) + µ((−∞, a]) * *

9.2 Vague convergence, Helly-Bray theorems, and tightness

+

m−1

295

* * *µn ((xi , xi+1 ]) − µ((xi , xi+1 ])*

i=0

+ µn ((b, ∞)) + µ((b, ∞)) . Since, a, b, x0 , x1 , . . . , xm ∈ D, * * * * lim sup ** f dµn − f dµ** ≤ f 2(1 − µ((a, b])) ≤ f 2. n→∞

Since > 0 is arbitrary, the “only if” part is proved. Next consider the “if” part. Since C0 (R) ⊂ CB (R), (2.9) and Theorem 9.2.2 imply that µn −→v µ. As noted in Remark 9.2.1, if {µn }n≥1 , µ are probability measures then µn −→v µ iﬀ µn −→d µ. So the proof is complete. 2 Deﬁnition 9.2.2: (a) A sequence of probability measures {µn }n≥1 on (R, B(R)) is called tight if for any > 0, there exists M = M ∈ (0, ∞) such that sup µn [−M, M ]c < . (2.10) n≥1

(b) A sequence of random variables {Xn }n≥1 is called tight or stochastically bounded if the sequence of probability distributions {µn }n≥1 of {Xn }n≥1 is tight, i.e., given any > 0, there exists M = M ∈ (0, ∞) such that (2.11) sup P (|Xn | > M ) < . n≥1

Remark 9.2.3: In Deﬁnition 9.2.2 (b), the random variables Xn , n ≥ 1 need not be deﬁned on a common probability space. If Xn is deﬁned on a probability space (Ωn , Fn , Pn ), n ≥ 1, then (2.11) needs to be replaced by sup Pn (|Xn | > M ) < . n≥1

Example 9.2.3: Let Xn ∼ Uniform(n, n + 1). Then, for any given M ∈ (0, ∞), P (|Xn | > M ) ≥ P (Xn > M ) = 1 for all n > M. Consequently, for any M ∈ (0, ∞), sup P (|Xn | > M ) = 1 n≥1

and the sequence {Xn }n≥1 cannot be stochastically bounded.

296

9. Convergence in Distribution

Example 9.2.4: For n ≥ 1, let Xn ∼ Uniform(an , 2 + an ),

(2.12)

where an = (−1)n . Then, {Xn }n≥1 is stochastically bounded. Indeed, |Xn | ≤ 3 for all n ≥ 1 and therefore, for any > 0, (2.11) holds with M = 3. Note that in this example, the sequence {Xn }n≥1 does not converge in distribution to a random variable X. From (2.12), it follows that as k → ∞, X2k −→d Uniform(1, 3), X2k−1 −→d Uniform(−1, 1).

(2.13)

Examples 9.2.3 and 9.2.4 highlight two important characteristics of a tight sequence of random variables or probability measures. First, the notion of tightness of probability measures or random variables is analogous to the notion of boundedness of a sequence of real numbers. For a sequence of bounded real numbers {xn }n≥1 , all the xn ’s must lie in a bounded interval [−M, M ], M ∈ (0, ∞). For a sequence of random variables {Xn }n≥1 , the condition of tightness requires that given > 0 arbitrarily small, there exists an M = M in (0, ∞) such that for each n, Xn lies in [−M, M ] with probability at least 1 − . Thus, for a tight sequence of random variables, no positive mass can escape to ±∞, which is contrary to what happens with the random variables {Xn }n≥1 of Example 9.2.3. The second property illustrated by Example 9.2.4 is that like a bounded sequence of real numbers, a tight or stochastically bounded sequence of random variables may not converge in distribution, but has one or more convergent subsequences (cf. (2.13)). Indeed, the notion of tightness can be characterized by this property, as shown by the following result. For consistency with the other results in this section, it is stated in terms of probability measures instead of random variables. Theorem 9.2.4: Let {µn }n≥1 be a sequence of probability measure on (R, B(R)). The sequence {µn }n≥1 is tight iﬀ given any subsequence {µni }i≥1 of {µn }n≥1 , there exists a further subsequence {µmi }i≥1 of {µni }i≥1 and a probability measure µ on (R, B(R)) such that µmi −→d µ

as

i → ∞.

(2.14)

Proof: Suppose that {µn }n≥1 is tight. Given any subsequence {µni }i≥1 of {µn }n≥1 , by Helly’s selection theorem (Theorem 9.2.1), there exists a sub-probability measure µ and a further subsequence {µmi }i≥1 of {µni }i≥1 such that (2.15) µmi −→v µ. Next, ﬁx ∈ (0, 1). Since {µn }n≥1 is tight, there exists M ∈ (0, ∞) such that sup µn [−M, M ]c < . (2.16) n≥1

9.2 Vague convergence, Helly-Bray theorems, and tightness

297

By (2.15) and (2.16), there exists a, b ∈ D, a < −M , b > M such that µ (a, b] = lim µmi (a, b] i→∞ ≥ lim inf µmi [−M, M ] i→∞ = 1 − lim sup µmi [−M, M ] n→∞

≥

1 − .

Since ∈ (0, 1) is arbitrary, this shows that µ is a probability measure and hence, the ‘only if’ part is proved. Next, consider the ‘if part.’ Suppose {µn }n≥1 is not tight. Then, there exists 0 ∈ (0, 1) such that for all M ∈ (0, ∞), sup µn [−M, M ]c > 0 . n≥1

Hence, for each k ∈ N, there exists nk ∈ N such that µnk [−k, k]c ≥ 0 .

(2.17)

Since any ﬁnite collection of probability measures on (R, B(R)) is tight, it follows that {µnk : k ∈ N} is a countable inﬁnite set. Hence, by the hypothesis, there exist a subsequence {µmi }i≥1 in {µnk : k ∈ N} and a probability measure µ such that µmi −→d µ

as i → ∞.

(2.18)

Let a, b ∈ R be such that µ({a}) = 0 = µ({b}) and µ (a, b]c < 0 /2. By (2.18), there exists i0 ≥ 1 such that for all i ≥ i0 , µmi (a, b]c < µ (a, b]c + 0 /2 < 0 . Since (a, b]c ⊃ [−k, k]c for all k > max{|a|, |b|} and {µmi : i ≥ i0 } ⊂ {µnk : 2 k ∈ N}, this contradicts (2.17). Hence, {µn }n≥1 is tight. Theorem 9.2.5: Let {µn }n≥1 , µ be probability measures on (R, B(R)). If µn −→d µ, then {µn }n≥1 is tight. Proof: Fix ∈ (0, ∞). Then, there exists a, b ∈ R such that µ({a}) = 0 = µ({b}) and µ (a, b]c < /2. Since µn −→d µ, there exists n0 ≥ 1 such that for all n ≥ n0 , * * *µn (a, b] − µ (a, b] * < /2. Thus, for all n ≥ n0 , µn (a, b]c ≤ µ (a, b]c + /2 < . Also, for each n = 1, . . . , n0 , there exist Mi ∈ (0, ∞) such that µi [−Mi , Mi ]c < .

(2.19)

(2.20)

298

9. Convergence in Distribution

Let M = max{Mi : 0 ≤ i ≤ n0 }, where M0 = max{|a|, |b|}. Then by (2.19) and (2.20), sup µn [−M, M ]c < . n≥1

Thus, {µn }n≥1 is tight.

2

An easy consequence of Theorems 9.2.4 and 9.2.5 is the following characterization of weak convergence. Theorem 9.2.6: Let {µn }n≥1 be a sequence of probability measures on (R, B(R)). Then µn −→d µ iﬀ {µn }n≥1 is tight and all weakly convergent subsequences of {µn }n≥1 converge to the same limiting probability measure µ. Proof: If µn −→d µ, then any weakly convergent subsequence of {µn }n≥1 converges to µ and by Theorem 9.2.5, {µn }n≥1 is tight. Hence, the ‘only if’ part follows. To prove the ‘if’ part, suppose that {µn }n≥1 is tight and that all weakly convergent subsequences of {µn }n≥1 converges to µ. Let {Fn }n≥1 and F denote the cdfs corresponding to {µn }n≥1 and µ, respectively. If possible, suppose that {µn }n≥1 does not converge in distribution to µ. Then, by deﬁnition, there exists x0 ∈ R with µ {x0 } = 0 such that Fn (x0 ) does not converge to F (x0 ) as n → ∞. Then, there exist 0 ∈ (0, 1) and a subsequence {ni }i≥1 such that * * *Fni (x0 ) − F (x0 )* ≥ 0

for all i ≥ 1.

(2.21)

Since {µn }n≥1 is tight, there exists a subsequence {mi }i≥1 ⊂ {ni }i≥1 and a probability measure µ0 such that µmi −→d µ0

as i → ∞.

(2.22)

By hypothesis, µ0 = µ. Hence µ0 ({x0 }) = µ {x0 } = 0 and by (2.22), Fmi (x0 ) → F (x0 )

as

i → ∞,

contradicting (2.21). Therefore, µn −→d µ.

2

For another proof of the ‘if’ part, see Problem 9.6. Note that by Slutsky’s theorem, if Xn −→d X and Yn −→p 0, then Xn Yn −→p 0. The following result gives a reﬁnement of this. Proposition 9.2.7: If {Xn }n≥1 is stochastically bounded and Yn −→p 0, then Xn Yn −→p 0. The proof is left as an exercise (Problem 9.7).

9.3 Weak convergence on metric spaces

299

9.3 Weak convergence on metric spaces The Helly-Bray theorems proved above suggest the following deﬁnitions of vague convergence and convergence in distribution for measures on metric spaces. Recall that (S, d) is called a metric space, if S is a nonempty set and d is a function from S × S → [0, ∞) such that (i) d(x, y) = d(y, x) for all x, y ∈ S, (ii) d(x, y) = 0 iﬀ x = y for all x, y ∈ S, (iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ S. A common example of a metric space is given by S = Rk and d(x, y), the Euclidean distance. A set G ⊂ S is open if for all x ∈ G, there exists an > 0 such that for any y in S, d(x, y) < ⇒ y ∈ G. The set B(x, ) = {y : d(x, y) < } is called the open ball of radius with center at x, x ∈ S, > 0. Recall that f : S → R is continuous if f −1 ((a, b)) is open for every −∞ < a < b < ∞. A family G of open sets in S is called an open cover for a set B ⊂ S if for each x ∈ B, there exists a G ∈ G such that x ∈ G. A set K ⊂ S is called compact if given any open cover G for K, there is a ﬁnite subfamily G1 ⊂ G such that G1 is an open cover for K. Let S be the Borel σ-algebra on S, i.e., let S be the σ-algebra generated by the open sets in S. A measure on the measurable space (S, S) is often simply referred to as a measure on (S, d). Deﬁnition 9.3.1: Let {µn }n≥1 and µ be sub-probability measures on a metric space (S, d), i.e., {µn }n≥1 and µ are measures on (S, S) such that µn (S) ≤ 1 for all n ≥ 1 and µ(S) ≤ 1. Then {µn }n≥1 converges vaguely to µ (written as µn −→v µ) if f dµn → f dµ (3.1) for all f ∈ C0 (S), where C0 (S) ≡ {f | f : S → R, f is continuous and for every > 0, there exists a compact set K such that |f (x)| < for all x ∈ K}. Deﬁnition 9.3.2: Let {µn }n≥1 and µ be probability measures on a metric space (S, d). Then {µn }n≥1 converges in distribution or converges weakly to µ (written as µn −→d µ) if (3.2) f dµn → f dµ for all f ∈ CB (S) ≡ {f | f : S → R, f is continuous and bounded}.

300

9. Convergence in Distribution

Recall that a sequence {xn }n≥1 in a metric space (S, d) is called Cauchy if for every > 0, there exists N such that n, m > N ⇒ d(xn , xm ) < . A metric space (S, d) is complete if every Cauchy sequence {xn }n≥1 in S converges in S, i.e., given a Cauchy sequence {xn }n≥1 , there exists an x in S such that d(xn , x) → 0 as n → ∞. Example 9.3.1: For any k ∈ N, Rk with the Euclidean metric is complete but the set of all rational vectors Qk with the Euclidean metric d(x, y) ≡ x − y is not complete. The set C[0, 1] of all continuous functions on [0, 1] is complete with the supremum metric d(f, g) = sup{|f (u) − g(u)| : 0 ≤ u ≤ 1} but the set of all polynomials on [0, 1] is not complete under the same metric. Recall that a set D is called dense in (S, d) if B(x, ) ∩ D = ∅ for all x ∈ S and for all > 0, where B(x, ) is the open ball with center at x and radius . Also, (S, d) is called separable if there exists a countable dense set D ⊂ S. Deﬁnition 9.3.3: A metric space (S, d) is called Polish if it is complete and separable. Example 9.3.2: All Euclidean spaces with the Euclidean metric as well as with the Lp metric for 1 ≤ p ≤ ∞, are complete. The space C[0, 1] of continuous functions on [0,1] with the supremum metric is complete. All Lp -spaces over measure spaces with a σ-ﬁnite measure and a countably generated σ-algebra, 1 ≤ p ≤ ∞, are complete (cf. Chapter 3). The following theorem gives several equivalent conditions for weak convergence of probability measures on a Polish space. Theorem 9.3.1: Let (S, d) be Polish and {µn }n≥1 , µ be probability measures. Then the following are equivalent: (i) µn −→d µ. (ii) For any open set G, lim inf µn (G) ≥ µ(G). n→∞

(iii) For any closed set C, lim sup µn (C) ≤ µ(C). n→∞

(iv) For all B ∈ S such that µ(∂B) = 0, lim µn (B) = µ(B),

n→∞

where ∂B is the boundary of B, i.e., ∂B = {x : for all > 0, B(x, ) ∩ B = ∅, B(x, ) ∩ B c = ∅}. (v) For every uniformly continuous and bounded function f : S → R, f dµn → f dµ.

9.3 Weak convergence on metric spaces

301

The proof uses the following fact. Proposition 9.3.2: For every open set G in a metric space (S, d), there exists a sequence {fn }n≥1 of bounded continuous functions from S to [0,1] such that as n ↑ ∞, fn (x) ↑ IG (x) for all x ∈ S. Proof: Let Gn ≡ {x : d(x, Gc ) > n1 } where for any set A in (S, d), d(x, A) ≡ inf{d(x, y) : y ∈ A}. Then since G is open, d(x, Gc ) > 0 for all x in G. Thus Gn ↑ G. Let for each n ≥ 1, fn (x) ≡

d(x, Gc ) , x ∈ S. d(x, Gc ) + d(x, Gn )

(3.3)

Check that (Problem 9.10) for each n, fn (x) is continuous on S, fn (x) = 1 on Gn and 0 on Gc , 0 ≤ fn (x) ≤ 1 for all x in S. Further fn (·) ↑ IG (·). 2 Proof of Theorem 9.3.1: (i) ⇒ (ii): Let G be open. Choose {fn }n≥1 as in Proposition 9.3.2. Then for j ∈ N, µn (G) ≥ fj dµn ⇒ lim inf µn (G) ≥ lim inf fj dµn = fj dµ (by (i)). But limj→∞ Hence (ii) holds.

n→∞

n→∞

fj dµ = µ(G), by the bounded convergence theorem.

(ii) ⇔ (iii): Suppose (ii) holds. Let C be closed. Then G = C c is open. So by (ii), lim inf µn (C c ) ≥ µ(C c ) ⇒ lim sup µn (C) ≤ µ(C), n→∞

n→∞

since µn and µ are probability measures. Thus, (iii) holds. Similarly, (iii) ⇒ (ii). ¯ denote, respectively, the interior (iii) ⇒ (iv): For any B ∈ S, let B 0 and B 0 and the closure of B. That is, B = {y : B(y, ) ⊂ B for some > 0} and ¯ = {y : for some {xn }n≥1 ⊂ B, limn→∞ xn = y}. Then, for any n ≥ 1, B ¯ µn (B 0 ) ≤ µn (B) ≤ µn (B) and by (ii) and (iii), ¯ µ(B 0 ) ≤ lim inf µn (B) ≤ lim sup µn (B) ≤ µ(B). n→∞

n→∞

¯ Thus, ¯ \ B 0 and so µ(∂B) = 0 implies µ(B 0 ) = µ(B). But ∂B = B limn→∞ µn (B) = µ(B). (iv) ⇒ (v) ⇒ (i): This will be proved for the case where S is the real line. For the general Polish case, see Billingsley (1968). Let F (x) ≡ µ((−∞, x]) and Fn (x) ≡ µn ((−∞, x]), x ∈ R, n ≥ 1. Let x be a continuity point of F . Then µ({x}) = 0. Since if B = (−∞, x], then ∂B = {x}, by (iv), Fn (x) = µn ((−∞, x]) → µ((−∞, x]) = F (x).

302

9. Convergence in Distribution

Thus, µn −→d µ. By Theorem 9.2.3, (i) holds and hence (v) holds. (v) ⇒ (i): Note that in the proof of Theorem 9.2.2, the approximating functions f1 and f2 were both uniformly continuous. Hence, the assertion follows from Theorem 9.2.2 and Remark 9.2.1. This completes the proof of Theorem 9.3.1. 2 The following example shows that the inequality can be strict in (ii) and (iii) of the above theorem. Example 9.3.3: Let X be a random variable. Set Xn = X+ n1 , Yn = X− n1 , n ≥ 1. Since Xn and Yn both converge to X w.p. 1, the distributions of Xn and Yn converge to that of X. Now suppose that there is a value x0 such that P (X = x0 ) > 0. Then, ≡ P (Xn < x0 ) µn (−∞, x0 ) 1 = P X < x0 − n → P (X < x0 ) = µ (−∞, x0 ) , µn (−∞, x0 ] =

P (Xn ≤ x0 ) 1 = P X ≤ x0 − → P (X < x0 ) < µ (−∞, x0 ] , n

and νn (−∞, x0 )

≡ P (Yn < x0 ) 1 = P X < x0 + → P (X ≤ x0 ) > P (X < x0 ). n

Note that µn and νn both converge in distribution to µ. However, for the closed set (−∞, x0 ], lim sup µn (−∞, x0 ] < µ (−∞, x0 ] n→∞

and for the open set (−∞, x0 ), lim inf νn (−∞, x0 ) > µ (−∞, x0 ) . n→∞

Remark 9.3.1: Convergent sequences of probability distributions arise in a natural way in parametric families in mathematical statistics. For example, let µ(·; θ) denote the normal distribution with mean θ and variance 1. Then, θn → θ ⇒ µn (·) ≡ N (θn , 1) −→d N (θ, 1) ≡ µ(·). Similarly, let θ = (λ, Σ), where λ ∈ Rk and Σ is a k×k positive deﬁnite matrix. Let µ(·; θ) be the k-variate normal distribution with mean λ and variance covariance

9.4 Skorohod’s theorem and the continuous mapping theorem

303

matrix Σ. Then, µ(·; θ) is continuous in θ in the sense that if θn → θ in the Euclidean metric, then µ(·; θn ) −→d µ(·; θ). Most parametric families in mathematical statistics possess this continuity property. Deﬁnition 9.3.4: Let {µn }n≥1 be a sequence of probability measures on (S, S), where S is a Polish space and S is the Borel σ-algebra on S. Then {µn }n≥1 is called tight if for any > 0, there exists a compact set K such that sup µn (K c ) < . (3.4) n≥1

A sequence of S-valued random variables {Xn }n≥1 is called tight or stochastically bounded if the sequence {µXn }n≥1 is tight, where µXn is the probability distribution of Xn on (S, S). If S = Rk , k ∈ N, and {Xn }n≥1 is a sequence of k-dimensional random vectors, then, by Deﬁnition 9.3.4, {Xn }n≥1 is tight if and only if for every > 0, there exists M ∈ (0, ∞) such that sup P (Xn > M ) < ,

(3.5)

n≥1

where · denotes the usual Euclidean norm on Rk . Furthermore, if Xn = (Xn1 , . . . , Xnk ), n ≥ 1, then the tightness of {Xn }n≥1 is equivalent to the tightness of the k sequences of random variables {Xnj }n≥1 , j = 1, . . . , k (Problem 9.9). An analog of Theorem 9.2.4 holds for probability measures on (S, S) when S is Polish. Theorem 9.3.3: (Prohorov-Varadarajan theorem). Let {µn }n≥1 be a sequence of probability measures on (S, S) where S is a Polish space and S is the Borel σ-algebra on S. Then, {µn }n≥1 is tight iﬀ given any subsequence {µni }i≥1 ⊂ {µn }n≥1 , there exist a further subsequence {µmi }i≥1 of {µni }i≥1 and a probability measure µ on (S, S) such that µmi −→d µ

as i → ∞.

(3.6)

For a proof of this result, see Section 1.6 of Billingsley (1968). This result is useful for proving weak convergence in function spaces (e.g., see Chapter 11 where a functional central limit theorem is stated).

9.4 Skorohod’s theorem and the continuous mapping theorem If {Xn }n≥1 is a sequence of random variables that converge to a random variable X in probability, then Xn does converge in distribution to X (cf.

304

9. Convergence in Distribution

Proposition 9.1.1). Here is another proof of this fact using Theorem 9.2.3. Let f : R → R be bounded and continuous. Then Xn → X in probability implies that f (Xn ) → f (X) in probability (Problem 9.13) and by the BCT, f dµn = Ef (Xn ) → Ef (X) = f dµ, where µn (·) = P (Xn ∈ ·), n ≥ 1 and µ(·) = P (X ∈ ·). Hence, µn −→d µ. In particular, it follows that if Xn → X w.p. 1, then Xn −→d X. Skorohod’s theorem is a sort of converse to this. If µn −→d µ, then there exist random variables Xn , n ≥ 1 and X such that Xn has distribution µn , n ≥ 1 and X has distribution µ and Xn → X w.p. 1. Theorem 9.4.1: (Skorohod’s theorem). Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→d µ. Let Xn (ω) ≡ sup{t : µn ((−∞, t]) < ω} X(ω) ≡ sup{t : µ((−∞, t]) < ω} for 0 < ω < 1. Then, Xn and X are random variables on ((0, 1), B (0, 1) , m) where m is the Lebesgue measure. Furthermore, Xn has distribution µn , n ≥ 1, X has distribution µ and Xn → X w.p. 1. Proof: For any cdf F (·), let F −1 (u) ≡ sup{t : F (t) < u}. Then for any u ∈ (0, 1) and t ∈ R, it can be veriﬁed that F −1 (u) ≤ t ⇒ F (t) ≥ u ⇒ F −1 (u) ≤ t and hence, if U is a Uniform (0,1) random variable (Problem 9.11), P (F −1 (U ) ≤ t) = P (U ≤ F (t)) = F (t), implying that

F −1 (U )

has cdf F (·).

This shows that Xn , n ≥ 1 and X have the asserted distributions. It remains to show that Xn (ω) → X(ω)

w.p. 1

Fix ω ∈ (0, 1) and let y < X(ω) be such that µ({y}) = 0. Now y < X(ω) ⇒ µ((−∞, y]) < ω. Since µn −→d µ and µ({y}) = 0, µn ((−∞, y]) → µ((−∞, y]) and so µn ((−∞, y]) < ω for large n. This implies that Xn (ω) ≥ y for large n and hence lim inf n→∞ Xn (ω) ≥ y. Since this is true for all y < X(ω) with µ({y}) = 0, and since the set of all such y’s is dense in R, it follows that lim inf Xn (ω) ≥ X(ω) n→∞

for all ω

in (0, 1).

Next ﬁx > 0 and y > X(ω + ), and µ({y}) = 0. Then µ((−∞, y]) ≥ ω + . Since µ({y}) = 0, µn ((−∞, y]) → µ((−∞, y]). Thus, for large n,

9.4 Skorohod’s theorem and the continuous mapping theorem

305

µn (−∞, y] ≥ ω. This implies that Xn (ω) ≤ y for large n and hence that lim supn→∞ Xn (ω) ≤ y. Since this is true for all y > X(ω + ), µ({y}) = 0, it follows that lim sup Xn (ω) ≤ X(ω + ) for every

>0

n→∞

and hence that lim sup Xn (ω) ≤ X(ω+). n→∞

Thus it has been shown that for all 0 < ω < 1, X(ω) ≤ lim inf Xn (ω) ≤ lim sup Xn (ω) ≤ X(ω+). n→∞

n→∞

Since X(ω) is a nondecreasing function on (0, 1), it has at most a countable number of discontinuities and so lim Xn (ω) = X(ω)

n→∞

w.p. 1.

2

An immediate consequence of the above theorem is the continuity of convergence in distribution under continuous transformations. Theorem 9.4.2: (The continuous mapping theorem). Let {Xn }n≥1 , X be random variables such that Xn −→d X. Let f : R → R be Borel measurable such that P (X ∈ Df ) = 0, where Df is the set of discontinuities of f . Then f (Xn ) −→d f (X). In particular, this holds if f : R → R is continuous. Remark 9.4.1: It can be shown that for any f : R → R, the set Df = {x : f is discontinuous at x} ∈ B(R) (Problem 9.12). Thus, {X ∈ Df } ∈ F, and P (X ∈ Df ) is well deﬁned. ˜ n }n≥1 , Proof: By Skorohod’s theorem, there exist random variables {X ˜ X deﬁned on the Lebesgue space (Ω = (0, 1), B((0, 1)), m = Lebesgue ˜ =d X, and ˜ n =d Xn for n ≥ 1, X measure) such that X ˜n → X ˜ X

w.p. 1.

˜ n (ω) → X(ω)} ˜ ˜ Let A = {ω : X and B = {ω : X(ω) ∈ Df }. Then, P (A) = 1 = P (B) and so, for ω ∈ A ∩ B, ˜ n (ω)) → f (X(ω)). ˜ f (X ˜ n ) → f (X) ˜ w.p. 1 and hence f (Xn ) −→d f (X). Thus, f (X

2

Another easy consequence of Skorohod’s theorem is the Helly-Bray The˜ w.p. 1 and f is a bounded continuous function, ˜n → X orem 9.2.3. Since X ˜ ˜ then f (Xn ) → f (X) w.p. 1 and so by the bounded convergence theorem ˜ n ) → Ef (X). ˜ Ef (X

306

9. Convergence in Distribution

˜ n =d Xn for n ≥ 1 and X ˜ =d X, this is the same as saying that Since X

That is, f dµn → P (X ∈ ·).

Ef (Xn ) → Ef (X). f dµ, where µn (·) = P (Xn ∈ ·), n ≥ 1 and µ(·) =

Remark 9.4.2: Skorohod’s theorem is valid for any Polish space. Suppose that S is a Polish space and {µn }n≥1 and µ are probability measures on (S, S), where S is the Borel σ-algebra on S, such that µn −→d µ. Then on the Lebesgue space there exist random variables Xn and X deﬁned (0, 1), B((0, 1)), m = the Lebesgue measure such that for all n ≥ 1, Xn has distribution µn , X has distribution µ and Xn → X w.p. 1. For a proof, see Billingsley (1968).

9.5 The method of moments and the moment problem 9.5.1

Convergence of moments

Let {Xn }n≥1 and X be random variables such that Xn converges to X in distribution. Suppose for some k > 0, E|Xn |k < ∞ for each n ≥ 1. A natural question is: When does this imply E|X|k < ∞ and limn→∞ E|Xn |k = EX k ? By Skorohod’s theorem, one can assume w.l.o.g. that Xn → X w.p. 1. Then the results from Section 2.5 yield the following. Theorem 9.5.1: Let {Xn }n≥1 and X be a collection of random variables such that Xn −→d X. Then, for each 0 < k < ∞, the following are equivalent: (i) E|Xn |k < ∞ for each n ≥ 1, E|X|k < ∞ and E|Xn |k → E|X|k . (ii) {|Xn |k }n≥1 are uniformly integrable, i.e., for every > 0, there exists an M ∈ (0, ∞) such that sup E(|Xn |k I(|Xn | > M )) < . n≥1

Remark 9.5.1: Recall that a suﬃcient condition for the uniform integrability of {|Xn |k }n≥1 is that sup E|Xn | < ∞

for some

∈ (k, ∞).

n≥1

Example 9.5.1: Let Xn have the distribution P (Xn = 0) = 1 − n1 , P (Xn = n) = n1 for n = 1, 2, . . .. Then Xn −→d 0 but EXn = 1 does not go to 0. Note that {Xn }n≥1 is not uniformly integrable here.

9.5 The method of moments and the moment problem

307

Remark 9.5.2: In Theorem 9.5.1, under hypothesis (ii), it follows that E|Xn | → E|X|

for all real numbers ∈ (0, k)

and EXnp → EX p

9.5.2

for all positive integers p, 0 < p ≤ k.

The method of moments

Suppose that {Xn }n≥1 are random variables such that limn→∞ EXnk = mk < ∞ exists for all integers k = 0, 1, 2, . . .. Does there exist a random variable X such that Xn −→d X? The answer is ‘yes’ provided that the moments {mk }k≥1 determine the distribution of the random variable X uniquely. Theorem 9.5.2: (Frech´et-Shohat theorem). Let {Xn }n≥1 be a sequence of random variables such that for each k ∈ N, limn→∞ EXnk ≡ mk exists and is ﬁnite. If the sequence {mk }k≥1 uniquely determines the distribution of a random variable X, then Xn −→d X. Proof: Suppose that for some subsequence {nj }j≥1 , the probability distributions {µnj }j≥1 of {Xnj }j≥1 converge vaguely to some µ. Since {EXn2j }j≥1 is a bounded sequence, {µnj }j≥1 is tight. Hence µ must be a probability distribution and by Theorem 9.5.1, the moments of µ must coincide with {mk }k≥1 . Since the sequence {mk }k≥1 determines the distribution uniquely, µ is unique and is the unique vague limit point of {µn }n≥1 and by Theorem 9.2.6, µn −→d µ. So if X is a random variable with distribution µ, then Xn −→d X. 2 The above “method of moments” used to be a tool for proving convergence in distribution, e.g., for proving asymptotic normality of the Binomial (n, p) distribution. Since it requires existence of all moments, this method is too restrictive and is of limited use. However, the question of when do the moments determine a distribution is an interesting one and is discussed next.

9.5.3

The moment problem

Suppose {mk }k≥1 is a sequence of real numbers such that there is at least one probability measure µ on (R, B(R)) such that for all k ∈ N mk = xk µ(dx). Does the sequence {mk }k≥1 determine µ uniquely? This is a part of the Hamburger-moment problem, which includes seeking conditions under

308

9. Convergence in Distribution

which a given sequence {mk }k≥1 is the moment sequence of a probability distribution. The answer to the uniqueness question posed above is ‘no,’ as the following example shows. Example 9.5.2: Let Y be a standard normal random variable and let X = exp(Y ). Then X is said to have the log-normal distribution (which is a misnomer as a more appropriate name would be something like exponormal). Then X has the probability density function 1 1 √ exp(−[log x]2 /2) x>0 2π x (5.1) f (x) = 0 otherwise. Consider now the family of functions fα (x) = f (x)(1 + α sin(2π log x)) with |α| ≤ 1. It is clear that fα (x) ≥ 0. Further, it is not diﬃcult to check that for any α ∈ [−1, 1], xr fα (x)dx = xr f (x)dx for all r = 0, 1, 2, . . .. Thus, the sequence of moments mk ≡ xk f (x)dx does not determine the log-normal distribution (5.1). A suﬃcient condition for uniqueness is Carleman’s condition: ∞

−1/2k

m2k

= ∞.

(5.2)

k=1

For a proof, see Feller (1966) or Shohat and Tamarkin (1943). Remark 9.5.3: A special case of the above is when 1/2k

lim sup m2k

= r ∈ [0, ∞).

(5.3)

k→∞

In particular, if {mk }k≥1 is a moment sequence, then within the class of probability distributions µ that have bounded support and have {mk }k≥1 as their moment sequence, µ is uniquely determined. This is so since if M ≡ sup{x : µ([−x, x]) < 1}, then (Problem 9.27) 1/2k

m2k

→M

as

k → ∞.

(5.4)

generally, if µ is a probability distribution on R such that More etx dµ(x) < ∞ for all |t| < δ for some δ > 0, then all its moments are ﬁnite and (5.2) holds and hence µ is uniquely determined by its moments

9.6 Problems

309

(Problem 9.28). In particular, the normal and Gamma distributions are determined by their moments. Remark 9.5.4: If {mk }k≥1 is a moment sequence of a distribution µ concentrated on [0, ∞), the problem of determining µ uniquely is known as the Stieltjes √ moment problem. If X is a random variable with distribution µ, let Y = δ X, where δ is independent of X and takes two values {−1, +1} with equal probability. Then Y has a symmetric distribution and for all k ≥ 1, E|Y |2k = E|X|k . √ The distribution of Y is uniquely determined (and hence that of X and hence that of X) if (EY 2k )1/2k lim sup 0, there exists M ∈ (0, ∞) such that F (−x) + 1 − F (x) < for all x > M , and (ii) F is uniformly continuous on [−M, M ].) 9.3 Prove parts (ii) and (iii) of Theorem 9.1.6. 9.4 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→v µ. Show that µn −→d µ. 9.5 Show that the function F˜ (·), deﬁned in (2.5), is nondecreasing and right continuous and that the function F (x) ≡ inf{F (r) : r ≥ x, r ∈ D} is nondecreasing and left continuous. 9.6 Give another proof of the ‘if’ part of Theorem 9.2.6 by using Theorem 9.2.1 and showing that for any f : R → R continuous and bounded and any subsequence {ni}i≥1 , there exist a further subsequence {mj }j≥1 such that amj = f dµmj → a = f dµ and hence, an ≡ f dµn → a. 9.7 If {Xn }n≥1 is stochastically bounded and Yn −→p 0, then show that Xn Yn −→p 0.

310

9. Convergence in Distribution

9.8 (a) Let Xn ∼ N (an , bn ) for n ≥ 0, where bn > 0 for n ≥ 1, b0 ∈ [0, ∞) and an ∈ R for all n ≥ 0. (i) Show that if an → a0 , bn → b0 as n → ∞, then Xn −→d X0 . (ii) Show that if Xn −→d X0 as n → ∞, then an → a0 and bn → b0 . (Hint: First show that {bn }n≥1 is bounded and then that {an }n≥1 is bounded and ﬁnally, that a0 and b0 are the only limit points of {an }n≥1 and {bn }n≥1 , respectively.) (b) For n ≥ 1, let Xn ∼ N (an , Σn ) where an ∈ Rk and Σn is a k × k positive deﬁnite matrix, k ∈ N. Then, {Xn }n≥1 is stochastically bounded if and only if {an }n≥1 and {Σn }n≥1 are bounded. 9.9 Let {Xjn }n≥1 , j = 1, . . . , k, k ∈ N be sequences of random variables. Let Xn = (X1n , . . . , Xkn ), n ≥ 1. Show that the sequence of random vectors {Xn }n≥1 is tight in Rk iﬀ for each 1 ≤ j ≤ k, the sequence of random variables {Xjn }n≥1 is tight in R. 9.10 Let (S, d) be a metric space. (a) For any set A ⊂ S, let d(x, A) ≡ inf{d(x, y) : y ∈ A}. Show that for each A, d(·, A) is continuous on S. (b) Let fn (·) be as in (3.3). Show that fn (·) is continuous on S and fn (·) ↑ IG (·). (Hint: Note that d(x, Gc ) + d(x, Gn ) > 0 for all x in S. ) 9.11 For any cdf F , let F −1 (u) ≡ sup{t : F (t) < u}, 0 < u < 1. Show that for any 0 < u0 < 1 and t0 in R, F −1 (u0 ) ≤ t0 ⇔ F (t0 ) ≥ u0 . (Hint: For ⇒, use the right continuity of F and for ⇐, use the deﬁnition of sup.) 9.12 For a function f : Rk → R (k ∈ N), deﬁne Df = {x ∈ Rk : f is discontinuous at x}. Show that Df ∈ B(Rk ). 9.13 If Xn −→p X and f : R → R is continuous, then f (Xn ) −→p f (X). 9.14 (The Delta method). Let {Xn }n≥1 be a sequence of random variables and {an }n≥1 ⊂ (0, ∞) be a sequence of constants such that an → ∞ as n → ∞ and an (Xn − θ) −→d Z

9.6 Problems

311

for some random variable Z and for some θ ∈ R. Let H : R → R be a function that is diﬀerentiable at θ with derivative c. Show that an H(Xn ) − H(θ) −→d cZ. (Hint: By Taylor’s expansion, for any x ∈ R, H(x) = H(θ) + c(x − θ) + R(x)(x − θ) where R(x) → 0 as x → θ. Now use Problem 9.7 and Slutsky’s theorem.) 9.15 Let X be a random variable with P (X = c) > 0 for some c ∈ R. Give examples of two sequences {Xn }n≥1 and {Yn }n≥1 satisfying Xn −→d X and Yn −→d X such that lim P (Xn ≤ c) = P (X ≤ c)

n→∞

but lim P (Yn ≤ c) = P (X ≤ c).

n→∞

(Hint: Take Xn =d X, n ≥ 1 and Yn =d X + n1 , n ≥ 1, say.) 9.16 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that f dµn → f dµ for all f ∈ F for some collection F of functions from R to R speciﬁed below. Does µn −→d µ if (a) F = {f | f : R → R, f is bounded and continuously diﬀerentiable on R with a bounded derivative} ? (b) F = {f | f : R → R, f is bounded and inﬁnitely diﬀerentiable on R} ? (c) F ≡ {f | f is a polynomial with real coeﬃcients} and |x|k µ(dx) + |x|k dµn (dx) < ∞ for all n, k ∈ N ? 9.17 For any two cdfs F , G on R, deﬁne dL (F, G)

=

inf{ > 0 : G(x − ) − < F (x) < G(x + ) + for all x ∈ R}.

(6.1)

Verify that dL deﬁnes a metric on the collection of all probability distributions on (R, B(R)). The metric dL is called the Levy metric. 9.18 Let {µn }n≥1 , µ be probability measures on (R, B(R)), with the corresponding cdfs {Fn }n≥1 and F . Show that µn −→d µ iﬀ dL (Fn , F ) → 0

as

n → ∞.

312

9. Convergence in Distribution

9.19 (a) Show that for any two cdfs F , G on R, dL (F, G) ≤ dK (F, G),

(6.2)

dK (F, G) = sup |F (x) − G(x)|

(6.3)

where x∈R

(dK is called the Kolmogorov distance or metric between F and G). (b) Give examples where (i) equality holds in (6.2), and (ii) where strict inequality holds in (6.2). 9.20 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→d µ. Let {fa : a ∈ R} be a collection of bounded functions from R → R such that µ(Dfa ) = 0 for all a ∈ R and |fa (x) − fb (x)| ≤ h(x)|b−a| for all a, b ∈ R and for some h : R → (0, ∞) with µ(Dh ) = 0 and |h|dµ < ∞. Show that * * * * sup * fa dµn − fa dµ* → 0 as n → ∞. a∈R

9.21 Let {Xn }n≥1 , X be k-dimensional random vectors such that Xn −→d X. Let {An }n≥1 be a sequence of r × k-matrices of real numbers and {bn }n≥1 ⊂ Rr , r ∈ N. Deﬁne Yn = An Xn + bn and Zn = An Xn XnT where XnT denotes the transpose of X. Suppose that An → A and bn → b. Show that (a) Yn −→d Y , where Y =d AX + b, (b) Zn −→d Z, where Z =d AXX T . (Note: Here convergence in distribution of a sequence of r×k matrixvalued random variables may be interpreted by considering the corresponding rk-dimensional random vectors obtained by concatenating the rows of the r × k matrix side-by-side and using the deﬁnition of convergence in distribution for random vectors.) 9.22 Let µn , µ be probability measures on a countable set D ≡ {aj }j≥1 ⊂ R. Let pnj = µn ({aj }), j ≥ 1, n ≥ 1 and pj = µ({aj }). Show that, as n → ∞, µn −→d µ iﬀ for all j, pnj → pj iﬀ j |pnj − pj | → 0. 9.23 Let Xn ∼ Binomial(n, pn ), n ≥ 1. Suppose npn → λ, 0 < λ < ∞. Show that Xn → X, where X ∼ Poisson(λ). 9.24 (a) Let Xn ∼ Geo(pn ), i.e. P (Xn = r) = qnr−1 pn , r ≥ 1, where 0 < pn < 1 and qn = 1 − pn . Show that as n → ∞ if pn → 0 then (6.4) pn Xn −→d X, where X ∼ Exponential (1).

9.6 Problems

313

(b) Fix a positive integer k. Let for n ≥ 1,

r − 1 r−1 r−k p q , r≥k pnr = k−1 n n where 0 < pn < 1, qn = 1 − pn . (i) Verify for each n, {pnr }r≥k is a probability distribution, that ∞ i.e., r=k pnr = 1. (ii) Let Yn be a random variable with distribution P (Yn = r) = pnr , r ≥ k. Show that as n → ∞ if pn → 0 then {pn Yn }n≥1 converges in distribution and identify the limit. 9.25 Let {Fn }n≥1 and {Gn }n≥1 be two sequences of cdfs on R such that, as n → ∞, Fn −→d F , Gn −→d G where F and G are cdfs on R. (a) Show that for each n ≥ 1,

Hn (x) ≡ (Fn ∗ Gn )(x) ≡

R

Fn (x − y)dG(y)

is a cdf on R. (b) Show that, as n → ∞, Hn −→d H where H = F ∗ G, by direct calculation and by Skorohod’s theorem (i.e., Theorem 9.4.1) and Problem 7.14. 9.26 Let Yn have discrete uniform distribution on the integers {1, 2, . . . , n}. Show that Xn ≡ Ynn and let X ∼ Uniform (0,1) random variable. Show that Xn −→d X using three diﬀerent methods as follows: (a) Helly-Bray theorem, (b) the method of moments, (c) using the cdfs. 9.27 Establish (5.4) in Remark 9.5.3. 1/2k

(Hint: Show that for any > 0, m2k 1/2k .) M − })

≥ (M − ) µ({x : |x| >

9.28 Let t|x|µ be a probability distribution on R such that φ(t0 ) ≡ e dµ(x) < ∞ for some t0 > 0. Show that Carleman’s condition (5.2) is satisﬁed. (Hint: Show that by Cramer’s inequality (Corollary 3.1.5) ∞ x2k−1 e−t0 x dx m2k ≤ 2k φ(t0 ) 0

φ(t0 )2k t−2k (2k − 1)! 0 √ and then use Stirling’s approximation: ‘n! ∼ 2π nn+1/2 e−n as n → ∞’ (Feller (1968)).) =

314

9. Convergence in Distribution

9.29 (Continuity theorem for mgfs). Let {Xn }n≥1 and X be random variables such that for some δ > 0, the mgf MXn (t) ≡ E(etXn ) and MX (t) ≡ E(etX ) are ﬁnite for all |t| < δ. Further, let MXn (t) → MX (t) for all |t| < δ. Show that Xn −→d X. (Hint: Show ﬁrst that {Xn }n≥1 is tight and the fact that by Remark 9.5.3, the distribution of X is determined by MX (·).) 9.30 Let Xn ∼ Binomial(n, pn ). Suppose npn → ∞. Let Yn = √Xn −npn , npn (1−pn )

n ≥ 1. Show that Yn −→d Y , where Y ∼ N (0, 1). (Hint: Use Problem 9.29.) 9.31 Use the continuity theorem for mgfs to establish (6.4) and the convergence in distribution of {pn Yn }n≥1 in Problem 9.24 (b)(ii). 9.32 Let {Xj , Vj : j ≥ 1} be a collection of random variables on some probability space (Ω, F, P ) such that P (Vj ∈ N) = 1 for all j, Vj → ∞ w.p. 1 and Xj −→d X. Suppose that for each j, the random variable Vj is independent of the sequence {Xn }n≥1 . Show that XVj −→d X. (Hint: Verify that for any bounded continuous function h : R → R, * * *Eh(XVj ) − Eh(X)* ≤ 2hP (Vj ≤ N ) + ∆N P (Vj > N ) * * where ∆N = sup *Eh(Xk )−Eh(X)* and h = sup{|h(x)| : x ∈ R}.) k>N

9.33 Let Xn −→ X and xn → x as n → ∞. If P (X = x) = 0, then show that P (Xn ≤ xn ) → P (X ≤ x). d

9.34 (Weyl’s equi-distribution property). Let 0 < α < 1 be an irrational number. Let µn (·) be the measure deﬁned by µn (A) ≡ n−1 1 d j=0 IA (jα mod 1), A ∈ B([0, 1]). Show that µn −→ Uniform n (0,1). 1 (Hint: Verify that f dµn → 0 f (x)dx for all f of the form f (x) = eι2πkx , k ∈ Z and then approximate a bounded continuous function f by trigonometric polynomials (cf. Section 5.6).) 9.35 (a) Let {Xi }i≥1 be iid random variables with Uniform (0,1) distribution. Let Mn = max1≤i≤n Xi . Show that n(1 − Mn ) −→d Exponential (1). (b) Let {Xi }i≥1 be iid random variables such that λ ≡ sup{x : P (X1 ≤ x) < 1} < ∞, P (X1 = λ) = 0, and P (λ − x < X1 < λ) ∼ cxα L(x) as x ↓ 0 where α > 0, c > 0, and L(·) is slowly varying at 0, i.e., limx↓0 L(cx) L(x) = 1 for all 0 < c < ∞. Let Mn = max1≤i≤n Xi . Show that Yn ≡ n1/α (λ − Mn ) converges in distribution as n → ∞ and identify the limit.

9.6 Problems

315

9.36 Let {Xi }i≥1 be iid positive random variables such that P (X1 < x) ∼ cxα L(x) as x ↓ 0, where c, α and L(·) are as in Problem 9.35. Let X1n ≡ min1≤i≤n Xi . Find {an }n≥1 ⊂ R+ such that Zn ≡ an X1n converges in distribution to a nondegenerate limit and identify the distribution. Specialize this to the cases where X1 has a pdf fX (·) such that (a) limx↓0 fX (x) = fX (0+) exists and is positive, (b) X1 has a Beta (a, b) distribution.

10 Characteristic Functions

10.1 Deﬁnition and examples Characteristic functions play an important role in studying (asymptotic) distributional properties of random variables, particularly for sums of independent random variables. The main uses of characteristic functions are (1) to characterize the probability distribution of a given random variable, and (2) to establish convergence in distribution of a sequence of random variables and to identify the limit distribution. Deﬁnition 10.1.1: (i) The characteristic function of a random variable X is deﬁned as φX (t) = E exp(ιtx), where ι =

t ∈ R,

(1.1)

√ −1.

(ii) The characteristic function of a probability measure µ on (R, B(R)) is deﬁned as µ ˆ(t) = exp(ιtx)µ(dx), t ∈ R. (1.2) (iii) Let F be cdf on R. Then, the characteristic function of F is deﬁned as µ ˆF (·), where µF is the Lebesgue-Stieltjes measure corresponding to F .

318

10. Characteristic Functions

Note that the integrands in (1.1) and (1.2) are complex valued. Here and elsewhere, for any f1 , f2 ∈ L1 (Ω, F, µ), the integral of (f1 + ιf2 ) with respect to µ is deﬁned as (1.3) (f1 + ιf2 )dµ = f1 dµ + ι f2 dµ. Thus, the characteristic function of X is given by φX (t) = (E cos tX) + ι(E sin tX), t ∈ R. Since the functions cos tx and sin tx are bounded for every t ∈ R, φX (t) is well deﬁned for all t ∈ R. Furthermore, φX (0) = 1 and for any t ∈ R, |φX (t)|

1/2 (E cos tX)2 + (E sin tX)2 1/2 ≤ E(cos tX)2 + E(sin tX)2 ≤ 1. =

(1.4)

If equality holds in (1.4), i.e., if |φX (t0 )| = 1 for some t0 = 0, then the random variable is necessarily discrete, as shown by the following proposition. Proposition 10.1.1: Let X be a random variable with characteristic function φX (·). Then the following are equivalent: (i) |φX (t0 )| = 1 for some t0 = 0. (ii) There exist a ∈ R, h = 0 such that P X ∈ {a + jh : j ∈ Z} = 1.

(1.5)

Proof: Suppose that (i) holds. Since |φX (t0 )| = 1, there exists a0 ∈ R such that φX (t0 ) = eιa0 , i.e., e−ιa0 φX (t0 ) = 1. Let a = a0 /t0 . Since the characteristic function of (X − a) is given by e−ιat φX (t), it follows that E exp(ιt0 (X − a)) = 1. Equating the real parts, one gets (1.6) E cos t0 (X − a) = 1. Since | cos θ| ≤ 1 for all θ and cos θ = 1 if and only if θ = 2πn for some n ∈ Z, (1.6) implies that P t0 (X − a) ∈ {2πj : j ∈ Z} = 1. (1.7) Therefore, (ii) holds with h = |t2π0 | and with a = a0 /t0 . For the converse, note that with pj = P (X = a + jh), j ∈ Z, φX (t) =

j∈Z

exp ιt(a + jh) pj , t ∈ R,

10.1 Deﬁnition and examples

* * * = 1. and hence *φX 2π h

319

2

Deﬁnition 10.1.2: A random variable X satisfying (1.5) for some a ∈ R and h > 0 is called a lattice random variable. In this case, the distribution of X is also called lattice or arithmetic. If X is a nondegenerate lattice random variable, then the largest h > 0 for which (1.5) holds is called the span (of the probability distribution or of the characteristic function) of X. An inspection of the proof of Proposition 10.1.1 shows that for a lattice random variable X with span h > 0, its characteristic function satisﬁes the relation * * *φX (2πj/h)* = 1 for all j ∈ Z. (1.8) In particular, this implies that lim sup|t|→∞ |φX (t)| = 1. The next result shows that characteristic functions of random variables with absolutely continuous cdfs exhibit a very diﬀerent limit behavior. Proposition 10.1.2: Let X be a random variable with cdf F and characteristic function φX . If F is absolutely continuous, then lim |φX (t)| = 0.

|t|→∞

(1.9)

Proof: Since F is absolutely continuous, the probability distribution µX of X has a density, say f , w.r.t. the Lebesgue measure m on R, and φX (t) = exp(ιtx)f (x)dx, t ∈ R. Fix ∈ (0, ∞). Since f ∈ L1 (R, B(R), m), by Theorem 2.3.14, there exists k a step function f = j=1 cj I(aj bj ) with 1 ≤ k < ∞ and aj , bj , cj ∈ R for j = 1, . . . , k, such that |f − f |dm < /2.

(1.10)

Next note that for any t = 0, * * * * * exp(ιtx)f (x)dx* * k * * = * cj j=1

≤

k j=1

|cj |

bj

aj

* * exp(ιtx)dx**

2 . |t|

Hence, by (1.10) and (1.11), it follows that * * * * |φX (t)| = * exp(ιtx)f (x)dx*

(1.11)

320

10. Characteristic Functions

≤

* * * * |f − f |dx + * exp(ιtx)f (x)dx*

t , where t = 4 j=1 |cj |/. Thus (1.9) holds.

2

Note that the above proof shows that for any f ∈ L1 (m), the Fourier transforms ˆ f (t) ≡ eιtx f (x)dx, t ∈ R satisﬁes lim|t|→∞ fˆ(t) = 0. This is known as the Riemann-Lebesgue lemma (cf. Proposition 5.7.1). Next, some basic results on smoothness properties of the characteristic function are presented. Proposition 10.1.3: Let X be a random variable with characteristic function φX (·). Then, φX (·) is uniformly continuous on R. Proof: For t, h ∈ R,

* * *φX (t + h) − φX (t)* * ** * = *E exp(ι(t + h)X) − exp(ιtX) * * * * * = *E exp(ιtX) · (eιhX − 1)* * * ≤ E *eιhX − 1* ≡ E∆(h), say,

where ∆(h) ≡ | exp(ιhX) − 1|. It is easy to check that |∆(h)| ≤ 2 and limh→0 ∆(h) = 0 w.p. 1 (infact, everywhere). Hence, by the BCT, E∆(h) → 0 as h → 0. Therefore, * * (1.12) lim sup *φX (t + h) − φX (t)* ≤ lim E∆(h) = 0 h→0 t∈R

h→0

and hence, φX (·) is uniformly continuous on R.

2

Theorem 10.1.4: Let X be a random variable with characteristic function φX (·). If E|X|r < ∞ for some r ∈ N, then φX (·) is r-times continuously diﬀerentiable and (r)

φX (t) = E(ιX)r exp(ιtX), t ∈ R.

(1.13)

For proving the theorem, the following bound on the function exp(ιx) is useful. Lemma 10.1.5: For any x ∈ R, r ∈ N, * * r r−1 * |x| 2|x|r−1 (ιx)k ** * exp(ιx) − ≤ min , . * k! * r! (r − 1)! k=0

(1.14)

10.1 Deﬁnition and examples

Proof: Note that for any x ∈ R and for any r ∈ N, dr dr dr [exp(ιx)] = cos x + ι sin x dxr dxr dxr = ιr exp(ιx).

321

(1.15)

Hence, by (1.15) and Taylor’s expansion (applied to the functions sin x and cos x of a real variable x), for any x ∈ R and r ∈ N, exp(ιx) =

r−1 (ιx)k k=0

k!

+

(ιx)r (r − 1)!

0

1

(1 − u)r exp(ιux)du.

Hence, for any x ∈ R and any r ∈ N, * * r−1 * (ιx)k ** |x|r * exp(ιx) − ≤ . * k! * r!

(1.16)

(1.17)

k=0

Also, for r ≥ 2, using (1.17) with r replaced by r − 1, one gets * * r−1 * (ιx)k ** * exp(ιx) − * k! * k=0 * * r−2 * |x|r−1 (ιx)k ** * ≤ * exp(ιx) − + k! * (r − 1)! k=0

≤

2|x|r−1 . (r − 1)!

(1.18)

Hence, by (1.17) and (1.18), (1.14) holds for all x ∈ R, r ∈ N with r ≥ 2. For r = 1, (1.14) follows from (1.17) and the bound ‘supx | exp(ιx)−1| ≤ 2.’ 2 Lemma 10.1.5 gives two upper bounds on the diﬀerence between the function exp(ιx) and its (r − 1)th order Taylor’s expansion around x = 0. r is more accurate, whereas For small values of |x|, the ﬁrst bound i.e., |x| r! r−1 for large values of |x|, the other bound i.e., 2|x| is more accurate. (r−1)! Proof of Theorem 10.1.4: Let µ denote the probability distribution of X on (R, B(R)). Suppose that E|X| < ∞. First it will be shown that φX (·) (1) is diﬀerentiable with φX (t) = E{ιX exp(ιtX)}, t ∈ R. Fix t ∈ R. For any h ∈ R, h = 0, h−1 [φX (t + h) − φX (t)] exp(ιhx) − 1 = exp(ιtx) µ(dx) h R exp(ιhx) − 1 − ιx µ(dx) + ιx exp(ιtx)µ(dx) exp(ιtx) = h R R

322

10. Characteristic Functions

≡

ψh (x)µ(dx) +

ιx exp(ιhx)µ(dx), say.

(1.19)

R

By Lemma 10.1.5 (with r = 2), |ψh (x)| ≤ min

h|x|2 2

, 2|x| for all x ∈ R, h = 0.

(1.20)

Hence, limh→0 ψh (x) = 0 for each x ∈ R. Also, |ψh (x)| ≤ 2|x| and |x|µ(dx) = E|X| < ∞. Hence, by the DCT, lim ψh (x)µ(dx) = 0 h→0

and therefore, from (1.19), it follows that φX (·) is diﬀerentiable at t with (1) φX (t) = ιx exp(ιtx)µ(dx) = E{ιX exp(ιtX)}. Next suppose that the assertion of the theorem is true for some r ∈ N. To prove it for r + 1, note that for t ∈ R and h = 0, h−1 [φ(r) (t + h) − φr (t)] − E(ιX)r+1 exp(ιtX) = (ιx)r ψh (x)µ(dx),

(1.21)

where ψh (x) is as in (1.19). Now using the bound (1.20) on ψh (x), the DCT, and the condition E|X|r+1 < ∞, one can show that the right side of (1.21) goes to zero as h → 0. By induction, this completes the proof of the theorem. 2 Proposition 10.1.6: Let X and Y be two independent random variables. Then (1.22) φX+Y (t) = φX (t) · φY (t), t ∈ R. Proof: Follows from (1.3), Proposition 7.1.3, and the independence of X and Y . 2 For a complex number z = a + ib, a, b ∈ R, let z¯ = a − ib denote the complex conjugate of z and let Re(z) = a and

Im(z) = b

(1.23)

respectively denote the real and the imaginary parts of z. Corollary 10.1.7: Let X be a random variable with characteristic function φX . Then, φ¯X , |φX |2 and Re(φX ) are characteristic functions, where Re(φX )(t) = Re(φX (t)), t ∈ R. Proof: φ¯X (t) = E exp(−ιtX) = E exp(ιt(−X)), t ∈ R ⇒ φ¯X is the characteristic function of −X. Next, let Y be an independent copy of X. Then, by (1.22), φX−Y (t) = |φX (t)|2 , t ∈ R.

10.2 Inversion formulas

323

1 ¯ Finally, Re(φ exp(ιtx)µ(dx), t ∈ R, where X )(t) = 2 (φX (t) + φX#(t)) = " 2 µ(A) = 2−1 P (X ∈ A) + P (−X ∈ A) , A ∈ B(R). Deﬁnition 10.1.3: A function φ : R → C, the set of complex numbers is said to be nonnegative deﬁnite if for any k ∈ N, t1 , t2 , . . . , tk ∈ R, α1 , α2 , . . . , αk ∈ C k k αi α ¯ j φ(ti − tj ) ≥ 0. (1.24) i=1 j=1

Proposition 10.1.7: Let φ(·) be the characteristic function of a random variable X. Then φ is nonnegative deﬁnite. Proof: Check that for k, {ti }, {αi } as in Deﬁnition 10.1.3, k k i=1 j=1

*2

* * k * ιtj X * * αi α ¯ j φ(ti − tj ) = E * αj e * . j=1

2 A converse to the above is known as the Bochner-Khinchine theorem, which states that if φ : R → C is nonnegative deﬁnite, continuous, and φ(0) = 1, then φ is the characteristic function of a random variable X. For a proof, see Chung (1974). Another criterion for a function φ : R → C to be a characteristic function is due to Poly¯ a. For a proof, see Chung (1974). Proposition 10.1.8: (Poly¯ a’s criterion). Let φ : R → C satisfy φ(0) = 1, φ(t) ≥ 0, φ(t) = φ(−t) for all t ∈ R and φ(·) is nonincreasing and convex on [0, ∞). Then φ is a characteristic function.

10.2 Inversion formulas Let F be a cdf and φ be its characteristic function. In this section, two inversion formulas to get the cdf F from φ are presented. The ﬁrst one is from Feller (1966), and the second one is more standard. Unless otherwise mentioned, for the rest of this section, X will be a random variable with cdf F and characteristic function φX and N a standard normal random variable independent of X. Lemma 10.2.1: Let g : R → R be a Borel measurable bounded function vanishing outside a bounded set and let ∈ (0, ∞). Then 2 t2 1 g(x)φX (t)e−ιtx e− 2 dtdx. (2.1) Eg(X + N ) = 2π

324

10. Characteristic Functions 2 t2

Proof: The integrand on the right is bounded by e− 2 |g(x)| and so is integrable on R × R with respect to the Lebesgue measure on (R2 , B(R2 )). t2 x2 Further, φX (t) = eιty dF (y) and e− 2 = √12π eιtx e− 2 dx, t ∈ R. By repeated applications of Fubini’s theorem and the above two identities, the right side of (2.1) becomes 1 ιt(y−x) −2 t2 √ √ e e 2 dt dF (y)dx g(x) 2π 2π [set s = t] 2 1 1 √ eιs(y−x)/ e−s /2 ds dF (y)dx = √ g(x) 2π 2π

2 (y−x) 1 = g(x) √ e− 22 dF (y) dx. 2π Since X and N are independent and N has an absolutely continuous distribution w.r.t. the Lebesgue measure, X + N also has an absolutely continuous distribution with density (y−x)2 1 fX+N (x) = √ e− 22 dF (y), x ∈ R. 2π Thus, the right side of (2.1) reduces to g(x)fX+N (x)dx = Eg(X + N ). 2 Corollary 10.2.2: Let g : R → R be continuous and let g(x) = 0 for all |x| > K, for some K, 0 < K < ∞. Then Eg(X) = g(x)dF (x) 2 t2 1 = lim g(x)e−ιtx φX (t)e− 2 dtdx. (2.2) →0+ 2π Proof: This follows from Lemma 10.2.1, the fact that X + N → X w.p. 1 as → 0, and the BCT. 2 Corollary 10.2.3: (Feller’s inversion formula). Let a and b, −∞ < a < b < ∞, be two continuity points of F . Then b 2 t2 1 e−ιtx φX (t)e− 2 dt dx. (2.3) F (b) − F (a) = lim →0+ a 2π Proof: This follows from Lemma 10.2.1 and Theorem 9.4.2, since the function g(x) = 1 for a ≤ x ≤ b and 0 otherwise is continuous except at a and b, which are continuity points of F . 2

10.2 Inversion formulas

325

Corollary 10.2.4: If φX (t) is integrable w.r.t. the Lebesgue measure m on R, then F is absolutely continuous with density w.r.t. m, given by 1 e−ιtx φX (t)dt, x ∈ R. (2.4) f (x) = 2π Proof: If φX is integrable, then 2 t2 1 φX (t)e−ιtx e− 2 dt 2π is bounded by (2π)−1 |φX (t)|dt for all x ∈ R, and it converges to (2π)−1 e−ιtx φX (t)dt as → 0+ for each x ∈ R. Hence, by the BCT and Corollary 10.2.3, for any a, b, −∞ < a < b < ∞, that are continuity points of F b 1 φX (t)e−ιtx dt dx. F (b) − F (a) = 2π a Since F has at most countably many discontinuity points and F is right continuous, the above relation holds for all −∞ < a < b < ∞. 2 Remark 10.2.1: The integrability of φX in Corollary 10.2.4 is only a sufﬁcient condition. The standard exponential distribution has characteristic function (1−ιt)−1 which is not integrable but the distribution is absolutely continuous. Corollary 10.2.5: (Uniqueness). The characteristic function φX determines F uniquely. Proof: Since a cdf F is uniquely determined by its values on the set of its continuity points, this corollary follows from Corollary 10.2.3. 2 A more standard inversion formula is the following. Theorem 10.2.6: Let F be a cdf on R and φ(t) ≡ eιtx dF (x), t ∈ R be its characteristic function. (i) For any a < b, a, b ∈ R, that are continuity points of F , 1 lim T →∞ 2π

T

−T

e−ιta − e−ιtb φ(t)dt = µF ((a, b)), ιt

(2.5)

where µF is the Lebesgue-Stieltjes measure generated by F . (ii) For any a ∈ R, 1 T →∞ 2T

µF ({a}) = lim

T

−T

e−ιta φ(t)dt.

(2.6)

326

10. Characteristic Functions

A multivariate extension of part (i) and its proof are given in Section 10.4. See also Problem 10.4. For a proof of part (ii), see Problem 10.5 or see Chung (1974) or Durrett (2004). Remark 10.2.2: (Inversion formula for integer valued random variables). If X is integer valued with pk = P (X = k), k ∈ Z, then its characteristic function is the Fourier series pk eιtk , t ∈ R. (2.7) φ(t) = k∈Z

π

Since −π eιtj dt = 2π if j = 0 and = 0 otherwise, multiplying both sides of (2.7) by e−ιtk and integrating over t ∈ (−π, π) and using DCT, one gets π 1 pk = φ(t)e−ιtk dt, k ∈ Z. (2.8) 2π −π As a corollary to part (ii) of Theorem 10.2.6, one can deduce a criterion for a distribution to be continuous. Let µ be a probability distribution and let {pj } be its atoms, if any. Let α = j p2j . Let X and Y be two independent random variables with distribution µ and characteristic function φ(·). Then Z = X − Y has characteristic function |φ(·)|2 and by Theorem 10.2.6, part (ii), T 1 P (Z = 0) = lim |φ(t)|2 dt. T →∞ 2T −T But P (Z = 0) = α. Hence, it follows that T 1 p2j = lim |φ(t)|2 dt. T →∞ 2T −T

(2.9)

j∈Z

Corollary 10.2.7: A distribution is continuous iﬀ T 1 |φ(t)|2 dt = 0. lim T →∞ 2T −T

(2.10)

Some consequences of the uniqueness result (cf. Corollary 10.2.5) are the following. Corollary 10.2.8: For a random variable X, X and −X have the same distribution iﬀ the characteristic function φX (t) of X is real valued for all t ∈ R. Proof: If φX (t) is real, then φX (t) = (cos tx)dF (x)

for all t ∈ R,

10.3 Levy-Cramer continuity theorem

327

where F is the cdf of X. So φX (t) = φX (−t) = E(e−ιtX ) = E(eιt(−X) ).

(2.11)

Since the characteristic function of −X coincides with φX (t), the ‘if part’ follows. To prove the ‘only if’ part, suppose that X and −X have the same distribution. Then as in (2.11), φX (t) = φ−X (t) = φX (−t) = φX (t), where for any complex number z = a + ιb, a, b ∈ R, z¯ ≡ a − ιb denotes its 2 conjugate. Hence, φX (t) is real for all t ∈ R. Example 10.2.1: The standard Cauchy distribution has density f (x) =

1 1 , −∞ < x < ∞. π 1 + x2

Its characteristic function is given by 1 eιtx φ(t) = dx = e−|t| , t ∈ R. π 1 + x2

(2.12)

(2.13)

To see this, let X1 and X2 be two independent copies of the standard exponential distribution. Since φX1 (t) = (1 − ιt)−1 , t ∈ R, Y ≡ X1 − X2 has characteristic function φY (t) = |φX1 (t)|2 = (1 + t2 )−1 ,

t ∈ R.

Since φY is integrable, the density of Y is 1 1 fY (y) = e−ιuy du, y ∈ R. 2π 1 + u2 But by the convolution formula, fY (y) = x>−y e−x e−(y+x) dx ∞ −x −(y+x) e e 11(0,∞) (y + x)dx = 12 e−|y| , y ∈ R. So 0 1 1 eιuy dt = e−|y| , y ∈ R, π 1 + u2

=

proving (2.13).

10.3 Levy-Cramer continuity theorem Characteristic functions are very useful in determining distributions, moments, and establishing various identities involving distributions. But by

328

10. Characteristic Functions

far their most important use is in establishing convergence in distribution. This is the content of a continuity theorem established by Paul Levy and H. Cramer. It says that the map ψ taking a distribution F to its characteristic function φ is a homeomorphism. That is, if Fn −→d F , then φn → φ and conversely. Here, the notion of convergence of φn to φ is that of uniform convergence on bounded intervals. The following result deals with the ‘if’ part. Theorem 10.3.1: Let Fn , n ≥ 1 and F be cdfs with characteristic functions φn , n ≥ 1 and φ, respectively. Let Fn −→d F . Then, for each 0 < K < ∞, sup |φn (t) − φ(t)| → 0 as n → ∞. |t|≤K

That is, φn converges to φ uniformly on bounded intervals. Proof: By Skorohod’s theorem, there exist random variables Xn , X de ﬁned on the Lebesgue space [0, 1], B([0, 1]), m where m(·) is the Lebesgue measure such that Xn ∼ Fn , X ∼ F and Xn → X w.p. 1. Now, for any t ∈ R, * * * * |φn (t) − φ(t)| = *E eιtXn − eιtX * * * * * ≤ E *1 − eιt(X−Xn ) * * * * * ≤ E *1 − eιt(X−Xn ) *11(|X − Xn | ≤ ) +P (|Xn − X| > ). Hence,

sup |φn (t) − φ(t)| ≤

|t|≤K

sup |1 − e | + P (|Xn − X| > ). ιu

|u|≤K

Given K and δ > 0, choose ∈ (0, ∞) small such that sup |1 − eιu | < δ.

|u|≤K

Since for all > 0, P (|Xn − X| > ) → 0 as n → ∞, it follows that lim sup |φn (t) − φ(t)| = 0.

n→∞ |t|≤K

2

The Levy-Cramer theorem is a converse to the above theorem. That is, if φn → φ uniformly on bounded intervals, then Fn −→d F . Actually, it is a stronger result than this converse. It says that it is enough to know that φn converges pointwise to a limit φ that is continuous at 0. Then φ is the characteristic function of some distribution F and Fn −→d F . The key to this is that under the given hypotheses, the family {Fn }n≥1 is tight.

10.3 Levy-Cramer continuity theorem

329

The next result relates the tail behavior of a probability measure to the behavior of its characteristic function near the origin, which in turn will be used to establish the tightness of {Fn }n≥1 . Lemma 10.3.2: Let µ be a probability measure on R with characteristic function φ. Then, for each δ > 0, 1 δ (1 − φ(u))du. µ {x : |x|δ ≥ 2} ≤ δ −δ Proof: Fix δ ∈ (0, ∞). Then, using Fubini’s theorem and the fact that 1 − sinx x ≥ 0 for all x, one gets δ δ (1 − φ(u))du = (1 − eιux )du) µ(dx) −δ −δ 2 sin δx = 2δ − µ(dx) x sin δx = 2δ 1− µ(dx) xδ 1 ≥ 2δ 1− µ(dx) |xδ| {x:|xδ|≥2} ≥ δµ {x : |x|δ ≥ 2} . 2 Lemma 10.3.3: Let {µn }n≥1 be a sequence of probability measures with characteristic functions {φn }n≥1 . Let limn→∞ φn (t) ≡ φ(t) exist for |t| ≤ δ0 for some δ0 > 0. Let φ(·) be continuous at 0. Then {µn }n≥1 is tight. Proof: For any 0 < δ < δ0 , by the BCT, 1 δ 1 δ (1 − φn (t))dt → [1 − φ(t)]dt. δ −δ δ −δ Also, by continuity of φ at 0, 1 δ [1 − φ(t)]dt → 0 δ −δ

as

δ → 0.

Thus, given > 0, there exists a δ ∈ (0, δ0 ) and an M ∈ (0, ∞) such that for all n ≥ M , 1 δ (1 − φn (t))dt < . δ −δ By Lemma 10.3.2, this implies that for all n ≥ M , 2 µn x : |x| ≥ < . δ

330

10. Characteristic Functions

Now choose K >

2 δ

such that

µj {x : |x| ≥ K } < for Then,

1 ≤ j ≤ M .

sup µn {x : |x| ≥ K } < n≥1

and hence, {µn }n≥1 is tight.

2

Theorem 10.3.4: (Levy-Cramer continuity theorem). Let {µn }n≥1 be a sequence of probability measures on (R, B(R)) with characteristic functions {φn }n≥1 . Let limn→∞ φn (t) ≡ φ(t) exist for all ∈ R and let φ be continuous at 0. Then φ is the characteristic function of a probability measure µ and µn −→d µ. Proof: By Lemma 10.3.3, {µn }n≥1 is tight. Let {µnj }j≥1 be any subsequence of {µn }n≥1 that converges vaguely to a limit µ. By tightness, µ is a probability measure and by Theorem 10.3.1, limj→∞ φnj (t) is the characteristic function of µ. That is, φ is the characteristic function of µ. Since φ determines µ uniquely, all vague limit points of {µn }n≥1 coincide with µ 2 and hence by Theorem 9.2.6, µn −→d µ. This theorem will be used extensively in the next chapter on central limit theorems. For the moment, some easy applications are given. Example 10.3.1: (Convergence of Binomial to Poisson). Let {Xn }n≥1 be a sequence of random variables such that Xn ∼ Binomial(Nn , pn ) for all n ≥ 1. Suppose that as n → ∞, Nn → ∞, pn → 0 and Nn pn → λ, λ ∈ (0, ∞). Then Xn −→d X

where X ∼ P oisson(λ).

(3.1)

To prove (3.1), note that the characteristic function φn of Xn is φn (t)

(pn eιt + 1 − pn )Nn Nn = 1 + pn (eιt − 1) Nn Nn pn ιt = 1+ (e − 1) , t ∈ R. Nn =

Next recall the fact that if {zn }n≥1 is a sequence of complex numbers such that limn→∞ zn = z exists, then (1 + n−1 zn )n → z

as n → ∞.

(3.2)

So φn (t) → eλ(e −1) for all t ∈ R. Since φ(t) ≡ eλ(e −1) , t ∈ R is the characteristic function of a Poisson (λ) random variable, (3.1) follows. ιt

ιt

10.3 Levy-Cramer continuity theorem

331

A direct proof of (3.1) consists of showing that for each j = 0, 1, 2, . . .

Nn j e−λ λj . pn (1 − pn )Nn −j → P (X = j) = P (Xn = j) ≡ j j! Example 10.3.2: (Convergence of Binomial to Normal). Let Xn ∼ Binomial(Nn , pn ) for all n ≥ 1. Suppose that as n → ∞, Nn → ∞ and s2n ≡ Nn pn (1 − pn ) → ∞. Then Zn ≡

Xn − Nn pn −→d N (0, 1). sn

(3.3)

To prove (3.3), note that the characteristic function φn of Zn is φn (t)

#Nn pn exp(ιt(1 − pn )/sn ) + (1 − pn ) exp(−ιtpn /sn ) zn (t) Nn ≡ 1+ , say, Nn =

"

" # −ιtpn ιt where zn (t) = Nn pn e sn (1−pn ) + (1 − pn )e sn − 1 . By (3.2), it suﬃces to show that for all t ∈ R, zn (t) → −

t2 2

as n → ∞.

By Lemma 10.1.5, for any x real, * (ιx)2 ** |x|3 * ιx . *≤ *e − 1 − ιx − 2 3!

(3.4)

Since sn → ∞, for any t ∈ R, with pn (t) ≡ tpn /sn and qn (t) ≡ t(1−pn )/sn , one has zn (t) = Nn pn exp(ιt(1 − pn )/sn ) + (1 − pn ) exp(−ιtpn /sn ) − 1 = Nn pn eιqn (t) − 1 − ιqn (t) + (1 − pn ) eιpn (t) − 1 − ιpn (t) p 1 − pn n (ιqn (t))2 + (ιpn (t))2 = Nn 2 2 p (1 − p )|t|3 n n + Nn O s3n −t2 + o(1) as n → ∞. = 2 This is known as the DeMovire-Laplace CLT in the case Nn = n, pn = p, 0 < p < 1. The original proof was based on Stirling’s approximation.

332

10. Characteristic Functions

Example 10.3.3: (Convergence of Poisson to Normal). Let {Xn }n≥1 be a sequence of random variables such that for n ≥ 1, Xn ∼ Poisson(λn ), n −λn λn ∈ (0, ∞). Let Yn = X√ , n ≥ 1. If λn → ∞ as n → ∞, then λ n

Yn −→d N (0, 1).

(3.5)

To prove (3.5), note that the characteristic function φn of Yn is , , φn (t) = exp − ιt λn exp λn exp ιt/ λn − 1 , , = exp λn exp ιt/ λn − 1 − ιt/ λn , t ∈ R. Now using (3.4) again it is easy to show that for each t ∈ R,

ιt ιt −t2 −1− √ → as n → ∞. λn exp √ 2 λn λn Hence, (3.5) follows.

10.4 Extension to Rk Deﬁnition 10.4.1: (a) Let X = (X1 , . . . , Xk ) be a k-dimensional random vector (k ∈ N). The characteristic function of X is deﬁned as φX (t)

E exp(ιt · X)

k = E exp ι tj X j , =

(4.1)

j=1

k t = (t1 , . . . , tk ) ∈ Rk , where t · x = j=1 tj xj denotes the inner product of the two vectors t = (t1 , . . . , tk ), x = (x1 , . . . , xk ) ∈ Rk . (b) For a probability measure µ on Rk , B(Rk ) , its characteristic function is deﬁned as exp(ιt · x)µ(dx). (4.2) φ(t) = Rk

Note that for a linear combination L ≡ a1 X1 +· · ·+ak Xk , a1 , . . . , ak ∈ R, of a set of random variables X1 , . . . , Xk , all deﬁned on a common probability space, the characteristic functions of L and X = (X1 , . . . , Xk ) are

10.4 Extension to Rk

333

related by the identity φL (λ)

=

k E exp ιλ aj Xj j=1

λ ∈ R,

= φX (λa),

(4.3)

where a = (a1 , . . . , ak ). Thus, the characteristic function of a random vector X = (X1 , . . . , Xk ) is determined by the characteristic functions of all its linear combinations and vice versa. It will now be shown that as in the one-dimensional case, the characteristic function of a random vector X uniquely determines its probability distribution. The following is a multivariate version of Theorem 10.2.6. Theorem 10.4.1: Let X = (X1 , . . . , Xk ) be a random vector with characteristic function φX (·) and let A = (a1 , b1 ] × · · · × (ak , bk ] be a rectangle in Rk with −∞ < ai < bi < ∞ for all i = 1, . . . , k. If P (X ∈ ∂A) = 0, then 1 = lim T →∞ (2π)k

P (X ∈ A)

T

−T

···

T

k +

−T j=1

hj (tj )

× φX (t1 , . . . , tk )dt1 . . . dtk ,

(4.4)

where ∂A denotes the boundary of A and where hj (tj ) ≡ exp(−ιtj aj ) − −1 exp(−ιtj bj ) (ιtj ) for tj = 0 and hj (0) = (bj − aj ), 1 ≤ j ≤ k. Proof: Consider the product space Ω = [−T, T ]k ×Rk with the corresponding Borel-σ-algebra F = B([−T, T ]k ) × B(Rk ) and the product measure k k [−T, T µ = µ1 ×µ2 , where µ1 is the Lebesgue’s measure on ] , B([−T, T ] ) k k and µ2 is the probability distribution of X on R , B(R ) . Since the function k + hj (tj ) exp(ιt · x), f (t, x) ≡ j=1

(t, x) ∈ Ω is integrable w.r.t. the product measure µ, by Fubini’s theorem, IT

≡

T

−T

···

Rk

=

−T

−T

···

k +

Rk j=1

hj (tj ) φX (t1 , . . . , tk )dt1 . . . dtk

j=1

T

=

+ k

T

T

k +

−T j=1 T

−T

{hj (tj ) exp(ιtj xj )}dt1 . . . dtk µ2 (dx)

exp(ιtj (xj − aj )) − exp(ιtj (xj − bj )) dtj µ2 (dx) ιtj

334

10. Characteristic Functions

= Rk

k + 2

sin tj (xj − aj ) dtj tj

T

0

j=1

−2

T

0

sin tj (xj − bj ) dtj µ2 (dx), tj

(4.5)

using (1.3) and the fact that sinθ θ and cosθ θ are respectively even and odd functions of θ. It can be shown that (Problem 10.8) T sin t dt = π/2. (4.6) lim T →∞ 0 t Hence, by the change of variables theorem, it ⎧ T ⎨ 0 sin tc π/2 dt = lim T →∞ 0 ⎩ t −π/2 and

* * sup ** T >0,c∈R

T

0

follows that, for any c ∈ R, if c = 0 if c > 0 if c < 0

* * * T * sin tc ** sin u ** * dt* = sup * du* ≡ K < ∞. t u T >0 0

(4.7)

(4.8)

This implies that as T → ∞, the integrand in (4.5) converges to the function k k j=1 gj (xj ) for each x ∈ R , where ⎧ ⎪ ⎨ π if y ∈ {aj , bj } 2π if y ∈ (aj , bj ) gj (y) = (4.9) ⎪ ⎩ 0 if y ∈ (−∞, aj ) ∪ (bj , ∞). Hence, by (4.5), (4.8), (4.9), and the BCT, lim IT =

T →∞

k +

Rk j=1

gj (xj )µ2 (dx).

By the boundary condition P (X ∈ ∂A) = 0, the right side above equals (2π)k P (X ∈ (a1 , b1 ) × · · · × (ak , bk )), proving the theorem. 2 Remark 10.4.1: The inversion formula (2.3) can also be extended to the multivariate case. Corollary 10.4.2: A probability measure on (Rk , B(Rk )) is uniquely determined by its characteristic function. Proof: Let µ and ν be probability measures on (Rk , B(Rk )) with the same characteristic function φ(·), i.e., φ(t) = exp(ιt · x)µ(dx) = exp(ιt · x)ν(dx),

10.4 Extension to Rk

335

t ∈ Rk . Let A = {A : A = (a1 , b1 ] × · · · × (ak , bk ], −∞ < ai < bi < ∞, i = 1, . . . , k, µ(∂A) = 0 = ν(∂A)}. It is easy to verify that A is a π-class. Since there are only countably many rectangles (a1 , b1 ] × · · · × (ak , bk ] with µ(∂A) + ν(∂A) = 0, A generates B(Rk ). But, by Theorem 10.4.1, µ(A)

= =

ν(A) −k

lim (2π)

T →∞

−T

T

+ k

−T

j=1

T

···

hj (tj ) φ(t1 , . . . , tk )dt1 , . . . , dtk

for all A ∈ A. Hence, by Theorem 1.2.4, µ(B) = ν(B) for all B ∈ B(Rk ), i.e., µ = ν. 2 Corollary 10.4.3: A probability measure µ on (Rk , B(Rk )) is determined by its values assigned to the collection of half-spaces H ≡ {H : H = {x ∈ Rk : a · x ≤ c}, a ∈ Rk , c ∈ R}. Proof: Let X be the identity mapping on Rk . Then, for any H = {x ∈ Rk : a · x ≤ c}, {X ∈ H} = {a · X ≤ c}. Thus, the values {µ(H) : H ∈ H} determine the probability distributions (and hence, the characteristic functions) of all linear combinations of X. Consequently, by (4.3), it determines the characteristic function of X. By Corollary 10.4.2, this determines µ uniquely. 2 Theorem 10.4.4: Let {Xn }n≥1 , X be k-dimensional random vectors. Then Xn −→d X iﬀ φXn (t) → φX (t)

t ∈ Rk .

for all

(4.10)

Proof: Suppose that Xn −→d X. Then, (4.10) follows from the continuous mapping theorem for weak convergence (cf. Theorem 9.4.2). Conversely, (j) suppose (4.10) holds. Let Xn and X (j) denote the jth components of Xn and X, respectively, j = 1, . . . , k. By (4.10), for any j ∈ {1, . . . , k}, lim E exp(ιλXn(j) ) = E exp(ιλX (j) )

n→∞

for all λ ∈ R.

Hence, by Theorem 10.3.4 Xn(j) −→d X (j)

for all j = 1, . . . , k.

(4.11)

This implies that the sequence of random vectors {Xn }n≥1 is tight (Problem 9.9). Hence, by Theorem 9.3.3, given any subsequence {ni }i≥1 , there exists a further subsequence {ni }i≥1 ⊂ {ni }i≥1 and a random vector X0 such that Xni −→d X0 as i → ∞. By the ‘only if’ part, this implies φXn (t) → φX0 (t) i

as

i → ∞,

336

10. Characteristic Functions

for all t ∈ Rk . Thus, φX0 (·) = φX (·) and by the uniqueness of characteristic functions, X0 =d X. Thus, all convergent subsequences of {Xn }n≥1 have the same limit. By arguments similar to the proof of Theorem 9.2.6, 2 Xn −→d X. This completes the proof of the theorem. Theorem 10.4.4 shows that as in the one-dimensional case, the (pointwise) convergence of the characteristic functions of a sequence of kdimensional random vectors {Xn }n≥1 to a given characteristic function is equivalent to convergence in distribution of the sequence {Xn }n≥1 . Since the characteristic function of a random vector is determined by the characteristic functions of all its linear combinations, this suggests that one may also be able to establish convergence in distribution of a sequence of random vectors by considering the convergence of the sequences of linear combinations that are one-dimensional random variables. This is indeed true as shown by the following result. Theorem 10.4.5: (Cramer-Wold device). Let {Xn }n≥1 be a sequence of k-dimensional random vectors and let X be a k-dimensional random vector. Then, Xn −→d X iﬀ for all a ∈ Rk , a · Xn −→d a · X.

(4.12)

Proof: Suppose Xn −→d X. Then, for any a ∈ Rk , the function h(x) = a·x, x ∈ Rk is continuous on Rk . Hence, (4.12) follows from Theorem 9.4.2. Conversely, suppose that (4.12) holds for all a ∈ Rk . By (4.3) and Theorem 10.3.1, this implies that as n → ∞ φXn (a)

= φa·Xn (1) → φa·X (1) = φX (a),

for all a ∈ Rk . Hence, by Theorem 10.4.4, Xn −→d X.

2

Recall that a set of random variables X1 , . . . , Xk deﬁned on a common probability space are independent iﬀ the joint cdf of X1 , . . . , Xk is the product of the marginal cdfs of the Xi ’s. A similar characterization of independence can be given in terms of the characteristic functions, as shown by the following result. The proof is left as an exercise (Problem 10.16). Proposition 10.4.6: Let X1 , . . . , Xk , (k ∈ N) be a collection of random variables deﬁned on a common probability space. Then, X1 , . . . , Xk are independent iﬀ φ(X1 ,...,Xk ) (t1 , . . . , tk ) =

k + j=1

for all t1 , . . . , tk ∈ R.

φXj (tj )

10.5 Problems

337

10.5 Problems 10.1 Let {Xn }n≥1 and {Yn }n≥1 be two sequences of random variables such that for each n ≥ 1, Xn and Yn are deﬁned on a common probability space and Xn and Yn are independent. If Xn −→d X and Yn −→d Y , then show that (5.1) Xn + Yn −→d X0 + Y0 where X0 =d X, Y0 =d Y (cf. Section 2.2) and X0 and Y0 are independent. Show by an example that (5.1) is false without the independence hypothesis. 10.2 Give an example of a nonlattice discrete distribution F on R supported by only a three point set. 10.3 Let F be an absolutely continuous cdf on R with density f and with characteristic function φ. Show that if f has a derivative f (1) ∈ L1 (R), then lim |tφ(t)| = 0. |t|→∞

Generalize this result when f is r-times diﬀerentiable and the jth derivative f (j) lie in L1 (R) for j = 1, . . . , r. 10.4 Let F be a cdf on R with characteristic function φ. Show that for any a < b, a, b ∈ R, T " # 1 exp(−ιta) − exp(−ιtb) (ιt)−1 φ(t)dt lim T →∞ 2π −T 1 = µF ((a, b)) + µF ({a, b}), (5.2) 2 where µF denotes the Lebesgue-Stieltjes measure corresponding to F. (Hint: Use (4.7) and the arguments in the proof of Theorem 10.4.1.) 10.5 Let φ be a characteristic function of a cdf F and let µF denote the Lebesgue-Stieltjes measure corresponding to F . (a) Show that for any a ∈ R and T ∈ (0, ∞),

T

exp(−ιta)φ(t)dt −T

=

2T µF ({a}) exp(ιT (x − a)) − exp(−ιT (x − a)) + µF (dx). T (x − a) {x =a} (5.3)

338

10. Characteristic Functions

(b) Conclude from (5.3) that for any a ∈ R, T 1 exp(−ιta)φ(t)dt. F (a) − F (a−) = lim T →∞ 2T −T

(5.4)

10.6 Let F be a cdf on R with characteristic function φ. If |φ| ∈ L2 (R), then show that F is continuous. (Hint: Use Corollary 10.1.7.) 10.7 Let {Fn }n≥1 , F be cdfs with characteristic functions {φn }n≥1 , φ, respectively. Suppose that Fn −→d F . (a) Give an example to show that φn may not converge to φ uniformly over all of R. t2

(Hint: Try φn (t) ≡ e− n .) (b) Let {µn }n≥1 and µ denote the Lebesgue-Stieltjes measures corresponding to {Fn }n≥1 and F , respectively. Suppose that {µn }n≥1 and µ are dominated by a σ-ﬁnite measure λ on n (R, B(R)) with Radon-Nikodym derivatives fn = dµ dλ , n ≥ 1 dµ and f = dλ . If fn −→ f a.e. (λ), then show that sup |φn (t) − φ(t)| → 0

as

n → ∞.

(5.5)

t∈R

10.8 Let G(x, a) = (1 + a2 )−1 (1 − e−ax {a sin x + cos x}), x ∈ R, a ∈ R. (a) Show that for any a > 0, x0 ≥ 0, x0 (sin x) e−ax dx = G(a, x0 ).

(5.6)

0

(Hint: Consider the derivatives of the left and the right sides of (5.6) w.r.t. x0 .) (b) Use Fubini’s theorem to justify that for all T > 0, ∞ T T ∞ −ax (sin x) e dadx = (sin x) e−ax dxda. 0

0

0

(5.7)

0

∞ (c) Use (5.6), (5.7) and the identity that for x > 0, 0 e−ax da = x1 to conclude that for any T > 0 T ∞ sin x dx = G(a, T )da. (5.8) x 0 0 ∞ (d) Use the DCT and the fact that 0 (1+a2 )−1 da = π2 to conclude that the limit of the right side of (5.8) exists and equals π2 .

10.5 Problems

339

10.9 Let F1 , F2 , and F3 be three cdfs on R. Then show by an example that F1 ∗ F2 = F1 ∗ F3 does not imply that F1 = F2 . Here Fi ∗ Fj denotes the convolution of Fi and Fj , 1 ≤ i, j ≤ 3. (Hint: For F1 , consider a cdf whose characteristic function φ has a bounded support.) 10.10 Let µ be a probability measure on R with characteristic function φ. Prove that 2 ∞ [1 − Re(φ(t))]t−2 dt = |x|µ(dx). π −∞ 10.11 Let φ be the characteristic function of a random variable X. If |φ(t)| = 1 = |φ(αt)| for some t = 0 and α ∈ R irrational, then there exists x0 ∈ R such that P (X = x0 ) = 1. (Hint: Use Proposition 10.1.1.) 10.12 Show that for any characteristic function φ, {t ∈ R : |φ(t)| = 1} is either {0} or countably inﬁnite or all of R. 10.13 Let {Xn }n≥1 be a sequence of iid random variables with a nondegenerate distribution F . Suppose that there exist an ∈ (0, ∞) and bn ∈ R such that a−1 n

n

X j − bn

−→d Z

(5.9)

j=1

for some nondegenerate random variable Z. (a) Show that an → ∞

as

n → ∞.

(Hint: If an → a ∈ R, then E exp(ιtaZ) = limn→∞ n E exp ιt = 0 for all except countably many j=1 Xj − bn t ∈ R, which leads to a contradiction.) (b) Show that as n → ∞ bn − bn−1 = o(an )

and

an → 1. an−1

n−1 d (Hint: Use (a) to show that j=1 Xj − bn /an −→ Z and n−1 d by (5.9), j=1 Xj − bn−1 /an−1 −→ Z.)

340

10. Characteristic Functions

10.14 Show that for every T ∈ (0, ∞), there exist two distinct characteristic functions φ1T and φ2T satisfying φ1T (t) = φ2T (t)

for all |t| ≤ T.

(Hint: Let φ1 (t) = e−|t| , t ∈ R and for any even function φ2T (·) by ⎧ ⎪ ⎨ φ1 (t) φ1 (T ) + (t − T )(−φ1 (T )) φ2T (t) = ⎪ ⎩ 0

T ∈ (0, ∞), deﬁne an for 0 ≤ t ≤ T T ≤t T.

Now use Poly¯a’s criterion.) 10.15 Show that φα (t) = exp(−|t|α ), t ∈ R, α ∈ (0, ∞) is a characteristic function for 0 ≤ α ≤ 2. 10.16 Prove Proposition 10.4.6. (Hint: The ‘only if’ part follows from (4.2) and Proposition 7.1.3. The ‘if part’ follows by using the inversion formulas of Theorems 10.4.1 and 10.2.6 and the characterization of independence in terms of cdfs (Corollary 7.1.2).) variables with characteristic 10.17 Let {Xn }n≥0 be a collection of random functions {φn }n≥0 . Suppose that |φn (t)|dt < ∞ for all n ≥ 0 and that φn (·) → φ0 (·) in L1 (R) as n → ∞. Show that * * sup *P (Xn ∈ B) − P (X0 ∈ B)* → 0 B∈B(R)

as n → ∞. 10.18 Let φ(·) be a characteristic function on R such that φ(t) → 0 as |t| → ∞. Let X be a random variable with φ as its characteristic function. For each n ≥ 1, let Xn = nk if nk ≤ X < k+1 n , k = 0, ±1, ±2, . . .. Show that if φn (t) ≡ E(eιtXn ), then φn (t) → φ(t) for each t ∈ R but for each n ≥ 1, sup |φn (t) − φ(t)| : t ∈ R = 1. 10.19 Let {δi }i≥1 be iid random variables with distribution P (δ1 = 1) = P (δ1 = −1) = 1/2. Let Xn =

n

δi i=1 2i

and X = limn→∞ Xn .

(a) Find the characteristic function of Xn .

10.5 Problems

(b) Show that the characteristic function of X is φX (t) ≡

341

sin t t .

be iid random variables with pdf f (x) = 21 e−|x| , x ∈ R. 10.20 Let {Xk }k≥1 ∞ Show that k=1 k1 Xk converges w.p. 1 and compute its characteristic function. (Hint: Note that the characteristic function of the standard Cauchy (0,1) distribution is e−|t| .) 10.21 Establish an extension of formula (2.3) to the multivariate case.

11 Central Limit Theorems

11.1 Lindeberg-Feller theorems The central limit theorem (CLT) is one of the oldest and most useful results in probability theory. Empirical ﬁndings in applied sciences, dating back to the 17th century, showed that the averages of laboratory measurements on various physical quantities tended to have a bell-shaped distribution. The CLT provides a theoretical justiﬁcation for this observation. Roughly speaking, it says that under some mild conditions, the average of a large number of iid random variables is approximately normally distributed. A version of this result for 0–1 valued random variables was proved by DeMoivre and Laplace in the early 18th century. An extension of this result to the averages of iid random variables with a ﬁnite second moment was done in the early 20th century. In this section, a more general set up is considered, namely, that of the limit behavior of the row sums of a triangular array of independent random variables. Deﬁnition 11.1.1: For each n ≥ 1, let {Xn1 , . . . , Xnrn } be a collection of random variables deﬁned on a probability space (Ωn , Fn , Pn ) such that Xn1 , . . . , Xnrn are independent. Then, {Xnj : 1 ≤ j ≤ rn }n≥1 is called a triangular array of independent random variables. Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables. Deﬁne the row sums

344

11. Central Limit Theorems

Sn =

rn

Xnj , n ≥ 1.

(1.1)

j=1 2 Suppose that EXnj < ∞ for all j, n. Write s2n = Var(Sn ) = rn by Lindeberg, j=1 Var(Xnj ), n ≥ 1. The following condition, introduced S −ES n n plays an important role in establishing convergence of to a stansn dard normal random variable in distribution.

Deﬁnition 11.1.2: Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables such that 2 2 ≡ σnj < ∞ for all EXnj = 0, 0 < EXnj

1 ≤ j ≤ rn , n ≥ 1.

(1.2)

Then, {Xnj : 1 ≤ j ≤ rn }n≥1 is said to satisfy the Lindeberg condition if for every > 0, lim

n→∞

where s2n =

rn j=1

s−2 n

rn

2 EXnj I(|Xnj | > sn ) = 0,

(1.3)

j=1

2 σnj , n ≥ 1.

Example 11.1.1: Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ and Var(X1 ) = σ 2 ∈ (0, ∞). Consider the centered and scaled sample mean √ ¯ n(Xn − µ) Tn = , n ≥ 1, (1.4) σ ¯ n = n−1 n Xj . Note that Tn can be written as the row sum of where X j=1 a triangular array of independent random variables: Tn =

n

Xnj ,

(1.5)

j=1

√ where Xnj = (Xj − µ)/{σ n}, 1 ≤ j ≤ n, n ≥ 1. Clearly, {Xnj : 1 ≤ 2 2 = EXnj = nσ1 2 Var(X1 ) = 1/n for all j ≤ n}n≥1 satisﬁes (1.2) with σnj n 2 1 ≤ j ≤ n, and hence, s2n = j=1 σnj = 1 for all n ≥ 1. Now, for any > 0, s−2 n

n

2 EXnj I(|Xnj | > sn )

j=1 n X − µ 2 * X − µ * * j * j √ = E I * √ *> σ n σ n j=1 1 √ = n 2 E (X1 − µ)2 I |X1 − µ| > σ n σ n √ −2 = σ E (X1 − µ)2 I |X1 − µ| > σ n → 0

as

n → ∞,

11.1 Lindeberg-Feller theorems

345

by the DCT, since E(X1 − µ)2 < ∞. Thus, the triangular array {Xnj : 1 ≤ j ≤ n} of (1.5) satisﬁes the Lindeberg condition (1.3). The main result of this section is the following CLT for scaled row sums of a triangular array of independent random variables. Theorem 11.1.1: (Lindeberg CLT). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) and the Lindeberg condition (1.3). Then,

where Sn =

rn j=1

Sn −→d N (0, 1) sn rn 2 and s2n = Var(Sn ) = j=1 σnj .

Xnj

(1.6)

As a direct consequence of Theorem 11.1.1 and Example 11.1.1, one gets the more familiar version of the CLT for the sample mean of iid random variables. Corollary 11.1.2: (CLT for iid random variables). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ and Var(X1 ) = σ 2 ∈ (0, ∞). Then, √ ¯ n − µ) −→d N (0, σ 2 ), n(X (1.7) n −1 ¯n = n where X Xnj , n ≥ 1. j=1

For proving the theorem, the following simple inequality will be used. Lemma 11.1.3: For any m ∈ N and for any complex numbers z1 , . . . , zm , ω1 , . . . , ωm , with |zj | ≤ 1, |ωj | ≤ 1 for all j = 1, . . . , m, * m * m m + *+ * * *≤ z − ω |zj − ωj |. (1.8) j j * * j=1

j=1

j=1

Proof: Inequality (1.8) follows from the identity m + j=1

zj −

m + j=1

ωj

=

m +

zj −

m−1 +

j=1

+

m−1 +

zj ω m

j=1

m−2 + zj ω m − zj ωm−1 ωm

j=1

+ · · · + z1

j=1 m + j=2

ωj −

m +

ωj .

j=1

2 Proof of Theorem 11.1.1: W.l.o.g., suppose that s2n = 1 for all n ≥ 1. ˜ nj ≡ Xnj /sn , 1 ≤ j ≤ rn , n ≥ 1, it is easy to check (Otherwise, setting X

346

11. Central Limit Theorems

˜ nj : 1 ≤ j ≤ rn }n≥1 , the variance of the nth that for the triangular array {X rn 2 ˜ row sum s˜n ≡ j=1 Var(Xnj ) = 1 for all n ≥ 1, the Lindeberg condition rn ˜ d holds, and s˜−1 n j=1 Xnj −→ N (0, 1) iﬀ (1.6) holds.) Then, by Theorem 10.3.4, it is enough to show that 2

lim E exp(ιtSn ) = e−t

n→∞

/2

for all t ∈ R.

(1.9)

For any > 0, ∆n

2 ≡ max EXnj : 1 ≤ j ≤ rn 2 2 ≤ max EXnj I(|Xnj | > ) + EXnj I(|Xnj | ≤ ) : 1 ≤ j ≤ rn rn 2 EXnj I(|Xnj | > ) + 2 ≤ j=1

= o(1) + 2

as n → ∞, by the Lindeberg condition (1.3).

Hence, ∆n → 0

as

n → ∞.

(1.10)

Fix t ∈ R. Let φnj (·) denote the characteristic function of Xnj , 1 ≤ j ≤ rn , n ≥ 1. Note that by (1.10), there exists n0 ∈ N such that for all n ≥ n0 , 2 /2| : 1 ≤ j ≤ rn } ≤ 1. Next, noting that s2n = I1n ≡ max{|1 − t2 σnj rn 2 j=1 σnj = 1, by Lemma 11.1.3, for all n ≥ n0 , * * 2 * * *E exp(ιtSn ) − e−t /2 * *+ rn 2 ** + * rn t2 σnj * 1− φnj (t) − ≤ ** * 2 j=1 j=1 * *+ rn 2 + * * rn t2 σnj 2 1− exp(−t2 σnj /2)** − + ** 2 j=1 j=1 * * rn 2 2 * t σnj * * * ≤ *φnj (t) − 1 − 2 * j=1 rn * 2 ** * t2 σnj 2 * * exp(−t2 σnj /2) − 1 − + * 2 * j=1 ≡ I2n + I3n ,

say.

(1.11)

It will now be shown that lim Ikn = 0

n→∞

for k = 2, 3.

* * First consider I2n . Since * exp(ιx) − [1 + ιx + (ιx)2 /2]* ≤ min{|x|3 /3!, |x|2 } for all x ∈ R (cf. Lemma 10.1.5) and EXnj = 0 for all 1 ≤ j ≤ rn , for any

11.1 Lindeberg-Feller theorems

347

> 0, by the Lindeberg condition, one gets rn * 2 * t2 σnj * * I2n ≡ * *φnj (t) − 1 − 2 j=1 = ≤

n * * (ιt)2 * 2 * EXnj *E exp(ιtXnj ) − 1 + ιtEXnj + * 2! j=1 rn

E min

j=1

≤

rn

|tX |3 nj , |tXnj |2 3!

E|tXnj |3 I(|Xnj | ≤ ) +

j=1

≤

|t|3

≤

E(tXnj )2 I(|Xnj | > )

j=1 rn

2 EXnj + t2

j=1 3

rn

rn

2 EXnj I(|Xnj | > )

j=1 2

|t| + t · o(1)

as n → ∞.

(1.12)

Since ∈ (0, ∞) is arbitrary, I2n → 0 as n → ∞. Next, consider I3n . Note that for any x ∈ R, * * ∞ * ∞ k * |x|k−2 x * ≤ x2 e|x| . x /k!** ≤ x2 |e − 1 − x| = * k! k=2

k=2

s2n

= 1, one gets Hence, using (1.10) and the fact that r n 2 2 t σnj 2 2 exp(t2 σnj /2) I3n ≤ 2 j=1 rn 4 2 2 σnj · ∆n ≤ t exp(t ∆n /2) j=1 4

2

→0

as

= t exp(t ∆n /2)∆n n → ∞.

(1.13)

Now (1.9) follows from (1.11), (1.12), and (1.13). This completes the proof of the theorem. 2 Oftentimes, veriﬁcation of the Lindeberg condition (1.3) becomes diﬃcult as one has to ﬁnd the truncated second moments of Xnj ’s. A simpler suﬃcient condition for the CLT is provided by Lyapounov’s condition. Deﬁnition 11.1.3: A triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 of independent random variables satisfying (1.2) is said to satisfy Lyapounov’s condition if there exists a δ ∈ (0, ∞) such that lim s−(2+δ) n

n→∞

rn j=1

E|Xnj |2+δ = 0,

(1.14)

348

11. Central Limit Theorems

where s2n =

rn j=1

2 EXnj .

Note that by Markov’s inequality, if a triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes Lyapounov’s condition (1.14), then for any ∈ (0, ∞), s−2 n ≤ s−2 n

rn j=1 rn

2 EXnj I(|Xnj | > sn )

E|Xnj |2 (|Xnj |/sn )δ

j=1

→0

as

n → ∞.

Thus, {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes the Lindeberg condition (1.3). This observation leads to the following result. Corollary 11.1.4: (Lyapounov’s CLT). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) and Lyapounov’s condition (1.14). Then, (1.6) holds, i.e., Sn −→d N (0, 1). sn It is clear that Lyapounov’s condition is only a suﬃcient but not a necessary condition for the validity of the CLT. In contrast, under some regularity conditions on the triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 , which essentially says that the individual random variables Xnj ’s are ‘uniformly small’, the Lindeberg condition is also a necessary condition for the CLT. This converse is due to W. Feller. Theorem 11.1.5: (Feller’s theorem). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) such that for any > 0, (1.15) lim max P (|Xnj | > sn ) = 0, where s2n =

rn j=1

n→∞ 1≤j≤rn

2 EXnj . Let Sn =

rn j=1

Xnj . If, in addition,

Sn −→d N (0, 1), sn

(1.16)

then {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes the Lindeberg condition. A triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 satisfying (1.15) is called a null array. Thus, the converse of Theorem 11.1.1 holds for null arrays. It may be noted that there exist non-null arrays for which (1.16) holds but the Lindeberg condition fails (Problem 11.9). ≥ 1. Next ﬁx ∈ (0, ∞). Proof: W.l.o.g., suppose that s2n = 1 for all n rn 2 Then, setting t0 = 4/, and noting that 1 = j=1 EXnj and cos x ≥

11.1 Lindeberg-Feller theorems

349

1 − x2 /2 for all x ∈ R, one gets rn t2 E cos t0 Xnj − 1 + 0 2 j=1

=

rn t2 X 2 0 nj − 1 + cos t0 Xnj E 2 j=1

≥

rn t2 X 2 0 nj − 1 + cos t0 Xnj I |Xnj | > E 2 j=1

≥

rn t2 X 2 0 nj − 2 I |Xnj | > E 2 j=1

≥ =

t2 0

2

−

rn 2 2 EXnj I |Xnj | > 2 j=1

rn 6 2 EXnj I |Xnj | > . 2 j=1

Hence, the Lindeberg condition would hold if it is shown that for all t ∈ R, rn t2 → 0 as n → ∞ E cos tXnj − 1 + 2 j=1

rn 2 ⇔ exp E cos tXnj − 1 → e−t /2 as n → ∞. (1.17) j=1

Let φnj (t) = E exp(ιtXnj ), t ∈ R denote the characteristic function of Xnj , 1 ≤ j ≤ rn , n ≥ 1. Note that E cos tXnj = Re(φnj (t)), where recall that for any complex number z, Re(z) denotes the real part of z, i.e., Re(z) = a if z = a + ιb, a, b ∈ R. Since the function h(z) = |z| is continuous on C and | exp(φnj (t))| = exp(E cos tXnj ), it follows that (1.17) holds if, for all t ∈ R,

rn 2 (φnj (t) − 1) → e−t /2 as n → ∞. exp j=1

However, by (1.16), E exp(ιtSn ) = Hence, it is enough to show that I1n (t)

≡

exp

rn

rn j=1

as

/2

for all t ∈ R.

+ rn [φnj (t) − 1] − φnj (t)

j=1

→0

2

φnj (t) → e−t

n→∞

j=1

for all t ∈ R.

(1.18)

350

11. Central Limit Theorems

Note that for any ∈ (0, ∞), by (1.15) and the inequality |eιx − 1| ≤ min{2, |x|} for all x ∈ R, one has * * |φnj (t) − 1| = *E exp(ιtXnj ) − 1 * ≤ E min{|tXnj |, 2} ≤ 2P (|Xnj | > ) + |t| uniformly in j = 1, . . . , rn . Hence, letting n → ∞ and then ↓ 0, by (1.15), one gets (1.19) I2n (t) ≡ max |φnj (t) − 1| = o(1) as n → ∞ 1≤j≤rn

for all t ∈ R. Further, by the inequality |eιx − 1 − ιx| ≤ |x|2 /2, x ∈ R, rn

|φnj (t) − 1|

=

j=1

rn * * *E exp(ιtXnj ) − 1 − E(ιtXnj )* j=1

≤

rn t2 t2 t2 2 EXnj = s2n = 2 j=1 2 2

(1.20)

uniformly in t ∈ R, n ≥ 1. Now ﬁx t ∈ R. Then, by (1.19), there exists n0 ∈ N such that for all n ≥ n0 , max1≤j≤rn |φn (t) − 1| ≤ 1. Hence, by the arguments in the proof of Lemma 11.1.3, by* the inequalities |ez | ≤ * ∞ and ∞ k |z| z k 2 * * k=0 |z| /k! = e , and |e − 1 − z| = k=2 z /k! ≤ |z| exp(|z|), z ∈ C, for all n ≥ n0 , one has * rn * rn *+ * + exp [φnj (t) − 1] − φnj (t)** I1n (t) = ** j=1

≤ ≤

rn j=1 rn j=1

j=1

n −j * r+ * * * * exp [φnj (t) − 1] − φnj (t)* · * exp [φnj (t) − 1] *

k=1

* * * exp [φnj (t) − 1] − φnj (t)* · exp

rn

* * *φnj (t) − 1*

j=1

≤

rn 2 * * * exp [φnj (t) − 1] − φnj (t)* · exp t 2 j=1

=

rn 2 * * * exp [φnj (t) − 1] − 1 − [φnj (t) − 1]* · exp t 2 j=1

rn 2 * * *φnj (t) − 1*2 · exp 1 + t 2 j=1

rn t2 ≤ max |φnj (t) − 1| |φnj (t) − 1| exp 1 + 1≤j≤rn 2 j=1

≤

11.1 Lindeberg-Feller theorems

351

t2 ≤ I2n (t) · t2 · exp 1 + /2 2 → 0 as n → ∞, by (1.19) and (1.20). Hence, (1.18) holds. This completes the proof of the theorem. 2 The following example is an application of the Lindeberg CLT for proving asymptotic normality of the least squares estimator of a regression parameter. Example 11.1.2: Let Yj = xj β + j ,

j = 1, 2, . . .

(1.21)

be a simple linear regression model, where {xn }n≥1 is a given sequence of real numbers, β ∈ R is the regression parameter and {n }n≥1 is a sequence of iid random variables with E1 = 0 and E21 ≡ σ 2 ∈ (0, ∞). The least squares estimator of β based on Y1 , . . . , Yn is given by βˆn =

n

xj Yj /a2n ,

n ≥ 1,

j=1

where a2n =

n j=1

x2j . Suppose that the sequence {xn }n≥1 satisﬁes max {|xj |/an } → 0

1≤j≤n

Then,

as n → ∞.

(1.22)

an (βˆn − β) −→d N (0, σ 2 ).

(1.23)

To prove (1.23), note that by (1.21), an (βˆn − β)

=

an

n

= a−1 n

j=1 n j=1

xj Yj − xj j ≡

a2n β n

1 a2n

Xnj ,

say

(1.24)

j=1

2 < where Xnj = xj j /an , 1 ≤ j ≤ n, n ≥ 1. Note that EXnj = 0, EXnj n n 2 2 2 2 2 2 ∞ and sn ≡ EX = x E /a = σ . Thus, {X : 1 ≤ nj n nj j j=1 j=1 j j ≤ n}n≥1 is a triangular array of independent random variables satisfying (1.2). Next, let mn = max{|xj |/an : 1 ≤ j ≤ n}, n ≥ 1. Then, by (1.22), for any δ ∈ (0, ∞),

s−2 n

n j=1

2 EXnj I |Xnj | > δsn

352

11. Central Limit Theorems

= σ −2 a−2 n

n

x2j E2j I(|xj j /an | > δσ)

j=1

≤ σ −2 a−2 n

n

x2j · E21 I(mn · |1 | > δσ)

j=1

= σ −2 E21 I |1 | > δσ · m−1 n →0

as

n → ∞ by the DCT.

Thus, {Xnj : 1 ≤ j ≤ n}n≥1 satisﬁes the Lindeberg condition (1.3) and hence, by Theorem 11.1.1, n j=1 Xnj −→d N (0, 1), σ which, in view of (1.24), implies (1.23). The next result gives a multivariate generalization of Theorem 11.1.1. Theorem 11.1.6: (A multivariate version of the Lindeberg CLT). For each n ≥ 1, let {Xnj : 1 ≤ j ≤ rn } be a collection of independent kdimensional random vectors satisfying EXnj = 0, 1 ≤ j ≤ rn

and

rn

EXnj Xnj = Ik ,

j=1

where Ik denotes the identity matrix of order k and for any vector x, x denotes its transpose. Suppose that for every ∈ (0, ∞), lim

n→∞

Then,

rn

EXnj 2 I(Xnj > ) = 0.

j=1

rn

Xnj −→d N (0, Ik ).

j=1

The proof is a consequence of Theorem 11.1.1 and the Cramer-Wold device (cf. Theorem 10.4.5) and is left as an exercise (Problem 11.17).

11.2 Stable distributions If {Xn }n≥1 is a sequence of iid N (µ, σ 2 ) random variables, then for each k k ≥ 1, Sk ≡ i=1 Xi has a N (kµ, kσ2 ) distribution. Similarly, if {Xn }n≥1

11.2 Stable distributions

353

is a sequence of iid Cauchy (µ, σ) random variables, then for each k ≥ 1, k Sk ≡ i=1 Xi has a Cauchy (kµ, kσ) distribution. Thus, in both cases, for each k ≥ 1, there exist constants ak and bk such that Sk has the same distribution as ak X1 + bk (Problem 11.21). Deﬁnition 11.2.1: A nondegenerate random variable X is called stable if the above property holds, i.e., for each k ∈ N, there exist constants ak and bk such that Sk =d ak X1 + bk , (2.1) where X1 , X2 , . . . are iid random variables with the same distribution as k X, and Sk = i=1 Xi . In this case, the distribution FX of X is called a stable distribution. There are two characterizations of stable distributions. Theorem 11.2.1: A nondegenerate distribution F is stable iﬀ there exists a sequence of iid random variable 2 {Yn }n≥1 and constants {an }n≥1 and n {bn }n≥1 such that ak converges in distribution to F . Y − b k i=1 i Theorem 11.2.2: A nondegenerate distribution F is stable iﬀ its characteristic function φ(t) admits the representation (2.2) φ(t) = exp ιtc − b|t|α (1 + ιλsgn(t)ωα (t)) √ where ι = −1, −1 ≤ λ ≤ 1, 0 < α ≤ 2, 0 ≤ b < ∞, and the functions ωα (t) and sgn(·) are deﬁned as if α = 1 tan πα 2 (2.3) ωα (t) = 2 log |t| if α = 1 π and

⎧ if ⎨ 1 −1 if sgn(t) = ⎩ 0 if

t>0 t 0. (2.4) exp − f (x) = √ 2x 2π x3/2

354

11. Central Limit Theorems

For an explicit expression for the density of F , in other cases, see Feller (1966), Section 17.6. Deﬁnition 11.2.2: The parameter α in (2.2) is called the index of the stable distribution. Remark 11.2.4: The parameter λ is related to the behavior of the ratio of the right tail of the distribution to the left tail through the relation lim

x→∞

1+λ 1 − F (x) = , F (−x) (1 − λ)

(2.5)

where for λ = 1, the ratio on the right side of (2.5) is deﬁned to be +∞. Deﬁnition 11.2.3: A function L : (0, ∞) → (0, ∞) is called slowly varying at ∞ if L(cx) = 1 for all 0 < c < ∞. (2.6) lim x→∞ L(x) A function f : (0, ∞) → (0, ∞) is called regularly varying at ∞ with index α ∈ R, α = 0 if f (x) = xα L(x) for all x ∈ (0, ∞) where L(·) is slowly varying at ∞. The functions L1 (t) = log t, L2 (t) = log(log t), L3 (t) = (log t)2 are slowly varying at ∞ but the function L4 (t) = tp is not so for p = 0. There is a companion result to Theorem 11.2.1 giving necessary and suﬃcient conditions for convergence of normalized sums of iid random variables to a stable distribution. Theorem 11.2.3: Let F be a nondegenerate stable distribution with index α, 0 < α < 2. Then in order that a sequence {Yn }n≥1 of iid random variables admits a sequence of constants {an }n≥1 and {bn }n≥1 such that n i=1 Yi − bn −→d F, (2.7) an it is necessary and suﬃcient that lim

x→∞

exists and

P (Y1 > x) ≡ θ ∈ [0, 1] P (|Y1 | > x)

P (|Y1 | > x) = x−α L(x),

(2.8)

(2.9)

where L(·) is a slowly varying function at ∞. If (2.8) and (2.9) hold, then the normalizing constants {an }n≥1 and {bn }n≥1 may be chosen to satisfy na−α n L(an ) → 1

and

bn = nEY1 I(|Y1 | ≤ an ).

(2.10)

11.2 Stable distributions

355

Remark 11.2.5: The analog of Theorem 11.2.3 for the case α = 2, i.e., for the normal distribution is the following. Theorem 11.2.4: Let {Yn }n≥1 be iid random variables. In order that there exist constants {an }n≥1 and {bn }n≥1 such that n

Yi − bn −→d N (0, 1), an

i=1

(2.11)

it is necessary and suﬃcient that x2 P (|Y1 | > x) →0 EY12 I(|Y1 | ≤ x)

as

x → ∞.

(2.12)

Remark 11.2.6: Note that condition (2.12) holds if EY12 < ∞. However, if P (|Y1 | > x) ∼ xC2 as x → ∞, then EY12 = ∞ and the classicalCLT (cf. n Corollary 11.1.2) fails. However, in this case, (2.12) holds and i=1 Yi is asymptotically normal with a suitable centering and scaling (diﬀerent from √ n) (Problem 11.20). Here, only the proof of Theorem 11.2.1 will be given. Further, a proof of Theorem 11.2.3, suﬃciency part, is also outlined. For the rest, see Feller (1966) or Gnedenko and Kolmogorov (1968). For proving Theorem 11.2.1, the following result is needed. Theorem 11.2.5: (Khinchine’s theorem on convergence of types). Let {Wn }n≥1 be a sequence of random variables such that for some sequences {αn }n≥1 ⊂ [0, ∞) and {βn }n≥1 ⊂ R, both Wn and αn Wn + βn converge in distribution to nondegenerate distributions G and H on R, respectively. Then limn→∞ αn = α and limn→∞ βn = β exist with 0 < α < ∞ and −∞ < β < ∞. Proof: Let {Wn }n≥1 be a sequence of random variables such that for each n ≥ 1, Wn and Wn have the same distribution and Wn and Wn are independent. Then Yn ≡ Wn − Wn and Zn ≡ αn (Wn − Wn ) both ˜ and H. ˜ Indeed convergence in distribution to nondegenerate limits, say G ˜ ˜ G = G ∗ G and H = H ∗ H, where ∗ denotes convolution. This implies that {αn }n≥1 cannot have 0 or ∞ as limit points. Also if 0 < α ≤ α < ∞ are ˜ ˜ x ) = G( ˜ x ) for all x. Since two limit points of {αn }n≥1 , then H(x) = G( α α ˜ is nondegenerate, α must equal α and so limn→∞ αn exists in (0, ∞). G(·) 2 This implies that limn→∞ βn exists in R. Proof of Theorem 11.2.1: The ‘only if’ part follows from the deﬁnition of F being stable, since one can take {Yn }n≥1 to be iid with distribution F.

356

11. Central Limit Theorems

For the ‘if part,’ let {Yn }n≥1 be iid random variables such that there exists constants {an }n≥1 and {bn }n≥1 such that as n → ∞ n i=1 Yi − bn −→d F. an To show that F is stable, ﬁx an integer r ≥ 1. Let {Xn }n≥1 be iid random variables with distribution F . Then as k → ∞, kr i=1 Yi − bkr −→d X1 . akr Also, the left side above equals r−1

(j+1)k i=jk+1

j=0

where αkr =

Yi − bk

ak ak akr ,

r−1 rbk − bkr ak + = αkr ηjk + βkr , akr akr j=0

(j+1)k

i=jk+1

ηjk =

Yi −bk

and βkr =

ak

rbk −bkr . akr

say.

Since {ηjk :

j = 0, 1, . . . , r − 1} are independent and for each j, ηjk −→d Xj+1 as k → ∞, it follows that as k → ∞, Wk =

r−1 j=0

ηjk −→d

r−1

Xj+1 =

j=0

r

Xj .

j=1

Also, as k → ∞,

αkr Wk + βkr −→d X1 . r Since F is nondegenerate, both X1 and j=1 Xj are nondegenerate random variables. Thus, as k → ∞, Wk and αkr Wk + βkr converge in distribution to nondegenerate random variables. Thus, by Khinchine’s theorem on convergence types proved above, it follows that αkr → αr and βkr → βr , 0< αr < ∞ and −∞ < βr < ∞. This yields that for each r ∈ N that r 1 2 j=1 Xj has the same distribution as α (X1 − βr ), i.e., X1 is stable. r

Proof of the suﬃciency part of Theorem 11.2.3: (Outline). The proof is based on the continuity theorem. The characteristic function of n is Tn ≡ Sna−b n Sn −bn t n n 1 φn (t) = E eιt an ≡ 1 + hn (t) = φ e−ιtbn /an an n where bn = bn /n and hn (t) = n φ( atn )e−ιtbn /an − 1 . Let G(·) be the cdf of Y1 . Then ιtx eιt(y−bn )/an − 1 dG(y) = e − 1 µn (dx) hn (t) = n

11.2 Stable distributions

357

where µn (A) = nP (Y1 ∈ bn + an A), A ∈ B(R). If A = (u, v], 0 < u < v < ∞, then nP Y1 ∈ bn + an A = nP an u + bn < Y1 ≤ an v + bn = nP Y1 > an u + bn − nP Y1 > an v + bn . By (2.8)–(2.10),

nP (Y1 > an x) =

P (Y1 > an x) P (Y1 > an )

nP (Y1 > an ) → θx−α for x > 0.

By using (2.10), it can be show that bn an

EY1 I(|Y1 | ≤ an ) an a1−α L(a ) n = O n an = o(1) as n → ∞. =

Hence, it follows that nP (Y1 > an u + bn ) − nP (Y1 > an v + bn ) → θ(u−α − v −α ). Similarly, for A = (−v, −u], nP (Y1 ∈ bn + an A) → (1 − θ)(u−α − v −α ). This suggests that hn (t) should approach ∞ (eιtx − 1)x−(α+1) dx + (1 − θ)α θα 0

0 −∞

(eιtx − 1)|x|−(α+1) dx.

But there are integrability problems for |x|−(α+1) near 0 and so a more careful analysis is needed. It can be shown that ∞ ιtx (α+1) lim hn (t) = ιtc + θα eιtx − 1 − x dx n→∞ 1 + x2 0 0 ιtx −(α+1) + (1 − θ)α eιtx − 1 − |x| dx 1 + x2 −∞ where c is a constant. The right side is continuous at t = 0 and so, the result follows by the continuity theorem. For details, see Feller (1966). 2 Remark 11.2.7: By the necessity part of Theorem 11.2.3, every stable distribution F must satisfy 1 − F (x) = θx−α L(x) F (−x) = (1 − θ)x−α L(x)

358

11. Central Limit Theorems

for large x where 0 ≤ θ ≤ 1 and L(·) is slowing varying at ∞ and 0 < α < 2. This implies that F has moments of order p such that α > p. Distributions satisfying the above tail condition are called heavy tailed and arise in many applications. The Pareto distribution in economics is an example of a heavy tail distribution. Remark 11.2.8: One way to generate heavy tailed distributions is as follows. If Y is a positive random variable such that there exist 0 < c < ∞ and 0 < p < ∞ satisfying P (Y < y) ∼ cy p

as y ↓ 0,

then the random variable X = Y −q has the property P (X > x) = P (Y < x−1/q ) ∼ cx−p/q

as x → ∞.

If p < then X has heavy tails. Thus if {Yn }n≥1 are iid Gamma(1,2), then 2q, n n−1 i=1 Yi−1 converges in distribution to a one sided Cauchy distribution (Problem 11.15). Deﬁnition 11.2.4: Let F and G be two probability distributions on R. Then G is said to belong to the domain of attraction of F if there exist a sequence of iid random variables {Yn }n≥1 with distribution G and constants {an }n≥1 and {bn }n≥1 such that n i=1 Yi − bn −→d F. an Theorem 11.2.1 says that the only nondegenerate distributions F that admit a nonempty domain of attraction are the stable distributions.

11.3 Inﬁnitely divisible distributions Deﬁnition 11.3.1: A random variable X (and its distribution) is called inﬁnitely divisible if for each integer k ∈ N, there exist iid random variables k Xk1 , Xk2 , . . . , Xkk such that j=1 Xkj has the same distribution as X. Examples include constants (degenerate distributions), normal, Poisson, Cauchy, and Gamma distributions. But distributions with bounded support cannot be inﬁnitely divisible unless they are degenerate. In fact, if X is inﬁnitely divisible satisfying P (|X| ≤ M ) = 1 for ∞, then some M < = 1 and the Xki ’s in the above deﬁnition must satisfy P |Xk1 | < M k M2 M2 2 so Var(Xk1 ) ≤ EXk1 ≤ k2 implying Var(X) = kVar(Xk1 ) ≤ k for each k ≥ 1. Hence Var(X) must be zero, and the random variable X is a constant w.p. 1. The following results are easy to establish.

11.3 Inﬁnitely divisible distributions

359

Theorem 11.3.1: (a) If X and Y are independent and inﬁnitely divisible, then X + Y is also inﬁnitely divisible. (b) If Xn is inﬁnitely divisible for each n ∈ N and Xn −→d X, then X is inﬁnitely divisible. Proof: (a) Follows from the deﬁnition. (b) For each k ≥ 1 and n ≥ 1, there exist iid random variables Xnk1 , Xnk2 , . . . , Xnkk such that Xn and k j=1 Xnkj have the same distribution. Now ﬁx k ≥ 1. Then for any y > 0, k P (Xnk1 > y) = P (Xnkj > y for all j = 1, ..., k) ≤ P (Xn > ky) and similarly,

k P (Xnk1 < −y) ≤ P (Xn ≤ ky).

Since Xn −→d X, the distributions of {Xn }n≥1 are tight and so are ∞ k {Xnk1 }∞ n=1 . So if Fk is a weak limit point of {Xnk1 }n=1 and if {Ykj }j=1 are k iid with distribution Fk , then X and j=1 Ykj have the same distribution and so X is inﬁnitely divisible. 2 A large class of inﬁnitely divisible distributions are generated by the compound Poisson family. Deﬁnition 11.3.2: Let {Yn }n≥1 be iid random variables and let N be a Poisson (λ) random variable, independent of the {Yn }n≥1 . The random N variable X ≡ i=1 Yi is said to have a compound Poisson distribution. Theorem 11.3.2: A compound Poisson distribution is inﬁnitely divisible. Proof: Let X be a random variable as in Deﬁnition 11.3.2. For each k ≥ 1, let {Ni }ki=1 be iid Poisson random variables with mean λk that are independent of {Yn }n≥1 . Let

Tj+1

Xkj =

Yi , 1 ≤ j ≤ k

i=Tj +1

j−1 k where T1 = 0, Tj = i=1 Ni , 2 ≤ j ≤ k. Then {Xkj }j=1 are iid and k j=1 Xkj and X are identically distributed and so X is inﬁnitely divisible. 2 Although the converse to the above is not valid, it is known that every inﬁnitely divisible distribution is the limit of a sequence of centered and scaled compound Poisson distributions. This is a consequence of a deep result giving an explicit formula for the characteristic function of an inﬁnitely divisible distribution which is stated below. For a proof of this result (stated below), see Feller (1966) and Chung (1974) or Gnedenko and Kolmogorov (1968). Theorem 11.3.3: (Levy-Khinchine representation theorem). Let X be an inﬁnitely divisible random variable. Then its characteristic function φ(t) ≡

360

11. Central Limit Theorems

E(eιtX ) is of the form

ιtx t2 ιtx e −1− µ(dx) , φ(t) = exp ιtc − β + 2 1 + x2 R where c ∈ R, β > 0 and µ is a measure on (R, B(R)) such that µ({0}) = 0 and |x|≤1 x2 µ(dx) < ∞ and µ({x : |x| > 1}) < ∞. Corollary 11.3.4: Stable distributions are inﬁnitely divisible. Proof: The normal distribution corresponds to the case µ(·) ≡ 0 and β > 0. For nonnormal stable laws with index α < 2, set β = 0 and µ(dx) = 2 θx−(α+1) dx for x > 0 and (1 − θ)|x|−(α+1) dx for x < 0. Corollary 11.3.5: Every inﬁnitely divisible distribution is the limit of centered and scaled compound Poisson distributions. Proof: Since the normal distribution can be obtained as a (weak) limit of centered and scaled Poisson distributions, it is enough to consider the case when β = 0, c = 0. Let µn (A) = µ(A ∩ {x : |x| > n−1 }), A ∈ B(R) and let

ιtx φn (t) = exp µ eιtx − 1 − (dx) n 1 + x2

= exp λn µn (dx) − ιtcn (eιtx − 1)˜ where µ ˜n (A) cn

µn (A)/µn (R), A ∈ B(R), λn = µn (R), x = µ ˜n (dx). 1 + x2 =

and

Thus, φn (·) is a compound Poisson characteristic function centered at cn , with Poisson parameter λn and with the compounding distribution µ˜n . By the DCT, φn (t) → φ(t) for each t ∈ R. Hence by the Levy-Cramer continuity theorem, the result follows. 2 Another characterization of inﬁnitely divisible distributions is similar to that of stable distributions. Recall that a stable distribution is one that is the limit of normalized sums of iid random variables and conversely. Theorem 11.3.6: A random variable X is inﬁnitely divisible iﬀ it is the limit in distribution of a sequence {Xn }n≥1 where for each n, Xn is the sum of n iid random variables {Xnj }nj=1 . Thus X is inﬁnitely divisible iﬀ it is the limit in distribution of the row sums of a triangular array of random variables where in each row, all the random variables are iid.

11.4 Reﬁnements and extensions of the CLT

361

Proof: The ‘only if’ part follows from the deﬁnition. For the ‘if’ part, ﬁx k ≥ 1. Then Xk·n can be written as Xk·n =

k

Yjn ,

j=1

jn where Yjn = r=(j−1)n+1 Xk·n,r j = 1, 2, . . . , k. By hypothesis, Xk·n −→d X. Now, {Yjn }kj=1 are iid and it can be shown, as in the proof of Theorem in 11.3.1, that for each i = 1, . . . , k, {Yin }∞ n=1 are tight and hence, converges k distribution to a limit Yi through a subsequence, and that X and i=1 Yi have the same distribution. Thus, X is inﬁnitely divisible. 2

11.4 Reﬁnements and extensions of the CLT This section is devoted to studying some reﬁnements and generalizations of the basic CLT results, such as the rate of convergence in the CLT, Edgeworth expansions and large deviations for sums of iid random variables, and also a generalization of the basic CLT to a functional version.

11.4.1

The Berry-Esseen theorem

Let X1 , X2 , . . . be a sequence of iid random variables with EX1 = µ and a’s theorem imply Var(X1 ) = σ 2 ∈ (0, ∞). Then, Corollary 11.1.2 and Poly¯ that * * * Sn − nµ * * √ ∆n ≡ sup *P ≤ x − Φ(x)** → 0 as n → ∞, (4.1) σ n x∈R where Sn = X1 + · · · + Xn , n ≥ 1, and Φ(·) is the cdf of the N (0, 1) distribution. A natural question that arises in this context is “how fast does ∆n go to zero?” Berry (1941) and Esseen (1942) independently proved that ∆n = O(n−1/2 ) as n → ∞, provided E|X1 |3 < ∞. This result is referred to as the Berry-Esseen theorem. Theorem 11.4.1: (The Berry-Esseen theorem). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X1 ) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Then, for all n ≥ 1, * * * Sn − nµ * E|X1 − µ|3 * √ √ ∆n ≡ sup *P ≤ x − Φ(x)** ≤ C · (4.2) σ n σ3 n x∈R where C ∈ (0, ∞) is a constant. The value of the constant C ∈ (0, ∞) does not depend on n and on any characteristics of the distribution 0 of X1. Indeed, the proof of Theorem

11.4.1 below shows that C ≤

2 π

·

5 2

+

12 π

< 5.05.

362

11. Central Limit Theorems

The following result plays an important role in the proof of Theorem 11.4.1. Lemma 11.4.2: (A smoothing inequality). Let F be a cdf on R with xdF (x) = 0 and characteristic function ζ(t) = exp(ιtx)dF (x), t ∈ R. Let G : R function with → R be a diﬀerentiable derivative g such that lim|x|→∞ F (x) − G(x) = 0. Suppose that (1 + |x|)|g(x)|dx < ∞, ∞ r x g(x)dx = 0 for r = 0, 1 and |g(x)| ≤ C0 for all x ∈ R, for some −∞ C0 ∈ (0, ∞). Then, for any T ∈ (0, ∞), * * 1 T |ζ(t) − ξ(t)| 24C0 * * sup *F (x) − G(x)* ≤ dt + π −T |t| πT x∈R where ξ(t) =

∞ −∞

(4.3)

exp(ιtx)g(x)dx, t ∈ R.

For a proof of Lemma 11.4.2, see Feller (1966). The next lemma deals with an expansion of the logarithm of the characteristic function of X, in a neighborhood of zero. Let z = reiθ , r ∈ (0, ∞), θ ∈ [0, 2π) be the polar representation of a nonzero complex number z. Then, the (principal branch of the) complex logarithm of z is deﬁned as log z = log r + iθ.

(4.4)

The function log z is inﬁnitely diﬀerentiable on the set {z ∈ C : z = reiθ , r ∈ (0, ∞), 0 ≤ θ < 2π} and has a convergent Taylor’s series expansion around 1 on the unit disc: log(1 + z) =

∞

z k /k

for |z| < 1.

(4.5)

k=1

Lemma 11.4.3: Let Y be a random variable with EY = 0, σ ˜ 2 = EY 2 ∈ 3 function φY (t) = E exp(ιtY ), (0, ∞), ρ˜ = E|Y | < ∞" and characteristic # t ∈ R. Then, for all t ∈ − σ1˜ , σ1˜ , * * 2 2* * 5 3 ˜ * * log φY (t) + t σ ≤ |t| ρ˜ * 2 * 12

(4.6)

and * ** 2 * (ιt)3 2 3 * * log φY (t) − (ιt) σ ˜ EY + * * 2! 3!

t4 σ ˜4 |tY |3 (tY )4 , . + ≤ E min 3 24 4

(4.7)

11.4 Reﬁnements and extensions of the CLT

363

Proof: Note that by Lemma 10.1.5, 2 2 * * * * *φY (t) − 1* = *E exp(ιtY ) − 1 − ιtY * ≤ t EY ≤ 1 2 2

(4.8)

|t|# ≤ σ ˜ −1 . In particular, log φY (t) is well deﬁned for all t ∈ "whenever −1 −1 ˜ −σ ˜ ,σ . By (4.5), (4.8), and Lemma 10.1.5, for |t| ≤ σ ˜ −1 , * * 2 2* * ˜ * * log φY (t) + t σ * 2 * * * * t 2 σ ˜ 2 ** = ** log 1 + φY (t) − 1 + 2 * * * ∞ * *k t2 σ ˜ 2 ** ** ≤ **φY (t) − 1 − + φY (t) − 1* /k * 2 k=2

* * ∞ * |tY |3 ** 1 t2 σ ˜ 2 2 1 k−2 ≤ E **(tY )2 ∧ + 3! * 2 2 2 k=2

≤

3

4 4

|t| ρ t σ ˜ + . 6 4

Now using the bounds |t˜ σ | ≤ 1 and σ ˜ 3 = (EY 2 )3/2 ≤ E|Y |3 = ρ˜, one gets (4.6). The proof of (4.7) is similar and hence, it is left as an exercise (Problem 11.27). 2 Proof of Theorem 11.4.1: W.l.o.g., set µ = 0 and σ = 1. Then, X1 , X2 , . . . are iid zero mean, unit variance random variables. Let X =d X1 , ρ = E|X|3 and φX (·) denote the characteristic function of X. Itis easy to Sn ≤x , check that the conditions of Lemma 11.4.2 hold with F (x) = P √ n G(x) = Φ(x), x ∈ R, and C0 = √12π . Hence, by Lemma 11.4.2, with √ T = n/ρ, * n t * −t2 /2 * * 24ρ 1 T φX ( √n ) − e dt + √ ∆n ≤ . (4.9) π −T |t| π 2πn By Lemma 11.4.3 (with Y =

X1 −µ σ

and t replaced by

√t ), n

* t t2 * * * + * rn (t) ≡ *n log φX √ 2 n * t t 2 σ 2 * * * = n* log φX √ + √ * 2 n n 5 ρ|t|3 · √ ≤ 12 n for all |t| ≤

√

n, n ≥ 1.

(4.10)

364

11. Central Limit Theorems

√ Since ρ = E|X1 |3 ≥ (EX12 )3/2 = σ 3 = 1, |T | ≤ n. Hence, using the inequality |ez − 1| ≤ |z|e|z| for all z ∈ C and (4.10), one gets * * t 2 * * n − e−t /2 * *φX √ n * * t2 t t2 * * − 1* · exp − + = * exp n · log φX √ 2 2 n t2 ≤ |rn (t)| exp |rn (t)| · exp − 2 t2 5ρ|t| 5ρ 3 √ |t| exp − 1− √ ≤ 2 12 n 6 n 2 t 5ρ √ |t|3 exp − (4.11) ≤ 12 12 n √ t2 ∞ 2 √ ≤ 1, i.e., for all |t| ≤ T , n ≥ 1. Since dt = 6 2π, t exp − 12 for all ρ|t| −∞ n 0 " # 2 the theorem follows from (4.9) and (4.11) with C = π2 52 + 12 π . A striking feature of Theorem 11.4.1 is that the upper bound on ∆n in (4.2) is valid for all n ≥ 1. Also, under the conditions of Theorem 11.4.1, the rate O( √1n ) in (4.2) is the best possible in the sense that there exist random variables for which ∆n is bounded below by a constant multiple of −nµ √1 (cf. Problem 11.29). Edgeworth expansions of the cdf of Sn √ , to be n σ n developed in the next section, can be used to show that for certain random variables X1 satisfying additional moment and symmetry conditions, ∆n may go to zero at a faster rate. (For example, consider X1 ∼ N (µ, σ 2 ).) For iid sequences {Xn }n≥1 with E|X1 |2+δ < ∞ for some δ ∈ (0, 1], Theorem 11.4.1 can be strengthened to show that ∆n decreases at the rate O(n−δ/2 ) as n → ∞ (cf. Chow and Teicher (1997), Chapter 9).

11.4.2

Edgeworth expansions

Recall from Chapter 10 that a random variable X1 is called lattice if there exist a ∈ R and h ∈ (0, ∞) such that P X1 ∈ {a + ih : i ∈ Z} = 1. (4.12) The largest h satisfying (4.12) is called the span of (the distribution of) X1 . A random variable X1 is called nonlattice if it is not a lattice random variable. From Proposition 10.1.1, it follows that X1 is nonlattice iﬀ * * *E exp(ιtX1 )* < 1 for all t = 0. (4.13) The next result gives an Edgeworth expansion for the cdf of an error of order o(n

−1/2

) for nonlattice random variables.

Sn √ −nµ σ n

with

11.4 Reﬁnements and extensions of the CLT

365

Theorem 11.4.4: Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X1 ) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Suppose, in addition, that X1 is nonlattice, i.e., it satisﬁes (4.13). Then, * ** * Sn − nµ 1 µ3 2 * √ ≤ x − Φ(x) − √ · 3 (x − 1)φ(x) ** sup *P σ n n 6σ x∈R = o(n−1/2 ) where φ(x) =

as

2 √1 e−x /2 , 2π

n → ∞,

(4.14)

x ∈ R and µ3 = E(X1 − µ)3 .

The function 1 µ3 en,2 (x) ≡ Φ(x) − √ · 3 (x2 − 1), x ∈ R n 6σ

(4.15)

−nµ is called a second order Edgeworth expansion for Tn ≡ Sσn √ . The above n theorem shows that the cdf of the normalized sum Tn can be approximated by the second order Edgeworth expansion with accuracy o(n−1/2 ). It can be shown that if E|X1 |4 < ∞ and X1 satisﬁes Cramer’s condition: * * lim sup *E exp(ιtX1 )* < 1, (4.16) |t|→∞

then the bound on the right side of (4.14) can be improved to O(n−1 ). Note that for a symmetric random variable X1 , having a ﬁnite fourth moment and satisfying (4.16), the second term in en,2 (x) is zero and the rate of normal approximation becomes O(n−1 ). Higher order Edgeworth expansions for Tn can be derived using (4.16) and arguments similar to those in the proof of Theorem 11.4.4, but the form of the expansion becomes more complicated. See Petrov (1975), Bhattacharya and Rao (1986), and Hall (1992) for detailed accounts of the Edgeworth expansion theory. Proof of Theorem 11.4.4: W.l.o.g., let µ = 0 and σ = 1. In Lemma 11.4.2, take F (x) = P (Tn ≤ x), and G(x) = en,2 (x), x ∈ R. Then, it is easy to verify that the conditions of Lemma 11.4.2 hold with g(x) = gn (x) ≡ φ(x) + 6µ√3n (x3 − 3x)φ(x), x ∈ R. Using repeated diﬀerentiation on both sides of the identity (inversion formula): ∞ 2 2 e−x /2 1 √ e−ιtx · e−t /2 dt, x ∈ R, = 2π −∞ 2π one can show that e−x /2 1 d3 e−x /2 −(x − 3x) √ = 3 √ = dx 2π 2π 2π 2

2

3

x ∈ R. Hence, ξn (t) ≡ ξ(t) =

2

eιtx gn (x)dx = e−t

/2

∞

2

e−ιtx (−ιt)3 e−t

/2

dt,

−∞

µ3 1 + √ (ιt)3 , t ∈ R. 6 n

(4.17)

366

11. Central Limit Theorems

√ Next, let ∈ (0, 1) be given. Then, set T = c n where c ≡ c = 24 · sup

3 |µ3 | 3 |x − 3x| φ(x) : x ∈ R . 1+ 6

Then, by Lemma 11.4.2 and (4.17), ∆n,2

* * * Sn − nµ * * √ ≡ sup *P ≤ x − en,2 (x)** σ n x∈R * * √ * * t n 1 c n *φX √n − ξ(t)* ≤ dt + √ , √ π −c n |t| n

(4.18)

where φX (t) = E exp(ιtX), t ∈ R and X =d X1 . Let ρ = E|X1 |3 . Let M ∈ (1, ∞) be such that E|X1 |3 I(|X1 | > M ) ≤ /2. Then, setting δ = 2M ρ , it √ |t| 4√ 3 follows that E|X1 | n I(|X1 | ≤ M ) ≤ M δE|X1 | ≤ /2 for all |t| ≤ δ n. √ Hence, for all |t| ≤ δ n, by (4.7) of Lemma 11.4.3, * * t (ιt)2 * µ3 ιt 3 ** * √ rn,2 (t) ≡ n* log φX √ + − * 2n 6 n n |X |3 |X |4 |t| * t *3 t4 * * 1 1 √ , ≤ n · * √ * E min + 2 3 24 4n n n |t| |t|3 √ E |X1 |3 I(|X1 | > M ) + E |X1 |4 √ I(|X1 | ≤ M ) ≤ 3 n n |t| 3 +√ n 4 |t|3 (4.19) ≤ √ . n Also, note that for any complex numbers z, w, |ez − 1 − w|

≤ ≤

|ez − ew | + |ew − 1 − w| # " 1 |z − w| + |w|2 exp |z| ∨ |w| . 2

(4.20)

√ Hence, by (4.10), (4.19), and (4.20), it follows that for all |t| ≤ δ n, * t * * n * − ξn (t)* *φX √ n * * t t2 * 2 * µ3 − 1 − √ (ιt)3 **e−t /2 = ** exp n log φX √ + 2 n 6 n *2 * 2 * µ 1 ** µ3 * * 3 * ≤ rn,2 (t) + * √ (ιt)3 * exp rn (t) ∨ * √ t3 * e−t /2 . 2 6 n 6 n

11.4 Reﬁnements and extensions of the CLT

√ 5 ≤ 12 |t|2 for all |t| ≤ δ n, δ√n **φn ( √t ) − ξ (t)** n X n dt √ |t| −δ n δ √n t2 µ23 2 5 √ exp − ≤ · |t| + dt · t √ 72n 12 n −δ n ≤ C1 · √ n

Since rn (t) ∨

367

|µ3 t|3 √ 6 n

(4.21)

for some C1 ∈ (0, ∞). By (4.13), * t *n √ √ * * sup *φX √ * : δ n < |t| < c n ≤ θn n for some θ ∈ (0, 1). Hence, * * n t *φ √ − ξn (t)* X n

dt |t| 2 ρ 1 2θn log(c/δ) + √ e−t /2 1 + √ |t|3 dt, √ δ n |t|>δ n n

√ √ δ n x) assert that a similar behavior holds for many distributions P (X other than the normal distribution on the (negative) logarithmic scale. The main result of this section is the following. Theorem 11.4.6: Let {Xn }n≥1 be a sequence of iid nondegenerate random variables with φ(t) ≡ EetX1 < ∞ for all t > 0. (4.27) Let µ = EX1 . Then, for all x ∈ (µ, θ) ¯ n ≥ x) = −γ(x), lim n−1 log P (X

n→∞

(4.28)

where γ(x) = sup{tx − log φ(t)} and θ = sup{x ∈ R : P (X1 ≤ x) < 1}. (4.29) t>0

Note that under (4.27), EX1+ < ∞ and hence, µ ≡ EX1 is well deﬁned, and µ ∈ [−∞, ∞). For proving the theorem, the following results are needed. Lemma 11.4.7: Let X1 be a nondegenerate random variable satisfying (4.27). Let µ = EX1 and let γ(x), θ be as in (4.29). Then,

11.4 Reﬁnements and extensions of the CLT

369

(i) the function φ(t) is inﬁnitely diﬀerentiable on (0, ∞) with φ(r) (t), the rth derivative of φ(t) being given by φ(r) (t) = E X1r etX1 , t ∈ (0, ∞), r ∈ N, φ(1) (t) t→∞ φ(t)

(ii) lim φ(t) = 1, lim φ(1) (t) = µ, and lim t↓0

t↓0

(4.30)

= θ,

(iii) for every x ∈ (µ, θ), there exists a unique solution ax ∈ R to the equation x = φ (ax )/φ(ax ) (4.31) such that γ(x) = xax − log φ(ax ). Proof: Let F denote the cdf of X1 . (i) Note that for any h = 0, " # h−1 φ(t + h) − φ(t) =

∞

−∞

ehx − 1 tx · e dF (x). h

* hs * As h *→ 0, the integrand converges xetx for all x, t. Also, * e h−1 * ≤ * ∞ * k−1 xk */k! ≤ |x|e|hx| for all h, x. Hence, for any x ∈ R, t ∈ (0, ∞), k=1 h and 0 < |h| < t/2, the integrand is bounded above by |x|e|hx| etx

Since

= |x|e(t−|h|)x I(−∞,0) (x) + |x|e(t+|h|)x I(x > 0) ≤

|x|e−t|x|/2 I(−∞,0) (x) + |x|e3tx I(0,∞) (x)

≡

g(x), say.

(4.32)

g(x)dF (x) < ∞, by the DCT, it follows that φ(t + h) − φ(t) lim h→0 h

exists and equals

xetx dF (x)

for all t ∈ (0, ∞). Thus, φ(t) is diﬀerentiable on (0, ∞) with φ(1) (t) = EX1 etX1 , t ∈ (0, ∞). Now, using induction and similar arguments, one can complete the proof of part (i) (Problem 11.34). Next consider (ii). Since etx ≤ I(−∞,0] (x) + ex I(0,∞) (x) for all x ∈ R, t ∈ (0, 1), by the DCT, the ﬁrst relation follows. For the second, note that |x|etx I(−∞,0) (x) ↑ |x|I(−∞,0] (x)

as

t↓0

(4.33)

and |x|etx ≤ |x|ex for all 0 < t ≤ 1, x > 0. Hence, applying the MCT for x ∈ (−∞, 0] and the DCT for x ∈ (0, ∞), one obtains the second limit. Derivation of the third limit is left as an exercise (Problem 11.35).

370

11. Central Limit Theorems

To prove part (iii), ﬁx x ∈ (µ, θ) and let γ(t) = tx − log φ(t), t ≥ 0. Then, for t ∈ (0, ∞), φ(1) (t) φ(t) (2) φ (t) φ(1) (t) 2 γ (2) (t) = − = Var(Yt ), (4.34) φ(t) φ(t) y where Yt is a random variable with cdf P (Yt ≤ y) = −∞ etu dF (x)/φ(t), y ∈ R. Since X1 is nondegenerate, so is Yt (for any t ≥ 0) and hence, Var(Yt ) > 0. As a consequence, the second derivative of the function γ(t) is positive. And the minimum of γ(t) over (0, ∞) is attained by a solution to the equation γ (1) (t) = 0, i.e., by t = ax satisfying (4.30). That such a solution exists and is unique follows from part (ii) and the facts that x > µ, γ (1) (t)

φ(1) (0+) φ(0)

=

x−

= µ (by (ii)), and that

φ(1) (t) φ(t)

is continuous and strictly increasing

on (0, ∞) (as for any t ∈ (0, ∞), the derivative of γ

(2)

φ(1) (t) φ(t)

(t), which is positive by (4.34)). This proves part (iii).

coincides with 2

Lemma 11.4.8: Let {Xn }n≥1 be as in Theorem 11.4.6. For t ∈ (0, ∞), let {Yt,n }n≥1 be a sequence of iid random variables with cdf y P (Yt,1 ≤ y) = etu dF (u)/φ(t), y ∈ R, −∞

where F is the cdf of X1 . Let νn and λn denote the probability distributions of Sn ≡ X1 + · · · + Xn and Tn,t = Yt,1 + · · · + Yt,n , n ≥ 1. Then, for each n ≥ 1, dνn νn λn and (x) = e−tx φ(t)n , x ∈ R. (4.35) dλn Proof: The proof is by induction. Clearly, the assertion holds for n = 1. Next, suppose that (4.35) is true for some r ∈ N and let n = r + 1. Then, for any A ∈ B(R), νn (A) = P X1 + · · · + Xn ∈ A ∞ = P X1 + · · · + Xn−1 ∈ A − x dF (x) −∞ ∞ dν n−1 (u) dλn−1 (u)dF (x) = −∞ A−x dλn−1 ∞ e−tu φ(t)n−1 dλn−1 (u)dF (x) = −∞ A−x ∞ e−t(u+x) dλn−1 (u)dλ1 (x) = [φ(t)]n −∞ A−x e−tu λn−1 ∗ λ1 (dν), = [φ(t)]n A

11.4 Reﬁnements and extensions of the CLT

371

where ∗ denotes convolution. Since λn−1 ∗ λ1 = λn , the result follows.

2

Proof of Theorem 11.4.6: Fix x ∈ (µ, θ). Note that by Markov’s inequality, for any t > 0, n ≥ 1, ¯ n ≥ x) = P etX¯ n ≥ etx P (X ¯ ≤ e−tx E etXn = exp − tx + n log φ(t/n) . Hence, ¯ n ≥ x ≤ −x · t + log φ t n−1 log P X for all t > 0, n ≥ 1 n n ¯ n ≥ x ≤ inf {−xt + log φ(t)} = −γ(x). (4.36) ⇒ lim sup n−1 log P X t>0

n→∞

This yields the upper bound. Next it will be shown that ¯ n ≥ x ≥ −γ(x). lim inf n−1 log P X n→∞

(4.37)

To that end, let {Yt,n }n≥1 , νn , and λn be as in Lemma 11.4.8. Also, let ax be as in (4.30). Then, for any y > x, t ∈ (ax , ∞), and n ≥ 1, by Lemma 11.4.8, ¯ n ≥ x = νn [nx, ∞) P X = e−tu φ(t)n du [nx,∞) ≥ e−tu φ(t)n du ≥

[nx,ny] n −tny

φ(t) e

λn [nx, ny] .

(4.38)

Note that EYt,1 = udλ1 (u) = φ(1) (t)/φ(t). Since φ(1) (·)/φ(·) is strictly increasing and continuous on (0, ∞), given y > x, there exists a t = ty ∈ (ax , ∞) such that φ(1) (t) φ(1) (ax ) y> > = x. (4.39) φ(t) φ(ax ) By the WLLN, for any y > x and t satisfying (4.39),

Yt,1 + · · · + Yt,n λn [nx, ny] = P x ≤ ≤y n → 1 as n → ∞. Hence, from (4.38), it follows that ¯ n ≥ x ≥ −ty + log φ(t) lim inf n−1 log P X n→∞

372

11. Central Limit Theorems

for all y > x and all t ∈ (ax , ∞) satisfying (4.39). Now, letting t ↓ ax ﬁrst and then y ↓ x, one gets (4.37). This completes the proof of Theorem 11.4.6. 2 Remark 11.4.1: If (4.27) holds and θ < ∞, then " # ¯ n ≥ θ = P (X1 = θ) n P X so that

¯ n ≥ θ = log P (X1 = θ). lim n−1 log P X

n→∞

In this case, (4.28) holds for x = θ with γ(θ) = − log P (X1 = θ). For x > θ, (4.28) holds with γ(x) = +∞. Remark 11.4.2: Suppose that there exists a t0 ∈ (0, ∞) such that, instead of (4.27), the following condition holds: φ(t)

=

+∞ for all t > t0

< ∞

for all t ∈ (0, t0 ),

and φ (t)/φ(t) increases to a ﬁnite limit θ0 as t ↑ t0 . Then, θ must be +∞. In this case, it can be shown that (4.28) holds for all x ∈ (µ, θ0 ) (with the given deﬁnition of γ(x)) and that (4.28) holds for all x ∈ [θ0 , ∞), with γ(x) ≡ t0 x − log φ(t0 ). See Theorem 9.6, Chapter 1, Durrett (2004).

11.4.4

The functional central limit theorem

2 Let {X i }ni≥1 be iid random variables with EX1 = 0, EX1 = 1. Let S0 = 0, Sn = i=1 Xi , n ≥ 1. The central limit theorem says that as n → ∞,

Sn Wn ≡ √ −→d N (0, 1). n Now let Wn and for any

j n

≤t

C0 ) = α and rejects the hypothesis that F is the distribution of {Xi }i≥1 if KS(Fn ) > C0 and accepts it, otherwise.

11.5 Problems 11.1 Show that the triangular array {Xnj : 1 ≤ j ≤ n}n≥1 , with Xnj as in (1.24), is a null array, i.e., satisﬁes (1.15) iﬀ (1.22) holds. 11.2 Construct an example of a triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 of independent random variables such that for any 1 ≤ j ≤ rn , n ≥ 1, E|Xnj |α = ∞ for all α ∈ (0, ∞), but there exist sequences {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that Sn − bn −→d N (0, 1). an 11.3 Let {Xn }n≥1 be a sequence of independent random variables with P (Xn = ±1) =

1 1 1 − √ , P (Xn = ±n2 ) = √ , n ≥ 1. 2 2 n 2 n

Find constants {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that n j=1 Xj − bn −→d N (0, 1). an 11.4 Let {Xn }n≥1 be a sequence of independent random variables such that for some α ≥ 12 , P (Xn = ±nα ) = Let Sn =

n j=1

n1−2α 2

and P (Xn = 0) = 1 − n1−2α , n ≥ 1.

Xj and s2n = Var(Sn ).

(a) Show that for all α ∈ [ 12 , 1), Sn −→d N (0, 1). sn

(5.1)

(b) Show that (5.1) fails for α ∈ [1, ∞). (c) Show that for α > 1, {Sn }n≥1 converges to a random variable S w.p. 1 and that sn → ∞.

11.5 Problems

377

11.5 Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent zero mean random variables satisfying the Lindeberg condition. Show that 2 Xnj → 0 as n → ∞, E max 1≤j≤rn s2 n rn where s2n = j=1 Var(Xnj ), n ≥ 1. n 11.6 Let {Xn }n≥1 be a sequence of random variables. Let Sn = j=1 Xj n and s2n = j=1 EXj2 < ∞, n ≥ 1. If s2n → ∞ then show that lim

n→∞

⇐⇒

s−2 n

lim s−2 n

n→∞

n j=1 n

EXj2 I(|Xj | > sn ) = 0

for all > 0

EXj2 I(|Xj | > sj ) = 0

for all > 0.

j=1

(Hint: Verify that for any δ > 0,

j:sj 2, show that lim s−r n

n→∞

⇐⇒

lim s−r n

n→∞

where s2n =

n j=1 n

E|Xj |r I(|Xj | > sn ) = 0 E|Xj |r = 0,

for all ∈ (0, ∞), (5.2)

j=1

n j=1

EXj2 .

11.8 Let {Xn }n≥1 be a sequence of zero mean independent random variables satisfying (5.2) for r = 4. (a) Show that

k k lim E(s−1 n Sn ) = EZ

n→∞

for all k = 2, 3, 4, where Z ∼ N (0, 1). 4 (b) Show that Ssnn is uniformly integrable. n≥1 Sn (c) Show that lim Eh sn = Eh(Z) where h(·) : R → R is continn→∞

uous and h(x) = O(|x|4 ) as |x| → ∞.

11.9 Let {Xn }n≥1 be a sequence of independent random variables such that 1 1 P (Xn = ±1) = , P (Xn = ±n) = 2 4 4n and 1 1 P (Xn = 0) = (1 − 2 ), n ≥ 1. 2 n

378

11. Central Limit Theorems

(a) Show that √ the triangular array {Xnj : 1 ≤ j ≤ n}n≥1 with Xnj = Xj / n, 1 ≤ j ≤ n, n ≥ 1 does not satisfy the Lindeberg condition. (b) Show that there exists σ ∈ (0, ∞) such that S √n −→d N (0, σ 2 ). n Find σ 2 . 11.10 Let {Xj }j≥1 be independent random variables such that Xj has Uniform [−j, j] distribution. Show that the Lindeberg-Feller condition holds for the triangular array Xnj = Xj /n3/2 , 1 ≤ j ≤ n, n ≥ 1. 11.11 (CLT for random sums). Let {Xi }i≥1 be iid random variables with EX1 = 0, EX12 = 1. Let {Nn }n≥1 be a sequence of positive integer valued random variables such that Nnn −→p c, 0 < c < ∞. Show that S √Nn −→d N (0, 1). N n

(Hint: Use Kolmogorov’s ﬁrst inequality (cf. 8.3.1) to show that |SNn −Snc | √ > λ, |Nn − nc| < n ≤ λ2 for any > 0, λ > 0.) P n 11.12 Let {N (t) : t ≤ 0} be the renewal process as deﬁned in (5.1) of Section 8.5. Assume EX1 = µ ∈ (0, ∞), EX12 < ∞. Show that N (t) − t/µ √ −→d N (0, σ 2 ) t

(5.3)

for some 0 < σ 2 < ∞. Find σ 2 . (Hint: Use SN (t)+1 − (N (t) + 1)µ SN (t) − N (t)µ t − N (t)µ µ , , ≤ , ≤ +, N (t) N (t) N (t) N (t) and the fact that

N (t) t

→

1 µ

w.p. 1.)

11.13 Let {N (t) : t ≥ 0} be as in the above problem. Give another proof of (5.3) by using the CLT for {Sn }n≥1 and the relation P (N (t) > n) = P (Sn < t) for all t, n. 11.14 Let {Xj }j≥1 be iid random variables with distribution P (X1 = 1) = 1/2 = P (X1 = −1). Show that there exist positive integer valued S random variables {rk }k≥1 such that rk → ∞ w.p. 1, but √rrkk does not converge in distribution. (Hint: Let r1 = min{n : rk+1 = min{n : n >

Sn rk , √ n

Sn √ n

> 1} and for k ≥ 1, deﬁne recursively

> k + 1}.)

11.5 Problems

379

11.15 (CLT for sample quantiles). Let {Xi }i≥1 be iid random variables. −1 Let 0 < p < 1 and n let Yn ≡ Fn (p) = inf{x : Fn (x) ≥ p}, where Fn (x) ≡ n1 i=1 I(Xi ≤ x) is the empirical cdf based on X1 , X2 , . . . , Xn . Assume that the cdf F (x) ≡ P (X1 ≤ x) is diﬀeren −1 F : F (x) ≥ p} and that λ ≡ F (p) > 0. tiable at F −1 (p) ≡ inf{x p √ −1 d 2 2 Then show that n Yn − F (p) −→ N (0, σ ), where σ = p(1 − p)/λ2p . (Hint: Use the identity P (Yn ≤ x) = P (Fn (x) ≥ p) for all x and p.) 11.16 (A coupon collector’s problem). For each n ∈ N, let {Xni }i≥1 be iid random variables such that P (Xn1 = j) = n1 , 1 ≤ j ≤ n. Let = X1 }, and Tn(i+1) = min j : Tn0 = 1, Tn1 = min{j : j > 1, Xj / {XTnk : 0 ≤ k ≤ i} . That is, Tni is the ﬁrst time j > Tni , Xj ∈ the sample has (i + 1) distinct elements. Suppose kn ↑ ∞ such that kn n → θ, 0 < θ < 1. Show that for some an , bn Tn,kn − an −→d N (0, 1). bn (Hint: Let Ynj = Tnj − Tn(j−1) , j = 1, 2, . . . , (n − 1). Show that for with Ynj having a geometric each n, {Ynj }j=1,2,... are independent distribution with parameter 1− nj . Now apply Lyapounov’s criterion to the triangular array {Ynj : 1 ≤ j ≤ kn }.) 11.17 Prove Theorem 11.1.6. 11.18 Let {Xn }n≥1 be a sequence of iid random variables with EXn = 0 n and EXn2 = σ 2 ∈ (0, ∞). Let Sn = j=1 Xj , n ≥ 1. For each k ∈ N, ﬁnd the limit distribution of the k-dimensional vector(s). S −S Sn S2n √−Sn , . . . , nk √n(k−1) , (a) √ , n n n Snak Sna1 Sna2 √ , √ ,..., √ (b) , where 0 < a1 < a2 < · · · < ak < ∞ are n n n given real numbers, S −S S3n √−Sn , . . . , (k+1)n√ (k−1)n . (c) S√2n , n n n 11.19 For any random variable X, show that EX 2 < ∞ implies y 2 P (|x| > y) →0 E(X 2 I(|x| ≤ y)) as y → ∞. Give an example to show that the converse is false. (Hint: Consider a random variable X with pdf f (x) = c1 |x|−3 for |x| > 2.)

380

11. Central Limit Theorems

11.20 Let {Xn }n≥1 be a sequence of iid random variables with common distribution P (X1 ∈ A) = |x|−3 I(|x| > 1)dx, A ∈ B(R). A

Find sequences {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that Sn − bn −→d N (0, 1) an where Sn =

n j=1

Xj , n ≥ 1.

11.21 Show using characteristic functions that if X1 , X2 , . . . , Xk are iid k Cauchy (µ, σ 2 ) random variables, then Sk ≡ i=1 Xi has a Cauchy (kµ, kσ) distribution. 11.22 Show that if a random variable Y1 has pdf f as in (2.4), then (2.9) holds with α = 1/2. n −1 −1 11.23 If {Yn }n≥1 are iid Gamma −→d W , where i=1 Yi then n 2 1 (1,2), W has pdf fW (w) ≡ π 1+w2 · I(0,∞) (w). 11.24 Let X be a nonnegative random variable such that P (X ≤ x) ∼ xα L(x) as x ↓ 0 for some α > 0 and L(·) slowly varying at 0. Let Y = X −β , β > 0. Show that ˜ P (Y > y) ∼ y −γ L(y)

as

y↑∞

˜ slowly varying at ∞. for some γ > 0 and L(·) 11.25 Let {Xi }i≥1 be iid Beta (m, n) random variables. Let Yi = Xi−β , β > 0, i ≥1. Show that there exist sequences {an }n≥1 and {bn }n≥1 n Y −a such that i=1bni n −→d a stable law of order γ for some γ in (0, 2]. 11.26 Let {Xi }i≥1 be iid Uniform [0, 1] random variables. (a) Show that for each 0 < β < 12 , there exist constants µ and σ 2 such that

n 1 −β √ X − nµ −→d N (0, 1). σ n i=1 i (b) Show that for each 12 < β < 1, there exist a constant 0 < γ < 2 and sequences {an }n≥1 and {bn }n≥1 such that 1 bn

n i=1

Xi−β

− an

−→d

a stable law of order

γ.

11.5 Problems

381

11.27 Prove (4.7). (Hint: Use (4.5) and Lemma 10.1.5.) 11.28 (a) Show that the Gamma (α, β) distribution is inﬁnitely divisible, 0 < α, β < ∞. (b) Let µ be a ﬁnite measure on R, β(R) . Show that φ(t) ≡ exp (eιtu − 1)µ(du) is the characteristic function of an inﬁnitely divisible distribution. 11.29 Let {Xn }n≥1 be iid random variables with P (X1 = 0) = P (X1 = 1) = 12 . Show that there exists a constant C1 ∈ (0, ∞) such that ∆n ≥ C1 n−1/2

for all n ≥ 1,

where ∆n is as in (4.1). 11.30 Let X1 be a random variable such that the absolutely continuous component βFac (·) in the decomposition (4.5.3) of the cdf F of X1 is nonzero. Show that X1 satisﬁes Cramer’s condition (4.13). (Hint: Use the Riemann-Lebesgue lemma.) 11.31 (Berry-Esseen theorem for sample quantile). Let {Xn }n≥1 be a collection of iid random variables with ncdf F (·). Let 0 < p < 1 and Yn = Fn−1 (p), where Fn (x) = n−1 i=1 I(X1 ≤ x), x ∈ R. Suppose that F (·) is twice diﬀerentiable in a neighborhood of ξp ≡ F −1 (p) and F (ξp ) ∈ (0, ∞). Show that * √ * * * n(Yn − ξp )/σp ≤ x − Φ(x)* = O(n−1/2 ) as n → ∞ sup *P x∈R

2 where σp = p(1 − p)/ F (ξp ) . (Hint: Use the identity P (Yn ≤ x) =√P (Fn (x) ≥ p) for all x and p, apply Theorem 11.4.1 √ to Fn (x) for n|x − ξp | ≤ log n, and use monotonicity of cdfs for n|x − ξp | > log n. See Lahiri (1992) for more details. Also, see Reiss (1974) for a diﬀerent proof.) 11.32 (A moderate deviation bound). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Show that √ * , * ¯ n − µ* > σ log n = O(n−1/2 ) as n → ∞. n*X P (Hint: Apply Theorem 11.4.1.) It can be shown that the bound on the right side is indeed o(n−1/2 ) as n → ∞. For a more general version of this result, see G¨ otze and Hipp (1978).

382

11. Central Limit Theorems

11.33 Show that the values of the functions en,2 (x) of (4.14) and e˜n,2 (x) of (4.24) are not necessarily nonnegative for all x ∈ R. 11.34 Complete the proof of Lemma 11.4.7 (i). (Hint: Suppose that for some r ∈ N, φ is r-times diﬀerentiable with its rth derivative given by (4.30). Then, for t ∈ (0, ∞), " # h−1 φ(r) (t + h) − φ(r) (t) =

∞

−∞

ehx − 1 r tx · x e F (dx) h

and the integrand is bounded by the integrable function |x|r g(x) for all x ∈ R, 0 < |h| < t/2, where g(·) is as in (4.32). Now apply the DCT.) 11.35 Under the conditions of Lemma 11.4.7, show that φ(1) (t) = θ. t→∞ φ(t) lim

(Hint: Consider the cases ‘θ ∈ R’ and ‘θ = ∞’ separately.) 11.36 Find the function γ(x) of (4.28) in each of the following cases: (a) X1 ∼ N (µ, σ 2 ), (b) X1 ∼ Gamma (α, β), (c) X1 ∼ Uniform (0, 1). 11.37 Verify that the functions Ti , i = 1, 2, 3 deﬁned by (4.42) are continuous on C[0, 1].

12 Conditional Expectation and Conditional Probability

12.1 Conditional expectation: Deﬁnitions and examples This section motivates the deﬁnition of conditional expectation for random variables with a ﬁnite variance through a mean square error prediction problem. The deﬁnition is then extended to integrable random variables by an approximation argument (cf. Deﬁnition 12.1.3). The more standard approach of proving the existence of conditional expectation by the use of Radon-Nikodym theorem is also outlined. Let (X, Y ) be a bivariate random vector. A standard problem in regression analysis is to predict Y having observed X. That is, to ﬁnd a function h(X) that predicts Y . A common criterion for measuring the accuracy of such a predictor is the mean squared error E(Y − h(X))2 . Under the assumption that E|Y |2 < ∞, it can be shown that there exists a unique h0 (X) that minimizes the mean squared error. Theorem 12.1.1: Let (X, Y ) be a bivariate random vector. Let EY 2 < ∞. Then there exists a Borel measurable function h0 : R → R with 2 E h0 (X) < ∞, such that 2 2 E Y − h0 (X) = inf E Y − h(X) : h(X) ∈ H0 ,

(1.1)

where H0 = h(X) | h : R → R is Borel measurable and E(h(X))2 < ∞ .

384

12. Conditional Expectation and Conditional Probability

Proof: Let H be the space of all Borel measurable functions of (X, Y ) that have a ﬁnite second moment. Let H0 be the subspace of all Borel measurable functions of X that have a ﬁnite second moment. It is known that H0 is a closed subspace of H (Problem 12.1) and that for any Z in H, there exists a unique Z0 in H0 such that E(Z − Z0 )2 = min E(Z − Z1 )2 : Z1 ∈ H0 . Further, Z0 is the unique random variable (up to equivalence w.p. 1) such that (1.2) E(Z − Z0 )Z1 = 0 for all Z1 ∈ H0 . A proof of this fact is given at the end of this section in Theorem 12.1.6. If one takes Z to be Y , then this Z0 is the desired h0 (X). 2 Remark 12.1.1: The random variable Z0 in (1.2) is known as the projection of Y onto H0 . It is known that for any random variable Y with EY 2 < ∞, the constant c that minimizes E(Y − c)2 over all c ∈ R is c = EY , the expected value of Y . By analogy with this, one is led to the following deﬁnition. Deﬁnition 12.1.1: For any bivariate random vector (X, Y ) with EY 2 < ∞, the conditional expectation of Y given X, denoted as E(Y |X), is the function h0 (X) of Theorem 12.1.1. Note that h0 (X) is determined up to equivalence w.p. 1. Any such h0 (X) is called a version of E(Y |X). From (1.2) in the proof of Theorem 12.1.1, by taking Z = Y , Z1 = IB (X), one ﬁnds that Z0 = h0 (X) satisﬁes EY IA = Eh0 (X)IA

(1.3)

−1

for every event A of the form X (B) where B ∈ B(R). Conversely, it can be shown that (1.3) implies (1.2), by the usual approximation procedure (Problem 12.1). From (1.3), the function h0 (X) is determined w.p. 1. So one can take (1.3) to be the deﬁnition of h0 (X). In statistics, the function E(Y |X) is called the regression of Y on X. The function h0 (x) can be determined explicitly in the following two special cases. If X is a discrete random variable with values x1 , x2 , . . ., then (1.3) implies, by taking A = {X = xi }, that E Y I(X = xi ) , i = 1, 2, . . . . (1.4) h0 (xi ) = P (X = xi ) Similarly, if (X, Y ) has an absolutely continuous distribution with joint probability density f (x, y), it can be shown that w.p. 1, E(Y |X) = h0 (X), where

yf (x, y)dy h0 (x) = (1.5) fX (x)

12.1 Conditional expectation: Deﬁnitions and examples

385

if fX (x) > 0 and 0 otherwise, where fX (x) = f (x, y)dy is the probability density function of X (Problem 12.2). The deﬁnition of E(Y |X) can be generalized to the case when X is a vector and more generally, as follows. Theorem 12.1.2: Let (Ω, F, P ) be a probability space and G ⊂ F be a σalgebra. Let H ≡ L2 (Ω, F, P ) and H0 = L2 (Ω, G, P ). Then for any Y ∈ H, there exist a Z0 ∈ H0 such that E(Y − Z0 )2 = inf{E(Y − Z)2 : Z ∈ H0 }

(1.6)

and this Z0 is determined w.p. 1 by the condition E(Y IA ) = E(Z0 IA )

for all

A ∈ G.

(1.7)

The proof is similar to that of Theorem 12.1.1. Deﬁnition 12.1.2: The random variable Z0 in (1.7) is called the conditional expectation of Y given G and is written as E(Y |G). When G = σX, the σ-algebra generated by a random variable X, E(Y |G) reduces to E(Y |X) in Deﬁnition 12.1.1. The following properties of E(Y |G) are easy to verify by using the deﬁning equation (1.7) (Problem 12.3). Proposition 12.1.3: Let Y and G be as in Theorem 12.1.2. (i) Y ≥ 0 w.p. 1 ⇒ E(Y |G) ≥ 0 w.p. 1 (ii) Y1 , Y2 ∈ H ⇒ E (αY1 + βY2 )|G = αE(Y1 |G) + βE(Y2 |G) for any α, β ∈ R. (iii) Y1 ≥ Y2 w.p. 1 ⇒ E(Y1 |G) ≥ E(Y2 |G) w.p. 1. Using a natural approximation procedure it is possible to extend the deﬁnition of E(Y |G) to all random variables with just the ﬁrst moment, i.e., E|Y | < ∞. This is done in the following result. Theorem 12.1.4: Let (Ω, F, P ) be a probability space and G ⊂ F be a sub-σ-algebra. Let Y : Ω → R be a F-measurable random variable with E|Y | < ∞. Then there exists a random variable Z0 : Ω → R that is Gmeasurable, E|Z0 | < ∞, and is uniquely determined (up to equivalence w.p. 1) by E(Y IA ) = E(Z0 IA ) for all A ∈ G. (1.8) Proof: Since Y can be written as Y = Y + − Y − , it is enough to consider the case Y ≥ 0 w.p. 1. Let Yn = min{Y, n} for n = 1, 2, . . .. Then EYn2 < ∞

386

12. Conditional Expectation and Conditional Probability

and by Theorem 12.1.2, Zn ≡ E(Yn |G) is well deﬁned, it is G-measurable, and satisﬁes E(Yn IA ) = E(Zn IA ) for all A ∈ G. (1.9) Since 0 ≤ Yn ≤ Yn+1 , by Proposition 12.1.3, 0 ≤ Zn ≤ Zn+1 w.p. 1 and so there exists a set B ∈ G such that P (B) = 0 and on B c , {Zn }n≥1 is nondecreasing and nonnegative. Let Z0 = limn→∞ Zn on B c and 0 on B. Then Z0 is G-measurable. Applying the MCT to both sides of (1.9), one gets E(Y IA ) = E(Z0 IA ) for all A ∈ G. This proves the existence of a G-measurable Z0 satisfying (1.8). The uniqueness follows from the fact that if Z0 and Z0 are G-measurable with E|Z0 | < ∞, E|Z0 | < ∞ and EZ0 IA = EZ0 IA

for all A ∈ G,

then Z0 = Z0 w.p. 1 (Problem 12.3).

2

Remark 12.1.2: An alternative to the proof of Theorem 12.1.4 above leading to the deﬁnition of E(Y |G) is via the Radon-Nikodym theorem. Here is an outline of this proof. Let Y be a nonnegative random variable with E|Y | < ∞. Set µ(A) ≡ E(Y IA ) for all A ∈ G. Then µ is a measure on (Ω, G) and it is dominated by PG , the restriction of P to G. By the Radon-Nikodym theorem, there is a G-measurable function Z such that E(Y IA ) = µY (A) = ZdPG = EZIA . A

Extension to the case when Y is real-valued with E|Y | < ∞, is via the decomposition Y = Y + − Y − . Remark 12.1.3: The arguments in the proof of Theorem 12.1.4 (and Problem 12.3) show that the conclusion of the theorem holds for any nonnegative random variable Y for which EY may or may not be ﬁnite. Deﬁnition 12.1.3: Let Y be a F-measurable random variable on a probability space (Ω, F, P ) such that either Y is nonnegative or E|Y | < ∞. A random variable Z0 that is G-measurable and satisﬁes (1.8) is called the conditional expectation of Y given G and is written as E(Y |G). The following are some important consequences of (1.8): (i) If Y is G-measurable then E(Y |G) = Y . (ii) If G = F, then E(Y |G) = Y . (iii) If G = {∅, Ω}, then E(Y |G) = EY .

12.1 Conditional expectation: Deﬁnitions and examples

387

(iv) By taking A to be Ω in (1.8), EY = E E(Y |G) . Furthermore, Proposition 12.1.3 extends to this case. When G = σX with X discrete, (1.4) holds provided E|Y | < ∞. Part (iv) is useful in computing EY without explicitly determining the distribution of Y . Suppose E(Y |X) = m(X) and Em(X) is easy to compute but ﬁnding the distribution of Y is not so easy. Then EY can still be computed as Em(X). For example, let (X, Y ) have a bivariate distribution with pdf ⎧ 2 (x−1)2 ⎨ √1 √ 1 e− (y−x) 2x2 e− 2 if x = 0, 2π 2π|x| fX,Y (x, y) = ⎩ 0 if x = 0, x, y ∈ R2 . In this case, evaluating fY (y) is not easy. On the other hand, it can be veriﬁed that for each x, fX,Y (x, y)dy = x, m(x) ≡ y fX (x) (x−1)2

and that fX (x) = √12π e− 2 . Thus, EY = EX = 1. For more examples of this type, see Problem 12.29. The next proposition lists some useful properties of the conditional expectation. Proposition 12.1.5: Let (Ω, F, P ) be a probability space and let Y be a F-measurable random variable with E|Y | < ∞. Let G1 ⊂ G2 ⊂ F be two sub-σ-algebras contained in F. (i) Then

E(Y |G1 ) = E E(Y |G2 )|G1 .

(1.10)

(ii) For any bounded G1 -measurable random variable U , E(Y U |G1 ) = U E(Y |G1 ).

(1.11)

Proof: (i) Let A ∈ G1 , Z1 = E(Y |G1 ), and Z2 = E(Y |G2 ). Then E(Y IA ) = E(Z1 IA ) by the deﬁnition of Z1 . Since G1 ⊂ G2 , A ∈ G2 and again by the deﬁnition of Z2 , E(Y IA ) = E(Z2 IA ). Thus, E(Z2 IA ) = E(Z1 IA )

for all A ∈ G1

and by the deﬁnition of E(Z2 |G1 ), it follows that Z1 = E(Z2 |G1 ), proving (i).

388

12. Conditional Expectation and Conditional Probability

(ii) Let Z1 = E(Y |G1 ). If U = IB some B ∈ G1 , then for any A ∈ G1 , A ∩ B ∈ G1 and by (1.8), EY IB IA = EY IA∩B = E(Z1 IA∩B ) = E(Z1 IB · IA ). So in this case E(Y U |G1 ) = Z1 U . By linearity (Proposition 12.1.3 (ii)), it extends to all U that are simple and G1 -measurable. For any bounded G1 -measurable U , there exists a sequence of bounded, G1 -measurable, and simple random variables {Un }n≥1 that converge to U uniformly. Hence, for any A ∈ G1 and for n ≥ 1, EY Un IA = EZ1 Un IA . The bounded convergence theorem applied to both sides yields EY U IA = EZ1 U IA . Since Z1 and U are both G1 -measurable, (ii) follows.

2

Remark 12.1.4: If the random variable U in Proposition 12.1.5 is G1 measurable and E|Y U | < ∞, then part (ii) of the proposition holds. The proof needs a more careful approximation (see Billingsley (1995), pp. 447). An Approximation Theorem Theorem 12.1.6: Let H be a real Hilbert space and M be a nonempty closed convex subset of H. Then for every v ∈ H, there is a unique u0 ∈ M such that (1.12) v − u0 = inf{v − u : u ∈ M} where x2 = x, x, with x, y denoting the inner-product in H. Proof: Let δ = inf{v − u : u ∈ M}. Then, δ ∈ [0, ∞). By deﬁnition, there exists a sequence {un }n≥1 ⊂ M such that v − un → δ. Also note that in any inner-product space, the parallelogram law holds, i.e., for any x, y ∈ H, x + y2 + x − y2 = 2(x2 + y2 ). Thus 2v − (un + um )2 + un − um 2 = 2(v − un 2 + v − um 2 ). (1.13) -2 m m∈ M implying that -v − un +u ≥ δ 2 . This, Since M is convex, un +u 2 2 with (1.13), implies that lim sup un − um 2 = 0, m,n→∞

12.2 Convergence theorems

389

making {un }n≥1 a Cauchy sequence. Since H is a Hilbert space, there exists a u0 ∈ H such that {un }n≥1 converges to u0 . Also, since M is closed, u0 ∈ M. Since v − un → δ, it follows that v − u0 = δ. To show the uniqueness, let u0 ∈ M also satisﬁes v − u0 = δ. Then as in (1.13), u0 + u0 -2 - u0 − u0 -2 -v − - +- = δ2 , 2 2 2 implying u0 − u0 = 0. Remark 12.1.5: The above theorem holds if M is a closed subspace of H.

12.2 Convergence theorems From Proposition 12.1.3, it is seen that E(Y |G) is monotone and linear in Y , suggesting that it behaves like an ordinary expectation. A natural question is whether under appropriate conditions, the basic convergence results extend to conditional expectations (CE). The answer is ‘yes,’ as shown by the following results. Theorem 12.2.1: (Monotone convergence theorem for CE). Let (Ω, F, P ) be a probability space and G ⊂ F be a sub-σ-algebra of F. Let {Yn }n≥1 be a sequence of nonnegative F-measurable random variables such that 0 ≤ Yn ≤ Yn+1 w.p. 1. Let Y ≡ lim Yn w.p. 1. Then n→∞

lim E(Yn |G) = E(Y |G) w.p. 1.

n→∞

(2.1)

Proof: By Proposition 12.1.3 (i), Zn ≡ E(Yn |G) is monotone nondecreasing in n, w.p. 1, and so Z ≡ limn→∞ Zn exists w.p. 1. By the MCT, for all A ∈ G, E(Y IA ) = lim EYn IA = lim E(Zn IA ) = E(ZIA ). n→∞

n→∞

Thus, Z = E(Y |G) w.p. 1, proving (2.1).

2

Theorem 12.2.2: (Fatou’s lemma for CE). Let {Yn }n≥1 be a sequence of nonnegative random variables on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F. Then lim inf E(Yn |G) ≥ E(lim inf Yn |G). n→∞

n→∞

(2.2)

Proof: Let Y˜n = inf j≥n Yj . Then {Y˜n }n≥1 is a sequence of nonnegative nondecreasing random variables and limn→∞ Y˜n = lim inf n→∞ Yn . By the previous theorem, lim E(Y˜n |G) = E(lim inf Yn |G).

n→∞

n→∞

(2.3)

390

12. Conditional Expectation and Conditional Probability

Also, since Y˜n ≤ Yj for each j ≥ n, E(Y˜n |G) ≤ E(Yj |G)

for each

j ≥ n w.p. 1

implying that E(Y˜n |G) ≤ inf j≥n E(Yj |G) w.p. 1. The right side converges 2 to lim inf n→∞ E(Yn |G) w.p. 1. Now (2.2) follows from (2.3). It is easy to deduce from Fatou’s lemma the following result (Problem 12.4). Theorem 12.2.3: (Dominated convergence theorem for CE). Let {Yn }n≥1 and Y be random variables on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F . Suppose that limn→∞ Yn = Y w.p. 1 and that there exists a random variable Z such that |Yn | ≤ Z w.p. 1 and EZ < ∞. Then lim E(Yn |G) = E(Y |G)

n→∞

w.p. 1.

(2.4)

Theorem 12.2.4: (Jensen’s inequality for CE). Let φ : (a, b) → R be convex for some −∞ ≤ a < b ≤ ∞. Let Y be a random variable on a probability space (Ω, F, P ) such that P (Y ∈ (a, b)) = 1 and E|φ(Y )| < ∞. Let G be a sub-σ-algebra of F. Then φ E(Y |G) ≤ E(φ(Y )|G). (2.5) Proof: By the convexity of φ on (a, b), for any c, x ∈ (a, b), φ(x) − φ(c) − (x − c)φ− (c) ≥ 0,

(2.6)

where φ− (c) is the left derivative of φ at c. Taking c = E(Y |G) and x = Y in (2.6), one gets Z ≡ φ(Y ) − φ(E(Y |G)) − (Y − E(Y |G))φ− (E(Y |G)) ≥ 0. Since E φ(E(Y |G))|G = φ E(Y |G) , by (1.11), " * # E Y − E(Y |G) φ− E(Y |G) *G " # = φ− (E(Y |G))E Y − E(Y |G) |G = 0. Also, from (2.7), E(Z|G) ≥ 0 and hence, E φ(Y )|G ≥ φ(E(Y |G)).

(2.7)

2

The following inequalities are a direct consequence of Theorem 12.2.4. Corollary 12.2.5: Let Y be a random variable on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F.

12.2 Convergence theorems

391

2 (i) If EY 2 < ∞, then E(Y 2 |G) ≥ E(Y |G) . (ii) If E|Y |p < ∞ for some p ∈ [1, ∞), then E(|Y |p |G) ≥ |(EY |G)|p . Deﬁnition 12.2.1: Let EY 2 < ∞. The conditional variance of Y given G, denoted by Var(Y |G), is deﬁned as Var(Y |G) = E(Y 2 |G) − (E(Y |G))2 .

(2.8)

This leads to the following formula for a decomposition of the variance of Y , known as the Analysis of Variance formula. Theorem 12.2.6: Let EY 2 < ∞. Then

Var(Y ) = Var(E(Y |G)) + E Var(Y |G) .

(2.9)

Proof: Var(Y ) = E(Y −EY )2 . But Y −EY = Y −E(Y |G)+E(Y |G)−EY . Also by (1.11), " #" #* E Y − E(Y |G) E(Y |G) − EY *G " # " #* = E(Y |G) − EY E Y − E(Y |G) *G " # = E(Y |G) − EY 0 = 0. " #" # Thus, E Y − E(Y |G) E(Y |G) − EY = 0 and so 2 2 E(Y − EY )2 = E Y − E(Y |G) + E E(Y |G) − EY . (2.10) " # " # " #2 Now, noting that E Y E(Y |G) = E E Y (EY |G)|G = E E(Y |G) , one gets " # 2 E(Y − E(Y |G))2 = EY 2 − 2E Y E(Y |G) + E E(Y |G) " 2 # = EY 2 − E E(Y |G) # " 2 # " = E E(Y 2 |G) − E E(Y |G) = E Var(Y |G) . " # Also, since E E(Y |G) = EY , 2 E E(Y |G) − EY = Var E(Y |G) . Hence, (2.9) follows from (2.10). 2 Remark 12.2.1: E Var(Y |G) is called the variance within and Var E(Y |G) is the variance between. The above proof also shows that 2 2 E(Y − Z)2 = E Y − E(Y |G) + E E(Y |G) − Z (2.11) for any random variable Z that is G-measurable. This is used to prove the Rao-Blackwell theorem in mathematical statistics (Lehmann and Casella (1998)) (Problem 12.27).

392

12. Conditional Expectation and Conditional Probability

12.3 Conditional probability Let (Ω, F, P ) be a probability space and let G ⊂ F be a sub-σ-algebra. Deﬁnition 12.3.1: For B ∈ F, the conditional probability of B given G, denoted by P (B|G), is deﬁned as P (B|G) = E(IB |G).

(3.1)

Thus Z ≡ P (B|G) is a G-measurable function such that P (A ∩ B) = E(ZIA )

for all A ∈ G.

(3.2)

Since 0 ≤ P (A∩B) ≤ P (A) for all A ∈ G, it follows that 0 ≤ P (B|G) ≤ 1 w.p. 1. It is easy to check that w.p. 1 P (Ω|G) = 1

and P (∅|G) = 0.

Also, by linearity, if B1 , B2 ∈ F, B1 ∩ B2 = ∅, then P (B1 ∪ B2 |G) = P (B1 |G) + P (B2 |G)

w.p. 1.

This suggests that w.p. 1, P (B|G) is countably additive as a set function in B. That is, there exists a set A0 ∈ G such that P (A0 ) = 0 and for all ω ∈ A0 , the map B → P (B|G)(ω) is countably additive. However, this is not true. Although for a given collection {Bn }n≥1 of disjoint sets in F, there is an exceptional set A0 such that P (A0 ) = 0 and for ω ∈ A0 , P(

n≥1

Bn |G)(ω) =

∞

P (Bn |G)(ω).

n=1

However, this A0 depends on {Bn }n≥1 and as the collection varies, these exceptional sets can be an uncountable collection whose union may not be contained in a set of probability zero. Deﬁnition 12.3.2: Let (Ω, F, P ) be a probability space and G be a subσ-algebra of F. A function µ : F × Ω → [0, 1] is called a regular conditional probability on F given G if (i) for all B ∈ F, µ(B, ω) = P (B|G) w.p. 1, and (ii) for all ω ∈ Ω, µ(B, ω) is a probability measure on (Ω, F). If a regular conditional probability (r.c.p.) µ(·, ·) exists on F given G, then conditional expectation of Y given G can be computed as E(Y |G)(ω) = Y (ω )µ(dω , ω) w.p. 1

12.4 Problems

393

for all Y such that E|Y | < ∞. The proof of this is via standard approximation using simple random variables (Problem 12.15). A suﬃcient condition for the existence of r.c.p. is provided by the following result. Theorem 12.3.1: Let (Ω, F, P ) be a probability space. Let S be a Polish space and S be its Borel σ-algebra. Let X be an S-valued random variable on (Ω, F). Then for any σ-algebra G ⊂ F, there is a regular conditional probability on σX given G, where σX = {X −1 (D) : D ∈ S}. Proof: (for S = R). Let Q = {rj } be the set of rationals. Let F (rj , ω) = P (X ≤ rj |G)(ω) w.p. 1. Then, there is a set A0 ∈ G such that P (A0 ) = 0 and for ω ∈ A0 , F (r, ω) is monotone nondecreasing on Q. For x ∈ R, set sup{F (r, ω) : r ≤ x} if ω ∈ A0 F (x, ω) ≡ F0 (x) if ω ∈ A0 , where F0 (x) is a ﬁxed cdf (say, F0 = Φ, the standard normal cdf). Then, F (x, ω) is a cdf in x for each ω and for each x, F (x, ·) is G-measurable. Let µ(B, ω) be the Lebesgue-Stieltjes measure induced by F (·, ω). Then it can be checked using the π − λ theorem (Theorem 1.1.2) that µ(·, ·) is a regular conditional probability on σX given G (Problem 12.16). 2 Remark 12.3.1: When F = σX, the regular conditional probability on F given G is also called the regular conditional probability distribution of X given G. Remark 12.3.2: For a proof for the general Polish case, see Durrett (2004) and Parthasarathy (1967).

12.4 Problems 12.1 Let (X, Y ) be a bivariate random vector with EY 2 < ∞. Let H = L2 (R2 , B(R), PX,Y ) and H0 = {h(X) | h : R → R is Borel measurable and Eh(X)2 < ∞}. Suppose that for some h(X) ∈ H0 , EY IA = Eh(X)IA

for all A ∈ σX.

Show that E(Y − h(X))Z1 = 0 for all Z1 ∈ H0 . Show also that H0 is a closed subspace of H. (Hint: For any Z1 ∈ H0 , there exists a sequence of simple random variables {Wn }n≥1 ⊂ H0 such that |Wn | ≤ |Z1 | and Wn → Z1 a.s. Now, apply the DCT. For the second part, use the fact that f : Ω → R is σX-measurable iﬀ there is a Borel measurable function h : R → R such that f = h(X).)

394

12. Conditional Expectation and Conditional Probability

12.2 Let (X, Y ) be a bivariate random vector that has an absolutely continuous distribution on (R2 , B(R2 )) w.r.t. the Lebesgue measure with density f (x, y). Suppose that E|Y | < ∞. Show that a version of E(Y |X) is given by h0 (X), where, with fX (x) = f (x, y)dy,

h0 (x) =

yf (x,y)dy fX (x)

0

if

fX (x) > 0

otherwise.

(Hint: Verify (1.8) for all A ∈ σX.) 12.3 Let Z1 and Z2 be two random variables on a probability space (Ω, G, P ). (a) Suppose that E|Z1 | < ∞, E|Z2 | < ∞ and EZ1 IA = EZ2 IA

for all A ∈ G.

(4.1)

Show that P (Z1 = Z2 ) = 1. (b) Suppose that Z1 and Z2 are nonnegative and (4.1) holds. Show that P (Z1 = Z2 ) = 1. (c) Prove Proposition 12.1.3. (Hint: (a) Consider (4.1) with A1 = {Z1 − Z2 > 0} and A2 = {Z1 − Z2 < 0} and conclude that P (A1 ) = 0 = P (A2 ). (b) Let A1n = {Z1 ≤ n, Z2 ≤ n, Z1 − Z2 > 0} and A2n = {Z1 ≤ ≥ 1. Then, by (4.1), P (A1n ) = 0 = P (A2n ) n, Z2 ≤ n, Z1 −Z2 < 0}, n for all n ≥ 1. But A1 = n≥1 Ain , i = 1, 2, . . . , where Ai ’s are as above.) 12.4 Prove Theorem 12.2.3. 12.5 Let Xi be a ki -dimensional random vector, ki ∈ N, i = 1, 2 such that X1 and X2 are independent. Let h : Rk1 +k2 → [0, ∞) be a Borel measurable function. Show that E h(X1 , X2 ) | X1 = g(X1 ) (4.2) k1 where g(x) = Eh(x, X * 2 ), x ∈ R * . Show that (4.2) is also valid for a real valued h with E *h(X1 , X2 )* < ∞.

(Hint: Let k = k1 + k2 , Ω = Rk , F = B(Rk ), P = PX1 × PX2 . Verify (1.8) for all A ∈ {A1 × Rk2 : A1 ∈ B(Rk1 )} ≡ σX1 .) 12.6 Let X be a random variable on a probability space (Ω, F, P ) with EX 2 < ∞ and let G ⊂ F be a sub-σ-ﬁeld. (a) Show that for any A ∈ G, * * 1/2 * * E(X|G)dP * ≤ E(X 2 |G)dP . * A

A

(4.3)

12.4 Problems

395

(b) Show that (4.3) is valid for all A ∈ F. 12.7 Let f : (Rk , B(Rk ), P ) → (R, B(R)) be an integrable function where −k

P (A) = 2

exp A

−

k

|xi | dx1 , . . . , dxk , A ∈ B(Rk ).

i=1

For each of the following cases, ﬁnd a version of E(f |G) and justify your answer: (a) G = σ{A ∈ B(Rk ) : A = −A}, (b) G = σ{(j1 , j1 + 1] × · · · × (jk , jk + 1] : j1 , . . . , jk ∈ Z}, (c) G = σ{B × {0} : B ∈ B(Rk−1 )}. 12.8 Let (Ω, F, P ) be a probability space and G = {∅, B, B c , Ω} for some B ∈ F with P (B) ∈ (0, 1). Determine P (A|G) for A ∈ F. 12.9 Let {Xn : n ∈ Z} be a collection of independent random variables with E|X0 | < ∞. Show that (a) E(X0 | X1 , . . . , Xn ) = EX0 for any n ∈ N, (b) E(X0 | X−n , . . . , X−1 ) = EX0 for any n ∈ N, (c) E(X0 | X1 , X2 , . . .) = EX0 = E(X0 | . . . , X−2 , X−1 ). 12.10 Let X be a random variable on a probability space (Ω, F, P ) with E|X| < ∞ and let C be a π-system such that σC = G ⊂ F. Suppose that there exists a G-measurable function f : Ω → R such that f dP = XdP for all A ∈ C. A

A

Show that f = E(X|G). 12.11 Let X and Y be integrable random variables on (Ω, F, P ) and let C be a semi-algebra, C ⊂ F. Suppose that A XdP ≤ A Y dP for all A ∈ C. Show that E(X|G) ≤ E(Y |G) where G = σC. 12.12 Let X, Y ∈ L2 (Ω, F, P ). If E(X|Y ) = Y and E(Y |X) = X, then P (X = Y ) = 1. (Hint: Show that E(X − Y )2 = EX 2 − EY 2 .) 12.13 Let {Xn }n≥1 , X be a collection of random variables on (Ω, F, P ) and let G be a sub-σ-algebra of F. If limn→∞ E|Xn − X|r = 0 for some r ≥ 1, then * *r lim E *E(Xn |G) − E(X|G)* = 0. n→∞

396

12. Conditional Expectation and Conditional Probability

12.14 Let X, Y ∈ L2 (Ω, F, P ) and let G be a sub-σ-algebra of F. Show that E{Y E(X|G)} = E{XE(Y |G)}. 12.15 Let Y be an integrable random variable on (Ω, F, P ) and let µ be a r.c.p. on F given G. Show that h(ω) ≡ Y (ω1 )µ(dω1 , ω), ω ∈ Ω is a version of E(Y |G). (Hint: Prove this ﬁrst for Y = IA , A ∈ F. Extend to simple functions by linearity. Use the DCT for CE for the general case.) 12.16 Complete the proof of Theorem 12.3.1 for S = R. 12.17 Let (Ω, F, P ) be a probability space, G be a sub-σ-algebra of F, and let {An }n≥1 ⊂ F be a collection of disjoint sets. Show that P

An |G

n≥1

=

∞

P (An |G).

n=1

Deﬁnition 12.4.1: Let G be a σ-algebra and let {Gλ : λ ∈ Λ} be a collection of subsets of F in a probability space (Ω, F, P ). Then, {Gλ : λ ∈ Λ} is called conditionally independent given G if for any λ1 , . . . , λk ∈ Λ, k ∈ N, k + P (Ai |G) P A1 ∩ · · · ∩ Ak |G = i=1

for all A1 ∈ G1 , . . . , Ak ∈ Gk . A collection of random variables {Xλ : λ ∈ Λ} on (Ω, F, P ) is called conditionally independent given G if {σXλ : λ ∈ Λ} is conditionally independent given G. 12.18 Let G1 , G2 , G3 be three sub-σ-algebras of F. Recall that Gi ∨ Gj = σGi ∪ Gj , 1 ≤ i = j ≤ 3. Show that G1 and G2 are conditionally independent given G3 iﬀ P (A1 |G2 ∨ G3 ) = P (A1 |G3 )

for all A1 ∈ G,

iﬀ E(X|G2 ∨ G3 ) = E(X|G3 ) for every X ∈ L1 (Ω, G1 ∨ G3 , P ). 12.19 Let G1 , G2 , G3 be sub-σ-algebra of F. Show that if G1 ∨ G3 is independent of G2 , then G1 and G2 are conditionally independent given G3 . 12.20 Give an example where E E(Y |X1 ) | X2 = E E(Y |X2 ) | X1 .

12.4 Problems

397

12.21 Let X be an Exponential (1) random variable. For t > 0, let Y1 = min{X, t} and Y2 = max{X, t}. Find E(X|Yi ) i = 1, 2. (Hint: Verify that σY1 is the σ-algebra generated by the collection {X −1 (A) : A ∈ B(R), A ⊂ [0, t)} ∪ {X −1 [t, ∞)}.) 12.22 Let (X, Y ) be a bivariate random vector with a joint pdf w.r.t. the Lebesgue measure f (x, y). Show that E(X|X +Y ) = h(X +Y ) where

xf (x, z − x)dx f (x, z − x)dx . I(0,∞) h(z) = f (x, z − x)dx 12.23 Let {Xi }i≥1 be iid random variables with E|X1 | < ∞. Show that for any n ≥ 1, X1 + X2 + · · · + Xn E X1 | (X1 + X2 + · · · + Xn ) = . n (Hint: Show that E Xi | (X1 + · · · + Xn ) is the same for all 1 ≤ i ≤ n.) Deﬁnition 12.4.2: A ﬁnite collection of random variables {Xi : 1 ≤ i ≤ n} on a probability space (Ω, F, P ) is said to be exchangeable if for any permutation (i1 , i2 , . . . , in ) of (1, 2, . . . , n), the joint distribution of (Xi1 , Xi2 , . . . , Xin ) is the same as that of (X1 , X2 , . . . , Xn ). A sequence of radom variables {Xi }i≥1 on a probability space (Ω, F, P ) is said to be exchangeable if for any ﬁnite n, the collection {Xi : 1 ≤ i ≤ n} is exchangeable. 12.24 Let {Xi : 1 ≤ i ≤ n+1} be a ﬁnite collection of random variables such that conditional Xn+1 , {X1 , X2 , . . . , Xn } are iid. Show that {Xi : 1 ≤ i ≤ n} is exchangeable. 12.25 Let {Xi : 1 ≤ i ≤ n} be exchangeable. Suppose E|X1 | < ∞. Show that X1 + X2 + · · · + Xn . E X1 | (X1 + · · · + Xn ) = n 12.26 Let (X1 , X2 , X3 ) be random variables such that P (X2 ∈ · | X1 ) P (X3 ∈ · | X1 , X2 )

=

p1 (X1 , ·)

=

p2 (X2 , ·)

and

where for each i = 1, 2, pi (x, ·) is a probability transition function on R as deﬁned in Example 6.3.8. Suppose pi (x, ·) admits a pdf fi (x, ·) i = 1, 2, . . .. Show that P (X1 ∈ · | X2 , X3 ) = P (X1 ∈ · | X2 ). (This says that if {X1 , X2 , X3 } has the Markov property, then so does {X3 , X2 , X1 }.)

398

12. Conditional Expectation and Conditional Probability

12.27 (Rao-Blackwell theorem). Let Y ∈ L2 (Ω, F, P ) and G ⊂ F be a subσ-algebra. Show that there exists Z ∈ L2 (Ω, G, P ) such that EZ = EY and Var(Z) ≤ Var(Y ). (Hint: Consider Z = E(Y |G).) 12.28 Let (X, Y ) have an absolutely continuous bivariate distribution with density fX,Y (x, y). Show that there is a regular conditional probability on σY given σX and that this probability measure induces an absolutely continuous distribution on R. Find its density. 12.29 Suppose, in the above problem, fX,Y (x, y) =

1 y − m(x) φ g(x) σ(x) σ(x)

where m(·), σ(·), φ(·), and g(·) are all Borel measurable functions on R to R with σ, φ, and g being nonnegative and φ and g being probability densities. (a) Find the marginal probability densities fX (·) and fY (·) of X and Y , respectively. Set up the integrals for EX and EY . (b) Using the conditioning argument in Proposition 12.1.5, show that EY = m(x)g(x)dx + uφ(u)du σ(x)g(x)dx (assuming that all the integrals are well deﬁned). (c) Find similar expressions for EY 2 and E(etY ). 12.30 Let X, Y , Z ∈ L1 (Ω, F, P ). Suppose that E(X|Y ) = Z, E(Y |Z) = X, E(Z|X) = Y. Show that X = Y = Z w.p. 1. 12.31 Let X, Y ∈ L2 (Ω, F, P ). Suppose E|Y |4 < ∞. Show that min E|X − (a + bY + cY 2 )|2 : a, b, c ∈ R = max E(XZ) : Z ∈ L2 (Ω, F, P ), EZ = 0, EZY = 0, EZY 2 = 0, EZ 2 = 1 . 12.32 Let X ∈ L2 (Ω, F, P ) and G be a sub-σ-algebra of F (a) Show that

min E(X − Y )2 : Y ∈ L2 (Ω, G, P ) = max (EXZ)2 : EZ 2 = 1, E(Z|G) = 0 .

(b) Find a random variable Z such that E(Z|G) = 0 w.p. 1 and ρ ≡ corr(X, Z) is maximized.

13 Discrete Parameter Martingales

13.1 Deﬁnitions and examples This section deals with a class of stochastic processes called martingales. Martingales arise in a natural way in many problems in probability and statistics. It provides a more general framework than the case of independent random variables where results, like the SLLN, the CLT, and other convergence theorems, can be established. Much of the discrete parameter martingale theory was developed by the great American mathematician J. L. Doob, whose book (Doob (1953)) has been very inﬂuential. Deﬁnition 13.1.1: Let (Ω, F, P ) be a probability space and let N = {1, . . . , n0 } be a nonempty subset of N = {1, 2, . . .}, n0 ≤ ∞. (a) A collection {Fn : n ∈ N } of sub-σ-algebras of F is called a ﬁltration if Fn ⊂ Fn+1 for all 1 ≤ n < n0 . (b) A collection of random variables {Xn : n ∈ N } is said to be adapted to the ﬁltration {Fn : n ∈ N } if Xn is Fn -measurable for all n ∈ N . (c) Given a ﬁltration {Fn : n ∈ N } and random variables {Xn : n ∈ N }, the collection {(Xn , Fn ) : n ∈ N } is called a martingale if (i) {Xn : n ∈ N } is adapted to {Fn : n ∈ N }, (ii) E|Xn | < ∞ for all n ∈ N , and (iii) for all 1 ≤ n < n0 , E(Xn+1 |Fn ) = Xn .

(1.1)

400

13. Discrete Parameter Martingales

When N = N, there is no maximum element in N . In this case, Definition 13.1.1 is to be interpreted by setting n0 = +∞ in parts (a) and (c) (iii). A similar convention applies to Deﬁnition 13.1.2 below. Also, recall that equalities and inequalities involving conditional expectations are interpreted as being valid events w.p. 1. If {(Xn , Fn ) : n ∈ N } is a martingale, then {Xn : n ∈ N } is also said to be a martingale w.r.t. (the ﬁltration) {Fn : n ∈ N }. Also {Xn : n ∈ N } is called a martingale if it is a martingale w.r.t. some ﬁltration. Observe that if {Xn : n ∈ N } is a martingale w.r.t. any given ﬁltration {Fn : n ∈ N }, it is also a martingale w.r.t. the natural ﬁltration {Xn : n ∈ N }, where Xn = σ{X1 , . . . , Xn }, n ∈ N . Clearly, {Xn : n ∈ N } is adapted to {Xn : n ∈ N }. To see that E(Xn+1 |Xn ) = Xn for all 1 ≤ n < n0 , note that Xn ⊂ Fn for all n ∈ N and hence, E(Xn+1 |Xn )

= E(E(Xn+1 |Fn ) | Xn ) = E(Xn |Xn ) = Xn .

(1.2)

Thus, {(Xn , Xn ) : n ∈ N } is a martingale. A classic interpretation of martingales in the context of gambling is given as follows. Let Xn represent the fortune of a gambler at the end of the nth play and let Fn be the information available to the gambler up to and including the nth play. Then, Fn contains the knowledge of all events like {Xj ≤ r} for r ∈ R, j ≤ n, making Xn measurable w.r.t. Fn . And Condition (iii) in Deﬁnition 13.1.1 (c) says that given all the information up until the end of the nth play, the expected fortune of the gambler at the end of the (n + 1)th play remains unchanged. Thus a martingale represents a fair game. In situations where the game puts the gambler in a favorable or unfavorable position, one may express that by suitably modifying condition (iii), yielding what are known as sub- and super-martingales, respectively. Deﬁnition 13.1.2: Let {Fn : n ∈ N } be a ﬁltration and {Xn : n ∈ N } be a collection of random variables in L1 (Ω, F, P ) adapted to {Fn : n ∈ N }. Then {(Xn , Fn ) : n ∈ N } is called a sub-martingale if E(Xn+1 |Fn ) ≥ Xn

for all

1 ≤ n < n0 ,

(1.3)

for all

1 ≤ n < n0 .

(1.4)

and a super-martingale E(Xn+1 |Fn ) ≤ Xn

Suppose that {(Xn , Fn ) : n ∈ N } is a sub-martingale. Then A ∈ Fn implies that A ∈ Fn+1 ⊂ . . . ⊂ Fn+k for every k ≥ 1, n + k ∈ N and hence, by (1.3), Xn dP ≤ E(Xn+1 |Fn )dP = Xn+1 dP A

A

A

13.1 Deﬁnitions and examples

.. .

401

≤

Xn+k dP.

(1.5)

A

Therefore, E(Xn+k |Fn ) ≥ Xn and, by taking A = Ω in (1.5), EXn+k ≥ EXn . Thus, the expected values of a sub-martingale is nondecreasing. For a martingale, by (1.2), equality holds at every step of (1.5) and hence, E(Xn+k |Fn ) = Xn , EXn+k = EXn

(1.6)

for all k ≥ 1, n, n + k ∈ N . Thus, in a fair game, the expected fortune of the gambler remains constant over time. Here are some examples. Example 13.1.1: (Random walk). Let Z1 , Z2 , . . . be a sequence of iid random variables on a probability space (Ω, F, P ) with ﬁnite mean µ = EZ1 and let Fn = σZ1 , . . . , Zn , n ≥ 1. Let Xn = Z1 + . . . Zn , n ≥ 1. Then, for all n ≥ 1, σXn ⊂ Fn and E|Xn | < ∞ for all n ≥ 1. Also, E(Xn+1 |Fn ) = E (Z1 + . . . + Zn+1 ) | Z1 , . . . , Zn = Z1 + . . . + Zn + EZn+1 (by independence) = Xn + µ, so that E(Xn+1 |Fn )

=

Xn

if µ = 0

>

Xn

if µ > 0

n} ∈ Fn for all n ≥ 1. However, the condition ‘{T ≥ n} ∈ Fn

for all n ≥ 1’

(2.5)

does not always imply that T is a stopping time w.r.t. {Fn }n≥1 (cf. Problem 13.7). Note that for a stopping time T w.r.t. {Fn }n≥1 , {T ≥ n} = {T ≤ n − 1}c ∈ Fn−1

for n ≥ 2,

and {T ≥ 1} = Ω. Since Fn−1 ⊂ Fn for all n ≥ 2, (2.5) is a weaker requirement than T being a stopping time w.r.t {Fn }n≥1 . Proposition 13.2.1: Let T be a stopping time w.r.t. {Fn }n≥1 and let F∞ be as in (2.3). Deﬁne FT = {A ∈ F∞ : A ∩ {T = n} ∈ Fn

for all

n ≥ 1}.

(2.6)

Then, FT is a σ-algebra. Proof: Left as an exercise (Problem 13.8).

2

If T is a stopping time w.r.t. {Fn }n≥1 , then for any m ∈ N, {T = m} ∩ {T = n} = ∅ ∈ Fn for all n = m and {T = m}∩{T = n} = {T = m} ∈ Fm for n = m. Thus, {T = m} ∈ FT for all m ≥ 1 and hence, σT ⊂ FT . But the reverse inclusion may not hold as shown below. Example 13.2.1: Let T ≡ m for some given integer m ≥ 1. Then, T is a stopping time w.r.t. any ﬁltration {Fn }n≥1 . For this T , A ∈ FT ⇒ A ∩ {T = m} ∈ Fm ⇒ A ∈ Fm ,

13.2 Stopping times and optional stopping theorems

407

so that FT ⊂ Fm . Conversely, suppose A ∈ Fm . Then, A ∩ {T = n} = ∅ ∈ Fn for all n = m, and A ∩ {T = m} = A ∈ Fm for n = m. Thus, Fm = FT . But σT = {Ω, ∅}. Example 13.2.2: Let {Xn }n≥1 be a sequence of random variables adapted to a ﬁltration {Fn }n≥1 and let {Bn }n≥1 be a sequence of Borel sets in R. Deﬁne the random variable (2.7) T = inf n ≥ 1 : Xn ∈ Bn . Then, T (ω) < ∞ if Xn (ω) ∈ Bn for some n ∈ N and T (ω) = +∞ if Xn (ω) ∈ Bn for all n ∈ N. Since, for any n ≥ 1, {T = n} = X1 ∈ B1 , . . . , Xn−1 ∈ Bn−1 , Xn ∈ Bn ∈ Fn , T is a stopping time w.r.t. {Fn }n≥1 . Now, deﬁne a new random variable XT by Xm if T = m XT = (2.8) lim sup Xn if T = ∞. n→∞

¯ and for any n ≥ 1 and r ∈ R, The, XT ∈ R {XT ≤ r} ∩ {T = n} = {Xn ≤ r} ∩ {T = n} ∈ Fn . Also, {XT = ±∞} ∩ {T = n} = {Xn = ±∞} ∩ {T = n} = ∅ for all n ≥ 1. ¯ Hence, it follows that XT is FT , B(R)-measurable. Example 13.2.3: Let {Yn }n≥1 be a sequence of iid random variables with EY1 = µ. Let Xn = (Y1 + . . . + Yn ), n ≥ 1 denote the random walk corresponding to {Yn }n≥1 . For x > 0, let √ (2.9) Tx = inf n ≥ 1 : Xn > nµ + x n . exceeds Then, Tx is the√ﬁrst time the sequence of partial sums {Xn }n≥1 √ the level nµ + x n and is a special case of (2.7) with Bn = (nµ + x n, ∞), n ≥ 1. Consequently, Tx is a stopping time w.r.t. Fn = σY1 , Y2 , . . . , Yn , n ≥ 1. Note that if EY12 < ∞, by the law of iterated logarithm (cf. 8.7), lim sup , n→∞

i.e.,

Xn − nµ 2σ 2 n log log n

Xn > nµ + C

=1

w.p. 1,

, n log log n inﬁnitely often w.p. 1

for some constant C > 0. Thus, P (Tx < ∞) = 1 and hence, Tx is a ﬁnite stopping time. This random variable Tx arises in sequential probability ratio tests (SPRT) for testing hypotheses on the mean of a (normal) population. See Woodroofe (1982), Chapter 3.

408

13. Discrete Parameter Martingales

Deﬁnition 13.2.2: Let {Fn }n≥0 be a ﬁltration in a probability space (Ω, F, P ). A betting sequence w.r.t. {Fn }n≥0 is a sequence {Hn }n≥1 of nonnegative random variables such that for each n ≥ 1, Hn is Fn−1 measurable. The following result says that there is no betting scheme that can beat a gambling system, i.e., convert a fair one into a favorable one or the other way around. Theorem 13.2.2: (Betting theorem). Let {Fn }n≥0 be a ﬁltration in a probability space. Let {Hn }n≥0 be a betting sequence w.r.t. {Fn }n≥0 . For an adapted sequence {Xn , Fn }n≥0 let {Yn }n≥0 be deﬁned by Y0 = X0 , n Yn = Y0 + j=1 (Xj − Xj−1 )Hj , n ≥ 1. Let E|(Xj − Xj−1 )Hj | < ∞ for j ≥ 1. Then, (i) {Xn , Fn }n≥0 a martingale ⇒ {Yn , Fn }n≥0 is also a martingale, (ii) {Xn , Fn }n≥0 a sub-martingale ⇒ {Yn , Fn }n≥0 is also a submartingale, Proof: Clearly, for all n ≥ 1, E|Yn | < ∞ and Yn is Fn -measurable. Further, * E Yn+1 *Fn = Yn + E (Xn+1 − Xn )Hn+1 | Fn = Yn + Hn+1 E (Xn+1 − Xn ) | Fn since Hn+1 is Fn+1 -measurable. Now the theorem follows from the deﬁning properties of {Xn , Fn }n≥0 . 2 The above theorem leads to the following results known as Doob’s optional stopping theorems. Theorem 13.2.3: (Doob’s optional stopping theorem I ). Let {Xn , Fn }n≥0 ˜n ≡ be a sub-martingale. Let T be a stopping time w.r.t. {Fn }n≥0 . Let X ˜ n , Fn }n≥0 is also a sub-martingale and hence XT ∧n , n ≥ 0. Then {X ˜ n ≥ EX0 for all n ≥ 1. EX Proof: For any A ∈ B(R) and n ≥ 0, ˜ −1 (A) X n

˜ n ∈ A} {ω : X

n = {ω : Xj ∈ A, T = j} ∪ {ω : Xn ∈ A, T > n}. =

j=1

Since T is a stopping time w.r.t. {Fn }n≥0 the right side above belongs to ˜ n | ≤ n |Xj | and hence E|X ˜ n | < ∞. Fn for each n ≥ 0. Next, |X j=1 Finally, let Hj = 1 if j ≤ T and 0 if j > T . Since for all j ≥ 1, {ω : Hj = 1} = {ω : T ≤ j − 1}c ∈ Fj−1 , {Hj }j≥1 is a betting sequence ˜ n = X0 + n (Xj − Xj−1 )Hj . Now the betting w.r.t. {Fn }n≥0 . Also, X j=1 theorem (Theorem 13.2.2) implies the present theorem. 2

13.2 Stopping times and optional stopping theorems

409

Remark 13.2.1: If {Xn , Fn }n≥0 is a martingale, then both {Xn , Fn }n≥0 and {−Xn , Fn }n≥0 are sub-martingales, and hence the above theorem im˜ n , Fn }n≥0 , and plies that if {Xn , Fn }n≥0 is a martingale, then so is {X ˜ hence E Xn = EXT ∧n = EX0 = EXn for all n ≥ 1. This suggests the question that if P (T < ∞) = 1, then on letting n → ∞, ˜ n → EXT ? Consider the following example. Let {Xn }n≥0 denote does E X the symmetric simple random walk on the integers with X0 = 0. Let T = ˜ n = EXT ∧n = inf{n : n ≥ 1, Xn = 1}. Then P (T < ∞) = 1 and E X ˜ EX0 = 0 but XT = 1 w.p. 1 and hence E Xn → EXT = 1. So, clearly some additional hypothesis is needed. Theorem 13.2.4: (Doob’s optional stopping theorem II ). Let {Xn , Fn }n≥0 be a martingale. Let T be a stopping time w.r.t. {Fn }n≥0 . Suppose P (T < ∞) = 1 and there is a 0 < K < ∞ such that for all n ≥ 0 |XT ∧n | ≤ K w.p. 1. Then EXT = EX0 . Proof: Since P (T < ∞) = 1, XT ∧n → XT w.p. 1 and |XT | ≤ K < ∞ 2 and hence E|XT | < ∞. Thus, E|XT − XT ∧n | ≤ 2KP (T > n) → 0. Remark 13.2.2: Since XT = XT I(T ≤n) + Xn I(T >n) and EXT I(T ≤n)

=

=

n j=0 n

E(Xj : T = j) E(X0 : T = j) = E(X0 : T ≤ n)

j=0

it follows that if E |Xn |I(T >n) → 0 as n → ∞ and P (T < ∞) = 1 then EXT = EX0 . A stronger version of Doob’s optional stopping theorem is given below in Theorem 13.2.6. Proposition 13.2.5: Let S and T be two stopping times w.r.t. {Fn }n≥1 with S ≤ T . Then, FS ⊂ FT . Proof: For any A ∈ FS and n ≥ 1, A ∩ {T = n}

= A ∩ {T = n} ∩ {S ≤ n}

n = A ∩ {S = k} ∩ {T = n} k=1

∈

Fn ,

410

13. Discrete Parameter Martingales

since A ∩ {S = k} ∈ Fk for all 1 ≤ k ≤ n. Thus, A ∈ FT , proving the result. 2 Theorem 13.2.6: (Doob’s optional stopping theorem III ). Let {Xn , Fn }n≥1 be a sub-martingale and S and T be two ﬁnite stopping times w.r.t. {Fn }n≥1 such that S ≤ T . If XS and XT are integrable and if lim inf E|Xn |I(|Xn | > T ) = 0, n→∞

(2.10)

then E(XT |FS ) ≥ XS

a.s.

(2.11)

If, in addition, {Xn }n≥1 is a martingale, then equality holds in (2.11). Thus, Theorem 13.2.6 shows that if a martingale (or a sub-martingale) is stopped at random time points S and T with S ≤ T , then under very mild conditions, (XS , FS ), (XT , FT ) continues to have the martingale (sub-martingale, respectively) property. Proof: To show (2.11), it is enough to show that (XT − XS )dP ≥ 0 for all A ∈ FS .

(2.12)

A

Fix A ∈ FS . Let {nk }k≥1 be a subsequence along which the “lim inf” is attained in (2.10). Let Tk = min{T, nk } and Sk = min{S, nk }, k ≥ 1. The proof of (2.12) involves showing that (XTk − XSk )dP ≥ 0 for all k ≥ 1 (2.13) A

and

lim

k→∞

A

" # (XT − XS ) − (XTk − XSk ) dP = 0.

(2.14)

Consider (2.13). Since Sk ≤ Tk ≤ nk , XTk − XSk

=

=

Tk n=Sk +1 nk

(Xn − Xn−1 )

(Xn − Xn−1 )I(Sk + 1 ≤ n ≤ Tk ).

(2.15)

n=2

Note that for all 2 ≤ n ≤ nk , {Tk ≥ n} = {Tk ≤ n − 1}c = {T ≤ n − 1}c ∈ Fn−1 and {Sk + 1 ≤ n} = {Sk ≤ n − 1} = {S ≤ n − 1} ∈ Fn−1 . Also, since A ∈ FS , Bn ≡ A ∩ {Sk + 1 ≤ n ≤ Tk } = (A ∩ {Sk + 1 ≤ n}) ∩ {Tk ≥ n} ∈ Fn−1 for all 2 ≤ n ≤ nk . Hence, by the sub-martingale property of

13.2 Stopping times and optional stopping theorems

411

{Xn }n≥1 , from (2.15), A

(XTk − XSk )dP

= =

nk n=2 A∩{Sk +1≤n≤Tk } nk

" # E(Xn |Fn−1 ) − Xn−1 dP

n=1

≥

(Xn − Xn−1 )dP

Bn

0.

(2.16)

This proves (2.13). To prove (2.14), note that by (2.10) and the integrability of XS and XT and the DCT, * " # ** * (XT − XS ) − (XTk − XSk ) dP * lim * k→∞

≤ lim

k→∞

≤ lim

k→∞

≤ lim

k→∞

A

"

# |XT − XTk | + |XS − XSk | dP

{T >nk }

{T >nk }

(|XT | + |Xnk |)dP + |XT |dP + 2

{T >nk }

{S>nk }

(|XS | + |Xnk |)dP

|Xnk |dP +

{S>nk }

|XS |dP

= 0, since {S > nk } ⊂ {T > nk } and {T > nk } ↓ ∅ as k → ∞. This proves the theorem for the case when {Xn }n≥1 is a sub-martingale. When {Xn }n≥1 is a martingale, equality holds in the last line of (2.16), which implies equality in (2.13) and hence, in (2.12). This completes the proof. 2 Remark 13.2.3: If there exists a t0 < ∞ such that P (T ≤ t0 ) = 1, then (2.10) holds. Corollary 13.2.7: Let {Xn , Fn }n≥1 be a sub-martingale and let T be a ﬁnite stopping time w.r.t. {Fn }n≥1 such that E|XT | < ∞ and (2.10) holds. Then, EXT ≥ EX1 . (2.17) If, in addition, {Xn }n≥1 is a martingale, then equality holds in (2.17). Proof: Follows from Theorem 13.2.6 by setting S ≡ 1.

2

Corollary 13.2.8: Let {Xn , Fn }n≥1 be a sub-martingale. Let {Tn }n≥1 be a sequence of stopping times such that (i) for all n ≥ 1, Tn ≤ Tn+1 w.p. 1, (ii) for all n ≥ 1, there exist a nonrandom tn ∈ (0, ∞) such that P (Tn ≤ tn ) = 1.

412

13. Discrete Parameter Martingales

Let Gn ≡ FTn , Yn ≡ XTn , n ≥ 1. Then {Yn , Gn }n≥1 is a sub-martingale. If {Xn , Fn }n≥1 is a martingale, then {Yn , Gn }n≥1 is a martingale. Proof: Use Theorem 13.2.6 and Remark 13.2.3.

2

Corollary 13.2.9: Let {Xn , Fn }n≥1 be a sub-martingale. Let T be a stopping time. Let Tn = min{T, n}, n ≥ 1. Then {XTn , FTn }n≥1 is a sub-martingale. Note that this is a stronger version of Theorem 13.2.3 as FTn ⊂ Fn for all n ≥ 1. Theorem 13.2.10: (Doob’s maximal inequality). Let {Xn , Fn }n≥1 be a sub-martingale and let Mm = max{X1 , . . . , Xm }, m ∈ N. Then, for any m ∈ N and x ∈ (0, ∞), P (Mm > x) ≤

+ + EXm EXm I(Mm > x) ≤ . x x

(2.18)

Proof: Fix m ≥ 1, x > 0. Deﬁne a random variable S by inf{k : 1 ≤ k ≤ m, Xk > x} on A S= m on Ac where A = {Xk > x for some 1 ≤ k ≤ m} = {Mm > x}. Then it is easy to check that S is a stopping time w.r.t. {Fn }n≥1 and S ≤ m. Set T ≡ m. Then (2.10) holds and hence, by Theorem 13.2.6, (XS , FS ), (Xm , Fm ) is a sub-martingale. m m Note that A = {Mm > x} = k=1 {Mm > x, S = k} = k=1 {XS > x, S = k} = {XS > x} ∈ FS . Hence, by Markov’s inequality, 1 1 P (A) = P (XS > x) ≤ XS dP ≤ Xm dP x {XS >x} x A + EXm 1 + . Xm dP ≤ ≤ x A x 2 Remark 13.2.4: An alternative proof of (2.18) is as follows. Let A1 = {X1 > x}, Ak = {X1 ≤ x, X2 ≤ x, . . . , Xk−1 ≤ x, Xk > x} for k = k 2, . . . , m. Then i=1 Ai = A ≡ {Mm > x} and Ak ∈ Fk for all k. Now for x > 0, m m 1 P (Mm > x) = P (Ak ) ≤ E(Xk IAk ). x k=1

k=1

13.2 Stopping times and optional stopping theorems

413

By the sub-martingale property of {Xn , Fn }n≥1 , E(Xk IAk ) ≤ E(Xm IAk )

for k ≤ m.

Thus

m 1 E Xm IAk x

P (Mm > x) ≤

k=1

1 E(Xm IA ) x 1 + 1 + E Xm IA ≤ E Xm . x x

≤ ≤

Theorem 13.2.11: (Doob’s Lp -maximal inequality for sub-martingales). Let {Xn , Fn }n≥1 be a sub-martingale and let Mn = max{Xj : 1 ≤ j ≤ n}. Then, for any p ∈ (1, ∞), E(Mn+ )p ≤

p p−1

p

E(Xn+ )p ≤ ∞.

(2.19)

Proof: If E(Xn+ )p = ∞, then (2.19) holds trivially. Let E(Xn+ )p < ∞. Since for p > 1, φ(x) = (x+ )p is a convex nondecreasing function on R. Hence, {(Xn+ )p , Fn }n≥1 is a sub-martingale and E(Xj+ )p ≤ E(Xn+ )p < ∞ n for all j ≤ n. Since (Mn+ )p ≤ j=1 (Xj+ )p , this implies that E(Mn+ )p < ∞. For any nonnegative random variable Y and p > 0, by Tonelli’s theorem,

EY p

Y

xp−1 dx

= pE

0 ∞

= pE xp−1 I(Y > x)dx 0 ∞ = pxp−1 P (Y > x)dx. 0

Thus, E(Mn+ )p

∞

= 0

=

0

∞

pxp−1 P (Mn+ > x)dx pxp−1 P (Mn > x)dx.

By Theorem 13.2.10, for x > 0 P (Mn > x) ≤

1 E(Xn+ I(Mn > x)), x

414

13. Discrete Parameter Martingales

and hence E(Mn+ )p

≤ = ≤

0

∞

pxp−2 E(Xn+ I(Mn > x))dx

p E(Xn+ Mnp−1 ) (p − 1)

p (E(Xn+ )p )1/p (E(Mn(p−1)q )1/q p−1

(by Holder’s inequality) where q is the conjugate of p, i.e. q = Thus, 1/p p 1/p + p E(Mn ) ≤ , E(Xn+ )p p−1

p (p−1) .

2

proving (2.19).

˜n = Corollary 13.2.12: Let {Xn , Fn }n≥1 be a martingale and let M sup{|Xj | : 1 ≤ j ≤ n}. Then, for p ∈ (1, ∞), p

p ˜ np ≤ E |Xn |p . EM p−1 Proof: Since {|Xn |, Fn }n≥1 is a sub-martingale, this follows from Theorem 13.2.11. 2 Theorem 13.2.13: (Doob’s L log L maximal inequality for submartingales). Let {Xn , Fn }n≥1 be a sub-martingale and Mn = max{Xj : 1 ≤ j ≤ n}. Then

e EMn+ ≤ (2.20) 1 + E(Xn+ log Xn+ ) , e−1 where 0 log 0 is interpreted as 0. Proof: As in the proof of Theorem 13.2.11, ∞ P (Mn+ > x)dx EMn+ = 0 ∞ 1 ≤ 1+ E(Xn+ I(Mn+ > x))dx x 1 = 1 + E(Xn+ log Mn+ ) . For x > 0, y > 0,

(2.21)

y . x y x Now x log x = yφ( y ) where φ(t) ≡ −t log t, t > 0. It can be veriﬁed φ(t) attains its maximum 1e at t = 1e . Thus x log y = x log x + x log

y x log y ≤ x log x + . e

13.2 Stopping times and optional stopping theorems

415

So

EMn+ . (2.22) e If EXn+ log Xn+ = ∞, (2.20) is trivially true. If EXn+ log Xn+ < ∞, then as in the proof of Theorem 13.2.11, it can be shown that EMn+ < ∞. Hence, the theorem follows from (2.21) and (2.22). 2 EXn+ log Mn+ ≤ EXn+ log Xn+ +

A special case of Theorem 13.2.10 is the maximal inequality of Kolmogorov (cf. Section 8.3) as shown by the following example. Example 13.2.4: Let {Yn }n≥1 be a sequence of independent random variables with EYn = 0 and E|Yn |α < ∞ for all n ≥ 1 for some α ∈ (1, ∞). Let Sn = Y1 +. . .+Yn , n ≥ 1. Then, φ(x) ≡ |x|α , x > 0 is a convex function, and hence, by Proposition 13.1.1, Xn ≡ φ(|Sn |), n ≥ 1 is a sub-martingale w.r.t. Fn = σ{Y1 , . . . , Yn }, n ≥ 1. Now, by Theorem 13.2.10, for any x > 0 and m ≥ 1, = P max Xn > xα P max |Sn | > x 1≤n≤m

1≤n≤m

−α

+ ≤ x EXm −α ≤ x E|Sm |α .

(2.23)

Kolmogorov’s inequality corresponds to the case where α = 2. Another application of the optimal stopping theorem yields the following useful result. Theorem 13.2.14: (Wald’s lemmas). Let {Yn }n≥1 be a sequence of iid random variables and let {Fn }n≥1 be a ﬁltration such that (i) Yn is Fn -measurable and (ii) Fn and σ{Yk : k ≥ n + 1} are independent for all n ≥ 1.

(2.24)

Also, let T be a ﬁnite stopping time w.r.t. {Fn }n≥1 and E|T | < ∞. Let Sn = Y1 + . . . + Yn , n ≥ 1. Then, (a) E|Y1 | < ∞ implies EST = (EY1 )(ET ).

(2.25)

E(ST − T EY1 )2 = Var(Y1 )E(T ).

(2.26)

(b) EY12 < ∞ implies

Proof: W.l.o.g., suppose that EY1 = 0. Then, {Sn , Fn }n≥1 is a martingale. By Corollary 13.2.7, (2.25) would follow if one showed that (2.10) holds with n T Xn = Sn and that E|ST | < ∞. Since |Sn | ≤ i=1 |Yi | ≤ i=1 |Yi | on the T set {T ≥ n}, both these conditions would hold if E( i=1 |Yi |) < ∞. Now,

416

13. Discrete Parameter Martingales

by the MCT and the independence of Yi and {T ≥ i} = {T ≤ i−1}c ∈ Fi−1 for i ≥ 2 and the fact that {T ≥ 1} = Ω, it follows that E

T

|Yi | =

E

∞

i=1

|Yi |I(i ≤ T )

i=1

=

∞

E|Yi |I(i ≤ T )

i=1

=

E|Y1 |

∞

P (T ≥ i)

i=1

= E|Y1 |E|T | < ∞.

(2.27)

This proves (a). To prove (b), set σ 2 = Var(Y1 ) and note that EY1 = 0 ⇒ {Sn2 − nσ 2 , Fn }n≥1 is a martingale. Let Tn = T ∧ n, n ≥ 1. Then, Tn is a bounded stopping time w.r.t. {Fn }n≥1 and hence, by Theorem 13.2.6, E[ST2n − (ETn )σ 2 ] = E(S12 − σ 2 ) = 0 for all n ≥ 1.

(2.28)

Thus, (2.26) holds with T replaced by Tn . Since T is a ﬁnite stopping time, Tn ↑ T < ∞ w.p. 1 and therefore, STn → ST as n → ∞, w.p. 1. Now applying Fatou’s lemma and the MCT, from (2.28), one gets EST2 ≤ lim inf EST2n = lim inf (ETn )σ 2 = (ET )σ 2 . n→∞

n→∞

(2.29)

Also, note that for any n ≥ 1 E(ST2 − ST2n )

=

E(ST2 − Sn2 )I(T > n)

=

E[(ST − Sn )2 + 2Sn (ST − Sn )]I(T > n)

≥ 2ESn (ST − Sn )I(T > n) = 2ESn (ST1n − Sn ) =

2E[Sn · E{(ST1n − Sn )|Fn }],

(2.30)

where T1n = max{T, n}. Since ET1n ≤ ET + n < ∞, and {T1n > k} = {T > k} for all k > n, the conditions of Theorem 13.2.6 hold with Xn = Sn , S = n and T = T1n . Hence, E(ST1n − Sn |Fn ) = 0 a.s. and by (2.30), EST2 ≥ EST2n for all n ≥ 1. Now letting n → ∞ and using (2.28), one gets 2 EST2 ≥ (ET )σ 2 , as in (2.29). This completes the proof of (b). This section is concluded with the statement of an inequality relating the pth moment of a martingale to the (p/2)th moment of its squared variation. Theorem 13.2.15: (Burkholder’s inequality). Let {Xn , Fn }n≥1 be a martingale sequence. Let ξj = Xj − Xj−1 , α ≥ 1, with X0 = 0. Then for any

13.3 Martingale convergence theorems

417

p ∈ [2, ∞), there exist positive constants Ap and Bp such that E|Xn |p ≤ Ap E

n

ξi2

p/2

i=1

and

n n p/2 E ξi2 |Fi−1 + E|ξi |p . E|Xn |p ≤ Bp E i=1

i=1

For a proof, see Chow and Teicher (1997).

13.3 Martingale convergence theorems The martingale (or sub- or super-martingale) property of a sequence of random variables {Xn }n≥1 implies, under some mild additional conditions, a remarkable regularity, namely, that {Xn }n≥1 converges w.p. 1 as n → ∞. For example, any nonnegative super-martingale converges w.p. 1. Also any sub-martingale {Xn }n≥1 for which {E|Xn |}n≥1 is bounded converges w.p. 1. Further, if {E|Xn |p }n≥1 is bounded for some p ∈ (1, ∞), then Xn converges w.p. 1 and in Lp as well. The proof of these assertions depend crucially on an ingenious inequality due to Doob. Recall that one way to prove that a sequence of real numbers {xn }n≥1 converges as n → ∞ is to show that it does not oscillate too much as n → ∞. That is, for all a < b, the number of times the sequence goes from below a to above b is ﬁnite. This number is referred to as the number of upcrossings from a to b. Doob’s upcrossing lemma (see Theorem 13.3.1 below) shows that for a sub-martingale, the mean of the upcrossings can be bounded above. First, a formal deﬁnition of upcrossings of a given sequence {xj : 1 ≤ j ≤ n} of real numbers from level a to level b with a < b is given. Let N1 N2

=

min{j : 1 ≤ j ≤ n, xj ≤ a}

=

min{j : N1 < j ≤ n, xj ≥ b}

=

min{j : N2k−2 < j ≤ n, xj ≤ a}

=

min{j : N2k−1 < j ≤ n, xj ≥ b}.

and, deﬁne recursively, N2k−1 N2k

If any of these sets on the right side is empty, all subsequent ones will be empty as well and the corresponding Nk ’s will not be well deﬁned. If N1 or N2 is not well deﬁned, then set U {xj }nj=1 ; a, b , the number of upcrossings of the interval (a, b) by {xj }j=1 equal to zero. Otherwise let N be the last

418

13. Discrete Parameter Martingales

one that is well deﬁned. Set U {xj }nj=1 ; a, b = is odd.

2

if is even and

−1 2

if

Theorem 13.3.1: (Doob’s upcrossing lemma). Let {Xj , Fj }nj=1 be a sub martingale and let a < b be real numbers. Let Un ≡ U {Xj }nj=1 ; a, b . Then EXn+ + |a| E(Xn − a)+ − E(X1 − a)+ ≤ . (3.1) EUn ≤ (b − a) (b − a) Proof: Consider ﬁrst the special case when Xj ≥ 0 w.p. 1 for all j ≥ 1 ˜0 = 1. Let and a = 0. Let N ⎧ ⎪ ⎨ Nj if j = 2k, k ≤ Un or if ˜ j = 2k − 1, k ≤ Un , Nj = ⎪ ⎩ n otherwise. If j is odd and j + 1 ≤ 2Un , then XN˜j+1 ≥ b > 0. If j is odd and j + 1 ≥ 2Un + 2, then XN˜j+1 = Xn = XN˜j . ˜j }n are Thus j odd (XN˜j+1 − XN˜j ) ≥ bUn . It is easy to check that {N j=1 stopping times. By Theorem 13.2.6, E(XN˜j+1 − XN˜j ) ≥ 0

for j = 1, 2, . . . , n.

Thus, E(Xn − X1 )

=

E

n−1

(XN˜j+1

j=0

≥

bEUn + E

≥

bEUn .

− XN˜j )

j

even

(XN˜j+1 − XN˜j ) (3.2)

Hence, both inequalities of (3.1) hold for the special case. Now for the general case, let Yj ≡ (Xj − a)+ , 1 ≤ j ≤ n. Then {Yj , Fj }nj=1 is a nonnegative sub-martingale and Un {Yj }nj=1 , 0, b − a ≡ Un {Xj }nj=1 , a, b . Thus, from (3.2) EUn

≤ =

E(Yn − Y1 ) (b − a) E((Xn − a)+ ) − E((X1 − a)+ ) , (b − a)

13.3 Martingale convergence theorems

419

proving the ﬁrst inequality of (3.1). The second inequality follows by noting 2 that (x − a)+ ≤ x+ + |a| for any x, a ∈ R. The ﬁrst convergence theorem is an easy consequence of the above theorem. Theorem 13.3.2: Let {Xn , Fn }n≥1 be a sub-martingale such that sup EXn+ < ∞.

n≥1

Then {Xn }n≥1 converges to a ﬁnite limit X∞ w.p. 1 and E|X∞ | < ∞. Proof: Let A = {ω : lim inf Xn < lim sup Xn }, n→∞

n→∞

and for a < b, let A(a, b) = {ω : lim inf Xn < a < b < lim sup Xn }. n→∞

n→∞

Then, A = ∪A(a, b) where the union is taken over all rationals a, b such that a < b. To establish convergence of {Xn }n≥1 it suﬃces to show that P (A(a, b)) = 0 for each a< b, as this implies P (A) = 0. Fix a < b and let Un = U {Xj }nj=1 ; a, b . For ω ∈ A(a, b), Un → ∞ as n → ∞. On the other hand, by the upcrossing lemma EUn ≤

EXn+ + |a| (b − a)

and by hypothesis, supn≥1 EXn+ < ∞, implying that sup EUn < ∞. n≥1

# " By the MCT, E limn→∞ Un = limn→∞ EUn , and hence lim Un < ∞ w.p. 1.

n→∞

Thus, P (A(a, b)) = 0 for all a < b, and hence limn→∞ Xn = X∞ exists w.p. 1. By Fatou’s lemma E|X∞ | ≤ lim E|Xn | ≤ sup E|Xn | . n→∞

n≥1

But E|Xn | = 2E(Xn+ ) − EXn ≤ 2EXn+ − EX1 , as {Xn , Fn }n≥1 a sub-martingale implies EXn ≥ EX1 . Thus, supn≥1 EXn+ < ∞ implies 2 supn≥1 E|Xn | < ∞. So, E|X∞ | < ∞ and hence |X∞ | < ∞ w.p. 1. Corollary 13.3.3: Let {Xn , Fn )n≥1 be a nonnegative super-martingale. Then {Xn }n≥1 converges to a ﬁnite limit w.p. 1.

420

13. Discrete Parameter Martingales

Proof: Since {−Xn , Fn }n≥1 is a nonpositive sub-martingale, supn≥1 E(−Xn )+ = 0 < ∞. By Theorem 13.3.2, {−Xn }n≥1 converges to a ﬁnite limit w.p. 1. 2 Corollary 13.3.4: Every nonnegative martingale converges w.p. 1. A natural question is that if a sub-martingale converges w.p. 1 to a ﬁnite limit, does it do so in L1 or in Lp for p > 1. It turns out that if a submartingale is Lp bounded for some p > 1, then it converges in Lp . But this is false for p = 1 as the following examples show. Example 13.3.1: (Gambler’s ruin problem). Let {Sn }n≥1 be the simple n ξ , n ≥ 1, where {ξn }n≥1 is a symmetric random walk, i.e., Sn = i i=1 sequence of iid random variables with P (ξ1 = 1) = 12 = P (ξ1 = −1). Let N = inf{n : n ≥ 1, Sn = 1}. As noted earlier, N is a ﬁnite stopping time and that {Sn }n≥1 is a martingale. Let Xn = SN ∧n , n ≥ 1. Then by the optional sampling theorem, {Xn }n≥1 is a martingale. Clearly, limn→∞ Xn ≡ X∞ = SN = 1 exists w.p. 1. But EXn ≡ 0 while EX∞ = 1 and so Xn does not converge to X∞ in L1 . Example 13.3.2: Suppose that {ξn }n≥1 is a sequence of iid nonnegative random variables with Eξ1 = 1. Let Xn = Πni=1 ξi , n ≥ 1. Then {Xn }n≥1 is a nonnegative martingale and hence converges w.p. 1 to X∞ , say. If P (ξ1 = 1) < 1, it can be shown that X∞ = 0 w.p. 1. Thus, Xn X∞ in L1 . In particular, {Xn }n≥1 is not UI (Problem 13.19). Example 13.3.3: If {Zn }n≥0 is abranching process with oﬀspring dis∞ tribution {pj }j≥0 and mean m = j=1 jpj then Xn ≡ Zn /mn (cf. 1.9) exists w.p. 1. It is is a nonnegative martingale and hence limn Xn = X∞ ∞ known that Xn converges to X∞ in L1 iﬀ m > 1 and j=1 j log pj < ∞ (cf. Chapter 18). See also Athreya and Ney (2004). Theorem 13.3.5: Let {Xn , Fn }n≥1 be a sub-martingale. Then the following are equivalent: (i) There exists a random variable X∞ in L1 such that Xn → X∞ in L1 . (ii) {Xn }n≥1 is uniformly integrable. Proof: Clearly, (i) ⇒ (ii) for any sequence of integrable random variables {Xn }n≥1 . Conversely, if (ii) holds, then {E|Xn |}n≥1 is bounded and hence by Theorem 13.3.2, Xn → X∞ w.p. 1 and by uniform integrability, Xn → 2 X∞ in L1 , i.e., (i) holds.

13.3 Martingale convergence theorems

421

Remark 13.3.1: Let (ii) of Theorem 13.3.5 hold. For any A ∈ Fn and m > n, by the sub-martingale property E(Xn IA ) ≤ E(Xm IA ). By uniform integrability, for any A ∈ F, EXn IA → EX∞ IA

as n → ∞.

This implies that {Xn , Fn }n≥1 ∪ {X∞ , F∞ } is a sub-martingale, where F∞ = σ n≥1 Fn . That is, the sub-martingale is closable at right. Further, EXn → EX∞ . Conversely, it can be shown that if there exists a random variable X∞ , measurable w.r.t. F∞ , such that (a) E|X∞ | < ∞, (b) {Xn , Fn }n≥1 ∪ {X∞ , F∞ } is a sub-martingale, and (c) EXn → EX∞ , then by (a) and (b), {Xn }n≥1 is uniformly integrable and (i) of Theorem 13.3.5 holds. Corollary 13.3.6: If {Xn , Fn }n≥1 is a martingale, then it is closable at right iﬀ {Xn }n≥1 is uniformly integrable iﬀ Xn converges in L1 . This follows from the previous remark since for a martingale, EXn is constant for 1 ≤ n ≤ ∞. Remark 13.3.2: A suﬃcient condition for a sequence {Xn }∞ n≥1 of random variables to be uniformly integrable is that there exists a random variable M such that EM < ∞ and |Xn | ≤ M w.p. 1 for all n ≥ 1. Suppose that {Xn }n≥1 is a nonnegative sub-martingale and M = supn≥1 Xn = limn→∞ Mn where Mn = sup1≤j≤n Xj . By the MCT, EM = limn→∞ EMn . But by Doob’s L log L maximal inequality (Theorem 13.2.13), e 1 + E Xn (log Xn )+ , EMn ≤ e−1 for all n ≥ 1. Thus, if {Xn }n≥1 is a nonnegative sub-martingale and supn≥1 E(Xn (log Xn )+ ) < ∞, then EM < ∞ and hence {Xn }n≥1 is uniformly integrable and converges w.p. 1 and in L1 . Similarly, if {Xn }n≥1 is a martingale such that supn≥1 E(|Xn |(log |Xn |)+ ) < ∞, then {Xn }n≥1 is uniformly integrable. L1 Convergence of the Doob Martingale Deﬁnition 13.3.1: Let X be a random variable on a probability space (Ω, F, P ) and {Fn }n≥1 a ﬁltration in F. Let E|X| < ∞ and Xn ≡

422

13. Discrete Parameter Martingales

E(X|Fn ), n ≥ 1. Then {Xn , Fn }n≥1 is called a Doob martingale (cf. Example 13.1.3). For a Doob martingale, E|Xn | ≤ E|X| and it can be shown that {Xn }n≥1 is uniformly integrable (Problem 13.20). Hence, lim n→∞ Xn exists w.p. 1 and in L1 , and equals E(X|F∞ ), where F∞ = σ n≥1 Fn . This may be summarized as: Theorem 13.3.7: Let {Fn }n≥1 be a ﬁltration and let X be an F∞ measurable with E|X| < ∞. Then E(X|Fn ) → X

w.p. 1 and in

L1 .

Corollary 13.3.8: Let {Fn }n≥1 be a ﬁltration and F∞ = σ

n≥1

Fn .

(i) For any A ∈ F∞ , one has P (A|Fn ) → IA

w.p. 1.

(ii) For any random variable X with E|X| < ∞, E(X|Fn ) → E(X|F∞ )

w.p. 1.

Proof: Take X = IA for (i) and in Theorem 13.3.7, replace X by E(X|F∞ ) for (ii). 2 Kolmogorov’s zero-one law (Theorem 7.2.4) is an easy consequence of this. If {ξn }n≥1 are independent random variables and A is a tail event and Fn ≡ σξj : 1 ≤ j ≤ n, then P (A|Fn ) = P (A) for each n and hence P (A) = IA w.p. 1, i.e., P (A) = 0 or 1. Theorem 13.3.9: (Lp convergence of sub-martingales, p > 1). Let {Xn , Fn }n≥1 be a nonnegative sub-martingale. Let 1 < p < ∞ and supn≥1 E|Xn |p < ∞. Then limn→∞ Xn = X∞ exists w.p. 1 and in Lp , and {(Xn , Fn )}n≥1 ∪ {X∞ , F∞ } is a Lp -bounded sub-martingale. Proof: By Doob’s maximal Lp inequality (Theorem 13.2.11), for any n ≥ 1, p p

p p p p p EXn ≤ sup EXm , (3.3) EMn ≤ p−1 p − 1 m≥1 where Mn = max{Xj : 1 ≤ j ≤ n}. Let M = lim Mn . Then (3.3) yields n→∞

EM < ∞ . p

This makes {|Xn |p }n≥1 uniformly integrable. Also supn≥1 E|Xn |p < ∞ and p > 1 ⇒ supn E|Xn | < ∞ and hence, limn→∞ Xn = X∞ exists w.p. 1 as a

13.3 Martingale convergence theorems

423

ﬁnite limit. The uniform integrability of {|Xn |p }n≥p implies Lp convergence (cf. Problem 2.36). The closability also follows as in Remark 13.3.1. 2 Corollary 13.3.10: Let {Xn , Fn }n≥1 be a martingale. Let 1 < p < ∞ and supn≥1 E|Xn |p < ∞. Then the conclusions of Theorem 13.3.9 hold. Proof: Since {Yn ≡ |Xn |, Fn }n≥1 is a nonnegative sub-martingale, Theorem 13.3.9 applies. 2 Reversed Martingales Deﬁnition 13.3.2: Let {Xn , Fn }n≤−1 be an adapted family with (Ω, F, P ) as the underlying probability space, i.e., for n < m, Fn ⊂ Fm ⊂ F and Xn is Fn -measurable for each n ≤ −1. Such a sequence is called a reversed martingale if (i) E|Xn | < ∞ for all n ≤ −1, (ii) E(Xn+1 |Fn ) = Xn for all n ≤ −1. The deﬁnitions of reversed sub- and super-martingales are similar. Reversed martingales are well behaved since they are closed at right. Theorem 13.3.11: Let {Xn , Fn }n≤−1 be a reversed martingale. Then (a)

lim Xn = X−∞ exists w.p. 1 and in L1 ,

n→−∞

(b) X−∞ = E(X−1 |F−∞ ), where F−∞ ≡

Fn .

n≤−1

Proof: Fix a < b. For n ≤ −1, let Un be the number of (a, b) upcrossings of {Xj : n ≤ j ≤ −1}. Then by Doob’s upcrossing lemma (Theorem 13.3.1), EUn ≤

E(X1 − a)+ . (b − a)

Let U = lim Un . Letting n → −∞, by the MCT, it follows that n→−∞

EU < ∞. Thus, U < ∞ w.p. 1. This being true for every a < b, one may conclude as in Theorem 13.3.2 that P (limn→−∞ Xn > limn→−∞ Xn ) = 0. So limn→−∞ Xn = X−∞ exists w.p. 1. Also, by Jensen’s inequality, {Xn }n≤−1 is uniformly integrable. So Xn → X−∞ in L1 , proving (a). To prove (b), note that for any A ∈ F−∞ , by uniform integrability, X−∞ dP = lim X−n dP n→−∞ A A = X−1 dP, by the martingale property. A

424

13. Discrete Parameter Martingales

2 Corollary 13.3.12: (The Strong Law of Large Numbers for iid random variables). Let {ξn }n≥1 be a sequence of iid random variables with E|ξ1 | < n ∞. Then, n−1 i=1 ξi → Eξ1 as n → ∞, w.p. 1. Proof: For k ≥ 1, let Sk = ξ1 + · · · + ξk and F−k = σ{Sk , ξk+1 , ξk+2 , . . .}. Let Xn ≡ E(ξ1 |Fn )n≤−1 . By the independence of {ξi }i≥1 , for any n ≤ −1, with k = −n, Xn = E(ξ1 |σSk ). Also, by symmetry, for any k ≥ 1, E(ξ1 |σSk ) = E(ξj |σSk ) for 1 ≤ j ≤ k. k Thus, Xn = k1 j=1 E(ξj |σSk ) = Skk , for all k = −n ≥ 1. It is easy to check that {Xn , Fn }n≤−1 is a reversed martingale and so by Theorem 13.3.11, Sk lim Xn = lim exists w.p. 1 and in L1 . n→−∞ k→∞ k By Kolmogorov’s zero-one law, limk→∞ Skk is a tail random variable, and 2 so a constant, which by L1 convergence must equal Eξ1 .

13.4 Applications of martingale methods 13.4.1 Supercritical branching processes Recall Example 13.1.5 on branching processes. Assume that it is supercritical, i.e., µ = Eξ11 > 1 and that σ 2 = Var(ξ11 ) < ∞. Proposition 13.4.1: Let Xn = µ−n Zn be the martingale deﬁned in (1.9). Then, {Xn }n≥1 is an L2 -bounded martingale. Proof: Let vn = Var(Xn ), n ≥ 1. Then Var(E(Xn+1 |Fn )) + E(Var(Xn+1 |Fn )) E(Zn σ 2 ) = Var(Xn ) + 2(n+1) µ = vn + σ 2 µ−2 µ−n , n ≥ 1. n Thus, vn+1 = σ 2 µ−2 j=1 µ−j . Since µ > 1, {vn }n≥1 is bounded. Now 2 since EXn ≡ 1, {Xn }n≥1 is L2 -bounded. vn+1

=

A direct consequence of Proposition 13.4.1 and Theorem 13.3.8 is that limn→∞ Xn = X∞ exists w.p. 1 and in mean-square.

13.4 Applications of martingale methods

13.4.2

425

Investment sequences

Let Xn be the value of a portfolio at (the end of) the nth period. Suppose the returns on the investment are random and satisfy E(Xn+1 |X0 , X1 , . . . , Xn ) ≤ ρn+1 Xn , n ≥ 1 where ρn+1 is a strictly positive random variable that is Fn -measurable, where Fn ≡ σX1 , X2 , . . . , Xn . Let ρ1 ≡ 1 and Xn Zn = n j=1

ρj

, n≥1.

Then, {Zn , Fn }n≥1 is a nonnegative super-martingale and hence, it converges w.p. 1 to a limit Z, with EZ ≤ EZ1= EX1 . This implies that n {Xn }n≥1 converges w.p. 1 on the event A ≡ { j=1 ρj converges}.

13.4.3

A conditional Borel-Cantelli lemma

Let {An }n≥1 be a sequence of events in a probability space (Ω, F, P ) and let {Fn }n≥1 be a ﬁltration in F. Let An ∈ Fn for all n ≥ 1 and pn = P (An |Fn−1 ), n ≥ 1, where F0 is the trivial σ-algebra ≡ {Ω, ∅}. Let δn = n IAn , and Xn = j=1 (δj − pj ), n ≥ 1. Then {Xn , Fn }n≥1 n is a martingale. Let Vj = Var(δj |Fj−1 ) = pj (1 − pj ), j ≥ 1, sn = j=1 Vj and s˜n = max{s nn , 1}, n ≥ 1. Since Vn is Fn−1 -measurable, so are sn and s˜n . Let Yn = j=1 (δj − pj )/˜ sj , n ≥ 1. Then, {Yn , Fn }n≥1 is a martingale. Clearly, EYn = 0 and by the martingale property

n δ j − pj 2 Var EYn = Var(Yn ) = s˜j j=1

n Vj = E 2 s˜j j=1

n Vj = E . s˜2 j=1 j But V1 = s1 and Vj = sj − sj−1 for j ≥ 2 and so ∞ Vj

s˜2 j=1 j

≤

1

∞

Vj s˜2j

≤

sj

1 dt sj−1 t2

and hence,

1 dt = 1. t2

So supn≥1 EYn2 ≤ 1. Thus, {Yn }n≥1 converges to some Y w.p. 1 and in L2 .

426

13. Discrete Parameter Martingales

If sn → ∞, then by Kronecker’s lemma (cf. Lemma 8.4.2). n 1 (δj − pj ) → 0 sn j=1

n n j=1 δj j=1 pj ⇒ n −1 →0. sn j=1 pj

But

n j=1

pj ≥ sn and hence n nj=1

δj

j=1 pj

→ 1 w.p. 1 on the event B ≡ {sn → ∞} .

(4.1)

Next it is claimed that on B c ≡ {limn→∞ sn < ∞}, limn→∞ Xn = X exists and is ﬁnite w.p. 1. To prove the claim, ﬁx 0 < t < ∞. Let Nt = inf{n : sn+1 > t}. Since sn+1 is Fn -measurable, Nt is a stopping time and by the optional stopping theorem I (Theorem 13.2.3), {Zn ≡ XNt ∧n }n≥1 is a martingale. By Doob’s L2 -maximal inequality, E sup Zj2 ≤ 4E(Zn2 ). 1≤j≤n

Also it is easy to verify that {Xn2 − s2n }n≥1 is a martingale and by the optional sampling theorem (Theorem 13.2.4), 2 E(Xn∧N − s2n∧Nt ) = 0. t

Thus, EZn2 = Es2n∧Nt ≤ t and hence for each t, limn→∞ Zn exists w.p. 1 and in L2 . Thus, limn→∞ XNt ∧n exists w.p. 1 for each t. But, on B c , Nt = ∞ for all large t. So limn→∞ Xn = X exists w.p. 1 on B c . This proves the claim. ∞ It follows that on B c ∩ { j=1 pj = ∞}, n nj=1 j=1

δj pj

Xn − 1 = n j=1

pj

→ 0.

∞ Also, since B ≡ {sn → ∞} is a subset of { j=1 pj = ∞} and it has been shown in (4.1) that n δj nj=1 → 1 w.p. 1 on B, p j=1 j it follows that n

δj nj=1 p j=1 j

→1

w.p. 1 on

∞ j=1

pj = ∞ .

13.4 Applications of martingale methods

427

Summarizing the above, one gets the following result. Theorem 13.4.2: (A conditional Borel-Cantelli lemma). Let {An }n≥1 be a sequence of events in a probability space (Ω, F, P ) and {Fn }n≥1 be a ﬁltration such that An ∈ Fn , for all n ≥ 1.Let pn = P (An |Fn−1 ) for ∞ n ≥ 2, p1 = P (A1 ). Then on the event B0 ≡ { j=1 pj = ∞}, n j=1 n

IAj

j=1

pj

→1

w.p. 1,

and in particular, inﬁnitely many An ’s happen w.p. 1 on B0 .

13.4.4

Decomposition of probability measures

The almost sure convergence of a nonnegative martingale yields the following theorem on the Lebesgue decomposition of two probability measures on a given measurable space. Theorem 13.4.3: Let (Ω, F) be a measurable space and {Fn }n≥1 be a ﬁltration with Fn ⊂ F for all n ≥ 1. Let P and Q be two probability measures on (Ω, F) such that for each n ≥ 1, Pn ≡ the restriction of P to Fn is absolutely continuous w.r.t. Qn ≡ on the restriction of Q to Fn , dPn . Let F ≡ σ with the Radon-Nikodym derivative Xn = dQ ∞ n≥1 Fn and n X ≡ limn→∞ Xn . Then for any A ∈ F∞ , XdQ + P (A ∩ (X = ∞)) P (A) = A

≡ Pa (A) + Ps (A), say,

(4.2)

and Pa Q and Ps ⊥ Q. Proof: For 1 ≤ k ≤ n, let Mk,n = maxk≤j≤n Xj . Let Mk = supn≥k Mk,n = limn→∞ Mk,n . Then X ≡ limn→∞ Xn = limk→∞ Mk . Fix 1 ≤ k0 , N < ∞ and A ∈ Fk0 . Then for n ≥ k ≥ k0 , Bk,n ≡ A ∩ {Mk,n ≤ N } ∈ Fn and hence P (Bk,n ) = Xn IBk,n dQ. (4.3) As n → ∞, Mk,n ↑ Mk and so IBk,n ↓ IBk , where Bk ≡ A ∩ {Mk ≤ N } = n≥k Bk,n . Also, since {Xn , Fn }n≥1 is a nonnegative martingale under the probability measure Q, limn→∞ Xn exists w.p. 1 and hence coincides with X. Thus, by the bounded convergence theorem applied to (4.3), P (Bk ) = XIBk dQ.

428

13. Discrete Parameter Martingales

Now let N → ∞ and use the MCT to conclude that P (A ∩ (Mk < ∞)) = XI{Mk 0 for all n and i. Let F = σ n≥1 An . Let

13.4 Applications of martingale methods

429

kn P (Ani ) Xn ≡ i=1 Q(Ani ) IAni . Then {Xn , Fn }n≥1 is a martingale on (Ω, F, Q) and P satisﬁes the decomposition (4.2). The proof of Corollary 13.4.6 is left as an exercise (Problem 13.22). Remark 13.4.3: This yields the Lebesgue decomposition of P w.r.t. Q when F is countably generated, i.e., when there exists a countable collection k A of subsets of Ω such that F = σA. In particular, this holds if Ω = R and F ≡ B (Rk ) for k ∈ N.

13.4.5

Kakutani’s theorem

Theorem 13.4.7: (Kakutani’s theorem). Let P and Q be the probability distributions on R∞ , B(R∞ ) of the sequences of independent random variables {Xj }j≥1 and {Yj }j≥1 , respectively. Let for each j ≥ 1, the distribution of Xj be dominated by that of Yj . Then either

P Q

or

P ⊥ Q.

(4.5)

Proof: Let fj be the density of λj w.r.t. µj where λj (·) = P (Xj ∈ ·) and µj (·) = Q(Yj ∈ ·). Let Ω = R∞ , F = (B(R))∞ . Then P = Πj≥1 λj , Q = Πj≥1 µj . Let ξn (ω) ≡ ω(n), the nth co-ordinate of ω = (ω1 , ω2 , . . .) ∈ Ω, and Fn ≡ σξj : 1 ≤ j ≤ n. Also let Pn be the restriction of P to Fn and Qn be that of Q to Fn . Then Pn Qn with probability density Ln =

n + dPn = fj (ξj ). dQn j=1

Since {limn→∞ Ln < ∞} is a tail event, by the independence of {ξj }j≥1 under P and the Kolmogorov’s zero-one law, P (limn→∞ Ln < ∞) = 0 or 1. Now, by Corollary 13.4.4, (4.5) follows. 2 Remark , 13.4.4: It can be shown that P Q or P ⊥ Q according as ∞ E fj (Yj ) > 0 or = 0. For a proof, see Durrett (2004). j=1 Remark 13.4.5: If {Xj }j≥1 are iid and {Yj }j≥1 are√also iid, then P = Q or P ⊥ Q. This is√because fj = f√1 for all j and EQ f1 ≤ (EQ f1 )1/2 ≤ 1 < 1 or EQ f1√= 1. In the latter case f1 ≡ 1, since and √ so either EQ f1 √ EQ ( f1 )2 = 1 = EQ ( f1 ) ⇒ VarQ ( f1 ) = 0. Remark 13.4.6: The above result can be extended to Markov chains. Let P and Q be two irreducible stochastic matrices on a countable set and let Q be positive recurrent. Also, let Px0 denote the distribution of a Markov chain {Xn }n≥1 starting at x0 and with transition probability P , and similarly, let Qy0 denote the distribution of a Markov chain {Yn }n≥1

430

13. Discrete Parameter Martingales

starting at y0 and with transition probability Q. Then either

Px0 ⊥ Qy0

or

P = Q.

(4.6)

The proof of this is left as an exercise (Problem 13.23).

13.4.6

de Finetti’s theorem

Let {Xn }n≥1 be a sequence of exchangeable random variables on a probability space (Ω, F, P ), i.e., for each n ≥ 1, the distribution of (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) is the same as that of (X1 , X2 , . . . , Xn ) where (σ(1), σ(2), . . . , σ(n)) is a permutation of (1, 2, . . . , n). Then there is a σalgebra G ⊂ F such that for each n ≥ 1, P (Xi ∈ Bi , i = 1, 2, . . . , n | G) =

n +

P (Xi ∈ Bi | G)

(4.7)

i=1

for all B1 , . . . , Bn ∈ B(R). This is known as de Finetti’s theorem. For a proof, see Durrett (2004) and Chow and Teicher (1997). This theorem says that conditioned on G the {Xi }i≥1 are iid random variables with distribution P (X1 ∈ · | G). The converse to this result, i.e., if for some σ-algebra G ⊂ F (4.7) holds, then the sequence {Xi }i≥1 is exchangeable is not diﬃcult to verify (Problem 13.26).

13.5 Problems 13.1 Let Ω be a nonempty set and let {Aj }j≥1 be a countable partition of Ω. For n ≥ 1, let Fn = σ-algebra generated by {Aj }nj=1 . (a) Show that {Fn }n≥1 is a ﬁltration. (b) Find F∞ = σ n≥1 Fn . 13.2 Let Ω be a nonempty set. For each n ≥ 1, let πn ≡ {Anj : j = 1, 2, . . . , kn } be a partition of Ω. Suppose that each n and j, Anj is a union of sets of πn+1 . Let Fn ≡ σπn for n ≥ 1. (a) Show that {Fn }n≥1 is a ﬁltration. " j (b) Suppose ∆ = [0, 1)and πn ≡ { j−1 : j = 1, 2, . . . , 2n }. 2n , 2n Show that F∞ = σ n≥1 Fn is the Borel σ-algebra B([0, 1)). 13.3 Let {(Yn , Fn ) : n ≥ 1} and {(Y˜n , Fn ) : n ≥ 1} be as in Example 13.1.2. Verify that {(Yn , Fn ) : n ≥ 1) is a sub-martingale and {(Y˜n , F˜n ) : n ≥ 1} is a martingale.

13.5 Problems

431

13.4 Give an example of a random variable T and two ﬁltrations {Fn }n≥1 and {Gn }n≥1 such that T is a stopping time w.r.t. the ﬁltration {Fn }n≥1 but not w.r.t. {Gn }n≥1 . 13.5 Let T1 and T2 be stopping times w.r.t. a ﬁltration {Fn }n≥1 . Verify 2 that min(T1 , T2 ), max(T1 , T2 ), T1 + T2 , and T1√ are stopping times w.r.t. {Fn }n≥1 . Give an example to show that T1 and T1 − 1 need not be stopping times w.r.t. {Fn }n≥1 . 13.6 Let T be a random variable taking values in {1, 2, 3, . . .}. Show that there is a ﬁltration {Fn }n≥1 w.r.t. which T is a stopping time. 13.7 Let {Fn }n≥1 be a ﬁltration. (a) Show that T is a stopping time w.r.t. {Fn }n≥1 iﬀ {T ≤ n} ∈ Fn

for all n ≥ 1 .

(b) Show by an example that if a random variable T satisﬁes {T ≥ n} ∈ Fn for all n ≥ 1, it need not be a stopping time w.r.t. {Fn }n≥1 . (Hint: Consider a T of the form T = inf{k : k ≥ 1, Xk+1 ∈ A}

and Fn = σXj : j ≤ n.)

13.8 Show that FT deﬁned in (2.6) is a σ-algebra. 13.9 Let {Xn }n≥1 be a sequence of random variables. Let Gn = σ{Xj : 1 ≤ j ≤ n}. Let {Fn }n≥1 be a ﬁltration such that Gn ⊂ Fn for each n ≥ 1. (a) Show that if {Xn , Fn }n≥1 is a martingale, then so is {Xn , Gn }n≥1 . (b) Show by an example that the converse need not be true. (c) Let {Xn , Fn }n≥1 be a martingale. Let 1 ≤ k1 < k2 < k3 · · · be a sequence of integers. Let Yn ≡ Xkn , Hn ≡ Fkn , n ≥ 1. Show that {Yn , Hn }n≥1 is also a martingale. 13.10 A branching random walk is a branching process and a random walk associated with it. Individuals reproduce according to a branching process and the oﬀspring move away from the parent a random distance. If Xn ≡ {xn1 , xn2 , . . . xnZn } denotes the position vector of the Zn individuals in the nth generation and the individual at location xni produces ρni oﬀspring, then each of them chooses a new position by moving a random distance from xni and these are assumed to be iid. Let ηnij be the random distance moved by the jth oﬀspring of the

432

13. Discrete Parameter Martingales

individual at xni . Then the position vector of the (n + 1)st generation is given by ni Xn+1 = {xni + ηnij }ρj=1 , i = 1, 2, . . . , n ≡ xn+1,k : k = 1, 2, . . . Zn+1 , say, where Zn+1

= =

population size of the (n+1)st generation Zn

ρni .

i=1

Let the oﬀspring distribution be {pk }k≥0 and the jump size distribution be denoted by F (·). Assume that the η’s are real valued and also that the collection {ρni }i≥1,n≥0 , {ηnij }i≥1,j≥1,n≥0 are all independent with the ρ’s being iid with distribution {pk }k≥0 and the η’s iid with distribution F . Fix θ ∈ R. For n ≥ 0, let Zn (θ) ≡

Zn

eθxni

−n and Yn (θ) = Zn (θ) ρφ(θ)

i=1

∞

where ρ = k=0 kpk , φ(θ) = E(eθη111 ) = φ(θ) < ∞, 0 < ρ < ∞.

eθx dF (x). Assume 0

0, λ0 > 0 E |Xn |I(|Xn | > λ) ≤ E |X|I|Xn | > λ ≤ E |X|I(|X| > λ0 ) + λ0 P (|Xn | > λ) . ) 13.21 Consider the following urn scheme due to Poly¯ a. Let an urn contain w0 white and b0 black balls at time n = 0. A ball is drawn from the urn at random. It is returned to the urn with one more ball of the color drawn. Repeat this procedure for all n ≥ 1. Let Wn and Bn denote the number of white and black balls in the urn after n draws. n , n ≥ 0. Let Fn = σZ0 , Z1 , . . . , Zn . Let Zn = WnW+B n (a) Show that {(Zn , Fn )}n≥0 is a martingale. (b) Conclude that Zn converges w.p. 1 and in L1 to a random variable Z. (c) Show that for any k ∈ N, limn→∞ EZnk converges and evaluate the limit. Deduce that Z has Beta (w0 , b0 ) distribution, i.e., its 0 +b0 −1)! z w0 −1 (1 − z)b0 −1 I[0,1] (z). pdf fZ (z) ≡ (w(w 0 −1)!(b0 −1)! (d) Generalize (a) to the case when at the nth stage a random number αn of balls of the color drawn are added where {αn }n≥1 is any sequence of nonnegative integer valued random variables. 13.22 Prove Corollary 13.4.6. (Hint: Argue as in Example 13.1.7.) 13.23 Prove the last equation (4.6) of Section 13.4. (Hint: Show using the strong law for the Q chain that under Q, the martingale Xn converges to 0 w.p. 1, where Xn is the Radon-Nikodym derivative of Px0 (X0 , . . . , Xn ) ∈ · w.r.t. Qx0 (X0 , . . . , Xn ) ∈ · .)

436

13. Discrete Parameter Martingales

13.24 Let {Fn }n≥0 be a ﬁltration ⊂ F where (Ω, F, P ) is a probability space. Let {Yn }n≥0 ⊂ L1 (Ω, F, P ). Suppose Z ≡ sup |Yn | ∈ L1 (Ω, F, P )

and

n≥1

lim Yn ≡ Y

n→∞

exists w.p. 1.

Show that E(Yn |Fn ) → E(Y |F∞ ) w.p. 1. (Hint: Fix m ≥ 1 and let Vm = supn≥m |Yn − Y |. Show that limn E(|Yn − Y ||Fn ) ≤ lim E(Vm |Fn ) = E(Vm |F∞ ). n→∞

Now show that E(Vm |F∞ ) → 0 as m → ∞.) 13.25 Let {Xt , Ft : t ∈ I ≡ Q ∩ (0, 1)} be a martingale, i.e., for all t1 < t2 in I, E(Xt2 |Ft1 ) = Xt1 . Show that for each t in I lim Xs

s↑t,s∈I

and

lim Xs

s↓t,s∈I

both exist w.p. 1 and in L1 and equal Xt w.p. 1. 13.26 Let {Xi }i≥1 be random variables on a probability space (Ω, F, P ). Suppose for some σ-algebra G ⊂ F (4.7) holds. Show that {Xi }i≥1 are exchangeable. 13.27 Let {Xn }n≥0 , {Yn }n≥0 be martingales in L2 (Ω, F, P ) w.r.t. the same ﬁltration {Fn }n≥1 . Let X0 = Y0 = 0. Show that E(Xn Yn ) =

n

E(Xk − Xk−1 )(Yk − Yk−1 ), n ≥ 1

k=1

and, in particular, E(Xn2 ) =

n

E(Xk − Xk−1 )2 .

k=1

13.28 Let {Xn , Fn }n≥1 be a martingale in L2 (Ω, F, P ). Suppose 0 ≤ bn ↑ ∞ n E(Xj −Xj−1 )2 n such that j=2 < ∞. Show that X bn → 0 w.p. 1. b2 j

n (X −X ) (Hint: Consider the sequence Yn ≡ j=2 j bj j−1 , n ≥ 2. Verify that {Yn , Fn }n≥2 is a L2 bounded martingale and use Kronecker’s lemma (cf. Chapter 8).)

13.5 Problems

437

13.29 Let f ∈ L1 [0, 1], B([0, 1]), m where m(·) is Lebesgue measure on [0, 1]. Let {Hk (·)}k≥1 be the Haar functions deﬁned by H1 (t) ≡ 1, H2 (t) ≡

H2n +1 (t)

H2n +j (t) Let ak ≡

1 −1

0 ≤ t < 12 1 2 ≤ t < 1,

⎧ n/2 0 ≤ t < 2−(n+1) ⎪ ⎨ 2 = −2n/2 2−(n+1) ≤ t < 2−n , n = 1, 2, . . . ⎪ ⎩ 0 otherwise, j − 1 = H2n +1 t − n , j = 1, 2, . . . , 2n . 2

1

f (t)Hk (t)dt, k = 1, 2, . . .. n (a) Verify that {Xn (t) ≡ k=1 ak Hk (t)}n≥1 is a martingale w.r.t. the natural ﬁltration. 0

(b) Show that Xn converges w.p. 1 and in L1 to f . 13.30 Let {Xn }n≥1 be a sequence of nonnegative random variables on some probability space (Ω, F, P ) such that E(Xn+1 |Fn ) ≤ Xn + Yn where Fn ≡ σX1 , . . . , Xn where {Yn }n≥1 is a sequence of nonnegative ∞ constants such that n=1 Yn < ∞. Show that {Xn }n≥1 converges w.p. 1. random variables with 13.31 Let {τj }j≥1 be independent ∞ exponential n λj = 1 < ∞. Let T = 0, T = Eτj , j ≥ 1 such that 0 n j=1 λ2j j=1 τj , n n ≥ 1, sn = j=1 λj . Show that {Xn ≡ Tn − sn }n≥1 converges w.p. 1 and in mean square. (Hint: Show that {Xn }n≥1 is an L2 -bounded martingale.)

14 Markov Chains and MCMC

14.1 Markov chains: Countable state space 14.1.1

Deﬁnition

Let S = {aj : j = 1, 2, . . . , K}, K ≤ ∞ be a ﬁnite or countable set. Let P = ((pij ))K×K be a stochastic matrix, i.e., pij ≥ 0, for every i, K j=1 pij = 1 and µ = {µj : 1 ≤ j ≤ K} be a probability distribution, i.e., K µj ≥ 0 for all j and j=1 µj = 1. Deﬁnition 14.1.1: A sequence {Xn }∞ n=0 of S-valued random variables on some probability space (Ω, F, P ) is called a Markov chain with stationary transition probabilities P = ((pij )), initial distribution µ, and state space S if (i) X0 ∼ µ, i.e., P (X0 = aj ) = µj for all j, and (ii) P Xn+1 = aj | Xn = ai , Xn−1 = ain−1 , . . . , X0 = ai0 = pij for all ai , aj , ain−1 , . . . , ai0 ∈ S and n = 0, 1, 2, . . . , i.e., the sequence is memoryless. Given Xn , Xn+1 is independent of {Xj : j ≤ n − 1}. More generally, given the present (Xn ), the past ({Xj : j ≤ n − 1}) and the future ({Xj : j > n}) are stochastically independent (Problem 14.1). A few questions arise: Question 1: Does such a sequence {Xn }∞ n=0 exist for every µ and P , and

440

14. Markov Chains and MCMC

if so, how does one generate them? The answer is yes. There are two approaches, namely, (i) using Kolmogorov’s consistency theorem and (ii) an iid random iteration scheme. Question 2: How does one describe the ﬁnite time behavior, i.e., the joint distribution of (X0 , X1 , . . . , Xn ) for any n ∈ N? One may use the Markov property repeatedly to obtain the joint distribution. Question 3: What can one say about the long-term behavior? One can ask questions like: (a) Does the trajectory n → Xn converge as n → ∞? (b) Does the distribution of Xn converge as n → ∞? (c) Do the laws of large numbers hold n for a suitable class of functions f ’s, i.e., do the limits limn→∞ n1 j=1 f (Xj ) exist w.p. 1? (d) Do stationary distributions exist? (A probability distribution π = {πi }i∈S is called a stationary distribution for a Markov chain {Xn }n≥0 if X0 has distribution π, then Xn also has distribution π for all n ≥ 1.) The key to answering these questions are the concepts of communication, irreducibility, aperiodicity, and most importantly recurrence. The main tools are the laws of large numbers, renewal theory, and coupling.

14.1.2

Examples

Example 14.1.1: (IID sequence). Let {Xn }∞ n=0 be a sequence of iid Svalued random variables with distribution µ = {µj }. Then {Xn }∞ n=0 is a Markov chain with initial distribution µ and transition probabilities given by pij = µj for all i, i.e., all rows of P are identical. It is also easy to prove the converse, i.e., if all rows of P are identical, then {Xn }∞ n=1 are iid and independent of X0 . To answer Question 3 in this case, note that P [Xn = j] = µj for all n and thus Xn converges in distribution. But the trajectories do not converge. However, the law of large numbers holds and µ is the unique stationary distribution. Example 14.1.2: (Random walks). Let S = Z, the set of integers. Let {n }n≥1 be iid with distribution {pj }j∈Z , i.e., P [1 = j] = pj for j ∈ Z and {n }n≥1 are independent. Let X0 be a Z-valued random variable independent of {n }n≥1 . Then, deﬁne for n ≥ 0, Xn+1 = Xn + n+1 = Xn−1 + n + n−1 = · · · = X0 +

n+1 j=1

j .

14.1 Markov chains: Countable state space

441

In this case, with probability one, the trajectories of Xn go to +∞ (respectively, −∞), if E(1 ) > 0 (respectively < 0). If E(1 ) = 0, then the trajectories ﬂuctuate inﬁnitely often provided p0 = 1. Example 14.1.3: (Branching processes). Let S = Z+ = {0, 1, 2, . . .}. Let {pj }∞ j=0 be a probability distribution. Let {ξni }i∈N,n∈Z+ be iid random variables with distribution {pj }∞ j=0 . Let Z0 be a Z+ -valued random variable independent of {ξni }. Let Zn+1 =

Zn

ξni

for n ≥ 0.

i=1

If p0 = 0 and p1 < 1, then Zn → ∞ w.p. 1. If p0 > 0, then P [Zn → ∞] + P [Zn → 0] = 1. Also, P [Zn → 0 | Z0 = 1] = q is the smallest solution in [0,1] to the equation q = f (q) =

∞

pj q j .

j=0

So q = 1 iﬀ m ≡

∞ j=1

jpj (1) ≤ 1 (see Chapter 18 also).

Example 14.1.4: (Birth and death chains). Again take S = Z+ . Deﬁne P by pi,i+1 = αi ,

pi,i−1 = βi = 1 − αi , for i ≥ 1,

p0,1 = α0 ,

p0,0 = β0 = 1 − α0 .

The population increases at rate αi and decreases at rate 1 − αi . Example 14.1.5: (Iterated function systems). Let G := {hi : hi : S → S, i = 1, 2, . . . , L}, L ≤ ∞. Let µ = {pi }L i=1 be a probability distribution. Let {fn }∞ n=1 be iid, such that P (fn = hi ) = pi , 1 ≤ i ≤ L. Let X0 be a S-valued random variable independent of {fn }∞ n=1 . Then, the iid random iteration scheme X1 X2

Xn+1

=

f1 (X0 )

= .. . =

f2 (X1 ) fn+1 (Xn ) = fn+1 fn (· · · (f1 (X0 )) · · ·)

is a Markov chain with transition probability matrix pij = P (f1 (i) = j) =

L

pr I hr (i) = j .

r=1

Remark 14.1.1: Any discrete state space Markov chain can be generated in this way (see II in 14.1.3 below).

442

14. Markov Chains and MCMC

14.1.3

Existence of a Markov chain

I. Kolmogorov’s approach. Let Ω = SZ+ = {ω : ω ≡ {xn }∞ n=0 , xn ∈ with values in S. Let S for all n} be the set of all sequences {xn }∞ n=0 F0 consist of all ﬁnite dimensional subsets of Ω of the form A = ω : ω = {xn }∞ n=0 , xj = aj , 0 ≤ j ≤ m , where m < ∞ and aj ∈ S for all j = 0, 1, . . . , m. Let F be the σ-algebra generated by F0 . Fix µ and P . For A as above let λµ,P (A) := µa0 pa0 a1 pa1 a2 · · · pam−1 am . Then it can be shown, using the extension theorem from Chapter 2 or Kolmogorov’s consistency theorem of Chapter 6, that λµ,P can be extended to be a probability measure on F. Let Xn (ω) = xn , if ∞ ω = {xn }∞ n=0 , be the coordinate projection. Then {Xn }n=0 will be a sequence of S-valued random variables on (Ω, F, λµ,P ), such that it is a Markov chain with initial distribution µ and transition probability P . A typical element ω = {xn }∞ n=0 of Ω is called a sample path or a sample trajectory. The following are examples of events (sets) in F, which are not ﬁnitedimensional:

A1 A2

1 h(xj ) exists for a given h : S → R, n→∞ n j=1 = ω : the set of limit points of {xn }∞ n=0 = {a, b} . =

n

ω : lim

Thus, it is essential to go to (Ω, F, λµ,P ) to discuss the events involving asymptotic (long term) behavior, i.e., as n → ∞. II. IIIDRM approach (iteration of iid random maps). Let P ((pij ))K×K be a stochastic matrix. Let f : S × [0, 1] → S be ⎧ a1 ⎪ ⎪ ⎪ ⎪ a2 ⎪ ⎪ ⎪ ⎪ . ⎪ ⎪ ⎨ .. aj f (ai , u) = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ .. ⎪ ⎪ ⎪ . ⎪ ⎩ aK

=

if 0 ≤ u < pi1 if pi1 ≤ u < pi1 + pi2 if pi1 + pi2 + · · · + pi(j−1) ≤ u < pi1 + pi2 + · · · + pij if pi1 + pi2 + · · · + pi(K−1) ≤ u < 1 .

(1.1) Let U1 , U2 , . . . be iid Uniform [0, 1] random variables. Let fn (·) := f (·, Un ). Then for each n, fn maps S to S. Also {fn }∞ n=1 are iid.

14.1 Markov chains: Countable state space

443

Let X0 be independent of {Ui }∞ i=1 and X0 ∼ µ. Then the sequence deﬁned by {Xn }∞ n=0 Xn+1 = fn+1 (Xn ) = f (Xn , Un+1 ) is a Markov chain with initial distribution µ and transition probability P . The underlying probability space on which X0 and {Ui }∞ i=1 are deﬁned can be taken to be the Lebesgue space ([0, 1], B([0, 1]), m), where m is the Lebesgue measure. Finite Time Behavior of {Xn } For each n ∈ N, P X0 = a0 , X1 = a1 , . . . , Xn = an

+ n = P Xj = aj | Xj−1 = aj−1 , . . . , X0 = a0 P X0 = a0 j=1

=

+ n

paj−1 aj µa0 .

(1.2)

j=1

Thus, the joint distribution for any ﬁnite n is determined by µ and P . In particular, n + paj−1 aj = (P n )a0 an , P Xn = an | X0 = a0 =

(1.3)

j=1

where the sum in the middle term runs over all a1 , a2 , . . . , aj−1 and P n is the nth power of P . So the behavior of the distribution of Xn can be studied via that of P n for large n. But this analytic approach is not as comprehensive as the probabilistic one, via the concept of recurrence, which will be described next.

14.1.4

Limit theory

Let {Xn }∞ n=0 be a Markov chain with state S = {1, 2, . . . , K}, K ≤ space ∞, and transition probability matrix P = (pij ) K×K . Deﬁnition 14.1.2: (Hitting times). For any set A ⊂ S, the hitting time TA is deﬁned as TA = inf{n : Xn ∈ A, n ≥ 1} i.e., it is the ﬁrst time after 0 that the chain enters A. The random variable TA is also called the ﬁrst passage time for A or the ﬁrst entrance time for A. Note that TA is a stopping time (cf. Chapter 13) w.r.t. the ﬁltration {Fn ≡ σ{Xj : 0 ≤ j ≤ n}}n≥0 .

444

14. Markov Chains and MCMC

By convention, inf ∅ = ∞. If A = {i}, then write T{i} = Ti for notational simplicity. A concept of fundamental importance is recurrence of a state. Deﬁnition 14.1.3: (Recurrence). A state i is recurrent or transient according as Pi [Ti < ∞] = 1 or < 1, where Pi denotes the probability distribution of {Xn }∞ n=0 with X0 = i with probability 1. Thus i is recurrent iﬀ fii ≡ P Xn = i for some 1 ≤ n < ∞ | X0 = i = 1. (1.4) Deﬁnition 14.1.4: (Null and positive recurrence). A recurrent state i is called null recurrent if Ei (Ti ) = ∞ and positive recurrent if Ei (Ti ) < ∞, where Ei refers to expectation w.r.t. the probability distribution Pi . Example 14.1.6: (Frog in the well). Let S = {1, 2, . . .} and P = ((pij )) be given by pi,i+1 = αi

and pi,1 = 1 − αi

for all i ≥ 1, 0 < αi < 1.

Then P1 [T1 > r] = α1 α2 · · · αr . r ∞ 0 as r→ ∞ iﬀ i=1 (1−αi ) = ∞. Further So P1 [T1 < ∞] = 1 iﬀ i=1 i → α ∞ r 1 1 is positive recurrent iﬀ r=1 1 αi < ∞. Thus, if αi = (1 − 2i2 ), then 1 is transient; but if αi ≡ α, 0 < α < 1 for all i, then 1 is positive recurrent. 1 If αi = 1 − ci , c > 1, then 1 is null recurrent (Problem 14.2). Example 14.1.7: (Simple random walk). Let S = Z, the set of all integers, pi,i+1 = p, pi,i−1 = q, 0 < p = 1 − q < 1. Then it can be shown by using the SLLN that for p = 12 , 0 is transient. But it is harder to show that for p = 12 , 0 is null recurrent (see Corollary 14.1.5 below). The next result says that after each return to i, the Markov chain starts afresh. Proposition 14.1.1: (The strong Markov property). For any i ∈ S and any initial distribution µ of X0 and any k < ∞, a1 , . . . , ak in S, Pµ (XTi +j = aj , j = 1, 2, . . . , k, Ti < ∞) = Pµ (Ti < ∞)Pi (Xj = aj , j = 1, 2, . . . , k). Proof: For any n ∈ N, Pµ (XTi +j = aj , 1 ≤ j ≤ k, Ti = n) = Pµ (Xn+j = aj , 1 ≤ j ≤ k, Xn = i, Xr = i, 1 ≤ r ≤ n − 1) * = Pµ (Xn+j = aj , 1 ≤ j ≤ k * Xn = i, Xr = i, 1 ≤ r ≤ n − 1)

14.1 Markov chains: Countable state space

445

· Pµ (Xn = i, Xr = i, 1 ≤ r ≤ n − 1) = Pi (Xj = aj , 1 ≤ j ≤ k)Pµ (Xn = i, Xr = i, 1 ≤ r ≤ n − 1) = Pi (Xj = aj , 1 ≤ j ≤ k)Pµ (Ti = n). 2

Adding both sides over n yields the result.

The strong Markov property leads to the important useful technique of breaking up the time evolution of a Markov chain into iid cycles. This combined with the law of large numbers yield the basic convergence results. (0)

Deﬁnition 14.1.5: (IID cycles). Let i be a state. Let Ti (k+1)

Ti

(k)

=

= 0 and

(k)

inf{n : n > Ti , Xn = i}, if Ti < ∞, (k) ∞, if Ti = ∞,

(1.5)

(k)

is the successive return times to state i. (k) Proposition 14.1.2: Let i be a recurrent state. Then Pi Ti < ∞ = 1 for all k ≥ 1. i.e., for k = 0, 1, 2, . . ., Ti

Proof: By deﬁnition of recurrence, the claim is true for k = 1. If it is true for k > 1, then (k+1) Pi Ti 0. A pair of states i, j are said to communicate (n)

if i → j and j → i, i.e., if there exist n ≥ 1, m ≥ 1 such that pij > 0, (m)

pji

> 0.

Deﬁnition 14.1.7: (Irreducibility). A Markov chain with state space S ≡ {1, 2, . . . , K}, K ≤ ∞ and transition probability matrix P ≡ ((pij )) is irreducible if for each i, j in S, i and j communicate. Deﬁnition 14.1.8: A state i is absorbing if pii = 1. It is easy to show that if i is absorbing and j → i, then j is transient (Problem 14.4). Proposition 14.1.7: (Solidarity property). Let i be recurrent and i → j. Then fji = 1 and j is recurrent, where fji ≡ P (Ti < ∞ | X0 = j). Proof: By the (strong) Markov property, 1 − fii = P (Ti = ∞ | X0 = i) ≥ P (Tj < ∞, Ti = ∞ | X0 = i) = P (Ti = ∞ | X0 = j)P (Tj < Ti | X0 = i) (intuitively speaking, one possibility of not returning to i (starting from i) is ∗ ∗ to visit j and then not returning to i) = (1 − fji )fij , where fij = P (visiting j before visiting i | X0 = i). Now i recurrent and i → j yield 1−fii = 0 and ∗ > 0 (Problem 14.4) and so 1 − fji = 0, i.e., fji = 1. Thus, starting from fij j, the chain visits i w.p. 1. From i, it keeps returning to i inﬁnitely often. ∗ of visiting j is positive In each of these excursions, the probability fij and since there are inﬁnite number of such excursions and they are iid, the chain does visit j in one of these excursions w.p. 1. That is j is recurrent. 2 Also an alternate proof using the Corollary 14.1.5 is possible (Problem 14.5). Proposition 14.1.8: In a ﬁnite state space irreducible Markov chain all states are recurrent. Proof: By Corollary 14.1.6, there is at least one state i0 that is recurrent. By irreducibility and solidarity, this implies all states are recurrent. Remark 14.1.2: A stronger result holds, namely, that for a ﬁnite state space irreducible Markov chain, all states are positive recurrent (Problem 14.6).

448

14. Markov Chains and MCMC

Theorem 14.1.9: (A law of large numbers). Let i be positive recurrent. Let Nn (j) = #{k : 0 ≤ k ≤ n, Xk = j}, j ∈ S (1.9) n (j) be be the number of visits to j during {0, 1, . . . , n}. Let Ln (j) ≡ Nn+1 the empirical distribution at time n. Let X0 = i, with probability 1. Then Ln (j) →

Vij , w.p. 1 Ei (Ti )

(1.10)

Ti −1 where Vij = Ei k=0 δXk ,j is the mean number of visits to j during {0, 1, . . . , Ti − 1} starting from i. In particular, Ln (i) → Ei1Ti , w.p. 1. (k)

Proof: For each n, let k ≡ k(n) be such that Ti

(k+1)

≤ n < Ti

. Then,

NT (k) (j) ≤ Nn (j) ≤ NT (k+1) (j), i

i

i.e., k

ξr ≤ Nn (j) ≤

r=0 (r)

k+1

ξr ,

r=0 (r+1)

where ξr = #{l : Ti ≤ l < Ti , Xl = j} is the number of visits to j during the rth excursion. Since Vij ≡ Ei ξ1 ≤ Ei Ti < ∞, by the SLLN, with probability 1, k(n) 1 ξr → Ei (ξ1 ) = Vij k(n) r=0 (k)

(k(n))

and k1 Ti → Ei (Ti ). Note that since n is between Ti n so k(n) → Ei Ti . Since

(k(n)+1)

and Ti

,

k k+1 k+1 1 Nn (j) k1 ≤ ξr ≤ ξr , n k r=0 n n k + 1 r=0

it follows that Ln (j) =

n Nn (j) n+1 n

→

Ei (ξ1 ) Ei (Ti ) .

2

Note that the above proof works for any initial distribution µ such that Pµ (Ti < ∞) = 1 and further, the limit of Ln (j) is independent of any such µ. Thus, if (S, P ) is irreducible and one state i is positive recurrent, then for any initial distribution µ, Pµ (Ti < ∞) = 1. Note ﬁnally that the proof in Theorem 14.1.9 can be adapted to yield a criterion for transience, null recurrence, and positive recurrence of a state. Thus, the following holds. Corollary 14.1.10: Fix a state i. Then

14.1 Markov chains: Countable state space

449

(i) i is transient iﬀ limn→∞ Nn (i) exists and is ﬁnite w.p. 1 for any initial distribution iﬀ lim Ei Nn (i) =

n→∞

(ii) i is null recurrent iﬀ

∞ k=0

∞

(k)

pii < ∞.

k=0 (k)

pii = ∞ and 1 (k) pii = 0. n→∞ n n

lim Ei Ln (i) = lim

n→∞

k=0

(iii) i is positive recurrent iﬀ 1 (k) pii > 0. n→∞ n n

lim Ei Ln (i) = lim

n→∞

k=0

(iv) Let (S, P ) be irreducible and let one state i be positive recurrent. Then, for any j and any initial distribution µ, Ln (j) →

1 ∈ (0, ∞) w.p. 1. Ej T j

Thus, for the symmetric simple random walk on the integers c (2k) p00 ∼ √ as k → ∞, 0 < c < ∞ k and hence

∞

1 (k) p00 → 0. n n

(n)

p00 = ∞

n=0

and

k=0

Thus 0 is null recurrent. It is not diﬃcult to show that if j leads to i, i.e., if fji ≡ Pj (Ti < ∞) is positive then the number ξ1 of visits to j between consecutive visits to i has all moments (Problem 14.9). Using the SLLN, Theorem 14.1.9 can be extended as follows to cover both null and positive recurrent cases. Theorem 14.1.11: Let i be a recurrent state. Then, for any j and any initial distribution µ such that Pµ (Ti < ∞) = 1, 1 Vij δXk j → πij ≡ , n+1 Ei T i n

Ln (j) ≡

w.p. 1

k=0

as n → ∞ where Vij < ∞. (If Ei Ti = ∞, then πij = 0 for all j.)

(1.11)

450

14. Markov Chains and MCMC

Corollary 14.1.12: Let (S, P ) be irreducible and let one state be recurrent. Then, for any j and any initial distribution, Ln (j) → cj

n → ∞,

as

w.p. 1,

(1.12)

where cj = 1/Ej Tj if Ej Tj < ∞ and cj = 0 otherwise. The Basic Ergodic Theorem Taking expectation in Theorem 14.1.11 and using the bounded convergence theorem leads to Corollary 14.1.13: Let i be recurrent. Then, for any initial distribution µ with Pµ (Ti < ∞) = 1, 1 Vij := πij Pµ (Xk = j) → n+1 Ei (Ti ) n

Eµ (Ln (j)) =

as

n → ∞.

k=0

(1.13) := πij for j ∈ Theorem 14.1.14: Let i be positive recurrent. Let πj } is a stationary distribution for P , i.e., S. Then {π j j∈S j πj = 1 and π p = π , for all j. j l∈S l lj Proof: 1 (k) pij n+1

1 1 (k) δij + pij n+1 n+1

n

n

=

k=0

k=1 n

1 1 (k−1) δij + pil plj n+1 n+1 k=1 l n 1 1 (k−1) δij + pil plj . (n + 1) n+1

= =

l

k=1

Taking limit as n → ∞ and using Fatou’s lemma yields πj ≥ πl plj .

(1.14)

l

If strict inequality were to hold for some j0 , then adding over j would yield πj > πl plj = πl plj = πl . j

j

Since j∈S Vij = Ei Ti , in (1.14) for any j.

l

j

l

j

l

πj = 1, so there cannot be a strict inequality 2

Therefore, the following has been established.

14.1 Markov chains: Countable state space

451

Theorem 14.1.15: Let i be a positive recurrent state. Let Ti −1 Ei ( k=0 δXk ,j ) πj := . Ei (Ti ) Then (i) {πj }j∈S is a stationary distribution. (ii) For any j and any initial distribution µ with Pµ (Ti < ∞) = 1, (a) Ln (j) ≡ (b)

1 n+1

n

1 n+1 #{k

: Xk = j, 0 ≤ j ≤ n} → πj w.p. 1 (Pµ )

Pµ (Xk = j) → πj .

k=0

In particular, if j = i, we have 1 1 (k) . pii → πi = n+1 Ei (Ti ) n

(1.15)

k=0

Now let i be a positive recurrent state and j be such that i → j. Then Vij > 0 and by the solidarity property, fji = 1 and j is recurrent. Now taking µ = δj in (ii) above and using Corollary 14.1.13, leads to the conclusions that n 1 (k) pjj → πj > 0 (1.16) n+1 k=0

and πj =

1 . Ej T j

(1.17)

Thus, j is positive recurrent. Now Theorem 14.1.15 leads to the basic ergodic theorem for Markov chains. Theorem 14.1.16: Let (S, P ) be irreducible and let one state be positive recurrent. Then (i) all states are positive recurrent, (ii) π ≡ {πj ≡ (Ej Tj )−1 : j ∈ S} is a stationary distribution for (S, P ), (iii) for any initial distribution µ and any j ∈ S (a)

1 n+1

n

Pµ (Xk = j) → πj , i.e., π is the unique limiting distri-

k=0

bution (in the average sense) and hence the unique stationary distribution,

452

14. Markov Chains and MCMC

(b) Ln (j) ≡

1 n+1

n

δXk j → πj w.p. 1 (Pµ ).

k=0

There is a converse to the above result. To develop this, note ﬁrst that if j is a transient state, then the total number Nj of visits to j is ﬁnite w.p. 1 (for any initial distribution µ) and hence Ln (j) → 0 w.p. 1 and taking expectations, for any i 1 (k) pij → 0 n+1 n

as

n → ∞.

k=0

Now, suppose that π ≡ {πj : j ∈ S} is a stationary distribution for (S, P ). Then, for all j, πj = i∈S πi pij and hence πr pri pij πj = i∈S,r∈S

=

(2)

πr prj

r∈S

and by induction, πj =

(k)

πi pij

for all k ≥ 0

i∈S

implying πj =

πi

i∈S

n 1 (k) pij . n+1 k=0

Now if j is transient then for any i 1 (k) pij → 0 n+1 n

as

n→∞

k=0

and so by the bounded convergence theorem n 1 (k) πi pij πi · 0 = 0. = πj = lim n→∞ n+1 i∈S

k=0

i∈S

Thus, πj > 0 implies j is recurrent. For j recurrent, it follows from arguments similar to those used to establish Theorem 14.1.15 that 1 1 (k) pij → fij . n Ej T j k=0 1 Thus, πj = i∈S πi fij Ej Tj . But i∈S πi fij ≥ πj fjj = πj > 0. So Ej Tj < ∞, i.e., j is positive recurrent. Summarizing the above discussion leads to n

14.1 Markov chains: Countable state space

453

Proposition 14.1.17: Let π ≡ {πj : j ∈ S} be a stationary distribution for (S, P ). Then, πj > 0 implies that j is positive recurrent. It is now possible to state a converse to Theorem 14.1.16. Theorem 14.1.18: Let (S, P ) be irreducible and admit a stationary distribution π ≡ {πj : j ∈ S}. Then, (i) all states are positive recurrent, (ii) π ≡ {πj : πj =

1 Ej Tj ,

j ∈ S} is the unique stationary distribution,

(iii) for any initial distribution µ and for all j ∈ S, (a)

(b)

1 n+1

1 n+1

n k=0 n

Pµ (Xk = j) → πj , δXk j → πj w.p. 1 (Pµ ).

k=0

In summary, for an irreducible Markov chain (S, P ) with a countable state space, a stationary distribution π exists iﬀ all states are positive recurrent. In which case, π is unique and for any initial distribution µ, the distribution at time n converges to π in the Cesaro sense (i.e., average) and the (LLN) law of large numbers holds. For the ﬁnite state space case, irreducibility suﬃces (Problem 14.6). If h : S → R is a function such that j∈S |h(j)|πj < ∞, then the LLN can be strengthened to ∞

1 h(Xk ) → h(j)πj w.p. 1 (Pµ ) n+1 j=0 n

k=0

for any µ (Problem 14.10). In particular, if A is any subset of S, then Ln (A) ≡

n 1 IA (Xk ) → π(A) ≡ πj n+1 k=0

j∈A

w.p. 1 (Pµ ) for any µ. An important question that remains is whether the convergence of Pµ (Xn = j) to πj can be strengthened from the average sense to full, n 1 i.e., from the convergence to πj of (n+1) k=0 Pµ (Xk = j) to the convergence to πj of Pµ (Xn = j) as n → ∞. For this, the additional hypothesis needed is aperiodicity. Deﬁnition 14.1.9: For any state i, the period di of the state i is the (n) g.c.d. n : n ≥ 1, pii > 0 .

454

14. Markov Chains and MCMC

Further, i is called aperiodic if di = 1. Example 14.1.8: Let S = {0, 1, 2} and ⎛ ⎞ 0 1 0 P = ⎝ 12 0 12 ⎠ . 0 1 0 Then di = 2 for all i. Note that in this example, since (S, P ) is ﬁnite and irreducible, it has a unique stationary distribution, given by π = ( 14 , 12 , 14 ) and 1 1 (k) p00 → n+1 4 n

as

n→∞

k=0

(2n+1)

(2n)

= 0 for each n and p00 → 14 as n → ∞. This suggests that but p00 aperiodicity will be needed. It turns out that if (S, P ) is irreducible, the period di is the same for all i (Problem 14.5). Theorem 14.1.19: Let (S, P ) be irreducible, positive recurrent and aperiodic (i.e., di = 1 for all i). Let {Xn }n≥0 be a (S, P ) Markov chain. Then, for any initial distribution µ, limn→∞ Pµ (Xn = i) ≡ πi exists for all i, where π ≡ {πj : j ∈ S} is the unique stationary distribution. There are many proofs known for this, and two of them are outlined below. The ﬁrst uses the discrete renewal theorem and the second uses a coupling argument. Proof 1: (via the discrete renewal theorem). Fix a state i. Recall that for (n) (n) n ≥ 1, pii = P (Xn = i | X0 = i), n ≥ 0 and fii = P (Ti = n | X0 = i), n ≥ 1. Using the Markov property, for n ≥ 1, (n)

pii = P (Xn = i | X0 = i)

= = =

n k=1 n k=1 n

P (Xn = i, Ti = k | X0 = i) P (Ti = k | X0 = i)P (Xn = i | Xk = i) (k) (n−k)

fii pii

.

k=1 (n)

(n)

Let an ≡ pii , n ≥ 0, pn ≡ fii , n ≥ 1. Then {pn }n≥1 is a probability distribution and {an }n≥0 satisﬁes the discrete renewal equation an = bn +

n k=0

an−k pk , n ≥ 0

14.1 Markov chains: Countable state space

455

where bn = δn0 and p0 = 0. It can be shown that di = 1 iﬀ g.c.d. k : k ≥ 1, pk > 0 = 1. ∞ Further, Ei Ti = k=1 kpk < ∞, by the assumption of positive recurrence. Now it follows from the discrete renewal theorem (see Section 8.5) that ∞ 1 j=0 bj lim an exists and equals ∞ = = πi . n→∞ Ei T i k=1 kpk 2 Proof 2: (Using coupling arguments). Let {Xn }n≥0 and {Yn }n≥0 be independent (S, P ) Markov chains such that Y0 has distribution π and X0 has distribution µ. Then {Zn = (Xn , Yn )}n≥0 is a Markov chain with state space S × S and transition probability P × P ≡ (p(i,j),(k, ) = pik pj ) . Further, it can be shown that (see Hoel, Port and Stone (1972)) (a) {πi,j ≡ πi πj : (i, j) ∈ S × S} is a stationary distribution for {Zn }, (b) since (S, P ) is irreducible and aperiodic, the pair (S × S, P × P ) is irreducible. Since (S × S, P × P ) is irreducible and admits a stationary distribution, it is necessarily recurrent and so from any initial distribution the ﬁrst passage time TD for the diagonal D ≡ {(i, i) : i ∈ S} is ﬁnite with probability one. Thus, Tc ≡ min{n : n ≥ 1, Xn = Yn } is ﬁnite w.p. 1. The random variable Tc is called the coupling time. Let Xn , n ≤ Tc ˜ Xn = Yn , n > Tc . ˜ n }n≥0 are identically disThen, it can be veriﬁed that {Xn }n≥0 and {X tributed Markov chains. Thus P (Xn = j)

˜ n = j) = P (X ˜ n = j, Tc < n) + P (X ˜ n = j, Tc ≥ n) = P (X

and P (Yn = j) = P (Yn = j, Tc < n) + P (Yn = j, Tc ≥ n) implying that * * *P (Xn = j) − P (Yn = j)* ≤ 2P (Tc ≥ n). Since P (Tc ≥ n) → 0 as n → ∞ and by the stationarity of π, P (Yn = j) = πj for all n and j it follows that for any j lim P (Xn = j)

n→∞

exists and

= πj .

456

14. Markov Chains and MCMC

2 In order to obtain results on rates of convergence for |P (Xn = j) − πj |, one needs more assumptions on the distribution of return time Ti or the coupling time Tc . For results on this, the books of Hoel et al. (1972), Meyn and Tweedie (1993), and Lindvall (1992) are good sources. It can be shown that in the irreducible case if for some i, Pi (Ti > n) = O(λn1 ) for some 0 < λ1 < 1, then j∈S |Pi (Xn = j) − πj | = O(λn2 ) for some λ1 < λ2 < 1. In particular, this geometric convergence holds for the ﬁnite state space irreducible case. The main results of this section are summarized below. Theorem 14.1.20: Let {Xn }n≥0 be a Markov chain with a countable state spaceS = {0, 1, 2, . . . , K}, K ≤ ∞ and transition probability matrix P ≡ (pij ) . Let (S, P ) be irreducible. Then, (a) All states are recurrent iﬀ for some i in S, ∞

(n)

pii = ∞.

n=1

(b) All states are positive recurrent iﬀ for some i in S, 1 (k) pii n→∞ n n

lim

exists and is strictly positive.

k=0

(c) There exists a stationary probability distribution π iﬀ there exists a positive recurrent state. (d) If there exists a stationary distribution π ≡ {πj : j ∈ S}, then (i) it is unique, all states are positive recurrent and for all j ∈ S, πj = (Ej Tj )−1 , (ii) for all i, j ∈ S, 1 (k) pij → πj n+1 n

as

n → ∞,

k=0

(iii) for any initial distribution and any j ∈ S, 1 δXk j → πj n+1 n

(iv) if

j∈S

w.p. 1,

k=0

|h(j)|πj < ∞, then 1 h(Xk ) → h(j)πj n+1 n

k=0

j∈S

w.p. 1,

14.2 Markov chains on a general state space

457

(v) if, in addition, di = 1 for some i ∈ S, then dj = 1 for all j ∈ S and for all i, j, (n)

pij → πj

as

n → ∞.

14.2 Markov chains on a general state space 14.2.1

Basic deﬁnitions

Let {Xn }n≥0 be a sequence of random variables with values in some space S that is not necessarily ﬁnite or countable. The Markov property says that conditioned on Xn , Xn−1 , . . . , X0 , the distribution of Xn+1 depends only on Xn and not on the past, i.e., Xj : j ≤ n − 1. When S is not countable, to make this notion of Markov property precise, one needs the following set up. Let (S, S) be a measurable space. Let (Ω, F, P ) be a probability space and {Xn (ω)}n≥0 be a sequence of maps from Ω to S such that for each n, Xn is (F , S) measurable. Let Fn ≡ σ{Xj : 0 ≤ j ≤ n} be the subσ-algebra of F generated by {Xj : 0 ≤ j ≤ n}. In what follows, for any sub-σ-algebra Y of F, let P (· | Y) denote the conditional probability given, Y as deﬁned in Chapter 12. Deﬁnition 14.2.1: The sequence {Xn }n≥0 is a Markov chain if for all A ∈ S, P (Xn+1 ∈ A) | Fn = P (Xn+1 ∈ A) | σXn w.p. 1, for all n ≥ 0, (2.1) for any initial distribution of X0 , where σXn is the sub-σ-algebra of F generated by Xn . It is easy to verify that (2.1) holds for all A ∈ S iﬀ for any bounded measurable h from (S, S) to R, B(R) , E h(Xn+1 ) | Fn = E h(Xn+1 ) | σXn w.p. 1 for all n ≥ 0 (2.2) for any initial distribution of X0 . Another equivalent formulation that makes the Markov property symmetric with respect to time is the following that says that given the present, the past and future are independent. Proposition 14.2.1: A sequence of random variables {Xn }n≥0 satisﬁes ⊂ S, (2.1) iﬀ for any {Aj }n+k 0 P Xj ∈ Aj , j = 0, 1, 2, . . . , n − 1, n + 1, . . . , n + k | σXn = P Xj ∈ Aj , j = 0, 1, 2, . . . , n − 1 | σXn × P Xj ∈ Aj , j = n + 1, . . . , n + k | σXn

458

14. Markov Chains and MCMC

w.p. 1. The proof is somewhat involved but not diﬃcult. The countable state space case is easier (Problem 14.1). An important tool for studying Markov chains is the notion of a transition probability function. Deﬁnition 14.2.2: A function P : S × S → [0, 1] is called a transition probability function on S if (i) for all x in S, P (x, ·) is a probability measure on (S, S), (ii) for all A ∈ S, P (·, A) is an S-measurable function from (S, S) → [0, 1]. Under some general conditions guaranteeing the existence of regular conditional probabilities, the right side of (2.1) can be expressed as Pn (Xn , A), where Pn (·, ·) is a transition probability function on S. In such a case, yet another formulation of Markov property is in terms of the joint distributions of {X0 , X1 , . . . , Xn } for any ﬁnite n. Proposition 14.2.2: A sequence {Xn }n≥0 satisﬁes (2.1) iﬀ for any n ∈ N and A0 , A1 , . . . , An ∈ S, P (Xj ∈ Aj , j = 0, 1, 2, . . . , n) = Pn−1 (xn−1 , An )Pn−2 (xn−2 , dxn−1 ) A0

An−2

An−1

. . . P1 (x0 , dx1 )µ0 (dx0 ), where µ0 (A) = P (x0 ∈ A), A ∈ S. The proof is by induction and left as an exercise (Problem 14.16). In what follows, it will be assumed that such transition functions exist. Deﬁnition 14.2.3: A sequence of S-valued random variables {Xn }n≥0 is called a Markov chain with transition function P (·, ·) if (2.1) holds and the right side equals P (Xn , A) for all n ∈ N.

14.2.2

Examples

Example 14.2.1: (IID sequence). Let {Xn }n≥0 be iid S-valued random variables with distribution µ. Then {Xn }n≥0 is a Markov chain with transition function P (x, A) ≡ µ(A) and initial distribution µ. Example 14.2.2: ((Additive) random walk in Rk ). Let {ηj }j≥1 be iid Rk -valued random variables with distribution ν. Let X0 be an Rk -valued random variable independent of {ηj }j≥1 and with distribution µ. Let Xn+1

=

Xn + ηn+1

14.2 Markov chains on a general state space

=

X0 +

n+1

459

ηj , n ≥ 0.

j=1

Then {Xn }n≥0 is a Rk -valued Markov chain with transition function P (x, A) ≡ ν(A − x) and initial distribution µ. Example 14.2.3: (Multiplicative random walk on R+ ). Let {ηn }n≥1 be iid nonnegative random variables with distribution ν and X0 be a nonnegative random variable with distribution µ and independent of {ηn }n≥1 . Let Xn+1 = Xn ηn+1 , n ≥ 0. Then {Xn }n≥0 is a Markov chain with state space R+ and transition function P (x, A) = ν(x−1 A) if x > 0 and IA (0) if x = 0 and initial distribution µ. This is a model for the value of a stock portfolio subject to random growth n rates. Clearly, the above iteration scheme leads to Xn = X0 · i=1 ηi which when normalized appropriately leads to what is known as the geometric Brownian notion model in ﬁnancial mathematics literature. Example 14.2.4: (AR(1) time series). Let ρ ∈ R and ν be a probability measure on (R, B(R)). Let {ηj }j≥1 be iid with distribution ν and X0 be a random variable independent of {ηj }j≥1 and with distribution µ. Let Xn+1 = ρXn + ηn+1 , n ≥ 0. Then {Xn }n≥0 is a R-valued Markov chain with transition function P (x, A) ≡ ν(A − ρx) and initial distribution µ. Example 14.2.5: (Random AR(1) vector time series). Let {(Ai , bi )}i≥1 be iid such that Ai is a k × k matrix and bi is a k × 1 vector. Let µ be a probability measure on (Rk , B(Rk )). Let X0 be a Rk -valued random variable independent of (Ai , bi )i≥1 and with distribution µ. Let Xn+1 = An+1 Xn + bn+1 , n ≥ 0. Then {Xn }n≥0 is a Rk -valued Markov chain with transition function P (x, B) ≡ P (A1 x + b1 ∈ B) and initial distribution µ. Example 14.2.6: (Waiting time chain). Let {ηi }i≥1 be iid real valued random variable with distribution ν and X0 be independent of {ηi }i≥1 with distribution µ. Let Xn+1 = max Xn + ηn+1 , 0 . Then {Xn }n≥0 is a nonnegative valued Markov chain with transition function P (x, A) ≡ P max{x + η1 , 0} ∈ A and initial distribution µ. In the

460

14. Markov Chains and MCMC

queuing theory context, if ηn represents the diﬀerence between the nth interarrival time and service time, then Xn represents the waiting time at the nth arrival. All the above are special cases of the following: Example 14.2.7: (Iterated function system (IFS)). Let (S, S) be a measurable space. Let (Ω, F, P ) be a probability space. Let {fi (x, ω)}i≥1 be such that for each i, fi : S × Ω → S is (S × F, S)-measurable and the stochastic processes {fi (·, ω)}i≥1 are iid. Let X0 be a S-valued random variable on (Ω, F, P ) with distribution µ and independent of {fi (·, ·)}i≥1 . Let Xn+1 (ω) = fn+1 (Xn (ω), ω), n ≥ 0. Then {Xn }n≥0 is an S-valued Markov chain with transition function P (x, A) ≡ P (f1 (x, ω) ∈ A) and initial distribution µ. It turns out that when S is a Polish space with S as the Borel σalgebra and P (·, ·) is a transition function on S, any S-valued Markov chain {Xn }n≥0 with transition function P (·, ·) can be generated by an IFS as in Example 14.2.7. For a proof, see Kifer (1988) and Athreya and Stenﬂo (2003). When {fi }i≥1 are iid such that f1 has only ﬁnite many choices {hj }kj=1 , where each hj is an aﬃne contraction on Rp , then the Markov chain {Xn } converges in distribution to some π(·). Further, the limit point set of {Xn } coincides w.p. 1 with the support M of the limit distribution π(·). This has been exploited by Barnsley and others to solve the inverse problem: given a compact set M in Rp , ﬁnd an IFS {hj }kj=1 , of aﬃne contractions so that by generating the Markov chain {Xn }, one can get an approximate picture of M . This is called data compression or image generation by Markov chain Monte Carlo. See Barnsley (1992) for details on this. More generally, when {fi } are Lipschitz maps, the following holds. Theorem 14.2.3: Let {fi (·, ·)}i≥1 be iid Lipschitz maps on S. Assume (i) E| logs(f1 )| < ∞ and E log s(f1 ) < 0, where s(f1 ) d f1 (x, ω), f1 (y, ω) and d(·, ·) is the metric on S, and sup d(x, y) x =y

=

+ (ii) for some x0 , E log d(f1 (x0 , ω), x0 ) < ∞. Then, ˆ n (x, ω) ≡ f1 f2 (. . . fn−1 (fn (x, ω), ω)) . . . converges w.p. 1 to a (i) X ˆ random variable X(ω) that is independent of x, (ii) for all x, Xn (x, ω) ≡ fn fn−1 . . . (f1 (x, ω), ω) . . . converges in disˆ tribution to X(ω).

14.2 Markov chains on a general state space

461

That (ii) is a consequence of (i) is clear since for each n, x, Xn (x, ω) and ˆ n (x, ω) are identically distributed. X ˆ n } is a Cauchy sequence in S The proof of (i) involves showing that {X w.p. 1 (Problem 14.17).

14.2.3

Chapman-Kolmogorov equations

Let P (·, ·) be a transition function on (S, S). For each n ≥ 0, deﬁne a sequence of functions {P (n) (·, ·)}n≥0 by the iteration scheme P

(n+1)

(x, A) = S

P (n) (y, A)P (x, dy), n ≥ 0,

(2.3)

where P (0) (x, A) ≡ IA (x). It can be veriﬁed by induction that for each n, P (n) (·, ·) is a transition probability function. Deﬁnition 14.2.4: P (n) (·, ·) deﬁned by (2.3) is called the n-step transition function generated by P (·, ·). It is easy to show by induction that if X0 = x w.p. 1, then P (Xn ∈ A) = P (n) (x, A)

for all n ≥ 0

(2.4)

(Problem 14.18). This leads to the Chapman-Kolmogorov equations. Proposition 14.2.4: Let P (·, ·) be a transition probability function on (S, S) and let P (n) (·, ·) be deﬁned by (2.3). Then for any n, m ≥ 0, P (n+m) (x, A) = P (n) (y, A)P (m) (x, dy). (2.5)

Proof: The analytic veriﬁcation is straightforward by induction. One can verify this probabilistically using the Markov property. Indeed, from (2.4) the left side of (2.5) is Px (Xn+m ∈ A)

Ex P (Xn+m ∈ A | Fm ) = Ex P (n) (Xm , A) (by Markov property) = right-hand side of (2.5), =

where Ex , Px denote expectation and probability distribution of {Xn }n≥0 when X0 = x w.p. 1. 2 From the above, one sees that the study of the limit behavior of the distribution of Xn as n → ∞ can be reduced analytically to the study of the n-step transition probabilities. This in turn can be done in terms of

462

14. Markov Chains and MCMC

the operator P on the Banach space B(S, R) of bounded measurable real valued functions from S to R (with sup norm), deﬁned by (P h)(x) ≡ Ex h(X1 ) ≡ h(y)P (x, dy). (2.6) It is easy to verify that P is a positive bounded linear operator on B(S, R) of norm one. The Chapman-Kolmogorov equation (2.4) is equivalent to saying that Ex h(Xn ) = (P n h)(x). Thus, analytically the study of the limit distribution of {Xn }n≥0 can be reduced to that of the sequence {P n }n≥0 of the operator P . However, probabilistic approaches via the notion of Harris irreducibility and recurrence when applicable and via the notion of Feller continuity when S is a Polish space are more fruitful and will be developed below.

14.2.4

Harris irreducibility, recurrence, and minorization

14.2.4.1 Deﬁnition of irreducibility Recall that a Markov chain {Xn }n≥0 with a discrete state space S and transition probability matrix P ≡ ((pij )) is irreducible if for every i, j in S, i leads to j, i.e., P (Xn = j

for some

n ≥ 1 | X0 = i)

≡ Pi (Tj < ∞) ≡ fij > 0, where Tj = min{n : n ≥ 1, Xn = j} is the time of ﬁrst visit to j. To generalize this to the case of general state spaces, one starts with the notion of ﬁrst entrance time or hitting time (also called the ﬁrst passage time). Deﬁnition 14.2.5: For any A ∈ S, the ﬁrst entrance time to A is deﬁned as min{n : n ≥ 1, Xn ∈ A} TA ≡ ∞ if Xn ∈ A for any n ≥ 1. Since the event {TA = 1} = {X1 ∈ A} and for k ≥ 2, {TA = k} = {X1 ∈ A, X2 ∈ A, . . . , Xk−1 ∈ A, Xk ∈ A} is an element of Fk = σ{Xj : j ≤ k}, TA is a stopping time w.r.t. the ﬁltration {Fn }n≥1 (cf. Chapter 13). Deﬁnition 14.2.6: Let φ be a nonzero σ-ﬁnite measure on (S, S). The Markov chain {Xn }n≥0 (or equivalently, its transition function P (·, ·)) is said to be φ-irreducible (or irreducible in the sense of Harris with reference measure φ) if for any A in S, φ(A) > 0 ⇒ L(x, A) ≡ Px (TA < ∞) > 0

(2.7)

for all x in S. This says that if a set A in S is considered important by the measure φ (i.e., φ(A) > 0), then so does the chain {Xn }n≥0 starting from

14.2 Markov chains on a general state space

463

∞ any x in S. If G(x, A) ≡ n=1 P n (x, A) is the Greens function associated with P , then (2.7) is equivalent to φ(A) > 0 ⇒ G(x, A) > 0

for all x ∈ S,

(2.8)

i.e., φ(·) is dominated by G(x, ·) for all x in S. 14.2.4.2 Examples Example 14.2.7: If S is countable and φ is the counting measure on S, then the irreducibility of a Markov chain {Xn }n≥0 with state space S is the same as φ-irreducibility. Example 14.2.8: If {Xn }n≥0 are iid with distribution ν, then it is νirreducible. Example 14.2.9: It can be veriﬁed that the random walk with jump distribution ν (Example 14.2.2) is φ-irreducible with the Lebesgue measure as φ if ν has a nonzero absolutely continuous component with a positive density on some open interval (Problem 14.19 (a)). Example 14.2.10: The AR(1) with η1 having a nontrivial absolutely continuous component can be shown to be φ-irreducible for some φ. On the other hand the AR(1) chain Xn+1 =

1 Xn + ηn , n ≥ 0, 2 2

where {ηn }n≥1 are iid Bernoulli ( 12 ) random variables, is not φ-irreducible for any φ. In general, if {Xn }n≥0 is a Markov chain that has a discrete distribution for each n and has a limit distribution that is nonatomic, then {Xn }n≥0 cannot be φ-irreducible for any φ (Problem 14.19 (b)). The waiting time chain (Example 14.2.6) is irreducible w.r.t. φ ≡ δ0 , the delta measure at 0 if P (η1 < 0) > 0 (Problem 14.20). It can be shown that if {Xn }n≥0 is φ-irreducible for some σ-ﬁnite φ, then there exists a probability measure ψ such that {Xn }n≥0 is ψ-irreducible and ˜ ˜ then for some φ, it is maximal in the sense that if {Xn }n≥0 is φ-irreducible φ˜ is dominated by ψ. See Nummelin (1984). 14.2.4.3 Harris recurrence A Markov chain {Xn }n≥0 that is Harris irreducible with reference measure φ is said to be Harris recurrent if A ∈ S, φ(A) > 0 ⇒ Px (TA < ∞) = 1

for all x

in S.

(2.9)

Recall that irreducibility requires only that Px (TA < ∞) be > 0. When S is countable and φ is the counting measure, this reduces to the usual notion of irreducibility and recurrence. If S is not countable but has

464

14. Markov Chains and MCMC

a singleton ∆ such that Px (T∆ < ∞) = 1 for all x in S, then the chain {Xn }n≥0 is Harris recurrent with respect to the measure φ(·) ≡ δ∆ (·), the delta measure at ∆. The waiting time chain (Example 14.2.6) has such a ∆ in 0 if Eη1 < 0 (Problem 14.20). If such a recurrent singleton ∆ exists, then the sample paths of {Xn }n≥0 can be broken into iid excursions by looking at the chain between consecutive returns to ∆. This in turn will allow a complete extension of the basic limit theory from the countable case to this special case. In general, such a singleton may not exist. For example, for the AR(1) sequence with {ηi }i≥1 having a continuous distribution, Px (Xn = x for some n ≥ 1) = 0 for all x. However, it turns out that for Harris recurrent chains, such a singleton can be constructed via the regeneration theorem below established independently by Athreya and Ney (1978) and Nummelin (1978).

14.2.5

The minorization theorem

A remarkable result of the subject is that when S is countably generated, a Harris recurrent chain can be embedded in a chain that has a recurrent singleton. This is achieved via the minorization theorem and the fundamental regeneration theorem below. Theorem 14.2.5: (The minorization theorem). Let (S, S) be such that S is countably generated. Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·) such that it is Harris irreducible with reference measure φ(·). Then the following minorization hypothesis holds. (i) (Hypothesis M). For every B0 ∈ S such that φ(B0 ) > 0, there exists a set A0 ⊂ B0 , an integer n0 ≥ 1, a constant 0 < α < 1, and a probability measure ν on (S, S) such that φ(A0 ) > 0 and for all x in A0 , P n0 (x, A) ≥ αν(A) for all A ∈ S. (ii) (The C-set lemma). For any set B0 ∈ S such that φ(B0 ) > 0, there exists a set A0 ⊂ B0 , an n0 ≥ 1, a constant 0 < α < 1 such that for x, y in A0 , pn0 (x, y) ≥ α , where pn0 (x, ·) is the Radon-Nikodym derivative of the absolutely continuous component of P n0 (x, ·)

w.r.t.

φ(·).

The proof of the C-set lemma is a nice application of the martingale convergence theorem (see Orey (1971)). The proof of Theorem 14.2.5 (i) using the C-set lemma is easy and is left as an exercise (Problem 14.21).

14.2 Markov chains on a general state space

14.2.6

465

The fundamental regeneration theorem

Theorem 14.2.6: Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·). Suppose there exists a set A0 ∈ S, a constant 0 < α < 1, a probability measure ν(·) on (S, S) such that for all x in A0 , P (x, A) ≥ αν(A)

for all

A ∈ S.

(2.10)

Suppose, in addition, that for all x in S, Px (TA0 < ∞) = 1.

(2.11)

Then, for any initial distribution µ, there exists a sequence of random times {Ti }i≥1 such that under Pµ , the sequence of excursions ηj ≡ {XTj +r , 0 ≤ r < Tj+1 − Tj , Tj+1 − Tj }, j = 1, 2, . . . are iid with XTj ∼ ν(·). Proof: For x in A0 , let Q(x, ·) =

P (x, ·) − αν(·) . (1 − α)

(2.12)

Then, (2.10) implies that for x in A0 , Q(x, ·) is a probability measure on (S, S). For each x in A0 and n ≥ 0, let ηn+1 , δn+1 and Yn+1,x be independent random variables such that P (ηn+1 ∈ ·) = ν(·), δn+1 is Bernoulli (α), and P (Yn+1,x ∈ ·) = Q(x, ·). Then given Xn = x in A0 , Xn+1 can be chosen to be ηn+1 if δn+1 = 1 (2.13) Xn+1 = Yn+1,x if δn+1 = 0 to ensure that Xn+1 has distribution P (x, ·). Indeed, for x in A0 , P (ηn+1 ∈ ·, δn+1 = 1) + P (Yn+1,x ∈ ·, δn+1 = 0) = ν(·)α + (1 − α)Q(x, ·) = P (x, ·). Thus, each time the chain enters A0 , there is a probability α that the position next time will have distribution ν(·), independent of x ∈ A0 as well as of all past history, i.e., that of starting afresh with distribution ν(·). Now if (2.11) also holds, then for any x in S, by Markov property, the chain enters A0 inﬁnitely often w.p. 1 (Px ). Let τ1 < τ2 < τ3 < · · · denote the times of successive visits to A0 . Since Xτi ∈ A0 , by the above construction (cf. (2.13)), there is a probability α > 0 that Xτi +1 will be distributed as ν(·), completely independent of Xj for j ≤ τi and τi . By comparison with coin tossing, this implies that for any x, w.p. 1 (Px ), there is a ﬁnite index i0 such that δτi0 +1 = 1 and hence Xτi0 +1 will be distributed as ν(·), independent of all history of the chain {Xn }n≥0 including that of the δτi +1 and Yτi +1 , i ≤ i0 up to the time τi0 . That is, at τi0 +1 the chain starts afresh with distribution ν(·). Thus,

466

14. Markov Chains and MCMC

it follows that for any initial distribution µ, there is a random time T such that XT is distributed as ν(·) and is independent of all history up to T − 1. More precisely, for any µ, Pµ (T < ∞) = 1 and Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n + k, T = n + 1) = Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n, T = n + 1) × Pν (X0 ∈ An+1 , X1 ∈ An+2 , . . . , Xk−1 ∈ An+k ) . Since this is true for any µ, it is true for µ = ν and hence the theorem follows. 2 A consequence of the above theorem is following. Theorem 14.2.7: Suppose in Theorem 14.2.6, instead of (2.10) and (2.11), the following holds. There exists an n0 ≥ 1 such that for all x in A0 , A in S, P n0 (x, A) ≥ αν(A)

(2.14)

and for all x in A0 , Px (Xnn0 ∈ A0

for some

n ≥ 1) = 1

(2.15)

x in S.

(2.16)

and Px (TA0 < ∞) = 1

for all

Let Yn ≡ Xnn0 , n ≥ 0 (where nn0 stands for the product of n and n0 ). Then, for any initial distribution µ, there exist random times {Ti }i≥1 such that under Pµ , the sequence ηj ≡ {Yj : Tj+r , 0 ≤ r < Tj+1 − Tj , Tj+1 − Tj } for j = 1, 2, . . . are iid with YTj ∼ ν(·). Proof: For any initial distribution µ such that µ(A0 ) = 1, the theorem follows from Theorem 14.2.6 since (2.14) and (2.15) are the same as (2.10) and (2.11) for the transition function P n0 (·, ·) and the chain {Yn }n≥0 . By 2 (2.16) for any other µ, Pµ (TA0 < ∞) = 1. Given a realization of the Markov chain {Yn ≡ Xnn0 }n≥0 , it is possible to construct a realization of the full Markov chain {Xn }n≥0 by “ﬁlling the gaps” {Xj : kn0 + 1 ≤ j ≤ (k + 1)n0 − 1} as follows: Given Xkn0 = x, X(k+1)n0 = y, generate an observation from the conditional distribution of (X1 , X2 , . . . , Xn0 −1 ), given X0 = x, Xn0 = y. This leads to the following. Theorem 14.2.8: Under the set up of Theorem 14.2.7, the “excursions” ∞ η˜j ≡ Xn0 Tj +k : 0 ≤ k < n0 (Tj+1 − Tj ), Tj+1 − Tj j=1 , are identically distributed and are one dependent, i.e., for each r ≥ 1, the ηr+2 , η˜r+3 , . . .} are independent. collections {˜ η1 , η˜2 , . . . , η˜r } and {˜

14.2 Markov chains on a general state space

467

Proof: Note that in applying the regeneration method to the sequence {Yn }n≥0 and then doing the “ﬁlling the gaps” lead to the common portion X(Tj −1)n0 +r 0 ≤ r ≤ n0 with given the values X(Tj −1)n0 and XTj n0 . This makes two successive η˜j−1 and η˜j dependent. But Markov property renders η˜j and η˜j+2 independent. 2 This yields the one-dependence of {˜ ηj }j≥1 . By the C-set lemma and the minorization Theorem 14.2.5, φ-recurrence yields the hypothesis of Theorem 14.2.7. Theorem 14.2.9: Let {Xn }n≥0 be a φ-recurrent Markov chain with state space (S, S), where S is countably generated. Then there exist an A0 in S, n0 ≥ 1, 0 < α < 1 and a probability measure ν such that (2.14), (2.15), and (2.16) hold and hence, the conclusions of Theorem 14.2.8 hold. Thus, φ-recurrence implies that the Markov chain {Xn }n≥0 is regenerative (deﬁned fully below). This makes the law of large numbers for iid random variables available to such chains. The limit theory of regenerative sequences developed in Section 8.5 is reviewed below and by the above results, such a theory will hold for φ-recurrent chains.

14.2.7

Limit theory for regenerative sequences

Deﬁnition 14.2.7: Let (Ω, F, P ) be a probability space and (S, S) be a measurable space. A sequence of random variables {Xn }n≥0 deﬁned on (Ω, F, P ) with values in (S, S) is called regenerative if there exists a sequence of random times 0 < T1 < T2 < T3 < · · · such that the excursions ηj ≡ {Xn : Tj ≤ n < Tj+1 , Tj+1 − Tj }j≥1 are iid, i.e., P Tj+1 − Tj = kj , XTj + ∈ A ,j , 0 ≤ < kj , j = 1, 2, . . . , r r + = P T2 − T1 = kj , XT1 + ∈ A ,j , 0 ≤ < kj (2.17) j=1

for all k1 , k2 , . . . , kr ∈ N and A ,j ∈ S, 1 ≤ ≤ kj , j = 1, . . . , r. Example 14.2.11: Any Markov chain {Xn }n≥0 with a countable state space S that is irreducible and recurrent is regenerative with {Ti }i≥1 being the times of successive returns to a given state ∆. Example 14.2.12: Any Harris recurrent chain satisfying the minorization condition (2.10) is regenerative by Theorem 14.2.6. Example 14.2.13: The waiting time chain (Example 14.2.6) with Eη1 < 0 is regenerative with {Ti }i≥1 being the times of successive returns of {Xn }n≥0 to zero.

468

14. Markov Chains and MCMC

Example 14.2.14: An example of a non-Markov sequence {Xn }n≥0 that is regenerative is a semi-Markov chain. Let {yn }n≥0 be a Harris recurrent Markov chain satisfying (2.10). Given {yn = an }n≥0 , let {Ln }n≥0 be independent positive integer valued random variables. Set ⎧ y0 0 ≤ j < L0 ⎪ ⎪ ⎪ ⎨ y1 L0 ≤ j < L0 + L1 Xj = y L0 + L1 ≤ j < L0 + L1 + L2 2 ⎪ ⎪ ⎪ ⎩ .. . Then {Xn }n≥0 is called a semi-Markov chain with embedded Markov chain {yn }n≥0 and sojourn times {Ln }n≥0 . It is regenerative if {Ti }i≥1 are deﬁned Ni −1 Lj , where {Ni }i≥1 are the successive regeneration times for by Ti = j=0 {yn } as in Theorem 14.2.7. Theorem 14.2.10: Let {Xn }n≥0 be a regenerative sequence with regen T2 −1 ˜ (A) ≡ E eration times {Ti }i≥1 . Let π j=T1 IA (Xj ) for A ∈ S. Suppose ˜ (·)/˜ π (S). Then π ˜ (S) ≡ E(T2 − T1 ) < ∞. Let π(·) = π n 1 f (Xj ) → f dπ w.p. 1 for any f ∈ L1 (S, S, π). (i) n j=0 1 P (Xj ∈ ·) → π(·) in total variation. n j=0 n

(ii) µn (·) ≡

(iii) If the distribution of T2 − T1 is aperiodic, then P (Xn ∈ ·) → π(·) in total variation. Proof: To prove (i) it suﬃces to consider nonnegative f . For each n, let Nn = k if Tk ≤ n < Tk+1 . Let Ti+1 −1

Yi =

f (Xj ), i ≥ 1

and Y0 =

Y0 +

f (Xj ).

j=0

j=Ti

Then

T 1 −1

N n −1 i=1

Yi ≤

n i=0

f (Xi ) ≤ Y0 +

Nn

Yi .

(2.18)

i=1

Nn −1 Nn By the SLLN, N1n i=1 Yi and N1n i=1 Yi converge to EY1 w.p. 1 and −1 Nn . It follows from (2.18) that n → E(T2 − T1 ) n f d˜ π EY1 1 = = f dπ, f (Xi ) = lim n→∞ n E(T2 − T1 ) π ˜ (S) i=0

14.2 Markov chains on a general state space

469

establishing (i). By taking f = IA and using the BCT, one concludes from (i) that µn (A) → π(A) for every A in S. Since µn and π are probability measures, this implies that µn → π in total variation, proving (ii). To prove (iii), note that for any bounded measurable f , an ≡ Ef (XT1 +n ) satisﬁes an

n = E f (XT1 +n )I(T2 − T1 > n) + E f (XT1 +n )I(T2 − T1 = r) r=1 n = bn + E f (XT2 +n−r ) P (T2 − T1 = r) r=1

= bn +

n

an−r pr ,

r=1

where pr = P (T2 − T1 = r). Now by the discrete renewal theorems from Section 8.5 (which applies since ET2 −T1 < ∞ and T2 −T1 has an aperiodic distribution), (iii) follows. 2 Remark 14.2.1: Since the strong law is valid for any m-dependent (m < ∞) and stationary sequence of random variables, Theorem 14.2.10 is valid even if the excursions {ηj }j≥1 are m-dependent.

14.2.8

Limit theory of Harris recurrent Markov chains

The minorization theorem, the fundamental regeneration theorem, and the limit theorem for regenerative sequences, i.e., Theorems 14.2.5, 14.2.6, 14.2.7, and 14.2.10, are the essential components of a limit theory for Harris recurrent Markov chains that parallels the limit theory for discrete state space irreducible recurrent Markov chains. Deﬁnition 14.2.8: A probability measure π on (S, S) is called stationary for a transition function P (·, ·) if P (x, A)π(dx)

π(A) = S

for all A ∈ S.

Note that if X0 ∼ π, then Xn ∼ π for all n ≥ 1, justifying the term “stationary.” Theorem 14.2.11: Let {Xn }n≥0 be a Harris recurrent Markov chain with state space (S, S) and transition function P (·, ·). Let S be countably generated. Suppose there exists a stationary probability measure π. Then, (i) π is unique.

470

14. Markov Chains and MCMC

(ii) (The law of large numbers). For all f ∈ L1 (S, S, π), for all x ∈ S, n−1 1 f (Xj ) → f dπ w.p. 1 (Px ). n j=0 (iii) (Convergence of n-step probabilities). For all x ∈ S µn,x (·) ≡

n−1 1 Px (Xj ∈ ·) → π(·) n j=0

in total variation.

Proof: By Harris recurrence and the minorization Theorem 14.2.5, there exist a set A0 ∈ S, a constant 0 < α < 1, an integer n0 ≥ 1, a probability measure ν such that for all x ∈ A0 , A ∈ S, P n0 (x, A) ≥ αν(A)

(2.19)

for all x in S, Px (TA0 < ∞) = 1.

(2.20)

and For simplicity of exposition, assume that n0 = 1. (The general case n0 > 1 can be reduced to this by considering the transition function P n0 .) Let the sequence {ηn , δn , Yn,x }n≥1 and the regeneration times {Ti }i≥1 be as in Theorem 14.2.6. Recall that the ﬁrst regeneration time T1 can be deﬁned as (2.21) T1 = min{n : n > 0, Xn−1 ∈ A0 , δn = 1} and the succeeding ones by Ti+1 = min{n : n > Ti , Xn−1 ∈ A0 , δn = 1},

(2.22)

and that XTi are distributed as ν independent of the past. Let for n ≥ 1, Nn = k if Tk ≤ n < Tk+1 . By Harris recurrence, for all x in S, Nn → ∞ w.p. 1 (Px ) and by the SLLN, for all x in S, 1 Nn → w.p. 1 (Px ). n Eν T 1 and hence, by the BCT, 1 Nn → . n Eν T 1 On the other hand, for any k ≥ 1, x ∈ S, Ex

Px (a regeneration occurs at k)

Thus

1 Nn = Px (Xk−1 ∈ A0 )α n n n

Ex

= Px (Xk−1 ∈ A0 , δk = 1) = Px (Xk−1 ∈ A0 )α.

k=1

14.2 Markov chains on a general state space

471

and hence, for all x in S, n−1 1 1 Px (Xj ∈ A0 ) → . n j=0 αEν T1

Now let π be a stationary measure for P (·, ·). Then π(A0 ) = Px (Xj ∈ A0 )π(dx) for all j = 0, 1, 2, . . . and hence nπ(A0 ) =

n−1

Px (Xj ∈ A0 )π(dx).

(2.23)

j=0

∞ Since G(x, A0 ) ≡ j=0 Px (Xj ∈ A0 ) > 0 for all x in S, by Harris recurrence (Harris irreducibility will do for this), it follows that π(A0 ) > 0. Dividing both sides of (2.23) by n and letting n → ∞ yield π(A0 ) =

1 αEν T1

and hence that Eν T1 < ∞. Since Eν T1 ≡ E(T2 − T1 ) < ∞, by Theorem 14.2.10, for all x in S, A ∈ S, T −1 n−1 Eν 1 j=0 IA (Xj ) . Px (Xj ∈ A) → n j=0 Eν (T1 ) Integrating the left side with respect to π yields that for any A ∈ S, T −1 Eν j=0 IA (Xj ) π(A) = , Eν (T1 ) thus establishing the uniqueness of π, i.e., establishing (i) of Theorem 14.2.11. The other two parts follow from the regeneration Theorem 14.2.6 and the limit Theorem 14.2.10. 2 Remark 14.2.2: Under the assumption n0 = 1 that was made at the beginning of the proof, it also follows that Px (Xj ∈ ·) → π(·)

(2.24)

in total variation. This also holds if the g.c.d. of the n0 ’s for which there exist A0 , α, ν satisfying (2.19) is one. Remark 14.2.3: A necessary and suﬃcient condition for the existence of a stationary distribution for a Harris recurrent chain is that there exists a set {A0 , α, ν, n0 } satisfying (2.19) and (2.20) and Eν TA0 < ∞.

472

14. Markov Chains and MCMC

A more general result than Theorem 14.2.11 is the following that was motivated by applications to Markov chain Monte Carlo methods. Theorem 14.2.12: Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·). Suppose (2.19) holds for some (A0 , α, ν, n0 ). Suppose π is a stationary probability measure for P (·, ·) such that π {x : Px (TA0 < ∞) > 0} = 1. (2.25) Then, for π-almost all x, (i) Px (TA0 < ∞) = 1. n−1 (ii) µn,x (·) = n1 j=0 Px (Xj ∈ ·) → π(·) in total variation. (iii) For any f ∈ L1 (S, S, π), 1 f (Xj ) → n j=0 n

(iv)

1 n

n j=0

Ex f (Xj ) →

f dπ w.p. 1 (Px ).

f dπ.

If, further the g.c.d. {m: there exists αm > 0 such that for all x in A0 , P m (x, ·) ≥ αν(·)} = 1, then (ii) can be strengthened to Px (Xn ∈ ·) → π(·) in total variation. The key diﬀerence between Theorems 14.2.11 and 14.2.12 is that the latter does not require Harris recurrence which is often diﬃcult to verify. On the other hand, the conclusions of Theorem 14.2.12 are valid only for πalmost all x unlike for all x in Theorem 14.2.11. In MCMC applications, the existence of a stationary measure is given (as it is the ‘target distribution’) and the minorization condition is more easy to verify as is the milder form of irreducibility condition (2.25). (Harris irreducibility will require Px (TA0 < ∞) > 0 for all x in S.) For a proof of Theorem 14.2.12 and applications to MCMC, see Athreya, Doss and Sethuraman (1996). Example 14.2.15: (AR(1) time series) (Example 14.2.4). Suppose η1 has an absolutely continuous component and that |ρ| < 1. Then the chain is Harris and admits a stationary probability distribution π(·) of ∞ recurrent j ρ η and the Px (Xn ∈ ·) → π(·) in total variation for any x. j j=0 Example 14.2.16: (Waiting time chain) (Example 14.2.6). If Eη1 < 0, the state 0 is recurrent and hence the Markov chain is Harris recurrent. Also a stationary distribution π does exist. It is known thatπ is the same j as the distribution of M∞ ≡ supj≥0 Sj , where S0 = 0, Sj = i=1 ηi , j ≥ 1, {ηi }i≥1 being iid.

14.2 Markov chains on a general state space

14.2.9

473

Markov chains on metric spaces

14.2.9.1 Feller continuity Let (S, d) be a metric space and S be the Borel σ-algebra in S. Let P (·, ·) be a transition function. Let {Xn }n≥0 be Markov chain with state space (S, S) and transition function P (·, ·). Deﬁnition 14.2.9: The transition function P (·, ·) is called Feller continuous (or simply Feller) if xn → x in S ⇒ P (xn , ·) −→d P (x, ·) i.e. (P f )(xn ) ≡ f (y)P (xn , dy) → (P f )(x) ≡ f (y)P (x, dy) for all bounded continuous f : S → R. In terms of the Markov chain, this says E f (X1 ) | X0 = xn → E f (X1 ) | X0 = x if xn → x. Example 14.2.17: Let (Ω, F, P ) be a probability space and h : S×Ω → S be jointly measurable and h(·, ω) be continuous w.p. 1. Let P (x, A) ≡ P (h(x, ω) ∈ A) for x ∈ S, A ∈ S. Then P (·, ·) is a Feller continuous transition function. Indeed, for any bounded continuous f : S → R (P f )(x) ≡ f (y)P (x, dy) = Ef h(x, ω) . Now, xn → x ⇒ h(xn , ω) → h(x, ω) w.p. 1 ⇒ f h(xn , ω) → f h(x, ω) w.p. 1 (by continuity of f ) ⇒ Ef h(xn , ω) → Ef h(x, ω) (by bounded convergence theorem). The ﬁrst ﬁve examples of Section 14.2.4 fall in this category. If h is discontinuous, then P (·, ·) need not be Feller (Problem 14.22). That P (·, ·) is a transition function requires only that h(·, ·) be jointly measurable (Problem 14.23). 14.2.9.2 Stationary measures A general method of ﬁnding a stationary measure for a Feller transition function P (·, ·) is to consider weak or vague limits of the occupation mean−1 sures µn,λ (A) ≡ n1 j=0 Pλ (Xj ∈ A), where λ is the initial distribution. Theorem 14.2.13: Fix an initial distribution λ. Suppose a probability measure µ is a weak limit point of {µn,λ }n≥1 . That is, for some n1 < n2 < n3 < · · ·, µnk ,λ −→d µ. Assume P (·, ·) is Feller. Then µ is a stationary probability measure for P (·, ·). Proof: Let f : S → R be continuous and bounded. Then f (y)µnk ,λ (dy) → f (y)µ(dy).

474

14. Markov Chains and MCMC

But the left side equals nk −1 1 Eλ f (Xj ) nk j=0

=

=

=

nk −1 1 1 Eλ f (X0 ) + Eλ f (Xj ) nk nk j=1 nk −1 1 1 Eλ f (X0 ) + Eλ (P f )(Xj−1 ) nk nk j=1 since by Markov property, for j ≥ 1, Eλ f (Xj ) = Eλ (P f )(Xj−1 ) nk −1 1 1 1 Eλ f (X0 ) + Eλ (P f )(Xj ) − Eλ (P f )(Xnk −1 ). nk nk j=0 nk

The ﬁrst and third term on the right side go to zero since f is bounded and nk → ∞. The second term goes to (P f )(y)µ(dy) since by the Feller property P f is a bounded continuous function. Thus, f (y)µ(dy) = (P f )(y)µ(dy) S S = f (z)P (y, dz) µ(dy) S S = f (z)(µP )(dz) S

where µP (A) ≡

P (y, A)µ(dy). S

This being true for all bounded continuous f , it follows that µ = µP , i.e., µ is stationary for P (·, ·). 2 A more general result is the following. Theorem 14.2.14: Let λ be an initial distribution. Let µ be a subprobability measure (i.e., µ(S) ≤ 1) such that for some n1 < n2 < n3 < · · ·, {µnk ,λ } converges vaguely to µ, i.e., f dµnk ,λ → f dµ for all f : S → R continuous with compact support. Suppose there exists an approximate identity {gn }n≥1 for S, i.e., for all n, gn is a continuous function from S to [0,1] with compact support and for every x in S, gn (x) ↑ 1 as n → ∞. Then µ is stationary for P (·, ·), i.e., µ(A) = P (x, A)µ(dx) for all A ∈ S. S

14.2 Markov chains on a general state space

475

For a proof, see Athreya (2004). If S = Rk for some k < ∞, S admits an approximate identity. Conditions to ensure that there is a vague limit point µ such that µ(S) > 0 is provided by the following. Theorem 14.2.15: Suppose there exists a set A0 ∈ S, a function V : S → [0, ∞) and numbers 0 < α, M < ∞ such that (P V )(x) ≡ Ex V (X1 ) ≤ V (x) − α and

for

x∈ / A0

sup Ex V (X1 ) − V (x) ≡ M < ∞.

(2.26)

(2.27)

x∈A0

Then, for any initial distribution λ, lim µn,λ (A0 ) ≥

n→∞

α . α+M

(2.28)

Proof: For j ≥ 1, Eλ V (Xj ) − Eλ V (Xj−1 )

Eλ P V (Xj−1 ) − V (Xj−1 ) = Eλ P V (Xj−1 ) − V (Xj−1 ) IA0 (Xj−1 ) + Eλ P V (Xj−1 ) − V (Xj−1 ) IAc0 (Xj−1 )

=

≤

/ A0 ). M Pλ (Xj−1 ∈ A0 ) − αPλ (Xj−1 ∈

Adding over j = 1, 2, . . . , n yields 1 Eλ V (Xn ) − V (x) ≤ −α + (α + M )µn,λ (A0 ). n Since V (·) ≥ 0, letting n → ∞ yields (2.28).

2

Deﬁnition 14.2.10: A metric space (S, d) has the vague compactness property if given any collection {µα : α ∈ I} of subprobability measures, there is a sequence {αj } ⊂ I such that µαj converges vaguely to a subprobability measure µ. It is known by the Helly’s selection theorem that all Euclidean spaces have this property. It is also known that any Polish space, i.e., a complete, separable, metric space has this property (see Billingsley (1968)). Combining the above two results yields the following: Theorem 14.2.16: Let P (·, ·) be a Feller transition function on a metric space (S, d) that admits an approximate identity and has the vague compactness property. Suppose there exists a closed set A0 , a function

476

14. Markov Chains and MCMC

V : S → [0, ∞), numbers 0 < α, M < ∞ such that (2.26) and (2.27) hold. Then there exists a stationary probability measure µ for P (·, ·). Proof: Fix an initial distribution λ. Then the family {µn,λ }n≥1 has a subsequence {µnk ,λ }k≥1 and a subprobability measure µ such that µnk ,λ → µ vaguely. Since A0 is closed, this implies lim µnk ,λ (A0 ) ≤ µ(A0 ).

k→∞

α , by Theorem 14.2.15. Thus µ(A0 ) ≥ lim µnk ,λ (A0 ) ≥ lim µnk ,λ (A0 ) ≥ α+M This yields that µ(S) > 0. By Theorem 14.2.14, µ is stationary for P . So µ(·) µ ˜(·) ≡ µ(S) is a probability measure that is stationary for P . 2

Example 14.2.18: Consider a Markov chain generated by the iteration of iid random logistic maps Xn+1 = Cn+1 Xn (1 − Xn ), n ≥ 0 with § = [0, 1], {Cn }n≥1 iid with values in [0,4] and X0 is independent of {Cn }n≥1 . Assume E log C1 > 0 and E| log(4 − C1)| < ∞. Then there exists a stationary probability measure π such that π (0, 1) = 1. This follows from Theorem 14.2.16 by showing that if V (x) = | log x|, then there exists A0 = [a, b] ⊂ (0, 1) and constants 0 < α, M < ∞ such that (2.26) and (2.27) hold. For details, see Athreya (2004). 14.2.9.3 Convergence questions If {Xn }n≥0 is a Feller Markov chain (i.e., its transition function P (·, ·) is Feller continuous), what can one say about the convergence of the distribution of Xn as n → ∞? If P (·, ·) admits a unique stationary probability measure π and the family {µn,λ : n ≥ 1} is tight for a given λ, then one can conclude from Theorem 14.2.13 that every weak limit point of this family has to be π and hence π is the only weak limit point and that for this λ, µn,λ −→d π. To go from this to the convergence of Pλ (Xn ∈ ·) to π(·), one needs extra conditions to rule out periodic behavior. Since the occupation measure µn,λ (A) is the mean of the empirical measure n−1 1 Ln (A) ≡ IA (Xj ), n j=0 a natural question is what can one say about the convergence of the empirical measure? This is important for the statistical estimation of π. When the chain is Harris recurrent, it was shown in the previous section that for each x and for each A ∈ S Ln (A) → π(A) w.p. 1 (Px ).

14.3 Markov chain Monte Carlo (MCMC)

477

For a Feller chain admitting a unique stationary measure π, one can appeal to the ergodic theorem to conclude that for each A in S, Ln (A) → π(A) w.p. 1 (Px ) for π-almost all x in S. Further, if S is Polish, then one can show that for π-almost all x, Ln (·) −→d π(·) w.p. 1 (Px ).

14.3 Markov chain Monte Carlo (MCMC) 14.3.1

Introduction

Let π be a probability measure on a measurable space (S, S). Let h(·) : S → R be S-measurable and |h|dπ < ∞ and λ = hdπ. The eﬀort in the computation of λ depends on the complexity of the function h(·) as well as that of the measure π. Clearly, a ﬁrst approach is to go back to the deﬁnition of hdπ and use numerical approximation such as approximating h(·) by a sequence of simple functions and evaluating π(·) on the sets involved in these simple functions. However, in many situations this may not be feasible especially if the measure π(·) is speciﬁed only up to a constant that is not easy to evaluate. Such is often the case in Bayesian statistics where π is the posterior distribution πθ|X of the parameter θ given the data X whose density is proportional to f (X|θ)ν(dθ), f (X|θ) being the density of X given θ and ν(dθ) the prior distribution of θ. In such situations, objects of interest are the posterior mean, variance, and other moments as well as posterior probability of θ being in some set of interest. In these problems, themain diﬃculty lies in the evaluation of the normalizing constant C(X) = f (X|θ)ν(dθ). However, it may be possible to generate a sequence of random variables {Xn }n≥1 such that the distribution of Xn gets close to π in a suitable sense and a law of large numbers asserting that 1 h(Xi ) → λ = n i=1 n

hdπ

holds for a large class of h such that |h|dπ < ∞. A method that has become very useful in Bayesian statistics in the past twenty years or so (with the advent of high speed computing) is that of generating a Markov chain {Xn }n≥1 with stationary distribution π. This method has its origins in the important paper of Metropolis, Rosenbluth, Rosenbluth, Teller and Teller (1953). For the adaptation of this method to image processing problems, see Geman and Geman (1984). This method is now known as the Markov chain Monte Carlo, or MCMC for short. For the basic limit theory of Markov chains, see Section 14.2. For proofs of the claims in the rest of this section and further details on MCMC, see the recent book of Robert and Casella (1999). In the rest of this section two of the widely used MCMC algorithms are discussed. These are the Metropolis-Hastings algorithm and the Gibbs sampler.

478

14.3.2

14. Markov Chains and MCMC

Metropolis-Hastings algorithm

Let π be a probability measure on a measurable space (S, S). Let π be dominated by a σ-ﬁnite measure µ with density f (·). Let for each x, q(y|x) be a probability density in y w.r.t. µ. That is, q(y|x) is jointly measurable as a function from (S × S, S × S) → R+ and for each x, S q(y|x)µ(dy) = 1. Such a distribution q(·|·) is called the instrumental or proposal distribution. The Metropolis-Hastings algorithm generates a Markov chain {Xn } using the densities f (·) and q(·) in two steps as follows: Step 1: Given Xn = x, ﬁrst generate a random variable Yn with density q(·|x). Step 2: Then, set Xn+1 = Yn with probability p(x, Yn ) and = Xn with probability 1 − p(x, Yn ), where f (y) q(x|y) p(x, y) ≡ min , 1 . (3.1) f (x) q(y|x) Thus, the value Yn is “accepted” as Xn+1 with probability p(x, Yn ) and if rejected the chain stays where it was, i.e., at Xn . Implicit in the above deﬁnition is that the state space of the Markov chain {Xn } is simply the set Af ≡ {x : f (x) > 0}. It is also assumed that for all x, y in Af , q(y|x) > 0. The transition function P (x, A) for this Markov chain is given by P (x, A) = IA (x)(1 − r(x)) + p(x, y)q(y|x)µ(dy) (3.2) A

where r(x) =

p(x, y)q(y|x)µ(dy). S

It turns out that the measure π(·) is a stationary measure for this Markov chain {Xn }. Indeed, for any A ∈ S, P (x, A)π(dx) = P (x, A)f (x)µ(dx) S S = IA (x)(1 − r(x))f (x)µ(dx) S + p(x, y)q(y|x)f (x)µ(dy)µ(dx). (3.3) S

A

By deﬁnition of p(x, y), the identity q(y|x)f (x)p(x, y) = p(y, x)q(x|y)f (y)

(3.4)

holds for all x, y. Thus the second integral in (3.3) (using Tonelli’s theorem) is p(y, x)q(x|y)µ(dx) f (y)µ(dy) = A

S

14.3 Markov chain Monte Carlo (MCMC)

479

=

r(y)f (y)µ(dy). A

Thus, the right side of (3.3) is IA (x)f (x)µ(dx) ≡ π(A), S

verifying stationarity. From the results of Section 14.2, it follows that if the transition function P (·, ·) is Harris irreducible w.r.t. some reference measure ϕ, then (since it admits π as a stationary measure) the law of large numbers, holds i.e., for any h ∈ L1 (π), n−1 1 h(Xj ) → hdπ n j=0

as n → ∞

w.p. 1 for any initial distribution. Thus, a good MCMC approximation to ˆ n ≡ 1 n−1 h(Xj ). A suﬃcient condition for irreducibility λ = hdπ is λ j=0 n is that q(y|x) > 0 for all (x, y) in Af × Af . Summarizing the above discussion leads to Theorem 14.3.1: Let π be a probability measure on a measurable space (S, S) with probability density f (·) w.r.t. a σ-ﬁnite measure µ. Let Af = {x : f (x) > 0}. Let q(y|x) be a measurable function from Af ×Af → (0, ∞) such that S q(y|x)µ(dy) = 1 for all x in Af . Let {Xn }n≥0 be a Markov chain generated by the Metropolis-Hastings algorithm (3.1). Then, for any h ∈ L1 (π), n−1 1 h(Xj ) → hdπ n j=0

as

n→∞

w.p. 1

(3.5)

for any (initial) distribution of X0 . The Metropolis-Hastings algorithm does not need the full knowledge of the target density f (·) of π(·). The function f (·) enters the algorithm only (y) and through the function p(x, y), which involves only the knowledge of ff (x) q(·|·) and hence this algorithm can be implemented even if f is known only up to a multiplicative constant. This is often the case in Bayesian statistics. Also, the choice of q(x|y) depends on f (·) only through the condition that q(x|y) > 0 on Af × Af . Thus, the Metropolis-Hastings algorithm has wide applicability. Two special cases of this algorithm are given below. 14.3.2.1 Independent Metropolis-Hastings Let q(y|x) ≡ g(y) where g(·) is a probability density such that g(y) > 0 if f (y) > 0.

480

14. Markov Chains and MCMC

(y) Suppose sup fg(y) : f (y) > 0 ≡ M < ∞. Then, in addition to the law of large numbers (3.5) of Theorem 14.3.1, it holds that for any initial value of X0 , 1 n P (Xn ∈ ·) − π(·) ≤ 2 1 − M where · is the total variation norm. Thus, the distribution of {Xn } converges in total variation at a geometric rate. For a proof, see Robert and Casella (1999). 14.3.2.2 Random-walk Metropolis-Hastings Here the state space is the real line or a subset of some Euclidean space. Let q(y|x) = g(y − x) where g(·) is a probability density such that g(y − x) > 0 for all x, y such that f (x) > 0, f (y) > 0. This ensures irreducibility and hence the law of large numbers (3.5) holds. A suﬃcient condition for geometric convergence of the distribution of {Xn } in the real line case is the following: (a) The density f (·) is symmetric about 0 and is asymptotically log concave, i.e., it holds that for some α > 0 and x0 > 0, log f (x) − log f (y) ≥ α|y − x| for all y < x < −x0 or x0 < x < y. (b) The density function g(·) is positive and symmetric. For further special cases and more results, see Robert and Casella (1999).

14.3.3

The Gibbs sampler

Suppose π is the probability distribution of a bivariate random vector (X, Y ). A Markov chain {Zn }n≥0 can be generated with π as its stationary distribution using only the families of conditional distributions Q(·, y) of X|Y = y and P (·, x) of Y |X = x for all x, y generated by the joint distribution π of (X, Y ). This Markov chain is known as the Gibbs sampler. The algorithm is as follows: Step 1: Start with some initial value X0 = x0 . Generate Y0 according to the conditional distribution P (Y ∈ · | X0 = x0 ) = P (·, x0 ). Step 2: Next, generate X1 according to the conditional distribution P (X1 ∈ · | Y0 = y0 ) = Q(·, y0 ). Step 3: Now generate Y1 as in Step 1 but with conditioning value X1 . Step 4: Now generate X2 as in Step 2 but with conditioning value Y1 and so on.

14.4 Problems

481

Thus, starting from X0 , one generates successively Y0 , X1 , Y1 , X2 , Y2 , . . .. Clearly, the sequences {Xn }n≥0 , {Yn }n≥0 and {Zn ≡ (Xn , Yn )}n≥0 are all Markov chains. It is also easy to verify that the marginal distribution πX of X, the marginal distribution πY of Y , and the distribution π are, respectively, stationary for the {Xn }, {Yn }, and {Zn } chains. Indeed, if X0 ∼ πX , then Y0 ∼ πY and hence X1 ∼ πX . Similarly one can verify the other two claims. Recall that a suﬃcient condition for the law of large numbers (3.5) to hold is irreducibility. A suﬃcient condition for irreducibility in turn is that the chain {Zn }n≥0 has a transition function R(z, ·) that, for each z = (x, y), is absolutely continuous with respect to some ﬁxed dominating measure on R2 . The above algorithm is easily generalized to cover the k-variate case (k > 2). Let (X1 , X2 , . . . , Xk ) be a random vector with distribution π. For any vector x = (x1 , x2 , . . . , xk ) let x(i) = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xk ) and Pi (· | x(i) ) be the conditional distribution of Xi given X(i) = x(i) . Now generate a Markov chain Zn ≡ (Zn1 , Zn2 , . . . , Znk ), n ≥ 0 as follows: Step 1: Start with some initial value Z0j = z0j , j = 1, 2, . . . , k − 1. Generate Z0k from the conditional distribution Pk (· | Xj = z0j , j = 1, 2, . . . , k − 1). Step 2: Next, generate Z11 from the conditional distribution P1 (· | Xj = z0j , j = 2, . . . , k − 1, Xk = Z0k ). Step 3: Next, generate Z12 from the conditional distribution P2 (· | X1 = Z11 , Xj = z0j , j = 3, . . . , k − 1, Xk = Z0k ) and so on until (Z11 , Z12 , . . . , Z1,k−1 ) is generated. Now go back to Step 1 to generate Z1k and repeat Steps 2 and 3 and so on. This sequence {Zn }n≥0 is called the Gibbs sampler Markov chain for the distribution π. A suﬃcient condition for irreducibility given earlier for the 2-variate case carries over to the k-variate case. For more on the Gibbs sampler, see Robert and Casella (1999).

14.4 Problems 14.1 (a) Show using Deﬁnition 14.1.1 that when the state space S is countable, for any n, conditioned on {Xn = an }, the events {Xn+j = an+j , 1 ≤ j ≤ k} and {Xj = aj : 0 ≤ j ≤ n − 1} are independent for all choices of k and {aj }n+k j=0 . Thus, conditioned on the “present” {Xn = an }, the “past” {Xj : j ≤ n − 1} and “future” {Xj : j ≥ n + 1} are two families of independent random variables with respect to the conditional probability measure P (· | Xn = an ), provided, P (Xn = an ) > 0.

482

14. Markov Chains and MCMC

(b) Prove Proposition 14.2.2 using induction on n (cf. Chapter 6). 14.2 In Example 14.1.1 (Frog in the well), verify that (a) if αi ≡ 1 −

1 ci ,

c > 1, i ≥ 1, then 1 is null recurrent,

(b) if αi ≡ α, 0 < α < 1, then 1 is positive recurrent, and (c) if αi ≡ 1 −

1 2i2 ,

then 1 is transient.

14.3 Consider SSRW in Z2 where the transition probabilities are p(i,j)(i ,j ) = 14 each if (i , j ) ∈ {(i + 1, j), (i − 1, j), (i, j + 1), (i, j − 1)} and zero otherwise. Verify that for n = 2k (2k) p(0,0),(0,0)

2 1 2k 11 = 2k ∼ k 4 πk

and conclude that (0,0) is null recurrent. Extend this calculation to SSRW in Z3 and conclude that (0,0,0) is transient. 14.4 Show that if i is absorbing and j → i, then j is transient by showing ∗ ∗ = P (Ti < Tj | X0 = j) > 0 and 1 − fjj ≥ fji . that if j → i, then fji 14.5 (a) Let i be recurrent and i → j. Show that j is recurrent using Corollary 14.1.5. (Hint: Show that there exist n0 and m0 such that for all n, (n +n+m0 ) (n ) (n) (m ) (n ) (m ) pjj 0 ≥ pji 0 pii , pij 0 with pji 0 > 0 and pij 0 > 0.) (b) Let i and j communicate. Show that di = dj . 14.6 Show that in a ﬁnite state space irreducible Markov chain (S, P ), all states are positive recurrent by showing (r)

(a) that for any i, j in S, there exist r, r ≤ K such that pij > 0, where K is the number of states in S, (b) for any i in S, there exists a 0 < α < 1, and c < ∞ such that Pi (Ti > n) ≤ cαn . Give an alternate proof by showing that if S is ﬁnite, then for any initial distribution µ, the occupation measures 1 Pµ (Xj ∈ ·) (n + 1) j=0 n

µn(·) ≡

has a subsequence that converges to a probability distribution π that is stationary for (S, P ). 14.7 Prove Theorem 14.1.3 using the Markov property and induction.

14.4 Problems

483

14.8 Adapt the proof of Theorem 14.1.9 to show that for any i, j 1 (k) fij pij → n j=1 Ej T j n

if j is positive recurrent and 0 otherwise. Conclude that in a ﬁnite state space case, there must be at least one state that is positive recurrent. Ti −1 14.9 If j → i then ζ1 ≡ j=0 δXr j , the number of visits to j before visiting i satisﬁes Pi (ζ1 > n) ≤ cαn for some 0 < c < ∞, 0 < α < 1 and all n ≥ 1. 14.10 Adapt the proof of Theorem 14.1.9 to establish the following laws of large numbers. Let (S, P ) be irreducible and positive recurrent with stationary distribution π. (a) Let h : S → R be such that j∈S |h(j)|πj < ∞. Then, for any initial distribution µ, 1 h(Xj ) → h(j)πj n + 1 j=0 n

w.p. 1

j∈S

by ﬁrst verifying that *

* T * * i −1 * h(Xj )** < ∞. Ei * j=0

(b) Let g : S × S → R be such that for any initial distribution µ,

i,j∈S

|g(i, j)|πi pij < ∞. Then,

n 1 g(Xj , Xj+1 ) → g(i, j)πi pij n + 1 j=0

w.p. 1.

i,j∈S

(c) Fix two disjoint subsets A and B in S. Evaluate the long run proportion of transitions from A to B. (d) Extend (b) to conclude that the tail sequence Zn ≡ {Xn+j : j ≥ 0} of the Markov chain {Xn }n≥0 converges as n → ∞ in the sense of ﬁnite dimensional distributions to the strictly stationary sequence {Xn }n≥0 which is the Markov chain (S, P ) with initial distribution π. 14.11 Let {Xn }n≥0 be a Markov chain that is irreducible and has at least two states. Show that w.p. 1 the trajectories {Xn } do not converge, i.e., w.p. 1, limn→∞ Xn does not exist. (Hint: Show that w.p. 1, the set of limit points of the set {Xn : n ≥ 0} coincides with S.)

484

14. Markov Chains and MCMC

14.12 Let n }n≥0 be a Markov chain with state space S and tr. pr. P ≡ {X (pij ) . A probability distribution π ≡ {πj : j ∈ S} is said to satisfy the condition of detailed balance or time reversibility with respect to (S, P ) if for all i, j, πi pij = πj pji . (a) Show that such a π is necessarily a stationary distribution. (b) For the birth and death chain (Example 14.1.4), ﬁnd a condition in terms of the birth and death rates {αi , βi }i≥0 for the existence of a probability distribution π that satisﬁes the condition of detailed balance. 14.13 (Absorption probabilities and times). Let 0 be an absorbing state. For any i = 0, let θi = fi0 ≡ Pi (T0 < ∞) and ηi = Ei T0 . Show using the Markov property that for every i = 0, θi = pi0 + θj pij j =0

ηi

=

1+

ηj pij .

j =0

Apply this to the Gambler’s ruin problem with S = {0, 1, 2, . . . , K}, K < ∞ and p00 = 1, pN N = 1, pi,i+1 = p, pi,i−1 = 1 − p, 0 < p < 1, 1 ≤ i ≤ N − 1 and ﬁnd the probability and expected waiting time for ruin (absorption at 0) starting from an initial fortune of i, 1 ≤ i ≤ N − 1. 14.14 (Renewal theory via Markov chains). Let {Xj }j≥1be iid positive n integer valued random variables. Let S0 = 0, Sn = j=1 Xj , n ≥ 1, N (n) = k if Sk ≤ n < Sk+1 , k = 0, 1, 2, . . . be the number of renewals up to time n, An = n − SN (n) be the age of the current unit at time n. (a) Show that {An }n≥0 is a Markov chain and ﬁnd its state space S and transition probabilities. (b) Assuming that EX1 < ∞, verify that πj =

P (X1 > j) j = 0, 1, 2, . . . EX1

is the unique stationary distribution. (c) Assuming that X1 has an aperiodic distribution and Theorem 14.1.18 holds, show that the discrete renewal theorem holds. 14.15 Prove Proposition 14.2.1 for the countable state space case. 14.16 Prove Proposition 14.2.2.

14.4 Problems

485

14.17 Establish assertion (i) of Theorem 14.2.3. n ˆ n+1 (x, ω) ˆ n (x, ω), X ≤ s(f ) (Hint: Show that d X i i=1 d x, fn+1 (x, ω) and use Borel-Cantelli to show that the right n 0 < λ < 1 and show similarly side is O(λ ) w.p. 1 for some ˆ n (y, ω) = O(λn ) w.p. 1 for any x, y.) ˆ d Xn (x, ω), X 14.18 Show that if P (·, ·) is the transition function of a Markov chain {Xn }n≥0 , then for any n ≥ 0, Px (Xn ∈ A) = P (n) (x, A), where P (n) (·, ·) is deﬁned by the iteration P (n+1) (x, A) = P (n) (y, A)P (x, dy), S

with P

(0)

(x, A) = IA (x).

(Hint: Use induction and Markov property.) 14.19 (a) Let {Xn }n≥0 be a random walk deﬁned by the iteration scheme Xn+1 = Xn +ηn+1 where {ηn }n≥1 are iid random variables independent of X0 . Assume that ν(·) = P (η1 ∈ ·) has an absolutely continuous component with a density that is strictly positive a.e. on an open interval around 0. Show that {Xn }n≥0 is Harris irreducible w.r.t. the Lebesgue measure on R. Show that if in addition Eη1 = 0, then {Xn } is Harris recurrent as well. (b) Use Theorem 14.2.11 to establish the second claim in Example 14.2.10. 14.20 Show that the waiting time chain (Example 14.2.6) deﬁned by Xn+1 = max{Xn + ηn+1 , 0}, where {ηn }n≥1 are iid is irreducible with reference measure φ(·) ≡ δ0 (·), the delta measure at 0, provided P (η1 < 0) > 0. Show further that it is φ recurrent if Eη1 < 0. 14.21 Prove Theorem 14.2.5 (i) using the C-set lemma. 14.22 Find a h : [0, 1] × [0, 1] → [0, 1] such that h(x, y) is discontinuous in x for almost all y in [0, 1] and conclude that the function P (x, A) = P h(x, Y ) ∈ A where Y is a uniform [0,1] random variable need not be Feller. 14.23 Let (Ω, F, P ) be a probability space and (S, S) a measurable space. Let h : S × Ω → S be jointly measurable. Show that P (x, A) ≡ P (h(x, ω) ∈ A) is a transition function. 14.24 (a) Let {Xn }n≥0 be an irreducible Markov chain with state space S ≡ {0, 1, 2, . . .}. Suppose V : S → [0, ∞) is such that for some K < ∞, Ex V (X1 ) ≤ V (x) for all x > K and that limx→∞ V (x) = ∞. Show that {Xn }n≥0 is recurrent.

486

14. Markov Chains and MCMC

˜ n }n≥0 be a Markov chain with state space (Hint: Let {X S ≡ {0, 1, 2, . . .} and transition probabilities same as that of {Xn }n≥0 except that the states {0, 1, 2, . . . , K} are absorbing. ˜ n )}n≥0 is a nonnegative super-martingale and Verify that {V (X ˜ hence that {Xn }n≥0 is bounded w.p. 1. Now conclude that there must exist a state x that gets visited inﬁnitely often by {Xn }n≥0 .) (b) Consider the reﬂecting nonhomogeneous random walk on S ≡ {0, 1, 2, . . .} such that pi if j = i + 1 pij = 1 − pi if j = i − 1 with p0 = 1, 0 < pi ≤ qi for all i ≥ k0 and some 1 ≤ k0 < ∞ and 0 < pi < 1 for all i ≥ 1. Show that {Xn }n≥0 is irreducible and recurrent. 14.25 Let {Xn }n≥0 be an irreducible and recurrent Markov chain with a countable state space S. Let V : S → R+ be such that Ex V (X1 ) ≤ V (x) for all x in S. Show that V (·) is constant on S. 14.26 Let {Cn }n≥1 be iid random variables with values in [0, 4]. Let {Xn }n≥0 be a Markov chain with values in [0, 1] deﬁned by the random iteration scheme Xn+1 = Cn+1 Xn (1 − Xn ), n ≥ 0. (a) Show that if E log C1 < 0 then Xn = O(λn ) w.p. 1 for some 0 < λ < 1. (b) Show also that if E log C1 < 0 and 0 < V (log C1 ) < ∞ then there exist sequences {an }n≥1 and {bn }n≥1 such that log Xn − an −→d N (0, 1). bn

15 Stochastic Processes

This chapter gives a brief discussion of two special classes of real valued stochastic processes {X(t) : t > 0} in continuous time [0, ∞). These are continuous time Markov chains with a discrete state space (including Poisson processes) and Brownian motion These are very useful in many areas of applications such as queuing theory and mathematical ﬁnance.

15.1 Continuous time Markov chains 15.1.1

Deﬁnition

Consider a physical system that can be in one of a ﬁnite or countable number of states {0, 1, 2, . . . , K}, K ≤ ∞. Assume that the system evolves in continuous time in the following manner. In each state the system stays a random length of time that is exponentially distributed and then jumps to a new state with a probability distribution that depends only on the current state and not on the past history. Thus, if the state of the system at the time of the nth transition is denoted by yn , n = 0, 1, 2, . . ., then {yn }n≥0 is a Markov chain with state space S ≡ {0, 1, 2, . . . , K}, K ≤ ∞ and some transition probability matrix P ≡ (pij ) . If yn = in , then the system stays in in a random length of time Ln , called the sojourn time, such that conditional on {yn = in }n≥0 , {Ln }n≥0 are independent exponential random variables with Ln having a mean λ−1 in . Now set the state of the

488

15. Stochastic Processes

system X(t) ⎧ ⎪ ⎪ ⎪ ⎨ X(t) = ⎪ ⎪ ⎪ ⎩

at time t ≥ 0 by y0 y1 .. .

0 ≤ t < L0 L0 ≤ t < L0 + L1

Lν + L1 + · · · + Ln−1 ≤ t < L0 + L1 + · · · + Ln . (1.1) Then {X(t) : t ≥ 0} is called a continuous time Markov chain with state space S, jump probabilities P ≡ (pij ) , waiting time parameters {λi : i ∈ S}, and embedded Markov chain {yn }n≥0 . To make sure that there are only ﬁnite number of transitions in ﬁnite time, i.e., yn

∞

Li = ∞

w.p. 1

i=0

one needs to impose the nonexplosion condition ∞ 1 =∞ λ i=0 yn

w.p. 1.

(1.2)

(Problem 15.1) Clearly, a suﬃcient condition for this is that λi < ∞ for all i ∈ S and {yn }n≥0 is an irreducible recurrent Markov chain. It can be veriﬁed using the “memorylessness” property of the exponential distribution (Problem 15.2) that {X(t) : t ≥ 0} has the Markov property, i.e., for any 0 ≤ t1 ≤ t2 ≤ t3 ≤ · · · ≤ tr < ∞ and P X(tr ) = ir | X(tj ) = ij , 0 ≤ j ≤ r − 1 = P X(tr ) = ir | X(tr−1 ) = ir−1 . (1.3)

15.1.2

Kolmogorov’s diﬀerential equations

The functions

pij (t) ≡ P X(t) = j | X(0) = i

(1.4)

are called transition functions. To determine these functions from the jump probabilities {pij } and the waiting time parameters {λi }, one uses the Chapman-Kolmogorov equations pij (t + s) = pik (t)pkj (s), t, s ≥ 0 (1.5) k∈S

which is an immediate consequence of the Markov property (1.3) and the deﬁnition (1.4). In addition to (1.5), one has the continuity condition lim pij (t) = δij . t↓0

(1.6)

15.1 Continuous time Markov chains

489

Under the nonexplosion hypothesis (1.2), it can be shown (Chung (1967), Feller (1966), Karlin and Taylor (1975)) that pij (t) are diﬀerentiable as functions of t and satisfy the Kolmogorov’s forward and backward diﬀerential equations pik (t)pkj (0) (forward) (1.7a) pij (t) = k

pij (t)

=

pik (0)pkj (t)

(backward)

(1.7b)

k

Further, akj ≡ pkj (0) can be shown to be λk pkj for k = j and −λk for k = j. The matrix A ≡ (aij ) is called the inﬁnitesimal matrix or generator of the process {X(t) : t ≥ 0}. If the state space S is ﬁnite, i.e., K < ∞, then P (t) ≡ (pij (t)) can be shown to be P (t) = exp(At) ≡

15.1.3

∞ An n t . n! n=0

(1.8)

Examples

Example 15.1.1: (Birth and death process). Here pi,i+1

=

pi,i−1

=

λi

=

αi αi + βi βi αi + βi

i≥0 i≥1

(αi + βi ) i ≥ 0

where {αi , βi }i≥0 are nonnegative numbers with αi being the birth rate, βi being the death rate. This has the meaning that given X(t) = i, for small h > 0, X(t + h) goes up to (i + 1) with probability αi h + o(h) or goes down to (i − 1) with probability βi h + o(h) or stays at i with probability 1 − (αi + βi )h + o(h). In this case the forward and backward equations become pij (t)

=

αj−1 pi,j−1 (t) + βj+1 pi,j+1 (t) − (αj + βj )pij (t),

pij (t)

=

αi pi+1,j (t) + βi pi−1,j (t) − (αi + βi )pij (t)

with initial condition pij (0) = δij . (a) Pure birth process. A special case of the above is when βi ≡ 0 for all i. Here pij (t) = 0 if j < i and X(t) is a nondecreasing function of t and the jumps are of size one. A further special case of this when αi ≡ α for all i. In this case, the process waits in each state a random length of time with mean α−1 and

490

15. Stochastic Processes

jumps one step higher. It can be veriﬁed that in this case, the solution of the Kolmogorov’s diﬀerential equations (1.7a) and (1.7b) are given by pij (t) = e−αt

(αt)j−i . (j − i)!

(1.9)

From this it is easy to conclude that {X(t) : t ≥ 0} is a Levy process, i.e., it has stationary and independent increments, i.e., for 0 = t0 ≤ t1 ≤ t2 ≤ t3 ≤ · · · ≤ tr < ∞, Yj = X(tj ) − X(tj−1 ), j = 1, 2, . . . , r are independent and the distribution of Yj depends only on (tj − tj−1 ). Further, in this case, X(t) − X(0) has a Poisson distribution with mean αt. This {X(t) : t ≥ 0} is called a Poisson process with intensity parameter α. Another special case is the linear birth and death process. Here αi = iα, βi = iβ for i = 0, 1, 2, . . .. The pure death process has parameters αi ≡ 0 for i ≥ 0. A number of queuing processes can be modeled as a birth and death process and more generally as a continuous time Markov chain. For example, an M/M/s queuing system is one in which customers arrive at a service facility at the jump times of a Poisson process (with parameter α) and there are s servers with service time at each server being exponential with the same mean (= β −1 ). The number X(t) of customers in the system at time t evolves a birth and dealt process with parameters αi ≡ α for i ≥ 0 and βi = iβ, 0 ≤ i ≤ s, = sβ for i > s. Example 15.1.2: (Markov branching processes). Here X(t) is the population size in a process where each particle lives a random length of time with exponential distribution with mean α−1 and on death create a random number of new particles with oﬀspring distribution {pj }j≥0 and all particles evolve independently of each other. This implies that λi = iα, i ≥ 0, pij = pj−i+1 , j ≥ i − 1 and = 0 for j < i − 1, i ≥ 1, p00 = 1. Thus 0 is an absorbing barrier. The random variable T ≡ inf{t : t > 0, X(t) = 0} is called the extinction time. It can be shown that ∞

j

pij (t)s =

j=0

∞

i j

p1j (t)s

for i ≥ 0,

0≤s≤1

(1.10)

j=0

and also that F (s, t) ≡

∞

p1j (t)sj

j=0

satisﬁes the diﬀerential equation ∂F (s, t) ∂t ∂F (s, t) ∂t

= =

∂ F (s, t) (forward equation) ∂s u F (s, t) (backward equation) u(s)

(1.11) (1.12)

15.1 Continuous time Markov chains

491

with F (s, 0) ≡ s where u(s) ≡ α

∞

pj sj − s .

(1.13)

j=0

Further, if q ≡ P (T < ∞ | X(0) = 1) is the extinction ∞ probability, the q is the smallest solution in [0,1] of the equation q = j=0 pj q j (cf. Chapter 18). (See Athreya and Ney (2004), Chapter III, p. 106.) Example 15.1.3: (Compound Poisson processes). Let {Li }i≥0 and {ξi }i≥1 be two independent sequences of random variables such that {Li }i≥0 are iid exponential with mean α−1 and {ξi }i≥1 are iid integer valued random variables with distribution {pj }. Let X(t) = k if L0 + · · · + Lk ≤ t < L0 + · · · + Lk+1 . Let ⎧ 0 0 ≤ t < L0 ⎪ ⎪ ⎪ 1 ⎪ L0 ≤ t < L0 + L1 ⎪ ⎪ ⎨ . . . X(t) = ⎪ ⎪ k L0 + · · · + Lk−1 ≤ t < L0 + · · · + Lk , ⎪ ⎪ ⎪ ⎪ ⎩ .. . Let

X(t)

Y (t) =

ξi ,

t ≥ 0.

(1.14)

i=1

Then {Y (t) : t ≥ 0} is a continuous time Markov chain with state space S ≡ {0, ±1, ±2, . . .}, jump probabilities pij = P (ξ1 = j − i) = pj−i . It is also a Levy process. It is called a compound Poisson process with jump rate α and jump distribution {pj }. If p1 ≡ 1 this reduces to the Poisson process case.

15.1.4

Limit theorems

To investigate what happens to pij (t) ≡ P (X(t) = j | X(0) = i) as t → ∞, one needs to assume that the embedded chain {yn }n≥0 is irreducible and recurrent. This implies that for any i0 the random variable T = min{t : t > L0 , X(t) = i0 } is ﬁnite w.p. 1. Further, the process, starting from i0 , returns to i0 inﬁnitely often and hence by the Markov property is regenerative in the sense that the excursions between consecutive returns to i0 are iid. One can use this, laws of large numbers and renewal theory (cf. Section 8.5) to arrive at the following:

492

15. Stochastic Processes

Theorem 15.1.1: Let P = {pij } be irreducible and recurrent and 0 < λi < ∞ for all i in S. Let there exist a probability distribution {πi } such that aij πj = 0 for all i (1.15) j∈S

where aij

= λi pij = −λi

i = j i = j.

Then (i) {πj } is stationary for {pij (t)}, i.e., t ≥ 0,

i∈S

πi pij (t) = πj for all j,

(ii) for all i, j lim pij (t) = πj ,

(1.16)

t→∞

and hence {πj } is the unique stationary distribution, (iii) for any function h : S → R, such that j∈S |h(j)|πj < ∞, 1 t→∞ t

lim

0

t

h X(u) du = h(j)πj

w.p. 1

(1.17)

j∈S

for any initial distribution of X(0). Note that (1.16) holds without any assumption of aperiodicity on P ≡ (pij ) . A suﬃcient condition for a probability distribution {πj } to be a stationary distribution is the so-called detailed balance condition πk akj = πj ajk .

(1.18)

One can use this for birth and death chains on a ﬁnite state space S ≡ {0, 1, 2, . . . , N }, N < ∞ to conclude that the stationary distribution is given by αn−1 αn−2 . . . α0 πn = π0 (1.19) βn βn−1 . . . β1 provided αi > 0 for all 0 ≤ i ≤ N − 1, βi > 0 for all 1 ≤ i ≤ N and αN = 0, β0 = 0. A necessary and suﬃcient condition for equilibrium, i.e., the existence of a stationary distribution when N = ∞ is ∞ αn−1 . . . α0 < ∞. βn . . . β1 n=1

(1.20)

15.2 Brownian motion

493

This yields in the M/M/s case with arrival rate α and service rate β (i.e., αi ≡ α, for i ≥ 0, βi ≡ iβ for 1 ≤ i ≤ s, = sβ for i > s) the necessary and suﬃcient condition for the equilibrium, that the traﬃc intensity α < 1, sβ

ρ≡

(1.21)

i.e., the mean number of arrivals per unit time, be less than the mean number of the persons served per unit time. For further discussion and results, see the books of Karlin and Taylor (1975) and Durrett (2001).

15.2 Brownian motion Deﬁnition 15.2.1: A real valued stochastic process {B(t) : t ≥ 0} is called standard Brownian motion (SBM) if it satisﬁes (i) B(0) = 0, (ii) B(t) has N (0, t) distribution, for each t ≥ 0, (iii) it is a Levy process, i.e., it has stationary independent increments. It follows that {B(t) : t ≥ 0} is a Gaussian process (i.e., the ﬁnite dimensional distributions are Gaussian) with mean function m(t) ≡ 0 and covariance function c(s, t) = min(s, t). It can be shown that the trajectories are continuous w.p. 1. Thus, Brownian motion is a Gaussian process, has continuous trajectories and has stationary independent increments (and hence is Markovian). These features make it a very useful process as a building block for many real world phenomena such as the movement of pollen (which was studied by the English Botanist, Robert Brown, and hence the name Brownian motion) movement of a tagged particle in a liquid subject to the bombardment of the molecules of the liquid (studied by Einstein and Slomuchowski) and the ﬂuctuations in stock market prices (studied by the French Economist Bachelier).

15.2.1

Construction of SBM

Let {ηi }i≥1 be iid N (0, 1) random variables on some probability space (Ω, F, P ). Let {φi (·)}i≥1 be the sequence of Haar functions on [0, 1] deﬁned by the doubly indexed collection H00 (t) ≡ 1 H11 (t) =

1 on [0, 12 ) −1 on [ 12 , 1]

494

15. Stochastic Processes

and for n ≥ 1 Hn,j (t)

=

(j − 1) j , n 2n 2 n−1 j j + 1 −2 2 for t in n , n 2 2 0 otherwise

=

1, 3, . . . , 2n−1 .

= =

j

2

n−1 2

for t in

It is known that this family is a complete orthonormal basis for L2 ([0, 1]). Let t N BN (t, ω) ≡ ηi (ω) φi (u)du. (2.1) i=1

0

Then, for each N , {BN (t, ω) : 0 ≤ t ≤ 1} is a Gaussian process on (Ω, F, P ) with mean function mN (t) ≡ 0 and covariance function s N t φ (u)du and the property that the tracN (s, t) = i=1 0 φi (u)du 0 i jectories t → BN (t, ω) are continuous in t for each ω in Ω. It can be shown (Problem 15.11) that w.p. 1 the sequence {BN (·, ω)}N ≥1 is a Cauchy sequence in the Banach space C[0, 1] of continuous real valued functions on [0, 1] with supremum metric. Hence, {BN (·, ω)}N ≥1 converges w.p. 1 to a limit element B(·, ω) which will be a Gaussian process with continuous trajectories and mean and covariance functions m(t) ≡ 0 s t ∞ t φ (u)du φ (u)du = 0 I[0,t] (u)I[0,s] (u)du = and c(s, t) = i=1 0 i 0 i min(s, t) respectively. (See Section 2.3 of Karatzas and Shreve (1991).) Thus, t ∞ ηi (ω) φi (u)du (2.2) B(t, ω) ≡ i=1

0

is a well-deﬁned stochastic process for 0 ≤ t ≤ 1 that has all the properties claimed above and is called SBM on [0,1]. Let {B (j) (t, ω) : 0 ≤ t ≤ 1}j≥1 be iid copies of {B(t, ω) : 0 ≤ t ≤ 1} as deﬁned as above. Now set ⎧ B (1) (t, ω), 0≤t≤1 ⎪ ⎪ ⎪ ⎪ ⎨ B (1) (1, ω) + B (2) (t − 1, ω), 1 ≤ t ≤ 2 B(t, ω) ≡ .. ⎪ . ⎪ ⎪ ⎪ ⎩ B(n, ω) + B (n+1) (t − n, ω), n ≤ t ≤ n + 1, n = 1, 2, . . . (2.3) Then {B(t, ω) : t ≥ 0} satisﬁes (i) B(0, ω) = 0, (ii) t → B(t, ω) is continuous in t for all ω, (iii) it is Gaussian with mean function m(t) ≡ 0 and covariance function c(s, t) ≡ min(s, t), i.e., it is SBM on [0, ∞). From now on the symbol ω may be suppressed.

15.2 Brownian motion

15.2.2

495

Basic properties of SBM

(i) Scaling properties Fix c > 0 and set 1 Bc (t) ≡ √ B(ct), t ≥ 0. c

(2.4)

Then, {Bc (t)}t≥0 is also an SBM. This is easily veriﬁed by noting that Bc (0) = 0, Bc (t) ∼ N (0, t), Cov(Bc (t), Bc (s)) = 1c min{ct, cs} = min(t, s) and that {Bc (·)} is a Levy process and the trajectories are continuous w.p. 1. (ii) Reﬂection If {B(·)} is SBM, then so is {−B(·)}. This follows from the symmetry of the mean zero Gaussian distribution. (iii) Time inversion Let

˜ = B(t)

tB( 1t ) for t > 0 0 for t = 0.

(2.5)

˜ ˜ Then {B(t) : t ≥ 0} is also an SBM. The facts that {B(t) : t > 0} is a Gaussian process with mean and covariance function same as SBM and the trajectories are continuous in the open interval (0, ∞) are straightforward to verify. It only remains to verify that ˜ ˜ limt→0 B(t) : t1 ≤ t ≤ t2 } = 0 w.p. 1. Fix 0 < t1 < t2 . Then {B(t) is a Gaussian process with mean function 0 and covariance function min(s, t) and has continuous trajectories, i.e., it has the same distri˜ 1 ≡ sup{|B(t)| ˜ : t 1 ≤ t ≤ t2 } bution as {B(t) : t1 ≤ t ≤ t2 }. Thus X has the same distribution as X1 ≡ sup{|B(t)| : t1 ≤ t ≤ t2 }. Since ˜ 2 (t2 ) ≡ sup{B(t) ˜ both converge as t1 ↓ 0 to X : 0 < t ≤ t2 } and X2 (t2 ) ≡ sup{B(t) : 0 < t ≤ t2 }, respectively, these two have the ˜ 2 (t2 ) and X2 (t2 ) both converge as same distribution. Again, since X ˜ 2 ≡ limt↓0 |B(t)| ˜ ˜2 and X2 ≡ limt↓0 |B(t)|, respectively, X t2 ↓ 0 to X and X2 have the same distribution. But X2 = 0 w.p. 1 since B(t) is ˜ 2 = 0 w.p. 1, i.e., limt→0 B(t) ˜ continuous in [0, ∞). Thus X = 0 w.p. 1. (iv) Translation invariance (after a ﬁxed time t0 ) Fix t0 > 0 and set Bt0 (t) = B(t + t0 ) − B(t0 ), t ≥ 0.

(2.6)

Then {Bt0 (t)}t≥0 is also an SBM. This follows from the stationary independent increments property. (v) Translation invariance (after a stopping time T0 ) A random variable T (ω) with values in [0, ∞) is called a stopping

496

15. Stochastic Processes

time w.r.t. the SBM {B(t) : t ≥ 0} if for each t in [0, ∞) the event {T ≤ t} is in the σ-algebra Ft ≡ σ(B(s) : s ≤ t) generated by the trajectory B(s) for 0 ≤ s ≤ t. Examples of stopping times are Ta = min{t : t ≥ 0, B(t) ≥ a}

(2.7)

Ta,b = min{t : t > 0, B(t) ∈ (a, b)}

(2.8)

for 0 < a < ∞ where a < 0 < b. Let T0 be a stopping time w.r.t. SBM {B(t) : t ≥ 0}. Let BT0 (t) ≡ {B(T0 + t) − B(T0 ) : t ≥ 0}.

(2.9)

Then {BT0 (t)}t≥0 is again an SBM. Here is an outline of the proof. (a) T0 deterministic is covered by (4) above. (b) If T0 takes only countably many values, say {aj }j≥1 , then it is not diﬃcult to show that conditioned on the event T0 = aj , the process BT0 (t) ≡ {B(T0 + t) − B(T0 )} is SBM. Thus the unconditional distribution of {BT0 (t) : t ≥ 0} is again an SBM. (c) Next given a general stopping time T0 , one can approximate it by a sequence Tn of stopping times where for each n, Tn is discrete. By continuity of trajectories, {BT0 (t) : t ≥ 0} has the same distribution as the limit of {BTn (t) : t ≥ 0} as n → ∞. A consequence of the above two properties is that SBM has the Markov and the strong Markov properties. That is, for each ﬁxed t0 , the distribution of B(t), t ≥ t0 given B(s) : s ≤ t0 depends only on B(t0 ) (Markov property) and for each stopping time T0 , the distribution of B(t) : t ≥ T0 given B(s) : s ≤ T0 depends only on B(T0 ) (strong Markov property). (vi) The reﬂection principle Fix a > 0 and let Ta = inf{t : B(t) ≥ a} where {B(t) : t ≥ 0} is SBM. For any t > 0, a > 0, P (Ta ≤ t)

=

P (Ta ≤ t, B(t) > a) + P (Ta ≤ t, B(t) < a).

Now, by continuity of the trajectory, B(Ta ) = a on {Ta ≤ t}. Thus P Ta ≤ t, B(t) < a = P Ta ≤ t, B(t) < B(Ta ) = P Ta ≤ t, B(t) − B(Ta ) < 0 = P Ta ≤ t, B(t) − B(Ta ) > 0 = P Ta ≤ t, B(t) > a .

15.2 Brownian motion

497

To see this note that by (4), {B(Ta + h) − B(Ta ) : h ≥ 0} is independent of Ta and has the same distribution as an SBM and hence − B(Ta + h) − B(Ta ) : h ≥ 0 is also independent of Ta and has the same distribution as an SBM. Thus, P (Ta ≤ t)

2P Ta ≤ t, B(t) > a = 2P B(t) > a a = 2 1−Φ √ t =

(2.10)

where Φ(·) is the standard N (0, 1) cdf. The above argument is known as the reﬂection principle as it asserts that the path ˜ ≡ B(t)

B(t) B(Ta ) − B(t) − B(Ta )

, t ≤ Ta , t > Ta

(2.11)

obtained by reﬂecting the original path on the line y = a from the point (Ta , a) for t > Ta yields a path that has the same distribution as the original path. Thus the probability density function of Ta is fTa (t)

= =

a 1 a 2φ √ t 2 t3/2 a2 1 a √ e− 2t 3/2 t 2π

(2.12)

implying that ETap < ∞ for p < 1/2 and ∞ for p ≥ 1/2. Also, by the strong Markov property the process {Ta : a ≥ 0} is a process with stationary independent increments, i.e., a Levy process. It is also a stable process of order 1/2. One can use this calculation of P (Ta ≤ t) to show that the probabil0 ity that the SBM crosses zero in the interval (t1 , t2 ) is π2 arcsin tt12 (Problem 15.12). If M (t) ≡ sup{B(s) : 0 ≤ s ≤ t} then for a > 0 P M (t) > a

= P (Ta ≤ t) = 2P B(t) > a = P |B(t)| > a

(2.13)

it follows that M (t) has the same distribution as |B(t)| and hence has ﬁnite moments of all order. In fact, E eθM (t) < ∞

for all θ > 0.

498

15.2.3

15. Stochastic Processes

Some related processes

(i) Let {B(t) : t ≥ 0} be a SBM. For µ in (−∞, ∞) and σ > 0, the process Bµ,σ (t) ≡ µt + σB(t), t ≥ 0 is called Brownian motion with constant drift µ and constant diﬀusion σ. (ii) Let B0 (t) = B(t) − tB(1), 0 ≤ t ≤ 1. The process {B0 (t) : 0 ≤ t ≤ 1} is called the Brownian bridge. It is a Gaussian process with mean function 0 and covariance min(s, t) − st and has continuous trajectories that vanish both at 0 and 1. (iii) Let Y (t) = e−t B(e2t ), −∞ < t < ∞. Then {Y (t) : t ≥ 0} is a Gaussian process with mean function 0 and covariance function c(s, t) = e−(s+t) e+2s = es−t if s < t. This process is called the Ornstein-Uhlenbeck process. It is to be noted that for each t, Y (t) ∼ N (0, 1) and in fact {Y (t) : −∞ < t < ∞} is a strictly stationary process and is a Markov process as well.

15.2.4

Some limit theorems

2 Let {ξ i }i≥1 be iid random variables with Eξ1 = 0, Eξ1 = 1. Let S0 = 0, n 1 Sn = i=1 ξi , n ≥ 1. Let Bn (j/n) = √n Sj , j = 0, 1, 2, . . . , n and {Bn (t) : 0 ≤ t ≤ 1} be obtained by linear interpolation from the values at j/n for j = 0, 1, 2, . . . , n. Then for each n, {Bn (t) : 0 ≤ t ≤ 1} is a random continuous trajectory and hence is a random element of the metric space of continuous real valued functions on [0,1] that are zero at zero with the metric

ρ(f, g) ≡ {sup |f (t) − g(t)| : 0 ≤ t ≤ 1}.

(2.14)

Let µn ≡ P Bn−1 be the induced probability measure on C[0, 1]. The following is a generalization of the central limit theorem as noted in Chapter 11. Theorem 15.2.1: (Donsker ). In the space (C[0, 1], ρ) the sequence of probability measures {µn ≡ P Bn−1 }n≥1 converges weakly to µ, the probability distribution of the SBM. That is, for any bounded continuous function h from C[0, 1] to R, hdµn → f hdµ. For a proof, see Billingsley (1968). Corollary 15.2.2: For any continuous functional T on (C[0, 1], ρ) to Rk , k < ∞, the distribution of T (Bn ) converges tothat of T (B). In particular, S |S | the joint distribution of max √jn , max √jn converges weakly to that of 0≤j≤n 0≤j≤n max B(u), max |B(u)| . 0≤u≤1

0≤u≤1

15.2 Brownian motion

499

There are similar limit theorems asserting the convergence of the empirical processes to the Brownian bridge with applications to the limit distribution of the Kolmogorov-Smirnov statistics (see Billingsley (1968)). Theorem 15.2.3: (Laws of large numbers). lim

t→∞

B(t) = 0 w.p. 1. t

(2.15)

Proof: By the time inversion property (2.5) ˜ =0 lim B(t)

t→0

˜ = lim tB(1/t) = lim But lim B(t) t→0

t→0

τ →∞

w.p. 1.

B(τ ) . τ

2

Theorem 15.2.4: (Kallianpur-Robbins). Let f : R → R be integrable with respect to Lebesgue measure. Then

∞ t 1 d √ f (B(u))du −→ f (u)du Z (2.16) t 0 0 where Z is a random variable with density √

π z(1−z)

in [0,1].

This is a special case of the Darling-Kac formula for Markov processes that can be established here using the regenerative property of SBM due to the fact that starting from 0, SBM will hit level 1 at same time T1 and from there hit level 0 at a later time τ1 . And this can be repeated to produce a sequence of times 0, τ1 , τ2 , . . . such that the excursions {B(t) : τi ≤ t < τi+1 }i≥1 are iid. The sequence {τi }i≥1 is a renewal sequence with life time distribution τ1 having a regularly varying tail of order 1/2 and hence inﬁnite mean. One can appeal now to results from renewal theory to complete the proof (see Feller (1966) and Athreya (1986)).

15.2.5

Some sample path properties of SBM

The sample paths t → B(t, ω) of the SBM are continuous w.p. 1. It turns out that they are not any more smooth than this. For example, they are not diﬀerentiable nor are they of bounded variation on ﬁnite intervals. It will be shown now that w.p. 1 Brownian sample paths are not diﬀerentiable any where and the quadratic variation over any ﬁnite interval is ﬁnite and nonrandom. (See also Karatzas and Shreve (1991).) (i) Nondiﬀerentiability of B(·, ω) in [0,1] Let An,k = ω : sup |B(t,ω)−B(s,ω)| ≤ k for some 0 ≤ s ≤ 1 . |t−s| |t−s|≤3/n * * Let Zr,n = *B (r+1) − B nr *, r = 0, 1, 2, n − 1. Let Bn,k = ω : n

500

15. Stochastic Processes

max Zr,n , Zr+1,n , Zr+2,n ≤ An,k ⊂ Bn,k . Now

6k n

for some r . It can be veriﬁed that

6k P max Zr,n , Zr+1,n , Zr+2,n ≤ n r=0 6k 3 ≤ n P |Z0,n | ≤ n |Z | 6k 3 0n ≤ nP ≤ √ √1 n n Const 3 ≤ n √ as n → ∞, n n−1

P (Bn,k ) ≤

since

Z0n √1 n

√ ∼ N (0, 1). Thus for each k < ∞, P (An,k ) ≤ Const . This n

implies

∞

P (An3 ,k ) < ∞.

n=1

So by the Borel-Cantelli lemma, only ﬁnitely many An3 ,k can happen w.p. 1. The event A ≡ {ω : B(t, ω) is diﬀerentiable for at least one t in [0, 1]} is contained in C ≡ k≥1 {ω : ω ∈ An3 ,k for inﬁnitely many n} and so P (A) ≤ P (C) = 0. (ii) Finite quadratic variation of SBM Let ηn,j = B(j2−n ) − B (j − 1)2−n , j = 1, 2, . . . , 2n 2 n

∆n

≡

2 ηnj .

(2.17)

j=1

Then

2 1 E∆n = = 1. n 2 j=1 n

Also by independence and stationarity of increments 2 3 3 = n. 2n 2 2 j=1 n

Var(∆n ) =

n) for any > 0. This implies, by Borel Thus P (|∆n −1| > ) ≤ Var(∆ 2 Cantelli, ∆n → 1 w.p. 1. By deﬁnition the quadratic variation is n * * *B(tj , ω) − B(tj−1 , ω)*2 : all ﬁnite partitions ∆ ≡ sup

j=0

(t0 , t1 , . . . , tn ) of [0, 1] .

(2.18)

15.2 Brownian motion

501

It is easy to verify that ∆ = limn ∆n . Thus ∆ = 1 w.p. 1. It follows that w.p. 1 the Brownian motion paths are not of bounded variation. By the scaling property of SBM, it follows that the quadratic variation of SBM over [0, t] is t w.p. 1 for any t > 0.

15.2.6

Brownian motion and martingales

There are three natural martingales associated with Brownian motion. Theorem 15.2.5: Let {B(t) : t ≥ 0} be SBM. Then (i) (Linear martingale) {B(t) : t ≥ 0} is a martingale. (ii) (Quadratic martingale) {B 2 (t) − t : t ≥ 0} is a martingale. (iii) (Exponential martingale) For any θ real, {eθB(t)− martingale.

θ2 2

t

: t ≥ 0} is a

Proof: (i) and (ii). Since B(t) ∼ N (0, t), E|B(t)| < ∞ and E|B(t)|2 < ∞. By the stationary independent increments property for any t ≥ 0, s ≥ 0, E B(t + s) | B(u) : u ≤ t = E B(t + s) − B(t) | B(u) : u ≤ t + B(t) =

0 + B(t) = B(t)

establishing (i).

Next,

= =

E B 2 (t + s) | B(u) : u ≤ t 2 E B(t + s) − B(t) | B(u) : u ≤ t + B 2 (t) + 2E B(t + s) − B(t) B(t) | B(u) : u ≤ t s + B 2 (t) + 0

and hence E B 2 (t + s) − (t + s) | B(u) : u ≤ t = B 2 (t) − t, establishing (ii). (iii) * θ2 * E eθB(t+s)− 2 (t+s) * B(u) : u ≤ t * θ2 θ2 * = E eθ(B(t+s)−B(t))− 2 s * B(u) : u ≤ t eθB(t)− 2 t . Again by using the fact that B(t + s) − B(t) given B(u) : u ≤ t is N (0, s), the ﬁrst term on the right side becomes 1 proving (iii). 2

502

15. Stochastic Processes

15.2.7

Some applications

The martingales in Theorem 15.2.5 combined with the optional stopping theorems of Chapter 13 yield the following applications. (i) Exit probabilities Let B(·) be SBM. Fix a < 0 < b. Let Ta,b = min{t : t > 0, B(t) = a or b}. From (i) and the optional sampling theorem, for any t > 0; EB(Ta,b ∧ t) = EB(0) = 0.

(2.19)

Also, by continuity, B(Ta,b ∧ t) → B(Ta,b ). By bounded convergence theorem, this implies EB(Ta,b ) = 0 (2.20) i.e., a p + b(1 − p) = 0 where p = P (Ta < Tb ) = P (B(·) reaches a b before b). Thus, p = (b−a) . (ii) Mean exit time From (ii) and the optional sampling theorem E B 2 (Ta,b ∧ t) − (Ta,b ∧ t) = 0 i.e., EB 2 (Ta,b ∧ t) = E(Ta,b ∧ t).

(2.21)

By using the bounded convergence theorem on the left and the monotone convergence theorem on the right, one may conclude EB 2 (Ta,b ) = ETa,b i.e., ETa,b

= pa2 + (1 − p)b2 b (−a) + b2 = a2 (b − a) (b − a) = (−ab).

(2.22)

(iii) The distribution of Ta,b From (iii) and the optimal sampling theorem θ2 E eθB(Ta,b ∧t)− 2 Ta,b = 1. By the bounded convergence theorem, this implies θ2 E eθB(Ta,b )− 2 Ta,b = 1.

(2.23)

15.2 Brownian motion

503

In particular, if b = −a this reduces to θ2 θ2 1 = E eθa− 2 Ta,−a : Ta < T−a + E e−θa e− 2 Ta,−a : Ta > T−a 1 θ2 = eθa + e−θa E e− 2 Ta,−a , 2 since by symmetry θ2 θ2 E e− 2 Ta,−a : Ta < T−a = E e− 2 Ta,−a : Ta > T−a . Thus, for λ ≥ 0 √ −1 √ E e−λTa,−a = 2 e 2λa + e− 2λa . Similarly, it can be shown that for λ ≥ 0, a > 0 √ E e−λTa = e− 2λa .

15.2.8

(2.24)

(2.25)

The Black-Scholes formula for stock price option

Let X(t) denote the price of one unit of a stock S at time t. Due to ﬂuctuations in the market place, it is natural to postulate that {X(t) : t ≥ 0} is a stochastic process. To build an appropriate model consider the discrete time case ﬁrst. If Xn denotes the unit price at time n, it is natural to postulate that Xn+1 = Xn yn+1 where yn+1 represents the eﬀects of the market ﬂuctuation in the time interval [n, n + 1). This leads to the formula Xn = X0 y1 y2 · · · yn . If one assumes that {yn }n≥1 are suﬃciently independent, then nµ+

X n = X0 e

n

(log yi −µ)

i=1

is, by the central limit theorem, approximately Gaussian, leading one to consider a model of the form X(t) = X(0)eµt+σB(t)

(2.26)

where {B(t) : t ≥ 0} is SBM. Thus, {log X(t) − log X(0) : t ≥ 0} is postulated to be a Brownian motion with drift µ and diﬀusion σ. In the language of ﬁnance, µ is called the growth rate and σ the volatility rate. The so-called European option allows one to buy the stock at a future time t0 for a unit price of K dollars at time 0. If X(t0 ) < K then one has the option of not buying, whereas if X(t0 ) ≥ K, then one can buy it at K dollars and sell it immediately at the market price X(t0 ) and realize a proﬁt of X(t0 ) − K. Thus the net revenue from this option is 0 if X(t0 ) ≤ K ˜ X(t0 ) = (2.27) X(t0 ) − K if X(t0 ) > K.

504

15. Stochastic Processes

Since the value of money depreciates over time, say at rate r, the net ˜ 0 )e−t0 r . So a fair price for this European revenue’s value at time 0 is X(t option is p0

˜ 0 )e−t0 r = E X(t = E(X(t0 ) − K)+ e−t0 r .

(2.28)

Here the constants µ, σ, K, t0 , r are assumed known. The goal is to compute p0 . This becomes feasible if one makes the natural assumption of no arbitrage. That is, the discounted value of the stock, i.e., X(t)e−rt , evolves as a martingale. This is a reasonable assumption as otherwise (if it is advantageous) then everybody will want to take advantage of it and start buying the stock, thereby driving the price down and making it unproﬁtable. Thus, in eﬀect, this assumption says that X(t)e−rt ≡ X(0)eµt+σB(t)−rt evolves as a martingale. But recall that if B(·) is an SBM then for any θ θ2

real, eθB(t)− 2 t evolves as a martingale. Thus, µ, σ, r should satisfy the 2 condition − σ2 = (µ − r). With this assumption, the fair price for this European option with µ, σ, r, K, t0 given is p0

= =

+ E e−t0 r X0 eσB(t0 )+µt0 − K y2 1 √ X0 eσy+µt0 − K e− 2t0 dy. (2.29) e−t0 r 2πt0 X0 eσy+µt0 >K

This is known as the Black-Scholes formula. For more detailed discussions on Brownian motion including the development of Ito stochastic integration and diﬀusion processes via a martingale formulation, the books of Stroock and Varadhan (1979) and Karlin and Taylor (1975) should be consulted. See also Karatzas and Shreve (1991).

15.3 Problems 15.1 Let {Lj }j≥0 be as in Section 15.1.1. Show that for any θ ≥ 0

n λyj −θ n Lj j=0 (a) E e =E θ+λy j=0

j

∞ ∞ 1 (b) E e−θ j=0 Lj = 0 for all θ > 0 iﬀ = ∞ w.p. 1 assumλ j=0 yj

ing 0 < λi < ∞ for all i.

15.3 Problems

505

15.2 Let L be an exponential random variable. Verify that for any x > 0, u>0 P (L > x + u | L > x) = P (L > u). (This is referred to as the “lack of memory” property.) 15.3 Solve the Kolmogorov’s forward and backward equations for the following special cases of birth and death processes: (a) Poisson process: αi ≡ α, βi ≡ 0, (b) Yule process: αi ≡ iα, βi ≡ 0, (c) On-oﬀ process: α0 = α, αi = 0, i ≥ 1, β1 = β, βi = 0, i = 0, 2, . . . , (d) M/M/1 queue: αi = α, i ≥ 0, βi = β, i ≥ 1, β0 = 0, (e) M/M/s queue: αi = α, i ≥ 0, βi = iβ, 1 ≤ i ≤ s and = sβ, i > s, β0 = 0, (f) Pure death process: βi ≡ β, i ≥ 1, β0 = 0, αi = 0, i ≥ 0. 15.4 Find the stationary distributions when they exist for the processes in Problem 15.3. 15.5 Consider 2 independent M/M/1 queues with arrival rate λ, service rate µ (Case I), and one M/M/1 queue with arrival rate 2λ and service rate 2µ (Case II). Assume λ < µ. Show that in the stationary state the mean number in the system Case I is larger than in Case 2 and their ratio approaches 2 as ρ = µλ ↑ 1. 15.6 Show that for any ﬁnite state space irreducible CTMC {X(t) : t ≥ 0} with all λi ∈ (0, ∞), there is a unique stationary distribution. 15.7 (M/M/∞ queue). This is a birth and death process such that αn ≡ α, βn = nβ, n ≥ 0, 0 < α, β < ∞. Show that this process has a stationary distribution that is Poisson with mean ρ = µλ . 15.8 (a) Let {X(t)}t≥0 be a Poisson process with rate λ. Let L be an exponential random variable with mean µ−1 and independent of {X(t)}t≥0 . Let N (t) = X(t + L) − X(t). Find the distribution of N (t). (b) Let {Y (t)}t≥0 be also a Poisson process with rate µ and independent of {X(t)}t≥0 in (a). Let T and T be two successive ‘event epochs’ for the {Y (t)}t≥0 process. Let N = X(T ) − X(T ). Find the distribution of N . (c) Let {X(t)}t≥0 be as in (a). Let τ0 = 0 < τ1 < τ2 < · · · be the successive event epochs of {X(t)}t≥0 . Find the joint distribution of (τ1 , τ2 , . . . , τn ) conditioned on the event {N (t) = 1} for some 0 < t < ∞.

506

15. Stochastic Processes

15.9 Let {X(t) : t ≥ 0} be a Poisson process with rate λ. Suppose at each event epoch of the Poisson process an experiment is performed that results in one of k possible outcomes {ai : 1 ≤ i ≤ k} with probability distribution {pi : 1 ≤ i ≤ k}. Let Xi (t) = outcomes ai in [0, t]. Assume the experiments are iid. Show that {Xi (t) : t ≥ 0} are independent Poisson processes with rate λpi for 1 ≤ i ≤ k. 15.10 Let {X(t) : t ≥ 0} be a Poisson process with rate λ, 0 < λ < ∞. Let {ξi }i≥1 be a sequence of iid random variables independent of {X(t) : t ≥ 0} with values in a measurable space (S, S). For each A ∈ S deﬁne N (t) I(ξj ∈ A), t ≥ 0. N (A, t) ≡ j=1

(a) Verify that for each A ∈ S, {N (A, t)}t≥0 is a Poisson process and ﬁnd its rate. (b) Show that if A1 , A2 ∈ S, A1 ∩ A2 = S, then the two Poisson processes {N (Ai , t)}t≥0 , i = 1, 2 are independent. (c) Show that for each t > 0, {N (A, t) : A ∈ S} is a Poisson random ﬁeld on S, i.e., for each A, N (A, t) is Poisson and for A1 , A2 , . . . , Ak pairwise disjoint elements of S, {N (Ai , t)}k1 are independent. (d) Show that {N (·, t)}t≥0 is a process with stationary independent increments that is Poisson random measure valued. 15.11 Let BN (·, ω) be as in (2.1). Show that {BN (·, ω)}N ≥1 is Cauchy in the Banach space C[0, 1] with sup norm by completing the following steps: (a) If ξnj (t, ω) = Znj (ω)Snj (t) then (i) ξnj (·, ω) ≡ sup{|ξnj (t, ω)| : 0 ≤ t ≤ 1} = |Znj (ω)|2− (ii)

sup

n 2 −1

(n+1) 2

|ξnj (t, ω) : 0 ≤ t ≤ 1

j=1

= (max{|Znj (ω)| : 1 ≤ j ≤ 2n − 1})2−

(n+1) 2

,

(b) for any sequence {ηi }i≥1 of random variables with supi E(eηi ) < ∞, w.p. 1, ηi ≤ 2 log i for all large i, (c) w.p. 1 there is a C < ∞ such that max{|Znj (ω)| : 1 ≤ j ≤ 2n − 1} ≤ Cn.

15.3 Problems

507

(d) ∞ 2 −1 n

ξnj (·, ω) < ∞ w.p. 1.

n=1 j=1

15.12 Show that if B(·) is SBM 4 2 t2 . P B(t) = 0 for some t in (t1 , t2 ) = arcsin π t1 (Hint: Conditioned on B(t1 ) = x = 0, the required probability equals P (T|x| ≤ t2 − t1 ) = 2 1 − Φ( √t|x| ) and hence the unconditional 2 −t1 |B(t1 )| ) .) probability is E2 1 − Φ( √ t2 −t1 15.13 Use the reﬂection principle to ﬁnd P (M (t) ≥ x, B(t) ≤ y) for x > y where M (t) = max{B(u) : 0 ≤ u ≤ t} and B(·) is SBM. 15.14 For a < 0 < b < c ﬁnd P (Tb < Ta < Tc ) where Tx = min{t : t > 0, B(t) = x} where B(·) is SBM. 15.15 Let B0 (t) ≡ B(t)−tB(1), 0 ≤ t ≤ 1 (where {B(t) : t ≥ 0} is SBM) t be , the Brownian bridge. Find the distribution of X(t) ≡ (1 + t)B0 1+t t ≥ 0. (Hint: X is a Gaussian process. Find its mean and covariance functions.) 15.16 Let B(·) be SBM. Let Mn = sup{|B(t) − B(n)| : n − 1 ≤ t ≤ n}, n = 1, 2, . . .. (a) Show that

Mn n

→ 0 w.p. 1 as n → ∞.

(Hint: Show {Mn }n≥1 are iid and EM1 < ∞.) (b) Using this show that B(t) t → 0 w.p. 1 as t → ∞ and give another proof of the time inversion result 15.2.3. 15.17 Use the exponential martingale to ﬁnd E(e−λT ) where T = inf{t : t ≥ 0, B(t) ≥ α + βt}, λ > 0, α > 0, β > 0 and B(·) SBM. 15.18 Let {Y (t) : −∞ < t < ∞} be the Ornstein-Uhlenbeck process as deﬁned in 15.2.4. Let f : R → R be Borel measurable and E|f (Z)| < t ∞ where Z ∼ N (0, 1). Evaluate lim 1t 0 f Y (u) du. t→∞

(Hint: Show that Y (·) is a regenerative stochastic process.)

16 Limit Theorems for Dependent Processes

16.1 A central limit theorem for martingales Let {Xn }n≥1 be a sequence of random variables on (Ω, F, P ), and let {Fn }n≥1 be a ﬁltration, i.e., a sequence of σ-algebras on Ω such that Fn ⊂ Fn+1 ⊂ F for all n ≥ 1. From Chapter 13, recall that {Xn , Fn }n≥1 is called a martingale if Xn is Fn -measurable for each n ≥ 1 and E(Xn+1 | Fn ) = Xn for each n ≥ 1. Given a martingale {Xn , Fn }n≥1 , deﬁne Y1 Yn

= X1 − EX1 , = Xn − Xn−1 , n ≥ 1.

Note that each Yn is Fn -measurable and E(Yn | Fn−1 ) = 0

for all n ≥ 1,

(1.1)

where F0 = {Ω, ∅}. Deﬁnition 16.1.1: Let {Yn }n≥1 be a collection of random variables on a probability space (Ω, F, P ) and let {Fn }n≥1 be a ﬁltration. Then, {Yn , Fn }n≥1 is called a martingale diﬀerence array (mda) if Yn is Fn measurable for each n ≥ 1 and (1.1) holds. For example, if {Yn }n≥1 is a sequence of zero mean independent random variables, then {Yn , Fn }n≥1 is a mda w.r.t. the natural ﬁltration Fn = σY1 , . . . , Yn , n ≥ 1. Other examples of mda’s can be constructed from the

510

16. Limit Theorems for Dependent Processes

examples given in Chapter 13. The main result of this section shows that for square-integrable mda’s satisfying a Lindeberg-type condition, the CLT holds. For more on limit theorems for mdas, see Hall and Heyde (1980). Theorem 16.1.1: For each n ≥ 1, let {Yni , Fni }i≥1 be a mda on (Ω, F, P ) 2 < ∞ for all i ≥ 1 and let τn be a ﬁnite stopping time w.r.t. with EYni {Fni }i≥1 . Suppose that for some constant σ 2 ∈ (0, ∞), τn * 2 * Fn,i−1 −→p σ 2 E Yni

as

n→∞

(1.2)

i=1

and that for each > 0, ∆n () ≡

τn * 2 E Yni I(|Yni | > ) * Fn,i−1 −→p 0

as

n → ∞.

(1.3)

i=1

Then,

τn

Yni −→d N (0, σ 2 ).

(1.4)

i=1

Proof: First the theorem will be proved under the additional condition that τn

=

mn for all n ≥ 1 for some nonrandom sequence of positive integers {mn }n≥1

(1.5)

and that for some c ∈ (0, ∞), mn * 2 * Fn,i−1 ≤ c E Yni

w.p. 1.

(1.6)

i=1

2 2 Let σni = E Yni | Fn,i−1 , i ≥ 1, n ≥ 1. Also, write m for mn to ease the 2 notation. Since σni is Fn,i−1 -measurable, for any t ∈ R, * *

m * * 2 2 * *E exp ιt Y t /2 − exp − σ nj * * j=1

* *

m−1 m * * 2 ≤ **E exp ιt Ynj − E exp ιt Ynj exp − t2 σnm /2 ** j=1

j=1

*

m * 2 + · · · + **E exp ιtYn1 exp − t2 σnj /2

j=2

− exp

−

m j=1

2 t2 σnj /2

* * * *

16.1 A central limit theorem for martingales

511

* *

m * * 2 2 2 2 * + *E exp − t σnj /2 − exp(−t σ /2)** j=1

m * * * * * 2 E *E exp(ιtYnk ) * Fn,k−1 − exp(−t2 σnk /2)*

≤

k=1

* *

m * * 2 2 2 2 * + *E exp − t σnj /2 − exp(−t σ /2)** j=1

≡

I1n + I2n ,

say.

(1.7)

By (1.2), (1.5), and the BCT, I2n → 0

as

n → ∞.

(1.8)

To estimate I1n , note that for any 1 ≤ k ≤ n, * E exp(ιtYnk ) * Fn,k−1 = =

(ιt)2 2 E Ynk | Fn,k−1 + θnk (t) 1 + ιtE Ynk | Fn,k−1 + 2 t2 2 1 − σnk + θnk (t) (1.9) 2

and

t2 2 2 /2 = 1 − σnk + γnk , say. exp − t2 σnk 2 It is easy to verify that |tYnk |3 ** |θnk | ≤ E min (tYnk )2 , * Fn,k−1 6 and

(1.10)

2 2 2 exp t2 σnk /2 /8. |γnk | ≤ t2 σnk

Hence, by (1.3), (1.6), (1.9), and (1.10), for any in (0,1), I1n

≤

m E |θnk | + |γnk | k=1 m 2

≤ t

* * * * * 2 E *E Ynk I(|Ynk | > ) * Fn,k−1 *

k=1

+ |t|3 ·

m m * ** * 2 4 E *E Ynk | Fn,k−1 * + E t4 σnk exp(t2 c/2) k=1

≤

t2 E ∆n () + |t|3 · · E

m

2 σnk

k=1

4

2

+ t exp(t c/2) E

m

k=1

2 σnk

k=1

max

l≤k≤m

2 σnk

.

512

16. Limit Theorems for Dependent Processes

Note that for any > 0, 2 E max σnk 1≤k≤m

≤ 2 + E

* 2 max E Ynk I |Ynk | > * Fn,k−1

1≤k≤m

2

≤ + E∆n (). Hence, by (1.3), (1.6), and the BCT, for any ∈ (0, 1), lim sup I1n n→∞

≤ lim sup t2 E∆n () + |t|3 · c n→∞ 2 + t4 cet c/2 2 + E∆n () ≤

c1 (t)

for some c1 (t) ∈ (0, ∞), not depending on . Thus implies that lim I1n = 0.

n→∞

(1.11)

Clearly (1.7), (1.8), and (1.11) yield (1.4), whenever (1.5) and (1.6) are true. Next, suppose that condition (1.6) is not assumed a priori but (1.5) k 2 holds true. Fix c > σ 2 and deﬁne the sets Bnk = i=1 σni ≤ c , and the variables Yˇnk = Ynk IBnk , k ≥ 1, n ≥ 1. Note that Bnk ∈ Fn,k−1 and hence, E Yˇnk | Fn,k−1 = IBnk E Ynk | Fn,k−1 = 0, and

2 2 2 σ ˇnk ≡ E Yˇnk | Fn,k−1 = IBnk σnk , (1.12) for all k ≥ 1. In particular, Yˇnk , Fn,k is a mda. Since Bn,k−1 ⊃ Bnk for all k, by the deﬁnitions of the sets Bnk ’s, it follows that m

2 σ ˇnk

=

k=1

m

2 σnk IBnk

k=1

=

m

2 σnk IBnm +

k=1

+ ··· +

m−1

2 IBn,m−1 − IBnm σnk

k=1

2 σn1

IBn1 − IBn2

≤ cIBnm + c IBn,m−1 − IBnm + · · · + cIBn1 − IBn2 ≤ c. (1.13) Thus, the mda Yˇnk , Fnk k≥1 satisﬁes (1.6). Next note that by (1.2) and (1.5), c P Bnm → 0 as n → ∞. (1.14)

16.2 Mixing sequences

Also, by (1.12),

m k=1

2 σ ˇnk =

m k=1 m

513

2 σnk on Bnm . Hence, it follows that

2 σ ˇnk −→p σ 2 ,

k=1

i.e., the mda Yˇnk , Fnk satisﬁes (1.2). Further, the inequality “|Yˇnk | ≤ = x2 I(|x| > ), x > 0 is nonde|Ynk |” and the fact that the function h(x) creasing jointly imply that (1.3) holds for Yˇnk , Fnk . Hence, by the case already proved, m (1.15) Yˇnk −→d N (0, σ 2 ). But

k=1

m ˇ k=1 Ynk = k=1 Ynk on Bnm . Hence, by (1.14),

m

m

Ynk −→d N (0, σ 2 ),

(1.16)

k=1

and therefore, the CLT holds without the restriction in (1.6). Next consider relaxing the restrictions in (1.5) (and (1.6)). Since P (τn < ∞) = 1, there exist positive integers mn such (Problem 16.2) that P τn > mn → 0 as n → ∞. (1.17) Next deﬁne Y˜nk = Ynk I(τn ≥ k), k ≥ 1, n ≥ 1.

(1.18)

It is easy to check (Problem 16.3) that {Y˜nk , Fnk } is a mda, and that {Y˜nk , Fnk } satisﬁes (1.2) and (1.3) with τn replaced by mn (Problem 16.4). Hence, by the previous case already proved, mn

Y˜nk −→d N (0, σ 2 ).

k=1

Next note that (cf. (4.1), Proclem 16.4), τn k=1

Ynk −

mn

Y˜nk −→p 0

as

n → ∞.

(1.19)

k=1

Hence, (1.4) holds and the proof of the theorem is complete.

2

16.2 Mixing sequences This section deals with a class of dependent processes, called the mixing processes, where the degree of dependence decreases as the distance (in

514

16. Limit Theorems for Dependent Processes

time) between two given sets of random variables goes to inﬁnity. The ‘degree of dependence’ is measured by various mixing coeﬃcients, which are deﬁned in Section 16.2.1 below. Some basic properties of the mixing coeﬃcients are presented in Section 16.2.2. Limit theorems for sums of mixing random variables are given in Section 16.2.3.

16.2.1

Mixing coeﬃcients

Deﬁnition 16.2.1: Let (Ω, F, P ) be a probability space and G1 , G2 be sub-σ-algebras of F. (a) The α-mixing or strong mixing coeﬃcient between G1 and G2 is deﬁned as * * α(G1 , G2 ) ≡ sup *P (A ∩ B) − P (A)P (B)* : A ∈ G1 , B ∈ G2 . (2.1) (b) The β-mixing coeﬃcient or the coeﬃcient of absolute regularity between G1 and G2 is deﬁned as β(G1 , G2 ) ≡

k * * 1 *P (Ai ∩ Bj ) − P (Ai )P (Bj )*, sup 2 i=1 j=1

(2.2)

where the supremum is taken over all ﬁnite partitions {A1 , . . . , Ak } and {B1 , . . . , B } of Ω by sets Ai ∈ G1 and Bj ∈ G2 , 1 ≤ i ≤ k, 1 ≤ j ≤ , , k ∈ N. (c) The ρ-mixing coeﬃcient or the coeﬃcient of maximal correlation between G1 and G2 is deﬁned as (2.3) ρ(G1 , G2 ) ≡ sup ρX1 ,X2 : Xi ∈ L2 (Ω, Gi , P ), i = 1, 2} is the correlation coeﬃcient of X1 where ρX1 ,X2 ≡ √ Cov(X1 ,X2 ) Var(X1 )Var(X2 ) and X2 . It is easy to check (Problem 16.5 (a) and (d)) that all three mixing coeﬃ cients take values in the interval [0, 1] and that ρ(G1 , G2 ) = sup |EX1 X2 | : Xi ∈ L2 (Ω, Gi , P )EXi = 0, EX12 = 1, i = 1, 2 . When the σ-algebras G1 and G2 are independent, these coeﬃcients equal zero, and vice versa. Thus, nonzero values of the mixing coeﬃcients give various measures of the degree of dependence between G1 and G2 . It is easy to check (Problem 16.5 (c)) that (2.4) α(G1 , G2 ) ≤ β(G1 , G2 ) and α(G1 , G2 ) ≤ ρ(G1 , G2 ). However, no ordering between the β(G1 , G2 ) and ρ(G1 , G2 ) exists, in general (Problem 16.6). There are two other mixing coeﬃcients that are also often used in the literature. These are given by the φ-mixing coeﬃcient * * φ(G1 , G2 ) ≡ sup *P (A) − P (A | B)* : B ∈ G1 , P (B) > 0, A ∈ G2 , (2.5)

16.2 Mixing sequences

515

and the Ψ-mixing coeﬃcient Ψ(G1 , G2 ) ≡

sup

A∈G1∗ ,B∈G2∗

|P (A ∩ B) − P (A)P (B)| , P (A)P (B)

(2.6)

where P (A | B) = P (A ∩ B)/P (B) for P (B) > 0, and where Gi∗ = {A : A ∈ Gi , P (A) > 0}, i = 1, 2. It is easy to check that Ψ(G1 , G2 ) ≥ φ(G1 , G2 ) ≥ β(G1 , G2 ). Deﬁnition 16.2.2: Let {Xi }i∈Z be a (doubly-inﬁnite) sequence of random variables on a probability space (Ω, F, P ). Then, the strong- or α-mixing coeﬃcient of {Xi }i∈Z , denoted by αX (·), is deﬁned by αX (n) ≡ sup α σ Xj : j ≤ i, j ∈ Z , σ Xj : j ≥ i+n, j ∈ Z , n ≥ 1, i∈Z

(2.7) where the α(·, ·) on the right side of (2.7) is as deﬁned in (2.1). The process {Xi }i∈Z is called strongly mixing or α-mixing if lim αX (n) = 0.

n→∞

(2.8)

The other mixing coeﬃcients of {Xi }i∈Z (e.g., βX (·), ρX (·), etc.) are deﬁned similarly. For a one-sided sequence {Xi }i≥1 , the α-mixing coeﬃcient {Xi }i≥1 is deﬁned by replacing Z on the right side of (2.7) by N on all three occurrences. A similar modiﬁcation is needed for the other mixing coeﬃcients. When there is no chance of confusion, the coeﬃcients αX (·), βX (·), . . . , etc., will be written as α(·), β(·), . . . , etc., to ease the notation. Another important notion of ‘weak’ dependence is given by the following: Deﬁnition 16.2.3: Let m ∈ Z+ be an integer and {Xi }i∈Z be a collection of random variables on (Ω, F, P ). Then, {Xi }i∈Z is called m-dependent if for every k ∈ Z, {Xi : i ≤ k, i ∈ Z} and {Xi : i > k + m, i ∈ Z} are independent. Example 16.2.1: If {i }i∈Z is a collection of independent random variables and Xi = (i + i+1 ), i ∈ Z, then {i }i∈Z is 0-dependent and {Xi }i∈Z is 1-dependent. It is easy to see that if {Xi }i∈Z is m-dependent for some ∗ ∗ (n) = 0 for all n > m, where αX ∈ {αX , βX , ρX , φX , ΨX }. m ∈ Z+ , then αX Therefore, m-dependence of {Xi }i∈Z implies that the process {Xi }i∈Z is ∗ αX -mixing. In this sense, the condition of m-dependence is the strongest and the condition of α-mixing is the weakest among all weak dependence conditions introduced here. Example 16.2.2: Let {i }i∈Z be a collection of iid random variables with E1 = 0, E21 < ∞ and let ai n−i , n ∈ Z (2.9) Xn = i∈Z

516

16. Limit Theorems for Dependent Processes

where ai ∈ R and ai = 0 exp(−ci ) as i → ∞, c ∈ (0, ∞). If 1 has an integrable characteristic function, then {Xi }i∈Z is strongly mixing (Chanda (1974), Gorodetskii (1977), Withers (1981), Athreya and Pantula (1986)). Example 16.2.3: Let {Xi }i∈Z be a zero mean stationary Gaussian process. Suppose that {Xi }i∈Z has spectral density f : (−π, π) → [0, ∞), i.e., EX0 Xk =

π

−π

eιkx f (x)dx, k ∈ Z.

(2.10)

Then, αX (n) ≤ ρX (n) ≤ 2πα(n), n ≥ 1 and, therefore, {Xi }i∈Z is α-mixing iﬀ it is ρ-mixing (Ibragimov and Rozanov (1978), Chapter 4). Further, {Xi }i∈Z is α-mixing iﬀ the spectral density f admits the representation *2 * f (t) = *p(eιt )* exp u(eιt ) + v˜(eιt ) ,

(2.11)

where p(·) is a polynomial, u and v are continuous real-valued functions on the unit circle in the complex plane, and v˜ is the conjugate function of u. It is also known that if the Gaussian process {Xi }i∈Z is φ-mixing, then it is necessarily m-dependent for some m ∈ Z+ . Thus, for Gaussian processes, the condition of α-mixing is as strong as ρ-mixing and the conditions of φmixing and Ψ-mixing are equivalent to m-dependence. See Ibragimov and Rozanov (1978) for more details.

16.2.2

Coupling and covariance inequalities

The mixing coeﬃcients can be seen as measures of deviations from independence. The idea of coupling is to construct independent copies of a given pair of random vectors on a suitable probability space such that the Euclidean distance between these copies admits a bound in terms of the mixing coeﬃcient between the (σ-algebras generated by the) random vectors. Thus, coupling gives a geometric interpretation of the mixing coeﬃcients. The ﬁrst result is for β-mixing random vectors. Theorem 16.2.1: (Berbee’s theorem). Let (X, Y ) be a random vector on a probability space (Ω0 , F0 , P0 ) such that X takes values in Rd and Y in Rs , d, s ∈ N. Then, there exist an enlarged probability space (Ω, F, P ) and a s-dimensional random vector Y ∗ such that (i) (X, Y , Y ∗ ) are deﬁned on (Ω, F, P ), (ii) Y ∗ is independent of X under P and (X, Y ) have the same distribution under P and P0 , (iii) P (Y = Y ∗ ) = β(σX, σY ).

16.2 Mixing sequences

517

Proof: See Corollary 4.2.5 of Berbee (1979). A weaker version of the above result is available for α-mixing random variables, where the diﬀerence between Y and its independent copy admits a bound in terms of the α-mixing coeﬃcient. Theorem 16.2.2: (Bradley’s theorem). In Theorem 16.2.1, assume s = 1 and 0 < E|Y |γ < ∞ for some 0 < γ < ∞. Then, for all 0 < y ≤ (E|Y |γ )1/γ , 2γ/1+2γ P |Y − Y ∗ | ≥ y ≤ 18 α σX, σY 1/1+2γ

(E|Y |γ )

y −γ/(1+2γ) .

(2.12)

Proof: See Theorem 3 of Bradley (1983). Next, some bounds on the covariance between mixing random variables are established. These will be useful for deriving limit theorems for sums of mixing random variables. For a random variable X, deﬁne the function QX (u) = inf t : P (|X| > t) ≤ u , u ∈ (0, 1). (2.13) Thus, QX (u) is the quantile function of |X| at (1 − u). Theorem 16.2.3: (Rio’s inequality). Let X and Y be two random vari1 ables with 0 QX (u)QY (u)du < ∞. Then, 2α * * *Cov(X, Y )* ≤ 2 QX (u)QY (u)du (2.14) 0

where α = α σX, σY .

Proof: By Tonelli’s theorem, X + Y + + + EX Y = E du du 0

0∞ ∞ + + = E I(X > u)I(Y > v)dudv ∞ 0 ∞ 0 = P (X > u, Y > v)dudv 0

+

∞

0

and similarly, EX = 0 P (X > u)du. Hence, by (2.1), it follows that * * *Cov(X + , Y + )* * * ∞ ∞ * * " # * P (X > u, Y > v) − P (X > u)P (Y > v) dudv ** = * 0 0 ∞ ∞ ≤ min α, P (X > u), P (Y > v) dudv. (2.15) 0

0

518

16. Limit Theorems for Dependent Processes

Next note that for any real numbers a, b, c, d (α ∧ a ∧ c) + (α ∧ a ∧ d) + (α ∧ b ∧ c) + (α ∧ b ∧ d) ≤ [2(α ∧ a)] ∧ (c + d) + [2(α ∧ b)] ∧ (c + d) ≤ 2[2α ∧ (a + b) ∧ (c + d)].

(2.16)

Now using (2.15), (2.16), and the identity Cov(X, Y ) = Cov(X + , Y + ) + Cov(X − , Y − ) − Cov(X + , Y − ) − Cov(X − , Y + ), one gets * * *Cov(X, Y )* ∞ ∞ ≤ 2 min 2α, P (|X| > u), P (|Y | > v) dudv. (2.17) 0

0

Hence, it is enough to show that the right sides of (2.14) and (2.17) agree. To that end, let U be a Uniform (0,1) random variable and deﬁne (W1 , W2 ) = (0, 0)I(U ≥ 2α) + QX (U ), QY (U ) I(U < 2α). Then EW1 W2 =

0

2α

QX (u)QY (u)du.

(2.18)

On the other hand, noting that QX (a) > t iﬀ P (|X| > t) > a, one has EW1 W2

∞

∞

= 0 ∞ 0 ∞ = 0 ∞ 0 ∞ = 0

0

P W1 > u, W2 > v dudv P U < 2α, QX (U ) > u, QY (U ) > v dudv min 2α, P (|X| > u), P (|Y | > v) dudv.

Hence, the theorem follows from (2.17), (2.18), and the above identity. 2 Corollary 16.2.4: Let X and Y be two random variables with α(σX, σY ) = α ∈ [0, 1]. (i) (Davydov’s inequality). Suppose that E|X|p < ∞, E|Y |q < ∞ for some p, q ∈ (1, ∞) with p1 + 1q < 1. Then, E|XY | < ∞ and * * *Cov(X, Y )* ≤ 2r(2α)1/r E|X|p 1/p E|Y |q 1/q , where

1 r

=1−

1 p

+

1 q

(2.19)

.

(ii) If P |X| ≤ c1 ) = 1 = P (|Y | ≤ c2 ) for some constants c1 , c2 ∈ (0, ∞), then * * *Cov(X, Y )* ≤ 4c1 c2 α. (2.20)

16.3 Central limit theorems for mixing sequences

519

Proof: Let a = (E|X|p )1/p and b = (E|Y |q )1/q . W.l.o.g., suppose that a, b ∈ (0, ∞). Then, by Markov’s inequality, for any 0 < u < 1, 2 P |X| > au−1/p ≤ E|X|p (au−1/p )p = u and hence, QX (u) ≤ au−1/p . Similarly, QY (u) ≤ bu−1/q , 0 < u < 1. Hence, by Theorem 16.2.3, * * *Cov(X, Y )*

2α

ab u−1/p−1/q du

≤

2

=

2ab(2α)1−1/p−1/q

0

3

1−

1 1 − . p q

which is equivalent to (2.19). The proof of (2.20) is a direct consequence of Rio’s inequality and the 2 bounds QX (u) ≤ c1 and QY (u) ≤ c2 for all 0 < u < 1.

16.3 Central limit theorems for mixing sequences In this section, CLTs for sequences of random variables satisfying diﬀerent mixing conditions are proved. Proposition 16.3.1: Let {Xi }i∈Z be a collection of random variables with strong mixing coeﬃcient α(·). ∞ (i) Suppose that n=1 α(n) < ∞ and for some c ∈ (0, ∞), P (|Xi | ≤ c) = 1 for all i. Then, ∞

Cov(X1 , Xn+1 ) converges absolutely.

(3.1)

n=1

∞ δ/2+δ (ii) Suppose that < ∞ and supi∈Z E|Xi |2+δ < ∞ for n=1 α(n) some δ ∈ (0, ∞). Then, (3.1) holds. 2

Proof: A direct consequence of Corollary 16.2.4.

Next suppose that the collection of random variables {Xi }i∈Z is station* ∞ * ary and that Var(X1 ) + n=1 *Cov(X1 , X1+n )* < ∞. Then by the DCT, ¯n) nVar(X

=

n n−1 Var Xi −1

= n

n i=1

i=1

Var(Xi ) + 2

1≤i<j≤n

Cov(Xi , Xj )

520

16. Limit Theorems for Dependent Processes

=

n−1 n−i n−1 n Var(X1 ) + 2 Cov(Xi , Xi+k ) i=1 k=1

n−1 (n − k)Cov(X1 , X1+k ) = n−1 n Var(X1 ) + 2 k=1 2 −→ σ∞

≡

Var(X1 ) + 2

∞

Cov(X1 , X1+k )

as

n → ∞. (3.2)

k=1

In particular, under the conditions of part (i) or part (ii) of Proposition 16.3.1, √ ¯ n ) exists and equals σ 2 . lim Var( n X ∞ n→∞

2 > 0. Indeed, it is not diﬃcult In general, it is not guaranteed that σ∞ to construct an example of a stationary strong mixing sequence {Xn }n≥1 2 such that σ∞ = 0 (Problem 16.8). However, in addition to the conditions √ ¯ of 2 Proposition 16.3.1, if it is assumed that σ∞ > 0, then a CLT for n(X n− EX1 ) holds in the stationary case; see Corollary 16.3.3 and 16.3.6 below. A classical method of proving the CLT (and other limit theorems) for mixing random variables is based on the idea of blocking, introduced by S. N. Bernstein. Intuitively, the ‘blocking’ approach can be described as n follows: Suppose, µ = EX1 = 0. First, write the sum i=1 Xi in terms of alternating sums of ‘big blocks’ Bi ’s (of length ‘p’ say) and ‘little blocks’ Li ’s (of length ‘q’ say) as n i=1

Xi

=

X1 + · · · + Xp + Xp+1 + · · · + Xp+q + Xp+q+1 + · · · + X2p+q + · · ·

= B1 + L1 + B2 + L2 + · · · + (BK + LK ) + Rn , where the last term Rn is the excess (if any) over the last complete pair of big- and little-blocks (BK , LK ). Next, group together the Bi ’s and Li ’s to write n K K √ 1 1 1 √ Xi = √ Bj + √ Lj + Rn / n. (3.3) n i=1 n j=1 n j=1 K If q p n, then, the number of Xi ’s in j=1 Lj and in Rn are of smaller order than n, the total number of Xi ’s. Using this, one can show that the contribution of the last two terms in (3.3) to the limit is negligible, i.e.,

K 1 √ Lj + Rn −→p 0. n j=1 K To handle the ﬁrst term, √1n j=1 Bi , note that the Bj ’s are functions of disjoint collections of Xj ’s that are separated by a distance of q or more.

16.3 Central limit theorems for mixing sequences

521

By letting q → ∞ suitably and using the mixing condition, one can replace the Bj ’s by their independent copies, and appeal to the Lindeberg CLT for sums of independent random variables to conclude that 1 2 √ Bj −→d N 0, σ∞ . n j=1 K

Although the blocking approach is described here for stationary random variables, with minor modiﬁcations, it is applicable to certain nonstationary sequences as shown below. Theorem 16.3.2: Let {Xn }n≥1 be a sequence of random variables (not necessarily stationary) with strong mixing coeﬃcient α(·). Suppose that there exist constants σ02 , c ∈ (0, ∞) such that P (|Xi | ≤ c) = 1

for all

i ∈ N,

* *

j+n−1 * −1 * 2* * γn ≡ sup *n Var Xi − σ 0 * → 0 j≥1

as

(3.4) n → ∞,

(3.5)

i=j

and that

∞

α(n) < ∞.

(3.6)

n=1

Then,

√ ¯n − µ n X ¯n −→d N (0, σ02 ) as n → ∞ ¯ n , and X ¯ n = n−1 n Xi , n ≥ 1. where µ ¯n = E X i=1

(3.7)

An important special case of Theorem 16.3.2 is the following: of stationary bounded random Corollary 16.3.3: ∞ If {Xn }n≥1 is a sequence 2 of (3.2) is positive, then, with variables with n=1 α(n) < ∞, and if σ∞ µ = EX1 , √ ¯ n − µ −→d N (0, σ 2 ) as n → ∞. n X (3.8) ∞

2 (cf. Proof: For stationary random variables, (3.5) holds with σ02 = σ∞ (3.2)). Hence, the Corollary follows from Theorem 16.3.2. 2

For proving the theorem, the following auxiliary result will be used. Lemma 16.3.4: Then, sup E m≥1

Suppose that the conditions of Theorem 16.3.2 hold. m+n−1 i=m

4 (Xi − EXi ) = o(n3 )

as

n → ∞.

522

16. Limit Theorems for Dependent Processes

Proof: W.l.o.g., let EXi = 0 for all i. Note that for any m ∈ N, 4

n+m−1 Xj E j=m

=

EXj4 + 6

j

EXi2 Xj2 + 4

i<j

+6

EXi2 Xj Xk

≡

+

i =j =k

EXi3 Xj

i =j

EXi Xj Xk X

i =j =k =

I1n + · · · + I5n , say,

(3.9)

where the indices i, j, k, in the above sums lie between m and m + n − 1. By (3.4), * * * * * * *I1n * + *I2n * + *I3n * ≤ n · c4 + 7n(n − 1)c4 ≤ 7n2 c4 . (3.10) By Corollary 16.2.4 (ii), noting that EXi = 0 for all i, * * * * * * * * *I4n * ≤ 12 *EXi2 Xj Xk * + *EXi Xj2 Xk * + *EXi Xj Xk2 * i<j 0] = 1 − exp[(−2µ/σ 2 )x], 0 < x < ∞. (b) limt P [supx |A(x, t) − A(x)| > |Z(t) > 0] = 0 for any > 0, where A(·, t) and A(x) are as in Theorem 18.3.2 (d) with α = 0. Theorem 18.3.4: (Subcritical case). Let m < 1. Then for any initial Z0 = 0, G(·) nonlattice (cf. Chapter 10), lim P [Z(t) = j|Z(t) > 0] = πj

t→∞

exists for all j ≥ 1 and

∞ j=1

πj = 1.

(3.5)

568

18. Branching Processes

18.4 Embedding of Urn schemes in continuous time branching processes It turns out that many urn schemes can be embedded in continuous time branching processes. The case of Poly¯a’s urn is discussed below. Recall that Poly¯ a’s urn scheme is the following. Let an urn have an initial composition of R0 red and B0 black balls. A draw consists of taking a ball at random from the urn, noting its color, and returning it to the urn with one more ball of the color drawn. Let (Rn , Bn ) denote the composition after n draws. Clearly, Rn + Bn = R0 + B0 + n for all n ≥ 0 and {Rn , Bn }n≥0 is a Markov chain. Let {Zi (t) : t ≥ 0}, i = 1, 2 be two independent continuous time branching processes with unit exponential life times and oﬀspring distribution of binary splitting, i.e., p2 = 1 and Z1 (0) = R0 , Z2 (0) = B0 . Let τ0 = 0 < τ1 < τ2 < . . . < τn < . . . denote the successive times of in the combined population. Then the sequence death of an individual Z1 (τn ), Z2 (τn ) n≥0 has the same distribution as (Rn , Bn )n≥0 . To establish this claim, by the Markov property of Z1 (t), Z2 (t) t≥0 , it suﬃces to show that Z1 (τ1 ), Z2 (τ1 ) has the same distribution as (R1 , B1 ). It is easy to show that if ηi : i = 1, 2, . . . , n are independent exponential random variables with parameters n λi , i = 1, 2, . . . , n then the η ≡ min{ηi : 1 ≤ i ≤ n} is an Exponential ( i=1 λi ) random variable and P (η = ηi ) = λi ( n λj ) (Problem 18.9). This, in turn, leads to the fact at time τ1 , the j=1

Z1 (0) probability that a split takes place in {Z1 (t) : t ≥ 0} is Z1 (0)+Z . At this 2 (0) split, the parent is lost but is replaced by two new individuals resulting in a net addition of one more individual, establishing the claim. The same reasoning yields the embedding of the following general urn scheme. Let Xn = (Xn1 , . . . , Xnk ) be the vector of the composition of an urn at time n where Xni is the number of balls of color i. Assume that given (X0 , X1 , . . . , Xn ), Xn+1 is generated as follows. Pick a ball at random from the urn. If it happens to be of color i, then return it to the urn along with a random number ζij of balls of color j = 1, 2, . . . , k where the joint distribution of ζi ≡ (ζi1 , ζi2 , . . . , ζik ) depends on i, i = 1, 2, . . . , k. Now set, Xn+1 = Xn + ζi . The embedding is done as follows. Consider a continuous time multitype branching process {Z(t) : t ≥ 0} with Exponential (1) lifetimes and the oﬀspring distribution of the ith type is the same as that of ζ˜i ≡ ζi +δi where ζi is as above and δi is ith the unit vector. Let for i = 1, 2, . . . , k, {Zi (t) : t ≥ 0} be a branching process that evolves as {Z(t) : t ≥ 0} above but has initial size Zi (0) ≡ (0, 0, . . . , X0i , 0, . . . , 0). Let 0 = τ0 < τ1 < τ2 < . . . denote the times at which deaths occur in the process obtained by pooling all the k processes. Then Zi (τn ) : i = 1, 2, . . . , k , n = 0, 1, 2 . . . has the same distribution as (Xni , i = 1, 2, . . . , k), n ≥ 0.

18.5 Problems

569

This embedding has been used to prove limit theorems for urn models. See Athreya and Ney (2004), Chapter 5, for details. For applications to clinical trials, see Rosenberger (2002).

18.5 Problems ∞ 18.1 Show that for any probability distribution {pj }j≥0 , f (s) = j=0 pj sj is convex in [0,1]. Show also that there exists a q ∈ [0, 1) such that f (q) = q iﬀ m = f (1·) > 1. ∞ 18.2 Assume j=1 j 2 pj < ∞. (a) Let vn = V (Zn |Z0 = 1). Show that vn+1 = V (Zn m|Z0 = 1) + E(Zn σ 2 |Z0 = 1) where m = E(Z1 |Z0 = 1) and σ 2 = V (Z1 |Z0 = 1) and hence vn+1 = m2 vn + σ 2 mn . (b) Conclude from (a) that supn EWn2 < ∞, where Wn = Zn /mn . 2 (c) Using the fact {Wn }n≥0 is a martingale, show that if j pj < ∞ then {Wn } converges w.p. 1 and in L2 to a random variable W such that E(W |Z0 = 1) = 1. 18.3 By deﬁnition, the sequence {Zn }n≥0 of population sizes satisﬁes the random iteration scheme Zn+1 =

Zn

ζni

i=1

where {ζni , i = 1, 2, . . . , n = 1, 2, . . .} is a doubly inﬁnite sequence of iid random variable with distribution {pj }. (a) (Independence of lines of descent). Establish the property that for any k ≥ 0 if Z0 = k then {Zn }n≥0 has the same distribution k (j) (j) as where {Zn }n≥0 , j ≥ 1 are iid copies of j=1 Zn n≥0 {Zn }n≥0 with Z0 = 1. (b) In the context of Theorem 18.1.2, show that if Z0 = 1 then W ≡ lim Wn can be represented as 1 1 W (j) m j=1

Z

W =

where Z1 , W (j) , j = 1, 2, . . . are all independent with Z1 having distribution {pj }j≥0 and {W (j) }j≥1 are iid with distribution same as W .

570

18. Branching Processes

(c) Let α ≡ aj ∈D P (W = aj ) where D ≡ {aj } is the set of values such that P (W = aj ) > 0. Show using (b) that α = f (α) and conclude that if α < 1, then α = q and hence that if α < 1, then the distribution of W conditional on W > 0 must be continuous. (d) Let β be the singular component of the distribution of W in its Lebesgue decomposition. Show using (b) that β satisﬁes β ≤ f (β) and hence that if β < 1, then β = P (W = 0) and the distribution of W conditional on W > 0 must be absolutely continuous. (e) Let p0 = 0. Show that the distribution of W is of the pure type, i.e., it is either purely discrete, purely singular continuous, or purely absolutely continuous. 18.4 (a) Show using Problem 18.3 (b) that if W has a lattice distribution with span d, then d must satisfy d = md and hence d = ∞. Conclude that if P (W = 0) < 1, then the distribution of W on {W > 0} must be nonlattice. (b) Let p0 = 0 and P (W = 0) = 0. Use (a) to conclude that the characteristic function φ(t) ≡ E(eιtW ) of W satisﬁes sup1≤|t|≤m |φ(t)| < 1. (c) Let p0 = 0. Show that for any 0 ≤ s0 < 1, > 0, f (n) (s0 ) = 0(n ). n−1 (Hint: By the mean value theorem, f (n) (s) = j=0 f (fj (s)). Now use f (0) = p0 , fj (s) → 0 as j → ∞.) (d) Let p0 = 0, P (W = 0) = 0. Show that for any n ≥ 1, ∞ φ(mn t) = f (n) (φ(t)) and hence −∞ |φ(u)|du < ∞. Conclude that the distribution of W is absolutely continuous. 18.5 In the multitype case for the martingale deﬁned in (2.1), show that 2 < ∞ for all i, j where Ei denotes {Wn }n≥0 is L2 bounded if Ei Z1j expectation when one starts with an individual of type i. 18.6 Let m(·) satisfy the integral equation (3.3). (a) Show that mα (t) ≡ m(t)e−αt satisﬁes the renewal equation −αt mα (·) = 1 − G(t) e + mα (t − u)dGα (u) (0,t]

where Gα (t) ≡ m

t 0

e−αu dG(u), t ≥ 0.

(b) Use the key renewal theorem of Section 8.5 to conclude that limt→∞ mα (t) exists and identify the limit.

18.5 Problems

571

∞ (c) Assuming j=1 j 2 pj < ∞ show using the key renewal theorem of Section 8.5 that {W (t) : t ≥ 0} of Theorem 18.3.2 is L2 bounded. 18.7 Consider an M/G/1 queue with Poisson arrivals and general service time. Let Z1 be the number of customers that arrive during the service time of the ﬁrst customer. Call these ﬁrst generation customers. Let Z2 be the number of customers that arrive during the time it takes to serve all the Z1 customers. Call these second generation customers. For n ≥ 1, let Zn+1 denote the number of customers that arrive during the time it takes to serve all Zn of the nth generation customers. (a) Show that {Zn }n≥0 is a BGW branching process as in Section 18.1. (b) Find the oﬀspring distribution {pj }∞ 0 and its mean m in terms of the rate parameter λ of the Poisson arrival process and the service time distribution G(·). (c) Show that the queue size goes to ∞ with positive probability iﬀ m > 1. (d) Set up a functional equation for the moment generating function of the busy period U , i.e., the time interval between when the ﬁrst service starts and when the server is idle for the ﬁrst time. 18.8 Let {ηi : i = 1, 2, . . . , n} be independent exponential random variables with Eηi = λ−1 i , i = 1, 2, . . . , n. Let η ≡ min{ηi : 1 ≤ i ≤ n}. n −1 Show that η has an exponential with Eη = i=1 λi n distribution and that P (η = ηj ) = λj i=1 λi . 18.9 Using the embedding outlined in Section 18.4 for the Poly¯ a urn n → Y w.p. 1 and that Y can be scheme, show that Yn ≡ RnR+B n represented as Y =

R0 Xi R0i=1 +B0 Xj j=1

where {Xi }i≥1 are iid exponential (1)

random variables. Conclude that Y has Beta (R0 , B0 ) distribution.

Appendix A Advanced Calculus: A Review

This Appendix is a brief review of elementary set theory, real numbers, limits, sequences and series, continuity, diﬀerentiability, Riemann integration, complex numbers, exponential and trigonometric functions, and metric spaces. For proofs and further details, see Rudin (1976) and Royden (1988).

A.1 Elementary set theory This section reviews the following: sets, set operations, product sets (ﬁnite and inﬁnite), equivalence relation, axiom of choice, countability, and uncountability. Deﬁnition A.1.1: A set is a collection of objects. It is typically deﬁned as a collection of objects with a common deﬁning property. For example, the collection of even integers can be written as E ≡ {n : n is an even integer}. In general, a set Ω with deﬁning property p is written as Ω = {ω : ω has property p}. The individual elements are denoted by the small letters ω, a, x, s, t, etc., and the sets by capital letters Ω, A, X, S, T , etc. Example A.1.1: The closed interval [0, 1] ≡ {x : x a real number, 0 ≤ x ≤ 1}.

574

Appendix A. Advanced Calculus: A Review

Example A.1.2: The set of positive rationals ≡ {x : x = positive integers}.

m n,

m and n

Example A.1.3: The set of polynomials in x of degree 10 ≡ {P (x) : 10 P (x) = j=0 aj xj , aj real, j = 0, . . . , 10, a10 = 0}. Example A.1.4: The set of all polynomials in x ≡ {P (x) : P (x) = n j a x , n a nonnegative integer, aj real, j = 0, 1, 2, . . . , n}. j=0 j

A.1.1

Set operations

Deﬁnition A.1.2: Let A be a set. A set B is called a subset of A and written as B ⊂ A if every element of B is also an element of A. Two sets A and B are the same and written as A = B if each is a subset of the other. A subset A ⊂ Ω is called empty and denoted by ∅ if there exists no ω in Ω such that ω ∈ A. Using the mathematical notation ∈ and ⇒, one writes B ⊂ A if x ∈ B ⇒ x ∈ A. Here ‘∈’ means “belongs to” and ⇒ means “implies.” Example A.1.5: Let N be the set of natural numbers, i.e., N = {1, 2, 3, . . .}. Let E be the set of even natural numbers, i.e., E = {n : n ∈ N, n = 2k for some k ∈ N}. Then E ⊂ N. Example A.1.6: Let A = [0, 1] and B be the set of x in A such that x2 < 14 . Then B = {x : 0 ≤ x < 12 } ⊂ A. Deﬁnition A.1.3: (Intersection and union). Let A1 , A2 be subsets of a set Ω. Then A1 union A2 , written as A1 ∪ A2 ,is the set deﬁned by A1 ∪ A2 = {ω : ω ∈ A1 or ω ∈ A2 or both}. Similarly, A1 intersection A2 , written as A1 ∩ A2 , is the set deﬁned by A1 ∩ A2 = {ω : ω ∈ A1 and ω ∈ A2 }. Example A.1.7: Let Ω ≡ N ≡ {1, 2, 3, . . .}, A1

= {ω : ω = 3k for some k ∈ N}

A2

= {ω : ω ≡ 5k for some k ∈ N}.

and

Then A1 ∪ A2 = {ω : ω is divisible by at least one of the two integers 3 and 5}, A1 ∩ A2 = {ω : ω is divisible by both 3 and 5}.

A.1 Elementary set theory

575

Deﬁnition A.1.4: Let Ω and I be nonempty sets. Let {Aα : α ∈ I} be a collection of subsets of Ω. Then I is called the index set. The union of {Aα : α ∈ I} is deﬁned as

Aα ≡ {ω : ω ∈ Aα for some α ∈ I}. α∈I

The intersection of {Aα : α ∈ I} is deﬁned as Aα ≡ {ω : ω ∈ Aα for every α ∈ I}. α∈I

Deﬁnition A.1.5: (Complement of a set). Let A ⊂ Ω. Then the comple˜ is deﬁned by Ac ≡ {ω : ω ∈ / A}. ment of the set A, written as Ac (or A), Example A.1.8: If Ω = N and A is the set of all integers that are divisible by 2, then Ac is the set of all odd integers, i.e., Ac = {1, 3, 5, 7, . . .}. Proposition A.1.1: (DeMorgan’s law ). For any {Aα : a ∈ I} of subsets of Ω, (∪α∈I Aα )c = ∩α∈I Acα , (∩α∈I Aα )c = ∪α∈I Acα . Proof: To show that two sets A and B are the same, it suﬃces to show that ω ∈ A ⇒ ω ∈ B and ω ∈ B ⇒ ω ∈ A. Let ω ∈ (∪α∈I Aα )c . Then ω ∈ / ∪α∈I Aα ⇒ ω∈ / Aα for any α ∈ I ⇒ ω ∈ Acα for each α ∈ I c ⇒ ω∈ Aα . α∈I

Thus (∪α∈I Aα )c ⊂ ∩α∈I Acα . The opposite inclusion and the second identity are similarly proved. 2 Deﬁnition A.1.6: (Product sets). Let Ω1 and Ω2 be two nonempty sets. Then the product set of Ω1 and Ω2 , denoted by Ω ≡ Ω1 × Ω2 , consists of all ordered pairs (ω1 , ω2 ) such that ω1 ∈ Ω1 , ω2 ∈ Ω2 . Note that if Ω1 = Ω2 and ω1 = ω2 , then the pair (ω1 , ω2 ) is not the same as (ω2 , ω1 ), i.e., the order is important. Example A.1.9: Ω1 = [0, 1], Ω2 = [2, 3]. Then Ω1 × Ω2 = {(x, y) : 0 ≤ x ≤ 1, 2 ≤ y ≤ 3}. Deﬁnition A.1.7: (Finite products). If Ωi , i = 1, 2 . . . , k are nonempty sets, then Ω = Ω1 × Ω 2 × . . . × Ω k

576

Appendix A. Advanced Calculus: A Review

is the set of all ordered k vectors (ω1 , ω2 , . . . , ωk ) where ωi ∈ Ωi . If Ωi = Ω1 (k) for all 1 ≤ i ≤ k, then Ω1 × Ω2 × . . . × Ωk is written as Ω1 or Ωk1 . Deﬁnition A.1.8: (Inﬁnite products). Let {Ωα : α ∈ I} be an inﬁnite collection of nonempty sets. Then ×α∈I Ωα , the product set, is deﬁned as {f : f is a function deﬁned on I such that for each α, f (α) ∈ Ωα }. If Ωα = Ω for all α ∈ I, then ×α∈I Ωα is also written as ΩI . It is a basic axiom of set theory, known as the axiom of choice (A.C.), that this space is nonempty. That is, given an arbitrary collection of nonempty sets, it is possible to form a parliament with one representative from each set. For a long time it was thought this should follow from the other axioms of set theory. But it is shown in Cohen (1966) that it is an independent axiom. That is, both the A.C. and its negation are consistent with the rest of the axioms of set theory. There are several equivalent versions of A.C. These are Zorn’s lemma, Hausdorﬀ’s maximality principle, the ‘Principle of Well Ordering,’ and Tukey’s lemma. For a proof of these equivalences, see Hewitt and Stromberg (1965). Deﬁnition A.1.9: (Functions, countability and uncountability). A function f is a correspondence between the elements of a set X and another set Y and is written as f : X → Y . It satisﬁes the condition that for each x, there is a unique y in Y that corresponds to it and is denoted as y = f (x). The set X is called the domain of f and the set f (X), deﬁned as, f (X) ≡ {y : there exists x in X such that f (x) = y} is called the range of f . It is possible that many x’s may correspond to the same y and also there may exist y in Y for which there is no x such that f (x) = y. If f (X) is all of Y , then the map is called onto. If for each y in f (X), there is a unique x in X such that f (x) = y, then f is called (1–1) or one-to-one. If f is one-to-one and onto, then X and Y are said to have the same cardinality. Deﬁnition A.1.10: Let f : X → Y be (1–1) and onto. Then, for each y in Y , there is a unique element x in X such that f (x) = y. This x is denoted as f −1 (y). Note that in this case, g(y) ≡ f −1 (y) is a (1–1) onto map from Y to X and is called the inverse of f . Example A.1.10: Let X = N ≡ {1, 2, 3, . . .}. Let Y = {n : n = 2k, k ∈ N} be the set of even integers. Then the map f (x) = 2x is a (1–1) onto map from X to Y . Example A.1.11: Let X be N and let P be the set of all prime numbers. Then X and P have the same cardinality.

A.1 Elementary set theory

577

Deﬁnition A.1.11: A set X is ﬁnite if there exists n ∈ N such that X and Y ≡ {1, 2, . . . , n} have the same cardinality, i.e., there exists a (1–1) onto map from Y to X. A set X is countable if X and N have the same cardinality, i.e., there exists a (1–1) onto map from N to X. A set X is uncountable if it is not ﬁnite or countable. Example A.1.12: The set {0, 1, 2, . . . , 9} is ﬁnite, the set Nk (k ∈ N) is countable and NN is uncountable (Problem A.6). Deﬁnition A.1.12: Let Ω be a nonempty set. Then the power set of Ω, denoted by P(Ω), is the collection of all subsets of Ω, i.e., P(Ω) ≡ {A : A ⊂ Ω}. Remark A.1.1: P(N) is an uncountable set (Problem A.5).

A.1.2

The principle of induction

The set N of natural numbers has the well ordering property that every nonempty subset A of N has a smallest element s such that (i) s ∈ A and (ii) a ∈ A ⇒ a ≥ s. This property is one of the basic postulates in the deﬁnition of N. The principle of induction is a consequence of the well ordering property. It says the following: Let {P (n) : n ∈ N} be a collection of propositions (or statements). Suppose that (i) P (1) is true. (ii) For each n ∈ N, P (n) true ⇒ P (n + 1) true. Then, P (n) is true for all n ∈ N. See Problem A.9 for some examples.

A.1.3

Equivalence relations

Deﬁnition A.1.13: (a) Let Ω be a nonempty set. Let G be a nonempty subset of Ω × Ω. Write x ∼ y if (x, y) ∈ G and call it a relation deﬁned by G. (b) A relation deﬁned by G is an equivalence relation if (i) (reﬂexive) for all x in Ω, x ∼ x, i.e., (x, x) ∈ G; (ii) (symmetric) x ∼ y ⇒ y ∼ x, i.e., (x, y) ∈ G ⇒ (y, x) ∈ G; (iii) (transitive) x ∼ y, y ∼ z ⇒ x ∼ z, i.e., (x, y) ∈ G, (y, z) ∈ G ⇒ (x, z) ∈ G.

578

Appendix A. Advanced Calculus: A Review

Example A.1.13: Let Ω = Z, the set of all integers, G = {(m, n) : m − n is divisible by 3}. Thus, m ∼ n if (m − n) is a multiple of 3. It is easy to verify that this is an equivalence relation. Deﬁnition A.1.14: (Equivalence classes). Let Ω be a nonempty set. Let G deﬁne an equivalence relation on Ω. For each x in Ω, the set [x] ≡ {y : x ∼ y} is called the equivalence class generated by x. Proposition A.1.2: Let C be the set of all equivalence classes in Ω generated by an equivalence relation deﬁned by G. Then (i) C1 , C2 ∈ C ⇒ C1 = C2 or C1 ∩ C2 = ∅. (ii) C = Ω. C∈C

Proof: (i) Suppose C1 ∩ C2 = ∅. Then there exist x1 , x2 , y such that C1 = [x1 ], C2 = [x2 ] and y ∈ C1 ∩ C2 . This implies x1 ∼ y, x2 ∼ y. But by symmetry y ∼ x2 and this implies by transitivity that x1 ∼ x2 , i.e., x2 ∈ C1 implying C2 ⊂ C1 . Similarly, C1 ⊂ C2 , i.e., C1 = C2 . (ii) For each x in Ω, (x, x) ∈ G and so [x] is not empty and x ∈ [x].

2

The above proposition says that every equivalence relation on Ω leads to a decomposition of Ω into equivalence classes that are disjoint and whose union is all of Ω. In the example given above, the set Z of all integers can be decomposed to three equivalence classes Cj ≡ {n : n = 3m + j for some m ∈ Z}, j = 0, 1, 2.

A.2 Real numbers, continuity, diﬀerentiability, and integration A.2.1

Real numbers

This section reviews the following: integers, rationals, real numbers; algebraic, order, and completeness axioms; Archimedean property, denseness of rationals. There are at least two approaches to deﬁning the real number system. Approach 1. Start with the natural numbers N, construct the set Z of all integers (N ∪ {0} ∪ (−N)), and next, the set Q of rationals and then the set R of real numbers either as the set of all Cauchy sequences of rationals or as Dedekind cuts. The step going from Q to R via Cauchy sequences is also available for completing any incomplete metric space (see Section A.4).

A.2 Real numbers, continuity, diﬀerentiability, and integration

579

Approach 2. Deﬁne the set of real numbers R as a set that satisﬁes three sets of axioms. The ﬁrst set is algebraic involving addition and multiplication. The second set is on ordering that, with the ﬁrst, makes R an ordered ﬁeld (see Royden (1988) for a deﬁnition). The third set is a single axiom known as the completeness axiom. Thus R is deﬁned as a complete ordered ﬁeld. The algebraic axioms say that there are two binary operations known as addition (+) and multiplication (·) that render R a ﬁeld. See Royden (1988) for the nine axioms for this set. The order axiom says that there is a set P ⊂ R, to be called positive numbers such that (i) x, y ∈ P ⇒ x · y ∈ P, x + y ∈ P (ii) x ∈ P ⇒ −x ∈ /P (iii) x ∈ R ⇒ x = 0 or x ∈ P or −x ∈ P. The set Q of rational numbers is an ordered ﬁeld (i.e., it satisﬁes the algebraic and order axioms). But Q does not satisfy the completeness axiom (see below). Given P, one can deﬁne an order on R by deﬁning x < y (read x less than y) to mean y − x ∈ P. Since for all x, y in R, (x − y) is either 0 or (x − y) ∈ P or (y − x) ∈ P, it follows that for all x, y in R, either x = y or x < y or x > y. This is called total or linear order. Deﬁnition A.2.1: (Upper and lower bounds). (a) Let A ⊂ R. A real number M is an upper bound for A if a ∈ A ⇒ a ≤ M and m is a lower bound for A if a ∈ A ⇒ a ≥ m. (b) The supremum of a set A, denoted by sup A or the least upper bound (l.u.b.) of A, is deﬁned by the following conditions: (i) x ∈ A ⇒ x ≤ sup A, (ii) K < sup A ⇒ there exists x ∈ A such that K < x. The completeness axiom says that if A ⊂ R has an upper bound M in ˜ in R such that M ˜ = sup A. R, then there exists a M That is, every set A that is bounded above in R has a l.u.b. in R. The ordered ﬁeld of rationals Q does not possess this property. One well-known example is the set A = {r : r ∈ Q, r2 < 2}. Then A is bounded above in Q but has no l.u.b. in Q (Problem A.11). Next some consequences of the completeness axiom are discussed. Proposition A.2.1: (Axiom of Eudoxus and Archimedes (AOE)). For all x in R, there exists a natural number n such that n > x.

580

Appendix A. Advanced Calculus: A Review

Proof: If x ≤ 1, take n = 2. If x > 1, let Sx ≡ {k : k ∈ N, k ≤ x}. Then Sx is not empty and is bounded above. By the completeness axiom, there is a real number y that is the l.u.b. of Sx . Thus y − 12 is not an upper bound for Sx and so there exists k0 ∈ Sx such that y − 12 < k0 . This implies that / Sx . By the linear order (k0 + 1) > y − 12 + 1 = y + 12 > y and so (k0 + 1) ∈ 2 in R, (k0 + 1) > x and so (k0 + 1) is the desired integer. Corollary A.2.2: For any x, y ∈ R with x < y, there is a r in Q such that x < r < y. Proof: Let z = (y − x)−1 . Then there is an integer k such that 0 < z < k (by AOE.) Again by AOE, there is a positive integer n such that n > yk. Let S = {n : n ∈ N, n > yk}. Since S = ∅, it has a smallest element (by the p well ordering property of N) say, p. Then p − 1 < yk < p, i.e., p−1 k < y < k. p p−1 1 1 Since k < z = (y − x) and k > y, it follows that k > x. Now take r = p−1 2 k . Remark A.2.1: This property is often stated as: The set Q of rationals is dense in the set R of real numbers. ¯ of extended real numbers is the set consisting Deﬁnition A.2.2: The set R of R and two elements identiﬁed as +∞ (plus inﬁnity) and −∞ (negative inﬁnity) with the following deﬁnition of addition (+) and multiplication (·). For any x in R, x + ∞ = ∞, x − ∞ = −∞, x · ∞ = ∞ if x > 0, x · ∞ = −∞ if x < 0, 0 · ∞ = 0, ∞ + ∞ = ∞, −∞ − ∞ = −∞, ∞ · (±∞) = ±∞, (−∞) · (±∞) = ∓∞. But ∞ − ∞ is not deﬁned. The ¯ is deﬁned by extending that on R with the additional order property on R condition x ∈ R ⇒ −∞ < x < +∞. Finally, if A ⊂ R does not have an upper bound in R, then sup A is deﬁned as +∞ and if A ⊂ R does not have a lower bound in R, then inf A is deﬁned as −∞.

A.2.2

Sequences, series, limits, limsup, liminf

Deﬁnition A.2.3: Let {xn }n≥1 be a sequence of real numbers. (i) For a real number a, lim xn = a if for every > 0, there exists a n→∞

positive integer N such that n ≥ N ⇒ |xn − a| < . (ii) lim xn = ∞ if for any K in R, there exists an integer NK such that n→∞ n ≥ NK ⇒ xn > K. (iii) lim xn = −∞ if lim (−xn ) = ∞. n→∞

n→∞

(iv) lim sup xn ≡ lim xn = inf (sup xj ). n→∞

n→∞

n≥1 j≥n

(v) lim inf xn ≡ lim xn = sup( inf xj ). n→∞

n→∞

n≥1 j≥n

A.2 Real numbers, continuity, diﬀerentiability, and integration

581

Deﬁnition A.2.4: (Cauchy sequences). A sequence {xn }n≥1 ⊂ R is called a Cauchy sequence if for every ε > 0, there is Nε such that n, m ≥ Nε ⇒ |xm − xn | < ε. Proposition A.2.3: If {xn }n≥1 ⊂ R is convergent in R (i.e., limn→∞ xn = a exists in R), then {xn }n≥1 is Cauchy. Conversely, if {xn }n≥1 ⊂ R is Cauchy, then there exists an a ∈ R such that limn→∞ xn = a. The proof is based on the use of the l.u.b. axiom (Problem A.14). For n ≥ 1, Deﬁnition n A.2.5: Let {xn }n≥1 be a sequence of real numbers. ∞ sn ≡ j=1 xj is called the nth partial sum of the series j=1 xj . The series ∞ to converge to s in R if limn→∞ sn = s. If limn→∞ sn = ±∞, j=1 xj is said ∞ then the series j=1 xj is said to diverge to ±∞. Note that if xj ≥ 0 for all j, then either limn→∞ sn = s ∈ R, or limn→∞ sn = ∞. Example A.2.1: (Geometric series). Fix 0 < r < 1. Let xn = rn , n ≥ 0. ∞ n+1 1 and j=1 rj converges to s = 1−r . Then sn = 1 + r + . . . + rn = 1−r 1−r ∞ Example A.2.2: Consider the series j=1 j1p , 0 < p < ∞. It can be shown that this converges for p > 1 and diverges to ∞ for 0 < p ≤ 1. ∞ Deﬁnition A.2.6: The series j=1 xj converges absolutely if the series ∞ j=1 |xj | converges in R. ∞ There exist series j=1 xj that converge but not absolutely. For example, ∞ (−1)j j=1 j . For further material on convergence properties of series, such as tests for convergence, rates of convergence, etc., see Rudin (1976). Deﬁnition A.2.7: (Power series). ∞ Letn{an }n≥0 be a sequence of real numbers. For x ∈ R, the series n=0 an x is called a power series. ∞ If the ∞ series n=0 an xn converges for all x in B ⊂ R, the power series n=0 an xn is said to be convergent on B. Proposition A.2.4: Let {an }n≥0 be a sequence of real numbers. Let ρ = 1 (lim supn→∞ |an | n )−1 . Then (i) |x| < ρ ⇒ (ii) |x| > ρ ⇒

∞ n=0

∞ n=0

|an xn | converges. |an xn | diverges to +∞.

Proof of this is left as an exercise (Problem A.15). 1

Deﬁnition A.2.8: ρ ≡ (lim supn→∞ |an | n )−1 is called the radius of con∞ vergence of the power series n=0 an xn .

582

A.2.3

Appendix A. Advanced Calculus: A Review

Continuity and diﬀerentiability

Deﬁnition A.2.9: Let f : A → R, A ⊂ R. Then (a) f is continuous at x0 in A if for every > 0, there exists a δ > 0 such that x ∈ A, |x − x0 | < δ, implies |f (x) − f (x0 )| < . (Here, δ may depend on and x0 .) (b) f is continuous on B ⊂ A if it is continuous at every x0 in B. (c) f is uniformly continuous on B ⊂ A if for every > 0, there exists a δ > 0 such that sup{|f (x) − f (y)| : x, y ∈ B, |x − y| < δ } < . Some properties of continuous functions are listed below. Proposition A.2.5: (i) (Sums, products, and ratios of continuous functions). Let f , g : A → R, A ⊂ R. Let f and g be continuous on B ⊂ A. Then (a) f + g, f − g, α · f for any α ∈ R are all continuous on B. (b) f (x)/g(x) is continuous at x0 in B, provided g(x0 ) = 0. (ii) (Continuous functions on a closed bounded interval). Let f be continuous on a closed and bounded interval [a, b]. Then (a) f is bounded, i.e., sup{|f (x)| : a ≤ x ≤ b} < ∞, (b) it achieves its maximum and minimum, i.e., there exist x0 , y0 in [a, b] such that f (x0 ) ≥ f (x) ≥ f (y0 ) for all x in [a, b] and f attains all values in [f (y0 ), f (x0 )], i.e., for all ∈ [f (y0 ), f (x0 )], there exists z ∈ [a, b] such that f (z) = . Thus, f maps bounded closed intervals onto bounded closed intervals. (c) f is uniformly continuous on [a, b]. (iii) (Composition of functions). Let f : A → R, g : B → R be continuous on A and B, respectively. Let f (A) ⊂ B, i.e., for any x in A, f (x) ∈ B. Let h(x) = g(f (x)) for x in A. Then h : A → R is continuous. (iv) (Uniform limits of continuous functions). Let {fn }n≥1 , be a sequence of functions continuous on A to R, A ⊂ R. If sup{|fn (x) − f (x)| : x ∈ A} → 0 as n → ∞ for some f : A → R, i.e., fn converges to f uniformly on A, then f is continuous on A. Remark A.2.2: The function f (x) ≡ x is clearly continuous on R. Now by Proposition A.2.5 (i) and (iv), it follows that all polynomials are continuous on R, and hence, so are their uniform limits. Weierstrass’ approximation theorem is a sort of converse to this. That is, every continuous function on a closed and bounded interval is the uniform limit of polynomials. More precisely, one has the following:

A.2 Real numbers, continuity, diﬀerentiability, and integration

583

Theorem A.2.6: Let f : [a, b] → R be continuous. Then for any > 0 n there is a polynomial p(x) = 0 aj xj , aj ∈ R, j = 0, 1, 2, . . . , n such that sup{|f (x) − p(x)| : x ∈ [a, b]} < . ∞ It should be noted that a power series A(x) ≡ 0 an xn is the uniform −1 limits of polynomials on [−λ, λ] for any 0 < λ < ρ ≡ limn→∞ |an |1/n and hence is continuous on (−ρ, ρ). Deﬁnition A.2.10: Let f : (a, b) → R, (a, b) ⊂ R. The function f is said to be diﬀerentiable at x0 ∈ (a, b) if lim

h→0

f (x0 + h) − f (x0 ) ≡ f (x0 ) h

exists in

R.

A function is diﬀerentiable in (a, b) if it is diﬀerentiable at each x in (a, b). Some important consequences of diﬀerentiability are listed below. Proposition A.2.7: Let f , g : (a, b) → R, (a, b) ⊂ R. Then (i) f diﬀerentiable at x0 in (a, b) implies f is continuous at x0 . (ii) (Mean value theorem). f diﬀerentiable on (a, b), f continuous on [a, b] implies that for some a < c < b, f (b) − f (a) = (b − a)f (c). (iii) (Maxima and minima). f diﬀerentiable at x0 and for some δ > 0, f (x) ≤ f (x0 ) for all x ∈ (x0 − δ, x0 + δ) implies that f (x0 ) = 0. (iv) (Sums, products and ratios). f , g diﬀerentiable at x0 implies that for any α, β in R, (αf + βg), f − g are diﬀerentiable at x0 with (αf + βg) (x0 ) (f g) (x0 )

=

αf (x0 ) + βg (x0 ),

=

f (x0 )g(x0 ) + f (x0 )g (x0 ),

and if g (x0 ) = 0, then f /g is diﬀerentiable at x0 with (f /g) (x0 ) =

f (x0 )g(x0 ) − f (x0 )g (x0 ) . (g(x0 ))2

(v) (Chain rule). If f is diﬀerentiable at x0 and g is diﬀerentiable at f (x0 ), then h(x) ≡ g(f (x)) is diﬀerentiable at x0 with h (x0 ) = g (f (x0 ))f (x0 ). ∞ (vi) (Diﬀerentiability of power series). Let A(x) ≡ n=0 an xn be a power −1 > 0. Then series with radius of convergence ρ ≡ limn→∞ |an |1/n A(·) is diﬀerentiable inﬁnitely many times on (−ρ, ρ) and for x in (−ρ, ρ), ∞ dk A(x) = n(n − 1) · · · (n − k + 1)xn−k , k ≥ 1. dxk n=k

584

Appendix A. Advanced Calculus: A Review

Remark A.2.3: It should be noted that the converse to (a) in the above proposition does not hold. For example, the function f (x) = |x| is continuous at x0 = 0 but is not diﬀerentiable at x0 . Indeed, Weierstrass showed that there exists a function f : [0, 1] → R such that it is continuous on [0, 1] but is not diﬀerentiable at any x in (0, 1). Also note that the mean value theorem implies that if f (·) ≥ 0 on (a, b), then f is nondecreasing on (a, b). Deﬁnition A.2.11: (Taylor series). Let f be a map from I ≡ (a−η, a+η) to R for some a ∈ R, η > 0. Suppose f is n times diﬀerentiable in I, for (n) ∞ n each n ≥ 1. Let an = f n!(a) . Then power series = n=0 an (x − a) ∞ f (n) (a) n n=0 n! (x − a) is called the Taylor series of f at a. Remark A.2.4: Let f be as in Deﬁnition A.2.11. Taylor’s remainder theorem says that for any x in I and any n ≥ 1, if f is (n + 1) times diﬀerentiable in I, then * * n * |f (n+1) (yn )| * j* *f (x) − a x j * ≤ (n + 1)! * j=0 for some yn in I. Thus, if for some > 0,

sup |f (k) (y)| ≡ λk satisﬁes

|y−a| 0, cos t = 0}. Set π = 2t0 . Since cos π2 = 0, (iv) implies π sin π2 = 1 and hence that eι 2 = ι. π

(vi) Clearly, eι 2 = ι implies that eιπ = −1 and eι2π = 1 and eι2πk = 1 for all integers k. Since eι2π = 1, it follows that ez = ez+ι2π for all z ∈ C, 2 It is now possible to prove various results involving π that one learns in calculus from the above deﬁnition. For example, ∞ 1 that the arc length of the unit circle {z : |z| = 1} is 2π and that −∞ 1+x 2 dx = π, etc. (Problems A.19 and A.20). The following assertions about ez can be proved with some more eﬀort. Theorem A.3.2: (i) ez = 1 iﬀ z = 2πιk for some integer k. (ii) The map t → eιt from R is onto the unit circle. (iii) For any ω ∈ C, ω = 0 there is a z ∈ C such that ω = ez .

590

Appendix A. Advanced Calculus: A Review

For a proof of this theorem as well as more details on Theorem A.3.1, see Rudin (1987). Theorem A.3.3: (Orthogonality of {eι2πnt }n∈Z ). 1 ι2πnt e dt = 0 if n = 0 and 1 if n = 0. 0

For any n ∈ Z,

Proof: Since (eιt ) = ιeιt (eι2πnt ) = ι2πn eι2πnt , n ∈ Z and so for n = 0,

1 ι2πnt

e

dt

=

0

=

1 1 (eι2πnt ) dt ι2πn 0 1 (eι2πn − 1) = 0. ι2πn

Corollary A.3.4: The family {cos 2πnt : n = 0, 1, 2, . . .} ∪ {sin 2πnt : n = 1, 2, . . .} are orthogonal in L2 [0, 1] (Problem A.22), i.e., for any two f , g 1 in this family, 0 f (x)g(x)dx = 0 for f = g.

A.4 Metric spaces A.4.1

Basic deﬁnitions

This section reviews the following: metric spaces, Cauchy sequences, completeness, functions, continuity, compactness, convergence of sequences functions, and uniform convergence. Deﬁnition A.4.1: Let S be a nonempty set. Let d : S × S → R+ = [0, ∞) be such that (i) d(x, y) = d(y, x) for any x, y in S. (ii) d(x, z) ≤ d(x, y) + d(y, z) for any x, y, z in S. (iii) d(x, y) = 0 iﬀ x = y. Such a d is called a metric on S and the pair (S, d) a metric space. Property (ii) is called the triangle inequality. Example A.4.1: Let Rk ≡ {(x1 , . . . , xk ) : xi ∈ R, 1 ≤ i ≤ k} be the k-dimensional Euclidean space. For 1 ≤ p < ∞ and x = (x1 , . . . , xk ), y = (y1 , y2 , . . . , yk ) ∈ Rk , let dp (x, y) =

k i=1

|xi − yi |

p

p1 ,

A.4 Metric spaces

591

and d∞ (x, y) = max{|xi − yi | : 1 ≤ i ≤ k}. It can be shown that dp (·, ·) is a metric on R for all 1 ≤ p ≤ ∞ (Problem A.24). Deﬁnition A.4.2: A sequence {xn }n≥1 in a metric space (S, d) converges to an x in S if for every ε > 0, there is a Nε such that n ≥ Nε ⇒ d(xn , x) < ε and is written as limn→∞ xn = x. Deﬁnition A.4.3: A sequence {xn }n≥1 in a metric space (S, d) is Cauchy if for all ε > 0, there exists Nε such that n, m ≥ Nε ⇒ d(xn , xm ) < ε. Deﬁnition A.4.4: A metric space (S, d) is complete if every Cauchy sequence {xn }n≥1 converges to some x in S. Example A.4.2: (a) Let S = Q, the set of rationals and d(x, y) ≡ |x − y|. Then (Q, d) is a metric space that is not complete. (b) Let S = R and d(x, y) = |x − y|. Then (R, d) is complete (cf. Proposition A.2.3). (c) Let S = Rk . Then (Rk , dp ) is complete for every 1 ≤ p ≤ ∞, where dp is as in Example A.4.1. Remark A.4.1: (Completion of an incomplete metric space). Let (S, d) ˜ be the set of all Cauchy sequences in S. Identify be a metric space. Let S each x in S with the Cauchy sequence {xn = x}n≥1 . Deﬁne a function from ˜×S ˜ to R+ by S ˜ n }n≥1 , {yn }n≥1 ) = lim sup d(xn , yn ). d({x n→∞

It is easy to verify that d˜ is symmetric and satisﬁes the triangle inequality. Deﬁne s1 = {xn }n≥1 and s2 = {yn }n≥1 to be equivalent (write {xn } ∼ ˜ 1 , s2 ) = 0. Let S ¯ be the set of all equivalence classes in S ˜ and {yn }) if d(s ˜ 1 , s2 ), where c1 , c2 are equivalence classes and s1 , s2 ¯ 1 , c2 ) ≡ d(s deﬁne d(c are arbitrary elements of c1 and c2 , respectively. ¯ is a complete metric space and (S, d) ¯ d) It can now be veriﬁed that (S, ¯ by identifying each x in S with the equivalence class ¯ d) is embedded in (S, containing the sequence {xn = x}n≥1 . Deﬁnition A.4.5: A metric space (S, d) is separable if there exists a subset D ⊂ S that is countable and dense in S, i.e., for each x in S and ε > 0, there is a y in D such that d(x, y) < ε. Example A.4.3: By the Archimedean property, Q is dense in R. Similarly Qk , the set of all k vectors with components from Q, is dense in Rk .

592

Appendix A. Advanced Calculus: A Review

Deﬁnition A.4.6: A metric space (S, d) is called Polish if it is complete and separable. Example A.4.4: (Rk , dp ) in Example A.4.2 is Polish.

A.4.2

Continuous functions

Let (S, d) and (T, ρ) be two metric spaces. Let f : S → T be a map from S to T. Deﬁnition A.4.7: (a) f is continuous at p in S if for each ε > 0, there exists δ > 0 such that d(x, p) < δ ⇒ ρ(f (x), f (p)) < ε. (Here the δ may depend on ε and p.) (b) f is continuous on a set B ⊂ S if it is continuous at every p ∈ B. (c) f is uniformly continuous on B if for each ε > 0, there exists δ > 0 such that for each pair x, y in S, d(x, y) < δ ⇒ ρ(f (x), f (y)) < ε. Deﬁnition A.4.8: Let (S, d) be a metric space. (a) A set O ⊂ (S, d) is open if x ∈ O ⇒ there exists δ > 0 such that d(x, y) < δ ⇒ y ∈ O. That is, at every point x in O, an open ball Bx (δ) ≡ {y : d(x, y) < δ} of positive radius δ is a subset of O. (b) A set C ⊂ (S, d) is closed if C c is open. Theorem A.4.1: Let (S, d) and (T, ρ) be metric spaces. A map f : S → T in continuous on S iﬀ for each O open in T, f −1 (O) is open in S. Proof is left as an exercise (Problem A.28).

A.4.3

Compactness

Deﬁnition A.4.9: A collection of open sets {Oα : α ∈ I} is an open cover for a set B ⊂ (S, d) if for each x ∈ B, there exists α ∈ I such that x ∈ Oα . Example A.4.5: Let B = (0, 1). Then the collection {(α − α2 , α + (1−α) 2 ): α ∈ Q ∩ (0, 1)} is an open cover for B. Deﬁnition A.4.10: Let (S, d) be a metric space. A set K ⊂ S is called compact if given any open cover {Oα : α ∈ I} for K, there exists a ﬁnite subcollection {Oαi : αi ∈ I, i = 1, 2, . . . , n, n < ∞} that is an open cover for K. Example A.4.6: The set B = (0, 1) is not compact as the open cover in the above Example A.3.4 does not admit a ﬁnite subcover.

A.4 Metric spaces

593

The next result is the well-known Heine-Borel theorem. Theorem A.4.2: (i) For any −∞ < a < b < ∞, the closed interval [a, b] is compact in R. (ii) Any K ⊂ R is compact iﬀ it is bounded and closed. For a proof, see Rudin (1976). From Proposition A.4.1, it is seen that the inverse image of an open set under a continuous function is open but the forward image may not have this property. But the following is true. Theorem A.4.3: Let (S, d) and (T, ρ) be two metric spaces and let f : (S, d) → (T, ρ) be continuous. Let K ⊂ S be compact. Then f (K) is compact. The proof is left as an exercise (Problem A.35).

A.4.4

Sequences of functions and uniform convergence

Deﬁnition A.4.11: Let (S, d) and (T, ρ) be two metric spaces and let {fn }n≥1 be a sequence of functions from (S, d) to (T, ρ). The sequence {fn }n≥1 is said to: (a) converge pointwise to f on a set A ⊂ S if limn→∞ fn (x) = f (x) for each x in A; (b) converges uniformly to f on a set A ⊂ S if for each ε > 0, there exists Nε > 0 (depending on ε and A) such that n ≥ Nε ⇒ ρ fn (x), f (x) < ε for all x in A. A consequence of uniform convergence is the preservation of the continuity property. Theorem A.4.4: Let (S, d) and (T, ρ) be two metric spaces and let {fn }n≥1 be a sequence of functions from (S, d) to (T, ρ). Let for each n ≥ 1, fn be continuous on A ⊂ S. Let {fn }n≥1 converge to f uniformly on A. Then f is continuous on A. Proof: The proof is based on the “break up into three parts” idea. By the triangle inequality, ρ f (x), f (y) ≤ ρ f (x), fn (x) + ρ fn (x), fn (y) + ρ fn (y), f (y) . Fix x in A. By the uniform convergence on A, sup{ρ fn (u), f (u) : u ∈ A} → 0as n → ∞. So for each ε > 0, there exists Nε < ∞ such that n ≥ Nε ⇒ ρ fn (u), f (u) < 3ε for all u in A. Now since fNε (·) is continuous on A, there exists a δ > 0(depending on Nε and x), such that d(x, y) < δ, y ∈ A ⇒ ρ fNε (y), fNε (x) < 3ε . Thus, y ∈ A, d(x, y) < δ ⇒ ρ f (x), f (y) < 2ε ε 2 3 + 3 = ε.

594

Appendix A. Advanced Calculus: A Review

A.5 Problems A.1 Express the following sets in the form {x : x has property p}. (a) The set A of all integers which when divided by 7 leave a remainder ≤ 3. (b) The set B of all functions form [0, 1] to R with at most two discontinuity points. (c) The set C of all students at a given university who are graduate students with at least one course in mathematics at the graduate level. (d) The set D of all algebraic numbers. (A number x is called an algebraic number, if it is the root of a polynomial with rational coeﬃcients.) (e) The set E of all possible sequences whose elements are either 0 or 1. A.2 Give an example of sets A1 , A2 such that A1 ∩ A2 = A1 ∪ A2 . A.3 Let I = [0, 1], Ω = R and for α ∈ R, Aα = (α − 1, α + 1), the open interval {x : α − 1 < x < α + 1}. (a) Show that ∪α∈I Aα = (−1, 2) and ∩α∈I Aα = (0, 1). (b) Suppose J = {x : x ∈ I, x is rational}. Find ∪x∈J Ax and ∩x∈J Ax . A.4 With Ω ≡ N ≡ {1, 2, 3, . . .}, ﬁnd Ac in the following cases: (a) A = {ω : ω is divisible by 2 or 3 or both}. If ω ∈ Ac , what can be said about its prime factors? (b) A = {ω : ω is divisible by 15 and 16}. (c) A = {ω : ω is a perfect square}. A.5 Show that X ≡ {0, 1}N , the set of all sequences {ωi }i∈N where each ωi ∈ {0, 1}, is uncountable. Conclude that P(N) is uncountable. A.6 Show that if Ωi is countable for each i ∈ N, then for each k ∈ N, ×ki=1 Ωi is countable and ∪i∈N Ωi is also countable but ×i∈N Ωi is not countable. A.7 Show that the set of all polynomials in x with integer coeﬃcients is countable. A.8 Show that the well ordering property implies the principle of induction. A.9 Apply the principle of induction to establish the following:

A.5 Problems

(a) For each n ∈ N,

n

j2 =

j=1

595

n(n+1)(2n+1) . 6

(b) For each n ∈ N, x1 , x2 , . . . , xk ∈ R, (i) (The binomial formula). (x1 + x2 )n =

n n r n−r . r x1 x2

r=0

(ii) (The multinomial formula). (x1 + x2 + . . . + xk )n =

n! xr1 xr2 . . . xrkk , r1 !r2 ! . . . rk ! 1 2

where the summation extends over all (r1 , r2 , . . . , rk ) such k that ri ∈ N, 0 ≤ ri ≤ n, r=1 ri = n. A.10 Verify that on R, the relation x ∼ y if x−y is rational is an equivalence relation but the relation x ∼ y if x − y is irrational is not. A.11 Show that the set A = {r : r ∈ Q, r2 < 2} is bounded above in Q but has no l.u.b. in Q. A.12 Show that for any two sequences {xn }n≥1 , {yn }n≥1 ⊂ R, lim xn + lim yn ≤ lim (xn + yn ) ≤ lim (xn + yn )

n→∞

n→∞

n→∞

n→∞

≤ lim xn + lim yn . n→∞

n→∞

A.13 Verify that lim xn = a ∈ R iﬀ lim xn = lim xn = a. n→∞

n→∞

n→∞

A.14 Establish Proposition A.2.3. (Hint: First show that a Cauchy sequence is bounded and then show that lim xn = lim xn .) n→∞

n→∞

A.15 (a) Prove Proposition A.2.4 by comparison with the geometric series. ∞ (b) Show that for integer k ≥ 1, the power series n=k n(n ∞− 1)(n − k + 1)an xn−k has the same radius of convergence as n=0 an xn . ∞ A.16 Show that the series j=2 j(log1 j)p converges for p > 1 and diverges for p ≤ 1. A.17 Find the radius of convergence, ρ, for the powers series A(x) ≡ ∞ n n=0 an x where (a) an =

n (n+1) ,

n ≥ 0.

(b) an = np , n ≥ 0, p ∈ R.

596

Appendix A. Advanced Calculus: A Review

(c) an =

1 n! ,

n ≥ 0 (where 0! = 1).

A.18 (a) Find the Taylor series at a = 0 for the function f (x) = I ≡ (−1, +1) and show that it converges to f (x) on I.

1 1−x

in

(b) Find the Taylor series of 1 + x + x2 in I = (1, 3), centered at 2. (c) Let

f (x) =

1

e− x2 0

if |x| < 1, x = 0 if x = 0 .

(i) Show that f is inﬁnitely diﬀerentiable at 0 and compute f (j) (0) for all j ≥ 1. (ii) Show that the Taylor series at a = 0 converges but not to f on (−1, 1). A.19 Let S = {z : z ∈ C, |z| = 1} be the unit circle. Using the parameterization t → eιt = (cos t + ι sin t) from [0, 2π] to S, show that the arc length of S (i.e., the circumference of the limit circle) is 2π. sin t π π 2 A.20 Set φ(t) = cos t for − 2 < t < 2 . Verify that φ = 1 + φ and that π π φ : (− 2 , 2 ) to (−∞, ∞) is strictly monotone increasing and onto. Conclude that π/2 ∞ 1 φ (t) dx = dt = π. 2 2 −∞ 1 + x −π/2 1 + (φ(t)) π

A.21 Using the property that eι 2 = ι verify that for all t in R π π − t) = sin t, sin( − t) = cos t, 2 2 cos(π + t) = − cos t, sin(π + t) = − sin t, cos(2π + t) = cos t, sin(2π + t) = sin t. cos(

Also show that cos t is a strictly decreasing map from [0, π] onto [−1, 1] and that sin t is a strictly increasing map from [− π2 , π2 ] onto [−1, 1]. A.22 Using (i) of Theorem A.3.1, express cos(t1 + t2 ), sin(t1 + t2 ) in terms cos ti , sin ti , i = 1, 2 and in turn use this to prove Corollary A.3.4 from Theorem A.3.3. A.23 Verify that pn (z) ≡ (1 + nz )n converges to ez uniformly on bounded sets in C. A.24 (a) Verify that for p = 1, p = 2 and p = ∞, dp is a metric on Rk . (b) Show that for ﬁxed x and y, ϕ(p) ≡ dp (x, y) is continuous in p on [1, ∞].

A.5 Problems

597

(c) Draw the open unit ball Bp ≡ {x : x ∈ R2 , dp (x, 0) < 1} in R2 for p = 1, 2 and ∞. A.25 Let S = C[0, 1] be the set of all real valued continuous functions on [0, 1]. Now let 1 d1 (f, g) = |f (x) − g(x)|dx, (area metric) 0

d2 (f, g)

=

d∞ (f, g)

=

0

1

2

|f (x) − g(x)| dx

12 ,

(least square metric)

sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}

(sup metric).

Show that all these are metrics on S. A.26 Let S = R∞ ≡ {{xn }n≥1 : xn ∈ R for all n ≥ 1} be the space of all sequences of real numbers. Let d({xn }n≥1 , {yn }n≥1 ) = ∞ |xj −yj | 1 j=1 ( 1+|xj −yj | ) 2j . Show that (S, d) is a Polish space. A.27 If sk = {xkn }n≥1 and s = {xn }n≥1 , are elements of S = R∞ as in Problem A.26, verify that as k → ∞, sk → s iﬀ xkn → xn for all n ≥ 1. A.28 Establish Theorem A.4.1. 1 1 A.29 Let S = C[0, 1] and dp (f, g) ≡ 0 |f (t) − g(t)|p dt p for 1 ≤ p < ∞ and d∞ (f, g) = sup{|f (t) − g(t)| : t ∈ [0, 1]}. (a) Let f (x) ≡ 1. Let fn (t) ≡ 1 for 0 ≤ t ≤ 1− n1 , and fn (t) = n(1−t) for 1 − n1 ≤ t ≤ 1. Show that dp (fn , f ) → 0 for 1 ≤ p < ∞ but d∞ (fn , f ) → 0. (b) Fix f ∈ C[0, 1]. Let gn (t) = f (t), 0 ≤ t ≤ 1 − n1 , and gn (t) = f (1 − n1 ) + (f (1) − f (1 − n1 ))n(t + n1 − 1), 1 − n1 ≤ t ≤ 1. Show that dp (gn , f ) → 0 for all 1 ≤ p ≤ ∞. A.30 Show that if {xn }n≥1 is a convergent sequence in a metric space (S, d), then it is Cauchy. A.31 Verify (b) of Example A.4.2 from the axioms of real numbers (cf. Proposition A.2.3). Verify (c) of the same example from (b). A.32 Let S = C[0, 1] and d be the supremum metric, i.e., d(f, g) = sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}. By approximating any continuous function with piecewise linear functions with rational end points and rational values, show that (S, d) is Polish, i.e., it is complete and separable.

598

Appendix A. Advanced Calculus: A Review

A.33 Show that the function f (x) = x2 is continuous on R, uniformly so on any bounded set B ⊂ R but not uniformly on R. A.34 Show that unions of open sets are open and intersection of any two open sets is open. Give an example to show that the intersection of an inﬁnite number of open sets need not be open. A.35 Prove Theorem A.4.3. A.36 Let fn (x) = xn and g(x) ≡ 0 on R. Then {fn }n≥1 converges pointwise to g on (−1, 1), uniformly on [a, b] for −1 < a < b < 1, but not uniformly on (0, 1). A.37 Let {fn }n≥1 , f ∈ C[0, 1]. Let {fn }n≥1 converge to f uniformly on 1 1 [0, 1]. Show that lim 0 |fn (x) − f (x)|dx = 0 and lim 0 fn (x) = n→∞ n→∞ 1 f (x)dx. 0 A.38 Give a proof of Proposition A.2.6 (vi) (term by term diﬀerentiability of a power series) using Proposition A.2.7 (iv) (the fundamental theorem of Riemann integration).

Appendix B List of Abbreviations and Symbols

B.1 Abbreviations a.c. a.e. AR(1) a.s. BCT

absolutely continuous (functions) almost everywhere autoregressive process of order one almost sure(ly) bounded convergence theorem

BGW cdf CE CLT CTMC

Biengeme-Galton-Watson cumulative distribution function conditional expectation central limit theorem continuous time Markov chain

DCT EDCT fdds iﬀ IFS

dominated convergence theorem extended dominated convergence theorem ﬁnite dimensional distributions if and only if iterated function system

iid IIIDRM i.o. LIL LLN

independent and identically distributed iterations of iid random maps inﬁnitely often law of the iterated logarithm laws of large numbers

600

Appendix B. List of Abbreviations and Symbols

MBB m.c.f.a. m.c.f.b. MCMC MCT

moving block bootstrap monotone continuity from above monotone continuity from below Markov chain Monte Carlo monotone convergence theorem

o.n.b. r.c.p. SBM SLLN s.o.c.

orthonormal basis regular conditional probability standard Brownian motion strong law of large numbers second order correctness

SSRW UI WLLN w.p. 1 w.r.t.

simple symmetric random walk uniform integrability weak law of large numbers with probability one with respect to

w.l.o.g.

without loss of generality

B.2 Symbols −→d −→p (·) ∗ (·) (·)∗

µ ν: absolute continuity of a measure convergence in distribution convergence in probability convolution of measures, functions, etc. extension of a measure

a∼b an ∼ bn a

a and b are equivalent (under an equivalence relation) an bn → 1 as n → ∞ the integer part of a, i.e., a = k if k ≤ a < k + 1, k ∈ Z, a ∈ R the smallest integer not less than a, i.e., "a# = k + 1 if k < a ≤ k + 1, k ∈ Z, a ∈ R closure of A

"a# A¯ Ac ∂A AB B(S) B(S, R) B(x, ), Bx ()

complement of a set A boundary of A symmetric diﬀerence of two sets A and B, i.e., A B = (A ∩ B c ) ∪ (Ac ∩ B) Borel σ-algebra on a metric space S such as S = R, Rk , R∞ ≡ {f | f : S → R, F-measurable, sup{|f (s)| : s ∈ S} ≤ 1} open ball of radius with center at x in a metric space (S, d), i.e., {y : d(x, y) < }

B.2 Symbols

C C[a, b] CB (R) Cc (R) C0 (R) C0 (S) C(F ), CF δij δx dµ dν

E(Y |G) H⊥ ι IA (·) II k λA Lp (Ω, F, µ) Lp (R) m µF µ⊥ν N ∅ Φ(·) P (A|G) Pλ (·)

601

the set of all complex numbers = {f | f : [a, b] → R, f continuous} ≡ {f | f : R → R, f bounded and continuous} ≡ {f | f : R → R, continuous and f ≡ 0 outside a bounded interval} ≡ {f | f : R → R, continuous and lim|x|→∞ f (x) = 0} = {f | f : S → R, f continuous and for every > 0, there exists a compact set K such that |f (x)| < for x ∈ K } the set of all continuity points of a cdf F Kronecker delta, i.e., δij = 1 if i = j and = 0 if i = j the probability distribution putting mass one at x Radon-Nikodym derivative of µ w.r.t. ν conditional expectation of Y given G orthogonal complement of a subspace H of a Hilbert √ space −1 the indicator function of a set A the identity matrix of order k λ-class generated by a class of sets A = {f |f : Ω → F, F-measurable, |f |p dµ < ∞}, with F = R or C (F = C in Sections 5.6, 5.7 only) = Lp (R, B(R), m) the Lebesgue measure Lebesgue-Stieltjes measure corresponding to F singularity of measures µ and ν the set of natural numbers the null set standard normal cdf, i.e., Φ(x) ≡ −∞ < x < ∞ probability of A given G

√1 2π

x −∞

e−u

2

/2

du,

P(Ω) Px (·) (Ω, F, P ) (Ω, F, µ)

probability distribution of a Markov chain with initial distribution λ the power set of Ω = {A : A ⊂ Ω} same as Pλ with λ = δx generic probability space generic measure space

Q R R+ ¯ R ¯ R+

the set of the set of the set of the set of = [0, ∞]

all rationals real numbers, (−∞, ∞) nonnegative real numbers, [0, ∞) all extended real numbers, [−∞, ∞]

602

Appendix B. List of Abbreviations and Symbols

(S, d) σA σ{fa : a ∈ A} T |z| Re(z) Im(z) Z Z+

a metric space S with a metric d σ-algebra generated by a class of sets A σ-algebra generated by a collection of mappings {fa : a ∈ A} tail√σ-algebra = a2 + b2 , the absolute value of a complex number z = a + ιb, a, b ∈ R = a, the real part of a complex number z = a + ιb, a, b∈R = b, the imaginary part of a complex number z = a + ιb, a, b ∈ R the set of all integers = {0, ±1, ±2, . . .} the set of all nonnegative integers = {0, 1, 2, . . .}

References

Arcones, M. A. and Gin´e, E. (1989), ‘The bootstrap of the mean with arbitrary bootstrap sample size’, Ann. Inst. H. Poincar´e Probab. Statist. 25(4), 457–481. Arcones, M. A. and Gin´e, E. (1991), ‘Additions and correction to: “The bootstrap of the mean with arbitrary bootstrap sample size” [Ann. Inst. H. Poincar´e Probab. Statist. 25(4) (1989), 457–481]’, Ann. Inst. H. Poincar´e Probab. Statist. 27(4), 583–595. Athreya, K. B. (1986), ‘Darling and Kac revisited’, Sankhy¯ a A 48(3), 255– 266. Athreya, K. B. (1987a), ‘Bootstrap of the mean in the inﬁnite variance case’, Ann. Statist. 15(2), 724–731. Athreya, K. B. (1987b), Bootstrap of the mean in the inﬁnite variance case, in ‘Proceedings of the 1st World Congress of the Bernoulli Society’, Vol. 2, VNU Sci. Press, Utrecht, pp. 95–98. Athreya, K. B. (2000), ‘Change of measures for Markov chains and the l log l theorem for branching processes’, Bernoulli 6, 323–338. Athreya, K. B. (2004), ‘Stationary measures for some Markov chain models in ecology and economics’, Econom. Theory 23(1), 107–122. Athreya, K. B., Doss, H. and Sethuraman, J. (1996), ‘On the convergence of the Markov chain simulation method’, Ann. Statist. 24(1), 69–100.

604

References

Athreya, K. B. and Jagers, P., eds (1997), Classical and Modern Branching Processes, Vol. 84 of The IMA Volumes in Mathematics and its Applications, Springer-Verlag, New York. Athreya, K. B. and Ney, P. (1978), ‘A new approach to the limit theory of recurrent Markov chains’, Trans. Amer. Math. Soc. 245, 493–501. Athreya, K. B. and Ney, P. E. (2004), Branching Processes, Dover Publications, Inc, Mineola, NY. (Reprint of Band 196, Grundlehren der Mathematischen Wissenschaften, Springer-Verlag, Berlin). Athreya, K. B. and Pantula, S. G. (1986), ‘Mixing properties of Harris chains and autoregressive processes’, J. Appl. Probab. 23(4), 880–892. Athreya, K. B. and Stenﬂo, O. (2003), ‘Perfect sampling for Doeblin chains’, Sankhy¯ a A 65(4), 763–777. Bahadur, R. R. (1966), ‘A note on quantiles in large samples’, Ann. Math. Statist. 37, 577–580. Barnsley, M. F. (1992), Fractals Everywhere, 2nd edn, Academic Press, New York. Berbee, H. C. P. (1979), Random Walks with Stationary Increments and Renewal Theory, Mathematical Centre, Amsterdam. Berry, A. C. (1941), ‘The accuracy of the Gaussian approximation to the sum of independent variates’, Trans. Amer. Math. Soc. 48, 122–136. Bhatia, R. (2003), Fourier Series, 2nd edn, Hindustan Book Agency, New Delhi, India. Bhattacharya, R. N. and Rao, R. R. (1986), Normal Approximation and Asymptotic Expansions, Robert E. Krieger, Melbourne, FL. Billingsley, P. (1968), Convergence of Probability Measures, John Wiley, New York. Billingsley, P. (1995), Probability and Measure, 3rd edn, John Wiley, New York. Bradley, R. C. (1983), ‘Approximation theorems for strongly mixing random variables’, Michigan Math. J. 30(1), 69–81. Brillinger, D. R. (1975), Time Series. Data Analysis and Theory, Holt, Rinehart and Winston, Inc, New York. Carlstein, E. (1986), ‘The use of subseries values for estimating the variance of a general statistic from a stationary sequence’, Ann. Statist. 14(3), 1171–1179.

References

605

Chanda, K. C. (1974), ‘Strong mixing properties of linear stochastic processes’, J. Appl. Probab. 11, 401–408. Chow, Y.-S. and Teicher, H. (1997), Probability Theory: Independence, Interchangeability, Martingales, Springer-Verlag, New York. Chung, K. L. (1967), Markov Chains with Stationary Transition Probabilities, 2nd edn, Springer-Verlag, New York. Chung, K. L. (1974), A Course in Probability Theory, 2nd edn, Academic Press, New York. Cohen, P. (1966), Set Theory and the Continuum Hypothesis, Benjamin, New York. Doob, J. L. (1953), Stochastic Processes, John Wiley, New York. Doukhan, P., Massart, P. and Rio, E. (1994), ‘The functional central limit theorem for strongly mixing processes’, Ann. Inst. H. Poincar´e Probab. Statist. 30, 63–82. Durrett, R. (2001), Essentials of Stochastic Processes, Springer-Verlag, New York. Durrett, R. (2004), Probability: Theory and Examples, 3rd edn, Duxbury Press, San Jose, CA. Efron, B. (1979), ‘Bootstrap methods: Another look at the jackknife’, Ann. Statist. 7(1), 1–26. Esseen, C.-G. (1942), ‘Rate of convergence in the central limit theorem’, Ark. Mat. Astr. Fys. 28A(9). Esseen, C.-G. (1945), ‘Fourier analysis of distribution functions. a mathematical study of the Laplace-Gaussian law’, Acta Math. 77, 1–125. Etemadi, N. (1981), ‘An elementary proof of the strong law of large numbers’, Z. Wahrsch. Verw. Gebiete 55(1), 119–122. Feller, W. (1966), An Introduction to Probability Theory and Its Applications, Vol. II, John Wiley, New York. Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. I, 3rd edn, John Wiley, New York. Geman, S. and Geman, D. (1984), ‘Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images’, IEEE Trans. Pattern Analysis Mach. Intell. 6, 721–741. Gin´e, E. and Zinn, J. (1989), ‘Necessary conditions for the bootstrap of the mean’, Ann. Statist. 17(2), 684–691.

606

References

Gnedenko, B. V. and Kolmogorov, A. N. (1968), Limit Distributions for Sums of Independent Random Variables, Revised edn, AddisonWesley, Reading, MA. Gorodetskii, V. V. (1977), ‘On the strong mixing property for linear sequences’, Theory Probab. 22, 411–413. G¨ otze, F. and Hipp, C. (1978), ‘Asymptotic expansions in the central limit theorem under moment conditions’, Z. Wahrsch. Verw. Gebiete 42, 67–87. G¨ otze, F. and Hipp, C. (1983), ‘Asymptotic expansions for sums of weakly dependent random vectors’, Z. Wahrsch. Verw. Gebiete 64, 211–239. Hall, P. (1985), ‘Resampling a coverage pattern’, Stochastic Process. Appl. 20, 231–246. Hall, P. (1992), The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York. Hall, P. G. and Heyde, C. C. (1980), Martingale Limit Theory and Its Applications, Academic Press, New York. Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995), ‘On blocking rules for the bootstrap with dependent data’, Biometrika 82, 561–574. Herrndorf, N. (1983), ‘Stationary strongly mixing sequences not satisfying the central limit theorem’, Ann. Probab. 11, 809–813. Hewitt, E. and Stromberg, K. (1965), Real and Abstract Analysis, SpringerVerlag, New York. Hoel, P. G., Port, S. C. and Stone, C. J. (1972), Introduction to Stochastic Processes, Houghton-Miﬄin, Boston, MA. Ibragimov, I. A. and Rozanov, Y. A. (1978), Gaussian Random Processes, Springer-Verlag, Berlin. Karatzas, I. and Shreve, S. E. (1991), Brownian Motion and Stochastic Calculus, 2nd edn, Springer-Verlag, New York. Karlin, S. and Taylor, H. M. (1975), A First Course in Stochastic Processes, Academic Press, New York. Kifer, Y. (1988), Random Perturbations of Dynamical Systems, Birkh¨ auser, Boston, MA. Kolmogorov, A. N. (1956), Foundations of the Theory of Probability, 2nd edn, Chelsea, New York.

References

607

K¨ orner, T. W. (1989), Fourier Analysis, Cambridge University Press, New York. K¨ unsch, H. R. (1989), ‘The jackknife and the bootstrap for general stationary observations’, Ann. Statist. 17, 1217–1261. Lahiri, S. N. (1991), ‘Second order optimality of stationary bootstrap’, Statist. Probab. Lett. 11, 335–341. Lahiri, S. N. (1992), ‘Edgeworth expansions for m-estimators of a regression parameter’, J. Multivariate Analysis 43, 125–132. Lahiri, S. N. (1994), ‘Rates of bootstrap approximation for the mean of lattice variables’, Sankhy¯ a A 56, 77–89. Lahiri, S. N. (1996), ‘Asymptotic expansions for sums of random vectors under polynomial mixing rates’, Sankhy¯ a A 58, 206–225. Lahiri, S. N. (2001), ‘Eﬀects of block lengths on the validity of block resampling methods’, Probab. Theory Related Fields 121, 73–97. Lahiri, S. N. (2003), Resampling Methods for Dependent Data, SpringerVerlag, New York. Lehmann, E. L. and Casella, G. (1998), Theory of Point Estimation, Springer-Verlag, New York. Lindvall, T. (1992), Lectures on Coupling Theory, John Wiley, New York. Liu, R. Y. and Singh, K. (1992), Moving blocks jackknife and bootstrap capture weak dependence, in R. Lepage and L. Billard, eds, ‘Exploring the Limits of the Bootstrap’, John Wiley, New York, pp. 225–248. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953), ‘Equations of state calculations by fast computing machines’, J. Chem. Physics 21, 1087–1092. Meyn, S. P. and Tweedie, R. L. (1993), Markov Chains and Stochastic Stability, Springer-Verlag, New York. Munkres, J. R. (1975), Topology, A First Course, Prentice Hall, Englewood Cliﬀs, NJ. Nummelin, E. (1978), ‘A splitting technique for Harris recurrent Markov chains’, Z. Wahrsch. Verw. Gebiete 43(4), 309–318. Nummelin, E. (1984), General Irreducible Markov Chains and Nonnegative Operators, Cambridge University Press, Cambridge. Orey, S. (1971), Limit Theorems for Markov Chain Transition Probabilities, Van Nostrand Reinhold, London.

608

References

Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic Press, San Diego, CA. Parthasarathy, K. R. (2005), Introduction to Probability and Measure, Vol. 33 of Texts and Readings in Mathematics, Hindustan Book Agency, New Delhi, India. Peligrad, M. (1982), ‘Invariance principles for mixing sequences of random variables’, Ann. Probab. 10(4), 968–981. Petrov, V. V. (1975), Sums of Independent Random Variables, SpringerVerlag, New York. Reiss, R.-D. (1974), ‘On the accuracy of the normal approximation for quantiles’, Ann. Probab. 2, 741–744. Robert, C. P. and Casella, G. (1999), Monte Carlo Statistical Methods, Springer-Verlag, New York. Rosenberger, W. F. (2002), ‘Urn models and sequential design’, Sequential Anal. 21(1–2), 1–41. Royden, H. L. (1988), Real Analysis, 3rd edn, Macmillan Publishing Co., New York. Rudin, W. (1976), Principles of Mathematical Analysis, International Series in Pure and Applied Mathematics, 3rd edn, McGraw-Hill Book Co., New York. Rudin, W. (1987), Real and Complex Analysis, 3rd edn, McGraw-Hill Book Co., New York. Shohat, J. A. and Tamarkin, J. D. (1943), The problem of moments, in ‘American Mathematical Society Mathematical Surveys’, Vol. II, American Mathematical Society, New York. Singh, K. (1981), ‘On the asymptotic accuracy of Efron’s bootstrap’, Ann. Statist. 9, 1187–1195. Strassen, V. (1964), ‘An invariance principle for the law of the iterated logarithm’, Z. Wahrsch. Verw. Gebiete 3, 211–226. Stroock, D. W. and Varadhan, S. (1979), Multidimensional Diﬀusion Processes, Band 233, Grundlehren der Mathematischen Wissenschaften, Springer-Verlag, Berlin. Szego, G. (1939), Orthogonal Polynomials, Vol. 23 of American Mathematical Society Colloquium Publications, American Mathematical Society, Providence, RI.

References

609

Withers, C. S. (1981), ‘Conditions for linear processes to be strong-mixing’, Z. Wahrsch. Verw. Gebiete 57, 477–480. Woodroofe, M. (1982), Nonlinear Renewal Theory in Sequential Analysis, SIAM, Philadelphia, PA.

Author Index

Arcones, M. A., 541 Athreya, K. B., 428, 460, 464, 472, 475, 499, 516, 541, 559, 561, 564, 565, 569

Doss, H., 472 Doukhan, P., 528 Durrett, R., 274, 278, 372, 393, 429, 493

Bahadur, R. R., 557 Barnsley, M. F., 460 Berbee, H.C.P., 517 Berry, A. C., 361 Bhatia, R. P., 167 Bhattacharya, R. N., 365, 368 Billingsley, P., 14, 211, 254, 301, 306, 373, 375, 475, 498 Bradley, R. C., 517 Brillinger, D. R., 547

Efron, B., 533, 545 Esseen, C. G., 361, 368 Etemadi, N., 244

Carlstein, E., 548 Casella, G., 391, 477, 480 Chanda, K. C., 516 Chow, Y. S., 364, 417, 430 Chung, K. L., 14, 250, 323, 359, 489 Cohen, P., 576 Doob, J. L., 211, 399

Feller, W., 166, 265, 308, 313, 323, 354, 357, 359, 362, 489, 499 Geman, D., 477 Geman, S., 477 Gin´e, E., 541 Gnedenko, B. V., 355, 359 Gorodetskii, V. V., 516 G¨ otze, F., 554 Hall, P. G., 365, 510, 534, 548, 556 Herrndorf, N., 529 Hewitt, E., 30, 576 Heyde, C. C., 510 Hipp, C., 554

Author Index

Hoel, P. G., 455 Horowitz, J. L., 556

Rozanov, Y. A., 516 Rudin, W., 27, 94, 97, 132, 181, 195, 581, 590, 593

Ibragimov, I. A., 516 Jagers, P., 561 Jing, B. Y., 556 Karatzas, I., 494, 499, 504 Karlin, S., 489, 493, 504 Kifer, Y., 460 Kolmogorov, A. N., 170, 200, 225, 244, 355, 359 K¨ orner, T. W., 167, 170 K¨ unsch, H. R., 547 Lahiri, S. N., 533, 537, 549, 552, 556 Lehmann, E. L., 391 Lindvall, T., 265, 456 Liu, R. Y., 547 Massart, P., 528 Metropolis, N., 477 Meyn, S. P., 456 Munkres, J. R., 71 Ney, P., 464, 564, 565, 569 Nummelin, E., 463, 464 Orey, S., 464 Pantula, S. G., 516 Parthasarathy, K. R., 393 Peligrad, M., 529 Petrov, V. V., 365 Port, S. C., 455 Rao, R. Ranga, 365, 368 Reiss, R. D., 381 Rio, E., 528 Robert, C. P., 477, 480 Rosenberger, W. F., 569 Rosenbluth, A. W., 477 Rosenbluth, M. N., 477 Royden, H. L., 27, 62, 94, 97, 118, 128, 130, 156, 573, 579

611

Sethuraman, J., 472 Shreve, S. E., 494, 499, 504 Shohat, J. A., 308 Singh, K., 536, 537, 545, 547 Stenﬂo, O., 460 Stone, C. J., 455 Strassen, V., 279, 576 Stromberg, K., 30 Stroock, D. W., 504 Szego, G.,107 Tamarkin, 308 Taylor, H. M.,

Stephen Fienberg

Ingram Olkin

Springer Texts in Statistics Alfred: Elements of Statistics for the Life and Social Sciences Athreya and Lahiri: Measure Theory and Probability Theory Berger: An Introduction to Probability and Stochastic Processes Bilodeau and Brenner: Theory of Multivariate Statistics Blom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting, Second Edition Carmona: Statistical Analysis of Financial Data in S-Plus Chow and Teicher: Probability Theory: Independence, Interchangeability, Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and Spatial Data—Nonparametric Regression and Response Surface Maximization, Second Edition Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory of Linear Models, Third Edition Creighton: A First Course in Probability Models and Statistical Inference Davis: Statistical Methods for the Analysis of Repeated Measurements Dean and Voss: Design and Analysis of Experiments du Toit, Steyn, and Stumpf: Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Ghosh, Delampady and Samanta: An Introduction to Bayesian Analysis: Theory and Methods Gut: Probability: A Graduate Course Heiberger and Holland: Statistical Analysis and Data Display: An Intermediate Course with Examples in S-PLUS, R, and SAS Jobson: Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design Jobson: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods Kalbfleisch: Probability and Statistical Inference, Volume I: Probability, Second Edition Kalbfleisch: Probability and Statistical Inference, Volume II: Statistical Inference, Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statistical Inference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lange: Optimization Lehmann: Elements of Large-Sample Theory (continued after index)

Krishna B. Athreya Soumendra N. Lahiri

Measure Theory and Probability Theory

Krishna B. Athreya Department of Mathematics and Department of Statistics Iowa State University Ames, IA 50011 [email protected]

Editorial Board George Casella Department of Statistics University of Florida Gainesville, FL 32611-8545 USA

Soumendra N. Lahiri Department of Statistics Iowa State University Ames, IA 50011 [email protected]

Stephen Fienberg Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890 USA

Ingram Olkin Department of Statistics Stanford University Stanford, CA 94305 USA

Library of Congress Control Number: 2006922767 ISBN-10: 0-387-32903-X ISBN-13: 978-0387-32903-1

e-ISBN: 0-387-35434-4

Printed on acid-free paper. ©2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springer.com

(MVY)

Dedicated to our wives Krishna S. Athreya and Pubali Banerjee and to the memory of Uma Mani Athreya and Narayani Ammal

Preface

This book arose out of two graduate courses that the authors have taught during the past several years; the ﬁrst one being on measure theory followed by the second one on advanced probability theory. The traditional approach to a ﬁrst course in measure theory, such as in Royden (1988), is to teach the Lebesgue measure on the real line, then the diﬀerentation theorems of Lebesgue, Lp -spaces on R, and do general measure at the end of the course with one main application to the construction of product measures. This approach does have the pedagogic advantage of seeing one concrete case ﬁrst before going to the general one. But this also has the disadvantage in making many students’ perspective on measure theory somewhat narrow. It leads them to think only in terms of the Lebesgue measure on the real line and to believe that measure theory is intimately tied to the topology of the real line. As students of statistics, probability, physics, engineering, economics, and biology know very well, there are mass distributions that are typically nonuniform, and hence it is useful to gain a general perspective. This book attempts to provide that general perspective right from the beginning. The opening chapter gives an informal introduction to measure and integration theory. It shows that the notions of σ-algebra of sets and countable additivity of a set function are dictated by certain very natural approximation procedures from practical applications and that they are not just some abstract ideas. Next, the general extension theorem of Carathedory is presented in Chapter 1. As immediate examples, the construction of the large class of Lebesgue-Stieltjes measures on the real line and Euclidean spaces is discussed, as are measures on ﬁnite and countable

viii

Preface

spaces. Concrete examples such as the classical Lebesgue measure and various probability distributions on the real line are provided. This is further developed in Chapter 6 leading to the construction of measures on sequence spaces (i.e., sequences of random variables) via Kolmogorov’s consistency theorem. After providing a fairly comprehensive treatment of measure and integration theory in the ﬁrst part (Introduction and Chapters 1–5), the focus moves onto probability theory in the second part (Chapters 6–13). The feature that distinguishes probability theory from measure theory, namely, the notion of independence and dependence of random variables (i.e., measureable functions) is carefully developed ﬁrst. Then the laws of large numbers are taken up. This is followed by convergence in distribution and the central limit theorems. Next the notion of conditional expectation and probability is developed, followed by discrete parameter martingales. Although the development of these topics is based on a rigorous measure theoretic foundation, the heuristic and intuitive backgrounds of the results are emphasized throughout. Along the way, some applications of the results from probability theory to proving classical results in analysis are given. These include, for example, the density of normal numbers on (0,1) and the Wierstrass approximation theorem. These are intended to emphasize the beneﬁts of studying both areas in a rigorous and combined fashion. The approach to conditional expectation is via the mean square approximation of the “unknown” given the “known” and then a careful approximation for the L1 -case. This is a natural and intuitive approach and is preferred over the “black box” approach based on the Radon-Nikodym theorem. The ﬁnal part of the book provides a basic outline of a number of special topics. These include Markov chains including Markov chain Monte Carlo (MCMC), Poisson processes, Brownian motion, bootstrap theory, mixing processes, and branching processes. The ﬁrst two parts can be used for a two-semester sequence, and the last part could serve as a starting point for a seminar course on special topics. This book presents the basic material on measure and integration theory and probability theory in a self-contained and step-by-step manner. It is hoped that students will ﬁnd it accessible, informative, and useful and also that they will be motivated to master the details by carefully working out the text material as well as the large number of exercises. The authors hope that the presentation here is found to be clear and comprehensive without being intimidating. Here is a quick summary of the various chapters of the book. After giving an informal introduction to the ideas of measure and integration theory, the construction of measures starting with set functions on a small class of sets is taken up in Chapter 1 where the Caratheodory extension theorem is proved and then applied to construct Lebesgue-Stieltjes measures. Integration theory is taken up in Chapter 2 where all the basic convergence theorems including the MCT, Fatou, DCT, BCT, Egorov’s, and Scheﬀe’s are

Preface

ix

proved. Included here are also the notion of uniform integrability and the classical approximation theorem of Lusin and its use in Lp -approximation by smooth functions. The third chapter presents basic inequalities for Lp spaces, the Riesz-Fischer theorem, and elementary theory of Banach and Hilbert spaces. Chapter 4 deals with Radon-Nikodym theory via the Riesz representation on L2 -spaces and its application to diﬀerentiation theorems on the real line as well as to signed measures. Chapter 5 deals with product measures and the Fubini-Tonelli theorems. Two constructions of the product measure are presented: one using the extension theorem and another via iterated integrals. This is followed by a discussion on convolutions, Laplace transforms, Fourier series, and Fourier transforms. Kolmogorov’s consistency theorem for the construction of stochastic processes is taken up in Chapter 6 followed by the notion of independence in Chapter 7. The laws of large numbers are presented in a uniﬁed manner in Chapter 8 where the classical Kolmogorov’s strong law as well as Etemadi’s strong law are presented followed by Marcinkiewicz-Zygmund laws. There are also sections on renewal theory and ergodic theorems. The notion of weak convergence of probability measures on R is taken up in Chapter 9, and Chapter 10 introduces characteristic functions (Fourier transform of probability measures), the inversion formula, and the Levy-Cramer continuity theorem. Chapter 11 is devoted to the central limit theorem and its extensions to stable and inﬁnitely divisible laws. Chapter 12 discusses conditional expectation and probability where an L2 -approach followed by an approximation to L1 is presented. Discrete time martingales are introduced in Chapter 13 where the basic inequalities as well as convergence results are developed. Some applications to random walks are indicated as well. Chapter 14 discusses discrete time Markov chains with a discrete state space ﬁrst. This is followed by discrete time Markov chains with general state spaces where the regeneration approach for Harris chains is carefully explained and is used to derive the basic limit theorems via the iid cycles approach. There are also discussions of Feller Markov chains on Polish spaces and Markov chain Monte Carlo methods. An elementary treatment of Brownian motion is presented in Chapter 15 along with a treatment of continuous time jump Markov chains. Chapters 16–18 provide brief outlines respectively of the bootstrap theory, mixing processes, and branching processes. There is an Appendix that reviews basic material on elementary set theory, real and complex numbers, and metric spaces. Here are some suggestions on how to use the book. 1. For a one-semester course on real analysis (i.e., measure end integration theory), material up to Chapter 5 and the Appendix should provide adequate coverage with Chapter 6 being optional. 2. A one-semester course on advanced probability theory for those with the necessary measure theory background could be based on Chapters 6–13 with a selection of topics from Chapters 14–18.

x

Preface

3. A one-semester course on combined treatment of measure theory and probability theory could be built around Chapters 1, 2, Sections 3.1– 3.2 of Chapter 3, all of Chapter 4 (Section 4.2 optional), Sections 5.1 and 5.2 of Chapter 5, Chapters 6, 7, and Sections 8.1, 8.2, 8.3 (Sections 8.5 and 8.6 optional) of Chapter 8. Such a course could be followed by another that includes some coverage of Chapters 9– 12 before moving on to other areas such as mathematical statistics or martingales and ﬁnancial mathematics. This will be particularly useful for graduate programs in statistics. 4. A one-semester course on an introduction to stochastic processes or a seminar on special topics could be based on Chapters 14–18. A word on the numbering system used in the book. Statements of results (i.e., Theorems, Corollaries, Lemmas, and Propositions) are numbered consecutively within each section, in the format a.b.c, where a is the chapter number, b is the section number, and c is the counter. Deﬁnitions, Examples, and Remarks are numbered individually within each section, also of the form a.b.c, as above. Sections are referred to as a.b where a is the chapter number and b is the section number. Equation numbers appear on the right, in the form (b.c), where b is the section number and c is the equation number. Equations in a given chapter a are referred to as (b.c) within the chapter but as (a.b.c) outside chapter a. Problems are listed at the end of each chapter in the form a.c, where a is the chapter number and c is the problem number. In the writing of this book, material from existing books such as Apostol (1974), Billingsley (1995), Chow and Teicher (2001), Chung (1974), Durrett (2004), Royden (1988), and Rudin (1976, 1987) has been freely used. The authors owe a great debt to these books. The authors have used this material for courses taught over several years and have beneﬁted greatly from suggestions for improvement from students and colleagues at Iowa State University, Cornell University, the Indian Institute of Science, and the Indian Statistical Institute. We are grateful to them. Our special thanks go to Dean Issacson, Ken Koehler, and Justin Peters at Iowa State University for their administrative support of this long project. Krishna Athreya would also like to thank Cornell University for its support. We are most indebted to Sharon Shepard who typed and retyped several times this book, patiently putting up with our never-ending “ﬁnal” versions. Without her patient and generous help, this book could not have been written. We are also grateful to Denise Riker who typed portions of an earlier version of this book. John Kimmel of Springer got the book reviewed at various stages. The referee reports were very helpful and encouraging. Our grateful thanks to both John Kimmel and the referees.

Preface

xi

We have tried hard to make this book free of mathematical and typographical errors and misleading or ambiguous statements, but we are aware that there will still be many such remaining that we have not caught. We will be most grateful to receive such corrections and suggestions for improvement. They can be e-mailed to us at [email protected] or [email protected] On a personal note, we would like to thank our families for their patience and support. Krishna Athreya would like to record his profound gratitude to his maternal granduncle, the late Shri K. Venkatarama Iyer, who opened the door to mathematical learning for him at a crucial stage in high school, to the late Professor D. Basu of the Indian Statistical Institute who taught him to think probabilistically, and to Professor Samuel Karlin of Stanford University for initiating him into research in mathematics. K. B. Athreya S. N. Lahiri May 12, 2006

Contents

Preface

vii

Measures and Integration: An Informal Introduction

1

1 Measures 1.1 Classes of sets . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The extension theorems and Lebesgue-Stieltjes measures 1.3.1 Caratheodory extension of measures . . . . . . . 1.3.2 Lebesgue-Stieltjes measures on R . . . . . . . . . 1.3.3 Lebesgue-Stieltjes measures on R2 . . . . . . . . 1.3.4 More on extension of measures . . . . . . . . . . 1.4 Completeness of measures . . . . . . . . . . . . . . . . . 1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

9 9 14 19 19 25 27 28 30 31

2 Integration 2.1 Measurable transformations . . . . . . . . . 2.2 Induced measures, distribution functions . . 2.2.1 Generalizations to higher dimensions 2.3 Integration . . . . . . . . . . . . . . . . . . 2.4 Riemann and Lebesgue integrals . . . . . . 2.5 More on convergence . . . . . . . . . . . . . 2.6 Problems . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

39 39 44 47 48 59 61 71

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

xiv

Contents

3 Lp -Spaces 3.1 Inequalities . . . . . . . . . . 3.2 Lp -Spaces . . . . . . . . . . . 3.2.1 Basic properties . . . 3.2.2 Dual spaces . . . . . . 3.3 Banach and Hilbert spaces . . 3.3.1 Banach spaces . . . . 3.3.2 Linear transformations 3.3.3 Dual spaces . . . . . . 3.3.4 Hilbert space . . . . . 3.4 Problems . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

83 . 83 . 89 . 89 . 93 . 94 . 94 . 96 . 97 . 98 . 102

4 Diﬀerentiation 4.1 The Lebesgue-Radon-Nikodym theorem 4.2 Signed measures . . . . . . . . . . . . . 4.3 Functions of bounded variation . . . . . 4.4 Absolutely continuous functions on R . 4.5 Singular distributions . . . . . . . . . . 4.5.1 Decomposition of a cdf . . . . . . 4.5.2 Cantor ternary set . . . . . . . . 4.5.3 Cantor ternary function . . . . . 4.6 Problems . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

113 113 119 125 128 133 133 134 136 137

5 Product Measures, Convolutions, and Transforms 5.1 Product spaces and product measures . . . . . . . . 5.2 Fubini-Tonelli theorems . . . . . . . . . . . . . . . . 5.3 Extensions to products of higher orders . . . . . . . 5.4 Convolutions . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Convolution of measures on R, B(R) . . . . 5.4.2 Convolution of sequences . . . . . . . . . . . 5.4.3 Convolution of functions in L1 (R) . . . . . . 5.4.4 Convolution of functions and measures . . . . 5.5 Generating functions and Laplace transforms . . . . 5.6 Fourier series . . . . . . . . . . . . . . . . . . . . . . 5.7 Fourier transforms on R . . . . . . . . . . . . . . . . 5.8 Plancherel transform . . . . . . . . . . . . . . . . . . 5.9 Problems . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

147 147 152 157 160 160 162 162 164 164 166 173 178 181

6 Probability Spaces 6.1 Kolmogorov’s probability model . . . . 6.2 Random variables and random vectors 6.3 Kolmogorov’s consistency theorem . . 6.4 Problems . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

189 189 191 199 212

7 Independence

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

219

Contents

7.1 7.2 7.3

xv

Independent events and random variables . . . . . . . . . . 219 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8 Laws of Large Numbers 8.1 Weak laws of large numbers . . . . . . . . . . . 8.2 Strong laws of large numbers . . . . . . . . . . 8.3 Series of independent random variables . . . . . 8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs 8.5 Renewal theory . . . . . . . . . . . . . . . . . . 8.5.1 Deﬁnitions and basic properties . . . . . 8.5.2 Wald’s equation . . . . . . . . . . . . . 8.5.3 The renewal theorems . . . . . . . . . . 8.5.4 Renewal equations . . . . . . . . . . . . 8.5.5 Applications . . . . . . . . . . . . . . . 8.6 Ergodic theorems . . . . . . . . . . . . . . . . . 8.6.1 Basic deﬁnitions and examples . . . . . 8.6.2 Birkhoﬀ’s ergodic theorem . . . . . . . . 8.7 Law of the iterated logarithm . . . . . . . . . . 8.8 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

237 237 240 249 254 260 260 262 264 266 268 271 271 274 278 279

9 Convergence in Distribution 9.1 Deﬁnitions and basic properties . . . . . . . . . . . . . . . 9.2 Vague convergence, Helly-Bray theorems, and tightness . 9.3 Weak convergence on metric spaces . . . . . . . . . . . . . 9.4 Skorohod’s theorem and the continuous mapping theorem 9.5 The method of moments and the moment problem . . . . 9.5.1 Convergence of moments . . . . . . . . . . . . . . . 9.5.2 The method of moments . . . . . . . . . . . . . . . 9.5.3 The moment problem . . . . . . . . . . . . . . . . 9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

287 287 291 299 303 306 306 307 307 309

10 Characteristic Functions 10.1 Deﬁnition and examples . . . . . 10.2 Inversion formulas . . . . . . . . 10.3 Levy-Cramer continuity theorem 10.4 Extension to Rk . . . . . . . . . 10.5 Problems . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

317 317 323 327 332 337

11 Central Limit Theorems 11.1 Lindeberg-Feller theorems . . . . . . . . 11.2 Stable distributions . . . . . . . . . . . . 11.3 Inﬁnitely divisible distributions . . . . . 11.4 Reﬁnements and extensions of the CLT

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

343 343 352 358 361

. . . . .

. . . . .

. . . . .

xvi

Contents

11.4.1 The Berry-Esseen theorem . . . . . . . . 11.4.2 Edgeworth expansions . . . . . . . . . . 11.4.3 Large deviations . . . . . . . . . . . . . 11.4.4 The functional central limit theorem . . 11.4.5 Empirical process and Brownian bridge 11.5 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

361 364 368 372 374 376

12 Conditional Expectation and Conditional Probability 12.1 Conditional expectation: Deﬁnitions and examples . . . 12.2 Convergence theorems . . . . . . . . . . . . . . . . . . . 12.3 Conditional probability . . . . . . . . . . . . . . . . . . 12.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

383 383 389 392 393

13 Discrete Parameter Martingales 13.1 Deﬁnitions and examples . . . . . . . . . . . . . 13.2 Stopping times and optional stopping theorems 13.3 Martingale convergence theorems . . . . . . . . 13.4 Applications of martingale methods . . . . . . 13.4.1 Supercritical branching processes . . . . 13.4.2 Investment sequences . . . . . . . . . . 13.4.3 A conditional Borel-Cantelli lemma . . . 13.4.4 Decomposition of probability measures . 13.4.5 Kakutani’s theorem . . . . . . . . . . . 13.4.6 de Finetti’s theorem . . . . . . . . . . . 13.5 Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

399 399 405 417 424 424 425 425 427 429 430 430

14 Markov Chains and MCMC 14.1 Markov chains: Countable state space . . . . . . . . . . . 14.1.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . 14.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 14.1.3 Existence of a Markov chain . . . . . . . . . . . . . 14.1.4 Limit theory . . . . . . . . . . . . . . . . . . . . . 14.2 Markov chains on a general state space . . . . . . . . . . . 14.2.1 Basic deﬁnitions . . . . . . . . . . . . . . . . . . . 14.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . 14.2.3 Chapman-Kolmogorov equations . . . . . . . . . . 14.2.4 Harris irreducibility, recurrence, and minorization . 14.2.5 The minorization theorem . . . . . . . . . . . . . . 14.2.6 The fundamental regeneration theorem . . . . . . 14.2.7 Limit theory for regenerative sequences . . . . . . 14.2.8 Limit theory of Harris recurrent Markov chains . . 14.2.9 Markov chains on metric spaces . . . . . . . . . . . 14.3 Markov chain Monte Carlo (MCMC) . . . . . . . . . . . . 14.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 14.3.2 Metropolis-Hastings algorithm . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

439 439 439 440 442 443 457 457 458 461 462 464 465 467 469 473 477 477 478

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

Contents

xvii

14.3.3 The Gibbs sampler . . . . . . . . . . . . . . . . . . . 480 14.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 15 Stochastic Processes 15.1 Continuous time Markov chains . . . . . . . . . . . . . . 15.1.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Kolmogorov’s diﬀerential equations . . . . . . . . 15.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . 15.1.4 Limit theorems . . . . . . . . . . . . . . . . . . . 15.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Construction of SBM . . . . . . . . . . . . . . . . 15.2.2 Basic properties of SBM . . . . . . . . . . . . . . 15.2.3 Some related processes . . . . . . . . . . . . . . . 15.2.4 Some limit theorems . . . . . . . . . . . . . . . . 15.2.5 Some sample path properties of SBM . . . . . . 15.2.6 Brownian motion and martingales . . . . . . . . 15.2.7 Some applications . . . . . . . . . . . . . . . . . 15.2.8 The Black-Scholes formula for stock price option 15.3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

487 487 487 488 489 491 493 493 495 498 498 499 501 502 503 504

16 Limit Theorems for Dependent Processes 16.1 A central limit theorem for martingales . . 16.2 Mixing sequences . . . . . . . . . . . . . . . 16.2.1 Mixing coeﬃcients . . . . . . . . . . 16.2.2 Coupling and covariance inequalities 16.3 Central limit theorems for mixing sequences 16.4 Problems . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

509 509 513 514 516 519 529

17 The Bootstrap 17.1 The bootstrap method for independent variables . . . . . 17.1.1 A description of the bootstrap method . . . . . . . 17.1.2 Validity of the bootstrap: Sample mean . . . . . . 17.1.3 Second order correctness of the bootstrap . . . . . 17.1.4 Bootstrap for lattice distributions . . . . . . . . . 17.1.5 Bootstrap for heavy tailed random variables . . . . 17.2 Inadequacy of resampling single values under dependence 17.3 Block bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Properties of the MBB . . . . . . . . . . . . . . . . . . . . 17.4.1 Consistency of MBB variance estimators . . . . . . 17.4.2 Consistency of MBB cdf estimators . . . . . . . . . 17.4.3 Second order properties of the MBB . . . . . . . . 17.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

533 533 533 535 536 537 540 545 547 548 549 552 554 556

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

18 Branching Processes 561 18.1 Bienyeme-Galton-Watson branching process . . . . . . . . . 562

xviii

Contents

18.2 BGW process: Multitype case . . . . . . . . . . . . . . . . . 18.3 Continuous time branching processes . . . . . . . . . . . . . 18.4 Embedding of Urn schemes in continuous time branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

564 566

A Advanced Calculus: A Review A.1 Elementary set theory . . . . . . . . . . . . . . . . . . . . . A.1.1 Set operations . . . . . . . . . . . . . . . . . . . . . A.1.2 The principle of induction . . . . . . . . . . . . . . . A.1.3 Equivalence relations . . . . . . . . . . . . . . . . . . A.2 Real numbers, continuity, diﬀerentiability, and integration . A.2.1 Real numbers . . . . . . . . . . . . . . . . . . . . . . A.2.2 Sequences, series, limits, limsup, liminf . . . . . . . . A.2.3 Continuity and diﬀerentiability . . . . . . . . . . . . A.2.4 Riemann integration . . . . . . . . . . . . . . . . . . A.3 Complex numbers, exponential and trigonometric functions A.4 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Basic deﬁnitions . . . . . . . . . . . . . . . . . . . . A.4.2 Continuous functions . . . . . . . . . . . . . . . . . . A.4.3 Compactness . . . . . . . . . . . . . . . . . . . . . . A.4.4 Sequences of functions and uniform convergence . . A.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

573 573 574 577 577 578 578 580 582 584 586 590 590 592 592 593 594

568 569

B List of Abbreviations and Symbols 599 B.1 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . 599 B.2 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600 References

603

Author Index

610

Subject Index

612

Measures and Integration: An Informal Introduction

For many students who are learning measure and integration theory for the ﬁrst time, the notions of a σ-algebra of subsets of a set Ω, countable additivity of a set function λ, measurability of a function, the deﬁnition of an integral, and the interchange of limits and integration are not easy to understand and often seem not so intuitive. The goals of this informal introduction to this subject are (1) to show that the notions of σ-algebra and countable additivity are logical consequences of certain natural approximation procedures; (2) the dividends for the assumption of these two properties are great, and they lead to a nice and natural theory that is also very powerful for the handling of limits. Of course, as the saying goes, the devil is in the details. After this informal introduction, the necessary details are given in the next few sections. It is hoped that after this heuristic explanation of the subject, the motivation for and the process of mastering the details on the part of the students will be forthcoming. What is Measure Theory? A simple answer is that it is a theory about the distribution of mass over a set S. If the mass is uniformly distributed and S is an Euclidean space Rk , it is the theory of Lebesgue measure on Rk (i.e., length in R, area in R2 , volume in R3 , etc.). Probability theory is concerned with the case when S is the sample space of a random experiment and the total mass is one. Consider the following example. Imagine an open ﬁeld S and a snowy night. At daybreak one goes to the ﬁeld to measure the amount of snow in as many of the subsets of S as

2

Measures and Integration: An Informal Introduction

possible. Suppose now that one has the tools to measure the snow exactly on a class of subsets, such as triangles, rectangles, circular shapes, elliptic shapes, etc., no matter how small. It is natural to try to approximate oddlyshaped regions by combinations of these “standard shapes,” and then use a limiting process to obtain a measure for the oddly-shaped regions and reach some limit for such sets. Let B denote the class of subsets of S whose measure is obtained this way and let λ(B) denote the amount of snow in each B ∈ B. Call B the class of all (snow) measurable sets and λ(B) the measure (of snow) on B for each B ∈ B. It is reasonable to expect that the following properties of B and λ(·) hold: Properties of B (i) A ∈ B ⇒ Ac ∈ B (i.e., if one can measure the amount of snow on A and knows the total amount on S, then one knows the amount of snow on Ac ). (ii) A1 , A2 ∈ B ⇒ A1 ∪ A2 ∈ B (i.e., if one can measure the amount of snow on A1 and A2 , then one can do the same for A1 ∪ A2 ). (iii) If {An : n ≥ 1} ⊂ B, and An ⊂ An+1 for all n ≥ 1, then limn→∞ An ≡ ∞ n=1 An ∈ B (i.e., if one can measure the amount of snow on An for each n ≥ 1 on an increasing sequence of sets, then one can do so on the limit of An ). (iv) C ⊂ B where C is the class of nice sets such as triangles, squares, etc., that one started with. Properties of λ(·) (i) λ(A) ≥ 0 for A ∈ B (i.e., the amount of snow on any set is nonnegative!) (ii) If A1 , A2 ∈ B, A1 ∩ A2 = ∅, λ(A1 ∪ A2 ) = λ(A1 ) + λ(A2 ) (i.e., the amounts of snow on two disjoint sets simply add up! This property of λ is referred to as ﬁnite additivity). (iii) If {An : n ≥ 1} ⊂B, are such that An ⊂ An+1 for all n, then ∞ λ(limn→∞ An ) = λ( n=1 An ) = limn→∞ λ(An ) (i.e., if we can approximate a set A by an increase sequence of sets {An }n≥1 from B, then λ(A) = limn→∞ λ(An ). This property of λ is referred to as monotone continuity from below, or m.c.f.b. in short). This last assumption (iii) is what guarantees that diﬀerent approximations lead to consistent limits. Thus, if there are two increasing sequences {An }n≥1 and {An }n≥1 having the same limit A but {λ(An )}n≥1 and {λ(An )}n≥1 have diﬀerent limits, then the approximating procedures are not consistent.

Measures and Integration: An Informal Introduction

3

It turns out that the above set of reasonable and natural assumptions lead to a very rich and powerful theory that is widely applicable. A triplet (S, B, λ) that satisﬁes the above two sets of assumptions is called a measure space. The assumptions on B and λ are equivalent to the following: On B B(i) : ∅, the empty set, lies in B B(ii) : A ∈ B ⇒ Ac ∈ B (same as (i) before) B(iii) : A1 , A2 , . . . ∈ B ⇒ ∪i Ai ∈ B (combines (ii) and (iii) above) (Closure under countable unions). On λ λ(i) : λ(·) ≥ 0 (same as (i) before) and λ(∅) = 0. ∞ λ(ii) : λ(∪n≥1 An ) = n=1 λ(An ) if {An }n≥1 ⊂ B are pairwise disjoint, i.e., Ai ∩ Aj = ∅ for i = j (Countable additivity). Any collection B of subsets of S that satisﬁes B(i) , B(ii) , B(iii) above is called a σ-algebra. Any set function λ on a σ-algebra B that satisﬁes λ(i) and λ(ii) above is called a measure. Thus, a measure space is a triplet (S, B, λ) where S is a nonempty set, B is a σ-algebra of subsets of S and λ is a measure on B. Notice that the σ-algebra structure on B and the countable additivity of λ are necessary consequences of the very natural assumptions (i), (ii), and (iii) on B and λ deﬁned at the beginning. It is not often the case that one is given B and λ explicitly. Typically, one starts with a small collection C of subsets of S that have properties resembling intervals or rectangles and a set function λ on C. Then, B is the smallest σ-algebra containing C obtained from C by various operations such as countable unions, intersections, and their limits. The key properties on C that one needs are: (i) A, B ∈ C ⇒ A ∩ B ∈ C (e.g., intersection of intervals is an interval). (ii) A ∈ C ⇒ Ac is a ﬁnite union of sets from C (e.g., the complement of an interval is the union of two intervals or an interval itself). A collection C satisfying (i) and (ii) is called a semialgebra. The function λ on B is an extension of λ on C. For this extension to be a measure on B, the conditions needed are (i) λ(A) ≥ 0

for all A ∈ C

(ii) If A1 , A2 , . . . ∈ C are pairwise disjoint and A = ∞ λ(A) = n=1 λ(An ).

n≥1

An ∈ C, then

4

Measures and Integration: An Informal Introduction

There is a result, known as the extension theorem, that says that given such a pair (C, λ), it is possible to extend λ to B, the smallest σ-algebra containing C, such that (S, B, λ) is a measure space. Actually, it does more. It constructs a σ-algebra B ∗ larger than B and a measure λ∗ on B ∗ such that (S, B ∗ , λ∗ ) is a larger measure space, λ∗ coincides with λ on C and it provides nice approximation theorems. For example, the following approximation result is available: If B ∈ B∗ with λ∗ (B) < ∞, then for every > 0, B can be approximated by a ﬁnite union of sets from C, i.e., there kexist sets A1 , . . . , Ak ∈ C with k < ∞ such that λ∗ (AB) < where A ≡ i=1 Ai and AB = (A∩B c )∪(Ac ∩B), the symmetric diﬀerence between A and B. That is, in principle, every (measurable) set B of ﬁnite measure (i.e., B belonging to B∗ with λ∗ (B) < ∞) is nearly a ﬁnite union of (elementary) sets that belong to C. For example if S = R and C is the class of intervals, then every measurable set of ﬁnite measure is nearly a ﬁnite union of disjoint bounded open intervals. The following are some concrete examples of the above extension procedure. Theorem: (Lebesgue-Stieltjes measures on R). Let F : R → R satisfy (i) x1 < x2 ⇒ F (x1 ) ≤ F (x2 ) (nondecreasing); (ii) F (x) = F (x+) ≡ limy↓x F (y) for all x ∈ R (i.e., F (·) is right continuous). Let C be the class of sets of the form (a, b], or (b, ∞), −∞ ≤ a < b < ∞. Then, there exists a measure µF deﬁned on B ≡ B(R), the smallest σalgebra generated by C such that µF ((a, b]) = F (b) − F (a)

for all

− ∞ < a < b < ∞.

The σ-algebra B ≡ B(R) is called the Borel σ-algebra of R. Corollary: There exists a measure m on B(R) such that m(I) = the length of I, for any interval I. Proof: Take F (x) ≡ x in the above theorem. This measure is called the Lebesgue measure on R. Corollary: There exists a measure λ on B(R) such that b 2 1 e−x /2 dx. λ((a, b]) = √ 2π a Proof: Take F (x) =

x −∞

2 √1 e−u /2 du, 2π

x ∈ R.

2

Measures and Integration: An Informal Introduction

5

This measure is called the standard normal probability measure on R. 2 Theorem: (Lebesgue-Stieltjes measures on R2 ). Let F : R2 → R be a function satisfying the following: (i) (Monotonicity) For x = (x1 , x2 ) , y = (y1 , y2 ) with xi ≤ yi for i = 1, 2, ∆F (x, y) ≡ F (y1 , y2 ) − F (x1 , y2 ) − F (y1 , x2 ) + F (x1 , x2 ) ≥ 0. (ii) (Continuity from above) F (x) =

lim

yi ↓xi ,i=1,2

F (y) for all x ∈ R2 .

Let C be the class of all rectangles of the form (a, b] ≡ (a1 , b1 ] × (a2 , b2 ] with a = (a1 , a2 ) , b = (b1 , b2 ) ∈ R2 . Then there exists a measure µF , deﬁned on the σ-algebra B ≡ B(R2 ), generated by C, such that µF ((a, b]) = ∆F (a, b). The above theorems have a converse that says that every measure on (Rk , B(Rk )) that is ﬁnite on bounded sets arises from some function F (called a distribution function) and is, therefore, a Lebesgue-Stieltjes measure. Here is another simple example of a measure space (with discrete S). Example: Let S = {s1 , s2 , . . . , sk }, k ≤ ∞, and let B = P(S), the power set of S, i.e., the collection of all possible subsets of S. Let p1 , p2 , . . . be nonnegative numbers. Let pi IA (si ), λ(A) ≡ 1≤i≤k

where IA is the indicator function of the set A, deﬁned by IA (s) = 1 if s ∈ A and 0 otherwise. It is easy to verify that (S, B, λ) is a measure space and also that every measure λ on B arises this way. What is Integration Theory? In short, it is a theory about weighted sums of functions on a set S when the weights are speciﬁed by a mass distribution λ. Here is a more detailed answer. Let (S, B, λ) be a measure space. Suppose f : S → R is a simple function, i.e., f is such that f (S) is a ﬁnite set {a1 , a2 , . . . , ak }. It is reasonable to k deﬁne the weighted sum of f with respect to λ as i=1 ai λ(Ai ) where Ai = f −1 {ai }. Of course, for this to be well deﬁned, one needs Ai to be in B and λ(Ai ) < ∞ for all i such that ai = 0. k Notice that the quantity i=1 ai λ(Ai ) remains the same whether the ai ’s are distinct or not. Call this the integral of f with respect to λ and denote

6

Measures and Integration: An Informal Introduction

this by f dλ. If f and g are simple, then for α,β ∈ R, (αf + βg)dλ = α f dx + β gdλ. Now how should one deﬁne f dλ (integral of f with respect to λ) for a nonsimple f ? The answer, of course, is to “approximate” by simple functions. Let f be a nonnegative function. To deﬁne the integral of f , one would like to approximate f by simple functions. It turns out that a necessary and suﬃcient condition for this is that for any a ∈ R, the set {s : f (s) ≤ a} is in B. Such a function f is called measurable with respect to B or B-measurable or simply, measurable (if B is kept ﬁxed throughout). Let f be a nonnegative B measurable function. Then there exists a sequence {fn }n≥1 of simple nonnegative functions such that for each s ∈ S, {fn (s)}n≥1 is a nondecreasing sequence converging to f (s). It is now natural to deﬁne the weighted sum of f with respect to λ, i.e., the integral of f with respect to λ, denoted by f dλ, as

f dλ = lim

fn dλ.

n→∞

An immediate question is: Is the right side the same for all such approximating sequences {fn }n≥1 ? The answer is a yes; it is guaranteed by the very natural assumption imposed on λ that it is ﬁnitely additive and monotone continuous from below, i.e. λ(ii) and λ(iii) (or equivalently, that λ is countably additive, i.e., λ(ii) ). One can strengthen this to a stronger result known as the monotone convergence theorem, a key result that in turn leads to two other major convergence results. The monotone convergence theorem (MCT): Let (S, B, λ) be a measure space and let fn : S → R+ , n ≥ 1 be a sequence of nonnegative Bmeasurable functions (not necessarily simple) such that for all s ∈ S, (i) fn (s) ≤ fn+1 (s),

for all

n ≥ 1,

and

(ii) lim fn (s) = f (s). n→∞

Then f is B-measurable and

f dλ = lim

n→∞

fn dλ.

This says that the integral and the limit can be interchanged for monotone nondecreasing nonnegative B-measurable functions. Note that if fn = IAn , the indicator function of a set An and if An ⊂ An+1 for each n, then the MCT is the same as m.c.f.b. (cf. property λ(ii)). Thus, the very natural assumption of m.c.f.b. yields a basic convergence result that makes the integration theory so elegant and powerful. To extend the deﬁnition of f dλ to a real valued, B-measurable function f : S → R, one uses the simple idea that f can be decomposed as f = f + − f − where f + (s) = max{f (s), 0} and f − (s) = max{−f (s), 0}, s ∈ S. Since both f + and f − are nonnegative and B-measurable, f + dλ and

Measures and Integration: An Informal Introduction

7

f − dλ are both well deﬁned. Now set f dλ = f + dλ − f − dλ,

provided at least one of the two terms on the right is ﬁnite. function f The + dλ and f − dλ is said to be integrable with respect to (w.r.t.) λ if both f are ﬁnite or, equivalently, if |f |dλ < ∞. The following is a consequence of the MCT. Fatou’s lemma: Let {fn }n≥1 be a sequence of nonnegative B-measurable functions on a measure space (S, B, λ). Then lim inf fn dλ ≤ lim inf fn dλ. n→∞

n→∞

This in turn leads to (Lebesgue’s) dominated convergence theorem (DCT): Let {fn }n≥1 be a sequence of B-measurable functions from a measure space (S, B, λ) to R and let g be a B-measurable nonnegative integrable function on (S, B, λ). Suppose that for each s in S, (i) |fn (s)| ≤ g(s)

for all

n≥1

and

(ii) lim fn (s) = f (s). n→∞

Then, f is integrable and fn dλ = f dλ = lim lim fn dλ. n→∞

n→∞

Thus some very natural assumptions on B and λ lead to an interesting measure and integration theory that is quite general and that allows the interchange of limits and integrals under fairly general conditions. A systematic treatment of the measure and integration theory is given in the next ﬁve chapters.

1 Measures

Section 1.1 deals with algebraic operations on subsets of a given nonempty set Ω. Section 1.2 treats nonnegative set functions on classes of sets and deﬁnes the notion of a measure on an algebra. Section 1.3 treats the extension theorem, and Section 1.4 deals with completeness of measures.

1.1 Classes of sets Let Ω be a nonempty set and P(Ω) ≡ {A : A ⊂ Ω} be the power set of Ω, i.e., the class of all subsets of Ω. Deﬁnition 1.1.1: A collection of sets F ⊂ P(Ω) is called an algebra if (a) Ω ∈ F, (b) A ∈ F implies Ac ∈ F, and (c) A, B ∈ F implies A ∪ B ∈ F (i.e., closure under pairwise unions). Thus, an algebra is a class of sets containing Ω that is closed under complementation and pairwise (and hence ﬁnite) unions. It is easy to see that one can equivalently deﬁne an algebra by requiring that properties (a), (b) hold and that the property (c)

A, B ∈ F ⇒ A ∩ B ∈ F

holds (i.e. closure under ﬁnite intersections).

10

1. Measures

Deﬁnition 1.1.2: A class F ⊂ P(Ω) is called a σ-algebra if it is an algebra and if it satisﬁes

An ∈ F. (d) An ∈ F for n ≥ 1 ⇒ n≥1

Thus, a σ-algebra is a class of subsets of Ω that contains Ω and is closed under complementation and countable unions. As pointed out in the introductory chapter, a σ-algebra can be alternatively deﬁned as an algebra that is closed under monotone unions as the following shows. Proposition 1.1.1: Let F ⊂ P(Ω). Then F is a σ-algebra if and only if F is an algebra and satisﬁes

An ∈ F. An ∈ F, An ⊂ An+1 for all n ⇒ n≥1 ∞ Proof: The ‘only if’ part is obvious. nFor the ‘if’ part, let {Bn }n=1 ⊂ F. Then, since F is analgebra, An≡ j=1 Bj ∈ F for all n. Further, An ⊂ An+1 for all n and n≥1 Bn = n≥1 An . Since by hypothesis ∪n An ∈ F, ∪n Bn ∈ F. 2

Here are some examples of algebras and σ-algebras. Example 1.1.1: Let Ω = {a, b, c, d}. Consider the classes F1 = {Ω, ∅, {a}} and F2 = {Ω, ∅, {a}, {b, c, d}}. Then, F2 is an algebra (and also a σ-algebra), but F1 is not an algebra, since {a}c ∈ F1 . Example 1.1.2: Let Ω be any nonempty set and let F3 = P(Ω) ≡ {A : A ⊂ Ω},

the power set of

Ω

and F4 = {Ω, ∅}. Then, it is easy to check that both F3 and F4 are σ-algebras. The latter σ-algebra is often called the trivial σ-algebra on Ω (Problem 1.1). From the deﬁnition it is clear that any σ-algebra is also an algebra and thus F2 , F3 , F4 are examples of algebras, too. The following is an example of an algebra that is not a σ-algebra. Example 1.1.3: Let Ω be a nonempty set, and let |A| denote the number of elements of a set A ⊂ Ω. Deﬁne. F5 = {A ⊂ Ω : either |A| is ﬁnite or |Ac | is ﬁnite}.

1.1 Classes of sets

11

Then, note that (i) Ω ∈ F5 (since |Ωc | = |∅| = 0)), (ii) A ∈ F5 implies Ac ∈ F5 (if |A| < ∞, then |(Ac )c | = |A| < ∞ and if |Ac | < ∞, then Ac ∈ F5 trivially). Next, suppose that A, B ∈ F5 . If either |A| < ∞ or |B| < ∞, then |A ∩ B| ≤ min{|A|, |B|} < ∞, so that A ∩ B ∈ F5 . On the other hand, if both |Ac | < ∞ and |B c | < ∞, then |(A ∩ B)c | = |Ac ∪ B c | ≤ |Ac | + |B c | < ∞, implying that A ∩ B ∈ F5 . Thus, property (c) holds, and F5 is an algebra. However, if |Ω| = ∞, then F5 is not a σ-algebra. To see this, suppose that |Ω| = ∞ and {ω 1 , ω2 , . . .} ⊂ Ω. Then, by deﬁnition, Ai = {ωi } ∈ F5 for all ∞ i ≥ 1, but A ≡ i=1 A2i−1 = {ω1 , ω3 , . . .} ∈ F5 , since |A| = |Ac | = ∞. Example 1.1.4: Let Ω be a nonempty set and let F6 = {A ⊂ Ω : A is countable or Ac is countable}. Then, it is easy to show that F6 is a σ-algebra (Problem 1.3). Suppose {Fθ : θ ∈ Θ} is a family of σ-algebras on Ω. From the deﬁnition, it follows that the intersection θ∈Θ Fθ is a σ-algebra, no matter how large the index set Θ is (Problem 1.4). However, the union of two σ-algebras may not even be an algebra (Problem 1.5). For the development of measure theory and probability theory, the concept of a σ-algebra plays a crucial role. In many instances, given an arbitrary collection of subsets of Ω, one would like to extend it to a possibly larger class that is a σ-algebra. This leads to the following deﬁnition. Deﬁnition 1.1.3: If A is a class of subsets of Ω, then the σ-algebra generated by A, denoted by σA, is deﬁned as σA = F, F∈I(A)

where I(A) ≡ {F : A ⊂ F and F is a σ-algebra on Ω} is the collection of all σ-algebras containing the class A. Note that since the power set P(Ω) contains A and is itself a σ-algebra, the collection I(A) is not empty and hence, the intersection in the above deﬁnition is well deﬁned. Example 1.1.5: In the setup of Example 1.1.1, σF1 = F2 (why?). A particularly useful class of σ-algebras are those generated by open sets of a topological space. These are called Borel σ-algebras. A topological space is a pair (S, T ) where S is a nonempty set and T is a collection of subsets of S such that(i) S ∈ T , (ii) O1 , O2 ∈ T ⇒ O1 ∩ O2 ∈ T , and (iii) {Oα : α ∈ I} ⊂ T ⇒ α∈I Oα ∈ T . Elements of T are called open sets.

12

1. Measures

A metric space is a pair (S, d) where S is a nonempty set and d is a function from S × S to R+ satisfying (i) d(x, y) = d(y, x) for all x, y in S, (ii) d(x, y) = 0 iﬀ x = y, and (iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z in S. Property (iii) is called the triangle inequality. The function d is called a metric on S (cf. see A.4). Any Euclidean space Rn (1 ≤ n < ∞) is a metric space under any one of the following metrics: (a) For 1 ≤ p < ∞, dp (x, y) =

n

|xi − yi |p

1/p .

i=1

(b) d∞ (x, y) = max |xi − yi |. 1≤i≤n

(c) For 0 < p < 1, dp (x, y) =

n

|xi − yi |p .

i=1

A metric space (S, d) is a topological space where a set O is open if for all x ∈ O, there is an > 0 such that B(x, ) ≡ {y : d(y, x) < } ⊂ O. Deﬁnition 1.1.4: The Borel σ-algebra on a topological space S (in particular, on a metric space or an Euclidean space) is deﬁned as the σ-algebra generated by the collection of open sets in S. Example 1.1.6: Let B(Rk ) denote the Borel σ-algebra on Rk , 1 ≤ k < ∞. Then, B(Rk ) ≡ σ{A : A is an open subset of Rk } is also generated by each of the following classes of sets O1 O2

= {(a1 , b1 ) × . . . × (ak , bk ) : −∞ ≤ ai < bi ≤ ∞, 1 ≤ i ≤ k};

O3

= {(a1 , b1 ) × . . . × (ak , bk ) : ai , bi ∈ Q, ai < bi , 1 ≤ i ≤ k};

O4

= {(−∞, x1 ) × . . . × (−∞, xk ) : x1 , . . . , xk ∈ Q},

= {(−∞, x1 ) × · · · × (−∞, xk ) : x1 , . . . , xk ∈ R};

where Q denotes the set of all rational numbers. To show this, note that σOi ⊂ B(Rk ) for i = 1, 2, 3, 4, and hence, it is enough to show that σOi ⊃ B(Rk ). Let G be a σ-algebra containing set A ⊂ Rk , there exist a sequence of O3 . Observe that given any open sets {Bn }n≥1 in O3 such that A = n≥1 Bn (Problem 1.9). Since G is a σalgebra and Bn ∈ G for all n ≥ 1, A ∈ G. Thus, G is a σ-algebra containing all open subsets of Rk , and hence G ⊃ B(Rk ). Hence, it follows that B(Rk ) ⊃ σO1 ⊃ σO3 =

G:G⊃O3

G ⊃ B(Rk ).

1.1 Classes of sets

13

Next note that any interval (a, b) ⊂ R can be expressed in terms of half spaces of the form (−∞, x), x ∈ R as

(a, b) =

∞

[(−∞, b) \ (−∞, a + n−1 )],

n=1

where for any two sets A and B, A\B = {x : x ∈ A, x ∈ / B}. It is not diﬃcult to show that this implies that σOi = B(Rk ) for i = 2, 4 (Problem 1.10). Example 1.1.7: Let Ω be a nonempty set with |Ω| = ∞ and F5 and F6 be as in Examples 1.1.3 and 1.1.4. Then F6 = σF5 . To see this, note that F6 is a σ-algebra containing F5 , so that σF5 ⊂ F6 . To prove the reverse inclusion, let G be a σ-algebra containing F5 . It is enough to show that F6 ⊂ G. Let A ∈ F6 . If A is countable, say ∞A = {ω1 , ω2 , . . .}, then Ai ≡ {ωi } ∈ F5 ⊂ G for all i ≥ 1 and hence A = i=1 Ai ∈ G. On the other hand, if Ac is countable, then by the above argument, Ac ∈ G ⇒ A ∈ G. Deﬁnition 1.1.5: A class C of subsets of Ω is a π-system or a π-class if A, B ∈ C ⇒ A ∩ B ∈ C. Example 1.1.8: The class C of intervals in R is a π-system whereas the class of all open discs in R2 is not. Deﬁnition 1.1.6: A class L of subsets of Ω is a λ-system or a λ-class if (i) Ω ∈ L, (ii) A, B ∈ L, A ⊂ B ⇒ B \ A ∈ L, and (iii) An ∈ L, An ⊂ An+1 for all n ≥ 1 ⇒ n≥1 An ∈ L. Example 1.1.9: Every σ-algebra F is a λ-system. But an algebra need not be a λ-system. It is easily checked that if L1 and L2 are λ-systems, then L1 ∩ L2 is also a λ-system. Recall that σB, the σ-algebra generated by B, is the intersection of all σ-algebras containing B and is also the smallest σ-algebra containing B. Similarly, for any B ⊂ P(Ω), the λ-system generated by B, denoted by λB, is deﬁned as the intersection of all λ-systems containing B. It is the smallest λ-system containing B. Theorem 1.1.2: (The π-λ theorem). If C is a π-system, then λC = σC. Proof: For any C, σC is a λ-system and σC contains C. Thus, λC ⊂ σC for any C. Hence, it suﬃces to show that if C is a π-system, then λC is a σ-algebra . Since λC is a λ-system, it is closed under complementation and monotone increasing unions. By Proposition 1.1.1, it is enough to show that it is closed under intersection. Let λ1 (C) ≡ {A : A ∈ λC, A ∩ B ∈ λC for all B ∈ C}. Then, λ1 (C) is a λ-system and C being a π-system, λ1 (C) ⊃ C. Therefore, λ1 (C) ⊃ λC. But λ1 (C) ⊂ λC. So λ1 (C) = λC.

14

1. Measures

Next, let λ2 (C) ≡ {A : A ∈ λC, A ∩ B ∈ λC for all B ∈ λC}. Then λ2 (C) is a λ-system and by the previous step C ⊂ λ2 (C) ⊂ λC. Hence, it follows that λ2 (C) = λC, i.e., λC is closed under intersection. This completes the proof of the theorem. 2 Corollary 1.1.3: If C is a π-system and L is a λ-system containing C, then L ⊃ σC. Remark 1.1.1: There are several equivalent deﬁnitions of λ-systems; see, for example, Billingsley (1995). A closely related concept is that of a monotone class; see, for example, Chung (1974).

1.2 Measures A set function is an extended real valued function deﬁned on a class of subsets of a set Ω. Measures are nonnegative set functions that, intuitively speaking, measure the content of a subset of Ω. As explained in Section 2 of the introductory chapter, a measure has to satisfy certain natural requirements, such as the measure of the union of a countable collection of disjoint sets is the sum of the measures of the individual sets. Formally, one has the following deﬁnition. Deﬁnition 1.2.1: Let Ω be a nonempty set and F be an algebra on Ω. Then, a set function µ on F is called a measure if (a) µ(A) ∈ [0, ∞] for all A ∈ F; (b) µ(∅) = 0; (c) for any disjoint collection of sets A1 , A2 , . . . , ∈ F with µ

n≥1

n≥1

An ∈ F,

∞ An = µ(An ). n=1

As discussed in Section 2 of the introductory chapter, these conditions on µ are equivalent to ﬁnite additivity and monotone continuity from below. Proposition 1.2.1: Let Ω be a nonempty set and F be an algebra of subsets of Ω and µ be a set function on F with values in [0, ∞] and with µ(∅) = 0. Then, µ is a measure iﬀ µ satisﬁes (iii)a : (ﬁnite additivity) for all A1 , A2 ∈ F with A1 ∩A2 = ∅, µ(A1 ∪A2 ) = µ(A1 ) + µ(A2 ), and (iii)b : (monotone continuity from below or, m.c.f.b., in short) for any collection {An }n≥1 of sets in F such that An ⊂ An+1 for all n ≥ 1

1.2 Measures

and

n≥1

15

An ∈ F, µ

An = lim µ(An ). n→∞

n≥1

Proof: Let µ be a measure on F. Since µ satisﬁes (iii), taking A3 , A4 , . . . to be ∅ yields (iii)a . This implies that for A and B in F, A ⊂ B ⇒ µ(B) = µ(A) + µ(B \ A) ≥ µ(A), i.e., µ is monotone. To establish (iii)b , note that if µ(An ) = ∞ for some n = n0 , then µ(An ) = ∞ for all n ≥ n0 and µ( n≥1 An ) = ∞ and (iii)b holds in this case. Hence, suppose that µ(An ) < ∞ for all n ≥ 1. Setting Bn = An \ An−1 for n ≥ 1 (with A0 = ∅), by (iii)a , µ(Bn ) =µ(An ) − µ(A n−1 ). Since {Bn }n≥1 is a disjoint collection of sets in F with n≥1 Bn = n≥1 An , by (iii) µ

An

= µ

n≥1

∞ N Bn = µ(Bn ) = lim [µ(An ) − µ(An−1 )]

=

N →∞

n=1

n≥1

n=1

lim µ(AN ),

N →∞

and so (iii)b holds also in this case. Conversely, let µ satisfy µ(∅) = 0 and nn }n≥1 be (iii)a and (iii)b . Let {A a disjoint collection of sets in F with i≥1 Ai ∈ F. Let Cn = j=1 Aj for n ≥ 1. Since F isan algebra,Cn ∈ F for all n ≥ 1. Also, Cn ⊂ Cn+1 for all n ≥ 1. Hence, n≥1 Cn = j≥1 Aj . By (iii)b ,

= µ Aj Cn = lim µ(Cn ) µ j≥1

=

=

n≥1 n

lim

n→∞ ∞

n→∞

µ(Aj )

(by (iii)a )

j=1

µ(Aj ).

j=1

Thus, (iii) holds.

2

Remark 1.2.1: The deﬁnition of a measure given in Deﬁnition 1.2.1 is valid when F is a σ-algebra. However, very often one may start with a measure on an algebra A and then extend it to a measure on the σ-algebra σA. This is why the deﬁnition of a measure on an algebra is given here. In the same vein, one may begin with a deﬁnition of a measure on a class of subsets of Ω that form only a semialgebra (cf. Deﬁnition 1.3.1). As described in the introductory chapter, such preliminary collections of sets are “nice” sets for which the measure may be deﬁned easily, and the extension to a σalgebra containing these sets may be necessary if one is interested in more general sets. This topic is discussed in greater detail in the next section.

16

1. Measures

Deﬁnition 1.2.2: A measure µ is called ﬁnite or inﬁnite according as µ(Ω) < ∞ or µ(Ω) = ∞. A ﬁnite measure with µ(Ω) = 1 is called a probability measure. A measure µ on a σ-algebra F is called σ-ﬁnite if there exist a countable collection of sets A1 , A2 , . . . , ∈ F, not necessarily disjoint, such that

An = Ω and (b) µ(An ) < ∞ for all n ≥ 1. (a) n≥1

Here are some examples of measures. Example 1.2.1: (The counting measure). Let Ω be a nonempty set and F3 = P(Ω) be the set of all subsets of Ω (cf. Example 1.1.2). Deﬁne µ(A) = |A|,

A ∈ F3 ,

where |A| denotes the number of elements in A. It is easy to check that µ satisﬁes the requirements (a)–(c) of a measure. This measure µ is called the counting measure on Ω. Note that µ is ﬁnite iﬀ Ω is ﬁnite and it is σ-ﬁnite if Ω is countably inﬁnite. Example 1.2.2: (Discrete probability measures). Let ω1 , ω2 , . . . , ∈ Ω and ∞ p1 , p2 , . . . ∈ [0, 1] be such that i=1 pi = 1. Deﬁne for any A ⊂ Ω P (A) =

∞

pi IA (ωi ),

i=1

where IA (·) denotes the indicator function of a set A, deﬁned by IA (ω) = 0 or 1 according as ω ∈ A or ω ∈ A. For any disjoint collection of sets A1 , A2 , . . . ∈ P(Ω), P

∞

Ai

=

i=1

∞ j=1

=

=

=

∞

pj I∞ (ωj ) i=1 Ai

∞

pj j=1 i=1 ∞ ∞

IAi (ωj )

pj IAi (ωj )

i=1 ∞

j=1

P (Ai ),

i=1

where interchanging the order of summation is permissible since the summands are nonnegative. This shows that P is a probability measure on P(Ω).

1.2 Measures

17

Example 1.2.3: (Lebesgue-Stieltjes measures on R). As mentioned in the previous chapter (cf. Section 2), a large class of measures on the Borel σ-algebra B(R) of subsets of R, known as the Lebesgue-Stieltjes measures, arise from nondecreasing right continuous functions F : R → R. For each such F , the corresponding measure µF satisﬁes µF ((a, b]) = F (b) − F (a) for all −∞ < a < b < ∞. The construction of these µF ’s via the extension theorem will be discussed in the next section. Also, note that if An = (−n, n), n = 1, 2, . . ., then R = n≥1 An and µF (An ) < ∞ for each n ≥ 1 (such measures are called Radon measures) and thus, µF is necessarily σ-ﬁnite. Proposition 1.2.2: Let µ be a measure on an algebra F, and let A, B, A1 , . . . , Ak ∈ F, 1 ≤ k < ∞. Then, (i) (Monotonicity) µ(A) ≤ µ(B) if A ⊂ B; (ii) (Finite subadditivity) µ(A1 ∪ . . . ∪ Ak ) ≤ µ(A1 ) + . . . + µ(Ak ); (iii) (Inclusion-exclusion formula) If µ(Ai ) < ∞ for all i = 1, . . . , k, then µ(A1 ∪ . . . ∪ Ak )

=

k i=1

µ(Ai ) −

µ(Ai ∩ Aj )

1≤i<j 0 is arbitrary, this yields µ(A) ≤ µ∗ (A). Thus, given a measure µ on a semialgebra C ⊂ P(Ω), there is a complete measure space (Ω, Mµ∗ , µ∗ ) such that Mµ∗ ⊃ C and µ∗ restricted to C equals µ. For this reason, µ∗ is called an extension of µ. The measure space (Ω, Mµ∗ , µ∗ ) is called the Caratheodory extension of µ. Since Mµ∗ is a σalgebra and contains C, Mµ∗ must contain σC, the σ-algebra generated by C, and thus, (Ω, σC, µ∗ ) is also a measure space. However, the latter may not be complete (see Section 1.4). Now the above method is applied to the construction of LebesgueStieltjes measures on R and R2 .

1.3.2

Lebesgue-Stieltjes measures on R

Let F : R → R be nondecreasing. For x ∈ R, let F (x+) ≡ limy↓x F (y), and F (x−) ≡ limy↑x F (y). Set F (∞) = limx↑∞ F (x) and F (−∞) = limx↓−∞ F (y). Let C ≡ (a, b] : −∞ ≤ a ≤ b < ∞ ∪ (a, ∞) : −∞ ≤ a < ∞ .

(3.7)

Deﬁne µF ((a, b])

= F (b+) − F (a+),

µF ((a, ∞))

= F (∞) − F (a+).

(3.8)

Then, it may be veriﬁed that (i) C is a semialgebra; (ii) µF is a measure on C. (For (ii), one needs to use the Heine-Borel theorem. See Problems 1.22 and 1.23.) Let (R, Mµ∗F , µ∗F ) be the Caratheodory extension of µF , i.e., the measure space constructed as in the above two theorems.

26

1. Measures

Deﬁnition 1.3.7: Let F : R → R be nondecreasing. The (measure) space (R, Mµ∗F , µ∗F ) is called a Lebesgue-Stieltjes measure space and µ∗F is the Lebesgue-Stieltjes measure generated by F . Since σC = B(R), the class of all Borel sets of R, every LebesgueStieltjes measure µ∗F is also a measure on (R, B(R)). Note also that µ∗F is ﬁnite on bounded intervals. Conversely, given any Radon measure µ on (R, B(R)), i.e., a measure on (R, B(R)) that is ﬁnite on bounded intervals, set ⎧ if x > 0 ⎨ µ((0, x]) 0 if x = 0 F (x) = ⎩ −µ((x, 0]) if x ≤ 0. Then µF = µ on C. By the uniqueness of the extension (discussed later in this section, see also Theorem 1.2.4), it follows that µ∗F coincides with µ on B(R). Thus, every Radon measure on (R, B(R)) is necessarily a LebesgueStieltjes measure. Deﬁnition 1.3.8: (Lebesgue Measure on R). When F (x) ≡ x, x ∈ R, the measure µ∗F is called the Lebesgue measure and the σ-algebra Mµ∗F is called the class of Lebesgue measurable sets. The Lebesgue measure will be denoted by m(·) or µL (·). Given below are some important results on m(·). (i) It follows from equation (3.1), that µ∗F ({x}) = F (x+) − F (x−) and hence = 0 if F is continuous at x. Thus m({x}) ≡ 0 on R. (ii) By countable additivity of m(·), m(A) = 0 for any countable set A. (iii) (Cantor set). There exists an uncountable set C such that m(C) = 0. An example is the Cantor set constructed Start with I0 = as follows: 1 2 [0, 1]. Delete the open middle third, i.e., 3 , 3 . Next from the closed 1 2 intervals I11 = [0, 3 ] and I12 = [ 3 , 1] delete the open middle thirds, i.e., 19 , 29 and 79 , 29 , respectively. Repeat this process of deleting the middle third from each of the remaining closed intervals. Thus at stage n there will be 2n−1 new closed intervals and 2n−1 deleted open intervals, each of length 31n . Let Un denote the union of all the deleted open intervals at the nth stage. Then {Un }n≥1 are disjoint open sets. Let U ≡ n≥1 Un . By countable additivity

m(U ) =

∞ n=1

m(Un ) =

∞ 2n−1 = 1. 3n n=1

Let C ≡ [0, 1] − U . Since U is open and[0,1] is closed, C is nonempty. ∞ It can be shown that C ≡ {x : x = 1 a3ii , ai = 0 or 2} (Problem

1.3 The extension theorems and Lebesgue-Stieltjes measures

27

1.33). Thus C can be mapped in (1–1) manner on to the set of all sequences {δi }i≥1 such that δi = 0 or 2. But this set is uncountable. Since m([0, 1]) = 1, it follows that m(C) = 0. For more properties of the Cantor set, see Rudin (1976) and Chapter 4. (iv) m(·) is invariant under reﬂection and translation. That is, for any E in B(R), m(−E) = m(E) and m(E + c) = m(E) for all c in R where −E = {−x : x ∈ E} and E + c = {y : y = x + c, x ∈ E}. This follows from Theorem 1.2.4 and the fact that the claim holds for intervals (cf. Problem 2.48). (v) There exists a subset A ⊂ R such that A ∈ Mm . That is, there exists a non-Lebesgue measurable set. The proof of this requires the use of the axiom of choice (cf. A.1). For a proof see Royden (1988).

1.3.3

Lebesgue-Stieltjes measures on R2

Let F : R2 → R satisfy F (a2 , b2 ) − F (a2 , b1 ) − F (a1 , b2 ) + F (a1 , b1 ) ≥ 0,

(3.9)

F (a2 , b2 ) − F (a1 , b1 ) ≥ 0,

(3.10)

and ¯ 2 by appropriate limiting procedure. for all a1 ≤ a2 , b1 ≤ b2 . Extend F to R Let C2 ≡ {I1 × I2 : I1 , I2 ∈ C}, (3.11) where C is as in (3.7). Next, for I1 = (a1 , a2 ], I2 = (b1 , b2 ], − ∞ < a1 , a2 , b1 , b2 < ∞, set µF (I1 × I2 ) ≡ F (a2 +, b2 +) − F (a2 +, b1 +) − (F (a1 +, b2 +) + F (a1 +, b1 +)), (3.12) where for any a, b ∈ R, F (a+, b+) is deﬁned as F (a+, b+) ≡

lim

a ↓a,b ↓b

F (a , b ).

Note that by (3.10), the limit exists and hence, F (a+, b+) is well deﬁned. Further, by (3.9), the right side of (3.12) is nonnegative. Next extend the deﬁnition of µF to unbounded sets in C2 by the limiting procedure: (3.13) µF (I1 × I2 ) = lim µF (I1 × I2 ) ∩ Jn , n→∞

where Jn = (−n, n]×(−n, n]. Then it may be veriﬁed (Problems 1.24, 1.25) that

28

1. Measures

(i) C2 is a semialgebra (ii) µF is a measure on C2 . Let (R2 , Mµ∗F , µ∗F ) be the measure space constructed as in the above two theorems. The measure µ∗F is called the Lebesgue-Stieltjes measure generated by F and Mµ∗F a Lebesgue-Stieltjes σ-algebra. Again, in this case, Mµ∗ includes the σ-algebra σC ≡ B(R2 ) and so (R2 , B(R2 ), µ∗F ) is also a measure space. If F (a, b) = ab, then µF is called the Lebesgue measure on R2 . A similar construction holds for any Rk , k < ∞.

1.3.4

More on extension of measures

Next the uniqueness of the extension µ∗ and approximation of the µ∗ measure of a set in Mµ∗ by that of a set from the algebra A ≡ A(C) are considered. As in the case of measures deﬁned on an algebra, a measure µ on a semialgebra C ⊂ P(Ω) is said to be σ-ﬁnite if there exists a countable collection {An }n≥1 ⊂ C such that (i) µ(An ) < ∞ for each n ≥ 1 and (ii) n≥1 An = Ω. The following approximation result holds. Theorem 1.3.4: Let A ∈ Mµ∗ and µ∗ (A) < ∞. Then, for each > 0, there exist B1 , B2 , . . . , Bk ∈ C, k < ∞ with Bi ∩ Bj = ∅ for 1 ≤ i = j ≤ k, such that

k

∗ Bj < , µ A j=1

where for any two sets E1 and E2 , E1 E2 is the symmetric diﬀerence of E1 and E2 , deﬁned by E1 E2 ≡ (E1 ∩ E2c ) ∪ (E1c ∩ E2 ). Proof: By deﬁnition of µ∗ , µ∗ (A)< ∞ implies that for every > 0, there exist {Bn }n≥1 ⊂ C such that A ⊂ n≥1 Bn and µ∗ (A) ≤

∞

µ(Bn ) ≤ µ∗ (A) + /2 < ∞.

n=1

Since Bn ∈ C, Bnc is a ﬁnite union of disjoint sets from C. W.l.o.g., it can be assumed that {Bn }n≥1 are disjoint. (Otherwise, ∞ one can consider the sequence B1 , B2 ∩ B1c , B3 ∩ B2c ∩ B1c , . . ..) Next, n=1µ(Bn ) < ∞ implies ∞ that for every > 0, there exists k ∈ N such that n=k+1 µ(Bn ) < 2 . k Since both A and j=1 Bj are subsets of n≥1 Bn ,

k ∞ ∞ ∗ c ∗ µ A Bj Bj Bj . ≤µ A ∩ +µ ∗

j=1

j=1

j=k+1

1.3 The extension theorems and Lebesgue-Stieltjes measures

29

But since µ∗ is a measure on (Ω, Mµ∗ ), µ∗ ( n≥1 Bn ) = µ∗ (A) + µ∗ (( n≥1 Bn ) ∩ Ac ). Further, since µ∗ (A) < ∞, µ∗

Bn ∩ Ac

n≥1

= µ∗

Bn − µ∗ (A))

n≥1

≤ =

∞ n=1 ∞

µ∗ (Bn ) − µ∗ (A), µ(Bn ) − µ∗ (A),

(since µ∗ is countably subadditive) (since µ∗ = µ on C)

n=1

0. Choose c > a and dn > bn , n ≥ 1 such that such that F (c) − F (a) < η, [c, b] ⊂ n n≥1 (an , dn ) and F (dn ) − F (bn ) < η/2 for all n ∈ N. Next, apply the Heine-Borel theorem to the interval [c, b] and the open cover {(an , dn )}n≥1 and extract a ﬁnite cover {(ai , di )}ki=1 for [c, b]. W.l.o.g., assume that c ∈ (a1 , d1 ) and b ∈ (ak , dk ). Now verify that F (b) − F (a) ≤ ≤

k j=1 ∞

(F (bj ) − F (aj )) + 2η (F (bj ) − F (aj )) + 2η.

j=1

1.23 Extend the above arguments to the case when (a, b] and (ai , bi ], i ≥ 1 are not necessarily bounded intervals. 1.24 Verify that C2 , deﬁned in (3.11), is a semialgebra. 1.25 (a) Verify that the limit in (3.13) exists. (b) Extend the arguments in Problems 1.22 and 1.23 to verify that µF of (3.12) and (3.13) is a measure on C2 . 1.26 Establish Theorem 1.3.6 by completing the following: (a) Suppose ﬁrst that ν(Ω) < ∞. Verify that L ≡ {A : A ∈ σC, µ∗ (A) = ν(A)} is a λ system and use the π-λ theorem. (b) Extend the above to the σ-ﬁnite case. 1.27 Prove Corollary 1.3.5 for Lebesgue measure m(·). 1.28 Let F be a discrete distribution function, i.e., F is of the form F (x) ≡

∞

aj I(xj ≤ x),

x ∈ R,

j=1

where 0 < aj < ∞, P(R).

j≥1

aj = 1, xj ∈ R, j ≥ 1. Show that Mµ∗F =

(Hint: Show that µ∗F (Ac ) = 0, where A ≡ {xj : j ≥ 1}, and use the fact that for any B ⊂ R, B ∩ A ∈ B(R).)

1.5 Problems

1.29 Let

⎧ ⎨ 0 x F (x) = ⎩ 1

35

for x < 0 for 0 ≤ x ≤ 1 for x > 0.

Show that Mµ∗F ≡ {A : A ∈ P(R), A ∩ [0, 1] ∈ M}, where M is the σ-algebra of Lebesgue measurable sets as in Deﬁnition 1.3.7. 1.30 Let F (·) = 12 Φ(·)+ 12 FP (·) where Φ(·) is the standard normal cdf, i.e., x ∞ k (k) 2 Φ(x) ≡ −∞ √12π e−u /2 Au and FP (x) ≡ k=0 e−2 2k! I(−∞,x] , x ∈ R. Let F1 = Φ, F2 = FP and F3 = F . Let A1 = (0, 1), A2 = {x : x ∈ R, sin x ∈ (0, 12 )}, A3 = {x : for some integers a0 , a1 , . . . , ak , k < ∞, k i i=0 ai x = 0}, the set of all algebraic numbers. Compute µFi (Aj ), 1 ≤ i, j ≤ 3. 1.31 Let µ be a measure on a semialgebra C ⊂ P(Ω) where Ω is a nonempty set. Let µ∗ be the outer measure generated by µ and let Mµ∗ be the σ-algebra of µ∗ -measurable sets as deﬁned in Theorem 1.3.3. (a) Show that for all A ⊂ Ω, there exists a B ∈ σC such that A ⊂ B and µ∗ (A) = µ∗ (B). (Hint: If µ∗ (A) = ∞, take B to be Ω. If µ∗ (A) < ∞, use the deﬁnition of µ∗ to show that for each n ≥ 1, there exists {B } ⊂ C such that A ⊂ Bn ≡ j≥1 Bnj , µ∗ (A) ≤ ∞ nj j≥1 1 ∗ j=1 µ(Bnj ) < µ (A) + n . Take B = n≥1 Bn .) (b) Show that for all A ∈ Mµ∗ with µ∗ (A) < ∞, there exists B ∈ σC such that A ⊂ B and µ∗ (B \ A) = 0. (Hint: Use (a) and the relation B = A ∪ (B \ A) with A and B \ A = B ∩ Ac in Mµ∗ .) (c) Show that if µ is σ-ﬁnite (i.e., there exist sets Ωn , n ≥ 1 in C with µ(Ωn ) < ∞ for all n ≥ 1 and n≥1 Ωn = Ω), then in (b), the hypothesis that µ∗ (A) < ∞ can be dropped. (Hint: Assume w.l.o.g. that {Ωn }n≥1 are disjoint. Apply (b) to {An ≡ A ∩ Ωn }n≥1 .) (d) Show that if µ is σ-ﬁnite, then A ∈ Mµ∗ iﬀ there exist sets B1 , B2 ∈ σC such that B1 ⊂ A ⊂ B2 and µ∗ (B2 \ B1 ) = 0. (Hint: Apply (c) to both A and Ac .) This shows that Mµ∗ is the completion of σC w.r.t. µ∗ . 1.32 (An outline of a proof of Corollary 1.3.5). Let (R, Mµ∗F , µ∗F ) be a Lebesgue Stieltjes measure space generated by a right continuous and nondecreasing function F : R → R.

36

1. Measures

(a) Show that A ∈ Mµ∗F iﬀ there exist Borel sets B1 and B2 ∈ B(R) such that B1 ⊂ A ⊂ B2 and µ∗F (B2 \ B1 ) = 0. (Hint: Take C to be the semialgebra C = {(a, b] : −∞ ≤ a ≤ b < ∞} ∪ {(b, ∞) : −∞ ≤ b < ∞} and apply Problem 1.31 (d).) (b) Let A ∈ Mµ∗F with µ∗F (A) < ∞. Show that for any > 0, there exist a ﬁnite number of bounded open intervals Ij , j = 1, 2, . . . , k k such that µ∗F (A j=1 Ij ) < . (Outline: Claim: For any B ∈ C with µF (B) < ∞, there exists an open interval I such that µ∗F (I B) < . To see this, note that if B = (a, b], − ∞ ≤ a < b < ∞, then one may choose b > b such that F (b ) − F (b) < . Now, with I = (a, b ), µ∗F (I B) = µ∗F ((b, b )) = µF ((b, b )) ≤ F (b ) − F (b) < . If B = (b, ∞) and µF (B) < ∞, then there exists b > b such that F (∞) − F (b −) < . Hence, with I = (b, b ), µ∗F (I B) = µ∗F ([b , ∞)) = F (∞) − F (b −) < . This proves the claim. Next, By Theorem 1.3.4, for all > 0, there exist k B1 , B2 , . . . , Bk ∈ C such that µ∗F (A j=1 Bj ) < /2. For each Bj , ﬁnd Ij , a bounded open interval such that µ∗F (Bj Ij ) < 2j . Since (A1 ∪ A2 ) (C1 ∪ C2 ) ⊂ (A1 C1 ) ∪ (A2 C2 ) for any A1 , A2 , C1 , C2 ⊂ Ω, it follows that µ∗F

k

k k

Bj Ij < µ∗F (Bj Ij ) < . 2 j=1 j=1 j=1

k Hence, µ∗F (A [ j=1 Ij ]) < .) (c) Let A ∈ Mµ∗F with µ∗F (A) < ∞. Show that for every > 0, there exists an open set O such that A ⊂ O and µ∗F (O \ A) < . > 0, there ex(Hint: By deﬁnition of µ∗F , for every ∗ B ist {Bj }j≥1 ⊂ C such that A ⊂ j≥1 j and µF (A) ≤ ∞ ∗ j=1 µF (Bj ) ≤ µF (A) + . Now as in (b), there exist open ∗ j intervals Ij such that Bj ⊂ Ij and ∞ µF (Ij \ Bj ) < /2∗ for all ∞ j ≥ 1. Then A ⊂ j=1 Bj ⊂ j=1 Ij ≡ O. Also, µF (O) = ∗ ∗ ∗ µ∗F (O \ A) ⇒ µ µ∗F (A) + F (O \ A) = µF (O) − µ (A) < 2 (since ∞ ∞ ∗ ∗ ∗ ∗ µ (O) ≤ j=1 µ (Ij ) = j=1 µF (Bj ) + ≤ µF (A) + 2 < ∞).) (d) Extend (c) to all A ∈ Mµ∗F . (Hint: Let Ai = A ∩ [i, i + 1], i ∈ Z. Apply (c) to Ai with and take unions.) i = 2|i|+1 (e) Show that for all A ∈ Mµ∗F and for all > 0, there exist a closed set C and an open set O such that C ⊂ A ⊂ O and

1.5 Problems

37

µ∗F (O \ A) ≤ , µ∗ (A \ C) < . (Hint: Apply (d) to A and Ac .) (f) Show that for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exist a closed and bounded set F ⊂ A such that µ∗F (A \ F ) < and an open set O with A ⊂ O such that µ∗F (O \ A) < . (Hint: Apply (d) to A ∩ [−M, M ] where M is chosen so that µ∗F (A ∩ [−M, M ]c ) < . Why is this possible?) Remark: Thus for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exist a compact set K ⊂ A and an open set O ⊃ A such that µ∗F (A \ K) < and µ∗F (O \ A) < . The ﬁrst property is called inner regularity of µ∗F and the second property is called outer regularity of µ∗F . (g) Show that for all A ∈ Mµ∗F with µ∗F (A) < ∞ and for all > 0, there exists a continuous function g with compact support (i.e., g (x) is zero for |x| large) such that µ∗F (A g−1 {1}) < . (Hint: For any bounded open interval (a, b), let η > 0 be such that µF ((a, a + η]) + µF ([b − η, b)) < . Next deﬁne ⎧ if a + η ≤ x ≤ b − η ⎨ 1 0 if x ∈ (a, b) g (x) = ⎩ linear over [a, a + η] ∪ [b − η, b]. Then g is continuous with compact support. Also, g −1 {1} = [a + η, b − η] and (a, b) g −1 {1} = (a, a + η) ∪ (b − η, b). So µF {(a, b) g−1 (1)}| < , proving the claim for A = (a, b). The general case follows from (b).) (h) Show that for all A ∈ Mµ∗F and for all > 0, there exists a continuous function g (not necessarily with compact support) such that µ∗F (A g−1 {1}) < (i.e., drop the condition µ∗F (A) < ∞). (Hint: Let Ak = A ∩ [k, k + 1], k ∈ Z. Find gk : R → R continuous with supportin k − 18 , k + 98 such that µ∗F (IAk = . Let g = gk ) < 2|k|+1 k∈Z gk . Note that for any x ∈ R, at (x) = 0 and so g is continuous. Also, µ∗F (IA = g) ≤ most two g k ∗ k∈Z µF (IAk = gk ) < .) 1.33 Let C be the Cantor set in [0,1] as deﬁned in Section 1.3.2. (a) Show that ∞ ai C= x:x= , a ∈ {0, 2} . i 3i i=1

38

1. Measures

and hence that C is uncountable. (b) Show that C + C ≡ {x + y : x, y ∈ C} = [0, 2].

2 Integration

2.1 Measurable transformations Oftentimes, one is not interested in the full details of a measure space (Ω, F, µ) but only in certain functions deﬁned on Ω. For example, if Ω represents the outcomes of 10 tosses of a fair coin, one may only be interested in knowing the number of heads in the 10 tosses. It turns out that to assign measures (probabilities) to sets (events) involving such functions, one can allow only certain functions (called measurable functions) that satisfy some ‘natural’ restrictions, speciﬁed in the following deﬁnitions. Deﬁnition 2.1.1: Let Ω be a nonempty set and let F be a σ-algebra on Ω. Then the pair (Ω, F) is called a measurable space. If µ is a measure on (Ω, F), then the triplet (Ω, F, µ) is called a measure space. If in addition, µ is a probability measure, then (Ω, F, µ) is called a probability space. Deﬁnition 2.1.2: (a) Let (Ω, F) be a measurable space. Then a function f : Ω to R is called F, B(R)-measurable (or F-measurable) if for each a in R (1.1) f −1 (−∞, a] ≡ {ω : f (ω) ≤ a} ∈ F. (b) Let (Ω, F, P ) be a probability space. Then a function X : Ω → R is called a random variable, if the event X −1 ((−∞, a]) ≡ {ω : X(ω) ≤ a} ∈ F

40

2. Integration

for each a in R, i.e., a random variable is a real valued F-measurable function on a probability space (Ω, F, P ). It will be shown later that condition (1.1) on f is equivalent to the stronger condition that f −1 (A) ∈ F for all Borel sets A ∈ B(R). Since for any Borel set A ∈ B(R), f −1 (A) is a member of the underlying σ-algebra F, one can assign a measure to the set f −1 (A) using a measure µ on (Ω, F). Note that for an arbitrary function T from Ω → R, T −1 (A) need not be a member of F and hence such an assignment may not be possible. Thus, condition (1.1) on real valued mappings is a ‘natural’ requirement while dealing with measure spaces. The following deﬁnition generalizes (1.1) to maps between two measurable spaces. Deﬁnition 2.1.3: Let (Ωi , Fi ), i = 1, 2 be measurable spaces. Then, a mapping T : Ω1 → Ω2 is called measurable with respect to the σ-algebras F1 , F2 (or F1 , F2 -measurable) if T −1 (A) ∈ F1

for all A ∈ F2 .

Thus, X is a random variable on a probability space (Ω, F, P ) iﬀ X is F, B(R)-measurable. Some examples of measurable transformations are given below. Example 2.1.1: Let Ω = {a, b, c, d}, F2 = {Ω, ∅, {a}, {b, c, d}} and let F3 = the set of all subsets of Ω. Deﬁne the mappings Ti : Ω → Ω, i = 1, 2, by T1 (ω) ≡ a for ω ∈ Ω

and T2 (ω) =

a if ω = a, b c if ω = c, d.

Then, T1 is F2 , F3 -measurable since for any A ∈ F3 , T1−1 (A) = Ω or ∅ according as a ∈ A or a ∈ A. By similar arguments, it follows that T2 is F3 , F2 -measurable. However, T2 is not F2 , F3 -measurable since T2−1 ({a}) = {a, b} ∈ F2 . As this simple example shows, measurability of a given mapping critically depends on the σ-algebras on its domain and range spaces. In general, if T is F1 , F2 -measurable, then T is F˜1 , F2 -measurable for any σ-algebra F˜1 ⊃ F1 and T is F1 , F˜2 -measurable for any F˜2 ⊂ F2 . Example 2.1.2: Let T : R → R be deﬁned as sin 2x if x > 0 T (x) = 1 + cos x if x ≤ 0.

2.1 Measurable transformations

41

Is T measurable w.r.t. the Borel σ-algebras B(R), B(R)? If one is to apply the deﬁnition directly, one must check that T −1 (A) ∈ B(R) for all A ∈ B(R). However, ﬁnding T −1 (A) for all Borel sets A is not an easy task. In many instances like this one, veriﬁcation of the measurability property of a given mapping by directly using the deﬁnition can be diﬃcult. In such situations, one may use some easy-to-verify suﬃcient conditions. Some results of this type are given below. Proposition 2.1.1: Let (Ωi , Fi ), i = 1, 2, 3 be measurable spaces. (i) Suppose that F2 = σA for some class of subsets A of Ω2 . If T : Ω1 → Ω2 is such that T −1 (A) ∈ F1 for all A ∈ A, then T is F1 , F2 measurable. (ii) Suppose that T1 : Ω1 → Ω2 is F1 , F2 -measurable and T2 : Ω2 → Ω3 is F2 , F3 -measurable. Let T = T2 ◦ T1 : Ω1 → Ω3 denote the composition of T1 and T2 , deﬁned by T (ω1 ) = T2 (T1 (ω1 )), ω1 ∈ Ω1 . Then, T is F1 , F3 -measurable. Proof: (i) Deﬁne the collection of sets F = {A ∈ F2 : T −1 (A) ∈ F1 }. Then, (a) T −1 (Ω2 ) = Ω1 ∈ F1 ⇒ Ω2 ∈ F. (b) If A ∈ F, then T −1 (A) ∈ F1 ⇒ (T −1 (A))c ∈ F1 ⇒ T −1 (Ac ) = (T −1 (A))c ∈ F1 , implying Ac ∈ F. −1 (c) If A1 , A2 , . . . , ∈ F, then, T (Ai ) ∈ F1 for−1all i ≥ 1. Since F1 −1 is a σ-algebra , T ( A ) = (An ) ∈ F1 . Thus, n n≥1 n≥1 T n≥1 An ∈ F. (See also Problem 2.1 on de Morgan’s laws.)

Hence, by (a), (b), (c), F is a σ-algebra and by hypothesis A ⊂ F . Hence, F2 = σA ⊂ F ⊂ F2 . Thus, F = F2 and T is F1 , F2 measurable. (ii) Let A ∈ F3 . Then, T2−1 (A) ∈ F2 , since T2 is F2 , F3 -measurable. Also, by the F1 , F2 -measurability of T1 , T −1 (A) = T1−1 (T2−1 (A)) ∈ 2 F1 , showing that T is F1 , F3 -measurable. Proposition 2.1.2: For any k, p ∈ N, if f : Rp → Rk is continuous, then f is B(Rp ), B(Rk )-measurable. Proof: Let A = {A : A is an open set in Rk }. Then, by the continuity of f , f −1 (A) is open and hence, is in B(Rp ) (cf. Section A.4). Thus, f −1 (A) ∈

42

2. Integration

Rp for all A ∈ A. Since B(Rk ) = σA, by Proposition 2.1.1 (a), f is 2 B(Rp ), B(Rk )-measurable. Although the converse to the above proposition is not true, a result due to Lusin says that except on a set of small measure, f coincides with a continuous function. This is stronger than the statement that except on set of small measure, f is close to a continuous function. For the statement and proof of Lusin’s theorem, see Theorem 2.5.12. Proposition 2.1.3: Let f1 , . . . , fk (k ∈ N) be F, B(R)-measurable transformations from Ω to R. Then, (i) f = (f1 , . . . , fk ) is F, B(Rk )-measurable. (ii) g = f1 + . . . + fk is F, B(R)-measurable. k (iii) h ≡ i=1 fi is F, B(R)-measurable. (iv) Let p ∈ N and let ψ : Rk → Rp be continuous. Then, ξ ≡ ψ ◦ f is F, B(Rp )-measurable, where f = (f1 , . . . , fk ). Proof: To prove (i), note that for any rectangle R = (a1 , b1 )×. . .×(ak , bk ), f −1 (R)

= =

{ω ∈ Ω : a1 < f1 (ω) < b1 , . . . , ak < fk (ω) < bk } k

{ω ∈ Ω : ai < f1 (ω) < bi }

i=1

=

k

fi−1 (ai , bi ) ∈ F,

i=1

since each fi is F, B(R)-measurable. Hence, by Proposition 2.1.1 (i), f is F, B(Rk )-measurable. To prove (ii), note that the function g1 (x) ≡ x1 + . . . + xk , x = (x1 , . . . , xk ) ∈ Rk is continuous on Rk , and hence, by Proposition 2.1.2, is B(Rk ), B(R)-measurable. Since g = g1 ◦ f , g is F, B(R)-measurable, by Proposition 2.1.1 (ii). The proofs of (iii) and (iv) are similar to that of (ii) and hence, are omitted. 2 Corollary 2.1.4: The collection of F, B(R)-measurable functions from Ω to R is closed under pointwise addition and multiplication as well as under scalar multiplication. The proof of Corollary 2.1.4 is omitted. In view of the above, writing the function T of Example 2.1.2 as T (x) = (sin 2x)I(0,∞) (x) + (1 + cos x)I(−∞,0] (x), x ∈ R, the B(R), B(R)-measurability of T follows. Note that here T is not continuous over R but only piecewise continuous (see also Problem 2.2).

2.1 Measurable transformations

43

Next, measurability of the limit of a sequence of measurable functions is ¯ = R ∪ {+∞, −∞} denote the extended real line and let considered. Let R ¯ ≡ σB(R) ∪ {+∞} ∪ {−∞} denote the extended Borel σ-algebra on B(R) ¯ R. ¯ be a F, B(R)¯ Proposition 2.1.5: For each n ∈ N, let fn : Ω → R measurable function. (i) Then, each of the functions supn∈N fn , inf n∈N fn , lim supn→∞ fn , and ¯ lim inf n→∞ fn is F, B(R)-measurable. (ii) The set A ≡ {ω : limn→∞ fn (ω) exists and is ﬁnite} lies in F and ¯ the function h ≡ (limn→∞ fn ) · IA is F, B(R)-measurable. Proof: ¯ (i) Let g = supn≥1 fn . To show that g is F, B(R)-measurable, it is enough to show that {ω : g(ω) ≤ r} ∈ F for all r ∈ R (cf. Problem 2.4). Now, for any r ∈ R, {ω : g(ω) ≤ r} = =

∞ n=1 ∞

{ω : fn (ω) ≤ r} fn−1 ((−∞, r]) ∈ F,

n=1

since fn−1 ((−∞, r]) ∈ F for all n ≥ 1, by the measurability of fn . Next note that inf n≥1 fn = − supn≥1 (−fn ) and hence, inf n≥1 fn is ¯ F, B(R)-measurable. To prove the measurability of lim supn→∞ fn , ¯ deﬁne the functions gm = supn≥m fn , m ≥ 1. Then, gm is F, B(R)measurable for each m ≥ 1 and since gm is nonincreasing in m, ¯ inf m≥1 gm ≡ lim supn→∞ fn is also F, B(R)-measurable. A similar argument works for lim inf n→∞ fn . ˜i = (ii) Let h1 = lim supn→∞ fn and h2 = lim inf n→∞ fn , and deﬁne h ˜ ˜ hi IR (hi ), i = 1, 2. Note that h1 − h2 is F, B(R)-measurable. Hence, {ω : lim fn (ω) n→∞

=

exists and is ﬁnite}

{ω : −∞ < lim sup fn (ω) = lim inf fn (ω) < ∞} n→∞

n→∞

= {ω : −∞ < h2 (ω) = h1 (ω) < ∞} ˜ 1 (ω) = h ˜ 2 (ω)} ∩ {ω : h1 (ω) < ∞, h2 (ω) > −∞} = {ω : h −1 ˜1 − h ˜ 2 ) ({0}) ∩ {ω : h1 (ω) < ∞, h2 (ω) > −∞} ∈ F. = (h Finally, note that h = h1 IA .

2

44

2. Integration

Deﬁnition 2.1.4: Let {fλ : λ ∈ Λ} be a family of mappings from Ω1 into Ω2 and let F2 be a σ-algebra on Ω2 . Then, σ{fλ−1 (A) : A ∈ F2 , λ ∈ Λ} is called the σ-algebra generated by {fλ : λ ∈ Λ} (w.r.t. F2 ) and is denoted by σ{fλ : λ ∈ Λ}. Note that σ{fλ : λ ∈ Λ} is the smallest σ-algebra on Ω1 that makes all fλ ’s measurable w.r.t. F2 on Ω2 . Example 2.1.3: Let f = IA for some set A ⊂ Ω1 and Ω2 = R and F2 = B(R). Then, σ{f } = σ{A} = {Ω1 , ∅, A, Ac }. Example 2.1.4: Let Ω1 = Rk , Ω2 = R, F2 = B(R), and for 1 ≤ i ≤ k, let fi : Ω1 → Ω2 be deﬁned as fi (x1 , . . . , xk ) = xi ,

(x1 , . . . , xk ) ∈ Ω1 = Rk .

Then, σ{fi : 1 ≤ i ≤ k} = B(Rk ). To show this, note that any k measurable rectangle A1 × . . . × Ak can be written as A1 × . . . × Ak = i=1 fi−1 (Ai ) and hence A1 × . . . × Ak ∈ σ{fi : 1 ≤ i ≤ k} for all A1 , . . . , Ak ∈ R. Since Rk is generated by the collection of all measurable rectangles, B(Rk ) ⊂ σ{fi : 1 ≤ i ≤ k}. Conversely, for any A ∈ B(R) and for any 1 ≤ i ≤ k, fi−1 (A) = R × . . . × A × . . . × R (with A in the ith position) is in B(Rk ). Therefore, σ{fi : 1 ≤ i ≤ k} = σ{fi−1 (A) : A ∈ R, 1 ≤ i ≤ k} ⊂ B(Rk ). Hence, σ{fi : 1 ≤ i ≤ k} = B(Rk ). Proposition 2.1.6: Let {fλ : λ ∈ Λ} be an uncountable collection of maps from Ω1 to Ω2 . Then for any B ∈ σ{fλ : λ ∈ Λ}, there exists a countable set ΛB ⊂ Λ such that B ∈ σ{fλ : λ ∈ ΛB }. Proof: The proof of this result is left as an exercise (Problem 2.5).

2

2.2 Induced measures, distribution functions Suppose X is a random variable deﬁned on a probability space (Ω, F, P ). Then P governs the probabilities assigned to events like X −1 ([a, b]), −∞ < a < b < ∞. Since X takes values in the real line, one should be able to express such probabilities only as a function of the set [a, b]. Clearly, since X is F, R-measurable, X −1 (A) ∈ F for all A ∈ B(R) and the function PX (A) ≡ P (X −1 (A))

(2.1)

2.2 Induced measures, distribution functions

45

is a set function deﬁned on B(R). Is this a (probability) measure on B(R)? The following proposition answers the question more generally. Proposition 2.2.1: Let (Ωi , Fi ), i = 1, 2 be measurable spaces and let T : Ω1 → Ω2 be a F1 , F2 -measurable mapping from Ω1 to Ω2 . Then, for any measure µ on (Ω1 , F1 ), the set function µT −1 , deﬁned by µT −1 (A) ≡ µ(T −1 (A)), A ∈ F2

(2.2)

is a measure on F2 . Proof: It is easy to check that µT −1 satisﬁes the three conditions for being a measure. The details are left as an exercise (cf. Problem 2.9). Deﬁnition 2.2.1: The measure µT −1 is called the measure induced by T (or the induced measure of T ) on F2 . In particular, if µ(Ω1 ) = 1, then µT −1 (Ω2 ) = 1. Hence, the set function P deﬁned in (2.1) is indeed a probability measure on (R, B(R)). Deﬁnition 2.2.2: For a random variable X deﬁned on a probability space (Ω, F, P ), the probability distribution of X (or the law of X), denoted by PX (say), is the induced measure of X under P on R, as deﬁned in (2.2). In introductory courses on probability and statistics, one deﬁnes probabilities of events like ‘X ∈ [a, b]’ by using the probability mass function for discrete random variables and the probability density function for ‘continuous’ random variables. The measure-theoretic deﬁnition above allows one to treat both these cases as well as the case of ‘mixed’ distributions under a uniﬁed framework. Deﬁnition 2.2.3: The cumulative distribution function (or cdf in short) of a random variable X is deﬁned as FX (x) ≡ PX ((−∞, x]), x ∈ R.

(2.3)

Proposition 2.2.2: Let F be the cdf of a random variable X. (i) For x1 < x2 , F (x1 ) ≤ F (x2 )

(i.e., F is nondecreasing on R).

(ii) For x in R, F (x) = limy↓x F (y)

(i.e., F is right continuous on R).

(iii) limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1. Proof: For x1 < x2 , (−∞, x1 ] ⊂ (−∞, x2 ]. Since PX is a measure on B(R), F (x1 ) = PX ((−∞, x1 ]) ≤ PX ((−∞, x2 ]) = F (x2 ), proving (i).

46

2. Integration

To prove (ii), let xn ↓ x. Then, the sets (−∞, xn ] ↓ (−∞, x], and PX ((−∞, x1 ]) = P (X ≤ x1 ) ≤ 1. Hence, using the monotone continuity from above of the measure PX (m.c.f.a.) (cf. Proposition 1.2.3), one gets lim F (xn ) = lim PX ((−∞, xn ]) = PX ((−∞, x]) = F (x).

n→∞

n→∞

Next consider part (iii). Note that if xn ↓ −∞ and yn ↑ ∞, then (−∞, xn ] ↓ ∅ and (−∞, yn ] ↑ (−∞, ∞). Hence, part (iii) follows the m.c.f.a. 2 and the m.c f.b. properties of PX (cf. Propositions 1.2.1 and 1.2.3). Deﬁnition 2.2.4: A function F : R → R satisfying (i), (ii), and (iii) of Proposition 2.2.2 is called a cumulative distribution function (or cdf for short). Thus, given a random variable X, its cdf FX satisﬁes properties (i), (ii), (iii) of Proposition 2.2.2. Conversely, given a cdf F , one can construct a probability space (Ω, F, P ) and a random variable X on it such that the cdf of X is F . Indeed, given a cdf F , note that by Theorem 1.3.3 and Definition 1.3.7, there exists a (Lebesgue-Stieltjes) probability measure µF on (R, B(R)) such that µF ((−∞, x]) = F (x). Now deﬁne X to be the identity map on R, i.e., let X(x) ≡ x for all x ∈ R. Then, X is a random variable on the probability space (R, B(R), µF ) with probability distribution PX = µF and cdf FX = F . In addition to (i), (ii) and (iii) of Proposition 2.2.2, it is easy to verify that for any x in R, P (X < x) = FX (x−) ≡ lim FX (y) y↑x

and hence P (X = x) = FX (x) − FX (x−).

(2.4)

Thus, the function FX (·) has a jump at x iﬀ P (X = x) > 0. Since a monotone function from R to R can have only jump discontinuities and only a countable number of them (cf. Problem 2.11), for any random variable X, the set {a ∈ R : P (X = a) > 0} is countable. This leads to the following deﬁnitions. Deﬁnition 2.2.5: (a) A random variable X is called discrete if there exists a countable set A ⊂ R such that P (X ∈ A) = 1. (b) A random variable X is called continuous if P (X = x) = 0 for all x ∈ R. Note that X is continuous iﬀ FX is continuous on all of R and X is discrete iﬀ the sum of all the jumps of its cdf FX is one. It may also be

2.2 Induced measures, distribution functions

47

noted that if FX is a step function, then X is discrete but not conversely. For example, consider the case where the set A in the above deﬁnition is the set of all rational numbers. It turns out that a given cdf may be written as a weighted sum of a discrete and a continuous cdfs. Let F be a cdf. Let A ≡ {x : p(x) ≡ F (x) − F (x−) > 0}. As remarked earlier, A is at most countable. Write α = y∈A p(y) and let F˜d (x) = y∈A p(y)I(−∞,x] (y), and F˜c (x) = F (x) − F˜d (x). It is easy to verify that F˜c (·) is continuous on R. If α = 0, then F (x) = F˜c (x) and F is continuous. If α = 1, then F = F˜d (x) and F is discrete. If 0 < α < 1, F (·) can be written as F (x) = αFd (x) + (1 − α)Fc (x),

(2.5)

where Fd (x) ≡ α−1 F˜d (x) and Fc (x) ≡ (1 − α)−1 F˜c (x) are both cdfs, with Fd being discrete and Fc being continuous. For a further decomposition of Fc (·) into absolutely continuous and singular continuous components, see Chapter 4.

2.2.1

Generalizations to higher dimensions

Induced distributions of random vectors and the associated cdfs are brieﬂy considered in this section. Let X = (X1 , X2 , . . . , Xk ) be a k-dimensional random vector deﬁned on a probability space (Ω, F, P ). The probability distribution PX of X is the induced probability measure on (Rk , B(Rk )), deﬁned by (cf. (2.1)) PX (A) ≡ P (X −1 (A)) A ∈ B(Rk ) .

(2.6)

The cdf FX of X is now deﬁned by FX (x) = P (X ≤ x),

x ∈ Rk ,

(2.7)

where for any x = (x1 , x2 , . . . , xk ) and y = (y1 , y2 , . . . , yk ) in Rk , x ≤ y means that xi ≤ yi for all i = 1, . . . , k. The extension of Proposition 2.2.2 to the k-dimensional case is notationally involved. Here, an analog of Proposition 2.2.2 for the bivariate case, i.e., for k = 2 is stated. Proposition 2.2.3: Let F be the cdf of a bivariate random vector X = (X1 , X2 ). (i) Then, for any x = (x1 , x2 ) ≤ y = (y1 , y2 ), F (y1 , y2 ) − F (x1 , y2 ) − F (y1 , x2 ) + F (x1 , x2 ) ≥ 0. (ii) For any x = (x1 , x2 ) ∈ R2 , lim

y1 ↓x1 ,y2 ↓x2

F (y1 , y2 ) = F (x1 , x2 ),

i.e., F is right continuous on R2 .

(2.8)

48

2. Integration

(iii) limx1 →−∞ F (x1 , a) = limx2 →−∞ F (a, x2 ) = 0 for all a ∈ R; limx1 →∞,x2 →∞ F (x1 , x2 ) = 1. (iv) For any a ∈ R, limy↑∞ F (a, y) = P (X1 ≤ a) and limy↑∞ F (y, a) = P (X2 ≤ a). Proof: Clearly, 0 ≤ P (X ∈ (x1 , y1 ] × (x2 , y2 ]) = P (x1 < X1 ≤ y1 , x2 < X2 ≤ y2 ) = P (X1 ≤ y1 , x2 < X2 ≤ y2 ) − P (X1 ≤ x1 , x2 < X2 ≤ y2 ) =

P (X1 ≤ y1 , X2 ≤ y2 ) − P (X1 ≤ y1 , X2 ≤ x2 ) −[P (X1 ≤ x1 , X2 ≤ y2 ) − P (X1 ≤ x1 , X2 ≤ x2 )]

= F (y1 , y2 ) − F (y1 , x2 ) − F (x1 , y2 ) + F (x1 , x2 ). This proves (i). To prove (ii), note that for any sequence yin ↓ xi , i = 1, 2, the sets An = (−∞, y1n ] × (−∞, y2n ] ↓ A ≡ (−∞, x1 ] × (−∞, x2 ]. Hence, by m.c.f.a property of a probability measure, F (y1n , y2n ) = P (X ∈ An ) ↓ P (A) = F (x1 , x2 ). For (iii), note that (−∞, x1n ] × (−∞, a] ↓ ∅ for any sequence x1n ↓ −∞ and for any a ∈ R. Hence, again by the m.c.f.a. property, F (x1n , a) ↓ 0

as

n → ∞.

By similar arguments, F (a, x2n ) ↓ 0 whenever x2n ↓ −∞. To prove the last relation in (iii), apply the m.c.f.b. property to the sets (−∞, x1n ] × (−∞, x2n ] ↑ R2 for x1n ↑ ∞, x2n ↑ ∞. The proof of part (iv) is similar. 2 Note that any function satisfying properties (i), (ii), (iii) of Proposition 2.2.3 determines a probability measure uniquely. This follows from the discussions in Section 1.3, as (1.3.9) and (1.3.10) follow from (i) and (iii) (Problem 2.12). For a general k ≥ 1, an analog of property (i) above is cumbersome to write down explicitly. Indeed, for any x ≤ y, now a sum involving 2k -terms must be nonnegative. However, properties (ii), (iii), and (iv) can be extended in an obvious way to the k-dimensional case. See Problem 2.13 for a precise statement. Also, for a general k ≥ 1, functions satisfying the properties listed in Problem 2.13 uniquely determine a probability measure on (Rk , B(Rk )).

2.3 Integration Let (Ω, F, µ) be a measure space and f : Ω → R be a measurable function. The goal of this section is to deﬁne the integral of f with respect to the

2.3 Integration

49

measure µ and establish some basic convergence results. The integral of a nonnegative function taking only ﬁnitely many values is deﬁned ﬁrst, which is then extended to all nonnegative measurable functions by approximation from below. Finally, the integral of an arbitrary measurable function is deﬁned using the decomposition of the function into its positive and negative parts. ¯ ≡ [−∞, ∞] is called simple Deﬁnition 2.3.1: A function f : Ω → R ¯ and sets if there exist a ﬁnite set (of distinct elements) {c1 , . . . , ck } ∈ R A1 , . . . , Ak ∈ F, k ∈ N such that f can be written as f=

k

ci IAi .

(3.1)

i=1

Deﬁnition 2.3.2: (The integral of a simple nonnegative function). Let ¯ + ≡ [0, ∞] be a simple nonnegative function on (Ω, F, µ) with f :Ω→R the representation (3.1). The integral of f w.r.t. µ, denoted by f dµ, is deﬁned as k f dµ ≡ ci µ(Ai ) . (3.2) i=1

Here and in the following, the relation 0·∞=0 is adopted as a convention. It may be veriﬁed that the value of the integral in (3.2) does not depend l on the representation of f . That is, if f can be expressed as f = j=1 dj IBj ¯ (not necessarily distinct) and for some sets for some d1 , . . . , dl ∈ R + l k B1 , . . . , Bl ∈ F, then i=1 ci µ(Ai ) = j=1 dj µ(Bj ), so that the value of the integral remains unchanged (Problem 2.17). Also note that for the f in Deﬁnition 2.3.2, 0≤

f dµ ≤ ∞.

The following proposition is an easy consequence of the deﬁnition and the above remark. Proposition 2.3.1: Let f and g be two simple nonnegative functions on (Ω, F, µ). Then (i) (Linearity) For α ≥ 0, β ≥ 0, (αf + βg)dµ = α f dµ + β gdµ. (ii) (Monotonicity) If f ≥ g a.e. (µ), i.e., µ({ω : ω ∈ Ω, f (ω) < g(ω)}) = 0, then f dµ ≥ gdµ.

50

2. Integration

(iii) If f = g a.e. (µ), i.e., if µ({ω : ω ∈ Ω, f (ω) = g(ω)}) = 0, then f dµ = gdµ. Deﬁnition 2.3.3: (The integral of a nonnegative measurable function). ¯ + be a nonnegative measurable function on (Ω, F, µ). The Let f : Ω → R integral of f with respect to µ, also denoted by f dµ, is deﬁned as f dµ = lim fn dµ, (3.3) n→∞

where {fn }n≥1 is any sequence of nonnegative simple functions such that fn (ω) ↑ f (ω) for all ω. Note that by Proposition 2.3.1 (ii), the sequence { fn dµ}n≥1 is nondecreasing, and hence the right side of (3.3) is well deﬁned. That the right side of (3.3) is the same for all such approximating sequences of functions needs to be established and is the content of the following proposition. The proof of this proposition exploits in a crucial way the m.c.f.b. and the ﬁnite additivity of the set function µ (or, equivalently, the countable additivity of µ). Proposition 2.3.2: Let {fn }n≥1 and {gn }n≥1 be two sequences of simple ¯ + such that as n → ∞, nonnegative measurable functions on (Ω, F, µ) to R fn (ω) ↑ f (ω) and gn (ω) ↑ f (ω) for all ω ∈ Ω. Then fn dµ = lim gn dµ. (3.4) lim n→∞

n→∞

Proof: Fix N ∈ N and 0 < ρ < 1. It will now be shown that lim fn dµ ≥ ρ gN dµ. n→∞

(3.5)

k Suppose that gN has the representation gN ≡ i=1 di IBi . Let Dn = {ω ∈ Ω : fn (ω) ≥ ρgN (ω)}, n ≥ 1. Since fn (ω) ↑ f (ω) for all ω, Dn ↑ D ≡ {ω : f (ω) ≥ ρgN (ω)} (Problem 2.18 (b)). Also since gN (ω) ≤ f (ω) and 0 < ρ < 1, D = Ω. Now writing fn = fn IDn + fn IDnc , it follows from Proposition 2.3.1 that fn dµ ≥ fn IDn dµ ≥ ρ gN IDn dµ = ρ

k

di µ(Bi ∩ Dn ).

(3.6)

i=1

By the m.c.f.b. property, for each i ∈ N, µ(Bi ∩ Dn ) ↑ µ(Bi ∩ Ω) = µ(Bi ) as n → ∞. Since the sequence { fn dµ}n≥1 is nondecreasing, taking limits

2.3 Integration

in (3.6), yields (3.5). Next, letting ρ ↑ 1 yields limn→∞ for each N ∈ N and hence, fn dµ ≥ lim gn dµ . lim n→∞

fn dµ ≥

51

gN dµ

n→∞

By symmetry, (3.4) follows and hence, the proof is complete.

2

Remark 2.3.1: It is easy to verify that Proposition 2.3.2 remains valid if {fn }n≥1 and {gn }n≥1 increase to f a.e. (µ). Given a nonnegative measurable function f , one can always construct a nondecreasing sequence {fn }n≥1 of nonnegative simple functions such that fn (ω) ↑ f (ω) for all ω ∈ Ω in the following manner. Let {δn }n≥1 be a sequence of positive real numbers and let {Nn }n≥1 be a sequence of positive integers such that as n → ∞, δn ↓ 0, Nn ↑ ∞ and Nn δn ↑ ∞. Further, suppose that the sequence Pn ≡ {jδn : j = 0, 1, 2, . . . , Nn } is nested, i.e., Pn ⊂ Pn+1 for each n ≥ 1. Now set if jδn ≤ f (ω) < (j + 1)δn , j = 0, 1, 2, . . . , (Nn − 1) jδn fn (ω) = Nn δn if f (ω) ≥ Nn δn . (3.7) A speciﬁc choice of δn and Nn is given by δn = 2−n , Nn = n2n . Thus,with the above choice of {fn }n≥1 in the deﬁnition of the Lebesgue integral f dµ in (3.3), the range of f is subdivided into intervals of decreasing lengths. This is in contrast to the deﬁnition of the Riemann integral of f over a bounded interval, which is deﬁned via subdividing the domain of f into ﬁner subintervals. Remark 2.3.2: In some cases it may be more appropriate to choose the approximating sequence {fn }n≥1 in a manner diﬀerent from (3.7). For example, let Ω = {ωi : i ≥ 1} be a countable set, F = P(Ω), the power set of Ω, and let µ be a measure on (Ω, F). Then any function f : Ω → R +∞≡ [0, ∞) is measurable and the integral f dµ coincides with the sum i=1 f (ωi )µ({ωi }) as can be seen by choosing the approximating sequence {fn }n≥1 as f (ωi ) for i = 1, 2, . . . , n fn (ωi ) = 0 for i > n. Remark 2.3.3: The integral of a nonnegative measurable function can be alternatively deﬁned as f dµ = sup gdµ : g nonnegative and simple, g ≤ f . The equivalence of this to (3.3) is seen as follows. Clearly the right side above, say, M is greater than or equal to f dµ as in (3.3). Conversely,

52

2. Integration

there exist a sequence {gn }n≥1 of simple nonnegative functions with gn ≤ f for all n ≥ 1 such that limn gn dµ equals the supremum M deﬁned above. Now set hn = max{gj : 1 ≤ j ≤ n}, n ≥ 1. Now it can be veriﬁed that for each n ≥ 1, hn is nonnegative, simple, and satisﬁes hn ↑ f and also that hn dµ converges to M (Problem 2.19 (b)). Corollary 2.3.3: Let f and g be two nonnegative measurable functions on (Ω, F, µ). Then, the conclusions of Proposition 2.3.1 remain valid for such f and g. Proof: This follows from Proposition 2.3.1 for nonnegative simple functions and Deﬁnition 2.3.3. 2 The deﬁnition of the integral f dµ of a nonnegative measurable function f in (3.3) makes it possible to interchange limits and integration in a fairly routine manner. In particular, the following key result is a direct consequence of the deﬁnition. Theorem 2.3.4: (The monotone convergence theorem or MCT). Let {fn }n≥1 and f be nonnegative measurable functions on (Ω, F, µ) such that fn ↑ f a.e. (µ). Then fn dµ. (3.8) f dµ = lim n→∞

Remark 2.3.4: The important diﬀerence between (3.4) and (3.8) is that in (3.8), the fn ’s need not be simple. Proof: It is similar to the proof of Proposition 2.3.2. Let {gn }n≥1 be a sequence of nonnegative simple functions on (Ω, F, µ) such that gn (ω) ↑ f (ω) for all ω. By hypothesis, there exists a set A ∈ F such that µ(Ac ) = 0 and for ω in A, fn (ω) ↑ f (ω). Fix k ∈ N and 0 < ρ < 1. Let Dn = {ω : ω ∈ A, fn (ω) ≥ ρgk (ω)}, n ≥ 1. Then, Dn ↑ D ≡ {ω : ω ∈ A, f (ω) ≥ ρgk (ω)}. Since gk (ω) ≤ f (ω) for all ω, it follows that D = A. Now, by Corollary 2.3.3, fn dµ ≥ fn IDn dµ ≥ ρ gk IDn dµ for all n ≥ 1. By m.c.f.b.,

gk IA dµ = gk dµ as n → ∞, yielding lim inf fn dµ ≥ ρ gk dµ

gk IDn dµ ↑

n→∞

for all 0 < ρ < 1 and all k ∈ N. Letting ρ ↑ 1 ﬁrst and then k ↑ ∞, from (3.3) one gets lim inf fn dµ ≥ f dµ. n→∞

2.3 Integration

By monotonicity (Corollary 2.3.3), fn dµ ≤ f dµ

53

for all n ≥ 1 2

and so the proof is complete.

Corollary 2.3.5: Let {hn }n≥1 be a sequence of nonnegative measurable functions on a measure space (Ω, F, µ). Then ∞ ∞ hn dµ . hn dµ = n=1

Proof: Let fn = By the MCT,

n i=1

n=1

hi , n ≥ 1, and let f = fn dµ ↑ f dµ.

∞ i=1

hi . Then, 0 ≤ fn ↑ f .

But by Corollary 2.3.3, fn dµ =

n

hi dµ.

i=1

2

Hence, the result follows.

Corollary 2.3.6: Let f be a nonnegative measurable function on a measurable space (Ω, F, µ). For A ∈ F, let ν(A) ≡ f IA dµ. Then, ν is a measure on (Ω, F). Proof: Let {An }n≥1 be a sequence of disjoint sets in F. Let hn = f IAn for n ≥ 1. Then by Corollary 2.3.5, ν(

An )

=

n≥1

=

f I[

n≥1

∞ n=1

An ] dµ

=

f·

∞

IAn dµ

n=1

∞ ∞ hn dµ = hn dµ = ν(An ). n=1

n=1

2 Remark 2.3.5: Notice that µ(A) = 0 ⇒ ν(A) = 0. In this case ν is said to be dominated by or absolutely continuous with respect to µ. The Radon-Nikodym theorem (see Chapter 4) provides a converse to this. That is, if ν and µ are two measures on a measurable space (Ω, F) such that

54

2. Integration

ν is dominated by µ and µ is σ-ﬁnite, then there exists a nonnegative measurable function f such that ν(A) = f IA dµ for all A in F. This f is called a Radon-Nikodym derivative (or a density) of ν with respect to µ dν and is denoted by dµ . Theorem 2.3.7: (Fatou’s lemma). Let {fn }n≥1 be a sequence of nonnegative measurable functions on (Ω, F, µ). Then (3.9) lim inf fn dµ ≥ lim inf fn dµ. n→∞

n→∞

Proof: Let gn (ω) ≡ inf{fj (ω) : j ≥ n}. Then {gn }n≥1 is a sequence of nonnegative, nondecreasing measurable functions on (Ω, F, µ) such that gn (ω) ↑ g(ω) ≡ lim inf fn (ω). By the MCT, n→∞

gn dµ ↑ But by monotonicity

gdµ.

fn dµ ≥

gn dµ for each n ≥ 1,

and hence, (3.9) follows.

2

Remark 2.3.6: In (3.9), the inequality can be strict. For example, take fn = I[n,∞) , n ≥ 1, on the measure space (R, B(R), m) where m is the Lebesgue measure. For another example, consider fn = nI[0, n1 ] , n ≥ 1, on the ﬁnite measure space ([0, 1], B([0, 1]), m). Deﬁnition 2.3.4: (The integral of a measurable function). Let f be a real valued measurable function on a measure space (Ω, F, µ). Let f + = f I{f ≥0} and f − = −f I{f K}) = 0

for some

K ∈ (0, ∞) .

The following is an extension of Proposition 2.3.1 to integrable functions. Proposition 2.3.8: Let f , g ∈ L1 (Ω, F, µ). Then (i) (αf + βg)dµ = α f dµ + β gdµ for any α, β ∈ R. (ii) f ≥ g a.e. (µ) ⇒ f dµ ≥ gdµ. (iii) f = g a.e. (µ) ⇒ f dµ = gdµ. Proof: It is easy to verify (Problem 2.32) that if h = h1 − h2 where h1 and h2 are nonnegative functions in L1 (Ω, F, µ), then h is also in L1 (Ω, F, µ) and hdµ = h1 dµ − h2 dµ. (3.11) Note that h ≡ αf + βg can be written as =

(a+ − α− )(f + − f − ) + (β + − β − )(g + − g − ) (α+ f + + α− f − + β + g + + β − g − ) − (α+ f − + α− f + + β + g − + β − g + )

= h1 − h 2 ,

say.

56

2. Integration

Since f , g ∈ L1 (Ω, F, µ), it follows that h1 and h2 ∈ L1 (Ω, F, µ). Further, they are nonnegative and by (3.11), h ∈ L1 (Ω, F, µ) and hdµ = h1 dµ − h2 dµ. Now apply Proposition 2.3.1 to each of the terms on the right side and regroup the terms to get hdµ = α f dµ + β gdµ. Proofs of (ii) and (iii) are left as an exercise.

2

Remark 2.3.8: By Proposition 2.3.8, if f and g ∈ L1 (Ω, F, µ), then so does αf + βg for all α, β ∈ R. Thus, L1 (Ω, F, µ) is a vector space over R. Further, if one sets f 1 ≡

|f |dµ,

and identiﬁes a function f with its equivalence class under the relation f ∼ g iﬀ f = g a.e. (µ) , then · 1 deﬁnes a norm on L1 (Ω, F, µ) and makes it a normed linear space (cf. Chapter 3). A similar remark also holds for Lp (Ω, F, µ) for 1 < p ≤ ∞. Next note that by Proposition 2.3.8, if f = 0 a.e. (µ), then f dµ = 0. However, the converse is not true. But if f is nonnegative a.e. (µ), then the converse is true as shown below. Proposition 2.3.9: Let f be a measurable function on (Ω, F, µ) and let f be nonnegative a.e. (µ). Then f dµ = 0 iﬀ f = 0 a.e. (µ). Proof: It is enough to prove the “only if” part. Let D = {ω : f (ω) > 0} and Dn = {ω : f (ω) > n1 }, n ≥ 1. Then D = n≥1 Dn . Since f ≥ f IDn a.e. (µ), 1 0 = f dµ ≥ f IDn dµ ≥ µ(Dn ) ⇒ µ(Dn ) = 0 for each n ≥ 1. n Also Dn ↑ D and so by m.c.f.b., µ(D) = lim µ(Dn ) = 0 . n→∞

Hence, Proposition 2.3.9 follows. A dual to the above proposition is the next one.

2

2.3 Integration

57

Proposition 2.3.10: Let f ∈ L1 (Ω, F, µ). Then, |f | < ∞ a.e. (µ). Proof: Let Cn = {ω : |f (ω)| > n}, n ≥ 1 and let C = {ω : |f (ω)| = ∞}. Then Cn ↓ C and |f |dµ |f |dµ ≥ |f |ICn dµ ≥ nµ(Cn ) ⇒ µ(Cn ) ≤ . n Since |f |dµ < ∞, limn→∞ µ(Cn ) = 0. Hence, by m.c.f.a., µ(C) = limn→∞ µ(Cn ) = 0. 2 The next result is a useful convergence theorem for integrals. Theorem 2.3.11: (The extended dominated convergence theorem or EDCT ). Let (Ω, F, µ) be a measure space and let fn , gn : Ω → R be F, R-measurable functions such that |fn | ≤ gn a.e. (µ) for all n ≥ 1. Suppose that (i) gn → g a.e. (µ) and fn → f a.e. (µ); (ii) gn , g ∈ L1 (Ω, F, µ) and |gn |dµ → |g|dµ as n → ∞. Then, f ∈ L1 (Ω, F, µ), fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.12) n→∞

n→∞

Two important special cases of Theorem 2.3.11 will be stated next. When gn = g for all n ≥ 1, one has the standard version of the dominated convergence theorem. Corollary 2.3.12: (Lebesgue’s dominated convergence theorem, or DCT ). If |fn | ≤ g a.e. (µ) for all n ≥ 1, gdµ < ∞ and fn → f a.e. (µ), then f ∈ L1 (Ω, F, µ), fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.13) n→∞

n→∞

Corollary 2.3.13: (The bounded convergence theorem, or BCT ). Let µ(Ω) < ∞. If there exist a 0 < k < ∞ such that |fn | ≤ k a.e. (µ) and fn → f a.e. (µ), then fn dµ = f dµ and |fn − f |dµ = 0. lim lim (3.14) n→∞

n→∞

Proof: Take g(ω) ≡ k for all ω ∈ Ω in the previous corollary. Proof of Theorem 2.3.11: By Fatou’s lemma, |f |dµ ≤ lim inf |fn |dµ ≤ lim inf |gn |dµ = |g|dµ < ∞. n→∞

n→∞

2

58

2. Integration

Hence, f is integrable. For proving the second part, let hn = fn + gn and γn = gn − fn , n ≥ 1. Then, {hn }n≥1 and {γn }n≥1 are sequences of nonnegative integrable functions. By Fatou’s lemma and (ii), (f + g)dµ = lim inf hn dµ n→∞ ≤ lim inf hn dµ n→∞ = lim inf gn dµ + fn dµ n→∞ = gdµ + lim inf fn dµ. n→∞

Similarly,

(g − f )dµ ≤

gdµ − lim sup

fn dµ.

n→∞

By Proposition 2.3.8, (g ± f )dµ = gdµ ± f dµ. Hence, f dµ ≤ lim inf fn dµ n→∞

and lim sup n→∞

fn dµ ≤

f dµ

yielding that limn→∞ fn dµ = f dµ. For the last part, apply the above argument to fn and gn replaced by f˜n ≡ |f − fn | and g˜n ≡ gn + |f |, respectively. 2 Theorem 2.3.14: (An approximation theorem). Let µF be a LebesgueStieltjes measure on (R, B(R)). Let f ∈ Lp (R, B(R), µF ), 0 < p < ∞. Then, for any δ > 0, there exist a step function h and a continuous function g with compact support (i.e., g vanishes outside a bounded interval) such that |f − h|p dµ < δ, (3.15) |f − g|p dµ < δ,

(3.16)

k where a step function h is a function of the form h = i=1 ci IAi with k < ∞, c1 , c2 , . . . , ck ∈ R and A1 , A2 , . . . , Ak being bounded disjoint intervals. Proof: Let fn (·) = f (·)IBn (·) where Bn = {x : |x| ≤ n, |f (x)| ≤ n}. By the DCT, for every > 0, there exists an N such that for all n ≥ N , (3.17) |f − fn |p dµF < .

2.4 Riemann and Lebesgue integrals

59

Since |fN (·)| ≤ N on [−N , N ], for any η > 0, there exists a simple function f˜ such that sup{|fN (x) − f˜(x)| : x ∈ R} < η. (cf. (3.7))

(3.18)

Next, using Problem 1.32 (b), one can show that for any η > 0, there exists a step function h (depending on η) such that (3.19) |f˜N − h|p dµF < η. Since f − h = f − fN + fN − f˜ + f˜ − h, |f − h|p ≤ Cp |f − fN |p + |fN − f˜|p + |f˜ − h|p , where Cp is a constant depending only on p. This in turn yields, from (3.17)–(3.19), |f − h|p dµF ≤ Cp ( + (µF {x : |x| ≤ N })η p + η).

(3.20)

Given δ > 0, choose > 0 ﬁrst and then η > 0 such that the right side of (3.20) above is less than δ. Next, given any step function h and η > 0 there exists a continuous function g with compact support such that µF {x : h(x) = g(x)} < η (cf. Problem 1.32 (g)). Now (3.16) follows from (3.15). 2 Remark 2.3.9: The approximation (3.16) remains valid if g is restricted to the class of all inﬁnitely diﬀerentiable functions with compact support. Further it remains valid for 0 < p < ∞ for Lebesgue-Stieltjes measures on any Euclidean space. Remark 2.3.10: The above approximation theorem fails for p = ∞. For example, consider the function f (x) ≡ 1 in L∞ (m).

2.4 Riemann and Lebesgue integrals Let f be a real valued bounded function on a bounded interval [a, b]. Recall the deﬁnition of the Riemann integral. Let P = {x0 , x1 , . . . , xn } be a ﬁnite partition of [a, b], i.e., x0 = a < x1 < x2 < xn−1 < xn = b and ∆ = ∆(P ) ≡ max{(xi+1 − xi ) : 0 ≤ i ≤ n − 1} be the diameter of P . Let Mi = sup{f (x) : xi ≤ x ≤ xi+1 } and mi = inf{f (x) : xi ≤ x ≤ xi+1 }, i = 0, 1, . . . , n − 1. Deﬁnition 2.4.1: The upper- and lower-Riemann sums of f w.r.t. the partition P are, respectively, deﬁned as U (f, P ) ≡

n−1 i=0

Mi · (xi+1 − xi )

(4.1)

60

2. Integration

and L(f, P ) ≡

n−1

mi · (xi+1 − xi ).

(4.2)

i=0

It is easy to verify that if Q = {y0 , y1 , . . . , yk } is another partition satisfying P ⊂ Q, then U (f, P ) ≥ U (f, Q) ≥ L(f, Q) ≥ L(f, P ). Let P denote the collection of all ﬁnite partitions of [a, b]. Deﬁnition 2.4.2: The upper-Riemann integral f is deﬁned as f = inf U (f, P ) (4.3)

P ∈P

and the lower-Riemann integral f , by f = sup L(f, P ). P ∈P

(4.4)

It can be shown (cf. Problem 2.23) that if {Pn }n≥1 is any sequence of partitions such that ∆(Pn ) → 0 as n → ∞ and Pn ⊂ Pn+1 for each n ≥ 1, then U (Pn , f ) ↓ f and L(Pn , f ) ↑ f . Deﬁnition 2.4.3: f is said to be Riemann integrable if f = f. The common value is denoted by

[a,b]

(4.5)

f.

Fix a sequence {Pn }n≥1 of partitions such that Pn ⊂ Pn+1 for all n ≥ 1 and ∆(Pn ) → 0 as n → ∞. Let Pn = {xn0 = a < xn1 < xn2 . . . < xnkn = b}. For i = 0, 1, . . . , kn − 1, let φn (x) ≡ ψn (x) ≡

sup{f (t) : xi ≤ t ≤ xi+1 }, x ∈ [xi , xi+1 ) inf{f (t) : xi ≤ t ≤ xi+1 }, x ∈ [xi , xi+1 )

and let φn (b) = ψn (b) = 0. Then, φn and ψn are step functions on [a, b] and hence, are Borel measurable. Further, since f is bounded, so are φn and ψn and hence are integrable on [a, b] w.r.t. theLebesgue measure m. The Lebesgue integrals of φn and ψn are given by [a,b] φn dm = U (Pn , f ) and [a,b] ψn dm = L(Pn , f ). It can be shown (Problem 2.24) that for all x ∈ n≥1 Pn , as n → ∞, φn (x) → φ(x) ≡ lim sup{f (y) : |y − x| < δ}

(4.6)

ψn (x) → ψ(x) ≡ lim inf{f (y) : |y − x| < δ}

(4.7)

δ↓0

and δ↓0

2.5 More on convergence

61

Thus, φ and ψ, being limits of Borel measurable functions (except possibly on a countable set), are Borel measurable. By the BCT (Corollary 2.3.13), f = lim φn dm = φdm n→∞

and

f = lim

n→∞

ψn dm =

ψdm.

Thus, f is Riemann integrable on [a, b], iﬀ φdm = ψdm, iﬀ (φ − ψ)dm = 0. Since φ(x) ≥ f (x) ≥ ψ(x) for all x, this holds iﬀ φ = f = ψ a.e [m]. It can be shown that f is continuous at x0 iﬀ φ(x0 ) = f (x0 ) = ψ(x0 ) (Problem 2.8). Summarizing the above discussion, one gets the following theorem. Theorem 2.4.1: Let f be a bounded function on a bounded interval [a, b]. Then f is Riemann integrable on [a, b] iﬀ f is continuous a.e. (m) on [a, b]. In [a, b] and the Lebesgue in this case, f is Lebesgue integrable on tegral [a,b] f dm equals the Riemann integral [a,b] f , i.e., the two integrals coincide. It should be noted that Lebesgue integrability need not imply Riemann integrability. For example, consider f (x) ≡ IQ1 (x) where Q1 is the set of rationals in [0, 1] (Problem 2.25). The functions φ and ψ deﬁned in (4.6) and (4.7) above are called, respectively, the upper and the lower envelopes of the function f . They are semicontinuous in the sense that for each α ∈ R, the sets {x : φ(x) < α} and {x : ψ(x) > α} are open (cf. Problem 2.8). Remark 2.4.1: The key diﬀerence in the deﬁnitions of Riemann and Lebesgue integrals is that in the former the domain of f is partitioned while in the latter the range of f is partitioned.

2.5 More on convergence Let {fn }n≥1 and f be measurable functions from a measure space (Ω, F, µ) ¯ the set of extended real numbers. There are several notions of conto R, vergence of {fn }n≥1 to f . The following two have been discussed earlier. Deﬁnition 2.5.1: {fn }n≥1 converges to f pointwise if lim fn (ω) = f (ω)

n→∞

for all ω in Ω.

Deﬁnition 2.5.2: {fn }n≥1 converges to f almost everywhere (µ), denoted by fn → f a.e. (µ), if there exists a set B in F such that µ(B) = 0 and lim fn (ω) = f (ω)

n→∞

for all ω∈B c .

(5.1)

62

2. Integration

Now consider some more notions of convergence. Deﬁnition 2.5.3: {fn }n≥1 converges to f in measure (w.r.t. µ), denoted by fn −→m f , if for each > 0, lim µ {|fn − f | > } = 0. (5.2) n→∞

to f in Lp (µ), Deﬁnition 2.5.4: Let 0 K}) = 0 .

(5.6)

(5.7)

Deﬁnition 2.5.7: {fn }n≥1 converges to f nearly uniformly (µ) if for every > 0, there exists a set A ∈ F such that µ(A) < and on Ac , fn → f uniformly, i.e., sup{|fn (ω) − f (ω)| : ω ∈ Ac } → 0 as n → ∞. The notion of convergence in Deﬁnition 2.5.7 is also called almost uniform convergence in some books (cf. Royden (1988)). The sequence fn ≡ nI[0,1/n] on (Ω = [0, 1], B([0, 1]), m) converges to f ≡ 0 nearly uniformly but not in L∞ (m). When µ is a probability measure, there is another useful notion of convergence, known as convergence in distribution, that is deﬁned in terms of the induced measures {µfn−1 }n≥1 and µf −1 . This notion of convergence will be treated in detail in Chapter 9.

2.5 More on convergence

63

Next, the connections between some of these notions of convergence are explored. Theorem 2.5.1: Suppose that µ(Ω) < ∞. Then, fn → f a.e. (µ) implies fn −→m f . The proof is left as an exercise (Problem 2.26). The hypothesis that ‘µ(Ω) < ∞’ in Theorem 2.5.1 cannot be dispensed with as seen by taking fn = I[n,∞) on R with Lebesgue measure. Also, fn −→m f does not imply fn → f a.e. (µ) (Problem 2.46), but the following holds. Theorem 2.5.2: Let fn −→m f . Then, there exists a subsequence {nk }k≥1 such that fnk → f a.e. (µ). Proof: Since fn −→m f , for each integer k ≥ 1, there exists an nk such that for all n ≥ nk µ |fn − f | > 2−k (5.8) < 2−k . W.l.o.g., that assume nk+1 > nk for all k ≥ 1. Let Ak = {|fnk − f | > 2−k }. By Corollary 2.3.5, ∞ ∞ ∞ IAk dµ = IAk dµ = µ(Ak ), k=1

k=1

k=1

∞ which, by (5.8), is ﬁnite. by Proposition 2.3.10, k=1 IAk < ∞ a.e. Hence, ∞ (µ). Now observe that k=1 IAk (ω) < ∞ ⇒ |fnk (ω) − f (ω)| ≤ 2−k for all k large ⇒ limk→∞ fnk (ω) = f (ω). Thus, fnk → f a.e. (µ). 2 Remark 2.5.1: From the above result it follows that the extended dominated convergence theorem (Theorem 2.3.11) remains valid if convergence a.e. of {fn }n≥1 and of {gn }n≥1 are replaced by convergence in measure for both (Problem 2.37). Theorem 2.5.3: Let {fn }n≥1 , f be measurable functions on a measure p space (Ω, F, µ). Let fn −→L f for some 0 < p < ∞. Then fn −→m f . Proof: For each > 0, let An = {|fn − f | ≥ }, n ≥ 1. Then p |fn − f |p dµ ≥ p µ(An ). |fn − f | dµ ≥ Since fn → f in Lp ,

An

|fn − f |p dµ → 0 and hence, µ(An ) → 0. p

2

It turns out that fn −→m f need not imply fn −→L f , even if {fn : n ≥ 1} ∪ {f } is contained in Lp (Ω, F, µ). For example, let fn = nI[0, n1 ] and f ≡ 0 on the Lebesgue space ([0, 1], B([0, 1]), m), where m is the Lebesgue measure. Then fn −→m f but |fn − f | ≡ 1 for all n ≥ 1. However, under

64

2. Integration

some additional conditions, convergence in measure does imply convergence in Lp . Here are two results in this direction. Theorem 2.5.4: (Scheﬀe’s theorem). Let {fn }n≥1 , f be a collection of nonnegative functions on a measure space (Ω, F, µ). Let fn → f measurable a.e. (µ), fn dµ → f dµ and f dµ < ∞. Then lim |fn − f |dµ = 0. n→∞

gn+ and gn− go Proof: Let gn = f − fn , n ≥ 1. Since fn → f a.e. (µ), both + to zero a.e. (µ). Further, 0 ≤ gn ≤ f and by hypothesis f dµ < ∞. Thus, by the DCT, it follows that gn+ dµ → 0. gn dµ → 0. Thus, gn− dµ = gn+ dµ − Next, note that by hypothesis, 2 gn dµ → 0 and hence, |gn |dµ = gn+ dµ + gn− dµ → 0. Corollary 2.5.5: Let {fn }n≥1 , f be probability density functions on a measure space (Ω, F, µ). That is, for all n ≥ 1, fn dµ = f dµ = 1 and fn , f ≥ 0 a.e. (µ). If fn → f a.e. (µ), then |fn − f |dµ = 0. lim n→∞

Remark 2.5.2: The above theorem and the corollary remain valid if the convergence of fn to f a.e. (µ) is replaced by f −→m f . , n = 1, 2, . . . and {pk }k≥1 be sequences of Corollary 2.5.6: Let {pnk }k≥1 ∞ ∞ nonnegative numbers satisfying k=1 pnk = 1 = k=1 pk . Let pnk → pk ∞ as n → ∞ for each k ≥ 1. Then k=1 |pnk − pk | → 0. Proof: Apply Corollary 2.5.5 with µ = the counting measure on (N, P(N)). 2 A more general result in this direction that does not require fn , f to be nonnegative involves the concept of uniform integrability. Let {fλ : λ ∈ Λ} be a collection of functions in L1 (Ω, F, µ). Then for each λ ∈ Λ, by the DCT and the integrability of fλ , aλ (t) ≡ |fλ |dµ → 0 as t → ∞. (5.9) {|fλ |>t}

The notion of uniform integrability requires that the integrals aλ (t) go to zero uniformly in λ ∈ Λ as t → ∞.

2.5 More on convergence

65

Deﬁnition 2.5.8: The collection of functions {fλ : λ ∈ Λ} in L1 (Ω, F, µ) is uniformly integrable (or UI, in short) if sup aλ (t) → 0 as t → ∞ .

(5.10)

λ∈Λ

The following proposition summarizes some of the main properties of UI families of functions. Proposition 2.5.7: Let {fλ : λ ∈ Λ} be a collection of µ-integrable functions on (Ω, F, µ). (i) If Λ is ﬁnite, then {fλ : λ ∈ Λ} is UI. (ii) If K ≡ sup{ |fλ |1+ dµ : λ ∈ Λ} < ∞ for some > 0, then {fλ : λ ∈ Λ} is UI. (iii) If |fλ | ≤ g a.e. (µ) and gdµ < ∞, then {fλ : λ ∈ Λ} is UI. (iv) If {fλ : λ ∈ Λ} and {gγ : γ ∈ Γ} are UI, then so is {fλ + gγ : λ ∈ Λ, γ ∈ Γ}. (v) If {fλ : λ ∈ Λ} is UI and µ(Ω) < ∞, then sup |fλ |dµ < ∞.

(5.11)

λ∈Λ

Proof: By hypothesis, aλ (t) ≡ {|fλ |>t} |fλ |dµ → 0 as t → ∞ for each λ. If Λ is ﬁnite this implies that sup{aλ (t) : λ ∈ Λ} → 0 as t → ∞. This proves (i). To prove (ii), note that since 1 < |fλ |/t on the set {|fλ | > t}, |fλ | |fλ |/t dµ ≤ Kt− → 0 as t → ∞. sup aλ (t) ≤ sup λ∈Λ

λ∈Λ

|fλ |>t

For (iii), note that for each t ∈ R, the function ht (x) ≡ xI(t,∞) (x), x ∈ R is nondecreasing. Hence, by the integrability of g, sup aλ (t) = sup ht (|fλ |)dµ ≤ ht (g)dµ = gdµ → 0 as t → ∞. λ∈Λ

λ∈Λ

{g>t}

To prove (iv), for t > 0, let a(t) = supλ∈Λ ht (|fλ |)dµ and b(t) = supγ∈Γ ht (|gγ |)dµ. Then, for any λ ∈ Λ, and γ ∈ Γ, |fλ + gγ |dµ = ht |fλ + gγ | dµ {|fλ +gγ |>t}

≤

ht 2 max{|fλ |, |gγ |} dµ

66

2. Integration

ht 2|fλ | I |fλ | ≥ |gγ | dµ + ≤ 2 |fλ |dµ + 2

=

{|fλ |>t/2}

≤

ht 2|gγ | I |fλ | < |gγ | dµ

{|gγ |>t/2}

|gγ |dµ

2[a(t/2) + b(t/2)].

By hypothesis, both a(t) and b(t) → 0 as t → ∞, thus proving (iv). Next consider (v). Since {fλ } : λ ∈ Λ} is UI, there exists a T > 0 such that sup hT |fλ | dµ ≤ 1. λ∈Λ

Hence, sup

|fλ |dµ

=

sup

λ∈Λ

λ∈Λ

!

{|fλ |≤T }

|fλ |dµ +

hT (|fλ |)dµ

≤ T µ(Ω) + 1 < ∞. 2

This completes the proof of the proposition.

Remark 2.5.3: In the above proposition, part (ii) can be improved as φ(x) follows: Let φ : R+ → R+ be nondecreasing and x ↑ ∞ as x ↑ ∞. If supλ∈Λ φ(|fλ |)dµ < ∞, then {fλ : λ ∈ Λ} is UI (Problem 2.27). A converse to this result is true. That is, if {fλ : λ ∈ Λ} is UI then there exists such a function φ. Some examples of such φ’s are φ(x) = xk , k > 1, φ(x) = x(log x)β I(x > 1), β > 0, and φ(x) = exp(βx), β > 0. In part (v) of Proposition 2.5.7, (5.11) does not imply UI. For example, consider the sequence of functions fn = nI[0, n1 ] , n = 1, 2, . . . on [0, 1]. On the other hand, (5.11) with an additional condition becomes necessary and suﬃcient for UI. Proposition 2.5.8: Let f ∈ L1 (Ω, F, µ). Then for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ A |f |dµ < . Fix > 0. By the DCT, there exists a t > 0 such that |f |dµ < /2. Hence, for any A ∈ F with µ(A) ≤ δ ≡ 2t , {|f |>t}

Proof:

|f |dµ

A

≤

A∩{|f |≤t}

|f |dµ +

≤ tµ(A) +

{|f |>t}

{|f |>t}

|f |dµ

|f |dµ

≤ , proving the claim.

2

2.5 More on convergence

67

The above proposition shows that for every f ∈ L1 (Ω, F, µ), the measure (cf. Corollary 2.3.6) ν|f | (A) ≡

|f |dµ

(5.12)

A

on (Ω, F) satisﬁes the condition that ν|f | (A) is small if µ(A) is small, i.e., for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ ν|f | (A) < . This property is referred to as the absolute continuity of the measure νf w.r.t. µ. Deﬁnition 2.5.9: Given a family {fλ : λ ∈ Λ} ⊂ L1 (Ω, F, µ), the measures {ν|fλ | : λ ∈ Λ} as deﬁned in (5.12) above are uniformly absolutely continuous w.r.t. µ (or u.a.c. (µ), in short) if for every > 0, there exists a δ > 0 such that µ(A) < δ ⇒ sup ν|fλ | (A) : λ ∈ Λ < . Theorem 2.5.9: Let {fλ : λ ∈ Λ} ⊂ L1 (Ω, F, µ) and µ(Ω) < ∞. Then, {fλ : λ ∈ Λ} is UI iﬀ supλ∈Λ |fλ |dµ < ∞ and ν|fλ | (·) : λ ∈ Λ is u.a.c. (µ). Proof: Let fλ : λ ∈ Λ be UI. Then, since µ(Ω) < ∞, L1 boundedness of {fλ : λ ∈ Λ} follows from Proposition 2.5.7 (v). To establish u.a.c. (µ), ﬁx > 0. By UI, there exists an N such that |fλ |dµ < /2. sup λ∈Λ

{|fλ |>N }

Let δ = and let A ∈ F be such that µ(A) < δ. Then, as in the proof of Proposition 2.5.8 above, |fλ |dµ ≤ N µ(A) + /2 < . sup 2N

λ∈Λ

A

Conversely, suppose fλ : λ ∈ Λ is L1 bounded and u.a.c. (µ). Then, for every > 0, there exists a δ > 0 such that |fλ |dµ < if µ(A) < δ . (5.13) sup

λ∈Λ

A

Also, for any nonnegative f in L1 (Ω, F, µ) and t > 0, tµ {f ≥ t} , which implies that f dµ µ {f ≥ t} ≤ . t

f dµ ≥

{f >t}

f dµ ≥

(This is known as Markov’s inequality − see Chapter 3). Hence, it follows that sup µ({|fλ | ≥ t}) ≤ sup |fλ |dµ / t. (5.14) λ∈Λ

λ∈Λ

68

2. Integration

Now given > 0, choose T such that supλ∈Λ |fλ |dµ /T < δ where δ is as in (5.13). Then, by (5.14), it follows that |fλ |dµ < , sup λ∈Λ

{|fλ |≥T }

i.e., {fλ : λ ∈ Λ} is UI.

2

Theorem 2.5.10: Let (Ω, F, µ) be a measure space with µ(Ω) < ∞, and let {fn : n ≥ 1} ⊂ L1 (Ω, F, µ) be such that fn → f a.e. (µ) and f is F, B(R)-measurable. If {fn : n ≥ 1} is UI, then f is integrable and |fn − f |dµ = 0. lim n→∞

Remark 2.5.4: In view of Proposition 2.5.7 (iii), Theorem 2.5.10 yields convergence of f dµ to f dµ under weaker conditions than the DCT, provided µ(Ω) < ∞. However, even under the restriction µ(Ω) < ∞, UI of {fn }n≥1is a suﬃcient, but not a necessary condition for convergence of In the fn dµ to f dµ (Problem 2.28). special case where fn ’s are non negative (and µ(Ω) < ∞), fn dµ → f dµ < ∞ if and only if {fn }n≥1 are UI (Problem 2.29). On the other hand, when µ(Ω) = +∞, UI is no longer suﬃcient to guarantee the convergence of fn dµ to f dµ (Problem 2.30). Thus, the notion of UI is useful mainly for ﬁnite measures and, in particular, probability measures. Proof: By Proposition 2.5.7 and Fatou’s lemma, |f |dµ ≤ lim inf |fn |dµ ≤ sup |fn |dµ < ∞ n→∞

n≥1

and hence f is integrable. Next for n ∈ N, t ∈ (0, ∞), deﬁne the functions gn = |fn − f |, g n,t = gn I(|gn | ≤ t) and g¯n,t = gn I(|gn | > t). Since {fn }n≥1 is UI and f is integrable, by Proposition 2.5.7 (iv), {gn }n≥1 is UI. Hence, for any > 0, there exists t > 0 such that sup g¯n,t dµ < for all t ≥ t . (5.15) n≥1

Next note that for any t > 0, since fn → f a.e. (µ), g n,t → 0 a.e. (µ), and |g n,t | ≤ t and (t)dµ = tµ(Ω) < ∞. Hence, by the DCT, lim

n→∞

g n,t dµ = 0

for all t > 0.

(5.16)

2.5 More on convergence

69

By (5.15) and (5.16) with t = t , we get g n,t dµ ≤ 0 ≤ lim sup |fn − f |dµ ≤ sup g¯n,t dµ + lim n→∞

n→∞

n≥1

for all > 0. Hence,

|fn − f |dµ → 0 as n → ∞.

2

The next result concerns connections between the notions of almost everywhere convergence and almost uniform convergence. Theorem 2.5.11: (Egorov’s theorem). Let fn → f a.e. (µ) and µ(Ω) < ∞. Then fn → f nearly uniformly (µ) as in Deﬁnition 2.5.7. Proof: For j, n, r ∈ N, deﬁne the sets Ajr Bnr

= {ω : |fj (ω) − f (ω)| ≥ r−1 }

= Ajr , Cr = Bnr j≥n

D

=

n≥1

Cr .

r≥1

It is easy to verify that D is the set of points where fn does not convergence to f . That is, D = {ω : fn (ω) → f (ω)}. By hypothesis µ(D) = 0. This implies µ(Cr ) = 0 for all r ≥ 1. Since Bnr ↓ Cr as n → ∞ and µ(Ω) < ∞, by m.c.f.a., µ(Cr ) = 0 ⇒ limn→∞ µ(Bnr ) = 0. Sofor all r ≥ 1, > 0, there exists kr ∈N such that µ(Bkr r ) < 2r . Let A = r≥1 Bkr r . Then µ(A) < ε and Ac = r≥1 Bkcr r . Also, for each r ∈ N, Ac ⊂ Bkcr r . So for any n ≥ 1, sup{|fn (ω) − f (ω)| : ω ∈ Ac } ≤ sup{|fn (ω) − f (ω)| : ω ∈ Bkcr r } for all r ∈ N 1 ≤ if n ≥ kr . r That is, sup{|fn (ω) − f (ω)| : ω ∈ Ac } → 0 as n → ∞.

2

Theorem 2.5.12: (Lusin’s theorem). Let F : R → R be a nondecreasing function and let µ = µ∗F be the corresponding Lebesgue-Stieltjes measure on (R, Mµ∗F ). Let f : R → R be Mµ∗F , B(R)-measurable. Let µ {x : |f (x)| = ∞} = 0. Then for every > 0, there exists a continuous function g : R → R such that µ {x : f (x) = g(x)} < . Proof: Fix −∞ < a < b < ∞. Since µ([a, b]) < ∞ and µ {x : |f (x)| = ∞} = 0, for each > 0, there exists K ∈ (0, ∞) such that µ {x : a ≤ x ≤ b, |f (x)| > K} < . 2

70

2. Integration

Let fK (x) = f (x)I[a,b] (x)I{|f |≤K} (x). Then fK : R → R is bounded, Mµ∗F , B(R)-measurable and zero for |x| > K. Consider now the following claim: For every > 0, there exists a continuous function g : [a, b] → R such that µ {x : fK (x) = g(x), a ≤ x ≤ b} < . (5.17) 2 Clearly this implies that µ {x : a ≤ x ≤ b, f (x) = g(x)} < . (5.18) δ , apply Fix δ > 0. Now for each n ∈ Z, take [a, b] = [n, n + 1], = 2|n|+2 . Let g ˜ be a continuous (5.18) and call the resulting continuous function g n function from R → R such that µ {x : n ≤ x ≤ n + 1, g˜(x) = gn (x)} < δ . This can be done by setting g˜(x) = gn (x) for n ≤ x ≤ (n + 1) − δn 2|n|+2 " # δ and linear on (n + 1) − δn , n + 1 for some 0 < δn < 2|n|+3 . Then ∞

µ {x : f (x) = g˜(x)} ≤

n=−∞ ∞

≤

µ {x : n ≤ x ≤ n + 1, f (x) = g˜(x)} µ {x : n ≤ x ≤ n + 1, f (x) = gn (x)}

n=−∞ ∞

+

0, there exists a simple function k s(x) ≡ i=1 ci IAi (x), with Ai ⊂ [a, b], Ai ∈ Mµ∗F , {Ai : 1 ≤ i ≤ k} are disjoint, µ(Ai ) < ∞, and ci ∈ R for i = 1, . . . , k, such that |f (x)−s(x)| < 4 for all a ≤ x ≤ b. By Theorem 1.3.4, for each Ai and η > 0, there exist a ﬁnite number of open disjoint intervals Iij = (aij , bij ), j = 1, . . . ni such that

ni

η . Iij < µ Ai 2k j=1 Now as in Problem 1.32 (g), there exists a continuous function gij such that η −1 µ gij {1} Iij < , j = 1, 2, . . . , ni , i = 1, 2, . . . , k. kni Let gi ≡

ni j=1

gij , 1 ≤ i ≤ k.

2.6 Problems

71

k Then µ(Ai gi−1 {1}) < kη . Let g = i=1 ci gi . Then µ({s = g}) < η. Hence for every > 0, η > 0, there is a continuous function g,η : [a, b] → R such that µ x : a ≤ x ≤ b, |fK (x) − g,η (x)| > < η. Now for each n ≥ 1, let hn (·) ≡ g 21n , 21n (·) Then, µ(An )

n . 2 and hence ∞

µ(An ) < ∞.

n=1

By the MCT, this implies that ∞

[a,b]

∞

n=1 IAn

IAn < ∞ a.e.

dµ < ∞ and hence

µ.

n=1

Thus hn → fK a.e. µ on [a, b]. By Egorov’s theorem for any > 0, there is a set A ∈ B([a, b]) such that µ(Ac ) < /2

and hn → fK

uniformly on A .

By the inner regularity (Corollary 1.3.5) of µ, there is a compact set D ⊂ A such that µ(A \D) < /2. Since hn → fK uniformly on A , fK is continuous on A and hence on D. It can be shown that there exists a continuous function g : [a, b] → R such that g = fK on D (Problem 2.8 (e)). A more general result extending a continuous function deﬁned on a closed set to the whole space is known as Tietze’s extension theorem (see Munkres (1975)). Thus µ({x : a ≤ x ≤ b, fK (x) = g(x)}) < . This completes the proof of (5.17) and hence that of the proposition. 2 Remark 2.5.5: (Littlewood’s principles). As pointed out in Section 1.3, Theorems 1.3.4, 2.5.11, and 2.5.12 constitute J. E. Littlewood’s three principles: every Mµ∗F measurable set is nearly a ﬁnite union of intervals; every a.e. convergent sequence is nearly uniformly convergent; and every Mµ∗F measurable function is nearly continuous.

2.6 Problems 2.1 Let Ωi , i = 1, 2 be two nonempty sets and T : Ω1 → Ω2 be a map. Then for any collection {Aα : α ∈ I} of subsets of Ω2 , show that

= Aα T −1 Aα T −1 α∈I

α∈I

72

2. Integration

and T −1

Aα

=

α∈I

T −1 Aα .

α∈I

c Further, T −1 (A) = T −1 (Ac ) for all A ⊂ Ω2 . (These are known as de Morgan’s laws.) 2.2 Let {Ai }i≥1 be a collection of disjoint sets in a measurable space (Ω, F). of F, B(R)-measurable functions (a) Let {gi }i≥1 be a collection ∞ from Ω to R. Show that i=1 IAi gi converges on R and is F, B(R)-measurable. (b) Let G ≡ σ{Ai : i ≥ 1}. Show that h : Ω → R is G, B(R)measurable iﬀ g(·) is constant on each Ai . 2.3 Let f, g : Ω → R be F, B(R)-measurable. Set h(ω) =

f (ω) I(g(ω) = 0), g(ω)

ω ∈ Ω.

Verify that h is F, B(R)-measurable. ¯ be such that for every r ∈ R, g −1 ((−∞, r]) ∈ F. Show 2.4 Let g : Ω → R ¯ that g is F, B(R)-measurable. 2.5 Prove Proposition 2.1.6. (Hint: Show that σ{fλ : λ ∈ Λ} =

σ{fλ : λ ∈ L}

L∈C

where C is the collection of all countable subsets of Λ.) 2.6 Let Xi , i = 1, 2, 3 be random variables on a probability space (Ω, F, P ). Consider the random equation (in t ∈ R): X1 (ω)t2 + X2 (ω)t + X3 (ω) = 0.

(6.1)

(a) Show that A ≡ {ω ∈ Ω : Equation (6.1) has two distinct real roots} ∈ F. (b) Let T1 (ω) and T2 (ω) denote the two roots of (6.1) on A. Let Ti (w) on A , fi (w) = 0 on Ac i = 1, 2. Show that (f1 , f2 ) is F, B(R2 )-measurable.

2.6 Problems

73

2.7 Let M ≡ Xij , 1 ≤ i, j ≤ k, be a (random) matrix of random variables Xij deﬁned on a probability space (Ω, F, P ). (a) Show that Y1 ≡ det(M ) (the determinant of M ) and Y2 ≡ tr(M )(the trace of M ) are both F, B(R)-measurable. (b) Show also that Y3 ≡ the largest eigenvalue of M M is F, B(R)measurable, where M is the transpose of M .

M M x) .) (Hint: Use the result that Y3 = sup (x (x x) x =0

2.8 Let f : R → R. Let f¯(x) = inf δ>0 sup|y−x|0 inf |y−x| 0, {x : f¯(x) − f (x) < t} ≡

{x : f¯(x) < t + r, f (x) > r}

r∈Q

and hence is open. (c) Show that f is continuous at some x0 in R iﬀ f¯(x0 ) = f (x0 ). (d) Show that the set Cf ≡ {x : f (·) is continuous at x} is a Gδ set, i.e., an intersection of a countable number of open sets, and hence, Cf is a Borel set. (e) Let D be a closed set in R. Let f : D → R be continuous on D. Show that there exists a g : R → R continuous such that g = f on D. (Hint: Note that Dc is open in R and hence it can be expressed as a countable union of disjoint open intervals {Ij = (aj , bj ) : 1 ≤ j ≤ k ≤ ∞}. Note that aj , bj ∈ D for all j except for possibly the j’s for which aj = −∞ or bj = +∞. Let ⎧ f (x) if x ∈ D ⎪ ⎪ ⎪ (x−aj ) ⎪ f (a ) + (f (b ) − f (a )) if x ∈ (aj , bj ), ⎪ j j j (bj −aj ) ⎪ ⎪ ⎪ ⎨ aj , bj ∈ D g(x) ≡ ) if a f (b j j = −∞, ⎪ ⎪ ⎪ x ∈ (aj , bj ) ⎪ ⎪ ⎪ ⎪ f (a ) if b = ∞, j j ⎪ ⎩ x ∈ (aj , bj ). Now verify that g has the required properties.)

74

2. Integration

2.9 Prove Proposition 2.2.1 using Problem 2.1. 2.10 (a) Show that for any x ∈ R and any random variable X with cdf FX (·), P (X < x) = FX (x−) ≡ limy↑x FX (y). (b) Show that Fc (·) in (2.5) is continuous. 2.11 Let F : R → R be nondecreasing. (a) Show that for x ∈ R, F (x−) ≡ limy↑x F (y) and F (x+) = limy↓x F (y) exist and satisfy F (x−) ≤ F (x) ≤ F (x+). (b) Let D ≡ {x : F (x+) − F (x−) > 0}. Show that D is at most countable. (Hint: Show that D=

Dn,r ,

n≥1 r≥1

where Dn,r = {x : |x| ≤ n, F (x+) − F (x−) > 1r } and that each Dn,r is ﬁnite.) 2.12 Suppose that (i) and (iii) of Proposition 2.2.3 hold. Show that for any a1 in R and −∞ < a2 ≤ b2 < ∞, F (a2 , a1 ) ≤ F (b2 , a1 ) and F (a1 , a2 ) ≤ F (a1 , b2 ), (i.e., F is monotone coordinatewise). 2.13 Let F : Rk → R be such that: (a) for x1 = (x11 , x12 , . . . , x1k ) and x2 = (x21 , x22 , . . . , x2k ) with x1i ≤ x2i for i = 1, 2, . . . k, ∆F (x1 , x2 ) ≡ (−1)s(a) F (a) ≥ 0, a∈A

where A ≡ {a = (a1 , a2 , . . . , ak ) : ai ∈ {x1i , x2i }, i = 1, 2, . . . , k} and for a in A, s(a) = |{i : ai = x1i , i = 1, 2, . . . , k}| is the number of indices i for which ai = x1i . (b) For each i = 1, 2, . . . , k,

lim F (xi ) = 0.

xi1 ↓−∞

Let Ck be the semialgebra of sets of the form {A : A = A1 × . . . × Ak , Ai ∈ C for all 1 ≤ i ≤ k} where C is the semialgebra in R deﬁned in (3.7). Set µF (A) ≡ ∆F (x1 , x2 ) if A = (x11 , x21 ] × (x12 , x22 ] × . . . × (x1k , x2k ] is bounded and set µF (A) = limn→∞ µ(A ∩ Jn ), where Jn = (−n, n]k . (i) Show that F is coordinatewise monotone, i.e., if x = (x1 , . . . , xk ), y = (y1 , . . . , yk ) and xi ≤ yi for every i = 1, . . . , k, then F (y) ≥ F (x).

2.6 Problems

75

(ii) Show that C is a semialgebra and µF is a measure on C by using the Heine-Borel theorem as in Problem 1.22 and 1.23. 2.14 Let m(·) denote the Lebesgue measure on (R, B(R)). Let T : R → R be the map T (x) = x2 . Evaluate the induced measure mT −1 (A) of the set A, where (a) A = [0, t], t > 0. (b) A = (−∞, 0). (c) A = {1, 2, 3, . . .}. ∞ (d) A = i=1 (i2 , (i + i12 )2 ). ∞ (e) A = i=1 (i2 , (i + 1i )2 ). 2.15 Consider the probability space (0, 1), B((0, 1)), m , where m(·) is the Lebesgue measure. (a) Let Y1 be the random variable Y1 (x) ≡ sin 2πx for x ∈ (0, 1). Find the cdf of Y1 . (b) Let Y2 be the random variable Y2 (x) ≡ log x for x ∈ (0, 1). Find the cdf of Y2 . (c) Let F : R → R be a cdf . For 0 < x < 1, let F1−1 (x)

=

inf{y : y ∈ R, F (y) ≥ x}

F2−1 (x)

=

sup{y : y ∈ R, F (y) ≤ x}.

Let Zi be the random variable deﬁned by Zi = Fi−1 (x) 0 < x < 1,

i = 1, 2.

(i) Find the cdf of Zi , i = 1, 2. (Hint: Verify using the right continuity of F that for any 0 < x < 1, t ∈ R, F (t) ≥ x ⇔ F1−1 (x) ≤ t.) (ii) Show also that F1−1 (·) is left continuous and F2−1 (·) is right continuous. 2.16 (a) Let (Ω, F1 , µ) be a σ-ﬁnite measure space. Let T : Ω → R be F, B(R)-measurable. Show, by an example, that the induced measure µT −1 need not be σ-ﬁnite. (b) Let (Ωi , Fi ) be measurable spaces for i = 1, 2 and let T : Ω1 → Ω2 be F1 , F2 -measurable. Show that any measure µ on (Ω1 , F1 ) is σ-ﬁnite if µT −1 is σ-ﬁnite on (Ω2 , F2 ).

76

2. Integration

2.17 Let (Ω, F, µ) be a measure space and let f : Ω → [0, ∞] be such that it admits two representations k

f=

ci IAi and f =

i=1

dj IBj ,

j=1

where ci , dj ∈ [0, ∞],and Ai and Bj ∈ F for all i, j. Show that k

ci µ(Ai ) =

i=1

dj µ(Bj ).

j=1

(Hint: Express Ai and Bj as ﬁnite unions of a common collection of disjoint sets in F.) 2.18 (a) Prove Proposition 2.3.1. (b) In the proof of Proposition 2.3.2, verify that Dn ↑ D. (c) Verify Remark 2.3.1. (Hint: Let An A

{ω : fn−1 (ω) ≥ fn (ω), gn+1 (ω) ≥ gn (ω)} ω : lim gn (ω) = g(ω), = An

=

n→∞

n≥1

lim fn (ω) = f (ω)

n→∞

and f˜n = fn IA , g˜n = gn IA . Verify that µ(Ac ) = 0 and apply gn }n≥1 .) Proposition 2.3.2 to {f˜n }n≥1 and {˜ 2.19 (a) Verify that fn (·) deﬁned in (3.7) satisﬁes fn (ω) ↑ f (ω) for all ω in Ω. (b) Verify that the sequence {hn }n≥1 of Remark 2.3.3 satisﬁes limn→∞ hn = f a.e. (µ), and limn→∞ hn dµ = M . 2.20 Apply Corollary 2.3.5 to show that for any collection {aij : i, j ∈ N} of nonnegative numbers, ⎛ ⎞ ∞ ∞ ∞ ∞ ⎝ aij ⎠ = aij . i=1

j=1

j=1

i=1

2.21 Let g : R → R. (a) Recall that limt→∞ g(t) = L for some L in R if for every > 0, there exists a T < ∞ such that t ≥ T ⇒ |g(t) − L| < . Show that limt→∞ g(t) = L for some L in R iﬀ limn→∞ g(tn ) = L for every sequence {tn }n≥1 with limn→∞ tn = ∞.

2.6 Problems

77

(b) Formulate and prove a similar result when limt→a g(t) = L for some a, L ∈ R. 2.22 Let {ft : t ∈ R} ⊂ L1 (Ω, F, µ). (a) (The continuous version of the MCT). Suppose that ft ↑ f as t ↑ ∞ a.e. (µ) and for each t, ft ≥ 0 a.e. (µ). Show that ft dµ ↑ f dµ. (b) (The continuous version of the DCT). Suppose there exists a nonnegative g ∈ L1 (Ω, F, µ) such that for each t, |ft | ≤ g a.e. 1 (µ). Then f ∈ L (Ω, F, µ) and (µ) and as t → ∞, ft → f a.e. |ft − f |dµ → 0 and hence, ft dµ → f dµ, as t → ∞. 2.23 Let f : [a, b] → R be bounded where −∞ < a < b < ∞. Let {Pn }n≥1 be a sequence of partitions such that ∆(Pn ) → 0. Show that as n → ∞, U (Pn , f ) → where

f

and L(Pn , f ) →

f,

f and f are as deﬁned in (4.3) and (4.4), respectively.

(Hint: Given > 0, ﬁx a partition P = {x0 = a < x1 < . . . < xk = b} such that f < U (P, f ) < f + . Let δ = min0≤i≤k−1 (xi+1 − xi ). Choose n large such that the diameter ∆(Pn ) < δ. Verify that U (Pn , f ) < U (P, f ) + kB∆(Pn ) where B = sup{|f (x)| : a ≤ x ≤ b} and conclude that limn U (Pn , f ) ≤ f + .) 2.24 Establish (4.6) and (4.7). (Hint: Show that for every x and any > 0, φn (x) ≤ φ(x) + for all n large and that for x ∈ n≥1 Pn , φn (x) ≥ φ(x) for all n.) 2.25 If f (x) = IQ1 (x) where Q1 = Q∩[0, 1], Q being the set of all rationals, then show that for any partition P , U (P, f ) = 1 and L(P, f ) = 0. 2.26 Establish Theorem 2.5.1. (Hint: Verify that D

≡ =

{ω : fj (ω) → f (ω)}

Ajr , r≥1 n≥1 j≥n

where Ajr = {|fj −f | > 1r }. Show that since µ(D) = 0 and µ(Ω) < ∞, µ(Drn ) → 0 as n → ∞ for each r ∈ N, where Drn = j≥n Ajr .)

78

2. Integration

2.27 Let φ : R+ → R+ be nondecreasing and φ(x) ↑ ∞ as x ↑ ∞. x 1 : λ ∈ Λ} be a subset of L (Ω, F, µ). Show that if Also, let {f λ supλ∈Λ φ(|fλ |)dµ < ∞, then {fλ : λ ∈ Λ} is UI. 2.28 Let µ be the Lebesgue measure on ([−1, 1], B([−1, 1])). For n ≥ 1, deﬁne fn (x) = nI(0,n−1 ) (x) − nI(−n−1 ,0) (x) and f (x) ≡ 0 for x ∈ [−1, 1]. Show that fn → f a.e. (µ) and fn dµ → f dµ but {fn }n≥1 is not UI. 2.29 Let {fn : n ≥ 1} ∪ {f } ⊂ L1 (Ω, F, µ). (a) Show that |fn − f |(dµ) → 0 iﬀ fn −→m f and |fn |dµ → |f |dµ. (b) Show further that if µ(Ω) < ∞ then the above two are equivalent to fn −→m f and {fn } UI. 2.30 For n ≥ 1, let fn (x) = n−1/2 I(0,n) (x), x ∈ R, and let f (x) = 0, x ∈ R. Let m denote the Lebesgue measure Show that fn → f on (R, B(R)). a.e. (m) and {fn }n≥1 is UI, but fn dm → f dm. 2.31 (Computing integrals w.r.t. the Lebesgue measure). Let f ∈ L1 (R, Mm , m) where (R, Mm , m) is the real line with Lebesgue σalgebra, andLebesgue measure, i.e., m = µ∗F where F (x) ≡ x. The + − + deﬁnition − of f dm as f dm− f dm involves computing f dm and f dm which in turn is given in terms of approximating by integrals of simple nonnegative functions. This is not a very practical procedure. For f that happens to be continuous a.e. and bounded on ﬁnite intervals, one can compute the Riemann integral of f over ﬁnite intervals and pass to the limit. Justify the following steps: (a) Let f be continuous a.e. and bounded on ﬁnite intervals and f ∈ L1 (R, Mm , m). Show that for −∞ < a < b < ∞, f ∈ L1 ([a, b], Mm , m) and ) f dm = f (x)dx, [a,b]

[a,b]

where the right side denotes the Riemann integral of f on [a, b]. (b) If, in addition, f ∈ L1 (R, Mm , m), then f dm = a→−∞ lim f dm. R

b→+∞

[a,b]

(c) If f is continuous a.e. and ∈ L1 (R, Mm , m), then f dm = lim φc (f )dm R

a→−∞ b→+∞ c→∞

[a,b]

where φc (f ) = f (x)I(|f (x)| ≤ c)+cI(f (x) > c)−cI(f (x) < −c).

2.6 Problems

(d) Apply the above procedure to compute (i) f (x) =

1 1+x2 , −x2 /2

(ii) f (x) = e

R

f dm for

,

+x −x2 /2

(iii) f (x) = e

79

e

.

2.32 (a) Let a ∈ R. Show that if a1 , a2 are nonnegative such that a = a1 − a2 then a1 ≥ a+ , a2 ≥ a− and a1 − a+ = a2 − a− . (b) Let f = f1 − f2 where f1 , f2 are nonnegative and are in 1 1 (Ω, F, µ). Show that f ∈ L (Ω, F, µ) and f dµ = f1 dµ − L f2 dµ. 2.33 Let (Ω, F, µ) be a measure space. Let f : Ω × (a, b) → R be such that for each a < t < b, f (·, t) ∈ L1 (Ω, F, µ). (a) Suppose for each a < t < b, (i) lim f (ω, t + h) = f (ω, t) a.e. (µ). h→0

(ii) sup |f (ω, t + h)| ≤ g1 (ω, t), where g1 (·, t) ∈ L1 (Ω, F, µ). |h|≤1

Show that φ(t) ≡

Ω

f (ω, t)dµ is continuous on (a, b).

(b) Suppose for each a < t < b. f (ω, t + h) − f (ω, t) = g2 (ω, t) exists a.e. (µ), h * f (ω, t + h) − f (ω, t) * * * (ii) sup * * ≤ G(ω, t) a.e. (µ), h 0≤|h|≤1 (i) lim

h→0

(iii) G(ω, t) ∈ L1 (Ω, F, µ).

Show that φ(t) ≡

f (ω, t)dµ Ω

is diﬀerentiable on (a, b). (Hint: Use the continuous version of DCT (cf. Problem 2.22).) 2.34 Let A ≡ (aij ) be an inﬁnite matrix of real numbers. Suppose that for each j, limi→∞ aij = aj exists in R and supi |aij | ≤ bj , where ∞ j=1 bj < ∞. (a) Show by an application of the DCT that lim

i→∞

∞

|aij − aj | = 0.

j=1

(b) Show the same directly, i.e., without using the DCT.

80

2. Integration

2.35 Using the above problem or otherwise, show that for any sequence {xn }n≥1 with limn→∞ xn = x in R, ∞ xn n xj 1+ ≡ exp(x). = n→∞ n j! j=0

lim

2.36 (a) Let {fn }n≥1 ⊂ L1 (Ω, F, µ) such that fn → 0 in L1 (µ). Show that {fn }n≥1 is UI. (b) Let {fn }n≥1 ⊂ Lp (Ω, F, µ), 0 < p < ∞, such that µ(Ω) < ∞, {|fn |p }n≥1 is UI and fn −→m f . Show that f ∈ Lp (µ) and fn → f in Lp (µ). 2.37 (a) Show that for a sequence of real numbers {xn }n≥1 , limn→∞ xn = ¯ holds iﬀ every subsequence {xn }j≥1 of {xn }n≥1 has a x ∈ R j further subsequence {xnjk }k≥1 such that limk→∞ xnjk = x. (b) Use (a) and Theorem 2.5.2 to show that the extended DCT (Theorem 2.3.11) is valid if the a.e. convergence of {fn }n≥1 and {gn }n≥1 is replaced by convergence in measure. 2.38 Let (R, Mµ∗F , µ∗F ) be a Lebesgue-Stieltjes measure space generated ¯ be Mµ∗ -measurable by a F : R → R nondecreasing. Let f : R → R F ∗ such that |f | < ∞ a.e. (µF ). Show that for every k ∈ N and for every , η ∈ (0, ∞), there exists a continuous function g : R → R such that g(x) = 0 for |x| > k and µ∗F ({x : |x| ≤ k, |f (x) − g(x)| > η} < . (Hint: Complete the following: Step I: For all > 0, there exists Mk, ∈ (0, ∞) such that µ∗F ({x : |x| ≤ k, |f (x)| > Mk, }) < . Step II: For η > 0, there exists a simple function s(·) such that µ∗F ({x : |x| ≤ k, |f (x)| ≤ Mk, , |f (x) − s(x)| > η}) = 0. Step III: For δ > 0, there exists a continuous function g(·) such that g ≡ 0 for |x| > k and µ∗F ({x : |x| ≤ k, s(x) = g(x)}) < δ.) 1 2.39 Recall from Corollary 2.3.6 that for g ∈ L (Ω, F, µ) and nonnegative function h, νg (A) ≡ A gdµ is a measure. Next for any F-measurable show that h ∈ L1 (νg ) iﬀ h · g ∈ L1 (µ) and hdνg = hgdµ.

(Hint: Verify ﬁrst for h simple and nonnegative, next for h nonnegative, and ﬁnally for any h.) 2.40 Prove the BCT, Corollary 2.3.13, using Egorov’s theorem (Theorem 2.5.11). 2.41 Deduce the DCT from the BCT with the notation as in Corollary 2.3.12. (Hint: Apply the BCT to the measure space (Ω, F, νg ) and functions hn = fgn I(g > 0), h = fg I(g > 0) where νg is as in Problem 2.39.)

2.6 Problems

81

2.42 (Change of variables formula). Let (Ωi , Fi ), i = 1, 2 be two measurable spaces. Let f : Ω1 → Ω2 be F1 , F2 -measurable, h : Ω2 → R be F2 , B(R)-measurable, and µ1 be a measure on (Ω1 , F1 ). Show that g ≡ h ◦ f , i.e., g(ω) ≡ h(f (ω)) for ω ∈ Ω1 is in L1 (µ) iﬀ h(·) ∈ L1 (Ω2 , F2 , µ2 ) where µ2 = µ1 f −1 iﬀ I(·) ∈ L1 R, B(R), µ3 ≡ µ2 h−1 where I(·) is the identity function in R, i.e., I(x) ≡ x for all x ∈ R and also that Ω1

gdµ1 =

Ω2

hdµ2 =

R

xdµ3 .

2

2.43 Let φ(x) ≡ √12π e−x /2 be the standard N (0, 1) pdf on R. Let {(µn , σn )}n≥1 be a sequence in R × R+ µn → µ and . Suppose x−µ 1 n , f (x) = φ σn → σ as n → ∞. Let fn (x) = σ1n φ x−µ σn σ σ and νn (A) = A fn dm, ν(A) = A f dm for any Borel set A in R, where m(·) is the Lebesgue measure on R. Using Scheﬀe’s theorem, verify that, as n → ∞, νn (·) → ν(·) uniformly on B(R) and that for any h : R → R, bounded and Borel measurable, hdνn → hdν. n 2.44 Let fn (x) = cn 1 − nx I[0,n] (x), x ∈ R, n ≥ 1. (a) Find cn such that

fn dm = 1.

(b) Show that limn→∞ fn (x) ≡ f (x) exists for all x in R and that f is a pdf (c) For A ∈ B(R), let νn (A) ≡

fn dm

and ν(A) =

A

f dm. A

Show that νn → ν uniformly on B(R). 2.45 Let (Ω, F, µ) be a measure space and f : Ω → R be F-measurable. Suppose that Ω |f |dµ < ∞ and D is a closed set in R such that for all B ∈ F with µ(B) > 0, 1 µ(B)

f dµ ∈ D. B

Show that f (ω) ∈ D a.e. µ. (Hint: Show that for x ∈ D, there exists r > 0 such that µ{ω : |f (ω) − x| < r} = 0.)

82

2. Integration

2.46 Find a sequence of nonnegative continuous functions {fn }n≥1 on [0,1] such that [0,1] fn dm → 0 but {fn (x)}n≥1 does not converge for any x. (Hint: Let for m ≥ 1, 1 ≤ k ≤ m, gm,k = I" k−1 a reordering of {gn,k : 1 ≤ k ≤ n, n ≥ 1}.)

m

k ,m

# and {fn }n≥1 be

2.47 Let {fn }n≥1 be a sequence of continuous functions from [0,1] to [0,1] such that fn (x) → 0 as n → ∞ for all 0 ≤ x ≤ 1. Show that 1 f (x)dx → 0 as n → ∞ (where the integral is the Riemann in0 n tegral) by two methods: one by using BCT and one without using BCT. Show also that if µ is a ﬁnite measure on ([0, 1], B([0, 1])), then f dµ → 0 as n → ∞. [0,1] n 2.48 (Invariance of Lebesgue measure under translation and reﬂection.) Let m(·) be Lebesgue measure on (R, B(R)). (a) For any E ∈ B(R) and c ∈ R, deﬁne −E ≡ {x : −x ∈ E} and E + c ≡ {y : y = x + c, x ∈ E}. Show that m(−E) = m(E) and m(E + c) = m(E) for all E ∈ B(R) and c ∈ R. (b) For any f ∈ L1 (R, B(R), m) and c ∈ R, let f˜(x) ≡ f (−x) and fc (x) ≡ f (x + c). Show that ˜ f dm = fc dm = f dm.

3 Lp-Spaces

3.1 Inequalities This section contains a number of useful inequalities. Theorem 3.1.1: (Markov’s inequality). Let f be a nonnegative measurable function on a measure space (Ω, F, µ). Then for any 0 < t < ∞, f dµ . (1.1) µ({f ≥ t}) ≤ t Proof: Since f is nonnegative,

f dµ ≥

({f ≥t})

f dµ ≥ tµ({f ≥ t}).

2

Corollary 3.1.2: Let X be a random variable on a probability space (Ω, F, P ). Then, for r > 0, t > 0, P (|X| ≥ t) ≤

E|X|r . tr

Proof: Since {|X| ≥ t} = {|X|r ≥ tr } for all t > 0, r > 0, this follows from (1.1). 2 Corollary 3.1.3: (Chebychev’s inequality). Let X be a random variable with EX 2 < ∞, E(X) = µ and Var(X) = σ 2 . Then for any 0 < k < ∞, P (|X − µ| ≥ kσ) ≤

1 . k2

84

3. Lp -Spaces

Proof: Follows from Corollary 3.1.2 with X replaced by X − µ and with r = 2, t = kσ. 2 Corollary 3.1.4: Let φ : R+ → R+ be nondecreasing. Then for any random variable X and 0 < t < ∞, P (|X| ≥ t) ≤

Eφ(|X|) . φ(t)

Proof: Use (1.1) and the fact that |X| ≥ t ⇒ φ(|X|) ≥ φ(t).

2

Corollary 3.1.5: (Cramer’s inequality). For any random variable X and t > 0, E(eθX ) . P (X ≥ t) ≤ inf θ>0 eθt Proof: For t > 0, θ > 0, P (X ≥ t) = P eθX ≥ eθt ≤

E(eθX ) , eθt

by (1.1). 2

Deﬁnition 3.1.1: A function φ : (a, b) → R is called convex if for all 0 ≤ λ ≤ 1, a < x ≤ y < b, φ(λx + (1 − λ)y) ≤ λφ(x) + (1 − λ)φ(y).

(1.2)

Geometrically, this means that for the graph of y = φ(x) on (a, b), for each ﬁxed t ∈ (0, ∞), the chord over the interval (x, x + t) turns in the counterclockwise direction as x increases. More precisely, the following result holds. Proposition 3.1.6: Let φ : (a, b) → R. Then φ is convex iﬀ for all a < x1 < x2 < x3 < b, φ(x3 ) − φ(x2 ) φ(x2 ) − φ(x1 ) ≤ , x2 − x1 x3 − x2

(1.3)

which is equivalent to φ(x3 ) − φ(x2 ) φ(x3 ) − φ(x1 ) φ(x2 ) − φ(x1 ) ≤ ≤ . x2 − x1 (x3 − x1 ) x3 − x2

(1.4)

Proof: Let φ be convex and a < x1 < x2 < x3 < b. Then one can write 3 −x2 ) x2 = λx1 + (1 − λ)x3 with λ = (x (x3 −x1 ) . So by (1.2), φ(x2 )

= φ(λx1 + (1 − λ)x3 ) ≤ λφ(x1 ) + (1 − λ)φ(x3 ) (x3 − x2 ) (x2 − x1 ) = φ(x1 ) + φ(x3 ) (x3 − x1 ) (x3 − x1 )

3.1 Inequalities

85

which is equivalent to (1.3). Also, since φ(x3 ) − φ(x2 ) φ(x2 ) − φ(x1 ) φ(x3 ) − φ(x1 ) =λ + (1 − λ) , (x3 − x1 ) (x3 − x2 ) (x2 − x1 ) (1.4) follows from (1.3). Conversely, suppose (1.4) holds for all a < x1 < x2 < x3 < b. Then given a < x < y < b and 0 < λ < 1, set x1 = x, x2 = λx + (1 − λ)y, x3 = y and apply (1.4) to verify (1.2). 2 The following properties of a convex function are direct consequences of (1.3). The proof is left as an exercise (Problem 3.1). Proposition 3.1.7: Let φ : (a, b) → R be convex. Then, (i) For each x ∈ (a, b), φ+ (x) ≡ lim y↓x

φ(y) − φ(x) , (y − x)

φ− (x) ≡ lim y↑x

φ(y) − φ(x) (y − x)

exist and are ﬁnite. (ii) Further, φ− (·) ≤ φ+ (·) and both are nondecreasing on (a, b). (iii) φ (·) exists except on the countable set of discontinuity points of φ+ and φ− . (iv) For any a < c < d < b, φ is Lipschitz on [c, d], i.e., there exists a constant K < ∞ such that |φ(x) − φ(y)| ≤ K|x − y|

for all

c ≤ x, y ≤ d.

(v) For any a < c, x < b, φ(x) − φ(c) ≥ φ+ (c)(x − c)

and φ(x) − φ(c) ≥ φ− (c)(x − c). (1.5)

By the mean value theorem, a suﬃcient condition for (1.3) and hence, for the convexity of φ is that φ be diﬀerentiable on (a, b) and φ be nondecreasing. A further suﬃcient condition for this is that φ be twice diﬀerentiable on (a, b) and φ be nonnegative. This is stated as Proposition 3.1.8: Let φ be twice diﬀerentiable on (a, b) and φ be nonnegative on (a, b). Then φ is convex on (a, b). Example 3.1.1: The following functions are convex in the given intervals: (a) φ(x) = |x|p , p ≥ 1, (b) φ(x) = ex ,

(−∞, ∞).

(−∞, ∞).

86

3. Lp -Spaces

(c) φ(x) = − log x,

(0, ∞).

(d) φ(x) = x log x,

(0, ∞).

Remark 3.1.1: By Proposition 3.1.7 (iii), the convexity of φ implies that φ exists except on a countable set. For example, the function φ(x) = |x| is convex on R; it is diﬀerentiable at all x = 0. Similarly, it is easy to construct a piecewise linear convex function φ with a countable number of points where φ is not diﬀerentiable. The following is an important inequality for convex functions. Theorem 3.1.9: (Jensen’s inequality). Let f be a measurable function on a probability space (Ω, F, P ) with P (f ∈ (a, b)) = 1 for some interval (a, b), −∞ ≤ a < b ≤ ∞ and let φ : (a, b) → R be convex. Then φ f dP ≤ φ(f )dP, (1.6) provided

|f |dP < ∞ and

|φ(f )|dP < ∞.

Remark 3.1.2: In terms of random variables, this says that for any random variable X on a probability space (Ω, F, P ) with P (X ∈ (a, b)) = 1 and for any function φ that is convex on (a, b), φ(EX) ≤ Eφ(X),

(1.7)

provided E|X| < ∞ and E|φ(X)| < ∞, where for any Borel measurable function h, Eh(X) ≡ h(X)dP . Proof of Theorem 3.1.9: Let c = f dP . Applying (1.5), one gets

(1.8) Y (ω) ≡ φ(f (ω)) − φ(c) − φ+ (c)(f (ω) − c) ≥ 0 a.e. (P ), which, when integrated, yields Y (ω)P (dω) ≥ 0. Since (f (ω) − c) P (dω) = 0, (1.6) follows. 2 Remark 3.1.3: Suppose that equality holds in (1.6). Then, it follows that Y (ω)P (dω) = 0. By (1.8), this implies φ(f (ω)) − φ(c) = φ+ (c)(f (ω) − c)

a.e. (P ).

Thus, if φ is a strictly convex function (i.e., strict inequality holds in (1.2) for all x, y ∈ (a, b) and 0 < λ < 1), then equality holds in (1.6) iﬀ f (ω) = c a.e. (P ). The following are easy consequences of Jensen’s inequality (Problem 3.3). Proposition 3.1.10: Let k ≥ 1 be an integer.

3.1 Inequalities

87

(i) Let a1 , a2 , , . . . , ak be real and p1 , p2 , , . . . , pk be positive numbers such k that i=1 pi = 1. Then k

pi exp(ai ) ≥ exp

k

i=1

pi ai .

(1.9)

i=1

(ii) Let b1 , b2 , . . . , bk be nonnegative numbers and p1 , p2 , . . . , pk be as in (i). Then k k + pi bi ≥ bpi i , (1.10) i=1

and in particular,

i=1

k

+ k k 1 bi ≥ bi , k i=1 i=1 1

(1.11)

i.e., the arithmetic mean of b1 , . . . , bk is greater than or equal to the geometric mean of the bi ’s. Further, equality holds in (1.10) iﬀ b1 = b2 = . . . = bk . (iii) For any a, b real and 1 ≤ p < ∞, |a + b|p ≤ 2p−1 (|a|p + |b|p ).

(1.12)

Inequality (1.10) is useful in establishing the following: Theorem 3.1.11: (H¨ older’s inequality). Let (Ω, F, µ) be a measure space. p . Let 1 < p < ∞, f ∈ Lp (Ω, F, µ) and g ∈ Lq (Ω, F, µ) where q = (p−1) Then

p1 q1 p q |f g|dµ ≤ |f | dµ (1.13) |g| dµ , i.e.,

f g1

≤

f p gq .

If f g1 = 0, then equality holds in (1.13) iﬀ |f |p = c|g|q a.e. (µ) for some constant c ∈ (0, ∞). Proof: W.l.o.g. assume that |f |p dµ > 0 and |g|q dµ > 0. Fixω ∈ Ω. Let p1 = p1 , p2 = 1q , b1 = c1 |f (ω)|p , and b2 = c2 |g(ω)|q , where c1 = ( |f |p dµ)−1 and c2 = ( |g|q dµ)−1 . Then applying (1.10) with k = 2 yields 1 1 c2 c1 |f (ω)|p + |g(ω)|q ≥ c1p c2q |f (ω)g(ω)|. p q

Integrating w.r.t. µ yields 1

1

1 ≥ c1p c2q

|f (ω)g(ω)|dµ(ω)

(1.14)

3. Lp -Spaces

88

which is equivalent to (1.13). Next, equality in (1.13) implies equality in (1.14) a.e. (µ). Since 1 < p < ∞, by the last part of Proposition 3.1.10 (ii), this implies that b1 = b2 a.e. q 2 (µ), i.e., |f (ω)|p = c2 c−1 1 |g(ω)| a.e. (µ). Remark 3.1.4: (H¨ older’s inequality for p = 1, q = ∞). Let f ∈ L1 (Ω, F, µ) and g ∈ L∞ (Ω, F, µ). Then |f g| ≤ |f | g∞ a.e. (µ) and hence, f g1 ≡ |f g|dµ ≤ f 1 g∞ . If equality holds in the above inequality, then |f |(g∞ − |g|) = 0 a.e. (µ) and hence, |g| = g∞ on the set {|f | > 0} a.e. (µ). The next two corollaries follow directly from Theorem 3.1.11. The proof is left as an exercise (Problem 3.9). Corollary 3.1.12: (Cauchy-Schwarz inequality). Let f , g ∈ L2 (Ω, F, µ). Then

12 12 2 2 |f g|dµ ≤ |f | dµ |g| dµ , (1.15) i.e.,

f g1

≤

f 2 g2 .

Corollary 3.1.13: Let k ∈ N. Let a1 , a2 , . . . , ak , b1 , b2 , . . . , bk be real numbers and c1 , c2 , . . . , ck be positive real numbers. (i) Then, for any 1 < p < ∞, k p1 k q1 k |ai bi |ci ≤ |ai |p ci |bi |q ci , i=1

where q =

i=1

(1.16)

i=1

p−1 p .

(ii) k

|ai bi |ci ≤

i=1

k

2

|ai | ci

12 k

i=1

12 2

|bi | ci

.

(1.17)

i=1

Next, as an application of H¨ older’s inequality, one gets Theorem 3.1.14: (Minkowski’s inequality). Let 1 < p < ∞ and f, g ∈ Lp (Ω, F, µ). Then

p1

p1 p1 p p p |f + g| dµ ≤ |f | dµ + |g| dµ , i.e.,

f + gp

≤

f p + gq .

(1.18)

3.2 Lp -Spaces

89

Proof: Let h1 = |f + g|, h2 = |f + g|p−1 . Then by (1.12), |f + g|p ≤ 2p−1 (|f |p + |g|p ), implying that h1 ∈ Lp (Ω, F, µ) and h2 ∈ Lq (Ω, F, µ), where q = older’s inequality, Since |f + g|p = h1 h2 ≤ |f |h2 + |g|h2 , by H¨

|f +g|p dµ ≤ But

hq2 =

|f |p dµ

p1

hq2

q1

+

|g|p dµ

p1

hq2

p (p−1) .

q1 . (1.19)

|f + g|p and so (1.19) yields (1.18).

2

Remark 3.1.5: Inequality (1.18) holds for both p = 1 and p = ∞.

3.2 Lp -Spaces 3.2.1

Basic properties

Let (Ω, F, µ) be a measure space. Recall the deﬁnition of Lp (Ω, F, µ), 0 < p ≤ ∞, from Section 2.5 as the set of all measurable functions f on (Ω, F, µ) such that f p < ∞, where for 0 < p < ∞, min{ p1 ,1} |f |p dµ f p = and for p = ∞, f ∞ ≡ sup{k : µ({|f | > k}) > 0} (called the essential supremum of f ). In this section and elsewhere, Lp (µ) denotes Lp (Ω, F, µ). The following proposition shows that Lp (µ) is a vector space over R. Proposition 3.2.1: For 0 < p ≤ ∞, f, g ∈ Lp (µ), a, b ∈ R

⇒

af + bg ∈ Lp (µ).

(2.1)

Proof: Case 1: 0 < p ≤ 1. For any two positive numbers x, y,

p p y y x x + = 1. + ≥ x+y x+y x+y x+y Hence, for all x, y ∈ (0, ∞) (x + y)p ≤ xp + y p .

(2.2)

90

3. Lp -Spaces

It is easy to check that (2.2) continues to hold if x, y ∈ [0, ∞). This yields |af + bg|p ≤ |a|p |f |p + |b|p |g|p , which, in turn, yields (2.1) by integration. Case 2: 1 < p < ∞. By (1.12), |af + bg|p ≤ 2p−1 (|af |p + |bg|p ). Integrating both sides of the above inequality yields (2.1). Case 3: p = ∞. By deﬁnition, there exist constants K1 < ∞ and K2 < ∞ such that µ({|f | > K1 }) = 0 = µ({|g| > K2 }). This implies that µ({|af + bg| > K}) = 0 for any K > |a|K1 + |b|K2 . Hence, af + bg ∈ L∞ (µ) and af + bg∞ ≤ |a| f ∞ + |b| g∞ . 2 Recall that a set S with a function d : S × S → [0, ∞] is called a metric space if for all x, y, z ∈ S, (i) d(x, y) = d(y, x)

(symmetry)

(ii) d(x, y) ≤ d(x, z) + d(y, z)

(triangle inequality)

(iii) d(x, y) = 0 iﬀ x = y and the function d(·, ·) is called a metric on S. Some examples are (a) S = Rk with d(x, y) =

k

2 i=1 |xi − yi |

12

;

(b) S = C[0, 1], the space of all continuous functions on [0, 1] with d(f, g) = sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}; (c) S = a nonempty set, and d(x, y) = 1 if x = y and 0 if x = y. (This d(·, ·) is called the discrete metric on S.) The Lp -norm · p can be used to introduce a distance notion in Lp (µ) for 0 < p ≤ ∞. Deﬁnition 3.2.1: For f , g ∈ Lp (µ), 0 < p ≤ ∞, let dp (f, g) ≡ f − gp . Note that, for any f , g, h ∈ Lp (µ) and 1 ≤ p ≤ ∞, (i) dp (f, g) = dp (g, f ) ≥ 0

(nonnegativity and symmetry), and

(ii) dp (f, h) ≤ dp (f, g) + dp (g, h)

(triangle inequality),

(2.3)

3.2 Lp -Spaces

91

which follows by Minkowski’s inequality (Theorem 3.1.14) for 1 ≤ p < ∞, and by Proposition 3.2.1 for p = ∞. However, dp (f, g) = 0 implies only that f = g a.e. (µ). Thus, dp (·, ·) of (2.3) satisﬁes conditions (i) and (ii) of being a metric and it satisﬁes condition (iii) as well, provided any two functions f and g that agree a.e. (µ) are regarded as the same element of Lp (µ). This leads to the following: Deﬁnition 3.2.2: For f , g in Lp (µ), f is called equivalent to g and is written as f ∼ g, if f = g a.e. (µ). It is easy to verify that the relation ∼ of Deﬁnition 3.2.2 is an equivalence relation, i.e., it satisﬁes (i) f ∼ f for all f in Lp (µ) (ii) f ∼ g ⇒ g ∼ f

(reﬂexive)

(symmetry)

(iii) f ∼ g, g ∼ h ⇒ f ∼ h

(transitive).

This equivalence relation ∼ divides Lp (µ) into disjoint equivalence classes such that in each class all elements are equivalent. The notion of distance between these classes may be deﬁned as follows: dp ([f ], [g]) ≡ dp (f, g) where [f ] and [g] denote, respectively, the equivalence classes of functions containing f and g. It can be veriﬁed that this is a metric on the set of equivalence classes. In what follows, the equivalence class [f ] is identiﬁed with the element f . With this identiﬁcation, (Lp (µ), dp (·, ·)) becomes a metric space for 1 ≤ p ≤ ∞. Remark 3.2.1: For 0 < p < 1, if one deﬁnes dp (f, g) ≡ |f − g|p dµ,

(2.4)

then (Lp (µ), dp ) becomes a metric space (with the same identiﬁcation as above of functions with their equivalence classes). The triangle inequality follows from (2.2). Recall that a metric space (S, d) is called complete if every Cauchy sequence in (S, d) converges to an element in S, i.e., if {xn }n≥1 is a sequence in S such that for every > 0, there exists a N such that n, m ≥ N ⇒ d(xn , xm ) ≤ , then there exists an element x in (S, d) such that limn→∞ d(xn , x) = 0. The next step is to establish the completeness of Lp (µ). Theorem 3.2.2: For 0 < p ≤ ∞, (Lp (µ), dp (·, ·)) is complete, where dp is as in (2.3).

92

3. Lp -Spaces

Proof: Let {fn }n≥1 be a Cauchy sequence in Lp (µ) for 0 < p < ∞. The main steps in the proof are as follows: (I) there exists a subsequence {nk }k≥1 such that {fnk }k≥1 converges a.e. (µ) to a limit function f ; (II) lim dp (fnk , f ) = 0; k→∞

(III) lim dp (fn , f ) = 0. n→∞

Step (I): Let {k }k≥1 and {δk }k≥1 be sequences of positive numbers decreasing to zero. Since {fn }n≥1 is Cauchy, for each k ≥ 1, there exists an integer nk such that (2.5) |fn − fm |p dµ ≤ k for all n, m ≥ nk . W.l.o.g., let nk+1 > nk for each k ≥ 1. Then, by Markov’s inequality (Theorem 3.1.1), −p µ({|fnk+1 − fnk | ≥ δk }) ≤ δk |fnk+1 − fnk |p dµ ≤ δk−p k . (2.6) Let A k = {|fnk+1 − fnk | ≥ δk }, k = 1, 2, . . . and A = lim supk→∞ Ak ≡ ∞ k≥j Ak . If {k }k≥1 and {δk }k≥1 satisfy j=1 ∞

δk−p k < ∞,

(2.7)

k=1

∞ then by (2.6), k=1 µ(Ak ) < ∞ and hence, as in the proof of Theorem 2.5.2, µ(A) = 0. that for ω in Ac , |fnk+1 (ω) − fnk (ω)| < δk for all k large. Thus, if Note ∞ c k=1 δk < ∞, then for ω in A , {fnk (ω)}k≥1 is a Cauchy sequence in R and hence, it converges to some f (ω) in R. Setting f (ω) = 0 for ω ∈ A, one gets lim fnk = f a.e. (µ). k→∞

∞ A choice of {k }k≥1 and {δk }k≥1 such that k=1 δk < ∞ and (2.7) holds is given by k = 2−(p+1)k and δk = 2−k . This completes Step (I). Step (II): By Fatou’s lemma, part (I), and (2.5), for any k ≥ 1 ﬁxed, k ≥ lim inf |fnk − fnk+j |p dµ ≥ |fnk − f |p dµ . j→∞

Since fnk ∈ Lp (µ), this shows that f ∈ Lp (µ). Now, on letting k → ∞, (II) follows.

3.2 Lp -Spaces

93

Step (III): By triangle inequality, for any k ≥ 1 ﬁxed, dp (fn , f ) ≤ dp (fn , fnk ) + dp (fnk , f ). By (2.5) and (II), for n ≥ nk , the right side above is ≤ 2˜ k , where ˜k = k 1/p if 0 < p < 1 and ˜k = k if 1 ≤ p < ∞. Now letting k → ∞, (III) follows. The proof of Theorem 3.2.2 is complete for 0 < p < ∞. The case p = ∞ is left as an exercise (Problem 3.14). 2

3.2.2

Dual spaces

Let 1 ≤ p < ∞. Let g ∈ Lq (µ), where q = if p = 1. Let Tg (f ) = By H¨olders inequality, Tg is linear, i.e.,

f gdµ,

p (p−1)

if 1 < p < ∞ and q = ∞

f ∈ Lp (µ).

(2.8)

|f g|dµ < ∞ and so Tg (·) is well deﬁned. Clearly

Tg (α1 f1 + α2 f2 ) = α1 Tg (f1 ) + α2 Tg (f2 )

(2.9)

for all α1 , α2 ∈ R and f1 , f2 ∈ Lp (µ). Deﬁnition 3.2.3: (a) A function T : Lp (µ) → R that satisﬁes (2.9) is called a linear functional. (b) A linear functional T on Lp (µ) is called bounded if there is a constant c ∈ (0, ∞) such that |T (f )| ≤ cf p

for all f ∈ Lp (µ).

(c) The norm of a bounded linear functional T on Lp (µ) is deﬁned as T = sup |T f | : f ∈ Lp (µ), f p = 1 . By H¨older’s inequality (cf. Theorem 3.1.11 and Remark 3.1.4), |Tg (f )| ≤ gq f p

for all f ∈ Lp (µ),

and hence, Tg is a bounded linear functional on Lp (µ). This implies that if dp (fn , f ) → 0, then |Tg (fn ) − Tg (f )| ≤ gq dp (fn , f ) → 0, i.e., Tg is continuous on the metric space (Lp (µ), dp ).

3. Lp -Spaces

94

Deﬁnition 3.2.4: The set of all continuous linear functionals on Lp (µ) is called the dual space of Lp (µ) and is denoted by (Lp (µ))∗ . In the next section, it will be shown that continuity of a linear functional on Lp (µ) implies boundedness. A natural question is whether every continuous linear functional T on Lp (µ) coincides with Tg for some g in Lq (µ). The answer is “yes” for 1 ≤ p < ∞, as shown by the following result. Theorem 3.2.3: (Riesz representation theorem). Let 1 ≤ p < ∞. Let T : Lp (µ) → R be linear and continuous. Then, there exists a g in Lq (µ) such that T = Tg , i.e., T (f ) = Tg (f ) ≡ f gdµ for all f ∈ Lp (µ), (2.10) where q =

p p−1

for 1 < p < ∞ and q = ∞ if p = 1.

Remark 3.2.2: Such a representation is not valid for p = ∞. That is, there exists continuous linear functionals T on L∞ (µ) for which there is no g ∈ L1 (µ) such that T (f ) = f gdµ for all f ∈ L∞ (µ), provided µ is not concentrated on a ﬁnite set {ω1 , ω2 , . . . , ωk } ⊂ Ω. For a proof of Theorem 3.2.3 and the above remark, see Royden (1988) or Rudin (1987). Next consider the mapping from (Lp (µ))∗ and Lq (µ) deﬁned by φ(Tg ) = g, where Tg is as deﬁned in (2.10). Then, φ is linear, i.e., φ(α1 T1 + α2 T2 ) = α1 φ(T1 ) + α2 φ(T2 ) for all α1 , α2 ∈ R and T1 , T2 ∈ (Lp (µ))∗ . Further, φ(T )q = T

for all T ∈ (Lp (µ))∗ .

Thus, φ preserves the vector space structure of (Lp (µ))∗ and the norm. For this reason, it is called an isometry between (Lp (µ))∗ and Lq (µ).

3.3 Banach and Hilbert spaces 3.3.1

Banach spaces

If (Ω, F, µ) is a measure space it was seen in the previous section that the space Lp (Ω, F, µ) of equivalence classes of functions f with Ω |f |p dµ < ∞ is a vector space over R for all 1 ≤ p < ∞ and for p ≥ 1, · p ≡ ( |f |p dµ)1/p satisﬁes

3.3 Banach and Hilbert spaces

95

(i) f + g ≤ f + g, (ii) αf = |a| f for every α ∈ R, (iii) f = 0 iﬀ f = 0 a.e. (µ). The Euclidean spaces Rk for any kk ∈ N is also a vector space. Note that for p ≥ 1, setting xp ≡ ( i=1 |xi |p )1/p if x = (x1 , x2 , . . . , xk ), (Rk , xp ) may be identiﬁed with a special case of Lp (Ω, F, µ), where Ω ≡ {1, 2, . . . , k}, F = P(Ω) and µ is the counting measure. Generalizing the above examples leads to the notion of a normed vector space (also called normed linear space). Recall that a vector space V over R is a nonempty set with a binary operation +, a function from V × V to V (called addition), and scalar by the real numbers, i.e., a multiplication function from R × V → V , (α, v) → αv satisfying (i) v1 , v2 ∈ V ⇒ v1 + v2 = v2 + v1 ∈ V . (ii) v1 , v2 , v3 ∈ V ⇒ (v1 + v2 ) + v3 = v1 + (v2 + v3 ). (iii) There exists an element θ, called the zero vector, in V such that v + θ = v for all v in V . (iv) α ∈ R, v ∈ V ⇒ αv ∈ V . (v) α ∈ R, v1 , v2 ∈ V ⇒ α(v1 + v2 ) = αv1 + αv2 . (vi) α1 , α2 ∈ R, v ∈ V ⇒ (α1 +α2 )v = α1 v +α2 v and α1 (α2 v) = (α1 α2 )v. (vii) v ∈ V ⇒ 0v = θ and 1v = v. Note that from conditions (vi) and (vii) above, it follows that for any v in V , v + (−1)v = 0 · v = θ. Thus for any v in V , (−1)v is the additive inverse and is denoted by −v. Conditions (i), (ii), and (iii) are called respectively commutativity, associativity, and the existence of an additive identity. Thus V under the operation + is an Abelian (i.e., commutative) group. Deﬁnition 3.3.1: A function f from V to R+ denoted by f (v) ≡ v is called a norm if (a) v1 , v2 ∈ V ⇒ v1 + v2 ≤ v1 + v2 (b) α ∈ R, v ∈ V ⇒ αv = |α| v

(triangle inequality)

(scalar homogeneity)

(c) v = 0 iﬀ v = θ. A vector space V with a norm · deﬁned on it is called a normed vector space or normed linear space and is denoted as (V, · ). Let d(v1 , v2 ) ≡ v1 − v2 , v1 , v2 ∈ V . Then from the deﬁnition of · , it follows that d is a metric on V , i.e., (V, d) is a metric space. Recall that a metric space (S, d)

96

3. Lp -Spaces

is called complete if every Cauchy sequence {xn }n≥1 in S converges to an element x in S. Deﬁnition 3.3.2: A Banach space is a complete normed linear space (V, · ). It was shown by S. Banach of Poland that all Lp (Ω, B, µ) spaces are Banach spaces, provided p ≥ 1 and in particular, all Euclidean spaces are Banach spaces. An example of a diﬀerent kind is the space C[0, 1] of all real valued continuous functions on [0, 1] with the usual operation of pointwise addition and scalar multiplication, i.e., (f + g)(x) = f (x) + g(x) and (αf )(x) = α · f (x) for all α ∈ R, 0 ≤ x ≤ 1, f , g ∈ C[0, 1] where the norm (called the supnorm) is deﬁned by f = sup{|f (x)| = 0 ≤ x ≤ 1}. The veriﬁcation of the fact that C[0, 1] with the supnorm is a Banach space is left as an exercise (Problem 3.22). The space P of all polynomials on [0, 1] is also a normed linear space under the above norm but (P, · ) is not complete (Problem 3.23). However for each n ∈ N, the space Pn of all polynomials on [0, 1] of degree ≤ n is a Banach space under the supnorm (Problem 3.26). Deﬁnition 3.3.3: Let V be a vector space. A subset W ⊂ V is called a subspace of V if v1 , v2 ∈ W, α1 , α2 ∈ R ⇒ α1 v1 + α2 v2 ∈ W . If (V, · ) is a normed vector space and W is a subspace of V , then (W, · ) is also a normed vector space. If W is closed in (V, · ), then W is called a closed subspace of V . Remark 3.3.1: If (V, · ) is a Banach space and W is a closed subspace of V , then (W, · ) is also a Banach space.

3.3.2

Linear transformations

Let (Vi , · i ), i = 1, 2 be two normed linear spaces over R. Deﬁnition 3.3.4: A function T from V1 to V2 is called a linear transformation or linear operator if α1 , α2 ∈ R, x, y ∈ V1 ⇒ T (α1 x + α2 y) = α1 T (x) + α2 T (y). Deﬁnition 3.3.5: A linear operator T from (V1 , · 1 ) to (V2 , · 2 ) is called bounded if T ≡ sup{T x2 : x1 < 1} < ∞, i.e., the image of the unit ball in (V1 , · 1 ) is contained in a ball of ﬁnite radius centered at the zero in V2 . Here is a summary of some important results on this topic. By linearity x 1 ) = x T (x) for any x = 0. It follows that T is bounded iﬀ of T , T ( x there exists k < ∞ such that for any x ∈ V1 , T x2 ≤ kx1 . Clearly, then k can be taken to be T . Also by linearity, if T is bounded, then T x1 −T x2 = T (x1 −x2 ) ≤ T x1 −x2 and so the map T is continuous

3.3 Banach and Hilbert spaces

97

(indeed, uniformly so). It turns out that if a linear operator T is continuous at some x0 in V1 , then T is continuous on all of V1 and is bounded (Problem 3.28 (a)). Now let B(V1 , V2 ) be the space of all bounded linear operators from (V1 , ·1 ) to (V2 , ·2 ). For T1 , T2 ∈ B(V1 , V2 ), α1 , α2 in R, let (α1 T1 +α2 T2 ) be deﬁned by (α1 T1 + α2 T2 )(x) ≡ α1 T1 (x) + α2 T1 (x) for all x in V1 . Then it can be veriﬁed that (α1 T1 + α2 T2 ) also belongs to B(V1 , V2 ) and T ≡ sup{T x2 : x1 ≤ 1}

(3.1)

is a norm on B(V1 , V2 ). Thus (B(V1 , V2 ), · ) is also a normed linear space. If (V2 , · 2 ) is complete, then it can be shown that (B(V1 , V2 ), · ) is also a Banach space (Problem 3.28 (b)). In particular, if (V2 , · 2 ) is the real line, the space (B(V1 , R), · ) is a Banach space.

3.3.3

Dual spaces

Deﬁnition 3.3.6: The space of all bounded linear functions from (V1 , ·) to R (also called bounded linear functionals), denoted by V1∗ , is called the dual space of V1 . Thus, for any normed linear space (V1 , ·1 ) (that need not be complete), the dual space (V1∗ , · ) is always a Banach space, where T ≡ sup{|T x| : x1 < 1} for T ∈ V1∗ . If (V1 , · 1 ) = Lp (Ω, F, µ) for some measure space (Ω, F, µ) and 1 ≤ p < ∞, by the Riesz representation theorem (see Theorem 3.2.3), the dual space may be identiﬁed with Lq (Ω, F, µ) where q is the conjugate of p, i.e., p1 + 1q = 1. However, as pointed out earlier in Section 3.2, this is not true for p = ∞. That is, the dual of L∞ (Ω, F, µ) is not L1 (Ω, F, µ) unless (Ω, F, µ) is a measure space where Ω is a ﬁnite set {w1 , w2 , . . . , wk } and F = P(Ω). An example for the p = ∞ case can be constructed for the space ∞ of all bounded sequences of real numbers (cf. Royden (1988)). The representation of the dual space of the Banach space C[0, 1] with supnorm is in terms of ﬁnite signed measures (cf. Section 4.2). Theorem 3.3.1: (Riesz ). Let T : C[0, 1] → R be linear and bounded. Then there exists two ﬁnite measures µ1 and µ2 on [0, 1] such that for any f ∈ C[0, 1] T (f ) = f dµ1 − f dµ2 .

For a proof see Royden (1988) or Rudin (1987) (see also Problem 3.27).

3. Lp -Spaces

98

3.3.4

Hilbert space

A vector space V over R is called a real innerproduct space if there exists a function f : V × V → R, denoted by f (x, y) ≡ x, y (and called the innerproduct) that satisﬁes (i) x, y = y, x for all x, y ∈ V , (ii) (linearity) α1 x1 + α2 x2 , y = α1 x, y + α2 x2 , y for all α1 , α2 ∈ R, x1 , x2 , y ∈ V , (iii) x, x ≥ 0 for all x ∈ V and x, x = 0 iﬀ x = θ, the zero vector of V . Using the fact that the quadratic function ϕ(t) = x + ty, x + ty = x, x + 2tx, y + t2 y, y is nonnegative for all t ∈ R, one gets the CauchySchwarz inequality , |x, y| ≤ x, xy, y for all x, y ∈ V. , Now setting x = x, x and using the Cauchy-Schwarz inequality, one veriﬁes that x is a norm on V and thus (V, · ) is a normed linear space. Further, the function x, y from V × V to R is continuous (Problem 3.29) under the norm (x1 , x2 ) = x1 + x2 , (x1 , x2 ) ∈ V × V . Deﬁnition 3.3.7: Let (V, ·, ·) be a real innerproduct space. It is called a Hilbert space if (V, · ) is a Banach space, i.e., if it is complete. It was seen in Section 3.2 that for any measure space (Ω, F, µ), the space 2 F, µ) of all equivalence classes of functions f : Ω → R satisfying L (Ω, |f |2 dµ < ∞ is a complete innerproduct space with the innerproduct f, g = f gdµ and hence a Hilbert space. It turns out that every Hilbert space H is an L2 (Ω, F, µ) for some (Ω, F, µ). (The axiom of choice or its equivalent, the Hausdorﬀ’s maximality principle, is required for a proof of this. See Rudin (1987).) This is in contrast to the Banach space case where every Lp (Ω, F, µ) with p ≥ 1 is a Banach space but not conversely, i.e., every Banach space need not be an Lp (Ω, F, µ). Next for each x in a Hilbert space H, let Tx : H → R be deﬁned by Tx (y) = x, y. By the deﬁning properties of x, y and the Cauchy-Schwarz inequality, it is easy to verify that Tx is a bounded linear function on H, i.e., Tx (α1 y1 + α2 y2 ) = α1 Tx (y1 ) + α2 Tx (y2 )

for all α1 , α2 ∈ R, y1 , y2 ∈ H (3.2)

and |Tx (y)| ≤ x y ∗

for all y ∈ H.

(3.3)

Thus Tx ∈ H , the dual space. It is an important result (see Theorem 3.3.3 below) that every T ∈ H ∗ is equal to Tx for some x in H and T = x. Thus H ∗ can be identiﬁed with H.

3.3 Banach and Hilbert spaces

99

Deﬁnition 3.3.8: Let (V, ·, ·) be an inner product space. Two vectors x, y in V are said to be orthogonal and written as x ⊥ y if x, y = 0. A collection B ⊂ V is called orthogonal if x, y ∈ B, x = y ⇒ x, y = 0. The collection B is called orthonormal if it is orthogonal and in addition for all x in B, x = 1. Note that if x ⊥ y, then x − y2 = x − y, x − y = x, x + y, y = x2 + y2 and√so if B is an orthonormal set, then x, y ∈ B ⇒ either x = y or x − y = 2. Thus, if V is separable under the metric d(x, y) = x − y (i.e., there exists a countable set D ⊂ V such that for every x in V and > 0, there exists a d ∈ D such that x − d < ) and 1 if B ⊂ V is an orthonormal system, then the open ball Sb of radius 2√ 2 around each b ∈ B satisﬁes {Sb ∩ D : b ∈ B} are disjoint and nonempty. Thus B is countable. Now let (V, ·, ·) be a separable innerproduct space and B ⊂ V be an orthonormal system. Deﬁnition 3.3.9: The Fourier coeﬃcients of a vector x in V with respect on orthonormal set B is the set {x, b : b ∈ B}. Since V is separable, B is countable. Let nB = {bi : i ∈ N}. For a given x ∈ V , let ci = x, bi , i ≥ 1. Let xn ≡ i=1 ci bi , n ∈ N. The sequence {xn }n≥1 is called the partial sum sequence of the Fourier expansion of the vector x w.r.t. the orthonormal set B. A natural question is: when does {xn }n≥1 converge to x? By the linearity property in the deﬁnition of the innerproduct ·, ·, it follows that 0 ≤ x − xn 2 = x − xn , x − xn = x, x − 2x, xn + xn , xn and x, xn =

n

ci x, bi =

i=1

n

c2i .

i=1

Since {bi }i≥1 are orthonormal, xn 2 = xn , xn =

n

c2i = x, xn .

i=1

Thus, 0 ≤ x − xn 2 = x2 − xn 2 = x2 −

n

c2i ,

i=1

leading to Proposition 3.3.2: (Bessel’s inequality). Let {bi }i≥1 be orthonormal in an innerproduct space (V, ·, ·). Then, for any x in V , ∞ i=1

x, bi 2 ≤ x2 .

(3.4)

100

3. Lp -Spaces

Now let (V, ·, ·) be a Hilbert space. Since for m > n, xn − xm 2 =

m

x, bi 2 ,

i=n+1

it follows from Bessel’s inequality that {xn }n≥1 is a Cauchy sequence, and since V is complete, there is a y in V such that xn → y. This implies (by the continuity of x, y) that x, bi = limn→∞ xn , bi = y, bi ⇒ x−y, bi = 0 for all i ≥ 1. Thus, it follows that x, bi = y, bi for all i ≥ 1. The last relation implies y = x iﬀ the set {bi }i≥1 satisﬁes the property that z, bi = 0

for all i ≥ 1 ⇒ z = 0.

(3.5)

This property is called the completeness of B. Thus B ≡ {bi }i≥1 is a complete orthonormal set for a Hilbert space H ≡ (V, ·, ·), iﬀ for every vector x, ∞ c2i = x2 , (3.6) i=1

where ci = x, bi , i ≥ 1, which in turn holds iﬀ n -x − ci bi - → 0 as n → ∞.

(3.7)

i=1

∞ of real numbers such that i=1 c2i < ∞, Conversely, if {ci }i≥1 is a sequence n then the sequence {xn ≡ i=1 ci bi }n≥1 is Cauchy and hence converges to an x in V . Thus the Hilbert spaceH can be identiﬁed with the space 2 ∞ of all square summable sequences {ci }i≥1 : i=1 c2i < ∞ , in the sense that the map ϕ : x → {ci }i≥1 , where ci = x, bi , i ≥ 1, preserves the algebraic structure as well as the innerproduct, i.e., ϕ is a linear operator from H to 2 and ϕ(x), ϕ(y) = x, y for all x, y ∈ H. Such a ϕ is called an isometric isomorphism between H to 2 . Note also that 2 is simply L2 (Ω, F, µ) where Ω ≡ N, F = P(N), and µ, the counting measure. It can be shown (using the axiom of choice) that every separable Hilbert space does possess a complete orthonormal system, i.e., an orthonormal basis. Next some examples are given. Here, unless otherwise indicated, H denotes the Hilbert space and B denotes an orthonormal basis of H. Example 3.3.1:

∞ (a) H ≡ 2 = {(x1 , x2 , . . .) : xi ∈ R, i=1 x2i < ∞}. B ≡ {ei : i ≥ 1} where ei ≡ (0, 0, . . . , 1, 0, . . .) with 1 in the ith position and 0 elsewhere. 1 m(A), m(·) being (b) H ≡ L2 ([0, 2π], B [0, 2π] , µ) where µ(A) = 2π Lebesgue measure. B ≡ {cos nx : n = 0, 1, 2, . . . , } ∪ {sin nx : n = 1, 2, . . .}. (For a proof, see Chapter 5.)

3.3 Banach and Hilbert spaces

101

2 (c) Let kH ≡ L (R, B(R), µ) where µ is a ﬁnite measure such that |x| dµ < ∞ for all k = 1, 2, . . .. Let B1 ≡ {1, x, x2 , . . .} and B be the orthonormal set generated by applying the Gram-Schmidt procedure to B1 (see Problem 3.31). It can be shown that B is a basis for H (Problem 3.39). When µ is the standard normal distribution, the elements of B are called Hermite polynomials.

For one more example, i.e., Haar functions, see Problem 3.40. Theorem 3.3.3: (Riesz representation). Let H be a separable Hilbert space. Then every bounded linear functional T on H → R can be represented as T ≡ Tx0 for some x0 ∈ V , where Tx0 (y) ≡ y, x0 . Proof: Let B = {bi }i≥1 be an orthonormal basis for H. Let ci ≡ T (bi ), i ≥ 1. Then, for n ≥ 1, n

c2i

=

i=1

n

=

T

n

≤

c2i

≤

c2i

< ∞.

⇒

ci bi

i=1 - n

* * * n 2* * ci ** ⇒* i=1 n

ci T (bi )

i=1

T -

i=1

(by the linearity of T )

1/2

n 2 ci bi - = T ci i=1

T 2

i=1

⇒

∞ i=1

n Thus {xn ≡ i=1 ci bi }n≥1 is Cauchy in H and hence converges to an H. By the continuity of T , for any y, T y = limn→∞ T yn , where x0 in n yn ≡ i=1 y, bi bi , n ≥ 1. But T yn

=

n i=1

=

y, bi ci =

. / n y, bi ci i=1

y, xn , by the linearity of T

Again by continuity of y, x, it follows that T y = y, x0 .

2

A suﬃcient condition for an L2 (Ω, F, µ) to be separable is that there exists an at most countable family A ≡ {Aj }j≥1 of sets in F such that F = σA andµ(Aj ) > 0 for each j. This holds for any σ-ﬁnite measure µ on Rk , B(Rk ) (Problem 3.38).

102

3. Lp -Spaces

Remark 3.3.2: Assuming the axiom of choice, the Riesz representation theorem remains valid for any Hilbert space, separable or not (Problem 3.43).

3.4 Problems 3.1 Prove Proposition 3.1.7. (Hint: Use (1.4) repeatedly.) 3.2 Let (Ω, F, µ) be a measure space with µ(Ω) ≤ 1 and f : Ω → (a, b) ⊂ R be in L1 (Ω, F, µ). Let φ : (a, b) → R be convex. Show that if c ≡ f dµ ∈ (a, b) and φ(f ) ∈ L1 (Ω, F, µ) and cφ+ (c) ≥ 0, then

µ(Ω) φ f dµ ≤ φ(f )dµ. 3.3 Prove Proposition 3.1.10. (Hint: Apply Jensen’s inequality with Ω ≡ {1, 2, . . . , k}, F = P(Ω), P ({i}) = pi , f (i) = ai , i = 1, 2, . . . , k, and φ(x) = ex to get (i). Deduce (ii) from (i) and Remark 3.1.3. For (iii), consider φ(x) = |x|p .) 3.4 Give an example of a convex function φ on (0, 1) with a ﬁnite number of points where it is not diﬀerentiable. Can this be extended to the countable case? Uncountable case? (Hint: Note that φ+ (·) and φ− (·) are both monotone and hence have at most a countable number of discontinuity points.) 3.5 Let φ : (a, b) → R be convex. (a) Using the deﬁnition and induction, show that

n n φ pi xi ≤ pi φ(xi ) i=1

i=1

for any n ≥ 2, x1 , x2 . . . , xn in (a, b) and {p1 , p2 , . . . , pn }, a probability distribution. (b) Use (a) to prove Jensen’s inequality for any bounded φ. 3.6 Show that a function φ : R → R is convex iﬀ

φ f dm ≤ φ(f )dm [0,1]

[0,1]

for every bounded Borel measurable function f : [0, 1] → R, where m(·) is the Lebesgue measure.

3.4 Problems

103

3.7 Let φ be convex on (a, b) and ψ : R → R be convex and nondecreasing. Show that ψ ◦ φ is convex on (a, b). 3.8 Let X be a nonnegative random variable on some probability space. 1 (a) Show that (EX)(E X ) ≥ 1. What does this say about the cor1 relation between X and X ?

(b) Let f, g : R+ → R+ be Borel measurable and such that f (x)g(x) ≥ 1 for all x in R+ . Show that Ef (X)Eg(X) ≥ 1. 3.9 Prove Corollary 3.1.13 using H¨ older’s inequality applied to an appropriate measure space. 3.10 Extend H¨ older’s inequality as follows. Let k1 < pi < ∞, and fi ∈ Lpi (Ω, F, µ), i = 1, 2, . . . , k. Suppose i=1 p1i = 1. Show that k k i=1 fi dµ ≤ i=1 fi pi . (Hint: Use Proposition 3.1.10 (ii).) 3.11 Verify Minkowski’s inequality for p = 1 and p = ∞. 3.12 (a) Find (Ω, F, µ), 0 < p < 1, f , g ∈ Lp (Ω, F, µ) such that 1/p 1/p 1/p |f + g|p dµ > |f |p dµ + |g|p dµ . (b) Prove (1.18) for 0 < p < 1 with f p =

|f |p dµ.

3.13 Let (Ω, F, µ) be a measure space. Let {Ak }k≥1 ⊂ F and ∞ lim Ak = 0, where lim Ak = k=1 µ(Ak ) < ∞. Show that µ k→∞ k→∞ ∞ j≥n Aj = {ω : ω ∈ Aj for inﬁnitely many j ≥ 1}. n=1 3.14 Establish Theorem 3.2.2 for p = ∞. (Hint: For each k ≥ 1, choose nk ↑ such that fnk+1 − fnk ∞ < 2−k . Show that there exists a set A with µ(Ac ) = 0 and for ω in A, |fnk+1 (ω) − fnk (ω)| < 2−k for all k ≥ 1 and now proceed as in the proof for the case 0 < p < ∞.) 3.15 Let f , g ∈ Lp (Ω, F, µ), 0 < p < 1. Show that d(f, g) = |f − g|p dµ is a metric and (Lp (Ω, F, µ), d) is complete. 3.16 Let (Ω, F, µ) be a measure space and f : Ω → R be F-measurable. Let Af = {p : 0 < p < ∞, |f |p dµ < ∞}. (a) Show that p1 , p2 ∈ Af , p1 < p2 implies [p1 , p2 ] ⊂ Af . (Hint: Use |f |≥1 |f |p dµ ≤ |f |≥1 |f |p2 dµ and |f |≤1 |f |p dµ ≤ |f |p1 dµ for any p1 < p < p2 .) |f |≤1

104

3. Lp -Spaces

(b) Let ψ(p) = log |f |p dµ for p ∈ Af . By (a), it is known that Af is connected, i.e., it is an interval. Show that ψ is convex in the open interior of Af . (Hint: Use H¨older’s inequality.) (c) Give examples to show that Af could be a closed interval, an open interval, and semi-open intervals. (d) If 0 < p1 < p < p2 , show that f p ≤ max{f p1 , f p2 } (Hint: Use (b).) (e) Show that if |f |r dµ < ∞ for some 0 < r < ∞, then f p → f ∞ as p → ∞. (Hint: Show ﬁrst that for any K > 0, µ(|f | > K) > 0 ⇒ lim f p ≥ K. If f ∞ < ∞ and µ(Ω) < ∞, use the fact that

p→∞

1/p f p ≤ f ∞ (µ(Ω)) and reduce the general case under the hypothesis that |f |p dµ < ∞ for some p to this case.)

3.17 Let X be a random variable on a probability space (Ω, F, µ). Recall that Eh(X) = h(X)dµ if h(X) ∈ L1 (Ω, F, µ). p1

(a) Show that (E|X|p1 ) ≤ (E|X|p2 ) p2 for any 0 < p1 < p2 < ∞. (b) Show that ‘=’ holds in (a) iﬀ |X| is a constant a.e. (µ). (c) Show that if E|X| < ∞, then E| log |X|| < ∞ and E|X|r < ∞ for all 0 < r < 1, and 1r log(E|X|r ) → E log |X| as r → 0. 3.18 Let X be a nonnegative random variable. (a) Show that EX log X ≥ (EX)(E log X). , √ (b) Show that 1 + (EX)2 ≤ E( 1 + X 2 ) ≤ 1 + EX. 3.19 Let Ω = N, F = P(N), and let µ be the counting measure. Denote Lp (Ω, F, µ) for this case by p . (a) Show that p is the set of all sequences {xn }n≥1 such that ∞ p n=1 |xn | < ∞. (b) For the following sequences, ﬁnd all p > 0 such that they belong to p : (i) xn ≡ (ii) xn =

1 n,

n ≥ 1.

1 n(log(n+1))2 ,

n ≥ 1.

3.4 Problems

105

3.20 For 1 ≤ p < ∞, prove the Riesz representation theorem for p . That is, show that if T is a bounded linear functional from p → R, then there exists ∞ a y = {yi }i≥1 ∈ q such that for any x = {xi }i≥1 in p , T (x) = i=1 xi yi . (Hint: Let yi = T (ei ) where ei = {ei (j)}j≥1 , ei (j) = 1 if i = j, 0 if i = j. Use the fact |T (x)| ≤ T xp to show that for each n ∈ N, n ( i=1 |yi |q ) ≤ T q < ∞.) 3.21 Let Ω = R, F = B(R), µ = µF where F is a cdf on R. If f (x) ≡ x2 , ﬁnd Af = {p : 0 < p < ∞, f ∈ Lp (R, B(R), µF )} for the following cases: x 2 (a) F (x) = Φ(x), the N (0, 1) cdf, i.e., Φ(x) ≡ √12π −∞ e−u /2 du, x ∈ R. x 1 (b) F (x) = π1 −∞ 1+u 2 du, x ∈ R. 3.22 Show that C[0, 1] with the supnorm (i.e., with f = sup{|f (x)| : 0 ≤ x ≤ 1}) is a Banach space. (Hint: To verify completeness, let {fn }n≥1 be a Cauchy sequence in C[0, 1]. Show that for each 0 ≤ x ≤ 1, {fn (x)}n≥1 is a Cauchy sequence in R. Let f (x) = lim fn (x). Now show that sup{|fn (x) − n→∞

f (x)| : 0 ≤ s ≤ 1} ≤ lim fn − fm . Conclude that fn converges to m→∞

f uniformly on [0, 1] and that f ∈ C[0, 1].) 3.23 Show that the space (P, · ) of all polynomials on [0, 1] with the supnorm is a normed linear space that is not complete. 1 (Hint: Let g(t) = 1−t/2 for 0 ≤ t ≤ 1. Find a sequence of polynomials {fn }n≥1 in P that converge to g in supnorm.)

3.24 Show that the function f (v) ≡ v from a normed linear space (V, ·) to R+ is continuous. 3.25 Let (V, · ) be a normed linear space. Let S = {v : v < 1}. Show that S is an open set in V . 3.26 Show that the space Pk of all polynomials on [0, 1] of degree ≤ k is a Banach space under the supnorm, i.e., under f = sup{|f (x)|, 0 ≤ s ≤ 1}. k (Hint: Let pn (x) = j=0 anj xj be a sequence of elements in Pk that converge in supnorm to some f (·). Show that {an1 }n≥1 converges and recursively, {ani }n≥1 converges for each i.) 3.27 Let µ be a ﬁnite measure on [0,1]. Verify that Tµ (f ) ≡ f dµ is a bounded linear functional on C[0, 1] and that Tµ = µ([0, 1]).

106

3. Lp -Spaces

3.28 Let (Vi , · i ), i = 1, 2, be two normed linear spaces over R. (a) Let T : V1 → V2 be a linear operator. Show that if for some x0 , T x − T x0 → 0 as x → x0 , then T is continuous on V1 and hence bounded. (b) Show that if (V2 , · 2 ) is complete, then B(V1 , V2 ) ≡ {T | T : V1 → V2 , T linear and bounded} is complete under the operator norm deﬁned in (3.1). In the following, H will denote a real Hilbert space. 3.29 (a) Use the Cauchy-Schwarz inequality to show that the function f (x, y) = x, y is continuous from H × H → R. (b) (Parallelogram law). Show that in an innerproduct space (V, ·, ·), for any x, y ∈ V x + y2 + x − y2 = 2(x2 + y2 ) where x2 = x, x. 3.30 (a) Let {Qn (x)}n≥0 be deﬁned on [0, 2π] by Qn (x) = cn

1 + cos x n 2

where cn is such that

1 2π

2π

0

Qn (x)dx = 1.

Clearly, Qn (·) ≥ 0. (i) Verify that for each δ > 0, sup{Qn (x) : δ ≤ x ≤ 2π − δ} → 0

as

n → ∞.

(ii) Use this to show that if f ∈ C[0, 2π] and if 1 Pn (t) ≡ 2π

0

2π

Qn (t − s)f (s)ds, n ≥ 0,

(4.1)

then Pn → f uniformly on [0, 2π]. (b) Use this to give a proof of the completeness of the class C of trigonometric functions. (c) Show that if f ∈ L1 ([0, 2π]), then Pn (·) converges to f in L1 ([0, 2π]).

3.4 Problems

107

(d) Let {µn (·)}n≥1 be a sequence of probability measures on (R, B(R)) such that for each δ > 0, µn ({x : |x| ≥ δ}) → 0 as n → ∞. Let f : R → R be Borel measurable. Let fn (x) ≡ f (x− y)µn (dy), n ≥ 1. Assuming that fn (·) is well deﬁned and Borel measurable, show that (i) f (·) continuous at x0 and bounded ⇒ fn (x0 ) → f (x0 ). (ii) f (·) uniformly continuous and bounded on R ⇒ fn → f uniformly on R. (iii) f ∈ Lp (R, B(R), m), 0 < p < ∞, m(·) = Lebesgue measure ⇒ |fn − f |p dm → 0. (iv) Show that (iii) ⇒ (c). 3.31 (Gram-Schmidt procedure). Let B ≡ {bn : n ∈ N} be a collection of nonzero vector in H. Set b1 b1 = b2 − b2 , e1 e1 , e˜2 (provided ˜ e2 = 0), and so on. = ˜ e2 =

e1 e˜2 e2

If ˜ en = 0 for some n ∈ N, then delete bn . Let E ≡ {ej : 1 ≤ j < k}, k ≤ ∞, be the collection of vectors reached this way. (a) Show that E is an orthonormal system. (b) Let HB denote the closed linear subspace generated by B, i.e., HB

≡

x : x ∈ H, there exists xn of the form aj ∈ R, such that xn − x → 0 .

n

aj bj ,

j=1

Show that HB is a Hilbert space and E is a basis for HB . 3.32 Let H = L2 (R, B(R), µ), where µ is a probability measure. Let B ≡ {1, x, x2 , . . .}. Assume that |x|k dµ < ∞ for all k ∈ N. Apply the Gram-Schmidt procedure in Problem 3.31 to the set B for the following cases and evaluate e1 , e2 , e3 . (a) µ = Uniform [0, 1] distribution. (b) µ = standard normal distribution. (c) µ = Exponential (1) distribution. The orthonormal basis E obtained this way is called Orthogonal Polynomials w.r.t. the given measure. (See Szego (1939).)

108

3. Lp -Spaces

3.33 Let B ⊂ H be an orthonormal system. Show that for any x in H, {b : x, b = 0} is at most countable. a collection of nonnegative (Hint: Show ﬁrst that if {yα : α ∈ I} is real numbers such that for some C < ∞, α∈F yα ≤ C for all F ⊂ I, F ﬁnite, then {α : yα > 0} is at most countable and apply this to the Bessel inequality.) 3.34 Let B ⊂ H. Deﬁne B ⊥ ≡ {x : x ∈ H, x, b = 0, for all b ∈ B}. Show that B ⊥ is a closed subspace of H. 3.35 Let B ⊂ H be a closed subspace of H. (a) Using the fact that every Hilbert space admits an orthonormal basis, show that every x in H can be uniquely decomposed as x=y+z

(4.2)

where y ∈ B and z ∈ B ⊥ and x2 = y2 + z2 . (b) Let PB : H → B be deﬁned by PB x = y where x admits the decomposition in (4.2) above. Verify that PB is a bounded linear operator from H to B and is of norm 1 if B has at least one nonzero vector. (The operator PB is called the projection onto B.) (c) Verify that PB (PB x) = PB x for all x in H. 3.36 Let H be separable and {xn }n≥1 ⊂ H be such that {xn n≥1 } is bounded by some C < ∞. Show that there exist a subsequence {xnj }j≥1 ⊂ {xn }n≥1 and an x0 in H, such that for every y in H, xnj , y → x0 , y. (Hint: Fix an orthonormalbasis B ≡ {bn }n≥1 ⊂ H. Let ani = ∞ xn , bi , n ≥ 1, i ≥ 1. Using i=1 a2ni ≤ C for all n and the BolzanoWierstrass property, show that (a) there exists {nj }j≥1 such that lim anj i = ai exists for all i ≥ 1, j→∞ ∞ 2 a < ∞, i=1 i n (b) lim i=1 ai bi ≡ x0 exists in H, n→∞

(c) xnj , y → x0 , y for all y in H.) 3.37 Let (V, ·, ·) be an innerproduct space. Verify that ·, · is bilinear, i.e., for α1 , α2 , β1 , β2 ∈ R, x1 , x2 , y1 , y2 in V , α1 x1 + α2 x2 , β1 y1 + β2 y2 =

α1 β1 x1 , y1 + α1 β2 x1 , y2 + α2 β1 x2 , y1 + α2 β2 x2 , y2 .

State and prove an extension to more than two vectors.

3.4 Problems

109

3.38 Let (Ω, F, µ) be a measure space. Suppose that there exists an at most countable family A ≡ {Aj }j≥1 ⊂ F such that F = σA and µ(Aj ) > 0 for each j ≥ 1. Then show that for 0 < p < ∞, Lp (Ω, F, µ) is separable. (Hint: Show ﬁrst that for any A ∈ F with µ(A) < ∞, and > 0, there exists a countable subcollection A1 of A such that µ(AB) < where n B = {∪Aj : Aj ∈ A1 }. Now consider the class of functions { i=1 ci IAi , n ≥ 1, Ai ∈ A, ci ∈ Q}.) 3.39 Show that B in Example 3.3.1 (c) is a basis for H. (Hint: Using Theorem 2.3.14 prove that the set of all polynomials are dense in H.) 3.40 (Haar functions). For x in R let h(x) = I[0,1/2) (x) − I[1/2,1) (x). Let h00 (x) ≡ I[0,1) (x) and for k ≥ 1, 0 ≤ j < 2k−1 , let hkj (x) ≡ k−1 2 2 h(2k−1 x − j), 0 ≤ x < 1. (a) Verify that the family {hkj (·), k ≥ 0, 0 ≤ j < 2k−1 } is an orthonormal set in L2 ([0, 1], B([0, 1]), m), where m(·) is Lebesgue measure. (b) Verify that this family is complete by completing the following two proofs: (i) Show "that for indicator function f of dyadic interval of the form 2kn , 2 n , k < , the following identity holds: 2 − k f 2 dm = n = f hkj dm . 2 k,j

Now use the fact the linear combinations of such f ’s is dense in L2 [0, 1]. (ii) For each f ∈ L2 ([0, 1], B([0, 1]), m) such that f is orthogonal to the Haar functions, F (t) ≡ [0,t] f dm, 0 ≤ t ≤ 1 is continuous and satisﬁes F ( 2jn ) = 0 for all 0 ≤ j ≤ 2n , n = 1, 2, . . . and hence F ≡ 0 implying f = 0 a.e. 3.41 Let H be a Hilbert space over R and M be a closed subspace of H. Let v0 ∈ H. Show that min{v − v0 : v ∈ M } = max{v0 , u, u ∈ M ⊥ , u = 1}, where M ⊥ is the orthogonal complement of M , i.e., M ⊥ ≡ {u : v, u = 0 for all v ∈ M }. (Hint: Use Problem 3.35 (a).)

110

3. Lp -Spaces

3.42 Let B be an orthonormal set in a Hilbert space H. (a) (i) Show that for any x in H and any ﬁnite set {bi : 1 ≤ i ≤ k} ⊂ B, k < ∞, k

x, bi 2 ≤ x2 .

i=1

(ii) Conclude that for all x in H, Bx ≡ {b : x, b = 0, b ∈ B} is at most countable. (b) Show that the following are equivalent: (i) B is complete, i.e., x ∈ H, x, b = 0 for all b ∈ B ⇒ x = 0. (ii) For all x ∈ B, there exists anat most countable set Bx ≡ ∞ {bi : i ≥ 1} such that x2 = i=1 x, bi 2 . (iii) For all x ∈ B, > 0, there exists a ﬁnite set {b1 , b2 , . . . , bk } ⊂ B such that k x, bi bi - < . -x − i=1

(iv) If B ⊂ B 1 , B 1 an orthonormal set in H ⇒ B = B 1 . 3.43 Extend Theorem 3.3.3 to any Hilbert space assuming that the axiom of choice holds. (Hint: Using the axiom of choice or its equivalent, the Hausdorﬀ maximality principle, it can be shown that every Hilbert space H admits an orthonormal basis B (see Rudin (1987)). Now let T be a bounded linear functional from H to R. Let f (b) ≡ T (b) for b in B. k Verify that i=1 |f (bi )|2 ≤ T 2 for all ﬁnite collection {bi : 1 ≤ i ≤ k} ⊂ B. Conclude that D ≡ {b : f (b) = 0} is countable. Let x0 ≡ b∈D f (b)b. Now use the proof of Theorem 3.3.3 to show that T (x) ≡ x, x0 for all x in H.) 3.44 Let (V, ·) be a normed linear space. Let {Tn }n≥1 and T be bounded linear operators from V to V . The sequence {Tn }n≥1 is said to converge (a) weakly to T if for each w in V ∗ , the dual of V , and each v in V , w(Tn (v)) → w(T (v)), (b) strongly if for each v in V , Tn v − T v → 0, (c) uniformly if sup{Tn v − T v : v ≤ 1} → 0.

3.4 Problems

111

Let Vp = Lp (R, B(R), µL ), 1 ≤ p ≤ ∞. Let {hn }n≥1 ⊂ R be such that hn = 0, hn → 0, as n → ∞. Let (Tn f )(·) ≡ f (· + hn ), T f (·) ≡ f (·). Verify that (i) {Tn }n≥1 and T are bounded linear operators on Vp , 1 ≤ p ≤ ∞, (ii) for 1 ≤ p < ∞, {Tn }n≥1 converges to T weakly, (iii) for 1 ≤ p < ∞, {Tn } converges to T strongly, (iv) for 1 ≤ p < ∞, {Tn } does not converge to T uniformly by showing that for all n, Tn − T = 1, (v) for p = ∞, show that Tn does not converge weakly to T .

4 Diﬀerentiation

4.1 The Lebesgue-Radon-Nikodym theorem Deﬁnition 4.1.1: Let (Ω, F) be a measurable space and let µ and ν be two measures on (Ω, F). The measure µ is said to be dominated by ν or absolutely continuous w.r.t. ν and written as µ ν if ν(A) = 0 ⇒ µ(A) = 0

for all A ∈ F.

(1.1)

Example 4.1.1: Let m be the Legesgue measure on (R, B(R)) and let µ be the standard normal distribution, i.e., 2 1 √ e−x /2 m(dx), A ∈ B(R). µ(A) ≡ 2π A Then m(A) = 0 ⇒ µ(A) = 0 and hence µ m. Example 4.1.2: Let Z+ ≡ {0, 1, 2, . . .} denote the set of all nonnegative integers. Let ν be the counting measure on Ω = Z+ and µ be the Poisson (λ) distribution for 0 < λ < ∞, i.e., ν(A) = number of elements in A and µ(A) =

e−λ λj j!

j∈A

for all A ∈ P(Ω), the power set of Ω. Since ν(A) = 0 ⇔ A = ∅ ⇔ µ(A) = 0, it follows that µ ν and ν µ.

114

4. Diﬀerentiation

Example 4.1.3: Let f be a nonnegative measurable function on a measure space (Ω, F, ν). Let f dν for all A ∈ F. (1.2) µ(A) ≡ A

Then, µ is a measure on (Ω, F) and ν(A) = 0 ⇒ µ(A) = 0 for all A ∈ F and hence µ ν. The Radon-Nikodym theorem is a sort of converse to Example 4.1.3. It says that if µ and ν are σ-ﬁnite measures (see Section 1.2) on a measurable space (Ω, F) and if µ ν, then there is a nonnegative measurable function f on (Ω, F) such that (1.2) holds. Deﬁnition 4.1.2: Let (Ω, F) be a measurable space and let µ and ν be two measures on (Ω, F). Then, µ is called singular w.r.t. ν and written as µ ⊥ ν if there exists a set B ∈ F such that µ(B) = 0

and ν(B c ) = 0.

(1.3)

Note that µ is singular w.r.t. ν implies that ν is singular w.r.t. µ. Thus, the notion of singularity between two measures µ and ν is symmetric but that of absolutely continuity is not. Note also that if µ and ν are mutually singular and B satisﬁes (1.3), then for all A ∈ F, µ(A) = µ(A ∩ B c )

and ν(A) = ν(A ∩ B).

(1.4)

Example 4.1.4: Let µ be the Lebesgue measure on (R, B(R)) and ν be deﬁned as ν(A) = # elements in A ∩ Z where Z is the set of integers. Then ν(Zc ) = 0

and µ(Z) = 0

and hence (1.3) holds with B = Z. Thus µ and ν are mutually singular. Another example is the pair m and µc on [0,1] where µc is the LebesgueStieltjes measure generated by the Cantor function (cf. Section 4.5) and m is the Lebesgue measure. Example 4.1.5: Let µ be the Lebesgue measure restricted to (−∞, 0] and ν be the Exponential(1) distribution. That is, for any A ∈ B(R), µ(A) ν(A)

=

the Lebesgue measure of = e−x dx.

A ∩ (−∞, 0];

A∩(0,∞)

Then, µ((0, ∞)) = 0 and ν((−∞, 0]) = 0, and (1.3) holds with B = (−∞, 0].

4.1 The Lebesgue-Radon-Nikodym theorem

115

Suppose that µ and ν are two ﬁnite measures on a measurable space (Ω, F). H. Lebesgue showed that µ1 can be decomposed as a sum of two measures, i.e., µ = µa + µs where µa ν and µs ⊥ ν. The next theorem is the main result of this section and it combines the above decomposition result of Lebesgue with the Radon-Nikodym theorem mentioned earlier. Theorem 4.1.1: Let (Ω, F) be a measurable space and let µ1 and µ2 be two σ-ﬁnite measures on (Ω, F). (i) (The Lebesgue decomposition theorem). uniquely decomposed as

The measure µ1 can be

µ1 = µ1a + µ1s

(1.5)

where µ1a and µ1s are σ-ﬁnite measures on (Ω, F) such that µ1a µ2 and µ1s ⊥ µ2 . (ii) (The Radon-Nikodym theorem). There exists a nonnegative measurable function h on (Ω, F) such that µ1a (A) = hdµ2 for all A ∈ F. (1.6) A

Proof: Case 1: Suppose that µ1 and µ2 are ﬁnite measures. Let µ be the measure µ = µ1 + µ2 and let H = L2 (µ). Deﬁne a linear function T on H by T (f ) =

f dµ1 .

(1.7)

Then, by the Cauchy-Schwarz inequality applied to the functions f and g ≡ 1, |T (f )|

12 12 µ1 (Ω) f dµ1

12 12 f dµ µ1 (Ω) .

≤ ≤

2

2

This shows that T is a bounded linear functional on H with T ≤ M ≡ 1 (µ1 (Ω)) 2 . By the Riesz representation theorem (cf. Theorem 3.3.3 and Remark 3.3.2), there exists a g ∈ L2 (µ) such that T (f ) = f gdµ (1.8)

116

4. Diﬀerentiation

for all f ∈ L2 (µ). Let f = IA for A in F. Then, (1.7) and (1.8) yield gdµ. µ1 (A) = T (IA ) = A

But 0 ≤ µ1 (A) ≤ µ(A) for all A ∈ F. Hence the function g in L2 (µ) satisﬁes gdµ ≤ µ(A) for all A ∈ F. (1.9) 0≤ A

Let A1 = {0 ≤ g < 1}, A2 = {g = 1}, A3 = {g ∈ [0, 1]}. Then (1.9) implies that µ(A3 ) = 0 (see Problem 4.1). Now deﬁne the measures µ1a (·) and µ1s (·) by µ1a (A) ≡ µ1 (A ∩ A1 ),

µ1s (A) ≡ µ1 (A ∩ A2 ), A ∈ F.

(1.10)

Next it will be shown that µ1a µ2 and µ1s ⊥ µ2 , thus establishing (1.5). By (1.7) and (1.8), for all f ∈ H, f dµ1 = f gdµ = f gdµ1 + f gdµ2 ⇒

f (1 − g)dµ1 =

f gdµ2 .

(1.11)

Setting f = IA2 yields 0 = µ2 (A2 ). From (1.10), since µ1s (Ac2 ) = 0, it follows that µ1s ⊥ µ2 . Now ﬁx n ≥ 1 and A ∈ F. Let f = IA∩A1 (1 + g + . . . + g n−1 ). Then (1.11) implies that n (1 − g )dµ1 = g(1 + g + . . . + g n−1 )dµ2 . A∩A1

A∩A1

Now letting n → ∞, and using the MCT on both sides, yields g µ1a (A) = dµ2 . IA1 (1 − g) A Setting h ≡

g 1−g IA1

(1.12)

completes the proof of (1.5) and (1.6).

Case 2: Now suppose that µ1 and µ2 are σ-ﬁnite. Then there exists a countable partition {Dn }≥1 ⊂ F of Ω such that µ1 (Dn ) and µ2 (Dn ) are (n) (n) both ﬁnite for all n ≥ 1. Let µ1 (·) ≡ µ1 (· ∩ Dn ) and µ2 ≡ µ2 (· ∩ Dn ). (n) (n) Then applying ‘Case 1’ to µ1 and µ2 for each n ≥ 1, one gets measures (n) (n) µ1a , µ1s and a function hn such that (n)

(n)

(n)

µ1 (·) ≡ µ1a (·) + µ1s (·)

(1.13)

4.1 The Lebesgue-Radon-Nikodym theorem

117

(n) (n) (n) (n) where, for A in F, µ1a (A) = A hn dµ2 = A hn IDn dµ2 and µ1s (·) ⊥ µ2 . ∞ (n) Since µ1 (·) = n=1 µ1 (·), it follows from (1.13) that where µ1a (A) ≡

(1.14) µ1 (·) = µ1a (·) + µ1s (·), (n) (n) ∞ n=1 µ1a (A) and µ1s (·) = n=1 µ1s (·). By the MCT, hdµ2 , A ∈ F, µ1a (A) =

∞

A

∞

where h ≡ n=1 hn IDn . Clearly, µ1a µ2 . The veriﬁcation of the singularity of µ1s and µ2 is left as an exercise (Problem 4.2). It remains to prove the uniqueness of the decomposition. Let µ1 = µa + µs and µ1 = µa + µs be two decompositions of µ1 where µa and µa are absolutely continuous w.r.t. µ2 and µs and µs are singular w.r.t. µ2 . By deﬁnition, there exist sets B and B in F such that µ2 (B) = 0, µ2 (B ) = 0,

and µs (B c ) = 0, µs (B c ) = 0.

Let D = B ∪ B . Then µ2 (D) = 0 and µs (Dc ) ≤ µs (B c ) = 0. Similarly, µs (Dc ) ≤ µs (B c ) = 0. Also µ2 (D) = 0 implies µa (D) = 0 = µa (D). Thus for any A ∈ F, µa (A) = µa (A ∩ Dc )

and µa (A) = µa (A ∩ Dc ).

Also µs (A ∩ Dc ) ≤ µs (A ∩ B c ) = 0 µs (A ∩ Dc ) ≤ µs (A ∩ B c ) = 0. Thus, µ(A ∩ Dc ) = µa (A ∩ Dc ) + µs (A ∩ Dc ) = µa (A ∩ Dc ) = µa (A) and µ(A ∩ Dc ) = µa (A ∩ Dc ) + µs (A ∩ Dc ) = µa (A ∩ Dc ) = µa (A). Hence, µa (A) = µ(A ∩ Dc ) = µa (A) for every A ∈ F. That is, µa = µa and hence, 2 µs = µs . Remark 4.1.1: In Theorem 4.1.1, the hypothesis of σ ﬁniteness cannot be dropped. For example, let µ be the Lebesgue measure and ν be the counting measure on [0, 1]. Then µ ν but there does not exist a nonnegative F measurable function h such that µ(A) = A hdν. To see this, if possible, suppose that for some h ∈ L1 (ν), µ(A) = A hdν for all A ∈ F. Note that µ([0, 1]) = 1 implies that [0,1] hdν < ∞ and hence, that B ≡ {x : h(x) > 0} is countable (Problem 4.3). But µ being the Lebesgue measure, µ(B) = 0 c c c and µ(B ) = 1. Since by deﬁnition, h ≡ 0 on B , this implies 1 = µ(B ) = hdν = 0, leading to a contradiction. However, if ν is σ-ﬁnite and µ ν Bc (µ not necessarily σ-ﬁnite), then the Radon-Nikodym theorem holds, i.e., there exists a nonnegative F-measurable function h such that µ(A) = hdν for all A ∈ F. A

118

4. Diﬀerentiation

For a proof, see Royden (1988), Chapter 11. Deﬁnition 4.1.3: Let µ and ν be measures on a measurable space (Ω, F) and let h be a nonnegative measurable function such that hdν for all A ∈ F. µ(A) = A

Then h is called the Radon-Nikodym derivative of µ w.r.t. ν and is written as dµ = h. dν If µ(Ω) < ∞ and there exist two nonnegative F-measurable functions h1 and h2 such that µ(A) =

h1 dν = A

h2 dν A

for all A ∈ F, then h1 = h2 a.e. (ν) and thus the Radon-Nikodym derivative dµ dν is unique up to equivalence a.e. (ν). This also extends to the case when µ is σ-ﬁnite. The following proposition is easy to verify (cf. Problem 4.4). Proposition 4.1.2: Let ν, µ, µ1 , µ2 , . . . be σ-ﬁnite measures on a measurable space (Ω, F). (i) If µ1 µ2 and µ2 µ3 , then µ1 µ3 and dµ1 dµ2 dµ1 = dµ3 dµ2 dµ3

a.e. (µ3 ).

(ii) Suppose that µ1 and µ2 are dominated by µ3 . Then for any α, β ≥ 0, αµ1 + βµ2 is dominated by µ3 and dµ1 dµ2 d(αµ1 + βµ2 ) =α +β dµ3 dµ3 dµ3 (iii) If µ ν and

dµ dν

a.e. (µ3 ).

> 0 a.e. (ν), then ν µ and dν = dµ

dµ dν

−1 a.e. (µ).

(iv) Let {µn }n≥1 be a sequence of measures and {αn }n≥1 be a sequence of positive real numbers, i.e., αn > 0 for all n ≥ 1. Deﬁne µ = ∞ n=1 αn µn .

4.2 Signed measures

119

(a) Then, µ ν iﬀ µn ν for each n ≥ 1 and in this case, ∞ dµ dµn = αn dν dν n=1

a.e. (ν).

(b) µ ⊥ ν iﬀ µn ⊥ ν for all n ≥ 1.

4.2 Signed measures Let µ1 and µ2 be two ﬁnite measures on a measurable space (Ω, F). Let ν(A) ≡ µ1 (A) − µ2 (A),

A ∈ F.

for all

(2.1)

Then ν : F → R satisﬁes the following: (i) ν(∅) = 0. (ii) For any {An }n≥1 ⊂ F with Ai ∩ Aj = ∅ for i = j, and with ∞ i=1 |ν(Ai )| < ∞, ∞ ν(A) = ν(Ai ). (2.2) i=1

(iii) Let ν

≡ sup

∞

|ν(Ai )| : {An }n≥1 ⊂ F, Ai ∩ Aj = ∅ for

i=1

i = j,

An = Ω .

(2.3)

n≥1

Then, ν is ﬁnite. Note that (iii) holds because ν ≤ µ1 (Ω) + µ2 (Ω) < ∞. Deﬁnition 4.2.1: A set function ν : F → R satisfying (i), (ii), and (iii) above is called a ﬁnite signed measure. The above example shows that the diﬀerence of two ﬁnite measures is a ﬁnite signed measure. It will be shown below that every ﬁnite signed measure can be expressed as the diﬀerence of two ﬁnite measures. Proposition 4.2.1: Let ν be a ﬁnite signed measure on (Ω, F). Let ∞ |ν(An )| : {An }n≥1 ⊂ F, Ai ∩ Aj = ∅ for i = j, |ν|(A) ≡ sup n=1

n≥1

An = A .

(2.4)

120

4. Diﬀerentiation

Then |ν|(·) is a ﬁnite measure on (Ω, F). Proof: That |ν|(Ω) < ∞ follows from part (iii) of the deﬁnition. Thus it is enough to verify that |ν| is countably additive. Let {An }n≥1 be a countable family of disjoint sets in F. Let A = n≥1 An . By the deﬁnition of |ν|, for all > 0 and n ∈ N, there exists a countable ∞family {Anj }j≥1 of disjoint sets in F with An = j≥1 Anj such that j=1 |ν(Anj )| > |ν|(An ) − 2n . Hence, ∞ ∞ ∞ |ν(Anj )| > |ν|(An ) − . n=1 j=1

n=1

Note that {Anj }n≥1,j≥1 is a countable family of disjoint sets in F such that A = n≥1 An = n≥1 j≥1 Anj . It follows from the deﬁnition of |ν| that |ν|(A) ≥

∞ ∞

|ν(Anj )| >

n=1 j=1

∞

|ν|(An ) − .

n=1

Since this is true for for all > 0, it follows that |ν|(A) ≥

∞

|ν|(An ).

(2.5)

n=1

To get the opposite inequality, let {Bj }j≥1 be a countable family of disjoint B = A = sets in F such that j j≥1 n≥1 An . Since Bj = Bj ∩ A = (B ∩ A ) and ν satisﬁes (2.2), j n n≥1 ν(Bj ) =

∞

ν(Bj ∩ An )

for all j ≥ 1.

n=1

Thus ∞

|ν(Bj )|

≤

j=1

=

∞ ∞ j=1 n=1 ∞ ∞

|ν(Bj ∩ An )| |ν(Bj ∩ An )|.

(2.6)

n=1 j=1

Note that for each An , {Bj ∩An }j≥1 is a countable family of disjoint sets in F such that An = j≥1 (Bj ∩ An ). Hence from (2.4), it fol∞ ∞ lows that |ν|(An ) ≥ |ν(Bj ∩ An )| and hence, |ν|(An ) ≥ j=1 ∞ ∞ n=1 ∞ j=1 |ν(Bj ∩ An )|. From (2.6), it follows that n=1 |ν|(An ) ≥ n=1 ∞ |ν(B )|. This being true for every such family {B } j j j≥1 , it follows j=1 from (2.4) that ∞ |ν|(A) ≤ |ν|(Ai ) (2.7) i=1

4.2 Signed measures

121

2

and with (2.5), this completes the proof.

Deﬁnition 4.2.2: The measure |ν| deﬁned by (2.4) is called the total variation measure of the signed measure ν. Next, deﬁne the set functions ν+ ≡

|ν| + ν , 2

ν− ≡

|ν| − ν . 2

(2.8)

It can be veriﬁed that ν + and ν − are both ﬁnite measures on (Ω, F). Deﬁnition 4.2.3: The measures ν + and ν − are called the positive and negative variation measures of the signed measure ν, respectively. It follows from (2.8) that ν = ν+ − ν−.

(2.9)

Thus every ﬁnite signed measure ν on (Ω, F) is the diﬀerence of two ﬁnite measures, as claimed earlier. Note that both ν + and ν − are dominated by |ν| and all three measures are ﬁnite. By the Radon-Nikodym theorem (Theorem 4.1.1), there exist functions h1 and h2 in L1 (Ω, F, |ν|) such that dν + = h1 d|ν|

and

dν − = h2 . d|ν|

(2.10)

This and (2.9) imply that for any A in F,

h1 d|ν| −

ν(A) = A

h2 d|ν| =

A

hd|ν|,

(2.11)

A

where h = h1 − h2 . Thus every ﬁnite signed measure ν on (Ω, F) can be expressed as f dµ, A ∈ F (2.12) ν(A) = A

for some ﬁnite measure µ on (Ω, F) and some f ∈ L1 (Ω, F, µ). Conversely, it is easy to verify that a set function ν deﬁned by (2.12) for some ﬁnite measure µ on (Ω, F) and some f ∈ L1 (Ω, F, µ) is a ﬁnite signed measure (cf. Problem 4.6). This leads to the following: Theorem 4.2.2: (i) A set function ν on a measurable space (Ω, F) is a ﬁnite signed measure iﬀ there exist two ﬁnite measures µ1 and µ2 on (Ω, F) such that ν = µ1 − µ2 .

122

4. Diﬀerentiation

(ii) A set function ν on a measurable space (Ω, F) is a ﬁnite signed measure iﬀ there exist a ﬁnite measure µ on (Ω, F) and an f ∈ L1 (Ω, F, µ) such that for all A in F, f dµ. ν(A) = A

Deﬁnition 4.2.4: Let ν be a ﬁnite signed measure on a measurable space on (Ω, F). A set A ∈ F is called a positive set for ν if for any B ⊂ A, B ∈ F, ν(B) ≥ 0. A set A ∈ F is called a negative set for ν if for any B ⊂ A with B ∈ F, ν(B) ≤ 0. Let h be as in (2.11). Let Ω+ = {ω : h(ω) ≥ 0} and

Ω− = {ω : h(ω) < 0}.

(2.13)

From (2.11), it follows that for all A in F, ν(A∩Ω+ ) ≥ 0 and ν(A∩Ω− ) ≤ 0. Thus Ω+ is a positive set and Ω− is a negative set for ν. Furthermore, Ω+ ∪ Ω− = Ω and Ω+ ∩ Ω− = ∅. Summarizing this discussion, one gets the following theorem. Theorem 4.2.3: (Hahn decomposition theorem). Let ν be a ﬁnite signed measure on a measurable space (Ω, F). Then there exist a positive set Ω+ and a negative set Ω− for ν such that Ω = Ω+ ∪ Ω− and Ω+ ∩ Ω− = ∅. Let Ω+ and Ω− be as in (2.13). It can be veriﬁed (Problem 4.8) that for any B ∈ F, if B ⊂ Ω+ , then ν(B) = |ν|(B). By (2.11), this implies that for all A in F, A∩Ω+

hd|ν| = |ν|(A ∩ Ω+ ).

It follows that h = 1 a.e. (|ν|) on Ω+ . Similarly, h = −1 a.e. (|ν|) on Ω− . Thus, the measures ν + and ν − , deﬁned in (2.8), reduce to (1 + h) ν + (A) = d|ν| 2 A (1 + h) (1 + h) d|ν| + |ν| = 2 d A∩Ω+ A∩Ω− = |ν|(A ∩ Ω+ ), and similarly

ν − (A) = |ν|(A ∩ Ω− ).

Note that ν + and ν − are both ﬁnite measures that are mutually singular. This particular decomposition of ν as ν = ν+ − ν−

4.2 Signed measures

123

is known as the Jordan decomposition of ν. It will now be shown that this decomposition is minimal and that it is unique in the class of signed measures with mutually singular components. Suppose there exist two ﬁnite measures µ1 and µ2 on (Ω, F) such that ν = µ1 − µ2 . For any A ∈ F, ν + (A) = ν(A ∩ Ω+ ) = µ1 (A ∩ Ω+ ) − µ2 (A ∩ Ω+ ) ≤ µ1 (A ∩ Ω+ ) ≤ µ1 (A) and ν − (A) = −ν(A ∩ Ω− ) = µ2 (A ∩ Ω− ) − µ1 (A ∩ Ω− ) ≤ µ2 (A ∩ Ω− ) ≤ µ2 (A). Thus, ν + ≤ µ1 and ν − ≤ µ2 . Clearly, since both µ1 and ν + are ﬁnite measures on (Ω, F), µ1 − ν + is also a ﬁnite measure. Similarly, µ2 − ν − is also a ﬁnite measure. Also, since µ1 − µ2 = ν = ν + − ν − , it follows that µ1 −ν + = µ2 −ν − = λ, say. Thus, for any decomposition of ν as µ1 −µ2 with µ1 , µ2 ﬁnite measures, it holds that µ1 = ν + + λ and µ2 = ν − + λ, where λ is a measure on (Ω, F). Thus, ν = ν + − ν − is a minimal decomposition in the sense that in this case λ = 0. Now suppose µ1 and µ2 are mutually singular, i.e., there exist Ω1 , Ω2 ∈ F such that Ω1 ∩ Ω2 = ∅, Ω1 ∪ Ω2 = Ω, and µ1 (Ω2 ) = 0 = µ2 (Ω1 ). Since µ1 ≥ λ and µ2 ≥ λ, it follows that λ(Ω2 ) = 0 = λ(Ω1 ). Thus λ = 0 and µ1 = ν + and µ2 = ν − . Summarizing the above discussion yields: Theorem 4.2.4: Let ν be a ﬁnite signed measure on a measurable space (Ω, F) and let µ1 and µ2 be two ﬁnite measures on (Ω, F) such that ν = µ1 − µ2 . Then there exists a ﬁnite measure λ such that µ1 = ν + + λ and µ2 = ν − + λ with λ = 0 iﬀ µ1 and µ2 are mutually singular. Let S ≡ {ν : ν

is a ﬁnite signed measure on

(Ω, F)}.

Also, for any α ∈ R, let α+ = max(α, 0) and α− = max(−α, 0). Note that for ν1 , ν2 in S and α1 , α2 ∈ R, α1 ν1 + α2 ν2

= =

(α1+ − α1− )(ν1+ − ν1− ) + (α2+ − α2− )(ν2+ − ν2− ) (α1+ ν1+ + α1− ν1− + α2+ ν2+ + α2− ν2− ) − (α1+ ν1− + α1− ν1+ + α2+ ν2− + α2− ν2+ )

= λ1 − λ2 ,

say,

where λ1 and λ2 are both ﬁnite measures. It now follows from Theorem 4.2.2 that α1 ν1 + α2 ν2 ∈ S. Thus, S is a vector space over R. Now it will be shown that ν ≡ |ν|(Ω) is a norm on S and that (S, · ) is a Banach space. Deﬁnition 4.2.5: For a ﬁnite signed measure ν on a measurable space (Ω, F), the total variation norm ν is deﬁned by ν ≡ |ν|(Ω). Proposition 4.2.5: Let S ≡ {ν : ν a ﬁnite signed measure on (Ω, F)}. Then, ν ≡ |ν|(Ω), ν ∈ S is a norm on S.

124

4. Diﬀerentiation

Proof: Let ν1 , ν2 ∈ S, α1 , α2 ∈R and λ = α1 ν1 + α2 ν2 . For any A ∈ F and any {Ai }i≥1 ⊂ F with A = i≥1 Ai , |λ(Ai )| ≤ |α1 ||ν1 (Ai )| + |α2 ||ν2 (Ai )| for all i ≥ 1 |λ(Ai )| ≤ |α1 | |ν1 (Ai )| + |α2 | |ν2 (Ai )| ⇒ i≥1

i≥1

i≥1

≤ |α1 ||ν1 |(A) + |α2 ||ν2 |(A). Taking supremum over all {Ai }i≥1 yields, |λ|(A) ≤ |α1 ||ν1 |(A) + |α2 ||ν2 |(·) |λ|(·) ≤ |α1 | |ν1 |(·) + |α2 ||ν2 |(·),

i.e., ⇒

λ ≡ |λ|(Ω) ≤ |α1 | |ν1 |(Ω) + |α2 | |ν2 |(Ω) = |α1 |ν1 + |α2 |ν2 .

Taking α1 = α2 = 1 yields ν1 + ν2 ≤ ν1 + ν2 , i.e., the triangle inequality holds. Next taking α2 = 0 yields α1 ν1 ≤ |α1 |ν1 . To get the opposite inequality, note that for α1 = 0, ν1 = α11 α1 ν1 and so ν1 ≤ | α11 |α1 ν1 . Hence, |α1 | ν1 ≤ α1 ν1 . Thus, for any α1 = 0, α1 ν = |α1 |ν. Finally, ν = 0 ⇒ |ν|(Ω) = 0 ⇒ |ν|(A) = 0 for all A ∈ F ⇒ ν(A) = 0 for all A ∈ F, i.e., ν is the zero measure. Thus · is a norm on S. 2 Proposition 4.2.6: (S, · ) is complete. Proof: Let {νn }n≥1 be a Cauchy sequence in (S, · ). Note that for each A ∈ F, |νn (A) − νm (A)| ≤ |νn − νm |(A) ≤ νn − νm . Hence, for each A ∈ F, {νn (A)}n≥1 is a Cauchy sequence in R and hence ν(A) ≡ lim νn (A) n→∞

exists.

It will be shown that ν(·) is a ﬁnite signed measure and νn − ν → 0 as n → ∞. Let {Ai }i≥1 ⊂ F, Ai ∩ Aj = ∅ for i = j, and A = i≥1 Ai . Let xn ≡ {νn (Ai )}i≥1 , n ≥ 1, and let x 0 ≡ {ν(Ai )}i≥1 . Note that each xn ∈ 1 where 1 ≡ {x : x = {xi }i≥1 ∈ R, i≥1 |xi | < ∞}. For x ∈ 1 , let x1 = ∞ |x |. Then x − x = |ν i n m i≥1 n (Ai ) − νm (Ai )| ≤ |νn − νm |(A) ≤ i=1 |νn − νm |(Ω) → 0 as n, m → ∞. But 1 is complete under · 1 . So there ) → ν(Ai ) for all exists x∗ ∈ 1 such that xn −x∗ 1 → 0. Since xni ≡ νn (A i ∞ ) for all i ≥ 1 and that i ≥ 1, it follows that x∗i = ν(A i i=1 |ν(Ai )| < ∞. ∞ Also, for all n ≥ 1, νn (A) = i=1 νn (Ai ). Since i≥1 |νn (Ai )−ν(Ai )| → 0, νn (A) ≡ i≥1 νn (Ai ) → i≥1 ν(Ai ) as n → ∞. But νn (A) → ν(A). Thus,

4.3 Functions of bounded variation

ν(A) =

∞ i=1

125

ν(Ai ). Also for any countable partition {Ai }i≥1 ⊂ F of Ω,

∞

|ν(Ai )| = lim

∞

n→∞

i=1

|νn (Ai )| ≤ lim νn < ∞. n→∞

i=1

Thus |ν|(Ω) < ∞ and hence, ν ∈ S. Finally, νn − ν

=

sup

∞

|νn (Ai ) − ν(Ai )| : {Ai }i≥1 ⊂ F

i=1

is a disjoint partition of

Ω .

But for every countable partition {Ai }i≥1 ⊂ F, ∞ i=1

|νn (Ai ) − ν(Ai )| = lim

m→∞

∞

|νn (Ai ) − νm (Ai )| ≤ lim νn − νm .

i=1

m→∞

Thus, νn − ν ≤ limm→∞ νn − νm and hence, limn→∞ νn − ν ≤ 2 limn→∞ limm→∞ νn − νm = 0. Hence, νn → ν in S. Deﬁnition 4.2.6: (Integration w.r.t. signed measures). Let µ be a ﬁnite signed measure on a measurable space (Ω, F) and |µ| be its total variation measure as in Deﬁnition 4.2.2. Then, for any f ∈ L1 (Ω, F, |µ|), f dµ is deﬁned as + f dµ = f dµ − f dµ− , where µ+ and µ− are the positive and negative variations of µ as deﬁned in (2.8). Proposition 4.2.7: Let µ be a signed measure on a measurable space (Ω, F, P ). Let µ = λ1 − λ2 where λ1 and λ2 are ﬁnite measures. Let f ∈ L1 (Ω, F, λ1 + λ2 ). Then f ∈ L1 (Ω, F, |µ|) and f dµ = f dλ1 − f dλ2 . (2.14) Proof: Left as an exercise (Problem 4.13).

4.3 Functions of bounded variation From the construction of the Lebesgue-Stieltjes measures on (R, B(R)) discussed in Chapter 1, it is seen that to every nondecreasing right continuous function F : R → R, there is a (Radon) measure µF on (R, B(R)) such that

126

4. Diﬀerentiation

µF ((a, b]) = F (b) − F (a) for all a < b and conversely. If µ1 and µ2 are two Radon measures and µ = µ1 − µ2 , let ⎧ for x > 0, ⎨ µ((0, x]) −µ((x, 0]) for x < 0, G(x) ≡ ⎩ 0 for x = 0. ⎧ for x > 0, ⎨ F1 (x) − F2 (x) − (F1 (0) − F2 (0)) for x < 0, (F1 (0) − F2 (0)) − (F1 (x) − F2 (x)) = ⎩ 0 for x = 0. Thus to every ﬁnite signed measure µ on (R, B(R)), there corresponds a function G(·) that is the diﬀerence of two right continuous nondecreasing and bounded functions. The converse is also easy to establish. A characterization of such a function G(·) without any reference to measures is given below. Deﬁnition 4.3.1: Let f : [a, b] → R, where −∞ < a < b < ∞. Then for any partition Q = {a = x0 < x1 < x2 < . . . < xn = b}, n ∈ N, the positive, negative and total variations of f with respect to Q are respectively deﬁned as P (f, Q) ≡

n

(f (xi ) − f (xi−1 ))+

i=1

N (f, Q) ≡

n

(f (xi ) − f (xi−1 ))−

i=1

T (f, Q) ≡

n

|f (xi ) − f (xi−1 )|.

i=1

It is easy to verify that (i) if f is nondecreasing, then P (f, Q) = T (f, Q) = f (b) − f (a)

and N (f, Q) = 0

and that (ii) for any f , P (f, Q) + N (f, Q) = T (f, Q). Deﬁnition 4.3.2: Let f = [a, b] → R, where −∞ < a < b < ∞. The positive, negative and total variations of f over [a, b] are respectively deﬁned as P (f, [a, b]) ≡

sup P (f, Q)

N (f, [a, b]) ≡

sup N (f, Q)

Q Q

T (f, [a, b]) ≡

sup T (f, Q), Q

4.3 Functions of bounded variation

127

where the supremum in each case is taken over all ﬁnite partitions Q of [a, b]. Deﬁnition 4.3.3: Let f : [a, b] → R, where −∞ < a < b < ∞. Then, f is said to be of bounded variation on [a, b] if T (f, [a, b]) < ∞. The set of all such functions is denoted by BV [a, b]. As remarked earlier, if f is nondecreasing, then T (f, Q) = f (b) − f (a) for each Q and hence T (f, [a, b]) = f (b) − f (a). It follows that if f = f1 − f2 , where both f1 and f2 are nondecreasing, then f ∈ BV [a, b]. A natural question is whether the converse is true. The answer is yes, as shown by the following result. Theorem 4.3.1: Let f ∈ BV [a, b]. Let f1 (x) ≡ P (f, [a, x]) and f2 (x) ≡ N (f, [a, x]). Then f1 and f2 are nondecreasing in [a, b] and for all a ≤ x ≤ b, f (x) = f1 (x) − f2 (x) Proof: That f1 and f2 are nondecreasing follows from the deﬁnition. It is enough to verify that if f ∈ BV [a, b], then f (b) − f (a) = P (f, [a, b]) − N (f, [a, b]), as this can be applied to [a, x] for a ≤ x < b. For each ﬁnite partition Q of [a, b], f (b) − f (a)

=

n

(f (xi ) − f (xi−1 ))

i=1

= P (f, Q) − N (f, Q). Thus P (f, Q) = f (b) − f (a) + N (f, Q). By taking supremum over all ﬁnite partitions Q, it follows that P (f, [a, b]) = f (b) − f (a) + N (f, [a, b]). If f ∈ BV [a, b], this yields f (b) − f (a) = P (f, [a, b]) − N (f, [a, b]).

2

Remark 4.3.1: Since T (f, Q) = P (f, Q) + N (f, Q) = 2P (f, Q) − (f (b) − f (a)), it follows that if f ∈ BV [a, b], then T (f, [a, b])

= 2P (f, [a, b]) − (f (b) − f (a)) = P (f, [a, b]) + N (f, [a, b]).

Corollary 4.3.2: A function f ∈ BV [a, b] iﬀ there exists a ﬁnite signed measure µ on (R, B(R)) such that f (x) = µ([a, x]), a ≤ x ≤ b. The proof of this corollary is left as an exercise.

128

4. Diﬀerentiation

Remark 4.3.2: Some observations on functions of bounded variations are listed below. (a) Let f = IQ where Q is the set of rationals. Then for any [a, b], a < b, P (f, [a, b]) = N (f, [a, b]) = ∞ and so f ∈ / BV [a, b]. This holds for f = ID for any set D such that both D and Dc are dense in R. (b) Let f be Lipschitz on [a, b]. That is, |f (x) − f (y)| ≤ K|x − y| for all x, y in [a, b] where K ∈ (0, ∞) is a constant. Then, f ∈ BV [a, b]. (c) Let f be diﬀerentiable in (a, b) and continuous on [a, b] and f (·) be bounded in (a, b). Then by the mean value theorem, f is Lipschitz and hence, f is in BV [a, b]. (d) Let f (x) = x2 sin x1 , 0 < x ≤ 1, and let f (0) = 0. Then f is continuous on [0, 1], diﬀerentiable on (0, 1) with f bounded on (0, 1), and hence f ∈ BV [0, 1]. (e) Let g(x) = x2 sin x12 , 0 < x ≤ 1, g(0) = 0. Then g is continuous on [0, 1], diﬀerentiable on (0, 1) but g is not bounded on (0, 1). This by itself does not imply that g ∈ / BV [0, 1], since being Lipschitz is only a suﬃcient condition. But it turns out that g ∈ / BV [0, 1]. To see 0 n 1 this, let xn = (2n+1) |g(xi ) − g(xi−1 )| ≥ π , n = 0, 1, 2 . . .. Then 2

n i=1

1 (2i+1) π 2

i=1

and hence T (g, [0, 1]) = ∞.

(f) It is known (see Royden (1988), Chapter 4) that if f : [a, b] → R then f is diﬀerentiable a.e. (m) on (a, b) and is nondecreasing, f dm ≤ f (b) − f (c), where (m) denotes the Lebesgue measure. [a,b] This implies that if f ∈ BV [a, b], then f is diﬀerentiable a.e. (m) on (a, b) and so, [a,b] |f |dm ≤ T (f, [a, b]).

4.4 Absolutely continuous functions on R Deﬁnition 4.4.1: A function F : R → R is absolutely continuous (a.c.) if for all > 0, there exists δ > 0 such that if Ij = [aj , bj ], j = 1, 2, . . . , k k k (k ∈ N) are disjoint and j=1 (bj − aj ) < δ, then j=1 |F (bj ) − F (aj )| < . By the mean value theorem, it follows that if F is diﬀerentiable and F (·) is bounded, then F is a.c. Also note that F is a.c. implies F is uniformly continuous.

4.4 Absolutely continuous functions on R

129

Deﬁnition 4.4.2: A function F : [a, b] → R is absolutely continuous if the function F˜ , deﬁned by ⎧ if a ≤ x ≤ b, ⎨ F (x) F (a) if x < a, F˜ (x) = ⎩ F (b) if x > b, is absolutely continuous. Thus F (x) = x is a.c. on R. Any polynomial is a.c. on any bounded interval but not necessarily on all of R. For example, F (x) = x2 is a.c. on any bounded interval but not a.c. on R, since it is not uniformly continuous on R. The main result of this section is the following result due to H. Lebesgue, known as the fundamental theorem of Lebesgue integral calculus. Theorem 4.4.1: A function F : [a, b] → R is absolutely continuous iﬀ there is a function f : [a, b] → R such that f is Lebesgue measurable and integrable w.r.t. m and such that f dm, for all a ≤ x ≤ b (4.1) F (x) = F (a) + [a,x]

where m is the Lebesgue measure. Proof: First consider the “if part.” Suppose that (4.1) holds. Since |f |dm < ∞, for any > 0, there exists a δ > 0 such that (cf. Proposi[a,b] tion 2.5.8). |f |dm < .

m(A) < δ ⇒

(4.2)

A

Thus, if Ij = (aj , bj ), ⊂ [a, b], j = 1, 2, . . . , k are such that δ, then k

k

j=1 (bj − aj )

0. Let δ > 0 be chosen so that (aj , bj ) ⊂ [a, b], j = 1, 2, . . . , k,

k j=1

(bj − aj ) < δ ⇒

k j=1

|F (bj ) − F (aj )| < .

130

4. Diﬀerentiation

Let A ∈ Mm , A ⊂ (a, b), and m(A) = 0. Then, there exist a countable collection of disjoint open intervals {Ij = (aj , bj ) : Ij ⊂ [a, b]}j≥1 such that

Ij and (bj − aj ) < δ. A⊂ j≥1

j≥1

Thus

k

Ij ≤ µF A ∩

µF

k

j=1

Ij

i=1

≤

k

µF (Ij ) =

j=1

k

(F (bj ) − F (aj )) <

j=1

for all k ∈ N. Ij , by the m.c.f.b. property of µF (·), µF (A) = Since A ⊂ j≥1 k limk→∞ µF (A ∩ j=1 Ij ) ≤ . This being true for any > 0, it follows that µF (A) = 0. Since F is continuous, µF ({a, b}) = 0 and hence µF (a, b)c = 0. Thus, µF m, i.e., µF is dominated by m. Now, by the Radon-Nikodym theorem (cf. Theorem 4.1.1 (ii)), there exists a nonnegative measurable function f such that A ∈ Mm implies that µF (A) = A∩[a,b] f dm and, in particular, for a ≤ x ≤ b, µF ([a, x]) = F (x) − F (a) = f dm, [a,x]

2

i.e., (4.1) holds.

The representation (4.1) of an absolutely continuous F can be strengthened as follows: Theorem 4.4.2: Let F : R → R satisfy (4.1). Then F is diﬀerentiable a.e. (m) and F (·) = f (·) a.e. (m). For a proof of this result, see Royden (1988), Chapter 4. The relation between the notion of absolute continuity of a distribution function F : R → R and that of the associated Lebesgue-Stieltjes measure µF w.r.t. Lebesgue measure m will be discussed now. Let F : R → R be a distribution function, i.e., F is nondecreasing and right continuous. Let µF be the associated Lebesgue-Stieltjes measure such that µF ((a, b]) = F (b) − F (a) for all −∞ < a < b < ∞. Recall that F is said to be absolutely continuous on an interval [a, b] if for each > 0, there exists a δ > 0 such that for any ﬁnite collection of intervals Ij = (aj , bj ), j = 1, 2, . . . , n, contained in [a, b], n j=1

(bj − aj ) < δ

⇒

n j=1

(F (bj ) − F (aj )) < .

4.4 Absolutely continuous functions on R

131

Recall also that µF is absolutely continuous w.r.t. the Lebesgue measure m if for A ∈ B(R), m(A) = 0 ⇒ µF (A) = 0. A natural question is that if F is absolutely continuous on every interval [a, b] ⊂ R, is µF absolutely continuous w.r.t. m and conversely? The answer is yes. Theorem 4.4.3: Let F : R → R be a nondecreasing function and let µF be the associated Lebesgue-Stieltjes measure. Then F is absolutely continuous on [a, b] for all −∞ < a < b < ∞ iﬀ µF m where m is the Lebesgue measure on (R, B(R)). Proof: Suppose that µF m. Then by Theorem 4.1.1, there exists a nonnegative measurable function h such that hdm for all A in B(R). µF (A) = A

Hence, for any a < b in R and a < x < b, F (x) − F (a) ≡ µF ((a, x]) =

hdm. (a,x]

This implies the absolute continuity of F on [a, b] as shown in Theorem 4.4.1. Conversely, if F is absolutely continuous on [a, b] for all −∞ < a < b < ∞, then as shown in the proof of the “only if” part of Theorem 4.4.1, for all −∞ < a < b < ∞, then µF (A ∩ [a, b]) = 0 if m(A ∩ [a, b]) = 0. Thus, if m(A) = 0, then for all −∞ < a < b < ∞, m(A ∩ [a, b]) = 0 and hence 2 µF (A ∩ [a, b]) = 0 and hence µF (A) = 0, i.e., µF m. Recall that a measure µ on (Rk , B(Rk )) is a Radon measure if µ(A) < ∞ for every bounded Borel set A. In the following, let m(·) denote the Lebesgue measure on Rk . Deﬁnition 4.4.3: A Radon measure µ on (Rk , B(Rk )) is diﬀerentiable at x ∈ Rk with derivative (Dµ)(x) if for any > 0, there is a δ > 0 such that * * µ(A) * * − (Dµ)(x)* < * m(A) for every open ball A such that x ∈ A and diam. (A) ≡ sup{x − y : x, y ∈ A}, the diameter of A, is less than δ. Theorem 4.4.4: Let µ be a Radon measure on (Rk , B(Rk )). Then (i) µ is diﬀerentiable a.e. (m), Dµ(·) is Lebesgue measurable, and ≥ 0 a.e. (m) and for all bounded Borel sets A ∈ B(Rk ), Dµ(·)dm ≤ µ(A). A

132

4. Diﬀerentiation

(ii) Let µa (A) ≡ A Dµ(·)dm, A ∈ B(Rk ). Let µs (·) be the unique measure on B(Rk ) such that for all bounded Borel sets A µs (A) = µ(A) − µa (A). Then µs ⊥ m and Dµs (·) = 0

a.e. (m).

For a proof, see Rudin (1987). Remark 4.4.1: By the uniqueness of the Lebesgue decomposition, it follows that a Radon measure µ on B(Rk ) is ⊥ m iﬀ Dµ(·) = 0 a.e. (m) and is m iﬀ µ(A) = A Dµ(·)dm for all A ∈ B(Rk ). k integrable w.r.t. m on bounded sets. Let µ(A) ≡ Let f : R → R+ be k f dm for A ∈ B(R ). Then µ(·) is a Radon measure and that is m A and hence by Theorem 4.4.4

Dµ(x) = f (x) for almost all

x(m).

That is, for almost all x(m), for each > 0, there is a δ > 0 such that * 1 * * * f dm − f (x)* < * m(A) A for all open balls A such that x ∈ A and diam. (A) < δ. It turns out that a stronger result holds. Theorem 4.4.5: For almost all x(m), for each > 0, there is a δ > 0 such that 1 |f − f (x)|dm < m(A) A for all open balls A such that x ∈ A and diam. (A) < δ (see Problems 4.23, 4.24). Theorem 4.4.6: (Change of variables formula in Rk , k > 1). Let V be an open set in Rk . Let T ≡ (T1 , T2 , . . . , Tk ) be a map from Rk → Rk such i (·) that for each i, Ti : Rk → R and ∂T ∂xj exists on V for all 1 ≤ i, j ≤ k. i (·) Suppose that the Jacobian JT (·) ≡ det ∂T is continuous and positive ∂xj on V . Suppose further that T (V ) is a bounded open set W in Rk and that T is (1 − 1) and T −1 is continuous. Then (i) For all Borel set E ⊂ V , T (E) is a Borel set ⊂ W . (ii) ν(·) ≡ m(T (·)) is a measure on B(W ) and ν m with dν(·) = JT (·). dm

4.5 Singular distributions

(iii) For any h ∈ L1 (W, m)

hdm =

W

133

h T (·) JT (·)dm.

V

(iv) λ(·) ≡ mT −1 (·) is a measure on B(W ) and λ m with dλ = |J T −1 (·) |−1 . dm (v) For any µ m on B(V ) the measure ψ(·) ≡ µT −1 (·) is dominated by m with dµ −1 dψ (·) = T −1 (·) JT T −1 (·) dm dm

on W.

For a proof see Rudin (1987), Chapter 7.

4.5 Singular distributions 4.5.1

Decomposition of a cdf

Recall that a cumulative distribution function (cdf) on R is a function F : R → [0, 1] such that it is nondecreasing, right continuous, F (−∞) = 0, F (∞) = 1. In Section 2.2, it was shown that any cdf F on R can be written as (5.1) F = αFd + (1 − α)Fc , where Fd and Fc are discrete and continuous cdfs respectively. In this section, the cdf Fc will be further decomposed into a singular continuous and absolutely continuous cdfs. Deﬁnition 4.5.1: A cdf F is singular if F ≡ 0 almost everywhere w.r.t. the Lebesgue measure on R. Example 4.5.1: The cdfs of Binomial, Poisson, or any integer valued random variables are singular. It is known (cf. Royden (1988), Chapter 5) that a monotone function F : R → R is diﬀerentiable almost everywhere w.r.t. the Lebesgue measure and its derivative F satisﬁes b F (x)dx ≤ F (b) − F (a), (5.2) a

for any −∞ < a < b < ∞.

134

4. Diﬀerentiation

x For x ∈ R, let F˜ac (x) ≡ −∞ Fc (t)dt and F˜sc (x) ≡ Fc (x) − F˜ac (x). If ∞ β˜ ≡ −∞ Fc (t)dt = F˜ac (∞) = 0, then Fc (t) = 0 a.e. and so Fc is singular continuous. If β˜ = 1, then Fc = F˜ac and so, Fc is absolutely continuous. If 0 < α < 1 and 0 < β˜ < 1, then F can be written as F = αFd + βFac + γFsc ,

(5.3)

˜ −1 F˜sc , ˜ γ = (1 − α)(1 − β), ˜ Fac = β˜−1 F˜ac , Fsc = (1 − β) where β = (1 − α)β, and Fd is as in (5.1). Note that Fd , Fac , Fsc are all cdfs and α, β, γ are nonnegative numbers adding up to 1. Summarizing the above discussions, one has the following: Proposition 4.5.1: Given any cdf F , there exist nonnegative constants α, β, γ and cdfs Fd , Fac , Fsc satisfying (a) α + β + γ = 1, and (b) Fd is discrete, Fac is absolutely continuous, Fsc is singular continuous, such that the decomposition (5.3) holds. It can be shown that the constants α, β, and γ are uniquely determined, and that when 0 < α < 1, the decomposition (5.1) is unique, and that when 0 < α, β, γ < 1, the decomposition (5.3) is unique. The decomposition (5.3) also has a probabilistic interpretation. Any random variable X can be realized as a randomized choice over three random variables Xd , Xac , and Xsc having cdfs Fd , Fac , and Fsc , respectively, and with corresponding randomization probabilities α, β, and γ. For more details see Problem 6.15 in Chapter 6.

4.5.2

Cantor ternary set

Recall the construction of the Cantor set from Section 1.3. Let I0 = [0, 1] denote the unit interval. If one deletes the open " middle # third of" I0 ,# then one gets two disjoint closed intervals I11 = 0, 13 and I12 = 23 , 1 . Proceeding similarly with" the# closed intervals " 2 1 # I11 and " 2 I712#, 1 one gets " four # disjoint intervals I21 = 0, 9 , I22 = 9 , 3 , I23 = 3 , 9 , I24 = 89 , 1 , and so on. Thus, at each step, deleting the open middle third of the closed intervals constructed in the previous step, one is left with 2n 2n disjoint closed intervals each of length 3−n after n steps. Let Cn = j=1 Inj ∞ and C = n=1 Cn . By construction Cn+1 ⊂ Cn for each n and Cn ’s are closed sets. With m(·) ndenoting Lebesgue measure, one has m(C0 ) = 1 and m(Cn ) = 2n 3−n = 23 . ∞ Deﬁnition 4.5.2: The set C ≡ n=1 Cn is called the Cantor ternary set or simply the Cantor set. n Since m(C0 ) = 1, by m.c.f.a. m(C) = limn→∞ m(Cn ) = limn→∞ 23 = 0. Thus, the Cantor set C has zero Lebesgue measure. Next, let U1 = U11 = 13 , 23 be the deleted interval at the ﬁrst stage, U2 = U21 ∪ U22 =

4.5 Singular distributions

135

1

7 8 ∪ 9 , 9 be the union of the deleted intervals at the second stage, and 2n−1 ∞ 2n−1 similarly, Un = j=1 Unj at stage n. Thus C c = U = n=1 j=1 Unj is open and m(C c ) = 1. Since C ∪ C c = [0, 1] and C c is open, it follows that C is nonempty. In fact, C is uncountably inﬁnite as will be shown now. To do this, one needs the concept of p-nary expansion of numbers in [0,1]. Fix a positive integer p > 1. For each x in [0,1), let a1 (x) = px where t = n if n ≤ t < n + 1. Thus a1 (x) ≤ px < a1 (x) + 1 and a1 (x) ∈ {0, 1, . . . , p − 1}, i.e., a1p(x) ≤ x < a1p(x) + p1 . Thus, if kp ≤ x < k+1 p 2 9, 9

for some k = 0, 1, 2, . . . , p − 1, then a1 (x) = k. Next, let x1 ≡ x − a1p(x) and " a2 (x) = p2 x1 . Then, x1 ∈ 0, p1 and a2 (x) a1 (x) 1 a2 (x) < ≤ x1 = x − + 2 2 2 p p p p and a2 (x) ∈ {0, 1, 2, p − 1}. Next, let 0 ≤ x2 ≡ x − a1p(x) − a3 (x) = p3 x2 and so on. After k such iterations one gets 0≤x−

k ai (x) i=1

pi

0} is countable. (Hint: Let Bn = {x : h(x) > each n ∈ N.)

1 n }.

Show that Bn is a ﬁnite set for

4.4 Prove Proposition 4.1.2. 4.5 Find the Lebesgue decomposition of µ w.r.t. ν and the Radona Nikodym derivative dµ dν in the following cases where µa is the absolutely continuous component of µ w.r.t. ν. (a) µ = N (0, 1), ν = Exponential(1) (b) µ = Exponential(1), ν = N (0, 1) (c) µ = µ1 + µ2 , where µ1 = N (0, 1), µ2 = Poisson(1) and ν = Cauchy(0, 1). (d) µ = µ1 + µ2 , ν = Geometric(p), 0 < p < 1, where µ1 = N (0, 1) and µ2 = Poisson(1). (e) µ = µ1 + µ2 , ν = ν1 + ν2 where µ1 = N (0, 1), µ2 = Poisson(1), ν1 = Cauchy(0, 1) and ν2 = Geometric(p), 0 < p < 1. (f) µ = Binomial (10, 1/2), ν = Poisson (1). The measures referred to above are deﬁned in Tables 4.6.1 and 4.6.2, given at the end of this section.

138

4. Diﬀerentiation

4.6 Let (Ω, F, µ) be a measure space and f ∈ L1 (Ω, F, µ). Let νf (A) ≡ f dµ for all A ∈ F. A

(a) Show that νf is a ﬁnite signed measure. (b) Show that ν = Ω |f |dµ and for A ∈ F, νf+ (A) = A f + dµ, νf− (A) = A f dµ, and |νf |(A) = A |f |(dµ). 4.7 (a) Let µ1 and µ2 be two ﬁnite measures such that both are dominated by a σ-ﬁnite measure ν. Show that the total variation measure of the signed measure µ ≡ µ1 − µ2 is given by i |µ|(A) = A |h1 − h2 |dν where for i = 1, 2, hi = dµ dν . (b) Conclude that if µ1 and µ2 are two measures on a countable set Ω ≡ {ωi }i≥1 with F ≡ P(Ω), then |µ|(A) = i∈A |µ1 (ωi ) − µ2 (ωi )|. (c) Show that if µn is the Binomial (n, pn ) measure and µ is the Poisson (λ) measure, 0 < λ < ∞, then as n → ∞, |µn −µ|(·) → 0 uniformly on P(Z+ ) iﬀ npn → λ. (Hint: Show that for each i ∈ Z+ ≡ {0, 1, 2, . . .}, µn ({i}) → µ({i}) and use Scheﬀe’s theorem.) 4.8 Let ν be a ﬁnite signed measure on a measurable space (Ω, F) and let |ν| be the total variation measure corresponding to ν. Show that for any B ∈ Ω+ , B ⊂ F, |ν|(B) = ν(B), where Ω+ is as deﬁned in (2.13). (Hint: For any set A ⊂ Ω+ , ν(A) = hd|ν| = A

A∩Ω+

hd|ν| ≥ 0.)

4.9 Show that the Banach space S of ﬁnite signed measures on (N, P(N)) is isomorphic to 1 , the Banach space of absolutely convergent sequences {xn }n≥1 in R. 4.10 Let µ1 and µ2 be two probability measures on (Ω, F). (a) Show that µ1 − µ2 = 2 sup{|µ1 (A) − µ2 (A)| : A ∈ F}. (Hint: For any A ∈ F, {A, Ac } is a partition of Ω and so µ1 − µ2 ≥ |µ1 (A) − µ2 (A)| + |µ1 (Ac ) − µ2 (Ac )| = 2|µ1 (A) − µ2 (A)|,

4.6 Problems

139

since µ1 and µ2 are probability measures. For the opposite inequality, use the Hahn decomposition of Ω w.r.t. µ1 − µ2 and the fact µ1 − µ2 = |(µ1 − µ2 )(Ω+ )| + |(µ1 − µ2 )(Ω− )|.) (b) Show that µ1 − µ2 is also equal to * * * * sup * f dµ1 − f dµ2 * : f ∈ B(Ω, R) where B(Ω, R) is the collection of all F-measurable functions from Ω to R such that sup{|f (ω)| : ω ∈ Ω} ≤ 1. 4.11 Let (Ω, F) be a measurable space. (a) Let {µn }n≥1 be a sequence of ﬁnite measures on (Ω, F). Show that there exists a probability measure λ such that µn λ. ∞ (·) (Hint: Consider λ(·) = n=1 21n µµnn(Ω) .) (b) Extend (a) to the case where {µn }n≥1 are σ-ﬁnite. (c) Conclude that for any sequence {νn }n≥1 of ﬁnite signed measures on (Ω, F), there exists a probability measure λ such that |νn | λ for all n ≥ 1. 4.12 Let {µn }n≥1 be a sequence of ﬁnite measures on a measurable space (Ω, F). Show that there exists a ﬁnite measure µ on (Ω, F) such that µn − µ → 0 iﬀ there is a ﬁnite measure λ dominating µ and µn , dµ n n ≥ 1 such that the Radon-Nikodym derivatives fn ≡ dµ dλ → f ≡ dλ in measure on (Ω, F, λ) and µn (Ω) → µ(Ω). 4.13 (a) Let µ1 and µ2 be two ﬁnite measures on (Ω, F). Let µ1 = µ1a + µ1s be the Lebesgue-Radon-Nikodym decomposition of µ1 w.r.t. µ2 as in Theorem 4.1.1. Show that if µ = µ1 − µ2 , then for all A ∈ F, dµ1a |h − 1|dµ2 + µ1s (A) where h = |µ|(A) = dµ2 A is the Radon-Nikodym derivative of µ1a w.r.t. µ2 . Conclude that if µ1 ⊥ µ2 , *then |µ|(·) * = µ1 (·) + µ2 (·) and if µ1 µ2 , then 1 *dµ2 . − 1 |µ|(A) = A * dµ dµ2 (b) Compute |µ|(·), µ if µ = µ1 − µ2 for the following cases (i) µ1 = N (0, 1), µ2 = N (1, 1) (ii) µ1 = Cauchy (0,1), µ2 = N (0, 1) (iii) µ1 = N (0, 1), µ2 = Poisson (λ). (c) Establish Proposition 4.2.7.

140

4. Diﬀerentiation

4.14 Give another proof of the completeness of (S, · )) by verifying the following steps. (a) For any sequence {νn }n≥1 in S, there is a ﬁnite measure λ and {fn }n≥1 ⊂ L1 (Ω, F, λ) such that νn (A) = fn dλ for all A ∈ F, for all n ≥ 1. A

(b) {νn }n≥1 Cauchy in S is the same as {fn }n≥1 Cauchy in L1 (Ω, F, λ) and hence, the completeness of (S, ·)) follows from the completeness of L1 (Ω, F, λ). 4.15 Let f, g ∈ BV [a, b]. (a) Show that P (f + g; [a, b]) ≤ P (f ; [a, b]) + P (g; [a, b]) and that the same is true for N (·; ·) and T (·; ·). (b) Show that for any c ∈ R, P (cf ; [a, b]) = |c|P (f ; [a, b]) and do the same for N (·; ·) and T (·; ·). (c) For any a < c < b, P (f ; [a, b]) = P (f ; [a, c]) + P (f ; [c, b]). 4.16 Let {fn }n≥1 ⊂ BV [a, b] and let limn fn (x) = f (x) for all x in [a, b]. Show that P (f ; [a, b]) ≤ limn→∞ P (fn ; [a, b]) and do the same for N (·; ·) and T (·; ·). 4.17 Let f ∈ BV [a, b]. Show that f is continuous except on an at most countable set. 4.18 Let F : [a, b] → R be a.c. Show that it is of bounded variation. (Hint: By the deﬁnition of a.c., for = 1, there is a δ1 > 0 such k k that j=1 |aj − bj | < δ1 ⇒ j=1 |F (aj ) − F (bj )| < 1. Let M be an integer > b−a δ + 1. Show that T (F, [a, b]) ≤ M .) 4.19 Let F be an absolutely continuous nondecreasing function on R. Let µF be the Lebesgue-Stieltjes measure corresponding to F . Show that for any h ∈ L1 (R, MµF , µF ), hdµF = hf dm R

where f is a nonnegative measurable function such that F (b)−F (a) = f dm for any a < b. [a,b]

4.6 Problems

141

4.20 Let F : [a, b] → R be absolutely continuous with F (·) > 0 a.e. on [a, b], where −∞ < a < b < ∞. Let F (a) = c and F (b) = d. Let m(·) denote the Lebesgue measure on R. Show the following: (a) (Change of variables formula). For any g : [c, d] → R and Lebesgue measurable and integrable w.r.t. m gdm = g(F )F dm. [c,d]

[a,b]

(b) For any Borel set E ⊂ [a, b], F (E) is also a Borel set. (c) ν(·) ≡ m F (·) is a measure on B([a, b]) and ν m with dν (·) = F (·). dm (d) λ(·) ≡ mF −1 (·) is a measure on B([c, d]) and λ m with −1 dλ (·) = F F −1 (·) . dm (e) For any measure µ m on B([a, b]) the measure ψ(·) ≡ µF −1 (·) is dominated by m with dµ −1 −1 −1 dψ = F (·) F F (·) . dm dm (f) Establish (a) assuming that g and F are both continuous noting that both integrals reduce to Riemann integrals. (Hint: (i) Verify (a) for g = I[a,b] , c < α < β < d and approximate by step functions. (ii) Show that F is (1 − 1) and F −1 (·) is continuous and hence Borel measurable. (iii) Show that ν(·) = µF , the Lebesgue-Stieltjes measure corresponding to F . (iv) Use the fact that for any c ≤ α ≤ β ≤ d, ψ([α, β])

= µ([γ, δ]), where γ = F −1 (α), δ = F −1 (β), dµ gdm, where g = = dm [γ,δ] g F −1 F (·) = F (·)dm [γ,δ] F F −1 F (·) g F −1 (·) dm by (a). ) = −1 (·) [α,β] F F

142

4. Diﬀerentiation

4.21 Let F : R → R be absolutely continuous on every ﬁnite interval. (a) Show that the f in (4.1) can be chosen independently of the interval [a, b]. (b) Further, if f is integrable over R, then limx→−∞ F (x) ≡ F (−∞) and limx→∞ F (x) ≡ F (∞) exist and F (x) = F (−∞) + f dµL for all x in R. (−∞,x) (c) Give an example where F : R → R is a.c., but f is not integrable over R. 4.22 Let F : R → R be absolutely continuous on bounded intervals. Let {I = 1 ≤ j ≤ k ≤ ∞} be a collection of disjoint intervals such that kj j=1 Ij ≡ R and on each Ij , F (·) > 0 a.e. or F (·) < 0 a.e. w.r.t. m. (a) Show that for any h ∈ L1 (R, m), hdm = h F (·) |F (·)|dm. R

R

(b) Show that if µ is a measure on R, B(R) dominated by m then the measure µF −1 (·) is also dominated by m and dµF −1 (y) = dm

xj ∈D(y)

f (xj ) |F (xj )|

and D(y) = {xj : xj ∈ Ij , F (xj ) = y}. (c) Let µ be the N (0, 1) measure on R, B(R) , i.e., where f (·) =

dµ dm

x2 1 dµ (x) = √ e− 2 , −∞ < x < ∞. dm 2n

Let F (x) = x2 . Find

dµF −1 dm .

4.23 Let f : R → R be integrable w.r.t. m on bounded intervals. Show that for almost all x0 in R (w.r.t. m), lim

a↑x0 b↓x0

1 (b − a)

b

|f (x) − f (x0 )|dx = 0. a

(Hint: For each rational r, by Theorem 4.4.4 1 lim a↑x0 (b − a) b↓x 0

b

|f (x) − r|dx = |f (x0 ) − r|. a

4.6 Problems

143

a.e. (m). Let Ar denote the set of x0 for which this fails to hold. Let A = r∈Q Ar . Then m(A) = 0. For any x0 ∈ A and any > 0, choose a rational r such that |f (x0 ) − r| < and now show that b 1 lim |f (x) − f (x0 )|dx < . ) a↑x0 (b − a) a b↓x 0

4.24 Use the hint to the above problem to establish Theorem 4.4.5. 4.25 Let (Ω, B) be a measurable space. Let {µn }n≥1 and µ be σ-ﬁnite measures on (Ω, B). Let for each n ≥ 1, µn = µna + µns be the Lebesgue of µn w.r.t. µ with µna µ and µns ⊥ µ. Let decomposition λ = n≥1 µn , λa = n≥1 µna , λs = n≥1 µns . Show that λa µ and λs ⊥ µ and that λ = λa + λs is the Lebesgue decomposition of λ w.r.t. µ. 4.26 Let {µn }n≥1 be Radon measures on Rk , B(Rk ) and m be the ∞ Lebesgue measure on Rk . Show that if λ = n=1 µn is also a Radon measure, then ∞ Dµn a.e. (m). Dλ = n=1

(Hint: Use Theorem 4.4.4 and the uniqueness of the Lebesgue decomposition.) 4.27 Let Fn , n ≥ 1 be a sequence of nondecreasing functions from R → R. Let F (x) = n≥1 Fn (x) < ∞ for all x ∈ R. Show that F (·) is nondecreasing and Fn (·) a.e. (m). F (·) = n≥1

4.28 Let E be a Lebesgue measurable set in R. The metric density of E at x is deﬁned as m E ∩ (x − δ, x + δ) DE (x) ≡ lim δ↓0 2δ if it exists. Show that DE (·) = IE (·) a.e. m. (Hint: Consider the measure λE (·) ≡ m(E ∩ ·) on the Lebesgue σ-algebra. Show that λE m and ﬁnd λE (·) (cf. Deﬁnition 4.4.2).) 4.29 Let F , G : [a, b] → R be both absolutely continuous. Show that H = F G is also absolutely continuous on [a, b] and that F dG + GdF = F (b)G(b) − F (a)G(a). [a,b]

[a,b]

144

4. Diﬀerentiation

4.30 Let (Ω, F, µ) be a ﬁnite measure space. Fix 1 ≤ p < ∞. Let T : Lp (µ) → R be a bounded linear functional as deﬁned in (3.2.10) (cf. Section 3.2). Complete the following outline of a proof of Theorem 3.2.3 (Riesz representation theorem). (a) Let ν(A) ≡ T (IA ), A ∈ F. Verify that ν(·) is a signed measure on (Ω, F). (b) Verify that |ν| µ. dν (c) Let g ≡ dµ . Show that g ∈ Lq (µ) where q = and q = ∞ if p = 1.

p p−1 ,

1 0, there is a f ∈ C[0, 2π] and an m ≥ 1 such that g − Dm (f, ·)2 < 2. That is, the set T of all ﬁnite linear combinations of the functions in the class T0 is dense in L2 [0, 2π]. Further, it is easy to verify that h1 , h2 ∈ T0 , h1 = h2 implies 2π

0

h1 (x)h2 (x)dx = 0,

170

5. Product Measures, Convolutions, and Transforms

i.e., T0 is an orthogonal family. Since T0 is orthogonal and T is dense in 2 L2 [0, 2π], T0 is complete. Deﬁnition 5.6.2: A function in T is called a trigonometric polynomial. Thus, the above theorem says that trigonometric polynomials are dense in L2 [0, 2π]. Completeness of T in L2 [0, 2π] and the results of Section 3.3 lead to Theorem 5.6.4: Let f ∈ L2 [0, 2π]. Let {(an , bn ), n = 0, 1, 2, . . .} and {sn (f, ·)}n≥0 be the associated Fourier coeﬃcient sequences and partial sum sequence of the Fourier series for f as in Deﬁnition 5.6.1. Then (i) sn (f, ·) → f in L2 [0, 2π], (ii)

2π 2 1 a2n + ˘b2n ) = 2π |f | dx where n=0 (˘ 0 2π 1 2 2 cn = 2π 0 (cos nx) dx = 12 , and for 2π 1 1 2 2π 0 (sin nx) dx = 2 .

∞

for n ≥ 0, a ˘n = an /cn , with ˘ n ≥ 1, bn = dbnn with d2n =

2π 1 f g dx = a0 α0 + (iii) Further, if f , g ∈ L2 [0, 2π], then 2π 0 ∞ bn βn an αn + d2 , where {(an , bn ) : n = 0, 1, 2, . . .} and n=1 c2n n {(αn , βn ) : n = 0, 1, 2, . . .} are, respectively, the Fourier coeﬃcients of f and g. Clearly (ii) above is a restatement of Bessel’s equality. Assertion (iii) is known as the Parseval identity. As for convergence pointwise or almost everywhere, A. N. Kolmogorov showed in 1926 (see K¨orner (1989)) that there exists an f ∈ L1 [0, 2π] such that limn→∞ |sn (f, x)| = ∞ everywhere on [0, 2π]. This led to the belief that for f ∈ C[0, 2π], the mean square convergence of (i) in Theorem 5.6.4 cannot be improved upon. But L. Carleson showed in 1964 (see K¨orner (1989)) that for f in L2 [0, 2π], sn (f, ·) → f (·) almost everywhere. Finally, turning to L1 [0, 2π], one has the following: Theorem 5.6.5: Let f ∈ L1 [0, 2π]. Let {(an , bn ) : n ≥ 0} be as in (6.1) ∞ and satisfy n=0 (|an | + |bn |) < ∞. Let sn (f, ·) be as in (6.2). Then sn (f, ·) converges uniformly on [0, 2π] and the limit coincides with f almost everywhere. ∞ Proof: Note that n=0 (|an | + |bn |) < ∞ implies that the sequence {sn (f, ·)}n≥0 is a Cauchy sequence in the Banach space C[0, 2π] with the sup-norm. Thus, there exists a g in C[0, 2π] such that sn (f, ·) → g uniformly on [0, 2π]. It is easy to check that this implies that g and f have the same Fourier coeﬃcients. Set h = g − f . Then h ∈ L1 [0, 2π] and the Fourier coeﬃcients of h are all zero. This implies that h is orthogonal to the members of the class T , which in turn yields that h is orthogonal to all

5.6 Fourier series

171

continuous functions in C[0, 2π], i.e., 2π h(x)k(x)dx = 0 0

for all k ∈ C[0, 2π]. Since h ∈ L1 [0, 2π] and for any interval A ⊂ [0, 2π], there exists a sequence {kn }n≥1 of uniformly bounded continuous functions, such that kn → IA a.e. (m), by the DCT, h(x)IA (x)dx = lim h(x)kn (x)dx = 0. n→∞

This in turn implies that h = 0 a.e., i.e., g = f a.e.

2

Remark 5.6.1: If f ∈ L2 [0, 2π], the Fourier coeﬃcients {an , bn } are square summable and hence go to zero as n → ∞. What if f ∈ L1 [0, 2π]? If f ∈ L1 [0, 2π], one can assert the following: Theorem 5.6.6: (Riemann-Lebesgue lemma). Let f ∈ L1 [0, 2π]. Then 2π 2π lim f (x) cos nx dx = 0 = lim f (x) sin nx dx. n→∞

n→∞

0

0

Proof: The lemma holds if f = IA for any interval A ⊂ [0, 2π] and since step functions (i.e., linear combinations of indicator functions of intervals) 2 are dense in L1 [0, 2π], the lemma is proved. It can be shown that the mapping f → {(an , bn )}n≥0 from L1 [0, 2π] to bivariate sequences that go to zero as n → ∞ is one-to-one but not onto (Rudin (1987), Chapter 5). Remark 5.6.2: (The complex case). Let T ≡ {z : z = eιθ , 0 ≤ θ ≤ 2π} be the unit circle in the complex plane C. Every function g : T → C can be identiﬁed with a function f on R by f (t) = g(eιt ). Clearly, f (·) is periodic on R with period 2π. In the rest of this section, for 0 < p < ∞, Lp (T ) will stand for the collection of all Borel measurable functions f : [0, 2π] to C such that [0,2π] |f |p dm < ∞ where m(·) is the Lebesgue measure. A trigonometric polynomial is a function of form f (·) ≡

k n=−k

αn eιnx ≡ a0 +

k

(an cos nx + bn sin nx),

n=1

k < ∞, where {αn }, {an }n≥0 , and {bn }n≥0 are sequences of complex numbers. The completeness of the trigonometric polynomials proved in Theorem 5.6.3 implies that the family {eιnx : n = 0, ±1, ±2, . . .} is a complete orthonormal basis for L2 (T ), which is a complex Hilbert space.

172

5. Product Measures, Convolutions, and Transforms

Thus Theorem 5.6.4 carries over to this case. Theorem 5.6.7: (i) Let f ∈ L2 (T ). Then, k

αn eιnx → f

in

L2 (T )

n=−k

where αn ≡ Further,

1 2π

∞

2π

0

f (x)e−ιns dx, n ∈ Z.

1 2π

|αn |2 =

n=−∞

2π

0

|f |2 dm.

(ii) For any sequence {αn }n∈Z of complex numbers such that k ∞ 2 ιnx conn=−∞ |αn | < ∞, the sequence fk (x) ≡ n=−k αn e k≥1 2 verges in L (T ) to a unique f such that αn =

1 2π

2π

f (x)e−ιnx dx.

0

(iii) For any f , g ∈ L2 (T ), ∞

αn β¯n =

n=−∞

1 2π

2π

f (x)g(x)dx 0

where

Further,

1 fˆ(n) = 2π

αn

=

βn

= gˆ(n) =

1 2π

∞

2π

0 2π 0

f (x)e−ιnx dx, g(x)e−ιnx dx, n ∈ Z.

|αn βn | < ∞.

n=−∞

(iv) L2 (T ) is isomorphic to 2 (Z), the Hilbert space of all square summable sequences of complex numbers on Z. Similarly, Theorem 5.6.5 carries over to the complex case.

5.7 Fourier transforms on R

173

Theorem 5.6.8: Let f ∈ L1 (T ). Suppose ∞

|fˆ(n)| < ∞

n=−∞

where 1 fˆ(n) = 2π

2π

0

Then sn (f, x) ≡

f (x)e−ιnx dx, n ∈ Z. n

fˆ(j)e−ιjx

j=−n

converges uniformly on [0, 2π] and the limit coincides with f a.e. and hence f is continuous a.e.

5.7 Fourier transforms on R In this section and in Section 5.8, let Lp (R) stand for |f |p dm < ∞} Lp (R) ≡ {f : f : R → C, Borel measurable, R

where m(·)is Lebesgue measure. Also, for f ∈ L1 (R), written as f (x)dx. Let C0 ≡ {f : f : R → C, continuous and

R

(7.1)

f dm will often be

lim f (x) = 0}.

|x|→∞

Deﬁnition 5.7.1: For f ∈ L1 (R), t ∈ R, fˆ(t) ≡ f (x)e−ιtx dx

(7.2)

(7.3)

is called the Fourier transform of f . Proposition 5.7.1: Let f ∈ L1 (R) and fˆ(·) be as in (7.3). Then (i) fˆ(·) ∈ C0 . (ii) If fa (x) ≡ f (x − a), a ∈ R, then fˆa (t) = eιta fˆ(t), t ∈ R. Proof: (i) For any t ∈ R, tn → t ⇒ eιtn x f (x) → eιtx f (x) for all x ∈ R and since |eιtn x f (x)| ≤ |f (x)| for all n and x, by the DCT fˆ(tn ) → fˆ(t). To show that fˆ(t) → 0 as |t| → ∞, the same proof as that of Theorem 5.6.6 works. Thus, it holds if f = I[a,b] , for a, b ∈ R, a < b and since the step functions are dense in L1 (R), it holds for all f ∈ L1 (R).

174

5. Product Measures, Convolutions, and Transforms

(ii) This is a consequence of the translation invariance of m(·), i.e., m(A+ a) = m(A) for all A ∈ B(R), a ∈ R. 2 The continuity of fˆ(·) can be strengthened to diﬀerentiability if, in addition to f ∈ L1 (R), xf (x) ∈ L1 (R). More generally, if f ∈ L1 (R) and xk f (x) ∈ L1 (R) for some k ≥ 1, then fˆ(·) is diﬀerentiable k-times with all derivatives fˆ(r) (t) → 0 as |t| → ∞ for r ≤ k (Problem 5.22). Proposition 5.7.2: Let f , g ∈ L1 (R) and f ∗ g be their convolution as deﬁned in (4.7). Then f ∗ g = fˆgˆ. (7.4) Proof: f ∗ g(t)

−ιtx

e

= R

R

= R

R

=

R

f (x − u)g(u)du dx

e−ιt(x−u) f (x − u)e−ιtu g(u)du dx

(ft ∗ gt )(x)dx

(7.5)

where ft (x) = e−ιtx f (x), gt (x) = e−ιtx g(x). Thus, by Proposition 5.4.4 ft (x)dx gt (x)dx f ∗ g(t) = R

=

R

fˆ(t)ˆ g (t).

2

The process of recovering f from fˆ (i.e., that of ﬁnding an inversion formula) can be developed along the lines of Fejer’s theorem (Theorem 5.6.1). Theorem 5.7.3: (Fejer’s theorem). Let f ∈ L1 (R), fˆ(·) be as in (7.3) and T 1 ST (f, x) ≡ (7.6) fˆ(t)eιtx dt, T ≥ 0, 2π −T 1 R ST (f, x)dT, R ≥ 0. (7.7) DR (f, x) ≡ R 0 (i) If f is continuous at x0 and f is bounded on R, then lim DR (f, x0 ) = f (x0 ).

R→∞

(7.8)

(ii) If f is uniformly continuous and bounded on R, then lim DR (f, ·) = f (·) uniformly on R.

R→∞

(7.9)

5.7 Fourier transforms on R

(iii) As R → ∞,

DR (f, ·) → f (·)

L1 (R).

in

175

(7.10)

(iv) If f ∈ Lp (R), 1 ≤ p < ∞, then as R → ∞, DR (f, ·) → f (·)

Lp (R).

in

(7.11)

Corollary 5.7.4: (Uniqueness theorem). If f and g ∈ L1 (R) and fˆ(·) = gˆ(·), then f = g a.e. (m). ˆ ≡ 0. Thus, ST (h, ·) ≡ 0 Proof: Let h = f − g. Then h ∈ L1 (R) and h(·) and DR (h, ·) ≡ 0 where ST (h, ·) and DR (h, ·) are as in (7.6) and (7.7). Hence by Theorem 5.7.3 (iii), h = 0 a.e. (m), i.e., f = g a.e. (m). 2 Corollary 5.7.5: (Inversion formula). Let f ∈ L1 (R) and fˆ ∈ L1 (R). Then 1 (7.12) fˆ(t)eιtx dx a.e. (m). f (x) = 2π R Proof: Since fˆ ∈ L1 (R), by the DCT, 1 fˆ(t)eιtx dt ST (f, x) → 2π R

as

T →∞

for all x in R and hence DR (f, x) has the same limit as R → ∞. Now (7.12) follows from (7.10). 2 The following results, i.e., Lemma 5.7.6 and Lemma 5.7.7, are needed for the proof of Theorem 5.7.3. The ﬁrst one is an analog of Lemma 5.6.2. Lemma 5.7.6: (Fejer ). For R > 0, let KR (x) ≡

1 1 2π R

R

0

T

eιtx dt dT.

(7.13)

−T

Then (i) KR (x) =

1 (1−cos Rx) , π x2

x = 0

R 2π

x = 0,

(7.14)

and hence KR (·) ≥ 0. (ii) For δ > 0,

|x|≥δ

KR (x)dx → 0

as

R → ∞.

(7.15)

176

5. Product Measures, Convolutions, and Transforms

(iii)

∞

−∞

KR (x)dx = 1.

(7.16)

Proof: (i) KR (0) =

1 2πR

R 0

(2T )dT = KR (x)

1 2π R.

= =

For x = 0 and R > 0,

1 2πR

R

0

2 sin T x dT x

1 2(1 − cos Rx) . 2πR x2

(ii) For δ > 0, 0≤

|x|≥δ

KR (x)dx

≤

2 2πR

=

2 1 →0 πR δ

|x|≥δ

1 dx x2 as

R → ∞.

(iii)

∞

−∞

KR (x)dx

= =

2 1 ∞ 1 − cos Rx dx π R 0 x2 2 ∞ 1 − cos u du. π 0 u2

Now

∞

1 − cos u du u2 0 L 1 − cos u = lim du (by the MCT) L→∞ 0 u2 L u 1 = lim sin x dx 2 du L→∞ 0 u 0 L L 1 = lim sin x du dx (by Fubini’s theorem) 2 L→∞ 0 x u L sin x 1 L = lim dx − sin x dx L→∞ x L 0 0 L * L * sin x * * = lim dx since * sin x dx* ≤ 1. L→∞ 0 x 0

5.7 Fourier transforms on R

Thus,

∞

1 − cos u du = u2 0 (cf. Problem 5.9). Hence (iii) follows.

0

∞

π sin x dx = x 2

Lemma 5.7.7: Let f ∈ Lp (R), 0 < p < ∞. Then ∞ |f (x − u) − f (x)|p dx → 0 as |u| → 0.

177

(7.17) 2

(7.18)

−∞

Proof: The lemma holds if f ∈ CK , i.e., if f is continuous on R (with values in C) and vanishes outside a bounded interval. By Theorem 2.3.14, such functions are dense in Lp (R). So given f ∈ Lp (R), 0 < p < ∞ and > 0, let g ∈ CK be such that |f − g|p dm < . For any 0 < p < ∞, there is a 0 < cp < ∞ such that for all x, y, z ∈ (0, ∞), |x + y + z|p ≤ cp (|x|p + |y|p + |z|p ). Then, |f (x − u) − f (x)|p dx ≤ cp |f (x − u) − g(x − u)|p + |g(x − u) − g(x)|p + |g(x) − f (x)|p dx = cp 2 + |g(x − u) − g(x)|p du.

So lim

|u|→0

|f (x − u) − f (x)|p dx ≤ cp 2.

Since > 0 is arbitrary, the lemma is proved.

2

Proof of Theorem 5.7.3: From (7.7) R T ∞ 1 ιtx DR (f, x) ≡ e e−ιty f (y)dy dt dT 2πR 0 −T −∞ R T ∞ 1 1 eιtu f (x − u)du dt dT. = 2π R 0 −T −∞ Now Fubini’s theorem yields ∞ 1 1 R T DR (f, x) = f (x − u) eιtu dt dR du 2π R 0 −∞ −T ∞ f (x − u)KR (u)du (7.19) = −∞

178

5. Product Measures, Convolutions, and Transforms

where KR (·) is as in (7.13). Now let f be continuous at x0 and bounded on R by Mf . Fix > 0 and choose δ > 0 such that |x − x0 | < δ ⇒ |f (x) − f (x0 )| < . From (7.16) and (7.19), DR (f, x0 ) − f (x0 ) = f (x0 − u) − f (x0 ) KR (u)du implying

|DR (f, x0 ) − f (x0 )|

≤

|u| 0}, i = 1, 2. (a) Show that D1 ∪ D2 is countable. (b) Let φi (x) = µi ({x}) x ∈ R, i = 1, 2. Show that φi is Borel measurable for i = 1, 2. (c) Show that φ1 dµ2 = z∈D1 ∩D2 φ1 (z)φ2 (z). (d) Deduce (2.11) from (c). 5.4 Extend Theorem 5.2.3 as follows. Let F1 , F2 be two nondecreasing right continuous functions on [a, b]. Then F1 dF2 + F2 dF1 = F1 (b)F2 (b) − F1 (a)F2 (a) + φ1 dµ2 , (a,b]

(a,b]

(a,b]

where φ1 is as in Problem 5.3. 5.5 Let Fi : R → R be nondecreasing and right continuous, i = 1, 2. Show that if limb↑∞ F1 (b)F2 (b) = λ1 and lima↓−∞ F1 (a)F2 (a) = λ2 exist and are ﬁnite, then F1 dF2 + F2 dF1 = λ1 − λ2 + φ1 dµ2 R

R

where φ1 is as in Problem 5.3.

R

182

5. Product Measures, Convolutions, and Transforms

5.6 Let (Ω, F, µ) be a σ-ﬁnite measure space and f be a nonnegative measurable function. Then, Ω f dµ = [0,∞) µ({f ≥ t})dt. (Hint: Consider the product space of (Ω, F, µ) with (R+ , B(R+ ), m) and apply Tonelli’s theorem to the function g(ω, t) = I(f (ω) ≥ t), after showing that g is F × B(R+ ))-measurable.) 5.7 Let (Ω, F, P ) be a probability space and X : Ω → R+ be a random variable. (a) Show that for any h : R+ → R+ that is absolutely continuous, h(X)dP = h(0) + h (t)P (X ≥ t)dt Ω [0,∞) = h(0) + h (t)P (X > t)dt. (0,∞)

(b) Show that for any 0 < p < ∞, p X dP = ptp−1 P (X ≥ t)dt. Ω

[0,∞)

(c) Show that for any 0 < p < ∞, 1 −p X dP = ψX (t)tp−1 dt, Γ(p) [0,∞) Ω

where Γ(p) = t ∈ R+ .

[0,∞)

e−t tp−1 dt, p > 0, and ψX (t) =

Ω

e−tX dP ,

(Hint: (a) Apply Tonelli’s theorem to the function f (t, ω) ≡ h (t)I(X(w) ≥ t) on the product measure space ([0, ∞) × Ω, B([0, ∞)) × F, m × P ), where m is Lebesgue measure on (R+ , B(R+ )).) 5.8 Let g : R+ → R+ and f : R2 → R+ be Borel measurable. Let A = {(x, y) : x ≥ 0, 0 ≤ y ≤ g(x)}. (a) Show that A ∈ B(R2 ). (b) Show that R+

[0,g(x)]

f (x, y)m(dy) m(dx) = f IA dm(2) ,

where m(2) is Lebesgue measure on (R2 , B(R2 )) and m(·) is Lebesgue measure on R.

5.9 Problems

183

(c) If g is continuous and strictly increasing show that the two integrals in (b) equal f (x, y)m(dx) m(dy). [g −1 (y),∞)

R+

5.9 (a) For 1 < A < ∞, let hA (t) =

A

0

e−xt sin x dx, t ≥ 0.

Use integration by parts to show that |hA (t)| ≤ and hA (t) →

1 + e−t 1 + t2

1 1 + t2

as A → ∞.

(b) Show using Fubini’s theorem that for 0 < A < ∞, A ∞ sin x dx. hA (t)dt = x 0 0 (c) Conclude using the DCT that A ∞ sin x 1 lim dx = dt. A→∞ 0 x 1 + t2 0 (d) Using Theorem 4.4.1 and the fact that φ(x) ≡ tan x is a (1–1) strictly monotone map from (0, π2 ) to (0, ∞) having the inverse 1 map ψ(·) with derivative ψ (t) ≡ 1+t 2 , 0 < t < ∞, conclude that ∞ π 1 dt = . 2 1+t 2 0 ∞ −x2 /2 ,π 5.10 Show that I ≡ 0 e dx = 2 . ∞ ∞ (x2 +y2 ) (Hint: By Tonelli’s theorem, I 2 = 0 0 e− 2 dxdy. Now use the change of variables x = r cos θ, y = r sin θ, 0 < r < ∞, 0 < θ < π 2 .) 5.11 Let µ be a ﬁnite measure on R, B(R) . Let f , g : R → R+ be nondecreasing. Show that µ(R) f g dµ ≥ f dµ gdµ . (Hint: Consider h(x1 , x2 ) = f (x1 ) − f (x2 ) g(x1 ) − g(x2 ) on R2 and integrate w.r.t. µ × µ.)

184

5. Product Measures, Convolutions, and Transforms

5.12 Let µ and λ be σ-ﬁnite measures on R, B(R) . Recall that ν(A) ≡ (µ ∗ λ)(A) ≡ IA (x + y)dµdλ, A ∈ B(R). (a) Show that for any Borel measurable f : R → R+ , f (x + y) is B(R) × B(R), B(R)-measurable from R × R → R and f dν = f (x + y)dµdλ. (b) Show that ν(A) = R

µ(A − t)λ(dt), A ∈ B(R).

(c) Suppose there exist countable sets Bλ , Bµ such that µ(Bµc ) = 0 = λ(Bλc ). Show that there exists a countable set Bν such that ν(Bνc ) = 0. (d) Suppose that µ({x}) = 0 for all x in R. Show that ν({x}) = 0 for all x in R. dµ (e) Suppose that µ m with dm = h. Show that ν m and ﬁnd dν dm in terms of h, µ and λ.

(f) Suppose that µ m and λ m. Show that dµ dλ dν = ∗ . dm dm dm 5.13 (Convolution of cdfs). Let Fi , i = 1, 2 be cdfs on R. Recall that a cdf F on R is a function from R → R+ such that it is nondecreasing, right continuous with F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞. (a) Show that (F1 ∗ F2 )(x) ≡ R F1 (x − u)dF2 (u) is well deﬁned and is a cdf on R. (b) Show also that (F1 ∗ F2 )(·) = (F2 ∗ F1 )(·). (c) Suppose t ∈ R is such that etx dFi (x) < ∞ for i = 1,2. Show tx tx tx e dF1 (x) e dF2 (x) . that e d(F1 ∗ F2 )(x) = 5.14 Let f , g ∈ L1 (R, B(R), m). (a) Show that if f is continuous and bounded on R, then so is f ∗ g. (b) Show that if f is diﬀerentiable with a bounded derivative on R, then so is f ∗ g. (Hint: Use the DCT.)

5.9 Problems

185

5.15 Let f ∈ L1 (R), g ∈ Lp (R), 1 ≤ p ≤ ∞. (a) Show that if 1 ≤ p < ∞, then for all x in R *p * p−1 * * |g(x−u)|p |f (u)|du |f |dm * |f (u)g(x−u)|du* ≤ and hence that

(f ∗ g)(x) ≡

f (u)g(x − u)du

is well deﬁned a.e. (m) and f ∗ gp ≤ f 1 gp with “=” holding iﬀ either f = 0 a.e. or g = 0 a.e. (b) Show that if p = ∞ then f ∗ g∞ ≤ f 1 g∞ and “=” can hold for some nonzero f and g. (Hint for (a): Use Jensen’s inequality with probability measure dµ = |f |dm f 1 if f ||1 > 0.) 5.16 Let 1 ≤ p ≤ ∞ and q = 1 − p1 . Let f ∈ Lp (R), g ∈ Lq (R). (a) Show that f ∗ g is well deﬁned and uniformly continuous. (b) Show that if 1 < p < ∞, lim (f ∗ g)(x) = 0.

|x|→∞

(Hint: For (a) use H¨ older’s inequality and Lemma 5.7.7. For (b) approximate g by simple functions.) 5.17 Let g : R → R be inﬁnitely diﬀerentiable and be zero outside a bounded interval. (a) Let f : R → R be Borel measurable and A |f |dm < ∞ for all bounded intervals A in R. Show that f ∗ g is well deﬁned and inﬁnitely diﬀerentiable. (b) Show that for any f ∈ L1 (R), there exist a sequence {gn }n≥1 of such functions such that f ∗ gn → f in L1 (R). 5.18 For f ∈ L1 (R), let fσ = f ∗φσ where φσ (x) = x ∈ R.

2

x √ 1 e− 2σ2 2πσ

, 0 < σ < ∞,

186

5. Product Measures, Convolutions, and Transforms

(a) Show that fσ is inﬁnitely diﬀerentiable. (b) Show that if f is continuous and zero outside a bounded interval, then fσ converges to g uniformly on R as σ → 0. (c) Show that if f ∈ Lp (R), 1 ≤ p < ∞, then fσ → f in Lp (R) as σ → 0. 5.19 Let f ∈ Lp (R), 1 ≤ p < ∞ and h(x) = bounded Borel set and

A+x

f (u)du where A is a

A + x ≡ {y : y = a + x, a ∈ A}. (a) Show that h = f ∗ g for some g bounded and with bounded support. (b) Show that h(·) is continuous and that lim h(x) = 0. |x|→∞

(Hint: For 1 1. −ιtx 1 (Hint: Consider g(x) = 2π fˆ(t)dt.) e 2

5.23 Let f (x) =

x √1 e− 2 2π

, x ∈ R.

(a) Show that fˆ(·) is real valued, diﬀerentiable and satisﬁes the ordinary diﬀerential equation fˆ (t) + tfˆ(t) = 0, t ∈ R and fˆ(0) = 1. Find fˆ(t).

5.9 Problems

187

(b) For µ in R, σ > 0, let fµ,σ (x) = √

1 x−u 2 1 e− 2 ( σ ) . 2πσ

Find fˆµ,σ and verify that for any (µi , σi ), i = 1, 2, fµ1 ,σ1 ∗ fµ2 ,σ2 = fµ1 +µ2 ,σ12 +σ22 . (Hint for (b): Use Fourier transforms and uniqueness.) 5.24 (Rate of convergence of Fourier series). Consider the function h(x) = π 2 − |x| in −π ≤ x ≤ π. π −ιnx 1 ˆ e h(x)dx, n = 0, ±1, ±2. (a) Find h(n) ≡ 2π −π ∞ ˆ < ∞. (b) Show that n=−∞ |h(n)| +n ˆ ιjx (c) Show that Sn (h, x) ≡ converges to h(x) unij=−n h(j)e formly on [−π, π]. (d) Verify that sup{|Sn (h, x) − h(x)| : −π ≤ x ≤ π} ≤ and |Sn (h, 0) − h(0)| ≥

2 1 , n≥2 π (n − 1)

2 1 . π (n + 2)

(Remark: This example shows that the Fourier series of a function can converge very slowly such as in this example where the rate of decay is n1 .) 5.25 Using Fejer’s theorem (Theorem 5.6.1) prove Wierstrass’ theorem on uniform approximation of a continuous function on a bounded closed interval by a polynomial. (Hint: Show that on bounded intervals a trigonometric polynomial can be approximated uniformly by a polynomial using the power series representation of sine and cosine functions (see Section A.3). 5.26 Evaluate

n

lim

n→∞

−n

sin λx ιtx e dx, 0 < λ < ∞, 0 < x < ∞. x

(Hint: For 0 < λ < ∞, sinxλx ∈ L2 (R) and it is the Fourier transform of f (t) = I[−λ,λ] (·). Now apply Plancherel theorem. Alternatively use n Fubini’s theorem and the fact lim −n siny y dy exists in R.) n→∞

188

5. Product Measures, Convolutions, and Transforms

c 5.27 Find an example of a function f ∈ L2 (R) ∩ L1 (R) such that its Plancherel transform fˆ ∈ L1 (R). (Hint: Examine Problem 5.26.)

6 Probability Spaces

6.1 Kolmogorov’s probability model Probability theory provides a mathematical model for random phenomena, i.e., those involving uncertainty. First one identiﬁes the set Ω of possible outcomes of (random) experiment associated with the phenomenon. This set Ω is called the sample space, and an individual element ω of Ω is called a sample point. Even though the outcome is not predictable ahead of time, one is interested in the “chances” of some particular statement to be valid for the resulting outcome. The set of ω’s for which a given statement is valid is called an event. Thus, an event is a subset of Ω. One then identiﬁes a class F of events, i.e., a class F of subsets of Ω (not necessarily all of P(Ω), the power set of Ω), and then a set function P on F such that for A in F, P (A) represents the “chance” of the event A happening. Thus, it is reasonable to impose the following conditions on F and P : (i) A ∈ F ⇒ Ac ∈ F (i.e., if one can deﬁne the probability of an event A, then the probability of A not happening is also well deﬁned). (ii) A1 , A2 ∈ F ⇒ A1 ∪ A2 ∈ F (i.e., if one can deﬁne the probabilities of A1 and A2 , then the probability of at least one of A1 or A2 happening is also well deﬁned). (iii) for all A in F, 0 ≤ P (A) ≤ 1, P (∅) = 0, and P (Ω) = 1.

190

6. Probability Spaces

(iv) A1 , A2 ∈ F, A1 ∩ A2 = ∅ ⇒ P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) (i.e., if A1 and A2 are mutually exclusive events, then the probability of at least one of the two happening is simply the sum of the probabilities). The above conditions imply that F is an algebra and P is a ﬁnitely additive set function. Next, as explained in Section 1.2, it is natural to require that F be closed under monotone increasing unions and P be monotone continuous from below. That is, if {An }n≥1 is a sequence of events in F such that An implies An+1 (i.e., An ⊂ An+1 ) for all n ≥ 1, then the probability of at least one of the An ’s happening is well deﬁned and is the limit of the corresponding probabilities. In other words, the following conditions on F and P must hold in addition to (i)–(iv) above: (v) An ∈ F, A n ⊂ An+1 for all n = 1, 2, . . . P (An ) ↑ P ( n≥1 An ).

⇒

n≥1

An , ∈ F and

As noted in Section 1.2, conditions (i)–(v) imply that (Ω, F, P ) is a measure space, i.e., F is a σ-algebra and P is a measure on F with P (Ω) = 1. That is, (Ω, F, P ) is a probability space. This is known as Kolmogorov’s probability model for random phenomena (see Kolmogorov (1956), Parthasarathy (2005)). Here are some examples. Example 6.1.1: (Finite sample spaces). Let Ω ≡ {ω1 , ω2 , . . . , ωk }, 1 ≤ k k < ∞, F ≡ P(Ω), the power set of Ω and P (A) = i=1 pi IA (ωi ) where k {pi }ki=1 are such that pi ≥ 0 and i=1 pi = 1. This is a probability model for random experiments with ﬁnitely many possible outcomes. An important application of this probability model is ﬁnite population sampling. Let {U1 , U2 , . . . , UN } be a ﬁnite population of N units or objects. These could be individuals in a city, counties in a state, etc. In a typical sample survey procedure, one chooses a subset of size n (1 ≤ n ≤ N ) from this population. Let Ω denote the collection of all possible subsets of size n. Here k = N n , each ωi is a sample of size n and pi is the selection probability of ωi . The assignment of {pi }ki=1 is determined by a given sampling scheme. For example, in simple random sampling without replacement, pi = k1 for i = 1, 2, . . . , k. Other examples include coin tossing, rolling of dice, bridge hands, and acceptance sampling in statistical quality control (Feller (1968)). Example 6.1.2: (Countably inﬁnite sample spaces). Let Ω ≡ {ω1 , ω2 , . . .} ∞ ∞ p I be a countable set, F = P(Ω), and P (A) ≡ i A (ωi ) where {pi }i=1 i=1 ∞ satisfy pi ≥ 0 and i=1 pi = 1. It is easy to verify that (Ω, F, P ) is a probability space. This is a probability model for random experiments with countably inﬁnite number of outcomes. For example, the experiment of tossing a coin until a “head” is produced leads to such a probability space.

6.2 Random variables and random vectors

191

Example 6.1.3: (Uncountable sample spaces). (a) (Random variables). Let Ω = R, F = B(R), P = µF , the LebesgueStieltjes measure corresponding to a cdf F , i.e., corresponding to a function F : R → R that is nondecreasing, right continuous, and satisﬁes F (−∞) = 0, F (+∞) = 1. See Section 1.3. This serves as a model for a single random variable X. (b) (Random vectors). Let Ω = Rk , F = B(Rk ), P = µF , the LebesgueStieltjes measure corresponding to a (multidimensional) cdf F on Rk where k ∈ N. See Section 1.3. This is a model for a random vector (X1 , X2 , . . . , Xk ). (c) (Random sequences). Let Ω = R∞ ≡ R × R × . . . be the set of all sequences {xn }n≥1 of real numbers. Let C be the class of all ﬁnite dimensional sets of the form A × R × R × . . ., where A ∈ B(Rk ) for some 1 ≤ k < ∞. Let F be the σ-algebra generated by C. For each 1 ≤ k < ∞, let µk be a probability measure on B(Rk ) such that µk+1 (A × R) = µk (A) for all A ∈ B(Rk ). Then there exists a probability measure µ on F such that µ(A × R × R × . . .) = µk (A) if A ∈ B(Rk ). (This will be shown later as a special case of the Kolmogorov’s consistency theorem in Section 6.3.) This will be a model for a sequence {Xn }n≥1 of random variables such that for each k, 1 ≤ k < ∞, the distribution of (X1 , X2 , . . . , Xk ) is µk .

6.2 Random variables and random vectors Recall the following deﬁnitions introduced earlier in Sections 2.1 and 2.2. Deﬁnition 6.2.1: Let (Ω, F, P ) be a probability space and X : Ω → R be F, B(R)-measurable, that is, X −1 (A) ∈ F for all A ∈ B(R). Then, X is called a random variable on (Ω, F, P ). Recall that X : Ω → R is F, B(R)-measurable iﬀ for all x ∈ R, {ω : X(ω) ≤ x} ∈ F. Deﬁnition 6.2.2: Let X be a random variable on (Ω, F, P ). Let FX (x) ≡ P ({ω : X(ω) ≤ x}), x ∈ R.

(2.1)

Then FX (·) is called the cumulative distribution function (cdf) of X. Deﬁnition 6.2.3: Let X be a random variable on (Ω, F, P ). Let PX (A) ≡ P (X −1 (A))

for all A ∈ B(R).

(2.2)

Then the probability measure PX is called the probability distribution of X.

192

6. Probability Spaces

Note that PX is the measure induced by X on (R, B(R)) under P and that the Lebesgue-Stieltjes measure µFX on B(R) corresponding with the cdf FX of X is the same as PX . Deﬁnition 6.2.4: Let (Ω, F, P ) be a probability space, k ∈ N and X : Ω → Rk be F, B(Rk )-measurable, i.e., X −1 (A) ∈ F for all A ∈ B(Rk ). Then X is called a (k-dimensional) random vector on (Ω, F, P ). Let X = (X1 , X2 , . . . , Xk ) be a random vector with components Xi , i = 1, 2, . . . , k. Then each Xi is a random variable on (Ω, F, P ). This follows from the fact that the coordinate projection maps from Rk to R, given by πi (x1 , x2 , . . . , xk ) ≡ xi , 1 ≤ i ≤ k are continuous and hence, are Borel measurable. Conversely, if for 1 ≤ i ≤ k, Xi is a random variable on (Ω, F, P ), then X = (X1 , X2 , . . . , Xk ) is a random vector (cf. Proposition 2.1.3). Deﬁnition 6.2.5: Let X be a k-dimensional random vector on (Ω, F, P ) for some k ∈ N. Let FX (x) ≡ P ({ω : X1 (ω) ≤ x1 , X2 (ω) ≤ x2 , . . . , Xk (ω) ≤ xk })

(2.3)

for x = (x1 , x2 , . . . , xk ) ∈ Rk . Then FX (·) is called the joint cumulative distribution function (joint cdf) of the random vector X. Deﬁnition 6.2.6: Let X be a k-dimensional random vector on (Ω, F, P ) for some k ∈ N. Let PX (A) = P (X −1 (A))

for all A ∈ B(Rk ).

(2.4)

The probability measure PX is called the (joint) probability distribution of X. As in the case k = 1, the Lebesgue-Stieltjes measure µFX on B(Rk ) corresponding to the joint cdf FX is the same as PX . Next, let X = (X1 , X2 , . . . , Xk ) be a random vector. Let Y = (Xi1 , Xi2 , . . . , Xir ) for some 1 ≤ i1 < i2 < . . . < ir ≤ k and some 1 ≤ r ≤ k. Then, Y is also a random vector. Further, the joint cdf of Y can be obtained from FX by setting the components xj , j ∈ {i1 , i2 , . . . , ir } equal to ∞. Similarly, the probability distribution PY can be obtained from PX as an induced measure from the projection map π(x) = (xi1 , xi2 , . . . , xir ), x ∈ Rk . For example, if (i1 , i2 , . . . , ir ) = (1, 2, . . . , r), r ≤ k, then FY (y1 , . . . , yr ) = FX (y1 , . . . , yr , ∞, . . . , ∞), (y1 , . . . , yr ) ∈ Rr and PY (A) = PX (A × R(k−r) ), A ∈ B(Rr ).

6.2 Random variables and random vectors

193

Deﬁnition 6.2.7: Let X = (X1 , X2 , . . . , Xk ) be a random vector on (Ω, F, P ). Then, for each i = 1, . . . , k, the cdf FXi and the probability distribution PXi of the random variable Xi are called the marginal cdf and the marginal probability distribution of Xi , respectively. It is clear that the distribution of X determines the marginal distribution PXi of Xi for all i = 1, 2, . . . , k. However, the marginal distributions {PXi : i = 1, 2, . . . , k} do not uniquely determine the joint distribution PX , without additional conditions, such as independence (see Problem 6.1). Deﬁnition 6.2.8: Let X be a random variable on (Ω, F, P ). The expected value of X, denoted by EX or E(X), is deﬁned as XdP, (2.5) EX = Ω

provided the integral is well deﬁned. That is, at least one of the two quantities X + dP and X − dP is ﬁnite. If X is a random variable on (Ω, F, P ) and h : R → R is Borel measurable, then Y = h(X) is also a random variable on (Ω, F, P ). The expected value of Y may be computed as follows. Proposition 6.2.1: (The change of variable formula). Let X be a random variable on (Ω, F, P ) and h : R → R be Borel measurable. Let Y = h(X). Then |Y |dP = |h(x)|PX (dx) = |y|PY (dy). (i) Ω

(ii) If

Ω

R

R

|Y |dP < ∞, then Y dP = h(x)PX (dx) = yPY (dy). Ω

R

(2.6)

R

Proof: If h = IA for A in B(R), the proposition follows from the deﬁnition of PX . By linearity, this extends to a nonnegative and simple function h and by the MCT, to any nonnegative measurable h, and hence to any measurable h. 2 Remark 6.2.1: Proposition 6.2.1 shows that the expectation of Y can be computed in three diﬀerent ways, i.e., by integrating Y with respect to P on Ω or by integrating h(x) on R with respect to the probability distribution PX of the random variable X or by integrating y on R with respect to the probability distribution PY of the random variable Y . Remark 6.2.2: If the function h is nonnegative, then the relation EY = h(x)P (dx) is valid even if EY = ∞. X R

194

6. Probability Spaces

Deﬁnition 6.2.9: For any positive integer n, the nth moment µn of a random variable X is deﬁned by µn ≡ EX n ,

(2.7)

provided the expectation is well deﬁned. Deﬁnition 6.2.10: The variance of a random variable X is deﬁned as Var(X) = E(X − EX)2 , provided EX 2 < ∞. Deﬁnition 6.2.11: The moment generating function (mgf ) of a random variable X is deﬁned by MX (t) ≡ E(etX )

for all t ∈ R.

(2.8)

Since etX is always nonnegative, E(etX ) is well deﬁned but could be inﬁnity. Proposition 6.2.1 gives a way of computing the moments and the mgf of X without explicitly computing the distribution of X k or etX . As an illustration, consider the case of a random variable X deﬁned on the probability space (Ω, F, P ) with Ω = {H, T }n , n ∈ N, F = the power set of Ω and P = the probability distribution deﬁned by P ({ω}) = pX(ω) q n−X(ω) where 0 < p < 1, q = 1 − p, and X(ω) = the number of H’s in ω. By the change of variable formula, the mgf of X is given by etx PX (dx) MX (t) ≡ =

n r=0

etr

n r n−r p q = (pet + q)n , r

distribution of X, is supported on {0, 1, 2, . . . , n} since PX , the probability with PX ({r}) = nr pr q n−r . Note that PX is the Binomial (n, p) distribution. Here, MX (t) is computed using the distribution of X, i.e., using the middle term in (2.6) only. The connection between the mgf MX (·) and the moments of a random variable X is given in the following propositions. Proposition 6.2.2: Let X be a nonnegative random variable and t ≥ 0. Then ∞ n t µn MX (t) ≡ E(etX ) = (2.9) n! n=0 where µn is as in (2.7). ∞ n Proof: Since etX = n=0 tn Xn! and X is nonnegative, (2.9) follows from the MCT. 2

6.2 Random variables and random vectors

195

Proposition 6.2.3: Let X be a random variable and let MX (t) be ﬁnite for all |t| < , for some > 0. Then (i) E|X|n < ∞ (ii) MX (t) =

∞

for all tn

n=0

µn n!

n ≥ 1, for all

|t| < ,

(iii) MX (·) is inﬁnitely diﬀerentiable on (−, +) and for r ∈ N, the rth derivative of MX (·) is (r)

MX (t) = In particular,

∞ n t µn+r = E(etX X r ) n! n=0

for

|t| < .

(r)

MX (0) = µr = EX r .

(2.10)

(2.11)

Proof: Since MX (t) < ∞ for all |t| < , E(e|tX| ) ≤ E(etX ) + E(e−tX ) < ∞ Also, e|tX| ≥

for |t| < .

(2.12)

|t|n |X|n n!

for *all n ∈ N and * hence, (i) follows by choosing a t *n (tx)j * in (0, ). Next note that * j=0 j! * ≤ e|tx| for all x in R and all n ∈ N. Hence, by (2.12) and the DCT, (ii) follows. Turning to (iii), since MX (·) admits a power series expansion convergent in |t| < , it is inﬁnitely diﬀerentiable in |t| < and the derivatives of MX (·) can be found by term-by-term diﬀerentiation of the power series (see Rudin (1976), Chapter 9). Hence, ∞ dr tn µn (r) MX (t) = dtr n=0 n! = = =

∞ µn dr (tn ) n! dtr n=0 ∞

µn

n=r ∞ n

tn−r (n − r)!

t µn+r . n! n=0

The veriﬁcation of the second equality in (2.10) is left in an exercise (see Problem 6.4). 2 Remark 6.2.3: If the mgf MX (·) is ﬁnite for |t| < for some > 0, then by part (ii) of the above proposition, MX (t) has a power series expansion

196

6. Probability Spaces

in t around 0 and µn!n is simply the coeﬃcient of tn . For example, if X has a N (0, 1) distribution, then for all t ∈ R, ∞ 2 2 1 MX (t) = etx √ e−x /2 dx = et /2 2π −∞ ∞ 2 k (t ) 1 = . (2.13) k! 2k k=0

Thus, µn =

0 (2k)! k!2k

if n is odd if n = 2k, k = 1, 2, . . .

Remark 6.2.4: If MX (t) is ﬁnite for |t| < for some > 0, then all the moments {µn }n≥1 of X are determined and also its probability distribution. However, in general, the sequence {µn }n≥1 of moments of X need not determine the distribution of X uniquely. Table 6.2.1 gives the mean, variance, and the mgf of a number of standard probability distributions on the real line. For future reference, some of the inequalities established in Section 3.1 are specialized for random variables and collected below without proofs. Proposition 6.2.4: (Markov’s inequality). Let X be a random variable on (Ω, F, P ). Then for any φ : R+ → R+ nondecreasing and any t > 0 with φ(t) > 0, E φ(|X|) . (2.14) P (|X| ≥ t) ≤ φ(t) In particular, (i) for r > 0, t > 0, P (X ≥ t) ≤ P (|X| ≥ t) ≤

E|X|r , tr

(2.15)

(ii) for any t ≥ 0, P (|X| ≥ t) ≤

E(eθ|X| ) , eθt

for any θ > 0 and hence E(eθ|X| ) . θ>0 eθt

P (|X| ≥ t) ≤ inf

(2.16)

Proposition 6.2.5: (Chebychev’s inequality). Let X be a random variable with EX 2 < ∞, EX = µ, Var(X) = σ 2 . Then for any k > 0, P (|X − µ| ≥ kσ) ≤

1 . k2

(2.17)

6.2 Random variables and random vectors

197

TABLE 6.2.1. Mean, variance, mgf of the distributions listed in Tables 4.6.1 and 4.6.2.

Distribution

Mean

Variance

mgf M(t)

Bernoulli (p), 0 0, then equality holds in (2.20) iﬀ P (Y = aX + b) = 1 for some constants a, b in R (Problem 6.6). Proposition 6.2.9: (Minkowski’s inequality). Let X and Y be random variables on (Ω, F, P ) such that E|X|p < ∞, E|Y |p < ∞ for some 1 ≤ p < ∞. Then (E|X + Y |p )1/p ≤ (E|X|p )1/p + (E|Y |p )1/p . (2.21) Deﬁnition 6.2.12: (Product moments of random vectors). Let X = (X1 , X2 , . . . , Xk ) be a random vector. The product moment of order r = (r1 , r2 , . . . , rk ), with ri being a nonnegative integer for each i, is deﬁned as µr ≡ µr1 ,r2 ,...,rk ≡ E(X1r1 X2r2 · · · Xkrk ),

(2.22)

provided E|X1r1 · · · Xkrk | < ∞. The joint moment generating function (joint mgf ) of a random vector X = (X1 , X2 , . . . , Xk ) is deﬁned by MX1 ,...,Xk (t1 , t2 , . . . , tk ) ≡ E(et1 X1 +t2 X2 +···+tk Xk ),

(2.23)

for all t1 , t2 , . . . , tk in R. As in the case of a random variable, if the joint mgf MX1 ,X2 ,...,Xk (t1 , . . . , tk ) is ﬁnite for all (t1 , t2 , . . . , tk ) with |ti | < for all i = 1, 2, . . . , k for some > 0, then an analog of Proposition 6.2.3 holds. For example, the following assertions are valid (cf. Problem 6.4): (i) E|Xi |n < ∞ for all i = 1, 2, . . . , k

and n ≥ 1.

(ii) For t = (t1 , . . . , tk ) ∈ Rk and r = (r1 , r2 , . . . , rk ) ∈ Zk+ , let tr = tr11 tr22 · · · trkk , r! = r1 !r2 ! · · · rk !, and µr = EX1r1 X2r2 · · · Xkrk .

(2.24)

6.3 Kolmogorov’s consistency theorem

Then, MX (t1 , . . . , tk ) =

tr µr r! k

199

(2.25)

r∈Z+

for all t = (t1 , t2 , . . . , tk ) ∈ (−, +)k . (iii) For any r = (r1 , . . . , rk ) ∈ Zk+ ,

* dr * M (t) = µr , * X dtr t=0

where

dr dtr

=

∂ r1 ∂ r2 r r ∂t11 ∂t22

(2.26)

rk

∂ . . . ∂t rk . k

6.3 Kolmogorov’s consistency theorem In the previous section, the case of a single random variable and that of a ﬁnite dimensional random vector were discussed. The goal of this section is to discuss inﬁnite families of random variables such as a random sequence {Xn }n≥1 or a random function {X(t) : 0 ≤ t < T }, 0 ≤ T ≤ ∞. For example, Xn could be the population size of the nth generation of a randomly evolving biological population, and X(t) could be the temperature at time t in a chemical reaction over a period [0, T ]. An example from modeling of spatial random phenomenon is a collection {X(s) : s ∈ S} of random variables X(s) where S is a speciﬁed region such as the U.S., and X(s) is the amount of rainfall at location s ∈ S during a speciﬁed month. Let (Ω, F, P ) be a probability space and {Xα : α ∈ A} be a collection of random variables deﬁned on (Ω, F, P ), where A is a nonempty set. Then for any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the random vector (Xα1 , Xα2 , . . . , Xαk ) has a joint probability distribution µ(α1 ,α2 ,...,αk ) over (Rk , B(Rk )). Deﬁnition 6.3.1: A (real valued) stochastic process with index set A is a family {Xα : α ∈ A} of random variables deﬁned on a probability space (Ω, F, P ). Example 6.3.1: (Examples of stochastic processes). Let Ω = [0, 1], F = B([0, 1]), P = the Lebesgue measure on [0, 1]. Let A1 = {1, 2, 3, . . .}, A2 = [0, T ], 0 < T < ∞. For ω ∈ Ω, n ∈ A1 , t ∈ A2 , let Xn (ω)

=

sin 2πnω

Yt (ω)

=

sin 2πtω

Zn (ω) Vn,t (ω)

=

nth digit in the decimal expansion of ω

=

Xn2 (ω) + Yt2 (ω).

Then {Xn : n ∈ A1 }, {Zn : n ∈ A1 }, {Vn,t : (n, t) ∈ A1 × A2 }, {Yt : t ∈ A2 } are all stochastic processes.

200

6. Probability Spaces

Note that a real valued stochastic process {Xα : α ∈ A} may also be viewed as a random real valued function on the set A by the identiﬁcation ω → f (ω, ·), where f (ω, α) = Xα (ω) for α in A. Deﬁnition 6.3.2: The family {µ(α1 ,α2 ,...,αk ) (·) ≡ P ((Xα1 , . . . , Xαk ) ∈ ·): (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} of probability distributions is called the family of ﬁnite dimensional distributions (fdds) associated with the stochastic process {Xα : α ∈ A}. This family of ﬁnite dimensional distributions satisﬁes the following consistency conditions: For any (α1 , α2 , . . . , αk ) ∈ Ak , 2 ≤ k < ∞, and any B1 , B2 , . . . , Bk in B(R), C1: µ(α1 ,α2 ,...,αk ) (B1 ×· · ·×Bk−1 ×R) = µ(α1 ,α2 ,...,αk−1 ) (B1 ×· · ·×Bk−1 ); C2: For any permutation (i1 , i2 , . . . , ik ) of (1, 2, . . . , k), µ(αi1 ,αi2 ,...,αik ) (Bi1 ×Bi2 ×· · ·×Bik ) = µ(α1 ,...,αk ) (B1 ×B2 ×· · ·×Bk ) . To verify C1, note that µ(α1 ,α2 ,...,,αk ) (B1 × B2 × · · · × Bk−1 × R) =

P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk−1 ∈ Bk−1 , Xαk ∈ R)

=

P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk−1 ∈ Bk−1 )

= µ(α1 ,α2 ,...,αk−1 ) (B1 × B2 × · · · × Bk−1 ). Similarly, to verify C2, note that µ(αi1 ,αi2 ,...,αik ) (Bi1 × Bi2 · · · × Bik ) = P (Xαi1 ∈ Bi1 , Xαi2 ∈ Bi2 , . . . , Xαik ∈ Bik ) = P (Xα1 ∈ B1 , Xα2 ∈ B2 , . . . , Xαk ∈ Bk ) = µ(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk ). A natural question is that given a family of probability distributions QA ≡ {ν(α1 ,α2 ,...,αk ) : (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} on ﬁnite dimensional Euclidean spaces, does there exist a real valued stochastic process {Xα : α ∈ A} such that its family of ﬁnite dimensional distributions coincides with QA ? Kolmogorov (1956) showed that if QA satisﬁes C1 and C2, then such a stochastic process does exist. This is known as Kolmogorov’s consistency theorem (also known as Kolmogorov’s existence theorem). Theorem 6.3.1: (Kolmogorov’s consistency theorem). Let A be a nonempty set. Let QA ≡ {ν(α1 ,α2 ,...,αk ) : (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞} be a family of probability distributions such that for each (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞,

6.3 Kolmogorov’s consistency theorem

201

(i) ν(α1 ,α2 ,...,αk ) is a probability distribution on (Rk , B(Rk )), (ii) C1 and C2 hold, i.e., for all B1 , B2 , . . . , Bk ∈ B(R), 2 ≤ k < ∞, ν(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk−1 × R) = ν(α1 ,α2 ,...,αk−1 ) (B1 × B2 × · · · × Bk−1 )

(3.1)

and for any permutation (i1 , i2 , . . . , ik ) of (1, 2, . . . , k), µ(αi1 ,αi2 ,...,αik ) (Bi1 × Bi2 × · · · × Bik ) = µ(α1 ,α2 ,...,αk ) (B1 × B2 × · · · × Bk ).

(3.2)

Then, there exists a probability space (Ω, F, P ) and a stochastic process XA ≡ {Xα : α ∈ A} on (Ω, F, P ) such that QA is the family of ﬁnite dimensional distributions associated with XA . Remark 6.3.1: Thus the above theorem says that given the family QA satisfying conditions (i) and (ii), there exists a real valued function on A × Ω such that for each ω, f (·, ω) is a function on A and for each (α1 , α2 , . . . , αk ) ∈ Ak , the vector f (α1 , ω), f (α2 , ω), . . . , f (αk , ω) is a random vector with probability distribution ν(α1 ,α2 ,...,αk ) . This random function point of view is useful in dealing with functionals of the form . .}, then one M (ω) ≡ {sup f (α, ω) : α ∈ A}. For example, if A1 = {1, 2, . n might consider functionals such as limn→∞ f (n, ω), limn→∞ n1 j=1 f (j, ω), ∞ j=1 f (j, ω), etc. Since the random functionals are not fully determined by f (α, ω) for ﬁnitely many α’s, it is not possible to compute probabilities of events deﬁned in terms of these functionals from the knowledge of the ﬁnite dimensional distribution of (f (α1 , ω), . . . , f (αk , ω)) for a given (α1 , . . . , αk ), no matter how large k is. Kolmogorov’s consistency theorem allows one to compute these probabilities given all ﬁnite dimensional distributions (provided that the functionals satisfy appropriate measurability conditions). Given a probability measure µ on (R, B(R)), now consider the problem of constructing a probability space (Ω, F, P ) and a random variable X on it with distribution µ. A natural solution is to set the sample space Ω to be R, the σ-algebra F to be B(R), and the probability measure P to be µ and the random variable X to be the identity map X(ω) ≡ ω. Similarly, given a probability measure µ on (Rk , B(R)k ), one can set the sample space Ω to be Rk and the σ-algebra F to be B(Rk ) and the probability measure P to be µ and the random vector X to be the identity map. Arguing in the same fashion, given a family QA of ﬁnite dimensional distributions with index set A, to construct a stochastic process {Xα : α ∈ A} with index set A on some probability space (Ω, F, P ), it is natural to set the sample space Ω to be RA , the collection of all real valued functions on A, F to be a suitable σ-algebra that includes all ﬁnite dimensional events,

202

6. Probability Spaces

P to be an appropriate probability measure that yields QA , and X to be the identity map. These considerations lead to the following deﬁnitions. Deﬁnition 6.3.3: Let A be a nonempty set. Then RA ≡ {f | f : A → R}, the collection of all real valued functions on A. If A is a ﬁnite set {a1 , a2 , . . . , ak }, then RA can be identiﬁed with Rk by associating each f ∈ RA with the vector (f (a1 ), f (a2 ), . . . , f (ak )) in Rk . If A is a countably inﬁnite set {a1 , a2 , a3 , . . .}, then RA can be similarly identiﬁed with R∞ , the set of all sequences {x1 , x2 , x3 , . . .} of real numbers. If A is the interval [0, 1], then RA is the collection of all real valued functions on [0, 1]. Deﬁnition 6.3.4: Let A be a nonempty set. A subset C ⊂ RA is called a ﬁnite dimensional cylinder set (fdcs) if there exists a ﬁnite subset A1 ⊂ A, say, A1 ≡ {α1 , α2 , . . . , αk }, 1 ≤ k < ∞ and a Borel set B in B(Rk ) such that C = {f : f ∈ RA and (f (α1 ), f (α2 ), . . . , f (αk )) ∈ B}. The set B is called a base for C. The collection of all ﬁnite dimensional cylinder sets will be denoted by C. The name cylinder is motivated by the following example: Example 6.3.2: Let A = {1, 2, 3} and C = {(x1 , x2 , x3 ) : x21 + x22 ≤ 1}. Then C is a cylinder (in the usual sense of the English word), but with inﬁnite height and depth. According to Deﬁnition 6.3.4, C is also a cylinder in R3 with the unit circle in R2 as its base. Examples 6.3.3 and 6.3.4 below are examples of fdcs, whereas Example 6.3.5 is an example of a set that is not a fdcs. Example 6.3.3: Let A = {1, 2} and C = {(x1 , x2 ) : | sin 2πx1 | ≤

√1 }. 2

Example 6.3.4: Let A = {1, 2, 3, . . .} and C = {(x1 , x2 , x3 , . . .) : x230 10

−

x242 5

x217 4

+

≤ 10}.

Example 6.3.5: Let A = {1,2, 3, . . .} and D = {(x1 , x2 , x3 , . . .) : xj ∈ R n for all j ≥ 1 and limn→∞ n1 j=1 xj exists} is not a ﬁnite dimensional cylinder set (Problem 6.8). Proposition 6.3.2: Let A be a nonempty set and C be the collection of all ﬁnite dimensional cylinder sets in RA . Then C is an algebra. Proof: Let C1 , C2 ∈ C and let C1 C2

= {f : f ∈ RA and f (α1 ), f (α2 ), . . . , f (αk ) ∈ B1 } = {f : f ∈ RA and f (β1 ), f (β2 ), . . . , f (βj ) ∈ B2 }

6.3 Kolmogorov’s consistency theorem

203

for some A1 = {α1 , α2 , . . . , αk } ⊂ A, A2 = {β1 , β2 , . . . , βj } ⊂ A, B1 ∈ B(Rk ), B2 ∈ B(Rj ), 1 ≤ k < ∞, 1 ≤ j < ∞. Let A3 = A1 ∪ A2 = {γ1 , γ2 , . . . , γ }, where without loss of generality (γ1 , γ2 , . . . , γk ) = (α1 , α2 , . . . , αk ) and (γ −j+1 , . . . , γ −1 , γ ) = (β1 , β2 , . . . , βj ). Then C1 and C2 may be expressed as ˜1 } C1 = {f : f ∈ RA and f (γ1 ), f (γ2 ), . . . , f (γ ) ∈ B A ˜2 } C2 = {f : f ∈ R and f (γ1 ), . . . , f (γ ) ∈ B ˜2 = R −j × B2 . Thus, C1 ∪ C2 = {f : f ∈ RA ˜1 = B1 × R −k and B where B ˜ ˜2 }. Since both B ˜1 and B ˜2 lie in B(R ), and (f (γ1 ), . . . , f (γ )) ∈ B1 ∪ B C1 ∪ C2 ∈ C. Next note that, C1c = {f : f ∈ RA and f (α1 ), . . . , f (αk ) ∈ B1c }. Since B1c ∈ B(Rk ), it follows that C1c ∈ C. Thus, C is an algebra. 2 Remark 6.3.2: If A is a ﬁnite nonempty set, the collection C is also a σ-algebra. Deﬁnition 6.3.5: Let A be a nonempty set. Let RA be the σ-algebra generated by the collection C. Then RA is called the product σ-algebra on RA . Remark 6.3.3: If A = {1, 2, 3, . . .} ≡ N and RN is identiﬁed with the set R∞ of all sequences of real numbers, then the product σ-algebra RN coincides with the Borel σ-algebra B(R∞ ) on R∞ under the metric

∞ |xj − yj | 1 d(x, y) = (3.3) 2j 1 + |xj − yj | j=1 for x = (x1 , x2 , . . .), y = (y1 , y2 , . . .) in R∞ (Problem 6.9). Deﬁnition 6.3.6: Let A be a nonempty set. For any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the projection map π(α1 ,...,αk ) from RA to Rk is deﬁned by π(α1 ,α2 ,...,αk ) (f ) = (f (α1 ), f (α2 ), . . . , f (αk )).

(3.4)

In particular, for α ∈ A, πα (f ) = f (α)

(3.5)

is called a co-ordinate map. The projection map πA1 for any arbitrary subset A1 ⊂ A may be similarly deﬁned. The next proposition follows from the deﬁnition of RA . Proposition 6.3.3: (i) For each α ∈ A, the map πα from RA to R is RA , B(R)-measurable. (ii) For any (α1 , α2 , . . . , αk ) ∈ Ak , 1 ≤ k < ∞, the map π(α1 ,α2 ,...,αk ) from RA to Rk is RA , B(Rk )-measurable.

204

6. Probability Spaces

Proof of Theorem 6.3.1: Let Ω = RA and F ≡ RA . Deﬁne a set function P on C by (3.6) P (C) = µ(α1 ,α2 ,...,αk ) (B) for a C in C with representation C = {ω : ω ∈ RA , ω(α1 ), ω(α2 ), . . . , ω(αk ) ∈ B}.

(3.7)

The main steps in the proof are (i) To show that P (C) as deﬁned in (3.6) is independent of the representation (3.7) of C, and (ii) P (·) is countably additive on C. Next, by the Caratheodory extension theorem (Theorem 1.3.3), there exists a unique extension of P (also denoted by P ) to F such that (Ω, F, P ) is a probability space. Deﬁning Xα (ω) ≡ πα (ω) = ω(α) for α in A yields a stochastic process {Xα : α ∈ A} on the probability space (RA , RA , P ) ≡ (Ω, F, P ) with the family QA as its set of ﬁnite dimensional distributions. Hence, it remains to establish (i) and (ii). Let C ∈ C admit two representations: C ≡ {ω : ω(α1 ), ω(α2 ), . . . , ω(αk ) ∈ B1 } ≡

π(α1 ,α2 ,...,αk ) (B1 )

≡

{ω : ω(β1 ), ω(β2 ), . . . , ω(βj ) ∈ B2 }

≡

−1 (B2 ) π(β 1 ,β2 ,...,βj )

and C

for some A1 = {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞, and some A2 = {β1 , β2 , . . . , βj } ⊂ A, 1 ≤ j < ∞, B1 ∈ B(Rk ) and B2 ∈ B(Rj ). Let A3 = A1 ∪A2 = {γ1 , γ2 , . . . , γ } and w.l.o.g., let (γ1 , γ2 , . . . , γk ) = (α1 , α2 , . . . , αk ) and (γ −j+1 , γ −j+2 , γ −1 , γ ) = (β1 , β2 , . . . , βj ). Then C may be represented as C

˜1 ) = πγ−1 (B 1 ,γ2 ,...,γ ˜2 ) (B = π −1 γ1 ,γ2 ,...,γ

˜1 = B1 ×R −k and B ˜2 = R −j ×B2 . Note that (ω(γ1 ), . . . , ω(γ )) ∈ where B ˜1 iﬀ ω ∈ C iﬀ (ω(γ1 ), . . . , ω(γ )) ∈ B ˜2 and thus B ˜1 = B ˜2 . B

(3.8)

6.3 Kolmogorov’s consistency theorem

205

Next by the ﬁrst consistency condition (3.1) and induction, ˜1 ) = ν(α ,α ,...,α ) (B1 ). ν(γ1 ,γ2 ,...,γ ) (B 1 2 k

(3.9)

Also by (3.2), for B2 of the form B21 × B22 × · · · × B2j with B2i ∈ B(R) for all 1 ≤ i ≤ j, ν(γ1 ,γ2 ,...,γ ) (R −j × B2 ) = ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B2 × R −j ). Now note that (a) ν(γ1 ,γ2 ,...,γ ) (R −j × B) and ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B × R −j ), considered as set functions deﬁned for B ∈ B(Rj ), are probability measures on B(Rj ), (b) they coincide on the class Γ of sets of the form B = B21 × B22 × · · · × B2j with B2i ∈ B(R) for all i, and (c) the class Γ is a π-class and it generates B(Rj ). Hence, by the uniqueness theorem (Theorem 1.3.6), ν(γ1 ,γ2 ,...,γ ) (R −j × B) = ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B × R −j )

(3.10)

for all B ∈ B(Rj ). Again by (3.1) and induction ν(γ −j+1 ,...,γ ,γ1 ,γ2 ,...,γ −j ) (B2 × R −j ) = ν(γ −j+1 ,...,γ ) (B2 ) = ν(β1 ,β2 ,...,βj ) (B2 ). ˜2 = R −j × B2 , by (3.10) and (3.11) Since B ˜2 ) = ν(β ,β ,...,β ) (B2 ). ν(γ1 ,γ2 ,...,γ ) (B 1 2 j Now from (3.8) and (3.9) it follows that ν(α1 ,...,αk ) (B1 )

˜1 ) ν(γ1 ,γ2 ,...,γ ) (B ˜2 ) = ν(γ1 ,γ2 ,...,γ ) (B =

= ν(β1 ,β2 ,...,βj ) (B2 ), thus establishing (i). To establish (ii), it needs to be shown that a

(ii) P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) if C1 , C2 ∈ C and C1 ∩ C2 = ∅. b

(ii) Cn ∈ C, Cn ⊃ Cn+1 for all n,

n≥1 Cn

= ∅ ⇒ P (Cn ) ↓ 0.

(3.11)

206

6. Probability Spaces

−1 −1 Let C1 = π(α (B1 ) and C2 = π(β (B2 ) for B1 ∈ B(Rk ), B2 ∈ 1 ,...,αk ) 1 ,...,βj ) j B(R ), {α1 , . . . , αk } ⊂ A and {β1 , . . . , βj } ⊂ A, 1 ≤ j, k < ∞. As in the proof of Proposition 6.3.2, C1 and C2 may be represented as −1 ˜i ), i = 1, 2, (B Ci = π(γ 1 ,γ2 ,...,γ )

˜i ∈ B(R ). Since C1 and C2 are disjoint by hypothesis, it follows where B ˜1 and B ˜2 are disjoint. Also, since that B ˜i ), i = 1, 2, P (Ci ) = ν(γ1 ,γ2 ,...,γ ) (B and ν(γ1 ,...,γ ) (·) is a measure on B(R ), it follows that P (C1 ∪ C2 )

˜1 ∪ B ˜2 ) ν(γ1 ,...,γ ) (B ˜1 ) + ν(γ ,...,γ ) (B ˜2 ) = ν(γ1 ,...,γ ) (B 1 =

=

P (C1 ) + P (C2 ),

thus proving (ii)a . To prove (ii)b , note that for any sequence {Cn }n≥1 ⊂ C, there exists a countable set A1 = {α1 , α2 , . . . , αn , . . .}, an increasing sequence {kn }n≥1 of positive integers and a sequence of Borel sets {Bn }n≥1 such that Bn ∈ −1 B(Rkn ) and Cn = π(α (Bn ) for all n ∈ N. Now suppose that 1 ,α2 ,...,αkn ) {Cn }n≥1 is decreasing. It will be shown that if limn→∞ P (Cn ) = δ > 0, then n≥1 Cn = ∅. For each n, by the regularity of measures (Corollary 1.3.5), there exists a compact set Gn ⊂ Bn such that ν(α1 ,...,αkn ) (Bn \ Gn )

0, Hn ⊂ Cn , and P (Cn \ Hn ) < 2δ , it follows that all n ≥ 1. This implies Hn = ∅ for each n. It will now P (Hn ) > 2δ for be shown that n≥1 Hn = ∅. Let {ωn }n≥1 be a sequence of elements

6.3 Kolmogorov’s consistency theorem

207

from Ω = RA such that for each n, ωn ∈ Hn . Then, since {Hn }n≥1 is a decreasing sequence, for each 1 ≤ j < ∞, ωn ∈ Hj for n ≥ j. This implies that the vector (ωn (α1 ), ωn (α2 ), . . . , ωn (αkj )) ∈ Gj for all n ≥ j. Since G1 is compact, there exists a subsequence {n1i }i≥1 such that limi→∞ ωn1i (α1 ) = ω(α1 ) exists. Next, since G2 is compact, there exists a further sequence {n2i }i≥1 of {n1i }i≥1 such that limi→∞ ωn2i (α2 ) = ω(α2 ) exists. Proceeding this way and applying the usual ‘diagonal method,’ a subsequence {ni }i≥1 is obtained such that limi→∞ ωni (αj ) = ω(αj ) for all 1 ≤ j < ∞. Let ω(α) = 0 for α ∈ {α1 , α2 , . . .}. Since for each j, Gj iscompact, (ω(α 1 ), ω(α2 ), . . . , ω(α kj )) ∈ Gj and hence ω ∈ Hj . Thus, ω ∈ j≥1 Hj ⊂ j≥1 Cj implying j≥1 Cj = ∅. The proof of the theorem is now complete. 2 When the index set A is countable and identiﬁed with the set N ≡ {1, 2, 3, . . .}, it is possible to give a simpler formulation of the consistency conditions. Theorem 6.3.4: Let {µn }n≥1 be a sequence of probability measures such that (i) for each n ∈ N, µn is a probability measure on (Rn , B(Rn )), (ii) for each n ∈ N, µn+1 (B × R) = µn (B) for all B ∈ B(Rn ). Then there exists a stochastic process {Xn : n ≥ 1} on a probability space (Ω, F, P ) with Ω = R∞ , F = B(R∞ ) such that for each n ≥ 1, the probability distribution P(X1 ,X2 ,...,Xn ) of the random vector (X1 , X2 , . . . , Xn ) is µn . Proof: For any {i1 , i2 , . . . , ik } ⊂ N, let j1 < j2 < · · · < jk be the increasing rearrangement of i1 , i2 , . . . , ik . Then there exists a permutation (r1 , r2 , . . . , rk ) of (1, 2, . . . , k) such that j1 = ir1 , j2 = ir2 , . . . , jk = irk . Now deﬁne (·) ν(j1 ,j2 ,...,jk ) (·) ≡ µjk πj−1 1 ,j2 ,...,jk where πj1 ,j2 ,...,jk (x1 , . . . , xjk ) = (xj1 , xj2 , . . . , xjk ) for all (x1 , x2 , . . . , xjk ) ∈ Rjk . Next deﬁne ν(i1 ,i2 ,...,ik ) (B1 × B2 × . . . × Bk ) ≡ ν(j1 ,j2 ,...,jk ) (Br1 × Br2 × . . . × Brk ) where Bi ∈ B(R) for all i, 1 ≤ i ≤ k. It can be veriﬁed that this family of ﬁnite dimensional distributions QN ≡ {ν(i1 ,i2 ,...,ik ) (·) : {i1 , i2 , . . . , ik } ⊂ N, 1 ≤ k < ∞}

(3.12)

satisﬁes the consistency conditions (3.1) and (3.2) of Theorem 6.3.1 and hence the assertion follows. 2

208

6. Probability Spaces

Example 6.3.6: (Sequence of independent random variables). Let {Fn }n≥1 be a sequence of cdfs on R. Consider the problem of constructing a sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) such that (i) for each n ∈ N, Xn has cdf Fn and (ii) for any n ∈ N and any {i1 , i2 , . . . , in } ⊂ N, the random variables {Xi1 , Xi2 , . . . , Xin } are independent, i.e., P (Xi1 ≤ x1 , Xi2 ≤ x2 , . . . , Xin ≤ xn ) =

n +

Fij (xj )

(3.13)

j=1

for all x1 , x2 , . . . , xn in R. This problem can be solved by using Theorem 6.3.4. Let µn be the Lebesgue-Stieltjes probability measure on (Rn , B(Rn )) corresponding to the distribution function F1,2,...,n (x1 , x2 , . . . , xn ) ≡

n +

F (xj ),

x1 , . . . , xn ∈ R.

j=1

It is easy to verify that the family {µn : n ≥ 1} satisﬁes (i) and (ii) of Theorem 6.3.4. Hence, there exist a probability measure P on the sequence space Ω ≡ R∞ equipped with σ-algebra F ≡ B(R∞ ) and random variables Xn (ω) ≡ πn (ω) ≡ ω(n), for ω = (ω(1), ω(2), . . .) in R∞ , n ≥ 1, such that (3.13) holds. Example 6.3.7: (Family of independent random variables). Given a family {Fα : α ∈ A} of cdfs on R for some index set A, a construction similar to Example 6.3.6, but using Theorem 6.3.1 yields the existence of a real valued stochastic process {Xα : α ∈ A} such that for any {α1 , α2 , . . . , αn } ⊂ A, 1 ≤ n < ∞, the random variables {Xα1 , Xα2 , . . . , Xαn } are independent, i.e., (3.13) holds. Example 6.3.8: (Markov chains). Let Q = ((qij )) be a k × k stochastic matrix for some 1 < k < ∞. That is, (a) for all 1 ≤ i, j ≤ k, qij ≥ 0 and (b) for each 1 ≤ i ≤ k,

k

j=1 qij

= 1.

Let p = (p1 , p2 , . . . , pk ) be a probability vector, i.e., for all i, pi ≥ 0, and k i=1 pi = 1. Consider the problem of constructing a sequence {Xn }n≥1 of random variables such that for each n ∈ N, P (X1 = j1 , X2 = j2 , . . . , Xn = jn ) = pj1 qj1 j2 . . . qjn−1 jn for 1 ≤ ji ≤ k, i = 1, 2, . . . , n.

(3.14)

6.3 Kolmogorov’s consistency theorem

209

Let µn be the discrete probability distribution determined by the right side of (3.14), that is, µn ({(j1 , j2 , . . . , jn )}) = pj1 qj1 j2 . . . qjn−1 jn for all (j1 , . . . , jn ) such that 1 ≤ ji ≤ k for all 1 ≤ i ≤ n. It is easy to verify that {µn }n≥1 satisﬁes the conditions of Theorem 6.3.4 and hence there exist a sequence {Xn }n≥1 of random variables satisfying (3.14). It may be veriﬁed that (3.14) is equivalent to P (Xn+1 = jn+1 |X1 = j1 , . . . , Xn = jn ) = qjn jn+1 = P (Xn+1 = jn+1 |Xn = jn )

(3.15)

for all n ≥ 1, 1 ≤ ji ≤ k, i = 1, 2, . . . , n + 1 provided P (X1 = j1 , . . . , Xn = jn ) > 0 and P (X1 = j) = pj for 1 ≤ j ≤ k. This says that the conditional distribution of Xn+1 given X1 , X2 , . . . , Xn depends only on Xn . This property is known as the Markov property, and the sequence {Xn }n≥1 is called a Markov chain with state space S ≡ {1, 2, . . . , k} and time homogeneous transition probability matrix ((qij )). When the state space S = {1, 2, . . .}, the above construction goes over with minor notational modiﬁcations. Next consider the case S = R. A function Q : R × B(R) → [0, 1] is called a probability transition function if (i) for each x in R, Q(x, ·) is a probability measure on (R, B(R)) and (ii) for each B in B(R), Q(·, B) is a Borel measurable function on R. Let µ be a probability distribution on (R, B(R)). Using Theorem 6.3.4, it can be shown that there exists a stochastic process {Xn }n≥1 such that P (X1 ∈ B1 , X2 ∈ B2 , . . . , Xn ∈ Bn ) = ··· Q(xn−1 , Bn )Q(xn−2 , dxn−1 ) · · · Q(x1 , dx2 )µ(dx1 ), B1

B2

Bn−1

(3.16) where right side of (3.16) is a well-deﬁned probability measure on n R , B(Rn ) (Problem 6.18). Such a sequence {Xn }n≥ is called a Markov chain with state space R, initial distribution µ, and transition probability function Q. For more on Markov chains, see Chapter 14. Example 6.3.9: (Gaussian processes). Let A be a nonempty set and {Xα : α ∈ A} be a stochastic process. Such a process is called Gaussian if for {α1 , α2 , . . . , αk } ⊂ A and real numbers t1 , t2 , . . . , tk , the random k variable i=1 ti Xαi has a univariate normal distribution (with possibly zero variance). For such a process, the functions µ(α) ≡ EXα and σ(α, β) ≡

210

6. Probability Spaces

Cov(Xα , Xβ ) are called the mean and covariance functions, respectively. k Since Var( i=1 ti Xαi ) ≥ 0, it follows that for any t1 , t2 , . . . , tk , k k

ti tj σ(αi , αj ) ≥ 0.

(3.17)

i=1 j=1

This property of the covariance function σ(·, ·) is called nonnegative deﬁniteness. A natural question is: Given functions µ : A → R and σ : A × A → R such that σ is symmetric and satisﬁes (3.17), does there exist a Gaussian process {Xα : α ∈ A} with µ(·) and σ(·; ) as its mean and covariance functions, respectively? The answer is yes and it follows from Theorem 6.3.1 by deﬁning the family QA of ﬁnite dimensional distributions as follows. Let ν(α1 ,α2 ,...,αk ) be the unique probability distribution on (Rk , B(Rk )) with the moment generating function M(α1 ,α2 ,...,αk ) (s1 , s2 , . . . , sk )

k k k 1 si µ(αi ) + si sj σ(αi , αj ) (3.18) = exp 2 i=1 j=1 i=1 for s1 , s2 , . . . , sk in R. If the matrix Σ ≡ (σ(αi , αj )) , 1 ≤ i, j ≤ k is positive deﬁnite, i.e., it is such that in (3.17) equality holds iﬀ ti = 0 for all i, then ν(α1 ,...,αk ) (·) can be shown to be a probability measure that is measure on Rk with density absolutely continuous w.r.t. mk , the Lebesgue k k ˜ij xj −µ(αj ) /2 1 − 12 − i=1 j=1 xi −µ(αi ) σ ˜ ≡ (˜ e where Σ σij ) = k/2 |Σ| (2π)

Σ−1 , the inverse of Σ and |Σ| = the determinant of Σ. The veriﬁcation of conditions (3.1) and (3.2) for this family is left as an exercise (Problem 6.12). Remark 6.3.4: Kolmogorov’s consistency theorem (Theorem 6.3.1) remains valid when the real line R is replaced by a complete separable metric space S. More speciﬁcally, let A be a nonempty set and for {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞, let ν(α1 ,α2 ,...,αk ) (·) be a probability measure on (Sk , B(Sk )). If the family QA ≡ {ν(α1 ,α2 ,...,αk ) : {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞} satisﬁes the natural analogs of (3.1) and (3.2), then there exists a probability measure P on (Ω ≡ SA , F ≡ (B(S))A ) and an S-valued stochastic process {Xα : α ∈ A} on (Ω, F, P ) such that ν(α1 ,α2 ,...,αk ) (·) = P (Xα1 , Xα2 , . . . XαK )−1 (·). Here SA is the set of all S valued functions on A, (B(S))A is the σ-algebra generated by the cylinder sets of the form C = {f : f : A → S, f (αi ) ∈ Bi , i = 1, 2, . . . , k} where {α1 , α2 , . . . , αk } ⊂ A, Bi ∈ B(S), 1 ≤ i ≤ k, 1 ≤ k < ∞, and also Xα (ω) is the projection map Xα (ω) ≡ ω(α). The main step in the

6.3 Kolmogorov’s consistency theorem

211

proof of Theorem 6.3.1 was to establish the countable additivity of the set function P on the algebra C of ﬁnite dimensional cylinder sets. This in turn depended upon the fact that any probability measure µ on (Rk , B(Rk )) for 1 ≤ k < ∞ is regular, i.e., for every Borel set B in B(Rk ) and for every > 0, there exists a compact set G ⊂ B such that µ(B\G) < . If S is a Polish space, then any probability measure on (Sk , (B(S))k ), 1 ≤ k < ∞ is regular (see Billingsley (1968)), and hence, the main steps in the proof of Theorem 6.3.1 go through in this case. Remark 6.3.5: (Limitations of Theorem 6.3.1). In this construction, Ω = RA is rather large and the σ-algebra F ≡ (B(R))A is not large enough to include many events of interest when the index set A is uncountable. In fact, it can be shown that F coincides with the class of all sets G ⊂ Ω that depend only on a countable number of coordinates of ω. More precisely, the following holds. Proposition 6.3.5: The σ-algebra F

=

−1 {G : G = πA (B) for some B in B(R∞ ) 1 and A1 ⊂ A, A1 countable}.

(3.19)

Proof: Verify that the right side of (3.19) is a σ-algebra containing the class C of cylinder sets and also that, it is contained in F. 2 For example, if A = [0, 1], then the set C[0, 1] of all continuous functions from [0, 1] → R is not a member of F ≡ (B(R))A . Similarly, if M (ω) ≡ sup{|ω(α)| : α ∈ [0, 1]}, then the set {M (ω) ≤ 1} is not in F = (B(R))[0,1] . When A is an interval in R, this diﬃculty can be overcome in several ways. One approach pioneered by J.L. Doob is the notion of separable stochastic processes (Doob (1953)). Another approach pioneered by Kolmogorov and Skorohod is to restrict Ω to the class of all continuous functions or functions that are right continuous and have left limits (Billingsley (1968)). For more on stochastic processes, see Chapter 15. Independent Random Experiments If E1 and E2 are two random experiments with associated probability spaces (Ω1 , F1 , P1 ) and (Ω2 , F2 , P2 ), it is possible to model the experiment of performing both E1 and E2 independently by the product probability space (Ω1 × Ω2 , F1 × F2 , P1 × P2 ) (see Chapter 5). The same idea carries over to an arbitrary collection {Eα : α ∈ A} of random experiments. It is possible to think of a grand experiment E in which all the Eα ’s are independent components by considering the product probability space (×α∈A Ωα , ×α∈A Fα , ×α∈A Pα ) ≡ (Ω, F, P )

(3.20)

where (Ωα , Fα , Pα ) is the probability space corresponding to Eα . Here Ω ≡ ×α∈A Ωα is the collection of all functions ω on A such that ω(α) ∈ Ωα ,

212

6. Probability Spaces

F ≡ ×α∈A Fα is the σ-algebra generated by ﬁnite dimensional cylinder sets of the form C = {ω : ω(αi ) ∈ Bαi , i = 1, 2, . . . , k}, (3.21) 1 ≤ k < ∞, {α1 , α2 , . . . , αk } ⊂ A, Bαi ∈ Fαi and P ≡ ×α∈A Pα is the probability measure on F such that for C of (3.21), P (C) =

k +

Pαi (Bαi ).

(3.22)

i=1

The proof of the existence of such a P on F is an application of the extension theorem (Theorem 1.3.3). The veriﬁcation of countable additivity on the class C of cylinder sets is not diﬃcult. See Kolmogorov (1956).

6.4 Problems 6.1 Let µ1 = µ2 be the probability distribution on Ω = {1, 2} with µ1 ({1}) = 1/2. Find two distinct probability distributions on Ω × Ω with µ1 and µ2 as the set of marginals. 6.2 Let Ω = (0, 1), F = B((0, 1)) and P be the Lebesgue measure on (0,1). Let X(ω) = − log ω, h(x) = x2 and Y = h(X). Find PX and PY and evaluate EY by applying the change of variables formula (Proposition 6.2.1). 6.3 In the change of variables formula, one of the three integrals is usually easier to evaluate than the other two. In this problem, in part (a), the ﬁrst integral is easier to evaluate than the other two while in part (b), the second one is easier. (a) Let Z ∼ N (0, 1), X = Z 2 , and Y = e−X . (i) Find the distributions PX and PY on (R, B(R)). (ii) Compute the integrals −z 2 −x e φ(z)dz, e PX (dx) and yPY (dy), R

R

where φ(z) = √12π e−z three integrals agree.

2

R

/2

, −∞ < z < ∞. Verify that all

(b) Let X1 , X2 , . . . , Xn be iid N (0, 1) random variables. Let Y = (X1 + · · · + Xk ) and Z = Y 2 . (i) Find the distributions of Y and Z. (ii) Evaluate (x1 + · · · + xk )2 dPX1 ,...,Xk (x1 , . . . , xk ), Rk 2 y PY (dy), and R+ zPZ (dz). R

6.4 Problems

213

(c) Let X1 , X2 , . . . , Xk be independent Binomial (ni , p), i = 1, 2, . . . , k random variables. Let Y = (X1 + · · · + Xk ). (i) Find the distribution PY of Y . (ii) Evaluate Rk (x1 + · · · + xk )dPX1 ,...,Xk (x1 , . . . , xk ) and yPY (dy). R 6.4 Let X be a random variable such that MX (t) ≡ E(etX ) < ∞ for |t| < for some > 0. (a) Show that E(etX |X|r ) < ∞ for all r > 0 and |t| < . (r)

(b) Show that MX (t), the rth derivative of MX (t) for r ∈ N, satisﬁes (r) MX (t) = E(etX X r ) for |t| < . (c) Verify (2.25). (Hint: (a) First show that for t1 ∈ (−, ), there exist a t2 ∈ (−, ) such that |t1 | < |t2 | < and for some C < ∞, et1 x |x|r ≤ Ce|t2 x| for all x in R. (b) Verify that for all x ∈ R, |ex − 1| ≤ |x|e|x| . Now use (a) and the (1) DCT to show that MX (t) is diﬀerentiable and MX (t) = E(etX X) for all |t| < . Now complete the proof by induction.) 6.5 Let X be a random variable. (a) Show that φ(r) ≡ (E|X|r )1/r is nondecreasing on (0, ∞). (b) Show that φ(r) ≡ log E|X|r is convex in (0, r0 ) if E|X|r0 < ∞. (c) Let M = sup{x : P (|X| > x) > 0}. Show that (i) lim φ(r) = M . r↑∞

(ii) lim

n→∞

E|X|n+1 E|X|n

= M.

(Hint: For M < ∞, note that E|X|r ≥ (M − )r P (|X| > M − ) for any > 0.) 6.6 Show that if equality holds in (2.20), then there exist constants a and b such that P (Y = aX + b) = 1. (Hint: Show that there exist a constant a such that Var(Y − aX) = 0.) 6.7 Determine C and its base B explicitly in Examples 6.3.3 and 6.3.4. 6.8 (a) Show that D in Example 6.3.5 is not a ﬁnite dimensional cylinder set. n (Hint: Note that lim n1 j=1 xj is not determined by the valn→∞

ues of ﬁnitely many xi ’s.)

214

6. Probability Spaces

(b) Find three other such examples of sets D in R∞ that are not ﬁnite dimensional cylinder sets. 6.9 Establish the assertion in Remark 6.3.3 by completing the following steps: (a) Show that the coordinate map fn (x) ≡ xn from R∞ to R is continuous under the metric d of (3.3). (Conclude, using Example 1.1.6, that RN ⊂ B(R∞ )). (b) Let C1 ≡ {A : A = (a1 , b1 ) × · · · × (ak , bk ) × R∞ , −∞ ≤ ai < bi ≤ ∞, 1 ≤ i ≤ k, for some k < ∞} and C2 ≡ {A : A is an open ball in (R∞ , d)}. Show that σC2 ⊂ σC1 . (c) Show that σC2 = B(R∞ ) by showing that every open set in (R∞ , d) is a countable union of open balls. 6.10 Show that the family QN deﬁned in (3.12) satisﬁes the consistency conditions (3.1) and (3.2) of Theorem 6.3.1. 6.11 Verify that the family of ﬁnite dimensional distributions deﬁned by the right side of (3.14) satisﬁes the conditions of Theorem 6.3.4. 6.12 Verify that the family of distributions deﬁned in (3.18) satisﬁes conditions (3.1) and (3.2) of Theorem 6.3.1. (Hint: Use the fact that for any k ≥ 1, any µ = (µ1 , µ2 , . . . , µk ) ∈ Rk , and any nonnegative deﬁnite k × k matrix Σ ≡ ((σij ))k×k , there is a unique probability distribution ν such that for any s = (s1 , s2 , . . . , sk ) in Rk , exp Rk

=

exp

k

k i=1

si xk ν(dx)

i=1

si µi +

k k 1 si sj σij . 2 i=1 j=1

Observe that this implies that for s = (s1 , s2 , . . . , sk ) in Rk , the ink duced distribution (under ν) on R by the map g(x) = i=1 si xi k k from R → R is univariate normal with mean i=1 si µi and variance k k i=1 j=1 si sj σij .) 6.13 Show that the set D ≡ C[0, 1] of continuous functions from [0, 1] to R is not a member of the σ-algebra F ≡ (B(R))[0,1] . −1 (B) (Hint: If D ∈ F, then by Proposition 6.3.5, D is of the form πA 1 ∞ for some B in B(R ), where A1 ⊂ [0, 1] is countable. Show that for any such A1 and B, there exist functions f : [0, 1] → R such that −1 f ∈ πA (B) but f is not continuous on [0, 1].) 1

6.4 Problems

6.14 Show that K ≡

ω : ω ∈ R[0,1] , sup |ω(α)| < 1 0≤α≤1

215

is not in F ≡

(B(R))[0,1] . (Hint: Observe that sup |ω(α)| is not determined by the values of 0≤α≤1

ω(α) for countably many α’s.)

6.15 Let {µi }i≥1 be a sequence of probability distributions on R, B(R) and let µ be a probability distribution on N with pi ≡ µ({i}), i ≥ 1. (a) Verify that ν(·) ≡ i≥1 pi µi (·) is a probability distribution on R, B(R) . (b) (i) Show that there exists a probability space (Ω, F, P ) and a collection of independent random variables {J, X1 , X2 , . . .} on (Ω, F, P ) such that for each i ≥ 1, Xi has distribution µi and J ∼ µ. (ii) Let Y = XJ , i.e., Y (ω) ≡ XJ(ω) (ω). Show that Y is a random variable on (Ω, F, P ) and Y ∼ ν. 6.16 Let F be a cdf on R and let F be decomposed as F = αFd + βFac + γFsc where α, β, γ ∈ [0, 1] and α + β + γ = 1 and Fd , Fac , Fsc are discrete, absolutely continuous, and singular continuous cdfs on R (cf. (4.5.3)). Show that there exist independent random variables X1 , X2 , X3 and J on some probability space such that X1 ∼ Fd , X2 ∼ Fac , X3 ∼ Fsc , P (J = 1) = α, P (J = 2) = β, P (J = 3) = γ and XJ ∼ F , where ∼ means “has cdf”. 6.17 Let µ be a probability on R, B(R) . Let for each x in R, measure F (x, ·) be a cdf on R, B(R) . Let ψ(x, t) ≡ inf{y : F (x, y) ≥ t}, for x in R, 0 < t < 1. Assume that ψ(·, ·) : R × (0, 1) → R is measurable. Let X and U be independent random variables on some probability space (Ω, F, P ) such that X ∼ µ and U ∼ uniform (0,1). (a) Show that Y = ψ(X, U ) is a random variable. (b) Show that P (Y ≤ y) = R F (x, y)µ(dx). (The distribution of Y is called a mixture of distributions with µ as the mixing distribution. This is of relevance in Bayesian statistical inference.) 6.18 (a) Let (Si , Si ), i = 1, 2 be two measurable spaces. Let µ be a probability measure on (S1 , S1 ) and let Q : S1 × S2 → [0, 1] be such that for each x in S1 , Q(x, ·) is a probability measure on (S2 , S2 ) and for each B in S2 , Q(·, B) is S1 -measurable. Deﬁne Q(x, B2 )µ(dx) ν(B1 × B2 ) ≡ B1

216

6. Probability Spaces

on C ≡ {B1 × B2 : Bi ∈ si , i = 1, 2}. Show that ν can be extended to be a probability measure on σC ≡ S1 × S2 . (b) Let µ and Q be as in Example 6.3.8 (cf. (3.16)). For each n ≥ 1 let νn be a set function deﬁned by the recursive scheme ν1 (·) νn+1 (A × B)

=

µ(·), = Q(x, B)νn (dx), A ∈ B(Rn ), B ∈ B(R). A

Show that for each n, νn can be extended to be a probability measure on Rn , B(Rn ) . (Thus the right side of (3.16) is deﬁned to be νn (B1 × B2 × · · · × Bn ).) 6.19 (Bayesian paradigm). Consider the setup in Problem 6.18 (a). Let λ(B) ≡ ν(S1 × B) = S1 Q(x, B)µ(dx) for all B in S2 . (a) Verify that λ is a probability measure on (S2 , S2 ). ˜ B1 ), (b) Now ﬁx B1 in S1 . Show that there exists a function Q(x, S2 → [0, 1] that is S2 , B(R)-measurable such that ˜ B1 )λ(dx). Q(x, ν(B1 × B2 ) = B2

(Hint: Apply the Radon-Nikodym theorem to the pair ν(B1 ×·) and λ(·).) (c) Let Ω = S1 × S2 , F = σC. For ω = (s1 , s2 ), let θ(ω) = s1 and X(ω) = s2 . Think of θ as the parameter, X as the data, Q(θ, ·) as the distribution of X given θ, µ(·) as the prior distribution of ˜ B1 ) as the posterior probability that θ is in B1 given θ and Q(x, ˜ B1 ) when (Si , Si ) = (R, B(R)), the data X = x. Compute Q(x, i = 1, 2, µ(·) ∼ N (0, 1), Q(θ, ·) ∼ N (θ, 1). 6.20 Let X be a random variable on some probability space (Ω, F, P ). Recall that a random variable X is (a) discrete if there is a ﬁnite or countable set D ≡ {aj : 1 ≤ j ≤ k ≤ ∞} such that P (X ∈ D) = 1, (b) continuous if for every x ∈ R, P (X = x) = 0 or equivalently the cdf FX (·) is continuous on all of R, (c) absolutely continuous if there exists a nonnegative Borel measurable function fX (·) on R such that for any −∞ < a < b < ∞, fX (·)dm P (a < X ≤ b) = (a,b]

or equivalently the induced measure P X −1 is m,

6.4 Problems

217

(d) singular if P X −1 ⊥m or equivalently FX (·) = 0 a.e. m, (e) singular continuous if it is singular and continuous. Let g : R → R be Borel measurable and Y = g(X). (a) Show that if X is discrete then so is Y but not conversely. (b) Show that if X is continuous and g is (1–1) on the range of X, then Y is continuous. (c) Show if X is absolutely continuous with pdf fX (·) and g is absolutely continuous on bounded intervals such that g (·) > 0 a.e. (m), then Y is also absolutely continuous with pdf fX g −1 (y) fY (y) = −1 . g g (y) (d) Let X be as in (c) above. Suppose g is absolutely continuous on bounded intervals and there exist disjoint intervals {Ij }1≤j≤k , 1 ≤ k ≤ ∞, such that 1≤j≤k Ij = R and for each j, either g (·) > 0 a.e. (m) on Ij or g (·) < 0 a.e. (m) on Ij . Show that Y is also absolutely continuous with pdf fY (y) =

xj ∈D(y)

fX (xj ) |g (xj )|

where D{y} ≡ {xj : xj ∈ Ij , g(xj ) = y}. (e) Use (c) to compute the pdf of Y when (i) (ii) (iii) (iv)

X X X X

∼ N (0, 1), g(x) = ex . ∼ N (0, 1), g(x) = x2 . ∼ N (0, 1), g(x) = sin 2πx. ∼ exp(1), g(x) = e−x .

6.21 (Simple random sampling without replacement). Let S ≡ {1, 2, . . . , m}, 1 < m < ∞. Fix 1 ≤ n ≤ m. Choose an element 1 for all j ∈ S. X1 from S such that the probability that X1 = j is m Next, choose an element X2 from S − {X1 } such that the probability 1 for j ∈ S − {X1 }. Continue this procedure for n that X2 = j is (m−1) steps. Write the outcome as the ordered vector ω ≡ (X1 , X2 , . . . , Xn ). (a) Identify the sample space Ω, the σ-algebra F and the probability measure P for this experiment. (b) Show that for any permutation σ of {1, 2, . . . , n}, the random vector Yσ = (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) has the same distribution as (X1 , X2 , . . . , Xn ).

218

6. Probability Spaces

(c) Conclude that {Xi }1≤i≤n are identically distributed and that EXi , Cov(Xi , Xj ), i = j are independent of i and j and compute them. (d) Answer the same questions (a)–(c) if the sampling is changed to with replacement, i.e., at each stage i, the probability that 1 for all j ∈ S. P (Xi = j) = m (e) In (d), let D be the number of distinct units in the sample. Find E(D) and Var(D). 6.22 Let X be a nonnegative random variable. Show that , , 1 + (EX)2 ≤ E 1 + X 2 ≤ 1 + EX. √ (Note that f (x) ≡ 1 + x2 is convex on [0, ∞) and bounded by 1+x.) 6.23 Let X and Y be nonnegative random variables deﬁned on a probability space (Ω, F, P ). Suppose X · Y ≥ 1 w.p. 1. Show that EX · EY ≥ 1. √ √ (Hint: Use Cauchy-Schwarz on X Y .) 6.24 Let µ be a probability measure on R, B(R) . Show that there is a random variable X on the Lebesgue space ([0, 1], B([0, 1]), m) such −1 that k m Xk ≡ µ where m is the Lebesgue measure. Extend this to R , B(R ) , where k is an integer > 1. (Note: This is true for any Polish space, i.e., a complete separable metric space, see Billingsley (1968).)

7 Independence

7.1 Independent events and random variables Although a probability space is nothing more than a measure space with the measure of the whole space equal to one, probability theory is not merely a subset of measure theory. A distinguishing and fundamental feature of probability theory is the notion of independence. Deﬁnition 7.1.1: Let (Ω, F, P ) be a probability space and {B1 , B2 , . . . , Bn } ⊂ F be a ﬁnite collection of events. (i) B1 , B2 , . . . , Bn are called independent w.r.t. P , if P

k j=1

Bij

=

k +

P (Bij )

(1.1)

j=1

for all {i1 , i2 , . . . , ik } ⊂ {1, 2, . . . , n}, 1 ≤ k ≤ n. (ii) B1 , B2 , . . . , Bn are called pairwise independent w.r.t. P if P (Bi ∩ Bj ) = P (Bi )P (Bj ) for all i, j, i = j. Note that a collection B1 , B2 , . . . , Bn of events may be independent with respect to one probability measure P but not with respect to another measure P . Note also that pairwise independence does not imply independence (Problem 7.1). Deﬁnition 7.1.2: Let (Ω, F, P ) be a probability space. A collection of events {Bα , α ∈ A} ⊂ F is called independent w.r.t. P if for every ﬁnite

220

7. Independence

subcollection {α1 , α2 , . . . , αk } ⊂ A, 1 ≤ k < ∞,

+ k k P Bαi = P (Bαi ). i=1

(1.2)

i=1

Deﬁnition 7.1.3: Let (Ω, F, P ) be a probability space. Let A be a nonempty set. For each α in A, let Gα ⊂ F be a collection of events. Then the family {Gα : α ∈ A} is called independent w.r.t P if for every choice of Bα in Gα for α in A, the collection of events {Bα : α ∈ A} is independent w.r.t. P as in Deﬁnition 7.1.2. Deﬁnition 7.1.4: Let (Ω, F, P ) be a probability space and let {Xα : α ∈ A} be a collection of random variables on (Ω, F, P ). Then the collection {Xα : α ∈ A} is called independent w.r.t. P if the family of σ-algebras {σXα : α ∈ A} is independent w.r.t. P , where σXα is the σ-algebra generated by Xα , i.e., σXα ≡ {Xα−1 (B) : B ∈ B(R)}.

(1.3)

Note that the collection {Xα : α ∈ A} is independent iﬀ for any {α1 , α2 , . . . , αk } ⊂ A, and Bi ∈ B(R), for i = 1, 2, . . . , k, 1 ≤ k < ∞, P (Xαi ∈ Bi , i = 1, 2, . . . , k) =

k +

P (Xαi ∈ Bi ).

(1.4)

i=1

It turns out that if (1.4) holds for all Bi of the form Bi = (−∞, xi ], xi ∈ R, then it holds for all Bi ∈ B(R), i = 1, 2, . . . , k. This follows from the proposition below. Proposition 7.1.1: Let (Ω, F, P ) be a probability space. Let A be a nonempty set. Let Gα ⊂ F be a π-system for each α in A. Let {Gα : α ∈ A} be independent w.r.t. P . Then the family of σ-algebras {σGα : α ∈ A} is also independent w.r.t. P . Proof: Fix 2 ≤ k < ∞, {α1 , α2 , . . . , αk } ⊂ A, Bi ∈ Gαi , i = 1, 2, . . . , k − 1. Let

k−1 + L ≡ B : B ∈ σGαk , P (B1 ∩ · · · ∩ Bk−1 ∩ B) = P (Bi ) P (B) . i=1

(1.5)

7.1 Independent events and random variables

221

It is easy to verify that L is a λ-system. By hypothesis, L contains the π-system Gαk . Hence by the π-λ theorem (cf. Theorem 1.1.2), L = σGα . Iterating the above argument k times completes the proof. 2 Corollary 7.1.2: A collection {Xα : α ∈ A} of random variables on a probability space (Ω, F, P ) is independent w.r.t. P iﬀ for any {α1 , α2 , . . . , αk } ⊂ A and any x1 , x2 , . . . , xk in R, the joint cdf Fα1 ,α2 ,...,αk of (Xα1 , Xα2 , . . . , Xαk ) is the product of the marginal cdfs Fαi , i.e., Fα1 ,α2 ,...,αk (x1 , x2 , . . . , xk ) ≡ P (Xαi ≤ xi , i = 1, 2, . . . , k) =

k +

k +

P (Xαi ≤ xi ) =

i=1

Fαi (xi ).

(1.6)

i=1

Proof: For the ‘if’ part let Gα ≡ {Xα−1 (−∞, x] : x ∈ R}, α ∈ A. Now apply Proposition 7.1.1. The only if part is easy. 2 Remark 7.1.1: If the probability distribution of (Xα1 , Xα2 , . . . , Xαk ) is absolutely continuous w.r.t. the Lebesgue measure mk on Rk , then (1.6) and hence the independence of {Xα1 , Xα2 , . . . , Xαk } is equivalent to the condition that fα1 ,α2 ,...,αk (x1 , x2 , . . . , xk ) =

k +

fαi (xi ),

(1.7)

i=1

a.e. (mk ), where f(α1 ,α2 ,...,αk ) is the joint density of (Xα1 , Xα2 , . . . , Xαk ), and fαi is the marginal density of Xαi , i = 1, 2, . . . , k. See Problem 7.18. Proposition 7.1.3: Let (Ω, F, P ) be a probability space and let {X1 , X2 , . . . , Xk }, 2 ≤ k < ∞ be a collection of random variables on (Ω, F, P ). (i) Then {X1 , X2 , . . . , Xk } is independent iﬀ E

k +

hi (Xi ) =

i=1

k +

Ehi (Xi )

(1.8)

i=1

for all bounded Borel measurable functions hi : R → R, i = 1, 2, . . . , k. (ii) If X1 , X2 are independent and E|X1 | < ∞, E|X2 | < ∞, then E|X1 X2 | < ∞

and

EX1 X2 = EX1 EX2 .

(1.9)

Proof: (i) If (1.8) holds, then taking hi = IBi with Bi ∈ B(R), i = 1, 2, . . . , k yields the independence of {X1 , X2 , . . . , Xk }. Conversely,

222

7. Independence

if {X1 , X2 , . . . , Xk } are independent, then (1.8) holds for hi = IBi for Bi ∈ B(R), i = 1, 2, . . . , k, and hence for simple functions {h1 , h2 , . . . , hk }. Now (1.8) follows from the BCT. (ii) Note that by the change of variable formula (Proposition 6.2.1) |x1 x2 |dPX1 ,X2 (x1 , x2 ), E|X1 X2 | = R2

E|Xi | =

R

|xi |dPXi (xi ),

i = 1, 2,

where PX1 ,X2 is the joint distribution of (X1 , X2 ) and PXi is the marginal distribution of Xi , i = 1, 2. Also, by the independence of X1 and X2 , PX1 ,X2 is equal to the product measure PX1 ×PX2 . Hence, by Tonelli’s theorem, E|X1 X2 | = |x1 x2 |dPX1 ,X2 (x1 , x2 ) R2

|x1 x2 |dPX1 (x1 )dPX2 (x2 )

= R2

= R

|x1 |dPX1 (x1 ) |x2 |dPX2 (x2 ) R

= E|X1 |E|X2 | < ∞. Now using Fubini’s theorem, one gets (1.9).

2

Remark 7.1.2: Note that the converse to (ii) above need not hold. That is, if X1 and X2 are two random variables such that E|X1 | < ∞, E|X2 | < ∞, E|X1 X2 | < ∞, and EX1 X2 = EX1 EX2 , then X1 and X2 need not be independent.

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law In this section some basic results on classes of independent events are established. These will play an important role in proving laws of large numbers in Chapter 8. Deﬁnition 7.2.1: Let (Ω, F) be a measurable space and {An }n≥1 be a sequence of sets in F. Then ∞

An (2.1) lim sup An ≡ lim An ≡ n→∞

k=1

n≥k

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law

lim inf An n→∞

≡

lim An ≡

∞

An .

223

(2.2)

k=1 n≥k

Proposition 7.2.1: Both lim An and lim An ∈ F and lim An

= {ω : ω ∈ An for inﬁnitely many n}

lim An

= {ω : ω ∈ An for all but a ﬁnite number of n}.

Proof: Since {An }n≥1 ⊂ F and F is a σ-algebra, Bk = ∞ each k ∈ N and hence lim An ≡ k=1 Bk ∈ F. Next,

n≥k

An ∈ F for

ω ∈ lim An ⇐⇒ ω ∈ Bk

for all k = 1, 2, ...

⇐⇒ for each k, there exists nk ≥ k ⇐⇒ ω ∈ An for inﬁnitely many n.

such that

ω ∈ A nk 2

The proof for lim An is similar.

In probability theory, lim An is referred to as the event that “An happens inﬁnitely often (i.o.)” and lim An as the event that “all but a ﬁnitely many An ’s happen.” Example 7.2.1: Let Ω = R, F = B(R), and let " 1# 0, for n odd " n1 # An = 1 − n , 1 for n even. Then lim An = {0, 1}, lim An = ∅. The following result on the probabilities of lim An and lim An is very useful in probability theory. Theorem 7.2.2: Let (Ω, F, P ) be a probability space and {An }n≥1 be a sequence of events in F. Then ∞ (a) (The ﬁrst Borel-Cantelli lemma). If P (An ) < ∞, then n=1

P (lim An ) = 0. (b) (The second Borel-Cantelli lemma). If

∞

P (An ) = ∞ and {An }n≥1

n=1

are pairwise independent, then P (lim An ) = 1. Remark 7.2.1: This result is also called a zero-one law as it asserts that for ∞pairwise independent events {An }n≥1 , P (lim An ) = 0 or 1 according to n=1 P (An ) < ∞ or equal to ∞.

224

7. Independence

Proof:

n ∞ (a) Let Zn ≡ j=1 IAj . Then Zn ↑ Z ≡ j=1 IAj and by the MCT, n ∞ EZn ≡ j=1 P (Aj ) ↑ EZ. Thus, j=1 P (Aj ) < ∞ ⇒ EZ < ∞ ⇒ Z < ∞ w.p. 1 ⇒ P (Z = ∞) = 0. But the event lim An = {Z = ∞} and so (a) follows.

(b) Without loss of generality, assume P (Aj ) > 0 for some j. Let Jn = Zn EZn for n ≥ j where Zn is as above. Then, EJn = 1 and by the pairwise independence of {An }n≥1 , the variance of Jn is n

P (Aj )(1 − P (Aj ))

1 . ≤ (EZn )2 (EZn ) ∞ n If j=1 P (Aj ) = ∞, then EZn = j=1 P (Aj ) ↑ ∞, by the MCT. Thus EJn ≡ 1, Var(Jn ) → 0 as n → ∞. By Chebychev’s inequality, for all > 0, Var(Jn ) =

j=1

Var(Jn ) → 0 as n → ∞. 2 Thus, Jn → 1 in probability and hence there exists a subsequence {nk }k≥1 such that Jnk → 1 w.p. 1 (cf. Theorem 2.5.2). Since EZnk ↑ ∞, this implies that Znk → ∞ w.p. 1. But {Zn }n≥1 is nondecreasing in n and hence Zn ↑ ∞ w.p. 1. Now since lim An = {Z = ∞}, it 2 follows that P (lim An ) = P (Z = ∞) = 1. P (|Jn − 1| > ) ≤

Proposition 7.2.3: Let {Xn }n≥1 , be a sequence of random variables on some probability space (Ω, F, P ). ∞ (a) If n=1 P (|Xn | > ) < ∞ for each > 0, then P ( lim Xn = 0) = 1. n→∞

(b) If {Xn }n≥ are pairwise independent and P ( lim Xn = 0) = 1, then n→∞ ∞ P (|X | > ) < ∞ for each > 0. n n=1 Proof:

∞ (a) Fix > 0. Let An = {|Xn | > }, n ≥ 1. Then n=1 P (An ) < ∞ ⇒ P (lim An ) = 0, by the ﬁrst Borel-Cantelli lemma (Theorem 7.2.2 (a)). But (lim An )c

=

{ω : there exists n(ω) < ∞ such that for all n ≥ n(ω), w ∈ An }

= {ω : there exists n(ω) < ∞ such that |Xn (ω)| ≤ for all n ≥ n(ω)} = B , say.

7.2 Borel-Cantelli lemmas, tail σ-algebras, and Kolmogorov’s zero-one law

Thus, that

∞ n=1

P (An ) < ∞ ⇒ P (B ) = 1. Let B =

Since P (B ) ≤

r=1

B r1 . Now note

∞ ω : lim |Xn (ω)| = 0 = B r1 . n→∞

c

∞

225

∞ r=1

r=1

c

P (B 1 ) = 0, P (B) = 1. r

∞ (b) Let {Xn }n≥1 be pairwise independent and n=1 P (|Xn | > 0 ) = ∞ for some 0 > 0. Let An = {|Xn | > 0 }. Since {Xn }n≥1 are pairwise independent, so are {An }n≥1 . By the second Borel-Cantelli lemma P (lim An ) = 1. But ω ∈ lim An ⇒ lim supn→∞ |Xn | ≥ 0 and hence P (limn→∞ |Xn | = 0) = 0. This contradicts the hypothesis that P (lim supn→∞ |Xn | = 0) = 1. 2 Deﬁnition 7.2.2: The tail σ-algebra of a sequence of random variables {Xn }n≥1 on a probability space (Ω, F, P ) is T =

∞

σ{Xj : j ≥ n}

n=1

and any A ∈ T is called a tail event. Further, any T -measurable random variable is called a tail random variable (w.r.t. {Xn }n≥1 ). Tail events are determined by the behavior of the sequence {Xn }n≥1 for large n and they remain unchanged if any ﬁnite subcollection of the Xn ’s are dropped or replaced by another ﬁnite set of random variables. Events such as {lim supn→∞ Xn < x} or {limn→∞ Xn = x}, x ∈ R, belong to T . A remarkable result of Kolmogorov is that for any sequence of independent random variables, any tail event has probability zero or one. Theorem 7.2.4: (Kolmogorov’s 0-1 law ). Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ) and let T be the tail σ-algebra of {Xn }n≥1 . Then P (A) = 0 or 1 for all A ∈ T . Remark 7.2.2: Note that in Proposition 7.2.3, the event A ≡ {limn→∞ Xn = 0} belongs to T , and hence, by the above theorem, P (A) = 0 or 1. Thus, proving that P (A) = 1 is equivalent to proving P (A) = 0. Kolmogorov’s 0-1 law only restricts the possible values of tail events like A to 0 or 1, while the Borel-Cantelli lemmas (Theorem 7.2.2) provide a tool for ascertaining whether the value is either 0 or 1. On the other hand, note that Theorem 7.2.2 requires only pairwise independence of {An }n≥1 but Kolmogorov’s 0-1 law requires the full independence of the sequence {Xn }n≥1 .

226

7. Independence

Proof: For n ≥ 1, deﬁne the σ-algebras Fn and Tn by Fn = σ{X1 , . . . , Xn } and Tn = σ{Xn+1 , Xn+2 , . . .}. Since Xn , n ≥ 1 are independent, Fn is independent of Tn for all n ≥ 1. Since, for each n, ∞ T = m=n Fm is a sub σ-algebra ∞of Tn , this implies Fn is independent of T for all n ≥ 1 and hence A ≡ n=1 Fn is independent of T . It is easy to check that A is an algebra (and hence, is a π-system). Hence, by Proposition 7.1.1, σA is independent of T . Since T is also a sub-σ-algebra of σA = σ{Xn : n ≥ 1}, this implies T is independent of itself. Hence for any B ∈ T , P (B ∩ B) = P (B) · P (B), 2

which implies P (B) = 0 or 1.

¯ Deﬁnition 7.2.3: Let (Ω, F, P ) be a probability space and let X : Ω → R ¯ ¯ from be a F, B(R)-measurable mapping. (Recall the deﬁnition of B(R) (2.1.4)). Then X is called an extended real-valued random variable or an ¯ R-valued random variable. Corollary 7.2.5: Let T be the tail σ-algebra of a sequence of indepen¯ dent random variables {Xn }n≥1 on (Ω, F, P ) and let X be a T , B(R)¯ ¯ ¯ measurable R-valued random variable from Ω to R. Then, there exists c ∈ R such that P (X = c) = 1. Proof: If P (X ≤ x) = 0 for all x ∈ R, then P (X = +∞) = 1. Hence, suppose that B ≡ {x ∈ R : P (X ≤ x) = 0} = ∅. Since {X ≤ x} ∈ T for all x ∈ R, P (X ≤ x) = 1 for all x ∈ B. Deﬁne c = inf{x : x ∈ B}. Check that P (X = c) = 1. 2 An immediate implication of Corollary 7.2.5 is that for any sequence ¯ random variables of independent random variables {Xn }n≥1 , the R-valued lim supn→∞ Xn and lim inf n→∞ Xn are degenerate, i.e., they are constants w.p. 1. Example 7.2.2: Let {Xn }n≥1 be a sequence of independent random 2 variables on (Ω, F, P ) with EXn = 0, √n =−11 for all2 n ≥ 1. Let x EX Sn = X1 + . . . + Xn , n ≥ 1 and Φ(x) = −∞ ( 2π) exp(−y /2)dy, x ∈ R. √ If P (Sn ≤ nx) → Φ(x) for all x ∈ R, then Sn lim sup √ = +∞ n n→∞

a.s.

(2.3)

√ To show this, let S = lim supn→∞ Sn / n. First it will be shown that ¯ S is T , B(R)-measurable. For any m ≥ 1, deﬁne √ the variables Tm,n = √ (Xm+1 + . . . + Xn )/ n and Sm,n = (X1 + . . . + Xm )/ n, n > m. Note that for any ﬁxed m ≥ 1, Tm,n is σXm+1 , . . .-measurable and Sm,n (ω) → 0 as

7.3 Problems

227

n → ∞ for all ω ∈ Ω. Hence, for any m ≥ 1, S

=

lim sup (Sm,n + Tm,n )

=

lim sup Tm,n

n→∞ n→∞

is σX , Xm+2 , . . .-measurable. Thus, S is measurable with respect to m+1 ∞ T = m=1 σXm+1 , Xm+2 , . . .. Hence, by Theorem 7.2.4, P (S = +∞) ∈ {0, 1}. If possible, now suppose that P (S = +∞) = 0. Then, by Corollary√7.2.5, there exists c ∈ [−∞, ∞) such that P (S = c) = 1. Let An = {Sn > nx}, n ≥ 1, with x = c + 1. Then, 0 < 1 − Φ(x)

= ≤

lim P (An )

lim P Am

n→∞

n→∞

m≥n

∞

∞

Am

=

P

=

S n P √ > x i.o. n

n=1 m=n

≤ P (S ≥ c + 1) = 0. This shows that P (S = +∞) must be 1. Also see Problem 7.16. Remark 7.2.3: It will be shown in Chapter 11 that if {Xi }i≥1 are independent and identically distributed (iid) random variables with EX1 = 0 and EX12 = 1, then S n P √ ≤ x → Φ(x) for all x in R. n (This is known as the central limit theorem.) Indeed, a stronger result known as the law of the iterated logarithm holds, which says that for such {Xi }i≥1 , Sn = +1, w.p. 1. lim sup √ 2n log log n n→∞

7.3 Problems 7.1 Give an example of three events A1 , A2 , A3 on some probability space such that they are pairwise independent but not independent. (Hint: Consider iid random variables X1 , X2 , X3 with P (X1 = 1) =

228

7. Independence 1 2

= P (X1 = 0) and the events A1 = {X1 = X2 }, A2 = {X1 = X3 }, A3 = {X3 = X1 }.) 7.2 Let {Xα : α ∈ A} be a collection of independent random variables on some probability space (Ω, F, P ). For any subset B ⊂ A, let XB ≡ {Xα : α ∈ B}. (a) Let B be a nonempty proper subset of A. Show that the collections XB and XB c are independent, i.e., the σ-algebras σXB and σXB c are independent w.r.t. P . (b) Let {Bγ : γ ∈ Γ} be a partition of A by nonempty proper subsets Bγ . Show that the family of σ-algebras {σXBγ : γ ∈ Γ} are independent w.r.t. P . 7.3 Let X1 , X2 be iid standard exponential random variables, i.e., e−x dx, A ∈ B(R). P (X1 ∈ A) = A∩(0,∞)

Let Y1 = min(X1 , X2 ) and Y2 = max(X1 , X2 ) − Y1 . Show that Y1 and Y2 are independent. Generalize this to the case of three iid standard exponential random variables. 7.4 Let Ω = (0, 1), F = B((0, 1)), the Borel σ-algebra on (0,1) and P be the Lebesgue measure on (0, 1). For each ω ∈ (0, 1), let ω = ∞ Xi (ω) be the nonterminating binary expansion of ω. i=1 2i (a) Show that {Xi }i≥1 are iid Bernouilli ( 12 ) random variables, i.e., is P (X1 = 0) = 12 = P (X1 = 1). (Hint: Let si ∈ {0, 1}, i = 1, 2, . . . , k, k ∈ N. Show that the set {ω : 0 < ω < 1, Xi (ω) = si , 1 ≤ i ≤ k} is an interval of length 2−k .) (b) Show that Y1 Y2

= =

∞ X2i−1 i=1 ∞ i=1

2i X2i 2i

(3.1) (3.2)

are independent Uniform (0,1) random variables. (c) Using the fact that the set N × N of lattice points (m, n) is in one to one correspondence with N itself, construct a sequence {Yi }i≥1 of iid Uniform (0,1) random variables such that for each j, Yj is a function of {Xi }i≥1 .

7.3 Problems

229

(d) For any cdf F , show that the random variable X(ω) ≡ F −1 (ω) has cdf F , where F −1 (u) = inf{x : F (x) ≥ u} for

0 < u < 1.

(3.3)

(e) Let {Fi }i≥1 be a sequence of cdfs on R. Using (c), construct a sequence {Zi }i≥1 of independent random variables on (Ω, F, P ) such that Zi has cdf Fi , i ≥ 1. ∞ i (f) Show that the cdf of the random variable W ≡ i=1 2X 3i is the Cantor function (cf. Section 4.5). (g) Let p > 1 be a positive integer. For each ω ∈ (0, 1) let ω ≡ ∞ Vi (ω) be the nonterminating p-nary expansion of ω. Show i=1 pi that {Vi }i≥1 are iid and determine the distribution of V1 . 7.5 Let {Xi }i≥1 be a Markov chain with state space S = {0, 1} and transition probability matrix

q 0 p0 where pi = 1 − qi , 0 < qi < 1, i = 0, 1 . Q= p1 q 1 Let τ1 = min{j : Xj = 0} and τk+1 = min{j : j > τk , Xj = 0}, k = 1, 2, . . .. Note that τk is the time of kth visit to the state 0. (a) Show that {τk+1 − τk : k ≥ 1} are iid random variables and independent of τ1 . (b) Show that Pi (τ1 < ∞) = 1

and hence Pi (τk < ∞) = 1

for all k ≥ 2, i = 0, 1 where Pi denotes the probability distribution with X1 = i w.p. 1. ∞ (Hint: Show that k=1 P (τ1 > k | X1 = i) < ∞ for i = 0, 1 and use the Borel-Cantelli lemma.) (c) Show also that Ei (eθ0 τ1 ) < ∞ for some θ0 > 0, i = 0, 1, where Ei denotes the expectation under Pi . 7.6 Let X1 and X2 be independent random variables. (a) Show that for any p > 0, p

E|X1 + X2 |p < ∞ iﬀ E|X1 | < ∞, E|X2 |p < ∞. Show that this is false if X1 and X2 are not independent. (Hint: Use Fubini’s theorem to conclude that E|X1 + X2 |p < ∞ implies that E|X1 + x2 |p < ∞ for some x2 and hence E|X1 |p < ∞.)

230

7. Independence

(b) Show that if E(X12 + X22 ) < ∞, then Var(X1 + X2 ) = Var(X1 ) + Var(X2 ).

(3.4)

Show by an example that (3.4) need not imply the independence of X1 and X2 . Show also that if X1 and X2 take only two values each and (3.4) holds, then X1 and X2 are independent. 7.7 Let X1 and X2 be two random variables on a probability space (Ω, F, P ). (a) Show that, if P X1 ∈ (a1 , b1 ), X2 ∈ (a2 , b2 ) = P X1 ∈ (a1 , b1 ) P X2 ∈ (a2 , b2 )

(3.5)

for all a1 , b1 , a2 , b2 in a dense set D in R, then X1 and X2 are independent. (Hint: Show that (3.5) implies that the joint cdf of (X1 , X2 ) is the product of the marginal cdfs of X1 and X2 and use Corollary 7.1.2.) (b) Let fi : R → R, i = 1, 2 be two one-one functions such that both fi and fi−1 are Borel measurable, i = 1, 2. Show that X1 and X2 are independent iﬀ f1 (X1 ) and f2 (X2 ) are independent. Conclude that X1 and X2 are independent iﬀ eX1 and eX2 are independent. 7.8 (a) Let X1 and X2 be two independent bounded random variables. Show that E(p1 (X1 )p2 (X2 )) = (Ep1 (X1 ))(Ep2 (X2 ))

(3.6)

where p1 (·) and p2 (·) are polynomials. (b) Show that if X1 and X2 are bounded random variables and (3.6) holds for all polynomials p1 (·) and p2 (·), then X1 and X2 are independent. (Hint: Use the facts that (i) continuous functions on a bounded closed interval [a, b] can be approximated uniformly by polynomials, and (ii) for any interval (c, d) ⊂ [a, b], any random variable X and > 0, there exists a continuous function f on [a, b] such that E|f (X) − I(c,d) (X)| < , provided P (X = c or d) = 0.) 7.9 Let {Xn }n≥1 be a sequence of iid random variables on a probability space (Ω, F, P ).∞Let R = R(ω) be the radius of convergence of the power series n=1 Xn rn .

7.3 Problems

231

(a) Show that R is a tail random variable. (Hint: Note that R=

1 lim sup |Xn |1/n

.)

n→∞

(b) Show that if E(log |X1 |)+ = ∞, then R = 0 w.p. 1. and if E(log |X1 |)+ < ∞, then R ≥ 1 w.p. 1. (Hint: Apply the Borel-Cantelli lemmas to An = {|Xn | > λn } for each λ > 1.) 7.10 Let {An }n≥1 be a sequence of events in (Ω, F, P ) such that ∞

P (An ∩ Acn+1 ) < ∞

n=1

and limn→∞ P (An ) = 0. Show that P (lim sup An ) = 0. n→∞

Show also that limn→∞ P (An ) limn→∞ P ( j≥n Aj ) = 0.

=

0 can be replaced by

(Hint: Let Bn = An ∩ Acn+1 , n ≥ 1, B = lim Bn , A = lim An . Show that A ∩ B c ⊂ lim An .) 7.11 For any nonnegative random variable X, show that E|X| < ∞ iﬀ ∞ n=1 P (|X| > n) < ∞ for every > 0. 7.12 Let {Xi }i≥1 be a sequence of pairwise independent and identically distribution random variables. (a) Show that limn→∞

Xn n

= 0 w.p. 1 iﬀ E|X1 | < ∞.

(Hint: E|X1 | < ∞ ⇐⇒

∞

P (|Xn | > n) < ∞ for all > 0.)

n=1

(b) Show that E(log |X1 |)+ < ∞ iﬀ 1/n →1 |Xn |

w.p. 1.

7.13 Let {Xi }i≥1 be a sequence of identically distributed random variables and let Mn = max{|Xj | : 1 ≤ j ≤ n}.

232

7. Independence

(a) If E|X1 |α < ∞ for some α ∈ (0, ∞), then show that Mn → 0 w.p. 1. n1/α

(3.7)

(Hint: Fix > 0. Let An = {|Xn | > n1/α }. Apply the ﬁrst Borel-Cantelli lemma.) (b) Show that if {Xi }i≥1 are iid satisfying (3.7) for some α > 0, then E|X1 |α < ∞. (Hint: Apply the second Borel-Cantelli lemma.) 7.14 Let X1 and X2 be independent random variables with distributions µ1 and µ2 . Let Y = (X1 + X2 ). (a) Show that the distribution µ of Y is the convolution µ1 ∗ µ2 as deﬁned by (µ1 ∗ µ2 )(A) = µ1 (A − x)µ2 (dx) R

(cf. Problem 5.12). (b) Show that if X1 has a continuous distribution then so does Y . (c) Show that if X1 has an absolutely continuous distribution then so does Y and that the density function of Y is given by dµ (x) ≡ fY (x) = fX1 (x − u)µ2 (du) dm 1 where fX1 (x) ≡ dµ dm (x), the probability density of X1 . (d) Y has a discrete distribution iﬀ both X1 and X2 are discrete. 7.15 (AR(1) series). Let {Xn }n≥0 be a sequence of random variables such that for some ρ ∈ R, Xn+1 = ρXn + n+1 , n ≥ 0 where {n }n≥1 are independent and independent of X0 . (a) Show that if |ρ| < 1 and E(log |1 |)+ < ∞, then ˆn ≡ X

n j=0

ˆ ∞ , say. to a limit X

ρj j+1

converges w.p. 1

7.3 Problems

233

(b) Show that under the hypothesis of (a), for any bounded continuous function h : R → R and for any distribution of X0 ˆ ∞ ). Eh(Xn ) → Eh(X ˆ n have the (Hint: Show that for each n ≥ 1, Xn − ρn X0 and X same distribution.) 7.16 Establish the following generalization of Example 7.2.2. Let {Xn }n≥1 be a sequence of independent random variables on some probability space (Ω, F, P ). Suppose there exists sequences {an }n≥1 , {xn }n≥1 , such that an ↑ ∞, xn ↑ ∞ and for each k < ∞, limn P (Sn ≤ an xk ) ≡ F (xk ) exists and is < 1. Show that lim supn→∞ Sann = +∞ a.s. 7.17 (a) Let {Xi }ni=1 be random variables on a probability space (Ω, F, P ) and let P (X1 , X2 , . . . , Xn )−1 (·) be dominated by the product measure µ × µ × · · · × µ where µ is a σﬁnite measure on (R, B(R)) with Radon-Nikodym derivative f (x1 , x2 , . . . , xn ). Show that {Xi }ni=1 are independent w.r.t. P n iﬀ f (x1 , x2 , . . . , xn ) ≡ i=1 hi (xi ) for all (x1 , x2 , . . . , xn ) ∈ R where for each i, hi : R → R is Borel measurable. (b) Use (a) to show that if (X1 , X2 ) has an absolutely continuous distribution with density f (x1 , x2 ) then X1 and X2 are independent iﬀ f (x1 , x2 ) = f1 (x1 )f2 (x2 ) where fi (·) is the density of Xi . (c) Using (a) or otherwise conclude that if Xi , i = 1, 2 are both discrete random variables then X1 and X2 are independent iﬀ P (X1 = a, X2 = b) = P (X1 = a)P (X2 = b) for all a and b. 7.18 Let {Xn }n≥1 be a sequence of independent random variables such that for n ≥ 1, P (Xn = 1) = n1 = 1 − P (Xn = 0). Show that Xn −→p 0 but not w.p. 1. 7.19 Let (Ω, F, P ) be a probability space. (a) Suppose there exists events A1 , A2 , . . . , Ak that are independent with 0 < P (Ai ) < 1, i = 1, 2, . . . , k. Show that |Ω| ≥ 2k where for any set A, |A| is the number of elements in A. (b) Let {Xi }ki=1 be independent random variables such that Xi takes n distinct values with positive probability. Show that |Ω| ≥ i k j=1 ni .

234

7. Independence

(c) Show that there exists a probability space (Ω, F, P ) such that |Ω| = 2k and k independent events A1 , A2 , . . . , Ak in F such that 0 < P (Ai ) < 1, i = 1, 2, . . . , k. 7.20 (a) Let Ω ≡ {(x1 , x2 ) : x1 , x2 ∈ R, x21 + x22 ≤ 1} be the unit disc in R2 . Let F ≡ B(Ω), the Borel σ-algebra in Ω and P = normalized Lebesgue measure, i.e., P (A) ≡ m(A) π , A ∈ F. For ω = (x1 , x2 ) let X1 (ω) = x1 , X2 (ω) = x2 , and R(ω), θ(ω) be the polar representation of ω. Show that the random variables R and θ are independent but X1 and X2 are not. (b) Formulate and establish an extension of the above to the unit sphere in R3 . 7.21 Let X1 , X2 , X3 be iid random variables such that P (X1 = x) = 0 for all x ∈ R. (a) Show that for any permutation σ of (1,2,3) 1 P Xσ(1) > Xσ(2) > Xσ(3) = . 3! (b) Show that for any i = 1, 2, 3 1 P Xi = max Xj = . 1≤j≤3 3 (c) State and prove a generalization of (a) and (b) to random variables {Xi : 1 ≤ i ≤ n} such that the joint distribution of (X1 , X2 , . . . , Xn ) is the same as that of (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) for any permutation σ of {1, 2, . . . , n} and P (X1 = x) = 0 for all x ∈ R. 7.22 Let f , g : R → R be monotone nondecreasing. Show that for any random variable X Ef (X)g(X) ≥ Ef (X)Eg(X) provided all the expectations exist. (Hint: of X with same distribution. Note that Let Y be independent Z = f (X) − f (Y ) g(X) − g(Y ) ≥ 0 w.p. 1.) 7.23 Let X1 , X2 , . . . , Xn be random variables on some probability space (Ω, F, P ). Show that if P (X1 , X2 , . . . , Xn )−1 (·) mn , the Lebesgue measure on Rn then for each i, P Xi−1 (·) m. Give an example to show that the converse is not true. Show also that if

7.3 Problems

235

P (X1 , X2 , . . . , Xn )−1 (·) mn then {X1 , X2 , . . . , Xn } are indepenn dent iﬀ f(X1 ,X2 ,...,Xn ) (x1 , x2 , . . . , xn ) = i=1 fXi (xi ) where the f ’s are the respective pdfs.

8 Laws of Large Numbers

When measuring a physical quantity such as the mass of an object, it is commonly believed that the average of several measurements is more reliable than a single one. Similarly, in applications of statistical inference when estimating a population mean µ, a random sample {X1 , X2 , . . . , Xn } ¯n ≡ of size n is drawn from the population, and the sample average X n 1 X is used as an estimator for the parameter µ. This is based on i=1 i n ¯ n will be close to µ in some suitable sense. In the idea that as n gets large, X many time-evolving physical systems {f (t) : 0 ≤ t < ∞}, where f (t) is an T element in the phase space S, “time averages” of the form T1 0 h(f (t))dt (where h is a bounded function on S) converge, as T gets large, to the “space average” of the form S h(x)π(dx) for some appropriate measure π on S. The above three are examples of a general phenomenon known as the law of large numbers. This chapter is devoted to a systematic development of this topic for sequences of independent random variables and also to some important reﬁnements of the law of large numbers.

8.1 Weak laws of large numbers Let {Zn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ). Recall that the sequence {Zn }n≥1 is said to converge in probability to a random variable Z if for each > 0, lim P (|Zn − Z| ≥ ) = 0.

n→∞

(1.1)

238

8. Laws of Large Numbers

This is written as Zn −→p Z. The sequence {Zn }n≥1 is said to converge with probability one or almost surely (a.s.) to Z if there exists a set A in F such that P (A) = 1 and for all ω in A,

lim Zn (ω) = Z(ω).

n→∞

(1.2)

This is written as Zn → Z w.p. 1 or Zn → Z a.s. Deﬁnition 8.1.1: A sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) is said to obey the weak law of large numbers (WLLN) with normalizing sequences of real numbers {an }n≥1 and {bn }n≥1 if

where Sn =

Sn − an −→p 0 bn

n i=1

as

n→∞

(1.3)

Xi for n ≥ 1.

The following theorem says that if {Xn }n≥1 is a sequence of iid random variables with EX12 < ∞, then it obeys the weak law of large numbers with an = nEX1 and bn = n. Theorem 8.1.1: Let {Xn }n≥1 be a sequence of iid random variables such that EX12 < ∞. Then ¯ n ≡ X1 + . . . + Xn −→p EX1 . X n

(1.4)

Proof: By Chebychev’s inequality, for any > 0, 2 ¯ ¯ n − EX1 | > ) ≤ Var(Xn ) = 1 · σ , P (|X 2 2 n

where σ 2 = Var(X1 ). Since

σ2 n2

→ 0 as n → ∞, (1.4) follows.

(1.5) 2

Corollary 8.1.2: Let {Xn }n≥1 be a sequence of iid Bernoulli (p) random variables, i.e., P (X1 = 1) = p = 1 − P (X1 = 0). Let pˆn =

#{i : 1 ≤ i ≤ n, Xi = 1} , n ≥ 1, n

(1.6)

where for a ﬁnite set A, #A denotes the number of elements in A. Then pˆn −→p p. ¯n. Proof: Check that EX1 = p and pˆn = X

2

This says that one can estimate the probability p of getting a “head” of a coin by tossing it n times and calculating the proportion of “heads.” This is also the basis of public opinion polls. Since the proof of Theorem 8.1.1 depended only on Chebychev’s inequality, the following generalization is immediate (Problem 8.1).

8.1 Weak laws of large numbers

239

Theorem 8.1.3: Let {Xn }n≥1 be a sequence of random variables on a probability space such that (i) EXn2 < ∞

for all

n ≥ 1,

(ii) EXi Xj = (EXi )(EXj ) for all i = j (i.e., {Xn }n≥1 are uncorrelated), n (iii) n12 i=1 σi2 → 0 as n → ∞, where σi2 = Var(Xi ), i ≥ 1. Then where µ ¯n ≡

1 n

¯n − µ X ¯n −→p 0

n i=1

(1.7)

EXi .

the above theorem Corollary 8.1.4: Let {Xn }n≥1 satisfy (i) and (ii) of n ¯n ≡ n1 i=1 EXi → µ as and let the sequence {σn2 }n≥1 be bounded. Let µ p ¯ n → ∞. Then Xn −→ µ. An Application to Real Analysis Let f : [0, 1] → R be a continuous function. K. Weierstrass showed that f can be approximated uniformly over [0, 1] by polynomials. S.N. Bernstein constructed a special class of such polynomials. A proof of Bernstein’s result using the WLLN (Theorem 8.1.1) is given below. Theorem 8.1.5: Let f : [0, 1] → R be a continuous function. Let n r n xr (1 − x)n−r , 0 ≤ x ≤ 1 f Bn,f (x) ≡ r n r=0

(1.8)

be the Bernstein polynomial of order n for the function f . Then lim sup |f (x) − Bn,f (x)| : 0 ≤ x ≤ 1 = 0. n→∞

Proof: Since f is continuous on the closed and bounded interval [0, 1], it is uniformly continuous and hence for any > 0, there exists a δ > 0 such that (1.9) |x − y| < δ ⇒ |f (x) − f (y)| < . Fix x in [0, 1]. Let {Xn }n≥1 be a sequence of iid Bernoulli (x) random variables. Let pˆn be as in (1.6). Then Bn,f (x) = Ef (ˆ pn ). Hence, |f (x) − Bn,f (x)|

≤ E|f (ˆ pn ) − f (x)| = E |f (ˆ pn ) − f (x)|I(|ˆ pn − x| < δ ) + E |f (ˆ pn ) − f (x)|I(|ˆ pn − x| ≥ δ ) ≤ + 2f P (|ˆ pn − x| ≥ δ )

240

8. Laws of Large Numbers

where f = sup{|f (x)| : 0 ≤ x ≤ 1}. But by Chebychev’s inequality, 1 Var(ˆ pn ) δ2 1 x(1 − x) ≤ for all 0 ≤ x ≤ 1. = 2 nδ 4nδ2 1 Thus, sup |f (x) − Bn,f (x)| : 0 ≤ x ≤ 1 ≤ + 2f 4nδ 2 . Letting n → ∞ ﬁrst and then ↓ 0 completes the proof. 2 P (|ˆ pn − x| ≥ δ ) ≤

8.2 Strong laws of large numbers . Deﬁnition 8.2.1: A sequence {Xn }n≥1 of random variables on a probability space (Ω, F, P ) is said to obey the strong law of large numbers (SLLN) with normalizing sequences of real numbers {an }n≥1 and {bn }n≥1 if

where Sn =

n i=1

Sn − an →0 bn

as

n→∞

w.p. 1,

(2.1)

Xi for n ≥ 1.

The following theorem says that if {Xn }n≥1 is a sequence of iid random variables with EX14 < ∞, then the strong law of large numbers holds with an = nEX1 and bn = n. This result is referred to as Borel’s SLLN. Theorem 8.2.1: (Borel’s SLLN ). Let {Xn }n≥1 be a sequence of iid random variables such that EX14 < ∞. Then ¯ n ≡ X1 + X2 + . . . + Xn → EX1 X n

w.p. 1.

(2.2)

¯ n − EX1 | ≥ }, n ≥ 1. To establish Proof: Fix > 0 and let An ≡ {|X (2.2), by Proposition 7.2.3 (a), it suﬃces to show that ∞

P (An ) < ∞.

(2.3)

n=1

By Markov’s inequality ¯ n − EX1 |4 E|X . (2.4) 4 Let Yi = Xi − EX1 for i ≥ 1. Since the Xi ’s are independent, it is easy to check that n 4 1 4 ¯ E|Xn − EX1 | = E Yi n4 i=1 P (An ) ≤

8.2 Strong laws of large numbers

= =

241

1 4 2 2 nEY + 3n(n − 1)(EY ) 1 1 n4 −2 O(n ). 2

By (2.4) this implies (2.3).

The following two results are easy consequences of the above theorem. Corollary 8.2.2: Let {Xn }n≥1 be a sequence of iid random variables that are bounded, i.e., there exists a C < ∞ such that P (|X1 | ≤ C) = 1. Then ¯ n → EX1 X

w.p. 1.

Corollary 8.2.3: Let {Xn }n≥1 be a sequence of iid Bernoulli(p) random variables. Then pˆn ≡

#{i : 1 ≤ i ≤ n, Xi = 1} → p w.p. 1. n

(2.5)

An application of the above result yields the following theorem on the uniform convergence of the empirical cdf to the true cdf. Theorem 8.2.4: (Glivenko-Cantelli). Let {Xn }n≥1 be a sequence of iid random variables with a common cdf F (·). Let Fn (·), the empirical cdf based on {X1 , X2 , . . . , Xn }, be deﬁned by 1 I(Xj ≤ x), n j=1 n

Fn (x) ≡

x ∈ R.

(2.6)

Then, ˜ n ≡ sup |Fn (x) − F (x)| → 0 ∆

w.p. 1.

(2.7)

x

Remark 8.2.1: Note that by applying Corollary 8.2.3 to the sequence of Bernoulli random variables {Yn ≡ I(Xn ≤ x)}n≥1 , one may conclude that Fn (x) → F (x) w.p. 1 for each ﬁxed x. So the main thrust of this theorem is the uniform convergence on R of Fn to F w.p. 1. It can be shown that (2.7) holds for sequences {Xn }n≥1 that are identically distributed and only pairwise independent. The proof is based on Etemadi’s SLLN (Theorem 8.2.7) below. The proof of Theorem 8.2.4 makes use of the following two lemmas. Lemma 8.2.5: (Scheﬀe’s theorem: A generalized version). Let (Ω, F, µ) be a measure space and {fn }n≥1 and f be nonnegative µ-integrable functions → f a.e. (µ) and (ii) f dµ → f dµ. Then such that as n → ∞, (i) f n n |f − fn |dµ → 0 as n → ∞.

242

8. Laws of Large Numbers

2

Proof: See Theorem 2.5.4. For any bounded monotone function H: R → R, deﬁne H(∞) ≡ lim H(x), x↑∞

H(−∞) ≡ lim H(x). x↓−∞

Lemma 8.2.6: (Poly¯ a’s theorem). Let {Gn }n≥1 and G be a collection of bounded nondecreasing functions on R → R such that G(·) is continuous on R and Gn (x) → G(x) for all x in D ∪ {−∞, +∞}, where D is dense in R. Then ∆n ≡ sup{|Gn (x) − G(x)| : x ∈ R} → 0. That is, Gn → G uniformly on R. Proof: Fix > 0. By the deﬁnitions of G(∞) and G(−∞), there exist C1 and C2 in D such that G(C1 ) − G(−∞) < ,

and G(∞) − G(C2 ) < .

(2.8)

Since G(·) is continuous, it is uniformly continuous on [C1 , C2 ] and so there exists a δ > 0 such that x, y ∈ [C1 , C2 ], |x − y| < δ ⇒ |G(x) − G(y)| < .

(2.9)

Also, there exist points a1 = C1 < a2 < . . . < ak = C2 , 1 < k < ∞, in D such that max{(ai+1 − ai ) : 1 ≤ i ≤ k − 1} < δ. Let a0 = −∞, ak+1 = ∞. By the convergence of Gn (·) to G(·), on D ∪ {−∞, ∞}, ∆n1

≡ max{|Gn (ai ) − G(ai )| : 0 ≤ i ≤ k + 1} → 0

(2.10)

as n → ∞. Now note that for any x in [ai , ai+1 ], 1 ≤ i ≤ k − 1, by the monotonicity of Gn (·) and G(·), and by (2.9) and (2.10), Gn (x) − G(x) ≤ Gn (ai+1 ) − G(ai ) ≤ Gn (ai+1 ) − G(ai+1 ) + G(ai+1 ) − G(ai ) ≤ ∆n1 + , and similarly, Gn (x) − G(x) ≥ −∆n1 − . Thus sup{|Gn (x) − G(x)| : a1 ≤ x ≤ ak } ≤ ∆n1 + .

(2.11)

8.2 Strong laws of large numbers

243

For x < a1 , by (2.8) and (2.10), |Gn (x) − G(x)|

≤ ≤

|Gn (x) − Gn (−∞)| + |Gn (−∞) − G(−∞)| + |G(−∞) − G(x)| (Gn (a1 ) − Gn (−∞)) + |Gn (−∞) − G(−∞)| +

≤

|Gn (a1 ) − G(a1 )| + |G(a1 ) − G(−∞)| + 2|Gn (−∞) − G(−∞)| +

≤

3∆n1 + 2.

Similarly, for x > ak , |Gn (x) − G(x)| ≤ 3∆n1 + 2. Combining the above with (2.11) yields ∆n ≤ 3∆n1 + 2. By (2.10), lim sup ∆n ≤ 2, n→∞

2

and > 0 being arbitrary, the proof is complete.

˜ n = supx∈Q |Fn (x) − F (x)| Proof of Theorem 8.2.4: First note that ∆ and hence, it is a random variable. Let B ≡ {bj : j ∈ J} be the set of jump discontinuity points of F with the corresponding jump sizes {pj : j ∈ J}, where J is a subset of N. Let p = j∈J pj . Note that =

1 I(Xi ≤ x) n i=1

=

1 1 I(Xi ≤ x, Xi ∈ B) + I(Xi ≤ x, Xi ∈ B) n i=1 n i=1

n

Fn (x)

n

n

Fnd (x) + Fnc (x), say. Then, Fnd (x) = j∈J pˆnj I(bj ≤ x), where =

pˆnj =

(2.12)

#{i : 1 ≤ i ≤ n, Xi = bj } . n

Let pˆn = j∈J pˆnj = n1 · #{i : 1 ≤ i ≤ n, Xi ∈ B}. By Corollary 8.2.3, for each j ∈ J, pˆnj → pj w.p. 1 and pˆn → p w.p. 1. Since B is countable, there exists a set A = 1 and 0 in F such that P (A0 ) for all ω in A0 , pˆnj → pj for all j ∈ J and j∈J pˆnj = pˆn → p = j∈J pj .

244

8. Laws of Large Numbers

By Lemma 8.2.5 (applied with µ being the counting measure on the set J), it follows that for ω in A0 ,

|ˆ pnj − pj | → 0.

(2.13)

j∈J

Let Fd (x) ≡

j∈J

pj I(bj ≤ x), x ∈ R. Then, sup |Fnd (x) − Fd (x)| ≤ x∈R

|ˆ pnj − pj |,

(2.14)

j∈J

which → 0 as n → ∞ for all ω in A0 , by (2.13). Next let, Fc (x) ≡ F (x) − Fd (x), x ∈ R. Then, it is easy to check that, Fc (·) is continuous and nondecreasing on R, Fc (−∞) = 0 and Fc (∞) = 1 − p. Again, by Corollary 8.2.3, there exists a set A1 in F such that P (A1 ) = 1 and for all ω in A1 , Fnc (x) → Fc (x) for all rational x in R and Fnc (∞) ≡ 1 − pˆn → 1 − p = Fc (∞). Also, Fnc (−∞) = 0 = Fc (−∞). So by Lemma 8.2.6, with D = Q, for ω in A1 , sup |Fnc (x) − Fc (x)| → 0 as n → ∞. (2.15) x∈R

Since P (A0 ∩ A1 ) = 1, the theorem follows from (2.12)–(2.15).

2

Borel’s SLLN for iid random variables requires that E|X1 |4 < ∞. Kolmogorov (1956) improved on this signiﬁcantly by using his “3-series” theorem and reduced the moment condition to E|X1 | < ∞. More recently, Etemadi (1981) N. improved this further by assuming only that the {Xn }n≥1 are pairwise independent and identically distributed with E|X1 | < ∞. More precisely, he proved the following. Theorem 8.2.7: (Etemadi’s SLLN ). Let {Xn }n≥1 be a sequence of pairwise independent and identically distributed random variables with E|X1 | < ∞. Then ¯ n → EX1 w.p. 1. X (2.16) Proof: The main steps in the proof are (I) reduction to the nonnegative case,

8.2 Strong laws of large numbers

245

(II) proof of convergence of Y¯n along a geometrically growing subsequence using the Borel-Cantelli lemma and Chebychev’s inequality, where Y¯n is the average of certain truncated versions of X1 , . . . , Xn , and extending the convergence from the geometric subsequence to the full sequence. Step I: Since the {Xn }n≥1 are pairwise independent and identically distributed with E|X1 | < ∞, it follows that {Xn+ }n≥1 and {Xn− }n≥1 are both sequences of pairwise independent and identically distributed nonnegative random variables with EX1+ < ∞ and EX1− < ∞. Also, since

¯n = 1 Xi = X n i=1 n

1 + X n i=1 i n

−

n 1 − X , n i=1 i

it is enough to prove the theorem under the assumption that the Xi ’s are nonnegative. Step II: Now let Xi ’s be nonnegative and let Yi = Xi I(Xi ≤ i), i ≥ 1. Then, ∞

P (Xi = Yi )

=

i=1

=

∞ i=1 ∞

P (Xi > i) P (X1 > i) ≤

i=1 ∞

= 0

∞ i=1

i

P (X1 > t)dt

i−1

P (X1 > t)dt

= EX1 < ∞. Hence, by the Borel-Cantelli lemma, P (Xi = Yi , inﬁnitely often) = 0. This implies that w.p. 1, Xi = Yi for all but ﬁnitely many i’s and hence, it suﬃces to show that 1 Yi → EX1 w.p. 1. Y¯n ≡ n i=1 n

(2.17)

Next, EYi = E(Xi I(Xi ≤ i) = E(X1 I(X1 ≤ i)) → EX1 (by the MCT) and hence n 1 E Y¯n = EYi → EX1 as n → ∞. (2.18) n i=1

246

8. Laws of Large Numbers

Suppose for the moment that for each ﬁxed 1 < ρ < ∞, it is shown that Y¯nk → EX1 as k → ∞ w.p. 1

(2.19)

where nk = ρk = the greatest integer less than or equal to ρk , k ∈ N. Then, since the Yi ’s are nonnegative, for any n and k satisfying ρk ≤ n < ρk+1 , one gets nk+1 nk n 1 1 1 Yi ≤ Y¯n = Yj ≤ Yi n i=1 n j=1 n i=1

=⇒

nk ¯ nk+1 ¯ Ynk ≤ Y¯n ≤ Ynk+1 n n

=⇒

1 ¯ Yn ≤ Y¯n ≤ ρY¯nk+1 . ρ k

From (2.19), it follows that 1 EX1 ≤ lim inf Y¯n ≤ lim sup Y¯n ≤ ρEX w.p. 1. n→∞ ρ n→∞ Since this is true for each 1 < ρ < ∞, by taking ρ = 1 + it follows that

1 r

for r = 1, 2, . . .,

EX1 ≤ lim inf Y¯n ≤ lim sup Y¯n ≤ EX1 w.p. 1, n→∞

n→∞

establishing (2.17). It now remains to prove (2.19). By (2.18), it is enough to show that Y¯nk − E Y¯nk → 0

as

k → ∞, w.p. 1.

(2.20)

By Chebychev’s inequality and the pairwise independence of the variables {Yn }n≥1 , for any > 0, P (|Y¯nk − E Y¯nk | > ) ≤ ≤

nk 1 1 1 ¯ Var(Ynk ) = 2 2 Var(Yi ) 2 nk i=1 nk 1 1 EYi2 . 2 n2k i=1

Thus, ∞

P (|Y¯nk − E Y¯nk | > ) ≤

k=1

=

nk ∞ 1 1 EYi2 2 n2k i=1 k=1

∞ 1 1 2 . EYi 2 i=1 n2k k:nk ≥i

(2.21)

8.2 Strong laws of large numbers

247

Since nk = ρk > ρk−1 for 1 < ρ < ∞, k:nk ≥i

1 ≤ n2k

k:ρk−1 ≥i

1 ρ(k−1)2

≤

C1 i2

(2.22)

for some constant C1 , 0 < C1 < ∞. Next, since the Xi ’s are identically distributed, ∞ EY 2 i

i=1

i2

=

∞ EX12 I(0 ≤ X1 ≤ i) i2 i=1

=

∞ i EX12 I(j − 1 < X1 ≤ j) i2 i=1 j=1

=

∞ ∞ EX12 I(j − 1 < X1 ≤ j) i−2 j=1

i=j

∞ jEX1 I(j − 1 < X1 ≤ j) · C2 j −1 ≤ j=1

= C2 EX1 < ∞,

(2.23)

for some constant C2 , 0 < C2 < ∞. Now (2.21)–(2.23) imply that ∞

P (|Y¯nk − E Y¯nk | > ) < ∞

k=1

for each > 0. By the Borel-Cantelli lemma and Proposition 7.2.3 (a), (2.20) follows and the proof is complete. 2 The following corollary is immediate from the above theorem. Corollary 8.2.8: (Extension to the vector case). Let {Xn = (Xn1 , . . . , Xnk )}n≥1 be a sequence of k-dimensional random vectors deﬁned on a probability space (Ω, F, P ) such that for each i, 1 ≤ i ≤ k, the sequence {Xni }n≥1 are pairwise independent and identically distributed with E|X1i | < ∞. Let µ = (EX11 , EX12 , . . . , EX1k ) and f : Rk → R be continuous at µ. Then ¯ n ≡ (X ¯ n1 , X ¯ n2 , . . . , X ¯ nk ) → µ w.p. 1, where X ¯ ni = 1 n Xji for (i) X j=1 n 1 ≤ i ≤ k. ¯ n ) → f (µ) w.p. 1. (ii) f (X Example 8.2.1: Let (Xn , Yn ), n = 1, 2, . . . be a sequence of bivariate iid random vectors with EX12 < ∞, EY12 < ∞. Then the sample correlation

248

8. Laws of Large Numbers

coeﬃcient ρˆn , deﬁned by, 1 n

¯ n Y¯n Xi Yi − X ¯ 2 1 n (Yi − Y¯n )2 i=1 (Xi − Xn ) i=1 n

ρˆn ≡ 0 n 1 n

n

i=1

is a strongly consistent estimator of the population correlation coeﬃcient ρ, deﬁned by, Cov(X1 , Y1 ) , ρ= , Var(X1 )Var(Y1 ) i.e., ρˆn → ρ w.p. 1. This follows from the above corollary by taking f : R5 → R to be ⎧ t 5 − t 1 t2 ⎪ ⎨ , , for t3 > t21 , t4 > t22 2 )(t − t2 ) (t − t 3 4 1 2 f (t1 , t2 , t3 , t4 , t5 ) = ⎪ ⎩ 0, otherwise, and the vector (Xn1 , Xn2 , . . . , Xn5 ) to be Xn1 = Xn , Xn2 = Yn , Xn3 = Xn2 , Xn4 = Yn2 , Xn5 = Xn Yn . Corollary 8.2.9: (Extension to the pairwise m-dependent case). Let {Xn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ) such that for an integer m, 1 ≤ m < ∞, and for each i, 1 ≤ i ≤ m, the random variables {Xi , Xi+m , Xi+2m , . . .} are identically distributed and pairwise independent with E|Xi | < ∞. Then ¯n → 1 EXi w.p. 1. X m i=1 m

The proof is left as an exercise (Problem 8.2). For an application of the above result to a discussion on normal numbers, see Problem 8.15. Example 8.2.2: (IID Monte Carlo). Let (S, S, π) be a probability space, f ∈ L1 (S, S, π) and λ = S f dπ. Let {Xn }n≥1 be a sequence of iid Svalued random variables with distribution π. Then, the IID Monte Carlo approximation to λ is deﬁned as ˆn ≡ 1 f (Xi ). λ n i=1 n

ˆ n → λ w.p. 1. Note that by the SLLN, λ An extension of this to the case where {Xi }i≥1 is a Markov chain, known as Markov chain Monte Carlo (MCMC), is discussed in Chapter 14.

8.3 Series of independent random variables

249

8.3 Series of independent random variables Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ). The goal of this section is to investigate the conver∞ gence of the inﬁnite series n=1 Xn , i.e., that of the partial sum sequence, n Sn = i=1 Xi , n ≥ 1. The main result of this section is Kolmogorov’s 3-series theorem (Theorem 8.3.5). The following two inequalities play a fundamental role in the proof of this theorem and also have other important applications. Theorem 8.3.1: Let {Xj : 1 ≤ j ≤ n} be a collection of independent i random variables. Let Si = j=1 Xj for 1 ≤ i ≤ n. (i) (Kolmogorov’s ﬁrst inequality). Suppose that EXj = 0v and EXj2 < ∞, 1 ≤ j ≤ n. Then, for 0 < λ < ∞, Var(S ) n . (3.1) P max |Si | ≥ λ ≤ 1≤i≤n λ2 (ii) (Kolmogorov’s second inequality). Suppose that there exists a constant C ∈ (0, ∞) such that P (|Xj − EXj | ≤ C) = 1 for 1 ≤ j ≤ n. Then, for any 0 < λ < ∞, (2C + 4λ)2 . P max |Si | ≤ λ ≤ 1≤i≤n Var(Sn ) Proof: Let A = {max1≤i≤n |Si | ≥ λ} and let A1 Aj

= {|S1 | ≥ λ},

= {|S1 | < λ, |S2 | < λ, . . . , |Sj−1 | < λ, |Sj | ≥ λ} n for j = 2, . . . , n. Then A1 , . . . , An are disjoint, j=1 Aj = A and P (A) = n j=1 P (Aj ). Since EXj = 0 for all j, Var(Sn ) = ESn2

≥ E(Sn2 IA ) =

n

E(Sn2 IAj )

j=1 n = E (Sn − Sj )2 + Sj2 + 2(Sn − Sj )Sj IAj j=1

≥

n j=1

E(Sj2 IAj ) + 2

n−1

E (Sn − Sj )Sj IAj .

(3.2)

j=1

n Note that since {X1 , . . . , Xn } are independent, (Sn − Sj ) ≡ i=j+1 Xi and Sj IAj are independent for 1 ≤ j ≤ n − 1. Hence, E (Sn − Sj )Sj IAj = E(Sn − Sj )E(Sj IAj ) = 0.

250

8. Laws of Large Numbers

Also on Aj , Sj2 ≥ λ2 . Therefore, by (3.2), Var(Sn ) ≥

n

λ2 P (Aj ) = λ2 P (A).

j=1

This establishes (i). For a proof of (ii), see Chung (1974), p. 117.

2

Remark 8.3.1: Recall that Chebychev’s inequality asserts that for λ > 0, n) and thus Kolmogorov’s ﬁrst inequality is signiﬁP (|Sn | ≥ λ) ≤ Varλ(S 2 cantly stronger. Kolmogorov’s ﬁrst inequality has an extension known as Doob’s maximal inequality to a class of dependent random variables, called martingales (see Chapter 13). The next inequality is due to P. Levy. Deﬁnition 8.3.1: For any random variable X, a real number c is called a median of X if 1 P (X < c) ≤ ≤ P (X ≤ c). (3.3) 2 Such a c always exists. It can be veriﬁed that c0 ≡ inf{x : P (X ≤ x) ≥ 12 } is a median. Note that if c is a median of X and α is a real number, then αc is a median of αX and α+c is a median of α+X. Further, if P (|X| ≥ α) < 12 for some α > 0, then any median c of X satisﬁes |c| ≤ α (Problem 8.4). Theorem 8.3.2: (Levy’s inequality). Let Xj , j = 1, . . . , n be independent n random variables. Let Sj = j=1 Xi , and cj,n be a median of (Sn − Sj ) for 1 ≤ j ≤ n, where cn,n is set equal to 0. Then, for any 0 < λ < ∞, (i) P max (Sj − cj,n ) ≥ λ ≤ 2P (Sn ≥ λ) ; 1≤j≤n

(ii) P

max |Sj − cj,n | ≥ λ ≤ 2P (|Sn | ≥ λ).

1≤j≤n

Proof: Let Aj B B1 Bj

{Sj − Sn ≤ cj,n } for 1 ≤ j ≤ n, = max (Sj − cj,n ) ≥ λ ,

=

1≤j≤n

=

{S1 − c1,n ≥ λ}

1 ≤ i ≤ j − 1, Sj − cj,n ≥ λ}, n for j = 2, . . . , n. Then B1 , . . . , Bn are disjoint and j=1 Bj = B. Since X1 , . . . , Xn are independent, Aj and Bj are independent for each j = 1, 2, . . . , n. Also for each j, Aj = {Sj − cj,n ≤ Sn }, and hence on Aj ∩ Bj , Sn ≥ λ holds. Thus, =

{Si − ci,n < λ for

P (Sn ≥ λ) ≥

n j=1

P (Aj ∩ Bj )

8.3 Series of independent random variables

=

n j=1

≥ =

1 P 2

251

P (Aj )P (Bj )

n

Bj

j=1

1 P (B), 2

proving part (i). Now part (ii) follows by applying part (i) to both {Xi }ni=1 and {−Xi }ni=1 . 2 Recall that if {Yn }n≥1 is a sequence of random variables, then {Yn }n≥1 converges w.p. 1 implies that {Yn }n≥1 converges in probability as well. A remarkable result of P. Levy is that if {Sn }n≥1 is the sequence of partial sums of independent random variables and {Sn }n≥1 converges in probability, then {Sn }n≥1 must converge w.p. 1 as well. The proof of this uses Levy’s inequality proved above. Theorem 8.3.3: Let {Xn }n≥1 be a sequence of independent random varin ables. Let Sn = j=1 Xj for 1 ≤ n < ∞ and let {Sn }n≥1 converge in probability to a random variable S. Then Sn → S w.p. 1. Proof: Recall that a sequence {xn }n≥1 of real numbers converges iﬀ it is Cauchy iﬀ δn ≡ sup{|xk − x | : k, ≥ n} → 0 as n → ∞. Let ˜n ∆

≡ sup{|Sk − S | : k, ≥ n} and

∆n

≡ sup{|Sk − Sn | : k ≥ n}.

˜ n ≤ 2∆n and ∆ ˜ n is decreasing in n. Suppose it is shown that Then, ∆ ∆n −→p 0.

(3.4)

˜ n −→p 0 and hence there is a subsequence {nk }k≥1 such that Then, ∆ ˜ n is decreasing in n, this implies that ˜ ∆nk → 0 as k → ∞ w.p. 1. Since ∆ ˜ ∆n → 0 w.p. 1. Thus it suﬃces to establish (3.4). Fix 0 < < 1. Let Sn, ∆n,k

= Sn+ − Sn =

for ≥ 1,

max{|Sn, | : 1 ≤ ≤ k}, k ≥ 1.

Note that for each n ≥ 1, {∆n,k }k≥1 is a nondecreasing sequence, lim ∆n,k = ∆n and hence, for any n ≥ 1,

k→∞

P (∆n > ) = lim P (∆n,k > ). k→∞

(3.5)

Levy’s inequality (Theorem 8.3.2) will now be used to bound P (∆n,k > ) uniformly in k. Since Sn −→p S, for any η > 0, there exists an n0 ≥ 1 such that for all n ≥ n0 , P (|Sn − S| > η) < η.

252

8. Laws of Large Numbers

This implies that for all k ≥ ≥ n0 , P (|Sk − S | > 2η) < 2η.

(3.6)

If 0 < η < 14 , then the medians of Sk − S for k ≥ ≥ n0 are bounded by 2η. Hence, for n ≥ n0 and k ≥ 1, applying Levy’s inequality (i.e., the above theorem) to {Xi : n + 1 ≤ i ≤ n + k}, P (∆n,k > ) = P max |Sn,j | > 1≤j≤k ≤ P max |Sn,j − cn+j,n+k | ≥ − 2η 1≤j≤k

≤ 2P (|Sn,k | ≥ − 2η). Now, choosing 0 < η < 4 , (3.6) yields P (∆n,k > ) < 4η < for all n ≥ n0 , k ≥ 1. Then, by (3.5), P (∆n > ) ≤ for all n ≥ n0 . Hence, (3.4) holds. 2 The following result on convergence of inﬁnite series of independent random variables is an immediate consequence of the above theorem. Theorem 8.3.4: (Khinchine-Kolmogorov’s 1-series theorem). Let a probability {Xn }n≥1 be a sequence of independent random variables on ∞ space (Ω, F, P ) such that EXn = 0 for all n ≥ 1 and n=1 EXn2 < ∞. n Then Sn ≡ j=1 Xj converges in mean square and almost surely, as n → ∞. Proof: For any n, k ∈ N, E(Sn − Sn+k )2 = Var(Sn − Sn+k ) =

n+k j=n+1

Var(Xj ) =

n+k

EXj2 ,

j=k+1

∞

by independence. Since n=1 EXn2 < ∞, {Sn }n≥1 is a Cauchy sequence in L2 (Ω, F, P ) and hence converges in mean square to some S in L2 (Ω, F, P ). This implies that Sn −→p S, and by the above theorem Sn → S w.p. 1. 2 Remark 8.3.2: It is possible to give another proof of the above theorem using Kolmogorov’s inequality. See Problem 8.5. Theorem 8.3.5: (Kolmogorov’s 3-series theorem). Let {Xn }n≥1 be a sequence of independent random variables on a probability space (Ω, F, P ) n and let Sn = i=1 Xi , n ≥ 1. Then the sequence {Sn }n≥1 converges w.p. 1 iﬀ the following 3-series converge for some 0 < c < ∞: ∞ (i) i=1 P (|Xi | > c) < ∞, ∞ (ii) i=1 E(Yi ) converges, ∞ (iii) i=1 Var(Yi ) < ∞,

8.3 Series of independent random variables

253

where Yi = Xi I(|Xi | ≤ c), i ≥ 1. Proof: (Suﬃciency). By (i) and the Borel-Cantelli lemma, P (Xi = Hence {Sn }n≥1 converges w.p. 1 iﬀ {Tn }n≥1 Yi i.o.) = P (|Xi | > c i.o.) = 0. n converges w.p. 1, where T = i=1 Yi , n ≥ 1. By (iii) and the 1-series then n orem, the sequence { i=1 (Yi − EYi )}n≥1 converges w.p. 1. Hence, by (ii), {Tn }n≥1 converges w.p. 1 and hence {Sn }n≥1 converges w.p. 1. (Necessity). Suppose {Sn }n≥1 converges w.p. 1. Fix 0 < c < ∞ and let Yi = Xi I(|Xi | ≤ c), i ≥ 1. Since {Sn }n≥1 converges w.p. 1, Xn → 0 w.p. 1. Hence, w.p. 1, |Xi | ≤ c for all but a ﬁnite number of i’s. If Ai ≡ {Xi = Yi } = {|Xi | > c}, then by the second Borel-Cantelli lemma, ∞

P (Ai ) < ∞, establishing (i).

i=1

To establish (ii) and (iii), the following construction and the second inequality of Kolmogorov will be used. Without loss of generality, assume ˜ n }n≥1 of random variables on the same that there is another sequence {X ˜ n }n≥1 are independent, (b) probability space (Ω, F, P ) such that (a) {X ˜ ˜n, {Xn }n≥1 is independent of {Xn }n≥1 , and (c) for each n ≥ 1, Xn =d X ˜ i.e., Xn and Xn have the same distribution. Let Y˜i Zi

˜ i I(|X ˜ i | ≤ c), = X = Yi − Y˜i , i ≥ 1,

Tn

≡

T˜n

≡

n i=1 n

Yi , Y˜i ,

i=1

and Rn ≡

n i=1

Zi , n ≥ 1.

n Since {Sn ≡ i=1 Xi }n≥1 converges w.p. 1, and Xi = Yi for all but a ﬁnite number of i, {Tn }n≥1 converges w.p. 1. Since {Yi }n≥1 and {Y˜i }n≥1 have the same distribution on R∞ , {T˜n }n≥1 converges w.p. 1. Thus the diﬀerence sequence {Rn }n≥1 converges w.p. 1. Next, note that {Zn }n≥1 are independent random variables with mean 0 and {Zn }n≥1 are uniformly bounded by 2c. Applying Kolmogorov’s second inequality (Theorem 8.3.1 (b)) to {Zj : m < j ≤ m + n} yields

(2c + 4)2 P max |Rj − Rm | ≤ ≤ m+n (3.7) m<j≤m+n i=m+1 Var(Zi ) for all m ≥ 1, n ≥ 1, 0 < < ∞.

254

8. Laws of Large Numbers

Let ∆m ≡ maxm<j |Rj − Rm |. Let n → ∞ in (3.7) to conclude that (2c + 4)2 P (∆m ≤ ) ≤ ∞ . i=m+1 Var(Zi ) Now suppose (iii) does not hold.Then, since Yi and Y˜i are iid, Var(Zi ) = ∞ 2Var(Yi ) for all i ≥ 1, and thus i=m+1 Var(Zi ) = ∞ and hence P (∆m ≤ ) = 0 for each m ≥ 1, 0 < < ∞. This implies that P (∆m > ) = 1 for each > 0 and hence that ∆m = ∞ w.p. 1 for all m ≥ 1. This contradicts the convergence w.p. 1 of the sequence {Rn }n≥1 . Hence (iii) holds. n (Yi − EYi )}n≥1 converges w.p. 1. Since By the 1-series theorem, { i=1 ∞ n { i=1 Yi }n≥1 converges w.p. 1, i=1 EYi converges, establishing (ii). This completes the proof of necessity part and of the theorem. 2 Remark 8.3.3: To go from the convergence w.p. 1 of {Rn }n≥1 to (iii), it suﬃces to show that if (iii) fails, then for each 0 < A < ∞, P (|Rn | ≤ A) → 0 as n → ∞. This can be established without the use of (3.7) but using the central limit theorem (to be proved later in Chapter 11), which shows that if Var(Rn ) → ∞, then

x 2 1 Rn ≤ x → Φ(x) ≡ √ e−t /2 dt, P , 2π −∞ Var(Rn ) for all x in R. (Also see Billingsley (1995), p. 290.)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs For a sequence of independent and identically distributed random variables {Xn }n≥1 , Kolmogorov showed that {Xn }n≥1 obeys the SLLN with bn = n iﬀ E|X1 | < ∞. Marcinkiewz and Zygmund generalized this result and proved a class of SLLNs for {Xn }n≥1 when E|X|p < ∞ for some p ∈ (0, 2). The proof uses Kolmogorov’s 3-series theorem and some results from real analysis. This approach is to be contrasted with Etemadi’s proof of the SLLN, which uses a decomposition of the random variables {Xn }n≥1 into positive and negative parts and uses monotonicity of the sum to establish almost sure convergence along a subsequence by an application of the BorelCantelli lemma. The alternative approach presented in this section is also useful for proving SLLNs for sums of independent random variables that are not necessarily identically distributed. The next three are preparatory results for Theorem 8.4.4. Lemma 8.4.1: (Abel’s summation formula). Let {an }n≥1 and {bn }n≥1 be two sequences of real numbers. Then, for all n ≥ 2, n j=1

aj bj = An bn −

n−1 j=1

Aj (bj+1 − bj )

(4.1)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

where Ak =

k j=1

255

aj , k ≥ 1.

Proof: Let A0 = 0. Then, aj = Aj − Aj−1 , j ≥ 1. Hence, n

aj bj

=

j=1

n

(Aj − Aj−1 )bj =

j=1

=

n

n

Aj b j −

j=1

Aj bj −

j=1

n−1

n

Aj−1 bj

j=1

Aj bj+1 ,

j=1

2

yielding (4.1).

be seLemma 8.4.2: (Kronecker’s lemma). Let {an }n≥1 and {bn }n≥1 ∞ quences of real numbers such that 0 < bn ↑ ∞ as n → ∞ and j=1 aj converges. Then, n 1 aj bj −→ 0 bn j=1

as

n → ∞.

(4.2)

k ∞ Proof: Let Ak = j=1 aj , A ≡ j=1 aj = limk→∞ Ak and Rk = A − Ak , k ≥ 1. Then, by Lemma 8.4.1 for n ≥ 2, n

aj bj

= An bn −

j=1

n−1

Aj (bj+1 − bj )

j=1

= An bn −

n−1

(A − Rj )(bj+1 − bj )

j=1

=

An bn − A

n−1

(bj+1 − bj ) +

j=1

n−1

Rj (bj+1 − bj )

j=1

= An bn − Abn + Ab1 +

n−1

Rj (bj+1 − bj )

j=1

=

−Rn bn + Ab1 +

n−1

Rj (bj+1 − bj ).

(4.3)

j=1

∞ Since n=1 an converges, Rn → 0 as n → ∞. Hence, given any > 0, there exists N = N > 1 such that |Rn | ≤ for all n ≥ N . Since 0 < bn ↑ ∞, for all n > N , * * * −1 n−1 * *bn Rj (bj+1 − bj )** * j=1

256

8. Laws of Large Numbers

≤

b−1 n

N −1

|Rj | |bj+1 − bj | +

b−1 n

j=1

= b−1 n

N −1

n−1

(bj+1 − bj )

j=N

|Rj | |bj+1 − bj | + .

j=1

Now letting n → ∞ and then letting ↓ 0, yields * * n−1 * * * R (b − b ) lim sup **b−1 j j+1 j * = 0. n n→∞

j=1

2

Hence, (4.2) follows from (4.3). Lemma 8.4.3: For any random variable X, ∞

P (|X| > n) ≤ E|X| ≤

n=1

∞

P (|X| > n).

(4.4)

n=0

Proof: For n ≥ 1, let An = {n − 1 < |X| ≤ n}. Deﬁne the random variables ∞ ∞ Y = (n − 1) IAn and Z = n I An . n=1

n=1

Then, it is clear that Y ≤ |X| ≤ Z, so that EY ≤ E|X| ≤ EZ.

(4.5)

Note that EY

=

∞

(n − 1)P (An )

n=1

=

=

=

∞ n−1

P (An )

n=2 j=1 ∞ ∞

P (n − 1 < |X| ≤ n)

j=1 n=j+1 ∞

P (|X| > j).

j=1

Similarly, one can show that EZ =

∞ j=0

P (|X| > j). Hence, (4.4) follows. 2

Theorem 8.4.4: (Marcinkiewz-Zygmund SLLNs). Let {Xn }n≥1 be a sequenceof identically distributed random variables and let p ∈ (0, 2). Write n Sn = i=1 Xi , n ≥ 1.

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

257

(i) If {Xn }n≥1 are pairwise independent and Sn − nc n1/p

converges w.p. 1

(4.6)

for some c ∈ R, then E|X1 |p < ∞. (ii) Conversely, if E|X1 |p < ∞ and {Xn }n≥1 are independent, then (4.6) holds with c = EX1 for p ∈ [1, 2) and with any c ∈ R for p ∈ (0, 1). Corollary 8.4.5: (Kolmogorov’s SLLN ). Let {Xn }n≥1 be a sequence of iid random variables. Then, Sn − nc →0 n

w.p. 1

for some c ∈ R iﬀ E|X1 | < ∞, in which case, c = EX1 . Thus, Kolmogorov’s SLLN corresponds with the special case p = 1 of Theorem 8.4.4. Note that compared with the WLLN and Borel’s SLLN of Sections 8.1 and 8.2, Kolmogorov’s SLLN presents a signiﬁcant improvement in the moment condition, i.e., it assumes the ﬁniteness of only the ﬁrst absolute moment. Further, both the Kolmogorov’s SLLN and the Marcinkiewz-Zygmund SLLN are proved under minimal moment conditions, since the corresponding moment conditions are shown to be necessary. Proof of Theorem 8.4.4: (i) Suppose that (4.6) holds for some c ∈ R. Then, Xn n1/p

= =

Sn − Sn−1 n1/p Sn − nc Sn−1 − (n − 1)c c − + 1/p n1/p n1/p n → 0 as n → ∞, a.s.

Hence, P (|Xn /n1/p | > 1 i.o.) = 0. By the second Borel-Cantelli lemma and by the pairwise independence of {Xn }n≥1 , this implies

∞ |Xn | P > 1 < ∞, n1/p n=1 i.e.,

∞ P |X1 |p > n < ∞. n=1

Hence, by Lemma 8.4.3, E|X1 |p < ∞.

258

8. Laws of Large Numbers

To prove (ii), suppose that E|X1 |p < ∞ for some p ∈ (0, 2). For 1 ≤ p < 2, w.l.o.g. assume that EX1 = 0. Next, deﬁne the variables Zn = Xn I(|Xn |p ≤ n), n ≥ 1. Then, by Lemma 8.4.3, ∞

=

n=1 ∞

P (Xn = Zn ) P (|Xn |p > n) =

n=1

∞

P (|X1 |p > n) ≤ E|X1 |p < ∞.

n=1

Hence, by the Borel-Cantelli lemma, P (Xn = Zn i.o.) = 0.

(4.7)

Note that, in view of (4.7), (4.6) holds with c = 0 if and only if n1/p

n

Zi → 0

as

n → ∞,

w.p. 1.

(4.8)

i=1

Note that for any j ∈ N, θ > 1 and β ∈ (−∞, 0)\{−1}, ∞

n−θ

≤

j −θ +

n=j

∞ n=j+1

= j −θ + ≤

n

x−θ dx

n−1

1 · j −(θ−1) θ−1

θ · j −(θ−1) θ−1

(4.9)

and similarly, j

nβ

≤

# " β + j (β+1) /(β + 1)

≤

j β+1 β I(β < −1) + I(−1 < β < 0). β+1 β+1

n=1

Now, ∞

≤ =

n=1 ∞

Var(Zn /n1/p ) EX12 I(|X1 |p ≤ n) · n−2/p

n=1 n ∞ n=1 j=1

EX12 I(j − 1 < |X1 |p ≤ j) · n−2/p

(4.10)

8.4 Kolmogorov and Marcinkiewz-Zygmund SLLNs

=

∞ ∞ j=1

≤ ≤

2 2−p 2 2−p

n−2/p

n=j ∞

· EX12 I(j − 1 < |X1 |p ≤ j)

2 j −( p −1) · EX12 I (j − 1) < |X1 |p ≤ j

j=1 ∞

259

(by (4.9))

2

j −( p −1) · E|X1 |p I(j − 1 < |X1 |p ≤ j) · (j 1/p )2−p

j=1

2 E|X1 |p < ∞. = 2−p ∞ 1/p converges w.p. 1. By Hence, by Theorem 8.3.4, n=1 (Zn − EZn )/n Kronecker’s lemma (viz. Lemma 8.4.2), n−1/p

n

(Zj − EZj ) → 0

as

n → ∞,

w.p. 1.

(4.11)

j=1

Now consider the case p = 1. In this case, E|X1 | < ∞ and by the DCT, n EZn = EX1 I(|X1 | ≤ n) → EX1 = 0 as n → ∞. Hence, n−1 i=1 EZi → 0. Part (ii) of the theorem now follows from (4.8) and (4.11) for p = 1. Next consider the case p ∈ (0, 2), p = 1. Using (4.9) and (4.10), one can show (cf. Problem 8.12) that n

−1/p

n

EZj → 0

as

n → ∞.

(4.12)

j=1

Hence, by (4.8), (4.11), and (4.12), one gets (4.6) with c = 0 for p ∈ (0, 2)\{1}. Finally, note that for p ∈ (0, 1), and for any c ∈ R, Sn − nc n1/p

=

Sn nc − 1/p n1/p n → 0 as n → ∞,

a.s.,

whenever Sn /n1/p → 0 as n → ∞, w.p. 1. Hence, (4.6) holds with an arbitrary c ∈ R for p ∈ (0, 1). This completes the proof of part (ii) for p ∈ (0, 2)\{1} and hence of the theorem. 2 The next result gives a SLLN for independent random variables that are not necessarily identically distributed. Theorem Let {Xn }n≥1 be a sequence of independent random vari8.4.6: ∞ ables. If n=1 E|Xn |αn /nαn < ∞ for some αn ∈ [1, 2], n ≥ 1, then n−1

n j=1

(Xj − EXj ) → 0

as

n → ∞,

w.p. 1.

(4.13)

260

8. Laws of Large Numbers

Proof: W.l.o.g. suppose that EXn = 0 for all n ≥ 1. Let Yn = Xn I(|Xn | ≤ n)/n. Note that |EYn | = |n−1 (EXn − EXn I(|Xn | > n))| = n−1 |EXn I(|Xn | > n)|, n ≥ 1. Since 1 ≤ αn ≤ 2, ∞

{P (|Xn | > n) + |EYn |}

n=1

≤ 2 ≤

2

∞ n=1 ∞

n−1 E|Xn |I(|Xn | > n) E|Xn |αn /nαn < ∞

n=1

and ∞

Var(Yn ) ≤

n=1

≤

∞ n=1 ∞

n−2 EXn2 I(|Xn | ≤ n) n−αn EXnαn < ∞.

n=1

∞ Hence, by Kolmogorov’s 3-series theorem, n=1 (Xn /n) converges w.p. 1. Now the theorem follows from Lemma 8.4.2. 2 Corollary 8.4.7: Let {Xn }n≥1 be asequence of independent random vari∞ ables such that for some α ∈ [1, 2], n=1 (n−α E|Xn |α ) < ∞. Then (4.13) holds.

8.5 Renewal theory 8.5.1

Deﬁnitions and basic properties

Let {Xn }n≥0 be a sequence of nonnegative random variables that are independent and, for i ≥ 1, identically distributed with cdf F . Let Sn = n i=0 Xi for n ≥ 0. Imagine a system where a component in operation at time t = 0 lasts X0 units of time and then is replaced by a new one that lasts X1 units of time, which, at failure, is replaced by yet another new one that lasts X2 units of time and so on. The sequence {Sn }n≥0 represents the sequence of epochs when ‘renewal’ takes place and is called a renewal sequence. Assume that P (X1 = 0) < 1. Then, since P (X1 < ∞) = 1, it follows that for each n, P (Sn < ∞) = 1 and limn→∞ Sn = ∞ w.p. 1 (Problem 8.16). Now deﬁne the counting process {N (t) : t ≥ 0} by the relation N (t) = k

if Sk−1 ≤ t < Sk

for k = 0, 1, 2, . . .

(5.1)

where S−1 = 0. Thus N (t) counts the number of renewals up to time t.

8.5 Renewal theory

261

Deﬁnition 8.5.1: The stochastic process {N (t) : t ≥ 0} is called a renewal process with lifetime distribution F . The renewal sequence {Sn }n≥0 and the renewal process {N (t) : t ≥ 0} are called nondelayed or standard if X0 has the same distribution as X1 and are called delayed otherwise. Since P (X1 ≥ 0) = 1, {Sn }n≥0 is nondecreasing in n and for each t ≥ 0, the event {N (t) = k} = {Sk−1 ≤ t < Sk } belongs to the σ-algebra σ{Xj : 0 ≤ j ≤ k} and hence N (t) is a random variable. Using the nontriviality hypothesis that P (X1 = 0) < 1, it is shown below that for each t > 0, the random variable N (t) has ﬁnite moments of all order. Proposition 8.5.1: Let P (X1 = 0) < 1. Then there exists 0 < λ < 1 (not depending on t) and a constant C(t) ∈ (0, ∞) such that P (N (t) > k) ≤ C(t)λk

for all

k > 0.

(5.2)

Proof: For t > 0, k ∈ N, P (N (t) > k)

= P (Sk ≤ t) for θ > 0 = P e−θSk ≥ e−θt −θS θt k ≤ e E e (by Markov’s inequality) k . = eθt E e−θX0 E e−θX1

By BCT, limθ↑∞ E(e−θX1 ) = P (X1 = 0) < 1. Hence, there exists a θ large 2 such that λ ≡ E(e−θX1 ) is less than one, thus, completing the proof. Corollary 8.5.2: There exists an s0 > 0 such that the moment generating function (m.g.f.) E(esN (t) ) < ∞ for all s < s0 and t ≥ 0. Proof: From (5.2), for any t > 0, it follows that P N (t) = k = O(λk ) as ∞ k → ∞ for some 0 < λ < 1 and hence E esN (t) = k=0 (es )k P N (t) = 2 k < ∞ for any s such that es λ < 1, i.e., for all s < s0 ≡ − log λ. From (5.1), it follows that for t > 0, SN (t)−1 ≤ t < SN (t) ⇒

S N (t) − 1 S t N (t)−1 N (t) ≤ ≤ . N (t) (N (t) − 1) N (t) N (t)

(5.3)

Let A be the event that Snn → EX1 as n → ∞ and let B be the event that N (t) → ∞ as t → ∞. Since Sn → ∞ w.p. 1, it follows that P (B) = 1. Also, by the SLLN, P (A) = 1. On the event C = A ∩ B, it holds that SN (t) → EX1 N (t)

as t → ∞.

262

8. Laws of Large Numbers

This together with (5.3) yields the following result. Proposition 8.5.3: Suppose that P (X1 = 0) < 1. Then, lim

t→∞

1 N (t) = t EX1

w.p. 1.

(5.4)

Deﬁnition 8.5.2: The function U (t) ≡ EN (t) for the nondelayed process is called the renewal function. An explicit expression for U (·) is given by (5.13) below. Next consider the convergence of EN (t)/t. By (5.4) and Fatou’s lemma, one gets 1 EN (t) ≥ . (5.5) lim inf t→∞ t EX1 It turns out that the lim inf t→∞ in (5.5) can be replaced by limt→∞ and ≥ by equality. To do this it suﬃces to show that the family { N t(t) : t ≥ k} is uniformly integrable for some k < ∞. This can be done by showing E( N t(t) )2 is bounded in t (see Chung (1974), Chapter 5). An alternate approach is to bound the lim sup. For this one can use an identity known as Wald’s equation (see also Chapter 13).

8.5.2

Wald’s equation

Let {Xj }j≥1 be independent n random variables with EXj = 0 for all j ≥ 1. Also, let S0 = 0, Sn = j=1 Xj , n ≥ 1. Deﬁnition 8.5.3: A positive integer valued random variable N is called a stopping time with respect to {Xj }j≥1 if for every j ≥ 1, the event {N = j} ∈ σ{X1 , . . . , Xj }. A stopping time N is called bounded if there exists a K < ∞ such that P (N ≤ K) = 1. n Example 8.5.1: N ≡ min{n : j=1 Xj ≥ 25} is a stopping time w.r.t. n {Xj }j≥1 , but M ≡ max{n : j=1 Xj ≥ 25} is not. Proposition 8.5.4: Let {Xj }j≥1 be independent random variables with EXj = 0. Let N be a bounded stopping time w.r.t. {Xj }j≥1 . Then E(|SN |) < ∞

and

ESN = 0.

K Proof: Let K ∈ N be such that P (N ≤ K) = 1. Then |SN | ≤ j=1 |Xi | K and hence E|SN | < ∞. Next, SN = j=1 Xj I(N ≥ j) and hence ESN =

K E Xj I(N ≥ j) . j=1

8.5 Renewal theory

263

But the event {N ≥ j} = {N ≤ j − 1}c ∈ σ{X1 , X2 , . . . , Xj−1 }. Since Xj is independent of σX1 , X2 , . . . , Xj−1 , E Xj I(N ≥ j) = 0 for 1 ≤ j ≤ K. 2

Thus ESN = 0.

Corollary 8.5.5: Let {Xj }j≥1 be iid random variables with E|X1 | < ∞. Let N be a bounded stopping time w.r.t. {Xj }j≥1 . Then ESN = (EN )EX1 . Corollary 8.5.6: Let {Xj }j≥1 be iid nonnegative random variable with E|X1 | < ∞. Let N be a stopping time w.r.t. {Xj }j≥1 . Then ESN = (EN )EX1 . Proof: Let Nk = N ∧ k, k = 1, 2, . . .. Then Nk is a bounded stopping time. By Corollary 8.5.5, E(SNk ) = (ENk )EX1 . Let k ↑ ∞. Then 0 ≤ SNk ↑ SN and Nk ↑ N . By the MCT, ESNk ↑ ESN and ENk ↑ EN . 2 Theorem 8.5.7: (Wald’s equation). Let {Xj }j≥1 be iid random variables with E|X1 | < ∞. Let N be a stopping time w.r.t. {Xj }j≥1 such that EN < ∞. Then ESN = (EN )EX1 . Proof: Let Tn = by Corollary 8.5.5,

n j=1

|Xj |, n ≥ 1. Let Nk = N ∧ k, k = 1, 2, . . .. Then E(SNk ) = (ENk )EX1 .

Also, |SNk | ≤ TNk and ETNk = (ENk )E|X1 |. Further, as k → ∞, Nk → N , SNk → SN , TNk → TN , and ETNk → ETN = (EN )E|X1 | < ∞. So, by the extended DCT (Theorem 2.3.11) ESNk → ESN i.e.,

(ENk )EX1 → ESN

i.e.,

ESN = (EN )EX1 . 2

264

8. Laws of Large Numbers

8.5.3

The renewal theorems

In this section, two versions of the renewal theorem will be proved. For this, the notation and concepts introduced in Sections 8.5.1 and 8.5.2 will be used without further explanation. Note that for each t > 0 and j = 0, 1, 2, . . ., the event {N (t) = j} = {Sj−1 ≤ t < Sj } belongs to σ{X0 , . . . , Xj }. Thus, by Wald’s equation (Theorem 8.5.7 above) E(SN (t) ) = EN (t) EX1 + EX0 . ˜ (t)}t≥0 ˜ i = min{Xi , m}, i ≥ 0. Let {S˜n }n≥0 and {N Let m ∈ (0, ∞) and X be the associated renewal sequence and renewal process, respectively. Again, by Wald’s equation, ˜0. ˜ (t) E X ˜1 + E X E S˜N˜ (t) = E N But since S˜N˜ (t)−1 ≤ t < S˜N˜ (t) , it follows that S˜N˜ (t) ≤ t + m and hence ˜ 0 ≤ t + m. ˜ (t))E X ˜1 + E X (E N This yields lim sup t→∞

˜ (t) 1 EN ≤ . ˜1 t EX

˜ (t) ≥ N (t) and hence Clearly, for all t > 0, N lim sup E t→∞

1 N (t) ≤ . ˜1 t EX

(5.6)

˜ 1 → EX1 as Since this is true for each m ∈ (0, ∞) and by the MCT, E X m → ∞, it follows that lim sup t→∞

1 EN (t) ≤ . t EX1

Combining this with (5.5) leads to the following result. Theorem 8.5.8: (The weak renewal theorem). Let {N (t) : t ≥ 0} be a renewal process with distribution F . Let µ = [0,∞) xdF (x) ∈ (0, ∞). Then, 1 EN (t) = . t→∞ t µ

(5.7)

lim

The above result is also valid when µ = ∞ when

1 µ

is interpreted as zero.

Deﬁnition 8.5.4: A random variable X (and its probability distribution) is called arithmetic (or lattice) if there exists a ∈ R and d > 0 such that X−a d

8.5 Renewal theory

265

is integer valued. The largest such d is called the span of (the distribution of) X. Deﬁnition 8.5.5: A random variable X (and its distribution distribution) is called nonarithmetic (or nonlattice) if it is not arithmetic. The weak renewal theorem (Theorem 8.5.8) implies that EN (t) = t/µ + o(t) as t → ∞. This suggests that E N (t + h) − N (t) = (t + h)/µ − t/µ + o(t) = h/µ + o(t). A strengthening of the above result is as follows. Theorem 8.5.9: (The strong renewal theorem). Let {N (t) : t ≥ 0} be a renewal process with a nonarithmetic distribution F with a ﬁnite positive mean µ. Then, for each h > 0, h lim E N (t + h) − N (t) = . t→∞ µ

(5.8)

Remark 8.5.1: Since N (t) =

k−1

N (t − j) − N (t − j − 1) + N (t − k)

j=0

where k ≤ t < k + 1, the weak renewal theorem follows from the strong renewal theorem. The following are the “arithmetic versions” of Theorems 8.5.8 and 8.5.9. Let {Xi }i≥0 be independent positive integer valued random nvariables such that {Xi }i≥1 are iid with distribution {pj }j≥1 . Let Sn = j=0 Xj , n ≥ 0, S−1 = 0. Let Nn = k if Sk−1 ≤ n < Sk , k = 0, 1, 2, . . .. Let un

= P (there is a renewal at time n) = P (Sk = n for some k ≥ 0).

Theorem 8.5.10: Let µ =

∞ j=1

jpj ∈ (0, ∞). Then

1 1 uj → n j=0 µ n

as

n → ∞.

(5.9)

∞ Theorem 8.5.11: Let µ = j=1 jpj ∈ (0, ∞) and g.c.d. {k : pk > 0} = 1. Then 1 as n → ∞. (5.10) un → µ For proofs of Theorems 8.5.9 and 8.5.11, see Feller (1966) for an analytic proof or Lindvall (1992) for a proof using the coupling method. The proof of Theorem 8.5.10 is similar to that of Theorem 8.5.8.

266

8. Laws of Large Numbers

8.5.4

Renewal equations

The above strong renewal theorems have many applications. These are via what are known as renewal equations. Let F (·) be a cdf such that F (0) = 0. Let B 0 ≡ {f | f : [0, ∞) → R, f is Borel measurable and bounded on bounded intervals}. Deﬁnition 8.5.6: A function a(·) is said to satisfy the renewal equation with distribution F (·) and forcing function b(·) ∈ B 0 if a ∈ B 0 and a(t − u)dF (u) for t ≥ 0. (5.11) a(t) = b(t) + (0,t]

Theorem 8.5.12: Let F be a cdf such that F (0) = 0 and let b(·) ∈ B 0 . Then there is a unique solution a0 (·) ∈ B 0 to (5.11) given by b(t − u)U (du) (5.12) a0 (t) = [0,t]

where U (·) is the Lebesgue-Stieltjes measure induced by the nondecreasing function ∞ F (n) (t), (5.13) U (t) ≡ n=0

with F

(n)

(·), n ≥ 0 being deﬁned by the relations (n) F (n−1) (t − u)dF (u), t ∈ R, n ≥ 1, F (t) = (0,t]

F (0) (t)

=

1 0

if t ≥ 0 t < 0.

It will be shown below that the function U (·) deﬁned in (5.13) is the renewal function EN (t) as in Deﬁnition 8.5.2. Proof: For any function b ∈ B 0 and any nondecreasing right continuous function G : [0, ∞) → R, let b(t − u)dG(u). (b ∗ G)(t) ≡ [0,t]

Then since F (0) = 0, the equation (5.11) can be rewritten as a = b + a ∗ F.

(5.14)

Let {Xi }i≥1 be iid random variables with cdf F . Then n it is easy to verify that F (n) (t) = P (Sn ≤ t), where S0 = 0, and Sn = i=1 Xi for n ≥ 1. Let

8.5 Renewal theory

267

{N (t) : t ≥ 0} be as deﬁned by (5.1). Then, for t ∈ (0, ∞), EN (t) =

∞ ∞ ∞ P N (t) ≥ j = P (Sj−1 ≤ t) = F (n) (t) = U (t). j=1

n=0

j=1

By Proposition 8.5.1, U (t) < ∞ for all t > 0 and is nondecreasing. Since b ∈ B 0 for each 0 < t < ∞, a0 deﬁned by (5.12) is well-deﬁned. By deﬁnition a0 = b ∗ U and by (5.13), a0 satisﬁes (5.14) and hence (5.11). If ˜ ≡ a1 − a2 satisﬁes a1 and a2 from B 0 are two solutions to (5.14) then a a ˜=a ˜∗F and hence

a ˜=a ˜ ∗ F (n)

for all n ≥ 1.

This implies M (t) ≡ sup{|˜ a(u)| : 0 ≤ u ≤ t} ≤ M (t)F (n) (t). a| = 0 on (0, t] for each t. Thus But F (n) (t) → 0 as n → ∞. Hence |˜ 2 a0 = b ∗ U is the unique solution to (5.11). The discrete or arithmetic analog of the renewal equation (5.11) is as follows. Let {Xi }i≥1 be iid positive integer valued n random variables with 1. Let un = distribution {pj }j≥1 . Let S0 = 0, and Sn = i=1 Xi for n ≥ n P (Sj = n for some j ≥ 0). Then, u0 = 1 and un satisﬁes un = j=1 pj un−j for n ≥ 1. For any sequence {bj }j≥0 , the equation an = bn +

n

an−j pj , n = 0, 1, 2, . . .

(5.15)

j=1

is called the discrete renewal equation. As in the general case, it can be shown (Problem 8.17 (a)) that the unique solution to (5.15) is given by an =

n

bn−j uj .

(5.16)

j=0

The following convergence results are easy to establish from Theorem 8.5.11 (Problem 8.17 (b)). Theorem 8.5.13: (The key renewal theorem, discrete ∞ case). Let {pj }j≥1 be aperiodic, i.e., g.c.d. {k : pk > 0} = 1 and µ ≡ j=1 jpj ∈ (0, ∞). Let renewal sequence associated with {pj }j≥1 . That {un }n≥0 be the ∞ is, u0 = 1 n and un = j=1 pj un−j for n ≥ 1. Let {bj }j≥0 be such that j=1 |bj | < ∞. Let {an }n≥0 satisfy a0 = b0 and an = bn +

∞ j=1

an−j pj n ≥ 1.

(5.17)

268

8. Laws of Large Numbers

Then an =

∞

∞

j=0 bj un−j , n ≥ 0 and lim an = n→∞

1 bj . µ j=0

The nonarithmetic analog of the above is as follows. Deﬁnition 8.5.7: A function b(·) ∈ B 0 is directly Riemann integrable (dri) ∞ on [0, ∞) iﬀ (i) for all h > 0, n=0 sup{|b(u)| : nh ≤ u ≤ (n + 1)h} < ∞, ∞ and (ii) limh→0 n=0 h(mn (h) − mn (h) = 0 where mn (h)

=

sup{b(u) : nh ≤ u ≤ (n + 1)h}

mn (h)

=

inf{b(u) : nh ≤ u ≤ (n + 1)h}.

Theorem 8.5.14: (The key renewal theorem, nonarithmetic case). Let F (·) be a nonarithmetic distribution with F (0) = 0 and µ = ∞ (n) udF (u) < ∞. Let U (·) = (·) be the renewal function asson=0 F [0,∞) ciated with F . Let b(·) ∈ B 0 be directly Riemann integrable. Then the unique solution to the renewal equation a=b+a∗F

(5.18)

is given by a = b ∗ U and lim a(t) =

t→∞

where c(b) ≡ lim

h→0

∞

c(b) µ

(5.19)

hmn (h).

n=0

Remark 8.5.2: A suﬃcient condition for b(·) to be dri is that it is Riemann integrable on bounded intervals and that there exists a nonincreasing integrable function h(·) on [0, ∞) and a constant C such that |b(·)| ≤ Ch(·) (Problem 8.18 (b)).

8.5.5

Applications

Here are two important applications of the above two theorems to a class of stochastic processes known as regenerative processes. Deﬁnition 8.5.8: (a) A sequence of random variables {Yn }n≥0 is called regenerative if there exists a renewal sequence {Tj }j≥0 such that the random cyclesand cycle length variables ηj = {Yi : Tj ≤ i < Tj+1 }, Tj+1 − Tj for j = 0, 1, 2, . . . are iid. (b) A stochastic process {Y (t) : t ≥ 0} is called regenerative if there exists a renewal sequence {Tj }j≥0 such that the random cycles and

8.5 Renewal theory

269

cycle length variables ηj ≡ {Y (t) : Tj ≤ t < Tj+1 , Tj+1 − Tj } for j = 0, 1, 2, . . . are iid. (c) In both (a) and (b), the sequence {Tj }j≥0 are called the regeneration times. Example 8.5.2: Let {Yn }n≥0 be a countable state space Markov chain (see Chapter 14) that is irreducible and recurrent. Fix a state ∆. Let T0

=

min{n : n > 0, Yn = ∆}

Tj+1

=

min{n : n > Tj , Yn = ∆}, n ≥ 0.

Then {Yn }n≥0 is regenerative (Problem 8.19). Example 8.5.3: Let {Y (t) : t ≥ 0} be a continuous time Markov chain (see Chapter 14) with a countable state space that is irreducible and recurrent. Fix a state ∆. Let T0 Tj+1

=

inf{t : t > 0, Y (t) = ∆}

=

inf{t : t > Tj , Y (t) = ∆}.

Then {Y (t) : t ≥ 0} is regenerative (Problem 8.19). Theorem 8.5.15: Let {Yn }n≥0 be a regenerative sequence of random variables with some state space (S, S) where S is a σ-algebra on S with regeneration times {Tj }j≥0 . Let f : S → R be bounded and S, B(R)-measurable. Let ≡ Ef (Yn+T0 ),

an bn

≡ Ef (YT0 +n )I(T1 > T0 + n).

(5.20)

Let µ = E(T1 − T0 ) ∈ (0, ∞) and g.c.d. {j : pj ≡ P (T1 − T0 = j) > 0} = 1. Then (i) an → f (y)π(dy) where π(A) ≡

1 µ

E

S

, A ∈ S.

T1 −1 j=T0 IA (Yj )

(ii) In particular, P (Yn ∈ ·) − π(·) → 0

as

n → ∞,

(5.21)

where · denotes the total variation norm. Proof: By the regenerative property, {an }n≥1 satisﬁes the renewal equation n an = bn + an−j pj j=0

270

8. Laws of Large Numbers

and hence, part (i) of the theorem follows from Theorem 8.5.13 and the ∞ fact n=0 bn = µπ(A). prove (ii) note that a ˜n ≡ Ef (Yn ) = E(f (Yn )I(T0 > n)) + To n a P (T = j) and by DCT limn→∞ a ˜n = limn→∞ an . 0 j=0 n−j It is not diﬃcult to show that for any two probability measures µ and ν on (S, S), the total variation norm * * * * µ − ν = sup * f dµ − f dν * : f ∈ B(S, R) where B(S, R) = {f : f : S → R, F measurable, sup{|f (s)| : s ∈ S} ≤ 1} (Problem 4.10 (b)). Thus, P (Yn+T0 ∈ ·) − π(·) * * * * ≤ sup *Ef (Yn0 +T ) − f dπ * : f ∈ B(S, R) .

(5.22)

Now, for any f ∈ B(S, R) and any integer K ≥ 1, from Theorem 8.5.13, * * * * *Ef (Yn0 +T ) − f dπ * ≤

* 1 ** * bj *un−j − * + 2 µ j=0

K

∞

P (T1 − T0 > j) ≡ δn , say (5.23)

j=(K+1)

where {bj } is deﬁned in (5.20). Since E(T1 − T0 ) < ∞, given > 0, there exists a K such that ∞

P (T1 − T0 > j) < /2.

j=(K+1)

By Theorem (8.5.11), un → (5.22), (ii) follows.

1 µ.

Thus, in (5.23), lim δn ≤ and so from 2

Theorem 8.5.16: Let {Y (t) : t ≥ 0} be a regenerative stochastic process with state space (S, S) where S is a σ-algebra on S. Let f : S → R be bounded and S, B(R)-measurable. Let a(t)

=

b(t) ≡

Ef (YT0 +t ), t ≥ 0, Ef (YT0 +t )I(T1 > T0 + t), t ≥ 0.

Let µ = E(T1 − T0 ) ∈ (0, ∞) and the distribution of T1 − T0 be nonarithmetic. Then (i) a(t) → f (y)π(dy) where π(A) =

1 µ

E

T T0

S

IA (Y (u))du , A ∈ S.

8.6 Ergodic theorems

271

(ii) In particular, P (Yt ∈ ·) − π(·) → 0

as

t→∞

(5.24)

where · is the total variation norm. The proof of this is similar to that of the previous theorem but uses Theorem 8.5.14. 2

8.6 Ergodic theorems 8.6.1

Basic deﬁnitions and examples

The law of large numbers proved in Section 8.2 states that if {Xi }i≥1 are pairwise independent and identically distributed and if h(·) is a Borel measurable function, then the time average, i.e.,

n 1 h(Xi ) n i=1

→ Eh(X1 ), i.e., space average w.p. 1

(6.1)

as n → ∞, provided E|h(X1 )| < ∞. The goal of this section is to investigate how far the independence assumption can be relaxed. Deﬁnition 8.6.1: (Stationary sequences). A sequence of random variables {Xi }i≥1 on a probability space (Ω, F, P ) is called strictly stationary if for each k ≥ 1 the joint distribution of (Xi+j : j = 1, 2, . . . , k) is the same for all i ≥ 0. Example 8.6.1: {Xi }i≥1 iid. Example 8.6.2: Let {Xi }i≥1 be iid. Fix 1 ≤ < ∞. Let h : R → R be a Borel function and Yi = h(Xi , Xi+1 , . . . , Xi+ −1 ), i ≥ 1. Then {Yi }i≥1 is strictly stationary. Example 8.6.3: Let {Xi }i≥1 be a Markov chain with a stationary distribution π. If X1 ∼ π then {Xi }i≥1 is strictly stationary (see Chapter 14). It will be shown that if {Xi }i≥1 is a strictly stationary sequence that is not a mixture of two other strictly stationary sequences, then (6.1) holds. This is known as the ergodic theorem (Theorem 8.6.1 below). Deﬁnition 8.6.2: (Measure preserving transformations). Let (Ω, F, P ) be a probability space and T : Ω → Ω be F, F measurable. Then, T is

272

8. Laws of Large Numbers

called P -preserving (or simply measure preserving on (Ω, F, P )) if for all A ∈ F, P (T −1 (A)) = P (A). That is, the random point T (ω) has the same distribution as ω. Let X be a real valued random variable on (Ω, F, P ). Let Xi ≡ X(T (i−1) (ω)) where T (0) (ω) = ω, T (i) (ω) = T (T (i−1) (ω)), i ≥ 1. Then {Xi }i≥1 is a strictly stationary sequence. It turns out that every strictly stationary sequence arises this way. Let {Xi }i≥1 be a strictly stationary sequence deﬁned on some probability space ˜ ≡ {Xi (ω)}i≥1 (Ω, F, P ). Let P˜ be the probability measure induced by X ∞ ˜ ∞ ∞ ˜ on Ω ≡ R , F ≡ B(R ) where R is the space of all sequences of real numbers and B(R∞ ) is the σ-algebra generated by ﬁnite dimensional cylinder sets of the form {x : (xj : j = 1, 2, . . . , k) ∈ Ak }, 1 ≤ k < ∞, Ak ∈ B(Rk ). Let T : R∞ → R∞ be the unilateral (one sided) shift to the right, ˜ F, ˜ P˜ ). Let i.e., T (xi )i≥1 = (xi )i≥2 . Then T is measure preserving on (Ω, i−1 ω ) = x1 , and Yi (˜ ω ) = Y1 (T ω ˜ ) = xi for i ≥ 2 if ω ˜ = (x1 , x2 , x3 , . . .). Y1 (˜ ˜ F, ˜ P˜ ) and has the Then {Yi }i≥1 is a strictly stationary sequence on (Ω, same distribution as {Xi }i≥1 . Example 8.6.4: Let Ω = [0, 1], F = B([0, 1]), P = Lebesgue measure. Let T ω ≡ 2ω mod 1, i.e., ⎧ ⎨

2ω if 0 ≤ ω < 12 2ω − 1 if 12 ≤ ω < 1 Tω = ⎩ 0 ω = 1. Then T is measure preserving since P ({ω : a < T ω < b}) = (b − a) for all 0 < a < b < 1 (Problem 8.20). This example is an equivalent version of the iid sequence {δi }i≥1 of ∞ be the biBernoulli (1/2) random variables. To see this, let ω = i=1 δi2(ω) i nary expansion of ω. Then {δi }i≥1 is iid Bernoulli (1/2) and T ω = 2ω mod ∞ (ω) (cf. Problem 7.4). Thus T corresponds with the unilateral 1 = i=2 δ2ii−1 shift to right on the iid sequence {δi }i≥1 . For this reason, T is called the Bernoulli shift. Example 8.6.5: (Rotation). Let Ω = {(x, y) : x2 + y 2 = 1} be the unit circle. Fix θ0 in [0, 2π). If ω = (cos θ, sin θ), θ in [0, 2π) set T ω = cos(θ + θ0 ), sin(θ + θ0 ) . That is, T rotates any point ω on Ω counterclockwise through an angle θ0 . Then T is measure preserving w.r.t. the Uniform distribution on [0, 2π]. Deﬁnition 8.6.3: Let (Ω, F, P ) be a probability space and T : Ω → Ω be a F, F measurable map. A set A ∈ F is T-invariant if A = T −1 A. A set A ∈ F is almost T -invariant w.r.t. P if P (A T −1 A) = 0 where A1 A2 = (A1 ∩ Ac2 ) ∪ (Ac1 ∩ A2 ) is the symmetric diﬀerence of A1 and A2 .

8.6 Ergodic theorems

273

It can be shown that A is almost T -invariant w.r.t. P iﬀ there exists a set A that is T -invariant and P (A A ) = 0 (Problem 8.21). Examples of T -invariant sets are A1 = {ω : T j ω ∈ A0 for inﬁnitely many n j i ≥ 1} where A0 ∈ F; A2 = ω : n1 j=1 h(T ω) converges as n → ∞ where h : Ω → R is a F measurable function. On the other hand, the event {x : x1 ≤ 0} is not shift invariant in R∞ , B(R∞ ) nor is it almost shift invariant if P˜ corresponds to the iid case with a nondegenerate distribution. The collection I of T -invariant sets is a σ-algebra and is called the invariant σ-algebra. A function h : Ω → R is I-measurable iﬀ h(ω) = h(T ω) for all ω (Problem 8.22). Deﬁnition 8.6.4: A measure preserving transformation T on a probability space (Ω, F, P ) is ergodic or irreducible (w.r.t. P ) if A is T -invariant implies P (A) = 0 or 1. Deﬁnition 8.6.5: A stationary sequence of random variables {Xi }i≥1 is ergodic if the unilateral shift T is ergodic on the sequence space (R∞ , B(R∞ ), P˜ ) where P˜ is the measure on R∞ induced by {Xi }i≥1 . Example 8.6.6: Consider the above sequence space. Then A ∈ F˜ is invariant with respect ∞ to ˜the unilateral shift implies that A is in the tail σ-algebra T ≡ n=1 σ(X j (ω), j ≥ n) (Problem 8.23). If {Xi }i≥1 are independent then by the Kolmogorov’s zero-one law, A ∈ T implies P (A) = 0 or 1. Thus, if {Xi }i≥1 are iid then it is ergodic. On the other hand, mixtures of iid sequences are not ergodic as seen below. Example 8.6.7: Let {Xi }i≥1 and {Yi }i≥1 be two iid sequences with different distributions. Let δ be Bernoulli (p), 0 < p < 1 and independent of both {Xi }i≥1 and {Yi }i≥1 . Let Zi ≡ δXi + (1 − δ)Yi , i ≥ 1. Then {Zi }i≥1 is a stationary sequence and is not ergodic (Problem 8.24). The above example can be extended to mixtures of irreducible positive recurrent discrete state space Markov chains (Problem 8.25 (a)). Another example is Example 8.6.5, i.e., rotation of the circle when θ is rational (Problem 8.25 (b)). Remark 8.6.1: There is a simple example of a measure preserving transformation T that is ergodic but T 2 is not. Let Ω = {ω1 , ω2 }, ω1 = ω2 . Let T ω1 = ω2 , T ω2 = ω1 , P be the distribution P ({ω1 }) = P ({ω2 }) = 12 . Then T is ergodic but T 2 is not (Problem 8.26).

274

8. Laws of Large Numbers

8.6.2

Birkhoﬀ ’s ergodic theorem

Theorem 8.6.1: Let (Ω, F, P ) be a probability space, T : Ω → Ω be a measure preserving ergodic map on (Ω, F, P ) and X ∈ L1 (Ω, F, P ). Then n−1 1 X(T j ω) → EX ≡ XdP n j=0 Ω

(6.2)

w.p. 1 and in L1 as n → ∞. Remark 8.6.2: A more general version is without the assumption of T being ergodic. In this case, the right side of (6.2) is a random variable Y (ω) that is T -invariant, i.e., Y (ω) = Y (T (ω)) w.p. 1 and satisﬁes XdP = Y dP for all T -invariant sets A. This Y is called the condiA A tional expectation of X given I, the σ-algebra of invariant sets (Chapter 13). For a proof of this version, see Durrett (2004). The proof of Theorem 8.6.1 depends on the following inequality. Lemma 8.6.2: (Maximal ergodic inequality). Let T be measure preserving on a probability space (Ω, F, P ) and X ∈ L1 (Ω, F, P ). Let S0 (ω) = 0, n−1 Sn (ω) = j=0 X(T j ω), n ≥ 1, Mn (ω) = max{Sj (ω) : 0 ≤ j ≤ n}. Then E X(ω)I Mn (ω) > 0 ≥ 0. Proof: By deﬁnition of Mn (ω), Sj (ω) ≤ Mn (ω) for 1 ≤ j ≤ n. Thus X(ω) + Mn (T ω) ≥ X(ω) + Sj (T ω) = Sj+1 (ω). Also, since Mn (T ω) ≥ 0, X(ω) ≥ X(ω) − Mn (T ω) = S1 (ω) − Mn (T ω). Thus X(ω) ≥ max Sj (ω) : 1 ≤ j ≤ n − Mn (T ω). For ω such that Mn (ω) > 0, Mn (ω) = max Sj (ω) : 1 ≤ j ≤ n and hence X(ω) ≥ Mn (ω) − Mn (T ω). Also, since X ∈ L1 (Ω, F, P ) it follows that Mn ∈ L1 (Ω, F, P ) for all n ≥ 1. Taking expectations yields E X(ω)I Mn (ω) > 0 ≥ E Mn (ω) − Mn (T ω)I Mn (ω) > 0 ≥ E Mn (ω) − Mn (T ω)I Mn (ω) ≥ 0 (since Mn (T ω) ≥ 0) = E Mn (ω) − Mn (T ω) = 0, since T is measure preserving.

2

8.6 Ergodic theorems

275

Remark 8.6.3: Note that the measure preserving property of T is used only at the last step. Proof of Theorem 8.6.1: W.l.o.g. assume that EX = 0. Let Z(ω) ≡ lim supn→∞ Snn(ω) . Fix > 0 and set A ≡ {ω : Z(ω) > }. It will be shown that P (Aε ) = 0. Clearly, A is T invariant. Since T is ergodic, P (A ) = 0 or 1. Suppose P (A ) = 1. Let Y (ω) = X(ω)−. Let Mn,Y (ω) ≡ max{Sj,Y (ω) : j−1 0 ≤ j ≤ n} where S0,Y (ω) ≡ 0, Sj,Y (ω) ≡ k=0 Y (T k ω), j ≥ 1. Then by Lemma 8.6.2 applied to Y (ω) E Y (ω)I Mn,Y (ω) > 0 ≥ 0. But Bn ≡ {ω : Mn,Y (ω) > 0} = {ω : sup1≤j≤n 1j Sj,Y (ω) > 0}. Clearly, Bn ↑ B ≡ {ω : sup1≤j 0}. Since 1j Sj,Y (ω) = 1j Sj (ω) − for j ≥ 1, B ⊃ A and since P (A ) = 1, it follows that P (B) = 1. Also |Y | ≤ |X| + ∈ L1 (Ω, F, P ). So by the bounded convergence theorem, 0 ≤ E(Y IBn ) → E(Y IB ) = EY = 0 − < 0, which is a contradiction. Thus P (A ) = 0. This being true for every > 0 it follows that P (limn→∞ Snn(ω) ≤ 0) = 1. Applying this to −X(ω) yields P

Sn (ω) ≥0 =1 n n→∞ lim

and hence P limn→∞ Snn(ω) = 0 = 1. To prove L1 -convergence, note that applying the above to X + and X − yields n 1 + i fn (ω) ≡ X (T ω) → EX + (ω) w.p. 1. n i=1 + = EX Since T is measure preserving fn (ω)dP (ω) for all n. So by Scheﬀe’s* theorem (Lemma 8.2.5), |fn (ω) − * + * 1 n X + (T i ω) − EX + * → 0. Similarly, (ω)|dP → 0, i.e., E EX i=1 n * n * E * n1 i=1 X − (T i ω) − EX − * → 0. This yields L1 convergence. 2 Corollary 8.6.3: Let {Xi }i≥1 be a stationary ergodic sequence of Rk valued random variables on some probability space (Ω, F, P ). Let h : Rk → R be Borel measurable and let E|h(X1 , X2 , . . . , Xk )| < ∞. Then 1 h(Xi , Xi+1 , . . . , Xi+k−1 ) → Eh(X1 , X2 , . . . , Xk ) w.p. 1. n i=1 n

˜ = (Rk )∞ , F˜ ≡ B (Rk )∞ and Proof: Consider the probability space Ω P˜ the probability measure induced by the map ω → (Xi (ω))i≥1 and the ˜ deﬁned by T˜(xi )i≥1 = (xi )i≥2 . Then T˜ is unilateral shift map T˜ on Ω

276

8. Laws of Large Numbers

measure preserving and ergodic. So the corollary follows from Theorem 8.6.1. 2 Remark 8.6.4: This corollary is useful in statistical time series analysis. If {Xi }i≥1 is a real valued stationary ergodic sequence, then the mean m ≡ EX1 , variance Var(X1 ), and covariance Cov(X1 , X2 ) can all be estimated consistently by the corresponding sample functions

2 n n 1 1 2 Xi − Xi , n i=1 n i=1

2 n n 1 1 Xi Xi+1 − Xi . n i=1 n i=1 1 Xi , n i=1 n

and

Further, the joint distribution of (X1 , X2 , . . . , Xk ) for any k ≥ 1, can be estimated consistently the corresponding empirical measure, i.e., by n Ln (A1 , A2 , . . . , Ak ) ≡ n1 i=1 I(Xi+k ∈ Ak , j = 1, 2, . . . , k), which converges to P (X1 ∈ A1 , X2 ∈ A2 , . . . , Xk ∈ Ak ) w.p. 1 where Ai ∈ B(R), i = 1, 2, . . . , k. The next three results (Theorems 8.6.4–8.6.6) are consequences and extensions of the ergodic theorem, Theorem 8.6.1. For proofs, see Durrett (2004). The ﬁrst one is the following result on the behavior of the log-likelihood function of a stationary ergodic sequence of random variables with a ﬁnite range. Theorem 8.6.4: (Shannon-McMillan-Breiman theorem). Let {Xi }i≥1 be a stationary ergodic sequence of random variables with values in a ﬁnite set S ≡ {a1 , a2 , . . . , ak }. For each n, x1 , x2 , . . . , xn in S, let p(xn | xn−1 , xn−2 , . . . , x1 ) = P (Xn = xn | Xj = xj , 1 ≤ j ≤ n − 1) P (Xj = xj : 1 ≤ j ≤ n) ≡ P (Xj = xj : 1 ≤ j ≤ n − 1) whenever the denominator is positive and let p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ). Then 1 log p(X1 , X2 , . . . , Xn ) = −H exists w.p. 1 n where H ≡ limn→∞ E − log p(Xn | Xn−1 , Xn−2 , . . . , X1 ) is called the entropy rate of {Xi }i≥1 . lim

n→∞

Remark 8.6.5: In the iid case this is a consequence of the strong law of k large numbers, and H can be identiﬁed as j=1 (− log pj )pj where pj =

8.6 Ergodic theorems

277

P (X1 = aj ), 1 ≤ j ≤ k. This is called the Kolmogorov-Shannon entropy of the distribution {pj : 1 ≤ j ≤ k}. If {Xi }i≥1 is a stationary ergodic Markov chain, then again it is a consequence of the strong law of large numbers, and H can be identiﬁed with k k πi (− log pij )pij E − log p(X2 | X1 ) = i=1

j=1

where π ≡ {πi : 1 ≤ i ≤ k} is the stationary distribution and P ≡ (pij ) is the transition probability matrix of the Markov chain {Xi }i≥1 . See Problem 8.27. A more general version of the ergodic Theorem 8.6.1 is the following. Theorem 8.6.5: (Kingman’s subadditive ergodic theorem). Let {Xm,n : 0 ≤ m < n}n≥1 be a collection of random variables such that (i) X0,m + Xm,n ≥ X0,n for all 0 ≤ m < n, n ≥ 1. (ii) For all k ≥ 1, {Xnk,(n+1)k }n≥1 is a stationary sequence. (iii) The sequence {Xm,m+k , k ≥ 1} has a distribution that does not depend on m ≥ 0. + (iv) EX0,1 < ∞ and for all n,

EX0,n n

≥ γ0 , where γ0 > −∞.

Then (i) lim

n→∞

(ii) lim

n→∞

EX0,n n X0,n n

= inf

n≥1

EX0,n n

≡ γ.

≡ X exists w.p. 1 and in L1 , and EX = γ.

(iii) If {Xnk,(n+1)k }n≥1 is ergodic for each k ≥ 1, then X ≡ γ w.p. 1. A nice application of this is a result on products of random matrices. Theorem 8.6.6: Let {Ai }i≥1 be a stationary sequence of k × k random matrices with nonnegative entries. Let αm,n (i, j) be the (i, j)th entry in Am+1 , · · · , An . Suppose E| log α1,2 (i, j)| < ∞ for all i, j. Then 1 n→∞ n

(i) lim

log α0,n (i, j) = η exists w.p. 1. 1 n→∞ n

log Am+1 · · · , An = η w.p. 1, where for any k×k k matrix B ≡ ((bij )), B = max j=1 |bij | : 1 ≤ i ≤ k .

(ii) For any m, lim

278

8. Laws of Large Numbers

Remark 8.6.6: A concept related to ergodicity is that of mixing. A measure preserving transformation T on a probability space (Ω, F, P ) is mixing if for all A, B ∈ B * * lim *P (A ∩ T −n B) − P (A)P (T −n B)* = 0. n→∞

A stationary sequence of random variables {Xi }i≥1 is mixing if the unilateral shift on the sequence space R∞ induced by {Xi }i≥1 is mixing. If T is mixing and A is T -invariant, then taking B = A in the above yields P (A) = P 2 (A) i.e., P (A) = 0 or 1. Thus, if T is mixing, then T is ergodic. Conversely, if T is ergodic, then by Theorem 8.6.1, for any B in B 1 IB (T j ω) → P (B) w.p. 1. n j=1 n

n Integrating both sides over A w.r.t. P yields n1 j=1 P (A ∩ T −j B) → P (A)P (B), i.e., T is mixing in an average sense, i.e., the Cesaro sense. A suﬃcient condition for a stationary sequence to be mixing is that the tail σ-algebra be trivial. If {Xi }i≥1 is a stationary irreducible Markov chain with a countable state space, then it is mixing iﬀ it is aperiodic. For proofs of the above results, see Durrett (2004).

8.7 Law of the iterated logarithm 2 Let {Xn }n≥1 be a sequence of iid random variables with nEX1 = 0, EX1 = 1 ¯ 1. The SLLN asserts that the sample mean Xn = n i=1 Xi → 0 w.p. 1. The central limit theorem √ ¯ (to be proved later) asserts that for all −∞ < a < b < ∞, P (a ≤ nX n ≤ b) → Φ(b) n− Φ(a) where Φ(·) is the standard √ Normal cdf. This suggests that Sn = i=1 Xi is of the order magnitude n Sn get as a function for large n. This raises the question of how large does √ n √ of n. It turns out that it is of the order 2n log log n. More precisely, the following holds:

Theorem 8.7.1: (Law of the iterated logarithm). Let {Xi (ω)}i≥1 be iid random variables on a probability space n(Ω, F, P ) with mean zero and variance one. Let S0 (ω) = 0, Sn (ω) = i=1 Xi (ω), n ≥ 1. For each ω, let A(ω) be the set of limit points of [−1, +1]} = 1. For a proof, see Durrett (2004).

S (ω) √ n 2n log log n

n≥1

. Then P {ω : A(ω) =

8.8 Problems

279

A deep generalization of the above was obtained by Strassen (1964). Theorem 8.7.2: Under the setup of Theorem 8.7.1, the following holds: Sj (ω) , j = 0, 1, 2, . . . , n and Yn (t, ω) be the function Let Yn ( nj ; ω) = √2n log log n obtained by linearly interpolating the above values on [0, 1]. For each ω, let B(ω) be the set of limit points of {Yn (·, ω)}n≥1 in the function space C[0, 1] of all continuous functions on [0, 1] with the supnorm. Then P {ω : B(ω) = K} = 1 where K ≡ f : f : [0, 1] → R, f is continuously diﬀerentiable, f (0) = 0 1 and 12 0 (f (t))2 dt ≤ 1 .

8.8 Problems 8.1 Prove Theorem 8.1.3 and Corollary 8.1.4. (Hint: Use Chebychev’s inequality.) 8.2 Let {Xn }n≥1 be a sequence of random variables on a probability space (Ω, F, P ) such that for some m ∈ N and for each i = 1, . . . , m, {Xi , Xi+m , Xi+2m , . . .} are identically distributed and pairwise independent. Furthermore, suppose that E(|X1 | + · · · + |Xm |) < ∞. Show that m 1 X n −→ EXi , w.p. 1. m i=1 (Hint: Reduce the problem to nonnegative Xn ’s and apply Theorem 8.2.7 for each i = 1, . . . , m.) 8.3 Let f be a bounded measurable function on [0,1] that is continuous 1 x1 +x2 +···+xn 11 1 dx1 dx2 . . . dxn . at 2 . Evaluate lim 0 0 · · · 0 f n n→∞

8.4 Show that if P (|X| > α) < 12 for some real number α, then any median of X must lie in the interval [−α, α]. 8.5 Prove Theorem 8.3.4 using Kolmogorov’s ﬁrst inequality (Theorem 8.3.1 (a)). (Hint: Apply Theorem 8.3.1 to ∆n,k deﬁned in the proof of Theorem 8.3.3 to establish (3.4).) 8.6 Let {Xn }n≥1 be a sequence of iid random variables with E|X1 |α < ∞ for some α > 0. Derive a necessary and suﬃcient condition on α ∞ for almost sure convergence of the series X sin 2πnt for all n n=1 t ∈ (0, 1).

280

8. Laws of Large Numbers

8.7 Show that for any given sequence of random variables {Xn }n≥1 , there n exists a sequence of real numbers {an }n≥1 ⊂ (0, ∞) such that X an → 0 w.p. 1. 8.8 Let {Xn }n≥1 be a sequence of independent random variables with P (Xn = 2) = P (Xn = nβ ) = an , P (Xn = an ) = 1 − 2an 1 for some n ∈ (0, 3 ) and β ∈ R. Show that a∞ only if n=1 an < ∞.

∞ n=1

Xn converges if and

8.9 Let {Xn }n≥1 be a sequence of iid random variables with E|X1 |p = ∞ n for some p ∈ (0, 2). Then P (lim sup |n−1/p i=1 Xi | = ∞) = 1. n→∞

8.10 For any random variable X and any r ∈ (0, ∞), E|X|r < ∞ iﬀ ∞ r−1 (log n)r P (|X| > n log n) < ∞. n=1 n m (Hint: Check that n=1 nr−1 (log n)r ∼ r−1 mr (log m)r as m → ∞.) 8.11 Let {Xn }n≥1 be a sequenceof independent random variables with n EXn = 0, EXn2 = σn2 , s2n = j=1 σj2 → ∞. Then, show that for any a > 12 , n 2 −a Xi → 0 w.p. 1. s−2 n (log sn ) i=1

8.12 Show that for p ∈ (0, 2), p = 1, (4.12) holds. ∞ ∞ 1/p | ≤ (Hint: For p ∈ (1, 2), n=1 |EZn /n n=1 E|X1 |I(|X1 | > ∞ j −1/p p n · E|X |I(j < |X ≤ j + 1) ≤ n)n−1/p = 1 1| j=1 n=1 ∞ p p 1/p < ∞, by (4.10). For p ∈ (0, 1), | ≤ n=1 |EZn /n p−1 E|X1 | ∞ ∞ 1 −1/p p p ( n )E|X |I(j − 1 < |X | ≤ j) ≤ E|X | , by 1 1 1 j=1 n=j 1−p (4.9).) 8.13 Let Yi = xi β + i , i ≥ 1 where {n }n≥1 is a sequence of iid random vectors, {xn }n≥1 is a sequence of constants, and β ∈ R is a constant n n 2 (the regression parameter). Let βˆn = i=1 nxi Yi /2 i=1 xi denote (the −1 least squares) estimator of β. Let n i=1 xi → c ∈ (0, ∞) and E1 = 0. (a) If E|1 |1+δ < ∞ for some δ ∈ (0, ∞), then show that βˆn −→ β

as

n → ∞,

w.p. 1.

(8.1)

(b) Suppose sup{|xi | : i ≥ 1} < ∞ and E|1 | < ∞. Show that (8.1) holds.

8.8 Problems

281

8.14 (Strongly consistent estimation.) Let {Xi }i≥1 be random variables on some probability space (Ω, F, P ) such that (i) for some integer m ≥ 1 the collections {Xi : i ≤ n} and {Xi : i ≥ n + m} are independent for each n ≥ 1, and (ii) the distribution of {Xi+j ; 0 ≤ j ≤ k} is independent of i, for all k ≥ 0. (a) Show that for every ≥ 1 and h : R → R with E|h(X1 , X2 , . . . , X )| < ∞, there are functions {fn : Rn → R}n≥1 such that fn (X1 , X2 , . . . , Xn ) → λ ≡ Eh(X1 , X2 , . . . , X ) w.p. 1. In this case, one says λ is estimable from {Xi }i≥1 in a strongly consistent manner. (b) Now suppose the distribution µ(·) of X1 is a mixture of the k form µ = i=1 αi µi . Suppose there exist disjoint Borel sets {Ai }1≤i≤k in R such that µi (Ai ) = 1 for each i. Show that all the αi ’s as well as λi ≡ hi (x)dµi where hi ∈ L1 (µi ) are estimable from {Xi }i≥1 in a strongly consistent manner. 8.15 (Normal numbers). Recall that in Section 4.5 it was shown that for any positive integer p > 1 and for any 0 ≤ ω ≤ 1, it is possible to write ω as ∞ Xi (ω) ω= (8.2) pi i=1 where for each i, Xi (ω) ∈ {0, 1, 2, . . . , p − 1}. Recall also that such an expansion is unique except for ω of the form q/pn , q = 1, 2, . . . , pn −1, n ≥ 1 in which case there are exactly two expansions, one of which is recurring. In what follows, for such ω’s the recurrent expansion will be the one used in (8.2). A number ω in [0,1] is called normal w.r.t. the integer p if for every ﬁnite pattern a1 a2 . . . ak where k ≥ 1 is a positive integer nand ai ∈ {0, 1, 2, . . . , p − 1} for 1 ≤ i ≤ k the relative frequency n1 i=1 δi (ω) where δi (ω) =

1 if Xi+j (ω) = aj+1 , j = 0, 1, 2, . . . , k − 1 0 otherwise

converges to p−k as n → ∞. A number ω in [0,1] is called absolutely normal if it is normal w.r.t. p for every integer p > 1. Show that the set A of all numbers ω in [0,1] that are absolutely normal has Lebesgue measure one. (Hint: Note that in (8.2), the function {Xi (ω)}i≥1 are iid random variables. Now use Problem 8.14 repeatedly.) 8.16 Show that for the renewal sequence {Sn }∞ n=0 , if P (X1 > 0) > 0, then lim Sn = ∞ w.p. 1. n→∞

282

8. Laws of Large Numbers

8.17 (a) Show that {an }n≥0 of (5.16) is the unique solution to (5.15) by using generating functions (cf. Section 5.5). (b) Deduce Theorems 8.5.13 and 8.5.14 from Theorems 8.5.11 and 8.5.12, respectively. (Hint: For Theorems 8.5.13 use the DCT , and for Theorem 8.5.14, show ﬁrst that k

mn (h) U ((n + 1)h) − U (nh)

n=0

≤ a(kh) ≤

k

mn (h) U ((n + 1)h) − U (nh) .)

n=0

8.18 (a) Let b(·) : [0, ∞) → R be dri. Show that b(·) is Riemann integrable on every bounded interval. Conclude that if b(·) is dri it must be continuous almost everywhere w.r.t. Lebesgue measure. (b) Let b(·) : [0, ∞) → R be Riemann integrable on [0, K] for each K < ∞. Let h(·) : [0, ∞) → R+ be nonincreasing and integrable w.r.t. Lebesgue measure and |b(·)| ≤ h(·) on [0, ∞). Show that b(·) is dri. 8.19 Verify that the sequence {Yn }n≥0 in Example 8.5.2 and the process {Y (t) : t ≥ 0} in Example 8.5.3 are both regenerative. 8.20 Show that the map T in Example 8.6.4 in Section 8.6 is measure preserving. (Hint: Show that for 0 < a < b < 1, P ω : T ω ∈ (a, b) = (b − a).) 8.21 Let T be a measure preserving map on a probability space (Ω, F, P ). Show that A is almost T -invariant w.r.t. P iﬀ there exists a set A1 such that A1 = T −1 A1 and P (A A1 ) = 0. ∞ (Hint: Consider A1 = n=0 T −n A. ) 8.22 Show that a function h : Ω → R is I-measurable iﬀ h(ω) = h(T ω) for all ω where I is the σ-algebra of T -invariant sets. 8.23 Consider the sequence space R∞ , B(R∞ ) . Show that A ∈ B(R∞ ) is invariant w.r.t. the unilateral shift T implies that A is in the tail σ-algebra. 8.24 In Example 8.6.7 of Section 8.6, show that {Zi }i≥1 is a stationary sequence that is not ergodic. (Hint: Assuming it is ergodic, derive a contradiction using the ergodic Theorem 8.6.1.)

8.8 Problems

283

8.25 (a) Extend Example 8.6.7 to the Markov chain case with two disjoint irreducible positive recurrent subsets. (b) Show that in Example 8.6.5, if θ0 is rational, then T is not ergodic. 8.26 (a) Verify that in Remark 8.6.1, T is ergodic but T 2 is not. (b) Construct a Markov chain with four states for which T is ergodic but T 2 is not. 8.27 In Remark 8.6.5, prove the Shannon-McMillan-Breiman theorem directly for the Markov chain case.

n−1 pXi Xi+1 p(X1 ).) (Hint: Express p(X1 , X2 , . . . , Xn ) as i=1

8.28 Let {Xi }i≥1 be iid Bernoulli (1/2) random variables. Let W1 W2

= =

∞ 2X2i i=1 ∞ i=1

4i X2i−1 . 4i

(a) Show that W1 and W2 are independent. (b) Let A1 = {ω : ω ∈ (0, 1) such that in the expansion of ω in base 4 only the digits 0 and 2 appear} and A2 = {ω : ω ∈ (0, 1) such that in the expansion of ω in base 4 only the digits 0 and 1 appear}. Show that m(A1 ) = m(A2 ) = 0 where m(·) is Lebesgue measure and hence that the distribution of W1 and W2 are singular w.r.t. m(·). (c) Let W ≡ W1 + W2 . Then show that W has uniform (0,1) distribution. (Hint: For (b) use the SLLN.) Remark: This example shows that the convolution of two singular probability measures can be absolutely continuous w.r.t. Lebesgue measure. 8.29 Let {Xn }n≥1 be a sequence of pairwise independent and identically distributed random variables with P (X1 ≤ x) = F (x), x ∈ R. Fix 0 < p < 1. Suppose that F (ζp + ) > p for all > 0 where ζp = F −1 (p) ≡ inf{x : F (x) ≥ p}. Show that ζˆn ≡ Fn−1 (p) ≡ inf{x : Fn (x) ≥ p} converges to ζp w.p. 1 n where Fn (x) ≡ n−1 i=1 I(Xi ≤ x), x ∈ R is the empirical distribution function of X1 , . . . , Xn .

284

8. Laws of Large Numbers

8.30 Let {Xi }i≥1be random variables such thatEXi2 < ∞ for all i ≥ 1. n n Suppose n1 i=1 EXi → 0 and an ≡ n12 j=0 (n − j)v(j) → 0 as * * n → ∞ where v(j) = supi *Cov(Xi , Xi+j )*. ¯ n −→p 0. (a) Show that X ∞ ¯ n → 0 w.p. 1. (b) Suppose further that n=1 an < ∞. Show that X (c) Show that as n → ∞, v(n) → 0 implies an → 0 but the converse need not hold. 8.31 Let i }i≥1 be iid random variables with cdf F (·). Let Fn (x) ≡ {X n 1 i=1 I(Xi ≤ x) be the empirical cdf. Suppose xn → x0 and F (·) n is continuous at x0 . Show that Fn (xn ) → F (x0 ) w.p. 1. 8.32 Let p be a positive integer > 1. Let {δi }i≥1 be iid random variable p−1 with distribution ∞ δi P (δ1 = j) = pj , 0 ≤ j ≤ p − 1, pj ≥ 0, 0 pj = 1. Let X = i=1 pi . Show that (a) P (X ∈ (0, 1)) = 1. (b) FX (x) ≡ P (X ≤ x) is continuous and strictly increasing in (0,1) if 0 < pj < 1 for any 0 ≤ j ≤ p − 1. (c) FX (·) is absolutely continuous iﬀ pj = which case FX (x) ≡ x, 0 ≤ x ≤ 1.

1 j

for all 0 ≤ j ≤ p − 1 in

8.33 (Random AR-series). Let {Xn }n≥0 be a sequence of random variables such that Xn+1 = ρn+1 Xn + n+1 , n ≥ 0 where the sequence {(ρn , n )}n≥1 are iid and independent of X0 . (a) Show that if E(log |ρ1 |) < 0 and E(log |1 |)+ < ∞ then ˆn ≡ X

n

ρ1 ρ2 . . . ρj , j+1

converges w.p. 1.

j=0

(b) Show that under the hypothesis of (a), for any bounded continuous function h : R → R and for any distribution of X0 ˆ ∞ ). Eh(Xn ) → Eh(X (Hint: Show by SLLN that there is a 0 < λ < 1 such that ρ1 , ρ2 , . . . , ρj = 0(λj ) w.p. 1 as j → ∞ and by Borel-Cantelli |j | = 0(λj ) for some λ > 0 λ λ < 1.) 8.34 (Iterated random functions). Let (S, ρ) be a complete separable metric space. Let (G, G) be a measurable space. Let f : G × S → S be

8.8 Problems

285

G × B(S), B(S) measurable function. Let (Ω, F, P ) be a probability space and {θi }i≥1 be iid G-valued random variables on (Ω, F, P ). Let X0 be an S-valued random variable on (Ω, F, P ) independent of {θi }i≥1 . Deﬁne {Xn }n≥0 by the random iteration scheme, X0 (x, ω) ≡ x Xn+1 (x, ω) = f θn+1 (ω), Xn (x, ω) n ≥ 0. (a) Show that for each n ≥ 0, the map Xn = S × Ω → S is B(S) × F, B(S) measurable. ˆ n (x, ω) = (b) Let fn (x) ≡ fn (x, ω) ≡ f (θn (ω), x). Let X ˆ n (x, ω) and f1 (f2 , . . . , fn (x)). Show that for each x and n, X Xn (x, ω) have the same distribution. (c) Now assume that for all ω, f (θ1 (ω), x) is Lipschitz from S to S, i.e., d(f (θi (ω), x), f (θi (ω), y)) < ∞. i (ω) ≡ sup d(x, y) x =y Show that i (ω) is a random variable on (Ω, F, P ), i.e. that i (·) : Ω → R+ is F, B(R) measurable. (d) Suppose that E| log 1 (ω)| < ∞ and E log 1 (ω) < 0, ˆ n (x, ω) = E| log d(f (θ1 , x), x)| < ∞ for all x. Show that limn X ˆ X∞ (ω) exists w.p. 1 and is independent of x w.p. 1. (Hint: Use Borel-Cantelli to show ˆ n (x, ω)}n≥1 is Cauchy in (S, ρ).) {X

that

for

each

x,

(e) Under the hypothesis in (d) show that for any bounded continuous h : S → R and for any x ∈ S, limn→∞ Eh(Xn (x, ω)) = ˆ ∞ (ω)). Eh(X (f) Deduce the results in Problems 7.15 and 8.33 as special cases. 8.35 (Extension of Gilvenko-Cantelli (Theorem 8.2.4) to the multivariate case). Let {Xn }n≥1 be a sequence of pairwise independent k and identically distributed random vectors taking values in R with cdf F (x) ≡ P X11 ≤ x1 , X12 ≤ x2 , . . . , X1k ≤ xk where X1 = (X 11 , X12 , . . . , X1k ) and x = (x1 , x2 , . . . , xk ) ∈ R. Let Fn (x) ≡ n 1 i=1 I(Xi ≤ x) be the empirical cdf based on {Xi }1≤i≤n . Show n that sup{|Fn (x) − F (x)| : x ∈ R} → 0 w.p. 1. (Hint: First prove an extension of Poly¯ a’s theorem (Lemma 8.2.6) to the multivariate case.)

9 Convergence in Distribution

9.1 Deﬁnitions and basic properties In this section, the notion of ‘convergence in distribution’ of a sequence of random variables is discussed. The importance and usefulness of this notion lie in the following observation: if a sequence of random variables Xn converges in distribution to a random variable X, then one may approximate the probabilities P (Xn ∈ A) by P (X ∈ A) for large n for a large class of sets A ∈ B(R). In many situations, exact evaluation of P (Xn ∈ A) is a more diﬃcult task than the evaluation of P (X ∈ A). As a result, one may work with the limiting value P (X ∈ A) instead of P (Xn ∈ A), when n is large. As an example, consider the following problem from statistical inference. Let Y1 , Y2 , . . . be a collection of iid random variables with a ﬁnite second moment. Suppose that one is interested in ﬁnding the observed level of signiﬁcance or the p-value for a statistical test of the hypotheses H1 : µ = 0 about the population mean H0 : µ = 0 against an alternative ¯n = n−1 n Yi is used and the test rejects H0 µ. If the test statistic Y i=1 √ for large values of | nY¯n |, √ then the p-value of the test can be found using the function ψn (a) ≡ P0 (| nY¯n | > a), a ∈ [0, ∞), where P0 denotes the joint distribution of {Yn }n≥1 under µ = 0. Note that here, ﬁnding ψn (·) is diﬃcult, as it depends on the √ joint distribution of Y1 , . . . , Yn . If, however, one knows that under µ = 0, nY¯n converges in distribution to a normal random variable Z (which is in fact guaranteed by the central limit theorem, see Chapter 11), then one may approximate ψn (a) by P (|Z| > a), which can be found, e.g., by using a table of normal probabilities.

288

9. Convergence in Distribution

The formal deﬁnition of ‘convergence in distribution’ is given below. Deﬁnition 9.1.1: Let Xn , n ≥ 0 be a collection of random variables and let Fn denote the cdf of Xn , n ≥ 0. Then, {Xn }n≥1 is said to converge in distribution to X0 , written as Xn −→d X0 , if lim Fn (x) = F0 (x) for every

n→∞

x ∈ C(F0 )

(1.1)

where C(F0 ) = {x ∈ R : F0 is continuous at x}. Deﬁnition 9.1.2: Let {µn }n≥0 be probability measures on (R, B(R)). denoted Then {µn }n≥1 is said to converge to µ0 weakly or in distribution, by µn −→d µ0 , if (1.1) holds with Fn (x) ≡ µn (−∞, x] , x ∈ R, n ≥ 0. Unlike the notions of convergence in probability and convergence almost surely, the notion of convergence in distribution does not require that the random variables Xn , n ≥ 0 be deﬁned on a common probability space. Indeed, for each n ≥ 0, Xn may be deﬁned on a diﬀerent probability space (Ωn , Fn , Pn ) and {Xn }n≥1 may converge in distribution to X0 . In such a context, the notions of convergence of {Xn }n≥1 to X0 in probability or almost surely are not well deﬁned. Deﬁnition 9.1.1 requires only the cdfs of Xn ’s to converge to that of X0 at each x ∈ C(F0 ) ⊂ R, but does not require the (almost sure or in probability) convergence of the random variables Xn ’s themselves. Example 9.1.1: For n ≥ 1, let Xn ⎧ ⎨ 0 nx Fn (x) = ⎩ 1

∼ Uniform (0, n1 ), i.e., Xn has the cdf if x ≤ 0 if 0 < x < if x ≥ n1

1 n

and let X0 be the degenerate random variable taking the value 0 with probability 1, i.e., the cdf of X0 is 0 if x < 0 F0 (x) = 1 if x ≥ 0. Note that the function F0 (x) is discontinuous only at x = 0. Hence, C(F0 ) = R\{0}. It is easy to verify that for every x = 0, Fn (x) → F0 (x)

as

n → ∞.

Hence, Xn −→d X0 . Example 9.1.2: Let {an }n≥1 and {bn }n≥1 be sequences of real numbers such that 0 < bn < ∞ for all n ≥ 1. Let Xn ∼ N (an , bn ), n ≥ 1. Then, the cdf of Xn is given by x − a n , x∈R (1.2) Fn (x) = Φ bn

9.1 Deﬁnitions and basic properties

289

x where Φ(x) = −∞ φ(t)dt and φ(x) = (2π)−1/2 exp(−x2 /2), x ∈ R. If X0 ∼ N (a0 , b0 ) for some a0 ∈ R, b0 ∈ [0, ∞), then using (1.2), one can show that Xn −→d X0 if and only if an → a0 and bn → b0 as n → ∞ (Problem 9.8). Next some simple implications of Deﬁnition 9.1.1 are considered. Proposition 9.1.1: If Xn −→p X0 , then Xn −→d X0 . Proof: Let Fn denote the cdf of Xn , n ≥ 0. Fix x ∈ C(F0 ). Then, for any > 0, P (Xn ≤ x) ≤ P (X0 ≤ x + ) + P (Xn ≤ x, X0 > x + ) ≤

P (X0 ≤ x + ) + P (|Xn − X0 | > )

(1.3)

and similarly, P (Xn ≤ x) ≥ P (X0 ≤ x − ) − P (|Xn − X0 | > ).

(1.4)

Hence, by (1.3) and (1.4), F0 (x − ) − P (|Xn − X0 | > ) ≤ Fn (x) ≤ F0 (x + ) + P (|Xn − X0 | > ). Since Xn −→p X0 , letting n → ∞, one gets F0 (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F0 (x + ) n→∞

(1.5)

n→∞

for all ∈ (0, ∞). Note that as x ∈ C(F0 ), F0 (x−) = F0 (x). Hence, letting ↓ 0 in (1.5), one has limn→∞ Fn (x) = F0 (x). This proves the result. 2 As pointed out before, the converse of Proposition 9.1.1 is false in general. The following is a partial converse. The proof follows from the deﬁnitions of convergence in probability and convergence in distribution and is left as an exercise (Problem 9.1). Proposition 9.1.2: If Xn −→d X0 and P (X0 = c) = 1 for some c ∈ R, then Xn −→p c. Theorem 9.1.3: Let Xn , n ≥ 0 be a collection of random variables with respective cdfs Fn , n ≥ 0. Then, Xn −→d X0 if and only if there exists a dense set D in R such that lim Fn (x) = F0 (x)

n→∞

for all

x ∈ D.

(1.6)

Proof: Since C(F0 )c has at most countably many points, the ‘only if’ part follows. To prove the ‘if’ part, suppose that (1.6) holds. Fix x ∈ C(F0 ). Then, there exist sequences {xn }n≥1 , {yn }n≥1 in D such that xn ↑ x and yn ↓ x as n → ∞. Hence, for any k, n ∈ N, Fn (xk ) ≤ Fn (x) ≤ Fn (yk ).

290

9. Convergence in Distribution

By (1.6), for every k ∈ N, F0 (xk )

= ≤

lim Fn (xk ) ≤ lim inf Fn (x)

n→∞

n→∞

lim sup Fn (x) ≤ lim Fn (yk ) = F0 (yk ). n→∞

n→∞

(1.7)

Since x ∈ C(F0 ), limk→∞ F0 (xk ) = F0 (x) = limk→∞ F0 (y). Hence, by (1.7), limn→∞ Fn (x) exists and equals F0 (x). This completes the proof of Theorem 9.1.3. 2 Theorem 9.1.4: (Poly¯ a’s theorem). Let Xn , n ≥ 0 be random variables with respective cdfs Fn , n ≥ 0. If F0 is continuous on R, then * * sup *Fn (x) − F0 (x)* → 0 as n → ∞. x∈R

Proof: This is a special case of Lemma 8.2.6 and uses the following proposition. 2 Proposition 9.1.5: If a cdf F is continuous on R, then it is uniformly continuous on R. The proof of Proposition 9.1.5 is left as an exercise (Problem 9.2). Theorem 9.1.6: (Slutsky’s theorem). Let {Xn }n≥1 and {Yn }n≥1 be two sequences of random variables such that for each n ≥ 1, (Xn , Yn ) is deﬁned on a probability space (Ωn , Fn , Pn ). If Xn −→d X and Yn −→p a for some a ∈ R, then (i) Xn + Yn −→d X + a, (ii) Xn Yn −→d aX, and (iii) Xn /Yn −→d X/a, provided a = 0. Proof: Only a proof of part (i) is given here. The other parts may be proved similarly. Let F0 denote the cdf of X. Then, the cdf of X + a is given by F (x) = F0 (x − a), x ∈ R. Fix x ∈ C(F ). Then, x − a ∈ C(F0 ). For any > 0 (as in the derivations of (1.3) and (1.4)), P (Xn + Yn ≤ x) ≤ P (|Yn − a| > ) + P (Xn + a − ≤ x)

(1.8)

P (Xn + Yn ≤ x) ≥ P (Xn + a + ≤ x) − P (|Y − a| > ).

(1.9)

and

Now ﬁx > 0 such that x − a − , x − a + ∈ C(F0 ). This is possible since R\C(F0 ) is countable. Then, from (1.8) and (1.9), it follows that lim sup P (Xn + Yn ≤ x) n→∞

9.2 Vague convergence, Helly-Bray theorems, and tightness

≤

291

lim P ((Yn − a) > ) + P (Xn ≤ x − a + )

n→∞

= F0 (x − a + )

(1.10)

and similarly, lim inf P (Xn + Yn ≤ x) ≥ F0 (x − a − ). n→∞

(1.11)

Now letting → 0+ in such a way that x − a ± ∈ C(F0 ), from (1.10) and (1.11), it follows that F0 ((x − a)−) ≤ lim inf P (Xn + Yn ≤ x) n→∞

≤ lim sup P (Xn + Yn ≤ x) n→∞

≤ F0 (x − a). 2

Since x − a ∈ C(F0 ), (i) is proved.

9.2 Vague convergence, Helly-Bray theorems, and tightness One version of the Bolzano-Weirstrass theorem from real analysis states that if A ⊂ [0, 1] is an inﬁnite set, then there exists a sequence {xn }n≥1 ⊂ A such that limn→∞ xn ≡ x exists in [0, 1]. Note that x need not be in A unless A is closed. There is an analog of this for sub-probability measures on (R, B(R)), i.e., for measures µ on (R, B(R)) such that µ(R) ≤ 1. First, one needs a deﬁnition of convergence of sub-probability measures. Deﬁnition 9.2.1: Let {µn }n≥1 , µ be sub-probability measures on (R, B(R)). Then {µn }n≥1 is said to converge to µ vaguely, denoted by µn −→v µ, if there exists a set D ⊂ R such that D is dense in R and µn ((a, b]) → µ((a, b])

as n → ∞

for all a, b ∈ D.

(2.1)

Example 9.2.1: Let {Xn }n≥1 , X be random variables such that Xn converges to X in distribution, i.e., Fn (x) ≡ P (Xn ≤ x) → F (x) ≡ P (X ≤ x)

(2.2)

for all x ∈ C(F ), the set of continuity points of F . Since the complement of C(F ) is at most countable, (2.2) implies that µn −→v µ where µn (·) ≡ P (Xn ∈ ·) and µ(·) ≡ P (X ∈ ·). Remark 9.2.1: It follows from above that if {µn }n≥1 , µ are probability measures, then µn −→d µ ⇒ µn −→v µ. (2.3)

292

9. Convergence in Distribution

Conversely, it is not diﬃcult to show that (Problem 9.4) if µn −→v µ and µn and µ are probability measures, then µn −→d µ. Example 9.2.2: Let µn be the probability measure corresponding to the Uniform distribution on [−n, n], n ≥ 1. It is easy to show that µn −→v µ0 , where µ0 is the measure that assigns zero mass to all Borel sets. This shows that if µn −→v µ, then µn (R) need not converge to µ(R). But if µn (R) does converge to µ(R) and µ(R) > 0 and if µn −→v µ, then it can be shown µ n and µ = µ(R) . that µn −→d µ where µn = µnµ(R) Theorem 9.2.1: (Helly’s selection theorem). Let A be an inﬁnite collection of sub-probability measures on (R, B(R)). Then, there exist a sequence {µn }n≥1 ⊂ A and a sub-probability measure µ such that µn −→v µ. Proof: Let D ≡ {rn }n≥1 be a countable dense set in R (for example, one may take D = Q, the set of rationals or D = Dd , the set of all dyadic rationals of the form {j/2n : j an integer, n a positive integer}). Let for each x, A(x) ≡ {µ((−∞, x]) : µ ∈ A}. Then A(x) ⊂ [0, 1] and so by the Bolzano-Weirstrass theorem applied to the set A(r1 ), one gets a sequence {µ1n }n≥1 ⊂ A such that limn→∞ F1n (r1 ) ≡ F (r1 ) exists, where F1i (x) ≡ µ1i ((−∞, x]), x ∈ R. Next, applying the Bolzano-Weirstrass theorem to {F1n (r2 )}n≥1 yields a further subsequence {µ2n }n≥1 ⊂ {µ1n }n≥1 ⊂ A such that limn→∞ F2n (r2 ) ≡ F (r2 ) exists, where F2i (x) ≡ µ2i ((−∞, x]), i ≥ 1. Continuing this way, one obtains a sequence of nested subsequences {µjn }n≥1 , j = 1, 2, . . . such that for each j, limn→∞ Fjn (rj ) ≡ F (rj ) exists. In particular, for the subsequence {µnn }n≥1 , lim Fnn (rj ) = F (rj )

(2.4)

F˜ (x) ≡ inf{F (r) : r > x, r ∈ D}.

(2.5)

n→∞

exists for all j. Now set

Then, F˜ (·) is a nondecreasing right continuous function on R (Problem 9.5) and it equals F (·) on D. Let µ be the Lebesgue-Stieltjes measure generated by F˜ . Since Fnn (x) ≤ 1 for all n and x, it follows that F˜ (x) ≤ 1 for all x and hence that µ is a sub-probability measure. Suppose it is shown that (2.4) also implies that lim Fnn (x) = F˜ (x) (2.6) n→∞

for all x ∈ CF˜ , the set of continuity points of F˜ . Then, all a, b ∈ CF˜ , µnn ((a, b]) ≡ Fnn (b) − Fnn (a) → F˜ (b) − F˜ (a) = µ((a, b]) and hence that µn −→v µ. To establish (2.6), ﬁx x ∈ CF˜ and > 0. Then, there is a δ > 0 such that for all x−δ < y < x+δ, F˜ (x)− < F˜ (y) < F˜ (x)+. This implies that there exist x − δ < r < x < r < x + δ, r, r ∈ D and F˜ (x) − < F (r) ≤ F˜ (x) ≤ F (r ) < F˜ (x) + . Since Fnn (r) ≤ Fnn (x) ≤ Fnn (r ), it

9.2 Vague convergence, Helly-Bray theorems, and tightness

293

follows that F˜ (x) − ≤ lim Fnn (x) ≤ lim Fnn (x) ≤ F˜ (x) + , n→∞

n→∞

2

establishing (2.6).

Next, some characterization results on vague convergence and convergence in distribution will be established. These can then be used to deﬁne the notions of convergence of sub-probability measures on more general metric spaces. Theorem 9.2.2: (The ﬁrst Helly-Bray theorem or the Helly-Bray theorem for vague convergence). Let {µn }n≥1 and µ be sub-probability measures on (R, B(R)). Then µn −→v µ iﬀ f dµn → f dµ (2.7) for all f ∈ C0 (R) ≡ {g | g : R → R is continuous and lim|x|→∞ g(x) = 0}. Proof: Let µn −→v µ and let f ∈ C0 (R). Given > 0, choose K large such that |f (x)| < for |x| > K. Since µn −→v µ, there exists a dense set D ⊂ R such that µn ((a, b]) → µ((a, b]) for all a, b ∈ D. Now choose a, b ∈ D such that a < −K and b > K. Since f is uniformly continuous on [a, b] and D is dense in R, there exist points x0 = a < x1 < x2 < · · · < xm = b in D such that supxi ≤x≤xi+1 |f (x) − f (xi )| < for all 0 ≤ i < m. Now

f dµn =

(−∞,a]

f dµn +

m−1 i=0

(xi ,xi+1 ]

f dµn +

(b,∞)

f dµn

and so * * m−1 * * * f dµn − * f (x )µ ((x , x ]) i n i i+1 * < 2 + · µn ((a, b]) < 3. * i=0

A similar approximation holds for measures, it follows that

f dµ. Since µn , µ are sub-probability

* * m * * * * * f dµn − f dµ* < 6 + f *µn ((xi , xi+1 ]) − µ((xi , xi+1 ])*, * * i=0

where f = sup{|f (x)| : x ∈ R}. Letting n → ∞ and noting that µn −→v µ and {xi }m i=0 ⊂ D, one gets * * * * * lim sup * f dµn − f dµ** ≤ 6. n→∞

294

9. Convergence in Distribution

Since > 0 is arbitrary, (2.7) follows and the proof of the “only if” part is complete. To prove the converse, let D be the set of points {x : µ({x}) = 0}. Fix a, b ∈ D, a < b. Let > 0. Let f1 be the function deﬁned by ⎧ ⎨ 1 if a ≤ x ≤ b 0 if x < a − or x > b + f1 (x) = ⎩ linear on a − ≤ x < a, b ≤ x ≤ b + . Then, f1 ∈ C0 (R) and by (2.7), f1 dµn → f1 dµ. f1 dµn and f1 dµ ≤ µ((a − , b + ]). Thus, But µn ((a, b]) ≤ lim supn→∞ µn ((a, b]) ≤ µ((a − , b + ]). Letting ↓ 0 and noting that a, b ∈ D, one gets (2.8) lim sup µn ((a, b]) ≤ µ((a, b]). n→∞

A similar argument with f2 = 1 on [a + , b − ] and 0 for x ≤ a and ≥ b and linear in between, yields lim inf µn ((a, b]) ≥ µ((a, b]). n→∞

This with (2.8) completes the proof of the “if” part.

2

Theorem 9.2.3: (The second Helly-Bray theorem or the Helly-Bray theorem for weak convergence). Let {µn }n≥1 , µ be probability measures on (R, B(R)). Then, µn −→d µ iﬀ f dµn → f dµ (2.9) for all f ∈ CB (R) ≡ {g | g : R → R, g is continuous and bounded}. Proof: Let µn −→d µ. Let > 0 and f ∈ CB (R) be given. Choose K large such that µ((−K, K]) > 1 − . Also, choose a < −K and b > K such that µ({a}) = µ({b}) = 0, a, b ∈ D. Let a = x0 < x1 < < xm = b be chosen so that x0 , . . . , xm ∈ D and sup

xi ≤x≤xi+1

|f (x) − f (xi )| <

for all i = 1, . . . , m−1. Since f dµn − f dµ = (−∞,a] f dµn − (−∞,a] f dµ+ m−1 i=1 ( (xi ,xi+1 ] f dµn − (xi ,xi+1 ] f dµ) + (b,∞) f dµn − (b,∞) f dµ, it follows that * * * * * f dµn − f dµ* < f µn ((−∞, a]) + µ((−∞, a]) * *

9.2 Vague convergence, Helly-Bray theorems, and tightness

+

m−1

295

* * *µn ((xi , xi+1 ]) − µ((xi , xi+1 ])*

i=0

+ µn ((b, ∞)) + µ((b, ∞)) . Since, a, b, x0 , x1 , . . . , xm ∈ D, * * * * lim sup ** f dµn − f dµ** ≤ f 2(1 − µ((a, b])) ≤ f 2. n→∞

Since > 0 is arbitrary, the “only if” part is proved. Next consider the “if” part. Since C0 (R) ⊂ CB (R), (2.9) and Theorem 9.2.2 imply that µn −→v µ. As noted in Remark 9.2.1, if {µn }n≥1 , µ are probability measures then µn −→v µ iﬀ µn −→d µ. So the proof is complete. 2 Deﬁnition 9.2.2: (a) A sequence of probability measures {µn }n≥1 on (R, B(R)) is called tight if for any > 0, there exists M = M ∈ (0, ∞) such that sup µn [−M, M ]c < . (2.10) n≥1

(b) A sequence of random variables {Xn }n≥1 is called tight or stochastically bounded if the sequence of probability distributions {µn }n≥1 of {Xn }n≥1 is tight, i.e., given any > 0, there exists M = M ∈ (0, ∞) such that (2.11) sup P (|Xn | > M ) < . n≥1

Remark 9.2.3: In Deﬁnition 9.2.2 (b), the random variables Xn , n ≥ 1 need not be deﬁned on a common probability space. If Xn is deﬁned on a probability space (Ωn , Fn , Pn ), n ≥ 1, then (2.11) needs to be replaced by sup Pn (|Xn | > M ) < . n≥1

Example 9.2.3: Let Xn ∼ Uniform(n, n + 1). Then, for any given M ∈ (0, ∞), P (|Xn | > M ) ≥ P (Xn > M ) = 1 for all n > M. Consequently, for any M ∈ (0, ∞), sup P (|Xn | > M ) = 1 n≥1

and the sequence {Xn }n≥1 cannot be stochastically bounded.

296

9. Convergence in Distribution

Example 9.2.4: For n ≥ 1, let Xn ∼ Uniform(an , 2 + an ),

(2.12)

where an = (−1)n . Then, {Xn }n≥1 is stochastically bounded. Indeed, |Xn | ≤ 3 for all n ≥ 1 and therefore, for any > 0, (2.11) holds with M = 3. Note that in this example, the sequence {Xn }n≥1 does not converge in distribution to a random variable X. From (2.12), it follows that as k → ∞, X2k −→d Uniform(1, 3), X2k−1 −→d Uniform(−1, 1).

(2.13)

Examples 9.2.3 and 9.2.4 highlight two important characteristics of a tight sequence of random variables or probability measures. First, the notion of tightness of probability measures or random variables is analogous to the notion of boundedness of a sequence of real numbers. For a sequence of bounded real numbers {xn }n≥1 , all the xn ’s must lie in a bounded interval [−M, M ], M ∈ (0, ∞). For a sequence of random variables {Xn }n≥1 , the condition of tightness requires that given > 0 arbitrarily small, there exists an M = M in (0, ∞) such that for each n, Xn lies in [−M, M ] with probability at least 1 − . Thus, for a tight sequence of random variables, no positive mass can escape to ±∞, which is contrary to what happens with the random variables {Xn }n≥1 of Example 9.2.3. The second property illustrated by Example 9.2.4 is that like a bounded sequence of real numbers, a tight or stochastically bounded sequence of random variables may not converge in distribution, but has one or more convergent subsequences (cf. (2.13)). Indeed, the notion of tightness can be characterized by this property, as shown by the following result. For consistency with the other results in this section, it is stated in terms of probability measures instead of random variables. Theorem 9.2.4: Let {µn }n≥1 be a sequence of probability measure on (R, B(R)). The sequence {µn }n≥1 is tight iﬀ given any subsequence {µni }i≥1 of {µn }n≥1 , there exists a further subsequence {µmi }i≥1 of {µni }i≥1 and a probability measure µ on (R, B(R)) such that µmi −→d µ

as

i → ∞.

(2.14)

Proof: Suppose that {µn }n≥1 is tight. Given any subsequence {µni }i≥1 of {µn }n≥1 , by Helly’s selection theorem (Theorem 9.2.1), there exists a sub-probability measure µ and a further subsequence {µmi }i≥1 of {µni }i≥1 such that (2.15) µmi −→v µ. Next, ﬁx ∈ (0, 1). Since {µn }n≥1 is tight, there exists M ∈ (0, ∞) such that sup µn [−M, M ]c < . (2.16) n≥1

9.2 Vague convergence, Helly-Bray theorems, and tightness

297

By (2.15) and (2.16), there exists a, b ∈ D, a < −M , b > M such that µ (a, b] = lim µmi (a, b] i→∞ ≥ lim inf µmi [−M, M ] i→∞ = 1 − lim sup µmi [−M, M ] n→∞

≥

1 − .

Since ∈ (0, 1) is arbitrary, this shows that µ is a probability measure and hence, the ‘only if’ part is proved. Next, consider the ‘if part.’ Suppose {µn }n≥1 is not tight. Then, there exists 0 ∈ (0, 1) such that for all M ∈ (0, ∞), sup µn [−M, M ]c > 0 . n≥1

Hence, for each k ∈ N, there exists nk ∈ N such that µnk [−k, k]c ≥ 0 .

(2.17)

Since any ﬁnite collection of probability measures on (R, B(R)) is tight, it follows that {µnk : k ∈ N} is a countable inﬁnite set. Hence, by the hypothesis, there exist a subsequence {µmi }i≥1 in {µnk : k ∈ N} and a probability measure µ such that µmi −→d µ

as i → ∞.

(2.18)

Let a, b ∈ R be such that µ({a}) = 0 = µ({b}) and µ (a, b]c < 0 /2. By (2.18), there exists i0 ≥ 1 such that for all i ≥ i0 , µmi (a, b]c < µ (a, b]c + 0 /2 < 0 . Since (a, b]c ⊃ [−k, k]c for all k > max{|a|, |b|} and {µmi : i ≥ i0 } ⊂ {µnk : 2 k ∈ N}, this contradicts (2.17). Hence, {µn }n≥1 is tight. Theorem 9.2.5: Let {µn }n≥1 , µ be probability measures on (R, B(R)). If µn −→d µ, then {µn }n≥1 is tight. Proof: Fix ∈ (0, ∞). Then, there exists a, b ∈ R such that µ({a}) = 0 = µ({b}) and µ (a, b]c < /2. Since µn −→d µ, there exists n0 ≥ 1 such that for all n ≥ n0 , * * *µn (a, b] − µ (a, b] * < /2. Thus, for all n ≥ n0 , µn (a, b]c ≤ µ (a, b]c + /2 < . Also, for each n = 1, . . . , n0 , there exist Mi ∈ (0, ∞) such that µi [−Mi , Mi ]c < .

(2.19)

(2.20)

298

9. Convergence in Distribution

Let M = max{Mi : 0 ≤ i ≤ n0 }, where M0 = max{|a|, |b|}. Then by (2.19) and (2.20), sup µn [−M, M ]c < . n≥1

Thus, {µn }n≥1 is tight.

2

An easy consequence of Theorems 9.2.4 and 9.2.5 is the following characterization of weak convergence. Theorem 9.2.6: Let {µn }n≥1 be a sequence of probability measures on (R, B(R)). Then µn −→d µ iﬀ {µn }n≥1 is tight and all weakly convergent subsequences of {µn }n≥1 converge to the same limiting probability measure µ. Proof: If µn −→d µ, then any weakly convergent subsequence of {µn }n≥1 converges to µ and by Theorem 9.2.5, {µn }n≥1 is tight. Hence, the ‘only if’ part follows. To prove the ‘if’ part, suppose that {µn }n≥1 is tight and that all weakly convergent subsequences of {µn }n≥1 converges to µ. Let {Fn }n≥1 and F denote the cdfs corresponding to {µn }n≥1 and µ, respectively. If possible, suppose that {µn }n≥1 does not converge in distribution to µ. Then, by deﬁnition, there exists x0 ∈ R with µ {x0 } = 0 such that Fn (x0 ) does not converge to F (x0 ) as n → ∞. Then, there exist 0 ∈ (0, 1) and a subsequence {ni }i≥1 such that * * *Fni (x0 ) − F (x0 )* ≥ 0

for all i ≥ 1.

(2.21)

Since {µn }n≥1 is tight, there exists a subsequence {mi }i≥1 ⊂ {ni }i≥1 and a probability measure µ0 such that µmi −→d µ0

as i → ∞.

(2.22)

By hypothesis, µ0 = µ. Hence µ0 ({x0 }) = µ {x0 } = 0 and by (2.22), Fmi (x0 ) → F (x0 )

as

i → ∞,

contradicting (2.21). Therefore, µn −→d µ.

2

For another proof of the ‘if’ part, see Problem 9.6. Note that by Slutsky’s theorem, if Xn −→d X and Yn −→p 0, then Xn Yn −→p 0. The following result gives a reﬁnement of this. Proposition 9.2.7: If {Xn }n≥1 is stochastically bounded and Yn −→p 0, then Xn Yn −→p 0. The proof is left as an exercise (Problem 9.7).

9.3 Weak convergence on metric spaces

299

9.3 Weak convergence on metric spaces The Helly-Bray theorems proved above suggest the following deﬁnitions of vague convergence and convergence in distribution for measures on metric spaces. Recall that (S, d) is called a metric space, if S is a nonempty set and d is a function from S × S → [0, ∞) such that (i) d(x, y) = d(y, x) for all x, y ∈ S, (ii) d(x, y) = 0 iﬀ x = y for all x, y ∈ S, (iii) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ S. A common example of a metric space is given by S = Rk and d(x, y), the Euclidean distance. A set G ⊂ S is open if for all x ∈ G, there exists an > 0 such that for any y in S, d(x, y) < ⇒ y ∈ G. The set B(x, ) = {y : d(x, y) < } is called the open ball of radius with center at x, x ∈ S, > 0. Recall that f : S → R is continuous if f −1 ((a, b)) is open for every −∞ < a < b < ∞. A family G of open sets in S is called an open cover for a set B ⊂ S if for each x ∈ B, there exists a G ∈ G such that x ∈ G. A set K ⊂ S is called compact if given any open cover G for K, there is a ﬁnite subfamily G1 ⊂ G such that G1 is an open cover for K. Let S be the Borel σ-algebra on S, i.e., let S be the σ-algebra generated by the open sets in S. A measure on the measurable space (S, S) is often simply referred to as a measure on (S, d). Deﬁnition 9.3.1: Let {µn }n≥1 and µ be sub-probability measures on a metric space (S, d), i.e., {µn }n≥1 and µ are measures on (S, S) such that µn (S) ≤ 1 for all n ≥ 1 and µ(S) ≤ 1. Then {µn }n≥1 converges vaguely to µ (written as µn −→v µ) if f dµn → f dµ (3.1) for all f ∈ C0 (S), where C0 (S) ≡ {f | f : S → R, f is continuous and for every > 0, there exists a compact set K such that |f (x)| < for all x ∈ K}. Deﬁnition 9.3.2: Let {µn }n≥1 and µ be probability measures on a metric space (S, d). Then {µn }n≥1 converges in distribution or converges weakly to µ (written as µn −→d µ) if (3.2) f dµn → f dµ for all f ∈ CB (S) ≡ {f | f : S → R, f is continuous and bounded}.

300

9. Convergence in Distribution

Recall that a sequence {xn }n≥1 in a metric space (S, d) is called Cauchy if for every > 0, there exists N such that n, m > N ⇒ d(xn , xm ) < . A metric space (S, d) is complete if every Cauchy sequence {xn }n≥1 in S converges in S, i.e., given a Cauchy sequence {xn }n≥1 , there exists an x in S such that d(xn , x) → 0 as n → ∞. Example 9.3.1: For any k ∈ N, Rk with the Euclidean metric is complete but the set of all rational vectors Qk with the Euclidean metric d(x, y) ≡ x − y is not complete. The set C[0, 1] of all continuous functions on [0, 1] is complete with the supremum metric d(f, g) = sup{|f (u) − g(u)| : 0 ≤ u ≤ 1} but the set of all polynomials on [0, 1] is not complete under the same metric. Recall that a set D is called dense in (S, d) if B(x, ) ∩ D = ∅ for all x ∈ S and for all > 0, where B(x, ) is the open ball with center at x and radius . Also, (S, d) is called separable if there exists a countable dense set D ⊂ S. Deﬁnition 9.3.3: A metric space (S, d) is called Polish if it is complete and separable. Example 9.3.2: All Euclidean spaces with the Euclidean metric as well as with the Lp metric for 1 ≤ p ≤ ∞, are complete. The space C[0, 1] of continuous functions on [0,1] with the supremum metric is complete. All Lp -spaces over measure spaces with a σ-ﬁnite measure and a countably generated σ-algebra, 1 ≤ p ≤ ∞, are complete (cf. Chapter 3). The following theorem gives several equivalent conditions for weak convergence of probability measures on a Polish space. Theorem 9.3.1: Let (S, d) be Polish and {µn }n≥1 , µ be probability measures. Then the following are equivalent: (i) µn −→d µ. (ii) For any open set G, lim inf µn (G) ≥ µ(G). n→∞

(iii) For any closed set C, lim sup µn (C) ≤ µ(C). n→∞

(iv) For all B ∈ S such that µ(∂B) = 0, lim µn (B) = µ(B),

n→∞

where ∂B is the boundary of B, i.e., ∂B = {x : for all > 0, B(x, ) ∩ B = ∅, B(x, ) ∩ B c = ∅}. (v) For every uniformly continuous and bounded function f : S → R, f dµn → f dµ.

9.3 Weak convergence on metric spaces

301

The proof uses the following fact. Proposition 9.3.2: For every open set G in a metric space (S, d), there exists a sequence {fn }n≥1 of bounded continuous functions from S to [0,1] such that as n ↑ ∞, fn (x) ↑ IG (x) for all x ∈ S. Proof: Let Gn ≡ {x : d(x, Gc ) > n1 } where for any set A in (S, d), d(x, A) ≡ inf{d(x, y) : y ∈ A}. Then since G is open, d(x, Gc ) > 0 for all x in G. Thus Gn ↑ G. Let for each n ≥ 1, fn (x) ≡

d(x, Gc ) , x ∈ S. d(x, Gc ) + d(x, Gn )

(3.3)

Check that (Problem 9.10) for each n, fn (x) is continuous on S, fn (x) = 1 on Gn and 0 on Gc , 0 ≤ fn (x) ≤ 1 for all x in S. Further fn (·) ↑ IG (·). 2 Proof of Theorem 9.3.1: (i) ⇒ (ii): Let G be open. Choose {fn }n≥1 as in Proposition 9.3.2. Then for j ∈ N, µn (G) ≥ fj dµn ⇒ lim inf µn (G) ≥ lim inf fj dµn = fj dµ (by (i)). But limj→∞ Hence (ii) holds.

n→∞

n→∞

fj dµ = µ(G), by the bounded convergence theorem.

(ii) ⇔ (iii): Suppose (ii) holds. Let C be closed. Then G = C c is open. So by (ii), lim inf µn (C c ) ≥ µ(C c ) ⇒ lim sup µn (C) ≤ µ(C), n→∞

n→∞

since µn and µ are probability measures. Thus, (iii) holds. Similarly, (iii) ⇒ (ii). ¯ denote, respectively, the interior (iii) ⇒ (iv): For any B ∈ S, let B 0 and B 0 and the closure of B. That is, B = {y : B(y, ) ⊂ B for some > 0} and ¯ = {y : for some {xn }n≥1 ⊂ B, limn→∞ xn = y}. Then, for any n ≥ 1, B ¯ µn (B 0 ) ≤ µn (B) ≤ µn (B) and by (ii) and (iii), ¯ µ(B 0 ) ≤ lim inf µn (B) ≤ lim sup µn (B) ≤ µ(B). n→∞

n→∞

¯ Thus, ¯ \ B 0 and so µ(∂B) = 0 implies µ(B 0 ) = µ(B). But ∂B = B limn→∞ µn (B) = µ(B). (iv) ⇒ (v) ⇒ (i): This will be proved for the case where S is the real line. For the general Polish case, see Billingsley (1968). Let F (x) ≡ µ((−∞, x]) and Fn (x) ≡ µn ((−∞, x]), x ∈ R, n ≥ 1. Let x be a continuity point of F . Then µ({x}) = 0. Since if B = (−∞, x], then ∂B = {x}, by (iv), Fn (x) = µn ((−∞, x]) → µ((−∞, x]) = F (x).

302

9. Convergence in Distribution

Thus, µn −→d µ. By Theorem 9.2.3, (i) holds and hence (v) holds. (v) ⇒ (i): Note that in the proof of Theorem 9.2.2, the approximating functions f1 and f2 were both uniformly continuous. Hence, the assertion follows from Theorem 9.2.2 and Remark 9.2.1. This completes the proof of Theorem 9.3.1. 2 The following example shows that the inequality can be strict in (ii) and (iii) of the above theorem. Example 9.3.3: Let X be a random variable. Set Xn = X+ n1 , Yn = X− n1 , n ≥ 1. Since Xn and Yn both converge to X w.p. 1, the distributions of Xn and Yn converge to that of X. Now suppose that there is a value x0 such that P (X = x0 ) > 0. Then, ≡ P (Xn < x0 ) µn (−∞, x0 ) 1 = P X < x0 − n → P (X < x0 ) = µ (−∞, x0 ) , µn (−∞, x0 ] =

P (Xn ≤ x0 ) 1 = P X ≤ x0 − → P (X < x0 ) < µ (−∞, x0 ] , n

and νn (−∞, x0 )

≡ P (Yn < x0 ) 1 = P X < x0 + → P (X ≤ x0 ) > P (X < x0 ). n

Note that µn and νn both converge in distribution to µ. However, for the closed set (−∞, x0 ], lim sup µn (−∞, x0 ] < µ (−∞, x0 ] n→∞

and for the open set (−∞, x0 ), lim inf νn (−∞, x0 ) > µ (−∞, x0 ) . n→∞

Remark 9.3.1: Convergent sequences of probability distributions arise in a natural way in parametric families in mathematical statistics. For example, let µ(·; θ) denote the normal distribution with mean θ and variance 1. Then, θn → θ ⇒ µn (·) ≡ N (θn , 1) −→d N (θ, 1) ≡ µ(·). Similarly, let θ = (λ, Σ), where λ ∈ Rk and Σ is a k×k positive deﬁnite matrix. Let µ(·; θ) be the k-variate normal distribution with mean λ and variance covariance

9.4 Skorohod’s theorem and the continuous mapping theorem

303

matrix Σ. Then, µ(·; θ) is continuous in θ in the sense that if θn → θ in the Euclidean metric, then µ(·; θn ) −→d µ(·; θ). Most parametric families in mathematical statistics possess this continuity property. Deﬁnition 9.3.4: Let {µn }n≥1 be a sequence of probability measures on (S, S), where S is a Polish space and S is the Borel σ-algebra on S. Then {µn }n≥1 is called tight if for any > 0, there exists a compact set K such that sup µn (K c ) < . (3.4) n≥1

A sequence of S-valued random variables {Xn }n≥1 is called tight or stochastically bounded if the sequence {µXn }n≥1 is tight, where µXn is the probability distribution of Xn on (S, S). If S = Rk , k ∈ N, and {Xn }n≥1 is a sequence of k-dimensional random vectors, then, by Deﬁnition 9.3.4, {Xn }n≥1 is tight if and only if for every > 0, there exists M ∈ (0, ∞) such that sup P (Xn > M ) < ,

(3.5)

n≥1

where · denotes the usual Euclidean norm on Rk . Furthermore, if Xn = (Xn1 , . . . , Xnk ), n ≥ 1, then the tightness of {Xn }n≥1 is equivalent to the tightness of the k sequences of random variables {Xnj }n≥1 , j = 1, . . . , k (Problem 9.9). An analog of Theorem 9.2.4 holds for probability measures on (S, S) when S is Polish. Theorem 9.3.3: (Prohorov-Varadarajan theorem). Let {µn }n≥1 be a sequence of probability measures on (S, S) where S is a Polish space and S is the Borel σ-algebra on S. Then, {µn }n≥1 is tight iﬀ given any subsequence {µni }i≥1 ⊂ {µn }n≥1 , there exist a further subsequence {µmi }i≥1 of {µni }i≥1 and a probability measure µ on (S, S) such that µmi −→d µ

as i → ∞.

(3.6)

For a proof of this result, see Section 1.6 of Billingsley (1968). This result is useful for proving weak convergence in function spaces (e.g., see Chapter 11 where a functional central limit theorem is stated).

9.4 Skorohod’s theorem and the continuous mapping theorem If {Xn }n≥1 is a sequence of random variables that converge to a random variable X in probability, then Xn does converge in distribution to X (cf.

304

9. Convergence in Distribution

Proposition 9.1.1). Here is another proof of this fact using Theorem 9.2.3. Let f : R → R be bounded and continuous. Then Xn → X in probability implies that f (Xn ) → f (X) in probability (Problem 9.13) and by the BCT, f dµn = Ef (Xn ) → Ef (X) = f dµ, where µn (·) = P (Xn ∈ ·), n ≥ 1 and µ(·) = P (X ∈ ·). Hence, µn −→d µ. In particular, it follows that if Xn → X w.p. 1, then Xn −→d X. Skorohod’s theorem is a sort of converse to this. If µn −→d µ, then there exist random variables Xn , n ≥ 1 and X such that Xn has distribution µn , n ≥ 1 and X has distribution µ and Xn → X w.p. 1. Theorem 9.4.1: (Skorohod’s theorem). Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→d µ. Let Xn (ω) ≡ sup{t : µn ((−∞, t]) < ω} X(ω) ≡ sup{t : µ((−∞, t]) < ω} for 0 < ω < 1. Then, Xn and X are random variables on ((0, 1), B (0, 1) , m) where m is the Lebesgue measure. Furthermore, Xn has distribution µn , n ≥ 1, X has distribution µ and Xn → X w.p. 1. Proof: For any cdf F (·), let F −1 (u) ≡ sup{t : F (t) < u}. Then for any u ∈ (0, 1) and t ∈ R, it can be veriﬁed that F −1 (u) ≤ t ⇒ F (t) ≥ u ⇒ F −1 (u) ≤ t and hence, if U is a Uniform (0,1) random variable (Problem 9.11), P (F −1 (U ) ≤ t) = P (U ≤ F (t)) = F (t), implying that

F −1 (U )

has cdf F (·).

This shows that Xn , n ≥ 1 and X have the asserted distributions. It remains to show that Xn (ω) → X(ω)

w.p. 1

Fix ω ∈ (0, 1) and let y < X(ω) be such that µ({y}) = 0. Now y < X(ω) ⇒ µ((−∞, y]) < ω. Since µn −→d µ and µ({y}) = 0, µn ((−∞, y]) → µ((−∞, y]) and so µn ((−∞, y]) < ω for large n. This implies that Xn (ω) ≥ y for large n and hence lim inf n→∞ Xn (ω) ≥ y. Since this is true for all y < X(ω) with µ({y}) = 0, and since the set of all such y’s is dense in R, it follows that lim inf Xn (ω) ≥ X(ω) n→∞

for all ω

in (0, 1).

Next ﬁx > 0 and y > X(ω + ), and µ({y}) = 0. Then µ((−∞, y]) ≥ ω + . Since µ({y}) = 0, µn ((−∞, y]) → µ((−∞, y]). Thus, for large n,

9.4 Skorohod’s theorem and the continuous mapping theorem

305

µn (−∞, y] ≥ ω. This implies that Xn (ω) ≤ y for large n and hence that lim supn→∞ Xn (ω) ≤ y. Since this is true for all y > X(ω + ), µ({y}) = 0, it follows that lim sup Xn (ω) ≤ X(ω + ) for every

>0

n→∞

and hence that lim sup Xn (ω) ≤ X(ω+). n→∞

Thus it has been shown that for all 0 < ω < 1, X(ω) ≤ lim inf Xn (ω) ≤ lim sup Xn (ω) ≤ X(ω+). n→∞

n→∞

Since X(ω) is a nondecreasing function on (0, 1), it has at most a countable number of discontinuities and so lim Xn (ω) = X(ω)

n→∞

w.p. 1.

2

An immediate consequence of the above theorem is the continuity of convergence in distribution under continuous transformations. Theorem 9.4.2: (The continuous mapping theorem). Let {Xn }n≥1 , X be random variables such that Xn −→d X. Let f : R → R be Borel measurable such that P (X ∈ Df ) = 0, where Df is the set of discontinuities of f . Then f (Xn ) −→d f (X). In particular, this holds if f : R → R is continuous. Remark 9.4.1: It can be shown that for any f : R → R, the set Df = {x : f is discontinuous at x} ∈ B(R) (Problem 9.12). Thus, {X ∈ Df } ∈ F, and P (X ∈ Df ) is well deﬁned. ˜ n }n≥1 , Proof: By Skorohod’s theorem, there exist random variables {X ˜ X deﬁned on the Lebesgue space (Ω = (0, 1), B((0, 1)), m = Lebesgue ˜ =d X, and ˜ n =d Xn for n ≥ 1, X measure) such that X ˜n → X ˜ X

w.p. 1.

˜ n (ω) → X(ω)} ˜ ˜ Let A = {ω : X and B = {ω : X(ω) ∈ Df }. Then, P (A) = 1 = P (B) and so, for ω ∈ A ∩ B, ˜ n (ω)) → f (X(ω)). ˜ f (X ˜ n ) → f (X) ˜ w.p. 1 and hence f (Xn ) −→d f (X). Thus, f (X

2

Another easy consequence of Skorohod’s theorem is the Helly-Bray The˜ w.p. 1 and f is a bounded continuous function, ˜n → X orem 9.2.3. Since X ˜ ˜ then f (Xn ) → f (X) w.p. 1 and so by the bounded convergence theorem ˜ n ) → Ef (X). ˜ Ef (X

306

9. Convergence in Distribution

˜ n =d Xn for n ≥ 1 and X ˜ =d X, this is the same as saying that Since X

That is, f dµn → P (X ∈ ·).

Ef (Xn ) → Ef (X). f dµ, where µn (·) = P (Xn ∈ ·), n ≥ 1 and µ(·) =

Remark 9.4.2: Skorohod’s theorem is valid for any Polish space. Suppose that S is a Polish space and {µn }n≥1 and µ are probability measures on (S, S), where S is the Borel σ-algebra on S, such that µn −→d µ. Then on the Lebesgue space there exist random variables Xn and X deﬁned (0, 1), B((0, 1)), m = the Lebesgue measure such that for all n ≥ 1, Xn has distribution µn , X has distribution µ and Xn → X w.p. 1. For a proof, see Billingsley (1968).

9.5 The method of moments and the moment problem 9.5.1

Convergence of moments

Let {Xn }n≥1 and X be random variables such that Xn converges to X in distribution. Suppose for some k > 0, E|Xn |k < ∞ for each n ≥ 1. A natural question is: When does this imply E|X|k < ∞ and limn→∞ E|Xn |k = EX k ? By Skorohod’s theorem, one can assume w.l.o.g. that Xn → X w.p. 1. Then the results from Section 2.5 yield the following. Theorem 9.5.1: Let {Xn }n≥1 and X be a collection of random variables such that Xn −→d X. Then, for each 0 < k < ∞, the following are equivalent: (i) E|Xn |k < ∞ for each n ≥ 1, E|X|k < ∞ and E|Xn |k → E|X|k . (ii) {|Xn |k }n≥1 are uniformly integrable, i.e., for every > 0, there exists an M ∈ (0, ∞) such that sup E(|Xn |k I(|Xn | > M )) < . n≥1

Remark 9.5.1: Recall that a suﬃcient condition for the uniform integrability of {|Xn |k }n≥1 is that sup E|Xn | < ∞

for some

∈ (k, ∞).

n≥1

Example 9.5.1: Let Xn have the distribution P (Xn = 0) = 1 − n1 , P (Xn = n) = n1 for n = 1, 2, . . .. Then Xn −→d 0 but EXn = 1 does not go to 0. Note that {Xn }n≥1 is not uniformly integrable here.

9.5 The method of moments and the moment problem

307

Remark 9.5.2: In Theorem 9.5.1, under hypothesis (ii), it follows that E|Xn | → E|X|

for all real numbers ∈ (0, k)

and EXnp → EX p

9.5.2

for all positive integers p, 0 < p ≤ k.

The method of moments

Suppose that {Xn }n≥1 are random variables such that limn→∞ EXnk = mk < ∞ exists for all integers k = 0, 1, 2, . . .. Does there exist a random variable X such that Xn −→d X? The answer is ‘yes’ provided that the moments {mk }k≥1 determine the distribution of the random variable X uniquely. Theorem 9.5.2: (Frech´et-Shohat theorem). Let {Xn }n≥1 be a sequence of random variables such that for each k ∈ N, limn→∞ EXnk ≡ mk exists and is ﬁnite. If the sequence {mk }k≥1 uniquely determines the distribution of a random variable X, then Xn −→d X. Proof: Suppose that for some subsequence {nj }j≥1 , the probability distributions {µnj }j≥1 of {Xnj }j≥1 converge vaguely to some µ. Since {EXn2j }j≥1 is a bounded sequence, {µnj }j≥1 is tight. Hence µ must be a probability distribution and by Theorem 9.5.1, the moments of µ must coincide with {mk }k≥1 . Since the sequence {mk }k≥1 determines the distribution uniquely, µ is unique and is the unique vague limit point of {µn }n≥1 and by Theorem 9.2.6, µn −→d µ. So if X is a random variable with distribution µ, then Xn −→d X. 2 The above “method of moments” used to be a tool for proving convergence in distribution, e.g., for proving asymptotic normality of the Binomial (n, p) distribution. Since it requires existence of all moments, this method is too restrictive and is of limited use. However, the question of when do the moments determine a distribution is an interesting one and is discussed next.

9.5.3

The moment problem

Suppose {mk }k≥1 is a sequence of real numbers such that there is at least one probability measure µ on (R, B(R)) such that for all k ∈ N mk = xk µ(dx). Does the sequence {mk }k≥1 determine µ uniquely? This is a part of the Hamburger-moment problem, which includes seeking conditions under

308

9. Convergence in Distribution

which a given sequence {mk }k≥1 is the moment sequence of a probability distribution. The answer to the uniqueness question posed above is ‘no,’ as the following example shows. Example 9.5.2: Let Y be a standard normal random variable and let X = exp(Y ). Then X is said to have the log-normal distribution (which is a misnomer as a more appropriate name would be something like exponormal). Then X has the probability density function 1 1 √ exp(−[log x]2 /2) x>0 2π x (5.1) f (x) = 0 otherwise. Consider now the family of functions fα (x) = f (x)(1 + α sin(2π log x)) with |α| ≤ 1. It is clear that fα (x) ≥ 0. Further, it is not diﬃcult to check that for any α ∈ [−1, 1], xr fα (x)dx = xr f (x)dx for all r = 0, 1, 2, . . .. Thus, the sequence of moments mk ≡ xk f (x)dx does not determine the log-normal distribution (5.1). A suﬃcient condition for uniqueness is Carleman’s condition: ∞

−1/2k

m2k

= ∞.

(5.2)

k=1

For a proof, see Feller (1966) or Shohat and Tamarkin (1943). Remark 9.5.3: A special case of the above is when 1/2k

lim sup m2k

= r ∈ [0, ∞).

(5.3)

k→∞

In particular, if {mk }k≥1 is a moment sequence, then within the class of probability distributions µ that have bounded support and have {mk }k≥1 as their moment sequence, µ is uniquely determined. This is so since if M ≡ sup{x : µ([−x, x]) < 1}, then (Problem 9.27) 1/2k

m2k

→M

as

k → ∞.

(5.4)

generally, if µ is a probability distribution on R such that More etx dµ(x) < ∞ for all |t| < δ for some δ > 0, then all its moments are ﬁnite and (5.2) holds and hence µ is uniquely determined by its moments

9.6 Problems

309

(Problem 9.28). In particular, the normal and Gamma distributions are determined by their moments. Remark 9.5.4: If {mk }k≥1 is a moment sequence of a distribution µ concentrated on [0, ∞), the problem of determining µ uniquely is known as the Stieltjes √ moment problem. If X is a random variable with distribution µ, let Y = δ X, where δ is independent of X and takes two values {−1, +1} with equal probability. Then Y has a symmetric distribution and for all k ≥ 1, E|Y |2k = E|X|k . √ The distribution of Y is uniquely determined (and hence that of X and hence that of X) if (EY 2k )1/2k lim sup 0, there exists M ∈ (0, ∞) such that F (−x) + 1 − F (x) < for all x > M , and (ii) F is uniformly continuous on [−M, M ].) 9.3 Prove parts (ii) and (iii) of Theorem 9.1.6. 9.4 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→v µ. Show that µn −→d µ. 9.5 Show that the function F˜ (·), deﬁned in (2.5), is nondecreasing and right continuous and that the function F (x) ≡ inf{F (r) : r ≥ x, r ∈ D} is nondecreasing and left continuous. 9.6 Give another proof of the ‘if’ part of Theorem 9.2.6 by using Theorem 9.2.1 and showing that for any f : R → R continuous and bounded and any subsequence {ni}i≥1 , there exist a further subsequence {mj }j≥1 such that amj = f dµmj → a = f dµ and hence, an ≡ f dµn → a. 9.7 If {Xn }n≥1 is stochastically bounded and Yn −→p 0, then show that Xn Yn −→p 0.

310

9. Convergence in Distribution

9.8 (a) Let Xn ∼ N (an , bn ) for n ≥ 0, where bn > 0 for n ≥ 1, b0 ∈ [0, ∞) and an ∈ R for all n ≥ 0. (i) Show that if an → a0 , bn → b0 as n → ∞, then Xn −→d X0 . (ii) Show that if Xn −→d X0 as n → ∞, then an → a0 and bn → b0 . (Hint: First show that {bn }n≥1 is bounded and then that {an }n≥1 is bounded and ﬁnally, that a0 and b0 are the only limit points of {an }n≥1 and {bn }n≥1 , respectively.) (b) For n ≥ 1, let Xn ∼ N (an , Σn ) where an ∈ Rk and Σn is a k × k positive deﬁnite matrix, k ∈ N. Then, {Xn }n≥1 is stochastically bounded if and only if {an }n≥1 and {Σn }n≥1 are bounded. 9.9 Let {Xjn }n≥1 , j = 1, . . . , k, k ∈ N be sequences of random variables. Let Xn = (X1n , . . . , Xkn ), n ≥ 1. Show that the sequence of random vectors {Xn }n≥1 is tight in Rk iﬀ for each 1 ≤ j ≤ k, the sequence of random variables {Xjn }n≥1 is tight in R. 9.10 Let (S, d) be a metric space. (a) For any set A ⊂ S, let d(x, A) ≡ inf{d(x, y) : y ∈ A}. Show that for each A, d(·, A) is continuous on S. (b) Let fn (·) be as in (3.3). Show that fn (·) is continuous on S and fn (·) ↑ IG (·). (Hint: Note that d(x, Gc ) + d(x, Gn ) > 0 for all x in S. ) 9.11 For any cdf F , let F −1 (u) ≡ sup{t : F (t) < u}, 0 < u < 1. Show that for any 0 < u0 < 1 and t0 in R, F −1 (u0 ) ≤ t0 ⇔ F (t0 ) ≥ u0 . (Hint: For ⇒, use the right continuity of F and for ⇐, use the deﬁnition of sup.) 9.12 For a function f : Rk → R (k ∈ N), deﬁne Df = {x ∈ Rk : f is discontinuous at x}. Show that Df ∈ B(Rk ). 9.13 If Xn −→p X and f : R → R is continuous, then f (Xn ) −→p f (X). 9.14 (The Delta method). Let {Xn }n≥1 be a sequence of random variables and {an }n≥1 ⊂ (0, ∞) be a sequence of constants such that an → ∞ as n → ∞ and an (Xn − θ) −→d Z

9.6 Problems

311

for some random variable Z and for some θ ∈ R. Let H : R → R be a function that is diﬀerentiable at θ with derivative c. Show that an H(Xn ) − H(θ) −→d cZ. (Hint: By Taylor’s expansion, for any x ∈ R, H(x) = H(θ) + c(x − θ) + R(x)(x − θ) where R(x) → 0 as x → θ. Now use Problem 9.7 and Slutsky’s theorem.) 9.15 Let X be a random variable with P (X = c) > 0 for some c ∈ R. Give examples of two sequences {Xn }n≥1 and {Yn }n≥1 satisfying Xn −→d X and Yn −→d X such that lim P (Xn ≤ c) = P (X ≤ c)

n→∞

but lim P (Yn ≤ c) = P (X ≤ c).

n→∞

(Hint: Take Xn =d X, n ≥ 1 and Yn =d X + n1 , n ≥ 1, say.) 9.16 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that f dµn → f dµ for all f ∈ F for some collection F of functions from R to R speciﬁed below. Does µn −→d µ if (a) F = {f | f : R → R, f is bounded and continuously diﬀerentiable on R with a bounded derivative} ? (b) F = {f | f : R → R, f is bounded and inﬁnitely diﬀerentiable on R} ? (c) F ≡ {f | f is a polynomial with real coeﬃcients} and |x|k µ(dx) + |x|k dµn (dx) < ∞ for all n, k ∈ N ? 9.17 For any two cdfs F , G on R, deﬁne dL (F, G)

=

inf{ > 0 : G(x − ) − < F (x) < G(x + ) + for all x ∈ R}.

(6.1)

Verify that dL deﬁnes a metric on the collection of all probability distributions on (R, B(R)). The metric dL is called the Levy metric. 9.18 Let {µn }n≥1 , µ be probability measures on (R, B(R)), with the corresponding cdfs {Fn }n≥1 and F . Show that µn −→d µ iﬀ dL (Fn , F ) → 0

as

n → ∞.

312

9. Convergence in Distribution

9.19 (a) Show that for any two cdfs F , G on R, dL (F, G) ≤ dK (F, G),

(6.2)

dK (F, G) = sup |F (x) − G(x)|

(6.3)

where x∈R

(dK is called the Kolmogorov distance or metric between F and G). (b) Give examples where (i) equality holds in (6.2), and (ii) where strict inequality holds in (6.2). 9.20 Let {µn }n≥1 , µ be probability measures on (R, B(R)) such that µn −→d µ. Let {fa : a ∈ R} be a collection of bounded functions from R → R such that µ(Dfa ) = 0 for all a ∈ R and |fa (x) − fb (x)| ≤ h(x)|b−a| for all a, b ∈ R and for some h : R → (0, ∞) with µ(Dh ) = 0 and |h|dµ < ∞. Show that * * * * sup * fa dµn − fa dµ* → 0 as n → ∞. a∈R

9.21 Let {Xn }n≥1 , X be k-dimensional random vectors such that Xn −→d X. Let {An }n≥1 be a sequence of r × k-matrices of real numbers and {bn }n≥1 ⊂ Rr , r ∈ N. Deﬁne Yn = An Xn + bn and Zn = An Xn XnT where XnT denotes the transpose of X. Suppose that An → A and bn → b. Show that (a) Yn −→d Y , where Y =d AX + b, (b) Zn −→d Z, where Z =d AXX T . (Note: Here convergence in distribution of a sequence of r×k matrixvalued random variables may be interpreted by considering the corresponding rk-dimensional random vectors obtained by concatenating the rows of the r × k matrix side-by-side and using the deﬁnition of convergence in distribution for random vectors.) 9.22 Let µn , µ be probability measures on a countable set D ≡ {aj }j≥1 ⊂ R. Let pnj = µn ({aj }), j ≥ 1, n ≥ 1 and pj = µ({aj }). Show that, as n → ∞, µn −→d µ iﬀ for all j, pnj → pj iﬀ j |pnj − pj | → 0. 9.23 Let Xn ∼ Binomial(n, pn ), n ≥ 1. Suppose npn → λ, 0 < λ < ∞. Show that Xn → X, where X ∼ Poisson(λ). 9.24 (a) Let Xn ∼ Geo(pn ), i.e. P (Xn = r) = qnr−1 pn , r ≥ 1, where 0 < pn < 1 and qn = 1 − pn . Show that as n → ∞ if pn → 0 then (6.4) pn Xn −→d X, where X ∼ Exponential (1).

9.6 Problems

313

(b) Fix a positive integer k. Let for n ≥ 1,

r − 1 r−1 r−k p q , r≥k pnr = k−1 n n where 0 < pn < 1, qn = 1 − pn . (i) Verify for each n, {pnr }r≥k is a probability distribution, that ∞ i.e., r=k pnr = 1. (ii) Let Yn be a random variable with distribution P (Yn = r) = pnr , r ≥ k. Show that as n → ∞ if pn → 0 then {pn Yn }n≥1 converges in distribution and identify the limit. 9.25 Let {Fn }n≥1 and {Gn }n≥1 be two sequences of cdfs on R such that, as n → ∞, Fn −→d F , Gn −→d G where F and G are cdfs on R. (a) Show that for each n ≥ 1,

Hn (x) ≡ (Fn ∗ Gn )(x) ≡

R

Fn (x − y)dG(y)

is a cdf on R. (b) Show that, as n → ∞, Hn −→d H where H = F ∗ G, by direct calculation and by Skorohod’s theorem (i.e., Theorem 9.4.1) and Problem 7.14. 9.26 Let Yn have discrete uniform distribution on the integers {1, 2, . . . , n}. Show that Xn ≡ Ynn and let X ∼ Uniform (0,1) random variable. Show that Xn −→d X using three diﬀerent methods as follows: (a) Helly-Bray theorem, (b) the method of moments, (c) using the cdfs. 9.27 Establish (5.4) in Remark 9.5.3. 1/2k

(Hint: Show that for any > 0, m2k 1/2k .) M − })

≥ (M − ) µ({x : |x| >

9.28 Let t|x|µ be a probability distribution on R such that φ(t0 ) ≡ e dµ(x) < ∞ for some t0 > 0. Show that Carleman’s condition (5.2) is satisﬁed. (Hint: Show that by Cramer’s inequality (Corollary 3.1.5) ∞ x2k−1 e−t0 x dx m2k ≤ 2k φ(t0 ) 0

φ(t0 )2k t−2k (2k − 1)! 0 √ and then use Stirling’s approximation: ‘n! ∼ 2π nn+1/2 e−n as n → ∞’ (Feller (1968)).) =

314

9. Convergence in Distribution

9.29 (Continuity theorem for mgfs). Let {Xn }n≥1 and X be random variables such that for some δ > 0, the mgf MXn (t) ≡ E(etXn ) and MX (t) ≡ E(etX ) are ﬁnite for all |t| < δ. Further, let MXn (t) → MX (t) for all |t| < δ. Show that Xn −→d X. (Hint: Show ﬁrst that {Xn }n≥1 is tight and the fact that by Remark 9.5.3, the distribution of X is determined by MX (·).) 9.30 Let Xn ∼ Binomial(n, pn ). Suppose npn → ∞. Let Yn = √Xn −npn , npn (1−pn )

n ≥ 1. Show that Yn −→d Y , where Y ∼ N (0, 1). (Hint: Use Problem 9.29.) 9.31 Use the continuity theorem for mgfs to establish (6.4) and the convergence in distribution of {pn Yn }n≥1 in Problem 9.24 (b)(ii). 9.32 Let {Xj , Vj : j ≥ 1} be a collection of random variables on some probability space (Ω, F, P ) such that P (Vj ∈ N) = 1 for all j, Vj → ∞ w.p. 1 and Xj −→d X. Suppose that for each j, the random variable Vj is independent of the sequence {Xn }n≥1 . Show that XVj −→d X. (Hint: Verify that for any bounded continuous function h : R → R, * * *Eh(XVj ) − Eh(X)* ≤ 2hP (Vj ≤ N ) + ∆N P (Vj > N ) * * where ∆N = sup *Eh(Xk )−Eh(X)* and h = sup{|h(x)| : x ∈ R}.) k>N

9.33 Let Xn −→ X and xn → x as n → ∞. If P (X = x) = 0, then show that P (Xn ≤ xn ) → P (X ≤ x). d

9.34 (Weyl’s equi-distribution property). Let 0 < α < 1 be an irrational number. Let µn (·) be the measure deﬁned by µn (A) ≡ n−1 1 d j=0 IA (jα mod 1), A ∈ B([0, 1]). Show that µn −→ Uniform n (0,1). 1 (Hint: Verify that f dµn → 0 f (x)dx for all f of the form f (x) = eι2πkx , k ∈ Z and then approximate a bounded continuous function f by trigonometric polynomials (cf. Section 5.6).) 9.35 (a) Let {Xi }i≥1 be iid random variables with Uniform (0,1) distribution. Let Mn = max1≤i≤n Xi . Show that n(1 − Mn ) −→d Exponential (1). (b) Let {Xi }i≥1 be iid random variables such that λ ≡ sup{x : P (X1 ≤ x) < 1} < ∞, P (X1 = λ) = 0, and P (λ − x < X1 < λ) ∼ cxα L(x) as x ↓ 0 where α > 0, c > 0, and L(·) is slowly varying at 0, i.e., limx↓0 L(cx) L(x) = 1 for all 0 < c < ∞. Let Mn = max1≤i≤n Xi . Show that Yn ≡ n1/α (λ − Mn ) converges in distribution as n → ∞ and identify the limit.

9.6 Problems

315

9.36 Let {Xi }i≥1 be iid positive random variables such that P (X1 < x) ∼ cxα L(x) as x ↓ 0, where c, α and L(·) are as in Problem 9.35. Let X1n ≡ min1≤i≤n Xi . Find {an }n≥1 ⊂ R+ such that Zn ≡ an X1n converges in distribution to a nondegenerate limit and identify the distribution. Specialize this to the cases where X1 has a pdf fX (·) such that (a) limx↓0 fX (x) = fX (0+) exists and is positive, (b) X1 has a Beta (a, b) distribution.

10 Characteristic Functions

10.1 Deﬁnition and examples Characteristic functions play an important role in studying (asymptotic) distributional properties of random variables, particularly for sums of independent random variables. The main uses of characteristic functions are (1) to characterize the probability distribution of a given random variable, and (2) to establish convergence in distribution of a sequence of random variables and to identify the limit distribution. Deﬁnition 10.1.1: (i) The characteristic function of a random variable X is deﬁned as φX (t) = E exp(ιtx), where ι =

t ∈ R,

(1.1)

√ −1.

(ii) The characteristic function of a probability measure µ on (R, B(R)) is deﬁned as µ ˆ(t) = exp(ιtx)µ(dx), t ∈ R. (1.2) (iii) Let F be cdf on R. Then, the characteristic function of F is deﬁned as µ ˆF (·), where µF is the Lebesgue-Stieltjes measure corresponding to F .

318

10. Characteristic Functions

Note that the integrands in (1.1) and (1.2) are complex valued. Here and elsewhere, for any f1 , f2 ∈ L1 (Ω, F, µ), the integral of (f1 + ιf2 ) with respect to µ is deﬁned as (1.3) (f1 + ιf2 )dµ = f1 dµ + ι f2 dµ. Thus, the characteristic function of X is given by φX (t) = (E cos tX) + ι(E sin tX), t ∈ R. Since the functions cos tx and sin tx are bounded for every t ∈ R, φX (t) is well deﬁned for all t ∈ R. Furthermore, φX (0) = 1 and for any t ∈ R, |φX (t)|

1/2 (E cos tX)2 + (E sin tX)2 1/2 ≤ E(cos tX)2 + E(sin tX)2 ≤ 1. =

(1.4)

If equality holds in (1.4), i.e., if |φX (t0 )| = 1 for some t0 = 0, then the random variable is necessarily discrete, as shown by the following proposition. Proposition 10.1.1: Let X be a random variable with characteristic function φX (·). Then the following are equivalent: (i) |φX (t0 )| = 1 for some t0 = 0. (ii) There exist a ∈ R, h = 0 such that P X ∈ {a + jh : j ∈ Z} = 1.

(1.5)

Proof: Suppose that (i) holds. Since |φX (t0 )| = 1, there exists a0 ∈ R such that φX (t0 ) = eιa0 , i.e., e−ιa0 φX (t0 ) = 1. Let a = a0 /t0 . Since the characteristic function of (X − a) is given by e−ιat φX (t), it follows that E exp(ιt0 (X − a)) = 1. Equating the real parts, one gets (1.6) E cos t0 (X − a) = 1. Since | cos θ| ≤ 1 for all θ and cos θ = 1 if and only if θ = 2πn for some n ∈ Z, (1.6) implies that P t0 (X − a) ∈ {2πj : j ∈ Z} = 1. (1.7) Therefore, (ii) holds with h = |t2π0 | and with a = a0 /t0 . For the converse, note that with pj = P (X = a + jh), j ∈ Z, φX (t) =

j∈Z

exp ιt(a + jh) pj , t ∈ R,

10.1 Deﬁnition and examples

* * * = 1. and hence *φX 2π h

319

2

Deﬁnition 10.1.2: A random variable X satisfying (1.5) for some a ∈ R and h > 0 is called a lattice random variable. In this case, the distribution of X is also called lattice or arithmetic. If X is a nondegenerate lattice random variable, then the largest h > 0 for which (1.5) holds is called the span (of the probability distribution or of the characteristic function) of X. An inspection of the proof of Proposition 10.1.1 shows that for a lattice random variable X with span h > 0, its characteristic function satisﬁes the relation * * *φX (2πj/h)* = 1 for all j ∈ Z. (1.8) In particular, this implies that lim sup|t|→∞ |φX (t)| = 1. The next result shows that characteristic functions of random variables with absolutely continuous cdfs exhibit a very diﬀerent limit behavior. Proposition 10.1.2: Let X be a random variable with cdf F and characteristic function φX . If F is absolutely continuous, then lim |φX (t)| = 0.

|t|→∞

(1.9)

Proof: Since F is absolutely continuous, the probability distribution µX of X has a density, say f , w.r.t. the Lebesgue measure m on R, and φX (t) = exp(ιtx)f (x)dx, t ∈ R. Fix ∈ (0, ∞). Since f ∈ L1 (R, B(R), m), by Theorem 2.3.14, there exists k a step function f = j=1 cj I(aj bj ) with 1 ≤ k < ∞ and aj , bj , cj ∈ R for j = 1, . . . , k, such that |f − f |dm < /2.

(1.10)

Next note that for any t = 0, * * * * * exp(ιtx)f (x)dx* * k * * = * cj j=1

≤

k j=1

|cj |

bj

aj

* * exp(ιtx)dx**

2 . |t|

Hence, by (1.10) and (1.11), it follows that * * * * |φX (t)| = * exp(ιtx)f (x)dx*

(1.11)

320

10. Characteristic Functions

≤

* * * * |f − f |dx + * exp(ιtx)f (x)dx*

t , where t = 4 j=1 |cj |/. Thus (1.9) holds.

2

Note that the above proof shows that for any f ∈ L1 (m), the Fourier transforms ˆ f (t) ≡ eιtx f (x)dx, t ∈ R satisﬁes lim|t|→∞ fˆ(t) = 0. This is known as the Riemann-Lebesgue lemma (cf. Proposition 5.7.1). Next, some basic results on smoothness properties of the characteristic function are presented. Proposition 10.1.3: Let X be a random variable with characteristic function φX (·). Then, φX (·) is uniformly continuous on R. Proof: For t, h ∈ R,

* * *φX (t + h) − φX (t)* * ** * = *E exp(ι(t + h)X) − exp(ιtX) * * * * * = *E exp(ιtX) · (eιhX − 1)* * * ≤ E *eιhX − 1* ≡ E∆(h), say,

where ∆(h) ≡ | exp(ιhX) − 1|. It is easy to check that |∆(h)| ≤ 2 and limh→0 ∆(h) = 0 w.p. 1 (infact, everywhere). Hence, by the BCT, E∆(h) → 0 as h → 0. Therefore, * * (1.12) lim sup *φX (t + h) − φX (t)* ≤ lim E∆(h) = 0 h→0 t∈R

h→0

and hence, φX (·) is uniformly continuous on R.

2

Theorem 10.1.4: Let X be a random variable with characteristic function φX (·). If E|X|r < ∞ for some r ∈ N, then φX (·) is r-times continuously diﬀerentiable and (r)

φX (t) = E(ιX)r exp(ιtX), t ∈ R.

(1.13)

For proving the theorem, the following bound on the function exp(ιx) is useful. Lemma 10.1.5: For any x ∈ R, r ∈ N, * * r r−1 * |x| 2|x|r−1 (ιx)k ** * exp(ιx) − ≤ min , . * k! * r! (r − 1)! k=0

(1.14)

10.1 Deﬁnition and examples

Proof: Note that for any x ∈ R and for any r ∈ N, dr dr dr [exp(ιx)] = cos x + ι sin x dxr dxr dxr = ιr exp(ιx).

321

(1.15)

Hence, by (1.15) and Taylor’s expansion (applied to the functions sin x and cos x of a real variable x), for any x ∈ R and r ∈ N, exp(ιx) =

r−1 (ιx)k k=0

k!

+

(ιx)r (r − 1)!

0

1

(1 − u)r exp(ιux)du.

Hence, for any x ∈ R and any r ∈ N, * * r−1 * (ιx)k ** |x|r * exp(ιx) − ≤ . * k! * r!

(1.16)

(1.17)

k=0

Also, for r ≥ 2, using (1.17) with r replaced by r − 1, one gets * * r−1 * (ιx)k ** * exp(ιx) − * k! * k=0 * * r−2 * |x|r−1 (ιx)k ** * ≤ * exp(ιx) − + k! * (r − 1)! k=0

≤

2|x|r−1 . (r − 1)!

(1.18)

Hence, by (1.17) and (1.18), (1.14) holds for all x ∈ R, r ∈ N with r ≥ 2. For r = 1, (1.14) follows from (1.17) and the bound ‘supx | exp(ιx)−1| ≤ 2.’ 2 Lemma 10.1.5 gives two upper bounds on the diﬀerence between the function exp(ιx) and its (r − 1)th order Taylor’s expansion around x = 0. r is more accurate, whereas For small values of |x|, the ﬁrst bound i.e., |x| r! r−1 for large values of |x|, the other bound i.e., 2|x| is more accurate. (r−1)! Proof of Theorem 10.1.4: Let µ denote the probability distribution of X on (R, B(R)). Suppose that E|X| < ∞. First it will be shown that φX (·) (1) is diﬀerentiable with φX (t) = E{ιX exp(ιtX)}, t ∈ R. Fix t ∈ R. For any h ∈ R, h = 0, h−1 [φX (t + h) − φX (t)] exp(ιhx) − 1 = exp(ιtx) µ(dx) h R exp(ιhx) − 1 − ιx µ(dx) + ιx exp(ιtx)µ(dx) exp(ιtx) = h R R

322

10. Characteristic Functions

≡

ψh (x)µ(dx) +

ιx exp(ιhx)µ(dx), say.

(1.19)

R

By Lemma 10.1.5 (with r = 2), |ψh (x)| ≤ min

h|x|2 2

, 2|x| for all x ∈ R, h = 0.

(1.20)

Hence, limh→0 ψh (x) = 0 for each x ∈ R. Also, |ψh (x)| ≤ 2|x| and |x|µ(dx) = E|X| < ∞. Hence, by the DCT, lim ψh (x)µ(dx) = 0 h→0

and therefore, from (1.19), it follows that φX (·) is diﬀerentiable at t with (1) φX (t) = ιx exp(ιtx)µ(dx) = E{ιX exp(ιtX)}. Next suppose that the assertion of the theorem is true for some r ∈ N. To prove it for r + 1, note that for t ∈ R and h = 0, h−1 [φ(r) (t + h) − φr (t)] − E(ιX)r+1 exp(ιtX) = (ιx)r ψh (x)µ(dx),

(1.21)

where ψh (x) is as in (1.19). Now using the bound (1.20) on ψh (x), the DCT, and the condition E|X|r+1 < ∞, one can show that the right side of (1.21) goes to zero as h → 0. By induction, this completes the proof of the theorem. 2 Proposition 10.1.6: Let X and Y be two independent random variables. Then (1.22) φX+Y (t) = φX (t) · φY (t), t ∈ R. Proof: Follows from (1.3), Proposition 7.1.3, and the independence of X and Y . 2 For a complex number z = a + ib, a, b ∈ R, let z¯ = a − ib denote the complex conjugate of z and let Re(z) = a and

Im(z) = b

(1.23)

respectively denote the real and the imaginary parts of z. Corollary 10.1.7: Let X be a random variable with characteristic function φX . Then, φ¯X , |φX |2 and Re(φX ) are characteristic functions, where Re(φX )(t) = Re(φX (t)), t ∈ R. Proof: φ¯X (t) = E exp(−ιtX) = E exp(ιt(−X)), t ∈ R ⇒ φ¯X is the characteristic function of −X. Next, let Y be an independent copy of X. Then, by (1.22), φX−Y (t) = |φX (t)|2 , t ∈ R.

10.2 Inversion formulas

323

1 ¯ Finally, Re(φ exp(ιtx)µ(dx), t ∈ R, where X )(t) = 2 (φX (t) + φX#(t)) = " 2 µ(A) = 2−1 P (X ∈ A) + P (−X ∈ A) , A ∈ B(R). Deﬁnition 10.1.3: A function φ : R → C, the set of complex numbers is said to be nonnegative deﬁnite if for any k ∈ N, t1 , t2 , . . . , tk ∈ R, α1 , α2 , . . . , αk ∈ C k k αi α ¯ j φ(ti − tj ) ≥ 0. (1.24) i=1 j=1

Proposition 10.1.7: Let φ(·) be the characteristic function of a random variable X. Then φ is nonnegative deﬁnite. Proof: Check that for k, {ti }, {αi } as in Deﬁnition 10.1.3, k k i=1 j=1

*2

* * k * ιtj X * * αi α ¯ j φ(ti − tj ) = E * αj e * . j=1

2 A converse to the above is known as the Bochner-Khinchine theorem, which states that if φ : R → C is nonnegative deﬁnite, continuous, and φ(0) = 1, then φ is the characteristic function of a random variable X. For a proof, see Chung (1974). Another criterion for a function φ : R → C to be a characteristic function is due to Poly¯ a. For a proof, see Chung (1974). Proposition 10.1.8: (Poly¯ a’s criterion). Let φ : R → C satisfy φ(0) = 1, φ(t) ≥ 0, φ(t) = φ(−t) for all t ∈ R and φ(·) is nonincreasing and convex on [0, ∞). Then φ is a characteristic function.

10.2 Inversion formulas Let F be a cdf and φ be its characteristic function. In this section, two inversion formulas to get the cdf F from φ are presented. The ﬁrst one is from Feller (1966), and the second one is more standard. Unless otherwise mentioned, for the rest of this section, X will be a random variable with cdf F and characteristic function φX and N a standard normal random variable independent of X. Lemma 10.2.1: Let g : R → R be a Borel measurable bounded function vanishing outside a bounded set and let ∈ (0, ∞). Then 2 t2 1 g(x)φX (t)e−ιtx e− 2 dtdx. (2.1) Eg(X + N ) = 2π

324

10. Characteristic Functions 2 t2

Proof: The integrand on the right is bounded by e− 2 |g(x)| and so is integrable on R × R with respect to the Lebesgue measure on (R2 , B(R2 )). t2 x2 Further, φX (t) = eιty dF (y) and e− 2 = √12π eιtx e− 2 dx, t ∈ R. By repeated applications of Fubini’s theorem and the above two identities, the right side of (2.1) becomes 1 ιt(y−x) −2 t2 √ √ e e 2 dt dF (y)dx g(x) 2π 2π [set s = t] 2 1 1 √ eιs(y−x)/ e−s /2 ds dF (y)dx = √ g(x) 2π 2π

2 (y−x) 1 = g(x) √ e− 22 dF (y) dx. 2π Since X and N are independent and N has an absolutely continuous distribution w.r.t. the Lebesgue measure, X + N also has an absolutely continuous distribution with density (y−x)2 1 fX+N (x) = √ e− 22 dF (y), x ∈ R. 2π Thus, the right side of (2.1) reduces to g(x)fX+N (x)dx = Eg(X + N ). 2 Corollary 10.2.2: Let g : R → R be continuous and let g(x) = 0 for all |x| > K, for some K, 0 < K < ∞. Then Eg(X) = g(x)dF (x) 2 t2 1 = lim g(x)e−ιtx φX (t)e− 2 dtdx. (2.2) →0+ 2π Proof: This follows from Lemma 10.2.1, the fact that X + N → X w.p. 1 as → 0, and the BCT. 2 Corollary 10.2.3: (Feller’s inversion formula). Let a and b, −∞ < a < b < ∞, be two continuity points of F . Then b 2 t2 1 e−ιtx φX (t)e− 2 dt dx. (2.3) F (b) − F (a) = lim →0+ a 2π Proof: This follows from Lemma 10.2.1 and Theorem 9.4.2, since the function g(x) = 1 for a ≤ x ≤ b and 0 otherwise is continuous except at a and b, which are continuity points of F . 2

10.2 Inversion formulas

325

Corollary 10.2.4: If φX (t) is integrable w.r.t. the Lebesgue measure m on R, then F is absolutely continuous with density w.r.t. m, given by 1 e−ιtx φX (t)dt, x ∈ R. (2.4) f (x) = 2π Proof: If φX is integrable, then 2 t2 1 φX (t)e−ιtx e− 2 dt 2π is bounded by (2π)−1 |φX (t)|dt for all x ∈ R, and it converges to (2π)−1 e−ιtx φX (t)dt as → 0+ for each x ∈ R. Hence, by the BCT and Corollary 10.2.3, for any a, b, −∞ < a < b < ∞, that are continuity points of F b 1 φX (t)e−ιtx dt dx. F (b) − F (a) = 2π a Since F has at most countably many discontinuity points and F is right continuous, the above relation holds for all −∞ < a < b < ∞. 2 Remark 10.2.1: The integrability of φX in Corollary 10.2.4 is only a sufﬁcient condition. The standard exponential distribution has characteristic function (1−ιt)−1 which is not integrable but the distribution is absolutely continuous. Corollary 10.2.5: (Uniqueness). The characteristic function φX determines F uniquely. Proof: Since a cdf F is uniquely determined by its values on the set of its continuity points, this corollary follows from Corollary 10.2.3. 2 A more standard inversion formula is the following. Theorem 10.2.6: Let F be a cdf on R and φ(t) ≡ eιtx dF (x), t ∈ R be its characteristic function. (i) For any a < b, a, b ∈ R, that are continuity points of F , 1 lim T →∞ 2π

T

−T

e−ιta − e−ιtb φ(t)dt = µF ((a, b)), ιt

(2.5)

where µF is the Lebesgue-Stieltjes measure generated by F . (ii) For any a ∈ R, 1 T →∞ 2T

µF ({a}) = lim

T

−T

e−ιta φ(t)dt.

(2.6)

326

10. Characteristic Functions

A multivariate extension of part (i) and its proof are given in Section 10.4. See also Problem 10.4. For a proof of part (ii), see Problem 10.5 or see Chung (1974) or Durrett (2004). Remark 10.2.2: (Inversion formula for integer valued random variables). If X is integer valued with pk = P (X = k), k ∈ Z, then its characteristic function is the Fourier series pk eιtk , t ∈ R. (2.7) φ(t) = k∈Z

π

Since −π eιtj dt = 2π if j = 0 and = 0 otherwise, multiplying both sides of (2.7) by e−ιtk and integrating over t ∈ (−π, π) and using DCT, one gets π 1 pk = φ(t)e−ιtk dt, k ∈ Z. (2.8) 2π −π As a corollary to part (ii) of Theorem 10.2.6, one can deduce a criterion for a distribution to be continuous. Let µ be a probability distribution and let {pj } be its atoms, if any. Let α = j p2j . Let X and Y be two independent random variables with distribution µ and characteristic function φ(·). Then Z = X − Y has characteristic function |φ(·)|2 and by Theorem 10.2.6, part (ii), T 1 P (Z = 0) = lim |φ(t)|2 dt. T →∞ 2T −T But P (Z = 0) = α. Hence, it follows that T 1 p2j = lim |φ(t)|2 dt. T →∞ 2T −T

(2.9)

j∈Z

Corollary 10.2.7: A distribution is continuous iﬀ T 1 |φ(t)|2 dt = 0. lim T →∞ 2T −T

(2.10)

Some consequences of the uniqueness result (cf. Corollary 10.2.5) are the following. Corollary 10.2.8: For a random variable X, X and −X have the same distribution iﬀ the characteristic function φX (t) of X is real valued for all t ∈ R. Proof: If φX (t) is real, then φX (t) = (cos tx)dF (x)

for all t ∈ R,

10.3 Levy-Cramer continuity theorem

327

where F is the cdf of X. So φX (t) = φX (−t) = E(e−ιtX ) = E(eιt(−X) ).

(2.11)

Since the characteristic function of −X coincides with φX (t), the ‘if part’ follows. To prove the ‘only if’ part, suppose that X and −X have the same distribution. Then as in (2.11), φX (t) = φ−X (t) = φX (−t) = φX (t), where for any complex number z = a + ιb, a, b ∈ R, z¯ ≡ a − ιb denotes its 2 conjugate. Hence, φX (t) is real for all t ∈ R. Example 10.2.1: The standard Cauchy distribution has density f (x) =

1 1 , −∞ < x < ∞. π 1 + x2

Its characteristic function is given by 1 eιtx φ(t) = dx = e−|t| , t ∈ R. π 1 + x2

(2.12)

(2.13)

To see this, let X1 and X2 be two independent copies of the standard exponential distribution. Since φX1 (t) = (1 − ιt)−1 , t ∈ R, Y ≡ X1 − X2 has characteristic function φY (t) = |φX1 (t)|2 = (1 + t2 )−1 ,

t ∈ R.

Since φY is integrable, the density of Y is 1 1 fY (y) = e−ιuy du, y ∈ R. 2π 1 + u2 But by the convolution formula, fY (y) = x>−y e−x e−(y+x) dx ∞ −x −(y+x) e e 11(0,∞) (y + x)dx = 12 e−|y| , y ∈ R. So 0 1 1 eιuy dt = e−|y| , y ∈ R, π 1 + u2

=

proving (2.13).

10.3 Levy-Cramer continuity theorem Characteristic functions are very useful in determining distributions, moments, and establishing various identities involving distributions. But by

328

10. Characteristic Functions

far their most important use is in establishing convergence in distribution. This is the content of a continuity theorem established by Paul Levy and H. Cramer. It says that the map ψ taking a distribution F to its characteristic function φ is a homeomorphism. That is, if Fn −→d F , then φn → φ and conversely. Here, the notion of convergence of φn to φ is that of uniform convergence on bounded intervals. The following result deals with the ‘if’ part. Theorem 10.3.1: Let Fn , n ≥ 1 and F be cdfs with characteristic functions φn , n ≥ 1 and φ, respectively. Let Fn −→d F . Then, for each 0 < K < ∞, sup |φn (t) − φ(t)| → 0 as n → ∞. |t|≤K

That is, φn converges to φ uniformly on bounded intervals. Proof: By Skorohod’s theorem, there exist random variables Xn , X de ﬁned on the Lebesgue space [0, 1], B([0, 1]), m where m(·) is the Lebesgue measure such that Xn ∼ Fn , X ∼ F and Xn → X w.p. 1. Now, for any t ∈ R, * * * * |φn (t) − φ(t)| = *E eιtXn − eιtX * * * * * ≤ E *1 − eιt(X−Xn ) * * * * * ≤ E *1 − eιt(X−Xn ) *11(|X − Xn | ≤ ) +P (|Xn − X| > ). Hence,

sup |φn (t) − φ(t)| ≤

|t|≤K

sup |1 − e | + P (|Xn − X| > ). ιu

|u|≤K

Given K and δ > 0, choose ∈ (0, ∞) small such that sup |1 − eιu | < δ.

|u|≤K

Since for all > 0, P (|Xn − X| > ) → 0 as n → ∞, it follows that lim sup |φn (t) − φ(t)| = 0.

n→∞ |t|≤K

2

The Levy-Cramer theorem is a converse to the above theorem. That is, if φn → φ uniformly on bounded intervals, then Fn −→d F . Actually, it is a stronger result than this converse. It says that it is enough to know that φn converges pointwise to a limit φ that is continuous at 0. Then φ is the characteristic function of some distribution F and Fn −→d F . The key to this is that under the given hypotheses, the family {Fn }n≥1 is tight.

10.3 Levy-Cramer continuity theorem

329

The next result relates the tail behavior of a probability measure to the behavior of its characteristic function near the origin, which in turn will be used to establish the tightness of {Fn }n≥1 . Lemma 10.3.2: Let µ be a probability measure on R with characteristic function φ. Then, for each δ > 0, 1 δ (1 − φ(u))du. µ {x : |x|δ ≥ 2} ≤ δ −δ Proof: Fix δ ∈ (0, ∞). Then, using Fubini’s theorem and the fact that 1 − sinx x ≥ 0 for all x, one gets δ δ (1 − φ(u))du = (1 − eιux )du) µ(dx) −δ −δ 2 sin δx = 2δ − µ(dx) x sin δx = 2δ 1− µ(dx) xδ 1 ≥ 2δ 1− µ(dx) |xδ| {x:|xδ|≥2} ≥ δµ {x : |x|δ ≥ 2} . 2 Lemma 10.3.3: Let {µn }n≥1 be a sequence of probability measures with characteristic functions {φn }n≥1 . Let limn→∞ φn (t) ≡ φ(t) exist for |t| ≤ δ0 for some δ0 > 0. Let φ(·) be continuous at 0. Then {µn }n≥1 is tight. Proof: For any 0 < δ < δ0 , by the BCT, 1 δ 1 δ (1 − φn (t))dt → [1 − φ(t)]dt. δ −δ δ −δ Also, by continuity of φ at 0, 1 δ [1 − φ(t)]dt → 0 δ −δ

as

δ → 0.

Thus, given > 0, there exists a δ ∈ (0, δ0 ) and an M ∈ (0, ∞) such that for all n ≥ M , 1 δ (1 − φn (t))dt < . δ −δ By Lemma 10.3.2, this implies that for all n ≥ M , 2 µn x : |x| ≥ < . δ

330

10. Characteristic Functions

Now choose K >

2 δ

such that

µj {x : |x| ≥ K } < for Then,

1 ≤ j ≤ M .

sup µn {x : |x| ≥ K } < n≥1

and hence, {µn }n≥1 is tight.

2

Theorem 10.3.4: (Levy-Cramer continuity theorem). Let {µn }n≥1 be a sequence of probability measures on (R, B(R)) with characteristic functions {φn }n≥1 . Let limn→∞ φn (t) ≡ φ(t) exist for all ∈ R and let φ be continuous at 0. Then φ is the characteristic function of a probability measure µ and µn −→d µ. Proof: By Lemma 10.3.3, {µn }n≥1 is tight. Let {µnj }j≥1 be any subsequence of {µn }n≥1 that converges vaguely to a limit µ. By tightness, µ is a probability measure and by Theorem 10.3.1, limj→∞ φnj (t) is the characteristic function of µ. That is, φ is the characteristic function of µ. Since φ determines µ uniquely, all vague limit points of {µn }n≥1 coincide with µ 2 and hence by Theorem 9.2.6, µn −→d µ. This theorem will be used extensively in the next chapter on central limit theorems. For the moment, some easy applications are given. Example 10.3.1: (Convergence of Binomial to Poisson). Let {Xn }n≥1 be a sequence of random variables such that Xn ∼ Binomial(Nn , pn ) for all n ≥ 1. Suppose that as n → ∞, Nn → ∞, pn → 0 and Nn pn → λ, λ ∈ (0, ∞). Then Xn −→d X

where X ∼ P oisson(λ).

(3.1)

To prove (3.1), note that the characteristic function φn of Xn is φn (t)

(pn eιt + 1 − pn )Nn Nn = 1 + pn (eιt − 1) Nn Nn pn ιt = 1+ (e − 1) , t ∈ R. Nn =

Next recall the fact that if {zn }n≥1 is a sequence of complex numbers such that limn→∞ zn = z exists, then (1 + n−1 zn )n → z

as n → ∞.

(3.2)

So φn (t) → eλ(e −1) for all t ∈ R. Since φ(t) ≡ eλ(e −1) , t ∈ R is the characteristic function of a Poisson (λ) random variable, (3.1) follows. ιt

ιt

10.3 Levy-Cramer continuity theorem

331

A direct proof of (3.1) consists of showing that for each j = 0, 1, 2, . . .

Nn j e−λ λj . pn (1 − pn )Nn −j → P (X = j) = P (Xn = j) ≡ j j! Example 10.3.2: (Convergence of Binomial to Normal). Let Xn ∼ Binomial(Nn , pn ) for all n ≥ 1. Suppose that as n → ∞, Nn → ∞ and s2n ≡ Nn pn (1 − pn ) → ∞. Then Zn ≡

Xn − Nn pn −→d N (0, 1). sn

(3.3)

To prove (3.3), note that the characteristic function φn of Zn is φn (t)

#Nn pn exp(ιt(1 − pn )/sn ) + (1 − pn ) exp(−ιtpn /sn ) zn (t) Nn ≡ 1+ , say, Nn =

"

" # −ιtpn ιt where zn (t) = Nn pn e sn (1−pn ) + (1 − pn )e sn − 1 . By (3.2), it suﬃces to show that for all t ∈ R, zn (t) → −

t2 2

as n → ∞.

By Lemma 10.1.5, for any x real, * (ιx)2 ** |x|3 * ιx . *≤ *e − 1 − ιx − 2 3!

(3.4)

Since sn → ∞, for any t ∈ R, with pn (t) ≡ tpn /sn and qn (t) ≡ t(1−pn )/sn , one has zn (t) = Nn pn exp(ιt(1 − pn )/sn ) + (1 − pn ) exp(−ιtpn /sn ) − 1 = Nn pn eιqn (t) − 1 − ιqn (t) + (1 − pn ) eιpn (t) − 1 − ιpn (t) p 1 − pn n (ιqn (t))2 + (ιpn (t))2 = Nn 2 2 p (1 − p )|t|3 n n + Nn O s3n −t2 + o(1) as n → ∞. = 2 This is known as the DeMovire-Laplace CLT in the case Nn = n, pn = p, 0 < p < 1. The original proof was based on Stirling’s approximation.

332

10. Characteristic Functions

Example 10.3.3: (Convergence of Poisson to Normal). Let {Xn }n≥1 be a sequence of random variables such that for n ≥ 1, Xn ∼ Poisson(λn ), n −λn λn ∈ (0, ∞). Let Yn = X√ , n ≥ 1. If λn → ∞ as n → ∞, then λ n

Yn −→d N (0, 1).

(3.5)

To prove (3.5), note that the characteristic function φn of Yn is , , φn (t) = exp − ιt λn exp λn exp ιt/ λn − 1 , , = exp λn exp ιt/ λn − 1 − ιt/ λn , t ∈ R. Now using (3.4) again it is easy to show that for each t ∈ R,

ιt ιt −t2 −1− √ → as n → ∞. λn exp √ 2 λn λn Hence, (3.5) follows.

10.4 Extension to Rk Deﬁnition 10.4.1: (a) Let X = (X1 , . . . , Xk ) be a k-dimensional random vector (k ∈ N). The characteristic function of X is deﬁned as φX (t)

E exp(ιt · X)

k = E exp ι tj X j , =

(4.1)

j=1

k t = (t1 , . . . , tk ) ∈ Rk , where t · x = j=1 tj xj denotes the inner product of the two vectors t = (t1 , . . . , tk ), x = (x1 , . . . , xk ) ∈ Rk . (b) For a probability measure µ on Rk , B(Rk ) , its characteristic function is deﬁned as exp(ιt · x)µ(dx). (4.2) φ(t) = Rk

Note that for a linear combination L ≡ a1 X1 +· · ·+ak Xk , a1 , . . . , ak ∈ R, of a set of random variables X1 , . . . , Xk , all deﬁned on a common probability space, the characteristic functions of L and X = (X1 , . . . , Xk ) are

10.4 Extension to Rk

333

related by the identity φL (λ)

=

k E exp ιλ aj Xj j=1

λ ∈ R,

= φX (λa),

(4.3)

where a = (a1 , . . . , ak ). Thus, the characteristic function of a random vector X = (X1 , . . . , Xk ) is determined by the characteristic functions of all its linear combinations and vice versa. It will now be shown that as in the one-dimensional case, the characteristic function of a random vector X uniquely determines its probability distribution. The following is a multivariate version of Theorem 10.2.6. Theorem 10.4.1: Let X = (X1 , . . . , Xk ) be a random vector with characteristic function φX (·) and let A = (a1 , b1 ] × · · · × (ak , bk ] be a rectangle in Rk with −∞ < ai < bi < ∞ for all i = 1, . . . , k. If P (X ∈ ∂A) = 0, then 1 = lim T →∞ (2π)k

P (X ∈ A)

T

−T

···

T

k +

−T j=1

hj (tj )

× φX (t1 , . . . , tk )dt1 . . . dtk ,

(4.4)

where ∂A denotes the boundary of A and where hj (tj ) ≡ exp(−ιtj aj ) − −1 exp(−ιtj bj ) (ιtj ) for tj = 0 and hj (0) = (bj − aj ), 1 ≤ j ≤ k. Proof: Consider the product space Ω = [−T, T ]k ×Rk with the corresponding Borel-σ-algebra F = B([−T, T ]k ) × B(Rk ) and the product measure k k [−T, T µ = µ1 ×µ2 , where µ1 is the Lebesgue’s measure on ] , B([−T, T ] ) k k and µ2 is the probability distribution of X on R , B(R ) . Since the function k + hj (tj ) exp(ιt · x), f (t, x) ≡ j=1

(t, x) ∈ Ω is integrable w.r.t. the product measure µ, by Fubini’s theorem, IT

≡

T

−T

···

Rk

=

−T

−T

···

k +

Rk j=1

hj (tj ) φX (t1 , . . . , tk )dt1 . . . dtk

j=1

T

=

+ k

T

T

k +

−T j=1 T

−T

{hj (tj ) exp(ιtj xj )}dt1 . . . dtk µ2 (dx)

exp(ιtj (xj − aj )) − exp(ιtj (xj − bj )) dtj µ2 (dx) ιtj

334

10. Characteristic Functions

= Rk

k + 2

sin tj (xj − aj ) dtj tj

T

0

j=1

−2

T

0

sin tj (xj − bj ) dtj µ2 (dx), tj

(4.5)

using (1.3) and the fact that sinθ θ and cosθ θ are respectively even and odd functions of θ. It can be shown that (Problem 10.8) T sin t dt = π/2. (4.6) lim T →∞ 0 t Hence, by the change of variables theorem, it ⎧ T ⎨ 0 sin tc π/2 dt = lim T →∞ 0 ⎩ t −π/2 and

* * sup ** T >0,c∈R

T

0

follows that, for any c ∈ R, if c = 0 if c > 0 if c < 0

* * * T * sin tc ** sin u ** * dt* = sup * du* ≡ K < ∞. t u T >0 0

(4.7)

(4.8)

This implies that as T → ∞, the integrand in (4.5) converges to the function k k j=1 gj (xj ) for each x ∈ R , where ⎧ ⎪ ⎨ π if y ∈ {aj , bj } 2π if y ∈ (aj , bj ) gj (y) = (4.9) ⎪ ⎩ 0 if y ∈ (−∞, aj ) ∪ (bj , ∞). Hence, by (4.5), (4.8), (4.9), and the BCT, lim IT =

T →∞

k +

Rk j=1

gj (xj )µ2 (dx).

By the boundary condition P (X ∈ ∂A) = 0, the right side above equals (2π)k P (X ∈ (a1 , b1 ) × · · · × (ak , bk )), proving the theorem. 2 Remark 10.4.1: The inversion formula (2.3) can also be extended to the multivariate case. Corollary 10.4.2: A probability measure on (Rk , B(Rk )) is uniquely determined by its characteristic function. Proof: Let µ and ν be probability measures on (Rk , B(Rk )) with the same characteristic function φ(·), i.e., φ(t) = exp(ιt · x)µ(dx) = exp(ιt · x)ν(dx),

10.4 Extension to Rk

335

t ∈ Rk . Let A = {A : A = (a1 , b1 ] × · · · × (ak , bk ], −∞ < ai < bi < ∞, i = 1, . . . , k, µ(∂A) = 0 = ν(∂A)}. It is easy to verify that A is a π-class. Since there are only countably many rectangles (a1 , b1 ] × · · · × (ak , bk ] with µ(∂A) + ν(∂A) = 0, A generates B(Rk ). But, by Theorem 10.4.1, µ(A)

= =

ν(A) −k

lim (2π)

T →∞

−T

T

+ k

−T

j=1

T

···

hj (tj ) φ(t1 , . . . , tk )dt1 , . . . , dtk

for all A ∈ A. Hence, by Theorem 1.2.4, µ(B) = ν(B) for all B ∈ B(Rk ), i.e., µ = ν. 2 Corollary 10.4.3: A probability measure µ on (Rk , B(Rk )) is determined by its values assigned to the collection of half-spaces H ≡ {H : H = {x ∈ Rk : a · x ≤ c}, a ∈ Rk , c ∈ R}. Proof: Let X be the identity mapping on Rk . Then, for any H = {x ∈ Rk : a · x ≤ c}, {X ∈ H} = {a · X ≤ c}. Thus, the values {µ(H) : H ∈ H} determine the probability distributions (and hence, the characteristic functions) of all linear combinations of X. Consequently, by (4.3), it determines the characteristic function of X. By Corollary 10.4.2, this determines µ uniquely. 2 Theorem 10.4.4: Let {Xn }n≥1 , X be k-dimensional random vectors. Then Xn −→d X iﬀ φXn (t) → φX (t)

t ∈ Rk .

for all

(4.10)

Proof: Suppose that Xn −→d X. Then, (4.10) follows from the continuous mapping theorem for weak convergence (cf. Theorem 9.4.2). Conversely, (j) suppose (4.10) holds. Let Xn and X (j) denote the jth components of Xn and X, respectively, j = 1, . . . , k. By (4.10), for any j ∈ {1, . . . , k}, lim E exp(ιλXn(j) ) = E exp(ιλX (j) )

n→∞

for all λ ∈ R.

Hence, by Theorem 10.3.4 Xn(j) −→d X (j)

for all j = 1, . . . , k.

(4.11)

This implies that the sequence of random vectors {Xn }n≥1 is tight (Problem 9.9). Hence, by Theorem 9.3.3, given any subsequence {ni }i≥1 , there exists a further subsequence {ni }i≥1 ⊂ {ni }i≥1 and a random vector X0 such that Xni −→d X0 as i → ∞. By the ‘only if’ part, this implies φXn (t) → φX0 (t) i

as

i → ∞,

336

10. Characteristic Functions

for all t ∈ Rk . Thus, φX0 (·) = φX (·) and by the uniqueness of characteristic functions, X0 =d X. Thus, all convergent subsequences of {Xn }n≥1 have the same limit. By arguments similar to the proof of Theorem 9.2.6, 2 Xn −→d X. This completes the proof of the theorem. Theorem 10.4.4 shows that as in the one-dimensional case, the (pointwise) convergence of the characteristic functions of a sequence of kdimensional random vectors {Xn }n≥1 to a given characteristic function is equivalent to convergence in distribution of the sequence {Xn }n≥1 . Since the characteristic function of a random vector is determined by the characteristic functions of all its linear combinations, this suggests that one may also be able to establish convergence in distribution of a sequence of random vectors by considering the convergence of the sequences of linear combinations that are one-dimensional random variables. This is indeed true as shown by the following result. Theorem 10.4.5: (Cramer-Wold device). Let {Xn }n≥1 be a sequence of k-dimensional random vectors and let X be a k-dimensional random vector. Then, Xn −→d X iﬀ for all a ∈ Rk , a · Xn −→d a · X.

(4.12)

Proof: Suppose Xn −→d X. Then, for any a ∈ Rk , the function h(x) = a·x, x ∈ Rk is continuous on Rk . Hence, (4.12) follows from Theorem 9.4.2. Conversely, suppose that (4.12) holds for all a ∈ Rk . By (4.3) and Theorem 10.3.1, this implies that as n → ∞ φXn (a)

= φa·Xn (1) → φa·X (1) = φX (a),

for all a ∈ Rk . Hence, by Theorem 10.4.4, Xn −→d X.

2

Recall that a set of random variables X1 , . . . , Xk deﬁned on a common probability space are independent iﬀ the joint cdf of X1 , . . . , Xk is the product of the marginal cdfs of the Xi ’s. A similar characterization of independence can be given in terms of the characteristic functions, as shown by the following result. The proof is left as an exercise (Problem 10.16). Proposition 10.4.6: Let X1 , . . . , Xk , (k ∈ N) be a collection of random variables deﬁned on a common probability space. Then, X1 , . . . , Xk are independent iﬀ φ(X1 ,...,Xk ) (t1 , . . . , tk ) =

k + j=1

for all t1 , . . . , tk ∈ R.

φXj (tj )

10.5 Problems

337

10.5 Problems 10.1 Let {Xn }n≥1 and {Yn }n≥1 be two sequences of random variables such that for each n ≥ 1, Xn and Yn are deﬁned on a common probability space and Xn and Yn are independent. If Xn −→d X and Yn −→d Y , then show that (5.1) Xn + Yn −→d X0 + Y0 where X0 =d X, Y0 =d Y (cf. Section 2.2) and X0 and Y0 are independent. Show by an example that (5.1) is false without the independence hypothesis. 10.2 Give an example of a nonlattice discrete distribution F on R supported by only a three point set. 10.3 Let F be an absolutely continuous cdf on R with density f and with characteristic function φ. Show that if f has a derivative f (1) ∈ L1 (R), then lim |tφ(t)| = 0. |t|→∞

Generalize this result when f is r-times diﬀerentiable and the jth derivative f (j) lie in L1 (R) for j = 1, . . . , r. 10.4 Let F be a cdf on R with characteristic function φ. Show that for any a < b, a, b ∈ R, T " # 1 exp(−ιta) − exp(−ιtb) (ιt)−1 φ(t)dt lim T →∞ 2π −T 1 = µF ((a, b)) + µF ({a, b}), (5.2) 2 where µF denotes the Lebesgue-Stieltjes measure corresponding to F. (Hint: Use (4.7) and the arguments in the proof of Theorem 10.4.1.) 10.5 Let φ be a characteristic function of a cdf F and let µF denote the Lebesgue-Stieltjes measure corresponding to F . (a) Show that for any a ∈ R and T ∈ (0, ∞),

T

exp(−ιta)φ(t)dt −T

=

2T µF ({a}) exp(ιT (x − a)) − exp(−ιT (x − a)) + µF (dx). T (x − a) {x =a} (5.3)

338

10. Characteristic Functions

(b) Conclude from (5.3) that for any a ∈ R, T 1 exp(−ιta)φ(t)dt. F (a) − F (a−) = lim T →∞ 2T −T

(5.4)

10.6 Let F be a cdf on R with characteristic function φ. If |φ| ∈ L2 (R), then show that F is continuous. (Hint: Use Corollary 10.1.7.) 10.7 Let {Fn }n≥1 , F be cdfs with characteristic functions {φn }n≥1 , φ, respectively. Suppose that Fn −→d F . (a) Give an example to show that φn may not converge to φ uniformly over all of R. t2

(Hint: Try φn (t) ≡ e− n .) (b) Let {µn }n≥1 and µ denote the Lebesgue-Stieltjes measures corresponding to {Fn }n≥1 and F , respectively. Suppose that {µn }n≥1 and µ are dominated by a σ-ﬁnite measure λ on n (R, B(R)) with Radon-Nikodym derivatives fn = dµ dλ , n ≥ 1 dµ and f = dλ . If fn −→ f a.e. (λ), then show that sup |φn (t) − φ(t)| → 0

as

n → ∞.

(5.5)

t∈R

10.8 Let G(x, a) = (1 + a2 )−1 (1 − e−ax {a sin x + cos x}), x ∈ R, a ∈ R. (a) Show that for any a > 0, x0 ≥ 0, x0 (sin x) e−ax dx = G(a, x0 ).

(5.6)

0

(Hint: Consider the derivatives of the left and the right sides of (5.6) w.r.t. x0 .) (b) Use Fubini’s theorem to justify that for all T > 0, ∞ T T ∞ −ax (sin x) e dadx = (sin x) e−ax dxda. 0

0

0

(5.7)

0

∞ (c) Use (5.6), (5.7) and the identity that for x > 0, 0 e−ax da = x1 to conclude that for any T > 0 T ∞ sin x dx = G(a, T )da. (5.8) x 0 0 ∞ (d) Use the DCT and the fact that 0 (1+a2 )−1 da = π2 to conclude that the limit of the right side of (5.8) exists and equals π2 .

10.5 Problems

339

10.9 Let F1 , F2 , and F3 be three cdfs on R. Then show by an example that F1 ∗ F2 = F1 ∗ F3 does not imply that F1 = F2 . Here Fi ∗ Fj denotes the convolution of Fi and Fj , 1 ≤ i, j ≤ 3. (Hint: For F1 , consider a cdf whose characteristic function φ has a bounded support.) 10.10 Let µ be a probability measure on R with characteristic function φ. Prove that 2 ∞ [1 − Re(φ(t))]t−2 dt = |x|µ(dx). π −∞ 10.11 Let φ be the characteristic function of a random variable X. If |φ(t)| = 1 = |φ(αt)| for some t = 0 and α ∈ R irrational, then there exists x0 ∈ R such that P (X = x0 ) = 1. (Hint: Use Proposition 10.1.1.) 10.12 Show that for any characteristic function φ, {t ∈ R : |φ(t)| = 1} is either {0} or countably inﬁnite or all of R. 10.13 Let {Xn }n≥1 be a sequence of iid random variables with a nondegenerate distribution F . Suppose that there exist an ∈ (0, ∞) and bn ∈ R such that a−1 n

n

X j − bn

−→d Z

(5.9)

j=1

for some nondegenerate random variable Z. (a) Show that an → ∞

as

n → ∞.

(Hint: If an → a ∈ R, then E exp(ιtaZ) = limn→∞ n E exp ιt = 0 for all except countably many j=1 Xj − bn t ∈ R, which leads to a contradiction.) (b) Show that as n → ∞ bn − bn−1 = o(an )

and

an → 1. an−1

n−1 d (Hint: Use (a) to show that j=1 Xj − bn /an −→ Z and n−1 d by (5.9), j=1 Xj − bn−1 /an−1 −→ Z.)

340

10. Characteristic Functions

10.14 Show that for every T ∈ (0, ∞), there exist two distinct characteristic functions φ1T and φ2T satisfying φ1T (t) = φ2T (t)

for all |t| ≤ T.

(Hint: Let φ1 (t) = e−|t| , t ∈ R and for any even function φ2T (·) by ⎧ ⎪ ⎨ φ1 (t) φ1 (T ) + (t − T )(−φ1 (T )) φ2T (t) = ⎪ ⎩ 0

T ∈ (0, ∞), deﬁne an for 0 ≤ t ≤ T T ≤t T.

Now use Poly¯a’s criterion.) 10.15 Show that φα (t) = exp(−|t|α ), t ∈ R, α ∈ (0, ∞) is a characteristic function for 0 ≤ α ≤ 2. 10.16 Prove Proposition 10.4.6. (Hint: The ‘only if’ part follows from (4.2) and Proposition 7.1.3. The ‘if part’ follows by using the inversion formulas of Theorems 10.4.1 and 10.2.6 and the characterization of independence in terms of cdfs (Corollary 7.1.2).) variables with characteristic 10.17 Let {Xn }n≥0 be a collection of random functions {φn }n≥0 . Suppose that |φn (t)|dt < ∞ for all n ≥ 0 and that φn (·) → φ0 (·) in L1 (R) as n → ∞. Show that * * sup *P (Xn ∈ B) − P (X0 ∈ B)* → 0 B∈B(R)

as n → ∞. 10.18 Let φ(·) be a characteristic function on R such that φ(t) → 0 as |t| → ∞. Let X be a random variable with φ as its characteristic function. For each n ≥ 1, let Xn = nk if nk ≤ X < k+1 n , k = 0, ±1, ±2, . . .. Show that if φn (t) ≡ E(eιtXn ), then φn (t) → φ(t) for each t ∈ R but for each n ≥ 1, sup |φn (t) − φ(t)| : t ∈ R = 1. 10.19 Let {δi }i≥1 be iid random variables with distribution P (δ1 = 1) = P (δ1 = −1) = 1/2. Let Xn =

n

δi i=1 2i

and X = limn→∞ Xn .

(a) Find the characteristic function of Xn .

10.5 Problems

(b) Show that the characteristic function of X is φX (t) ≡

341

sin t t .

be iid random variables with pdf f (x) = 21 e−|x| , x ∈ R. 10.20 Let {Xk }k≥1 ∞ Show that k=1 k1 Xk converges w.p. 1 and compute its characteristic function. (Hint: Note that the characteristic function of the standard Cauchy (0,1) distribution is e−|t| .) 10.21 Establish an extension of formula (2.3) to the multivariate case.

11 Central Limit Theorems

11.1 Lindeberg-Feller theorems The central limit theorem (CLT) is one of the oldest and most useful results in probability theory. Empirical ﬁndings in applied sciences, dating back to the 17th century, showed that the averages of laboratory measurements on various physical quantities tended to have a bell-shaped distribution. The CLT provides a theoretical justiﬁcation for this observation. Roughly speaking, it says that under some mild conditions, the average of a large number of iid random variables is approximately normally distributed. A version of this result for 0–1 valued random variables was proved by DeMoivre and Laplace in the early 18th century. An extension of this result to the averages of iid random variables with a ﬁnite second moment was done in the early 20th century. In this section, a more general set up is considered, namely, that of the limit behavior of the row sums of a triangular array of independent random variables. Deﬁnition 11.1.1: For each n ≥ 1, let {Xn1 , . . . , Xnrn } be a collection of random variables deﬁned on a probability space (Ωn , Fn , Pn ) such that Xn1 , . . . , Xnrn are independent. Then, {Xnj : 1 ≤ j ≤ rn }n≥1 is called a triangular array of independent random variables. Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables. Deﬁne the row sums

344

11. Central Limit Theorems

Sn =

rn

Xnj , n ≥ 1.

(1.1)

j=1 2 Suppose that EXnj < ∞ for all j, n. Write s2n = Var(Sn ) = rn by Lindeberg, j=1 Var(Xnj ), n ≥ 1. The following condition, introduced S −ES n n plays an important role in establishing convergence of to a stansn dard normal random variable in distribution.

Deﬁnition 11.1.2: Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables such that 2 2 ≡ σnj < ∞ for all EXnj = 0, 0 < EXnj

1 ≤ j ≤ rn , n ≥ 1.

(1.2)

Then, {Xnj : 1 ≤ j ≤ rn }n≥1 is said to satisfy the Lindeberg condition if for every > 0, lim

n→∞

where s2n =

rn j=1

s−2 n

rn

2 EXnj I(|Xnj | > sn ) = 0,

(1.3)

j=1

2 σnj , n ≥ 1.

Example 11.1.1: Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ and Var(X1 ) = σ 2 ∈ (0, ∞). Consider the centered and scaled sample mean √ ¯ n(Xn − µ) Tn = , n ≥ 1, (1.4) σ ¯ n = n−1 n Xj . Note that Tn can be written as the row sum of where X j=1 a triangular array of independent random variables: Tn =

n

Xnj ,

(1.5)

j=1

√ where Xnj = (Xj − µ)/{σ n}, 1 ≤ j ≤ n, n ≥ 1. Clearly, {Xnj : 1 ≤ 2 2 = EXnj = nσ1 2 Var(X1 ) = 1/n for all j ≤ n}n≥1 satisﬁes (1.2) with σnj n 2 1 ≤ j ≤ n, and hence, s2n = j=1 σnj = 1 for all n ≥ 1. Now, for any > 0, s−2 n

n

2 EXnj I(|Xnj | > sn )

j=1 n X − µ 2 * X − µ * * j * j √ = E I * √ *> σ n σ n j=1 1 √ = n 2 E (X1 − µ)2 I |X1 − µ| > σ n σ n √ −2 = σ E (X1 − µ)2 I |X1 − µ| > σ n → 0

as

n → ∞,

11.1 Lindeberg-Feller theorems

345

by the DCT, since E(X1 − µ)2 < ∞. Thus, the triangular array {Xnj : 1 ≤ j ≤ n} of (1.5) satisﬁes the Lindeberg condition (1.3). The main result of this section is the following CLT for scaled row sums of a triangular array of independent random variables. Theorem 11.1.1: (Lindeberg CLT). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) and the Lindeberg condition (1.3). Then,

where Sn =

rn j=1

Sn −→d N (0, 1) sn rn 2 and s2n = Var(Sn ) = j=1 σnj .

Xnj

(1.6)

As a direct consequence of Theorem 11.1.1 and Example 11.1.1, one gets the more familiar version of the CLT for the sample mean of iid random variables. Corollary 11.1.2: (CLT for iid random variables). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ and Var(X1 ) = σ 2 ∈ (0, ∞). Then, √ ¯ n − µ) −→d N (0, σ 2 ), n(X (1.7) n −1 ¯n = n where X Xnj , n ≥ 1. j=1

For proving the theorem, the following simple inequality will be used. Lemma 11.1.3: For any m ∈ N and for any complex numbers z1 , . . . , zm , ω1 , . . . , ωm , with |zj | ≤ 1, |ωj | ≤ 1 for all j = 1, . . . , m, * m * m m + *+ * * *≤ z − ω |zj − ωj |. (1.8) j j * * j=1

j=1

j=1

Proof: Inequality (1.8) follows from the identity m + j=1

zj −

m + j=1

ωj

=

m +

zj −

m−1 +

j=1

+

m−1 +

zj ω m

j=1

m−2 + zj ω m − zj ωm−1 ωm

j=1

+ · · · + z1

j=1 m + j=2

ωj −

m +

ωj .

j=1

2 Proof of Theorem 11.1.1: W.l.o.g., suppose that s2n = 1 for all n ≥ 1. ˜ nj ≡ Xnj /sn , 1 ≤ j ≤ rn , n ≥ 1, it is easy to check (Otherwise, setting X

346

11. Central Limit Theorems

˜ nj : 1 ≤ j ≤ rn }n≥1 , the variance of the nth that for the triangular array {X rn 2 ˜ row sum s˜n ≡ j=1 Var(Xnj ) = 1 for all n ≥ 1, the Lindeberg condition rn ˜ d holds, and s˜−1 n j=1 Xnj −→ N (0, 1) iﬀ (1.6) holds.) Then, by Theorem 10.3.4, it is enough to show that 2

lim E exp(ιtSn ) = e−t

n→∞

/2

for all t ∈ R.

(1.9)

For any > 0, ∆n

2 ≡ max EXnj : 1 ≤ j ≤ rn 2 2 ≤ max EXnj I(|Xnj | > ) + EXnj I(|Xnj | ≤ ) : 1 ≤ j ≤ rn rn 2 EXnj I(|Xnj | > ) + 2 ≤ j=1

= o(1) + 2

as n → ∞, by the Lindeberg condition (1.3).

Hence, ∆n → 0

as

n → ∞.

(1.10)

Fix t ∈ R. Let φnj (·) denote the characteristic function of Xnj , 1 ≤ j ≤ rn , n ≥ 1. Note that by (1.10), there exists n0 ∈ N such that for all n ≥ n0 , 2 /2| : 1 ≤ j ≤ rn } ≤ 1. Next, noting that s2n = I1n ≡ max{|1 − t2 σnj rn 2 j=1 σnj = 1, by Lemma 11.1.3, for all n ≥ n0 , * * 2 * * *E exp(ιtSn ) − e−t /2 * *+ rn 2 ** + * rn t2 σnj * 1− φnj (t) − ≤ ** * 2 j=1 j=1 * *+ rn 2 + * * rn t2 σnj 2 1− exp(−t2 σnj /2)** − + ** 2 j=1 j=1 * * rn 2 2 * t σnj * * * ≤ *φnj (t) − 1 − 2 * j=1 rn * 2 ** * t2 σnj 2 * * exp(−t2 σnj /2) − 1 − + * 2 * j=1 ≡ I2n + I3n ,

say.

(1.11)

It will now be shown that lim Ikn = 0

n→∞

for k = 2, 3.

* * First consider I2n . Since * exp(ιx) − [1 + ιx + (ιx)2 /2]* ≤ min{|x|3 /3!, |x|2 } for all x ∈ R (cf. Lemma 10.1.5) and EXnj = 0 for all 1 ≤ j ≤ rn , for any

11.1 Lindeberg-Feller theorems

347

> 0, by the Lindeberg condition, one gets rn * 2 * t2 σnj * * I2n ≡ * *φnj (t) − 1 − 2 j=1 = ≤

n * * (ιt)2 * 2 * EXnj *E exp(ιtXnj ) − 1 + ιtEXnj + * 2! j=1 rn

E min

j=1

≤

rn

|tX |3 nj , |tXnj |2 3!

E|tXnj |3 I(|Xnj | ≤ ) +

j=1

≤

|t|3

≤

E(tXnj )2 I(|Xnj | > )

j=1 rn

2 EXnj + t2

j=1 3

rn

rn

2 EXnj I(|Xnj | > )

j=1 2

|t| + t · o(1)

as n → ∞.

(1.12)

Since ∈ (0, ∞) is arbitrary, I2n → 0 as n → ∞. Next, consider I3n . Note that for any x ∈ R, * * ∞ * ∞ k * |x|k−2 x * ≤ x2 e|x| . x /k!** ≤ x2 |e − 1 − x| = * k! k=2

k=2

s2n

= 1, one gets Hence, using (1.10) and the fact that r n 2 2 t σnj 2 2 exp(t2 σnj /2) I3n ≤ 2 j=1 rn 4 2 2 σnj · ∆n ≤ t exp(t ∆n /2) j=1 4

2

→0

as

= t exp(t ∆n /2)∆n n → ∞.

(1.13)

Now (1.9) follows from (1.11), (1.12), and (1.13). This completes the proof of the theorem. 2 Oftentimes, veriﬁcation of the Lindeberg condition (1.3) becomes diﬃcult as one has to ﬁnd the truncated second moments of Xnj ’s. A simpler suﬃcient condition for the CLT is provided by Lyapounov’s condition. Deﬁnition 11.1.3: A triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 of independent random variables satisfying (1.2) is said to satisfy Lyapounov’s condition if there exists a δ ∈ (0, ∞) such that lim s−(2+δ) n

n→∞

rn j=1

E|Xnj |2+δ = 0,

(1.14)

348

11. Central Limit Theorems

where s2n =

rn j=1

2 EXnj .

Note that by Markov’s inequality, if a triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes Lyapounov’s condition (1.14), then for any ∈ (0, ∞), s−2 n ≤ s−2 n

rn j=1 rn

2 EXnj I(|Xnj | > sn )

E|Xnj |2 (|Xnj |/sn )δ

j=1

→0

as

n → ∞.

Thus, {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes the Lindeberg condition (1.3). This observation leads to the following result. Corollary 11.1.4: (Lyapounov’s CLT). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) and Lyapounov’s condition (1.14). Then, (1.6) holds, i.e., Sn −→d N (0, 1). sn It is clear that Lyapounov’s condition is only a suﬃcient but not a necessary condition for the validity of the CLT. In contrast, under some regularity conditions on the triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 , which essentially says that the individual random variables Xnj ’s are ‘uniformly small’, the Lindeberg condition is also a necessary condition for the CLT. This converse is due to W. Feller. Theorem 11.1.5: (Feller’s theorem). Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent random variables satisfying (1.2) such that for any > 0, (1.15) lim max P (|Xnj | > sn ) = 0, where s2n =

rn j=1

n→∞ 1≤j≤rn

2 EXnj . Let Sn =

rn j=1

Xnj . If, in addition,

Sn −→d N (0, 1), sn

(1.16)

then {Xnj : 1 ≤ j ≤ rn }n≥1 satisﬁes the Lindeberg condition. A triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 satisfying (1.15) is called a null array. Thus, the converse of Theorem 11.1.1 holds for null arrays. It may be noted that there exist non-null arrays for which (1.16) holds but the Lindeberg condition fails (Problem 11.9). ≥ 1. Next ﬁx ∈ (0, ∞). Proof: W.l.o.g., suppose that s2n = 1 for all n rn 2 Then, setting t0 = 4/, and noting that 1 = j=1 EXnj and cos x ≥

11.1 Lindeberg-Feller theorems

349

1 − x2 /2 for all x ∈ R, one gets rn t2 E cos t0 Xnj − 1 + 0 2 j=1

=

rn t2 X 2 0 nj − 1 + cos t0 Xnj E 2 j=1

≥

rn t2 X 2 0 nj − 1 + cos t0 Xnj I |Xnj | > E 2 j=1

≥

rn t2 X 2 0 nj − 2 I |Xnj | > E 2 j=1

≥ =

t2 0

2

−

rn 2 2 EXnj I |Xnj | > 2 j=1

rn 6 2 EXnj I |Xnj | > . 2 j=1

Hence, the Lindeberg condition would hold if it is shown that for all t ∈ R, rn t2 → 0 as n → ∞ E cos tXnj − 1 + 2 j=1

rn 2 ⇔ exp E cos tXnj − 1 → e−t /2 as n → ∞. (1.17) j=1

Let φnj (t) = E exp(ιtXnj ), t ∈ R denote the characteristic function of Xnj , 1 ≤ j ≤ rn , n ≥ 1. Note that E cos tXnj = Re(φnj (t)), where recall that for any complex number z, Re(z) denotes the real part of z, i.e., Re(z) = a if z = a + ιb, a, b ∈ R. Since the function h(z) = |z| is continuous on C and | exp(φnj (t))| = exp(E cos tXnj ), it follows that (1.17) holds if, for all t ∈ R,

rn 2 (φnj (t) − 1) → e−t /2 as n → ∞. exp j=1

However, by (1.16), E exp(ιtSn ) = Hence, it is enough to show that I1n (t)

≡

exp

rn

rn j=1

as

/2

for all t ∈ R.

+ rn [φnj (t) − 1] − φnj (t)

j=1

→0

2

φnj (t) → e−t

n→∞

j=1

for all t ∈ R.

(1.18)

350

11. Central Limit Theorems

Note that for any ∈ (0, ∞), by (1.15) and the inequality |eιx − 1| ≤ min{2, |x|} for all x ∈ R, one has * * |φnj (t) − 1| = *E exp(ιtXnj ) − 1 * ≤ E min{|tXnj |, 2} ≤ 2P (|Xnj | > ) + |t| uniformly in j = 1, . . . , rn . Hence, letting n → ∞ and then ↓ 0, by (1.15), one gets (1.19) I2n (t) ≡ max |φnj (t) − 1| = o(1) as n → ∞ 1≤j≤rn

for all t ∈ R. Further, by the inequality |eιx − 1 − ιx| ≤ |x|2 /2, x ∈ R, rn

|φnj (t) − 1|

=

j=1

rn * * *E exp(ιtXnj ) − 1 − E(ιtXnj )* j=1

≤

rn t2 t2 t2 2 EXnj = s2n = 2 j=1 2 2

(1.20)

uniformly in t ∈ R, n ≥ 1. Now ﬁx t ∈ R. Then, by (1.19), there exists n0 ∈ N such that for all n ≥ n0 , max1≤j≤rn |φn (t) − 1| ≤ 1. Hence, by the arguments in the proof of Lemma 11.1.3, by* the inequalities |ez | ≤ * ∞ and ∞ k |z| z k 2 * * k=0 |z| /k! = e , and |e − 1 − z| = k=2 z /k! ≤ |z| exp(|z|), z ∈ C, for all n ≥ n0 , one has * rn * rn *+ * + exp [φnj (t) − 1] − φnj (t)** I1n (t) = ** j=1

≤ ≤

rn j=1 rn j=1

j=1

n −j * r+ * * * * exp [φnj (t) − 1] − φnj (t)* · * exp [φnj (t) − 1] *

k=1

* * * exp [φnj (t) − 1] − φnj (t)* · exp

rn

* * *φnj (t) − 1*

j=1

≤

rn 2 * * * exp [φnj (t) − 1] − φnj (t)* · exp t 2 j=1

=

rn 2 * * * exp [φnj (t) − 1] − 1 − [φnj (t) − 1]* · exp t 2 j=1

rn 2 * * *φnj (t) − 1*2 · exp 1 + t 2 j=1

rn t2 ≤ max |φnj (t) − 1| |φnj (t) − 1| exp 1 + 1≤j≤rn 2 j=1

≤

11.1 Lindeberg-Feller theorems

351

t2 ≤ I2n (t) · t2 · exp 1 + /2 2 → 0 as n → ∞, by (1.19) and (1.20). Hence, (1.18) holds. This completes the proof of the theorem. 2 The following example is an application of the Lindeberg CLT for proving asymptotic normality of the least squares estimator of a regression parameter. Example 11.1.2: Let Yj = xj β + j ,

j = 1, 2, . . .

(1.21)

be a simple linear regression model, where {xn }n≥1 is a given sequence of real numbers, β ∈ R is the regression parameter and {n }n≥1 is a sequence of iid random variables with E1 = 0 and E21 ≡ σ 2 ∈ (0, ∞). The least squares estimator of β based on Y1 , . . . , Yn is given by βˆn =

n

xj Yj /a2n ,

n ≥ 1,

j=1

where a2n =

n j=1

x2j . Suppose that the sequence {xn }n≥1 satisﬁes max {|xj |/an } → 0

1≤j≤n

Then,

as n → ∞.

(1.22)

an (βˆn − β) −→d N (0, σ 2 ).

(1.23)

To prove (1.23), note that by (1.21), an (βˆn − β)

=

an

n

= a−1 n

j=1 n j=1

xj Yj − xj j ≡

a2n β n

1 a2n

Xnj ,

say

(1.24)

j=1

2 < where Xnj = xj j /an , 1 ≤ j ≤ n, n ≥ 1. Note that EXnj = 0, EXnj n n 2 2 2 2 2 2 ∞ and sn ≡ EX = x E /a = σ . Thus, {X : 1 ≤ nj n nj j j=1 j=1 j j ≤ n}n≥1 is a triangular array of independent random variables satisfying (1.2). Next, let mn = max{|xj |/an : 1 ≤ j ≤ n}, n ≥ 1. Then, by (1.22), for any δ ∈ (0, ∞),

s−2 n

n j=1

2 EXnj I |Xnj | > δsn

352

11. Central Limit Theorems

= σ −2 a−2 n

n

x2j E2j I(|xj j /an | > δσ)

j=1

≤ σ −2 a−2 n

n

x2j · E21 I(mn · |1 | > δσ)

j=1

= σ −2 E21 I |1 | > δσ · m−1 n →0

as

n → ∞ by the DCT.

Thus, {Xnj : 1 ≤ j ≤ n}n≥1 satisﬁes the Lindeberg condition (1.3) and hence, by Theorem 11.1.1, n j=1 Xnj −→d N (0, 1), σ which, in view of (1.24), implies (1.23). The next result gives a multivariate generalization of Theorem 11.1.1. Theorem 11.1.6: (A multivariate version of the Lindeberg CLT). For each n ≥ 1, let {Xnj : 1 ≤ j ≤ rn } be a collection of independent kdimensional random vectors satisfying EXnj = 0, 1 ≤ j ≤ rn

and

rn

EXnj Xnj = Ik ,

j=1

where Ik denotes the identity matrix of order k and for any vector x, x denotes its transpose. Suppose that for every ∈ (0, ∞), lim

n→∞

Then,

rn

EXnj 2 I(Xnj > ) = 0.

j=1

rn

Xnj −→d N (0, Ik ).

j=1

The proof is a consequence of Theorem 11.1.1 and the Cramer-Wold device (cf. Theorem 10.4.5) and is left as an exercise (Problem 11.17).

11.2 Stable distributions If {Xn }n≥1 is a sequence of iid N (µ, σ 2 ) random variables, then for each k k ≥ 1, Sk ≡ i=1 Xi has a N (kµ, kσ2 ) distribution. Similarly, if {Xn }n≥1

11.2 Stable distributions

353

is a sequence of iid Cauchy (µ, σ) random variables, then for each k ≥ 1, k Sk ≡ i=1 Xi has a Cauchy (kµ, kσ) distribution. Thus, in both cases, for each k ≥ 1, there exist constants ak and bk such that Sk has the same distribution as ak X1 + bk (Problem 11.21). Deﬁnition 11.2.1: A nondegenerate random variable X is called stable if the above property holds, i.e., for each k ∈ N, there exist constants ak and bk such that Sk =d ak X1 + bk , (2.1) where X1 , X2 , . . . are iid random variables with the same distribution as k X, and Sk = i=1 Xi . In this case, the distribution FX of X is called a stable distribution. There are two characterizations of stable distributions. Theorem 11.2.1: A nondegenerate distribution F is stable iﬀ there exists a sequence of iid random variable 2 {Yn }n≥1 and constants {an }n≥1 and n {bn }n≥1 such that ak converges in distribution to F . Y − b k i=1 i Theorem 11.2.2: A nondegenerate distribution F is stable iﬀ its characteristic function φ(t) admits the representation (2.2) φ(t) = exp ιtc − b|t|α (1 + ιλsgn(t)ωα (t)) √ where ι = −1, −1 ≤ λ ≤ 1, 0 < α ≤ 2, 0 ≤ b < ∞, and the functions ωα (t) and sgn(·) are deﬁned as if α = 1 tan πα 2 (2.3) ωα (t) = 2 log |t| if α = 1 π and

⎧ if ⎨ 1 −1 if sgn(t) = ⎩ 0 if

t>0 t 0. (2.4) exp − f (x) = √ 2x 2π x3/2

354

11. Central Limit Theorems

For an explicit expression for the density of F , in other cases, see Feller (1966), Section 17.6. Deﬁnition 11.2.2: The parameter α in (2.2) is called the index of the stable distribution. Remark 11.2.4: The parameter λ is related to the behavior of the ratio of the right tail of the distribution to the left tail through the relation lim

x→∞

1+λ 1 − F (x) = , F (−x) (1 − λ)

(2.5)

where for λ = 1, the ratio on the right side of (2.5) is deﬁned to be +∞. Deﬁnition 11.2.3: A function L : (0, ∞) → (0, ∞) is called slowly varying at ∞ if L(cx) = 1 for all 0 < c < ∞. (2.6) lim x→∞ L(x) A function f : (0, ∞) → (0, ∞) is called regularly varying at ∞ with index α ∈ R, α = 0 if f (x) = xα L(x) for all x ∈ (0, ∞) where L(·) is slowly varying at ∞. The functions L1 (t) = log t, L2 (t) = log(log t), L3 (t) = (log t)2 are slowly varying at ∞ but the function L4 (t) = tp is not so for p = 0. There is a companion result to Theorem 11.2.1 giving necessary and suﬃcient conditions for convergence of normalized sums of iid random variables to a stable distribution. Theorem 11.2.3: Let F be a nondegenerate stable distribution with index α, 0 < α < 2. Then in order that a sequence {Yn }n≥1 of iid random variables admits a sequence of constants {an }n≥1 and {bn }n≥1 such that n i=1 Yi − bn −→d F, (2.7) an it is necessary and suﬃcient that lim

x→∞

exists and

P (Y1 > x) ≡ θ ∈ [0, 1] P (|Y1 | > x)

P (|Y1 | > x) = x−α L(x),

(2.8)

(2.9)

where L(·) is a slowly varying function at ∞. If (2.8) and (2.9) hold, then the normalizing constants {an }n≥1 and {bn }n≥1 may be chosen to satisfy na−α n L(an ) → 1

and

bn = nEY1 I(|Y1 | ≤ an ).

(2.10)

11.2 Stable distributions

355

Remark 11.2.5: The analog of Theorem 11.2.3 for the case α = 2, i.e., for the normal distribution is the following. Theorem 11.2.4: Let {Yn }n≥1 be iid random variables. In order that there exist constants {an }n≥1 and {bn }n≥1 such that n

Yi − bn −→d N (0, 1), an

i=1

(2.11)

it is necessary and suﬃcient that x2 P (|Y1 | > x) →0 EY12 I(|Y1 | ≤ x)

as

x → ∞.

(2.12)

Remark 11.2.6: Note that condition (2.12) holds if EY12 < ∞. However, if P (|Y1 | > x) ∼ xC2 as x → ∞, then EY12 = ∞ and the classicalCLT (cf. n Corollary 11.1.2) fails. However, in this case, (2.12) holds and i=1 Yi is asymptotically normal with a suitable centering and scaling (diﬀerent from √ n) (Problem 11.20). Here, only the proof of Theorem 11.2.1 will be given. Further, a proof of Theorem 11.2.3, suﬃciency part, is also outlined. For the rest, see Feller (1966) or Gnedenko and Kolmogorov (1968). For proving Theorem 11.2.1, the following result is needed. Theorem 11.2.5: (Khinchine’s theorem on convergence of types). Let {Wn }n≥1 be a sequence of random variables such that for some sequences {αn }n≥1 ⊂ [0, ∞) and {βn }n≥1 ⊂ R, both Wn and αn Wn + βn converge in distribution to nondegenerate distributions G and H on R, respectively. Then limn→∞ αn = α and limn→∞ βn = β exist with 0 < α < ∞ and −∞ < β < ∞. Proof: Let {Wn }n≥1 be a sequence of random variables such that for each n ≥ 1, Wn and Wn have the same distribution and Wn and Wn are independent. Then Yn ≡ Wn − Wn and Zn ≡ αn (Wn − Wn ) both ˜ and H. ˜ Indeed convergence in distribution to nondegenerate limits, say G ˜ ˜ G = G ∗ G and H = H ∗ H, where ∗ denotes convolution. This implies that {αn }n≥1 cannot have 0 or ∞ as limit points. Also if 0 < α ≤ α < ∞ are ˜ ˜ x ) = G( ˜ x ) for all x. Since two limit points of {αn }n≥1 , then H(x) = G( α α ˜ is nondegenerate, α must equal α and so limn→∞ αn exists in (0, ∞). G(·) 2 This implies that limn→∞ βn exists in R. Proof of Theorem 11.2.1: The ‘only if’ part follows from the deﬁnition of F being stable, since one can take {Yn }n≥1 to be iid with distribution F.

356

11. Central Limit Theorems

For the ‘if part,’ let {Yn }n≥1 be iid random variables such that there exists constants {an }n≥1 and {bn }n≥1 such that as n → ∞ n i=1 Yi − bn −→d F. an To show that F is stable, ﬁx an integer r ≥ 1. Let {Xn }n≥1 be iid random variables with distribution F . Then as k → ∞, kr i=1 Yi − bkr −→d X1 . akr Also, the left side above equals r−1

(j+1)k i=jk+1

j=0

where αkr =

Yi − bk

ak ak akr ,

r−1 rbk − bkr ak + = αkr ηjk + βkr , akr akr j=0

(j+1)k

i=jk+1

ηjk =

Yi −bk

and βkr =

ak

rbk −bkr . akr

say.

Since {ηjk :

j = 0, 1, . . . , r − 1} are independent and for each j, ηjk −→d Xj+1 as k → ∞, it follows that as k → ∞, Wk =

r−1 j=0

ηjk −→d

r−1

Xj+1 =

j=0

r

Xj .

j=1

Also, as k → ∞,

αkr Wk + βkr −→d X1 . r Since F is nondegenerate, both X1 and j=1 Xj are nondegenerate random variables. Thus, as k → ∞, Wk and αkr Wk + βkr converge in distribution to nondegenerate random variables. Thus, by Khinchine’s theorem on convergence types proved above, it follows that αkr → αr and βkr → βr , 0< αr < ∞ and −∞ < βr < ∞. This yields that for each r ∈ N that r 1 2 j=1 Xj has the same distribution as α (X1 − βr ), i.e., X1 is stable. r

Proof of the suﬃciency part of Theorem 11.2.3: (Outline). The proof is based on the continuity theorem. The characteristic function of n is Tn ≡ Sna−b n Sn −bn t n n 1 φn (t) = E eιt an ≡ 1 + hn (t) = φ e−ιtbn /an an n where bn = bn /n and hn (t) = n φ( atn )e−ιtbn /an − 1 . Let G(·) be the cdf of Y1 . Then ιtx eιt(y−bn )/an − 1 dG(y) = e − 1 µn (dx) hn (t) = n

11.2 Stable distributions

357

where µn (A) = nP (Y1 ∈ bn + an A), A ∈ B(R). If A = (u, v], 0 < u < v < ∞, then nP Y1 ∈ bn + an A = nP an u + bn < Y1 ≤ an v + bn = nP Y1 > an u + bn − nP Y1 > an v + bn . By (2.8)–(2.10),

nP (Y1 > an x) =

P (Y1 > an x) P (Y1 > an )

nP (Y1 > an ) → θx−α for x > 0.

By using (2.10), it can be show that bn an

EY1 I(|Y1 | ≤ an ) an a1−α L(a ) n = O n an = o(1) as n → ∞. =

Hence, it follows that nP (Y1 > an u + bn ) − nP (Y1 > an v + bn ) → θ(u−α − v −α ). Similarly, for A = (−v, −u], nP (Y1 ∈ bn + an A) → (1 − θ)(u−α − v −α ). This suggests that hn (t) should approach ∞ (eιtx − 1)x−(α+1) dx + (1 − θ)α θα 0

0 −∞

(eιtx − 1)|x|−(α+1) dx.

But there are integrability problems for |x|−(α+1) near 0 and so a more careful analysis is needed. It can be shown that ∞ ιtx (α+1) lim hn (t) = ιtc + θα eιtx − 1 − x dx n→∞ 1 + x2 0 0 ιtx −(α+1) + (1 − θ)α eιtx − 1 − |x| dx 1 + x2 −∞ where c is a constant. The right side is continuous at t = 0 and so, the result follows by the continuity theorem. For details, see Feller (1966). 2 Remark 11.2.7: By the necessity part of Theorem 11.2.3, every stable distribution F must satisfy 1 − F (x) = θx−α L(x) F (−x) = (1 − θ)x−α L(x)

358

11. Central Limit Theorems

for large x where 0 ≤ θ ≤ 1 and L(·) is slowing varying at ∞ and 0 < α < 2. This implies that F has moments of order p such that α > p. Distributions satisfying the above tail condition are called heavy tailed and arise in many applications. The Pareto distribution in economics is an example of a heavy tail distribution. Remark 11.2.8: One way to generate heavy tailed distributions is as follows. If Y is a positive random variable such that there exist 0 < c < ∞ and 0 < p < ∞ satisfying P (Y < y) ∼ cy p

as y ↓ 0,

then the random variable X = Y −q has the property P (X > x) = P (Y < x−1/q ) ∼ cx−p/q

as x → ∞.

If p < then X has heavy tails. Thus if {Yn }n≥1 are iid Gamma(1,2), then 2q, n n−1 i=1 Yi−1 converges in distribution to a one sided Cauchy distribution (Problem 11.15). Deﬁnition 11.2.4: Let F and G be two probability distributions on R. Then G is said to belong to the domain of attraction of F if there exist a sequence of iid random variables {Yn }n≥1 with distribution G and constants {an }n≥1 and {bn }n≥1 such that n i=1 Yi − bn −→d F. an Theorem 11.2.1 says that the only nondegenerate distributions F that admit a nonempty domain of attraction are the stable distributions.

11.3 Inﬁnitely divisible distributions Deﬁnition 11.3.1: A random variable X (and its distribution) is called inﬁnitely divisible if for each integer k ∈ N, there exist iid random variables k Xk1 , Xk2 , . . . , Xkk such that j=1 Xkj has the same distribution as X. Examples include constants (degenerate distributions), normal, Poisson, Cauchy, and Gamma distributions. But distributions with bounded support cannot be inﬁnitely divisible unless they are degenerate. In fact, if X is inﬁnitely divisible satisfying P (|X| ≤ M ) = 1 for ∞, then some M < = 1 and the Xki ’s in the above deﬁnition must satisfy P |Xk1 | < M k M2 M2 2 so Var(Xk1 ) ≤ EXk1 ≤ k2 implying Var(X) = kVar(Xk1 ) ≤ k for each k ≥ 1. Hence Var(X) must be zero, and the random variable X is a constant w.p. 1. The following results are easy to establish.

11.3 Inﬁnitely divisible distributions

359

Theorem 11.3.1: (a) If X and Y are independent and inﬁnitely divisible, then X + Y is also inﬁnitely divisible. (b) If Xn is inﬁnitely divisible for each n ∈ N and Xn −→d X, then X is inﬁnitely divisible. Proof: (a) Follows from the deﬁnition. (b) For each k ≥ 1 and n ≥ 1, there exist iid random variables Xnk1 , Xnk2 , . . . , Xnkk such that Xn and k j=1 Xnkj have the same distribution. Now ﬁx k ≥ 1. Then for any y > 0, k P (Xnk1 > y) = P (Xnkj > y for all j = 1, ..., k) ≤ P (Xn > ky) and similarly,

k P (Xnk1 < −y) ≤ P (Xn ≤ ky).

Since Xn −→d X, the distributions of {Xn }n≥1 are tight and so are ∞ k {Xnk1 }∞ n=1 . So if Fk is a weak limit point of {Xnk1 }n=1 and if {Ykj }j=1 are k iid with distribution Fk , then X and j=1 Ykj have the same distribution and so X is inﬁnitely divisible. 2 A large class of inﬁnitely divisible distributions are generated by the compound Poisson family. Deﬁnition 11.3.2: Let {Yn }n≥1 be iid random variables and let N be a Poisson (λ) random variable, independent of the {Yn }n≥1 . The random N variable X ≡ i=1 Yi is said to have a compound Poisson distribution. Theorem 11.3.2: A compound Poisson distribution is inﬁnitely divisible. Proof: Let X be a random variable as in Deﬁnition 11.3.2. For each k ≥ 1, let {Ni }ki=1 be iid Poisson random variables with mean λk that are independent of {Yn }n≥1 . Let

Tj+1

Xkj =

Yi , 1 ≤ j ≤ k

i=Tj +1

j−1 k where T1 = 0, Tj = i=1 Ni , 2 ≤ j ≤ k. Then {Xkj }j=1 are iid and k j=1 Xkj and X are identically distributed and so X is inﬁnitely divisible. 2 Although the converse to the above is not valid, it is known that every inﬁnitely divisible distribution is the limit of a sequence of centered and scaled compound Poisson distributions. This is a consequence of a deep result giving an explicit formula for the characteristic function of an inﬁnitely divisible distribution which is stated below. For a proof of this result (stated below), see Feller (1966) and Chung (1974) or Gnedenko and Kolmogorov (1968). Theorem 11.3.3: (Levy-Khinchine representation theorem). Let X be an inﬁnitely divisible random variable. Then its characteristic function φ(t) ≡

360

11. Central Limit Theorems

E(eιtX ) is of the form

ιtx t2 ιtx e −1− µ(dx) , φ(t) = exp ιtc − β + 2 1 + x2 R where c ∈ R, β > 0 and µ is a measure on (R, B(R)) such that µ({0}) = 0 and |x|≤1 x2 µ(dx) < ∞ and µ({x : |x| > 1}) < ∞. Corollary 11.3.4: Stable distributions are inﬁnitely divisible. Proof: The normal distribution corresponds to the case µ(·) ≡ 0 and β > 0. For nonnormal stable laws with index α < 2, set β = 0 and µ(dx) = 2 θx−(α+1) dx for x > 0 and (1 − θ)|x|−(α+1) dx for x < 0. Corollary 11.3.5: Every inﬁnitely divisible distribution is the limit of centered and scaled compound Poisson distributions. Proof: Since the normal distribution can be obtained as a (weak) limit of centered and scaled Poisson distributions, it is enough to consider the case when β = 0, c = 0. Let µn (A) = µ(A ∩ {x : |x| > n−1 }), A ∈ B(R) and let

ιtx φn (t) = exp µ eιtx − 1 − (dx) n 1 + x2

= exp λn µn (dx) − ιtcn (eιtx − 1)˜ where µ ˜n (A) cn

µn (A)/µn (R), A ∈ B(R), λn = µn (R), x = µ ˜n (dx). 1 + x2 =

and

Thus, φn (·) is a compound Poisson characteristic function centered at cn , with Poisson parameter λn and with the compounding distribution µ˜n . By the DCT, φn (t) → φ(t) for each t ∈ R. Hence by the Levy-Cramer continuity theorem, the result follows. 2 Another characterization of inﬁnitely divisible distributions is similar to that of stable distributions. Recall that a stable distribution is one that is the limit of normalized sums of iid random variables and conversely. Theorem 11.3.6: A random variable X is inﬁnitely divisible iﬀ it is the limit in distribution of a sequence {Xn }n≥1 where for each n, Xn is the sum of n iid random variables {Xnj }nj=1 . Thus X is inﬁnitely divisible iﬀ it is the limit in distribution of the row sums of a triangular array of random variables where in each row, all the random variables are iid.

11.4 Reﬁnements and extensions of the CLT

361

Proof: The ‘only if’ part follows from the deﬁnition. For the ‘if’ part, ﬁx k ≥ 1. Then Xk·n can be written as Xk·n =

k

Yjn ,

j=1

jn where Yjn = r=(j−1)n+1 Xk·n,r j = 1, 2, . . . , k. By hypothesis, Xk·n −→d X. Now, {Yjn }kj=1 are iid and it can be shown, as in the proof of Theorem in 11.3.1, that for each i = 1, . . . , k, {Yin }∞ n=1 are tight and hence, converges k distribution to a limit Yi through a subsequence, and that X and i=1 Yi have the same distribution. Thus, X is inﬁnitely divisible. 2

11.4 Reﬁnements and extensions of the CLT This section is devoted to studying some reﬁnements and generalizations of the basic CLT results, such as the rate of convergence in the CLT, Edgeworth expansions and large deviations for sums of iid random variables, and also a generalization of the basic CLT to a functional version.

11.4.1

The Berry-Esseen theorem

Let X1 , X2 , . . . be a sequence of iid random variables with EX1 = µ and a’s theorem imply Var(X1 ) = σ 2 ∈ (0, ∞). Then, Corollary 11.1.2 and Poly¯ that * * * Sn − nµ * * √ ∆n ≡ sup *P ≤ x − Φ(x)** → 0 as n → ∞, (4.1) σ n x∈R where Sn = X1 + · · · + Xn , n ≥ 1, and Φ(·) is the cdf of the N (0, 1) distribution. A natural question that arises in this context is “how fast does ∆n go to zero?” Berry (1941) and Esseen (1942) independently proved that ∆n = O(n−1/2 ) as n → ∞, provided E|X1 |3 < ∞. This result is referred to as the Berry-Esseen theorem. Theorem 11.4.1: (The Berry-Esseen theorem). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X1 ) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Then, for all n ≥ 1, * * * Sn − nµ * E|X1 − µ|3 * √ √ ∆n ≡ sup *P ≤ x − Φ(x)** ≤ C · (4.2) σ n σ3 n x∈R where C ∈ (0, ∞) is a constant. The value of the constant C ∈ (0, ∞) does not depend on n and on any characteristics of the distribution 0 of X1. Indeed, the proof of Theorem

11.4.1 below shows that C ≤

2 π

·

5 2

+

12 π

< 5.05.

362

11. Central Limit Theorems

The following result plays an important role in the proof of Theorem 11.4.1. Lemma 11.4.2: (A smoothing inequality). Let F be a cdf on R with xdF (x) = 0 and characteristic function ζ(t) = exp(ιtx)dF (x), t ∈ R. Let G : R function with → R be a diﬀerentiable derivative g such that lim|x|→∞ F (x) − G(x) = 0. Suppose that (1 + |x|)|g(x)|dx < ∞, ∞ r x g(x)dx = 0 for r = 0, 1 and |g(x)| ≤ C0 for all x ∈ R, for some −∞ C0 ∈ (0, ∞). Then, for any T ∈ (0, ∞), * * 1 T |ζ(t) − ξ(t)| 24C0 * * sup *F (x) − G(x)* ≤ dt + π −T |t| πT x∈R where ξ(t) =

∞ −∞

(4.3)

exp(ιtx)g(x)dx, t ∈ R.

For a proof of Lemma 11.4.2, see Feller (1966). The next lemma deals with an expansion of the logarithm of the characteristic function of X, in a neighborhood of zero. Let z = reiθ , r ∈ (0, ∞), θ ∈ [0, 2π) be the polar representation of a nonzero complex number z. Then, the (principal branch of the) complex logarithm of z is deﬁned as log z = log r + iθ.

(4.4)

The function log z is inﬁnitely diﬀerentiable on the set {z ∈ C : z = reiθ , r ∈ (0, ∞), 0 ≤ θ < 2π} and has a convergent Taylor’s series expansion around 1 on the unit disc: log(1 + z) =

∞

z k /k

for |z| < 1.

(4.5)

k=1

Lemma 11.4.3: Let Y be a random variable with EY = 0, σ ˜ 2 = EY 2 ∈ 3 function φY (t) = E exp(ιtY ), (0, ∞), ρ˜ = E|Y | < ∞" and characteristic # t ∈ R. Then, for all t ∈ − σ1˜ , σ1˜ , * * 2 2* * 5 3 ˜ * * log φY (t) + t σ ≤ |t| ρ˜ * 2 * 12

(4.6)

and * ** 2 * (ιt)3 2 3 * * log φY (t) − (ιt) σ ˜ EY + * * 2! 3!

t4 σ ˜4 |tY |3 (tY )4 , . + ≤ E min 3 24 4

(4.7)

11.4 Reﬁnements and extensions of the CLT

363

Proof: Note that by Lemma 10.1.5, 2 2 * * * * *φY (t) − 1* = *E exp(ιtY ) − 1 − ιtY * ≤ t EY ≤ 1 2 2

(4.8)

|t|# ≤ σ ˜ −1 . In particular, log φY (t) is well deﬁned for all t ∈ "whenever −1 −1 ˜ −σ ˜ ,σ . By (4.5), (4.8), and Lemma 10.1.5, for |t| ≤ σ ˜ −1 , * * 2 2* * ˜ * * log φY (t) + t σ * 2 * * * * t 2 σ ˜ 2 ** = ** log 1 + φY (t) − 1 + 2 * * * ∞ * *k t2 σ ˜ 2 ** ** ≤ **φY (t) − 1 − + φY (t) − 1* /k * 2 k=2

* * ∞ * |tY |3 ** 1 t2 σ ˜ 2 2 1 k−2 ≤ E **(tY )2 ∧ + 3! * 2 2 2 k=2

≤

3

4 4

|t| ρ t σ ˜ + . 6 4

Now using the bounds |t˜ σ | ≤ 1 and σ ˜ 3 = (EY 2 )3/2 ≤ E|Y |3 = ρ˜, one gets (4.6). The proof of (4.7) is similar and hence, it is left as an exercise (Problem 11.27). 2 Proof of Theorem 11.4.1: W.l.o.g., set µ = 0 and σ = 1. Then, X1 , X2 , . . . are iid zero mean, unit variance random variables. Let X =d X1 , ρ = E|X|3 and φX (·) denote the characteristic function of X. Itis easy to Sn ≤x , check that the conditions of Lemma 11.4.2 hold with F (x) = P √ n G(x) = Φ(x), x ∈ R, and C0 = √12π . Hence, by Lemma 11.4.2, with √ T = n/ρ, * n t * −t2 /2 * * 24ρ 1 T φX ( √n ) − e dt + √ ∆n ≤ . (4.9) π −T |t| π 2πn By Lemma 11.4.3 (with Y =

X1 −µ σ

and t replaced by

√t ), n

* t t2 * * * + * rn (t) ≡ *n log φX √ 2 n * t t 2 σ 2 * * * = n* log φX √ + √ * 2 n n 5 ρ|t|3 · √ ≤ 12 n for all |t| ≤

√

n, n ≥ 1.

(4.10)

364

11. Central Limit Theorems

√ Since ρ = E|X1 |3 ≥ (EX12 )3/2 = σ 3 = 1, |T | ≤ n. Hence, using the inequality |ez − 1| ≤ |z|e|z| for all z ∈ C and (4.10), one gets * * t 2 * * n − e−t /2 * *φX √ n * * t2 t t2 * * − 1* · exp − + = * exp n · log φX √ 2 2 n t2 ≤ |rn (t)| exp |rn (t)| · exp − 2 t2 5ρ|t| 5ρ 3 √ |t| exp − 1− √ ≤ 2 12 n 6 n 2 t 5ρ √ |t|3 exp − (4.11) ≤ 12 12 n √ t2 ∞ 2 √ ≤ 1, i.e., for all |t| ≤ T , n ≥ 1. Since dt = 6 2π, t exp − 12 for all ρ|t| −∞ n 0 " # 2 the theorem follows from (4.9) and (4.11) with C = π2 52 + 12 π . A striking feature of Theorem 11.4.1 is that the upper bound on ∆n in (4.2) is valid for all n ≥ 1. Also, under the conditions of Theorem 11.4.1, the rate O( √1n ) in (4.2) is the best possible in the sense that there exist random variables for which ∆n is bounded below by a constant multiple of −nµ √1 (cf. Problem 11.29). Edgeworth expansions of the cdf of Sn √ , to be n σ n developed in the next section, can be used to show that for certain random variables X1 satisfying additional moment and symmetry conditions, ∆n may go to zero at a faster rate. (For example, consider X1 ∼ N (µ, σ 2 ).) For iid sequences {Xn }n≥1 with E|X1 |2+δ < ∞ for some δ ∈ (0, 1], Theorem 11.4.1 can be strengthened to show that ∆n decreases at the rate O(n−δ/2 ) as n → ∞ (cf. Chow and Teicher (1997), Chapter 9).

11.4.2

Edgeworth expansions

Recall from Chapter 10 that a random variable X1 is called lattice if there exist a ∈ R and h ∈ (0, ∞) such that P X1 ∈ {a + ih : i ∈ Z} = 1. (4.12) The largest h satisfying (4.12) is called the span of (the distribution of) X1 . A random variable X1 is called nonlattice if it is not a lattice random variable. From Proposition 10.1.1, it follows that X1 is nonlattice iﬀ * * *E exp(ιtX1 )* < 1 for all t = 0. (4.13) The next result gives an Edgeworth expansion for the cdf of an error of order o(n

−1/2

) for nonlattice random variables.

Sn √ −nµ σ n

with

11.4 Reﬁnements and extensions of the CLT

365

Theorem 11.4.4: Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X1 ) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Suppose, in addition, that X1 is nonlattice, i.e., it satisﬁes (4.13). Then, * ** * Sn − nµ 1 µ3 2 * √ ≤ x − Φ(x) − √ · 3 (x − 1)φ(x) ** sup *P σ n n 6σ x∈R = o(n−1/2 ) where φ(x) =

as

2 √1 e−x /2 , 2π

n → ∞,

(4.14)

x ∈ R and µ3 = E(X1 − µ)3 .

The function 1 µ3 en,2 (x) ≡ Φ(x) − √ · 3 (x2 − 1), x ∈ R n 6σ

(4.15)

−nµ is called a second order Edgeworth expansion for Tn ≡ Sσn √ . The above n theorem shows that the cdf of the normalized sum Tn can be approximated by the second order Edgeworth expansion with accuracy o(n−1/2 ). It can be shown that if E|X1 |4 < ∞ and X1 satisﬁes Cramer’s condition: * * lim sup *E exp(ιtX1 )* < 1, (4.16) |t|→∞

then the bound on the right side of (4.14) can be improved to O(n−1 ). Note that for a symmetric random variable X1 , having a ﬁnite fourth moment and satisfying (4.16), the second term in en,2 (x) is zero and the rate of normal approximation becomes O(n−1 ). Higher order Edgeworth expansions for Tn can be derived using (4.16) and arguments similar to those in the proof of Theorem 11.4.4, but the form of the expansion becomes more complicated. See Petrov (1975), Bhattacharya and Rao (1986), and Hall (1992) for detailed accounts of the Edgeworth expansion theory. Proof of Theorem 11.4.4: W.l.o.g., let µ = 0 and σ = 1. In Lemma 11.4.2, take F (x) = P (Tn ≤ x), and G(x) = en,2 (x), x ∈ R. Then, it is easy to verify that the conditions of Lemma 11.4.2 hold with g(x) = gn (x) ≡ φ(x) + 6µ√3n (x3 − 3x)φ(x), x ∈ R. Using repeated diﬀerentiation on both sides of the identity (inversion formula): ∞ 2 2 e−x /2 1 √ e−ιtx · e−t /2 dt, x ∈ R, = 2π −∞ 2π one can show that e−x /2 1 d3 e−x /2 −(x − 3x) √ = 3 √ = dx 2π 2π 2π 2

2

3

x ∈ R. Hence, ξn (t) ≡ ξ(t) =

2

eιtx gn (x)dx = e−t

/2

∞

2

e−ιtx (−ιt)3 e−t

/2

dt,

−∞

µ3 1 + √ (ιt)3 , t ∈ R. 6 n

(4.17)

366

11. Central Limit Theorems

√ Next, let ∈ (0, 1) be given. Then, set T = c n where c ≡ c = 24 · sup

3 |µ3 | 3 |x − 3x| φ(x) : x ∈ R . 1+ 6

Then, by Lemma 11.4.2 and (4.17), ∆n,2

* * * Sn − nµ * * √ ≡ sup *P ≤ x − en,2 (x)** σ n x∈R * * √ * * t n 1 c n *φX √n − ξ(t)* ≤ dt + √ , √ π −c n |t| n

(4.18)

where φX (t) = E exp(ιtX), t ∈ R and X =d X1 . Let ρ = E|X1 |3 . Let M ∈ (1, ∞) be such that E|X1 |3 I(|X1 | > M ) ≤ /2. Then, setting δ = 2M ρ , it √ |t| 4√ 3 follows that E|X1 | n I(|X1 | ≤ M ) ≤ M δE|X1 | ≤ /2 for all |t| ≤ δ n. √ Hence, for all |t| ≤ δ n, by (4.7) of Lemma 11.4.3, * * t (ιt)2 * µ3 ιt 3 ** * √ rn,2 (t) ≡ n* log φX √ + − * 2n 6 n n |X |3 |X |4 |t| * t *3 t4 * * 1 1 √ , ≤ n · * √ * E min + 2 3 24 4n n n |t| |t|3 √ E |X1 |3 I(|X1 | > M ) + E |X1 |4 √ I(|X1 | ≤ M ) ≤ 3 n n |t| 3 +√ n 4 |t|3 (4.19) ≤ √ . n Also, note that for any complex numbers z, w, |ez − 1 − w|

≤ ≤

|ez − ew | + |ew − 1 − w| # " 1 |z − w| + |w|2 exp |z| ∨ |w| . 2

(4.20)

√ Hence, by (4.10), (4.19), and (4.20), it follows that for all |t| ≤ δ n, * t * * n * − ξn (t)* *φX √ n * * t t2 * 2 * µ3 − 1 − √ (ιt)3 **e−t /2 = ** exp n log φX √ + 2 n 6 n *2 * 2 * µ 1 ** µ3 * * 3 * ≤ rn,2 (t) + * √ (ιt)3 * exp rn (t) ∨ * √ t3 * e−t /2 . 2 6 n 6 n

11.4 Reﬁnements and extensions of the CLT

√ 5 ≤ 12 |t|2 for all |t| ≤ δ n, δ√n **φn ( √t ) − ξ (t)** n X n dt √ |t| −δ n δ √n t2 µ23 2 5 √ exp − ≤ · |t| + dt · t √ 72n 12 n −δ n ≤ C1 · √ n

Since rn (t) ∨

367

|µ3 t|3 √ 6 n

(4.21)

for some C1 ∈ (0, ∞). By (4.13), * t *n √ √ * * sup *φX √ * : δ n < |t| < c n ≤ θn n for some θ ∈ (0, 1). Hence, * * n t *φ √ − ξn (t)* X n

dt |t| 2 ρ 1 2θn log(c/δ) + √ e−t /2 1 + √ |t|3 dt, √ δ n |t|>δ n n

√ √ δ n x) assert that a similar behavior holds for many distributions P (X other than the normal distribution on the (negative) logarithmic scale. The main result of this section is the following. Theorem 11.4.6: Let {Xn }n≥1 be a sequence of iid nondegenerate random variables with φ(t) ≡ EetX1 < ∞ for all t > 0. (4.27) Let µ = EX1 . Then, for all x ∈ (µ, θ) ¯ n ≥ x) = −γ(x), lim n−1 log P (X

n→∞

(4.28)

where γ(x) = sup{tx − log φ(t)} and θ = sup{x ∈ R : P (X1 ≤ x) < 1}. (4.29) t>0

Note that under (4.27), EX1+ < ∞ and hence, µ ≡ EX1 is well deﬁned, and µ ∈ [−∞, ∞). For proving the theorem, the following results are needed. Lemma 11.4.7: Let X1 be a nondegenerate random variable satisfying (4.27). Let µ = EX1 and let γ(x), θ be as in (4.29). Then,

11.4 Reﬁnements and extensions of the CLT

369

(i) the function φ(t) is inﬁnitely diﬀerentiable on (0, ∞) with φ(r) (t), the rth derivative of φ(t) being given by φ(r) (t) = E X1r etX1 , t ∈ (0, ∞), r ∈ N, φ(1) (t) t→∞ φ(t)

(ii) lim φ(t) = 1, lim φ(1) (t) = µ, and lim t↓0

t↓0

(4.30)

= θ,

(iii) for every x ∈ (µ, θ), there exists a unique solution ax ∈ R to the equation x = φ (ax )/φ(ax ) (4.31) such that γ(x) = xax − log φ(ax ). Proof: Let F denote the cdf of X1 . (i) Note that for any h = 0, " # h−1 φ(t + h) − φ(t) =

∞

−∞

ehx − 1 tx · e dF (x). h

* hs * As h *→ 0, the integrand converges xetx for all x, t. Also, * e h−1 * ≤ * ∞ * k−1 xk */k! ≤ |x|e|hx| for all h, x. Hence, for any x ∈ R, t ∈ (0, ∞), k=1 h and 0 < |h| < t/2, the integrand is bounded above by |x|e|hx| etx

Since

= |x|e(t−|h|)x I(−∞,0) (x) + |x|e(t+|h|)x I(x > 0) ≤

|x|e−t|x|/2 I(−∞,0) (x) + |x|e3tx I(0,∞) (x)

≡

g(x), say.

(4.32)

g(x)dF (x) < ∞, by the DCT, it follows that φ(t + h) − φ(t) lim h→0 h

exists and equals

xetx dF (x)

for all t ∈ (0, ∞). Thus, φ(t) is diﬀerentiable on (0, ∞) with φ(1) (t) = EX1 etX1 , t ∈ (0, ∞). Now, using induction and similar arguments, one can complete the proof of part (i) (Problem 11.34). Next consider (ii). Since etx ≤ I(−∞,0] (x) + ex I(0,∞) (x) for all x ∈ R, t ∈ (0, 1), by the DCT, the ﬁrst relation follows. For the second, note that |x|etx I(−∞,0) (x) ↑ |x|I(−∞,0] (x)

as

t↓0

(4.33)

and |x|etx ≤ |x|ex for all 0 < t ≤ 1, x > 0. Hence, applying the MCT for x ∈ (−∞, 0] and the DCT for x ∈ (0, ∞), one obtains the second limit. Derivation of the third limit is left as an exercise (Problem 11.35).

370

11. Central Limit Theorems

To prove part (iii), ﬁx x ∈ (µ, θ) and let γ(t) = tx − log φ(t), t ≥ 0. Then, for t ∈ (0, ∞), φ(1) (t) φ(t) (2) φ (t) φ(1) (t) 2 γ (2) (t) = − = Var(Yt ), (4.34) φ(t) φ(t) y where Yt is a random variable with cdf P (Yt ≤ y) = −∞ etu dF (x)/φ(t), y ∈ R. Since X1 is nondegenerate, so is Yt (for any t ≥ 0) and hence, Var(Yt ) > 0. As a consequence, the second derivative of the function γ(t) is positive. And the minimum of γ(t) over (0, ∞) is attained by a solution to the equation γ (1) (t) = 0, i.e., by t = ax satisfying (4.30). That such a solution exists and is unique follows from part (ii) and the facts that x > µ, γ (1) (t)

φ(1) (0+) φ(0)

=

x−

= µ (by (ii)), and that

φ(1) (t) φ(t)

is continuous and strictly increasing

on (0, ∞) (as for any t ∈ (0, ∞), the derivative of γ

(2)

φ(1) (t) φ(t)

(t), which is positive by (4.34)). This proves part (iii).

coincides with 2

Lemma 11.4.8: Let {Xn }n≥1 be as in Theorem 11.4.6. For t ∈ (0, ∞), let {Yt,n }n≥1 be a sequence of iid random variables with cdf y P (Yt,1 ≤ y) = etu dF (u)/φ(t), y ∈ R, −∞

where F is the cdf of X1 . Let νn and λn denote the probability distributions of Sn ≡ X1 + · · · + Xn and Tn,t = Yt,1 + · · · + Yt,n , n ≥ 1. Then, for each n ≥ 1, dνn νn λn and (x) = e−tx φ(t)n , x ∈ R. (4.35) dλn Proof: The proof is by induction. Clearly, the assertion holds for n = 1. Next, suppose that (4.35) is true for some r ∈ N and let n = r + 1. Then, for any A ∈ B(R), νn (A) = P X1 + · · · + Xn ∈ A ∞ = P X1 + · · · + Xn−1 ∈ A − x dF (x) −∞ ∞ dν n−1 (u) dλn−1 (u)dF (x) = −∞ A−x dλn−1 ∞ e−tu φ(t)n−1 dλn−1 (u)dF (x) = −∞ A−x ∞ e−t(u+x) dλn−1 (u)dλ1 (x) = [φ(t)]n −∞ A−x e−tu λn−1 ∗ λ1 (dν), = [φ(t)]n A

11.4 Reﬁnements and extensions of the CLT

371

where ∗ denotes convolution. Since λn−1 ∗ λ1 = λn , the result follows.

2

Proof of Theorem 11.4.6: Fix x ∈ (µ, θ). Note that by Markov’s inequality, for any t > 0, n ≥ 1, ¯ n ≥ x) = P etX¯ n ≥ etx P (X ¯ ≤ e−tx E etXn = exp − tx + n log φ(t/n) . Hence, ¯ n ≥ x ≤ −x · t + log φ t n−1 log P X for all t > 0, n ≥ 1 n n ¯ n ≥ x ≤ inf {−xt + log φ(t)} = −γ(x). (4.36) ⇒ lim sup n−1 log P X t>0

n→∞

This yields the upper bound. Next it will be shown that ¯ n ≥ x ≥ −γ(x). lim inf n−1 log P X n→∞

(4.37)

To that end, let {Yt,n }n≥1 , νn , and λn be as in Lemma 11.4.8. Also, let ax be as in (4.30). Then, for any y > x, t ∈ (ax , ∞), and n ≥ 1, by Lemma 11.4.8, ¯ n ≥ x = νn [nx, ∞) P X = e−tu φ(t)n du [nx,∞) ≥ e−tu φ(t)n du ≥

[nx,ny] n −tny

φ(t) e

λn [nx, ny] .

(4.38)

Note that EYt,1 = udλ1 (u) = φ(1) (t)/φ(t). Since φ(1) (·)/φ(·) is strictly increasing and continuous on (0, ∞), given y > x, there exists a t = ty ∈ (ax , ∞) such that φ(1) (t) φ(1) (ax ) y> > = x. (4.39) φ(t) φ(ax ) By the WLLN, for any y > x and t satisfying (4.39),

Yt,1 + · · · + Yt,n λn [nx, ny] = P x ≤ ≤y n → 1 as n → ∞. Hence, from (4.38), it follows that ¯ n ≥ x ≥ −ty + log φ(t) lim inf n−1 log P X n→∞

372

11. Central Limit Theorems

for all y > x and all t ∈ (ax , ∞) satisfying (4.39). Now, letting t ↓ ax ﬁrst and then y ↓ x, one gets (4.37). This completes the proof of Theorem 11.4.6. 2 Remark 11.4.1: If (4.27) holds and θ < ∞, then " # ¯ n ≥ θ = P (X1 = θ) n P X so that

¯ n ≥ θ = log P (X1 = θ). lim n−1 log P X

n→∞

In this case, (4.28) holds for x = θ with γ(θ) = − log P (X1 = θ). For x > θ, (4.28) holds with γ(x) = +∞. Remark 11.4.2: Suppose that there exists a t0 ∈ (0, ∞) such that, instead of (4.27), the following condition holds: φ(t)

=

+∞ for all t > t0

< ∞

for all t ∈ (0, t0 ),

and φ (t)/φ(t) increases to a ﬁnite limit θ0 as t ↑ t0 . Then, θ must be +∞. In this case, it can be shown that (4.28) holds for all x ∈ (µ, θ0 ) (with the given deﬁnition of γ(x)) and that (4.28) holds for all x ∈ [θ0 , ∞), with γ(x) ≡ t0 x − log φ(t0 ). See Theorem 9.6, Chapter 1, Durrett (2004).

11.4.4

The functional central limit theorem

2 Let {X i }ni≥1 be iid random variables with EX1 = 0, EX1 = 1. Let S0 = 0, Sn = i=1 Xi , n ≥ 1. The central limit theorem says that as n → ∞,

Sn Wn ≡ √ −→d N (0, 1). n Now let Wn and for any

j n

≤t

C0 ) = α and rejects the hypothesis that F is the distribution of {Xi }i≥1 if KS(Fn ) > C0 and accepts it, otherwise.

11.5 Problems 11.1 Show that the triangular array {Xnj : 1 ≤ j ≤ n}n≥1 , with Xnj as in (1.24), is a null array, i.e., satisﬁes (1.15) iﬀ (1.22) holds. 11.2 Construct an example of a triangular array {Xnj : 1 ≤ j ≤ rn }n≥1 of independent random variables such that for any 1 ≤ j ≤ rn , n ≥ 1, E|Xnj |α = ∞ for all α ∈ (0, ∞), but there exist sequences {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that Sn − bn −→d N (0, 1). an 11.3 Let {Xn }n≥1 be a sequence of independent random variables with P (Xn = ±1) =

1 1 1 − √ , P (Xn = ±n2 ) = √ , n ≥ 1. 2 2 n 2 n

Find constants {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that n j=1 Xj − bn −→d N (0, 1). an 11.4 Let {Xn }n≥1 be a sequence of independent random variables such that for some α ≥ 12 , P (Xn = ±nα ) = Let Sn =

n j=1

n1−2α 2

and P (Xn = 0) = 1 − n1−2α , n ≥ 1.

Xj and s2n = Var(Sn ).

(a) Show that for all α ∈ [ 12 , 1), Sn −→d N (0, 1). sn

(5.1)

(b) Show that (5.1) fails for α ∈ [1, ∞). (c) Show that for α > 1, {Sn }n≥1 converges to a random variable S w.p. 1 and that sn → ∞.

11.5 Problems

377

11.5 Let {Xnj : 1 ≤ j ≤ rn }n≥1 be a triangular array of independent zero mean random variables satisfying the Lindeberg condition. Show that 2 Xnj → 0 as n → ∞, E max 1≤j≤rn s2 n rn where s2n = j=1 Var(Xnj ), n ≥ 1. n 11.6 Let {Xn }n≥1 be a sequence of random variables. Let Sn = j=1 Xj n and s2n = j=1 EXj2 < ∞, n ≥ 1. If s2n → ∞ then show that lim

n→∞

⇐⇒

s−2 n

lim s−2 n

n→∞

n j=1 n

EXj2 I(|Xj | > sn ) = 0

for all > 0

EXj2 I(|Xj | > sj ) = 0

for all > 0.

j=1

(Hint: Verify that for any δ > 0,

j:sj 2, show that lim s−r n

n→∞

⇐⇒

lim s−r n

n→∞

where s2n =

n j=1 n

E|Xj |r I(|Xj | > sn ) = 0 E|Xj |r = 0,

for all ∈ (0, ∞), (5.2)

j=1

n j=1

EXj2 .

11.8 Let {Xn }n≥1 be a sequence of zero mean independent random variables satisfying (5.2) for r = 4. (a) Show that

k k lim E(s−1 n Sn ) = EZ

n→∞

for all k = 2, 3, 4, where Z ∼ N (0, 1). 4 (b) Show that Ssnn is uniformly integrable. n≥1 Sn (c) Show that lim Eh sn = Eh(Z) where h(·) : R → R is continn→∞

uous and h(x) = O(|x|4 ) as |x| → ∞.

11.9 Let {Xn }n≥1 be a sequence of independent random variables such that 1 1 P (Xn = ±1) = , P (Xn = ±n) = 2 4 4n and 1 1 P (Xn = 0) = (1 − 2 ), n ≥ 1. 2 n

378

11. Central Limit Theorems

(a) Show that √ the triangular array {Xnj : 1 ≤ j ≤ n}n≥1 with Xnj = Xj / n, 1 ≤ j ≤ n, n ≥ 1 does not satisfy the Lindeberg condition. (b) Show that there exists σ ∈ (0, ∞) such that S √n −→d N (0, σ 2 ). n Find σ 2 . 11.10 Let {Xj }j≥1 be independent random variables such that Xj has Uniform [−j, j] distribution. Show that the Lindeberg-Feller condition holds for the triangular array Xnj = Xj /n3/2 , 1 ≤ j ≤ n, n ≥ 1. 11.11 (CLT for random sums). Let {Xi }i≥1 be iid random variables with EX1 = 0, EX12 = 1. Let {Nn }n≥1 be a sequence of positive integer valued random variables such that Nnn −→p c, 0 < c < ∞. Show that S √Nn −→d N (0, 1). N n

(Hint: Use Kolmogorov’s ﬁrst inequality (cf. 8.3.1) to show that |SNn −Snc | √ > λ, |Nn − nc| < n ≤ λ2 for any > 0, λ > 0.) P n 11.12 Let {N (t) : t ≤ 0} be the renewal process as deﬁned in (5.1) of Section 8.5. Assume EX1 = µ ∈ (0, ∞), EX12 < ∞. Show that N (t) − t/µ √ −→d N (0, σ 2 ) t

(5.3)

for some 0 < σ 2 < ∞. Find σ 2 . (Hint: Use SN (t)+1 − (N (t) + 1)µ SN (t) − N (t)µ t − N (t)µ µ , , ≤ , ≤ +, N (t) N (t) N (t) N (t) and the fact that

N (t) t

→

1 µ

w.p. 1.)

11.13 Let {N (t) : t ≥ 0} be as in the above problem. Give another proof of (5.3) by using the CLT for {Sn }n≥1 and the relation P (N (t) > n) = P (Sn < t) for all t, n. 11.14 Let {Xj }j≥1 be iid random variables with distribution P (X1 = 1) = 1/2 = P (X1 = −1). Show that there exist positive integer valued S random variables {rk }k≥1 such that rk → ∞ w.p. 1, but √rrkk does not converge in distribution. (Hint: Let r1 = min{n : rk+1 = min{n : n >

Sn rk , √ n

Sn √ n

> 1} and for k ≥ 1, deﬁne recursively

> k + 1}.)

11.5 Problems

379

11.15 (CLT for sample quantiles). Let {Xi }i≥1 be iid random variables. −1 Let 0 < p < 1 and n let Yn ≡ Fn (p) = inf{x : Fn (x) ≥ p}, where Fn (x) ≡ n1 i=1 I(Xi ≤ x) is the empirical cdf based on X1 , X2 , . . . , Xn . Assume that the cdf F (x) ≡ P (X1 ≤ x) is diﬀeren −1 F : F (x) ≥ p} and that λ ≡ F (p) > 0. tiable at F −1 (p) ≡ inf{x p √ −1 d 2 2 Then show that n Yn − F (p) −→ N (0, σ ), where σ = p(1 − p)/λ2p . (Hint: Use the identity P (Yn ≤ x) = P (Fn (x) ≥ p) for all x and p.) 11.16 (A coupon collector’s problem). For each n ∈ N, let {Xni }i≥1 be iid random variables such that P (Xn1 = j) = n1 , 1 ≤ j ≤ n. Let = X1 }, and Tn(i+1) = min j : Tn0 = 1, Tn1 = min{j : j > 1, Xj / {XTnk : 0 ≤ k ≤ i} . That is, Tni is the ﬁrst time j > Tni , Xj ∈ the sample has (i + 1) distinct elements. Suppose kn ↑ ∞ such that kn n → θ, 0 < θ < 1. Show that for some an , bn Tn,kn − an −→d N (0, 1). bn (Hint: Let Ynj = Tnj − Tn(j−1) , j = 1, 2, . . . , (n − 1). Show that for with Ynj having a geometric each n, {Ynj }j=1,2,... are independent distribution with parameter 1− nj . Now apply Lyapounov’s criterion to the triangular array {Ynj : 1 ≤ j ≤ kn }.) 11.17 Prove Theorem 11.1.6. 11.18 Let {Xn }n≥1 be a sequence of iid random variables with EXn = 0 n and EXn2 = σ 2 ∈ (0, ∞). Let Sn = j=1 Xj , n ≥ 1. For each k ∈ N, ﬁnd the limit distribution of the k-dimensional vector(s). S −S Sn S2n √−Sn , . . . , nk √n(k−1) , (a) √ , n n n Snak Sna1 Sna2 √ , √ ,..., √ (b) , where 0 < a1 < a2 < · · · < ak < ∞ are n n n given real numbers, S −S S3n √−Sn , . . . , (k+1)n√ (k−1)n . (c) S√2n , n n n 11.19 For any random variable X, show that EX 2 < ∞ implies y 2 P (|x| > y) →0 E(X 2 I(|x| ≤ y)) as y → ∞. Give an example to show that the converse is false. (Hint: Consider a random variable X with pdf f (x) = c1 |x|−3 for |x| > 2.)

380

11. Central Limit Theorems

11.20 Let {Xn }n≥1 be a sequence of iid random variables with common distribution P (X1 ∈ A) = |x|−3 I(|x| > 1)dx, A ∈ B(R). A

Find sequences {an }n≥1 ⊂ (0, ∞) and {bn }n≥1 ⊂ R such that Sn − bn −→d N (0, 1) an where Sn =

n j=1

Xj , n ≥ 1.

11.21 Show using characteristic functions that if X1 , X2 , . . . , Xk are iid k Cauchy (µ, σ 2 ) random variables, then Sk ≡ i=1 Xi has a Cauchy (kµ, kσ) distribution. 11.22 Show that if a random variable Y1 has pdf f as in (2.4), then (2.9) holds with α = 1/2. n −1 −1 11.23 If {Yn }n≥1 are iid Gamma −→d W , where i=1 Yi then n 2 1 (1,2), W has pdf fW (w) ≡ π 1+w2 · I(0,∞) (w). 11.24 Let X be a nonnegative random variable such that P (X ≤ x) ∼ xα L(x) as x ↓ 0 for some α > 0 and L(·) slowly varying at 0. Let Y = X −β , β > 0. Show that ˜ P (Y > y) ∼ y −γ L(y)

as

y↑∞

˜ slowly varying at ∞. for some γ > 0 and L(·) 11.25 Let {Xi }i≥1 be iid Beta (m, n) random variables. Let Yi = Xi−β , β > 0, i ≥1. Show that there exist sequences {an }n≥1 and {bn }n≥1 n Y −a such that i=1bni n −→d a stable law of order γ for some γ in (0, 2]. 11.26 Let {Xi }i≥1 be iid Uniform [0, 1] random variables. (a) Show that for each 0 < β < 12 , there exist constants µ and σ 2 such that

n 1 −β √ X − nµ −→d N (0, 1). σ n i=1 i (b) Show that for each 12 < β < 1, there exist a constant 0 < γ < 2 and sequences {an }n≥1 and {bn }n≥1 such that 1 bn

n i=1

Xi−β

− an

−→d

a stable law of order

γ.

11.5 Problems

381

11.27 Prove (4.7). (Hint: Use (4.5) and Lemma 10.1.5.) 11.28 (a) Show that the Gamma (α, β) distribution is inﬁnitely divisible, 0 < α, β < ∞. (b) Let µ be a ﬁnite measure on R, β(R) . Show that φ(t) ≡ exp (eιtu − 1)µ(du) is the characteristic function of an inﬁnitely divisible distribution. 11.29 Let {Xn }n≥1 be iid random variables with P (X1 = 0) = P (X1 = 1) = 12 . Show that there exists a constant C1 ∈ (0, ∞) such that ∆n ≥ C1 n−1/2

for all n ≥ 1,

where ∆n is as in (4.1). 11.30 Let X1 be a random variable such that the absolutely continuous component βFac (·) in the decomposition (4.5.3) of the cdf F of X1 is nonzero. Show that X1 satisﬁes Cramer’s condition (4.13). (Hint: Use the Riemann-Lebesgue lemma.) 11.31 (Berry-Esseen theorem for sample quantile). Let {Xn }n≥1 be a collection of iid random variables with ncdf F (·). Let 0 < p < 1 and Yn = Fn−1 (p), where Fn (x) = n−1 i=1 I(X1 ≤ x), x ∈ R. Suppose that F (·) is twice diﬀerentiable in a neighborhood of ξp ≡ F −1 (p) and F (ξp ) ∈ (0, ∞). Show that * √ * * * n(Yn − ξp )/σp ≤ x − Φ(x)* = O(n−1/2 ) as n → ∞ sup *P x∈R

2 where σp = p(1 − p)/ F (ξp ) . (Hint: Use the identity P (Yn ≤ x) =√P (Fn (x) ≥ p) for all x and p, apply Theorem 11.4.1 √ to Fn (x) for n|x − ξp | ≤ log n, and use monotonicity of cdfs for n|x − ξp | > log n. See Lahiri (1992) for more details. Also, see Reiss (1974) for a diﬀerent proof.) 11.32 (A moderate deviation bound). Let {Xn }n≥1 be a sequence of iid random variables with EX1 = µ, Var(X) = σ 2 ∈ (0, ∞) and E|X1 |3 < ∞. Show that √ * , * ¯ n − µ* > σ log n = O(n−1/2 ) as n → ∞. n*X P (Hint: Apply Theorem 11.4.1.) It can be shown that the bound on the right side is indeed o(n−1/2 ) as n → ∞. For a more general version of this result, see G¨ otze and Hipp (1978).

382

11. Central Limit Theorems

11.33 Show that the values of the functions en,2 (x) of (4.14) and e˜n,2 (x) of (4.24) are not necessarily nonnegative for all x ∈ R. 11.34 Complete the proof of Lemma 11.4.7 (i). (Hint: Suppose that for some r ∈ N, φ is r-times diﬀerentiable with its rth derivative given by (4.30). Then, for t ∈ (0, ∞), " # h−1 φ(r) (t + h) − φ(r) (t) =

∞

−∞

ehx − 1 r tx · x e F (dx) h

and the integrand is bounded by the integrable function |x|r g(x) for all x ∈ R, 0 < |h| < t/2, where g(·) is as in (4.32). Now apply the DCT.) 11.35 Under the conditions of Lemma 11.4.7, show that φ(1) (t) = θ. t→∞ φ(t) lim

(Hint: Consider the cases ‘θ ∈ R’ and ‘θ = ∞’ separately.) 11.36 Find the function γ(x) of (4.28) in each of the following cases: (a) X1 ∼ N (µ, σ 2 ), (b) X1 ∼ Gamma (α, β), (c) X1 ∼ Uniform (0, 1). 11.37 Verify that the functions Ti , i = 1, 2, 3 deﬁned by (4.42) are continuous on C[0, 1].

12 Conditional Expectation and Conditional Probability

12.1 Conditional expectation: Deﬁnitions and examples This section motivates the deﬁnition of conditional expectation for random variables with a ﬁnite variance through a mean square error prediction problem. The deﬁnition is then extended to integrable random variables by an approximation argument (cf. Deﬁnition 12.1.3). The more standard approach of proving the existence of conditional expectation by the use of Radon-Nikodym theorem is also outlined. Let (X, Y ) be a bivariate random vector. A standard problem in regression analysis is to predict Y having observed X. That is, to ﬁnd a function h(X) that predicts Y . A common criterion for measuring the accuracy of such a predictor is the mean squared error E(Y − h(X))2 . Under the assumption that E|Y |2 < ∞, it can be shown that there exists a unique h0 (X) that minimizes the mean squared error. Theorem 12.1.1: Let (X, Y ) be a bivariate random vector. Let EY 2 < ∞. Then there exists a Borel measurable function h0 : R → R with 2 E h0 (X) < ∞, such that 2 2 E Y − h0 (X) = inf E Y − h(X) : h(X) ∈ H0 ,

(1.1)

where H0 = h(X) | h : R → R is Borel measurable and E(h(X))2 < ∞ .

384

12. Conditional Expectation and Conditional Probability

Proof: Let H be the space of all Borel measurable functions of (X, Y ) that have a ﬁnite second moment. Let H0 be the subspace of all Borel measurable functions of X that have a ﬁnite second moment. It is known that H0 is a closed subspace of H (Problem 12.1) and that for any Z in H, there exists a unique Z0 in H0 such that E(Z − Z0 )2 = min E(Z − Z1 )2 : Z1 ∈ H0 . Further, Z0 is the unique random variable (up to equivalence w.p. 1) such that (1.2) E(Z − Z0 )Z1 = 0 for all Z1 ∈ H0 . A proof of this fact is given at the end of this section in Theorem 12.1.6. If one takes Z to be Y , then this Z0 is the desired h0 (X). 2 Remark 12.1.1: The random variable Z0 in (1.2) is known as the projection of Y onto H0 . It is known that for any random variable Y with EY 2 < ∞, the constant c that minimizes E(Y − c)2 over all c ∈ R is c = EY , the expected value of Y . By analogy with this, one is led to the following deﬁnition. Deﬁnition 12.1.1: For any bivariate random vector (X, Y ) with EY 2 < ∞, the conditional expectation of Y given X, denoted as E(Y |X), is the function h0 (X) of Theorem 12.1.1. Note that h0 (X) is determined up to equivalence w.p. 1. Any such h0 (X) is called a version of E(Y |X). From (1.2) in the proof of Theorem 12.1.1, by taking Z = Y , Z1 = IB (X), one ﬁnds that Z0 = h0 (X) satisﬁes EY IA = Eh0 (X)IA

(1.3)

−1

for every event A of the form X (B) where B ∈ B(R). Conversely, it can be shown that (1.3) implies (1.2), by the usual approximation procedure (Problem 12.1). From (1.3), the function h0 (X) is determined w.p. 1. So one can take (1.3) to be the deﬁnition of h0 (X). In statistics, the function E(Y |X) is called the regression of Y on X. The function h0 (x) can be determined explicitly in the following two special cases. If X is a discrete random variable with values x1 , x2 , . . ., then (1.3) implies, by taking A = {X = xi }, that E Y I(X = xi ) , i = 1, 2, . . . . (1.4) h0 (xi ) = P (X = xi ) Similarly, if (X, Y ) has an absolutely continuous distribution with joint probability density f (x, y), it can be shown that w.p. 1, E(Y |X) = h0 (X), where

yf (x, y)dy h0 (x) = (1.5) fX (x)

12.1 Conditional expectation: Deﬁnitions and examples

385

if fX (x) > 0 and 0 otherwise, where fX (x) = f (x, y)dy is the probability density function of X (Problem 12.2). The deﬁnition of E(Y |X) can be generalized to the case when X is a vector and more generally, as follows. Theorem 12.1.2: Let (Ω, F, P ) be a probability space and G ⊂ F be a σalgebra. Let H ≡ L2 (Ω, F, P ) and H0 = L2 (Ω, G, P ). Then for any Y ∈ H, there exist a Z0 ∈ H0 such that E(Y − Z0 )2 = inf{E(Y − Z)2 : Z ∈ H0 }

(1.6)

and this Z0 is determined w.p. 1 by the condition E(Y IA ) = E(Z0 IA )

for all

A ∈ G.

(1.7)

The proof is similar to that of Theorem 12.1.1. Deﬁnition 12.1.2: The random variable Z0 in (1.7) is called the conditional expectation of Y given G and is written as E(Y |G). When G = σX, the σ-algebra generated by a random variable X, E(Y |G) reduces to E(Y |X) in Deﬁnition 12.1.1. The following properties of E(Y |G) are easy to verify by using the deﬁning equation (1.7) (Problem 12.3). Proposition 12.1.3: Let Y and G be as in Theorem 12.1.2. (i) Y ≥ 0 w.p. 1 ⇒ E(Y |G) ≥ 0 w.p. 1 (ii) Y1 , Y2 ∈ H ⇒ E (αY1 + βY2 )|G = αE(Y1 |G) + βE(Y2 |G) for any α, β ∈ R. (iii) Y1 ≥ Y2 w.p. 1 ⇒ E(Y1 |G) ≥ E(Y2 |G) w.p. 1. Using a natural approximation procedure it is possible to extend the deﬁnition of E(Y |G) to all random variables with just the ﬁrst moment, i.e., E|Y | < ∞. This is done in the following result. Theorem 12.1.4: Let (Ω, F, P ) be a probability space and G ⊂ F be a sub-σ-algebra. Let Y : Ω → R be a F-measurable random variable with E|Y | < ∞. Then there exists a random variable Z0 : Ω → R that is Gmeasurable, E|Z0 | < ∞, and is uniquely determined (up to equivalence w.p. 1) by E(Y IA ) = E(Z0 IA ) for all A ∈ G. (1.8) Proof: Since Y can be written as Y = Y + − Y − , it is enough to consider the case Y ≥ 0 w.p. 1. Let Yn = min{Y, n} for n = 1, 2, . . .. Then EYn2 < ∞

386

12. Conditional Expectation and Conditional Probability

and by Theorem 12.1.2, Zn ≡ E(Yn |G) is well deﬁned, it is G-measurable, and satisﬁes E(Yn IA ) = E(Zn IA ) for all A ∈ G. (1.9) Since 0 ≤ Yn ≤ Yn+1 , by Proposition 12.1.3, 0 ≤ Zn ≤ Zn+1 w.p. 1 and so there exists a set B ∈ G such that P (B) = 0 and on B c , {Zn }n≥1 is nondecreasing and nonnegative. Let Z0 = limn→∞ Zn on B c and 0 on B. Then Z0 is G-measurable. Applying the MCT to both sides of (1.9), one gets E(Y IA ) = E(Z0 IA ) for all A ∈ G. This proves the existence of a G-measurable Z0 satisfying (1.8). The uniqueness follows from the fact that if Z0 and Z0 are G-measurable with E|Z0 | < ∞, E|Z0 | < ∞ and EZ0 IA = EZ0 IA

for all A ∈ G,

then Z0 = Z0 w.p. 1 (Problem 12.3).

2

Remark 12.1.2: An alternative to the proof of Theorem 12.1.4 above leading to the deﬁnition of E(Y |G) is via the Radon-Nikodym theorem. Here is an outline of this proof. Let Y be a nonnegative random variable with E|Y | < ∞. Set µ(A) ≡ E(Y IA ) for all A ∈ G. Then µ is a measure on (Ω, G) and it is dominated by PG , the restriction of P to G. By the Radon-Nikodym theorem, there is a G-measurable function Z such that E(Y IA ) = µY (A) = ZdPG = EZIA . A

Extension to the case when Y is real-valued with E|Y | < ∞, is via the decomposition Y = Y + − Y − . Remark 12.1.3: The arguments in the proof of Theorem 12.1.4 (and Problem 12.3) show that the conclusion of the theorem holds for any nonnegative random variable Y for which EY may or may not be ﬁnite. Deﬁnition 12.1.3: Let Y be a F-measurable random variable on a probability space (Ω, F, P ) such that either Y is nonnegative or E|Y | < ∞. A random variable Z0 that is G-measurable and satisﬁes (1.8) is called the conditional expectation of Y given G and is written as E(Y |G). The following are some important consequences of (1.8): (i) If Y is G-measurable then E(Y |G) = Y . (ii) If G = F, then E(Y |G) = Y . (iii) If G = {∅, Ω}, then E(Y |G) = EY .

12.1 Conditional expectation: Deﬁnitions and examples

387

(iv) By taking A to be Ω in (1.8), EY = E E(Y |G) . Furthermore, Proposition 12.1.3 extends to this case. When G = σX with X discrete, (1.4) holds provided E|Y | < ∞. Part (iv) is useful in computing EY without explicitly determining the distribution of Y . Suppose E(Y |X) = m(X) and Em(X) is easy to compute but ﬁnding the distribution of Y is not so easy. Then EY can still be computed as Em(X). For example, let (X, Y ) have a bivariate distribution with pdf ⎧ 2 (x−1)2 ⎨ √1 √ 1 e− (y−x) 2x2 e− 2 if x = 0, 2π 2π|x| fX,Y (x, y) = ⎩ 0 if x = 0, x, y ∈ R2 . In this case, evaluating fY (y) is not easy. On the other hand, it can be veriﬁed that for each x, fX,Y (x, y)dy = x, m(x) ≡ y fX (x) (x−1)2

and that fX (x) = √12π e− 2 . Thus, EY = EX = 1. For more examples of this type, see Problem 12.29. The next proposition lists some useful properties of the conditional expectation. Proposition 12.1.5: Let (Ω, F, P ) be a probability space and let Y be a F-measurable random variable with E|Y | < ∞. Let G1 ⊂ G2 ⊂ F be two sub-σ-algebras contained in F. (i) Then

E(Y |G1 ) = E E(Y |G2 )|G1 .

(1.10)

(ii) For any bounded G1 -measurable random variable U , E(Y U |G1 ) = U E(Y |G1 ).

(1.11)

Proof: (i) Let A ∈ G1 , Z1 = E(Y |G1 ), and Z2 = E(Y |G2 ). Then E(Y IA ) = E(Z1 IA ) by the deﬁnition of Z1 . Since G1 ⊂ G2 , A ∈ G2 and again by the deﬁnition of Z2 , E(Y IA ) = E(Z2 IA ). Thus, E(Z2 IA ) = E(Z1 IA )

for all A ∈ G1

and by the deﬁnition of E(Z2 |G1 ), it follows that Z1 = E(Z2 |G1 ), proving (i).

388

12. Conditional Expectation and Conditional Probability

(ii) Let Z1 = E(Y |G1 ). If U = IB some B ∈ G1 , then for any A ∈ G1 , A ∩ B ∈ G1 and by (1.8), EY IB IA = EY IA∩B = E(Z1 IA∩B ) = E(Z1 IB · IA ). So in this case E(Y U |G1 ) = Z1 U . By linearity (Proposition 12.1.3 (ii)), it extends to all U that are simple and G1 -measurable. For any bounded G1 -measurable U , there exists a sequence of bounded, G1 -measurable, and simple random variables {Un }n≥1 that converge to U uniformly. Hence, for any A ∈ G1 and for n ≥ 1, EY Un IA = EZ1 Un IA . The bounded convergence theorem applied to both sides yields EY U IA = EZ1 U IA . Since Z1 and U are both G1 -measurable, (ii) follows.

2

Remark 12.1.4: If the random variable U in Proposition 12.1.5 is G1 measurable and E|Y U | < ∞, then part (ii) of the proposition holds. The proof needs a more careful approximation (see Billingsley (1995), pp. 447). An Approximation Theorem Theorem 12.1.6: Let H be a real Hilbert space and M be a nonempty closed convex subset of H. Then for every v ∈ H, there is a unique u0 ∈ M such that (1.12) v − u0 = inf{v − u : u ∈ M} where x2 = x, x, with x, y denoting the inner-product in H. Proof: Let δ = inf{v − u : u ∈ M}. Then, δ ∈ [0, ∞). By deﬁnition, there exists a sequence {un }n≥1 ⊂ M such that v − un → δ. Also note that in any inner-product space, the parallelogram law holds, i.e., for any x, y ∈ H, x + y2 + x − y2 = 2(x2 + y2 ). Thus 2v − (un + um )2 + un − um 2 = 2(v − un 2 + v − um 2 ). (1.13) -2 m m∈ M implying that -v − un +u ≥ δ 2 . This, Since M is convex, un +u 2 2 with (1.13), implies that lim sup un − um 2 = 0, m,n→∞

12.2 Convergence theorems

389

making {un }n≥1 a Cauchy sequence. Since H is a Hilbert space, there exists a u0 ∈ H such that {un }n≥1 converges to u0 . Also, since M is closed, u0 ∈ M. Since v − un → δ, it follows that v − u0 = δ. To show the uniqueness, let u0 ∈ M also satisﬁes v − u0 = δ. Then as in (1.13), u0 + u0 -2 - u0 − u0 -2 -v − - +- = δ2 , 2 2 2 implying u0 − u0 = 0. Remark 12.1.5: The above theorem holds if M is a closed subspace of H.

12.2 Convergence theorems From Proposition 12.1.3, it is seen that E(Y |G) is monotone and linear in Y , suggesting that it behaves like an ordinary expectation. A natural question is whether under appropriate conditions, the basic convergence results extend to conditional expectations (CE). The answer is ‘yes,’ as shown by the following results. Theorem 12.2.1: (Monotone convergence theorem for CE). Let (Ω, F, P ) be a probability space and G ⊂ F be a sub-σ-algebra of F. Let {Yn }n≥1 be a sequence of nonnegative F-measurable random variables such that 0 ≤ Yn ≤ Yn+1 w.p. 1. Let Y ≡ lim Yn w.p. 1. Then n→∞

lim E(Yn |G) = E(Y |G) w.p. 1.

n→∞

(2.1)

Proof: By Proposition 12.1.3 (i), Zn ≡ E(Yn |G) is monotone nondecreasing in n, w.p. 1, and so Z ≡ limn→∞ Zn exists w.p. 1. By the MCT, for all A ∈ G, E(Y IA ) = lim EYn IA = lim E(Zn IA ) = E(ZIA ). n→∞

n→∞

Thus, Z = E(Y |G) w.p. 1, proving (2.1).

2

Theorem 12.2.2: (Fatou’s lemma for CE). Let {Yn }n≥1 be a sequence of nonnegative random variables on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F. Then lim inf E(Yn |G) ≥ E(lim inf Yn |G). n→∞

n→∞

(2.2)

Proof: Let Y˜n = inf j≥n Yj . Then {Y˜n }n≥1 is a sequence of nonnegative nondecreasing random variables and limn→∞ Y˜n = lim inf n→∞ Yn . By the previous theorem, lim E(Y˜n |G) = E(lim inf Yn |G).

n→∞

n→∞

(2.3)

390

12. Conditional Expectation and Conditional Probability

Also, since Y˜n ≤ Yj for each j ≥ n, E(Y˜n |G) ≤ E(Yj |G)

for each

j ≥ n w.p. 1

implying that E(Y˜n |G) ≤ inf j≥n E(Yj |G) w.p. 1. The right side converges 2 to lim inf n→∞ E(Yn |G) w.p. 1. Now (2.2) follows from (2.3). It is easy to deduce from Fatou’s lemma the following result (Problem 12.4). Theorem 12.2.3: (Dominated convergence theorem for CE). Let {Yn }n≥1 and Y be random variables on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F . Suppose that limn→∞ Yn = Y w.p. 1 and that there exists a random variable Z such that |Yn | ≤ Z w.p. 1 and EZ < ∞. Then lim E(Yn |G) = E(Y |G)

n→∞

w.p. 1.

(2.4)

Theorem 12.2.4: (Jensen’s inequality for CE). Let φ : (a, b) → R be convex for some −∞ ≤ a < b ≤ ∞. Let Y be a random variable on a probability space (Ω, F, P ) such that P (Y ∈ (a, b)) = 1 and E|φ(Y )| < ∞. Let G be a sub-σ-algebra of F. Then φ E(Y |G) ≤ E(φ(Y )|G). (2.5) Proof: By the convexity of φ on (a, b), for any c, x ∈ (a, b), φ(x) − φ(c) − (x − c)φ− (c) ≥ 0,

(2.6)

where φ− (c) is the left derivative of φ at c. Taking c = E(Y |G) and x = Y in (2.6), one gets Z ≡ φ(Y ) − φ(E(Y |G)) − (Y − E(Y |G))φ− (E(Y |G)) ≥ 0. Since E φ(E(Y |G))|G = φ E(Y |G) , by (1.11), " * # E Y − E(Y |G) φ− E(Y |G) *G " # = φ− (E(Y |G))E Y − E(Y |G) |G = 0. Also, from (2.7), E(Z|G) ≥ 0 and hence, E φ(Y )|G ≥ φ(E(Y |G)).

(2.7)

2

The following inequalities are a direct consequence of Theorem 12.2.4. Corollary 12.2.5: Let Y be a random variable on a probability space (Ω, F, P ) and let G be a sub-σ-algebra of F.

12.2 Convergence theorems

391

2 (i) If EY 2 < ∞, then E(Y 2 |G) ≥ E(Y |G) . (ii) If E|Y |p < ∞ for some p ∈ [1, ∞), then E(|Y |p |G) ≥ |(EY |G)|p . Deﬁnition 12.2.1: Let EY 2 < ∞. The conditional variance of Y given G, denoted by Var(Y |G), is deﬁned as Var(Y |G) = E(Y 2 |G) − (E(Y |G))2 .

(2.8)

This leads to the following formula for a decomposition of the variance of Y , known as the Analysis of Variance formula. Theorem 12.2.6: Let EY 2 < ∞. Then

Var(Y ) = Var(E(Y |G)) + E Var(Y |G) .

(2.9)

Proof: Var(Y ) = E(Y −EY )2 . But Y −EY = Y −E(Y |G)+E(Y |G)−EY . Also by (1.11), " #" #* E Y − E(Y |G) E(Y |G) − EY *G " # " #* = E(Y |G) − EY E Y − E(Y |G) *G " # = E(Y |G) − EY 0 = 0. " #" # Thus, E Y − E(Y |G) E(Y |G) − EY = 0 and so 2 2 E(Y − EY )2 = E Y − E(Y |G) + E E(Y |G) − EY . (2.10) " # " # " #2 Now, noting that E Y E(Y |G) = E E Y (EY |G)|G = E E(Y |G) , one gets " # 2 E(Y − E(Y |G))2 = EY 2 − 2E Y E(Y |G) + E E(Y |G) " 2 # = EY 2 − E E(Y |G) # " 2 # " = E E(Y 2 |G) − E E(Y |G) = E Var(Y |G) . " # Also, since E E(Y |G) = EY , 2 E E(Y |G) − EY = Var E(Y |G) . Hence, (2.9) follows from (2.10). 2 Remark 12.2.1: E Var(Y |G) is called the variance within and Var E(Y |G) is the variance between. The above proof also shows that 2 2 E(Y − Z)2 = E Y − E(Y |G) + E E(Y |G) − Z (2.11) for any random variable Z that is G-measurable. This is used to prove the Rao-Blackwell theorem in mathematical statistics (Lehmann and Casella (1998)) (Problem 12.27).

392

12. Conditional Expectation and Conditional Probability

12.3 Conditional probability Let (Ω, F, P ) be a probability space and let G ⊂ F be a sub-σ-algebra. Deﬁnition 12.3.1: For B ∈ F, the conditional probability of B given G, denoted by P (B|G), is deﬁned as P (B|G) = E(IB |G).

(3.1)

Thus Z ≡ P (B|G) is a G-measurable function such that P (A ∩ B) = E(ZIA )

for all A ∈ G.

(3.2)

Since 0 ≤ P (A∩B) ≤ P (A) for all A ∈ G, it follows that 0 ≤ P (B|G) ≤ 1 w.p. 1. It is easy to check that w.p. 1 P (Ω|G) = 1

and P (∅|G) = 0.

Also, by linearity, if B1 , B2 ∈ F, B1 ∩ B2 = ∅, then P (B1 ∪ B2 |G) = P (B1 |G) + P (B2 |G)

w.p. 1.

This suggests that w.p. 1, P (B|G) is countably additive as a set function in B. That is, there exists a set A0 ∈ G such that P (A0 ) = 0 and for all ω ∈ A0 , the map B → P (B|G)(ω) is countably additive. However, this is not true. Although for a given collection {Bn }n≥1 of disjoint sets in F, there is an exceptional set A0 such that P (A0 ) = 0 and for ω ∈ A0 , P(

n≥1

Bn |G)(ω) =

∞

P (Bn |G)(ω).

n=1

However, this A0 depends on {Bn }n≥1 and as the collection varies, these exceptional sets can be an uncountable collection whose union may not be contained in a set of probability zero. Deﬁnition 12.3.2: Let (Ω, F, P ) be a probability space and G be a subσ-algebra of F. A function µ : F × Ω → [0, 1] is called a regular conditional probability on F given G if (i) for all B ∈ F, µ(B, ω) = P (B|G) w.p. 1, and (ii) for all ω ∈ Ω, µ(B, ω) is a probability measure on (Ω, F). If a regular conditional probability (r.c.p.) µ(·, ·) exists on F given G, then conditional expectation of Y given G can be computed as E(Y |G)(ω) = Y (ω )µ(dω , ω) w.p. 1

12.4 Problems

393

for all Y such that E|Y | < ∞. The proof of this is via standard approximation using simple random variables (Problem 12.15). A suﬃcient condition for the existence of r.c.p. is provided by the following result. Theorem 12.3.1: Let (Ω, F, P ) be a probability space. Let S be a Polish space and S be its Borel σ-algebra. Let X be an S-valued random variable on (Ω, F). Then for any σ-algebra G ⊂ F, there is a regular conditional probability on σX given G, where σX = {X −1 (D) : D ∈ S}. Proof: (for S = R). Let Q = {rj } be the set of rationals. Let F (rj , ω) = P (X ≤ rj |G)(ω) w.p. 1. Then, there is a set A0 ∈ G such that P (A0 ) = 0 and for ω ∈ A0 , F (r, ω) is monotone nondecreasing on Q. For x ∈ R, set sup{F (r, ω) : r ≤ x} if ω ∈ A0 F (x, ω) ≡ F0 (x) if ω ∈ A0 , where F0 (x) is a ﬁxed cdf (say, F0 = Φ, the standard normal cdf). Then, F (x, ω) is a cdf in x for each ω and for each x, F (x, ·) is G-measurable. Let µ(B, ω) be the Lebesgue-Stieltjes measure induced by F (·, ω). Then it can be checked using the π − λ theorem (Theorem 1.1.2) that µ(·, ·) is a regular conditional probability on σX given G (Problem 12.16). 2 Remark 12.3.1: When F = σX, the regular conditional probability on F given G is also called the regular conditional probability distribution of X given G. Remark 12.3.2: For a proof for the general Polish case, see Durrett (2004) and Parthasarathy (1967).

12.4 Problems 12.1 Let (X, Y ) be a bivariate random vector with EY 2 < ∞. Let H = L2 (R2 , B(R), PX,Y ) and H0 = {h(X) | h : R → R is Borel measurable and Eh(X)2 < ∞}. Suppose that for some h(X) ∈ H0 , EY IA = Eh(X)IA

for all A ∈ σX.

Show that E(Y − h(X))Z1 = 0 for all Z1 ∈ H0 . Show also that H0 is a closed subspace of H. (Hint: For any Z1 ∈ H0 , there exists a sequence of simple random variables {Wn }n≥1 ⊂ H0 such that |Wn | ≤ |Z1 | and Wn → Z1 a.s. Now, apply the DCT. For the second part, use the fact that f : Ω → R is σX-measurable iﬀ there is a Borel measurable function h : R → R such that f = h(X).)

394

12. Conditional Expectation and Conditional Probability

12.2 Let (X, Y ) be a bivariate random vector that has an absolutely continuous distribution on (R2 , B(R2 )) w.r.t. the Lebesgue measure with density f (x, y). Suppose that E|Y | < ∞. Show that a version of E(Y |X) is given by h0 (X), where, with fX (x) = f (x, y)dy,

h0 (x) =

yf (x,y)dy fX (x)

0

if

fX (x) > 0

otherwise.

(Hint: Verify (1.8) for all A ∈ σX.) 12.3 Let Z1 and Z2 be two random variables on a probability space (Ω, G, P ). (a) Suppose that E|Z1 | < ∞, E|Z2 | < ∞ and EZ1 IA = EZ2 IA

for all A ∈ G.

(4.1)

Show that P (Z1 = Z2 ) = 1. (b) Suppose that Z1 and Z2 are nonnegative and (4.1) holds. Show that P (Z1 = Z2 ) = 1. (c) Prove Proposition 12.1.3. (Hint: (a) Consider (4.1) with A1 = {Z1 − Z2 > 0} and A2 = {Z1 − Z2 < 0} and conclude that P (A1 ) = 0 = P (A2 ). (b) Let A1n = {Z1 ≤ n, Z2 ≤ n, Z1 − Z2 > 0} and A2n = {Z1 ≤ ≥ 1. Then, by (4.1), P (A1n ) = 0 = P (A2n ) n, Z2 ≤ n, Z1 −Z2 < 0}, n for all n ≥ 1. But A1 = n≥1 Ain , i = 1, 2, . . . , where Ai ’s are as above.) 12.4 Prove Theorem 12.2.3. 12.5 Let Xi be a ki -dimensional random vector, ki ∈ N, i = 1, 2 such that X1 and X2 are independent. Let h : Rk1 +k2 → [0, ∞) be a Borel measurable function. Show that E h(X1 , X2 ) | X1 = g(X1 ) (4.2) k1 where g(x) = Eh(x, X * 2 ), x ∈ R * . Show that (4.2) is also valid for a real valued h with E *h(X1 , X2 )* < ∞.

(Hint: Let k = k1 + k2 , Ω = Rk , F = B(Rk ), P = PX1 × PX2 . Verify (1.8) for all A ∈ {A1 × Rk2 : A1 ∈ B(Rk1 )} ≡ σX1 .) 12.6 Let X be a random variable on a probability space (Ω, F, P ) with EX 2 < ∞ and let G ⊂ F be a sub-σ-ﬁeld. (a) Show that for any A ∈ G, * * 1/2 * * E(X|G)dP * ≤ E(X 2 |G)dP . * A

A

(4.3)

12.4 Problems

395

(b) Show that (4.3) is valid for all A ∈ F. 12.7 Let f : (Rk , B(Rk ), P ) → (R, B(R)) be an integrable function where −k

P (A) = 2

exp A

−

k

|xi | dx1 , . . . , dxk , A ∈ B(Rk ).

i=1

For each of the following cases, ﬁnd a version of E(f |G) and justify your answer: (a) G = σ{A ∈ B(Rk ) : A = −A}, (b) G = σ{(j1 , j1 + 1] × · · · × (jk , jk + 1] : j1 , . . . , jk ∈ Z}, (c) G = σ{B × {0} : B ∈ B(Rk−1 )}. 12.8 Let (Ω, F, P ) be a probability space and G = {∅, B, B c , Ω} for some B ∈ F with P (B) ∈ (0, 1). Determine P (A|G) for A ∈ F. 12.9 Let {Xn : n ∈ Z} be a collection of independent random variables with E|X0 | < ∞. Show that (a) E(X0 | X1 , . . . , Xn ) = EX0 for any n ∈ N, (b) E(X0 | X−n , . . . , X−1 ) = EX0 for any n ∈ N, (c) E(X0 | X1 , X2 , . . .) = EX0 = E(X0 | . . . , X−2 , X−1 ). 12.10 Let X be a random variable on a probability space (Ω, F, P ) with E|X| < ∞ and let C be a π-system such that σC = G ⊂ F. Suppose that there exists a G-measurable function f : Ω → R such that f dP = XdP for all A ∈ C. A

A

Show that f = E(X|G). 12.11 Let X and Y be integrable random variables on (Ω, F, P ) and let C be a semi-algebra, C ⊂ F. Suppose that A XdP ≤ A Y dP for all A ∈ C. Show that E(X|G) ≤ E(Y |G) where G = σC. 12.12 Let X, Y ∈ L2 (Ω, F, P ). If E(X|Y ) = Y and E(Y |X) = X, then P (X = Y ) = 1. (Hint: Show that E(X − Y )2 = EX 2 − EY 2 .) 12.13 Let {Xn }n≥1 , X be a collection of random variables on (Ω, F, P ) and let G be a sub-σ-algebra of F. If limn→∞ E|Xn − X|r = 0 for some r ≥ 1, then * *r lim E *E(Xn |G) − E(X|G)* = 0. n→∞

396

12. Conditional Expectation and Conditional Probability

12.14 Let X, Y ∈ L2 (Ω, F, P ) and let G be a sub-σ-algebra of F. Show that E{Y E(X|G)} = E{XE(Y |G)}. 12.15 Let Y be an integrable random variable on (Ω, F, P ) and let µ be a r.c.p. on F given G. Show that h(ω) ≡ Y (ω1 )µ(dω1 , ω), ω ∈ Ω is a version of E(Y |G). (Hint: Prove this ﬁrst for Y = IA , A ∈ F. Extend to simple functions by linearity. Use the DCT for CE for the general case.) 12.16 Complete the proof of Theorem 12.3.1 for S = R. 12.17 Let (Ω, F, P ) be a probability space, G be a sub-σ-algebra of F, and let {An }n≥1 ⊂ F be a collection of disjoint sets. Show that P

An |G

n≥1

=

∞

P (An |G).

n=1

Deﬁnition 12.4.1: Let G be a σ-algebra and let {Gλ : λ ∈ Λ} be a collection of subsets of F in a probability space (Ω, F, P ). Then, {Gλ : λ ∈ Λ} is called conditionally independent given G if for any λ1 , . . . , λk ∈ Λ, k ∈ N, k + P (Ai |G) P A1 ∩ · · · ∩ Ak |G = i=1

for all A1 ∈ G1 , . . . , Ak ∈ Gk . A collection of random variables {Xλ : λ ∈ Λ} on (Ω, F, P ) is called conditionally independent given G if {σXλ : λ ∈ Λ} is conditionally independent given G. 12.18 Let G1 , G2 , G3 be three sub-σ-algebras of F. Recall that Gi ∨ Gj = σGi ∪ Gj , 1 ≤ i = j ≤ 3. Show that G1 and G2 are conditionally independent given G3 iﬀ P (A1 |G2 ∨ G3 ) = P (A1 |G3 )

for all A1 ∈ G,

iﬀ E(X|G2 ∨ G3 ) = E(X|G3 ) for every X ∈ L1 (Ω, G1 ∨ G3 , P ). 12.19 Let G1 , G2 , G3 be sub-σ-algebra of F. Show that if G1 ∨ G3 is independent of G2 , then G1 and G2 are conditionally independent given G3 . 12.20 Give an example where E E(Y |X1 ) | X2 = E E(Y |X2 ) | X1 .

12.4 Problems

397

12.21 Let X be an Exponential (1) random variable. For t > 0, let Y1 = min{X, t} and Y2 = max{X, t}. Find E(X|Yi ) i = 1, 2. (Hint: Verify that σY1 is the σ-algebra generated by the collection {X −1 (A) : A ∈ B(R), A ⊂ [0, t)} ∪ {X −1 [t, ∞)}.) 12.22 Let (X, Y ) be a bivariate random vector with a joint pdf w.r.t. the Lebesgue measure f (x, y). Show that E(X|X +Y ) = h(X +Y ) where

xf (x, z − x)dx f (x, z − x)dx . I(0,∞) h(z) = f (x, z − x)dx 12.23 Let {Xi }i≥1 be iid random variables with E|X1 | < ∞. Show that for any n ≥ 1, X1 + X2 + · · · + Xn E X1 | (X1 + X2 + · · · + Xn ) = . n (Hint: Show that E Xi | (X1 + · · · + Xn ) is the same for all 1 ≤ i ≤ n.) Deﬁnition 12.4.2: A ﬁnite collection of random variables {Xi : 1 ≤ i ≤ n} on a probability space (Ω, F, P ) is said to be exchangeable if for any permutation (i1 , i2 , . . . , in ) of (1, 2, . . . , n), the joint distribution of (Xi1 , Xi2 , . . . , Xin ) is the same as that of (X1 , X2 , . . . , Xn ). A sequence of radom variables {Xi }i≥1 on a probability space (Ω, F, P ) is said to be exchangeable if for any ﬁnite n, the collection {Xi : 1 ≤ i ≤ n} is exchangeable. 12.24 Let {Xi : 1 ≤ i ≤ n+1} be a ﬁnite collection of random variables such that conditional Xn+1 , {X1 , X2 , . . . , Xn } are iid. Show that {Xi : 1 ≤ i ≤ n} is exchangeable. 12.25 Let {Xi : 1 ≤ i ≤ n} be exchangeable. Suppose E|X1 | < ∞. Show that X1 + X2 + · · · + Xn . E X1 | (X1 + · · · + Xn ) = n 12.26 Let (X1 , X2 , X3 ) be random variables such that P (X2 ∈ · | X1 ) P (X3 ∈ · | X1 , X2 )

=

p1 (X1 , ·)

=

p2 (X2 , ·)

and

where for each i = 1, 2, pi (x, ·) is a probability transition function on R as deﬁned in Example 6.3.8. Suppose pi (x, ·) admits a pdf fi (x, ·) i = 1, 2, . . .. Show that P (X1 ∈ · | X2 , X3 ) = P (X1 ∈ · | X2 ). (This says that if {X1 , X2 , X3 } has the Markov property, then so does {X3 , X2 , X1 }.)

398

12. Conditional Expectation and Conditional Probability

12.27 (Rao-Blackwell theorem). Let Y ∈ L2 (Ω, F, P ) and G ⊂ F be a subσ-algebra. Show that there exists Z ∈ L2 (Ω, G, P ) such that EZ = EY and Var(Z) ≤ Var(Y ). (Hint: Consider Z = E(Y |G).) 12.28 Let (X, Y ) have an absolutely continuous bivariate distribution with density fX,Y (x, y). Show that there is a regular conditional probability on σY given σX and that this probability measure induces an absolutely continuous distribution on R. Find its density. 12.29 Suppose, in the above problem, fX,Y (x, y) =

1 y − m(x) φ g(x) σ(x) σ(x)

where m(·), σ(·), φ(·), and g(·) are all Borel measurable functions on R to R with σ, φ, and g being nonnegative and φ and g being probability densities. (a) Find the marginal probability densities fX (·) and fY (·) of X and Y , respectively. Set up the integrals for EX and EY . (b) Using the conditioning argument in Proposition 12.1.5, show that EY = m(x)g(x)dx + uφ(u)du σ(x)g(x)dx (assuming that all the integrals are well deﬁned). (c) Find similar expressions for EY 2 and E(etY ). 12.30 Let X, Y , Z ∈ L1 (Ω, F, P ). Suppose that E(X|Y ) = Z, E(Y |Z) = X, E(Z|X) = Y. Show that X = Y = Z w.p. 1. 12.31 Let X, Y ∈ L2 (Ω, F, P ). Suppose E|Y |4 < ∞. Show that min E|X − (a + bY + cY 2 )|2 : a, b, c ∈ R = max E(XZ) : Z ∈ L2 (Ω, F, P ), EZ = 0, EZY = 0, EZY 2 = 0, EZ 2 = 1 . 12.32 Let X ∈ L2 (Ω, F, P ) and G be a sub-σ-algebra of F (a) Show that

min E(X − Y )2 : Y ∈ L2 (Ω, G, P ) = max (EXZ)2 : EZ 2 = 1, E(Z|G) = 0 .

(b) Find a random variable Z such that E(Z|G) = 0 w.p. 1 and ρ ≡ corr(X, Z) is maximized.

13 Discrete Parameter Martingales

13.1 Deﬁnitions and examples This section deals with a class of stochastic processes called martingales. Martingales arise in a natural way in many problems in probability and statistics. It provides a more general framework than the case of independent random variables where results, like the SLLN, the CLT, and other convergence theorems, can be established. Much of the discrete parameter martingale theory was developed by the great American mathematician J. L. Doob, whose book (Doob (1953)) has been very inﬂuential. Deﬁnition 13.1.1: Let (Ω, F, P ) be a probability space and let N = {1, . . . , n0 } be a nonempty subset of N = {1, 2, . . .}, n0 ≤ ∞. (a) A collection {Fn : n ∈ N } of sub-σ-algebras of F is called a ﬁltration if Fn ⊂ Fn+1 for all 1 ≤ n < n0 . (b) A collection of random variables {Xn : n ∈ N } is said to be adapted to the ﬁltration {Fn : n ∈ N } if Xn is Fn -measurable for all n ∈ N . (c) Given a ﬁltration {Fn : n ∈ N } and random variables {Xn : n ∈ N }, the collection {(Xn , Fn ) : n ∈ N } is called a martingale if (i) {Xn : n ∈ N } is adapted to {Fn : n ∈ N }, (ii) E|Xn | < ∞ for all n ∈ N , and (iii) for all 1 ≤ n < n0 , E(Xn+1 |Fn ) = Xn .

(1.1)

400

13. Discrete Parameter Martingales

When N = N, there is no maximum element in N . In this case, Definition 13.1.1 is to be interpreted by setting n0 = +∞ in parts (a) and (c) (iii). A similar convention applies to Deﬁnition 13.1.2 below. Also, recall that equalities and inequalities involving conditional expectations are interpreted as being valid events w.p. 1. If {(Xn , Fn ) : n ∈ N } is a martingale, then {Xn : n ∈ N } is also said to be a martingale w.r.t. (the ﬁltration) {Fn : n ∈ N }. Also {Xn : n ∈ N } is called a martingale if it is a martingale w.r.t. some ﬁltration. Observe that if {Xn : n ∈ N } is a martingale w.r.t. any given ﬁltration {Fn : n ∈ N }, it is also a martingale w.r.t. the natural ﬁltration {Xn : n ∈ N }, where Xn = σ{X1 , . . . , Xn }, n ∈ N . Clearly, {Xn : n ∈ N } is adapted to {Xn : n ∈ N }. To see that E(Xn+1 |Xn ) = Xn for all 1 ≤ n < n0 , note that Xn ⊂ Fn for all n ∈ N and hence, E(Xn+1 |Xn )

= E(E(Xn+1 |Fn ) | Xn ) = E(Xn |Xn ) = Xn .

(1.2)

Thus, {(Xn , Xn ) : n ∈ N } is a martingale. A classic interpretation of martingales in the context of gambling is given as follows. Let Xn represent the fortune of a gambler at the end of the nth play and let Fn be the information available to the gambler up to and including the nth play. Then, Fn contains the knowledge of all events like {Xj ≤ r} for r ∈ R, j ≤ n, making Xn measurable w.r.t. Fn . And Condition (iii) in Deﬁnition 13.1.1 (c) says that given all the information up until the end of the nth play, the expected fortune of the gambler at the end of the (n + 1)th play remains unchanged. Thus a martingale represents a fair game. In situations where the game puts the gambler in a favorable or unfavorable position, one may express that by suitably modifying condition (iii), yielding what are known as sub- and super-martingales, respectively. Deﬁnition 13.1.2: Let {Fn : n ∈ N } be a ﬁltration and {Xn : n ∈ N } be a collection of random variables in L1 (Ω, F, P ) adapted to {Fn : n ∈ N }. Then {(Xn , Fn ) : n ∈ N } is called a sub-martingale if E(Xn+1 |Fn ) ≥ Xn

for all

1 ≤ n < n0 ,

(1.3)

for all

1 ≤ n < n0 .

(1.4)

and a super-martingale E(Xn+1 |Fn ) ≤ Xn

Suppose that {(Xn , Fn ) : n ∈ N } is a sub-martingale. Then A ∈ Fn implies that A ∈ Fn+1 ⊂ . . . ⊂ Fn+k for every k ≥ 1, n + k ∈ N and hence, by (1.3), Xn dP ≤ E(Xn+1 |Fn )dP = Xn+1 dP A

A

A

13.1 Deﬁnitions and examples

.. .

401

≤

Xn+k dP.

(1.5)

A

Therefore, E(Xn+k |Fn ) ≥ Xn and, by taking A = Ω in (1.5), EXn+k ≥ EXn . Thus, the expected values of a sub-martingale is nondecreasing. For a martingale, by (1.2), equality holds at every step of (1.5) and hence, E(Xn+k |Fn ) = Xn , EXn+k = EXn

(1.6)

for all k ≥ 1, n, n + k ∈ N . Thus, in a fair game, the expected fortune of the gambler remains constant over time. Here are some examples. Example 13.1.1: (Random walk). Let Z1 , Z2 , . . . be a sequence of iid random variables on a probability space (Ω, F, P ) with ﬁnite mean µ = EZ1 and let Fn = σZ1 , . . . , Zn , n ≥ 1. Let Xn = Z1 + . . . Zn , n ≥ 1. Then, for all n ≥ 1, σXn ⊂ Fn and E|Xn | < ∞ for all n ≥ 1. Also, E(Xn+1 |Fn ) = E (Z1 + . . . + Zn+1 ) | Z1 , . . . , Zn = Z1 + . . . + Zn + EZn+1 (by independence) = Xn + µ, so that E(Xn+1 |Fn )

=

Xn

if µ = 0

>

Xn

if µ > 0

n} ∈ Fn for all n ≥ 1. However, the condition ‘{T ≥ n} ∈ Fn

for all n ≥ 1’

(2.5)

does not always imply that T is a stopping time w.r.t. {Fn }n≥1 (cf. Problem 13.7). Note that for a stopping time T w.r.t. {Fn }n≥1 , {T ≥ n} = {T ≤ n − 1}c ∈ Fn−1

for n ≥ 2,

and {T ≥ 1} = Ω. Since Fn−1 ⊂ Fn for all n ≥ 2, (2.5) is a weaker requirement than T being a stopping time w.r.t {Fn }n≥1 . Proposition 13.2.1: Let T be a stopping time w.r.t. {Fn }n≥1 and let F∞ be as in (2.3). Deﬁne FT = {A ∈ F∞ : A ∩ {T = n} ∈ Fn

for all

n ≥ 1}.

(2.6)

Then, FT is a σ-algebra. Proof: Left as an exercise (Problem 13.8).

2

If T is a stopping time w.r.t. {Fn }n≥1 , then for any m ∈ N, {T = m} ∩ {T = n} = ∅ ∈ Fn for all n = m and {T = m}∩{T = n} = {T = m} ∈ Fm for n = m. Thus, {T = m} ∈ FT for all m ≥ 1 and hence, σT ⊂ FT . But the reverse inclusion may not hold as shown below. Example 13.2.1: Let T ≡ m for some given integer m ≥ 1. Then, T is a stopping time w.r.t. any ﬁltration {Fn }n≥1 . For this T , A ∈ FT ⇒ A ∩ {T = m} ∈ Fm ⇒ A ∈ Fm ,

13.2 Stopping times and optional stopping theorems

407

so that FT ⊂ Fm . Conversely, suppose A ∈ Fm . Then, A ∩ {T = n} = ∅ ∈ Fn for all n = m, and A ∩ {T = m} = A ∈ Fm for n = m. Thus, Fm = FT . But σT = {Ω, ∅}. Example 13.2.2: Let {Xn }n≥1 be a sequence of random variables adapted to a ﬁltration {Fn }n≥1 and let {Bn }n≥1 be a sequence of Borel sets in R. Deﬁne the random variable (2.7) T = inf n ≥ 1 : Xn ∈ Bn . Then, T (ω) < ∞ if Xn (ω) ∈ Bn for some n ∈ N and T (ω) = +∞ if Xn (ω) ∈ Bn for all n ∈ N. Since, for any n ≥ 1, {T = n} = X1 ∈ B1 , . . . , Xn−1 ∈ Bn−1 , Xn ∈ Bn ∈ Fn , T is a stopping time w.r.t. {Fn }n≥1 . Now, deﬁne a new random variable XT by Xm if T = m XT = (2.8) lim sup Xn if T = ∞. n→∞

¯ and for any n ≥ 1 and r ∈ R, The, XT ∈ R {XT ≤ r} ∩ {T = n} = {Xn ≤ r} ∩ {T = n} ∈ Fn . Also, {XT = ±∞} ∩ {T = n} = {Xn = ±∞} ∩ {T = n} = ∅ for all n ≥ 1. ¯ Hence, it follows that XT is FT , B(R)-measurable. Example 13.2.3: Let {Yn }n≥1 be a sequence of iid random variables with EY1 = µ. Let Xn = (Y1 + . . . + Yn ), n ≥ 1 denote the random walk corresponding to {Yn }n≥1 . For x > 0, let √ (2.9) Tx = inf n ≥ 1 : Xn > nµ + x n . exceeds Then, Tx is the√ﬁrst time the sequence of partial sums {Xn }n≥1 √ the level nµ + x n and is a special case of (2.7) with Bn = (nµ + x n, ∞), n ≥ 1. Consequently, Tx is a stopping time w.r.t. Fn = σY1 , Y2 , . . . , Yn , n ≥ 1. Note that if EY12 < ∞, by the law of iterated logarithm (cf. 8.7), lim sup , n→∞

i.e.,

Xn − nµ 2σ 2 n log log n

Xn > nµ + C

=1

w.p. 1,

, n log log n inﬁnitely often w.p. 1

for some constant C > 0. Thus, P (Tx < ∞) = 1 and hence, Tx is a ﬁnite stopping time. This random variable Tx arises in sequential probability ratio tests (SPRT) for testing hypotheses on the mean of a (normal) population. See Woodroofe (1982), Chapter 3.

408

13. Discrete Parameter Martingales

Deﬁnition 13.2.2: Let {Fn }n≥0 be a ﬁltration in a probability space (Ω, F, P ). A betting sequence w.r.t. {Fn }n≥0 is a sequence {Hn }n≥1 of nonnegative random variables such that for each n ≥ 1, Hn is Fn−1 measurable. The following result says that there is no betting scheme that can beat a gambling system, i.e., convert a fair one into a favorable one or the other way around. Theorem 13.2.2: (Betting theorem). Let {Fn }n≥0 be a ﬁltration in a probability space. Let {Hn }n≥0 be a betting sequence w.r.t. {Fn }n≥0 . For an adapted sequence {Xn , Fn }n≥0 let {Yn }n≥0 be deﬁned by Y0 = X0 , n Yn = Y0 + j=1 (Xj − Xj−1 )Hj , n ≥ 1. Let E|(Xj − Xj−1 )Hj | < ∞ for j ≥ 1. Then, (i) {Xn , Fn }n≥0 a martingale ⇒ {Yn , Fn }n≥0 is also a martingale, (ii) {Xn , Fn }n≥0 a sub-martingale ⇒ {Yn , Fn }n≥0 is also a submartingale, Proof: Clearly, for all n ≥ 1, E|Yn | < ∞ and Yn is Fn -measurable. Further, * E Yn+1 *Fn = Yn + E (Xn+1 − Xn )Hn+1 | Fn = Yn + Hn+1 E (Xn+1 − Xn ) | Fn since Hn+1 is Fn+1 -measurable. Now the theorem follows from the deﬁning properties of {Xn , Fn }n≥0 . 2 The above theorem leads to the following results known as Doob’s optional stopping theorems. Theorem 13.2.3: (Doob’s optional stopping theorem I ). Let {Xn , Fn }n≥0 ˜n ≡ be a sub-martingale. Let T be a stopping time w.r.t. {Fn }n≥0 . Let X ˜ n , Fn }n≥0 is also a sub-martingale and hence XT ∧n , n ≥ 0. Then {X ˜ n ≥ EX0 for all n ≥ 1. EX Proof: For any A ∈ B(R) and n ≥ 0, ˜ −1 (A) X n

˜ n ∈ A} {ω : X

n = {ω : Xj ∈ A, T = j} ∪ {ω : Xn ∈ A, T > n}. =

j=1

Since T is a stopping time w.r.t. {Fn }n≥0 the right side above belongs to ˜ n | ≤ n |Xj | and hence E|X ˜ n | < ∞. Fn for each n ≥ 0. Next, |X j=1 Finally, let Hj = 1 if j ≤ T and 0 if j > T . Since for all j ≥ 1, {ω : Hj = 1} = {ω : T ≤ j − 1}c ∈ Fj−1 , {Hj }j≥1 is a betting sequence ˜ n = X0 + n (Xj − Xj−1 )Hj . Now the betting w.r.t. {Fn }n≥0 . Also, X j=1 theorem (Theorem 13.2.2) implies the present theorem. 2

13.2 Stopping times and optional stopping theorems

409

Remark 13.2.1: If {Xn , Fn }n≥0 is a martingale, then both {Xn , Fn }n≥0 and {−Xn , Fn }n≥0 are sub-martingales, and hence the above theorem im˜ n , Fn }n≥0 , and plies that if {Xn , Fn }n≥0 is a martingale, then so is {X ˜ hence E Xn = EXT ∧n = EX0 = EXn for all n ≥ 1. This suggests the question that if P (T < ∞) = 1, then on letting n → ∞, ˜ n → EXT ? Consider the following example. Let {Xn }n≥0 denote does E X the symmetric simple random walk on the integers with X0 = 0. Let T = ˜ n = EXT ∧n = inf{n : n ≥ 1, Xn = 1}. Then P (T < ∞) = 1 and E X ˜ EX0 = 0 but XT = 1 w.p. 1 and hence E Xn → EXT = 1. So, clearly some additional hypothesis is needed. Theorem 13.2.4: (Doob’s optional stopping theorem II ). Let {Xn , Fn }n≥0 be a martingale. Let T be a stopping time w.r.t. {Fn }n≥0 . Suppose P (T < ∞) = 1 and there is a 0 < K < ∞ such that for all n ≥ 0 |XT ∧n | ≤ K w.p. 1. Then EXT = EX0 . Proof: Since P (T < ∞) = 1, XT ∧n → XT w.p. 1 and |XT | ≤ K < ∞ 2 and hence E|XT | < ∞. Thus, E|XT − XT ∧n | ≤ 2KP (T > n) → 0. Remark 13.2.2: Since XT = XT I(T ≤n) + Xn I(T >n) and EXT I(T ≤n)

=

=

n j=0 n

E(Xj : T = j) E(X0 : T = j) = E(X0 : T ≤ n)

j=0

it follows that if E |Xn |I(T >n) → 0 as n → ∞ and P (T < ∞) = 1 then EXT = EX0 . A stronger version of Doob’s optional stopping theorem is given below in Theorem 13.2.6. Proposition 13.2.5: Let S and T be two stopping times w.r.t. {Fn }n≥1 with S ≤ T . Then, FS ⊂ FT . Proof: For any A ∈ FS and n ≥ 1, A ∩ {T = n}

= A ∩ {T = n} ∩ {S ≤ n}

n = A ∩ {S = k} ∩ {T = n} k=1

∈

Fn ,

410

13. Discrete Parameter Martingales

since A ∩ {S = k} ∈ Fk for all 1 ≤ k ≤ n. Thus, A ∈ FT , proving the result. 2 Theorem 13.2.6: (Doob’s optional stopping theorem III ). Let {Xn , Fn }n≥1 be a sub-martingale and S and T be two ﬁnite stopping times w.r.t. {Fn }n≥1 such that S ≤ T . If XS and XT are integrable and if lim inf E|Xn |I(|Xn | > T ) = 0, n→∞

(2.10)

then E(XT |FS ) ≥ XS

a.s.

(2.11)

If, in addition, {Xn }n≥1 is a martingale, then equality holds in (2.11). Thus, Theorem 13.2.6 shows that if a martingale (or a sub-martingale) is stopped at random time points S and T with S ≤ T , then under very mild conditions, (XS , FS ), (XT , FT ) continues to have the martingale (sub-martingale, respectively) property. Proof: To show (2.11), it is enough to show that (XT − XS )dP ≥ 0 for all A ∈ FS .

(2.12)

A

Fix A ∈ FS . Let {nk }k≥1 be a subsequence along which the “lim inf” is attained in (2.10). Let Tk = min{T, nk } and Sk = min{S, nk }, k ≥ 1. The proof of (2.12) involves showing that (XTk − XSk )dP ≥ 0 for all k ≥ 1 (2.13) A

and

lim

k→∞

A

" # (XT − XS ) − (XTk − XSk ) dP = 0.

(2.14)

Consider (2.13). Since Sk ≤ Tk ≤ nk , XTk − XSk

=

=

Tk n=Sk +1 nk

(Xn − Xn−1 )

(Xn − Xn−1 )I(Sk + 1 ≤ n ≤ Tk ).

(2.15)

n=2

Note that for all 2 ≤ n ≤ nk , {Tk ≥ n} = {Tk ≤ n − 1}c = {T ≤ n − 1}c ∈ Fn−1 and {Sk + 1 ≤ n} = {Sk ≤ n − 1} = {S ≤ n − 1} ∈ Fn−1 . Also, since A ∈ FS , Bn ≡ A ∩ {Sk + 1 ≤ n ≤ Tk } = (A ∩ {Sk + 1 ≤ n}) ∩ {Tk ≥ n} ∈ Fn−1 for all 2 ≤ n ≤ nk . Hence, by the sub-martingale property of

13.2 Stopping times and optional stopping theorems

411

{Xn }n≥1 , from (2.15), A

(XTk − XSk )dP

= =

nk n=2 A∩{Sk +1≤n≤Tk } nk

" # E(Xn |Fn−1 ) − Xn−1 dP

n=1

≥

(Xn − Xn−1 )dP

Bn

0.

(2.16)

This proves (2.13). To prove (2.14), note that by (2.10) and the integrability of XS and XT and the DCT, * " # ** * (XT − XS ) − (XTk − XSk ) dP * lim * k→∞

≤ lim

k→∞

≤ lim

k→∞

≤ lim

k→∞

A

"

# |XT − XTk | + |XS − XSk | dP

{T >nk }

{T >nk }

(|XT | + |Xnk |)dP + |XT |dP + 2

{T >nk }

{S>nk }

(|XS | + |Xnk |)dP

|Xnk |dP +

{S>nk }

|XS |dP

= 0, since {S > nk } ⊂ {T > nk } and {T > nk } ↓ ∅ as k → ∞. This proves the theorem for the case when {Xn }n≥1 is a sub-martingale. When {Xn }n≥1 is a martingale, equality holds in the last line of (2.16), which implies equality in (2.13) and hence, in (2.12). This completes the proof. 2 Remark 13.2.3: If there exists a t0 < ∞ such that P (T ≤ t0 ) = 1, then (2.10) holds. Corollary 13.2.7: Let {Xn , Fn }n≥1 be a sub-martingale and let T be a ﬁnite stopping time w.r.t. {Fn }n≥1 such that E|XT | < ∞ and (2.10) holds. Then, EXT ≥ EX1 . (2.17) If, in addition, {Xn }n≥1 is a martingale, then equality holds in (2.17). Proof: Follows from Theorem 13.2.6 by setting S ≡ 1.

2

Corollary 13.2.8: Let {Xn , Fn }n≥1 be a sub-martingale. Let {Tn }n≥1 be a sequence of stopping times such that (i) for all n ≥ 1, Tn ≤ Tn+1 w.p. 1, (ii) for all n ≥ 1, there exist a nonrandom tn ∈ (0, ∞) such that P (Tn ≤ tn ) = 1.

412

13. Discrete Parameter Martingales

Let Gn ≡ FTn , Yn ≡ XTn , n ≥ 1. Then {Yn , Gn }n≥1 is a sub-martingale. If {Xn , Fn }n≥1 is a martingale, then {Yn , Gn }n≥1 is a martingale. Proof: Use Theorem 13.2.6 and Remark 13.2.3.

2

Corollary 13.2.9: Let {Xn , Fn }n≥1 be a sub-martingale. Let T be a stopping time. Let Tn = min{T, n}, n ≥ 1. Then {XTn , FTn }n≥1 is a sub-martingale. Note that this is a stronger version of Theorem 13.2.3 as FTn ⊂ Fn for all n ≥ 1. Theorem 13.2.10: (Doob’s maximal inequality). Let {Xn , Fn }n≥1 be a sub-martingale and let Mm = max{X1 , . . . , Xm }, m ∈ N. Then, for any m ∈ N and x ∈ (0, ∞), P (Mm > x) ≤

+ + EXm EXm I(Mm > x) ≤ . x x

(2.18)

Proof: Fix m ≥ 1, x > 0. Deﬁne a random variable S by inf{k : 1 ≤ k ≤ m, Xk > x} on A S= m on Ac where A = {Xk > x for some 1 ≤ k ≤ m} = {Mm > x}. Then it is easy to check that S is a stopping time w.r.t. {Fn }n≥1 and S ≤ m. Set T ≡ m. Then (2.10) holds and hence, by Theorem 13.2.6, (XS , FS ), (Xm , Fm ) is a sub-martingale. m m Note that A = {Mm > x} = k=1 {Mm > x, S = k} = k=1 {XS > x, S = k} = {XS > x} ∈ FS . Hence, by Markov’s inequality, 1 1 P (A) = P (XS > x) ≤ XS dP ≤ Xm dP x {XS >x} x A + EXm 1 + . Xm dP ≤ ≤ x A x 2 Remark 13.2.4: An alternative proof of (2.18) is as follows. Let A1 = {X1 > x}, Ak = {X1 ≤ x, X2 ≤ x, . . . , Xk−1 ≤ x, Xk > x} for k = k 2, . . . , m. Then i=1 Ai = A ≡ {Mm > x} and Ak ∈ Fk for all k. Now for x > 0, m m 1 P (Mm > x) = P (Ak ) ≤ E(Xk IAk ). x k=1

k=1

13.2 Stopping times and optional stopping theorems

413

By the sub-martingale property of {Xn , Fn }n≥1 , E(Xk IAk ) ≤ E(Xm IAk )

for k ≤ m.

Thus

m 1 E Xm IAk x

P (Mm > x) ≤

k=1

1 E(Xm IA ) x 1 + 1 + E Xm IA ≤ E Xm . x x

≤ ≤

Theorem 13.2.11: (Doob’s Lp -maximal inequality for sub-martingales). Let {Xn , Fn }n≥1 be a sub-martingale and let Mn = max{Xj : 1 ≤ j ≤ n}. Then, for any p ∈ (1, ∞), E(Mn+ )p ≤

p p−1

p

E(Xn+ )p ≤ ∞.

(2.19)

Proof: If E(Xn+ )p = ∞, then (2.19) holds trivially. Let E(Xn+ )p < ∞. Since for p > 1, φ(x) = (x+ )p is a convex nondecreasing function on R. Hence, {(Xn+ )p , Fn }n≥1 is a sub-martingale and E(Xj+ )p ≤ E(Xn+ )p < ∞ n for all j ≤ n. Since (Mn+ )p ≤ j=1 (Xj+ )p , this implies that E(Mn+ )p < ∞. For any nonnegative random variable Y and p > 0, by Tonelli’s theorem,

EY p

Y

xp−1 dx

= pE

0 ∞

= pE xp−1 I(Y > x)dx 0 ∞ = pxp−1 P (Y > x)dx. 0

Thus, E(Mn+ )p

∞

= 0

=

0

∞

pxp−1 P (Mn+ > x)dx pxp−1 P (Mn > x)dx.

By Theorem 13.2.10, for x > 0 P (Mn > x) ≤

1 E(Xn+ I(Mn > x)), x

414

13. Discrete Parameter Martingales

and hence E(Mn+ )p

≤ = ≤

0

∞

pxp−2 E(Xn+ I(Mn > x))dx

p E(Xn+ Mnp−1 ) (p − 1)

p (E(Xn+ )p )1/p (E(Mn(p−1)q )1/q p−1

(by Holder’s inequality) where q is the conjugate of p, i.e. q = Thus, 1/p p 1/p + p E(Mn ) ≤ , E(Xn+ )p p−1

p (p−1) .

2

proving (2.19).

˜n = Corollary 13.2.12: Let {Xn , Fn }n≥1 be a martingale and let M sup{|Xj | : 1 ≤ j ≤ n}. Then, for p ∈ (1, ∞), p

p ˜ np ≤ E |Xn |p . EM p−1 Proof: Since {|Xn |, Fn }n≥1 is a sub-martingale, this follows from Theorem 13.2.11. 2 Theorem 13.2.13: (Doob’s L log L maximal inequality for submartingales). Let {Xn , Fn }n≥1 be a sub-martingale and Mn = max{Xj : 1 ≤ j ≤ n}. Then

e EMn+ ≤ (2.20) 1 + E(Xn+ log Xn+ ) , e−1 where 0 log 0 is interpreted as 0. Proof: As in the proof of Theorem 13.2.11, ∞ P (Mn+ > x)dx EMn+ = 0 ∞ 1 ≤ 1+ E(Xn+ I(Mn+ > x))dx x 1 = 1 + E(Xn+ log Mn+ ) . For x > 0, y > 0,

(2.21)

y . x y x Now x log x = yφ( y ) where φ(t) ≡ −t log t, t > 0. It can be veriﬁed φ(t) attains its maximum 1e at t = 1e . Thus x log y = x log x + x log

y x log y ≤ x log x + . e

13.2 Stopping times and optional stopping theorems

415

So

EMn+ . (2.22) e If EXn+ log Xn+ = ∞, (2.20) is trivially true. If EXn+ log Xn+ < ∞, then as in the proof of Theorem 13.2.11, it can be shown that EMn+ < ∞. Hence, the theorem follows from (2.21) and (2.22). 2 EXn+ log Mn+ ≤ EXn+ log Xn+ +

A special case of Theorem 13.2.10 is the maximal inequality of Kolmogorov (cf. Section 8.3) as shown by the following example. Example 13.2.4: Let {Yn }n≥1 be a sequence of independent random variables with EYn = 0 and E|Yn |α < ∞ for all n ≥ 1 for some α ∈ (1, ∞). Let Sn = Y1 +. . .+Yn , n ≥ 1. Then, φ(x) ≡ |x|α , x > 0 is a convex function, and hence, by Proposition 13.1.1, Xn ≡ φ(|Sn |), n ≥ 1 is a sub-martingale w.r.t. Fn = σ{Y1 , . . . , Yn }, n ≥ 1. Now, by Theorem 13.2.10, for any x > 0 and m ≥ 1, = P max Xn > xα P max |Sn | > x 1≤n≤m

1≤n≤m

−α

+ ≤ x EXm −α ≤ x E|Sm |α .

(2.23)

Kolmogorov’s inequality corresponds to the case where α = 2. Another application of the optimal stopping theorem yields the following useful result. Theorem 13.2.14: (Wald’s lemmas). Let {Yn }n≥1 be a sequence of iid random variables and let {Fn }n≥1 be a ﬁltration such that (i) Yn is Fn -measurable and (ii) Fn and σ{Yk : k ≥ n + 1} are independent for all n ≥ 1.

(2.24)

Also, let T be a ﬁnite stopping time w.r.t. {Fn }n≥1 and E|T | < ∞. Let Sn = Y1 + . . . + Yn , n ≥ 1. Then, (a) E|Y1 | < ∞ implies EST = (EY1 )(ET ).

(2.25)

E(ST − T EY1 )2 = Var(Y1 )E(T ).

(2.26)

(b) EY12 < ∞ implies

Proof: W.l.o.g., suppose that EY1 = 0. Then, {Sn , Fn }n≥1 is a martingale. By Corollary 13.2.7, (2.25) would follow if one showed that (2.10) holds with n T Xn = Sn and that E|ST | < ∞. Since |Sn | ≤ i=1 |Yi | ≤ i=1 |Yi | on the T set {T ≥ n}, both these conditions would hold if E( i=1 |Yi |) < ∞. Now,

416

13. Discrete Parameter Martingales

by the MCT and the independence of Yi and {T ≥ i} = {T ≤ i−1}c ∈ Fi−1 for i ≥ 2 and the fact that {T ≥ 1} = Ω, it follows that E

T

|Yi | =

E

∞

i=1

|Yi |I(i ≤ T )

i=1

=

∞

E|Yi |I(i ≤ T )

i=1

=

E|Y1 |

∞

P (T ≥ i)

i=1

= E|Y1 |E|T | < ∞.

(2.27)

This proves (a). To prove (b), set σ 2 = Var(Y1 ) and note that EY1 = 0 ⇒ {Sn2 − nσ 2 , Fn }n≥1 is a martingale. Let Tn = T ∧ n, n ≥ 1. Then, Tn is a bounded stopping time w.r.t. {Fn }n≥1 and hence, by Theorem 13.2.6, E[ST2n − (ETn )σ 2 ] = E(S12 − σ 2 ) = 0 for all n ≥ 1.

(2.28)

Thus, (2.26) holds with T replaced by Tn . Since T is a ﬁnite stopping time, Tn ↑ T < ∞ w.p. 1 and therefore, STn → ST as n → ∞, w.p. 1. Now applying Fatou’s lemma and the MCT, from (2.28), one gets EST2 ≤ lim inf EST2n = lim inf (ETn )σ 2 = (ET )σ 2 . n→∞

n→∞

(2.29)

Also, note that for any n ≥ 1 E(ST2 − ST2n )

=

E(ST2 − Sn2 )I(T > n)

=

E[(ST − Sn )2 + 2Sn (ST − Sn )]I(T > n)

≥ 2ESn (ST − Sn )I(T > n) = 2ESn (ST1n − Sn ) =

2E[Sn · E{(ST1n − Sn )|Fn }],

(2.30)

where T1n = max{T, n}. Since ET1n ≤ ET + n < ∞, and {T1n > k} = {T > k} for all k > n, the conditions of Theorem 13.2.6 hold with Xn = Sn , S = n and T = T1n . Hence, E(ST1n − Sn |Fn ) = 0 a.s. and by (2.30), EST2 ≥ EST2n for all n ≥ 1. Now letting n → ∞ and using (2.28), one gets 2 EST2 ≥ (ET )σ 2 , as in (2.29). This completes the proof of (b). This section is concluded with the statement of an inequality relating the pth moment of a martingale to the (p/2)th moment of its squared variation. Theorem 13.2.15: (Burkholder’s inequality). Let {Xn , Fn }n≥1 be a martingale sequence. Let ξj = Xj − Xj−1 , α ≥ 1, with X0 = 0. Then for any

13.3 Martingale convergence theorems

417

p ∈ [2, ∞), there exist positive constants Ap and Bp such that E|Xn |p ≤ Ap E

n

ξi2

p/2

i=1

and

n n p/2 E ξi2 |Fi−1 + E|ξi |p . E|Xn |p ≤ Bp E i=1

i=1

For a proof, see Chow and Teicher (1997).

13.3 Martingale convergence theorems The martingale (or sub- or super-martingale) property of a sequence of random variables {Xn }n≥1 implies, under some mild additional conditions, a remarkable regularity, namely, that {Xn }n≥1 converges w.p. 1 as n → ∞. For example, any nonnegative super-martingale converges w.p. 1. Also any sub-martingale {Xn }n≥1 for which {E|Xn |}n≥1 is bounded converges w.p. 1. Further, if {E|Xn |p }n≥1 is bounded for some p ∈ (1, ∞), then Xn converges w.p. 1 and in Lp as well. The proof of these assertions depend crucially on an ingenious inequality due to Doob. Recall that one way to prove that a sequence of real numbers {xn }n≥1 converges as n → ∞ is to show that it does not oscillate too much as n → ∞. That is, for all a < b, the number of times the sequence goes from below a to above b is ﬁnite. This number is referred to as the number of upcrossings from a to b. Doob’s upcrossing lemma (see Theorem 13.3.1 below) shows that for a sub-martingale, the mean of the upcrossings can be bounded above. First, a formal deﬁnition of upcrossings of a given sequence {xj : 1 ≤ j ≤ n} of real numbers from level a to level b with a < b is given. Let N1 N2

=

min{j : 1 ≤ j ≤ n, xj ≤ a}

=

min{j : N1 < j ≤ n, xj ≥ b}

=

min{j : N2k−2 < j ≤ n, xj ≤ a}

=

min{j : N2k−1 < j ≤ n, xj ≥ b}.

and, deﬁne recursively, N2k−1 N2k

If any of these sets on the right side is empty, all subsequent ones will be empty as well and the corresponding Nk ’s will not be well deﬁned. If N1 or N2 is not well deﬁned, then set U {xj }nj=1 ; a, b , the number of upcrossings of the interval (a, b) by {xj }j=1 equal to zero. Otherwise let N be the last

418

13. Discrete Parameter Martingales

one that is well deﬁned. Set U {xj }nj=1 ; a, b = is odd.

2

if is even and

−1 2

if

Theorem 13.3.1: (Doob’s upcrossing lemma). Let {Xj , Fj }nj=1 be a sub martingale and let a < b be real numbers. Let Un ≡ U {Xj }nj=1 ; a, b . Then EXn+ + |a| E(Xn − a)+ − E(X1 − a)+ ≤ . (3.1) EUn ≤ (b − a) (b − a) Proof: Consider ﬁrst the special case when Xj ≥ 0 w.p. 1 for all j ≥ 1 ˜0 = 1. Let and a = 0. Let N ⎧ ⎪ ⎨ Nj if j = 2k, k ≤ Un or if ˜ j = 2k − 1, k ≤ Un , Nj = ⎪ ⎩ n otherwise. If j is odd and j + 1 ≤ 2Un , then XN˜j+1 ≥ b > 0. If j is odd and j + 1 ≥ 2Un + 2, then XN˜j+1 = Xn = XN˜j . ˜j }n are Thus j odd (XN˜j+1 − XN˜j ) ≥ bUn . It is easy to check that {N j=1 stopping times. By Theorem 13.2.6, E(XN˜j+1 − XN˜j ) ≥ 0

for j = 1, 2, . . . , n.

Thus, E(Xn − X1 )

=

E

n−1

(XN˜j+1

j=0

≥

bEUn + E

≥

bEUn .

− XN˜j )

j

even

(XN˜j+1 − XN˜j ) (3.2)

Hence, both inequalities of (3.1) hold for the special case. Now for the general case, let Yj ≡ (Xj − a)+ , 1 ≤ j ≤ n. Then {Yj , Fj }nj=1 is a nonnegative sub-martingale and Un {Yj }nj=1 , 0, b − a ≡ Un {Xj }nj=1 , a, b . Thus, from (3.2) EUn

≤ =

E(Yn − Y1 ) (b − a) E((Xn − a)+ ) − E((X1 − a)+ ) , (b − a)

13.3 Martingale convergence theorems

419

proving the ﬁrst inequality of (3.1). The second inequality follows by noting 2 that (x − a)+ ≤ x+ + |a| for any x, a ∈ R. The ﬁrst convergence theorem is an easy consequence of the above theorem. Theorem 13.3.2: Let {Xn , Fn }n≥1 be a sub-martingale such that sup EXn+ < ∞.

n≥1

Then {Xn }n≥1 converges to a ﬁnite limit X∞ w.p. 1 and E|X∞ | < ∞. Proof: Let A = {ω : lim inf Xn < lim sup Xn }, n→∞

n→∞

and for a < b, let A(a, b) = {ω : lim inf Xn < a < b < lim sup Xn }. n→∞

n→∞

Then, A = ∪A(a, b) where the union is taken over all rationals a, b such that a < b. To establish convergence of {Xn }n≥1 it suﬃces to show that P (A(a, b)) = 0 for each a< b, as this implies P (A) = 0. Fix a < b and let Un = U {Xj }nj=1 ; a, b . For ω ∈ A(a, b), Un → ∞ as n → ∞. On the other hand, by the upcrossing lemma EUn ≤

EXn+ + |a| (b − a)

and by hypothesis, supn≥1 EXn+ < ∞, implying that sup EUn < ∞. n≥1

# " By the MCT, E limn→∞ Un = limn→∞ EUn , and hence lim Un < ∞ w.p. 1.

n→∞

Thus, P (A(a, b)) = 0 for all a < b, and hence limn→∞ Xn = X∞ exists w.p. 1. By Fatou’s lemma E|X∞ | ≤ lim E|Xn | ≤ sup E|Xn | . n→∞

n≥1

But E|Xn | = 2E(Xn+ ) − EXn ≤ 2EXn+ − EX1 , as {Xn , Fn }n≥1 a sub-martingale implies EXn ≥ EX1 . Thus, supn≥1 EXn+ < ∞ implies 2 supn≥1 E|Xn | < ∞. So, E|X∞ | < ∞ and hence |X∞ | < ∞ w.p. 1. Corollary 13.3.3: Let {Xn , Fn )n≥1 be a nonnegative super-martingale. Then {Xn }n≥1 converges to a ﬁnite limit w.p. 1.

420

13. Discrete Parameter Martingales

Proof: Since {−Xn , Fn }n≥1 is a nonpositive sub-martingale, supn≥1 E(−Xn )+ = 0 < ∞. By Theorem 13.3.2, {−Xn }n≥1 converges to a ﬁnite limit w.p. 1. 2 Corollary 13.3.4: Every nonnegative martingale converges w.p. 1. A natural question is that if a sub-martingale converges w.p. 1 to a ﬁnite limit, does it do so in L1 or in Lp for p > 1. It turns out that if a submartingale is Lp bounded for some p > 1, then it converges in Lp . But this is false for p = 1 as the following examples show. Example 13.3.1: (Gambler’s ruin problem). Let {Sn }n≥1 be the simple n ξ , n ≥ 1, where {ξn }n≥1 is a symmetric random walk, i.e., Sn = i i=1 sequence of iid random variables with P (ξ1 = 1) = 12 = P (ξ1 = −1). Let N = inf{n : n ≥ 1, Sn = 1}. As noted earlier, N is a ﬁnite stopping time and that {Sn }n≥1 is a martingale. Let Xn = SN ∧n , n ≥ 1. Then by the optional sampling theorem, {Xn }n≥1 is a martingale. Clearly, limn→∞ Xn ≡ X∞ = SN = 1 exists w.p. 1. But EXn ≡ 0 while EX∞ = 1 and so Xn does not converge to X∞ in L1 . Example 13.3.2: Suppose that {ξn }n≥1 is a sequence of iid nonnegative random variables with Eξ1 = 1. Let Xn = Πni=1 ξi , n ≥ 1. Then {Xn }n≥1 is a nonnegative martingale and hence converges w.p. 1 to X∞ , say. If P (ξ1 = 1) < 1, it can be shown that X∞ = 0 w.p. 1. Thus, Xn X∞ in L1 . In particular, {Xn }n≥1 is not UI (Problem 13.19). Example 13.3.3: If {Zn }n≥0 is abranching process with oﬀspring dis∞ tribution {pj }j≥0 and mean m = j=1 jpj then Xn ≡ Zn /mn (cf. 1.9) exists w.p. 1. It is is a nonnegative martingale and hence limn Xn = X∞ ∞ known that Xn converges to X∞ in L1 iﬀ m > 1 and j=1 j log pj < ∞ (cf. Chapter 18). See also Athreya and Ney (2004). Theorem 13.3.5: Let {Xn , Fn }n≥1 be a sub-martingale. Then the following are equivalent: (i) There exists a random variable X∞ in L1 such that Xn → X∞ in L1 . (ii) {Xn }n≥1 is uniformly integrable. Proof: Clearly, (i) ⇒ (ii) for any sequence of integrable random variables {Xn }n≥1 . Conversely, if (ii) holds, then {E|Xn |}n≥1 is bounded and hence by Theorem 13.3.2, Xn → X∞ w.p. 1 and by uniform integrability, Xn → 2 X∞ in L1 , i.e., (i) holds.

13.3 Martingale convergence theorems

421

Remark 13.3.1: Let (ii) of Theorem 13.3.5 hold. For any A ∈ Fn and m > n, by the sub-martingale property E(Xn IA ) ≤ E(Xm IA ). By uniform integrability, for any A ∈ F, EXn IA → EX∞ IA

as n → ∞.

This implies that {Xn , Fn }n≥1 ∪ {X∞ , F∞ } is a sub-martingale, where F∞ = σ n≥1 Fn . That is, the sub-martingale is closable at right. Further, EXn → EX∞ . Conversely, it can be shown that if there exists a random variable X∞ , measurable w.r.t. F∞ , such that (a) E|X∞ | < ∞, (b) {Xn , Fn }n≥1 ∪ {X∞ , F∞ } is a sub-martingale, and (c) EXn → EX∞ , then by (a) and (b), {Xn }n≥1 is uniformly integrable and (i) of Theorem 13.3.5 holds. Corollary 13.3.6: If {Xn , Fn }n≥1 is a martingale, then it is closable at right iﬀ {Xn }n≥1 is uniformly integrable iﬀ Xn converges in L1 . This follows from the previous remark since for a martingale, EXn is constant for 1 ≤ n ≤ ∞. Remark 13.3.2: A suﬃcient condition for a sequence {Xn }∞ n≥1 of random variables to be uniformly integrable is that there exists a random variable M such that EM < ∞ and |Xn | ≤ M w.p. 1 for all n ≥ 1. Suppose that {Xn }n≥1 is a nonnegative sub-martingale and M = supn≥1 Xn = limn→∞ Mn where Mn = sup1≤j≤n Xj . By the MCT, EM = limn→∞ EMn . But by Doob’s L log L maximal inequality (Theorem 13.2.13), e 1 + E Xn (log Xn )+ , EMn ≤ e−1 for all n ≥ 1. Thus, if {Xn }n≥1 is a nonnegative sub-martingale and supn≥1 E(Xn (log Xn )+ ) < ∞, then EM < ∞ and hence {Xn }n≥1 is uniformly integrable and converges w.p. 1 and in L1 . Similarly, if {Xn }n≥1 is a martingale such that supn≥1 E(|Xn |(log |Xn |)+ ) < ∞, then {Xn }n≥1 is uniformly integrable. L1 Convergence of the Doob Martingale Deﬁnition 13.3.1: Let X be a random variable on a probability space (Ω, F, P ) and {Fn }n≥1 a ﬁltration in F. Let E|X| < ∞ and Xn ≡

422

13. Discrete Parameter Martingales

E(X|Fn ), n ≥ 1. Then {Xn , Fn }n≥1 is called a Doob martingale (cf. Example 13.1.3). For a Doob martingale, E|Xn | ≤ E|X| and it can be shown that {Xn }n≥1 is uniformly integrable (Problem 13.20). Hence, lim n→∞ Xn exists w.p. 1 and in L1 , and equals E(X|F∞ ), where F∞ = σ n≥1 Fn . This may be summarized as: Theorem 13.3.7: Let {Fn }n≥1 be a ﬁltration and let X be an F∞ measurable with E|X| < ∞. Then E(X|Fn ) → X

w.p. 1 and in

L1 .

Corollary 13.3.8: Let {Fn }n≥1 be a ﬁltration and F∞ = σ

n≥1

Fn .

(i) For any A ∈ F∞ , one has P (A|Fn ) → IA

w.p. 1.

(ii) For any random variable X with E|X| < ∞, E(X|Fn ) → E(X|F∞ )

w.p. 1.

Proof: Take X = IA for (i) and in Theorem 13.3.7, replace X by E(X|F∞ ) for (ii). 2 Kolmogorov’s zero-one law (Theorem 7.2.4) is an easy consequence of this. If {ξn }n≥1 are independent random variables and A is a tail event and Fn ≡ σξj : 1 ≤ j ≤ n, then P (A|Fn ) = P (A) for each n and hence P (A) = IA w.p. 1, i.e., P (A) = 0 or 1. Theorem 13.3.9: (Lp convergence of sub-martingales, p > 1). Let {Xn , Fn }n≥1 be a nonnegative sub-martingale. Let 1 < p < ∞ and supn≥1 E|Xn |p < ∞. Then limn→∞ Xn = X∞ exists w.p. 1 and in Lp , and {(Xn , Fn )}n≥1 ∪ {X∞ , F∞ } is a Lp -bounded sub-martingale. Proof: By Doob’s maximal Lp inequality (Theorem 13.2.11), for any n ≥ 1, p p

p p p p p EXn ≤ sup EXm , (3.3) EMn ≤ p−1 p − 1 m≥1 where Mn = max{Xj : 1 ≤ j ≤ n}. Let M = lim Mn . Then (3.3) yields n→∞

EM < ∞ . p

This makes {|Xn |p }n≥1 uniformly integrable. Also supn≥1 E|Xn |p < ∞ and p > 1 ⇒ supn E|Xn | < ∞ and hence, limn→∞ Xn = X∞ exists w.p. 1 as a

13.3 Martingale convergence theorems

423

ﬁnite limit. The uniform integrability of {|Xn |p }n≥p implies Lp convergence (cf. Problem 2.36). The closability also follows as in Remark 13.3.1. 2 Corollary 13.3.10: Let {Xn , Fn }n≥1 be a martingale. Let 1 < p < ∞ and supn≥1 E|Xn |p < ∞. Then the conclusions of Theorem 13.3.9 hold. Proof: Since {Yn ≡ |Xn |, Fn }n≥1 is a nonnegative sub-martingale, Theorem 13.3.9 applies. 2 Reversed Martingales Deﬁnition 13.3.2: Let {Xn , Fn }n≤−1 be an adapted family with (Ω, F, P ) as the underlying probability space, i.e., for n < m, Fn ⊂ Fm ⊂ F and Xn is Fn -measurable for each n ≤ −1. Such a sequence is called a reversed martingale if (i) E|Xn | < ∞ for all n ≤ −1, (ii) E(Xn+1 |Fn ) = Xn for all n ≤ −1. The deﬁnitions of reversed sub- and super-martingales are similar. Reversed martingales are well behaved since they are closed at right. Theorem 13.3.11: Let {Xn , Fn }n≤−1 be a reversed martingale. Then (a)

lim Xn = X−∞ exists w.p. 1 and in L1 ,

n→−∞

(b) X−∞ = E(X−1 |F−∞ ), where F−∞ ≡

Fn .

n≤−1

Proof: Fix a < b. For n ≤ −1, let Un be the number of (a, b) upcrossings of {Xj : n ≤ j ≤ −1}. Then by Doob’s upcrossing lemma (Theorem 13.3.1), EUn ≤

E(X1 − a)+ . (b − a)

Let U = lim Un . Letting n → −∞, by the MCT, it follows that n→−∞

EU < ∞. Thus, U < ∞ w.p. 1. This being true for every a < b, one may conclude as in Theorem 13.3.2 that P (limn→−∞ Xn > limn→−∞ Xn ) = 0. So limn→−∞ Xn = X−∞ exists w.p. 1. Also, by Jensen’s inequality, {Xn }n≤−1 is uniformly integrable. So Xn → X−∞ in L1 , proving (a). To prove (b), note that for any A ∈ F−∞ , by uniform integrability, X−∞ dP = lim X−n dP n→−∞ A A = X−1 dP, by the martingale property. A

424

13. Discrete Parameter Martingales

2 Corollary 13.3.12: (The Strong Law of Large Numbers for iid random variables). Let {ξn }n≥1 be a sequence of iid random variables with E|ξ1 | < n ∞. Then, n−1 i=1 ξi → Eξ1 as n → ∞, w.p. 1. Proof: For k ≥ 1, let Sk = ξ1 + · · · + ξk and F−k = σ{Sk , ξk+1 , ξk+2 , . . .}. Let Xn ≡ E(ξ1 |Fn )n≤−1 . By the independence of {ξi }i≥1 , for any n ≤ −1, with k = −n, Xn = E(ξ1 |σSk ). Also, by symmetry, for any k ≥ 1, E(ξ1 |σSk ) = E(ξj |σSk ) for 1 ≤ j ≤ k. k Thus, Xn = k1 j=1 E(ξj |σSk ) = Skk , for all k = −n ≥ 1. It is easy to check that {Xn , Fn }n≤−1 is a reversed martingale and so by Theorem 13.3.11, Sk lim Xn = lim exists w.p. 1 and in L1 . n→−∞ k→∞ k By Kolmogorov’s zero-one law, limk→∞ Skk is a tail random variable, and 2 so a constant, which by L1 convergence must equal Eξ1 .

13.4 Applications of martingale methods 13.4.1 Supercritical branching processes Recall Example 13.1.5 on branching processes. Assume that it is supercritical, i.e., µ = Eξ11 > 1 and that σ 2 = Var(ξ11 ) < ∞. Proposition 13.4.1: Let Xn = µ−n Zn be the martingale deﬁned in (1.9). Then, {Xn }n≥1 is an L2 -bounded martingale. Proof: Let vn = Var(Xn ), n ≥ 1. Then Var(E(Xn+1 |Fn )) + E(Var(Xn+1 |Fn )) E(Zn σ 2 ) = Var(Xn ) + 2(n+1) µ = vn + σ 2 µ−2 µ−n , n ≥ 1. n Thus, vn+1 = σ 2 µ−2 j=1 µ−j . Since µ > 1, {vn }n≥1 is bounded. Now 2 since EXn ≡ 1, {Xn }n≥1 is L2 -bounded. vn+1

=

A direct consequence of Proposition 13.4.1 and Theorem 13.3.8 is that limn→∞ Xn = X∞ exists w.p. 1 and in mean-square.

13.4 Applications of martingale methods

13.4.2

425

Investment sequences

Let Xn be the value of a portfolio at (the end of) the nth period. Suppose the returns on the investment are random and satisfy E(Xn+1 |X0 , X1 , . . . , Xn ) ≤ ρn+1 Xn , n ≥ 1 where ρn+1 is a strictly positive random variable that is Fn -measurable, where Fn ≡ σX1 , X2 , . . . , Xn . Let ρ1 ≡ 1 and Xn Zn = n j=1

ρj

, n≥1.

Then, {Zn , Fn }n≥1 is a nonnegative super-martingale and hence, it converges w.p. 1 to a limit Z, with EZ ≤ EZ1= EX1 . This implies that n {Xn }n≥1 converges w.p. 1 on the event A ≡ { j=1 ρj converges}.

13.4.3

A conditional Borel-Cantelli lemma

Let {An }n≥1 be a sequence of events in a probability space (Ω, F, P ) and let {Fn }n≥1 be a ﬁltration in F. Let An ∈ Fn for all n ≥ 1 and pn = P (An |Fn−1 ), n ≥ 1, where F0 is the trivial σ-algebra ≡ {Ω, ∅}. Let δn = n IAn , and Xn = j=1 (δj − pj ), n ≥ 1. Then {Xn , Fn }n≥1 n is a martingale. Let Vj = Var(δj |Fj−1 ) = pj (1 − pj ), j ≥ 1, sn = j=1 Vj and s˜n = max{s nn , 1}, n ≥ 1. Since Vn is Fn−1 -measurable, so are sn and s˜n . Let Yn = j=1 (δj − pj )/˜ sj , n ≥ 1. Then, {Yn , Fn }n≥1 is a martingale. Clearly, EYn = 0 and by the martingale property

n δ j − pj 2 Var EYn = Var(Yn ) = s˜j j=1

n Vj = E 2 s˜j j=1

n Vj = E . s˜2 j=1 j But V1 = s1 and Vj = sj − sj−1 for j ≥ 2 and so ∞ Vj

s˜2 j=1 j

≤

1

∞

Vj s˜2j

≤

sj

1 dt sj−1 t2

and hence,

1 dt = 1. t2

So supn≥1 EYn2 ≤ 1. Thus, {Yn }n≥1 converges to some Y w.p. 1 and in L2 .

426

13. Discrete Parameter Martingales

If sn → ∞, then by Kronecker’s lemma (cf. Lemma 8.4.2). n 1 (δj − pj ) → 0 sn j=1

n n j=1 δj j=1 pj ⇒ n −1 →0. sn j=1 pj

But

n j=1

pj ≥ sn and hence n nj=1

δj

j=1 pj

→ 1 w.p. 1 on the event B ≡ {sn → ∞} .

(4.1)

Next it is claimed that on B c ≡ {limn→∞ sn < ∞}, limn→∞ Xn = X exists and is ﬁnite w.p. 1. To prove the claim, ﬁx 0 < t < ∞. Let Nt = inf{n : sn+1 > t}. Since sn+1 is Fn -measurable, Nt is a stopping time and by the optional stopping theorem I (Theorem 13.2.3), {Zn ≡ XNt ∧n }n≥1 is a martingale. By Doob’s L2 -maximal inequality, E sup Zj2 ≤ 4E(Zn2 ). 1≤j≤n

Also it is easy to verify that {Xn2 − s2n }n≥1 is a martingale and by the optional sampling theorem (Theorem 13.2.4), 2 E(Xn∧N − s2n∧Nt ) = 0. t

Thus, EZn2 = Es2n∧Nt ≤ t and hence for each t, limn→∞ Zn exists w.p. 1 and in L2 . Thus, limn→∞ XNt ∧n exists w.p. 1 for each t. But, on B c , Nt = ∞ for all large t. So limn→∞ Xn = X exists w.p. 1 on B c . This proves the claim. ∞ It follows that on B c ∩ { j=1 pj = ∞}, n nj=1 j=1

δj pj

Xn − 1 = n j=1

pj

→ 0.

∞ Also, since B ≡ {sn → ∞} is a subset of { j=1 pj = ∞} and it has been shown in (4.1) that n δj nj=1 → 1 w.p. 1 on B, p j=1 j it follows that n

δj nj=1 p j=1 j

→1

w.p. 1 on

∞ j=1

pj = ∞ .

13.4 Applications of martingale methods

427

Summarizing the above, one gets the following result. Theorem 13.4.2: (A conditional Borel-Cantelli lemma). Let {An }n≥1 be a sequence of events in a probability space (Ω, F, P ) and {Fn }n≥1 be a ﬁltration such that An ∈ Fn , for all n ≥ 1.Let pn = P (An |Fn−1 ) for ∞ n ≥ 2, p1 = P (A1 ). Then on the event B0 ≡ { j=1 pj = ∞}, n j=1 n

IAj

j=1

pj

→1

w.p. 1,

and in particular, inﬁnitely many An ’s happen w.p. 1 on B0 .

13.4.4

Decomposition of probability measures

The almost sure convergence of a nonnegative martingale yields the following theorem on the Lebesgue decomposition of two probability measures on a given measurable space. Theorem 13.4.3: Let (Ω, F) be a measurable space and {Fn }n≥1 be a ﬁltration with Fn ⊂ F for all n ≥ 1. Let P and Q be two probability measures on (Ω, F) such that for each n ≥ 1, Pn ≡ the restriction of P to Fn is absolutely continuous w.r.t. Qn ≡ on the restriction of Q to Fn , dPn . Let F ≡ σ with the Radon-Nikodym derivative Xn = dQ ∞ n≥1 Fn and n X ≡ limn→∞ Xn . Then for any A ∈ F∞ , XdQ + P (A ∩ (X = ∞)) P (A) = A

≡ Pa (A) + Ps (A), say,

(4.2)

and Pa Q and Ps ⊥ Q. Proof: For 1 ≤ k ≤ n, let Mk,n = maxk≤j≤n Xj . Let Mk = supn≥k Mk,n = limn→∞ Mk,n . Then X ≡ limn→∞ Xn = limk→∞ Mk . Fix 1 ≤ k0 , N < ∞ and A ∈ Fk0 . Then for n ≥ k ≥ k0 , Bk,n ≡ A ∩ {Mk,n ≤ N } ∈ Fn and hence P (Bk,n ) = Xn IBk,n dQ. (4.3) As n → ∞, Mk,n ↑ Mk and so IBk,n ↓ IBk , where Bk ≡ A ∩ {Mk ≤ N } = n≥k Bk,n . Also, since {Xn , Fn }n≥1 is a nonnegative martingale under the probability measure Q, limn→∞ Xn exists w.p. 1 and hence coincides with X. Thus, by the bounded convergence theorem applied to (4.3), P (Bk ) = XIBk dQ.

428

13. Discrete Parameter Martingales

Now let N → ∞ and use the MCT to conclude that P (A ∩ (Mk < ∞)) = XI{Mk 0 for all n and i. Let F = σ n≥1 An . Let

13.4 Applications of martingale methods

429

kn P (Ani ) Xn ≡ i=1 Q(Ani ) IAni . Then {Xn , Fn }n≥1 is a martingale on (Ω, F, Q) and P satisﬁes the decomposition (4.2). The proof of Corollary 13.4.6 is left as an exercise (Problem 13.22). Remark 13.4.3: This yields the Lebesgue decomposition of P w.r.t. Q when F is countably generated, i.e., when there exists a countable collection k A of subsets of Ω such that F = σA. In particular, this holds if Ω = R and F ≡ B (Rk ) for k ∈ N.

13.4.5

Kakutani’s theorem

Theorem 13.4.7: (Kakutani’s theorem). Let P and Q be the probability distributions on R∞ , B(R∞ ) of the sequences of independent random variables {Xj }j≥1 and {Yj }j≥1 , respectively. Let for each j ≥ 1, the distribution of Xj be dominated by that of Yj . Then either

P Q

or

P ⊥ Q.

(4.5)

Proof: Let fj be the density of λj w.r.t. µj where λj (·) = P (Xj ∈ ·) and µj (·) = Q(Yj ∈ ·). Let Ω = R∞ , F = (B(R))∞ . Then P = Πj≥1 λj , Q = Πj≥1 µj . Let ξn (ω) ≡ ω(n), the nth co-ordinate of ω = (ω1 , ω2 , . . .) ∈ Ω, and Fn ≡ σξj : 1 ≤ j ≤ n. Also let Pn be the restriction of P to Fn and Qn be that of Q to Fn . Then Pn Qn with probability density Ln =

n + dPn = fj (ξj ). dQn j=1

Since {limn→∞ Ln < ∞} is a tail event, by the independence of {ξj }j≥1 under P and the Kolmogorov’s zero-one law, P (limn→∞ Ln < ∞) = 0 or 1. Now, by Corollary 13.4.4, (4.5) follows. 2 Remark , 13.4.4: It can be shown that P Q or P ⊥ Q according as ∞ E fj (Yj ) > 0 or = 0. For a proof, see Durrett (2004). j=1 Remark 13.4.5: If {Xj }j≥1 are iid and {Yj }j≥1 are√also iid, then P = Q or P ⊥ Q. This is√because fj = f√1 for all j and EQ f1 ≤ (EQ f1 )1/2 ≤ 1 < 1 or EQ f1√= 1. In the latter case f1 ≡ 1, since and √ so either EQ f1 √ EQ ( f1 )2 = 1 = EQ ( f1 ) ⇒ VarQ ( f1 ) = 0. Remark 13.4.6: The above result can be extended to Markov chains. Let P and Q be two irreducible stochastic matrices on a countable set and let Q be positive recurrent. Also, let Px0 denote the distribution of a Markov chain {Xn }n≥1 starting at x0 and with transition probability P , and similarly, let Qy0 denote the distribution of a Markov chain {Yn }n≥1

430

13. Discrete Parameter Martingales

starting at y0 and with transition probability Q. Then either

Px0 ⊥ Qy0

or

P = Q.

(4.6)

The proof of this is left as an exercise (Problem 13.23).

13.4.6

de Finetti’s theorem

Let {Xn }n≥1 be a sequence of exchangeable random variables on a probability space (Ω, F, P ), i.e., for each n ≥ 1, the distribution of (Xσ(1) , Xσ(2) , . . . , Xσ(n) ) is the same as that of (X1 , X2 , . . . , Xn ) where (σ(1), σ(2), . . . , σ(n)) is a permutation of (1, 2, . . . , n). Then there is a σalgebra G ⊂ F such that for each n ≥ 1, P (Xi ∈ Bi , i = 1, 2, . . . , n | G) =

n +

P (Xi ∈ Bi | G)

(4.7)

i=1

for all B1 , . . . , Bn ∈ B(R). This is known as de Finetti’s theorem. For a proof, see Durrett (2004) and Chow and Teicher (1997). This theorem says that conditioned on G the {Xi }i≥1 are iid random variables with distribution P (X1 ∈ · | G). The converse to this result, i.e., if for some σ-algebra G ⊂ F (4.7) holds, then the sequence {Xi }i≥1 is exchangeable is not diﬃcult to verify (Problem 13.26).

13.5 Problems 13.1 Let Ω be a nonempty set and let {Aj }j≥1 be a countable partition of Ω. For n ≥ 1, let Fn = σ-algebra generated by {Aj }nj=1 . (a) Show that {Fn }n≥1 is a ﬁltration. (b) Find F∞ = σ n≥1 Fn . 13.2 Let Ω be a nonempty set. For each n ≥ 1, let πn ≡ {Anj : j = 1, 2, . . . , kn } be a partition of Ω. Suppose that each n and j, Anj is a union of sets of πn+1 . Let Fn ≡ σπn for n ≥ 1. (a) Show that {Fn }n≥1 is a ﬁltration. " j (b) Suppose ∆ = [0, 1)and πn ≡ { j−1 : j = 1, 2, . . . , 2n }. 2n , 2n Show that F∞ = σ n≥1 Fn is the Borel σ-algebra B([0, 1)). 13.3 Let {(Yn , Fn ) : n ≥ 1} and {(Y˜n , Fn ) : n ≥ 1} be as in Example 13.1.2. Verify that {(Yn , Fn ) : n ≥ 1) is a sub-martingale and {(Y˜n , F˜n ) : n ≥ 1} is a martingale.

13.5 Problems

431

13.4 Give an example of a random variable T and two ﬁltrations {Fn }n≥1 and {Gn }n≥1 such that T is a stopping time w.r.t. the ﬁltration {Fn }n≥1 but not w.r.t. {Gn }n≥1 . 13.5 Let T1 and T2 be stopping times w.r.t. a ﬁltration {Fn }n≥1 . Verify 2 that min(T1 , T2 ), max(T1 , T2 ), T1 + T2 , and T1√ are stopping times w.r.t. {Fn }n≥1 . Give an example to show that T1 and T1 − 1 need not be stopping times w.r.t. {Fn }n≥1 . 13.6 Let T be a random variable taking values in {1, 2, 3, . . .}. Show that there is a ﬁltration {Fn }n≥1 w.r.t. which T is a stopping time. 13.7 Let {Fn }n≥1 be a ﬁltration. (a) Show that T is a stopping time w.r.t. {Fn }n≥1 iﬀ {T ≤ n} ∈ Fn

for all n ≥ 1 .

(b) Show by an example that if a random variable T satisﬁes {T ≥ n} ∈ Fn for all n ≥ 1, it need not be a stopping time w.r.t. {Fn }n≥1 . (Hint: Consider a T of the form T = inf{k : k ≥ 1, Xk+1 ∈ A}

and Fn = σXj : j ≤ n.)

13.8 Show that FT deﬁned in (2.6) is a σ-algebra. 13.9 Let {Xn }n≥1 be a sequence of random variables. Let Gn = σ{Xj : 1 ≤ j ≤ n}. Let {Fn }n≥1 be a ﬁltration such that Gn ⊂ Fn for each n ≥ 1. (a) Show that if {Xn , Fn }n≥1 is a martingale, then so is {Xn , Gn }n≥1 . (b) Show by an example that the converse need not be true. (c) Let {Xn , Fn }n≥1 be a martingale. Let 1 ≤ k1 < k2 < k3 · · · be a sequence of integers. Let Yn ≡ Xkn , Hn ≡ Fkn , n ≥ 1. Show that {Yn , Hn }n≥1 is also a martingale. 13.10 A branching random walk is a branching process and a random walk associated with it. Individuals reproduce according to a branching process and the oﬀspring move away from the parent a random distance. If Xn ≡ {xn1 , xn2 , . . . xnZn } denotes the position vector of the Zn individuals in the nth generation and the individual at location xni produces ρni oﬀspring, then each of them chooses a new position by moving a random distance from xni and these are assumed to be iid. Let ηnij be the random distance moved by the jth oﬀspring of the

432

13. Discrete Parameter Martingales

individual at xni . Then the position vector of the (n + 1)st generation is given by ni Xn+1 = {xni + ηnij }ρj=1 , i = 1, 2, . . . , n ≡ xn+1,k : k = 1, 2, . . . Zn+1 , say, where Zn+1

= =

population size of the (n+1)st generation Zn

ρni .

i=1

Let the oﬀspring distribution be {pk }k≥0 and the jump size distribution be denoted by F (·). Assume that the η’s are real valued and also that the collection {ρni }i≥1,n≥0 , {ηnij }i≥1,j≥1,n≥0 are all independent with the ρ’s being iid with distribution {pk }k≥0 and the η’s iid with distribution F . Fix θ ∈ R. For n ≥ 0, let Zn (θ) ≡

Zn

eθxni

−n and Yn (θ) = Zn (θ) ρφ(θ)

i=1

∞

where ρ = k=0 kpk , φ(θ) = E(eθη111 ) = φ(θ) < ∞, 0 < ρ < ∞.

eθx dF (x). Assume 0

0, λ0 > 0 E |Xn |I(|Xn | > λ) ≤ E |X|I|Xn | > λ ≤ E |X|I(|X| > λ0 ) + λ0 P (|Xn | > λ) . ) 13.21 Consider the following urn scheme due to Poly¯ a. Let an urn contain w0 white and b0 black balls at time n = 0. A ball is drawn from the urn at random. It is returned to the urn with one more ball of the color drawn. Repeat this procedure for all n ≥ 1. Let Wn and Bn denote the number of white and black balls in the urn after n draws. n , n ≥ 0. Let Fn = σZ0 , Z1 , . . . , Zn . Let Zn = WnW+B n (a) Show that {(Zn , Fn )}n≥0 is a martingale. (b) Conclude that Zn converges w.p. 1 and in L1 to a random variable Z. (c) Show that for any k ∈ N, limn→∞ EZnk converges and evaluate the limit. Deduce that Z has Beta (w0 , b0 ) distribution, i.e., its 0 +b0 −1)! z w0 −1 (1 − z)b0 −1 I[0,1] (z). pdf fZ (z) ≡ (w(w 0 −1)!(b0 −1)! (d) Generalize (a) to the case when at the nth stage a random number αn of balls of the color drawn are added where {αn }n≥1 is any sequence of nonnegative integer valued random variables. 13.22 Prove Corollary 13.4.6. (Hint: Argue as in Example 13.1.7.) 13.23 Prove the last equation (4.6) of Section 13.4. (Hint: Show using the strong law for the Q chain that under Q, the martingale Xn converges to 0 w.p. 1, where Xn is the Radon-Nikodym derivative of Px0 (X0 , . . . , Xn ) ∈ · w.r.t. Qx0 (X0 , . . . , Xn ) ∈ · .)

436

13. Discrete Parameter Martingales

13.24 Let {Fn }n≥0 be a ﬁltration ⊂ F where (Ω, F, P ) is a probability space. Let {Yn }n≥0 ⊂ L1 (Ω, F, P ). Suppose Z ≡ sup |Yn | ∈ L1 (Ω, F, P )

and

n≥1

lim Yn ≡ Y

n→∞

exists w.p. 1.

Show that E(Yn |Fn ) → E(Y |F∞ ) w.p. 1. (Hint: Fix m ≥ 1 and let Vm = supn≥m |Yn − Y |. Show that limn E(|Yn − Y ||Fn ) ≤ lim E(Vm |Fn ) = E(Vm |F∞ ). n→∞

Now show that E(Vm |F∞ ) → 0 as m → ∞.) 13.25 Let {Xt , Ft : t ∈ I ≡ Q ∩ (0, 1)} be a martingale, i.e., for all t1 < t2 in I, E(Xt2 |Ft1 ) = Xt1 . Show that for each t in I lim Xs

s↑t,s∈I

and

lim Xs

s↓t,s∈I

both exist w.p. 1 and in L1 and equal Xt w.p. 1. 13.26 Let {Xi }i≥1 be random variables on a probability space (Ω, F, P ). Suppose for some σ-algebra G ⊂ F (4.7) holds. Show that {Xi }i≥1 are exchangeable. 13.27 Let {Xn }n≥0 , {Yn }n≥0 be martingales in L2 (Ω, F, P ) w.r.t. the same ﬁltration {Fn }n≥1 . Let X0 = Y0 = 0. Show that E(Xn Yn ) =

n

E(Xk − Xk−1 )(Yk − Yk−1 ), n ≥ 1

k=1

and, in particular, E(Xn2 ) =

n

E(Xk − Xk−1 )2 .

k=1

13.28 Let {Xn , Fn }n≥1 be a martingale in L2 (Ω, F, P ). Suppose 0 ≤ bn ↑ ∞ n E(Xj −Xj−1 )2 n such that j=2 < ∞. Show that X bn → 0 w.p. 1. b2 j

n (X −X ) (Hint: Consider the sequence Yn ≡ j=2 j bj j−1 , n ≥ 2. Verify that {Yn , Fn }n≥2 is a L2 bounded martingale and use Kronecker’s lemma (cf. Chapter 8).)

13.5 Problems

437

13.29 Let f ∈ L1 [0, 1], B([0, 1]), m where m(·) is Lebesgue measure on [0, 1]. Let {Hk (·)}k≥1 be the Haar functions deﬁned by H1 (t) ≡ 1, H2 (t) ≡

H2n +1 (t)

H2n +j (t) Let ak ≡

1 −1

0 ≤ t < 12 1 2 ≤ t < 1,

⎧ n/2 0 ≤ t < 2−(n+1) ⎪ ⎨ 2 = −2n/2 2−(n+1) ≤ t < 2−n , n = 1, 2, . . . ⎪ ⎩ 0 otherwise, j − 1 = H2n +1 t − n , j = 1, 2, . . . , 2n . 2

1

f (t)Hk (t)dt, k = 1, 2, . . .. n (a) Verify that {Xn (t) ≡ k=1 ak Hk (t)}n≥1 is a martingale w.r.t. the natural ﬁltration. 0

(b) Show that Xn converges w.p. 1 and in L1 to f . 13.30 Let {Xn }n≥1 be a sequence of nonnegative random variables on some probability space (Ω, F, P ) such that E(Xn+1 |Fn ) ≤ Xn + Yn where Fn ≡ σX1 , . . . , Xn where {Yn }n≥1 is a sequence of nonnegative ∞ constants such that n=1 Yn < ∞. Show that {Xn }n≥1 converges w.p. 1. random variables with 13.31 Let {τj }j≥1 be independent ∞ exponential n λj = 1 < ∞. Let T = 0, T = Eτj , j ≥ 1 such that 0 n j=1 λ2j j=1 τj , n n ≥ 1, sn = j=1 λj . Show that {Xn ≡ Tn − sn }n≥1 converges w.p. 1 and in mean square. (Hint: Show that {Xn }n≥1 is an L2 -bounded martingale.)

14 Markov Chains and MCMC

14.1 Markov chains: Countable state space 14.1.1

Deﬁnition

Let S = {aj : j = 1, 2, . . . , K}, K ≤ ∞ be a ﬁnite or countable set. Let P = ((pij ))K×K be a stochastic matrix, i.e., pij ≥ 0, for every i, K j=1 pij = 1 and µ = {µj : 1 ≤ j ≤ K} be a probability distribution, i.e., K µj ≥ 0 for all j and j=1 µj = 1. Deﬁnition 14.1.1: A sequence {Xn }∞ n=0 of S-valued random variables on some probability space (Ω, F, P ) is called a Markov chain with stationary transition probabilities P = ((pij )), initial distribution µ, and state space S if (i) X0 ∼ µ, i.e., P (X0 = aj ) = µj for all j, and (ii) P Xn+1 = aj | Xn = ai , Xn−1 = ain−1 , . . . , X0 = ai0 = pij for all ai , aj , ain−1 , . . . , ai0 ∈ S and n = 0, 1, 2, . . . , i.e., the sequence is memoryless. Given Xn , Xn+1 is independent of {Xj : j ≤ n − 1}. More generally, given the present (Xn ), the past ({Xj : j ≤ n − 1}) and the future ({Xj : j > n}) are stochastically independent (Problem 14.1). A few questions arise: Question 1: Does such a sequence {Xn }∞ n=0 exist for every µ and P , and

440

14. Markov Chains and MCMC

if so, how does one generate them? The answer is yes. There are two approaches, namely, (i) using Kolmogorov’s consistency theorem and (ii) an iid random iteration scheme. Question 2: How does one describe the ﬁnite time behavior, i.e., the joint distribution of (X0 , X1 , . . . , Xn ) for any n ∈ N? One may use the Markov property repeatedly to obtain the joint distribution. Question 3: What can one say about the long-term behavior? One can ask questions like: (a) Does the trajectory n → Xn converge as n → ∞? (b) Does the distribution of Xn converge as n → ∞? (c) Do the laws of large numbers hold n for a suitable class of functions f ’s, i.e., do the limits limn→∞ n1 j=1 f (Xj ) exist w.p. 1? (d) Do stationary distributions exist? (A probability distribution π = {πi }i∈S is called a stationary distribution for a Markov chain {Xn }n≥0 if X0 has distribution π, then Xn also has distribution π for all n ≥ 1.) The key to answering these questions are the concepts of communication, irreducibility, aperiodicity, and most importantly recurrence. The main tools are the laws of large numbers, renewal theory, and coupling.

14.1.2

Examples

Example 14.1.1: (IID sequence). Let {Xn }∞ n=0 be a sequence of iid Svalued random variables with distribution µ = {µj }. Then {Xn }∞ n=0 is a Markov chain with initial distribution µ and transition probabilities given by pij = µj for all i, i.e., all rows of P are identical. It is also easy to prove the converse, i.e., if all rows of P are identical, then {Xn }∞ n=1 are iid and independent of X0 . To answer Question 3 in this case, note that P [Xn = j] = µj for all n and thus Xn converges in distribution. But the trajectories do not converge. However, the law of large numbers holds and µ is the unique stationary distribution. Example 14.1.2: (Random walks). Let S = Z, the set of integers. Let {n }n≥1 be iid with distribution {pj }j∈Z , i.e., P [1 = j] = pj for j ∈ Z and {n }n≥1 are independent. Let X0 be a Z-valued random variable independent of {n }n≥1 . Then, deﬁne for n ≥ 0, Xn+1 = Xn + n+1 = Xn−1 + n + n−1 = · · · = X0 +

n+1 j=1

j .

14.1 Markov chains: Countable state space

441

In this case, with probability one, the trajectories of Xn go to +∞ (respectively, −∞), if E(1 ) > 0 (respectively < 0). If E(1 ) = 0, then the trajectories ﬂuctuate inﬁnitely often provided p0 = 1. Example 14.1.3: (Branching processes). Let S = Z+ = {0, 1, 2, . . .}. Let {pj }∞ j=0 be a probability distribution. Let {ξni }i∈N,n∈Z+ be iid random variables with distribution {pj }∞ j=0 . Let Z0 be a Z+ -valued random variable independent of {ξni }. Let Zn+1 =

Zn

ξni

for n ≥ 0.

i=1

If p0 = 0 and p1 < 1, then Zn → ∞ w.p. 1. If p0 > 0, then P [Zn → ∞] + P [Zn → 0] = 1. Also, P [Zn → 0 | Z0 = 1] = q is the smallest solution in [0,1] to the equation q = f (q) =

∞

pj q j .

j=0

So q = 1 iﬀ m ≡

∞ j=1

jpj (1) ≤ 1 (see Chapter 18 also).

Example 14.1.4: (Birth and death chains). Again take S = Z+ . Deﬁne P by pi,i+1 = αi ,

pi,i−1 = βi = 1 − αi , for i ≥ 1,

p0,1 = α0 ,

p0,0 = β0 = 1 − α0 .

The population increases at rate αi and decreases at rate 1 − αi . Example 14.1.5: (Iterated function systems). Let G := {hi : hi : S → S, i = 1, 2, . . . , L}, L ≤ ∞. Let µ = {pi }L i=1 be a probability distribution. Let {fn }∞ n=1 be iid, such that P (fn = hi ) = pi , 1 ≤ i ≤ L. Let X0 be a S-valued random variable independent of {fn }∞ n=1 . Then, the iid random iteration scheme X1 X2

Xn+1

=

f1 (X0 )

= .. . =

f2 (X1 ) fn+1 (Xn ) = fn+1 fn (· · · (f1 (X0 )) · · ·)

is a Markov chain with transition probability matrix pij = P (f1 (i) = j) =

L

pr I hr (i) = j .

r=1

Remark 14.1.1: Any discrete state space Markov chain can be generated in this way (see II in 14.1.3 below).

442

14. Markov Chains and MCMC

14.1.3

Existence of a Markov chain

I. Kolmogorov’s approach. Let Ω = SZ+ = {ω : ω ≡ {xn }∞ n=0 , xn ∈ with values in S. Let S for all n} be the set of all sequences {xn }∞ n=0 F0 consist of all ﬁnite dimensional subsets of Ω of the form A = ω : ω = {xn }∞ n=0 , xj = aj , 0 ≤ j ≤ m , where m < ∞ and aj ∈ S for all j = 0, 1, . . . , m. Let F be the σ-algebra generated by F0 . Fix µ and P . For A as above let λµ,P (A) := µa0 pa0 a1 pa1 a2 · · · pam−1 am . Then it can be shown, using the extension theorem from Chapter 2 or Kolmogorov’s consistency theorem of Chapter 6, that λµ,P can be extended to be a probability measure on F. Let Xn (ω) = xn , if ∞ ω = {xn }∞ n=0 , be the coordinate projection. Then {Xn }n=0 will be a sequence of S-valued random variables on (Ω, F, λµ,P ), such that it is a Markov chain with initial distribution µ and transition probability P . A typical element ω = {xn }∞ n=0 of Ω is called a sample path or a sample trajectory. The following are examples of events (sets) in F, which are not ﬁnitedimensional:

A1 A2

1 h(xj ) exists for a given h : S → R, n→∞ n j=1 = ω : the set of limit points of {xn }∞ n=0 = {a, b} . =

n

ω : lim

Thus, it is essential to go to (Ω, F, λµ,P ) to discuss the events involving asymptotic (long term) behavior, i.e., as n → ∞. II. IIIDRM approach (iteration of iid random maps). Let P ((pij ))K×K be a stochastic matrix. Let f : S × [0, 1] → S be ⎧ a1 ⎪ ⎪ ⎪ ⎪ a2 ⎪ ⎪ ⎪ ⎪ . ⎪ ⎪ ⎨ .. aj f (ai , u) = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ .. ⎪ ⎪ ⎪ . ⎪ ⎩ aK

=

if 0 ≤ u < pi1 if pi1 ≤ u < pi1 + pi2 if pi1 + pi2 + · · · + pi(j−1) ≤ u < pi1 + pi2 + · · · + pij if pi1 + pi2 + · · · + pi(K−1) ≤ u < 1 .

(1.1) Let U1 , U2 , . . . be iid Uniform [0, 1] random variables. Let fn (·) := f (·, Un ). Then for each n, fn maps S to S. Also {fn }∞ n=1 are iid.

14.1 Markov chains: Countable state space

443

Let X0 be independent of {Ui }∞ i=1 and X0 ∼ µ. Then the sequence deﬁned by {Xn }∞ n=0 Xn+1 = fn+1 (Xn ) = f (Xn , Un+1 ) is a Markov chain with initial distribution µ and transition probability P . The underlying probability space on which X0 and {Ui }∞ i=1 are deﬁned can be taken to be the Lebesgue space ([0, 1], B([0, 1]), m), where m is the Lebesgue measure. Finite Time Behavior of {Xn } For each n ∈ N, P X0 = a0 , X1 = a1 , . . . , Xn = an

+ n = P Xj = aj | Xj−1 = aj−1 , . . . , X0 = a0 P X0 = a0 j=1

=

+ n

paj−1 aj µa0 .

(1.2)

j=1

Thus, the joint distribution for any ﬁnite n is determined by µ and P . In particular, n + paj−1 aj = (P n )a0 an , P Xn = an | X0 = a0 =

(1.3)

j=1

where the sum in the middle term runs over all a1 , a2 , . . . , aj−1 and P n is the nth power of P . So the behavior of the distribution of Xn can be studied via that of P n for large n. But this analytic approach is not as comprehensive as the probabilistic one, via the concept of recurrence, which will be described next.

14.1.4

Limit theory

Let {Xn }∞ n=0 be a Markov chain with state S = {1, 2, . . . , K}, K ≤ space ∞, and transition probability matrix P = (pij ) K×K . Deﬁnition 14.1.2: (Hitting times). For any set A ⊂ S, the hitting time TA is deﬁned as TA = inf{n : Xn ∈ A, n ≥ 1} i.e., it is the ﬁrst time after 0 that the chain enters A. The random variable TA is also called the ﬁrst passage time for A or the ﬁrst entrance time for A. Note that TA is a stopping time (cf. Chapter 13) w.r.t. the ﬁltration {Fn ≡ σ{Xj : 0 ≤ j ≤ n}}n≥0 .

444

14. Markov Chains and MCMC

By convention, inf ∅ = ∞. If A = {i}, then write T{i} = Ti for notational simplicity. A concept of fundamental importance is recurrence of a state. Deﬁnition 14.1.3: (Recurrence). A state i is recurrent or transient according as Pi [Ti < ∞] = 1 or < 1, where Pi denotes the probability distribution of {Xn }∞ n=0 with X0 = i with probability 1. Thus i is recurrent iﬀ fii ≡ P Xn = i for some 1 ≤ n < ∞ | X0 = i = 1. (1.4) Deﬁnition 14.1.4: (Null and positive recurrence). A recurrent state i is called null recurrent if Ei (Ti ) = ∞ and positive recurrent if Ei (Ti ) < ∞, where Ei refers to expectation w.r.t. the probability distribution Pi . Example 14.1.6: (Frog in the well). Let S = {1, 2, . . .} and P = ((pij )) be given by pi,i+1 = αi

and pi,1 = 1 − αi

for all i ≥ 1, 0 < αi < 1.

Then P1 [T1 > r] = α1 α2 · · · αr . r ∞ 0 as r→ ∞ iﬀ i=1 (1−αi ) = ∞. Further So P1 [T1 < ∞] = 1 iﬀ i=1 i → α ∞ r 1 1 is positive recurrent iﬀ r=1 1 αi < ∞. Thus, if αi = (1 − 2i2 ), then 1 is transient; but if αi ≡ α, 0 < α < 1 for all i, then 1 is positive recurrent. 1 If αi = 1 − ci , c > 1, then 1 is null recurrent (Problem 14.2). Example 14.1.7: (Simple random walk). Let S = Z, the set of all integers, pi,i+1 = p, pi,i−1 = q, 0 < p = 1 − q < 1. Then it can be shown by using the SLLN that for p = 12 , 0 is transient. But it is harder to show that for p = 12 , 0 is null recurrent (see Corollary 14.1.5 below). The next result says that after each return to i, the Markov chain starts afresh. Proposition 14.1.1: (The strong Markov property). For any i ∈ S and any initial distribution µ of X0 and any k < ∞, a1 , . . . , ak in S, Pµ (XTi +j = aj , j = 1, 2, . . . , k, Ti < ∞) = Pµ (Ti < ∞)Pi (Xj = aj , j = 1, 2, . . . , k). Proof: For any n ∈ N, Pµ (XTi +j = aj , 1 ≤ j ≤ k, Ti = n) = Pµ (Xn+j = aj , 1 ≤ j ≤ k, Xn = i, Xr = i, 1 ≤ r ≤ n − 1) * = Pµ (Xn+j = aj , 1 ≤ j ≤ k * Xn = i, Xr = i, 1 ≤ r ≤ n − 1)

14.1 Markov chains: Countable state space

445

· Pµ (Xn = i, Xr = i, 1 ≤ r ≤ n − 1) = Pi (Xj = aj , 1 ≤ j ≤ k)Pµ (Xn = i, Xr = i, 1 ≤ r ≤ n − 1) = Pi (Xj = aj , 1 ≤ j ≤ k)Pµ (Ti = n). 2

Adding both sides over n yields the result.

The strong Markov property leads to the important useful technique of breaking up the time evolution of a Markov chain into iid cycles. This combined with the law of large numbers yield the basic convergence results. (0)

Deﬁnition 14.1.5: (IID cycles). Let i be a state. Let Ti (k+1)

Ti

(k)

=

= 0 and

(k)

inf{n : n > Ti , Xn = i}, if Ti < ∞, (k) ∞, if Ti = ∞,

(1.5)

(k)

is the successive return times to state i. (k) Proposition 14.1.2: Let i be a recurrent state. Then Pi Ti < ∞ = 1 for all k ≥ 1. i.e., for k = 0, 1, 2, . . ., Ti

Proof: By deﬁnition of recurrence, the claim is true for k = 1. If it is true for k > 1, then (k+1) Pi Ti 0. A pair of states i, j are said to communicate (n)

if i → j and j → i, i.e., if there exist n ≥ 1, m ≥ 1 such that pij > 0, (m)

pji

> 0.

Deﬁnition 14.1.7: (Irreducibility). A Markov chain with state space S ≡ {1, 2, . . . , K}, K ≤ ∞ and transition probability matrix P ≡ ((pij )) is irreducible if for each i, j in S, i and j communicate. Deﬁnition 14.1.8: A state i is absorbing if pii = 1. It is easy to show that if i is absorbing and j → i, then j is transient (Problem 14.4). Proposition 14.1.7: (Solidarity property). Let i be recurrent and i → j. Then fji = 1 and j is recurrent, where fji ≡ P (Ti < ∞ | X0 = j). Proof: By the (strong) Markov property, 1 − fii = P (Ti = ∞ | X0 = i) ≥ P (Tj < ∞, Ti = ∞ | X0 = i) = P (Ti = ∞ | X0 = j)P (Tj < Ti | X0 = i) (intuitively speaking, one possibility of not returning to i (starting from i) is ∗ ∗ to visit j and then not returning to i) = (1 − fji )fij , where fij = P (visiting j before visiting i | X0 = i). Now i recurrent and i → j yield 1−fii = 0 and ∗ > 0 (Problem 14.4) and so 1 − fji = 0, i.e., fji = 1. Thus, starting from fij j, the chain visits i w.p. 1. From i, it keeps returning to i inﬁnitely often. ∗ of visiting j is positive In each of these excursions, the probability fij and since there are inﬁnite number of such excursions and they are iid, the chain does visit j in one of these excursions w.p. 1. That is j is recurrent. 2 Also an alternate proof using the Corollary 14.1.5 is possible (Problem 14.5). Proposition 14.1.8: In a ﬁnite state space irreducible Markov chain all states are recurrent. Proof: By Corollary 14.1.6, there is at least one state i0 that is recurrent. By irreducibility and solidarity, this implies all states are recurrent. Remark 14.1.2: A stronger result holds, namely, that for a ﬁnite state space irreducible Markov chain, all states are positive recurrent (Problem 14.6).

448

14. Markov Chains and MCMC

Theorem 14.1.9: (A law of large numbers). Let i be positive recurrent. Let Nn (j) = #{k : 0 ≤ k ≤ n, Xk = j}, j ∈ S (1.9) n (j) be be the number of visits to j during {0, 1, . . . , n}. Let Ln (j) ≡ Nn+1 the empirical distribution at time n. Let X0 = i, with probability 1. Then Ln (j) →

Vij , w.p. 1 Ei (Ti )

(1.10)

Ti −1 where Vij = Ei k=0 δXk ,j is the mean number of visits to j during {0, 1, . . . , Ti − 1} starting from i. In particular, Ln (i) → Ei1Ti , w.p. 1. (k)

Proof: For each n, let k ≡ k(n) be such that Ti

(k+1)

≤ n < Ti

. Then,

NT (k) (j) ≤ Nn (j) ≤ NT (k+1) (j), i

i

i.e., k

ξr ≤ Nn (j) ≤

r=0 (r)

k+1

ξr ,

r=0 (r+1)

where ξr = #{l : Ti ≤ l < Ti , Xl = j} is the number of visits to j during the rth excursion. Since Vij ≡ Ei ξ1 ≤ Ei Ti < ∞, by the SLLN, with probability 1, k(n) 1 ξr → Ei (ξ1 ) = Vij k(n) r=0 (k)

(k(n))

and k1 Ti → Ei (Ti ). Note that since n is between Ti n so k(n) → Ei Ti . Since

(k(n)+1)

and Ti

,

k k+1 k+1 1 Nn (j) k1 ≤ ξr ≤ ξr , n k r=0 n n k + 1 r=0

it follows that Ln (j) =

n Nn (j) n+1 n

→

Ei (ξ1 ) Ei (Ti ) .

2

Note that the above proof works for any initial distribution µ such that Pµ (Ti < ∞) = 1 and further, the limit of Ln (j) is independent of any such µ. Thus, if (S, P ) is irreducible and one state i is positive recurrent, then for any initial distribution µ, Pµ (Ti < ∞) = 1. Note ﬁnally that the proof in Theorem 14.1.9 can be adapted to yield a criterion for transience, null recurrence, and positive recurrence of a state. Thus, the following holds. Corollary 14.1.10: Fix a state i. Then

14.1 Markov chains: Countable state space

449

(i) i is transient iﬀ limn→∞ Nn (i) exists and is ﬁnite w.p. 1 for any initial distribution iﬀ lim Ei Nn (i) =

n→∞

(ii) i is null recurrent iﬀ

∞ k=0

∞

(k)

pii < ∞.

k=0 (k)

pii = ∞ and 1 (k) pii = 0. n→∞ n n

lim Ei Ln (i) = lim

n→∞

k=0

(iii) i is positive recurrent iﬀ 1 (k) pii > 0. n→∞ n n

lim Ei Ln (i) = lim

n→∞

k=0

(iv) Let (S, P ) be irreducible and let one state i be positive recurrent. Then, for any j and any initial distribution µ, Ln (j) →

1 ∈ (0, ∞) w.p. 1. Ej T j

Thus, for the symmetric simple random walk on the integers c (2k) p00 ∼ √ as k → ∞, 0 < c < ∞ k and hence

∞

1 (k) p00 → 0. n n

(n)

p00 = ∞

n=0

and

k=0

Thus 0 is null recurrent. It is not diﬃcult to show that if j leads to i, i.e., if fji ≡ Pj (Ti < ∞) is positive then the number ξ1 of visits to j between consecutive visits to i has all moments (Problem 14.9). Using the SLLN, Theorem 14.1.9 can be extended as follows to cover both null and positive recurrent cases. Theorem 14.1.11: Let i be a recurrent state. Then, for any j and any initial distribution µ such that Pµ (Ti < ∞) = 1, 1 Vij δXk j → πij ≡ , n+1 Ei T i n

Ln (j) ≡

w.p. 1

k=0

as n → ∞ where Vij < ∞. (If Ei Ti = ∞, then πij = 0 for all j.)

(1.11)

450

14. Markov Chains and MCMC

Corollary 14.1.12: Let (S, P ) be irreducible and let one state be recurrent. Then, for any j and any initial distribution, Ln (j) → cj

n → ∞,

as

w.p. 1,

(1.12)

where cj = 1/Ej Tj if Ej Tj < ∞ and cj = 0 otherwise. The Basic Ergodic Theorem Taking expectation in Theorem 14.1.11 and using the bounded convergence theorem leads to Corollary 14.1.13: Let i be recurrent. Then, for any initial distribution µ with Pµ (Ti < ∞) = 1, 1 Vij := πij Pµ (Xk = j) → n+1 Ei (Ti ) n

Eµ (Ln (j)) =

as

n → ∞.

k=0

(1.13) := πij for j ∈ Theorem 14.1.14: Let i be positive recurrent. Let πj } is a stationary distribution for P , i.e., S. Then {π j j∈S j πj = 1 and π p = π , for all j. j l∈S l lj Proof: 1 (k) pij n+1

1 1 (k) δij + pij n+1 n+1

n

n

=

k=0

k=1 n

1 1 (k−1) δij + pil plj n+1 n+1 k=1 l n 1 1 (k−1) δij + pil plj . (n + 1) n+1

= =

l

k=1

Taking limit as n → ∞ and using Fatou’s lemma yields πj ≥ πl plj .

(1.14)

l

If strict inequality were to hold for some j0 , then adding over j would yield πj > πl plj = πl plj = πl . j

j

Since j∈S Vij = Ei Ti , in (1.14) for any j.

l

j

l

j

l

πj = 1, so there cannot be a strict inequality 2

Therefore, the following has been established.

14.1 Markov chains: Countable state space

451

Theorem 14.1.15: Let i be a positive recurrent state. Let Ti −1 Ei ( k=0 δXk ,j ) πj := . Ei (Ti ) Then (i) {πj }j∈S is a stationary distribution. (ii) For any j and any initial distribution µ with Pµ (Ti < ∞) = 1, (a) Ln (j) ≡ (b)

1 n+1

n

1 n+1 #{k

: Xk = j, 0 ≤ j ≤ n} → πj w.p. 1 (Pµ )

Pµ (Xk = j) → πj .

k=0

In particular, if j = i, we have 1 1 (k) . pii → πi = n+1 Ei (Ti ) n

(1.15)

k=0

Now let i be a positive recurrent state and j be such that i → j. Then Vij > 0 and by the solidarity property, fji = 1 and j is recurrent. Now taking µ = δj in (ii) above and using Corollary 14.1.13, leads to the conclusions that n 1 (k) pjj → πj > 0 (1.16) n+1 k=0

and πj =

1 . Ej T j

(1.17)

Thus, j is positive recurrent. Now Theorem 14.1.15 leads to the basic ergodic theorem for Markov chains. Theorem 14.1.16: Let (S, P ) be irreducible and let one state be positive recurrent. Then (i) all states are positive recurrent, (ii) π ≡ {πj ≡ (Ej Tj )−1 : j ∈ S} is a stationary distribution for (S, P ), (iii) for any initial distribution µ and any j ∈ S (a)

1 n+1

n

Pµ (Xk = j) → πj , i.e., π is the unique limiting distri-

k=0

bution (in the average sense) and hence the unique stationary distribution,

452

14. Markov Chains and MCMC

(b) Ln (j) ≡

1 n+1

n

δXk j → πj w.p. 1 (Pµ ).

k=0

There is a converse to the above result. To develop this, note ﬁrst that if j is a transient state, then the total number Nj of visits to j is ﬁnite w.p. 1 (for any initial distribution µ) and hence Ln (j) → 0 w.p. 1 and taking expectations, for any i 1 (k) pij → 0 n+1 n

as

n → ∞.

k=0

Now, suppose that π ≡ {πj : j ∈ S} is a stationary distribution for (S, P ). Then, for all j, πj = i∈S πi pij and hence πr pri pij πj = i∈S,r∈S

=

(2)

πr prj

r∈S

and by induction, πj =

(k)

πi pij

for all k ≥ 0

i∈S

implying πj =

πi

i∈S

n 1 (k) pij . n+1 k=0

Now if j is transient then for any i 1 (k) pij → 0 n+1 n

as

n→∞

k=0

and so by the bounded convergence theorem n 1 (k) πi pij πi · 0 = 0. = πj = lim n→∞ n+1 i∈S

k=0

i∈S

Thus, πj > 0 implies j is recurrent. For j recurrent, it follows from arguments similar to those used to establish Theorem 14.1.15 that 1 1 (k) pij → fij . n Ej T j k=0 1 Thus, πj = i∈S πi fij Ej Tj . But i∈S πi fij ≥ πj fjj = πj > 0. So Ej Tj < ∞, i.e., j is positive recurrent. Summarizing the above discussion leads to n

14.1 Markov chains: Countable state space

453

Proposition 14.1.17: Let π ≡ {πj : j ∈ S} be a stationary distribution for (S, P ). Then, πj > 0 implies that j is positive recurrent. It is now possible to state a converse to Theorem 14.1.16. Theorem 14.1.18: Let (S, P ) be irreducible and admit a stationary distribution π ≡ {πj : j ∈ S}. Then, (i) all states are positive recurrent, (ii) π ≡ {πj : πj =

1 Ej Tj ,

j ∈ S} is the unique stationary distribution,

(iii) for any initial distribution µ and for all j ∈ S, (a)

(b)

1 n+1

1 n+1

n k=0 n

Pµ (Xk = j) → πj , δXk j → πj w.p. 1 (Pµ ).

k=0

In summary, for an irreducible Markov chain (S, P ) with a countable state space, a stationary distribution π exists iﬀ all states are positive recurrent. In which case, π is unique and for any initial distribution µ, the distribution at time n converges to π in the Cesaro sense (i.e., average) and the (LLN) law of large numbers holds. For the ﬁnite state space case, irreducibility suﬃces (Problem 14.6). If h : S → R is a function such that j∈S |h(j)|πj < ∞, then the LLN can be strengthened to ∞

1 h(Xk ) → h(j)πj w.p. 1 (Pµ ) n+1 j=0 n

k=0

for any µ (Problem 14.10). In particular, if A is any subset of S, then Ln (A) ≡

n 1 IA (Xk ) → π(A) ≡ πj n+1 k=0

j∈A

w.p. 1 (Pµ ) for any µ. An important question that remains is whether the convergence of Pµ (Xn = j) to πj can be strengthened from the average sense to full, n 1 i.e., from the convergence to πj of (n+1) k=0 Pµ (Xk = j) to the convergence to πj of Pµ (Xn = j) as n → ∞. For this, the additional hypothesis needed is aperiodicity. Deﬁnition 14.1.9: For any state i, the period di of the state i is the (n) g.c.d. n : n ≥ 1, pii > 0 .

454

14. Markov Chains and MCMC

Further, i is called aperiodic if di = 1. Example 14.1.8: Let S = {0, 1, 2} and ⎛ ⎞ 0 1 0 P = ⎝ 12 0 12 ⎠ . 0 1 0 Then di = 2 for all i. Note that in this example, since (S, P ) is ﬁnite and irreducible, it has a unique stationary distribution, given by π = ( 14 , 12 , 14 ) and 1 1 (k) p00 → n+1 4 n

as

n→∞

k=0

(2n+1)

(2n)

= 0 for each n and p00 → 14 as n → ∞. This suggests that but p00 aperiodicity will be needed. It turns out that if (S, P ) is irreducible, the period di is the same for all i (Problem 14.5). Theorem 14.1.19: Let (S, P ) be irreducible, positive recurrent and aperiodic (i.e., di = 1 for all i). Let {Xn }n≥0 be a (S, P ) Markov chain. Then, for any initial distribution µ, limn→∞ Pµ (Xn = i) ≡ πi exists for all i, where π ≡ {πj : j ∈ S} is the unique stationary distribution. There are many proofs known for this, and two of them are outlined below. The ﬁrst uses the discrete renewal theorem and the second uses a coupling argument. Proof 1: (via the discrete renewal theorem). Fix a state i. Recall that for (n) (n) n ≥ 1, pii = P (Xn = i | X0 = i), n ≥ 0 and fii = P (Ti = n | X0 = i), n ≥ 1. Using the Markov property, for n ≥ 1, (n)

pii = P (Xn = i | X0 = i)

= = =

n k=1 n k=1 n

P (Xn = i, Ti = k | X0 = i) P (Ti = k | X0 = i)P (Xn = i | Xk = i) (k) (n−k)

fii pii

.

k=1 (n)

(n)

Let an ≡ pii , n ≥ 0, pn ≡ fii , n ≥ 1. Then {pn }n≥1 is a probability distribution and {an }n≥0 satisﬁes the discrete renewal equation an = bn +

n k=0

an−k pk , n ≥ 0

14.1 Markov chains: Countable state space

455

where bn = δn0 and p0 = 0. It can be shown that di = 1 iﬀ g.c.d. k : k ≥ 1, pk > 0 = 1. ∞ Further, Ei Ti = k=1 kpk < ∞, by the assumption of positive recurrence. Now it follows from the discrete renewal theorem (see Section 8.5) that ∞ 1 j=0 bj lim an exists and equals ∞ = = πi . n→∞ Ei T i k=1 kpk 2 Proof 2: (Using coupling arguments). Let {Xn }n≥0 and {Yn }n≥0 be independent (S, P ) Markov chains such that Y0 has distribution π and X0 has distribution µ. Then {Zn = (Xn , Yn )}n≥0 is a Markov chain with state space S × S and transition probability P × P ≡ (p(i,j),(k, ) = pik pj ) . Further, it can be shown that (see Hoel, Port and Stone (1972)) (a) {πi,j ≡ πi πj : (i, j) ∈ S × S} is a stationary distribution for {Zn }, (b) since (S, P ) is irreducible and aperiodic, the pair (S × S, P × P ) is irreducible. Since (S × S, P × P ) is irreducible and admits a stationary distribution, it is necessarily recurrent and so from any initial distribution the ﬁrst passage time TD for the diagonal D ≡ {(i, i) : i ∈ S} is ﬁnite with probability one. Thus, Tc ≡ min{n : n ≥ 1, Xn = Yn } is ﬁnite w.p. 1. The random variable Tc is called the coupling time. Let Xn , n ≤ Tc ˜ Xn = Yn , n > Tc . ˜ n }n≥0 are identically disThen, it can be veriﬁed that {Xn }n≥0 and {X tributed Markov chains. Thus P (Xn = j)

˜ n = j) = P (X ˜ n = j, Tc < n) + P (X ˜ n = j, Tc ≥ n) = P (X

and P (Yn = j) = P (Yn = j, Tc < n) + P (Yn = j, Tc ≥ n) implying that * * *P (Xn = j) − P (Yn = j)* ≤ 2P (Tc ≥ n). Since P (Tc ≥ n) → 0 as n → ∞ and by the stationarity of π, P (Yn = j) = πj for all n and j it follows that for any j lim P (Xn = j)

n→∞

exists and

= πj .

456

14. Markov Chains and MCMC

2 In order to obtain results on rates of convergence for |P (Xn = j) − πj |, one needs more assumptions on the distribution of return time Ti or the coupling time Tc . For results on this, the books of Hoel et al. (1972), Meyn and Tweedie (1993), and Lindvall (1992) are good sources. It can be shown that in the irreducible case if for some i, Pi (Ti > n) = O(λn1 ) for some 0 < λ1 < 1, then j∈S |Pi (Xn = j) − πj | = O(λn2 ) for some λ1 < λ2 < 1. In particular, this geometric convergence holds for the ﬁnite state space irreducible case. The main results of this section are summarized below. Theorem 14.1.20: Let {Xn }n≥0 be a Markov chain with a countable state spaceS = {0, 1, 2, . . . , K}, K ≤ ∞ and transition probability matrix P ≡ (pij ) . Let (S, P ) be irreducible. Then, (a) All states are recurrent iﬀ for some i in S, ∞

(n)

pii = ∞.

n=1

(b) All states are positive recurrent iﬀ for some i in S, 1 (k) pii n→∞ n n

lim

exists and is strictly positive.

k=0

(c) There exists a stationary probability distribution π iﬀ there exists a positive recurrent state. (d) If there exists a stationary distribution π ≡ {πj : j ∈ S}, then (i) it is unique, all states are positive recurrent and for all j ∈ S, πj = (Ej Tj )−1 , (ii) for all i, j ∈ S, 1 (k) pij → πj n+1 n

as

n → ∞,

k=0

(iii) for any initial distribution and any j ∈ S, 1 δXk j → πj n+1 n

(iv) if

j∈S

w.p. 1,

k=0

|h(j)|πj < ∞, then 1 h(Xk ) → h(j)πj n+1 n

k=0

j∈S

w.p. 1,

14.2 Markov chains on a general state space

457

(v) if, in addition, di = 1 for some i ∈ S, then dj = 1 for all j ∈ S and for all i, j, (n)

pij → πj

as

n → ∞.

14.2 Markov chains on a general state space 14.2.1

Basic deﬁnitions

Let {Xn }n≥0 be a sequence of random variables with values in some space S that is not necessarily ﬁnite or countable. The Markov property says that conditioned on Xn , Xn−1 , . . . , X0 , the distribution of Xn+1 depends only on Xn and not on the past, i.e., Xj : j ≤ n − 1. When S is not countable, to make this notion of Markov property precise, one needs the following set up. Let (S, S) be a measurable space. Let (Ω, F, P ) be a probability space and {Xn (ω)}n≥0 be a sequence of maps from Ω to S such that for each n, Xn is (F , S) measurable. Let Fn ≡ σ{Xj : 0 ≤ j ≤ n} be the subσ-algebra of F generated by {Xj : 0 ≤ j ≤ n}. In what follows, for any sub-σ-algebra Y of F, let P (· | Y) denote the conditional probability given, Y as deﬁned in Chapter 12. Deﬁnition 14.2.1: The sequence {Xn }n≥0 is a Markov chain if for all A ∈ S, P (Xn+1 ∈ A) | Fn = P (Xn+1 ∈ A) | σXn w.p. 1, for all n ≥ 0, (2.1) for any initial distribution of X0 , where σXn is the sub-σ-algebra of F generated by Xn . It is easy to verify that (2.1) holds for all A ∈ S iﬀ for any bounded measurable h from (S, S) to R, B(R) , E h(Xn+1 ) | Fn = E h(Xn+1 ) | σXn w.p. 1 for all n ≥ 0 (2.2) for any initial distribution of X0 . Another equivalent formulation that makes the Markov property symmetric with respect to time is the following that says that given the present, the past and future are independent. Proposition 14.2.1: A sequence of random variables {Xn }n≥0 satisﬁes ⊂ S, (2.1) iﬀ for any {Aj }n+k 0 P Xj ∈ Aj , j = 0, 1, 2, . . . , n − 1, n + 1, . . . , n + k | σXn = P Xj ∈ Aj , j = 0, 1, 2, . . . , n − 1 | σXn × P Xj ∈ Aj , j = n + 1, . . . , n + k | σXn

458

14. Markov Chains and MCMC

w.p. 1. The proof is somewhat involved but not diﬃcult. The countable state space case is easier (Problem 14.1). An important tool for studying Markov chains is the notion of a transition probability function. Deﬁnition 14.2.2: A function P : S × S → [0, 1] is called a transition probability function on S if (i) for all x in S, P (x, ·) is a probability measure on (S, S), (ii) for all A ∈ S, P (·, A) is an S-measurable function from (S, S) → [0, 1]. Under some general conditions guaranteeing the existence of regular conditional probabilities, the right side of (2.1) can be expressed as Pn (Xn , A), where Pn (·, ·) is a transition probability function on S. In such a case, yet another formulation of Markov property is in terms of the joint distributions of {X0 , X1 , . . . , Xn } for any ﬁnite n. Proposition 14.2.2: A sequence {Xn }n≥0 satisﬁes (2.1) iﬀ for any n ∈ N and A0 , A1 , . . . , An ∈ S, P (Xj ∈ Aj , j = 0, 1, 2, . . . , n) = Pn−1 (xn−1 , An )Pn−2 (xn−2 , dxn−1 ) A0

An−2

An−1

. . . P1 (x0 , dx1 )µ0 (dx0 ), where µ0 (A) = P (x0 ∈ A), A ∈ S. The proof is by induction and left as an exercise (Problem 14.16). In what follows, it will be assumed that such transition functions exist. Deﬁnition 14.2.3: A sequence of S-valued random variables {Xn }n≥0 is called a Markov chain with transition function P (·, ·) if (2.1) holds and the right side equals P (Xn , A) for all n ∈ N.

14.2.2

Examples

Example 14.2.1: (IID sequence). Let {Xn }n≥0 be iid S-valued random variables with distribution µ. Then {Xn }n≥0 is a Markov chain with transition function P (x, A) ≡ µ(A) and initial distribution µ. Example 14.2.2: ((Additive) random walk in Rk ). Let {ηj }j≥1 be iid Rk -valued random variables with distribution ν. Let X0 be an Rk -valued random variable independent of {ηj }j≥1 and with distribution µ. Let Xn+1

=

Xn + ηn+1

14.2 Markov chains on a general state space

=

X0 +

n+1

459

ηj , n ≥ 0.

j=1

Then {Xn }n≥0 is a Rk -valued Markov chain with transition function P (x, A) ≡ ν(A − x) and initial distribution µ. Example 14.2.3: (Multiplicative random walk on R+ ). Let {ηn }n≥1 be iid nonnegative random variables with distribution ν and X0 be a nonnegative random variable with distribution µ and independent of {ηn }n≥1 . Let Xn+1 = Xn ηn+1 , n ≥ 0. Then {Xn }n≥0 is a Markov chain with state space R+ and transition function P (x, A) = ν(x−1 A) if x > 0 and IA (0) if x = 0 and initial distribution µ. This is a model for the value of a stock portfolio subject to random growth n rates. Clearly, the above iteration scheme leads to Xn = X0 · i=1 ηi which when normalized appropriately leads to what is known as the geometric Brownian notion model in ﬁnancial mathematics literature. Example 14.2.4: (AR(1) time series). Let ρ ∈ R and ν be a probability measure on (R, B(R)). Let {ηj }j≥1 be iid with distribution ν and X0 be a random variable independent of {ηj }j≥1 and with distribution µ. Let Xn+1 = ρXn + ηn+1 , n ≥ 0. Then {Xn }n≥0 is a R-valued Markov chain with transition function P (x, A) ≡ ν(A − ρx) and initial distribution µ. Example 14.2.5: (Random AR(1) vector time series). Let {(Ai , bi )}i≥1 be iid such that Ai is a k × k matrix and bi is a k × 1 vector. Let µ be a probability measure on (Rk , B(Rk )). Let X0 be a Rk -valued random variable independent of (Ai , bi )i≥1 and with distribution µ. Let Xn+1 = An+1 Xn + bn+1 , n ≥ 0. Then {Xn }n≥0 is a Rk -valued Markov chain with transition function P (x, B) ≡ P (A1 x + b1 ∈ B) and initial distribution µ. Example 14.2.6: (Waiting time chain). Let {ηi }i≥1 be iid real valued random variable with distribution ν and X0 be independent of {ηi }i≥1 with distribution µ. Let Xn+1 = max Xn + ηn+1 , 0 . Then {Xn }n≥0 is a nonnegative valued Markov chain with transition function P (x, A) ≡ P max{x + η1 , 0} ∈ A and initial distribution µ. In the

460

14. Markov Chains and MCMC

queuing theory context, if ηn represents the diﬀerence between the nth interarrival time and service time, then Xn represents the waiting time at the nth arrival. All the above are special cases of the following: Example 14.2.7: (Iterated function system (IFS)). Let (S, S) be a measurable space. Let (Ω, F, P ) be a probability space. Let {fi (x, ω)}i≥1 be such that for each i, fi : S × Ω → S is (S × F, S)-measurable and the stochastic processes {fi (·, ω)}i≥1 are iid. Let X0 be a S-valued random variable on (Ω, F, P ) with distribution µ and independent of {fi (·, ·)}i≥1 . Let Xn+1 (ω) = fn+1 (Xn (ω), ω), n ≥ 0. Then {Xn }n≥0 is an S-valued Markov chain with transition function P (x, A) ≡ P (f1 (x, ω) ∈ A) and initial distribution µ. It turns out that when S is a Polish space with S as the Borel σalgebra and P (·, ·) is a transition function on S, any S-valued Markov chain {Xn }n≥0 with transition function P (·, ·) can be generated by an IFS as in Example 14.2.7. For a proof, see Kifer (1988) and Athreya and Stenﬂo (2003). When {fi }i≥1 are iid such that f1 has only ﬁnite many choices {hj }kj=1 , where each hj is an aﬃne contraction on Rp , then the Markov chain {Xn } converges in distribution to some π(·). Further, the limit point set of {Xn } coincides w.p. 1 with the support M of the limit distribution π(·). This has been exploited by Barnsley and others to solve the inverse problem: given a compact set M in Rp , ﬁnd an IFS {hj }kj=1 , of aﬃne contractions so that by generating the Markov chain {Xn }, one can get an approximate picture of M . This is called data compression or image generation by Markov chain Monte Carlo. See Barnsley (1992) for details on this. More generally, when {fi } are Lipschitz maps, the following holds. Theorem 14.2.3: Let {fi (·, ·)}i≥1 be iid Lipschitz maps on S. Assume (i) E| logs(f1 )| < ∞ and E log s(f1 ) < 0, where s(f1 ) d f1 (x, ω), f1 (y, ω) and d(·, ·) is the metric on S, and sup d(x, y) x =y

=

+ (ii) for some x0 , E log d(f1 (x0 , ω), x0 ) < ∞. Then, ˆ n (x, ω) ≡ f1 f2 (. . . fn−1 (fn (x, ω), ω)) . . . converges w.p. 1 to a (i) X ˆ random variable X(ω) that is independent of x, (ii) for all x, Xn (x, ω) ≡ fn fn−1 . . . (f1 (x, ω), ω) . . . converges in disˆ tribution to X(ω).

14.2 Markov chains on a general state space

461

That (ii) is a consequence of (i) is clear since for each n, x, Xn (x, ω) and ˆ n (x, ω) are identically distributed. X ˆ n } is a Cauchy sequence in S The proof of (i) involves showing that {X w.p. 1 (Problem 14.17).

14.2.3

Chapman-Kolmogorov equations

Let P (·, ·) be a transition function on (S, S). For each n ≥ 0, deﬁne a sequence of functions {P (n) (·, ·)}n≥0 by the iteration scheme P

(n+1)

(x, A) = S

P (n) (y, A)P (x, dy), n ≥ 0,

(2.3)

where P (0) (x, A) ≡ IA (x). It can be veriﬁed by induction that for each n, P (n) (·, ·) is a transition probability function. Deﬁnition 14.2.4: P (n) (·, ·) deﬁned by (2.3) is called the n-step transition function generated by P (·, ·). It is easy to show by induction that if X0 = x w.p. 1, then P (Xn ∈ A) = P (n) (x, A)

for all n ≥ 0

(2.4)

(Problem 14.18). This leads to the Chapman-Kolmogorov equations. Proposition 14.2.4: Let P (·, ·) be a transition probability function on (S, S) and let P (n) (·, ·) be deﬁned by (2.3). Then for any n, m ≥ 0, P (n+m) (x, A) = P (n) (y, A)P (m) (x, dy). (2.5)

Proof: The analytic veriﬁcation is straightforward by induction. One can verify this probabilistically using the Markov property. Indeed, from (2.4) the left side of (2.5) is Px (Xn+m ∈ A)

Ex P (Xn+m ∈ A | Fm ) = Ex P (n) (Xm , A) (by Markov property) = right-hand side of (2.5), =

where Ex , Px denote expectation and probability distribution of {Xn }n≥0 when X0 = x w.p. 1. 2 From the above, one sees that the study of the limit behavior of the distribution of Xn as n → ∞ can be reduced analytically to the study of the n-step transition probabilities. This in turn can be done in terms of

462

14. Markov Chains and MCMC

the operator P on the Banach space B(S, R) of bounded measurable real valued functions from S to R (with sup norm), deﬁned by (P h)(x) ≡ Ex h(X1 ) ≡ h(y)P (x, dy). (2.6) It is easy to verify that P is a positive bounded linear operator on B(S, R) of norm one. The Chapman-Kolmogorov equation (2.4) is equivalent to saying that Ex h(Xn ) = (P n h)(x). Thus, analytically the study of the limit distribution of {Xn }n≥0 can be reduced to that of the sequence {P n }n≥0 of the operator P . However, probabilistic approaches via the notion of Harris irreducibility and recurrence when applicable and via the notion of Feller continuity when S is a Polish space are more fruitful and will be developed below.

14.2.4

Harris irreducibility, recurrence, and minorization

14.2.4.1 Deﬁnition of irreducibility Recall that a Markov chain {Xn }n≥0 with a discrete state space S and transition probability matrix P ≡ ((pij )) is irreducible if for every i, j in S, i leads to j, i.e., P (Xn = j

for some

n ≥ 1 | X0 = i)

≡ Pi (Tj < ∞) ≡ fij > 0, where Tj = min{n : n ≥ 1, Xn = j} is the time of ﬁrst visit to j. To generalize this to the case of general state spaces, one starts with the notion of ﬁrst entrance time or hitting time (also called the ﬁrst passage time). Deﬁnition 14.2.5: For any A ∈ S, the ﬁrst entrance time to A is deﬁned as min{n : n ≥ 1, Xn ∈ A} TA ≡ ∞ if Xn ∈ A for any n ≥ 1. Since the event {TA = 1} = {X1 ∈ A} and for k ≥ 2, {TA = k} = {X1 ∈ A, X2 ∈ A, . . . , Xk−1 ∈ A, Xk ∈ A} is an element of Fk = σ{Xj : j ≤ k}, TA is a stopping time w.r.t. the ﬁltration {Fn }n≥1 (cf. Chapter 13). Deﬁnition 14.2.6: Let φ be a nonzero σ-ﬁnite measure on (S, S). The Markov chain {Xn }n≥0 (or equivalently, its transition function P (·, ·)) is said to be φ-irreducible (or irreducible in the sense of Harris with reference measure φ) if for any A in S, φ(A) > 0 ⇒ L(x, A) ≡ Px (TA < ∞) > 0

(2.7)

for all x in S. This says that if a set A in S is considered important by the measure φ (i.e., φ(A) > 0), then so does the chain {Xn }n≥0 starting from

14.2 Markov chains on a general state space

463

∞ any x in S. If G(x, A) ≡ n=1 P n (x, A) is the Greens function associated with P , then (2.7) is equivalent to φ(A) > 0 ⇒ G(x, A) > 0

for all x ∈ S,

(2.8)

i.e., φ(·) is dominated by G(x, ·) for all x in S. 14.2.4.2 Examples Example 14.2.7: If S is countable and φ is the counting measure on S, then the irreducibility of a Markov chain {Xn }n≥0 with state space S is the same as φ-irreducibility. Example 14.2.8: If {Xn }n≥0 are iid with distribution ν, then it is νirreducible. Example 14.2.9: It can be veriﬁed that the random walk with jump distribution ν (Example 14.2.2) is φ-irreducible with the Lebesgue measure as φ if ν has a nonzero absolutely continuous component with a positive density on some open interval (Problem 14.19 (a)). Example 14.2.10: The AR(1) with η1 having a nontrivial absolutely continuous component can be shown to be φ-irreducible for some φ. On the other hand the AR(1) chain Xn+1 =

1 Xn + ηn , n ≥ 0, 2 2

where {ηn }n≥1 are iid Bernoulli ( 12 ) random variables, is not φ-irreducible for any φ. In general, if {Xn }n≥0 is a Markov chain that has a discrete distribution for each n and has a limit distribution that is nonatomic, then {Xn }n≥0 cannot be φ-irreducible for any φ (Problem 14.19 (b)). The waiting time chain (Example 14.2.6) is irreducible w.r.t. φ ≡ δ0 , the delta measure at 0 if P (η1 < 0) > 0 (Problem 14.20). It can be shown that if {Xn }n≥0 is φ-irreducible for some σ-ﬁnite φ, then there exists a probability measure ψ such that {Xn }n≥0 is ψ-irreducible and ˜ ˜ then for some φ, it is maximal in the sense that if {Xn }n≥0 is φ-irreducible φ˜ is dominated by ψ. See Nummelin (1984). 14.2.4.3 Harris recurrence A Markov chain {Xn }n≥0 that is Harris irreducible with reference measure φ is said to be Harris recurrent if A ∈ S, φ(A) > 0 ⇒ Px (TA < ∞) = 1

for all x

in S.

(2.9)

Recall that irreducibility requires only that Px (TA < ∞) be > 0. When S is countable and φ is the counting measure, this reduces to the usual notion of irreducibility and recurrence. If S is not countable but has

464

14. Markov Chains and MCMC

a singleton ∆ such that Px (T∆ < ∞) = 1 for all x in S, then the chain {Xn }n≥0 is Harris recurrent with respect to the measure φ(·) ≡ δ∆ (·), the delta measure at ∆. The waiting time chain (Example 14.2.6) has such a ∆ in 0 if Eη1 < 0 (Problem 14.20). If such a recurrent singleton ∆ exists, then the sample paths of {Xn }n≥0 can be broken into iid excursions by looking at the chain between consecutive returns to ∆. This in turn will allow a complete extension of the basic limit theory from the countable case to this special case. In general, such a singleton may not exist. For example, for the AR(1) sequence with {ηi }i≥1 having a continuous distribution, Px (Xn = x for some n ≥ 1) = 0 for all x. However, it turns out that for Harris recurrent chains, such a singleton can be constructed via the regeneration theorem below established independently by Athreya and Ney (1978) and Nummelin (1978).

14.2.5

The minorization theorem

A remarkable result of the subject is that when S is countably generated, a Harris recurrent chain can be embedded in a chain that has a recurrent singleton. This is achieved via the minorization theorem and the fundamental regeneration theorem below. Theorem 14.2.5: (The minorization theorem). Let (S, S) be such that S is countably generated. Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·) such that it is Harris irreducible with reference measure φ(·). Then the following minorization hypothesis holds. (i) (Hypothesis M). For every B0 ∈ S such that φ(B0 ) > 0, there exists a set A0 ⊂ B0 , an integer n0 ≥ 1, a constant 0 < α < 1, and a probability measure ν on (S, S) such that φ(A0 ) > 0 and for all x in A0 , P n0 (x, A) ≥ αν(A) for all A ∈ S. (ii) (The C-set lemma). For any set B0 ∈ S such that φ(B0 ) > 0, there exists a set A0 ⊂ B0 , an n0 ≥ 1, a constant 0 < α < 1 such that for x, y in A0 , pn0 (x, y) ≥ α , where pn0 (x, ·) is the Radon-Nikodym derivative of the absolutely continuous component of P n0 (x, ·)

w.r.t.

φ(·).

The proof of the C-set lemma is a nice application of the martingale convergence theorem (see Orey (1971)). The proof of Theorem 14.2.5 (i) using the C-set lemma is easy and is left as an exercise (Problem 14.21).

14.2 Markov chains on a general state space

14.2.6

465

The fundamental regeneration theorem

Theorem 14.2.6: Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·). Suppose there exists a set A0 ∈ S, a constant 0 < α < 1, a probability measure ν(·) on (S, S) such that for all x in A0 , P (x, A) ≥ αν(A)

for all

A ∈ S.

(2.10)

Suppose, in addition, that for all x in S, Px (TA0 < ∞) = 1.

(2.11)

Then, for any initial distribution µ, there exists a sequence of random times {Ti }i≥1 such that under Pµ , the sequence of excursions ηj ≡ {XTj +r , 0 ≤ r < Tj+1 − Tj , Tj+1 − Tj }, j = 1, 2, . . . are iid with XTj ∼ ν(·). Proof: For x in A0 , let Q(x, ·) =

P (x, ·) − αν(·) . (1 − α)

(2.12)

Then, (2.10) implies that for x in A0 , Q(x, ·) is a probability measure on (S, S). For each x in A0 and n ≥ 0, let ηn+1 , δn+1 and Yn+1,x be independent random variables such that P (ηn+1 ∈ ·) = ν(·), δn+1 is Bernoulli (α), and P (Yn+1,x ∈ ·) = Q(x, ·). Then given Xn = x in A0 , Xn+1 can be chosen to be ηn+1 if δn+1 = 1 (2.13) Xn+1 = Yn+1,x if δn+1 = 0 to ensure that Xn+1 has distribution P (x, ·). Indeed, for x in A0 , P (ηn+1 ∈ ·, δn+1 = 1) + P (Yn+1,x ∈ ·, δn+1 = 0) = ν(·)α + (1 − α)Q(x, ·) = P (x, ·). Thus, each time the chain enters A0 , there is a probability α that the position next time will have distribution ν(·), independent of x ∈ A0 as well as of all past history, i.e., that of starting afresh with distribution ν(·). Now if (2.11) also holds, then for any x in S, by Markov property, the chain enters A0 inﬁnitely often w.p. 1 (Px ). Let τ1 < τ2 < τ3 < · · · denote the times of successive visits to A0 . Since Xτi ∈ A0 , by the above construction (cf. (2.13)), there is a probability α > 0 that Xτi +1 will be distributed as ν(·), completely independent of Xj for j ≤ τi and τi . By comparison with coin tossing, this implies that for any x, w.p. 1 (Px ), there is a ﬁnite index i0 such that δτi0 +1 = 1 and hence Xτi0 +1 will be distributed as ν(·), independent of all history of the chain {Xn }n≥0 including that of the δτi +1 and Yτi +1 , i ≤ i0 up to the time τi0 . That is, at τi0 +1 the chain starts afresh with distribution ν(·). Thus,

466

14. Markov Chains and MCMC

it follows that for any initial distribution µ, there is a random time T such that XT is distributed as ν(·) and is independent of all history up to T − 1. More precisely, for any µ, Pµ (T < ∞) = 1 and Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n + k, T = n + 1) = Pµ (Xj ∈ Aj , j = 0, 1, 2, . . . , n, T = n + 1) × Pν (X0 ∈ An+1 , X1 ∈ An+2 , . . . , Xk−1 ∈ An+k ) . Since this is true for any µ, it is true for µ = ν and hence the theorem follows. 2 A consequence of the above theorem is following. Theorem 14.2.7: Suppose in Theorem 14.2.6, instead of (2.10) and (2.11), the following holds. There exists an n0 ≥ 1 such that for all x in A0 , A in S, P n0 (x, A) ≥ αν(A)

(2.14)

and for all x in A0 , Px (Xnn0 ∈ A0

for some

n ≥ 1) = 1

(2.15)

x in S.

(2.16)

and Px (TA0 < ∞) = 1

for all

Let Yn ≡ Xnn0 , n ≥ 0 (where nn0 stands for the product of n and n0 ). Then, for any initial distribution µ, there exist random times {Ti }i≥1 such that under Pµ , the sequence ηj ≡ {Yj : Tj+r , 0 ≤ r < Tj+1 − Tj , Tj+1 − Tj } for j = 1, 2, . . . are iid with YTj ∼ ν(·). Proof: For any initial distribution µ such that µ(A0 ) = 1, the theorem follows from Theorem 14.2.6 since (2.14) and (2.15) are the same as (2.10) and (2.11) for the transition function P n0 (·, ·) and the chain {Yn }n≥0 . By 2 (2.16) for any other µ, Pµ (TA0 < ∞) = 1. Given a realization of the Markov chain {Yn ≡ Xnn0 }n≥0 , it is possible to construct a realization of the full Markov chain {Xn }n≥0 by “ﬁlling the gaps” {Xj : kn0 + 1 ≤ j ≤ (k + 1)n0 − 1} as follows: Given Xkn0 = x, X(k+1)n0 = y, generate an observation from the conditional distribution of (X1 , X2 , . . . , Xn0 −1 ), given X0 = x, Xn0 = y. This leads to the following. Theorem 14.2.8: Under the set up of Theorem 14.2.7, the “excursions” ∞ η˜j ≡ Xn0 Tj +k : 0 ≤ k < n0 (Tj+1 − Tj ), Tj+1 − Tj j=1 , are identically distributed and are one dependent, i.e., for each r ≥ 1, the ηr+2 , η˜r+3 , . . .} are independent. collections {˜ η1 , η˜2 , . . . , η˜r } and {˜

14.2 Markov chains on a general state space

467

Proof: Note that in applying the regeneration method to the sequence {Yn }n≥0 and then doing the “ﬁlling the gaps” lead to the common portion X(Tj −1)n0 +r 0 ≤ r ≤ n0 with given the values X(Tj −1)n0 and XTj n0 . This makes two successive η˜j−1 and η˜j dependent. But Markov property renders η˜j and η˜j+2 independent. 2 This yields the one-dependence of {˜ ηj }j≥1 . By the C-set lemma and the minorization Theorem 14.2.5, φ-recurrence yields the hypothesis of Theorem 14.2.7. Theorem 14.2.9: Let {Xn }n≥0 be a φ-recurrent Markov chain with state space (S, S), where S is countably generated. Then there exist an A0 in S, n0 ≥ 1, 0 < α < 1 and a probability measure ν such that (2.14), (2.15), and (2.16) hold and hence, the conclusions of Theorem 14.2.8 hold. Thus, φ-recurrence implies that the Markov chain {Xn }n≥0 is regenerative (deﬁned fully below). This makes the law of large numbers for iid random variables available to such chains. The limit theory of regenerative sequences developed in Section 8.5 is reviewed below and by the above results, such a theory will hold for φ-recurrent chains.

14.2.7

Limit theory for regenerative sequences

Deﬁnition 14.2.7: Let (Ω, F, P ) be a probability space and (S, S) be a measurable space. A sequence of random variables {Xn }n≥0 deﬁned on (Ω, F, P ) with values in (S, S) is called regenerative if there exists a sequence of random times 0 < T1 < T2 < T3 < · · · such that the excursions ηj ≡ {Xn : Tj ≤ n < Tj+1 , Tj+1 − Tj }j≥1 are iid, i.e., P Tj+1 − Tj = kj , XTj + ∈ A ,j , 0 ≤ < kj , j = 1, 2, . . . , r r + = P T2 − T1 = kj , XT1 + ∈ A ,j , 0 ≤ < kj (2.17) j=1

for all k1 , k2 , . . . , kr ∈ N and A ,j ∈ S, 1 ≤ ≤ kj , j = 1, . . . , r. Example 14.2.11: Any Markov chain {Xn }n≥0 with a countable state space S that is irreducible and recurrent is regenerative with {Ti }i≥1 being the times of successive returns to a given state ∆. Example 14.2.12: Any Harris recurrent chain satisfying the minorization condition (2.10) is regenerative by Theorem 14.2.6. Example 14.2.13: The waiting time chain (Example 14.2.6) with Eη1 < 0 is regenerative with {Ti }i≥1 being the times of successive returns of {Xn }n≥0 to zero.

468

14. Markov Chains and MCMC

Example 14.2.14: An example of a non-Markov sequence {Xn }n≥0 that is regenerative is a semi-Markov chain. Let {yn }n≥0 be a Harris recurrent Markov chain satisfying (2.10). Given {yn = an }n≥0 , let {Ln }n≥0 be independent positive integer valued random variables. Set ⎧ y0 0 ≤ j < L0 ⎪ ⎪ ⎪ ⎨ y1 L0 ≤ j < L0 + L1 Xj = y L0 + L1 ≤ j < L0 + L1 + L2 2 ⎪ ⎪ ⎪ ⎩ .. . Then {Xn }n≥0 is called a semi-Markov chain with embedded Markov chain {yn }n≥0 and sojourn times {Ln }n≥0 . It is regenerative if {Ti }i≥1 are deﬁned Ni −1 Lj , where {Ni }i≥1 are the successive regeneration times for by Ti = j=0 {yn } as in Theorem 14.2.7. Theorem 14.2.10: Let {Xn }n≥0 be a regenerative sequence with regen T2 −1 ˜ (A) ≡ E eration times {Ti }i≥1 . Let π j=T1 IA (Xj ) for A ∈ S. Suppose ˜ (·)/˜ π (S). Then π ˜ (S) ≡ E(T2 − T1 ) < ∞. Let π(·) = π n 1 f (Xj ) → f dπ w.p. 1 for any f ∈ L1 (S, S, π). (i) n j=0 1 P (Xj ∈ ·) → π(·) in total variation. n j=0 n

(ii) µn (·) ≡

(iii) If the distribution of T2 − T1 is aperiodic, then P (Xn ∈ ·) → π(·) in total variation. Proof: To prove (i) it suﬃces to consider nonnegative f . For each n, let Nn = k if Tk ≤ n < Tk+1 . Let Ti+1 −1

Yi =

f (Xj ), i ≥ 1

and Y0 =

Y0 +

f (Xj ).

j=0

j=Ti

Then

T 1 −1

N n −1 i=1

Yi ≤

n i=0

f (Xi ) ≤ Y0 +

Nn

Yi .

(2.18)

i=1

Nn −1 Nn By the SLLN, N1n i=1 Yi and N1n i=1 Yi converge to EY1 w.p. 1 and −1 Nn . It follows from (2.18) that n → E(T2 − T1 ) n f d˜ π EY1 1 = = f dπ, f (Xi ) = lim n→∞ n E(T2 − T1 ) π ˜ (S) i=0

14.2 Markov chains on a general state space

469

establishing (i). By taking f = IA and using the BCT, one concludes from (i) that µn (A) → π(A) for every A in S. Since µn and π are probability measures, this implies that µn → π in total variation, proving (ii). To prove (iii), note that for any bounded measurable f , an ≡ Ef (XT1 +n ) satisﬁes an

n = E f (XT1 +n )I(T2 − T1 > n) + E f (XT1 +n )I(T2 − T1 = r) r=1 n = bn + E f (XT2 +n−r ) P (T2 − T1 = r) r=1

= bn +

n

an−r pr ,

r=1

where pr = P (T2 − T1 = r). Now by the discrete renewal theorems from Section 8.5 (which applies since ET2 −T1 < ∞ and T2 −T1 has an aperiodic distribution), (iii) follows. 2 Remark 14.2.1: Since the strong law is valid for any m-dependent (m < ∞) and stationary sequence of random variables, Theorem 14.2.10 is valid even if the excursions {ηj }j≥1 are m-dependent.

14.2.8

Limit theory of Harris recurrent Markov chains

The minorization theorem, the fundamental regeneration theorem, and the limit theorem for regenerative sequences, i.e., Theorems 14.2.5, 14.2.6, 14.2.7, and 14.2.10, are the essential components of a limit theory for Harris recurrent Markov chains that parallels the limit theory for discrete state space irreducible recurrent Markov chains. Deﬁnition 14.2.8: A probability measure π on (S, S) is called stationary for a transition function P (·, ·) if P (x, A)π(dx)

π(A) = S

for all A ∈ S.

Note that if X0 ∼ π, then Xn ∼ π for all n ≥ 1, justifying the term “stationary.” Theorem 14.2.11: Let {Xn }n≥0 be a Harris recurrent Markov chain with state space (S, S) and transition function P (·, ·). Let S be countably generated. Suppose there exists a stationary probability measure π. Then, (i) π is unique.

470

14. Markov Chains and MCMC

(ii) (The law of large numbers). For all f ∈ L1 (S, S, π), for all x ∈ S, n−1 1 f (Xj ) → f dπ w.p. 1 (Px ). n j=0 (iii) (Convergence of n-step probabilities). For all x ∈ S µn,x (·) ≡

n−1 1 Px (Xj ∈ ·) → π(·) n j=0

in total variation.

Proof: By Harris recurrence and the minorization Theorem 14.2.5, there exist a set A0 ∈ S, a constant 0 < α < 1, an integer n0 ≥ 1, a probability measure ν such that for all x ∈ A0 , A ∈ S, P n0 (x, A) ≥ αν(A)

(2.19)

for all x in S, Px (TA0 < ∞) = 1.

(2.20)

and For simplicity of exposition, assume that n0 = 1. (The general case n0 > 1 can be reduced to this by considering the transition function P n0 .) Let the sequence {ηn , δn , Yn,x }n≥1 and the regeneration times {Ti }i≥1 be as in Theorem 14.2.6. Recall that the ﬁrst regeneration time T1 can be deﬁned as (2.21) T1 = min{n : n > 0, Xn−1 ∈ A0 , δn = 1} and the succeeding ones by Ti+1 = min{n : n > Ti , Xn−1 ∈ A0 , δn = 1},

(2.22)

and that XTi are distributed as ν independent of the past. Let for n ≥ 1, Nn = k if Tk ≤ n < Tk+1 . By Harris recurrence, for all x in S, Nn → ∞ w.p. 1 (Px ) and by the SLLN, for all x in S, 1 Nn → w.p. 1 (Px ). n Eν T 1 and hence, by the BCT, 1 Nn → . n Eν T 1 On the other hand, for any k ≥ 1, x ∈ S, Ex

Px (a regeneration occurs at k)

Thus

1 Nn = Px (Xk−1 ∈ A0 )α n n n

Ex

= Px (Xk−1 ∈ A0 , δk = 1) = Px (Xk−1 ∈ A0 )α.

k=1

14.2 Markov chains on a general state space

471

and hence, for all x in S, n−1 1 1 Px (Xj ∈ A0 ) → . n j=0 αEν T1

Now let π be a stationary measure for P (·, ·). Then π(A0 ) = Px (Xj ∈ A0 )π(dx) for all j = 0, 1, 2, . . . and hence nπ(A0 ) =

n−1

Px (Xj ∈ A0 )π(dx).

(2.23)

j=0

∞ Since G(x, A0 ) ≡ j=0 Px (Xj ∈ A0 ) > 0 for all x in S, by Harris recurrence (Harris irreducibility will do for this), it follows that π(A0 ) > 0. Dividing both sides of (2.23) by n and letting n → ∞ yield π(A0 ) =

1 αEν T1

and hence that Eν T1 < ∞. Since Eν T1 ≡ E(T2 − T1 ) < ∞, by Theorem 14.2.10, for all x in S, A ∈ S, T −1 n−1 Eν 1 j=0 IA (Xj ) . Px (Xj ∈ A) → n j=0 Eν (T1 ) Integrating the left side with respect to π yields that for any A ∈ S, T −1 Eν j=0 IA (Xj ) π(A) = , Eν (T1 ) thus establishing the uniqueness of π, i.e., establishing (i) of Theorem 14.2.11. The other two parts follow from the regeneration Theorem 14.2.6 and the limit Theorem 14.2.10. 2 Remark 14.2.2: Under the assumption n0 = 1 that was made at the beginning of the proof, it also follows that Px (Xj ∈ ·) → π(·)

(2.24)

in total variation. This also holds if the g.c.d. of the n0 ’s for which there exist A0 , α, ν satisfying (2.19) is one. Remark 14.2.3: A necessary and suﬃcient condition for the existence of a stationary distribution for a Harris recurrent chain is that there exists a set {A0 , α, ν, n0 } satisfying (2.19) and (2.20) and Eν TA0 < ∞.

472

14. Markov Chains and MCMC

A more general result than Theorem 14.2.11 is the following that was motivated by applications to Markov chain Monte Carlo methods. Theorem 14.2.12: Let {Xn }n≥0 be a Markov chain with state space (S, S) and transition function P (·, ·). Suppose (2.19) holds for some (A0 , α, ν, n0 ). Suppose π is a stationary probability measure for P (·, ·) such that π {x : Px (TA0 < ∞) > 0} = 1. (2.25) Then, for π-almost all x, (i) Px (TA0 < ∞) = 1. n−1 (ii) µn,x (·) = n1 j=0 Px (Xj ∈ ·) → π(·) in total variation. (iii) For any f ∈ L1 (S, S, π), 1 f (Xj ) → n j=0 n

(iv)

1 n

n j=0

Ex f (Xj ) →

f dπ w.p. 1 (Px ).

f dπ.

If, further the g.c.d. {m: there exists αm > 0 such that for all x in A0 , P m (x, ·) ≥ αν(·)} = 1, then (ii) can be strengthened to Px (Xn ∈ ·) → π(·) in total variation. The key diﬀerence between Theorems 14.2.11 and 14.2.12 is that the latter does not require Harris recurrence which is often diﬃcult to verify. On the other hand, the conclusions of Theorem 14.2.12 are valid only for πalmost all x unlike for all x in Theorem 14.2.11. In MCMC applications, the existence of a stationary measure is given (as it is the ‘target distribution’) and the minorization condition is more easy to verify as is the milder form of irreducibility condition (2.25). (Harris irreducibility will require Px (TA0 < ∞) > 0 for all x in S.) For a proof of Theorem 14.2.12 and applications to MCMC, see Athreya, Doss and Sethuraman (1996). Example 14.2.15: (AR(1) time series) (Example 14.2.4). Suppose η1 has an absolutely continuous component and that |ρ| < 1. Then the chain is Harris and admits a stationary probability distribution π(·) of ∞ recurrent j ρ η and the Px (Xn ∈ ·) → π(·) in total variation for any x. j j=0 Example 14.2.16: (Waiting time chain) (Example 14.2.6). If Eη1 < 0, the state 0 is recurrent and hence the Markov chain is Harris recurrent. Also a stationary distribution π does exist. It is known thatπ is the same j as the distribution of M∞ ≡ supj≥0 Sj , where S0 = 0, Sj = i=1 ηi , j ≥ 1, {ηi }i≥1 being iid.

14.2 Markov chains on a general state space

14.2.9

473

Markov chains on metric spaces

14.2.9.1 Feller continuity Let (S, d) be a metric space and S be the Borel σ-algebra in S. Let P (·, ·) be a transition function. Let {Xn }n≥0 be Markov chain with state space (S, S) and transition function P (·, ·). Deﬁnition 14.2.9: The transition function P (·, ·) is called Feller continuous (or simply Feller) if xn → x in S ⇒ P (xn , ·) −→d P (x, ·) i.e. (P f )(xn ) ≡ f (y)P (xn , dy) → (P f )(x) ≡ f (y)P (x, dy) for all bounded continuous f : S → R. In terms of the Markov chain, this says E f (X1 ) | X0 = xn → E f (X1 ) | X0 = x if xn → x. Example 14.2.17: Let (Ω, F, P ) be a probability space and h : S×Ω → S be jointly measurable and h(·, ω) be continuous w.p. 1. Let P (x, A) ≡ P (h(x, ω) ∈ A) for x ∈ S, A ∈ S. Then P (·, ·) is a Feller continuous transition function. Indeed, for any bounded continuous f : S → R (P f )(x) ≡ f (y)P (x, dy) = Ef h(x, ω) . Now, xn → x ⇒ h(xn , ω) → h(x, ω) w.p. 1 ⇒ f h(xn , ω) → f h(x, ω) w.p. 1 (by continuity of f ) ⇒ Ef h(xn , ω) → Ef h(x, ω) (by bounded convergence theorem). The ﬁrst ﬁve examples of Section 14.2.4 fall in this category. If h is discontinuous, then P (·, ·) need not be Feller (Problem 14.22). That P (·, ·) is a transition function requires only that h(·, ·) be jointly measurable (Problem 14.23). 14.2.9.2 Stationary measures A general method of ﬁnding a stationary measure for a Feller transition function P (·, ·) is to consider weak or vague limits of the occupation mean−1 sures µn,λ (A) ≡ n1 j=0 Pλ (Xj ∈ A), where λ is the initial distribution. Theorem 14.2.13: Fix an initial distribution λ. Suppose a probability measure µ is a weak limit point of {µn,λ }n≥1 . That is, for some n1 < n2 < n3 < · · ·, µnk ,λ −→d µ. Assume P (·, ·) is Feller. Then µ is a stationary probability measure for P (·, ·). Proof: Let f : S → R be continuous and bounded. Then f (y)µnk ,λ (dy) → f (y)µ(dy).

474

14. Markov Chains and MCMC

But the left side equals nk −1 1 Eλ f (Xj ) nk j=0

=

=

=

nk −1 1 1 Eλ f (X0 ) + Eλ f (Xj ) nk nk j=1 nk −1 1 1 Eλ f (X0 ) + Eλ (P f )(Xj−1 ) nk nk j=1 since by Markov property, for j ≥ 1, Eλ f (Xj ) = Eλ (P f )(Xj−1 ) nk −1 1 1 1 Eλ f (X0 ) + Eλ (P f )(Xj ) − Eλ (P f )(Xnk −1 ). nk nk j=0 nk

The ﬁrst and third term on the right side go to zero since f is bounded and nk → ∞. The second term goes to (P f )(y)µ(dy) since by the Feller property P f is a bounded continuous function. Thus, f (y)µ(dy) = (P f )(y)µ(dy) S S = f (z)P (y, dz) µ(dy) S S = f (z)(µP )(dz) S

where µP (A) ≡

P (y, A)µ(dy). S

This being true for all bounded continuous f , it follows that µ = µP , i.e., µ is stationary for P (·, ·). 2 A more general result is the following. Theorem 14.2.14: Let λ be an initial distribution. Let µ be a subprobability measure (i.e., µ(S) ≤ 1) such that for some n1 < n2 < n3 < · · ·, {µnk ,λ } converges vaguely to µ, i.e., f dµnk ,λ → f dµ for all f : S → R continuous with compact support. Suppose there exists an approximate identity {gn }n≥1 for S, i.e., for all n, gn is a continuous function from S to [0,1] with compact support and for every x in S, gn (x) ↑ 1 as n → ∞. Then µ is stationary for P (·, ·), i.e., µ(A) = P (x, A)µ(dx) for all A ∈ S. S

14.2 Markov chains on a general state space

475

For a proof, see Athreya (2004). If S = Rk for some k < ∞, S admits an approximate identity. Conditions to ensure that there is a vague limit point µ such that µ(S) > 0 is provided by the following. Theorem 14.2.15: Suppose there exists a set A0 ∈ S, a function V : S → [0, ∞) and numbers 0 < α, M < ∞ such that (P V )(x) ≡ Ex V (X1 ) ≤ V (x) − α and

for

x∈ / A0

sup Ex V (X1 ) − V (x) ≡ M < ∞.

(2.26)

(2.27)

x∈A0

Then, for any initial distribution λ, lim µn,λ (A0 ) ≥

n→∞

α . α+M

(2.28)

Proof: For j ≥ 1, Eλ V (Xj ) − Eλ V (Xj−1 )

Eλ P V (Xj−1 ) − V (Xj−1 ) = Eλ P V (Xj−1 ) − V (Xj−1 ) IA0 (Xj−1 ) + Eλ P V (Xj−1 ) − V (Xj−1 ) IAc0 (Xj−1 )

=

≤

/ A0 ). M Pλ (Xj−1 ∈ A0 ) − αPλ (Xj−1 ∈

Adding over j = 1, 2, . . . , n yields 1 Eλ V (Xn ) − V (x) ≤ −α + (α + M )µn,λ (A0 ). n Since V (·) ≥ 0, letting n → ∞ yields (2.28).

2

Deﬁnition 14.2.10: A metric space (S, d) has the vague compactness property if given any collection {µα : α ∈ I} of subprobability measures, there is a sequence {αj } ⊂ I such that µαj converges vaguely to a subprobability measure µ. It is known by the Helly’s selection theorem that all Euclidean spaces have this property. It is also known that any Polish space, i.e., a complete, separable, metric space has this property (see Billingsley (1968)). Combining the above two results yields the following: Theorem 14.2.16: Let P (·, ·) be a Feller transition function on a metric space (S, d) that admits an approximate identity and has the vague compactness property. Suppose there exists a closed set A0 , a function

476

14. Markov Chains and MCMC

V : S → [0, ∞), numbers 0 < α, M < ∞ such that (2.26) and (2.27) hold. Then there exists a stationary probability measure µ for P (·, ·). Proof: Fix an initial distribution λ. Then the family {µn,λ }n≥1 has a subsequence {µnk ,λ }k≥1 and a subprobability measure µ such that µnk ,λ → µ vaguely. Since A0 is closed, this implies lim µnk ,λ (A0 ) ≤ µ(A0 ).

k→∞

α , by Theorem 14.2.15. Thus µ(A0 ) ≥ lim µnk ,λ (A0 ) ≥ lim µnk ,λ (A0 ) ≥ α+M This yields that µ(S) > 0. By Theorem 14.2.14, µ is stationary for P . So µ(·) µ ˜(·) ≡ µ(S) is a probability measure that is stationary for P . 2

Example 14.2.18: Consider a Markov chain generated by the iteration of iid random logistic maps Xn+1 = Cn+1 Xn (1 − Xn ), n ≥ 0 with § = [0, 1], {Cn }n≥1 iid with values in [0,4] and X0 is independent of {Cn }n≥1 . Assume E log C1 > 0 and E| log(4 − C1)| < ∞. Then there exists a stationary probability measure π such that π (0, 1) = 1. This follows from Theorem 14.2.16 by showing that if V (x) = | log x|, then there exists A0 = [a, b] ⊂ (0, 1) and constants 0 < α, M < ∞ such that (2.26) and (2.27) hold. For details, see Athreya (2004). 14.2.9.3 Convergence questions If {Xn }n≥0 is a Feller Markov chain (i.e., its transition function P (·, ·) is Feller continuous), what can one say about the convergence of the distribution of Xn as n → ∞? If P (·, ·) admits a unique stationary probability measure π and the family {µn,λ : n ≥ 1} is tight for a given λ, then one can conclude from Theorem 14.2.13 that every weak limit point of this family has to be π and hence π is the only weak limit point and that for this λ, µn,λ −→d π. To go from this to the convergence of Pλ (Xn ∈ ·) to π(·), one needs extra conditions to rule out periodic behavior. Since the occupation measure µn,λ (A) is the mean of the empirical measure n−1 1 Ln (A) ≡ IA (Xj ), n j=0 a natural question is what can one say about the convergence of the empirical measure? This is important for the statistical estimation of π. When the chain is Harris recurrent, it was shown in the previous section that for each x and for each A ∈ S Ln (A) → π(A) w.p. 1 (Px ).

14.3 Markov chain Monte Carlo (MCMC)

477

For a Feller chain admitting a unique stationary measure π, one can appeal to the ergodic theorem to conclude that for each A in S, Ln (A) → π(A) w.p. 1 (Px ) for π-almost all x in S. Further, if S is Polish, then one can show that for π-almost all x, Ln (·) −→d π(·) w.p. 1 (Px ).

14.3 Markov chain Monte Carlo (MCMC) 14.3.1

Introduction

Let π be a probability measure on a measurable space (S, S). Let h(·) : S → R be S-measurable and |h|dπ < ∞ and λ = hdπ. The eﬀort in the computation of λ depends on the complexity of the function h(·) as well as that of the measure π. Clearly, a ﬁrst approach is to go back to the deﬁnition of hdπ and use numerical approximation such as approximating h(·) by a sequence of simple functions and evaluating π(·) on the sets involved in these simple functions. However, in many situations this may not be feasible especially if the measure π(·) is speciﬁed only up to a constant that is not easy to evaluate. Such is often the case in Bayesian statistics where π is the posterior distribution πθ|X of the parameter θ given the data X whose density is proportional to f (X|θ)ν(dθ), f (X|θ) being the density of X given θ and ν(dθ) the prior distribution of θ. In such situations, objects of interest are the posterior mean, variance, and other moments as well as posterior probability of θ being in some set of interest. In these problems, themain diﬃculty lies in the evaluation of the normalizing constant C(X) = f (X|θ)ν(dθ). However, it may be possible to generate a sequence of random variables {Xn }n≥1 such that the distribution of Xn gets close to π in a suitable sense and a law of large numbers asserting that 1 h(Xi ) → λ = n i=1 n

hdπ

holds for a large class of h such that |h|dπ < ∞. A method that has become very useful in Bayesian statistics in the past twenty years or so (with the advent of high speed computing) is that of generating a Markov chain {Xn }n≥1 with stationary distribution π. This method has its origins in the important paper of Metropolis, Rosenbluth, Rosenbluth, Teller and Teller (1953). For the adaptation of this method to image processing problems, see Geman and Geman (1984). This method is now known as the Markov chain Monte Carlo, or MCMC for short. For the basic limit theory of Markov chains, see Section 14.2. For proofs of the claims in the rest of this section and further details on MCMC, see the recent book of Robert and Casella (1999). In the rest of this section two of the widely used MCMC algorithms are discussed. These are the Metropolis-Hastings algorithm and the Gibbs sampler.

478

14.3.2

14. Markov Chains and MCMC

Metropolis-Hastings algorithm

Let π be a probability measure on a measurable space (S, S). Let π be dominated by a σ-ﬁnite measure µ with density f (·). Let for each x, q(y|x) be a probability density in y w.r.t. µ. That is, q(y|x) is jointly measurable as a function from (S × S, S × S) → R+ and for each x, S q(y|x)µ(dy) = 1. Such a distribution q(·|·) is called the instrumental or proposal distribution. The Metropolis-Hastings algorithm generates a Markov chain {Xn } using the densities f (·) and q(·) in two steps as follows: Step 1: Given Xn = x, ﬁrst generate a random variable Yn with density q(·|x). Step 2: Then, set Xn+1 = Yn with probability p(x, Yn ) and = Xn with probability 1 − p(x, Yn ), where f (y) q(x|y) p(x, y) ≡ min , 1 . (3.1) f (x) q(y|x) Thus, the value Yn is “accepted” as Xn+1 with probability p(x, Yn ) and if rejected the chain stays where it was, i.e., at Xn . Implicit in the above deﬁnition is that the state space of the Markov chain {Xn } is simply the set Af ≡ {x : f (x) > 0}. It is also assumed that for all x, y in Af , q(y|x) > 0. The transition function P (x, A) for this Markov chain is given by P (x, A) = IA (x)(1 − r(x)) + p(x, y)q(y|x)µ(dy) (3.2) A

where r(x) =

p(x, y)q(y|x)µ(dy). S

It turns out that the measure π(·) is a stationary measure for this Markov chain {Xn }. Indeed, for any A ∈ S, P (x, A)π(dx) = P (x, A)f (x)µ(dx) S S = IA (x)(1 − r(x))f (x)µ(dx) S + p(x, y)q(y|x)f (x)µ(dy)µ(dx). (3.3) S

A

By deﬁnition of p(x, y), the identity q(y|x)f (x)p(x, y) = p(y, x)q(x|y)f (y)

(3.4)

holds for all x, y. Thus the second integral in (3.3) (using Tonelli’s theorem) is p(y, x)q(x|y)µ(dx) f (y)µ(dy) = A

S

14.3 Markov chain Monte Carlo (MCMC)

479

=

r(y)f (y)µ(dy). A

Thus, the right side of (3.3) is IA (x)f (x)µ(dx) ≡ π(A), S

verifying stationarity. From the results of Section 14.2, it follows that if the transition function P (·, ·) is Harris irreducible w.r.t. some reference measure ϕ, then (since it admits π as a stationary measure) the law of large numbers, holds i.e., for any h ∈ L1 (π), n−1 1 h(Xj ) → hdπ n j=0

as n → ∞

w.p. 1 for any initial distribution. Thus, a good MCMC approximation to ˆ n ≡ 1 n−1 h(Xj ). A suﬃcient condition for irreducibility λ = hdπ is λ j=0 n is that q(y|x) > 0 for all (x, y) in Af × Af . Summarizing the above discussion leads to Theorem 14.3.1: Let π be a probability measure on a measurable space (S, S) with probability density f (·) w.r.t. a σ-ﬁnite measure µ. Let Af = {x : f (x) > 0}. Let q(y|x) be a measurable function from Af ×Af → (0, ∞) such that S q(y|x)µ(dy) = 1 for all x in Af . Let {Xn }n≥0 be a Markov chain generated by the Metropolis-Hastings algorithm (3.1). Then, for any h ∈ L1 (π), n−1 1 h(Xj ) → hdπ n j=0

as

n→∞

w.p. 1

(3.5)

for any (initial) distribution of X0 . The Metropolis-Hastings algorithm does not need the full knowledge of the target density f (·) of π(·). The function f (·) enters the algorithm only (y) and through the function p(x, y), which involves only the knowledge of ff (x) q(·|·) and hence this algorithm can be implemented even if f is known only up to a multiplicative constant. This is often the case in Bayesian statistics. Also, the choice of q(x|y) depends on f (·) only through the condition that q(x|y) > 0 on Af × Af . Thus, the Metropolis-Hastings algorithm has wide applicability. Two special cases of this algorithm are given below. 14.3.2.1 Independent Metropolis-Hastings Let q(y|x) ≡ g(y) where g(·) is a probability density such that g(y) > 0 if f (y) > 0.

480

14. Markov Chains and MCMC

(y) Suppose sup fg(y) : f (y) > 0 ≡ M < ∞. Then, in addition to the law of large numbers (3.5) of Theorem 14.3.1, it holds that for any initial value of X0 , 1 n P (Xn ∈ ·) − π(·) ≤ 2 1 − M where · is the total variation norm. Thus, the distribution of {Xn } converges in total variation at a geometric rate. For a proof, see Robert and Casella (1999). 14.3.2.2 Random-walk Metropolis-Hastings Here the state space is the real line or a subset of some Euclidean space. Let q(y|x) = g(y − x) where g(·) is a probability density such that g(y − x) > 0 for all x, y such that f (x) > 0, f (y) > 0. This ensures irreducibility and hence the law of large numbers (3.5) holds. A suﬃcient condition for geometric convergence of the distribution of {Xn } in the real line case is the following: (a) The density f (·) is symmetric about 0 and is asymptotically log concave, i.e., it holds that for some α > 0 and x0 > 0, log f (x) − log f (y) ≥ α|y − x| for all y < x < −x0 or x0 < x < y. (b) The density function g(·) is positive and symmetric. For further special cases and more results, see Robert and Casella (1999).

14.3.3

The Gibbs sampler

Suppose π is the probability distribution of a bivariate random vector (X, Y ). A Markov chain {Zn }n≥0 can be generated with π as its stationary distribution using only the families of conditional distributions Q(·, y) of X|Y = y and P (·, x) of Y |X = x for all x, y generated by the joint distribution π of (X, Y ). This Markov chain is known as the Gibbs sampler. The algorithm is as follows: Step 1: Start with some initial value X0 = x0 . Generate Y0 according to the conditional distribution P (Y ∈ · | X0 = x0 ) = P (·, x0 ). Step 2: Next, generate X1 according to the conditional distribution P (X1 ∈ · | Y0 = y0 ) = Q(·, y0 ). Step 3: Now generate Y1 as in Step 1 but with conditioning value X1 . Step 4: Now generate X2 as in Step 2 but with conditioning value Y1 and so on.

14.4 Problems

481

Thus, starting from X0 , one generates successively Y0 , X1 , Y1 , X2 , Y2 , . . .. Clearly, the sequences {Xn }n≥0 , {Yn }n≥0 and {Zn ≡ (Xn , Yn )}n≥0 are all Markov chains. It is also easy to verify that the marginal distribution πX of X, the marginal distribution πY of Y , and the distribution π are, respectively, stationary for the {Xn }, {Yn }, and {Zn } chains. Indeed, if X0 ∼ πX , then Y0 ∼ πY and hence X1 ∼ πX . Similarly one can verify the other two claims. Recall that a suﬃcient condition for the law of large numbers (3.5) to hold is irreducibility. A suﬃcient condition for irreducibility in turn is that the chain {Zn }n≥0 has a transition function R(z, ·) that, for each z = (x, y), is absolutely continuous with respect to some ﬁxed dominating measure on R2 . The above algorithm is easily generalized to cover the k-variate case (k > 2). Let (X1 , X2 , . . . , Xk ) be a random vector with distribution π. For any vector x = (x1 , x2 , . . . , xk ) let x(i) = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xk ) and Pi (· | x(i) ) be the conditional distribution of Xi given X(i) = x(i) . Now generate a Markov chain Zn ≡ (Zn1 , Zn2 , . . . , Znk ), n ≥ 0 as follows: Step 1: Start with some initial value Z0j = z0j , j = 1, 2, . . . , k − 1. Generate Z0k from the conditional distribution Pk (· | Xj = z0j , j = 1, 2, . . . , k − 1). Step 2: Next, generate Z11 from the conditional distribution P1 (· | Xj = z0j , j = 2, . . . , k − 1, Xk = Z0k ). Step 3: Next, generate Z12 from the conditional distribution P2 (· | X1 = Z11 , Xj = z0j , j = 3, . . . , k − 1, Xk = Z0k ) and so on until (Z11 , Z12 , . . . , Z1,k−1 ) is generated. Now go back to Step 1 to generate Z1k and repeat Steps 2 and 3 and so on. This sequence {Zn }n≥0 is called the Gibbs sampler Markov chain for the distribution π. A suﬃcient condition for irreducibility given earlier for the 2-variate case carries over to the k-variate case. For more on the Gibbs sampler, see Robert and Casella (1999).

14.4 Problems 14.1 (a) Show using Deﬁnition 14.1.1 that when the state space S is countable, for any n, conditioned on {Xn = an }, the events {Xn+j = an+j , 1 ≤ j ≤ k} and {Xj = aj : 0 ≤ j ≤ n − 1} are independent for all choices of k and {aj }n+k j=0 . Thus, conditioned on the “present” {Xn = an }, the “past” {Xj : j ≤ n − 1} and “future” {Xj : j ≥ n + 1} are two families of independent random variables with respect to the conditional probability measure P (· | Xn = an ), provided, P (Xn = an ) > 0.

482

14. Markov Chains and MCMC

(b) Prove Proposition 14.2.2 using induction on n (cf. Chapter 6). 14.2 In Example 14.1.1 (Frog in the well), verify that (a) if αi ≡ 1 −

1 ci ,

c > 1, i ≥ 1, then 1 is null recurrent,

(b) if αi ≡ α, 0 < α < 1, then 1 is positive recurrent, and (c) if αi ≡ 1 −

1 2i2 ,

then 1 is transient.

14.3 Consider SSRW in Z2 where the transition probabilities are p(i,j)(i ,j ) = 14 each if (i , j ) ∈ {(i + 1, j), (i − 1, j), (i, j + 1), (i, j − 1)} and zero otherwise. Verify that for n = 2k (2k) p(0,0),(0,0)

2 1 2k 11 = 2k ∼ k 4 πk

and conclude that (0,0) is null recurrent. Extend this calculation to SSRW in Z3 and conclude that (0,0,0) is transient. 14.4 Show that if i is absorbing and j → i, then j is transient by showing ∗ ∗ = P (Ti < Tj | X0 = j) > 0 and 1 − fjj ≥ fji . that if j → i, then fji 14.5 (a) Let i be recurrent and i → j. Show that j is recurrent using Corollary 14.1.5. (Hint: Show that there exist n0 and m0 such that for all n, (n +n+m0 ) (n ) (n) (m ) (n ) (m ) pjj 0 ≥ pji 0 pii , pij 0 with pji 0 > 0 and pij 0 > 0.) (b) Let i and j communicate. Show that di = dj . 14.6 Show that in a ﬁnite state space irreducible Markov chain (S, P ), all states are positive recurrent by showing (r)

(a) that for any i, j in S, there exist r, r ≤ K such that pij > 0, where K is the number of states in S, (b) for any i in S, there exists a 0 < α < 1, and c < ∞ such that Pi (Ti > n) ≤ cαn . Give an alternate proof by showing that if S is ﬁnite, then for any initial distribution µ, the occupation measures 1 Pµ (Xj ∈ ·) (n + 1) j=0 n

µn(·) ≡

has a subsequence that converges to a probability distribution π that is stationary for (S, P ). 14.7 Prove Theorem 14.1.3 using the Markov property and induction.

14.4 Problems

483

14.8 Adapt the proof of Theorem 14.1.9 to show that for any i, j 1 (k) fij pij → n j=1 Ej T j n

if j is positive recurrent and 0 otherwise. Conclude that in a ﬁnite state space case, there must be at least one state that is positive recurrent. Ti −1 14.9 If j → i then ζ1 ≡ j=0 δXr j , the number of visits to j before visiting i satisﬁes Pi (ζ1 > n) ≤ cαn for some 0 < c < ∞, 0 < α < 1 and all n ≥ 1. 14.10 Adapt the proof of Theorem 14.1.9 to establish the following laws of large numbers. Let (S, P ) be irreducible and positive recurrent with stationary distribution π. (a) Let h : S → R be such that j∈S |h(j)|πj < ∞. Then, for any initial distribution µ, 1 h(Xj ) → h(j)πj n + 1 j=0 n

w.p. 1

j∈S

by ﬁrst verifying that *

* T * * i −1 * h(Xj )** < ∞. Ei * j=0

(b) Let g : S × S → R be such that for any initial distribution µ,

i,j∈S

|g(i, j)|πi pij < ∞. Then,

n 1 g(Xj , Xj+1 ) → g(i, j)πi pij n + 1 j=0

w.p. 1.

i,j∈S

(c) Fix two disjoint subsets A and B in S. Evaluate the long run proportion of transitions from A to B. (d) Extend (b) to conclude that the tail sequence Zn ≡ {Xn+j : j ≥ 0} of the Markov chain {Xn }n≥0 converges as n → ∞ in the sense of ﬁnite dimensional distributions to the strictly stationary sequence {Xn }n≥0 which is the Markov chain (S, P ) with initial distribution π. 14.11 Let {Xn }n≥0 be a Markov chain that is irreducible and has at least two states. Show that w.p. 1 the trajectories {Xn } do not converge, i.e., w.p. 1, limn→∞ Xn does not exist. (Hint: Show that w.p. 1, the set of limit points of the set {Xn : n ≥ 0} coincides with S.)

484

14. Markov Chains and MCMC

14.12 Let n }n≥0 be a Markov chain with state space S and tr. pr. P ≡ {X (pij ) . A probability distribution π ≡ {πj : j ∈ S} is said to satisfy the condition of detailed balance or time reversibility with respect to (S, P ) if for all i, j, πi pij = πj pji . (a) Show that such a π is necessarily a stationary distribution. (b) For the birth and death chain (Example 14.1.4), ﬁnd a condition in terms of the birth and death rates {αi , βi }i≥0 for the existence of a probability distribution π that satisﬁes the condition of detailed balance. 14.13 (Absorption probabilities and times). Let 0 be an absorbing state. For any i = 0, let θi = fi0 ≡ Pi (T0 < ∞) and ηi = Ei T0 . Show using the Markov property that for every i = 0, θi = pi0 + θj pij j =0

ηi

=

1+

ηj pij .

j =0

Apply this to the Gambler’s ruin problem with S = {0, 1, 2, . . . , K}, K < ∞ and p00 = 1, pN N = 1, pi,i+1 = p, pi,i−1 = 1 − p, 0 < p < 1, 1 ≤ i ≤ N − 1 and ﬁnd the probability and expected waiting time for ruin (absorption at 0) starting from an initial fortune of i, 1 ≤ i ≤ N − 1. 14.14 (Renewal theory via Markov chains). Let {Xj }j≥1be iid positive n integer valued random variables. Let S0 = 0, Sn = j=1 Xj , n ≥ 1, N (n) = k if Sk ≤ n < Sk+1 , k = 0, 1, 2, . . . be the number of renewals up to time n, An = n − SN (n) be the age of the current unit at time n. (a) Show that {An }n≥0 is a Markov chain and ﬁnd its state space S and transition probabilities. (b) Assuming that EX1 < ∞, verify that πj =

P (X1 > j) j = 0, 1, 2, . . . EX1

is the unique stationary distribution. (c) Assuming that X1 has an aperiodic distribution and Theorem 14.1.18 holds, show that the discrete renewal theorem holds. 14.15 Prove Proposition 14.2.1 for the countable state space case. 14.16 Prove Proposition 14.2.2.

14.4 Problems

485

14.17 Establish assertion (i) of Theorem 14.2.3. n ˆ n+1 (x, ω) ˆ n (x, ω), X ≤ s(f ) (Hint: Show that d X i i=1 d x, fn+1 (x, ω) and use Borel-Cantelli to show that the right n 0 < λ < 1 and show similarly side is O(λ ) w.p. 1 for some ˆ n (y, ω) = O(λn ) w.p. 1 for any x, y.) ˆ d Xn (x, ω), X 14.18 Show that if P (·, ·) is the transition function of a Markov chain {Xn }n≥0 , then for any n ≥ 0, Px (Xn ∈ A) = P (n) (x, A), where P (n) (·, ·) is deﬁned by the iteration P (n+1) (x, A) = P (n) (y, A)P (x, dy), S

with P

(0)

(x, A) = IA (x).

(Hint: Use induction and Markov property.) 14.19 (a) Let {Xn }n≥0 be a random walk deﬁned by the iteration scheme Xn+1 = Xn +ηn+1 where {ηn }n≥1 are iid random variables independent of X0 . Assume that ν(·) = P (η1 ∈ ·) has an absolutely continuous component with a density that is strictly positive a.e. on an open interval around 0. Show that {Xn }n≥0 is Harris irreducible w.r.t. the Lebesgue measure on R. Show that if in addition Eη1 = 0, then {Xn } is Harris recurrent as well. (b) Use Theorem 14.2.11 to establish the second claim in Example 14.2.10. 14.20 Show that the waiting time chain (Example 14.2.6) deﬁned by Xn+1 = max{Xn + ηn+1 , 0}, where {ηn }n≥1 are iid is irreducible with reference measure φ(·) ≡ δ0 (·), the delta measure at 0, provided P (η1 < 0) > 0. Show further that it is φ recurrent if Eη1 < 0. 14.21 Prove Theorem 14.2.5 (i) using the C-set lemma. 14.22 Find a h : [0, 1] × [0, 1] → [0, 1] such that h(x, y) is discontinuous in x for almost all y in [0, 1] and conclude that the function P (x, A) = P h(x, Y ) ∈ A where Y is a uniform [0,1] random variable need not be Feller. 14.23 Let (Ω, F, P ) be a probability space and (S, S) a measurable space. Let h : S × Ω → S be jointly measurable. Show that P (x, A) ≡ P (h(x, ω) ∈ A) is a transition function. 14.24 (a) Let {Xn }n≥0 be an irreducible Markov chain with state space S ≡ {0, 1, 2, . . .}. Suppose V : S → [0, ∞) is such that for some K < ∞, Ex V (X1 ) ≤ V (x) for all x > K and that limx→∞ V (x) = ∞. Show that {Xn }n≥0 is recurrent.

486

14. Markov Chains and MCMC

˜ n }n≥0 be a Markov chain with state space (Hint: Let {X S ≡ {0, 1, 2, . . .} and transition probabilities same as that of {Xn }n≥0 except that the states {0, 1, 2, . . . , K} are absorbing. ˜ n )}n≥0 is a nonnegative super-martingale and Verify that {V (X ˜ hence that {Xn }n≥0 is bounded w.p. 1. Now conclude that there must exist a state x that gets visited inﬁnitely often by {Xn }n≥0 .) (b) Consider the reﬂecting nonhomogeneous random walk on S ≡ {0, 1, 2, . . .} such that pi if j = i + 1 pij = 1 − pi if j = i − 1 with p0 = 1, 0 < pi ≤ qi for all i ≥ k0 and some 1 ≤ k0 < ∞ and 0 < pi < 1 for all i ≥ 1. Show that {Xn }n≥0 is irreducible and recurrent. 14.25 Let {Xn }n≥0 be an irreducible and recurrent Markov chain with a countable state space S. Let V : S → R+ be such that Ex V (X1 ) ≤ V (x) for all x in S. Show that V (·) is constant on S. 14.26 Let {Cn }n≥1 be iid random variables with values in [0, 4]. Let {Xn }n≥0 be a Markov chain with values in [0, 1] deﬁned by the random iteration scheme Xn+1 = Cn+1 Xn (1 − Xn ), n ≥ 0. (a) Show that if E log C1 < 0 then Xn = O(λn ) w.p. 1 for some 0 < λ < 1. (b) Show also that if E log C1 < 0 and 0 < V (log C1 ) < ∞ then there exist sequences {an }n≥1 and {bn }n≥1 such that log Xn − an −→d N (0, 1). bn

15 Stochastic Processes

This chapter gives a brief discussion of two special classes of real valued stochastic processes {X(t) : t > 0} in continuous time [0, ∞). These are continuous time Markov chains with a discrete state space (including Poisson processes) and Brownian motion These are very useful in many areas of applications such as queuing theory and mathematical ﬁnance.

15.1 Continuous time Markov chains 15.1.1

Deﬁnition

Consider a physical system that can be in one of a ﬁnite or countable number of states {0, 1, 2, . . . , K}, K ≤ ∞. Assume that the system evolves in continuous time in the following manner. In each state the system stays a random length of time that is exponentially distributed and then jumps to a new state with a probability distribution that depends only on the current state and not on the past history. Thus, if the state of the system at the time of the nth transition is denoted by yn , n = 0, 1, 2, . . ., then {yn }n≥0 is a Markov chain with state space S ≡ {0, 1, 2, . . . , K}, K ≤ ∞ and some transition probability matrix P ≡ (pij ) . If yn = in , then the system stays in in a random length of time Ln , called the sojourn time, such that conditional on {yn = in }n≥0 , {Ln }n≥0 are independent exponential random variables with Ln having a mean λ−1 in . Now set the state of the

488

15. Stochastic Processes

system X(t) ⎧ ⎪ ⎪ ⎪ ⎨ X(t) = ⎪ ⎪ ⎪ ⎩

at time t ≥ 0 by y0 y1 .. .

0 ≤ t < L0 L0 ≤ t < L0 + L1

Lν + L1 + · · · + Ln−1 ≤ t < L0 + L1 + · · · + Ln . (1.1) Then {X(t) : t ≥ 0} is called a continuous time Markov chain with state space S, jump probabilities P ≡ (pij ) , waiting time parameters {λi : i ∈ S}, and embedded Markov chain {yn }n≥0 . To make sure that there are only ﬁnite number of transitions in ﬁnite time, i.e., yn

∞

Li = ∞

w.p. 1

i=0

one needs to impose the nonexplosion condition ∞ 1 =∞ λ i=0 yn

w.p. 1.

(1.2)

(Problem 15.1) Clearly, a suﬃcient condition for this is that λi < ∞ for all i ∈ S and {yn }n≥0 is an irreducible recurrent Markov chain. It can be veriﬁed using the “memorylessness” property of the exponential distribution (Problem 15.2) that {X(t) : t ≥ 0} has the Markov property, i.e., for any 0 ≤ t1 ≤ t2 ≤ t3 ≤ · · · ≤ tr < ∞ and P X(tr ) = ir | X(tj ) = ij , 0 ≤ j ≤ r − 1 = P X(tr ) = ir | X(tr−1 ) = ir−1 . (1.3)

15.1.2

Kolmogorov’s diﬀerential equations

The functions

pij (t) ≡ P X(t) = j | X(0) = i

(1.4)

are called transition functions. To determine these functions from the jump probabilities {pij } and the waiting time parameters {λi }, one uses the Chapman-Kolmogorov equations pij (t + s) = pik (t)pkj (s), t, s ≥ 0 (1.5) k∈S

which is an immediate consequence of the Markov property (1.3) and the deﬁnition (1.4). In addition to (1.5), one has the continuity condition lim pij (t) = δij . t↓0

(1.6)

15.1 Continuous time Markov chains

489

Under the nonexplosion hypothesis (1.2), it can be shown (Chung (1967), Feller (1966), Karlin and Taylor (1975)) that pij (t) are diﬀerentiable as functions of t and satisfy the Kolmogorov’s forward and backward diﬀerential equations pik (t)pkj (0) (forward) (1.7a) pij (t) = k

pij (t)

=

pik (0)pkj (t)

(backward)

(1.7b)

k

Further, akj ≡ pkj (0) can be shown to be λk pkj for k = j and −λk for k = j. The matrix A ≡ (aij ) is called the inﬁnitesimal matrix or generator of the process {X(t) : t ≥ 0}. If the state space S is ﬁnite, i.e., K < ∞, then P (t) ≡ (pij (t)) can be shown to be P (t) = exp(At) ≡

15.1.3

∞ An n t . n! n=0

(1.8)

Examples

Example 15.1.1: (Birth and death process). Here pi,i+1

=

pi,i−1

=

λi

=

αi αi + βi βi αi + βi

i≥0 i≥1

(αi + βi ) i ≥ 0

where {αi , βi }i≥0 are nonnegative numbers with αi being the birth rate, βi being the death rate. This has the meaning that given X(t) = i, for small h > 0, X(t + h) goes up to (i + 1) with probability αi h + o(h) or goes down to (i − 1) with probability βi h + o(h) or stays at i with probability 1 − (αi + βi )h + o(h). In this case the forward and backward equations become pij (t)

=

αj−1 pi,j−1 (t) + βj+1 pi,j+1 (t) − (αj + βj )pij (t),

pij (t)

=

αi pi+1,j (t) + βi pi−1,j (t) − (αi + βi )pij (t)

with initial condition pij (0) = δij . (a) Pure birth process. A special case of the above is when βi ≡ 0 for all i. Here pij (t) = 0 if j < i and X(t) is a nondecreasing function of t and the jumps are of size one. A further special case of this when αi ≡ α for all i. In this case, the process waits in each state a random length of time with mean α−1 and

490

15. Stochastic Processes

jumps one step higher. It can be veriﬁed that in this case, the solution of the Kolmogorov’s diﬀerential equations (1.7a) and (1.7b) are given by pij (t) = e−αt

(αt)j−i . (j − i)!

(1.9)

From this it is easy to conclude that {X(t) : t ≥ 0} is a Levy process, i.e., it has stationary and independent increments, i.e., for 0 = t0 ≤ t1 ≤ t2 ≤ t3 ≤ · · · ≤ tr < ∞, Yj = X(tj ) − X(tj−1 ), j = 1, 2, . . . , r are independent and the distribution of Yj depends only on (tj − tj−1 ). Further, in this case, X(t) − X(0) has a Poisson distribution with mean αt. This {X(t) : t ≥ 0} is called a Poisson process with intensity parameter α. Another special case is the linear birth and death process. Here αi = iα, βi = iβ for i = 0, 1, 2, . . .. The pure death process has parameters αi ≡ 0 for i ≥ 0. A number of queuing processes can be modeled as a birth and death process and more generally as a continuous time Markov chain. For example, an M/M/s queuing system is one in which customers arrive at a service facility at the jump times of a Poisson process (with parameter α) and there are s servers with service time at each server being exponential with the same mean (= β −1 ). The number X(t) of customers in the system at time t evolves a birth and dealt process with parameters αi ≡ α for i ≥ 0 and βi = iβ, 0 ≤ i ≤ s, = sβ for i > s. Example 15.1.2: (Markov branching processes). Here X(t) is the population size in a process where each particle lives a random length of time with exponential distribution with mean α−1 and on death create a random number of new particles with oﬀspring distribution {pj }j≥0 and all particles evolve independently of each other. This implies that λi = iα, i ≥ 0, pij = pj−i+1 , j ≥ i − 1 and = 0 for j < i − 1, i ≥ 1, p00 = 1. Thus 0 is an absorbing barrier. The random variable T ≡ inf{t : t > 0, X(t) = 0} is called the extinction time. It can be shown that ∞

j

pij (t)s =

j=0

∞

i j

p1j (t)s

for i ≥ 0,

0≤s≤1

(1.10)

j=0

and also that F (s, t) ≡

∞

p1j (t)sj

j=0

satisﬁes the diﬀerential equation ∂F (s, t) ∂t ∂F (s, t) ∂t

= =

∂ F (s, t) (forward equation) ∂s u F (s, t) (backward equation) u(s)

(1.11) (1.12)

15.1 Continuous time Markov chains

491

with F (s, 0) ≡ s where u(s) ≡ α

∞

pj sj − s .

(1.13)

j=0

Further, if q ≡ P (T < ∞ | X(0) = 1) is the extinction ∞ probability, the q is the smallest solution in [0,1] of the equation q = j=0 pj q j (cf. Chapter 18). (See Athreya and Ney (2004), Chapter III, p. 106.) Example 15.1.3: (Compound Poisson processes). Let {Li }i≥0 and {ξi }i≥1 be two independent sequences of random variables such that {Li }i≥0 are iid exponential with mean α−1 and {ξi }i≥1 are iid integer valued random variables with distribution {pj }. Let X(t) = k if L0 + · · · + Lk ≤ t < L0 + · · · + Lk+1 . Let ⎧ 0 0 ≤ t < L0 ⎪ ⎪ ⎪ 1 ⎪ L0 ≤ t < L0 + L1 ⎪ ⎪ ⎨ . . . X(t) = ⎪ ⎪ k L0 + · · · + Lk−1 ≤ t < L0 + · · · + Lk , ⎪ ⎪ ⎪ ⎪ ⎩ .. . Let

X(t)

Y (t) =

ξi ,

t ≥ 0.

(1.14)

i=1

Then {Y (t) : t ≥ 0} is a continuous time Markov chain with state space S ≡ {0, ±1, ±2, . . .}, jump probabilities pij = P (ξ1 = j − i) = pj−i . It is also a Levy process. It is called a compound Poisson process with jump rate α and jump distribution {pj }. If p1 ≡ 1 this reduces to the Poisson process case.

15.1.4

Limit theorems

To investigate what happens to pij (t) ≡ P (X(t) = j | X(0) = i) as t → ∞, one needs to assume that the embedded chain {yn }n≥0 is irreducible and recurrent. This implies that for any i0 the random variable T = min{t : t > L0 , X(t) = i0 } is ﬁnite w.p. 1. Further, the process, starting from i0 , returns to i0 inﬁnitely often and hence by the Markov property is regenerative in the sense that the excursions between consecutive returns to i0 are iid. One can use this, laws of large numbers and renewal theory (cf. Section 8.5) to arrive at the following:

492

15. Stochastic Processes

Theorem 15.1.1: Let P = {pij } be irreducible and recurrent and 0 < λi < ∞ for all i in S. Let there exist a probability distribution {πi } such that aij πj = 0 for all i (1.15) j∈S

where aij

= λi pij = −λi

i = j i = j.

Then (i) {πj } is stationary for {pij (t)}, i.e., t ≥ 0,

i∈S

πi pij (t) = πj for all j,

(ii) for all i, j lim pij (t) = πj ,

(1.16)

t→∞

and hence {πj } is the unique stationary distribution, (iii) for any function h : S → R, such that j∈S |h(j)|πj < ∞, 1 t→∞ t

lim

0

t

h X(u) du = h(j)πj

w.p. 1

(1.17)

j∈S

for any initial distribution of X(0). Note that (1.16) holds without any assumption of aperiodicity on P ≡ (pij ) . A suﬃcient condition for a probability distribution {πj } to be a stationary distribution is the so-called detailed balance condition πk akj = πj ajk .

(1.18)

One can use this for birth and death chains on a ﬁnite state space S ≡ {0, 1, 2, . . . , N }, N < ∞ to conclude that the stationary distribution is given by αn−1 αn−2 . . . α0 πn = π0 (1.19) βn βn−1 . . . β1 provided αi > 0 for all 0 ≤ i ≤ N − 1, βi > 0 for all 1 ≤ i ≤ N and αN = 0, β0 = 0. A necessary and suﬃcient condition for equilibrium, i.e., the existence of a stationary distribution when N = ∞ is ∞ αn−1 . . . α0 < ∞. βn . . . β1 n=1

(1.20)

15.2 Brownian motion

493

This yields in the M/M/s case with arrival rate α and service rate β (i.e., αi ≡ α, for i ≥ 0, βi ≡ iβ for 1 ≤ i ≤ s, = sβ for i > s) the necessary and suﬃcient condition for the equilibrium, that the traﬃc intensity α < 1, sβ

ρ≡

(1.21)

i.e., the mean number of arrivals per unit time, be less than the mean number of the persons served per unit time. For further discussion and results, see the books of Karlin and Taylor (1975) and Durrett (2001).

15.2 Brownian motion Deﬁnition 15.2.1: A real valued stochastic process {B(t) : t ≥ 0} is called standard Brownian motion (SBM) if it satisﬁes (i) B(0) = 0, (ii) B(t) has N (0, t) distribution, for each t ≥ 0, (iii) it is a Levy process, i.e., it has stationary independent increments. It follows that {B(t) : t ≥ 0} is a Gaussian process (i.e., the ﬁnite dimensional distributions are Gaussian) with mean function m(t) ≡ 0 and covariance function c(s, t) = min(s, t). It can be shown that the trajectories are continuous w.p. 1. Thus, Brownian motion is a Gaussian process, has continuous trajectories and has stationary independent increments (and hence is Markovian). These features make it a very useful process as a building block for many real world phenomena such as the movement of pollen (which was studied by the English Botanist, Robert Brown, and hence the name Brownian motion) movement of a tagged particle in a liquid subject to the bombardment of the molecules of the liquid (studied by Einstein and Slomuchowski) and the ﬂuctuations in stock market prices (studied by the French Economist Bachelier).

15.2.1

Construction of SBM

Let {ηi }i≥1 be iid N (0, 1) random variables on some probability space (Ω, F, P ). Let {φi (·)}i≥1 be the sequence of Haar functions on [0, 1] deﬁned by the doubly indexed collection H00 (t) ≡ 1 H11 (t) =

1 on [0, 12 ) −1 on [ 12 , 1]

494

15. Stochastic Processes

and for n ≥ 1 Hn,j (t)

=

(j − 1) j , n 2n 2 n−1 j j + 1 −2 2 for t in n , n 2 2 0 otherwise

=

1, 3, . . . , 2n−1 .

= =

j

2

n−1 2

for t in

It is known that this family is a complete orthonormal basis for L2 ([0, 1]). Let t N BN (t, ω) ≡ ηi (ω) φi (u)du. (2.1) i=1

0

Then, for each N , {BN (t, ω) : 0 ≤ t ≤ 1} is a Gaussian process on (Ω, F, P ) with mean function mN (t) ≡ 0 and covariance function s N t φ (u)du and the property that the tracN (s, t) = i=1 0 φi (u)du 0 i jectories t → BN (t, ω) are continuous in t for each ω in Ω. It can be shown (Problem 15.11) that w.p. 1 the sequence {BN (·, ω)}N ≥1 is a Cauchy sequence in the Banach space C[0, 1] of continuous real valued functions on [0, 1] with supremum metric. Hence, {BN (·, ω)}N ≥1 converges w.p. 1 to a limit element B(·, ω) which will be a Gaussian process with continuous trajectories and mean and covariance functions m(t) ≡ 0 s t ∞ t φ (u)du φ (u)du = 0 I[0,t] (u)I[0,s] (u)du = and c(s, t) = i=1 0 i 0 i min(s, t) respectively. (See Section 2.3 of Karatzas and Shreve (1991).) Thus, t ∞ ηi (ω) φi (u)du (2.2) B(t, ω) ≡ i=1

0

is a well-deﬁned stochastic process for 0 ≤ t ≤ 1 that has all the properties claimed above and is called SBM on [0,1]. Let {B (j) (t, ω) : 0 ≤ t ≤ 1}j≥1 be iid copies of {B(t, ω) : 0 ≤ t ≤ 1} as deﬁned as above. Now set ⎧ B (1) (t, ω), 0≤t≤1 ⎪ ⎪ ⎪ ⎪ ⎨ B (1) (1, ω) + B (2) (t − 1, ω), 1 ≤ t ≤ 2 B(t, ω) ≡ .. ⎪ . ⎪ ⎪ ⎪ ⎩ B(n, ω) + B (n+1) (t − n, ω), n ≤ t ≤ n + 1, n = 1, 2, . . . (2.3) Then {B(t, ω) : t ≥ 0} satisﬁes (i) B(0, ω) = 0, (ii) t → B(t, ω) is continuous in t for all ω, (iii) it is Gaussian with mean function m(t) ≡ 0 and covariance function c(s, t) ≡ min(s, t), i.e., it is SBM on [0, ∞). From now on the symbol ω may be suppressed.

15.2 Brownian motion

15.2.2

495

Basic properties of SBM

(i) Scaling properties Fix c > 0 and set 1 Bc (t) ≡ √ B(ct), t ≥ 0. c

(2.4)

Then, {Bc (t)}t≥0 is also an SBM. This is easily veriﬁed by noting that Bc (0) = 0, Bc (t) ∼ N (0, t), Cov(Bc (t), Bc (s)) = 1c min{ct, cs} = min(t, s) and that {Bc (·)} is a Levy process and the trajectories are continuous w.p. 1. (ii) Reﬂection If {B(·)} is SBM, then so is {−B(·)}. This follows from the symmetry of the mean zero Gaussian distribution. (iii) Time inversion Let

˜ = B(t)

tB( 1t ) for t > 0 0 for t = 0.

(2.5)

˜ ˜ Then {B(t) : t ≥ 0} is also an SBM. The facts that {B(t) : t > 0} is a Gaussian process with mean and covariance function same as SBM and the trajectories are continuous in the open interval (0, ∞) are straightforward to verify. It only remains to verify that ˜ ˜ limt→0 B(t) : t1 ≤ t ≤ t2 } = 0 w.p. 1. Fix 0 < t1 < t2 . Then {B(t) is a Gaussian process with mean function 0 and covariance function min(s, t) and has continuous trajectories, i.e., it has the same distri˜ 1 ≡ sup{|B(t)| ˜ : t 1 ≤ t ≤ t2 } bution as {B(t) : t1 ≤ t ≤ t2 }. Thus X has the same distribution as X1 ≡ sup{|B(t)| : t1 ≤ t ≤ t2 }. Since ˜ 2 (t2 ) ≡ sup{B(t) ˜ both converge as t1 ↓ 0 to X : 0 < t ≤ t2 } and X2 (t2 ) ≡ sup{B(t) : 0 < t ≤ t2 }, respectively, these two have the ˜ 2 (t2 ) and X2 (t2 ) both converge as same distribution. Again, since X ˜ 2 ≡ limt↓0 |B(t)| ˜ ˜2 and X2 ≡ limt↓0 |B(t)|, respectively, X t2 ↓ 0 to X and X2 have the same distribution. But X2 = 0 w.p. 1 since B(t) is ˜ 2 = 0 w.p. 1, i.e., limt→0 B(t) ˜ continuous in [0, ∞). Thus X = 0 w.p. 1. (iv) Translation invariance (after a ﬁxed time t0 ) Fix t0 > 0 and set Bt0 (t) = B(t + t0 ) − B(t0 ), t ≥ 0.

(2.6)

Then {Bt0 (t)}t≥0 is also an SBM. This follows from the stationary independent increments property. (v) Translation invariance (after a stopping time T0 ) A random variable T (ω) with values in [0, ∞) is called a stopping

496

15. Stochastic Processes

time w.r.t. the SBM {B(t) : t ≥ 0} if for each t in [0, ∞) the event {T ≤ t} is in the σ-algebra Ft ≡ σ(B(s) : s ≤ t) generated by the trajectory B(s) for 0 ≤ s ≤ t. Examples of stopping times are Ta = min{t : t ≥ 0, B(t) ≥ a}

(2.7)

Ta,b = min{t : t > 0, B(t) ∈ (a, b)}

(2.8)

for 0 < a < ∞ where a < 0 < b. Let T0 be a stopping time w.r.t. SBM {B(t) : t ≥ 0}. Let BT0 (t) ≡ {B(T0 + t) − B(T0 ) : t ≥ 0}.

(2.9)

Then {BT0 (t)}t≥0 is again an SBM. Here is an outline of the proof. (a) T0 deterministic is covered by (4) above. (b) If T0 takes only countably many values, say {aj }j≥1 , then it is not diﬃcult to show that conditioned on the event T0 = aj , the process BT0 (t) ≡ {B(T0 + t) − B(T0 )} is SBM. Thus the unconditional distribution of {BT0 (t) : t ≥ 0} is again an SBM. (c) Next given a general stopping time T0 , one can approximate it by a sequence Tn of stopping times where for each n, Tn is discrete. By continuity of trajectories, {BT0 (t) : t ≥ 0} has the same distribution as the limit of {BTn (t) : t ≥ 0} as n → ∞. A consequence of the above two properties is that SBM has the Markov and the strong Markov properties. That is, for each ﬁxed t0 , the distribution of B(t), t ≥ t0 given B(s) : s ≤ t0 depends only on B(t0 ) (Markov property) and for each stopping time T0 , the distribution of B(t) : t ≥ T0 given B(s) : s ≤ T0 depends only on B(T0 ) (strong Markov property). (vi) The reﬂection principle Fix a > 0 and let Ta = inf{t : B(t) ≥ a} where {B(t) : t ≥ 0} is SBM. For any t > 0, a > 0, P (Ta ≤ t)

=

P (Ta ≤ t, B(t) > a) + P (Ta ≤ t, B(t) < a).

Now, by continuity of the trajectory, B(Ta ) = a on {Ta ≤ t}. Thus P Ta ≤ t, B(t) < a = P Ta ≤ t, B(t) < B(Ta ) = P Ta ≤ t, B(t) − B(Ta ) < 0 = P Ta ≤ t, B(t) − B(Ta ) > 0 = P Ta ≤ t, B(t) > a .

15.2 Brownian motion

497

To see this note that by (4), {B(Ta + h) − B(Ta ) : h ≥ 0} is independent of Ta and has the same distribution as an SBM and hence − B(Ta + h) − B(Ta ) : h ≥ 0 is also independent of Ta and has the same distribution as an SBM. Thus, P (Ta ≤ t)

2P Ta ≤ t, B(t) > a = 2P B(t) > a a = 2 1−Φ √ t =

(2.10)

where Φ(·) is the standard N (0, 1) cdf. The above argument is known as the reﬂection principle as it asserts that the path ˜ ≡ B(t)

B(t) B(Ta ) − B(t) − B(Ta )

, t ≤ Ta , t > Ta

(2.11)

obtained by reﬂecting the original path on the line y = a from the point (Ta , a) for t > Ta yields a path that has the same distribution as the original path. Thus the probability density function of Ta is fTa (t)

= =

a 1 a 2φ √ t 2 t3/2 a2 1 a √ e− 2t 3/2 t 2π

(2.12)

implying that ETap < ∞ for p < 1/2 and ∞ for p ≥ 1/2. Also, by the strong Markov property the process {Ta : a ≥ 0} is a process with stationary independent increments, i.e., a Levy process. It is also a stable process of order 1/2. One can use this calculation of P (Ta ≤ t) to show that the probabil0 ity that the SBM crosses zero in the interval (t1 , t2 ) is π2 arcsin tt12 (Problem 15.12). If M (t) ≡ sup{B(s) : 0 ≤ s ≤ t} then for a > 0 P M (t) > a

= P (Ta ≤ t) = 2P B(t) > a = P |B(t)| > a

(2.13)

it follows that M (t) has the same distribution as |B(t)| and hence has ﬁnite moments of all order. In fact, E eθM (t) < ∞

for all θ > 0.

498

15.2.3

15. Stochastic Processes

Some related processes

(i) Let {B(t) : t ≥ 0} be a SBM. For µ in (−∞, ∞) and σ > 0, the process Bµ,σ (t) ≡ µt + σB(t), t ≥ 0 is called Brownian motion with constant drift µ and constant diﬀusion σ. (ii) Let B0 (t) = B(t) − tB(1), 0 ≤ t ≤ 1. The process {B0 (t) : 0 ≤ t ≤ 1} is called the Brownian bridge. It is a Gaussian process with mean function 0 and covariance min(s, t) − st and has continuous trajectories that vanish both at 0 and 1. (iii) Let Y (t) = e−t B(e2t ), −∞ < t < ∞. Then {Y (t) : t ≥ 0} is a Gaussian process with mean function 0 and covariance function c(s, t) = e−(s+t) e+2s = es−t if s < t. This process is called the Ornstein-Uhlenbeck process. It is to be noted that for each t, Y (t) ∼ N (0, 1) and in fact {Y (t) : −∞ < t < ∞} is a strictly stationary process and is a Markov process as well.

15.2.4

Some limit theorems

2 Let {ξ i }i≥1 be iid random variables with Eξ1 = 0, Eξ1 = 1. Let S0 = 0, n 1 Sn = i=1 ξi , n ≥ 1. Let Bn (j/n) = √n Sj , j = 0, 1, 2, . . . , n and {Bn (t) : 0 ≤ t ≤ 1} be obtained by linear interpolation from the values at j/n for j = 0, 1, 2, . . . , n. Then for each n, {Bn (t) : 0 ≤ t ≤ 1} is a random continuous trajectory and hence is a random element of the metric space of continuous real valued functions on [0,1] that are zero at zero with the metric

ρ(f, g) ≡ {sup |f (t) − g(t)| : 0 ≤ t ≤ 1}.

(2.14)

Let µn ≡ P Bn−1 be the induced probability measure on C[0, 1]. The following is a generalization of the central limit theorem as noted in Chapter 11. Theorem 15.2.1: (Donsker ). In the space (C[0, 1], ρ) the sequence of probability measures {µn ≡ P Bn−1 }n≥1 converges weakly to µ, the probability distribution of the SBM. That is, for any bounded continuous function h from C[0, 1] to R, hdµn → f hdµ. For a proof, see Billingsley (1968). Corollary 15.2.2: For any continuous functional T on (C[0, 1], ρ) to Rk , k < ∞, the distribution of T (Bn ) converges tothat of T (B). In particular, S |S | the joint distribution of max √jn , max √jn converges weakly to that of 0≤j≤n 0≤j≤n max B(u), max |B(u)| . 0≤u≤1

0≤u≤1

15.2 Brownian motion

499

There are similar limit theorems asserting the convergence of the empirical processes to the Brownian bridge with applications to the limit distribution of the Kolmogorov-Smirnov statistics (see Billingsley (1968)). Theorem 15.2.3: (Laws of large numbers). lim

t→∞

B(t) = 0 w.p. 1. t

(2.15)

Proof: By the time inversion property (2.5) ˜ =0 lim B(t)

t→0

˜ = lim tB(1/t) = lim But lim B(t) t→0

t→0

τ →∞

w.p. 1.

B(τ ) . τ

2

Theorem 15.2.4: (Kallianpur-Robbins). Let f : R → R be integrable with respect to Lebesgue measure. Then

∞ t 1 d √ f (B(u))du −→ f (u)du Z (2.16) t 0 0 where Z is a random variable with density √

π z(1−z)

in [0,1].

This is a special case of the Darling-Kac formula for Markov processes that can be established here using the regenerative property of SBM due to the fact that starting from 0, SBM will hit level 1 at same time T1 and from there hit level 0 at a later time τ1 . And this can be repeated to produce a sequence of times 0, τ1 , τ2 , . . . such that the excursions {B(t) : τi ≤ t < τi+1 }i≥1 are iid. The sequence {τi }i≥1 is a renewal sequence with life time distribution τ1 having a regularly varying tail of order 1/2 and hence inﬁnite mean. One can appeal now to results from renewal theory to complete the proof (see Feller (1966) and Athreya (1986)).

15.2.5

Some sample path properties of SBM

The sample paths t → B(t, ω) of the SBM are continuous w.p. 1. It turns out that they are not any more smooth than this. For example, they are not diﬀerentiable nor are they of bounded variation on ﬁnite intervals. It will be shown now that w.p. 1 Brownian sample paths are not diﬀerentiable any where and the quadratic variation over any ﬁnite interval is ﬁnite and nonrandom. (See also Karatzas and Shreve (1991).) (i) Nondiﬀerentiability of B(·, ω) in [0,1] Let An,k = ω : sup |B(t,ω)−B(s,ω)| ≤ k for some 0 ≤ s ≤ 1 . |t−s| |t−s|≤3/n * * Let Zr,n = *B (r+1) − B nr *, r = 0, 1, 2, n − 1. Let Bn,k = ω : n

500

15. Stochastic Processes

max Zr,n , Zr+1,n , Zr+2,n ≤ An,k ⊂ Bn,k . Now

6k n

for some r . It can be veriﬁed that

6k P max Zr,n , Zr+1,n , Zr+2,n ≤ n r=0 6k 3 ≤ n P |Z0,n | ≤ n |Z | 6k 3 0n ≤ nP ≤ √ √1 n n Const 3 ≤ n √ as n → ∞, n n−1

P (Bn,k ) ≤

since

Z0n √1 n

√ ∼ N (0, 1). Thus for each k < ∞, P (An,k ) ≤ Const . This n

implies

∞

P (An3 ,k ) < ∞.

n=1

So by the Borel-Cantelli lemma, only ﬁnitely many An3 ,k can happen w.p. 1. The event A ≡ {ω : B(t, ω) is diﬀerentiable for at least one t in [0, 1]} is contained in C ≡ k≥1 {ω : ω ∈ An3 ,k for inﬁnitely many n} and so P (A) ≤ P (C) = 0. (ii) Finite quadratic variation of SBM Let ηn,j = B(j2−n ) − B (j − 1)2−n , j = 1, 2, . . . , 2n 2 n

∆n

≡

2 ηnj .

(2.17)

j=1

Then

2 1 E∆n = = 1. n 2 j=1 n

Also by independence and stationarity of increments 2 3 3 = n. 2n 2 2 j=1 n

Var(∆n ) =

n) for any > 0. This implies, by Borel Thus P (|∆n −1| > ) ≤ Var(∆ 2 Cantelli, ∆n → 1 w.p. 1. By deﬁnition the quadratic variation is n * * *B(tj , ω) − B(tj−1 , ω)*2 : all ﬁnite partitions ∆ ≡ sup

j=0

(t0 , t1 , . . . , tn ) of [0, 1] .

(2.18)

15.2 Brownian motion

501

It is easy to verify that ∆ = limn ∆n . Thus ∆ = 1 w.p. 1. It follows that w.p. 1 the Brownian motion paths are not of bounded variation. By the scaling property of SBM, it follows that the quadratic variation of SBM over [0, t] is t w.p. 1 for any t > 0.

15.2.6

Brownian motion and martingales

There are three natural martingales associated with Brownian motion. Theorem 15.2.5: Let {B(t) : t ≥ 0} be SBM. Then (i) (Linear martingale) {B(t) : t ≥ 0} is a martingale. (ii) (Quadratic martingale) {B 2 (t) − t : t ≥ 0} is a martingale. (iii) (Exponential martingale) For any θ real, {eθB(t)− martingale.

θ2 2

t

: t ≥ 0} is a

Proof: (i) and (ii). Since B(t) ∼ N (0, t), E|B(t)| < ∞ and E|B(t)|2 < ∞. By the stationary independent increments property for any t ≥ 0, s ≥ 0, E B(t + s) | B(u) : u ≤ t = E B(t + s) − B(t) | B(u) : u ≤ t + B(t) =

0 + B(t) = B(t)

establishing (i).

Next,

= =

E B 2 (t + s) | B(u) : u ≤ t 2 E B(t + s) − B(t) | B(u) : u ≤ t + B 2 (t) + 2E B(t + s) − B(t) B(t) | B(u) : u ≤ t s + B 2 (t) + 0

and hence E B 2 (t + s) − (t + s) | B(u) : u ≤ t = B 2 (t) − t, establishing (ii). (iii) * θ2 * E eθB(t+s)− 2 (t+s) * B(u) : u ≤ t * θ2 θ2 * = E eθ(B(t+s)−B(t))− 2 s * B(u) : u ≤ t eθB(t)− 2 t . Again by using the fact that B(t + s) − B(t) given B(u) : u ≤ t is N (0, s), the ﬁrst term on the right side becomes 1 proving (iii). 2

502

15. Stochastic Processes

15.2.7

Some applications

The martingales in Theorem 15.2.5 combined with the optional stopping theorems of Chapter 13 yield the following applications. (i) Exit probabilities Let B(·) be SBM. Fix a < 0 < b. Let Ta,b = min{t : t > 0, B(t) = a or b}. From (i) and the optional sampling theorem, for any t > 0; EB(Ta,b ∧ t) = EB(0) = 0.

(2.19)

Also, by continuity, B(Ta,b ∧ t) → B(Ta,b ). By bounded convergence theorem, this implies EB(Ta,b ) = 0 (2.20) i.e., a p + b(1 − p) = 0 where p = P (Ta < Tb ) = P (B(·) reaches a b before b). Thus, p = (b−a) . (ii) Mean exit time From (ii) and the optional sampling theorem E B 2 (Ta,b ∧ t) − (Ta,b ∧ t) = 0 i.e., EB 2 (Ta,b ∧ t) = E(Ta,b ∧ t).

(2.21)

By using the bounded convergence theorem on the left and the monotone convergence theorem on the right, one may conclude EB 2 (Ta,b ) = ETa,b i.e., ETa,b

= pa2 + (1 − p)b2 b (−a) + b2 = a2 (b − a) (b − a) = (−ab).

(2.22)

(iii) The distribution of Ta,b From (iii) and the optimal sampling theorem θ2 E eθB(Ta,b ∧t)− 2 Ta,b = 1. By the bounded convergence theorem, this implies θ2 E eθB(Ta,b )− 2 Ta,b = 1.

(2.23)

15.2 Brownian motion

503

In particular, if b = −a this reduces to θ2 θ2 1 = E eθa− 2 Ta,−a : Ta < T−a + E e−θa e− 2 Ta,−a : Ta > T−a 1 θ2 = eθa + e−θa E e− 2 Ta,−a , 2 since by symmetry θ2 θ2 E e− 2 Ta,−a : Ta < T−a = E e− 2 Ta,−a : Ta > T−a . Thus, for λ ≥ 0 √ −1 √ E e−λTa,−a = 2 e 2λa + e− 2λa . Similarly, it can be shown that for λ ≥ 0, a > 0 √ E e−λTa = e− 2λa .

15.2.8

(2.24)

(2.25)

The Black-Scholes formula for stock price option

Let X(t) denote the price of one unit of a stock S at time t. Due to ﬂuctuations in the market place, it is natural to postulate that {X(t) : t ≥ 0} is a stochastic process. To build an appropriate model consider the discrete time case ﬁrst. If Xn denotes the unit price at time n, it is natural to postulate that Xn+1 = Xn yn+1 where yn+1 represents the eﬀects of the market ﬂuctuation in the time interval [n, n + 1). This leads to the formula Xn = X0 y1 y2 · · · yn . If one assumes that {yn }n≥1 are suﬃciently independent, then nµ+

X n = X0 e

n

(log yi −µ)

i=1

is, by the central limit theorem, approximately Gaussian, leading one to consider a model of the form X(t) = X(0)eµt+σB(t)

(2.26)

where {B(t) : t ≥ 0} is SBM. Thus, {log X(t) − log X(0) : t ≥ 0} is postulated to be a Brownian motion with drift µ and diﬀusion σ. In the language of ﬁnance, µ is called the growth rate and σ the volatility rate. The so-called European option allows one to buy the stock at a future time t0 for a unit price of K dollars at time 0. If X(t0 ) < K then one has the option of not buying, whereas if X(t0 ) ≥ K, then one can buy it at K dollars and sell it immediately at the market price X(t0 ) and realize a proﬁt of X(t0 ) − K. Thus the net revenue from this option is 0 if X(t0 ) ≤ K ˜ X(t0 ) = (2.27) X(t0 ) − K if X(t0 ) > K.

504

15. Stochastic Processes

Since the value of money depreciates over time, say at rate r, the net ˜ 0 )e−t0 r . So a fair price for this European revenue’s value at time 0 is X(t option is p0

˜ 0 )e−t0 r = E X(t = E(X(t0 ) − K)+ e−t0 r .

(2.28)

Here the constants µ, σ, K, t0 , r are assumed known. The goal is to compute p0 . This becomes feasible if one makes the natural assumption of no arbitrage. That is, the discounted value of the stock, i.e., X(t)e−rt , evolves as a martingale. This is a reasonable assumption as otherwise (if it is advantageous) then everybody will want to take advantage of it and start buying the stock, thereby driving the price down and making it unproﬁtable. Thus, in eﬀect, this assumption says that X(t)e−rt ≡ X(0)eµt+σB(t)−rt evolves as a martingale. But recall that if B(·) is an SBM then for any θ θ2

real, eθB(t)− 2 t evolves as a martingale. Thus, µ, σ, r should satisfy the 2 condition − σ2 = (µ − r). With this assumption, the fair price for this European option with µ, σ, r, K, t0 given is p0

= =

+ E e−t0 r X0 eσB(t0 )+µt0 − K y2 1 √ X0 eσy+µt0 − K e− 2t0 dy. (2.29) e−t0 r 2πt0 X0 eσy+µt0 >K

This is known as the Black-Scholes formula. For more detailed discussions on Brownian motion including the development of Ito stochastic integration and diﬀusion processes via a martingale formulation, the books of Stroock and Varadhan (1979) and Karlin and Taylor (1975) should be consulted. See also Karatzas and Shreve (1991).

15.3 Problems 15.1 Let {Lj }j≥0 be as in Section 15.1.1. Show that for any θ ≥ 0

n λyj −θ n Lj j=0 (a) E e =E θ+λy j=0

j

∞ ∞ 1 (b) E e−θ j=0 Lj = 0 for all θ > 0 iﬀ = ∞ w.p. 1 assumλ j=0 yj

ing 0 < λi < ∞ for all i.

15.3 Problems

505

15.2 Let L be an exponential random variable. Verify that for any x > 0, u>0 P (L > x + u | L > x) = P (L > u). (This is referred to as the “lack of memory” property.) 15.3 Solve the Kolmogorov’s forward and backward equations for the following special cases of birth and death processes: (a) Poisson process: αi ≡ α, βi ≡ 0, (b) Yule process: αi ≡ iα, βi ≡ 0, (c) On-oﬀ process: α0 = α, αi = 0, i ≥ 1, β1 = β, βi = 0, i = 0, 2, . . . , (d) M/M/1 queue: αi = α, i ≥ 0, βi = β, i ≥ 1, β0 = 0, (e) M/M/s queue: αi = α, i ≥ 0, βi = iβ, 1 ≤ i ≤ s and = sβ, i > s, β0 = 0, (f) Pure death process: βi ≡ β, i ≥ 1, β0 = 0, αi = 0, i ≥ 0. 15.4 Find the stationary distributions when they exist for the processes in Problem 15.3. 15.5 Consider 2 independent M/M/1 queues with arrival rate λ, service rate µ (Case I), and one M/M/1 queue with arrival rate 2λ and service rate 2µ (Case II). Assume λ < µ. Show that in the stationary state the mean number in the system Case I is larger than in Case 2 and their ratio approaches 2 as ρ = µλ ↑ 1. 15.6 Show that for any ﬁnite state space irreducible CTMC {X(t) : t ≥ 0} with all λi ∈ (0, ∞), there is a unique stationary distribution. 15.7 (M/M/∞ queue). This is a birth and death process such that αn ≡ α, βn = nβ, n ≥ 0, 0 < α, β < ∞. Show that this process has a stationary distribution that is Poisson with mean ρ = µλ . 15.8 (a) Let {X(t)}t≥0 be a Poisson process with rate λ. Let L be an exponential random variable with mean µ−1 and independent of {X(t)}t≥0 . Let N (t) = X(t + L) − X(t). Find the distribution of N (t). (b) Let {Y (t)}t≥0 be also a Poisson process with rate µ and independent of {X(t)}t≥0 in (a). Let T and T be two successive ‘event epochs’ for the {Y (t)}t≥0 process. Let N = X(T ) − X(T ). Find the distribution of N . (c) Let {X(t)}t≥0 be as in (a). Let τ0 = 0 < τ1 < τ2 < · · · be the successive event epochs of {X(t)}t≥0 . Find the joint distribution of (τ1 , τ2 , . . . , τn ) conditioned on the event {N (t) = 1} for some 0 < t < ∞.

506

15. Stochastic Processes

15.9 Let {X(t) : t ≥ 0} be a Poisson process with rate λ. Suppose at each event epoch of the Poisson process an experiment is performed that results in one of k possible outcomes {ai : 1 ≤ i ≤ k} with probability distribution {pi : 1 ≤ i ≤ k}. Let Xi (t) = outcomes ai in [0, t]. Assume the experiments are iid. Show that {Xi (t) : t ≥ 0} are independent Poisson processes with rate λpi for 1 ≤ i ≤ k. 15.10 Let {X(t) : t ≥ 0} be a Poisson process with rate λ, 0 < λ < ∞. Let {ξi }i≥1 be a sequence of iid random variables independent of {X(t) : t ≥ 0} with values in a measurable space (S, S). For each A ∈ S deﬁne N (t) I(ξj ∈ A), t ≥ 0. N (A, t) ≡ j=1

(a) Verify that for each A ∈ S, {N (A, t)}t≥0 is a Poisson process and ﬁnd its rate. (b) Show that if A1 , A2 ∈ S, A1 ∩ A2 = S, then the two Poisson processes {N (Ai , t)}t≥0 , i = 1, 2 are independent. (c) Show that for each t > 0, {N (A, t) : A ∈ S} is a Poisson random ﬁeld on S, i.e., for each A, N (A, t) is Poisson and for A1 , A2 , . . . , Ak pairwise disjoint elements of S, {N (Ai , t)}k1 are independent. (d) Show that {N (·, t)}t≥0 is a process with stationary independent increments that is Poisson random measure valued. 15.11 Let BN (·, ω) be as in (2.1). Show that {BN (·, ω)}N ≥1 is Cauchy in the Banach space C[0, 1] with sup norm by completing the following steps: (a) If ξnj (t, ω) = Znj (ω)Snj (t) then (i) ξnj (·, ω) ≡ sup{|ξnj (t, ω)| : 0 ≤ t ≤ 1} = |Znj (ω)|2− (ii)

sup

n 2 −1

(n+1) 2

|ξnj (t, ω) : 0 ≤ t ≤ 1

j=1

= (max{|Znj (ω)| : 1 ≤ j ≤ 2n − 1})2−

(n+1) 2

,

(b) for any sequence {ηi }i≥1 of random variables with supi E(eηi ) < ∞, w.p. 1, ηi ≤ 2 log i for all large i, (c) w.p. 1 there is a C < ∞ such that max{|Znj (ω)| : 1 ≤ j ≤ 2n − 1} ≤ Cn.

15.3 Problems

507

(d) ∞ 2 −1 n

ξnj (·, ω) < ∞ w.p. 1.

n=1 j=1

15.12 Show that if B(·) is SBM 4 2 t2 . P B(t) = 0 for some t in (t1 , t2 ) = arcsin π t1 (Hint: Conditioned on B(t1 ) = x = 0, the required probability equals P (T|x| ≤ t2 − t1 ) = 2 1 − Φ( √t|x| ) and hence the unconditional 2 −t1 |B(t1 )| ) .) probability is E2 1 − Φ( √ t2 −t1 15.13 Use the reﬂection principle to ﬁnd P (M (t) ≥ x, B(t) ≤ y) for x > y where M (t) = max{B(u) : 0 ≤ u ≤ t} and B(·) is SBM. 15.14 For a < 0 < b < c ﬁnd P (Tb < Ta < Tc ) where Tx = min{t : t > 0, B(t) = x} where B(·) is SBM. 15.15 Let B0 (t) ≡ B(t)−tB(1), 0 ≤ t ≤ 1 (where {B(t) : t ≥ 0} is SBM) t be , the Brownian bridge. Find the distribution of X(t) ≡ (1 + t)B0 1+t t ≥ 0. (Hint: X is a Gaussian process. Find its mean and covariance functions.) 15.16 Let B(·) be SBM. Let Mn = sup{|B(t) − B(n)| : n − 1 ≤ t ≤ n}, n = 1, 2, . . .. (a) Show that

Mn n

→ 0 w.p. 1 as n → ∞.

(Hint: Show {Mn }n≥1 are iid and EM1 < ∞.) (b) Using this show that B(t) t → 0 w.p. 1 as t → ∞ and give another proof of the time inversion result 15.2.3. 15.17 Use the exponential martingale to ﬁnd E(e−λT ) where T = inf{t : t ≥ 0, B(t) ≥ α + βt}, λ > 0, α > 0, β > 0 and B(·) SBM. 15.18 Let {Y (t) : −∞ < t < ∞} be the Ornstein-Uhlenbeck process as deﬁned in 15.2.4. Let f : R → R be Borel measurable and E|f (Z)| < t ∞ where Z ∼ N (0, 1). Evaluate lim 1t 0 f Y (u) du. t→∞

(Hint: Show that Y (·) is a regenerative stochastic process.)

16 Limit Theorems for Dependent Processes

16.1 A central limit theorem for martingales Let {Xn }n≥1 be a sequence of random variables on (Ω, F, P ), and let {Fn }n≥1 be a ﬁltration, i.e., a sequence of σ-algebras on Ω such that Fn ⊂ Fn+1 ⊂ F for all n ≥ 1. From Chapter 13, recall that {Xn , Fn }n≥1 is called a martingale if Xn is Fn -measurable for each n ≥ 1 and E(Xn+1 | Fn ) = Xn for each n ≥ 1. Given a martingale {Xn , Fn }n≥1 , deﬁne Y1 Yn

= X1 − EX1 , = Xn − Xn−1 , n ≥ 1.

Note that each Yn is Fn -measurable and E(Yn | Fn−1 ) = 0

for all n ≥ 1,

(1.1)

where F0 = {Ω, ∅}. Deﬁnition 16.1.1: Let {Yn }n≥1 be a collection of random variables on a probability space (Ω, F, P ) and let {Fn }n≥1 be a ﬁltration. Then, {Yn , Fn }n≥1 is called a martingale diﬀerence array (mda) if Yn is Fn measurable for each n ≥ 1 and (1.1) holds. For example, if {Yn }n≥1 is a sequence of zero mean independent random variables, then {Yn , Fn }n≥1 is a mda w.r.t. the natural ﬁltration Fn = σY1 , . . . , Yn , n ≥ 1. Other examples of mda’s can be constructed from the

510

16. Limit Theorems for Dependent Processes

examples given in Chapter 13. The main result of this section shows that for square-integrable mda’s satisfying a Lindeberg-type condition, the CLT holds. For more on limit theorems for mdas, see Hall and Heyde (1980). Theorem 16.1.1: For each n ≥ 1, let {Yni , Fni }i≥1 be a mda on (Ω, F, P ) 2 < ∞ for all i ≥ 1 and let τn be a ﬁnite stopping time w.r.t. with EYni {Fni }i≥1 . Suppose that for some constant σ 2 ∈ (0, ∞), τn * 2 * Fn,i−1 −→p σ 2 E Yni

as

n→∞

(1.2)

i=1

and that for each > 0, ∆n () ≡

τn * 2 E Yni I(|Yni | > ) * Fn,i−1 −→p 0

as

n → ∞.

(1.3)

i=1

Then,

τn

Yni −→d N (0, σ 2 ).

(1.4)

i=1

Proof: First the theorem will be proved under the additional condition that τn

=

mn for all n ≥ 1 for some nonrandom sequence of positive integers {mn }n≥1

(1.5)

and that for some c ∈ (0, ∞), mn * 2 * Fn,i−1 ≤ c E Yni

w.p. 1.

(1.6)

i=1

2 2 Let σni = E Yni | Fn,i−1 , i ≥ 1, n ≥ 1. Also, write m for mn to ease the 2 notation. Since σni is Fn,i−1 -measurable, for any t ∈ R, * *

m * * 2 2 * *E exp ιt Y t /2 − exp − σ nj * * j=1

* *

m−1 m * * 2 ≤ **E exp ιt Ynj − E exp ιt Ynj exp − t2 σnm /2 ** j=1

j=1

*

m * 2 + · · · + **E exp ιtYn1 exp − t2 σnj /2

j=2

− exp

−

m j=1

2 t2 σnj /2

* * * *

16.1 A central limit theorem for martingales

511

* *

m * * 2 2 2 2 * + *E exp − t σnj /2 − exp(−t σ /2)** j=1

m * * * * * 2 E *E exp(ιtYnk ) * Fn,k−1 − exp(−t2 σnk /2)*

≤

k=1

* *

m * * 2 2 2 2 * + *E exp − t σnj /2 − exp(−t σ /2)** j=1

≡

I1n + I2n ,

say.

(1.7)

By (1.2), (1.5), and the BCT, I2n → 0

as

n → ∞.

(1.8)

To estimate I1n , note that for any 1 ≤ k ≤ n, * E exp(ιtYnk ) * Fn,k−1 = =

(ιt)2 2 E Ynk | Fn,k−1 + θnk (t) 1 + ιtE Ynk | Fn,k−1 + 2 t2 2 1 − σnk + θnk (t) (1.9) 2

and

t2 2 2 /2 = 1 − σnk + γnk , say. exp − t2 σnk 2 It is easy to verify that |tYnk |3 ** |θnk | ≤ E min (tYnk )2 , * Fn,k−1 6 and

(1.10)

2 2 2 exp t2 σnk /2 /8. |γnk | ≤ t2 σnk

Hence, by (1.3), (1.6), (1.9), and (1.10), for any in (0,1), I1n

≤

m E |θnk | + |γnk | k=1 m 2

≤ t

* * * * * 2 E *E Ynk I(|Ynk | > ) * Fn,k−1 *

k=1

+ |t|3 ·

m m * ** * 2 4 E *E Ynk | Fn,k−1 * + E t4 σnk exp(t2 c/2) k=1

≤

t2 E ∆n () + |t|3 · · E

m

2 σnk

k=1

4

2

+ t exp(t c/2) E

m

k=1

2 σnk

k=1

max

l≤k≤m

2 σnk

.

512

16. Limit Theorems for Dependent Processes

Note that for any > 0, 2 E max σnk 1≤k≤m

≤ 2 + E

* 2 max E Ynk I |Ynk | > * Fn,k−1

1≤k≤m

2

≤ + E∆n (). Hence, by (1.3), (1.6), and the BCT, for any ∈ (0, 1), lim sup I1n n→∞

≤ lim sup t2 E∆n () + |t|3 · c n→∞ 2 + t4 cet c/2 2 + E∆n () ≤

c1 (t)

for some c1 (t) ∈ (0, ∞), not depending on . Thus implies that lim I1n = 0.

n→∞

(1.11)

Clearly (1.7), (1.8), and (1.11) yield (1.4), whenever (1.5) and (1.6) are true. Next, suppose that condition (1.6) is not assumed a priori but (1.5) k 2 holds true. Fix c > σ 2 and deﬁne the sets Bnk = i=1 σni ≤ c , and the variables Yˇnk = Ynk IBnk , k ≥ 1, n ≥ 1. Note that Bnk ∈ Fn,k−1 and hence, E Yˇnk | Fn,k−1 = IBnk E Ynk | Fn,k−1 = 0, and

2 2 2 σ ˇnk ≡ E Yˇnk | Fn,k−1 = IBnk σnk , (1.12) for all k ≥ 1. In particular, Yˇnk , Fn,k is a mda. Since Bn,k−1 ⊃ Bnk for all k, by the deﬁnitions of the sets Bnk ’s, it follows that m

2 σ ˇnk

=

k=1

m

2 σnk IBnk

k=1

=

m

2 σnk IBnm +

k=1

+ ··· +

m−1

2 IBn,m−1 − IBnm σnk

k=1

2 σn1

IBn1 − IBn2

≤ cIBnm + c IBn,m−1 − IBnm + · · · + cIBn1 − IBn2 ≤ c. (1.13) Thus, the mda Yˇnk , Fnk k≥1 satisﬁes (1.6). Next note that by (1.2) and (1.5), c P Bnm → 0 as n → ∞. (1.14)

16.2 Mixing sequences

Also, by (1.12),

m k=1

2 σ ˇnk =

m k=1 m

513

2 σnk on Bnm . Hence, it follows that

2 σ ˇnk −→p σ 2 ,

k=1

i.e., the mda Yˇnk , Fnk satisﬁes (1.2). Further, the inequality “|Yˇnk | ≤ = x2 I(|x| > ), x > 0 is nonde|Ynk |” and the fact that the function h(x) creasing jointly imply that (1.3) holds for Yˇnk , Fnk . Hence, by the case already proved, m (1.15) Yˇnk −→d N (0, σ 2 ). But

k=1

m ˇ k=1 Ynk = k=1 Ynk on Bnm . Hence, by (1.14),

m

m

Ynk −→d N (0, σ 2 ),

(1.16)

k=1

and therefore, the CLT holds without the restriction in (1.6). Next consider relaxing the restrictions in (1.5) (and (1.6)). Since P (τn < ∞) = 1, there exist positive integers mn such (Problem 16.2) that P τn > mn → 0 as n → ∞. (1.17) Next deﬁne Y˜nk = Ynk I(τn ≥ k), k ≥ 1, n ≥ 1.

(1.18)

It is easy to check (Problem 16.3) that {Y˜nk , Fnk } is a mda, and that {Y˜nk , Fnk } satisﬁes (1.2) and (1.3) with τn replaced by mn (Problem 16.4). Hence, by the previous case already proved, mn

Y˜nk −→d N (0, σ 2 ).

k=1

Next note that (cf. (4.1), Proclem 16.4), τn k=1

Ynk −

mn

Y˜nk −→p 0

as

n → ∞.

(1.19)

k=1

Hence, (1.4) holds and the proof of the theorem is complete.

2

16.2 Mixing sequences This section deals with a class of dependent processes, called the mixing processes, where the degree of dependence decreases as the distance (in

514

16. Limit Theorems for Dependent Processes

time) between two given sets of random variables goes to inﬁnity. The ‘degree of dependence’ is measured by various mixing coeﬃcients, which are deﬁned in Section 16.2.1 below. Some basic properties of the mixing coeﬃcients are presented in Section 16.2.2. Limit theorems for sums of mixing random variables are given in Section 16.2.3.

16.2.1

Mixing coeﬃcients

Deﬁnition 16.2.1: Let (Ω, F, P ) be a probability space and G1 , G2 be sub-σ-algebras of F. (a) The α-mixing or strong mixing coeﬃcient between G1 and G2 is deﬁned as * * α(G1 , G2 ) ≡ sup *P (A ∩ B) − P (A)P (B)* : A ∈ G1 , B ∈ G2 . (2.1) (b) The β-mixing coeﬃcient or the coeﬃcient of absolute regularity between G1 and G2 is deﬁned as β(G1 , G2 ) ≡

k * * 1 *P (Ai ∩ Bj ) − P (Ai )P (Bj )*, sup 2 i=1 j=1

(2.2)

where the supremum is taken over all ﬁnite partitions {A1 , . . . , Ak } and {B1 , . . . , B } of Ω by sets Ai ∈ G1 and Bj ∈ G2 , 1 ≤ i ≤ k, 1 ≤ j ≤ , , k ∈ N. (c) The ρ-mixing coeﬃcient or the coeﬃcient of maximal correlation between G1 and G2 is deﬁned as (2.3) ρ(G1 , G2 ) ≡ sup ρX1 ,X2 : Xi ∈ L2 (Ω, Gi , P ), i = 1, 2} is the correlation coeﬃcient of X1 where ρX1 ,X2 ≡ √ Cov(X1 ,X2 ) Var(X1 )Var(X2 ) and X2 . It is easy to check (Problem 16.5 (a) and (d)) that all three mixing coeﬃ cients take values in the interval [0, 1] and that ρ(G1 , G2 ) = sup |EX1 X2 | : Xi ∈ L2 (Ω, Gi , P )EXi = 0, EX12 = 1, i = 1, 2 . When the σ-algebras G1 and G2 are independent, these coeﬃcients equal zero, and vice versa. Thus, nonzero values of the mixing coeﬃcients give various measures of the degree of dependence between G1 and G2 . It is easy to check (Problem 16.5 (c)) that (2.4) α(G1 , G2 ) ≤ β(G1 , G2 ) and α(G1 , G2 ) ≤ ρ(G1 , G2 ). However, no ordering between the β(G1 , G2 ) and ρ(G1 , G2 ) exists, in general (Problem 16.6). There are two other mixing coeﬃcients that are also often used in the literature. These are given by the φ-mixing coeﬃcient * * φ(G1 , G2 ) ≡ sup *P (A) − P (A | B)* : B ∈ G1 , P (B) > 0, A ∈ G2 , (2.5)

16.2 Mixing sequences

515

and the Ψ-mixing coeﬃcient Ψ(G1 , G2 ) ≡

sup

A∈G1∗ ,B∈G2∗

|P (A ∩ B) − P (A)P (B)| , P (A)P (B)

(2.6)

where P (A | B) = P (A ∩ B)/P (B) for P (B) > 0, and where Gi∗ = {A : A ∈ Gi , P (A) > 0}, i = 1, 2. It is easy to check that Ψ(G1 , G2 ) ≥ φ(G1 , G2 ) ≥ β(G1 , G2 ). Deﬁnition 16.2.2: Let {Xi }i∈Z be a (doubly-inﬁnite) sequence of random variables on a probability space (Ω, F, P ). Then, the strong- or α-mixing coeﬃcient of {Xi }i∈Z , denoted by αX (·), is deﬁned by αX (n) ≡ sup α σ Xj : j ≤ i, j ∈ Z , σ Xj : j ≥ i+n, j ∈ Z , n ≥ 1, i∈Z

(2.7) where the α(·, ·) on the right side of (2.7) is as deﬁned in (2.1). The process {Xi }i∈Z is called strongly mixing or α-mixing if lim αX (n) = 0.

n→∞

(2.8)

The other mixing coeﬃcients of {Xi }i∈Z (e.g., βX (·), ρX (·), etc.) are deﬁned similarly. For a one-sided sequence {Xi }i≥1 , the α-mixing coeﬃcient {Xi }i≥1 is deﬁned by replacing Z on the right side of (2.7) by N on all three occurrences. A similar modiﬁcation is needed for the other mixing coeﬃcients. When there is no chance of confusion, the coeﬃcients αX (·), βX (·), . . . , etc., will be written as α(·), β(·), . . . , etc., to ease the notation. Another important notion of ‘weak’ dependence is given by the following: Deﬁnition 16.2.3: Let m ∈ Z+ be an integer and {Xi }i∈Z be a collection of random variables on (Ω, F, P ). Then, {Xi }i∈Z is called m-dependent if for every k ∈ Z, {Xi : i ≤ k, i ∈ Z} and {Xi : i > k + m, i ∈ Z} are independent. Example 16.2.1: If {i }i∈Z is a collection of independent random variables and Xi = (i + i+1 ), i ∈ Z, then {i }i∈Z is 0-dependent and {Xi }i∈Z is 1-dependent. It is easy to see that if {Xi }i∈Z is m-dependent for some ∗ ∗ (n) = 0 for all n > m, where αX ∈ {αX , βX , ρX , φX , ΨX }. m ∈ Z+ , then αX Therefore, m-dependence of {Xi }i∈Z implies that the process {Xi }i∈Z is ∗ αX -mixing. In this sense, the condition of m-dependence is the strongest and the condition of α-mixing is the weakest among all weak dependence conditions introduced here. Example 16.2.2: Let {i }i∈Z be a collection of iid random variables with E1 = 0, E21 < ∞ and let ai n−i , n ∈ Z (2.9) Xn = i∈Z

516

16. Limit Theorems for Dependent Processes

where ai ∈ R and ai = 0 exp(−ci ) as i → ∞, c ∈ (0, ∞). If 1 has an integrable characteristic function, then {Xi }i∈Z is strongly mixing (Chanda (1974), Gorodetskii (1977), Withers (1981), Athreya and Pantula (1986)). Example 16.2.3: Let {Xi }i∈Z be a zero mean stationary Gaussian process. Suppose that {Xi }i∈Z has spectral density f : (−π, π) → [0, ∞), i.e., EX0 Xk =

π

−π

eιkx f (x)dx, k ∈ Z.

(2.10)

Then, αX (n) ≤ ρX (n) ≤ 2πα(n), n ≥ 1 and, therefore, {Xi }i∈Z is α-mixing iﬀ it is ρ-mixing (Ibragimov and Rozanov (1978), Chapter 4). Further, {Xi }i∈Z is α-mixing iﬀ the spectral density f admits the representation *2 * f (t) = *p(eιt )* exp u(eιt ) + v˜(eιt ) ,

(2.11)

where p(·) is a polynomial, u and v are continuous real-valued functions on the unit circle in the complex plane, and v˜ is the conjugate function of u. It is also known that if the Gaussian process {Xi }i∈Z is φ-mixing, then it is necessarily m-dependent for some m ∈ Z+ . Thus, for Gaussian processes, the condition of α-mixing is as strong as ρ-mixing and the conditions of φmixing and Ψ-mixing are equivalent to m-dependence. See Ibragimov and Rozanov (1978) for more details.

16.2.2

Coupling and covariance inequalities

The mixing coeﬃcients can be seen as measures of deviations from independence. The idea of coupling is to construct independent copies of a given pair of random vectors on a suitable probability space such that the Euclidean distance between these copies admits a bound in terms of the mixing coeﬃcient between the (σ-algebras generated by the) random vectors. Thus, coupling gives a geometric interpretation of the mixing coeﬃcients. The ﬁrst result is for β-mixing random vectors. Theorem 16.2.1: (Berbee’s theorem). Let (X, Y ) be a random vector on a probability space (Ω0 , F0 , P0 ) such that X takes values in Rd and Y in Rs , d, s ∈ N. Then, there exist an enlarged probability space (Ω, F, P ) and a s-dimensional random vector Y ∗ such that (i) (X, Y , Y ∗ ) are deﬁned on (Ω, F, P ), (ii) Y ∗ is independent of X under P and (X, Y ) have the same distribution under P and P0 , (iii) P (Y = Y ∗ ) = β(σX, σY ).

16.2 Mixing sequences

517

Proof: See Corollary 4.2.5 of Berbee (1979). A weaker version of the above result is available for α-mixing random variables, where the diﬀerence between Y and its independent copy admits a bound in terms of the α-mixing coeﬃcient. Theorem 16.2.2: (Bradley’s theorem). In Theorem 16.2.1, assume s = 1 and 0 < E|Y |γ < ∞ for some 0 < γ < ∞. Then, for all 0 < y ≤ (E|Y |γ )1/γ , 2γ/1+2γ P |Y − Y ∗ | ≥ y ≤ 18 α σX, σY 1/1+2γ

(E|Y |γ )

y −γ/(1+2γ) .

(2.12)

Proof: See Theorem 3 of Bradley (1983). Next, some bounds on the covariance between mixing random variables are established. These will be useful for deriving limit theorems for sums of mixing random variables. For a random variable X, deﬁne the function QX (u) = inf t : P (|X| > t) ≤ u , u ∈ (0, 1). (2.13) Thus, QX (u) is the quantile function of |X| at (1 − u). Theorem 16.2.3: (Rio’s inequality). Let X and Y be two random vari1 ables with 0 QX (u)QY (u)du < ∞. Then, 2α * * *Cov(X, Y )* ≤ 2 QX (u)QY (u)du (2.14) 0

where α = α σX, σY .

Proof: By Tonelli’s theorem, X + Y + + + EX Y = E du du 0

0∞ ∞ + + = E I(X > u)I(Y > v)dudv ∞ 0 ∞ 0 = P (X > u, Y > v)dudv 0

+

∞

0

and similarly, EX = 0 P (X > u)du. Hence, by (2.1), it follows that * * *Cov(X + , Y + )* * * ∞ ∞ * * " # * P (X > u, Y > v) − P (X > u)P (Y > v) dudv ** = * 0 0 ∞ ∞ ≤ min α, P (X > u), P (Y > v) dudv. (2.15) 0

0

518

16. Limit Theorems for Dependent Processes

Next note that for any real numbers a, b, c, d (α ∧ a ∧ c) + (α ∧ a ∧ d) + (α ∧ b ∧ c) + (α ∧ b ∧ d) ≤ [2(α ∧ a)] ∧ (c + d) + [2(α ∧ b)] ∧ (c + d) ≤ 2[2α ∧ (a + b) ∧ (c + d)].

(2.16)

Now using (2.15), (2.16), and the identity Cov(X, Y ) = Cov(X + , Y + ) + Cov(X − , Y − ) − Cov(X + , Y − ) − Cov(X − , Y + ), one gets * * *Cov(X, Y )* ∞ ∞ ≤ 2 min 2α, P (|X| > u), P (|Y | > v) dudv. (2.17) 0

0

Hence, it is enough to show that the right sides of (2.14) and (2.17) agree. To that end, let U be a Uniform (0,1) random variable and deﬁne (W1 , W2 ) = (0, 0)I(U ≥ 2α) + QX (U ), QY (U ) I(U < 2α). Then EW1 W2 =

0

2α

QX (u)QY (u)du.

(2.18)

On the other hand, noting that QX (a) > t iﬀ P (|X| > t) > a, one has EW1 W2

∞

∞

= 0 ∞ 0 ∞ = 0 ∞ 0 ∞ = 0

0

P W1 > u, W2 > v dudv P U < 2α, QX (U ) > u, QY (U ) > v dudv min 2α, P (|X| > u), P (|Y | > v) dudv.

Hence, the theorem follows from (2.17), (2.18), and the above identity. 2 Corollary 16.2.4: Let X and Y be two random variables with α(σX, σY ) = α ∈ [0, 1]. (i) (Davydov’s inequality). Suppose that E|X|p < ∞, E|Y |q < ∞ for some p, q ∈ (1, ∞) with p1 + 1q < 1. Then, E|XY | < ∞ and * * *Cov(X, Y )* ≤ 2r(2α)1/r E|X|p 1/p E|Y |q 1/q , where

1 r

=1−

1 p

+

1 q

(2.19)

.

(ii) If P |X| ≤ c1 ) = 1 = P (|Y | ≤ c2 ) for some constants c1 , c2 ∈ (0, ∞), then * * *Cov(X, Y )* ≤ 4c1 c2 α. (2.20)

16.3 Central limit theorems for mixing sequences

519

Proof: Let a = (E|X|p )1/p and b = (E|Y |q )1/q . W.l.o.g., suppose that a, b ∈ (0, ∞). Then, by Markov’s inequality, for any 0 < u < 1, 2 P |X| > au−1/p ≤ E|X|p (au−1/p )p = u and hence, QX (u) ≤ au−1/p . Similarly, QY (u) ≤ bu−1/q , 0 < u < 1. Hence, by Theorem 16.2.3, * * *Cov(X, Y )*

2α

ab u−1/p−1/q du

≤

2

=

2ab(2α)1−1/p−1/q

0

3

1−

1 1 − . p q

which is equivalent to (2.19). The proof of (2.20) is a direct consequence of Rio’s inequality and the 2 bounds QX (u) ≤ c1 and QY (u) ≤ c2 for all 0 < u < 1.

16.3 Central limit theorems for mixing sequences In this section, CLTs for sequences of random variables satisfying diﬀerent mixing conditions are proved. Proposition 16.3.1: Let {Xi }i∈Z be a collection of random variables with strong mixing coeﬃcient α(·). ∞ (i) Suppose that n=1 α(n) < ∞ and for some c ∈ (0, ∞), P (|Xi | ≤ c) = 1 for all i. Then, ∞

Cov(X1 , Xn+1 ) converges absolutely.

(3.1)

n=1

∞ δ/2+δ (ii) Suppose that < ∞ and supi∈Z E|Xi |2+δ < ∞ for n=1 α(n) some δ ∈ (0, ∞). Then, (3.1) holds. 2

Proof: A direct consequence of Corollary 16.2.4.

Next suppose that the collection of random variables {Xi }i∈Z is station* ∞ * ary and that Var(X1 ) + n=1 *Cov(X1 , X1+n )* < ∞. Then by the DCT, ¯n) nVar(X

=

n n−1 Var Xi −1

= n

n i=1

i=1

Var(Xi ) + 2

1≤i<j≤n

Cov(Xi , Xj )

520

16. Limit Theorems for Dependent Processes

=

n−1 n−i n−1 n Var(X1 ) + 2 Cov(Xi , Xi+k ) i=1 k=1

n−1 (n − k)Cov(X1 , X1+k ) = n−1 n Var(X1 ) + 2 k=1 2 −→ σ∞

≡

Var(X1 ) + 2

∞

Cov(X1 , X1+k )

as

n → ∞. (3.2)

k=1

In particular, under the conditions of part (i) or part (ii) of Proposition 16.3.1, √ ¯ n ) exists and equals σ 2 . lim Var( n X ∞ n→∞

2 > 0. Indeed, it is not diﬃcult In general, it is not guaranteed that σ∞ to construct an example of a stationary strong mixing sequence {Xn }n≥1 2 such that σ∞ = 0 (Problem 16.8). However, in addition to the conditions √ ¯ of 2 Proposition 16.3.1, if it is assumed that σ∞ > 0, then a CLT for n(X n− EX1 ) holds in the stationary case; see Corollary 16.3.3 and 16.3.6 below. A classical method of proving the CLT (and other limit theorems) for mixing random variables is based on the idea of blocking, introduced by S. N. Bernstein. Intuitively, the ‘blocking’ approach can be described as n follows: Suppose, µ = EX1 = 0. First, write the sum i=1 Xi in terms of alternating sums of ‘big blocks’ Bi ’s (of length ‘p’ say) and ‘little blocks’ Li ’s (of length ‘q’ say) as n i=1

Xi

=

X1 + · · · + Xp + Xp+1 + · · · + Xp+q + Xp+q+1 + · · · + X2p+q + · · ·

= B1 + L1 + B2 + L2 + · · · + (BK + LK ) + Rn , where the last term Rn is the excess (if any) over the last complete pair of big- and little-blocks (BK , LK ). Next, group together the Bi ’s and Li ’s to write n K K √ 1 1 1 √ Xi = √ Bj + √ Lj + Rn / n. (3.3) n i=1 n j=1 n j=1 K If q p n, then, the number of Xi ’s in j=1 Lj and in Rn are of smaller order than n, the total number of Xi ’s. Using this, one can show that the contribution of the last two terms in (3.3) to the limit is negligible, i.e.,

K 1 √ Lj + Rn −→p 0. n j=1 K To handle the ﬁrst term, √1n j=1 Bi , note that the Bj ’s are functions of disjoint collections of Xj ’s that are separated by a distance of q or more.

16.3 Central limit theorems for mixing sequences

521

By letting q → ∞ suitably and using the mixing condition, one can replace the Bj ’s by their independent copies, and appeal to the Lindeberg CLT for sums of independent random variables to conclude that 1 2 √ Bj −→d N 0, σ∞ . n j=1 K

Although the blocking approach is described here for stationary random variables, with minor modiﬁcations, it is applicable to certain nonstationary sequences as shown below. Theorem 16.3.2: Let {Xn }n≥1 be a sequence of random variables (not necessarily stationary) with strong mixing coeﬃcient α(·). Suppose that there exist constants σ02 , c ∈ (0, ∞) such that P (|Xi | ≤ c) = 1

for all

i ∈ N,

* *

j+n−1 * −1 * 2* * γn ≡ sup *n Var Xi − σ 0 * → 0 j≥1

as

(3.4) n → ∞,

(3.5)

i=j

and that

∞

α(n) < ∞.

(3.6)

n=1

Then,

√ ¯n − µ n X ¯n −→d N (0, σ02 ) as n → ∞ ¯ n , and X ¯ n = n−1 n Xi , n ≥ 1. where µ ¯n = E X i=1

(3.7)

An important special case of Theorem 16.3.2 is the following: of stationary bounded random Corollary 16.3.3: ∞ If {Xn }n≥1 is a sequence 2 of (3.2) is positive, then, with variables with n=1 α(n) < ∞, and if σ∞ µ = EX1 , √ ¯ n − µ −→d N (0, σ 2 ) as n → ∞. n X (3.8) ∞

2 (cf. Proof: For stationary random variables, (3.5) holds with σ02 = σ∞ (3.2)). Hence, the Corollary follows from Theorem 16.3.2. 2

For proving the theorem, the following auxiliary result will be used. Lemma 16.3.4: Then, sup E m≥1

Suppose that the conditions of Theorem 16.3.2 hold. m+n−1 i=m

4 (Xi − EXi ) = o(n3 )

as

n → ∞.

522

16. Limit Theorems for Dependent Processes

Proof: W.l.o.g., let EXi = 0 for all i. Note that for any m ∈ N, 4

n+m−1 Xj E j=m

=

EXj4 + 6

j

EXi2 Xj2 + 4

i<j

+6

EXi2 Xj Xk

≡

+

i =j =k

EXi3 Xj

i =j

EXi Xj Xk X

i =j =k =

I1n + · · · + I5n , say,

(3.9)

where the indices i, j, k, in the above sums lie between m and m + n − 1. By (3.4), * * * * * * *I1n * + *I2n * + *I3n * ≤ n · c4 + 7n(n − 1)c4 ≤ 7n2 c4 . (3.10) By Corollary 16.2.4 (ii), noting that EXi = 0 for all i, * * * * * * * * *I4n * ≤ 12 *EXi2 Xj Xk * + *EXi Xj2 Xk * + *EXi Xj Xk2 * i<j 0] = 1 − exp[(−2µ/σ 2 )x], 0 < x < ∞. (b) limt P [supx |A(x, t) − A(x)| > |Z(t) > 0] = 0 for any > 0, where A(·, t) and A(x) are as in Theorem 18.3.2 (d) with α = 0. Theorem 18.3.4: (Subcritical case). Let m < 1. Then for any initial Z0 = 0, G(·) nonlattice (cf. Chapter 10), lim P [Z(t) = j|Z(t) > 0] = πj

t→∞

exists for all j ≥ 1 and

∞ j=1

πj = 1.

(3.5)

568

18. Branching Processes

18.4 Embedding of Urn schemes in continuous time branching processes It turns out that many urn schemes can be embedded in continuous time branching processes. The case of Poly¯a’s urn is discussed below. Recall that Poly¯ a’s urn scheme is the following. Let an urn have an initial composition of R0 red and B0 black balls. A draw consists of taking a ball at random from the urn, noting its color, and returning it to the urn with one more ball of the color drawn. Let (Rn , Bn ) denote the composition after n draws. Clearly, Rn + Bn = R0 + B0 + n for all n ≥ 0 and {Rn , Bn }n≥0 is a Markov chain. Let {Zi (t) : t ≥ 0}, i = 1, 2 be two independent continuous time branching processes with unit exponential life times and oﬀspring distribution of binary splitting, i.e., p2 = 1 and Z1 (0) = R0 , Z2 (0) = B0 . Let τ0 = 0 < τ1 < τ2 < . . . < τn < . . . denote the successive times of in the combined population. Then the sequence death of an individual Z1 (τn ), Z2 (τn ) n≥0 has the same distribution as (Rn , Bn )n≥0 . To establish this claim, by the Markov property of Z1 (t), Z2 (t) t≥0 , it suﬃces to show that Z1 (τ1 ), Z2 (τ1 ) has the same distribution as (R1 , B1 ). It is easy to show that if ηi : i = 1, 2, . . . , n are independent exponential random variables with parameters n λi , i = 1, 2, . . . , n then the η ≡ min{ηi : 1 ≤ i ≤ n} is an Exponential ( i=1 λi ) random variable and P (η = ηi ) = λi ( n λj ) (Problem 18.9). This, in turn, leads to the fact at time τ1 , the j=1

Z1 (0) probability that a split takes place in {Z1 (t) : t ≥ 0} is Z1 (0)+Z . At this 2 (0) split, the parent is lost but is replaced by two new individuals resulting in a net addition of one more individual, establishing the claim. The same reasoning yields the embedding of the following general urn scheme. Let Xn = (Xn1 , . . . , Xnk ) be the vector of the composition of an urn at time n where Xni is the number of balls of color i. Assume that given (X0 , X1 , . . . , Xn ), Xn+1 is generated as follows. Pick a ball at random from the urn. If it happens to be of color i, then return it to the urn along with a random number ζij of balls of color j = 1, 2, . . . , k where the joint distribution of ζi ≡ (ζi1 , ζi2 , . . . , ζik ) depends on i, i = 1, 2, . . . , k. Now set, Xn+1 = Xn + ζi . The embedding is done as follows. Consider a continuous time multitype branching process {Z(t) : t ≥ 0} with Exponential (1) lifetimes and the oﬀspring distribution of the ith type is the same as that of ζ˜i ≡ ζi +δi where ζi is as above and δi is ith the unit vector. Let for i = 1, 2, . . . , k, {Zi (t) : t ≥ 0} be a branching process that evolves as {Z(t) : t ≥ 0} above but has initial size Zi (0) ≡ (0, 0, . . . , X0i , 0, . . . , 0). Let 0 = τ0 < τ1 < τ2 < . . . denote the times at which deaths occur in the process obtained by pooling all the k processes. Then Zi (τn ) : i = 1, 2, . . . , k , n = 0, 1, 2 . . . has the same distribution as (Xni , i = 1, 2, . . . , k), n ≥ 0.

18.5 Problems

569

This embedding has been used to prove limit theorems for urn models. See Athreya and Ney (2004), Chapter 5, for details. For applications to clinical trials, see Rosenberger (2002).

18.5 Problems ∞ 18.1 Show that for any probability distribution {pj }j≥0 , f (s) = j=0 pj sj is convex in [0,1]. Show also that there exists a q ∈ [0, 1) such that f (q) = q iﬀ m = f (1·) > 1. ∞ 18.2 Assume j=1 j 2 pj < ∞. (a) Let vn = V (Zn |Z0 = 1). Show that vn+1 = V (Zn m|Z0 = 1) + E(Zn σ 2 |Z0 = 1) where m = E(Z1 |Z0 = 1) and σ 2 = V (Z1 |Z0 = 1) and hence vn+1 = m2 vn + σ 2 mn . (b) Conclude from (a) that supn EWn2 < ∞, where Wn = Zn /mn . 2 (c) Using the fact {Wn }n≥0 is a martingale, show that if j pj < ∞ then {Wn } converges w.p. 1 and in L2 to a random variable W such that E(W |Z0 = 1) = 1. 18.3 By deﬁnition, the sequence {Zn }n≥0 of population sizes satisﬁes the random iteration scheme Zn+1 =

Zn

ζni

i=1

where {ζni , i = 1, 2, . . . , n = 1, 2, . . .} is a doubly inﬁnite sequence of iid random variable with distribution {pj }. (a) (Independence of lines of descent). Establish the property that for any k ≥ 0 if Z0 = k then {Zn }n≥0 has the same distribution k (j) (j) as where {Zn }n≥0 , j ≥ 1 are iid copies of j=1 Zn n≥0 {Zn }n≥0 with Z0 = 1. (b) In the context of Theorem 18.1.2, show that if Z0 = 1 then W ≡ lim Wn can be represented as 1 1 W (j) m j=1

Z

W =

where Z1 , W (j) , j = 1, 2, . . . are all independent with Z1 having distribution {pj }j≥0 and {W (j) }j≥1 are iid with distribution same as W .

570

18. Branching Processes

(c) Let α ≡ aj ∈D P (W = aj ) where D ≡ {aj } is the set of values such that P (W = aj ) > 0. Show using (b) that α = f (α) and conclude that if α < 1, then α = q and hence that if α < 1, then the distribution of W conditional on W > 0 must be continuous. (d) Let β be the singular component of the distribution of W in its Lebesgue decomposition. Show using (b) that β satisﬁes β ≤ f (β) and hence that if β < 1, then β = P (W = 0) and the distribution of W conditional on W > 0 must be absolutely continuous. (e) Let p0 = 0. Show that the distribution of W is of the pure type, i.e., it is either purely discrete, purely singular continuous, or purely absolutely continuous. 18.4 (a) Show using Problem 18.3 (b) that if W has a lattice distribution with span d, then d must satisfy d = md and hence d = ∞. Conclude that if P (W = 0) < 1, then the distribution of W on {W > 0} must be nonlattice. (b) Let p0 = 0 and P (W = 0) = 0. Use (a) to conclude that the characteristic function φ(t) ≡ E(eιtW ) of W satisﬁes sup1≤|t|≤m |φ(t)| < 1. (c) Let p0 = 0. Show that for any 0 ≤ s0 < 1, > 0, f (n) (s0 ) = 0(n ). n−1 (Hint: By the mean value theorem, f (n) (s) = j=0 f (fj (s)). Now use f (0) = p0 , fj (s) → 0 as j → ∞.) (d) Let p0 = 0, P (W = 0) = 0. Show that for any n ≥ 1, ∞ φ(mn t) = f (n) (φ(t)) and hence −∞ |φ(u)|du < ∞. Conclude that the distribution of W is absolutely continuous. 18.5 In the multitype case for the martingale deﬁned in (2.1), show that 2 < ∞ for all i, j where Ei denotes {Wn }n≥0 is L2 bounded if Ei Z1j expectation when one starts with an individual of type i. 18.6 Let m(·) satisfy the integral equation (3.3). (a) Show that mα (t) ≡ m(t)e−αt satisﬁes the renewal equation −αt mα (·) = 1 − G(t) e + mα (t − u)dGα (u) (0,t]

where Gα (t) ≡ m

t 0

e−αu dG(u), t ≥ 0.

(b) Use the key renewal theorem of Section 8.5 to conclude that limt→∞ mα (t) exists and identify the limit.

18.5 Problems

571

∞ (c) Assuming j=1 j 2 pj < ∞ show using the key renewal theorem of Section 8.5 that {W (t) : t ≥ 0} of Theorem 18.3.2 is L2 bounded. 18.7 Consider an M/G/1 queue with Poisson arrivals and general service time. Let Z1 be the number of customers that arrive during the service time of the ﬁrst customer. Call these ﬁrst generation customers. Let Z2 be the number of customers that arrive during the time it takes to serve all the Z1 customers. Call these second generation customers. For n ≥ 1, let Zn+1 denote the number of customers that arrive during the time it takes to serve all Zn of the nth generation customers. (a) Show that {Zn }n≥0 is a BGW branching process as in Section 18.1. (b) Find the oﬀspring distribution {pj }∞ 0 and its mean m in terms of the rate parameter λ of the Poisson arrival process and the service time distribution G(·). (c) Show that the queue size goes to ∞ with positive probability iﬀ m > 1. (d) Set up a functional equation for the moment generating function of the busy period U , i.e., the time interval between when the ﬁrst service starts and when the server is idle for the ﬁrst time. 18.8 Let {ηi : i = 1, 2, . . . , n} be independent exponential random variables with Eηi = λ−1 i , i = 1, 2, . . . , n. Let η ≡ min{ηi : 1 ≤ i ≤ n}. n −1 Show that η has an exponential with Eη = i=1 λi n distribution and that P (η = ηj ) = λj i=1 λi . 18.9 Using the embedding outlined in Section 18.4 for the Poly¯ a urn n → Y w.p. 1 and that Y can be scheme, show that Yn ≡ RnR+B n represented as Y =

R0 Xi R0i=1 +B0 Xj j=1

where {Xi }i≥1 are iid exponential (1)

random variables. Conclude that Y has Beta (R0 , B0 ) distribution.

Appendix A Advanced Calculus: A Review

This Appendix is a brief review of elementary set theory, real numbers, limits, sequences and series, continuity, diﬀerentiability, Riemann integration, complex numbers, exponential and trigonometric functions, and metric spaces. For proofs and further details, see Rudin (1976) and Royden (1988).

A.1 Elementary set theory This section reviews the following: sets, set operations, product sets (ﬁnite and inﬁnite), equivalence relation, axiom of choice, countability, and uncountability. Deﬁnition A.1.1: A set is a collection of objects. It is typically deﬁned as a collection of objects with a common deﬁning property. For example, the collection of even integers can be written as E ≡ {n : n is an even integer}. In general, a set Ω with deﬁning property p is written as Ω = {ω : ω has property p}. The individual elements are denoted by the small letters ω, a, x, s, t, etc., and the sets by capital letters Ω, A, X, S, T , etc. Example A.1.1: The closed interval [0, 1] ≡ {x : x a real number, 0 ≤ x ≤ 1}.

574

Appendix A. Advanced Calculus: A Review

Example A.1.2: The set of positive rationals ≡ {x : x = positive integers}.

m n,

m and n

Example A.1.3: The set of polynomials in x of degree 10 ≡ {P (x) : 10 P (x) = j=0 aj xj , aj real, j = 0, . . . , 10, a10 = 0}. Example A.1.4: The set of all polynomials in x ≡ {P (x) : P (x) = n j a x , n a nonnegative integer, aj real, j = 0, 1, 2, . . . , n}. j=0 j

A.1.1

Set operations

Deﬁnition A.1.2: Let A be a set. A set B is called a subset of A and written as B ⊂ A if every element of B is also an element of A. Two sets A and B are the same and written as A = B if each is a subset of the other. A subset A ⊂ Ω is called empty and denoted by ∅ if there exists no ω in Ω such that ω ∈ A. Using the mathematical notation ∈ and ⇒, one writes B ⊂ A if x ∈ B ⇒ x ∈ A. Here ‘∈’ means “belongs to” and ⇒ means “implies.” Example A.1.5: Let N be the set of natural numbers, i.e., N = {1, 2, 3, . . .}. Let E be the set of even natural numbers, i.e., E = {n : n ∈ N, n = 2k for some k ∈ N}. Then E ⊂ N. Example A.1.6: Let A = [0, 1] and B be the set of x in A such that x2 < 14 . Then B = {x : 0 ≤ x < 12 } ⊂ A. Deﬁnition A.1.3: (Intersection and union). Let A1 , A2 be subsets of a set Ω. Then A1 union A2 , written as A1 ∪ A2 ,is the set deﬁned by A1 ∪ A2 = {ω : ω ∈ A1 or ω ∈ A2 or both}. Similarly, A1 intersection A2 , written as A1 ∩ A2 , is the set deﬁned by A1 ∩ A2 = {ω : ω ∈ A1 and ω ∈ A2 }. Example A.1.7: Let Ω ≡ N ≡ {1, 2, 3, . . .}, A1

= {ω : ω = 3k for some k ∈ N}

A2

= {ω : ω ≡ 5k for some k ∈ N}.

and

Then A1 ∪ A2 = {ω : ω is divisible by at least one of the two integers 3 and 5}, A1 ∩ A2 = {ω : ω is divisible by both 3 and 5}.

A.1 Elementary set theory

575

Deﬁnition A.1.4: Let Ω and I be nonempty sets. Let {Aα : α ∈ I} be a collection of subsets of Ω. Then I is called the index set. The union of {Aα : α ∈ I} is deﬁned as

Aα ≡ {ω : ω ∈ Aα for some α ∈ I}. α∈I

The intersection of {Aα : α ∈ I} is deﬁned as Aα ≡ {ω : ω ∈ Aα for every α ∈ I}. α∈I

Deﬁnition A.1.5: (Complement of a set). Let A ⊂ Ω. Then the comple˜ is deﬁned by Ac ≡ {ω : ω ∈ / A}. ment of the set A, written as Ac (or A), Example A.1.8: If Ω = N and A is the set of all integers that are divisible by 2, then Ac is the set of all odd integers, i.e., Ac = {1, 3, 5, 7, . . .}. Proposition A.1.1: (DeMorgan’s law ). For any {Aα : a ∈ I} of subsets of Ω, (∪α∈I Aα )c = ∩α∈I Acα , (∩α∈I Aα )c = ∪α∈I Acα . Proof: To show that two sets A and B are the same, it suﬃces to show that ω ∈ A ⇒ ω ∈ B and ω ∈ B ⇒ ω ∈ A. Let ω ∈ (∪α∈I Aα )c . Then ω ∈ / ∪α∈I Aα ⇒ ω∈ / Aα for any α ∈ I ⇒ ω ∈ Acα for each α ∈ I c ⇒ ω∈ Aα . α∈I

Thus (∪α∈I Aα )c ⊂ ∩α∈I Acα . The opposite inclusion and the second identity are similarly proved. 2 Deﬁnition A.1.6: (Product sets). Let Ω1 and Ω2 be two nonempty sets. Then the product set of Ω1 and Ω2 , denoted by Ω ≡ Ω1 × Ω2 , consists of all ordered pairs (ω1 , ω2 ) such that ω1 ∈ Ω1 , ω2 ∈ Ω2 . Note that if Ω1 = Ω2 and ω1 = ω2 , then the pair (ω1 , ω2 ) is not the same as (ω2 , ω1 ), i.e., the order is important. Example A.1.9: Ω1 = [0, 1], Ω2 = [2, 3]. Then Ω1 × Ω2 = {(x, y) : 0 ≤ x ≤ 1, 2 ≤ y ≤ 3}. Deﬁnition A.1.7: (Finite products). If Ωi , i = 1, 2 . . . , k are nonempty sets, then Ω = Ω1 × Ω 2 × . . . × Ω k

576

Appendix A. Advanced Calculus: A Review

is the set of all ordered k vectors (ω1 , ω2 , . . . , ωk ) where ωi ∈ Ωi . If Ωi = Ω1 (k) for all 1 ≤ i ≤ k, then Ω1 × Ω2 × . . . × Ωk is written as Ω1 or Ωk1 . Deﬁnition A.1.8: (Inﬁnite products). Let {Ωα : α ∈ I} be an inﬁnite collection of nonempty sets. Then ×α∈I Ωα , the product set, is deﬁned as {f : f is a function deﬁned on I such that for each α, f (α) ∈ Ωα }. If Ωα = Ω for all α ∈ I, then ×α∈I Ωα is also written as ΩI . It is a basic axiom of set theory, known as the axiom of choice (A.C.), that this space is nonempty. That is, given an arbitrary collection of nonempty sets, it is possible to form a parliament with one representative from each set. For a long time it was thought this should follow from the other axioms of set theory. But it is shown in Cohen (1966) that it is an independent axiom. That is, both the A.C. and its negation are consistent with the rest of the axioms of set theory. There are several equivalent versions of A.C. These are Zorn’s lemma, Hausdorﬀ’s maximality principle, the ‘Principle of Well Ordering,’ and Tukey’s lemma. For a proof of these equivalences, see Hewitt and Stromberg (1965). Deﬁnition A.1.9: (Functions, countability and uncountability). A function f is a correspondence between the elements of a set X and another set Y and is written as f : X → Y . It satisﬁes the condition that for each x, there is a unique y in Y that corresponds to it and is denoted as y = f (x). The set X is called the domain of f and the set f (X), deﬁned as, f (X) ≡ {y : there exists x in X such that f (x) = y} is called the range of f . It is possible that many x’s may correspond to the same y and also there may exist y in Y for which there is no x such that f (x) = y. If f (X) is all of Y , then the map is called onto. If for each y in f (X), there is a unique x in X such that f (x) = y, then f is called (1–1) or one-to-one. If f is one-to-one and onto, then X and Y are said to have the same cardinality. Deﬁnition A.1.10: Let f : X → Y be (1–1) and onto. Then, for each y in Y , there is a unique element x in X such that f (x) = y. This x is denoted as f −1 (y). Note that in this case, g(y) ≡ f −1 (y) is a (1–1) onto map from Y to X and is called the inverse of f . Example A.1.10: Let X = N ≡ {1, 2, 3, . . .}. Let Y = {n : n = 2k, k ∈ N} be the set of even integers. Then the map f (x) = 2x is a (1–1) onto map from X to Y . Example A.1.11: Let X be N and let P be the set of all prime numbers. Then X and P have the same cardinality.

A.1 Elementary set theory

577

Deﬁnition A.1.11: A set X is ﬁnite if there exists n ∈ N such that X and Y ≡ {1, 2, . . . , n} have the same cardinality, i.e., there exists a (1–1) onto map from Y to X. A set X is countable if X and N have the same cardinality, i.e., there exists a (1–1) onto map from N to X. A set X is uncountable if it is not ﬁnite or countable. Example A.1.12: The set {0, 1, 2, . . . , 9} is ﬁnite, the set Nk (k ∈ N) is countable and NN is uncountable (Problem A.6). Deﬁnition A.1.12: Let Ω be a nonempty set. Then the power set of Ω, denoted by P(Ω), is the collection of all subsets of Ω, i.e., P(Ω) ≡ {A : A ⊂ Ω}. Remark A.1.1: P(N) is an uncountable set (Problem A.5).

A.1.2

The principle of induction

The set N of natural numbers has the well ordering property that every nonempty subset A of N has a smallest element s such that (i) s ∈ A and (ii) a ∈ A ⇒ a ≥ s. This property is one of the basic postulates in the deﬁnition of N. The principle of induction is a consequence of the well ordering property. It says the following: Let {P (n) : n ∈ N} be a collection of propositions (or statements). Suppose that (i) P (1) is true. (ii) For each n ∈ N, P (n) true ⇒ P (n + 1) true. Then, P (n) is true for all n ∈ N. See Problem A.9 for some examples.

A.1.3

Equivalence relations

Deﬁnition A.1.13: (a) Let Ω be a nonempty set. Let G be a nonempty subset of Ω × Ω. Write x ∼ y if (x, y) ∈ G and call it a relation deﬁned by G. (b) A relation deﬁned by G is an equivalence relation if (i) (reﬂexive) for all x in Ω, x ∼ x, i.e., (x, x) ∈ G; (ii) (symmetric) x ∼ y ⇒ y ∼ x, i.e., (x, y) ∈ G ⇒ (y, x) ∈ G; (iii) (transitive) x ∼ y, y ∼ z ⇒ x ∼ z, i.e., (x, y) ∈ G, (y, z) ∈ G ⇒ (x, z) ∈ G.

578

Appendix A. Advanced Calculus: A Review

Example A.1.13: Let Ω = Z, the set of all integers, G = {(m, n) : m − n is divisible by 3}. Thus, m ∼ n if (m − n) is a multiple of 3. It is easy to verify that this is an equivalence relation. Deﬁnition A.1.14: (Equivalence classes). Let Ω be a nonempty set. Let G deﬁne an equivalence relation on Ω. For each x in Ω, the set [x] ≡ {y : x ∼ y} is called the equivalence class generated by x. Proposition A.1.2: Let C be the set of all equivalence classes in Ω generated by an equivalence relation deﬁned by G. Then (i) C1 , C2 ∈ C ⇒ C1 = C2 or C1 ∩ C2 = ∅. (ii) C = Ω. C∈C

Proof: (i) Suppose C1 ∩ C2 = ∅. Then there exist x1 , x2 , y such that C1 = [x1 ], C2 = [x2 ] and y ∈ C1 ∩ C2 . This implies x1 ∼ y, x2 ∼ y. But by symmetry y ∼ x2 and this implies by transitivity that x1 ∼ x2 , i.e., x2 ∈ C1 implying C2 ⊂ C1 . Similarly, C1 ⊂ C2 , i.e., C1 = C2 . (ii) For each x in Ω, (x, x) ∈ G and so [x] is not empty and x ∈ [x].

2

The above proposition says that every equivalence relation on Ω leads to a decomposition of Ω into equivalence classes that are disjoint and whose union is all of Ω. In the example given above, the set Z of all integers can be decomposed to three equivalence classes Cj ≡ {n : n = 3m + j for some m ∈ Z}, j = 0, 1, 2.

A.2 Real numbers, continuity, diﬀerentiability, and integration A.2.1

Real numbers

This section reviews the following: integers, rationals, real numbers; algebraic, order, and completeness axioms; Archimedean property, denseness of rationals. There are at least two approaches to deﬁning the real number system. Approach 1. Start with the natural numbers N, construct the set Z of all integers (N ∪ {0} ∪ (−N)), and next, the set Q of rationals and then the set R of real numbers either as the set of all Cauchy sequences of rationals or as Dedekind cuts. The step going from Q to R via Cauchy sequences is also available for completing any incomplete metric space (see Section A.4).

A.2 Real numbers, continuity, diﬀerentiability, and integration

579

Approach 2. Deﬁne the set of real numbers R as a set that satisﬁes three sets of axioms. The ﬁrst set is algebraic involving addition and multiplication. The second set is on ordering that, with the ﬁrst, makes R an ordered ﬁeld (see Royden (1988) for a deﬁnition). The third set is a single axiom known as the completeness axiom. Thus R is deﬁned as a complete ordered ﬁeld. The algebraic axioms say that there are two binary operations known as addition (+) and multiplication (·) that render R a ﬁeld. See Royden (1988) for the nine axioms for this set. The order axiom says that there is a set P ⊂ R, to be called positive numbers such that (i) x, y ∈ P ⇒ x · y ∈ P, x + y ∈ P (ii) x ∈ P ⇒ −x ∈ /P (iii) x ∈ R ⇒ x = 0 or x ∈ P or −x ∈ P. The set Q of rational numbers is an ordered ﬁeld (i.e., it satisﬁes the algebraic and order axioms). But Q does not satisfy the completeness axiom (see below). Given P, one can deﬁne an order on R by deﬁning x < y (read x less than y) to mean y − x ∈ P. Since for all x, y in R, (x − y) is either 0 or (x − y) ∈ P or (y − x) ∈ P, it follows that for all x, y in R, either x = y or x < y or x > y. This is called total or linear order. Deﬁnition A.2.1: (Upper and lower bounds). (a) Let A ⊂ R. A real number M is an upper bound for A if a ∈ A ⇒ a ≤ M and m is a lower bound for A if a ∈ A ⇒ a ≥ m. (b) The supremum of a set A, denoted by sup A or the least upper bound (l.u.b.) of A, is deﬁned by the following conditions: (i) x ∈ A ⇒ x ≤ sup A, (ii) K < sup A ⇒ there exists x ∈ A such that K < x. The completeness axiom says that if A ⊂ R has an upper bound M in ˜ in R such that M ˜ = sup A. R, then there exists a M That is, every set A that is bounded above in R has a l.u.b. in R. The ordered ﬁeld of rationals Q does not possess this property. One well-known example is the set A = {r : r ∈ Q, r2 < 2}. Then A is bounded above in Q but has no l.u.b. in Q (Problem A.11). Next some consequences of the completeness axiom are discussed. Proposition A.2.1: (Axiom of Eudoxus and Archimedes (AOE)). For all x in R, there exists a natural number n such that n > x.

580

Appendix A. Advanced Calculus: A Review

Proof: If x ≤ 1, take n = 2. If x > 1, let Sx ≡ {k : k ∈ N, k ≤ x}. Then Sx is not empty and is bounded above. By the completeness axiom, there is a real number y that is the l.u.b. of Sx . Thus y − 12 is not an upper bound for Sx and so there exists k0 ∈ Sx such that y − 12 < k0 . This implies that / Sx . By the linear order (k0 + 1) > y − 12 + 1 = y + 12 > y and so (k0 + 1) ∈ 2 in R, (k0 + 1) > x and so (k0 + 1) is the desired integer. Corollary A.2.2: For any x, y ∈ R with x < y, there is a r in Q such that x < r < y. Proof: Let z = (y − x)−1 . Then there is an integer k such that 0 < z < k (by AOE.) Again by AOE, there is a positive integer n such that n > yk. Let S = {n : n ∈ N, n > yk}. Since S = ∅, it has a smallest element (by the p well ordering property of N) say, p. Then p − 1 < yk < p, i.e., p−1 k < y < k. p p−1 1 1 Since k < z = (y − x) and k > y, it follows that k > x. Now take r = p−1 2 k . Remark A.2.1: This property is often stated as: The set Q of rationals is dense in the set R of real numbers. ¯ of extended real numbers is the set consisting Deﬁnition A.2.2: The set R of R and two elements identiﬁed as +∞ (plus inﬁnity) and −∞ (negative inﬁnity) with the following deﬁnition of addition (+) and multiplication (·). For any x in R, x + ∞ = ∞, x − ∞ = −∞, x · ∞ = ∞ if x > 0, x · ∞ = −∞ if x < 0, 0 · ∞ = 0, ∞ + ∞ = ∞, −∞ − ∞ = −∞, ∞ · (±∞) = ±∞, (−∞) · (±∞) = ∓∞. But ∞ − ∞ is not deﬁned. The ¯ is deﬁned by extending that on R with the additional order property on R condition x ∈ R ⇒ −∞ < x < +∞. Finally, if A ⊂ R does not have an upper bound in R, then sup A is deﬁned as +∞ and if A ⊂ R does not have a lower bound in R, then inf A is deﬁned as −∞.

A.2.2

Sequences, series, limits, limsup, liminf

Deﬁnition A.2.3: Let {xn }n≥1 be a sequence of real numbers. (i) For a real number a, lim xn = a if for every > 0, there exists a n→∞

positive integer N such that n ≥ N ⇒ |xn − a| < . (ii) lim xn = ∞ if for any K in R, there exists an integer NK such that n→∞ n ≥ NK ⇒ xn > K. (iii) lim xn = −∞ if lim (−xn ) = ∞. n→∞

n→∞

(iv) lim sup xn ≡ lim xn = inf (sup xj ). n→∞

n→∞

n≥1 j≥n

(v) lim inf xn ≡ lim xn = sup( inf xj ). n→∞

n→∞

n≥1 j≥n

A.2 Real numbers, continuity, diﬀerentiability, and integration

581

Deﬁnition A.2.4: (Cauchy sequences). A sequence {xn }n≥1 ⊂ R is called a Cauchy sequence if for every ε > 0, there is Nε such that n, m ≥ Nε ⇒ |xm − xn | < ε. Proposition A.2.3: If {xn }n≥1 ⊂ R is convergent in R (i.e., limn→∞ xn = a exists in R), then {xn }n≥1 is Cauchy. Conversely, if {xn }n≥1 ⊂ R is Cauchy, then there exists an a ∈ R such that limn→∞ xn = a. The proof is based on the use of the l.u.b. axiom (Problem A.14). For n ≥ 1, Deﬁnition n A.2.5: Let {xn }n≥1 be a sequence of real numbers. ∞ sn ≡ j=1 xj is called the nth partial sum of the series j=1 xj . The series ∞ to converge to s in R if limn→∞ sn = s. If limn→∞ sn = ±∞, j=1 xj is said ∞ then the series j=1 xj is said to diverge to ±∞. Note that if xj ≥ 0 for all j, then either limn→∞ sn = s ∈ R, or limn→∞ sn = ∞. Example A.2.1: (Geometric series). Fix 0 < r < 1. Let xn = rn , n ≥ 0. ∞ n+1 1 and j=1 rj converges to s = 1−r . Then sn = 1 + r + . . . + rn = 1−r 1−r ∞ Example A.2.2: Consider the series j=1 j1p , 0 < p < ∞. It can be shown that this converges for p > 1 and diverges to ∞ for 0 < p ≤ 1. ∞ Deﬁnition A.2.6: The series j=1 xj converges absolutely if the series ∞ j=1 |xj | converges in R. ∞ There exist series j=1 xj that converge but not absolutely. For example, ∞ (−1)j j=1 j . For further material on convergence properties of series, such as tests for convergence, rates of convergence, etc., see Rudin (1976). Deﬁnition A.2.7: (Power series). ∞ Letn{an }n≥0 be a sequence of real numbers. For x ∈ R, the series n=0 an x is called a power series. ∞ If the ∞ series n=0 an xn converges for all x in B ⊂ R, the power series n=0 an xn is said to be convergent on B. Proposition A.2.4: Let {an }n≥0 be a sequence of real numbers. Let ρ = 1 (lim supn→∞ |an | n )−1 . Then (i) |x| < ρ ⇒ (ii) |x| > ρ ⇒

∞ n=0

∞ n=0

|an xn | converges. |an xn | diverges to +∞.

Proof of this is left as an exercise (Problem A.15). 1

Deﬁnition A.2.8: ρ ≡ (lim supn→∞ |an | n )−1 is called the radius of con∞ vergence of the power series n=0 an xn .

582

A.2.3

Appendix A. Advanced Calculus: A Review

Continuity and diﬀerentiability

Deﬁnition A.2.9: Let f : A → R, A ⊂ R. Then (a) f is continuous at x0 in A if for every > 0, there exists a δ > 0 such that x ∈ A, |x − x0 | < δ, implies |f (x) − f (x0 )| < . (Here, δ may depend on and x0 .) (b) f is continuous on B ⊂ A if it is continuous at every x0 in B. (c) f is uniformly continuous on B ⊂ A if for every > 0, there exists a δ > 0 such that sup{|f (x) − f (y)| : x, y ∈ B, |x − y| < δ } < . Some properties of continuous functions are listed below. Proposition A.2.5: (i) (Sums, products, and ratios of continuous functions). Let f , g : A → R, A ⊂ R. Let f and g be continuous on B ⊂ A. Then (a) f + g, f − g, α · f for any α ∈ R are all continuous on B. (b) f (x)/g(x) is continuous at x0 in B, provided g(x0 ) = 0. (ii) (Continuous functions on a closed bounded interval). Let f be continuous on a closed and bounded interval [a, b]. Then (a) f is bounded, i.e., sup{|f (x)| : a ≤ x ≤ b} < ∞, (b) it achieves its maximum and minimum, i.e., there exist x0 , y0 in [a, b] such that f (x0 ) ≥ f (x) ≥ f (y0 ) for all x in [a, b] and f attains all values in [f (y0 ), f (x0 )], i.e., for all ∈ [f (y0 ), f (x0 )], there exists z ∈ [a, b] such that f (z) = . Thus, f maps bounded closed intervals onto bounded closed intervals. (c) f is uniformly continuous on [a, b]. (iii) (Composition of functions). Let f : A → R, g : B → R be continuous on A and B, respectively. Let f (A) ⊂ B, i.e., for any x in A, f (x) ∈ B. Let h(x) = g(f (x)) for x in A. Then h : A → R is continuous. (iv) (Uniform limits of continuous functions). Let {fn }n≥1 , be a sequence of functions continuous on A to R, A ⊂ R. If sup{|fn (x) − f (x)| : x ∈ A} → 0 as n → ∞ for some f : A → R, i.e., fn converges to f uniformly on A, then f is continuous on A. Remark A.2.2: The function f (x) ≡ x is clearly continuous on R. Now by Proposition A.2.5 (i) and (iv), it follows that all polynomials are continuous on R, and hence, so are their uniform limits. Weierstrass’ approximation theorem is a sort of converse to this. That is, every continuous function on a closed and bounded interval is the uniform limit of polynomials. More precisely, one has the following:

A.2 Real numbers, continuity, diﬀerentiability, and integration

583

Theorem A.2.6: Let f : [a, b] → R be continuous. Then for any > 0 n there is a polynomial p(x) = 0 aj xj , aj ∈ R, j = 0, 1, 2, . . . , n such that sup{|f (x) − p(x)| : x ∈ [a, b]} < . ∞ It should be noted that a power series A(x) ≡ 0 an xn is the uniform −1 limits of polynomials on [−λ, λ] for any 0 < λ < ρ ≡ limn→∞ |an |1/n and hence is continuous on (−ρ, ρ). Deﬁnition A.2.10: Let f : (a, b) → R, (a, b) ⊂ R. The function f is said to be diﬀerentiable at x0 ∈ (a, b) if lim

h→0

f (x0 + h) − f (x0 ) ≡ f (x0 ) h

exists in

R.

A function is diﬀerentiable in (a, b) if it is diﬀerentiable at each x in (a, b). Some important consequences of diﬀerentiability are listed below. Proposition A.2.7: Let f , g : (a, b) → R, (a, b) ⊂ R. Then (i) f diﬀerentiable at x0 in (a, b) implies f is continuous at x0 . (ii) (Mean value theorem). f diﬀerentiable on (a, b), f continuous on [a, b] implies that for some a < c < b, f (b) − f (a) = (b − a)f (c). (iii) (Maxima and minima). f diﬀerentiable at x0 and for some δ > 0, f (x) ≤ f (x0 ) for all x ∈ (x0 − δ, x0 + δ) implies that f (x0 ) = 0. (iv) (Sums, products and ratios). f , g diﬀerentiable at x0 implies that for any α, β in R, (αf + βg), f − g are diﬀerentiable at x0 with (αf + βg) (x0 ) (f g) (x0 )

=

αf (x0 ) + βg (x0 ),

=

f (x0 )g(x0 ) + f (x0 )g (x0 ),

and if g (x0 ) = 0, then f /g is diﬀerentiable at x0 with (f /g) (x0 ) =

f (x0 )g(x0 ) − f (x0 )g (x0 ) . (g(x0 ))2

(v) (Chain rule). If f is diﬀerentiable at x0 and g is diﬀerentiable at f (x0 ), then h(x) ≡ g(f (x)) is diﬀerentiable at x0 with h (x0 ) = g (f (x0 ))f (x0 ). ∞ (vi) (Diﬀerentiability of power series). Let A(x) ≡ n=0 an xn be a power −1 > 0. Then series with radius of convergence ρ ≡ limn→∞ |an |1/n A(·) is diﬀerentiable inﬁnitely many times on (−ρ, ρ) and for x in (−ρ, ρ), ∞ dk A(x) = n(n − 1) · · · (n − k + 1)xn−k , k ≥ 1. dxk n=k

584

Appendix A. Advanced Calculus: A Review

Remark A.2.3: It should be noted that the converse to (a) in the above proposition does not hold. For example, the function f (x) = |x| is continuous at x0 = 0 but is not diﬀerentiable at x0 . Indeed, Weierstrass showed that there exists a function f : [0, 1] → R such that it is continuous on [0, 1] but is not diﬀerentiable at any x in (0, 1). Also note that the mean value theorem implies that if f (·) ≥ 0 on (a, b), then f is nondecreasing on (a, b). Deﬁnition A.2.11: (Taylor series). Let f be a map from I ≡ (a−η, a+η) to R for some a ∈ R, η > 0. Suppose f is n times diﬀerentiable in I, for (n) ∞ n each n ≥ 1. Let an = f n!(a) . Then power series = n=0 an (x − a) ∞ f (n) (a) n n=0 n! (x − a) is called the Taylor series of f at a. Remark A.2.4: Let f be as in Deﬁnition A.2.11. Taylor’s remainder theorem says that for any x in I and any n ≥ 1, if f is (n + 1) times diﬀerentiable in I, then * * n * |f (n+1) (yn )| * j* *f (x) − a x j * ≤ (n + 1)! * j=0 for some yn in I. Thus, if for some > 0,

sup |f (k) (y)| ≡ λk satisﬁes

|y−a| 0, cos t = 0}. Set π = 2t0 . Since cos π2 = 0, (iv) implies π sin π2 = 1 and hence that eι 2 = ι. π

(vi) Clearly, eι 2 = ι implies that eιπ = −1 and eι2π = 1 and eι2πk = 1 for all integers k. Since eι2π = 1, it follows that ez = ez+ι2π for all z ∈ C, 2 It is now possible to prove various results involving π that one learns in calculus from the above deﬁnition. For example, ∞ 1 that the arc length of the unit circle {z : |z| = 1} is 2π and that −∞ 1+x 2 dx = π, etc. (Problems A.19 and A.20). The following assertions about ez can be proved with some more eﬀort. Theorem A.3.2: (i) ez = 1 iﬀ z = 2πιk for some integer k. (ii) The map t → eιt from R is onto the unit circle. (iii) For any ω ∈ C, ω = 0 there is a z ∈ C such that ω = ez .

590

Appendix A. Advanced Calculus: A Review

For a proof of this theorem as well as more details on Theorem A.3.1, see Rudin (1987). Theorem A.3.3: (Orthogonality of {eι2πnt }n∈Z ). 1 ι2πnt e dt = 0 if n = 0 and 1 if n = 0. 0

For any n ∈ Z,

Proof: Since (eιt ) = ιeιt (eι2πnt ) = ι2πn eι2πnt , n ∈ Z and so for n = 0,

1 ι2πnt

e

dt

=

0

=

1 1 (eι2πnt ) dt ι2πn 0 1 (eι2πn − 1) = 0. ι2πn

Corollary A.3.4: The family {cos 2πnt : n = 0, 1, 2, . . .} ∪ {sin 2πnt : n = 1, 2, . . .} are orthogonal in L2 [0, 1] (Problem A.22), i.e., for any two f , g 1 in this family, 0 f (x)g(x)dx = 0 for f = g.

A.4 Metric spaces A.4.1

Basic deﬁnitions

This section reviews the following: metric spaces, Cauchy sequences, completeness, functions, continuity, compactness, convergence of sequences functions, and uniform convergence. Deﬁnition A.4.1: Let S be a nonempty set. Let d : S × S → R+ = [0, ∞) be such that (i) d(x, y) = d(y, x) for any x, y in S. (ii) d(x, z) ≤ d(x, y) + d(y, z) for any x, y, z in S. (iii) d(x, y) = 0 iﬀ x = y. Such a d is called a metric on S and the pair (S, d) a metric space. Property (ii) is called the triangle inequality. Example A.4.1: Let Rk ≡ {(x1 , . . . , xk ) : xi ∈ R, 1 ≤ i ≤ k} be the k-dimensional Euclidean space. For 1 ≤ p < ∞ and x = (x1 , . . . , xk ), y = (y1 , y2 , . . . , yk ) ∈ Rk , let dp (x, y) =

k i=1

|xi − yi |

p

p1 ,

A.4 Metric spaces

591

and d∞ (x, y) = max{|xi − yi | : 1 ≤ i ≤ k}. It can be shown that dp (·, ·) is a metric on R for all 1 ≤ p ≤ ∞ (Problem A.24). Deﬁnition A.4.2: A sequence {xn }n≥1 in a metric space (S, d) converges to an x in S if for every ε > 0, there is a Nε such that n ≥ Nε ⇒ d(xn , x) < ε and is written as limn→∞ xn = x. Deﬁnition A.4.3: A sequence {xn }n≥1 in a metric space (S, d) is Cauchy if for all ε > 0, there exists Nε such that n, m ≥ Nε ⇒ d(xn , xm ) < ε. Deﬁnition A.4.4: A metric space (S, d) is complete if every Cauchy sequence {xn }n≥1 converges to some x in S. Example A.4.2: (a) Let S = Q, the set of rationals and d(x, y) ≡ |x − y|. Then (Q, d) is a metric space that is not complete. (b) Let S = R and d(x, y) = |x − y|. Then (R, d) is complete (cf. Proposition A.2.3). (c) Let S = Rk . Then (Rk , dp ) is complete for every 1 ≤ p ≤ ∞, where dp is as in Example A.4.1. Remark A.4.1: (Completion of an incomplete metric space). Let (S, d) ˜ be the set of all Cauchy sequences in S. Identify be a metric space. Let S each x in S with the Cauchy sequence {xn = x}n≥1 . Deﬁne a function from ˜×S ˜ to R+ by S ˜ n }n≥1 , {yn }n≥1 ) = lim sup d(xn , yn ). d({x n→∞

It is easy to verify that d˜ is symmetric and satisﬁes the triangle inequality. Deﬁne s1 = {xn }n≥1 and s2 = {yn }n≥1 to be equivalent (write {xn } ∼ ˜ 1 , s2 ) = 0. Let S ¯ be the set of all equivalence classes in S ˜ and {yn }) if d(s ˜ 1 , s2 ), where c1 , c2 are equivalence classes and s1 , s2 ¯ 1 , c2 ) ≡ d(s deﬁne d(c are arbitrary elements of c1 and c2 , respectively. ¯ is a complete metric space and (S, d) ¯ d) It can now be veriﬁed that (S, ¯ by identifying each x in S with the equivalence class ¯ d) is embedded in (S, containing the sequence {xn = x}n≥1 . Deﬁnition A.4.5: A metric space (S, d) is separable if there exists a subset D ⊂ S that is countable and dense in S, i.e., for each x in S and ε > 0, there is a y in D such that d(x, y) < ε. Example A.4.3: By the Archimedean property, Q is dense in R. Similarly Qk , the set of all k vectors with components from Q, is dense in Rk .

592

Appendix A. Advanced Calculus: A Review

Deﬁnition A.4.6: A metric space (S, d) is called Polish if it is complete and separable. Example A.4.4: (Rk , dp ) in Example A.4.2 is Polish.

A.4.2

Continuous functions

Let (S, d) and (T, ρ) be two metric spaces. Let f : S → T be a map from S to T. Deﬁnition A.4.7: (a) f is continuous at p in S if for each ε > 0, there exists δ > 0 such that d(x, p) < δ ⇒ ρ(f (x), f (p)) < ε. (Here the δ may depend on ε and p.) (b) f is continuous on a set B ⊂ S if it is continuous at every p ∈ B. (c) f is uniformly continuous on B if for each ε > 0, there exists δ > 0 such that for each pair x, y in S, d(x, y) < δ ⇒ ρ(f (x), f (y)) < ε. Deﬁnition A.4.8: Let (S, d) be a metric space. (a) A set O ⊂ (S, d) is open if x ∈ O ⇒ there exists δ > 0 such that d(x, y) < δ ⇒ y ∈ O. That is, at every point x in O, an open ball Bx (δ) ≡ {y : d(x, y) < δ} of positive radius δ is a subset of O. (b) A set C ⊂ (S, d) is closed if C c is open. Theorem A.4.1: Let (S, d) and (T, ρ) be metric spaces. A map f : S → T in continuous on S iﬀ for each O open in T, f −1 (O) is open in S. Proof is left as an exercise (Problem A.28).

A.4.3

Compactness

Deﬁnition A.4.9: A collection of open sets {Oα : α ∈ I} is an open cover for a set B ⊂ (S, d) if for each x ∈ B, there exists α ∈ I such that x ∈ Oα . Example A.4.5: Let B = (0, 1). Then the collection {(α − α2 , α + (1−α) 2 ): α ∈ Q ∩ (0, 1)} is an open cover for B. Deﬁnition A.4.10: Let (S, d) be a metric space. A set K ⊂ S is called compact if given any open cover {Oα : α ∈ I} for K, there exists a ﬁnite subcollection {Oαi : αi ∈ I, i = 1, 2, . . . , n, n < ∞} that is an open cover for K. Example A.4.6: The set B = (0, 1) is not compact as the open cover in the above Example A.3.4 does not admit a ﬁnite subcover.

A.4 Metric spaces

593

The next result is the well-known Heine-Borel theorem. Theorem A.4.2: (i) For any −∞ < a < b < ∞, the closed interval [a, b] is compact in R. (ii) Any K ⊂ R is compact iﬀ it is bounded and closed. For a proof, see Rudin (1976). From Proposition A.4.1, it is seen that the inverse image of an open set under a continuous function is open but the forward image may not have this property. But the following is true. Theorem A.4.3: Let (S, d) and (T, ρ) be two metric spaces and let f : (S, d) → (T, ρ) be continuous. Let K ⊂ S be compact. Then f (K) is compact. The proof is left as an exercise (Problem A.35).

A.4.4

Sequences of functions and uniform convergence

Deﬁnition A.4.11: Let (S, d) and (T, ρ) be two metric spaces and let {fn }n≥1 be a sequence of functions from (S, d) to (T, ρ). The sequence {fn }n≥1 is said to: (a) converge pointwise to f on a set A ⊂ S if limn→∞ fn (x) = f (x) for each x in A; (b) converges uniformly to f on a set A ⊂ S if for each ε > 0, there exists Nε > 0 (depending on ε and A) such that n ≥ Nε ⇒ ρ fn (x), f (x) < ε for all x in A. A consequence of uniform convergence is the preservation of the continuity property. Theorem A.4.4: Let (S, d) and (T, ρ) be two metric spaces and let {fn }n≥1 be a sequence of functions from (S, d) to (T, ρ). Let for each n ≥ 1, fn be continuous on A ⊂ S. Let {fn }n≥1 converge to f uniformly on A. Then f is continuous on A. Proof: The proof is based on the “break up into three parts” idea. By the triangle inequality, ρ f (x), f (y) ≤ ρ f (x), fn (x) + ρ fn (x), fn (y) + ρ fn (y), f (y) . Fix x in A. By the uniform convergence on A, sup{ρ fn (u), f (u) : u ∈ A} → 0as n → ∞. So for each ε > 0, there exists Nε < ∞ such that n ≥ Nε ⇒ ρ fn (u), f (u) < 3ε for all u in A. Now since fNε (·) is continuous on A, there exists a δ > 0(depending on Nε and x), such that d(x, y) < δ, y ∈ A ⇒ ρ fNε (y), fNε (x) < 3ε . Thus, y ∈ A, d(x, y) < δ ⇒ ρ f (x), f (y) < 2ε ε 2 3 + 3 = ε.

594

Appendix A. Advanced Calculus: A Review

A.5 Problems A.1 Express the following sets in the form {x : x has property p}. (a) The set A of all integers which when divided by 7 leave a remainder ≤ 3. (b) The set B of all functions form [0, 1] to R with at most two discontinuity points. (c) The set C of all students at a given university who are graduate students with at least one course in mathematics at the graduate level. (d) The set D of all algebraic numbers. (A number x is called an algebraic number, if it is the root of a polynomial with rational coeﬃcients.) (e) The set E of all possible sequences whose elements are either 0 or 1. A.2 Give an example of sets A1 , A2 such that A1 ∩ A2 = A1 ∪ A2 . A.3 Let I = [0, 1], Ω = R and for α ∈ R, Aα = (α − 1, α + 1), the open interval {x : α − 1 < x < α + 1}. (a) Show that ∪α∈I Aα = (−1, 2) and ∩α∈I Aα = (0, 1). (b) Suppose J = {x : x ∈ I, x is rational}. Find ∪x∈J Ax and ∩x∈J Ax . A.4 With Ω ≡ N ≡ {1, 2, 3, . . .}, ﬁnd Ac in the following cases: (a) A = {ω : ω is divisible by 2 or 3 or both}. If ω ∈ Ac , what can be said about its prime factors? (b) A = {ω : ω is divisible by 15 and 16}. (c) A = {ω : ω is a perfect square}. A.5 Show that X ≡ {0, 1}N , the set of all sequences {ωi }i∈N where each ωi ∈ {0, 1}, is uncountable. Conclude that P(N) is uncountable. A.6 Show that if Ωi is countable for each i ∈ N, then for each k ∈ N, ×ki=1 Ωi is countable and ∪i∈N Ωi is also countable but ×i∈N Ωi is not countable. A.7 Show that the set of all polynomials in x with integer coeﬃcients is countable. A.8 Show that the well ordering property implies the principle of induction. A.9 Apply the principle of induction to establish the following:

A.5 Problems

(a) For each n ∈ N,

n

j2 =

j=1

595

n(n+1)(2n+1) . 6

(b) For each n ∈ N, x1 , x2 , . . . , xk ∈ R, (i) (The binomial formula). (x1 + x2 )n =

n n r n−r . r x1 x2

r=0

(ii) (The multinomial formula). (x1 + x2 + . . . + xk )n =

n! xr1 xr2 . . . xrkk , r1 !r2 ! . . . rk ! 1 2

where the summation extends over all (r1 , r2 , . . . , rk ) such k that ri ∈ N, 0 ≤ ri ≤ n, r=1 ri = n. A.10 Verify that on R, the relation x ∼ y if x−y is rational is an equivalence relation but the relation x ∼ y if x − y is irrational is not. A.11 Show that the set A = {r : r ∈ Q, r2 < 2} is bounded above in Q but has no l.u.b. in Q. A.12 Show that for any two sequences {xn }n≥1 , {yn }n≥1 ⊂ R, lim xn + lim yn ≤ lim (xn + yn ) ≤ lim (xn + yn )

n→∞

n→∞

n→∞

n→∞

≤ lim xn + lim yn . n→∞

n→∞

A.13 Verify that lim xn = a ∈ R iﬀ lim xn = lim xn = a. n→∞

n→∞

n→∞

A.14 Establish Proposition A.2.3. (Hint: First show that a Cauchy sequence is bounded and then show that lim xn = lim xn .) n→∞

n→∞

A.15 (a) Prove Proposition A.2.4 by comparison with the geometric series. ∞ (b) Show that for integer k ≥ 1, the power series n=k n(n ∞− 1)(n − k + 1)an xn−k has the same radius of convergence as n=0 an xn . ∞ A.16 Show that the series j=2 j(log1 j)p converges for p > 1 and diverges for p ≤ 1. A.17 Find the radius of convergence, ρ, for the powers series A(x) ≡ ∞ n n=0 an x where (a) an =

n (n+1) ,

n ≥ 0.

(b) an = np , n ≥ 0, p ∈ R.

596

Appendix A. Advanced Calculus: A Review

(c) an =

1 n! ,

n ≥ 0 (where 0! = 1).

A.18 (a) Find the Taylor series at a = 0 for the function f (x) = I ≡ (−1, +1) and show that it converges to f (x) on I.

1 1−x

in

(b) Find the Taylor series of 1 + x + x2 in I = (1, 3), centered at 2. (c) Let

f (x) =

1

e− x2 0

if |x| < 1, x = 0 if x = 0 .

(i) Show that f is inﬁnitely diﬀerentiable at 0 and compute f (j) (0) for all j ≥ 1. (ii) Show that the Taylor series at a = 0 converges but not to f on (−1, 1). A.19 Let S = {z : z ∈ C, |z| = 1} be the unit circle. Using the parameterization t → eιt = (cos t + ι sin t) from [0, 2π] to S, show that the arc length of S (i.e., the circumference of the limit circle) is 2π. sin t π π 2 A.20 Set φ(t) = cos t for − 2 < t < 2 . Verify that φ = 1 + φ and that π π φ : (− 2 , 2 ) to (−∞, ∞) is strictly monotone increasing and onto. Conclude that π/2 ∞ 1 φ (t) dx = dt = π. 2 2 −∞ 1 + x −π/2 1 + (φ(t)) π

A.21 Using the property that eι 2 = ι verify that for all t in R π π − t) = sin t, sin( − t) = cos t, 2 2 cos(π + t) = − cos t, sin(π + t) = − sin t, cos(2π + t) = cos t, sin(2π + t) = sin t. cos(

Also show that cos t is a strictly decreasing map from [0, π] onto [−1, 1] and that sin t is a strictly increasing map from [− π2 , π2 ] onto [−1, 1]. A.22 Using (i) of Theorem A.3.1, express cos(t1 + t2 ), sin(t1 + t2 ) in terms cos ti , sin ti , i = 1, 2 and in turn use this to prove Corollary A.3.4 from Theorem A.3.3. A.23 Verify that pn (z) ≡ (1 + nz )n converges to ez uniformly on bounded sets in C. A.24 (a) Verify that for p = 1, p = 2 and p = ∞, dp is a metric on Rk . (b) Show that for ﬁxed x and y, ϕ(p) ≡ dp (x, y) is continuous in p on [1, ∞].

A.5 Problems

597

(c) Draw the open unit ball Bp ≡ {x : x ∈ R2 , dp (x, 0) < 1} in R2 for p = 1, 2 and ∞. A.25 Let S = C[0, 1] be the set of all real valued continuous functions on [0, 1]. Now let 1 d1 (f, g) = |f (x) − g(x)|dx, (area metric) 0

d2 (f, g)

=

d∞ (f, g)

=

0

1

2

|f (x) − g(x)| dx

12 ,

(least square metric)

sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}

(sup metric).

Show that all these are metrics on S. A.26 Let S = R∞ ≡ {{xn }n≥1 : xn ∈ R for all n ≥ 1} be the space of all sequences of real numbers. Let d({xn }n≥1 , {yn }n≥1 ) = ∞ |xj −yj | 1 j=1 ( 1+|xj −yj | ) 2j . Show that (S, d) is a Polish space. A.27 If sk = {xkn }n≥1 and s = {xn }n≥1 , are elements of S = R∞ as in Problem A.26, verify that as k → ∞, sk → s iﬀ xkn → xn for all n ≥ 1. A.28 Establish Theorem A.4.1. 1 1 A.29 Let S = C[0, 1] and dp (f, g) ≡ 0 |f (t) − g(t)|p dt p for 1 ≤ p < ∞ and d∞ (f, g) = sup{|f (t) − g(t)| : t ∈ [0, 1]}. (a) Let f (x) ≡ 1. Let fn (t) ≡ 1 for 0 ≤ t ≤ 1− n1 , and fn (t) = n(1−t) for 1 − n1 ≤ t ≤ 1. Show that dp (fn , f ) → 0 for 1 ≤ p < ∞ but d∞ (fn , f ) → 0. (b) Fix f ∈ C[0, 1]. Let gn (t) = f (t), 0 ≤ t ≤ 1 − n1 , and gn (t) = f (1 − n1 ) + (f (1) − f (1 − n1 ))n(t + n1 − 1), 1 − n1 ≤ t ≤ 1. Show that dp (gn , f ) → 0 for all 1 ≤ p ≤ ∞. A.30 Show that if {xn }n≥1 is a convergent sequence in a metric space (S, d), then it is Cauchy. A.31 Verify (b) of Example A.4.2 from the axioms of real numbers (cf. Proposition A.2.3). Verify (c) of the same example from (b). A.32 Let S = C[0, 1] and d be the supremum metric, i.e., d(f, g) = sup{|f (x) − g(x)| : 0 ≤ x ≤ 1}. By approximating any continuous function with piecewise linear functions with rational end points and rational values, show that (S, d) is Polish, i.e., it is complete and separable.

598

Appendix A. Advanced Calculus: A Review

A.33 Show that the function f (x) = x2 is continuous on R, uniformly so on any bounded set B ⊂ R but not uniformly on R. A.34 Show that unions of open sets are open and intersection of any two open sets is open. Give an example to show that the intersection of an inﬁnite number of open sets need not be open. A.35 Prove Theorem A.4.3. A.36 Let fn (x) = xn and g(x) ≡ 0 on R. Then {fn }n≥1 converges pointwise to g on (−1, 1), uniformly on [a, b] for −1 < a < b < 1, but not uniformly on (0, 1). A.37 Let {fn }n≥1 , f ∈ C[0, 1]. Let {fn }n≥1 converge to f uniformly on 1 1 [0, 1]. Show that lim 0 |fn (x) − f (x)|dx = 0 and lim 0 fn (x) = n→∞ n→∞ 1 f (x)dx. 0 A.38 Give a proof of Proposition A.2.6 (vi) (term by term diﬀerentiability of a power series) using Proposition A.2.7 (iv) (the fundamental theorem of Riemann integration).

Appendix B List of Abbreviations and Symbols

B.1 Abbreviations a.c. a.e. AR(1) a.s. BCT

absolutely continuous (functions) almost everywhere autoregressive process of order one almost sure(ly) bounded convergence theorem

BGW cdf CE CLT CTMC

Biengeme-Galton-Watson cumulative distribution function conditional expectation central limit theorem continuous time Markov chain

DCT EDCT fdds iﬀ IFS

dominated convergence theorem extended dominated convergence theorem ﬁnite dimensional distributions if and only if iterated function system

iid IIIDRM i.o. LIL LLN

independent and identically distributed iterations of iid random maps inﬁnitely often law of the iterated logarithm laws of large numbers

600

Appendix B. List of Abbreviations and Symbols

MBB m.c.f.a. m.c.f.b. MCMC MCT

moving block bootstrap monotone continuity from above monotone continuity from below Markov chain Monte Carlo monotone convergence theorem

o.n.b. r.c.p. SBM SLLN s.o.c.

orthonormal basis regular conditional probability standard Brownian motion strong law of large numbers second order correctness

SSRW UI WLLN w.p. 1 w.r.t.

simple symmetric random walk uniform integrability weak law of large numbers with probability one with respect to

w.l.o.g.

without loss of generality

B.2 Symbols −→d −→p (·) ∗ (·) (·)∗

µ ν: absolute continuity of a measure convergence in distribution convergence in probability convolution of measures, functions, etc. extension of a measure

a∼b an ∼ bn a

a and b are equivalent (under an equivalence relation) an bn → 1 as n → ∞ the integer part of a, i.e., a = k if k ≤ a < k + 1, k ∈ Z, a ∈ R the smallest integer not less than a, i.e., "a# = k + 1 if k < a ≤ k + 1, k ∈ Z, a ∈ R closure of A

"a# A¯ Ac ∂A AB B(S) B(S, R) B(x, ), Bx ()

complement of a set A boundary of A symmetric diﬀerence of two sets A and B, i.e., A B = (A ∩ B c ) ∪ (Ac ∩ B) Borel σ-algebra on a metric space S such as S = R, Rk , R∞ ≡ {f | f : S → R, F-measurable, sup{|f (s)| : s ∈ S} ≤ 1} open ball of radius with center at x in a metric space (S, d), i.e., {y : d(x, y) < }

B.2 Symbols

C C[a, b] CB (R) Cc (R) C0 (R) C0 (S) C(F ), CF δij δx dµ dν

E(Y |G) H⊥ ι IA (·) II k λA Lp (Ω, F, µ) Lp (R) m µF µ⊥ν N ∅ Φ(·) P (A|G) Pλ (·)

601

the set of all complex numbers = {f | f : [a, b] → R, f continuous} ≡ {f | f : R → R, f bounded and continuous} ≡ {f | f : R → R, continuous and f ≡ 0 outside a bounded interval} ≡ {f | f : R → R, continuous and lim|x|→∞ f (x) = 0} = {f | f : S → R, f continuous and for every > 0, there exists a compact set K such that |f (x)| < for x ∈ K } the set of all continuity points of a cdf F Kronecker delta, i.e., δij = 1 if i = j and = 0 if i = j the probability distribution putting mass one at x Radon-Nikodym derivative of µ w.r.t. ν conditional expectation of Y given G orthogonal complement of a subspace H of a Hilbert √ space −1 the indicator function of a set A the identity matrix of order k λ-class generated by a class of sets A = {f |f : Ω → F, F-measurable, |f |p dµ < ∞}, with F = R or C (F = C in Sections 5.6, 5.7 only) = Lp (R, B(R), m) the Lebesgue measure Lebesgue-Stieltjes measure corresponding to F singularity of measures µ and ν the set of natural numbers the null set standard normal cdf, i.e., Φ(x) ≡ −∞ < x < ∞ probability of A given G

√1 2π

x −∞

e−u

2

/2

du,

P(Ω) Px (·) (Ω, F, P ) (Ω, F, µ)

probability distribution of a Markov chain with initial distribution λ the power set of Ω = {A : A ⊂ Ω} same as Pλ with λ = δx generic probability space generic measure space

Q R R+ ¯ R ¯ R+

the set of the set of the set of the set of = [0, ∞]

all rationals real numbers, (−∞, ∞) nonnegative real numbers, [0, ∞) all extended real numbers, [−∞, ∞]

602

Appendix B. List of Abbreviations and Symbols

(S, d) σA σ{fa : a ∈ A} T |z| Re(z) Im(z) Z Z+

a metric space S with a metric d σ-algebra generated by a class of sets A σ-algebra generated by a collection of mappings {fa : a ∈ A} tail√σ-algebra = a2 + b2 , the absolute value of a complex number z = a + ιb, a, b ∈ R = a, the real part of a complex number z = a + ιb, a, b∈R = b, the imaginary part of a complex number z = a + ιb, a, b ∈ R the set of all integers = {0, ±1, ±2, . . .} the set of all nonnegative integers = {0, 1, 2, . . .}

References

Arcones, M. A. and Gin´e, E. (1989), ‘The bootstrap of the mean with arbitrary bootstrap sample size’, Ann. Inst. H. Poincar´e Probab. Statist. 25(4), 457–481. Arcones, M. A. and Gin´e, E. (1991), ‘Additions and correction to: “The bootstrap of the mean with arbitrary bootstrap sample size” [Ann. Inst. H. Poincar´e Probab. Statist. 25(4) (1989), 457–481]’, Ann. Inst. H. Poincar´e Probab. Statist. 27(4), 583–595. Athreya, K. B. (1986), ‘Darling and Kac revisited’, Sankhy¯ a A 48(3), 255– 266. Athreya, K. B. (1987a), ‘Bootstrap of the mean in the inﬁnite variance case’, Ann. Statist. 15(2), 724–731. Athreya, K. B. (1987b), Bootstrap of the mean in the inﬁnite variance case, in ‘Proceedings of the 1st World Congress of the Bernoulli Society’, Vol. 2, VNU Sci. Press, Utrecht, pp. 95–98. Athreya, K. B. (2000), ‘Change of measures for Markov chains and the l log l theorem for branching processes’, Bernoulli 6, 323–338. Athreya, K. B. (2004), ‘Stationary measures for some Markov chain models in ecology and economics’, Econom. Theory 23(1), 107–122. Athreya, K. B., Doss, H. and Sethuraman, J. (1996), ‘On the convergence of the Markov chain simulation method’, Ann. Statist. 24(1), 69–100.

604

References

Athreya, K. B. and Jagers, P., eds (1997), Classical and Modern Branching Processes, Vol. 84 of The IMA Volumes in Mathematics and its Applications, Springer-Verlag, New York. Athreya, K. B. and Ney, P. (1978), ‘A new approach to the limit theory of recurrent Markov chains’, Trans. Amer. Math. Soc. 245, 493–501. Athreya, K. B. and Ney, P. E. (2004), Branching Processes, Dover Publications, Inc, Mineola, NY. (Reprint of Band 196, Grundlehren der Mathematischen Wissenschaften, Springer-Verlag, Berlin). Athreya, K. B. and Pantula, S. G. (1986), ‘Mixing properties of Harris chains and autoregressive processes’, J. Appl. Probab. 23(4), 880–892. Athreya, K. B. and Stenﬂo, O. (2003), ‘Perfect sampling for Doeblin chains’, Sankhy¯ a A 65(4), 763–777. Bahadur, R. R. (1966), ‘A note on quantiles in large samples’, Ann. Math. Statist. 37, 577–580. Barnsley, M. F. (1992), Fractals Everywhere, 2nd edn, Academic Press, New York. Berbee, H. C. P. (1979), Random Walks with Stationary Increments and Renewal Theory, Mathematical Centre, Amsterdam. Berry, A. C. (1941), ‘The accuracy of the Gaussian approximation to the sum of independent variates’, Trans. Amer. Math. Soc. 48, 122–136. Bhatia, R. (2003), Fourier Series, 2nd edn, Hindustan Book Agency, New Delhi, India. Bhattacharya, R. N. and Rao, R. R. (1986), Normal Approximation and Asymptotic Expansions, Robert E. Krieger, Melbourne, FL. Billingsley, P. (1968), Convergence of Probability Measures, John Wiley, New York. Billingsley, P. (1995), Probability and Measure, 3rd edn, John Wiley, New York. Bradley, R. C. (1983), ‘Approximation theorems for strongly mixing random variables’, Michigan Math. J. 30(1), 69–81. Brillinger, D. R. (1975), Time Series. Data Analysis and Theory, Holt, Rinehart and Winston, Inc, New York. Carlstein, E. (1986), ‘The use of subseries values for estimating the variance of a general statistic from a stationary sequence’, Ann. Statist. 14(3), 1171–1179.

References

605

Chanda, K. C. (1974), ‘Strong mixing properties of linear stochastic processes’, J. Appl. Probab. 11, 401–408. Chow, Y.-S. and Teicher, H. (1997), Probability Theory: Independence, Interchangeability, Martingales, Springer-Verlag, New York. Chung, K. L. (1967), Markov Chains with Stationary Transition Probabilities, 2nd edn, Springer-Verlag, New York. Chung, K. L. (1974), A Course in Probability Theory, 2nd edn, Academic Press, New York. Cohen, P. (1966), Set Theory and the Continuum Hypothesis, Benjamin, New York. Doob, J. L. (1953), Stochastic Processes, John Wiley, New York. Doukhan, P., Massart, P. and Rio, E. (1994), ‘The functional central limit theorem for strongly mixing processes’, Ann. Inst. H. Poincar´e Probab. Statist. 30, 63–82. Durrett, R. (2001), Essentials of Stochastic Processes, Springer-Verlag, New York. Durrett, R. (2004), Probability: Theory and Examples, 3rd edn, Duxbury Press, San Jose, CA. Efron, B. (1979), ‘Bootstrap methods: Another look at the jackknife’, Ann. Statist. 7(1), 1–26. Esseen, C.-G. (1942), ‘Rate of convergence in the central limit theorem’, Ark. Mat. Astr. Fys. 28A(9). Esseen, C.-G. (1945), ‘Fourier analysis of distribution functions. a mathematical study of the Laplace-Gaussian law’, Acta Math. 77, 1–125. Etemadi, N. (1981), ‘An elementary proof of the strong law of large numbers’, Z. Wahrsch. Verw. Gebiete 55(1), 119–122. Feller, W. (1966), An Introduction to Probability Theory and Its Applications, Vol. II, John Wiley, New York. Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Vol. I, 3rd edn, John Wiley, New York. Geman, S. and Geman, D. (1984), ‘Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images’, IEEE Trans. Pattern Analysis Mach. Intell. 6, 721–741. Gin´e, E. and Zinn, J. (1989), ‘Necessary conditions for the bootstrap of the mean’, Ann. Statist. 17(2), 684–691.

606

References

Gnedenko, B. V. and Kolmogorov, A. N. (1968), Limit Distributions for Sums of Independent Random Variables, Revised edn, AddisonWesley, Reading, MA. Gorodetskii, V. V. (1977), ‘On the strong mixing property for linear sequences’, Theory Probab. 22, 411–413. G¨ otze, F. and Hipp, C. (1978), ‘Asymptotic expansions in the central limit theorem under moment conditions’, Z. Wahrsch. Verw. Gebiete 42, 67–87. G¨ otze, F. and Hipp, C. (1983), ‘Asymptotic expansions for sums of weakly dependent random vectors’, Z. Wahrsch. Verw. Gebiete 64, 211–239. Hall, P. (1985), ‘Resampling a coverage pattern’, Stochastic Process. Appl. 20, 231–246. Hall, P. (1992), The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York. Hall, P. G. and Heyde, C. C. (1980), Martingale Limit Theory and Its Applications, Academic Press, New York. Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995), ‘On blocking rules for the bootstrap with dependent data’, Biometrika 82, 561–574. Herrndorf, N. (1983), ‘Stationary strongly mixing sequences not satisfying the central limit theorem’, Ann. Probab. 11, 809–813. Hewitt, E. and Stromberg, K. (1965), Real and Abstract Analysis, SpringerVerlag, New York. Hoel, P. G., Port, S. C. and Stone, C. J. (1972), Introduction to Stochastic Processes, Houghton-Miﬄin, Boston, MA. Ibragimov, I. A. and Rozanov, Y. A. (1978), Gaussian Random Processes, Springer-Verlag, Berlin. Karatzas, I. and Shreve, S. E. (1991), Brownian Motion and Stochastic Calculus, 2nd edn, Springer-Verlag, New York. Karlin, S. and Taylor, H. M. (1975), A First Course in Stochastic Processes, Academic Press, New York. Kifer, Y. (1988), Random Perturbations of Dynamical Systems, Birkh¨ auser, Boston, MA. Kolmogorov, A. N. (1956), Foundations of the Theory of Probability, 2nd edn, Chelsea, New York.

References

607

K¨ orner, T. W. (1989), Fourier Analysis, Cambridge University Press, New York. K¨ unsch, H. R. (1989), ‘The jackknife and the bootstrap for general stationary observations’, Ann. Statist. 17, 1217–1261. Lahiri, S. N. (1991), ‘Second order optimality of stationary bootstrap’, Statist. Probab. Lett. 11, 335–341. Lahiri, S. N. (1992), ‘Edgeworth expansions for m-estimators of a regression parameter’, J. Multivariate Analysis 43, 125–132. Lahiri, S. N. (1994), ‘Rates of bootstrap approximation for the mean of lattice variables’, Sankhy¯ a A 56, 77–89. Lahiri, S. N. (1996), ‘Asymptotic expansions for sums of random vectors under polynomial mixing rates’, Sankhy¯ a A 58, 206–225. Lahiri, S. N. (2001), ‘Eﬀects of block lengths on the validity of block resampling methods’, Probab. Theory Related Fields 121, 73–97. Lahiri, S. N. (2003), Resampling Methods for Dependent Data, SpringerVerlag, New York. Lehmann, E. L. and Casella, G. (1998), Theory of Point Estimation, Springer-Verlag, New York. Lindvall, T. (1992), Lectures on Coupling Theory, John Wiley, New York. Liu, R. Y. and Singh, K. (1992), Moving blocks jackknife and bootstrap capture weak dependence, in R. Lepage and L. Billard, eds, ‘Exploring the Limits of the Bootstrap’, John Wiley, New York, pp. 225–248. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953), ‘Equations of state calculations by fast computing machines’, J. Chem. Physics 21, 1087–1092. Meyn, S. P. and Tweedie, R. L. (1993), Markov Chains and Stochastic Stability, Springer-Verlag, New York. Munkres, J. R. (1975), Topology, A First Course, Prentice Hall, Englewood Cliﬀs, NJ. Nummelin, E. (1978), ‘A splitting technique for Harris recurrent Markov chains’, Z. Wahrsch. Verw. Gebiete 43(4), 309–318. Nummelin, E. (1984), General Irreducible Markov Chains and Nonnegative Operators, Cambridge University Press, Cambridge. Orey, S. (1971), Limit Theorems for Markov Chain Transition Probabilities, Van Nostrand Reinhold, London.

608

References

Parthasarathy, K. R. (1967), Probability Measures on Metric Spaces, Academic Press, San Diego, CA. Parthasarathy, K. R. (2005), Introduction to Probability and Measure, Vol. 33 of Texts and Readings in Mathematics, Hindustan Book Agency, New Delhi, India. Peligrad, M. (1982), ‘Invariance principles for mixing sequences of random variables’, Ann. Probab. 10(4), 968–981. Petrov, V. V. (1975), Sums of Independent Random Variables, SpringerVerlag, New York. Reiss, R.-D. (1974), ‘On the accuracy of the normal approximation for quantiles’, Ann. Probab. 2, 741–744. Robert, C. P. and Casella, G. (1999), Monte Carlo Statistical Methods, Springer-Verlag, New York. Rosenberger, W. F. (2002), ‘Urn models and sequential design’, Sequential Anal. 21(1–2), 1–41. Royden, H. L. (1988), Real Analysis, 3rd edn, Macmillan Publishing Co., New York. Rudin, W. (1976), Principles of Mathematical Analysis, International Series in Pure and Applied Mathematics, 3rd edn, McGraw-Hill Book Co., New York. Rudin, W. (1987), Real and Complex Analysis, 3rd edn, McGraw-Hill Book Co., New York. Shohat, J. A. and Tamarkin, J. D. (1943), The problem of moments, in ‘American Mathematical Society Mathematical Surveys’, Vol. II, American Mathematical Society, New York. Singh, K. (1981), ‘On the asymptotic accuracy of Efron’s bootstrap’, Ann. Statist. 9, 1187–1195. Strassen, V. (1964), ‘An invariance principle for the law of the iterated logarithm’, Z. Wahrsch. Verw. Gebiete 3, 211–226. Stroock, D. W. and Varadhan, S. (1979), Multidimensional Diﬀusion Processes, Band 233, Grundlehren der Mathematischen Wissenschaften, Springer-Verlag, Berlin. Szego, G. (1939), Orthogonal Polynomials, Vol. 23 of American Mathematical Society Colloquium Publications, American Mathematical Society, Providence, RI.

References

609

Withers, C. S. (1981), ‘Conditions for linear processes to be strong-mixing’, Z. Wahrsch. Verw. Gebiete 57, 477–480. Woodroofe, M. (1982), Nonlinear Renewal Theory in Sequential Analysis, SIAM, Philadelphia, PA.

Author Index

Arcones, M. A., 541 Athreya, K. B., 428, 460, 464, 472, 475, 499, 516, 541, 559, 561, 564, 565, 569

Doss, H., 472 Doukhan, P., 528 Durrett, R., 274, 278, 372, 393, 429, 493

Bahadur, R. R., 557 Barnsley, M. F., 460 Berbee, H.C.P., 517 Berry, A. C., 361 Bhatia, R. P., 167 Bhattacharya, R. N., 365, 368 Billingsley, P., 14, 211, 254, 301, 306, 373, 375, 475, 498 Bradley, R. C., 517 Brillinger, D. R., 547

Efron, B., 533, 545 Esseen, C. G., 361, 368 Etemadi, N., 244

Carlstein, E., 548 Casella, G., 391, 477, 480 Chanda, K. C., 516 Chow, Y. S., 364, 417, 430 Chung, K. L., 14, 250, 323, 359, 489 Cohen, P., 576 Doob, J. L., 211, 399

Feller, W., 166, 265, 308, 313, 323, 354, 357, 359, 362, 489, 499 Geman, D., 477 Geman, S., 477 Gin´e, E., 541 Gnedenko, B. V., 355, 359 Gorodetskii, V. V., 516 G¨ otze, F., 554 Hall, P. G., 365, 510, 534, 548, 556 Herrndorf, N., 529 Hewitt, E., 30, 576 Heyde, C. C., 510 Hipp, C., 554

Author Index

Hoel, P. G., 455 Horowitz, J. L., 556

Rozanov, Y. A., 516 Rudin, W., 27, 94, 97, 132, 181, 195, 581, 590, 593

Ibragimov, I. A., 516 Jagers, P., 561 Jing, B. Y., 556 Karatzas, I., 494, 499, 504 Karlin, S., 489, 493, 504 Kifer, Y., 460 Kolmogorov, A. N., 170, 200, 225, 244, 355, 359 K¨ orner, T. W., 167, 170 K¨ unsch, H. R., 547 Lahiri, S. N., 533, 537, 549, 552, 556 Lehmann, E. L., 391 Lindvall, T., 265, 456 Liu, R. Y., 547 Massart, P., 528 Metropolis, N., 477 Meyn, S. P., 456 Munkres, J. R., 71 Ney, P., 464, 564, 565, 569 Nummelin, E., 463, 464 Orey, S., 464 Pantula, S. G., 516 Parthasarathy, K. R., 393 Peligrad, M., 529 Petrov, V. V., 365 Port, S. C., 455 Rao, R. Ranga, 365, 368 Reiss, R. D., 381 Rio, E., 528 Robert, C. P., 477, 480 Rosenberger, W. F., 569 Rosenbluth, A. W., 477 Rosenbluth, M. N., 477 Royden, H. L., 27, 62, 94, 97, 118, 128, 130, 156, 573, 579

611

Sethuraman, J., 472 Shreve, S. E., 494, 499, 504 Shohat, J. A., 308 Singh, K., 536, 537, 545, 547 Stenﬂo, O., 460 Stone, C. J., 455 Strassen, V., 279, 576 Stromberg, K., 30 Stroock, D. W., 504 Szego, G.,107 Tamarkin, 308 Taylor, H. M.,