Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Massachusetts Institute of Technology
WWW site for book information and orders http://www.athenasc.com
Athena Scientific, Belmont, Massachusetts
Athena Scientific Post Office Box 391 Belmont, Mass. 02478-9998 U.S.A. Email:
[email protected] WWW: http://www.athenasc.com
Cover Design: Ann Gallager
c 2002 Dimitri P. Bertsekas and John N. Tsitsiklis All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
Publisher’s Cataloging-in-Publication Data Bertsekas, Dimitri P., Tsitsiklis, John N. Introduction to Probability Includes bibliographical references and index 1. Probabilities. 2. Stochastic Processes. I. Title. QA273.B475 2002 519.2 – 21 Library of Congress Control Number: 2002092167 ISBN 1-886529-40-X
To the memory of Pantelis Bertsekas and Nikos Tsitsiklis
Contents 1. Sample Space and Probability 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7.
Sets . . . . . . . . . . Probabilistic Models . . . Conditional Probability . Total Probability Theorem Independence . . . . . . Counting . . . . . . . Summary and Discussion Problems . . . . . . . .
2. Discrete Random Variables 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8.
. . . . . . and . . . . . . . .
. . . . . . . . . Bayes’ . . . . . . . . . . . .
. . . . . . . . . Rule . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . .
Basic Concepts . . . . . . . . . . . . Probability Mass Functions . . . . . . Functions of Random Variables . . . . . Expectation, Mean, and Variance . . . . Joint PMFs of Multiple Random Variables Conditioning . . . . . . . . . . . . . Independence . . . . . . . . . . . . . Summary and Discussion . . . . . . . Problems . . . . . . . . . . . . . . .
3. General Random Variables 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7.
. . . . . . . . . . . . . . p. 1
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
p. 71 . . . . . . . . .
. . . . . . . . . . . . . .
Continuous Random Variables and PDFs Cumulative Distribution Functions . . . Normal Random Variables . . . . . . . Conditioning on an Event . . . . . . . Multiple Continuous Random Variables . Derived Distributions . . . . . . . . . Summary and Discussion . . . . . . . Problems . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. p. 3 . p. 6 p. 18 p. 28 p. 34 p. 43 p. 50 p. 52
p. 72 p. 74 p. 80 p. 81 p. 92 p. 98 p. 110 p. 116 p. 119 p. 139
. . . . . . . .
p. p. p. p. p. p. p. p.
140 148 152 158 164 179 190 192 v
vi
Contents
4. Further Topics on Random Variables 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8.
. . . . . . . . . .
Transforms . . . . . . . . . . . . . . . . . . . . . Sums of Independent Random Variables - Convolution . . . More on Conditional Expectation and Variance . . . . . . Sum of a Random Number of Independent Random Variables Covariance and Correlation . . . . . . . . . . . . . . Least Squares Estimation . . . . . . . . . . . . . . . The Bivariate Normal Distribution . . . . . . . . . . . Summary and Discussion . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . .
p. 209
. . . p. . . . p. . . . p. . . p. . . . p. . . . p. . . . p. . . . p. . . . p.
5. The Bernoulli and Poisson Processes . . . . . . . . . . . 5.1. The Bernoulli Process . . . . . 5.2. The Poisson Process . . . . . 5.3. Summary and Discussion . . . Problems . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
p. 271 . . . .
6. Markov Chains . . . . . . . . . . . . . . . . . . . . 6.1. 6.2. 6.3. 6.4. 6.5. 6.6.
Discrete-Time Markov Chains . . . . . . Classification of States . . . . . . . . . . Steady-State Behavior . . . . . . . . . . Absorption Probabilities and Expected Time Continuous-Time Markov Chains . . . . . Summary and Discussion . . . . . . . . Problems . . . . . . . . . . . . . . . .
. . . to . . .
. . . . . . . . . . . . . . . . . . Absorption . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
Markov and Chebyshev Inequalities The Weak Law of Large Numbers . Convergence in Probability . . . . The Central Limit Theorem . . . The Strong Law of Large Numbers Summary and Discussion . . . . Problems . . . . . . . . . . . .
Index
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
p. p. p. p.
273 285 299 301
p. 313 . . . . . . .
7. Limit Theorems . . . . . . . . . . . . . . . . . . . . 7.1. 7.2. 7.3. 7.4. 7.5. 7.6.
210 221 225 232 236 240 247 255 257
p. p. p. p. p. p. p.
314 321 326 337 344 352 354
p. 379 . . . . . . .
p. p. p. p. p. p. p.
381 383 386 388 395 397 399
p. 411
Preface
Probability is common sense reduced to calculation Laplace
This book is an outgrowth of our involvement in teaching an introductory probability course (“Probabilistic Systems Analysis”) at the Massachusetts Institute of Technology. The course is attended by a large number of students with diverse backgrounds, and a broad range of interests. They span the entire spectrum from freshmen to beginning graduate students, and from the engineering school to the school of management. Accordingly, we have tried to strike a balance between simplicity in exposition and sophistication in analytical reasoning. Our key aim has been to develop the ability to construct and analyze probabilistic models in a manner that combines intuitive understanding and mathematical precision. In this spirit, some of the more mathematically rigorous analysis has been just sketched or intuitively explained in the text, so that complex proofs do not stand in the way of an otherwise simple exposition. At the same time, some of this analysis is developed (at the level of advanced calculus) in theoretical problems, that are included at the end of the corresponding chapter. Furthermore, some of the subtler mathematical issues are hinted at in footnotes addressed to the more attentive reader. The book covers the fundamentals of probability theory (probabilistic models, discrete and continuous random variables, multiple random variables, and limit theorems), which are typically part of a first course on the subject. It also contains, in Chapters 4-6 a number of more advanced topics, from which an instructor can choose to match the goals of a particular course. In particular, in Chapter 4, we develop transforms, a more advanced view of conditioning, sums of random variables, least squares estimation, and the bivariate normal distribuvii
viii
Preface
tion. Furthermore, in Chapters 5 and 6, we provide a fairly detailed introduction to Bernoulli, Poisson, and Markov processes. Our M.I.T. course covers all seven chapters in a single semester, with the exception of the material on the bivariate normal (Section 4.7), and on continuoustime Markov chains (Section 6.5). However, in an alternative course, the material on stochastic processes could be omitted, thereby allowing additional emphasis on foundational material, or coverage of other topics of the instructor’s choice. Our most notable omission in coverage is an introduction to statistics. While we develop all the basic elements of Bayesian statistics, in the form of Bayes’ rule for discrete and continuous models, and least squares estimation, we do not enter the subjects of parameter estimation, or non-Bayesian hypothesis testing. The problems that supplement the main text are divided in three categories: (a) Theoretical problems: The theoretical problems (marked by *) constitute an important component of the text, and ensure that the mathematically oriented reader will find here a smooth development without major gaps. Their solutions are given in the text, but an ambitious reader may be able to solve many of them, especially in earlier chapters, before looking at the solutions. (b) Problems in the text: Besides theoretical problems, the text contains several problems, of various levels of difficulty. These are representative of the problems that are usually covered in recitation and tutorial sessions at M.I.T., and are a primary mechanism through which many of our students learn the material. Our hope is that students elsewhere will attempt to solve these problems, and then refer to their solutions to calibrate and enhance their understanding of the material. The solutions are posted on the book’s www site http://www.athenasc.com/probbook.html (c) Supplementary problems: There is a large (and growing) collection of additional problems, which is not included in the book, but is made available at the book’s www site. Many of these problems have been assigned as homework or exam problems at M.I.T., and we expect that instructors elsewhere will use them for a similar purpose. While the statements of these additional problems are publicly accessible, the solutions are made available from the authors only to course instructors. We would like to acknowledge our debt to several people who contributed in various ways to the book. Our writing project began when we assumed responsibility for a popular probability class at M.I.T. that our colleague Al Drake had taught for several decades. We were thus fortunate to start with an organization of the subject that had stood the test of time, a lively presentation of the various topics in Al’s classic textbook, and a rich set of material that had been used in recitation sessions and for homework. We are thus indebted to Al Drake
Preface
ix
for providing a very favorable set of initial conditions. We are thankful to the several colleagues who have either taught from the draft of the book at various universities or have read it, and have provided us with valuable feedback. In particular, we thank Ibrahim Abou Faycal, Gustavo de Veciana, Eugene Feinberg, Bob Gray, Muriel M´edard, Jason Papastavrou, Ilya Pollak, David Tse, and Terry Wagner. The teaching assistants for the M.I.T. class have been very helpful. They pointed out corrections to various drafts, they developed problems and solutions suitable for the class, and through their direct interaction with the student body, they provided a robust mechanism for calibrating the level of the material. Reaching thousands of bright students at M.I.T. at an early stage in their studies was a great source of satisfaction for us. We thank them for their valuable feedback and for being patient while they were taught from a textbook-inprogress. Last but not least, we are grateful to our families for their support throughout the course of this long project. Dimitri P. Bertsekas,
[email protected] John N. Tsitsiklis,
[email protected] Cambridge, Mass., May 2002
ATHENA SCIENTIFIC BOOKS 1. Introduction to Probability, by Dimitri P. Bertsekas and John N. Tsitsiklis, 2002, ISBN 1-886529-40-X, 430 pages 2. Dynamic Programming and Optimal Control: Second Edition, Vols. I and II, by Dimitri P. Bertsekas, 2001, ISBN 1-886529-08-6, 704 pages 3. Nonlinear Programming, Second Edition, by Dimitri P. Bertsekas, 1999, ISBN 1-886529-00-0, 791 pages 4. Network Optimization: Continuous and Discrete Models by Dimitri P. Bertsekas, 1998, ISBN 1-886529-02-7, 608 pages 5. Network Flows and Monotropic Optimization by R. Tyrrell Rockafellar, 1998, ISBN 1-886529-06-X, 634 pages 6. Introduction to Linear Optimization by Dimitris Bertsimas and John N. Tsitsiklis, 1997, ISBN 1-886529-19-1, 608 pages 7. Parallel and Distributed Computation: Numerical Methods by Dimitri P. Bertsekas and John N. Tsitsiklis, 1997, ISBN 1-886529-01-9, 718 pages 8. Neuro-Dynamic Programming, by Dimitri P. Bertsekas and John N. Tsitsiklis, 1996, ISBN 1-886529-10-8, 512 pages 9. Constrained Optimization and Lagrange Multiplier Methods, by Dimitri P. Bertsekas, 1996, ISBN 1-886529-04-3, 410 pages 10. Stochastic Optimal Control: The Discrete-Time Case by Dimitri P. Bertsekas and Steven E. Shreve, 1996, ISBN 1-886529-03-5, 330 pages
x
1 Sample Space and Probability
Contents 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7.
Sets . . . . . . . . . . Probabilistic Models . . . Conditional Probability . Total Probability Theorem Independence . . . . . . Counting . . . . . . . Summary and Discussion Problems . . . . . . . .
. . . . . . and . . . . . . . .
. . . . . . . . . Bayes’ . . . . . . . . . . . .
. . . . . . . . . Rule . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. p. 3 . p. 6 p. 18 p. 28 p. 34 p. 43 p. 50 p. 52
1
2
Sample Space and Probability
Chap. 1
“Probability” is a very useful concept, but can be interpreted in a number of ways. As an illustration, consider the following.
A patient is admitted to the hospital and a potentially life-saving drug is administered. The following dialog takes place between the nurse and a concerned relative. RELATIVE:
Nurse, what is the probability that the drug will work? I hope it works, we’ll know tomorrow. RELATIVE: Yes, but what is the probability that it will? NURSE: Each case is different, we have to wait. RELATIVE: But let’s see, out of a hundred patients that are treated under similar conditions, how many times would you expect it to work? NURSE (somewhat annoyed): I told you, every person is different, for some it works, for some it doesn’t. RELATIVE (insisting): Then tell me, if you had to bet whether it will work or not, which side of the bet would you take? NURSE (cheering up for a moment): I’d bet it will work. RELATIVE (somewhat relieved): OK, now, would you be willing to lose two dollars if it doesn’t work, and gain one dollar if it does? NURSE (exasperated): What a sick thought! You are wasting my time! NURSE:
In this conversation, the relative attempts to use the concept of probability to discuss an uncertain situation. The nurse’s initial response indicates that the meaning of “probability” is not uniformly shared or understood, and the relative tries to make it more concrete. The first approach is to define probability in terms of frequency of occurrence, as a percentage of successes in a moderately large number of similar situations. Such an interpretation is often natural. For example, when we say that a perfectly manufactured coin lands on heads “with probability 50%,” we typically mean “roughly half of the time.” But the nurse may not be entirely wrong in refusing to discuss in such terms. What if this was an experimental drug that was administered for the very first time in this hospital or in the nurse’s experience? While there are many situations involving uncertainty in which the frequency interpretation is appropriate, there are other situations in which it is not. Consider, for example, a scholar who asserts that the Iliad and the Odyssey were composed by the same person, with probability 90%. Such an assertion conveys some information, but not in terms of frequencies, since the subject is a one-time event. Rather, it is an expression of the scholar’s subjective belief. One might think that subjective beliefs are not interesting, at least from a mathematical or scientific point of view. On the other hand, people often have to make choices in the presence of uncertainty, and a systematic way of making use of their beliefs is a prerequisite for successful, or at least consistent, decision making.
Sec. 1.1
Sets
3
In fact, the choices and actions of a rational person, can reveal a lot about the inner-held subjective probabilities, even if the person does not make conscious use of probabilistic reasoning. Indeed, the last part of the earlier dialog was an attempt to infer the nurse’s beliefs in an indirect manner. Since the nurse was willing to accept a one-for-one bet that the drug would work, we may infer that the probability of success was judged to be at least 50%. And had the nurse accepted the last proposed bet (two-for-one), that would have indicated a success probability of at least 2/3. Rather than dwelling further into philosophical issues about the appropriateness of probabilistic reasoning, we will simply take it as a given that the theory of probability is useful in a broad variety of contexts, including some where the assumed probabilities only reflect subjective beliefs. There is a large body of successful applications in science, engineering, medicine, management, etc., and on the basis of this empirical evidence, probability theory is an extremely useful tool. Our main objective in this book is to develop the art of describing uncertainty in terms of probabilistic models, as well as the skill of probabilistic reasoning. The first step, which is the subject of this chapter, is to describe the generic structure of such models, and their basic properties. The models we consider assign probabilities to collections (sets) of possible outcomes. For this reason, we must begin with a short review of set theory.
1.1 SETS Probability makes extensive use of set operations, so let us introduce at the outset the relevant notation and terminology. A set is a collection of objects, which are the elements of the set. If S is a set and x is an element of S, we write x ∈ S. If x is not an element of S, we write x ∈ / S. A set can have no elements, in which case it is called the empty set, denoted by Ø. Sets can be specified in a variety of ways. If S contains a finite number of elements, say x1 , x2 , . . . , xn , we write it as a list of the elements, in braces: S = {x1 , x2 , . . . , xn }. For example, the set of possible outcomes of a die roll is {1, 2, 3, 4, 5, 6}, and the set of possible outcomes of a coin toss is {H, T }, where H stands for “heads” and T stands for “tails.” If S contains infinitely many elements x1 , x2 , . . ., which can be enumerated in a list (so that there are as many elements as there are positive integers) we write S = {x1 , x2 , . . .}, and we say that S is countably infinite. For example, the set of even integers can be written as {0, 2, −2, 4, −4, . . .}, and is countably infinite.
4
Sample Space and Probability
Chap. 1
Alternatively, we can consider the set of all x that have a certain property P , and denote it by {x | x satisfies P }. (The symbol “|” is to be read as “such that.”) For example, the set of even integers can be written as {k | k/2 is integer}. Similarly, the set of all scalars x in the interval [0, 1] can be written as {x | 0 ≤ x ≤ 1}. Note that the elements x of the latter set take a continuous range of values, and cannot be written down in a list (a proof is sketched in the end-of-chapter problems); such a set is said to be uncountable. If every element of a set S is also an element of a set T , we say that S is a subset of T , and we write S ⊂ T or T ⊃ S. If S ⊂ T and T ⊂ S, the two sets are equal, and we write S = T . It is also expedient to introduce a universal set, denoted by Ω, which contains all objects that could conceivably be of interest in a particular context. Having specified the context in terms of a universal set Ω, we only consider sets S that are subsets of Ω. Set Operations The complement of a set S, with respect to the universe Ω, is the set {x ∈ Ω|x ∈ / S} of all elements of Ω that do not belong to S, and is denoted by S c . Note that Ωc = Ø. The union of two sets S and T is the set of all elements that belong to S or T (or both), and is denoted by S ∪ T . The intersection of two sets S and T is the set of all elements that belong to both S and T , and is denoted by S ∩ T . Thus, S ∪ T = {x | x ∈ S or x ∈ T }, and S ∩ T = {x | x ∈ S and x ∈ T }. In some cases, we will have to consider the union or the intersection of several, even infinitely many sets, defined in the obvious way. For example, if for every positive integer n, we are given a set Sn , then ∞
Sn = S1 ∪ S2 ∪ · · · = {x | x ∈ Sn for some n},
n=1
and
∞
Sn = S1 ∩ S2 ∩ · · · = {x | x ∈ Sn for all n}.
n=1
Two sets are said to be disjoint if their intersection is empty. More generally, several sets are said to be disjoint if no two of them have a common element. A collection of sets is said to be a partition of a set S if the sets in the collection are disjoint and their union is S.
Sec. 1.1
Sets
5
If x and y are two objects, we use (x, y) to denote the ordered pair of x and y. The set of scalars (real numbers) is denoted by ; the set of pairs (or triplets) of scalars, i.e., the two-dimensional plane (or three-dimensional space, respectively) is denoted by 2 (or 3 , respectively). Sets and the associated operations are easy to visualize in terms of Venn diagrams, as illustrated in Fig. 1.1. Ω
Ω
S
Ω
S
S T
T
T (b)
(a)
(c)
Ω
Ω
T
U
S
U
T (d)
Ω
T
S
S
(e)
(f )
Figure 1.1: Examples of Venn diagrams. (a) The shaded region is S ∩ T . (b) The shaded region is S ∪ T . (c) The shaded region is S ∩ T c . (d) Here, T ⊂ S. The shaded region is the complement of S. (e) The sets S, T , and U are disjoint. (f) The sets S, T , and U form a partition of the set Ω.
The Algebra of Sets Set operations have several properties, which are elementary consequences of the definitions. Some examples are: S∪T S ∩ (T ∪ U ) (S c )c S∪Ω
= T ∪ S, = (S ∩ T ) ∪ (S ∩ U ), = S, = Ω,
S ∪ (T ∪ U ) S ∪ (T ∩ U ) S ∩ Sc S∩Ω
= (S ∪ T ) ∪ U, = (S ∪ T ) ∩ (S ∪ U ), = Ø, = S.
Two particularly useful properties are given by De Morgan’s laws which state that c c Sn = Snc , Sn = Snc . n
n
n
n
To establish the first law, suppose that x ∈ (∪n Sn Then, x ∈ / ∪n Sn , which implies that for every n, we have x ∈ / Sn . Thus, x belongs to the complement )c .
6
Sample Space and Probability
Chap. 1
of every Sn , and xn ∈ ∩n Snc . This shows that (∪n Sn )c ⊂ ∩n Snc . The converse inclusion is established by reversing the above argument, and the first law follows. The argument for the second law is similar.
1.2 PROBABILISTIC MODELS A probabilistic model is a mathematical description of an uncertain situation. It must be in accordance with a fundamental framework that we discuss in this section. Its two main ingredients are listed below and are visualized in Fig. 1.2.
Elements of a Probabilistic Model • The sample space Ω, which is the set of all possible outcomes of an experiment. • The probability law, which assigns to a set A of possible outcomes (also called an event) a nonnegative number P(A) (called the probability of A) that encodes our knowledge or belief about the collective “likelihood” of the elements of A. The probability law must satisfy certain properties to be introduced shortly.
Probability law Event B Experiment
Event A Sample space (Set of possible outcomes)
P(B) P(A)
A
B
Events
Figure 1.2: The main ingredients of a probabilistic model.
Sample Spaces and Events Every probabilistic model involves an underlying process, called the experiment, that will produce exactly one out of several possible outcomes. The set of all possible outcomes is called the sample space of the experiment, and is denoted by Ω. A subset of the sample space, that is, a collection of possible
Sec. 1.2
Probabilistic Models
7
outcomes, is called an event.† There is no restriction on what constitutes an experiment. For example, it could be a single toss of a coin, or three tosses, or an infinite sequence of tosses. However, it is important to note that in our formulation of a probabilistic model, there is only one experiment. So, three tosses of a coin constitute a single experiment, rather than three experiments. The sample space of an experiment may consist of a finite or an infinite number of possible outcomes. Finite sample spaces are conceptually and mathematically simpler. Still, sample spaces with an infinite number of elements are quite common. For an example, consider throwing a dart on a square target and viewing the point of impact as the outcome. Choosing an Appropriate Sample Space Regardless of their number, different elements of the sample space should be distinct and mutually exclusive so that when the experiment is carried out, there is a unique outcome. For example, the sample space associated with the roll of a die cannot contain “1 or 3” as a possible outcome and also “1 or 4” as another possible outcome, because we would not be able to assign a unique outcome when the roll is a 1. A given physical situation may be modeled in several different ways, depending on the kind of questions that we are interested in. Generally, the sample space chosen for a probabilistic model must be collectively exhaustive, in the sense that no matter what happens in the experiment, we always obtain an outcome that has been included in the sample space. In addition, the sample space should have enough detail to distinguish between all outcomes of interest to the modeler, while avoiding irrelevant details.
Example 1.1. Consider two alternative games, both involving ten successive coin tosses: Game 1: We receive $1 each time a head comes up. Game 2: We receive $1 for every coin toss, up to and including the first time a head comes up. Then, we receive $2 for every coin toss, up to the second time a head comes up. More generally, the dollar amount per toss is doubled each time a head comes up. † Any collection of possible outcomes, including the entire sample space Ω and its complement, the empty set Ø, may qualify as an event. Strictly speaking, however, some sets have to be excluded. In particular, when dealing with probabilistic models involving an uncountably infinite sample space, there are certain unusual subsets for which one cannot associate meaningful probabilities. This is an intricate technical issue, involving the mathematics of measure theory. Fortunately, such pathological subsets do not arise in the problems considered in this text or in practice, and the issue can be safely ignored.
8
Sample Space and Probability
Chap. 1
In game 1, it is only the total number of heads in the ten-toss sequence that matters, while in game 2, the order of heads and tails is also important. Thus, in a probabilistic model for game 1, we can work with a sample space consisting of eleven possible outcomes, namely, 0, 1, . . . , 10. In game 2, a finer grain description of the experiment is called for, and it is more appropriate to let the sample space consist of every possible ten-long sequence of heads and tails.
Sequential Models Many experiments have an inherently sequential character, such as for example tossing a coin three times, or observing the value of a stock on five successive days, or receiving eight successive digits at a communication receiver. It is then often useful to describe the experiment and the associated sample space by means of a tree-based sequential description, as in Fig. 1.3. Sample space for a pair of rolls
Tree-based sequential description
4
1
3 2nd roll
1,1 1,2 1,3 1,4
2 Root
2
Leaves 3
1 1
2
3 1st roll
4
4
Figure 1.3: Two equivalent descriptions of the sample space of an experiment involving two rolls of a 4-sided die. The possible outcomes are all the ordered pairs of the form (i, j), where i is the result of the first roll, and j is the result of the second. These outcomes can be arranged in a 2-dimensional grid as in the figure on the left, or they can be described by the tree on the right, which reflects the sequential character of the experiment. Here, each possible outcome corresponds to a leaf of the tree and is associated with the unique path from the root to that leaf. The shaded area on the left is the event {(1, 4), (2, 4), (3, 4), (4, 4)} that the result of the second roll is 4. That same event can be described by the set of leaves highlighted on the right. Note also that every node of the tree can be identified with an event, namely, the set of all leaves downstream from that node. For example, the node labeled by a 1 can be identified with the event {(1, 1), (1, 2), (1, 3), (1, 4)} that the result of the first roll is 1.
Probability Laws Suppose we have settled on the sample space Ω associated with an experiment. Then, to complete the probabilistic model, we must introduce a probability
Sec. 1.2
Probabilistic Models
9
law. Intuitively, this specifies the “likelihood” of any outcome, or of any set of possible outcomes (an event, as we have called it earlier). More precisely, the probability law assigns to every event A, a number P(A), called the probability of A, satisfying the following axioms.
Probability Axioms 1. (Nonnegativity) P(A) ≥ 0, for every event A. 2. (Additivity) If A and B are two disjoint events, then the probability of their union satisfies P(A ∪ B) = P(A) + P(B). More generally, if the sample space has an infinite number of elements and A1 , A2 , . . . is a sequence of disjoint events, then the probability of their union satisfies P(A1 ∪ A2 ∪ · · ·) = P(A1 ) + P(A2 ) + · · · . 3. (Normalization) The probability of the entire sample space Ω is equal to 1, that is, P(Ω) = 1.
In order to visualize a probability law, consider a unit of mass which is “spread” over the sample space. Then, P(A) is simply the total mass that was assigned collectively to the elements of A. In terms of this analogy, the additivity axiom becomes quite intuitive: the total mass in a sequence of disjoint events is the sum of their individual masses. A more concrete interpretation of probabilities is in terms of relative frequencies: a statement such as P(A) = 2/3 often represents a belief that event A will occur in about two thirds out of a large number of repetitions of the experiment. Such an interpretation, though not always appropriate, can sometimes facilitate our intuitive understanding. It will be revisited in Chapter 7, in our study of limit theorems. There are many natural properties of a probability law, which have not been included in the above axioms for the simple reason that they can be derived from them. For example, note that the normalization and additivity axioms imply that 1 = P(Ω) = P(Ω ∪ Ø) = P(Ω) + P(Ø) = 1 + P(Ø), and this shows that the probability of the empty event is 0: P(Ø) = 0.
10
Sample Space and Probability
Chap. 1
As another example, consider three disjoint events A1 , A2 , and A3 . We can use the additivity axiom for two disjoint events repeatedly, to obtain P(A1 ∪ A2 ∪ A3 ) = P A1 ∪ (A2 ∪ A3 ) = P(A1 ) + P(A2 ∪ A3 ) = P(A1 ) + P(A2 ) + P(A3 ). Proceeding similarly, we obtain that the probability of the union of finitely many disjoint events is always equal to the sum of the probabilities of these events. More such properties will be considered shortly. Discrete Models Here is an illustration of how to construct a probability law starting from some common sense assumptions about a model.
Example 1.2. Consider an experiment involving a single coin toss. There are two possible outcomes, heads (H) and tails (T ). The sample space is Ω = {H, T }, and the events are {H, T }, {H}, {T }, Ø. If the coin is fair, i.e., if we believe that heads and tails are “equally likely,” we should assign equal probabilities to the two possible outcomes and specify that P({H}) = P({T }) = 0.5. The additivity axiom implies that
P {H, T } = P {H} + P {T } = 1, which is consistent with the normalization axiom. Thus, the probability law is given by
P {H, T } = 1,
P {H} = 0.5,
P {T } = 0.5,
P(Ø) = 0,
and satisfies all three axioms. Consider another experiment involving three coin tosses. The outcome will now be a 3-long string of heads or tails. The sample space is Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. We assume that each possible outcome has the same probability of 1/8. Let us construct a probability law that satisfies the three axioms. Consider, as an example, the event A = {exactly 2 heads occur} = {HHT, HT H, T HH}. Using additivity, the probability of A is the sum of the probabilities of its elements:
P {HHT, HT H, T HH} = P {HHT } + P {HT H} + P {T HH} 1 1 1 + + 8 8 8 3 = . 8 =
Sec. 1.2
Probabilistic Models
11
Similarly, the probability of any event is equal to 1/8 times the number of possible outcomes contained in the event. This defines a probability law that satisfies the three axioms.
By using the additivity axiom and by generalizing the reasoning in the preceding example, we reach the following conclusion.
Discrete Probability Law If the sample space consists of a finite number of possible outcomes, then the probability law is specified by the probabilities of the events that consist of a single element. In particular, the probability of any event {s1 , s2 , . . . , sn } is the sum of the probabilities of its elements: P {s1 , s2 , . . . , sn } = P(s1 ) + P(s2 ) + · · · + P(sn ).
Note that we are using here the simpler notation P(si ) to denote the probability of the event {si }, instead of the more precise P({si }). This convention will be used throughout the remainder of the book. In the special case where the probabilities P(s1 ), . . . , P(sn ) are all the same (by necessity equal to 1/n, in view of the normalization axiom), we obtain the following.
Discrete Uniform Probability Law If the sample space consists of n possible outcomes which are equally likely (i.e., all single-element events have the same probability), then the probability of any event A is given by P(A) =
number of elements of A . n
Let us provide a few more examples of sample spaces and probability laws.
Example 1.3. Consider the experiment of rolling a pair of 4-sided dice (cf. Fig. 1.4). We assume the dice are fair, and we interpret this assumption to mean that each of the sixteen possible outcomes [pairs (i, j), with i, j = 1, 2, 3, 4], has the same probability of 1/16. To calculate the probability of an event, we must count the number of elements of the event and divide by 16 (the total number of possible
12
Sample Space and Probability
Chap. 1
outcomes). Here are some event probabilities calculated in this way:
P {the sum of the rolls is even} = 8/16 = 1/2, P {the sum of the rolls is odd} = 8/16 = 1/2,
P {the first roll is equal to the second} = 4/16 = 1/4,
P {the first roll is larger than the second} = 6/16 = 3/8,
P {at least one roll is equal to 4} = 7/16.
Sample space for a pair of rolls 4 3 2nd roll
Event = {at least one roll is a 4} Probability = 7/16
2 1 1
2
3 1st roll
4
Event = {the first roll is equal to the second} Probability = 4/16
Figure 1.4: Various events in the experiment of rolling a pair of 4-sided dice, and their probabilities, calculated according to the discrete uniform law.
Continuous Models Probabilistic models with continuous sample spaces differ from their discrete counterparts in that the probabilities of the single-element events may not be sufficient to characterize the probability law. This is illustrated in the following examples, which also indicate how to generalize the uniform probability law to the case of a continuous sample space.
Example 1.4. A wheel of fortune is continuously calibrated from 0 to 1, so the possible outcomes of an experiment consisting of a single spin are the numbers in the interval Ω = [0, 1]. Assuming a fair wheel, it is appropriate to consider all outcomes equally likely, but what is the probability of the event consisting of a single element? It cannot be positive, because then, using the additivity axiom, it would follow that events with a sufficiently large number of elements would have
Sec. 1.2
Probabilistic Models
13
probability larger than 1. Therefore, the probability of any event that consists of a single element must be 0. In this example, it makes sense to assign probability b − a to any subinterval [a, b] of [0, 1], and to calculate the probability of a more complicated set by evaluating its “length.”† This assignment satisfies the three probability axioms and qualifies as a legitimate probability law.
Example 1.5. Romeo and Juliet have a date at a given time, and each will arrive at the meeting place with a delay between 0 and 1 hour, with all pairs of delays being equally likely. The first to arrive will wait for 15 minutes and will leave if the other has not yet arrived. What is the probability that they will meet? Let us use as sample space the unit square, whose elements are the possible pairs of delays for the two of them. Our interpretation of “equally likely” pairs of delays is to let the probability of a subset of Ω be equal to its area. This probability law satisfies the three probability axioms. The event that Romeo and Juliet will meet is the shaded region in Fig. 1.5, and its probability is calculated to be 7/16.
y 1
M 1/4
0
1/4
1
x
Figure 1.5: The event M that Romeo and Juliet will arrive within 15 minutes of each other (cf. Example 1.5) is M =
(x, y)
|x − y| ≤ 1/4, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 ,
and is shaded in the figure. The area of M is 1 minus the area of the two unshaded triangles, or 1 − (3/4) · (3/4) = 7/16. Thus, the probability of meeting is 7/16.
† The “length” of a subset S of [0, 1] is the integral S dt, which is defined, for “nice” sets S, in the usual calculus sense. For unusual sets, this integral may not be well defined mathematically, but such issues belong to a more advanced treatment of the subject. Incidentally, the legitimacy of using length as a probability law hinges on the fact that the unit interval has an uncountably infinite number of elements. Indeed, if the unit interval had a countable number of elements, with each element having zero probability, the additivity axiom would imply that the whole interval has zero probability, which would contradict the normalization axiom.
14
Sample Space and Probability
Chap. 1
Properties of Probability Laws Probability laws have a number of properties, which can be deduced from the axioms. Some of them are summarized below. Some Properties of Probability Laws Consider a probability law, and let A, B, and C be events. (a) If A ⊂ B, then P(A) ≤ P(B). (b) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). (c) P(A ∪ B) ≤ P(A) + P(B). (d) P(A ∪ B ∪ C) = P(A) + P(Ac ∩ B) + P(Ac ∩ B c ∩ C). These properties, and other similar ones, can be visualized and verified graphically using Venn diagrams, as in Fig. 1.6. Note that property (c) can be generalized as follows: P(A1 ∪ A2 ∪ · · · ∪ An ) ≤
n
P(Ai ).
i=1
To see this, we apply property (c) to the sets A1 and A2 ∪ · · · ∪ An , to obtain P(A1 ∪ A2 ∪ · · · ∪ An ) ≤ P(A1 ) + P(A2 ∪ · · · ∪ An ). We also apply property (c) to the sets A2 and A3 ∪ · · · ∪ An , to obtain P(A2 ∪ · · · ∪ An ) ≤ P(A2 ) + P(A3 ∪ · · · ∪ An ). We continue similarly, and finally add. Models and Reality The framework of probability theory can be used to analyze uncertainty in a wide variety of physical contexts. Typically, this involves two distinct stages. (a) In the first stage, we construct a probabilistic model, by specifying a probability law on a suitably defined sample space. There are no hard rules to guide this step, other than the requirement that the probability law conform to the three axioms. Reasonable people may disagree on which model best represents reality. In many cases, one may even want to use a somewhat “incorrect” model, if it is simpler than the “correct” one or allows for tractable calculations. This is consistent with common practice in science
Sec. 1.2
Probabilistic Models
15
and engineering, where the choice of a model often involves a tradeoff between accuracy, simplicity, and tractability. Sometimes, a model is chosen on the basis of historical data or past outcomes of similar experiments, using methods from the field of statistics.
B
U
A
Ac
B
U
A
B
B
A
(b)
(a)
B
C Bc
U
U
Ac
C
Ac
U
A
B
(c)
Figure 1.6: Visualization and verification of various properties of probability laws using Venn diagrams. If A ⊂ B, then B is the union of the two disjoint events A and Ac ∩ B; see diagram (a). Therefore, by the additivity axiom, we have P(B) = P(A) + P(Ac ∩ B) ≥ P(A), where the inequality follows from the nonnegativity axiom, and verifies property (a). From diagram (b), we can express the events A ∪ B and B as unions of disjoint events: A ∪ B = A ∪ (Ac ∩ B),
B = (A ∩ B) ∪ (Ac ∩ B).
Using the additivity axiom, we have P(A ∪ B) = P(A) + P(Ac ∩ B),
P(B) = P(A ∩ B) + P(Ac ∩ B).
Subtracting the second equality from the first and rearranging terms, we obtain P(A ∪ B) = P(A) + P(B) − P(A ∩ B), verifying property (b). Using also the fact P(A ∩ B) ≥ 0 (the nonnegativity axiom), we obtain P(A ∪ B) ≤ P(A) + P(B), verifying property (c). From diagram (c), we see that the event A ∪ B ∪ C can be expressed as a union of three disjoint events: A ∪ B ∪ C = A ∪ (Ac ∩ B) ∪ (Ac ∩ B c ∩ C), so property (d) follows as a consequence of the additivity axiom.
16
Sample Space and Probability
Chap. 1
(b) In the second stage, we work within a fully specified probabilistic model and derive the probabilities of certain events, or deduce some interesting properties. While the first stage entails the often open-ended task of connecting the real world with mathematics, the second one is tightly regulated by the rules of ordinary logic and the axioms of probability. Difficulties may arise in the latter if some required calculations are complex, or if a probability law is specified in an indirect fashion. Even so, there is no room for ambiguity: all conceivable questions have precise answers and it is only a matter of developing the skill to arrive at them. Probability theory is full of “paradoxes” in which different calculation methods seem to give different answers to the same question. Invariably though, these apparent inconsistencies turn out to reflect poorly specified or ambiguous probabilistic models. An example, Bertrand’s paradox, is shown in Fig. 1.7.
V chord at angle Φ
. chord through C
..
Φ
A
C
B
(a)
midpoint of AB (b)
Figure 1.7: This example, presented by L. F. Bertrand in 1889, illustrates the need to specify unambiguously a probabilistic model. Consider a circle and an equilateral triangle inscribed in the circle. What is the probability that the length of a randomly chosen chord of the circle is greater than the side of the triangle? The answer here depends on the precise meaning of “randomly chosen.” The two methods illustrated in parts (a) and (b) of the figure lead to contradictory results. In (a), we take a radius of the circle, such as AB, and we choose a point C on that radius, with all points being equally likely. We then draw the chord through C that is orthogonal to AB. From elementary geometry, AB intersects the triangle at the midpoint of AB, so the probability that the length of the chord is greater than the side is 1/2. In (b), we take a point on the circle, such as the vertex V , we draw the tangent to the circle through V , and we draw a line through V that forms a random angle Φ with the tangent, with all angles being equally likely. We consider the chord obtained by the intersection of this line with the circle. From elementary geometry, the length of the chord is greater that the side of the triangle if Φ is between π/3 and 2π/3. Since Φ takes values between 0 and π, the probability that the length of the chord is greater than the side is 1/3.
Sec. 1.2
Probabilistic Models
17
A Brief History of Probability • B.C. Games of chance were popular in ancient Greece and Rome, but no scientific development of the subject took place, possibly because the number system used by the Greeks did not facilitate algebraic calculations. The development of probability based on sound scientific analysis had to await the development of the modern arithmetic system by the Hindus and the Arabs in the second half of the first millennium, as well as the flood of scientific ideas generated by the Renaissance. • 16th century. Girolamo Cardano, a colorful and controversial Italian mathematician, publishes the first book describing correct methods for calculating probabilities in games of chance such as dice and cards. • 17th century. A correspondence between Fermat and Pascal touches upon several interesting probability questions, and motivates further study in the field. • 18th century. Jacob Bernoulli studies repeated coin tossing and introduces the first law of large numbers, which lays a foundation for linking theoretical probability concepts and empirical fact. Several mathematicians, such as Daniel Bernoulli, Leibnitz, Bayes, and Lagrange, make important contributions to probability theory and its use in analyzing real-world phenomena. De Moivre introduces the normal distribution and proves the first form of the central limit theorem. • 19th century. Laplace publishes an influential book that establishes the importance of probability as a quantitative field and contains many original contributions, including a more general version of the central limit theorem. Legendre and Gauss apply probability to astronomical predictions, using the method of least squares, thus pointing the way to a vast range of applications. Poisson publishes an influential book with many original contributions, including the Poisson distribution. Chebyshev, and his students Markov and Lyapunov, study limit theorems and raise the standards of mathematical rigor in the field. Throughout this period, probability theory is largely viewed as a natural science, its primary goal being the explanation of physical phenomena. Consistently with this goal, probabilities are mainly interpreted as limits of relative frequencies in the context of repeatable experiments. • 20th century. Relative frequency is abandoned as the conceptual foundation of probability theory in favor of the axiomatic system that is universally used now. Similar to other branches of mathematics, the development of probability theory from the axioms relies only on logical correctness, regardless of its relevance to physical phenomena. Nonetheless, probability theory is used pervasively in science and engineering because of its ability to describe and interpret most types of uncertain phenomena in the real world.
18
Sample Space and Probability
Chap. 1
1.3 CONDITIONAL PROBABILITY Conditional probability provides us with a way to reason about the outcome of an experiment, based on partial information. Here are some examples of situations we have in mind: (a) In an experiment involving two successive rolls of a die, you are told that the sum of the two rolls is 9. How likely is it that the first roll was a 6? (b) In a word guessing game, the first letter of the word is a “t”. What is the likelihood that the second letter is an “h”? (c) How likely is it that a person has a disease given that a medical test was negative? (d) A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft? In more precise terms, given an experiment, a corresponding sample space, and a probability law, suppose that we know that the outcome is within some given event B. We wish to quantify the likelihood that the outcome also belongs to some other given event A. We thus seek to construct a new probability law, which takes into account the available knowledge and which, for any event A, gives us the conditional probability of A given B, denoted by P(A | B). We would like the conditional probabilities P(A | B) of different events A to constitute a legitimate probability law, that satisfies the probability axioms. The conditional probabilities should also be consistent with our intuition in important special cases, e.g., when all possible outcomes of the experiment are equally likely. For example, suppose that all six possible outcomes of a fair die roll are equally likely. If we are told that the outcome is even, we are left with only three possible outcomes, namely, 2, 4, and 6. These three outcomes were equally likely to start with, and so they should remain equally likely given the additional knowledge that the outcome was even. Thus, it is reasonable to let P(the outcome is 6 | the outcome is even) =
1 . 3
This argument suggests that an appropriate definition of conditional probability when all outcomes are equally likely, is given by P(A | B) =
number of elements of A ∩ B . number of elements of B
Generalizing the argument, we introduce the following definition of conditional probability: P(A ∩ B) P(A | B) = , P(B)
Sec. 1.3
Conditional Probability
19
where we assume that P(B) > 0; the conditional probability is undefined if the conditioning event has zero probability. In words, out of the total probability of the elements of B, P(A | B) is the fraction that is assigned to possible outcomes that also belong to A. Conditional Probabilities Specify a Probability Law For a fixed event B, it can be verified that the conditional probabilities P(A | B) form a legitimate probability law that satisfies the three axioms. Indeed, nonnegativity is clear. Furthermore, P(Ω | B) =
P(Ω ∩ B) P(B) = = 1, P(B) P(B)
and the normalization axiom is also satisfied. To verify the additivity axiom, we write for any two disjoint events A1 and A2 , P (A1 ∪ A2 ) ∩ B P(A1 ∪ A2 | B) = P(B) =
P((A1 ∩ B) ∪ (A2 ∩ B)) P(B)
=
P(A1 ∩ B) + P(A2 ∩ B) P(B)
=
P(A1 ∩ B) P(A2 ∩ B) + P(B) P(B)
= P(A1 | B) + P(A2 | B), where for the third equality, we used the fact that A1 ∩ B and A2 ∩ B are disjoint sets, and the additivity axiom for the (unconditional) probability law. The argument for a countable collection of disjoint sets is similar. Since conditional probabilities constitute a legitimate probability law, all general properties of probability laws remain valid. For example, a fact such as P(A ∪ C) ≤ P(A) + P(C) translates to the new fact P(A ∪ C | B) ≤ P(A | B) + P(C | B). Let us also note that since we have P(B | B) = P(B)/P(B) = 1, all of the conditional probability is concentrated on B. Thus, we might as well discard all possible outcomes outside B and treat the conditional probabilities as a probability law defined on the new universe B. Let us summarize the conclusions reached so far.
20
Sample Space and Probability
Chap. 1
Properties of Conditional Probability • The conditional probability of an event A, given an event B with P(B) > 0, is defined by P(A | B) =
P(A ∩ B) , P(B)
and specifies a new (conditional) probability law on the same sample space Ω. In particular, all known properties of probability laws remain valid for conditional probability laws. • Conditional probabilities can also be viewed as a probability law on a new universe B, because all of the conditional probability is concentrated on B. • In the case where the possible outcomes are finitely many and equally likely, we have P(A | B) =
number of elements of A ∩ B . number of elements of B
Example 1.6. We toss a fair coin three successive times. We wish to find the conditional probability P(A | B) when A and B are the events A = {more heads than tails come up},
B = {1st toss is a head}.
The sample space consists of eight sequences, Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }, which we assume to be equally likely. The event B consists of the four elements HHH, HHT, HT H, HT T , so its probability is P(B) =
4 . 8
The event A ∩ B consists of the three elements HHH, HHT, HT H, so its probability is 3 P(A ∩ B) = . 8 Thus, the conditional probability P(A | B) is P(A | B) =
P(A ∩ B) 3/8 3 = = . P(B) 4/8 4
Sec. 1.3
Conditional Probability
21
Because all possible outcomes are equally likely here, we can also compute P(A | B) using a shortcut. We can bypass the calculation of P(B) and P(A ∩ B), and simply divide the number of elements shared by A and B (which is 3) with the number of elements of B (which is 4), to obtain the same result 3/4. Example 1.7. A fair 4-sided die is rolled twice and we assume that all sixteen possible outcomes are equally likely. Let X and Y be the result of the 1st and the 2nd roll, respectively. We wish to determine the conditional probability P(A | B), where B = min(X, Y ) = 2 , A = max(X, Y ) = m , and m takes each of the values 1, 2, 3, 4. As in the preceding example, we can first determine the probabilities P(A∩B) and P(B) by counting the number of elements of A ∩ B and B, respectively, and dividing by 16. Alternatively, we can directly divide the number of elements of A ∩ B with the number of elements of B; see Fig. 1.8.
All outcomes equally likely Probability = 1/16 4 3 2nd roll Y 2
B
1 1
2 3 1st roll X
4
Figure 1.8: Sample space of an experiment involving two rolls of a 4-sided die. (cf. Example 1.7). The conditioning event B = {min(X, Y ) = 2} consists of the 5-element shaded set. The set A = {max(X, Y ) = m} shares with B two elements if m = 3 or m = 4, one element if m = 2, and no element if m = 1. Thus, we have
P {max(X, Y ) = m} B =
2/5, 1/5, 0,
if m = 3 or m = 4, if m = 2, if m = 1.
Example 1.8. A conservative design team, call it C, and an innovative design team, call it N, are asked to separately design a new product within a month. From past experience we know that: (a) The probability that team C is successful is 2/3. (b) The probability that team N is successful is 1/2.
22
Sample Space and Probability
Chap. 1
(c) The probability that at least one team is successful is 3/4. Assuming that exactly one successful design is produced, what is the probability that it was designed by team N? There are four possible outcomes here, corresponding to the four combinations of success and failure of the two teams: SS: both succeed, SF : C succeeds, N fails,
F F : both fail, F S: C fails, N succeeds.
We are given that the probabilities of these outcomes satisfy P(SS) + P(SF ) =
2 , 3
1 , 2
P(SS) + P(F S) =
P(SS) + P(SF ) + P(F S) =
3 . 4
From these relations, together with the normalization equation P(SS) + P(SF ) + P(F S) + P(F F ) = 1, we can obtain the probabilities of all the outcomes: P(SS) =
5 , 12
P(SF ) =
1 , 4
P(F S) =
1 , 12
P(F F ) =
1 . 4
The desired conditional probability is 1
1 12 = . P F S {SF, F S} = 1 1 4 +
4
12
Using Conditional Probability for Modeling When constructing probabilistic models for experiments that have a sequential character, it is often natural and convenient to first specify conditional probabilities and then use them to determine unconditional probabilities. The rule P(A∩B) = P(B)P(A | B), which is a restatement of the definition of conditional probability, is often helpful in this process.
Example 1.9. Radar Detection. If an aircraft is present in a certain area, a radar correctly registers its presence with probability 0.99. If it is not present, the radar falsely registers an aircraft presence with probability 0.10. We assume that an aircraft is present with probability 0.05. What is the probability of false alarm (a false indication of aircraft presence), and the probability of missed detection (nothing registers, even though an aircraft is present)? A sequential representation of the experiment is appropriate here, as shown in Fig. 1.9. Let A and B be the events A = {an aircraft is present}, B = {the radar registers an aircraft presence},
Sec. 1.3
Conditional Probability
23
and consider also their complements Ac = {an aircraft is not present}, B c = {the radar does not register an aircraft presence}. The given probabilities are recorded along the corresponding branches of the tree describing the sample space, as shown in Fig. 1.9. Each possible outcome corresponds to a leaf of the tree, and its probability is equal to the product of the probabilities associated with the branches in a path from the root to the corresponding leaf. The desired probabilities of false alarm and missed detection are P(false alarm) = P(Ac ∩ B) = P(Ac )P(B | Ac ) = 0.95 · 0.10 = 0.095, P(missed detection) = P(A ∩ B c ) = P(A)P(B c | A) = 0.05 · 0.01 = 0.0005.
Aircraft present P(A) = 0.05
P(Ac)
= 0.95
Aircraft not present
P(
A B|
P( Bc |A
9
Missed detection
)=
0.0
c)=
B P(
0.9
)=
1
0.1
|A
P( Bc |A c )=
0
False alarm
0.9
0
Figure 1.9: Sequential description of the experiment for the radar detection problem in Example 1.9.
Extending the preceding example, we have a general rule for calculating various probabilities in conjunction with a tree-based sequential description of an experiment. In particular: (a) We set up the tree so that an event of interest is associated with a leaf. We view the occurrence of the event as a sequence of steps, namely, the traversals of the branches along the path from the root to the leaf. (b) We record the conditional probabilities associated with the branches of the tree. (c) We obtain the probability of a leaf by multiplying the probabilities recorded along the corresponding path of the tree.
24
Sample Space and Probability
Chap. 1
In mathematical terms, we are dealing with an event A which occurs if and only if each one of several events A1 , . . . , An has occurred, i.e., A = A1 ∩ A2 ∩ · · · ∩ An . The occurrence of A is viewed as an occurrence of A1 , followed by the occurrence of A2 , then of A3 , etc., and it is visualized as a path with n branches, corresponding to the events A1 , . . . , An . The probability of A is given by the following rule (see also Fig. 1.10).
Multiplication Rule Assuming that all of the conditioning events have positive probability, we have P ∩ni=1 Ai = P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ) · · · P An | ∩n−1 i=1 Ai .
The multiplication rule can be verified by writing n P ∩ni=1 Ai P(A1 ∩ A2 ) P(A1 ∩ A2 ∩ A3 ) , P ∩i=1 Ai = P(A1 ) · · ··· P(A1 ) P(A1 ∩ A2 ) P ∩n−1 i=1 Ai
Event A1 ∩ A2 ∩ . . . ∩ An
Event A1 ∩ A2 ∩ A3
A1 P(A1)
A2 P(A2 |A1)
A3 P(A3 |A1 ∩ A2)
. . . An−1
An P(An |A1 ∩ A2 ∩. . . ∩ An−1)
Figure 1.10: Visualization of the multiplication rule. The intersection event A = A1 ∩ A2 ∩ · · · ∩ An is associated with a particular path on a tree that describes the experiment. We associate the branches of this path with the events A1 , . . . , An , and we record next to the branches the corresponding conditional probabilities. The final node of the path corresponds to the intersection event A, and its probability is obtained by multiplying the conditional probabilities recorded along the branches of the path P(A1 ∩ A2 ∩ · · · ∩ A3 ) = P(A1 )P(A2 | A1 ) · · · P(An | A1 ∩ A2 ∩ · · · ∩ An−1 ). Note that any intermediate node along the path also corresponds to some intersection event and its probability is obtained by multiplying the corresponding conditional probabilities up to that node. For example, the event A1 ∩ A2 ∩ A3 corresponds to the node shown in the figure, and its probability is P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ).
Sec. 1.3
Conditional Probability
25
and by using the definition of conditional probability to rewrite the right-hand side above as P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ) · · · P An | ∩n−1 i=1 Ai . For the case of just two events, A1 and A2 , the multiplication rule is simply the definition of conditional probability.
Example 1.10. Three cards are drawn from an ordinary 52-card deck without replacement (drawn cards are not placed back in the deck). We wish to find the probability that none of the three cards is a heart. We assume that at each step, each one of the remaining cards is equally likely to be picked. By symmetry, this implies that every triplet of cards is equally likely to be drawn. A cumbersome approach, that we will not use, is to count the number of all card triplets that do not include a heart, and divide it with the number of all possible card triplets. Instead, we use a sequential description of the experiment in conjunction with the multiplication rule (cf. Fig. 1.11). Define the events Ai = {the ith card is not a heart},
i = 1, 2, 3.
We will calculate P(A1 ∩ A2 ∩ A3 ), the probability that none of the three cards is a heart, using the multiplication rule P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ). We have P(A1 ) =
39 , 52
since there are 39 cards that are not hearts in the 52-card deck. Given that the first card is not a heart, we are left with 51 cards, 38 of which are not hearts, and P(A2 | A1 ) =
38 . 51
Finally, given that the first two cards drawn are not hearts, there are 37 cards which are not hearts in the remaining 50-card deck, and P(A3 | A1 ∩ A2 ) =
37 . 50
These probabilities are recorded along the corresponding branches of the tree describing the sample space, as shown in Fig. 1.11. The desired probability is now obtained by multiplying the probabilities recorded along the corresponding path of the tree: 39 38 37 P(A1 ∩ A2 ∩ A3 ) = · · . 52 51 50
26
Sample Space and Probability
Chap. 1
Not a heart 37/50 Not a heart 38/51 Not a heart 39/52
Heart 13/50
Figure 1.11: Sequential description of the experiment in the 3-card selection problem of Example 1.10.
Heart 13/51
Heart 13/52
Note that once the probabilities are recorded along the tree, the probability of several other events can be similarly calculated. For example, P(1st is not a heart and 2nd is a heart) = P(1st two are not hearts and 3rd is a heart) =
39 13 · , 52 51 39 38 13 · · . 52 51 50
Example 1.11. A class consisting of 4 graduate and 12 undergraduate students is randomly divided into 4 groups of 4. What is the probability that each group includes a graduate student? We interpret “randomly” to mean that given the assignment of some students to certain slots, any of the remaining students is equally likely to be assigned to any of the remaining slots. We then calculate the desired probability using the multiplication rule, based on the sequential description shown in Fig. 1.12. Let us denote the four graduate students by 1, 2, 3, 4, and consider the events A1 = {students 1 and 2 are in different groups}, A2 = {students 1, 2, and 3 are in different groups}, A3 = {students 1, 2, 3, and 4 are in different groups}. We will calculate P(A3 ) using the multiplication rule: P(A3 ) = P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ). We have P(A1 ) =
12 , 15
since there are 12 student slots in groups other than the one of student 1, and there are 15 student slots overall, excluding student 1. Similarly, P(A2 | A1 ) =
8 , 14
Sec. 1.3
Conditional Probability
27
since there are 8 student slots in groups other than those of students 1 and 2, and there are 14 student slots, excluding students 1 and 2. Also, P(A3 | A1 ∩ A2 ) =
4 , 13
since there are 4 student slots in groups other than those of students 1, 2, and 3, and there are 13 student slots, excluding students 1, 2, and 3. Thus, the desired probability is 4 12 8 · · , 15 14 13 and is obtained by multiplying the conditional probabilities along the corresponding path of the tree of Fig. 1.12.
Students 1, 2, 3, & 4 are in different groups 4/13 Students 1, 2, & 3 are in different groups 8/14 Students 1 & 2 are in different groups 12/15
Figure 1.12: Sequential description of the experiment in the student problem of Example 1.11.
Example 1.12. The Monty Hall Problem. This is a much discussed puzzle, based on an old American game show. You are told that a prize is equally likely to be found behind any one of three closed doors in front of you. You point to one of the doors. A friend opens for you one of the remaining two doors, after making sure that the prize is not behind it. At this point, you can stick to your initial choice, or switch to the other unopened door. You win the prize if it lies behind your final choice of a door. Consider the following strategies: (a) Stick to your initial choice. (b) Switch to the other unopened door. (c) You first point to door 1. If door 2 is opened, you do not switch. If door 3 is opened, you switch. Which is the best strategy? To answer the question, let us calculate the probability of winning under each of the three strategies. Under the strategy of no switching, your initial choice will determine whether you win or not, and the probability of winning is 1/3. This is because the prize is equally likely to be behind each door. Under the strategy of switching, if the prize is behind the initially chosen door (probability 1/3), you do not win. If it is not (probability 2/3), and given that
28
Sample Space and Probability
Chap. 1
another door without a prize has been opened for you, you will get to the winning door once you switch. Thus, the probability of winning is now 2/3, so (b) is a better strategy than (a). Consider now strategy (c). Under this strategy, there is insufficient information for determining the probability of winning. The answer depends on the way that your friend chooses which door to open. Let us consider two possibilities. Suppose that if the prize is behind door 1, your friend always chooses to open door 2. (If the prize is behind door 2 or 3, your friend has no choice.) If the prize is behind door 1, your friend opens door 2, you do not switch, and you win. If the prize is behind door 2, your friend opens door 3, you switch, and you win. If the prize is behind door 3, your friend opens door 2, you do not switch, and you lose. Thus, the probability of winning is 2/3, so strategy (c) in this case is as good as strategy (b). Suppose now that if the prize is behind door 1, your friend is equally likely to open either door 2 or 3. If the prize is behind door 1 (probability 1/3), and if your friend opens door 2 (probability 1/2), you do not switch and you win (probability 1/6). But if your friend opens door 3, you switch and you lose. If the prize is behind door 2, your friend opens door 3, you switch, and you win (probability 1/3). If the prize is behind door 3, your friend opens door 2, you do not switch and you lose. Thus, the probability of winning is 1/6 + 1/3 = 1/2, so strategy (c) in this case is inferior to strategy (b).
1.4 TOTAL PROBABILITY THEOREM AND BAYES’ RULE In this section, we explore some applications of conditional probability. We start with the following theorem, which is often useful for computing the probabilities of various events, using a “divide-and-conquer” approach.
Total Probability Theorem Let A1 , . . . , An be disjoint events that form a partition of the sample space (each possible outcome is included in exactly one of the events A1 , . . . , An ) and assume that P(Ai ) > 0, for all i. Then, for any event B, we have P(B) = P(A1 ∩ B) + · · · + P(An ∩ B) = P(A1 )P(B | A1 ) + · · · + P(An )P(B | An ).
The theorem is visualized and proved in Fig. 1.13. Intuitively, we are partitioning the sample space into a number of scenarios (events) Ai . Then, the probability that B occurs is a weighted average of its conditional probability under each scenario, where each scenario is weighted according to its (unconditional) probability. One of the uses of the theorem is to compute the probability of various events B for which the conditional probabilities P(B | Ai ) are known or
Sec. 1.4
Total Probability Theorem and Bayes’ Rule
29
B A1
A1
Bc
B
B
A2
A2 ∩ B
Bc
A3 A3
A2
A1 ∩ B
B
A3 ∩ B
Bc
Figure 1.13: Visualization and verification of the total probability theorem. The events A1 , . . . , An form a partition of the sample space, so the event B can be decomposed into the disjoint union of its intersections Ai ∩ B with the sets Ai , i.e., B = (A1 ∩ B) ∪ · · · ∪ (An ∩ B). Using the additivity axiom, it follows that P(B) = P(A1 ∩ B) + · · · + P(An ∩ B). Since, by the definition of conditional probability, we have P(Ai ∩ B) = P(Ai )P(B | Ai ), the preceding equality yields P(B) = P(A1 )P(B | A1 ) + · · · + P(An )P(B | An ). For an alternative view, consider an equivalent sequential model, as shown on the right. The probability of the leaf Ai ∩ B is the product P(Ai )P(B | Ai ) of the probabilities along the path leading to that leaf. The event B consists of the three highlighted leaves and P(B) is obtained by adding their probabilities.
easy to derive. The key is to choose appropriately the partition A1 , . . . , An , and this choice is often suggested by the problem structure. Here are some examples. Example 1.13. You enter a chess tournament where your probability of winning a game is 0.3 against half the players (call them type 1), 0.4 against a quarter of the players (call them type 2), and 0.5 against the remaining quarter of the players (call them type 3). You play a game against a randomly chosen opponent. What is the probability of winning? Let Ai be the event of playing with an opponent of type i. We have P(A1 ) = 0.5,
P(A2 ) = 0.25,
P(A3 ) = 0.25.
Let also B be the event of winning. We have P(B | A1 ) = 0.3,
P(B | A2 ) = 0.4,
P(B | A3 ) = 0.5.
30
Sample Space and Probability
Chap. 1
Thus, by the total probability theorem, the probability of winning is P(B) = P(A1 )P(B | A1 ) + P(A2 )P(B | A2 ) + P(A3 )P(B | A3 ) = 0.5 · 0.3 + 0.25 · 0.4 + 0.25 · 0.5 = 0.375.
Example 1.14. You roll a fair four-sided die. If the result is 1 or 2, you roll once more but otherwise, you stop. What is the probability that the sum total of your rolls is at least 4? Let Ai be the event that the result of first roll is i, and note that P(Ai ) = 1/4 for each i. Let B be the event that the sum total is at least 4. Given the event A1 , the sum total will be at least 4 if the second roll results in 3 or 4, which happens with probability 1/2. Similarly, given the event A2 , the sum total will be at least 4 if the second roll results in 2, 3, or 4, which happens with probability 3/4. Also, given the event A3 , you stop and the sum total remains below 4. Therefore, P(B | A1 ) =
1 , 2
P(B | A2 ) =
3 , 4
P(B | A3 ) = 0,
P(B | A4 ) = 1.
By the total probability theorem, P(B) =
1 3 1 1 9 1 1 · + · + ·0+ ·1= . 4 2 4 4 4 4 16
The total probability theorem can be applied repeatedly to calculate probabilities in experiments that have a sequential character, as shown in the following example.
Example 1.15. Alice is taking a probability class and at the end of each week she can be either up-to-date or she may have fallen behind. If she is up-to-date in a given week, the probability that she will be up-to-date (or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind in a given week, the probability that she will be up-to-date (or behind) in the next week is 0.4 (or 0.6, respectively). Alice is (by default) up-to-date when she starts the class. What is the probability that she is up-to-date after three weeks? Let Ui and Bi be the events that Alice is up-to-date or behind, respectively, after i weeks. According to the total probability theorem, the desired probability P(U3 ) is given by P(U3 ) = P(U2 )P(U3 | U2 ) + P(B2 )P(U3 | B2 ) = P(U2 ) · 0.8 + P(B2 ) · 0.4. The probabilities P(U2 ) and P(B2 ) can also be calculated using the total probability theorem: P(U2 ) = P(U1 )P(U2 | U1 ) + P(B1 )P(U2 | B1 ) = P(U1 ) · 0.8 + P(B1 ) · 0.4,
Sec. 1.4
Total Probability Theorem and Bayes’ Rule
31
P(B2 ) = P(U1 )P(B2 | U1 ) + P(B1 )P(B2 | B1 ) = P(U1 ) · 0.2 + P(B1 ) · 0.6. Finally, since Alice starts her class up-to-date, we have P(U1 ) = 0.8,
P(B1 ) = 0.2.
We can now combine the preceding three equations to obtain P(U2 ) = 0.8 · 0.8 + 0.2 · 0.4 = 0.72, P(B2 ) = 0.8 · 0.2 + 0.2 · 0.6 = 0.28, and by using the above probabilities in the formula for P(U3 ): P(U3 ) = 0.72 · 0.8 + 0.28 · 0.4 = 0.688. Note that we could have calculated the desired probability P(U3 ) by constructing a tree description of the experiment, by calculating the probability of every element of U3 using the multiplication rule on the tree, and by adding. However, there are cases where the calculation based on the total probability theorem is more convenient. For example, suppose we are interested in the probability P(U20 ) that Alice is up-to-date after 20 weeks. Calculating this probability using the multiplication rule is very cumbersome, because the tree representing the experiment is 20-stages deep and has 220 leaves. On the other hand, with a computer, a sequential calculation using the total probability formulas P(Ui+1 ) = P(Ui ) · 0.8 + P(Bi ) · 0.4, P(Bi+1 ) = P(Ui ) · 0.2 + P(Bi ) · 0.6, and the initial conditions P(U1 ) = 0.8, P(B1 ) = 0.2, is very simple.
Inference and Bayes’ Rule The total probability theorem is often used in conjunction with the following celebrated theorem, which relates conditional probabilities of the form P(A | B) with conditional probabilities of the form P(B | A), in which the order of the conditioning is reversed.
Bayes’ Rule Let A1 , A2 , . . . , An be disjoint events that form a partition of the sample space, and assume that P(Ai ) > 0, for all i. Then, for any event B such that P(B) > 0, we have P(Ai )P(B | Ai ) P(B) P(Ai )P(B | Ai ) = . P(A1 )P(B | A1 ) + · · · + P(An )P(B | An )
P(Ai | B) =
32
Sample Space and Probability
Chap. 1
To verify Bayes’ rule, note that P(Ai )P(B | Ai ) and P(Ai | B)P(B) are equal, because they are both equal to P(Ai ∩ B). This yields the first equality. The second equality follows from the first by using the total probability theorem to rewrite P(B). Bayes’ rule is often used for inference. There are a number of “causes” that may result in a certain “effect.” We observe the effect, and we wish to infer the cause. The events A1 , . . . , An are associated with the causes and the event B represents the effect. The probability P(B | Ai ) that the effect will be observed when the cause Ai is present amounts to a probabilistic model of the cause-effect relation (cf. Fig. 1.14). Given that the effect B has been observed, we wish to evaluate the probability P(Ai | B) that the cause Ai is present.
Cause 1: malignant tumor
B
Cause 3: other A1
A1 A2
Effect: shade observed Cause 2: nonmalignant tumor
A3
B A2
A1 ∩ B
Bc B
A2 ∩ B
Bc B
A3 ∩ B
A3 Bc
Figure 1.14: An example of the inference context that is implicit in Bayes’ rule. We observe a shade in a person’s X-ray (this is event B, the “effect”) and we want to estimate the likelihood of three mutually exclusive and collectively exhaustive potential causes: cause 1 (event A1 ) is that there is a malignant tumor, cause 2 (event A2 ) is that there is a nonmalignant tumor, and cause 3 (event A3 ) corresponds to reasons other than a tumor. We assume that we know the probabilities P(Ai ) and P(B | Ai ), i = 1, 2, 3. Given that we see a shade (event B occurs), Bayes’ rule gives the conditional probabilities of the various causes as P(Ai | B) =
P(Ai )P(B | Ai ) , P(A1 )P(B | A1 ) + P(A2 )P(B | A2 ) + P(A3 )P(B | A3 )
i = 1, 2, 3.
For an alternative view, consider an equivalent sequential model, as shown on the right. The probability P(A1 | B) of a malignant tumor is the probability of the first highlighted leaf, which is P(A1 ∩ B), divided by the total probability of the highlighted leaves, which is P(B).
Example 1.16. Let us return to the radar detection problem of Example 1.9 and Fig. 1.9. Let A = {an aircraft is present}, B = {the radar registers an aircraft presence}.
Sec. 1.4
Total Probability Theorem and Bayes’ Rule
33
We are given that P(B | A) = 0.99,
P(A) = 0.05,
P(B | Ac ) = 0.1.
Applying Bayes’ rule, with A1 = A and A2 = Ac , we obtain P(aircraft present | radar registers) = P(A | B) =
P(A)P(B | A) P(B)
=
P(A)P(B | A) P(A)P(B | A) + P(Ac )P(B | Ac )
=
0.05 · 0.99 0.05 · 0.99 + 0.95 · 0.1
≈ 0.3426. Example 1.17. Let us return to the chess problem of Example 1.13. Here, Ai is the event of getting an opponent of type i, and P(A1 ) = 0.5,
P(A2 ) = 0.25,
P(A3 ) = 0.25.
Also, B is the event of winning, and P(B | A1 ) = 0.3,
P(B | A2 ) = 0.4,
P(B | A3 ) = 0.5.
Suppose that you win. What is the probability P(A1 | B) that you had an opponent of type 1? Using Bayes’ rule, we have P(A1 | B) = =
P(A1 )P(B | A1 ) P(A1 )P(B | A1 ) + P(A2 )P(B | A2 ) + P(A3 )P(B | A3 ) 0.5 · 0.3 0.5 · 0.3 + 0.25 · 0.4 + 0.25 · 0.5
= 0.4.
Example 1.18. The False-Positive Puzzle. A test for a certain rare disease is assumed to be correct 95% of the time: if a person has the disease, the test results are positive with probability 0.95, and if the person does not have the disease, the test results are negative with probability 0.95. A random person drawn from a certain population has probability 0.001 of having the disease. Given that the person just tested positive, what is the probability of having the disease? If A is the event that the person has the disease, and B is the event that the test results are positive, the desired probability, P(A | B), is P(A | B) = =
P(A)P(B | A) P(A)P(B | A) + P(Ac )P(B | Ac ) 0.001 · 0.95 0.001 · 0.95 + 0.999 · 0.05
= 0.0187.
34
Sample Space and Probability
Chap. 1
Note that even though the test was assumed to be fairly accurate, a person who has tested positive is still very unlikely (less than 2%) to have the disease. According to The Economist (February 20th, 1999), 80% of those questioned at a leading American hospital substantially missed the correct answer to a question of this type. Most of them said that the probability that the person has the disease is 0.95!
1.5 INDEPENDENCE We have introduced the conditional probability P(A | B) to capture the partial information that event B provides about event A. An interesting and important special case arises when the occurrence of B provides no such information and does not alter the probability that A has occurred, i.e., P(A | B) = P(A). When the above equality holds, we say that A is independent of B. Note that by the definition P(A | B) = P(A ∩ B)/P(B), this is equivalent to P(A ∩ B) = P(A)P(B). We adopt this latter relation as the definition of independence because it can be used even if P(B) = 0, in which case P(A | B) is undefined. The symmetry of this relation also implies that independence is a symmetric property; that is, if A is independent of B, then B is independent of A, and we can unambiguously say that A and B are independent events. Independence is often easy to grasp intuitively. For example, if the occurrence of two events is governed by distinct and noninteracting physical processes, such events will turn out to be independent. On the other hand, independence is not easily visualized in terms of the sample space. A common first thought is that two events are independent if they are disjoint, but in fact the opposite is true: two disjoint events A and B with P(A) > 0 and P(B) > 0 are never independent, since their intersection A ∩ B is empty and has probability 0.
Example 1.19. Consider an experiment involving two successive rolls of a 4-sided die in which all 16 possible outcomes are equally likely and have probability 1/16. (a) Are the events Ai = {1st roll results in i}, independent? We have
Bj = {2nd roll results in j},
P(Ai ∩ Bj ) = P the result of the two rolls is (i, j) =
1 , 16
P(Ai ) =
number of elements of Ai 4 = , total number of possible outcomes 16
P(Bj ) =
number of elements of Bj 4 = . total number of possible outcomes 16
Sec. 1.5
Independence
35
We observe that P(Ai ∩ Bj ) = P(Ai )P(Bj ), and the independence of Ai and Bj is verified. Thus, our choice of the discrete uniform probability law (which might have seemed arbitrary) models the independence of the two rolls. (b) Are the events A = {1st roll is a 1},
B = {sum of the two rolls is a 5},
independent? The answer here is not quite obvious. We have
P(A ∩ B) = P the result of the two rolls is (1,4) =
1 , 16
and also P(A) =
number of elements of A 4 = . total number of possible outcomes 16
The event B consists of the outcomes (1,4), (2,3), (3,2), and (4,1), and P(B) =
4 number of elements of B = . total number of possible outcomes 16
Thus, we see that P(A ∩ B) = P(A)P(B), and the events A and B are independent. (c) Are the events A = {maximum of the two rolls is 2}, B = {minimum of the two rolls is 2}, independent? Intuitively, the answer is “no” because the minimum of the two rolls conveys some information about the maximum. For example, if the minimum is 2, the maximum cannot be 1. More precisely, to verify that A and B are not independent, we calculate
P(A ∩ B) = P the result of the two rolls is (2,2) =
1 , 16
and also P(A) =
number of elements of A 3 = , total number of possible outcomes 16
P(B) =
5 number of elements of B = . total number of possible outcomes 16
We have P(A)P(B) = 15/(16)2 , so that P(A ∩ B) = P(A)P(B), and A and B are not independent.
36
Sample Space and Probability
Chap. 1
Conditional Independence We noted earlier that the conditional probabilities of events, conditioned on a particular event, form a legitimate probability law. We can thus talk about independence of various events with respect to this conditional law. In particular, given an event C, the events A and B are called conditionally independent if P(A ∩ B | C) = P(A | C)P(B | C). To derive an alternative characterization of conditional independence, we use the definition of the conditional probability and the multiplication rule, to write P(A ∩ B | C) = =
P(A ∩ B ∩ C) P(C) P(C)P(B | C)P(A | B ∩ C) P(C)
= P(B | C)P(A | B ∩ C). We now compare the preceding two expressions, and after eliminating the common factor P(B | C), assumed nonzero, we see that conditional independence is the same as the condition P(A | B ∩ C) = P(A | C). In words, this relation states that if C is known to have occurred, the additional knowledge that B also occurred does not change the probability of A. Interestingly, independence of two events A and B with respect to the unconditional probability law, does not imply conditional independence, and vice versa, as illustrated by the next two examples.
Example 1.20. Consider two independent fair coin tosses, in which all four possible outcomes are equally likely. Let H1 = {1st toss is a head}, H2 = {2nd toss is a head}, D = {the two tosses have different results}. The events H1 and H2 are (unconditionally) independent. But P(H1 | D) =
1 , 2
P(H2 | D) =
1 , 2
P(H1 ∩ H2 | D) = 0,
so that P(H1 ∩ H2 | D) = P(H1 | D)P(H2 | D), and H1 , H2 are not conditionally independent.
Sec. 1.5
Independence
37
Example 1.21. There are two coins, a blue and a red one. We choose one of the two at random, each being chosen with probability 1/2, and proceed with two independent tosses. The coins are biased: with the blue coin, the probability of heads in any given toss is 0.99, whereas for the red coin it is 0.01. Let B be the event that the blue coin was selected. Let also Hi be the event that the ith toss resulted in heads. Given the choice of a coin, the events H1 and H2 are independent, because of our assumption of independent tosses. Thus, P(H1 ∩ H2 | B) = P(H1 | B)P(H2 | B) = 0.99 · 0.99. On the other hand, the events H1 and H2 are not independent. Intuitively, if we are told that the first toss resulted in heads, this leads us to suspect that the blue coin was selected, in which case, we expect the second toss to also result in heads. Mathematically, we use the total probability theorem to obtain P(H1 ) = P(B)P(H1 | B) + P(B c )P(H1 | B c ) =
1 1 1 · 0.99 + · 0.01 = , 2 2 2
as should be expected from symmetry considerations. Similarly, we have P(H2 ) = 1/2. Now notice that P(H1 ∩ H2 ) = P(B)P(H1 ∩ H2 | B) + P(B c )P(H1 ∩ H2 | B c ) 1 1 1 = · 0.99 · 0.99 + · 0.01 · 0.01 ≈ . 2 2 2 Thus, P(H1 ∩ H2 ) = P(H1 )P(H2 ), and the events H1 and H2 are dependent, even though they are conditionally independent given B.
As mentioned earlier, if A and B are independent, the occurrence of B does not provide any new information on the probability of A occurring. It is then intuitive that the non-occurrence of B should also provide no information on the probability of A. Indeed, it can be verified that if A and B are independent, the same holds true for A and B c (see the end-of-chapter problems). We now summarize. Independence • Two events A and B are said to be independent if P(A ∩ B) = P(A)P(B). If in addition, P(B) > 0, independence is equivalent to the condition P(A | B) = P(A).
38
Sample Space and Probability
Chap. 1
• If A and B are independent, so are A and B c . • Two events A and B are said to be conditionally independent, given another event C with P(C) > 0, if P(A ∩ B | C) = P(A | C)P(B | C). If in addition, P(B ∩ C) > 0, conditional independence is equivalent to the condition P(A | B ∩ C) = P(A | C). • Independence does not imply conditional independence, and vice versa.
Independence of a Collection of Events The definition of independence can be extended to multiple events.
Definition of Independence of Several Events We say that the events A1 , A2 , . . . , An are independent if P
i∈S
Ai
=
P(Ai ),
for every subset S of {1, 2, . . . , n}.
i∈S
For the case of three events, A1 , A2 , and A3 , independence amounts to satisfying the four conditions P(A1 ∩ A2 ) = P(A1 ) P(A2 ), P(A1 ∩ A3 ) = P(A1 ) P(A3 ), P(A2 ∩ A3 ) = P(A2 ) P(A3 ), P(A1 ∩ A2 ∩ A3 ) = P(A1 ) P(A2 ) P(A3 ). The first three conditions simply assert that any two events are independent, a property known as pairwise independence. But the fourth condition is also important and does not follow from the first three. Conversely, the fourth condition does not imply the first three; see the two examples that follow.
Example 1.22. Pairwise Independence does not Imply Independence. Consider two independent fair coin tosses, and the following events:
Sec. 1.5
Independence
39
H1 = {1st toss is a head}, H2 = {2nd toss is a head}, D = {the two tosses have different results}. The events H1 and H2 are independent, by definition. To see that H1 and D are independent, we note that P(D | H1 ) =
P(H1 ∩ D) 1/4 1 = = = P(D). P(H1 ) 1/2 2
Similarly, H2 and D are independent. On the other hand, we have 1 1 1 · · = P(H1 )P(H2 )P(D), 2 2 2 and these three events are not independent. P(H1 ∩ H2 ∩ D) = 0 =
Example 1.23. The Equality P(A1 ∩ A2 ∩ A3 ) = P(A1 ) P(A2 ) P(A3 ) is not Enough for Independence. Consider two independent rolls of a fair six-sided die, and the following events: A = {1st roll is 1, 2, or 3}, B = {1st roll is 3, 4, or 5}, C = {the sum of the two rolls is 9}. We have P(A ∩ B) =
1 1 1 = · = P(A)P(B), 6 2 2
P(A ∩ C) =
1 4 1 = · = P(A)P(C), 36 2 36
P(B ∩ C) =
1 4 1 = · = P(B)P(C). 12 2 36
Thus the three events A, B, and C are not independent, and indeed no two of these events are independent. On the other hand, we have P(A ∩ B ∩ C) =
1 1 4 1 = · · = P(A)P(B)P(C). 36 2 2 36
The intuition behind the independence of a collection of events is analogous to the case of two events. Independence means that the occurrence or non-occurrence of any number of the events from that collection carries no information on the remaining events or their complements. For example, if the events A1 , A2 , A3 , A4 are independent, one obtains relations such as P(A1 ∪ A2 | A3 ∩ A4 ) = P(A1 ∪ A2 ) or P(A1 ∪ Ac2 | Ac3 ∩ A4 ) = P(A1 ∪ Ac2 ); see the end-of-chapter problems.
40
Sample Space and Probability
Chap. 1
Reliability In probabilistic models of complex systems involving several components, it is often convenient to assume that the behaviors of the components are uncoupled (independent). This typically simplifies the calculations and the analysis, as illustrated in the following example.
Example 1.24. Network Connectivity. A computer network connects two nodes A and B through intermediate nodes C, D, E, F, as shown in Fig. 1.15(a). For every pair of directly connected nodes, say i and j, there is a given probability pij that the link from i to j is up. We assume that link failures are independent of each other. What is the probability that there is a path connecting A and B in which all links are up?
0.8 0.9
1
E
0.95
A
1
F 0.85
0.75
0.95
D
3
Series connection
0.9
C
2
B
2 3 Parallel connection
(a)
(b)
Figure 1.15: (a) Network for Example 1.24. The number next to each link indicates the probability that the link is up. (b) Series and parallel connections of three components in a reliability problem.
This is a typical problem of assessing the reliability of a system consisting of components that can fail independently. Such a system can often be divided into subsystems, where each subsystem consists in turn of several components that are connected either in series or in parallel; see Fig. 1.15(b). Let a subsystem consist of components 1, 2, . . . , m, and let pi be the probability that component i is up (“succeeds”). Then, a series subsystem succeeds if all of its components are up, so its probability of success is the product of the probabilities of success of the corresponding components, i.e., P(series subsystem succeeds) = p1 p2 · · · pm . A parallel subsystem succeeds if any one of its components succeeds, so its probability of failure is the product of the probabilities of failure of the corresponding components, i.e., P(parallel subsystem succeeds) = 1 − P(parallel subsystem fails) = 1 − (1 − p1 )(1 − p2 ) · · · (1 − pm ).
Sec. 1.5
Independence
41
Returning now to the network of Fig. 1.15(a), we can calculate the probability of success (a path from A to B is available) sequentially, using the preceding formulas, and starting from the end. Let us use the notation X → Y to denote the event that there is a (possibly indirect) connection from node X to node Y . Then,
P(C → B) = 1 − 1 − P(C → E and E → B)
1 − P(C → F and F → B)
= 1 − (1 − pCE pEB )(1 − pCF pF B ) = 1 − (1 − 0.8 · 0.9)(1 − 0.95 · 0.85) = 0.946, P(A → C and C → B) = P(A → C)P(C → B) = 0.9 · 0.946 = 0.851, P(A → D and D → B) = P(A → D)P(D → B) = 0.75 · 0.95 = 0.712, and finally we obtain the desired probability
P(A → B) = 1 − 1 − P(A → C and C → B)
1 − P(A → D and D → B)
= 1 − (1 − 0.851)(1 − 0.712) = 0.957.
Independent Trials and the Binomial Probabilities If an experiment involves a sequence of independent but identical stages, we say that we have a sequence of independent trials. In the special case where there are only two possible results at each stage, we say that we have a sequence of independent Bernoulli trials. The two possible results can be anything, e.g., “it rains” or “it doesn’t rain,” but we will often think in terms of coin tosses and refer to the two results as “heads” (H) and “tails” (T ). Consider an experiment that consists of n independent tosses of a coin, in which the probability of heads is p, where p is some number between 0 and 1. In this context, independence means that the events A1 , A2 , . . . , An are independent, where Ai = {ith toss is a head}. We can visualize independent Bernoulli trials by means of a sequential description, as shown in Fig. 1.16 for the case where n = 3. The conditional probability of any toss being a head, conditioned on the results of any preceding tosses is p, because of independence. Thus, by multiplying the conditional probabilities along the corresponding path of the tree, we see that any particular outcome (3-long sequence of heads and tails) that involves k heads and 3 − k tails has probability pk (1 − p)3−k . This formula extends to the case of a general number n of tosses. We obtain that the probability of any particular n-long sequence that contains k heads and n − k tails is pk (1 − p)n−k , for all k from 0 to n. Let us now consider the probability p(k) = P(k heads come up in an n-toss sequence),
42
Sample Space and Probability HHH
Prob = p 3
HH T
Prob = p2(1 − p)
HTH
Prob = p2(1 − p)
HT T
Prob = p(1 − p) 2
p
THH
Prob = p2(1 − p)
1− p
THT
Prob = p(1 − p) 2
p
T TH
Prob = p(1 − p) 2
TTT
Prob = (1 − p) 3
p
Chap. 1
HH p
1− p
H
p
1− p
p
HT 1− p TH 1− p
p T 1− p TT
1− p
Figure 1.16: Sequential description of an experiment involving three independent tosses of a coin. Along the branches of the tree, we record the corresponding conditional probabilities, and by the multiplication rule, the probability of obtaining a particular 3-toss sequence is calculated by multiplying the probabilities recorded along the corresponding path of the tree.
which will play an important role later. We showed above that the probability of any given sequence that contains k heads is pk (1 − p)n−k , so we have n k p(k) = p (1 − p)n−k , k where we use the notation n = number of distinct n-toss sequences that contain k heads. k The numbers nk (called “n choose k”) are known as the binomial coefficients, while the probabilities p(k) are known as the binomial probabilities. Using a counting argument, to be given in Section 1.6, we can show that n n! = , k = 0, 1, . . . , n, k k! (n − k)! where for any positive integer i we have i! = 1 · 2 · · · (i − 1) · i, and, by convention, 0! = 1. An alternative verification is sketched in the end-ofchapter problems. Note that the binomial probabilities p(k) must add to 1, thus
Sec. 1.6
Counting
43
showing the binomial formula n n k=0
k
pk (1 − p)n−k = 1.
Example 1.25. Grade of Service. An internet service provider has installed c modems to serve the needs of a population of n customers. It is estimated that at a given time, each customer will need a connection with probability p, independently of the others. What is the probability that there are more customers needing a connection than there are modems? Here we are interested in the probability that more than c customers simultaneously need a connection. It is equal to n
p(k),
k=c+1
where p(k) =
n k p (1 − p)n−k k
are the binomial probabilities. For instance, if n = 100, p = 0.1, and c = 15, the desired probability turns out to be 0.0399. This example is typical of problems of sizing a facility to serve the needs of a homogeneous population, consisting of independently acting customers. The problem is to select the facility size to achieve a certain threshold probability (sometimes called grade of service) that no user is left unserved.
1.6 COUNTING The calculation of probabilities often involves counting the number of outcomes in various events. We have already seen two contexts where such counting arises. (a) When the sample space Ω has a finite number of equally likely outcomes, so that the discrete uniform probability law applies. Then, the probability of any event A is given by P(A) =
number of elements of A , number of elements of Ω
and involves counting the elements of A and of Ω. (b) When we want to calculate the probability of an event A with a finite number of equally likely outcomes, each of which has an already known probability p. Then the probability of A is given by P(A) = p · (number of elements of A),
44
Sample Space and Probability
Chap. 1
and involves counting the number of elements of A. An example of this type is the calculation of the probability of k heads in n coin tosses (the binomial probabilities). We saw in the preceding section that the probability of each distinct sequence involving k heads is easily obtained, but the calculation of the number of all such sequences is somewhat intricate, as will be seen shortly. While counting is in principle straightforward, it is frequently challenging; the art of counting constitutes a large portion of the field of combinatorics. In this section, we present the basic principle of counting and apply it to a number of situations that are often encountered in probabilistic models.
The Counting Principle The counting principle is based on a divide-and-conquer approach, whereby the counting is broken down into stages through the use of a tree. For example, consider an experiment that consists of two consecutive stages. The possible results of the first stage are a1 , a2 , . . . , am ; the possible results of the second stage are b1 , b2 , . . . , bn . Then, the possible results of the two-stage experiment are all possible ordered pairs (ai , bj ), i = 1, . . . , m, j = 1, . . . , n. Note that the number of such ordered pairs is equal to mn. This observation can be generalized as follows (see also Fig. 1.17).
...... ... ... ......
Leaves
...... n1 choices
n2 n3 n4 choices choices choices
Stage 1 Stage 2 Stage 3 Stage 4
Figure 1.17: Illustration of the basic counting principle. The counting is carried out in r stages (r = 4 in the figure). The first stage has n1 possible results. For every possible result of the first i − 1 stages, there are ni possible results at the ith stage. The number of leaves is n1 n2 · · · nr . This is the desired count.
Sec. 1.6
Counting
45
The Counting Principle Consider a process that consists of r stages. Suppose that: (a) There are n1 possible results at the first stage. (b) For every possible result of the first stage, there are n2 possible results at the second stage. (c) More generally, for any possible results of the first i − 1 stages, there are ni possible results at the ith stage. Then, the total number of possible results of the r-stage process is n 1 n2 · · · n r .
Example 1.26. The Number of Telephone Numbers. A telephone number is a 7-digit sequence, but the first digit has to be different from 0 or 1. How many distinct telephone numbers are there? We can visualize the choice of a sequence as a sequential process, where we select one digit at a time. We have a total of 7 stages, and a choice of one out of 10 elements at each stage, except for the first stage where we only have 8 choices. Therefore, the answer is 6 8 · 10 · 10 · · · 10 = 8 · 10 . 6
times
Example 1.27. The Number of Subsets of an n-Element Set. Consider an n-element set {s1 , s2 , . . . , sn }. How many subsets does it have (including itself and the empty set)? We can visualize the choice of a subset as a sequential process where we examine one element at a time and decide whether to include it in the set or not. We have a total of n stages, and a binary choice at each stage. Therefore the number of subsets is n 2 · 2· · · 2 = 2 . n
times
It should be noted that the Counting Principle remains valid even if each first-stage result leads to a different set of potential second-stage results, etc. The only requirement is that the number of possible second-stage results is constant, regardless of the first-stage result. In what follows, we will focus primarily on two types of counting arguments that involve the selection of k objects out of a collection of n objects. If the order of selection matters, the selection is called a permutation, and otherwise, it is called a combination. We will then discuss a more general type of counting, involving a partition of a collection of n objects into multiple subsets.
46
Sample Space and Probability
Chap. 1
k-permutations We start with n distinct objects, and let k be some positive integer, with k ≤ n. We wish to count the number of different ways that we can pick k out of these n objects and arrange them in a sequence, i.e., the number of distinct k-object sequences. We can choose any of the n objects to be the first one. Having chosen the first, there are only n − 1 possible choices for the second; given the choice of the first two, there only remain n − 2 available objects for the third stage, etc. When we are ready to select the last (the kth) object, we have already chosen k − 1 objects, which leaves us with n − (k − 1) choices for the last one. By the Counting Principle, the number of possible sequences, called k-permutations, is n(n − 1) · · · (n − k + 1)(n − k) · · · 2 · 1 n(n − 1) · · · (n − k + 1) = (n − k) · · · 2 · 1 n! = . (n − k)! In the special case where k = n, the number of possible sequences, simply called permutations, is n(n − 1)(n − 2) · · · 2 · 1 = n!. (Let k = n in the formula for the number of k-permutations, and recall the convention 0! = 1.)
Example 1.28. Let us count the number of words that consist of four distinct letters. This is the problem of counting the number of 4-permutations of the 26 letters in the alphabet. The desired number is n! 26! = = 26 · 25 · 24 · 23 = 358, 800. (n − k)! 22!
The count for permutations can be combined with the Counting Principle to solve more complicated counting problems.
Example 1.29. You have n1 classical music CDs, n2 rock music CDs, and n3 country music CDs. In how many different ways can you arrange them so that the CDs of the same type are contiguous? We break down the problem in two stages, where we first select the order of the CD types, and then the order of the CDs of each type. There are 3! ordered sequences of the types of CDs (such as classical/rock/country, rock/country/classical, etc.), and there are n1 ! (or n2 !, or n3 !) permutations of the classical (or rock, or country, respectively) CDs. Thus for each of the 3! CD type sequences, there are n1 ! n2 ! n3 ! arrangements of CDs, and the desired total number is 3! n1 ! n2 ! n3 !.
Sec. 1.6
Counting
47
Combinations There are n people and we are interested in forming a committee of k. How many different committees are possible? More abstractly, this is the same as the problem of counting the number of k-element subsets of a given n-element set. Notice that forming a combination is different than forming a k-permutation, because in a combination there is no ordering of the selected elements. Thus for example, whereas the 2-permutations of the letters A, B, C, and D are AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, DC, the combinations of two out of these four letters are AB, AC, AD, BC, BD, CD. (Since the elements of a combination are unordered, BA is not viewed as being distinct from AB.) To count the number of combinations, we observe that selecting a kpermutation is the same as first selecting a combination of k items and then ordering them. Since there are k! ways of ordering the k selected items, we see that the number n!/(n − k)! of k-permutations is equal to the number of combinations times k!. Hence, the number of possible combinations, is equal to n! . k! (n − k)! Let us now relate the above expression to the binomial coefficient, which was denoted by nk and was defined in the preceding section as the number of n-toss sequences with k heads. We note that specifying an n-toss sequence with k heads is the same as selecting k elements (those that correspond to heads) out of the n-element set of tosses, i.e., a combination of k out of n objects. Hence, the binomial coefficient is also given by the same formula and we have n n! = . k k! (n − k)!
Example 1.30. The number of combinations of two out of the four letters A, B, C, and D is found by letting n = 4 and k = 2. It is
4 2
=
4! = 6, 2! 2!
consistently with the listing given earlier.
48
Sample Space and Probability
Chap. 1
It is worth observing that counting arguments sometimes lead to formulas that are rather difficult to derive algebraically. One example is the binomial formula n n k p (1 − p)n−k = 1, k k=0
discussed in Section 1.5. In the special case where p = 1/2, this formula becomes n n k=0
k
= 2n ,
and admits the following simple interpretation. Since nk is the number of k element subsets of a given n-element subset, the sum over k of nk counts the number of subsets of all possible cardinalities. It is therefore equal to the number of all subsets of an n-element set, which is 2n . Partitions Recall that a combination is a choice of k elements out of an n-element set without regard to order. Thus, a combination can be viewed as a partition of the set in two: one part contains k elements and the other contains the remaining n − k. We now generalize by considering partitions into more than two subsets. We are given an n-element set and nonnegative integers n1 , n2 , . . . , nr , whose sum is equal to n. We consider partitions of the set into r disjoint subsets, with the ith subset containing exactly ni elements. Let us count in how many ways this can be done. We form the subsets one at a time. We have nn1 ways of forming the first subset. Having formed the first subset, we are left with n − n1 elements. need We to choose n2 of them in order to form the second subset, and we have n−n1 choices, etc. Using the Counting Principle for this r-stage process, the n2 total number of choices is n n − n1 n − n1 − n2 n − n1 − · · · − nr−1 ··· , n1 n2 n3 nr which is equal to n! (n − n1 )! (n − n1 − · · · − nr−1 )! · ··· . n1 ! (n − n1 )! n2 ! (n − n1 − n2 )! (n − n1 − · · · − nr−1 − nr )! nr ! We note that several terms cancel and we are left with n! . n 1 ! n 2 ! · · · nr !
Sec. 1.6
Counting
49
This is called the multinomial coefficient and is usually denoted by n . n1 , n2 , . . . , nr Example 1.31. Anagrams. How many different words (letter sequences) can be obtained by rearranging the letters in the word TATTOO? There are six positions to be filled by the available letters. Each rearrangement corresponds to a partition of the set of the six positions into a group of size 3 (the positions that get the letter T), a group of size 1 (the position that gets the letter A), and a group of size 2 (the positions that get the letter O). Thus, the desired number is 6! 1·2·3·4·5·6 = = 60. 1! 2! 3! 1·1·2·1·2·3 It is instructive to derive this answer using an alternative argument. (This argument can also be used to rederive the multinomial coefficient formula; see the end-of-chapter problems.) Let us write TATTOO in the form T1 AT2 T3 O1 O2 pretending for a moment that we are dealing with 6 distinguishable objects. These 6 objects can be rearranged in 6! different ways. However, any of the 3! possible permutations of T1 , T1 , and T3 , as well as any of the 2! possible permutations of O1 and O2 , lead to the same word. Thus, when the subscripts are removed, there are only 6!/(3! 2!) different words.
Example 1.32. A class consisting of 4 graduate and 12 undergraduate students is randomly divided into four groups of 4. What is the probability that each group includes a graduate student? This is the same as Example 1.11 in Section 1.3, but we will now obtain the answer using a counting argument. We first determine the nature of the sample space. A typical outcome is a particular way of partitioning the 16 students into four groups of 4. We take the term “randomly” to mean that every possible partition is equally likely, so that the probability question can be reduced to one of counting. According to our earlier discussion, there are
16 4, 4, 4, 4
=
16! 4! 4! 4! 4!
different partitions, and this is the size of the sample space. Let us now focus on the event that each group contains a graduate student. Generating an outcome with this property can be accomplished in two stages: (a) Take the four graduate students and distribute them to the four groups; there are four choices for the group of the first graduate student, three choices for the second, two for the third. Thus, there is a total of 4! choices for this stage. (b) Take the remaining 12 undergraduate students and distribute them to the four groups (3 students in each). This can be done in
12 3, 3, 3, 3
=
12! 3! 3! 3! 3!
50
Sample Space and Probability
Chap. 1
different ways. By the Counting Principle, the event of interest can occur in 4! 12! 3! 3! 3! 3! different ways. The probability of this event is 4! 12! 3! 3! 3! 3! . 16! 4! 4! 4! 4! After some cancellations, we find that this is equal to 12 · 8 · 4 , 15 · 14 · 13 consistent with the answer obtained in Example 1.11.
Here is a summary of all the counting results we have developed.
Summary of Counting Results • Permutations of n objects: n!. • k-permutations of n objects: n!/(n − k)!. n n! • Combinations of k out of n objects: = . k k! (n − k)! • Partitions of n objects into r groups, with the ith group having ni objects: n n! = . n1 , n2 , . . . , nr n1 ! n2 ! · · · nr !
1.7 SUMMARY AND DISCUSSION A probability problem can usually be broken down into a few basic steps: (a) The description of the sample space, that is, the set of possible outcomes of a given experiment. (b) The (possibly indirect) specification of the probability law (the probability of each event). (c) The calculation of probabilities and conditional probabilities of various events of interest.
Sec. 1.7
Summary and Discussion
51
The probabilities of events must satisfy the nonnegativity, additivity, and normalization axioms. In the important special case where the set of possible outcomes is finite, one can just specify the probability of each outcome and obtain the probability of any event by adding the probabilities of the elements of the event. Given a probability law, we are often interested in conditional probabilities, which allow us to reason based on partial information about the outcome of the experiment. We can view conditional probabilities as probability laws of a special type, under which only outcomes contained in the conditioning event can have positive conditional probability. Conditional probabilities can be derived from the (unconditional) probability law using the definition P(A | B) = P(A ∩ B)/P(B). However, the reverse process is often convenient, that is, first specify some conditional probabilities that are natural for the real situation that we wish to model, and then use them to derive the (unconditional) probability law. We have illustrated through examples three methods for calculating probabilities: (a) The counting method. This method applies to the case where the number of possible outcomes is finite, and all outcomes are equally likely. To calculate the probability of an event, we count the number of elements of the event and divide by the number of elements of the sample space. (b) The sequential method. This method applies when the experiment has a sequential character, and suitable conditional probabilities are specified or calculated along the branches of the corresponding tree (perhaps using the counting method). The probabilities of various events are then obtained by multiplying conditional probabilities along the corresponding paths of the tree, using the multiplication rule. (c) The divide-and-conquer method. Here, the probabilities P(B) of various events B are obtained from conditional probabilities P(B | Ai ), where the Ai are suitable events that form a partition of the sample space and have known probabilities P(Ai ). The probabilities P(B) are then obtained by using the total probability theorem. Finally, we have focused on a few side topics that reinforce our main themes. We have discussed the use of Bayes’ rule in inference, which is an important application context. We have also discussed some basic principles of counting and combinatorics, which are helpful in applying the counting method.
52
Sample Space and Probability
Chap. 1
PROBLEMS
SECTION 1.1. Sets Problem 1. Consider rolling a six-sided die. Let A be the set of outcomes where the roll is an even number. Let B be the set of outcomes where the roll is greater than 3. Calculate and compare the sets on both sides of De Morgan’s laws
(A ∪ B)c = Ac ∩ B c ,
A∩B
c
= Ac ∪ B c .
Problem 2. Let A and B be two sets. (a) Show that Ac = (Ac ∩ B) ∪ (Ac ∩ B c ),
B c = (A ∩ B c ) ∪ (Ac ∩ B c ).
(b) Show that (A ∩ B)c = (Ac ∩ B) ∪ (Ac ∩ B c ) ∪ (A ∩ B c ). (c) Consider rolling a six-sided die. Let A be the set of outcomes where the roll is an odd number. Let B be the set of outcomes where the roll is less than 4. Calculate the sets on both sides of the equality in part (b), and verify that the equality holds. Problem 3.* Prove the identity
∞ A ∪ ∩∞ n=1 Bn = ∩n=1 (A ∪ Bn ).
Solution. If x belongs to the set on the left, there are two possibilities. Either x ∈ A, in which case x belongs to all of the sets A ∪ Bn , and therefore belongs to the set on the right. Alternatively, x belongs to all of the sets Bn in which case, it belongs to all of the sets A ∪ Bn , and therefore again belongs to the set on the right. Conversely, if x belongs to the set on the right, then it belongs to A ∪ Bn for all n. If x belongs to A, then it belongs to the set on the left. Otherwise, x must belong to every set Bn and again belongs to the set on the left. Problem 4.* Cantor’s diagonalization argument. Show that the unit interval [0, 1] is uncountable, i.e., its elements cannot be arranged in a sequence. Solution. Any number x in [0, 1] can be represented in terms of its decimal expansion, e.g., 1/3 = 0.3333 · · ·. Note that most numbers have a unique decimal expansion, but there are a few exceptions. For example, 1/2 can be represented as 0.5000 · · · or as 0.49999 · · ·. It can be shown that this is the only kind of exception, i.e., decimal expansions that end with an infinite string of zeroes or an infinite string of nines.
Problems
53
Suppose, to obtain a contradiction, that the elements of [0, 1] can be arranged in a sequence x1 , x2 , x3 , . . ., so that every element of [0, 1] appears in the sequence. Consider the decimal expansion of xn : xn = 0.a1n a2n a3n · · · , where each digit ain belongs to {0, 1, . . . , 9}. Consider now a number y constructed as follows. The nth digit of y can be 1 or 2, and is chosen so that it is different from the nth digit of xn . Note that y has a unique decimal expansion since it does not end with an infinite sequence of zeroes or nines. The number y differs from each xn , since it has a different nth digit. Therefore, the sequence x1 , x2 , . . . does not exhaust the elements of [0, 1], contrary to what was assumed. The contradiction establishes that the set [0, 1] is uncountable.
SECTION 1.2. Probabilistic Models Problem 5. Out of the students in a class, 60% are geniuses, 70% love chocolate, and 40% fall into both categories. Determine the probability that a randomly selected student is neither a genius nor a chocolate lover. Problem 6. A six-sided die is loaded in a way that each even face is twice as likely as each odd face. All even faces are equally likely, as are all odd faces. Construct a probabilistic model for a single roll of this die and find the probability that the outcome is less than 4. Problem 7. A four-sided die is rolled repeatedly, until the first time (if ever) that an even number is obtained. What is the sample space for this experiment? Problem 8.* Bonferroni’s inequality. (a) Prove that for any two events A and B, we have P(A ∩ B) ≥ P(A) + P(B) − 1. (b) Generalize to the case of n events A1 , A2 , . . . , An , by showing that P(A1 ∩ A2 ∩ · · · ∩ An ) ≥ P(A1 ) + P(A2 ) + · · · + P(An ) − (n − 1). Solution. We have P(A ∪ B) = P(A) + P(B) − P(A ∩ B) and P(A ∪ B) ≤ 1, which implies part (a). For part (b), we use De Morgan’s law to obtain
1 − P(A1 ∩ · · · ∩ An ) = P (A1 ∩ · · · ∩ An )c
= P(Ac1 ∪ · · · ∪ Acn ) ≤ P(Ac1 ) + · · · + P(Acn )
= 1 − P(A1 ) + · · · + 1 − P(An ) = n − P(A1 ) − · · · − P(An ).
54
Sample Space and Probability
Chap. 1
Problem 9.* The inclusion-exclusion formula. Show the following generalizations of the formula P(A ∪ B) = P(A) + P(B) − P(A ∩ B). (a) Let A, B, and C be events. Then, P(A∪B∪C) = P(A)+P(B)+P(C)−P(A∩B)−P(B∩C)−P(A∩C)+P(A∩B∩C). (b) Let A1 , A2 , . . . , An be events. Let S1 = {i | 1 ≤ i ≤ n}, S2 = {(i1 , i2 ) | 1 ≤ i1 < i2 ≤ n}, and more generally, let Sm be the set of all m-tuples (i1 , . . . , im ) of indices that satisfy 1 ≤ i1 < i2 < · · · < im ≤ n. Then, P (∪n k=1 Ak ) =
i∈S1
+
P(Ai ) −
P(Ai1 ∩ Ai2 )
(i1 ,i2 )∈S2
P(Ai1 ∩ Ai2 ∩ Ai3 ) − · · · + (−1)n−1 P (∩n k=1 Ak ) .
(i1 ,i2 ,i3 )∈S3
Solution. (a) We use the formulas P(X ∪ Y ) = P(X) + P(Y ) − P(X ∩ Y ) and (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C). We have
P(A ∪ B ∪ C) = P(A ∪ B) + P(C) − P (A ∪ B) ∩ C
= P(A ∪ B) + P(C) − P (A ∩ C) ∪ (B ∩ C)
= P(A ∪ B) + P(C) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) = P(A) + P(B) − P(A ∩ B) + P(C) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(B ∩ C) − P(A ∩ C) + P(A ∩ B ∩ C). (b) Use induction and verify the main induction step by emulating the derivation of part (a). For a different approach, see the problems at the end of Chapter 2. Problem 10.* Continuity property of probabilities. (a) Let A1 , A2 , . . . be an infinite sequence of events, which is “monotonically increasing,” meaning that An ⊂ An+1 for every n. Let A = ∪∞ n=1 An . Show that P(A) = limn→∞ P(An ). Hint: Express the event A as a union of countably many disjoint sets. (b) Suppose now that the events are “monotonically decreasing,” i.e., An+1 ⊂ An for every n. Let A = ∩∞ n=1 An . Show that P(A) = limn→∞ P(An ). Hint: Apply the result of part (a) to the complements of the events. (c) Consider a probabilistic model whose sample space is the real line. Show that
P [0, ∞) = lim P [0, n] n→∞
and
lim P [n, ∞) = 0.
n→∞
Problems
55
Solution. (a) Let B1 = A1 and, for n ≥ 2, Bn = An ∩ Acn−1 . The events Bn are ∞ disjoint, and we have ∪n k=1 Bk = An , and ∪k=1 Bk = A. We apply the additivity axiom to obtain P(A) =
∞
P(Bk ) = lim
n
n→∞
k=1
P(Bk ) = lim P(∪n k=1 Bk ) = lim P(An ). n→∞
n→∞
k=1
(b) Let Cn = Acn and C = Ac . Since An+1 ⊂ An , we obtain Cn ⊂ Cn+1 , and the events c ∞ c ∞ Cn are increasing. Furthermore, C = Ac = (∩∞ n=1 An ) = ∪n=1 An = ∪n=1 Cn . Using the result from part (a) for the sequence Cn , we obtain
1 − P(A) = P(Ac ) = P(C) = lim P(Cn ) = lim 1 − P(An ) , n→∞
n→∞
from which we conclude that P(A) = limn→∞ P(An ). (c) For the first equality, use the result from part (a) with An = [0, n] and A = [0, ∞). For the second, use the result from part (b) with An = [n, ∞) and A = ∩∞ n=1 An = Ø.
SECTION 1.3. Conditional Probability Problem 11. We roll two fair 6-sided dice. Each one of the 36 possible outcomes is assumed to be equally likely. (a) Find the probability that doubles are rolled. (b) Given that the roll results in a sum of 4 or less, find the conditional probability that doubles are rolled. (c) Find the probability that at least one die roll is a 6. (d) Given that the two dice land on different numbers, find the conditional probability that at least one die roll is a 6. Problem 12. A coin is tossed twice. Alice claims that the event of two heads is at least as likely if we know that the first toss is a head than if we know that at least one of the tosses is a head. Is she right? Does it make a difference if the coin is fair or unfair? How can we generalize Alice’s reasoning? Problem 13. We are given three coins: one has heads in both faces, the second has tails in both faces, and the third has a head in one face and a tail in the other. We choose a coin at random, toss it, and it comes heads. What is the probability that the opposite face is tails? Problem 14. A batch of one hundred items is inspected by testing four randomly selected items. If one of the four is defective, the batch is rejected. What is the probability that the batch is accepted if it contains five defectives? Problem 15. Let A and B be events. Show that P(A ∩ B | B) = P(A | B), assuming that P(B) > 0.
56
Sample Space and Probability
Chap. 1
SECTION 1.4. Total Probability Theorem and Bayes’ Rule Problem 16. Alice searches for her term paper in her filing cabinet, which has n drawers. She knows that she left her term paper in drawer j with probability pj > 0. The drawers are so messy that even if she correctly guesses that the term paper is in drawer i, the probability that she finds it is only di . Alice searches in a particular drawer, say drawer i, but the search is unsuccessful. Conditioned on this event, show that the probability that her paper is in drawer j, is given by pj , 1 − pi di
if j = i,
pi (1 − di ) , 1 − pi di
if j = i.
Problem 17. How an inferior player with a superior strategy can gain an advantage. Boris is about to play a two-game chess match with an opponent, and wants to find the strategy that maximizes his winning chances. Each game ends with either a win by one of the players, or a draw. If the score is tied at the end of the two games, the match goes into sudden-death mode, and the players continue to play until the first time one of them wins a game (and the match). Boris has two playing styles, timid and bold, and he can choose one of the two at will in each game, no matter what style he chose in previous games. With timid play, he draws with probability pd > 0, and he loses with probability 1 − pd . With bold play, he wins with probability pw , and he loses with probability 1 − pw . Boris will always play bold during sudden death, but may switch style between games 1 and 2. (a) Find the probability that Boris wins the match for each of the following strategies: (i) Play bold in both games 1 and 2. (ii) Play timid in both games 1 and 2. (iii) Play timid whenever he is ahead in the score, and play bold otherwise. (b) Assume that pw < 1/2, so Boris is the worse player, regardless of the playing style he adopts. Show that with the strategy in (iii) above, and depending on the values of pw and pd , Boris may have a better than a 50-50 chance to win the match. How do you explain this advantage? Problem 18. Two players take turns removing a ball from a jar that initially contains m white and n black balls. The first player to remove a white ball wins. Develop a recursive formula that allows the convenient computation of the probability that the starting player wins. Problem 19. Each of k jars contains m white and n black balls. A ball is randomly chosen from jar 1 and transferred to jar 2, then a ball is randomly chosen from jar 2 and transferred to jar 3, etc. Finally, a ball is randomly chosen from jar k. Show that the probability that the last ball is white is the same as the probability that the first ball is white, i.e., it is m/(m + n). Problem 20. We have two jars each containing initially n balls. We perform four successive ball exchanges. In each exchange, we pick simultaneously and at random a ball from each jar and move it to the other jar. What is the probability that at the end of the four exchanges all the balls will be in the jar where they started?
Problems
57
Problem 21. The prisoner’s dilemma. Two out of three prisoners are to be released. One of the prisoners asks a guard to tell him the identity of a prisoner other than himself that will be released. The guard refuses with the following rationale: at your present state of knowledge, your probability of being released is 2/3, but after you know my answer, your probability of being released will become 1/2, since there will be two prisoners (including yourself) whose fate is unknown and exactly one of the two will be released. What is wrong with the guard’s reasoning? Problem 22. A two-envelopes puzzle. You are handed two envelopes, and you know that each contains a positive integer dollar amount and that the two amounts are different. The values of these two amounts are modeled as constants that are unknown. Without knowing what the amounts are, you select at random one of the two envelopes, and after looking at the amount inside, you may switch envelopes if you wish. A friend claims that the following strategy will increase above 1/2 your probability of ending up with the envelope with the larger amount: toss a coin repeatedly, let X be equal to 1/2 plus the number of tosses required to obtain heads for the first time, and switch if the amount in the envelope you selected is less than the value of X. Is your friend correct? Problem 23. The paradox of induction. Consider a statement whose truth is unknown. If we see many examples that are compatible with it, we are tempted to view the statement as more probable. Such reasoning is often referred to as inductive inference (in a philosophical, rather than mathematical sense). Consider now the statement that “all cows are white.” An equivalent statement is that “everything that is not white is not a cow.” We then observe several black crows. Our observations are clearly compatible with the statement, but do they make the hypothesis “all cows are white” more likely? To analyze such a situation, we consider a probabilistic model. Let us assume that there are two possible states of the world, which we model as complementary events: A : all cows are white, Ac : 50% of all cows are white. Let p be the prior probability P(A) that all cows are white. We make an observation of a cow or a crow, with probability q and 1 − q, respectively, independently of whether event A occurs or not. Assume that 0 < p < 1, 0 < q < 1, and that all crows are black. (a) Given the event B = {a black crow was observed}, what is P(A | B)? (b) Given the event C = {a white cow was observed}, what is P(A | C)? Problem 24.* Conditional version of the total probability theorem. Show the identity P(A | B) = P(C | B)P(A | B ∩ C) + P(C c | B)P(A | B ∩ C c ), assuming all the conditioning events have positive probability. Solution. Using the conditional probability definition and the additivity axiom on the disjoint sets A ∩ B ∩ C and A ∩ B ∩ C c , we obtain
58
Sample Space and Probability
Chap. 1
P(C | B)P(A | B ∩ C) + P(C c | B)P(A | B ∩ C c ) =
P(B ∩ C c ) P(A ∩ B ∩ C c ) P(B ∩ C) P(A ∩ B ∩ C) · + · P(B) P(B ∩ C) P(B) P(B ∩ C c )
=
P(A ∩ B ∩ C) + P(A ∩ B ∩ C c ) P(B)
= =
P (A ∩ B ∩ C) ∪ (A ∩ B ∩ C c ) P(B) P(A ∩ B) P(B)
= P(A | B). Problem 25.* Let A and B be events with P(A) > 0 and P(B) > 0. We say that an event B suggests an event A if P(A | B) > P(A), and does not suggest event A if P(A | B) < P(A). (a) Show that B suggests A if and only if A suggests B. (b) Assume that P(B c ) > 0. Show that B suggests A if and only if B c does not suggest A. (c) We know that a treasure is located in one of two places, with probabilities β and 1 − β, respectively, where 0 < β < 1. We search the first place and if the treasure is there, we find it with probability p > 0. Show that the event of not finding the treasure in the first place suggests that the treasure is in the second place. Solution. (a) We have P(A | B) = P(A ∩ B)/P(B), so B suggests A if and only if P(A ∩ B) > P(A)P(B), which is equivalent to A suggesting B, by symmetry. (b) Since P(B) + P(B c ) = 1, we have P(B)P(A) + P(B c )P(A) = P(A) = P(B)P(A | B) + P(B c )P(A | B c ), which implies that
P(B c ) P(A) − P(A | B c ) = P(B) P(A | B) − P(A) . Thus, P(A | B) > P(A) (B suggests A) if and only if P(A) > P(A | B c ) (B c does not suggest A). (c) Let A and B be the events A = {the treasure is in the second place}, B = {we don’t find the treasure in the first place}. Using the total probability theorem, we have P(B) = P(Ac )P(B | Ac ) + P(A)P(B | A) = β(1 − p) + (1 − β),
Problems
59
so P(A | B) =
P(A ∩ B) 1−β 1−β = = > 1 − β = P(A). P(B) β(1 − p) + (1 − β) 1 − βp
It follows that event B suggests event A.
SECTION 1.5. Independence Problem 26. A hunter has two hunting dogs. One day, on the trail of some animal, the hunter comes to a place where the road diverges into two paths. He knows that each dog, independently of the other, will choose the correct path with probability p. The hunter decides to let each dog choose a path, and if they agree, take that one, and if they disagree, to randomly pick a path. Is his strategy better than just letting one of the two dogs decide on a path? Problem 27. Communication through a noisy channel. A binary (0 or 1) symbol transmitted through a noisy communication channel is received incorrectly with probability 0 and 1 , respectively (see Fig. 1.18). Errors in different symbol transmissions are independent.
1 − ε0
0
0
ε0 ε1 1
1 − ε1
1
Figure 1.18: Error probabilities in a binary communication channel.
(a) Suppose that the channel source transmits a 0 with probability p and transmits a 1 with probability 1 − p. What is the probability that a randomly chosen symbol is received correctly? (b) Suppose that the string of symbols 1011 is transmitted. What is the probability that all the symbols in the string are received correctly? (c) In an effort to improve reliability, each symbol is transmitted three times and the received symbol is decoded by majority rule. In other words, a 0 (or 1) is transmitted as 000 (or 111, respectively), and it is decoded at the receiver as a 0 (or 1) if and only if the received three-symbol string contains at least two 0s (or 1s, respectively). What is the probability that a transmitted 0 is correctly decoded? (d) Suppose that the channel source transmits a 0 with probability p and transmits a 1 with probability 1 − p, and that the scheme of part (c) is used. What is the probability that a 0 was transmitted given that the received string is 101? Problem 28. The king’s sibling. The king has only one sibling. What is the probability that the sibling is male? Assume that every birth results in a boy with probability
60
Sample Space and Probability
Chap. 1
1/2, independent of other births. Be careful to state any additional assumptions you have to make in order to arrive at an answer. Problem 29. Using a biased coin to make an unbiased decision. Alice and Bob want to choose between the opera and the movies by tossing a fair coin. Unfortunately, the only available coin is biased (though the bias is not known exactly). How can they use the biased coin to make a decision so that either option (opera or the movies) is equally likely to be chosen? Problem 30. An electrical system consists of identical components that are operational with probability p, independently of other components. The components are connected in three subsystems, as shown in Fig. 1.19. The system is operational if there is a path that starts at point A, ends at point B, and consists of operational components. This is the same as requiring that all three subsystems are operational. What are the probabilities that the three subsystems, as well as the entire system, are operational?
2
1 A
3 B
Figure 1.19: A system of identical components that consists of the three subsystems 1, 2, and 3. The system is operational if there is a path that starts at point A, ends at point B, and consists of operational components.
Problem 31. Reliability of a k-out-of-n system. A system consists of n identical components that are operational with probability p, independently of other components. The system is operational if at least k out of the n components are operational. What is the probability that the system is operational? Problem 32. A power utility can supply electricity to a city from n different power plants. Power plant i fails with probability pi , independently of the others. (a) Suppose that any one plant can produce enough electricity to supply the entire city. What is the probability that the city will experience a black-out? (b) Suppose that two power plants are necessary to keep the city from a black-out. Find the probability that the city will experience a black-out. Problem 33. A cellular phone system services a population of n1 “voice users” (those that occasionally need a voice connection) and n2 “data users” (those that occasionally need a data connection). We estimate that at a given time, each user will need to be connected to the system with probability p1 (for voice users) or p2 (for data users),
Problems
61
independently of other users. The data rate for a voice user is r1 bits/sec and for a data user is r2 bits/sec. The cellular system has a total capacity of c bits/sec. What is the probability that more users want to use the system than the system can accommodate? Problem 34. The problem of points. Telis and Wendy play a round of golf (18 holes) for a $10 stake, and their probabilities of winning on any one hole are p and 1 − p, respectively, independently of their results in other holes. At the end of 10 holes, with the score 4 to 6 in favor of Wendy, Telis receives an urgent call and has to report back to work. They decide to split the stake in proportion to their probabilities of winning had they completed the round, as follows. If pT and pW are the conditional probabilities that Telis and Wendy, respectively, are ahead in the score after 18 holes given the 4-6 score after 10 holes, then Telis should get a fraction pT /(pT + pW ) of the stake, and Wendy should get the remaining pW /(pT + pW ). How much money should Telis get? Note: This is an example of the, so-called, problem of points, which played an important historical role in the development of probability theory. The problem was posed by Chevalier de M´er´e in the 17th century to Pascal, who introduced the idea that the stake of an interrupted game should be divided in proportion to the players’ conditional probabilities of winning given the state of the game at the time of interruption. Pascal worked out some special cases and through a correspondence with Fermat, stimulated much thinking and several probability-related investigations. Problem 35. A particular class has had a history of low attendance. The annoyed professor decides that she will not lecture unless at least k of the n students enrolled in the class are present. Each student will independently show up with probability pg if the weather is good, and with probability pb if the weather is bad. Given the probability of bad weather on a given day, calculate the probability that the professor will teach her class on that day. Problem 36. Consider a coin that comes up heads with probability p and tails with probability 1 − p. Let qn be the probability that after n independent tosses, there have been an even number of heads. Derive a recursion that relates qn to qn−1 , and solve this recursion to establish the formula
qn = 1 + (1 − 2p)n /2. Problem 37.* Gambler’s ruin. A gambler makes a sequence of independent bets. In each bet, he wins $1 with probability p, and loses $1 with probability 1 − p. Initially, the gambler has $k, and plays until he either accumulates $n or has no money left. What is the probability that the gambler will end up with $n? Solution. Let us denote by A the event that he ends up with $n, and by F the event that he wins the first bet. Denote also by wk the probability of event A, if he starts with $k. We apply the total probability theorem to obtain wk = P(A | F )P(F ) + P(A | F c )P(F c ) = pP(A | F ) + qP(A | F c ),
0 < k < n,
where q = 1 − p. By the independence of past and future bets, having won the first bet is the same as if he were just starting now but with $(k+1), so that P(A | F ) = wk+1 and similarly P(A | F c ) = wk−1 . Thus, we have wk = pwk+1 + qwk−1 , which can be written as wk+1 − wk = r(wk − wk−1 ), 0 < k < n,
62
Sample Space and Probability
Chap. 1
where r = q/p. We will solve for wk in terms of p and q using iteration, and the boundary values w0 = 0 and wn = 1. We have wk+1 − wk = rk (w1 − w0 ), and since w0 = 0, wk+1 = wk + rk w1 = wk−1 + rk−1 w1 + rk w1 = w1 + rw1 + · · · + rk w1 . The sum in the right-hand side can be calculated separately for the two cases where r = 1 (or p = q) and r = 1 (or p = q). We have
wk =
1 − rk w1 , 1−r kw1 ,
if p = q, if p = q.
Since wn = 1, we can solve for w1 and therefore for wk :
1−r , if p = q, n
w1 =
1−r
1,
if p = q,
n
so that wk =
1 − rk , if p = q, n 1−r
k,
if p = q.
n
Problem 38.* Let A and B be independent events. Use the definition of independence to prove the following: (a) The events A and B c are independent. (b) The events Ac and B c are independent. Solution. (a) The event A is the union of the disjoint events A ∩ B c and A ∩ B. Using the additivity axiom and the independence of A and B, we obtain P(A) = P(A ∩ B) + P(A ∩ B c ) = P(A)P(B) + P(A ∩ B c ). It follows that
P(A ∩ B c ) = P(A) 1 − P(B) = P(A)P(B c ), so A and B c are independent. (b) Apply the result of part (a) twice: first on A and B, then on B c and A. Problem 39.* Let A, B, and C be independent events, with P(C) > 0. Prove that A and B are conditionally independent given C. Solution. We have P(A ∩ B | C) = =
P(A ∩ B ∩ C) P(C) P(A)P(B)P(C) P(C)
= P(A)P(B) = P(A | C)P(B | C),
Problems
63
so A and B are conditionally independent given C. In the preceding calculation, the first equality uses the definition of conditional probabilities; the second uses the assumed independence; the fourth uses the independence of A from C, and of B from C. Problem 40.* Assume that the events A1 , A2 , A3 , A4 are independent and that P(A3 ∩ A4 ) > 0. Show that P(A1 ∪ A2 | A3 ∩ A4 ) = P(A1 ∪ A2 ). Solution. We have P(A1 | A3 ∩ A4 ) =
P(A1 ∩ A3 ∩ A4 ) P(A1 )P(A3 )P(A4 ) = = P(A1 ). P(A3 ∩ A4 ) P(A3 )P(A4 )
We similarly obtain P(A2 | A3 ∩ A4 ) = P(A2 ) and P(A1 ∩ A2 | A3 ∩ A4 ) = P(A1 ∩ A2 ), and finally, P(A1 ∪ A2 | A3 ∩ A4 ) = P(A1 | A3 ∩ A4 ) + P(A2 | A3 ∩ A4 ) − P(A1 ∩ A2 | A3 ∩ A4 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 ) = P(A1 ∪ A2 ).
Problem 41.* Laplace’s rule of succession. Consider m + 1 boxes with the kth box containing k red balls and m − k white balls, where k ranges from 0 to m. We choose a box at random (all boxes are equally likely) and then choose a ball at random from that box, n successive times (the ball drawn is replaced each time, and a new ball is selected independently). Suppose a red ball was drawn each of the n times. What is the probability that if we draw a ball one more time it will be red? Estimate this probability for large m. Solution. We want to find the conditional probability P(E | Rn ), where E is the event of a red ball drawn at time n + 1, and Rn is the event of a red ball drawn each of the n preceding times. Intuitively, the consistent draw of a red ball indicates that a box with a high percentage of red balls was chosen, so we expect that P(E | Rn ) is closer to 1 than to 0. In fact, Laplace used this example to calculate the probability that the sun will rise tomorrow given that it has risen for the preceding 5,000 years. (It is not clear how serious Laplace was about this calculation, but the story is part of the folklore of probability theory.) We have P(E ∩ Rn ) P(E | Rn ) = , P(Rn ) and by using the total probability theorem, we obtain
P(Rn ) =
m
P(kth box chosen)
k=0
k m
n =
m 1 k n+1 P(E ∩ Rn ) = P(Rn+1 ) = . m+1 m k=0
m 1 k n , m+1 m k=0
64
Sample Space and Probability
Chap. 1
For large m, we can view P(Rn ) as a piecewise constant approximation to an integral: P(Rn ) =
m m 1 k n 1 1 mn+1 1 ≈ xn dx = · ≈ . n n m+1 m (m + 1)m 0 (m + 1)m n+1 n+1 k=0
Similarly, P(E ∩ Rn ) = P(Rn+1 ) ≈ so that P(E | Rn ) ≈
1 , n+2
n+1 . n+2
Thus, for large m, drawing a red ball one more time is almost certain when n is large. Problem 42.* Binomial coefficient formula and the Pascal triangle.
(a) Use the definition of nk as the number of distinct n-toss sequences with k heads, to derive the recursion suggested by the so called Pascal triangle, given in Fig. 1.20. (b) Use the recursion derived in part (a) and induction, to establish the formula
n k
=
n! . k! (n − k)!
( 00 ) ( 10 ) ( 11 ) ( 20 ) ( 21 ) ( 22 ) ( 30 ) ( 31 ) ( 32 ) ( 33 ) ( 40 ) ( 14 ) ( 42 ) ( 43 ) ( 44 ) . . . . . . . .
1 1 2
1 1 1
1
3 4
1 3
6
1 4
1
. . . . . . . .
Figure 1.20: Sequential calculation method of the binomial coefficients using the in the triangular array on the left is computed Pascal triangle. Each term n k and placed in the triangular array on the right by adding its two neighbors in the row above it (except for the boundary terms with k = 0 or k = n, which are equal to 1).
Solution. (a) Note that n-toss sequences that contain k heads (for 0 < k < n) can be obtained in two ways: (1) By starting with an (n− 1)-toss sequence that contains k heads and adding a tail at the end. There are n−1 different sequences of this type. k
Problems
65
(2) By starting with an (n − 1)-toss that contains k − 1 heads and adding sequence different sequences of this type. a head at the end. There are n−1 k−1 Thus,
n k
n − 1 + n − 1 , if k = 1, 2, . . . , n − 1, k−1 k = 1,
if k = 0, n.
This is the formula corresponding to the Pascal triangle calculation, given in Fig. 1.20. (b) We now use the recursion from part (a), to demonstrate the formula
n k
=
n! , k! (n − k)!
by induction on n. Indeed, we have from the definition 10 = 11 = 1, so for n = 1 the above formula is seen to hold as long as we use the convention 0! = 1. If the formula holds for each index up to n − 1, we have for k = 1, 2, . . . , n − 1,
n k
=
n−1 k−1
+
n−1 k
=
(n − 1)! (n − 1)! + (k − 1)! (n − 1 − k + 1)! k! (n − 1 − k)!
=
n! n−k n! k · + · n k! (n − k)! n k! (n − k)!
=
n! , k! (n − k)!
and the induction is complete. Problem 43.* The Borel-Cantelli lemma. Consider an infinite sequence of trials. The probability of success at the ith trial is some positive number pi . Let N be the event that there is no success, and let I be the event that there is an infinite number of successes. (a) Assume that the trials are independent and that P(N ) = 0 and P(I) = 1. (b) Assume that
∞
i=1
∞
i=1
pi = ∞. Show that
pi < ∞. Show that P(I) = 0.
Solution. (a) The event N is a subset of the event that there were no successes in the first n trials, so that P(N ) ≤
n
(1 − pi ).
i=1
Taking logarithms, log P(N ) ≤
n i=1
log(1 − pi ) ≤
n i=1
(−pi ).
66
Sample Space and Probability
Chap. 1
Taking the limit as n tends to infinity, we obtain log P(N ) = −∞, or P(N ) = 0. Let now Ln be the event that there is a finite number of successes and that the last success occurs at the nth trial. We use the already established result P(N ) = 0, and apply it to the sequence of trials after trial n, to obtain P(Ln ) = 0. The event I c (finite number of successes) is the union of the disjoint events Ln , n ≥ 1, and N , so that P(I c ) = P(N ) +
∞
P(Ln ) = 0,
n=1
and P(I) = 1. (b) Let Si be the event that the ith trial is a success. Fix some number n and for every i > n, let Fi be the event that the first success after time n occurs at time i. Note that Fi ⊂ Si . Finally, let An be the event that there is at least one success after time n. Note that I ⊂ An , because an infinite number of successes implies that there are successes subsequent to time n. Furthermore, the event An is the union of the disjoint events Fi , i > n. Therefore,
P(I) ≤ P(An ) = P
∞
i=n+1
Fi
=
∞ i=n+1
P(Fi ) ≤
∞ i=n+1
∞
P(Si ) =
pi .
i=n+1
We take the limit of both sides as n → ∞. Because of the assumption the right-hand side converges to zero. This implies that P(I) = 0.
∞ i=1
pi < ∞,
SECTION 1.6. Counting Problem 44. De M´ er´ e’s puzzle. A six-sided die is rolled three times independently. Which is more likely: a sum of 11 or a sum of 12? (This question was posed by the French nobleman de M´er´e to his friend Pascal in the 17th century.) Problem 45. The birthday problem. Consider n people who are attending a party. We assume that every person has an equal probability of being born on any day during the year, independently of everyone else, and ignore the additional complication presented by leap years (i.e., nobody is born on February 29). What is the probability that each person has a distinct birthday? Problem 46. An urn contains m red and n white balls. (a) We draw two balls randomly and simultaneously. Describe the sample space and calculate the probability that the selected balls are of different color, by using two approaches: a counting approach based on the discrete uniform law, and a sequential approach based on the multiplication rule. (b) We roll a fair 3-sided die whose faces are labeled 1,2,3, and if k comes up, we remove k balls from the urn at random and put them aside. Describe the sample space and calculate the probability that all of the balls drawn are red, using a divide-and-conquer approach and the total probability theorem. Problem 47. We deal from a well-shuffled 52-card deck. Calculate the probability that the 13th card is the first king to be dealt.
Problems
67
Problem 48. Ninety students, including Joe and Jane, are to be split into three classes of equal size, and this is to be done at random. What is the probability that Joe and Jane end up in the same class? Problem 49. Twenty distinct cars park in the same parking lot every day. Ten of these cars are US-made, while the other ten are foreign-made. The parking lot has exactly twenty spaces, all in a row, so the cars park side by side. However, the drivers have varying schedules, so the position any car might take on a certain day is random. (a) In how many different ways can the cars line up? (b) What is the probability that on a given day, the cars will park in such a way that they alternate (no two US-made are adjacent and no two foreign-made are adjacent)? Problem 50. Eight rooks are placed in distinct squares of an 8 × 8 chessboard, with all possible placements being equally likely. Find the probability that all the rooks are safe from one another, i.e., that there is no row or column with more than one rook. Problem 51. An academic department offers 8 lower level courses: {L1 , L2 , . . . , L8 } and 10 higher level courses: {H1 , H2 , . . . , H10 }. A valid curriculum consists of 4 lower level courses, and 3 higher level courses. (a) How many different curricula are possible? (b) Suppose that {H1 , . . . , H5 } have L1 as a prerequisite, and {H6 , . . . H10 } have L2 and L3 as prerequisites, i.e., any curricula which involve, say, one of {H1 , . . . , H5 } must also include L1 . How many different curricula are there? Problem 52. How many 6-word sentences can be made using each of the 26 letters of the alphabet exactly once? A word is defined as a nonempty (possibly jibberish) sequence of letters. Problem 53. Consider a group of n persons. A club consists of a special person from the group (the club leader) and a number (possibly zero) of additional club members. (a) Explain why the number of possible clubs is n2n−1 . (b) Find an alternative way of counting the number of possible clubs and show the identity n n k = n2n−1 . k k=1
Problem 54. We draw the top 7 cards from a well-shuffled standard 52-card deck. Find the probability that: (a) The 7 cards include exactly 3 aces. (b) The 7 cards include exactly 2 kings. (c) The probability that the 7 cards include exactly 3 aces or exactly 2 kings. Problem 55. A parking lot contains 100 cars, k of which happen to be lemons. We select m of these cars at random and take them for a testdrive. Find the probability
68
Sample Space and Probability
Chap. 1
that n of the cars tested turn out to be lemons. Problem 56. A well-shuffled 52-card deck is dealt to 4 players. Find the probability that each of the players gets an ace. Problem 57.* Hypergeometric probabilities. An urn contains n balls, out of which m are red. We select k of the balls at random, without replacement (i.e., selected balls are not put back into the urn before the next selection). What is the probability that i of the selected balls are red?
Solution. The sample space consists of the nk different ways that we can select k out of the available balls. For the event of interest to occur, we have to select i out of the m red balls, which can be done in m ways, and also select k − i out of the n − m blue i ways. Therefore, the desired probability is balls, which can be done in n−m k−i
m i
n−m k−i
n k
,
for i ≥ 0 satisfying i ≤ m, i ≤ k, and k − i ≤ n − m. For all other i, the probability is zero. Problem 58.* Correcting the number of permutations for indistinguishable objects. When permuting n objects, some of which are indistinguishable, different permutations may lead to indistinguishable object sequences, so the number of distinguishable object sequences is less than n!. For example, there are six permutations of the letters A, B, and C: ABC, ACB, BAC, BCA, CAB, CBA, but only three distinguishable sequences that can be formed using the letters A, D, and D: ADD, DAD, DDA. (a) Suppose that k out of the n objects are indistinguishable. Show that the number of distinguishable object sequences is n!/k!. (b) Suppose that we have r types of indistinguishable objects, and for each i, ki objects of type i. Show that the number of distinguishable object sequences is n! . k 1 ! k2 ! · · · k r ! Solution. (a) Each one of the n! permutations corresponds to k! duplicates which are obtained by permuting the k indistinguishable objects. Thus, the n! permutations can be grouped into n!/k! groups of k! indistinguishable permutations that result in the same object sequence. Therefore, the number of distinguishable object sequences is n!/k!. For example, the three letters A, D, and D give the 3! = 6 permutations ADD, ADD, DAD, DDA, DAD, DDA,
Problems
69
obtained by replacing B and C by D in the permutations of A, B, and C given earlier. However, these 6 permutations can be divided into the n!/k! = 3!/2! = 3 groups {ADD, ADD}, {DAD, DAD}, {DDA, DDA}, each having k! = 2! = 2 indistinguishable permutations. (b) One solution is to extend the argument in (a) above: for each object type i, there are ki ! indistinguishable permutations of the ki objects. Hence, each permutation belongs to a group of k1 ! k2 ! · · · kr ! indistinguishable permutations, all of which yield the same object sequence. An alternative argument goes as follows. Choosing a distinguishable object sequence is the same as starting with n slots and for each i, choosing the ki slots to be occupied by objects of type i. This is the same as partitioning the set {1, . . . , n} into groups of size k1 , . . . , kr , and the number of such partitions is given by the multinomial coefficient.
Discrete Random Variables
Contents
2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8.
Basic Concepts . . . . . . . . . . . . Probability Mass F'unctions . . . . . . F'unctions of Random Variables . . . . . Expectation, Mean, and Variance . . . . Joint PMFs of Multiple Random Variables Conditioning . . . . . . . . . . . . . Independence . . . . . . . . . . . . . Summary and Discussion . . . . . . . Problems . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. p . 72 . p . 74 . p . 80 . p . 81 . p . 92 . p . 98 .p . 110 . p . 116 . p . 119
Discrete Random Variables
Chap. 2
BASIC CONCEPTS In many probabilistic models, the outcomes are numerical, e.g., when they correspond to instrument readings or stock prices. In other experiments, the outcomes are not numerical, but they may be associated with some numerical values of interest. For example, if the experiment is the selection of students from a given population, we may wish to consider their grade point average. When dealing with such numerical values, it is often useful to assign probabilities to them. This is done through the notion of a random variable, the focus of the present chapter. Given an experiment and the corresponding set of possible outcomes (the sample space), a random variable associates a particular number with each outcome; see Fig. 2.1. We refer to this number as the numerical value or simply the value of the random variable. I\/lathematically, a random variable is a real-valued function of the experimental outcome.
Figure 2.1: (a) Visualization of a random variable. It is a function that assigns a numerical value to each possible outcome of the experiment. (b) An example of a random variable. The experiment consists of two rolls of a 4-sided die, and the random variable is the maximum of the two rolls. If the outcome of the experiment is (4,2), the value of this random variable is 4.
Here are some examples of random variables: (a) In an experiment involving a sequence of 5 tosses of a coin, the number of heads in the sequence is a random variable. However, the 5-long sequence
Sec. 2.1
Basic Concepts
73
of heads and tails is not considered a random variable because it does not have an explicit numerical value. (b) In an experiment involving two rolls of a die, the following are examples of random variables: (i) The sum of the two rolls. (ii) The number of sixes in the two rolls. (iii) The second roll raised to the fifth power. (c) In an experiment involving the transmission of a message, the time needed to transmit the message, the number of symbols received in error, and the delay with which the message is received are all random variables. There are several basic concepts associated with random variables, which are summarized below. These concepts will be discussed in detail in the present chapter.
Main Concepts kelated to Random Variables Starting with a probabilistic model of an experiment: a,
A random variable is a real-valued function of the outcome of the experiment.
c,
e
a,
*
A function of a random variable defines another random variable. We can associate with each random variable certain "averages" of interest, such as the mean and the variance.
A random variable can be conditioned on an event or on another random variable. There is a notion of independence of a random variable from an event or fiom another random variable.
A random variable is called discrete if its range (the set of values that it can take) is finite or at most countably infinite. For example, the random variables mentioned in (a) and (b) above can take at most a finite number of numerical values, and are therefore discrete. A random variable that can take an uncountably infinite number of values is not discrete. For an example, consider the experiment of choosing a point a from the interval [-I, 11. The random variable that associates the numerical value a2 to the outcome a is not discrete. On the other hand, the random variable that associates with a the numerical value
Discrete Random Variables
74
Chap. 2
is discrete. In this chapter, we focus exclusively on discrete random variables, even though we will typically onlit the qualifier "discrete."
Concepts Related to Discrete Random Variables Starting with a probabilistic model of an experiment: e
A discrete random variable is a real-valued function of t h of the experiment that can tale
A discrete random variable has an as tion ( P M F ) , which gives the proba the random variable can t a
A function of a discrete random variable, whose P original random variable.
We will discuss each of tlie above concepts and the associated methodology in the following sections. In addition, we will provide examples of some important and frequently encountered random variables. In Chapter 3, we will discuss general (not necessarily discrete) random variables. Even though this chapter may appear to be covering a lot of new ground, this is not really the case. The general line of development is to simply talce the concepts from Chapter 1 (probabilities, conditioning, independence, etc.) and apply them to random variables rather than events, together with some appropriate new notation. The only genuinely new concepts relate to means and variances.
2.2 P R O B A B I L I T Y M A S S F U N C T I O N S
The most important way to characterize a random variable is through the probabilities of the values that it can take. For a discrete random variable X , these are captured by the probability mass function (PMF for short) of X , denoted px. In particular, if x is any possible value of X , the probability mass of x, denoted px (x), is the probability of the event {X = x) consisting of all outcomes that give rise to a value of X equal to x:
For example, let the experiment consist of two independent tosses of a fair coin, and let X be the number of heads obtained. Then the PMF of X is
Sec. 2.2
Probability n/fass Functions
114, if x = 0 or x = 2, 0,
otherwise.
In what follows, we will often omit the braces from the eventlset notation, when no ambiguity can arise. In particular, we will usually write P ( X = x) in place of the more correct notation P ( { X = x)). We will also adhere to the following convention throughout: we will use upper case characters to denote random variables, and lower case characters to denote real numbers such as the numerical values of a random variable. Note that
where in the summation above, x ranges over all the possible numerical values of X . This follows from the additivity and normalization axioms, because the events { X = x) are disjoint and form a partition of the sample space, as x ranges over all possible values of X. By a similar argument, for any set S of possible
For example, if X is the number of heads obtained in two independent tosses of a fair coin, as above, the probability of at least one head is
Calculating the PMF of X is conceptually straightforward, and is illustrated in Fig. 2.2. Calculation of the P M F of a Random Variable X For each possible value x of X:
1. Collect all the possible outcomes that give rise to the event {X = x}. 2. Add their probabilities to obtain p x ( x ) .
The Bernoulli Random Variable Consider the toss of a coin, which comes up a head with probability p, and a tail with probability 1 - p. The Bernoulli random variable talces the two values 1 and 0, depending on whether the outcome is a head or a tail:
x = {0,1,
i f a thaeial .d ,
Discrete Random Variables
Chap. 2
I
Event
(X=x) (a)
I
Random variable:
I
Sample space: pairs of rolls
Figure 2.2: (a) Illustration of the method to calculate the PI\/IF of a random variable X. For each possible value x, we collect all the outcomes that give rise to X = x and add their probabilities to obtain px(x). (b) Calculation of the PMF p x of the random variable X = maximum roll in two independent rolls of a fair Csided die. There are four possible values x, namely, 1, 2, 3, 4. To calculate p.y(x) for a given x, we add the probabilities of the outcomes that give rise to x. For example, there are three outcomes that give rise to a: = 2, namely, (1,2), (2,2), ( 2 , l ) . Each of these outcomes has probability 1/16, so pX(2) = 3/16, as indicated in the figure.
Its PWIF is
For all its simplicity, the Bernoulli random variable is very important. In practice, it is used to model generic probabilistic situations with just two outcomes, such as: (a) The state of a telephone at a given time that can be either free or busy. (b) A person who can be either healthy or sick with a certain disease. (c) The preference of a person who can be either for or against a certain political candidate.
Sec. 2.2
Probability Mass Functions
77
Furthermore, by combining multiple Bernoulli random variables, one can construct more complicated random variables, such as the binomial random variable, which is discussed next. The Binomial Random Variable
A coin is tossed n times. At each toss, the coin comes up a head with probability p, and a tail with probability 1 - p, independently of prior tosses. Let X be the number of heads in the n-toss sequence. We refer to X as a binomial random variable with parameters n and p. The PMF of X consists of the binomial probabilities that were calculated in Section 1.5:
(Note that here and elsewhere, we simplify notation and use k, instead of x, to denote the values of integer-valued random variables.) The normalization property, specialized to the binomial random variable, is written as
Some special cases of the binomial PMF are sketched in Fig. 2.3
lJx(k)
P,Y(k) Binomial PIVIF, n = 9, p = 112
Binomial PMF, n = large, p = small
Figure 2.3: The P M F of a binomial random variable. If p = 112, the P M F is symmetric around 4 2 . Otherwise, the P M F is skewed towards 0 if p < 112, and towards n if p > 112.
The Geometric Random Variable Suppose that we repeatedly and independently toss a coin with probability of a head equal to p, where 0 < p < 1. The geometric random variable is the number X of tosses needed for a head to come up for the first time. Its PlLlF is given by px(k)=(l-p)"-lp, k = l , 2 , . . .,
Discrete Random Variables
78
Chap. 2
since (1- ~ ) ~ - isl the p probability of the sequence consisting of k - 1 successive tails followed by a head; see Fig. 2.4. This is a legitimate PMF because
Naturally, the use of coin tosses here is just to provide insight. More generally, we can interpret the geometric random variable in terms of repeated independent trials until the first "success." Each trial has probability of success p and the number of trials until (and including) the first success is modeled by the geometric random variable. The meaning of "success" is context-dependent. For example, it could mean passing a test in a given try, or finding a missing item in a given search, or logging into a computer system in a given attempt, etc.
Figure 2.4: The PMF
of a geometric random variable. It decreases as a geometric progression with parameter 1 - p.
The Poisson Random Variable
A Poisson random variable has a PkIF given by
where X is a positive parameter characterizing the PI\/IF, see Fig. 2.5. This is a legitimate PMF because
Sec. 2.2
Probability Mass Functions ~.y(k)
Pdk)
Poisson, h = 0.5
Poisson, h = 3
Figure 2.5: The PMF e ~ ~ X ~of/ the k ! Poisson random variable for different values of A. Note that if X < 1, then the PMF is monotonically decreasing, while if X > 1, the PUIF first increases and then decreases as the value of k increases (this is shown in the end-of-chapter problems).
To get a feel for the Poisson random variable, think of a binomial random variable with very small p and very large n. For example, consider the number of typos in a book with a total of n words, when the probability p that any one word is misspelled is very small (associate a word with a coin toss which comes a head when the word is misspelled), or the number of cars involved in accidents in a city on a given day (associate a car with a coin toss which comes a head when the car has an accident). Such random variables can be well-modeled with a Poisson PMF. I\/Iore precisely, the Poisson PWIF with parameter X is a good approximation for a binomial PMF with parameters n and p, i.e., e-A- Xk k!
n! pk(1 - p)"-k, k ! ( n- k ) !
k = 0, 1 , . . . , n ,
provided X = np, n is very large, and p is very small. In this case, using the Poisson PMF may result in simpler models and calculations. For example, let n = 100 and p = 0.01. Then the probability of k = 5 successes in n = 100 trials is calculated using the binomial PMF as
Using the Poisson PMF with approximated by
X
= np = 100 . 0.01 = 1, this probability is
We provide a formal justification of the Poisson approximation property in the end-of-chapter problems and also in Chapter 5, where we will further interpret it, extend it, and use it in the context of the Poisson process.
Discrete Random Variables
80
Chap. 2
2.3 FUNCTIONS OF RANDOM VARIABLES Given a random variable X, one may generate other random variables by applying various transformations on X . As an example, let the random variable X be today's temperature in degrees Celsius, and consider the transformation Y = 1.8X 32, which gives the temperature in degrees Fahrenheit. In this example, Y is a linear function of X ,of the form
+
where a and b are scalars. We may also consider nonlinear functions of the general form Y = g(X). For example, if we wish to display temperatures on a logarithmic scale, we would want to use the function g(X) = log X. If Y = g ( X ) is a function of a random variable X, then Y is also a random variable, since it provides a numerical value for each possible outcome. This is because every outcome in the sample space defines a numerical value x for X and hence also the numerical value y = g ( x ) for Y. If X is discrete with P M F p x , then Y is also discrete, and its PMF py can be calculated using the PWIF of X. In particular, to obtain py(y) for any y, we add the probabilities of all values of x such that g(x) = y:
Example 2.1. Let Y = 1x1 and let us apply the preceding formula for the PMF py to the case where
px(x) =
119, if x is an integer in the range [-4,4], 0, otherwise;
see Fig. 2.6 for an illustration. The possible values of Y are y = 0,1,2,3,4. To compute py (y) for some given value y from this range, we must add p x (x) over all values x such that 1x1 = y In particular, there is only one value of X that corresponds to y = 0, namely x = 0. Thus,
Also, there are two values of X that correspond to each y = 1,2,3,4,so for example,
Sec. 2.4
Expectation, Mean, and Variance
Figure 2.6: The PMFs of X and Y = (XI in Example 2.1.
Thus, the PMF of Y is
219, if y = 1,2,3,4, 0,
otherwise.
For another related example, let Z = x2. To obtain the PMF of Z,we can view it either as the square of the random variable X or as the square of the x2=Z) p ~ ( x ) or random variable Y = 1x1. By applying the formula pz(z) = the formula pz(z) = 1 y 2 = z ) PY(Y),we obtain
x{x
zIy
2.4
219,
if z = 1,4,9,16,
0,
otherwise.
EXPECTATION, MEAN, AND VARIANCE The PMF of a random variable X provides us with several numbers, the probabilities of all the possible values of X. It would be desirable to summarize this information in a single representative number. This is accomplished by the expectation of XI which is a weighted (in proportion to probabilities) average of the possible values of X. As motivation, suppose you spin a wheel of fortune many times. At each spin, one of the numbers m l , mz, . . . , m, comes up with corresponding probability p l , pa, . . . ,p,, and this is your monetary reward from that spin. What is to get "per spin"? The terms "expect" the amount of money that you LLexpect" and "per spin" are a little ambiguous, but here is a reasonable interpretation. Suppose that you spin the wheel k times, and that ki is the number of times .. that the outcome is mi. Then, the total amount received is mlk1 +mzka m,kn. The amount received per spin is
+. +
Discrete Random Variables
82
Chap. 2
If the number of spins k is very large, and if we are willing t o interpret probabilities as relative frequencies, it is reasonable t o anticipate that m Q comes up a fraction of times t h a t is roughly equal t o pi:
Thus, the amount of money per spin that you "expect" t o receive is
M=
mlkl +mzkz
+ ...+ m n k n = m l p l + mzpz + . . . +mnpn.
k
Motivated by this example, we introduce the following definiti0n.T
Expectation We define the expected value (also called the expectation or the mean) of a random variable X, with PMF p x , by
C
EjXl =,
XP-Y ( x )
+
x
Example 2.2. Consider two independent coin tosses, each with a 314 probability of a head, and let X be the number of heads obtained. This is a binomial random variable with parameters n = 2 and p = 314. Its PMF is
so the mean is
t When dealing with random variables that talce a countably infinite number of values, one has to deal with the possibility that the infinite sum xpx(x) is not well-defined. &/lore concretely, we will say that the expectation is well-defined if Ixlpx (x) < oo. In that case, it is known that the infinite sum xp,y(x) converges to a finite value that is independent of the order in which the various terms are summed. For an example where the expectation is not well-defined, consider a random variable X that talces the value 2" with probability 2 - k , for k = 1,2,. . .. For a more subtle example, consider the random variable X that takes the values 2k and -2k with probability 2-< for k = 2 , 3 , . . .. The expectation is again undefined, even though the PMF is symmetric around zero and one might be tempted to say that E[X] is zero. Throughout this book, in lack of an indication to the contrary, we implicitly assume that the expected value of the random variables of interest is well-defined.
xz
xz
xz
Sec. 2.4
Expectation, Atlean, and Variance
83
It is useful to view the mean of X as a "representative" value of X , which lies somewhere in the middle of its range. We can make this statement more precise, by viewing the mean as the center of gravity of the PMF, in the sense explained in Fig. 2.7.
Center of gravity c = mean = E[X] Figure 2.7: Interpretation of the mean as a center of gravity. Given a bar with a weight p x (x) placed at each point x with p x (x) > 0,the center of gravity c is the point at which the sum of the torques from the weights t o its left is equal to the sum of the torques from the weights t o its right:
Thus, c =
xz
z
x p ~ ( x ) ,i.e., the center of gravity is equal to the mean E[X]
Variance, Moments, a n d t h e E x p e c t e d Value R u l e Besides the mean, there are several other quantities that we can associate with a random variable and its PILIF. For example, we define the 2nd m o m e n t of the random variable X as the expected value of the random variable Xz. lUore generally, we define the n t h m o m e n t as E[Xn], the expected value of the random variable Xn. With this terminology, the 1st moment of X is just the mean. The most important quantity associated with a random variable X , other than the mean, is its variance, which is denoted by var(X) and is defined as ', i.e., the expected value of the random variable ( X - E[x])
Since ( X - E[x])'can only talce nonnegative values, the variance is always nonnegative. The variance provides a measure of dispersion of X around its mean. Another measure of dispersion is the s t a n d a r d deviation of X , which is defined as the square root of the variance and is denoted by c.y:
Discrete Random Variables
84
Chap. 2
The standard deviation is often easier to interpret, because it has the same units as X. For example, if X measures length in meters, the units of variance are square meters, while the units of the standard deviation are meters. One way to calculate var(X), is to use the definition of expected value, after calculating the PMF of the random variable (X - E [ x ] ) ~ . This latter random variable is a function of X , and its PMF can be obtained in the manner discussed in the preceding section.
Example 2.3. PMF
Consider the random variable X of Example 2.1, which has the
ps(x) =
119, if x is an integer in the range [-4,4], 0,
otherwise.
The mean E[X] is equal to 0. This can be seen from the symmetry of the PMF of X around 0, and can also be verified from the definition:
Let Z = (X - E[x])'= x 2 . AS in Example 2.1, we obtain 219, if z = 1,4,9,16, 0,
otherwise.
The variance of X is then obtained by
It turns out that there is an easier method to calculate var(X), which uses the PMF of X but does n o t require t h e PMF of (X - E[x])'. This method is based on the following rule. E x p e c t e d Value R u l e for Functions of R a n d o m Variables Let X be a random variable with PMF p x , and let g(X) be a function of X . Then, the expected value of the random variable g(X) is given by ~ [ g j X )= ]
g(x)px(x). 3:
Sec. 2.4
Expectation, Alean, and Variance
To verify this rule, we let Y = g(X) and use the formula
derived in the preceding section. We have
Using the expected value rule, we can write the variance of X as
Similarly, the nth moment is given by
and there is no need to calculate the PMF of Xn.
Example 2.3 (continued). For the random variable X with PMF PX(X) =
1/9, if x is an integer in the range [-4,4], 0,
otherwise,
we have
=9 - E x 2 1
(since E[X] = 0)
Discrete Random Variables
86
Chap. 2
which is consistent with the result obtained earlier.
As we have noted earlier, the variance is always nonnegative, but could it be zero? Since every term in the formula (x - E [ x ](x) )~ for ~ the~variance is nonnegative, the sum is zero if and only if (x - E[X])2p,y(x) = 0 for every x. This condition implies that for any x with px(x) > 0, we must have x = E[X] and the random variable X is not really L'random": its value is equal to the mean E[X], with probability 1.
Ex
Variance
The variance var(X) of a random variable X is defined by var(X) = E
'1
[(x- E[XI) ,
and can be calculated as var(X) =
(x - E[x]) 2p.y (x). x
It is always nonnegative. Its square root is denoted by crx and is called the standard deviation.
Properties of Mean and Variance
We will now use the expected value rule in order to derive some important properties of the mean and the variance. We start with a random variable X and define a new random variable Y, of the form where a and b are given scalars. Let us derive the mean and tlie variance of the linear function Y. We have
Furthermore, var(Y) =
(ax
+ b - E[aX + b ~ ) ~ p(x) ,y
Sec. 2.4
Expectation, Adean, and Variance
87
Mean and Variance of a Linear Function of a Random Variable Let X be a random variable and let Y=aX+b, where a and b are given scalars. Then, E[Y]=aE[X]+b,
var(Y)=a2var(X).
Let us also give a convenient formula for the variance of a random variable X with given PMF.
Variance in Terms of Moments Expression var(X) = E[X2] - (E[x])~.
This expression is verified as follows: var(X) =
x x
(a - E
[ x ](x) )~~~
5
=
(x2 - 2xE[X]
+ (E[x])~)PX(X)
5
=CX~PX(X - )2EIXI CxppX(x) x
+ (E[xI)? C PX(X)
x
x
+ (~1x1)'
= E[X2] - ~ ( E [ x ] ) ~ = E[X2] -
(E[x])'.
We finally illustrate by example a common pitfall: unless g(X) is a linear function, it is not generally true that E [ ~ ( x ) ] is equal to g(E[X]).
Example 2.4. Average Speed Versus Average Time. If the weather is good (which happens with probability 0.6), Alice walks the 2 miles to class at a speed of V = 5 miles per hour, and otherwise rides her motorcycle at a speed of V = 30 miles per hour. What is the mean of the time T to get to class? A correct way to solve the problem is to first derive the PMF of T, 0.6, if t = 215 hours, 0.4, if t = 2/30 hours,
Discrete Random Variables
88
Chap. 2
and then calculate its mean by 2 2 4 E[T] = 0.6. - + 0.4. - = - hours. 5 30 15 However, it is wrong to calculate the meail of the speed V, E[V] = 0.6 . 5
+ 0.4. 30 = 15 miles per hour,
and then claim that the mean of the time T is 2
E[VJ-15
hours
To summarize, in this example we have 2 V'
T= -
2 and E[T] = E - f. E[V]'
Mean and Variance of Some Common Random Variables We will now derive formulas for the mean and the variance of a few important random variables. These formulas will be used repeatedly in a variety of contexts throughout the text.
Example 2.5. Mean and Variance of the Bernoulli. Consider the experiment of tossing a coin, which comes up a head with probability p and a tail with probability 1 - p, and the Bernoulli random variable X with PMF
Its mean, second moment, and variance are given by the following calculations: E[X] = l . p + O . ( l - p ) = p ,
E[x']= 1 2 . p + ~ ~ ( 1 - p ) = p , var(X) = E [ x ~ - ]( E [ x ]=)p~- pZ = p(1 - p).
Example 2.6. Discrete Uniform Random Variable. What is the mean and variance of the roll of a fair six-sided die? If we view the result of the roll as a random variable X , its PMF is 116, i f k = 1 , 2 , 3 , 4 , 5 , 6 , , otherwise.
Sec. 2.4
Expectation, Mean, and Variance
89
Since the PMF is symmetric around 3.5, we conclude that E[X]= 3.5. Regarding the variance, we have
which yields var(X) = 35/12. The above random variable is a special case of a discrete uniformly distributed random variable (or discrete uniform for short), which by definition, takes one out of a range of contiguous integer values, with equal probability. More precisely, this random variable has a PMF of the form
1
i f k = a , a + l , . . . , b, otherwise,
where a and b are two integers with a The mean is
< b; see Fig. 2.8.
as can be seen by inspection, since the PMF is symmetric around (a + b)/2. To calculate the variance of X I we first consider the simpler case where a = 1 and b = n. It can be verified by induction on n that
We leave the verification of this as an exercise for the reader. The variance can now be obtained in terms of the first and second moments
For the case of general integers a and b, we note that a random variable which is uniformly distributed over the interval [a, b] has the same variance as one which is uniformly distributed over [I, b - a 11, since the PMF of the second is just a shifted version of the PMF of the first. Therefore, the desired variance is given by the above formula with n = b - a 1, which yields
+
+
Discrete Random Variables
Chap. 2
Figure 2.8: PMF of the discrete random variable that is uniformly distributed between two integers a and b. Its mean and variance are
Example 2.7. The M e a n of the Poisson. The mean of the Poisson PMF
can be calculated is follows:
(the li = 0 term is zero)
The last equality is obtained by noting that
is the normalization property for the Poisson PMF. A similar calculation shows that the variance of a Poisson random variable is also X (see the end-of-chapter problems). We will derive this fact in a number of different ways in later chapters.
Sec. 2.4
Expectation, Mean, and Variance
91
Decision Making Using Expected Values Expected values often provide a convenient vehicle for formulating optimization problems where we have t o choose between several candidate decisions t h a t result in random rewards. If we view t h e expected reward of a decision as its "average payoff over a large number of trials," i t is reasonable t o choose a decision with maximum expected reward. T h e following is a n example.
Example 2.8. The Quiz Problem. This example, when generalized appropriately, is a prototypical model for optimal scheduling of a collection of tasks that have uncertain outcomes. Consider a quiz game where a person is given two questions and must decide which question to answer first. Question 1 will be answered correctly with probability 0.8, and the person will then receive a s prize $100, while question 2 will be answered correctly with probability 0.5, and the person will then receive as prize $200. If the first question attempted is answered incorrectly, the quiz terminates, i.e., the person is not allowed to attempt the second question. If the first question is answered correctly, the person is allowed to attempt the second question. Which question should be answered first to maximize the expected value of the total prize money received? The answer is not obvious because there is a tradeoff: attempting first the more valuable but also more difficult question 2 carries the risk of never getting a chance to attempt the easier question 1. Let us view the total prize money received as a random variable X, and calculate the expected value E[X] under the two possible question orders (cf. Fig. 2.9):
Figure 2.9: Sequential description of the sample space of the quiz problem for the two cases where we answer question 1 or question 2 first.
(a) Answer question 1 first: Then the PMF of X is (cf. the left side of Fig. 2.9) px(O)=0.2,
px(100)=0.8.0.5,
px(300)=0.8.0.5,
and we have E[X] = 0.8 .0.5 . 100
+ 0.8 . 0.5 . 300 = $160.
Discrete Random Variables (b) Answer question 2 first: Then the
px(O)=O.5,
Chap. 2
PMF of X is (cf. the right side of Fig.
p.y(200)=0.5.0.2,
2.9)
px(300)=0.5,0.8,
and we have
E [ X ]= 0.5 . 0 . 2 . 200 + 0 . 5 . 0 . 8 . 300 = $140. Thus, it is preferable to attempt the easier question 1 first. Let us now generalize the analysis. Denote by pl and p2 the probabilities of correctly answering questions 1 and 2, respectively, and by ul and vz the corresponding prizes. If question 1 is answered first, we have
while if question 2 is answered first, we have
It is thus optimal to answer question 1 first if and only if
or equivalently, if
Therefore, it is optimal to order the questions in decreasing value of the expression p v / ( l - p), which provides a convenient index of quality for a question with probability of correct answer p and value v. Interestingly, this rule generalizes to the case of more than two questions (see the end-of-chapter problems).
2.5 JOINT PMFS OF MULTIPLE RANDOM VARIABLES
Probabilistic models often involve several random variables. For example, in a medical diagnosis context, the results of several tests may be significant, or in a networking context, the workloads of several gateways may be of interest. All of these random variables are associated with the same experiment, sample space, and probability law, and their values may relate in interesting ways. This motivates us to consider probabilities involving simultaneously the numerical values of several random variables and to investigate their mutual couplings. In this section, we will extend the concepts of PMF and expectation developed so far to multiple random variables. Later on, we will also develop notions of conditioning and independence that closely parallel the ideas discussed in Chapter 1. Consider two discrete random variables X and Y associated with the same experiment. The probabilities of the values that X and Y can take are captured
/
Sec. 2.5
Joint PMFs of Ilfultiple Random Variables
93
by the joint P M F of X and Y, denoted px,y. In particular, if (x, y) is a pair of possible values of X and Y, the probability mass of (x, y) is the probability of the event { X = a, Y = y): PX,Y (x, Y) = P ( X = 2, Y = y).
Here and elsewhere, we will use the abbreviated notation P ( X = x, Y = y) instead of the more precise notations P({X = x)n{Y = y)) or P ( X = x and Y = XI. The joint PMF determines the probability of any event that can be specified in terms of the random variables X and Y. For example if A is the set of all pairs (x,y) that have a certain property, then
In fact, we can calculate the PMFs of X and Y by using the formulas
The formula for px(x) can be verified using the calculation
px (x) = P ( X = 2)
where the second equality follows by noting that the event {X = x) is the union of the disjoint events { X = x, Y = y) as y ranges over all the different values of Y. The formula for py (y) is verified similarly. We sometimes refer to p x and py as the marginal PMFs, to distinguish them from the joint PMF. We can calculate the marginal PNlFs from the joint PMF by using the tabular method. Here, the joint PMF of X and Y is arranged in a twodimensional table, and the marginal P M F of X or Y at a given value is obtained by adding the table entries along a corresponding column or row, respectively. This is illustrated in the following example and in Fig. 2.10.
Example 2.9. Consider two random variables, X and Y, described by the joint PMF shown in Fig. 2.10. The marginal PMFs are calculated by adding the table entries along the columns (for the marginal PMF of X) and along the rows (for the marginal PMF of Y), as indicated.
Discrete R a n d o m Variables
Chap. 2
.Joint PbIF p ~ , ~ ( x . y ) in tabular form
A
-
3/20
3
1/20
2/20 3 / 2 0
2
1/20
2/20 3/20 1/20
1
1/20
1/20 1/20
7/20
*
1/20
7/20
Row sums: marg~nalPh4F Py(y)
3)
0
3/20 B .b
1
B 3/20
2
3
4
7
V
6/20
8/20
3/20
I{
"
Column sums: marginal PhIF p X ( x ) Figure 2.10: Illustration of the tabular method for calculating the marginal P M F s from the joint P M F in Example 2.9. The joint PMF is represented by t h e table, where the number in each square (x, y ) gives the vaiue of p x , y (x, y). To calculate the marginal PMF p x ( x ) for a given value of x, we add the numbers in the column corresponding t o x. For example p x ( 2 ) = 6/20. Similarly, t o calculate t h e marginal P M F p y ( y ) for a given value of y , we add the numbers in the row corresponding to y. For example py ( 2 ) = 7 / 2 0 .
Functions of Multiple Random Variables When there are multiple random variables of interest, it is possible to generate new random variables by considering functions involving several of these random variables. In particular, a function Z = g ( X , Y ) of the random variables X and Y defines another random variable. Its PMF can be calculated from the joint PMF p x , y according to
Furthermore, the expected value rule for functions naturally extends and takes the form
The verification of this is very similar to the earlier case of a function of a single random variable. In the special case where g is linear and of the form aX+bY fc, where a , b, and c are given scalars, we have
Sec. 2.5
Joint PPfFs of Multiple Random Variables
95
Example 2.9 (continued). Consider again the random variables X and Y whose joint PMF is given in Fig. 2.10, and a new random variable Z defined by
The PMF of Z can be calculated using the formula
and we have, using the PMF given in Fig. 2.10,
3 3 2 pz(8) = 20, pz(9) = 20, p z ( l 0 ) = -, 20
1 p z ( l 1 ) = -, 20
The expected value of Z can be obtained from its PMF:
Alternatively, we can obtain E[Z] using the formula
From the marginal PMFs, given in Fig. 2.10, we have
1 ~ ~ ( 1 =2 -. 20 )
Discrete Random Vnriables
96
Chap. 2
More than TWORandom Variables The joint P M F of three random variables X , Y, and Z is defined in analogy with the above as P X , Y , Z ( X , ~=, ~ P(X ) =x,Y =y , Z = z), for all possible triplets of numerical values (x, y, 2). Corresponding marginal PbIFs are analogously obtained by equations such as PX.Y(XI
Y) = ~-P X , Y , Z ( Xy,, z),
and
The expected value rule for functions is give11 by E[g(X, Y, Z)] =
g(x, Y, ~)Px,Y,z(x, Y. z ) , x
and if g is linear and has the form
y
z
aX + bY + c Z + dl then
Furthermore, there are obvious generalizations of the above to more than three random variables. For example, for any random variables XI, X2,. . . , X n and any scalars a l , aa, . . . ,a,, we have
Example 2.10. Mean of the Binomial. Your probability class has 300 students and each student has probability 1/3 of getting an A, independently of any other student. What is the mean of X, the number of students that get an A? Let 1, if the ith student gets an A, Xi = 0, otherwise.
Thus XI, X2, . . . ,X, are Bernoulli random variables with common mean p = 1/3. Their sum X =XI +X2+...+Xn is the number of students that get an A. Since X is the number of "successes" in n independent trials, it is a binomial random variable with parameters n and p. Using the linearity of X as a function of the X,, we have
Sec. 2.5
Joint P W s of &lultipIe Random Variables
97
If we repeat this calculation for a general number of students n and probability of A equal to p, we obtain E[Xi] = C
E[X] = i=l
p = np.
i=l
Example 2.11. The Hat Problem. Suppose that n people throw their hats in a box and then each picks one hat at random. (Each hat can be picked by only one person, and each assignment of hats to persons is equally likely.) What is the expected value of X, the number of people that get back their own hat? For the ith person, we introduce a random variable Xi that takes the value 1 if the person selects his/her own hat, and takes the value 0 otherwise. Since P(Xi = 1) = l / n and P(X, = 0) = 1 - l l n , the mean of Xi is
We now have so that
Discrete Random Variables
98
Chap. 2
2.6 CONDITIONING
Similar to our discussion in Chapter 1, conditional probabilities can be used to capture the information conveyed by various events about the different possible values of a random variable. We are thus motivated to introduce conditional PMFs, given the occurrence of a certain event or given the value of another random variable. We develop this idea in this section and we discuss the properties of conditional PRIIFs. In reality though, there is not much that is new, only an elaboration of concepts that are familiar from Chapter 1, together with some new notation. Conditioning a Random Variable on an Event The conditional P M F of a random variable event A with P ( A ) > 0, is defined by PXIA(X)=P ( X
= x I A) =
X ,conditioned on a particular
P ({X= x} fl A) P(A)
Note that the events { X = x} n A are disjoint for different values of x, their union is A, and, therefore,
Combining the above two formulas, we see that
so p x l ~is a legitimate PMF. The conditional PMF is calculated similar to its unconditional counterpart: to obtain p X I A ( x ) ,we add the probabilities of the outcomes that give rise to
Sec. 2.6
Conditioning
99
X = x and belong t o t h e conditioning event A, a n d t h e n normalize by dividing with P ( A ) . Example 2.12. Let X be the roll of a fair six-sided die and let A be the event that the roll is an even number. Then, by applying the preceding formula, we obtain px l a(k) = P ( X = k I roll is even)
P ( X = k and X is even) P(rol1 is even)
=
{
j3,
if k = 2,416, otherwise.
Example 2.13. A student will take a certain test repeatedly, up to a maximum of n times, each time with a probability p of passing, independently of the number of previous attempts. What is the PMF of the number of attempts, given that the student passes the test? Let A be the event that the student passes the test (with at most n attempts). We introduce the random variable X , which is the number of attempts that would be needed if an unlimited number of attempts were allowed. Then, X is a geometric random variable with parameter p, and A = {X 5 n). We have
!0,
otherwise,
as illustrated in Fig. 2.11.
Figure 2.11: Visualization and calculation of the conditional PMF p X I A ( k in ) Example 2.13. We start with the P M F of X , we set to zero the P M F values for all I; that do not belong to the conditioning event A, and we normalize the remaining values by dividing with P ( A ) .
Discrete Random Variables
Chap. 2
Figure 2.12: Visualization and calculation of the conditional PMF pxla(x). For each x , we add the probabilities of the outcomes in the intersection (X = x ) nA, and normalize by dividing with P(A).
Figure 2.12 provides a more abstract visualization of the construction of the conditional PMF. Conditioning one Random Variable on Another Let X and Y be two random variables associated with the same experiment. If we know that the value of Y is some particular y [with py ( y ) > 01, this provides partial knowledge about the value of X. This knowledge is captured by the conditional PMF pxly of X given Y, which is defined by specializing the ~ events A of the form {Y = y ) : definition of p x , to
Using the definition of conditional probabilities, we have
Let us fix some y with py ( y ) > 0, and consider p x l Y(xI y) as a function of x. This function is a valid PWIF for X:it assigns nonnegative values to each possible z, and these values add to 1. Furthermore, this function of x, has the same shape as p ~ , y ( xy ,) except that it is divided by py ( y ) , which enforces the normalization property
Figure 2.13 provides a visualization of the conditional PMF. The conditional PMF is often convenient for the calculation of the joint PMF, using a sequential approach and the formula
Sec. 2.6
Conditioning Conditional PMF
Slice view of conditional PMF
x Figure 2.13: Visualization of t h e conditional P M F p X I (x / y). For each y, we view t h e joint P M F along the slice Y = y and renormalize so t h a t
o r its counterpart
This method is entirely similar t o t h e use of t h e multiplication rule from Chapter 1. T h e following example provides a n illustration.
Example 2.14. Professor May B. Right often has her facts wrong, and answers each of her students' questions incorrectly with probability 114, independently of other questions. In each lecture May is asked 0, 1, or 2 questions with equal probability 113. Let X and Y be the number of questions May is asked and the number of questions she answers wrong in a given lecture, respectively. To construct the joint PMF px,y(x, y), we need to calculate all the probabilities P(X = x, Y = y) for all combinations of values of x and y. This can be done by using a sequential description of the experiment and the multiplication rule, as shown in Fig. 2.14. For example, for the case where one question is asked and is answered wrong, we have 1 1 1 PX,Y(~ 1), = PX(X)PY~X(Y I x) = 3 . 4 = 12' The joint PMF can be represented by a two-dimensional table, as shown in Fig. 2.14. It can be used to calculate the probability of any event of interest. For
Discrete Random Variables
102
Chap. 2
instance, we have P ( a t least one wrong answer) = P X , Y (1,l)
+ p.u,u(Z,1) + p x , (~2 , 2 )
Prob 1/48
Prob 6/48
Prob: 4/48
1 - X : Number of questio~lsaslced
Y : Number of questions answered wrong
Joint PMF p X , y ( x , l ~ ) in tabular for111
Figure 2.14: Calculation of the joint PMF p x , y ( x , y ) in Example 2.14.
The conditional PMF can also be used to calculate the marginal PMFs. In particular, we have by using the definitions,
This formula provides a divide-and-conquer method for calculating marginal PMF's. It is in essence identical to the total probability theorem given in Chapter 1, but cast in different notation. The following example provides an illustration.
Example 2.15. Consider a transmitter that is sending messages over a computer network. Let us define the following two random variables:
X : the travel time of a given message, Y : the length oE the given message. \Ve lcnow the PI\/IF of the travel time of a message that has a given length, and we know the PMF of the message length. We want to find the (unconditional) P M F of the travel time of a message.
Sec. 2.6
Conditioning
103
We assume that the length of a message can take two possible values: y = l o 2 bytes with probability 516, and y = lo4 bytes with probability 116, so that
We assume that the travel time X of the message depends on its length Y and the congestion in the networlc at the time of transmission. In particular, the travel time is seconds with probability 112, ~ o - ~seconds Y with probability 113, and ~ o - ~seconds Y with probability 116. Thus, we have
To find the PMF of X I we use the total probability formula
We obtain
Note finally t h a t one can define conditional PMFs involving more t h a n two random variables, such as p x , Y j z ( x yl I z ) or p x lyYz(x yI z ) . T h e concepts a n d methods described above generalize easily.
I
Discrete Random Variables
104
e
Chap. 2
The conditional PMF of X given Y by This is analogous to the multipli and can be used to calculate the
e
The conditional PMF of X gi marginal PMF of X with the formu
This is analogous to the divide probabilities using es
There are natural random variables.
Conditional Expectation
A conditional Pi'vIF can be thought of as an ordinary PbIF over a new universe determined by the conditioning event. In the same spirit, a conditional expectation is the same as an ordinary expectation, except that it refers to the new universe, and all probabilities and PMFs are replaced by their conditional counterparts. (Conditional variances can also be treated similarly.) We list the main definitions and relevant facts below.
Summary of Facts About Conditional Expectatio Let X and Y be rando e
The conditional expectation of X defined by
For a function g(X), we have
Sec. 2.6
e
Conditioning
105
The conditional expectation of X given a value y of Y is defined by E[XIY=?II = ~ x P x , Y ( x / ~ ) . 5
an We have
E[Xl =
CPY (Y)E[X I Y = Yl. Y
This is the total expectation theorem. an Let All . . . , A n be disjoint events that form a partition of the sample
space, and assume that P(Az) > 0 for all i. Then, 11
E[X] = a,
C P(A,)E[X 1 A,].
Let A1,. . . ,A, be disjoint events that form a partition of an event B , and assume that P(A, n B ) > 0 for all i. Then, n
E[XI B] = C P ( A zI B)E[XI Ai
B].
Let us verify the total expectation theorem, which basically says that "the unconditional average can be obtained by averaging the conditional averages." The theorem is derived using the total probability formula
and the calculation
P(Ai)E[X I Ai] can be verified by viewing it as The relation E[X] = Cy=L=l a special case of the total expectation theorem. Let us introduce the random
Discrete Random Variables
106
Cl~ap.2
variable Y t h a t takes the value i if and only if the event Ai occurs. Its P M F is given by py(i) = { E ( A i ) , i f 2 = 1 1 2 , . ,n, otherwise. The total expectation theorem yields
and since t h e event {Y = i) is just Ai, we obtain the desired expression
Finally, the last relation in the table is essentially the same as the preceding one, but applied t o a universe where event B is known t o have occurred. T h e total expectation theorem is analogous t o the total probability theorem. It can be used to calculate t h e unconditional expectation E[X] from the conditional P M F or expectation, using a divide-and-conquer approach.
Example 2.16. Messages transmitted by a computer in Boston through a data network are destined for New York with probability 0.5, for Chicago with probability 0.3, and for San Francisco with probability 0.2. The transit time X of a message is random. Its mean is 0.05 seconds if it is destined for New Yorlc, 0.1 seconds if it is destined for Chicago, and 0.3 seconds if it is destined for San Francisco. Then, E[X] is easily calculated using the total expectation theorem as
E [ X ] = 0 . 5 ~ 0 . 0 5 + 0 . 3 ~ 0 . 1 + 0 . 2 ~ 0 . 3 = 0 . 1seconds. 15
Example 2.17. Mean and Variance of the Geometric. You write a software program over and over, and each time there is probability p that it works correctly, independently from previous attempts. What is the mean and variance of X , the number of tries until the program works correctly? We recognize X a s a geometric random variable with PMF
The mean and variance of X are given by
but evaluating these infinite sums is somewhat tedious. As an alternative, we will apply the total expectation theorem, with A1 = {X = 1) = {first try is a success),
Sec. 2.6
Conditioning
107
= {X > 1) = {first try is a failure), and end up with a much simpler calculation. If the first try is successful, we have X = 1, and
A2
If the first try fails (X > I), we have wasted one try, and we are back where we started. So, the expected number of remaining tries is E[X], and
Thus,
E[X] = P(X = 1)E[XI X = 11
+ P ( X > 1)E[XI X > 11
from which we obtain
1 E[X] = -. P With similar reasoning, we also have
so that
E[x2]= p . I + (I - p) (1
+ 2E[X] + E [ x ~ ] ) ,
from which we obtain
and, using the formula E[X] = l/p derived above,
We conclude that
Example 2.18. The Two-Envelopes Paradox. This is a much discussed puzzle that involves a subtle mathematical point regarding conditional expectations. You are handed two envelopes, and you are told that one of them contains m times as much money as the other, where m is an integer with m > 1. You open one of the envelopes and loolc a t the amount inside. You may now keep this amount, or you may switch envelopes and keep the amount in the other envelope. What is the best strategy? Here is a line of reasoning that argues in favor of switching. Let A be the envelope you open and B be the envelope that you may switch to. Let also x and
Discrete Random Variables
108
Chap. 2
y be the amounts in A and B, respectively. Then, as the argument goes, either y = x / m or y = mx, with equal probability 1/2, so given x , the expected value of y is
+
since 1 m2 > 2m for m > 1. Therefore, you should always switch to envelope B! But then, since you should switch regardless of the amount found in A, you might as well open B to begin with; but once you do, you should switch again, etc. There are two assumptions, both flawed to some extent, that underlie this paradoxical line of reasoning. (a) You have no a priori ltnowledge about the amounts in the envelopes, so given x, the only thing you know about y, is that it is either l / m or m times x, and there is no reason to assume that one is more lilcely than the other. (b) Given two random variables X and Y , representing monetary amounts, if
for all possible values x of X , then the strategy that always switches to Y yields a higher expected monetary gain. Let us scrutinize these assumptions. . Assumption (a) is flawed because it relies on an incompletely specified probabilistic model. Indeed, in any correct model, all events, including the possible values of X and Y, must have well-defined probabilities. With such probabilistic knowledge about X and Y, the value of X may reveal a great deal of information about Y. For example, assume the following probabilistic model: someone chooses an integer dollar amount Z from a known range [g,Z] according to some distribution, places this amount in a randomly chosen envelope, and places m times this amount in the other envelope. You then choose to open one of the two envelopes (with equal probability), and look at the enclosed amount X . If X turns out to be larger than the upper range limit z,you know that X is the larger of the two amounts, and hence you should not switch. On the other hand, for some other values of X , such a s the lower range limit g, you should switch envelopes. Thus, in this model: the choice to switch or not should depend on the value of X. Roughly speaking, if you have an idea about the range and lilcelihood of the values of X , you can judge whether the amount X found in A is relatively small or relatively large, and accordingly switch or not switch envelopes. Mathematically, in a correct probabilistic model, we must have a joint PI\IIF for the random variables X and Y, the amounts in envelopes A and B, respectively. This joint PMF is specified by introducing a PMF pz for the random variable Z , the minimum of the amounts in the two envelopes. Then, for all z, PX-Y (mz, z)
1 = P X , Y (z, mz) = -pZ(z), 2
and PX,Y ( 2 ,Y) = 0, for every (x,y) that is not of the form (mz, z) or (z,mz). With this specification of p,y,y (x,y), and given that X = x, one can use the rule
switch if and only if E[Y ( X = XI
> x.
Sec. 2.6
Conditioning
109
According to this decision rule, one may or may not switch envelopes, depending on the value of X , as indicated earlier. Is it true that, with the above described probabilistic model and decision rule, you should be switching for some values x but not for others? Ordinarily yes, as illustrated from the earlier example where Z takes values in a bounded range. However, here is a devilish example where because of a subtle mathematical quirk, you will always switch! A fair coin is tossed until it comes up heads. Let N be the number of tosses. Then, mN dollars are placed in one envelope and mN-l dollars are placed in the other. Let X be the amount in the envelope you open (envelope A), and let Y be the amount in the other envelope (envelope B). NOW,if A contains $1, clearly B contains $m,so you should switch envelopes. If, on the other hand, A contains mn dollars, where n > 0, then B contains either mn-l or mn+l dollars. Since N has a geometric PMF, we have
Thus
and
+
We have ( 2 m"/3m Thus if m > 2, then
> 1 if
and only if m2 - 3m f 2
> 0 or
( m - l ) ( m- 2)
> 0.
and to maximize the expected monetary gain you should always switch to B! What is happening in this example is that you switch for all values of x because
E [ Y / X = x] > x,
for all x.
A naive application of the total expectation theorem might seem to indicate that E [ Y ] > E [ X ] . However, this cannot be true, since X and Y have identical PMFs. Instead, we have
E [ Y ] = E [ X ] = oo, which is not necessarily inconsistent with the relation E [ Y / X = x] > x for all x. The conclusion is that the decision rule that switches if and only if E [ Y ( X = x] > x does not improve the expected monetary gain in the case where E [ Y ] = E [ X ] = co,and the apparent paradox is resolved.
110
Discrete Random Variables
Chap. 2
2.7 INDEPENDENCE
We now discuss concepts of independence related to random variables. These concepts are analogous to the concepts of independence between events (cf. Chapter 1). They are developed by simply introducing suitable events involving the possible values of various random variables, and by considering the independence of these events. Independence of a Random Variable from an Event The independence of a random variable from an event is similar to the independence of two events. The idea is that knowing the occurrence of the conditioning event provides no new information on the value of the random variable. More formally, we say that the random variable X is independent of the event A if P ( X = x and A) = P (X = x)P(A) = p x ( x ) P(A), for all x, which is the same as requiring that the two events {X = x} and A be independent, for any choice x. From the definition of the conditional PMF, we have P ( X = x and A) = p x l A ( ~ ) P ( A ) , so that as long as P ( A ) > 0, independence is the same as the condition PX\A(X) = px(x),
for all 2.
Example 2.19. Consider two independent tosses of a fair coin. Let X be the number of heads and let A be the event that the number of heads is even. The (unconditional) PMF of X is
and P ( A ) = 112. The conditional PMF is obtained from the definition p x i a ( x )= P (X= x and A) /P ( A ) :
Clearly, X and A are not independent, since the PMFs px and p , y l ~ are different. For an example of a random variable that is independent of A, consider the random variable that takes the value 0 if the first toss is a head, and the value 1 if the first toss is a tail. This is intuitively clear and can also be verified by using the definition of independence.
Sec. 2.7
Independence
111
Independence of R a n d o m Variables The notion of independence of two random variables is similar to the independence of a random variable from an event. We say that two r a n d o m variables X and Y are independent if PX,Y(X,Y) = PX(X)PY(Y),
for all x, y.
This is the same as requiring that the two events { X = x) and {Y = y) be independent for every x and y. Finally, the formula px,y (x, y) = p x l Y(x I y)py (y) shows that independence is equivalent to the condition
Intuitively, independence means that the value of Y provides no information on the value of X . There is a similar notion of conditional independence of two random variables, given an event A with P(A) > 0. The conditioning event A defines a new universe and all probabilities (or PMFs) have to be replaced by their conditional counterparts. For example, X and Y are said to be conditionally independ e n t , given a positive probability event A, if P ( X = x , Y = y J A ) = P ( X=xIA)P(Y = y / A ) ,
f o r a l l x a n d y,
or, in this chapter's notation,
Once more, this is equivalent to
As in the case of events (Section 1.5), conditional independence may not imply unconditional independence and vice versa. This is illustrated by the example in Fig. 2.15. If X and Y are independent random variables, then
as shown by the following calculation:
=
7, -yXYPX(X)PY (Y)
(by independence)
Discrete R a n d o m Variables
Chap. 2
Figure 2.15: Example illustrating that conditional independence may not imply unconditional independence. For the PMF shown, the random variables X and Y are not independent. For example, we have
0, find the values for p for which n = 2k + 1 is better for the Celtics than n = 2k - 1.
Problem 7. You just rented a large house and the realtor gave you 5 keys, one for each of the 5 doors of the house. Unfortunately, all keys look identical, so to open the front door, you try them at random. (a) Find the PMF of the number of trials you will need to open the door, under the following alternative assumptions: (1) after an unsuccessful trial, you marlc the corresponding key, so that you never try it again, and (2) at each trial you are equally likely to choose any key. , (b) Repeat part (a) for the case where the realtor gave you an extra duplicate key for each of the 5 doors.
Problem 8. Recursive computation of the binomial PMF. Let X be a binomial random variable with parameters n and p. Show that its PWIF can be computed by , by using the recursive formula starting with p x (0) = (1 - P ) ~and p x ( k + l ) = -...?-. I-p
Problem 9.
n-k
k + l P X ( ~ ) , k = 0 , 1 , . . . ,n - 1 .
Form of the binomial PMF. Consider a binomial random variable
X with parameters n and p. Let k* be the largest integer that is less than or equal
+
to (n 1)p. Show that the PMF px(k) is monotonically nondecreasing with k in the range from 0 to k*, and is monotonically decreasing with k for k 2 k*.
Problem 10. Form of the Poisson PMF. Let X be a Poisson random variable with parameter A. Show that the PMF p x ( k ) increases monotonically with k up to the point where k reaches the largest integer not exceeding A, and after that point decreases monotonically with k. Problem 11." The matchbox problem - inspired by Banach's smoking habits. A smoker mathematician carries one matchbox in his right pocket and one in his left pocket. Each time he wants to light a cigarette, he selects a matchbox from either pocket with probability p = 112, independently of earlier selections. The two matchboxes have initially n matches each. What is the PRiIF of the number of remaining matches at the moment when the mathematician reaches for a match and discovers that the corresponding matchbox is empty? How can we generalize to the case where the probabilities of a left and a right pocket selection are p and 1 - p, respectively?
Problems
121
Solution. Let X be the number of matches that remain when a matchbox is found empty. For k = 0 , 1 , . . . , n , let Lk (or Rk) be the event that an empty box is first discovered in the left (respectively, right) pocltet while the number of matches in the right (respectively, left) poclcet is k at that time. The PMF of X is
Viewing a left and a right poclcet selection as a "success" and a "failure," respectively, P(Lk) is the probability that there are n successes in the first 2n - k trials, and trial 2n - k t 1 is a success, or
By symmetry, P(Lk) = P(Rk), so
In the more general case, where the probabilities of a left and a right pocket selection are p and 1 - p, using a similar reasoning, we obtain
which yields p x (k) = P(Lk) =
+ P(Rk)
(2ni k, (pn+'(l - pln-* + p n W k ( l- p)"+'),
k = o , I , . . . ,n.
P r o b l e m 12.* Justification of t h e Poisson a p p r o x i m a t i o n property. Consider the PMF of a binomial random variable with parameters n and p. Show that asymptotically, as p - t 0, n --+ 00, while n p is fixed at a given value A, this PMF approaches the PMF of a Poisson random variable with parameter A. Solution. Using the equation X = np, write the binomial P M F as
( n - k)! k!
p k ( l - p)n-k
- n(n-l)...(n-k-t-l)
nk
A~ k!
n-k
Discrete Random Variables
122
Chap. 2
Fix k and let n -+m. We have, for j = 1 , . . . , k,
Thus, for each fixed k , as n
-
m we obtain
SECTION 2.3. Functions of Random Variables P r o b l e m 13. A family has 5 natural children and has adopted 2 girls. Each natural child has equal probability of being a girl or a boy, independently of the other children. Find the PMF of the number of girls out of the 7 children. P r o b l e m 14. Let X be a random variable that takes values from 0 to 9 with equal probability 1/10. (a) Find the PMF of the random variable Y = X mod(3). (b) Find the PMF of the random variable Y = 5 mod(X
+ 1).
P r o b l e m 15. Let K be a random variable that takes, with equal probability 1/(2n+l), the integer values in the interval [-n, n]. Find the PMF of the random variable Y = l n X , where X = alKI, and a is a positive number.
SECTION 2.4. Expectation, Mean, and Variance P r o b l e m 16. Let X be a random variable with PMF ifx=-3,-2,-1,0,1,2,3, otherwise. (a) Find a and E[X]. (b) What is the PMF of the random variable Z = (X - ~1x1)' ? (c) Using part (b), compute the variance of X (d) Compute the variance of X using the formula var(X) =
2 23, (X - E[x]) px(x).
P r o b l e m 17. A city's temperature is modeled as a random variable with mean and standard deviation both equal to 10 degrees Celsius. A day is described as "normal" if the temperature during that day ranges within one standard deviation from the mean. What would be the temperature range for a normal day if temperature were expressed in degrees Fahrenheit? P r o b l e m 18. Let a and b be positive integers with a I_< b, and let X be a random variable that takes as values, with equal probability, the powers of 2 in the interval [2", 2b]. Find the expected value and the variance of X.
Problems
123
Problem 19. A prize is randomly placed in one of ten boxes, numbered from 1 to 10. You search for the prize by asking yes-no questions. Find the expected number of questions until you are sure about the location of the prize, under each of the following strategies. (a) An enumeration strategy: you ask questions of the form "is it in box k?" (b) A bisection strategy: you eliminate as close to half of the remaining boxes as possible by asking questions of the form "is it in a box numbered less than or equal to k?".
Problem 20. As an advertising campaign, a chocolate factory places golden tickets in some of its candy bars, with the promise that a golden ticket is worth a trip through the chocolate factory, and all the chocolate you can eat for life. If the probability of finding a golden ticket is p, find the mean and the variance of the number of candy bars you need to eat to find a ticket. Problem 21. St. Petersburg paradox. You toss independently a fair coin and you count the number of tosses until the first tail appears. If this number is n, you receive 2" dollars. What is the expected amount that you will receive? How much would you be willing to pay to play this game? Problem 22. Two coins are simultaneously tossed until one of them comes up a head and the other a tail. The first coin comes up a head with probability p and the second with probability q. All tosses are assumed independent. (a) Find the PMF, the expected value, and the variance of the number of tosses. (b) What is the probability that the last toss of the first coin is a head?
Problem 23. (a) A fair coin is tossed successively until two consecutive heads or two consecutive tails appear. Find the PMF, the expected value, and the variance of the number of tosses. (b) Assume now that the coin is tossed successively until we obtain a tail that is immediately preceded by a head. Find the PMF and the expected value of the number of tosses.
Problem 24." Variance of the Poisson. Consider the Poisson random variable with parameter A. Compute its second moment and variance. Solution. As shown in the text, the mean is given by
Also
Discrete Random Variables
from which
var(X) = E[x']- (E[x])' = X(X
+ 1) - A'
Chap. 2
= A.
SECTION 2.5. Joint PMFs of Multiple Random Variables Problem 25. A stoclc market trader buys 100 shares of stock A and 200 shares of stock B. Let X and Y be the price changes of A and B, respectively, over a certain time period, and assume that the joint PMF of X and Y is uniform over the set of integers x and y satisfying -1Ly-xI1. -25x54, (a) Find the marginal PMFs and the means of X and Y (b) Find the mean of the trader's profit.
Problem 26. A class of n students taltes a test consisting of rn questions. Suppose that student i submitted answers to the first mi questions. (a) The grader randomly picks one answer, call it (I,J ) , where I is the student ID number (taking values 1 , . . . ,n) and J is the question number (taking values 1,. . . ,m). Assume that all answers are equally likely to be picked. Calculate the joint and the marginal PMFs of I and J.
(b) Assume that an answer to question j , if submitted by student i, is correct with probability pi?. Each answer gets a points if it is correct and gets b points otherwise. Calculate the expected value of the score of student i.
Problem 27. PMF of the minimum of several random variables. On a given day, your golf score talres values from the range 101 to 110, with probability 0.1, independently from other days. Determined to improve your score, you decide to play Xz, on three different days and declare as your score the minimum X of the scores XI, and X8 on the different days. (a) Calculate the PMF of X . (b) By how much has your expected score improved as a result of playing on three days?
Problem 28." The quiz problem. Consider a quiz contest where a person is given a list of n questions and can answer these questions in any order he or she chooses. Question i will be answered correctly with probability pi, and the person will then receive a reward vi. At the first incorrect answer, the quiz terminates and the person is allowed to lteep his or her previous rewards. The problem is to choose the ordering of questions so as to maxirnize the expected value of the total reward obtained. Show that it is optimal to answer questions in a nonincreasing order of pivi/(l - pi).
Problems
125
Solution. We will use a so-called interchange argument, which is often useful in scheduling-like problems. Let i and j be the kth and (k 1)st questions in an optimally ordered list L = (il, . . . ,ik-1,i,j,ilc+3,.. . , i n ) .
+
Consider the list L' = (il, . . . , ik-1 , j , i, i k + 2 , . . . ,in) obtained from L by interchanging the order of questions i and j. We compute the expected values of the rewards of L and L', and argue that since L is optimally ordered, we have E[reward of L] 2 E[reward of L']. Define the weight of question i to be
We will show that any permutation of the questions in a nonincreasing order of weights maximizes the expected reward. If L = (il, . . . , i n ) is a permutation of the questions, define L ( ~to ) be the permutation obtained from L by interchanging questions i k and ik+l. Let us first compute . have the difference between the expected reward of L and that of L ( ~ )We
and
Therefore
E [reward of L("] - E[reward of L] = pil
. . 'Pik-1 (
~ i ~ + ~ +v Pik+lPfkvik i ~ + ~
Now, let us go back to our problem. Consider any permutation L of the questions. ) w(ik+l) for some k, it follows from the above equation that the permutation If ~ ( i k < L ( ~has ) an expected reward larger than that of L. So, an optimal permutation of the questions must be in a nonincreasing order of weights. Let us finally show that any two such permutations have equal expected rewards. Assume that L is such a permutation and say that w(ik) = w(ik+l) for some k. We know that interchanging ik and ik+1preserves the expected reward. So, the expected reward of any permutation L' in a non-increasing order of weights is equal to that of L, because L' can be obtained from L by repeatedly interchanging adjacent questions having equal weights.
Discrete Random Variables
126
Chap. 2
P r o b l e m 29." T h e inclusion-exclusion formula. Let A1, A,, . . . , A, be events. Let S1 = {i I 1 i _< n), S2 = {(il, i?) I 1 il < zz n), and more generally, let Sm be the set of all m-tuples ( i l l . . . ,i,) of indices that satisfy 1 il < ia < . . . < im n. Show that
1.
This is because if the spider and the fly are at a distance d > 1 apart, then one second later their distance will be d (if the fly moves away from the spider) or d - 1 (if the fly
Problems
129
does not move) or d - 2 (if the fly moves towards the spider). We also have, for the case where the spider and the fly start one unit apart,
Using the total expectation theorem, we obtain
E[TI Ad] =: P(Bd I Ad)E[TIAd
n Bd] P(Bd-1 I Ad)E[T 1 Ad flBd-I] + P(Bd-2 I Ad)E[TI Ad n Bd-21,
if d
> 1,
and
It can be seen based on the problem data that
EIT/AlnB1]=l+EITIAl],
E [ T I A l n B o ]- 1 ,
so by applying the formula for the case d = 1, we obtain
By applying the formula with d = 2, we obtain
E[T ) Az] =: pE[T I A2 We have
n Bz] + ( 1 - 2p)E[T\ A2 n B I ]+ pE[T I Az n Bo].
E[T I A? n Bo] = 1, E[T I Az
n Bi] = 1 + E[T I All,
so by substituting these relations in the expression for E[T / Az],we obtain
This equation yields after some calculation
Discrete Random Variables Generalizing, we obtain for d
E[T I
= p ( l + E[TI Ad])
Chap. 2
> 2,
+ ( 1 - 2p) (1 + E[T I & - I ] ) + p ( l + E[T I A ~ - ~ ] ) .
Thus, E[T 1 Ad] can be generated recursively for any initial distance d , using as initial conditions the values of E[T I A?]and E[T ( ,421 obtained earlier. Finally, the expected value of T can be obtained using the given PMF for the initial distance D and the total expectation theorem:
Problem 35." Verify the expected value rule
using the expected value rule for a function of a single random variable. Then, use the rule for the special case of a linear function, to verify the formula
where a and b are given scalars. Solution. We use the total expectation theorem to reduce the problem to the case of a single random variable. In particular, we have
as desired. Note that the third equality above used the expected value rule for the function g ( X , y) of a single random variable X. For the linear special case, the expected value rule gives
Problems
131
P r o b l e m 36.* The multiplication r u l e for conditional P M F s . Let X , Y , and Z be random variables. (a) Show that P X , Y , Z (Y,~ ,2) = PX(X)PY~X(Y I ~ ) P Z I X(2, YIx, Y). (b) How can we interpret this formula as a special case of the multiplication rule given in Section 1.3? (c) Generalize to the case of more than three random variables. Solution. (a) We have P X , Y , Z ( ~ ,=~P, (~X) = x,Y= y, Z = z) = P ( x = x ) ~ ( ~ = ~ , z = z j x = x )
=P(X=x)P(Y=y~X=x)P(Z=z~X=x,Y=y) = PX(X)PY~X(Y I ~ ) P Z ~ X , Y( 2( ,Zy).
(b) The formula can be written as
which is a special case of the multiplication rule. (c) The generalization is
P r o b l e m 37.* Splitting a Poisson r a n d o m variable. A transmitter sends out either a 1 with probability p, or a 0 with probability 1 - p, independently of earlier transmissions. If the number of transmissions within a given time interval has a Poisson P M F with parameter A, show that the number of 1's transmitted in that same time interval has a Poisson PMF with parameter pX. Solution. Let X and Y be the numbers of 1's and 0's transmitted, respectively. Let Z = X iY be the total number of symbols transmitted. We have
Discrete Random Variables
132
Chap. 2
Thus,
-
e-x~
( A P ) e-X(l-p) ~
n.!
m!
so that X is Poisson with parameter Ap.
SECTION 2.7. Independence Problern 38. Alice passes through four traffic lights on her way to work, and each light is equally likely to be green or red, independently of the others. (a) What is the PMF, the mean, and the variance of the number of red lights that Alice encounters? (b) Suppose that each red light delays Alice by exactly two minutes. What is the variance of Alice's commuting time? Problem 39. Each morning, Hungry Harry eats some eggs. On any given morning, the number of eggs he eats is equally likely to be 1, 2, 3, 4, 5, or 6, independently of what he has done in the past. Let X be the number of eggs that Harry eats in 10 days. Find the mean and variance of X. Problem 40. A particular professor is lcnown for his arbitrary grading policies. Each paper receives a grade from the set {A, A-, B f , B, B-, C f ) , with equal probability, independently of other papers. How many papers do you expect to hand in before you receive each possible grade at least once? Problem 41. You drive to work 5 days a week for a full year (50 weeks), and with probability p = 0.02 you get a traffic ticlcet on any given day, independently of other days. Let X be the total number of tickets you get in the year. (a) What is the probability that the number of tickets you get is exactly equal to the expected value of X? (b) Calculate approximately the probability in (a) using a Poisson approximation. (c) Any one of the ticlcets is $10 or $20 or $50 with respective probabilities 0.5, 0.3, and 0.2, and independently of other tickets. Find the mean and the variance of the amount of money you pay in traffic tickets during the year. (d) Suppose you don't know the probability p of getting a ticket, but you got 5 tickets during the year, and you estimate p by the sample mean
Problems
133
What is the range of possible values of p assuming that the difference between p and the sample mean is within 5 times the standard deviation of the sample mean? P r o b l e m 42. C o m p u t a t i o n a l problem. Here is a probabilistic method for computing the area of a given subset S of the unit square. The method uses a sequence of independent random selections of points in the unit square [O,1] x [0, I], according to a uniform probability law. If the ith point belongs to the subset S the value of a random variable Xi is set to 1, and otherwise it is set to 0. Let X I , Xz, . . . be the sequence of random variables thus defined, and for any n, let
(a) Show that E[S,] is equal to the area of the subset S, and that var(S,) diminishes to 0 as n increases. (b) Show that to calculate S,, it is sufficient to know Sn-1 and X,, so the past values of X k , k = 1,. . . ,n - 1, do not need to be remembered. Give a formula. (c) Write a computer program to generate S, for n = 1 , 2 , . . . ,10000, using the computer's random number generator, for the case where the subset S is the circle inscribed within the unit square. How can you use your program to measure experimentally the value of n? (d) Use a similar computer program to calculate approximately the area of the set of all (x, y) that lie within the unit square and satisfy 0 cos nx sinny 1.
0. Show that the variance of X is maximized if the pi are chosen to be all equal to p/n.
(b) Suppose that each Xi is geometric with parameter pi, and that pl, . . . , p n are chosen so that the mean of X is a given p > 0. Show that the variance of X is minimized if the pi are chosen to be all equal to n/p. [Note the strikingly different character of the results of parts (a) and (b).] Solution. (a) We have var(X) =
C v a r ( ~ i =) C pi(l -
pi) = p -
Thus maximizing the variance is equivalent to minimizing pi = p) that (using the constraint
z:=,
so
ElnZl pp is minimized when p, = pln for all i.
CP:.
ELlp:.
It can be seen
Problems (b) We have
and
Introducing the change of variables yi = l l p i = E[Xi], we see that the constraint becomes n
and that we must minimize
subject to that constraint. This is the same problem as the one of part (a), so the method of proof given there applies.
Problem 46." Entropy and uncertainty. Consider a random variable X that can take n values, 21,. . . , x,, with corresponding probabilities p l , . . . , p n . The entropy of X is defined to be n
(All logarithms in this problem are with respect to base two.) The entropy H ( X ) provides a measure of the uncertainty about the value of X . To get a sense of this, note that H ( X ) 2 0 and that H ( X ) is very close to 0 when X is "nearly deterministic," i.e., takes one of its possible values with probability very close to 1 (since we have plogp G 0 if either p x 0 or p w 1). The notion of entropy is fundamental in information theory, which originated with C. Shannon's famous work and is described in many specialized textbooks. For example, it can be shown that H ( X ) is a lower bound to the average number of yes-no questions (such as "is X = xl?" or "is X < xs?") that must be asked in order to determine the value of X . Furthermore, if k is the average number of questions required to determine the value of a string of independent identically distributed random variables X I , X2,. . . , X n , then, with a suitable strategy, k l n can be made as close to H ( X ) as desired, when n is large. (a) Show that if 91,. . . , qn are nonnegative numbers such that x:=l
qi = 1, then
with equality if and only if pi = qi for all i. As a special case, show that H(X) logn, with equality if and only if pi = l / n for all i. Hint: Use the inequality l n a a - 1, for a > 0, which holds with equality if and only if a = 1.
t The integral JB f ~ ( xdx ) is to be interpreted in the usual calculus/Riemann sense and we implicitly assume that it is well-defined. For highly unusual functions and sets, this integral can be harder - or even impossible - to define, but such issues belong to a more advanced treatment of the subject. In any case, it is comforting to know that mathematical subtleties of this type do not arise if fx is a piecewise continuous function with a finite or countable number of points of discontinuity, and B is the union of a finite or countable number of intervals.
Sec. 3.1
Continuous Random Variables ancl PDFs
141
Figure 3.1: Illustration of a PDF. The probability that X takes a value in an b interval [a,b] is Ja fx (x)dx,which is the shaded area in the figure.
Graphically, this means that the entire area under the graph of the PDF must be equal to 1. To interpret the PDF, note that for an interval [x,x S] with very small length 6, we have
+
so we can view f X ( x ) as the "probability mass per unit length" near x (cf. Fig. 3.2). It is important to realize that even though a PDF is used to calculate event probabilities, f x ( x ) is not the probability of any particular event. In particular, it is not restricted to be less than or equal to one.
Figure 3.2: Interpretation of the PDF fx (x) as "probability mass pkr unit length" around x. If 6 is very small, the probability that X takes a value in the interval [x,x S] is the shaded area in the figure, which is approximately equal t o fx(x) . 6 .
+
Example 3.1. Continuous Uniform Random Variable. A gambler spins a wheel of fortune, continuously calibrated between 0 and 1,and observes the resulting number. Assuming that any two subintervals of [0,1] of the same length have the same probability, this experiment can be modeled in terms of a random variable X with PDF c, i f O < x < l , fxb)= 0, otherwise,
General Random Variables
142
Chap. 3
for some coizstant c. This constant can be determined by using the normalization property 1
fxbIdx=llcdx=ci
dx=c,
so tliat c = 1. More generally, we can consider a random variable X that takes values in an interval [a,b], and again assume that any two subintervals of the same length have the same probability. We refer t o this type of random variable as uniform or uniformly distributed. Its P D F has the form
0,
otherwise,
(cf. Fig. 3.3). The constant value of the PDF within [a,b] is determined from the normalization property. Indeed, we have
Example 3.2. Piecewise Constant PDF. Alvin's driving time to work is between 15 and 20 minutes if the day is sunny, and between 20 and 25 minutes if the day is rainy, with all times being equally likely in each case. Assume that a day is sunny with probability 213 and rainy with probability 113. What is the P D F of the driving time, viewed as a random variable X? We interpret the statement that "all times are equally likely" in the sunny and the rainy cases, to mean that the PDF of X is constant in each of the intervals [15,20] and [20,25]. Furthermore, since these two intervals contain all possible driving times, the PDF should be zero everywhere else:
fx(x)=
{
< <
0, and g(x) = 0, otherwise. Then Y = g(X) is a discrete random variable taking values in the finite set ( 0 , l ) . In either case, the mean of g ( X ) satisfies the expected value rule
in complete analogy with the discrete case; see the end-of-chapter problems. The nth m o m e n t of a continuous random variable X is defined as E[Xn], the expected value of the random variable Xn. The variance, denoted by var(X) , is defined as the expected value of the random variable (X - E[X])2 . We now summarize this discussion and list a number of additional facts that are practically identical to their discrete counterparts.
Example 3.4. Mean and Variance of the Uniform Random Variable. Consider the case of a uniform PDF over an interval [a,b], as in Example 3.1. We
146
General Random Variables
Chap. 3
have
E[X] =
1:
xfx (x)dx
+
as one expects based on the symmetry of the PDF around ( a b ) / 2 . To obtain the variance, we first calculate the second moment. We have b
E[x']=
d
x2
b-a
x
Thus, the variance is obtained as
after some calculation.
Exponential Random Variable
An exponential random variable has a PDF of the form
fx(4 =
{,A;-*x,
if x 2 0, otherwise,
Sec. 3.1
Continuous Random Variables and PDFs
14'7
where X is a positive parameter characterizing the PDF (see Fig. 3.5). This is a legitimate PDF because
Note that the probability that X exceeds a certain value decreases exponentially. Indeed, for any a 0, we have
>
An exponential random variable can, for example, be a good model for the amount of time until a piece of equipment breaks down, until a light bulb burns out, or until an accident occurs. It will play a major role in our study of random processes in Chapter 5, but for the time being we will simply view it as a special random variable that is fairly tractable analytically.
Figure 3.5: The PDF ~
e of -an exponential ~ ~ random variable.
The mean and the variance can be calculated to be
These formulas can be verified by straightforward calculation, as we now show. We have, using integration by parts,
General Random Variables
148
Chap. 3
Using again integration by parts, the second moment is
Finally, using the formula var(X) = E[X2J- (E[x])~, we obtain
Example 3.5. The time until a small meteorite first lands anywhere in the Sahara desert is modeled as an exponential random variable with a mean of 10 days. The time is currently midnight. What is the probability that a meteorite first lands some time between 6 a.m. and 6 p.m. of the first day? Let X be the time elapsed until the event of interest, measured in days. Then, X is exponential, with mean 1/X = 10, which yields X = 1/10. The desired probability is
where we have used the formula P ( X >_ a ) = P ( X
> a) = e-"".
,2 CUMUEATPVE DISTRIBUTION FUNCTIONS
We have been dealing with discrete and continuous random variables in a somewhat different manner, using PMFs and PDFs, respectively. It would be desirable to describe all kinds of random variables with a single mathematical concept. This is accomplished with the cumulative distribution f u n c t i o n , or CDF for short. The CDF of a random variable X is denoted by Fx and provides the probability B(X x). In particular, for every x we have
Figure 3.11: The signal detection scheme of Example 3.8. The area of the shaded region gives the probability of error in the two cases where -1 and +1 is transmitted.
An error occurs whenever -1 is transmitted and the noise N is at least 1 so that N S = N - 1 2 0, or whenever +l is transmitted and the noise N is smaller
+
)
,
General Random Variables
158
than -1 so that N + S = N
Chap. 3
+ 1 < 0. In the former case, the probability of error is
In the latter case, the probability of error is the same, by symmetry. The value of @(l/a)can be obtained from the normal table. For a = 1, we have @(1/0)= @(I)= 0.8413, and the probability of error is 0.1587.
The normal random variable plays an important role in a broad range of probabilistic models. The main reason is that, generally speaking, it models well the additive effect of many independent factors in a variety of engineering, physical, and statistical contexts. Mathematically, the key fact is that the s u m of a large number of independent and identically distributed (not necessarily normal) random variables has a n approximately normal CDF,regardless of the CDF of the individual random variables. This property is captured in the celebrated central limit theorem, which will be discussecl in Chapter 7.
3.4 CONDETPONING ON AN EVENT
The conditional PDF of a continuous random variable X, given an event {X E A) with P ( X E A) > 0 , is defined by:
t 0,
otherwise.
It can be used to calculate the conditional probability of events of the form {X E R) because
As in the discrete case, the conditional PDF is zero outside the conditioning set. Within the conditioning set, the conclitional PDF has exactly the same shape as the unconditional one, except that it is scaled by the constant factor 1/P(X E A). This normalization ensures that f x l A integrates to 1, which malces it a legitimate PDF; see Fig. 3.12. Thus, the conditional PDF is similar to an ordinary PDF, except that it refers to a new universe in which the event {X E A) is known to have occurrecl. In the more general case where we wish to condition on a positive probability event A that cannot be described in terms of the random variable X , the
Sec. 3.4
Conditioning on an event
Figure 3.12: The unconditional P D F fx and the conditional P D F f x l A ,where A is the interval [a,b]. Note that within the conditioning event A,fx A retains the same shape as fx,except that it is scaled along the vertical axis.
conditional PDF of X given A is defined as a nonnegative function fxl that satisfies
for any subset B of the real line although, in general, there is no simple formula for fxla(x)in terms of flY(x).t
Example 3.9. The Exponential Random Variable is Memoryless. The time T until a new light bulb burns out is an exponential random variable with parameter A. Ariadne turns the light on, leaves the room, and when she returns, t time units later, finds that the light bulb is still on, which corresponds to the event A = {T > t). Let X be the additional time until the light bulb burns out. What is the conditional CDF of X, given the event A? We have, for x 2 0,
- P(T > t + x and T > t)
P(T > t)
t According to the notation introduced here, when we condition on an event of (x).However, in such the form {X E A), we should be using the notation fxl(xEAl introduced in the beginning cases, we will be reverting to the simpler notation fxIA(x), of this section.
160
General Random Variables
Chap. 3
where we have used the expression for the CDF of an exponential random variable derived in Section 3.2. Thus, the conditional CDF of X is exponential with parameter A, regardless of the time t that elapsed between the lighting of the bulb and Ariadne's arrival. This is known as the memorylessness property of the exponential. Generally, if we model the time to complete a certain operation by an exponential random variable X, this property implies that as long as the operation has not been completed, the remaining time up to completion has the same exponential CDF, no matter when the operation started.
For a continuous random variable, we define the conditional expectation E [ X / A] and the conditional variance var(X / A), given an event A, similar to the unconditional case, except that we now need to use the conditional PDF. We summarize the discussion so far, together with some additional properties in the table that follows.
Sec. 3.4
e
Conditioning on an event
161
If A1, A2, . . . , An are disjoint events with P(Ai) form a partition of the sample space, then
> 0 for each i, that
n
(a version of the total probability theorem), and
(the tot a1 expect ation theorem). Similarly, n
To justify the above version of the total probability theorem, we use the total probability theorem from Chapter 1, to obtain
This formula can be rewritten as
We take the derivative of both sides, with respect to x , and obtain the desired relation n f ~ ( x=) C ~ ( ~ i ) f x(x). ~ a , i=l If we now multiply both sides by x and then integrate from - m to m , we obtain random variables. the total expectation theorem for contin~~ous The total expectation theorem can often facilitate the calculation of the mean, variance, and other moments of a random variable, using a divide-andconquer approach.
Example 3.10. Mean and Variance of a Piecewise Constant PDF. Suppose that the random variable X has the piecewise constant PDF 113, if 0 x 1, 213, i f l < x < 2 , 0, otherwise,
<
0, the conditional PDF of X given that Y = y, is defined
This definition is analogous to the formula p x l y ( x I y) = px,y(x, y)/py(y) for the discrete case. When thinking about the conditional PDF, it is best to view y as a fixed number and consider f x l Y (x / y) as a function of the single variable x. As a function of x, the conditional PDF f x j y(3: I y) has the same shape as the joint PDF fX, (x, y ) , because the normalizing factor f y (y) does not depend on x;see Fig. 3.17.Note that the normalization ensures that
so for any fixed y, f,ul (x 1 y) is a legitimate PDF.
Example 3.15. Circular Uniform PDF. Ben throws a dart at a circular target of radius r (see Fig. 3.18). We assume that he always hits the target, and that all points of impact (x, y) are equally likely, so that the joint PDF of the random
Sec. 3.5
P'ultiple Continuous Random Variables
169
Figure 3.17: Visualization of the conditional P D F f X l (x 1 y). Let X and Y have a joint P D F which is uniform on the set S . For each fixed y, we consider the joint P D F along the slice Y = y and normalize it so that it integrates to 1.
variables X and Y is uniform. Following Example 3.12, and since the area of the circle is n r 2 , we have f x , y ( x ,Y) =
1 area of the circle '
if (x, y) is in the circle, otherwise,
( 0,
otherwise.
Figure 3.18: Circular target for Example 3.15.
To calculate the conditional PDF f x I (x 1 y), let us first calculate the marginal PDF f y ( y ) . For / y / > r , it is zero. For /yl r, it can be calculated as follows:
if x 0, otherwise, 1
E[X] = A'
l-e-Xx,
1 var (X) = -. A2
ifx>O, otherwise,
Sec. 3.7
S~zrnrnaryand Discussion
Normal with Parameters
A!.
and a "
191
0:
We have also introduced CDFs, which can be used to characterize general random variables that are neither discrete nor continuous. CDFs are related to PMFs and PDFs, but are more general. In the case of a discrete random variable, we can obtain the PMF by differencing the CDF, and in the case of a continuous random variable, we can obtain the PDF by differentiating the CDF. Calculating the PDF of a function g(X) of a continuous random variable X can be challenging. The concept of a CDF is useful in this calculation. In particular, the PDF of g(X) is typically obtained by calculating and differentiating the corresponding CDF. In some cases, such as when the function g is strictly monotonic, the calculation is facilitated through the use of some special formulas.
General Random Variables
192
Chap. 3
PROBLEMS
SECTION 3.1. Continuous Random Variables and PDFs P r o b l e m 1. Let X be uniformly distributed in the unit interval [0, 11. Consider the random variable Y = g(X), where
Find the expected value of Y by first deriving its PMF. Verify the result using the expected value rule. P r o b l e m 2 . Laplace r a n d o m variable. Let X have the PDF
where X is a positive scalar. Verify that fx satisfies the normalization condition, and evaluate the mean and variance of X. P r o b l e m 3." Show that the expected value of a continuous random variable X satisfies E[X] =
I"
P(X
> X) dx -
I"
P(X
< -x)
dx.
Solution. We have
where for the second equality we have reversed the order of integration by writing the set {(x, y) 10 x < CQ, x y < CQ) as {(x, y ) 10 x y , 0 _< y < CQ). Similarly, we can show that
t if and only if gS (x) > t . We will use the result
> 0,
from the preceding exercise. The first term in the right-hand side is equal to
By a symmetrical argument, the term in the left-hand side is given by
Lw
p(g(x)
< t ) dt =
L:
g-(x) f x ( x ) dx.
Combining the above equalities, we obtain
SECTION 3.2. Cumulative Distribution Functions Problem 5. Consider a triangle and a point chosen within the triangle according t o the uniform probability law. Let X be the distance from the point to the base of the triangle. Given the height of the triangle, find the CDF and the PDF of X . Problem 6. Calamity Jane goes to the bank to make a withdrawal, and is equally lilcely to find 0 or 1 customers ahead of her. The service time of the customer ahead, if present, is exponentially distributed with parameter A. What is the CDF of Jane's waiting time? Problem 7. Consider two continuous random variables Y and 2,and a random variable X that is equal to Y with probability p and to Z with probability 1 - p.
(a) Show that the PDF of X is given by
General Random Variables
194
Chap. 3
(b) Calculate the CDF of the two-sided exponential random variable that has PDF given by
wliere X
> 0 and 0 < p < 1.
P r o b l e m %.* Simulating a continuous r a n d o m variable. A computer has a subroutine that can generate values of a random variable U that is uniformly distributed in the interval [0, 11. Such a subroutine can be used to generate values of a continuous random variable with given CDF F ( x ) as follows. If U takes a value u, we let the value of X be a number x that satisfies F ( x ) = u. For simplicity, we assume that the given CDF is strictly increasing over the range S of values of interest, where S = {x 10 < F ( x ) < 1). This condition guarantees that for any u E (0, I ) , there is a unique x that satisfies F ( x ) = u. (a) Show that the CDF of the random variable X thus generated is indeed equal to the given CDF. (b) Describe how this procedure can be used to simulate an exponential random variable with parameter A. (c) How can this procedure be generalized to simulate a discrete integer-valued random variable? Solution. (a) By definition, the random variables X and U satisfy the relation F ( X ) = U. Since F is monotonic, we have for every x, X
<x
if and only if
F(X)
< F(x).
Therefore,
P(X 2 X) = P(F(X) 5 F(x)) = P ( U 5 F(x)) = F(x), where the last equality follows because U is uniform. Thus, X has the desired CDF.
>
(b) The exponential CDF has the form F ( x ) = 1 - em'" for x 0. Thus, to generate values of X , we should generate values u E (O,1) of a uniformly distributed random variable U, and set X to the value for which 1 - e-'" = u , or x = - ln(1 - u)/A. (c) Let again F be the desired CDF. To any u E (0, l ) , there corresponds a unique integer xu such that F(x, - 1) < u F(x,). This correspondence defines a random variable X as a function of the random variable U. We then have, for every integer k ,
'
Since this is the same for all x, it follows that the conditional distribution of X is uniform on the interval [0,z], with PDF f x l z(x I z) = 1/z.
Problern 28.* Consider two continuous random variables with joint PDF f x , ~ Sup. pose that for any subsets A and B of the real line, the events {X E A) and {Y € B) are independent. Show that the random variables X and Y are independent. Solution. For any two real numbers x and y, using the independence of the events {X x) and {Y y), we have
0 (or p < 0), then the values of X - E[X] and Y - E[Y] "tend" to have the same (or opposite, respectively) sign, and the size of jpl provides a
238
E'urther Topics on Random Variables
Chap. 4
normalized measure of the extent to which this is true. In fact, always assuming that X and Y have positive variances, it can be shown that p = 1 (or p = -1) if and only if there exists a positive (or negative, respectively) constant c such that
Y
-
E[Y]= c(X - E[x])
(see the end-of-chapter problems). The following example illustrates in part this property.
Example 4.25. Consider n independent tosses of a coin with probability of a head equal to p. Let X and Y be the numbers of heads and of tails, respectively, and let us look at the correlation coefficient of X and Y. Here. we have X Y = n. and also E[X] E[Y] = n. Thus,
+
+
We will calculate the correlation coefficient of X and equal to -1. S47e have
Y ,and verify that it is indeed
Hence, the correlation coefficient is
Covariance and Correlation e
The covariance of X and Y is given by
e
If cov(X, Y) = 0 , we say that X and Y are uncorrelated.
e
If X and Y are independent, they are uncorrelated. The converse is not always true.
Sec. 4.5
e
Covariance and Correlation
239
The correlation coefficient p(X, Y) of two random variables X and Y with positive variances is defined by
and satisfies
-1 5 p(X,Y) 5 1.
Variance of the Sum of Random Variables The covariance can be used to obtain a formula for the variance of the sum of several (not necessarily independent) random variables. In particular, if XI, X2, . . . , Xn are random variables with finite variance, we have var(X1
+ X2) = var(X1) + var(Xz) + 2cov(X1, Xz),
and, more generally,
This can be seen from the following calculation, where for brevity, we denote x i = X i- E[X,]:
Further Topics on Random Variables
240
Chap. 4
T h e following example illustrates the use of this formula.
Example 4.26. Consider the hat problem discussed in Section 2 . 5 , where n people throw their hats in a box and then pick a hat at random. Let us find the variance of X , the number of people who pick their own hat. We have
where X, is the random variable that takes the value 1 if the zth person selects his/her own hat, and takes the value 0 otherwise. Noting that X,is Bernoulli with parameter p = P ( X , = 1) = l / n , we obtain
For i
# j , we have
Therefore var(X) = var
=
c11 X,
G var ( x i ) + 2
C cov(Xi, XJ )
4.6 LEAST SQUARES ESTIMATION In many practical contexts, we want t o form an estimate of the value of a random variable X given the value of a related random variable Y, which may be viewed as some form of "measurement" of X. For example, X may b e the range of
Sec. 4.6
Least Squares Estimation
241
an aircraft and Y may be a noise-corrupted measurement of that range. In this section we discuss a popular formulation of the estimation problem, which is based on finding the estimate c that minimizes the expected value of the ~ the name "least squares"). squared error ( X - c ) (hence We start by considering the simpler problem of estimating X before learning the value of Y. The estimation error X - c is random (because X is random), ] a number that depends on c and can but the mean squared error E[(X - c ) ~ is be minimized over c. With respect to this criterion, it turns out that the best possible estimate is to set c equal to E[X], as we proceed to verify. For any estimate c, we have
E [(x- c)2] = var ( X - c) + (E[X - c]) = var(X) + (E[x] - c) ',
+
2
where the first equality makes use of the formula E[Z2] = var(Z) (E[Z]) , and the second holds because when the constant c is subtracted from the random variable X , the variance is unaffected while the mean is reduced by c. We now note that the term var(X) does not depend on our choice of c. Therefore, we 2 should choose c to minimize the term (E[x] - c) , which leads to c = E[X] (see Fig. 4.9). Mean squared estimation error
Figure 4.9: The mean squared error E [(x - c ) ~ ]as, a function of the estimate c, is a quadratic in c and is minimized when c = E[X]. The minimum value of the mean squared error is var(X).
Suppose now that we observe the value y of some related random variable
Y, before forming an estimate of X . How can we exploit this additional information? Once we know the value y of Y, the situation is identical to the one considered earlier, except that we are now in a new "universe," where everything is conditioned on Y = y. We can therefore adapt our earlier conclusion and assert that c = E[X I Y = y] minimizes the conditional mean squared error ~ [ ( -cX ) 2 I Y = y]. Note that the resulting estimate c depends on the value y of Y (as it should). Thus, vie call E[X I Y = y] the least-squares estimate of X given the value y of Y.
Further Topics on Random Variables
242
Chap. 4
Example 4.27. Let X be uniformly distributed in the interval [4,10]and suppose that we observe X with some random error W, that is, we observe the value of the random variable Y=X+W. We assume that W is uniformly distributed in the interval 1-l,1], and independent of X . What is the least squares estimate of X given the value y of Y? We have fx (x) = 116 for 4 x 10, and fx (x) = 0, elsewhere. Conditioned on X being equal to some x, Y is the same as x W, and is uniform over the interval [x - 1,x + 11. Thus, the joint PDF is given by 1 1 1 ~x,Y(x,Y) = ~ X ( X ) ~ Y J X1 (X) Y= 6'2 = -, 12 10 and x - 1 5 y 5 x + 1, and is zero for all other values of (x, y). if 4 5 x The slanted rectangle in the right-hand side of Fig. 4.10 is the set of pairs (x, y) for which fx,y(x,y) is nonzero. Given a value y of Y, the conditional PDF fxjY of X is uniform on the corresponding vertical section of the slanted rectangle. The optimal estimate E[X / Y = y] is the midpoint of that section. In the special case of the present example, it happens to be a piecewise linear function of y.
<
1.
Linear Least Squares Estimation Based on a Single Measurement MTe are interested in finding a and b that minimize the mean squared estimation error E [(x - aY - b ) 2 ] , associated with a linear estimator a Y b of X . Suppose that a has already been chosen. How should we choose b? This is the same as having to choose a constant b to estimate the random variable X - aY and. by our earlier results, the best choice is to let b = E [ X - aY] = E [ X ] - aE[Y]. It now remains to minimize, with respect to a , the expression
+
which is tile same as var(X - aY) = a$ where a x and
ay
+ a2a$ + 2cov(X, -aY)
= a$
+ a2a$ - 2 a . cov(X, Y),
are the standard deviations of X and Y, respectively, and
is the covariance of X and Y.The squared error is a quadratic function of a . It is minimized at the point where its derivative is zero, which is
where
is the correlation coefficient. With this choice of a , the mean squared estimation error is given by
Sec. 4.7
The Bi9ariate Normal Distribution
247
Linear Least Squares Estimation Formulas e
The linear least squares estimator of X based on Y is given by
~jxj+ p 'JY g (Y- EIYI), where P=
cov(X, Y) ox 'JY
is the correlation coefficient. e
The resulting mean squared estimation error is equal to (1 - p2)0$-.
The formula for the linear least squares estimator has an intuitive interpretation. Suppose, for concreteness, that the correlation coefficient p is positive. The estimator starts with the baseline estimate E[X] for X ; which it then adjusts by taking into account the value of Y - E[Y]. For example, when Y is larger than its mean, the positive correlation between X and Y suggests that X is expected to be larger than its mean, and the resulting estimate is set to a value larger than X. The value of p also affects the quality of the estimate. When lpl is close to 1, the two random variables are highly correlated and knowing Y allows us to accurately estimate X.
4.7 THE BIVARIATE NORMAL DISTRIBUTION
Let U and V be two independent normal random variables, and consider two new random variables X and Y of the form X=aU+bV, Y=cU+dV, where a, b, c, d, are some scalars. Each one of the random variables X and Y is normal, since it is a linear function of independent normal random variables (se,e Example 4.12) .t Furthermore, because X and Y are linear functions of the same
t For the purposes of this section, we adopt the following convention. A random variable which is always equal to a constant will also be called normal, with zero variance, even though it does not have a PDF. With this convention, the family of normal random variables is closed under linear operations. That is, if X is normal, then a x + b is also normal, even if a = 0.
Further Topics on Random Variables
248
Chap. 4
two independent normal random variables, their joint PDF takes a special form, known as the bivariate normal PDF. The bivariate normal PDF has several useful and elegant properties and, for this reason, it is a commonly employed model. In this section, we derive many such properties, both qualitative and analytical, culminating in a closed-form expression for the joint PDF. To keep the discussion simple, we restrict ourselves to the case where X and Y have zero mean.
Jointly Normal Random Variables Two random variables X and Y are said to be jointly normal if they can be expressed in the form X=aU+bV, Y =cU+dV, where U and V are independent normal random variables. Note that if X and Y are jointly normal, then any linear combination
has a normal distribution. The reason is that if we have X = aU + bV and Y = cU + dV for some independent normal random variables U and V, then
+
Thus, Z is the sum of the independent normal random variables (as1 cs2)U and (bsl ds2)V, and is therefore normal. A very important property of jointly normal random variables, and which will be the starting point for our development, is that zero correlation implies independence.
+
Zero Correlation Implies Independence If two random variables X and Y are jointly normal and are uncorrelated, then they are independent. This property can be verified using multivariate transforms, as follows. Suppose that U and V are independent zero-mean normal random variables, and that X = aU bV and Y = cU dV, so that X and Y are jointly normal. We assume that X and Y are uncorrelated, and we wish to show that they are independent. Our first step is to derive a formula for the multivariate transform Adx,y (sl, s2) associated with X and Y. Recall that if Z is a zero-mean normal
+
+
Sec. 4.7
he Bivariate Normal Distribution
random variable with variance o$., the associated transform is 2
2
E[esZ]= h b (s) = eaz3 12, which implies that
E[ez] = M z ( 1 ) =
+
Let us fix some scalars sl, s2, and let Z = s l X s 2 Y The random variable is normal, by our earlier discussion, with variance
Z
This leads to the following formula for the multivariate transforin associated with the uncorrelated pair X and Y:
Let now fT and T? be independent zero-mean normal random variables with are the same vaxiances 0% and a$ as X and Y, respectively. Since X and independent, they are also uncorrelated, and the preceding argument yields
--
Thus, the two pairs of random variables (X, Y) and (X,Y) are associated with the same multivariate transform. Since the multivariate transform completely determines the joint PDF, it follows that the pair ( X , Y ) has the same joint -PDF as the pair (X,Y). Since f? and Y are independent, X and Y must also be independent, which establishes our claim.
The Conditional Distribution of X Given Y We now turn t o the problem of estimating X given the value of Y. To avoid uninteresting degenerate cases, we assume that both X and Y have positive variance. Let us definet
where
t Comparing with the formulas in the preceding section, it is seen that x is defined to be the linear least squares estimator of X , and k is the corresponding estimation error, although these facts are not needed for the argument that follows.
Further Topics on Random Variables
250
Chap. 4
is the correlation coefficient of X and Y.Since X and Y are linear combinations of independent normal random variables U and V, it follows that Y and x are also linear combinations of U and V. In particular, Y and 2 are jointly normal. Furthermore.
Thus, Y and x are uncorrelated and, therefore, independent. Since x is a scalar multiple of Y, it follows that x and X are independent. lVe have so far decomposed X into a sum of two independent normal random variables, namely.
l i e take conditional expectat,ions of both sides, given
Y,to obtain
where we have made use of the independence of Y and X to set E[X / Y]= 0.We have therefore reached the important conclusion that the conditional expectation E[X/ Y] is a linear function of the random variable Y. Using the above decomposition, it is now easy t o determine the conditional PDF of X. Given a value of Y, the random variable 2 = paxY/oy becomes is a known constant, but the normal distribution of the random variable unaffected. since is independent of Y. Therefore, the conditional distribution of X given Y is the same as the unconditional distribution of X.shifted by X . Since x is normal with mean zero and some variance a2-, we conclude that the
x
x
conditional distribution of X is also normal with mean X and the same variance a$. The variance of can be found with the following calculation:
x
where we have made use of the property E [ X Y ]= ,ooxoy. lVe summarize our conclusions below. Although our discussion used the zero-mean assumption, these conclusions also hold for the non-zero mean case and we state them with this added generality; see the end-of-chapter problems.
Sec. 4.7
The Bivariate Normal Distribution
Properties of Jointly Normal Random Variables Let X and Y be jointly normal random variables. e
X and Y are independent if and only if they are uncorrelated.
e
The conditional expectation of X given Y satisfies E [ X / Y] = E[X]
+ p=U Y (Y - E[Y]).
It is a linear function of Y and has a normal PDF. e
The estimation error x = X - E [ X I Y] is zero-mean, normal, and independent of Y, with variance
e
a X2 - = ( I - p 2 ) o $ .
The conditional distribution of X given Y is normal with mean E [ X ( Y] and variance 0;.
The Form of the Bivariate Normal PDF Having determined the parameters of the PDF of x and of the conditional PDF of X ,we can give explicit formulas for these PDFs. We keep assuming that X and Y have zero means and positive variances. Furthermore, to avoid the degenerate where if is identically zero, we assume that (pl < 1. We have
and
(.pzY) 2
~ X I (YX
-
1
I Y ) = &,/"ox
e
-
/2o$
1
where
02X
=
( 1 - p2)o$.
Using also the formula for the PDF of Y,
and the multiplication rule f x , y ( x , y) = f y ( y ) f X j y ( x I y ) , we can obtain the joint PDF of X and Y. This PDF is of the form
Further Topics on Random Variables
252
Chap. 4
where the normalizing constant is
The exponent term q(x:y ) is a quadratic function of x and y,
which after some straightforward algebra simplifies to
An important observation here is that the joint PDF is completely determined by ax,a y , and p. In the special case u-here X and Y are uncorrelated (p = O), the joint PDF takes the simple form
which is just the product of two independent normal PDFs. We can get some insight into the form of this PDF by considering its contours, i.e., sets of points at which the PDF takes a constant value. These contours are described by an equation of the form x2 y2 = constant,
, +, ax OY
and are ellipses whose two axes are horizontal and vertical. In the more general case u-here X and Y are dependent, a typical contour is described by x2 y2 -- = constant, 2 ~ - xY
4
+
,,fly
0 :
and is again an ellipse, but its axes are no longer horizontal and vertical. Figure 4.12 illustrates the contours for two cases, one in which p is positive and one in which p is negative.
Example 4.29. Suppose that X and Z are zero-mean jointly normal random variables, such that a$ = 4, a; = 1719, and E[XZ] = 2. We define a new random variable Y = 2X - 32. We wish to determine the PDF of Y, the conditional PDF of X given Y, and the joint PDF of and Y.
Sec. 4.7
The Bivariate Normal Distribution
Figure 4.12: Contours of the bivariate normal PDF. The diagram on the left (respectively, right) corresponds to a case of positive (respectively,negative) correlation coefficient p.
As noted earlier, a linear function of two jointly normal random variables is also normal. Thus, Y is normal with variance
Hence, Y has the normal PDF
We next note that X and Y are jointly normal. The reason is that X and Z are linear functions of two independent normal random variables (by the definition of joint normality), so that X and Y are also linear functions of the same independent normal random variables. The covariance of X and Y is equal to
Hence, the correlation coefficient of X and Y , denoted by p, is equal to
The conditional expectation of X given Y is u
E [ X jY] = p-Y UY
1 2 2 = - . -Y = -Y. 3 3 9
254
Further Topics on Random Variables
Chap. 4
The conditional variance of X given Y (which is the same as the variance of
x=
so that a x = 4313. Hence, the conditional PDF of X given Y is
Finally, the joint P D F of X and Y is obtained using either the multiplication rule f x Y (x, y) = fx (x)f X j (x I y ) , or by using the earlier developed formula for the exponent q(x,y ) , and is equal t o
We end wit,h a cautionary note. If X and Y are jointly normal, then each random variable X and Y is normal. However, the converse is not true. Namely, if each of the random variables X and Y is normal, it does not follow that they are jointly normal, even if they are uncorrelated. This is illustrated in the following example.
Example 4.30. Let X have a normal distribution with zero mean and unit variance. Let Z be independent of X ,with P(Z = 1) = P(Z = -1) = 112. Let Y = ZX,m~hichis also normal with zero mean. The reason is that conditioned on either value of 2,Y has the same normal distribution, hence its unconditional distribution is also normal. Furthermore,
so X and Y are uncorrelated. On the other hand X and Y are clearly dependent. (For example, if X = 1, then Y must be either -1 or 1.) If X and Y were jointly normal, we would have a contradiction to our earlier conclusion that zero correlation implies independence. It follows that X and Y are not jointly normal, even though both marginal distributions are normal.
The Multivariate Normal PDF The development in this section generalizes to the case of more than two random variables. For example, we can say that the random variables X I , . . . , Xn are jointly normal if all of them are linear functions of a set U l , . . . , Un of independent
Sec. 4.8
Summary and Discussion
normal random variables. We can then establish the natural extensions of the results derived in this section. For example, it is still true that zero correlation implies independence, that the conditional expectation of one random variable given some of the others is a linear function of the conditioning random variables, and that the conditional PDF of X I , . . . , X k given X k + l , . .. , Xn is multivariate normal. Finally, there is a closed-form expression for the joint PDF. Assuming that none of the random variables is a deterministic function of the others, we have fx Xn = c e - q ( ~ l , . , , r ~ n ) , where c is a normalizing constant and where q(x1, . . . , x,) is a quadratic function of X I , .. . , x, that increases to infinity as the magnitude of the vector ( X I , .. . , x,) tends to infinity. Multivariate normal models are very common in statistics, econon~etrics, signal processing, feedback control, and many other fields. However, a full development falls outside the scope of this text.
4.8 SUMMARY AND DISCUSSION In this chapter, we have studied a number of advanced topics. We discuss here some of the highlights. In Section 4.1, we introduced the transform associated with a random variable, and saw how such a transform can be computed. Conversely, we indicated that given a transform, the distribution of an associated random variable is uniquely determined and can be found, for example, using tables of commonly occurring transforms. We have found transforms useful for a variety of purposes, such as the following. (a) Knowledge of the transform associated with a random variable provides a shortcut for calculating the moments of the random variable. (b) The transform associa,ted with ables is equal to the product of them. This property was used normal (respectively, Poisson) Poisson).
the sum of two independent random varithe transforms associated with each one of to show that the sum of two independent random variables is normal (respectively,
(c) Ti-ansforms can be used to characterize the distribution of the sum of a random number of random variables (Section 4.4), something which is often impossible by any other means. (d) The inversion property of multivariate transforms was used in Section 4.7 to establish that two uncorrelated jointly normal random variables are independent. Sums of independent random variables can also be handled without using transforms. In Section 4.2, we gave the convolution formula for the PMF or
256
Further Topics on Random Variables
Chap. 4
PDF of the sum of two independent random variables, in terms of the individual PMFs or PDFs. Furthermore, in Section 4.4, we derived formulas for the mean and variance of the sum of a random number of random variables, using a divideand-conquer strategy. One of the most useful divide-and-conquer strategies relies on conditioning and the use of conditional expectations. In Section 4.3, we took a closer look at the conditional expectation and indicated that it can be viewed as a random variable, with an expectation and variance of its own. We derived some related properties, including the law of iterated expectations, and the law of total variance. In Section 4.6 we discussed problems of estimation and prediction of one random variable X, based on knowledge of another random variable Y. These problems are related t o the problems of ihference that we discussed in Section 3.5 using Bayes' rule. In this chapter, we discussed the least squares approach. The highlights of this approach are: (a) The conditional expectation E[X I Y] is the optimal estimator under the least squares criterion. (b) When the conditional expectation is hard to compute, one may be interested in an optimal estimator within a restricted class of estimators that depend linearly on the available information Y. The formula for the best linear estimator involves the correlation coefficient between X and Y, which was introduced in Section 4.5, and which provides a numerical indicator of the relation between X and Y. Finally, in Section 4.7, we introduced the concept of jointly normal random variables. Such random variables have several remarkable properties. For example, the sum of jointly normal random variables is normal. Furthermore, the conditional expectation E[X I Y] is a linear function of Y and is therefore normal. These properties imply that one can keep on forming linear combinations of jointly normal random variables, and taking conditional expectations, while remaining within the analytically tractable class of normal random variables. Furthermore, the somewhat cumbersome formula for the bivariate normal P D F is completely determined in terms of a few parameters: the means, variances, and the correlation coefficient of the random variables of interest. Thus, in order to determine the particular form of a bivariate normal PDF, one only needs to calculate the values of these parameters. For all these reasons, jointly normal models have found a tremendous range of applications, in statistics, signal processing, communication theory, and many other contexts.
Pro bIems
257
--
PROBLEMS
SECTION 4.1. Transforms Problem 1. Let X be a random variable that takes the values 1, 2, and 3, with the following probabilities:
Find the transform associated with X and use it to obtain the first three moments, E[Xl, E[X21, EIX31. Problem 2. A nonnegative integer-valued random variable X has one of the following two expressions as its transform:
2. n/r(s) = e2(eeS -1). (a) Explain why one of the two cannot possibly be the transform.
(b) Use the true transform to find P ( X = 0). Problem 3. Find the PDF of the continuous random variable X associated with the transform 1 2 2 3 M ( s ) = - . -+ - . ---3 2-s 3 3-s' Problem 4. Let X be a random variable that takes nonnegative integer values, and is associated with a transform of the form
where c is some scalar. Find E[X], p x ( I ) , and E [ X I X f 01. Problem 5. Let X , Y, and Z be independent random variables, where X is Bernoulli with parameter 1/3, Y is exponential with parameter 2, and 2 is Poisson with parameter 3. (a) Consider the new random variable U = X Y associated with U.
+ (1 - X ) Z .
Find the transform
+ 3. (c) Find the transform associated with Y + 2.
(b) Find the transform associated with 2 2
Problem 6." Let X be a discrete random variable taking nonnegative integer values. Let M ( s ) be the transform associated with X.
Further Topics on Ra.ndom Varia.bles
258
Chap. 4
(a) Show that P ( X = 0) = lim Ad(s) s1-00
(b) Use part (a) to verify that if X is a binomial random variable with parameters n and p, we have P ( X = 0) = (1 - p)". F~rt~hermore, if X is a Poisson random variable wit11 parameter A, we have P ( X = 0) = epX.
(c) Suppose that X is instead known to ta.ke only int,eger values that are greater than or equal to a given integer k. How can we calculate P ( X = k ) using the transform associated with X ? Solution. (a) We have
As s
+
-00,
all the terms eks with k
> 0 tend to
0, so we obtain lim,--,
Ad(s) =
P(X = 0). (b) In the case of the binomial, we have from the transform tables
so that lim,,-,
M ( s ) = (1 - p)". In the case of the Poisson, we have
so that lim,,-,
M(s) = eCX.
(c) The random variable I f = X - h- -takes only nonnegative integer values and the associated transform is Ady(s) = e-'"(s) (cf. Example 4.4). Since P(Y = 0) = P ( X = %),we have from part (a),
Problem 7." Transforms associated w i t h uniform random variables. (a) Find the transform associated with an integer-valued random variable X that is uniformly distributed in the range { a , a 1 , . . . , h } .
+
(b) Find t,he transform associated with a conl,inuous random variable X that is uniformly distributed in the range [ a ,b]. Solution. (a) The PMF of X is
1
i f k = a , a + l , . . . , b, otherwise.
Problems The transform is
b-a
(b) We have
Problem 8." Find the third, fourth, and fifth moments of an exponential random variable with parameter A. Solution. The transform is
X A-s
M ( s ) = -. Thus,
By setting s = 0, we obtain
Problem 9." Suppose that the transform associated with a discrete random variable X has the form
where A(t) and B ( t ) are polynomials of the generic variable t. Assume that A(t) and B ( t ) have no common roots and that the degree of A(t) is smaller than the degree of B ( t ) . Assume also that B ( t ) has distinct, real, and nonzero roots that have absolute value greater than 1. Then it can be seen that M ( s ) can be written in the form
Further Topics on Random Variables
260
Chap. 4
where l / r l , .. . , llr, are the roots of B ( t ) and the a,, are constants that are equal to limes, (1 - ,rze5)A.l(s),i = 1,. . . , m,. 7',&
(a) Show that the PMF of X has the form
\ 0,
otherwise
Note: For large k , the P M F of X can be approximated by a,$, index corresponding to the largest Ir,\ (assuming ? is unique).
where
T is the
(b) Extend the result of pa.rt (a) t,o the case where M ( s ) = e " ~ ( t ) / B ( t ) and b is an integer. les Solution. (a) We have for all s such tha.t Ir.L
Therefore, m
M(S) =
a
+
(5
&=l
s==l
es
for k 0, and P ( X = k ) = 0 for k PMFs.
< 0. Note that this PMF is a mixture of geometric
(b) In this case, M ( s ) corresporrds to the t,ranslation by b of a random variable whose transform is A(t)/B(t) (cf. Example 4.4), so we have
\ 0, SECTION 4.2. tions
otherwise
Sums of Independent Random Variables - Convolu-
Problem 10. A soccer team has three designated players who take turns striking penalty shots. The ith player has probability of success p,: independently of the successes of the other players. Let X be the number of si~cce~sful penalty shots after each player has ha.d one turn. Use convolution to calculate the P M F of X. Confirm your
Problems
261
answer by first calculating the transform associated with X and then obtaining the PMF from the transform. Problem 11. The random variables X , Y, and Z are indepe~ide~lt and uniformly Y + 2. distribut,ed between zero and one. Find the PDF of X
+
Problem 12. Consider a PDF that is positive olrly within an interval [a,h] and is symmetric around the mean ( a $- b)/2. Let X and Y be independent random variables that both have this PDF. Suppose that. you have calculated the PDF and the transforin associated with X+Ii. How can you ea.sily obtain the PDF and the transform associated with X - Y ?
SECTION 4.3. More on Conditional Expectation and Variance Problem 13. Pat a.nd Nat are dating, arld all of their dat,es are scheduled to start at 9 p.m. 1qat always arrives promptly at 9 p.m. Pat is highly disorganized and arrives a.t a time that is uniformly distributed between 8 p.m. and 10 p.m. Let X be the time in hours between 8 p.m. and the time when Pa,t arrives. If Pat arrives before 9 p.m., their date will la.st exactly 3 hours. If Pat arrives after 9 p.m., their date will last for a time that is uxliforrnly distributed between 0 and 3 - X hours. The da.te starts at the time they meet. Nat gets irritated when Pat is late and will end the relationship after the second date on which Pat is late by more than 45 minutes. All dates are independent of any other dates. (a) W1.iat is the expected number of hours Nat waits for Pat t o arrive? (b) What is the expected duration of any particular date? (c) What is the expect,ed number of dates they will have before breaking up? Problem 14. A retired professor comes to i,he office at a time which is uniformly distributed between 9 a.m. and 1 p.m., performs a single task, and leaves when the task is completed. The duration of the task is exponcntially distributed with pararnel,er X(y) = 1/(5 - y), where y is the length of the time interval between 9 a.m. and tlie time of his arrival. (a) What is the expected amount of time that the professor devotes to tlie task? (b) What is the expected time at which the ta.sk is complet.ed?
(c) The professor has a Ph.D. student who on a given day comes to see him a.t, a time that is uniformly distributed between 9 a.m. and 5 p.m. If the student does not find the professor, he leaves and does not return. If he finds the professor, he spends an amount of t.ime tha.t is uniformly distributed between D and 1 hour. The professor will spend the same total amount of t,ime on his task regardless of whether he is interrupted by the student. Wha.t is the expected amount of time that the professor will spend with the student and what is the expected time at which he will leave his office? Problem 15. Consider a gambler who at each gamble either wins or loses his bet with probabilities p and 1 - p, independently of earlier gambles. When p > 1/2, a popular ga,mbling system, known as the Kelly strategy, is to always bet the fraction
Further Topics on Random Variables
262
Chap. 4
2p - 1 of the current fortune. Assuming p > 112) compute the expected fortune aft,er n gambles, starting with x units and elnploying the Kelly strategy. Problem 16. Let X and Y be two random variables that are uniformly distributed over the triangle formed by the points (0, O), (1, 0), and (0,2) (this is an asymmetric version of the PDF of Example 4.15). Calculate E[X] and E[Y] using the law of iterated expectations and a method similar to the one of Example 4.15. Problem 1'9." Let X and Y be independent random variables. Use the law of total variance to show that
Solution. Let
We have so that
Z=XY.
The law of total variance yields
E[Z I X ] = E [ X Y 1 XI = XEIY], var (E[Z / XI) = var (xE[Y]) = ( E [ Y ] ) ~ V ~ ~ ( X ) .
so that
Combining the preceding relations! we obtain
SECTION 4.4. Sum of a Random Number of Independent Random Variables Problem 18. .4t a certain time, the number of people that enter an elevator is a Poisson random variable with parameter A. The weight of each person is independent of every other person's weight, and is uniformly distributed between 100 and 200 lbs. Let X , be the fraction of 100 by which the ith person exceeds 100 lbs, e.g., if t,he 7t" person weighs 175 lbs., then X7 = 0.75. Let Y be the sum of the X,. (a) Find the transform associated with Y (b) Use the transform to compute the expected value of Y. (c) Verify your answer to part (b) by using the law of iterated expectations.
Problems
263
Problem 19. Construct an example to show that the sum of a random number of independent normal random variables is not normal (even though a fixed sum is). Problem 20. A motorist goes through 4 lights, each of which is found to be red with probability 112. The waiting times at each light are modeled as independent normal random variables with mean 1 minute and standard deviation 112 minute. Let X be the total waiting time at the red lights.
(a) Use the total probability theorem to find the P D F and the transform associated with X , and the probability that X exceeds 4 minutes. Is X normal? (b) Find the transform associated with X by viewing X as a sum of a random number of random variables. Problem 21." Use transforms to show that the sum of a Poisson-distributed number of independent, identically distributed Bernoulli random variables is Poisson.
SoIution. Let N be a Poisson-distributed random variable with parameter A. Let X , , i = I,.. . , N, be independent Bernoulli random variables with success probability p, and let
L=X1+...+X~ be the corresponding sum. The transform associated with L is found by starting with the transform associated with N, which is
and replacing each occurrence of eS by the transform associated with
We obtain
~
X,,which is
~= e X ((l - ~ + ~ ~e S - l ) 1= e X ~ ( e s-1)
This is the transform associated with a Poisson random variable with parameter Xp.
SECTION 4.5. Covariance and Correlation Problem 22. Consider four random variables, W , X , Y, 2,with
and assume that W , X , Y , Z are pairwise uncorrelated. Find the correlation coefficients p(A, B)and p(A, C ) ,where A = W + X, B = X Y, and C = Y Z.
+
+
Problem 23. Suppose that X is a standard normal random variable.
and E[X4]. (a) Calculate E[x3] (b) Define a new random variable Y such that
Further Topics on Random Variables
264
Chap. 4
Find the correlation coefficient p ( X , Y ) . Schwarz inequality. Show that i f X and Y are random variables,
Problem 24.* we have
Solution. W e may assume that E [ Y 2 ]# 0; otherwise, we have Y = 0 with probability 1 , and hence E [ X Y ]= 0 , so the inequality holds. mJe have
i.e., ( E [ x Y ] ) 5 ~ E [ x 2 ]E [ y 2 ] . Problem 25.'
Correlation coefficient. Consider the correlation coefficient
o f two random variables X and Y that have positive variances Show that: ( a ) p ( X , Y ) /5 1. Hint:Use the Schwarz inequality from the preceding problem. ( b ) I f Y - E [ Y ]is a positive (or negative) multiple o f X - E [ X ] ,then p ( X , Y ) = 1 [or p ( X , Y ) = -1, respectively].
( c ) I f p ( X , Y ) = 1 [or p ( X , Y ) = I ] , then, with probability 1 , Y - E [ Y ]is a positive (or negative, respectively) multiple o f X - EIX1. Solution we get
( a ) Let
2 =X
and hence lp(X, Y)I (b)I f
? = a*, then
5 1.
- E [ X ] and
?=Y
- E [ Y ] Using the Schwarz inequality,
(c) If ( p ( X ,1')) = 1 , then
Thus, with probability 1, the random variable
is equal to zero. It follows that, with probability I,
i.e., the sign of the constant ratio of -'.; and J? is determined by the sign of p ( X . Y)
SECTION 4.6. Least Squares Estimation Problem 26. 4 police radar always overestimates the speed of incoming cars by an amount that is uniformly distributed between 0 and 5 miles/hour. Assume that car speeds are uniformly distributed from 55 to 75 miles/hour. What is the least squares estimate of the car speed based on the radar's measurement? Problem 27. Consider the random variables X and Y of Example 4.27. Find the linear least squares estimator of X given Y . Problem 28." Linear least squares estimate based on several measurements. Let X be a random variable with mean p and variance v, and let Yl, . . . , Y, be measurements of the form Y, = X TVt,
+
where t,he W , are random variables with mean 0 and variance v,:which represent measurement errors. We assume that the random variables X, W l : .. . firn are independent. Show that the linear least squares estimator of X based on Y ] ,. . . , Y, is
Rirther Topics on Random Varia.bles
266
Chap. 4
Note. If either v is very large, or if t,he number of rneasurements is very large so that l / v is negligible relative to ( l l v , ) , this estimator can be well approximated by
xy=,
which does not require knowledge of p and v. If in addition all the v, are equal (all the measurements are of equal "quality"), the estimator can be well approximated by Y,/n,the sarnple mean of the measurements.
xr='=,
Solulion. TVe want to show that the choices of a ] ,. . . ,a,, b that minimize the function
To this end, it is sufficient to show that the partial derivatives of f , with respect to o l , . . . ,a,,6 , are all equal to O when evaluated at a ; , . . . ,a:, b*. By differentiating f , we have
From the expressions for b* a.nd a:, we see that
Using this equat.ion and the facts
E [ X ]= p , it follows tha.t
E [ W , ]= 0 ,
Problems Using, in addition, the equations E[y,(p,-x)] = E [ ( x - ~ + ~ ~ + P ) ( P - x ) ] =-',
EIY,~,= ] E[(X
+ wi)wt] =
vi?
E~Y,w,] = E [(x+ w J ) w t ] = 0,
for all i , for all i and j with i # j l
we obtain
a; ,bf
= -0-
b* P
* + aivi
= 0,
where the last equality holds in view of the definitions of a: and b*.
problem 29.* Let
(a) Let
x and Y be two random variables with positive variances.
xLbe the linear least squares estimator of X based on Y E/(X
Show that
- XL)Y] = 0
that the estimation error X - X L is uncorrelated with Y.
d, (b) Let
x = E[X 1 Y] be the least squares estimator of X E[(X
given Y Show that
- x ) ~ ( Y ) ] = 0,
for any function h. (,) Is it trne that the estimation error X - E [ X 1 Y] is independent
Y1
Solution. (a) We have
xL= E[X] + p ( x , Y:)
(y - ~ [ y ]=)E \ x ] +
cov(X, Y)
ff;(Y - EIE':)
so cov(X, Y)
E[(X - x L ) Y ] = E
+
= E[XY] - E[X] E[Y] -
= cov(X, Y)
(
1-
E ~ Y ~ ]( ~ 1 ~ 1 ) ' )
+ ; ff
= cov(X,Y) (1 = 0.
-
$)
;
ff
3
Further Topics on Random Variables
268
Chap. 4
To show that cov(X - XL, Y) = 0, we write
and X
-
E [ X ] - c o v ( ~ ' Y )-( E[Y])] ~ *Y
as desired. (b) We have E [(x - X) h(Y)] = E [(x- E [ X / Y]) h ( y ) ]
(c) The answer is no. For a counterexample, let X and Y be discrete random variables with the joint PMF
=
{ 0,114,
for (x, Y) = ( L O ) ,(0, I ) , (-1, otherwise.
o), (0, -1)
Note that E [ X / Y = y] = 0 for all possible values y, so E [ X 1 Y] = 0. Thus, mre have X - E [ X / Y] = X . Since X and Y are not independent, X - E [ X / Y] and Y are also not independent.
SECTION 4.7. The Bivariate Normal Distribution Problem 30. Let X I and X2 be independent standard normal random variables. Define the random variables Yl and Y2 by
Find E[Yl], E[Yz], cov(Y1, Yz), and the joint PDF fyl,y2 Problem 31. The random variables X and Y are described by a joint P D F of the form -8z2-6ry-18y2 ~ X ,(XI Y Y) = ce Find the means, variances, and the correlation coefficient of X and Y. Also, find the value of the constant c. Problem 32. Suppose that X and Y are independent normal random variables with the same variance. Show that X - Y and X Y are independent.
+
Problems
269
Problem 33. The coordinates X and Y of a point are independent zero-mean normal random variables with common variance 02.Given that the point is at a distance of at least c from the origin, find the conditional joint PDF of X and Y. Suppose that X and Y are jointly normal random variables. Show
Problem 34." that
Hint: Consider the random variables X - E[X] and Y established in the text for the zero-mean case.
- E[Y] and use the result
= X - E[X] and ? = Y - E[Y]. The random variables % and p Solution. Let are joir~tlynormal. This is because if X and Y are linear functions of two independent normal random variables U and V, then 2 and ? are also linear functions of U and V. Therefore, as established in the text,
Note that conditioning on
Since X and Finally,
? is the same as conditioning on Y. Therefore,
2 only differ by a
constant, we have oj: = ox and, similarly, oy = otJ.
COV(%, p ) = E[%F]= E[(X - E[x]) (Y - E[Y])] = cov(X, Y ) , from which it follows that p ( 2 , p ) = p ( X , Y). The desired formula follows by substituting the above relations in the formula for EIX 1 p]. Problem 35." (a) Let X i , X p , . . . ,X n be independent identically distributed random variables and let Y = X i X z + . . . -t- Xn. Show that
+
Y
E [ X l j Y]= -.
n
(b) Let X and W be independent zero-mean normal random variables, with positive integer variances k and m, respectively. Use the result of part (a) to find E [ X 1 X W], and verify that this agrees with the conditional expectation formula for jointly normal random variables given in the text. Hint: Think of X and W as sums of independent random variables.
+
Solution. (a) By symmetry, we see that E[X, / Y] is the same for all i . Furthermore,
Therefore, E[XI / Y] = Y/n. (b) We can think of X and W as sums of independent standard normal random variables: X=Xl+...+Xk, W=IV1+...+Wm .
Further Topics on Random Variables
270
We identify Y with X
+ W and use the result from part (a), to obtain
Thus,
E [ X l X + W ] = E [ X 1 + . . . + X k ( X + W ] =-( x + W ) . k+m
This formula agrees with the formula derived in the text because
We have used here the property
Chap. 4
5
Stochastic Processes
p.273 p .. 285 p .. 299 p.301
271
272
Stochastic Processes
Chap. 5
A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of numerical values. For example, a stochastic process can be used to model: (a) the sequence of daily prices of a stock; (b) the sequence of scores in a football game; (c) the sequence of failure times of a machine; (d) the sequence of hourly traffic loads at a node of a communication network; (e) the sequence of radar measurements of the position of an airplane. Each numerical value in the sequence is modeled by a random variable, so a stochastic process is simply a (finite or infinite) sequence of random variables and does not represent a major conceptual departure from our basic framework. We are still dealing with a single basic experiment that involves outcomes governed by a probability law, and random variables that inherit their probabilistic properties from that law.t However, stochastic processes involve some change in emphasis over our earlier models. In particular: (a) We tend to focus on the dependencies in the sequence of values generated by the process. For example, how do future prices of a stock depend on past values? (b) We are often interested in long-term averages, involving the entire sequence of generated values. For example, what is the fraction of time that a machine is idle? (c) We sometimes wish to characterize the likelihood or frequency of certain boundary events. For example, what is the probability that within a given hour all circuits of some telephone system become simultaneously busy, or what is the frequency with which some buffer in a computer network overflows with data? There is a wide variety of stochastic processes, but in this book, we will only discuss two major categories. (i) Arrival-Type Processes: Here, we are interested in occurrences that have the character of an "arrival," such as message receptions at a receiver, job completions in a manufacturing cell, customer purchases at a store, etc. We will focus on models in which the interarrival times (the times between successive arrivals) are independent random variables. In Section 5.1, we consider the case where arrivals occur in discrete time and the interarrival
t Let us emphasize that all of the random variables arising in a stochastic process refer to a single and common experiment, and are therefore defined on a common sample space. The corresponding probability law can be specified explicitly or implicitly (in terms of its properties), provided that it determines unambiguously the joint CDF of any subset of the random variables involved.
Sec. 5.1
The Bernoulli Process
273
times are geometrically distributed - this is the Bernoulli process. In Section 5.2, we consider the case where arrivals occur in continuous time and the interarrival times are exponentially distributed - this is the Poisson process. (ii) Markov Processes: Here, we are looking at experiments that evolve in time and in which the future evolution exhibits a probabilistic dependence on the past. As an example, the future daily prices of a stock are typically dependent on past prices. However, in a Markov process, we assume a very special type of dependence: the next value depends on past values only through the current value. There is a rich methodology that applies to such processes, and is the subject of Chapter 6.
5.1 THE BERNOULLI PROCESS The Bernoulli process can be visualized as a sequence of independent coin tosses, where the probability of heads in each toss is a fixed number p in the range o < p < 1. In general, the Bernoulli process consists of a sequence of Bernoulli trials, where each trial produces a 1 (a success) with probability p, and a 0 (a failure) with probability 1 - p, independently of what happens in other trials. Of course, coin tossing is just a paradigm for a broad range of contexts involving a sequence of independent binary outcomes. For example, a Bernoulli process is often used to model systems involving arrivals of customers or jobs at service centers. Here, time is discretized into periods, and a "success" at the kth trial is associated with the arrival of at least one customer at the service center during the kth period. We will often use the term "arrival" in place of "success" when this is justified by the context. In a more formal description, we define the Bernoulli process as a sequence Xl, X 2 , ... of independent Bernoulli random variables Xi with
== 1) == P(success at the ith trial) == p, P(Xi == 0) == P(failure at the ith trial) == 1 - p,
P(Xi
for each i.t Given an arrival process, one is often interested in random variables such as the number of arrivals within a certain time period, or the time until the first
t Generalizing from the case of a finite number of random variables, the independence of an infinite sequence of random variables Xi is defined by the requirement that the random variables Xl, ... , X n be independent for any finite n. Intuitively, knowing the values of any finite subset of the random variables does not provide any new probabilistic information on the remaining random variables, and the conditional distribution of the latter is the same as the unconditional one.
Stochastic Processes
274
Chap. 5
arrival. For the case of a Bernoulli process, some answers are already available from earlier chapters. Here "is a summary of the main facts.
Some Random Variables Associated with the Bernoulli Process and their Properties • The binomial with parameters p and n. This is the numberS of successes in n independent trials. Its PMF, mean, and variance are k == 0, 1, ... , n, E[S]
== np,
var(S)
== np(l
- p).
This is the number T of trials Its PMF, mean, and "tT~'t~1~1''lr>O
The geometric up to (and ,L,L,L'-",L~""'''',L,L,Lj~1 are
t == 1,2, ... , E[T]==-, p
p -y.
Independence and Memorylessness The independence assumption underlying the Bernoulli process has important implications, including a memorylessness property (whatever has happened in past trials provides no information on the outcomes of future trials). An appreciation and intuitive understanding of such properties is very useful, and allows the quick solution of many problems that would be difficult with a more formal approach. In this subsection, we aim at developing the necessary intuition. Let us start by considering random variables that are defined in terms of what happened in a certain set of trials. For example, the random variable Z == (Xl + XS)X6X7 is defined in terms of the first, third, sixth, and seventh trial. If we have two random variables of this type and if the two sets of trials that define them have no common element, then these random variables are independent. This is a generalization of a fact first seen in Chapter 2: if two random variables U and V are independent, then any two functions of them, g(U) and h(V), are also independent. Example 5.1.
(a) Let U be the number of successes in trials 1 to 5. Let V be the number of successes in trials 6 to 10. Then, U and V are independent. This is because
Sec. 5.1
The Bernoulli Process
275
U = Xl + ... +X5 , V = X6 + ... +XIO , and the two collections {Xl, ... , X 5 }, {X6, ... ,X IO } have no common elements. (b) Let U (respectively, V) be the first odd (respectively, even) time i in which we have a success. Then, U is determined by the odd-time sequence X I, X 3 , ... , whereas V is determined by the even-time sequence X2, X4, . ... Since these two sequences have no comlnon elements, U and V are independent.
Suppose now that a Bernoulli process has been running for n time steps, and that we have observed the values of Xl, X2, . .. ,Xn . We notice that the sequence of future trials X n + l , X n +2 , ... are independent Bernoulli trials and therefore form a Bernoulli process. In addition, these future trials are independent from the past ones. We conclude that starting from any given point in time, the future is also modeled by a Bernoulli process, which is independent of the past. We refer to this as the fresh-start property of the Bernoulli process. Let us now recall that the time T until the first success is' a geometric random variable. Suppose that we have been watching the process for n time steps and no success has been recorded. What can we say about the number T-n of remaining trials until the first success? Since the future of the process (after time n) is independent of the past and constitutes a fresh-starting Bernoulli process, the number of future trials until the first success is described by the same geometric PMF. Mathematically, we have
P(T - n = tiT> n)
~
(1 - p)t-I p = P(T = t),
t = 1,2, ....
This memorylessness property can also be derived algebraically, using the definition of conditional probabilities, but the argument given here is certainly more intuitive.
Memorylessnessand the Fresh-Start Property of the Bernoulli Process For any given time n, the sequence of random variables Xn+l, X n+2, ... (the future of the process) is also a Bernoulli process, and is independent from Xl, ... ,Xn (the past of the process). Let n be a given time and let T be the time of the first success after time n. Then, T - n has a geometric distribution with parameter p, and is independent of the random variables Xl, ... , X n .
According to the fresh-start property, if we start watching a Bernoulli process at a certain time n, what we see is indistinguishable from a Bernoulli process that has just started. It turns out that the same is true if we start watching the process at some random time N, as long as N is determined only by the past history of the process and does not convey any information on the future. As an example, consider a roulette wheel with each occurrence of red viewed
Stochastic Processes
276
Chap. 5
as a success. The sequence generated starting at some fixed spin (say, the 25th spin) is probabilistically indistinguishable from the sequence generated starting immediately after red occurs in five consecutive spins. In either case, the process starts fresh (although one can certainly find galnblers with alternative theories). The example that follows makes a similar argument, but more mathematically.
Example 5.2. Fresh-Start at a Random Time. Let N be the first time in which we have a success immediately following a previous success. (That is, N is the first i for which X i - 1 = Xi = 1.) What is the probability P(XN+l = XN+2 = 0) that there are no successes in the two trials that follow? Intuitively, once the condition XN-l = X N = 1 is satisfied, and from then on, the future of the process consists of independent Bernoulli trials. Therefore, the probability of an event that refers to the future of the process is the, same as in a fresh-starting Bernoulli process, so that P(XN+l = XN+2 = 0) = (1 _ p)2. To provide a rigorous justification of the above argument, we note that the time N is a random variable, and by conditioning on the possible values of N, we have 00
P(XN+l = XN+2 = 0) = LP(N = n)P(XN+1 = XN+2 = 0 I N = n) n=l 00
= L P(N = n)P(Xn +1
= X n +2 = 0 I N = n).
n=l
Because of the way that N was defined, the event {N = n} occurs if and only if the values of Xl, ... ,Xn satisfy a certain condition. But these random variables are independent of X n + 1 and X n + 2 . Therefore,
P(Xn +1 = X n +2 = 0 IN = n) = P(Xn +1 = X n +2 = 0) = (1 _ p)2, which leads to 00
P(XN+l = XN+2 = 0) = LP(N = n)(l - p)2 = (1 _ p)2. n=l
The next example illustrates the use of the fresh-start property to derive the distribution of certain random variables.
Example 5.3. A computer executes two types of tasks, priority and nonpriority, and operates in discrete time units (slots). A priority task arrives with probability p at the beginning of each slot, independently of other slots, and requires one full slot to complete. A nonpriority task is always available and is executed at a given slot if no priority task is available. In this context, it may be important to know the probabilistic properties of the time intervals available for nonpriority tasks. With this in mind, let us call a slot busy if within this slot, the computer executes a priority task, and otherwise let us call it. idle. We call a string of idle
The Bernoulli Process
Sec. 5.1
277
(or busy) slots, flanked by busy (or idle, respectively) slots, an idle period (or busy period, respectively). Let us derive the PMF, mean, and variance of the following random variables (cf. Fig. 5.1): (a) T
= the time index of the first idle slot;
(b) B = the length (number of slots) of the first busy period; (c) I = the length of the first idle period. (d) Z = the number of slots after the first slot of the first busy period up to and including the first subsequent idle slot.
,. ..
B IlII
IlII
T IlII
.. .. Z
.. IlII
..
Busy period
IdIII
.
time
Idle period
time
Figure 5.1: Illustration of random variables, and busy and idle periods in Example 5.3. In the top diagram, T = 4, B = 3, I = 2, and Z = 3. In the bottom diagram, T = 1, I = 5, B = 4, and Z = 4.
We recognize T as a geometrically distributed random variable with parameter 1 - p. Its PMF is k = 1,2, .... Its mean and variance are E[TJ=-11 ,
-p
var(T) = (1
p p)2'
Let us now consider the first busy period. It starts with the first busy slot, call it slot L. (In the top diagram in Fig. 5.1, L = 1; in the bottom diagram, L = 6.) The number Z of subsequent slots until (and including) the first subsequent idle slot has the same distribution as T, because the Bernoulli process starts fresh at time L + 1. We then notice that Z = B and conclude that B has the same PMF as T.
Chap. 5
Stochastic Processes
278
If we reverse the roles of idle and busy slots, and interchange p with 1 - p, we see that the length I of the first idle period has the same PMF as the time index of the first busy slot, so that k = 1,2, ... ,
1
E[I] =-, p
var(I) =
1-p
-2-'
P
We finally note that the argument given here also works for the second, third, etc. busy (or idle) period. Thus, the PMFs calculated above apply to the ith busy and idle period, for any i.
Interarrival Times An important random variable associated with the Bernoulli process is the time of the kth success (or arrival), which we denote by Yk. A related random'variable is the kth interarrival time, denoted by Tk. It is defined by k
== 2,3, ... ,
and represents the nUlnber of trials following the (k - l)st success until the next success. See Fig. 5.2 for an illustration, and also note that
Y k == T 1
+ T2 + ... + Tk.
time
Figure 5.2: Illustration of interarrival times, where a 1 represents an arrival. In this example, Tl == 3, T2 == 5, Ts == 2, T4 == 1. Furthermore, Yl == 3, Y2 == 8, Ys == 10, Y4 == 11.
We have already seen that the time T 1 until the first success is a geometric random variable with parameter p. Having had a success at time T 1 , the future is a fresh-starting Bernoulli process. Thus, the number of trials T2 until the next success has the same geometric PMF. Furthermore, past trials (up to and including time Tl) are independent of future trials (from time Tl + 1 onward). Since T2 is determined exclusively by what happens in these future trials, we see that T2 is independent of T 1 . Continuing similarly, we conclude that the random variables Tl, T2, T3, ... are independent and all have the same geometric distribution. This important observation leads to an alternative, but equivalent way of describing the Bernoulli process, which is sornetimes more convenient to work with.
Sec. 5.1
The Bernoulli Process
279
Alternative Description of the Bernoulli Process 1. Start with a sequence of independent geometric random variables T 1 , T2, ... , with common parameter p, and let these stand for the interarrival times. 2. Record a success (or arrival) at times Tl, Tl
+ T2,
+ T2 + T3,
Tl
etc.
Example 5.4. It has been observed that after a rainy day, the number of days until it rains again is geometrically distributed with parameter p, independently of the past. Find the probability that it rains on both the 5th and the 8th day of the month. If we attempt to approach this problem by manipulating the geometric PMFs in the problem statement, the solution is quite tedious. However, if we view rainy days as "arrivals," we notice that the description of the weather conforms to the alternative description of the Bernoulli process given above. Therefore, any given day is rainy with probability p, independent of other days. In particular, the probability that days 5 and 8 are rainy is equal to p2.
The kth Arrival Time The time Yk of the kth success (or arrival) is equal to the sum Yk Tl + T2 + ... + Tk of k independent identically distributed geometric random variables. This allows us to derive formulas for the mean, variance, and PMF of Yk, which are given in the table that follows.
Properties of the kth Arrival Time • The kth arrival time is equal to the sum of the first k interarrival times
and the latter are independent geometric random variables with common parameter p. The mean and variance of Yk are given by
var(Yk)
= var(Tl) + ... + var(Tk) =
k(l - p) 2
P
.
Stochastic Processes
280
Chap. 5
• The PMF of Yk is given by PYk(t)
=
-1)
t ( k _I Pk (l- p)t-k,
t
= k, k + 1, ... ,
and is known as the Pascal PMF of order k. To verify the formula for the PMF of Yk, we first note that Yk cannot be smaller than k. For t 2:: k, we observe that the event {Yk = t} (the kth success comes at time t) will occur if and only if both of the following two events A and B occur: (a) event A: trial t is a success; (b) event B: exactly k - 1 successes occur in the first t - 1 trials. The probabilities of these two events are
P(A) = P, and P(B) =
-1)
t ( k-l pk-l(l - p)t-k,
respectively. In addition, these two events are independent (whether trial t is a success or not is independent of what happened in the first t-l trials). Therefore, PYk (t)
-1)
= P(Yk = t) = P(A n B) = P(A)P(B) = ( kt _
1 pk(l - p)t-k,
as claimed.
Example 5.5. In each minute of basketball play, Alicia commits a single foul with probability p and no foul with probability 1 - p. The number of fouls in different minutes are assumed to be independent. Alicia will foul out of the game once she commits her sixth foul, and will play 30 minutes if she does not foul out. What is the PMF of Alicia's playing time? We model fouls as a Bernoulli process with parameter p. Alicia's playing time Z is equal to Y6, the time until the sixth foul, except if Y6 is larger than 30, in which case, her playing time is 30; that is, Z == min{Y6 , 30}. The random variable Y6 has a Pascal PMF of order 6, which is given by PY6
(t) == (t -5 1) (1 _)t-6 P , P
6
t == 6,7, ...
To determine the PMF pz(z) of Z, we first consider the case where z is between 6 and 29. For z in this range, we have
pz(z)
= P(Z = z) = P(Y6 = z) =
(z
~ 1)p6(1_ p)Z-6,
z == 6, 7, ... ,29.
The Bernoulli Process
Sec. 5.1
281
The probability that Z = 30 is then determined from 29
pz(30) = 1 -
LPZ(Z). z=6
Splitting and Merging of Bernoulli Processes Starting with a Bernoulli process in which there is a probability p of an arrival at each time, consider splitting it as follows. Whenever there is an arrival, we choose to either keep it (with probability q), or to discard it (with probability 1- q); see Fig. 5.3. Assume that the decisions to keep or discard are independent for different arrivals. If we focus on the process of arrivals that are kept, we see that it is a Bernoulli process: in each time slot, there is a probability pq of a kept arrival, independently of what happens in other slots. For the s~me reason, the process of discarded arrivals is also a Bernoulli process, with a probability of a discarded arrival at each time slot equal to p(l - q).
time q ~
O_r_i_gi_n_al process
_--L.--lI---Il--&.---L-..L----IIL---III----JlL.-.....I....-Ii--L-JIl--..Il--..IIl--_--IliP
tilne
l-q
time
Figure 5.3: Splitting of a Bernoulli process.
In a reverse situation, we start with two independent Bernoulli processes (with parameters p and q, respectively) and merge them into a single process, as follows. An arrival is recorded in the merged process if and only if there is an arrival in at least one of the two original processes, which happens with probability p + q - pq [one minus the probability (1 - p)(l - q) of no arrival in either process]. Since different time slots in either of the original processes are independent, different slots in the merged process are also independent. Thus, the merged process is Bernoulli, with success probability p + q - pq at each time step; see Fig. 5.4. Splitting and merging of Bernoulli (or other) arrival processes arises in many contexts. For example, a two-machine work center may see a stream of arriving parts to be processed and split them by sending each part to a randomly
Chap. 5
Stochastic Processes
282
Bernoulli (P) _&..-.1-10.. ; ; . .111.--·. .L.1......1 1 --L1--.L-,;;10;;.. a. 1O~"T(AT)k/k!, with T = 1, and k = 0 or k = 1:
P(O, 1)
= e- O. 2 = 0.819,
P(l, 1) = 0.2 . e- O. 2 = 0.164.
Suppose that you have not checked your email for a whole day. What is the probability of finding no new messages? We use again the Poisson PMF and obtain P(0,24) = e- O. 2 . 24 = 0.0083.
Alternatively, we can argue that the event of no messages in a 24-hour period is the intersection of the events of no messages during each of 24 hours. The latter events are independent and the probability of each is P(O, 1) = e- O.2 , so P(0,24)
= (P(O, 1)) 24 = (e- O.2 ) 24 = 0.0083,
which is consistent with the preceding calculation method.
Example 5.9. The Sum of Independent Poisson Random Variables is Poisson. Arrivals of customers at the local supermarket are modeled by a Poisson
Stochastic Processes
290
Chap. 5
process with a rate of A = 10 customers per minute. Let IV! be the number of customers arriving between 9:00 and 9:10. Also, let N be the number of customers arriving between 9:30 and 9:35. What is the distribution of M + N? We notice that M is Poisson with parameter J-l = 10·10 = 100 and N is Poisson with parameter v = 10·5 = 50. Furthermore, M and N are independent. As shown in Section 4.1, using transforms, M +N is Poisson with parameter J-l+v 150. We will now proceed to derive the same result in a more direct and intuitive manner. Let N be the number of customers that arrive between 9:10 and 9:15. Note that N has the same distribution as N (Poisson with parameter 50). Furthermore, N is also independent of M. Thus, the distribution of ]v[ + N is the same as the distribution of M + N. But M + N is the number of arrivals during an interval of length 15, and has therefore a Poisson distribution with parameter 10 . 15 = 150. This example makes a point that is valid in general. The probability of k arrivals during a set of times of total length T is always given by P(k, T), even if that set is not an interval. (In this example, we dealt with the set [9:00,9:10]U [9:30,9:35], of total length 15.)
Independence and Memorylessness The Poisson process has several properties that parallel those of the Bernoulli process, including the independence of nonoverlapping time sets, a fresh-start property, and the memorylessness of the interarrival time distribution. Given that the Poisson process can be viewed as a limiting case of a Bernoulli process, the fact that it inherits the qualitative properties of the latter should be hardly surprising.
Memorylessness and the :Fresh-Start Property of the Poisson Process For any given time t > 0, the history of the process after time t is also a Poisson process, and is independent from the history of the process until time t. Let t be a given time and let T be the time of the first arrival after time t. Then, T - t has an exponential distribution with parameter A, and is independent of the history of the process until time t.
The fresh-start property is established by observing that the portion of the process that starts at time t satisfies the properties required by the definition of the Poisson process. The independence of the future from the past is a direct consequence of the independence assumption in the definition of the Poisson process. Finally, the fact that T - t has the same exponential distribution can be verified by noting that
P(T - t > s) = p(O arrivals during [t, t + s]) = P(O, s) = e- AS • The following is an example of reasoning based on the memoryless property.
Sec. 5.2
The Poisson Process
291
Example 5.10. You and your partner go to a tennis court, and have to wait until the players occupying the court finish playing. Assume (somewhat unrealistically) that their playing time has an exponential PDF. Then the PDF of your waiting time (equivalently, their remaining playing time) also has the same exponential PDF, regardless of when they started playing.
Example 5.11. When you enter the bank, you find that all three tellers are busy serving other customers, and there are no other customers in queue. Assume that the service times for you and for each of the customers being served are independent identically distributed exponential random variables. What is the probability that you will be the last to leave? The answer is 1/3. To see this, focus at the moment when you start service with one of the tellers. Then, the remaining time of each of the other two customers being served, as well as your own remaining time, have the same PDF. Therefore, you and the other two customers have equal probability 1/3 of being the last to leave.
Interarrival Times An important random variable associated with a Poisson process that starts at time 0, is the time of the kth arrival, which we denote by Yk. A related random variable is the kth interarrival time, denoted by Tk. It is defined by k
==
2,3, ...
and represents the amount of time between the (k - l)st and the kth arrival. Note that
We have already seen that the time T I until the first arrival is an exponential random variable with parameter A. Starting from the time TI of the first arrival, the future is a fresh-starting Poisson process. t Thus, the time until the next arrival has the same exponential PDF. Furthermore, the past of the process (up to time TI) is independent of the future (after time T I ). Since T2 is determined exclusively by what happens in the future, we see that T 2 is independent of T I . Continuing similarly, we conclude that the random variables T I , T2, T3, ... are independent and all have the same exponential distribution.
t
We are stating here that the Poisson process starts afresh at the random time
T 1 . This statement is a bit stronger than what was claimed earlier (fresh-start at deterministic times), but is quite intuitive, and can be formally justified using an argument analogous to the one used in Example 5.2, by conditioning on all possible values of the random variable T 1 .
Stochastic Processes
292
Chap. 5
This important observation leads to an alternative, but equivalent, way of describing the Poisson process. t
Alternative Description of the Poisson Process 1. Start with a sequence of independent exponential random variables Tl, T2, ... , with common parameter A, and let these represent the interarrival times. 2. Record an arrival at times Tl, Tl
+ T2, T 1 + T2 + T3,
etc.
The kth Arrival Time The time Yk of the kth arrival is equal to the sum Yk == Tl + T2 + ... + Tk of k independent identically distributed exponential random variables. This allows us to derive formulas for the mean, variance, and PDF of Yk, which are given in the table that follows.
Properties of the kth Arrival Time The kth arrival time is equal to the sum of the first k interarrival times
and the latter are independent exponential random variables with common parameter A. The mean and variance of Yk are given by
• The PDF of Yk is given by
t In our original definition, a process was called Poisson if it possessed certain properties. The reader may have noticed that we did not establish that a process with the required properties exists. In an alternative line of development, we can use the constructive description given here: we start with a sequence of independent, exponentially distributed interarrival times, from which the arrival times are completely determined. With this definition, it is then possible to establish that the process satisfies all of the properties that were postulated in our original Poisson process definition.
The Poisson Process
Sec. 5.2
/Yk(Y)
293
=
'Akyk-le-AY (k - 1)! '
y 2: 0,
and is known as the Erlang PDF of order k.
To evaluate the PDF fYk of Yk, we can argue that for a small 8, the product 8· fYk (y) is the probability that the kth arrival occurs between times y and y + 8. t When 8 is very small, the probability of more than one arrival during the interval [y, y + 8] is negligible. Thus, the kth arrival occurs between y and y + 8 if and only if the following two events A and B occur: (a) event A: there is an arrival during the interval [y, y
+ 8];
(b) event B: there are exactly k - 1 arrivals before time y. The probabilities of these two events are
P(A)
~
'A8,
and
P(B) = P(k - 1,y) =
'Ak-1yk-le-AY (k 1)! .
Since A and B are independent, we have
8/yk (y)
;::;!
P(y ::; Yk ::; y + 8)
;::;!
P(A n B)
= P(A)P(B)
;::;!
,\8
'Ak-1yk-le-AY (k _ 1)! '
from which we obtain y
2: O.
Example 5.12. You call the IRS hotline and you are told that you are the 56th person in line, excluding the person currently being served. Callers depart
t For an alternative derivation that does not rely on approximation arguments, note that for a given y ~ 0, the event {Yk :S y} is the same as the event {number of arrivals in the interval [0, y] is at least
k}.
Thus the CDF of Yk is given by 00
k-l
FYk(y)=p(Yk:'S:y)=L:P(n,y)=l-L:P(n,y)=l n=k
n=O
~ (Ayte- AY
L..t
n!
.
n=O
The PDF of Y k can be obtained by differentiating the above expression, which by straightforward calculation yields the Erlang PDF formula
Stochastic Processes
294
Chap. 5
according to a Poisson process with a rate of A = 2 per minute. How long will you have to wait on the average until your service starts, and what is the probability you will have to wait for more than an hour? By the memoryless property, the remaining service time of the person currently being served is exponentially distributed with parameter 2. The service times of the 55 persons ahead of you are also exponential with the same parameter, and all of these random variables are independent. Thus, your waiting time in minutes, call it Y, is Erlang of order 56, and
~= 28.
E[Y]
The probability that you have to wait for more than an hour is given by the formula P(Y ~ 60) =
l
CX>
60
A56
55 ->.y
Y
~
55.
dy.
Computing this probability is quite tedious. In Chapter 7, we will discuss a much easier way to compute approximately this probability. This is done using the central limit theorem, which allows us to approximate the CDF of the sum of a large number of independent random variables with a normal CDF, and then to calculate various probabilities of interest by using the normal tables.
Splitting and Merging of Poisson Processes Similar to the case of a Bernoulli process, we can start with a Poisson process with rate A and split it, as follows: each arrival is kept with probability p and discarded with probability I-p, independently of what happens to other arrivals. In the Bernoulli case, we saw that the result of the splitting was also a Bernoulli process. In the present context, the result of the splitting turns out to be a Poisson process with rate Ap. Alternatively, we can start with two independent Poisson processes, with rates )\1 and A2, and merge them by recording an arrival whenever an arrival occurs in either process. It turns out that the merged process is also Poisson with rate Al + A2. Furthermore, any particular arrival of the merged process has probability Al/(Al + A2) of originating from the first process, and probability A2/(Al + A2) of originating from the second, independently of all other arrivals and their origins. We discuss these properties in the context of some examples, and at the same time provide the arguments that establish their validity.
Example 5.13. Splitting of Poisson Processes. A packet that arrives at a node of a data network is either a local packet that is destined for that node (this happens with probability p), or else it is a transit packet that must be relayed to another node (this happens "\\rith probability 1 - p). Packets arrive according to a Poisson process with rate A, and each one is a local or transit packet independently
Sec. 5.2
The Poisson Process
295
of other packets and of the arrival times. As stated above, the process of local packet arrivals is Poisson with rate Ap. Let us see why. We verify that the process of local packet arrivals satisfies the defining properties of a Poisson process. Since )... and p are constant (do not change with time), the first property (time homogeneity) clearly holds. Furtherlnore, there is no dependence between what happens in disjoint time intervals, verifying the second property. Finally, if we focus on a small interval of length 6, the probability of a local arrival is approximately the probability that there is a packet arrival, and that this turns out to be a local one, i.e., A6·p. In addition, the probability of two or more local arrivals is negligible in comparison to 6, and this verifies the third property. We conclude that local packet arrivals form a Poisson process and, in particular, the number of such arrivals during an interval of length T has a Poisson PMF with parameter pAT. By a symmetrical argument, the process of transit packet arrivals is also Poisson, with rate A(1 - p). A somewhat surprising fact in this context is that the two Poisson processes obtained by splitting an original Poisson process are independent, see the end-of-chapter problems.
Example 5.14. Merging of Poisson Processes. People with letters to mail arrive at the post office according to a Poisson process with rate AI, while people with packages to mail arrive according to an independent Poisson process with rate A2. As stated earlier the merged process, which includes arrivals of both types, is Poisson with rate Al + A2. Let us see why. First, it should be clear that the merged process satisfies the time-homogeneity property. Furthermore, since different intervals in each of the two arrival processes are independent, the same property holds for the merged process. Let us now focus on a small interval of length 6. Ignoring terms that are negligible compared to 6, we have P(O arrivals in the merged process) ~ (1
A16)(1
A26)
P(1 arrival in the merged process) ~ A16(1 - A26)
+ (1 -
~
1 - (AI
A16)A26
+ A2)6, ~
(AI
+ A2)6,
and the third property has been verified. Given that an arrival has just been recorded, what is the probability that it is an arrival of a person with a letter to mail? We focus again on a small interval of length 6 around the current time, and we seek the probability P(1 arrival of person with a letter 11 arrival).
Using the definition of conditional probabilities, and ignoring the negligible probability of more than one arrival, this is P(1 arrival of person with a letter) P(1 arrival)
--------------
~
A16 + A2)6
----
(AI
+
Generalizing this calculation, we let L k be the event that the kth arrival corresponds to an arrival of a person with a letter to mail, and we have
Stochastic Processes
296
Chap. 5
Furthermore, since distinct arrivals happen at different times, and since, for Poisson processes, events at different times are independent, it follows that the random variables £1, £2, ... are independent.
Example 5.15. Competing Exponentials. Two light bulbs have independent and exponentially distributed lifetimes T a and T b, with parameters Aa and Ab, respectively. What is the distribution of Z = min{Ta , Tb}, the first time when a bulb burns out? We can treat this as an exercise in derived distributions. For all z 2: 0, we have,
Fz(z) = P(min{Ta , Tb} :::; z) = 1- P(min{Ta , T b}
> z)
= 1 - P(Ta
> z, Tb > z)
= 1 - P(Ta
> Z)P(Tb > z)
This is recognized as the exponential CDF with parameter Aa + Ab. Thus, the minimum of two independent exponentials with parameters Aa and Ab is an exponential with parameter Aa + Ab. For a more intuitive explanation of this fact, let us think of T a and Tb as the times of the first arrival in two independent Poisson processes with rates Aa and Ab, respectively. If we merge these two processes, the first arrival time will be min{Ta , T b }. But we already know that the merged process is Poisson with rate Aa + Ab, and it follows that the first arrival time, min{Ta , T b}, is exponential with parameter Aa + Ab.
The preceding discussion can be generalized to the case of more than two processes. Thus, the total arrival process obtained by merging the arrivals of n independent Poisson processes with arrival rates )\1, ... , An is Poisson with arrival rate equal to the sum Al + ... + An.
Example 5.16. More on Competing Exponentials. Three light bulbs have independent exponentially distributed lifetimes with a common parameter A. What is the expected value of the time until the last bulb burns out? We think of the times when each bulb burns out as the first arrival times in independent Poisson pJocesses. In the beginning, we have three bulbs, and the merged process has rate 3A. Thus, the time T 1 of the first burnout is exponential with parameter 3A, and mean 1/3A. Once a bulb burns out, and because of the memorylessness property of the exponential distribution, the remaining lifetimes of the other two lightbulbs are again independent exponential random variables with parameter A. We thus have two Poisson processes running in parallel, and the remaining time T 2 until the first arrival in one of these two processes is now
Sec. 5.2
The Poisson Process
297
exponential with parameter 2.A and mean 1/2.A. Finally, once a second bulb burns out, we are left with a single one. Using memorylessness once more, the remaining time T 3 until the last bulb burns out is exponential with parameter .A and mean 1/.A. Thus, the expected value of the total time is
Note that the random variables T 1 , T2, T 3 are independent, because of memorylessness. This allows us to also compute the variance of the total time: var(T1
+ T 2 + T3)
= var(T1 )
+ var(T2) + var(T3)
1 = 9.A 2
+
1 4.A 2
+
1 .A 2 '
We close by noting a related and quite deep fact, namely that the sum of a large number of (not necessarily Poisson) independent arrival processes, can be approximated by a Poisson process with arrival rate equal to the sum of the individual arrival rates. The component processes must have a small rate relative to the total (so that none of them imposes its probabilistic character on the total arrival process) and they must also satisfy some technical mathematical assumptions. Further discussion of this fact is beyond our scope, but we note that it is in large measure responsible for the abundance of Poisson-like processes in practice. For example, the telephone traffic originating in a city consists of many component processes, each of which characterizes the phone calls placed by individual residents. The component processes need not be Poisson; some people for example tend to make calls in batches, and (usually) while in the process of talking, cannot initiate or receive a second call. However, the total telephone traffic is well-modeled by a Poisson process. For the same reasons, the process of auto accidents in a city, customer arrivals at a store, particle emissions from radioactive material, etc., tend to have the character of the Poisson process. The Random Incidence Paradox The arrivals of a Poisson process partition the time axis into a sequence of interarrival intervals; each interarrival interval starts with an arrival and ends at the time of the next arrival. We have seen that the lengths of these interarrival intervals are independent exponential random variables with parameter A, where A is the rate of the process. More precisely, for every k, the length of the kth interarrival interval has this exponential distribution. In this subsection, we look at these interarrival intervals from a different perspective. Let us fix a time instant t* and consider the length L of the interarrival interval that contains t*. For a concrete context, think of a person who shows up at the bus station at some arbitrary time t* and records the time from the previous bus arrival until the next bus arrival. The arrival of this person is often referred to as a "random incidence," but the reader should be aware that the term is misleading: t* is just a particular time instance, not a random variable.
Stochastic Processes
298
Chap. 5
We assume that t* is much larger than the starting time of the Poisson process so that we can be fairly certain that there has been an arrival prior to time t*. To avoid the issue of how large t* should be, we assume that the Poisson process has been running forever, so that we can be certain that there has been a prior arrival, and that L is well-defined. One might superficially argue that L is the length of a "typical" interarrival interval, and is exponentially distributed, but this turns out to be false. Instead, we will establish that L has an Erlang PDF of order two. This is known as the random incidence phenomenon or paradox, and it can be explained with the help of Fig. 5.7. Let [U, V] be the interarrival interval that contains t*, so that L == V - U. In particular, U is the time of the first arrival prior to t* and V is the time of the first arrival after t*. We split L into two parts, L == (t* - U) + (V - t*), where t* - U is the elapsed time since the last arrival, and V - t* is the remaining time until the next arrival. Note that t* - U is determined by the past history of the process (before t*), while V - t* is determined by the future of the process (after time t*). By the independence properties of the Poisson process, the random variables t* - U and V - t* are independent. By the memorylessness property, the Poisson process starts fresh at time t*, and therefore V - t* is exponential with parameter A. The random variable t* - U is also exponential with parameter A. The easiest way to see this is to realize that if we run a Poisson process backwards in time it remains Poisson; this is because the defining properties of a Poisson process make no reference to whether time moves forward or backward. A more formal argument is obtained by noting that P(t* - U > x)
== P (no arrivals during [t*
x, t*]) == P(O, x) == e- AX ,
x ~ O.
We have therefore established that L is the sum of two independent exponential random variables with parameter A, i.e., Erlang of order two, with mean 2/A. L
-~i==---..~ ~.r.--+-h---lillllltim:
Elapsed time t*- U
~ Remaining
Chosen time instant
time V - t*
Figure 5.7: Illustration of the random incidence phenomenon. For a fixed time instant t*, the corresponding interarrival interval [U, V] consists of the elapsed time t* - U and the remaining time V - t*. These two times are independent and are exponentially distributed with parameter -\, so the PDF of their sum is Erlang of order two.
Sec. 5.3
Summary and Discussion
299
Random incidence phenomena are often the source of misconceptions and errors, but these can be avoided with careful probabilistic modeling. The key issue is that even though interarrival intervals have length 1/ A on the average, an observer who arrives at an arbitrary time is more likely to fall in a large rather than a small interarrival interval. As a consequence the expected length seen by the observer is higher, 2/ A in this case. A similar situation arises in the example that follows.
Example 5.17. Random Incidence in a Non-Poisson Arrival Process. Buses arrive at a station deterministically, on the hour, and five minutes after the hour. Thus, the interarrival times alternate between 5 and 55 minutes. The average interarrival time is 30 minutes. A person shows up at the bus station at a "random" time. We interpret "random" to mean a time that is uniformly distributed within a particular hour. Such a person falls into an interarrival interval of length 5 with probability 1/12, and an interarrival interval of length 55 with probability 11/12. The expected length of the chosen interarrival interval is 1 5· 12
11
+ 55 . 12
= 50.83,
which is considerably larger than 30, the average interarrival time.
As the preceding example indicates, random incidence is a subtle phenomenon that introduces a bias in favor of larger interarrival intervals, and can manifest itself in contexts other than the Poisson process. In general, whenever different calculations give contradictory results, the reason is that they refer to different probabilistic mechanisms. For instance, considering a fixed nonrandom k and the associated random value of the kth interarrival interval is a different experiment from fixing a time t and considering the random K such that the Kth interarrival interval contains t. For a last example with the same flavor, consider a survey of the utilization of town buses. One approach is to select a few buses at random and calculate the average number of riders in the selected buses. An alternative approach is to select a few bus riders at random, look at the buses that they rode and calculate the average number of riders in the latter set of buses. The estimates produced by the two methods have very different statistics, with the second method being biased upwards. The reason is that with the second method, it is much more likely to select a bus with a large number of riders than a bus that is near-empty.
5.3 SUMMARY AND DISCUSSION In this chapter, we introduced and analyzed two memoryless arrival processes. The Bernoulli process evolves in discrete time, and during each discrete time step, there is a constant probability p of an arrival. The Poisson process evolves
300
Stochastic Processes
Chap. 5
in continuous time, and during each small interval of length 0, there is a probability of an arrival approximately equal to AO. In both cases, the numbers of arrivals in disjoint time intervals are assumed independent. Thus the Poisson process can be viewed as a limiting case of the Bernoulli process, in which the duration of each discrete time slot is taken to be a very small number O. This fact can be used to draw parallels between the major properties of the two processes, and to transfer insights gained from one process to the other. Using the memorylessness property of the Bernoulli and the Poisson, we derived the following. (a) The PMF of the number of arrivals during a time interval of given length is binomial and Poisson, respectively. (b) The distribution of the time between successive arrivals is geometric and exponential, respectively. (c) The distribution of the time until the kth arrival, is Pascal of order k and Erlang of order k, respectively. Furthermore, we saw that one can start with two independent Bernoulli (respectively, Poisson) processes and "merge" them to form a new Bernoulli (respectively, Poisson) process. Conversely, if one "accepts" each arrival by flipping a coin with success probability p ("splitting"), the process of accepted arrivals is a Bernoulli or Poisson process whose arrival rate is p times the original arrival rate. We finally considered the "random incidence" phenomenon where an external observer arrives at some given time and measures the interarrival interval within which he arrives. The statistics of the measured interval are different from those of a "typical" interarrival interval, because the arriving observer is more likely to fall in a larger interarrival interval. This phenomenon indicates that when talking about a "typical" interval, one must carefully describe the mechanism by which the interval is selected. Different mechanisms will in general result in different statistical properties.
Problems
301
PROBLEMS
SECTION 5.1. The Bernoulli Process Problem 1. Each of n packages is loaded independently onto either a red truck (with probability p) or onto a green truck (with probability 1 - p). Let R be the total number of items selected for the red truck and let G be the total number of items selected for the green truck.
(a) Determine the PMF, expected value, and variance of the random variable R. (b) Evaluate the probability that the first item to be loaded ends up being the only one on its truck. (c) Evaluate the probability that at least one truck ends up with a total of exactly one package. (d) Evaluate the expected value and the variance of the difference R - G. (e) Assume that n 2:: 2. Given that both of the first two packages to be loaded go onto the red truck, find the conditional expectation, variance, and PMF of the random variable R. Problem 2. Dave fails quizzes with probability 1/4, independently of other quizzes.
(a) What is the probability that Dave fails exactly two of the next six quizzes? (b) What is the expected number of quizzes that Dave will pass before he has failed three times? (c) What is the probability that the second and third time Dave fails a quiz will occur when he takes his eighth and ninth quizzes, respectively? (d) What is the probability that Dave fails two quizzes in a row before he passes two quizzes in a row? Problem 3. A computer system carries out tasks submitted by two users. Time is divided into slots. A slot can be idle, with probability PI == 1/6, and busy with probability PB == 5/6. During a busy slot, there is probability PIIB == 2/5 (respectively, P21B == 3/5) that a task from user 1 (respectively, 2) is executed. We assume that events related to different slots are independent.
(a) Find the probability that a task from user 1 is executed for the first time during the 4th slot. (b) Given that exactly 5 out of the first 10 slots were idle, find the probability that the 6th idle slot is slot 12. (c) Find the expected number of slots up to and including the 5th task from user 1.
Stochastic Processes
302
Chap. 5
(d) Find the expected number of busy slots up to and including the 5th task from user 1. (e) Find the PMF, mean, and variance of the number of tasks from user 2 until the time of the 5th task from user 1. Problem 4. * Consider a Bernoulli process with probability of success in each trial equal to p. (a) Relate the number of failures before the rth success (sometimes called a negative binomial random variable) to a Pascal random variable and derive its PMF. (b) Find the expected value and variance of the number of failures before the rth success. (c) Obtain an expression for the probability that the ith failure occurs before the rth success.
Solution. (a) Let Y be the number of trials until the rth success, which is a Pascal random variable of order r. Let X be the number of failures before the rth success, so that X = Y - r. Therefore, px (k) = py (k + r), and k = 0,1, ....
(b) Using the notation of part (a), we have
r p
E[X] = E[Y] - r = - - r
(1 - p)r p
Furthermore, var(X) (c) Let again X be the number of failures before the rth success. The ith failure occurs before the rth success if and only if X 2:: i. Therefore, the desired probability is equal to i = 1,2, ....
An alternative formula is derived as follows. Consider the first r + i 1 trials. The number of failures in these trials is at least i if and only if the number of successes is less than r. But this is equivalent to the ith failure occurring before the rth success. Hence, the desired probability is the probability that the number of successes in r+i-1 trials is less than r, which is
i = 1,2, ....
Problem 5. * Random incidence in the Bernoulli process. Your cousin has been playing the same video game from time immemorial. Assume that he wins each
Problems
303
game with probability p, independently of the outcomes of other games. At midnight, you enter his room and witness his losing the current game. What is the PMF of the number of lost games between his most recent win and his first future win?
Solution. Let t be the number of the game when you enter the room. Let lvI be the number of the most recent past game that he won, and let N be the number of the first game to be won in the future. The random variable X = N - t is geometrically distributed with parameter p. By symmetry and independence of the games, the random variable Y = t - 1\;[ is also geometrically distributed with parameter p. The games he lost between his most recent win and his first future win are all the games between M and N. Their number L is given by
L=N-M-1=X+Y-1. Thus, L
+ 1 has a
Pascal PMF of order two, and
peL + 1 = k) Hence, 1,2, ....
*
Problem 6. Sum of a geometric number of independent geometric random variables. Let Y Xl + ... +XN , where the random variables Xi are geometric with parameter p, and N is geometric with parameter q. Assume that the random variables N, Xl, X2, ... are independent. Show, without using transforms, that Y is geometric with parameter pq. Hint: Interpret the various random variables in terms of a split Bernoulli process.
Solution. We derived this result in Chapter 4, using transforms, but we develop a more intuitive derivation here. We interpret the random variables Xi and N as follows. We view the times Xl, Xl + X2, etc. as the arrival times in a Bernoulli process with parameter p. Each arrival is rejected with probability 1 - q and is accepted with probability q. We interpret N as the number of arrivals until the first acceptance. The process of accepted arrivals is obtained by splitting a Bernoulli process and is therefore itself Bernoulli with parameter pq. The random variable Y = Xl + ... + XN is the time of the first accepted arrival and is therefore geometric, with parameter pq. Problem 7.* The bits in a uniform random variable form a Bernoulli process. Let Xl, X 2 , ... be a sequence of binary random variables taking values in the set {a, I}. Let Y be a continuous random variable that takes values in the set [0,1]. We relate X and Y by assuming that Y is the real number whose binary representation is 0.X l X 2 X3 .... More concretely, we have Y = '""'" L...i2 -k Xk. k=l
(a) Suppose that the Xi form a Bernoulli process with parameter p = 1/2. Show that Y is uniformly distributed. Hint: Consider the probability of the event (i - 1)/2 k < Y < i/2 k , where i and k are positive integers.
Stochastic Processes
304
Chap. 5
(b) Suppose that Y is uniformly distributed. Show that the Xi form a Bernoulli process with parameter p = 1/2.
Solution. (a) We have 1
p(Y E [0,1/2]) = P(XI = 0) = "2 = p(Y E [1/2,1]). Furthermore,
p(Y E [0,1/4])
P(X I = 0, X2 = 0)
1 4
Arguing similarly, we consider an interval of the form [(i - 1)/2 k , i/2 k J, where i and k are positive integers and i :::; 2k . For Y to fall in the interior of this interval, we need Xl, ... ,Xk to take on a particular set of values (namely, the binary expansion of i-I). Hence, k p ((i 1) 12 < Y < i 12k ) = 21k ' Note also that for any y E [0,1], we have P(Y = y) 0, because the event {Y = y} can only occur if infinitely many XiS take on particular values, a zero probability event. Therefore, the CDF of Y is continuous and satisfies
Since every number y in [0,1] can be closely approximated by a number of the form i/2 k , we have P(Y :::; y) = y, for every y E [0,1], which establishes that Y is uniform. (b) As in part (a), we observe that every possible zero-one pattern for X I, ... , X k is associated to one particular interval of the form [(i - 1)/2 k , i/2 k J for Y. These intervals have equal length, and therefore have the same probability 1/2 k , since Y is uniform. This particular joint PMF for X I, ... , X k, corresponds to independent Bernoulli random variables with parameter p 1/2.
SECTION 5.2. The Poisson Process Problem 8. During rush hour, from 8 a.m. to 9 a.m., traffic accidents occur according to a Poisson process with a rate of 5 accidents per hour. Between 9 a.m. and 11 a.m., they occur as an independent Poisson process with a rate of 3 accidents per hour. What is the PMF of the total number of accidents between 8 a.m. and 11 a.m.? Problem 9. A fisherman catches fish according to a Poisson process with rate A = 0.6 per hour. The fisherman will keep fishing for two hours. If he has caught at least one fish, he quits. Otherwise, he continues until he catches at least one fish. (a) Find the probability that he stays for more than two hours. (b) Find the probability that the total time he spends fishing is between two and five hours. (c) Find the probability that he catches at least two fish. (d) Find the expected number of fish that he catches. (e) Find the expected total fishing time, given that he has been fishing for four hours.
305
ProblelTIS
Problem 10. Customers depart fronl a bookstore according to a Poisson process with rate A per hour. Each customer buys a book with probability p, independent of everything else. (a) Find the distribution of the time until the first sale of a book. (b) Find the probability that no books are sold during a particular hour. (c) Find the expected nUlnber of customers who buy a book during a particular hour. Problem 11. Transmitters A and B independently send messages to a single receiver in a Poisson manner, with rates of AA and AB, respectively. All messages are so brief that we may assume that they occupy single points in time. The number of words in a message, regardless of the source that is transmitting it, is a random variable with PMF 2/6, if w == 1, _ 3/6, if w == 2, pw(w) 1/6, if w == 3, { otherwise, 0, and is independent of everything else. (a) What is the probability that during an interval of duration t, a total of exactly nine messages will be received? (b) Let N be the total number of words received during an interval of duration t. Determine the expected value of N. (c) Determine the PDF of the time from t == 0 until the receiver has received exactly eight three-word messages from transmitter A. (d) What is the probability that exactly eight of the next twelve messages received will be from transmitter A? Problem 12. Beginning at time t == 0, we start using bulbs, one at a time, to illuminate a room. Bulbs are replaced immediately upon failure. Each new bulb is selected independently by an equally likely choice between a type-A bulb and a type-B bulb. The lifetime, X, of any particular bulb of a particular type is a randoln variable, independent of everything else, with the following PDF: for type-A Bulbs:
Ix (x)
for type-B Bulbs:
Ix (x)
{
e -x , 0,
3e- 3x , { 0,
if x 2: 0, otherwise; if x 2: 0, otherwise.
(a) Find the expected time until the first failure. (b) Find the probability that there are no bulb failures before time t. (c) Given that there are no failures until time t, determine the conditional probability that the first bulb used is a type-A bulb. (d) Find the variance of the time until the first bulb failure. (e) Find the probability that the 12th bulb failure is also the 4th type-A bulb failure.
306
Stochastic Processes
Chap. 5
(f) Vp to and including the 12th bulb failure, what is the probability that a total of exactly 4 type-A bulbs have failed? (g) Determine either the PDF or the transform associated with the time until the 12th bulb failure. (h) Determine the probability that the total period of illumination provided by the first two type-B bulbs is longer than that provided by the first type-A bulb. (i) Suppose the process terminates as soon as a total of exactly 12 bulb failures have occurred. Determine the expected value and variance of the total period of illumination provided by type-B bulbs while the process is in operation.
(j) Given that there are no failures until time t, find the expected value of the time until the first failure. Problem 13. A service station handles jobs of two types, A and B. (Multiple jobs can be processed simultaneously.) Arrivals of the two job types are independent Poisson processes with parameters AA = 3 and AB = 4 per minute, respectively. Type A jobs stay in the service station for exactly one minute. Each type B job stays in the service station for a random but integer amount of time which is geometrically distributed, with mean equal to 2, and independent of everything else. The service station started operating at some time in the remote past. (a) What is the mean, variance, and PMF of the total number of jobs that arrive within a given three-minute interval? (b) We are told that during a 10-minute interval, exactly 10 new jobs arrived. What is the probability that exactly 3 of them are of type A? (c) At time 0, no job is present in the service station. What is the PMF of the number of type B jobs that arrive in the future, but before the first type A arrival? (d) At time t = 0, there were exactly two type A jobs in the service station. What is the PDF of the time of the last (before time 0) type A arrival? (e) At time 1, there was exactly one type B job in the service station. Find the distribution of the time until this type B job departs. Problem 14. Each morning, as you pull out of your driveway, you would like to make a V-turn rather than drive around the block. V nfortunately, V-turns are illegal in your neighborhood, and police cars drive by according to a Poisson process with rate A. You decide to make a V-turn once you see that the road has been clear of police cars for T time units. Let N be the number of police cars you see before you make the V-turn. (a) Find E[N]. (b) Find the conditional expectation of the time elapsed between police cars n - 1 and n, given that N ~ n. (c) Find the expected time that you wait until you make the V-turn. Hint: Condition onN. Problem 15. A wombat in the San Diego zoo spends the day walking from a burrow to a food tray, eating, walking back to the burrow, resting, and repeating the cycle. The amount of time to walk from the burrow to the tray (and also from the tray to the
Problems
307
burrow) is 20 secs. The amounts of time spent at the tray and resting are exponentially distributed with mean 30 secs. The wombat, with probability 1/3, will momentarily stand still (for a negligibly small time) during a walk to or from the tray, with all times being equally likely (and independently of what happened in the past). A photographer arrives at a random time and will take a picture at the first time the wombat will stand still. What is the expected value of the length of time the photographer has to wait to snap the wombat's picture? Problem 16. * Consider a Poisson process. Given that a single arrival occurred in a given interval [0, t], show that the PDF of the arrival time is uniform over [0, t].
Solution. Consider an interval [a, b] C [0, t] of length l == b - a. Let T be the time of the first arrival, and let A be the event that a single arrival occurred during [0, t]. We have P(T E [a,b] and A)
P(TE[a,bJjA)=
peA)
.
The numerator is equal to the probability P(I, l) that the Poisson process has exactly one arrival during the length l interval [a, b], times the probability P(O, t - l) that the process has zero arrivals during the set [0, a) U (b, t], of total length t - l. Hence
P (T
[ b] I A) == P(I, l)P(O, t - l) == (Al)e->-le->-(t-l) E a,
P(I, t)
(At)e->-t
l
-
t'
which establishes that T is uniformly distributed. Problem 17.
*
(a) Let Xl and X 2 be independent and exponentially distributed, with parameters Al and A2, respectively. Find the expected value of max{XI , X 2}. (b) Let Y be exponentially distributed with parameter AI. Let Z be Erlang of order 2 with parameter )..2. Assume that Y and Z are independent. Find the expected value of max{Y, Z}.
Solution. A direct but tedious approach would be to find the PDF of the random variable of interest and then evaluate an integral to find its expectation. A much simpler solution is obtained by interpreting the random variables of interest in terms of underlying Poisson processes. (a) Consider two independent Poisson processes with rates Al and A2, respectively. We interpret XI as the first arrival time in the first process, and X 2 as the first arrival time in the second process. Let T == min{X1 , X 2 } be the first time when one of the processes registers an arrival. Let S == max{X I , X 2 } - T be the additional time until both have registered an arrival. Since the merged process is Poisson with rate Al + A2, we have 1 E[T]==~. /\1
+ /\2
Concerning S, there are two cases to consider. (i) The first arrival comes from the first process; this happens with probability Al/(Al + A2)' We then have to wait for an arrival from the second process, which takes 1/ A2 time on the average.
Stochastic Processes
308
Chap. 5
(ii) The first arrival comes from the second process; this happens with probability A2/(Al + A2)' We then have to wait for an arrival from the first process, which takes 1/ Al time on the average. Putting everything together, we obtain
(b) Consider two independent Poisson processes with rates Al and A2, respectively. We interpret Y as the first arrival time in the first process, and Z as the second arrival time in the second process. Let T be the first time when one of the processes registers an arrival. Since the merged process is Poisson with rate Al + A2, we have E[T] == 1/(Al + A2)' There are two cases to consider. (i) The arrival at time T comes from the first process; this happens with probability Al / (AI + A2). In this case, we have to wait an additional time until the second process registers two arrivals. This additional time is Erlang of order 2, with parameter A2, and its expected value is 2/ A2. (ii) The arrival at time T comes from the second process; this happens with probability A2/ (AI + A2). In this case, the additional time S we have to wait is the time until each of the two processes registers an arrival. This is the maximum of two independent exponential random variables and, according to the result of part (a), we have
E[S] == Al
1
+ A2
(
Al
1 + A2
A2)
+ Al
.
Putting everything together, we have E [ max{Y, Z} ]
1
Al
A2
2
== --+- + Al + A2 . A2 + Al + A2 . E[S],
where E[S] is given by the previous formula. Problem 18. * Let Yk be the time of the kth arrival in a Poisson process with rate A. Show that for all y > 0,
L
fYk
(y) == A.
k=1
Solution. We have
(let m
== A.
==
k
1)
Problems
309
The last equality holds because the )..myme->..y 1m! terms are the values of a Poisson PMF with parameter )..y and must therefore sum to 1. For a more intuitive derivation, let 8 be a small positive number and consider the following events:
Ak: the kth arrival occurs between y and y+8; the probability of this event is P(A k ) ::::: fYk (y)8; A: an arrival occurs between y and y
+ 8;
the probability of this event is P(A) :::::
fy(y)8. Suppose that 8 is taken small enough so that the possibility of two or more arrivals during an interval of length 8 can be ignored. With this approximation, the events AI, A 2 , ... become disjoint, and their union is A. Therefore, 00
00
EfYk(y)·8::::: Ep(Ak) k=l
k=l
::::: P(A)
::::: ),,8,
and the desired result follows by canceling 8 from both sides.
Problem 19. * Consider an experiment involving two independent Poisson processes with rates )..1 and )..2. Let Xl (k) and X2(k) be the times of the kth arrival in the 1st and the 2nd process, respectively. Show that
Solution~ Consider the merged Poisson process, which has rate )..1 + )..2. Each time there is an arrival in the merged process, this arrival comes from the first process ("success") with probability )..1/()..1 +)..2), and from the second process ("failure") with probability )..2/()..1 + )..2). Consider the situation after n + m - 1 arrivals. The number of arrivals from the first process is at least n if and only if the number of arrivals from the second process is less than m, which happens if and only if the nth success occurs before the mth failure. Thus, the event {X1(n) < X2(m)} is the same as the event of having at least n successes in the first n + m 1 trials. Therefore, using the binomial PMF for the number of successes in a given number of trials, we have
Problem 20. * Sum of a geometric number of independent exponential random variables. Let Y == Xl + ... + X N, where the random variables Xi are exponential with parameter ).., and N is geometric with parameter p. Assume that the raIl,dom variables N, Xl, X 2 , ... are independent. Show, without using transforms, that Y is exponential with parameter )..p. Hint: Interpret the various random variables in terms of a split Poisson process.
Stochastic Processes
310
Chap. 5
Solution. We derived this result in Chapter 4, using transforms, but we develop a more intuitive derivation here. We interpret the random variables Xi and N as follows. We view the times Xl, Xl + X 2 , etc. as the arrival times in a Poisson process with parameter A. Each arrival is rejected with probability 1 - P and is accepted with probability p. We interpret N as the number of arrivals until the first acceptance. The process of accepted arrivals is obtained by splitting a Poisson process and is therefore itself Poisson with parameter pA. Note that Y == Xl + ... + XN is the time of the first accepted arrival and is therefore exponential with parameter pA. Problem 21. * The number of Poisson arrivals during an exponentially distributed interval. Consider a Poisson process with parameter A, and an independent random variable T, which is exponential with parameter 1/. Find the PMF of the number of Poisson arrivals during the time interval [0, T].
Solution. Let us view T as the first arrival time in anew, independent, Poisson process with parameter 1/, and merge this process with the original Poisson process. Each arrival in the merged process comes from the original Poisson process with probability A/(A + 1/), independently from other arrivals. If we view each arrival in the merged process as a trial, and an arrival from the new process as a success, we note thatrP the number K of trials/arrivals until the first success has a geometric PMF, of the form PK(k) ==
1/
(
A+1/
)
(
A
A+1/
)k-l '
k
== 1,2, ....
Now the number L of arrivals from the original Poisson process until the first "success" is equal to K - 1 and its PMF is
l
== 0,1, ....
Problem 22. * An infinite server queue. We consider a queueing system with an infinite number of servers, in which customers arrive according to a Poisson process with rate A. The ith customer stays in the system for a random amount of time, denoted by Xi. We assume that the random variables Xi are independent identically distributed, and also independent from the arrival process. We also assume, for simplicity, that the Xi take integer values in the range 1, 2, ... ,n, with given probabilities. Find the PMF of Nt, the number of customers in the system at time t.
Solution. Let us refer to those customers i whose service time Xi is equal to k as "type-k" customers. We view the overall arrival process as the merging of n Poisson subprocesses; the kth subprocess corresponds to arrivals of type-k customers, is independent of the other arrival subprocesses, and has rate APk, where Pk == P(Xi == k). Let N tk be the number of type-k customers in the system at time t. Thus, n
and the random variables N tk are independent. We now determine the PMF of N tk . A type-k customer is in the system at time t if and only if that customer arrived between times t - k and t. Thus, N tk has a Poisson
Problems
311
PI\1F with mean Akpk. Since the sum of independent Poisson random variables is Poisson, it follows that Nt has a Poisson PMF with parameter
I: kPk = AE[X
E[Nd = A
i ].
k=l
Problem 23. * Independence of Poisson processes obtained by splitting. Consider a Poisson process whose arrivals are split, with each arrival assigned to one of two subprocesses by flipping an independent coin with success probability p. In Example 5.13, it was established that each of the subprocesses is a Poisson process. Show that the two subprocesses are independent.
Solution. Let us start with two independent Poisson processes PI and P2, with rates pA and (1 p)A, respectively. We merge the two processes and obtain a Poisson process P with rate A. We now split the process P into two new subprocesses P~ and P~, according to the following rule: an arrival is assigned to subprocess P~ (respectively, P~) if and only if that arrival corresponds to an arrival from subprocess PI (respectively, P2). Clearly, the two new subprocesses P~ and P~ are independent, since they are identical to the original subprocesses PI and P2. However, P~ and P~ were generated by a splitting mechanism that looks different than the one in the problem statement. We will now verify that the new splitting mechanism considered here is statistically identical to the one in the problem statement. It will then follow that the subprocesses constructed in the problem statement have the same statistical properties as P~ and p~, and are also independent. So, let us consider the above described splitting mechanism. Given that P had an arrival at time t, this was due to either an arrival in PI (with probability p), or to an arrival in P2 (probability 1 - p). Therefore, the arrival to P is assigned to P~ or P~ with probabilities p and 1 - p, respectively, exactly as in the splitting procedure described in the problem statement. Consider now the kth arrival in P and let Lk be the event that this arrival originated from subprocess PI; this is the same as the event that the kth arrival is assigned to subprocess P~. As explained in the context of Example 5.14, the events Lk are independent. Thus, the assignments of arrivals to the subprocesses P~ and P~ are independent for different arrivals, which is the other requirement of the splitting mechanism described in the problem statement. Problem 24. * Random incidence in an Erlang arrival process. Consider an arrival process in which the interarrival times are independent Erlang random variables of order 2, with mean 2/A. Assume that the arrival process has been ongoing for a very long time. An external observer arrives at a given time t. Find the PDF of the length of the interarrival interval that contains t.
Solution. We view the Erlang arrival process in the problem statement as part of a Poisson process with rate A. In particular, the Erlang arrival process registers an arrival once every two arrivals of the Poisson process. For concreteness, let us say that the Erlang process arrivals correspond to even-numbered arrivals in the Poisson process. Let Y k be the time of the kth arrival in the Poisson process. Let K be such that YK :::; t < YI 0 for all j E R. As an example, consider the first chain in Fig. 6.9. Starting from state 1, every state is possible at time n == 3, so the unique recurrent class of that chain is aperiodic. A converse statement, which we do not prove, also turns out to be true: if a recurrent class R is aperiodic, then there exists a time n such that rij (n) > 0 for every i and j in R; see the end-of-chapter problems.
lVIarkov Chains
326
Chap. 6
Periodicity Consider a recurrent class R. • The class is called periodic if its states can be grouped in d > 1 disjoint subsets Sl, ... , Sd, so that all transitions from Sk lead to Sk+1 (or to S1 if k == d). J
The class is aperiodic (not periodic) if and only if there exists a time Tij (n) > 0, for all i, j E R.
n such that
6.3 STEADY-STATE BEHAVIOR In Markov chain models, we are often interested in long-term state occupancy behavior, that is, in the n-step transition probabilities Tij (n) when n is very large. We have seen in the example of Fig. 6.6 that the Tij (n) may converge to steady-state values that are independent of the initial state. We wish to understand the extent to which this behavior is typical. If there are two or more classes of recurrent states, it is clear that the limiting values of the Tij (n) must depend on the initial state (the possibility of visiting j far into the future depends on whether j is in the same class as the initial state i). We will, therefore, restrict attention to chains involving a single recurrent class, plus possibly some transient states. This is not as restrictive as it may seem, since we know that once the state enters a particular recurrent class, it will stay within that class. Thus, the asymptotic behavior of a multiclass chain can be understood in terms of the asymptotic behavior of a single-class chain. Even for chains with a single recurrent class, the Tij (n) may fail to converge. To see this, consider a recurrent class with two states, 1 and 2, such that from state 1 we can only go to 2, and from 2 we can only go to 1 (P12 == P21 == 1). Then, starting at some state, we will be in that same state after any even number of transitions, and in the other state after any odd number of transitions. Formally,
n even, n odd. What is happening here is that the recurrent class is periodic, and for such a class, it can be seen that the Tij (n) gen'erically oscillate. We now assert that for every state j, the probability T ij (n) of being at state j approaches a limiting value that is independent of the initial state i, provided we exclude the two situations discussed above (multiple recurrent classes and/or a periodic class). This limiting value, denoted by 1T' j, has the interpretation 1T'j ~
P(X n
== j),
when n is large,
Sec. 6.3
Steady-State Behavior
327
and is called the steady-state probability of j. The following is an important theorem. Its proof is quite complicated and is outlined together with several other proofs in the end-of-chapter problems.
Steady-State Convergence Theorem
,
Consider a Markov chain with a single recurrent class, which is aperiodic. Then, the states j are associated with steady-state probabilities 7fj that have the following properties. (a) For each j, we have for all i. (b) The
7fj
are the unique solution to the system of equations below: m
7fj
== L7fkPkj,
j
==
1, ... , m,
k=l m
(c) We have 7fj 7fj
== 0, > 0,
for all transient states j, for all recurrent states j.
The steady-state probabilities 7fj sum to 1 and form a probability distribution on the state space, called the stationary distribution of the chain. The reason for the qualification "stationary" is that if the initial state is chosen according to this distribution, i.e., if P(Xo
== j)
7fj,
j
1, ...
,m,
then, using the total probability theorem, we have m
P(X1
== j)
LP(Xo k=l
m
== k)Pkj == L 7rkPkj == 7fj, k=l
where the last equality follows from part (b) of the steady-state convergence theorem. Similarly, we obtain P(Xn == j) == 7rj, for all nand j. Thus, if the initial state is chosen according to the stationary distribution, the state at any future time will have the same distribution.
Markov Chains
328
Chap. 6
The equations m
1rj
==
2=:
1rkPkj,
j
==
1, ...
,m,
k=l
are called the balance equations. They are a simple consequence of part (a) of the theorem and the Chapman-Kolmogorov equation. Indeed, once the convergence of rij(n) to some 1rj is taken for granted, we can consider the equation, m
rij(n)
==
2=:
rik(n
l)Pkj,
k=l
take the limit of both sides as n -+ 00, and recover the balance equations. t Together with the normalization equation
the balance equations can be solved to obtain the illustrate the solution process.
Example 6.5.
1rj.
The following examples
Consider a two-state Markov chain with transition probabilities PII P21
== 0.8, == 0.6,
PI2
0.2,
P22
== 0.4.
(This is the same as the chain of Example 6.1 and Fig. 6.1.) The balance equations take the form
or 1Tl
== 0.8 . 1Tl + 0.6 . 1T2,
1T2
== 0.2 . 1Tl + 0.4 . 1T2.
Note that the above two equations are dependent, since they are both equivalent to
t According to a famous and important theorem from linear algebra (called the Perron-Frobenius theorem), the balance equations always have a nonnegative solution, for any Markov chain. What is special about a chain that has a single recurrent class, which is aperiodic, is that given also the normalization equation, the solution is unique and is equal to the limit of the n-step transition probabilities rij(n).
Sec. 6.3
Steady-State Behavior
329
This is a generic property, and in fact it can be shown that anyone of the balance equations can always be derived from the remaining equations. However, we also know that the 1rj satisfy the normalization equation 1rl
+ 1r2
= 1, "
which supplements the balance equations and suffices to determine the 1rj uniquely. 1, we Indeed, by substituting the equation 1rl = 31r2 into the equation 1rl + 1r2 obtain 31r2 + 1r2 = 1, or 1r2 = 0.25, which using the equation
1rl
+ 1r2 =
1, yields
1rl
= 0.75.
This is consistent with what we found earlier by iterating the Chapman-Kolmogorov equation (cf. Fig. 6.6).
Example 6.6. An absent-minded professor has two umbrellas that she uses when commuting from home to office and back. If it rains and an umbrella is available in her location, she takes it. If it is not raining, she always forgets to take an umbrella. Suppose that it rains with probability p each time she commutes, independently of other times. What is the steady-state probability that she gets wet during a commute? We model this problem using a Markov chain with the following states: Statei: i umbrellas are available in her current location,
i = 0,1,2.
The transition probability graph is given in Fig. 6.11, and the transition probability matrix is
P
I-p
.CD:: Z-:=W 1
I-p
No umbrellas
p
Two umbrellas
One umbrella
Figure 6.11: Transition probability graph for Example 6.6.
The chain has a single recurrent class that is aperiodic (assuming 0 < p < 1), so the steady-state convergence theorem applies. The balance equations are
Markov Chains
330
Chap. 6
From the second equation, we obtain 1rl == 1r2, which together with the first equation 1ro == (1 - p)1r2 and the normalization equation 1ro + 1rl + 1r2 == 1, yields 1ro
==
1-
3-p
1rl
1 -p
== -3--'
1r2
==
1 -p
According to the steady-state convergence theorem, the steady-state probability that the professor finds herself in a place without an umbrella is 1ro. The steadystate probability that she gets wet is 1ro times the probability of rain p.
Example 6.7. A superstitious professor works in a circular building with m doors, where m is odd, and never uses the same door twice in a row. Instead he uses with probability p (or probability 1 - p) the door that is adjacent in the clockwise direction (or the counterclockwise direction, respectively) to the door he used last. What is the probability that a given door will be used on some particular day far into the future?
p
Door 3
Door 4
Figure 6.12: Transition probability graph in Example 6.7, for the case of m == 5 doors. Assuming that 0 < p < 1, it is not hard to see that given an initial state, every state j can be reached in exactly 5 steps, and therefore the chain is aperiodic.
We introduce a Markov chain with the following m states: Statei: last door used is door i,
i·'== 1, ... , m.
The transition probability graph of the chain is given in Fig. 6.12, for the case m == 5. The transition probability matrix is
0 I-p 0
p 0 I-p
p
0
0 0 p 0 0 p 0
0
0 0 0
I-p 0 0
I-p
0
Sec. 6.3
331
Steady-State Behavior
Assuming that 0 < p < 1, the chain has a single recurrent class that is aperiodic. (To verify aperiodicity, we leave it to the reader to verify that given an initial state, every state j can be reached in exactly m steps, and the criterion for aperiodicity given at the end of the preceding section is satisfied.) The balance equations are
1f1
= (1 - p)1f2
1fi
= P1fi-1
1fm
= (1
+ p1fm ,
+ (1 - p )1fi+1, P)1f1 + p1fm -1 .
i = 2, ... ,m -1,
These equations are easily solved once we observe that by symmetry, all doors should have the same steady-state probability. This suggests the solution 1 1fj
= m'
j = 1, ... ,m.
Indeed, we see that these 1fj satisfy the balance equations as well as the normalization equation, so they must be the desired steady-state probabilities (by the uniqueness part of the steady-state convergence theorem). Note that if either p = 0 or p = 1, the chain still has a single recurrent class but is periodic. In this case, the n-step transition probabilities rij (n) do not converge to a limit, because the doors are used in a cyclic order. Similarly, if m is even, the recurrent class of the chain is periodic, since the states can be grouped into two subsets, the even and the odd numbered states, such that from each subset one can only go to the other subset.
Long-Term Frequency Interpretations
Probabilities are often interpreted as relative frequencies in an infinitely long string of independent trials. The steady-state probabilities of a Markov chain admit a similar interp~etation, despite the absence of independence. Consider, for example, a Markov chain involving a machine, which at the end of any day can be in one of two states, wor¥ing or broken down. Each time it breaks down, it is immediately repaired at a cost of $1. How are we to model the long-term expected cost of repair per day? One possibility is to view it as the expected value of the repair cost on a randomly chosen day far into the future; this is just the steady-state probability of the broken down state. Alternatively, we can calculate the total expected repair cost in n days, where n is very large, and divide it by n. Intuition suggests that these two methods of calculation should give the same result. Theory supports this intuition, and in general we have the following interpretation of steady-state probabilities (a justification is given in the end-of-chapter problems).
332
]\;farkov Chains
Chap. 6
Steady-State Probabilities as Expected State Frequencies For a Markov chain with a single class which is aperiodic, the steady-state probabilities 1T'j satisfy , Vij(n) 1T'j == 11m - - , n-+oo n where Vij (n) is the expected value of the number of visits to state j within the first n transitions, starting from state i,
Based on this interpretation, 1T'j is the long-term expected fraction of time that the state is equal to j. Each time that state j is visited, there is probability Pjk that the next transition takes us to state k. We conclude that 1T'jPjk can be viewed as the long-term expected fraction of transitions that move the state from j to k.t
Expected Frequency of
Particular Transition
Consider n transitions of a.Markov chain with a single· class which is aperiodic, starting from a giveninitial·state.. Let qjk(n) be the expected number of such. transitions that take the state from j to k. Then, regardless of the initial state, .we have .••• qjk(n) 11m-.--- == 1T'jPjk. n-+oo n
The frequency interpretation of 1T'j and 1T'jPjk allows for a simple interpretation of the balance equations. The state is equal to j if and only if there is a transition that brings the state to j. Thus, the expected frequency 1T' j of visits to j is equal to the sum of the expected frequencies 1T'kPkj of transitions that lead to j, and m
1T'j
==
L
1T'kPkj;
k=l
see Fig. 6.13.
t In fact, some stronger statements are also true, such as the following. Whenever we carry out a probabilistic experiment and generate a trajectory of the Markov chain over an infinite time horizon, the observed long-term frequency with which state j is visited will be exactly equal to 1fj, and the observed long-term frequency of transitions from j to k will be exactly equal to 1fj pj k. Even though the trajectory is random, these equalities hold with essential certainty, that is, with probability 1. The exact meaning of such a statement will become more apparent in the next chapter, when we discuss concepts related to the limiting behavior of random processes.
Sec. 6.3
Steady-State Behavior
333
m
Figure 6.13: Interpretation of the balance equations in terms of frequencies. In a very large number of transitions, we expect a fraction 7TkPkj that bring the state from k to j. (This also applies to transitions from j to itself, which occur with frequency 7TjPjj.) The sum of the expected frequencies of such transitions is the expected frequency 7Tj of being at state j.
Birth-Death Processes A birth-death process is a Markov chain in which the states are linearly arranged and transitions can only occur to a neighboring state, or else leave the state unchanged. They arise in many contexts, especially in queueing theory. Figure 6.14 shows the general structure of a birth-death process and also introduces some generic notation for the transition probabilities. In particular,
bi == P(Xn +l == i + 11 X n == i), di == P(Xn + 1 == i - I I X n == i),
("birth" probability at state i), ("death" probability at state i).
Figure 6.14: Transition probability graph for a birth-death process.
For a birth-death process, the balance equations can be substantially simplified. Let us focus on two neighboring states, say, i and i + 1. In any trajectory of the Markov chain, a transition from i to i + 1 has to be followed by a transition from i + 1 to i, before another transition from i to i + 1 can occur. Therefore, the expected frequency of transitions from i to i + 1, which is 1ribi, must be equal
l\Jarkov Chains
334
to the expected frequency of transitions from i leads to the local balance equations t i
~
+ 1 to i,
which is
Chap. 6
7ri+ldi+l.
This
0,1, ... ,fJn- 1.
Using the local balance equations, we obtain i
~
1, ... , fJn,
from which, using also the normalization equation probabilities 7ri are easily computed.
E i 7ri ~
1, the steady-state
Example 6.8. Random Walk with Reflecting Barriers. A person walks along a straight line and, at each time period, takes a step to the right with probability b, and a step to the left with probability 1 - b. The person starts in one of the positions 1, 2, ... ,m, but if he reaches position 0 (or position m + 1), his step is instantly reflected back to position 1 (or position m, respectively). Equivalently, we may assume that when the person is in positions 1 or m, he will stay in that position with corresponding probability 1 - band b, respectively. We introduce a Markov chain model whose states are the positions 1, ... , m. The transition probability graph of the chain is given in Fig. 6.15.
Figure 6.15: 'fransition probability graph for the random walk Example 6.8.
The local balance equations are i
Thus,
1I"i+1
== 1, ... , m - 1.
== P1l"i, where b
p==
lb'
t A more formal derivation that does not rely on the frequency interpretation proceeds as follows. The balance equation at state 0 is 11"0(1 - bo) + 1r1 d 1 == 11"0, which yields the first local balance equation 1I"0 bo == 11"1 d 1 • The balance equation at state 1 is 1I"0 bo + 11"1 (1 b1 - d 1 ) + 1T'2d2 == 1T'1. Using the local balance equation 1I"0 bo == 1I" 1 d 1 at the previous state, this is rewritten as 1T' l d 1 + 11"1(1 - b1 - d 1 ) + 1T' 2 d 2 == 11"1, which simplifies to 1T' l b1 == 1I" 2 d 2 . We can then continue similarly to obtain the local balance equations at all other states.
Sec. 6.3
Steady-State Behavior
and we can express all the 1ri
1rj
335
in terms of
== Pi - I 1rl,
Using the normalization equation 1
1r1,
i
==
as 1, ... , m.
== 1rl + ... + 1rm , we obtain
which leads to i
== 1, ... ,m.
Note that if p == 1 (left and right steps are equally likely), then
1ri
== 1/m for all i.
Example 6.9. Queueing. Packets arrive at a node of a communication network, where they are stored in a buffer and then transmitted. The storage capacity of the buffer is m: if m packets are already present, any newly arriving packets are discarded. We discretize time in very small periods, and we assume that in each period, at most one event can happen that can change the number of packets stored in the node (an arrival of a new packet or a completion of the transmission of an existing packet). In particular, we assume that at each period, exactly one of the following occurs: (a) one new packet arrives; this happens with a given probability b > 0; (b) one existing packet completes transmission; this happens with a given probability d > 0 if there is at least one packet in the node, and with probability o otherwise; (c) no new packet arrives and no existing packet completes transmission; this happens with probability 1 - b - d if there is at least one packet in the node, and with probability 1 - b otherwise. We introduce a Markov chain with states 0,1, ... ,m, corresponding to the number of packets in the buffer. The transition probability graph is given in Fig. 6.16.
Figure 6.16: Transition probability graph in Example 6.9.
The local balance equations are i
== 0,1, ... ,m - 1.
Markov Chains
336
We define
b
p=
and obtain
Chap. 6
1ri+l
= P1ri, which leads to 1ri
i
i = 0,1, ... ,m.
= P 1ro,
By using the normalization equation 1 =
and 1fo =
Using the equation
1fi
= pi 1fO ,
1- p
?Ti
d'
{
1- p 1 - pm+l' 1
{
m+1'
+ 1rl + ... + 1fm , we obtain
if p
-#
1,
if p = 1. m+1' the steady-state probabilities are i
1 ~ prn+l p ,
=
1ro
if P -# 1, O,l, ... ,m.
if P = 1,
It is interesting to consider what happens when the buffer size m is so large that it can be considered as practically infinite. We distinguish two cases. (a) Suppose that b < d, or p < 1. In this case, arrivals of new packets are less likely than departures of existing packets. This prevents the number of packets in the buffer from growing, and the steady-state probabilities 1fi decrease with i, as in a (truncated) geometric PMF. We observe that as m -+ 00, we have 1 - pm+l -+ 1, and for all i. We can view these as the steady-state probabilities in a system with an infinite buffer. [As a check, note that we have pi(l - p) = 1.]
2::0
(b) Suppose that b > d, or p > 1. In this case, arrivals of new packets are more likely than departures of existing packets. The number of packets in the buffer tends to increase, and the steady-state probabilities 1fi increase with i. As we consider larger al).d larger buffer sizes m, the steady-state probability of any fixed state i decreases to zero: 1ri -+
0,
for all i.
Were we to consider a system with an infinite buffer, we would have a Markov chain with a countably infinite number of states. Although we do not have the machinery to study such chains, the preceding calculation suggests that every state will have zero steady-state probability and will be "transient." The number of packets in queue will generally grow to infinity, and any particular state will be visited only a finite number of times. The preceding analysis provides a glimpse into the character of Markov chains with an infinite number of states. In such chains, even if there is a single and aperiodic recurrent class, the chain may never reach steady-state and a steadystate distribution may not exist.
Sec. 6.4
Absorption Probabilities and Expected Time to Absorption
337
6.4 ABSORPTION PROBABILITIES AND EXPECTED TIME TO ABSORPTION In this section, we study the short-term behavior of Markov chains. We first consider the case where the Markov chain starts at a transient state., We are interested in the first recurrent state to be entered, as well as in the time until this happens. When addressing such questions, the subsequent behavior of the Markov chain (after a recurrent state is encountered) is immaterial. We can therefore focus on the case where every recurrent state k is absorbing, i.e., Pkk
==
1,
Pkj
==
°
for all j
=1=
k.
If there is a unique absorbing state k, its steady-state probability is 1 (because all other states are transient and have zero steady-state probability), and will be reached with probability 1, starting from any initial state. If there are multiple absorbing states, the probability that one of them will be eventually reached is still 1, but the identity of the absorbing state to be entered is random and the associated probabilities may depend on the starting state. In the sequel, we fix a particular absorbing state, denoted by s, and consider the absorption probability ai that s is eventually reached, starting from i: ai
== P(Xn eventually becomes equal to the absorbing state s I Xo == i).
Absorption probabilities can be obtained by solving a system of linear equations, as indicated below. Absorption Probability Equations
Consider a· Markov· chain. where .each •state is ing, and fix a particular absorbing state s. Then, eventually reaching state s, starting from i, are the equations as == 1, for all absorbingi ai == 0,
=l=s,
m
ai
==
LPijaj,
for all transient i.
j=l
The equations as == 1, and ai == 0, for all absorbing i =1= s, are evident from the definitions. To verify the remaining equations, we argue as follows. Let us consider a transient state i and let A be the event that state s is eventually
lVlarkov Chains
338
Chap. 6
reached. We have ai
== P(A I X o == i) m
==
I: P(A IXo == i, Xl == j)P(X
1
== j I Xo == i)
(total probability thm.)
j==l m
==
I: P(A IXl == j)pij
(~arkov property)
j==l m
==
I:
ajPij.
j==l
The uniqueness property of the solution to the absorption probability equations requires a separate argument, which is given in the end-of-chapter problems. The next example illustrates how we can use the preceding method to calculate the probability of entering a given recurrent class (rather than a given absorbing state).
Example 6.10. Consider the Markov chain shown in Fig. 6.17(a). Note that there are two recurrent classes, namely {1} and {4, 5}. We would like to calculate the probability that the state eventually enters the recurrent class {4, 5} starting from one of the transient states. For the purposes of this problem, the possible transitions within the recurrent class {4, 5} are immaterial. We can therefore lump the states in this recurrent class and treat them as a single absorbing state (call it state 6), as in Fig. 6.17(b). It then suffices to compute the probability of eventually entering state 6 in this new chain. The probabilities of eventually reaching state 6, starting from the transient states 2 and 3, satisfy the following equations: a2 == O.2al a3 == O.2a2 Using the facts al == 0 and a6
+ a.3a2 + a.4a3 + a.1a6, + a.8a6.
== 1,
we obtain
a2 == O.3a2
+ a.4a3 + a.1,
a3 == O.2a2
+ a.8.
This is a system of two equations in the two unknowns a2 and a3, which can be readily solved to yield a2 == 21/31 and a3 == 29/31.
Example 6.11. Gambler's Ruin. A gambler wins $1 at each round, with probability p, and loses $1, with probability 1 - p. Different rounds are assumed independent. The gambler plays continuously until he either accumulates a target amount of $m, or loses all his money. What is the probability of eventually accumulating the target amount (winning) or of losing his fortune?
Sec. 6.4
Absorption Probabilities and Expected Time to Absorption
339
0.5
Figure 6.17: (a) Transition probability graph in Example 6.10. (b) A new graph in which states 4 and 5 have been lumped into the absorbing state 6.
We introduce the Markov chain shown in Fig. 6.18 whose state i represents the gambler's wealth at the beginning of a round. The states i = 0 and i = m correspond to losing and winning, respectively. All states are transient, except for the winning and losing states which are absorbing. Thus, the problem amounts to finding the probabilities of absorption at each one of these two absorbing states. Of course, these absorption probabilities depend on the initial state i.
Figure 6.18: Thansition probability graph for the gambler's ruin problem (Example 6.11). Here m == 4.
Let us set s = m in which case the absorption probability ai is the probability of winning, starting from state i. These probabilities satisfy
ao = 0, i = 1, ... ,m -1,
am = 1.
l\larkov Chains
340
Chap. 6
These equations can be solved in a variety of ways. It turns out there is an elegant method that leads to a nice closed form solution. Let us write the equations for the ai as i == 1, ... ,m - 1.
Then, by denoting i == 0, ... ,m - 1,
and p==
1p
the equations are written as i == 1, ... ,m - 1,
from which we obtain i == 1, ... ,m - 1.
This, together with the equation 80 (1
+ 81 + ... + 8m - 1 == am
+ p + ... + pm-1 ) 80 ==
and
ao
==
1, implies that
1,
1
== - - - - - - - l+p+'''+
80
Since ao == 0 and ai+1 == ai fortune i is equal to
+ 8i ,
the probability ai of winning starting from a
which simplifies to 1- pi
y-rn'
ai
= { ~,
p
if p
-I- 1,
if p == 1.
The solution reveals that if p > 1, which corresponds to p < 1/2 and unfavorable odds for the gambler, the probability of winning approaches 0 as m -+ 00, for any fixed initial fortune. This suggests that if you aim for a large profit under unfavorable odds, financial ruin is almost certain.
341
Absorption Probabilities and Expected Time to Absorption
Sec. 6.4
Expected Time to Absorption We now turn our attention to the expected number of steps until a recurrent state is entered (an event that we refer to as "absorption"), starting from a particular transient state. For any state i, we denote J-Li
E [number of transitions until absorption, starting from
i]
== E [ min{ n 2: 0 I X n is recurrent} I Xo == iJ. Note that if i is recurrent, then J-Li == 0 according to this definition. We can derive equations for the J-Li by using the total expectation theorem. We argue that the time to absorption starting from a transient state i is equal to 1 plus the expected time to absorption starting from the next state, which is j with probability Pij. We then obtain a system of linear equations, stated below, which has a unique solution (see the end-of-chapter problems).
Equations for the Expected Time. to Absorption The expected times to· absorption, /-Ll,···, /-Lm, the equations for all
J-Li. == 0,
-rar>,,-r-ra"t'"l1""
m
for all
TT- t], is independent of the past [the random variables X (T) for T < t].
Approximation by a Discrete-Time Markov Chain We now elaborate on the relation between a continuous-time Markov chain and a corresponding discrete-time version. This relation will lead to an alternative
t If a transition takes place at time t, the notation X (t) is ambiguous. A common convention is to let X (t) refer to the state right after the transition, so that X (Yn ) is the same as X n .
Sec. 6.5
Continuous- Time ]V[arkov Chains
347
description of a continuous-time Markov chain, and to a set of balance equations characterizing the steady-state behavior. Let us fix a small positive number 5 and consider the discrete-time Markov chain Zn that is obtained by observing X(t) every 5 time units:
Zn == X(n5),
n == 0,1, ....
The fact that Zn is a Markov chain (the future is independent from the past, given the present) follows from the Markov property of X(t). We will use the notation Pij to describe the transition probabilities of Zn. Given that Zn i, there is a probability approximately equal to v i 5 that there is a transition between times n5 and (n + 1)5, and in that case there is a further probability Pij that the next state is j. Therefore, if j
-# i,
where 0(5) is a term that is negligible compared to 5, as 5 gets smaller. The probability of remaining at i [Le., no transition occurs between times n5 and (n + 1)5] is
Pii
== P(Zn+l == i I Zn == i) == 1 -
LPij' j=j;i
This gives rise to the following alternative description. t
history •. of. the· process.
Example 6.14 (continued). Neglecting 0(8) terms, the transition probability matrix for the corresponding discrete-time Mar~ov chain Zn is
1-8 8 0] [ 58/2 38
1 - 58 0
58/2 •. 1 - 38
t Our argument so far shows that a continuous-time Markov chain satisfies this alternative description. Conversely, it can be shown that if we start with this alternative description, the time until a transition out of state i is an exponential random variable with parameter Vi == I:j qij. Furthermore, given that such a transition has just occurred, the next state is j with probability qij / Vi Pij. This establishes that the alternative description is equivalent to the original one.
IVlarkov Chains
348
Chap. 6
Example 6.15. Queueing. Packets arrive at a node of a communication network according to a Poisson process with rate .-\. The packets are stored at a buffer with room for up to m packets, and are then transmitted one at a time. However, if a packet finds a full buffer upon arrival, it is discarded. The time required to transmit a packet is exponentially distributed with parameter J.L. The transmission times of different packets are independent and are also independent from all the inte,rarrival times. We will model this system using a continuous-time Markov chain with state X (t) equal to the number of packets in the system at time t [if X (t ) > 0, then X(t) - 1 packets are waiting in the queue and one packet is under transmission]. The state increases by one when a new packet arrives and decreases by one when an existing packet departs. To show that X(t) is indeed a Markov chain, we verify that we have the property specified in the above alternative description, and at the same time identify the transition rates qij' Consider first the case where the system is empty, Le., the state X(t) is equal to O. A transition out of state 0 can only occur if there is a new arrival, in which case the state becomes equal to 1. Since arrivals are Poisson, we have
p(X(t + 8) == and qOj
==
11 X(t)
{0,.-\,
==
0)
==.-\6 + 0(8),
if j == 1, otherwise.
Consider next the case where the system is full, Le., the state X (t) is equal to m. A transition out of state m will occur upon the completion of the current packet transmission, at which point the state will become m - 1. Since the duration of a transmission is exponential (and in particular, memoryless), we have
p(X(t + 8)
m- 11 X(t)
and
==
m)
== J.L8 + 0(8),
if j m 1, otherwise.
Consider finally the case where X (t) is· equal to some intermediate state i, with 0 < i < m. During the next 8 time units, there is a probability .-\8 + 0(8) of a new packet arrival, which will bring the state to i + 1, and a probability J.L8 + 0(8) that a packet transmission is completed, which will bring the state to i - I . [The probability of both an arrival and a departure within an interval of length 8 is of the order of 82 and can be neglected, as is the case with other 0(8) terms.] Hence,
p(X(t + 8) == i - I I X(t) == i) ::::; J.L8 + 0(8), p(X(t + 8) == i + 11 X(t) == i) == .-\8 + 0(6), and
see Fig. 6.21.
if j == i + 1, if j == i-I, otherwise,
for i == 1, 2, ... , m - 1;
Sec. 6.5
Continuous- Time l\([arkov Chains
349
Figure 6.21: 'fransition graph in Example 6.15.
Steady-State Behavior We now turn our attention to the long-term behavior of a continuous-time Markov chain and focus on the state occupancy probabilities P(X(t) == i), in the limit as t gets large. We approach this problem by studying the steady-state probabilities of the corresponding discrete-time chain Zn. Since Zn == X (n8), it is clear that the limit 1rj of P (Zn == j I Zo == i), if it exists, is the same as the limit of P(X(t) == j I X(O) == i). It therefore suffices to consider the steady-state probabilities associated with Zn. Reasoning as in the discrete-time case, we see that for the steady-state probabilities to be independent of the initial state, we need the chain Zn to have a single recurrent class, which we will henceforth assume. We also note that the Markov chain Zn is automatically aperiodic. This is because the self-transition probabilities are of the form Pii == 1 - 8 qij + 0(8), j::;t=i
which is positive when O. Since an converges to a, there exists some no such that ak ~ a+(E/2), for all k > no. Let also c = maXk ak. We then have no bn = -1 L ak n k=l
+ -n1
ak k=no+l
~
E) .
noc + n --no - ( a + -2 n n
The limit of the right-hand side, as n tends to infinity, is a + (E/2). Therefore, there exists some nl such that bn ~ a + E, for every n ~ nl. By a symmetrical argument, there exists some n2 such that bn ~ a - E, for every n ~ n2. We have shown that for every E > 0, there exists some n3 (namely, n3 = max{nl,n2}) such that Ibn - al ~ E, for all n ~ n3. This means that bn converges to a.
Problem 20. * Doubly stochastic matrices. Consider a Markov chain with a single recurrent class which is aperiodic, and whose transition probability matrix is doubly stochastic, Le., it has the property that the entries in any column (as well as in any row) add to unity, so that m
LPij
j = 1, ... ,m.
1,
i=l
(a) Show that the transition probability matrix of the chain in Example 6.7 is doubly stochastic. (b) Show that the steady-state probabilities are 1 1rj
m
,
j = 1, ... ,m.
365
Problems
(c) Suppose that the recurrent class of the chain is instead periodic. Show that 7fl == ... == 7f m == 11m is the unique solution to the balance and normalization equations. Discuss your answer in the context of Example 6.7 for the case where m is even.
Solution. (a) Indeed the rows and the columns of the transition probability rp.atrix in this example all add to 1. (b) We have 1 m Thus, the given probabilities 7fj == 11m satisfy the balance equations and must therefore be the steady-state probabilities. (c) Let 7fj be a solution to the balance and normalization equations. Consider a particular j such that 7fj 2: 7fi for all i, and let q == 7fj. The balance equation for state j yields m
q
== 7fj == L
7fiPij ::;
q LPij
i=l
q,
i=l
where the last step follows because the transition probability matrix is doubly stochastic. It follows that the above inequality is actually an equality and m
m 7fiPij
i=l
==
qPij· i=l
Since 7fi ::; q for all i, we must have 7fiPij == qPij for every i, Thus, 7fi == q for every state i from which a transition to j is possible. By repeating this argument, we see that 7fi == q for every state i such that there is a positive probability path from i to j. Since all states are recurrent and belong to the same class, all states i have this property, and therefore 7fi is the same for all i. Since the 7fi add to 1, we obtain 7fl == 11m for all i. If m is even in Example 6.7, the chain is periodic with period 2. Despite this fact, the result we have just established shows that 7fj == 11m is the unique solution to the balance and normalization equations.
*
Problem 21. Queueing. Consider the queueing Example 6.9, but assume that the probabilities of a packet arrival and a packet transmission depend on the state of the queue. In particular, in each period where there are i packets in the node, one of the following occurs: (i) one new packet arrives; this happens with a given probability bi . We assume that bi > 0 for i < m, and bm == O.
(ii) one existing packet completes transmission; this happens with a given probability di
> 0 if i 2:
1, and with probability 0 otherwise;
(iii) no new packet arrives and no existing packet completes transmission; this happens with probability 1 - bi di if i 2: 1, and with probability 1 - bi if i == O. Calculate the steady-state probabilities of the corresponding Markov chain.
Markov Chains
366
Chap. 6
Figure 6.23: Transition probability graph for Problem 21.
Solution. We introduce a Markov chain where the states are 0,1, ... ,m, and correspond to the number of packets currently stored at the node. The transition probability graph is given in Fig. 6.23. Similar to Example 6.9, we write down the local balance equations, which take the form i == 0,1, ... , m - 1. 7fi bi == 7fi+1di+1, Thus we have
7fi+1
==
where
pi7fi,
Pi ==
Hence 1 == 1T'0
7fi
== (po'"
Pi-1)7fO
for i
== 1, ... , m. By using the normalization equation
+ 7f1 + ... + 7fm , we obtain 1 == 7fo(l + po + POP1 + ... + po ... Pm-I),
from which 7fo
1 == - - - - - - - - - - - - - 1 + po + POPI + ... + po ... Pm-l
The remaining steady-state probabilities are 7fi
== - - - - - - - - - - - - - -
1 + po
+ POPI + ... + po ... pm-l
i
== 1, ... ,m.
Problem 22. * Dependence of the balance equations. Show that if we add the first m - 1 balance equations 1T'j == 1T'kPkj, for j 1, ... ,m - 1, we obtain the last equation 1T'm == 7fkPkm'
2::=1
2:;:1
Solution. By adding the first m - 1 balance equations, we obtain m-1
m-}
L:
m
L: L:
1T'j
j=1
7fkPkj
j=l k=l
m-l
==
L: L: 1T'k
k=1
Pkj
j=l
m
==
L:
1T'k(l -
Pkm)
k=1 m-1
==
7f m
+ L: 7fk k=1
m
L: k=1
7fkPkm'
Problems
367
This equation is equivalent to the last balance equation
7r m
==
2::=1 7rkPkm.
Problem 23. * Local balance equations. We are given a Markov chain that has a single recurrent class which is aperiodic. Suppose that we have found a solution 7r1, ... , 7r m to the following system of local balance and normalization equations: i, j
==
1, ... , m,
m
L
7ri
==
i
1,
==
1, ... ,m.
i=1
(a) Show that the
7rj
are the steady-state probabilities.
(b) What is the interpretation of the equations 7riPij == 7rjPji in terms of expected long-term frequencies of transitions between i and j? (c) Construct an example where the local balance equations are not satisfied by the steady-state probabilities.
Solution. (a) By adding the local balance equations m
7riPij
== 7rjPji over
i, we obtain
m
L
7riPij
==
i=1
L
7rjPji
==
7rj,
i=1
so the 7rj also satisfy the balance equations. Therefore, they are equal to the steadystate probabilities. (b) We know that 7riPij can be interpreted as the expected long-term frequency of transitions from i to j, so the local balance equations imply that the expected longterm frequency of any transition is equal to the expected long-term frequency of the reverse transition. (This property is also known as time reversibility of the chain.) (c) We need a minimum of three states for such an example. Let the states be 1,2,3, and let P12 > 0, P13 > 0, P21 > 0, P32 > 0, with all other transition probabilities being 0. The chain has a single recurrent aperiodic "class. The local balance equations do not hold because the expected frequency of transitions from 1 to 3 is positive, but the expected frequency of reverse transitions is O. Problem 24. * Sampled Markov chains. Consider a Markov chain X n with transition probabilities Pij, ~nd let Tij (n) be the n-step transition probabilities. (a) Show that for all n
~
1 and l
~
1, we have
Tij(n+l)
==
LTik(n)Tkj(l). k=1
(b) Suppose that there is a single recurrent class, which is aperiodic. We sample the Markov chain every l transitions, thus generating a process Yn , where Yn == X Zn . Show that the sampled process can be modeled by a Markov chain with a single aperiodic recurrent class and transition probabilities Tij (l). (c) Show that the Markov chain of part (b) has the same steady-state probabilities as the original process.
Markov Cbains
368
Cbap.6
Solution. (a) We condition on X n and use the total probability theorem. We have
rij(n + l) == P(Xn+l == j I X o == i)
==
L P(X
n
== k I X o == i) P(Xn+l == j I X n == k, X o == i)
k=l m
==
L P(X
n
k I X o == i) P(Xn+l == j I X n
k)
k=l
==
L rik(n)rkj(l), k=l
where in the third equality we used the Markov property. (b) Since X n is Markov, once we condition on Xl n , the past of the process (the states X k for k < In) becomes independent of the future (the states X k for k > In). This implies that given Yn , the past (the states Yk for k < n) is independent of the future (the states Yk for k > n). Thus, Yn has the Markov property. Because of our assumptions on X n , there is a time 'ii such that
P(Xn == j I X o == i) > 0, for every n ~ 'ii, every state i, and every state j in the single recurrent class R of the process X n . This implies that
P(Yn
== j I Yo
i) > 0,
for every n ~ 'ii, every i, and every j E R. Therefore, the process Y n has a single recurrent class, which is aperiodic. (c) The n-step transition probabilities rij(n) of the process X n converge to the steadystate probabilities 1T j. The n-step transition probabilities of the process Y n are of the form rij (In), and therefore also converge to the same limits 1Tj. This establishes that the 1T j are the steady-state probabilities of the process Y n .
Problem 25. * Given a Markov chain X n with a single recurrent class which is aperiodic, consider the Markov chain whose state at time n is (Xn-1,X n ). Thus, the state in the new chain can be associated with the last transition in the original chain. (a) Show that the steady-state probabilities of the new chain are
'TJij == 1TiPij, where the 1Ti are the steady-state probabilities of the original chain. (b) Generalize part (a) to the case of the Markov chain (Xn-k,Xn-k+l,""X n ), whose state can be associated with the last k transitions of the original chain.
Solution. (a) For every state (i,j) of the new Markov chain, we have P ((Xn- 1 , X n ) == (i, j))
== P(Xn - 1 == i) P(Xn == j I X n- 1 == i) == P(Xn - 1 == i) Pij.
369
Problems
Since the Markov chain X n has a single recurrent class which is aperiodic, P(Xn - 1 == i) converges to the steady-state probability 1fi, for every i. It follows that P ((X n - 1 , X n ) == (i, j)) converges to
7f iPij,
which is therefore the steady-state probability of (i, j).
(b) Using the multiplication rule, we have
Therefore, by an argument similar to the one in part (a), the steady-state probability of state (io, ... ,ik) is equal to 1fiOPiOil ... Pik-l ik'
SECTION 6.4. Absorption Probabilities and Expected Time to Absorption Problem 26. There are m classes offered by a particular department, and each year, the students rank each class from 1 to m, in order of difficulty, with rank m being the highest. Unfortunately, the ranking is completely arbitrary. In fact, any given class is equally likely to receive any given rank on a given year (two classes may not receive the same rank). A certain professor chooses to remember only the highest ranking his class has ever gotten. (a) Find the transition probabilities of the Markov chain that models the ranking that the professor remembers. (b) Find the recurrent and the transient states. (c) Find the expected number of years for the professor to achieve the highest ranking given that in the first year he achieved the ith ranking.
Problem 27. Consider the Markov chain specified in Fig. 6.24. The steady-state probabilities are known to be: 6
1fl
== 31'
6
9
7T2
== 31'
7T3
== 31'
Assume that the process is in state 1 just before the first transition.
Figure 6.24: Transition probability graph for Problem 27.
(a) What is the probability that the process will be in state 1 just after the sixth transition?
Markov Chains
370
Chap. 6
(b) Determine the expected value and variance of the number of transitions up to and including the next transition during which the process returns to state 1. (c) What is (approximately) the probability that the state of the system resulting from transition 1000 is neither the same as the state resulting from transition 999 nor the same as the state resulting from transition IDOl? Problem 28. Consider the Markov chain specified in Fig. 6.25.
0.5
0.4
0.4
0.4
1
0.2
0.3
Figure 6.25: Transition probability graph for Problem 28.
(a) Identify the transient and recurrent states. Also, determine the recurrent classes and indicate which ones, if any are periodic. (b) Do there exist steady-state probabilities given that the process starts in state I? If so, what are they? (c) Do there exist steady-state probabilities given that the process starts in state 6? If so, what are they? (d) Assume that the process starts in state 1 but we begin observing it after it reaches steady-state. (i) Find the probability that the state increases by one during the first transition we observe.
(ii) Find the conditional probability that the process was in state 2 when we started observing it, given that the state increased by one during the first transition that we observed. (iii) Find the probab~lity that the state increased by one during the first change of state that we observed. (e) Assume that the process starts in state 4. (i) For each recurrent class, determine the probability that we eventually reach that class.
(ii) What is the expected number of transitions up to and including the transition at which we reach a recurrent state for the first time? Problem 29. * Absorption probabilities. Consider a Markov chain where each state is either transient or absorbing. Fix an absorbing state s. Show that the probabilities ai of eventually reaching s starting from a state i are the unique solution to
Problems
371
the equations as = 1, for all absorbing i
=1=
s,
m
=
ai
for all transient i.
LPijaj, j=l
Hint: If there are two solutions, find a system of equations that is satisfied by their difference, and look for its solutions.
Solution. The fact that the ai satisfy these equations was established in the text, using the total probability theorem. To show uniqueness, let ai be another solution, and let 8i = ai - ai. Denoting by A the set of absorbing states and using the fact 8j = 0 for all j E A, we have 8i =
for all transient i.
L P ij 8j j=l
Applying this relation m successive times, we obtain
8i =
Pij1 j1 ~A
Pj1j2 ... L j2~A
Pjm-1jm .
8jm ·
jm~A
Hence
18i I :::;
Pij1 j1 ~A
= P(X1
Pj1j2 ... L j2~A
~
A,
:::; P(X1 ~ A,
Pjm-ljm .
18jm I
jm~A
, X m ~ A I X o = i) . 18jm I ,Xm ~ A\Xo = i)· maxl8j j~A
l.
The above relation holds for all transient i, so we obtain
where jJ
= P(X1
~
A, ... , X m
rt A \ X o = i).
Note that jJ < 1, becaus~ there is positive probability that X m is absorbing, regardless of the initial state. It follows that maXj~A 18j I 0, or ai = ai for all i that are not absorbing. We also have aj = aj for all absorbing j~' so ai = ai for all i. Problem 30. * Multiple recurrent classes. Consider a Markov chain that has more that one recurrent class, as well as some transient states. Assume that all the recurrent classes are aperiodic. (a) For any transient state i, let ai(k) be the probability that starting from i we will reach a state in the kth recurrent class. Derive a system of equations whose solution are the ai (k). (b) Show that each of the n-step transition probabilities and discuss how these limits can be calculated.
Tij (n)
converges to a limit,
Markov Chains
372
Chap. 6
Solution. (a) We introduce a new Markov chain that has only transient and absorbing states. The transient states correspond to the transient states of the original, while the absorbing states correspond to the recurrent classes of the original. The transition probabilities Pij of the new chain are as follows: if i and j are transient, Pij = Pij; if i is a transient state and k corresponds to a recurrent class, Pik is the sum of the transition probabilities from i to states in the recurrent class in the original Markov chain. The desired probabilities ai(k) are the absorption probabilities in the new Markov chain and are given by the corresponding formulas: for all transient i. j: transient
(b) If i and j are recurrent but belong to different classes, rij (n) is always O. If i and j are recurrent but belong to the same class, r ij (n) converges to the steady-state probability of j in a chain consisting of only this particular recurrent class. If j is transient, r ij (n) converges to O. Finally, if i is transient and j is recurrent, then r ij (n) converges to the product of two probabilities: (1) the probability that starting from i we will reach a state in the recurrent class of j, and (2) the steady-state probability of j conditioned on the initial state being in the class of j. Problem 31. * Mean first passage times and expected times to absorption. Consider a Markov chain with a single recurrent class, and let s be a fixed recurrent state. Show that the system of equations m
t s = 0,
1+
ti
LPijtj,
for all i
f. s,
j=l
satisfied by the mean first passage times, has a unique solution. Show a similar result for the system of equations satisfied by the expected times to absorption j.Li. Hint: If there are two solutions, find a system of equations that is satisfied by their difference, and look for its solutions.
Solution. Let ti be the mean first passage times. These satisfy the given system of equations. To show uniqueness, let ti be another solution. Then we have for all i f. s 1+
ti
Pijtj,
ti
= 1 + LPijtj,
j:j:.s
j:j:.s
and by subtraction, we obtain
8i =
L Pij 8 j , j:j:.s
where 8i = ti
ti.
By applying m successive times this relation, if follows that
L
8i
Pijl
jl:j:.s
Hence, we have for all i
18i l ::;
f.
Pjlj2 ... j2:j:.s
L
Pjm-ljm . 8 jm ·
jm:j:.s
s
L
Pijl
jl:j:.s
= P(X1
L
Pjlj2 ...
j2:j:.s
f.
s, .. . , X m
L
Pjm-ljm . ffi?-X J
jm:j:.s
f.
s 1X o
18j I
i) . m~x 18j l· J
373
Problems
On the other hand, we have P(X1 =I- s, .. . ,Xm =I- s I X o == i) < 1. This is because starting from any state there is positive probability that s is reached in m steps. It follows that all the Oi must be equal to zero. The proof for the cas~ of the system of equations satisfied by the expected times to absorption is nearly identical.
Problem 32. * Balance equations and mean recurrence times. Consider a Markov chain with a single recurrent class, and let s be a fixed recurrent state. For any state i, let Pi
== E [Number of visits to i between two successive visits to sJ,
where by convention, ps (a) Show that for
~ll
== 1.
i, we have m
==
Pi
L
PkPki·
k=l
(b) Show that the numbers Pi t~
1, ...
,
,m,
sum to 1 and satisfy the balance equations, where t; is the mean recurrence time of s (the expected number of transitions up to the first return to s, starting from s). (c) Show that if to 1, then
1i"1, ... , 1i" m
are nonnegative, satisfy the balance equations, and sum
1i"i
== {
~,
0, where
ti
if i is recurrent, if i is transient,
is the mean recurrence time of i.
(d) Show that the distribution of part (b) is the unique probability distribution that satisfies the balance equations.
Note: This problem not only provides an alternative proof of the existence and uniqueness of probability distributions that satisfy the balance equations, but also makes an intuitive connection between steady-state probabilities and mean recurrence times. The main idea is to break the process into "cycles," with a new cycle starting each time that the recurrent state s is visited. The steady-state probability of state s can be interpreted as the long-term expected frequency of visits to state s, which is inversely proportional to the average time between consecutive' visits (the mean recurrence time); cf. part (c). Furthermore, if a state i is expected to be visited, say, twice as often as some other state j during a typical cycle, it is plausible that the long-term expected frequency 1fi of state i will be twice as large as 1i"j. Thus, the steady-state probabilities 1i"i should be proportional to the expected number of visits Pi during a cycle; cf. part
(b). Solution. (a) Let us consider the Markov chain X n , initialized with X o assert that for all i, 00
Pi
== LP(X1 =I- S, ... ,Xn - 1 =I- S,Xn == i). n=l
== s. We first
lvfarkov Chains
374
Chap. 6
To see this, we first consider the case i i- s, and let In be the random variable that takes the value 1 if Xl i- s, ... ,Xn -1 i- s, and X n == i, and the value 0 otherwise. Then, the number of visits to state i before the next visit to state s is equal to In. Thus,t
2::::1
n=l
==
L P(X
1
i-
s, ... ,Xn -
1
i-
s, X n == i).
n=l
When i == s, the events {Xl i- S, ... , X n - 1 i- s, X n == s}, for the different values of n, form a partition of the sample space, because they correspond to the different possibilities for the time of the next visit to state s. Thus, P(X1
i- s, ... ,Xn - 1 i- s, X n == s)
== 1 == ps,
n=l
which completes the verification of our assertion. We next use the total probability theorem to write for n 2: 2, P(X1
i- s, ... ,Xn - 1 i- s, X n
== i) ==
P(X1
i- S, ... ,Xn - 2 i- s, X n - 1
k)Pki.
k=j:s
t The interchange of the infinite summation and the expectation in the subsequent calculation can be justified by the following argument. We have for any k > 0,
Let T be the first positive time that state s is visited. Then,
E
[
ex:>] In n=k+1
f'P(T=t)E t=k+2
[f
n=k+1
2::1
In
I T=t]:::;
00
tP(T=t).
t=k+2
00
Since the mean recurrence time t peT == t) is finite, the limit, as k ---+ of 2::k+2 t peT == t) is equal to zero. We take the limit of both sides of the earlier equation, as k ---+ to obtain the desired relation
00,
Problems
375
We thus obtain (Xl
Pi
==
L: P(X
:f:
1
S, ...
,Xn -
1
S, ...
,Xn -
:f:
S,
Xn
== i)
n=l
== Psi +
L: P(X
1
:f:
:f:
1
S,
Xn
== i)
:f:
Xn-
n=2 (Xl
== Psi +
L: P(X
1
:f:
S, ...
,Xn -
2
S,
1
==
k)Pki
n=2 k=l-s
== psi + L:Pki k=l-s
== pspsi +
L: P(X
1
:f:
S, ...
,Xn -
2
:f:
S,
Xn-
1
k)
n=2
L: PkiPk k=l-s
==
L: PkPki· k=l
(b) Dividing both sides of the relation established in part (a) by
t~,
we obtain
m
1fi
==
1fkPki, k=l
where 1fi == Pi/t~. Thus, the 1fi solve the balance equations. Furthermore, the 1fi are nonnegative, and we clearly have 2:::1 Pi == t~ or 2:::1 1fi == 1. Hence, (1fl, ... ,1fm ) is a probability distribution. (c) Consider a probability distribution (1f1, ... , 1fm) that satisfies the balance equations. Fix a recurrent state s, let t~ be the mean recurrence time of s, and let ti be the mean first passage time from a statei:f: s to state s. We will show that 1fst~ == 1. Indeed, we have == 1 + L:PSjtj,
t:
j=l-s
for all i :f: s.
. ti==l+L:Pijtj, j=l-s
Multiplying these equations with 1fsand 1fi, respectively, and adding, we obtain m
1fst:
+
1fiti
== 1 +
i=l-s
L:
Pijtj.
1fi
i=l
j=l-s
By using the balance equations, the right-hand side is equal to m
m
1+
L: i=l
1fi
L:Pijtj j=l-s
== 1 +
L: tj L: 1fiPij == 1 + 'L: tj1fj. j=l-s
i=l
j=l-s
A1arkov Chains
376
Chap. 6
t;
By combining the last two equations, we obtain 7T s == 1. Since the probability distribution (7Tl, .•. , 7T m) satisfies the balance equations, if the initial state X o is chosen according to this distribution, all subsequent states X n have the same distribution. If we start at a transient state i, the probability of being at that state at time n diminishes to 0 as n -+ 00. It follows that we must have 7Ti == O. (d) Part (b) shows that there exists at least one probability distribution that sati;sfies the balance equations. Part (c) shows that there can be only one such probability distribution.
SECTION 6.5. Continuous-Time Markov Chains Problem 33. A facility of m identical machines is sharing a single repairperson. The time to repair a failed machine is exponentially distributed with mean 1/.:\. A machine once operational, fails after a time that is exponentially distributed with mean 1/ /-L. All failure and repair times are independent. (a) Find the steady-state probability that there is no operational machine. (b) Find the expected number of operational machines, in steady-state. Problem 34. An athletic facility has 5 tennis courts. Pairs of players arrive at the courts according to a Poisson process with rate of one pair per 10 minutes, and use a court for an exponentially distributed time with mean 40 minutes. Suppose a pair of players arrives and finds all courts busy and k other pairs waiting in queue. What is the expected waiting time to get a court? Problem 35. Empty taxis pass by a street corner at a Poisson rate of two per minute and pick up a passenger if one is waiting there. Passengers arrive at the street corner at a Poisson rate of one per minute and wait for a taxi only if there are less thq,n four persons waiting; otherwise they leave and never return. Penelope arrives at the street corner at a given time. Find her expected waiting time, given that she joins the queue. Assume that the process is in steady-state. Problem 36. There are m users who share a computer system. Each user alternates between "thinking" intervals whose durations are independent exponentially distributed with parameter .:\, and an "active" mode that starts by submitting a service request. The server can only serve one request at a time, and will serve a request completely before serving other requests. The service times of different requests are independent exponentially distributed random variables with parameter /-L, and also independent of the thinking times of the users. Construct a Markov chain model and derive the steady-state distribution of the number of p'ending requests, including the one presently served, if any. Problem 37. * Consider a continuous-time Markov chain in which the transition rates are the same for all i. Assume that the process has a single recurrent class.
Vi
(a) Explain why the sequence Y n of transition times form a Poisson process. (b) Show that the steady-state probabilities of the Markov chain X (t) are the same as the steady-state probabilities of the embedded Markov chain X n .
377
Problems
Solution. (a) Denote by v the common value of the transition rates Vi. The sequence {Yn } is a sequence of independent exponentially distributed time intervals with parameter v. Therefore they can be associated with the arrival times of a Poisson process with rate v. (b) The balance and normalization equations for the continuous-time chain aFe =
qjk
7fj
k=j=j
j
L 7 fk Qkj,
1, ...
,m,
k=j=j
m
1=
L 7 fk . k=l
By using the relation qjk = are written as
Vpjk,
and by canceling the common factor v, these equations j = 1, ...
We have
Ek=j=j Pjk
= 1
Pjj,
,m,
so the first of these two equations is written as
7fj
(1
Pjj)
=
7fkPkj, k=j=j
or
m 7fj
=
7fkPkj,
j = 1, ... ,m.
k=l
These are the balance equations for the embedded Markov chain, which have a unique solution since the embedded Markov chain has a single recurrent class, which is aperiodic. Hence the 7fj are the steady-state probabilities for the embedded Markov chain.
7
imit Theorems
379
Limit Theorems
380
Chap. 7
In this chapter, we discuss some fundamental issues related to the asymptotic behavior of sequences of random variables. Our principal context involves a sequence Xl, X2, ... of independent identically distributed random variables with mean f.-L and variance 0'2. Let
Sn==XI+",+Xn be the sum of the first n of them. Limit theorems are mostly concerned with the properties of Sn and related random variables as n becomes very large. Because of independence, we have
var(Sn) == var(XI) + ... + var(Xn ) == nO' 2 . Thus, the distribution of Sn spreads out as n increases, and cannot have a meaningful limit. The situation is different if we consider the sample mean Mn==XI+",+X n Sn n n A quick calculation yields
==
0'2
var(Mn ) == -. n In particular, the variance of Mn decreases to zero as n increases, and the bulk of the distribution of M n must be very close to the mean f.-L. This phenomenon is the subject of certain laws of large numbers, which generally assert that the sample mean M n (a random variable) converges to the true mean f.-L (a number), in a precise sense. These laws provide a mathematical basis for the loose interpretation of an expectation E[X] == f.-L as the average of a large number of independent samples drawn from the distribution of X. We will also consider a quantity which is intermediate between Sn and M n . We first subtract nf.-L from Sn, to obtain the zero-mean random variable Sn - nf.-L and then divide by a yri, to form the random variable E[Mn ]
f.-L,
Zn
- Sn - nf.-L O'yri .
It can be seen that
E[Zn] == 0,
var(Zn) == 1.
Since the mean and the variance of Zn remain unchanged as n increases, its distribution neither spreads, nor shrinks to a point. The central limit theorem is concerned with the asymptotic shape of the distribution of Zn and asserts that it becomes the standard normal distribution. Limit theorems are useful for· several reasons: (a) Conceptually, they provide an interpretation of expectations (as well as probabilities) in terms of a long sequence of identical independent experiments. (b) They allow for an approximate analysis of the properties of random variables such as Sn. This is to be contrasted with an exact analysis which would require a formula for the PMF or PDF of Sn, a complicated and tedious task when n is large.
Sec. 7.1
381
Markov and Chebyshev Inequalities
7.1 MARKOV AND CHEBYSHEV INEQUALITIES In this section, we derive some important inequalities. These inequalities use the mean, and possibly the variance, of a random variable to draw conclusions on the probabilities of certain events, They are primarily useful in situations where the mean and variance of a random variable X are easily computable, but the distribution of X is either unavail~ble or hard to calculate. We first present the Markov inequality. Loosely speaking, it asserts that if a nonnegative random variable has a small mean, then the probability that it takes a large value must also be small.
Markov ......
c"t.,...'1I'11""l>I·'IIT'v;r
To justify the Markov inequality, let us fix a positive number a and consider the random variable Ya defined' by Ya =
{O,a,
if X < a, if X ~ a.
It is seen that the relation always holds and therefore, E[Ya ] ~ E[X].
On the other hand, E[Ya ]
= aP(Ya = a) = aP(X
~ a),
from which we obtain aP(X ~ a) ~ E[X];
see Fig. 7.1 for an illustration.
Example 7.1. Let X be uniformly distributed in the interval [0,4] and note that E[X] == 2. Then, the Markov inequality asserts that
P(X :2: 2) :::;
~ = 1,
P(X :2: 3) :::;
~
= 0.67,
P(X ~ 4) ~
2
4 == 0.5.
Limit Theorems
382
Chap. 7
(a)
o
x
a
P(Ya = 0)
(b)
P(Ya = a) a
o
Figure 7.1: Illustration of the derivation of the Markov inequality. Part (a) of the figure shows the PDF of a nonnegative random variable X. Part (b) shows the PMF of a related random variable Y a , which is constructed as follows. All of the probability mass in the PDF of X that lies between 0 and a is assigned to 0, and all of the mass that lies above a is assigned to a. Since mass is shifted to the left, the expectation can only decrease and, therefore,
E[X]
~ E[Ya ]
==
aP(Ya
==
a)
==
aPeX ~ a).
By comparing with the exact probabilities
P(X 2:: 2) == 0.5,
P(X 2:: 3) == 0.25,
P(X 2:: 4) == 0,
we see that the bounds provided by the Markov inequality can be quite loose.
We continue with the Chebyshev inequality. Loosely speaking, it asserts that if the variance of a random variable is small, then the probability that it takes a value far from its mean is also small. Note that the Chebyshev inequality does not require the random variable to be nonnegative.
Chebyshev Inequality If X is a random variable with mean f.-L and variance
p(IX - Ikl
~
0'2
c) ::; 2' c
for all c
0'2,
>
then
o.
Sec. 7.2
The Weak Law of Large Numbers
383
To justify the Chebyshev inequality, we consider the nonnegative random variable (X - J-L)2 and apply the Markov inequality with a == c2. We obtain
We complete the derivation by observing that the event (X - J-L)2 2:: c2 is identical to the event IX - J-LI 2:: C, so that
An alternative form of the Chebyshev inequality is obtained by letting C
== k(7, where k is positive, which yields
p(IX - p,1 ;:::
ka)
:s;
(72
1
k 2a 2 = k 2 '
Thus, the probability that a random variable takes a value more than k standard deviations away from its mean is at most 1/ k 2 . The Chebyshev inequality tends to be more powerful than the Markov inequality (the bounds that it provides are more accurate), because it also uses information on the variance of X. Still, the mean and the variance of a random variable are only a rough summary of its properties, and we cannot expect the bounds to be close approximations of the exact probabilities.
Example 7.2. As in Example 7.1, let X be uniformly distributed in [0,4]. Let us use the Chebyshev inequality to bound the probability that IX - 21 ~ 1. We have a 2 = 16/12 = 4/3, and
which is uninformative. For another example, let X be exponentially distributed with parameter A = 1, so that E[X] = var(X) = 1. For c > 1, using the Chebyshev inequality, we obtain
P(X :::: c)
= P(X -
1
1 :::: c - 1) :S P (IX - 11 :::: c - 1) :S (c _ 1)2 .
This is again conservative compared to the exact answer P(X 2:: c) = e- c .
7.2 THE "WEAK LA"W OF LARGE NUMBERS The weak law of large numbers asserts that the sample mean of a large number of independent identically distributed random variables is very close to the true mean, with high probability.
Limit Theorems
384
Chap. 7
As in the introduction to this chapter, we consider a sequence Xl, X 2 , ... of independent identically distributed random variables with mean f..l and variance 0- 2 , and define the sample mean by Mn
We have E[M
n
]
+X == Xl + ... .. ,n . n
= E[X1 ] + ... + E[Xn ] = np, = p" n
n
and, using independence, var (
M) n
==
var(X1
+ ... + X n ) n2
var(XI)
+ ... + var(Xn ) n2
na 2 n2
a2 n
We apply the Chebyshev inequality and obtain for any
E
> O.
We observe that for any fixed E > 0, the right-hand side of this inequality goes to zero as n increases. As a consequence, we obtain the weak law of large numbers, which is stated below. It turns out that this law remains true even if the Xi have infinite variance, but a much more elaborate argument is needed, which we omit. The only assumption needed is that E[Xi] is finite.
The weak law of large numbers states that for large n, the bulk of the distribution of M n is concentrated near jl;. That is, if we consider a positive length interval [f..l - E, f1 + E] around f..l, then there is high probability that M n will fall in that interval; as n -+ 00, this probability converges to 1. Of course, if E is very small, we may have to wait longer (i.e., need a larger value of n) before we can assert that M n is highly likely to fall in that interval.
Example 7.3. Probabilities and Frequencies. Consider an event A defined in the context of some probabilistic experiment. Let p = P(A) be the probability
Sec. 7.2
The Weak Law of Large Numbers
385
of that event. We consider n independent repetitions of the experiment, and let M n be the fraction of time that event A occurs; in this context, lVln is often called the empirical frequency of A. Note that
Mn
n = Xl +",+X , n
where Xi is 1 whenever A occurs, and 0 otherwise; in particular, E[Xi ] = p. The weak law applies and shows that when n is large, the empirical frequency is most likely to be within E of p. Loosely speaking, this allows us to say that empirical frequencies are faithful estimates of p. Alternatively, this is a step towards interpreting the probability p as the frequency of occurrence of A.
Example 7.4. Polling. Let p be the fraction of voters who support a particular candidate for office. We interview n "randomly selected" voters and record M n , the fraction of them that support the candidate. We view M n as our estimate of p and would like to investigate its properties. We interpret "randomly selected" to mean that the n voters are chosen independently and uniformly from the given population. Thus, the reply of each person interviewed can be viewed as an independent Bernoulli random variable Xi with success probability p and variance 0"2 = p(l - p). The Chebyshev inequality yields
p(l ~ p). nE The true value of the parameter p is assumed to be unknown. On the other hand, it is easily verified that p(l - p) ::; 1/4, which yields
P(IMn
-
pi ::=:
E) ::;
1
P(IMn-PI~E)::;-4 2' nE For example, if
E
= 0.1
and n
= 100, we obtain
1 P(IMlQO -pi::=: 0.1) ::; 4.100. (0.1)2
= 0.25.
In words, with a sample size of n = 100, the probability that our estimate is incorrect by more than 0.1 is no larger than 0.25. Suppose now that we impose some tight specifications on our poll. We would like to have high confidence (probability at least 95%) that our estimate will be very accurate (within .01 of p). How many voters should be sampled? The only guarantee that we have at this point is the inequality
P
(IMn
-
1 pi ::=: 0.01) ::; 4n(0.01)2·
We will be sure to satisfy the above specifications if we choose n large enough so that 1 4n(0.01)2 ::; 1 - 0.95 = 0.05, which yields n ~ 50,000. This choice of n satisfies our specifications, but turns out to be fairly conservative, because it is based on the rather loose Chebyshev inequality. A refinement will be considered in Section 7.4.
Limit Theorems
386
Chap. 7
7.3 CONVERGENCE IN PROBABILITY We can interpret the weak law of large numbers as stating that "Mn converges to J-L." However, since MI, M2, ... is a sequence of random variables, not a sequence of numbers, the meaning of convergence has to be made precise. A particular definition is provided below. To facilitate the comparison with the ordinary notion of convergence, we also include the definition of the latter.
Convergence of a Deterministic Sequence Let· aI, a2, ... ·bea sequence of real numbers, and let a be another real number. We say that the sequence an converges to a, orlimn-+ooan = a, if for every E>. 0 there exists some. no such that for alln .~ no.
Intuitively, if limn-+ oo an = a, then for any given accuracy level E, an must be within E of a, when n is large enough.
Convergence in Probability Let Yl, Y2, ... be a sequence of random variables (no~ necessarily independent}, and let a be areal number. We say that the sequence Yn converges a probability, if for every E> O,we 0, and for every 0, there exists some no such that for all n
~
no.
If we refer to Eas the accuracy level, and 0, we have using the independence of the X n ,
p(IYn
01 ~ €) = P(X1 ~ €,
-
= P(XI
~
€)
,Xn ~ €) P(Xn ~ €)
=(1_€)n.
In particular, lim
n-+oo
p(IYn
Since this is true for every bility.
€
01 ~
-
> 0,
€) = n-+oo lim (1 -
€)n
= 0.
we conclude that Yn converges to zero, in proba-
Example 7.6. Let Y be an exponentially distributed random variable with parameter A = 1. For any positive integer n, let Yn = Y / n. (Note that these random variables are dependent.) We wish to investigate whether the sequence Y n converges to zero. For € > 0, we have
In particular, lim
n-+oo
p(IYn
Since this is the case for every
01 ~
€
€) = n-+oo lim e- n = 0. €
> 0, Y n converges to zero, in probability.
One might be tempted to believe that if a sequence Yn converges to a number a, then E[Yn ] must also converge to a. The following example shows that this need not be the case, and illustrates some of the limitations of the notion of convergence in probability.
Example 7.7. Consider a sequence of discrete random variables Y n with the following distribution: 1 1--,
fory=O,
1
for y = n 2 ,
n
P(Yn = y) =
n'
elsewhere;
Limit Theorems
388
see Fig. 7.2 for an illustration. For every
lim
n-l-oo
E
> 0,
Chap. 7
we have
p(IYnl ~ E) = n-+oo lim .!. = 0, n
and Yn converges to zero in probability. On the other hand, E[Yn ] = n 2 In = n, which goes to infinity as n increases.
I-lin
PMF of Yn
lin
o Figure 7.2: The PMF of the random variable Y n in Example 7.7.
7.4 THE CENTRAL LIMIT THEOREM According to the weak law of large numbers, the distribution of the sample mean lVln is increasingly concentrated in the near vicinity of the true mean J..L. In particular, its variance tends to zero. On the other hand, the variance of the sum Sn == Xl + ... + X n == nMn increases to infinity, and the distribution of Sn cannot be said to converge to anything meaningful. An intermediate view is obtained by considering the deviation Sn - nJ..L of Sn from its mean nJ..L, and scaling it by a factor proportional to 1/ Vii. What is special about this particular scaling is that it keeps the variance at a constant level. The central limit theorem asserts that the distribution of this scaled random variable approaches a normal distribution. More specifically, let Xl, X 2, . .. be a sequence of independent identically distributed random variables with mean J..L and variance a 2 . We define
Zn == Sn - nJ..L a Vii An easy calculation yields E[Zn] = E[X1
+ ... + X n] a Vii
n/-L = 0,
Sec. 7.4
The Central Limit Theorem
389
and
var(Zn) = var(X1 + ... + X n ) = Var(Xl) a2n
+ ... + var(Xn)
= na
a2n
2
= 1.
na 2
The Central Limit· Theorem Let Xl,X2 , .•.. be a sequence of independent identically distributed random variables with common ·mean J.L and variance a 2 , and define
Then, the
of Zn converges to the standard normal CDF
P(Zn ~z)
==
for every z.
The central limit theorem is surprisingly general. Besides independence, and the implicit assumption that the mean and variance are finite, it places no other requirement on the distribution of the Xi, which could be discrete, continuous, or mixed; see the end-of-chapter problems for an outline of its proof. The theorem is of tremendous importance for several reasons, both conceptual and practical. On the conceptual side, it indicates that the sum of a large number of independent random variables is approximately normal. As such, it applies to many situations in which a random effect is the sum of a large number of small but independent random factors. Noise in many natural or engineered systems has this property. In a wide array of contexts, it has been found empirically that the statistics of noise are well-described by normal distributions, and the central limit theorem provides a convincing explanation for this phenomenon. On the practical side, the central limit theorem eliminates the need for detailed probabilistic models, and for tedious manipulations of PMFs and PDFs. Rather, it allows the calculation of certain probabilities by simply referring to the normal CDF table. Furthermore, these calculations only require the knowledge of means and variances.
Approximations Based on the Central Limit Theorem The central limit theorem allows us to calculate probabilities related to Zn as
Limit Theorems
390
Chap. 7
if Zn were normal. Since normality is preserved under linear transformations, this is equivalent to treating Sn as a normal random variable with mean nJ.-L and variance na 2 .
Normal Approximation Based on the· Central Limit Theorem Let Sn = XI·+ ... + X n , where the Xi are independent identically distributed random variables with mean J.-L and variance 0- 2 • If n is large, the probability P(Sn :::; c) can be approximated by treating Sn as if it were normal, according to the following procedure. 1. Calculate the meannJ.-L and the variancen0- 2 of Sn. 2. Calculatethe normalized valuez =(c -nJ.-L}/a@. 3. Use the approximation P(Sn :::; c)
~~(z),
where ~(z) is available from standard normal CDF tables.
Example 7.8. We load on a plane 100 packages whose weights are independent random variables that are uniformly distributed between 5 and 50 pounds. What is the probability that the total weight will exceed 3000 pounds? It is not easy to calculate the CDF of the total weight and the desired probability, but an approximate answer can be quickly obtained using the central limit theorem. We want to calculate P(SlOO > 3000), where SlOO is the sum of the weights of 100 packages. The mean and the variance of the weight of a single package are I/. 5 +2 50 -- 275 ,..-.,
(J2
=
(50 - 5)2 12
= 168.75 '
based on the formulas for the mean and variance of the uniform PDF. We thus calculate the normalized value Z
=
3000 - 100 . 27.5 V168.75·100
=
250 129.9
= 1.92 '
and use the standard normal tables to obtain the approximation P(SlOO :::; 3000) ~ 3000)
= 1 - P(SIOO :::; 3000) ~ 1 - 0.9726 = 0.0274.
Example 7.9. A machine processes parts, one at a time. The processing times of different parts are independent random variables, uniformly distributed in [1,5].
Sec. 7.4
The Central Limit Theorem
391
We wish to approximate the probability that the number of parts processed within 320 time units, denoted by N 320 , is at least 100. There is no obvious way of expressing the random variable N 320 as a sum of independent random variables, but we can proceed differently. Let Xi be the processing time of the ith part, and let 8 10 0 == Xl +...+ X 100 be the total processing time of the first 100 parts. The event {N320 ~ 100} is the same as the event {8lo o ::; 320}, and we can now use a normal approximation to the distribution of 8 10 0. Note that J.-t == E[Xi ] == 3 and (J2 == var(Xi ) == 16/12 == 4/3. We calculate the normalized value Z
== 320 - nJ.-t
== 320 - 300
(JVii
== 1.73,
J100 . 4/3
and use the approximation
P(8l00
::;
320)
~
O.
(e) Apply the result of part (c) to obtain a bound for P (X 2:: a), for the case where X is a standard normal random variable and a > O. (f) Let Xl, X 2 , •.. be independent random variables with the same distribution as X. Show that for any a > E[X], we have
so that the probability that the sample mean exceeds the mean by a certain amount decreases exponentially with n.
Solution. (a) Given some a and s 2:: 0, consider the random variable Y a defined by
< a,
Ya == {O~a e , if X 2:: a. if X
Limit Theorems
400
Chap. 7
It is seen that the relation always holds and therefore,
On the other hand,
from which we obtain
(b) The argument is similar to the one for part (a). We define Y a by
sa Y a = { e , if X ::; a, 0,
Since s
if X
> a.
< 0, the relation
always holds and therefore,
On the other hand,
from which we obtain
P(X ::; a) ::; e- sa M(s).
(c) Since the inequality from part (a) is valid for every 8 2:: 0, we obtain P(X ~ a)::; min (e-saM(s)) s2::0
= mine-(sa-InM(s)) s2::0
= e- max s2::o (sa-In M(S))
(d) For s = 0, we have sa-lnM(s)=O-lnl=O, where we have used the generic property M(O) = 1 of transforms. Furthermore,
lId
d -(sa-lnM(s)) =a---·-M(s) I =a-l·E[X]>O. ds s=o M(s) ds s=o
Problems
401
Since the function sa -In 1\1(s) is zero and has a positive derivative at s = 0, it must be positive when s is positive and small. It follows that the maximum ¢( a) of the function sa - In M (s) over all s 2: is also positive.
°
s2
(e) For a standard normal random variable X, we have 1\1 (s) = e /2. Therefore, sa - In M (s) = sa - s2/2. To maximize this expression over all s 2: 0, we form the derivative, which is a - s, and set it to zero, resulting in s = a. Thus, ¢(a) = a 2/2, which leads to the bound Note: In the case where a < 0, the maximizing value of s turns out to be s = 0, resulting in ¢( a) = 0 and in the uninteresting bound P(X 2: a) :::; 1.
(f) Let Y = Xl
+ ... + X n . p
Using the result of part (c), we have
(~.n
a)
Xi 2:
= P(Y
2: na) :::;
e-¢y(na),
~=l
where ¢y(na) = max (nsa -lnMy(s)), s~O
and is the transform associated with Y. We have In1\1y(s) = nlnM(s), from which we obtain ¢y(na) = n· max (sa -lnM(s)) = n¢(a), s~O
and
Note that when a > E[X), part (d) asserts that ¢( a) decreases exponentially with n.
> 0, so the probability of interest
Problem 2. * Jensen inequality. A twice differentiable real-valued function f of a single variable is called convex if its second derivative (d 2 f / dx 2 ) (x) is nonnegative for all x in its domain of definition. (a) Show that the functions f(x)
=e
QX ,
f(x) = -lnx, and f(x) = x 4 are all convex.
(b) Show that if f is twice differentiable and convex, then the first order Taylor approximation of f is an underestimate of the function, that is,
f(a) for every a and x.
+ (x -
a) : (a) :::; f(x),
Limit Theorems
402
(c) Show that if
Chap. 7
f has the property in part (b), and if X is a random variable, then f (E[X]) :::; E [f(X) ] .
Solution. (a) We have 2
d ax 2 ax dx 2 e = a e
d2 -d2(-lnx) x
0
> ,
2
1
d 4 x = 4 . 3 . x2 > dx 2
= 2" > 0, x
o.
(b) Since the second derivative of f is nonnegative, its first derivative must be nondecreasing. Using the fundamental theorem of calculus, we obtain
+
f(x) = f(a)
l
X
~ (t) dt ~ f(a) +
l
x
~ (a) dt =
f(a)
+ (x -
a) ~~ (a).
(c) Since the inequality from part (b) is assumed valid for every possible value x of the random variable X, we obtain
f(a) We now choose a
= E[X]
+ (X -
a) ~~ (a)
~
f(X).
and take expectations, to obtain
f (E[XJ) + (E[X] - E[XJ) ~ (E[XJ) ~ E [f (X)] , or
f(E[XJ) :::; E[f(X)J. SECTION 7.2. The Weak Law of Large Numbers Problem 3. In order to estimate f, the true fraction of smokers in a large population, Alvin selects n people at random. His estimator M n is obtained by dividing Sn, the number of smokers in his sample, by n, Le., M n = Sn/n. Alvin chooses the sample size n to be the smallest possible number for which the Chebyshev inequality yields a guarantee that where E and [) are some prespecified tolerances. Determine how the value of n recommended by the Chebyshev inequality changes in the following cases. (a) The value of
E
is reduced to half its original value.
(b) The probability [) is reduced to [) /2.
SECTION 7.3. Convergence in Probability
*
Problem 4. Consider two sequences of random variables Xl, X 2, ... and Y1 , Y 2 , ... , associated with the same experiment. Suppose that X n converges to a and Yn converges to b, in probability. Show that X n + Y n converges to a + b, in probability.
Problems
403
Solution. We need to show that P(IXn + Y n - a - bl ~ E) converges to zero, for any > o. To bound this probability, we note that for IXn + Y n - a - bl to be as large as E, we need either IXn - al or \Yn - bl (or both) to exceed E/2. Therefore, in terms of events, we have E
This implies that
and lim
n-+oo
p(IXn+Yn a-bl
~E):; lim
n-+oo
where the last equality follows
p(IXn al
becau~e
~E/2)+ lim
n-+oo
p(IYn-bl
~E/2) ==0,
of the assumption that the sequences X n and
Y n converge, in probability, to a and b, respectively.
Problem 5. * A sequence X n of random variables is said to converge to a number c in the mean square, if lim E [(X n - c)2J == o. n-+oo
(a) Show that convergence in the mean square implies convergence in probability. (b) Give an example that shows that convergence in probability does not imply convergence in the mean square.
Solution. (a) Suppose that X n converges to c in the mean square. Using the Markov inequality, we have
P ( IXn Taking the limit as n
-
cl
~ E) ==
--+ 00,
P ( IXn
-
cl
2
2)
~ E:;
E [(X n
2
C)2J .
E
we obtain lim
n-+oo
p(IXn- cl
~
E)
== 0,
which establishes convergence in probability. (b) In Example 7.7, we have convergence in probability to 0 but E[Y;] == n 3 , which diverges to infinity.
SECTION 7.4. The Central Limit Theorem Problem 6. Before starting to play the roulette in a casino, you want to look for biases that you can exploit. You therefore watch 100 rounds that result in a number between 1 and 36, and count the number of rounds for which the result is odd. If the count exceeds 55, you decide that the roulette is not fair. Assuming that the roulette is fair, find an approximation for the probability that you will make the wrong decision.
Limit Theorems
404
Chap. 7
Problem 7. During each day, the probability that your computer's operating system crashes at least once is 5%, independent of every other day. You are interested in the probability of at least 45 crash-free days out of the next 50 days. (a) Find the probability of interest by using the normal approximation to the binomial. (b) Repeat part (a), this time using the Poisson approximation to the binomial.
Problem 8. A factory produces X n gadgets on day n, where the X n are independent and identically distributed random variables, with mean 5 and variance 9. (a) Find an approximation to the probability that the total number of gadgets produced in 100 days is less than 440. (b) Find (approximately) the largest value of n such that
P(XI
+ ... + X n
~
200 + 5n) ::; 0.05.
(c) Let N be the first day on which the total number of gadgets produced exceeds 1000. Calculate an approximation to the probability that N ~ 220.
Problem 9. Let Xl, YI,X 2 , Y2 , ... be independent random variables, uniformly distributed in the unit interval [0,1]. Let
Find a numerical approximation to the quantity
p(IW - E[W]I < 0.001). Problem 10. * Proof of the central limit theorem. Let Xl, X 2 , . . . be a sequence of independent identically distributed zero-mean random variables with common variance 0- 2 , and associated transform Mx(s). We assume that Mx(s) is finite when -d < s < d, where d is some positive number. Let
(a) Show that the transform associated with Zn satisfies
(b) Suppose that the transform Mx (s) has a second order Taylor series expansion around s = 0, of the form
Problems
405
where 0(8 2) is a function that satisfies lims-+o 0(8 2)/8 2 == O. Find a, b, and c in terms of a 2 . (c) Combine the results of parts (a) and (b) to show that the transform M Zn (8 ) converges to the transform associated with a standard normal random variable, that is, for all
8.
Note: The central limit theorem follows from the result of part (c), together with the fact (whose proof lies beyond the scope of this text) that if the transforms M Zn (8) converge to the transform IV[z(8) of a random variable Z whose CDF is continuous, then the CDFs FZ n converge to the CDF of Z. In our case, this implies that the CDF of Zn converges to the CDF of a standard normal.
Solution. (a) We have, using the independence of the Xi, MZ n (8)
== E[eSZnJ
=E[eXP{a~ Xi}] n
n
E [eSXi/(a vn )]
i=l
(b) Using the moment generating properties of the transform, we have
b == dd 1\{x ( 8 ) I
a == Mx(O) == 1,
8
s=O
and c=
! . .!...-MxCsl! 2 2
d8
== E[X s=o
2
2
== E [X] == 0, 2
]
== a
.
2
(c) We combine the results of parts (a) and (b). We have 2 (
a
( 82 ) ) n + ayb8~n + -C82n - +0 -2an (j
and using the formulas for a, b, and c from part (b), it follows that
We now take the limit as n .-,.
00,
and use the identity
lim n-+oo
(1 + ~)n == e n
C ,
,
Limit Theorems
406
Chap. 7
to obtain
SECTION 7.5. The Strong Law of Large Numbers Problem 11. * Consider two sequences of random variables Xl, X 2 , ••• and YI , Y2 , .••. Suppose that X n converges to a and Yn converges to b, with probability 1. Show that X n +Yn converges to a+b, with probability 1. Also, assuming that the random variables Y n cannot be equal to zero, show that Xn/Yn converges to alb, with probability 1.
Solution. Let A (respectively, B) be the event that the sequence of values of the random variables X n (respectively, Yn ) does not converge to a (respectively, b). Let C be the event that the sequence of values of X n + Yn does not converge to a + band notice that C c Au B. Since X n and Yn converge to a and b, respectively, with probability 1, we have P(A) == 0 and P(B) == O. Hence, P(C) ::; P(A U B) ::; P(A)
+ P(B)
O.
Therefore, P(CC) == 1, or equivalently, X n + Y n converges to a For the convergence of Xn/Yn , the argument is similar.
+ b with probability 1.
Problem 12.* Let X I ,X2 , ••• be a sequence of independent identically distributed random variables. Let YI , Y2, ... be another sequence of independent identically distributed random variables. We assume that the Xi and Yi have finite mean, and that YI + ... + Y n cannot be equal to zero. Does the sequence
converge with probability 1, and if so, what is the limit?
Solution. We have
Zn == (Xl (YI
+ +
+ Xn)/n. + Yn)/n
By the strong law of large numbers, the numerator and denominator converge with probability 1 to E[X] and E[Y], respectively. It follows that Zn converges to E[X]/E[Y], with probability 1 (cf. the preceding problem).
Problem 13. * Suppose that a sequence YI , Y 2 , ... of random variables converges to a real number c, with probability 1. Show that the sequence also converges to c in probability.
Solution. Let C be the event that the sequence of values of the random variables Y n converges to c. By assumption, we have P(C) == 1. Fix some E'> 0, and let Ak be the event that IYn - c/ < E for every n 2:: k. If the sequence of values of the random variables Yn converges to c, then there must exist some k such that for every n 2:: k, this sequence of values is within less than E from c. Therefore, every element of C belongs to Ak for some k, or
Problems
407
Note also that the sequence of events A k is monotonically increasing, in the sense that A k C Ak+1 for all k. Finally, note that the event Ak is a subset of the event {IYk - cl < f}. Therefore, lim P(IYk -
k-+oo
cl < f)
~ lim peAk) k-+oo
= p(Ur=lAk)
~ P(C)
=
1,
where the first equality uses the continuity property of probabilities (Problem 10 in Chapter 1) . It follows that lim p(IYk
k-+oo
cl
~
f)
= 0,
which establishes convergence in probability. Problem 14. * Consider a sequence Y n of nonnegative random variables and suppose that
Show that Yn converges to 0, with probability l. Note: This result provides a commonly used method for establishing convergence with probability 1. To evaluate the expectation of Y n , one typically uses the formula
2::=1
The fact that the expectation and the infinite summation can be interchanged, for the case of nonnegative random variables, is known as the monotone convergence theorem, a fundamental result of probability theory, whose proof lies beyond the scope of this text.
2::=1
Solution. We note that the infinite sum Y n must be finite, with probability 1. Indeed, if it had a positive probability of being infinite, then its expectation would also be infinite. But if the sum of the values of the random variables Y n is finite, the sequence of these values must converge to zero. Since the probability of this event is equal to 1, it follows that the sequence Y n converges to zero, with probability 1. Problem 15. * Consider a sequence of Bernoulli random variables X n , and let pn = P(Xn = 1) be the probability of success in the nth trial. Assuming that pn < show that the number of successes is finite, with probability 1. [Compare with Exercise l.43(b) .]
2::=1
Solution. Using the monotone convergence theorem (see above note), we have
This implies that 00
LXn < 00, n=l
00,
Limit Theorems
408
Chap. 7
with probability 1. We then note that the event { L~=l X n < oo} is the same as the event that there is a finite number of successes. Problem 16.* The strong law of large numbers. Let Xl, X 2 , ... be a sequence of independent identically distributed random variables and assume that E[X;] < 00. Prove the strong law of large numbers.
Solution. We note that the assumption E[X;] < 00 implies that the expected value of the Xi is finite. Indeed, using the inequality Ixl ::; 1 + x 4 , we have
Let us assume first that E[Xi ]
== o. We will show that
We have
Let us consider the various terms in this sum. If one of the indices is different from all of the other indices, the corresponding term is equal to zero. For example, if il is different from i2, is, or i4, the assumption E[Xi ] == 0 yields
Therefore, the nonzero terms in the above sum are either of the form E[X;] (there are n such terms), or of the form E[X; XJ], with i i=- j. Let us count how many terms there are of this form. Such terms are obtained in three different ways: by setting il == i2 i=- is == i 4 , or by setting il is i=- i2 == i4, or by setting il i4 i=- i2 is. For each one of these three ways, we have n choices for the first pair of indices, and n - 1 choices for the second pair. We conclude that there are 3n(n - 1) terms of this type. Thus, E [(Xl + ... + Xn)4] == nE[Xt] + 3n(n - l)E[X;X?]. n4 n4 Using the inequality xy ::; (x 2 + y2)/2, we obtain E[XfX?] ::; E[Xt], and
3E[Xt] ~
It follows that
Problems
409
2::=1
00.
where the last step uses the well known property n- 2 < This implies that 4 (Xl +.. ·+Xn )4 In converges to zero with probability 1 (cf. Problem 14), and therefore, (Xl + ... + Xn)ln also converges to zero with probability 1, which is the strong law of large numbers. For the more general case where the mean of the random variables Xi is nonzero, the preceding argument establishes that (Xl + ... + X n - nE[X l ]) In converges to zero, which is the same as (Xl + ... + Xn)ln converging to E[X l ], with probability 1.
*
Problem 17. The strong law of large numbers for Markov chains. Consider a finite-state Markov chain in which all states belong to a single recurrent class which is aperiodic. For a fixed state s, let Y k be the time of the kth visit to state s. Let also Vn be the number of visits to state s during the first n transitions. (a) Show that Y k I k converges with probability 1 to the mean recurrence time state s. (b) Show that Vnln converges with probability 1 to
t;
of
lit;.
(c) Can you relate the limit of Vnln to the steady-state probability of state s?
Solution. (a) Let us fix an initial state i, not necessarily the same as s. Thus, the random variables Yk+l - Yk, for k 2: 1, correspond to the time between successive visits to state s. Because of the Markov property (the past is independent of the future, given the present), the process "starts fresh" at each revisit to state s and, therefore, the random variables Yk+l - Yk are independent and identically distributed, with mean equal to the mean recurrence time Using the strong law of large numbers, we obtain
t;.
with probability 1. (b) Let us fix an element of the sample space (a trajectory of the Markov chain). Let Yk and V n be the values of the random variables Yk and Vn , respectively. Furthermore, let us assume that the sequence Yk I k converges to t;; according to the result of part (a), the set of trajectories with this property has probability 1. Let us consider some n between the time of the kth visit to state s and the time just before the next visit to that state: For every n in this range, we have
Vn
= k, and also
111 Yk+l n - Yk'
--