DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager August 30, 2009
i
Contents 1 INTRODUCTION AND REVIE...
287 downloads
1478 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager August 30, 2009
i
Contents 1 INTRODUCTION AND REVIEW OF PROBABILITY 1.1
1.2
1.3
1.4
1
Probability models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
The sample space of a probability model . . . . . . . . . . . . . . . .
3
1.1.2
Assigning probabilities for finite sample spaces . . . . . . . . . . . .
4
The axioms of probability theory . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
Axioms for the class of events for a sample space ≠: . . . . . . . . .
6
1.2.2
Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Probability review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.3.1
Conditional probabilities and statistical independence . . . . . . . .
8
1.3.2
Repeated idealized experiments . . . . . . . . . . . . . . . . . . . . .
10
1.3.3
Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.4
Multiple random variables and conditional probabilities . . . . . . .
12
1.3.5
Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.3.6
Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.3.7
Random variables as functions of other random variables . . . . . .
19
1.3.8
Conditional expectations
. . . . . . . . . . . . . . . . . . . . . . . .
22
1.3.9
Indicator random variables . . . . . . . . . . . . . . . . . . . . . . .
24
1.3.10 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
The laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.4.1
Basic inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
1.4.2
Weak law of large numbers with a finite variance . . . . . . . . . . .
28
1.4.3
Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
1.4.4
The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . .
31
ii
CONTENTS
1.5
iii
1.4.5
Weak law with an infinite variance . . . . . . . . . . . . . . . . . . .
33
1.4.6
Strong law of large numbers (SLLN) . . . . . . . . . . . . . . . . . .
34
1.4.7
Convergence of random variables . . . . . . . . . . . . . . . . . . . .
39
Relation of probability models to the real world . . . . . . . . . . . . . . . .
42
1.5.1
Relative frequencies in a probability model . . . . . . . . . . . . . .
43
1.5.2
Relative frequencies in the real world . . . . . . . . . . . . . . . . . .
43
1.5.3
. . . . . . . . .
46
1.5.4
Limitations of relative frequencies . . . . . . . . . . . . . . . . . . .
46
1.5.5
Subjective probability . . . . . . . . . . . . . . . . . . . . . . . . . .
47
1.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
1.7
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
1.7.1
Table of standard random variables . . . . . . . . . . . . . . . . . . .
49
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
1.8
Statistical independence of real-world experiments
2 POISSON PROCESSES 2.1
58
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
2.1.1
Arrival processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Definition and properties of the Poisson process . . . . . . . . . . . . . . . .
60
2.2.1
Memoryless property . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
2.2.2
Probability density of Sn and S1 , . . . Sn . . . . . . . . . . . . . . . .
64
2.2.3
The PMF for N (t) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
2.2.4
Alternate definitions of Poisson processes . . . . . . . . . . . . . . .
67
2.2.5
The Poisson process as a limit of shrinking Bernoulli processes . . .
69
Combining and splitting Poisson processes . . . . . . . . . . . . . . . . . . .
70
2.3.1
Subdividing a Poisson process . . . . . . . . . . . . . . . . . . . . . .
72
2.3.2
Examples using independent Poisson processes . . . . . . . . . . . .
73
2.4
Non-homogeneous Poisson processes . . . . . . . . . . . . . . . . . . . . . .
74
2.5
Conditional arrival densities and order statistics . . . . . . . . . . . . . . . .
77
2.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
2.7
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
2.2
2.3
3 RENEWAL PROCESSES 3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92 92
iv
CONTENTS
3.2
Strong Law of Large Numbers for renewal processes . . . . . . . . . . . . .
93
3.3
Expected number of renewals . . . . . . . . . . . . . . . . . . . . . . . . . .
97
3.3.1
Laplace transform approach . . . . . . . . . . . . . . . . . . . . . . .
98
3.3.2
Random stopping times . . . . . . . . . . . . . . . . . . . . . . . . .
100
3.3.3
Wald’s equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
102
Renewal-reward processes; time-averages . . . . . . . . . . . . . . . . . . . .
105
3.4.1
General renewal-reward processes . . . . . . . . . . . . . . . . . . . .
108
3.5
Renewal-reward processes; ensemble-averages . . . . . . . . . . . . . . . . .
112
3.6
Applications of renewal-reward theory . . . . . . . . . . . . . . . . . . . . .
117
3.6.1
Little’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
3.6.2
Expected queueing time for an M/G/1 queue . . . . . . . . . . . . .
120
Delayed renewal processes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
3.4
3.7
3.7.1
Delayed renewal-reward processes . . . . . . . . . . . . . . . . . . . . 124
3.7.2
Transient behavior of delayed renewal processes . . . . . . . . . . . .
125
3.7.3
The equilibrium process . . . . . . . . . . . . . . . . . . . . . . . . .
126
3.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
3.9
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
4 FINITE-STATE MARKOV CHAINS
139
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
4.2
Classification of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141
4.3
The Matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
4.3.1
The eigenvalues and eigenvectors of P . . . . . . . . . . . . . . . . .
149
4.4
Perron-Frobenius theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
150
4.5
Markov chains with rewards . . . . . . . . . . . . . . . . . . . . . . . . . . .
156
4.6
Markov decision theory and dynamic programming . . . . . . . . . . . . . .
165
4.6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165
4.6.2
Dynamic programming algorithm . . . . . . . . . . . . . . . . . . . .
167
4.6.3
Optimal stationary policies . . . . . . . . . . . . . . . . . . . . . . .
170
4.6.4
Policy iteration and the solution of Bellman’s equation . . . . . . . .
173
4.6.5
Stationary policies with arbitrary final rewards . . . . . . . . . . . .
177
4.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
182
4.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
CONTENTS
v
5 COUNTABLE-STATE MARKOV CHAINS 5.1
Introduction and classification of states . . . . . . . . . . . . . . . . . . . .
197
5.2
Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
207
5.3
Birth-death Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . .
210
5.4
Reversible Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
211
5.5
The M/M/1 sample-time Markov chain . . . . . . . . . . . . . . . . . . . .
214
5.6
Round-robin and processor sharing . . . . . . . . . . . . . . . . . . . . . . .
217
5.7
Semi-Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
223
5.8
Example — the M/G/1 queue . . . . . . . . . . . . . . . . . . . . . . . . . .
227
5.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
228
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
230
6 MARKOV PROCESSES WITH COUNTABLE STATE SPACES 6.1
235
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
235
6.1.1
The sampled-time approximation to a Markov process . . . . . . . .
238
Steady-state behavior of irreducible Markov processes . . . . . . . . . . . .
239
6.2.1
The number of transitions per unit time . . . . . . . . . . . . . . . .
240
6.2.2
Renewals on successive entries to a state . . . . . . . . . . . . . . . .
240
6.2.3
The strong law for time-average state probabilities . . . . . . . . . .
243
6.2.4
The equations for the steady state process probabilities . . . . . . .
244
6.2.5
The sampled-time approximation again . . . . . . . . . . . . . . . .
245
6.2.6
Pathological cases . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245
6.3
The Kolmogorov differential equations . . . . . . . . . . . . . . . . . . . . .
246
6.4
Uniformization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
250
6.5
Birth-death processes
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
6.6
Reversibility for Markov processes . . . . . . . . . . . . . . . . . . . . . . .
253
6.7
Jackson networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
259
6.7.1
Closed Jackson networks . . . . . . . . . . . . . . . . . . . . . . . . .
264
6.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
266
6.9
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
268
6.2
7
197
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES 279 7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279
vi
CONTENTS
7.1.1
Simple random walks
. . . . . . . . . . . . . . . . . . . . . . . . . .
280
7.1.2
Integer-valued random walks . . . . . . . . . . . . . . . . . . . . . .
280
7.1.3
Renewal processes as special cases of random walks . . . . . . . . . .
281
7.2
The waiting time in a G/G/1 queue: . . . . . . . . . . . . . . . . . . . . . .
281
7.3
Detection, Decisions, and Hypothesis testing . . . . . . . . . . . . . . . . . .
284
7.4
Threshold crossing probabilities in random walks . . . . . . . . . . . . . . .
287
7.5
Thresholds, stopping rules, and Wald’s identity . . . . . . . . . . . . . . . .
291
7.5.1
Stopping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
292
7.5.2
Joint distribution of N and barrier . . . . . . . . . . . . . . . . . . .
297
7.5.3
Proof of Wald’s identity . . . . . . . . . . . . . . . . . . . . . . . . .
298
Martingales and submartingales . . . . . . . . . . . . . . . . . . . . . . . . .
299
7.6.1
Simple examples of martingales . . . . . . . . . . . . . . . . . . . . .
300
7.6.2
Markov modulated random walks . . . . . . . . . . . . . . . . . . . .
301
7.6.3
Generating functions for Markov random walks . . . . . . . . . . . .
303
7.6.4
Scaled branching processes . . . . . . . . . . . . . . . . . . . . . . .
304
7.6.5
Partial isolation of past and future in martingales . . . . . . . . . .
304
7.6.6
Submartingales and supermartingales . . . . . . . . . . . . . . . . .
305
Stopped processes and stopping times . . . . . . . . . . . . . . . . . . . . .
307
7.7.1
Stopping times for martingales relative to a process . . . . . . . . .
311
The Kolmogorov inequalities . . . . . . . . . . . . . . . . . . . . . . . . . .
312
7.8.1
The strong law of large numbers . . . . . . . . . . . . . . . . . . . .
315
7.8.2
The martingale convergence therem . . . . . . . . . . . . . . . . . .
316
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
319
7.6
7.7
7.8
7.9
Chapter 1
INTRODUCTION AND REVIEW OF PROBABILITY 1.1
Probability models
Probability theory is a central field of mathematics, widely applicable to scientific, technological, and human situations involving uncertainty. The most obvious applications are to situations, such as games of chance, in which repeated trials of essentially the same procedure lead to differing outcomes. For example, when we flip a coin, roll a die, pick a card from a shuffled deck, or spin a ball onto a roulette wheel, the procedure is the same from one trial to the next, but the outcome (heads (H) or tails (T ) in the case of a coin, one to six in the case of a die, etc.) varies from one trial to another in a seemingly random fashion. For the case of flipping a coin, the outcome of the flip could be predicted from the initial position, velocity, and angular momentum of the coin and from the nature of the surface on which it lands. Thus, in one sense, a coin flip is deterministic rather than random and the same can be said for the other examples above. When these initial conditions are unspecified, however, as when playing these games, the outcome can again be viewed as random in some intuitive sense. Many scientific experiments are similar to games of chance in the sense that multiple trials of the same procedure lead to results that vary from one trial to another. In some cases, this variation is due to slight variations in the experimental procedure, in some it is due to noise, and in some, such as in quantum mechanics, the randomness is generally believed to be fundamental. Similar situations occur in many types of systems, especially those in which noise and random delays are important. Some of these systems, rather than being repetitions of a common basic procedure, are systems that evolve over time while still containing a sequence of underlying similar random occurences. This intuitive notion of randomness, as described above, is a very special kind of uncertainty. Rather than involving a lack of understanding, it involves a type of uncertainty that can lead to probabilistic models with precise results. As in any scientific field, the models might or might not correspond to reality very well, but when they do correspond to reality, there 1
2
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
is the sense that the situation is completely understood, while still being random. For example, we all feel that we understand flipping a coin or rolling a die, but still accept randomness in each outcome. The theory of probability was developed particularly to give precise and quantitative understanding to these types of situations. The remainder of this section introduces this relationship between the precise view of probability theory and the intuitive view as used in applications and everyday language. After this introduction, the following sections review probability theory as a mathematical discipline, with a special emphasis on the laws of large numbers. In the final section of this chapter, we use the theory and the laws of large numbers to obtain a fuller understanding of the relationship between theory and the real world.1 Probability theory, as a mathematical discipline, started to evolve in the 17th century and was initially focused on games of chance. The importance of the theory grew rapidly, particularly in the 20th century, and it now plays a central role in risk assessment, statistics, data networks, operations research, information theory, control theory, theoretical computer science, quantum theory, game theory, neurophysiology, and many other fields. The core concept in probability theory is that of a probability model. Given the extent of the theory, both in mathematics and in applications, the simplicity of probability models is surprising. The first component of a probability model is a sample space, which is a set whose elements are called outcomes or sample points. Probability models are particularly simple in the special case where the sample space is finite,2 and we consider only this case in the remainder of this section. The second component of a probability model is a class of events, which can be considered for now simply as the class of all subsets of the sample space. The third component is a probability measure, which can be regarded here as the assignment of a non-negative number to each outcome, with the restriction that these numbers must sum to one over the sample space. The probability of an event is the sum of the probabilities of the outcomes comprising that event. These probability models play a dual role. In the first, the many known results about various classes of models, and the many known relationships between models, constitute the essence of probability theory. Thus one often studies a model not because of any relationship to the real world, but simply because the model provides an example or a building block useful for the theory and thus ultimately for other models. In the other role, when probability theory is applied to some game, experiment, or some other situation involving randomness, a probability model is used to represent a trial of the experiment (in what follows, we refer to all of these random situations as experiments). For example, the standard probability model for rolling a die uses {1, 2, 3, 4, 5, 6} as the sample space, with each possible outcome having probability 1/6. An odd result, i.e., the subset {1, 3, 5}, is an example of an event in this sample space, and this event has probability 1/2. The correspondence between model and actual experiment seems straightforward here. The set of outcomes is the same and, given the symmetry between faces of the die, the choice 1 It would be appealing to show how probability theory evolved from real-world situations involving randomness, but we shall find that it is difficult enough to find good models of real-world situations even using all the insights and results of probability theory as a starting basis. 2 A number of mathematical issues arise with infinite sample spaces, as discussed in the following section.
1.1. PROBABILITY MODELS
3
of equal probabilities seems natural. On closer inspection, there is the following important difference. The model corresponds to a single roll of a die, with a probability defined for each possible outcome. In a real-world single roll of a die, an outcome k from 1 to 6 occurs, but there is no observable probability of k. For repeated rolls, however, the real-world relative frequency of k, i.e., the fraction of rolls for which the output is k, can be compared with the sample value of the relative frequency of k in a model for repeated rolls. The sample values of the relative frequency of k in the model resemble the probability of k in a way to be explained later. It is this relationship through relative frequencies that helps overcome the non-observable nature of probabilities in the real world.
1.1.1
The sample space of a probability model
An outcome or sample point in a probability model corresponds to the complete result of a trial of the experiment being modeled. For example, a game of cards is often appropriately modeled by the arrangement of cards within a shuffled 52 card deck, thus giving rise to a set of 52! outcomes (incredibly detailed, but trivially simple in structure), even though the entire deck might not be played in one trial of the game. A poker hand with 4 aces is an event rather than an outcome in this model, since many arrangements of the cards can give rise to 4 aces in a given hand. The possible outcomes in a probability model (and in the experiment being modeled) are mutually exclusive and collectively constitute the entire sample space (space of possible results). An outcome is often called a finest grain result of the model in the sense that an outcome ω contains no subsets other than the empty set φ and the singleton subset {ω}. Thus events typically give only partial information about the result of the experiment, whereas an outcome fully specifies the result. In choosing the sample space for a probability model of an experiment, we often omit details that appear irrelevant for the purpose at hand. Thus in modeling the set of outcomes for a coin toss as {H, T }, we usually ignore the type of coin, the initial velocity and angular momentum of the toss, etc.. We also omit the rare possibility that the coin comes to rest on its edge. Sometimes, conversely, we add detail in the interest of structural simplicity, such as the use of a shuffled deck of 52 cards above. The choice of the sample space in a probability model is similar to the choice of a mathematical model in any branch of science. That is, one simplifies the physical situation by eliminating essentially irrelevant detail. One often does this in an iterative way, using a very simple model to acquire initial understanding, and then successively choosing more detailed models based on the understanding from earlier models. The mathematical theory of probability views the sample space simply as an abstract set of elements, and from a strictly mathematical point of view, the idea of doing an experiment and getting an outcome is a distraction. For visualizing the correspondence between the theory and applications, however, it is better to view the abstract set of elements as the set of possible outcomes of an idealized experiment in which, when the idealized experiment is performed, one and only one of those outcomes occurs. The two views are mathematically identical, but it will be helpful to refer to the first view as a probability model and the
4
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
second as an idealized experiment. In applied probability texts and technical articles, these idealized experiments, rather than the real-world situations with randomness, often become the primary topic of discussion.3
1.1.2
Assigning probabilities for finite sample spaces
The word probability is widely used in everyday language, and most of us attach various intuitive meanings4 to the word. For example, everyone would agree that something virtually impossible should be assigned a probability close to 0 and something virtually certain should be assigned a probability close to 1. For these special cases, this provides a good rationale for choosing probabilities. The relationship between virtually and close to are unclear at the moment, but if there is some implied limiting process, we would all agree that, in the limit, certainty and impossibility correspond to probabilities 1 and 0 respectively. Between virtual impossibility and certainty, if one outcome appears to be closer to certainty than another, its probability should be correspondingly greater. This intuitive notion is imprecise and highly subjective; it provides little rationale for choosing numerical probabilities for different outcomes, and, even worse, little rationale justifying that probability models bear any precise relation to real-world situations. Symmetry can often provide a better rationale for choosing probabilities. For example, the symmetry between H and T for a coin, or the symmetry between the the six faces of a die, motivates assigning equal probabilities, 1/2 each for H and T and 1/6 each for the six faces of a die. This is reasonable and extremely useful, but there is no completely convincing reason for choosing probabilities based on symmetry. Another approach is to perform the experiment many times and choose the probability of each outcome as the relative frequency of that outcome (i.e., the number of occurences of that outcome (or event) divided by the total number of trials). Experience shows that the relative frequency of an outcome often approaches a limiting value with an increasing number of trials. Associating the probability of an outcome with that limiting relative frequency is certainly close to our intuition and also appears to provide a testable criterion between model and real world. This criterion is discussed in Sections 1.5.1 and 1.5.2 to follow. This provides a very concrete way to use probabilities, since it suggests that the randomness in a single trial tends to disappear in the aggregate of many trials. Other approaches to choosing probability models will be discussed later.
3
This is not intended as criticism, since we will see that there are good reasons to concentrate initially on such idealized experiments. However, readers should always be aware that modeling errors are the major cause of misleading results in applications of probability, and thus modeling must be seriously considered before using the results. 4 It is popular to try to define probability by likelihood, but this is unhelpful since the words are synonyms (except for a technical meaning that likelihood takes on in detection theory).
1.2. THE AXIOMS OF PROBABILITY THEORY
1.2
5
The axioms of probability theory
As the applications of probability theory became increasingly varied and complex during the 20th century, it became increasingly necessary to put the theory on a firm mathematical footing, and this was accomplished by an axiomatization of the theory largely due to the great Russian mathematician A. N. Kolmogorov [14] in 1932. Before stating and explaining these axioms of probability theory, the following two examples explain why the simple approach of the last section, assigning a probability to each sample point, often fails with infinite sample spaces. Example 1.2.1. Suppose we want to model the phase of a sine wave, where the phase is reasonably viewed as being uniformly distributed between 0 and 2π. If this phase is the only quantity of interest, it is reasonable to choose a sample space consisting of the set of real numbers between 0 and 2π. There are an uncountably5 infinite number of possible phases between 0 and 2π, and with any reasonable interpretation of uniform distribution, one must conclude that each sample point has probability zero. Thus, the simple approach of the last section leads us to conclude that any event with a finite or countably infinite set of sample points should have probability zero, but that there is no way to find the probability of, say, the interval (0, π). For this example, the appropriate view, as developed in all elementary probability texts, 1 is to assign a probability density 2π to the phase. The probability of any given event of interest can usually be found by integrating the density over the event, whereas it generally can not be found from knowing only that each sample point has 0 probability. Useful as densities are, however, they do not lead to a general approach over arbitrary sample spaces.6 Example 1.2.2. Consider an infinite sequence of coin tosses. The probability of any given n-tuple of individual outcomes can be taken to be 2−n , so in the limit n = 1, the probability of any given sequence is 0. There are 2n equiprobable outcomes for n tosses and an infinite number of outcomes for n = 1. Again, expressing the probability of events involving infinitely many tosses as a sum of individual sample point probabilities does not work. The obvious approach is to evaluate the probability of an event as an appropriate limit from finite length sequences, but we find situations in which this doesn’t work. Although appropriate rules can be generated for simple examples such as those above, there is a need for a consistent and general approach. In such an approach, rather than assigning probabilities to sample points, which are then used to assign probabilities to events, probabilities must be associated directly with events. The axioms to follow establish consistency requirements between the probabilities of different events. The axioms, and 5 A set is uncountably infinite if it is infinite but its members cannot be put into one-to-one correspondence with the positive integers. For example the set of real numbers over some interval such as (0, 2π) is uncountably infinite. The Wikipedia article on countable sets provides a friendly introduction to the concepts of countability and uncountability. 6 It is possible to avoid the consideration of infinite sample spaces here by quantizing the possible phases. This is analogous to avoiding calculus by working only with discrete functions. Both usually result in both artificiality and added complexity.
6
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
the corollaries derived from them, are consistent with one’s intuition, and, for finite sample spaces, are consistent with our earlier approach. Dealing with the countable unions of events in the axioms will be unfamiliar to some students, but will soon become both familiar and consistent with intuition. The strange part of the axioms comes from the fact that defining the class of events as the set of all subsets of the sample space is usually inappropriate when the sample space is uncountably infinite. What is needed is a class of events that is large enough that we can almost forget that some very strange subsets are excluded. This is accomplished by having two simple sets of axioms, one defining the class of events,7 and the other defining the relations between the probabilities assigned to these events. In this theory, all events have probabilities, but those truly weird subsets that are not events do not have probabilities. This will be discussed more after giving the axioms for events. The axioms for events use the standardS notation of set theory. That is, for a set ≠, the union of two subsets A1 and A2 , denoted A1 A2 , isT the subset containing all points in either A1 or A2 . The intersection, denoted A1 A2 , or A1 A2 is the set of points in both A1 and A2 . A sequence of events is a class of events that can be put into one-to-one correspondence with the positive integers, i.e., A1 , A2 , . . . , ad infinitum. The union of a sequence of events is called a countable union, meaning the union includes one event for each positive integer n.
1.2.1
Axioms for the class of events for a sample space ≠:
1. ≠ is an event. 2. For every sequence of events A1 , A2 , . . . , the union
S1
n=1 An
is an event.
3. For every event A, the complement of A, denoted Ac is an event. There are a number of important corollaries of these axioms. The first is that since the empty set φ is the complement of ≠, it is also an event. The empty set does not correspond to our intuition about events, but the theory would be extremely awkward if it were omitted. Second, by making all but a finite number of events in a union be the empty event, it is clear that all finite unions of events are events. Third, by using deMorgan’s law, h[ ic \ An = Acn . n
n
it is clear that finite and countable intersections of events are events.
Although we will not make a big fuss about these axioms in the rest of the text, we will be careful to use only complements and countable unions and intersections in our analysis. Thus subsets that are not events will not arise and we need not concern ourselves with them. Note that the axioms do not say that all subsets of ≠ are events. In fact, there are many rather silly ways to define classes of events that obey the axioms. For example, the axioms 7
The mathematical name for a class of elements satisfying these axioms is a σ-algebra, sometimes called a σ-field.
1.2. THE AXIOMS OF PROBABILITY THEORY
7
are satisfied by choosing only the universal set ≠ and the empty set φ to be events. We shall avoid such trivialities by assuming that for each sample point ω, the singleton subset {ω} is an event. For finite sample spaces, this assumption, plus the axioms above, imply that all subsets are events for finite sample spaces. For uncountably infinite sample spaces, such as the sinusoidal phase above, this assumption, plus the axioms above, still leaves considerable freedom in choosing a class of events. As an example, the class of all subsets of ≠ satisfies the axioms but surprisingly does not allow the probability axioms to be satisfied in any sensible way. How to choose an appropriate class of events requires an understanding of measure theory which would take us too far afield for our purposes. Thus we neither assume nor develop measure theory here.8 From a pragmatic standpoint, we start with the class of events of interest, such as those required to define the random variables needed in the problem. That class is then extended so as to be closed under complementation and countable unions. Measure theory shows that this extension can always be done, but here we simply accept that extension as a known result.
1.2.2
Axioms of probability
Given any sample space ≠ and any class of events satisfying the axioms of events, the following three probability axioms9 hold: 1. Pr{≠} = 1. 2. For every event A, Pr{A} ≥ 0. 3. For every sequence A1 , A2 , . . . of disjoint events, the probability of their union is o X1 n [1 An = Pr{An } , (1.1) Pr n=1
P1
where
n=1 Pr{An }
n=1
is shorthand for limm→1
The axioms imply the following useful corollaries:
Pr
n[m
for disjoint events A1 , . . . , Am
(1.3)
for all A
(1.4)
Pr{A} ≤ Pr{B}
for all A, B such that A ⊆ B
(1.5)
for finite or countable disjoint events.
(1.7)
n=1
c
Pr{A } = 1 − Pr{A}
n
8
n=1 Pr{An }.
Pr{φ} = 0 o Xm An Pr{An } =
n=1
X
Pm
Pr{A} ≤ 1
Pr{An } ≤ 1
(1.2)
for all A
(1.6)
There is no doubt that measure theory is useful in probability theory, and serious students of probability should certainly learn measure theory at some point. For application-oriented people, however, it seems advisable to acquire more insight and understanding of probability, at a graduate level, before concentrating on the abstractions and subtleties of measure theory. 9 Sometimes finite additivity, (1.3) is added as an additional axiom. This might aid intuition and also avoids the very technical proofs given for (1.2) and (1.3).
8
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
To verify (1.2), note that events are disjoint if they have no outcomes in common, and thus the empty event φ is disjoint from itself and every other event. Thus, φ, φ, . . . , is a sequence of disjoint events, and it follows from Axiom 3 (see Exercise 1.1) that Pr{φ} = 0. To verify (1.3), apply Axiom 3 to the disjoint sequence A1 , . . . , Am , φ, φ, . . . . One might reasonably guess that (1.3), along with Axioms 1 and 2 implies Axiom 3. Exercise 1.3 shows why this guess is incorrect. S To verify (1.4), note that ≠ = A Ac . Then apply (1.3) to the disjoint sets A and Ac . S To verify (1.5), note if A ⊆ F , then F = A (F Ac ) where10 A and F Ac are disjoint. S that Apply (1.3) to A (F Ac ), (1.5) and note that Pr{F Ac } ≥ 0. S To verify (1.6) and (1.7), first substitute ≠ for F in (1.5) and then substitute n An for A. The axioms specify the probability of any disjoint union of events in terms of the individual event probabilities, but what about a finite or countable union of arbitrary events A1 , A2 , . . . ? Exercise 1.4 shows that in this case, (1.1) or (1.3) can be generalized to n[ n o X [n−1 o Pr An = Pr An − Ai . (1.8) n
n
i=1
In order to use this, one must know not only the event probabilities for A1 , A2 . . . , but also the probabilities of their intersections. The union bound, which is also derived in Exercise 1.4 depends only on the individual event probabilities, but gives only the following upper bound on the union probability. o X n[ An ≤ Pr{An } . (1.9) Pr n
1.3 1.3.1
n
Probability review Conditional probabilities and statistical independence
Definition 1.1. For any two events A and B (with Pr{B} > 0), the conditional probability of A, conditional on B, is defined by Pr{A|B} = Pr{AB} /Pr{B} .
(1.10)
One visualizes an experiment that has been partly carried out with B as the result. Pr{A|B} can then be viewed as the probability of A normalized to a sample space restricted to event B. Within this restricted sample space, we can view B as the sample space and AB as an event within this sample space. For a fixed event B, we can visualize mapping each event A in the original space into AB in the restricted space. It is easy to see that the event axioms are still satisfied in the restricted space. Assigning probability Pr{A|B} to each AB in the restricted space, it is easy to see that the axioms of probability are also satisfied. In other words, everything we know about probability can also be applied to such a restricted probability space. 10
We sometimes express intersection as A
T
B and sometimes directly as AB.
1.3. PROBABILITY REVIEW
9
Definition 1.2. Two events, A and B, are statistically independent (or, more briefly, independent) if Pr{AB} = Pr{A} Pr{B} . For Pr{B} > 0, this is equivalent to Pr{A|B} = Pr{A}. This latter form corresponds to our intuitive view of independence, since it says that the observation of B does not change the probability of A. Such intuitive statements about “observation” and “occurrence” are helpful in reasoning probabilistically, but also can lead to great confusion. For Pr{B} > 0, Pr{A|B} is defined without any notion of B being observed “before” A. For example Pr{A|B} Pr{B} = Pr{B|A} Pr{A} is simply a consequence of the definition of conditional probability and has nothing to do with causality or observations. This issue caused immense confusion in probabilistic arguments before the axiomatic theory was developed. The notion of independence is of vital importance in defining, and reasoning about, probability models. We will see many examples where very complex systems become very simple, both in terms of intuition and analysis, when enough quantities are modeled as statistically independent. An example will be given in the next subsection where repeated independent experiments are used in understanding relative frequency arguments. Often, when the assumption of independence turns out to be oversimplified, it is reasonable to assume conditional independence, where A and B are said to be conditionally independent given C if Pr{AB|C} = Pr{A|C} Pr{B|C}. Most of the stochastic processes to be studied here are characterized by particular forms of independence or conditional independence. For more than two events, the definition of statistical independence is a little more complicated. Definition 1.3. The events A1 , . . . , An , n > 2 are statistically independent if for each subset S of two or more of the integers 1 to n. Pr
n\
i∈S
o Y Ai =
i∈S
Pr{Ai } .
(1.11)
This includes the full subset S = {1, . . . , n}, so one necessary condition for independence is that n\n o Yn Pr Ai = Pr{Ai } . (1.12) i=1
i=1
Assuming that Pr{Ai } is strictly positive for each i, this says that n Ø[ Ø Pr Ai Ø
j∈S
Aj
o
= Pr{Ai }
for each i and each S such that i ∈ / S.
It might be surprising that (1.12) does not imply (1.11), but the example in Exercise 1.5 will help clarify this. This definition will become clearer (and simpler) when we see how to view independence of events as a special case of independence of random variables.
10
1.3.2
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Repeated idealized experiments
Much of our intuitive understanding of probability comes from the notion of repeated idealized experiments, but the axioms of probability contain no explicit recognition of such repetitions. The appropriate way to handle n repetitions of an idealized experiment is through an extended probability model whose sample points are n-tuples of sample points from the original model. In other words, given an original sample space ≠, the sample space of an n-repetition model is the Cartesian product ≠×n = {(ω1 , ω2 , . . . , ωn ) | ωi ∈ ≠ for each i, 1 ≤ i ≤ n}.
(1.13)
Since each sample point in the n-repetition model is an n-tuple of points from the original ≠, an event in the n-repetition model is a subset of ≠×n , i.e., a subset of the n-tuples (ω1 , . . . , ωn ), each ωi ∈ ≠. This class of events should include each event of the form (A1 , A2 , . . . , An ) where A1 ⊆ ≠ is an event within the first trial, A2 ⊆ ≠ is an event within the 2nd trial, etc. For example, with two rolls of a die, an even number {2, 4, 6} followed by an odd number is an event. There are other events also, such as the union of odd followed by even or even followed by odd, that can not be represented as an n-tuple of elementary events. Thus the set of events (for n-repetitions) should include all n-tuples of elementary events plus the extension of this class of events over complementation and countable unions and intersections. An understanding of how to construct this extension in general requires measure theory but is not necessary for our purposes. The simplest and most natural way of creating a probability measure for this extended sample space and class of events is through the assumption that the repetitions are statistically independent of each other. More precisely, we assume that for each of the extended events (A1 , A2 , . . . , An ) contained in ≠×n , we have Pr{(A1 , A2 , . . . , An )} =
Yn
i=1
Pr{Ai } ,
(1.14)
where Pr{Ai } is the probability of event Ai in the original model. Note that since ≠ can be substituted for any Ai in this formula, the subset condition of (1.11) is automatically satisfied. In other words, for any probability model, there is an extended independent nrepetition model for which the events in each repetition are independent of those in the other repetitions. In what follows, we refer to this as the probability model for n independent identically distributed (IID) experiments. The niceties of how this model for n IID experiments is created depend on measure theory, but we can rely on the fact that such a model can be created, and that the events in each repetition are independent of those in the other repetitions. What we have done here is very important conceptually. A probability model for an experiment does not say anything directly about repeated experiments. However, questions about independent repeated experiments can be handled directly within this n IID experiment model. This can also be extended to a countable number of IID experiments.
1.3. PROBABILITY REVIEW
1.3.3
11
Random variables
The outcome of a probabilistic experiment often specifies a collection of numerical values such as temperatures, voltages, numbers of arrivals or departures in various intervals, etc. Each such numerical value varies, depending on the particular outcome of the experiment, and thus can be viewed as a mapping from the set ≠ of sample points to the set R of real numbers. Note that R does not include ±1, and in those rare instances where ±1 should be included, it is called the extended set of real numbers. These mappings from sample points to numerical values are called random variables. Definition 1.4. A random variable (rv) is a function X from the sample space ≠ of a probability model to the set of real numbers R. That is, each sample point ω in ≠, except perhaps for a subset11 of probability 0, is mapped into a real number denoted by X(ω). There is an additional technical requirement that, for each real x, the set {X(ω) ≤ x} must be an event in the probability model. As with any function, there is often some confusion between the function itself, which is called X in the definition above, and the value X(ω) the function takes on for a sample point ω. This is particularly prevalent with random variables (rv’s) since we intuitively associate a rv with its sample value when an experiment is performed. We try to control that confusion here by using X, X(ω), and x respectively, to refer to the rv, the sample value taken for a given sample point ω, and a generic sample value. Definition 1.5. The distribution function12 FX (x) of a random variable (rv) X is a function, R → R, defined by FX (x) = Pr{ω ∈ ≠ | X(ω) ≤ x}. The argument ω is usually omitted for brevity, so FX (x) = Pr{X ≤ x}. Note that x is the argument of FX (x), whereas the subscript X denotes the particular rv under consideration. As illustrated in Figure 1.1, the distribution function FX (x) is nondecreasing with x and must satisfy the limits limx→−1 FX (x) = 0 and limx→1 FX (x) = 1 (see Exercise 1.3). The concept of a rv is often extended to complex random variables (rv’s) and vector rv’s. A complex random variable is a mapping from the sample space to the set of finite complex numbers, and a vector random variable (rv) is a mapping from the sample space to the finite vectors in some finite dimensional vector space. Another extension is that of defective rvs. X is defective if there is an event of positive probability for which the mapping is either undefined or defined to be either +1 or −1. When we refer to random variables in this text (without any modifier such as complex, vector, or defective), we explicitly restrict attention to the original definition, i.e., a function from ≠ to R. If X has only a finite or countable number of possible sample values, say x1 , x2 , . . . , the probability Pr{X = xi } of each sample value xi is called the probability mass function 11 We will find many situations, particularly when taking limits, when rv’s are undefined for some set of sample points of probability zero. These are still called rv’s (rather than defective rv’s ) and the zero probability subsets can usually be ignored. 12 Some people refer to the distribution function as the cumulative distribution function.
12
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1 FX (x)
0
Figure 1.1: Example of a distribution function for a rv that is neither continuous nor discrete. If FX (x) has a discontinuity at some xo , it means that there is a discrete probability at xo equal to the magnitude of the discontinuity. In this case FX (xo ) is given by the height of the upper point at the discontinuity.
(PMF) at xi and denoted by pX (xi ); such a random variable is called discrete. The distribution function of a discrete rv is a ‘staircase function’ staying constant between the possible sample values and having a jump of magnitude pX (xi ) at each sample value xi . If the distribution function FX (x) of a rv X has a (finite) derivative at x, the derivative is called the probability density (or just density) of X at x and denoted by fX (x); for sufficiently small δ, δfX (x) then approximates the probability that X is mapped into a value between x and x + δ. If the density exists for all x, the rv is said to be continuous. Elementary probability courses work primarily with the PMF and the density, since they are convenient for computational exercises. We will often work with the distribution function here. This is partly because it is always defined, partly to avoid saying everything thrice, for discrete, continuous, and other rv’s, and partly because the distribution function is most important in limiting arguments such as steady-state time-average arguments. For distribution functions, density functions, and PMF’s, the subscript denoting the rv is usually omitted if the rv is clear from the context. The same convention is used for complex, vector, etc. rv’s.
1.3.4
Multiple random variables and conditional probabilities
Often we must deal with multiple random variables (rv’s) in a single probability experiment. If X1 , X2 , . . . , Xn are n rv’s or the components of a vector rv, their joint distribution function is defined by FX1 ,...,Xn (x1 , x2 , . . . , xn ) = Pr{ω ∈ ≠ | X1 (ω) ≤ x1 , X2 (ω) ≤ x2 , . . . , Xn (ω) ≤ xn } . (1.15) This definition goes a long way toward explaining why we need the notion of a sample space ≠ when all we want to talk about is a set of rv’s. The distribution function of a rv fully describes the individual behavior of that rv, but ≠ and the mappings are needed to describe how the rv’s interact. For a vector rv X with components X1 , . . . , Xn , or a complex rv X with real and imaginary parts X1 , X2 , the distribution function is also defined by (1.15). Note that {X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn } is an event and the corresponding probability is nondecreasing in each argument xi . Also the distribution function of any subset of random variables is obtained
1.3. PROBABILITY REVIEW
13
by setting the other arguments to +1. For example, the distribution of a single rv (called a marginal distribution) is given by FXi (xi ) = FX1 ,...,Xi−1 ,Xi ,Xi+1 ,...Xn (1, . . . , 1, xi , 1, . . . , 1). If the rv’s are all discrete, the joint PMF is given by Pr{X1 = x1 , . . . , Xn = xn }. Similarly, n (x ,...,x ) n 1 . if a joint probability density f (x1 , . . . , xn ) exists, it is given by the derivative @@xF1 @x 2 ···@xn Two rv’s, say X and Y , are statistically independent (or, more briefly, independent) if FXY (x, y) = FX (x)FY (y)
for each x ∈ R, y ∈ R.
(1.16)
If X and Y are discrete rv’s then the definition of independence in (1.16) is seen to be equivalent to pXY (xi yj ) = pX (xi )pY (yj )
for each value xi of X and yj of Y.
Since X = xi and Y = yj are events, the conditional probability of X = xi conditional on Y = yj (assuming pY (yj ) > 0) is given by (1.10) to be pX|Y (xi | yj ) =
pXY (xi , yj ) . pY (yj )
If pX|Y (xi | yj ) = pX (xi ) for all i, j, then it is seen that X and Y are independent. This captures the intuitive notion of independence better than (1.16) for discrete rv’s , since it can be viewed as saying that the PMF of X is not affected by the sample value of Y . If X and Y have a joint density, then (1.16) is equivalent to fXY (xy) = fX (x)fY (y)
for each x ∈ R, y ∈ R.
If fY (y) > 0, the conditional density can be defined as fX|Y (x | y) = statistical independence can be expressed as fX|Y (x|y) = fX (x)
where fY (y) > 0.
fXY (x,y) fY (y) .
Then (1.17)
This captures the intuitive notion of statistical independence for continuous rv’s better than (1.16), but it does not quite say that the density of X, conditional on Y = y is the same as the marginal density of X. The event Y = y has zero probability, and we cannot condition on events of zero probability. If we look at the derivatives defining these densities, the conditional density looks at the probability that x ≤ X ≤ x + δ given that y ≤ Y ≤ y + ≤ in the limit δ, ≤ → 0. At some level, this is a very technical point and the intuition of conditioning on Y =y works very well. In fact, problems are often directly modeled in terms of conditional probabilities so that viewing a conditional density as a derivative is less relevant. We can similarly define the probability of an arbitrary event A conditional on a given value of a rv Y with a density as Pr{A, Y ∈ [y, y + δ]} . δ→0 Pr{Y ∈ [y, y + δ]}
Pr{A | Y = y} = lim
14
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
For n rv’s X1 , . . . , Xn , we define statistical independence by the equation Yn Yn F (x1 , . . . , xn ) = Pr{Xi ≤ xi } = FXi (xi ) for all values of x1 , . . . , xn . i=1
i=1
(1.18)
Another way to state this is that X1 , . . . , Xn are independent if the events Xi ≤ xi for 1 ≤ i ≤ n are independent for all choices of x1 , . . . , xn . If the density or PMF exists, (1.18) is equivalent to a product form for the density or mass function. A set of rv’s is said to be pairwise independent if each pair of rv’s in the set is independent. As shown in Exercise 1.19, pairwise independence does not imply that the entire set is independent. Independent rv’s are very often also identically distributed, i.e., they all have the same distribution function. These cases arise so often that we abreviate independent identically distributed by IID. For the IID case (1.18) becomes F (x1 , . . . , xn ) =
1.3.5
Yn
i=1
FX (xi ).
(1.19)
Stochastic processes
A stochastic13 process (or random process) is an infinite collection of rv’s, usually indexed by an integer or a real number that is often interpreted as discrete or continuously varying time. Thus each sample point of the probability model maps into an infinite collection of sample values of rv’s. If the index is regarded as time, then each sample point maps into a function of time called a sample function. These sample functions might vary continuously with time or might vary at only discrete times, and if they vary at discrete times, those times can be deterministic or random. In many cases, this collection of rv’s comprising the stochastic process is the only thing of interest. In this case the sample points of the probability model can be taken to be the sample functions of the process. Conceptually, then, each event is a set of sample functions. Usually these events are characterized in terms of a finite set of rv’s. As an example of sample functions that vary at only discrete times, we might be concerned with arrivals to some system. The arrivals might model incoming jobs for a computer system, arriving packets to a communication system, patients in a health care system, or orders for some merchandising warehouse. The Bernoulli process is an example of an arrival process and is probably the simplest imaginable non-trivial stochastic process. The following example defines this process and develops a few of its many properties. We will often return to it as an example. Example 1.3.1. The Bernoulli process: A Bernoulli process is an IID sequence, Y1 , Y2 , . . . , of binary random variables. Let q = Pr{Yi = 1} and 1 − q = Pr{Yi = 0}. We visualize time as a discrete variable and the event {Yi = 1} as an arrival at time i and {Yi = 0} as no 13
Stochastic and random are synonyms, but random has become more popular for rv’s and stochastic for stochastic processes, perhaps because both the word random and the word process are used and misused in so many ways.
1.3. PROBABILITY REVIEW
15
arrival. There can be at most one arrival at each discrete time. We visualize the process as starting at time 0, with the first opportunity for an arrival at time 1. Several of the random variables of interest in Pta Bernoulli process are illustrated in Figure 1.2. Note that the counting process N (t) = i=1 Yi gives the number of arrivals up to and including each time t. It is an alternate way to specify the process, since Yi = N (i)−N (i−1). S3 S2
✛
✲
X3
S1
✛X✲ 2
✛ X1 ✲ i 0 Yi
1 0
2 1
N (t) 3 1
4 0
5 0
6 1
7 0
8 0
Figure 1.2: For a Bernoulli with binary rv’s Yi , the counting process at each Pprocess t discrete time t is N (t) = i=1 Yi . The arrival epoch Sj is the smallest t for which N (t) = j, i.e., it is the epoch of the jth arrival. The interarrival interval Xj is S1 for j = 1 and Sj − Sj−1 for j > 1. The rv N (t), for any given t and q, is a binomial rv and is a good example of the sums of IID rv’s involved in the laws of large numbers. The PMF pN (t) (k) is the probability that k ° ¢ out of t of the Yi ’s have the value 1. There are kt arrangements of t binary numbers with k 1’s, and each has probability q k (1 − q)t−k . Thus µ ∂ t k pN (t) (k) = q (1 − q)t−k . k The interarrival intervals Xj are also illustrated in the figure. The rv X1 is simply the time of the first arrival. It has the value 1 if Y1 = 1, and thus pX1 (1) = q. It has the value 2 if Y1 = 0 and Y2 = 1, so pX1 (2) = q(1 − q). Continuing, we see that X1 has the geometric PMF, pX1 (j) = q(1 − q)j−1 . We can see that each subsequent Xj , conditional on Sj−1 , can be found in the same way and has this same geometric PMF. Also, since this conditional PMF does not depend on Sj−1 , Xj is independent of Sj−1 and thus14 also of all previous Xj . We have seen that the Bernoulli process can be expressed in several alternative ways, all involving a sequence of rv’s. For each of these alternatives, there are generalizations that lead to other interesting processes. A particularly interesting generalization arises by allowing the interarrival intervals to be arbitrary discrete or continuous nonnegative IID rv’s 14
This is one of those maddening arguments that, while essentially obvious, require some careful reasoning to be precise. We go through those kinds of arguments with great care in the next chapter, and suggest that skeptical readers wait until then.
16
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
rather than geometric rv’s. Letting the interarrival intervals be IID exponential rv’s leads to the Poisson process, which is the topic of Chapter 2. IID Interarrival intervals with an arbitrary distribution leads to renewal processes, which play a central role in the theory of stochastic processes, and consitute the subject of Chapter 3. Renewal processes are examples of discrete stochastic processes. The distinguishing characteristic of such processes is that interesting things (arrivals, departures, changes of state) occur at discrete instants of time separated by deterministic or random intervals. Discrete stochastic processes are to be distinguished from noise-like stochastic processes in which changes are continuously occurring and the sample functions are continuously varying functions of time. The description of discrete stochastic processes above is not intended to be precise. The various types of stochastic processes developed in subsequent chapters are all discrete in the above sense, however, and we refer to these processes, somewhat loosely, as discrete stochastic processes. Discrete stochastic processes find wide and diverse applications in operations research, communication, control, computer systems, management science, etc. Paradoxically, we shall spend relatively little of our time discussing these particular applications, and rather spend our time developing results and insights about these processes in general. We shall discuss a number of examples drawn from the above fields, but the examples will be “toy” examples, designed to enhance understanding of the material rather than to really understand the application areas.
1.3.6
Expectations
Before giving a general definition for the expected value (the expectation) of a rv, we discuss several special cases. First, the expected value of a non-negative discrete rv X with PMF pX is given by X x pX (x). (1.20) E [X] = x
The expected value in this case must be finite if the rv has only a finite set of sample values, but it can be either finite or infinite if the rv has a countably infinite set of sample values. Example 1.3.2 illustrates a case in which it is infinite.
Next, assume X is discrete with both positiveP and negative sample values. It is possible P in this case to have x≥0 x pX (x) = 1 and x x} = 1 − pX (a1 ), Pand Pr{X > x} has similar drops as x reaches a2 , a3 , and a4 . E [X], from (1.20), is i ai pX (ai ), which is the sum of R 1the rectangles in the figure. This is also the area under the curve 1 − FX (x), i.e., 0 [1 − FX (x)] dx. It can be seen that this argument applies to any non-negative rv, thus verifying (1.23). This relationship is just as important for conceptual purposes as for computational purposes. To derive it for a discrete random variable, consider the sketch of the complementary P distribution function, Pr{X > x} = 1 − FX (x) in Figure 1.3. The figure shows that ai pX (ai ) is equal to the area under the curve 1 − FX (x) from x = 0 to 1. For an arbitrary nonnegative rv, we can visualize quantizing the distribution function, using the above argument, and then passing to the limit of arbitrarily fine quantizing. Since there are no mathematical subtleties in integrating an arbitrary non-negative decreasing function, this is not only useful in calculating expectations, but is also the fundamental way to define expectation. Note that it is possible for X to be infinite, but this simply arises when the integral in (1.23) is infinite. One can also derive (1.23) from (1.22) by integration by parts,
1.3. PROBABILITY REVIEW
19
but it is the more fundamental graphical argument that shows that the complementary distribution function can always be integrated with either a finite or infinite value. For an arbitrary (rather than nonnegative) rv, the mean, as illustrated in Figure 1.4 and analyzed in Exercise 1.7, generalizes to E [X] = −
Z
0
FX (x) dx +
−1
Z
1
0
(1 − FX (x)) dx.
(1.24)
1 − FX (0)
1 − FX (x)
FX (0) FX (x)
Figure 1.4: The figure shows the complementary distribution function, 1 − FX (x), of a rv X for x between 0 and 1 and shows the distribution function, FX (x), for x between −1 and 0. The expectation of X, E [X] is the area under the curve on the right less the area under the curve on the left. If the area under each curve is finite, then E [X] is finite. If one area is finite and the other infinite, then E [X] is ±1, and if both are infinite then E [X] is undefined. Definition 1.6. The expectation of a rv X for which at least one of the two terms in (1.24) is finite is given by (1.24). If both these terms are infinite, the expectation is undefined. Given this as the definition of expectation, one can use the arguments above to see that the formulas for discrete and continuous expectations in (1.20) and (1.22) respectively follow from this definition.
1.3.7
Random variables as functions of other random variables
Random variables (rv’s) are often defined in terms of each other. For example, if g is a function from R to R and X is a rv, then Y = g(X) is the random variable that maps each sample point ω into the composite function g(X(ω)). As indicated in Exercise 1.15, one can find the expected value of Y (if it is defined) in either of the following ways: Z 1 Z 1 E [Y ] = y dFY (y) = g(x) dFX (x). (1.25) −1
−1
15 which for our purposes is shortThe integral R in this equation is called a Stieltjes integral, P hand for g(x)f (x) dx if a density f exists and x g(x)p(x) if a PMF p exists. The 15
To be a little more careful, we can view
R1
−1
g(x)dF (x) as limδ→0
P
n
g(nδ)[F (nδ) − F (nδ − δ)].
20
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
R Rintegral is always R linear in the sense that if g(x) = a(x) + b(x), then g(x) dF (x) = a(x) dF (x) + b(x) dF (x). Usually it can be calculated using some variation on (1.24).
Particularly important examples values are the moments E [X n ] of a rv £ of suchn expected § X and the central moments E (X − X) of X where X is the mean E [X]. The second 2 . It is given by central moment is called the variance, denoted by VAR(X) or σX § £ § £ 2 VAR(X) = E (X − X)2 = E X 2 − X .
(1.26)
The standard deviation of X, σX , is the square root of the variance and provides a measure of dispersion of the rv around the mean. Thus the mean is a rough measure of typical values for the outcome of the rv, and σX is a measure of the typical difference between X and X. There are other measures of typical value (such as the median and the mode) and other measures of dispersion, but mean and standard deviation have a number of special properties that make them important. £ § One of these (see Exercise 1.20) is that E [X] is the value of x that minimizes E (X − x)2 . If g is a strictly monotonic increasing function, then the distribution function of Y = g(X) can be found by noting that for any x, the event {X ≤ x} is the same as the event {Y ≤ g(x)}. Thus FY (g(x)) = FX (x)
FY (y) = FX (g −1 (y)).
or
(1.27)
If X is a continuous rv and g a continuously differentiable function, then the density of Y = g(X) is related to the density of X by g 0 (x)fY (g(x)) = fX (x)
or
fY (y) =
fX (g −1 (y)) . g 0 (g −1 (y))
(1.28)
Another way to see this is to observe that the probability that X lies between x and x + ≤ is, to first order, ≤fX (x). This is the same as the probability that Y lies between g(x) and g(x + ≤) ≈ g(x) + ≤g 0 (x). Finally, if g is not monotonic then (1.27) must be modified to take account of the possibility that the event Y ≤ y corresponds to multiple intervals for X.
If X and Y are rv’s, then the sum Z = X + Y is also a rv. To see this, note that X(ω) ∈ R and Y (ω) ∈ R for every ω ∈ ≠ and thus, by the axioms of the real numbers, Z(ω) is in R and is thus finite.16 In the same way, the sum Sn = X1 + · · · + Xn of any finite collection of rv’s is also a rv. If X and Y are independent, then the distribution function of Z is given by Z 1 Z 1 FX (z − y) dFY (y) = FY (z − x) dFX (x). FZ (z) = −1
If X and Y both have densities, this can be rewritten as Z 1 Z 1 fX (z − y)fY (y) dy = fY (z − x)fX (x) dx. fZ (z) = −1
16
(1.29)
−1
(1.30)
−1
We usually extend the definition of a rv by allowing it to be undefined over a set of probability 0, but the sum is then undefined only over a set of probability 0 also.
1.3. PROBABILITY REVIEW
21
Eq. (1.30) is the familiar convolution equation from linear systems, and we similarly refer to (1.29) as the convolution of distribution functions (although it has a different functional form from (1.30)). If X and Y are non-negative random variables, then the integrands in (1.29) and (1.30) are non-zero only between 0 and z, so we often use 0 and z as the limits in (1.29) and (1.30). Note, however, that if Pr{Y = 0} 6= 0, then the lower limit must be 0− , i.e., in terms of (1.29), it must include the jump in FY (y) at y = 0. One can avoid confusion of this type by always keeping infinite limits until actually calculating something. If X1 , X2 , . . . , Xn are independent rv’s, then the distribution of the rv Sn = X1 + X2 + · · · + Xn can be found by first convolving the distributions of X1 and X2 to get the distribution of S2 and then, for each i ≥ 2, convolving the distribution of Si and Xi+1 to get the distribution of Si+1 . The distributions can be convolved in any order to get the same resulting distribution. Whether or not X1 , X2 , . . . , Xn are independent, the expected value of Sn = X1 + X2 + · · · + Xn satisfies E [Sn ] = E [X1 + X2 + · · · + Xn ] = E [X1 ] + E [X2 ] + · · · + E [Xn ] .
(1.31)
This says that the expected value of a sum is equal to the sum of the expected values, whether or not the rv’s are independent (see exercise 1.11). The following example shows how this can be a valuable problem solving aid with an appropriate choice of rv’s. Example 1.3.4. In packet networks, a packet can be crudely modeled as a string of IID binary digits with Pr{0} = Pr{1} = 1/2. Packets are often separated from each other by a special bit pattern, 01111110, called a flag. If this special pattern appears within a packet, it could be interpreted as a flag indicating the end of the packet. To prevent this problem, an extra binary digit of value 0 is inserted after each appearance of 011111 in the original string (this can be deleted after reception). Suppose we want to find the expected number of inserted bits in a string of length n. For each position i ≥ 6 in the original string, define Xi as a rv whose value is 1 if an insertion occurs after the ith data bit. The total number of insertions is then just the sum of Xi from i = 6 to n inclusive. Since E [Xi ] = 2−6 , the expected number of insertions is (n − 5)2−6 . Note that the positions in which the insertions occur are highly dependent, and the problem would be quite difficult if one didn’t use (1.31) to avoid worrying about the dependence. If the rv’s X1 , . . . , Xn are independent, then, as shown in exercises 1.11 and 1.17, the variance of Sn = X1 + · · · + Xn is given by Xn 2 σS2 n = σX . (1.32) i i=1
2 , If X1 , . . . , Xn are also identically distributed (i.e., X1 , . . . , Xn are IID) with variance σX √ 2 2 then σSn = nσX . Thus the standard deviation of Sn is σSn = nσX . Sums of IID rv’s appear everywhere in probability theory and play an especially central role in the laws of large numbers. It is important to remember that, while the mean of Sn is increasing linearly with n, the standard deviation is increasing only with the square root of n. Figure 1.5 illustrates this behavior.
22
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.6
·
FS4 ·
·
·
FS20 · ·
·
·
·
·
FS50 ·
·
·
·
·
0.4
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0 5
10
15
20
Figure 1.5: The distribution function FSn (s) of the sum of n IID rv’s for n = 4, n = 20, and n = 50. The rv is binary with Pr{1} = 1/4, √ Pr{0} = 3/4. Note that the mean is increasing with n and standard deviation with n.
1.3.8
Conditional expectations
Just as the conditional distribution of one rv conditioned on a sample value of another rv is important, the conditional expectation of one rv based on the sample value of another is equally important. Initially let X be a positive discrete rv and let y be a sample value of another discrete rv Y such that pY (y) > 0. Then the conditional expectation of X given Y = y is defined to be X E [X | Y =y] = x pX|Y (x | y) (1.33) x
This is simply the ordinary expected value of X using the conditional probabilities in the reduced sample space corresponding to Y = y. It can be finite or infinite as before. More generally, if X can take on positive or negative values, then there is the possibility that the conditional expectation is undefined if the sum is 1 over positive values of x and −1 over negative values. In other words, for discrete rv’s, the conditional expectation is exactly the same as the ordinary expectation, except that it is taken using conditional probabilities over the reduced sample space. More generally yet, let X be a completely aribitrary rv and let y be a sample value of a discrete rv Y with pY (y) > 0. The conditional distribution function of X conditional on Y = y is defined as FX|Y (x | y) =
Pr{X ≤ x, Y = y} Pr{Y = y}
Since this is an ordinary distribution function in the reduced sample space where Y = y,
1.3. PROBABILITY REVIEW
23
(1.24) expresses the expectation of X conditional on Y = y as Z 0 Z 1 E [X | Y = y] = − FX|Y (x | y) dx + FX|Y (x | y) dx −1
(1.34)
0
The forms of conditional expectation in (1.33) and (1.34) are given for individual sample values of Y for which pY (y) > 0. Thus each ω ∈ ≠ (except perhaps a set of zero probability) maps into Y = y for some y with pY (y) > 0 and each such y corresponds to E [X | Y = y] for that y. Thus we can consider E [X | Y ] to be a rv that is a function of Y , mapping ω into Y = y and hence into E [X | Y = y]. Regarding a conditional expectation as a rv that is a function of the conditioning rv is a powerful tool both in problem solving and in advanced work. For now, we use this to express the unconditional mean of X as E [X] = E [E [X | Y ]] , where the outer expectation is over the rv E [X | Y ]. Again assuming X to be discrete, we can write out this expectation by using (1.33) for E [X | Y = y]. X E [X] = E [E [X | Y ]] = pY (y)E [X | Y = y] =
X y
pY (y)
X
y
x pX|Y (x|y)
(1.35)
x
Operationally,Pthere is nothing very fancy here. Combining the sums, (1.37) simply says that E [X] = y,x x pY X (y, x). As a concept, however, viewing the conditional expectation as a rv based on the conditioning rv is often very useful in research. This approach is equally useful as a tool in problem solving, since there are many problems where it is easy to find the conditional expectations, and then to find the total expectation by (1.37). For this reason, this result is sometimes called the total expectation theorem. Exercise 1.16 illustrates the advantages of this approach, particularly where it is initially unclear whether or not the expectation is finite. The following example shows that this approach can sometimes hide convergence questions and give the wrong answer. Example 1.3.5. Let Y be a geometric rv with the PDF fY (y) = 2−y for integer y ≥ 1. Let X be an integer rv that, conditional on Y , is binary with equiprobable values ±2y given Y = y. We then see that E [X | Y = y] = 0 for all y, and thus, (1.37) indicates that E [X] = 0. On the other hand, it is easy to see that pX (2k ) = 2−k−1 for each integer k ≥ 1 and pX (−2k ) = 2−k−1 for each integer k ≤ 1. Thus the expectation over positive values of X is 1 and that over negative values is −1. In other words, the expected value of X is undefined and (1.37) is incorrect. The difficulty in the above example cannot occur if X is a non-negative rv. Then (1.37) is simply a sum of a countable number of non-negative terms, and thus it either converges to a finite sum independent of the order of summation, or it diverges to 1, again independent of the order of summation. If X has both positive and negative components, we can separate it into X = X + +X − where X + = max(0, X) and X − = min(X, 0). Then (1.37) applies to X + and −X − separately. If at most one is infinite, then (1.37) applies to X, and otherwise X is undefined. This is summarized in the following theorem:
24
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Theorem 1.1 (Total expectation). Let X and Y be discrete rv’s. If X is non-negative, P then E [X] = E [E [X | Y ]] = y pY (y)E [X | Y = y]. If X has both positive and negative values, and if at most one of E [X + ] and E [−X − ] is infinite, then E [X] = E [E [X | Y ]] = P y pY (y)E [X | Y = y].
We have seen above that if Y is a discrete rv, then the conditional expectation E [X|Y = y] is little more complicated than the unconditional expectation, and this is true whether X is discrete, continuous, or arbitrary. If X and Y are continuous, we can essentially extend these results to probability densities. In particular, Z 1 E [X | Y = y] = x fX|Y (x | y), (1.36) −1
and E [X] =
Z
1
−1
fY (y)E [X | Y =y] dy =
Z
1
fY (y)
−1
Z
1
−1
x fX|Y (x | y) dx dy
(1.37)
We do not state this as a theorem because the details about the integration do not seem necessary for the places where it is useful.
1.3.9
Indicator random variables
For any event A, the indicator random variable of A, denoted IA , is a binary rv that has the value 1 for all ω ∈ A and the value 0 otherwise. Thus, as illustrated in Figure 1.6, the distribution function FIX (x) is 0 for x < 0, 1 − Pr{A} for 0 ≤ x < 1, and 1 for x ≥ 1. it is obvious, by comparing Figures 1.6 and 1.3 that E [IA ] = Pr{A}. 1
IA
1 − Pr{A} 0
1
0
Figure 1.6: The distribution function of an indicator random variable. Indicator rv’s are useful because they allow us to apply the many known results about rv’s and expectations to events. For example, the laws of large numbers are expressed in terms of sums of rv’s, and those results all translate into results about relative frequencies through the use of indicator functions.
1.3.10
Transforms
The moment generating function (mgf) for a rv X is given by Z 1 £ § erx dFX (x). gX (r) = E erX = −1
(1.38)
1.4. THE LAWS OF LARGE NUMBERS
25
Viewing r as a real variable, we see that for r > 0, gX (r) only exists if 1−FX (x) approaches 0 at least exponentially as x → 1. Similarly, for r < 0, gX (r) exists only if FX (x) approaches 0 at least exponentially as x → −1. If gX (r) exists in a region of r around 0, then derivatives of all orders exist, given by Ø Z 1 @ n gX (r) @ n gX (r) ØØ n rx = x e dFX (x) ; = E [X n ] . (1.39) Ø n Ø @rn @r −1 r=0
This shows that finding the moment generating function often provides a convenient way to calculate the moments of a random variable. Another convenient feature of moment generating functions is their use in dealing with sums of independent rv’s. For example, suppose S = X1 + X2 + · · · + Xn . Then i Yn h ≥Xn ¥i hYn £ § gS (r) = E erS = E exp rXi = E exp(rXi ) = gXi (r). (1.40) i=1
i=1
i=1
In the last step here, we have used a result of Exercise 1.11, which shows that for independent rv’s, the mean of the product is equal to the product of the means. If X1 , . . . , Xn are also IID, then gS (r) = [gX (r)]n .
(1.41)
The variable r in the mgf can also be viewed as a complex variable, giving rise to a number of other transforms. √ A particularly important case is to view r as a pure imaginary variable, say iω where i = −1 and ω is real. The mgf is then called the characteristic function. Since |eiωx | is 1 for all x, gX (iω) must always exist, and its magnitude is at most one. Note that gX (iω) is the inverse Fourier transform of the density of X. The Z-transform is the result of replacing er with z in gX (r). This is useful primarily for integer valued rv’s, but if one transform can be evaluated, the other can be found immediately. Finally, if we use −s, viewed as a complex variable, in place of r, we get the two sided Laplace transform of the density of the random variable. Note that for all of these transforms, multiplication in the transform domain corresponds to convolution of the distribution functions or densities, and summation of independent rv’s. It is the simplicity of taking products of transforms that make transforms so useful in probability theory. We will use transforms sparingly here, since, along with simplifying computational work, they sometimes obscure underlying probabilistic insights.
1.4
The laws of large numbers
The laws of large numbers are a collection of results in probability theory that describe the behavior of the arithmetic averageP of n rv’s for large n. For any n rv’s, X1 , . . . , Xn , the arithmetic average is the rv (1/n) ni=1 Xi . Since in any outcome of the experiment, the sample value of this rv is the arithmetic average of the sample values of X1 , . . . , Xn , this random variable is called the sample average. If X1 , . . . , Xn are viewed as successive variables in time, this sample average is called the time-average. Under fairly general
26
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
assumptions, the standard deviation of the sample average goes to 0 with increasing n, and, in various ways, depending on the assumptions, the sample average approaches the mean. These results are central to the study of stochastic processes because they allow us to relate time-averages (i.e., the average over time of individual sample paths) to ensemble-averages (i.e., the mean of the value of the process at a given time). In this and the next section, we discuss two of these results, the weak and the strong law of large numbers for independent identically distributed rv’s. The strong law requires considerable patience to understand, but it is a basic and useful result in studying stochastic processes. We start with some basic inequalities which are useful in their own right and which lead us directly to the weak law of large numbers.
1.4.1
Basic inequalities
Inequalities play an unusually important role in probability theory, perhaps since so much of the theory is based on mathematical analysis, or perhaps because many of the problems are so complex that that important quantities can not be evaluated exactly, but can only be bounded. One of the simplest and most basic of these inequalities is the Markov inequality, which states that if a non-negative random variable Y has a mean E [Y ], then the probability that Y exceeds any given number y satisfies Pr{Y ≥ y} ≤
E [Y ] y
Markov Inequality.
(1.42)
Figure 1.7 derives this result using the fact (see Figure 1.3) that the mean of a non-negative rv is the integral of its complementary distribution function, i.e., of the area under the curve Pr{Y > z}. Exercise 1.8 gives another simple proof using an indicator random variable.
Area under curve = E [Y ] ✟❅ ✟ ✙ ✟ ❅ Pr{Y ≥ y} ❄ ❅ ❘ ❅ Area = yPr{Y ≥ y} y
Figure 1.7: Demonstration that yPr{Y ≥ y} ≤ E [Y ]. As an example of this inequality, assume that the average height of a population of people is 1.6 meters. Then the Markov inequality states that at most half of the population have a height exceeding 3.2 meters. We see from this example that the Markov inequality is often very weak. However, for any y > 0, we can consider a rv that takes on the value y with probability ≤ and the value 0 with probability 1 − ≤; this rv satisfies the Markov inequality at the point y with equality. Another application of Figure 1.7 is the observation that, for any given non-negative rv Y with finite mean, (i.e., with finite area under the curve
1.4. THE LAWS OF LARGE NUMBERS
27
Pr{Y ≥ y}), lim y Pr{Y ≥ y} = 0.
y→1
(1.43)
This is explained more fully by Exercise 1.31 and will be useful shortly in the proof of Theorem 1.3. We now use the Markov inequality to establish the probably better known Chebyshev inequality. Let Z be an arbitrary rv with finite mean E [Z] and finite variance σZ2 , and define Y as the non-negative rv Y = (Z − E [Z])2 . Thus E [Y ] = σZ2 . Applying (1.42), ™ σ2 © Pr (Z − E [Z])2 ≥ y ≤ Z . y
Replacing y with ≤2 (for any ≤ > 0) and noting that the event {(Z − E [Z])2 ≥ ≤2 } is the same as |Z − E [Z] | ≥ ≤, this becomes Pr{|Z − E [Z] | ≥ ≤} ≤
σZ2 ≤2
(Chebyshev inequality).
(1.44)
Note that the Markov inequality bounds just the upper tail of the distribution function and applies only to non-negative rv’s, whereas the Chebyshev inequality bounds both tails of the distribution function. The more important difference, however, is that the Chebyshev bound goes to zero inversely with the square of the distance from the mean, whereas the Markov bound goes to zero inversely with the distance from 0 (and thus asymptotically with distance from the mean). There is another variant of the Markov inequality, known as an exponential bound or Chernoff bound, in which the bound goes to 0 exponentially with distance from the mean. Let Y = exp(rZ) for some arbitrary rv Z that has a moment generating function, gZ (r) = E [exp(rZ)] over some open interval of real values of r including r = 0. Then, for any r in that interval, (1.42) becomes Pr{exp(rZ) ≥ y} ≤
gZ (r) . y
Letting y = exp(ra) for some constant a, we have the following two inequalities, Pr{Z≥a} ≤ gZ (r) exp(−ra) ;
(Exponential bound for r ≥ 0).
(1.45)
Pr{Z≤a} ≤ gZ (r) exp(−ra) ;
(Exponential bound for r ≤ 0).
(1.46)
Note that the right hand side of (1.45) is one for r = 0 and its derivitive with respect to r is Z − a at r = 0. Thus, for any a > Z, this bound is less than one for small enough r > 0, i.e., Pr{Z≥a} ≤ gZ (r) exp(−ra) < 1 ;
for a > Z and sufficiently small r > 0.
(1.47)
28
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Note that for fixed r > 0, this bound decreases exponentially with increasing a. Similarly, for r < 0 and a < Z, (1.46) satisfies Pr{Z≤a} ≤ gZ (r) exp(−ra) < 1 ;
for a < Z and r < 0 sufficiently close to 0.
(1.48)
The bound in (1.46) is less than 1 for negative r sufficiently close to 0. This bound also decreases exponentially with decreasing a. Both these bounds can be optimized over r to get the strongest bound for any given a. These bounds will be used in the next section to prove the strong law of large numbers. They will also be used extensively in Chapter 7 and are useful for detection, random walks, and information theory.
1.4.2
Weak law of large numbers with a finite variance
2 , let S = Let X1 , X2 , . . . , Xn be IID rv’s with a finite mean X and finite variance σX n 2 . X1 + · · · + Xn , and consider the sample average Sn /n. We saw in (1.32) that σS2 n = nσX Thus the variance of Sn /n is "µ µ ∂ ∂2 # ¢2 i σ 2 Sn 1 h° Sn − nX VAR = = 2 E Sn − nX =E . (1.49) n n n n √ This says that the standard deviation of the sample average Sn /n is σ/ n, which approaches 0 as n increases. Figure 1.8 illustrates this decrease in the standard deviation of Sn /n with increasing n. In contrast, recall that Figure 1.5 illustrated how the standard deviation of Sn increases with n. From (1.49), we see that "µ ∂2 # Sn lim E = 0. (1.50) −X n→1 n
As a result, we say that Sn /n converges in mean square to X. This convergence in mean square essentially says that the sample average approaches the mean with increasing n. This connection of the sample average (which is a random variable) to the mean (which is a fixed number) is probably the most important conceptual result in probability theory. There are many ways of expressing this concept and many ways of relaxing the IID and finite variance conditions. Perhaps the most insightful way of expressing the concept is the following weak law of large numbers (it is called a weak law to contrast it with the strong law of large numbers, which will be discussed later in this section.) Theorem 1.2 (Weak law with finite variance). Let Sn = X1 + · · · + Xn be the sum of n IID rv’s with a finite variance. Then the following equivalent conditions hold: first, ΩØ æ Ø Ø Sn Ø lim Pr Ø for every ≤ > 0. (1.51) − XØ ≥ ≤ = 0 n→1 n Second, a real-valued function f (≤, δ) exists such that for every ≤ > 0 and every δ > 0, ΩØ æ Ø Ø Sn Ø Pr Ø for all n ≥ f (≤, δ). (1.52) − XØ ≥ ≤ ≤ δ n
1.4. THE LAWS OF LARGE NUMBERS
29
1
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.6
·
·
·
·
·
·
·
· · Yn = Snn
·
·
0.4
·
·
·
·
·
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
·
·
·
·
FYn (z)
n=4 n = 20 n = 50
·
·
·
0 0
0.25
0.5
0.75
1
Figure 1.8: The same distribution as Figure 1.5, scaled differently to give the distribution function of the sample average Yn . Note that as n increases, the distribution function of Yn slowly becomes closer to a unit step at the mean, 0.25, of the variables X being summed. Discussion For any arbitrarily small ≤ > 0, (1.51) says that limn→1 Pr{An } = 0 where An is the event that the sample average Sn /n differs from the true mean by more than ≤. This means (see Figure 1.9) that the distribution function of Sn /n approaches a unit step function with the step at X as n → 1. Because of (1.51), Sn /n is said to converge to E [X] in probability. Figure 1.9 also illustrates how the ≤ and δ of (1.52) control the approximation of FSn /n for finite n by a unit step. Proof: For any given ≤ > 0, we can apply the Chebyshev inequality, (1.44), to the sample average Sn /n, getting æ ΩØ Ø σ2 Ø Ø Sn (1.53) Pr Ø − XØ ≥ ≤ ≤ 2 . n n≤
where σ 2 is the variance of each Xi . For the given ≤ >©0, the righthand ™ side of (1.53) is decreasing toward 0 as 1/n with increasing n. Since Pr |Sn /n − X| ≥ ≤ is sandwiched between 0 and a quantity approaching 0 as 1/n, (1.51) © ™ is satisfied. To make this particularly clear, note that for a given ≤, Pr |Sn /n − X| ≥ ≤ represents a sequence of real numbers, one for each positive integer n. The limit of a sequence of real numbers a1 , a2 , . . . is defined to exist17 and equal some number b if for each δ > 0, there © is an f (δ) such ™ that |b − an | ≤ δ for all n ≥ f (δ). For the limit here, b = 0, an = Pr |Sn /n − X| ≥ ≤ and, from (1.53), f (δ) can be taken to satisfy δ = σ 2 /(f (δ)≤2 ). Thus we see that (1.52) simply spells out the definition of the limit in (1.51). One might reasonably ask at this point what (1.52) adds to the more specific statement in 2 (1.53). In particular (1.53) shows that the function f (≤, δ) in (1.52) can be taken to be ≤σ2 δ , 17
Exercise 1.9 is designed to help those unfamiliar with limits to understand this definition.
30
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1
✻ ✏✏ FSn /n ✏ ✮ 1−δ
✲ ✛2≤
δ=
σ2 n≤2
❄ 0
X
Figure 1.9: Approximation of the distribution function FSn /n of a sample average by a step function at the©mean. From (1.51), ™ the probability that Sn /n differs from X by more than ≤ (i.e., Pr |Sn /n − X| ≥ ≤ ) approaches 0 with increasing n. This means that the limiting distribution function goes from 0 to 1 as x goes from X − ≤ to X + ≤. Since this is true for every ≤ > 0 (presumably with slower convergence as ≤ gets smaller), FSn /n approaches a unit step at X. Note the rectangle of width 2≤ and height 1 − δ in the figure. The meaning of (1.52) is that the distribution function of Sn /n enters this rectangle from below and exits from above, thus approximating a unit step. Note that there are two ‘fudge factors’ here, ≤ and δ and, since we are approximating an entire distribution function, neither can be omitted, except by going to a limit as n → 1.
which provides added information on the rapidity of convergence. The reason for (1.52) is that it remains valid when the theorem is generalized. For variables that are not IID or have an infinite variance, Eq. (1.53) is no longer necessarily valid. For the Bernoulli process of Example 1.3.1, the binomial PMF can be compared with the upper bound in (1.55), and in fact the example in Figure 1.8 uses the binomial PMF. Exact calculation, and bounds based on the binomial PMF, provide a tighter bound than (1.55), but the result is still proportional to σ 2 , 1/n, and 1/≤2 .
1.4.3
Relative frequency
We next show that (1.52) and (1.51) can be applied to the relative frequency of an event as well as to the sample average of a random variable. Suppose that A is some event in a single experiment, and that the experiment is independently repeated n times. Then, in the probability model for the n repetitions, let Ai be the event that A occurs at the ith trial, 1 ≤ i ≤ n. The events A1 , A2 , . . . , An are then IID.
If we let IAi be the indicator rv for A on the nth trial, then the rv Sn = IA1 + IA2 + · · · + IAn is the number of occurences of A over the n trials. It follows that Pn IA Sn relative frequency of A = = i=1 i . (1.54) n n Thus the relative frequency of A is the sample average of IA and, from (1.52), Pr{| relative frequency of A − Pr{A} |≥ ≤} ≤
σ2 , n≤2
(1.55)
1.4. THE LAWS OF LARGE NUMBERS
31
where σ 2 is the variance of IA , i.e., σ 2 = Pr{A} (1 − Pr{A}). The main point here is everything we learn about sums of IID rv’s applies equally to the relative frequency of IID events. In fact, since the indicator rv’s are binary, everything known about the binomial PMF also applies. Relative frequency is extremely important, but we need not discuss it much in what follows, since it is simply a special case of sample averages.
1.4.4
The central limit theorem
The law of large numbers says that Sn /n is close to X with high probability for large n, but this most emphatically does not mean that Sn is close to nX. In fact, the standard √ deviation of Sn is σ n , which increases with n. This leads us to ask about the behavior of √ Sn / n, since its standard deviation is σ and does not vary with n. The celebrated central limit theorem answers this question by stating that if σ 2 < 1, then for every real number y, æ∏ Z y µ 2∂ ∑ Ω 1 −x Sn − nX √ √ lim Pr ≤y = exp dx. (1.56) n→1 2 nσ 2π −1 The quantity on the right side of (1.56) is the distribution function of a normalized Gaussian rv, which is known to have mean 0 and variance 1. The sequence of rv’s Zn = (Sn − √ nX)/( n σ) on the left also have mean 0 and variance 1 for all n. The central limit theorem, as expressed in (1.56), says that the sequence of distribution functions, FZ1 (y), FZ2 (y), . . . converges at each value of y to FΦ (y) as n → 1, where FΦ (y) is the distribution function on the right side of (1.56). In other words, limn→1 FZn (y) = FΦ (y) for each y ∈ R. This is called convergence in distribution, since it is the sequence of distribution functions that is converging. The theorem is illustrated by Figure 1.10. It is known that if |X| has a finite third moment, then for each y ∈ R, FZn (y) converges √ with increasing n to FΦ (y) as 1/ n. Without a third moment the convergence still takes place, but perhaps very slowly. The word central appears in the name central limit theorem because FZn (y) is usually much better approximated by FΦ (y) when the magnitude of y is relatively small, say y ∈ (−3, 3). Part of the reason for this is that when y is very negative, FΦ (y) is extremely small (for example FΦ (−3) = 0.00135). Approximating FZn (y) by FΦ (y) is then only going to be meaningful if |FZn (y) − FΦ(y) | is small relative to FΦ (y). For y = −3, say, this requires n to be huge. The same problem occurs for large y, since 1 − Fφ (y) is extremely small and is the quantity of interest for large y. Finding how FZn (y) behaves for large n and large |y| is a part of the theory of large deviations, which will be introduced in Chapter 7. The central limit theorem (CLT) helps explain why Gaussian random variables play such a central role in probability theory. In fact, many of the cookbook formulas of elementary statistics are based on the tacit assumption that the underlying variables are Gaussian, and the CLT helps explain why these formulas often give reasonable results. One should be careful to avoid reading more into the CLT than it says. For example, the √ normalized sum, [Sn − nX]/( nσ) need not have a density that is approximately Gaussian.
32
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
FZn (z)
0.6
· · · ] √ n Zn = Snσ−E[S n
·
n=4 n = 20
X
0.4
·
·
·
·
·
·
·
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
·
·
·
·
·
·
·
n = 50
·
·
0 -2
-1
0
1
2
Figure 1.10: The same distribution functions as Figure 1.5 normalized to 0 mean ] √ n for n = and unit standard deviation, i.e., the distribution functions of Zn = Snσ−E[S X n 4, 20, 50. Note that as n increases, the distribution function of Zn slowly starts to resemble the normal distribution function. In fact, if the underlying variables are discrete, the normalized sum is also, and does not have a density at all. What is happening is that the normalized sum can have very detailed fine structure; this does not disappear as n increases, but becomes “integrated out” in the distribution function. We will not use the CLT extensively here, and will not prove it (See Feller [8] for a thorough and careful exposition on various forms of the central limit theorem). Giving a proof from first principles is quite tricky and uses analytical tools that will not be needed subsequently; many elementary texts on probability give “heuristic” proofs indicating that the normalized sum has a density that tends to Gaussian (thus indicating that both the heuristics and the manipulations are wrong). The central limit theorem gives us very explicit information on how the distribution function of the sample average, FSn /n converges to X. It says that as n increases, FSn /n becomes better approximated by the S-shaped Gaussian distribution function, where with increasing √ n, the standard deviation of the Gaussian curve decreases as 1/ n. Thus not only do we know that FSn is approaching a unit step, but we know the shape of FSn as it appoaches this step. Thus it is reasonable to ask why we emphasize the weak law of large numbers when the central limit is so much more explicit. There are three answers—the first is that sometimes one doesn’t care about the added information in (1.56), and that the added information obscures some issues of interest. The next is that (1.56) is only meaningful if the variance σ 2 of X is finite, whereas, as we soon show, (1.51) holds whether or not the variance is finite. One might think that variables with infinite variance are of purely academic interest. However, whether or not a variance is finite depends on the far tails of a distribution, and thus results are considerably more robust when a finite variance is not required. The third reason for interest in laws of large numbers is that they hold in many situations
1.4. THE LAWS OF LARGE NUMBERS
33
more general than that of sums of IID random variables18 . These more general laws of large numbers can usually be interpreted as saying that a time-average (i.e., an average over time of a sample function of a stochastic process) approaches the expected value of the process at a given time. Since expected values only exist within a probability model, but time-averages can be evaluated from a sample function of the actual process being modeled, the relationship between time-averages and expected values is often the link between a model and the reality being modeled. These more general laws often go under the term ergodic theory.
1.4.5
Weak law with an infinite variance
We now establish the law of large numbers without assuming a finite variance. Theorem 1.3 (Weak Law of Large Numbers). Let Sn = X1 +· · ·+Xn where X1 , X2 , . . . are IID rv’s with a finite mean E [X]. Then for any ≤ > 0, ΩØ æ Ø Ø Sn Ø lim Pr Ø − E [X] Ø ≥ ≤ = 0. (1.57) n→1 n Outline of Proof We use a truncation argument; such arguments are used frequently in dealing with rv’s that have infinite variance. Let b be a real number (which we later take ˘ i (see Figure 1.11) by to be increasing with n), and for each variable Xi , define a new rv X for E [X] − b ≤ Xi ≤ E [X] + b Xi ˘ E [X] + b for Xi > E [X] + b (1.58) Xi = E [X] − b for Xi < E [X] − b. X +b
FX˘
FX
X
X −b
X
Figure 1.11: The truncated rv X˘ for a given rv X has a distribution function which is truncated at X ± b.. ˘ i are IID with a finite second moment. Also the first moment The truncated variables X approaches the mean of the original variables Xi with increasing b. Thus, as shown in ˘1 + · · · + X ˘ n of the truncated rv’s satisfies the following type Exercise 1.26, the sum S˘n = X of weak law of large numbers: Ø (Ø ) Ø S˘ Ø 8bE [|X|] Ø n Ø Pr Ø . (1.59) − E [X]Ø ≥ ≤ ≤ Øn Ø n≤2 18
Central limit theorems also hold in many of these more general situations, but they usually do not have quite the generality of the laws of large numbers
34
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
for sufficiently large b. ˘ one of the Xi has an outage, i.e., |Xi − X| > b, The original sum Sn is the same o n as Sn unless ™ © ˘ so, using the union bound, Pr Sn 6= Sn ≤ nPr |Xi − X| > b . Combining with (1.59), Ø ΩØ æ Ø Sn Ø 8bE [|X|] n Ø Ø Pr Ø + [b Pr{ |X − E [X] | > b}] . − E [X] Ø ≥ ≤ ≤ n n≤2 b
(1.60)
The trick of the proof is to show that by letting b and n both approach infinity in the appropriate ratio, do this © both of these ™ terms go to 0, establishing the theorem. The way to p is let g(b) = bPr |X − X| > b . From (1.43), limb→1 g(b) = 0. Choosing n by b/n = g(b), we see that both terms in (1.60) approach 0 with incresing b and n, completing the proof.
1.4.6
Strong law of large numbers (SLLN)
We next discuss the strong law of large numbers. We do not have the mathematical tools to prove the theorem in its full generality, but will give a fairly insightful proof under the additional assumption that the rv under discussion has a moment generating function. Theorem 1.4 (SLLN (Version 1)). Let Sn = X1 + · · · + Xn where X1 , X2 , . . . are IID rv’s with a finite mean X. Then for every ≤ > 0, Ω[ ææ ΩØ Ø Ø Ø Sm lim Pr = 0. (1.61) − XØ > ≤ Ø m≥n n→1 m As discussed later, an equivalent statement is Ø Ω\ ææ ΩØ [ Ø Ø Sm Ø Ø Pr = 0. − XØ > ≤ n≥1 m≥n Ø m
(1.62)
For those who do not eat countable unions and intersections for breakfast, we describe what these equations mean before proving the theorem. This understanding is perhaps as difficult as following the proof. For brevity, assuming some fixed ε, let Ø ΩØ æ Ø Sm Ø Am = ØØ (1.63) − X ØØ > ≤ . m
This is the event that the sample average for m trials differs from the mean by more than the given ≤. The weak law asserts nS that limon→1 Pr{An } = 0. The strong law (in the form (1.61)) asserts that limn→1 Pr m≥n Am = 0. This means that not only is An unlikely, but that all subsequent Am are collectively unlikely. nS o A and limn→1 Pr{An } can Example 1.4.1. The difference between limn→1 Pr m m≥n be more easily understood by looking at a simpler sequence of events than those in (1.63). Let U be a uniformly distributed random variable over the interval [0, 1). Let An , n ≥ 1 be a sequence of events locating the sample value of U with greater and greater precision.
1.4. THE LAWS OF LARGE NUMBERS
35
In particular, let A1 = {0 ≤ U < 1} be the entire interval, A2 = {0 ≤ U < 1/2} and A3 = {1/2 ≤ U < 1} be the two half intervals. Then let A4 = {0 ≤ U < 1/4}, . . . , A7 = {3/4 ≤ U < 1} be the quarter intervals, etc. For each positive integer k, then, Pr{An } = 2−k for n in the range 2k ≤ n < 2k+1 . We see that limn→1 Pr{An } = 0. On the other hand, for each k, any given sample value u S2k+1 −1 must lienin An for some n between 2k and 2k+1 . In other words, n=2 Ann = ≠. It follows k o o S S that Pr A Pr A = 1 for all n. Taking the limit as n → 1, lim n→1 m≥n m m≥n m = 1. The example shows that, for this sequence of events, there is a considerable difference between the probability of the nth event and the collective probability of one or more events from n on. S We next want to understand why (1.61) and (1.62) say the same thing. Let Bn = m≥n Am . The sequence {Bn ; n ≥ 1} is a sequence of unions, and each event Bn is modified from the previous event Bn−1 by the removal of An−1 . Thus these unions are nested with Tn each one contained in the previous one, i.e., Bn ⊆ Bn−1 . It follows from this that Bn = k=1 Bk for each n, and thus Pr{Bn } = Pr
n\n
k=1
o Bk .
(1.64)
Since {Bn ; n ≥ 1} is nested, Pr{Bn } is nonincreasing in n. Since Pr{Bn } is also nonnegative for each n, the limit exists and (1.61) says that limn→1 Pr{Bn } = 0. The righthand side of (1.64), as a function of n, is the same sequence of numbers, which must then also approach 0. This is (1.62), which is thus equivalent to (1.61). T T S The event n Bn = n m≥n Am occurs frequently in probability theory and is called the lim sup of the sequence {An ; n ≥ 1}. This event is the set of sample points which are in each T Bn , and which thus, for each n, occur in some Am with m ≥ n. This means that if ω ∈ n Bn , it must occur in an infinite collection of the Am . For Am = {| Smm − X| > ≤}, these sample points correspond to sample averages more than ≤ from the mean for an infinite number of m. The SLLN thus says that the collective probability of these sample points is 0. We now prove Theorem 1.4 for the special case where the moment generating function gX (r) exists around r = 0. The proof is based on the following lemmas: Ø ©Ø ™ ØSm /m − X Ø > ≤ for Lemma 1.1. Assume that g (r) exists around r = 0 and let A = m X P each m ≥ 1. Then 1 m=1 Pr{Am } < 1. Proof: First we break Pr{Am } into two terms, Ø æ Ω æ Ω æ ΩØ Ø Ø Sm S S m m Pr{Am } = Pr ØØ − X ØØ > ≤ = Pr − X > ≤ + Pr − X < −≤ . m m m
(1.65)
m (r). We can now apply the From (1.41), the moment generating function of Sm is = gX
36
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
exponential bound of (1.45) to the first term above, æ Ω © ™ Sm = Pr Sm > mX + m≤ Pr −X >≤ m ≤ gSm (r)e−mrX−mr≤ h im = gX (r)e−rX−r≤ .
As pointed out in (1.47), gX (r)e−rX−r≤ < 1 for small enough r > 0. gX (r)e−rX−r≤ for some such r, it follows that 0 < α < 1 and Ω æ Sm Pr − X > ≤ ≤ αm . m
Letting α =
(1.66)
Using (1.48) in the same way on the second term in (1.65), there is some β, 0 < β < 1, such that Ω æ Sm Pr − X < −≤ ≤ β m . m Substituting these inequalities in (1.65), Pr{Am } ≤ αm + β m . Thus X1
m=1
Pr{Am } ≤
completing the proof.
X1
m=1
(αm + β m ) =
α β + < 1, 1−α 1−β
Lemma 1.2. (Borel Cantelli) Let A1 , A2 , . . . , be a sequence of arbitrary events for which P1 Pr[A m ] < 1. Then m=1 i h\1 [1 Am = 0. (1.67) Pr n=1
m=n
Pn P Proof: The hypothesis 1 m=1 Pr{Am } < 1 means that m=1 Pr{Am } converges to a finite limit as n → 1, and thus that X1 lim Pr{Am } = 0. (1.68) n→1
m=n
Upper bounding the probability of an intersection of events by the probability of any term in that intersection, we get o n[ o n\ [ Am Am ≤ Pr for each n ≥ 0 Pr n
m≥n
m≥n
≤
1 X
m=n
Pr{Am } ,
where we have used the union bound. Since this inequality holds for each n, it also holds in the limit n → 1, completing the proof.
1.4. THE LAWS OF LARGE NUMBERS
37
Ø ©Ø ™ P Proof of Theorem 1.4: Lemma 1.1 shows that m Pr{Am } < 1 for Am = Ø Smm − X Ø > ≤ and Lemma 1.2 shows that this implies (1.68) and (1.67).
Note that the assumption of P a moment generating function was used only in PLemma 1.1, and was only used to show that m Pr{Am } < 1. Exercise 1.34 shows that m Pr{Am } < 1 also follows from the assumption that X has a finite fourth moment. Section 7.8.1 uses more powerful tools to show the sufficiency of a second moment, and advanced texts on mathematical probability are demonstrate the sufficiency of a finite mean.
The version of the SLLN in Theorem 1.4 is not the version that is most useful. Before proceeding to that version, the following restatement of the previous version will be useful in proving and understanding the more useful version. Lemma 1.3. A sequence of rv’s {Sn /n; n ≥ 1} satisfies the condition Ø ææ Ω\ ΩØ [ Ø Ø Sm Ø Ø =0 for every ε > 0 Pr − XØ > ≤ n≥1 m≥n Ø m if and only if it satisfies Ω[ Pr
k≥1
\
n≥1
Ø ΩØ ææ Ø 1 Ø Sm Ø Ø − XØ > = 0. m≥n Ø m k
[
(1.69)
(1.70)
Proof: First assume that (1.69) is satisfied. Using the union bound on k in (1.70), Ø Ø æ X \ [ ΩØ æ [ \ [ ΩØØ S Ø Ø Ø 1 1 S Ø m − XØ > Ø m − XØ > Pr Pr Ø k ≤ Ø k . Øm Øm k≥1 n≥1 m≥n
k≥1
n≥1 m≥n
Each term on the right side above is zero by choosing ε in (1.69) to be 1/k. The sum of these terms is then also 0, showing that (1.69) implies (1.70). Next assume that (1.70) is satisfied. Since the probability of a union can be no less than the probability of any term in the union, we see that for every integer k, Ø Ø ΩØ æ æ [ \ [ ΩØØ S \ [ Ø Ø Ø Ø m − XØ > 1 Ø Sm − X Ø > 1 0 = Pr Ø k ≥ Pr Ø k . Øm Øm k≥1 n≥1 m≥n
n≥1 m≥n
This implies (1.69) for every ε > 1/k and, since k is arbitrary, for every ε > 0.
We are now ready to state the most useful version of the strong law. The statement is deceptively simple, and it will take some care to interpret what it means. Theorem 1.5 (Strong Law of Large Numbers (SLLN - Version 2)). Let {Xm ; m ≥ 1} be a sequence of IID rv’s with a finite mean X. For each n, let Sn = X1 + · · · + Xn . Then Ω æ Sn Pr lim (1.71) = X = 1. n→1 n
38
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Before proving the theorem, we must be clear about the meaning of the limit of a sequence of rv’s. For each sample point ω in the sample space, limn→1 Sn (ω)/n is the limit of a sequence of real numbers. If the limit exists, it is a real number and otherwise the limit is undefined. Thus the limit is a possibly defective random variable, mapping each sample point to a real number or to a black hole. The theorem then says three things: first, the set of ω ∈ ≠ for which the limit exists is an event19 according to the axioms of probability; second, this event has probability 1; and third, the limiting rv is deterministic and equal to X with probability 1. A sequence of rv’s {Sn /n; n ≥ 1} that converges in the sense of (1.71) is said to converge with probability 1. Thus we usually express (1.71) as limn→1 Sn /n = X with probability 1, or even more briefly as limn→1 Sn /n = X W.P.1. Proof: Surprisingly, the subtle mathematical issues about convergence of rv’s can be avoided if we look carefully at Version 1 of the SLLN. In the form of (1.70), Ø Ω[ ΩØ ææ \ [ Ø 1 Ø Sm Ø Ø Pr − XØ > = 0. k≥1 n≥1 m≥n Ø m k
The complement of the event above must have probability 1, and using de Morgan’s laws to find the complement, we get Ø Ω\ ΩØ ææ [ \ Ø 1 Ø Sm Ø Ø Pr − XØ ≤ = 1. (1.72) k≥1 n≥1 m≥n Ø m k
To better understand this expression, define the event Ak as Ø ΩØ æ \ [ Ø 1 Ø Sm Ø Ak = − X ØØ ≤ n≥1 m≥n Ø m k T T Thus Version 1 of the SLLN states that Pr{ k Ak } = 1. We now show that k Ak is simply the set of ω for which limn Sn (ω)/n = X. To see this, note that ω ∈ Ak means that there T is some n ≥ 1 such that |Sm (ω)/m − X| ≤ 1/k for all m ≥ n. It then follows that ω ∈ k Ak means that for every k ≥ 1, there is some n ≥ 1 such that |Sm (ω)/m − X| ≤ 1/k for all m ≥ n. This is the definition of limn Sn (ω)/n = X. This means that the set of ω for T which limn Sn (ω)/n = X is the set of ω in k Ak . Thus this set is an event (as a countable intersection of a countable union or a countable intersection of events, and this event has probability 1. 19 This is the only place in this text where one must seriously question whether a subset of sample points is actually an event. The extreme peculiarity of this subset can be seen by visualizing a Bernoulli process with pX (1) = q. In this case, E [X] = q and, according to the theorem, the set of sample paths for which Sm /m → q has probability 1. If we consider another Bernoulli process with pX (1) = q 0 , q 0 6= q, then the old set of sample paths with Sm /m → q has probability 0 and the new set with Sm /m → q 0 has probability 1. There are uncountably many choices for q, so there are uncountably many sets, each of which has probability 1 for its own q. Each of these sets are events in the same class of events. Thus they partition the sample space into an uncountable set of events, each with probability 1 for its own q, and in addition there are the sample points for which Sm /m does not have a limit. There is nothing wrong mathematically here, since these sets are described by countable unions and intersections. However, even the simple friendly Bernoulli process is very strange when one looks at events of this type.
1.4. THE LAWS OF LARGE NUMBERS
39
Example 1.4.2. Suppose the Xi are IID binary with equiprobable ones and zeros. Then X = 1/2. We can easily construct sequences for which the sample average is not 1/2; for example the sequence of all zeros, the sequence of all ones, sequences with 1/3 zeros and 2/3 ones, and so forth. The theorem says, however, that collectively those sequences have zero probability. The proof above actually shows more than the theorem claims. It shows that Version 1 and Version 2 of the SLLN are actually equivalent statements in the same sense that (1.61) and (1.62) are equivalent statments. Each form leads to its own set of insights, but when we show that other sequences of rv’s satisfy one or the other of these forms, we can recognize almost immediately that the other equivalent form is also valid.
1.4.7
Convergence of random variables
This section has developed a number of laws of large numbers, each saying that a sum of many IID random variables (rv’s), suitably normalized, is essentially equal to the mean. In the case of the CLT, the limiting distribution around the mean is also specified to be Gaussian. At the outermost intuitive level, i.e., at the level most useful when first looking at some very complicated set of issues, viewing the sample average as being essentially equal to the mean is highly appropriate. At the next intuitive level down, the meaning of the word essentially becomes important and thus involves the details of the above laws. All of the results involve how the rv’s Sn /n change with n and in particular become better and better approximated by X. When we talk about a sequence of rv’s (namely a sequence of functions on the sample space) being approximated by a numerical constant, we are talking about some kind of convergence, but it clearly is not as simple as a sequence of real numbers (such as 1/n for example) converging to some other number (0 for example). The purpose of this section, is to give names and definitions to these various forms of convergence. This will give us increased understanding of the laws of large numbers already developed, but, more important, we can later apply these convergence results to sequences of random variables other than sample averages of IID rv’s. We discuss four types of convergence in what follows, convergence in distribution, in probability, in mean square, and with probability 1. For each, we first recall the type of large number result with that type of convergence and then give the general definition. The examples are all based on a sequence {Xn ; n ≥ 1} of rv’s with partial sums Sn = X1 +· · ·+Xn and the definitions are given in terms of an arbitrary sequence {Yn ; n ≥ 1} of rv’s . We start with the central limit theorem, which, from (1.56) says æ Z y Ω µ 2∂ Sn − nX 1 −x √ √ exp ≤y = lim Pr dx for every y ∈ R. n→1 2 nσ 2π −1 This o in Figure 1.10 and says that the sequence (in n) of distribution functions n is illustrated Sn√−nX ≤ y converges at every y to the normal distribution function at y. This is an Pr nσ example of convergence in distribution.
40
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Definition 1.7. A sequence of random variables, Y1 , Y2 , . . . , converges in distribution to a random varible Z if limn→1 FYn (z) = FZ (z) at each z for which FZ (z) is continuous. Convergence in distribution does not say that the rv’s themselves are converging in any sense (see Exercise 1.29), but only that their distribution functions are converging. Note that for the CLT, it is the rv’s Sn√−nX that are converging in distribution. Convergence in nσ distribution does not say that the PMF’s or densities of the rv’s are converging, and it is easy to see that a sequence of PMF’s can never converge to a density.
★✥ ★✥ MS
✧✦ ✧✦ W.P.1
In probability
Distribution
Figure 1.12: Relationship between different kinds of convergence. Convergence in distribution is the most general and is implied by all the others. Convergence in probability is the next most general and is implied by convergence with probability 1 (W.P.1) and by mean square (MS) convergence, neither of which imply the other.
Next, the weak law of large numbers, in the form of (1.57), says that ΩØ æ Ø Ø Sn Ø for every ≤ > 0. − XØ ≥ ≤ = 0 lim Pr Ø n→1 n
As illustrated in Figure 1.8, this means that Sn /n converges in distribution to a unit step function at X. This is an example of convergence in probability, Definition 1.8. A sequence of random variables Y1 , Y2 , . . . , converges in probability to a real number z if limn→1 Pr{|Yn − z| ≥ ε} = 0 for every ε > 0. An equivalent statement, as illustrated in Figure 1.9, is that the sequence Y1 , Y2 , . . . , converges in probability to z if limn→1 FYn (y) = 0 for all y < z and limn→1 FYn (y) = 1 for all y > z. This shows (as illustrated in Figure 1.12) that convergence in probability is a special case of convergence in distribution, since with convergence in probability, the sequence FYn of distribution functions converges to a unit step at z. Note that limn→1 FYn (z) for the z where the step occurs is not specified. However, the step function is not continuous at z, so the limit there need not be specifiied for convergence in distribution. Convergence in probability says quite a bit more than convergence in distribution. As an important example of this, consider the difference Yn − Ym for n and m both large. If {Yn ; n ≥ 1} converges in probability to z, then Yn − z and Ym − z are close to 0 with high probability for large n and m. Thus Yn − Ym is close to 0 with high probability. More precisely, limm→1,n→1 Pr{|Yn − Ym | > ε} = 0 for every ε > 0. If the sequence {Yn ; n ≥ 1}
1.4. THE LAWS OF LARGE NUMBERS
41
merely converges in distribution to some arbitrary distribution, then Yn − Ym can be large with high probability, even when n and m are large. An example of this is given in Exercise 1.29. In other words, convergence in distribution need not mean that the random variables converge in any sense — it is only their distribution functions that must converge. There is something slightly paradoxical here, since the CLT seems to say a great deal more about how Sn /n approaches X than the weak law, but it corresponds to a weaker type of convergence. The resolution of this paradox is that the sequence of rv’s in the CLT √ is { Sn√−nX ; n ≥ 1}. The presence of n in the denominator of this sequence provides nσ
much more detailed information about how Sn /n approaches X with increasing n than the limiting unit step of FSn /n itself. For example, it is easy to see from the CLT that limn→1 FSn /n (X) = 1/2, which can not be directly derived from the weak law. Yet another kind of convergence is convergence in mean square (MS). An example of this, for the sample average Sn /n of IID rv’s with a variance, is given in (1.50), repeated below: "µ ∂2 # Sn lim E = 0. −X n→1 n The general definition is as follows: Definition 1.9. A sequence of rv’s Y1 , Y2 , . . . , converges in mean square (MS) to a real £ § number z if limn→1 E (Yn − z)2 = 0.
Our derivation of the weak law of large numbers (Theorem 1.2) was essentially based on the MS convergence of (1.50). Using the same approach, Exercise 1.28 shows in general that convergence in MS implies convergence in probability. Convergence in probability does not imply MS convergence, since as shown in Theorem 1.3, the weak law of large numbers holds without the need for a variance. The final type of convergence to be discussed is convergence © ™ with probability 1. The strong law of large numbers, stating that Pr limn→1 Sn /n = X = 1, is an example. The general definition is as follows: Definition 1.10. A sequence of rv’s Y1 , Y2 , . . . , converges with probability 1 (W.P.1) to a real number z if Pr{limn→1 Yn = z} = 1.
An equivalent Y1 , Y2 , . . . , converges W.P.1 to a real number z if, for every nS statement is that o ε > 0, Pr m≥n {|Ym − z| > ε} = 0. The equivalence of these two statements follows by exactly the same argument used to show the equivalence of the two versions of the stong law. In seeing this, note that the proof of version 2 of the strong law did not use any properties of the sample averages Sn /n. We now show that convergence with probability 1 implies convergence in probability. Using the second form above of convergence W.P.1, we see that n[ o Pr {|Ym − z| > ε} ≥ Pr{|Yn − z| > ε} (1.73) m≥n
42
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Thus if the term on the left converges to 0 as n → 1, then the term on the right does also and convergence W.P.1 implies convergence in probability. It turns out that convergence in probability does not imply convergence W.P.1. We cannot show this from the SLLN in Theorem 1.5 and the WLLN in Theorem 1.3, since both hold under the same conditions. However, convergence in probability and convergence W.P.1 also hold for many other sequences of rv’s . A simple example where convergence in probability holds but convergence W.P.1 does not is given by Example 1.4.1 where we let Yn = IAn and z = 0. Then for any positive ε < 1, the left side of (1.73) is 1 and the right side converges to 0. This means that {Yn ; n ≥ 1} converges to 0 in probability but does not converge with probability 1. Finally, we want to show that convergence W.P.1 neither implies nor is implied by convergence in mean square. Although we have not shown it, convergence W.P.1 in the form of the SLLN is valid for rv’s with a mean but not a variance, so this is a case where convergence W.P.1 holds, but convergence in MS does not. Going the other way, the Example of 1.4.1 again shows that convergence W.P.1 does not hold here, but convergence in MS to 0 does hold. Figure 1.12 illustrates the fact that neither convergence W.P.1 nor convergence in MS imply the other.
1.5
Relation of probability models to the real world
Whenever first-rate engineers or scientists construct a probability model to represent aspects of some system that either exists or is being designed for some application, they first acquire a deep knowledge of the system and its surrounding circumstances. This is then combined with a deep and wide knowledge of probabilistic analysis of similar types of problems. For a text such as this, however, it is necessary for motivation and insight to pose the models we look at as highly simplified models of real-world applications, chosen more for their tutorial than practical value. There is a danger, then, that readers will come away with the impression that analysis is more challenging and important than modeling. To the contrary, modeling is almost always more difficult, more challenging, and more important than analysis. The objective here is to provide the necessary knowledge and insight about probabilistic systems so that the reader can later combine it with a deep understanding of some real application area which will result in a useful interactive use of models, analysis, and experimentation. In this section, our purpose is not to learn how to model real-world problems, since, as said above, this requires deep and specialized knowledge of whatever application area is of interest. Rather it is to understand the following conceptual problem that was posed in Section 1.1. Suppose we have a probability model of some real-world experiment involving randomness in the sense expressed there. When the real-world experiment being modeled is performed, there is an outcome, which presumably is one of the outcomes of the probability model, but there is no observable probability. It appears to be intuitively natural, for experiments that can be carried out repeatedly under essentially the same conditions, to associate the probability of a given event with
1.5. RELATION OF PROBABILITY MODELS TO THE REAL WORLD
43
the relative frequency of that event over many repetitions. We now have the background to understand this approach. We first look at relative frequencies within the probability model, and then within the real world.
1.5.1
Relative frequencies in a probability model
We have seen that for any probability model, an extended probability model exists for n IID idealized experiments of the original model. For any event A in the original model, the indicator function IA is a random variable, and the relative frequency of A over n IID experiments is the sample average of n IID rv’s with the distribution of IA . From the weak law of large numbers, this relative frequency converges in probability to E [IA ] = Pr{A}. By taking the limit n → 1, the strong law of large numbers says that the relative frequency of A converges with probability 1 to Pr{A}. In plain English, this says that for large n, the relative frequency of an event (in the nrepetition IID model) is essentially the same as the probability of that event. The word essentially is carrying a great deal of hidden baggage. For the weak law, for any ≤, δ > 0, the relative frequency is within some ≤ of Pr{A} with a confidence level 1 − δ whenever n is sufficiently large. For the strong law, the ≤ and δ are avoided, but only by looking directly at the limit n → 1.
1.5.2
Relative frequencies in the real world
In trying to sort out if and when the laws of large numbers have much to do with real-world experiments, we should ignore the mathematical details for the moment and agree that for large n, the relative frequency of an event A over n IID trials of an idealized experiment is essentially Pr{A}. We can certainly visualize a real-world experiment that has the same set of possible outcomes as the idealized experiment and we can visualize evaluating the relative frequency of A over n repetitions with large n. If that real-world relative frequency is essentially equal to Pr{A}, and this is true for the various events A of greatest interest, then it is reasonable to hypothesize that the idealized experiment is a reasonable model for the real-world experiment, at least so far as those given events of interest are concerned. One problem with this comparison of relative frequencies is that we have carefully specified a model for n IID repetitions of the idealized experiment, but have said nothing about how the real-world experiments are repeated. The IID idealized experiments specify that the conditional probability of A at one trial is the same no matter what the results of the other trials are. Intuitively, we would then try to isolate the n real-world trials so they don’t affect each other, but this is a little vague. The following examples help explain this problem and several others in comparing idealized and real-world relative frequenices. Example 1.5.1. Coin tossing: Tossing coins is widely used as a way to choose the first player in other games, and is also sometimes used as a primitive form of gambling. Its importance, however, and the reason for its frequent use, is its simplicity. When tossing a coin, we would argue from the symmetry between the two sides of the coin that each should be equally probable (since any procedure for evaluating the probability of one side
44
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
should apply equally to the other). Thus since H and T are the only outcomes (the remote possibility of the coin balancing on its edge is omitted from the model), the reasonable and universally accepted model for coin tossing is that H and T each have probability 1/2. On the other hand, the two sides of a coin are embossed in different ways, so that the mass is not uniformly distributed. Also the two sides do not behave in quite the same way when bouncing off a surface. Each denomination of each currency behaves slightly differently in this respect. Thus, not only do coins violate symmetry in small ways, but different coins violate it in different ways. How do we test whether this effect is significant? If we assume for the moment that successive tosses of the coin are well-modeled by the idealized experiment of n IID trials, we can essentially find the probability of H for a particular coin as the relative frequency of H in a sufficiently large number of independent tosses of that coin. This gives us slightly different relative frequencies for different coins, and thus slightly different probability models for different coins. If we want a generic model, we might randomly choose coins from a large bowl and then toss them, but this is manifestly silly. The assumption of independent tosses is also questionable. Consider building a carefully engineered machine for tossing coins and using it in a vibration-free environment. A standard coin is inserted into the machine in the same way for each toss and we count the number of heads and tails. Since the machine has essentially eliminated the randomness, we would expect all the coins, or almost all the coins, to come up the same way — the more precise the machine, the less independent the results. By inserting the original coin in a random way, a single trial might have equiprobable results, but successive tosses are certainly not independent. The successive trials would be closer to independent if the tosses were done by a slightly inebriated individual who tossed the coins high in the air. The point of this example is that there are many different coins and many ways of tossing them, and the idea that one model fits all is reasonable under some conditions and not under others. Rather than retreating into the comfortable world of theory, however, note that we can now find the relative frequency of heads for any given coin and essentially for any given way of tossing that coin.20 Example 1.5.2. Binary data: Consider the binary data transmitted over a communication link or stored in a data facility. The data is often a mixture of encoded voice, video, graphics, text, etc., with relatively long runs of each, interspersed with various protocols for retrieving the original non-binary data. The simplest (and most common) model for this is to assume that each binary digit is 0 or 1 with equal probability and that successive digits are statistically independent. This is the same as the model for coin tossing after the trivial modification of converting {H, T } into {0, 1}. This is also a rather appropriate model for designing a communication or storage facility, since all n-tuples are then equiprobable (in the model) for each n, and thus the 20
We are not suggesting that distinguishing different coins for the sake of coin tossing is an important problem. Rather, we are illustrating that even in such a simple situation, the assumption of identically prepared experiments is questionable and the assumption of independent experiments is questionable. The extension to n repetitions of IID experiments is not necessarily a good model for coin tossing. In other words, one has to question both the original model and the n-repetition model.
1.5. RELATION OF PROBABILITY MODELS TO THE REAL WORLD
45
facilities need not rely on any special characteristics of the data. On the other hand, if one wants to compress the data, reducing the required number of transmitted or stored bits per incoming bit, then a more elaborate model is needed. Developing such an improved model would require finding out more about where the data is coming from — a naive application of calculating relative frequencies of n-tuples would probably not be the best choice. On the other hand, there are well-known data compression schemes that in essence track dependencies in the data and use them for compression in a coordinated way. These schemes are called universal data-compression schemes since they don’t rely on a probability model. At the same time, they are best analyzed by looking at how they perform for various idealized probability models. The point of this example is that choosing probability models often depends heavily on how the model is to be used. Models more complex than IID binary digits are usually based on what is known about the input processes. Associating relative frequencies with probabilities is the basic underlying conceptual connection between real-world and models, but in practice this is essentially the relationship of last resort. For most of the applications we will study, there is a long history of modeling to build on, with experiments as needed. Example 1.5.3. Fable: In the year 2008, the financial structure of the US failed and the world economy was brought to its knees. Much has been written about the role of greed on Wall Street and incompetence and stupidity in Washington. Another aspect of the collapse, however, was a widespread faith in stochastic models for limiting risk. These models encouraged people to engage in investments that turned out to be far riskier than the models predicted. These models were created by some of the brightest PhD’s from the best universities, but they failed miserably because they modeled the everyday events very well, but modeled the rare events and the interconnection of events poorly. They failed badly by not understanding their application, and in particular, by trying to extrapolate typical behavior when their primary goal was to protect against atypical situations. The moral of the fable is that brilliant analysis is not helpful when the modeling is poor; as computer engineers say, “garbage in, garbage out.” The examples above show that the problems of modeling a real-world experiment are often connected with the question of creating a model for a set of experiments that are not exactly the same and do not necessarily correspond to the notion of independent repetitions within the model. In other words, the question is not only whether the probability model is reasonable for a single experiment, but also whether the IID repetition model is appropriate for multiple copies of the real-world experiment. At least we have seen, however, that if a real-world experiment can be performed many times with a physical isolation between performances that is well modeled by the IID repetition model, then the relative frequencies of events in the real-world experiment correspond to relative frequencies in the idealized IID repetition model, which correspond to probabilities in the original model. In other words, under appropriate circumstances, the probabilities in a model become essentially observable over many repetitions. We will see later that our emphasis on IID repetitions was done for simplicity. There are other models for repetitions of a basic model, such as Markov models, that we study later.
46
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
These will also lead to relative frequencies approaching probabilities within the repetition model. Thus, for repeated real-world experiments that are well modeled by these repetition models, the real world relative frequencies approximate the probabilities in the model.
1.5.3
Statistical independence of real-world experiments
We have been discussing the use of relative frequencies of an event A in a repeated realworld experiment to test Pr{A} in a probability model of that experiment. This can be done essentially successfully if the repeated trials correpond to IID trials in the idealized experiment. However, the statement about IID trials in the idealized experiment is a statement about probabilities in the extended n trial model. Thus, just as we tested Pr{A} by repeated real-world trials of a single experiment, we should be able to test Pr{A1 , . . . , An } in the n-repetition model by a much larger number of real-world repetitions taking an n-tuple at a time. To be more specific, choose two large integers, m and n, and perform the underlying realworld experiment mn times. Partition the mn trials into m runs of n trials each. For any given n-tuple A1 , . . . , An of successive events, find the relative frequency (over m trials of n tuples) of the n-tuple event A1 , . . . , An . This can then be used essentially to test the probability Pr{A1 , . . . , An } in the model for n IID trials. The individual event probabilities can also be tested, so the condition for independence can be tested. The observant reader will note that there is a tacit assumption above that successive n tuples can be modeled as independent, so it seems that we are simply replacing a big problem with a bigger problem. This is not quite true, since if the trials are dependent with some given probability model for dependent trials, then this test for independence will essentially reject the independence hypothesis for large enough n. Choosing models for real-world experiments is primarily a subject for statistics, and we will not pursue it further except for brief discussions when treating particular application areas. The purpose here was to treat a fundamental issue in probability theory. As stated before, probabilities are non-observables — they exist in the theory but are not directly measurable in real-world experiments. We have shown that probabilities essentially become observable in the real-world via relative frequencies over repeated trials.
1.5.4
Limitations of relative frequencies
Most real-world applications that are modeled by probability models have such a large sample space that it is impractical to conduct enough trials to choose probabilities from relative frequencies. Even a shuffled deck of 52 cards would require many more than 52! ≈ 8 × 1067 trials for most of the outcomes to appear even once. Thus relative frequencies can be used to test the probability of given individual events of importance, but are usually impractical for choosing the entire model and even more impractical for choosing a model for repeated trials. Since relative frequencies give us a concrete interpretation of what probability means, however, we can now rely on other approaches, such as symmetry, for modeling. From symmetry,
1.5. RELATION OF PROBABILITY MODELS TO THE REAL WORLD
47
for example, it is clear that all 52! possible arrangements of a card deck are equiprobable after shuffling. This leads, for example, to the ability to calculate probabilities of different poker hands, etc., which are such popular exercises in elementary probability. Another valuable modeling procedure is that of constructing a probability model where the possible outcomes are independently chosen n-tuples of outcomes in a simpler model. More generally, most of the random processes to be studied in this text are defined as various ways of combining simpler idealized experiments. What is really happening as we look at modeling increasingly sophisticated systems and studying increasingly sophisticated models is that we are developing mathematical results for simple idealized models and relating those results to real-world results (such as relating idealized statistically independent trials to real-world independent trials). The association of relative frequencies to probabilities forms the basis for this, but is usually exercised only in the simplest cases. The way one selects probability models of real-world experiments in practice is to use scientific knowledge and experience, plus simple experiments, to choose a reasonable model. The results from the model (such as the law of large numbers) are then used both to hypothesize results about the real-world experiment and to provisionally reject the model when further experiments show it to be highly questionable. Although the results about the model are mathematically precise, the corresponding results about the real-world are at best insightful hypotheses whose most important aspects must be validated in practice.
1.5.5
Subjective probability
There are many useful applications of probability theory to situations other than repeated trials of a given experiment. When designing a new system in which randomness (of the type used in probability models) is hypothesized to be significantly involved, one would like to analyze the system before actually building it. In such cases, the real-world system does not exist, so indirect means must be used to construct a probability model. Often some sources of randomness, such as noise, can be modeled in the absence of the system. Often similar systems or simulation can be used to help understand the system and help in formulating appropriate probability models. However, the choice of probabilities is to a certain extent subjective. In other situations, as illustrated in the examples above, there are repeated trials of similar experiments, but a probability model would have to choose one probability assignment for a model rather than a range of assignments, each appropriate for different types of experiments. Here again the choice of probabilities for a model is somewhat subjective. Another type of situation, of which a canonic example is risk analysis for nuclear reactors, deals with a large number of very unlikely outcomes, each catastrophic in nature. Experimentation clearly cannot be used to establish probabilities, and it is not clear that probabilities have any real meaning here. It can be helpful, however, to choose a probability model on the basis of subjective beliefs which can be used as a basis for reasoning about the problem. When handled well, this can at least make the subjective biases clear, leading to a more rational approach to the problem. When handled poorly, it can hide the arbitrary
48
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
nature of possibly poor decisions. We will not discuss the various, often ingenious methods to choose subjective probabilities. The reason is that subjective beliefs should be based on intensive and long term exposure to the particular problem involved; discussing these problems in abstract probability terms weakens this link. We will focus instead on the analysis of idealized models. These can be used to provide insights for subjective models, and more refined and precise results for objective models.
1.6
Summary
This chapter started with an introduction into the correspondence between probability theory and real-world experiments involving randomness. While almost all work in probability theory works with established probability models, it is important to think through what these probabilities mean in the real world, and elementary subjects rarely address these questions seriously. The next section discussed the axioms of probability theory, along with some insights about why these particular axioms were chosen. This was followed by a review of conditional probabilities, statistical independence, rv’s, stochastic processes, and expectations. The emphasis was on understanding the underlying structure of the field rather than reviewing details and problem solving techniques. This was followed by a fairly extensive treatment of the laws of large numbers. This involved a fair amount of abstraction, combined with mathematical analysis. The central idea is that the sample average of n IID rv’s approaches the mean with increasing n. As a special case, the relative frequency of an event A approaches Pr{A}. What the word approaches means here is both tricky and vital in understanding probability theory. The strong law of large numbers requires mathematical maturity, and might be postponed to Chapter 3 where it is first used. The final section came back to the fundamental problem of understanding the relation between probability theory and randomness in the real-world. It was shown, via the laws of large numbers, that probabilities become essentially observable via relative frequencies calculated over repeated experiments. There are too many texts on elementary probability to mention here, and most of them serve to give added understanding and background to the material in this chapter. We recommend Bertsekas and Tsitsiklis [2] and also [15] and [23], as both sound and readable. Kolmogorov [14] is readable for the mathematically mature and is also of historical interest as the translation of the 1933 book that first put probability on a firm mathematical basis. Feller [8] is the classic extended and elegant treatment of elementary material from a mature point of view. Rudin [17] is an excellent text on measure theory for those with advanced mathematical preparation.
1.7. APPENDIX
1.7
49
Appendix
1.7.1
Table of standard random variables
The following tables summarize the properties of some common random variables. If a density or PMF is specified only in a given region, it is assumed to be zero elsewhere. Name Exponential Erlang Gaussian Uniform
Density fX (x)
Mean
Variance
MGF
∏ exp(−∏x); x≥0
1 ∏
1 ∏2
∏ ∏−r
x≥0
n ∏
n ∏2
¥
a
σ2
exp(ra + r2 σ 2 /2)
σ>0
a 2
a2 12
exp(ra)−1 ra
a>0
∏n xn−1 exp(−∏x) ; (n−1)! √1 σ 2π
exp 1 a;
≥
−(x−a)2 2σ 2
0≤x≤a
≥
∏ ∏−r
¥n
∏>0 ∏ > 0, n ≥ 1
Name
PMF pN (n)
Mean
Variance
MGF
Binary
pN (0) = 1−q; pN (1) = q
q
q(1−q)
1 − q + qer
0≤q≤1
mq
mq(1 − q)
[1 − q + qer ]m
0≤q≤1
1 q
1−q q2
qer 1−(1−q)er
∏
∏
exp[∏(er − 1)]
Binomial Geometric Poisson
°m¢ n m−n ; 0≤n≤m n q (1 − q) q(1−q)n−1 ; n≥1 ∏n exp(−∏) ; n!
n≥0
00
50
1.8
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercises
Exercise 1.1. Consider a sequence A1 , A2 , . . . of events each of which have probability zero. P Pm a) Find Pr{ m n=1 An } and find limm→1 Pr{ n=1 An }. What you have done is to show that the sum of a countably infinite set of numbers each equal to 0 is perfectly well defined as 0. b) For a sequence of possible phases, S a1 , a2 , . . . between 0 and 2π, and a sequence of singleton events, An = {an }, find Pr{ n An } assuming that the phase is uniformly distributed. c) Now let An be the empty event φ for all n. Use (1.1) to show that Pr{φ} = 0.
S Exercise 1.2. Let A1 and A2 be arbitrary events and show that Pr{A1 A2 }+Pr{A1 A2 } = Pr{A1 } + Pr{A2 }. Explain which parts of the sample space are being double counted on both sides of this equation and which parts are being counted once. Exercise 1.3. Let A1 , A2 , . . . , be a sequenceS of disjoint events and assume that Pr{An } = 2−n−1 for each n ≥ 1. Assume also that ≠ = 1 n=1 An . a) Show that these assumptions violate the axioms of probability.
b) Show that if (1.3) is substituted for the third of those axioms, then the above assumptions satisfy the axioms. This shows that the countable additivity of Axiom 3 says something more than the finite additivity of (1.3). Exercise 1.4. This exercise derives the probability of an arbitrary (non-disjoint) union of events and also derives the union bound. a) For 2 arbitrary events A1 and A2 , show that [ [ A1 A2 = A1 (A2 −A1 ),
where A2 −A1 = A2 Ac1 and where A1 and A2 − A1 are disjoint Hint: This is what Venn diagrams were invented for. b Show that for any n ≥ 2 and any events A1 , . . . , An , ≥[n−1 ¥ [ ≥ ≥[n−1 ¥ [ [n−1 ¥ An = Ai Ai Ai , An − i=1
i=1
i=1
where the two parenthesized expressions on the right are disjoint. c) Show that Pr
n[
n
n o X [n−1 o An = Pr An − Ai . n
i=1
1.8. EXERCISES
51
n Sn−1 o Ai ≤ Pr{An }. Use this to show that d) Show that for each n, Pr An − i=1 n[ o X Pr An ≤ Pr{An } . n
n
Exercise 1.5. Consider a sample space of 8 equiprobable sample points and let A1 , A2 , A3 be three events each of probability 1/2 such that Pr{A1 A2 A3 } = Pr{A1 } Pr{A2 } Pr{A3 }.
a) Create an example where Pr{A1 A2 } = Pr{A1 A3 } = 14 but Pr{A2 A3 } = 18 . Hint: Make a table with a row for each sample point and a column for each event and experiment with choosing which sample points belong to each event (the answer is not unique).
b) Show that, for your example, A2 and A3 are not independent. Note that the definition of statistical independence would be very strange if it allowed A1 , A2 , A3 to be independent while A2 and A3 are dependent. This illustrates why the definition of independence requires (1.11) rather than just (1.12). Exercise 1.6. Suppose X and Y are discrete rv’s with the PMF pXY (xi , yj ). Show (a picture will help) that this is related to the joint distribution function by pXY (xi , yj ) =
lim
δ>0,δ→0
[F (xi , yj ) − F (xi − δ, yj ) − F (xi , yj − δ) + F (xi − δ, yj − δ)] .
Exercise 1.7. The R 1 text shows that, for a non-negative rv X with distribution function FX (x), E [X] = 0 [1 − FX (x)]dx.
a) Write this integral as a sum where X is a non-negative integer random variable.
b) Generalize the above integral for the case of an arbitrary (rather than non-negative) rv Y with distribution function FY (y); use a graphical argument. c) Find E [|Y |] by the same type of argument. d) For what value of α is E [|Y − α|] minimized? Use a graphical argument again. Exercise 1.8. a) Let Y be a nonnegative rv and y > 0 be some fixed number. Let A be the event that Y ≥ y. Show that y IA ≤ Y (i.e., that this inequality is satisfied for every ω ∈ ≠). b) Use your result in part a) to prove the Markov inequality. Exercise 1.9. Use the definition of a limit in the proof of Theorem 1.2 to show that the sequences in parts a and b satisfy limn→1 an = 0 but the sequence in part c does not have a limit. a) an =
1 ln(ln n)
b) an = n10 exp(−n) c) an = 1 for n = 10k for each positive integer k and an = 0 otherwise. d) Show that the definition can be changed (with no change in meaning) by replacing δ with either 1/m or 2−m for every positive integer m.
52
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercise 1.10. Let X be a rv with distribution function FX (x). Find the distribution function of the following rv’s. a) The maximum of n IID rv’s with distribution function FX (x). b) The minimum of n IID rv’s with distribution FX (x). c) The difference of the rv’s defined in a) and b); assume X has a density fX (x). Exercise 1.11. a) Let X1 , X2 , . . . , Xn be rv’s with expected values X 1 , . . . , X n . Prove that E [X1 + · · · + Xn ] = X 1 + · · · + X n . Do not assume that the rv’s are independent. b) Now assume that X1 , . . . , Xn are statistically independent and show that the expected value of the product is equal to the product of the expected values. c) Again assuming that X1 , . . . , Xn are statistically independent, show that the variance of the sum is equal to the sum of the variances. Exercise 1.12. Let X1 , X2 , . . . , Xn , . . . be a sequence of IID continuous rv’s with the common probability density function fX (x); note that Pr{X=α} = 0 for all α and that Pr{Xi =Xj } = 0 for all i 6= j. For n ≥ 2, define Xn as a record-to-date of the sequence if Xn > Xi for all i < n. a) Find the probability that X2 is a record-to-date. Use symmetry to obtain a numerical answer without computation. A one or two line explanation should be adequate). b) Find the probability that Xn is a record-to-date, as a function of n ≥ 1. Again use symmetry. c) Find a simple expression for the expected number of records-to-date that occur over the first m trials for any given integer m. Hint: Use indicator functions. Show that this expected number is infinite in the limit m → 1. Exercise 1.13. (Continuation of Exercise 1.12) a) Let N1 be the index of the first record-to-date in the sequence. Find Pr{N1 > n} for each n ≥ 2. Hint: There is a far simpler way to do this than working from part b in Exercise 1.12. b) Show that N1 is a rv. c) Show that E [N1 ] = 1. d) Let N2 be the index of the second record-to-date in the sequence. Show that N2 is a rv. Hint: You need not find the distribution function of N2 here. e) Contrast your result in part c to the result from part c of Exercise 1.12 saying that the expected number of records-to-date is infinite over an an infinite number of trials. Note: this should be a shock to your intuition — there is an infinite expected wait for the first of an infinite sequence of occurrences.
1.8. EXERCISES
53
Exercise 1.14. (Another direction from Exercise 1.12) a) For any given n ≥ 2, find the probability that Nn and Xn+1 are both records-to-date. Hint: The idea in part b of 1.12 is helpful here, but the result is not. b) Is the event that Xn is a record-to-date statistically independent of the event that Xn+1 is a record-to-date? c) Find the expected number of adjacent pairs of records-to-date over the sequence X1 , X2 , . . . . 1 1 Hint: A helpful fact here is that n(n+1) = n1 − n+1 . Exercise 1.15. a) Assume that X is a discrete rv taking on values a1 , a2 , . . . , and let th Y = g(X). Let P P bi = g(ai ), i≥1 be the i value taken on by Y . Show that E [Y ] = i bi pY (bi ) = i g(ai )pX (ai ).
b) Let X be a continuous rv with R density fXR(x) and let g be differentiable and monotonic increasing. Show that E [Y ] = yfY (y)dy = g(x)fX (x)dx. Exercise 1.16. a) Consider a positive, integer-valued rv whose distribution function is given at integer values by FY (y) = 1 −
2 (y + 1)(y + 2)
for integer y ≥ 0
Use (1.24) to show that E [Y ] = 2. Hint: Note the PMF given in (1.21). b) Find the PMF of Y and use it to check the value of E [Y ]. c) Let X be another positive, integer-valued rv. Assume its conditional PMF is given by pX|Y (x|y) =
1 y
for 1 ≤ x ≤ y
Find E [X | Y = y] and show that E [X] = 3/2. Explore finding pX (x) until you are convinced that using the conditional expectation to calculate E [X] is considerably easier than using pX (x). d) Let Z be another integer-valued rv with the conditional PMF pZ|Y (z|y) =
1 y2
for 1 ≤ z ≤ y 2
Find E [Z | Y = y] for each integer y ≥ 1 and find E [Z]. Exercise 1.17. a) Show that, for uncorrelated rv’s, the expected value of the product is equal to the product £ § of the expected values (by definition, X and Y are uncorrelated if E (X − X)(Y − Y ) = 0).
b) Show that if X and Y are uncorrelated, then the variance of X + Y is equal to the variance of X plus the variance of Y .
54
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
c) Show that if X1 , . . . , Xn are uncorrelated, the the variance of the sum is equal to the sum of the variances. d) Show that independent rv’s are uncorrelated. e) Let X, Y be identically distributed ternary valued random variables with the PMF pX (−1) = pX (1) = 1/4; pX (0) = 1/2. Find a simple joint probability assignment such that X and Y are uncorrelated but dependent. f ) You have seen that the moment generating function of a sum of independent rv’s is equal to the product of the individual moment generating functions. Give an example where this is false if the variables are uncorrelated but dependent. Exercise 1.18. Suppose X has the Poisson PMF, pX (n) = ∏n exp(−∏)/n! for n ≥ 0 and Y has the Poisson PMF, pY (m) = µn exp(−µ)/n! for n ≥ 0. Find the distribution of Z = X + Y and find the conditional distribution of Y conditional on Z = n. Exercise 1.19. a) Suppose X, Y and Z are binary rv’s, each taking on the value 0 with probability 1/2 and the value 1 with probability 1/2. Find a simple example in which X, Y , Z are statistically dependent but are pairwise statistically independent (i.e., X, Y are statistically independent, X, Z are statistically independent, and Y , Z are statistically independent). Give pXY Z (x, y, z) for your example. Hint: In the simplest examle, only 4 of the joint values for x, y, z have positive probabilities. b) Is pairwise statistical independence enough to ensure that hYn i Yn E Xi = E [Xi ] i=1
i=1
for a set of rv’s X1 , . . . , Xn ?
£ § Exercise 1.20. Show that E [X] is the value of z that minimizes E (X − z)2 . Exercise 1.21. A computer system has n users, each with a unique name and password. Due to a software error, the n passwords are randomly permuted internally (i.e. each of the n! possible permutations are equally likely. Only those users lucky enough to have had their passwords unchanged in the permutation are able to continue using the system. a) What is the probability that a particular user, say user 1, is able to continue using the system? b) What is the expected number of users able to continue using the system? Hint: Let Xi be a rv with the value 1 if user i can use the system and 0 otherwise. Exercise 1.22. Suppose the rv X is continuous and has the distribution function FX (x). Consider another rv Y = FX (X). That is, for any sample point α such that X(α) = x, we have Y (α) = FX (x). Show that Y is uniformly distributed in the interval 0 to 1.
1.8. EXERCISES
55
Exercise 1.23. Let Z be an integer valued rv with the PMF pZ (n) = 1/k for 0 ≤ n ≤ k−1. Find the mean, variance, and moment generating function of Z. Hint: The elegant way to do this is to let U be a uniformly distributed continuous rv over (0, 1] that is independent of Z. Then U + Z is uniform over (0, k]. Use the known results about U and U + Z to find the mean, variance, and mgf for Z. Exercise 1.24. Let {Xn ; n ≥ 1} be a sequence of independent but not identically distributed rv’s. We say that the weak law of large numbers holds for this sequence if for all ≤>0 ΩØ æ Ø Ø Sn E [Sn ] Ø lim Pr Ø (a) − Ø ≥ ≤ = 0 where Sn = X1 + X2 + · · · + Xn . n→1 n n a) Show that (a) holds if there is some constant A such that VAR (Xn ) ≤ A for all n.
b) Suppose that VAR (Xn ) ≤ A n1−α for some α < 1 and for all n. Show that (a) holds in this case.
Exercise 1.25. Let {Xi ; i ≥ 1} be IID binary rv’s. Let Pr{Xi = 1} = δ, Pr{Xi = 0} = 1 − δ. Let Sn = X1 + · · · + Xn . Let m be an arbitrary but fixed positive integer. Think! then evaluate the following and explain your answers: P a) limn→1 i:nδ−m≤i≤nδ+m Pr{Sn = i} P b) limn→1 i:0≤i≤nδ+m Pr{Sn = i} P c) limn→1 i:n(δ−1/m)≤i≤n(δ+1/m) Pr{Sn = i} Exercise 1.26. (Details in the proof of Theorem 1.3)
˘ i are also IID. a) Show that if a sequence {Xi ; i ≥ 0} are IID, then the truncated versions X h i ˘ and finite variance σ 2 . Show that the ˘ i has a finite mean E X b) Show that each X ˘ X
variance is upper bounded iby the second moment around the original mean X, i.e., show h 2 ˘ − E [X] |2 . that σX˘ ≤ E | X £ § ˘ i is Xi truncated to X ± b. Show that σ 2 ≤ 2bE X . Hint: The fact that c) Assume that X ˘ X |A − B| ≤ |A| + |B| for any numbers or rv’s might be helpful. ˘1 + · · · + X ˘ n and show that for any ε > 0, d) Let S˘n = X Ø (Ø ) Ø S˘ h iØ ≤ 8bE [|X|] Ø n Ø ˘ Ø≥ . Pr Ø −E X ≤ Øn Ø 2 n≤2 h i ˘ − E [X] |≤ ε/2 and use this to e) Use (1.24) to show that for sufficiently large b, | E X show that Ø (Ø ) Ø S˘ Ø 8bE [|X|] Ø n Ø Pr Ø for large enough b. − E [X]Ø ≥ ≤ ≤ Øn Ø n≤2
56
CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercise £ §1.27. Let {Xi ; i ≥ 1} be IID rv’s with mean 0 and infinite variance. Assume that E |Xi |1+h = β for some given h, 0 < h < 1 and some given β. Let Sn = X1 + · · · + Xn . a) Show that Pr{|Xi | ≥ y} ≤ β y −1−h
b : Xi ≥ b Xi : −b ≤ Xi ≤ b −b : Xi ≤ −b h i £ § R ˘ 2 ≤ 2βb1−h Hint: For a non-negative rv Z, E X 2 = 1 2 z Pr{Z ≥ z} dz Show that E X 1−h 0 (you can establish this, if you wish, by integration by parts). n o ˘1 + . . . + X ˘ n . Show that Pr Sn 6= S˘n ≤ nβb−1−h c) Let S˘n = X i h 1−h ©Ø Ø ™ 2b n d Show that Pr Ø Snn Ø ≥ ≤ ≤ β (1−h)n≤ + . 2 b1+h ˘ i ; i ≥ 1} be truncated variables X ˘i = b) Let {X
e) Optimize your bound with respect to b. How fast does this optimized bound approach 0 with increasing n? Exercise 1.28. (MS convergence → convergence in probability) Assume that {Y £ § n; n ≥ 2 1} is a sequence of rv’s and z is a number with the property that limn→1 E (Yn − z) = 0. a) Let ε > 0 be arbitrary and show that for each n ≥ 0, § £ E (Yn − z)2 Pr{|Yn − z| ≥ ε} ≤ ε2
b)£ For the §ε above, let δ > 0 be arbitrary. Show that there is an integer m such that E (Yn − z)2 ≤ ε2 δ for all n ≥ m. c) Show that this implies convergence in probability.
Exercise 1.29. Let X1 , X2 . . . , be a sequence of IID rv’s each with mean 0 and variance √ √ σ 2 . Let Sn = X1 +· · ·+Xn for all n and consider the random variable Sn /σ n−S2n /σ 2n. Find the limiting distribution function for this rv as n → 1. The point of this exercise is √ to see clearly that the distribution function of Sn /σ n is converging but that the sequence of rv’s is not converging. Exercise 1.30. A town starts a mosquito control program and the rv Zn is the number of mosquitos at the end of the nth year (n = 0, 1, 2, . . . ). Let Xn be the growth rate of mosquitos in year n; i.e., Zn = Xn Zn−1 ; n ≥ 1. Assume that {Xn ; n ≥ 1} is a sequence of IID rv’s with the PMF Pr{X=2} = 1/2; Pr{X=1/2} = 1/4; Pr{X=1/4} = 1/4. Suppose that Z0 , the initial number of mosquitos, is some known constant and assume for simplicity and consistency that Zn can take on non-integer values. a) Find E [Zn ] as a function of n and find limn→1 E [Zn ]. b) Let Wn = log2 Xn . Find E [Wn ] and E [log2 (Zn /Z0 )] as a function of n.
1.8. EXERCISES
57
c) There is a constant α such that limn→1 (1/n)[log2 (Zn /Z0 )] = α with probability 1. Find α and explain how this follows from the strong law of large numbers. d) Using (c), show that limn→1 Zn = β with probability 1 for some β and evaluate β. e) Explain carefully how the result in (a) and the result in (d) are possible. What you should learn from this problem is that the expected value of the log of a product of IID rv’s is more significant that the expected value of the product itself. Exercise 1.31. Use Figure 1.7 to verify (1.43). Hint: Show that yPr{Y ≥y} ≥ R and show that limy→1 z≥y zdFY (z) = 0 if E [Y ] is finite.
R
z≥y
zdFY (z)
Q Exercise 1.32. Show that m≥n (1 − 1/m) = 0. Hint: Show that µ ∂ µ µ ∂∂ µ ∂ 1 1 1 1− = exp ln 1 − ≤ exp − . m m m
arbitrary set o of events. This exercise shows that Exercise nS1.33. Let o A1 , A2 , . . . , denote an nT S limn Pr m≥n Am = 0 if and only if Pr n m≥n Am = 0. nS o nT S o a) Show that limn Pr = 0 implies that Pr = 0. Hint: First m≥n Am n m≥n Am show that for every positive integer n, n\ [ o n[ o Am ≤ Pr Am . Pr n
m≥n
m≥n
nT S nS o o b) Show that Pr A Pr A = 0 implies that lim = 0. Hint: First n n m≥n m m≥n m show that each of the following hold for every positive integer n, n[ o n\ o [ Am = Pr Am . Pr m≥n
lim Pr
n→1
n[
m≥n
k≤n
o n\ Am ≤ Pr
k≤n
m≥k
[
m≥k
o Am .
Exercise Assume that£ X §is a zero-mean rv with finite second and fourth moments, £ 1.34. § i.e., E X 4 = ∞ < 1 and E X 2 = σ 2 < 1. Let X1 , X2 , . . . , be a sequence of IID rv’s, each with the distribution of X. Let Sn = X1 + · · · + Xn . £ § a) Show that E Sn4 = n∞ + 3n(n − 1)σ 4 . 4
b) Show that Pr{|Sn /n| ≥ ≤} ≤ n∞+3n(n−1)σ . ≤4 n4 P c) Show that n Pr{|Sn /n| ≥ ≤} < 1. Note that you have shown that Lemma 1.1 in the proof of the strong law of large numbers holds if X has a fourth moment. d) Show that a finite fourth moment implies a finite second moment. Hint: Let Y = X 2 and Y 2 = X 4 and show that if a nonnegative rv has a second moment, it also has a mean. e) Modify the above parts for the case in which X has a non-zero mean.
Chapter 2
POISSON PROCESSES 2.1
Introduction
A Poisson process is a simple and widely used stochastic process for modeling the times at which arrivals enter a system. It is in many ways the continuous-time version of the Bernoulli process that was described briefly in Subsection 1.3.5. Recall that the Bernoulli process is defined by a sequence of IID binary rv’s Y1 , Y2 . . . , with PMF pY (1) = q specifying the probability of an arrival in each time slot i > 0. There is an associated counting process {N (t); t ≥ 0} giving the number of arrivals up to ° ¢and including time slot t. The PMF for N (t), for integer t > 0, is the binomial pN (t) (n) = nt q n (1 − q)t−n . There is also a sequence S1 , S2 , . . . of integer arrival times (epochs), where the rv Si is the epoch of the ith arrival. Finally there is an associated sequence of interarrival times, X1 , X2 , . . . , which are IID with the geometric PMF, pXi (x) = q(1−q)x−1 for positive integer x. It is intuitively clear that the Bernoulli process is fully specified by specifying that the interarrival intervals are IID with the geometric PMF. For the Poisson process, arrivals may occur at any time, and the probability of an arrival at any particular instant is 0. This means that there is no very clean way of describing a Poisson process in terms of the probability of an arrival at any given instant. It is more convenient to define a Poisson process in terms of the sequence of interarrival times, X1 , X2 , . . . , which are defined to be IID. Before doing this, we describe arrival processes in a little more detail.
2.1.1
Arrival processes
An arrival process is a sequence of increasing rv’s , 0 < S1 < S2 < · · · , where Si < Si+1 means that Si+1 − Si is a positive rv, i.e., a rv X such that Pr{X ≤ 0} = 0. These random variables are called arrival epochs (the word time is somewhat overused in this subject) and represent the times at which some repeating phenomenon occurs. Note that the process starts at time 0 and that multiple arrivals can’t occur simultaneously (the phenomenon of bulk arrivals can be easily handled by the simple extension of associating a positive integer rv to each arrival). We will often specify arrival processes in a way that allows an arrival at 58
2.1. INTRODUCTION
59
time 0 or simultaneous arrivals as events of zero probability, but such zero probability events can usually be ignored. In order to fully specify the process by the sequence S1 , S2 , . . . of rv’s, it is necessary to specify the joint distribution of the subsequences S1 , . . . , Sn for all n > 1. Although we refer to these processes as arrival processes, they could equally well model departures from a system, or any other sequence of incidents. Although it is quite common, especially in the simulation field, to refer to incidents or arrivals as events, we shall avoid that here. The nth arrival epoch Sn is a rv and {Sn ≤ t}, for example, is an event. This would make it confusing to also refer to the nth arrival itself as an event.
✛ X1
0
✛ r ✛ X2 ✲ r ✲
S1
X3
r ✲
✻ N (t)
t
S2
S3
Figure 2.1: An arrival process and its arrival epochs {S1 , S2 , . . . }, its interarrival intervals {X1 , X2 , . . . }, and its counting process {N (t); t ≥ 0}
As illustrated in Figure 2.1, any arrival process can also be specified by two other stochastic processes. The first is the sequence of interarrival times, X1 , X2 , . . . ,. These are positive rv’s defined in terms of the arrival epochs by X1 = S1 and Xi = Si − Si−1 for i > 1. Similarly, given the Xi , the arrival epochs Si are specified as Xn Sn = Xi . (2.1) i=1
Thus the joint distribution of X1 , . . . , Xn for all n > 1 is sufficient (in principle) to specify the arrival process. Since the interarrival times are IID, it is usually much easier to specify the joint distribution of the Xi than of the Si .
The second alternative to specify an arrival process is the counting process N (t), where for each t > 0, the rv N (t) is the number of arrivals up to and including time t. The counting process {N (t); t > 0}, illustrated in Figure 2.1, is an uncountably infinite family of rv’s {N (t); t ≥ 0} where N (t), for each t > 0, is the number of arrivals in the interval (0, t]. Whether the end points are included in these intervals is sometimes important, and we use parentheses to represent intervals without end points and square brackets to represent inclusion of the end point. Thus (a, b) denotes the interval {t : a < t < b}, and (a, b] denotes {t : a < t ≤ b}. The counting rv’s N (t) for each t > 0 are then defined as the number of arrivals in the interval (0, t]. N (0) is defined to be 0 with probability 1, which means, as before, that we are considering only arrivals at strictly positive times. The counting process {N (t), t ≥ 0} for any arrival process has the properties that N (τ ) ≥ N (t) for all τ ≥ t > 0 (i.e., N (τ ) − N (t) is a non-negative random variable).
60
CHAPTER 2. POISSON PROCESSES
For any given integer n ≥ 1 and time t ≥ 0, the nth arrival epoch, Sn , and the counting random variable, N (t), are related by {Sn ≤ t} = {N (t) ≥ n}.
(2.2)
To see this, note that {Sn ≤ t} is the event that the nth arrival occurs by time t. This event implies that N (t), the number of arrivals by time t, must be at least n; i.e., it implies the event {N (t) ≥ n}. Similarly, {N (t) ≥ n} implies {Sn ≤ t}, yielding the equality in (2.2). This equation is essentially obvious from Figure 2.1, but is one of those peculiar obvious things that is often difficult to see. One should be sure to understand it, since it is fundamental in going back and forth between arrival epochs and counting rv’s. In principle, (2.2) specifies the joint distributions of {Si ; i > 0} and {N (t); t > 0} in terms of each other, and we will see many examples of this in what follows. In summary, then, an arrival process can be specified by the joint distributions of the arrival epochs, the interarrival intervals, or the counting rv’s. In principle, specifying any one of these specifies the others also.1
2.2
Definition and properties of the Poisson process
The Poisson process is an example of an arrival process, and the interarrival times provide the most convenient description since the interarrival times are defined to be IID. Processes with IID interarrival times are particularly important and form the topic of Chapter 3. Definition 2.1. A renewal process is an arrival process for which the sequence of interarrival times is a sequence of IID rv’s. Definition 2.2. A Poisson process is a renewal process in which the interarrival intervals have an exponential distribution function; i.e., for some parameter ∏, each Xi has the density fX (x) = ∏ exp(−∏x) for x ≥ 02 . The parameter ∏ is called the rate of the process. We shall see later that for any interval of size t, ∏t is the expected number of arrivals in that interval. Thus ∏ is called the arrival rate of the process.
2.2.1
Memoryless property
What makes the Poisson process unique among renewal processes is the memoryless property of the exponential distribution. 1
By definition, a stochastic process is a collection of rv’s, so one might ask whether an arrival process (as a stochastic process) is ‘really’ the arrival epoch process 0 ≤ S1 ≤ S2 ≤ · · · or the interarrival process X1 , X2 , . . . or the counting process {N (t); t ≥ 0}. The arrival time process comes to grips with the actual arrivals, the interarrival process is often the simplest, and the counting process ‘looks’ most like a stochastic process in time since N (t) is a rv for each t ≥ 0. It seems preferable, since the descriptions are so clearly equivalent, to view arrival processes in terms of whichever description is more convenient at the time. 2 With this density, Pr{Xi =0} = 0, so that we can regard Xi as a positive random variable. Since events of probability zero can be ignored, the density ∏ exp(−∏x) for x ≥ 0 and zero for x < 0 is effectively the same as the density ∏ exp(−∏x) for x > 0 and zero for x ≤ 0.
2.2. DEFINITION AND PROPERTIES OF THE POISSON PROCESS
61
Definition 2.3. Memoryless random variables: A non-negative non-deterministic rv X possesses the memoryless property if, for every x ≥ 0 and t ≥ 0, Pr{X > t + x} = Pr{X > x} Pr{X > t} .
(2.3)
Note that (2.3) is a statement about the complementary distribution function of X. There is no intimation that the event {X > t + x} in the equation has any relation to the events {X > t} or {X > x}. For the case x = t = 0, (2.3) says that Pr{X > 0} = [Pr{X > 0}]2 , so, since X is non-deterministic, Pr{X > 0} = 1. In a similar way, by looking at cases where x = t, it can be seen that Pr{X > t} > 0 for all t ≥ 0. Thus (2.3) can be rewritten as Pr{X > t + x | X > t} = Pr{X > x} .
(2.4)
If X is interpreted as the waiting time until some given arrival, then (2.4) states that, given that the arrival has not occured by time t, the distribution of the remaining waiting time (given by x on the left side of (2.4)) is the same as the original waiting time distribution (given on the right side of (2.4)), i.e., the remaining waiting time has no memory of previous waiting. Example 2.2.1. If X is the waiting time for a bus to arrive, and X is memoryless, then after you wait 15 minutes, you are no better off than you were originally. On the other hand, if the bus is known to arrive regularly every 16 minutes, then you know that it will arrive within a minute, so X is not memoryless. The opposite situation is also possible. If the bus frequently breaks down, then a 15 minute wait can indicate that the remaining wait is probably very long, so again X is not memoryless. We study these non-memoryless situations when we study renewal processes in the next chapter. For an exponential rv X of rate ∏, Pr{X > x} = e−∏x , so (2.3) is satisfied and X is memoryless. Conversely, it turns out that an arbitrary non-negative non-deterministic rv X is memoryless only if it is exponential. To see this, let h(x) = ln[Pr{X > x}] and observe that since Pr{X > x} is nonincreasing in x, h(x) is also. In addition, (2.3) says that h(t + x) = h(x) + h(t) for all x, t ≥ 0. These two statements (see Exercise 2.6) imply that h(x) must be linear in x, and Pr{X > x} must be exponential in x. Although the exponential distribution is the only memoryless distribution, it is interesting to note that if we restrict the definition of memoryless to integer times, then the geometric distribution is memoryless, so the Bernoulli process in this respect seems like a discrete-time version of the Poisson process. We now use the memoryless property of the exponential rv to find the distribution of the first arrival in a Poisson process after some given time t > 0. We not only find this distribution, but also show that this first arrival after t is independent of all arrivals up to and including t. Note that t is an arbitrarily selected constant here; it is not a random variable. Let Z be the duration of the interval from t until the first arrival after t. First we find Pr{Z > z | N (t) = 0} .
62
CHAPTER 2. POISSON PROCESSES
✛ ✛
✛ X2 ✲ ✲ ✲
Z
X1
0
t
S1
S2
Figure 2.2: For some fixed t, consider the event N (t) = 0. Conditional on this event, Z is the interval from t to S1 ; i.e., Z = X1 − t.
As illustrated in Figure 2.2, for {N (t) = 0}, the first arrival after t is the first arrival of the process. Stating this more precisely, the following events are identical:3 {Z > z}
\ \ {N (t) = 0} = {X1 > z + t} {N (t) = 0}.
The conditional probabilities are then
Pr{Z > z | N (t)=0} = Pr{X1 > z + t | N (t)=0} = Pr{X1 > z + t | X1 > t} −∏z
= Pr{X1 > z} = e
.
(2.5) (2.6)
In (2.5), we used the fact that {N (t) = 0} = {X1 > t}, which is clear from Figure 2.1. In (2.6) we used the memoryless condition in (2.4) and the fact that X1 is exponential. Next consider the condition that there are n arrivals in (0, t] and the nth occurs at epoch Sn = τ ≤ t. The argument here is basically the same as that with N (t) = 0, with a few extra details (see Figure 2.3). ✛
X3
✲
N (t)
✛ X2 ✲
✻ ✛
Z
✲
✛ X1 ✲
0
S1
τ S2
t S3
Figure 2.3: Given N (t) = 2, and S2 = τ , X3 is equal to Z + (t − τ ). Also, the event {N (t)=2, S2 =τ } is the same as the event {S2 =τ, X3 >t−τ }. Conditional on N (t) = n and Sn = τ , the first arrival after t is the first arrival after the arrival at Sn , i.e., Z = z corresponds to Xn+1 = z +t−τ . Stating this precisly, the following events are identical: \ \ \ \ {Z > z} {N (t) = n} {Sn = τ } = {Xn+1 > z+t−τ } {N (t) = n} {Sn = τ }. 3
It helps intuition to sometimes think of one event A as conditional on another T event B. More precisely, A given B is the set of sample points in B that are also in A, which is simply A B.
2.2. DEFINITION AND PROPERTIES OF THE POISSON PROCESS
63
Note that Sn = τ is an event of zero probability, but Sn is a sum of n IID random variables with densities, and thus has a density itself, so that other events can be conditioned on it. Pr{Z > z | N (t)=n, Sn =τ } = Pr{Xn+1 > z+t−τ | N (t)=n, Sn =τ }
= Pr{Xn+1 > z+t−τ | Xn+1 >t−τ, Sn =τ }
= Pr{Xn+1 > z+t−τ | Xn+1 >t−τ } −∏z
= Pr{Xn+1 > z} = e
.
(2.7) (2.8) (2.9) (2.10)
In (2.8), we have used the fact that, given Sn = τ , the event N (t) = n is the same as Xn+1 > t − τ (see Figure 2.3). In (2.9) we used the fact that Xn+1 is independent of Sn . In (2.10) we used the memoryless condition in (2.4) and the fact that Xn+1 is exponential. The same argument applies if, in (2.7), we condition not only on Sn but also on S1 , . . . , Sn−1 . Since this is equivalent to conditioning on N (τ ) for all τ in (0, t], we have Pr{Z > z | {N (τ ), 0 < τ ≤ t}} = exp(−∏z).
(2.11)
The following theorem states this in words. Theorem 2.1. For a Poisson process of rate ∏, and any given time t > 0, the interval from t until the first arrival after t is a nonnegative rv Z with the distribution function 1 − exp[−∏z] for z ≥ 0. This rv is independent of all arrival epochs before time t and independent of N (τ ) for all τ ≤ t. The length of our derivation of (2.11) somewhat hides its conceptual simplicity. Z, conditional on the time τ of the last arrival before t, is simply the remaining time until the next arrival, which, by the memoryless property, is independent of τ ≤ t, and hence also independent of everything before t. Next consider subsequent interarrival intervals after a given time t. For m ≥ 2, let Zm be the interarrival interval from the m − 1st arrival epoch after t to the mth arrival epoch after t. Given N (t) = n, we see that Zm = Xm+n , and therefore Z2 , Z3 , . . . , are IID exponentially distributed random variables, conditional on N (t) = n (see Exercise 2.8). Let Z in (2.11) become Z1 here. Since Z1 is independent of Z2 , Z3 , . . . and independent of N (t), we see that Z1 , Z2 , . . . are unconditionally IID and also independent of N (t). It should also be clear that Z1 , Z2 , . . . are independent of {N (τ ); 0 < τ ≤ t}. The above argument shows that the portion of a Poisson process starting at some time t > 0 is a probabilistic replica of the process starting at 0; that is, the time until the first arrival after t is an exponentially distributed rv with parameter ∏, and all subsequent arrivals are independent of this first arrival and of each other and all have the same exponential distribution. Definition 2.4. A counting process {N (t); t ≥ 0} has the stationary increment property if for every t0 > t > 0, N (t0 ) − N (t) has the same distribution function as N (t0 − t). e (t, t0 ) = N (t0 ) − N (t) as the number of arrivals in the interval (t, t0 ] for any Let us define N 0 e (t, t0 ) has the same given t ≥ t. We have just shown that for a Poisson process, the rv N
64
CHAPTER 2. POISSON PROCESSES
distribution as N (t0 − t), which means that a Poisson process has the stationary increment property. Thus, the distribution of the number of arrivals in an interval depends on the size of the interval but not on its starting point. Definition 2.5. A counting process {N (t); t ≥ 0} has the independent increment property if, for every integer k > 0, and every k-tuple of times 0 < t1 < t2 < · · · < tk , the k-tuple of e (t1 , t2 ), . . . , N e (tk−1 , tk ) of rv’s are statistically independent. rv’s N (t1 ), N
For the Poisson process, Theorem 2.1 says that for any t, the time Z1 until the next arrival after t is independent of N (τ ) for all τ ≤ t. Letting t1 < t2 < · · · tk−1 < e (t1 , t2 ), . . . , N e (tk−1 , t). We have also t, this means that Z1 is independent of N (t1 ), N e (t, t0 ) are independent seen that the subsequent interarrival times after Z1 , and thus N 0 e e e (tk , tk+1 ) of N (t1 ), N (t1 , t2 ), . . . , N (tk−1 , t). Renaming t as tk and t as tk+1 , we see that N e (t1 , t2 ), . . . , N e (tk−1 , tk ). Since this is true for all k, the Poisson is independent of N (t1 ), N process has the independent increment property. In summary, we have proved the following:
Theorem 2.2. Poisson processes have both the stationary increment and independent increment properties.
Note that if we look only at integer times, then the Bernoulli process also has the stationary and independent increment properties.
2.2.2
Probability density of Sn and S1 , . . . Sn
Recall from (2.1) that, for a Poisson process, Sn is the sum of n IID rv’s, each with the density function f (x) = ∏ exp(−∏x), x ≥ 0. Also recall that the density of the sum of two independent rv’s can be found by convolving their densities, and thus the density of S2 can be found by convolving f (x) with itself, S3 by convolving the density of S2 with f (x), and so forth. The result, for t ≥ 0, is called the Erlang density,4 fSn (t) =
∏n tn−1 exp(−∏t) . (n − 1)!
(2.12)
We can understand this density (and other related matters) much better by reviewing the above mechanical derivation more carefully. The joint density for two continuous independent rv’s X1 and X2 is given by fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ). Letting S2 = X1 + X2 and substituting S2 − X1 for X2 , we get the following joint density for X1 and the sum S2 , fX1 S2 (x1 s2 ) = fX1 (x1 )fX2 (s2 − x1 ). The marginal density for S2 then results from integrating x1 out from the joint density, and this, of course, is the familiar convolution integration. For IID exponential rv’s X1 , X2 , the joint density of X1 , S2 takes the following interesting form: fX1 S2 (x1 s2 ) = ∏2 exp(−∏x1 ) exp(−∏(s2 −x1 )) = ∏2 exp(−∏s2 ) 4
for 0 ≤ x1 ≤ s2 . (2.13)
Another (somewhat poorly chosen and rarely used) name for the Erlang density is the gamma density.
2.2. DEFINITION AND PROPERTIES OF THE POISSON PROCESS
65
This says that the joint density does not contain x1 , except for the constraint 0 ≤ x1 ≤ s2 . Thus, for fixed s2 , the joint density, and thus the conditional density of X1 given S2 = s2 is uniform over 0 ≤ x1 ≤ s2 . The integration over x1 in the convolution equation is then simply multiplication by the interval size s2 , yielding the marginal distribution fS2 (s2 ) = ∏2 s2 exp(−∏s2 ), in agreement with (2.12) for n = 2. This same curious behavior exhibits itself for the sum of an arbitrary number n of IID exponential rv’s. That is, fX1 ,... ,Xn (x1 , . . . , xn ) = ∏n exp(−∏x1 − ∏x2 − · · · − ∏xn ). Letting Sn = X1 + · · · + Xn and substituting Sn − X1 − · · · − Xn−1 for Xn , this becomes fX1 ···Xn−1 Sn (x1 , . . . , xn−1 , sn ) = ∏n exp(−∏sn ). since each xi cancels out above. This equation is valid over the region where each xi ≥ 0 and sn − x1 − · · · − xn−1 ≥ 0. The density is 0 elsewhere. The constraint region becomes more clear here if we replace the interarrival intervals X1 , . . . , Xn−1 with the arrival epochs S1 , . . . , Sn−1 where S1 = X1 and Si = Xi + Si−1 for 2 ≤ i ≤ n − 1. The joint density then becomes5 fS1 ···Sn (s1 , . . . , sn ) = ∏n exp(−∏sn )
for 0 ≤ s1 ≤ s2 · · · ≤ sn .
(2.14)
The interpretation here is the same as with S2 . The joint density does not contain any arrival time other than sn , except for the ordering constraint 0 ≤ s1 ≤ s2 ≤ · · · ≤ sn , and thus this joint density is the same for all choices of arrival times satisfying the ordering constraint. Mechanically integrating this over s1 , then s2 , etc. we get the Erlang formula (2.12). The Erlang density then is the joint density in (2.14) times the volume snn−1 /(n−1)! of the region of s1 , . . . , sn−1 satisfing 0 < s1 < · · · < sn . This will be discussed further later.
2.2.3
The PMF for N (t)
The Poisson counting process, {N (t); t > 0} consists of a discrete rv N (t) for each t > 0. In this section, we show that the PMF for this rv is the well-known Poisson PMF, as stated in the following theorem. We give two proofs for the theorem, each providing its own type of understanding and each showing the close relationship between {N (t) = n} and {Sn = t}. Theorem 2.3. For a Poisson process of rate ∏, and for any t > 0, the PMF for N (t) (i.e., the number of arrivals in (0, t]) is given by the Poisson PMF, pN (t) (n) =
(∏t)n exp(−∏t) . n!
(2.15)
Proof 1: This proof, for given n and t, is based on two ways of calculating the probability Pr{t < Sn+1 ≤ t + δ} for some vanishingly small δ. The first way is based on the already 5 The random vector S = (S1 , . . . , Sn ) is then related to the interarrival intervals X = (X1 , . . . , Xn ) by a linear transformation, say S = AX . In general, the joint density of S at s = Ax is fS (s) = fX (x )/| det A|. This is because the transformation A carries a cube δ on a side into a parallelepiped of volume δ n | det A|. In the case here, A is upper triangular with 1’s on the diagonal, so det A = 1.
66
CHAPTER 2. POISSON PROCESSES
known density of Sn+1 and gives Pr{t < Sn+1 ≤ t + δ} =
Z
t+δ
t
fSn (τ ) dτ = fSn (t) (δ + o(δ)).
The term o(δ) is used to describe a function of δ that goes to 0 faster than δ as δ → = 0. Thus 0. More precisely, a function g(δ) is said to be of order o(δ) if limδ→0 g(δ) δ Pr{t < Sn ≤ t + δ} = fSn (t)(δ + o(δ)) is simply a consequence of the fact that Sn has a continuous probability density in the interval [t, t + δ]. The second way is that {t < Sn+1 ≤ t + δ} occurs if exactly n arrivals arrive in the interval (0, t] and one arrival occurs in (t, t + δ]. Because of the independent increment property, this is an event of probability pN (t) (n)(∏δ + o(δ)). It is also possible to have fewer than n arrivals in (0, t] and more than one in (t, t + δ], but this has probability o(δ). Thus pN (t) (n)(∏δ + o(δ)) + o(δ) = fSn+1 (t)(δ + o(δ)). Dividing by ∏ and taking the limit δ → 0, we get ∏pN (t) (n) = fSn+1 (t). Using the density for fSn given in (2.12), we get (2.15). Proof 2: The approach here is to use the fundamental relation that {N (t ≥ n} = {Sn ≤ t}. Taking the probabilities of these events, 1 X i=n
pN (t) (i) =
Z
0
t
fSn (τ ) dτ
for all n ≥ 1 and t > 0.
The term on the right above is the distribution function of Sn for each n ≥ 1 and the term on the left is the complementary distribution function of N (t) for each t > 0. Thus this equation (for all n ≥ 1, t > 0), uniquely specifies the PMF of N (t) for each t > 0. The theorem will then be proven by showing that X1 (∏t)i exp(−∏t) Z t fSn (τ ) dτ. = i=n i! 0
(2.16)
If we take the derivative with respect to t of each side of (2.16), we find that almost magically each term except the first on the left cancels out, leaving us with ∏n tn−1 exp(−∏t) = fSn (t). (n − 1)! Thus the derivative with respect to t of each side of (2.16) is equal to the derivative of the other for all n ≥ 1 and t > 0. The two sides of (2.16) are also equal in the limit t → 0, so it follows that (2.16) is satisfied everywhere, completing the proof.
2.2. DEFINITION AND PROPERTIES OF THE POISSON PROCESS
2.2.4
67
Alternate definitions of Poisson processes
Definition 2 of a Poisson process: A Poisson counting process {N (t); t ≥ 0} is a counting process that satisfies (2.15) (i.e., has the Poisson PMF) and has the independent and stationary increment properties. We have seen that the properties in Definition 2 are satisfied starting with Definition 1 (using IID exponential interarrival times), so Definition 1 implies Definition 2. Exercise 2.4 shows that IID exponential interarrival times are implied by Definition 2, so the two definitions are equivalent. It may be somewhat surprising at first to realize that a counting process that has the Poisson PMF at each t is not necessarily a Poisson process, and that the independent and stationary increment properties are also necessary. One way to see this is to recall that the Poisson PMF for all t in a counting process is equivalent to the Erlang density for the successive arrival epochs. Specifying the probability density for S1 , S2 , . . . , as Erlang specifies the marginal densities of S1 , S2 , . . . ,, but need not specify the joint densities of these rv’s. Figure 2.4 illustrates this in terms of the joint density of S1 , S2 , given as fS1 S2 (s1 s2 ) = ∏2 exp(−∏s2 )
for 0 ≤ s1 ≤ s2
and 0 elsewhere. The figure illustrates how the joint density can be changed without changing the marginals.
✑ ✑ ✑ ✑ s1 ✑ ✑ ✑ ✑ ✑ ✑
0
fS1 S2 (s1 s2 ) > 0
s2
Figure 2.4: The joint density of S1 , S2 is nonzero in the region shown. It can be changed, while holding the marginals constant, by reducing the joint density by ε in the upper left and lower right squares above and increasing it by ε in the upper right and lower left squares.
There is a similar effect with the Bernoulli process in that a discrete counting process for which the number of arrivals from 0 to t, for each integer t is a binomial rv, but the process is not Bernoulli. This is explored in Exercise 2.5. The next definition of a Poisson process is based on its incremental properties. Consider e (t, t + δ) has the same the number of arrivals in some very small interval (t, t + δ]. Since N
68
CHAPTER 2. POISSON PROCESSES
distribution as N (δ), we can use (2.15) to get n o e (t, t + δ) = 0 Pr N = e−∏δ ≈ 1 − ∏δ + o(δ) n o e (t, t + δ) = 1 Pr N = ∏e−∏δ ≈ ∏δ + o(δ) n o e (t, t + δ) ≥ 2 Pr N ≈ o(δ).
(2.17)
Definition 3 of a Poisson process: A Poisson counting process is a counting process that satisfies (2.17) and has the stationary and independent increment properties. We have seen that Definition 1 implies Definition 3. The essence of the argument the other way is that for any interarrival interval X, FX (x + δ) − FX (x) is the probability of an arrival in an appropriate infinitesimal interval of width δ, which by (2.17) is ∏δ +o(δ). Turning this into a differential equation (see Exercise 2.7), we get the desired exponential interarrival intervals. Definition 3 has an intuitive appeal, since it is based on the idea of independent arrivals during arbitrary disjoint intervals. It has the disadvantage that one must do a considerable amount of work to be sure that these conditions are mutually consistent, and probably the easiest way is to start with Definition 1 and derive these properties. Showing that there is a unique process that satisfies the conditions of Definition 3 is even harder, but is not necessary at this point, since all we need is the use of these properties. Section 2.2.5 will illustrate better how to use this definition (or more precisely, how to use (2.17)). What (2.17) accomplishes, beyond the assumption of independent and stationary increments, in Definition 3 is the prevention of bulk arrivals. For example, consider a counting process in which arrivals always occur in pairs, and the intervals between successive pairs arenIID and exponentially distributed with parameter ∏ (see Figure 2.5). For this process, o o n e (t, t+δ)=2 = ∏δ + o(δ), thus violating (2.17). This e (t, t + δ)=1 = 0, and Pr N Pr N process has stationary and independent increments, however, since the process formed by viewing a pair of arrivals as a single incident is a Poisson process.
N (t) 4 ✛
2 ✛
0
X1
X2
✲ 3
✲ 1
S1
S2
Figure 2.5: A counting process modeling bulk arrivals. X1 is the time until the first pair of arrivals and X2 is the interval between the first and second pair of arrivals.
2.2. DEFINITION AND PROPERTIES OF THE POISSON PROCESS
2.2.5
69
The Poisson process as a limit of shrinking Bernoulli processes
The intuition of Definition 3 can be achieved in a much less abstract way by starting with the Bernoulli process, which has the properties of Definition 3 in a discrete-time sense. We then go to an appropriate limit of a sequence of these processes, and find that this sequence of Bernoulli processes converges in various ways to the Poisson process. Recall that a Bernoulli process is an IID sequence, Y1 , Y2 , . . . , of binary random variables for which pY (1) = q and pY (0) = 1 − q. We can visualize Yi = 1 as an arrival at time i and Yi = 0 as no arrival, but we can also ‘shrink’ the time scale of the process so that for some integer j > 0, Yi is an arrival or no arrival at time i2−j . We consider a sequence indexed by j of such shrinking Bernoulli processes, and in order to keep the arrival rate constant, we let q = ∏2−j for the jth process. Thus for each unit increase in j, the Bernoulli process shrinks by replacing each slot with two slots, each with half the previous arrival probability. The expected number of arrivals per time unit is then ∏, matching the Poisson process that we are approximating. If we look at this jth process relative to Definition 3 of a Poisson process, we see that for these regularly spaced increments of size δ = 2−j , the probability of one arrival in an increment is ∏δ and that of no arrival is 1 − ∏δ, and thus (2.17) is satisfied, and in fact the o(δ) terms are exactly zero. For arbitrary sized increments, it is clear that disjoint increments have independent arrivals. The increments are not quite stationary, since, for example, an increment of size 2−j−1 might contain a time that is a multiple of 2−j or might not, depending on its placement. However, for any fixed increment of size δ, the number of multiples of 2−j (i.e., the number of possible arrival points) is either bδ2j c or 1 + bδ2j c. Thus in the limit j → 1, the increments are both stationary and independent. For each j, the jth Bernoulli process has an associated Bernoulli counting process Nj (t) = Pbt2j c of arrivals up to time t and is a discrete rv with the binomial i=1 Yi . This is the number °bt2j c¢ n j PMF. That is, pNj (t) (n) = n q (1 − q)bt2 c−n where q = ∏2−j . We now show that this PMF approaches the Poisson PMF as j increases
Theorem 2.4. Consider the sequence of shrinking Bernoulli processes with arrival probability ∏2−j and time-slot size 2−j . Then for every fixed time t > 0 and fixed number of arrivals n, the counting PMF pNj (t) (n) approaches the Poisson PMF (of the same ∏) with increasing j, i.e.,
lim pNj (t) (n) = pN (t) (n).
j→1
(2.18)
70
CHAPTER 2. POISSON PROCESSES
Proof: We first rewrite the binomial PMF, for bt2j c variables with q = ∏2−j as µ j ∂µ ∂n ∏2−j bt2 c lim pNj (t) (n) = lim exp[bt2j c(ln(1 − ∏2−j )] j→1 j→1 n 1 − ∏2−j µ j ∂µ ∂n ∏2−j bt2 c = lim exp(−∏t) (2.19) j→1 n 1 − ∏2−j ∂n µ bt2j c · bt2j −1c · · · bt2j −n+1c ∏2−j = lim exp(−∏t) (2.20) j→1 n! 1 − ∏2−j (∏t)n exp(−∏t) . (2.21) = n! We used ln(1 − ∏2−j ) = −∏2−j + o(2−j ) in (2.19) and expanded ≥ −j ¥ the combinatorial term ∏2 j = ∏t for 0 ≤ i ≤ n − 1. in (2.20). In (2.21), we recognized that limj→1 bt2 − ic 1−∏2 −j Since the binomial PMF (scaled as above) has the Poisson PMF as a limit for each n, the distribution function of Nj (t) also converges to the Poisson distribution function for each t. In other words, for each t > 0, the counting random variables Nj (t) of the Bernoulli processes converge in distribution to N (t) of the Poisson process. By Definition 2, then,6 this limiting process is a Poisson process. With the same scaling, the distribution function of the geometric distribution converges to the exponential distribution function, j −1c
lim (1 − ∏2−j )bt2
j→1
= exp(−∏t).
Note that no matter how large j is, the corresponding shrunken Bernoulli process can have arrivals only at the discrete times that are multiples of 2−j . Thus the interarrival times and the arrival epochs are discrete rv’s and in no way approach the densities of the Poisson process. The distribution functions of these rv’s, however quickly approach the distribution functions of the corresponding Poisson rv’s. This is a good illustration of why it is sensible to focus on distribution functions rather than PDF’s or PMF’s.
2.3
Combining and splitting Poisson processes
Suppose that {N1 (t), t ≥ 0} and {N2 (t), t ≥ 0} are independent Poisson counting processes7 of rates ∏1 and ∏2 respectively. We want to look at the sum process where N (t) = N1 (t) + 6
Note that the counting process is one way of defining the Poisson process, but this requires the joint distributions of the counting variables, not just the marginal distributions. This is why Definition 2 requires not only the Poisson distribution for each N (t) (i.e., each marginal distribution), but also the stationary and independent increment properties. Exercise 2.5 gives an example for how the binomial distribution can be satisfied without satisfying the discrete version of the independent increment property. 7 Two processes {N1 (t); t ≥ 0} and {N2 (t); t ≥ 0} are said to be independent if for all positive integers k and all sets of times t1 , . . . , tk , the random variables N1 (t1 ), . . . , N1 (tk ) are independent of N2 (t1 ), . . . , N2 (tk ). Here it is enough to extend the independent increment property to independence between increments over the two processes; equivalently, one can require the interarrival intervals for one process to be independent of the interarrivals for the other process.
2.3. COMBINING AND SPLITTING POISSON PROCESSES
71
N2 (t) for all t ≥ 0. In other words, {N (t), t ≥ 0} is the process consisting of all arrivals to both process 1 and process 2. We shall show that {N (t), t ≥ 0} is a Poisson counting process of rate ∏ = ∏1 + ∏2 . We show this in three different ways, first using Definition 3 of a Poisson process (since that is most natural for this problem), then using Definition 2, and finally Definition 1. We then draw some conclusions about the way in which each approach is helpful. Since {N1 (t); t ≥ 0} and {N2 (t); t ≥ 0} are independent and both possess the stationary and independent increment properties, it follows from the definitions that {N (t); t ≥ 0} also possesses the stationary and independent increment properties. Using the approximations in (2.17) for the individual processes, we see that n o n o n o e (t, t + δ) = 0 e1 (t, t + δ) = 0 Pr N e2 (t, t + δ) = 0 Pr N = Pr N = (1 − ∏1 δ)(1 − ∏2 δ) ≈ 1 − ∏δ. n o e (t, t+δ) = 1 is approximated by where ∏1 ∏2 δ 2 has been dropped. In the same way, Pr N n o e (t, t + δ) ≥ 2 is approximated by 0, both with errors proportional to δ 2 . It ∏δ and Pr N follows that {N (t), t ≥ 0} is a Poisson process.
In the second approach, we have N (t) = N1 (t) + N2 (t). Since N (t), for any given t, is the sum of two independent Poisson rv’s , it is also a Poisson rv with mean ∏t = ∏1 t + ∏2 t. If the reader is not aware that the sum of two independent Poisson rv’s is Poisson, it can be derived by discrete convolution of the two PMF’s (see Exercise 1.18). More elegantly, one can observe that we have already implicitly shown this fact. That is, if we break an interval I into disjoint subintervals, I1 and I2 , the number of arrivals in I (which is Poisson) is the sum of the number of arrivals in I1 and in I2 (which are independent Poisson). Finally, since N (t) is Poisson for each t, and since the stationary and independent increment properties are satisfied, {N (t); t ≥ 0} is a Poisson process.
In the third approach, X1 , the first interarrival interval for the sum process, is the minimum of X11 , the first interarrival interval for the first process, and X21 , the first interarrival interval for the second process. Thus X1 > t if and only if both X11 and X21 exceed t, so Pr{X1 > t} = Pr{X11 > t} Pr{X21 > t} = exp(−∏1 t − ∏2 t) = exp(−∏t). Using the memoryless property, each subsequent interarrival interval can be analyzed in the same way. The first approach above was the most intuitive for this problem, but it required constant care about the order of magnitude of the terms being neglected. The second approach was the simplest analytically (after recognizing that sums of independent Poisson rv’s are Poisson), and required no approximations. The third approach was very simple in retrospect, but not very natural for this problem. If we add many independent Poisson processes together, it is clear, by adding them one at a time, that the sum process is again Poisson. What is more interesting is that when many independent counting processes (not necessarily Poisson) are added together, the sum process often tends to be approximately Poisson if the individual processes have small rates compared to the sum. To obtain some crude intuition about why this might be expected, note that the interarrival intervals for each process (assuming no bulk arrivals) will tend to be large relative to the mean interarrival interval
72
CHAPTER 2. POISSON PROCESSES
for the sum process. Thus arrivals that are close together in time will typically come from different processes. The number of arrivals in an interval large relative to the combined mean interarrival interval, but small relative to the individual interarrival intervals, will be the sum of the number of arrivals from the different processes; each of these is 0 with large probability and 1 with small probability, so the sum will be approximately Poisson.
2.3.1
Subdividing a Poisson process
Next we look at how to break {N (t), t ≥ 0}, a Poisson counting process of rate ∏, into two processes, {N1 (t), t ≥ 0} and {N2 (t), t ≥ 0}. Suppose that each arrival in {N (t), t ≥ 0} is sent to the first process with probability p and to the second process with probability 1 − p (see Figure 2.6). Each arrival is switched independently of each other arrival and independently of the arrival epochs. We shall show that the resulting processes are each Poisson, with rates ∏1 = ∏p and ∏2 = ∏(1 − p) respectively, and that furthermore the two processes are independent. Note that, conditional on the original process, the two new processes are not independent; in fact one completely determines the other. Thus this independence might be a little surprising.
N (t) rate ∏
✟ ✟ ✟ p ✟ ✲✟ ❍❍ ❍❍ 1−p ❍
N1 (t) rate ∏1 = p∏
✲
N2 (t) ✲ rate ∏2 = (1−p)∏
Figure 2.6: Each arrival is independently sent to process 1 with probability p and to process 2 otherwise. First consider a small increment (t, t + δ]. The original process has an arrival in this incremental interval with probability ∏δ (ignoring δ 2 terms as usual), and thus process 1 has an arrival with probability ∏δp and process 2 with probability ∏δ(1 − p). Because of the independent increment property of the original process and the independence of the division of each arrival between the two processes, the new processes each have the independent increment property, and from above have the stationary increment property. Thus each process is Poisson. Note now that we cannot verify that the two processes are independent from this small increment model. We would have to show that the number of arrivals for process 1 and 2 are independent over (t, t + δ]. Unfortunately, leaving out the terms of order δ 2 , there is at most one arrival to the original process and no possibility of an arrival to each new process in (t, t + δ]. If it is impossible for both processes to have an arrival in the same interval, they cannot be independent. It is possible, of course, for each process to have an arrival in the same interval, but this is a term of order δ 2 . Thus, without paying attention to the terms of order δ 2 , it is impossible to demonstrate that the processes are independent. To demonstrate that process 1 and 2 are independent, we first calculate the joint PMF for N1 (t), N2 (t) for arbitrary t. Conditioning on a given number of arrivals N (t) for the
2.3. COMBINING AND SPLITTING POISSON PROCESSES
73
original process, we have Pr{N1 (t)=m, N2 (t)=k | N (t)=m+k} =
(m + k)! m p (1 − p)k . m!k!
(2.22)
Equation (2.22) is simply the binomial distribution, since, given m + k arrivals to the original process, each independently goes to process 1 with probability p. Since the event {N1 (t) = m, N2 (t) = k} is a subset of the conditioning event above, Pr{N1 (t)=m, N2 (t)=k | N (t)=m+k} =
Pr{N1 (t)=m, N2 (t)=k} . Pr{N (t)=m+k}
Combining this with (2.22), we have Pr{N1 (t)=m, N2 (t)=k} =
(m + k!) m (∏t)m+k e−∏t p (1 − p)k . m!k! (m + k)!
(2.23)
(p∏t)m e−∏pt [(1 − p)∏t]k e−∏(1−p)t . m! k!
(2.24)
Rearranging terms, we get Pr{N1 (t)=m, N2 (t)=k} =
This shows that N1 (t) and N2 (t) are independent. To show that the processes are independent, we must show that for any k > 1 and any set of times 0 ≤ t1 ≤ t2 ≤ · · · ≤ tk , the sets {N1 (ti )1 ≤ i ≤ k} and {N2 (tj ); 1 ≤ j ≤ k} are independent of each other. It is equivalent e1 (ti−1 , ti ); 1 ≤ i ≤ k} and {N e2 (tj−1 , tj ); 1 ≤ j ≤ k} (where t0 is 0) to show that the sets {N are independent. The argument above shows this independence for i = j, and for i 6= j, the independence follows from the independent increment property of {N (t); t ≥ 0}.
2.3.2
Examples using independent Poisson processes
We have observed that if the arrivals of a Poisson process are split into two new arrival processes, each arrival of the original process independently going into the first of the new processes with some fixed probability p, then the new processes are Poisson processes and are independent. The most useful consequence of this is that any two independent Poisson processes can be viewed as being generated from a single process in this way. Thus, if one process has rate ∏1 and the other has rate ∏2 , they can be viewed as coming from a process of rate ∏1 + ∏2 . Each arrival to the combined process then goes to the first process with probability p = ∏1 /(∏1 + ∏2 ) and to the second process with probability 1 − p. The above point of view is very useful for finding probabilities such as Pr{S1k < S2j } where S1k is the epoch of the kth arrival to the first process and S2j is the epoch of the jth arrival to the second process. The problem can be rephrased in terms of a combined process to ask: out of the first k + j − 1 arrivals to the combined process, what is the probability that k or more of them are switched to the first process? (Note that if k or more of the first k + j − 1 go to the first process, at most j − 1 go to the second, so the kth arrival to the first precedes the jth arrival to the second; similarly if fewer than k of the first k + j − 1 go to the first process, then the jth arrival to the second process precedes the kth arrival
74
CHAPTER 2. POISSON PROCESSES
to the first). Since each of these first k + j − 1 arrivals are switched independently with the same probability p, the answer is Pr{S1k < S2j } =
Xk+j−1 i=k
(k + j − 1)! pi (1 − p)k+j−1−i . i!(k + j − 1 − i)!
(2.25)
As an example of this, suppose a queueing system has arrivals according to a Poisson process (process 1) of rate ∏. There is a single server who serves arriving customers in order with a service time distribution F (y) = 1−exp[−µy]. Thus during periods when the server is busy, customers leave the system according to a Poisson process (process 2) of rate µ. Thus, if j or more customers are waiting at a given time, then (2.25) gives the probability that the kth subsequent arrival comes before the jth departure.
2.4
Non-homogeneous Poisson processes
The Poisson process, as we defined it, is characterized by a constant arrival rate ∏. It is often useful to consider a more general type of process in which the arrival rate varies as a function of time. A non-homogeneous Poisson process with time varying arrival rate ∏(t) is defined8 as a counting process {N (t); t ≥ 0} which has the independent increment property and, for all t ≥ 0, δ > 0, also satisfies: n o e (t, t + δ) = 0 Pr N = 1 − δ∏(t) + o(δ) n o e (t, t + δ) = 1 Pr N = δ∏(t) + o(δ) n o e (t, t + δ) ≥ 2 Pr N = o(δ).
(2.26)
e (t, t + δ) = N (t + δ) − N (t). The non-homogeneous Poisson process does not have where N the stationary increment property.
One common application occurs in optical communication where a non-homogeneous Poisson process is often used to model the stream of photons from an optical modulator; the modulation is accomplished by varying the photon intensity ∏(t). We shall see another application shortly in the next example. Sometimes a Poisson process, as we defined it earlier, is called a homogeneous Poisson process. We can use a “shrinking Bernoulli process” again to approximate a non-homogeneous Poisson process. To see how to do this, assume that ∏(t) is bounded away from zero. We partition the time axis into increments whose lengths δ vary inversely with ∏(t), thus holding the probability of an arrival in an increment at some fixed value q = δ∏(t). Thus, 8 We assume that ∏(t) is right continuous, i.e., that for each t, ∏(t) is the limit of ∏(t+ε) as ε approaches 0 from above. This allows ∏(t) to contain discontinuities, as illustrated in Figure 2.7, but follows the convention that the value of the function at the discontinuity is the limiting value from the right. This convention is required in (2.26) to talk about the distribution of arrivals just to the right of time t.
2.4. NON-HOMOGENEOUS POISSON PROCESSES
temporarily ignoring the variation of ∏(t) within an increment, Ω µ ∂ æ q e Pr N t, t + =0 = 1 − q + o(q) ∏(t) ∂ æ Ω µ q e =1 = q + o(q) Pr N t, t + ∏(t) ∂ æ Ω µ e t, t + q ≥2 = o(ε). Pr N ∏(t) Ths partition is defined more precisely by defining m(t) as Z t ∏(τ )dτ. m(t) =
75
(2.27)
(2.28)
0
Then the ith increment ends at that t for which m(t) = qi. r
t
❅ ❅ ∏(t) ❅ ❅
Figure 2.7: Partitioning the time axis into increments each with an expected number of arrivals equal to q. Each rectangle above has the same area, which ensures that the ith partition ends where m(t) = qi.
As before, let {Yi ; i ≥ 1} be a sequence of IID binary rv’s with Pr{Yi = 1} = q and Pr{Yi = 0} = 1 − q. Consider the counting process {N (t); t ≥ 0} in which Yi , for each i ≥ 1, denotes the number of arrivals in the interval (ti−1 , ti ], where ti satisfies m(ti ) = iq. Thus, N (ti ) = Y1 + Y2 + · · · + Yi . If q is decreased as 2−j each increment is successively split into a pair of increments. Thus by the same argument as in (2.21), [1 + o(q)][m(t)]n exp[−m(t)] . (2.29) n! Rτ Similarly, for any interval (t, τ ], taking m(t, e τ ) = t ∏(u)du, and taking t = tk , τ = ti for some k, i, we get n o e τ )]n exp[−m(t, e τ )] e (t, τ ) = n = [1 + o(q)][m(t, Pr N . (2.30) n! Pr{N (t) = n} =
Going to the limit q → 0, the counting process {N (t); t ≥ 0} above approaches the nonhomogeneous Poisson process under consideration, and we have the following theorem:
Theorem 2.5. For a non-homogeneous Poisson process with right-continuous arrival rate e (t, τ ), the number of arrivals in (t, τ ], ∏(t) bounded away from zero, the distribution of N satisfies Z τ n o e τ )] e τ )]n exp[−m(t, e (t, τ ) = n = [m(t, ∏(u) du. (2.31) Pr N where m(t, e τ) = n! t
76
CHAPTER 2. POISSON PROCESSES
Hence, one can view a non-homogeneous Poisson process as a (homogeneous) Poisson process over a non-linear time scale. That is, let {N ∗ (s); s ≥ 0} be a (homogeneous) Poisson process with rate 1. The non-homogeneous Poisson process is then given by N (t) = N ∗ (m(t)) for each t. Example 2.4.1 (THE M/G/1 Queue). Queueing theorists use a standard notation of characters separated by slashes to describe common types of queueing systems. The first character describes the arrival process to the queue. M stands for memoryless and means a Poisson arrival process; D stands for deterministic and means that the interarrival interval is fixed and non-random; G stands for general interarrival distribution. We assume that the interarrival intervals are IID (thus making the arrival process a renewal process), but many authors use GI to explicitly indicate IID interarrivals. The second character describes the service process. The same letters are used, with M indicating the exponential service time distribution. The third character gives the number of servers. It is assumed, when this notation is used, that the service times are IID, independent of the arrival times, and independent of the which server is used. With this notation, M/G/1 indicates a queue with Poisson arrivals, a general service distribution, and an infinite number of servers. Similarly, the example at the end of Section 2.3 considered an M/M/1 queue. Since the M/G/1 queue has an infinite number of servers, no arriving customers are ever queued. Each arrival immediately starts to be served by some server, and the service time Yi of customer i is IID over i with some distribution function G(y); the service time is the interval from start to completion of service and is also independent of arrival epochs. We would like to find the distribution function of the number of customers being served at a given epoch τ . Let {N (t); t ≥ 0} be the Poisson counting process of customer arrivals. Consider the arrival times of those customers that are still in service at some fixed time τ . In some arbitrarily small interval (t, t + δ], the probability of an arrival is δ∏ + o(δ) and the probability of 2 or more arrivals is negligible (i.e., o(δ)). The probability that an arrival occurred in (t, t + δ] and that that customer is still being served at time τ > t is then δ∏[1 − G(τ − t)] + o(δ). Consider a counting process {N1 (t); 0≤t≤τ } where N1 (t) is the number of arrivals between 0 and t that are still in service at τ . This counting process has the independent increment property. To see this, note that the overall arrivals in {N (t); t ≥ 0} have the independent increment property; also the arrivals in {N (t); t ≥ 0} have independent service times, and thus are independently in or not in {N1 (t); 0 ≤ t < τ }. It follows that {N1 (t); 0 ≤ t < τ } is a non-homogeneous Poisson process with rate ∏[1 − G(τ − t)] at time t ≤ τ . The expected number of arrivals still in service at time τ is then Z τ Z τ m(τ ) = ∏ [1 − G(τ − t)] dt = ∏ [1 − G(t)] dt. (2.32) t=0
t=0
and the PMF of the number in service at time τ is given by Pr{N1 (τ ) = n} =
m(τ )n exp(−m(τ )) . n!
(2.33)
Note that as τ → 1, the integral in (2.32) approaches the mean of the service time distribution (i.e., it is the integral of the complementary distribution function, 1 − G(t), of the
2.5. CONDITIONAL ARRIVAL DENSITIES AND ORDER STATISTICS
77
✲ N1 (τ ) = Customers in service at τ ✟ ✟ ✟1 − G(τ −t) N (t) ✲✟✟ ❍❍ ❍❍ G(τ −t) ✲ N (τ ) − N1 (τ ) = Customers departed by τ ❍
Figure 2.8: Poisson arrivals {N (t); t ≥ 0} can be considered to be split in a nonhomogeneous way. An arrival at t is split with probability 1 − G(τ − t) into a process of customers still in service at τ . service time). This means that in steady-state (as τ → 1), the distribution of the number in service at τ depends on the service time distribution only through its mean. This example can be used to model situations such as the number of phone calls taking place at a given epoch. This requires arrivals of new calls to be modeled as a Poisson process and the holding time of each call to be modeled as a random variable independent of other holding times and of call arrival times. Finally, as shown in Figure 2.8, we can regard {N1 (t); 0≤t ≤ τ } as a splitting of the arrival process {N (t); t≥0}. By the same type of argument as in Section 2.3, the number of customers who have completed service by time τ is independent of the number still in service.
2.5
Conditional arrival densities and order statistics
A diverse range of problems involving Poisson processes are best tackled by conditioning on a given number n of arrivals in the interval (0, t], i.e., on the event N (t) = n. Because of the incremental view of the Poisson process as independent and stationary arrivals in each incremental interval of the time axis, we would guess that the arrivals should have some sort of uniform distribution given N (t) = n. More precisely, the following theorem shows that the joint density of S (n) = (S1 , S2 , . . . , Sn ) given N (t) = n is uniform over the region 0 < S1 < S2 < · · · < Sn < t. Theorem 2.6. Let fS(n) |N (t) (s(n) | n) be the joint density of S(n) conditional on N (t) = n. This density is constant over the region 0 < s1 < · · · < sn < t and has the value fS(n) |N (t) (s(n) | n) =
n! . tn
(2.34)
Two proofs are given, each illustrative of useful techniques. Proof 1: Recall that the joint density of the first n + 1 arrivals S n+1 = (S1 . . . , Sn , Sn+1 with no conditioning is given in (2.14). We first use Bayes law to calculate the joint density of S n+1 conditional on N (t) = n. fS (n+1) |N (t) (s (n+1) | n) pN (t) (n) = pN (t)|S (n+1) (n|s (n+1) )fS (n+1) (s (n+1) ). Note that N (t) = n if and only if Sn ≤ t and Sn+1 > t. Thus pN (t)|S (n+1) (n|s (n+1) ) is 1 if Sn ≤ t and Sn+1 > t and is 0 otherwise. Restricting attention to the case N (t) = n, Sn ≤ t
78
CHAPTER 2. POISSON PROCESSES
and Sn+1 > t, fS (n+1) |N (t) (s (n+1) | n) = = =
fS (n+1) (s (n+1) ) pN (t) (n) ∏n+1 exp(−∏sn+1 ) (∏t)n exp(−∏t) /n! n!∏ exp(−∏(sn+1 − t) . tn
(2.35)
This is a useful expression, but we are interested in S (n) rather than S (n+1) . Thus we break up the left side of (2.35) as follows: fS (n+1) |N (t) (s (n+1) | n) = fS (n) |N (t) (s (n) | n) fSn+1 |S (n) N (t) (sn+1 |s (n) , n). Conditional on N (t) = n, Sn+1 is the first arrival epoch after t, which by the memoryless property is independent of S n . Thus that final term is simply ∏ exp(−∏(sn+1 − t)) for sn+1 > t. Substituting this into (2.35), the result is (2.34). Proof 2: This alternative proof derives (2.34) by looking at arrivals in very small increments of size δ (see Figure 2.9). For a given t and a given set of n times, 0 < s1 < · · · , < sn < t, we calculate the probability that there is a single arrival in each of the intervals (si , si + δ], 1 ≤ i ≤ n and no other arrivals in the interval (0, t]. Letting A(δ) be this event,
0
✲δ✛
✲δ✛
✲δ✛
s1
s2
s3
t
Figure 2.9: Intervals for arrival density.
Pr{A(δ)} = pN (s1 ) (0) pNe (s1 ,s1 +δ) (1) pNe (s1 +δ,s2 ) (0) pNe (s2 ,s2 +δ) (1) · · · pNe (sn +δ,t) (0).
The sum of the lengths of the above intervals is t, so if we represent pNe (si ,si +δ) (1) as ∏δ exp(−∏δ) + o(δ) for each i, then Pr{A(δ)} = (∏δ)n exp(−∏t) + δ n−1 o(δ). The event A(δ) can be characterized as the event that, first, N (t) = n and, second, that the n arrivals occur in (si , si +δ] for 1 ≤ i ≤ n. Thus we conclude that fS (n) |N (t) (s (n) ) = lim
δ→0
Pr{A(δ)} , δ n pN (t) (n)
which simplifies to (2.34). The joint density of the interarrival intervals, X (n) = X1 . . . , Xn given N (t) = n can be found directly from Theorem 2.6 simply by making the linear transformation X1 = S1
2.5. CONDITIONAL ARRIVAL DENSITIES AND ORDER STATISTICS
79
❅ ❅ ❅ ❅ x2 ❅ ❅ ❅ ❘ ❅ ❅ ❅
° ° ° ° s2 ° ° ° °
s1
x1
Figure 2.10: Mapping from arrival epochs to interarrival times. Note that incremental cubes in the arrival space map into parallelepipeds of the same volume in the interarrival space.
and Xi = Si − S Pi−1 for 2 ≤ i ≤ n. The density is unchanged, but the constraint region transforms into ni=1 Xi < t with Xi > 0 for 1 ≤ i ≤ n (see Figure 2.10). fX (n) |N (t) (x (n) | n) =
n! tn
for X (n) > 0,
Xn
i=1
Xi < t.
(2.36)
It is also instructive to compare the joint distribution of S (n) conditional on N (t) = n with the joint distribution of n IID uniformly distributed random variables, U (n) = U1 , . . . , Un on (0, t]. For any point U (n) = u (n) , this joint density is fU (n) (u (n) ) = 1/tn for 0 < ui ≤ t, 1 ≤ i ≤ n. Both fS (n) and fU (n) are uniform over the volume of n-space where they are non-zero, but as illustrated in Figure 2.11 for n = 2, the volume for the latter is n! times larger than the volume for the former. To explain this more fully, we can define a set of random variables S1 , . . . , Sn , not as arrival epochs in a Poisson process, but rather as the order statistics function of the IID uniform variables U1 , . . . , Un ; that is S1 = min(U1 , . . . , Un ); S2 = 2nd smallest (U1 , . . . , Un ); etc. The n-cube is partitioned into n! regions, one where u1 < u2 < · · · < un . For each permutation π(i) of the integers 1 to n, there is another region9 where uπ(1) < uπ(2) < · · · < uπ(n) . By symmetry, each of these regions has the same volume, which then must be 1/n! of the volume tn of the n-cube. Each of these n! regions map into the same region of ordered values. Thus these order statistics have the same joint probability density function as the arrival epochs S1 , . . . , Sn conditional on N (t) = n. Thus anything we know (or can discover) about order statistics is valid for arrival epochs given N (t) = n and vice versa.10 9
As usual, we are ignoring those points where ui = uj for some i, j, since the set of such points has 0 probability. 10 There is certainly also the intuitive notion, given n arrivals in (0, t], and given the stationary and independent increment properties of the Poisson process, that those n arrivals can be viewed as uniformly distributed. It does not seem worth the trouble, however, to make this precise, since there is no natural way to associate each arrival with one of the uniform rv’s.
80
CHAPTER 2. POISSON PROCESSES
t ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣
t ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ° ♣° ♣ ♣♣ ♣♣ ° ♣♣ ♣ ♣ 2 f =2/t ♣ ♣ ♣♣ ♣♣ ° ♣ s2 ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ° ♣ ♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣ ♣° ♣♣ ♣ ° ° 0
0
s1
♣♣ ♣♣ ♣ u2 ♣♣♣ ♣♣ ♣♣ ♣ 0
t
0
♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣
♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣
♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣
♣♣ ♣♣ ♣♣
f =1/t2
u1
♣♣ ♣♣
♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣ ♣♣ ♣♣ ♣♣
t
Figure 2.11: Density for the order statistics of an IID 2 dimensional uniform distribution. Note that the square over which fU (2) is non-zero contains one triangle where u2 > u1 and another of equal size where u1 > u2 . Each of these maps, by a permutation mapping, into the single triangle where s2 > s1 . Next we want to find the marginal distribution functions of the individual Si conditional on N (t) = n. Starting with S1 , and viewing it as the minimum of the IID uniformly distributed variables U1 , . . . , Un , we recognize that S1 > τ if and only if Ui > τ for all i, 1 ≤ i ≤ n. Thus, ∑ ∏ t−τ n for 0 < τ ≤ t. (2.37) Pr{S1 > τ | N (t)=n} = t Since this is the complement of the distribution function of S1 , conditional on N (t) = n, we can integrate it to get the conditional mean of S1 , E [S1 | N (t)=n] =
t . n+1
(2.38)
We come back later to the distribution functions of S2 , . . . , Sn , and first look at the marginal distributions of the interarrival intervals. Recall from (2.36) that fX (n) |N (t) (x (n) | n) =
n! tn
for X (n) > 0,
Xn
i=1
Xi < t.
(2.39)
The joint density is the same for all points in the constraint region, and the constraint does not distinguish between X1 to Xn . Thus they must all have the same marginal distribution, and more generally the marginal distribution of any subset of the Xi can depend only on the size of the subset. We have found the distribution of S1 , which is the same as X1 , and thus ∑ ∏ t−τ n Pr{Xi > τ | N (t)=n} = for 1 ≤ i ≤ n and 0 < τ ≤ t. (2.40) t t E [Xi | N (t)=n] = for 1 ≤ i ≤ n. (2.41) n+1 ∗ = t − Sn to be the interval from the largest of the IID variables to t, the Next define Xn+1 right end of the interval. Using (2.39)
fX (n) |N (t) (x (n) | n) =
n! tn
∗ for X (n) > 0, Xn+1 > 0,
Xn
i=1
∗ Xi + Xn+1 = t.
2.6. SUMMARY
81
∗ , and the density of X , . . . , X The constraints above are symmetric in X1 , . . . , Xn , Xn+1 1 n within the constrainty region is uniform. This density can be replaced by a density over any ∗ other n rv’s out of X1 , . . . , Xn , Xn+1 by a linear transformation with unit determinant. Thus ∗ Xn+1 has the same marginal distribution as each of the Xi . This gives us a partial check on ∗ , our work, since the interval (0, t] is divided into n+1 intervals of sizes X1 , X2 , . . . , Xn , Xn+1 and each of these has a mean size t/(n + 1). We also see that the joint distribution function ∗ of any proper subset of X1 , X2 , . . . Xn , Xn+1 is independent of the order of the variables.
Next consider the distribution function of Xi+1 for (i < n), conditional both on N (t) = n and Si = si (or conditional on any given values for X1 , . . . , Xi summing to si ). We see that Xi+1 is just the wait until the first arrival in the interval (si , t], given that this interval contains n − i arrivals. From the same argument as used in (2.37), we have ∑ ∏ t − si − τ n−i Pr{Xi+1 > τ | N (t)=n, Si =si } = . (2.42) t − si Since Si+1 is Xi+1 + Si , this immediately gives us the conditional distribution of Si+1 ∑ ∏ t − si+1 n−i Pr{Si+1 > si+1 | N (t) = n, Si = si } = . (2.43) t − si We note that this is independent of S1 , . . . , Si−1 . As a check, one can find the conditional densities from (2.43) and multiply them all together to get back to (2.34) (see Exercise 2.25). W can also find the distribution of each Si conditioned on N (t) = n but unconditioned on S1 , S2 , . . . , Si−1 . The density for this is calculated by looking at n uniformly distributed rv’s in (0, t]. The probability that one of these lies in the interval (x, x + dt] is (n dt)/t. Out of the remaining n − 1, the probability that i − 1 lie in the interval (0, x] is given by the binomial distribution with probability of success x/t. Thus the desired density is fSi (x | N (t)=n) dt = fSi (x | N (t)=n) =
2.6
xi−1 (t − x)n−i (n − 1)! tn−1 (n − i)!(i − 1)! xi−1 (t − x)n−i n! . tn (n − i)!(i − 1)!
n dt t (2.44)
Summary
We started the chapter with three equivalent definitions of a Poisson process—first as a renewal process with exponentially distributed inter-renewal intervals, second as a stationary and independent increment counting process with Poisson distributed arrivals in each interval, and third essentially as a limit of shrinking Bernoulli processes. We saw that each definition provided its own insights into the properties of the process. We emphasized the importance of the memoryless property of the exponential distribution, both as a useful tool in problem solving and as an underlying reason why the Poisson process is so simple. We next showed that the sum of independent Poisson processes is again a Poisson process. We also showed that if the arrivals in a Poisson process are independently routed to different
82
CHAPTER 2. POISSON PROCESSES
locations with some fixed probability assignment, then the arrivals at each of these locations form independent Poisson processes. This ability to view independent Poisson processes either independently or as a splitting of a combined process is a powerful technique for finding almost trivial solutions to many problems. It was next shown that a non-homogeneous Poisson process could be viewed as a (homogeneous) Poisson process on a non-linear time scale. This allows all the properties of (homogeneous) Poisson properties to be applied directly to the non-homogeneous case. The simplest and most useful result from this is (2.31), showing that the number of arrivals in any interval has a Poisson PMF. This result was used to show that the number of customers in service at any given time τ in an M/G/1 queue has a Poisson PMF with a mean approaching ∏ times the expected service time in the limit as τ → 1. Finally we looked at the distribution of arrivals conditional on n arrivals in the interval (0, t]. It was found that these arrivals had the same joint distribution as the order statistics of n uniform IID rv’s in (0, t]. By using symmetry and going back and forth between the uniform variables and the Poisson process arrivals, we found the distribution of the interarrival times, the arrival epochs, and various conditional distributions.
2.7
Exercises
Exercise 2.1. a) Find the Erlang density fSn (t) by convolving fX (x) = ∏ exp(−∏x) with itself n times. b) Find the moment generating function of X (or find the Laplace transform of fX (x)), and use this to find the moment generating function (or Laplace transform) of Sn = X1 + X2 + · · · + Xn . Invert your result to find fSn (t). c) Find the Erlang density by starting with (2.14) and then calculating the marginal density for Sn . Exercise 2.2. a) Find the mean, variance, and moment generating function of N (t), as given by (2.15). b) Show by discrete convolution that the sum of two independent Poisson rv’s is again Poisson. c) Show by using the properties of the Poisson process that the sum of two independent Poisson rv’s must be Poisson. Exercise 2.3. The purpose of this exercise is to give an alternate derivation of the Poisson distribution for N (t), the number of arrivals in a Poisson process up to time t; let ∏ be the rate of the process. a) Find the conditional probability Pr{N (t) = n | Sn = τ } for all τ ≤ t. b) Using the Erlang density for Sn , use (a) to find Pr{N (t) = n}.
2.7. EXERCISES
83
Exercise 2.4. Assume that a counting process {N (t); t≥0} has the independent and stationary increment properties and satisfies (2.15) (for all t > 0). Let X1 be the epoch of the first arrival and Xn be the interarrival time between the n − 1st and the nth arrival. a) Show that Pr{X1 > x} = e−∏x . b) Let Sn−1 be the epoch of the n − 1st arrival. Show that Pr{Xn > x | Sn−1 = τ } = e−∏x . c) For each n > 1, show that Pr{Xn > x} = e−∏x and that Xn is independent of Sn−1 . d) Argue that Xn is independent of X1 , X2 , . . . Xn−1 . Exercise 2.5. The point of this exercise is to show that the sequence of PMF’s for the counting process of a Bernoulli process does not specify the process. In other words, knowing that N (t) satisfies the binomial distribution for all t does not mean that the process is Bernoulli. This helps us understand why the second definition of a Poisson process requires stationary and independent increments as well as the Poisson distribution for N (t). a) For a sequence of binary rv’s Y1 , Y2 , Y3 , . . . , in which each rv is 0 or 1 with equal probability,° find a joint distribution for Y1 , Y2 , Y3 that satisfies the binomial distribution, t ¢ −k pN (t) (k) = k 2 for t = 1, 2, 3 and 0 ≤ k ≤, but for which Y1 , Y2 , Y3 are not independent. Your solution should contain four 3-tuples with probability 1/8 each, two 3-tuples with probability 1/4 each, and two 3-tuples with probability 0. Note that by making the subsequent arrivals IID and equiprobable, you have an example where N (t) is binomial for all t but the process is not Bernoulli. Hint: Use the binomial for t = 3 to find two 3-tuples that must have probability 1/8. Combine this with the binomial for t = 2 to find two other 3-tuples with probability 1/8. Finally look at the constraints imposed by the binomial distribution on the remaining four 3-tuples. b) Generalize part a) to the case where Y1 , Y2 , Y3 satisfy Pr{Yi = 1} = q and Pr{Yi = 0} = 1 − q. Assume q < 1/2 and find a joint distribution on Y1 , Y2 , Y3 that satisfies the binomial distribution, but for which the 3-tuple (0, 1, 1) has zero probability. c) More generally yet, view a joint PMF on binary t-tuples as a °non-negative vector in a 2t τ¢ k τ −k dimensional vector space. Each binomial probability pN (τ ) (k) = k q (1 − q) constitutes a linear constraint on this vector. For each τ , show that one of these constraints may be replaced by the constraint that the components of the vector sum to 1. d) Using part c), show that at most (t + 1)t/2 + 1 of the binomial constraints are linearly independent. Note that this means that the linear space of vectors satisfying these binomial constraints has dimension at least 2t − (t + 1)t/2 − 1. This linear space has dimension 1 for t = 3 explaining the results in parts a) and b). It has a rapidly increasing dimensional for t > 3, suggesting that the binomial constraints are relatively ineffectual for constraining the joint PMF of a joint distribution. More work is required for the case of t > 3 because of all the inequality constraints, but it turns out that this large dimensionality remains. Exercise 2.6. Let h(x) be a positive function of a real variable that satisfies h(x + t) = h(x) + h(t) and let h(1) = c.
84
CHAPTER 2. POISSON PROCESSES
a) Show that for integer k > 0, h(k) = kc. b) Show that for integer j > 0, h(1/j) = c/j. c) Show that for all integer k, j, h(k/j) = ck/j. d) The above parts show that h(x) is linear in positive rational numbers. For very picky mathematicians, this does not guarantee that h(x) is linear in positive real numbers. Show that if h(x) is also monotonic in x, then h(x) is linear in x > 0. Exercise 2.7. Assume that a counting process {N (t); t≥0} has the independent and stationary increment properties and, for all t > 0, satisfies o n e (t, t + δ) = 0 Pr N = 1 − ∏δ + o(δ) n o e (t, t + δ) = 1 Pr N = ∏δ + o(δ) n o e (t, t + δ) > 1 = o(δ). Pr N
a) Let F0 (τ ) = Pr{N (τ ) = 0} and show that dF0 (τ )/dτ = −∏F0 (τ ).
b) Show that X1 , the time of the first arrival, is exponential with parameter ∏. n o e (t, t + τ ) = 0 | Sn−1 = t and show that dFn (τ )/dτ = −∏Fn (τ ). c) Let Fn (τ ) = Pr N
d) Argue that Xn is exponential with parameter ∏ and independent of earlier arrival times.
Exercise 2.8. Let t > 0 be an arbitrary time, let Z1 be the duration of the interval from t until the next arrival after t. Let Zm , for each m > 1, be the interarrival time from the epoch of the m − 1st arrival after t until the mth arrival. a) Given that N (t) = n, explain why Zm = Xm+n for m > 1 and Z1 = Xn+1 − t + Sn . b) Conditional on N (t) = n and Sn = τ , show that Z1 , Z2 , . . . are IID. c) Show that Z1 , Z2 , . . . are IID. Exercise 2.9. Consider a “shrinking Bernoulli” approximation Nδ (mδ) = Y1 + · · · + Ym to a Poisson process as described in Subsection 2.2.5. a) Show that µ ∂ m Pr{Nδ (mδ) = n} = (∏δ)n (1 − ∏δ)m−n . n b) Let t = mδ, and let t be fixed for the remainder of the exercise. Explain why µ ∂ µ ∂n µ ∂ m ∏t ∏t m−n lim Pr{Nδ (t) = n} = lim . 1− m→1 n δ→0 m m
2.7. EXERCISES
85
where the limit on the left is taken over values of δ that divide t. c) Derive the following two equalities: µ ∂ 1 m 1 = ; lim n m→1 n m n!
and
µ ∂ ∏t m−n = e−∏t . lim 1 − m→1 m
d) Conclude from this that for every t and every n, limδ→0 Pr{Nδ (t)=n} = Pr{N (t)=n} where {N (t); t ≥ 0} is a Poisson process of rate ∏. Exercise 2.10. Let {N (t); t ≥ 0} be a Poisson process of rate ∏. a) Find the joint probability mass function (PMF) of N (t), N (t + s) for s > 0. b) Find E [N (t) · N (t + s)] for s > 0. i h e (t2 , t4 ) where N e (t, τ ) is the number of arrivals in (t, τ ] and t1 < e (t1 , t3 ) · N c) Find E N t2 < t3 < t4 . Exercise 2.11. An experiment is independently performed N times where N is a Poisson rv of mean ∏. Let {a1 , a2 , . . . , aK } be the set of sample points of the experiment and let pk , 1 ≤ k ≤ K, denote the probability of ak . a) Let Ni denote the number of experiments performed for which the output is ai . Find the PMF for Ni (1 ≤ i ≤ K). (Hint: no calculation is necessary.) b) Find the PMF for N1 + N2 . c) Find the conditional PMF for N1 given that N = n. d) Find the conditional PMF for N1 + N2 given that N = n. e) Find the conditional PMF for N given that N1 = n1 . Exercise 2.12. Starting from time 0, northbound buses arrive at 77 Mass. Avenue according to a Poisson process of rate ∏. Passengers arrive according to an independent Poisson process of rate µ. When a bus arrives, all waiting customers instantly enter the bus and subsequent customers wait for the next bus. a) Find the PMF for the number of customers entering a bus (more specifically, for any given m, find the PMF for the number of customers entering the mth bus). b) Find the PMF for the number of customers entering the mth bus given that the interarrival interval between bus m − 1 and bus m is x. c) Given that a bus arrives at time 10:30 PM, find the PMF for the number of customers entering the next bus. d) Given that a bus arrives at 10:30 PM and no bus arrives between 10:30 and 11, find the PMF for the number of customers on the next bus.
86
CHAPTER 2. POISSON PROCESSES
e) Find the PMF for the number of customers waiting at some given time, say 2:30 PM (assume that the processes started infinitely far in the past). Hint: think of what happens moving backward in time from 2:30 PM. f ) Find the PMF for the number of customers getting on the next bus to arrive after 2:30. (Hint: this is different from part (a); look carefully at part e). g) Given that I arrive to wait for a bus at 2:30 PM, find the PMF for the number of customers getting on the next bus. Exercise 2.13. a) Show that the arrival epochs of a Poisson process satisfy fS (n) |Sn+1 (s (n) |sn+1 ) = n!/snn+1 . Hint: This is easy if you use only the results of Section 2.2.2. b) Contrast this with the result of Theorem 2.6 Exercise 2.14. Equation (2.44) gives fSi (x | N (t)=n), which is the density of random variable Si conditional on N (t) = n for n ≥ i. Multiply this expression by Pr{N (t) = n} and sum over n to find fSi (x); verify that your answer is indeed the Erlang density. Exercise 2.15. Consider generalizing the bulk arrival process in Figure 2.5. Assume that the epochs at which arrivals occur form a Poisson process {N (t); t ≥ 0} of rate ∏. At each arrival epoch, Sn , the number of arrivals, Zn , satisfies Pr{Zn =1} = p, Pr{Zn =2)} = 1 − p. The variables Zn are IID. a) Let {N1 (t); t ≥ 0} be the counting process of the epochs at which single arrivals occur. Find the PMF of N1 (t) as a function of t. Similarly, let {N2 (t); t ≥ 0} be the counting process of the epochs at which double arrivals occur. Find the PMF of N2 (t) as a function of t. b) Let {NB (t); t ≥ 0} be the counting process of the total number of arrivals. Give an expression for the PMF of NB (t) as a function of t. Exercise 2.16. a) For a Poisson counting process of rate ∏, find the joint probability density of S1 , S2 , . . . , Sn−1 conditional on Sn = t. b) Find Pr{X1 > τ | Sn =t}. c) Find Pr{Xi > τ | Sn =t} for 1 ≤ i ≤ n. d) Find the density fSi (x|Sn =t) for 1 ≤ i ≤ n − 1. e) Give an explanation for the striking similarity between the condition N (t) = n − 1 and the condition Sn = t.
2.7. EXERCISES
87
Exercise 2.17. a) For a Poisson process of rate ∏, find Pr{N (t)=n | S1 =τ } for t > τ and n ≥ 1. b) Using this, find fS1 (τ | N (t)=n) c) Check your answer against (2.37). Exercise 2.18. Consider a counting process in which the rate is a rv Λ with probability density fΛ (∏) = αe−α∏ for ∏ > 0. Conditional on a given sample value ∏ for the rate, the counting process is a Poisson process of rate ∏ (i.e., nature first chooses a sample value ∏ and then generates a sample function of a Poisson process of that rate ∏). a) What is Pr{N (t)=n | Λ=∏}, where N (t) is the number of arrivals in the interval (0, t] for some given t > 0? b) Show that Pr{N (t)=n}, the unconditional PMF for N (t), is given by Pr{N (t)=n} =
αtn . (t + α)n+1
c) Find fΛ (∏ | N (t)=n), the density of ∏ conditional on N (t)=n. d) Find E [Λ | N (t)=n] and interpret your result for very small t with n = 0 and for very large t with n large. e) Find E [Λ | N (t)=n, S1 , S2 , . . . , Sn ]. (Hint: consider the distribution of S1 , . . . , Sn conditional on N (t) and Λ). Find E [Λ | N (t)=n, N (τ )=m] for some τ < t. Exercise 2.19. a) Use Equation (2.44) to find E [Si | N (t)=n]. Hint: When you integrate xfSi (x | N (t)=n), compare this integral with fSi+1 (x | N (t)=n + 1) and use the fact that the latter expression is a probability density. b) Find the second moment and the variance of Si conditional on N (t)=n. Hint: Extend the previous hint. c) Assume that n is odd, and consider i = (n + 1)/2. What is the relationship between Si , conditional on N (t)=n, and the sample median of n IID uniform random variables. d) Give a weak law of large numbers for the above median. Exercise 2.20. Suppose cars enter a one-way infinite length, infinite lane highway at a Poisson rate ∏. The ith car to enter chooses a velocity Vi and travels at this velocity. Assume that the Vi ’s are independent positive rv’s having a common distribution F . Derive the distribution of the number of cars that are located in an interval (0, a) at time t. Exercise 2.21. Consider an M/G/1 queue, i.e., a queue with Poisson arrivals of rate ∏ in which each arrival i, independent of other arrivals, remains in the system for a time Xi , where {Xi ; i ≥ 1} is a set of IID rv’s with some given distribution function F (x).
88
CHAPTER 2. POISSON PROCESSES
You may assume that the number of arrivals in any interval (t, t + ε) that are still in the system at some later time τ ≥ t + ε is statistically independent of the number of arrivals in that same interval (t, t + ε) that have departed from the system by time τ . a) Let N (τ ) be the number of customers in the system at time τ . Find the mean, m(τ ), of N (τ ) and find Pr{N (τ ) = n}. b) Let D(τ ) be the number of customers that have departed from the system by time τ . Find the mean, E [D(τ )], and find Pr{D(τ ) = d}. c) Find Pr{N (τ ) = n, D(τ ) = d}. d) Let A(τ ) be the total number of arrivals up to time τ . Find Pr{N (τ ) = n | A(τ ) = a}. e) Find Pr{D(τ + ε) − D(τ ) = d}. Exercise 2.22. The voters in a given town arrive at the place of voting according to a Poisson process of rate ∏ = 100 voters per hour. The voters independently vote for candidate A and candidate B each with probability 1/2. Assume that the voting starts at time 0 and continues indefinitely. a) Conditional on 1000 voters arriving during the first 10 hours of voting, find the probability that candidate A receives n of those votes. b) Again conditional on 1000 voters during the first 10 hours, find the probability that candidate A receives n votes in the first 4 hours of voting. c) Let T be the epoch of the arrival of the first voter voting for candidate A. Find the density of T . d) Find the PMF of the number of voters for candidate B who arrive before the first voter for A. e) Define the nth voter as a reversal if the nth voter votes for a different candidate than the n − 1st . For example, in the sequence of votes AABAABB, the third, fourth, and sixth voters are reversals; the third and sixth are A to B reversals and the fourth is a B to A reversal. Let N (t) be the number of reversals up to time t (t in hours). Is {N (t); t ≥ 0} a Poisson process? Explain. f ) Find the expected time (in hours) between reversals. g) Find the probability density of the time between reversals. h) Find the density of the time from one A to B reversal to the next A to B reversal. Exercise 2.23. Let {N1 (t); t ≥ 0} be a Poisson counting process of rate ∏. Assume that the arrivals from this process are switched on and off by arrivals from a second independent Poisson process {N2 (t); t ≥ 0} of rate ∞. Let {NA (t); t≥0} be the switched process; that is NA (t) includes the arrivals from {N1 (t); t ≥ 0} during periods when N2 (t) is even and excludes the arrivals from {N1 (t); t ≥ 0} while N2 (t) is odd.
2.7. EXERCISES
rate ∏ rate ∞
89
✁❆ ✛ On ✲✁❆ ✁❆
✁❆
✁❆
✁❆ ❆✁✛ ✁❆ On ✲ ✁❆
✁❆ ✁❆
✁❆ ❆✁✛
✁❆
N2 (t) ✲✁❆
On ✁❆
N1 (t)
✁❆
NA (t)
a) Find the PMF for the number of arrivals of the first process, {N1 (t); t ≥ 0}, during the nth period when the switch is on. b) Given that the first arrival for the second process occurs at epoch τ , find the conditional PMF for the number of arrivals of the first process up to τ . c) Given that the number of arrivals of the first process, up to the first arrival for the second process, is n, find the density for the epoch of the first arrival from the second process. d) Find the density of the interarrival time for {NA (t); t ≥ 0}. Note: This part is quite messy and is done most easily via Laplace transforms. Exercise 2.24. Let us model the chess tournament between Fisher and Spassky as a stochastic process. Let Xi , for i ≥ 1, be the duration of the ith game and assume that {Xi ; i≥1} is a set of IID exponentially distributed rv’s each with density f (x) = ∏e−∏x . Suppose that each game (independently of all other games, and independently of the length of the games) is won by Fisher with probability p, by Spassky with probability q, and is a draw with probability 1 − p − q. The first player to win n games is defined to be the winner, but we consider the match up to the point of winning as being embedded in an unending sequence of games. a) Find the distribution of time, from the beginning of the match, until the completion of the first game that is won (i.e., that is not a draw). Characterize the process of the number {N (t); t ≥ 0} of games won up to and including time t. Characterize the process of the number {NF (t); t ≥ 0} of games won by Fisher and the number {NS (t); t ≥ 0} won by Spassky. b) For the remainder of the problem, assume that the probability of a draw is zero; i.e., that p + q = 1. How many of the first 2n − 1 games must be won by Fisher in order to win the match? c) What is the probability that Fisher wins the match? Your answer should not involve any integrals. Hint: consider the unending sequence of games and use part b. d) Let T be the epoch at which the match is completed (i.e., either Fisher or Spassky wins). Find the distribution function of T . e) Find the probability that Fisher wins and that T lies in the interval (t, t+δ) for arbitrarily small δ. Exercise 2.25. Using (2.43), find the conditional density of Si+1 , conditional on N (t) = n and Si = si , and use this to find the joint density of S1 , . . . , Sn conditional on N (t) = n. Verify that your answer agrees with (2.34).
90
CHAPTER 2. POISSON PROCESSES
Exercise 2.26. A two-dimensional Poisson process is a process of randomly occurring special points in the plane such that (i) for any region of area A the number of special points in that region has a Poisson distribution with mean ∏A, and (ii) the number of special points in nonoverlapping regions is independent. For such a process consider an arbitrary location in the plane and let X denote its distance from its nearest special point (where distance is measured in the usual Euclidean manner). Show that a) Pr{X > t} = exp(−∏πt2 ) √ b) E [X] = 1/(2 ∏). Exercise 2.27. This problem is intended to show that one can analyze the long term behavior of queueing problems by using just notions of means and variances, but that such analysis is awkward, justifying understanding the strong law of large numbers. Consider an M/G/1 queue. The arrival process is Poisson with ∏ = 1. The expected service time, E [Y ], is 1/2 and the variance of the service time is given to be 1. a) Consider Sn , the time of the nth arrival, for n = 1012 . With high probability, Sn will lie within 3 standard derivations of its mean. Find and compare this mean and the 3σ range. b) Let Vn be the total amount of time during which the server is busy with these n arrivals (i.e., the sum of 1012 service times). Find the mean and 3σ range of Vn . c) Find the mean and 3σ range of In , the total amount of time the server is idle up until Sn (take In as Sn − Vn , thus ignoring any service time after Sn ). d) An idle period starts when the server completes a service and there are no waiting arrivals; it ends on the next arrival. Find the mean and variance of an idle period. Are successive idle periods IID? e) Combine (c) and (d) to estimate the total number of idle periods up to time Sn . Use this to estimate the total number of busy periods. f ) Combine (e) and (b) to estimate the expected length of a busy period. Exercise 2.28. The purpose of this problem is to illustrate that for an arrival process with independent but not identically distributed interarrival intervals, X1 , X2 , . . . ,, the number of arrivals N (t) in the interval (0, t] can be a defective rv. In other words, the ‘counting process’ is not a stochastic process according to our definitions. This illustrates that it is necessary to prove that the counting rv’s for a renewal process are actually rv’s . a) Let the distribution function of the ith interarrival interval for an arrival process be FXi (xi ) = 1 − exp(−αi xi ) for some fixed α ∈ (0, 1). Let Sn = X1 + · · · Xn and show that E [Sn ] =
1 − αn−1 . 1−α
b) Sketch a ‘reasonable’ sample function for N (t). c) Find σS2 n .
2.7. EXERCISES
91
d) Use the Chebyshev inequality on Pr{Sn ≥ t} to find an upper bound on Pr{N (t) ≤ n} that is smaller than 1 for all n and for large enough t. Use this to show that N (t) is defective for large enough t.
Chapter 3
RENEWAL PROCESSES 3.1
Introduction
Recall that a renewal process is an arrival process in which the interarrival intervals are positive,1 independent and identically distributed (IID) random variables (rv’s). Renewal processes (since they are arrival processes) can be specified in three standard ways, first, by the joint distributions of the arrival epochs, second, by the joint distributions of the interrarival times, and third, by the joint distributions of the counting process {N (t); t ≥ 0} in which N (t) represents the number of arrivals to a system in the interval (0, t]. The simplest characterization is through the interarrival times, since they are IID. Perhaps the most useful characterization, however, is through the counting process. These processes are called renewal processes because the process probabilistically starts over at each arrival epoch, Sn = X1 + · · · + Xn . That is, if the nth arrival occurs at Sn = τ , then, counting from Sn = τ , the j th subsequent arrival epoch is at Sn+j − Sn = Xn+1 + · · · + Xn+j . Thus, given Sn = τ , {N (τ + t) − N (τ ); t ≥ 0} is a renewal counting process with IID interarrival intervals of the same distribution as the original renewal process. Because of this renewal property, we shall usually refer to arrivals as renewals. The major reason for studying renewal processes is that many complicated processes have randomly occurring instants at which the system returns to a state probabilistically equivalent to the starting state. These embedded renewal epochs allow us to separate the long term behavior of the process (which can be studied through renewal theory) from the behavior of the actual process within a renewal period. Example 3.1.1. Consider a G/G/m queue. The arrival counting process to a G/G/m queue is a renewal counting process, {N (t); t ≥ 0}. Each arriving customer waits in the queue until one of m identical servers is free to serve it. The service time required by each 1
Renewal processes are often defined in a slightly more general way, allowing the interarrival intervals Xi to include the possibility 1 > Pr{Xi = 0} > 0. All of the theorems in this chapter are valid under this more general assumption, as can be verified by complicating the proofs somewhat. Allowing Pr{Xi = 0} > 0 allows multiple arrivals at the same instant, which makes it necessary to allow N (0) to take on positive values, and appears to inhibit intuition about renewals. Exercise 3.2 shows how to view these more general renewal processes while using the definition here, thus showing that the added generality is not worth much.
92
3.2. STRONG LAW OF LARGE NUMBERS FOR RENEWAL PROCESSES
93
customer is a rv, IID over customers, and independent of arrival times and server. We define a new renewal counting process, {N 0 (t); t ≥ 0}, for which the renewal epochs are those epochs in the original process {N (t); t ≥ 0} at which an arriving customer sees an empty system (i.e., no customer in queue and none in service). To make the time origin probabilistically identical to the renewal epochs in this new renewal process, we consider time 0 as an arrival epoch that starts a busy period. There are two renewal counting processes, {N (t); t ≥ 0} and {N 0 (t); t ≥ 0} in this example, and the renewals in the second process are those particular arrivals in the first process that arrive to an empty system. Throughout our study of renewal processes, we use X and E [X] interchangeably to denote the mean inter-renewal interval, and use σ 2 to denote the variance of the inter-renewal interval. We will usually assume that X is finite, but, except where explicitly stated, we need not assume that σ 2 is finite. This means, first, that this variance need not be calculated (which is often difficult if renewals are embedded into a more complex process), and second, that the results are relatively robust to modeling errors on the far tails of the inter-renewal distribution. Much of this chapter will be devoted to understanding the behavior of N (t)/t and N (t) as t becomes large. N (t)/t is the time-average renewal rate over the interval (0,t]. It is a random variable (i.e., it is not defective) for each value of t (see Exercise 3.1). One of the major results about renewal theory, which we establish shortly, is that, with probability 1, this family of random variables, {N (t)/t; t > 0}, has a limiting value, limt→1 N (t)/t, equal to 1/ X. This result is called the strong law of large numbers for renewal processes. We shall often refer to it by the less precise statement that the time-average renewal rate is 1/ X. This result is an analog (and direct consequence) of the strong law of large numbers, Theorem 1.5. Another important result is the elementary renewal theorem, which states that E [N (t)/t] also approaches 1/X as t → 1. It seems surprising that this does not follow from the strong law for renewal processes, but in fact it doesn’t, and we shall develop several widely useful results in establishing this theorem. The final major result is Blackwell’s theorem, which shows that, for appropriate values of δ, the expected number of renewals in an interval (t, t + δ] approaches δ/X as t → 1. We shall thus interpret 1/X as an ensemble-average renewal rate. This rate is the same as the above time-average renewal rate. We shall see the benefits of being able to work with both time-averages and ensemble-averages.
3.2
Strong Law of Large Numbers for renewal processes
To get an intuitive idea why N (t)/t should approach 1/X for large t, note that Sn /n is the sample average of n inter-renewal intervals. From the strong law of large numbers, we know that Sn /n approaches X with probability 1 as t → 1. From Figure 3.1, observe that N (Sn ), the number of renewals at the epoch of the nth renewal, is n, and thus N (Sn )/Sn is n/Sn . This is the reciprocal of the sample average of n inter-renewal intervals. Since Sn /n → X as n → 1, we should expect n/Sn to approach2 1/X, and thus we should 2
Note that the fact that E [Sn /n] = X does not imply that E [n/Sn ] = 1/ X. We really need the stronger result that Sn /n → X with probability 1
94
CHAPTER 3. RENEWAL PROCESSES
hypothesize that N (Sn )/Sn approaches 1/X as n → 1, and thus that N (t)/t approaches 1/X. To make this precise, we need the following lemma. Essentially, it says that the first renewal occurs eventually, and after that, the second eventually occurs, and so forth.
Slope = Slope =
N (t) t
N (t) SN (t)
=
n Sn
✘✘✘ ❅ ✘ ✘ ❄ ✘✘ ❅ ❘ ❅✘✘✘ ✘ ✘ ✘ ✘✘✘ ✘ ✘ t ✘✘
N (t)
0
S3
S1
S2
Figure 3.1: Comparison of n/Sn with N (t)/t for N (t) = n. Lemma 3.1. Let {N (t); t ≥ 0} be a renewal counting process with inter-renewal rv’s {Xn ; n ≥ 1}. Then (whether or not E [Xn ] is finite), limt→1 N (t) = 1 with probability 1 and limt→1 E [N (t)] = 1. Proof: Note that for each sample point ω ∈ ≠, N (t, ω) is a nondecreasing function of t and thus either has a finite limit or an infinite limit. The probability that this limit is finite with value less than n is lim Pr{N (t) < n} = 1 − lim Pr{N (t) ≥ n} = 1 − lim Pr{Sn ≤ t}
t→1
t→1
t→1
Since the Xi are rv’s, the sums Sn are also rv’s (i.e., nondefective) for each n (see Section 1.3.7), and thus limt→1 Pr{Sn ≤ t} = 1 for each n. Thus limt→1 Pr{N (t) < n} = 0 for each n. This shows that the set of sample points ω for which limt→1 N (t) < 1 has probability 0. Thus limt→1 N (t) = 1 with probability 1. Next, E [N (t)] is nondecreasing in t, and thus has either a finite or infinite limit as t → 1. For each n, Pr{N (t) ≥ n} ≥ 1/2 for large enough t, and therefore E [N (t)] ≥ n/2 for all such t. Thus E [N (t)] can have no finite limit, and limt→1 E [N (t)] = 1. For any given t > 0, the random variable N (t) is the number of renewal epochs in the interval (0, t]. The random variable SN (t) is then the epoch at which renewal N (t) occurs, i.e., the latest renewal epoch before or equal to time t. Similarly SN (t)+1 is the first arrival epoch after time t (see Figure 3.2). Thus we have the inequalities SN (t) SN (t)+1 t ≤ < . N (t) N (t) N (t)
(3.1)
From lemma 3.1, limt→1 N (t) = 1 with probability 1. Assuming that X < 1, the strong law of large numbers (Theorem 1.5) asserts that limn→1 Sn /n = X with probability 1. For any sample function (i.e., sample point ω), SN (t) (ω)/N (t, ω) runs through the same sequence of values with increasing t as Sn (ω)/n runs through with increasing n. Thus letting ≠0 be the set of sample points of ω for which both limn→1 Sn /n = X and limt→1 N (t) = 1, we
3.2. STRONG LAW OF LARGE NUMBERS FOR RENEWAL PROCESSES
95
have limt→1 SN (t) /N (t) = X for all sample points in ≠0 . In the same way lim
t→1
SN (t)+1 SN (t)+1 N (t) + 1 = lim =X t→1 N (t) N (t) + 1 N (t)
Slope = Slope =
N (t) t
for all ω ∈ ≠0 .
N (t) SN (t)
✘ ❅ ✘✘✘ ✘ ❄ ✘ ✘ SN (t)+1 ❅ ❘ ❅✘✘✘ ❆ ✘ ✘ ❆❆ ✘✘✘ ✘ ✘ ✘ t ✘✘
N (t)
0
SN (t)+1
Slope =
N (t)
(3.2)
S1
SN (t)
Figure 3.2: Comparison of N (t)/t with
N (t) SN (t)
and
N (t) SN (t)+1 .
Since t/N (t) in (3.1) lies between two random variables both converging to X for all sample points in ≠0 , we see that limt→1 t/N (t) = X for all sample functions in ≠0 , i.e., with probability 1. Since X must be greater than 0, it follows that limt→1 N (t)/t = 1/X for all sample points in ≠0 . This proves the following strong law for renewal processes. Theorem 3.1 (Strong Law for Renewal Processes). For a renewal process with mean inter-renewal interval X , limt→1 N (t)/t = 1/X with probability 1. This theorem is also true if the mean inter-renewal interval is infinite; this can be seen by a truncation argument (see Exercise 3.3). We © could also prove a™weak law for N (t) (i.e., we could show that for any ≤ > 0, limt→1 Pr |N (t)/t − 1/X| ≥ ≤ = 0). This could be done by using the weak law of large numbers for Sn (Theorem 1.3) and the fact that the event Sn ≤ t is the same as N (t) ≥ n. Such a derivation is tedious, however, and illustrates that the strong law of large numbers is often much easier to work with than the weak law. We shall not derive the weak law here, since the strong law for renewal processes implies the weak law and it is the strong law that is most often useful. Figure 3.3 helps give some appreciation of what the strong law for N (t) says and doesn’t say. The strong law deals with time-averages, limt→1 N (t, ω)/t, for individual sample points ω; these are indicated in the figure as horizontal averages, one for each ω. It is also of interest to look at time and ensemble-averages, E [N (t)/t], shown in the figure as vertical averages. N (t, ω)/t is the time-average number of renewals from 0 to t, and E [N (t)/t] averages also over the ensemble. Finally, to focus on arrivals in the vicinity of a particular time t, it is of interest to look at the ensemble-average E [N (t + δ) − N (t)] /δ. Given the strong law for N (t), one would hypothesize that E [N (t)/t] approaches 1/X as t → 1. One might also hypothesize that limt→1 E [N (t + δ) − N (t)] /δ = 1/X, subject to some minor restrictions on δ. These hypotheses are correct and are discussed in detail in what follows. This equality of time-averages and limiting ensemble-averages for renewal processes carries over to a large number of stochastic processes, and forms the basis of ergodic theory. These results are important for both theoretical and practical purposes. It is sometimes easy to find time averages (just like it was easy to find the time-average
96
CHAPTER 3. RENEWAL PROCESSES
N (t, ω)/t from the strong law of large numbers), and it is sometimes easy to find limiting ensemble-averages. Being able to equate the two then allows us to alternate at will between time and ensemble-averages. Ensemble Average at t
Time and ensemble Average over (0, τ )
(1/δ)E [N (t + δ) − N (t)]
N (t,ω3 ) t
✲ Time Ave. at ω1
0 N (t,ω2 ) t
✲ Time Ave. at ω2
0 N (t,ω1 ) t
0
✲ Time Ave. at ω3
τ
t
Figure 3.3: Time average at a sample point, time and ensemble average from 0 to a given τ , and the ensemble-average in an interval (t, t + δ].
Note that in order to equate time-averages and limiting ensemble-averages, quite a few conditions are required. First, the time average must exist in the limit t → 1 with probability one and have a fixed value with probability one; second, the ensemble-average must approach a limit as t → 1; and third, the limits must be the same. The following example, for a stochastic process very different from a renewal process, shows that equality between time and ensemble averages is not always satisfied for arbitrary processes. Example 3.2.1. Let {Xi ; i ≥ 1} be a sequence of binary IID random variables, each taking the value 0 with probability 1/2 and 2 with probability 1/2. Let {Mn ; n ≥ 1} be the product process in which Mn = X1 X2 · · · Xn . Since Mn = 2n if X1 to Xn each take the value 2 (an event of probability 2−n ) and Mn = 0 otherwise, we see that limn→1 Mn = 0 with probability 1. Also E [Mn ] = 1 for all n ≥ 1. Thus the time-average exists and equals 0 with probability 1 and the ensemble-average exists and equals 1 for all n, but the two are different. The problem is that as n increases, the atypical event in which Mn = 2n has a probability approaching 0, but still has a significant effect on the ensemble-average. Before establishing the results about ensemble-averages, we state and briefly discuss the central limit theorem for renewal processes. Theorem 3.2 (Central Limit Theorem for N (t)). Assume that the inter-renewal intervals for a renewal counting process {N (t); t ≥ 0} have finite standard deviation σ > 0. Then ( ) N (t) − t/X lim Pr < α = Φ(α). (3.3) −3/2 √ t→1 σX t Ry where Φ(y) = −1 √12π exp(−x2 /2)dx.
3.3. EXPECTED NUMBER OF RENEWALS
97
✟✟ ✟ ❍ ✛✟✲✻ ❄ ✟ ❍❍ α√n σ n ✟ ■ ❅ ✟ X ❅√ ✟✟ ✟ α n σ ✟ ✟✟ ✟ ✟ 1 Slope = X ✟✟ ✟ ✟ nX t ✟ E [Sn ]
Figure 3.4: Illustration of the central limit theorem for renewal processes. A given integer n is shown on the vertical axis, and the corresponding mean, E [Sn ] = nX is shown on the horizontal axis. The horizontal line with arrows at height n indicates α standard deviations from E [Sn ], and the vertical line with arrows indicates the distance below (t/X). This says that N (t) tends to Gaussian with mean t/X and standard deviation σX
−3/2 √
t.
The theorem can be proved by applying the central limit theorem (CLT) for a sum of IID rv’s, (1.56), to Sn and then using the identity {Sn ≤ t} = {N (t) ≥ n}. The general idea is illustrated in Figure 3.4, but the details are somewhat tedious, and can be found, for example, in [16]. We simply outline the argument here. For any real α, the CLT states that © √ ™ Pr Sn ≤ nX + α nσ ≈ Φ(α) Rα where Φ(α) = −1 √12π exp(−x2 /2) dx and where the approximation becomes exact in the limit n → 1. Letting √ t = nX + α nσ,
and using {Sn ≤ t} = {N (t) ≥ n}, Pr{N (t) ≥ n} ≈ Φ(α).
(3.4)
Since t is monotonic in n for fixed α, we can express n in terms of t, getting √ t ασ n t n = − ≈ − ασt1/2 (X)−3/2 . X X X Substituting this into (3.4) establishes the theorem for −α, which establishes the theorem since α is arbitrary. The omitted details involve handling the approximations carefully.
3.3
Expected number of renewals
Let E [N (t)] be denoted by m(t) in what follows. We first find an exact expression for m(t). This is often quite messy for large t, so we then find the asymptotic behavior of m(t). Since N (t)/t approaches 1/X with probability 1, we expect m(t) to grow with a
98
CHAPTER 3. RENEWAL PROCESSES
slope that asymptotically approaches 1/X, but we will find that this is not quite true in general. Two somewhat weaker results, however, are true. The first, called the elementary renewal theorem (Theorem 3.4), states that limt→1 m(t)/t = 1/X . The second result, called Blackwell’s theorem (Theorem 3.5), states that, subject to some limitations on δ > 0, limt→1 [m(t + δ) − m(t)] = δ/X. This says essentially that the expected renewal rate approaches steady-state as t → 1. We will find a large number of applications of Blackwell’s theorem throughout the remainder of the text. The exact calculation of m(t) makes use of the fact that the expectation of a non-negative random variable is the integral of its complementary distribution function, m(t) = E [N (t)] =
1 X
n=1
Pr{N (t) ≥ n} .
Since the event N (t) ≥ n is the same as Sn ≤ t, m(t) is expressed in terms of the distribution functions of Sn , n ≥ 1, as follows. m(t) =
1 X
n=1
Pr{Sn ≤ t} .
(3.5)
Although this expression looks fairly simple, it becomes increasingly complex with increasing t. As t increases, there is an increasing set of values of n for which Pr{Sn ≤ t} is significant, and Pr{Sn ≤ t} itself is not that easy to calculate if the interarrival distribution FX (x) is complicated. The main utility of (3.5) comes from the fact that it leads to an integral equation for m(t). Since Sn = Sn−1 + Xn for each n ≥ 1 (interpreting S0 as 0), and since Xn and Sn−1 are independent, we can use the convolution equation (1.11) to get Z t Pr{Sn ≤ t} = Pr{Sn−1 ≤ t − x} dFX (x) for n ≥ 2. x=0
Substituting this in (3.5) for n ≥ 2 and using the fact that Pr{S1 ≤ t} = FX (t), we can interchange the order of integration and summation to get Z t X 1 Pr{Sn−1 ≤ t − x} dFX (x) m(t) = FX (t) + x=0 n=2
= FX (t) +
Z
t
1 X
x=0 n=1
= FX (t) +
Z
t
x=0
Pr{Sn ≤ t − x} dFX (x)
m(t − x)dFX (x) ;
t ≥ 0.
(3.6)
An alternative derivation is given in Exercise 3.6. This integral equation is called the renewal equation.
3.3.1
Laplace transform approach
If we assume R 1 that X has a density fX (x), and that this density has a Laplace transform LX (s) = 0 fX (x)e−sx dx, we can take the Laplace transform of both sides of (3.6). Note
3.3. EXPECTED NUMBER OF RENEWALS
99
that the final term in (3.6) is the convolution of m with fX , so that the Laplace transform of m(t) satisfies Lm (s) = LX (s)/s + Lm (s)LX (s). Solving for Lm (s), Lm (s) =
LX (s) . s[1 − LX (s)]
(3.7)
Example 3.3.1. As a simple example of how this can be used to calculate m(t), suppose fX (x) = (1/2)e−x + e−2x for x ≥ 0. The Laplace transform is given by LX (s) =
1 1 (3/2)s + 2 + = . 2(s + 1) s + 2 (s + 1)(s + 2)
Substituting this into (3.7) yields Lm (s) =
(3/2)s + 2 1 4 1 = 2+ − . 2 s (s + 3/2) 3s 9s 9(s + 3/2)
Taking the inverse Laplace transform, we then have m(t) =
4t 1 − exp[−(3/2)t] + . 3 9
The procedure in this example can be used for any inter-renewal density fX (x) for which the Laplace transform is a rational function, i.e., a ratio of polynomials. In such cases, Lm (s) will also be a rational function. The Heaviside inversion formula (i.e., factoring the denominator and expressing Lm (s) as a sum of individual poles as done above) can then be used to calculate m(t). In the example above, there was a second order pole at s = 0 leading to the linear term 4t/3 in m(t), there was a first order pole at s = 0 leading to the constant 1/9, and there was a pole at s = −3/2 leading to the exponentially decaying term. We now show that a second order pole at s = 0 always occurs when LX (s) is a rational function. To see this, note that LX (0) is just the integral of fX (x), which is 1; thus 1−LX (s) has a zero at s = 0 and Lm (s) has a second order pole at s = 0. To evaluate the residue for this second order we recall that the first and second derivatives of LX (s) at s = 0 £ pole, § are −E [X] and E X 2 respectively. £Expanding LX (s) in a power series around s = 0 then § yields LX (s) = 1 − sE [X] + (s2 /2)E X 2 plus terms of order s3 or higher. This gives us ! √ £ § £ § 1 − sX + (s2 /2)E X 2 + · · · 1 1 E X2 § = (3.8) Lm (s) = 2 £ + 2 − 1 + ··· . s s X − (s/2)E [X 2 ] + · · · s2 X 2X The remaining terms are the other of Lm (s) with their residues. RFor values of s with R poles −sx R 1, that In , the decision whether or not to observe Xn , should depend only on X1 , . . . , Xn−1 . 3
For example, poker players do not take kindly to a player who bets on a hand and then withdraws his bet when someone else wins the hand. 4 Stopping times are sometimes called optional stopping times.
102
3.3.3
CHAPTER 3. RENEWAL PROCESSES
Wald’s equality
An important question that arises with stopping rules P is to evaluate the sum SJ of the random variables up to the stopping time, i.e., SJ = Jn=1 Xn . Many gambling strategies and investing strategies involve some sort of rule for when to stop, and it is important to understand the rv SJ . Wald’s equality is very useful in establishing a very simple way to find E [SJ ]. Theorem 3.3 (Wald’s Equality). Let {Xn ; n ≥ 1} be IID rv’s, each of mean X. If J is a stopping time for {Xn ; n ≥ 1}, E [J] < 1, and SJ = X1 + X2 + · · · + XJ , then E [SJ ] = XE [J] .
(3.10)
Proof: We can express SJ as SJ =
1 X
n=1
Xn In where In = 1 if J ≥ n and In = 0 if J < n.
(3.11)
By the definition of a stopping rule, In is independent of Xn , Xn+1 , . . . and thus independent of Xn . Thus, E [Xn In ] = E [Xn ] E [In ] = XE [In ]. We then have hX1 i Xn In E [SJ ] = E n=1 X1 E [Xn In ] (3.12) = Xn=1 1 XE [In ] = n=1
= XE [J] .
(3.13)
The interchange of expectation and infinite sum in (3.12) is obviously valid for a finite sum, and is also valid for the infinite sum if E [J] < 1, although we do not prove that here. The final step above comes from the observation that E [In ] = Pr{In = 1} = Pr{J ≥ n} and, P since J is a positive integer rv, E [J] = Pr{J ≥ n}. One can also obtain the last step n≥0 P by using J = n≥1 In (see Exercise 3.4).
It can be seen from the proof that the result essentially holds under the weaker conditions that the random variables Xn all have the same mean and that for each n, Xn and In are uncorrelated. In this case, however, added conditions are necessary in order to exchange the sum and the expectation in (3.12). What this result essentially says in terms of gambling is that strategies for when to stop betting are not really effective as far as the mean is concerned. This is one of these strange results that sometimes appear obvious and sometimes appear very surprising, depending on the application. We next use Wald’s equality in evaluating m(t). Consider an experiment in which we observe successive interarrival intervals until the sum first exceeds t. From Figure 3.5, note that SN (t)+1 is the epoch of the first arrival after t, and thus N (t) + 1 is the number of intervals observed until the sum exceeds t. We now show that N (t) + 1 is a stopping
3.3. EXPECTED NUMBER OF RENEWALS
103
time for the interarrival sequence {Xn ; n ≥ 1}. Informally, the decision to stop when the sum exceeds t depends only on the interarrival intervals already observed. More formally, N (t) + 1 is a random variable and the associated decision variable In has the value 1 for N (t) + 1 ≥ n, which is equivalent to Sn−1 ≤ t. This is a function only of X1 , . . . , Xn−1 , and thus independent of Xn , Xn+1 . . . , verifying that N (t) + 1 is a stopping time for X1 , X2 . . . . . ✏ m(t) + 1 ✏ ✏✏ ✏ ✏ ✏✏ ✏ 1 m(t) Slope = X ✏✏ ✏✏ ✏ ✏✏ ✏✏ ✏ ✏✏ t ✏✏
0
£ § E SN (t)
£ § E SN (t)+1
Figure 3.5: Illustration of Wald’s equality, 3.14, applied to N (t) + 1. Note that N (t) is not a stopping time for X1 , X2 , . . . . For any given n, observation of X1 , . . . , Xn−1 where Sn−1 < t, does not specify whether or not N (t) ≥ n. One would have to peek ahead at Xn+1 to verify that Sn+1 exceeds t. Since N (t) + 1 is a stopping time, however, Wald’s equality yields § £ E SN (t)+1 = XE [N (t) + 1] = X[m(t) + 1] § £ E SN (t)+1 m(t) = − 1. X § £ Since E SN (t)+1 ≥ t, we have m(t) ≥ t/X − 1, and
(3.14)
m(t) 1 1 ≥ (3.15) − . t t X § £ If we had a similar upper bound on E SN (t)+1 − t, we could easily show that m(t)/t £ § approaches 1/X in the limit t → 1, but unfortunately E SN (t)+1 might be larger than t by a surprising amount. The difference SN (t)+1 − t is the interval from t to the next arrival and is known as the residual life of the renewal process £ § at t. We shall see subsequently that its expected value, in the limit as t → 1, is E X 2 /(2E [X]). Substituting this into (3.14), we find the same limiting expression for m(t) as in (3.9) (which £ § was restricted to inter-renewal intervals with a rational Laplace transform). Since E X 2 can be arbitrarily large, and even infinite, this does not show that m(t)/t → 1/X. The reason that the expected residual life can be so large can be seen by an example. Suppose that X is 0 with probability 1 − ≤ and 1/≤ with probability ≤, and that ≤ is very small. Then X = 1, but arrivals occur bunched together with a large gap of 1/≤ between successive bunches. Most points t lie in these gaps, and the residual life is large over most of each gap (we discuss this example in more detail later). Fortunately, it turns out that the familiar truncation method allows us to circumvent these problems and prove the following theorem. Theorem 3.4 (The Elementary Renewal Theorem). Let {N (t); t ≥ 0} be a renewal counting process with mean inter-renewal interval X . Then limt→1 E [N (t)] /t = 1/X.
104
CHAPTER 3. RENEWAL PROCESSES
ei = Xi for Xi ≤ b and let X ei = b for Xi > b. Since these truncated Proof 5 : Let X e random variables h i are IID, they form a related renewal counting process {N (t); t > 0} with e (t) and Sen = X e1 + · · · + X en . Since the nth arrival in this truncated process m(t) e =E N e arrives no later than the nth arrival e ≥ m(t). h in the ioriginal process, N (t) ≥ N (t), so m(t) e Finally, in the truncated process, E SNe (t)+1 ≤ t+b. Thus, applying (3.14) to the truncated process, h i h i h i e (m(t) + 1) ≤ E X e (m(t) E X e + 1) = E SeNe (t)+1 ≤ t + b. Combining this equation with (3.15), we have upper and lower bounds on m(t), 1 b 1 1 m(t) 1 − ≤ ≤ h i+ h i− . E [X] t t t e e E X tE X
(3.16)
h i R √ e = b [1−FX (x)] dx→E [X]. Finally, choose b = t. Then as t → 1, b → 1 also, so that E X 0 Also, both of the final terms on the right of (3.16) approach 0, completing the proof. Note that this theorem (and its have not assumed finite variance. The theorem also h proof) i e holds when E [X] is infinite; E X in (3.16) simply approaches 1 in this case.
We have just shown that m(t) has the property that limt→1 [m(t)/t] = 1/E [X]. Note again that N [t, ω]/t is the average number of renewals from 0 to t for a sample function ω, and m(t)/t is the average of this over ω. Combining with Theorem 3.1, the limiting time and ensemble-average equals the time-average renewal rate for each sample function except for a set of probability 0. Another interesting question is to determine the expected renewal rate in the limit of large t without averaging from 0 to t. That is, are there some values of t at which renewals are more likely than others for large t? If the inter-renewal intervals have an integer distribution function (i.e., each inter-renewal interval must last for an integer number of time units), then each renewal epoch Sn must also be an integer. This means that N (t) can increase only at integer times and the expected rate of renewals is zero at all non-integer times. An obvious generalization of this behavior for integer valued inter-renewal intervals is that of inter-renewals that occur only at integer multiples of some real number d > 0. Such a distribution is called an arithmetic distribution. The span of an arithmetic distribution is the largest number d such that this property holds. Thus, for example if X takes on only the values 0, 2, and 6, its distribution is arithmetic with span 2. Similarly, X takes on only the values 1/3 and 1/5, then the span is 1/15. The remarkable thing, for our purposes, is that any inter-renewal distribution that is not an arithmetic distribution leads to a uniform expected rate of renewals in the limit of large t. This result is contained in Blackwell’s theorem which we state without proof (see Section 11.1, Theorem 1 of [9]). Recall, however, that for the special case of an inter-renewal density that has a rational Laplace transform, Blackwell’s theorem is a simple consequence of (3.9). 5
This proof can be postponed and read later.
3.4. RENEWAL-REWARD PROCESSES; TIME-AVERAGES
105
Theorem 3.5 (Blackwell). If a renewal process has an inter-renewal distribution that is non-arithmetic, then for any δ > 0, lim [m(t + δ) − m(t)] =
t→1
δ . E [X]
(3.17)
If the inter-renewal distribution is arithmetic with span d, then for any integer n ≥ 1 lim [m(t + nd) − m(t)] =
t→1
nd . E [X]
(3.18)
Eq. (3.17) says that for non-arithmetic distributions, the expected number of arrivals in the interval (t, t + δ] is equal to δ/E [X] in the limit t → 1. Since the theorem is true for arbitrarily small δ, the theorem almost, but not quite, says that dm(t)/dt is equal to 1/E [X] in the limit t → 1. Unfortunately this latter statement is not true, and one can see the reason by looking at an example where X can take on only the values 1 and π. Then no matter how large t is, N (t) can only increase at discrete points of time of the form k + jπ where k and j are non-negative integers; thus dm(t)/dt is either 0 or 1 for all t. As t gets larger, however, the jumps in m(t) become both smaller in magnitude and more closely spaced from one to the next. Thus [m(t + δ) − m(t)]/δ approaches 1/E [X] as t → 1 for any fixed δ, no matter how small, but as δ gets smaller, the convergence in t gets slower. No matter how large one chooses t, [m(t + δ) − m(t)]/δ does not approach 1/E [X] as δ → 0. Since the inter-renewal intervals are positive random variables, multiple renewals cannot occur simultaneously, and thus, as shown in Exercise 3.15, (3.17) implies that the probability of a renewal in a small interval (t, t + δ] tends to δ/E [X] + o(δ) as t → 1. Thus, for a non-arithmetic inter-renewal distribution, the limiting distribution of renewals in a small interval (t, t + δ] satisfies lim Pr{N (t + δ) − N (t) = 0} = 1 − δ/X + o(δ)
t→1
lim Pr{N (t + δ) − N (t) = 1} = δ/X + o(δ)
t→1
lim Pr{N (t + δ) − N (t) ≥ 2} = o(δ).
(3.19)
t→1
If we compare this with Eq. (2.17), associating the rate ∏ of a Poisson process with 1/X , we see that, asymptotically, a renewal process with a non-arithmetic inter-renewal distribution satisfies two of the three requirements in definition 3 of a Poisson process. That is, the increments are asymptotically stationary and the renewals do not occur simultaneously. If the renewal process is not Poisson, however, the increments are not independent. For an arithmetic renewal process with span d, (3.18), with n = 1, states that the probability of a renewal at time kd is given by lim Pr{N (kd) − N (kd − d) = 1} =
k→1
3.4
d . X
(3.20)
Renewal-reward processes; time-averages
There are many situations in which, along with a renewal counting process {N (t); t ≥ 0}, there is another randomly varying function of time, called a reward function {R(t); t ≥ 0}.
106
CHAPTER 3. RENEWAL PROCESSES
R(t) models a rate at which the process is accumulating a reward. We shall illustrate many examples of such processes and see that a “reward” could also be a cost or any randomly varying quantity of interest. The important restriction on these reward functions is that R(t) at a given t depends only on the particular inter-renewal interval containing t. We start with several examples to illustrate the kinds of questions addressed by this type of process. Example 3.4.1. (time-average Residual Life) For a renewal counting process {N (t), t ≥ 0}, let Y (t) be the residual life at time t. The residual life is defined as the interval from t until the next renewal epoch, i.e., as SN (t)+1 − t. For example, if we arrive at a bus stop at time t and buses arrive according to a renewal process, Y (t) is the time we have to wait for a bus to arrive (see Figure 3.6). We interpret {Y (t); t ≥ R0} as a reward function. The timet average of Y (t), over the interval (0, t], is given by6 (1/t) 0 Y (τ )dτ . We are interested in the limit of this average as t → 1 (assuming that it exists in some sense). Figure 3.6 illustrates a sample function of a renewal counting process {N (t); t ≥ 0} and shows the residual life Y (t) for Rthat sample function. Note that, for a given sample function {Y (t) = y(t)}, the t integral 0 y(τ ) dτ is simply a sum of isosceles right triangles, with part of a final triangle at the end. Thus it can be expressed as Z
t
0
n(t)
1X 2 y(τ )dτ = xi + 2 i=1
Z
t
y(τ )dτ
τ =sn(t)
where {xi ; 0 < i < 1} is the set of sample values for the inter-renewal intervals. Since this relationship holds for every sample point, we see that the random variable Rt Y (τ )dτ can be expressed in terms of the inter-renewal random variables Xn as 0 Z
t
Y (τ )dτ =
τ =0
Z t N (t) 1X 2 Xn + Y (τ )dτ. 2 n=1 τ =SN (t)
Although the final term above can be easily evaluated for a given SN (t) (t), it is more convenient to use the following bound: Z N (t) N (t)+1 1 X 2 1 t 1 X X ≤ Y (τ )dτ ≤ Xn2 . 2t n=1 n t τ =0 2t n=1
(3.21)
The term on the left can now be evaluated in the limit t → 1 (for all sample functions except a set of probability zero) as follows: lim
t→1 6
Rt
PN (t) n=1
2t
Xn2
= lim
t→1
£ § E X2 Xn2 N (t) = . N (t) 2t 2E [X]
PN (t) n=1
(3.22)
Y (τ )dτ is a random variable just like any other function of a set of variables. It has a sample 0 value for each sample function of {N (t); t ≥ 0}, and its distribution function could be calculated in a straightforward but tedious way. For arbitrary stochastic processes, integration and differentiation can require great mathematical sophistication, but none of those subtleties occur here.
3.4. RENEWAL-REWARD PROCESSES; TIME-AVERAGES
107
N (t)
✛ X2 ✲ X1 S1
S2 S3
S4
S5
S6
X5
❅ ❅ ❅Y (t) X2 X4 ❅ ❅ ❅ X6 ❅ ❅ X1 ❅ ❅ ❅ ❅ ❅ X3 ❅ ❅ ❅ ❅ ❅ t ❅ ❅❅ ❅ ❅ ❅ ❅ S1
S2 S3
S4
S5
S6
Figure 3.6: Residual life at time t. For any given sample function of the renewal process, the sample function of residual life decreases linearly with a slope of −1 from the beginning to the end of each inter-renewal interval.
The second P equality above follows by applying the strong law of large numbers (Theorem 1.5) to n≤N (t) Xn2 /N (t) as N (t) approaches infinity, and by applying the strong law for renewal processes (Theorem 3.1) to N (t)/t as t → 1. The right hand term of (3.21) is handled almost the same way: £ § PN (t)+1 2 PN (t)+1 2 E X2 Xn Xn N (t) + 1 N (t) n=1 n=1 lim = lim = . (3.23) t→1 t→1 2t N (t) + 1 N (t) 2t 2E [X] Combining these two results, we see that, with probability 1 (abbreviated as W.P.1), the time-average residual life is given by £ § Rt E X2 τ =0 Y (τ ) dτ lim = W.P.1. (3.24) t→1 t 2E [X] 2
2
Note that this time-average depends on the second moment of X; this is X + σ 2 ≥ X , so the time-average residual life is at least half the expected inter-renewal interval (which is not surprising). On the other hand, the second moment of X can be arbitrarily large (even infinite) for any given value of E [X], so that the time average residual life can be arbitrarily large relative to E [X]. This can be explained intuitively by observing that large inter-renewal intervals are weighted more heavily in this time-average than small interrenewal intervals. As an example, consider an inter-renewal random variable X that takes on value ≤ with £ §probability 1 − ≤ and value 1/≤ with probability ≤. Then, for small ≤, E [X] ∼ 1, E X 2 ∼ 1/≤, and the time average residual life is approximately 1/(2≤) (see Figure 3.7). Example 3.4.2. (time-average Age) Let Z(t) be the age of a renewal process at time t where age is defined as the interval from the most recent arrival before (or at) t until t,
108
CHAPTER 3. RENEWAL PROCESSES 1/ε
❅ ❅ ❅ ❅ ❅Y (t) ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ✲ 1/ε ❅❅❅❅❅❅ ❅❅❅❅❅❅❅❅❅✛
Figure 3.7: Average Residual life is dominated by large interarrival intervals. Note the large intervals of time (within the large triangles) over which Y (t) is large, and small aggregate intervals over which it is small. i.e., Z(t) = t − SN (t) (see Figure 3.8). We notice that the age process, for a given sample function of the renewal process, is almost the same as the residual life process—the isosceles right triangles are simply turned around. Thus the same analysis as before can be used to show that the time average of Z(t) is the same as the time-average of the residual life, £ 2§ Rt Z(τ ) dτ E X lim τ =0 = W.P.1. (3.25) t→1 t 2E [X] ° °
Z(t)°
° ° ° ° ° ° ° ° °° ° ° ° °° t ° ° ° °
S1
S2 S3
S4
° ° °
S5
S6
Figure 3.8: Age at time t: For any given sample function of the renewal process, the sample function of age increases linearly with a slope of 1 from the beginning to the end of each inter-renewal interval.
Example 3.4.3. (time-average Duration) Let X(t) be the duration of the inter-renewal interval containing time t, i.e., X(t) = XN (t)+1 = SN (t)+1 − SN (t) (see Figure 3.9). It is clear that X(t) = Z(t) + Y (t), and thus the time-average of the duration is given by £ § Rt E X2 τ =0 X(τ ) dτ lim = W.P.1. (3.26) t→1 t E [X] Again, long intervals are heavily weighted in this average, so that the time-average duration is at least as large as the mean inter-renewal interval and often much larger.
3.4.1
General renewal-reward processes
In each of these examples, and in many other situations, we have a random function of time (i.e., Y (t), Z(t), or X(t)) whose value at time t depends only on where t is in the
3.4. RENEWAL-REWARD PROCESSES; TIME-AVERAGES
109
X(t)
✛
X5
✲
t S1
S2 S3
S4
S5
S6
Figure 3.9: Duration X(t) = XN (t) of the inter-renewal interval containing t. current inter-renewal interval (i.e., on the age Z(t) and on the duration X(t) of the current inter-renewal interval). We now investigate the general class of reward functions for which the reward function at time t depends only on the age and the duration at t, i.e., the reward R(t) at time t is given explicitly as a function R(Z(t), X(t)) of the age and duration at t. For the three examples above, the function R is trivial. That is, the residual life, Y (t), is given by X(t) − Z(t) and the age and duration are given directly. Rt We now find the time-average value of R(t), namely, limt→1 1t 0 R(τ ) dτ . As in examples 3.4.1 to 3.4.3 above, we first want to look at the accumulated reward over each inter-renewal period separately. Define Rn as the accumulated reward in the nth renewal interval, Z Sn Z Sn Rn = R(τ ) d(τ ) = R[Z(τ ), X(τ )] dτ. (3.27) Sn−1
Sn−1
For residual life (see Example 3.4.1), Rn is the area of the nth isosceles right triangle in Figure 3.6. In general, since Z(τ ) = τ − Sn−1 , Z Sn Z Xn R(τ −Sn−1 , Xn ) dτ = R(z, Xn ) dz. (3.28) Rn = Sn−1
z=0
Note that Rn depends only on the value of Xn and the form of the function R(Z, X). From this, it is clear that {Rn ; n ≥ 1} is essentially7 a set of IID random variables. For residual life, R(z, Xn ) = Xn − z, so the integral in (3.28) is Xn2 /2, as calculated by inspection before. In general, from (3.28), the expected value of Rn is given by Z 1 Z x R(z, x) dz dFX (x). (3.29) E [Rn ] = x=0
z=0
Rt
Breaking 0 R(τ ) dτ into the reward over the successive renewal periods, we get Z t Z S1 Z S2 Z SN (t) Z t R(τ ) dτ = R(τ ) dτ + R(τ ) dτ + · · · + R(τ ) dτ + R(τ )dτ 0
0
S1
N (t)
=
X
n=1 7
Rn +
Z
SN (t)−1
SN (t)
t
R(τ ) dτ.
(3.30)
SN (t)
One can certainly define functions R(Z, X) for which the integral in (3.28) is infinite or undefined for some values of Xn , and thus Rn becomes a defective rv. It seems better to handle this type of situation when it arises rather than handling it in general.
110
CHAPTER 3. RENEWAL PROCESSES
The following theorem now generalizes the results of Examples 3.4.1, 3.4.2, and 3.4.3 to general renewal-reward functions. Theorem 3.6. Let {R(t); t > 0} ≥ 0 be a non-negative renewal-reward function for a renewal process with expected inter-renewal time E [X] = X. If X < 1 or E [Rn ] < 1, then with probability 1 Z E [Rn ] 1 t lim R(τ ) dτ = . (3.31) t→1 t τ =0 X Proof: Using (3.30), the accumulated reward up to time t can be bounded between the accumulated reward up to the renewal before t and that to the next renewal after t, Rt PN (t) PN (t)+1 Rn n=1 Rn τ =0 R(τ ) dτ ≤ ≤ n=1 . (3.32) t t t The left hand side of (3.32) can now be broken into PN (t) n=1
Rn
t
=
PN (t)
Rn N (t) . N (t) t
n=1
(3.33)
As t → 1, N (t) → 1, and thus, by the strong law of large numbers, the first term on the right side of (3.33) approaches E [Rn ] with probability 1. Also the second term approaches 1/X by the strong law for renewal processes. Thus the product of the two terms approaches the limit E [Rn ] /X. The right hand term of (3.32) is handled almost the same way, PN (t)+1 n=1
t
Rn
=
PN (t)+1
Rn N (t) + 1 N (t) n=1 . N (t) + 1 N (t) t
(3.34)
It is seen that the terms on the right side of (3.34) approach limits as before and thus the term on the left approaches E [Rn ] /X withRprobability 1. Since the upper and lower τ bound in (3.32) approach the same limit, (1/t) 0 R(τ ) dτ approaches the same limit and the theorem is proved. The restriction to non-negative renewal-reward functions in Theorem 3.6 is slightly artificial. The same result holds for non-positive reward functions simply by changing the directions of the inequalities in (3.32). Assuming that E [Rn ] exists (i.e., that both its positive and negative parts are finite), the same result applies in general by splitting an arbitrary reward function into a positive and negative part. This gives us the corollary: Corollary 3.1. Let {R(t); t > 0} be a renewal-reward function for a renewal process with expected inter-renewal time E [X] = X. If E [Rn ] exists, then with probability 1 Z E [Rn ] 1 t lim R(τ ) dτ = . (3.35) t→1 t τ =0 X Example 3.4.4. (Distribution of Residual Life) Example 3.4.1 treated the time-average value of the residual life Y (t). Suppose, however, that we would like to find the time-average distribution function of Y (t), i.e., the fraction of time that Y (t) ≤ y as a function of y. The
3.4. RENEWAL-REWARD PROCESSES; TIME-AVERAGES
111
approach, which applies to a wide variety of applications, is to use an indicator function (for a given value of y) as a reward function. That is, define R(t) to have the value 1 for all t such that Y (t) ≤ y and to have the value 0 otherwise. Figure 3.10 illustrates this function for a given sample path. Expressing this reward function in terms of Z(t) and X(t), we have ( 1 ; X(t) − Z(t) ≤ y . R(t) = R(Z(t), X(t)) = 0 ; otherwise ✛y✲
0
✛y✲ ✲ X3 ✛
S1
S2 S3
✛y✲
S4
Figure 3.10: Reward function to find the time-average fraction of time that {Y (t) ≤ y}. For the sample function in the figure, X1 > y, X2 > y, and X4 > y, but X3 < y Note that if an inter-renewal interval is smaller than y (such as the third interval in Figure 3.10), then R(t) has the value one over the entire interval, whereas if the interval is greater than y, then R(t) has the value one only over the final y units of the interval. Thus Rn = min[y, Xn ]. Note that the random variable min[y, Xn ] is equal to Xn for Xn ≤ y, and thus has the same distribution function as Xn in the range 0 to y. Figure 3.11 illustrates this in terms of the complementary distribution function. From the figure, we see that Z 1 Z y E [Rn ] = E [min(X, y)] = Pr{min(X, y) > x} dx = Pr{X > x} dx. (3.36) x=0
x=0
✛ ✒ ✲ ° ° Pr{min(X, y) > x}
° ✠
Pr{X > x}
x y
Figure 3.11: Rn for distribution of residual life. Rt Let FY (y) = limt→1 (1/t) 0 R(τ ) dτ denote the time-average fraction of time that the residual life is less than or equal to y. From Theorem 3.6 and Eq.(3.36), we then have Z E [Rn ] 1 y Pr{X > x} dx. (3.37) FY (y) = = X X x=0 As a check, note that this integral is increasing in y and approaches 1 as y → 1. In the development so far, the reward function R(t) has been a function solely of the age and duration intervals. In more general situations, where the renewal process is embedded in some more complex process, it is often desirable to define R(t) to depend on other aspects of the process as well. The important thing here is for the reward function to be independent of the renewal process outside the given inter-renewal interval so that the accumulated rewards
112
CHAPTER 3. RENEWAL PROCESSES
over successive inter-renewal intervals are IID random variables. Under this circumstance, Theorem 3.6 clearly remains valid. The subsequent examples of Little’s theorem and the M/G/1 expected queueing delay both use this more general type of renewal-reward function. The above time-average is sometimes visualized by the following type of experiment. For some given large time t, let T be a uniformly distributed random variable over R t (0, t]; T is independent of the renewal-reward process under consideration. Then (1/t) 0 R(τ ) dτ is the expected value (over T ) of R(T ) for a given sample point of {R(τ ); τ >0}. Theorem 3.6 states that in the limit t → 1, all sample points (except a set of probability 0) yield the same expected value over T . This approach of viewing a time-average as a random choice of time is referred to as random incidence. Random incidence is awkward mathematically, since the random variable T changes with the overall time t and has no reasonable limit. It also blurs the distinction between time and ensemble-averages, so we won’t use it in what follows. We next investigate the ensemble-average, E [R(t)] of a reward function, and in particular how it varies for large hR i t and whether limt→1 E [R(t)] exists. We could also find t limt→1 (1/t)E 0 R(τ ) dτ , which can be interpreted as the limiting expected value of reward, averaged over both time and ensemble. Not surprisingly, this is equal to E [Rn ] /E [X]) (See [16], Theorem 6.6.1). We will not bother with that here, however, since what is more important is finding E [R(t)] and limt→1 E [R(τ )] if it exists. In concrete terms, if E [R(t)] varies significantly with t, even for large t, it means, for example, that our waiting time for a bus depends strongly on our arrival time. In other words, it is important to learn if the effect of the original arrival at t = 0 gradually dies out as t → 1.
3.5
Renewal-reward processes; ensemble-averages
As in the last section, {N (t); t ≥ 0} is a renewal counting process, Z(t) and X(t), t > 0, are the age and duration random variables, R(z, x) is a real valued function of the real variables z and x, and {R(t); t ≥ 0} is a reward process with R(t) = R[Z(t), X(t)]. Our objective is to find (and to understand) limt→1 E [R(t)]. We start out with an intuitive derivation which assumes that the inter-renewal intervals {Xn ; n ≥ 1} have a probability density fX (x). Also, rather than finding E [R(t)] for a finite t and then going to the limit, we simply assume that t is so large that m(τ + δ) − m(τ ) = δ/X for all τ in the vicinity of t (i.e., we ignore the limit in (3.17)). After this intuitive derivation, we return to look at the limiting issues more carefully. Since R(t) = R[Z(t), X(t)], we start by finding the joint probability density, fZ(t),X(t) (z, x), of Z(t), X(t). Since the duration at t is equal to the age plus residual life at t, we must have X(t) ≥ Z(t), and the joint probability density can be non-zero only in the triangular region shown in Figure 3.12. From (3.19), the probability of a renewal in a small interval [t − z, t − z + δ) is δ/X − o(δ). Note that, although Z(t) = z implies a renewal at t − z, a renewal at t − z does not imply that Z(t) = z, since there might be other renewals between t − z and t. Given a renewal at t − z, however, the subsequent inter-renewal interval has probability density fX (x). Thus,
3.5. RENEWAL-REWARD PROCESSES; ENSEMBLE-AVERAGES
113
the joint probability of a renewal in [t − z, t − z + δ) and a subsequent inter-renewal interval X between x and x + δ is δ 2 fX (x)/X + o(δ 2 ), i.e., Pr{renewal ∈ [t − z, t − z + δ), X ∈ (x, x + δ]} =
δ 2 fX (x) + o(δ 2 ). X
This is valid for arbitrary x. For x > z, however, the joint event above is the same as the joint event {Z(t) ∈ (z − δ, z], X(t) ∈ (x, x + δ]}. Thus, going to the limit δ → 0, we have fZ(t),X(t) (z, x) =
fX (x) , x > z; X
fZ(t),X(t) (z, x) = 0 elsewhere.
(3.38)
This joint density is illustrated in Figure 3.12. Note that the argument z does not appear except in the condition x > z ≥ 0, but this condition is very important. The marginal densities for Z(t) and X(t) can be found by integrating (3.38) over the constraint region, Z 1 fX (x) dx 1 − FX (z) fZ(t) (z) = = . (3.39) X X x=z fX(t) (x) =
Z
x
z=0
fX (x) dz xfX (x) = . X X
The mean age can be calculated from (3.39) by integration by parts, yielding £ § E X2 . E [Z(t)] = 2X ♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣ ° ♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ° ♣ ♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣♣ ° z ♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ° ° fZ(t),X(t) (z, x) = fXX(x) ; ° ♣ ♣ ♣ ♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ° ♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣° ♣ ♣♣° ♣♣ ♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣♣♣ ♣ °
x
♣♣ ♣♣ ♣♣ ♣♣ ♣
♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣ ♣
♣♣ ♣♣ ♣♣ ♣♣
♣♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣
♣♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣
♣ ♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣
♣ ♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣
x>z
♣ ♣♣ ♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣♣ ♣♣
♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣ ♣♣ ♣♣ ♣♣
♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣ ♣♣ ♣♣ ♣♣
♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣ ♣♣ ♣♣ ♣♣
(3.40)
(3.41)
♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣ ♣♣ ♣♣ ♣♣
Figure 3.12: Joint density of age and duration. Similarly, the mean duration can be found from (3.40). £ § E X2 . E [X(t)] = X
(3.42)
These ensemble-averages agree with the time-averages found in (3.25) and (3.26). In calculating time-averages, the somewhat paradoxical result that the time-average duration is greater than E [X] was explained by the large inter-renewal intervals being weighted more heavily in the time-average. Here the same effect occurs, but it can be given a different interpretation: the joint density, at z and x, for age and duration, is proportional to the
114
CHAPTER 3. RENEWAL PROCESSES
inter-renewal density fX (x), but the marginal density for duration is weighted by x since the range for age is proportional to x. Using the joint probability density of Z(t) and X(t) to evaluate the expectation of an arbitrary reward function R(t) = R(Z(t), X(t)) in the limit t → 1, we get Z 1 Z x fX (x) dx E [Rn ] lim E [R(t)] = R(z, x) dz = . (3.43) t→1 X X x=0 z=0 where E [Rn ] is defined in (3.29). Thus the limiting ensemble-average is the same as the time-average. This result should not be surprising. Since we are dealing with non-arithmetic renewals, the probability of a renewal in a small interval becomes independent of where the interval is, so long as the interval is far enough removed from 0 for the process to be in “steady-state”. Since the reward function depends on the process only through the current renewal interval, the reward function must also become independent of time. In the intuitive derivation above, we assumed a density for the inter-renewal variables {Xn ; n ≥ 1}, and we ignored the mathematical issues of taking limits. Here we correct those defects, but the reader should bear in mind that the logical train of the argument is exactly the same as before. Assume throughout that Xn is non-arithmetic with distribution function FX (x). Since m(τ ) = E [N (τ )], the expected number of renewals in (t − z, t − z + δ] is m(t − z + δ) − m(t − z). As shown in Exercise 3.15, the assumption that the interrenewal variables are strictly positive ensures that the probability of more than one renewal in (t − z, t − z + δ] is insignificant, and more specifically, it also shows that for any t > z > 0, Pr{renewal ∈ (t − z, t − z + δ]} = m(t − z + δ) − m(t − z) − o(δ),
(3.44)
where limδ→0 o(δ)/δ = 0 and o(δ) ≥ 0. Conditional on a renewal at a sample value t − z, the inter-renewal interval X starting at t − z has the distribution FX . Thus P {renewal ∈ (t−z, t−z+δ1 ], X ∈ (x, x+δ2 ]} =
= [m(t−z+δ1 ) − m(t−z) − o(δ1 )][FX (x+δ2 ) − FX (x)].
For x > z, the joint event {renewal ∈ (t − z, t − z + δ1 ], X ∈ (x, x + δ2 ]} is the same as the joint event {Z(t) ∈ [z − δ1 , z), X(t) ∈ (x, x + δ2 ]}. Thus, Pr{Z(t)∈[z−δ1 , z), X(t)∈(x, x+δ2 ]} = [m(t−z+δ1 ) − m(t−z) − o(δ1 )][FX (x+δ2 ) − FX (x)]. (3.45) Now let R(z, x) ≥ 0 be a renewal-reward function (the non-negativity restriction will be relaxed later). We want to find E [R(t)] for this function. In principle, we can find this expectation by quantizing the region x ≥ z in Figure 3.12 into rectangles δ1 by δ2 , multiplying the probability of each rectangle by R(z, x) in that rectangle, summing, and shrinking the δ’s to 0. That is, without worrying about the range of the sums, X E [R(t)] ≈ Pr{Z(t)∈[nδ1 −δ1 , nδ1 ), X(t)∈(kδ2 , kδ2 +δ2 ]} R(nδ1 , kδ2 ) n,k
X [m(t−nδ1 +δ1 ) − m(t−nδ1 ) − o(δ1 )][FX (kδ2 +δ2 ) − FX (kδ2 )]R(nδ1 , kδ2 ). = n,k
3.5. RENEWAL-REWARD PROCESSES; ENSEMBLE-AVERAGES
115
It is simpler, in going to the limit as δ1 , δ2 → 0 to take δ2 → 0 first, since the integration over x does not involve m(t), which is the difficult quantity to work with. Thus define r(z) by8 Z 1 r(z) = R(z, x)dFX (x). (3.46) x=z
Going to the limit δ2 → 0 in (3.46) and simplifying with (3.46), we have X [m(t−nδ1 +δ1 ) − m(t−nδ1 ) − o(δ1 )]r(nδ1 ). E [R(t)] ≈
(3.47)
n
We can now make this precise by upper and lower bounding E [R(t)]. Using (3.44) to upper and lower bound the first term of (3.47) and upper bounding r(z) over an interval (nδ1 − δ1 , nδ1 ] by r(n, δ1 ) = supnδ1 −δ1 ≤z≤nδ1 r(z), E [R(t)] ≤
t/δ1 X
[m(t − nδ1 + δ1 ) − m(t − nδ1 )]r(n, δ1 ).
(3.48)
n=1
We can similarly lower bound R(t) as E [R(t)] ≥
t/δ1 X
n=1
[m(t − nδ1 + δ1 ) − m(t − nδ1 ) − o(δ1 )] r(n, δ1 ).
(3.49)
where r(n, δ) = inf nδ−δ≤z≤nδ r(z). Aside from the term o(δ1 ) in (3.49), we see from the definition of a Stieltjes integral in (1.7) that if the Stieltjes integral exists, we can express E [R(t)] as E [R(t)] =
Z
t
τ =0
r(t − τ ) dm(τ ).
(3.50)
This is valid for all t, but is not always easy to calculate because of the complex behavior of m(t), especially for non-airthmetic discrete distributions. In the limit of large t, we can use Blackwell’s theorem to evaluate the limit in t (if it exists) of the right hand side of (3.48) lim
t→1
t/δ X
n=1
[m(t − nδ + δ) − m(t − nδ)] r(n, δ) =
1 δ X r(n, δ). X n=1
(3.51)
Similarly, the limit of the right hand side of (3.49), if it exists, is lim
t→1 8
t/δ X
n=1
[m(t − nδ + δ) − m(t − nδ) − o(δ)] r(n, δ) =
µ
∂X 1 δ r(n, δ). − o(δ) X n=1
(3.52)
If one insists on interpreting r(z), one can see that r(z)/(1 − FX (z)) is E [R(t) | Z(t) = z]. It is probably better to simply view r(z) as a step in a derivation.
116
CHAPTER 3. RENEWAL PROCESSES
P P A function r(z) is called directly P Riemann integrable ifP n≥1 r(n, δ) and n≥1 r(n, δ) are finite for all δ > 0 and if limδ→0 n≥1 R δr(n, δ) = limδ→0 n≥1 δr(n, δ). If this latter equality holds, then each limit is equal to z≥0 r(z)dz. If r(z) is directly Riemann integrable, then the right hand sides of (3.51) and (3.52) are equal in the limit δ → 0. Since one is an upper bound and R the other a lower bound to limt→1 E [R(t)], we see that the limit exists and is equal to [ z≥0 r(z)dz]/X . This can be summarized in the following theorem, known as the Key renewal theorem: Theorem 3.7 (Key renewal theorem). Let r(z) ≥ 0 be a directly Riemann integrable function, and let m(t) = E [N (t)] for a non-arithmetic renewal process. Then Z t Z 1 1 lim r(t − τ )dm(τ ) = r(z)dz. (3.53) t→1 τ =0 X z=0 Since R(z, x) ≥ 0, (3.46) shows that r(z) ≥ 0. Also, from (3.29), E [Rn ] = Thus, combining (3.50) with (3.53), we have the corollary
R
z≥0 r(z)dz.
Corollary 3.2. Let {N (t); t ≥ 0} be a non-arithmetic renewal process, let R(z, x) ≥ 0, and let r(z) ≥ 0 in (3.46) be directly Riemann integrable. Then lim E [R(t)] =
t→1
E [Rn ] . X
(3.54)
The major R restrictions imposed by r(z) being directly Riemann integrable are, first, that E [Rn ] = z≥0 r(z) dz is finite, second, that r(z) contains no impulses, and third, that r(z) is not too wildly varying (being continuous and bounded by a decreasing integrable function is enough). It is also not necessary to assume R(z, x) ≥ 0, since, if E [Rn ] exists, one can break a more general R into positive and negative parts. The above development assumed a non-arithmetic renewal process. For an arithmetic process, the situation is somewhat simpler mathematically, but in general E [R(t)] depends on the remainder when t is divided by the span d. Usually with such processes, one is interested only in reward functions that remain constant over intervals of length d, so we can consider E [R(t)] only for t equal to multiples of d, and thus we assume t = nd here. Thus the function R(z, x) is of interest only when z and x are multiples of d, and in particular, only for x = d, 2d, . . . and for z = d, 2d, . . . , x. We follow the convention that an inter-renewal interval is open on the left and closed on the right, thus including the renewal that ends the interval. E [R(nd)] =
1 X i X i=1 j=1
R(jd, id) Pr{renewal at (n − j)d, next renewal at (n − j + i)d} .
Let Pi be the probability that an inter-renewal interval has size id. Using (3.20) for the limiting probability of a renewal at (n − j)d, this becomes lim E [R(nd)] =
n→1
1 X i X i=1 j=1
R(jd, id)
d E [Rn ] Pi = , X X
(3.55)
3.6. APPLICATIONS OF RENEWAL-REWARD THEORY
117
where E [Rn ] is the expected reward over a renewal period. In using this formula, remember that R(t) is piecewise constant, so that the aggregate reward over an interval of size d around nd is dR(nd). It has been important, and theoretically assuring, to be able to find ensemble-averages for renewal-reward functions in the limit of large t and to show (not surprisingly) that they are the same as the time-average results. The ensemble-average results are quite tricky, though, and it is wise to check results achieved that way with the corresponding time-average results.
3.6
Applications of renewal-reward theory
3.6.1
Little’s theorem
Little’s theorem is an important queueing result stating that the expected number of customers in a queueing system is equal to the expected time each customer waits in the system times the arrival rate. This result is true under very general conditions; we use the G/G/1 queue as a specific example, but the reason for the greater generality will be clear as we proceed. Note that the theorem does not tell us how to find either the expected number or expected wait; it only says that if one can be found, the other can also be found.
✛ W1 0
✛ ✲
A(τ ) ✛ W2
W3
♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣♣♣♣♣♣♣
✲
✲ D(τ ) S1
t
S2
Figure 3.13: Arrival process, departure process, and waiting times for a queue. Renewals occur at S1 and S2 , i.e., whenR an arrival sees an empty system. The area t between A(τ ) and D(τ ) up to time t is 0 L(τ ) dτ where L(τ ) = A(τ ) − D(τ ). The sum W1 + · · · + WA(t) also includes the shaded area to the right of t. Figure 3.13 illustrates the setting for Little’s theorem. It is assumed that an arrival occurs at time 0, and that the subsequent interarrival intervals are IID. A(t) is the number of arrivals from time 0 to t, including the arrival at 0, so {A(t) − 1; t ≥ 0} is a renewal counting process. The departure process {D(t); t ≥ 0} is the number of departures from 0 to t, and thus increases by one each time a customer leaves the system. The difference, L(t) = A(t) − D(t), is the number in the system at time t. To be specific about the waiting time of each customer, we assume that customers enter service in the order of their arrival to the system. This service rule is called First-Come-FirstServe (FCFS).9 Assuming FCFS, the system time of customer n, i.e., the time customer n spends in the system, is the interval from the nth arrival to the nth departure. Finally, 9
For single server queues, this is also frequently referred to as First In First Out (FIFO) service.
118
CHAPTER 3. RENEWAL PROCESSES
the figure shows the renewal points S1 , S2 , . . . at which arriving customers find an empty system. As observed in Example 3.1.1, the system probabilistically restarts at each of these renewal instants, and the behavior of the system in one inter-renewal interval is independent of that in each other inter-renewal interval. It is important here to distinguish between two different renewal processes. The arrival process, or more precisely, {A(t) − 1; t ≥ 0} is one renewal counting process, and the renewal epochs S1 , S2 , . . . in the figure generate another renewal process. In what follows, {A(t); t ≥ 0} is referred to as the arrival process and {N (t); t ≥ 0}, with renewal epochs S1 , S2 , . . . is referred to as the renewal process. The entire system can be viewed as starting anew at each renewal epoch, but not at each arrival epoch. We now regard L(t), the number of customers in the system at time t, as a reward function over the renewal process. This is slightly more general than the reward functions of Sections 3.4 and 3.5, since L(t) depends on the arrivals and departures within a busy period (i.e., within an inter-renewal interval). Conditional on the age Z(t) and duration X(t) of the interrenewal interval at time t, one could, in principle, calculate the expected value R(Z(t), X(t)) over the parameters other than Z(t) and X(t). Fortunately, this is not necessary and we can use the sample functions of the combined arrival and departure processes directly, which specify L(t) as A(t) − D(t). Assuming that the expected inter-renewal interval is finite, Theorem 3.6 asserts that the time average number of customers in the system (with probability 1) is equal to E [Ln ] /E [X]. E [Ln ] is the expected area between A(t) and D(t) (i.e., the expected integral of L(t)) over an inter-renewal interval. An inter-renewal interval is a busy period followed by an idle period, so E [Ln ] is also the expected area over a busy period. E [X] is the mean inter-renewal interval. From Figure 3.13, we observe that W1 + W2 + W3 is the area of the region between A(t) and D(t) in the first inter-renewal interval for the particular sample path in the figure. This is the aggregate reward over the first inter-renewal interval for the reward function L(t). More generally, for any time t, W1 + W2 + · · · + WA(t) is the area between A(t) and D(t) Rt up to a height of A(t). It is equal to 0 L(τ )dτ plus the remaining waiting time of each of the customers in the system at time t (see Figure 3.13). Since this remaining waiting time is at most the area between A(t) and D(t) from t until the next time when the system is empty, we have N (t)
X
n=1
Ln ≤
Z
A(t)
t
τ =0
L(τ ) dτ ≤
X i=1
N (t)+1
Wi ≤
X
Ln .
(3.56)
n=1
Assuming that the expected inter-renewal interval, E [X], is finite, we can divide both sides of (3.56) by t and go to the limit t → 1. From the same argument as in Theorem 3.6, we get lim
t→1
PA(t) i=1
t
Wi
= lim
t→1
Rt
τ =0 L(τ ) dτ
t
=
E [Ln ] E [X]
with probability 1.
(3.57)
Rt We denote limt→1 (1/t) 0 L(τ )dτ as L . The quantity on the left of (3.57) can now be broken up as waiting time per customer multiplied by number of customers per unit time,
3.6. APPLICATIONS OF RENEWAL-REWARD THEORY
119
i.e., lim
t→1
PA(t) i=1
t
Wi
= lim
t→1
PA(t)
Wi A(t) lim . t→1 A(t) t
i=1
(3.58)
From (3.57), the limit on the left side of (3.58) exists (and equals L) with probability 1. The second limit on the right also exists with probability 1 by the strong law for renewal processes, applied to {A(t) − 1; t ≥ 0}. This limit is called the arrival rate ∏, and is equal to the reciprocal of the mean interarrival interval for {A(t)}. Since these two limits exist with probability 1, the first limit on the right, which is the sample-path-average waiting time per customer, denoted W , also exists with probability 1. We have thus proved Little’s theorem, Theorem 3.8 (Little). For a FCFS G/G/1 queue in which the expected inter-renewal interval is finite, the time-average number of customers in the system is equal, with probability 1, to the sample-path-average waiting time per customer multiplied by the customer arrival rate, i.e., L = ∏W . The mathematics we have brought to bear here is quite formidable considering the simplicity of the idea. At any time t within an idle period, the sum of customer waiting periods up to time t is precisely equal to t times the time-average number of customers in the system up to t (see Figure 3.13). Renewal theory informs us that the limits exist and that the edge effects (i.e., the customers in the system at an arbitrary time t) do not have any effect in the limit. Recall that we assumed earlier that customers departed from the queue in the same order in which they arrived. From Figure 3.14, however, it is clear that FCFS order is not required for the argument. Thus the theorem generalizes to systems with multiple servers and arbitrary service disciplines in which customers do not follow FCFS order. In fact, all that the argument requires is that the system has renewals (which are IID by definition of a renewal) and that the inter-renewal interval is finite with probability 1.
A(τ ) ✛
✛ ✛ 0
W1
W2
✲
W3
♣♣ ♣♣ ♣♣
✲
♣♣ ♣♣ ♣♣
♣♣ ♣♣ ♣ ♣ ♣ ♣ ♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣ ♣♣
✲ S1
t
Figure 3.14: Arrivals and departures in non-FCFS systems. The aggregate reward (integral of number of customers in system) up to time t is the enclosed area to the left of t; the sum of waits of customers arriving by t includes the additional shaded area to the right of t.
Finally, suppose the inter-renewal distribution is non-arithmetic; this occurs if the interarrival distribution is non-arithmetic. Then L, the time-average number of customers in the
120
CHAPTER 3. RENEWAL PROCESSES
system, is also equal to10 limt→1 E [L(t)]. It is also possible (see Exercise 3.30) to replace the time-average waiting time W with limn→1 E [Wn ]. This gives us the following variant of Little’s theorem: lim E [L(t)] = ∏W = lim ∏E [Wn ] . n→1
t→1
(3.59)
The same argument as in Little’s theorem can be used to relate the average number of customers in the queue (not counting service) to the average wait in the queue (not counting service). Renewals still occur on arrivals to an empty system, and the integral of customers in queue over a busy period is still equal to the sum of the queue waiting times. Let Lq (t) be Rt the number in the queue at time t and let Lq = limt→1 (1/t) 0 Lq (τ )dτ be the time-average queue wait. Letting W q be the time-average waiting time in queue, Lq = ∏W q .
(3.60)
If the inter-renewal distribution is non-arithmetic, then lim E [Lq (t)] = ∏W q .
t→1
(3.61)
The same argument can also be applied to the service facility. The time-average of the number of customers in the server is just the fraction of time that the server is busy. Denoting this fraction by ρ and the expected service time by Z, we get ρ = ∏Z.
3.6.2
(3.62)
Expected queueing time for an M/G/1 queue
For our last example of the use of renewal-reward processes, we consider the expected queueing time in an M/G/1 queue. We again assume that an arrival to an empty system occurs at time 0 and renewals occur on subsequent arrivals to an empty system. At any given time t, let Lq (t) be the number of customers in the queue (not counting the customer in service, if any) and let R(t) be the residual life of the customer in service. If no customer is in service, R(t) = 0, and otherwise R(t) is the remaining time until the current service will be completed. Let U (t) be the waiting time in queue that would be experienced by a customer arriving at time t. This is often called the unfinished work in the queueing literature and represents the delay until all the customers currently in the system complete service. Thus the rv U (t) is equal to R(t), the residual life of the customer in service, plus the service times of each of the Lq (t) customers currently waiting in the queue. Lq (t)
U (t) =
X
Zi + R(t).
(3.63)
i=1
10
To show this mathematically requires a little care. One approach is to split the reward function into many individual terms. Let Ln (t) = 1 if the nth arrival since the beginning of the busy period has arrived by time t, and let Ln (t) = 0 otherwise. Let Sn (t) = 1 if the nth departure since the beginning of the busy period occurs by time t. It is easy toPshow that P the direct Riemann integrability condition holds for each of these reward functions, and L(t) is n Ln (t) − n Sn (t).
3.6. APPLICATIONS OF RENEWAL-REWARD THEORY
121
where Zi is the required service time of the ith customer in the queue at time t. Since the service times are independent of the arrival times and of the earlier service times, Lq (t) is independent of Z1 , Z2 , . . . , ZLq (t) , so, taking expected values, E [U (t)] = E [Lq (t)] E [Z] + E [R(t)] .
(3.64)
Figure 3.15 illustrates how to find the time-average of R(t). Viewing R(t) as a reward ❅ ❅ ❅ ❅ ❅ R(t) ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ t ❅ ❅ ❅ ✛ Z1 ✲✛ Z2 ✲✛ Z3✲
0
❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ ❅ S1
Figure 3.15: Sample value of the residual life function of customers in service. function, we can R find the accumulated reward up to time t as the sum of triangular areas. First, consider R(τ )dτ from 0 to SN (t) , i.e., the accumulated reward up to the last renewal epoch before t. SN (t) is not only a renewal epoch for the renewal process, but also an arrival epoch for the arrival process; in particular, it is the A(SN (t) )th arrival epoch, and the A(SN (t) ) − 1 earlier arrivals are the customers that have received service up to time SN (t) . Thus, Z
A(SN (t) )−1
SN (t)
R(τ ) dτ =
τ =0
X i=1
A(t) 2 XZ Zi2 i ≤ . 2 2 i=1
R SN (t)+1 R(τ ) dτ . We also We can similarly upper bound the term on the right above by τ =0 Rt know (from going through virtually the same argument many times) that (1/t) τ =0 R(τ )dτ will approach a limit with probability 1 as t → 1, and that the limit will be unchanged if t is replaced with SN (t) or SN (t)+1 . Thus, taking ∏ as the arrival rate, lim
t→1
Rt 0
R(τ ) dτ = lim t→1 t
£ § ∏E Z 2 Zi2 A(t) = 2A(t) t 2
PA(t) i=1
W.P.1.
From Corollary 3.2, we can replace the time average above with the limiting ensembleaverage, so that £ § ∏E Z 2 lim E [R(t)] = . (3.65) t→1 2 Finally, we can use Little’s theorem, in the limiting ensemble-average form of (3.61), to assert that limt→1 E [Lq (t)] = ∏W q . Substituting this plus (3.65) into (3.64), we get £ § ∏E Z 2 lim E [U (t)] = ∏E [Z] W q + . (3.66) t→1 2
122
CHAPTER 3. RENEWAL PROCESSES
This shows that limt→1 E [U (t)] exists, so that E [U (t)] is asymptotically independent of t. It is now important to distinguish between E [U (t)] and W q . The first is the expected unfinished work at time t, which is the queue delay that a customer would incur by arriving at t; the second is the time-average expected queue delay. For Poisson arrivals, the probability of an arrival in (t, t + δ] is independent of U (t)11 . Thus, in the limit t → 1, each arrival faces an expected delay limt→1 E [U (t)], so limt→1 E [U (t)] must be equal to W q . Substituting this into (3.66), we obtain the celebrated Pollaczek-Khinchin formula, £ § ∏E Z 2 Wq = . (3.67) 2(1 − ∏E [Z]) This£ queueing delay has some of the peculiar features of residual life, and in particular, § 2 if E Z = 1, the limiting expected queueing delay is infinite even though the expected service time is less than the expected interarrival interval. £ § In trying to visualize why the queueing delay is so large when E Z 2 is large, note that while a particularly long service is taking place, numerous arrivals are coming into the system, and all are being delayed by this single long service. In other words, the number of new customers held up by a long service is proportional to the length of the service, and the amount each of them are held up is also proportional to the length of the service. This visualization is rather crude, but does serve to explain the second moment of Z in (3.67). This phenomenon is sometimes called the “slow truck effect” because of the pile up of cars behind a slow truck on a single lane road. For a G/G/1 queue, (3.66) is still valid, but arrival times are no longer independent of U (t), so that typically E [U (t)] 6= W q . As an example, suppose that the service time is uniformly distributed between 1−≤ and 1+≤ and that the interarrival interval is uniformly distributed between 2 − ≤ and 2 + ≤. Assuming that ≤ < 1/2, the system has no queueing and W q = 0. On the other hand, for small ≤, limt→1 E [U (t)] ∼ 1/4 (i.e., the server is busy half the time with unfinished work ranging from 0 to 1).
3.7
Delayed renewal processes
We have seen a certain awkwardness in our discussion of Little’s theorem and the M/G/1 delay result because an arrival was assumed, but not counted, at time 0; this was necessary for the first interarrival interval to be statistically identical to the others. In this section, we correct that defect by allowing the epoch at which the first renewal occurs to be arbitrarily distributed. The resulting type of process is a generalization of the class of renewal processes known as delayed renewal processes. The word delayed does not necessarily imply that the first renewal epoch is in any sense larger than the other inter-renewal intervals. Rather, it means that the usual renewal process, with IID inter-renewal times, is delayed until after the epoch of the first renewal. What we shall discover is intuitive — both the time average behavior and, in essence, the limiting ensemble behavior are not affected by the 11
This is often called the PASTA property, standing for Poisson arrivals see time-averages. This holds with great generality, requiring only that time-averages exist and that the state of the system at a given time t is independent of future arrivals.
3.7. DELAYED RENEWAL PROCESSES
123
distribution of the first renewal epoch. It might be somewhat surprising, however, to find that this irrelevance of the distribution of the first renewal epoch holds even when the mean of the first renewal epoch is infinite. To be more precise, we let {Xi ; i≥1} be a set of independent non-negative random variables. X1 has some distribution function G(x), whereas {Xi ; i ≥ 2} are identically distributed with some distribution function F (x). Typically, G(x) Pn6= F (x), since if equality held, we would have an ordinary renewal process. Let Sn = i=1 Xi be the nth renewal epoch and let N (t) be the number of renewal epochs up to and including time t (i.e., N (t) ≥ n if and only if Sn ≤ t). {N (t); t ≥ 0} is then called a delayed renewal counting process. The following simple lemma follows from lemma 3.1. Lemma 3.2. Let {N (t); t ≥ 0} be a delayed renewal counting process. Then limt→1 N (t) = 1 with probability 1 and limt→1 E [N (t)] = 1. Proof: Conditioning on X1 = x, we can write N (t) = 1+N 0 (t−x) where N 0 {t; t ≥ 0} is the ordinary renewal counting process with inter-renewal intervals X2 , X3 , . . . . From Lemma 3.1, limt→1 N 0 (t − x) = 1 with probability 1, and limt→1 E [N 0 (t − x)] = 1. Since this is true for every finite x > 0, and X1 is finite with probability 1, the lemma is proven. Theorem 3.9 (Strong Law for Delayed Renewal Processes). Let a delayed renewal R1 process have mean inter-renewal interval X 2 = x=0 [1 − F (x)] dx. Then lim
t→1
N (t) 1 = t X2
with probability 1.
(3.68)
Proof: As in the proof of Theorem 3.1, we have SN (t) SN (t)+1 t ≤ ≤ . N (t) N (t) N (t) PN (t) SN (t) X1 n=2 Xn lim = lim + lim t→1 N (t) t→1 N (t) t→1 N (t) − 1
(3.69)
N (t) − 1 . N (t)
(3.70)
From Lemma 3.2, N (t) approaches 1 as t → 1. Thus for any finite sample value of X1 , the first limit on the right side of (3.70) approaches 0. Since X1 is a random variable, it takes on a finite value with probability 1, so this first term is 0 with probability 1 (note that this does not require X1 to have a finite mean). The second term in (3.70) approaches X 2 with probability 1 by the strong law of large numbers. The same argument applies to the right side of (3.69), so that limt→1 N t(t) = X 2 with probabilty 1. Eq. (3.68) then follows. A truncation argument, as in Exercise 3.3, shows that the theorem is still valid if X 2 = 1. Next we look at the elementary renewal theorem and Blackwell’s theorem for delayed renewal processes. To do this, we view a delayed renewal counting process {N (t); t ≥ 0} as an ordinary renewal counting process that starts at a random non-negative epoch X1 with some distribution function G(t). Define No (t−X1 ) as the number of renewals that occur in the interval (X1 , t]. Conditional on any given sample value x for X1 , {No (t−x); t−x≥0} is an ordinary renewal counting process and thus, given X1 = x, limt→1 E [No (t − x)] /(t−x) = 1/X 2 .
124
CHAPTER 3. RENEWAL PROCESSES
Since N (t) = 1 + No (t − X1 ) for t > X1 , we see that, conditional on X1 = x, lim
t→1
E [N (t) | X1 =x] E [No (t − x)] t − x 1 . = lim = t→1 t t−x t X2
(3.71)
Since this is true for every finite sample value x for X1 , we can take the expected value over X1 to get the following theorem: Theorem 3.10 (Elementary Delayed Renewal Theorem). For a delayed renewal process with E [Xi ] = X 2 for all i ≥ 2, lim
t→1
E [N (t)] 1 .. = t X2
(3.72)
The same approach gives us Blackwell’s theorem. Specifically, if {Xi ; i≥2} are non-arithmetic, then, using Blackwell’s theorem for ordinary renewal processes, for any δ > 0, lim
t→1
E [No (t − x + δ) − No (t − x)] 1 . = δ X2
(3.73)
Thus, conditional on any sample value X1 = x, limt→1 E [N (t+δ) − N (t) | X1 =x] = δ/X 2 . Taking the expected value over X1 gives us limt→1 E [N (t + δ) − N (t)] = δ/X 2 . The case in which {Xi ; i≥2} are arithmetic with span d is somewhat more complicated. If X1 is arithmetic with span d (or a multiple of d), then the first renewal epoch must be at some multiple of d and d/X 2 gives the expected number of arrivals at time id in the limit as i → 1. If X1 is non-arithmetic or arithmetic with a span other than a multiple of d, then the effect of the first renewal epoch never dies out, since all subsequent renewals occur at multiples of d from this first epoch. This gives us the theorem: Theorem 3.11 (Blackwell for Delayed Renewal). If {Xi ; i≥2} are non-arithmetic, then, for all δ > 0, lim
t→1
E [N (t + δ) − N (t)] 1 . = δ X2
(3.74)
If {Xi ; i ≥ 2} are arithmetic with span d and X1 is arithmetic with span md for some positive integer m, then lim Pr{renewal at t = id} =
i→1
3.7.1
d . X2
(3.75)
Delayed renewal-reward processes
We have seen that the distribution of the first renewal epoch has no effect on the time or ensemble-average behavior of a renewal process (other than the ensemble dependence on time for an arithmetic process). This carries over to reward functions with almost no change. In particular, the extended version of Theorem 3.6 is as follows:
3.7. DELAYED RENEWAL PROCESSES
125
Theorem 3.12. Let {N (t); t≥0} be a delayed renewal counting process, let Z(t) = t−SN (t) , let X(t) = SN (t)+1 − SN (t) , and let R(t) = R(Z(t), X(t)) be a reward function. Assume that Z 1 Z x E [Rn ] = R(z, x) dz dFXn (x) < 1 for all n. x=0
z=0
Then, with probability one, 1 lim t→1 t
Z
t
R(τ )dτ =
τ =0
E [Rn ] for n ≥ 2. X2
(3.76)
We omit the proof of this since it is a minor variation of that of theorem 3.6. Finally, since Blackwell’s theorem holds for delayed renewal processes, Eq. (3.54), giving the ensembleaverage reward for non-arithmetic processes, follows as before, yielding lim E [R(t)] =
t→1
3.7.2
E [Rn ] . X2
(3.77)
Transient behavior of delayed renewal processes
Let m(t) = E [N (t)] for a delayed renewal process. As in (3.5), we have m(t) =
1 X
n=1
Pr{N (t) ≥ n} =
1 X
n=1
Pr{Sn ≤ t} .
(3.78)
For n ≥ 2, Sn = Sn−1 + Xn where Xn and Sn−1 are independent. From the convolution equation (1.12), Z t Pr{Sn ≤ t} = Pr{Sn−1 ≤ t − x} dF (x) for n ≥ 2. (3.79) x=0
For n = 1, Pr{Sn ≤ t} = G(t). Substituting this in (3.78) and interchanging the order of integration and summation, m(t) = G(t) + = G(t) + = G(t) +
Z
t
1 X
x=0 n=2 Z t X 1 x=0 n=1 Z t x=0
Pr{Sn−1 ≤ t − x} dF (x) Pr{Sn ≤ t − x} dF (x)
m(t − x)dF (x) ;
t ≥ 0.
(3.80)
This is called the renewal equation and is a generalization of (3.6). [9], Section 11.1, Theorem 3.1, shows that it has a unique solution. There is another useful integral equation very similar to (3.80) that arises from breaking up Sn as the sum of X1 and Sbn−1 where Sbn−1 = X2 + · · · + Xn . Letting m(t) b be the expected number of renewals in time t for an ordinary renewal process with interarrival distribution F ,
126
CHAPTER 3. RENEWAL PROCESSES
a similar argument to that above, starting with Pr{Sn ≤ t} = yields m(t) = G(t) +
Z
n o bn−1 ≤ t − x dG(x) Pr S 0
Rt
t
x=0
m(t b − x)dG(x).
(3.81)
This equation brings out the effect of the initial renewal interval clearly, and is useful in computation if one already knows m(t). b
Frequently, the most convenient way of dealing with m(t) is through transforms. Following the same argument as that in (3.7), we get Lm (r) = (1/r)LG (r) + Lm (r)LF (r). Solving, we get Lm (r) =
LG (r) . r[1 − LF (r)]
(3.82)
We can find m(t) from (3.82) by finding the inverse Laplace transform, using the same procedure as in Example 3.3.1. There is a second order pole at r = 0 again, and, evaluating the residue, it is 1/L0F (0) = 1/X 2 , which is not surprising in terms of Blackwell’s theorem. We can also expand numerator and denominator of (3.82) in a power series, as in (3.8). The inverse transform, corresponding to (3.9), is £ § E X22 t X1 m(t) = + − + ε(t) for t → 0. X2 2X 2 X2
(3.83)
where limt→1 ε(t) = 0.
3.7.3
The equilibrium process
Consider an ordinary non-arithmetic renewal process with an inter-renewal interval X of distribution F (x). We have seen that Rthe distribution of the interval from t to the next y renewal approaches FY (y) = (1/E [X]) 0 [1 − F (x)]dx as t → 1. This suggests that if we look at this renewal process starting at some very large t, we should see a delayed renewal process for which the distribution G(x) of the first renewal is equal to the residual life distribution FY (x) above and subsequent inter-renewal intervals should have the original distribution F (x) above. Thus it appears that such a delayed renewal process is the same as the original ordinary renewal process, except that it starts in “steady-state.” To verify this, we show that m(t) = t/X 2 is a solution to (3.80) if G(t) = FY (t). Substituting (t − x)/X 2 for m(t − x), the right hand side of (3.80) is Rt 0
[1 − F (x)]dx X2
+
Rt
0 (t
− x)dF (x) X2
=
Rt 0
[1 − F (x)]dx X2
+
Rt 0
F (x)dx X2
=
t . X2
where we have used integration by parts for the first equality. This particular delayed renewal process is called the equilibrium process, since it starts off in steady state, and thus one need not worry about transients.
3.8. SUMMARY
3.8
127
Summary
Sections 3.1 to 3.3 give the central results about renewal processes that form the basis for many of the subsequent chapters. The chapter starts with the strong law for renewal processes, showing that the time average rate of renewals, N (t)/t, approaches 1/X with probability 1 as t → 1. This, combined with the strong law of large numbers in Chapter 1, is the basis for most subsequent results about time-averages. The next topic is the expected renewal rate, E [N (t)] /t. If the Laplace transform of the inter-renewal density is rational, E [N (t)] /t can be easily calculated. In general, the Wald equality shows that limt→1 E [N (t)] /t = 1/X. Finally, Blackwell’s theorem shows that the renewal epochs reach a steady-state as t → 1. The form of this steady-state depends on whether the interrenewal distribution is arithmetic (see (3.18)) or non-arithmetic (see (3.17) and (3.19)). Sections 3.4 and 3.5 add a reward function R(t) to the underlying renewal process; R(t) depends only on the inter-renewal interval containing t. The time-average value of reward exists with probability 1 and is equal to the expected reward over a renewal interval divided by the expected length of an inter-renewal interval. Under some minor restrictions imposed by the key renewal theorem, we also found that, for non-arithmetic inter-renewal distributions, limt→1 E [R(t)] is the same as the time-average value of reward. These general results were applied to residual life, age, and duration, and were also used to derive and understand Little’s theorem and the Pollaczek-Khinchin expression for the expected delay in an M/G/1 queue. Finally, all the results above were shown to apply to delayed renewal processes. For further reading on renewal processes, see Feller,[9], Ross, [16], or Wolff, [22]. Feller still appears to be the best source for deep understanding of renewal processes, but Ross and Wolff are somewhat more accessible.
3.9
Exercises
Exercise 3.1. The purpose of this exercise is to show that for an arbitrary renewal process, the number of renewals in (0, t] is a random variable for each t > 0, i.e., to show that N (t), for each t > 0, is an actual rv rather than a defective rv. a) Let X1 , X2 , . . . , be a sequence of IID inter-renewal rv’s . Let Sn = X1 + · · · + Xn be the corresponding renewal epochs for each n ≥ 1. Assume that each Xi has a finite expectation X > 0 and use the weak law of large numbers to show that limn→1 Pr{Sn < t} = 0. b) Use part a) to show that limn→1 Pr{N ≥ n} = 0 and explain why this means that N (t) is not defective. c) Prove that N (t) is not defective without assuming that each Xi has a mean (but assuming that Pr{X = 0} 6= 1.) Hint: Lower bound each Xi by a binary rv Yi (i.e., Xi (ω) ≥ Yi (ω) for each sample point ω) and show that this implies that FXi (xi ) ≤ FYi (yi ). Be sure you understand this strange reversal of inequality signs.
128
CHAPTER 3. RENEWAL PROCESSES
Exercise 3.2. Let {Xi ; i ≥ 1} be the inter-renewal intervals of a renewal process generalized to allow for inter-renewal intervals of size 0 and let Pr{Xi = 0} =α, 0 < α < 1. Let {Yi ; i ≥ 1} be the sequence of non-zero interarrival intervals. For example, if X1 = x1 >0, X2 = 0, X3 = x3 >0, . . . , then Y1 =x1 , Y2 =x3 , . . . , . a) Find the distribution function of each Yi in terms of that of the Xi . b) Find the PMF of the number of arrivals of the generalized renewal process at each epoch at which arrivals occur. c) Explain how to view the generalized renewal process as an ordinary renewal process with inter-renewal intervals {Yi ; i ≥ 1} and bulk arrivals at each renewal epoch. Exercise 3.3. Let {Xi ; i≥1} be the inter-renewal intervals of a renewal process and assume ei be a truncated random variable that E [Xi ] = 1. Let b > 0 be an arbitrary number and X ei = Xi if Xi ≤ b and X ei = b otherwise. defined by X h i ei ≥ M . a) Show that for any constant M > 0, there is a b sufficiently large so that E X e (t); t≥0} be the renewal counting process with inter-renewal intervals {X ei ; i ≥ 1} b) Let {N e (t) ≥ N (t). and show that for all t > 0, N
c) Show that for all sample functions N (t, ω), except a set of probability 0, N (t, ω)/t < 2/M for all sufficiently large t. Note: Since M is arbitrary, this means that lim N (t)/t = 0 with probability 1. Exercise 3.4. a) Let J be a stopping rule and In be the indicator random variable of the P event {J ≥ n}. Show that J = n≥1 In .
b) Show that I1 ≥ I2 ≥ I3 ≥ . . . , i.e., show that for each n > 1, In (ω) ≥ In+1 (ω) for each ω ∈ ≠ (except perhaps for a set of probability 0). Exercise 3.5. Is it true for a renewal process that: a) N (t) < n if and only if Sn > t? b) N (t) ≤ n if and only if Sn ≥ t? c) N (t) > n if and only if Sn < t?
Exercise 3.6. Let {N (t); t ≥ 0} be a renewal counting process and let m(t) = E [N (t)] be the expected number of arrivals up to and including time t. Let {Xi ; i ≥ 1} be the inter-renewal times and assume that FX (0) = 0. a) For all x > 0 and t > x show that E [N (t)|X1 =x] = E [N (t − x)] + 1. Rt b) Use part (a) to show that m(t) = FX (t) + 0 m(t − x)dFX (x) for t > 0. This equation is the renewal equation derived differently in (3.6). c) Suppose that X is an exponential random variable of parameter ∏. Evaluate Lm (s) from (3.7); verify that the inverse Laplace transform is ∏t; t≥0.
3.9. EXERCISES
129
Exercise 3.7. a) Let the inter-renewal interval of a renewal process have a second order Erlang density, fX (x) = ∏2 x exp(−∏x). Evaluate the Laplace transform of m(t) = E [N (t)]. b) Use this to evaluate m(t) for t ≥ 0. Verify that your answer agrees with (3.9). c) Evaluate the slope of m(t) at t = 0 and explain why that slope is not surprising. d) View the renewals here as being the even numbered arrivals in a Poisson process of rate ∏. Sketch m(t) for the process here and show one half the expected number of arrivals for the Poisson process on the same sketch. Explain the difference between the two. Exercise 3.8. a) Let N (t) be the number of arrivals in the interval (0, t] for a Poisson process of rate ∏. Show that the probability that N (t) is even is [1 + exp(−2∏t)]/2. Hint: Look at the power series expansion P of exp(−∏t) and that of exp(∏t), and look at the sum of the two. Compare this with n even Pr{N (t) = n}. e (t) be the number of even numbered arrivals in (0, t]. Show that N e (t) = N (t)/2 − b) Let N Iodd (t)/2 where Iodd (t) is a random variable that is 1 if N (t) is odd and 0 otherwise. h i e (t) . Note that this is m(t) for a renewal process with c) Use parts a and b to find E N 2nd order Erlang inter-renewal intervals.
Exercise 3.9. Use Wald’s equality to compute the expected number of trials of a Bernoulli process up to and including the kth success. Exercise 3.10. A gambler with an initial finite capital of d > 0 dollars starts to play a dollar slot machine. At each play, either his dollar is lost or is returned with some additional number of dollars. Let Xi be his change of capital on the ith play. Assume that {Xi ; i=1, 2, . . . } is a set of IID random variables taking on integer values {−1, 0, 1, . . . }. Assume that E [Xi ] < 0. The gambler plays until losing all his money (i.e., the initial d dollars plus subsequent winnings). a) Let J be the number of plays until the gambler loses all his money. Is the weak law of large numbers sufficient to argue that limn→1 Pr{J > n} = 0 (i.e., that J is a random variable) or is the strong law necessary? b) Find E [J]. Exercise 3.11. Let {Xi ; i ≥ 1} be IID binary random variables with PX (0) = PX (1) = 1/2. Let J be a non negative integer valued P random variable defined on the above sample space of binary sequences and let SJ = Ji=1 Xi . Find the simplest example you can in which J is not a stopping rule for {Xi ; i ≥ 1} and where E [X] E [J] 6= E [SJ ]. Exercise 3.12. Let J = min{n | Sn ≤B or Sn ≥A}, where A is a positive integer, B is a negative integer, and Sn = X1 + X2 + · · · + Xn . Assume that {Xi ; i≥1} is a set of zero mean IID rv’s that can take on only the set of values {−1, 0, +1}, each with positive probability.
130
CHAPTER 3. RENEWAL PROCESSES
a) Is J a stopping rule? Why or why not? Hint: Part of this is to argue that J is finite with probability 1; you do not need to construct a proof of this, but try to argue why it must be true. b) What are the possible values of SJ ? c) Find an expression for E [SJ ] in terms of p, A, and B, where p = Pr{SJ ≥ A}. d) Find an expression for E [SJ ] from Wald’s equality. Use this to solve for p. Exercise 3.13. Let {N (t); t≥0} be a renewal counting process generalized to allow for inter-renewal intervals {Xi } of duration 0. Let each Xi have the PMF Pr{Xi = 0} = 1 − ≤ ; Pr{Xi = 1/≤} = ≤. a) Sketch a typical sample function of {N (t); t≥0}. Note that N (0) can be non-zero (i.e., N (0) is the number of zero interarrival times that occur before the first non-zero interarrival time). b) Evaluate E [N (t)] as a function of t. c) Sketch E [N (t)] /t as a function of t. § £ d) Evaluate E SN (t)+1 as a function of t (do this directly, and then use Wald’s equality as a check on your work). e) Sketch the lower bound E [N (t)] /t ≥ 1/E [X] − 1/t on the same graph with part (c). § £ f ) Sketch E SN (t)+1 − t as a function of t and find the time average of this quantity. § £ § £ g) Evaluate E SN (t) as a function of t; verify that E SN (t) 6= E [X] E [N (t)]. Exercise 3.14. Consider a miner trapped in a room that contains three doors. Door 1 leads him to freedom after two-day’s travel; door 2 returns him to his room after four-day’s travel; and door 3 returns him to his room after eight-day’s travel. Suppose each door is equally likely to be chosen whenever he is in the room, and let T denote the time it takes the miner to become free. a) Define a sequence of independent and identically distributed random variables X1 , X2 , . . . and a stopping rule J such that T =
J X
Xi .
i=1
b) Use Wald’s equality to find E [T ]. hP i P J c) Compute E X | J=n and show that it is not equal to E [ ni=1 Xi ]. i i=1 d) Use part (c) for a second derivation of E [T ].
3.9. EXERCISES
131
Exercise 3.15. Consider a non-arithmetic renewal counting process {N (t); t > 0. For a given t and δ, denote Pr{N (t + δ) − N (t) = i} by Pi . a) Show that P2 ≤ P1 FX (δ).
b) Show that Pi ≤ P1 [FX (δ)]i−1 .
c) Show that E [N (t + δ) − N (t)] ≤ P1 [1 − FX (δ)]−2 .
d) Show that E [N (t + δ) − N (t)] ≤ P1 [1 − o(δ)]. e) Show that E [N (t + δ) − N (t)] ≥ P1 . f ) Use parts d) and e), along with Blackwell’s theorem, to verify (3.45) and the three equations of (3.19). Exercise 3.16. Let Y (t) = SN (t)+1 − t be the residual life at time t of a renewal process. First consider a renewal process in which the interarrival time has density fX (x) = e−x ; x ≥ 0, and next consider a renewal process with density fX (x) =
3 ; (x + 1)4
x ≥ 0.
For each of the above densities, use renewal-reward theory to find: i) the time-average of Y (t) ii) the second moment in time of Y (t) (i.e., limT →1
1 T
RT 0
Y 2 (t)dt)
£ § For the exponential density, verify your answers by finding E [Y (t)] and E Y 2 (t) directly.
Exercise 3.17. Consider a variation of an M/G/1 queueing system in which there is no facility to save waiting customers. Assume customers arrive according to a Poisson process of rate ∏. If the server is busy, the customer departs and is lost forever; if the server is not busy, the customer enters service with a service time distribution function denoted by FY (y). Successive service times (for those customers that are served) are IID and independent of arrival times. Assume that customer number 0 arrives and enters service at time t = 0. a) Show that the sequence of times S1 , S2 , . . . at which successive customers enter service are the renewal times of a renewal process. Show that each inter-renewal interval Xi = Si −Si−1 (where S0 = 0) is the sum of two independent random variables, Yi + Ui where Yi is the ith service time; find the probability density of Ui . b) Assume that a reward (actually a cost in this case) of one unit is incurred for each customer turned away. Sketch the expected reward function as a function of time for the sample function of inter-renewal intervals and service intervals shown below; the expectation is to be taken over those (unshown) arrivals of customers that must be turned away. Rt c) Let 0 R(τ )dτ denote the accumulated reward (i.e., cost) from 0 to t and find the limit Rt as t → 1 of (1/t) 0 R(τ )dτ . Explain (without any attempt to be rigorous or formal) why this limit exists with probability 1.
132
CHAPTER 3. RENEWAL PROCESSES
✛ ❄
Y1
✲
S0 = 0
❄
✛ ❄
Y1
S1
✲ ❄
✛ ❄
S2
Y1
✲ ❄
d) In the limit of large t, find the expected reward from time t until the next renewal. Hint: Sketch this expected reward as a function of t for a given sample of inter-renewal intervals and service intervals; then find the time-average. e) Now assume that the arrivals are deterministic, with the first arrival at time 0 and the nth arrival at time n − 1. Does the sequence of times S1 , S2 , . . . at which subsequent customers start service still constitute the renewal times of a renewal process? Draw¥a sketch ≥R t of arrivals, departures, and service time intervals. Again find limt→1 0 R(τ ) dτ /t. Exercise 3.18. Let Z(t) = t − SN (t) be the age of a renewal process and Y (t) = SN (t)+1 − t be the residual life. Let FX (x) be the distribution function of the inter-renewal interval and find the following as a function of FX (x): a) Pr{Y (t)>x | Z(t)=s} b) Pr{Y (t)>x | Z(t+x/2)=s} c) Pr{Y (t)>x | Z(t+x)>s} for a Poisson process. Exercise 3.19. Let Z(t), Y (t), X(t) denote the age, residual life, and duration at time t for a renewal counting process {N (t); t ≥ 0} in which the interarrival time has a density given by f (x). Find the following probability densities; assume steady-state. a) fY (t) (y | Z(t+s/2)=s) for given s > 0. b) fY (t),Z(t) (y, z). c) fY (t) (y | X(t)=x). d) fZ(t) (z | Y (t−s/2)=s) for given s > 0. e) fY (t) (y | Z(t+s/2)≥s) for given s > 0. Exercise 3.20. a) Find limt→1 {E [N (t)] − t/X} for a renewal counting process {N (t); t ≥ 0} with inter-renewal times {Xi ; i ≥ 1}. Hint: use Wald’s equation. b) Evaluate your result for the case in which X is an exponential random variable (you already know what the result should be in this case). £ § c) Evaluate your result for a case in which E [X] < 1 and E X 2 = 1. Explain (very briefly) why this does not contradict the elementary renewal theorem. Exercise 3.21. Customers arrive at a bus stop according to a Poisson process of rate ∏. Independently, buses arrive according to a renewal process with the inter-renewal interval
3.9. EXERCISES
133
distribution FX (x). At the epoch of a bus arrival, all waiting passengers enter the bus and the bus leaves immediately. Let R(t) be the number of customers waiting at time t. a) Draw a sketch of a sample function of R(t). b) Given that the first£Rbus arrives§ at time X1 = x, find the expected number of customers x picked up; then find E 0 R(t)dt , again given the first bus arrival at X1 = x. Rt c) Find limt→1 1t 0 R(τ )dτ (with probability 1). Assuming that FX is a non-arithmetic distribution, find limt→1 E [R(t)]. Interpret what these quantities mean. d) Use part (c) to find the time-average expected wait per customer. e) Find the fraction of time that there are no customers at the bus stop. (Hint: this part is independent of a), b), and c); check your answer for E [X] ø 1/∏). Exercise 3.22. Consider the same setup as in Exercise 3.21 except that now customers arrive according to a non-arithmetic renewal process independent of the bus arrival process. Let 1/∏ be the expected inter-renewal interval for the customer renewal process. Assume that both renewal processes are in steady-state (i.e., either we look only at t ¿ 0, or we assume that they are equilibrium processes). Given that the nth customer arrives at time t, find the expected wait for customer n. Find the expected wait for customer n without conditioning on the arrival time. Exercise 3.23. Let {N1 (t); t ≥ 0} be a Poisson counting process of rate ∏. Assume that the arrivals from this process are switched on and off by arrivals from a non-arithmetic renewal counting process {N2 (t); t ≥ 0} (see figure below). The two processes are independent. rate ∏ rate ∞
✁❆ ✛ On ✲✁❆ ✁❆
✁❆
✁❆
✁❆ ❆✁✛ ✁❆ On ✲ ✁❆
✁❆ ✁❆
✁❆ ❆✁✛
✁❆
N2 (t) ✲✁❆
On ✁❆
N1 (t)
✁❆
NA (t)
Let {NA (t); t ≥ 0} be the switched process; that is NA (t) includes arrivals from {N1 (t); t ≥ 0} while N2 (t) is even and excludes arrivals from {N1 (t); t ≥ 0} while N2 (t) is odd. a) Is NA (t) a renewal counting process? Explain your answer and if you are not sure, look at several examples for N2 (t). b) Find limt→1 1t NA (t) and explain why the limit exists with probability 1. Hint: Use symmetry—that is, look at N1 (t) − NA (t). To show why the limit exists, use the renewalreward theorem. What is the appropriate renewal process to use here? c) Now suppose that {N1 (t); t≥0} is a non-arithmetic renewal counting process but not a Poisson process and let the expected inter-renewal interval be 1/∏. For any given δ, find limt→1 E [NA (t + δ) − NA (t)] and explain your reasoning. Why does your argument in (b) fail to demonstrate a time-average for this case?
134
CHAPTER 3. RENEWAL PROCESSES
Exercise 3.24. An M/G/1 queue has arrivals at rate ∏ and a service time distribution given by FY (y). Assume that ∏ < 1/E [Y ]. Epochs at which the system becomes empty define a renewal process. Let FZ (z) be the distribution of the inter-renewal intervals and let E [Z] be the mean inter-renewal interval. a) Find the fraction of time that the system is empty as a function of ∏ and E [Z]. State carefully what you mean by such a fraction. b) Apply Little’s theorem, not to the system as a whole, but to the number of customers in the server (i.e., 0 or 1). Use this to find the fraction of time that the server is busy. c) Combine your results in a) and b) to find E [Z] in terms of ∏ and E [Y ]; give the fraction of time that the system is idle in terms of ∏ and E [Y ]. d) Find the expected duration of a busy period. Exercise 3.25. Consider a sequence X1 , X2 , . . . of IID binary random variables. Let p and 1 − p denote Pr{Xm = 1} and Pr{Xm = 0} respectively. A renewal is said to occur at time m if Xm−1 = 0 and Xm = 1. a) Show that {N (m); m ≥ 0} is a renewal counting process where N (m) is the number of renewals up to and including time m and N (0) and N (1) are taken to be 0. b) What is the probability that a renewal occurs at time m, m ≥ 2 ? c) Find the expected inter-renewal interval; use Blackwell’s theorem here. d) Now change the definition of renewal; a renewal now occurs at time m if Xm−1 = 1 and ∗ ; m ≥ 0} is a delayed renewal counting process where N ∗ is the Xm = 1. Show that {Nm m number of renewals up to and including m for this new definition of renewal (N0∗ = N1∗ = 0). e) Find the expected inter-renewal interval for the renewals of part d). f ) Given that a renewal (according to the definition in (d)) occurs at time m, find the expected time until the next renewal, conditional, first, on Xm+1 = 1 and, next, on Xm+1 = 0. Hint: use the result in e) plus the result for Xm+1 = 1 for the conditioning on Xm+1 = 0. g) Use your result in f) to find the expected interval from time 0 to the first renewal according to the renewal definition in d). h) Which pattern requires a larger expected time to occur: 0011 or 0101 i) What is the expected time until the first occurrence of 0111111? Exercise 3.26. A large system is controlled by n identical computers. Each computer independently alternates between an operational state and a repair state. The duration of the operational state, from completion of one repair until the next need for repair, is a random variable X with finite expected duration E [X]. The time required to repair a computer is an exponentially distributed random variable with density ∏e−∏t . All operating durations and repair durations are independent. Assume that all computers are in the repair state at time 0.
3.9. EXERCISES
135
a) For a single computer, say the ith, do the epochs at which the computer enters the repair state form a renewal process? If so, find the expected inter-renewal interval. b) Do the epochs at which it enters the operational state form a renewal process? c) Find the fraction of time over which the ith computer is operational and explain what you mean by fraction of time. d) Let Qi (t) be the probability that the ith computer is operational at time t and find limt→1 Qi (t). e) The system is in failure mode at a given time if all computers are in the repair state at that time. Do the epochs at which system failure modes begin form a renewal process? f ) Let Pr{t} be the probability that the the system is in failure mode at time t. Find limt→1 Pr{t}. Hint: look at part (d). g) For δ small, find the probability that the system enters failure mode in the interval (t, t + δ] in the limit as t → 1. h) Find the expected time between successive entries into failure mode. i) Next assume that the repair time of each computer has an arbitrary density rather than exponential, but has a mean repair time of 1/∏. Do the epochs at which system failure modes begin form a renewal process? j) Repeat part (f) for the assumption in (i). Exercise 3.27. Let {N1 (t); t≥0} and {N2 (t); t≥0} be independent renewal counting processes. Assume that each has the same distribution function F (x) for interarrival intervals and assume that a density f (x) exists for the interarrival intervals. a) Is the counting process {N1 (t) + N2 (t); t ≥ 0} a renewal counting process? Explain. b) Let Y (t) be the interval from t until the first arrival (from either process) after t. Find an expression for the distribution function of Y (t) in the limit t → 1 (you may assume that time averages and ensemble-averages are the same). c) Assume that a reward R of rate 1 unit per second starts to be earned whenever an arrival from process 1 occurs and ceases to be earned whenever an arrival from process 2 occurs. Rt Assume that limt→1 (1/t) 0 R(τ ) dτ exists with probability 1 and find its numerical value.
d) Let Z(t) be the interval from t until the first time after t that R(t) (as in part c) changes value. Find an expression for E [Z(t)] in the limit t → 1. Hint: Make sure you understand why Z(t) is not the same as Y (t) in part b). You might find it easiest to first find the expectation of Z(t) conditional on both the duration of the {N1 (t); t ≥ 0} interarrival interval containing t and the duration of the {N2 (t); t ≥ 0} interarrival interval containing t; draw pictures! Exercise 3.28. This problem provides another way of treating ensemble-averages for renewalreward problems. Assume for notational simplicity that X is a continuous random variable.
136
CHAPTER 3. RENEWAL PROCESSES
a) Show that Pr{one or more arrivals in (τ, τ + δ)} = m(τ +δ)−m(τ )−o(δ) where o(δ) ≥ 0 and limδ→0 o(δ)/δ = 0. b) Show that Pr{Z(t) ∈ [z, z + δ), X(t) ∈ (x, x + δ)} is equal to [m(t − z) − m(t − z − δ) − o(δ)][FX (x + δ) − FX (x)] for x ≥ z + δ.
c) Assuming that m0 (τ ) = dm(τ )/dτ exists for all τ , show that the joint density of Z(t), X(t) is fZ(t),X(t) (z, x) = m0 (t − z)fX (x) for x > z. Rt R1 d) Show that E [R(t)] = z=0 x=z R(z, x)fX (x)dx m0 (t − z)dz Rt R1 Note: Without densities, R this becomes z=0 x=z R(z, x)dFX (x) dm(t − z). This is the same as (3.50), and if r(z) = x≥z R(z, x)dF (x) is directly Riemann integrable, then, as shown in (3.51) to (3.53), this leads to (3.54). Exercise 3.29. This problem is designed to give you an alternate way of looking© at ensemble™ averages for renewal-reward problems. First we find an exact expression for Pr SN (t) > s . We find this for arbitrary s and t, 0 < s < t. a) By breaking the event {SN (t) > s}into subevents {SN (t) > s, N (t) = n}, explain each of the following steps: 1 X © ™ Pr SN (t) > s = Pr{t ≥ Sn > s, Sn+1 > t}
= =
n=1 1 X n=1 Z t
Z
y=s
=
Z
t
y=s
Pr{Sn+1 >t | Sn =y} dFSn (y)
[1 − FX (t−y)] d
t
y=s
1 X
FSn (y)
n=1
[1 − FX (t−y)] dm(y) where m(y) = E [N (y)] .
b) Show that for 0 < s < t < u, © ™ Pr SN (t) > s, SN (t)+1 > u =
Z
t
y=s
[1 − FX (u − y)] dm(y).
c) Draw a two dimensional sketch, with age and duration as the axes, and show the region of (age, duration) values corresponding to the event {SN (t) > s, SN (t)+1 > u}. d) Assume that for large t, dm(y) can be approximated (according to Blackwell) as (1/X)dy, where X = E [X]. Assuming that X also has a density, use the result in parts b) and c) to find the joint density of age and duration. Exercise 3.30. In this problem, we show how to calculate the residual life distribution Y (t) as a transient in t. Let µ(t) = dm(t)/dt where m(t) = E [N (t)], and let the interarrival distribution have the density fX (x). Let Y (t) have the density fY (t) (y).
3.9. EXERCISES
137
a) Show that these densities are related by the integral equation Z y µ(t + y) = fY (t) (y) + µ(t + u)fX (y − u)du. R
u=0
µ(t + y)e−ry dy
and let LY (t) (r) and LX (r) be the Laplace transforms b) Let Lµ,t (r) = y≥0 of fY (t) (y) and fX (x) respectively. Find LY (t) (r) as a function of Lµ,t and LX . c) Consider the inter-renewal density fX (x) = (1/2)e−x + e−2x for x ≥ 0 (as in Example 3.3.1). Find Lµ,t (r) and LY (t) (r) for this example. d) Find fY (t) (y). Show that your answer reduces to that of (3.37) in the limit as t → 1. e) Explain how to go about finding fY (t) (y) in general, assuming that fX has a rational Laplace transform. Exercise 3.31. Show that for a G/G/1 queue, the time-average wait in the system is the same as limn→1 E [Wn ]. Hint: Consider an integer renewal counting process {M (n); n ≥ 0} where M (n) is the number of renewals in the G/G/1 process of Section 3.6 that have occurred by the nth arrival. Show that this renewal process has a span of 1. Then consider {Wn ; n ≥ 1} as a reward within this renewal process. Exercise 3.32. If one extends the definition of renewal processes to include inter-renewal intervals of duration 0, with Pr{X=0} = α, show that the expected number of simultaneous renewals at a renewal epoch is 1/(1 − α), and that, for a non-arithmetic process, the probability of 1 or more renewals in the interval (t, t + δ] tends to (1 − α)δ/E [X] + o(δ) as t → 1. Exercise 3.33. The purpose of this exercise is to show why the interchange of expectation and sum in the proof of Wald’s equality is justified when E [J] < 1 but not otherwise. Let X1 , X2 , . . . , be a sequence of IID rv’s, each with the distribution FX . Assume that E [|X|] < 1. a) Show that Sn = X1 + · · · + Xn is a rv for each integer n > 0. Note: Sn is obviously a mapping from the sample space to the real numbers, but you must show that it is finite with probability 1. Hint: Recall the additivity axiom for the real numbers. b) Let J be a stopping P time for X1 , X2 , . . . . Show that SJ = X1 + · · · XJ is a rv. Hint: Represent Pr{SJ } as 1 n=1 Pr{J = n} Sn .
c) For the stopping time J above, let J (k) = min(J, k) be the stopping time J truncated to integer k. Explain why the interchange of sum£ and§ expectation in the proof of Wald’s equality is justified in this case, so E [SJ (k) ] = XE J (k) .
d) In parts d), e), and f), assume, in addition to the assumptions above, that FX (0) = 0, i.e., that the Xi are positive rv’s. Show that limk→1 E [SJ (k) ] < 1 if E [J] < 1 and limk→1 E [SJ (k) ] = 1 if E [J] = 1. e) Show that Pr{SJ (k) > x} ≤ Pr{SJ > x}
138
CHAPTER 3. RENEWAL PROCESSES
for all k, x. f ) Show that E [SJ ] = XE [J] if E [J] < 1 and E [SJ ] = 1 if E [J] = 1. g) Now assume that X has both negative and positive values with nonzero probability PJ and + − + + − let X = max(0, X) and X = min(X, 0). Express SJ as SJ + SJ where SJ = i=1 Xi+ P and SJ− = Ji=1 Xi− . Show that E [SJ ] = XE [J] if E [J] < 1 and that E [Sj ] is undefined otherwise.
Chapter 4
FINITE-STATE MARKOV CHAINS 4.1
Introduction
The counting processes {N (t), t ≥ 0} of Chapterss 2 and 3 have the property that N (t) changes at discrete instants of time, but is defined for all real t ≥ 0. Such stochastic processes are generally called continuous time processes. The Markov chains to be discussed in this and the next chapter are stochastic processes defined only at integer values of time, n = 0, 1, . . . . At each integer time n ≥ 0, there is an integer valued random variable (rv) Xn , called the state at time n, and the process is the family of rv’s {Xn , n ≥ 0}. These processes are often called discrete time processes, but we prefer the more specific term integer time processes. An integer time process {Xn ; n ≥ 0} can also be viewed as a continuous time process {X(t); t ≥ 0} by taking X(t) = Xn for n ≤ t < n + 1, but since changes only occur at integer times, it is usually simpler to view the process only at integer times. In general, for Markov chains, the set of possible values for each rv Xn is a countable set usually taken to be {0, 1, 2, . . . }. In this chapter (except for Theorems 4.2 and 4.3), we restrict attention to a finite set of possible values, say {1, . . . , M}. Thus we are looking at processes whose sample functions are sequences of integers, each between 1 and M. There is no special significance to using integer labels for states, and no compelling reason to include 0 as a state for the countably infinite case and not to include 0 for the finite case. For the countably infinite case, the most common applications come from queueing theory, and the state often represents the number of waiting customers, which can be zero. For the finite case, we often use vectors and matrices, and it is more conventional to use positive integer labels. In some examples, it will be more convenient to use more illustrative labels for states. Definition 4.1. A Markov chain is an integer time process, {Xn , n ≥ 0} for which each rv Xn , n ≥ 1, is integer valued and depends on the past only through the most recent rv Xn−1 , 139
140
CHAPTER 4. FINITE-STATE MARKOV CHAINS
i.e., for all integer n ≥ 1 and all integer i, j, k, . . . , m, Pr{Xn =j | Xn−1 =i, Xn−2 =k, . . . , X0 =m} = Pr{Xn =j | Xn−1 =i} ..
(4.1)
Pr{Xn =j | Xn−1 =i} depends only on i and j (not n) and is denoted by Pr{Xn =j | Xn−1 =i} = Pij .
(4.2)
The initial state X0 has an arbitrary probability distribution, which is required for a full probabilistic description of the process, but is not needed for most of the results. A Markov chain in which each Xn has a finite set of possible sample values is a finite-state Markov chain. The rv Xn is called the state of the chain at time n. The possible values for the state at time n, namely {1, . . . , M} or {0, 1, . . . } are also generally called states, usually without too much confusion. Thus Pij is the probability of going to state j given that the previous state is i; the new state, given the previous state, is independent of all earlier states. The use of the word state here conforms to the usual idea of the state of a system — the state at a given time summarizes everything about the past that is relevant to the future. Note that the transition probabilities, Pij , do not depend on n. Occasionally, a more general model is required where the transition probabilities do depend on n. In such situations, (4.1) and (4.2) are replaced by Pr{Xn =j | Xn−1 =i, Xn−2 =k, . . . , X0 =m} = Pr{Xn =j | Xn−1 =i} = Pij (n).
(4.3)
A process that obeys (4.3), with a dependence on n, is called a non-homogeneous Markov chain. Some people refer to a Markov chain (as defined in (4.1) and (4.2)) as a homogeneous Markov chain. We will discuss only the homogeneous case (since not much of general interest can be said about the non-homogeneous case) and thus omit the word homogeneous as a qualifier. An initial probability distribution for X0 , combined with the transition probabilities {Pij } (or {Pij (n)} for the non-homogeneous case), define the probabilities for all events. Markov chains can be used to model an enormous variety of physical phenomena and can be used to approximate most other kinds of stochastic processes. To see this, consider sampling a given process at a high rate in time, and then quantizing it, thus converting it into a discrete time process, {Zn ; −1 < n < 1}, where each Zn takes on a finite set of possible values. In this new process, each variable Zn will typically have a statistical dependence on past values that gradually dies out in time, so we can approximate the process by allowing Zn to depend on only a finite number of past variables, say Zn−1 , . . . , Zn−k . Finally, we can define a Markov process where the state at time n is Xn = (Zn , Zn−1 , . . . , Zn−k+1 ). The state Xn = (Zn , Zn−1 , . . . , Zn−k+1 ) then depends only on Xn−1 = (Zn−1 , . . . , Zn−k+1 , Zn−k ), since the new part of Xn , i.e., Zn , is independent of Zn−k−1 , Zn−k−2 , . . . , and the other variables comprising Xn are specified by Xn−1 . Thus {Xn } forms a Markov chain approximating the original process. This is not always an insightful or desirable model, but at least provides one possibility for modeling relatively general stochastic processes. Markov chains are often described by a directed graph (see Figure 4.1). In the graphical representation, there is one node for each state and a directed arc for each non-zero transition
4.2. CLASSIFICATION OF STATES
141
probability. If Pij = 0, then the arc from node i to node j is omitted; thus the difference between zero and non-zero transition probabilities stands out clearly in the graph. Several of the most important characteristics of a Markov chain depend only on which transition probabilities are zero, so the graphical representation is well suited for understanding these characteristics. A finite-state Markov chain is also often described by a matrix [P ] (see Figure 4.1). If the chain has M states, then [P ] is a M by M matrix with elements Pij . The matrix representation is ideally suited for studying algebraic and computational issues. P23
✐ ② ✯2 ✟ ✟ ✟ ✟ P32 ✞ ✐ ✟✟P12 P35 1 ✿ ❍ ✝ ✘ ❍❍ P11 ❍ P41❍❍ ❄ P45 ✐
4
P ③ ✐ 3 ✛ 63 6✐ P35 ✘ ✲ 5❄ ✐ ②
P55
(a)
☎ ✆
P11 P12 · · · P16 P21 P22 · · · P26 [P ] = ................... P61 P62 · · · P66
(b)
Figure 4.1: Graphical and Matrix Representation of a 6 state Markov Chain; a directed arc from i to j is included in the graph if and only if (iff) Pij > 0.
4.2
Classification of states
This section, except where indicated otherwise, applies to Markov chains with both finite and countable state spaces. We start with several definitions. Definition 4.2. An (n-step) walk1 is an ordered string of nodes {i0 , i1 , . . . in }, n ≥ 1, in which there is a directed arc from im−1 to im for each m, 1 ≤ m ≤ n. A path is a walk in which the nodes are distinct. A cycle is a walk in which the first and last nodes are the same and the other nodes are distinct. Note that a walk can start and end on the same node, whereas a path cannot. Also the number of steps in a walk can be arbitrarily large, whereas a path can have at most M − 1 steps and a cycle at most M steps. Definition 4.3. A state j is accessible from i (abbreviated as i → j) if there is a walk in the graph from i to j. For example, in figure 4.1(a), there is a walk from node 1 to node 3 (passing through node 2), so state 3 is accessible from 1. There is no walk from node 5 to 3, so state 3 is not accessible from 5. State 2, for example, is accessible from itself, but state 6 is not accessible from itself. To see the probabilistic meaning of accessibility, suppose that a walk i0 , i1 , . . . in exists from node i0 to in . Then, conditional on X0 = i0 , there is a positive probability, Pi0 i1 , that X1 = i1 , and consequently (since Pi1 i2 > 0), there is a positive probability that 1
We are interested here only in directed graphs, and thus undirected walks and paths do not arise.
142
CHAPTER 4. FINITE-STATE MARKOV CHAINS
X2 = i2 . Continuing this argument there is a positive probability that Xn = in , so that Pr{Xn =in | X0 =i0 } > 0. Similarly, if Pr{Xn =in | X0 =i0 } > 0, then there is an n-step walk from i0 to in . Summarizing, i → j if and only if (iff) Pr{Xn =j | X0 =i} > 0 for some n ≥ 1. We denote Pr{Xn =j | X0 =i} by Pijn . Thus, for n ≥ 1, Pijn > 0 iff the graph has an n step walk from i to j (perhaps visiting the same node more than once). For the example in 2 = P P n Figure 4.1(a), P13 12 23 > 0. On the other hand, P53 = 0 for all n ≥ 1. An important relation that we use often in what follows is that if there is an n-step walk from state i to j and an m-step walk from state j to k, then there is a walk of m + n steps from i to k. Thus Pijn > 0 and Pm jk > 0 imply
Pn+m > 0. ik
(4.4)
i → k.
(4.5)
This also shows that i → j and j → k imply
Definition 4.4. Two distinct states i and j communicate (abbreviated i ↔ j) if i is accessible from j and j is accessible from i. An important fact about communicating states is that if i ↔ j and m ↔ j then i ↔ m. To see this, note that i ↔ j and m ↔ j imply that i → j and j → m, so that i → m. Similarly, m → i, so i ↔ m. Definition 4.5. A class T of states is a non-empty set of states such that for each state i ∈ T , i communicates with each j ∈ T (except perhaps itself) amd does not communicate with any j ∈ / T. For the example of Fig. 4.1(a), {1, 2, 3, 4} is one class of states, {5} is another, and {6} is another. Note that state 6 does not communicate with itself, but {6} is still considered to be a class. The entire set of states in a given Markov chain is partitioned into one or more disjoint classes in this way. Definition 4.6. For finite-state Markov chains, a recurrent state is a state i that is accessible from all states that are accessible from i (i is recurrent if i → j implies that j → i). A transient state is a state that is not recurrent. Recurrent and transient states for Markov chains with a countably infinite set of states will be defined in the next chapter. According to the definition, a state i in a finite-state Markov chain is recurrent if there is no possibility of going to a state j from which there can be no return. As we shall see later, if a Markov chain ever enters a recurrent state, it returns to that state eventually with probability 1, and thus keeps returning infinitely often (in fact, this property serves as the definition of recurrence for Markov chains without the finite-state restriction). A state i is transient if there is some j that is accessible from i but from which there is no possible return. Each time the system returns to i, there is a possibility of going to j; eventually this possibility will occur, and then no more returns to i can occur (this can be thought of as a mathematical form of Murphy’s law).
4.2. CLASSIFICATION OF STATES
143
Theorem 4.1. For finite-state Markov chains, either all states in a class are transient or all are recurrent.2 Proof: Assume that state i is transient (i.e., for some j, i → j but j 6→ i) and suppose that i and m are in the same class (i.e., i ↔ m). Then m → i and i → j, so m → j. Now if j → m, then the walk from j to m could be extended to i; this is a contradiction, and therefore there is no walk from j to m, and m is transient. Since we have just shown that all nodes in a class are transient if any are, it follows that the states in a class are either all recurrent or all transient. For the example of fig. 4.1(a), {1, 2, 3, 4} is a transient class and {5} is a recurrent class. In terms of the graph of a Markov chain, a class is transient if there are any directed arcs going from a node in the class to a node outside the class. Every finite-state Markov chain must have at least one recurrent class of states (see Exercise 4.1), and can have arbitrarily many additional classes of recurrent states and transient states. States can also be classified according to their periods (see Figure 4.2). In fig. 4.2(a), given that X0 = 2, we see that X1 must be either 1 or 3, X2 must then be either 2 or 4, and in general, Xn must be 2 or 4 for n even and 1 or 3 for n odd. On the other hand, if X0 is 1 or 3, then Xn is 2 or 4 for n odd and 1 or 3 for n even. Thus the effect of the starting state never dies out. Fig. 4.2(b) illustrates another example in which the state alternates from odd to even and the memory of the starting state never dies out. The states in both these Markov chains are said to be periodic with period 2. 1❧ ②
✄✗
③ ❧ 2
✄✎ 4❧ ②
✎✄ ③ ❧ 3
✄✗
(a)
✒ ° °
7❧
8❧ ✲ 9❧
❅ ■ ❅ ❧ 6✛
1❧
❅ ✒ ❅ ° ❘ ❧ ❅ ❘ ❧ ❅ ° 4 2 ° ❅ ° ■ ✠ ° ✠ ° ❅ ❧ 5❧ 3
(b)
Figure 4.2: Periodic Markov Chains Definition 4.7. The period of a state i, denoted d(i), is the greatest common divisor (gcd) of those values of n for which Piin > 0. If the period is 1, the state is aperiodic, and if the period is 2 or more, the state is periodic.3 n > 0 for n = 2, 4, 6, . . . . Thus d(1), the period of state For example, in Figure 4.2(a), P11 1, is two. Similarly, d(i) = 2 for the other states in Figure 4.2(a). For fig. 4.2(b), we have 2 This theorem is also true for Markov chains with a countable state space, but the proof here is inadequate. Also recurrent classes with a countable state space are further classified into either positive-recurrent or nullrecurrent, a distinction that does not appear in the finite-state case. 3 For completeness, we say that the period is infinite if Piin = 0 for all n ≥ 1. Such states do not have the intuitive characteristics of either periodic or aperiodic states. Such a state cannot communicate with any other state, and cannot return to itself, so it corresponds to a singleton class of transient states. The notion of periodicity is of primary interest for recurrent states.
144
CHAPTER 4. FINITE-STATE MARKOV CHAINS
n > 0 for n = 4, 8, 10, 12, . . . ; thus d(1) = 2, and it can be seen that d(i) = 2 for all the P11 states. These examples suggest the following theorem.
Theorem 4.2. For any Markov chain (with either a finite or countably infinite number of states), all states in the same class have the same period. Proof: Let i and j be any distinct pair of states in a class. Then i ↔ j and there is some r such that Pijr > 0 and some s such that Pjis > 0. Since there is a walk of length r + s going from i to j and back to i, r + s must be divisible by d(i). Let t be any integer such that t > 0. Since there is a walk of length r + t + s that goes first from i to j, then to j again, Pjj and then back to i, r + t + s is divisible by d(i), and thus t is divisible by d(i). Since this t > 0, d(j) is divisible by d(i). Reversing the roles of i and j, is true for any t such that Pjj d(i) is divisible by d(j), so d(i) = d(j). Since the states in a class all have the same period and are either all recurrent or all transient, we refer to the class itself as having the period of its states and as being recurrent or transient. Similarly if a Markov chain has a single class of states, we refer to the chain as having the corresponding period and being recurrent or transient. Theorem 4.3. If a recurrent class in a finite-state Markov chain has period d, then the states in the class can be partitioned into d subsets, S1 , S2 , . . . , Sd , such that all transitions out of subset Sm go to subset Sm+1 for m < d and to subset S1 for m = d. That is, if j ∈ Sm and Pjk > 0, then k ∈ Sm+1 for m < d and k ∈ S1 for m = d. Proof: See Figure 4.3 for an illustration of the theorem. For a given state in the class, say state 1, define the sets S1 , . . . , Sd by nd+m Sm = {j : P1j > 0 for some n ≥ 0};
1 ≤ m ≤ d.
(4.6)
For each given j in the class, we first show that there is one and only one value of m such r > 0 and some s for which P s > 0. that j ∈ Sm . Since 1 ↔ j, there is some r for which P1j j1 Since there is a walk from 1 to 1 (through j) of length r + s, r + s is divisible by d. Define m, 1 ≤ m ≤ d, by r = m + nd, where n is an integer. From (4.6), j ∈ Sm . Now let r0 be any r0 > 0. Then r 0 + s is also divisible by d, so that r 0 − r is divisible other integer such that P1j by d. Thus r0 = m + n0 d for some integer n0 and that same m. Since r0 is any integer such r0 > 0, j is in S for only that one value of m. Since j is arbitrary, this shows that that P1j m the sets Sm are disjoint and partition the class. Finally, suppose j ∈ Sm and Pjk > 0. Given a walk of length r = nd + m from state 1 to j, there is a walk of length nd + m + 1 from state 1 to k. It follows that if m < d, then k ∈ Sm+1 and if m = d, then k ∈ S1 , completing the proof. We have seen that each class of states (for a finite-state chain) can be classified both in terms of its period and in terms of whether or not it is recurrent. The most important case is that in which a class is both recurrent and aperiodic. Definition 4.8. For a finite-state Markov chain, an ergodic class of states is a class that
4.2. CLASSIFICATION OF STATES
145
❣ ❣ PP ❍ PP ❆❑❆ ❍ ✻ S2 ❍ ❍ ❣P P P P ❆ ❍ PP P ❍ P q❣ P PP ③ S1 ❆ ❇▼❇ PP ❍❍ ❆❇ PP ❍❍ PP ❍ ❆❇ PP❍ S3 ❆❇ P❍ PP ❥❣ ❍ q ❆❇ ❣✛ ✏ ✮ ✏
Figure 4.3: Structure of a Periodic Markov Chain with d = 3. Note that transitions only go from one subset Sm to the next subset Sm+1 (or from Sd to S1 ).
is both recurrent and aperiodic4 . A Markov chain consisting entirely of one ergodic class is called an ergodic chain. We shall see later that these chains have the desirable property that Pijn becomes independent of the starting state i as n → 1. The next theorem establishes the first part of this by showing that Pijn > 0 for all i and j when n is sufficiently large. The Markov chain in Figure 4.4 illustrates the theorem by illustrating how large n must be in the worst case. ✓✏
✒✑ ✒ ° ° ✓✏ °
5
✒✑ ■ ❅ ❅ ❅✓✏ 3 ✛ ✒✑
4
✓✏ ✲ 6 ✒✑ ❅ ❅ ✓✏ ❘ ❅
✒✑ ° ° ✓✏ ❄° ✠
1
✒✑
2
Figure 4.4: An ergodic chain with M = 6 states in which Pijm > 0 for all m > (M − 1)2 (M−1)2
and all i, j but P11 = 0 The figure also illustrates that an M state Markov chain must have a cycle with M−1 or fewer nodes. To see this, note that an ergodic chain must have cycles, since each node must have a walk to itself, and any subcycle of repeated nodes can be omitted from that walk, converting it into a cycle. Such a cycle might have M nodes, but a chain with only a M node cycle would be periodic. Thus some nodes must be on smaller cycles, such as the cycle of length 5 in the figure.
Theorem 4.4. For an ergodic M state Markov chain, Pijm > 0 for all i, j, and all m ≥ (M − 1)2 + 1. 4
For Markov chains with a countably infinite state space, ergodic means that the states are positiverecurrent and aperiodic (see Chapter 5, Section 5.1).
146
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Proof*:5 As shown in Figure 4.4, the chain must contain a cycle with fewer than M nodes. Let τ ≤ M − 1 be the number of nodes on a smallest cycle in the chain and let i be any given state on such a cycle. Define T (m), m ≥ 1, as the set of states accessible from the fixed state i in m steps. Thus T (1) = {j : Pij > 0}, and for arbitrary m ≥ 1, T (m) = {j : Pijm > 0}.
(4.7)
Since i is on a cycle of length τ , Piiτ > 0. For any m ≥ 1 and any j ∈ T (m), we can then construct an m + τ step walk from i to j by going from i to i in τ steps and then to j in another m steps. This is true for all j ⊆ T (m), so T (m) ⊆ T (m + τ ).
(4.8)
By defining T (0) to be the singleton set {i}, (4.8) also holds for m = 0, since i ∈ T (τ ). By starting with m = 0 and iterating on (4.8), T (0) ⊆ T (τ ) ⊆ T (2τ ) ⊆ · · · ⊆ T (nτ ) ⊆ · · · .
(4.9)
We now show that if one of the inclusion relations in (4.9) is satisfied with equality, then all the subsequent relations are satisfied with equality. More generally, assume that T (m) = T (m + s) for some m ≥ 0 and s ≥ 1. Note that T (m + 1) is the set of states that can be reached in one step from states in T (m), and similarly T (m + s + 1) is the set reachable in one step from T (m + s) = T (m). Thus T (m + 1) = T (m + 1 + s). Iterating this result, T (m) = T (m + s) implies T (n) = T (n + s) for all n ≥ m.
(4.10)
Thus, (4.9) starts with strict inclusions and then continues with strict equalities. Since the entire set has M members, there can be at most M − 1 strict inclusions in (4.9). Thus T ((M − 1)τ ) = T (nτ ) for all integers n ≥ M − 1.
(4.11)
Define k as (M − 1)τ . We can then rewrite (4.11) as T (k) = T (k + jτ ) for all j ≥ 1.
(4.12)
We next show that T (k) consists of all M nodes in the chain. The central part of this is to show that T (k) = T (k + 1). Let t be any positive integer other than τ such that Piit > 0. Letting m = k in (4.8) and using t in place of τ , T (k) ⊆ T (k + t) ⊆ T (k + +2t) ⊆ · · · ⊆ T (k + τ t).
(4.13)
Since T (k + τ t) = T (k), this shows that T (k) = T (k + t).
(4.14)
Now let s be the smallest positive integer such that T (k) = T (k + s). 5
Proofs marked with an asterisk can be omitted without loss of continuity.
(4.15)
4.3. THE MATRIX REPRESENTATION
147
From (4.11), we see that (4.15) holds when s takes the value τ . Thus, the minimizing s must lie in the range 1 ≤ s ≤ τ . We will show that s = 1 by assuming s > 1 and establishing a contradiction. Since the chain is aperiodic, there is some t not divisible by s for which Piit > 0. This t can be represented by t = js + ` where 1 ≤ ` < s and j ≥ 0. Iterating (4.15), we get T (k) = T (k + js), and applying (4.10) to this, T (k + `) = T (k + js + `) = T (k + t) = T (k).
where we have used t = js + ` followed by (4.14). This is the desired contradiction, since ` < s. Thus s = 1 and T (k) = T (k + 1). Iterating this, T (k) = T (k + n) for all n ≥ 0.
(4.16)
Since the chain is ergodic, each state j continues to be accessible after k steps. Therefore j must be in T (k + n) for some n ≥ 0, which, from (4.16), implies that j ∈ T (k). Since j is arbitrary, T (k) must be the entire set of states. Thus Pijn > 0 for all n ≥ k and all j. This same argument can be applied to any state i on the given cycle with τ nodes. Any state m not on this cycle has a path to the cycle using at most M − τ steps. Using this path to reach a node i on the cycle, and following this with all the walks from i of length k = (M − 1)τ , we see that M−τ +(M−1)τ
Pmj
>0
for all j, m.
The proof is complete, since M − τ + (M − 1)τ ≤ (M − 1)2 + 1 for all τ, 1 ≤ τ ≤ M − 1, with equality when τ = M − 1. Figure 4.4 illustrates a situation where the bound (M − 1)2 + 1 is met with equality. Note that there is one cycle of length M − 1 and the single node not on this cycle, node 1, is the unique starting node at which the bound is met with equality.
4.3
The Matrix representation
The matrix [P ] of transition probabilities of a Markov chain is called a stochastic matrix; that is, a stochastic matrix is a square matrix of non-negative terms in which the elements in each row sum to 1. We first consider the n step transition probabilities Pijn in terms of [P]. The probability of going from state i to state j in two steps is the sum over h of all possible two step walks, from i to h and from h to j. Using the Markov condition in (4.1), Pij2 =
M X
Pih Phj .
h=1
It can be seen that this is just the ij term of the product of matrix [P ] with itself; denoting [P ][P ] as [P ]2 , this means that Pij2 is the (i, j) element of the matrix [P ]2 . Similarly, Pijn is
148
CHAPTER 4. FINITE-STATE MARKOV CHAINS
the ij element of the nth power of the matrix [P ]. Since [P ]m+n = [P ]m [P ]n , this means that Pijm+n =
M X
m n Pih Phj .
(4.17)
h=1
This is known as the Chapman-Kolmogorov equation. An efficient approach to compute [P ]n (and thus Pijn ) for large n, is to multiply [P ]2 by [P ]2 , then [P ]4 by [P ]4 and so forth and then multiply these binary powers together as needed. The matrix [P ]n (i.e., the matrix of transition probabilities raised to the nth power) is very important for a number of reasons. The i, j element of this matrix is Pijn , which is the probability of being in state j at time n given state i at time 0. If memory of the past dies out with increasing n, then we would expect the dependence on both n and i to disappear in Pijn . This means, first, that [P ]n should converge to a limit as n → 1, and, second, that each row of [P ]n should tend to the same set of probabilities. If this convergence occurs (and we later determine the circumstances under which it occurs), [P ]n and [P ]n+1 will be the same in the limit n → 1 which means lim[P ]n = (lim[P ]n )P . If all the rows of lim[P n ] are the same, equal to some row vector π = (π1 , π2 , . . . , πM ), this simplifies to π = π [P ]. Since π is a probability vector (i.e., its components are the probabilities of being in the various states in the limit n → 1), its components must be non-negative and sum to 1. Definition 4.9. A steady-state probability vector (or a steady-state distribution) for a Markov chain with transition matrix [P ] is a vector π that satisfies X π = π [P ] ; where πi = 1 ; πi ≥ 0 , 1 ≤ i ≤ M. (4.18) i
The steady-state probability vector is also often called a stationary distribution. If a probability vector π satisfying (4.18) is taken as the initial probability assignment of the chain at time 0, then thatPassigment is maintained forever. That is, if Pr{X0 =i} = πi for all i, then Pr{X1 =j} = i πi Pij = πj for all j, and, by induction, Pr{Xn = j} = πj for all j and all n > 0. If [P ]n converges as above, then, for each starting state, the steady-state distribution is reached asymptotically. There are a number of questions that must be answered for a steady-state distribution as defined above: 1. Does π = π [P ] always have a probability vector solution? 2. Does π = π [P ] have a unique probability vector solution? 3. Do the rows of [P ]n converge to a probability vector solution of π = π [P ]? We first give the answers to these questions for finite-state Markov chains and then derive them. First, (4.18) always has a solution (although this is not necessarily true for infinitestate chains). The answer to the second and third questions is simpler with the following definition:
4.3. THE MATRIX REPRESENTATION
149
Definition 4.10. A unichain is a finite-state Markov chain that contains a single recurrent class plus, perhaps, some transient states. An ergodic unichain is a unichain for which the recurrent class is ergodic. A Unichain, as we shall see, is the natural generalization of a recurrent chain to allow for some initial transient behavior without disturbing the long term aymptotic behavior of the underlying recurrent chain. The answer to the second question above is that the solution to (4.18) is unique iff [P] is the transition matrix of a unichain. If there are r recurrent classes, then π = π [P ] has r linearly independent solutions. For the third question, each row of [P ]n converges to the unique solution of (4.18) if [P] is the transition matrix of an ergodic unichain. If there are multiple recurrent classes, but all of them are aperiodic, then [P ]n still converges, but to a matrix with non-identical rows. If the Markov chain has one or more periodic recurrent classes, then [P ]n does not converge. We first look at these answers from the standpoint of matrix theory and then proceed in Chapter 5 to look at the more general problem of Markov chains with a countably infinite number of states. There we use renewal theory to answer these same questions (and to discover the differences that occur for infinite-state Markov chains). The matrix theory approach is useful computationally and also has the advantage of telling us something about rates of convergence. The approach using renewal theory is very simple (given an understanding of renewal processes), but is more abstract.
4.3.1
The eigenvalues and eigenvectors of P
A convenient way of dealing with the nth power of a matrix is to find the eigenvalues and eigenvectors of the matrix. Definition 4.11. The row vector π is a left eigenvector of [P ] of eigenvalue ∏ if π 6= 0 π . The column vector ∫ is a right eigenvector of eigenvalue ∏ if ∫ 6= 0 and and π [P ] = ∏π [P ]∫∫ = ∏∫∫ . We first treat the special case of a Markov chain with two states. Here the eigenvalues and eigenvectors can be found by elementary (but slightly tedious) algebra. The eigenvector equations can be written out as π1 P11 + π2 P21 = ∏π1 π1 P12 + π2 P22 = ∏π2
P11 ∫1 + P12 ∫2 = ∏∫1 . P21 ∫1 + P22 ∫2 = ∏∫2
(4.19)
These equations have a non-zero solution iff the matrix [P − ∏I], where [I] is the identity matrix, is singular (i.e., there must be a non-zero ∫ for which [P − ∏I]∫∫ = 0 ). Thus ∏ must be such that the determinant of [P − ∏I], namely (P11 − ∏)(P22 − ∏) − P12 P21 , is equal to 0. Solving this quadratic equation in ∏, we find that ∏ has two solutions, ∏1 = 1 and ∏2 = 1 − P12 − P21 . Assume initially that P12 and P21 are not both 0. Then the solution for the left and right eigenvectors, π (1) and ∫ (1) , of ∏1 and π (2) and ∫ (2) of ∏2 , are given by (1)
π1 = (2) π1
=
P21 P12 +P21
1
(1)
π2 = (2) π2
=
P12 P12 +P21
−1
(1)
∫1 =
(2) ∫1
=
1 P12 P12 +P21
(1)
∫2 =
(2) ∫2
=
1 −P21 P12 +P21
.
150
CHAPTER 4. FINITE-STATE MARKOV CHAINS
∏ ∏1 0 and These solutions contain an arbitrary normalization factor. Now let [Λ] = 0 ∏2 let [U ] be a matrix with columns ∫ (1) and ∫ (2) . Then the two right eigenvector equations in (4.19) can be combined compactly as [P ][U ] = [U ][Λ]. It turns out (given the way we have normalized the eigenvectors) that the inverse of [U ] is just the matrix whose rows are the left eigenvectors of [P ] (this can be verified by direct calculation, and we show later that any right eigenvector of one eigenvalue must be orthogonal to any left eigenvector of another eigenvalue). We then see that [P ] = [U ][Λ][U ]−1 and consequently [P ]n = [U ][Λ]n [U ]−1 . Multiplying this out, we get ∑
[P ]n =
∑
π1 + π2 ∏n2 π1 − π1 ∏n2
π2 − π2 ∏n2 π2 + π1 ∏n2
∏
where π1 =
P21 , π2 = 1 − π1 . P12 + P21
Recalling that ∏2 = 1 − P12 − P21 , we see that |∏2 | ≤ 1. If P12 = P21 = 0, then ∏2 = 1 so that [P ] and [P ]n are simply identity matrices. If P12 = P21 = 1, then ∏2 = −1 so that [P ]n alternates between the identity matrix for n even and [P ] for n odd. In all other cases, |∏2 | < 1 and [P ]n approaches the matrix whose rows are both equal to π . Parts of this special case generalize to an arbitrary finite number of states. In particular, ∏ = 1 is always an eigenvalue and the vector e whose components are all equal to 1 is always a right eigenvector of ∏ = 1 (this follows immediately from the fact that each row of a stochastic matrix sums to 1). Unfortunately, not all stochastic matrices can be represented in the form [P ] = [U ][Λ][U −1 (since M independent right eigenvectors need not exist—see Exercise 4.9) In general, the diagonal matrix of eigenvalues in [P ] = [U ][Λ][U −1 ] must be replaced by something called a Jordan form, which does not easily lead us to the desired results. In what follows, we develop the powerful Perron and Frobenius theorems, which are useful in their own right and also provide the necessary results about [P ]n in general.
4.4
Perron-Frobenius theory
A real vector x (i.e., a vector with real components) is defined to be positive, denoted x > 0, if xi > 0 for each component i. A real matrix [A] is positive, denoted [A] > 0, if Aij > 0 for each i, j. Similarly, x is non-negative, denoted x ≥ 0, if xi ≥ 0 for all i. [A] is non-negative, denoted [A] ≥ 0, if Aij ≥ 0 for all i, j. Note that it is possible to have x ≥ 0 and x 6= 0 without having x > 0, since x > 0 means that all components of x are positive and x ≥ 0, x 6= 0 means that at least one component of x is positive and all are non-negative. Next, x > y and y < x both mean x − y > 0. Similarly, x ≥ y and y ≤ x mean x − y ≥ 0. The corresponding matrix inequalities have corresponding meanings. We start by looking at the eigenvalues and eigenvectors of positive square matrices. In what follows, when we assert that a matrix, vector, or number is positive or non-negative, we implicitly mean that it is real also. We will prove Perron’s theorem, which is the critical result for dealing with positive matrices. We then generalize Perron’s theorem to the Frobenius theorem, which treats a class of non-negative matrices called irreducible matrices. We finally specialize the results to stochastic matrices.
4.4. PERRON-FROBENIUS THEORY
151
Perron’s theorem shows that a square positive matrix [A] always has a positive eigenvalue ∏ that exceeds the magnitude of all other eigenvalues. It also shows that this ∏ has a right eigenvector ∫ that is positive and unique within a scale factor. It establishes these results by relating ∏ to the following frequently useful optimization problem. For a given square matrix [A] > 0, and for any non-zero vector6 x ≥ 0, let g(x ) be the largest real number a for which ax ≤ [A]x . Let ∏ be defined by ∏=
sup
g(x ).
(4.20)
x 6=0 ,x ≥0
P We can express g(x ) explicitly by rewriting ax ≤ Ax as axi ≤ j Aij xj for all i. Thus, the largest a for which this is satisfied is P j Aij xj g(x ) = min gi (x ) where gi (x ) = . (4.21) i xi P Since [A] > 0, x ≥ 0 and x 6= 0 , it follows that the numerator i Aij xj is positive for all i. Thus gi (x ) is positive for xi > 0 and infinite for xi = 0, so g(x ) > 0. It is shown in Exercise 4.10 that g(x ) is a continuous function of x over x 6= 0 , x ≥ 0 and that the supremum in (4.20) is actually achieved as a maximum. Theorem 4.5 (Perron). Let [A] > 0 be a M by M matrix, let ∏ > 0 be given by (4.20) and (4.21), and let ∫ be a vector x that maximizes (4.20). Then 1. ∏∫∫ = [A]∫∫ and ∫ > 0. 2. For any other eigenvalue µ of [A], |µ| < ∏. 3. If x satisfies ∏x = [A]x, then x = β∫∫ for some (possibly complex) number β. Discussion: Property (1) asserts not only that the solution ∏ of the optimization problem is an eigenvalue of [A], but also that the optimizing vector ∫ is an eigenvector and is strictly positive. Property (2) says that ∏ is strictly greater than the magnitude of any other eigenvalue, and thus we refer to it in what follows as the largest eigenvalue of [A]. Property (3) asserts that the eigenvector ∫ is unique (within a scale factor), not only among positive vectors but among all (possibly complex) vectors. Proof* Property 1: We are given that ∏ = g(∫∫ ) ≥ g(x )
for each x ≥ 0 , x 6= 0 . (4.22) P We must show that ∏∫∫ = [A]∫∫ , i.e., that ∏∫i = j Aij ∫j for each i, or equivalently that P j Aij ∫j for each i. (4.23) ∏ = g(∫∫ ) = gi (∫∫ ) = ∫i Thus we want to show that the minimum in (4.21) is achieved by each i, 1≤i≤M. To show this, we assume the contrary and demonstrate a contradiction. Thus, suppose that 6
Note that the set of nonzero vectors x for which x ≥ 0 is different from the set {x > 0} in that the former allows some xi to be zero, whereas the latter requires all xi to be zero.
152
CHAPTER 4. FINITE-STATE MARKOV CHAINS
g(∫∫ ) < gk (∫∫ ) for some k. Let e k be the kth unit vector and let ≤ be a small positive number. The contradiction will be to show that g(∫∫ + ≤e k ) > g(∫∫ ) for small enough ≤, thus violating (4.22). For i 6= k, P P j Aij ∫j + ≤Aik j Aij ∫j gi (∫∫ + ≤e k ) = > . (4.24) ∫i ∫i gk (∫∫ + ≤e k ), on the other hand, is continuous in ≤ > 0 as ≤ increases from 0 and thus remains greater than g(∫∫ ) for small enough ≤. This shows that g(∫∫ + ≤e k ) > g(∫∫ ), completing the contradiction. This also shows that ∫k must be greater than 0 for each k. Property 2: Let µ be any eigenvalue of [A]. Let x 6= 0 be a right eigenvector (perhaps complex) for µ. Taking the magnitude of each side of µx = [A]x , we get the following for each component i X X |µ||xi | = | Aij xj | ≤ Aij |xj |. (4.25) j
j
Let u = (|x1 |, |x2 |, . . . , |xM |), so (4.25) becomes |µ|u ≤ [A]u. Since u ≥ 0, u 6= 0, it follows from the definition of g(x ) that |µ| ≤ g(u). From (4.20), g(u) ≤ ∏, so |µ| ≤ ∏. Next assume that |µ| = ∏. From (4.25), then, ∏u ≤ [A]u, so u achieves the maximization in (4.20) and part 1 of the theorem asserts that ∏u = [A]u. This means that (4.25) is satisfied with equality, and it follows from this (see Exercise 4.11) that x = βu for some (perhaps complex) scalar β. Thus x is an eigenvector of ∏, and µ = ∏. Thus |µ| = ∏ is impossible for µ 6= ∏, so ∏ > |µ| for all eigenvalues µ 6= ∏. Property 3: Let x be any eigenvector of ∏. Property 2 showed that x = βu where ui = |xi | for each i and u is a non-negative eigenvector of eigenvalue ∏. Since ∫ > 0 , we can choose α > 0 so that ∫ − αu ≥ 0 and ∫i − αui = 0 for some i. Now ∫ − αu is either identically 0 or else an eigenvector of eigenvalue ∏, and thus strictly positive. Since ∫i − αui = 0 for some i, ∫ − αu = 0 . Thus u and x are scalar multiples of ∫ , completing the proof. Next we apply the results above to a more general type of non-negative matrix called an irreducible matrix. Recall that we analyzed the classes of a finite-state Markov chain in terms of a directed graph where the nodes represent the states of the chain and a directed arc goes from i to j if Pij > 0. We can draw the same type of directed graph for an arbitrary non-negative matrix [A]; i. e., a directed arc goes from i to j if Aij > 0. Definition 4.12. An irreducible matrix is a non-negative matrix such that for every pair of nodes i, j in its graph, there is a walk from i to j. For stochastic matrices, an irreducible matrix is thus the matrix of a recurrent Markov chain. If we denote the i, j element of [A]n by Anij , then we see that Anij > 0 iff there is a walk of length n from i to j in the graph. If [A] is irreducible, a walk exists from any i to any j (including j = i) with length at most M, since the PMwalk nneed visit each other node at n most once. Thus Aij > 0 for some n, 1 ≤ n ≤ M, and n=1 Aij > 0 . The key to analyzing P n irreducible matrices is then the fact that the matrix B = M n=1 [A] is strictly positive.
4.4. PERRON-FROBENIUS THEORY
153
Theorem 4.6 (Frobenius). Let [A] ≥ 0 be a M by M irreducible matrix and let ∏ be the supremum in (4.20) and (4.21). Then the supremum is achieved as a maximum at some vector ∫ and the pair ∏, ∫ have the following properties: 1. ∏∫∫ = [A]∫∫ and ∫ > 0. 2. For any other eigenvalue µ of [A], |µ| ≤ ∏. 3. If x satisfies ∏x = [A]x, then x = β∫∫ for some (possibly complex) number β. Discussion: Note that this is almost the same as the Perron theorem, except that [A] is irreducible (but not necessarily positive), and the magnitudes of the other eigenvalues need not be strictly less than ∏. When we look at recurrent matrices of period d, we shall find that there are d − 1 other eigenvalues of magnitude equal to ∏. Because of the possibility of other eigenvalues with the same magnitude as ∏, we refer to ∏ as the largest real eigenvalue of [A]. Proof* Property 1: We first establish property 1 for a particular choice of ∏ and ∫ and then show P thatn this choice satisfies the optimization problem in (4.20) and (4.21). Let [B] = M n=1 [A] > 0. Using theorem 4.5, we let ∏B be the largest eigenvalue of [B] and let ∫ > 0 be the corresponding right eigenvector. Then [B]∫∫ = ∏B ∫ . Also, since [B][A] = [A][B], we have [B]{[A]∫∫ } = [A][B]∫∫ = ∏B [A]∫∫ . Thus [A]∫∫ is a right eigenvector for eigenvalue ∏B of [B] and thus equal to ∫ multiplied by some positive scale factor. ∫ ∫ Define this P scale nfactor to be ∏, soM that [A]∫ = ∏∫ and ∏ > 0.M We can relate ∏ to ∏B by ∫ [B]∫∫ = M [A] ∫ = (∏ + · · · + ∏ )∫ . Thus ∏ = ∏ + ··· + ∏ . B n=1
Next, for any non-zero x ≥ 0 , let g > 0 be the largest number such that [A]x ≥ gx . Multiplying both sides of this by [A], we see that [A]2 x ≥ g[A]x ≥ g 2 x . Similarly, [A]i x ≥ g i x for each i ≥ 1, so it follows that Bx ≥ (g + g 2 + · · · + g M )x . From the optimization property of ∏B in theorem 4.5, this shows that ∏B ≥ g + g 2 + · · · + g M . Since ∏B = ∏ + ∏2 + · · · + ∏M , we conclude that ∏ ≥ g, showing that ∏, ∫ solve the optimization problem for A in (4.20) and (4.21).
Properties 2 and 3: The first half of the proof of property 2 in Theorem 4.5 applies here also to show that |µ| ≤ ∏ for all eigenvalues µ of [A]. Finally, let x be an arbitrary vector satisfying [A]x = ∏x . Then, from the argument above, x is also a right eigenvector of [B] with eigenvalue ∏B , so from Theorem 4.5, x must be a scalar multiple of ∫ , completing the proof. Corollary 4.1. The largest real eigenvalue ∏ of an irreducible matrix [A] ≥ 0 has a positive left eigenvector π . π is the unique left eigenvector of ∏ (within a scale factor) and is the π ≤ π [A]. only non-negative non-zero vector (within a scale factor) that satisfies ∏π Proof: A left eigenvector of [A] is a right eigenvector (transposed) of [A]T . The graph corresponding to [A]T is the same as that for [A] with all the arc directions reversed, so that all pairs of nodes still communicate and [A]T is irreducible. Since [A] and [A]T have the same eigenvalues, the corollary is just a restatement of the theorem.
154
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Corollary 4.2. Let ∏ be the largest real eigenvalue of an irreducible matrix and let the right and left eigenvectors of ∏ be ∫ > 0 and π > 0. Then, within a scale factor, ∫ is the only non-negative right eigenvector of [A] (i.e., no other eigenvalues have non-negative eigenvectors). Similarly, within a scale factor, π is the only non-negative left eigenvector of [A]. Proof: Theorem 4.6 asserts that ∫ is the unique right eigenvector (within a scale factor) of the largest real eigenvalue ∏, so suppose that u is a right eigenvector of some other eigenvalue π u and also π [A]u = µπ π u. µ. Letting π be the left eigenvector of ∏, we have π [A]u = ∏π Thus π u = 0. Since π > 0 , u cannot be non-negative and non-zero. The same argument shows the uniqueness of π . Corollary 4.3. Let [P ] be a stochastic irreducible matrix (i.e., the matrix of a recurrent Markov chain). Then ∏ = 1 is the largest real eigenvalue of [P ], e = (1, 1, . . . , 1)T is the right eigenvector of ∏ = 1, unique within a scale factor, and there is a unique probability vector π > 0 that is a left eigenvector of ∏ = 1. Proof: Since each row of [P ] adds up to 1, [P ]e = e. Corollary 4.2 asserts the uniqueness of e and the fact that ∏ = 1 is the largest real eigenvalue, and Corollary 4.1 asserts the uniqueness of π . The proof above shows that every stochastic matrix, whether irreducible or not, has an eigenvalue ∏ = 1 with e = (1, . . . , 1)T as a right eigenvector. In general, a stochastic matrix with r recurrent classes has r independent non-negative right eigenvectors and r independent non-negative left eigenvectors; the left eigenvectors can be taken as the steadystate probability vectors within the r recurrent classes (see Exercise 4.14). The following corollary, proved in Exercise 4.13, extends corollary 4.3 to unichains. Corollary 4.4. Let [P ] be the transition matrix of a unichain. Then ∏ = 1 is the largest real eigenvalue of [P ], e = (1, 1, . . . , 1)T is the right eigenvector of ∏ = 1, unique within a scale factor, and there is a unique probability vector π ≥ 0 that is a left eigenvector of ∏ = 1; πi > 0 for each recurrent state i and πi = 0 for each transient state. Corollary 4.5. The largest real eigenvalue ∏ of an irreducible matrix [A] ≥ 0 is a strictly increasing function of each component of [A]. Proof: For a given irreducible [A], let [B] satisfy [B] ≥ [A], [B] 6= [A]. Let ∏ be the largest real eigenvalue of [A] and ∫ > 0 be the corresponding right eigenvector. Then ∏∫∫ = [A]∫∫ ≤ [B]∫∫ , but ∏∫∫ 6= [B]∫∫ . Let µ be the largest real eigenvalue of [B], which is also irreducible. If µ ≤ ∏, then µ∫∫ ≤ ∏∫∫ ≤ [B]∫∫ , and µ∫∫ 6= [B]∫∫ , which is a contradiction of property 1 in Theorem 4.6. Thus, µ > ∏. We are now ready to study the asymptotic behavior of [A]n . The simplest and cleanest result holds for [A] > 0. We establish this in the following corollary and then look at the case of greatest importance, that of a stochastic matrix for an ergodic Markov chain. More general cases are treated in Exercises 4.13 and 4.14.
4.4. PERRON-FROBENIUS THEORY
155
Corollary 4.6. Let ∏ be the largest eigenvalue of [A] > 0 and let π and ∫ be the positive left and right eigenvectors of ∏, normalized so that π ∫ = 1. Then lim
n→1
[A]n = ∫π. ∏n
(4.26)
Proof*: Since ∫ > 0 is a column vector and π > 0 is a row vector, ∫ π is a positive matrix of the same dimension as [A]. Since [A] > 0, we can define a matrix [B] = [A] − α∫∫ π which is positive for small enough α > 0. Note that π and ∫ are left and right eigenvectors of [B] with eigenvalue µ = ∏ − α. We then have µn∫ = [B]n∫ , which when pre-multiplied by π yields XX n (∏ − α)n = π [B]n∫ = πi Bij ∫j . i
j
n is the i, j element of [B]n . Since each term in the above summation is positive, where Bij n ∫ , and therefore B n ≤ (∏ − α)n /(π ∫ ). Thus, for each i, we have (∏ − α)n ≥ πi Bij j i j ij n −n j, limn→1 Bij ∏ = 0, and therefore limn→1 [B]n ∏−n = 0. Next we use a convenient matrix identity: for any eigenvalue ∏ of a matrix [A], and any corresponding right and left eigenvectors ∫ and π , normalized so that π ∫ = 1, we have {[A] − ∏∫∫ π }n = [A]n − ∏n∫ π (see Exercise 4.12). Applying the same identity to [B], we have {[B] − µ∫∫ π }n = [B]n − µn∫ π . Finally, since [B] = [A] − α∫∫ π , we have [B] − µ∫∫ π = [A] − ∏∫∫ π , so that
[A]n − ∏n∫ π = [B]n − µn∫ π .
(4.27)
Dividing both sides of (4.27) by ∏n and taking the limit of both sides of (4.27) as n → 1, the right hand side goes to 0, completing the proof. π. Note that for a stochastic matrix [P ] > 0, this corollary simplifies to limn→1 [P ]n = eπ n This means that limn→1 Pij = πj , which means that the probability of being in state j after a long time is πj , independent of the starting state. Theorem 4.7. Let [P ] be the transition matrix of an ergodic finite-state Markov chain. Then ∏ = 1 is the largest real eigenvalue of [P ], and ∏ > |µ| for every other eigenvalue π , where π > 0 is the unique probability vector satisfying µ. Furthermore, limn→1 [P ]n = eπ T π [P ] = π and e = (1, 1, . . . , 1) is the unique vector ∫ (within a scale factor) satisfying [P ]∫∫ = ∫ . Proof: From corollary 4.3, ∏ = 1 is the largest real eigenvalue of [P ], e is the unique (within a scale factor) right eigenvector of ∏ = 1, and there is a unique probability vector π such that π [P ] = π . From Theorem 4.4, [P ]m is positive for sufficiently large m. Since [P ]m is also stochastic, ∏ = 1 is strictly larger than the magnitude of any other eigenvalue of [P ]m . Let µ be any other eigenvalue of [P ] and let x be a right eigenvector of µ. Note that x is also a right eigenvector of [P ]m with eigenvalue (µ)m . Since ∏ = 1 is the only eigenvalue of [P ]m of magnitude 1 or more, we either have |µ| < ∏ or (µ)m = ∏. If (µ)m = ∏, then x must be a scalar times e. This is impossible, since x cannot be an eigenvector of [P ] with both eigenvalue ∏ and µ. Thus |µ| < ∏. Similarly, π > 0 is the unique left eigenvector of [P ]m with eigenvalue ∏ = 1, and π e = 1. Corollary 4.6 then asserts that
156
CHAPTER 4. FINITE-STATE MARKOV CHAINS
π . Multiplying by [P ]i for any i, 1 ≤ i < m, we get limn→1 [P ]mn+i = eπ π, limn→1 [P ]mn = eπ n π. so limn→1 [P ] = eπ Theorem 4.7 generalizes easily to an ergodic unichain (see Exercise 4.15). In this case, as one might suspect, πi = 0 for each transient state i and πi > 0 within the ergodic class. Theorem 4.7 becomes: Theorem 4.8. Let [P ] be the transition matrix of an ergodic unichain. Then ∏ = 1 is the largest real eigenvalue of [P ], and ∏ > |µ| for every other eigenvalue µ. Furthermore, π, lim [P ]m = eπ
m→1
(4.28)
where π ≥ 0 is the unique probability vector satisfying π [P ] = π and e = (1, 1, . . . , 1)T is the unique ∫ (within a scale factor) satisfying [P ]∫∫ = ∫ . If a chain has a periodic recurrent class, [P ]m never converges. The existence of a unique probability vector solution to π [P ] = π for a periodic recurrent chain is somewhat mystifying at first. If the period is d, then the steady-state vector π assigns probability 1/d to each of the d subsets of Theorem 4.3. If the initial probabilities for the chain are chosen as Pr{X0 = i} = πi for each i, then for each subsequent time n, Pr{Xn = i} = πi . What is happening is that this initial probability assignment starts the chain in each of the d subsets with probability 1/d, and subsequent transitions maintain this randomness over subsets. On the other hand, [P ]n cannot converge because Piin , for each i, is zero except when n is a multiple of d. Thus the memory of starting state never dies out. An ergodic Markov chain does not have this peculiar property, and the memory of the starting state dies out (from Theorem 4.7). The intuition to be associated with the word ergodic is that of a process in which timeaverages are equal to ensemble-averages. Using the general definition of ergodicity (which is beyond our scope here), a periodic recurrent Markov chain in steady-state (i.e., with Pr{Xn = i} = πi for all n and i) is ergodic. Thus the notion of ergodicity for Markov chains is slightly different than that in the general theory. The difference is that we think of a Markov chain as being specified without specifying the initial state distribution, and thus different initial state distributions really correspond to different stochastic processes. If a periodic Markov chain starts in steady state, then the corresponding stochastic process is stationary, and otherwise not.
4.5
Markov chains with rewards
Suppose that each state i in a Markov chain is associated with some reward, ri . As the Markov chain proceeds from state to state, there is an associated sequence of rewards that are not independent, but are related by the statistics of the Markov chain. The situation is similar to, but simpler than, that of renewal-reward processes. As with renewal-reward processes, the reward ri could equally well be a cost or an arbitrary real valued function of the state. In this section, the expected value of the aggregate reward over time is analyzed.
4.5. MARKOV CHAINS WITH REWARDS
157
The model of Markov chains with rewards is surprisingly broad. We have already seen that almost any stochastic process can be approximated by a Markov chain. Also, as we saw in studying renewal theory, the concept of rewards is quite graphic not only in modeling such things as corporate profits or portfolio performance, but also for studying residual life, queueing delay, and many other phenomena. In Section 4.6, we shall study Markov decision theory, or dynamic programming. This can be viewed as a generalization of Markov chains with rewards in the sense that there is a “decision maker” or “policy maker” who in each state can choose between several different policies; for each policy, there is a given set of transition probabilities to the next state and a given expected reward for the current state. Thus the decision maker must make a compromise between the expected reward of a given policy in the current state (i.e., the immediate reward) and the long term benefit from the next state to be entered. This is a much more challenging problem than the current study of Markov chains with rewards, but a thorough understanding of the current problem provides the machinery to understand Markov decision theory also. Frequently it is more natural to associate rewards with transitions rather than states. If rij denotes the reward associated with a transition from i to j and Pij denotes the correspondP ing transition probability, then ri = P r is the expected reward associated with a ij ij j transition from state i. Since we analyze only expected rewards here, P and since the effect of transition rewards rij are summarized into the state rewards ri = j Pij rij , we henceforth ignore transition rewards and consider only state rewards. The steady-state expected P reward per unit time, assuming a single recurrent class of states, is easily seen to be g = i πi ri where πi is the steady-state probability of being in state i. The following examples demonstrate that it is also important to understand the transient behavior of rewards. This transient behavior will turn out to be even more important when we study Markov decision theory and dynamic programming. Example 4.5.1 (Expected first-passage time). A common problem when dealing with Markov chains is that of finding the expected number of steps, starting in some initial state, before some given final state is entered. Since the answer to this problem does not depend on what happens after the given final state is entered, we can modify the chain to convert the given final state, say state 1, into a trapping state (a trapping state i is a state from which there is no exit, i.e., for which Pii = 1). That is, we set P11 = 1, P1j = 0 for all j 6= 1, and leave Pij unchanged for all i 6= 1 and all j (see Figure 4.5). ♥ 2♥ ✯ 2 ❍ ✟ ❍ ✙ ✟ ✿ 1♥ ✘ ❍
✄✗
✎✄ ✙ ❥ ♥ ❍ ✟ 4
❍ 3♥ ❥ ✯ ② ✟
✚ ✚ ✄✗ ✚ ❂ ❍ 3♥ ❥ ✘ 1♥ ✿ ✯ ② ✟ ⑥ ✄✎ ✟ ♥ ✙
4
Figure 4.5: The conversion of a four state Markov chain into a chain for which state 1 is a trapping state. Note that the outgoing arcs from node 1 have been removed.
Let vi be the expected number of steps to reach state 1 starting in state i 6= 1. This number
158
CHAPTER 4. FINITE-STATE MARKOV CHAINS
of steps includes the first step plus the expected number of steps from whatever state is entered next (which is 0 if state 1 is entered next). Thus, for the chain in Figure 4.5, we have the equations v2 = 1 + P23 v3 + P24 v4 v3 = 1 + P32 v2 + P33 v3 + P34 v4 v4 = 1 + P42 v2 + P43 v3 . For an arbitrary chain of M states where 1 is a trapping state and all other states are transient, this set of equations becomes vi = 1 +
X
Pij vj ;
j6=1
i 6= 1.
(4.29)
If we define ri = 1 for i 6= 1 and ri = 0 for i = 1, then ri is a unit reward for not yet entering the trapping state, and vi as the expected aggregate reward before entering the trapping state. Thus by taking r1 = 0, the reward ceases upon entering the trapping state, and vi is the expected transient reward, i.e., the expected first passage time from state i to state 1. Note that in this example, rewards occur only in transient states. Since transient states P have zero steady-state probabilities, the steady-state gain per unit time, g = i πi ri , is 0. If we define v1 = 0, then (4.29), along with v1 = 0, has the vector form v = r + [P ]v ;
v1 = 0.
(4.30)
For a Markov chain with M states, (4.29) is a set of M − 1 equations in the M − 1 variables v2 to vM . The equation v = r + [P ]v is a set of M linear equations, of which the first is the vacuous equation v1 = 0 + v1 , and, with v1 = 0, the last M − 1 correspond to (4.29). It is not hard to show that (4.30) has a unique solution for v under the condition that states 2 to M are all transient states and 1 is a trapping state, but we prove this later, in Lemma 4.1, under more general circumstances. Example 4.5.2. Assume that a Markov chain has M states, {0, 1, . . . , M − 1}, and that the state represents the number of customers in an integer time queueing system. Suppose we wish to find the expected sum of the times all customers spend in the system, starting at an integer time where i customers are in the system, and ending at the first instant when the system becomes idle. From our discussion of Little’s theorem in Section 3.6, we know that this sum of times is equal to the sum of the number of customers in the system, summed over each integer time from the initial time with i customers to the final time when the system becomes empty. As in the previous example, we modify the Markov chain to make state 0 a trapping state. We take ri = i as the “reward” in state i, and vi as the expected aggregate reward until the trapping state is entered. Using the same reasoning as in the previous example, vi is equal to the immediate “reward” P ri = i plus the expected reward from whatever state is entered next. Thus vi = ri + j≥1 Pij vj . With v0 = 0, this is v = r + [P ]v . This has a unique solution for v as will be shown later in Lemma 4.1. This same analysis is valid for any choice of reward ri for each transient state i; the reward in the trapping state must be 0 so as to keep the expected aggregate reward finite.
4.5. MARKOV CHAINS WITH REWARDS
159
In the above examples, the Markov chain has a trapping state with zero gain, so the expected gain is essentially a transient phenomena until entering the trapping state. We now look at the more general case of a unichain, i.e., a chain with a single recurrent class, possibly along with some transient states. In this more general case, there can be some average gain per unit time, along with some transient gain depending on the initial state. We first look at the aggregate gain over a finite number of time units, thus providing a clean way of going to the limit. Example 4.5.3. The example in Figure 4.6 provides some intuitive appreciation for the general problem. Note that the chain tends to persist in whatever state it is in for a relatively long time. Thus if the chain starts in state 2, not only is an immediate reward of 1 achieved, but there is a high probability of an additional gain of 1 on many successive transitions. Thus the aggregate value of starting in state 2 is considerably more than the immediate reward of 1. On the other hand, we see from symmetry that the expected gain per unit time, over a long time period, must be one half. ✿ 1♥ ✘ ②
0.99
r1 =0
0.01 0.01
③ ♥ 2 ②
0.99
r2 =1
Figure 4.6: Markov chain with rewards. Returning to the general case, it is convenient to work backward from a final time rather than forward from the initial time. This will be quite helpful later when we consider dynamic programming and Markov decision theory. For any final time m, define stage n as n time units before the final time, i.e., as time m − n in Figure 4.7. Equivalently, we often view the final time as time 0, and then stage n corresponds to time −n. m−n n −n n
··· ···
··· ···
m−3 3
−n+1 −n+2 −n+3 n−1 n−2 n−3
··· ···
m−2 2
m−1 1
m 0
Time Stage
−2 2
−1 1
0 0
Time Stage
Figure 4.7: Alternate views of Stages. As a final generalization of the problem (which will be helpful in the solution), we allow the reward at the final time (i.e., in stage 0) to be different from that at other times. The final reward in state i is denoted ui , and u = (u1 , . . . , uM )T . We denote the expected aggregate reward from stage n up to and including the final stage (stage zero), given state i at stage n, as vi (n, u). Note that the notation here is taking advantage of the Markov property. That is, given that the chain is in state i at time −n (i.e., stage n), the expected aggregate reward up to and including time 0 is independent of the states before time −n and is independent of when the Markov chain started prior to time −n.
160
CHAPTER 4. FINITE-STATE MARKOV CHAINS
The expected aggregate reward can be found by starting at stage 1. Given that the chain is in state i at time −1, the immediate reward is ri . The chain then makes a transition (with probability Pij ) to some state j at time 0 with a final reward of uj . Thus X vi (1, u) = ri + Pij uj . (4.31) j
For the example of Figure 4.6 (assuming the final reward is the same as that at the other stages, i.e., ui = ri for i = 1, 2), we have v1 (1, u) = 0.01 and v2 (1, u) = 1.99. The expected aggregate reward for stage 2 can be calculated in the same way. Given state i at time −2 (i.e., stage 2), there is an immediate reward of ri and, with probability Pij , the chain goes to state j at time −1 (i.e., stage 1) with an expected additional gain of vj (1, u). Thus X Pij vj (1, u). (4.32) vi (2, u) = ri + j
Note that vj (1, u), as calculated in (4.31), includes the gain in stages 1 and 0, and does not depend on how state j was entered. Iterating the above argument to stage 3, 4, . . . , n, X vi (n, u) = ri + Pij vj (n−1, u). (4.33) j
This can be written in vector form as v (n, u) = r + [P ]v (n−1, u);
n ≥ 1,
(4.34)
where r is a column vector with components r1 , r2 , . . . , rM and v (n, u) is a column vector with components v1 (n, u), . . . , vM (n, u). By substituting (4.34), with n replaced by n − 1, into the last term of (4.34), v (n, u) = r + [P ]r + [P ]2 v (n−2, u);
n ≥ 2.
(4.35)
Applying the same substitution recursively, we eventually get an explicit expression for v (n, u), v (n, u) = r + [P ]r + [P ]2 r + · · · + [P ]n−1 r + [P ]n u.
(4.36)
Eq. (4.34), applied iteratively, is more convenient for calculating v (n, u) than (4.36), but neither give us much insight into the behavior of the expected aggregate reward, especially for large n. We can get a little insight by averaging the components of (4.36) over the steady-state probability vector π . Since π [P ]m = π for all m and π r is, by definition, the steady state gain per stage g, this gives us π v (n, u) = ng + π u.
(4.37)
This result is not surprising, since when the chain starts in steady-state at stage n, it remains in steady-state, yielding a gain per stage of g until the final reward at stage 0. For the example of Figure 4.6 (again assuming u = r ), Figure 4.8 tabulates this steady
4.5. MARKOV CHAINS WITH REWARDS
161
state expected aggregate gain and compares it with the expected aggregate gain vi (n, u) for initial states 1 and 2. Note that v1 (n, u) is always less than the steady-state average by an amount approaching 25 with increasing n. Similarly, v2 (n, u) is greater than the average by the corresponding amount. In other words, for this example, vi (n, u) − π v (n, u), for each state i, approaches a limit as n → 1. This limit is called the asymptotic relative gain for starting in state i, relative to starting in steady state. In what follows, we shall see that this type of asymptotic behavior is quite general. n 1 2 4 10 40 100 400 1000
π v (n, r ) 1 1.5 2.5 5.5 20.5 50.5 200.5 500.5
v1 (n, r ) 0.01 0.0298 0.098 0.518 6.420 28.749 175.507 475.500
v2 (n, r ) 1.99 2.9702 4.902 10.482 34.580 72.250 225.492 525.500
Figure 4.8: The expected aggregate reward, as a function of starting state and stage, for the example of figure 4.6. Initially we consider only ergodic Markov chains and first try to understand the asymptotic behavior above at an intuitive level. For large n, the probability of being in state j at time 0, conditional on starting in i at time −n, is Pijn ≈ πj . Thus, the expected final reward at time 0 is approximately π u for each possible starting state at time −n. For (4.36), this π u)e for large n. Similarly, in (4.36), says that the final term [P ]n u is approximately (π [P ]n−m r ≈ ge if n − m is large. This means that for very large n, each unit increase or decrease in n simply adds or subtracts ge to the vector gain. Thus, we might conjecture that, for large n, v (n, u) is the sum of an initial transient term w , an intermediate term π u)e, i.e, nge, and a final term, (π π u)e. v(n, u) ≈ w + nge + (π
(4.38)
where we also conjecture that the approximation becomes exact as n → 1. Substituting (4.37) into (4.38), the conjecture (which we shall soon validate) is π v (n, u))e. v(n, u) ≈ w + (π
(4.39)
That is, the component wi of w tells us how profitable it is, in the long term, to start in a particular state i rather than start in steady-state. Thus w is called the asymptotic relative gain vector or, for brevity, the relative gain vector. In the example of the table above, w = (−25, +25). There are two reasonable approaches to validate the conjecture above and to evaluate the relative gain vector w . The first is explored in Exercise 4.22 and expands on the intuitive argument leading to (4.38) to show that w is given by w=
1 X
π )r . ([P ]i − eπ
n=0
(4.40)
162
CHAPTER 4. FINITE-STATE MARKOV CHAINS
This expression is not a very useful way to calculate w , and thus we follow the second approach here, which provides both a convenient expression for w and a proof that the approximation in (4.38) becomes exact in the limit. Rearranging (4.38) and going to the limit, π u)e}. w = lim {v (n, u) − nge − (π n→1
(4.41)
The conjecture, which is still to be proven, is that the limit in (4.41) actually exists. We now show that if this limit exists, w must have a particular form. In particular, substituting (4.34) into (4.41), w
=
π u)e} lim {r + [P ]v (n − 1, u) − nge − (π
n→1
π u)e} = r − ge + [P ] lim {v (n − 1, u) − (n − 1)ge − (π n→1
= r − ge + [P ]w . Thus, if the limit in (4.41) exists, that limiting vector w must satisfy w + ge = r + [P ]w . The following lemma shows that this equation has a solution. The lemma does not depend on the conjecture in (4.41); we are simply using this conjecture to motivate why the equation (4.42) is important. Lemma 4.1. Let [P ] be the transition matrix of a M state unichain. Let r = (r1 , . . . , rM )T be a reward P vector, let π = (π1 , . . . , πM ) be the steady state probabilities of the chain, and let g = i πi ri . Then the equation w + ge = r + [P ]w
(4.42)
has a solution for w. With the additional condition π w = 0, that solution is unique. Discussion: Note that v = r + [P ]v in Example 4.5.1 is a special case of (4.42) in T which π = (1, 0, . . . , 0) and r = (0, 1, . . . , 1) and thus g = 0. With the added condition v1 = π v = 0, the solution is unique. Example 4.5.2 is the same, except for that r is different, and thus also has a unique solution. Proof: Rewrite (4.42) as {[P ] − [I]}w = ge − r .
(4.43)
e be a particular solution to (4.43) (if one exists). Then any solution to (4.43) can be Let w e +x for some x that satisfies the homogeneous equation {[P ]−[I]}x = 0. For expressed as w x to satisfy {[P ] − [I]}x = 0, however, x must be a right eigenvector of [P ] with eigenvalue 1. From Theorem 4.8, x must have the form αe for some number α. This means that if a e to (4.43) exists, then all solutions have the form w = w e + αe. For particular solution w a particular solution to (4.43) to exist, ge − r must lie in the column space of the matrix [P ] − [I]. This column space is the space orthogonal to the left null space of [P ] − [I]. This left null space, however, is simply the set of left eigenvectors of [P ] of eigenvalue 1, i.e., the scalar multiples of π . Thus, a particular solution exists iff π (ge − r ) = 0. Since π ge = g and π r = g, this equality is satisfied and a particular solution exists. Since all solutions
4.5. MARKOV CHAINS WITH REWARDS
163
πw e + αe, setting π w = 0 determines the value of α to be −π e , thus have the form w = w yielding a unique solution with π w = 0 and completing the proof.
It is not necessary to assume that g = π r in the lemma. If g is treated as a variable in (4.42), then, by pre-multiplying any solution w , g of (4.42) by π , we find that g = π r must be satisfied. This means that (4.42) can be viewed as M linear equations in the M + 1 variables w , g and the set of solutions can be found without first calculating π . Naturally, π must be found to find the particular solution with π w = 0. If the final reward vector is chosen to be any solution w of (4.42) (not necessarily the one with π w = 0), then v (1, w ) = r + [P ]w = w + ge v (2, w ) = r + [P ]{w + ge} = w + 2ge ···
···
v (n, w ) = r + [P ]{w + (n − 1)ge} = w + nge.
(4.44)
This is a simple explicit expression for expected aggregate gain for this special final reward vector. We now show how to use this to get a simple expression for v (n, u) for arbitrary u. From (4.36), v (n, u) − v (n, w ) = [P ]n {u − w }.
(4.45)
Note that this is valid for any Markov unichain and any reward vector. Substituting (4.44) into (4.45), v (n, u) = nge + w + [P ]n {u − w }.
(4.46)
It should now be clear why we wanted to allow the final reward vector to differ from the reward vector at other stages. The result is summarized in the following theorem: Theorem 4.9. Let [P ] be the transition matrix of a unichain. Let r be a reward vector and w a solution to (4.42). Then the expected aggregate reward vector over n stages is given by (4.46). If the unichain is ergodic and w satisfies π w = 0 then π u)e. lim {v(n, u) − nge} = w + (π
n→1
(4.47)
Proof: The argument above established (4.46). If the recurrent class is ergodic, then [P ]n approaches a matrix whose rows each equal π , and (4.47) follows. The set of solutions to (4.42) has the form w + αe where w satisfies π w = 0 and α is any real number. The factor α cancels out in (4.46), so any solution can be used. In (4.47), however, the restriction to π w = 0 is necessary. We have defined the (asymptotic) relative gain vector w to satisfy π w = 0 so that, in the ergodic case, the expected aggregate gain, v (n, u) can be cleanly split into an initial transient w , the intermediate gain per stage, ne, and the final gain π u, as in (4.47). We shall call other solutions to (4.42) shifted relative gain vectors.
164
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Recall that Examples 4.5.1 and 4.5.2 showed that the aggregate reward vi from state i to enter a trapping state, state 1, is given by the solution to v = r + [P ]v , v1 = 0. This aggregate reward, in the general setup of Theorem 4.9, is limn→1 v (n, u). Since g = 0 and u = 0 in these examples, (4.47) simplifies to limn→1 v (n, u) = w where w = r + [P ]w and π w = w1 = 0. Thus, we see that (4.47) gives the same answer as we got in these examples. For the example in Figure 4.6, we have seen that w = (−25, 25) (see Exercise 4.21 also). The large relative gain for state 2 accounts for both the immediate reward and the high probability of multiple additional rewards through remaining in state 2. Note that w2 can not be interpreted as the expected reward up to the first transition from state 2 to 1. The reason for this is that the gain starting from state 1 cannot be ignored; this can be seen from Figure 4.9, which modifies Figure 4.6 by changing P12 to 1. In this case, (see Exercise 4.21), w2 − w1 = 1/1.01 ∼ 0.99, reflecting the fact that state 1 is always left immediately, thus reducing the advantage of starting in state 2. 1♥ ② r1 =0
1 0.01
③ ♥ 2 ②
0.99
r21 =1
Figure 4.9: A variation of Figure 4.6. We can now interpret the general solution in (4.46) by viewing ge as the steady state gain per stage, viewing w as the dependence on the initial state, and viewing [P ]n {u − w } as the dependence on the final reward vector u). If the recurrent class is ergodic, then, as seen in (4.47), this final term is asymptotically independent of the starting state and w , but depends on π u. Example 4.5.4. In order to understand better why (4.47) can be false without the assumption of an ergodic unichain, consider a two state periodic chain with P12 = P21 = 1, r1 = r2 = 0, and arbitrary final reward with u1 6= u2 . Then it is easy to see that for n even, v1 (n) = u1 ; v2 (n) = u2 and for n odd, v1 (n) = u2 ; v2 (n) = u1 . Thus, the effect of the final reward on the initial state never dies out. For a unichain with a periodic recurrent class of period d, as in the example above, it is a little hard to interpret w as an asymptotic relative gain vector, since the last term of (4.46) involves w also (i.e., the relative gain of starting in different states depends on both n and u). The trouble is that the final reward happens at a particular phase of the periodic variation, and the starting state determines the set of states at which the final reward is assigned. If we view the final reward as being randomized over a period, with equal probability of occuring at each phase, then, from (4.46), d−1 X
m=0
≥ ¥ v (n + m, u) − (n + m)ge = w + [P ]n I + [P ] + · · · + [P ]d−1 {u − w }.
Going to the limit n → 1, and using the result of Exercise 4.18, this becomes almost the
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
165
same as the result for an ergodic unichain, i.e., lim
n→1
d−1 X
m=0
π )u. (v (n + m, u) − (n + m)ge) = w + (eπ
(4.48)
There is an interesting analogy between the steady-state vector π and the relative gain vector w . If the recurrent class of states is ergodic, then any initial distribution on the states approaches the steady state with increasing time, and similarly the effect of any final π u)) with an increasing number gain vector becomes negligible (except for the choice of (π of stages. On the other hand, if the recurrent class is periodic, then starting the Markov chain in steady-state maintains the steady state, and similarly, choosing the final gain to be the relative gain vector maintains the same relative gain at each stage. Theorem 4.9 treated only unichains, and it is sometimes useful to look at asymptotic expressions for chains with m > 1 recurrent classes. In this case, the analogous quantity to a relative gain vector can be expressed as a solution to w+
m X
g (i)∫ (i) = r + [P ]w ,
(4.49)
i=1
where g (i) is the gain of the ith recurrent class and ∫ (i) is the corresponding right eigenvector of [P ] (see Exercise 4.14). Using a solution to (4.49) as a final gain vector, we can repeat the argument in (4.44) to get v (n, w ) = w + n
m X i=1
g (i)∫ (i)
for all n ≥ 1.
(4.50)
As expected, the average reward per stage depends on the recurrent class of the initial state. If the initial state, j, is transient, the average reward per stage is averaged over the (i) recurrent classes, using the probability ∫j that state j eventually reaches class i. For an arbitrary final reward vector u, (4.50) can be combined with (4.45) to get v (n, u) = w + n
m X i=1
g (i)∫ (i) + [P ]n {u − w } for all n ≥ 1.
(4.51)
Eqn. (4.49) always has a solution P(see Exercise 4.27), and in fact has an m dimensional set ˜ + i αi∫ (i) , where α1 , . . . , αm can be chosen arbitrarily and of solutions given by w = w ˜ is any given solution. w
4.6 4.6.1
Markov decision theory and dynamic programming Introduction
In the previous section, we analyzed the behavior of a Markov chain with rewards. In this section, we consider a much more elaborate structure in which a decision maker can select
166
CHAPTER 4. FINITE-STATE MARKOV CHAINS
between various possible decisions for rewards and transition probabilities. In place of the reward ri and the transition probabilities {Pij ; 1 ≤ j ≤ M} associated with a given state i, (1) (2) (K ) there is a choice between some number Ki of different rewards, say ri , ri , . . . , ri i and (1) a corresponding choice between Ki different sets of transition probabilities, say {Pij ; 1 ≤ (2)
(K )
j ≤ M}, {Pij , 1 ≤ j ≤ M}, . . . {Pij i ; 1 ≤ j ≤ M}. A decision maker then decides between these Ki possible decisions each time the chain is in state i. Note that if the decision maker (k) chooses decision k for state i, then the reward is ri and the transition probabilities from (k) (k) (k) state i are {Pij ; 1 ≤ j ≤ M}; it is not possible to choose ri for one k and {Pij ; 1 ≤ j ≤ M} for another k. We assume that, given Xn = i, and given decision k at time n, the probability (k) of entering state j at time n + 1 is Pij , independent of earlier states and decisions. Figure 4.10 shows an example of this situation in which the decision maker can choose between two possible decisions in state 2 (K2 = 2) and has no freedom of choice in state 1 (K1 = 1). This figure illustrates the familiar tradeoff between instant gratification (alternative 2) and long term gratification (alternative 1). 0.01 0.99 0.01 0.99 0.99 ③ ♥ ③ ♥ ♥ 2 1 2 ✿ 1♥ ✘ ② ✿ ✘ ② ② (1) (2) 0.01 1 r2 =1 r2 =50 r1 =0 r1 =0 Decision 1
Decision 2
Figure 4.10: A Markov decision problem with two alternatives in state 2. It is also possible to consider the situation in which the rewards for each decision are (k) associated with transitions; that is, for decision k in state i, the reward rij is associated with a transition from i to j. This means that the expected reward for a transition from P (k) (k) (k) i with decision k is given by ri = j Pij rij . Thus, as in the previous section, there is no essential loss in generality in restricting attention to the case in which rewards are associated with the states. The set of rules used by the decision maker in selecting different alternatives at each stage of the chain is called a policy. We want to consider the expected aggregate reward over n trials of the “Markov chain,” as a function of the policy used by the decision maker. If the policy uses the same decision, say ki , at each occurrence of state i, for each i, then that (k ) policy corresponds to a homogeneous Markov chain with transition probabilities Pij i . We denote the matrix of these transition probabilities as [P k ], where k = (k1 , . . . , kM ). Such a policy, i.e., making the decision for each state i independent of time, is called a stationary policy. The aggregate reward for any such stationary policy was found in the previous section. Since both rewards and transition probabilities depend only on the state and the corresponding decision, and not on time, one feels intuitively that stationary policies make a certain amount of sense over a long period of time. On the other hand, assuming some final reward ui for being in state i at the end of the nth trial, one might expect the best policy to depend on time, at least close to the end of the n trials. In what follows, we first derive the optimal policy for maximizing expected aggregate reward over an arbitrary number n of trials. We shall see that the decision at time m, 0 ≤ m < n, for
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
167
the optimal policy does in fact depend both on m and on the final rewards {ui ; 1 ≤ i ≤ M}. We call this optimal policy the optimal dynamic policy. This policy is found from the dynamic programming algorithm, which, as we shall see, is conceptually very simple. We then go on to find the relationship between the optimal dynamic policy and the optimal stationary policy and show that each has the same long term gain per trial.
4.6.2
Dynamic programming algorithm
As in our development of Markov chains with rewards, we consider expected aggregate reward over n time periods and we use stages, counting backwards from the final trial. First consider the optimum decision with just one trial (i.e., with just one stage). We start (k) in a given state i at stage 1, make a decision k, obtain the reward ri , then go to some state (k) j with probability Pij and obtain the final reward uj . This expected aggregate reward is maximized over the choice of k, i.e., X (k) (k) vi∗ (1, u) = max{ri + Pij uj }. (4.52) k
j
We use the notation vi∗ (n, u) to represent the maximum expected aggregate reward for n stages starting in state i. Note that vi∗ (1, u) depends on the final reward vector u = T (u1 , u2 , . . . , uM ) . Next consider the maximum expected aggregate reward starting in state i at stage 2. For each state j, 1 ≤ j ≤ M, let vj (1, u) be the expected aggregate reward, over stages 1 and 0, for some arbitrary policy, conditional on the chain being in state j at stage 1. Then if decision k is made in state i at stage 2, the expected aggregate reward for P (k) (k) stage 2 is ri + j Pij vj (1, u). Note that no matter what policy is chosen at stage 2, this expression is maximized at stage 1 by choosing the stage 1 policy that maximizes vj (1, u). Thus, independent of what we choose at stage 2 (or at earlier times), we must use vj∗ (1, u) for the aggregate gain from stage 1 onward in order to maximize the overall aggregate gain from stage 2. Thus, at stage 2, we achieve maximum expected aggregate gain, vi∗ (2, u), by choosing the k that achieves the following maximum: X (k) (k) vi∗ (2, u) = max {ri + Pij vj∗ (1, u)}. (4.53) k
j
Repeating this argument for successively larger n, we obtain the general expression X (k) (k) vi∗ (n, u) = max{ri + Pij vj∗ (n − 1, u)}. (4.54) k
j
Note that this is almost the same as (4.33), differing only by the maximization over k. We can also write this in vector form, for n ≥ 1, as v ∗ (n, u) = max{r k + [P k ]v ∗ (n − 1, u)}, k
(4.55)
where for n = 1, we take v ∗ (0, u) = u. Here k is a set (or vector) of decisions, k = (k1 , k2 , . . . , kM ) where ki is the decision to be used in state i. [P k ] denotes a matrix whose
168
CHAPTER 4. FINITE-STATE MARKOV CHAINS (k )
(k )
(i, j) element is Pij i , and r k denotes a vector whose ith element is ri i . The maximization over k in (4.55) is really M separate and independent maximizations, one for each state, i.e., (4.55) is simply a vector form of (4.54). Another frequently useful way to rewrite (4.54) or (4.55) is as follows: 0
0
v ∗ (n, u) = r k + [P k ]v ∗ (n−1) for k 0 such that 0
0
r k + [P k ]v ∗ (n−1) = max r k + [P k ]v ∗ (n−1).
(4.56)
k
If k 0 satisfies (4.56), it is called an optimal decision for stage n. Note that (4.54), (4.55), and (4.56) are valid with no restrictions (such as recurrent or aperiodic states) on the possible transition probabilities [P k ]. The dynamic programming algorithm is just the calculation of (4.54), (4.55), or (4.56), performed successively for n = 1, 2, 3, . . . . The development of this algorithm, as a systematic tool for solving this class of problems, is due to Bellman [Bel57]. This algorithm yields the optimal dynamic policy for any given final reward vector, u. Along with the calculation of v ∗ (n, u) for each n, the algorithm also yields the optimal decision at each stage. The surprising simplicity of the algorithm is due to the Markov property. That is, vi∗ (n, u) is the aggregate present and future reward conditional on the present state. Since it is conditioned on the present state, it is independent of the past (i.e., how the process arrived at state i from previous transitions and choices). Although dynamic programming is computationally straightforward and convenient7 , the asymptotic behavior of v ∗ (n, u) as n → 1 is not evident from the algorithm. After working out some simple examples, we look at the general question of asymptotic behavior. Example 4.6.1. Consider Fig. 4.10, repeated below, with the final rewards u2 = u1 = 0. ✿ 1♥ ✘ ②
0.99
r1 =0
0.01 0.01
③ ♥ 2 ②
✿ 1♥ ✘ ②
0.99
0.99
(1)
r2 =1
r1 =0
0.01 1
③ ♥ 2 (2)
r2 =50
Since there is no reward in stage 0, uj = 0. Also r1 = 0, so, from (4.52), the aggregate gain in state 1 at stage 1 is X v1∗ (1, u) = r1 + Pij uj = 0. j
(1)
Similarly, since policy 1 has an immediate reward r2 = 1 in state 2, and policy 2 has an (2) immediate reward r2 = 50, Ωh X (1) i h (2) X (2) iæ (1) ∗ v2 (1, u) = max r2 + Pij uj , Pij uj r2 + = max{1, 50} = 50. j
7
j
Unfortunately, many dynamic programming problems of interest have enormous numbers of states and possible choices of decision (the so called curse of dimensionality), and thus, even though the equations are simple, the computational requirements might be beyond the range of practical feasibility.
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
169
We can now go on to stage 2, using the results above for vj∗ (1, u). From (4.53), v1∗ (2) = r1 + P11 v1∗ (1, u) + P12 v2∗ (1, u) = P12 v2∗ (1, u) = 0.5 Ωh i h iæ X (1) (1) (2) (2) ∗ ∗ ∗ P2j vj (1, u) , r2 + P21 v1 (1, u) v2 (2) = max r2 + j
o n (1) = max [1 + P22 v2∗ (1, u)], 50 = max{50.5, 50} = 50.5.
Thus, we have seen that, in state 2, decision 1 is preferable at stage 2, while decision 2 is preferable at stage 1. What is happening is that the choice of decision 2 at stage 1 has made it very profitable to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it is preferable to choose decision 1 (i.e., the small unit gain) at stage 2 with the corresponding high probability of remaining in state 2 at stage 1. Continuing this computation for larger n, one finds that v1∗ (n, u) = n/2 and v2∗ (n, u) = 50 + n/2. The optimum dynamic policy is decision 2 for stage 1 and decision 1 for all stages n > 1. This example also illustrates that the maximization of expected gain is not necessarily what is most desirable in all applications. For example, people who want to avoid risk might well prefer decision 2 at stage 2. This guarantees a reward of 50, rather than taking a small chance of losing that reward. Example 4.6.2 (Shortest Path Problems). The problem of finding the shortest paths between nodes in a directed graph arises in many situations, from routing in communication networks to calculating the time to complete complex tasks. The problem is quite similar to the expected first passage time of example 4.5.1. In that problem, arcs in a directed graph were selected according to a probability distribution, whereas here, we must make a decision about which arc to take. Although there are no probabilities here, the problem can be posed as dynamic programming. We suppose that we want to find the shortest path from each node in a directed graph to some particular node, say node 1 (see Figure 4.11). The link lengths are arbitrary numbers that might reflect physical distance, or might reflect an arbitrary type of cost. The length of a path is the sum of the lengths of the arcs on that path. In terms of dynamic programming, a policy is a choice of arc out of each node. Here we want to minimize cost (i.e., path length) rather than maximizing reward, so we simply replace the maximum in the dynamic programming algorithm with a minimum (or, if one wishes, all costs can be replaced with negative rewards. 2♥ 2 ❍ ✚ 4✚ ✄✗ 0 ❂ ✚ ❥ 3♥ ❍ ✿ 1♥ ✘ ② ✯ ✟ ⑥ ✎✄ ✙ ♥ ✟ 4 4
Figure 4.11: A shortest path problem. The arcs are marked with their lengths. Any unmarked link has length 1
We start the dynamic programming algorithm with a final cost vector that is 0 for node 1 and infinite for all other nodes. In stage 1, we choose the arc from node 2 to 1 and that
170
CHAPTER 4. FINITE-STATE MARKOV CHAINS
from 4 to 1; the choice at node 3 is immaterial. The stage 1 costs are then v1 (1, u) = 0,
v2 (1, u) = 4,
v3 (1, u) = 1,
In stage 2, the cost v3 (2, u), for example, is h v3 (2, u) = min 2 + v2 (1, u),
The set of costs at stage 2 are v1 (2, u) = 0,
v2 (2, u) = 2,
v4 (1, u) = 1.
i 4 + v4 (1, u) = 5.
v3 (2, u) = 5,
v4 (2, u) = 1.
and the policy is for node 2 to go to 4, node 3 to 4, and 4 to 1. At stage, node 3 switches to node 2, reducing its path length to 4, and nodes 2 and 4 are unchanged. Further iterations yield no change, and the resulting policy is also the optimal stationary policy. It can be seen without too much difficulty, for the example of Figure 4.11, that these final aggregate costs and shortest paths also result no matter what final cost vector u (with u1 = 0) is used. We shall see later that this always happens so long as all the cycles in the directed graph (other than the self loop from node 1 to node 1) have positive cost.
4.6.3
Optimal stationary policies
In Example 4.6.1, we saw that there was a final transient (for stage 1) in which decision 1 was taken, and in all other stages, decision 2 was taken. Thus, the optimal dynamic policy used a stationary policy (using decision 2) except for a final transient. It seems reasonable to expect this same type of behavior for typical but more complex Markov decision problems. We can get a clue about how to demonstrate this by first looking at a situation in which the aggregate expected gain of a stationary policy is equal to that of the optimal dynamic 0 ) of decisions in policy. Denote some given stationary policy by the vector k 0 = (k10 , . . . , kM 0 each state. Assume that the Markov chain with transition matrix [P k ] is a unichain, i.e., recurrent with perhaps additional transient states. The expected aggregate reward for this stationary policy is then given by (4.46), using the Markov chain with transition matrix 0 0 [P k ] and reward vector r k . Let w 0 be the relative gain vector for the stationary policy k 0 . Recall from (4.44) that if w 0 is used as the final reward vector, then the expected aggregate gain simplifies to 0
v k (n, w 0 ) − ng 0 e = w 0 ,
(4.57)
P 0 0 0 where g 0 = i πik rik is the steady-state gain, π k is the steady-state probability vector, and the relative gain vector w 0 satisfies 0
0
w 0 + g 0 e = r k + [P k ]w 0 ;
0
π k w 0 = 0.
(4.58)
The fact that the right hand side of (4.57) is independent of the stage, n, leads us to hypothesize that if the stationary policy k 0 is the same as the dynamic policy except for a final transient, then that final transient might disappear if we use w 0 as a final reward
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
171
vector. To pursue this hypothesis, assume a final reward equal to w 0 . Then, if k 0 maximizes r k + [P k ]w 0 over k , we have 0
0
v ∗ (1, w 0 ) = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k
(4.59)
Substituting (4.58) into (4.59), we see that the vector decision k 0 is optimal at stage 1 if 0
0
w 0 + g 0 e = r k + [P k ]w 0 = max{r k + [P k ]w 0 }. k
(4.60)
If (4.60) is also satisfied, then the optimal gain is given by v ∗ (1, w 0 ) = w 0 + g 0 e.
(4.61)
The following theorem now shows that if (4.60) is satisfied, then, not only is the decision k 0 that maximizes r k + [P k ]w 0 an optimal dynamic policy for stage 1 but is also optimal at all stages (i.e., the stationary policy k 0 is also an optimal dynamic policy). Theorem 4.10. Assume that (4.60) is satisfied for some w0 , g 0 , and k0 . Then, if the final reward vector is equal to w0 , the stationary policy k0 is an optimal dynamic policy and the optimal expected aggregate gain satisfies v∗ (n, w0 ) = w0 + ng 0 e.
(4.62)
Proof: Since k 0 maximizes r k + [P k ]w 0 , it is an optimal decision at stage 1 for the final 0 0 vector w 0 . From (4.60), w 0 + g 0 e = r k + [P k ]w 0 , so v ∗ (1, w 0 ) = w 0 + g 0 e. Thus (4.62) is satisfied for n = 1, and we use induction on n, with n = 1 as a basis, to verify (4.62) in general. Thus, assume that (4.62) is satisfied for n. Then, from (4.55), v ∗ (n + 1, w 0 ) = max{r k + [P k ]v ∗ (n, w 0 )} k o n k = max r k + [P k ]{w 0 + ng 0 e} k
(4.63) (4.64)
= ng 0 e + max{r k + [P k ]w 0 }
(4.65)
= (n + 1)g 0 e + w 0 ..
(4.66)
k
Eqn (4.64) follows from the inductive hypothesis of (4.62), (4.65) follows because [P k ]e = e for all k , and (4.66) follows from (4.60). This verifies (4.62) for n + 1. Also, since k 0 maximizes (4.65), it also maximizes (4.63), showing that k 0 is the optimal decision at stage n + 1. This completes the inductive step and thus the proof. Since our major interest in stationary policies is to help understand the relationship between the optimal dynamic policy and stationary policies, we define an optimal stationary policy as follows: Definition 4.13. A stationary policy k0 is optimal if there is some final reward vector w0 for which k0 is the optimal dynamic policy.
172
CHAPTER 4. FINITE-STATE MARKOV CHAINS
From Theorem 4.10, we see that if there is a solution to (4.60), then the stationary policy k 0 that maximizes r k + [P k ]w 0 is an optimal stationary policy. Eqn. (4.60) is known as Bellman’s equation, and we now explore the situations in which it has a solution (since these solutions give rise to optimal stationary policies). Theorem 4.10 made no assumptions beyond Bellman’s equation about w 0 , g 0 , or the stationary policy k 0 that maximizes r k + [P k ]w 0 . However, if k 0 corresponds to a unichain, then, from Lemma 4.1 and its following discussion, w 0 and g 0 are uniquely determined (aside from an additive factor of αe in w 0 ) as the relative gain vector and gain per stage for k 0 . If Bellman’s equation has a solution, w 0 , g 0 , then, for every decision k , we have w 0 + g 0 e ≥ r k + [P k ]w 0
with equality for some k 0 .
(4.67)
The Markov chains with transition matrices [P k ] might have multiple recurrent classes, so we let π k ,R denote the steady-state probability vector for a given recurrent class R of k . Premultiplying both sides of (4.67) by π k ,R , π k ,R w 0 + g 0π k ,R e ≥ π k ,R r k + π k ,R [P k ]w 0
with equality for some k 0 .
(4.68)
Recognizing that π k ,R e = 1 and π k ,R [P k ] = π k ,R , this simplifies to g 0 ≥ π k ,R r k
with equality for some k 0 .
(4.69)
This says that if Bellman’s equation has a solution w 0 , g 0 , then the gain per stage g 0 in that solution is greater than or equal to the gain per stage in each recurrent class of each stationary policy, and is equal to the gain per stage in each recurrent class of the maximizing stationary policy, k 0 . Thus, the maximizing stationary policy is either a unichain or consists of several recurrent classes all with the same gain per stage. We have been discussing the properties that any solution of Bellman’s equation must have, but still have no guarantee that any such solution must exist. The following subsection describes a fairly general algorithm (policy iteration) to find a solution of Bellman’s algorithm, and also shows why, in some cases, no solution exists. Before doing this, however, we look briefly at the overall relations between the states in a Markov decision problem. For any Markov decision problem, consider a directed graph for which the nodes of the graph are the states in the Markov decision problem, and, for each pair of states (i, j), (k ) there is a directed arc from i to j if Pij i > 0 for some decision ki . Definition 4.14. A state i in a Markov decision problem is reachable from state j if there is a path from j to i in the above directed graph. Note that if i is reachable from j, then there is a stationary policy in which i is accessible from j (i.e., for each arc (m, l) on the path, a decision km in state m is used for which (k ) Pmlm > 0). Definition 4.15. A state i in a Markov decision problem is inherently transient if it is not reachable from some state j that is reachable from i. A state i is inherently recurrent
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
173
if it is not inherently transient. A class I of states is inherently recurrent if each i ∈ I is inherently recurrent, each is reachable from each other, and no state j ∈ / I is reachable from any i ∈ I. A Markov decision problem is inherently recurrent if all states form an inherently recurrent class. An inherently recurrent class of states is a class that, once entered, can never be left, but which has no subclass with that property. An inherently transient state is transient in at least one stationary policy, but might be recurrent in other policies (but all the states in any such recurrent class must be inherently transient). In the following subsection, we analyze inherently recurrent Markov decision problems. Multiple inherently recurrent classes can be analyzed one by one using the same approach, and we later give a short discussion of inherently transient states.
4.6.4
Policy iteration and the solution of Bellman’s equation
The general idea of policy iteration is to start with an arbitrary unichain stationary policy k 0 and to find its gain per stage g 0 and its relative gain vector w 0 . We then check whether Bellman’s equation, (4.60), is satisfied, and if not, we find another stationary policy k that is ‘better’ than k 0 in a sense to be described later. Unfortunately, the ‘better’ policy that we find might not be a unichain, so the following lemma shows that any such policy can be converted into an equally ‘good’ unichain policy. The algorithm then iteratively finds better and better unichain stationary policies, until eventually one of them satisfies Bellman’s equation and is thus optimal. Lemma 4.2. Let k = (k1 , . . . , kM ) be an arbitrary stationary policy in an inherently recurrent Markov decision problem. Let R be a recurrent class of states in k. Then a unichain ˜ = (k˜1 , . . . , k˜M ) exists with the recurrent class R and with k˜j = kj for stationary policy k j ∈ R. Proof: Let j be any state in R. By the inherently recurrent assumption, there is a decision vector, say k 0 under which j is accessible from all other states (see Exercise 4.38). Choosing k˜i = ki for i ∈ R and k˜i = ki0 for i ∈ / R completes the proof. Now that we are assured that unichain stationary policies exist and can be found, we can state the policy improvement algorithm for inherently recurrent Markov decision problems. This algorithm is a generalization of Howard’s policy iteration algorithm, [How60]. Policy Improvement Algorithm 1. Choose an arbitrary unichain policy k 0 0
0
2. For policy k 0 , calculate w 0 and g 0 from w 0 + g 0 e = r k + [P k ]w 0 . 3. If w 0 + g 0 e = maxk {r k + [P k ]w 0 }, then stop; k 0 is optimal. P (k ) (k ) 4. Otherwise, choose i and ki so that wi0 + g 0 < ri i + j Pij i wj0 . For j 6= i, let kj = kj0 .
5. If the policy k = (k1 , . . . kM ) is not a unichain, then let R be the recurrent class in policy k that contains state i, and let k˜ be the unichain policy of Lemma 4.2. Update k to the value of k˜ .
174
CHAPTER 4. FINITE-STATE MARKOV CHAINS
6. Update k 0 to the value of k and return to step 2. (k )
If the stopping test in step 3 fails, then there is some i for which wi0 + g 0 < maxki {ri i + P (ki ) 0 j Pij wj }, so step 4 can always be executed if the algorithm does not stop in step 3. The resulting policy k then satisfies w 0 + g0 e
≤ 6 =
r k + [P k ]w 0 ,
(4.70)
where ≤ 6= means that the inequality is strict for at least one component (namely i) of the vectors. 0
Note that at the end of step 4, [P k ] differs from [P k ] only in the transitions out of state i. Thus the set of states from which i is accessible is the same in k 0 as k . If i is recurrent in the unichain k 0 , then it is accessible from all states in k 0 and thus also accessible from all states in k . It follows that i is also recurrent in k and that k is a unichain (see Exercise 4.2. On the other hand, if i is transient in k 0 , and if R0 is the recurrent class of k 0 , then R0 must also be a recurrent class of k , since the transitions from states in R0 are unchanged. There (k ) are then two possibilities when i is transient in k 0 . First, if the changes in Pij i eliminate all the paths from i to R0 , then a new recurrent class R will be formed with i a member. This is the case in which step 5 is used to change k back to a unichain. Alternatively, if a path still exists to R0 , then i is transient in k and k is a unichain with the same recurrent class R0 as k 0 . These results are summarized in the following lemma: Lemma 4.3. There are only three possibilities for k at the end of step 4 of the policy improvement algorithm for inherently recurrent Markov decision problems. First, k is a unichain and i is recurrent in both k0 and k. Second, k is not a unichain and i is transient in k0 and recurrent in k. Third, k is a unichain with the same recurrent class as k0 and i is transient in both k0 and k. The following lemma now asserts that the new policy on returning to step 2 of the algorithm is an improvement over the previous policy k 0 . Lemma 4.4. Let k0 be the unichain policy of step 2 in an iteration of the policy improvement algorithm for an inherently recurrent Markov decision problem. Let g 0 , w0 , R0 be the gain per stage, relative gain vector, and recurrent class respectively of k0 . Assume the algorithm doesn’t stop at step 3 and let k be the unichain policy of step 6. Then either the gain per stage g of k satisfies g > g 0 or else the recurrent class of k is R0 , the gain per stage satisfies g = g 0 , and there is a shifted relative gain vector, w, of k satisfying w0
≤ 6 =
w
and wj0 = wj for each j ∈ R0 .
(4.71)
Proof*: The policy k of step 4 satisfies (4.70) with strict inequality for the component i in which k 0 and k differ. Let R be any recurrent class of k and let π be the steady-state probability vector for R. Premultiplying both sides of (4.70) by π , we get π w 0 + g 0 ≤ π r k + π [P k ]w 0 .
(4.72)
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
175
Recognizing that π [P k ] = π and cancelling terms, this shows that g 0 ≤ π r k . Now (4.70) is satisfied with strict inequality for component i, and thus, if πi > 0, (4.72) is satisfied with strict inequality. Thus, g 0 ≤ π r k with equality iff πi = 0.
(4.73)
For the first possibility of Lemma 4.3, k is a unichain and i ∈ R. Thus g 0 < π r k = g. Similarly, for the second possibility in Lemma 4.3, i ∈ R for the new recurrent class that is formed in k , so again g 0 < π r k . Since k˜ is a unichain with the recurrent class R, we have g 0 < g again. For the third possibility in Lemma 4.3, i is transient in R0 = R. Thus πi = 0, so π 0 = π , and g 0 = g. Thus, to complete the proof, we must demonstrate the validity of (4.71) for this case. We first show that, for each n ≥ 1, v k (n, w 0 ) − ng 0 e ≤ v k (n+1, w 0 ) − (n+1)g 0 e.
(4.74)
v k (1, w 0 ) = r k + [P k ]w 0 .
(4.75)
For n = 1,
Using this, (4.70) can be rewritten as w0
≤ 6 =
v k (1, w 0 ) − g 0 e.
(4.76)
Using (4.75) and then (4.76), v k (1, w 0 ) − g 0 e = r k + [P k ]w 0 − g 0 e
≤ r k + [P k ]{v k (1, w 0 ) − g 0 e} − g 0 e k
k
k
0
(4.77)
0
= r + [P ]v (1, w ) − 2g e = v k (2, w 0 ) − 2g 0 e.
We now use induction on n, using n = 1 as the basis, to demonstrate (4.74) in general. For any n > 1, assume (4.74) for n − 1 as the inductive hypothesis. v k (n, w 0 ) − ng 0 e = r k + [P k ]v k (n − 1, w 0 ) − ng 0 e
= r k + [P k ]{v k (n − 1, w 0 ) − (n − 1)g 0 e} − g 0 e ≤ r k + [P k ]{v k (n, w 0 ) − ng 0 e} − g 0 e
= v k (n+1, w 0 ) − (n+1)g 0 e.
This completes the induction, verifying (4.74) and showing that v k (n, w 0 ) − ng 0 e is nondecreasing in n. Since k is a unichain, Lemma 4.1 asserts that k has a shifted relative gain vector w , i.e., a solution to (4.42). From (4.46), v k (n, w 0 ) = w + ng 0 e + [P k ]n {w 0 − w }.
(4.78)
Since [P k ]n is a stochastic matrix, its elements are each between 0 and 1, so the sequence of vectors v k − ng 0 e must be bounded independent of n. Since this sequence is also non˜, decreasing, it must have a limit, say w ˜. lim v k (n, w 0 ) − ng 0 e = w
n→1
(4.79)
176
CHAPTER 4. FINITE-STATE MARKOV CHAINS
˜ satisfies (4.42) for k . We next show that w ˜ w
= =
lim {v k (n+1, w 0 ) − (n + 1)g 0 e}
n→1
lim {r k + [P k ]v k (n, w 0 ) − (n + 1)g 0 e}
n→1 k
˜. = r − g 0 e + [P k ] lim {v k (n, w 0 ) − ng 0 e} = r k − g 0 e + [P k ]w n→1
(4.80) (4.81)
˜ is a shifted relative gain vector for k . Finally we must show that w ˜ satisfies the Thus w conditions on w in (4.71). Using (4.76) and iterating with (4.74), w0
≤ 6 =
˜ v k (n, w 0 ) − ng 0 e ≤ w
for all n ≥ 1.
(4.82)
Premultiplying each term in (4.82) by the steady-state probability vector π for k , ˜. π w 0 ≤ π v k (n, w 0 ) − ng 0 ≤ π w
(4.83)
Now, k is the same as k 0 over the recurrent class, and π = π 0 since π is non-zero only over the recurrent class. This means that the first inequality above is actually an equality. Also, ˜ . Since πi ≥ 0 and wi0 ≤ w going to the limit, we see that π w 0 = π w ˜i , this implies that 0 wi = w ˜i for all recurrent i, completing the proof. We now see that each iteration of the algorithm either increases the gain per stage or holds the gain constant and increases the shifted relative gain vector w . Thus the sequence of policies found by the algorithm can never repeat. Since there are a finite number of stationary policies, the algorithm must eventually terminate at step 3. Thus we have proved the following important theorem. Theorem 4.11. For any inherently recurrent Markov decision problem, there is a solution to Bellman’s equation and a maximizing stationary policy that is a unichain. There are also many interesting Markov decision problems, such as shortest path problems, that contain not only an inherently recurrent class but also some inherently transient states. The following theorem then applies. Theorem 4.12. Consider a Markov decision problem with a single inherently recurrent class of states and one or more inherently transient states. Let g∗ be the maximum gain per stage over all recurrent classes of all stationary policies and assume that each recurrent class with gain per stage equal to g ∗ is contained in the inherently recurrent class. Then there is a solution to Bellman’s equation and a maximizing stationary policy that is a unichain. Proof*: Let k be a stationary policy which has a recurrent class, R, with gain per stage g ∗ . Let j be any state in R. Since j is inherently recurrent, there is a decision vector k˜ under which j is accessible from all other states. Choose k 0 such that ki0 = ki for all i ∈ R and ki0 = k˜i for all i ∈ / R. Then k 0 is a unichain policy with gain per stage g ∗ . Suppose the policy improvement algorithm is started with this unichain policy. If the algorithm stops at step 3, then k 0 satisfies Bellman’s equation and we are done. Otherwise, from Lemma 4.4, the unichain policy in step 6 of the algorithm either has a larger gain per stage (which is impossible) or has the same recurrent class R and has a relative gain vector w satisfying
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
177
(4.74). Iterating the algorithm, we find successively larger relative gain vectors. Since the policies cannot repeat, the algorithm must eventually stop with a solution to Bellman’s equation. The above theorems give us a good idea of the situations under which optimal stationary policies and solutions to Bellman’s equation exist. However, we call a stationary policy optimal if it is the optimal dynamic policy for one special final reward vector. In the next subsection, we will show that if an optimal stationary policy is unique and is an ergodic unichain, then that policy is optimal except for a final transient no matter what the final reward vector is.
4.6.5
Stationary policies with arbitrary final rewards
We start out this subsection with the main theorem, then build up some notation and preliminary ideas for the proof, then prove a couple of lemmas, and finally prove the theorem. Theorem 4.13. Assume that k0 is a unique optimal stationary policy and is an ergodic unichain with the ergodic class R = {1, 2, . . . , m}. Let w0 and g 0 be the relative gain vector and gain per stage for k0 . Then, for any final gain vector u, the following limit exists and is independent of i lim v ∗ (n, u) n→1 i
π 0 u)(u), − ng 0 − wi0 = (π
(4.84)
π 0 u)(u) satisfies where (π π 0 u)(u) = lim π 0 [v∗ (n, u) − ng 0 e − w0 ] (π n→1
(4.85)
and π 0 is the steady-state vector for k0 Discussion: The theorem says that, asymptotically, the relative advantage of starting in one state rather than another is independent of the final gain vector, i.e., that for any states i, j, limn→1 [u∗i (n, u) − u∗j (n, u)] is independent of u. For the shortest path problem, for example, this says that v ∗ (n, u) converges to the shortest path vector for any choice of u for which ui = 0. This means that if the arc lengths change, we can start the algorithm at the shortest paths for the previous arc lengths, and the algorithm is guaranteed to converge to the correct new shortest paths. To see why the theorem can be false without the ergodic assumption, consider Example 4.5.4 where, even without any choice of decisions, (4.84) is false. Exercise 4.34 shows why the theorem can be false without the uniqueness assumption. It can also be shown (see Exercise 4.35) that for any Markov decision problem satisfying the hypotheses of Theorem 4.13, there is some n0 such that the optimal dynamic policy uses the optimal stationary policy for all stages n ≥ n0 . Thus, the dynamic part of the optimal dynamic policy is strictly a transient. The proof of the theorem is quite lengthy. Under the restricted conditions that k 0 is an ergodic Markov chain, the proof is simpler and involves only Lemma 4.5.
178
CHAPTER 4. FINITE-STATE MARKOV CHAINS
We now develop some notation required for the proof of the theorem. Given a final reward (k) P (k) vector u, define ki (n) for each i and n as the k that maximizes ri + j Pij vj∗ (n, u). Then (ki (n))
vi∗ (n + 1, u) = ri
X
+
(k (n)) ∗ vj (n, u)
Pij i
j
(k)
Similarly, since ki0 maximizes ri (ki0 )
vi∗ (n + 1, w 0 ) = ri
+
X j
+
P
j
(ki0 )
≥ ri
X
+
(k0 )
Pij i vj∗ (n, u).
(4.86)
(k (n)) ∗ vj (n, w 0 ).
(4.87)
j
(k)
Pij vj∗ (n, w 0 ),
(k0 )
(ki (n))
Pij i vj∗ (n, w 0 ) ≥ ri
+
X
Pij i
j
Subtracting (4.87) from (4.86), we get the following two inequalities, X (k0 ) Pij i [vj∗ (n, u) − vj∗ (n, w 0 )]. vi∗ (n + 1, u) − vi∗ (n + 1, w 0 ) ≥
(4.88)
j
vi∗ (n + 1, u) − vi∗ (n + 1, w 0 ) ≤
X
(k (n))
Pij i
j
[vj∗ (n, u) − vj∗ (n, w 0 )].
(4.89)
Define δi (n) = vi∗ (n, u) − vi∗ (n, w 0 ). Then (4.88) and (4.89) become δi (n + 1) ≥
δi (n + 1) ≤
X
(k0 )
Pij i δj (n).
(4.90)
j
X
(k (n))
Pij i
δj (n).
(4.91)
j
Since vi∗ (n, w 0 ) = ng 0 + wi0 for all i, n, δi (n) = vi∗ (n, u) − ng 0 − wi0 . Thus the theorem can be restated as asserting that limn→1 δi (n) = β(u) for each state i. Define δmax (n) = max δi (n); i
Then, from (4.90), δi (n + 1) ≥
P
δmin (n) = min δi (n). i
(k0 )
j
Pij i δmin (n) = δmin (n). Since this is true for all i,
δmin (n + 1) ≥ δmin (n).
(4.92)
δmax (n + 1) ≤ δmax (n).
(4.93)
In the same way, from (4.91),
The following lemma shows that (4.84) is valid for each of the recurrent states.
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
179
Lemma 4.5. Under the hypotheses of Theorem 4.12, the limiting expression for β(u) in (4.85) exists and lim δi (n) = β(u) for 1 ≤ i ≤ m.
(4.94)
n→1
Proof* of lemma 4.5: Multiplying each side of (4.90) by πi0 and summing over i, 0
π 0δ (n + 1) ≥ π 0 [P k ]δδ (n) = π 0δ (n). Thus π 0δ (n) is non-decreasing in n. Also, from (4.93), π 0δ (n) ≤ δmax (n) ≤ δmax (1). Since π 0δ (n) is non-decreasing and bounded, it has a limit β(u) as defined by (4.85) and π 0δ (n) ≤ β(u)
lim π 0δ (n) = β(u).
(4.95)
n→1
Next, iterating (4.90) m times, we get 0
δ (n + m) ≥ [P k ]mδ (n). 0
π 0 . Thus, Since the recurrent class of k 0 is ergodic, (4.28) shows that limm→1 [P k ]m = eπ 0
π 0 + [χ(m)]. [P k ]m = eπ where [χ(m)] is a sequence of matrices for which limm→1 [χ(m)] = 0. π 0δ (n) + [χ(m)]δδ (n). δ (n + m) ≥ eπ For any ≤ > 0, (4.95) shows that for all sufficiently large n, π 0δ (n) ≥ β(u) − ≤/2. Also, since δmin (1) ≤ δi (n) ≤ δmax (1) for all i and n, and since [χ(m)] → 0, we see that [χ(m)]δδ (n) ≥ −(≤/2)e for all large enough m. Thus, for all large enough n and m, δi (n + m) ≥ β(u) − ≤. Thus, for any ≤ > 0, there is an n0 such that for all n ≥ n0 , δi (n) ≥ β(u) − ≤.
(4.96)
Also, from (4.95), we have π 0 [δδ (n) − β(u)e] ≤ 0, so π 0 [δδ (n) − β(u)e + ≤e] ≤ ≤.
(4.97)
From (4.96), each term, πi0 [δi (n) − β(u) + ≤], on the left side of (4.97) is non-negative, so each must also be smaller than ≤. For πi0 > 0, it follows that δi (n) − β(u) + ≤ ≤ ≤/πi0 for all i and all n ≥ n0 .
(4.98)
Since ≤ > 0 is arbitrary, (4.96) and (4.98) together with πi0 > 0 show that, limn→1 δi (n) = β(u), completing the proof of Lemma 4.5. Since k 0 is a unique optimal stationary policy, we have X (k0 ) X (k ) (k0 ) (k ) ri i + Pij i wj0 > ri i + Pij i wj0 j
j
180
CHAPTER 4. FINITE-STATE MARKOV CHAINS
for all i and all ki 6= ki0 . Snce this is a finite set of strict inequalities, there is an α > 0 such that for all i > m, ki 6= ki0 , X (k0 ) X (k ) (k0 ) (k ) Pij i wj0 ≥ ri i + Pij i wj0 + α. (4.99) ri i + j
j
Since vi∗ (n, w 0 ) = ng 0 + wi0 , (ki0 )
vi∗ (n + 1, w 0 ) = ri
+
X j
(ki (n))
≥ ri
+
(k0 )
Pij i vj∗ (n, w 0 )
X
(k (n)) ∗ vj (n, w 0 )
Pij i
(4.100) + α.
(4.101)
j
for each i and ki (n) 6= ki0 . Subtracting (4.101) from (4.86), X (k0 ) Pij i δj (n) − α for ki (n) 6= ki0 . δi (n + 1) ≤
(4.102)
j
Since δi (n) ≤ δmax (n), (4.102) can be further bounded by δi (n + 1) ≤ δmax (n) − α for P (k0 ) ki (n) 6= ki0 . Combining this with δi (n + 1) = j Pij i δj (n) for ki (n) = ki0 , h i X (k0 ) δi (n + 1) ≤ max δmax − α, Pij i δj (n) . (4.103) j
Next, since k 0 is a unichain, we can renumber the transient states, m < i ≤ M so that P (ki0 ) > 0 for each i, m < i ≤ M. Since this is a finite set of strict inequalities, there j 0 such that X (k0 ) Pij i ≥ ∞ for m < i ≤ M. (4.104) j
The quantity δi (n) for each transient state i is somewhat difficult to work with directly, so we define the new quantity, δ˜i (n), which will be shown in the following lemma to upper bound δi (n). The definition for δ˜i (n) is given iteratively for n ≥ 1, m < i ≤ M as h i (4.105) δ˜i (n + 1) = max δ˜M (n) − α, ∞ δ˜i−1 (n) + (1 − ∞)δ˜M (n) .
The boundary conditions for this are defined to be
δ˜i (1) = δmax (1); m < i ≤ M δ˜m (n) = sup max δi (n0 ). n0 ≥n i≤m
(4.106) (4.107)
Lemma 4.6. Under the hypotheses of Theorem 4.13, with α defined by (4.99) and ∞ defined by (4.104), the following three inequalities hold, δ˜i (n) ≤ δ˜i (n − 1); δ˜i (n) ≤ δ˜i+1 (n); δj (n) ≤ δ˜i (n);
for n ≥ 2, m ≤ i ≤ M
for n ≥ 1, m ≤ i < M
for n ≥ 1, j ≤ i, m ≤ i ≤ M.
(4.108) (4.109) (4.110)
4.6. MARKOV DECISION THEORY AND DYNAMIC PROGRAMMING
181
Proof* of (4.108): Since the supremum in (4.107) is over a set decreasing in n, δ˜m (n) ≤ δ˜m (n − 1);
for n ≥ 1.
(4.111)
This establishes (4.108) for i = m. To establish (4.108) for n = 2, note that δ˜i (1) = δmax (1) for i > m and δ˜m (1) = sup max δi (n0 ) ≤ sup δmax (n0 ) ≤ δmax (1). n0 ≥1 i≤m
Thus
(4.112)
n0 ≥1
h δ˜i (2) = max δ˜M (1) − α,
i ∞ δ˜i−1 (1) + (1 − ∞)δ˜M (1)
≤ δmax (1) = δ˜i (1)
for i > m.
Finally, we use induction for n ≥ 2, i > m, using n = 2 as the basis. Assuming (4.108) for a given n ≥ 2, δ˜i (n+1) = max[δ˜M (n)−α, ∞ δ˜i−1 (n) + (1−∞)δ˜M (n)] ≤ max[δ˜M (n−1)−α, ∞ δ˜i−1 (n−1) + (1−∞)δ˜M (n−1)] = δ˜i (n).
Proof* of (4.109): Using (4.112) and the fact that δ˜i (1) = δmax (1) for i > m, (4.109) is valid for n = 1. Using induction on n with n = 1 as the basis, we assume (4.109) for a given n ≥ 1. Then for m ≤ i ≤ M, δ˜i (n + 1) ≤ δ˜i (n) ≤ ∞ δ˜i (n) + (1 − ∞)δ˜M (n) ≤ max[δ˜M (n) − α, ∞ δ˜i (n) + (1 − ∞)δ˜M (n)] = δ˜i+1 (n + 1).
Proof* of (4.110): Note that δj (n) ≤ δ˜m (n) for all j ≤ m and n ≥ 1 by the definition in (4.107). From (4.109), δj (n) ≤ δ˜i (n) for j ≤ m ≤ i. Also, for all i > m and j ≤ i, δj (1) ≤ δmax (1) = δ˜i (1). Thus (4.110) holds for n = 1. We complete the proof by using induction on n for m < j ≤ i, using n = 1 as the basis. Assume (4.110) for a given n ≥ 1. Then, δj (n) ≤ δ˜M (n) for all j, and it then follows that δmax (n) ≤ δ˜M (n). Similarly, δj (n) ≤ δ˜i−1 (n) for j ≤ i − 1. For i > m, we then have h i X k0 δi (n+1) ≤ max δmax (n)−α, Piji δj (n) h ≤ max δ˜M (n)−α, h ≤ max δ˜M (n)−α,
j
X
k0
Piji δ˜i−1 (n) +
j
X j≥i
i k0 Piji δ˜M (n)
§ ∞ δ˜i−1 (n) + (1−∞)δ˜M (n) = δ˜i (n+1),
where the final inequality follows from the definition of ∞. Finally, using (4.109) again, we have δj (n + 1) ≤ δ˜j (n + 1) ≤ δ˜i (n + 1) for m < j ≤ i, completing the proof of Lemma 4.6.
Proof* of Theorem 4.13: From (4.110), δ˜i (n) is non-increasing in n for i ≥ m. Also, from (4.109) and (4.97), δ˜i (n) ≥ δ˜m (n) ≥ β(u). Thus, limn→1 δ˜i (n) exists for each i ≥ m. We then have h i lim δ˜M (n) = max lim δ˜M (n)−α, ∞ lim δ˜M−1 (n) + (1−∞) lim δ˜M (n) . n→1
n→1
n→1
n→1
182
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Since α > 0, the second term in the maximum above must achieve the maximum in the limit. Thus, lim δ˜M (n) = lim δ˜M−1 (n).
n→1
(4.113)
n→1
In the same way, lim δ˜M−1 (n) = max
n→1
h
lim δ˜M (n)−α,
n→1
i ∞ lim δ˜M−2 (n) + (1−∞) lim δ˜M−1 (n) . n→1
n→1
Again, the second term must achieve the maximum, and using (4.113), lim δ˜M−1 (n) = lim δ˜M−2 (n).
n→1
n→1
Repeating this argument, lim δ˜i (n) = lim δ˜i−1 (n) for each i, m < i ≤ M.
n→1
n→1
(4.114)
Now, from (4.94), limn→1 δi = β(u) for i ≤ m. From (4.107), then, we see that limn→1 δ˜m (n) = β(u). Combining this with (4.114), lim δ˜i (n) = β(u) for each i such that m ≤ i ≤ M.
n→1
(4.115)
Combining this with (4.110), we see that for any ≤ > 0, and any i, δi (n) ≤ β(u) + ≤ for large enough n. Combining this with (4.96) completes the proof.
4.7
Summary
This chapter has developed the basic results about finite-state Markov chains from a primarily algebraic standpoint. It was shown that the states of any finite-state chain can be partitioned into classes, where each class is either transient or recurrent, and each class is periodic or aperiodic. If the entire chain is one recurrent class, then the Frobenius theorem, with all its corollaries, shows that ∏ = 1 is an eigenvalue of largest magnitude and has positive right and left eigenvectors, unique within a scale factor. The left eigenvector (scaled to be a probability vector) is the steady-state probability vector. If the chain is also aperiodic, then the eigenvalue ∏ = 1 is the only eigenvalue of magnitude 1, and all rows of [P ]n converge geometrically in n to the steady-state vector. This same analysis can be applied to each aperiodic recurrent class of a general Markov chain, given that the chain ever enters that class. For a periodic recurrent chain of period d, there are d − 1 other eigenvalues of magnitude 1, with all d eigenvalues uniformly placed around the unit circle in the complex plane. Exercise 4.17 shows how to interpret these eigenvectors, and shows that [P ]nd converges geometrically as n → 1. For an arbitrary finite-state Markov chain, if the initial state is transient, then the Markov chain will eventually enter a recurrent state, and the probability that this takes more than
4.8. EXERCISES
183
n steps approaches zero geometrically in n; Exercise 4.14 shows how to find the probability that each recurrent class is entered. Given an entry into a particular recurrent class, then the results above can be used to analyze the behavior within that class. The results about Markov chains were extended to Markov chains with rewards. As with renewal processes, the use of reward functions provides a systematic way to approach a large class of problems ranging from first passage times to dynamic programming. The key result here is Theorem 4.9, which provides both an exact expression and an asymptotic expression for the expected aggregate reward over n stages. Finally, the results on Markov chains with rewards were used to understand Markov decision theory. We developed the Bellman dynamic programming algorithm, and also investigated the optimal stationary policy. Theorem 4.13 demonstrated the relationship between the optimal dynamic policy and the optimal stationary policy. This section provided only an introduction to dynamic programming and omitted all discussion of discounting (in which future gain is considered worth less than present gain because of interest rates). We also omitted infinite state spaces. For an introduction to vectors, matrices, and linear algebra, see any introductory text on linear algebra such as Strang [20]. Gantmacher [11] has a particularly complete treatment of non-negative matrices and Perron-Frobenius theory. For further reading on Markov decision theory and dynamic programming, see Bertsekas, [3]. Bellman [1] is of historic interest and quite readable.
4.8
Exercises
Exercise 4.1. a) Prove that, for a finite-state Markov chain, if Pii > 0 for some i in a recurrent class A, then class A is aperiodic. b) Show that every finite-state Markov chain contains at least one recurrent set of states. Hint: Construct a directed graph in which the states are nodes and an edge goes from i to j if i → j but i is not accessible from j. Show that this graph contains no cycles, and thus contains one or more nodes with no outgoing edges. Show that each such node is in a recurrent class. Note: this result is not true for Markov chains with countably infinite state spaces. Exercise 4.2. Consider a finite-state Markov chain in which some given state, say state 1, is accessible from every other state. Show that the chain has at most one recurrent class of states. (Note that, combined with Exercise 4.1, there is exactly one recurrent class and the chain is then a unichain.) Exercise 4.3. Show how to generalize the graph in Figure 4.4 to an arbitrary number of states M ≥ 3 with one cycle of M nodes and one of M − 1 nodes. For M = 4, let node 1 be the node not in the cycle of M − 1 nodes. List the set of states accessible from node 1 in n steps for each n ≤ 12 and show that the bound in Theorem 4.5 is met with equality. Explain why the same result holds for all larger M.
184
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Exercise 4.4. Consider a Markov chain with one ergodic class of m states, say {1, 2, . . . , m} and M − m other states that are all transient. Show that Pijn > 0 for all j ≤ m and n ≥ (m − 1)2 + 1 + M − m. Exercise 4.5. a) Let τ be the number of states in the smallest cycle of an arbitrary ergodic Markov chain of M ≥ 3 states. Show that Pijn > 0 for all n ≥ (M − 2)τ + M. Hint: Look at the last part of the proof of Theorem 4.4. b) For τ = 1, draw the graph of an ergodic Markov chain (generalized for arbitrary M ≥ 3) for which there is an i, j for which Pijn = 0 for n = 2M − 3. Hint: Look at Figure 4.4. d) For arbitrary τ < M − 1, draw the graph of an ergodic Markov chain (generalized for arbitrary M) for which there is an i, j for which Pijn = 0 for n = (M − 2)τ + M − 1. Exercise 4.6. A transition probability matrix P is said to be doubly stochastic if X
Pij = 1 for all i;
j
X
Pij = 1 for all j.
i
That is, the row sum and the column sum each equal 1. If a doubly stochastic chain has M states and is ergodic (i.e., has a single class of states and is aperiodic), calculate its steady-state probabilities. Exercise 4.7. a) Find the steady-state probabilities π0 , . . . , πk−1 for the Markov chain below. Express your answer in terms of the ratio ρ = p/q. Pay particular attention to the special case ρ = 1. b) Sketch π0 , . . . , πk−1 . Give one sketch for ρ = 1/2, one for ρ = 1, and one for ρ = 2. c) Find the limit of π0 as k approaches 1; give separate answers for ρ < 1, ρ = 1, and ρ > 1. Find limiting values of πk−1 for the same cases. ✿ 0♥ ✘ ②
1−p
p 1−p
③ ♥ 1 ②
p 1−p
③ ♥ 2 ②
p 1−p
③ ...
k−♥ 2 ②
p 1−p
③k−♥ 1 ②
p
Exercise 4.8. a) Find the steady-state probabilities for each of the Markov chains in Figure 4.2 of section 4.1. Assume that all clockwise probabilities in the first graph are the same, say p, and assume that P4,5 = P4,1 in the second graph. b) Find the matrices [P ]2 for the same chains. Draw the graphs for the Markov chains represented by [P ]2 , i.e., the graph of two step transitions for the original chains. Find the steady-state probabilities for these two step chains. Explain why your steady-state probabilities are not unique. c) Find limn→1 [P ]2n for each of the chains.
4.8. EXERCISES
185
Exercise 4.9. Answer each of the following questions for each of the following non-negative matrices [A]
i)
∑
1 0 1 1
∏
ii)
1 0 0 1/2 1/2 0 . 0 1/2 1/2
a) Find [A]n in closed form for arbitrary n > 1. b) Find all eigenvalues and all right eigenvectors of [A]. c) Use (b) to show that there is no diagonal matrix [Λ] and no invertible matrix [Q] for which [A][Q] = [Q][Λ]. d) Rederive the result of part (c) using the result of (a) rather than (b). Exercise 4.10. a) Show that g(x ), as given in (4.21), is a continuous function of x for x ≥ 0 , x 6= 0 . b) Show that g(x ) = g(βx ) for all β > 0. Show that this implies that P the supremum of g(x ) over x ≥ 0 , x 6= 0 is the same as the supremum over x ≥ 0 , i xi = 1. Note that this shows that the supremum must be achieved, since it is a supremum of a continuous function over a closed and bounded space. Exercise 4.11. a) Show that if x1 and x2 are real or complex numbers, then |x1 + x2 | = |x1 | + |x2 | implies that for some β, βx1 and βx2 are both real and non-negative. b) Show from this that if the inequality in (4.25) is satisfied with equality, then there is some β for which βxi = |xi | for all i. Exercise 4.12. a) Let ∏ be an eigenvalue of a matrix [A], and let ∫ and π be right and left eigenvectors respectively of ∏, normalized so that π ∫ = 1. Show that [[A] − ∏∫∫ π ]2 = [A]2 − ∏2∫ π . b) Show that [[A]n − ∏n∫ π ][[A] − ∏∫∫ π ] = [A]n+1 − ∏n+1∫ π . c) Use induction to show that [[A] − ∏∫∫ π ]n = [A]n − ∏n∫ π .
Exercise 4.13. Let [P ] be the transition matrix for a Markov unichain with M recurrent states, numbered 1 to∏ M, and K transient states, J +1 to J +K. Thus [P ] can be partitioned ∑ Pr 0 as [P ] = . Ptr Ptt ∑ ∏ [Pr ]n [0] a) Show that [P ]n can be partitioned as [P ]n = . That is, the blocks on [Pijn ] [Ptt ]n the diagonal are simply products of the corresponding blocks of [P ], and the lower left block is whatever it turns out to be.
186
CHAPTER 4. FINITE-STATE MARKOV CHAINS
b) Let Qi be the probability that P the chain will be in a recurrent state after K transitions, starting from state i, i.e., Qi = j≤M PijK . Show that Qi > 0 for all transient i. c) Let Q be the minimum Qi over all transient i and show that PijnK ≤ (1 − Q)n for all transient i, j (i.e., show that [Ptt ]n approaches the all zero matrix [0] with increasing n).
π r , π t ) be a left eigenvector of [P ] of eigenvalue 1 (if one exists). Show that d) Let π = (π π t = 0 and show that π r must be positive and be a left eigenvector of [Pr ]. Thus show that π exists and is unique (within a scale factor). e) Show that e is the unique right eigenvector of [P ] of eigenvalue 1 (within a scale factor). Exercise 4.14. Generalize Exercise 4.13 to the case of a Markov chain [P ] with r recurrent classes and one or more transient classes. In particular, a) Show that [P ] has exactly r linearly independent left eigenvectors, π (1) , π (2) , . . . , π (r) of eigenvalue 1, and that the ith can be taken as a probability vector that is positive on the ith recurrent class and zero elsewhere. b) Show that [P ] has exactly r linearly independent right eigenvectors, ∫ (1) , ∫ (2) , . . . , ∫ (r) (i) of eigenvalue 1, and that the ith can be taken as a vector with ∫j equal to the probability that recurrent class i will ever be entered starting from state j. Exercise 4.15. Prove Theorem 4.8. Hint: Use Theorem 4.7 and the results of Exercise 4.13. Exercise 4.16. Generalize Exercise 4.15 to the case of a Markov chain [P ] with r aperiodic recurrent classes and one or more transient classes. In particular, using the left and right eigenvectors π (1) , π (2) , . . . , π (r) and ∫ (1) , . . . , ∫ (r) of Exercise 4.14, show that X ∫ (i)π (i) . lim [P ]n = n→1
i
Exercise 4.17. Suppose a Markov chain with an irreducible matrix [P ] is periodic with period d and let Ti , 1 ≤ i ≤ d, be the ith subset in the sense of Theorem 4.3. Assume the states are numbered so that the first M1 states are in T1 , the next J2 are in T2 , and so forth. Thus [P ] has the block form given by
0
0 . [P ] = .. 0
[Pd ]
[P1 ] 0 .. . 0 0
..
.
..
.
..
.
0 .. . .. .
[P2 ] .. .. . . .. .. . . [Pd−1 ] .. .. . . 0
where [Pi ] has dimension Mi by Mi+1 for i < d and Md by M1 for i = d
4.8. EXERCISES
187
a) Show that [P ]d has the form
.. . 0 0 [Q1 ] d . . .. [P ] = 0 [Q2 ] . . .. . [Qd ] 0 0
where [Qi ] = [Pi ][Pi+1 ] . . . [Pd ][P1 ] . . . [Pi−1 ].
so that with the eigenvectors b) Show that [Qi ] is the matrix of an ergodic Markov P chain, nd (i) (i) defined in Exercises 4.14 and 4.16, limn→1 [P ] = i ∫ π . c) Show that πˆ (i) , the left eigenvector of [Qi ] of eigenvalue 1 satisfies πˆ (i) [Pi ] = πˆ (i+1) for i < d and πˆ (d) [Pd ] = πˆ (1) . √
π (1) , πˆ (2) eαk , πˆ (3) e2αk , . . . , πˆ (d) e(d−1)αk ). Show that π (k) d) Let α = 2ππ d −1 and let π (k) = (ˆ is a left eigenvector of [P ] of eigenvalue e−αk . Exercise 4.18. (continuation of Exercise 4.17). defined in Exercises 4.14 and 4.16, nd
lim [P ] [P ] =
n→1
d X
a) Show that, with the eigenvectors
∫ (i)π (i+1)
i=1
where π (d+1) is taken to be π (1) . b) Show that, for 1 ≤ j < d, lim [P ]nd [P ]j =
n→1
d X
∫ (i)π (i+j)
i=1
where π (d+m) is taken to be π (m) for 1 ≤ m < d. c) Show that n o lim [P ]nd I + [P ] + . . . , [P ]d−1 =
n→1
√ d X i=1
∫ (i)
!√ d X
π (i+j)
i=1
!
.
d) Show that ¥ 1≥ n π [P ] + [P ]n+1 + · · · + [P ]n+d−1 = eπ n→1 d lim
where π isPthe steady-state probability vector for [P ]. Hint: Show that e = π = (1/n) i π (i) . e) Show that the above result is also valid for periodic unichains.
P
i∫
(i)
and
188
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Exercise 4.19. Assume a friend has developed an excellent program for finding the steadystate probabilities for finite-state Markov chains. More precisely, given the transition matrix [P], the program returns limn→1 Piin for each i. Assume all chains are aperiodic. a) You want to find the expected time to first reach a given state k starting from a different state m for a Markov chain with transition matrix [P ]. You modify the matrix to [P 0 ] where 0 0 = 0 for j 6= m, and P 0 = P otherwise. How do you find the desired first Pkm = 1, Pkj ij ij passage time from the program output given [P 0 ] as an input? (Hint: The times at which a Markov chain enters any given state can be considered as renewals in a (perhaps delayed) renewal process). b) Using the same [P 0 ] as the program input, how can you find the expected number of returns to state m before the first passage to state k? c) Suppose, for the same Markov chain [P ] and the same starting state m, you want to find the probability of reaching some given state n before the first passage to k. Modify [P ] to some [P 00 ] so that the above program with P 00 as an input allows you to easily find the desired probability. d) Let Pr{X(0) = i} = Qi , 1 ≤ i ≤ M be an arbitrary set of initial probabilities for the same Markov chain [P ] as above. Show how to modify [P ] to some [P 000 ] for which the steady-state probabilities allow you to easily find the expected time of the first passage to state k. Exercise 4.20. Suppose A and B are each ergodic Markov chains with transition probabilities {PAi ,Aj } and {PBi ,Bj } respectively. Denote the steady-state probabilities of A and B by {πAi } and {πBi } respectively. The chains are now connected and modified as shown below. In particular, states A1 and B1 are now connected and the new transition probabilities P 0 for the combined chain are given by PA0 1 ,B1 = ε, PB0 1 ,A1
= δ,
PA0 1 ,Aj = (1 − ε)PA1 ,Aj
PB0 1 ,Bj
= (1 − δ)PB1 ,Bj
for all Aj for all Bj .
All other transition probabilities remain the same. Think intuitively of ε and δ as being small, but do not make any approximations in what follows. Give your answers to the following questions as functions of ε, δ, {πAi } and {πBi }. ✬ ♥
✩ ✬
♥ ° ° ♥ A1 ② ° ✚ ° ✚ ° ✚ ♥ ♥
✫
ε δ
♥ ✚ ✚ ✚ ♥ ♥ ③ B1 ✚ ✚ ♥ ✚
✪ ✫
✩
✪
Chain A Chain B a) Assume that ≤ > 0, δ = 0 (i.e., that A is a set of transient states in the combined chain). Starting in state A1 , find the conditional expected time to return to A1 given that the first transition is to some state in chain A.
4.8. EXERCISES
189
b) Assume that ≤ > 0, δ = 0. Find TA,B , the expected time to first reach state B1 starting from state A1 . Your answer should be a function of ≤ and the original steady state probabilities {πAi } in chain A. c) Assume ε > 0, δ > 0, find TB,A , the expected time to first reach state A1 , starting in state B1 . Your answer should depend only on δ and {πBi }.
d) Assume ε > 0 and δ > 0. Find P 0 (A), the steady-state probability that the combined chain is in one of the states {Aj } of the original chain A.
e) Assume ε > 0, δ = 0. For each state Aj 6= A1 in A, find vAj , the expected number of visits to state Aj , starting in state A1 , before reaching state B1 . Your answer should depend only on ε and {πAi }. 0 , the steady-state probability of f ) Assume ε > 0, δ > 0. For each state Aj in A, find πA j being in state Aj in the combined chain. Hint: Be careful in your treatment of state A1 .
Exercise 4.21. For the Markov chain with rewards in figure 4.6, a) Find the general solution to (4.42) and then find the particular solution (the relative gain vector) with π w = 0. b) Modify Figure 4.6 by letting P12 be an arbitrary probability. Find g and w again and give an intuitive explanation of why P12 effects w2 . Exercise 4.22. a) Show that, for any i, π )r + ge. [P ]i r = ([P ]i − eπ b) Show that 4.36) can be rewritten as v (n, u) =
n−1 X i=0
π )r + nge + (π π u)e. ([P ]i − eπ
Pn−1 π ) converges in the c) Show that if [P ] is a positive stochastic matrix, then i=0 ([P ]i − eπ limit n → 1. Hint: You can use the same argument as in the proof of Corollary 4.6. Note: this sum also converges for an arbitrary ergodic Markov chain. Exercise 4.23. Consider the Markov chain below: a) Suppose the chain is started in state i and goes through n transitions; let vi (n, u) be the expected number of transitions (out of the total of n) until the chain enters the trapping state, state 1. Find an expression for v (n, u) = (v1 (n, u), v2 (n, u), v3 (n, u)) in terms of v (n − 1, u) (take v1 (n, u) = 0 for all n). (Hint: view the system as a Markov reward system; what is the value of r ?) b) Solve numerically for limn→1 v (n, u). Interpret the meaning of the elements vi in the solution of (4.30). c) Give a direct argument why (4.30) provides the solution directly to the expected time from each state to enter the trapping state.
190
CHAPTER 4. FINITE-STATE MARKOV CHAINS
✿ 2♥ ✘ PP PP 1/2 ✻ PP PP PP q ♥ 1 P 1/2 ✶ 1 ② ✏ ✏ ✏ ✏ ✏ ✏✏ ✏✏ 1/4 1/4 ✏ ✏ ✿ 3♥ ✘
1/2
Exercise 4.24. Consider a sequence of IID binary rv’s X1 , X2 , . . . . Assume that Pr{Xi = 1} = p1 , Pr{Xi = 0} = p0 = 1 − p1 . A binary string (a1 , a2 , . . . , ak ) occurs at time n if Xn = ak , Xn−1 = ak−1 , . . . Xn−k+1 = a1 . For a given string (a1 , a2 , . . . , ak ), consider a Markov chain with k + 1 states {0, 1, . . . , k}. State 0 is the initial state, state k is a final trapping state where (a1 , a2 , . . . , ak ) has already occurred, and each intervening state i, 0 < i < k, has the property that if the subsequent k − i variables take on the values ai+1 , ai+2 , . . . , ak , the Markov chain will move successively from state i to i + 1 to i + 2 and so forth to k. For example, if k = 2 and (a1 , a2 ) = (0, 1), the corresponding chain is given by 0 1 ☎ ③ ❧ ③ ❧ ❧ 0 1 2✘ ② ✆ ② 0 a) For the chain above, find the mean first passage time from state 0 to state 2. b) For parts b to d, let (a1 , a2 , a3 , . . . , ak ) = (0, 1, 1, . . . , 1), i.e., zero followed by k − 1 ones. Draw the corresponding Markov chain for k = 4. c) Let vi , 1 ≤ i ≤ k be the expected first passage time from state i to state k. Note that vk = 0. Show that v0 = 1/p0 + v1 . d) For each i, 1 ≤ i < k, show that vi = αi + vi+1 and v0 = βi + vi+1 where αi and βi are each a product of powers of p0 and p1 . Hint: use induction, or iteration, starting with i = 1, and establish both equalities together. e) Let k = 3 and let (a1 , a2 , a3 ) = (1, 0, 1). Draw the corresponding Markov chain for this string. Evaluate v0 , the expected first passage time for the string 1,0,1 to occur. f ) Use renewal theory to explain why the answer in part e is different from that in part d with k = 3.
Exercise 4.25. a) Find limn→1 [P ]n for the Markov chain below. Hint: Think in terms of the long term transition probabilities. Recall that the edges in the graph for a Markov chain correspond to the positive transition probabilities. b) Let π (1) and π (2) denote the first two rows of limn→1 [P ]n and let ∫ (1) and ∫ (2) denote the first two columns of limn→1 [P ]n . Show that π (1) and π (2) are independent left eigenvectors of [P ], and that ∫ (1) and ∫ (2) are independent right eigenvectors of [P ]. Find the eigenvalue for each eigenvector.
4.8. EXERCISES
1
191
✛ P31 ✿ 1♥ ✘
3♥ ❈❖
P32✲ ♥ 1 2 ②
P33
c) Let r be an arbitrary reward vector and consider the equation w + g (1)∫ (1) + g (2)∫ (2) = r + [P ]w .
(4.116)
Determine what values g (1) and g (2) must have in order for (4.84) to have a solution. Argue that with the additional constraints w1 = w2 = 0, (4.84) has a unique solution for w and find that w . d) Show that, with the w above, w 0 = w + α∫∫ (1) + β∫∫ (2) satisfies (4.84) for all choices of scalars α and β. e) Assume that the reward at stage 0 is u = w . Show that v (n, w ) = n(g (1)∫ (1) +g (2)∫ (2) )+ w. f ) For an arbitrary reward u at stage 0, show that v (n, u) = n(g (1)∫ (1) + g (2)∫ (2) ) + w + [P ]n (u − w ). Note that this verifies (4.49-4.51) for this special case. Exercise 4.26. Generalize Exercise 4.25 to the general case of two recurrent classes and an arbitrary set of transient states. In part (f), you will have to assume that the recurrent classes are ergodic. Hint: generalize the proof of Lemma 4.1 and Theorem 4.9 Exercise 4.27. Generalize Exercise 4.26 to an arbitrary number of recurrent classes and an arbitrary number of transient states. This verifies (4.49-4.51) in general. Exercise 4.28. Let u and u 0 be arbitrary final reward vectors with u ≤ u 0 .
a) Let k be an arbitrary stationary policy and prove that v k (n, u) ≤ v k (n, u 0 ) for each n ≥ 1.
b) Prove that v ∗ (n, u) ≤ v ∗ (n, u 0 ) for each n ≥ 1. This is known as the monotonicity theorem.
Exercise 4.29. George drives his car to the theater, which is at the end of a one-way street. There are parking places along the side of the street and a parking garage that costs $5 at the theater. Each parking place is independently occupied or unoccupied with probability 1/2. If George parks n parking places away from the theater, it costs him n cents (in time and shoe leather) to walk the rest of the way. George is myopic and can only see the parking place he is currently passing. If George has not already parked by the time he reaches the nth place, he first decides whether or not he will park if the place is unoccupied, and then observes the place and acts according to his decision. George can never go back and must park in the parking garage if he has not parked before. a) Model the above problem as a 2 state Markov decision problem. In the “driving” state, state 2, there are two possible decisions: park if the current place is unoccupied or drive on whether or not the current place is unoccupied.
192
CHAPTER 4. FINITE-STATE MARKOV CHAINS
b) Find vi∗ (n, u), the minimum expected aggregate cost for n stages (i.e., immediately before observation of the nth parking place) starting in state i = 1 or 2; it is sufficient to express vi∗ (n, u) in times of vi∗ (n − 1). The final costs, in cents, at stage 0 should be v2 (0) = 500, v1 (0) = 0. c) For what values of n is the optimal decision the decision to drive on? d) What is the probability that George will park in the garage, assuming that he follows the optimal policy? Exercise 4.30. Consider the dynamic programming problem below with two states and two possible policies, denoted k and k 0 . The policies differ only in state 2. 1/2 3/4 1/2 7/8 1/2 1/2 ③ ♥ ③ ♥ ♥ ♥ 2 2 ✿ 1 ✘ ② ✿ 1 ✘ ② ② ② 0 1/4 1/8 k =5 k r2 r2 =6 r1 =0 r1 =0 a) Find the steady-state gain per stage, g and g 0 , for stationary policies k and k 0 . Show that g = g 0 . b) Find the relative gain vectors, w and w 0 , for stationary policies k and k 0 . c) Suppose the final reward, at stage 0, is u1 = 0, u2 = u. For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 1? d) For what range of u does the dynamic programming algorithm use decision k in state 2 at stage 2? at stage n? You should find that (for this example) the dynamic programming algorithm uses the same decision at each stage n as it uses in stage 1. e) Find the optimal gain v2∗ (n, u) and v1∗ (n, u) as a function of stage n assuming u = 10. f ) Find limn→1 v ∗ (n, u) and show how it depends on u. Exercise 4.31. Consider a Markov decision problem in which the stationary policies k and k 0 each satisfy Bellman’s equation, (4.60) and each correspond to ergodic Markov chains. 0
0
a) Show that if r k + [P k ]w 0 ≥ r k + [P k ]w 0 is not satisfied with equality, then g 0 > g. 0
0
b) Show that r k + [P k ]w 0 = r k + [P k ]w 0 (Hint: use part a).
c) Find the relationship between the relative gain vector w k for policy k and the relative gain vector w 0 for policy k 0 . (Hint: Show that r k + [P k ]w 0 = ge + w 0 ; what does this say about w and w 0 ?) e) Suppose that policy k uses decision 1 in state 1 and policy k 0 uses decision 2 in state 1 (i.e., k1 = 1 for policy k and k1 = 2 for policy k 0 ). What is the relationship between (k) (k) (k) (k) r1 , P11 , P12 , . . . P1J for k equal to 1 and 2? f ) Now suppose that policy k uses decision 1 in each state and policy k 0 uses decision 2 in (1) (2) each state. Is it possible that ri > ri for all i? Explain carefully. (1)
g) Now assume that ri Explain.
is the same for all i. Does this change your answer to part f)?
4.8. EXERCISES
193
Exercise 4.32. Consider a Markov decision problem with three states. Assume that each stationary policy corresponds to an ergodic Markov chain. It is known that a particular policy k 0 = (k1 , k2 , k3 ) = (2, 4, 1) is the unique optimal stationary policy (i.e., the gain per stage in steady-state is maximized by always using decision 2 in state 1, decision 4 in state (k) 2, and decision 1 in state 3). As usual, ri denotes the reward in state i under decision k, (k) and Pij denotes the probability of a transition to state j given state i and given the use of decision k in state i. Consider the effect of changing the Markov decision problem in each of the following ways (the changes in each part are to be considered in the absence of the changes in the other parts): (1)
(1)
(2)
(2)
a) r1 is replaced by r1 − 1. b) r1 is replaced by r1 + 1. (k)
c) r1
(k)
is replaced by r1 + 1 for all state 1 decisions k. (ki )
d) for all i, ri
is replaced by r(ki ) + 1 for the decision ki of policy k 0 .
For each of the above changes, answer the following questions; give explanations: 1) Is the gain per stage, g 0 , increased, decreased, or unchanged by the given change? 2) Is it possible that another policy, k 6= k 0 , is optimal after the given change? Exercise 4.33. (The Odoni Bound) Let k 0 be the optimal stationary policy for a Markov decision problem and let g 0 and π 0 be the corresponding gain and steady-state probability respectively. Let vi∗ (n, u) be the optimal dynamic expected reward for starting in state i at stage n. a) Show that mini [vi∗ (n, u) − vi∗ (n − 1)] ≤ g 0 ≤ maxi [vi∗ (n, u) − vi∗ (n − 1)] ; n ≥ 1. Hint: Consider premultiplying v ∗ (n, u) − v ∗ (n − 1) by π 0 or π 0 where k is the optimal dynamic policy at stage n. b) Show that the lower bound is non-decreasing in n and the upper bound is non-increasing in n and both converge to g 0 with increasing n. Exercise 4.34. Consider a Markov decision problem with three states, {1, 2, 3}. For state (1) (2) (1) (2) 3, there are two decisions, r3 = r3 = 0 and P3,1 = P3,2 = 1. For state 1, there are two (1)
decisions, r1 = 0, (1)
r2 = 0,
(2)
(2)
(1)
(2)
r1 = −100 and P1,1 = P1,3 = 1. For state 2, there are two decisions, (1)
(2
r2 = −100 and P2,1 = P2,3 = 1.
a) Show that there are two ergodic unichain optimal stationary policies, one using decision 1 in states 1 and 3 and decision 2 in state 2. The other uses the opposite decision in each state. b) Find the relative gain vector for each of the above stationary policies. c) Let u be the final reward vector. Show that the first stationary policy above is the optimal dynamic policy in all stages if u1 ≥ u2 + 100 and u3 ≥ u2 + 100. Show that a non-unichain stationary policy is the optimal dynamic policy if u1 = u2 = u3
194
CHAPTER 4. FINITE-STATE MARKOV CHAINS
c) Theorem 4.13 implies that, under the conditions of the theorem, limn→1 [vi∗ (n, u) − vj∗ (n, u)] is independent of u. Show that this is not true for the conditions of this exercise. Exercise 4.35. Assume that k 0 is a unique optimal stationary policy and corresponds to an ergodic unichain (as in Theorem 4.13). Let w 0 and g 0 be the relative gain and gain per stage for k 0 and let u be an arbitrary final reward vector. 0 ). Show that for each i and each k 6= k 0 , there is some α > 0 such a) Let k 0 = (k10 , k20 , ..., kM i that for each i and k 6= ki0 , (ki0 )
ri
+
X j
(k0 )
Pij i wj0 ≥ +rik +
X
Pijk wj0 + α.
j
Hint: Look at the proof of Lemma 4.5 b) Show that there is some n0 such that for all n ≥ n0 , Ø Ø Ø ∗ Ø Øvj (n − 1) − (n − 1)g 0 − wj0 − β(u)Ø < α/2
where β(u) is given in Theorem 4.13.
c) Use part b) to show that for all i and all n ≥ n0 , (ki0 )
ri
+
X j
(k0 )
(ki0 )
Pij i vj∗ (n − 1) > +ri
+
X j
(k0 )
Pij i wj0 + (n − 1)g 0 + β(u) − α/2.
d) Use parts a) and b) to show that for all i, all n ≥ n0 , and all k 6= ki0 , rik +
X j
(ki0 )
Pijk vj∗ (n − 1) < +ri
+
X j
(k0 )
Pij i wj0 + (n − 1)g 0 + β(u) − α/2.
e Combine parts c) and d) to conclude that the optimal dynamic policy uses policy k 0 for all n ≥ n0 . Exercise 4.36. Consider an integer time queueing system with a finite buffer of size 2. At the beginning of the nth time interval, the queue contains at most two customers. There is a cost of one unit for each customer in queue (i.e., the cost of delaying that customer). If there is one customer in queue, that customer is served. If there are two customers, an extra server is hired at a cost of 3 units and both customers are served. Thus the total immediate cost for two customers in queue is 5, the cost for one customer is 1, and the cost for 0 customers is 0. At the end of the nth time interval, either 0, 1, or 2 new customers arrive (each with probability 1/3). a) Assume that the system starts with 0 ≤ i ≤ 2 customers in queue at time −1 (i.e., in stage 1) and terminates at time 0 (stage 0) with a final cost u of 5 units for each customer in queue (at the beginning of interval 0). Find the expected aggregate cost vi (1, u) for 0 ≤ i ≤ 2.
4.8. EXERCISES
195
b) Assume now that the system starts with i customers in queue at time −2 with the same final cost at time 0. Find the expected aggregate cost vi (2, u) for 0 ≤ i ≤ 2. c) For an arbitrary starting time −n, find the expected aggregate cost vi (n, u) for 0 ≤ i ≤ 2. d) Find the cost per stage and find the relative cost (gain) vector. e) Now assume that there is a decision maker who can choose whether or not to hire the extra server when there are two customers in queue. If the extra server is not hired, the 3 unit fee is saved, but only one of the customers is served. If there are two arrivals in this case, assume that one is turned away at a cost of 5 units. Find the minimum dynamic aggregate expected cost vi∗ (1), 0 ≤ i ≤, for stage 1 with the same final cost as before. f ) Find the minimum dynamic aggregate expected cost vi∗ (n, u) for stage n, 0 ≤ i ≤ 2. g) Now assume a final cost u of one unit per customer rather than 5, and find the new minimum dynamic aggregate expected cost vi∗ (n, u), 0 ≤ i ≤ 2. Exercise 4.37. Consider a finite-state ergodic Markov chain {Xn ; n ≥ 0} with an integer valued set of states {−K, −K+1, . . . , −1, 0, 1, . . . , +K}, a set of transition probabilities Pij ; −K ≤ i, j ≤ K, and initial state X0 = 0. One example of such a chain is given by: 0.9
0.9
0.9
✎✄ 0.1 ✄✎ ✄✎ ✲ 0♥ 0.1✲ 1♥ -1♥ ②
0.1
P Let {Sn ; n ≥ 0} be a stochastic process with Sn = ni=0 Xi . Parts (a), (b), and (c) are independent of parts (d) and (e). Parts (a), (b), and (c) should be solved both for the special case in the above graph and for the general case. a) Find limn→1 E [Xn ] for the example and express limn→1 E [Xn ] in terms of the steadystate probabilities of {Xn , n ≥ 0} for the general case. b) Show that limn→1 Sn /n exists with probability one and find the value of the limit. Hint: apply renewal-reward theory to {Xn ; n ≥ 0}. c) Assume that limn→1 E [Xn ] = 0. Find limn→1 E [Sn ]. d) Show that Pr{Sn =sn | Sn−1 =sn−1 , Sn−2 =sn−2 , Sn−3 =sn−3 , . . . , S0 =0} = Pr{Sn =sn | Sn−1 =sn−1 , Sn−2 =sn−2 } . e) Let Y n = (Sn , Sn−1 ) (i. e., Y n is a random two dimensional integer valued vector). Show that {Y n ; n ≥ 0} (where Y 0 = (0, 0)) is a Markov chain. Describe the transition probabilities of {Y n ; n ≥ 0} in terms of {Pij }.
196
CHAPTER 4. FINITE-STATE MARKOV CHAINS
Exercise 4.38. Consider a Markov decision problem with M states in which some state, say state 1, is inherently reachable from each other state. a) Show that there must be some other state, say state 2, and some decision, k2 , such that (k ) P21 2 > 0. b) Show that there must be some other state, say state 3, and some decision, k3 , such that (k ) (k ) either P31 3 > 0 or P32 3 > 0. c)Assume, for some i, and some set of decisions k2 , . . . , ki that, for each j, 2 ≤ j ≤ i, (k ) Pjl j > 0 for some l < j (i.e., that each state from 2 to j has a non-zero transition to a lower numbered state). Show that there is some state (other than 1 to i), say i + 1 and (ki+1 ) some decision ki+1 such that Pi+1,l > 0 for some l ≤ i. d) Use parts a), b), and c) to observe that there is a stationary policy k = k1 , . . . , kM for which state 1 is accessible from each other state.
Chapter 5
COUNTABLE-STATE MARKOV CHAINS 5.1
Introduction and classification of states
Markov chains with a countably-infinite state space (more briefly, countable-state Markov chains) exhibit some types of behavior not possible for chains with a finite state space. Figure 5.1 helps explain how these new types of behavior arise. If the right-going transitions p in the figure satisfy p > 1/2, then transitions to the right occur with higher frequency than transitions to the left. Thus, reasoning heuristically, we expect the state Xn at time n of being in n to drift to the right with increasing n. Given X0 = 0, the probability P0j state j at time n, should then tend to zero for any fixed j with increasing n. If one tried to n , then this limit would be 0 for define the steady-state probability of state j as limn→1 P0j all j. These probabilities would not sum to 1, and thus would not correspond to a limiting distribution. Thus we say that a steady-state does not exist. In more poetic terms, the state wanders off into the wild blue yonder.
q
✿ 0♥ ✘ ②
③ ♥ 1 ② q =1−p p
p q
③ ♥ 2 ②
p q
③ ♥ 3 ②
p q
③ ♥ 4
...
Figure 5.1: A Markov chain with a countable state space. If p > 1/2, then as time n increases, the state Xn becomes large with high probability, i.e., limn→1 Pr{Xn ≥ j} = 1 for each integer j. The truncation of Figure 5.1 to k states is analyzed in Exercise 4.7. The solution there defines ρ = p/q and shows that if ρ 6= 1, then πi = (1 − ρ)ρi /(1 − ρk ) for each i, 0 ≤ i < k. For ρ = 1, πi = 1/k for each i. For ρ < 1, the limiting behavior as k → 1 is πi = (1 − ρ)ρi . Thus for ρ < 1, the truncated Markov chain is similar to the untruncated chain. For ρ > 1, on the other hand, the steady-state probabilities for the truncated case are geometrically decreasing from the right, and the states with significant probability keep moving to the right 197
198
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
as k increases. Although the probability of each fixed state j approaches 0 as k increases, the truncated chain never resembles the untruncated chain. This example is further studied in Section 5.3, which considers a generalization known as birth-death Markov chains. Fortunately, the strange behavior of Figure 5.1 when p > q is not typical of the Markov chains of interest for most applications. For typical countable-state Markov chains, a steadystate does exist, and the steady-state probabilities of all but a finite number of states (the number depending on the chain and the application) can almost be ignored for numerical calculations. It turns out that the appropriate tool to analyze the behavior, and particularly the long term behavior, of countable-state Markov chains is renewal theory. In particular, we will first revise the definition of recurrent states for finite-state Markov chains to cover the countable-state case. We then show that for any given recurrent state j, the sequence of discrete time epochs n at which the state Xn is equal to j essentially forms a renewal process.1 The renewal theorems then specify the time-average relative-frequency of state j, the limiting probability of j with increasing time, and a number of other relations. To be slightly more precise, we want to understand the sequence of epochs at which one state, say j, is entered, conditional on starting the chain either at j or at some other state, say i. We will see that, subject to the classification of states i and j, this gives rise to a delayed renewal process. In preparing to study this delayed renewal process, we need to understand the inter-renewal intervals. The probability mass functions (PMF’s) of these intervals are called first-passage-time probabilities in the notation of Markov chains. Definition 5.1. The first-passage-time probability, fij (n), is the probability that the first entry to state j occurs at discrete time n (for n ≥ 1), given that X0 = i. That is, for n = 1, fij (1) = Pij . For n ≥ 2, fij (n) = Pr{Xn =j, Xn−1 6=j, Xn−2 6=j, . . . , X1 6=j|X0 =i} .
(5.1)
For n ≥ 2, note the distinction between fij (n) and Pijn = Pr{Xn = j|X0 = i}. The definition in (5.1) also applies for j = i; fii (n) is thus the probability, given X0 = i, that the first occurrence of state i after time 0 occurs at time n. Since the transition probabilities are independent of time, fij (n−1) is also the probability, given X1 = i, that the first subsequent occurrence of state j occurs at time n. Thus we can calculate fij (n) from the iterative relations X fij (n) = Pik fkj (n − 1); n > 1; fij (1) = Pij . (5.2) k6=j
With this iterative approach, the first passage time probabilities fij (n) for a given n must be calculated for all i before proceeding to calculate them for the next larger value of n. This also gives us fjj (n), although fjj (n) is not used in the iteration. 1 We say ‘essentially forms a renewal process’ because we haven’t yet specified the exact conditions upon which these returns to a given state form a renewal process. Note, however, that since we start the process in state j at time 0, the time at which the first renewal occurs is the same as the interval between successive renewals.
5.1. INTRODUCTION AND CLASSIFICATION OF STATES
199
Let Fij (n), for n ≥ 1, be the probability, given X0 = i, that state j occurs at some time between 1 and n inclusive. Thus, Fij (n) =
n X
fij (m).
(5.3)
m=1
For each i, j, Fij (n) is non-decreasing in n and (since it is a probability) is upper bounded by 1. Thus Fij (1), i.e., limn→1 Fij (n) must exist, and is the probability, given X0 = i, that state j will ever occur. If Fij (1) = 1, then, given X0 = i, it is certain (with probability 1) that the chain will eventually enter state j. In this case, we can define a random variable (rv) Tij , conditional on X0 = i, as the first passage time from i to j. Then fij (n) is the probability mass function of Tij and Fij (n) is the distribution function of Tij . If Fij (1) < 1, then Tij is a defective rv, since, with some non-zero probability, there is no first passage to j. Defective rv’s are not considered to be rv’s (in the theorems here or elsewhere), but they do have many of the properties of rv’s. The first passage time Tjj from a state j back to itself is of particular importance. It has the PMF fjj (n), the distribution function Fjj (n), and is a rv (as opposed to a defective rv) if Fjj (1) = 1, i.e., if the state eventually returns to state j with probability 1 given that it starts in state j. This leads to the definition of recurrence. Definition 5.2. A state j in a countable-state Markov chain is recurrent if Fjj (1) = 1. It is transient if Fjj (1) < 1. Thus each state j in a countable-state Markov chain is either recurrent or transient, and is recurrent if and only if (iff) an eventual return to j occurs W.P.1, given that X0 = j. Equivalently, j is recurrent iff Tjj , the time to first return to j, is a rv. Note that for the special case of finite-state Markov chains, this definition is consistent with the one in Chapter 4. For a countably-infinite state space, however, the earlier definition is not adequate; for example, i and j communicate for all states i and j in Figure 5.1, but for p > 1/2, each state is transient (this is shown in Exercise 5.2, and further explained in Section 5.3). If state j is recurrent, and if the initial state is specified to be X0 = j, then Tjj is the integer time of the first recurrence of state j. At that recurrence, the Markov chain is in the same state j as it started in, and the discrete interval from Tjj to the next occurence of state j, say Tjj,2 has the same distribution as Tjj and is clearly independent of Tjj . Similarly, the sequence of successive recurrence intervals, Tjj , Tjj,2 , Tjj,3 , . . . is a sequence of IID rv’s. This sequence of recurrence intervals2 is then the sequence of inter-renewal intervals of a renewal process, where each renewal interval has the distribution of Tjj . These inter-renewal intervals have the PMF fjj (n) and the distribution function Fjj (n). Since results about Markov chains depend very heavily on whether states are recurrent or transient, we will look carefully at the probabilities Fij (n). Substituting (5.2) into (5.3), we 2
Note that in Chapter 3 the inter-renewal intervals were denoted X1 , X2 , . . . , whereas here X0 , X1 , . . . , is the sequence of states in the Markov chain and Tjj , Tjj,2 , . . . , is the sequence of inter-renewal intervals.
200
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
obtain Fij (n) = Pij +
X k6=j
Pik Fkj (n − 1);
n > 1;
Fij (1) = Pij .
(5.4)
P In the expression Pij + k6=j Pik FkjP (n − 1), note that Pij is the probability that state j is entered on the first transition, and k6=j Pik Fkj (n − 1) is the sum, over every other state k, of the joint probability that k is entered on the first transition and that j is entered on one of the subsequent n − 1 transitions. For each i, Fij (n) is non-decreasing in n and upper bounded by 1 (this can be seen from (5.3), and can also be established directly from (5.4) by induction). Thus it can be shown that the limit as n → 1 exists and satisfies Fij (1) = Pij +
X
Pik Fkj (1).
(5.5)
k6=j
There is not always a unique solution to (5.5). That is, the set of equations xij = Pij +
X
Pik xkj ;
k6=j
all i ≥ 0
(5.6)
always has a solution in which xij = 1 for all i ≥ 0, but if state j is transient, there is another solution in which xij is the true value of Fij (1) and Fjj (1) < 1. Exercise 5.1 shows that if (5.6) is satisfied by a set of non-negative numbers {xij ; 1 ≤ i ≤ J}, then Fij (1) ≤ xij for each i. We have defined a state j to be recurrent if Fjj (1) = 1 and have seen that if j is recurrent, then the returns to state j, given X0 = j form a renewal process, and all of the results of renewal theory can then be applied to the random sequence of integer times at which j is entered. Our next objective is to show that all states in the same class as a recurrent state are also recurrent. Recall that two states are in the same class if they communicate, i.e., each has a path to the other. For finite-state Markov chains, the fact that either all states in the same class are recurrent or all transient was relatively obvious, but for countable-state Markov chains, the definition of recurrence has been changed and the above fact is no longer obvious. We start with a lemma that summarizes some familiar results from Chapter 3. Lemma 5.1. Let {Njj (t); t ≥ 0} be the counting process for occurrences of state j up to time t in a Markov chain with X0 = j. The following conditions are then equivalent. 1. state j is recurrent. 2. limt→1 Njj (t) = 1 with probability 1. 3. limt→1 E [Njj (t)] = 1. 4. limt→1
P
n 1≤n≤t Pjj
= 1.
5.1. INTRODUCTION AND CLASSIFICATION OF STATES
201
Proof: First assume that j is recurrent, i.e., that Fjj (1) = 1. This implies that the inter-renewal times between occurrences of j are IID rv’s, and consequently {Njj (t); t ≥ 1} is a renewal counting process. Recall from Lemma 3.1 of Chapter 3 that, whether or not the expected inter-renewal time E [Tjj ] is finite, limt→1 Njj (t) = 1 with probability 1 and limt→1 E [Njj (t)] = 1. Next assume that state j is transient. In this case, the inter-renewal time Tjj is not a rv, so {Njj (t); t ≥ 0} is not a renewal process. An eventual return to state j occurs only with probability Fjj (1) < 1, and, since subsequent returns are independent, the total number of returns to state j is a geometric rv with mean Fjj (1)/[1 − Fjj (1)]. Thus the total number of returns is finite with probability 1 and the expected total number of returns is finite. This establishes the first three equivalences. n , the probability of a transition to state j at integer time n, is equal Finally, note that Pjj to the expectation of a transition to j at integer time n (i.e., a single transition occurs n and 0 occurs otherwise). Since N (t) is the sum of the number of with probability Pjj jj transitions to j over times 1 to t, we have X n E [Njj (t)] = Pjj , 1≤n≤t
which establishes the final equivalence. Next we show that if one state in a class is recurrent, then the others are also. Lemma 5.2. If state j is recurrent and states i and j are in the same class, i.e., i and j communicate, then state i is also recurrent. P n = 1. Since j and i commuProof: From Lemma 5.1, state j satisfies limt→1 1≤n≤t Pjj m nicate, there are integers m and k such that Pij > 0 and Pjik > 0. For every walk from state j to j in n steps, there is a corresponding walk from i to i in m + n + k steps, going from i to j in m steps, j to j in n steps, and j back to i in k steps. Thus n k Piim+n+k ≥ Pijm Pjj Pji 1 X
n=1
Piin ≥
1 X
n=1
Piim+n+k ≥ Pijm Pjik
1 X
n=1
n Pjj = 1.
Thus, from Lemma 5.1, i is recurrent, completing the proof. Since each state in a Markov chain is either recurrent or transient, and since, if one state in a class is recurrent, all states in that class are recurrent, we see that if one state in a class is transient, they all are. Thus we can refer to each class as being recurrent or transient. This result shows that Theorem 4.1 also applies to countable-state Markov chains. We state this theorem separately here to be specific. Theorem 5.1. For a countable-state Markov chain, either all states in a class are transient or all are recurrent.
202
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
We next look at the delayed counting process {Nij (n); n ≥ 1}. We saw that for finitestate ergodic Markov chains, the effect of the starting state eventually dies out. In order to find the conditions under which this happens with countable-state Markov chains, we compare the counting processes {Njj (n); n ≥ 1} and {Nij (n); n ≥ 1}. These differ only in the starting state, with X0 = j or X0 = i. In order to use renewal theory to study these counting processes, we must first verify that the first-passage-time from i to j is a rv. The following lemma establishes the rather intuitive result that if state j is recurrent, then from any state i accessible from j, there must be an eventual return to j. Lemma 5.3. Let states i and j be in the same recurrent class. Then Fij (1) = 1. Proof: First assume that Pji > 0, and assume for the sake of establishing a contradiction that Fij (1) < 1. Since Fjj (1) = 1, we can apply (5.5) to Fjj (1), getting X X Pjk Fkj (1) < Pjj + Pjk = 1, 1 = Fjj (1) = Pjj + k6=j
k6=j
where the strict inequality follows since Pji Fij (1) < Pji by assumption. This is a contradiction, so Fij (1) = 1 for every i accessible from j in one step. Next assume that Pji2 > 0, say with Pjm > 0 and Pmi > 0. For a contradiction, again assume that Fij (1) < 1. From (5.5), X X Fmj (1) = Pmj + Pmk Fkj (1) < Pmj + Pmk = 1, k6=j
k6=j
where the strict inequality follows since Pmi Fij (1) < Pmi . This is a contradiction, since m is accessible from j in one step, and thus Fmj (1) = 1. It follows that every i accessible from j in two steps satisfies Fij (1) = 1. Extending the same argument for successively larger numbers of steps, the conclusion of the lemma follows. Lemma 5.4. Let {Nij (t); t ≥ 0} be the counting process for transitions into state j up to time t for a Markov chain given X0 = i 6= j. Then if i and j are in the same recurrent class, {Nij (t); t ≥ 0} is a delayed renewal process. Proof: From Lemma 5.3, Tij , the time until the first transition into j, is a rv. Also Tjj is a rv by definition of recurrence, and subsequent intervals between occurrences of state j are IID, completing the proof. If Fij (1) = 1, we have seen that the first-passage time from i to j is a rv, i.e., is finite with probability 1. In this case, the mean time T ij to first enter state j starting from state i is of interest. Since Tij is a non-negative random variable, its expectation is the integral of its complementary distribution function, T ij = 1 +
1 X
(1 − Fij (n)).
(5.7)
n=1
It is possible to have Fij (1) = 1 but T ij = 1. As will be shown in Section 5.3, the chain in Figure 5.1 satisfies Fij (1) = 1 and T ij < 1 for p < 1/2 and Fij (1) = 1 and T ij = 1 for p = 1/2. As discussed before, Fij (1) < 1 for p > 1/2. This leads us to the following definition.
5.1. INTRODUCTION AND CLASSIFICATION OF STATES
203
Definition 5.3. A state j in a countable-state Markov chain is positive-recurrent if Fjj (1) = 1 and T jj < 1. It is null-recurrent if Fjj (1) = 1 and T jj = 1. Each state of a Markov chain is thus classified as one of the following three types — positiverecurrent, null-recurrent, or transient. For the example of Figure 5.1, null-recurrence lies on a boundary between positive-recurrence and transience, and this is often a good way to look at null-recurrence. Part f) of Exercise 6.1 illustrates another type of situation in which null-recurrence can occur. Assume that state j is recurrent and consider the renewal process {Njj (t); t ≥ 0}. The limiting theorems for renewal processes can be applied directly. From the strong law for renewal processes, Theorem 3.1, lim Njj (t)/t = 1/T jj
t→1
with probability 1.
(5.8)
From the elementary renewal theorem, Theorem 3.4, lim E [Njj (t)/t] = 1/T jj .
t→1
(5.9)
Equations (5.8) and (5.9) are valid whether j is positive-recurrent or null-recurrent. Next we apply Blackwell’s theorem to {Njj (t); t ≥ 0}. Recall that the period of a given state j in a Markov chain (whether the chain has a countable or finite number of states) is n > 0. If this period the greatest common divisor of the set of integers n > 0 such that Pjj is d, then {Njj (t); t ≥ 0} is arithmetic with span d (i.e., renewals occur only at times that are multiples of d). From Blackwell’s theorem in the arithmetic form of (3.20), lim Pr{Xnd = j | X0 = j} = d/T jj .
n→1
(5.10)
If state j is aperiodic (i.e., d = 1), this says that limn→1 Pr{Xn = j | X0 = j} = 1/T jj . Equations (5.8) and (5.9) suggest that 1/T jj has some of the properties associated with a steady-state probability of state j, and (5.10) strengthens this if j is aperiodic. For a Markov chain consisting of a single class of states, all positive-recurrent, we will strengthen this association further in Theorem 5.4 by showing that there is a unique P steady-state distribution, {π , j ≥ 0} such that π = 1/T for all j and such that π = j j jj j i πi Pij for P all j ≥ 0 and j πj = 1. The following theorem starts this development by showing that (5.8-5.10) are independent of the starting state. Theorem 5.2. Let j be a recurrent state in a Markov chain and let i be any state in the same class as j. Given X0 = i, let Nij (t) be the number of transitions into state j by time t and let T jj be the expected recurrence time of state j (either finite or infinite). Then lim Nij (t)/t = 1/T jj
t→1
with probability 1
lim E [Nij (t)/t] = 1/T jj .
t→1
(5.11) (5.12)
If j is also aperiodic, then lim Pr{Xn = j | X0 = i} = 1/T jj .
n→1
(5.13)
204
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
Proof: Since i and j are recurrent and in the same class, Lemma 5.4 asserts that {Nij (t); t ≥ 0} is a delayed renewal process for j 6= i. Thus (5.11) and (5.12) follow from Theorems 3.9 and 3.10 of Chapter 3. If j is aperiodic, then {Nij (t); t ≥ 0} is a delayed renewal process for which the inter-renewal intervals Tjj have span 1 and Tij has an integer span. Thus, (5.13) follows from Blackwell’s theorem for delayed renewal processes, Theorem 3.11. For i = j, Equations (5.11-5.13) follow from (5.8-5.10), completing the proof. Theorem 5.3. All states in the same class of a Markov chain are of the same type — either all positive-recurrent, all null-recurrent, or all transient. Proof: Let j be a recurrent state. From Theorem 5.1, all states in a class are recurrent or all are transient. Next suppose that j is positive-recurrent, so that 1/T jj > 0. Let i be in the same class as j, and consider the renewal-reward process on {Njj (t); t ≥ 0} for which R(t) = 1 whenever the process is in state i (i.e., if Xn = i, then R(t) = 1 for n ≤ t < n + 1). The reward is 0 whenever the process is in some state other than i. Let E [Rn ] be the expected reward in an inter-renewal interval; this must be positive since i is accessible from j. From the strong law for renewal-reward processes, Theorem 3.6, Z t E [Rn ] 1 lim R(τ )dτ = with probability 1. t→1 t T jj 0 The term on the left is the time-average number of transitions into state i, given X0 = j, and this is 1/T ii from (5.11). Since E [Rn ] > 0 and T jj < 1, we have 1/T ii > 0, so i is positive-recurrent. Thus if one state is positive-recurrent, the entire class is, completing the proof. If all of the states in a Markov chain are in a null-recurrent class, then 1/T jj = 0 for each state, and one might think of 1/T jj = 0 as a “steady-state” probability for j in the sense that 0 is both the time-average rate of occurrence of j and the limiting probability of j. However, these “probabilities” do not add up to 1, so a steady-state probability distribution does not exist. This appears rather paradoxical at first, but the example of Figure 5.1, with p = 1/2 will help to clarify the situation. As time n increases (starting in state i, say), the random variable Xn spreads out over more and more states around i,P and thus is less likely n to be in each individual state. For each j, lim P (n) = 0. Thus, n→1 ij j {limn→1 Pij } = 0. P n On the other hand, for every n, j Pij = 1. This is one of those unusual examples where a limit and a sum cannot be interchanged. In Chapter 4, we defined the steady-state distribution of a finite-state Markov chain as a probability vector π that satisfies π = π [P ]. Here we define {πi ; i ≥ 0} in the same way, as a set of numbers that satisfy X X πi Pij for all j; πj ≥ 0 for all j; πj = 1. (5.14) πj = i
j
Suppose that a set of numbers {πi ; i ≥ 0} satisfying (5.14) is chosen as the initial probability distribution for a Markov chain, i.e., if Pr{X0 = i} = πi for all i. Then Pr{X1 = j} = P i πi Pij = πj for all j, and, by induction, Pr{Xn = j} = πj for all j and all n ≥ 0. The fact that Pr{Xn = j} = πj for all j motivates the definition of steady-state distribution
5.1. INTRODUCTION AND CLASSIFICATION OF STATES
205
above. Theorem 5.2 showed that 1/T jj is a ‘steady-state’ probability for state j, both in a time-average and a limiting ensemble-average sense. The following theorem brings these ideas together. An irreducible Markov chain is a Markov chain in which all pairs of states communicate. For finite-state chains, irreducibility implied a single class of recurrent states, whereas for countably infinite chains, an irreducible chain is a single class that can be transient, null-recurrent, or positive-recurrent. Theorem 5.4. Assume an irreducible Markov chain with transition probabilities {Pij }. If (5.14) has a solution, then the solution is unique, πi = 1/T ii > 0 for all i ≥ 0, and the states are positive-recurrent. Also, if the states are positive-recurrent then (5.14) has a solution. Proof*: Let {πj ; j ≥ 0} satisfy (5.14) and be the initial distribution of the Markov chain, i.e., Pr{X0 =j} = πj , j ≥ 0. Then, as shown above, Pr{Xn =j} = πj for all n ≥ 0, j ≥ 0. ˜j (t) be the number of occurrences of any given state j from time 1 to t. Equating Let N Pr{Xn =j} to the expectation of an occurrence of j at time n, we have, h i X ˜j (t) = (1/t) (1/t)E N Pr{Xn =j} = πj for all integers t ≥ 1. 1≤n≤t
Conditioning this on the possible starting states i, and using the counting processes {Nij (t); t ≥ 0} defined earlier, h i X ej (t) = πj = (1/t)E N πi E [Nij (t)/t] for all integer t ≥ 1. (5.15) i
For any given state i, let Tij be the time of the first occurrence of state j given X0 = i. Then if Tij < 1, we have Nij (t) ≤ Nij (Tij + t). Thus, for all t ≥ 1, E [Nij (t)] ≤ E [Nij (Tij + t)] = 1 + E [Njj (t)] .
(5.16)
The last step follows since the process is in state j at time Tij , and the expected number of occurrences of state j in the next t steps is E [Njj (t)]. Substituting (5.16) in (5.15) for each i, πj ≤ 1/t + E [NP jj (t)/t)]. Taking the limit as t → 1 and using (5.12), πj ≤ limt→1 E [Njj (t)/t]. Since i πi = 1, there is at least one value of j for which πj > 0, and for this j, limt→1 E [Njj (t)/t] > 0, and consequently limt→1 E [Njj (t)] = 1. Thus, from Lemma 5.1, state j is recurrent, and from Theorem 5.2, j is positive-recurrent. From Theorem 5.3, all states are then positive-recurrent. For any j and any integer M , (5.15) implies that X πj ≥ πi E [Nij (t)/t] for all t. (5.17) i≤M
From Theorem 5.2, P limt→1 E [Nij (t)/t] = 1/T jj for all i. Substituting this into (5.17), we get πj ≥ 1/T jj i≤M πi . Since M is arbitrary, πj ≥ 1/T jj . Since we already showed that πj ≤ limt→1 E [Njj (t)/t] = 1/T jj , we have πj = 1/T jj for all j. This shows both that πj > 0 for all j and that the solution to (5.14) is unique. Exercise 5.4 completes the proof by showing that if the states are positive-recurrent, then choosing πj = 1/T jj for all j satisfies (5.14).
206
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
In practice, it is usually easy to see whether a chain is irreducible. We shall also see by a number of examples that the steady-state distribution can often be calculated from (5.14). Theorem 5.4 then says that the calculated distribution is unique and that its existence guarantees that the chain is positive recurrent. Example 5.1.1. Age of a renewal process: Consider a renewal process {N (t); t ≥ 0} in which the inter-renewal random variables {Wn ; n ≥ 1} are arithmetic with span 1. We will use a Markov chain to model the age of this process (see Figure 5.2). The probability that a renewal occurs at a particular integer time depends on the past only through the integer time back to the last renewal. The state of the Markov chain during a unit interval will be taken as the age of the renewal process at the beginning of the interval. Thus, each unit of time, the age either increases by one or a renewal occurs and the age decreases to 0 (i.e., if a renewal occurs at time t, the age at time t is 0). ✿ 0♥ ✘ ② ❍
P01
P00
P10
③ ♥ 1
③ ♥ 2
P12 P20
P23 P30
③ ♥ 3
P34 P40
③ ♥ 4
...
Figure 5.2: A Markov chain model of the age of a renewal process. Pr{W > n} is the probability that an inter-renewal interval lasts for more than n time units. We assume that Pr{W > 0} = 1, so that each renewal interval lasts at least one time unit. The probability Pn,0 in the Markov chain is the probability that a renewal interval has duration n + 1, given that the interval exceeds n. Thus, for example, P00 is the probability that the renewal interval is equal to 1. Pn,n+1 is 1 − Pn,0 , which is Pr{W > n + 1} /Pr{W > n}. We can then solve for the steady state probabilities in the chain: for n > 0, πn = πn−1 Pn−1,n = πn−2 Pn−2,n−1 Pn−1,n = π0 P0,1 P1,2 . . . Pn−1,n . The first equality above results from the fact that state n, for n > 0 can be entered only from state n − 1. The subsequent equalities come from substituting in the same expression for πn−1 , then pn−2 , and so forth. πn = π0
Pr{W > 1} Pr{W > 2} Pr{W > n} ... = π0 Pr{W > n} . Pr{W > 0} Pr{W > 1} Pr{W > n − 1}
(5.18)
We have cancelled out all the cross terms above and used the fact that Pr{W > 0} = 1. Another way to see that πn = π0 Pr{W > n} is to observe that state 0 occurs exactly once in each inter-renewal interval; state n occurs exactly once in those inter-renewal intervals of duration n or more. Since the steady-state probabilities must sum to 1, (5.18) can be solved for π0 as π0 = P1
n=0
1 1 = . Pr{W > n} E [W ]
(5.19)
5.2. BRANCHING PROCESSES
207
The second equality follows by expressing E [W ] as the integral of the complementary distribution function of W . Combining this with (5.18), the steady-state probabilities for n ≥ 0 are πn =
Pr{W > n} . E [W ]
(5.20)
In terms of the renewal process, πn is the probability that, at some large integer time, the age of the process will be n. Note that if the age of the process at an integer time is n, then the age increases toward n + 1 at the next integer time, at which point it either drops to 0 or continues to rise. Thus πn can be interpreted as the fraction of time that the age of the process is between n and n + 1. Recall from (3.37) (and the fact that residual life and age are equally R n distributed) that the distribution function of the time-average age is given by FZ (n) = 0 Pr{W > w} dw/E [W ]. Thus, the probability that the age is between n and n+1 is FZ (n+1)−FZ (n). Since W is an integer random variable, this is Pr{W > n} /E [W ] in agreement with our result here. The analysis here gives a new, and intuitively satisfying, explanation of why the age of a renewal process is so different from the inter-renewal time. The Markov chain shows the ever increasing loops that give rise to large expected age when the inter-renewal time is heavy tailed (i.e., has a distribution function that goes to 0 slowly with increasing time). These loops can be associated with the isosceles triangles of Figure 3.8. The advantage here is that we can associate the states with steady-state probabilities if the chain is recurrent. Even when the Markov chain is null-recurrent (i.e., the associated renewal process has infinite expected age), it seems easier to visualize the phenomenon of infinite expected age.
5.2
Branching processes
Branching processes provide a simple model for studying the population of various types of individuals from one generation to the next. The individuals could be photons in a photomultiplier, particles in a cloud chamber, micro-organisms, insects, or branches in a data structure. Let Xn be the number of individuals in generation n of some population. Each of these Xn individuals, independently of each other, produces a random number of offspring, and these offspring collectively make up generation n + 1. More precisely, a branching process is a Markov chain in which the state Xn at time n models the number of individuals in generation n. Denote the individuals of generation n as {1, 2, ..., Xn } and let Yk,n be the number of offspring of individual k. The random variables Yk,n are defined to be IID over k and n, with a PMF pj = Pr{Yk,n = j}. The state at time n + 1, namely the number of individuals in generation n + 1, is Xn+1 =
Xn X
Yk,n .
(5.21)
k=1
Assume a given distribution (perhaps deterministic) for the initial state X0 . The transition probability, Pij = Pr{Xn+1 = j | Xn = i}, is just the probability that Y1,n +Y2,n +· · ·+Yi,n =
208
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
j. The zero state (i e., the state in which there are no individuals) is a trapping state (i.e., P00 = 1) since no future offspring can arise in this case. One of the most important issues about a branching process is the probability that the population dies out eventually. Naturally, if p0 (the probability that an individual has no offspring) is zero, then each generation must be at least as large as the generation before, and the population cannot die out unless X0 = 0. We assume in what follows that p0 > 0 and X0 > 0. Recall that Fij (n) was defined as the probability, given X0 = i, that state j is entered between times 1 and n. From (5.4), this satisfies the iterative relation X Fij (n) = Pij + Pik Fkj (n − 1), n > 1; Fij (1) = Pij . (5.22) k6=j
The probability that the process dies out by time n or before, given X0 = i, is thus Fi0 (n). For the nth generation to die out, starting with an initial population of i individuals, the descendants of each of those i individuals must die out. Since each individual generates descendants independently, we have Fi0 (n) = [F10 (n)]i for all i and n. Because of this relationship, it is sufficient to find F10 (n), which can be determined from (5.22). Observe that P1k is just pk , the probability that an individual will have k offspring. Thus, (5.22) becomes F10 (n) = p0 +
1 X k=1
k
pk [F10 (n − 1)] =
1 X k=0
pk [F10 (n − 1)]k .
(5.23)
P Let h(z) = k pk z k be the z transform of the number of an individual’s offspring. Then (5.23) can be written as F10 (n) = h(F10 (n − 1)).
(5.24)
This iteration starts with F10 (1) = p0 . Figure 5.3 shows a graphical construction for evaluating F10 (n). Having found F10 (n) as an ordinate on the graph for a given value of n, we find the same value as an abscissa by drawing a horizontal line over to the straight line of slope 1; we then draw a vertical line back to the curve h(z) to find h(F10 (n)) = F10 (n + 1). For the two figures shown, it can be seen that F10 (1) is equal to the smallest root of the equation h(z) − z = 0. We next show that these two figures are representative of all possibilities. Since h(z) is a z transform, we know P that h(1) = 1, so that z = 1 is one root of h(z) − z = 0. Also, h0 (1) = Y , where Y = k kpk is the expected number of an individual’s offspring. If Y > 1, as in Figure 5.3a, then h(z) − z is negative for z slightly smaller than 1. Also, for z = 0, h(z) − z = h(0) = p0 > 0. Since h00 (z) ≥ 0, there is exactly one root of h(z) − z = 0 for 0 < z < 1, and that root is equal to F10 (1). By the same type of analysis, it can be seen that if Y ≤ 1, as in Figure 5.3b, then there is no root of h(z) − z = 0 for z < 1, and F10 (1) = 1. As we saw earlier, Fi0 (1) = [F10 (1)]i , so that for any initial population size, there is a probability strictly between 0 and 1 that successive generations eventually die out for Y > 1, and probability 1 that successive generations eventually die out for Y ≤ 1. Since state 0 is accessible from all i, but F0i (1) = 0, it follows from Lemma 5.3 that all states other than state 0 are transient.
5.2. BRANCHING PROCESSES
1
209
° ° ° h(z) ° ° F10 (1) ° F10 (3) ° F10 (2) ° p0 F10 (1) ° °
z
1
F10 (1)
° ° ° ° F (3) ° F (2) 10 h(z) 10 ° p0 °F10 (1) ° ° °
Figure 5.3: Graphical construction to find the probability that a population dies out. Here F10 (n) is the probability that a population starting with one member at generation 0 dies out by generation n or before. Thus F10 (1) is the probability that the population ever dies out.
We next evaluate the expected number of individuals in a given generation. Conditional on Xn−1 = i, (5.21) shows that the expected value of Xn is iY . Taking the expectation over Xn−1 , we have E [Xn ] = Y E [Xn−1 ] .
(5.25)
Iterating this equation, we get n
E [Xn ] = Y E [X0 ] .
(5.26)
Thus, if Y > 1, the expected number of individuals in a generation increases exponentially with n, and Y gives the rate of growth. Physical processes do not grow exponentially forever, so branching processes are appropriate models of such physical processes only over some finite range of population. Even more important, the model here assumes that the number of offspring of a single member is independent of the total population, which is highly questionable in many areas of population growth. The advantage of an oversimplified model such as this is that it explains what would happen under these idealized conditions, thus providing insight into how the model should be changed for more realistic scenarios. It is important to realize that, for branching processes, the mean number of individuals is not a good measure of the actual number of individuals. For Y = 1 and X0 = 1, the expected number of individuals in each generation is 1, but the probability that Xn = 0 approaches 1 with increasing n; this means that as n gets large, the nth generation contains a large number of individuals with a very small probability and contains no individuals with a very large probability. For Y > 1, we have just seen that there is a positive probability that the population dies out, but the expected number is growing exponentially. A surprising result, which is derived from the theory of martingales in Chapter 7, is that n if X0 = 1 and Y > 1, then the sequence of random variables Xn /Y has a limit with probability 1. This limit is a random variable; it has the value 0 with probability F10 (1), and has larger values with some given distribution. Intuitively, for large n, Xn is either 0 or very large. If it is very large, it tends to grow in an orderly way, increasing by a multiple of Y in each subsequent generation.
210
5.3
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
Birth-death Markov chains
A birth-death Markov chain is a Markov chain in which the state space is the set of nonnegative integers; for all i ≥ 0, the transition probabilities satisfy Pi,i+1 > 0 and Pi+1,i > 0, and for all |i−j| > 1, Pij = 0 (see Figure 5.4). A transition from state i to i+1 is regarded as a birth and one from i+1 to i as a death. Thus the restriction on the transition probabilities means that only one birth or death can occur in one unit of time. Many applications of birth-death processes arise in queueing theory, where the state is the number of customers, births are customer arrivals, and deaths are customer departures. The restriction to only one arrival or departure at a time seems rather peculiar, but usually such a chain is a finely sampled approximation to a continuous time process, and the time increments are then small enough that multiple arrivals or departures in a time increment are unlikely and can be ignored in the limit. ✿ 0♥ ✘ ② ❖
p0 q1
③ ♥ 1 ② ❖
p1 q2
③ ♥ 2 ② ❖
p2 q3
③ ♥ 3 ② ❖
p3 q4
③ ♥ 4
...
❖
1 − p3 − q3
Figure 5.4: Birth-death Markov chain.
We denote Pi,i+1 by pi and Pi,i−1 by qi . Thus Pii = 1 − pi − qi . There is an easy way to find the steady-state probabilities of these birth-death chains. In any sample function of the process, note that the number of transitions from state i to i + 1 differs by at most 1 from the number of transitions from i + 1 to i. If the process starts to the left of i and ends to the right, then one more i → i + 1 transition occurs than i + 1 → i, etc. Thus if we visualize a renewal-reward process with renewals on occurrences of state i and unit reward on transitions from state i to i + 1, the limiting time-average number of transitions per unit time is πi pi . Similarly, the limiting time-average number of transitions per unit time from i + 1 to i is πi+1 qi+1 . Since these two must be equal in the limit, πi pi = πi+1 qi+1
for i ≥ 0.
(5.27)
The intuition in (5.27) is simply that the rate at which downward transitions occur from i + 1 to i must equal the rate of upward transitions. Since this result is very important, both here and in our later study of continuous time birth-death processes, we show that (5.27) also results from using the steady-state equations in (5.14): πi = pi−1 πi−1 + (1 − pi − qi )πi + qi+1 πi+1 ;
π0 = (1 − p0 )π0 + q1 π1 .
i>0
(5.28) (5.29)
From (5.29), p0 π0 = q1 π1 . To see that (5.27) is satisfied for i > 0, we use induction on i, with i = 0 as the base. Thus assume, for a given i, that pi−1 πi−1 = qi πi . Substituting this in (5.28), we get pi πi = qi+1 πi+1 , thus completing the inductive proof.
5.4. REVERSIBLE MARKOV CHAINS
211
It is convenient to define ρi as pi /qi+1 . Then we have πi+1 = ρi πi , and iterating this, πi = π0
i−1 Y
ρj ;
π0 =
j=0
1+
P1
1
i=1
Qi−1
j=0
ρj
.
(5.30)
P Q If i≥1 0≤j 0 and all sufficiently large j, then this sum of products will converge and the states will be positive-recurrent. For the simple birth-death process of Figure 5.1, if we define ρ = q/p, then ρj = ρ for all j. For ρ < 1, (5.30) simplifies to πi = πo ρi for all i ≥ 0, π0 = 1 − ρ, and thus πi = (1 − ρ)ρi for i ≥ 0. Exercise 5.2 shows how to find Fij (1) for all i, j in the case where ρ ≥ 1. We have seen that the simple birth-death chain of Figure 5.1, we have seen that the chain is transient if ρ > 1. This is not necessarily so in the case where self-transitions exist, but the chain is still either transient or null-recurrent. An important example of this arises in Exercise 6.1.
5.4
Reversible Markov chains
Many important Markov chains have the property that, in steady-state, the sequence of states looked at backwards in time, i.e.,. . . Xn+1 , Xn , Xn−1 , . . . , has the same probabilistic structure as the sequence of states running forward in time. This equivalence between the forward chain and backward chain leads to a number of results that are intuitively quite surprising and that are quite difficult to derive without using this equivalence. We shall study these results here and then extend them in Chapter 6 to Markov processes with a discrete state space. This set of ideas, and its use in queueing and queueing networks, has been an active area of queueing research over many years . It leads to many simple results for systems that initially look very complex. We only scratch the surface here and refer the interested reader to [13] for a more comprehensive treatment. Before going into reversibility, we describe the backward chain for an arbitrary Markov chain. The defining characteristic of a Markov chain {Xn ; n ≥ 0} is that for all n ≥ 0, Pr{Xn+1 | Xn , Xn−1 , . . . , X0 } = Pr{Xn+1 | Xn } .
(5.31)
For homogeneous chains, which we have been assuming throughout, Pr{Xn+1 = j | Xn = i} = Pij , independent of n. For any k > 1, we can extend (5.31) to get Pr{Xn+k , Xn+k−1 , . . . , Xn+1 | Xn , Xn−1 , . . . , X0 }
= Pr{Xn+k | Xn+k−1 } Pr{Xn+k−1 | Xn+k−2 } . . . Pr{Xn+1 | Xn }
= Pr{Xn+k , Xn+k−1 , . . . , Xn+1 | Xn } .
(5.32)
By letting A+ be any event defined on the states Xn+1 to Xn+k and letting A− be any event defined on X0 to Xn−1 , this can be written more succinctly as ™ © ™ © (5.33) Pr A+ | Xn , A− = Pr A+ | Xn .
212
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
This says that, given state Xn , any future event A+ is statistically independent of any past event A− . This result, namely that past and future are independent given the present state, is equivalent to (5.31) for defining a Markov chain, but it has the advantage of showing the symmetry between past and future. This symmetry is best brought out by multiplying both sides of (5.33) by Pr{A− | Xn }, obtaining3 © ™ © ™ © ™ Pr A+ , A− | Xn = Pr A+ | Xn Pr A− | Xn . (5.34)
This symmetric form says that, conditional on the current state, past and future are statistically independent. Dividing both sides by Pr{A+ | Xn } then yields © ™ © ™ Pr A− | Xn , A+ = Pr A− | Xn . (5.35) By letting A− be Xn−1 and A+ be Xn+1 , Xn+2 , . . . , Xn+k , this becomes Pr{Xn−1 | Xn , Xn+1 , . . . , Xn+k } = Pr{Xn−1 | Xn } . This is the equivalent form to (5.31) for the backward chain, and says that the backward chain is also a Markov chain. By Bayes’ law, Pr{Xn−1 | Xn } can be evaluated as Pr{Xn−1 | Xn } =
Pr{Xn | Xn−1 } Pr{Xn−1 } . Pr{Xn }
(5.36)
Since the distribution of Xn can vary with n, Pr{Xn−1 | Xn } can also depend on n. Thus the backward Markov chain is not necessarily homogeneous. This should not be surprising, since the forward chain was defined with some arbitrary distribution for the initial state at time 0. This initial distribution was not relevant for equations (5.31) to (5.33), but as soon as Pr{A− | Xn } was introduced, the initial state implicitly became a part of each equation and destroyed the symmetry between past and future. For a chain in steady-state, however, Pr{Xn = j} = Pr{Xn−1 = j} = πj for all j, and we have Pr{Xn−1 = j | Xn = i} = Pji πj /πi .
(5.37)
Thus the backward chain is homogeneous if the forward chain is in steady-state. For a chain with steady-state probabilities {πi ; i ≥ 0}, we define the backward transition probabilities Pij∗ as πi Pij∗ = πj Pji .
(5.38)
From (5.36), the backward transition probability Pij∗ , for a Markov chain in steady-state, is then equal to Pr{Xn−1 = j | Xn = i}, the probability that the previous state is j given that the current state is i. Now consider a new Markov chain with transition probabilities {Pij∗ }. Over some segment of time for which both this new chain and the old chain are in steady-state, the set of states generated by the new chain is statistically indistinguishable from the backward running sequence of states from the original chain. It is somewhat simpler, in talking about forward © ™ 3 more broadly, any 3 events, say A− , X0 , A+ are said to be Markov if Pr ™A+ © | X0 A− ™= ™ © ™ © © Much Pr A+ | X0 , and this implies the more symmetric form Pr A− E + | X0 ) = Pr A− | X0 ) Pr A+ | X0 ) .
5.4. REVERSIBLE MARKOV CHAINS
213
and backward running chains, however, to visualize Markov chains running in steady-state from t = −1 to t = +1. If one is uncomfortable with this, one can also visualize starting the Markov chain at some very negative time with the initial distribution equal to the steady-state distribution. A Markov chain is now defined to be reversible if Pij∗ = Pij for all states i and j. Thus the chain is reversible if, in steady-state, the backward running sequence of states is statistically indistinguishable from the forward running sequence. Comparing (5.38) with the steady-state equations (5.27) that we derived for birth-death chains, we have the important theorem: Theorem 5.5. Every birth-death chain with a steady-state probability distribution is reversible. We saw that for birth-death chains, the equation πi Pij = πj Pji (which only had to be considered for |i − j| ≤ 1) provided a very simple way of calculating the steady-state probabilities. Unfortunately, it appears that we must first calculate the steady-state probabilities in order to show that a chain is reversible. The following simple theorem gives us a convenient escape from this dilemma. Theorem 5.6. Assume that an irreducible Markov chain has transition probabilities {Pij }. Suppose {πi } is a set of positive numbers summing to 1 and satisfying πi Pij = πj Pji ;
all i, j.
(5.39)
then, first, {πi ; i ≥ 0} is the steady-state distribution for the chain, and, second, the chain is reversible. Proof: Given a solution to (5.39) for all i and j, we can sum this equation over i for each j. X X πi Pij = πj Pji = πj . (5.40) i
i
P Thus the solution to (5.39), along with the constraints πi > 0, i πi = 1, satisfies the steadystate equations, (5.14), and, from Theorem 5.4, this is the unique steady-state distribution. Since (5.39) is satisfied, the chain is also reversible. It is often possible, sometimes by using an educated guess, to find a solution to (5.39). If this is successful, then we are assured both that the chain is reversible and that the actual steady-state probabilities have been found. Note that the theorem applies to periodic chains as well as to aperiodic chains. If the chain is periodic, then the steady-state probabilities have to be interpreted as average values over the period, but, from Theorem 5.4, (5.40) still has a unique solution (assuming an irreducible chain). On the other hand, for a chain with period d > 1, there are d subclasses of states and the sequence {Xn } must rotate between these classes in a fixed order. For this same order to be followed in the backward chain, the only possibility is d = 2. Thus periodic chains with periods other than 2 cannot be reversible.
214
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
There are several simple tests that can be used to show that some given irreducible chain is not reversible. First, the steady-state probabilities must satisfy πi > 0 for all i, and thus, if Pij > 0 but Pji = 0 for some i, j, then (5.39) cannot be satisfied and the chain is not reversible. Second, consider any set of three states, i, j, k. If Pji Pik Pkj is unequal to Pjk Pki Pij then the chain cannot be reversible. To see this, note that (5.39) requires that πi = πj Pji /Pij = πk Pki /Pik . Thus, πj Pji Pik = πk Pki Pij . Equation (5.39) also requires that πj Pjk = πk Pkj . Taking the ratio of these equations, we see that Pji Pik Pkj = Pjk Pki Pij . Thus if this equation is not satisfied, the chain cannot be reversible. In retrospect, this result is not surprising. What it says is that for any cycle of three states, the probability of three transitions going around the cycle in one direction must be the same as the probability of going around the cycle in the opposite (and therefore backwards) direction. It is also true (see [16] for a proof), that a necessary and sufficient condition for a chain to be reversible is that the product of transition probabilities around any cycle of arbitrary length must be the same as the product of transition probabilities going around the cycle in the opposite direction. This doesn’t seem to be a widely useful way to demonstrate reversibility. There is another result, similar to Theorem 5.6, for finding the steady-state probabilities of an arbitrary Markov chain and simultaneously finding the transition probabilities of the backward chain. Theorem 5.7. Assume that an irreducible Markov chain has transition probabilities {Pij }. Suppose {πi } is a set of positive numbers summing to 1 and that {Pij∗ } is a set of transition probabilities satisfying πi Pij = πj Pji∗ ;
all i, j.
(5.41)
Then {πi } is the steady-state distribution and {Pij∗ } is the set of transition probabilities for the backward chain. Proof: Summing (5.41) over i, we get the steady-state equations for the Markov chain, so the fact that the given {πi } satisfy these equations asserts that they are the steady-state probabilities. Equation (5.41) then asserts that {Pij∗ } is the set of transition probabilities for the backward chain. The following two sections illustrate some important applications of reversibility.
5.5
The M/M/1 sample-time Markov chain
The M/M/1 Markov chain is a sampled-time model of the M/M/1 queueing system. Recall that the M/M/1 queue has Poisson arrivals at some rate ∏ and IID exponentially distributed service times at some rate µ. We assume throughout this section that ∏ < µ (this is required to make the states positive-recurrent). For some given small increment of time
5.5. THE M/M/1 SAMPLE-TIME MARKOV CHAIN
215
δ, we visualize observing the state of the system at the sample times nδ. As indicated in Figure 5.5, the probability of an arrival in the interval from (n − 1)δ to nδ is modeled as ∏δ, independent of the state of the chain at time (n − 1)δ and thus independent of all prior arrivals and departures. Thus the arrival process, viewed as arrivals in subsequent intervals of duration δ, is Bernoulli, thus approximating the Poisson arrivals. This is a sampled-time approximation to the Poisson arrival process of rate ∏ for a continuous time M/M/1 queue. 0♥ ② ❖
∏δ µδ
③ ♥ 1 ② ❖
∏δ µδ
③ ♥ 2 ② ❖
∏δ µδ
③ ♥ 3 ② ❖
∏δ µδ
③ ♥ 4
...
❖
Figure 5.5: Sampled-time approximation to M/M/1 queue for time increment δ. When the system is non-empty (i.e., the state of the chain is one or more), the probability of a departure in the interval (n − 1)δ to nδ is µδ, thus modelling the exponential service times. When the system is empty, of course, departures cannot occur. Note that in our sampled-time model, there can be at most one arrival or departure in an interval (n − 1)δ to nδ. As in the Poisson process, the probability of more than one arrival, more than one departure, or both an arrival and a departure in an increment δ is of order δ 2 for the actual continuous time M/M/1 system being modeled. Thus, for δ very small, we expect the sampled-time model to be relatively good. At any rate, we can now analyze the model with no further approximations. Since this chain is a birth-death chain, we can use (5.30) to determine the steady-state probabilities; they are πi = π0 ρi ; ρ = ∏/µ < 1. Setting the sum of the πi to 1, we find that π0 = 1 − ρ, so πi = (1 − ρ)ρi ;
all i ≥ 0.
(5.42)
Thus the steady-state probabilities exist and the chain is a birth-death chain, so from Theorem 5.5, it is reversible. We now exploit the consequences of reversibility to find some rather surprising results about the M/M/1 chain in steady-state. Figure 5.6 illustrates a sample path of arrivals and departures for the chain. To avoid the confusion associated with the backward chain evolving backward in time, we refer to the original chain as the chain moving to the right and to the backward chain as the chain moving to the left. There are two types of correspondence between the right-moving and the left-moving chain: 1. The left-moving chain has the same Markov chain description as the right-moving chain, and thus can be viewed as an M/M/1 chain in its own right. We still label the sampled-time intervals from left to right, however, so that the left-moving chain makes transitions from Xn+1 to Xn to Xn−1 . Thus, for example, if Xn = i and Xn−1 = i + 1, the left-moving chain has an arrival in the interval from nδ to (n − 1)δ.
216
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
2. Each sample function . . . xn−1 , xn , xn+1 . . . of the right-moving chain corresponds to the same sample function . . . xn+1 , xn , xn−1 . . . of the left-moving chain, where Xn−1 = xn−1 is to the left of Xn = xn for both chains. With this correspondence, an arrival to the right-moving chain in the interval (n − 1)δ to nδ is a departure from the leftmoving chain in the interval nδ to (n − 1)δ, and a departure from the right-moving chain is an arrival to the left-moving chain. Using this correspondence, each event in the left-moving chain corresponds to some event in the right-moving chain. In each of the properties of the M/M/1 chain to be derived below, a property of the leftmoving chain is developed through correspondence 1 above, and then that property is translated into a property of the right-moving chain by correspondence 2. Property 1: Since the arrival process of the right-moving chain is Bernoulli, the arrival process of the left-moving chain is also Bernoulli (by correspondence 1). Looking at a sample function xn+1 , xn , xn−1 of the left-moving chain (i.e., using correspondence 2), an arrival in the interval nδ to (n − 1)δ of the left-moving chain is a departure in the interval (n − 1)δ to nδ of the right-moving chain. Since the arrivals in successive increments of the left-moving chain are independent and have probability ∏δ in each increment δ, we conclude that departures in the right-moving chain are similarly Bernoulli.
r
✟r ✟ r✟✟ r
r
✟r ✟ r✟✟
✛ δ ✲ r r
r r ❍ ❍❍ ❍r
r
r r
r r
r r
✟r ✟ ✟ r✟
✲
Arrivals
r
r
✟r ✟ ✟ ✲ r✟ ✟ Departures ✟✟ r✟
✟r❍❍ ✟❍ ✟ ❍❍State ✟ r✟ r❍ ❍❍ ❍r r
r ❍ ❍❍ ✛ ❍❍ r❍ ✛ rArrivals ❍❍ ❍Departures ❍❍ ❍r r r
Figure 5.6: Sample function of M/M/1 chain over a busy period and corresponding arrivals and departures for right and left-moving chains. Arrivals and departures are viewed as occurring between the sample times, and an arrival in the left-moving chain between time nδ and (n + 1)δ corresponds to a departure in the right-moving chain between (n + 1)δ and nδ. The fact that the departure process is Bernoulli with departure probability ∏δ in each increment is surprising. Note that the probability of a departure in the interval (nδ − δ, nδ] is µδ conditional on Xn−1 ≥ 1 and is 0 conditional on Xn−1 = 0. Since Pr{Xn−1 ≥ 1} = 1 − Pr{Xn−1 = 0} = ρ, we see that the unconditional probability of a departure in the
5.6. ROUND-ROBIN AND PROCESSOR SHARING
217
interval (nδ − δ, nδ] is ρµδ = ∏δ as asserted above. The fact that successive departures are independent is much harder to derive without using reversibility (see exercise 5.12). Property 2: In the original (right-moving) chain, arrivals in the time increments after nδ are independent of Xn . Thus, for the left-moving chain, arrivals in time increments to the left of nδ are independent of the state of the chain at nδ. From the correspondence between sample paths, however, a left chain arrival is a right chain departure, so that for the right-moving chain, departures in the time increments prior to nδ are independent of Xn , which is equivalent to saying that the state Xn is independent of the prior departures. This means that if one observes the departures prior to time nδ, one obtains no information about the state of the chain at nδ. This is again a surprising result. To make it seem more plausible, note that an unusually large number of departures in an interval from (n − m)δ to nδ indicates that a large number of customers were probably in the system at time (n−m)δ, but it doesn’t appear to say much (and in fact it says exactly nothing) about the number remaining at nδ. The following theorem summarizes these results. Theorem 5.8 (Burke’s theorem for sampled-time). Given an M/M/1 Markov chain in steady-state with ∏ < µ, a) the departure process is Bernoulli, b) the state Xn at any time nδ is independent of departures prior to nδ. The proof of Burke’s theorem above did not use the fact that the departure probability is the same for all states except state 0. Thus these results remain valid for any birth-death chain with Bernoulli arrivals that are independent of the current state (i.e., for which Pi,i+1 = ∏δ for all i ≥ 0). One important example of such a chain is the sampled time approximation to an M/M/m queue. Here there are m servers, and the probability of departure from state i in an increment δ is µiδ for i ≤ m and µmδ for i > m. For the states to be recurrent, and thus for a steady-state to exist, ∏ must be less than µm. Subject to this restriction, properties a) and b) above are valid for sampled-time M/M/m queues.
5.6
Round-robin and processor sharing
Typical queueing systems have one or more servers who each serve customers in FCFS order, serving one customer completely while other customers wait. These typical systems have larger average delay than necessary. For example, if two customers with service requirements of 10 and 1 units respectively are waiting when a single server becomes empty, then serving the first before the second results in departures at times 10 and 11, for an average delay of 10.5. Serving the customers in the opposite order results in departures at times 1 and 11, for an average delay of 6. Supermarkets have recognized this for years and have special express checkout lines for customers with small service requirements. Giving priority to customers with small service requirements, however, has some disadvantages; first, customers with high service requirements can feel discriminated against, and
218
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
second, it is not always possible to determine the service requirements of customers before they are served. The following alternative to priorities is popular both in the computer and data network industries. When a processor in a computer system has many jobs to accomplish, it often serves these jobs on a time-shared basis, spending a small increment of time on one, then the next, and so forth. In data networks, particularly high-speed networks, messages are broken into small fixed-length packets, and then the packets from different messages can be transmitted on an alternating basis between messages. A round-robin service system is a system in which, if there are m customers in the system, say c1 , c2 , . . . , cm , then c1 is served for an incremental interval δ, followed by c2 being served for an interval δ, and so forth up to cm . After cm is served for an interval δ, the server returns and starts serving c1 for an interval δ again. Thus the customers are served in a cyclic, or “round-robin” order, each getting a small increment of service on each visit from the server. When a customer’s service is completed, the customer leaves the system, m is reduced, and the server continues rotating through the now reduced cycle as before. When a new customer arrives, m is increased and the new customer must be inserted into the cycle of existing customers in a way to be discussed later. Processor sharing is the limit of round-robin service as the increment δ goes to zero. Thus, with processor sharing, if m customers are in the system, all are being served simultaneously, but each is being served at 1/m times the basic server rate. For the example of two customers with service requirement 1 and 10, each customer is initially served at rate 1/2, so one customer departs at time 2. At that time, the remaining customer is served at rate 1 and departs at time 11. For round-robin service with an increment of 1, the customer with unit service requirement departs at either time 1 or 2, depending on the initial order of service. With other increments of service, the results are slightly different. We first analyze round-robin service and then go to the processor-sharing limit as δ → 0. As the above example suggests, the results are somewhat cleaner in the limiting case, but more realistic in the round-robin case. Round robin provides a good example of the use of backward transition probabilities to find the steady-state distribution of a Markov chain. The techniques used here are quite similar to those used in the next chapter to analyze queueing networks. Assume a Bernoulli arrival process in which the probability of an arrival in an interval δ is ∏δ. Assume that the ith arriving customer has a service requirement Wi . The random variables Wi , i ≥ 1, are IID and independent of the arrival epochs. Thus, in terms of the arrival process and the service requirements, this is the same as an M/G/1 queue (see Section 3.6), but with M/G/1 queues, the server serves each customer completely before going on to the next customer. We shall find that the round-robin service here avoids the “slow truck effect” identified with the M/G/1 queue. For simplicity, assume that Wi is arithmetic with span δ, taking on only values that are positive integer multiples of δ. Let f (j) = Pr{Wi = jδ} , j ≥ 1 and let F (j) = Pr{Wi > jδ}. Note that if a customer has already received j increments of service, then the probability that that customer will depart after 1 more increment is f (j + 1)/F (j). This probability of departure on the next service increment after the j th is denoted by g(j) = f (j + 1)/F (j); j ≥ 1.
(5.43)
5.6. ROUND-ROBIN AND PROCESSOR SHARING
219
The state s of a round-robin system can be expressed as the number, m, of customers in the system, along with an ordered listing of how many service increments each of those m customers have received, i.e., s = (m, z1 , z2 , . . . , zm ),
(5.44)
where z1 δ is the amount of service already received by the customer at the front of the queue, z2 δ is the service already received by the next customer in order, etc. In the special case of an idle queue, s = (0), which we denote as φ. Given that the state Xn at time nδ is s 6= φ, the state Xn+1 at time nδ + δ evolves as follows:
• A new arrival enters with probability ∏δ and is placed at the front of the queue; • The customer at the front of the queue receives an increment δ of service; • The customer departs if service is complete. • Otherwise, the customer goes to the back of the queue It can be seen that the state transition depends, first, on whether a new arrival occurs (an event of probability ∏δ), and, second, on whether a departure occurs. If no arrival and no departure occurs, then the queue simply rotates. The new state is s 0 = r(s), where the rotation operator r(s) is defined by r(s) = (m, z2 , . . . , zm , z1 + 1). If a departure but no arrival occurs, then the customer at the front of the queue receives its last unit of service and departs. The new state is s 0 = δ(s), where the departure operator δ(s) is defined by δ(s) = (m − 1, z2 , . . . , zm ). If an arrival occurs, the new customer receives one unit of service and goes to the back of the queue if more than one unit of service is required. In this case, the new state is s 0 = a(s) where the arrival operator a(s) is defined by a(s) = (m + 1, z1 , z2 , . . . , zm , 1). If only one unit of service is required by a new arrival, the arrival departs and s 0 = s. In the special case of an empty queue, s = φ, the state is unchanged if either no arrival occurs or an arrival requiring one increment of service arrives. Otherwise, the new state is s = (1, 1), i.e., the one customer in the system has received one increment of service. We next find the probability of each transition for s 6= φ. The probability of no arrival is 1 − ∏δ. Given no arrival, and given a non-empty system, s 6= φ, the probability of a departure is g(z1 ) = f (z1 + 1)/F (z1 ), i e., the probability that one more increment of service allows the customer at the front of the queue to depart. Thus the probability of a departure is (1 − ∏δ)g(z1 ) and the probability of a rotation is (1 − ∏δ)[1 − g(z1 )]. Finally, the probability of an arrival is ∏δ, and given an arrival, the new arrival will leave the system after one unit of service with probability g(0) = f (1). Thus the probability of an arrival and no departure is ∏δ[1 − f (1)] and the probability of an unchanged system is ∏δf (1). To
220
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
summarize, for s 6= φ, Ps,r(s) = (1 − ∏δ)[1 − g(z1 )];
r(s) = (m, z2 , . . . , zm , z1 + 1)
Ps,a(s) = ∏δ[1 − f (1)];
a(s) = (m + 1, z1 , z2 , . . . , zm , 1)
Ps,d(s) = (1 − ∏δ)g(z1 ); Ps,s
d(s) = (m − 1, z2 , . . . , zm )
= ∏δf (1).
(5.45)
For the special case of the idle state, Pφ,φ = (1 − ∏δ) + ∏δf (1) and Pφ,(1,1) = ∏δ(1 − f (1)). We now find the steady-state distribution for this Markov chain by looking at the backward Markov chain. We will hypothesize backward transition probabilities, and then use Theorem 5.7 to verify that the hypothesis is correct. Consider the backward transitions corresponding to each of the forward transitions in (5.45). A rotation in forward time causes the elements z1 , . . . , zm in the state s = (m, z1 , . . . , zm ) to rotate left, and the left most element (corresponding to the front of the queue) is incremented while rotating to the right end. The backward transition from r(s) to s corresponds to the elements z2 , . . . , zm , z1 + 1 rotating to the right, with the right most element being decremented while rotating to the left end. If we view the transitions in backward time as a kind of round-robin system, we see that the rotation is in the opposite direction from the forward time system. In the backward time system, we view the numbers z1 , . . . , zm in the state as the remaining service required before the corresponding customers can depart. Thus, these numbers decrease in the backward moving system. Also, since the customer rotation in the backward moving system is opposite to that in the forward moving system, zm is the remaining service of the customer at the front of the queue, and z1 is the remaining service of the customer at the back of the queue. We also view departures in forward time as arrivals in backward time. Thus the backward transition from d(s) = (m − 1, z2 , . . . , zm ) to s = (m, z1 , . . . , zm ) corresponds to an arrival requiring z1 + 1 units of service; the arrival goes to the front of the queue, receives one increment of service, and then goes to the back of the queue with z1 increments of remaining service. The nicest thing we could now hope for is that the arrivals in backward time are Bernoulli. This is a reasonable hypothesis to make, partly because it is plausible, and partly because it is easy to check via Theorem 5.7. Fortunately, we shall find that it is valid. According to ∗ this hypothesis, the backward transition probability Pr(s),s is given by 1 − ∏δ; that is, given that Xn+1 is r(s) = (m, z2 , . . . , zm , z1 +1), and given that there is no arrival in the backward system at time (n + 1)δ, then the only possible state at time n is s = (m, z1 , . . . , zn ). Next consider a backward transition from d(s) = (m − 1, z2 , . . . , zn ) to s = (m, z1 , z2 , . . . , zm ). This corresponds to an arrival in the backward moving system; the arrival requires z1 + 1 increments of service, one of which is provided immediately, leaving the arrival at the back of the queue with z1 required increments of service remaining. The probability of this ∗ transition is Pd(s),s = ∏δf (z1 + 1). Calculating the other backward transitions in the same way, the hypothesized backward transition probabilities are given by ∗ Pr(s),s ∗ Pa(s),s
= 1 − ∏δ = 1 − ∏δ
∗ Pd(s),s = ∏δf (z1 + 1) ∗ Ps,s = ∏δf (1).
(5.46)
One should view (5.46) as an hypothesis for the backward transition probabilities. The arguments leading up to (5.46) are simply motivation for this hypothesis. If the hypothesis
5.6. ROUND-ROBIN AND PROCESSOR SHARING
221
is correct, we can combine (5.45) and (5.46) to express the steady-state equations of Theorem 5.7 (for s 6= f ) as ∗ πs Ps,r(s) = πr(s) Pr(s),s ;
πs Ps,d(s) = πs Ps,a(s) = πs Ps,s
=
∗ πd(s) Pd(s),s ; ∗ πa(s) Pa(s),s ; ∗ πs Ps,s ;
(1 − ∏δ)[1 − g(z1 )]πs = (1 − ∏δ)πr(s)
(5.47)
∏δ[1 − f (1)]πs = (1 − ∏δ)πa(s)
(5.49)
(1 − ∏δ)g(z1 )πs = ∏δf (z1 + 1)πd(s)
(5.48)
∏δf (1)πs = ∏δf (1)πs .
(5.50)
We next show that (5.48), applied repeatedly, will allow us to solve for πs (if ∏ is small enough for the states to be positive recurrent). Verifying that the solution also satisfies (5.47) and (5.49), will then verify the hypothesis. Since f (z1 + 1)/g(z1 ) is F (z1 ) from (5.43), we have πs =
∏δ F (z1 )πd(s) . 1 − ∏δ
(5.51)
For m > 1, d(s) = (m − 1, z2 , . . . , zm ), so we can apply (5.51) to πd(s) , and substitute the result back into (5.51), yielding πs =
√
∏δ 1 − ∏δ
!2
F (z1 )F (z2 )πd(d(s)) ,
(5.52)
where d(d(s)) = (m − 2, z3 , . . . , zm ). Applying (5.51) repeatedly to πd(d(s)) , πd(d(d(s))) , and so forth, we eventually get √ !m m Y ∏δ πs = F (zj ) πφ . (5.53) 1 − ∏δ j=1
Before this can be accepted as a steady-state distribution, we must verify that it satisfies (5.47) and (5.49). The left hand side of (5.47) is (1 − ∏δ)[1 − g(z1 )]πs , and, from (5.43), 1 − g(z1 ) = [F (z1 ) − f (z1 + 1)]/F (z1 ) = F (z1 + 1)/(z1 ). Thus using (5.53), the left side of (5.47) is √ !m m !m m √ Y Y ∏δ F (z1 +1) ∏δ (1 − ∏δ) F (zj ) πφ = (1−∏δ) F (zj ) F (z1 +1)πφ . 1−∏δ 1−∏δ F (z1 ) j=1
j=2
This is equal to (1 − ∏δ)πr(s) , verifying (5.47). Equation (5.49) is verified in the same way. We now have to find P whether there is a solution for pf such that these probabilities sum to 1. First define Pm = z1 , . . . , zm π(m, z1 , . . . , zm ). This is the probability of m customers in the system. Whenever a new customer enters the system, it receives one increment of service immediately, so each zi ≥ 1. Using the hypothesized solution in (5.53), √ !m m 1 Y X ∏δ Pm = F (i) πφ . (5.54) 1 − ∏δ j=1 i=1
222
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
Since F (i) = Pr{W > iδ}, since W is arithmetic with span δ, and since the mean of a non-negative random variable is the integral of its complementary distribution function, we have δ
1 X i=1
F (i) = E [W ] − δ Pm =
√
∏ 1 − ∏δ
!m
(5.55) ≥ ¥m πφ . E [W ] − δ
(5.56)
Defining ρ = [∏/(1 − ∏δ)]{E [W ] − δ}, we see Pm = ρm πφ . If ρ < 1, then πφ = 1 − ρ, and Pm = (1 − ρ)ρm ;
m ≥ 0.
(5.57)
The condition r < 1 is required for the states to be positive-recurrent. The expected P number of customers in the system for a round-robin queue is m mPm = ρ/(1 − ρ), and using Little’s theorem, Theorem 3.8, the expected delay is ρ/[∏(1 − ρ)]. In using Little’s theorem here, however, we are viewing the time a customer spends in the system as starting when the number m in the state increases; that is, if a customer arrives at time nδ, it goes to the front of the queue and receives one increment of service, and then, assuming it needs more than one increment, the number m in the state increases at time (n + 1)δ. Thus the actual expected delay, including the original d when the customer is being served but not counted in the state, is δ + ρ/[∏(1 − ρ)]. The relation between ρ and ∏E [W ] is shown in Figure 5.7, and it is seen that ρ < 1 for ∏E [W ] < 1. The extreme case where ∏δ = ∏E [W ] is the case for which each customer requires exactly one unit of service. Since at most one customer can arrive per time increment, the state always remains at s = φ, and the delay is δ, i.e., the original increment of service received when a customer arrives. 1
° ° ρ ° ° ∏δ ° ∏E [W ]
1
Figure 5.7: ρ as a function of ∏E [W ] for given ∏δ. Note that (5.57) is the same as the distribution of customers in the system for the M/M/1 Markov chain in (5.42), except for the anomaly in the definition of ρ here. We then have the surprising result that if round-robin queueing is used rather than FCFS, then the distribution of the number of customers in the system is approximately the same as that for an M/M/1 queue. In other words, the slow truck effect associated with the M/G/1 queue has been eliminated. Another remarkable feature of round-robin systems is that one can also calculate the expected delay for a customer conditional on the required service of that customer. This
5.7. SEMI-MARKOV PROCESSES
223
is done in Exercise 5.15, and it is found that the expected delay is linear in the required service. Next we look at processor sharing by going to the limit as δ → 0. We first eliminate the assumption that the service requirement distribution is arithmetic with span δ. Assume that the server always spends an increment of time δ on the customer at the front of the queue, and if service is finished before the interval of length δ ends, the server is idle until the next sample time. The analysis of the steady-state distribution above P is still valid if 1 we define F (j) = Pr{W > jδ}, and f (j) = F (j) − F (j + 1). In this case δ i=1 F (i) lies between E [W ] − δ and E [W ]. As δ → 0, ρ = ∏E [W ], and distribution of time in the system becomes identical to that of the M/M/1 system.
5.7
Semi-Markov processes
Semi-Markov process are generalizations of Markov chains in which the time intervals between transitions are random. To be specific, let X(t) be the state of the process at time t and let {0, 1, 2, . . . } denote the set of possible states (which can be finite or countably infinite). Let the random variables S1 < S2 < S3 < . . . denote the successive epochs at which state transitions occur. Let Xn be the new state entered at time Sn (i.e., Xn = X(Sn ), and X(t) = Xn for Sn ≤ t < Sn + 1). Let S0 = 0 and let X0 denote the starting state at time 0 (i.e., X0 = X(0) = X(S0 ). As part of the definition of a semi-Markov process, the sequence {Xn ; n ≥ 0} is required to be a Markov chain, and the transition probabilities of that chain are denoted {Pij , i ≥ 0, j ≥ 0}. This Markov chain is called the embedded Markov chain of the semi-Markov process. Thus, for n ≥ 1, Pr{Xn = j | Xn−1 = i} = Pr{X(Sn ) = j | X(Sn−1 ) = i} = Pij .
(5.58)
Conditional on X(Sn−1 ), the state entered at Sn is independent of X(t) for all t < Sn−1 . As the other part of the definition of a semi-Markov process, the intervals Un = Sn − Sn−1 between successive transitions for n ≥ 1 are random variables that depend only on the states X(Sn−1 ) and X(Sn ). More precisely, given Xn−1 and Xn , the interval Un is independent of the set of Um for m < n and independent of X(t) for all t < Sn−1 . The conditional distribution function for the intervals Un is denoted by Gij (u), i.e., Pr{Un ≤ u | Xn−1 = i, Xn = j} = Gij (u).
(5.59)
The conditional mean of Un , conditional on Xn−1 = i, Xn = j, is denoted (i, j), i.e., Z U (i, j) = E [Un | Xn−1 = i, Xn = j] = [1 − Gij (u)]du. (5.60) u≥0
We can visualize a semi-Markov process evolving as follows: given an initial state, X0 = i at time 0, a new state X1 = j is selected according to the embedded chain with probability Pij . Then U1 = S1 is selected using the distribution Gij (u). Next a new state X2 = k is chosen according to the probability Pjk ; then, given X1 = j and X2 = k, the interval U2 is selected with distribution function Gjk (u). Successive state transitions and transition times are chosen in the same way. Because of this evolution from X0 = i, we see that U1 = S1
224
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
is a random variable, so S1 is finite with probability 1. Also U2 is a random variable, so that S2 = S1 + U2 is a random variable and thus is finite with probability 1. By induction, Sn is a random variable and thus is finite with probability 1 for all n ≥ 1. This proves the following simple lemma. Lemma 5.5. Let M (t) be the number of transitions in a semi-Markov process in the interval (0, t], (i.e., SM (t) ≤ t < SM (t+1) ) for some given initial state X0 . Then limt→1 M (t) = 1 with probability 1. Figure 5.8 shows an example of a semi-Markov process in which the transition times are deterministic but depend on the transitions. The important point that this example brings out is that the embedded Markov chain has steady-state probabilities that are each 1/2. On the other hand, the semi-Markov process spends most of its time making long transitions from state 0 to state 1, and during these transitions the process is in state 0. This means that one of our first objectives must be to understand what steady-state probabilities mean for a semi-Markov process. 1/2; 1
r
0→0
r
✿ 0♥ ✘ ②
1/2; 10 1/2; 1
③ ♥ 1 ②
1/2; 1
r
0→1
1→0
r 0→0 r
Figure 5.8: Example of a semi-Markov process with deterministic transition epochs. The label on each arc (i, j) in the graph gives Pij followed by U (i, j). The solid dots on the sample function below the graph show the state transition epochs and show the new states entered. Note that the state at Sn is the new state entered, i.e., Xn , and the state remains Xn in the interval [Sn ,n+1 ). In what follows, we assume that the embedded Markov chain is irreducible and positiverecurrent. Define U (i) as the expected time in state i before a transition, i.e., X X U (i) = E [Un | Xn−1 = i] = Pij E [Un | Xn−1 = i, Xn = j] = Pij U (i, j). (5.61) j
j
The steady-state probabilities {πi } for the embedded chain tell us the fraction of transitions that enter any given state i. Since U (i) is the expected holding time in i per transition into i, we would guess that the fraction of time spent in state i should be proportional to πi U (i). Normalizing, we would guess that the time-average probability of being in state i should be X πj U (j). (5.62) pi = πi U (i)/ j
By now, it should be no surprise that renewal theory is the appropriate tool to make this precise. We continue to assume an irreducible positive-recurrent chain; this ensures that the steady-state probabilities for the embedded chain exist and are positive. It is possible
5.7. SEMI-MARKOV PROCESSES
225
for the denominator in (5.62) to be infinite. This can happen either because U (i) is infinite for some i, or because the sum does not converge (for example, there could be a countably infinite number of states and U (i) could be proportional to 1/πi ). When the denominator is infinite, we say that the probabilities {pi } do not exist; this is a bizarre special case, and it is discussed further in Section 6.1, but it doesn’t have to be specifically excluded from the following analysis. Lemma 5.6. Consider a semi-Markov process with an irreducible recurrent embedded chain {Xn ; n ≥ 0}. Given X0 = i, let {Mij (t); t ≥ 0} be the number of transitions into a given state j in the interval (0, t]. Then {Mij (t); t ≥ 0} is a delayed renewal process (or, if j = i, is an ordinary renewal process). Proof: Let M (t) be the total number of state transitions over all states that occur in the interval (0, t]. From Lemma 5.5, limt→1 M (t) = 1 with probability 1. Let Nij (n) be the number of transitions into state j that occur in the embedded Markov chain by the nth transition of the embedded chain. From Lemma 5.4, {Nij (n); n ≥ 0} is a delayed renewal process. It follows from Lemma 3.2 of Chapter 3 that limn→1 Nij (n) = 1 with probability 1. Since Mij (t) is the number of transitions into j during the first M (t) transitions of the embedded chain, we have Mij (t) = Nij (M (t)). Thus, lim Mij (t) = lim Nij (M (t)) = lim Nij (n) = 1.
t→1
t→1
t→1
It follows that the time W1 at which the first transition into state j occurs, and the subsequent interval W2 until the next transition, are both finite with probability 1. Subsequent intervals have the same distribution as W2 , and all intervals are independent, so {Mij (t); t ≥ 0} is a delayed renewal process with inter-renewal intervals {Wk ; k ≥ 1}. If i = j, then all Wk are identically distributed and we have an ordinary renewal process, completing the proof. Let W (j) be the mean inter-renewal interval between successive transitions into state j (i.e., the mean of the inter-renewal intervals W2 , W3 , . . . in {Mij (t); t ≥ 0}). Consider a delayed renewal-reward process defined on {Mij (t); t ≥ 0} for which R(t) = 1 whenever X(t) = j (see Figure 9). Define pj as the time-average fraction of time spent in state j. Then, if U (i) < 1, Theorems 3.8 and 3.12 of Chapter 3 state that R1 R(τ )dτ U (j) pj = lim 0 = with probability 1. (5.63) t→1 t W (j) We can also investigate the limit, as t → 1, of the probability that X(t) = j. This is equal to E [R(t)] for the renewal-reward process above. From Equation (3.72) of Chapter 3, if the distribution of the inter-renewal time is non-arithmetic, then pj = lim E [R(t)] = t→1
U (j) . W (j)
(5.64)
Next we must express the mean inter-renewal time, W (j), in terms of more accessible quantities. From Theorem 3.9 of Chapter 3, lim Mij (t)/t = 1/W (j)
t→1
with probability 1.
(5.65)
226
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
t
t
✛
Un
Xn =j
✛
✲
t
t
Xn+1 6=j
Xn+2 6=j Wk
t Xn+3 =j
✲
t
Xn+4 6=j
Figure 5.9: The delayed renewal-reward process for time in state j. The reward is one from an entry into state j, say at the nth transition of the embedded chain, until the next transition out of state j. The expected duration of such an interval Un is U (j). The inter-renewal interval Wk , assuming the kth occurrence of state j happens on the the nth transition of the embedded chain, lasts until the next entry into state j, with expected duration W (j). As before, Mij (t) = Nij (M (t)) where, given X0 = i, M (t) is the total number of transitions in (0, t] and Nij (n) is the number of transitions into state j in the embedded Markov chain by the nth transition. Lemma 5.5 shows that limt→1 M (t) = 1 with probability 1, so lim
t→1
Mij (t) Nij (M (t)) Nij (n) = lim = lim = πj . n→1 t→1 M (t) M (t) n
(5.66)
Combining (5.65) and (5.66), we have 1 W (j)
Mij (t) t→1 t Mij (t) M (t) = lim t→1 M (t) t Mij (t) M (t) = lim lim t→1 M (t) t→1 t M (t) = πj lim . t→1 t
=
lim
(5.67)
(5.68)
Substituting this in (5.63), we see Pthat pj = πj U (j) limt→1 {M (t)/t}. Since limt→1 {M (t)/t} is independent of j, and since j pj = 1, we see that limt→1 {M (t)/t} must be equal to P { j πj U (j)}−1 , thus yielding (5.62). Summarizing, we have the following theorem: Theorem 5.9. Assume that the embedded Markov chain of a semi-Markov process is irP reducible and positive-recurrent. If i πi U (i) 0. It is not hard to see that if the distribution of inter-renewal intervals for one value of j is arithmetic with span d, then the distribution of inter-renewal intervals for each i is arithmetic with the same span (see exercise 5.19).
5.8. EXAMPLE — THE M/G/1 QUEUE
227
For the example of Figure 5.8, we see by inspection that U (1) = 5.5 and U (2) = 1. Thus p1 = 11/13, and p2 = 2/13. For a semi-Markov process, knowing the probability that X(t) = j for large t does not completely specify the steady-state behavior. Another important steady state question is to determine the fraction of time involved in i to j transitions. To make this notion precise, define Y (t) as the residual time until the next transition after time t (i e., t + Y (t) is the epoch of the next transition after time t). We want to determine the fraction of time t over which X(t) = i and X(t + Y (t)) = j. Equivalently, for a non-arithmetic process, we want to determine Pr{X(t) = i, X(t + Y (t) = j)} in the limit as t → 1. Call this limit Q(i, j). Consider a renewal process, starting in state i and with renewals on transitions to state i. Define a reward R(t) = 1 for X(t) = i, X(t + Y (t)) = j and R(t) = 0 otherwise (see Figure 5.10). That is, for each n such that X(Sn ) = i and X(Sn+1 ) = j, R(t) = 1 for Sn ≤ t < Sn+1 . The expected reward in an inter-renewal interval is then Pij U (i, j) . It follows that Q(i, j) is given by Rt R(τ )(dτ pi Pij U (i, j) Pij U (i, j) Q(i, j) = lim 0 = = . (5.69) t→1 t W (i) U (i)
t
t
✛
Xn =i
✛
Un
✲
t
Xn+1 =j
t
Xn+2 6=j Wk
t
t
Xn+3 =i Xn+4 6=j
✲
t
Xn+5 =i
t
Xn+6 =j
Figure 5.10: The renewal-reward process for i to j transitions. The expected value of Un if Xn = i and Xn+1 = j is U (i, j) and the expected interval between entries to i is W (i).
5.8
Example — the M/G/1 queue
As one example of a semi-Markov chain, consider an M/G/1 queue. Rather than the usual interpretation in which the state of the system is the number of customers in the system, we view the state of the system as changing only at departure times; the new state at a departure time is the number of customers left behind by the departure. This state then remains fixed until the next departure. New customers still enter the system according to the Poisson arrival process, but these new customers are not considered as part of the state until the next departure time. The number of customers in the system at arrival epochs does not in general constitute a “state” for the system, since the age of the current service is also necessary as part of the statistical characterization of the process. One purpose of this example is to illustrate that it is often more convenient to visualize the transition interval Un = Sn − Sn−1 as being chosen first and the new state Xn as being
228
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
chosen second rather than choosing the state first and the transition time second. For the M/G/1 queue, first suppose that the state is some i > 0. In this case, service begins on the next customer immediately after the old customer departs. Thus, Un , conditional on Xn = i for i > 0, has the distribution of the service time, say G(u). The mean interval until a state transition occurs is Z 1 U (i) = [1 − G(u)]du; i > 0. (5.70) 0
Given the interval u for a transition from state i > 0, the number of arrivals in that period is a Poisson random variable with mean ∏u, where ∏ is the Poisson arrival rate. Since the next state j is the old state i, plus the number of new arrivals, minus the single departure, Pr{Xn+1 = j | Xn = i, Un = u} =
(∏u)j+i+1 exp(−∏u) . (j − i + 1)!
(5.71)
for j ≥ i − 1. For j < i − 1, the probability above is 0. The unconditional probability Pij of a transition from i to j can then be found by multiplying the right side of (5.71) by the probability density g(u) of the service time and integrating over u. Z 1 G(u)(∏u)j−i+1 exp(−∏u) Pij = du; j ≥ i − 1, i > 0. (5.72) (j − i + 1) 0 For the case i = 0, the server must wait until the next arrival before starting service. Thus the expected time from entering the empty state until a service completion is Z 1 U (0) = (1/∏) + [1 − G(u)]du. (5.73) 0
We can evaluate P0j by observing that the departure of that first arrival leaves j customers in this system iff j customers arrive during the service time of that first customer; i.e., the new state doesn’t depend on how long the server waits for a new customer to serve, but only on the arrivals while that customer is being served. Letting g(u) be the density of the service time, Z 1 g(u)∏u)j exp(−∏u) P0j = du; j ≥ 0. (5.74) j! 0
5.9
Summary
This chapter extended the finite-state Markov chain results of Chapter 4 to the case of countably-infinite state spaces. It also provided an excellent example of how renewal processes can be used for understanding other kinds of processes. In Section 5.1, the firstpassage-time random variables were used to construct renewal processes with renewals on successive transitions to a given state. These renewal processes were used to rederive the basic properties of Markov chains using renewal theory as opposed to the algebraic PerronFrobenius approach of Chapter 4. The central result of this was Theorem 5.4, which showed that, for an irreducible chain, the states are positive-recurrent iff the steady-state equations,
5.9. SUMMARY
229
(5.14), have a solution. Also if (5.14) has a solution, it is positive and unique. We also showed that these steady-state probabilities are, with probability 1, time-averages for sample paths, and that, for an ergodic chain, they are limiting probabilities independent of the starting state. We found that the major complications that result from countable state spaces are, first, different kinds of transient behavior, and second, the possibility of null-recurrent states. For finite-state Markov chains, a state is transient only if it can reach some other state from which it can’t return. For countably infinite chains, there is also the case, as in Figure 5.1 for p > 1/2, where the state just wanders away, never to return. Null recurrence is a limiting situation where the state wanders away and returns with probability 1, but with an infinite expected time. There is not much engineering significance to null recurrence; it is highly sensitive to modeling details over the entire infinite set of states. One usually uses countably infinite chains to simplify models; for example, if a buffer is very large and we don’t expect it to overflow, we assume it is infinite. Finding out, then, that the chain is transient or null-recurrent simply means that the modeling assumption was not very good. Branching processes were introduced in Section 5.3 as a model to study the growth of various kinds of elements that reproduce. In general, for these models (assuming p0 > 0), there is one trapping state and all other states are transient. Figure 5.3 showed how to find the probability that the trapping state is entered by the nth generation, and also the probability that it is entered eventually. If the expected number of offspring of an element is at most 1, then the population dies out with probability 1, and otherwise, the population dies out with some given probability q, and grows without bound with probability 1 − q. We next studied birth-death Markov chains and reversibility. Birth-death chains are widely used in queueing theory as sample time approximations for systems with Poisson arrivals and various generalizations of exponentially distributed service times. Equation (5.30) gives their steady-state probabilities if positive-recurrent, and shows the condition under which they are positive-recurrent. We showed that these chains are reversible if they are positiverecurrent. Theorems 5.6 and 5.7 provided a simple way to find the steady-state distribution of reversible chains and also of chains where the backward chain behavior could be hypothesized or deduced. We used reversibility to show that M/M/1 and M/M/m Markov chains satisfy Burke’s theorem for sampled-time — namely that the departure process is Bernoulli, and that the state at any time is independent of departures before that time. Round-robin queueing was then used as a more complex example of how to use the backward process to deduce the steady-state distribution of a rather complicated Markov chain; this also gave us added insight into the behavior of queueing systems and allowed us to show that, in the processor-sharing limit, the distribution of number of customers is the same as that in an M/M/1 queue. Finally, semi-Markov processes were introduced. Renewal theory again provided the key to analyzing these systems. Theorem 5.9 showed how to find the steady-state probabilities of these processes, and it was shown that these probabilities could be interpreted both as time-averages and, in the case of non-arithmetic transition times, as limiting probabilities in time.
230
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
For further reading on Markov chains with countably infinite state spaces, see [9], [16], or [22]. Feller [9] is particularly complete, but Ross and Wolff are somewhat more accessible. Harris, [12] is the standard reference on branching processes and Kelly, [13] is the standard reference on reversibility. The material on round-robin systems is from [24] and is generalized there.
5.10
Exercises
Exercise 5.1. Let {Pij ; i, j ≥ 0} be the set of transition probabilities for an infinite-state Markov chain. For each i, j, let Fij (n) be the probability that state j occurs sometime between time 1 and n inclusive, given X0 = i. For some P given j, assume that {xk ; k ≥ 0} is a set of non-negative numbers satisfying xi = Pij + k6=j Pik xk . Show that xi ≥ Fij (n) for all n and i, and hence that xi ≥ Fij (1) for all i. Hint: use induction. Exercise 5.2. a) For the Markov chain in Figure 5.1, show that, for p ≥ 1/2, F00 (1) = 2(1 − p) and show that Fi0 (1) = [(1 − p)/p]i for i ≥ 1. Hint: first show that this solution satisfies (5.5) and then show that (5.5) has no smaller solution (see Exercise 5.1). Note that you have shown that the chain is transient for p > 1/2 and that it is recurrent for p = 1/2. b) Under the same conditions as part (a), show that Fij (1) equals 2(1 − p) for j = i, equals [(1 − p)/p]i−j for i > j, and equals 1 for i < j. Exercise 5.3. Let j be a transient state in a Markov chain and let j be accessible from i. Show that i is transient also. Interpret this as a form of Murphy’s law (if something bad can happen, it will, where the bad thing is the lack of an eventual return). Note: give a direct demonstration rather than using Lemma 5.3. Exercise 5.4. Consider an irreducible positive-recurrent Markov chain. Consider the renewal process {Njj (t); t ≥ 0} where, given X0 = j, Njj (t) is the number of times that state j is visited from time 1 to t. For each i ≥ 0, consider a renewal-reward function Ri (t) equal to 1 whenever the chain is in state i and equal to 0 otherwise. Let πi be the time-average reward. a) Show that πi = 1/T ii for each i with probability 1. P P b) Show that i πi = 1. Hint: consider i≤M πi for any integer M .
c) Consider a renewal-reward function Rij (t) that is 1 whenever the chain is in state i and the next state is state j. Rij (t) = 0 otherwise. P Show that the time-average reward is equal to πi Pij with probability 1. Show that pk = i πi Pik for all k. Exercise 5.5. Let {Xn ; n ≥ 0} be a branching process with X0 = 1. Let Y , σ 2 be the mean and variance of the number of offspring of an individual.
a) Argue that limn→1 Xn exists with probability 1 and either has the value 0 (with probability F10 (1)) or the value 1 (with probability 1 − F10 (1)).
5.10. EXERCISES
231
b) Show that VAR (Xn ) = σ 2 Y Y = 1.
n−1
n
(Y − 1)/(Y − 1) for Y 6= 1 and VAR(Xn ) = nσ 2 for
Exercise 5.6. There are n states and for each pair of states i and j, a positive number dij = dji is given. A particle moves from state to state in the following manner: Given that the particle is in any state i, it will next move to any j 6= i with probability Pij given by Pij = P
dij
j6=i
dij
.
Assume that Pii = 0 for all i. Show that the sequence of positions is a reversible Markov chain and find the limiting probabilities. Exercise 5.7. Consider a reversible Markov chain with transition probabilities Pij and limiting probabilities πi . Also consider the same chain truncated to the states 0, 1, . . . , M . That is, the transition probabilities {Pij0 } of the truncated chain are Pij0
=
(
P mPij k=0
; 0 ≤ i, j ≤ M 0 ; elsewhere.
Pik
Show that the truncated chain is also reversible and has limiting probabilities given by πi
PM
j=0 Pij . PM k=0 πi m=0 Pkm
π i = PM
Exercise 5.8. A Markov chain (with states {0, 1, 2, . . . , J − 1} where J is either finite or infinite) has transition probabilities {Pij ; i, j ≥ 0}. Assume that P0j > 0 for all j > 0 and Pj0 > 0 for all j > 0. Also assume that for all i, j, k, we have Pij Pjk Pki = Pik Pkj Pji . a) Assuming also that all states are positive recurrent, show that the chain is reversible and find the steady state probabilities {πi } in simplest form. b) Find a condition on {P0j ; j ≥ 0} and {Pj0 ; j ≥ 0} that is sufficient to ensure that all states are positive recurrent. Exercise 5.9. a) Use the birth and death model described in figure 5.4 to find the steady state probability mass function for the number of customers in the system (queue plus service facility) for the following queues: i) M/M/1 with arrival probability ∏δ, service completion probability µδ. ii) M/M/m with arrival probability ∏δ, service completion probability iµδ for i servers busy, 1 ≤ i ≤ m. iii) M/M/1 with arrival probability ∏δ, service probability imd for i servers. Assume d so small that iµδ < 1 for all i of interest.
232
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
Assume the system is positive recurrent. b) For each of the queues above give necessary conditions (if any) for the states in the chain to be i) transient, ii) null recurrent, iii) positive recurrent. c) For each of the queues find: L = (steady state) mean number of customers in the system. Lq = (steady state) mean number of customers in the queue. W = (steady state) mean waiting time in the system. Wq = (steady state) mean waiting time in the queue. Exercise 5.10. a) Given that an arrival occurs in the interval (nδ, (n+1)δ) for the sampledtime M/M/1 model in figure 5, find the conditional PMF of the state of the system at time nδ (assume n arbitrarily large and assume positive recurrence). b) For the same model, again in steady state but not conditioned on an arrival in (nδ, (n + 1)δ), find the probability Q(i, j)(i ≥ j > 0) that the system is in state i at nδ and that i − j departures occur before the next arrival. c) Find the expected number of customers seen in the system by the first arrival after time nδ. (Note: the purpose of this exercise is to make you cautious about the meaning of “the state seen by a random arrival”). Exercise 5.11. Find the backward transition probabilities for the Markov chain model of age in figure 2. Draw the graph for the backward Markov chain, and interpret it as a model for residual life. Exercise 5.12. Consider the sample time approximation to the M/M/1 queue in figure 5 a) Give the steady-state probabilities for this chain (no explanations or calculations required– just the answer). In parts b) to g) do not use reversibility and do not use Burke’s theorem. Let Xn be the state of the system at time nδ and let Dn be a random variable taking on the value 1 if a departure occurs between nδ and (n + 1)δ, and the value 0 if no departure occurs. Assume that the system is in steady-state at time nδ. b) Find Pr{Xn = i, Dn = j} for i ≥ 0, j = 0, 1 c) Find Pr{Dn = 1} d) Find Pr{Xn = i | Dn = 1} for i ≥ 0 e) Find Pr{Xn+1 = i | Dn = 1} and show that Xn+1 is statistically independent of Dn . Hint: Use part d); also show that Pr{Xn+1 = i} = Pr{Xn+1 = i | Dn = 1} for all i ≥ 0 is sufficient to show independence. f ) Find Pr{Xn+1 = i, Dn+1 = j | Dn } and show that the pair of variables (Xn+1 , Dn+1 ) is statistically independent ofDn .
5.10. EXERCISES
233
g) For each k > 1, find Pr{Xn+k = i, Dn+k = j | Dn+k−1 , Dn+k−2 , . . . , Dn } and show that the pair (Xn+k , Dn+k ) is statistically independent of (Dn+k−1 , Dn+k−2 , . . . , Dn ). Hint: use induction on k; as a substep, find Pr{Xn+k = i | Dn+k−1 = 1, Dn+k−2 , . . . , Dn } and show that Xn+k is independent of Dn+k−1 , Dn+k−2 , . . . , Dn . h) What do your results mean relative to Burke’s theorem. Exercise 5.13. Let {Xn , n ≥ 1} denote a irreducible recurrent Markov chain having a countable state state space. Now consider a new stochastic process {Yn , n ≥ 0} that only accepts values of the Markov chain that are between 0 and some integer m. For instance, if m = 3 and X1 = 1, X2 = 3, X3 = 5, X4 = 6, X5 = 2, then Y1 = 1, Y2 = 3, Y3 = 2. a) Is {Yn , n ≥ 0} a Markov chain? Explain briefly. b) Let pj denote the proportion of time that {Xn , n ≥ 1} is in state j. If pj > 0 for all j, what proportion of time is {Yn , n ≥ 0} in each of the states 0, 1, . . . , m? c) Suppose {Xn } is null-recurrent and let pi (m), i = 0, 1, . . . , m denote the long-run proportions for {Yn , n ≥ 0}. Show that pj (m) = pi (m)E[time the X process spends in j between returns to i], j 6= i.} Exercise 5.14. Verify that (5.49) is satisfied by the hypothesized solution to p in (5.53). Also show that the equations involving the idle state f are satisfied. Exercise 5.15. Replace the state m = (m, z1 , . . . , zm ) in Section 5.6 with an expanded state m = (m, z1 , w1 , z2 , w2 , . . . , zm , wm ) where m and {zi ; 1 ≤ i ≤ m} are as before and w1 , w2 , . . . , wm are the original service requirements of the m customers. a) Hypothesizing the same backward round-robin system as hypothesized in Section 5.6, find the backward transition probabilities and give the corresponding equations to (5.475.50) for the expanded state description. b) Solve the resulting equations to show that m ≥ ∏δ ¥m Y f (wj ). πm = π + φ 1 − ∏δ j=1
c) Show that the probability that there are m customers in the system, and that those customers have original service requirements given by w1 , . . . , wm , is √ !m m Y ∏δ (wj − 1)f (wj ). Pr{m, w1 , . . . , wm } = πφ 1 − ∏δ j=1
d) Given that a customer has original service requirement w, find the expected time that customer spends in the system.
234
CHAPTER 5. COUNTABLE-STATE MARKOV CHAINS
Exercise 5.16. A taxi alternates between three locations. When it reaches location 1 it is equally likely to go next to either 2 or 3. When it reaches 2 it will next go to 1 with probability 1/3 and to 3 with probability 2/3. From 3 it always goes to 1. The mean time between locations i and j are t12 = 20, t13 = 30, t23 = 30. Assume tij = tji ). What is the (limiting) probability that the taxi’s most recent stop was at location i, i = 1, 2, 3? What is the (limiting) probability that the taxi is heading for location 2? What fraction of time is the taxi traveling from 2 to 3. Note: Upon arrival at a location the taxi immediately departs. Exercise 5.17. Consider an M/G/1 queueing system with Poisson arrivals of rate ∏ and expected service time E [X]. Let ρ = ∏E [X] and assume ρ < 1. Consider a semi-Markov process model of the M/G/1 queueing system in which transitions occur on departures from the queueing system and the state is the number of customers immediately following a departure. a) Suppose a colleague has calculated the steady-state probabilities {pi } of being in state i for each i ≥ 0. For each i ≥ 0, find the steady-state probability pii of state i in the embedded Markov chain. Give your solution as a function of ρ, πi , and p0 . b) Calculate p0 as a function of ρ. c) Find πi as a function of ρ and pi . d) Is pi the same as the steady-state probability that the queueing system contains i customers at a given time? Explain carefully. Exercise 5.18. Consider an M/G/1 queue in which the arrival rate is ∏ and the service time distributin is uniform (0, 2W ) with ∏W < 1. Define a semi-Markov chain following the framework for the M/G/1 queue in Section 5.8. a) Find P0j ; j ≥ 0. b) Find Pij for i > 0; j ≥ i − 1. Exercise 5.19. Consider a semi-Markov process for which the embedded Markov chain is irreducible and positive-recurrent. Assume that the distribution of inter-renewal intervals for one state j is arithmetic with span d. Show that the distribution of inter-renewal intervals for all states is arithmetic with the same span.
Chapter 6
MARKOV PROCESSES WITH COUNTABLE STATE SPACES 6.1
Introduction
Recall that a Markov chain is a discrete-time process {Xn ; n ≥ 0} with the property that for each integer n ≥ 1, the state Xn at time n is a random variable (rv) that is statistically dependent on past states only through the most recent state Xn−1 . A Markov process is a generalization of a Markov chain in the sense that, along with the sequence of states, there is a random time interval from the entry to one state until the entry to the next. We denote the sequence of states by {Xn ; n ≥ 0} and, as before, assume this sequence forms a Markov chain with a countable state space. We assume the process is in state X0 at time 0, and let S1 , S2 , . . . , be the epochs at which successive state transitions occur. Thus the process is in state X0 in the time interval [0, S1 ), in state X1 in the interval [S1 , S2 ), etc. The intervals between successive transitions are denoted U1 , U2 , . . . , and thus U1 =S1 , U2 = S P2 n− S1 , and in general Un = Sn − Sn−1 . The epoch of the nth transition is then Sn = i=1 Ui . Note that Ui is the interval preceding the entry to Xi .
A Markov process is thus specified by specifying both the sequence of rv’s {Xn ; n ≥ 0} and {Un ; n ≥ 1}. There are two additional requirements for the process to be defined as Markov. The first is that for each n ≥ 1, Un , conditional on Xn−1 , is statistically independent of all the other states and of all the other intervals. Second, Un (conditional on Xn−1 ) is required to be an exponential rv with a rate ∫i that is a function only of the sample value i of Xn−1 . Thus, for all integers n ≥ 1, all sample values i of the state space, and all u ≥ 0, Pr{Un ≤ u | Xn−1 = i} = 1 − exp(−∫i u).
(6.1)
A Markov process is then specified by assigning a rate ∫i to each state i in the state space and by specifying the transition probabilities Pij for the Markov chain. The Markov chain here is called the embedded Markov chain of the process. The state of a Markov process at any time t > 0 is denoted by X(t) and is given by X(t) = Xn
for Sn ≤ t < Sn+1 235
(6.2)
236
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
This defines a stochastic process {X(t); t ≥ 0} in the sense that each sample point ω ∈ ≠ maps into a sequence of sample values of {Xn ; n ≥ 0} and {Sn ; n ≥ 1}, and thus into a sample function of {X(t); t ≥ 0}. This stochastic process is what is usually referred to as a Markov process, but it is often simpler to view {Xn ; n ≥ 0}, {Sn ; n ≥ 1} as a characterization of the process. We assume throughout this chapter (except in a few places where specified otherwise) that the embedded Markov chain has no self transitions, i.e., Pii = 0 for all states i. One reason for this is that such transitions are invisible in {X(t); t ≥ 0}. Another is that with this assumption, the representation of {X(t); t ≥ 0} in terms of the embedded chain and the transition rates is unique. We are not interested for the moment in exploring the probability distribution of X(t) for given values of t, but one feature we can see immediately is that for any times t > τ > 0 and any states i, j, Pr{X(t)=j | X(τ )=i, {X(s) = x(s); s < τ }} = Pr{X(t−τ )=j | X(0)=i} .
(6.3)
This property arises because of the memoryless property of the exponential distribution. If X(τ ) = i, it makes no difference how long the process has been in state i before τ ; the time to the next transition is still exponential and the next state is still determined by the embedded chain. This will be seen more clearly in the following exposition. This property is the reason why these processes are called Markov, and is often taken as the defining property of Markov processes. Example 6.1.1. The M/M/1 queue: An M/M/1 queue has Poisson arrivals at a rate denoted by ∏ and has a single server with an exponential service distribution of rate µ > ∏ (see Figure 6.1). Successive service times are independent both of each other and of arrivals. The state X(t) of the queue is the total number of customers either in the queue or in service. When X(t) = 0, the time to the next transition is the time until the next arrival, i.e., ∫0 = ∏. For any state i > 0, the server is busy and the time to the next transition is the time until either a new arrival occurs or a departure occurs. Thus ∫i = ∏ + µ. For the embedded Markov chain, P01 = 1 since only arrivals are possible in state 0, and they increase the state to 1. In the other states, Pi,i−1 = µ/(∏+µ) and Pi,i+1 = ∏/(∏+µ). 0♥ ② ∏
1 µ/(∏+µ)
③ ♥ 1 ② ∏+µ
∏/(∏+µ) µ/(∏+µ)
③ ♥ 2 ② ∏+µ
∏/(∏+µ) µ/(∏+µ)
③ ♥ 3 ∏+µ
...
Figure 6.1: The Markov process for an M/M/1 queue. Each node i is labeled with its corresponding rate ∫i to the next transition, and each transition is labeled with the corresponding transition probability in the embedded Markov chain.
The embedded Markov chain is a Birth-death Markov chain, and its steady state probabil-
6.1. INTRODUCTION
237
ities can be calculated easily using (5.27). The result is π0 = πn =
1−ρ 2 1 − ρ2 n−1 ρ 2
where ρ = for n ≥ 1.
∏ µ (6.4)
Note that if ∏ 0, qi,i−1 = µ is the departure rate when there are customers to be served. Figure 6.2 shows Figure 6.1 incorporating this notational simplification 0♥ ②
∏ µ
③ ♥ 1 ②
∏ µ
③ ♥ 2 ②
∏ µ
③ ♥ 3
...
Figure 6.2: The Markov process for an M/M/1 queue. Each transition (i, j) is labelled with the transition rate qij in the embedded Markov chain.
Note that the interarrival density for the Poisson process from i to a given j is qij exp(−qij x). On the other hand, given that the process is in state i, the probability density for the interval until the next arrival, whether conditioned on an arrival1 to j or not, is ∫i exp(−∫i x).
6.1.1
The sampled-time approximation to a Markov process
As yet another way to visualize a Markov process, consider approximating the process by viewing it only at times separated by a given increment size δ. The Poisson processes above are then approximated by Bernoulli processes where the transition probability from i to j in the sampled-time chain is defined to be qij δ for all j 6= i. The Markov process is then approximated by a Markov chain and self-transition probabilities from i to i are required for those time increments P in which no transition occurs. Thus, as illustrated in Figure 6.3, we have Pii = 1 − j qij δ = 1 − ∫i δ for each i. Note that this is an approximation to the Markov process in two ways. First, transitions occur only at integer multiples of the increment δ, and second, qij δ is an approximation to Pr{X(δ)=j | X(0)=i}. From (6.3), Pr{X(δ)=j | X(0) = i} = qij δ + o(δ), so this second approximation is increasingly good as δ → 0 Since the transition probability from i to itself in this approximation is 1 − ∫i δ, we require that ∫i δ ≤ 1 for all i. For a finite state space, this is satisfied for any δ ≤ [maxi ∫i ]−1 . For a countably infinite set of states, however, the sampled-time approximation requires the existence of some finite B such that ∫i ≤ B for all i. The sampled-time Markov chain for the M/M/1 queue was analyzed in Section 5.5. Recall that this required a self-loop for each state to handle the probability of no transitions in a time increment. In that sampled-time model, the steady-state probability of state i is given by (1 − ρ)ρi where ρ = ∏/µ. We will see that even though the sampled-time model contains several approximations, the resulting-steady probabilities are exact. 1
This is the same paradoxical situation that arises whenever we view one Poisson process as the sum of several other Poisson processes. Perhaps the easiest way to understand this is with the M/M/1 example. Given an entry into state i > 0, customer arrivals occur at rate ∏ and departures with rate µ, but the state changes at rate ∏ + µ, and the epoch of this change is independent of whether it is caused by an arrival or departure.
6.2. STEADY-STATE BEHAVIOR OF IRREDUCIBLE MARKOV PROCESSES
q13 q31
239
δq13 δq31
❥ ✙ 1♥ q ✲ 2♥ q ✲ 3♥ 12 23
❥ ✙ 1♥ δq ✲ 2♥ δq ✲ 3♥ 12 23 ❖ ❖ ❖ 1−δq12 −δq13
1 − δq23
1 − δq31
Figure 6.3: Approximating a Markov process by its sampled-time Markov chain.
6.2
Steady-state behavior of irreducible Markov processes
As one might guess, the appropriate approach to exploring the steady-state behavior of Markov processes comes from applying renewal theory to various renewal processes associated with the Markov process. Many of the needed results for this have already been developed in looking at the steady-state behavior of countable-state Markov chains. We restrict our analysis to Markov processes for which the embedded Markov chain is irreducible, i.e., consists of a single class of states. Such Markov processes are themselves called irreducible. The reason for this restriction is not that Markov processes with multiple classes of states are unimportant, but rather that they can usually be best understood by looking at the embedded Markov chain and the various classes making up that chain. Recall the following results about irreducible Markov chains from Theorems 5.4 and 5.2. An irreducible countable-state Markov chain with transition probabilities {Pij ; i, j ≥ 0} is positive recurrent if and only if there is a set of numbers {πi ; i ≥ 0} satisfying X X πj = πi Pij for all j; πj ≥ 0 for all j; πj = 1. (6.6) i
j
If such a solution exists, it is unique and πj > 0 for all j. Furthermore, if such a solution exists (i.e., if the chain is positive recurrent), then for each i, j, a delayed renewal counting process {Nij (n)} exists counting the renewals into state j over the first n transitions of the chain, given an initial state X(0) = i. These processes each satisfy lim Nij (t)/t = πj
t→1
with probability 1
lim E [Nij (t)/t] = πj .
t→1
(6.7) (6.8)
Now consider a Markov process which has a positive recurrent embedded Markov chain and thus satisfies (6.6 - 6.8). When a transition in the embedded chain leads to state j, the time until the next transition is exponential with rate ∫j . Reasoning intuitively, we would expect the fraction of time the process spends in a state j to be proportional to πj (the fraction of transitions going to j), but also to be proportional to the expected holding time in state j, which is 1/∫j . Since the fraction of time in different states must add up to 1, we would then hypothesize that the fraction of time pj spent in any given state j should satisfy πj /∫j . pj = P i πi /∫i
(6.9)
240
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
In fact, if we apply this formula to the embedded chain for the M/M/1 queue in (6.4), we find that pi = (1−ρ)ρi . This is the same result given by the sampled-time analysis of the M/M/1 queue. In other words, the self loops in the sampled-time model provide the same effect as the difference in holding times in the Markov process model.
6.2.1
The number of transitions per unit time
We now turn to a careful general derivation of (6.9), but the most important issue is to understand what is meant by the fraction of time in a state (e.g., is there a strong law interpretation P for pj such as that for πj in (6.7)?). There is also the question of what happens if i πi /∫i = 1.
Assume that the embedded Markov chain starts in some arbitrary state X0 = i. Then the time U1 until the next transition is exponential with rate ∫i . The interval U2 until the next following transition is a mixture of exponentials depending on X1 , but it is a well-defined rv . In fact, for each n, Un is a rv with the distribution function X Pr{Un ≤ u} = Pijn−1 exp(−∫j u). j
The epoch of the nth transition is Sn = U1 + · · · Un , which is also a rv (i.e., it it finite with probability 1, so limt→1 Pr{Sn ≤ t} = 1), i.e., for all n, the nth transition eventually occurs with probability 1. Now define Mi (t) as the number of transitions (given X0 = i) that occur by time t. Note that {Mi (t); t ≥ 0} is not a renewal process (neither delayed nor ordinary), but it still satisfies the set identity {Mi (t) ≥ n} = {Sn ≤ t}. This then implies that limt→1 Pr{Mi (t) ≥ n} = 1 for all n, or in other words,2 limt→1 Mi (t) = 1 with probability 1. Note that this result, which is stated in the following lemma, does not assume that the Markov process is irreducible or recurrent. Lemma 6.1. Let Mi (t) be the number of transitions in the interval (0, t] of a Markov process starting with X0 = i. Then lim Mi (t) = 1
t→1
6.2.2
with probability 1.
(6.10)
Renewals on successive entries to a state
For an irreducible Markov process with X0 = i, let Mij (t) be the number of transitions into state j over the interval (0, t]. We want to find when this is a delayed renewal counting process. It is clear that the sequence of epochs at which state j is entered form renewal points, since they form renewal points in the embedded Markov chain and the time intervals between transitions depend only on the current state. The questions are whether the first 2
To spell this out, consider the sample function Mi (t, ω) for any ω ∈ ≠. This is nondecreasing in t and thus either has a finite limit or goes to 1. The set of ω for which this limit is at most n is 0 since limt→1 Pr{Mi (t) ≥ n} = 1, and thus the limit is 1 with probability 1.
6.2. STEADY-STATE BEHAVIOR OF IRREDUCIBLE MARKOV PROCESSES
241
entry to state j must occur within some finite time, and then whether recurrences must occur within finite time. The following lemma answers these questions for the case where the embedded chain is recurrent (either positive recurrent or null recurrent). Lemma 6.2. Consider a Markov process with an irreducible recurrent embedded chain {Xn ; n ≥ 0}. Given X0 = i, let {Mij (t); t ≥ 0} be the number of transitions into a given state j in the interval (0, t]. Then {Mij (t); t ≥ 0} is a delayed renewal counting process (or, if j = i, is an ordinary renewal counting process). Proof: Given X0 = i, let Nij (n) be the number of transitions into state j that occur in the embedded Markov chain by the nth transition of the embedded chain. From Lemma 5.4, {Nij (n); n ≥ 0} is a delayed renewal process, and from Lemma 3.2, limn→1 Nij (n) = 1 with probability 1. Then Mij (t) = Nij (Mi (t)) where Mi (t) is the total number of state transitions (between all states) in the interval (0, t]. Thus, with probability 1, lim Mij (t) = lim Nij (Mi (t)) = lim Nij (n) = 1.
t→1
t→1
n→1
where we have used Lemma 6.1, which asserts that limt→1 Mi (t) = 1 with probability 1. It follows that the time W1 at which the first transition into state j occurs, and the subsequent interval W2 until the next transition to state j, are both finite with probability 1. Subsequent intervals have the same distribution as W2 , and all intervals are independent, so {Mij (t); t ≥ 0} is a delayed renewal process with inter-renewal intervals {Wk ; k ≥ 1}. If i = j, then all Wk are identically distributed and we have an ordinary renewal process, completing the proof. The inter-renewal intervals W2 , W3 , . . . for {Mij (t); t ≥ 0} above are well-defined nonnegative iid rv’s whose distribution depends on j but not i. They either have an expectation as a finite number or can be regarded as having an infinite expectation. In either case, this expectation is denoted as E [W (j)] = W (j). This is the mean time between successive entries to state j, and we will see later that in some cases this mean time can be infinite. In order to study the fraction of time spent in state j, we define a delayed renewal-reward process, based on {Mij (t); t ≥ 0}, for which unit reward is accumulated whenever the process is in state j (see Figure 6.4). If transition n−1 of the embedded chain enters state j, then the interval Un until the nth transition is exponential with rate ∫j , so E [Un |Xn−1 =j] = 1/∫j . Define pj as the limiting time-average fraction of time spent in state j (if such a limit exists). Then, since U (j) = 1/∫j , Theorems 3.6 and 3.12, for ordinary and delayed renewal-reward processes respectively, state that pj = lim
t→1
Rt 0
Rj (τ )dτ U (j) 1 = = t W (j) ∫j W (j)
with probability 1.
(6.11)
We can also investigate the limit, as t → 1, of the probability that X(t) = j. This is equal to limt→1 E [R(t)] for the renewal-reward process above. Because of the exponential holding
242
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
t
t
✛
Un
Xn−1 =j
✛
✲
t
Rj (t)
t
Xn 6=j
t
Xn+1 6=j
Xn+2 =j
Wk
✲
t
Xn+3 6=j
Figure 6.4: The delayed renewal-reward process {Rj (t); t ≥ 0} for time in state j. The reward is one whenever the process is in state j. A renewal occurs on each entry to state j, so the reward starts at each such entry and continues until a state transition, assumed to enter a state other than j. The reward then ceases until the next renewal, i.e., the next entry to state j. The figure illustrates the kth interrenewal interval, of duration Wk , which is assumed to start on the n − 1st state transition. The expected interval over which a reward is accumulated is ∫j and the expected duration of the interrewal interval is W (j). times, the inter-renewal times are non-arithmetic, and thus from Blackwell’s theorem, in the form of (3.77), lim Pr{X(t) = j} =
t→1
1 . ∫j W (j)
(6.12)
We summarize these results in the following lemma. Lemma 6.3. Consider an irreducible Markov process with a recurrent embedded Markov chain. Then with probability 1, the limiting time-average fraction of time in each state j is given by pj = ∫ W1 (j) . This is also the limit, as t → 1 of Pr{X(t) = j}. j
Next we must express the mean inter-renewal time, W (j), in terms of more accessible quantities. We continue to assume an irreducible process and a starting state X0 = i. From the strong law for delayed renewal processes (Theorem 3.9), lim Mij (t)/t = 1/W (j)
t→1
W.P. 1.
(6.13)
As before, Mij (t) = Nij (Mi (t)). Since limt→1 Mi (t) = 1 with probability 1, lim
t→1
Mij (t) Nij (Mi (t)) Nij (n) = lim = lim = πj n→1 t→1 Mi (t) Mi (t) n
W.P. 1.
(6.14)
Combining (6.13) and (6.14), the following equalities hold with probability 1. 1 W (j)
Mij (t) t→1 t Mij (t) Mi (t) = lim t→1 Mi (t) t Mi (t) = πj lim . t→1 t
=
lim
(6.15)
(6.16)
6.2. STEADY-STATE BEHAVIOR OF IRREDUCIBLE MARKOV PROCESSES
243
Substituting this in (6.11), we see that pj =
πj Mi (t) lim t→1 ∫j t
W.P.1.
(6.17)
This equation does not quite tell us what pj is, since we don’t have a good expression for the time-average number of transitions per unit time, limt→1 Mi (t)/t. It does, however, tell us that this time-average converges to some particular value with probability 1, and also tell us (not surprisingly) that this value is independent of the starting state i. Finally, it tell us that if we can find the single value limt→1 Mi (t)/t (which is independent of i), then we have an expression for all pj in terms of πj , ∫j , and this single constant.3
6.2.3
The strong law for time-average state probabilities
In order to evaluate limt→1 Mi (t)/t, we modify the assumption that X0 = i for some arbitrary i to the assumption that the embedded Markov chain starts inPsteady state. The expected time until the first transition in the Markov process is then i πi /∫i . The embedded Markov chain remains in steady state after each transition, and therefore the inter-transition time, say Un for the nth transition is a sequence of IID rv’s, each with P mean i πi /∫i . There is a renewal process, say M (t), associated with this sum of interrenewal times, and by the strong law for renewal processes, M (t) 1 . =P t→1 t i πi /∫i lim
Now M (t) =
P
i πi Mi (t),
and (6.11) does not depend on i. Thus
pj =
πj πj /∫j M (t) lim =P t→1 ∫j t i πi /∫i
W.P.1.
(6.18)
The following theorem summarizes these results.
Theorem 6.1. Consider an irreducible Markov process with a positive recurrent embedded Markov chain starting in any state X0 = i. Then, with probability 1, the limiting time-average fraction of time spent in an arbitrary state j is given by (6.18). Also pPj = limt→1 Pr{X(t)=j}. Finally, the expected time between returns to state j is W (j) = i πi /∫i . πj This has been a great deal of work to validate (6.18), which must seem almost intuitively obvious. However, the fact that these time averages are valid over all sample points with probability 1 is not obvious and the fact that πj W (j) is independent of j is certainly not obvious. P It is extremely tempting at this point to argue that j pj = 1. This reduces (6.17) to (6.9), and gives the P correct answer in most cases. Mathematically however, pj is defined as a time limit, and asserting that j pj = 1 involves interchanging a limit with a sum. This is not always valid, and we shortly find the conditions under which it is invalid for this problem. 3
244
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
P The most subtle thing here, however, is that if i πi /∫i = 1, then pj = 0 for all states j. This is strange because the time-average state probabilities do not add to 1, and also strange because the embedded Markov chain continues to make transitions, and these transitions, in steady state, occur with the probabilities πi . We give an example of this phenomenon for a birth-death process in the next subsection. What is happening is that for very small t, the state is with high probability in the starting state, but as t increases, there is an increasing probability that X(t) is in a state with a P large holding time. The effect of these long holding times builds up with increasing t. Since ki=1 πi /∫i is finite for any integer k, it means that the probability of states larger than k is increasing with time. In the limit t → 1, the state disperses over an infinite set of possibilities with increasingly small probabilities for each. To look at this in another way, the expected time to the next transition, starting in steady state for the embedded chain, is 1. This is probably not a phenomenon that can be understood intuitively. Perhaps the best one can hope for is to see that it does not violate the things we know.
6.2.4
The equations for the steady state process probabilities
P Let us now come back to the case where i πi /∫i < 1, which is the case of virtually all applications. We have seen that a Markov process can be specified in terms of the timetransitions qij = ∫i Pij , and it is useful to express the steady state equations for pj directly in terms of qijPrather than indirectly P in terms of the embedded chain. First note from (6.18) that if i πi /∫i < 1, then j pj = 1, so that P the limiting time averages behave as we would expect in all but the peculiar case where i πi /∫i = 1. We also note that (6.18) specifies the embedded steady state probabilities πi in terms P of the pi . Since πi = pi ∫i α, where α is independent of i, we can use the normalization i πi = 1 to obtain pi ∫i . k pk ∫k
πi = P
We can substitute πi as given by (6.19) into (6.6), obtaining pj ∫j = state j. Since ∫i Pij = qij , X X pj ∫j = pi qij ; pi = 1. i
(6.19)
P
i pi ∫i Pij
for each
(6.20)
i
This set ofPequations is known as the steady state equations for the Markov process. The condition i pi = 1 has been added as a normalization condition. Equation (6.20) has a nice interpretation in that the term on the left is the steady state rate in time at which transitions occur out of state j and the term on the right is the rate in time at which transitions occur into state j. Since the total number of entries to j must differ by at most 1 from the exits from j for each sample path, this equation is not surprising. We know that (6.6) has a unique solution with all πi > 0 if the embedded chain is positive recurrent, and thus (6.20) also has a unique solution Pwith all pi > 0 under the conditions that the embedded chain is positive recurrent and i πi /∫i < 1. This final condition is not quite what we want, since we would like to solve (6.20) directly without worrying about the embedded chain.
6.2. STEADY-STATE BEHAVIOR OF IRREDUCIBLE MARKOV PROCESSES
245
P If we find a solution to (6.20), however, and if i pi ∫i < 1 in that solution, then the corresponding set of πi from (6.19) must satisfy (6.6) and be the unique steady state solution for the embedded chain. Thus the solution for pi must be the corresponding steady state solution for the Markov process. This is summarized in the following theorem. TheoremP 6.2. Assume an irreducible Markov process and let {pi ; i ≥ 0} be a solution to (6.20). If i pi ∫i < 1, then, first, that solution is unique, second, each pi is positive, and third, the embedded Markov chain is positive recurrent with the P steady state πi satisfying (6.19). Also, if the embedded chain is positive recurrent, and i πi /∫i < 1 then the set of pi satisfying (6.18) is the unique solution to (6.20).
6.2.5
The sampled-time approximation again
For an alternative view of the probabilities {pi }, consider the special case (but the typical case) where the transition rates {∫i } are bounded. Consider the sampled-time approximation to the process for a given increment size δ ≤ [maxi ∫i ]−1 (see Figure 6.3). Let {wi ; i ≥ 0} be the set of steady state probabilities for the sampled-time chain, assuming that they exist. These steady state probabilities satisfy wj =
X i6=j
wi qij δ + wj (1 − ∫j δ);
wj ≥ 0;
X
wj = 1.
(6.21)
j
P The first equation simplifies to wj ∫j = i6=j wi qij , which is the same as (6.20). It follows that the steady state probabilities {pi ; i ≥ 0} for the process are the same as the steady state probabilities {wi ; i ≥ 0} for the sampled-time approximation. Note that this is not an approximation; wi is exactly equal to pi for all values of δ ≤ 1/ supi ∫i . We shall see later that the dynamics of a Markov process are not quite so well modeled by the sampled time approximation except in the limit δ → 0.
6.2.6
Pathological cases
The examplePin Figure 6.5 gives some insight into the case of positive recurrent embedded chains with i πi /∫i = 1. It models a variation of an M/M/1 queue in which the server becomes increasingly rattled and slow as the queue builds up, and the custormers become almost equally discouraged about entering. The downward drift in the transitions is more than overcome by the slow down in higher states. Transitions continue to occur, but the number of transitions per unit time goes to 0 with increasing time. Exercise 6.1 gives some added insight into this type of situation. P P It is also possible for (6.20) to have a solution for {pi ; i ≥ 0} with i pi = 1, but i pi ∫i = 1. This is not possible for a positive recurrent embedded chain, but is possible both if the embedded Markov chain is transient and if it is null recurrent. A transient chain means that there is a positive probability that the embedded chain will never return to a state after leaving it, and thus there can be no sensible kind of steady state behavior for the process. These processes are characterized by arbitrarily large transition rates from the
246
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
0♥ ②
1
1
0.6
③ ♥ 1 ② 2−1
0.4 0.6
③ ♥ 2 ② 2−2
0.4 0.6
③ ♥ 3 2−3
...
Figure 6.5: The Markov process for a variation on M/M/1 where arrivals and services get slower with increasing state. Each node i has a rate ∫i = 2−i . The embedded chain transition probabilities are Pi,i+1 = 0.4 for i ≥ 1 and Pi,i−1 = 0.6 for i ≥ 1. Note that qi,i+1 > qi+1,i . various states, and these allow the process to transit through an infinite number of states in a finite time. Processes for which there is a non-zero probability of passing through an infinite number of states in a finite time are called irregular. Exercises 6.4 and 6.5 give some insight into irregular processes. Exercise 6.6 givesPan example of a process that is not irregular, but for which (6.20) has a solution with i pi = 1 and the embedded Markov chain is null recurrent. We restrict P our attention in P what follows to irreducible Markov chains for which (6.20) has a solution, pi = 1, and pi ∫i < 1. This is slightly more restrictive than P necessary, but processes for which i pi ∫i = 1 (see Exercise 6.6) are not very robust.
6.3
The Kolmogorov differential equations
Let Pij (t) be the probability that a Markov process is in state j at time t given that X(0) = i, Pij (t) = P {X(t)=j | X(0)=i}.
(6.22)
Pij (t) is analogous to the nth order transition probabilities Pijn for Markov chains. We have already seen that P limt→1 Pij (t) = pj for the case where the embedded chain is positive recurrent and i πi /∫i < 1. Here we want to find the transient behavior, and we start by deriving the Chapman-Kolmogorov equations for Markov processes. Let s and t be arbitrary times, 0 < s < t. By including the state at time s, we can rewrite (6.22) as X Pij (t) = Pr{X(t)=j, X(s)=k | X(0)=i} k
=
X k
Pr{X(s)=k | X(0)=i} Pr{X(t)=j | X(s)=k} ;
all i, j,
(6.23)
where we have used the Markov condition, (6.3). Given that X(s) = k, the residual time until the next transition after s is exponential with rate ∫k , and thus the process starting at time s in state k is statistically identical to that starting at time 0 in state k. Thus, for any s ≥ 0, we have P {X(t)=j | X(s)=k} = Pkj (t − s).
6.3. THE KOLMOGOROV DIFFERENTIAL EQUATIONS
247
Substituting this into (6.23), we have the Chapman-Kolmogorov equations for a Markov process, Pij (t) =
X k
Pik (s)Pkj (t − s).
(6.24)
These equations correspond to (4.17) for Markov chains. We now use these equations to derive two types of sets of differential equations for finding Pij (t). The first type are called the Kolmogorov backward differential equations, and the second are called the Kolmogorov forward differential equations. The backward equations are obtained by letting s approach 0 from above, and the forward equations are obtained by letting s approach t from below. First we derive the backward equations. For s small, the probability of two transitions in (0, s] is o(s) and the probability of a single transition into k is qik s + o(s). Thus Pik (s) = qik s + o(s) for k 6= i. Now, given that X(0) = i, the state remains at i throughout the interval (0, s] with probability exp(−∫i s) = 1 − ∫i s + o(s). The probability that the state changes in (0, s) and returns to i by time s is also of order o(s). Thus Pii (s) = 1 − ∫i s + o(s) and (6.24) becomes Pij (t) =
X [qik sPkj (t − s)] + (1 − ∫i s)Pij (t − s) + o(s).
(6.25)
k6=i
Subtracting Pij (t − s) from both sides and dividing by s, we get § Pij (t) − Pij (t − s) X £ o(s) qik Pkj (t − s) − ∫i Pij (t − s) + = . s s
(6.26)
k6=i
Taking the limit as s → 0, (and assuming that the summation and limit can be interchanged) we get the Kolmogorov backward equations, § dPij (t) X £ qik Pkj (t) − ∫i Pij (t). = dt
(6.27)
k6=i
P For interpretation, the right hand side of (6.27) can be rewritten as k6=i qik [Pkj (t)−Pij (t)]. We see that qik is the rate of moving from i to k in an initial increment of time. If such a transition occurs in the initial increment, the change in probability of ending in state j after an additional interval of length t is given by Pkj (t) − Pij (t). Thus the rate of change of Pij (t) can be interpreted as arising from the initial rates of transition from i to other states. The set of equations over i, j in (6.27) is a set of linear differential first order equations which, for any given j, must be solved simultaneously for all i. For a finite state space, this set of equations, for all i and j, can be expressed most compactly in matrix notation. Letting [P (t)] be a matrix for each t whose i, j element is Pij (t), and letting [Q] be a matrix whose i, j term is qij for i 6= j and −∫i for i = j, we have d[P (t)]/dt = [Q][P (t)] ;
t ≥ 0.
(6.28)
248
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
The initial conditions, at t = 0, are [P (t)] = I, where I is the identity matrix. The solution to (6.28), subject to these initial conditions, is 1 X tm [Q]m [P (t)] = . m! m=0
(6.29)
Equation (6.29) can be verified by substitution into (6.28). We shall return to (6.28) and (6.29) after deriving the Kolmogorov forward equations. For t − s small and positive, (6.24) becomes X [Pik (s)(t − s)qkj ] + Pij (s)[1 − (t − s)∫j ] + o(t − s) Pij (t) =
(6.30)
k6=j
Pij (t) − Pij (s) t−s
=
dPij (t) dt
=
X k6=j
X k6=j
[Pik (s)qkj − Pij (s)∫j
(6.31)
[Pik (t)qkj ] − Pij (t)∫j ,
(6.32)
(where we have again assumed that the summation and limit can be interchanged). These are the Kolmogorov forward equations; their interpretation is similar to, but slightly simpler than, that of the backward equations. The incremental change in Pij (t) is equal to the difference of two terms. The first is the sum over k of the probability of being in state k at time t and then moving to j in the final incremental interval; the second is the probability of being in state j at time t and then moving away in the final incremental interval. For the finite state case, (6.32), in matrix notation is d[P (t)]/dt = [P (t)][Q].
(6.33)
This equation also has the matrix solution in (6.29). Although (6.29) is a solution to these equations, it doesn’t provide much insight into the form of the solution. In order to provide more of this insight, we go back to the sampled time approximation. With an increment of size δ between samples, the probability of a transition from i to j, i 6= j, is qij δ, and the probability of remaining in state i is 1 − ∫i δ. Thus, in terms of the matrix [Q] of transition rates, the transition probability matrix in the sampled time model is I +δ[Q], where I is the identity matrix. We denote this matrix by [Wδ ] = I + δ[Q]. Note that ∏ is an eigenvalue of [Q] iff 1 + ∏δ is an eigenvalue of [Wδ ], and the eigenvectors of these matrices are the same. That is, if ∫(π) is a right (left) eigenvector of [Q] with eigenvalue ∏, then ∫(π) is a right (left) eigenvector of [Wδ ] with eigenvalue 1 + ∏δ, and conversely. We have already studied the eigenvalues and eigenvectors of transition matrices such as [Wδ ] in Section 4.4, and the following theorem translates some of these results into results about [Q]. Theorem 6.3. Consider an irreducible finite-state Markov process with n states. Then the matrix [Q] for that process has an eigenvalue ∏ equal to 0. That eigenvalue has a right eigenvector e = (1, 1, . . . , 1)T which is unique within a scale factor. It has a left eigenvector p = (p1 , . . . , pn ) that is positive, sums to 1, satisfies (6.20), and is unique within a scale factor. All the other eigenvalues of [Q] have strictly negative real parts.
6.3. THE KOLMOGOROV DIFFERENTIAL EQUATIONS
249
Proof: Since all n states communicate, the sampled time chain is recurrent. From corollary 4.3 to the Frobenius theorem (Theorem 4.6), [Wδ ] has an eigenvalue ∏ = 1; the corresponding right eigenvector is e; the left eigenvector is the steady state probability vector, which is positive. These eigenvectors are unique within a scale factor. From the equivalence of (6.20) and (6.21), p, as given by (6.20), is the steady state probability vector. Each eigenvalue ∏δ of [Wδ ] corresponds to an eigenvalue ∏ of [Q] with the correspondence ∏δ = 1 + ∏δ , i.e., ∏ = (∏δ − 1)/δ. Thus the eigenvalue 1 of [Wδ ] corresponds to the eigenvalue 0 of [Q]. Since |∏δ | ≤ 1 and ∏δ 6= 1 for all other eigenvalues, the other eigenvalues of [Q] all have strictly negative real parts, completing the proof. We continue to look at an irreducible n-state Markov process, and assume for simplicity T T that [Q] has n distinct eigenvalues, ∏1 , . . . , ∏n . Let ∫ 1 , ∫ 2 , . . . , ∫ n and π 1 , . . . , π n be the corresponding right and left eigenvectors respectively. Assume that these eigenvectors are T T scaled so that π i ∫ i = 1 for each i, and recall that π i ∫ j = 0 for i 6= j. Thus, if [V ] is defined as the matrix with columns ∫ 1 , ∫ 2 , . . . , ∫ n , then its inverse, [V ]−1 , is the matrix whose rows T T are π 1 , . . . , π n . Finally, let [Λ] be the diagonal matrix with elements ∏1 , . . . , ∏n . We then have the relationships [Λ] = [V ]−1 [Q][V ] ;
[Q] = [V ][Λ][V ]−1 .
(6.34)
We then also have [Q]m = [V ]−1 [Λ]m [V ], and thus (6.29) can be written as [P (t)] =
1 X
m=0
[V ]
tm [Λ]m [V ]−1 = [V ][etΛ ][V ]−1 , m!
(6.35)
where [etΛ ] is the diagonal matrix with elements et∏1 , et∏2 , . . . , , et∏n . Finally, if we break up the matrix [etΛ ] into a sum of n matrices, each with only a single non-zero element, then (6.35) becomes [P (t)] =
n X
T
∫ i et∏i π i .
(6.36)
i=1
Note that each term in (6.36) is a matrix formed by the product of a column vector ∫i and a row vector πi , scaled by et∏i . From Theorem 6.3, one of these eigenvalues, say ∏1 , is equal to 0, and thus et∏1 = 1 for all t. Each other eigenvalue ∏i has a negative real part, so et∏i goes to zero with increasing t for each of these terms. Thus the term for i = 1 in (6.36) contains the steady state solution, ep; this is a matrix for which each row is the steady state probability vector p. Another way to express the solution to these equations (for finite n) is by the use of Laplace transforms. Let Lij (s) be the Laplace transform of Pij (t) and let [L(s)] be the n by n matrix (for each s) of the elements Lij (s). Then the equation d[P (t)]/dt = [Q][P (t)] for t ≥ 0, along with the initial condition [P (0)] = I, becomes the Laplace transform equation [L(s)] = [sI − [Q]]−1 .
(6.37)
This appears to be simpler than (6.36), but it is really just more compact in notation. It can still be used, however, when [Q] has fewer than n eigenvectors.
250
6.4
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
Uniformization
Up until now, we have discussed Markov processes under the assumption that qii = 0 (i.e., no transitions from a state into itself are allowed). We now consider what happens if this restriction is removed. Suppose we start with some Markov process defined by a set of transition rates qij with qii = 0, and we modify this process by some arbitrary choice of qii ≥ 0 for eachP state i. ThisPmodification changes the embedded Markov chain, since ∫i is increased from k6=i qik to k6=i qik + qii . From (6.5), Pij is changed to qij /∫i for the new value of ∫i for each i, j. Thus the steady state probabilities πi for the embedded chain are changed. The Markov process {X(t); t ≥ 0} is not changed, since a transition from i into itself does not change X(t) and does not change the distribution of the time until the next transition to a different state. The steady state probabilities for the process still satisfy X X pj ∫j = pk qkj ; pi = 1. (6.38) i
k
The addition of the new term qjj increases ∫j by qjj , thus increasing the left hand side by pj qjj . The right hand side is similarly increased by pj qjj , so that the solution is unchanged (as we already determined it must be). A particularly convenient way to add self-transitions is to add them in such a way as to make the transition rate ∫j the same for all states. Assuming that the transition rates {∫i ; i ≥ 0} arePbounded, we define ∫ ∗ as supj ∫j for the original transition rates. Then we set qjj = ∫ ∗ − k6=j qjk for each j. With this addition of self-transitions, all transition rates become ∫ ∗ . From (6.19), we see that the new steady state probabilities, πi∗ , in the embedded Markov chain become equal to the steady state process probabilities, pi . Naturally, we could also choose any ∫ greater than ∫ ∗ and increase each qjj to make all transition rates equal to that value of ∫. When the transition rates are changed in this way, the resulting embedded chain is called a uniformized chain and the Markov process is called the uniformized process. The uniformized process is the same as the original process, except that quantities like the number of transitions over some interval are different because of the self transitions. Assuming that all transition rates are made equal to ∫ ∗ , the new transition probabilities in the embedded chain become Pij∗ = qij /∫ ∗ . Let N (t) be the total number of transitions that occur from 0 to t in the uniformized process. Since the rate of transitions is the same from all states and the inter-transition intervals are independent and identically exponentially distributed, N (t) is a Poisson counting process of rate ∫ ∗ . Also, N (t) is independent of the sequence of transitions in the embedded uniformized Markov chain. Thus, given that N (t) = n, the probability that X(t) = j given that X(0) = i is just the probability that the embedded chain goes from i to j in ∫ steps, i.e., Pij∗n . This gives us another formula for calculating Pij (t), (i.e., the probability that X(t) = j given that X(0) = i). Pij (t) =
1 X
n=0
Pij∗n
∗
e−∫ t (∫ ∗ t)n . n!
(6.39)
Another situation where the uniformized process is useful is in extending Markov decision theory to Markov processes, but we do not pursue this.
6.5. BIRTH-DEATH PROCESSES
6.5
251
Birth-death processes
Birth-death processes are very similar to the birth-death Markov chains that we studied earlier. Here transitions occur only between neighboring states, so it is convenient to define ∏i as qi,i+1 and µi as qi,i−1 (see Figure 6.6). Since the number of transitions from i to i + 1 is within 1 of the number of transitions from i + 1 to i for every sample path, we conclude that pi ∏i = pi+1 µi+1 .
(6.40)
This can also be obtained inductively from (6.20) using the same argument that we used earlier for birth-death Markov chains. ✿ 0♥ ✘ ②
∏0 µ1
③ ♥ 1 ②
③ ♥ 2 ②
∏1 µ2
③ ♥ 3 ②
∏2 µ3
∏3 µ4
③ ♥ 4
...
Figure 6.6: Birth-death process.
Define ρi as ∏i /µi+1 . Then applying (6.40) iteratively, we obtain the steady state equations pi = p0
i−1 Y
ρj ;
i ≥ 1.
j=0
We can solve for p0 by substituting (6.41) into p0 =
1+
P1
P
i
1
i=1
(6.41)
pi , yielding
Qi−1
j=0 ρj
.
(6.42)
For the M/M/1 queue, the state of the Markov process is the number of customers in the system (i.e., waiting in queue or in service). The transitions from i to i + 1 correspond to arrivals, and since the arrival process is Poisson of rate ∏, we have ∏i = ∏ for all i ≥ 0. The transitions from i to i − 1 correspond to departures, and since the service time distribution is exponential with parameter µ, say, we have µi = µ for all i ≥ 1. Thus, (6.42) simplifies to p0 = 1 − ρ, where ρ = ∏/µ and thus pi = (1 − ρ)ρi ;
i ≥ 0.
(6.43)
We assume that ρ < 1, which is required for positive recurrence. The probability that there are i or more customers in the system in steady state is then given by P {X(t) ≥ i} = ρi and the expected number of customers in the system is given by E [X(t)] =
1 X i=1
P {X(t) ≥ i} =
ρ . 1−ρ
(6.44)
252
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
The expected time that a customer spends in the system in steady state can now be determined by Little’s formula (Theorem 3.8). E [System time] =
E [X(t)] ρ 1 = = . ∏ ∏(1 − ρ) µ−∏
(6.45)
The expected time that a customer spends in the queue (i.e., before entering service) is just the expected system time less the expected service time, so E [Queueing time] =
1 ρ 1 − = . µ−∏ µ µ−∏
(6.46)
Finally, the expected number of customers in the queue can be found by applying Little’s formula to (6.46), E [Number in queue] =
∏ρ . µ−∏
(6.47)
Note that the expected number of customers in the system and in the queue depend only on ρ, so that if the arrival rate and service rate were both speeded up by the same factor, these expected values would remain the same. The expected system time and queueing time, however would decrease by the factor of the rate increases. Note also that as ρ approaches 1, all these quantities approach infinity as 1/(1 − ρ). At the value ρ = 1, the embedded Markov chain becomes null-recurrent and the steady state probabilities (both {πi ; i ≥ 0} and {pi ; i ≥ 0}) can be viewed as being all 0 or as failing to exist. There are many types of queueing systems that can be modeled as birth-death processes. For example the arrival rate could vary with the number in the system and the service rate could vary with the number in the system. All of these systems can be analyzed in steady state in the same way, but (6.41) and (6.42) can become quite messy in these more complex systems. As an example, we analyze the M/M/m system. Here there are m servers, each with exponentially distributed service times with parameter µ. When i customers are in the system, there are i servers working for i < m and all m servers are working for i ≥ m. With i servers working, the probability of a departure in an incremental time δ is iµδ, so that µi is iµ for i < m and mµ for i ≥ m (see Figure 6.7). Define ρ = ∏/(mµ). Then in terms of our general birth-death process notation, ρi = mρ/(i + 1) for i < m and ρi = ρ for i ≥ m. From (6.41), we have mρ mρ mρ p0 (mρ)i ··· = ; 1 2 i i! p0 ρi mm ; i ≥ m. m!
pi = p0 pi =
i≤m
(6.48) (6.49)
We can find p0 by summing pi and setting the result P equal to 1; a solution exists if ρ < 1. Nothing simplifies much in this sum, except that i≥m pi = p0 (ρm)m /[m!(1 − ρ)], and the solution is " #−1 m−1 X (mρ)i (mρ)m p0 = . (6.50) + m!(1 − ρ) i! i=0
6.6. REVERSIBILITY FOR MARKOV PROCESSES
0♥ ②
∏ µ
③ ♥ 1 ②
∏ 2µ
③ ♥ 2 ②
∏ 3µ
253
③ ♥ 3 ②
∏ 3µ
③ ♥ 4
...
Figure 6.7: M/M/m queue for m = 3..
6.6
Reversibility for Markov processes
In Section 5.4 on reversibility for Markov chains, (5.37) showed that the backward transition probabilities Pij∗ in steady state satisfy πi Pij∗ = πj Pji .
(6.51)
These equations are then valid for the embedded chain of a Markov process. Next, consider backward transitions in the process itself. Given that the process is in state i, the probability of a transition in an increment δ of time is ∫i δ+o(δ), and transitions in successive increments are independent. Thus, if we view the process running backward in time, the probability of a transition in each increment δ of time is also ∫i δ + o(δ) with independence between increments. Thus, going to the limit δ → 0, the distribution of the time backward to a transition is exponential with parameter ∫i . This means that the process running backwards is again a Markov process with transition probabilities Pij∗ and transition rates ∫i . Figure 6.8 helps to illustrate this. ✛
State i
✲✛
State j, rate ∫j
t1
✲✛
State k
✲
t2
Figure 6.8: The forward process enters state j at time t1 and departs at t2 . The backward process enters state j at time t2 and departs at t1 . In any sample function, as illustrated, the interval in a given state is the same in the forward and backward process. Given X(t) = j, the time forward to the next transition and the time backward to the previous transition are each exponential with rate ∫j . Since the steady state probabilities {pi ; i ≥ 0} for the Markov process are determined by πi /∫i , k πk /∫k
pi = P
(6.52)
and since {πi ; i ≥ 0} and {∫i ; i ≥ 0} are the same for the forward and backward processes, we see that the steady state probabilities in the backward Markov process are the same as the steady state probabilities in the forward process. This result can also be seen by the correspondence between sample functions in the forward and backward processes. ∗ = ∫ P ∗ . Using (6.51), we The transition rates in the backward process are defined by qij i ij
254
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
have ∗ qij = ∫j Pij∗ =
∫i πi Pji ∫i πj qji = . πi πi ∫j
(6.53)
From (6.52), we note that pj = απj /∫j and pi = απi /∫i for the same value of α. Thus the ∗ = p q /p , and ratio of πj /∫j to πi /∫i is pj /pi . This simplifies (6.53) to qij j ji i ∗ = pj qji . pi qij
(6.54)
This equation can be used as an alternate definition of the backward transition rates. To interpret this, let δ be a vanishingly small increment of time and assume the process is in steady state at time t. Then δpj qji ≈ Pr{X(t) = j} Pr{X(t + δ) = i | X(t) = j} whereas ∗ ≈ Pr{X(t + δ) = i} Pr{X(t) = j | X(t + δ) = i}. δpi qij
∗ = q for all i, j. If the embedded Markov A Markov process is defined to be reversible if qij ij chain is reversible, (i.e., Pij∗ = Pij for all i, j), then one can repeat the above steps using Pij ∗ to see that p q = p q for all i, j. Thus, if the embedded and qij in place of Pij∗ and qij i ij j ji chain is reversible, the process is also. Similarly, if the Markov process is reversible, the above argument can be reversed to see that the embedded chain is reversible. Thus, we have the following useful lemma.
Lemma 6.4. Assume that steady state P probabilities {pi ; i≥0} exist in an irreducible Markov process (i.e., (6.20) has a solution and pi ∫i < 1). Then the Markov process is reversible if and only if the embedded chain is reversible. One can find the steady state probabilities of a reversible Markov process and simultaneously show that it is reversible by the following useful theorem (which is directly analogous to Theorem 5.6 of chapter 5). Theorem 6.4. For an irreducible Markov process, assume that {pi ; i ≥ 0} is a set of nonP negative numbers summing to 1, satisfying i pi ∫i ≤ 1, and satisfying pi qij = pj qji
for all i, j.
(6.55)
Then {pi ; i ≥ 0} is the set of steady state probabilities for the process, pi > 0 for all i, the process is reversible, and the embedded chain is positive recurrent. Proof: Summing (6.55) over i, we obtain X pi qij = pj ∫j
for all j.
i
P
These, along with i pi = 1 are the steady state equations for the process. These equations have a solution, and by Theorem 6.2, pi > 0 for all i, the embedded chain is positive recurrent, and pi = limt→1 Pr{X(t) = i}. Comparing (6.55) with (6.54), we see that ∗ , so the process is reversible. qij = qij There are many irreducible Markov processes that are not reversible but for which the backward process has interesting properties that can be deduced, at least intuitively, from
6.6. REVERSIBILITY FOR MARKOV PROCESSES
255
the forward process. Jackson networks (to be studied shortly) and many more complex networks of queues fall into this category. The following simple theorem allows us to use whatever combination of intuitive reasoning and wishful thinking we desire to guess both ∗ in the backward process and the steady state probabilities, and to the transition rates qij then verify rigorously that the guess is correct. One might think that guessing is somehow unscientific, but in fact, the art of educated guessing and intuitive reasoning is at the heart of all good scientific work. Theorem 6.5. ForPan irreducible P Markov process, assume that a set of positive numbers {pi ; i ≥ 0} satisfy i pi = 1 and i pi ∫i < 1. Also assume that a set of non-negative ∗ } satisfy the two sets of equations numbers {qij X
qij =
j
X
∗ qij
for all i
(6.56)
for all i, j.
(6.57)
j
∗ pi qij = pj qji
Then {pi } is the set of steady state probabilities for the process, pi > 0 for all i, the embedded ∗ } is the set of transition rates in the backward process. chain is positive recurrent, and {qij Proof: Sum (6.57) over i. Using the fact that X
pi qij = pj ∫j
P
j qij
= ∫i and using (6.56), we obtain
for allj.
(6.58)
i
P These, along with i pi = 1, are the steady state equations for the process. These equations thus have a solution, and by Theorem 6.2, pi > 0 for all i, the embedded chain is positive ∗ as given by (6.57) is the backward recurrent, and pi = limt→1 Pr{X(t) = i}. Finally, qij transition rate as given by (6.54) for all i, j. ∗ We see that Theorem 6.4 is just a special case of Theorem 6.5 in which the guess about qij ∗ is that qij = qij .
Birth-death processes are all reversible if the steady state probabilities exist. To see this, note that Equation (6.40) (the equation to find the steady state probabilities) is just (6.55) applied to the special case of birth-death processes. Due to the importance of this, we state it as a theorem. Theorem 6.6. For P P a birth-death process, if there is a solution {pi ; i ≥ 0} to (6.40) with p = 1 and i i i pi ∫i < 1, then the process is reversible, and the embedded chain is positive recurrent and reversible. Since the M/M/1 queueing process is a birth-death process, it is also reversible. Burke’s theorem, which was given as Theorem 5.8 for sampled-time M/M/1 queues, can now be established for continuous-time M/M/1 queues. Note that the theorem here contains an extra part, part c). Theorem 6.7 (Burke’s theorem). Given an M/M/1 queueing system in steady state with ∏ < µ,
256
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
a) the departure process is Poisson with rate ∏, b) the state X(t) at any time t is independent of departures prior to t, and c) for FCFS service, given that a customer departs at time t, the arrival time of that customer is independent of the departures prior to t. Proof: The proofs of parts a) and b) are the same as the proof of Burke’s theorem for sampled-time, Theorem 5.8, and thus will not be repeated. For part c), note that with FCFS service, the mth customer to arrive at the system is also the mth customer to depart. Figure 6.9 illustrates that the association between arrivals and departures is the same in the backward system as in the forward system (even though the customer ordering is reversed in the backward system). In the forward, right moving system, let τ be the epoch of some given arrival. The customers arriving after τ wait behind the given arrival in the queue, and have no effect on the given customer’s service. Thus the interval from τ to the given customer’s service completion is independent of arrivals after τ . r a1
d4 r
r a2
r a3
r d3 a4 r Right moving (forward) M/M/1 process d3 r
d1 r
r a4
d2 r
d2 r
r a3
r a2
d1 r Left moving (backward) M/M/1 process
d4 r r a1
Figure 6.9: FCFS arrivals and departures in right and left moving M/M/1 processes. Since the backward, left moving, system is also an M/M/1 queue, the interval from a given backward arrival, say at epoch t, moving left until the corresponding departure, is independent of arrivals to the left of t. From the correspondence between sample functions in the right moving and left moving systems, given a departure at epoch t in the right moving system, the departures before time t are independent of the arrival epoch of the given customer departing at t; this completes the proof. Part c) of Burke’s theorem does not apply to sampled-time M/M/1 queues because the sampled time model does not allow for both an arrival and departure in the same increment of time. Note that the proof of Burke’s theorem (including parts a and b from Section 5.5) does not make use of the fact that the transition rate qi,i−1 = µ for i ≥ 1 in the M/M/1 queue. Thus Burke’s theorem remains true for any birth death Markov process in steady state for which qi,i+1 = ∏ for all i ≥ 0. For example, parts a and b are valid for M/M/m queues; part c is also valid (see [Wol89]), but the argument here is not adequate since the first customer to enter the system might not be the first to depart.
6.6. REVERSIBILITY FOR MARKOV PROCESSES
257
We next show how Burke’s theorem can be used to analyze a tandem set of queues. As shown in Figure 6.10, we have an M/M/1 queueing system with Poisson arrivals at rate ∏ and service at rate µ1 . The departures from this queueing system are the arrivals to a second queueing system, and we assume that a departure from queue 1 at time t instantaneously enters queueing system 2 at the same time t. The second queueing system has a single server and the service times are IID and exponentially distributed with rate µ2 . The successive service times at system 2 are also independent of the arrivals to systems 1 and 2, and independent of the service times in system 1. Since we have already seen that the departures from the first system are Poisson with rate ∏, the arrivals to the second queue are Poisson with rate ∏. Thus the second system is also M/M/1. M/M/1
M/M/1 ∏ ✲
∏ ✲ µ1
∏ ✲ µ2
Figure 6.10: A tandem queueing system. Assuming that ∏ > µ1 and ∏ > µ2 , the departures from each queue are Poisson of rate ∏.
Let X(t) be the state of queueing system 1 and Y (t) be the state of queueing system 2. Since X(t) at time t is independent of the departures from system 1 prior to t, X(t) is independent of the arrivals to system 2 prior to time t. Since Y (t) depends only on the arrivals to system 2 prior to t and on the service times that have been completed prior to t, we see that X(t) is independent of Y (t). This leaves a slight nit-picking question about what happens at the instant of a departure from system 1. We have considered the state X(t) at the instant of a departure to be the number of customers remaining in system 1 not counting the departing customer. Also the state Y (t) is the state in system 2 including the new arrival at instant t. The state X(t) then is independent of the departures up to and including t, so that X(t) and Y (t) are still independent. Next assume that both systems use FCFS service. Consider a customer that leaves system 1 at time t. The time at which that customer arrived at system 1, and thus the waiting time in system 1 for that customer, is independent of the departures prior to t. This means that the state of system 2 immediately before the given customer arrives at time t is independent of the time the customer spent in system 1. It therefore follows that the time that the customer spends in system 2 is independent of the time spent in system 1. Thus the total system time that a customer spends in both system 1 and system 2 is the sum of two independent random variables. This same argument can be applied to more than 2 queueing systems in tandem. It can also be applied to more general networks of queues, each with single servers with exponentially distributed service times. The restriction here is that there can not be any cycle of queueing systems where departures from each queue in the cycle can enter the next queue in the cycle. The problem posed by such cycles can be seen easily in the following example of a single queueing system with feedback (see Figure 6.11).
258
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
M/M/1 ∏ ✻
Q
✲
∏ ✲
1−Q µ
Figure 6.11: A queue with feedback. Assuming that µ > ∏/Q, the exogenous output is Poisson of rate ∏. We assume that the queueing system in Figure 6.11 has a single server with IID exponentially distributed service times that are independent of arrival times. The exogenous arrivals from outside the system are Poisson with rate ∏. With probability Q, the departures from the queue leave the entire system, and, alternatively, with probability 1 − Q, they return instantaneously to the input of the queue. Successive choices between leaving the system and returning to the input are IID and independent of exogenous arrivals and of service times. Figure 6.12 shows a sample function of the arrivals and departures in the case in which the service rate µ is very much greater than the exogenous arrival rate ∏. Each exogenous arrival spawns a geometrically distributed set of departures and simultaneous re-entries. Thus the overall arrival process to the queue, counting both exogenous arrivals and feedback from the output, is not Poisson. Note, however, that if we look at the Markov process description, the departures that are fed back to the input correspond to self loops from one state to itself. Thus the Markov process is the same as one without the self loops with a service rate equal to µQ. Thus, from Burke’s theorem, the exogenous departures are Poisson with rate ∏. Also the steady state distribution of X(t) is P {X(t) = i} = (1 − ρ)ρi where ρ = ∏/(µQ) (assuming, of course, that ρ < 1). exogenous arrival
✄re-entries ° ✄°
❄ ❄ ✂❄✁
❄ ✂❄✁
endogenous departures
❄ ❄
endogenous departures
✄°
❄ ✂❄✁
❄
Figure 6.12: Sample path of arrivals and departures for queue with feedback. The tandem queueing system of Figure 6.10 can also be regarded as a combined Markov process in which the state at time t is the pair (X(t), Y (t)). The transitions in this process correspond to, first, exogenous arrivals in which X(t) increases, second, exogenous departures in which Y (t) decreases, and third, transfers from system 1 to system 2 in which X(t) decreases and Y (t) simultaneously increases. The combined process is not reversible since there is no transition in which X(t) increases and Y (t) simultaneously decreases. In the next section, we show how to analyze these combined Markov processes for more general
6.7. JACKSON NETWORKS
259
networks of queues.
6.7
Jackson networks
In many queueing situations, a customer has to wait in a number of different queues before completing the desired transaction and leaving the system. For example, when we go to the registry of motor vehicles to get a driver’s license, we must wait in one queue to have the application processed, in another queue to pay for the license, and in yet a third queue to obtain a photograph for the license. In a multiprocessor computer facility, a job can be queued waiting for service at one processor, then go to wait for another processor, and so forth; frequently the same processor is visited several times before the job is completed. In a data network, packets traverse multiple intermediate nodes; at each node they enter a queue waiting for transmission to other nodes. Such systems are modeled by a network of queues, and Jackson networks are perhaps the simplest models of such networks. In such a model, we have a network of k interconnected queueing systems which we call nodes. Each of the k nodes receives customers (i.e., tasks or jobs) both from outside the network (exogenous inputs) and from other nodes within the network (endogenous inputs). It is assumed that the exogenous inputs to each node i form a Poisson process of rate ri and that these Poisson processes are independent of each other. For analytical convenience, we regard this as a single Poisson input process of rate ∏0 , with each input independently going to each node i with probability Q0i = ri /∏0i . ∏0 Q02
❅ ❅ ❘✓✏ ❅ Q11 ✿ 1 ✐ ✘ ✒✑ ° ■ ° ✠ ° Q 10
Q13
Q12 Q21 Q31
Q32
✓✏ ❘ 3 ✙ ✒✑ ✒ ❖❅ ° ° ❅ ❘ Q30 ❅ ° ∏0 Q03
°∏0 Q02 ° ✓✏ ✠ ° q 2 ② Q22 ✯✒✑ ❅ ❅ ❘ Q20 ❅ Q23
Q33
Figure 6.13: A Jackson network with 3 nodes. Given a departure from node i, the probability that departure goes to node j (or, for j = 0, departs the system) is Qij . Note that a departure from node i can re-enter node i with probability Qii . The overall exogenous arrival rate is ∏0 , and, conditional on an arrival, the probability the arrival enters node i is Q0i . Each node i contains a single server, and the successive service times at node i are IID random variables with an exponentially distributed service time of rate µi . The service times at each node are also independent of the service times at all other nodes and independent of the exogenous arrival times at all nodes. When a customer completes service at a given
260
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
node i, that customer is routed to node j with probability Qij (see Figure 6.13). It is also possible for the customer to depart from the network entirely (called an exogenous P departure), and this occurs with probability Qi0 = 1 − j≥1 Qij . For a customer departing from node i, the next node j is a random variable with PMF {Qij , 0 ≤ j ≤ k}. Successive choices of the next node for customers at node i are IID, independent of the customer routing at other nodes, independent of all service times, and independent of the exogenous inputs. Notationally, we are regarding the outside world as a fictitious node 0 from which customers appear and to which they disappear. When a customer is routed from node i to node j, it is assumed that the routing is instantaneous; thus at the epoch of a departure from node i, there is a simultaneous endogenous arrival at node j. Thus a node j receives Poisson exogenous arrivals from outside the system at rate ∏0 Q0j and receives endogenous arrivals from other nodes according to the probabilistic rules just described. We can visualize these combined exogenous and endogenous arrivals as being served in FCFS fashion, but it really makes no difference in which order they are served, since the customers are statistically identical and simply give rise to service at node j at rate µj whenever there are customers to be served. The Jackson queueing network, as just defined, is fully described by the exogenous input rate ∏0 , the service rates {µi }, and the routing probabilities {Qij ; 0 ≤ i, j ≤ k}. The network as a whole is a Markov process in which the state is a vector m = (m1 , m2 , . . . , mk ), where mi , 1 ≤ i ≤ k, is the number of customers at node i. State changes occur upon exogenous arrivals to the various nodes, exogenous departures from the various nodes, and departures from one node that enter another node. In a vanishingly small interval δ of time, given that the state at the beginning of that interval is m, an exogenous arrival at node j occurs in the interval with probability ∏0 Q0j δ and changes the state to m 0 = m + e j where e j is a unit vector with a one in position j. If mi > 0, an exogenous departure from node i occurs in the interval with probability µi Qi0 δ and changes the state to m 0 = m − e i . Finally, if mi > 0, a departure from node i entering node j occurs in the interval with probability µi Qij δ and changes the state to m 0 = m − e i + e j . Thus, the transition rates are given by qm,m 0 = ∏0 Q0j = µi Qi0 = µi Qij = 0
for m 0 = m + e j , 0
for m = m − m i ,
1≤i≤k mi > 0,
0
(6.59) 1≤i≤k
for m = m − e i + e j , mi > 0, 1 ≤ i, j ≤ k
(6.60) (6.61)
0
for all other choices of m .
Note that a departure from node i that re-enters node i causes a transition from state m back into state m; we disallowed such transitions in sections 6.1 and 6.2, but showed that they caused no problems in our discussion of uniformization. It is convenient to allow these self transitions here, partly for the added generality and partly to illustrate that the single node network with feedback of Figure 6.11 is an example of a Jackson network. Our objective is to find the steady state probabilities p(m) for this type of process, and our plan of attack is in accordance with Theorem 6.5; that is, we shall guess a set of transition rates for the backward Markov process, use these to guess p(m), and then verify that the guesses are correct. Before making these guesses, however, we must find out a little more about how the system works, so as to guide the guesswork. Let us define ∏i for each i,
6.7. JACKSON NETWORKS
261
1 ≤ i ≤ k, as the time average overall rate of arrivals to node i, including both exogenous and endogenous arrivals. Since ∏0 is the rate of exogenous inputs, we can interpret ∏i /∏0 as the expected number of visits to node i per exogenous input. The endogenous arrivals to node i are not necessarily Poisson, as the example of a single queue with feedback shows, and we are not even sure at this point that such a time-average rate exists in any reasonable sense. However, let us assume for the time being that such rates exist and that the timeaverage rate of departures from each node equals the time-average rate of arrivals (i.e., the queue sizes do not grow linearly with time). Then these rates must satisfy the equation ∏j =
k X j=0
∏i Qij ;
1 ≤ j ≤ k.
(6.62)
To see this, note that ∏0 Q0j is the rate of exogenous arrivals to j. Also ∏i is the time-average rate at which customers depart from queue i, and ∏i Qij is the rate at which customers go from node i to node j. Thus, the right hand side of (6.62) is the sum of the exogenous and endogenous arrival rates to node j. Note the distinction between the time-average rate of customers going from i to j in (6.62) and the rate qm,m 0 = µi Qij for m 0 = m −e i +e j , mi > 0 in (6.61). The rate in (6.61) is conditioned on a state m with mi > 0, whereas that in (6.62) is the overall time-average rate, averaged over all states. Note that {Qij ; 0 ≤ i, j ≤ k} forms a stochastic matrix and (6.62) is formally equivalent to the equations for steady state probabilities (except that steady state probabilities sum to 1). The usual equations for steady state probabilities include an equation for j = 0, but that equation is redundant. Thus we know that, if there is a path between each pair of nodes (including the fictitious node 0), then (6.62) has a solution for {∏i ; 0 ≤ i ≤ k, and that solution is unique within a scale factor. The known value of ∏0 determines this scale factor and makes the solution unique. Note that we don’t have to be careful at this point about whether these rates are time averages in any nice sense, since this will be verified later; we do have to make sure that (6.62) has a solution, however, since it will appear in our solution for p(m). Thus we assume in what follows that a path exists between each pair of nodes, and thus that (6.62) has a unique solution as a function of ∏0 . We now make the final necessary assumption about the network, which is that µi > ∏i for each node i. This will turn out to be required in order to make the process positive recurrent. We also define ρi as ∏i µi . We shall find that, even though the inputs to an individual node i are not Poisson in general, there is a steady state distribution for the number of customers at i, and that distribution is the same as that of an M/M/1 queue with the parameter ρi . Now consider the backward time process. We have seen that only three kinds of transitions are possible in the forward process. First, there are transitions from m to m 0 = m + e j for any j, 1 ≤ j ≤ k. Second, there are transitions from m to m − e i for any i, 1 ≥ i ≤ k, such that mi > 0. Third, there are transitions from m to m 0 = m − e i + e j for 1 ≤ i, j ≤ k with mi > 0. Thus in the backward process, transitions from m 0 to m are possible only for the m, m 0 pairs above. Corresponding to each arrival in the forward process, there is a departure in the backward process; for each forward departure, there is a backward arrival; and for each forward passage from i to j, there is a backward passage from j to i.
262
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
We now make the conjecture that the backward process is itself a Jackson network with Poisson exogenous arrivals at rates {∏0 Q∗0j }, service times that are exponential with rates {µi }, and routing probabilities {Q∗ij }. The backward routing probabilities {Q∗ij } must be chosen to be consistent with the transition rates in the forward process. Since each transition from i to j in the forward process must correspond to a transition from j to i in the backward process, we should have ∏i Qij = ∏j Q∗ji
;
0 ≤ i, j ≤ k.
(6.63)
Note that ∏i Qij represents the rate at which forward transitions go from i to j, and ∏i represents the rate at which forward transitions leave node i. Equation (6.63) takes advantage of the fact that ∏i is also the rate at which forward transitions enter node i, and thus the rate at which backward transitions leave node i. Using the conjecture that the backward time system is a Jackson network with routing probabilities {Q∗ij ; 0 ≤ i, j ≤ k}, we can write down the backward transition rates in the same way as (6.59-6.61), ∗ qm,m = ∏0 Q∗0j 0
= =
for m 0 = m + e j
µi Q∗i0 µi Q∗ij
(6.64)
0
(6.65)
0
(6.66)
for m = m − e i , mi > 0, 1 ≤ i ≤ k
for m = m − e i + e j , m − i > 0, 1 ≤ i, j ≤ k.
If we substitute (6.63) into (6.64)-(6.66), we obtain ∗ qm,m = ∏j Qj0 0
for m 0 = m + e j ,
= (µi /∏i )∏0 Q0i = (µi /∏i )∏j Qji
0
1≤j≤k
for m = m − e i , mi > 0, 0
for m = m − e i + e j ,
(6.67)
1≤i≤k
(6.68)
mi > 0, 1 ≤ i, j ≤ k. (6.69)
This gives us our hypothesized backward transition rates in terms of the parameters of the original Jackson network. To we must verify that there is a set of positive P use theorem 6.5, P numbers, p(m), satisfying m p(m) = 1 and m ∫m pm < 1, and a set of non-negative ∗ numbers qm 0 ,m satisfying the following two sets of equations: ∗ p(m)qm,m 0 = p(m 0 )qm 0 ,m X X ∗ qm,m 0 = qm,m 0 m
for all m, m 0
(6.70)
for all m.
(6.71)
m0
We verify (6.70) by substituting (6.59)-(6.61) on the left side of (6.70) and (6.67)-(6.69) on the right side. Recalling that ρi is defined as ∏i /µi , and cancelling out common terms on each side, we have p(m) = p(m 0 )/ρj 0
p(m) = p(m )ρi 0
p(m) = p(m )ρi /ρj
for m 0 = m + e j
(6.72)
0
for m = m − e i , mi > 0 0
for m = m − e i + e j ,
(6.73) mi > 0.
(6.74)
Looking at the case m 0 = m − e i , and using this equation repeatedly to get from state (0, 0, . . . , 0) up to an arbitrary m, we obtain p(m) = p(0, 0, . . . , 0)
k Y i=1
i ρm i .
(6.75)
6.7. JACKSON NETWORKS
263
It is easy to verify that (6.75) satisfies (6.72)-(6.74) for all possible transitions. Summing over all m to solve for p(0, 0, . . . , 0), we get X X X X m m2 1 1= p(m) = p(0, 0, . . . , 0) ρm ρ . . . ρk k 1 2 m1 ,m2 ,... ,mk
m1
m2
−1
= p(0, 0 . . . , 0)(1 − ρ1 )
mk
−1
(1 − ρ2 )
. . . (1 − ρk )−1 .
Thus, p(0, 0, . . . , 0) = (1 − ρ1 )(1 − ρ2 ) . . . (1 − ρk ), and substituting this in (6.75), we get p(m) =
k Y i=1
k h i Y i pi (mi ) = (1 − ρi )ρm . i
(6.76)
i=1
where pi (m) = (1 − ρi )ρm i is the steady state distribution of a single M/M/1 queue. Now that we have found the steady state distribution implied by our assumption about the backward process being a Jackson network, our remaining task is to verify (6.71) P P ∗ To verify (6.71), i.e., m 0 qm,m 0 = m 0 qm,m 0 , first consider the right side. Using (6.64) to 0 sum over all m = m + e j , then (6.65) to sum over m 0 = m − e i (for i such that mi > 0), and finally (6.66) to sum over m 0 = m − e i + e j , (again for i such that mi > 0), we get X m0
∗ qm,m 0 =
k X
∏0 Q∗0j +
j=1
X
µi Q∗i0 +
i:mj >0
X
i:mi >0
Using the fact Q∗ is a stochastic matrix, then, X X ∗ qm,m µi . 0 = ∏0 + m0
µi
k X
Q∗ij .
(6.77)
j=1
(6.78)
i:mi >0
The left hand side of (6.71) can be summed in the same way to get the result on the right side of (6.78), but we can see that P this must be the result by simply observing that ∏0 is the rate of exogenous arrivals and i:mi>0 µi is P the overall rate of P service completions in 0 state m. Note that this also verifies that ∫ = q ≥ ∏ + m 0 m 0 m,m i µi , and since ∫m is P bounded, m ∫m p(m) < 1. Since all the conditions of Theorem 6.5 are satisfied, p(m), as given in (6.76), gives the steady state probabilities for the Jackson network. This also verifies that the backward process is a Jackson network, and hence the exogenous departures are Poisson and independent. Although the exogenous arrivals and departures in a Jackson network are Poisson, the endogenous processes of customers travelling from one node to another are typically not Poisson if there are feedback paths in the network. Also, although (6.76) shows that the numbers of customers at the different nodes are independent random variables at any given time in steady state, it is not generally true that the number of customers at one node at one time is independent of the number of customers at another node at another time. There are many generalizations of the reversibility arguments used above, and many network situations in which the nodes have independent states at a common time. We discuss just two of them here and refer to Kelly, [13], for a complete treatment.
264
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
For the first generalization, assume that the service time at each node depends on the number of customers at that node, i.e., µi is replaced by µi,mi . Note that this includes the M/M/m type of situation in which each node has several independent exponential servers. With this modification, the transition rates in (6.60) and (6.61) are modified by replacing µi with µi,mi . The hypothesized backward transition rates are modified in the same way, and the only effect of these changes is to replace ρi and ρj for each i and j in (6.72)-(6.74) with ρi,mi = ∏i /µi,mi and ρj,mj = ∏j /µj,mj . With this change, (6.75) becomes p(m) =
k Y
pi (mi ) =
i=1
pi (0) = 1 +
k Y
j=1 1 m−1 X Y
m=1 j=0
pi (0) −1
ρi,j
mi Y
ρi,j
(6.79)
j=0
.
(6.80)
Thus, p(m) is given by the product distribution of k individual birth death systems.
6.7.1
Closed Jackson networks
The second generalization is to a network of queues with a fixed number M of customers in the system and with no exogenous inputs or outputs. Such networks are called closed Jackson networks, whereas the networks analyzed above are often called open Jackson networks. P Suppose a k node closed network has routing probabilities Qij , 1 ≤ i, j ≤ k, where j Qij = 1, and has exponential service times of rate µi (this can be generalized to µi,mi as above). We make the same assumptions as before about independence of service variables and routing variables, and assume that there is a path between each pair of nodes. Since {Qij ; 1 ≤ i, j ≤ k} forms an irreducible stochastic matrix, there is a one dimensional set of solutions to the steady state equations X ∏i Qij ; 1 ≤ j ≤ k. (6.81) ∏j = i
We interpret ∏i as the time-average rate of transitions that go into node i. Since this set of equations can only be solved within an unknown multiplicative constant, and since this constant can only be determined at the end of the argument, we define {πi ; 1 ≤ i ≤ k} as the particular solution of (6.81) satisfying πj =
X i
πi Qij ; 1 ≤ j ≤ k;
X
πi = 1.
(6.82)
i
Thus, for all i, ∏i = απi , where α is some unknown constant. The state P of the Markov process is again taken as m = (m1 , m2 , . . . , mk ) with the condition i mi = M . The transition rates of the Markov process are the same as for open networks, except that there are no exogenous arrivals or departures; thus (6.59)-(6.61) are replaced by qm,m 0 = µi Qij
for m 0 = m − e i + e j ,
mi > 0, 1 ≤ i, j ≤ k.
(6.83)
6.7. JACKSON NETWORKS
265
We hypothesize that the backward time process is also a closed Jackson network, and as before, we conclude that if the hypothesis is true, the backward transition rates should be ∗ qm,m = µi Q∗ij 0
where ∏i Qij =
∏j Q∗ji
for m 0 = m − e i + e j , mi > 0, 1 ≤ i, j ≤ k for 1 ≤ i, j ≤ k.
(6.84) (6.85)
In order to use Theorem 6.5 again, we must verify that a PMF p(m) exists satisfying 0 )q∗ 0 0 p(m)q m ,m for all possible states and transitions, and we must also verify Pm,m = p(m P that m 0 qm,m 0 = m 0 q∗m,m 0 for all possible m. This latter verification is virtually the same as before and is left as an exercise. The former verification, with the use of (72), (73), and (74), becomes p(m)(µi /∏i ) = p(m 0 )(µj /∏j ) for m 0 = m − e i + e j ,
mi > 0.
(6.86)
Using the open network solution to guide our intuition, we see Pthat the following choice of p(m) satisfies (6.86) for all possible m (i.e., all m such that i mi = M ) p(m) = A
k Y
(∏i /µi )mi ;
for m such that
i=1
X
mi = M.
(6.87)
i
The constant A is a normalizing constant, chosen to make p(m) sum to unity. The problem with (6.87) is that we do not know ∏i (except within a multiplicative constant independent of i). Fortunately, however, if we substitute πi /α for ∏i , we see that α is raised to the power −M , independent of the state m. Thus, letting A0 = Aα − M , our solution becomes p(m) = A0
K Y
(πi /µi )mi ; for m such that
i=1
1 A0
= m:
X
P
k Y
i=1 i mi =M
√
πi µi
!mi
X
mi = M.
(6.88)
i
.
(6.89)
Note that the steady state distribution of the closed Jackson network has been found without solving for the time-average transition rates. Note also that the steady state distribution looks very similar to that for an open network; that is, it is a product distribution over the nodes with a geometric type distribution within each node. This is somewhat misleading, however, since the constant A0 can be quite difficult to calculate. It is surprising at first that the parameter of the geometric distribution can be changed by a constant multiplier in (6.88) and (6.89) (i.e., πi could be replaced with ∏i ) and the solution does not change; the important quantity is the relative values of πi /µi from one value of i to another rather than the absolute value. In order to find ∏i (and this is important, since it says how quickly the system is doing its work), note that ∏i = µi Pr{mi > 0}). Solving for Pr{mi > 0} requires finding the constant A0 in (6.83). In fact, the major difference between open and closed networks is that the relevant constants for closed networks are tedious to calculate (even by computer) for large networks and large M .
266
6.8
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
Summary
We have seen that Markov processes with countable state spaces are remarkably similar to Markov chains with countable state spaces, and throughout the chapter, we frequently made use of both the embedded chain corresponding to the process and to the sampled time approximation to the process. P For irreducible processes, the steady state equations, (6.20) and i ∫i = 1, were found to specify the steady state probabilities, pi , which have significance both as time averages and as limiting probabilities. If the transition rates ni are bounded, then the sampled time approximation exists and has the same steady state P probabilities as the Markov process itself. If the transition rates ni are unbounded but i pi ∫i < 1, then the embedded chain is positive recurrent and has steady state probabilities, but the sampled time approximation P does not exist. We assumed throughout the remainder of the chapter that i pi ∫i < 1. This ruled out irregular processes in which there is no meaningful steady state, and also some peculiar processes such as that in Exercise 6.6 where the embedded chain is null recurrent. Section 6.3 developed the Kolmogoroff backward and forward differential equations for the transient probabilities Pij (t) of being in state j at time t given state i at time 0. We showed that for finite-state processes, these equations can be solved by finding the eigenvalues and eigenvectors of the transition rate matrix Q. There are close analogies between this analysis and the algebraic treatment of finite-state Markov chains in chapter 4, and exercise 6.7 showed how the transients of the process are related to the transients of the sampled time approximation. For irreducible processes with bounded transition rates, uniformization was introduced as a way to simplify the structure of the process. The addition of self transitions does not change the process itself, but can be used to adjust the transition rates ∫i to be the same for all states. This changes the embedded Markov chain, and the steady state probabilities for the embedded chain become the same as those for the process. The epochs at which transitions occur then form a Poisson process which is independent of the set of states entered. This yields a separation between the transition epochs and the sequence of states. The next two sections analyzed birth-death processes and reversibility. The results about birth-death Markov chains and reversibility for Markov chains carried over almost without change to Markov processes. These results are central in queueing theory, and Burke’s theorem allowed us to look at simple queueing networks with no feedback and to understand how feedback complicates the problem. Finally, Jackson networks were discussed. These are important in their own right and also provide a good example of how one can solve complex queueing problems by studying the reverse time process and making educated guesses about the steady state behavior. The somewhat startling result here is that in steady state, and at a fixed time, the number of customers at each node is independent of the number at each other node and satisfies the same distribution as for an M/M/1 queue. Also the exogenous departures from the network are Poisson and independent from node to node. We emphasized that the number of customers at one node at one time is often dependent on the number at other nodes at
6.8. SUMMARY
267
other times. The independence holds only when all nodes are viewed at the same time. For further reading on Markov processes, see [13], [16], [22], and [9].
268
6.9
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
Exercises
Exercise 6.1. Consider a Markov process for which the embedded Markov chain is a birthdeath chain with transition probabilities Pi,i+1 = 2/5 for all i ≥ 1, Pi,i−1 = 3/5 for all i ≥ 1, P01 = 1, and Pij = 0 otherwise. a) Find the steady state probabilities {πi ; i ≥ 0} for the embedded chain.
b) Assume that the transition rate ∫i out of state i, for i ≥ 0, is given by ∫i = 2i . Find the transition rates {qij } between states and find the steady state probabilities {pi } for the Markov process. Explain heuristically why πi 6= pi .
c) Now assume in parts c) to f ) that the transition rate out of state i, for i ≥ 0, is given by ∫i = 2−i . Find the transition rates {qij } between states and draw the directed graph with these transition rates. d) Show that there is no probability vector solution {pi ; i ≥ 0} to (6.20). e) Argue that the expected time between visits to any given state i is infinite. Find the expected number of transitions between visits to any given state i. Argue that, starting from any state i, an eventual return to state i occurs with probability 1. f ) Consider the sampled-time approximation of this process with δ = 1. Draw the graph of the resulting Markov chain and argue why it must be null-recurrent. Exercise 6.2. Consider the Markov process illustrated below. The transitions are labelled by the rate qij at which those transitions occur. The process can be viewed as a single server queue where arrivals become increasingly discouraged as the queue lengthens. The word time-average below refers to the limiting time-average over each sample-path of the process, except for a set of sample paths of probability 0. 0♥ ②
∏ µ
③ ♥ 1 ②
∏/2 µ
③ ♥ 2 ②
∏/3 µ
③ ♥ 3 ②
∏/4 µ
③ ♥ 4
...
a) Find the time-average fraction of time pi spent in each state i > 0 in terms of p0 and then solve for p0 . Hint: First find an equation relating pi to pi+1 for each i. It also may help to recall the power series expansion of ex . P b) Find a closed form solution to i pj ∫i where ∫i is the departure rate from state i. Show that the process is positive recurrent for all choices of ∏ > 0 and µ > 0 and explain intuitively why this must be so. c) For the embedded Markov chain corresponding to this process, find the steady-state probabilities πi for each i ≥ 0 and the transition probabilities Pij for each i, j. d) For each i, find both the time-average interval and the time-average number of overall state transitions between successive visits to i. Exercise 6.3. (Continuation of Exercise 6.2 a) Assume that the Markov process in Exercise 6.2 is changed in the following way: whenever the process enters state 0, the time spent
6.9. EXERCISES
269
before leaving state 0 is now a uniformly distributed rv, taking values from 0 to 2/∏. All other transitions remain the same. For this new process, determine whether the successive epochs of entry to state 0 form renewal epochs, whether the successive epochs of exit from state 0 form renewal epochs, and whether the successive entries to any other given state i form renewal epochs. e) For each i, find both the time-average interval and the time-average number of overall state transitions between successive visits to i. f ) Is this modified process a Markov process in the sense that Pr{X(t) = i | X(τ ) = j, X(s) = k} = Pr{X(t) = i | X(τ ) = j} for all 0 < s < τ < t and all i, j, k? Explain. Exercise 6.4. a) Consider a Markov process with the set of states {0, 1, . . . } in which the transition rates {qij } between states are given by qi,i+1 = (3/5)2i for i ≥ 0, qi,i−1 = (2/5)2i for i ≥ 1, and qij = 0 otherwise. Find the transition rate ∫i out of state i for each i ≥ 0 and find the transition probabilities {Pij } for the embedded Markov chain. P b) Find a solution {pi ; i ≥ 0} with i pi = 1 to (6.20). c) Show that all states of the embedded Markov chain are transient.
Exercise 6.5. a) Consider the process in the figure below. The process starts at X(0) = 1, and for all i ≥ 1, Pi,i+1 = 1 and ∫i = i2 for all i. Let Tn be the time that the nth transition occurs. Show that E [Tn ] =
n X
i−2 < 2 for all n.
i=1
Hint: Upper bound the sum from i = 2 by integrating x−2 from x = 1. 1♥
1
✲ 2♥
4
✲ 3♥
9
✲ 4♥
16
✲
b) Use the Markov inequality to show that Pr{Tn > 4} ≤ 1/2 for all n. Show that the probability of an infinite number of transitions by time 4 is at least 1/2. Exercise 6.6. Let qi,i+1 = 2i−1 for all i ≥ 0 and let qi,i−1 = 2i−1 for all i ≥ 1. All other transition rates are 0. a) Solve the steady state equations and show that pi = 2−i−1 for all i ≥ 0. b) Find the transition probabilities for the embedded Markov chain and show that the chain is null recurrent. c) For any state i, consider the renewal process for which the Markov process starts in state i and renewals occur on each transition to state i. Show that, for each i ≥ 1, the expected inter-renewal interval is equal to 2. Hint: Use renewal-reward theory.
270
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
d) Show that the expected number of transitions between each entry into state i is infinite. Explain why this does not mean that an infinite number of transitions can occur in a finite time. Exercise 6.7. A two state Markov process has transition rates q01 = 1, q10 = 2. Find P01 (t), the probability that X(t) = 1 given that X(0) = 0. Hint: You can do this by solving a single first order differential equation if you make the right choice between forward and backward equations. Exercise 6.8. a) Consider a two state Markov process with q01 = ∏ and q10 = µ. Find the eigenvalues and eigenvectors of the transition rate matrix [Q]. b) Use (6.36) to solve for [P (t)]. c) Use the Kolmogorov forward equation for P01 (t) directly to find P01 (t) for t ≥ 0. Hint: you don’t have to use the equation for P00 (t); why? d) Check your answer in b) with that in c). Exercise 6.9. Consider an irreducible Markov process with n states and assume that the transition rate matrix [Q] = [V ][Λ][V ]−1 where [V ] is the matrix of right eigenvectors of [Q], [Λ] is the diagonal matrix of eigenvalues of {Q], and the inverse of [Q] is the matrix of left eigenvectors. a) Consider the sampled-time approximation to the process with an increment of size δ, and let [Wδ ] be the transition matrix for [Λ]. b) Express [Wδ ]n in terms of [V ] and [Λ]. c) Expand [Wδ ]n in the same form as (6.36). d) Let t be an integer multiple of δ, and compare [Wδ ]t/δ to [P (t)]. Note: What you see from this is that ∏i in (6.36) is replaced by (1/δ)ln(1 + δ∏i ). For the steady state term, ∏1 = 0, this causes no change, but for the other eigenvalues, there is a change that vanishes as δ → 0. Exercise 6.10. Consider the three state Markov process below; the number given on edge (i, j) is qij , the transition rate from i to j. Assume that the process is in steady state. ✎☞ 1✐ ✍✌ ■ 2
1 1 2
4
✎☞ ❘ 3 ✙ ✍✌
q✎☞ 2 ✯✍✌ 4
6.9. EXERCISES
271
a) Is this process reversible? b) Find pi , the time-average fraction of time spent in state i for each i. c) Given that the process is in state i at time t, find the mean delay from t until the process leaves state i. d) Find πi , the time-average fraction of all transitions that go into state i for each i. e) Suppose the process is in steady state at time t. Find the steady state probability that the next state to be entered is state 1. f ) Given that the process is in state 1 at time t, find the mean delay until the process first returns to state 1. g) Consider an arbitrary irreducible finite-state Markov process in which qij = qji for all i, j. Either show that such a process is reversible or find a counter example. Exercise 6.11. a) Consider an M/M/1 queueing system with arrival rate ∏, service rate µ, µ > ∏. Assume that the queue is in steady state. Given that an arrival occurs at time t, find the probability that the system is in state i immediately after time t. b) Assuming FCFS service, and conditional on i customers in the system immediately after the above arrival, characterize the time until the above customer departs as a sum of random variables. c) Find the unconditional probability density of the time until the above customer departs. Hint: You know (from splitting a Poisson process) that the sum of a geometrically distributed number of IID exponentially distributed random variables is exponentially distributed. Use the same idea here. Exercise 6.12. a) Consider an M/M/1 queue in steady state. Assume ρ = ∏/µ < 1. Find the probability Q(i, j) for i ≥ j > 0 that the system is in state i at time t and that i − j departures occur before the next arrival. b) Find the PMF of the state immediately before the first arrival after time t. c) There is a well known queueing principle called PASTA, standing for “Poisson arrivals see time averages”. Given your results above, give a more precise statement of what that principle means in the case of the M/M/1 queue. Exercise 6.13. A small bookie shop has room for at most two customers. Potential customers arrive at a Poisson rate of 10 customers per hour; they enter if there is room and are turned away, never to return, otherwise. The bookie serves the admitted customers in order, requiring an exponentially distributed time of mean 4 minutes per customer. a) Find the steady state distribution of number of customers in the shop. b) Find the rate at which potential customers are turned away. c) Suppose the bookie hires an assistant; the bookie and assistant, working together, now serve each customer in an exponentially distributed time of mean 2 minutes, but there is
272
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
only room for one customer (i.e., the customer being served) in the shop. Find the new rate at which customers are turned away. Exercise 6.14. Consider the job sharing computer system illustrated below. Incoming jobs arrive from the left in a Poisson stream. Each job, independently of other jobs, requires pre-processing in system 1 with probability Q. Jobs in system 1 are served FCFS and the service times for successive jobs entering system 1 are IID with an exponential distribution of mean 1/µ1 . The jobs entering system 2 are also served FCFS and successive service times are IID with an exponential distribution of mean 1/µ2 . The service times in the two systems are independent of each other and of the arrival times. Assume that µ1 > ∏Q and that µ2 > ∏. Assume that the combined system is in steady state. ∏
Q✲
✲
System 1
µ1
✲
System 2
µ2
✲
1−Q
a Is the input to system 1 Poisson? Explain. b) Are each of the two input processes coming into system 2 Poisson? Explain. d) Give the joint steady state PMF of the number of jobs in the two systems. Explain briefly. e) What is the probability that the first job to leave system 1 after time t is the same as the first job that entered the entire system after time t? f ) What is the probability that the first job to leave system 2 after time t both passed through system 1 and arrived at system 1 after time t? Exercise 6.15. Consider the following combined queueing system. The first queue system is M/M/1 with service rate µ1 . The second queue system has IID exponentially distributed service times with rate µ2 . Each departure from system 1 independently 1 − Q1 . System 2 has an additional Poisson input of rate ∏2 , independent of inputs and outputs from the first system. Each departure from the second system independently leaves the combined system with probability Q2 and re-enters system 2 with probability 1 − Q2 . For parts a, b, c assume that Q2 = 1 (i.e., there is no feedback). ∏1
✲
System 1
µ1
Q1 ❄ 1−Q1
∏2
✲ ✲ ✲
System 2
µ2
Q2 ✲ 1−Q2 (part d)
a) Characterize the process of departures from system 1 that enter system 2 and characterize the overall process of arrivals to system 2. b) Assuming FCFS service in each system, find the steady state distribution of time that a customer spends in each system.
6.9. EXERCISES
273
c) For a customer that goes through both systems, show why the time in each system is independent of that in the other; find the distribution of the combined system time for such a customer. d) Now assume that Q2 < 1. Is the departure process from the combined system Poisson? Which of the three input processes to system 2 are Poisson? Which of the input processes are independent? Explain your reasoning, but do not attempt formal proofs. Exercise 6.16. Suppose a Markov chain with transition probabilities {Pij } is reversible. 0 ; j ≥ Suppose we change the transition probabilities out of state 0 from {P0j ; j ≥ 0} to {P0j 0}. Assuming that all Pij for i 6= 0 are unchanged, what is the most general way in which 0 ; j ≥ 0} so as to maintain reversibility? Your answer should be explicit we can choose {P0j about how the steady state probabilities {πi ; i ≥ 0} are changed. Your answer should also indicate what this problem has to do with uniformization of reversible Markov processes, if anything. Hint: Given Pij for all i, j, a single additional parameter will suffice to specify 0 for all j. P0j Exercise 6.17. Consider the closed queueing network in the figure below. There are three customers who are doomed forever to cycle between node 1 and node 2. Both nodes use FCFS service and have exponentially distributed IID service times. The service times at one node are also independent of those at the other node and are independent of the customer being served. The server at node i has mean service time 1/µi , i = 1, 2. Assume to be specific that µ2 < µ1 . Node 1
µ1
✲
Node 2
µ2
✛
a) The system can be represented by a four state Markov process. Draw its graphical representation and label it with the individual states and the transition rates between them. b) Find the steady state probability of each state. c) Find the time-average rate at which customers leave node 1. d) Find the time-average rate at which a given customer cycles through the system. e) Is the Markov process reversible? Suppose that the backward Markov process is interpreted as a closed queueing network. What does a departure from node 1 in the forward process correspond to in the backward process? Can the transitions of a single customer in the forward process be associated with transitions of a single customer in the backward process?
274
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
Exercise 6.18. Consider an M/G/1 queueing system with last come first serve (LCFS) pre-emptive resume service. That is, customers arrive according to a Poisson process of rate ∏. A newly arriving customer interrupts the customer in service and enters service itself. When a customer is finished, it leaves the system and the customer that had been interrupted by the departing customer resumes service from where it had left off. For example, if customer 1 arrives at time 0 and requires 2 units of service, and customer 2 arrives at time 1 and requires 1 unit of service, then customer 1 is served from time 0 to 1; customer 2 is served from time 1 to 2 and leaves the system, and then customer 1 completes service from time 2 to 3. Let Xi be the service time required by the ith customer; the Xi are IID random variables with expected value E [X]; they are independent of customer arrival times. Assume ∏ E [X] < 1. a) Find the mean time between busy periods (i.e., the time until a new arrival occurs after the system becomes empty). b) Find the time-average fraction of time that the system is busy. c) Find the mean duration, E [B], of a busy period. Hint: use a) and b). d) Explain briefly why the customer that starts a busy period remains in the system for the entire busy period; use this to find the expected system time of a customer given that that customer arrives when the system is empty. e) Is there any statistical dependence between the system time of a given customer (i.e., the time from the customer’s arrival until departure) and the number of customers in the system when the given customer arrives? f ) Show that a customer’s expected system time is equal to E [B]. Hint: Look carefully at your answers to d) and e). g) Let C be the expected system time of a customer conditional on the service time X of that customer being 1. Find (in terms of C) the expected system time of a customer conditional on X = 2; (Hint: compare a customer with X = 2 to two customers with X = 1 each); repeat for arbitrary X = x. h) Find the constant C. Hint: use f ) and g); don’t do any tedious calculations. Exercise 6.19. Consider a queueing system with two classes of customers, type A and type B. Type A customers arrive according to a Poisson process of rate ∏A and customers of type B arrive according to an independent Poisson process of rate ∏B . The queue has a FCFS server with exponentially distributed IID service times of rate µ > ∏A + ∏B . Characterize the departure process of class A customers; explain carefully. Hint: Consider the combined arrival process and be judicious about how to select between A and B types of customers. Exercise 6.20. Consider a pre-emptive resume last come first serve (LCFS) queueing system with two classes of customers. Type A customer arrivals are Poisson with rate ∏A and Type B customer arrivals are Poisson with rate ∏B . The service time for type A customers is exponential with rate µA and that for type B is exponential with rate µB . Each service time is independent of all other service times and of all arrival epochs.
6.9. EXERCISES
275
Define the “state” of the system at time t by the string of customer types in the system at t, in order of arrival. Thus state AB means that the system contains two customers, one of type A and the other of type B; the type B customer arrived later, so is in service. The set of possible states arising from transitions out of AB is as follows: ABA if another type A arrives. ABB if another type B arrives. A if the customer in service (B) departs. Note that whenever a customer completes service, the next most recently arrived resumes service, so the state changes by eliminating the final element in the string. a) Draw a graph for the states of the process, showing all states with 2 or fewer customers and a couple of states with 3 customers (label the empty state as E). Draw an arrow, labelled by the rate, for each state transition. Explain why these are states in a Markov process. b) Is this process reversible. Explain. Assume positive recurrence. Hint: If there is a transition from one state S to another state S 0 , how is the number of transitions from S to S 0 related to the number from S 0 to S? c) Characterize the process of type A departures from the system (i.e., are they Poisson?; do they form a renewal process?; at what rate do they depart?; etc.) d) Express the steady state probability Pr{A} of state A in terms of the probability of the empty state Pr{E}. Find the probability Pr{AB} and the probability Pr{ABBA} in terms of Pr{E}. Use the notation ρA = ∏A /µA and ρB = ∏B /µB . e) Let Qn be the probability of n customers in the system, as a function of Q0 = Pr{E}. Show that Qn = (1 − ρ)ρn where ρ = ρA + ρB . Exercise 6.21. a) Generalize Exercise 6.20 to the case in which there are m types of customers, each with independent Poisson arrivals and each with independent exponential service times. Let ∏i and µi be the arrival rate and service rate respectively of the ith user. Let ρi = ∏i /µi and assume that ρ = ρ1 + ρ2 + · · · + ρm < 1. In particular, show, as before that the probability of n customers in the system is Qn = (1 − ρ)ρn for 0 ≤ n < 1. b) View a single type of customer with Poisson arrivals of rate P the customers in part a) as P ∏ = i ∏i and with a service density i (∏i /∏)µi exp(−µi x). Show that the expected service time is ρ/∏. Note that you have shown that, if a service distribution can be represented as a weighted sum of exponentials, then the distribution of customers in the system for LCFS service is the same as for the M/M/1 queue with equal mean service time.
Exercise 6.22. Consider a k node Jackson type network with the modification that each node i has s servers rather than one server. Each server at i has an exponentially distributed service time with rate µi . The exogenous input rate to node i is ρi = ∏0 Q0i and each output from i is switched to j with probability Qij and switched out of the system with probability
276
CHAPTER 6. MARKOV PROCESSES WITH COUNTABLE STATE SPACES
Qi0 (as in the text). Let ∏i , 1 ≤ i ≤ k, be the solution, for given ∏0 , to ∏j =
k X
∏i Qij ;
i=0
1 ≤ j ≤ k and assume that ∏i < sµi ; 1 ≤ i ≤ k. Show that the steady state probability of state m is Pr{m} =
k Y
pi (mi ),
i=1
where pi (mi ) is the probability of state mi in an (M, M, s) queue. Hint: simply extend the argument in the text to the multiple server case. Exercise 6.23. Suppose a Markov process with the set of states A is reversible and has steady state probabilities pi ; i ∈ A. Let B be a subset of A and assume that the process is changed by setting qij = 0 for all i ∈ B, j ∈ / B. Assuming that the new process (starting in B and remining in B) is irreducible, show that the new process is reversible and find its steady state probabilities. Exercise 6.24. Consider a queueing system with two classes of customers. Type A customer arrivals are Poisson with rate ∏A and Type B customer arrivals are Poisson with rate ∏B . The service time for type A customers is exponential with rate µA and that for type B is exponential with rate µB . Each service time is independent of all other service times and of all arrival epochs. a) First assume there are infinitely many identical servers, and each new arrival immediately enters an idle server and begins service. Let the state of the system be (i, j) where i and j are the numbers of type A and B customers respectively in service. Draw a graph of the state transitions for i ≤ 2, j ≤ 2. Find the steady state PMF, {p(i, j); i, j ≥ 0}, for the Markov process. Hint: Note that the type A and type B customers do not interact. b) Assume for the rest of the exercise that there is some finite number m of servers. Customers who arrive when all servers are occupied are turned away. Find the steady state PMF, {p(i, j); i, j ≥ 0, i + j ≤ m}, in terms of p(0, 0) for this Markov process. Hint: Combine part (a) with the result of Exercise 6.23. c) Let Qn be the probability that there are n customers in service at some given time in steady state. Show that Qn = p(0, 0)ρn /n! for 0 ≤ n ≤ m where ρ = ρA + ρB , ρA = ∏A /µA , and ρB = ∏B /µB . Solve for p(0, 0). Exercise 6.25. a) Generalize Exercise 6.24 to the case in which there are K types of customers, each with independent Poisson arrivals and each with independent exponential service times. Let ∏k and µk be the arrival rate and service rate respectively for the k h user type, 1 ≤ k ≤ K. Let ρk = ∏k /µk and ρ = ρ1 + ρ2 + · · · + ρK . In particular, show, as before, that the probability of n customers in the system is Qn = p(0, . . . , 0)ρn /n! for 0 ≤ n ≤ m.
6.9. EXERCISES
277
b) View P the customers in part (a) as a single P type of customer with Poisson arrivals of rate ∏ = k ∏k and with a service density k (∏k /∏)µk exp(−µk x). Show that the expected service time is ρ/∏. Note that what you have shown is that, if a service distribution can be represented as a weighted sum of exponentials, then the distribution of customers in the system is the same as for the M/M/m,m queue with equal mean service time. Exercise 6.26. Consider a sampled-time M/D/m/m queueing system. The arrival process is Bernoulli with probability ∏ 0, we might try to find the probabiity that Sn ≥ α for any n, or given that Sn ≥ α for one or more values of n, we might want to find the distribution of the smallest n such that Sn ≥ α. typical questions about random walks are finding the smallest n such that Sn reaches or exceeds a threshold, and finding the probability that the threshold is ever reached or crossed. Since Sn tends to drift downward with increasing n if E [X] = X < 0, and tends to drift upward if X > 0, the results to be obtained depend critically on whether X < 0, X > 0, or X = 0. Since results for X < 0 can be easily translated into results for X > 0 by considering {−Sn ; n ≥ 0}, we will focus on the case X < 0. As one might expect, both the results and the techniques have a very different flavor when X = 0, since here the random walk does not drift but typically wanders around in a rather aimless fashion. The following three subsections discuss three special cases of random walks. The first two, simple random walks and integer random walks, will be useful throughout as examples, since they can be easily visualized and analyzed. The third special case is that of renewal processes, which we have already studied and which will provide additional insight into the general study of random walks. After this, Sections 3.2 and 3.3 show how two major application areas, G/G/1 queues and 279
280 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
hypothesis testing, can be treated in terms of random walks. These sections also show why questions related to threshold crossings are so important in random walks. Section 3.4 then develops the theory of threshold crossing for general random walks and Section 3.5 extends and in many ways simplifies these results through the use of stopping rules and a powerful generalization of Wald’s equality known as Wald’s identity. The remainder of the chapter is devoted to a rather general type of stochastic process called martingales. The topic of martingales is both a subject of interest in its own right and also a tool that provides additional insight into random walks, laws of large numbers, and other basic topics in probability and stochastic processes.
7.1.1
Simple random walks
Suppose X1 , X2 , . . . are IID binary random variables, each taking on the value 1 with probability p and −1 with probability q = 1 − p. Letting Sn = X1 + · · · + Xn , the sequence of sums {Sn ; n ≥ 1}, is called a simple random walk. Sn is the difference between positive and negative occurrences in the first n trials. Thus, if there are j positive occurrences for 0 ≤ j ≤ n, then Sn = 2j − n, and Pr{Sn = 2j − n} =
n! pj (1 − p)n−j . j!(n − j)!
(7.1)
This distribution allows us to answer questions about Sn for any given n, but it is not very helpful in answering such questions as the following: for any given integer k > 0, what is the probability that the S sequence S1 , S2 , . . . ever reaches or exceeds k? This probability can be expressed as1 Pr{ 1 n=1 {Sn ≥ k}} and is referred to as the probability that the random walk crosses a threshold at k. Exercise 7.1 demonstrates the surprisingly simple result that for a simple random walk with p < 1/2, this threshold crossing probability is (1 ) µ ∂k [ p Pr {Sn ≥ k} = . (7.2) 1−p n=1 Sections 7.4 and 7.5 treat this same question for general random walks. They also treat questions such as the overshoot given a threshold crossing, the time at which the threshold is crossed given that it is crossed, and the probability of crossing such a positive threshold before crossing a given negative threshold.
7.1.2
Integer-valued random walks
Suppose next that X1 , X2 , . . . are arbitrary IID integer-valued random variables. We can again ask for the probability that such an integer valued random walk crosses a threshold 1
1 This S same probability is often expressed as as Pr{supn=1 Sn ≥ k}. For a general random walk, the event n≥1 {Sn ≥ k} is slightly different from supn≥1 Sn ≥ k. The latter event can include sample sequences s1 , s2 , . . . in which a subsequence of values sn approach k as a limit but never quite reach k. This is impossible for a simple random walk since all sk must be integers. It is possible, but can be shown to have probability zero, for general random walks. We will avoid this silliness by not using the sup notation to refer to threshold crossings.
7.2. THE WAITING TIME IN A G/G/1 QUEUE:
281
S at k, i.e., that the event 1 n=1 {Sn ≥ k} occurs, but the question is considerably harder than for simple random walks. Since this random walk takes on only integer values, it can be represented as a Markov chain with the set of integers forming the state space. In the Markov chain representation, threshold crossing problems are first passage-time problems. These problems can be attacked by the Markov chain tools we already know, but the special structure of the random walk provides new approaches and simplifications that will be explained in Sections 7.4 and 7.5.
7.1.3
Renewal processes as special cases of random walks
If X1 , X2 , . . . are IID positive random variables, then {Sn ; n ≥ 1} is both a special case of a random walk and also the sequence of arrival epochs of a renewal counting process, {N (t); t ≥ 0}. In this special case, the sequence {Sn ; n ≥ 1} must eventually cross a threshold at any given positive value α, and the question of whether the threshold is ever crossed becomes uninteresting. However, the trial on which a threshold is crossed and the overshoot when it is crossed are familiar questions from the study of renewal theory. For the renewal counting process, N (α) is the largest n for which Sn ≤ α and N (α) + 1 is the smallest n for which Sn > α, i.e., the smallest n for which the threshold at α is strictly exceeded. Thus the trial at which α is crossed is a central issue in renewal theory. Also the overshoot, which is SN (α)+1 − α is familiar as the residual life at α. Figure 7.1 illustrates the difference between general random walks and positive random walks, i.e., renewal processes. Note that the renewal process is illustrated with the axes reversed from usual representation. We usually view each renewal epoch as a time (epoch) and view N (α) as the number of trials up to time α, whereas with random walks, we usually view the number of trials as a discrete time variable and view the sum of rv’s as some kind of amplitude or cost. Mathematically this makes no difference and it is often valuable to move from one point of view to another.
7.2
The waiting time in a G/G/1 queue:
This section and the next introduce two important problems that are best solved by viewing them as random walks. In this section we represent the waiting time in a G/G/1 queue as a threshold crossing problem in a random walk. In the next section, we represent the error probability in a standard type of detection problem as a random walk problem. This detection problem will later be generalized to a sequential detection problem based on threshold crossings in a random walk. Consider a G/G/1 queue with first come first serve (FCFS) service. We shall find how to associate the probability that a customer must wait more than some given time α in the queue with the probability that a certain random walk crosses a threshold at α. Let X1 , X2 , . . . be the inter-arrival times of a G/G/1 queueing system; thus these variables are IID with a given distribution function FX (x) = Pr{Xi ≤ x}. Assume that arrival 0 enters an empty system at time 0, so that Sn = X1 + X2 + · · · + Xn is the epoch of the nth arrival after time 0. Let Y0 , Y1 , . . . , be the service times of the successive customers. These are IID
282 CHAPTER 7.
α Epoch S1 q
S2 q
S3 q
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
S4 q
Trial
(a)
S5 q
α Epoch S1 q
S2 q
S3 q
S4 q S5 q
Trial Sq 1
Trial
Sq3
S2q
Epoch
(b)
Sq5
q
S4
α
(c)
Figure 7.1: The sample function in (a) above illustrates a random walk with arbitrary (positive and negative) step sizes {Xi ; i ≥ 1}. The sample function in (b) illustrates a random walk restricted to positive step sizes {Xi > 0; i ≥ 1}, i.e., a renewal process. Note that the axes are reversed from the usual depiction of a renewal process. The same sample function is shown in part (c) using the customary axes for a renewal process. For both the arbitrary random walk of part (a) and the random walk with positive step sizes of parts (b) and (c), a ‘threshold’ at α is crossed on trial 4 with an overshoot S4 − α.
with some given distribution function FY (y) and are independent of {Xi ; i ≥ 1}. Figure 7.2 shows a sample path of arrivals and departures and illustrates the waiting time in queue for each arrival. To analyze the waiting time, note that the system time, i.e., the time in queue plus the time in service, for any given customer n is Wn + Yn , where Wn is the queueing time and Yn is the service time. As illustrated in Figure 7.2, customer n + 1 arrives Xn+1 time units after the beginning of this interval, i.e., after the arrival of customer n. If Xn+1 < Wn + Yn , then customer n + 1 arrives while customer n is still in the system, and thus must wait in the queue until n finishes service (in the figure, for example, customer 2 arrives while customer 1 is still in the queue). Thus Wn+1 = Wn + Yn − Xn+1
if Xn+1 ≤ Wn + Yn .
(7.3)
On the other hand, if Xn+1 > Wn + Yn , then customer n (and all earlier customers) have departed when n + 1 arrives. Thus n + 1 starts service immediately and Wn+1 = 0. This is the case for customer 3 in the figure. These two cases can be combined in the single equation Wn+1 = max[Wn + Yn − Xn+1 , 0];
for n ≥ 0; W0 = 0.
(7.4)
Since Yn and Xn+1 are coupled together in this equation for each n, it is convenient to define Un+1 = Yn − Xn+1 . Note that {Ui ; i ≥ 1} is a sequence of IID random variables. From (7.4), Wn = max[Wn−1 + Un , 0], and iterating on this equation, Wn = max[max[Wn−2 +Un−1 , 0]+Un , 0] = max[(Wn−2 + Un−1 + Un ), Un , 0] = max[(Wn−3 +Un−2 +Un−1 +Un ), (Un−1 +Un ), Un , 0] = ··· ···
(7.5)
7.2. THE WAITING TIME IN A G/G/1 QUEUE:
283 s3
s✛ 2 Arrivals 0✛ ✛
x1
s✛ 1 x2 ✲✛ ✲✛ y0
w1
✲
x3 w2 ✲✛ y1
✲✛ y2 ✲ ✲
Departures
✲ ✛ x2 + w2 = w1 + y1✲
Figure 7.2: Sample path of arrivals and departures from a G/G/1 queue. Customer 0 arrives at time 0 and enters service immediately. Customer 1 arrives at time s1 = x1 . For the case shown above, customer 0 has not yet departed, i.e., x1 < y0 , so customer 1’s time in queue is w1 = y0 − x1 . As illustrated, customer 1’s sytem time (queueing time plus service time) is w1 + y1 . Customer 2 arrives at s2 = x1 + x2 . For the case shown above, this is before customer 1 departs at y0 + y1 . Thus, customer 2’s wait in queue is w2 = y0 + y1 − x1 − x2 . As illustrated above, x2 +w2 is also equal to customer 1’s system time, so w2 = w1 +y1 −x2 . Customer 3 arrives when the system is empty, so it enters service immediately with no wait in queue, i.e., w3 = 0.
= max[(U1 +U2 + . . . +Un ), (U2 +U3 + . . . +Un ), . . . , (Un−1 +Un ), Un , 0]. (7.6) It is not necessary for the theorem below, but we can understand this maximization better by realizing that if the maximization is achieved at Ui + Ui+1 + · · · + Un , then a busy period must start with the arrival of customer i − 1 and continue at least through the service of customer n. To see this intuitively, note that the analysis above starts with the arrival of customer 0 to an empty system at time 0, but the choice of 0 time and customer number 0 has nothing to do with the analysis, and thus the analysis is valid for any arrival to an empty system. Choosing the largest customer number before n that starts a busy period must then give the correct waiting time and thus maximize (7.5). Exercise 7.2 provides further insight into this maximization. Define Z1n = Un , define Z2n = Un + Un−1 , and in general, for i ≤ n, define Zin = Un + Un−1 + · · · + Un−i+1 . Thus Znn = Un + · · · + U1 . With these definitions, (7.5) becomes Wn = max[0, Z1n , Z2n , . . . , Znn ].
(7.7)
Note that the terms in {Zin ; 1 ≤ i ≤ n} are the first n terms of a random walk, but it is not the random walk based on U1 , U2 , . . . , but rather the random walk going backward, starting with Un . Note also that Wn+1 , for example, is the maximum of a different set of variables, i.e., it is the walk going backward from Un+1 . Fortunately, this doesn’t matter for the analysis since the ordered variables (Un , Un−1 . . . , U1 ) are statistically identical to (U1 , . . . , Un ). The probability that the wait is greater than or equal to a given value α is Pr{Wn ≥ α} = Pr{max[0, Z1n , Z2n , . . . , Znn ] ≥ α} .
(7.8)
This says that, for the nth customer, Pr{Wn ≥ α} is equal to the probability that the random walk {Zin ; 1 ≤ i ≤ n} crosses a threshold at α by the nth trial. Because of the
284 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
initialization used in the analysis, we see that Wn is the waiting time in queue of the nth arrival after the beginning of a busy period (although this nth arrival might belong to a a later busy period than that initial busy period). As noted above, (Un , Un−1 , . . . , U1 ) is statistically identical to (U1 , . . . , Un ) and thus Pr{Wn ≥ α} is the same as the probability that the first n terms of a random walk based on {Ui ; i ≥ 1} crosses a threshold at α. Since the first n + 1 terms of this random walk provide one more opportunity to cross α than the first n terms, we see that · · · ≤ Pr{Wn ≥ α} ≤ Pr{Wn+1 ≥ α} ≤ · · · ≤ 1.
(7.9)
Since this sequence of probabilities is non-decreasing, it must have a limit as n → 1, and this limit is denoted Pr{W ≥ α}. Mathematically,2 this limit is the probability that a random walk based on {Ui ; i ≥ 1} ever crosses a threshold at α. Physically, this limit is the probability that the waiting time in queue is at least α for any given very large-numbered customer (i.e., for customer n when the influence of a busy period starting n customers earlier has died out). These results are summarized in the following theorem. Theorem 7.1. Let {Xi ; i ≥ 1} be the interarrival intervals of a G/G/1 queue, let {Yi ; i ≥ 0} be the service times, and assume that the system is empty at time 0 when customer 0 arrives. Let Wn be the time that the nth customer waits in the queue. Let Un = Yn−1 − Xn for n ≥ 1. Then for any α > 0, and n ≥ 1, Wn is given by (7.7). Also, Pr{Wn ≥ α} is equal to the probability that the random walk based on {Ui ; i ≥ 1} crosses a threshold at α by the nth trial. Finally, Pr{W ≥ α} = limn→1 Pr{Wn ≥ α} is equal to the probability that the random walk based on {Ui ; i ≥ 1} ever crosses a threshold at α. Note that the theorem specifies the distribution function of Wn for each n, but says nothing about the joint distribution of successive waiting times. These are not the same as the distribution of successive terms in a random walk because of the reversal of terms above. We shall find a relatively simple solution to the probability that a random walk crosses a positive threshold in Section 7.4. From Theorem 7.1, this also solves for the distribution of queueing delay for the G/G/1 queue (and thus also for the M/G/1 and M/M/1 queues).
7.3
Detection, Decisions, and Hypothesis testing
Consider a situation in which we make n noisy observations of the outcome of a single binary random variable H and then guess, on the basis of the observations alone, which binary outcome occurred. In communication technology, this is called a detection problem. It models, for example, the situation in which a single binary digit is transmitted over some time interval but a noisy vector depending on that binary digit is received. It similarly models the problem of detecting whether or not a target is present in a radar observation. In control theory, such situations are usually referred to as decision problems, whereas in statistics, they are referred to as hypothesis testing. 2
More precisely, the sequence of waiting times W1 , W2 . . . , have distribution functions FWn that converge to FW , the generic distribution of the given threshold crossing problem with unlimited trials. As n increases, the distribution of Wn approaches FW , and we refer to W as the waiting time in steady state.
7.3. DETECTION, DECISIONS, AND HYPOTHESIS TESTING
285
Specifically, let H0 and H1 be the names for the two possible values of the binary random variable H and let p0 = Pr{H0 } and p1 = 1 − p0 = Pr{H1 }. Thus p0 and p1 are the a priori probabilities3 for the random variable H. Let Y1 , Y2 , . . . , Yn be the n observations. We assume that, conditional on H0 , the observations Y1 , . . . Yn are IID random variables. Suppose, to be specific, that these variables have a density f (y | H0 ). Conditional on H0 , the joint density of a sample n-tuple y = (y1 , y2 , . . . , yn ) of observations is given by f (y | H0 ) =
n Y i=1
f (yi | H0 ).
(7.10)
Similarly, conditional on H1 , we assume that Y1 , . . . Yn are IID random variables with a conditional joint density given by (7.10) with H1 in place of H0 . In summary then, the model is that H is a rv with PMF {p0 , p1 }. Conditional on H, Y = (Y1 , . . . , Yn ) is an n-tuple of IID rv’s. Given a particular sample of n observations y = y1 , y2 , . . . , yn , we can evaluate Pr{H1 | y } as Q p1 ni=1 f (yi | H1 ) Q Pr{H1 | y } = Qn . (7.11) p1 i=1 f (yi | H1 ) + p0 ni=1 f (yi | H0 )
We can evaluate Pr{H0 | y } in the same way, and the ratio of these quantities is given by Q p1 n f (yi | H1 ) Pr{H1 | y } = Qi=1 . (7.12) Pr{H0 | y } p0 ni=1 f (yi | H0 )
If we observe y and choose H0 , then Pr{H1 | y } is the resulting probability of error, and conversely if we choose H1 , then Pr{H0 | y } is the resulting probability of error. Thus the probability of error is minimized, for a given y , by evaluating the above ratio and choosing H1 if the ratio is greater than 1 and choosing H0 otherwise. If the ratio is equal to 1, the error probability is the same whether H0 or H1 is chosen.
The above rule for choosing H0 or H1 is called the Maximum a posteriori probability detection rule, usually abbreviated as the MAP rule. The rule has a more attractive form (and also brings us back to random walks) if we take the logarithm of each side of (7.12), getting n
Pr{H1 | y } p1 X ln + zi ; = ln Pr{H0 | y } p0 i=1
where zi = ln
f (yi | H1 ) . f (yi | H0 )
(7.13)
The quantity zi in (7.13) is called a log likelihood ratio. Note that zi is a function only of yi , and that this same function is used for each i. For simplicity, we assume that this 3 Statisticians have argued since the beginning of statistics about the validity of choosing a priori probabilities for an hypothesis to be tested. Bayesian statisticians are comfortable with this practice and non-Bayesians are not. Both are comfortable with choosing a probability model for the observations conditional on each hypothesis. We take a Bayesian approach here, partly to take advantage of the power of a complete probability model, and partly because non-Bayesian results, i.e., results that do not depend on the a priori probabilies are much easier to derive and interpret within a full probability model. As will be seen, the Bayesian approach also makes it natural to incorporate the results of early observations into updated a priori probabilities for analyzing later observations.
286 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
function is finite for all y. The MAP rule is to choose H1 or H0 depending on whether the quantity on the right is positive or negative, i.e., n > ln(p0 /p1 ) ; choose H1 X < ln(p0 /p1 ) ; choose H0 (7.14) zi i=1 = ln(p0 /p1 ) ; don’t care, choose either
Conditional on H0 , the rv’s {Yi ; 1 ≤ i ≤ n} are IID. Since Zi = ln[f (Yi | H1 )/f (Yi | H0 )] for 1 ≤ i ≤ n, and since Zi is the same finite function of Yi for all i, we see that each Zi is a rv and that Z1 , . . . , Zn are IID conditional on H0 . Similarly, Z1 , . . . , Zn are IID conditional on H1 . Without conditioning on H0 or H1 , neither the rv’s Y1 , . . . , Yn nor the rv’s Z1 , . . . , Zn are IID. Thus it is important to keep in mind the basic structure of this problem: initially a sample value is chosen for H. Then n observations, IID conditional on H, are made. Naturally the observer does not observe the original selection for H Conditional on H0 , the sum on the left in (7.14) is thus the sample value of the nth term in the random walk Sn = Z1 + · · · + Zn based on the rv’s {Zi ; i ≥ 1} conditional on H0 . The MAP rule chooses H1 , thus making an error conditional on H0 , if Sn is greater than the threshold ln[p0 /p1 ]. Similarly, conditional on H1 , Sn = Z1 + · · · + Zn is the nth term in a random walk with the conditional probabilities from H1 , and an error is made, conditional on H1 , if Sn is less than the threshold ln[p0 /p1 ]. P It is interesting to observe that i zi in (7.14) depends only on the observations but not on p0 , whereas the threshold ln(p0 /p1 ) depends only P on p0 and not on the observations. Naturally the marginal i Zi does depend on p0 (and on the P probability distribution of conditioning), but i zi is a function only of the observations, so its value does not depend on p0 . P The decision rule in (7.14) is called a threshold test in the sense that i zi is compared with a threshold to make a decision. There are a number of other formulations of the problem that also lead to threshold tests. For example, maximum likelihood (ML) detection chooses the hypothesis i that maximizes f (y | Hi ), and thus corresponds to a threshold at 0. The ML rule has the property that it minimizes the maximum of Pr{H0 | Y } and Pr{H1 | Y }; this has obvious benefits when one is unsure of the a priori probabilities. In many detection situations there are unequal costs associated with the two kinds of errors. For example one kind of error in a medical test could lead to death of the patient and the other to an unneeded medical procedure. A minimum cost decision minimizes the expected cost over the two types of errors. As shown in Exercise 7.5, this is also a threshold test. Finally, one might impose the constraint that Pr{error | H1 } must be less than some tolerable limit α, and then minimize Pr{error | H0 } subject to this constraint. The solution to this is called a Neyman-Pearson threshold test (see Exercise 7.6). The Neyman-Pearson test is of particular interest since it does not require any assumptions about a priori probabilities. So far we have assumed that a decision is made after n observations. In many situations there is a cost associated with observations and one would prefer, after a given number of observations, to make a decision if the resulting probability of error is small enough, and to
7.4. THRESHOLD CROSSING PROBABILITIES IN RANDOM WALKS
287
continue with more observations otherwise. Common sense dictates such a strategy, and the branch of probability theory analyzing such strategies is called sequential analysis, which is based on the results in the next section. Essentially, we will see that the appropriate way to vary the number of observations based on the result of the observations is as follows: The probability of error under either hypothesis is based on Sn = Z1 + · · · + Zn . Thus we will see that the appropriate rule is to choose H0 if the sample value of Sn is less than some negative threshold β, to choose H1 if the sample value of Sn ≥ α for some positive threshold α and to continue testing if the sample value has not exceeded either threshold. The previous examples have all involved random walks crossing thresholds, and we now turn to the systematic study of threshold crossing problems. First we look at single thresholds, so that one question of interest is to find Pr{Sn ≥ α} for an arbitrary integer n ≥ 1 and arbitrary α > 0. Another question is whether Sn ≥ α for any n ≥ 1. We then turn to random walks with both a positive and negative threshold. Here, some questions of interest are to find the probability that the positive threshold is crossed before the negative threshold, to find the distribution of the threshold crossing time given the particular threshold crossed, and to find the overshoot when a threshold is crossed.
7.4
Threshold crossing probabilities in random walks
Let {Xi ; i ≥ 1} be a sequence of IID random variables with the distribution function FX (x), and let {Sn ; n ≥ 1} be a random walk with Sn = X1 + · · · + Xn . We assume throughout that E [X] exists and is finite. The reader should focus on the case E [X] = X < 0 on a first reading, and consider X = 0 and X > 0 later. For X < 0 and α > 0, we shall develop upper bounds on Pr{Sn ≥ α} that are exponentially decreasing in n and α. These bounds, and many similar results to follow, are examples of large deviation theory, i.e., probabilities of highly unlikely events. We this section that X has a moment generating function g(r) = £ assume § R throughout E erX = erx dFX (x), and that g(r) is finite in some open interval around r = 0. As pointed out in Chapter 1, X must then have moments of all orders and the tails of its distribution function FX (x) must decay at least exponentially R 1in x as x → −1 and as x → +1. Note that erx is increasing in r for x > 0, so that if 0 erx dFX (x) blows up for some r+ >R 0, it remains infinite for all r > r+ . Similarly, for x < 0, erx is increasing in −r, so that if x≤0 erx dFX (x) blows up at some r− < 0, it is infinite for all r < r− . Thus if r− and r+ are the smallest and largest values such that g(r) is finite for r− < r < r+ , then g(r) is infinite for r > r+ and for r < r− . The end points r− and r+ can each be finite or infinite, and the values g(r+ ) and g(r− ) can each be finite or infinite. Note that if X is bounded in the sense that Pr{X < −B} = 0 and Pr{X > B} = 0 for some B < 1, then g(r) exists for all r. Such rv’s are said to have finite support and include all discrete rv’s with a finite set of possible values. Another simple example is that if X is a non-negative rv with FX (x) = 1 − exp(−αx) for x ≥ 0, then r+ = α. Similarly, if X is a negative rv with FX = exp(βx) for x < 0, then r− = −β. Exercise 7.7 provides further examples of these possibilities.
288 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
The moment generating function of Sn = X1 + · · · + Xn is given by gSn (r) = E [exp(rSn )] = E [exp(r(X1 + · · · + Xn )] = {E [exp(rX)]}n = {g(r)}n .
(7.15)
It follows that gSn (r) is finite in the same interval (r− , r+ ) as g(r). First we look at the probability, Pr{Sn ≥ α}, that the nth step of the random walk satisfies Sn ≥ α for some threshold α > 0. We could actually find the distribution of Sn either by convolving the density of X with itself n times or by going through the transform domain. This would not give us much insight, however, and would be computationally tedious for large n. Instead, we explore the exponential bound, (1.38). For any r ≥ 0, in the region where g(r) is finite, i.e., for 0 ≤ r < r+ , we have Pr{Sn ≥ α} ≤ gSn (r)e−rα = [g(r)]n e−rα .
(7.16)
It is convenient to rewrite (7.16) in terms of the semi-invariant moment generating function ∞(r) = ln[g(r)]. Pr{Sn ≥ α} ≤ exp[n∞(r) − rα] ;
any r, 0 ≤ r < r+ .
(7.17)
The first two derivatives of ∞ with respect to r are given by g(r)g 00 (r) − [g 0 (r)]2 . (7.18) [g(r)]2 £ § Recall from (1.32) that g 0 (0) = E [X] and g 00 (0) = E X 2 . Substituting this into (7.18), we can evaluate ∞ 0 (0) and ∞ 00 (0) as ∞ 0 (r) =
g 0 (r) ; g(r)
∞ 00 (r) =
∞ 0 (0) = X = E [X] ;
2 ∞ 00 (0) = σX .
(7.19)
The fact that ∞ 00 (0) is the second central moment of X is why ∞ is called a semi-invariant moment generating function. Unfortunately, the higher-order derivatives of ∞, evaluated at r = 0, are not equal to the higher-order central moments. Over the range of r where g(r) < 1, it is shown in Exercise 7.8 that ∞ 00 (r) ≥ 0, with strict inequality except in the very special (and uninteresting) case where X is deterministic. If X is deterministic, then Sn is also and there is no point to considering a probabilistic model. We thus assume in what follows that X is non-deterministic and thus ∞ 00 (r) > 0 for all r between r− and r+ Figure 7.3 sketches ∞(r) assuming that X < 0 and r+ = 1 We can now minimize the exponent in (7.17) over r ≥ 0. For simplicity, first assume that r+ = 1. Since ∞ 00 (r) > 0, the exponent is minimized by setting its derivative equal to 0. The minimum (if it exists) occurs at the r, say ro for which ∞ 0 (r) = α/n. As seen from Figure 7.3, this is satisfied with r ≥ 0 only if α/n ≥ X. Thus © £ §™ Pr{Sn ≥ α} ≤ exp n ∞(ro ) − ro ∞ 0 (ro ) where ∞ 0 (ro ) = α/n ≥ E [X] (7.20) Ω ∑ ∏æ ∞(ro ) = exp α 0 (7.21) − ro . ∞ (ro )
7.4. THRESHOLD CROSSING PROBABILITIES IN RANDOM WALKS
❅0 ❅ ❅ ❅ ❅ ❅
289
r
∞(r) slope = E [X]
Figure 7.3: Semi-invariant moment generating function ∞(r) for a rv X such that E [X] < 0 and r+ = 1. Note that ∞(r) is tangent to the line of slope E [X] < 0 at 0 and has a positive second derivative everywhere.
0
∞(r) − rα/n
r
∞(r)
ro
r∗
ro − ∞(ro )(n/α)
slope = ∞ 0 (ro ) = α/n ∞(ro )
∞(ro ) − ro α/n
Figure 7.4: Graphical minimization of ∞(r) − (α/n)r. For any r, ∞(r) − (α/n)r is found by drawing a line of slope (α/n) from the point (r, ∞(r)) to the vertical axis. The minimum occurs when the line of slope α/n is tangent to the curve. The first of these inequalities shows how Pr{Sn ≥ α} decreases exponentially with n for fixed α/n = ∞ 0 (ro ) and the second shows how it decreases with α for the same ratio α/n = ∞ 0 (ro ). We now give a graphical interpretation in Figure 7.4 of what these exponents mean, and return subsequently to discuss whether α/n = ∞ 0 (r) actually has a solution. The function ∞(r) has a strictly positive second derivative, and thus any tangent to the function must lie below the function everywhere except at the point of tangency. The particular tangent shown is tangent at the point r = ro where ∞ 0 (r) = α/n. Thus this tangent line has slope α/n = ∞ 0 (ro ) and meets the vertical axis at the point ∞(ro ) − ro ∞ 0 (ro ). As illustrated, this vertical axis intercept is smaller than ∞(r) − (α/n)r for any other choice of r. This is the exponent in (7.20). This exponent is negative and shows that for a fixed ratio α/n, Pr{Sn ≥ α} decays exponentially in n. Our primary interest is in the probability that Sn exceeds a positive threshold, α > 0, but it can be seen that both the algebraic and graphical arguments above apply whenever α > nE [X]. Since E [X] < 0, we might also be interested in the probability that Sn exceeds the mean by some amount, while also being negative. Figure 7.4 also gives a geometric interpretation of (7.21) for the case α > 0. The exponent in α is given by (7.21) to be −ro +∞(ro )/∞ 0 (ro ) where ro satisfies ∞ 0 (ro ) = α/n. The negative
290 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
of this is seen to be the horizontal axis intercept of the tangent to ∞(r) at ro , and thus this intercept gives the exponential decay rate of Pr{Sn ≥ α} in α for fixed α/n. It is interesting to observe what happens to (7.21) as n is changed while holding α > 0 fixed. This is an important question for threshold crossings, since it provides an upper bound on crossing a fixed α for different values of n. For the α and n illustrated in Figure 7.4, note that as n increases with fixed α, the slope of the tangent decreases, moving the horizontal axis intercept to the right, i.e., increasing the exponential decay rate in α. Conversely, as n is decreased, the intercept moves to the left, decreasing the exponential decay rate. Note, however, that when the slope increases to the point where the intercept reaches the point where ∞(r) = 0, i.e., the point labelled r∗ in Figure 7.4, then further reductions in n move the tangent point to where ∞(r) is positive. At this point, the intercept starts to move to the right again. This means that for all n, an upper bound to Pr{Sn ≥ α} is given by Pr{Sn ≥ α} ≤ exp(−r∗ α)
for arbitrary α > 0, n ≥ 1.
(7.22)
We now must return to the question of whether the equation α/n = ∞ 0 (r) has a solution. From the assumption that E [X] < 0, we know that ∞ 0 (0) < 0. We have not yet shown why ∞ 0 (r) should become positive as r increases. To see this in the simplest case, assume that X is discrete and assume that X takes on positive values (if X were a non-positive rv, there would be no point to discussing the probability of crossing a positive threshold). P Let xmax be the largest such value. Then g(r) = x p(x)erx ≥ p(xmax )erxmax . It follows that ∞(r) ≥ rxmax + ln(p(xmax )). Since ∞ has a positive second derivative, it follows that ∞ 0 (r) must be increasing with r and must approach xmax in the limit as r → 1. Thus α/n = ∞ 0 (r) has a solution whenever α/n < xmax . It is also clear that Pr{Sn ≥ α} = 0 for α/n > xmax . Thus ∞ 0 (r) = α/n has a solution over the range of interest. One can extend this argument to the case where X has an arbitrary distribution function with negative mean. Although we have only established (7.20, 7.21, 7.22) as upper bounds, Exercise 7.10 shows that for any fixed ratio a = α/n, and any ε > 0, there is an n0 (ε) such that for all n ≥ n0 (ε), Pr{Sn ≥ n(α − ε)} > exp{−n[ra − ∞(r) + ε]} where r satisfies ∞ 0 (r) = a. This means that for fixed a = α/n, (7.20) is exponentially tight, i.e., Pr{Sn ≥ na} decays exponentially with increasing n at the asymptotic rate −ra + ∞(r) where r satisfies ∞ 0 (r) = a. The above discussion has treated only the case where r+ = 1. Figure 5 illustrates the minimization of (7.17) for the case where r+ < 1. We have assumed that ∞(r) < 0 for r < r+ , since the previous argument applies if ∞(r) crosses 0 at some r∗ ≤ r+ . To include this case, (7.20) is generalized to §™ © £ exp n ∞(ro ) − ronα Ω ∑µ ∂ Pr{Sn ≥ α} ≤ lim ∞(r) − exp n r∈(r− ,r+ )→r+
if α/n = ∞ 0 (ro ) for ro < r+ rα n
∏æ
otherwise. (7.23)
7.5. THRESHOLDS, STOPPING RULES, AND WALD’S IDENTITY
0
∞(r) − rα/n
r
r∗ = r+
∞(r)
∞(r∗ )
291
r∗ − ∞(r∗ )(n/α)
slope = α/n
∞(r∗ ) − r∗ α/n
Figure 7.5: Graphical minimization of ∞(r) − (α/n)r for the case where r+ < 1. As
before, for any r < r+ , ∞(r) − rα/n is found by drawing a line of slope (α/n) from the point (r, ∞(r)) to the vertical axis. The minimum occurs when the line of slope α/n is tangent to the curve or when it touches the curve at r = r∗ .
If we extend the definition of r∗ as the supremum of r such that ∞(r) ≤ 0, then Pr{Sn ≥ α} ≥ exp(−r∗ α) still holds for arbitrary α > 0, n ≥ 1. The next section establishes Wald’s identity, which shows, among other things, that if X < 0, then exp(−r∗ α) is an upper bound (and a reasonable approximation) to the probability that the walk ever crosses a threshold at α > 0. Note that we have already found S an upper bound to Pr{Sn ≥ α} for any α > 0, n ≥ 1, but this new result bounds Pr{ n {Sn ≥ α}} for any α > 0. Both the threshold-crossing bounds in this section and Wald’s identity in the next suggest that for large n or large α, the most important parameter of the IID rv’s X making up the walk is the positive root r∗ of ∞(r), rather than the mean, variance, or other moments of X. As a prelude to developing these large deviation results about threshold crossings, we define stopping rules in a way that is both simpler and more general than the treatment in Chapter 3.
7.5
Thresholds, stopping rules, and Wald’s identity
The following lemma shows that a random walk with two thresholds, say α > 0 and β < 0, eventually crosses one of the threshold. Figure 7.6 illustrates two sample paths and how they cross thresholds. More specifically, the random walk first crosses a thresholds at trial n if β < Si < α for 1 ≤ i < n and either Sn ≥ α or Sn ≤ β. The lemma shows that this random number of trials N is finite with probability 1 (i.e., N is a rv) and that N has moments of all orders. Lemma 7.1. Let {Xi ; i ≥ 1} be IID and not identically 0. For each n ≥ 1, let Sn = X1 + · · · + Xn . Let α > 0 and β < 0 be arbitrary real numbers , and let N be the smallest n for which either Sn ≥ α or Sn ≤ β. Consider a random walk with two thresholds, α > 0 and β < 0, and assume that X is not identically zero. Then N is a random variable (i.e., limm→1 Pr{N ≥ m} = 0) and has finite moments of all orders.
292 CHAPTER 7.
α
β
S1 r
S2 r
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
S3 r
S4 r
S5 r
S6 r
α
β
S1 r
S2 r
S3 r
S4 r
S5 r
S6 r
Figure 7.6: Two sample paths of a random walk with two thresholds. In the first, the threshold at α is crossed at N = 5. In the second, the threshold at β is crossed at N =4
Proof: Since X is not identically 0, there is some n for which either Pr{Sn ≤ −α + β} > 0 or for which Pr{Sn ≥ α − β} > 0. For any such n, let ε = max[Pr{Sn ≤ −α + β} , Pr{Sn ≥ α − β}]. For any integer k ≥ 1, given that N > n(k − 1), and given any value of Sn(k−1) in (β, α), a threshold will be crossed by time nk with probability at least ε. Thus, Pr{N > nk | N > n(k − 1)} ≤ 1 − ε, Iterating on k, Pr{N > nk} ≤ (1 − ε)k . This shows that N is finite with probability 1 and that Pr{N ≥ j} goes to 0 at least geometrically in j. It follows that the moment generating function gN (r) of N is finite in a region around r = 0, and that N has moments of all orders.
7.5.1
Stopping rules
In this section, we start with a definition of stopping rules that is more fundamental and quite different from that in Chapter 3. We then use this definition to establish Wald’s identity, which is the basis for all of our subsequent results about random walks and threshold crossings. First consider a simple example. Consider a sequence {Xn ; n ≥ 1} of binary random variables taking on only the values ±1. Suppose we are interested in the first occurrence of the string (+1, −1), and we view this condition as a stopping rule. Figure 7.7 illustrates this stopped process by viewing it as the truncation of a tree of possible sequences. Aside from the complexity of the tree, the same approach can be taken when considering a random walk with a stopping rule that stops at the first trial in which the random walk reaches either α > 0 or β < 0. In this case also, the stopping node is the initial segment for which the first crossing occurs at the final trial of that segment.
7.5. THRESHOLDS, STOPPING RULES, AND WALD’S IDENTITY
r
1 -1
r r
1 -1 1 -1
r
sr r
r
sr sr r
293
r sr sr sr r✘
Figure 7.7: A tree representing the collection of binary (1, -1) sequences, with a stopping rule viewed as a pruning of the tree. The particular stopping rule here is to stop on the first occurrence of the string (+1, −1). The leaves of the tree (i.e., the nodes at which stopping occurs) are marked with large dots and the intermediate nodes (the other nodes) with small dots. Note that each leaf in the tree has a oneto-one correspondence with an initial segment of the tree, so the stopping nodes can be unambiguously viewed either as leaves of the tree or initial segments of the sample sequences.
Note that in both of these examples, the stopping rule determines which initial segment of any given sample sequence satisifes the rule. The distribution of each Xn , and even whether or not the sequence is IID, is usually not relevant for defining these stopping rules. In other words, the conditions about statistical independence used in Chapter 3 for the indicator functions of stopping rules is quite unnatural for most applications. The essence of a stopping rule, however, is illustrated quite well in Figure 7.7. If one stops at some initial segment of a sample sequence, then one cannot stop again at some longer initial segment of the same sample sequence. This leads us to the following definitions of stopping nodes, stopping rules, and stopping times. Definition 7.2 (Stopping nodes). Given a sequence {Xn ; n ≥ 1} of rv’s, a collection of stopping nodes is a collection of initial segments of the sample sequences of {Xn ; n ≥ 1}. If an initial segment of one sequence is a stopping node, then it is a stopping node for all sequences with that same initial segment. Also, no stopping node can be an initial segment of any other stopping node. This definition is less abstract when each Xn is dicrete with a finite number, say m of possible values. In this case, as illustrated in Figure 7.7, the set of seqeunces is represented by a tree in which each node has one branch coming in from the root and m branches going out. Each stopping node corresponds to ‘pruning’ the tree at that node. All the sequences with that given initial segment can then be ignored since they all have that same initial
294 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
segment, i.e., stopping node. In this sense, every ‘pruning’ of the tree corresponds to a collection of stopping nodes. In information theory, such a collection of stopping nodes is called a prefix-free source code. Each segment corresponding to a stopping node is used as a codeword for some given message. If a sequence of consecutive segments is transmitted, a receiver can parse the incoming letters into segments by using the fact that no stopping node is an initial segment of any other stopping node. Definition 7.3 (Stopping rule and stopping time). A stopping rule for {Xn ; n ≥ 1} is a rule that determines a collection of stopping nodes. A stopping time is a perhaps defective rv whose value, for a sample sequence with a stopping node, is the length of the initial segment for that node. Its value, for a sample sequence with no stopping node, is infinite. For most interesting stopping rules, sample sequences exist that have no stopping nodes. For the example of a random walk with two thresholds, there are many sequences that stay inside the thresholds forever. As shown by Lemma 7.1 however, this set of sequences has zero probability and thus the stopping time is a (non-defective) rv. We see from this that, although stopping rules are generally defined without the use of a probability measure, and the mapping from sample sequences to stopping nodes is similarly independent of the probability measure, the question of whether the stopping time is defective and whether it has moments is very dependent on the probability measure. £ § Theorem 7.2 (Wald’s identity). Let {Xi ; i ≥ 1} be IID and let ∞(r) = ln{E erX } be the semi-invariant moment generating function of each Xi . Assume ∞(r) is finite in an open interval (r− , r+ ) with r− < 0 < r+ . For each n ≥ 1, let Sn = X1 + · · · + Xn . Let α > 0 and β < 0 be arbitrary real numbers, and let N be the smallest n for which either Sn ≥ α or Sn ≤ β. Then for all r ∈ (r− , r+ ), E [exp(rSN − N ∞(r))] = 1.
(7.24)
We first show how to use and interpret this theorem, and then prove it. The proof is quite simple, but will mean more after understanding the surprising power of this result. Wald’s identity can be thought of as a generating function form of Wald’s equality as established in Theorem 3.3. First note that the trial N at which a threshold is crossed in the theorem is a stopping time in the terminology of Chapter 3. Also, if we take the derivative with respect to r of both sides of (7.24), we get £ § E [SN − N ∞ 0 (r) exp{rSN − N ∞(r)} = 0. Setting r = 0 and recalling that ∞(0) = 0 and ∞ 0 (0) = X, this becomes Wald’s equality, E [SN ] = E [N ] X.
(7.25)
Note that this derivation of Wald’s equality is restricted to a random walk with two thresholds (and this automatically satisfies the constraint in Wald’s equality that E [N ] < 1). The result in Chapter 3 was more general, applying to any stopping time such that E [N ] < 1.
7.5. THRESHOLDS, STOPPING RULES, AND WALD’S IDENTITY
295
The second derivative of (7.24) with respect to r is §§ ££ E (SN − N ∞ 0 (r))2 − N ∞ 00 (r) exp{rSN − N ∞(r)} = 0.
At r = 0, this is
i h 2 2 2 − 2N SN X + N 2 X = E [N ] σX . E SN
(7.26)
This equation is often difficult to use because of the cross term between SN and N , but its main application comes in the case where X = 0. In this case, Wald’s equality provides no information about E [N ], but (7.26) simplifies to £ 2§ 2 E SN . (7.27) = E [N ] σX
Example 7.5.1 (Simple random walks again). As an example, consider the simple random walk of Section 7.1.1 with Pr{X=1} = Pr{X= − 1} = 1/2, and assume that α > 0 and β < 0 are integers. Since Sn takes on only integer values and changes only by ±1, it takes on the value α or β before exceeding either of these values. Thus SN = α or SN = β. Let qα denote Pr{SN = α}. The expected value of SN is then αqα + β(1 − qα ). From Wald’s equality, E [SN ] = 0, so qα =
−β ; α−β
1 − qα =
α . α−β
(7.28)
From (7.27), £ 2§ 2 = E SN = α2 qα + β 2 (1 − qα ). E [N ] σX
(7.29)
2 = 1, Using the value of qα from (7.28) and recognizing that σX 2 = −βα. E [N ] = −βα/σX
(7.30)
As a sanity check, note that if α and β are each multiplied by some large constant k, then E [N ] increases by k2 . Since σS2 n = n, we would expect Sn to fluctuate with increasing n √ with typical values growing as n, and thus it is reasonable for the time to reach a threshold to increase with the square of the distance to the threshold. We also notice that if β is decreased toward −1, while holding α constant, then qα → 1 and E [N ] → 1, which helps explain the possibility of winning one coin with probability 1 in a coin tossing game, assuming we have an infinite capital to risk and an infinite time to wait. For more general random walks with X = 0, there is usually an overshoot when the threshold is crossed. If the magnitudes of α and β are large relative to the range of X, however, it is 2 becomes a good approximation often reasonable to ignore the overshoots, and then −βα/σX to E [N ]. If one wants to include the overshoot, then the effect of the overshoot must be taken into account both in (7.28) and (7.29). We next apply Wald’s identity to upper bound Pr{SN ≥ α} for the case where X < 0.
296 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Corollary 7.1. Under the conditions of 7.2, assume that ∞(r) has a root at r∗ > 0. Then Pr{SN ≥ α} ≤ exp(−r∗ α).
(7.31)
Proof: Wald’s identity, with r = r∗ , reduces to E [exp(r∗ SN )] = 1. We can express this as Pr{SN ≥ α} E [exp(r∗ SN ) | SN ≥ α] + Pr{SN ≤ β} E [exp(r∗ SN ) | SN ≤ β] = 1.
(7.32)
Since the second term on the left is non-negative, Pr{SN ≥ α} E [exp(r∗ SN ) | SN ≥ α] ≤ 1.
(7.33)
Given that SN ≥ α, we see that exp(r∗ SN ) ≥ exp(r∗ α). Thus Pr{SN ≥ α} exp(r∗ α) ≤ 1.
(7.34)
which is equivalent to (7.31). This bound is valid for all β < 0, and thus is also valid in the limit β → −1 (see Exercise 7.12 for a more careful demonstration that (7.31) is valid without a lower threshold). Equation (7.31) is also valid for the case of Figure 7.5, where ∞(r) < 0 for all r ∈ (0, r+ ). The exponential bound in (7.22) showsSthat Pr{Sn ≥ α} ≤ exp(−r∗ α) for each n; (7.31) is stronger than this. It shows that Pr{ n {Sn ≥ α}} ≤ exp(−r∗ α). This also holds in the limit β → −1. When Corollary 7.1 is applied to the G/G/1 queue in Theorem 7.1, (7.31) is referred to as the Kingman Bound. Corollary 7.2 (Kingman Bound). Let {Xi ; i ≥ 1} and {Yi ; i ≥ 0} be the interarrival intervals and service times of a G/G/1 queue that is empty£ at time 0 when customer 0 § U r arrives. Let {Ui = Yi−1 − Xi ; i ≥ 1}, and let ∞(r) = ln{E e } be the semi-invariant moment generating function of each Ui . Assume that ∞(r) has a root at r∗ > 0. Then Wn , the waiting time of the nth arrival and W , the steady state waiting time, satisfy Pr{Wn ≥ α} ≤ Pr{W ≥ α} ≤ exp(−r∗ α)
;
for all α > 0.
(7.35)
In most applications, a positive threshold crossing for a random walk with a negative drift corresponds to some exceptional, and usually undesirable, circumstance (for example an error in the hypothesis testing problem or an overflow in the G/G/1 queue). Thus an upper bound such as (7.31) provides an assurance of a certain level of performance and is often more useful than either an approximation or an exact expression that is very difficult to evaluate. For a random walk with X > 0, the exceptional circumstance is Pr{SN ≤ β}. This can be analyzed by changing the sign of X and β and using the results for a negative expected value. These exponential bounds do not work for X = 0, and we will not analyze that case here. Note that (7.31) is an upper bound because, first, the effect of the second threshold in (7.32) was set to 0, and, second, the overshoot in the threshold crossing at α was set to 0 in going
7.5. THRESHOLDS, STOPPING RULES, AND WALD’S IDENTITY
297
from (7.33) to (7.34). It is easy to account for the second threshold by recognizing that Pr{SN ≤ β} = 1 − Pr{SN ≥ α}. Then (7.32) can be solved, getting Pr{SN ≥ α} =
1 − E [exp(r∗ SN ) | SN ≤ β] . E [exp(r∗ SN ) | SN ≥ α] − E [exp(r∗ SN ) | SN ≤ β]
(7.36)
Accounting for the overshoots is much more difficult. For the case of the simple random walk, overshoots never occur since the random walk always changes in unit steps. Thus, for α and β integers, we have E [exp(r ∗ SN ) | SN ≤ β] = exp(r∗ β) and E [exp(r∗ SN ) | SN ≥ α] = exp(r∗ α). Substituting this in (7.36) yields the exact solution Pr{SN ≥ α} =
exp(−r∗ α)[1 − exp(r∗ β)] . 1 − exp[−r∗ (α − β)]
(7.37)
Solving the equation ∞(r∗ ) = 0 for the simple random walk with probabilities p and q yields r∗ = ln(q/p). This is also valid if X takes on the three values −1, 0, and +1 with p = Pr{X = 1}, q = Pr{X = −1}, and 1 − p − q = Pr{X = 0}. It can be seen that if α and −β are large positive integers, then the simple bound of (7.31) is almost exact for this example. Equation (7.37) is sometimes taken as an approximation for (7.36). Unfortunately, for many applications, the overshoots are more significant than the effect of the opposite threshold so that (7.37) is only negligibly better than (7.31) as an approximation, and has the disadvantage of not being a bound. If Pr{SN ≥ α} must actually be calculated, then the overshoots in (7.36) must be taken into account. See Chapter 12 of [9] for a treatment of overshoots.
7.5.2
Joint distribution of N and barrier
Next we look at Pr{N ≥ n, SN ≥ α}, where again we assume that X < 0 and that ∞(r∗ ) = 0 for some r∗ > 0. For any r in the region where ∞(r) ≤ 0 (i.e., for 0 ≤ r ≤ r∗ ), we have −N ∞(r) ≥ −n∞(r) for N ≥ n. Thus, from the Wald identity, we have 1 ≥ E [exp[rSN − N ∞(r)] | N ≥ n, SN ≥ α] Pr{N ≥ n, SN ≥ α} ≥ exp[rα − n∞(r)]Pr{N ≥ n, SN ≥ α}
Pr{N ≥ n, SN ≥ α} ≤ exp[−rα + n∞(r)] ;
for all r such that 0 ≤ r ≤ r∗ .(7.38)
Under our assumption that X < 0, we have ∞(r) ≤ 0 in the range 0 ≤ r ≤ r∗ , and (7.38) is valid for all r in this range. To obtain the tightest bound of this form, we should minimize the right hand side of (7.38). This is the same minimization (except for the constraint r ≤ r∗ ) as in Figure 7.4, and the result, if α/n < ∞ 0 (r∗ ), is Pr{N ≥ n, SN ≥ α} ≤ exp[−ro α + n∞(ro )].
(7.39)
where ro satisfies ∞ 0 (ro ) = α/n. This is the same as the bound on Pr{Sn ≥ α} in (7.20) except that r ≤ r∗ in (7.39). For the special case described in Figure 7.5 where ∞(r) < 0 for all r < r+ , (7.39) is modified in the same way as used in (7.23).
298 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
The bound in (7.39) is strictly tighter than the bound Pr{SN ≥ α} ≤ exp(−r∗ α) if α/n < ∞ 0 (r∗ ). For a given value of α, define n∗ = α/∞ 0 (r∗ ). We can then rewrite (7.39) as for n > n∗ , α/n = ∞ 0 (ro ) exp [n∞(ro ) − ro α] . (7.40) Pr{N ≥ n, SN ≥ α} ≤ n ≤ n∗ exp [−r∗ α]
The interpretation of (7.40) is that n∗ is an estimate of the typical value of N given that the threshold at α is crossed. For n greater than this typical value, (7.40) provides a tighter bound on Pr{N ≥ n, SN ≥ α} than the bound on Pr{SN ≥ α} in (7.31), whereas (7.40) ∗ provides© nothing new for ™ n ≤ n . In Section 7.7, we shall derive the slightly stronger result that Pr supi≥n Si ≥ α is also upper bounded by the right hand side of (7.40). We next develop an almost identical upper bound to Pr{N ≤ n, SN ≥ α} by using the Wald identity for r > r∗ . Here ∞(r) > 0, so −N ∞(r) ≥ −n∞(r) for N ≤ n. It follows that 1 ≥ E [exp[rSN − N ∞(r)] | N ≤ n, SN ≥ α] Pr{N ≤ n, SN ≥ α} ≥ exp[rα − n∞(r)] Pr{N ≤ n, SN ≥ α} ,
so that
Pr{N ≤ n, SN ≥ α} ≤ exp[−rα + n∞(r)]. Optimizing over r as before (except recognizing that r ≥ r∗ ), we get for n < n∗ , α/n = ∞ 0 (ro ) exp [n∞(ro ) − ro α] . Pr{N ≤ n, SN ≥ α} ≤ n ≥ n∗ exp [−r∗ α]
(7.41)
(7.42)
This strengthens the interpretation of n∗ as the typical value of N conditional on crossing the threshold at α. That is, (7.42) provides information on the lower tail of the distribution of N (conditional on SN ≥ α), whereas (7.40) provides information on the upper tail.
7.5.3
Proof of Wald’s identity
The proof is given under the assumption that the rv’s Xi are discrete with a PMF p(X). Extending the proof to arbitrary rv’s requires some careful mathematical analysis but nothing conceptually different. Let T be the collection of stopping nodes for the given random walk with two thresholds. We refer to each stopping node as a pair, (n, x ), where n is the length of the initial segment and x = (x1 , . . . , xn ) is the segment itself. The probability that this initial segment occurs is then p(x1 )p(x2 ) · · · p(xn ). From Lemma 7.1, stopping occurs with probability 1, so n X Y
p(xi ) = 1.
(n,x )∈T i=1
Now consider the same set of sample sequences, but with a different IID probability measure. In particular, each rv Xi is replaced with a ‘tilted’ rv with the PMF q(X) = p(X)erX−∞(r)
7.6. MARTINGALES AND SUBMARTINGALES
299
for any fixed r, r− < r < r+ . Note that q(X) sums to 1 over the sample values of X, so it is in fact a valid PMF. Note also that we can consider the same random walk as with the original rv’s, with only the probability measure changed. That is, the same set of stopping nodes correspond to threshold crossings as before. Applying Lemma 7.1 to this modified random walk, we get n X Y
q(xi ) = 1.
(n,x )∈T i=1
Expressing this in terms of p(x), n X Y
(n,x )∈T i=1
p(xi ) exp[rxi − ∞(r)] = 1.
For any given (n, x ) ∈ T , the product above can be written as n Y i=1
where s(n, x ) =
p(xi ) exp[rxi − ∞(r)] = Pr{(n, x )} exp[rs(n, x ) − n∞(r)].
Pn
i=1 xi .
Thus X Pr{(n, x )} exp[rs(n, x ) − n∞(r)] = 1.
(7.43)
(n,x )∈T
Note that the stopping time N and the stopping value SN are functions of the stopping node, and a stopping node occurs with probability 1. Thus the expectation in Wald’s identity can be taken over the stopping nodes, and that expectation is given in (7.43), completing the proof. The proof of this theorem makes it clear that the theorem can be generalized easily in a number of directions. The only requirement is that the random walk must stop with probability 1, and the tilted version must also stop with probability 1. Thus, for example, the thresholds could vary as a function of the trial number, the thresholds could be replaced by the first time some given pattern occurs, etc. It is usually not possible, however, to replace two thresholds with one, since the most useful stopping nodes in that case do not occur with probability 1.
7.6
Martingales and submartingales
A martingale is defined as an integer-time stochastic process {Zn ; n ≥ 1} with the properties that E [|Zn |] < 1 for all n ≥ 1 and E [Zn | Zn−1 , Zn−2 , . . . , Z1 ] = Zn−1 ;
for all n ≥ 2.
(7.44)
The name martingale comes from gambling terminology where martingales refer to gambling strategies in which the amount to be bet is determined by the past history of winning or
300 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
losing. If one visualizes Zn as representing the gambler’s fortune at the end of the nth play, the definition above means, first, that the game is fair in the sense that the expected increase in fortune from play n − 1 to n is zero, and, second, that the expected fortune on the nth play depends on the past only through the fortune on play n − 1. There are two interpretations of (7.44); the first and most straightforward is to view it as shorthand for E [Zn | Zn−1 =zn−1 , Zn−2 =zn−2 , . . . , Z1 =z1 ] = zn−1 for all possible sample values z1 , z2 , . . . , zn−1 . The second is that E [Zn | Zn−1 =zn−1 , . . . , Z1 =z1 ] is a function of the sample values z1 , . . . , zn−1 and thus E [Zn | Zn−1 , . . . , Z1 ] is a random variable which is a function of the random variables Z1 , . . . , Zn−1 (and, for a martingale, a function only of Zn−1 ). The student is encouraged to take the first viewpoint initially and to write out the expanded type of expression in cases of confusion. It is important to understand the difference between martingales and Markov chains. For the Markov chain {Xn ; n ≥ 1}, each rv Xn is conditioned on the past only through Xn−1 , whereas for the martingale {Zn ; n ≥ 1}, it is only the expected value of Zn that is conditioned on the past only through Zn−1 . The rv Zn itself, conditioned on Zn−1 can be dependent on all the earlier Zi ’s. It is very surprising that so many results can be developed using such a weak form of conditioning. In what follows, we give a number of important examples of martingales, then develop some results about martingales, and then discuss those results in the context of the examples.
7.6.1
Simple examples of martingales
Example 7.6.1 (Zero-mean random walk). One example of a martingale is a zeromean random walk, since if Zn = X1 + X2 + · · · + Xn , where the Xi are IID and zero mean, then E [Zn | Zn−1 , . . . , Z1 ] = E [Xn + Zn−1 | Zn−1 , . . . , Z1 ] = E [Xn ] + Zn−1 = Zn−1 .
(7.45) (7.46)
Extending this example, suppose that {Xi ; i ≥ 1} is an arbitrary sequence of IID random ei = Xi − X . Then {Sn ; n ≥ 1} is a random walk with variables with mean X and let X e1 + · · · + X en . The random Sn = X1 + · · · + Xn and {Zn ; n ≥ 1} is a martingale with Zn = X walk and the martingale are simply related by Zn = Sn − nX, and thus general results about martingales can easily be applied to random walks. Example 7.6.2 (Sums of dependent zero-mean variables). Let {Xi ; i ≥ 1} be a set of dependent random variables satisfying E [Xi | Xi−1 , . . . , X1 ] = 0. Then {Zn ; n ≥ 1}, where Zn = X1 + · · · + Xn , is a zero mean martingale. To see this, note that E [Zn | Zn−1 , . . . , Z1 ] = E [Xn + Zn−1 | Zn−1 , . . . , Z1 ]
= E [Xn | Xn−1 , . . . , X1 ] + E [Zn−1 | Zn−1 , . . . , Z1 ] = Zn−1 .
This is a more general example than it appears, since given any martingale {Zn ; n ≥ 1}, we can define Xn = Zn − Zn−1 for n ≥ 2 and define X1 = Z1 . Then E [Xn | Xn−1 , . . . , X1 ] = 0 for n ≥ 2. If the martingale is zero mean (i.e., if E [Z1 ] = 0), then E [X1 ] = 0 also.
7.6. MARTINGALES AND SUBMARTINGALES
301
Example 7.6.3 (Product-form martingales). Another example is a product of unit mean IID random variables. Thus if Zn = X1 X2 . . . Xn , we have E [Zn | Zn−1 , . . . , Z1 ] = E [Xn Zn−1 | Zn−1 , . . . , Z1 ] = E [Xn ] Zn−1 = Zn−1 .
(7.47)
A particularly simple case of this product example is where Xn = 2 with probability 1/2 and Xn = 0 with probability 1/2. Then Pr{Zn = 2n } = 2−n ;
Pr{Zn = 0} = 1 − 2−n ;
E [Zn ] = 1.
(7.48)
Thus limn→1 Zn = 0 with probability 1, but E [Zn ] = 1 for all n and limn→1 E [Zn ] = 1. This is an important example to keep in mind when trying to understand why proofs about martingales are necessary and non-trivial. An important example of a product-form martingale is as follow: let {Xi ; i ≥ 1} be an IID sequence, and let {Sn = X1 + · · · + Xn ; n ≥ 1} be a random walk. Assume that the semi-invariant generating function ∞(r) = ln{E [exp(rX)]} exists in some region of r around 0. For each n ≥ 1, let Zn be defined as Zn = exp{rSn − n∞(r)}
(7.49)
= exp{rXn − ∞(r)} exp{rSn−1 − (n − 1)∞(r)}
= exp{rXn − ∞(r)}Zn−1 .
(7.50)
Taking the conditional expectation of this, E [Zn | Zn−1 , . . . , Z1 ] = E [exp(rXn − ∞(r))] E [Zn−1 | Zn−1 , . . . , Z1 ] = Zn−1 .
(7.51)
where we have used the fact that E [exp(rXn )] = exp(∞(r)). Thus we see that {Zn ; n ≥ 1} is a martingale of the product-form.
7.6.2
Markov modulated random walks
Frequently it is useful to generalize random walks to allow some dependence between the variables being summed. The particular form of dependence here is the same as the Markov reward processes of Section 4.5. The treatment in Section 4.5 discussed only expected rewards, whereas the treatment here focuses on the random variables themselves. Let {Ym ; m ≥ 0} be a sequence of (possibly dependent) rv’s, and let {Sn ; n ≥ 1}
where Sn =
n−1 X
Ym .
(7.52)
m=0
be the process of successive sums of these random variables. Let {Xn ; n ≥ 0} be a Markov chain, and assume that each Yn can depend on Xn and Xn+1 . Conditional on Xn and Xn+1 , however, Yn is independent of Yn−1 , . . . , Y1 , and of Xi for all i 6= n. Assume that Yn , conditional on Xn and Xn+1 has a distribution function Fij (y) = Pr{Yn ≤ y | Xn = i, Xn+1 = j}.
302 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Thus each rv Yn depends only on the associated transition in the Markov chain, and this dependence is the same for all n. The process {Sn ; n ≥ 1} is called a Markov modulated random walk. It is also the sequence of epochs in a semi-Markov process. For each m, Ym is associated with the transition in the Markov chain from time m to m + 1, and Sn is the aggregate reward up to but not including time n. Let Y ij denote E [Yn | Xn = i, Xn+1 = j] and Y i denote E [YnP | Xn = i]. Let {Pij } be the set of transition probabilities for the Markov chain, so Y i = j Pij Y ij . We may think of the process {Yn ; n ≥ 0} as evolving along with the Markov chain. The distributions of the variables Yn are associated with the transitions from Xn to Xn+1 , but the Yn are otherwise independent random variables. In order to define a martingale related to the process {Sn ; n ≥ 1}, we must subtract the mean reward from {Sn } and must also compensate for the effect of the state of the Markov chain. The appropriate compensation factor turns out to be the relative gain vector defined in Section 4.5. For simplicity, consider only finite-state irreducible Markov chains with M states. Let π = (π1 , . . . , πM ) be the steady state probability vector for the chain, let Y = (Y 1 , . . . , Y M )T be the vector of expected rewards, let g = πY be the steady state gain per unit time, and let w = (w1 , . . . , wM )T be the relative gain vector. From (4.42), w is the unique solution to w + ge = Y + [P ]w
;
w1 = 0.
(7.53)
We assume a fixed starting state X0 = k. As we now show, the process Zn ; n ≥ 1 given by Zn = Sn − ng + wXn − wk
;
n≥1
(7.54)
is a martingale. First condition on a given state, Xn−1 = i. E [Zn | Zn−1 , Zn−2 , . . . , Z1 , Xn−1 = i] .
(7.55)
Since Sn = Sn−1 + Yn−1 , we can express Zn as Zn = Zn−1 + Yn−1 − g + wXn − wXn−1 . Since E [Yn−1 | Xn−1 = i] = Y i and E [wXn | Xn−1 = i] =
P
j
(7.56)
Pij wj , we have
E [Zn | Zn−1 , Zn−2 , . . . , Z1 , Xn−1 = i] = Zn−1 + Y i − g +
X j
Pij wj − wi .
(7.57)
From (7.53) the final four terms in (7.57) sum to 0, so E [Zn | Zn−1 , . . . , Z1 , Xn−1 = i] = Zn−1 .
(7.58)
Since this is valid for all choices of Xn−1 , we have E [Zn | Zn−1 , . . . , Z1 ] = Zn−1 . Since the expected values of all the reward variables Y i exist, we see that E [|Yn |] < 1, so that E [|Zn |] < 1 also. This verifies that {Zn ; n ≥ 1} is a martingale. It can be verified similarly that E [Z1 ] = 0, so E [Zn ] = 0 for all n ≥ 1.
7.6. MARTINGALES AND SUBMARTINGALES
303
In showing that {Zn ; n ≥ 1} is a martingale, we actually showed something a little stronger. That is, (7.58) is conditioned on Xn−1 as well as Zn−1 , . . . , Z1 . In the same way, it follows that for all n > 1, E [Zn | Zn−1 , Xn−1 , Zn−2 , Xn−2 , . . . , Z1 , X1 ] = Zn−1 .
(7.59)
In terms of the gambling analogy, this says that {Zn ; n ≥ 1} is fair for each possible past sequence of states. A martingale {Zn ; n ≥ 1} with this property (i.e., satisfying (7.59)) is said to be a martingale relative to the joint process {Zn , Xn ; n ≥ 1}. We will use this martingale later to discuss threshold crossing problems for Markov modulated random walks. We shall see that the added property of being a martingale relative to {Zn , Xn } gives us added flexibility in defining stopping times. As an added bonus to this example, note that if {Xn ; n ≥ 0} is taken as the embedded chain of a Markov process (or semi-Markov process), and if Yn is taken as the time interval from transition n to n + 1, then Sn becomes the epoch of the nth transition in the process.
7.6.3
Generating functions for Markov random walks
Consider the same Markov chain and reward variables as in the previous example, and assume that for each pair of states, i, j, the moment generating function gij (r) = E [exp(rYn ) | Xn = i, Xn+1 = j] .
(7.60)
exists over some open interval (r− , r+ ) containing 0. Let [Γ(r)] be the matrix with terms Pij gij (r). Since [Γ(r)] is an irreducible non-negative matrix, Theorem 4.6 shows that [Γ(r)] has a largest real eigenvalue, ρ(r) > 0, and an associated positive right eigenvector, ∫(r) = (∫1 (r), . . . , ∫J (r))T that is unique within a scale factor. We now show that the process {Mn (r); n ≥ 1} defined by Mn (r) =
exp(rSn )∫Xn (r) . ρ(r)n ∫k (r)
(7.61)
is a product type Martingale for each r ∈ (r− , r+ ). Since Sn = Sn−1 + Yn−1 , we can express Mn (r) as Mn (r) = Mn−1 (r)
exp(rYn−1 ) ∫Xn (r) . ρ(r)∫Xn−1 (r)
The expected value of the ratio in (7.62), conditional on Xn−1 = i, is P ∏ ∑ exp(rYn−1 )∫Xn (r) j Pij gij (r)∫j (r) | Xn−1 =i = = 1. E ρ(r)∫i (r) ρ(r)∫i (r)
(7.62)
(7.63)
where, in the last step, we have used the fact that ∫(r) is an eigenvector of [Γ(r)]. Thus, E [Mn (r) | Mn−1 (r), . . . , M1 (r), Xn−1 = i] = Mn−1 (r). Since this is true for all choices of i, the condition on Xn−1 = i can be removed and {Mn (r); n ≥ 1} is a martingale. Also, for n > 1, E [Mn (r) | Mn−1 (r), Xn−1 , . . . , M1 (r), X1 ] = Mn−1 (r).
(7.64)
304 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
so that {Mn (r); n ≥ 1} is also a martingale relative to the joint process {Mn (r), Xn ; n ≥ 1}. It can be verified by the same argument as in (7.63) that E [M1 (r)] = 1. It then follows that E [Mn (r)] = 1 for all n ≥ 1. One of the uses of this martingale is to provide exponential upper bounds, similar to (7.16), to the probabilities of threshold crossings for Markov modulated random walks. Define fn (r) = exp(rSn ) minj (∫j (r)) . M (7.65) ρ(r)n ∫k (r) h i fn (r) ≤ 1. For any µ > 0, the Markov inequality can be fn (r) ≤ Mn (r), so E M Then M fn (r) to get applied to M o i h n fn (r) ≤ 1 . fn (r) ≥ µ ≤ 1 E M Pr M µ µ
(7.66)
For any given α, and any given r, 0 ≤ r < r+ , we can choose µ = exp(rα)ρ(r)−n minj (∫j (r))/∫k (r), and for r > 0. Combining (7.65) and (7.66), Pr{Sn ≥ α} ≤ ρ(r)n exp(−rα)∫k (r)/ min(∫j (r)). j
(7.67)
This can be optimized over r to get the tightest bound in the same way as (7.16).
7.6.4
Scaled branching processes
A final example of a martingale is a “scaled down” version of a branching process {Xn ; n ≥ 0}. Recall from Section 5.2 that, for each n, Xn is defined as the aggregate number of elements in generation n. Each element i of generation n, 1 ≤ i ≤ Xn has a number Yi,n P n of offspring which collectively constitute generation n + 1, i.e., Xn+1 = X i=1 Yi,n . The rv’s Yi,n are IID over both i and n. Let Y = E [Yi,n ] be the mean number of offspring of each element of the population. Then E [Xn | Xn−1 ] = Y Xn−1 , which resembles a martingale except for the factor of Y . We can convert this branching process into a martingale by scaling it, however. That is, define n Zn = Xn /Y . It follows that ∑ ∏ Xn Y Xn−1 = Zn−1 . (7.68) E [Zn | Zn−1 , . . . , Z1 ] = E n | Xn−1 , . . . , X1 = n Y Y Thus {Zn ; n ≥ 1} is a martingale. We will see the surprising result later that this implies that Zn converges with probability 1 to a limiting rv as n → 1.
7.6.5
Partial isolation of past and future in martingales
Recall that for a Markov chain, the states at all times greater than a given n are independent of the states at all times less than n conditional on the state at time n. The following lemma shows that at least a small part of this independence of past and future applies to martingales.
7.6. MARTINGALES AND SUBMARTINGALES
305
Lemma 7.2. Let {Zn ; n ≥ 1} be a martingale. Then for any n > i ≥ 1, E [Zn | Zi , Zi−1 , . . . , Z1 ] = Zi .
(7.69)
Proof: For n = i + 1, E [Zi+1 | Zi , . . . , Z1 ] = Zi by the definition of a martingale. Now consider n = i + 2. Then E [Zi+2 | Zi+1 , . . . , Z1 ] is a rv equal to Zi+1 . We can view E [Zi+2 | Zi , . . . , Z1 ] as the conditional expectation of the rv E [Zi+2 | Zi+1 , . . . , Z1 ] over Zi+1 conditional on Zi , . . . , Z1 . E [Zi+2 |Zi , . . . , Z1 ] = E [E [Zi+2 | Zi+1 , Zi , . . . , Z1 ] | Zi , . . . , Z1 ] = E [Zi+1 | Zi , . . . , Z1 ] = Zi .
(7.70)
For n = i+3, (7.70), with i incremented, shows us that the rv E [Zi+3 | Zi+1 , . . . , Z1 ] = Zi+1 . Taking the conditional expectation of this rv over Zi+1 conditional on Zi , . . . , Z1 , we get E [Zi+3 | Zi , . . . , Z1 ] = Zi . This argument can be applied successively to any n > i. The same argument can also be used (see Exercise 7.18) to show that E [Zn ] = E [Z1 ]
7.6.6
for all n > 1.
(7.71)
Submartingales and supermartingales
Submartingales and supermartingales are simple generalizations of martingales that provide many useful results for very little additional work. We will subsequently derive the Kolmogorov submartingale inequality, which is a powerful generalization of the Markov inequality. We use this both to give a simple proof of the strong law of large numbers and also to better understand threshold crossing problems for random walks. Definition 7.4. A submartingale is an integer time stochastic process {Zn ; n ≥ 1} that satisfies the relations E [|Zn |] < 1
;
E [Zn | Zn−1 , Zn−2 , . . . , Z1 ] ≥ Zn−1
;
n ≥ 1.
(7.72)
A supermartingale is an integer time stochastic process {Zn ; n ≥ 1} that satisfies the relations E [|Zn |] < 1
;
E [Zn | Zn−1 , Zn−2 , . . . , Z1 ] ≤ Zn−1
;
n ≥ 1.
(7.73)
In terms of our gambling analogy, a submartingale corresponds to a game that is at least fair, i.e., where the expected fortune of the gambler either increases or remains the same. A supermartingale is a process with the opposite type of inequality.4 4
The easiest way to remember the difference between a submartingale and a supermartingale is to remember that it is contrary to what common sense would dictate. That is, a submartingale is bigger than a supermartingale. Why this terminology became standard is a mystery.
306 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Since a martingale satisfies both (7.72) and (7.73) with equality, a martingale is both a submartingale and a supermartingale. Note that if {Zn ; n ≥ 1} is a submartingale, then {−Zn ; n ≥ 1} is a supermartingale, and conversely. Thus, some of the results to follow are stated only for submartingales, with the understanding that they can be applied to supermartingales by changing signs as above. Lemma 7.2, with the equality replaced by inequality, also applies to submartingales and supermartingales. That is, if {Zn ; n ≥ 1} is a submartingale, then E [Zn | Zi , Zi−1 , . . . , Z1 ] ≥ Zi
;
1 ≤ i < n,
(7.74)
;
1 ≤ i < n.
(7.75)
and if {Zn ; n ≥ 1} is a supermartingale, then E [Zn | Zi , Zi−1 , . . . , Z1 ] ≤ Zi
Equations (7.74) and (7.75) are verified in the same way as Lemma 7.2 (see Exercise 7.20). Similarly, the appropriate generalization of (7.71) is that if {Zn ; n ≥ 1} is a submartingale, then E [Zn ] ≥ E [Zi ]
;
for all i, 1 ≤ i < n.
(7.76)
for all i, 1 ≤ i < n.
(7.77)
and if {Zn ; n ≥ 1} is a supermartingale, then E [Zn ] ≤ E [Zi ]
;
A random walk {Sn ; n ≥ 1} with Sn = X1 + · · · + Xn is a submartingale, martingale, or supermartingale respectively for X ≥ 0, X = 0, or X ≤ 0. Also, if X has a semiinvariant moment generating function ∞(r) for some given r, and if Zn is defined as Zn = exp(rSn ), then the process {Zn ; n ≥ 1} is a submartingale, martingale, or supermartingale respectively for ∞(r) ≥ 0, ∞(r) = 0, or ∞(r) ≤ 0. The next example gives an important way in which martingales and submartingales are related. Example 7.6.4 (Convex functions of martingales). Figure 7.8 illustrates the graph of a convex function h of a real variable x. A function h of a real variable is defined to be convex if, for each point x1 , there is a real number c with the property that h(x1 ) + c(x − x1 ) ≤ h(x) for all x. Geometrically, this says that every tangent to h(x) lies on or below h(x). If h(x) has a derivative at x1 , then c is the value of that derivative and h(x1 ) + c(x − x1 ) is the tangent line at x1 . If h(x) has a discontinuous slope at x1 , then there might be many choices for c; for example, h(x) = |x| is convex, and for x1 = 0, one could choose any c in the range −1 to +1. A simple condition that implies convexity is a nonnegative second derivative everywhere. This is not necessary, however, and functions can be convex even when the first derivative does not exist everywhere. For example, the function ∞(r) in Figure 7.5 is convex even though it blows up at r = r+ . Jensen’s inequality states that if h is convex and X is a random variable with an expectation, then h(E [X]) ≤ E [h(X)]. To prove this, let x1 = E [X] and choose c so that h(x1 ) + c(x −
7.7. STOPPED PROCESSES AND STOPPING TIMES
h(x)
307
h(x1 ) + c(x − x1 ) c = h0 (x1 )
x1 Figure 7.8: Convex functions: For each x1 , there is a value of c such that, for all x, h(x1 ) + c(x − x1 ) ≤ h(x). If h is differentiable at x1 , then c is the derivative of h at x1 .
x1 ) ≤ h(x). Using the random variable X in place of x and taking expected values of both sides, we get Jensen’s inequality. Note that for any particular event A, this same argument applies to X conditional on A, so that h(E [X | A]) ≤ E [h(X) | A]. Jensen’s inequality is very widely used; it is a minor miracle that we have not required it previously. Theorem 7.3. If h is a convex function of a real variable, {Zn ; n ≥ 1} is a martingale, and E [|h(Zn )|] < 1 for all n, then {h(Zn ); n ≥ 1} is a submartingale. Proof: For any choice of z1 , . . . , zn−1 , we can use Jensen’s inequality with the conditioning probabilities to get E [h(Zn )|Zn−1 =zn−1 , . . . , Z1 =z1 ] ≥ h(E [Zn | Zn−1 =zn−1 , . . . , Z1 =z1 ]) = h(zn−1 ). (7.78) For any choice of numbers h1 , . . . , hn−1 in the range of the function h, let z1 , . . . , zn−1 be arbitrary numbers satisfying h(z1 )=h1 , . . . , h(zn−1 )=hn−1 . For each such choice, (7.78) holds, so that E [h(Zn ) | h(Zn−1 )=hn−1 , . . . , h(Z1 )=h1 ] ≥ h(E [Zn | h(Zn−1 )=hn−1 , . . . , h(Z1 )=h1 ]) = h(zn−1 ) = hn−1 .
(7.79)
completing the proof. Some examples of this result, applied to a martingale {Zn ; n ≥ 1}, are as follows: {|Zn |; n ≥ 1} is a submartingale {Zn2 ;
£ § n ≥ 1} is a submartingale if E Zn2 < 1
(7.80) (7.81)
{exp(rZn ); n ≥ 1} is a submartingale for r such that E [exp(rZn )] < 1. (7.82)
A function of a real variable h(x) is defined to be concave if −h(x) is convex. It then follows from Theorem 7.3 that if h is concave and {Zn ; n ≥ 1} is a martingale, then {h(Zn ); n ≥ 1} is a supermartingale (assuming that E [|h(Zn )|] < 1). For example, if {Zn ; n ≥ 1} is a positive martingale and E [| ln(Zn )|] < 1, then {ln(Zn ); n ≥ 1} is a supermartingale.
7.7
Stopped processes and stopping times
The discussion of stopping times in Section 7.5.1 applies to arbitrary integer time processes {Zn ; n ≥ 1} as well as to IID sequences. Recall that a collection T of stopping nodes for a
308 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
sequence {Zn ; n ≥ 1} of rv’s is a collection of initial segments such that no initial segment in T is an initial segment of any other segment in T . For any sample sequence which has an initial segment in T , the stopping time for that sample sequence is the length of that initial segment. If the set of sample sequences containing an initial segment in T has probability 1, then the stopping time N is a random variable, and otherwise it is a defective random variable. For some of the results to follow, it is unimportant whether N is a random variable or a defective random variable (i.e., whether or not the process stops with probability 1). If it is not specified whether N is a random variable or a defective random variable, we refer to the stopping time as a possibly defective stopping time; we consider N to take on the value 1 if the process does not stop. Given a possibly defective stopping time N for a process {Zn ; n ≥ 1}, the corresponding stopped process is defined as the process {Zn∗ ; n ≥ 1} in which Zn∗ = Zn for n ≤ N and Zn∗ = ZN for n > N . As an example, suppose Zn models the fortune of a gambler at the completion of the nth trial of some game, and suppose the gambler then modifies the game by deciding to stop gambling under some given circumstances (i.e., at the stopping time). Thus, after stopping, the fortune remains constant, so the stopped process models the gambler’s fortune in time, including the effect of the stopping time. As another example, consider a random walk with a positive and negative threshold, and consider the process to stop after reaching or crossing a threshold. The stopped process then stays at that point beyond the threshold as an artifice to simplify analysis. The use of stopped processes is similar to the artifice that we employed in Section 4.5 for first passage times in Markov chains; recall that we added an artificial trapping state after the desired passage to simplify analysis. We next show that the possibly defective stopped process of a martingale is itself a martingale; the intuitive reason is that, before stopping, the stopped process is the same as the ∗ . The following theorem establishes this and the martingale, and, after stopping, Zn∗ = Zn−1 corresponding results for submartingales and supermartingales. Theorem 7.4. Given a stochastic process {Zn ; n ≥ 1} and a possibly defective stopping time N for the process, the stopped process {Zn∗ ; n ≥ 1} is a submartingale if {Zn ; n ≥ 1} is a submartingale, is a martingale if {Zn ; n ≥ 1} is a martingale, and is a supermartingale if {Zn ; n ≥ 1} is a supermartingale. Proof: First we show that, for all three cases, the stopped process satisfies E [|Zn∗ |] < 1 for any given n ≥ 1. Conditional on N = i for some i < n, we have Zn∗ = Zi , so E [|Zn∗ | | N = i] = E [|Zi | | N = i] < 1 for each i < n such that Pr{N = i} > 0. The reason for this is that if E [|Zi | | N = i] = 1 and Pr{N = i} > 0, then E [|Zi |] = 1, contrary to the assumption that {Zn ; n ≥ 1} is a martingale, submartingale, or supermartingale. Similarly, for N ≥ n, we have Zn∗ = Zn so E [|Zn∗ | | N ≥ n] = E [|Zn | | N ≥ n] < 1
if Pr{N ≥ n} > 0.
7.7. STOPPED PROCESSES AND STOPPING TIMES
309
Averaging, E [|Zn∗ |] =
n−1 X i=1
E [|Zn∗ | | N =i] Pr{N =i} + E [|Zn∗ | | N ≥ n] Pr{N ≥ n} < 1.
Next assume that {Zn ; n ≥ 1} is a submartingale. For any given n > 1, consider an arbitrary initial sample sequence Z1 = z1 , Z2 = z2 , . . . , Zn−1 = zn−1 . First, suppose that for some i ≤ n − 1, (i, z1 , . . . , zi ) ∈ T where T is the set of stopping nodes. Then for the ∗ stopped process under these assumptions, zn−1 = · · · = zi∗ = zi and zj∗ = zj for 1 ≤ j < i. ∗ Also, Zn∗ = zn−1 = zi . Thus § £ ∗ ∗ ∗ =zn−1 , ..., Z1∗ =z1∗ = zn−1 . (7.83) E Zn∗ |Zn−1 Next, contrary to the above assumption, assume there is no i ≤ n − 1, (i, z1 , . . . , zi ) ∈ T where T is the set of stopping nodes. Given this assumption, zj∗ = zj for 1 ≤ j ≤ n − 1. Conditional on this, Zn∗ = Zn . Using the fact that {Zn ; n ≥ 1} is a submartingale, § £ ∗ ∗ ∗ =zn−1 , ..., Z1∗ =z1∗ ≥ zn−1 . (7.84) E Zn∗ |Zn−1
The same argument works for martingales and supermartingales by replacing the inequality in (7.84) by equality for the martingale case and the opposite inequality for the supermartingale case.
Theorem 7.5. Given a stochastic process {Zn ; n ≥ 1} and a possibly defective stopping time N for the process, the stopped process {Zn∗ ; n ≥ 1} satisfies the following conditions for all n ≥ 1 if {Zn ; n ≥ 1} is a submartingale, martingale, or supermartingale respectively: E [Z1 ] ≤ E [Zn∗ ] ≤ E [Zn ]
(submartingale)
(7.85)
= E [Zn ]
(martingale)
(7.86)
E [Z1 ] ≥
≥ E [Zn ]
(supermartingale).
(7.87)
E [Z1 ] =
E [Zn∗ ] E [Zn∗ ]
Proof: Since a process cannot stop before epoch 1, Z1 = Z1∗ in all cases. First consider the case in which {Zn ; n ≥ 1} is a submartingale. Theorem 7.4 shows that {Zn∗ ; n ≥ 1} is a submartingale, and from (7.76), E [Z1 ] ≤ E [Zn∗ ] for all n ≥ 1. This establishes the first half of (7.85) and we next prove the second half. First condition on the set of sequences for which some given initial segment (i, z1 , . . . , zi ) ∈ T for some i < n. Then E [Zn∗ ] = zi . From (7.74), E [Zn ] ≥ zi , proving (7.85) for this case. For those sequences not having such an initial segment, Zn∗ = Zn , establishing (7.85) in that case. Averaging over these two cases gives (7.85) in general. Finally, if {Zn ; n ≥ 1} is a supermartingale, then {−Zn ; n ≥ 1} is a submartingale, verifying (7.87). Since a martingale is both a submartingale and supermartingale, (7.86) follows and the proof is complete. Consider a (non-defective) stopping time N for a martingale {Zn ; n ≥ 1}. Since the stopped process is also a martingale, we have E [Zn∗ ] = E [Z1∗ ] = E [Z1 ] ; n ≥ 1.
(7.88)
310 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Since Zn∗ = ZN for all n ≥ N and since N is finite with probability 1, we see that limn→1 Zn∗ = ZN with probability 1. Unfortunately, in general, E [ZN ] is unequal to limn→1 E [Zn∗ ] = E [Z1 ]. An example in which this occurs is the binary product martingale in (7.48). Taking the stopping time N to be the smallest n for which Zn = 0, we have ZN = 0 with probability 1, and thus E [ZN ] = 0. But Zn∗ = Zn for all n, and E [Zn∗ ] = 1 for all n. The problem here is that, given that the process has not stopped by time n, Zn and Zn∗ each have the value 2n . Fortunately, in most situations, this type of bizarre behavior does not occur and E [ZN ] = E [Z1 ]. To get a better understanding of when E [ZN ] = E [Z1 ], note that for any n, we have E [Zn∗ ]
= =
n X i=1 n X i=1
E [Zn∗ | N = i] Pr{N = i} + E [Zn∗ | N > n] Pr{N > n}
(7.89)
E [ZN | N = i] Pr{N = i} + E [Zn | N > n] Pr{N > n} .
(7.90)
The left side of this equation is E [Z1 ] for all n. If the final term on the right converges to 0 as n → 1, then the sum must converge to E [Z1 ]. If E [|ZN |] < 1, then the sum also converges to E [ZN ]. Without the condition E [|ZN |] < 1, the sum might consist of alternating terms which converge, but whose absolute values do not converge, in which case E [ZN ] does not exist (see Exercise 7.23 for an example). Thus we have established the following theorem. Theorem 7.6. Let N be a stopping time for a martingale {Zn ; n ≥ 1}. Then E [ZN ] = E [Z1 ] if and only if lim E [Zn | N > n] Pr{N > n} = 0
n→1
and E [|ZN |] < 1.
(7.91)
Example 7.7.1 (Random walks with thresholds). Recall the generating function product martingale of (7.49) in which {Zn = exp[rSn − n∞(r)]; n ≥ 1} is a martingale defined in terms of the random walk {Sn = X1 +· · ·+Xn ; n ≥ 1}. From (7.86), we have E [Zn ] = E [Z1 ], and since E [Z1 ] = E [exp{rX1 − ∞(r}] = 1, we have E [Zn ] = 1 for all n. Also, for any possibly defective stopping time N , we have E [Zn∗ ] = E [Z1 ] = 1. If N is a non-defective stopping time, and if (7.91) holds, then E [ZN ] = E [exp{rSN − N ∞(r)}] = 1.
(7.92)
If there are two thresholds, one at α > 0, and the other at β < 0, and the stopping rule is to stop when either threshold is crossed, then (7.92) is just the Wald identity, (7.24). The nice part about the approach here is that it also applies naturally to other stopping rules. For example, for some given integer n, let Nn+ be the smallest integer i ≥ n for 1 which Si ≥ α or Si ≤ β. Then, in the limit β → −1, © Pr{SNn +™ ≥ α} = Pr{∪i=n (Si ≥ α)}. Assuming X < 0, we can find an upper bound to Pr SNn+ ≥ α for any r > 0 and ∞(r) ≤ 0 (i.e., for 0 < r ≤ r∗ ) by the following steps § © ™ £ 1 = E exp{rSNn+ − Nn+ ∞(r)} ≥ Pr SNn+ ≥ α exp[rα − n∞(r)] © ™ Pr SNn+ ≥ α ≤ exp[−rα + n∞(r)]; 0 ≤ r ≤ r∗ . (7.93)
7.7. STOPPED PROCESSES AND STOPPING TIMES
311
This is almost the same result as (7.38), except that it is slightly stronger; (7.38) bounded the probability that the first threshold crossing crossed α at some epoch i ≥ n, whereas this includes the possibility that Sm ≥ α and Si ≥ α for some m < n ≤ i.
7.7.1
Stopping times for martingales relative to a process
In Section 7.6.2, we defined a martingale {Zn ; n ≥ 1} relative to a joint process {Zn , Xn ; n ≥ 1} as a martingale for which (7.59) is satisfied, i.e., E [Zn | Zn−1 , Xn−1 , . . . , Z1 , X1 ] = Zn−1 . In the same way, we can define a submartingale or supermartingale {Zn ; n ≥ 1} relative to a joint process {Zn , Xn ; n ≥ 1} as a submartingale or supermartingale satisfying (7.59) with the = sign replaced by ≥ or ≤ respectively. The purpose of this added complication is to make it easier to define useful stopping rules. As before, a stopping rule is a rule that determines a collection of stopping nodes. A stopping node here, is an initial segment (i, x1 , z1 , . . . , xi , zi ) of a joint sample sequence of {Zn , Xn ; n ≥ 1}. If an initial segment is a stopping node for one joint sample sequence, it is a stopping node for all joint sample sequences with that initial segment, and no shorter initial segment is a stopping node. The stopping time for joint sequences with that initial segment is the length i of that segment. Theorems 7.4, 7.5, 7.6 all carry over to martingales (submartingales or supermartingales) relative to a joint process. These theorems are stated more precisely in Exercises 7.24 to 7.27. To summarize them here, assume that {Zn ; n ≥ 1} is a martingale (submartingale or supermartingale) relative to a joint process {Zn , Xn ; n ≥ 1} and assume that N is a stopping time for {Zn ; n ≥ 1} relative to {Zn , Xn ; n ≤ 1}. Then the stopped process is a martingale (submartingale or supermartingale) respectively, (7.85-7.87) are satisfied, and, for a martingale, E [ZN ] = E [Z1 ] is satisfied iff (7.91) is satisfied. Example 7.7.2 (Markov modulated random walks with thresholds). In Sections 7.6.2 and 7.6.3, we developed two martingales for Markov modulated random walks, both conditioned on a fixed initial state X0 = k. The first, given in (7.54), is {Zn = Sn − ng + wXn − wk ; n ≥ 1}. Recall that E [Zn ] = 0 for all n ≥ 1 for this martingale. Given two thresholds, α > 0 and β < 0, define N as the smallest n for which Sn ≥ α or Sn ≤ β. In , the indicator function of {N ≥ n}, is 1 iff β < Si < α for 1 ≤ i ≤ n − 1. Since Si = Zi + ig − wXi + wk , Si is a function of Zi and Xi , so the stopping nodes are functions of both Zi and Xi for 1 ≤ i ≤ n − 1. It follows that N is a stopping time for {Zn ; n ≥ 1} relative to {Zn , Xn ; n ≥ 1}. From Theorem 7.6, we can assert that E [ZN ] = E [Z1 ] = 0 if (7.91) is satisfied, i.e., if limn→1 E [Zn | N > n] Pr{N > n} = 0 is satisfied. Using the same argument as in Lemma 7.1, we can see that Pr{N > n} goes to 0 at least geometrically in n. Conditional on N > n, β < Sn < α, so Sn is bounded independent of n. Also wXn is bounded, since the chain is finite state, and ng is linear in n. Thus E [Zn | N > n] varies at most linearly with n, so (7.91) is satisfied, and 0 = E [ZN ] = E [SN ] − E [N ] g + E [wXn ] − wk .
(7.94)
Recall that Wald’s equality for random walks is E [SN ] = E [N ] g. For Markov modulated random walks, this is modified, as shown in (7.94), by the relative gain vector terms.
312 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
The same arguments can be applied to the generating function martingale of (7.61). Again, let N be the smallest n for which Sn ≥ α or Sn ≤ β. As before, Si is a function of Mi (r) and Xi , so In is a function of Mi (r) and Xi for 1 ≤ i ≤ n − 1. It follows that N is a stopping time for {Mn (r); n ≥ 1} relative to {Mn (r), Xn ; n ≥ 1} Next we need the following lemma: Lemma 7.3. For the martingale {Mn (r); n ≥ 1} relative to {Mn (r), Xn ; n ≥ 1} defined in (7.61), where {Xn ; n ≥ 0} is a finite-state Markov chain, and for the above stopping time N, lim E [Mn (r) | N > n] Pr{N > n} = 0.
n→1
(7.95)
Proof: From lemma 4, slightly modified for the case here, there is a δ > 0 such that for all states i, j, and all n > 1 such that Pr{N = n, Xn−1 = i, Xn = j} > 0, E [exp(rSn | N = n, Xn−1 = i, Xn = j] ≥ δ.
(7.96)
Since the stopped process, {Mn∗ (r); n ≥ 1}, is a martingale, we have for each m, ∗ (r)] ≥ 1 = E [Mm
m X E [exp(rSn )∫Xn (r) | N = n] . ρ(r)n ∫k (r) n=1
(7.97)
From (7.96), we see that there is some δ 0 > 0 such that
E [exp(rSn )∫Xn (r)] /∫k (r) | N = n] ≥ δ 0 for all n such that Pr{N = n} > 0. Thus (7.97) is bounded by X ρ(r)n Pr{N = n} . 1 ≥ δ0 n≤m
Since this is valid for all m, it follows by the argument in the proof of theorem 7.2 that limn→1 ρ(r)n Pr{N > n} = 0. This, along with (7.96), establishes (7.95), completing the proof. From Theorem 7.6, we have the desired result: ∑ ∏ exp(rSN )∫XN (r) E [MN (r)] = E = 1; [ρ(r)]N ∫k (r)
r− < r < r+ .
(7.98)
This is the extension of the Wald identity to Markov modulated random walks, and is used in the same way as the Wald identity. As shown in Exercise 7.29, the derivative of (7.98), evaluated at r = 0, is the same as (7.94).
7.8
The Kolmogorov inequalities
We now use the previous theorems to establish Kolmogorov’s submartingale inequality, which is a major strengthening of the Markov inequality. Just as the Markov inequality in Section 1.7 was used to derive the Chebychev inequality and then the weak law of large numbers, the Kolmogorov submartingale inequality will be used to strengthen the Chebychev inequality, from which the strong law of large numbers will follow.
7.8. THE KOLMOGOROV INEQUALITIES
313
Theorem 7.7 (Kolmogorov’s submartingale inequality). Let {Zn ; n ≥ 1} be a nonnegative submartingale. Then for any positive integer m and any a > 0, Ω æ E [Zm ] Pr max Zi ≥ a ≤ . (7.99) 1≤i≤m a Proof: Given a non-negative submartingale {Zn ; n ≥ 1}, given a > 0, and given a positive integer m, let N be the stopping time defined as the smallest n ≤ m such that Zn ≥ a. If Zn < a for all n ≤ m, then N = m. Thus the process must stop by time m, and ZN ≥ a if and only if Zn ≥ a for some n ≤ m. Thus æ Ω E [ZN ] Pr max Zn ≥ a = Pr{ZN ≥ a} ≤ . (7.100) 1≤n≤m a where we have used the Markov inequality. Finally, since the process must be stopped by ∗ . From (7.85), E [Z ∗ ] ≤ E [Z ], so the right hand side of (7.100) time m, we have ZN = Zm m m is less than or equal to E [Zm ] /a, completing the proof. The following simple corollary shows that (7.99) takes a simpler form for non-negative martingales. Corollary 7.3 (Non-negative martingale inequality). Let {Zn ; n ≥ 1} be a non-negative martingale. Then Ω æ E [Z1 ] Pr sup |Zn | ≥ a ≤ ; for all a > 0. (7.101) a n≥1 Proof For a martingale, E [Zm ] = E [Z1 ]. Thus, from (7.99), Pr{max1≤i≤m Zi ≥ a} ≤ for all m > 1. Passing to the limit m → 1 yields (7.101).
E[Z1 ] a
The following corollary bears the same relationship to the submartingale inequality as the Chebychev inequality does to the Markov inequality. Corollary £7.4§(Kolmogorov’s martingale inequality). Let {Zn ; n ≥ 1} be a martingale with E Zn2 < 1 for all n ≥ 1. Then £ 2§ æ E Zm Pr max |Zn | ≥ b ≤ ; for all integer m ≥ 2, all b > 0. 1≤n≤m b2 Ω
(7.102)
Proof: Since {Zn ; n ≥ 1} is a martingale and Zn2 is a convex function of Zn , it follows from Theorem 7.3 that {Zn2 ; n ≥ 1} is a submartingale. Since Zn2 is non-negative, we can use the Kolmogorov submartingale inequality to see that Ω æ £ 2§ 2 Pr max Zn ≥ a ≤ E Zm /a for any a > 0. n≤m
Substituting b2 for a, we get (7.102).
314 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Corollary 7.5 (Kolmogorov’s random walk inequality). Let {Sn ; n ≥ 1} be a random walk with Sn = X1 + · · · + Xn where {Xi ; i ≥ i} is a set of IID random variables with mean X and variance σ 2 . Then for any positive integer m and any ≤ > 0, Ω æ σ2 Pr max |Sn − nX| ≥ m≤ ≤ . (7.103) 1≤n≤m m≤2 Proof: £ 2 § {Zn =2 Sn − nX; n ≥ 1} is a zero mean random walk, and thus a martingale. Since E Zm = mσ , (7.103) follows by substituting m≤ for b in 7.102).
Recall form of the weak law of large numbers was given in (1.53) as © that the simplest ™ Pr |Sm /m − X| ≥ ε ≤ σ 2 /(mε2 ). This is strengthened in (7.103) to upper bound the probability that any of the first m terms deviate from the mean by more than mε. It is this strengthening that will allow us to prove the strong law of large numbers. The following corollary yields essentially the same result as (7.41), but is included here as another example of the use of the Kolmogorov submartingale inequality.
Corollary 7.6. Let {Sn ; n ≥ 1} be a random walk, Sn = X1 + · · · + Xn where each Xi has mean X < 0 and semi-invariant moment generating function ∞(r). For any r > 0 such that 0 < ∞(r) < 1 (i.e., for r > r∗ ), and for any a > 0. æ Ω (7.104) Pr max Si ≥ α ≤ exp{−rα + n∞(r)}. 1≤i≤n
Proof: For r > r∗ , {exp(rSn ); n ≥ 1} is a submartingale. Taking a = exp(rα) in (7.99), we get (7.104). The following theorem about supermartingales is, in a sense, the dual of the Kolmogorov submartingale inequality. Note, however, that it applies to the terms n ≥ m in the supermartingale rather than n ≤ m. Theorem 7.8. Let {Zn ; n ≥ 1} be a non-negative supermartingale. Then for any positive integer m and any a > 0, Ω æ E [Zm ] Pr sup Zi ≥ a ≤ . (7.105) a i≥m Proof: For given m ≥ 1 and a > 0, let N be a possibly defective stopping time defined as the smallest i ≥ m for which Zi ≥ a. Let {Zn∗ ; n ≥ 1} be the corresponding stopped process, which is also non-negative and is a supermartingale from Theorem 7.4. For any k > m, note that Zk∗ ≥ a iff maxm≤i≤k Zi ≥ a. Thus æ Ω E [Zk∗ ] Pr max Zi ≥ a = Pr{Zk∗ ≥ a} ≤ . m≤i≤k a ∗ ]. On the other Since {Zn∗ ; n ≥ 1} is a supermartingale, (7.77) shows that E [Zk∗ ] ≤ E [Zm ∗ hand, Zm = Zm since the process can not stop before epoch m. Thus Pr{maxm≤i≤k Zi ≥ a} is at most E [Zm ] /a. Since k is arbitrary, we can pass to the limit, getting (7.105) and completing the proof.
7.8. THE KOLMOGOROV INEQUALITIES
7.8.1
315
The strong law of large numbers
We now proceed to prove the strong law of large numbers. Recall that we proved this in Chapter 1 either using the assumption of moments of all orders or of a fourth moment. Here we need assume only a finite second moment, and, given the machinery we now have, the proof is considerably simpler than before. The theorem is also true assuming only a first absolute moment, but the truncation argument we used for the weak law in Theorem 1.3 does not carry over simply here. The theorem here is Version 1 of the strong law as stated in Chapter 1, generalized to require no more than a second moment. The more useful form, Version 2 of Chapter 1, follows as before. Theorem 7.9 (Strong law of large numbers). Let {Xi ; i ≥ 1} be a sequence of IID random variables with mean X and standard deviation σ < 1. Let Sn = X1 + · · · + Xn . Then for any ε > 0, ( ) Ø [ ØØ Sm Ø Ø Ø lim Pr (7.106) Ø m − X Ø > ε = 0. n→1 m>n
Proof: As n increases, the number of terms in the the union above decreases, and thus has a non-increasing probability. Thus we can restrict attention to n of the form 2k for integer k. For any given k, the union above can be separated into blocks as follows: j+1 Ø Ø æ æ ΩØ 2[ 1 [ ΩØØ S [ Ø Ø Ø Sm Ø m − XØ > ε Ø = Pr Pr − X ØØ > ε Ø Ø Ø m m j=k m=2j +1 m>2k j+1 Ø æ ΩØ 1 2[ X Ø Ø Sm Ø Ø (7.107) ≤ Pr Ø m − XØ > ε j=k m=2j +1 j+1 1 2[ X ™ © = Pr |Sm − mX| ≥ εm j=k m=2j +1 j+1 1 2[ X © ™ Pr |Sm − mX| ≥ ε2j (7.108) ≤ j j=k
=
1 X
Pr
j=k
≤ ≤
1 X j=k 1 X j=k
Pr
Ω
m=2 +1
j
max
|Sm − mX| ≥ ε2
max
j
2j <m≤2j+1
Ω
1≤m≤2j+1
|Sm − mX| ≥ ε2
2−k+2 σ 2 2j+1 σ 2 = . ε2 22j ε2
æ
æ
(7.109) (7.110)
In (7.107), we used the union bound on the union over j. In (7.108), we used the fact that m ≥ 2j to increase the size of the sets in the union. In (7.109), we upper bounded by adding
316 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
additional terms into the maximization, and in (7.110) we used the Kolmogorov martingale inequality. The proof is completed by noting that the upper bound in (7.110) goes to 0 with increasing k. It should be noted that, although the proof consists of a large number of steps, the steps are all quite small and familiar. Basically the proof is a slick way of upper bounding the probability of the high-order terms in a sequence by using the Kolmogorov martingale inequality, which bounds the low-order terms.
7.8.2
The martingale convergence therem
Another famous result that follows from the Kolmogorov submartingale inequality is the martingale convergence theorem. This states that if a martingale {Zn ; n ≥ 1} has the property that there is some finite M such that E [|Zn |] ≤ M for all n, then limn→1 Zn exists (and is finite) with probability 1. This is a powerful theorem in more advanced work, but it is not quite as useful as it appears, since the restriction E [|Zn |] ≤ M is more than a technical restriction; for example it is not satisfied by a zero-mean random walk. We prove £ 2 §the theorem with the additional restriction that there is some finite M such that E Zn ≤ M for all n.
Theorem 7.10 (Martingale convergence theorem). £ Let § {Zn ; n ≥ 1} be a martingale and assume that there is some finite M such that E Zn2 ≤ M for all n. Then there is a random variable Z such that, for all sample sequences except a set of probability 0, limn→1 Zn = Z. £ § Proof∗ : From Theorem 7.3 and £the §assumption that E Zn2 ≤ M , {Zn2 ; £n ≥§ 1} is a submartingale. from (7.76), E Zn2 is nondecreasing in n, and since E Zn2 is bounded, £ Thus, § limn→1 E Zn2 = M 0 for some M 0 ≤ M . For any integer k, the process {Yn = Zk+n − Zk ; n ≥ 1} is a zero mean martingale (see Exercise 7.30). Thus from Kolmogorov’s martingale inequality, Ω æ £ § Pr max |Zk+n − Zk | ≥ b ≤ E (Zk+m − Zk )2 /b2 . (7.111) 1≤n≤m
2 Next, observe that £ 2 §E [Zk+m Zk£ | Zk = zk , Zk−1 § = zk−1 £ 2, . . .§, Z1 = £ z21§] = zk0 , and£ therefore, § 2 E [Zk+m Zk ] = E Zk . Thus E (Zk+m − Zk ) = E Zk+m − E Zk ≤ M − E Zk2 . Since this is independent of m, we can pass to the limit, obtaining £ § æ Ω M 0 − E Zk2 . (7.112) Pr sup |Zk+n − Zk | ≥ b ≤ b2 n≥1 £ § Since limk→1 E Zk2 = M 0 , we then have, for all b > 0, Ω æ lim Pr sup |Zk+n − Zk | ≥ b = 0. (7.113) k→1
n≥1
This means that with probability 1, a sample sequence of {Zn ; n ≥ 1} is a Cauchy sequence, and thus approaches a limit, concluding the proof.
7.9. SUMMARY
317
This result can be relatively easily interpreted for branching processes. For a branching process {Xn ; n ≥ 1} where Y is the expected number of offspring of an individual, n {Xn /Y ; n ≥ 1} is a martingale that satisfies the above conditions. If Y ≤ 1, the branchn ing process dies out with probability 1, so Xn /Y approaches 0 with probability 1. For Y > 1, however, the branching process dies out with some probability less than 1 and approaches 1 otherwise. Thus, the limiting random variable Z is 0 with the probability that the process ultimately dies out, and is positive otherwise. In the latter case, for large n, the interpretation is that when the population is very large, a law of large numbers n effect controls its growth in each successive generation, so that Xn /Y tends to change in a random way for small n, and then changes increasingly little as n increases.
7.9
Summary
Each term in a random walk {Sn ; n ≥ 1} is a sum of IID random variables, and thus the study of random walks is closely related to that of sums of IID variables. The focus in random walks, however, as in most of the processes we have studied, is more in the relationship between the terms (such as which term first crosses a threshold) than in the individual terms. We started by showing that random walks are a generalization of renewal processes, are central to studying the time waiting in queue for G/G/1 queues, and to sequential analysis for hypothesis testing. A major focus of the chapter was on estimating the probabilities of very unlikely events, a topic known as large deviation theory. We started by studying the exponential bound to Pr{Sn ≥ α} for α > 0 and E [X] < 0. We then developed the Wald identity, which can be used to find tight upper bounds to the probability that a threshold is ever crossed by a random walk. One of the insights gained here was that if a threshold at α is crossed, it is likely to be crossed at a time close to n∗ = α/∞ 0 (r∗ ), where r∗ is the positive root of ∞(r). We also found that r∗ plays a fundamental role in the probability of threshold crossings. For questions of typical behavior, the mean and variance of a random variable are the major quantities of interest, but when interested in atypically large deviations, r∗ is the major parameter of interest. We next introduced martingales, submartingales and supermartingales. These are sometimes regarded as somewhat exotic topics in mathematics, but in fact they are very useful in a large variety of relatively simple processes. For example, we showed that all of the random walk issues of earlier sections can be treated as a special case of martingales, and that martingales can be used to model both sums and products of random variables. We also showed how Markov modulated random walks can be treated as martingales. Stopping times, as first introduced in chapter 3, were then applied to martingales. We defined a stopped process {Zn∗ ; n ≥ 1} to be the same as the original process {Zn ; n ≥ 1} up to the stopping point, and then constant thereafter. Theorems 7.4 and 7.5 showed that the stopped process has the same form (martingale, submartingale, or supermartingale) as the original process, and that the expected values E [Zn∗ ] are between E [Z1 ] and E [Zn ]. We also looked at E [ZN ] and found that it is equal to E [Z1 ] iff (7.91) is satisfied. The Wald identity can be viewed as E [ZN ] = E [Z1 ] = 1 for the Wald martingale, Zn = exp{rSn − n∞(r)}. We
318 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
then found a similar identity for Markov modulated random walks. In deriving results for Markov modulated random walks, it was necessary to define martingales relative to other processes in order to find suitable stopping times, also defined on martingales relative to other processes. This added restriction on martingales is useful in other contexts. The Kolmogorov inequalities were next developed. They are analogs of the Markov inequality and Chebyshev inequality, except sufficiently stronger to prove the strong law of large numbers and the martingale convergence theorem. A standard reference on random walks, and particularly on the analysis of overshoots is [Fel66]. Dembo and Zeitouni, [5] develop large deviation theory in a much more general and detailed way than the introduction here. The classic reference on martingales is [6], but [4] and [16] are more accessible.
7.10. EXERCISES
7.10
319
Exercises
Exercise 7.1. Consider the simple random walk {Sn ; n ≥ 1} of Example 7.1.1 with Sn = X1 + · · · + Xn and Pr{Xi = 1} = p; Pr{Xi = −1} = 1 − p; assume that p < 1/2. © ™ © ™ a) Show that Pr supi≥1 Si ≥ k = [Pr supi≥1 Si ≥ 1 ]k for any positive inteter k. Hint: Given that the random walk ever reaches the value 1, consider a new random walk starting at that time and explore the probability that the new walk ever reaches a value 1 greater than its starting point. © ™ b) Find a quadratic equation for y = Pr supi≥1 Si ≥ 1 . Hint: explore each of the two possibilities immediately after the first trial. c) ©Show that the™ two roots of this quatratic equation are p/(1 − p) and 1. Argue that Pr supi≥1 Si ≥ k cannot be 1 and thus must be p/(1 − p). ∗ ∗ d) Show that p/(1 £ rX− § p) = exp(−r ) where r is the unique positive root of g(r) = 1 and where g(r) = E e .
Exercise 7.2. Consider a G/G/1 queue with IID arrivals {Xi ; i ≥ 1}, IID FCFS service times {Yi ; i ≥ 0}, and an initial arrival to an empty system at time 0. Define Ui = Xi −Yi−1 for i ≥ 1. Consider a sample path where (u1 , . . . , u6 ) = (1, −2, 2, −1, 3, −2). a) Let Zi6 = U6 + U6−1 + . . . + U6−i+1 . Find the waiting time in queue for customer 6 as the maximum of the ‘backward’ random walk with elements 0, Z16 , Z26 , . . . , Z66 ; sketch this random walk. b) Find the waiting time in queue for customers 1 to 5. c) Which customers start a busy period (i.e., arrive when the queue and server are both empty)? Verify that if Zi6 maximizes the random walk in part a), then a busy period starts with arrival 6 − i. d) Now consider a forward random walk Vn = U1 +· · ·+Un . Sketch this walk for the sample path above and show that the waiting time for each customer is the difference between two appropriately chosen values of this walk. Exercise 7.3. A G/G/1 queue has a deterministic service time of 2 and inter-arrival times that are 3 with probability p and 1 with probability 1 − p. a) Find the distribution of W1 , the wait in queue of the first arrival after the beginning of a busy period. b) Find the distribution of W1 , the steady state wait in queue. c) Repeat parts a) and b) if the service times and inter-arrival times are exponentially distributed with rates µ and ∏ respectively. Exercise 7.4. A sales executive hears that one of his sales people is routing half of his incoming sales to a competitor. In particular, arriving sales are known to be Poisson at rate
320 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
one per hour. According to the report (which we view as hypothesis 1), each second arrival is routed to the competition; thus under hypothesis 1 the inter-arrival density for successful sales is f (y|H1 ) = ye−y ; y ≥ 0. The alternate hypothesis (H0 ) is that the rumor is false and the inter-arrival density for successful sales is f (y|H0 ) = e−y ; y ≥ 0. Assume that, a priori, the hypotheses are equally likely. The executive, a recent student of stochastic processes, explores various alternatives for choosing between the hypotheses; he can only observe the times of successful sales however. a) Starting with a successful sale at time 0, let Si be the arrival time of the ith subsequent successful sale. The executive observes S1 , S2 , . . . , Sn (n ≥ 1) and chooses the maximum aposteriori probability hypothesis given this data. Find the joint probability density f (S1 , S2 , . . . , Sn |H1 ) and f (S1 , . . . , Sn |H0 ) and give the decision rule. b) This is the same as part a) except that the system is in steady state at time 0 (rather than starting with a successful sale). Find the density of S1 (the time of the first arrival after time 0) conditional on H0 and on H1 . What is the decision rule now after observing S1 , . . . , Sn . c) This is the same as part b), except rather than observing n successful sales, the successful sales up to some given time t are observed. Find the probability, under each hypothesis, that the first successful sale occurs in (s1 , s1 + ∆], the second in (s2 , s2 + ∆], . . . , and the last in (sN (t) , sN (t) + ∆] (assume ∆ very small). What is the decision rule now? Exercise 7.5. For the hypothesis testing problem of Section 7.3, assume that there is a cost C0 of choosing H1 when H0 is correct, and a cost C1 of choosing H0 when H1 is correct. Find the test that minimizes the expected cost and express it in the form of (7.12). Exercise 7.6. a) For the hypothesis testing problem of Section 7.3, assume that Zi = ln[f (Yi |H1 )/f (Yi |H0 )], as a random variable conditional on Hi , i = 1, 2, has a continuous non-zero probability density. Show that, for any given α, a threshold test minimizes Pr{error | H0 }, under the constraint that Pr{error | H1 } ≤ α. Hint: Assume that some given test is better than any threshold test, and show, as a contradiction, that that test has a smaller probability of error than the MAP test for the appropriate Pr{H0 }. (The test in this problem is called the Neyman-Pearson test) b) Repeat part a) without the assumption that Zi has a continuous density. Hint: be careful about the don’t care case in (7.14). Exercise 7.7. For each of the following random variables, find the interval (r− , r+ ) over which the moment generating function g(r) exists, and determine at which end points g(r) exists. For parts a) and b) you should also find and sketch g(r). For part c), g(r) has no closed form. a) Let a, b, be positive numbers and let X have the density fX (x) = (2a)−1 exp(−ax); x ≥ 0;
fX (x) = (2b)−1 exp(bx); x < 0.
b) Let Y be a Gaussian random variable with mean m and variance σ 2 .
7.10. EXERCISES
321
c) Let Z be a non-negative random variable with density fZ (z) = k(1 + z)−2 exp(−az);
z ≥ 0. R where a > 0 and k = [ z≥0 (1+z)2 exp(−az)dz]−1 . Hint: you can find the Laplace transform for fZ in Laplace transform tables, but the result is not in closed form and is not much help. Fortunately, there is no need to evaluate g(r) to find r+ or to find whether g(r+ ) is finite. Exercise 7.8. a) Assume that the random variable X has a moment generating function gX (r) that is finite in the interval (r− , r+ ), r− < 0 < r+ , and assume r− < r < r+ throughout. For any finite constant c, express the moment generating function of X − c, 00 i.e., g(X−c) (r) in terms of the moment generating function of X. Explain why g(X−c) (r) ≥ 0. 00 00 (r) − 2cg 0 (r) + c2 g (r)]e−rc . (r) = [gX b) Show that g(X−c) X X 00 (r)g (r) − [g 0 (r)]2 ≥ 0, and that ∞ 00 (r) ≥ 0. Hint: c) Use a) and b) to show that gX X X X 0 (r)/g (r). Choose c = gX X
d) Assume that X is non-atomic, i.e., that there is no value of c such that Pr{X = c} = 1. Show that the inequality sign “ ≥ “ may be replaced by “ > “ everywhere in a), b) and c). Exercise 7.9. Define ∞(r) as ln [g(r)] where g(r) = E [exp(rX)]. Assume that X is discrete with possible outcomes {ai ; i ≥ 1}, let pi denote Pr{X = ai }, and assume that g(r) exists in some region (r− , r+ ) around r = 0. For any given r, r− < r < r+ , define a random variable Xr with the same set of possible outcomes {ai ; i ≥ 1} as X, but with a probability mass function qi = Pr{Xr = ai } = pi exp[ai r − g(r)]. Xr is not a function of X, and is not even to be viewed as in the same probability space as X; it is of interest simply because of the behavior of its defined probability mass function. It is called a tilted random variable relative to X, and this exercise, along with Exercise 7.10 will justify our interest in it. P a) Verify that i qi = 1. P b) Verify that E [Xr ] = i ai qi is equal to ∞ 0 (r). P c) Verify that Var [Xr ] = i a2i qi − (E [Xr ])2 is equal to ∞ 00 (r). d) Argue that ∞ 00 (r) ≥ 0 for all r such that g(r) exists, and that ∞ 00 (r) > 0 if ∞ 00 (0) > 0.
e) Give a similar definition of Xr for a random variable X with a density, and modify parts a) to d) accordingly. Exercise 7.10. Assume that X is discrete, with possible values {ai ; i ≥ 1) and probabilities Pr{X = ai } = pi . Let Xr be the corresponding tilted random variable as defined in Exercise 7.9. Let Sn = X1 + · · · + Xn be the sum of n IID rv’s with the distribution of X, and let Sn,r = X1,r + · · · + Xn,r be the sum of n IID tilted rv’s with the distribution of Xr . Assume that X < 0 and that r > 0 is such that ∞(r) exists. a) Show that Pr{Sn,r =s} = Pr{Sn =s} exp[sr − n∞(r)]. Hint: first show that Pr{X1,r =v1 , . . . , Xn,r =vn } = Pr{X1 =v1 , . . . , Xn =vn } exp[sr − n∞(r)]
322 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
where s = v1 + · · · + vn . b) Find the mean and variance of Sn,r in terms of ∞(r). √ © ™ c) Define a = ∞ 0 (r) and σr2 = ∞ 00 (r). Show that Pr |Sn,r − na| ≤ 2nσr > 1/2. Use this to show that n o √ √ Pr | Sn − na |≤ 2n σr > (1/2) exp[−r(an + 2n σr ) + n∞(r)]. d) Use this to show that for any ε and for all sufficiently large n, ™ 1 © Pr Sn ≥ n(∞ 0 (r) − ε) > exp[−rn(∞ 0 (r) + ε) + n∞(r)] 2 Exercise 7.11. a) Redraw Figure 7.3 for the case X > 0. b) Find the value of r ≥ 0 that minimizes the right hand side of (7.17) and redraw Figure 7.4 to illustrate your solution. Hint: the nature of your answer will change depending on the relationship between α/n and X. Exercise 7.12. Consider a random walk with thresholds α > 0, β < 0. We wish to find Pr{SN ≥ α} in the absence of a lower threshold. Use the upper bound in (7.31) for the probability that the random walk crosses α before β. a) Given that the random walk crosses β first, find an upper bound to the probability that α is now crossed before a yet lower threshold at 2β is crossed. b) Given that 2β is crossed before α, upperbound the probability that α is crossed before a threshold at 3β. Extending this argument to successively lower thresholds, find an upper bound to each successive term, and find an upper bound on the overall probability that α is crossed. By observing that β is arbitrary, show that (7.31) is valid with no lower threshold. Exercise 7.13. This exercise verifies that Corollary 7.1 holds in the situation where ∞(r) < 0 for all r ∈ (r− , r+ ) and where r∗ is taken to be r+ (see Figure 7.4). a) Show that for the situation above, exp(rSN ) ≤ exp(rSN − N ∞(r)) for all r ∈ (0, r∗ ). b) Show that E [exp(rSN )] ≤ 1 for all r ∈ (0, r∗ ).
c) Show that Pr{SN ≥ α} ≤ exp(−rα) for all r ∈ (0, r∗ ). Hint: Follow the steps of the proof of Corollary 7.1. d) Show that Pr{SN ≥ α} ≤ exp(−r∗ α). Exercise 7.14. a) Use Wald’s equality to show that if X = 0, then E [SN ] = 0 where N is the time of threshold crossing with one threshold at α > 0 and another at β < 0. b) Obtain an expression for Pr{SN ≥ α}. Your expression should involve the expected value of SN conditional on crossing the individual thresholds (you need not try to calculate these expected values).
7.10. EXERCISES
323
c) Evaluate your expression for the case of a simple random walk. d) Evaluate your expression when X has an exponential density, fX (x) = a1 e−∏x for x ≥ 0 and fX (x) = a2 eµx for x < 0 and where a1 and a2 are chosen so that X = 0. PN Exercise 7.15. A random walk {Sn ; n ≥ 1}, with Sn = i=1 Xi , has the following probability density for Xi e−x −1 ≤ x ≤ 1 e−e−1 ; . fX (x) = =0; elsewhere. a) Find the values of r for which g(r) = E [exp(rX)] = 1.
b) Let Pα be the probability that the random walk ever crosses a threshold at α for some α > 0. Find an upper bound to Pα of the form Pα ≤ e−αA where A is a constant that does not depend on α; evaluate A. c) Find a lower bound to Pα of the form Pα ≥ Be−αA where A is the same as in part (b) and B is a constant that does not depend on α.£ Hint:§ keep it simple–you are not expected ∗ to find an elaborate bound. Also recall that E er SN = 1 where N is a stopping time for the random walk and g(r∗ ) = 1. Exercise 7.16. Let {Xn ; n ≥ 1} be a sequence of IID integer valued random variables with the probability mass function PX (k) = Qk . Assume that Qk > 0 for |k| ≤ 10 and Qk = 0 for |k| > 10. Let {Sn ; n ≥ 1} be a random walk with Sn = X1 + · · · + Xn . Let α > 0 and β < 0 be integer valued thresholds, let N be the smallest value of n for which either Sn ≥ α or Sn ≤ β. Let {Sn∗ ; n ≥ 1} be the stopped random walk; i.e., Sn∗ = Sn for n ≤ N and Sn∗ = SN for n > N . Let πi∗ = Pr{SN = i}. a) Consider a Markov chain in which this stopped random walk is run repeatedly until the point of stopping. That is, the Markov chain transition probabilities are given by Pij = Qj−i for β < i < α and Pi0 = 1 for i ≤ β and i ≥ α. All other transition probabilities are 0 and the set of states is the set of integers [−9 + β, 9 + α]. Show that this Markov chain is ergodic. b) Let {πi } be the set of steady state probabilities for this Markov chain. Find the set of probabilities {πi∗ } for the stopping states of the stopped random walk in terms of {πi }. c) Find E [SN ] and E [N ] in terms of {πi }. Exercise 7.17. a) Conditional on H0 for the hypothesis testing problem, consider the random variables Zi = ln[f (Yi |H1 )/f (Yi |H0 )]. Show that r∗ , the positive solution to g(r) = 1, where g(r) = E [exp(rZi)], is given by r∗ = 1. b) Assuming that Y is a discrete random variable (under each hypothesis), show that the tilted random variable Zr with r = 1 has the PMF PY (y|H1 ).
324 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Exercise 7.18. a) Suppose {Zn ; n ≥ 1} is a martingale. Verify (7.71); i.e., E [Zn ] = E [Z1 ] for n > 1. b) If {Zn ; n ≥ 1} is a submartingale, verify (7.76), and if a supermartingale, verify (7.77). Exercise 7.19. Suppose {Zn ; n ≥ 1} is a martingale. Show that £ § E Zm | Zni , Zni−1 , . . . , Zn1 = Zni for all 0 < n1 < n2 < . . . < ni < m. Exercise 7.20. a) Assume that {Zn ; n ≥ 1} is a submartingale. Show that E [Zm | Zn , Zn−1 , . . . , Z1 ] ≥ Zn for all n < m. b) Show that § £ E Zm | Zni , Zni−1 , . . . , Zn1 ≥ Zni for all 0 < n1 < n2 < . . . < ni < m.
c) Assume now that {Zn ; n ≥ 1} is a supermartingale. Show that parts a) and b) still hold with ≥ replaced by ≤. Exercise 7.21. Let {Zn = exp[rSn − n∞(r)]; n ≥ 1} be the generating function martingale of (7.49) where Sn = X1 + · · · + Xn and X1 , . . . Xn are IID with mean X < 0. Let N be the possibly defective stopping time for which the process stops after crossing a threshold at α > 0 (there is no negative threshold). Show that exp[r∗ α] is an upper bound to the probability of threshold crossing by considering the stopped process {Zn∗ ; n ≥ 1}. The purpose of this exercise is to illustrate that the stopped process can yield useful upper bounds even when the stopping time is defective. Exercise 7.22. This problem uses martingales to find the expected number of trials E [N ] before a fixed pattern, a1 , a2 , . . . , ak , of binary digits occurs in a sequence of IID binary random variables X1 , X2 , . . . (see Exercises 3.25 and 4.24 for alternate approaches). A mythical casino and set of gamblers who follow a prescribed strategy will be used to determine E [N ]. The casino has a game where, on the ith trial, gamblers bet money on either 1 or 0. After bets are placed, Xi above is used to select the outcome 0 or 1. Let p(1) = Pr{Xi = 1} and p(0) = 1 − p1 = Pr{Xi = 0}. If an amount s is bet on 1, the casino receives s if Xi = 0, and pays out s/p(1) − s (plus returning the bet s) if Xi = 1. If s is bet on 0, the casino receives s if X1 = 1, and pays out s/p(0) − s (plus the bet s) if Xi = 0. a) Assume an arbitrary pattern of bets by various gamblers on various trials (some gamblers might bet arbitrary amounts on 0 and some on 1 at any given trial). Let Yi be the net gain of the casino on trial i. Show that E [Yi ] = 0 (i.e., show that the game is fair). Let Zn = Y1 + Y2 + · · · + Yn be the aggregate gain of the casino over n trials. Show that for the given pattern of bets, {Zn ; n ≥ 1} is a martingale. b) In order to determine E [N ] for a given pattern a1 , a2 , . . . , ak , we program our gamblers to bet as follows:
7.10. EXERCISES
325
i) Gambler 1 has an initial capital of 1 which is bet on a1 at trial 1. If he wins, his capital grows to 1/p(a1 ), which is bet on a2 at trial 2. If he wins again, he bets his entire capital, 1/[p(a1 )p(a2 )], on a3 at trial 3. He continues, at each trial i, to bet his entire capital on ai until he loses at some trial (in which case he leaves with no money) or he wins on k successive trials (in which case he leaves with 1/[p(a1 ) . . . p(ak )]. ii) Gambler j, j > 1, follows exactly the same strategy but starts at trial j. Note that if the pattern a1 , . . . , ak appears for the first time at N = n, then gambler n − k + 1 leaves at time n with capital 1/[p(a1 ) . . . p(ak )] and gamblers j < n − k + 1 all lose their capital. Suppose the string (a1 , . . . , ak ) is (0, 1). Let for the above gambling strategy. Given that N = 3 (i.e., that X2 = 0 and X3 = 1), note that gambler 1 loses his money at either trial 1 or 2, gambler 2 leaves at time 3 with 1/[p(0)p(1)] and gambler 3 loses his money at time 3. Show that ZN = 3 − 1/[p(0)p(1)] given N = 3. Find ZN given N = n for arbitrary n ≥ 2 (note that the condition N = n uniquely specifies ZN ). c) Find E [ZN ] from part (a). Use this plus part (b) to find E [N ]. d) Repeat parts (b) and (c) using the string (a1 , . . . , ak ) = (1, 1). Be careful about gambler 3 for N = 3. Show that E [N ] = 1/[p(1)p(1)] + 1/p(1) e) Repeat parts (b) and (c) for (a1 , . . . , ak ) = (1, 1, 1, 0, 1, 1). Exercise 7.23. a) This exercise shows why the condition E [|ZN |] < 1 is required in Theorem 7.6. Let Z1 = −2 and, for n ≥ 1, let Zn+1 = Zn [1 + Xn (3n + 1)/(n + 1)] where X1 , X2 , . . . are IID and take on the values +1 and −1 with probability 1/2 each. Show that {Zn ; n ≥ 1} is a martingale. b) Consider the stopping time N such that N is the smallest value of n > 1 for which Zn and Zn−1 have the same sign. Show that, conditional on n < N , Zn = (−2)n /n and, conditional on n = N , ZN = −(−2)n (n − 2)/(n2 − n). c) Show that E [|ZN |] is infinite, so that E [ZN ] does not exist according to the definition of expectation, and show that limn→1 E [Zn |N > n] Pr{N > n} = 0. Exercise 7.24. Show that Theorem 7.3 is also valid for martingales relative to a joint process. That is, show that if h is a convex function of a real variable and if {Zn ; n ≥ 1} is a martingale relative to a joint process {Zn , Xn ; n ≥ 1}, then {h(Zn ); n ≥ 1} is a submartingale relative to {h(Zn ), Xn ; n ≥ 1}. Exercise 7.25. Show that if {Zn ; n ≥ 1} is a martingale (submartingale or supermartingale) relative to a joint process {Zn , Xn ; n ≥ 1} and if N is a stopping time for {Zn ; n ≥ 1} relative to {Zn , Xn ; n ≥ 1}, then the stopped process is a martingale (submartingale or supermartingale) respectively relative to the joint process. Exercise 7.26. Show that if {Zn ; n ≥ 1} is a martingale (submartingale or supermartingale) relative to a joint process {Zn , Xn ; n ≥ 1} and if N is a stopping time for {Zn ; n ≥ 1} relative to {Zn , Xn ; n ≥ 1}, then the stopped process satisfies (7.85), (7.86), or (7.87) respectively.
326 CHAPTER 7.
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
Exercise 7.27. Show that if {Zn; n1} is a martingale relative to a joint process {Zn , Xn ; n ≥ 1} and if N is a stopping time for {Zn ; n ≥ 1} relative to {Zn , Xn ; n ≥ 1}, then E [ZN ] = E [Z1 ] if and only f (7.91) is satisfied. Exercise 7.28. Consider the Markov modulated random walk in the figure below. The random variables Yn in this example take on only a single value for each transition, that value being 1 for all transitions from state 1, 10 for all transitions from state 2, and 0 otherwise. ε > 0 is a very small number, say ε < 10−6 . ✿ 1♥ ✘ ②
1−ε
ε 1/2
③ ♥ 0 ②
1/2 ε
1−ε ③ ♥ 2 ②
a) Show that the steady state gain per transition is 5.5/(1 + ε). Show that the relative gain vector is w = (0, (ε − 4.5)/[ε(1 + ε)], (10ε + 4.5)/[ε(1 + ε)]). b) Let Sn = Y0 + Y1 + · · · + Yn−1 and take the starting state X0 to be 0. Let N be the smallest value of n for which Sn ≥ 100. Find Pr{N = 11} and Pr{N = 101}. Find an estimate of E [N ] that is exact in the limit ε → 0. c) Show that Pr{XN = 1} = (1 − 45ε + o(ε))/2 and that Pr{XN = 2} = (1 + 45ε + o(ε))/2. Verify, to first order in ε that (7.94) is satisfied. Exercise 7.29. Show that (7.94) results from taking the derivative of (7.98) and evaluating it at r = 0. Exercise 7.30. Let {Zn ; n ≥ 1} be a martingale, and for some integer m, let Yn = Zn+m − Zm . a) Show that E [Yn | Zn+m−1 = zn+m−1 , Zn+m−2 = zn+m−2 , . . . , Zm = zm , . . . , Z1 = z1 ] = zn+m−1 − zm . b) Show that E [Yn | Yn−1 = yn−1 , . . . , Y1 = y1 ] = yn−1 c) Show that E [|Yn |] < 1. Note that b) and c) show that {Yn ; n ≥ 1} is a martingale. Exercise 7.31. a) Show that Theorem 7.3 is valid if {Zn ; n ≥ 1} is a submartingale rather than a martingale. Hint: Simply follow the proof of Theorem 7.3 in the text. b) Show that the Kolmogorov martingale inequality also holds if {Zn ; n ≥ 1} is a submartingale rather than a martingale.
Bibliography [1] , Bellman, R., Dynamic Programming, Princeton University Press, Princeton, NJ., 1957. [2] D. Bertsekas and J. Tsitsiklis, An Introduction to Probability Theory, 2nd Ed. Athena Scientific, Belmont, MA, 2008. [3] Bertsekas, D. P., Dynamic Programming - Deterministic and Stochastic Models, Prentice Hall, Englewood Cliffs, NJ, 1987. [4] Bhattacharya, R.N. & E.C. Waymire, Stochastic Processes with Applications, Wiley, NY, 1990. [5] Dembo, A. and O. Zeitouni, Large Deviation Techniques and Applications, Jones & Bartlett Publishers, 1993. [6] Doob, J. L., Stochastic Processes, Wiley, NY, 1953. [7] Drake, A.W.,, Fundamentals of Applied Probability, McGraw Hill, NY, 1967. [8] Feller, W., An Introduction to Probability Theory and its Applications, vol 1, 3rd Ed., Wiley, NY, 1968. [9] Feller, W., An Introduction to Probability Theory and its Applications, vol 2, Wiley, NY, 1966. [10] Gallager, R.G. Discrete Stochastic Processes, Kluwer, Norwell MA 1996. [11] Gantmacher, Applications of the Theory of Matrices, (English translation), Interscience, N.Y., 1959. [12] Harris, T. E., The Theory of Branching Processes, Springer Verlag, Berlin, and Prentice Hall, Englewood Cliffs, NJ, 1963. [13] Kelly, F.P. Reversibility and Stochastic Networks, Wiley, 1979. [14] Kolmogorov, A. N., Foundations of the Theory of Probability, Chelsea 1950 [15] Ross, S., A First Course in Probability, 4th Ed., McMillan & Co., 1994. [16] Ross, S., Stochastic Processes, 2nd Ed., Wiley, NY, 1983. vii
viii
BIBLIOGRAPHY
[17] Rudin, W. Real and Complex Analysis, McGraw Hill, NY, 1966. [18] Shannon, C.E., “A mathematical theory of communication,” Bell System Technical Journal, 27, 1948, 379-423 and 623-656. [19] Stark, H. & J.W. Woods, Probability, Random Processes, and Estimation for Engineers, Prentice Hall, Englewood Cliffs, NJ, 1994. [20] Strang, G. Linear Algebra and its Applications, Third Edition, Harcourt, Brace, Jovanovich, Fort Worth, TX, 1988. [21] Schweitzer, P. J. & A. Federgruen, ”The Asymptotic Behavior of Undiscounted Value Iteration in Markov Decision Problems,” Math. of Op. Res., 2, Nov.1967, pp 360-381. [22] Wolff, R. W. Stochastic Modeling and the Theory of Queues, Prentice Hall, Englewood Cliffs, NJ, 1989. [23] Yates, R. D. & D.J. Goodman, Probability and Stochastic Processes, Wiley, NY, 1999. [24] Yates, R. D., High Speed Round Robin Queueing Networks, LIDS-TH-1983, M.I.T., Cambridge MA, 1983. [25] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. IT, May 1977, pp. 337-343. [26] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. IT, Sept. 1978, pp. 530-536.