FINITE STATE MARKOVIAN DECISION PBO~ESSES
CYRUS DERMAN DIVISION OF MATHEMATICAL METHODS OF ENGINEERING AND OPERATIONS R...
37 downloads
955 Views
1MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
FINITE STATE MARKOVIAN DECISION PBO~ESSES
CYRUS DERMAN DIVISION OF MATHEMATICAL METHODS OF ENGINEERING AND OPERATIONS RESEARCH SCHOOL OF ENGINEERING AND APPLIED SCIENCE COLUMBIA UNIVERSITY NEW YORK, NEW YORK
1970
ACADEMIC PRESS New York and London
COPYRIGHT © 1970, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED NO PART OF THIS BOOK MAY BE REPRODUCED IN ANY FORM, BY PHOTOSTAT, MICROFILM, RETRIEVAL SYSTEM, OR ANY OTHER MEANS, WITHOUT WRITTEN PERMISSION FROM THE PUBLISHERS.
ACADEMIC PRESS, INC.
111 Fifth Avenue, New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD.
Berkeley Square House, London WlX 6BA
LIBRARY OF CoNGRESS CATALOG CARD NUMBER!
PRINTED IN THE UNITED STATES OF AMERICA
70-117083
TO MY PARENTS
Samuel and Bessie (Segal) Derman
Preface
We are concerned with the optimal sequential control of certain types of dynamic systems. We assume such a system is observed periodically. After each observation the system is classified into one of a possible number of states; after each classification one of a possible number of decisions is implemented. The sequence of implemented decisions interacts with the chance environment to effect the evolution of the system. We call the mathematical abstraction of this process a Markovian decision process; however, some authors use the term discrete dynamic programming. Just as linear programming provides a general framework for formulating and solving certain optimization problems, so does the Markovian decision process provide a structure within which optimal control of dynamic systems can be formulated and solved. xi
xii
Preface
Recognizing its potential usefulness we presented a course on Markovian decision processes in 1966 and 1968 for operations research students of the Columbia University School of Engineering and Applied Science. The lectures for the course have served as a basis for the preparation of this monograph. This book is intended for operations researchers, statisticians, mathematicians, and engineers interested in mathematical methods for the control of dynamic systems. It can serve as a text for a course on dynamic programming which is intended to provide students with the basic computational algorithms as well as to prepare them for research in the subject. Prerequisites include a reasonable grounding in real analysis or advanced calculus, a knowledge of the elementary theory of Markov chains, and an acquaintance with the rudiments of linear programming. An appendix on these prerequisites is provided; however, its primary purpose is to collect those facts which are explicitly used in the text and is not intended to serve as a source for obtaining the necessary background for reading this book.
Aeknowledgntents
Acknowledgments are due to Richard Bellman for his initial and recurrent encouragement to write this book; to Arthur Veinott for a reading of the manuscript and helpful comments; to, at least, Edward lgnall, Morton Klein, Peter Kolesar, and Howard Taylor for enlightening conversations; and to Regina Tetens and Stennett Parris for preparation of the manuscript. This work was supported, in part, by the Army, Navy, Air Force, and NASA under Contract NOOO 1467-A-0108-0008 with the Office of Naval Research.
xiii
Introduction
Preliminary Remarks Markovian decision processes are stochastic processes that describe the evolution of dynamic systems controlled by sequences of decisions or actions. Thus, in this monograph we shall be concerned with certain types of dynamic systems which are observed periodically, and influenced at the time of observation by the taking of one of a possible number of different actions. The evolution of the system will be the result of the interaction between the "laws of motion" of the system 1
2
1 Introduction
and the sequence of actions taken over time. The different paths of the system will have associated economic consequences ; the ultimate aim is to take those actions that control the system in an optimal manner. Optimality will be defined relative to a stipulated criterion. A typical system is an inventory system for a given product where the inventory level is under periodic review. After each review, the action taken is that of adding a certain amount of the product to the inventory level. The laws of motion of the system are determined by the pattern of demand for the product between times of review. Various costs associated with ordering ne.w product, holding inventory in storage, shortages, etc. contribute to the economic consequences of the actions taken at the various times. A criterion for optimality would ordinarily be a function of long term costs. Another typical system might consist of a component or group of components that is under periodic surveillance and subjected to periodic maintenance or replacement of one or more components. At each inspection the system is classified in some appropriate way and a decision is made as to what degree of maintenance to employ. The properties of the system together with the demands upon it determine the laws of motion. Economic aspects involve the various costs associated with maintenance and also the attributed costs due to failure of the system. Occasionally, failure costs may be difficult or impossible to ascertain, in which case " system reliability " may be a more appropriate yardstick with which to measure the effectiveness of a surveillance and maintenance procedure. The Markovian Decision Model To introduce the general model, let us assume that at points of time t = 0, 1, .. . the system is observed and classified into one of a possible number of states. We let { Y , , t = 0, 1, .. .>denote the sequence of observed states. The letter Z will denote the space of possible states. Throughout this volume, I will be a finite space. After each observation of the system, one of a possible number of
The Markovian Decision Model
3
.
actions is taken. We let { A t , t = 0, 1, ..} denote the sequence of actions. Ki denotes the number of actions possible when the system is in state i. More frequently, we shall also use Kito denote the set of possible actions when the system is in state i. No confusion should result from the double use of the notation. Throughout, Kiwill be finite. A rule or policy, to be denoted by R, is a prescription for taking actions at each point in time. We shall permit a policy for taking an action at time t to be a function of the entire “ history ” of the system up to time t. We will allow actions to be taken which are determined by a random mechanism, the random mechanism will be a function of the “ history.” For example, when in state i, a coin may be tossed to determine which of two possible actions to take. However, the kind of coin used may depend on the previous sequence of states and actions taken. In this volume the use of policies employing randomization enables the use of linear programming formulations of the problems of interest and allows one to obtain optimal policies in the face of certain constraints. Thus, a policy R is a set of functions {Da(Ht-,,
K),
a ~ K y , ,t = O , 1 , . . * }
satisfying O$D,Sl
and
a
Da=l,
where Ht denotes the history of the system up to time t ; that is, H, = { Y o ,A , , .. . , Y,, A t } . One interprets D,(Ht-l, Yt) as follows: if denotes the history of the system up to time t - 1 and Y, the state of the system at time t, then a random mechanism is to be used which assigns the probability D,(Ht-l, Y,) of taking action a at time t . We assume throughout that the laws of motion of the system can be characterized by a time invariant set of transition probabilities. Namely, whenever the system is in state i and action a is taken, then, regardless of its history, qij(a) denotes the probability of the system being in state j at the next instant the system is observed. An alternative way of stating this assumption is that no matter what policy R is employed, the conditional probability that y,,, = j , given that H I - , , Y, = i, and A , = a
1 Introduction
4
is equal to qii(a). In this volume it will be assumed that the set of numbers {qij(a),a E Ki ,i E I,j E I} are known and, of course, satisfy O$qij 5 1
and
C q i j= 1. i
For example, consider the laws of motion of an inventory system under periodic review. Let Y, denote the level of inventory at time t and let A , denote the amount ordered after observing Y,. Assume delivery of the A , units is instantaneous so that at the moment of ordering, the inventory level is Y, + A , . Suppose the sequence of demands {D,}for the product during each of the periods is a sequence of independent and identically distributed random variables. If, for simplicity, we allow negative inventory (that is, backlogging of demand) and a denumerable state space, then qij(a)= P ( Y , + , = j l Ht-l, Y , = i, A , = a} = P{Demand = i + a - j } .
Given a distribution P{Yo = i} over the initial states of the system and a policy R, then the sequence { Y, ,A , , t = 0, 1, .. .} is a stochastic process. We call this process a Markovian decision process. The term Markovian is employed because of the special assumptions regarding the laws of motion. However, we point out that the process ( Y , , A , , t = 0,1, .. .>is not necessarily a Markov process. Since a policy R may be such that the prescription for taking actions is dependent upon the entire history of the process, the Markov property may not be satisfied by {Yt> 4 1 . In order to indicate the dependence of the probabilities on the policy R, the notation PR{E}will denote the probability of an event E occurring when policy R is used. The probability of the event E given the initial state Yo = i and the use of the policy R will be denoted by PR{E]Yo = i}. We assume a certain cost structure superimposed on the Markovian decision process. Whenever the system is in state i and action a is taken, we assume that a known cost wia is incurred. For most of that considered in this volume, wia may also denote an expected cost rather than an actual cost. However, the important aspect of the assumption is that this cost is a function only of the state and action taken. For example, in
Problems to Be Treated
5
our inventory problem the cost incurred in a period is a function of the ordering costs and the inventory level at the end of that period. The expected value of this function taken with respect to the distribution of demand will yield our assumed cost wia, the expected cost associated with inventory level i, and the action of ordering a units. We define the random variables { Wt , t = 0, 1, . . .} : Wt = wia
if
Yt = i, A, = a,
a s K i , is1
We can then speak of expected costs; that is,
Since I and K i will be finite, no question of existence of ER W, will arise.
Problems to Be Treated In terms of our model, we are now in the position to state some of the problems of interest in this volume. In each of the problems, it will be assumed that the initial state Yo = i is given; that is, P{Yo = i} = 1. Let
In words, SR,T(i)is the expected total cost of operating the system up to and including the time “horizon ” t = T, given that the initial state is i and the policy used is R . The problem of interest is that of obtaining that policy R which minimizes SR,T(i).This is the type of problem which is commonly dealt with by the straightforward method of dynamic programming. We discuss its solution in the next chapter. Another problem treated is that of finding R to minimize aR(i)= SR,z(i),where z denotes the smallest positive value of t such that Yt = j , j being a “target ” state at which the process is stopped. We refer to this as an optimal first-passage problem. It should be noted that z is a random variable so that aR(i)is the expected value of a random sum
1 Introduction
6
of random variables. For an obvious generalization, j need not refer to a single state, but may be a class of states. A third problem to be dealt with is that involving the discounted cost criterion YR(i,a) = ER
-=
r=o
dW,,
where 0 5 a 1 (the discount factor) is a given number. We shall be interested in finding R to minimize Y&, a). A fourth problem arises from the expected average cost per unit time criterion: (bR(i) = lim
sR, d i ) .
~ + mT4-1
For some policies R, the limit of the above expression may not exist. In those cases we shall deal with the upper or lower limit, whichever seems appropriate. We shall be interested in finding a policy R to minimize 4R(i). Several other problems will be treated as well. We may have other criteria to minimize or we might be interested in minimizing $ ~ ~ ( i ) subject to certain side constraints. In Chapter 7 we hope to develop the theory so that these problems can be dealt with; in Chapter 8 a stopping problem is considered.
Hierarchical Classification of Policies In each case we shall be concerned with three questions: existence, structure, and computational procedures. With regard to each of these it is convenient to introduce a hierarchical classification of the totality of possible policies. We let C denote the class of all policies under consideration ; that is, those with possible dependency on the complete history of the system. We let C , denote the class of all memoryless or Markovian type policies. That is, C, consists of all policies R such that Da(Hr-l,Y,) is a function only of Y , , t , and a. When R E C,, then ( Y ,, f = 0, 1, , . .} is a Markov chain, not necessarily stationary. We let
7
Bibliographical Remarks
Csdenote the class of all Markovian policies which are time invariant. That is, C, consists of all policies R such that Da(Ht-l,Y,) are functions only of Y, and a. Let D , = D , { H t - i , Y, = i} when R E C , . Then { Y,, t = 0,1, .. .} is a Markov chain with stationary transition probabilities i, j E I . Pij = D, qij(a),
1 a
Finally, we let C , denote the subclass of Csconsisting of the deterministic policies. That is, R E C , whenever D , is 0 or I for every i E I. In this case we can think of R as defining a single-valued transformation from the states to the actions; that is, when R E C,, to each state i there corresponds an action a,, among the possible actions Ki , such that R prescribes action ai when the system is in state i. Accordingly, when convenient and R E C,, we shall employ the notation wiRand qij(R)to denote wiai and qij(ai). Since the class C of all policies is infinite, the question of existence of an optimal policy will be important for each problem considered. In all problems dealt with here, we shall want to assert that not only does an optimal policy exist but that one is also a member of C , , C, , or C, . In other words, we shall want to say something about the structure of at least one of the optimal policies. In certain special cases, perhaps, more can be said about structure. When the structure is such that an optimal policy is a member of C, , C, , or C,, then frequently, a computational procedure can be obtained for its determination.
Bibliographical Remarks The title of this book might well have been called “Dynamic Programming,” or better, “ Discrete Dynamic Programming ” as used by Blackwell [6]. The descriptive phrase “ Markovian Decision Process ” is due to Bellman [2], and because of the connections of the material treated herein with Markov chains, we prefer the latter description. The Markovian decision model in recent years has been the subject of an increased amount of research activity. Early papers, in a special
1 Introduction
8
context by Bellman and LaSalle [4], Bellman and Blackwell [5], and later, more generally by Shapley [48] were among the first formulations of the model in the context of two-person dynamic games. Its first explicit formulation outside the game context is given by Bellman [2]. The model has a large number of parameters and is readily adaptable to many dynamic systems as a descriptive model. Although some of the methods of dynamic programming such as backward induction and method of successive approximations predate the formal conception of dynamic programming and, in particular, the Markovian decision model, it was not until computational breakthroughs by Howard [34], Manne [46], and D’Epenoux [14], some seven years or so after Shapley’s [48] treatment, that interest in the model increased and an awareness in its potential usefulness developed. In the Shapley [48] two-person stochastic games model, as in the earlier papers by Bellman and LaSalle [4], and Bellman and Blackwell [5], the process { Y , , t = 0, 1, . . .} is controlled by two sets of simultaneous action. The “laws of motion” are in the form of numbers {qij(a,b), a E K i l ,b E K!‘, i E Z,j E I } , where Ki’ and K!‘, i E I, are sets of possible actions for a player I ” and a “ player I1 ” at state i. That is, if the process is in state i and player I takes action a and player I1 takes action b then the probability is qij(a,b) that the next period will find the process in state j . The costs in this model are of the form winb, to be interpreted as the cost to player I and the gain to player I1 when the process is in state i and player I takes action a and player I1 takes action b. Thus, the process under consideration in this volume concerns the special case where player I1 has only one available action at each state. “
Problems (1)
Derive the form of the laws of motion, the qij(a)’s, for the example of the inventory system under periodic review when backlogging is not permitted.
Problems
9
(2) For an inventory system under periodic review for the cases of backlogging and no backlogging, construct the forms of the costs {wio}. (3)
For the inventory system under periodic review construct a policy R that belongs to CD;to C, - C,; to C - C , .
L
Finite Horizon Expected Cost Minimization
Dynamic Programming This chapter is concerned with the determination of that policy
R E C which minimizes SR,T(i)= ER
T
1 W,, where the horizon T and
i=O
initial state i are given. We will show that a backward induction method, which is the essence of dynamic programming, provides a computational procedure for obtaining the optimal policy. Although Z is assumed to be 11
2 Finite Horizon Expected Cost Minimization
12
finite, the method of this chapter holds for countable I as long as the costs (wia) are such that ER W, is well defined for all R and t . Let us denote by V,(R,j, h,,-l), 0 5 n 5 T, the conditional expected total cost of a process from time t = n to time t = T given the history H,-, = h,-,, Y, =j and policy R ; that is,
When n = 0, we have, since there is no history, I/d(R,i) = S R , T ( ~ ) .
Set
0 5 n 5 T.
V,*(i, h,-,) = inf V,(R, i, hn-l), REC
Then providing Vo*(i)= min V,(R, i), the value of SR,=(i) is V,* (i) when ReC
an optimal policy is used. It will be seen that the minimum over all R E C is obtained. The method of backward induction employs a recursion formula by which V,*- can be expressed in terms of V,* for n = T, T - 1 , . . . , 1, thereby achieving the value V,*(i). At the same time the optimal policy is perceived. We first prove: 1,
LEMMA1. For every I f n - ,= h,-,, V,*(i, = V,*(i), n = is independent of I?,,-.~.
..., T, i E I ; that is to say, vl*(i,A,-,)
Proof: Fix i. Let u*(i) = inf V,(R, i, hn-l). Let arbitrarily. Let R , hf?
R,h,-i
be such that
v ~ ( R i~, hi,
1)
< u*(i)
+
E
> O be given
E.
Define R, as follows: W(K-1,
r,) = D,RO(Ht-,,
0
t = 0 , . . . )n - 1 ,
t = n, . . . , T ; . . . , YJ, for t = 0, .. ., n - 1 , but, for t = n, . . . , T,
= D,RO(hf-,, Y,,
that is, R, is the same as R,
9
Dynamic Programming
13
R1 prescribes actions as if policy R, were in effect and the history Hn-l = h f - , had been observed up to time n - 1. Then, for every Hn-I= h n - 1 , Vn(R,, i, l ~ , , - ~=) V,(R,, i, /I:-,) 5 v*(i) + E . Therefore, since E > 0 is arbitrary, V,*(i, hand, by definition of v*(i), we have that
5 v*(i). On the other
Vn(R,i, hn-l) 2 u*(i)
for every R and h n - l . Hence, Vn*(i,hn-l) and the lemma is proved.
= v*(i) independent
For the following theorem, we set V ; + , ( i )
= 0, i E I .
of An-,
We prove:
THEOREM 1. If R* is defined as a policy which at time n takes action ai* (a function of n) satisfying wia*;
+ 1 qij(ai*>V,*+,(j> i
= V,*(i),n = 0 , 1, for i E I and n = 0, . . . , T, then V,(R*,i, T, i E I . In particular, R* is optimal for minimizing SR,T(i).
Proofi We will use backward induction on n. Suppose n for any R and h T - , ER(WTI Y T = i, HT-l = h T - l ) = C D / ( h , - , , a
=
. ..,
T. Then
i)w,
2 min{wia} a
= Vj-(R*, i, h ~ - 1 ) .
Since, in fact, VT(R*,i, is independent of A,-,, we have that it is equal to V,*(i). Now assume that V,(R*, i, h t - , ) = Vt*(i)for t = n 1, , . . , T. We shall show that the same holds for t = n. For any R and
+
hn-1,
2 Finite Horizon Expected
14
XE,(
B c D:(hn-
B min a
Wia (
Y,,= i , ~ , = a , h,,-,
KIY,,+,= j ,
t=n+l
1,
i)( %via +
Cost Minimization
ci qij(a)v,*+l~j)]
I
I
+ Ci qij(a)V:+ dj)
= V,,(R*, i , h , , - I ) .
The first inequality follows from Lemma 1 and the last equation follows from the induction assumption. The right-hand side is again independent of I Z , , - ~ . Hence, Vn(R*,i, = V,*(i),i E I. This proves the theorem.
CQROLLARY 1 . V,,*(i) = min{w, 1,
...,T.
a
+ qjj(u)V:+l(j)},i E I, n = 0,
Proof: The equations follow from the fact that V,*(i)= V,(R*, i, h,,-,), for i E I and n = 0, 1, ...,T and the last equality of the proof of the theorem. 2. R* is a member of C, COROLLARY
.
Proofi This is apparent from the definition of R*. The defining equations of R* of Corollary I are known as the functional equations of dynamic programming. They provide a simple but extremely useful recursive scheme for obtaining the optimal policy as long as the state space I is not too large. They also express what is
ComputationalExample
15
commonly referred to as the " principle of optimality " which asserts that an optimal policy for minimizing SR,=(i)must also minimize Vn(R,i, An-,) for every n = 0, 1, . .., T.
Computational Example Suppose I = (0, l} ;Ki = 2, i = 0 , l with
and
We want to find R to minimize SR,di),(i = 0, 1) for T = 2. First, we calculate V,*(i), (i = 0, 1): Vz*(0) = min{wo,, wo2}= 0, Vz*(l) = min{w,,, wlz} = 2,
keeping in mind that a0*(2) (the optimal action taken at t = 2 when in state 0) is 2 and that al*(2) = 1 or 2. Now I/,*(O) = min(1 = min(1 = 2;
+ +Vz*(0) + +Vz*(l), 3 + tVz*(0) + tVz*(l))
+ 1, 3 + 3)
+ fVz*(0) + +Vz*(l),1 + 3V2*(0) + 3V2*(1)) = min(2 + 3, 1 + 4)
V,*(l) = min(2 =I
3'
2 Finite Horizon Expected Cost Minimization
16
Here ao*(l)= 1, a,*(l)
= 2.
Then
with a,*(O) = 1, al*(0)= 2. Therefore, the optimal policy with respect to minimizing S, ,z(i) is to take actions 1, 1, and 2 when in state 0 at times t = 0, 1,2, respectively, and to take actions 2, 2, and 1 or 2 when in state 1 at times t = 0, 1,2, respectively.
Pathologies When the state and action spaces are noncountably infinite, one may encounter technical difficulties which restrict the universal validity of the backward induction method. These difficulties arise in pathological cases where not all policies possess an associated ER W, defined for all t ; only those policies which satisfy certain measurability conditions can be so evaluated. Thus, the use of the criterion of mathematical expectation or, indeed, the assumption that { Y t }be a stochastic process has the effect of imposing subtle constraints on the space of possible policies, which in turn raises the possibility that the functional equation approach has some flaws. In order to appreciate how constraints on acceptable policies can invalidate the dynamic programming procedure, even in the finite state and action case, it should be noted that the imposition of the gross restriction that R be a member of C , nullifies the fact that the optimal policy will satisfy the functional equations.
Problem
17
Bibliographical Remarks For earlier and fuller expositions of the material of this chapter one should refer to Bellman [3]. So intuitive is the dynamic programming or backward induction approach, that one rarely encounters a formal proof that the method yields an optimal policy; hence, the proof is given here, despite the fact that for discrete state spaces and finite actions at each state the procedure is clearly correct. The pathology alluded to when state and action spaces are noncountable was revealed by Blackwell [7].
Problem
3 Some Existence Theorems
Summary In this chapter we shall prove the existence of optimal policies in the class C, for the expected discounted cost, expected average cost, and first-passage problems. The criterion in the expected average cost problem is at first +R(i)= lim sup SR,T(i)/T+ 1. We then obtain the same result for the problem based on the lower limit definition of $R(i). Our method here involves discussion of the discounted cost 19
3 Some Existence Theorems
20
problem first and then, using the results obtained together with elementary Abelian theorems, proving existence theorems for the other two problems. Expected Discounted Cost Problem
Our basic approach to existence in the discounted case is to first establishthat YR(i,a)for fixed i and u, 0 I a < 1, is a continuous function of R and that C is a compact space. Thus an R E C minimizing YR(i,u) must exist. From there, we establish that there exists an R E C , minimizing YR(i,a). We say a sequence {R,,, n = 1,2, . ..} of policies converges to a policy R if for every a, y t , h, , t = 0,1, .. .,lim @"(hi, y t ) = DaR(h,,yt). n-+m
We say a class of policies is compact if for every sequence of policies {R,,, n = 1,2, ...} there exists a subsequence {R,,,k = 1,2, ...} that converges to a policy in the class. We first prove:
LEMMA 1. The class C is compact. Proof: For every Hi= h , , Yt = y t , the space D(h,,y,) = ( D l ( h t ,y,), . . ., D,Jh,, y,)} is compact since Ki is finite for every i ~ l By . Tychonov's theorem (Theorem 3 of Appendix B), the D(h,, y t ) is also compact. However, every point product space
fl
ht,Yt,t
in this product space is by definition a policy and every policy corresponds to a point in the product space. Hence, the product space is the space C. Thus C is compact. The temptation is to assert that Lemma 1 holds for any state space I and compact action spaces. However, because of the measurability constraints alluded to in Chapter 2, C, being the class of allpolicies for which stochastic processes { Y,, A , , t = 0, 1, ...} and expectations
Expected Discounted Cost Problem
21
EW,, t = 0, 1,. . . are well defined, may be a proper subclass of the D(h,, y,) and therefore, may not be compact. product space
fl
ht,yt,t
LEMMA 2. Let t be arbitrary and H , = h, = {Yo = i, A , = a. , . . ., Y, = if, A , = a,} be given, then P R { H ,= h, I Yo = i} is a continuous function of R. Proof: For t = 0, P R { H , = h, I Yo = i} = D ~ o ( Y= o i), and hence, the assertion is true for t = 0. Assume it true for t = 0, . . . , T. Then since PR{HT+
1 = hT+ 1
I
= i>
= P R { H T = hTIYo = i}qiT,iT+l(aT)D,R,+I(hT, YT+l = i T + 1 ) *
we have by induction that the assertion is true for t lemma is proven.
=T
+ 1 and the
LEMMA3. Let t be arbitrary and Y, = j , A , = a, be given, then PR{Y, = j , A , = a, I Yo = i} is a continuous function of R. Proof: We can write P R { K = j, A , = tI Yo = i} =
=
C P R { K = j,
A, = a, I Yo = i,
x PR{H,-1 =
I Yo= i}
ht-1
H,-l
=
h,-l}
Cqi,-,,j(ar-,)D,Rt(h,-i, Yr=j).PR(Hf-l =ht-lIYo=i}*
ht-1
Since there is only a finite number of ht-17s, Lemma 3 follows from Lemma 2. We remark that if I is countable so that a countable number of histories h, exist for each t , Lemma 3 can still be established.
3 Some Existence Theorems
22
LEMMA 4. E R W , , t = 0, 1,. . . and YR(i, a) for a given 0 5 a < 1 are continuous functions of R.
Proof: For a given t, E R W, is a finite linear combination of terms PR{yt = j , A , = a I Yo = i>. Thus from Lemma 3, E R W, is continuous. Similarly YR(i,T ) = Since
yR(i,
T
r=o
a'
ER
W, is continuous for every T = 0, 1, ,. , .
a) = lim YR(iy a, T) uniformly in
R, YR(i,a) is also con-
T+CO
tinuous.
We are now in a position to prove:
5. Let Yo = i be given. There exists an R* LEMMA yR*(i,
a) = inf YR(i,a),
E
C such that
i EI .
ReC
Proof: Let v,(a) =
Ci B i y R ( i , a)
3
where p i , i E Z are given positive numbers. Since by Lemma 1, C is compact and by Lemma 4,y R ( i , a) is a continuous function of R, then y R ( a ) is also a continuous function of R and hence, from the wellknown fact (Theorem 2 of Appendix B) that continuous functions over compact spaces achieve their extremes, YR(a)is minimized by a policy R* E C.However, R* must also minimize YR(i,a) for each i E I, otherwise a different policy could easily be constructed which would yield a smaller value for YR(u). Define a, i E I , as those actions which satisfy Wio,
+a i
qij(ai>YR*O',
Expected Discounted Cost Problem
23
If ai is not uniquely defined, let it be any one of the several actions satisfying (1). Let R, be defined as that policy which takes action ai when the system is in state i, i E Z. Here, Ro is a member of C , . We now prove
Proof: For each n
= 1,2,
... let R, be a policy defined as follows : for every H , -= ~I t t - ,
D~(/Z~ Y,-=~i ), = 1
and t = 0 , 1 , and
..., n - 1
D , R ~ ( H ~ - ,,x) = D ? ( H ~ ? ~ ,~ j ? ~ ) for
where
up, = r,, H,",
t
2n,
A:)n = A , ,
. .. , Yp,, A!?"}.
= { Y p ,At',
In words, for t = 0, . ..,n - 1, R, prescribes action a, whenever the system is in state i. Thereafter it prescribes according to R* as if the process started anew at time t = n. Now, from Lemma 1 of Chapter 2, E R t(=g1 a ' W , I Y o = i , & = a ,
, --j
=infE,
}>ReC
co
Cd&1Y1=j
(1=1
for all R E C; therefore, for each i E Z we have Y R * ( i ,a ) = C D ~ ( = Y i)w, ~ a
+ a C 1DY(YO = i)qij(a) i
s
(equation continued)
3 Some Existence Theorems
24
= YR,(i,a).
Consequently, since YR*(i, a) is minimal we have that the equality holds. By repeated iteration it follows that YR*,(i, a) = YRn(i, a) for n = 1,2, . .. and all i E I. Since YR(i,a) is a continuous function of R, and {R,, n = 1,2, .. .} converges to R,, it follows that YR*(i, a) = 'PRO(& a), i E I . This establishes the optimality of R, and also that YRo(i, a) satisfies (2). Uniqueness will be shown in Corollary 1 of Theorem 1 of Chapter 4.
COROLLARY 1. There exists an E CD such that for each i E I , Y ; R (a) ~= , inf YR(i,a) for all a near enough to 1. ReC
Proof: From Theorem 1 we need only show that
YR(i,a) = inf YR(i,a) RECD
for all a near enough to 1. However, for any R E CD, it is easily seen that
from which it follows (see Theorem 3 of Appendix A) that YR(i,a) is a rational function of a for 0 5 a < 1. Let (a,,n = 1, 2, . . .} be a sequence such that lim a, = 1 and R,,= R,, = ... = (say) where n+m
RanE C, minimizes y R ( i 9 a,). Such a sequence can be chosen since C,
Expected Average Cost Problem
25
is a finite set. Since the difference Y R ( i , a) - YR(i,a) is also rational it is either identically zero or has, at most, a finite number of zeros. Thus,
there is an interval (a(R,i), 1) for which YR(i,a) - YR(i,a) >= 0 for all G( E (a(R,i), 1). Let d = max a(R, i). Then for a > d , R E C,, we have R,i
YR(i,a) 2 YR(i,a) and the corollary is proved.
Expected Average Cost Problem We now turn to the problem of finding R E C to minimize $R(i)
+ 1)
= lim SUP S R , T(i)/(T T+
m
the expected average cost per unit time over an infinite time horizon, given the initial state Yo = i. Since lim SR,T(i)/T+1 does not exist in T-tm
general, we define 4 R ( i ) by the upper limit. However, Corollary 1 to Theorem 2 below will treat the case where (PR(I') is defined by the lower limit. We prove : THEOREM 2. There exists a policy R*
4R*(i)=Rinf &R(i), EC Proof: Let R*
E C,
E
C, such that
iEI .
be such that
Yp(i,a) = inf YR(i,a), REC
i EI,
for every ci near enough to 1. Corollary 1 to Theorem 1 guarantees the existence of such a policy R*. We shall now show that R* is an optimal policy for the criterion 4 R ( i ) . From Theorem l(b) of Appendix B, since 4R(i)= lim S R , T ( i ) / T + 1 when R E C , (a consequence of T- m
Theorem 1 of Appendix A), we have &(i)
= lim(1 - a)YR.(i, a); a+ 1
3 Some Existence Theorems
26
from Theorem l(c) of Appendix B we have for all R E C that sR,T(i) lim sup(1 - a)YR(i,a) 5 lim sup a+ 1 ~-+m T+1'
EI,
a) 5 YR(i,a) for all a near enough to 1 Using the fact that YR8(i, combined with the two above inequalities yields
and the theorem is proved. We remark that more than one optimal policy may exist. In our construction of the proof we showed that the policy R*,which is optimal with respect to YR(i,a) for every a near enough to 1, is also optimal with respect to 4R(i).However, not every policy that is optimal with respect to 4R(i)will be optimal with respect to YR(i,a) for every a near enough to 1. The following example demonstrates this. Let I consist of the states 0 and 1. Let KO = 2, K, = 1, where qoo(l) = /3 > 0, q0,(2) = I, qI1(1)= 1, wol= 1, woz = 0, wll= 0. Here, C, contains two policies. Let R1 denote the policy in C, which takes action 1 in state 0 and R,, which takes action 2 in state 0. Clearly, m
4R,(0)= 4R2(0).However, YR,(O,a) = C (up)',
YR,(O,
t=O
a) = 0. Thus
R,
is a better choice than R, with respect to the discounted cost criterion. For more on this subject see the bibliographical notes of Chapter 6. Suppose we define 4R(i)= lim inf SR,T(i)/(T+1). As a consequence T-r m
of Theorem 2 we have:
COROLLARY 1. There exists a policy R* (the same policy as in Theorem 2) such that 4R*(i)
when
( b R ( i ) iS
= inf
ReC
4R(i),
iEI ,
defined as lim inf SR,T(i)/T+1. T+ m
Expected Average Cost Problem
27
Proof: If R is such that lim S R , T ( i ) / T T+ co
+ I exists for every i E I , then
by Theorem 2 $Rr(i)5 &(i), i E I. Suppose that for every i and j there exists a policy R j i E C , such that PRjr{Y,= i for some t 2 1 I Yo = i} = 1, and m j i , the mean first-passage time from j to i under R j i , is finite. Starting with an arbitrary policy R, define l? as follows: use policy R for times t = 0, .. . , T ; then if Yo = i and YT = j , use policy Rji until Y, = i for the first time after T, after which use policy R as if starting from t = 0. Repeat this construction; that is, each time policy R j i , depending on the state j , returns the process t o state i, use policy R as if starting from t = O for T + 1 units of time, then use policy Rii wherej is the state of the process at the time of the switch in policies. The process { Y,} under l? is a recurrent event process (see Appendix B) and by Theorem 6, Appendix B, lim SR,T(i)/(T + 1) T+m
exists. Hence, $R.(i) S $a(i).Now let us suppose there exists a policy R such that $R(i) < C#IR*(i) for some i. Then there will be an E > 0 and a subsequence {T,,,v = 1,2, .. of T values such that
.>
Let us adjoin to K j for each j # i an action a" such that qji(6) = I with cost wjZ= w to be assigned. Let i? be defined as above with R j i the policy which takes actiona"in statej and T = T,,for some v to be assigned. By Theorem 6 , Appendix B,
Choose w large enough so that $ ~ ~ * (= i ) min C#IR(i); RECD
that is, if w is large enough so that any policy R E C , using action a" is too costly, the policy R* being originally optimal will again be optimal
3 Some Existence Theorems
28
among all policies in the enlarged class C,. But also choose v large enough so that
after which we have
4R(i) < 4 R * ( & a contradiction of Theorem 2. This proves the corollary. Remark:. That Theorem 2 and its corollary are true might seem intuitively obvious. This, however, does not stem alone from the Markovian structure we have assumed. For when the state space I is allowed to be countable, Theorem 2 does not hold; moreover, optimal policies may not exist or, when they do exist, they may not be members of C , or C,. First-Passage Problem We now consider the optimal first-passage problem. Let us first assume that wio 2 0 for all a E K , and i E 1. We let j = 0 denote a given target state. Without loss of generality, we can take woa = 0 andqoo(a)= 1 for all a E KO.For sincej = 0 is the target state and only costs associated with reaching the target state are relevant, this assumption will not affect the solution to the problem. Then, if Yo = i is the initial state and 'F denotes the smallest positive integer t such that Y, = 0, then = sR,
~ ( ~ 1
m
where W, are nonnegative random variables. By setting woo = 0 and
mt Passage Problem
29
qoo(a)= 1, we are able to remove the random variable z in the upper
limit of the definition of oR(i). We now prove:
THEOREM 3. If (wia} are nonnegative, then there exists an R* E CD such that gR*(i)
= inf
iEI
cR(i),
REC
Pruufi Consider first m
2 CI'W,. m
= ER
t=O
The interchange of expectation and summation is justified since T
W
t=O
t=O
C a'Wt converges uniformly to
a'Wt for 0 S
CI
c 1. By Corollary 1
to Theorem 1 there exists a policy a E CDsuch that YR(i,a) 5 y R ( i , a) for all i E I , R E C, and a near enough to 1. Since for every R E C, lim YR(i,a) = U'1
m
C
f=O
ERWt (the limit may equal a),we have from
Theorem 4(a) of Appendix B that
which in turn is
= oR(i)
The theorem is proved.
.
3 Some Existence Theorems
30
In relaxing the condition of nonnegativity on all costs, we need an hypothesis of the sort that PR(Yo= 0 for some t 2 1 I Yo = i} = 1 for i c l for all R E C,. Such a condition guarantees that uR(i)is well defined. Without such a condition an infinite path through positive and negative cost states could give rise to an indeterminate cost criterion. We now state and prove Theorem 4, a version of Theorem 3 without the nonnegativity assumption. In Chapter 7 a proof by other methods will also be given for this case. THEOREM 4. If PR(Yt= 0 for some t 2 1 I Yo = i} = 1 for all i and R E C, , then the conclusion of Theorem 3 holds.
~
Proof: Under the hypothesis, the mean first passage miRfrom state i to 0 is finite for every i E I and R E C, (see Theorem 6 of Appendix A). Now let wig = JwiaJand, by Theorem 4(a) of Appendix B, oR’(i)= E R
m
z=o
W,l
m
=CERWf‘, z=o
where Wz’ = wia if Yz = i, A , = a , t = 0, 1, ... . By Corollary 1 of Theorem 1 (with maximum replacing minimum) there exists an R* E C, such that i E I, R E C
Yk,(i, a) 2 Yi(i, a),
for all a sufficiently close to 1. However,
m
= lim
C Q‘ER w,l W
a-tl f=O
(equation continued)
l
First Passage as a Discounted Cost Problem
= ER,
31
C W,’
I maxi I wid ] rn,R+} i.
a
< co, Thus, since IWtl that
=
forevery
ieI
and R E C .
Wt’, it follows from Theorem 4(b) of Appendix B
m
for every i E I , R E C. Now, again using Corollary 1 to Theorem I , there exists an R* E C, such that for all i E I , R E C, and c1 near enough a). Hence, using Theorem l(a) of Appendix B, to 1, YR(i,ct) 2 YR*(i, cRI( i) = lim a-0
5 lim
m
1ct‘ER*W,
t=O
cc
1ct‘ERW,
a+l t = O
= cR(i),
i #j,
R EC.
This proves the theorem.
First Passage as a Discounted Cost Problem Consider the special case where qio(a)= 1 - ct for all a E K i , i E I - {0}, where again qoo(a)3 1, wOa= 0. Then, for any R E C ,
3 Some Existence Theorems
32
But
m
=
C ER(WtI t > t , r=o
Yo = i}P(t > t }
However,
E { W r l t > t , Yo = i} Thus, cR(i),i E I- { 0 } , is equivalent to YR(i,a), i E I - { 0 } , for a Markovian decision process based on the state space I’ = I - {0}, with laws of motion qij(u) = qt(u)/a, u E Ki , i, j E 1’,and costs w:, = wia , u E Ki ,i E 1’.Or, let YR(i, a) be a discounted cost criterion over a state space I , with laws of motion {qij(u)}. Define a fictitious state “0” and adjoin it to I . Define laws of motion by qio(u)= 1 - a, i E I, qij(u) = aqij(u),u E Ki , i, j E I, with qoo(u)E 1. Then, YR(i,a) is equivalent to oR(i),i E I , where aR(i) is the expected cost of a first passage to the state “0” using the laws of motion {qLj(u)}.
BibliographicalRemarks
Using essentially the same method, Derman [ 171 proved Theorem 1 for the case of a denumerable state space I . The idea of using Tychonov’s theorem in this connection is due to Karlin [35]. Another proof of
problems
33
Theorem 1 holding for a denumerable state space is given by Blackwell [7]. Corollary 1 to Theorem 1 is due to Blackwell [6]. The proof of Theorem 2 employs techniques used in Derman [15] and Gillette [33]. Gillette [33] employs an incorrect (see Liggett and Lippman [42]) extension of a theorem by Hardy and Littlewood while working with +R(i) defined by lim inf SR,T(i)/T.The proof of Corollary T+m
1 to Theorem 2 given here is a modification of the one given by Derman
D81.
For more on the remark following Theorem 2 see Derman [19] and Fisher and Ross [32]. Theorem 3 was proved by Derman [15]. In connection with Problem 2 following, Veinott [53] provides an alternative proof. His proof shows that the limit converges geometrically. This result in Problem 2 is used in the proof of Theorem 1 of Chapter 5.
Problems (1)
Is
f$R(i)
always a continuous function of R?
(2) Suppose qoo(u)= 1 for every u E KOand that PR{Y,= 0 for some t > 01 Yo = i > > 0, ~ E RI E, C,. Show that lim supP,(Y, # 01 Yo = i ) = 0, ,+a)
REC
iE1.
Solution: By Theorem 4 and Theorem 6 of Appendix A, on maximizing the expected first-passage cost with wia = 1, a E Ki, i # 0 , there exists an R* E C, such that
m y = maxm,R < co, REC
where miR denotes the mean first-passage time to 0 from
3 Some Existence Theorems
34
state i. But by Chebychev's inequality, ( P I X > c} < EX/c if Xis a nonnegative random variable) for every R E C,i E I , PR{Y, # 0 I Yo = i } = PR{(first-passagetimetoo)> t 1 Yo = i}
Thus, for i E I , sup P,(y, # 01 Yo = i }
ReC
my =< t
and
lim sup P R { x # 0 1 Yo = i } = 0,
t-rm R e C
i
EI
.
4 Computational Methods for the Discounted Cost Problem
Introduction and Summary In the previous chapter we showed that solutions exist to the problems of minimizing the expected discounted cost and the expected average cost criterion as well as to the problem of minimizing the mean first-passage cost. In this chapter and Chapters 5 and 6, we present methods by which the optimal policies actually can be obtained. 35
36
4 The Discounted Cost Problem
Although it was shown that optimal policies for each problem exist in C, , the problem of finding them is nontrivial. In spite of the fact that, given R E C,, the cost criteria Y R, c $ ~ , and cR can be evaluated, the number of policies in C, may be astronomically large. For example if I contains N states and Ki = 2, i E I , then C, contains 2N different policies. For small values of N the method of simple enumeration is feasible; however, for N moderately large, complete enumeration is virtually impossible. Nevertheless, modern computational methods have overcome problems of this sort. For example, in solving the linear programming problem, the space of possible solutions that would have to be searched if the problem were to be solved by enumeration would usually be a large finite number; however, computational methods (the simplex method is one) have been developed which select an optimal solution without the need for complete enumeration. We shall show here that comparable methods exist for obtaining optimal policies. In fact, linear programming computational procedures, among others, can be employed. In this chapter we discuss the problem of minimizing the expected discounted cost criterion Y R. We present three approaches to obtaining the optimal policy: the method of successive approximations, policy improvement, and linear programming. The first is the classical method used in solving differential and integral equations. In itself, it does not provide a method for obtaining a solution in a finite number of iterations; however, slightly modified, it can. More significantly, this method has some uses in determining structure of optimal policies. Both the policy improvement and linear programming methods are finite algorithms and are feasible, provided the size of I is not too large. Method of Successive Approximations Let {vo(i),ic I } denote an arbitrary set of values. Define for n = 0 , 1 , ..., iE1,
Method of Successive Approximations
37
where 0 5 a 1 is fixed. We have:
THEOREM 1. If {on(& i E I , n = 1, . ..>is defined by the transformation (1) with {uo(i), i E I } arbitrary, then lim u,(i) = y R o ( i , a), i E I , n-+ w
independent of {uo(i), i E I } , where Ro is a policy that minimizes y R ( i , u), i E I .
Proof: Let {uo(i), i E I } be arbitrary and uo'(i) = YR,,(i, a), i E I . We first show that n = 0, 1, . . . .
maxlu,+,(i) - uL+l(i)/ S a maxlu,(i) - u,,'(i)l, is1
is1
Let ai',i c I , be the action which minimizes the right-hand side of Eq. (l), where in (1) u,,(i) is the nth iterate of uo'(i). Then
Ia 1qij(ai') maxIu,(j) i
i
= a maxlu,(j) i
- u,,'(j)l,
- u,,'(j)I i E 1.
Similarly, on letting a , , i E I , be the action which minimizes the righthand side of Eq. (l), where in ( 1 ) u,(i) is the nth iterate of uo(i), we obtain uA+ l(i)
- on+ l ( i ) 5 a maxIu,(j) - u,,'(j)l, i
i
E I.
Hence, we have shown the inequality. Now, by iteration, we obtain that ~ ~ ~ + ~ ( i ) - u $anmaxluo(j)-u,'(j)l, ~ + ~ ( i ) ~
Thus, lim (uA(i) - u,(i)) n-rm
= 0, i E I .
i e I , n=0,1,
However, from Theorem
.... 1 of
Chapter 3 (without using uniqueness) we have that un'(i) = YRo(i, a), i E I, n = 0, 1, .... Hence, lim un(i) = YRo(i, a), i E I. Since uo(i), i E I , n-rm
was arbitrarily selected, the theorem is proven.
4 The Discounted Cost Problem
38
The proof of the uniqueness part of Theorem 1 of Chapter 3 was postponed. Essentially, we now have shown this and summarize it in
COROLLARY 1. Equation (2) of Chapter 3* has one and only one solution, namely, y R o ( i , a), i E I , where R , is a policy that minimizes v R ( i , a), i E I . Proofi If u,"(i), i E I , is a second solution, then u,"(i) = uo"(i), i E I , n = 1, ..., and thus, lim v,"(i) = vo"(i). However, from Theorem 1, n-t m
lim u,"(i) has been evaluated to be y R o ( i , a).
n+m
The method of successive approximations based on Theorem 1 consists of selecting an initial arbitrary function and transforming it successively according to the transformation defined by (1). The limiting function will satisfy Eq. (3.2) and the optimal policy is obtained by taking that action in state i, i E I , which minimizes the right-hand side of (3.2). In practice, the limiting function will be approximated only. An approximation to the optimal policy is obtained by treating y R o ( i , ct) and its approximation as if they were equal. Actually, if the approximation is close enough to y R o ( i , a), which will be the case for large n, the exact optimal policy will be obtained. However, within the procedure, no formal stopping method is given. One might modify the method by occasionally evaluating for various values of n, YRn(i, a), i E I , from Eq. (2) following, where R, E C, is the policy defined by taking that action in state i, i f I , which minimizes the right-hand side of (1). If { y R n ( i , a), i E I } satisfies (3.2) then R, is optimal. See also Problems 2, 3, and 4 at the end of this chapter. In general, the functions {v,(i), i E I } , n = 0, 1, .. have no tangible interpretation. However, if uo(i) = min{w,,}, i E I , then using methods of Chapter 2, we have
.
,
n = 0,1, ...,
* Hereafter this equation will be referred to as (3.2).
Policy Improvement Procedure
39
the optimal expected discounted cost criterion over the periods 0, 1 , . .., n. Otherwise, if v,(i) is interpreted as the terminal cost of being in state i at time n if the process is terminated at time n, then u,(i) is the minimal expected discounted cost plus terminal cost over periods 0, 1, , . ,n given that the initial state is i. In practice, the method of successive approximations may be used when an approximation or guess to an expected discounted cost criterion corresponding to a heuristic policy (one arrived at by respected intuition) is available. Then several iterations will hopefully improve it. In any case, use of the method of successive approximations never necessitates the computation of an exact discounted cost criterion ; thus, one is spared the work of solving the system of simultaneous equations (2) following. This latter feature makes the method an attractive computational procedure, particularly if the computations are done by hand or with a desk calculator. Perhaps the method of successive approximations is of greatest value in a more theoretical context. That is, certain mathematical properties can be ascertained. For example, suppose I is the set of integers 0, 1, , ..,L and the laws of motion are such that ul(i) is a nondecreasing function of i whenever vo(i) is a nondecreasing function. Then it follows that YRo(i, a) is a nondecreasing function. From this property the structure of an optimal policy may sometimes be deduced. (See Chapter 9, Section 1.)
.
Policy Improvement Procedure This is an iterative procedure that improves on each iteration and terminates after a finite number of iterations with an optimal policy. Let R , E C , be arbitrary. Then {YR,(i, a), i E I] satisfies, uniquely, the equations yRl(i,
@-) = WiR 1
+ a i qij(R1)yR1O’,
a),
I
*
(2)
Uniqueness follows from the fact (using Theorem 3 of Appendix A) that the matrix of system (2) is I - a Q ( I is the identity matrix and
4 The Discounted Cost Problem
40
Q = {qij(R,)}) which has an inverse ( I - aQ}-' =
m
C
n=O
a " p . Let Ei
denote the set of actions i for which wia+ c1 C qij(a)YRl(j, a) is strictly j
less than the right-hand side of (2). Define R , E C, as follows: For one or more states i for which Ei is nonempty, prescribe action a in E l . For all other states, take the action prescribed by R , . We refer to the derivation of R , from R , as a policy improvement iteration. The fact that the iteration is an improvement is established in the following. THEOREM 2. If E, is nonempty for at least one state i, then YR2(i, a) 6 YRl(i, a), i E I , with strict inequality holding at every i for which R, # R 1 . Proof: By definition of the policy improvement iteration,
with strict inequality holding at each i for which R , # R , . Let (q$)(R,)) (t = 0, 1, , . .) denote the t-step transition probabilities under R , . Then from (3), on premultiplying by atq$)(R2)and summing over i, we can write for t = 0, 1,. . ., a'
c i
q$)(R2)yR1(i9
zar
a)
c #(R2)WiR2 + i
a'+'
CI q $ + ' ) ( ~ , ) Y ~a),~ ( l , j E I .
(4)
For t = 0, Eqs. (3) and (4) are identical. On summing (4) over t = 0,1,. .., we obtain, since YR2(j,a) =
C a' Eq$)(R2)wiR2, m
r=O
i
Linear Programming
41
with strict inequality holding, because of terms at t = 0, at e a c h j for which R, # R,.On subtracting the second term on the right-hand side from the left term, and because the two differ only when j = i (since qi;) = S i j ) , we have Y R , ( jcx) , 2 Y R l ( ja,) , j E I, with strict inequality holding for each j for which R , # R, . Thus the theorem is proved. We refer to a sequence of policy improvement iterations as the policy improvement procedure. We can state
COROLLARY 1. The policy improvement procedure terminates, after a finite number of iterations, at an optimal policy. Proof: C, contains only a finite number of policies. Since each iteration is accompanied by a strict improvement, no repetitions will occur. Thus, at some point no improvements will be possible, at which time (3.2) will hold and the corollary is proved. In summary, the policy improvement procedure provides a monotone (always improving) convergent sequence of policies and attains in a finite number of iterations the optimal policy. Its drawback is that the discounted cost function for each policy R in the sequence must be calculated. This involves solving the linear system (2).
Linear Programming Both the methods of successive approximation and policy improvement may be regarded as methods of dynamic programming. Thus, it is somewhat surprising that the method of linear programming can also be brought to bear. For this, consider the linear programming problem: Maximize
4 The Discounted Cost Problem
42
subject to ui
where
5 wia + a C q i j ( a ) u j , i
a E K ~ ,i
EI,
Pi > 0 , j E I, and 1Pi = 1 are given numbers. The dual
linear
j
programming problem is: Minimize
subject to xia 2 0,
U E
Ki, i E I ,
and
where aij = 0 if i # j , and 1 if i =j. We first discuss the dual problem.
THEOREM 3. Let R E C, be defined by { D i a } :Then
is a feasible solution to the dual problem. On the other hand, if {xis} is any feasible solution to the dual problem, then { D i a }= {xia/Cxi,.} a'
defines a policy R E C, and x i a = Tia , a E Ki , i E I. That is, { D i a }= {xh/c xias}is a one-to-one mapping of the feasible solutions to the dual a'
problem onto C,. Proof: It can be readily verified that {Zia} satisfies the feasibility constraints. Now let {xi.} be any feasible solution to the dual problem.
Linear Programming
43
The feasibility equations can be written
from which it follows that
c xia> 0,i
E
a
I , since it is assumed that
Pi > 0 , j E I. Thus ( 0 3 = { x j U / zxla}is well defined. However, treating a 2 xia, i E I, as variables in the above representation of the feasibility (I
equations, it follows from Theorem 3, Appendix A that
But Eio
= =
c xi,. D, a'
C xis, Dia a'
-xis,
aEKi, i e I ,
which completes the proof of the theorem.
COROLLARY 1 . An optimal policy R , E C, is obtained by solving the dual problem and setting D F = xialC xi., a E K i , i E I, where (I
(xi.} is an optimal solution to the dual problem.
leI
Pro08 Since the objective function of the dual problem is in fact YR(I,a), this expression is minimized. However, since PI > 0 and
a single R E C, minimizes YR(l,a) for every I E I, it follows that YR(I,a) is minimized for each 1 E I .
COROLLARY 2. If the simplex method is used to solve the dual problem, an optimal policy R, E C, is obtained.
4 The Discounted Cost Problem
44
Proof: The simplex method obtains only extreme point solutions. It follows that xia> 0, i E I , and from Theorem 3 of Appendix C,
1 a
xi. > 0 for exactly one a E K j for each i E I . This then implies D F = 1
or 0, i E I .
COROLLARY 3. For every optimal solution to the primal problem, (3.2) must hold.
Proof: It follows from t.he complementary slackness property of primal and dual linear programming problems (Theorem 5, Appendix C) that if ( u j ,j E I } is optimal for the primal problem, then
for those values of i and a where x i p> 0. However, we have seen in the proof of Theorem 3 that for each i E I, xio> 0 for some a. Therefore, if we consider the constraints of the primal problem, (3.2) must hold. We can now prove THEOREM 4. If { u j o } is an optimal solution to the primal problem, then {vjo> satisfies (3.2) and consequently u: = YRo(j, a),j E I, where R, E C, is an optimal policy.
Proof: From Corollary 3 we have that (3.2) must be satisfied by an optimal solution to the primal problem. Since by corollary 1 to Theorem 1, Eq. (3.2) has a unique solution {YRo(j, a ) , j E I}, the equality must follow. COROLLARY 1. An optimal policy R, E C, can be obtained from the optimal solution to the primal problem by letting R, be the policy that takes action a = ai at state i which achieves equality in the constraints of the primal linear programming problem. If more than one action a achieves equality at any state then either action may be taken.
Linear Programming
45
Proof: Once a solution to (3.2) is obtained by any means, an optimal policy is prescribed according to Theorem 1 of Chapter 3. Theorems 3 and 4 and their corollaries provide the linear programming machinery for obtaining optimal policies. We point out that the variables (xis} of the dual problem have policy and expected frequency interpretations for euery feasible solution to the problem.' For xia is, in a discounted sense, an average probability of being in state i and making decision a when P( Yo = I } = P I , I E I, and policy R E C, is used, where R is given by Dia = x i a / cxi., Thus, in a sense, the simplex algorithm a
for solving the dual problem, which is a procedure that has the property that successive iterations provide improving solutions, is a special type of policy improvement method. On the other hand, for the primal problem, it is the equalities (3.2) obtained only in an optimal solution that yield an interpretation with respect to policies. Of course, the optimal values have their interpretation in terms of being the optimal discounted costs YR(i,a), i E 1. In the proof of Theorem 3, the assumption that P I > 0, I E I, is used in showing that xi" > 0, i E I. If we allow a subset S of states
2a
= 0, I E S, then it is possible that
such that
a
xia = 0 for some i E S.
In this case, suppose we define R = ( D i a }by setting Dia= x,lc xi,, if a
1xia > 0 and choosing Dia arbitrarily if 1xia= 0. Then, for every i
c xia
a
such that
a
a
= 0, we can assert that
PR{Y, = 1 I Yo = I } = 0,
t = 0, 1, . . . for every 1 such that > 0. Moreover, R is optimal with respect to minimizing YR(Z,a) for every 1 for which > 0. However, R may not be optimal with respect to minimizing YR(i,a) for every i such that 2 xi. = 0. a
For this reason we should have perhaps referred to the problem involving the variables {xi.} as the primal linear programming problem and to the other as dual. However, since the problem involving the variables {ui } arises first we have called it primal.
4 TheDIscountedcOatProblem
46
Computational Examples
Suppose, as in Chapter 2, that Z = (0, 11, Ki = 2, i = 0, 1, where
and
We take u = 3. Let us first employ the method of successive approximations in order to obtain an optimal policy. Let uo(0)= uo(l) = 0. Then using (1)s
= min(1,O) =0,
and, similarly, u,(O) = min{2,2)
= 2.
Then,
= mini;
3
=-
4
t)
,
Computational Examples
47
and 2)
7 3‘
=-
Iteration once again yields
. 37 93 - -
= mm
(24 ’ 96)
31 32
=-
and
0
= 2 + -1 -2. -3+ -1. - 7 2 3 4 3 3 95 36‘
=-
The policy R* approximating the optimal policy is the one that takes action a = 2 at state 0 since
+.
-)
+.
-),
1 + 1 1 3- 1+ - .1 9 5 > 1 1 3- 1+ -. 395 2 2 3 2 236 2 4 3 2 436
and takes action a = 1 at state 1 since
4 The Discounted Cost Problem
48
Let us check whether this policy is, in fact, optimal. We have that yR*(O)
= 3 * (iyR*(O)
and VR*(l)= 2
+ SYR*(l))
+ +(+YR*(o) + $ Y R * ( 1 ) .
On solving, we obtain
We can now check whether
and
The inequalities hold so that R* is, in fact, optimal. Let us now obtain an optimal policy using the policy improvement procedure. Let Rl be the policy that takes action a = 1 at state 0 and action a = 1 at state a = 1 . Then
from which we see that
At state 0,
hence, the policy R , ,action a = 2 at state 0, and action a = 1 at state 1, is better. Since we already know R , is optimal, no further policy improvement iterations will be possible.
Bibliographical Remarks
49
The primal linear programming problem, letting Po = P I = 4, is: To maximize 3(vo
+4
subject to 9uo
- i u 1 5 1,
300
- 301
-300
-&lo
5 0,
+ 3-01 5 2 ,
+ +Vl 5 2.
The dual problem is to minimize xo1
+ 2x1, + 2x12
subject to and
We leave it to the reader to numerically solve each of the linear programming problems and determine the optimal policy from each solution.
Bibliographical Remarks The proof of Theorem 1 essentially involves establishing that .the transformation defined by (1) is a contraction. Since we know a fixed point of (1) already exists; namely {YRo(i), i E I } , the remainder of the proof is somewhat simpler than the classic proof of the Picard-Banach fixed point theorem. Maitra’s proof [45]for the denumerable state case motivated our approach. However, the theorem for the finite case should be credited to Shapley [48] who also used the contraction method.
4 The Discounted Cost Problem
50
That the policy improvement procedures are associated with dynamic programming can be seen in the writings of Bellman (see, for example, [3]). The explicit procedure for the Markovian decision process with discounted cost criterion appears in Howard 1341. See also Blackwell
161.
That linear programming can be used for the discounted cost criterion is due to D'Epenoux [14].
Problems Using the data provided for Problem 1 of Chapter 2, find the minimal expected discounted cost policy using each of the computational methods. In the method of successive approximations with uo(i) = 0, i E I, show that for each i E I,
where R, is optimal.
Assume min {IwM - wb.l} > 0. How large must n be i,
0 , a' o'#a
in order that the method of successive approximations yields an optimal policy? If in the method of successive approximatiom, ul(i) 2 uo(i), i E I , show that ~ " + ~ 2 ( iun(i), ) i E I; that is, that {u,,(i), n = 0, 1, .. .} converges monotonically to YRo(i, a) from below. By direct argument, that is, without resorting to the dual problem, prove that the optimal solution {vi, i E I} to the primal problem must satisfy ui = min wi. a
(
1
+ Ci qij(a)uj ,
i EI
.
Problems
51
(6) Define the transformation TR of a vector u = {u(i), i E I } by
+
(TRu)(i)= wiR a C qrj(R)u(j), i
i EI .
Show that if u’(i) 1 u(i), i E I , then (TRv ’ ) ( i ) 2 (TRu)(i), iEI. (7) Prove the assertions in the last paragraph of the linear programming section.
5 Computational Procedures for the Optimal First-Passage Problem
Introduction In solving the optimal policy for the expected discounted cost criterion, we saw that the methods of successive approximation, policy improvement, and linear programming can all be used. The same can be said for obtaining optimal solutions to the optimal mean first-passage problem. We assume that state j = 0 is the target state and PR{Yt= 0 for 53
5 The Optimal First-Paasage Problem
54
some t > O I Y o = i } = l , i E Z , a n d q o o ( R ) = l for all R E C ~Recall . from Theorem 4, Chapter 3 that some policy R E C, is optimal and thus we need only consider the rules in C, . However, as in the previous chapter, it will be convenient, in the linear programming formulation, to consider the class C, of policies. Method of Successive Approximations
We first discuss the method of successive approximations in the present context. Let {uo(i), i E Z - (0)) be arbitrary, and define
(
un+l(i) = min wI. a
I
+ C qij(a)un(j) ,
We shall prove:
j+o
i
EI
- (0).
(1)
THEOREM1. If {un(i), i E Z - { 0 } ,n = 0, 1, ...} are defined by transformation (l), then lim un(i) = oR0(i),i E Z - {0}, independent of n-r m
{uo(i), i E Z - {0}, where Ro E C, minimizes trR(i), i E I - (0).
Proofi Let {v0(i),i E Z - (0)) be arbitrary and v;(i) = oRo(i), i E Z (0). Since Ro is optimal and is a member of C, we have that u,,’(i), the nth iterate of uo’(i) in (l), is equal to uo‘(i) for n = 1,2, , . , Let ai,i E Z - (0)denote the actions minimizing the right-hand side of (1) where vn(i) is the nth iterate of uo(i) in (1). On subtraction, we have
-
.
forn=O, 1, ...,
Similarly, if a;, i E Z - {0}, denote the actions minimizing the righthand side of (l), where un(i) of (1) is the nth iterate of uo’(i). Then for n = 0 , 1, ...,
Method of Succeesive Approximations
Putting the two inequalities together we obtain
for each n = 0, 1, ...and i E I - (0). Consequently, letting Rndenote the policy in C, which takes action ai or a; at state i depending upon which yields the larger value for C qii(a)I u n ( j ) - un’(j)[,we obtain Iun+1(i)
- uA+i(i)I 5jToqij(&) lun(j)- u,,’(j)l,
for each n = 0, 1, IUn+l(i)
i E - (0)
9
.... Repeated iteration yields
- uA+ i(i)I s C PR,{Yn = j I Yo = i > IUdj) - 0o’O’)I j#O
- PR,{Y, # 0 I Yo = i } maxlu ( j ) - uo‘(j)}, S i
i
EZ
- (0),
where 8, is the policy in C, that takes action at time t according to policy Rn-, (0 S f n). From Problem 2, Chapter 3, it follows that lim PR,( Yn # j IYo= i ) = 0, n-r
iE
Z - (0).
~3
Therefore lim lu,(i) - u,,’(O)l = 0,
n-+m
i E Z - (0).
Since ~ ; + ~ (=i )uo’(i) = aRo(i),i E I - {0}, we have that lim un(i)= aR0(i), i E Z - (0}, and the theorem is proved.
n-r
m
The remarks regarding the method of successive approximations in Chapter 4 hold here as well. In particular we have: COROLLARY 1. The function nR0(i),i E I - (0}, uniquely satisfies (2)
5 The Optimal First-Passage Problem
56
Proof: Same as for Corollary 1 to Theorem 1, Chapter 4.
Thus, we start the method of successive approximations with an arbitrary function vo(i),ie I - {0}, and iterate it according to (1). In the limit, we get (2) with Ro E C, as that policy determined by those actions which minimize the right-hand side of (2). In practice, the limit is not attained, but a large number of iterations of (1) should in most cases yield the optimal policy or a good approximation.
Policy Improvement Procedure We turn now to the policy improvement procedure for obtaining an optimal policy. Let R denote an arbitrary policy in C,. Then {aR(i))satisfies uniquely (Theorem 2 of Appendix A) the system
For each i E I
- (0) let Eidenote those actions a for which
Define R E C, by choosing an action in E, for at least one i where Ei is not empty. At all other states let R' = R. If Eiis nonempty for at least one i we call the transformation of R to R' a policy improvement iteration. A sequence of policy improvement iterations that leads to an optimal policy is called the policy improvement procedure. That, in fact, every policy improvement procedure leads to an optimal policy is summarized in the following theorem and corollary.
THEOREM 2. If R' is obtained from R by a policy improvement iteration, then ax.(i) 5 aR(i),i E I - (01, with strict inequality holding for at least one i E Z - ( 0 ) . Proof: From (3) on substituting the inequalities of the policy
improvement iteration, we have
2 WiR‘ +
j#O
qij(R’)cR(.i),
i E I - (0),
with strict inequality holding for at least one i E I - {0}, namely, for i where R’ # R. On iterating the inequality we obtain
T
On letting T + 00 we obtain aR(i)2 aRt(i)+ lim
1q { ? + ’ ) ( ~ ’ ) c ~ ~ ( j )
T+m j Z 0
= aRt(i),
i E I - {0},
since lim &)(R’) = 0. Strict equality holds, at least for those i where T-tm
R‘ # R. The theorem is proved. COROLLARY 1. The policy improvement procedure converges, within a finite number of policy improvement iterations, to an optimal policy. Proof: Each iteration yields a strictly better policy. Only a finite number of policies are in C, Thus, at some point, no policy iteration is possible and (2) is satisfied by the final policy; by the corollary to Theorem 1 it must be optimal.
.
Linear Programming Formulations We allude now to the linear programming formulations, the primal and dual, of the optimal first-passage problem.
58
5 The Optimal First-Passage Problem
Consider first what we call the primal problem:
To maximize
subject to ui
where the
s wia +
j+O
qij(a)uj,
a E Ki, i
EI
- {O} ,
(pi} are known positive numbers such that
dual problem is: to minimize
C C xia
i#O a
j+O
/I,= 1. The
wia
subject to
xi.20,
aeKi,
icZ- {0},
and
By the same methods of Theorem 3, Chapter 4 we can assert that there is a one-to-one correspondence between any solution {xi.} to the dual problem, and R E C, given by
a
and
In words, xio is equal to the expected number of times under the policy R that the process is in state i and action a is taken before the process
Computational Examples
59
enters state 0 given that P{ Yo = 1) = P I , 1 E I - (0). That x,, in fact, is finite for R E C, ,follows from the assumption that each i, i E Z - {0}, is transient for every R E C, and Theorem 2 of Appendix A. Thus, Theorems 3 and 4 of Chapter 4 and their corollaries have their counterpart for the optimal first-passage problem with Eq. (2) of this chapter replacing (3.2) of those discussions. When some Bj’s are equal to zero the remark in Chapter 4 holds here as well. Computational Examples In order to keep the computations extremely simple we shall consider a two-state problem with one of the states as the target state. Clearly, for such a simple case, the optimal policy can be seen by inspection. However, we shall formally go through the steps of the various procedures. Suppose I = (0,1); 0 is the target state; K, = 2, q1,(I) = 3, q11(2)= 3; WI1= 3, W1z = 1. We first use the method of successive approximations. Let vo(l) = 0. Then
ul(l)
= min(3
+ +uo(l), 1 + 3uo(l))
= min(3, l} = 1.
-1
- 3’
~ ~ (= 1 min(3 )
+
1,
1
+3
+ 3 - +,
1
+ 3 - +}
~ ( 1=) min(3 +
*
*
1)
= $0.
On the basis of v3(1), we have that 2 10 1 10 3+-->1+--; 39 29 hence, the approximation to the optimal policy is to take action a = 2
5 The Optimal First-Passage Problem
60
at state 1. In this case the approximation is, in fact, the optimal action. Using policy improvement, suppose R, is the policy which prescribes action a = 1 at state 1. Then gRR1(l)
+ 3aR,(1)
=
3
and, therefore oRl(l)= 6. Since
+ 3CRl(l)
6>
=5,
R, ,which prescribes action a = 2, is the policy obtained by the policy iteration; C , contains only the policies C , and C2;therefore R2 is optimal. To evaluate uRz(l),we have that =
oRz(l)
+! ~ R Z ( ~ ) ,
or crRz(l)= 3 . The primal linear programming problem looks like: Maximize u1
subject to +vl 5 3
+ul 5 1.
and
Clearly the solution is u1 = 3, and since equality is obtained at the second constraint, action a = 2 is optimal. The dual problem takes the form : Minimize
3x11 + x12
subject to x1110, +XI1
XI2209
+ 3x12 = 1.
The solution is x I 2 = 3 which yields D I 2 = 1, as the optimal policy.
The Finite Horizon Problem
61
The Finite Horizon Problem as a First-Passage Problem The finite horizon problem of Chapter 2 can be viewed as an optimal first-passage problem. Let I' be the state space consisting of all pairs 'i = (i, t), i E I, t = 0, 1, . ,T and an adjoined state 0 (say); let Ki,= Ki,if E I' - {0}, KO = 1, qirj.(a)= qij(a)if 'i = (i, t),j' = ( j , t + 1) for i, j E I , t = 0, 1, . .., T - 1, qit0(u)= 1 if i' E {(i, T ) , i E I } , qoo(a)= 1 and qiPjt(a) = 0 otherwise; wi., = w i g ,i E I , t = 0, . . , T, and wOa= 0. Denote by C' the class of all policies. This is merely an enlargement of the original state space to one where the new state designation includes the time of observation as well as the original state; state 0 denotes time T 1 without concern for the original state at time T + 1. Within this conception the state 0 is an absorbing state and a first passage from any state in {(i, 0), i E I } to state 0 takes exactly T 1 units of time. Thus, it should be clear, that to find R E C to minimize ,SR,*(i), i E I , is equivalent to finding R E C' to minimize aR((i, O)), i E I , where the " target" is the state 0. This observation coupled with the contents of this chapter point out that ,SR,T(i)can be minimized by the method of successive approximations, the policy improvement procedure, and by linear programming. This is not to say that any of these methods would be superior to the method ofchapter two. The simple dynamic programming algorithm is the natural and highly efficient way to solve the problem. However, when certain types of additional constraints are imposed the dual linear programming approach may prove useful. For example, suppose, for a given initial state i, we wish to find R E C, to minimize SR,=(i)subject to the constraint that ,SR,=(i) 2 s (a given constant). Translated to the first-passage problem this would be equivalent to finding R E C,' (the stationary Markovian subclass of C') such that bR((i,0)) is minimized subject to b R ( ( i , 0)) 2 s. Letting p(i,o)= I , p i . = 0,if # (i, 0), this problem can then be formulated as finding { x i , , } ,to minimize
..
.
+
+
5 The Optimal First-Passage Problem
62
subject to U E K ~ . i, ‘ E 1 ’ - { O } ,
xi’.hO,
cc
i’#O
a
xi.a(s,j,
and
- qi.j@)) = pi.,
cc
i’#O
If (x,,}
a
Xita Wi’.
j ’ E I’
- (0)
2s .
c
is the optimal solution to this linear programming problem, set = 0. Di., = xina/Cxi.. if xi,. > 0 and let DiSabe arbitrary if a
a
a
We point out that the policy so obtained will not in general be a member of C,‘ since for at least one state i‘ a random mechanism will be used for deciding on which action to take. Bibliographical Remarks The optimal first-passage problem was formulated by Eaton and Zadeh [30]; they called it a “pursuit problem.” The transformation (1) is not, in general, a contraction for the I” norm. However, because all states except 0 are transient, the proof of the convergence of the method of successive approximations proceeds along the lines of the previous chapter. Other norms are given by Veinott [53] for which (1) is a contraction. A different linear programming formulation involving the minimization of the ratio of two linear forms (a problem called a fractional linear programming problem which can be readily transformed into a linear programming problem) was first given by Derman [15]. The one given here circumvents the need for the fractional linear programming form. The remark regarding the formulation of the finite horizon problem as a first-passage problem with application to constrained optimal policies appears in Derman and Klein [22].
Problems
63
Problems (1)
Is it possible to put a bound on Iu,,(i) i € I - {O}?
- oR0(i)l,
(2) For the data given in Problem 1 , Chapter 2, find the optimal policy using each method.
6 Expected Average Cost Criterion Computational Procedures
Summary In Chapters 4 and 5, it was shown that the method of successive approximations, the policy improvement procedure, and linear programming provide general methods for obtaining optimal policies for the discounted cost criterion and for the first-passage problem. In this chapter, which is devoted to the expected average cost criterion, we shall see that a special kind of policy improvement procedure and the 65
6 Expected Average Cost Criterion
66
methods of linear programming provide general algorithms for obtaining optimal policies.
Policy Improvement Procedure We first consider the policy improvement procedure. Let R E C, be arbitrary. We let
the limit always existing (Theorem 1 of Appendix A). We also have (Theorem 1 of Appendix A): nil(R) =
Ci nij(R>#(R)
= =
Ci q$)(R)njl(R)
C nij(R)njl(R), j
i , j E I, t = 0,1, .
..,
relations which we shall use throughout. Consider the equations in {4i , v i , . . . i E I } : $i
+
Ui
= WiR
+ i qij(R)oj
y
iE I
(1)
and
Cj n i j ( ~ ) v=i 0,
i E I.
Equations (1) and (2) are the essence of the policy improvement procedure. We shall construct a solution to (1) and (2). Let El,E2 , . . . ,Ek be the recurrent classes of I under R. Let E = {jl, ...,j,} be a set of selected states from El, . . , ,E,; that is, j,, E E,, , n = 1, .. .,k. Define wfR= wiR - 4R(i),i E I ; let W,' = w j R , if
~- 1
7-
Y, = j , and set T = mh{t [ Y, E E, t 2 11. Let uR(i)= ER{
t=O
W,' I Yo = i},
i E I ; that is, uR(i)is the expected cost under R and the cost structure {wiR} of going from state i to any of the states in E not counting the cost at the time of arrival.
Policy Improvement Procedure
67
In constructing a solution to (1) and (2), we first demonstrate that ( 4 R ( i ) ,uR(i),i E I } satisfies the system (1). Then by a suitable modification we can construct a solution to (1) and (2). By its definition we clearly have that
However, from Theorem 5 of Appendix A, for any i E E (say i E E n , where En is one of the recurrent classes El, ...,EJ, using the fact that n i j = 0 if j # E n ,
=O since
68
6 Expected Average Cost Criterion
Since, in fact, z i j ( R ) is independent of i for i E En (Theorem 4 of Appendix A), c(i) is a function of n when i is recurrent. Notice now that
=0,
iEZ
since Hence, uR(i) satisfies (2) for all i E I. Now we also see, using qij(R)njl(R)= zil(R),that for i E I , i
WiR
+ jCe I q i j ( R ) u R ( 8
= WiR + = WiR
c
jsl
- c(i)>
qij(W~R(-i)
+j s I q i j ( R ) U R ( j )
- c(i>
+
= 4R(i) uR(i)- c ( i ) = 4 R ( i,
+ uR( i, ;
that is, (4R(i), uR(i), i E I } satisfies (1). We have thus constructed a solution (bR(i), uR(i), i E I } to the combined system (1) and (2). We can further state :
LEMMA1. The numbers {4K(i), uR(i), i E I } satisfy (1) and (2); moreover, there is only one solution for which C$i is constant over each recurrent class and equal to q5R(i)when i is transient.
Proof: The first statement was just proven. To prove uniqueness, suppose the values { C $ i , u i , i E I } satisfy (1) and (2) where 4i is constant on each recurrent class. By premultiplying (1) by nli(R)and summing over i E I and using (2) we have
Ci z , i ( R ) d i
= $R(O,
1EI .
(3)
Policy Improvement Procedure
69
Suppose I is a recurrent state. Let En denote the class of states to which I belongs. Then since nli is independent of I for i E E, and = O if i $ E n , and + i is constant for i E En, it follows from (3) that + i = cjR(i)for all i E E n . Hence, for all recurrent states i, 4i = 4R(i).Now letting Ai = uR(i) - ui and subtracting in (l), we have
Ai =
c qij(R)Aj,
i EI
i
Iterating, we obtain
Ai = C q$(R) Aj, i
Averaging (4)over t = 1,
i E I,
t =
1,2, . . . .
(4)
. . . , T and letting T + co,we have by Eq.(2), Ai = C wij(R) Ai i
=o,
i€Z.
This proves the lemma.
LEMMA2. Given any c1 (0 < c1 < I ) and R E C , ,
where zR(i, a) + 0 as c1+ 1 Proof: Since
i
q i j ( R )+ R ( j )= 4x(i),it follows on multiplying both
sides of (1) by cc'q{f)(R)and summing over i that @'4R(I)
+ cc'Ci q l f ' ( R ) u R ( i )
On summing over I we arrive at
(l - a>-'(bR(l)
+ uR(l)
6 Expected Average Cost Criterion
70
Using (2) and the Abelian theorem l(b) of Appendix B on the last term of the right-hand side, the lemma follows. Let R E CDbe arbitrary. For each i E I , define Ei to be the set of actions at state i for which i
qij(a)$R(j)
< (bR(i)
3
or, if no actions satisfy the inequality, the set that satisfies
and Wia
-k
1qij(‘)uRO’) j
< WiR -k = &(i)
qij(R)uRO’)
+
UR(i).
Define R‘ E CDas the policy which takes an action a E E, in at least one state i for which Eiis nonempty; otherwise, the action taken is the one dictated by R.Of course, if E, is empty for all i, then R = R’. If R‘ # R, then either
Ci q i j ( W M j ) IMi),
i E1
(6)
7
with strict inequality holding for at least one i and qii(R’)= qij(R), wiR= wiR.j E I , for each i where equality holds, or i
q i j ( R ’ ) b R ( i ) = 4di),
(7)
I,
and $R(i)
+ v R ( i ) 2 w ~ R ,+ Ci qij(R’)Ud.&
i
E1,
(8)
with strict inequality in (8) holding for at least one i and qjj(R’)= qij(R), wiR= w i R . j €I , for each i where equality in (8) holds.
Policy Improvement Procedure
and
Y R , ( i ,a)S Y R ( i ,a),
i E I, a near 1 ,
with strict inequality holding in (10) for at least one i.
Proofi From the representation ( 5 ) we can write
If (6) holds, for some a,,near enough to 1 we can write for all a 2 a,,,
with ‘strictinequality holding for that i where strict inequality holds in (6). Thus, Theorem 2 of Chapter 4 applies; that is, policy improvement for the discounted cost criterion takes place in going from R to R’ for every a 2 a,. If (7) and (8) hold, the same can be said. Thus (10) holds. From the fact that (10) holds and using (3,one sees that (9) also holds. Thus the lemma is proven. Let us define the transformation from R to R‘ as a policy improvement iteration. Thus, the policy improvement iteration takes a policy R E C, to R’E C, such that (6) is satisfied or (7) and (8) is satisfied. Thus, the policy improvement iteration is analogous to those discussed in Chapters 4 and 5 though somewhat more complicated. We refer to a sequence of policy improvement iterations as the policy improvement procedure. We have: 1. The policy improvement procedure leads to an THEOREM optimal policy within a finite number of iterations.
Proofi Let R,,R2 , ... be the policies obtained from a sequence of policy improvement iterations with R1arbitrary. Since there are only a
6 Expected Average Cost Criterion
72
finite number of policies in C , and {YRn(i, a),i E I } is a strictly decreasing sequence as long as a policy improvement iteration can be effected, there is an n for which R, = R,+l = R (say); that is, a policy iteration on R results in no change of policy. The fact that { Y R " ( i , a), u = 1, ...,n ) is strictly decreasing prevents cycling from occurring within the sequence R , , , . . , R, . Then we must have
and 4R(i)
= min a
1 i
I,
qfj(a)$Ru)?
(12)
where Ki'in (11) is the subset of actions at i such that equality is achieved in (12). We now show that whenever R is such that both (1 1) and (12) hold, then R must be optimal; that is, R is not a local minimum but is, in fact, an absolute minimum. Suppose f? is an arbitrary policy in C , . By virtue of (11) and (12) holding together with the argument employing ( 5 ) and its expansion used to prove Lemma 3, we now conclude that YR(i,a) 5 wiR
+ a Ci qii(R)YR(j,a),
i E I,
for all a sufficiently close to one. By the method of proof used in proving Theorem 2, Chapter 4, we obtain the fact that vR(i,
iEI,
a) 5 YR(i,a),
for all a sufficiently near 1. From ( 5 ) we then conclude that d)R(i)
5 Mi),
iE I .
Since a is arbitrary this proves the theorem. To spell out the policy improvement procedure, we first start with an arbitrary R, E C,. We then solve for { $ R , ( i ) , u R , ( i ) ,i E I}. Given { $ R , ( i ) , i E f>, we obtain {uR,(i),i E I} algebraically by virtue of Lemma 1. For any R E C,, { d R ( i )i, E Z} is calculated from {nij(R),i , j E I } ,
Linear Programming Formulations
73
where {nij(R) = , p j , i , j E En} is the unique solution to
(Theorem 4 of Appendix A), and for i # nij(R)
where ain= P{Y, E En for some t
u k
n= 1
En,
= U i n nnj 3
2 I 1 Yo = i}.
{mi,, , i 4
satisfies the system ain
=
C qij(R) + C jsE, j
u k
n= 1
u
En}uniquely
k
qij(R)ajn
+ni,En
3
iE
En*
n= 1
Hence {4Rl(i), vRl(i), i E I } can be obtained algebraically. Having solved for {dRl(i), vR,(i),i E I } , R, is obtained by a policy improvement iteration; that is, at one or more i, where possible, an action a is taken which satisfies either (6) or (7) and (8) with R, = R, at all other states. This process is repeated until (1 1) and (12) are satisfied, at which point an optimal policy is on hand. Unfortunately, +R(i) and v R ( i )must be obtained anew at each iteration.
Linear Programming Formulations
We now turn to the linear programming approach to obtaining an optimal policy. First consider the linear programming (primal) problem. To determine values of the variables { 4 i ,v i , i E I} to maximize
74
6 Expected Average Cost Criterion
subject to
Cj "j(6ij - &j(U)) + 4i 5 wia,
C J
(Sij
- qij(a)Mj S 03
a E Ki, i E I , a€Ki,
(13) (14)
iEI,
where Pj > 0,c/Ij = 1 are known constants. The dual problem is to find j
values of the variables {xis yia, a E Ki, i E I } to minimize
Cxja + C C r i a ( s i j - y i j ( a ) ) = B j , a i a
i ~ l -
We consider the primal problem. In what follows, R* optimal policy.
E
(16) C , is an
LEMMA. 4. Let {cji,v i , ie I } be any optimal solution to the primal problem; then 4i= 4 R * ( i ) ,i E I .
Linear Programming Fomulations
75
Thus 4R.(I)2 r$t, I E I . However, from Eqs. (1 1) and (12), we see that {4R.(i),uR.(i) ~4~.(i), i E I> for c large enough is a feasible solution to the primal linear programming problem from which it follows that +j = bR.(j), j E I.
+
LEMMA 5. Let R E C,, be any optimal policy and optimal solution to the primal problem; then
c i
uj(Gij
{ 4 i ,u,, i E I } an
+ 4 i = wiR
- qij(R))
for every i that is recurrent with respect to R,and
C(6, - qij = o ,
,=I
Then there exists an R E C , satisfying n
X,"(i) =
1 P,xR,,(i),
u= l
for all
T.
Proofi Consider a policy generated by selecting R, ( u = 1, . . . , n) at random according to selection probabilities B,(v = 1, 2, . .. , n); that is, introduce an initial randomization over the policies R,,. . ., R, . Denote this policy by K. Strictly speaking, is outside of the class C of all rules which we are considering since i?. not only depends on the history of the process but also the outcome of the initial randomization. However, i?. is a policy from the use of which { Y , , A , , t = 0,
92
7 State-Action Frequencies, Problems with Constraints
1, ...) is a stochastic process. Define
We shall show for every t = 0, 1, . . . that P,(y,=j,
A,=a(Y,=i}
=Pa{&=j,
A,=a(Yo=i}
For t = 0 and j # i, Eq. (1) holds trivially since both sides vanish. For t = 0 and j = i, p R { y o= i,
n
C PuPR,{YO= i, v= 1
= a I Yo = i} =
n
A . = a I Yo = i}
The Main Theorems and Applications
93
However, by the induction hypothesis
PR{Y, = j I Yo = i}
=
c c PR{ Y, l
a
-1 = I,
A,-
1 =a
I Yo = i}q,j ( a )
Hence,
and the induction argument is complete.
COROLLARY 1. Let HR(i) be the set of limit points of {XTa(i),
T = 0,1, . . .}; then there exists an R E C , such that HR(i)= HR(i).
Proof: Since XTR(i)= XTR(i)for every Tif R is the policy constructed in Theorem 1, the corollary is evident.
The significance of Theorem 1 and its corollary is that for any optimization problem involving expected frequencies of state and decision in its cost criterion and constraints, only policies in C Mneed be considered. That is, if R, (say) is any other policy that is optimal it can always be replaced by a Markovian policy R which is also optimal. However, more can be said along these lines. THEOREM 2. H(i) = H M ( i )= f l D ( i ) = gs(i). Proof: First we prove that H ( i ) c aD(i). Suppose the contrary; that is, there exists a point X‘ = {xi,(i)} E H(i) not contained in RD(i).Since RD(i)is a closed convex set there exists (Theorem 1 of Appendix C)
94
7 State-Action Frequencies, Problems with Constraints
a set of numbers {wik} such that
However, X' is a limit point of some policy R E C. Hence,
C C w&xia(i) 2 &R'(i) = Tlim inf C C w&xFja(i). +m a i
j
a
However, by Corollary 1 of Theorem 2, Chapter 3, there exists an
R* E C, such that 4k.(i) 5 &X(i). Thus we have a contradiction. We now prove that RD(i)= RS(i).Clearly fTs(i) =I RD(i)since Hs(i) =I HD(i).
n
f
= (X*) = min f ( X ) X EH(i)
= min lim inf f(XTR(i)). R s C T+m
Therefore, f can be minimized by a policy R* E C,
.
The Main Theorems and Applications
95
THEOREM 3. If I has at most one ergodic class for every R E C, , then H(i) = Hs(i). Pro08 In much the same way as Lemma 8 of Chapter 6 was proven we can show that if I has at most one ergodic class for every R E C,, this also holds for every R E C,. We shall show that under the hypothesis, Hs(i) is closed and convex, in which case, Hs(i) = g"(i) and the theorem follows using Theorem 2. For each .RE C,, let X = {xi,} = { I ~ ~ " where D ~ ~ njR } is the steady-state probability or long term expected frequency of state j under policy R.Let R* and R**be any two policies in C, with X * and X** being their corresponding matrices. Let X = bX* + (1 - P)X** where 0 5 p 5 1 is arbitrary. In order to show that Hs(i) is convex, we need to show that X corresponds tc some R E C, . We note that X satisfies the system (the steady-state system of equations and inequalities)
since X* and X** satisfy the system. But we know that there corresponds a policy R E C, yielding X = {zjRDj",}; namely Dja = xja/C xja, if a
xja > 0, and Dja arbitrary, if
a
a
xja = 0. Hence Hs(i) is convex.
To show that Hs(i) is closed, let {R", u = 1,2, ...} be a sequence of policies in C, such that ( X u ,v = 1,2, . . .}, the sequence of corresponding X matrices, converges to X . We need to show that X E Hs(i). Since C, is compact we can assume that {R,,v = 1,2, . . .} converges to R (say) E C,, for otherwise we can select a convergent subsequence of {R,} that does converge. However, the transition probabilities { p i j } are continuous functions of the policies in C, and since the steady-state has a unique solution (Theorems 2 and 4,Appendix A), we must have
96
7 State-Action Frequencies, Problems with Constraints
that X corresponds to R.Thus Hs(i) is closed as well as convex and the theorem is proved. As in the application of Theorem 2, letf(.) be a continuous function over the closure of the possible points XT"(i) for all R E C, T = 0, 1, . . . , for a given Yo = i. Suppose the problem is to minimize lim inff(XTR(i)) over R E C subject to the constraint that HR(i)c G , T-tm
a given closed subset of H(i). Since f is continuous and G is closed, an optimal policy will exist. Let R* denote an optimal policy with lim inff(X,R*(i)) = f ( X * ) . Then by Theorem 3, there exists an R**E Cs T-+m
such that X** = X * , where X** is the X matrix corresponding to R**. That is, R** is also optimal; consequently, in seeking an optimal policy, we need only consider those policies in class Cs. Let us return to the problem described at the beginning of this chapter which is also Example 3 with i = 0 and r = 1. It was shown in Example 3 that Ez, the expected recurrence time, and the probability under constraint are expressible as continuous functions of points in H(i). Under reasonable conditions on the laws of motion the hypothesis of Theorem 3 will hold. Thus, it is possible in accordance with the above remark to restrict consideration to policies in Cs. However, for policies in Cs it is readily seen (Theorem 5 of Appendix A) that ERz = (noR)-l and PR(Yt= j , 1 6 t 5 z I Yo = O} = rcjR/rcoR. Thus, we can state the problem as that of minimizing zOR subject to the constraint that zjR5 nORcz,where o! is a given number 0 6 a 4 1. This problem can now be put into the linear programming form: Minimize
subject t o Xia
2 0,
iEI ,
The Main Theorems and Applications
a
Letting D , = x i a / cxia if a
97
a
1xia > 0 or Dia arbitrary otherwise, yields a
the optimal policy R E C, . In Theorem 4 of Chapter 3 we proved that a,(i) is minimized by a policy R E C,. We now provide an alternative proof. Repeating the statement of the theorem:
THEOREM 4. Let j be the target state. If PR{Yt= j for some c 2 1 1 Yo = i } = 1 for every R E C,, then there exists an R E C, such that CR(i) = min uR(i), REC
for i # j .
Proof: Define qji(a)= when i # j , where 1 + 1/p is equal to the number of states in I . Since wja = 0, we can define for every R E C,
(Note, in each case z denotes min{t I Y, =j , t 2 l}.) Clearly, if R minimizes crR(j),it will also minimize gR(i)for each i # j . We first argue that ER{z I Yo =j } < co for every R E C. In Example 2 of this chapter, we have that ER{z I Yo = j } = m (: x ; ~ ~ ) - ’where , R is any
1
-rm a
policy in the class of “ renewal ” policies. However, under the hypothesis of the theorem and the definition of qji(u),the state space I is irreducible for every R E C,. Hence, by Theorem 3 if an R existed such that E R { I~Yo = j } = co, then there must exist an R E C, such that E,{T 1 Yo =j } = 00. But from Markov chain theory (Theorem 6 of Appendix A) this cannot be the case. Hence ER{zI Yo = j } < co for all R. Now from Example 2 we also have that ~ , ( j is) expressible as a continuous function of points in H(i). In fact it can be shown that this function assumes its minimum at the extreme points of H(i) = H”(i) (the
98
7 State-Action Frequencies, Problems with Constraints
equality of these two sets given by Theorem 2). Thus by the argument used in the application following Theorem 2, there exists a policy R E C, that minimizes oR(i).This proves the theorem.
State-Action Frequencies The first three theorems of this chapter deal with expected stateaction frequencies. In some applications it is desirable to have similar statements concerning the . sample frequencies: that is, the actual frequencies of state-action combinations without taking expectations. If R E C, , the long-run frequencies and expected frequencies coincide with probability 1. However, if R 4 C, this may not be the case. Let
Ztja= 1, = 0,
if
Yt = j , A , = a,
otherwise.
Let
For a fixed R, denote by and 2, denote the matrix of quantities {z,,]. w a sample sequence of the joint process { Yt , A , , t = 0, 1, . . .}. Let U R ( o )be the set of limit points of { Z T R ,T = 0, 1, . . .}. We have THEOREM 5. For each R E C, PR{U R ( o )c R } = 1, where R is the closed convex hull of HD(j).
u
js 1
Before proceeding to the proof of Theorem 5, we shall need a preliminary inequality. Using the notation of Chapter 2, we set
LEMMA 1. lim inf min VT*(i)/T 2 min min T+m
isl
isI REC
4R(i).
State-Action Frequencies
99
Pro08 If the inequality were not to hold, there would exist a sufficiently large T and state i, recurrent with respect to some policy such that VT*(i) < T min min q5R(i), ieI
REC
from which one could construct a policy R with a 4R(i) smaller than min min + R (i). We leave the details t o the reader. k I REC
Proof of Theorem 5 : Suppose the theorem is false. Let R be a policy such that PR{U"(o) c < 1. Then there exists a sphere S with positive radius such that S n fl the null set, and PR{U R ( w ) nS # @} > 0. This is so since B c , the complement of R, can be covered by a denumerable number of such spheres S, , u = 1, 2, . . . and
n}
=a,
Since R is a closed and bounded convex set and S is convex and the two sets are disjoint, the two sets can be separated by a hyper-plane; that is, by Theorem 1 of Appendix C there exists a set of numbers { w i n } a, E K i , i E I such that wiariq> winsia for all r = {ria}E 17 and
cc
1
i
i n
o
s = {sin}E S . Let W, = win when Yt = i, A , = a and note that
so that 1 lim inf -C W , = T-m T + 1t=O
for some point Z
=
1C wiaZia i
n
{ Z i n }in U"(o). We intend to show that the set of T 1 W, < min 1 T + l,=O
0's such that lim inf-
1
rsH
winria has at most prob-
ability 0. If this is the case we must have PR{V"(m)
nS # $3)= 0, a
7 State-Action Frequencies, Problems with Constraints
100
contradiction proving the theorem. For a fixed N let
BV= B' =
W,,
u = 1, ...,[ T I N ] ,
W,,
if [ T I N ] < T I N ,
1)N+ 1
t=(u-
2 T
t = [T/N]Ni 1
= 0,
if [TIN] = T I N ,
where [ T / N ] is the greatest integer less than or equal to TIN. Clearly IB,I, u = 1, ..., [TIN] and IB'I are bounded and
ER{B,IBi, - - - 7 B " - i }
2 min C C wja is1
j
a
VN
C
t=(v-l)N+l
PR{ = 0, A, = a I qv-l)N = i>
N
= m i n C C wja ieI
j
a
C PR(yt = j ,
A , = i l Yo = i}
t=1
2 min{ v,*(i ) - max wia} is1
aeKi
2 min VN*(i)- max(wia}. is1
By Lemma 1, for any E > 0, there exists an N such that
and
min V,*(i) ie1
N
E
2 min min 4R(i)- i
where r*
= {r:}
2
is1 REC
a
is such that C C W E T i a = rnin rnin (bR(i). Thus, for the i
a
ieI REC
value of N and for 2, = 1, ..., [ T / N ] (the greatest integer less than or equal to TIN), we have
ER(B,[B,, ..., Bu-l) 2 N c z wiarE - N E . i
a
State-Action Frequencies
101
However, by Theorem 5 of Appendix B, we have
with probability 1. Consequently,
with probability 1. But then, with probability 1,
= lim inf -
1
lT/Nl
T+1
v=l
2 lim inf T-m
= N-'lim inf[T/N]-l T+m
1 T+1
B, + lirn inf -(B' + W,j T+m
ITIN1 v= 1
B,
e Ci C aw i a r z - E Since E is arbitrary, we have
with probability 1, and the theorem is proved. As an application, suppose f( -) is a continuous function defined over the closure of the possible values of Z,, T = 0, 1, . . , and R, and it is of interest to find R E C which minimizes E lirn inff(Z,) (as
distinct from minimizing lim inff(XT) T-t m
= lim
for each R E C, ,I is irreducible. Then B
T-rm
T-m
inff(EZ,)). Assume that,
= Hs(i) = H S independent
of
7 State-Action Frequencies, Problems with Constraints
102
i by Theorem 3 and Theorem 4 of Appendix A. Sincef( .) is continuous, then lim inff(ZT) 2 minf(r) = f ( r o ) with probability 1. Hence, T-m
reR
E lim inff(&.) 2 minf(ro). Since
= H S , there exists a policy R
E
C, such that lirn Z, = ro with T+m
probability 1. Therefore, under this policy, ER lirn inff(z,) = f ( r o ) . T+ m
Bibliographical Remarks Theorem 1 is a slightly more general form of a result obtained by Derman and Strauch [25]. The general form was given by Strauch and Veinott [50],from which follows the equality of H(i) with H M ( i ) in Theorem 2. The remaining results of this chapter are due to the author [16, 181.
Optimal Stopping of a Markov Chain
Statement of the Problem Let us suppose { Y,, t = 0, 1, . . .} is a finite state Markov chain with stationary transition probabilities { p i j } . Let us suppose there exists an absorbing state 0 (that is, poo = 1) in the state space I such that P{Yt=O for some t z l l Y o = i } = l for every i E I . Let { w i , i € I } denote nonnegative numerical values associated with each state. When the chain is absorbed at state 0, we can think of the process as having been stopped at that point in time and we receive the value w,,associated 103
8 Optimal Stopping of a Markov Chain
104
with the state 0. However, we can also think of stopping the process at any point in time prior to absorption and receiving the value wi if i is the state of the chain when the process is stopped. If our aim is to receive the highest possible numerical value and if w o < max{w,}, then iEI
clearly we would not necessarily wait for absorption before stopping the process. By a stopping time T, we mean a rule that prescribes the time to stop the process; z = t means that the process is stopped at time t and the information for stopping the process at time t must be confined to the values of the variables Y o , .. . , Y, . We shall assume for all stopping times z considered that if T - min{t) Y, = 0, 12 l}, then z 5 zl. By a stopped process (Y, , t = 0, I, .. .} determined by a stopping time T we mean Ft = Y , , t 5 2, = Y,, if t > z.
--
The problem of this chapter is to determine the stopping time z such that E{wy,1 Yo = i}, i E I - (03,is maximized. Stopping Problem as an Expected Average Gain Problem We first remark that an optimal stopping time does exist and in fact is of the form that the prescription as to when to stop the process need only be a function of the state of the process at the time of stopping; that is, Z will be dichotimized into states where the process is stopped and states where the process is not stopped. To see this we need only to observe that the problem can be reformulated so as to be of the form discussed in Chapter 6. At each state there are two possible actions. Action 1 continues the process according to the transition probabilities { p i j } ;action 2 at state i transforms i into an absorbing state. At state 0 the two actions coincide. That is, Set
qij(l>= p i j ,
wil = 0,
qij(2) = 6,,,
wiz = w i ,
iEI, j E I . i E I - (0),
A Different Approach
105
and
Consider the problem of maximizing 4R(i),the expected average cost per unit time, over all possible policies in C. Notice that the class of stopping times is a subclass of the class C of all policies. This is the subclass of policies such that whenever action 2 is dictated at a state i and time t = z, then action 2 is dictated for all t > z. Notice also, that for such a policy R = z (say), 4R(i)= E(wYrI Yo = i } . Thus rnax &(i) 2 REC
max E{wYzIYo = i } . However, by Theorem 2, Chapter 3, or its t
Corollary 1, (PR(i) is maximized for each i E I by a policy R E C D .But each R E C , is a stopping time since, if action 2 is prescribed at state i, the process will remain in state i and continue to prescribe action 2. Thus, max
REC
4R(i)= max dR(i) RECD
= max E{wYzI Yo = i}, z
iEI ,
where the optimal stopping time t is the policy R E C, that maximizes 4R(i), i E 1. Of course, it follows that the computational methods of Chapter 6 can be used to obtain an optimal stopping time. A Different Approach We return to the original problem formulation of this chapter and offer another approach. Let
Mi) = max E(w,= I Y, = i},
i EI
I
By the remarks of the previous section, the optimal stopping time 7 need only be a time invariant function of the state of the process, so
8 Optimal Stopping of a Markov Chain
106
that we have the dynamic programming functional equations M(0) = w,
thus, the optimal stopping time takes the form of stopping the process at those values of i where w i 2 c p i j M ( j ) ,i E I . If M(i), i e I, were a j
known function, the optimal stopping time would be known. The following discussion is intended to provide methods for determining M(i), i E I. By a super-regular function f ( i ) , i E I , with respect to { p i j } ,we mean a nonnegative function satisfying
1 J
We first prove:
Pij
f ( j > 5 f(i>,
iEI
(2)
+
LEMMA 1. Let z be any stopping time. Iff ( i ) , i E I , is any function such thatf(i) 2 w i , i E I , then E{wyzI Yo = i } 5 E{f ( Y J I Yo = i},
iEI .
Proof:
E{f(Y,)I Yo = i}
=
C E { f ( y r )I Y, = i,
=
C f(j)P{K= j l
j
i
Y, = j } P { Y , = j 1 Y, = i}
yo = i}
~CwjP{Yr=jI~,=i} j
=CE{wY,IY;=i, Yr=j}P{Yr=jIY,,=i} 1
= E { w y z I Yo = i}
then
.
LEMMA 2. Let z be any stopping time. Iff(i), i E I, is super-regular,
E{f(Y,)IYo = i} rf(i), i e I .
A Different Approach
107
Proof: Let R be the space of all sequences w = {ik, k = 0, 1, . . .}, where the range of each coordinate of w is I. Because { Y,} is eventually stopped or absorbed at 0, all the probability mass on R is concentrated on a denumerable subset of R. In what follows, P{o}is to beinterpreted as P{ Yk = ik ,k = 0, 1, . . .}. Let En denote any subset of R determined by conditions on io , i,, . . . ,in (that is, on the first n + 1 coordinates of w). Since f is super-regular, we have
=
c P(0I
WEEn
yo = i>f(Y,(o>).
Recall that { 7,,n = 0, 1, . . .} denotes the stopped process. We now show that for each n = 0, 1, . . . , (3) ~{f( F,,) I yo = i} 5 f(i>, i 6 I. Noting that {w I z n} and {oI z > n} are both subsets of R of the form E,, , we can write using (3)
E{f(P"+1) I y o
= E(f(
= il
E) I Yo = i }
Since E{f(Y0)I Yo = i } = f ( i ) , Eq. (4) holds on iterating the above
8 Optimal Stopping of a Markov Chain
108
inequality. Since 7 < co with probability I (because z is less than or equal to the time of absorption at state 0), we have that lim Yn(w)= n-t m
Y,(w) with probability 1. Since interchange of limit and expectation are valid here, we have E{f(K) I Yo = i}
= lim
n-t m
E { f ( y n )I Yo = i}
5 f(i),
i EI .
and the lemma is proved. We now define the smallest super-regular function dominating
{wi , i E I } as that function {s(i), i E I } satisfying the conditions that
(i) s is super-regular, (ii) s(i) 2 w i , ie I , (iii) s(i) sf(i), i E I , whenever f is super-regular and f (i) 2 wi , i E I . I f f and g are two super-regular functions then h = min(f, g) is also super-regular since
Thus, s can be defined as the lower envelope of all superregular functions dominating { w i, i E I } . Clearly, one and only one such function exists. The main result relating M ( i ) to the notion of the smallest superregular function dominating {wi , i E I } is
THEOREM 1. The function { M ( i ) ,i E I } of (1) is equivalent to the smallest super-regular function dominating { w i , i E I } . Proof: Clearly M ( i ) satisfies condition (ii). Also, from Eq. (I), if M(i) = w i , then M(i) 2 pij M ( j ) ; otherwise, M(i) = pijM ( j ) .
1 j
j
Hence, condition (i) holds. To show that condition (iii) holds, supposef is super-regular andf(i) 2 w i , i E I. Let z be the optimal stopping time.
A Different Approach
109
Then for each i E I , by Lemma 1 and 2,
M(i) = E{wYTI Yo= i }
I J W ( Y , >I yo = i>
SfW. Thus, condition (iii) holds and the theorem is proved. We now can determine { M ( i ) ,i E I } by solving a linear programming problem as given in: THEOREM 2. If {vi*, i E I } is an optimal solution to the linear programming problem to minimize
subject to
then M(i) = vi*, i E I . Proof: From the constraints of linear programming problem, the function {vi*, i E I } is super-regular and dominates { w i , i E I } . By Theorem 2, s(i) = M(i), i E I . If {M(i),i E I > is not equal to {ui*,i E I > then M(i) 5 vi*, i E I , with strict inequality holding for at least one i. However, then {vi*, i E I } cannot be the optimal solution to the linear programming problem, a contradiction proving the theorem.
Another method for calculating {M(i), i E I } is a method of successive approximations. Let {fo(i),i E I } be a given function. Define
8 Optimal Stopping of a Markov Chain
110
recursively, f,(i)
= max{fo(i), C ~ ~ ~ f n - i~E I ( jfor ) }n, = 1,2, j
have :
.. . . We
THEOREM 3. Iffo(i) = w i t i E I, then M(i) = lim f,(i), i E I . n-rm
Pro08 It is easily established that f,(i) = E{wYzn I Yo = i}, i E I, where z, is the optimal stopping time among the class of all stopping numbers such that z 5 n with probability 1. As {S.) is a nondecreasing sequence with f , ( i ) 2 M(i), i E I, then f(i) = limf,(i) S M(i). We also n-
have f(i) = max
(
wi 9
CPijf(j) j
I
9
00
i E I,
and clearly f is a super-regular function that dominates { w i , i E I}. However, by Theorem 2, we must have thatf(i) = M(i), i E I, so that the theorem is proved. A third method for calculating {M(i),i E I} is to solve the system (1). That this is true follows from:
then f ( i ) = M(i), i E I.
Proof: If {f(i), i E I} satisfies (1) then it is super-regular with f ( i ) 2 w i , i E I. Since M(i) is the unique smallest super-regular function dominating { w i , i E I}, then Ai
=f(i) - M(i) 20, iEI.
A Different Approach
111
On subtracting M(i) fromf(i) in (l), we obtain
Ai$xpijAj,
iEI.
j
However, on iterating, we obtain that Ai $
1i p$) A j ,
i
EI
,
and sincej = 0 is an absorbing state with all other states being transient, lim pi;) = 0 for j # 0. Thus
t+O
Ai 5 d o =0,
iel,
which proves the theorem. It is sometimes possible, without knowing {M(i),i E I},to determine that a state i as one at which the process is stopped under an optimal stopping time. More explicitly we state: THEOREM 5. Let { f ( j ) , j I} ~ be any super-regular function that dominates {wj ,j E I}(that is,f( j) 2 wi ,j E I).If for some i, p i i f ( j ) = i
wi ,then M(i) = wi; that is, i is a state where the process is stopped under an optimal policy.
Proof: Since wi S M(i) sf(i),
= wi ;
hence, equality must hold. See Problem 3 for an application of Theorem 5.
8 Optimal Stopping of a Markov Chain
112
Computational Example
Suppose I = 0, 1, 2, where Po0
Po1
(hP11 P20
PZl
;
1 0 0
Po2
p12) = P22
and (wo,w,,wz)= (0,2, 1). We want to find z to minimize Ewy,, Let us solve for M(i), i = 0, 1, 2, by linear programming. Since we know that M(0) = 0, the linear programming problem can be stated as finding those variables vi , v2 to minimize
Solving, we find vi* = 2, vi* = 1 as an optimal solution. Thus, M(0) = 0, M(1) = 2, M(2) = 1, where
wt = 2
> 3(2 + 1)
and w2
=1
< i(2
+ 1)
Thus, the optimal stopping time is always T = 0.
Dual Linear Programming Problem
113
The Dual Linear Programming Problem Let us consider the dual linear programming problem for obtaining ( M ( i ) , i E I}. First, it is convenient to rewrite the primal problem :
=
c i
pij wj
= yi,
- wi
i E I (say).
This problem was obtained by setting ui = ui - wi in the original primal problem; the constant term in the objective function has been dropped. The dual problem is: Maximize
Ci Yixi
subject to
Xi30,
1 i
Xi(6ij
- pij)
i€I,
i Pj ,
j EI
From Theorem 2 of Appendix A, one can argue that for every possible stopping set S , the values {Ti}, equal to the expected number of entries into state i from time t = 0 up to but not including the time of entry into S, where P( Yo = i ) = Pi, i E I, are feasible solutions to the dual problem. However, the objective function under one of these solutions for t given by the stopping set S, can be seen to be E w , ~ - C p i wi . Thus if S is the optimal stopping set, the objective function must equal C P i u i , where { u i } is the optimal solution to the primal i
114
8 Optimal Stopping of a Markov Chain
problem. By the duality theorem (Theorem 4 of Appendix C), at least one optimal solution of the dual problem must be {Xi}, where S is the optimal stopping set. In any case, using Theorem 4 of Appendix C , part of the optimal stopping set can easily be extracted from the dual problem solution; namely, set uj = 0 when the jth constraint in the dual problem solution holds with strict inequality prevailing. However, uj = 0 implies that j E S. Some Other Forms of the Stopping Problem
The problem formulation of this chapter includes the case where a cost is incurred for each period that the process continues. The cost can be a function of the state of the process; that is, there is a cost cieach time the process is in state i, i E I . The problem is to maximize
(
I
E wyr-ccytIYo=i, zo:
by selecting the best stopping number z. Let
iE1,
where z1 denotes the stopping time that waits until Y, = 0 for the first time; that is, T~ = min{tl Yr = 0, t 2 I}. For any stopping time T, let us note that
x P{Yr = j ,
z = n I Yo = i }
c 1C(j)P{Y, m
= =
j
n=O
=j ,
T =n
Ci C(.~)P{Y, = j I yo = i},
I Y, = i> i
EI
.
Some Other Forms of the Stopping Problem
115
Then, for any stopping time z,
= C W ~ P { Y , = ~ ( Y+CC(j)P{Y,=i}-C(i) ,=~} i
=
C (wj+ C(j))P{Y,= j I Y, i
=E{w,,+C(Y,)I
i
= i}
- C(i)
Y,=i}-C(i).
+
Therefore, on letting wi’= w i C(i),i E I , and solving the original stopping problem with respect to the values {wi‘,i E I } , we will obtain an optimal stopping time for the problem with costs. The problem formulation also includes the problem of finding a stopping time z to maximize
where c1 is a number between 0 and 1. It is well to note that for this problem, it is not necessary to have an absorbing state in order to have a nontrivial problem. The presence of the factor ar makes it imperative that we do not wait too long before stopping the process. However, we now show that by introducing an additional state, the Eroblem can be reverted to its original form. Let us consider a related Markov chain {Y,’, t = 0, I, . . .} over the state space I’ = I u {0} where 0 is an absorbing state of {Yr’, t =,O, 1, . . .}. More specifically, we let pbo = 1 ;pio = 1 - u, i E I ; plj = upij, i, j E I ; wo’= 0, wi‘= wi , i E I . For any stopping time 5’ to stop the process, {Yt’,t = 0, 1, . . .}, which is a function only of the state of the process, relate the stopping time z to stop { Y, , t = 0, 1, . ..} by defining z to stop { Y,} at those states i # 0 a t which z’ stops { Y,’}.
116
8 Optimal Stopping of a Markov Chain
However, for any stopping time z’, we have
E { w l t z ,I Yo’ = i} m
=I C w i ’ P { Y , ’ = j , j c I t=O
= c cwj’P{y,’=j, = cCwja‘P{Y,=j, 00
z’=tiY,=i} Y,’#O,
J c I t=O
m
j E I t=O
= E{a‘wY, 1 Yo = i }
Y,#O,
lSn = f ( i ) ,
i EI ,
for any stopping time z.
(3)
Suppose there are n objects with associated distinct values ul, v 2 , . . .,u, . We define the following selection process: An object is selected at random. If its value is acceptable, then the process of selection terminates with the value of the selected object given to the decision maker. If the value of the object is unacceptable, then the object is discarded and another random selection is made from the remaining n - 1 objects.
8 Optimal Stopping of a Markov Chain
118
The selection process continues in this manner until an object is accepted. If all n objects have been rejected then the value received is zero. Determine a stopping procedure that maximizes the probability of choice of the most valuable object. Consider only stopping procedures that do not accept an object whose value is less than the value of one already rejected. Solution (Dynkin [29]): Consider the state space I = {1,2,. . ., n, 0}, with Y, = 1. Let Y, = i if the ith object selected is the first to have a value greater than the first selected; Y3 = j if the jth object selected is the first object to have a value greater than that associated with the value of the ith object selected. In general, Y, is the number of the object selected which has value exceeding the value of the (Y,- ,)th object selected; Yt = 0 when n objects have been rejected. Clearly, pij=Oif 1 S j s i . For i < j ,
i p..=” j ( j - 1) ’
n
Pio=1-
C Pij. j=i+l
Also, whenever Y, = i and the object is accepted, the probability that the object is most valuable is i/n. Therefore, we set wi = i/n, i = 1, .. ., n. Let i* be the largest integer for which
c i/j =- 1 . One can verify
n-1
j=i*
that the function f(j) = max(i*/n,j/n) is a superregular function which dominates {wi ,.j E I } . Also, c p i j f ( j )= wi for i 2 i*. Thus, by Theorem 5, states i i
for which i 2 i* are states at which an optimal stopping time r stops and M(i) = i/n for i 2 i*. Since pii = 0 for j 5 i,
Problems
119
hence, at state i* - 1 an optimal stopping time does not stop. Repeating the argument successively, the same holds for state i* - 2, .. . , 1. Summarizing, z stops at states i*, . . . ,it and does not stop at states i, . . . , i* - 1. (4) Suppose E is a set of states such that pii = 0 for every i E E and j $ E , z p i j w j B w i for every i # E , and .i
c p i i w j 5 wi for every i E E. Then show that an j
optimal policy stops for all i E E, and continues for all i $ E. (5)
Supposef(i) = wi - C p i , w j , i E I, is a nonincreasing i
function and C p i j g ( j ) , i E I , is nonincreasing whenever i
g(i), i E I , is nonincreasing. Show that the optimal policy is of the form: stop for all i 2 i* and continue for all i c i* where i* is a state that must be determined.
Proof (Breiman [S]): Let
H(i ) = M( i ) - wi
Using Theorem 3, show that H(i) is nonincreasing from which it will follow that H(i) = 0 for all i 2 i* and H(i) < 0 for i < i*.
9 Some Applications
1 A Replacement Model A common activity is the periodic inspection of some system, or one of its components, as part of a procedure for keeping it operative. After each inspection, an action must be taken as to whether or not to alter the system at that time. The problem is that of determining, according to some appropriate cost criterion, the optimal policy for taking actions. More specifically, suppose a unit (a system, a component of a system, a piece of operating equipment, etc.) is inspected at equally spaced points in time and that after each inspection it is classified into 121
9 Some Applications
122
one of L + 1 states 0, 1, . . . ,L. Then { Yt} is the sequence of states. A unit is in state 0 if and only if it is new; a unit is in state L if and only if it is inoperative. We assume that at states 1, . . . ,L - 1, there are two possible actions: a = 1 is not to replace the unit, a = 2 is to replace the unit. At state 0 only one action is possible, not to replace. At state L only one action is possible, to replace. Accordingly, we set qij(l) = p i j , i = O ,..., L , j = O ,..., L with P,=O,i=O ,..., L, and p L o = l ; qi0(2)= 1, i = 1, ... ,L - 1. We assume the { p i j }are such that for every > 0 for some t 2 1. This implies that a unit not i (i = 0, . . . ,L replaced will eventually become inoperative with probability 1. We assume two types of cost, the cost to replace an operative unit and the cost to replace an inoperative unit. That is, we set Wil
= 0,
wiz = c, WL1
i=O,
..., L -
1,
i = 1, ...,L - 1 ,
=c+A,
where c > 0, A > 0. Thus, A is the additional cost incurred if the unit is allowed to become inoperative before being replaced; {W,} is the sequence of costs. Either the discounted expected cost criterion VR(i,a) for some given a (0 c a < 1) or the expected average cost criterion 4R may be of interest. The methods of Chapters 4 and 6 can be employed to find optimal replacement policies, depending on which criterion is selected. However, in practice, it is frequently desirable to use simple replacement policies. For example, we speak of a controllimit policy as one which always replaces the unit whenever the observed state is io, io + 1, . . . ,L and never replaces the unit in states 0, 1, . ..,io - 1; state io is the control limit. We shall show under certain conditions on (pij>that there always exists a control limit policy that is optimal. We state
CONDITION A : The transition probabilities {pij} are such that for every nondecreasing functionf(j),j = 0, . .. ,L,the function L
g(i) =
is also nondecreasing.
1p i j f ( j ) ,
j=O
.
i = 0, . . ,L
-1
1 A Replacement Model
123
We also state CONDITION B: The transition probabilities {pij>are such that for each k = 0, 1, . . ., L, the function
is nondecreasing. Let us first show
LEMMA 1. Conditions A and B are equivalent. Proofi Assume Condition A. Then, in particular, the function
is nondecreasing. But we have
= rk(i)7
and, hence, Condition B holds. Assume Condition B holds. Any nondecreasing function f ( j ) can be expressed in the form L
where c k 2 0, k
= 0,
. ..,L andf,(i) is defined above.
Then,
L
=
L
C p i j kC= O c k f k ( j ) j=O
(equation continued)
124
9
c c c c L
=
L
k=O
ck
j=O
L
=
Pijfk(j)
L
k=O
Since ck 2 0, and by hypothesis,
Some Applications
ck
j=k
L j=k
Pij
.
p i j is nondecreasing for each k , it
follows that g(i) is nondecreasing, proving the lemma. The significance of Lemma 1 is that Condition A becomes a verifiable condition through the verification of condition B. We now state:
THEOREM 1. If Condition A (or B) holds, then there exists a control limit policy R(a) such that YR(al( i, a) = min YR(i, a), REC
N
Proof: Let Y(i, a, N ) = min
REC t=O
i=O,
..., L .
a'E(Wt I Yo = i), N
= 0, . . .
, L.
Clearly, Y(i,a, 0) is a nondecreasing function of i. Assume Y(i,0, n) is nondecreasing in i for 0 5 n 5 N . Then since Y ( i ,CI, N
=c
+ 1)
+A +
L j=O
pojY(O,a,N ) ,
i =L ,
from the induction hypothesis and Condition A, it follows that there exists an i* such that Y ( i ,a,N
+ 1) = a
L
j=O
p i j Y ( j , a, N ) ,
+ a c P O j Y ( jCI,, N ) , L
=c
j=O
+ A + CI c p o j Y ( j , a,N ) ,
.*
i < i ,
i* 5 i < L,
L
=c
j=O
i =L.
2 A Surveillance-Maintenance-ReplacementModel
125
where Y(i,tl, N + 1) is a nondecreasing function of i. Therefore, Y(i,a, N ) is nondecreasing in i for N = 0, 1 , . . . . From Chapter 4, Theorem 1, we know that lim Y(i,a, N ) = min y R ( i , a) is also nonN+
REC
w
decreasing in i. On repeating the argument using Condition A again, the theorem follows.
THEOREM 2. If Condition A (or B) holds, then there exists a control-limit policy R* such that i = 0, ...,L .
+Ra(i) = min + R ( i ) , ReC
Proojl From Theorem 2, Chapter 3 and Corollary 1 to Theorem 1, Chapter 6 , we need only consider policies in C , . By Theorem 1 for each a ( 0 < tl < 1) there exists a control-limit policy R(M)that minimizes YR(i,a). Let { M , , v = 1, 2, . . .} be any sequence of discount factors such that lim a, = 1 and R(cc,) = R(aJ = . . = R*. Since there are at most a
0-
m
a finite number of different control-limit policies, such a sequence exists. Let R be any policy in C , . Since R* = R(au)is optimal for cc,, we have (1 - X u ) y,Q.(i, a,) 2 (1 - a,) YR*(ij ffu)>
0=
1, 2,
u
..-
Letting u -+ 00 and using Theorems 1, Appendix A, and l(b), Appendix B, we obtain that Y R ( i= ) lim(1 - @,)y,Q.(i, a) u-) m
2 lim(1 - tlu)YR*(i, tl) v-) m
=yR*(i),
i = 0, ...) L .
Therefore, R* is optimal and the theorem is proved.
2 A Surveillance-Maintenance-ReplacementModel Consider a system, in use or in storage, which is deteriorating. Suppose that the deterioration occurs stochastically and that the condition of the system is known only if it is inspected, which is costly.
126
9
Some Applications
After inspection the manager of the system has two basic alternatives: (a) to replace the system or (b) to keep it. Under the second alternative he must decide the extent of repairs to be made and when to make the next inspection. If inspection is put off too long the system may fail in the interim, the consequence of which is an incurred cost which is a function of how long the system has been inoperative. Let us suppose that the uninspected system evolves according to a Markov chain through the states 0, 1, . ..,L. The state 0, as before denotes a new system and L an inoperative system. Let { p i j }denote the matrix of transition probabilities with pLL= 1 and p a > 0 for each i. Assume that when a replacement is made an instantaneous transition to state 0 takes place; when a repair is made an instantaneous transition takes place to one of the states, 1, . . . ,L - 1 depending on the extent of the repairs. Replacements or repairs are only made at the time of inspections. Assume that M < co denotes the upper bound on the number of periods that can elapse without an inspection. Let ci denote the cost of inspection when, in fact, the system is in state i. Let r i j , i = 1, . . . ,L , j = 0, . . . , L - 1 denote the cost to place the system in statej after observing the system to be in state i. In particular, ri0 is the cost to replace the system from state i. In addition we let rL,m,,j, m = 1, . . . , M , denote the cost to place the system in state j from state L when prior to discovering the system in state L, the system has been in state L for m uninspected periods. This cost represents, in addition to the repair or replacement costs, the cost associated with undetected failure. For a criterion, we shall be interested in minimizing the expected average cost per unit time attributed to the surveillancereplacement-maintenance policy. The Markovian decision process we shall work with is the process ( Y , , A , , t = 0 , 1, ...}, Yo =0, where { Y t ,t = 0 , 1, ...} is the sequence of observed states and { A , , t = 0, 1, ...} is the sequence of actions taken. The state space I will consist of the states 0, 1, . . . , L, L(1), .. . , L(M), where L(m), m = 1, . . ., M are additional states with L(m) denoting the fact that the system is observed to be in state L and has been in state L for m uninspected periods. At each state i E I, an action A , = aim consists in placing the system in state j , j = 0, I, . .. ,L and deciding to
2 A SurveWnceMaintenan~ReplacementModel
127
skip m (m 5 M ) time periods before observing the system again. If the system is observed in one of the states L, L(1), . . . ,L ( M ) we assume that ujo,j = 0, . . . ,I, - 1. are the only possible actions. We have as transition probabilities for the observed process q i j ( a d = PU(a+1)
for each i, j , 1, and m. Let if
Ztijm= 1,
= i, A , = ajm,
otherwise,
= 0,
and
If JT is the average cost up to time T (in real time) then, for each i, j E I , t(T)
C C Ci Cm (ci +
j T = ,(&O
i
7ij)Zrijm
.,
C1 C 1( 1 + m ) Z r i j m + 6 i j m
t=O
where t ( T ) is the largest value o f t such that the real time is less than or equal to T and 0 in some positive integer less than or equal to M . We also have
From the application to Theorem 5, Chapter 7 and since t(T)+ m when T + oc), it is possible to select a policy R E C, that minimizes E lim inf JT (or E lim sup JT) over all R E C. Since for any R E C, T-rm
T-r a,
-
lim Z,,
Tda,
= lim
1
-1P(Y, = i, A, = aim}
T-+W T
+1
t = ~
9 Some Applications
128
with probability 1 (combine Theorems 4, Appendix A, and 5, Appendix B), the problem can, using the methods of Chapter 6 , ultimately be put into'the form: Choose (xiajm} to minimize
subject to xiajm2 0
for all i, j , m
and
The optimal policy is obtained by putting
for each i and aim. The above problem involves minimizing a ratio of linear functions subject to linear constraints where the lower linear form is always positive. Any problem of this form can always be transformed to a linear programming problem. Namely, suppose we wish to minimize
2d i x i
i= 1
2 A Suneillance-Maintenance-ReplacementModel
129
subject to
xilO, n
C a i jx j = 0,
i= 1
i=l,
..., n ,
i = 1, . . ., m
,
n
E x =1,
i= 1
where
n
1 dixi > 0 for all feasible (xl, . . . , xn). Set
i= 1
zi =
~
I= 1
di xi
,
i = 1,
..., n ,
1
Then we can write the linear programming problem in zl, . ..,zn+ to minimize
subject to zizO, n
1 a i j z j = 0, 1
i=l,
..., n + l ,
i = 1, .. ., m ,
j=
Clearly a one-to-one correspondence exists between the feasible solutions of the two problems.
9 Some Applications
130
In the case of our problem, Dioj, can be obtained by solving the linear programming problem and setting Dial,,, =
Ziq,
Cj Cm ziujm
for each state i and decision aim.As argued in Chapters 4, 5, and 6, if the simplex method is used, it will turn out that for each state i, Dioj, will be equal to 1 for precisely one aim and equal to 0 for all others. Thus, a policy in C , will be optimal among all policies.
3 AOQL of Continuous Sampling Plans We have in mind a stream of items produced on a conveyor line. Each item, if inspected, could be classified as either a defective or a nondefective. Frequently, the stream is grouped into lots from which a sample is drawn and the lot accepted or rejected according to whether few or many defectives are found in the lot. Sampling in this manner in order to control the quality of output of the production system is called lot-by-lot sampling. However, often it is not feasible to group items in lots. When this is the case the continuous stream is sampled with defective items found replaced by nondefective ones. Sampling plans of this kind are called continuous sampling plans. We treat one of the simplest such plans-a version of what is usually referred to as Dodge plans. The plan proceeds as follows: Sample every item until m successive nondefective items are seen, at which time sample each item produced with probability f, 0 < f < 1. When a defective is found, resort to 100% inspection. Then continue as before. Each defective item is replaced (from a pool of good items) by a nondefective item. The out-going-quality (AOQ) for a given stream of items is defined as the least upper bound on the possible proportions of defective items that can pass through the inspection process. That is, for a given stream, because of the random inspections, the proportion of defective items passed (defined as the lim sup of the ratio of numbers of defective
3 AOQL of Continuous Sampling Plans
131
items passed to the number produced) is a random variable; the smallest value that exceeds this random variable with probability 1 is the AOQ. A conservative measure of the effectiveness of a continuous sampling plan is the largest AOQ that can be obtained. If the AOQ can be kept within acceptable bounds no matter what the stream, then the plan is a satisfactory one. Accordingly, we define Average-Outgoing-QualityLevel as AOQL = SUP AOQ, S
where s denotes a production stream and the supremum is taken over all possible streams, or all possible streams in a given subset of streams. The problem we consider here is that of computing the AOQL for a given m and$ For example we say the production process is in a state of “control” if each item has some fixed probability p of being defective. We can readily compute the AOQL under the restricted assumption that the process is in control. For let { Y,) be a stochastic process where Yt denotes the state of the inspection system just before the rth item is inspected; that is, Y, can be equal to 1, . . . ,m, m 1, where i, i = 1, , . . , m denotes the state of being in 100% inspection with i - 1 successive nondefectives already observed since being in 100 ”/, inspection, and i = m + 1 denotes the state of being in partial inspection (sampling with probabilityf). Then { Y,} is a Markov chain with transition probabilities
+
p ii ..= p5 Pm+l,l
Pi,i+l =
~
f
= 1 -P,
7
Pm+l,m+l
i = l , ..., m , = 1 - ~ f *
Using Theorem 4 of Appendix A the long range proportion E , , + ~ of time the inspection system is in state m + 1 can be computed, from which we can obtain AOQ = (1 - f ) p ~ ~ In + fact, ~ .
Hence, AOQL= max
05pSl
(1 - f M 1 - P) j + (1 - f)(l - p)“‘
9 Some Applications
132
We use the Markovian decision process framework to compute the AOQL when the production process is not in a state of control. Let ( Y ,,t = 1,2, ...} be as defined and consider the production process as the decision maker; that is, at the tth item, the production process, noting the state of the inspection system, decides to produce a nondefective or a defective. Thus, A , = 1 or 2, 1 denoting a nondefective item and 2 a defective item. Let Z(m + 1, 2) denote the long range proportion of items for which Y, = m 1, A , = 2 (or lim sup of long range proportion). Then using Theorem 5 of Appendix B (or the well-known law of large numbers), we can see that the long range proportion of passed defectives is Z (m + 1, 2)(1 - f ) with probability 1. Thus, if LR = inf { L I P,(z(m 1,2) 5 L) = l}, the production process wants a policy R to maximize L R ; in so doing it obtains the AOQL, LR(l -f). From the application to Theorem 5, Chapter 7 and because Z(m + 1,2) can be maximized at an extreme point R (using the notation of Theorem 5 of Chapter 7), such a policy can be found in C, . Thus, for certain states, A , = 1 and for the others A , = 2. Clearly, A , = 1 if Yt = i, i = 1,. .. ,m and A , = 2 if Y, = rn + 1. Under this policy (Yt} is a Markov chain with transition probabilities
+
+
~ ~ , ~ + ] =i =l l ,, ~ m + l , I = f ,
..., m ,
~rn+l,rn+l=l--f.
Using Theorem 4 of Appendix A one obtains that Z(m + 1, 2) = l/(rnf+ 1) with probability 1 and hence, AOQL = (1 - f ) / ( r n f + 1). 4 A Sequential Search Problem
Suppose a hunted object moves about within a finite number of regions according to known probabilistic laws. A single searcher using a detection system whose effectiveness is a function of the amount of effort used and the region searched, checks one region at a time until the object is found, his effort budget is exhausted, or he decides it is " uneconomical " to continue. The problem is to find an optimal sequential search policy; that is, one which tells the searcher, at each point in
4 A Sequential Search Problem
133
time, whether to search, where to search, and how much effort to use in order to optimize a given effectiveness criterion. More precisely, let us assume that there are regions labeled 1, .. . ,L. At time 1, t = 0, 1, . . , the object is in one of the regions. At that time the searcher selects one of the regions to search, say region i, and puts effort e into the search. If the searcher finds the object, the search is over; otherwise the object moves to region j at time t + 1 with known probability h i j ( i , j = 1, . . . , L). Initially, before the searcher has searched any regions, the object places itself in region j with probability h o j . Thus, the movement of the object is a function of where the hunter searches. We shall assume that the initial effort budget for the searcher is B, a nonnegative integer, and the efforts e expended are nonnegative integers. Also, in moving from region i to region j,the searcher uses up r i j (a nonnegative integer) units of his effort budget. We define the state space I of the Markovian decision process to be I = {i,", i = 1, ,. ., L ; k = 0, 1; b = 0, ..., B } u {O,'}. The state i: denotes that region i has just been searched, the object has not been found (indicated by the superscript 0 ) ,and b units of the budget remain for future use. The state ibl denotes the fact that region i has just been searched, the object has been found (indicated by the superscript l), and b units remain for future use. The state 0 , ' is a fictitious initial state from which a searcher begins his sequence of searches; U B o can be interpreted as a base from which all search sequences begin. The class of states
.
T={iOo, i = l , ..., L } u { i b l , i = l ,
..., L, b = O ,..., B }
are states at which the search process terminates; in the first group, termination is due to lack of funds, in the second group, termination is due to the object having been found. When the process is in any of the states of T, the next point in time will find the process in 0,'. Thus, the process returns to the initial state completing a cycle. At each state of I - T the searcher has a number of possible actions depending on the state. His possible actions at state ibo are { a j e , j= 1, ., . ,L ; e = 1, . . . ,b ) , where aje denotes the action of searching region
9 Some Applicatiom
134
j and expending effort e. As stated in the previous paragraph, at each
state of T the searcher has only one possible action-to return to 0 , ' . We let vj(e) denote the conditional probability of finding the object given that region j is searched, the object is in region j , and effort e is expended in the search. Thus, the laws of motion of the Markovian i = 0,.
i=O,
..,L ;
..., L ,
1 S e 5 b # 0,
b=0,
..., B - 1 .
,..., L ; e s b , i = O , ..., L , i = 1, ..., L ; b = 1,..., B -
i,j=O
1.
Suppose we are interested in minimizing the expected cost to reach a state in T subject to the probability of a successful search being greater than or equal to a given number 8. The expected cost of reaching a state in T is the same as the expected cost of a return to state OBo. We can make use of Examples 2 and 3 of Chapter 7 as well as Theorem 2, Chapter 7, and Theorem 5 of Appendix A to formulate the problem as finding that policy R E C, to minimize
subject to
where T
= {it,i =
1, ...,L ; b = 0 , . ..,B - 1).
5 A Stochastic Traveling Salesman Problem
135
The fact that Theorem 2 of Chapter 7 applies is due to the fact that at most one ergodic class can exist in I for every R E C, . Letting xi", = ni(R)DE, as in Chapter 5, the problem can be put into the form of finding (x,} to minimize
subject to
The method of Section 2 of this chapter can then be used to transform this fractional linear programming problem to an ordinary linear programming problem and then obtain from its solution the optimal policy in C,. Other criteria could be formulated with the same methods for obtaining the optimal policy prevailing. For example we could simply maximize the probability of a successful search. 5 A Stochastic Traveling Salesman Problem In the well-known traveling salesman problem there are L + 1 cities 0, . . ., L. There is a cost of w i jto travel from city i to cityj. The problem consists of finding a route, starting and ending at city 0 through all cities, which has minimum total cost. Clearly, a minimal route goes through each city, not counting the initial time in city 0, exactly once.
9 Some Applications
136
In the stochastic traveling salesman problem we allow the routes to be determined by chance with the constraint that in a route starting from 0 and returning to 0 the expected number of visits to city i be exactly one. We seek a random mechanism for generating the routes to have the property that the expected total cost of a route be minimized. To approach this problem with the tools of Markovian decision processes, let Z denote the states 0, . . . ,L. At each state suppose there are actions a = 0, . . . ,L;that is, action a = i is an effort to move the process to state i. However, for technical reasons, we assume that if action a =j is taken in state i, then for i,j E I , qij(a = j ) = 1
- E,
E
4. li’ ( a = j ) = L’
j’ Z j ,
for some small positive number E . We assume E > 0 in order that I be irreducible for every policy R E C,. The cost of being in state i and taking action a = j is w i j. Using the results of Example 2 of Chapter 7 and Theorem 2 of Chapter 7 and Theorem 5 of Appendix A we can formulate the stochastic traveling problem as finding that policy R E C, such that
is minimized subject to
From the latter constraints we obtain that zO(R)= 1/(L + 1). Thus, using the transformation x i j = z i ( R ) D t of Chapter 5 , we can formulate the problem as that of finding { x i j } to minimize
Bibliographical Remarks
137
subject to
x i j >= 0,
i, j
c c x i j = i j
C
xij
J
Since, for each j
cc k
a
=
E
1,
1
L+I
Z,
i€Z.
2
EI,
=
xke 4 k j ( a )
c k
xkj(l
-
+
7zj
1
Xka
1
=
7
E
Xkj(l
- K) -/-
Xkj)E
z' E
we can rewrite the equations as
c
xij
=
c x i j
=
i
j
1-E
jEZ,
L+l-E' 1 ~+1,
iEZ.
As in Chapter 5, we then let D$ = x i j / cx i j be the optimal policy, j
where { x i j >is the optimal solution to the linear programming problem.
Bibliographical Remarks The material of Section I appears in Derman [20]. Further results are obtained by Kolesar [41]. A similar problem maximizing the expected time between replacements subject to a probability bound on failure is treated by Derman [21] and is extended by Kolesar [40] to show the control limit optimality for this case.
138
9 Some Applications
Section 2 is found in Klein [38]. We make use of our results in Chapter 7 to reduce the problem to policies in C , . In the original paper, only policies in C, were permitted. The transformation to the linear programming problem appears in Derman [ 151. See Charnes and Cooper [9] for more general treatment of “ linear fractional programming ” problems. The calculation of AOQL’s for continuous sampling plans via Markovian decision processes has been treated more generally by White [54]. The out-of-control AOQL for the Dodge continuous sampling plans was first computed by Lieberman [42]. For Dodge’s original paper see [27]. The application of Markovian decision models to the sequential search problem is due to Klein [39]. We refer the reader to Derman and Klein [23] for the original version of the stochastic traveling salesman problem.
Appendix A. Markov Chains
A sequence of random variables { Y, = t = 0, 1, . . .} whose range is a finite state space Z is called aJinite-state Markov chain with stationary transition probabilities { p i j ,i, j E Z} if P{Yt+l = j l Y o = y o ,...,
.
Yt-l=yt-l,
Y , = i } = p .V.
for every y o , . . ,y t - 1 , i, j E I , and t = 0, 1, . . . . The parameters { p i j } are called one-step transition probabilities. More generally, it follows that P { Y s + t = j l Y o = y o , . . . , YSdl= y s - l ,
Y,=i} =P{Ys,,=jlyS=i} =P{y, =jl (t)
= Pij
139
Yo= i}
140
Appendix A
for every y o , ...,y s - l > i, j E I , s = 0, 1, ...,t = 1,2, . .. . We refer to {pi;))as the t-step transition probabilities. For t = 0, we definep:;) = S i j , the Kronecker delta. If t = 1, p i j =p$). The fundamental relationships connecting the transition probabilities are the Chapman-Kolmogorov equations
holding for all s, t
2 0. In particular one gets the recursive formulas
As a consequence of Eqs. (1) and (2) certain fundamental results can be obtained. An important result is:
THEOREM 1. For all i, j E I ,
.
l
T
nij = Iim - C pi;) ~ - mT 1t = ~
+
exists and satisfies
It is convenient to classify the various states of 1. If there exist positive integers t, and t2 such that p!;) > 0 and p:) > 0, we say that i a n d j communicate. If I’ is a set of states all of which communicate and no set containing I’ possesses states all of which communicate, then I‘ is called a class (or ergodic class). In general I may be divided into a number of classes and some states which do not belong to any classes. When I consists of one class I‘ = I, then we say I is irreducible.
Markov Chams
141
States that belong to some class are called recurrent. States that do not belong to any class are called transient. Transient states can be characterized in a different way. A state i is transient if there exists a state j and an integer t such that pi:) > 0 ; but p$) = 0, t = 0, 1, . .. , If j is transient then, limp$) = zii = 0 for all i E I . t+m
Let I‘ denote a set of transient states and denote by Q the matrix { p i j,i, j E I ’ } :
THEOREM 2. The matrix I - Q has an inverse, namely, ( I - Q)-’ = I + Q
+ Qz +
< co. A result similar to Theorem 2 is stated in:
3. If Q is any matrix of positive elements whose row THEOREM sums are less than or equal to unity and E is any number 0 S a < 1, then ( I - aQ)-’ = I + aQ + a Z Q 2+ . * *
< 00. The following theorem enables one to evaluate nijwhenjis recurrent. 4. Let I’ and I” be classes of recurrent states. If THEOREM i E I ” # I ’ , then r i j= 0, j E I’. If i E 1’, j E I ’ , then r i j= nj,independent of i, where {nj,j E 1’1 uniquely satisfy nj > 0 , nj=
If i is transient and j
E
i
1I’
aipij,
E
I’, then
nij = P{ Y, E I’ for some t > 0 I Yo = i)nj
Appendix A
142
Let 2, = 1 if Y,
=j ,
or zero otherwise. If i E I', j
E
Z', and Yo = i, then
1 = lim C T-tm T 1 t = OZ , = ni
+
with probability 1. Further useful results can be summarized in :
j
E I'
THEOREM 5. Suppose I' is a class of recurrent states, suppose and z = min(t >, 1 I yt = j ) . Then E{zl
Yo = j }
=
-.1
ni
Also, if { wi , i E I}is a set of numerical values and uj = E where W, = wi if Y, = i, then
Let
'yf
= P { Y , = j , Y, # j for some 1 5 n < t l Yo = i}.
c fv =>(Y,= j m
t=l
for some t 2 I 1 Y, = i}. w e let mij =
Then
1 tft), m
t=l
which can be interpreted as the mean first-passage time from state i to state j when
m
C f $) = 1.
1=l
The following theorem is useful : THEOREM^. IfP{Y,=jfor some
then
m
f= 1
f Y]
r 2 lIY,,=i}>Ofor all
iEZ,
= 1 for all i E I , and furthermore m i j < cg for all i E I.
Bibliographical Notes
All results stated here can be found with proofs in Chung [Ill, Feller [31}, and/or Kemeny and Snell [37].
Appendix B. Some Theorems from Analysis and Probability Theory
Let (a,, n = 0, I ,
. . .} be a sequence of real numbers and let
c 00
f(x) =
n=O
THEOREM 1. (a) If
c
n=O
a,x",
05 x
a, = A < 0 3 , 143
r 1.
Appendix B
144
then lirn f ( x ) = A . x+ 1
I
N
then lim(1 - x ) f ( x ) = A .
. x-bl
(c) Let l N lirn sup- C a , = A ,
~ - r m
Nn=o
then lim sup(1 - x ) f ( x ) S A . X'l
Let S be any space of points. If there is a function p(x, y ) of x and y in S, satisfying P(X,
x E s,
4 = 0,
Ax, 4 s
4x9
Y ) + P(X,
s, x , Y , z E s, x, Y E
P(X, Y ) = P01,X>,
4,
then p is called a metric and S is called a metric space (with metric p ) . We say a set of points {x,} E S converges to a point x E S if lim p(x,, x ) = 0. n+ m
A space S is said to be compact if for every sequence of points ...} and a point x E S
{x,} E S there exists a subsequence { x n u ,u = 1, such that { x , , , v = 1, . . .} converges to x .
Let,f(x) be any function on S. A well-known fact about continuous functions over compact spaces is :
Analysis, Probability Theory Theorems
145
THEOREM 2. Iff is a continuous function on S and S is compact, then there exists a point x* E S such that minf(x) = f ( x * ) . xss
Let S1, S , , . . . be a denumerable collection of spaces. By the product space n S i , we mean the set of all possible sequences x = ( x i , i E S i , i = 1,2, . . .}. We shall say that a sequence {x'"), n = 1, 2, , . .} of points in n S i converges to a point x E nSi if Iim pi(xln),x i ) = 0, for every n+ w
i = 1,2, . . . where xi")is the ith component of x("), xi is the ith component of x and p i is the metric of Si. In Chapter 3 we used: 3. If Si is compact for each i = 1,2, . . . , then n S i is THEOREM also compact. Proof: Let {x'")}, be any sequence of points in n S i . Since S , is compact there exists a subsequence {n:, u = 1, .. .>and a point x1E S, such that lim pl(x""', xl) = 0. Since S, is compact, there exists a subU'W
sequence {nu2,v = 1, ...} of {n,'} and a point x2 E S, such that lim p2(xF2,x,) = 0. Continuing, for each i there exists a subsequence 0-
m
{nui, u = 1, .. .} of (nL-', u = 1, ...} and a point x i € S i such that lim pi(xlyi,xi)= 0. Now let nu = nou,u = 1,. .., and x = (x,,x 2 , . . .)
u+m
n S i .Then, since {nu,v = 1, . . .} is a subsequence of {nui}for each i, we have lim pi(xyu,x i ) = 0. Thus, n S i is compact. E
u+m
Let { Y, , t = 1,2, . . .} be any sequence of random variables. We use the following:
THEOREM 4. (a) If Y, 2 0, t E
=
1,2, .. . ,then
m
C r, =, 1 EX. =I m
f=l
146
Appendix B
m
m
THEOREM 5. If
then
with probability 1. Suppose {Y,, t = 1,2, . . .} is a sequence of random variables with associated random times zl,z 2 , . .. such that the sets of random variables {Yl, . * .
, q,>,{ Y T l + 1 2. . K,}, . . . 7
*
are independent and have the same probability laws. It is assumed (letting zo = 0) that {z, - zv-l, u = 1, . . .} are independent and identically distributed with z, - T , - ~ being determined by conditions on YTuI + 1, . . . , YTu. Such a sequence we call a recurrent event process; that is, some event occurs at times zl,z 2 , . . . which has the effect of starting the process { Y, , t = 1,2, . . .} anew. For example, if { Y,, t = 0, 1, . . .} is a Markov chain with Yo = i, a recurrent state, z l , z2 , . . . may be the successive times at which Y, = i. Let { C, , u = 1,2, . . .} be a sequence defined by
c, = C(Y*"-,+ 1 , . . . ) Y,"),
u = 1, 2,
. . .;
that is, C is a function of the random variables in the t.th " cycle " in the recurrent event process. The sequence { C , , u = 1,2, . . .} consists of independent and identically distributed random variables. One can show:
Bibliographical Notes
147
THEOREM 6. If EC, < co and E(z, - zo-l) < co and if v(t) is the largest u such that z, 5 t , then
If we interpret C, to be a cost associated with the uth cycle defined by the function C, then Theorem 6 asserts that the expected average cost per unit time is equal to the ratio of the expected cost of a cycle to the expected length of a cycle. See Theorem 5 of Appendix A for application in the Markov chain context. We can make the same assertion for the average cost per unit time, the limit existing and taking on the same value with probability one.
Bibliographical Notes
Theorem 1 summarizes several well-known Abelian theorems. See, for example, Widder [55]. Theorem 2 can be found in any text on real analysis. Theorem 3 is a special case of Tychonov's theorem. We only need nSito be the product space of a denumerable number of spaces S j , Tychonov's theorem holds for any product space of compact spaces. The theorem asserts that nS, is compact in the product topology, equivalent to the topology we implicitly defined. Theorem 4(a) is a consequence of the Lebesgue monotone convergence theorem, and (b) is a consequence of the theorem of Fubini. Theorem 5 is a strong law of large numbers for dependent random variables. (See Loeve [44], p. 387.) See Feller [31] for a treatment of recurrent event processes.
Appendix C. Convex Sets and Linear Programming
Let E denote an n-dimensional Euclidean space. A set S c E is said to be convex if whenever x E S, y E S, then EX + (1 - a)y E S for every real number c(, 0 c( 2 1. A fact we use about convex sets is:
s
THEOREM 1. If S is a closed convex set and x = (xl,. . . , x,) $ S then there exists a set of real numbers {wl,.. . ,wn)such that for every
149
150
Appendix C
The convex hull of a set T c E is the smallest convex set S such that S 3 T. The closed convex hull of T is the smallest closed convex set containing T ; namely, S, the closure of S. An extreme point of a convex set S is any point x E S such that there does not exist points y E S, z E S distinct from x for which x = ccy + (1 - a)z for some a, 0 < 01 < 1. THEOREM 2. Let P denote the set of extreme points of a closed convex set S. Then every point x E S can be expressed in the form k
I?
1 ccizi, where z i ~ Pi =, i= 1
x=
1,
..., k, 0 s a i S 1, a n d C u i = 1. i= 1
A function f ( over a convex set S is concave (convex) if for every x1 E S, x 2 ES, and a, 0 5 a 5 1, a)
fbx1
+ (1 - " 2 ) I ( 9
+ (1 - a>f(xz>.
THEOREM 2a. Iff( is concave (convex) over a closed convex set S and if it achieves its minimum (maximum), then f ( .) is minimized (maximized) at an extreme point of S. a)
A linear programming problem is a mathematical optimization problem that can be formulated in one of several standard forms. A problem given in one form can always be translated into any one of the other standard forms. The following is one form: To find variables XI,
*.*)
xn
to minimize
subject to the constraints n
11
j=
aij
xi = bi,
i = 1,
...,m
and x j l O , j = I , ...,n,
Convex Sets and Linear Programming
151
where { c j , j = I, . . . , n}, { a i j ,i = 1, . . . , m ; j = 1, . . ., n}, and { b i , i = 1, , . . ,m} are given real valued constants. A set of values that satisfy (1) and (2) is called a feasible solution to the linear programming problem. The set of all feasible solutions is a closed convex set possessing a finite number of extreme points. The linear expression C c j x j is called the objective function. A feasible j= 1
solution that minimizes the objective function is called an optimal solution. Depending on circumstances, there may exist one, many, or no optimal solutions. However, if at least one optimal solution exists then there exists an extreme point in the convex set of feasible solutions which is an optimal solution. We can assert:
THEOREM 3. An extreme point of the set of feasible solutions of the linear programming problem stated in the given form has at most m positive components. Since there is an extreme point which is optimal whenever an optimal solution exists, computational methods may restrict themselves to searching the extreme point for an optimal solution. The simplex method is a general method for solving linear programming problems and which yields an extreme point optimal solution. A linear programming problem must be translated to the above form in order to use the simplex method. A problem related to the one stated above is: To find variables Ul,
* * *
9
0,
to maximize
subject to the constraints m
CaijScj,
i=l
j=1,
..., n,
Appendix C
152
where {Cj,
j = 1, ..., n}, {bi, i = I ,
...,m),
{ a i j , i = 1, ..., m ; j = 1, ..., n}
are as before. This second problem is a linear programming problem in a different form. This problem is called the dual problem to the first problem; the first, is referred to as the primal problem. Every linear programming problem, regardless of the form in which it is stated, has a well-defined dual problem, its form depending on the form of the primal. In particular, the dual of the dual problem is always the primal problem; thus, there is no difference which problem is originally referred to as the primal. The central relationship between primal and dual problems can be stated as: THEOREM 4. Suppose an optimal solution xl*, . . . , xn* exists to the primal problem, then an optimal solution ul*, . , . ,v,* exists to the dual problem and
c
*
n
cjxj =
j = 1
c"
*
sioi.
i= 1
We also make use of the following: THEOREM 5. Let uj = c j -
m
1 a i ju i ,j = 1, . . . , n. A necessary and i= 1
sufficient condition that feasible solutions xl, . . . , xn and v l , . . . , u, both be optimal for their respective problems is that
2
xjuj =0.
j= 1
Bibliographical Notes
Our sources for information on convex sets and linear programming include Karlin [36] and Dantzig [12].
1. Balinski, M. L., On Solving Discrete Stochastic Decision Problems. U.S. Navy
Supply System Research Study 2. Mathematica, Princeton, New Jersey, 1961. 2. Bellman, R., A Markovian decision process, J. Math. Mech. 6, 679-684 (1957). 3. Bellman, R., “Dynamic Programming.” Princeton Univ. Press, Princeton, New Jersey, 1957. 4. Bellman, R. and Lasalle, J. P., “ O n Non-Zero Sum Games and Stochastic Processes.” Rand McNally, Chicago, Illinois, 1949. 5. Bellman, R. and Blackwell, D., “ O n a Particular Non-Zero Sum Game.” Rand McNally, Chicago, Illinois, September, 1949. 6. Blackwell, D., Discrete dynamic programming, Ann. Math. Statisr.33, 719-726 ( I 962). 153
154
References
7. Blackwell, D., Discounted dynamic programming, Ann. Math. Statist. 36, 226-235 (1965). 8. Breiman, L., Stopping rule problems, in “Applied Combinatorial Mathematics,” Chapter 10. Wiley, New York, 1964. 9. Charnes, A. and Copper, W. W., Programming with fractional functionals: I, Linear fractional programming Naval Res. Logist. Quart. 9, Nos. 3 & 4,181186 (1962). 10. Chow, Y. S. and Robbins, H., A martingale system theorem and applications, in Proc. Berkeley Symp., Math. Statist. Prob. 4th, pp. 93-104. Univ. of California Press, Berkeley, California, 1961. 11. Chung, Kai Lai, “ Markov Chains with Stationary Transition Probabilities.” Springer, Berlin, 1960. 12. Dantzig, G., “ Linear Programming and Extensions.” Princeton Univ. Press, Princeton, New Jersey, 1963. 13. Denardo, E. V. and Fox, B. L., Multichain Markov renewal programs, Siam J. Appl. Math. 16,46&487 (1968). 14. D’Epenoux, F., Sur un probleme de production et de stockage dans l’a lkatoire, Rev. Francaise Informat. Recherche Opirationnelle 14, 3-16 (1960). [English transl.: Mgt. Sci. 10, 98-108 (1963).] 15. Derman, C., On sequential decisions and Markov chains, Mgt. Sci. 9, 16-24
(1962). 16. Derman, C., Stable sequential control rules and Markov chains, J. Math. Anal. Appl. 6, 257-265 (1963). 17. Derman, C . , Markovian sequential control processes-denumerable state space, J. Math. Anal. Appl. 10,295-302 (1965). 18. Derman, C., On sequential control processes, Ann. Math. Stutist. 35, 341-349 (1964). 19. Derman, C., Denumerable state Markovian decision processes-average cost criterion, Ann. Math. Statist. 37, 1545-1 554 (1966). 20. Derman, C., On optimal replacement rules when changes of state are Markovian, in “ Mathematical Optimization Techniques ” (R. Bellman, ed.), Chapter 9, pp. 201-210. Univ. of California Press, Berkeley, California, 1963. 21. Derman, C., Optimal replacement under Markovian deterioration with probability bounds on failure, Mgt. Sci.9, 478-481 (1963). 22. Derman, C. and Klein, M., Some remarks on finite horizon Markovian decision models, Operations Res. 13, 272-278 (1965). 23. Derman, C. and Klein, M., Surveillance of multi-component systems: A stochastic travelling salesman problem, Naual Res. Logist. Quart. 13, 103-11 1 (1966). 24. Derman, C. and Sacks, J., Replacement of periodically inspected equipment (An optimal optimal stopping rule), Naval Res. Logist. Quart. 7 , 597-607 (I 960). 25. Derman, C. and Strauch, R., A note on memoryless rules for controlling sequential control processes, Ann. Math. Statist. 37, 276-278 (1966). 26. Derman, C. and Veinott, Jr., A. F., A solution to a countable system of equations
References
27. 28. 29. 30. 31.
32. 33. 34. 35. 36. 37. 38. 39.
40. 41. 42. 43. 44. 45. 46. 47. 48.
155
arising in Markovian decision processes, Ann. Math. Statist. 38, 582-584 (1967). Dodge, H. F., A sampling inspection plan for continuous production, Ann. Math. Statist. 14, 264279 (1943). Doob, J. L., “Stochastic Processes.” Wiley, New York, 1953. Dynkin, E. B., The optimum choice of instant for stopping a Markov process, Soviet Math. Dokl. [English Transl.] 4, 627-629 (1963). Eaton, J. H. and Zadeh, L. A., Optimal pursuit strategies in discrete state probabilistic systems, Trans. ASME Ser. D., J. Basic Engineering 84, 23-29 (1962). Feller, W., “An Introduction to Probability Theory and Its Applications,” 3rd ed. Wiley, New York, 1968. Fisher, L. and Ross, S., An example in denumerable decision processes, Ann. Math. Statist. 39, 674676 (1968). Gillette, D., Stochastic games with zero stop probabilities, Ann. Math. Studies 3, 179-186 (1957). Howard, R. A., “ Dynamic Programming and Markov Processes.” Technology Press, Cambridge, Massachusetts, and Wiley, New York, 1960. Karlin, S., Structure of dynamic programming, Naval Res. Logist. Quart. 2, 285-294 (1955). Karlin, S., “ Mathematical Methods and Theory in Games Programming and Economics,” Vol. 1 . Addison-Wesley, Reading, Massachusetts, 1959. Kemeny, J. G. and Snell, J. L., “Finite Markov Chains.” Van Nostrand, Princeton, New Jersey, 1960. Klein, M., Inspection-maintenance-replacement schedules under Markovian deterioration, Mgt. Sci. 9, 25-32 (1962). Klein, M., Note on sequential search, Naval Res. Logist. Quart. 15, 469-475 (1968). Kolesar, P., Randomized replacement rules which maximize the expected cycle length of equipment subject to Markovian deterioration, Mgt. Sci. 867-876 (1967). Kolesar, P., Minimum cost replacement under Markovian deterioration, Mgt. Sci. 12,694766 (1966). Lieberman, G. J., A note on Dodge’s continuous inspection plans, Ann. Math. Statist. 24, 48W84 (1953). Liggett, T. M. and Lippman, S. A., Stochastic Games with Perfect Information. Working Paper 142, Western Management Science Institute, October (1968). Loeve, M., “ Probability Theory.” Van Nostrand, Princeton, New Jersey, 1960. Maitra, A., Dynamic programming for countable state systems, Sankhya Ser. A. 27, Parts 2, 3, & 4, 241-248 (1965). Manne, A. S., Linear programming and sequential decisions, Mgt. Sci. 6 , 259-267 (1960). Miller, B. L. and Veinott, A. F., Discrete dynamic programming with small interest rate, Ann. Math. Statist. 40,366-370 (1969). Shapley, L. S., Stochastic games, Proc. Nut. Acad. Sci. US.39 (1953).
156
References
49. Snell, J. L., Applications of martingale system theorems, Trans. Amer. Math. SOC.73,293-312 (1952). 50. Strauch, R. and Veinott, A. F., “A Property of Sequential Control Processes.” Rand McNally, Chicago, Illinois, 1966. 51. Taylor, H., Optimal stopping in a Markov process, Ann. Math. Statist. 39, 1333-1344 (1968). 52. Taylor, H., Optimal Stopping of Averaged Brownian Motion. Tech. Rept., Dept. Operations Res. Cornell Univ., Ithaca, New York, November (1967). 53. Veinott, A. F., Jr., Discrete dynamic programming with sensitive discount optimality criteria, Ann. Math. Statist., 40, 1635-1660 (1969). 54. White, L. S., Markovian decision models for the evaluation of a large class of continuous inspection plans, Ann. Math. Statist. 36, 1408-1420 (1965). 55. Widder, D., “Laplace Tranform.” Princeton Univ. Press, Princeton, New Jersey, 1946.
A
C
Abelian theorems, 147 Actions, number of, set of, sequence of, 3 Average cost criterion, 6 problem, 6,20,25
B Backward induction, 1 1 Balinski, M. L., 84, 153 Bellman, R., 7, 8, 17, 50, 84, 153 Blackwell,D., 7, 8, 17,33,50,83, 153, 154 Breiman, L., 116, 117, 119, 154
Chapman-Kolmogorov equations, 140 Charnes, A., 138, 154 Chow, Y. S., 116,117, 154 Chung, K. L., 142,154 Communicatingstates, 140 Compact metric spaces, 144 Compactness of policy space, 20 Concave functions, 150 Continuous sampling plans, 130 AOQ of, 130 AOQL of, 131 Dodge type, 130 Control limit, 122 Control limit policy, 122
157
158
Index
Convergence of policies, 20 Convex functions, 150 Convex hul1,closed convex hull, 150 Convex sets, 149 extreme points of, 150 Cooper, W. W., 138, 154 Cost structure, 4
D Dantzig, G., 152,154 Denardo, E. V., 84,154 DEpenoux, F., 8 , 5 0 , 154 Derman,C., 32,33,62,83,102,116,117, 137,138, 154 Deterministic (time invariant) policies, 7 Discounted cost criterion, 6 problem, 6 Dodge, H. F., 138,155 Doob, J. L., 116, 155 Dynamic programming, 11 functional equations of, 14 Dynkin, E. B., 116, 118, 155
E Eaton, J. H., 62, 155 Entrance fees, 117 Ergodic class, 140
F Feller, W., 142, 147, 155 Fisher, L., 33, 155 Fox, B. L.,84,154
Horizon, 5 finite horizon problem, 11 Howard, R. A., 8, 50,83, 155 1
Inventory system under periodic review, 4 Irreducible state space, 78, 140
K Karlin, S.. 32, 152, 155 Kemeny, J. G., 142, 155 Klein, M., 62, 138, 154, 155 Kolesar, P., 137, 155 1
LaSalle, J. P., 8 Laws of motion, 1 , 3 Lieberman, C. J., 138, 155 Liggett, T. M., 33, 155 Linear programming formulation of average cost problem, 73 discounted cost problem, 41 first-passage problem, 57 stopping problem, 109, 113 Linear programming problem, 150 dual, 151 feasible solution of, 150 objective function of, 150 optimal solution of, 150 primal, 151 simplex method solution, of, 150 Lippman, S.,33, 155 Loeve, M.,147, 155
G Gillette, D., 33, 155
H History, 3
M Maitra, A., 49, 155 Manne, A. S., 8, 84, 155 Markov chains with stationary transition probabilities, 139
Index
159
Markovian decision model, 2 Markovian decision process, 1 , 4 Mean first-passage time, 142 Metric, 144 Metric space, 144 Miller, B. L.,84, 155 0
Optimal first-passage problem, 5 Optimality principle, 15
P Policy, 3 convergence of policies, 20 deterministic, 7 memoryless (Markovian), 6 renewal, 90 time invariant (Markovian), 7 Policy improvement interation for average cost problem, 71 discounted cost problem, 40 first-passageproblem, 56 Policy improvement procedure for average cost problem, 71 discounted cost problem, 41 first-passa.ge problem, 56 Pursuit problem, 62
R Randomization, 3 Recurrent event process, 90, 146 Recurrent states, 141 Replacement model, 121 Robbins, H., 116, 117, 154 Ross, S., 33, 155 Rule, see Policy
S Sacks, J. S., 116, 117, 154 Sequential search problem, 132
Shapley, L. S., 8,49, 155 Snell, J. L., 116, 142, 155, 156 State-action frequencies, 98 expected, 89 States, 2 sequence of 2 Stochastic games, 8 Stochastic traveling salesman problem, 135 Stopped process, 104 Stopping time, 104 Strauch, R., 102,154,156 Successive approximations for discounted cost problem, 36 for first-passage problem, 54 for stopping problem, 109 Super-regular function, 106 smallest, dominating a function, 108 Surveillance-maintenance-replacement model, 127
T Target state, 5 Taylor, H., 116, 156 Transient states, 141 Transition probabilities, 139 ?-step, 140 Tychonov’s theorem, 20,32, 147
V Veinott, A. F. Jr., 33,62,83,84,102, 154, 156 W White, L. S., 138, 156 Widder, D., 147,156 Z
Zadeh, L. A., 62, 155