Optimal Control
WILEYlNTERSCIENCE SERIES IN SYSTEMS AND OPTIMIZATION
Advisory Editors Sheldon Ross Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720, USA
Richard Weber Cambridge University, Engineering Department, Management Studies Group, Mill Lane, Cambridge CB21RX, UK
GITIINS Multiarmed Bandit Allocation Indices K.ALLIWAILACE Stochastic Programming KAMP/HASLER Recursive Neural Networks for Associative Memory KIBZUN/K.AN 
Stochastic Programming Problems with Probability and Quantile Functions
VAN DIJK Queueing Networks and Product Forms: A Systems Approach WHITILE Optimal Control: Basics and Beyond WHITILE Risksensitive Optimal Control
Optimal Control Basics and Beyond Peter Whittle Statistical Laboratory, University of Cambridge, UK
JOHN WILEY & SONS
Copyright © 1996 by John Wiley & Sons Ltd Baffins Lane, Chichester, West Sussex P019 IUD, England National 01243 nm1 International ( +44) 1243 TI97TI All rights reserved. No part of this book maybe reproduced by any means, or transmitted, or translated into a machine language without the written permission of the publisher. Cover photograph by courtesy of The News Portsmouth
Other Wiley Editorial Offrces John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 101580012, USA Jacaranda Wiley Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Rexdale, Ontario M9W ILl, Canada John Wiley & Sons (SEA) Pte Ltd, Yl Jalan Pemimpin #0504. Block B, Union Industrial Building, Singapore 2057
Library of Congress CataloginginPublictJtion Data Whittle, Peter. Optimal control: basics and beyond I Peter Whittle. p. em. (Wileylnterscience series in systems and optimization) Includes bibliographical references and index. ISBN 0 471 95679 I (he : alk. paper).  ISBN 0 471 96099 3 (pb : alk. paper) I. Automatic control. 2. Control theory. I. Title. II. Series. TJ213.W442 1996 629.8 dc20 9522113 CIP
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0 471 956791; 0 471 96099 3 (pbk) Typeset in 10/12pt Times by Pure Tech India Limited, Pondicherry. Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn This book is printed on acidfree paper responsibly manufactured from sustainable forestation, for which at least two trees are planted for each one used for paper production.
Contents Preface
vii
1 First ideas
BASICS Part 1 Deterministic Models 2 3 4 5 6 7
Deterministic models and their optimisation A sketch of infinitehorizon behaviour; policy improvement The classic formulation of the control problem; operators and filters Statestructured deterministic models Stationary rules and direct optimisation for the LQ model The Pontryagin maximum principle
9
11 47 63 91 Ill 131
Part 2 Stochastic Models
167
8 9 10 11 12 13 14 15
169 179 189 215 229 255 269 285
Stochastic dynamic programming Stochastic dynamics in continuous time Some stochastic examples Policy improvement: stochastic versions and examples The LQG model with imperfect observation Stationary processes; spectral theory Optimal allocation; the multiarmed bandit Imperfect state observation
BEYOND Part 3 Risksensitive and H00 Criteria
293
16 Risksensitivity: the LEQG model 17 The Hoo formulation
295 321
vi
CONTENTS
Part 4 Timeintegral Methods and Optimal Stationary Policies
331
18 19 20 21
335 349 357 371
The timeintegral formalism Optimal stationary LQG policies: perfect observation Optimal Stationary LQG policies: imperfect observation The risksensitive (LEQG) version
Part 5 Neardeterminism and Large Deviation Theory
379
22 23 24 25
383 405 415 431
The essentials oflarge deviation theory Control optimisation in the large deviation limit Controlled first passage Imperfect observation; nonlinear filtering
Appendices
443
AI Notation and conventions A2 The structural basis of temporal optimisation A3 Moment generating functions; basic properties
443 449 455
References
457
Index
461
Preface Anyone who writes on the subject of control without having faced the responsibility of practical implementation should be conscious of his presumption, and the strength of this sense should be at least doubled if he writes on optimal control. Beautiful theories commonly wither when put to the test, usually because factors are present which simply had not been envisaged This is the reason why the design of practical control systems still has aspects of an art, for all the science on which it now calls. Nevertheless, even an art requires guidelines, and it can be claimed that the proper function of a quest for optimality is just the revelation of fundamental guidelines. The notion of achieving optimality in systems of the degree of complexity encountered in practice is a delusion, but the attempt to optimise idealised systems does generate the fundamental concepts needed for the enlightened treatment of less ideal cases. This observation then has a corollary: the theory must be natural and incisive enough that it does generate recognisable concepts; a theory which ends in an opaque jumble of formulae has served no purpose. 'Control theory' is now understood not merely in the narrow sense of the control of mechanisms but in the wider sense of the control of any dynamic system (e.g. communication, distribution, production, financial, economic), in general stochastic and imperfectly observed. The text takes this wider view and so covers general techniques of optimisation (e.g. dynamic programming and the maximum principle) as well as topics more classically associated with narrowsense control theory (e.g. stability, feedback, controllability). There is now a great deal of standard material in this area, and it is to this which the 'basics' component of the book provides an introduction. However, while the material may be standard, the treatment of the section is shaped considerably by consciousness of the 'beyond' component into which it leads. There are two pieces of standard theory which impress one as complete: one is the Pontryagin maximum principle for the optimisation of deterministic processes; the other is the optimisation of LQG models (a class of stochastic models with Linear dynamics, Quadratic costs and Gaussian noise~ These have appeared like two islands in a sea of problems for which little more than an ad hoc treatment was available. However, in recent years the seabed has begun to rise and depths have become shallows, shallows have become bridging dry land. The class of risksensitive models, LEQG models, was introduced, and it was
viii
PREFACE
found that the LQG theory could be extended to these, although the mode of extension was sufficiently unevident that its perception added considerable insight. At about the same time it was found that optimisation on the Hoo criterion was both feasible, in that analytic advance was possible, and useful, in that it gave a robust criterion. Unexpectedly and beautifully, these two lines of work coalesced when it was realised that the H 00 criterion was a special case of the LEQG criterion, for all that one was phrased deterministically and the other stochastically. Finally, it was realised that, if largedeviation theory is applicable (as it is when a stochastic model is close to determinism in a certain sense), then all the exact results of the LQG theory have a version which holds in considerable generality. These successive insights revealed a structure in which concepts which had been familiar in special contexts for decades (e.g. timeintegral solutions, Hamiltonian structure, certainty equivalence, solution by canonical factorisation) were seen to be closely related and to supply exactly the right view of a very general class of stochastic models. The 'beyond' component is devoted to exposition of this material, and it was the fact that such a connected treatment now seems possible which motivated the writing of this text. Another motivation was the desire to write a successor to my earlier work Optimisation over Time (Wiley 1982, 1983). However, it is not squarely a successor. I wanted to write something much more homogeneous and tightly focused, and the restriction to the control theme provided that tightness. Remarkably, the recent advances mentioned above also induced a tightening, rather than the loosening one might have expected. For example, it turns out that the discounted cost criterion so beloved of exponents of dynamic programming is logically inconsistent outside a rather narrow context (see Section 16.12). In control contexts it is natural to work with either total or timeaveraged cost (in terminating or nonterminating situations respectively). The algorithm which emerges as natural is the iterative one of policy improvement. This has intrinsically a clear variational basis; it can also be seen as a NewtonRaphson algorithm (Section 3.5) whose secondorder convergence is often rapid enough that a single iteration is enlightening (see Section 3.7 and the examples of Chapter 11); it implies similarly effective algorithms in derived work, e.g. for the canonical factorisations of Chapters 1821. One very important topic to which we give little space is that of dual control. By this is meant the use of control actions to evoke information as well as to govern the dynamics of the system, with its associated concepts of adaptive control, selftuning regulators, etc. Chapter 14 on the multiarmed bandit constitutes almost the only substantial discussion. Despite the fact that the idea of dual control emerges spontaneously in any effort to optimise the running of a stochastic dynamic system, the topic seems too demanding and idiosyncratic that one can treat it in passing. Indeed, one may say that the treatment of this book pushes a certain line about as far as it can be taken, and that this line necessarily skirts
PREFACE
ix
dual control. In all our formulations of the LQG model, the LEQG model, largedeviation versions and even minimax control we find that there is a certainty equivalence principle. The principle indeed generally takes a more sophisticated form than that familiar from the simple LQG case, but any such principle must by its nature exclude dual control: the notion that control actions affect information gained. Another topic from which we refrain, despite the attention it has received in recent years, is the use of J factorisation techniques and the like to determine all stabilising controls satisfying some lower bound on performance. This topic is important because of the increased emphasis given to robustness: the realisation that it is oflittle use if a control is optimal for a specified model if its performance deteriorates rapidly with departure from that specification. However, we take reassurance from one conclusion which this body of work establishes: that if a control rule is optimised under the assumption that there is observation error then it is also proofed to some extent against errors in model specification (see Section 17.3). The factorisation techniques which we employ are those associated with the formulation of optimal control as the extremisation of a suitably defined timeintegral (even in the stochastic case). This is a class of ideas completely distinct from that of ]factorisation, and with its own particular elegance. My references to the literature are not systematic, but I have certainly given credit for all recent work for which I knew an attribution. However, there are many sections in which I have worked out my own treatment, very possibly in ignorance of existing work. Let me apologise in advance to authors thus unwittingly overlooked, and affirm my readiness to correct the record at the first opportunity. A substantial proportion of this work was completed before my retirement in 1994 from the Churchill Chair, endowed by the Esso Petroleum Company. I am profoundly indebted to the Company for its support over my 27year occupancy of the Chair.
CHAPTER I
First Ideas 1 CONTROL AS AN OPTIMISATION PROBLEM One tends to think of 'control' as meaning the control of mechanisms: e.g. the classic stabilisation of the speed of a steam engine by the centrifugal governor, the stabilisation of temperature in a central heating system, or the many automatic controls built into a modern aircraft. However, the controls built into an aircraft are modest compared with those which Nature has built into any higher organism; a biological rather than a mechanical system. This can be taken as an indication that any system operating in time, be it mechanica~ electrical, biological, economic or industrial, will need continuous monitoring and correction if it is to keep on course. In other words, it needs control. The efficient running of the dynamic system constituted by an economy or a factory poses a control problem just as much as does the operation of an aircraft. The fact that control actions may be realised by procedures or by conscious decisions rather than by mechanisms is a matter of implementation rather than of principle. (Although it is also true that it is the higherlevel decisions, laying out the general course one wishes the system to follow, which will be taken consciously, and it is the lowerlevel decisions which will be automated. The more complex the system, the more need there will be for an automated lowlevel decision structure which ensures that the system actually follows the course decided by higherlevel policy.) In traditional control theory the problem is regarded very much as one of stabilitythat departures from the desired course should certainly be corrected ultimately, and should preferably be corrected quickly, smoothly and effortlessly. Since the midcentury increasing attention has been given to more specific design criteria: control rules are chosen so as to minimise a cost function which appropriately penalises both deviation from course and excessive control action. That is, the design problem is formulated as an optimisation problem. This has virtues, in that it leads to a sharpening of concepts; indeed, to the generation of concepts. It has faults, in that the model behind the optimisation may be so idealised that it leads to a nonrobust solutiona solution which is likely to prove unacceptable if the actual system deViates at all from that supposed. However, as is usual when 'theory' is criticised, this objection is not a criticism of theory as such, but criticism of a naive theory. One may say, indeed, that optimisation exposes the weaknesses in thinking which are usually compensated for by soundness of intuition. By this is meant that, if one makes certain assumptions,
2
FIRST IDEAS
then an attempt at optimisation will go to the limit in some direction consistent with a literal interpretation of these assumptions. It is not a bad idea, then, to see how an illposed attempt at optimisation can reveal the pitfalls and point the way to their remedy. 2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE A good example of the harvesting of a renewable resource would be the operation of a fishery. Consider the simplest case, in which the description of current fish stocks is condensed to a single variable, x, the biomass. That is, we neglect the classification by species, age, size and location which a more adequate model would obviously require. We also neglect the effect of the seasons (although see Exercise 1) and suppose simply that, in the absence of fishing, biomass follows a differential equation
x= a(x)
(1)
where xis the rate of change of x with time, dx/dt. The function a(x) represents the rate of change of biomass, a net reproduction rate, and in practice has very much the course illustrated in Figure 1. It is initially positive and increasing with x, but then dips and becomes negative for large x, as the demands which a large biomass levies on environmental resources make themselves felt. Two significant stock levels are xo and Xm, distinguished in Figure 1. The stock level x 0 is the equilibrium level for the unharvested population, that at which the net reproduction rate is zero. The stock level Xm is that at which the net reproduction rate is greatest. If stocks are depleted at a rate u by fishing then the equation becomes
x=
a(x) u.
(2)
X
Figure 1 The postulated form of the net reproduction rate for a population. This rate is maximal at Xm and it is zero at xo, which would consequently be the equilibrium level of the unharvested population.
2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE
3
X
ligure 2 The values Xt and x2 are the possible equilibriii!Tilevels ofpopulation ifharvesting is carried out at a]rxed rate ujor x > 0. These are respectively unstJJb/e and stable, as is seen from the indicated direction ofmovement of:x.
Note that u is the actual catch rate, rather than, for example, fishing effort. Presumably a given effort yields less in the way of catch as x decreases until, when x becomes zero, one could catch at no faster rate than the rate a(O) at which the population is being replenished from external sources (which may be zero). Suppose, nevertheless, that one prescribes a fishing policy by announcing how one will determine u. If one chooses u varying with x then one is showing some responsiveness to the current state; in control terminology one is incorporating feedback. However, let us consider the most naive policy (which is not to say that it has not been used): that which sets u at a definite fixed value for x > 0. An equilibrium value of x under this policy must satisfy a(x) = u, and we see from the graph of Figure 2 that this equation has in general two solutions, x 1 and x2, say. Recall that the domain of attraction of an equilibrium point is the set of initial values x for which the trajectory would lead to that equilibrium. Further, that the equilibrium is stable (in a local sense) only if all points in some neighbourhood of it lie in its domain of attraction. Examining the sign of .X = a(x) u, we see that the lesser value x 1 has only itself as domain of attraction, and so is unstable. The greater value x 2 has x > x 1 as domain of attraction, and so is stable. One might pose as a natural aspiration: to choose the value of u which is largest consistent with existence of a stable equilibrium solution, and this would seem to be U = Um := a(xm)·
That is, the maximal value of u for which a( x) = u has a solution, and so for which the equilibrium operating point is such that the biomass replaces itself at the maximal rate.
FIRST IDEAS
4
X
Figure 3
Ifthefu:ed harvesting rate is taken as high as Um, then the equilibrium at Xm is only semistable.
However, this argument is fallacious, and its adoption is said to be the reason why the Peruvian anchovy fishery crashed between 1970 and 1973 from an annual catch of 12.3 million tons to one ofl.8 million tons (Clark, 1976). As u increases to Um then x 1 and x2 converge to the common value Xm. But Xm has domain of attraction x ~ Xm, and so is only semistable (Figure 3). If the biomass drops at all from the value Xm then it crashes to zero. In Exercise 10.4.1 we consider a stochastic model of the situation which makes the same point in another way. We shall see in the next chapter that the policy which indeed maximises the steadystate harvest rate is that which one might expect: to fish at the maximal feasible rate (presumably greater than um) for x > Xm and not to fish at all for x < Xm. This makes the stock level Xm a stable point of the controlled system, at which one achieves an effective harvest rate of a(xm)· At least, this is the optimal policy for this simple model; the model can be criticised on many grounds. .. Exercises and comments
(1) One can to some extent consider seasonal effects by considering a discretetime model Xr+J = a(xr) Ur in which time t moves forwards in unit steps (corresponding to the annual cycle) rather then continuously. In this case the function a has the form of Figure 4 rather than of Figure 1. The same arguments can be applied as in the continuoustime case, although it is worth noting that it was this model (with u = 0) which provided the first and simplest demonstration of chaotic effects. (2) Suppose that the constant value presumed for u when x > 0 exceeds a( 0 ), with u = 0 for x = 0. Then x = 0 is effectively a stable equilibrium point, with an
3 DYNAMIC OPTIMISATION TECHNIQUES
5
X
Figure 4 The form ofthe yeartoyear reproduction rate.
effective harvest rate u = a(O). This is because one harvests at the constant rate the moment x becomes positive, and drives the biomass back to zero again. One has then a 'chattering' equilibrium, at which the alternation of zero and infinitesimally positive values of x (and of zero and positive values of u) is infinitely rapid. The effective harvest rate must balance the immigration rate, a(O). At this level, a fish is caught the moment it appears from somewhere. Under the policy indicated at the end of the section the equilibrium at Xm is also a 'chattering' one. Practical considerations would of course smooth out both operation and solution around this transition point. 3 DYNAMIC OPTIMISATION TECHNIQUES The crudity of the control rule of the previous section lay, of course, in the assumption of a constant harvest rate. The harvest rate must be adapted to current conditions, and in such a way as to ensure that, at the very least, a depleted population can recover. With improved dynamics it may well be possible to retain the point of maximal productivity Xm as the equilibrium operating point. However, one certainly needs a basis for the deduction of good dynamic rules. There are a number of approaches, all ultimately related. The first is the classical design approach, with its primary concern to secure stability at the desired operating point and, after that, other desirable dynamic characteristics. This shares at least one set of techniques with later approaches: the techniques needed to handle dynamic systems (see Chapters 4 and 5). One optimisation approach is that oflaying direct variational conditions on the path of the process; of requiring that there should be no variation of the path, consistent with the prescribed dynamics, which would yield a smaller cost. The optimisation problem is then cast as a problem in the calculus of variations. However, this classic calculus needs modification if the control problem is to be
6
FIRST IDEAS
accommodated naturally, and the form in which it is effective is that of the Pontryagin maximum principle (Chapter 7). This is a valuable technique, but one which would seem to be applicable only in the deterministic case. However, it has a natural version for at least certain classes of stochastic models; see Chapters 16, 1821, 23 and 25. Another approach is the recursive one, in which one optimises the control action at a given time on the assumption that the optimal rule for later times has already been determined. This leads to the dynamic programming technique, a technique which is central and which has the merit of being immediately applicable also in the stochastic case (see Chapter 8). It is this approach which in a sense provides the spine of our treatment, although we shall see that all other methods are related to it and sometimes provide advantageous variants of it. It is also true that there is merit in methods which display the future options for the controlled process more clearly than does the dynamic programming technique (see the certainty equivalence principles of Chapters 12 and16). One might say that methods which are expressed in terms of the predicted future path of the process (such as the maximum principle, the certaintyequivalence principle and the timeintegral methods of Chapters 1821) correspond to the approach of a chessplayer who explores a range of future scenarios in his mind before he makes a move. The dynamic programming approach reflects the approach of the player who has built up a mental evaluation of all possible board configurations, and so can replace the longterm goal of winning by the shortterm goal of choosing a move which leads to a highervalue configuration. There is virtue both in the explicit awareness of future possibilities and in the ability to be guided to the same effect by aiming for some more immediate goal. Finally, there is the relatively naive approach of simply choosing a reasonable control rule and evaluating its performance (by, say, determination of the average cost associated with the rule under equilibrium conditions). It is seldom easy to optimise the rule at this stage; the indirect routes to optimisation are more effective and more revealing. However, there is a systematic method of improving such solutions to yield something which is well on the way to optimality. This is the technique of policy improvement (see Chapters 3 and 11), an approach also derived from dynamic programming. Judged either as an analytic or a computational technique, this may be the single most important tool. In cases where optimality may be an unrealistic ambition, even a false one, it offers a way of starting from a humble base and achieving performance comparable with the optimal. The revision of policy that it recommends can itself convey insight. Policy improvement has a good theoretical basis, has a natural expression in all the characteristions of optimality and, as an iterative technique, it shows secondorder convergence to optimality.
4 ORGANISATION OF THE TEXT
7
4 ORGANISATION OF THE TEXT Conventions on notation and standard notations are listed in Appendix 1. While much of the treatment of the text is informal. conclusions are either announced in advance or summarised afterwards in theoremproof form. This form should be regarded as neither forbidding nor pretentious, but simply as the best way of punctuating and summarising the discussion. It is also by far the best form for readers looking for a quick reference on some point. It does create one difficulty, however. There are theorems whose validity is completely assured by the conditions statedmathematicians could conceive of nothing else. However, there are situations where arguments of less than full rigour have led one to considerable penetration and to what one indeed believes to be the essential insight, but for which the aspiration to full rigour would multiply the length of the treatment and obscure its point. This is particularly the case when the topic is new enough that a rigorous treatment, even if available, is itself not insightful. One would still wish to summarise assertions, however, leaving it to be understood that the truth of these is subject to technical conditions of a nature neither stated nor verified. Such summary assertions should not properly be termed 'theorems: We cover this point by starring the second type. So, Theorem 23.1 is true as its stands. On the other hand, *Theorem 7.2.1 is 'essentially' valid in statement and proof, but both would need technical supplement before the star could be removed. Exercises are in some cases substantial. In others they simply make points which, although important or interesting in themselves, would have interrupted the discussion if they had been incorporated into the main text. Theorems carry chapter and section labels. Thus, Theorem 2.3.1 is the ftrst theorem of Section 3 of Chapter 2. Equations are numbered consecutively through a chapter, however, without chapter label. A reference to equation (18) would thus mean equation (18) of the current chapter, but a reference to equation (3.18) would mean equation (18) of Chapter 3. A similar convention holds for figures.
BASICS PART 1
Deterministic Models
CHAPTER 2
Deterministic Models and their Optimisation 1 STATE STRUCTURE OPTIMISATION AND DYNAMIC PROGRAMMING
The dynamic operation one is controlling is referred to as the 'process' or the 'plant' more or less interchangeably; we shall usually take 'system' as including sensors, controls and even command signals as well as plant. The set of variables which describe the evolution of the process will be collectively termed the process variable and denoted by x. The control variable, whose value can be chosen by the optimiser, will be denoted by u. This is consistent with the notation of Chapter 1. Models are termed stochastic or deterministic according as to whether randomness enters the description or not. We shall see that the incorporation of stochastic effects (i.e. of randomness) is essentially a way of recognising that the values of certain variables may be unknown; in particular, that the future course of certain input variables may be only imperfectly predictable. We restrict ourselves to deterministic models in these first seven chapters. We shall denote time by t. Physical models are naturally phrased in continuous time, when t may take any value on the real axis. However, it is also useful to consider models in discrete time, when tis considered to take only integer values t = ... , 2, 1, 0, 1, 2, .... This corresponds to the notion that the process develops in stages, of equal length. It is a natural view in economic contexts, for example, when data become available at regular intervals, and so decisions tend to be taken at the same intervals. Even engineers operate in this way, when they work with 'sampled data'. Discretisation of time is inevitable if control values are determined digitally. There are mathematical advantages in starting with a discretetime formulation, even if one later transfers the treatment to the more physical continuoustime formulation. We shall in general try to cover material in both versions. There are two aspects of the model which must be specified if the control optimisation problem is to be properly formulated. The first of these is the plant equation; the dynamic evolution rule that x obeys for given controls u. This describes the dynamics of the system which is to be controlled, and must be derived from a physical model of that system. The second aspect is the performance criterion, which usually implies specification of a cost function. This
12
DETERMINISTIC MODELS AND THEIR OPTIMISATION
cost function penalises all aspects of the path of the process which are regarded as undesirable (e.g. deviations from required path, lack of smoothness, depletion of resources) and the control policy is to be chosen to minimise it. Consider ft.rst the case of an uncontrolled system in discrete time. The plant equation must then take the form of a recursion eXpressing x 1 in terms of previous xvalues. Suppose that this recursion is firstorder, so taking the form Xt
= a(Xt1
1
(1)
t),
where we have allowed dynamics also to depend upon time. In this case the variable x constitutes a dynamically complete description of the state of the system, in that the future course {xT; T > t} of the process at time tis determined totally by xr, and is independent of the path{~; T < t} by which x 1 was reached. A model with this property is said to have state structure, and the process variable x can be more strongly characterised as the state variable. State structure for a controlled process in discrete time will also require that the model is, in some sense, simply recursive. It is a property of system dynamics and the cost function jointly. We shall assume that the plant equation takes the form Xt
=
a(Xt1 1 Ut1 1
(2)
t).
analogously to (1). Further, if one is optimising over the time period 0 shall assume that the cost function C takes the additive form hl
hl
r=O
r=O
c = 2: c(xT, u,., r) + Ch(Xh) = L
Cr
+ ch,
~
t
~
h, we
(3)
say. The endpoint h is referred to as the horizon, the point at which operations close. It is natural to regard the terms cr and Ch as costs incurred at time r and time h respectively; we shall refer to them as the instantaneous and closing costs. We have thus assumed, not merely additivity, but also that the instantaneous cost depends only on current state, control and time, and that the closing cost depends only upon the closing state xh. One would often refer to xh and Ch as the terminal state and the terminal cost respectively. However, we shall encounter processes which may terminate in other ways before the horizon point is reached (e.g. by accident or by bankruptcy) and it is useful to distinguish between the cost incurred in such a physical termination and one incurred simply by the state one is left in at the expiry of the planning period. We have yet to define what we mean by 'state structure' in the controlled case, but shall see in Theorem 2.1.1 that assumptions (2) and (3) do in fact imply the simply recursive character of the optimisation problem that one would wish. Relation (2) is of course a simple forward recursion, and the significant property of the cost function (3) turns out to be that it can be generated by a simple backward recursion. We can interpret the quantity
13
1 STATE STRUCTURE hl
(4)
Ct= Lct+Ch, T=C
as the cost incurred form time t onwards. It plainly obeys the backward recursion
Ct = c(Xt Ut t) 1
1
+ Ct+l
t 0 for a given t. If x ~ d1 and one transferred the production u; to day t + 1 one would save a cost cu; or a+ cu; according as u;+I is zero or positive. Hence the first assertion. For the second point, suppose that on day t one produces enough to satisfy demand before time r, but not enough to satisfy demand at time r as well. That is, u1 = 'r.j::/ ~  x 1 + 8, where 0 ~ 8 < dr. Then one must produce on day Tin any case. If one decreased by 8 and increased by 8 then one would save a storage cost c( T  t)8, with no other effect on costs. Hence 8 = 0 in an optimal policy. 0
u;
u;
Thus, if x 1 ~ d1 one does not produce and lets stock run down. If x 1 ~ d1 then one produces enough that demand is met up to some time T  1 exactly (so that x,. = 0), where T is an integer exceeding t. The optimal r will be determined by the optimality equation
+ F(O, r)],
( 18)
c(t, r) =a+ 'l)b + c(j t)]~.
( 19)
F(x, t) = min [c(t, r) t 0 with increasing t. Solution (41) thus becomes x 1 = fl x 0 , consistent with the recursive analysis. Exercises and comments 1 Consider the following separate variations of the assumptions of the theorem. (i) If B = 0 (so that the process is uncontrolled) then (35) has a finite solution if the plant is strictly stable; otherwise it has no finite positive solution. (ii) If R = 0 and the plant is stable then (35) has the unique solution II= 0. If R = 0 and the plant is unstable then there is a solution II = 0 and another positive solution. These two solutions correspond respectively to infinite horizon limits for the two cases IIh = 0 and IIh > 0. (iii) If Q = 0 then II = R. In this case the optimal control takes whatever value is needed to bring the state variable to zero at the next step.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME Control models are usually framed in continuous time, and all the material of Section 1 has a continuoustime analogue. If we look for state structure then the analogues of relations (2) and (3) are a firstorder differential equation i = a(x, u, t),
(42)
as plant equation and a cost function
IC =
1h
c(x, u, T)dT + C(x(h), h).
(43)
of integral form. Here i is the time rate of change dx / dt of the state variable x. It thus seems that an assumption is forced upon one: that the possible values and course of x are ·· such this rate of change can be defined; see Exercise 1. We shall usually suppose x to be a vector taking values in !Rn. We shall write the value of x at time t as x( t) rather than x 1 , although often the time argument will be suppressed. So, it is understood that the variables are evaluated at time tin the plant equation (42), and at time T in the integral of (43). The quantity c(x, u, T) now represents an instantaneous rate of cost, and the final term in (43) again represents the closing cost incurred at the horizon point h. The general discussion of optimisation methods in Section 1 holds equally for the continuoustime case: there seem to be just two methods which are widely applicable. One is that of direct trajectory optimisation by Lagrangian methods, which we develop in Chapter 7 under its usual description of the maximum principle. The other is the recursive approach of dynamic programming.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME
29
We can derive the continuoustime dynamic programming equation formally [rom the discretetime version (6). The value function F(x, t) is again defined as the minimal future cost incurred if one starts from time t with state value x. Considering events over a positive timeincrement 8t we deduce (cf. (6)) that
F(x, t) = inf[c(x, u, t)8t + F(x + a(x, u, t)8t, t + bt)] u
+ o(8t).
(44)
Letting 8t tend to zero in this relation we deduce the continuoustime optimality equation
.
[
1~f c(x, u, t)
+
oF(x, t) ot
+
oF(x, t) ] ox a(x, u~ t) = 0
(t . 1 } is then the sequence of differentials (70) defined on this optimal orbit. If we set x = x 1 in (69) then the minimality condition with respect to u1 will imply a stationarity condition
(71) where the row vector Cu and matrix au of derivatives are evaluated at time t. Differentiation of (69) with respect to x 1 yields the companion equation
(72) Theorem 2.10.1 (Discrete time) Assume that the differentials above exist and that the optimally controlled process has an equilibrium point. Then the values ofx, u and>. at an optimal equilibrium must satisfy the three equations x=a,
(73)
where arguments x and ufor a, c and their derivatives are understood. This follows simply by taking the equilibrium condition x = a together with the equilibrium forms of equations (71) and (72) above. Note the necessity for introduction of the conjugate variable >.; it was because of this that we had to establish the dynamic equations (71) and (72) first (see Exercise 1). One can, of course, eliminate >.to obtain the pair ofequations
44
DETERMINISTIC MODELS AND THEIR OPTIMISATION
(74)
x= a,
The fact that the optimal equilibrium point varies with f3 is something we have already observed for the continuoustime model of Section 7, for which the optimal equilibrium point c was determined in terms of the discount rate a by equation (49). Discounting has the effect of encouraging one to realise return early in time, so there is an inducement to take a quick immediate yield from the system, even at the cost of sinking to a loweryielding equilibrium. Equations (73) may have several solutions, of course, and the eligible solution must at least be stable. For a case in point, one can consider an optimal fishing policy under a model which allows for age structure in the fish population. For certain parameter choices, there is an equilibrium in that the optimal fishing pattern is constant from year to year. For other parameter values, one should opt for the nonstationary policy of 'pulse fishing', under which one allows the population to build up for a period before harvesting it; a static equilibrium solution may or may not be unstable under these circumstances. For the continuoustime analogue of this analysis the plant equation· is .X= a(x, u) and the dynamic programming equation is inf[c  aF +a;+ Fxa] u
ut
= 0,
(75)
where a is the discount rate. The equations analogous to (71) and (72) are (76)
(77) We shall encounter these again in Chapter 7 when we develop the Pontryagin maximum principle; relations (76) and (77) are then seen as conditions that the orbit be optimal. For present purposes, we deduce the analogue of Theorem 2.10.1. Theorem 2.10.2 (Continuous time) Assume that the differentials above exist and that the optimally controlled process has an equilibrium point. Then the values ofx, u, and .X at an optimal equilibrium must satisfy the three equations a
=0,
(78)
The question that should now really be faced is: what should the optimal control policy be in the neighbourhood of the equilibrium point? That is, if the equilibrium values are denoted by .X and u, then how should u vary from u as x varies slightly from x? To determine this we must consider secondorder effects and obtain what is in essence an LQ approximation to the process in the neighbourhood of equilibrium. More generally, one can do the same in the neighbourhood of an optimal orbit.
10 OPTIMAL EQUILIBRIUM POINTS
45
Consider again the discretetime case, and define II 1 as the value of the square matrix of secondorder derivatives Fxx on an optimal orbit at time t. Let 6.x 1 denote a given perturbation of Xr from its value on the orbit at this point and 6.ur the corresponding perturbation in ur (whose optimal value is now to be determined). Theorem 2.10.3 (/)iscrete time) Assume that all differentials now invoked exist. Define the Hamiltonian at timet, H(xr, ur, Ar+t) = c(xr, ur) + /3A"[+ 1a(xr. u 1 ) and the matrices Ar =ax, Br =au, Rr = Hxx. Sr = HUX> Qr'1 = Huu; these being evaluated on the original optimal orbit at time t. Then the matrix Dr satisfies the Riccati recursion
Dr= [R + j3ATIIr+1A (ST
+ /3ATIIt+tB)(Q + j3BTIIr+IB) 1(S + /3BTIT 1+tA)Jc (79)
where all terms in the square bracket save II are to bear the subscript t. The perturbation in optimal control is, to first order, 6.u 1 = K16.x 6 where
(80) Proof Set x = x 1 + 6.x 1 and u = u 1 + 6.ur in the dynamic programming equation (69), where x 1 and u1 are the values on the original optimal orbit, and expand all expressions as far as the secondorder terms in these perturbations. The zerothorder terms cancel in virtue of the equation (69) itself, and the firstorder terms cancel in virtue of relations (76) and (77). One is then left with a relation in secondorder terms which is just the equation
!xTII 1x
= inf[c(x, u) + !!3(Ax + Bu? IIr+l (Ax+ Bu)] u
with x and u replaced by 6.x 1 and 6.u 1 and A, B, R, S and Q replaced by the tdependent quantities defined above. The conclusions thus follow. 0 The interest lies in the replacement of the cost function c(x, u) (with cost matrices R, Sand Q) by the Lagrangianlike expression c(x, u) /3ATa(x, u). This is the negative of the Hamiltonian which will play such a role in the discussion of the maximum principle in Chapter 7. The additional term j3ATa(x, u) would in fact contribute nothing at this point if a(x, u) were linear: it is the nonlinearity in the plant equation which adds an effective supplementary cost. The negative signs which occur in our definition of A come from a desire to be consistent with convention; these signs would all be positive if one had phrased the problem as one of reward maximisation rather than of cost minimisation. This perturbation calculation is, of course, of no value unless the perturbed orbit continues to lie close to the original optimal orbit. So, either one must be
46
DETERMINISTIC MODELS AND THEIR OPTIMISATION
considering events over a horizon which is short enough that perturbations remain small, or the perturbed orbit must in fact converge back to the original orbit. The implication in the latter case is that the original orbit is an attractor under the optimal control rule. This would be a rather special circumstance, except in the particular case when the original orbit had itself settled to a stable equilibrium value. In such a case the matrices TI and K will also be independent oft.
The continuoustime analogue follows fairly immediately; we quote only the undiscounted case. Theorem 2.10.4 (Continuous time) Assume that all differentials now invoked exist. Define the HamiltonianH(x, u, .h) = c(x, u) + .hTa(x, u) and the timedependent matrices A= ax, B =a,., R = Hxx• S =flux, Q = H,.,.; these being evaluated on the original optimal orbit at time t. Then the matrix TI = Fxx ~valuated on the original orbit) satisfies the Riccati equation
(81) The perturbation in optimal control is, to first order, 6u timedependent value
= K6.x,
where K has the
(82) For the harvesting model of Section 7 the Hamiltonian is linear in u, so that Q = 0 and the above analysis fails. Such cases are spoken of as singular. As we see from this example, an optimal control with an equilibrium may nevertheless very well exist. The singularity reflects itself in the discontinuous nature of the control rule. Exercises and comments (1) One could have derived the conditions of Theorem 2.10.1 in the undiscounted case simply by minimising c(x, u) with respect to x and u subject to the constraint x = a(x, u). The variable A then appears as a Lagrange multiplier associated with the constraint. The approach of the text is better for the discounted case, however, and necessary for the dynamic case.
CHAPTER 3
A Sketch of Infinitehorizon Behaviour; Policy Improvement Suppose the model timehomogeneous. In the control context one will very frequently consider indefinitely continuing operation, i.e. an infinite horizon. If the model is also such that infinitehorizon operation makes physical sense then one will expect that the value function and the optimal control rule will have proper infinitehorizon limits, and that these are indeed timeindependent. That is, that the optimal control policy exists and is stationary in the infinitehorizon limit. In this chapter we simply list the types of behaviour one might expect in the infinitehorizon limit, and that typifY applications. Expectations can be false, as we illustrate by counterexample; dangers are best put in proportion if they are identified. However, we defer any substa~tial analysis until Chapter 11, when more examples have been seen and the treatment can be incorporated in that of the stochastic case. Coupled with this discussion is introduction of the important and central technique ofpolicy improvement. The infinitehorizon limit should not be confused with the limit of equilibrium behaviour. If the model is timehomogeneous and the control rule stationary then one can expect behaviour to tend to some kind of equilibrium with time, under suitable regularity conditions. It would then be more appropriate to regard equilibrium as an infinitehistory limit.
1 A FORMALISM FOR THE DYNAMIC PROGRAMMING EQUATION We suppose throughout that the model is statestructured, so that we can appeal to the material of Sections 2.1 and 2.6. Consider the discretetime case first. We shall suppose the model timehomogeneous, so that the time argument can be dropped from a(x, u, t) and c(x, u, t), but we shall allow the possibility of discounting. The value function F(x, t) can then be written Fs(x), where s = h  t is the time to go. We achieve considerable notational simplification if we write the dynamic programming equation (2.10) simply as
Fs = ff'Fs1 where 2 is the operator with action
(s > 0)
(1)
48
A SKETCH OF INFINITE HORIZON BEHAVIOUR
!l'¢(x)
= inf[c(x, u) + f3¢(a(x, u))] u
(2)
on a scalar function of state ¢(x ). The interpetation of !l' ¢(x) is that it is the cost incurred if one optimises a single stage of operation: so optimising u1, say, knowing that x 1 = x and that one will incur a closing cost ¢(x1+1) at time t + 1. The operator !l' thus transforms scalar functions of state into scalar functions of state. We shall term is the forward operator of the optimised process, since it indicates how minimal costs from timet (say) depend upon a given cost function at time t + 1. The term is quite consistent with the fact that the dynamic programming equation is a backward equation. Let us also consider how costs evolve if one chooses a policy 7l" which is not necessarily optimal. If the policy 7l" is Markov in that u1 depends only t and the value x of current state x 1 then the value function from time t will also be a function only of these variables, V( 1!", x, t), say. If the policy is also stationary then it must be of the form u, = g(x 1) for some fixed function g. In this case the policy is often written as 7!" = gCool, to indicate indefinite application of this constant rule, and one can write the value function with times to go as V5 (g(oo), x). The backward cost recursion for this fixed policy is Vs+I(g(ool,x) = c(x,g(x))
+ /3V5 (g(ool,a(x,g(x))),
(s > 0)
(3)
a relation which we shall condense to
(4) Here L(g) is then the forward operator for the process with policy g(oo). If it can be taken for granted that this is the policy being operated then we can suppress the argument g and write the recursion (4) simply as
Vs+l = LV5 • The operators L and !l' transform scalar functions of state to scalar functions of state. If one applies either of them to a function ¢, then it is assumed that ¢is a scalar function of state. One important property they share is that they are monotonic. By this we mean that !l'¢ ~ !l'.,P if¢> ~ .,P; similarly for L.
Theorem 3.1.1 (i) The operators !l' and L are monotonic. (ii) If F 1 ~ (::::;; )Fo then the sequence {Fs} is monotone nondecreasing (nonincreasing); correspondingly for { V5 }. Proof Lis plainly monotonic. We have then ifwe take u = g(x) as the particular control rule induced by formation of !t'¢, i.e. the minimising value of u in (2} Assertion (i) is thus proven.
2 INFINITEHORIZON LIMITS FOR TOTAL COST
49
Assertion (ii) follows inductively. IfF, ~Fsl then Fs+l = fil F, ~ fil Fsl =F,.
0 In continuous time we have (with the conventions of Section 2.6) the analogues of relations (I) and (4): oF= vHF
OS
,
oV
OS
= MV
(s > 0).
(5)
Here F and V are taken as functions of x and s and the operators vlt and M = M(g) have the actions . o¢(x) vlt¢(x) = mf[c(x,u) a¢(x) +a(x,u)], u 0X o¢(x) M¢(x) = c(x,g(x)) a¢(x) + axa(x,g(x)).
Exercises and comments (1) Consider a timehomogeneous discretetime model with uniformly bounded instantaneous cost function and strict discounting (so that f3 < 1). Suppose also (for simplicity rather than necessity) that the state and control variables can take only finitely many values. Define the norm 11¢11 of a scalar function ¢(x) of state by supx I¢(x) I, the supremum norm. Let f!J be the class of functions bounded in this norm. Show that for¢ and 1/J in f!J we have II fil ¢  ff'¢!1 ~ .BII ¢ 1/JII. Hence show that the equilibrium optimality equation F = fil F has a unique solution Fin f!J, identifiable with lim5 _, 00 fi7 5'¢ for any ¢ of f!J, and that the uj x relation determined by ff' F defines an optimal infinitehorizon control rule.
2 INFINITEHORIZON LIMITS FOR TOTAL COST The fact that one faces an infinite horizon, i.e. envisages indefinite operation, does not mean that the process may not terminate. For example, if one fires a guided missile, it will continue until it either strikes some object or falls back to Earth with its fuel spent. (For simplicity we exclude the possibility of escape into space.) In either case the trajectory has terminated, with a terminal cost which is a function IK(i) of the terminal value .X of state. The problem is nevertheless an infinitehorizon one if one has set no a priori bound on the time allowed. If one had set a bound, in that one considered the firing a failure if the missile were still in flight at a prescribed time h, then presumably one should assign a cost Ch(xh) to this contingency. The cost Ch might then more appropriately be termed a closing cost, to distinguish it from the terminal cost IK, the cost of natural termination.
50
A SKETCH OF INFINITE HORIZON BEHAVIOUR
In the infmitehorizon case h is infinite and there is no mention of a closing cost One very regular case is that for which the total infinitehorizon cost is well defined, and the total costs V(x) and F(x) (under a policy gCco) and an optimal policy respectively) are finite for prescribed x. If instantaneous cost is nonnegative then this means that the trajectory of the controlled process must be such that the cost incurred after time t tends to zero with increasing t. One situation for which this would occur is that envisaged above: that in which the process terminates of itself at some finite time and incurs no further cost Another is that discussed in Exercise 1.1, in which instantaneous costs are uniformly bounded and discounting is strict, when the value at time 0 of cost incurred after time t tends to zero as /3 1• Yet another case is that typified by the LQ regulation problem of Section 2.3. Suppose one has a fixed policy u = K.x which 1 is stabilising, in that the elements of (A+ BK) tend to zero as p1 with increasing 1 t. Then x 1 and u1 also tend to zero as p , and the instantaneous cost c(x1 , u1) tends to zero as fl''. Although there is no actual termination in this case, one approaches a costless equilibrium sufficiently fast that total cost is finite. One would hope that finiteness of total cost would imply that V and F could be identified respectively as the limits of Vs = Lsch and of Fs = 2sch ass + oo, for any specified closing cost Ch in some natural class CC. One would also hope that these cost functions obeyed the equilibrium forms of the dynamic programming equations F = !i'F, (6)
V=LV
(7)
and that they were the unique solutions of these equations (at least in some natural class of functions). Further, that the minimising value ofu in (6) would determine a stationary optimal policy. That is, that if
F= .ff'F=L(g)F (8) then g 0 one can move only to j  1; state 0 is absorbing. All transitions are costless, except that from 1 to 0, which carries unit cost. Then F3 (a) is zero, because one can move from a to a j so large that the transition I > 0 does not occur before closing. Thus F 00 (a) := limF3 (a) = 0. On the other hand F( o:) = I, because the transition I > 0 must occur at some time under any policy (i.e. choice of move in state o:). The fact that Fr:10 # F means that optimisation does not commute with passage to the infinitehorizon limit, and is referred to (unfortunately) as instability. (2) Suppose that one can either continue, at zero cost, or terminate, at unit cost. There is thus effectively only one continuation state; if F is the minimal cost in this state then the equation F =IfF becomes F = min(F, 1). The solution we want is F = 0, corresponding to the optimal policy of indefinite continuation. However, the equation is solved by F = K for any constant K ~ 1; that for K = 1 is indeed consistent with the nonoptimal policy that one chooses termination. K can be regarded as a notional closing cost, whose value affects costs and decisions at all horizons. It is optimal to continue or to terminate according as K ~ I or K ~ 1. In fact, K = 0, by the conventions of the infinitehorizon formulation, but the nonuniqueness in the solution of the dynamic programming equation reflects a sensitivity to any other specification. (3) A more elaborate version of the same effect is to assume that x and u may take integral values, say, and that the plant equation and cost function are such as to imply the equilibrium dynamic programming equation.
F(x) = min[lul u
+ F(x u)].
=
The desired solution is F = 0, u 0, corresponding to a zero closing cost. However, there are many other solutions, as the reader may verifY. corresponding to a nonzero notional closing cost. (4) Consider a process on the positive integers x = 1, 2, 3, ... such that when, inx, one has the options of either moving to x + 1 at zero reward ('continuation') or retiring with reward 1  I/ x ('termination'). This then is a problem in 'positive programming': one is maximising a nonnegative reward rather than minimising a nonnegative cost If G(x) is the maximal infinitehorizon reward from state x then the dynamic programming equation is G(x) = max[ G(x + 1), 1  1/ x]. This is solved by any constant G ~ 1, corresponding to the specification of some xdependent closing reward which exceeds or equals the supremum terminal
3 AVERAGECOST OPTIMALITY
53
reward of 1. However, we know that there is no such closing reward, and we must restrict ourselves to solutions in G ~ I. The only such solution is G = I, attained for indefinite continuation. But indefinite continuation is nonoptimalone then never collects the reward. In short, this is a case for which there is a g such that !l' F = L(g)F for the correct F, but g(oo) is nevertheless nonoptimal. In fact, no optimal solution exists in this case. If one decides in advance to terminate in state x, then there is always an advantage in choosing x larger, but x may not be infinite.
3 AVERAGECOST OPTIMALITY In most control applications it is not natural to discount, and the controlled process will, under a stationary and stabilising policy, converge to some kind of equilibrium behaviour. A cost will still be incurred under these conditions, but at a uniform rate 1, say. The dominant component of cost over a horizon h will thus be the linear growth term [h, for large h. For example, suppose we consider the LQ regulation problem of Section 2.4, but with the cost function c(x, u) modified to! (x  q) T R(x q) +! uT Qu. One is thus trying to regulate to the set point (q, 0) rather than to (0, 0). At the optimal equilibrium a constant control effort will be required to hold x in at least the direction of q. One then incurs a constant cost, because of the constant offset of (x, u) from the desired set point (q, 0); see Exercise I. More generally, disturbances in the plant equation will demand continuing correction and so constitute a continuing source of cost, as we saw in Section 2.9. With known disturbances cost is incurred at a known but timevarying rate. One could doubtless develop the notion of a longterm average cost rate under appropriate hypotheses, but a truly timeinvariant model can only be achieved if disturbances are specified statistically. For example, we shall see in Section 10.1 that, if the disturbance takes the form of 'white noise' with covariance matrix N, then the minimal expected cost incurred per unit time is 1 = !tr(NII). Here IT is the matrix of the deterministic value function derived in Section 2.4. In general, there are many aspects of averagecost optimality x implied by relation (18) is G(z) = A(z) 1 . The cau.salformofthefuter is stable ifand only ifall zeros ofA(z) lie strictly outside the unit circle. Inversion of the operator thus amounts, in the TIL case, to inversion of the transfer function in the most literal sense: G(z) is simply the reciprocal of the transfer function A(z) of the inverse filter. However, to this observation must be added some guidance as to how the series expansion of A(z) 1 is to be understood. We do indeed demand causality if equation (18) is to be understood physically: as a forward recursion in time whose solution cannot be affected by future input. We shall see in the next section that the theorem continues to hold in the vector case. In this case A(z) is a matrix, with elements polynomial in z, which implies that A(zf 1 is rational. That is, the transfer function from any component of the input to any component of the output is rational. This can be regarded as the way the rational transfer functions make their appearance in practice: as a consequence of finiteorder finitedimensional (but multivariable) linear dynamics. The simplest example is that which we have already quoted in the last section; the uncontrolled version x, = Ax,_ I
+ d,
(20)
of relation (2.61). We understood this as a vector relation in Section 2.9, and shall soon do so again, but let us regard it as a scalar relation for the moment. We have then A(z) = 1  Az, with its single zero at z = A 1• The necessary and sufficient condition for stability is then that lA I < 1. The solution of the equation for prescribed initial conditions at t = r is tr1 Xt
=
L A'd,_, +AIr
Xr.
(t
~
r)
(21)
r=O
whatever A. From this we see that the stability condition IAI < 1 both assures stability in what one would regard as the usual dynamic sense (that the effect of initial conditions vanishes in the limit r ___, oo) and in the filter sense (e.g. that the filter has the BIBO property). If lA I ~ 1 then solution (21) is still valid, but will in general diverge as t  r increases. We can return to the point now that the transfer function has, in the present case G(z) = ( 1  Az) I, two distinct series expansions. These are 1
00
G(z)
= _EA'z',
G(z) = l:A'z', oc
0
1
1
valid respectively for !zl < IAI and Jz! > IAJ . The first of these corresponds to a filter which is causal, but stable only if !AI < 1. The second corresponds to a
74
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
filter which is stable if [A[ > 1, but is noncausal. Indeed, it corresponds to a solution 00
Xt =
L( At' dt+r r=l
of (20). This is mathematically acceptable if [AI < 1 (and d uniformly bounded, say), but is of course physically unacceptable, being noncausal.
Exercises and comments (1) Consider the filter d  x implied by the difference equation (18). The leading coefficient Ao must be nonzero if the filter is to be causal. Consider the partial power expansion. t1
p1
A(z) 1 = Lg,z' +A(z) 1 Lctd+k, k=O
ro=O
in which the two sums are respectively the dividend and the remainder after t steps of division of A(z) into unity by the longdivision algorithm. Show, by establishing recursions for these quantities, that the solution of system (18) for general initial conditions at t = 0 is just p1
11
Xt = Lg,d,_, + LCtkXtk· k=O
ro=O
Relation (21) illustrates this in the case p = 1. (2) Suppose that A(z) = II~=t (1 a1z), so that we require jo1[ < 1 for allj for stability. Determine the coefficients c1 in the partial fraction expansion p
A(z) 1
=
L c1(1 
a1z) 1
}""'I
in tbe case when the o1 are distinct. Hence determine the coefficients g, and c1k of the partial inversion of Exercise 1. (3) Model (20) has the frequency response function (1  Aeiw) 1. An input signal d, = eiw1 will indeed be multiplied by this factor in the output x if the filter is stable and sufficient time has elapsed since startup. This will be true for an unstable filter only if special initial conditions hold (indeed, that x showed this pattern already at startup).What is the amplitude of the response function?
4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE Suppose now that input d is an mvector and output x an nvector. Then the representations (7) and (12) still hold for linear and TIL filters respectively, the
4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE
75
first by definition and the second again by Theorem 4.2.1. The coefficients g 1r and g, are still interpretable as transient response, but are now n x m matrices, since they must give the response of all n components of output to all m components of input. However, for the continuation of the treatment of the last section, let us take an approach which is both more oblique and more revealing. Note, first, that we are using !!/ in different senses in the two places where it occurs in (11). On the right, it is applied to the input, and so converts mvectors into mvectors. On the left, it is applied to output, and does the same for nvectors. For this reason, we should not regard !!/ as a particular case of a filter operator r§; we should regard it rather as an operator of a special character, which can be applied to a signal of any dimension, and which in fact operates only on the time argument of that signal and not on its amplitude. We now restrict ourselves wholly to the TIL case. Since r§ is a linear operator, one might look for its eigenvectors; i.e. signals ~~ with the property r§~ 1 = .\~ 1 for some scalar constant .\. However, since the output dimension is in general not equal to the input dimension we look rather for a scalar signal 0'1 such that r§~u 1 = ryu 1 for some fixed vector 1J for any fixed vector~ The translationinvariance condition (ll) implies that, if the inputoutput pair {~0' 1 , 1]0'1 } has this property, then so does {~O'rJ,1J0' 1 _I}. If these sequences are unique to within a multiplicative constant for a prescribed ~then one set of signals must be a scalar multiple of the other, so that ~0' 1 _ 1 = z~rJ 1 for the some scalar z. This implies that a1 ex: z 1, which reveals the particular role of the exponential sequences. Further, 1J must be then be linearly dependent upon ~. although in general by a zdependent rule, so that we can set 1J = G(z)~ for some matrix G(z). But it is then legitimate to write r§
= G(3"),
(22)
since r§ has this action for any input of the form ~z 1 or a linear combination of such expressions for varying~ and z. If G(z) has a power series expansion 'L,g,z' then relation (22) implies an expression (12) for the output of the filter, with g, identified as the matrixvalued transient response. We say 'a' rather than 'the' because, just as in the SISO case, there may be several such expansions, and the appropriate one must be resolved by considerations of causality and stability. These concepts are defined as before, with the lqcondition (13) modified to
wheregrJk is thejkth element of g,. The transient responseg, is obtained from G(z) exactly as in (17); by determination of the coefficient of z' in the appropriate expansion of G(z). If causality is demanded then the only acceptable expansion is that in nonnegative powers of z.
76
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
If G(z) is rational (meaning that the elements of G(z) are all rational in z) then the necessary and sufficient condition that the filter be both causal and stable is that of Theorem 4.2.3, applied to each element of G. Again, one returns to basics and to physical reality if one sees the filter as being generated by a model. Suppose that the filter is generated by the dynamic equation (18), with x and d now understood as vectors. If this relation is to be invertible we shall in general require that input and output be of the same dimension, so that A(Y) is a square matrix whose elements are polynomial of degree p in §". The analysis of Section 3 generalises immediately; we can summarise conclusions.
Theorem 4.4.1 The filter C§ determined by (18) has transfer function G(z) = A(z) 1. The causa/form ofthe filter is stable ifand only if the zeros ofiA(z) I lie strictly outside the unit circle. Here IA(z)l is the determinant of A(z), regarded as a function of z. The first conclusion follows from A(z)G(z) = I, established as before. The elements of 1 A(zr are rational in z, with poles at the zeros of the determinant IA(z)j. (More exactly, these are the only possible poles, and all of them occur as the pole of some element.) The second conclusion then follows from Theorem 4.2.3. The fact that z = 0 must not be a zero of jA(z)l implies that Ao is nonsingular. This is of course necessary if (18) is to be seen as a forward recursion, determining x 1 in terms of current input and past xvalues. The actual filter output may be only part of the output generated by the dynamic equation (18). Suppose we again take the car driving over the bumpy road as our example, and take the actual filter output as being what the driver observes. He will observe only the grosser motions of the car body, and will in fact observe only some lowerdimensional function of the process variable x. Exercises and comments
(1) Return to the control context and consider the equation pair X1
= Axti + Bu 1J,
Yt
=
CXti·
(23)
as a model of plant and observer. In an inputoutput description of the plant one often regards the control u as the input and the observation y as the output, the state variable x simply being a hidden variable. Show that the transfer function u '* y is C(I Az) I Bz2 • What is the condition for stability of this causal filter? If a disturbance d were added to the plant equation of (23), then this would constitute a second input. If a control policy has been determined then one has a higherorder formulation; dis now the only input to the controlled system. (2) The general version of this last model would be
77
S COMPOSITION AND INVERSION OF FILTERS; ZTRANSFORMS
dx + f:Mu
= 0,
y
+ CCx = 0,
where d, !16 and ct are causal TIL operators. If d 1 function u+ y is C(z)A(z) B(z).
= A (ff) etc.
then the transfer
5 COMPOSITION AND INVERSION OF FILTERS; zTRANSFORMS We assume for the remainder of the chapter that all filters are linear, translationinvariant and causal. Let us denote the class of such filters by CC. If filters ~~ and ~2 are applied in succession then the compound filter thus generated also lies in CC and has action (24) That is, its transient response at lag r is L:vK2vKl,rv. in an obvious terminology. However, relation (24) expresses the same fact much more neatly. The formalism we have developed for TIL filters shows that we we can manipulate the filter operators just as though the operator .r were an ordinary scalar, with some guidance from physical considerations as to how power series expansions are to be taken. This formalism is just the Heaviside operator calculus, and is completely justified as a way of expressing identities between coefficients such as the A, of the vector version of (18) and the consequent transient response g,. However, there is a parallel and useful formalism in terms of ztransforms (which become Laplace transforms in continuous time). This should not be seen as justifying the operator formalism (such justification not being needed) but as supplying useful analytic characterisatioQ.s and evaluations. Suppose that the vector system p
(25)
LArXcr =de. r=O
does indeed start up at time zero, in that both x and dare zero on the negative time axis. Define the ztransforms IX)
IX)
x(z) =
l:xcz',
d(z) = A
t=O
L.,; d1zI
(26)
" '
t=O
for scalar complex z. Then it is readily verified that relation (25) amounts to
A(z)x(z) = d(z) with an inversion
x(z) = A(z) 1d(z).
(27) 1
This latter expression amounts to the known conclusion G(z) = A(z) if we understand that A(z) 1 and d(z) are to be expanded in nonnegative powers of z.
78
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
The inversion is completed by the assertion that x 1 is the coefficient of Z1in the expansion of x(z). Analytically, this is expressed as Xr
= 2 ~i
J
x(z)ztl dz
(28)
where the contour of integration is a circle around the origin in the complex plane small enough that all singularities of x(z) lie outside it. Use of transforms supplies an alternative language in which the application of an operator, as in A(ff)x, is replaced by the application of a simple matrix multiplier: A(z)x. This can be useful, in that important properties of the operator A(ff) can be expressed in terms of the algebraic properties of A(z), and calculable integral expressions can be obtained for the transient response g, of (17). However, it is also true that both the operator formalism and the concept of a transfer function continue to be valid in cases where the signal transforms .X and d do not exist, as we shall see in Chapter 13. Exercises and comments
(1) There is a version of (27) for arbitrary initial conditions. If one multiplies relation (25) by z1 and then sums over t ~ 0 one obtains the relation
x(z) = A(z)l
[d(z)
tL
AjXkZjkl·
(29)
k=l; ;;>k
This is a transform version of the solution for x 1 of Exercise 3.1, implying that solution for all t ~ 0. Note that we could write it more compactly as
(30) where the operator [ ]+ retains only the terms in nonnegative powers of z from the power series in the bracket. It is plain from this solution that stability of the filter implies also that the effect of initial conditions dies away exponentially fast with increasing time. (2) Consider the relation x 1 = god1 + g 1dr1, for which the transient response is zero at lags greater than unity. There is interest in seeing whether this could be generated by a recursion of finite order, where we would for example regard relation (25) as a recursion of order p. If we define the augmented process variable .X as that with vector components x 1 and d1 then we see that it obeys a recursion.X 1 = Ax1_ 1 + Bd1, where
A=
[00
KI] 0 ,
6 FILTERS IN CONTINUOUS TIME
79
The fact that this system could not be unstable under any circumstances is reflected in the fact that II  Azl has no zeros, and so G(z) has no poles.
6 FILTERS IN CONTINUOUS TIME
In continous time the translation operator f!T' is defined for any real r; let us replacer by the continous variable T. However, the role of the unit translation f!T must be taken over by that of an infinitesimal translation. More specifically, one must consider the rate of change with translation g) = lim T l [1  f!T'"], which is just the differential operator g) = d/dt. The relation TlO
.( ) . x(tT+Ot)x(tT) 1liD =X tT 6t10 c5t amounts to the operator relation d
f71"=~r.
dT
Since f/
0
= 1 this has formal solution arr
.:7
=e T!?J ,
(31)
which exhibits !!) as the infinitesmal generator of the translations. Relation (31) can be regarded as an expression of Thylor's theorem
e1"gjx(t) =
't (~~'/ }=0
x(t) = x(t T) = f71"x(t).
(32)
1·
Note, though, that translation x(t T) of x(t) makes sense even if x is not differentiable to any given order, let alone indefinitely. As in the discretetime case, a translationinvariant linear filter ~ will modify an exponential signal x(t) =~est merely by a multiplicative matrix factor, G(s), say, so that (33)
G(s) is then the transfer function of the continuoustime filter. We can legitimately write ~=
G(Ei')
(34)
in analogy to (22), since, by (33), r§ has this action on any linear combination of exponential signals. However, use of the formalism (34) does not imply that the differentials £}r x( t) need exist for any r for a function x( t) to which~ is to be applied. We have already seen in (32) that the translation e_.,.g x( t) is welldefined, even if x( t) is not differentiable at all.
80
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
In fact, identification (31) demonstrates that, if G(s) has a FourierLaplace representation,
1
00
G(s)
=
esrg(r) dr,
(35)
then the filter relationship x = '§d can be written
whence we see that we can identify g( r) with the transient response of the filter. However, we must be prepared to stretch our ideas. For example, one could envisage a filter x = ~' d which formed the rth time differential of the current input. One could represent this in the form (36) only by setting g(r) = (~)' 6( r), the rth differential of a deltafunction. We have taken the integral only over nonnegative Tin (35) on the supposition that the filter is causal. If the integral (35) is absolutely convergent for some real positive value a of s then it will define G(s) as an analytic function for all s such that Re(s) ? a.
7 DYNAMIC MODELS: THE INVERSION OF CONTINUOUSTIME FILTERS As we have already emphasised in Sections 3 and 4, one must look beyond the inputoutput specification of a filter to the dynamic mechanism behind it. Suppose that these equations take the form .fllx =d.
(37)
The simplest finiteorder TIL assumption is that .fll is a differential operator of order p, say: p
s(
=A(~)= LA,~'.
(38)
r=O
(For economy of notation we have denoted the matrix coefficients by A, as for the discretetime version (18), but the two sets of coefficients are completely distinct.) The system (37), (38) then constitutes a set of differential equations of degree p at most. This is to be regarded as a forward equation in time determining the forward course ofthe output y. In discrete time this led to the requirement that Ao should be nonsingular. The corresponding requirement now is that the matrix coefficient of the highestorder differentials should be nonsingular. That is, if the kth individual output Xk occurs differentiated to order rk at most in system (37), (38) then the matrix A. whose jkth element is the jkth element of A,k (for all relevant}, k) must be nonsingular.
7 THE INVERSION OF CONTINUOUSTIME FILTERS
81
Just as for the discretetime case of Sections 3 and 4 the actual filter d > x, 1 obtained by inversion of the relation (37), has A(s) as transfer function. We must also suppose the filter causal, if relation (37) is su!.fosed on physical grounds to be a forward relation in time. Thus, if A(sr has the Laplace representation (39) then g( T) is the transient response of the filter, and the solution of equation (37) is
1
00
x(t)
=
g(T)d(t T) dT
(40)
plus a possible contribution from initial conditions. There will be no such contribution if the system starts from a quiescent and undisturbed past or if the filter is stable and has operated indefinitely. The Laplace transform is the continuoustime analogue of the ztransform of Section 5. Suppose the system (37) quiescent before time zero, in that both x(t) and d(t) are zero for t < 0. If we multiply relation (37) at time t by est and integrate over t from 0 to infinity then we obtain
A(s)x(s) = d(s)
(41)
where x(s) is the Laplace transform of x( t ):
x(s) =
roo e
lo
31
x(t) dt.
(42)
The reason for emphasising inclusion ofthe value t = 0 in the range of integration will transpire in the next section. However, relation (41) certainly implies that .Y(s) = A(s) 1x(s) forallsfor which bothx(s) and A(s) 1 are defined, and this is indeed equivalent to the solution implied by the evaluation of the transient response implicit in (39). In a sense there is no need for the introduction of Laplace transforms, in that the solution determined by (39), (40) remains valid in cases when d(s) does not exist. However, the Laplace formalism provides the natural technique for the inversion constituted by relation (39); i.e. for the actual determination of g( T) from G(s) = A(s) 1• Exercises and comments (I) Show, by partial integration, that
82
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
(Here ~r x( 0) is a somewhat loose expression for the rth differential of x at time 0.) For general initial conditions at t = 0 relation (41) must thus be replaced by
rl
A(s).Y(s) = x(s)
+ LAr Ls'l~rqly(O). q=O
r
This is the continuoustime analogue of relation (29). (2) A stock example is that of radioactive decay. Suppose that a radioactive substance can decay through consecutive elemental forms j = 0, 1, 2, ... , and that xj(t) is the amount of element} at timet. Under standard assumptions the Xj will obey the equations
(J= 1,2, ... ). where J.lj is the decay rate in state}. Here we have supposed for simplicity that only element 0 is replenished externally, at rate d(t). In terms of Laplace transforms these relations become
(s + J.lo)xo =
d,
(s+J.lj)xj=J.ljlXj1
(}=1,2, ... ),
if we assume that xj(O) = 0 for all}. We thus find that A
.X1 
/lO/ll'''J.l'1 1
d  d ' A
(s+J.lo)(s+J.lt)···(s+J.lj)
Pj(s)
say. If J.lQ, Ill, . .. , /lj are distinct and positive then this corresponds to a transient response function (ford ~ xj) : j
Gj(T) =
eJIF
L f)/( ) . k=O .rj /lk
(43)
Suppose thatj =pis the terminal state, in that 1lp = 0. The term corresponding to k = p in expression (43) for j = p is then simply 1. This corresponds to a singularity at s = 0 in the transfer function from d to Xp. The singularity corresponds to a rather innocent instability in the response of Xp to d: simply that all matter entering the system ultimately accumulates in state p.
8 LAPLACE TRANSFORMS The Laplace transform (42) is often written as !f'x to emphasise that the function x(s) has been obtained from the function x(t) by the transformation!£'. (This is quite distinct from the forward operator defined in Section 3.1we use !f' to denote Laplace transformation only in this section.) The transformation !f' is linear, as is its inverse, which is written !f' 1. One of the key results is that the inversion has the explicit form
83
9 STABILITY OF CONTINUOUSTIME FILTERS
x(t)
= _pl_x =  1 . 111+ioo e31 x(s) 21fl
ds
(44)
uioo
x
where the real number u is taken large enough that all singularities of lie to the left of the axis of integration. (This choice of integration path yields x( t) = 0 for t < 0, which is what is effectively supposed in the formation (42) of the transform. It is analogous to the choice of integration path in (17) to exclude all singularities of G(z), if one wishes to determine the causal form of the filter.) Inversion of a transform then often becomes an exercise in evaluation of residues at the various singularities ofthe integrand. The glossary of simple transformpairs in Table 4.1 covers all cases in which .X(s) is rational and proper (a term defined in the next section). The reader may wish to verify validity of both direct and inverse transforms. In all cases the assumption is that x( t) is zero for negative t. Table4.1 X
I t"/n! e1
t"e"'ljn! 6(r)
s1 sn1
(s + a) 1 (s + ct)n1 I
If s corresponds to the operation of differentiation then s 1 presumably corresponds to the operation of integration. This is indeed true (as we saw in Exercise 7.2), but the operation of integration is unique only to within an additive constant, to be determined from initial conditions. That initial conditions should have a persistent effect is an indication of instability. A very useful results is the final value theorem: that if limrroo x(t) exists, then so does lims!O sx(s), and the two are equal. This is easily proved, but one should note that the converse holds only under regularity conditions. Note an implication: that if limrroc ~x(t) exists for a given positive integer j, then so does limslO si+l x(s), and the tWO are equal.
9 STABILITY OF CONTINUOUSTIME FILTERS Let us consider the SISO case to begin with, which again sets the general pattern. Lqstability for a realisable filter requires that fooc !g(rW dr
< oo.
L1stability is then again equivalent to BIBO stability, and implies that G(s) is analytic for Re(s) ~ 0. This condition certainly excludes the possibility that g could have a differentiated deltafunction as component, i.e. that the filter would
84
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
actually differentiate the input at any lag. A bounded function need not have a differential at all, let alone a bounded one. A filter for which IG(s) I remains bounded as lsi + oo is said to be proper. This excludes s' behaviour of G( s) for any r > 0 and so excludes a differentiating action for the filter. If IG(s)l + 0 as lsi + oo then the filter is said to be strictly proper. In this case even deltafunctions are excluded in g; i.e. response must be smoothly distributed. A rational transfer function G(s) is now one which is rational ins. As in the discretetime case, this can be seen as the consequence of finiteorder, finitedimensional linear dynamics. The following theorem is the analogue of Theorem 4.2.3.
Theorem 4.9.1 Suppose that a causa/filter has a response function G(s) which is rational and proper. Then the filter is stable ifand only ifall polesofG(s) have strictly negative real part {i.e. lie strictly in the left halfofthe complex plane). Proof This is again analogous to the proof of Theorem 4.2.3. G(s) will have the expansion in partial fractions G(s) =
L c,s' + L L djk(s Sjrkl r
j
k
where the ranges of summation are finite, j and k are nonnegative integers and the s1 are the nonzero poles of G(s). Negative powers s' cannot occur, since these would imply a component in the output consisting of integrated input, and the integral of a bounded function will not be bounded in general. Neither can positive powers occur, because of the condition that the filter be proper. The first sum thus reduces to a constant c0. This corresponds to an instantaneous term c0 d(t) in filter output t'§d, which is plainly stable. The term in (s s1 )kl gives a term proportional tor" exp(s1r) in the filter response; this component is then stable if and only if Re(s1) < 0. The .condition of the theorem is thus sufficient, and necessity follows, as previously, 0 from the linear independence of the components of filter response. The decay example of Exercise 7.1 illustrates these points. The transfer function Gj(s) for the output x1 had poles at the values Jlk (k = 0, 1, ... ,j). These are strictly negative fork < p, and so G1 determined a stable filter for j < p. The final filter had a response singularity at s = p,P = 0. This gave rise to an instability corresponding, as we saw, to the accumulation of matter in the final state.
Exercises and comments (I) The second stock example is that of the hanging penduluma damped harmonic oscillator. Suppose that the bob of the pendulum has unit mass and
10 SYSTEMS STABILISED BY FEEDBACK
85
that it is driven by a forcing term d. The linearised equation of motion (see Section 5.1) for the angle of displacement x of the pendulum is then A(£»)x = d where A(s) = s' + a,s + ao. Here (nonnegative) represents damping and ao (positive) represents the restoring force due to gravity. If a 1 = 0 then the zeros of A(l') have the purely imaginary values ±iylaO (corresponding to an undamped oscillation of the free pendulum, with an amplitude determined by initial conditions). If 0 < < 2yla0 then the zeros are complex with negative real part (corresponding to a damped oscillation of the free pendulum~ If a 1 ~ 2y"QQ, then they are negative real (corresponding to a damped nonoscillatory motion ofthe free pendulum). The equivalent filter is thus stable or unstable according as the pendulum is damped or not. A damped harmonic oscillator would also provide the simplest useful model of our car driving over a bumpy road, the output variable x being the vertical displacement of the car body. If the suspension is damped lightly enough that the car shows an oscillatory response near the natural frequency of vibration wo = ylaO then the response function A(sr' will be large in modulus at s = ±iwo. This can be observed when one drives along an unsealed road which has developed regular transverse ridges (as can happen on a dry creek bed). There is a critical speed which must be avoided if the car is not to develop violent oscillations. The effect is enhanced by the fact that the ridges develop in response to such oscillations!
a,
a,
10 SYSTEMS STABILISED BY FEEDBACK We shall from now on often specify a filter simply by its transfer function, so that we write G rather than r§. In continuous time the understanding is then that G is understood as denoting G(PJ) or G(s) according as one is considering action of the filter in the time domain or the transform domain. Correspondingly, in discrete time it denotes G(Y) or G(z), as appropriate. We are now in a position to resume the discussion of Section 1. There the physically given system (the plant) was augmented to the controlled system by addition of a feedback loop incorporating a controller. The total system thus consists of two filters in a loop, corresponding to plant and controller, and one seeks to choose the controller so as to achieve satisfactory performance of the whole system. Optimisation considerations, a consciousness that 'plant' constitutes a model for more than just the process under control (see Exercise 1.1) and a later concern for robustness to misspecification (see Chapter 17) lead one to modify the block diagram of Figure 2 somewhat, to that of Figure 4. In this diagram u and y denote, as ever, the signals constituted by control and observations respectively. The signal ( combines all primitive exogenous inputs to the system: e.g. plant noise, observation noise and command signals (or the noise that drives command signals if these are generated by a statistical model).
86
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
l;
~
G
y
u
K
Figure 4 The block diagram corresponding to the canonical set ofsystem equations (45). The plant G. understood in a wide sense, has as outputs the actual observations yand the vector of deviations .6.. It has as inputs the control u and the vector (of 'primitive' inputs to the system.
The signal D.. comprises all the 'deviations' which are penalised in the cost function. These would for example include tracking error and those aspects of the control u itself which incur cost. Here the plant G is understood as including all given aspects of the system. These certainly include plant in the narrow sensethe process being controlledbut also the sensor system which provides the observations. They may also include subsidiary models used to predict, for example, sea and weather for the longdistance yachtsman of Section 1, or the future inflow to the reservoir of Section 2.9, or the command signal constituted by the position of a vehicle one is attempting to follow. The optimiser may be unable to exert any control upon these aspects, but he must regard them as part of the total given physical model. As well as the control input u to this generalised plant one has the exogenous input(. This comprises all quantities which are primitive inputs to the system; i.e. exogenous to it and not explained by it. These include statistical noise variables (white noisewhich no model can reduce) and also command sequences and the like which are known in advance (and so for which no model is needed). It may be thought that some of these inputs should enter the system at another point; e.g. that observation noise should enter just before the controller, and that a known command sequence should be a direct input to the controller. However, the simple formalism of Figure 4 covers all these cases. The input (is in general a vector input whose components feed into the plant at different ports. A command or noise signal destined for the controller can be routed through the plant, and either included in or superimposed upon the information streamy. As far as plant outputs are concerned, the deviation signal D.. will not be completely observable in general, but must be defined if one is to evaluate (and optimise) system performance.
10 SYSTEMS STABILISED BY FEEDBACK
87
If we assume timeinvariant linear structure then the block diagram of Figure 4 is equivalent to a set of relations D.= G11( + G12u y = G12( + Gzzu (45) u=Ky. We can write this as an equation system determining the system variables in terms of the system inputs; the endogenous variables in terms of the exogenous variables:
(46) By inverting this set of equations one determines the system transfer function, which specifies the transfer functions from all components of the system input ( to the three system outputs: D., y and u. The classic demand is that the response of tracking error to command signal should be stable, but this may not be enough. One will in general require that all signals occurring in the system should be finite throughout their course. Denote the first matrix of operators in (46) by M, so that it isM which must be inverted. The simplest demand would be that the solution of (46) should be determinate; i.e. that M(s) should not be singular identically in .r. A stronger demand would be that the system transfer function thus determined should be proper, so that the controlled system does not effectively differentiate its inputs. A yet stronger demand is that of internal stability; that the system transfer function should be stable. Suppose all the coefficient matrices in (46) are rational in s. Then the case which is most clearcut is that in which the poles of all the individual transfer functions are exactly atthe zeros of IM(s) I, i.e. atthe zeros of II Gzz(s)K(s)l. In such a case stability of any particular response (e.g. of error to command signal) would imply internal stability, and the necessary and sufficient condition for stability would be that II Gz2(s)K(s) Ishould have all its zeros strictly in the left halfplane. In fact, it is only in quite special cases that this pattern fails. These cases are important, however, because performance deteriorates as one approaches them. To illustrate the kind of thing that can happen, let us revert to the model (3) represented in Figure 3, which is indeed a special case of that which we are considering. The plant output y is required to follow the command signal w; both of these are observable and the controller works on their difference e. The only noise is process noised, superimposed upon the control input. Let us suppose for simplicity that all signals are scalar. Solution (5) then becomes
e = (l
+ GKf 1(Gd
w)
(47)
88
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
in the response function notation. Suppose that G(s) and and K(s) are rational. Then the transfer function (1 + GK) 1 of e tow is rational and proper and its only poles are precisely at the zeros of I + G(s)K(s). It is then stable if and only if these zeros lie strictly in the left halfplane. The same will be true of the response function (I + GK) I G of e to d if all unstable poles of G are also poles of GK. However, suppose that the plant response G has an unstable pole which is cancelled by a zero of the controller response K. Then this pole will persist in the transfer function (1 + GK) 1 G, which is consequently unstable. To take the simplest numerical example, suppose that G = (s 1) I, K = 1  s 1. Then the transfer functions
(1
+ GK) 1 =
__s_
s + 1'
(I
1
s
+ GK) G = (s + I\)(s I)
are respectively stable and unstable. One can say in such a case that the controller is such that its output does not excite this unstable plant mode, so that the mode seems innocuous. The mode is there and ready to be excited, however, and the noise does just that. Moreover, the fact that the controller cannot excite the mode means that it is also unable to dampen it. These points are largely taken care of automatically when the controller is chosen optimally. If certain signal amplitudes are penalised in the cost function, then those signals will be stabilised to a low value in the optimal design, if they can be. If inputs are such that they will excite an instability of the total system then such instabilities will be designed out, if they can be. If inputs are such that their differentials do not exist then the optimal system will be proper, if it can be. One may say that optimality enforces a degree of robustness in that, as far as Physical constraints permit, it protects against any irregularity permitted in system input which is penalised in system output. Optimisation, like computer programming, is a very literal procedure. It supplies all the protection it can against contingencies which are envisaged, none at a]] against others. Exercises and comments
(1) Application of the converse to the final value theorem (Section 8) can yield Useful information about dynamic lagsthe limiting values for large time of the tracking errore or its derivatives. Consider a scalar version of the simple system of Figure 3. If the loop transfer function has the form ksN ( 1 + o(s)) for smalls, then k is said to be the effective gain of the loop and N its type number: the effective number of integrations achieved in passage around the loop. Consider a :ommand signal w which is equal to t" j n! for positive t and zero otherwise. Then J.V ::::: snJ, and it follows from (47) and an application of the converse to the final Value theorem (if applicable) that the limit of qj efor large tis lims!O 0(~n+J). It
10 SYSTEMS STABILISED BY FEEDBACK
89
thus follows that the limit offset in the jth differential of the output path y is zero, finite or infinite according as n is less than, equal to or greater than N + j. So, suppose that w is the position of a fleeing hare and y the position of a dog pursuing it. Then a zero offset for j = I and n = 2 would mean that, if the hare maintained a constant acceleration (!) then at least the difference in the velocities of the dog and the hare would tend to zero with time. It appears then that an increase in N improves offset. However, it also causes a decrease in stability, and N = 2 is regarded as a practical upper limit.
CHAPTER 5
Statestructured Deterministic Models In the last chapter we considered deterministic models in the classic inputoutput formulation. In this we discuss models in the more explicit statespace formulation, specialising rather quickly to the timehomogeneous linear case. The advantage of the statespace formulation is that one has a physically explicit model whose dynamics and whose optimisation can both be treated by recursive methods, without assumption of stationarity. Concepts such as those of controllability and observability are certainly best developed first in this framework. The advantage of the inputoutput formulation is that one can work with a more condensed formulation of the model (in that there is no necessity to expand it to a state description) and that the transform techniques then available permit a powerful treatment of, in particular, the stationary case. We shall later move freely between the two formulations, as appropriate. 1 STATESTRUCTURE FOR THE UNCONTROLLED CASE: STABILTY; LINEARISATION Let us set questions of control and observation to one side to begin with, and simply consider a dynamic system whose course is described by a process variablex. We have already introduced the notion of state structure for a discretetime model in Section 2.1. The system has state structure if x obeys a simple recursion X1
= a(Xtl, t),
(1)
when x is termed the state variable. Dynamics are timehomogeneous if they are governed by timeindependent rules, in which case (1) reduces to the form
(2) We have said nothing of the set of values within which x may vary. In the majority of practical cases xis numerical in value: we may suppose it a vector of finite dimension n. The most amenable models are those which are linear, and the assumption of linearity often has at least a local validity. A model which is statestructured, timehomogeneous and linear then necessarily has the form Xt
=
Axtl
+b
(3)
92
STATESTRUCTURED DETERMINISTIC MODELS
where A is a square matrix and ban nvector. If the equation(/ A)x = b has a solution for x (see Exercise 1) then we can normalise b to zero by working with a new variable x x.lfwe assume this normalisation performed, then the model (3) reduces to (4) Xr = Axr1· The model has by now been pared down considerably, but is still interesting enough to serve as a basis for elaboration in later sections to controlled and imperfectly observed versions. We are now interested in the behaviour of the sequence x, = A'xo generated by (4). It obviously has an equilibrium point x = 0 (corresponding to the equilibrium point x = x of (3)). This will be the unique equilibrium point if I A is nonsingular, when the only solution of x = Ax is x = 0. Supposing this true, one may now ask whether this equilibrium is stable in that x, ~ 0 with increasing t for any xo. Theorem 5.1.1 The equilibrium of system (4) at x = 0 is stable eigenvalues ofthe matrix A have modulus strictly less than unity.
if and only if all
Proof Let A be the eigenvalue of maximal modulus. Then there are sequences A'.xo which grow as A1, so IAI < 1 is necessary for stability On the other hand, no such sequence grows faster than t" 1IAI 1, so IAI < 1 is also sufficient for ~~
0
A matrix A with this property is termed a stability matrix. More explicitly, it is termed a stability matrix 'in the discretetime sense', since the corresponding property in continuous time differs somewhat. Note that if the equilibrium at zero is stable then it is necessarily unique; if it is not unique then it cannot be stable (Exercise 2). Note thatg1 = A 1 is the transient response function of the system (4) to a driving input. The fact that stability implies exponential convergence of this response to zero also implies lq·Stability of the filter thus constituted, and so of the filter of Exercise 4.4.1. The stability criterion deduced there, that C(I Azf 1B should have all its singularities strictly outside the unit circle, is implied by that of Theorem 5.1.1. All this material has a direct analogue in the continuoustime case, at least for the case of vector x (to which we are virtually forced; see the discussion of Exercise 2.1.1). The analogue of (2), a statestructured timehomogeneous model, is
.X= a(x).
(5)
(For economy of notation we use the same notation a(x) as in (2), but the functions in the two cases are quite distinct.) The normalised linear version of this model, corresponding to (4~ is
l STATESTRUCTURE FOR THE UNCONTROLLED CASE
93
i =Ax.
(6)
The analogue of the formal solution x 1 = A 1xo of (4) is the solution
x(t) = e1Ax(O)
(tAY 2:::.x(O) J=O ]· co
:=
1
(7)
of (6). The stability criterion is also analogous.
Theorem 5.1.2 The equilibrium of system (6) at x = 0 is stable eigenvalues ofthe matrix A have real part strictly less than zero.
if and only if all
Proof Let O" be the eigenvalue of maximal real part. Then there are functions x(t) = e1A x(O) which grow as eu1, so Re(O") < 0 is necessary for stability. On the other hand, no such function grows faster than f!IeRe(ulr, so Re(O") < 0 is also sufficient for stability. D A matrix A with this property is a stability matrix in the continuoustime sense. If a(x) is nonlinear then there may be several solutions of a(x) = 0, and so several possible equilibrium points. Recall the definitions of Section 1.2: that the domain ofattraction of an equilibrium point is the set of initial values from which the path would lead to that point, and that the equilibrium is locally stable if its domain of attraction includes a neighbourhood of the point. Henceforth we shall take 'stability' as meaning simply 'local stability'. For nonlinear models the equilibrium points are usually separated (which is not possible in the linear case; see Exercise 2) and so one or more of them can be stable. Suppose that x is such an equilibrium point, and define the deviation .6.(t) = x(t)  x from the equilibrium value. If a(x) possesses a matrix ax of first derivatives which is continuous in the neighbourhood of x and has value A at x then equation (6) becomes
(8) to within a term o (.6.) in the neighbourhood of x. The state variable x will indeed remain in the neighbourhood of x if A is a stability matrix, and it is by thus testing A = ax(x) that one determines whether or not xis locally stable. The passage from (5) to (8) is termed a linearisation of the model in the neighbourhood of x, for obvious reasons, and the technique of linearisation is indeed an invaluable tool for the study of local behaviour. However, one should be aware that nonlinear systems such as (2) and (5) can show limiting behaviour much more complicated than that of passage to a static equilibrium: e.g. limit cycles or chaotic behaviour. Either of these would represent something of a failure in most control contexts, however, and it is reasonable to expect that optimisation will exclude them for all but the most exotic of examples.
94
STATESTRUCTURED DETERMINISTIC MODELS
We have already seen an example of multiple equilibria in the harvesting example of Section 1.2. If the harvest rate was less than the maximal net reproduction rate then there were two equilibria; one stable and the other unstable. The stock example is of course the pendulum; in its linearised form the archetypal harmonic oscillator. If we suppose the pendulum undamped then the equation of motion for the angle a: of displacement from the hanging position is
a+ w2 sin 0: =
0'
(9)
where J is proportional to the effective length of the pendulum. There are two static equilibrium positions: a: = 0 (the hanging position) and a: = 1r (the inverted position). Let us bring the model to state form and linearise it simultaneously, by defining D. as the vector whose elements are the deviations of a: and a from their equilibrium values, and then retaining only firstorder terms in D.. The matrix A for the linearised system is then
A=[±~ ~], where the + and  options refer to the inverted and hanging equilibria respectively. We find that A has eigenvalues ±w in the inverted position, so this is certainly unstable. In the hanging position the eigenvalues are ±iw, so this is also unstable, but only justthe amplitude of the oscillation about equilibrium remains constant. Of course, one can calculate these eigenvalues simply by determining which values a are consistent with a solution a:( t) = ea1 of the linearised version of equation (9). However, we shall tend to discuss models in their statereduced forms. Discretetime models can equally well be linearised; we leave details to the reader. We shall develop some examples of greater novelty in the next section, when we consider controlled processes. Exercises and comments
(1) This exercise and the next refer to the discretetime model (3).1f (I A)x = b has no solution for .X then a finite equilibrium value certainly does not exist. It follows also that I  A must be singular, so that A has an eigenvalue .>. = 1. (2) If (I A)x = b has more than one solution then, again, I A is singular. Furthermore, any linear combination of these solutions with scalar coefficients (i.e. any point in the smallest linear manifold At containing these points) is a solution, and a possible equilibrium. There is neutral equilibrium between points of At in that, once x 1 is in At, there is no further motion.
I STATESTRUCTURE FOR THE UNCONTROLLED CASE
95
(3) Suppose that the component x11 of the vector x 1 represents the number (assumed continuousvalued) of individuals of age j in a population at time t, and that the x1, satisfy the dynamic equations CX>
Xot = LaixJ,t1, J=O
(j > 0).
The interpretation is that a1 and b1 are respectively the reproduction and survival rates at age j. One may assume that b1 = 0 for some fmite j if one wishes the dimension of the vector x to be finite. Show that the equilibrium at x = 0 is stable (i.e. the population becomes extinct in the course of time) if all roots>. of CX>
Lbob1 · · ·bJIaJ>..1I = 1 j=O
are strictly less than unity in modulus. Show that the root of greatest modulus is the unique positive real root. (4) A pattern observed in many applications is that the recursion (2) holds for a scalar x with the function a(x) having a sigmoid form: e.g. :x? /(1 + x) 2 (x ~ 0). l 0 and Jg >a are thus necessary and sufficient for stability of this equilibrium.
(6) The eigenvalues and eigenfunctions of A are important in determining the 'modes' of the system (4) or (6). Consider the continuoustime case (6) for definiteness, and suppose that A has the full spectral representation A = H AJI 1, where A is the diagonal matrix of eignenvalues >.1. and the columns of H (rows of n 1) are the corresponding right (left) eigenvectors. Then, by adoption of a new state vector x = Hx, one can write the vector equation (6) as the n decoupled scalar equations iJ = >.1. xi, corresponding to the n decoupled modes of variation. An oscillatory mode will correspond to a pair of complex conjugate eigenvalues. (7) The typical case for which a complete diagonalisation cannot be achieved is that in which A takes the form
96
STATESTRUCTURED DETERMINISTIC MODELS
for nonzero J.l· One can imagine that two population groups both reproduce at net rate >., but that group 2 also generates members of group 1 at rate J.l· There is a double eigenvalue of A at >.,but Ai = )j1
[> }J.L] 0
>. '
At _
e
e
>.c [ l
0
j.J,t]
1 .
One can regard this as a situation in which a mode of transient response e>.1 (in continuous time) is driven by a signal ofthe same type; the effect is to produce an output proportional to te>. 1• If there are n consecutive such stages of driving then the response at the last stage is proportional to t"e>.'. In the case when >. is purely imaginary (iw, say) this corresponds to the familiar phenomenon of resonance of response to input of the same frequency w. The effect of resonance is that output amplitude increases indefinitely with time until other effects (nonlinearity, or slight damping) take over. 2 CONTROL, IMPERFECT OBSERVATION AND STATE STRUCTURE
We saw in Section 2.1 that achievement of 'state structure' for the optimisation of a controlled process implied conditions upon both dynamics and cost function. However, in this chapter we consider dynamics alone, and the controlled analogue of the statestructured dynamic relation (1) would seem to be Xt
= a(xrl,url,t),
(10)
which is indeed the relation assumed previously. Control can be based only upon what is currently observable, and it may well be that current state is not fully observable. Consider, for example, the task of an anaesthetist who is trying to hold a patient in a condition of light anaesthesia. The patient's body is a dynamical system, and so its 'physiological state' exists in principle, but is far too complex to be specifiable, let alone observable. The anaesthetist must then do as best he can on the basis of relatively crude indicators of state: e.g. appearance, pulse and breathing. In general we shall assume that the new observation available at timet is of the form
y,
= c(xr1, UrI, r).
( 11)
So, if the new information consisted of several numerical observations, then y, would be a vector. Note that y, is regarded as being an observation on immediate past state Xxr rather than on current state x,. This turns out to be the formally natural convention, although it can certainly be modified. It is assumed that the past control u,_ 1 is known; one remembers past actions taken. Relation (11) thus
2 CONTROL, IMPERFECT OBSERVATION AND STATE STRUCTURE
97
represents an imperfect observation on Xr1, whose nature is perhaps affected both by the value chosen for u1_ 1 and by time. Information is cumulative; all past observations are supposed in principle to be available. In the timehomogeneous linear case x, u and y are vectors and relations (10) and (11) reduce to Xr
= Axr1 + Bur1
Yt = Cxr1
(12) (13)
Formal completeness would demand the inclusion of a term Dur1 in the righthand member of (13). However, this term is known in value, and can just as well be subtracted out. Control does not affect the nature of the information gained in this linear case. System (12), (13) is often referred to as the system [A, B, C ], since it is specified by these three matrices. The dimension of the system is n, the dimension ofx. The linear system is relatively tractable, which explains much of its popularity. However, for all its relative simplicity, the [A, B, C] system generates a theory which as yet shows no signs of completion. Once a particular control rule has been chosen then one is back in the situation of the last section. Suppose, for example, that current state is in fact observable, and that one chooses a control rule of the form u1 = Kx1 • The controlled plant equation for system (12) then becomes x 1 =(A+ BK)x1J,
whose solution will converge to zero if A + BK is a stability matrix. The continuoustime analogue of relations (10), (11) is
x=
a(x, u, t),
y = c(x, u, t),
(14)
x=Ax+Bu,
y = Cx+Du.
( 15)
and of (12), (13)
with D usually normalised to zero. Note that, while the plant equation of (14) or (15) now becomes a firstorder differential equation, the observation relation becomes an instantaneous relation, nondifferential in form. This turns out to be the natural structure to adopt on the whole, although it can also be natural to recast the observation relation in differential form; see Chapter 25. In continuous time the system (15) with D normalised to zero is also referred to as the system [A, B, C ].
One sometimes derives a linear system (15) by the linearisation of a timehomogeneous nonlinear system in the neighbourhood of a stable equilibrium point of the controlled system. Suppose that state and control values fluctuate about constant values x and u, so that y fluctuates about y = c(x, u). Defining the transformed variables :X= x x, u' = u u andy= y y, we obtain the
98
STATESTRUCTURED DETERMINISTIC MODELS
system (15) in these transformed variables as a linearised version of the system (14) with the identifications
D
=Cu.
Here the derivatives are evaluated at x u, and must be supposed continuous in a neighbourhood of this point. The approximation remains valid only as long as x and u stay in this neighbourhood, which implies either that (.X, u) is a stable equilibrium of the controlled system or that one is considering only a short time span. Subject to these latter considerations, one can linearise even about a timevariable orbit, as we saw in Section 2.9. Exercises and comments (1) Nonuniqueness of the state variable. If relations (12), (13) are regarded as just a way of realising a transfer function C(/ Azr 1Bz2 from u to y then this realisation is far from unique. By considering a new state variable Hx (for square nonsingular H) one sees that the system [HAH 1, HB, cn 1] realises the same transfer function as does [A, B, C ]. (2) A satellite in a planar orbit. Let (r, B) be the polar coordinates of a particle of unit mass (the satellite) moving in a plane and gravitationally attracted to the origin (where the centre of mass of the Earth is supposed situated). The Newtonian equations of motion are then
.. ·2 2 • r = r0  1r r u,,
,r
2 represents the gravitational force and u, and ua are the radial and where tangential components of a control force applied to the satellite. A possible equilibrium orbit under zero control forces is the circle of radius r = p, when the angular velocity must bee = w = r I rJ3). Suppose that small control forces are applied; define x as the 4vector whose components are the deviations of r, r, f) and from their values on the circular orbit and u as the 2vector whose elements are u, and ua. Show that for the linearised version (15) of the dynamic equations one has
J(
e
A_

[3~0 0
1
0
0 0 0 0 2w 0
~] 1 0
'
Note that the matrix A has eigenvalues 0, 0 and ±iw. The zero eigenvalues correspond to the neutral stability of the orbit, which is one of a continuous family of ellipses. The others correspond to the periodic motion of frequency w.
3 THE CAYLEYHAMILTON THEOREM
99
In deriving the dynamic equations for the following standard examples we appeal to the Lagrangian formalism for Newtonian dynamics. Suppose that the system is described by a vector q of position coordinates flJ, that the potential and kinetic energies of the configuration are functions V(q) and T(q, q) and that an external force u with components Uj is applied. Then the dynamic equations can be written (j=1,2, ... ) (3) The cart and the pendulum. This is the celebrated control problem formulated in Section 2.8: the stabilisation of an inverted pendulum mounted on a cart by the exertion of a horizontal control force on the cart. In the notation of that section we have the expressions
v=
Lmg cos
Q,
Show that the equations of motion, linearised for small n, are
(M + m)q + mLii = u,
q+Lci =ga,
and hence derive expressions (2.60) for the matrices A and B of the linearised state equations. Show that the eigenvalues of A are 0, 0 and ±y'g(l + m/ M) / L. The zero eigenvalues correspond to the 'mode' in which the whole system moves at a constant (and arbitrary) horizontal velocity. The positive eigenvalue of course corresponds to the instability of the upright pendulum. (4) A popular class of controllers is provided by the PID (proportional, integral, differential) controllers, for which u is a linear function of current values of tracking error, its timeintegral and its rate of change. Consider the equation for the controlled pendulum, linearised about the hanging position: & + u?n = u. Suppose one wishes to stabilise this to rest, so that a is itself the error. Note that a purely proportional control will never stabilise it. The LQoptimal control of Section 2.8 would be linear in a and 0:, and so would be of the PD fonn. LQoptimisation will produce something like an integral term in the control only if there is observation error; see Chapter 12.
3 THE CAYLEYHAMILTON THEOREM A deeper study of the system [A, B, C ] takes one into the byways oflinear algebra. We shall manage with a knowledge of the standard elementary properties of matrices. However, there is one result which should be formalised. Theorem 5.3.1 Let A be an n x n matrix. Then the first n powers /,A, A2 , •.• , A" 1 constitute a basis for all the powers A' of A, in that scalar coefficients c,1 exist such that
100
STATESTRUCTURED DETERMINISTIC MODELS n1
(r = 0, 1,2, ... ).
A'= LCrjAi
(16)
)=0
It is important that the coefficients are scalar, so that each element of A' has the same representation in terms of the corresponding elements of I, A, ... , A" 1•
Proof Define the generating function 00
«P(z) = L(AzY = (I Az) 1 j=O
where z is a scalar; this series will be convergent if lzl is smaller than the reciprocal of the largest eigenvalue of A. Writing the inverse as the adjugate divided by the determinant we have then II Azi«P(z) = adj(/ Az)
( 17)
Now II Azi is a polynomial with scalar coefficients a1 :
"
II Azl = L:a1zi, }=0
say, and the elements of adj(I Az) are polynomials in z of degree less than n. Evaluating the coefficient of z' on both sides of (17) we thus deduce that
" LaJArJ
= 0
(r
~
n).
(18)
}=0
Relation (18) constitutes a recursion for the powers of A with scalar coefficients a1. It can be solved for r ~ n in the form (16). 0 The CayleyHamilton theorem asserts simply relation (18) for r = n, but this has the extended relation (18) and Theorem 5.3.1 as immediate consequences. It is sometimes expressed verbally as 'a square matrix obeys its own characteristic equation', the characteristic equation being the equation L,'J=o a1>..nJ = 0 for the eigenvalues >... Exercises and comments (1) Define the nth degree polynomial P(z)  II Azl = L,'J=o aizi. If we have a discretetime system x 1 = Ax11 then Theorem 5.3.1 implies that any scalar linear function~~ = cT x 1 of the state variable satisfies the equation P(5")~ 1 = 0. That is, a firstorder nvector system has been reduced to an nthorder scalar system.
4 CONTROLLABIUTY (DISCRETE TIME)
101
One can reverse this manipulation. Suppose that one has a model for which the process variable~ is a scalar obeying the plant equation P(§)~1 = bu 1_ 1 ( *) with ao = l. Show that the column vector x, with elements (~" ~rl, ..• , ~rn+ 1) is a statevariable with plant equation x 1 = Ax,_ 1 + Bu,_ 1, where
A=
[~, 0
a2 0 1
0
0
anI 0 0
an 0 0 0
The matrix A is often termed the companion matrix of the nthorder system ( * ). (2) Consider the continuoustime analogue of Exercise 1. If x obeys x = Ax then it follows as above that the scalar ~ obeys the nthorder differential equation P(~)e
= o.
Reverse this argument to obtain a companion form (i.e. statereduced form) of the equation P(~)~ =bu. Note that this equation must be regarded as expressing the highestorder differential of~ in terms of lowerorder differentials, whereas the discretetime relation (*) expresses the /eastlagged variable~~ in terms oflagged variables.
4 CONTROLLABILITY (DISCRETE TIME) The twin concepts of controllability and observability concern the respective questions of whether control bites deeply enough that one can bring the state variable to a specified value and whether observations are revealing enough that one can indeed determine the value of the state variable from them. We shall consider these concepts only in the case of the timehomogeneous linear system (12), (13~ when they must mirror properties of the three matrices A, B and C. The system is termed rcontrollable if, given an arbitrary value of xo, one can choose control values u0 , u1 , •.• , u,_ 1 such that x, has an arbitrary prescribed value. For example, if m = n and B is nonsingular then the system is 1controllable, because one can move from any value of xo to any value of x1 by choosing uo = ,B 1 (x 1  Axo). As a second example, consider the system X1
=
[
all
a21
0]
a22
Xrl
+
[1] O
UtI
for which m = 1, n = 2. It is never 1controllable, because u cannot affect the second component of x in a single step. It is uncontrollable if a21 = 0, because u can then never affect this second component. It is 2controllable if a21 =/= 0, because
102
STATESTRuCTURED DETERMINISTIC MODELS
x2

2
A xo = Bu1
+ ABuo
= [
~ ~~: ] [ ~],
and this has a solution for uo, ur if a21 =!= 0. This argument generalises. Theorem 5.4.1 matrix
Thendimensional system
[A, B, ·]is rcontrollahle if and only if the ( 19)
hasrankn.
We write the system as [A, B, ·]rather than as {4, B, q since the matrix Cevidently has no bearing on the question of controllability. The matrix (19) is written in a partitioned form; it has n rows and mr columns. The notation T;on is clumsy, but shortlived and motivated by the limitations of the alphabet; read it as 'test matrix of size r for the control context: Proof If we solve the plant equation (12) for x, in terms of the initial value x 0 and subsequent control values we obtain the relation
(20) The question is whether this set of linear equations in uo, u 1, •.. , u,_ 1 has a solution, whatever the value of thenvector x, A'xo. Such a solution will always exist if and only if the coefficient matrix of the uvariables has rank n. This matrix is just r,' 00 , whence the theorem follows. D If equation (20) has a solution at all then in general it has many. We shall find a way of determining 'good' solutions in Theorem 5.4.3. Meanwhile, the CayleyHamilton theorem has an important consequence. Theorem 5.4.2 Ifa system of dimension n is rcontrollable, then it is also scontrollahlefor s ~min (n, r). Proof The rank of r,an is nondecreasing in r, so rcontrollability certainly implies scontrollability for s ~ r. However, Theorem 5.3.1 implies that the rank of r;on is constant for r ~ n, because it implies that the columns of r,on are then linear combinations (with scalar coefficients) of the columns of ~ 00 . The system is thus ncontrollable if it is rcontrollable for any r, and we deduce the complete assertion. 0
If the system is ncontrollable then it is simply termed controllable. This is a reasonable convention, since Theorem 5.4.2 then implies that the system is
4 CONTROLLABILITY (DISCRETE TIME)
103
controllable if it is rcontrollable for any r, and that it is rcontrollable for r ~ n if it is controllable. One should distinguish between controllability, which implies that one can bring the state vector to a prescribed value in at most n steps, and the weaker property of stabilisability, which requires only that a matrix K can be found such that A + BK is a stability matrix, and so that the policy ur = Kx 1 will stabilise the equilibrium at x = 0. It will be proved incidentally in Section 6.1 that, if the process can be stabilised in any way, then it can be stabilised in this way; also that controllability implies stabilisability. That stabilisability does not imply controllability follows from the case in which A is a stability matrix and B is zero. This is uncontrolled, and so not controllable, but stable, and so stabilisable. Note, however, that stabilisability does not imply the existence of a control which stabilises the process to an arbitrary prescribed equilibrium point; see Sections 6.2 and 6.6. Finally, the notion of finding a usolution to the equation system (20) can be made more definite if we require that the transfer from x 0 to x, be achieved optimally, in some sense.
Theorem 5.4.3
(i) rcontrollability is equivalent to the demand that the matrix r1
~n = LABQIBT(AT)i j=O
(21)
should be positive definite. Here Q is a prescribed positive definite matrix. (ii) If the process is rcontrollable then the control which achieves the passage from prescribed xo to prescribed x, with minimal control cost! 'L:01 u'J Qur is
(0
~ r
< r).
(22)
Proof Let us take assertion (ii) first. We seek to minimise the control cost subject to the constraint (20) on the controls (if indeed this constraint can be satisfied, i.e. if controls exist which will effect the transfer). Free minimisation of the Lagrangian form r1
r1
r=O
r=O
!Lu;QUr + AT(x, A'xo LArrI Bu.r) yields the control evaluation
in terms of the Lagrange multiplier A. Evaluation of A by substitution of this solution back into the constraint (20) yields the asserted control (22).
104
STATESTRUCTURED DETERMINISTIC MODELS
However, we see that (22) provides a control rule which is acceptable for general x 0 and x, if and only if the matrix cr;on is nonsingular. The requirement of nonsingularity of cr;on must then be equivalent to that of controllability, and so to the rank condition of Theorem 5.4.1. This is a sufficient proof of assertion (i), but we can give an explicit argument. Suppose ~on singular. Then there exists an nvector w such that wT cr;on = 0, and so wT cr;cnw = 0, or r1
L(wT£B)Q1(wT£B)T =0. j=O
But the terms in this sum are individually nonnegative, so the sum can be zero only if the terms are individually zero, which implies in turn that wT Ai B = 0 (j = 0, 1, ... , r  1). That is, wTr;on = 0, so that r;on is of less than full rowrank, n. This chain of implications is easily reversed, demonstrating the equivalence of the two conditions: T~on is of rank n if and only if a;:on is, i.e. if and only if the nonnegative definite matrix ~on is in fact positive definite. 0 The matrix a;:on is known as the control Gramian. At least, this is the name given in the particular case Q =I and r = n. As the proof will have made clear, the choice of Q does not affect the definiteness properties of the Gramian, as long as Q is itself positive definite. Consideration of general Q has the advantage that we relate the controllability problem back to the optimal regulation problem of Section 2.4. We shall give some continuoustime examples in the exercises of the next section, for some of which the reader will see obvious discretetime analogues.
5 CONTROLLABiliTY {CONTINUOUS TIME) Controllability considerations in continuous time are closely analogous to those in discrete time, but there are also special features. The system is controllable if, for a given t > 0, one can find a control {u(T);O::;;,; T < t} which takes the state value from an arbitrary prescribed initial valuex(O) to an arbitrary prescribed terminal value x(t). The value oft is immaterial to the extent that, if the system is controllable for one value oft, then it is controllable for any other. However, the smaller the value of t, and so the shorter the interval of time in which the transfer must be completed, the more vigorous must be the control actions. Indeed, in the limit t ! 0 infinite values of u will generally be required, corresponding to the application of impulses or differentials of impulses. This makes clear that the concept of rcontrollability does not carry over to continuous time, and also that some thought must be given to the class of controls regarded as admissible.
5
CONTROLLABIUTY (CONTINUOUS TIME)
105
It follows from the CayleyHamilton theorem and the relation 00
eAr = ~)At/ fj!.
(23)
j=O
implicit in (7) that the matrices I, A, A 2 , ... , A" 1 constitute a basis also for the family of matrices {eAr; t ~ 0}. Here n is, as ever, the dimension of the system. We shall define r;on again as in (19), despite the somewhat different understanding of the matrix A in the continuoustime case.
Theorem 5.5.1 (i) The ndimensional system [A, B, · ] is controllable ifand only the matrix r;on has rank n. (ii) This condition is equivalent to the condition that the control Gram ian G(t)con
=
1'
eAT BQ1 BTeATT dr
if
(24)
should be positive defmite (for prescribed positive t and positive definite Q ~ (iii) If the system is controllable, then the control which achieves the passage from prescribed x(O) to prescribed x( t) with minimal control cost! J~ uT Qu dr is (25)
Proof If transfer from prescribed x(O) to prescribed x(t) is possible then controls {u( T); 0 ~ T < t} exist which satisfY the equation x(t) eArx(O) =
1' eA(rT)Bu(~)
dr,
{26)
analogous to (20). There must then be a control in this class which minimises the control cost defined in the theorem; we find this to be (25) by exactly the methods used to derive (22). This solution is acceptable if and only if the control Gramian G(tt00 is nonsingular (i.e. positive definite); this is consequently the necessary and sufficient condition for controllability. As in the proof of Theorem 5.43 (i): if G{trn were singular then there would be annvector w for which wT G{t)con = 0, with the successive implications that wTeA'B = 0 (t ~ 0), wTAiB = 0 (j = 0, 1, 2, ... ) and wT ~on = 0. The reverse implications also hold. Thus, an alternative necessary and sufficient condition for controllability is that r:;on should have full ~~
0
While the Gramians G( tt00 for varying t > 0 and Q > 0 are either all singular or all nonsingular, it is evident that G(trn approaches the zero matrix as t approaches zero, and that the control (25) will then become infinite.
106
STATESTRUCTURED DETERMINISTIC MODELS
Exercises and comments (1) Consider the satellite example of Exercise 2.1 in its linearised form. Show that the system is controllable. Show that it is indeed controllable under tangential thrust alone, but not under radial thrust alone.
(2) Consider the twovariable system x1 = >..ixi + u (j = 1, 2). One might regard this as a situation in which one bas two rooms, roomj having temperature xi and losing temperature at a rate >..1x1, and beat being supplied (or extracted) exogenously at a common rate u. Show that the system is controllable if and only if >.. 1 =f. >.. 2. Indeed, if >..1 = >..2 < 0 then the temperature difference between the two rooms will converge to zero in time, however u is varied. (3) The situation of Exercise 2 can be generalised. Suppose, as in Exercise 1.5, that the matrix A is diagonalisable to the form s 1AH. With the change of state variable to the set of modal variables .i = Hx considered there the dynamic equations become j_;j
= }v.ij + 2.: bjkUk, k
where bfk is tbejktb element of the matrix HB. Suppose all the >..1 distinct; it is a fact that the square matrix withjkth element >..j' 1 is then no~singular. Use this fact to show that controllability, equivalent to the fact that r;on has rank n, is equivalent to the fact that HB should have no rows which are zero. In other words, the system is controllable if and only if there is some control input to any mode. This ~ssertion does not, however, imply a failure of controllability if there are repeated eigenvalues. 6 OBSERVABILITY The notion ofcontrollability rested on the assumption that the initial value of state was known. If, however, one must rely upon imperfect observations, then it is a question whether the value of state (either in the past or in the present) can be determined from these observations. The discretetime system [A, B, C] is said to be robservable if the value ofxo can be inferred from knowledge of the subsequent observations Y1.Y2 ... ,y, and subsequent relevant control values u 0 , u1 , ... , ur2· Note that, if xo can be thus determined, then x 1 is also in principle simultaneously determinable for all t for which one knows the control history. The notion of observability stands in a dual relation to that of controllability; a duality which indeed persists right throughout the subject. We have the determination Yr
=
C
l
T2 2.: Arj Bui + Ari xo [ 2
J=O
6 OBSERVABILITY
107
of y.,. in terms of Xo and subsequent controls. Thus, if we define the reduced observation
y.,. =
r2
y.,.  C
L .A.'Tj Buj 2
j=O
then xo is to be determined from the system of equations
y.,. =
C.A..,._ 1xo
(0 < T:::::;; r).
{27)
These equations are mutually consistent, by hypothesis, and so have a solution The question is whether this solution is unique. This is the reverse of the situation for controllability, when the question was whether equation (20) for the uvalues bad a solution at all, unique or not. Note an implication of the system (27): that the property of observability depends only upon the matrices A. and C; not all upon B. We define the matrix
I I CA.
~bs r
=
CA c2
CA'._
1
(28)
'
the test matrix of size r for the observability context.
Theorem 5.6.1 (i) Thendimensional system [A,·, C] is robservable if and only if the matrix T~bs has rank n. (ii) Equivalently. the system is robservable ifand only ifthe matrix r1
~bs
= L(C.A.TI?MICAr1
(29)
r=0
is positive definite, for prescribed positive de]znite M. (iii) Ifthe system is robservable then the determination ofXo can be expressed r1
xo =
(~bs)IM1 L(CATI)Ty.,..
{30)
r=0
(iv) Ifthe system is robservable then it is sobservablefor s ;;:::: min (n, r). Proof If system (27) has a solution for Xo (which is so by hypothesis) then this solution is unique if and only if the coefficient matrix r;bs of the system bas rank n, implying assertion (i). Assertion (iv) follows by appeal to the CayleyHamilton theorem, as in Theorem 5.4.2. If we define the deviation ry, = y,.  CA.,. 1xo then equations {27) amount to ry,. = 0 (0 < T:::::;; r). Ifthese equations were not consistent we could still define a
108
STATESTRUCTURED DETERMINISTIC MODELS
'leastsquare' solution to them by minimising any positivedefinite quadratic form in these deviations with respect to xo. In particular, we could minimise :E~=.b ?JJM 11Jr. This minimisation yields expression (30). If equations (27) indeed have a solution (i.e. are mutually consistent, as we suppose) and this is unique then expression (30) must equal this solution: the actual value of XQ. The criterion for uniqueness of the leastsquare solution is that G,?bs should be nonsingular, which is exactly condition (ii). As in Theorem 5.4.3, equivalence of 0 conditions (i) and (i) can be verified directly, if desired. Note that we have again found it helpful to bring in an optimisation criterion. This time it was a question, not of finding a 'least cost' solution when many solutions are known to exist, but of finding a 'best fit' solution when no exact solution may exist. This approach lies close to the statistical approach necessary when observations are corrupted by noise; see Chapter 12. Matrix (29) is the
observation Gramian. The continuoustime version of these results will now be apparent, with a proof which bears the same relation to that of Theorem 5.6.1 as that of Theorem 5.5.1 does to the material of Section 4.
Theorem 5.6.2 (i) Thendimensional continuoustime system [A 1 • 1 C] is observable ifand only ifthe matrix T~bs defined by (28) has rank n. (ii) This condition is equivalent to the condition that the observation Gramian 1
G(t)obs
= 1 (Ce.A.'~"?M 1 Ce4'~" dr
(31)
should be positive definite (for prescribedpositive t and positive definite M ). (iii) Ifthe system is observable then the determination ofx(O) can be written
where ji(t) =y(t)
1'
CeA(t'T)Bu(r)dr.
A way of generating realtime estimates of current state is to drive a model of the plant by the apparent discrepancy in observation. For the continuoustime model (15) this would amount to generating an estimate x(t) of x(t) as a solution ofthe equation
.i =Ax+ Bu+ H(y Cx)
1
(32)
where the matrix His to be chosen suitably. One can regard this as the realisation of a ftlter whose output is the estimate x generated from the known inputs u and
109
6 OBSERVABILITY
y. Such a relation is spoken of as an observer, it is unaffected in its performance by the control policy. We shall see in Chapter 12 that the optimal estimating relation in the statistical context, the Kalmanfrlter, is exactly ofthis form. Denote the estimation error x(t) x(t) by ~(t} By subtracting the plant equation from relation (32) and settingy = Cx we see that
A= (A HC)~. Estimation will thus be successful (in that the estimation error will tend to zero with time) if A  HC is a stability matrix. If it is possible to find a matrix H such that this is so then the system [A,·, C] is said to be detectable; a property corresponding to the control property of stabilisability.
Exercises and comments (1) Consider the linearised satellite model of Exercise 2.2. Show that state x is observable from angle measurements alone (i.e. from observation of 0 and 0) but not from radial measurements alone. (2) The scalar variables xi ( j equations
X)
= 1, 2, ... , n)
= 2(1 + Xnr 1 X)+ u,
Xj
= Xj1
of a metabolic system obey the 
Xj
(j
= 2, 3, ... In).
Show that in the absence of the control u there is a unique equilibrium point in the positive orthant. Consider the controlled system linearised about this equilibrium point. Show that it is controllable, and that it is observable from measurements of x 1 alone.
Notes We have covered the material which is of immediate relevance for our purposes, but this is only a small part of the very extensive theory which exists, even (and especially) for the timehomogeneous linear case. One classical piece of work is the RouthHurwicz criterion for stability, which states in verifiable form the necessary and sufficient conditions that the characteristic polynomial I>J  A I = 0 should have all its zeros strictly in the left halfplane. Modern work has been particularly concerned with the synthesis or realisation problem: can one find a system [A, B, C] which realises a given transfer function G? If one can find such a realisation, of finite dimension, then it is of course not unique (see Exercise 2.1). However, the main consideration is to achieve a realisation which is minimal in that it is of minimal dimension. One has the important and beautiful theorem: a system [A, B, C] realising G is minimal if and only if it is both controllable and observable. (See, for example, Brockett (1970) p. 94)
llO
STATESTRUCTURED DETERMINISTIC MODELS
However, when we resume optimisation, the relevant parts of this further theory are in a sense generated automatically and in the operational form dictated by the goal. So, existence theorems are replaced by explicit solutions (as Theorem 5.4.3 gave greater definiteness to Theorem 5.4.1), the family of 'good' solutions is generated by the optimal solution as the cost function is varied, and the conditions for validity of the optimal solution provide the minimal and natural conditions for existence or realisability.
CHAPTER 6
Stationary Rules and Direct Optimisation for the LQModel The LQ model introduced in Sections 2.4, 2.8 and 2.9 has aspects which go far beyond what was indicated there, and a theory which is more elegant than the reader might have concluded from a first impression. In Section 1 we deal with a central issue: proof of the existence of infinitehorizon limits for the LQ regulation problem under appropriate hypotheses. The consequences of this for the LQ tracking problem are considered in Section 2. However, in Section 3 we move to deduction of an optimal policy, not by dynamic programming, but by direct optimisation of the trajectory by Lagrangian methods. This yields a treatment of the tracking problem which is much more elegant and insightful than that given earlier, at least in the stationary case. The approach is one which anticipates the maximum principle of the next chapter and provides a natural application of the transform methods of Chapter 4. As we see in Sections 5 and 6, it generalises with remarkable simplicity; we continue this line with the development of timeintegral methods in Chapters 1821. The material of Sections 3, 5, 6 and 7 must be regarded, not as a systematic exposition, but as a first sketch of an important pattern whose details will be progressively completed. 1 INFINITEHORIZON UMITS FOR THE LQ REGULATION PROBLEM One hopes that, if the horizon is allowed to become infinite, then the control problem will simplify in that it becomes timeinvariant, i.e. such that a timeshift leaves the problem unchanged. One hopes in particular that the optimal policy will become timeinvariant in form, when it is referred to as stationary. The stationary case is the natural one in a high proportion of control contexts, where one has a system which, for practical purposes, operates indefinitely under constant conditions. The existence of infinitehorizon limits has to be established by different arguments in different cases, and will certainly demand conditions of some kindtime homogeneity plus both the ability and the incentive to control. In this section we shall study an important case, the LQ regulation problem of Section 2.4. In this case the value function F(x, t) has the form !xTII,x where II obeys the Riccati equation (2.25). It is convenient to write F(x, t) rather as Fs(x) = !xTII(s)X where s = h tis the 'time to go'. The matrix II(o) is then that
112
STATIONARY RULES AND DIRECT OPTIMISATION
associated with the terminal cost function. The passage to an infinite horizon is then just the passage s t +oo, and infinitehorizon limits will exist if lies) has a limit value II which is independent ofiico) for the class of terminal cost functions one is likely to consider. In this case the matrix K 1 = K(s) of the optimal control rule (2.27) has a corresponding limit value K, so that the rule takes the stationary form u1 = Kxt. Two basic conditions are required for the existence of infmitehorizon limits in this case. One is that of sensitivity: that any deviation from the desired rest point x = 0, u = 0 should ultimately carry a cost penalty, and so demand correction. The other is that of controllability: that such any such deviation can indeed be corrected ultimately. We suppose S normalised to zero; a normalisation which can be reversed if required by replacing R and A in the calculations below by R  sT Q 1S and A  ST Q 1B respectively. The Riccati equation then takes the form
(s
= 1,2, ... ).
(1)
where/ has the action /II= R
+ ATTIA ATTIB(Q + BTIIB) 1BTIIA.
(2)
Lemma 6.1.1 Suppose that II(o) = 0, R ;;::: 0, Q > 0. Then the sequence {II(s)} is nondecreasing (in the ordering ofpositivedefiniteness). Proof We have F1 = xT Rx;;::: 0 = Fo. Thus, by Theorem 3.1.1, Fs(x) = xTII(s)X is nondecreasing ins for fixed x. That is, II(s) is nondecreasing in the matrixdefiniteness sense. Lemma 6.1.2 Suppose that II(o) = 0, R;;::: 0, Q > 0 and that the system [A, B, ·]is either controllable or stabilisable. Then {II(s)} is bounded above and has a finite limit II. Proof To demonstrate boundedness one must demonstrate that a policy can be found which incurs a finite infinitehorizon cost for any prescribed value x of initial state. Controllability implies that there is a linear control rule (e.g. that suggested in the proof of Theorem 5.4.3) which, for any x 0 = x, will bring the state to zero in at most n steps and at a finite cost xTII• x, say. The cost of holding it at zero thereafter is zero, so we can assert that (3) for all x. Stabilisability implies the same conclusion, except that convergence to zero takes place exponentially fast rather than in a finite time. The nondecreasing sequence {II(s)} is thus bounded above by II* (in the positivedefinite
I LIMITS FOR THE LQ REGULATION PROBLEM
113
sense) and so has a limit II. (More explicitly, take x = e1, the vector with a unit in the jth place and zeros elsewhere. The previous lemma and relation (3) then imply that 1rjj8, the jjth element of II(s)• is nondecreasing and bounded above by rrj. It thus converges. By then taking x = e1 + ek we can similarly prove the convergence of rr1kJ.) 0 We shall now show that, under natural conditions, II is indeed the limit of II(s) for any nonnegative definite II(o)· The proof reveals more.
Theorem 6.1.3 Suppose that R > 0, Q > 0 and the system [A, B, ·]is either controllable or stabilisahle. Then (i) The equilibrium Riccati equation II
=JII
(4)
has a unique nonnegative definite solution II. (ii) For any finite nonnegative definite II(o) the sequence {II(!)} converges to II (iii) The gain matrix r conesponding to II is a stability matrix. Proof Define ll as the limit of the sequence f(slo. We know by Lemma 6.1.2 that this limit exists, is finite and satisfies (4). Setting u1 = Kx1 and Xt+i = (A + BK)x1 = rx, in the relation
(5) where K and rare the values corresponding to II, we see that we can write (4) as
(6) Consider the form
V(x) and a sequence Xt
=
xTnx
(7)
= r 1Xo, for arbitrary X(). Then V(xt+i)  V(x 1) = xJ (R + KT QK)x 1 ~ 0.
(8)
Thus V(x 1) decreases and, being bounded below by zero, tends to a limit. Thus
xJ(R+KTQK)x1  0
(9)
which implies that x 1  0, since R + KT QK ~ R > 0. Since .xo is arbitrary this implies that rt  t 0, establishing (iii). We can thus deduce from (6) the convergent series expression for II: 00
II =
:l)rTy (R + KT QK)rl. j=O
( 10)
114
STATIONARY RULES AND DIRECT OPTIMISATION
Note now that, for arbitrary finite nonnegative definite ll(o).
ll(s) = fsln(o) ~ f(s)o
+
n.
(11)
Comparing the minimal shorizon cost with that incurred by using the stationary rule Ut Kx1 we deduce a reverse inequality
=
si
ll(s) ~ 2)rTy (R + KT QK)fi + (rTyn(o)rs
+
IT.
(12)
j=O
Relations (11) and (12) imply (ii). Finally, assertion (i) follows because, if another fmite nonnegative definite solution of(4), then fi = f(slfi + IT.
fi
is 0
It is gratifying that proof of the convergence llc1J + IT implies incidentally that is a stability matrix. Of course, this is no more than one would expect: if the optimal policy is successful in driving the state variable x to zero then it must indeed stabilise the equilibrium point x = 0. The proof appealed to the condition R > 0, which is exactly the condition that any deviation of x from zero should be penalised immediately. However, we can weaken this to the condition that the deviation should be penalised ultimately.
r
Theorem 6.1.4 The conclusions of Theorem 6.1.3 remain valid if the condition that R > 0 is replaced by the condition that, if R = LTL then the system [A 1 • 1 L] is either observable or detectable. Proof ·Relation (9) now becomes
(Lx 1 )T(Lx1) + (Kx 1)TQ(Kx1)+ 0 which implies that Kx1 + 0 and Lx 1 + 0. These convergences, with the relation x, =(A+ BK)xt1 imply thatx1 ultimately enters a manifold for which 1
Lx,
=0
1
Xt
= AXti·
(13)
=
The observability condition implies that these relations can hold only if x 1 0. The detectability condition implies that we can find an H such that A  HL is a stability matrix, and since relations (13) imply that x 1 = (A  H L )xri 1 then again x, + 0. Thus x 1 + 0 under either condition. This fact established, the proof continues as in Theorem 6.1.3. 0 We can note the corollaries of this result, already mentioned in Section 5.4.
Corollary 6.1.5 (i) Controllability implies stabilisability. (ii) Stabilisability to x = 0 by any means implies stabilisability by a control of the linear Markov form u1 = Kx1 •
2. STATIONARY TRACKING RULES
115
Proof (i) The proof of Theorem 6.1.3 demonstrated that a stabilising policy of the linear Markov form could be found if the system were controllable. (ii) The optimal policy under a quadratic cost function is exactly of the linear Markov form, so, if such a policy will not stabilise the system (in the sense of ensuring a finitecost passage to x = 0), then neither will any other. 0 2. STATIONARY TRACKING RULES
The proof of the existence of infinite horizon limits demonstrates the validity of the infinitehorizon tracking rule (2.67) of Section 2.9, at least if the hypotheses of the last section are satisfied and the disturbances and command signals are such that the feedforward term in (2.67) is convergent. We can now take matters somewhat further and begin, in the next section, to see the underlying structure. In order to avoid undue repetition of the material of Section 2.9 and to link up with conventional control ideas we shall discuss the continuoustime case. The continuoustime analogue of (2.67) would be u uc
= K(x xc) Q 1BT
1
00
erTrii[d(t + T) dc(t + T)] dT
( 14)
where II, K and rare the infinitehorizon limits of Section 1 (in a continuoustime version) and the time argument t is understood unless otherwise stated. We regard (14) as a stationary rule because, although it involves the timedependent signal ( 15) this signal is seen as a system input on which the rule (14) operates in a stationary fashion. A classic control rule for the tracking of a command signal r in the case uc = 0 would be simply ( 16)
where u = Kx is a control which is known to stabilise the equilibrium x = 0. We see that (14) differs from this in the feedforward term, which can of course be calculated only if the future courses of the command signal and disturbance are known. Neither rule in general leads ultimately to perfect following, ('zero offset') although rule (14) does so if d  ~. defined in (15), tends to zero with increasing time. This is sometime expressed as the condition that all unstable modes of (xc, uc, d) should satisfy the plant equation. There is one point that we should cover. In most cases one will not prescribe the course of all components of the process vector x, but merely that of certain linear functions of this vector. For example, an aeroplane following a moving target is merely required to keep that target in its sights from an appropriate distance and angle; not to specify all aspects of its dynamic state. In such a case it is better not
116
STATIONARY RULES AND DIRECT OPTIMISATION
to carry out the normalisation of x, u and d adopted in Section 2.9. If we assume that uc = 0 and work with the raw variables then we find that the control rule (14) becomes rather 00
u = Kx Q 1BT [
erTT[lld(t + r) (R sT Q 1s)xc(t + r)] dr
(17)
Details of derivation are omitted, because in the next section we shall develop an analysis which, at least for the stationary case, is much more direct and powerful than that of Section 2.9. Relation (17) is exactly what we want. A penalty term such as (x r l R(x r) is a function only of those linear functions of (x r) which are penalised. The consequence is then that Rr and Sr are functions only of those linear functions of r which are prescribed. If we consider the case when S, uc and dare zero and r is constant then relation (17) reduces to {18) which is to be compared with relation (16) and must be superior to it (in average cost terms). When we insert these two rules into the plant equation we see that x settles to the equilibrium value r 1BKx = r 1JI1xC for rule (16) and r 1J(rT) 1Rr for rule (18). Here Jis again the controlpower matrix BQ 1BT. We obtain expressions for the total offset costs in Exercise 1 and Section 6. Exercises and comments (1) Verify that the offset cost under control (16) (assuming S zero and r constant) is! (Ar) T P(Ar), where P
= (rT) 1(R + KT QK)rl = nr 1 
(rT) 1ll.
We shall come to evaluations under the optimal rule in Section 6. However, if R is assumed nonsingular (so that all components of r are necessarily specified) then location of the optimal equilibrium point by the methods of Section 2.10 leads to the conclusion that the offset cost under the optimal rule (18) again has the form !{Ar?P(Ar), but now with P = (AR 1AT +BQ 1BT) 1. This is generalised in equation (43). 3 DIRECI' TRAJECTORY OPTIMISATION: WHY THE OPTIMAL FEEDBACK/FEEDFORWARD CONTROL RULE HAS THE FORM IT DOES Our analysis of the disturbed tracking problem in Section 2.9 won through to a solution with an appealing form, but only after some rather unappealing
117
3 DIRECT TRAJECTORY OPTIMISATION
calculations. Direct trajectory optimisation turns out to offer a quick, powerful and transparent treatment of the problem, at least in the stationary case. The approach carries over to much more general models, and we shall develop it as a principal theme. Consider the discretetime model of Section 2.9, assuming plant equation (2.61) and instantaneous cost function (2.62). Regard the plant equation at time T as a constraint and associate with it a vector Lagrange multiplier >.7 , so that we have a Lagrangian form 0 = L[c(xr, U71 r) + >.'J(xr AXrl B~l d7 )] +terminal cost.
( 19)
T
Here the time variable T runs over the timeinterval under consideration, wb.ich we shall now suppose to be h 1 < T < h2 ; the terminal cost is incurred at the horizon point T = hz. We use T to denote a running value of time rather than t, and shall do so henceforth, reserving t to indicate the particular moment 'now'. In other words, it is assumed that controls U7 for T < t have been determined, not necessarily optimally, and that the time t has come at which the value of ur is to be determined in the light of information currently available. We shall refer to the form 0 of (19) as a 'timeintegral' since it is indeed the discretetime version of an integral. We shall also require of a 'timeintegral' a property which 0 possesses, that one optimises by extremising the integral.free/y with respect to all variables except those whose values are currently known. That is, optimisation is subject to no other constraint. The application of Lagrangian methods is certainly legitimate (at least in the case of a fixed and finite horizon) if all cost functions are nonnegative definite quadratic forms; see Section 7.1. We can make the strong statement that the optimal trajectory from time t is determined by minimisation of 0with respect to (x7 , ~) and maximisation with respect to .\7 for all relevant T ~ t. This extremisation then yields a linear system of equations which we can write
R
s
[ I  AfT
ST
Q Bfl
[3],
(t~r is Hermitian, in that if we define the conjugate of ci> = ci>(ET) as if>= ci>(ff 1)T then if>= ci>. Suppose that ci>(z), with z a scalar complex variable, has a canonical factorisation (23) where ci>+(z) and ci>+(z) 1 can be validly expanded on the unit circle wholly in nonnegative powers of z and ci> _ ditto for nonpositive powers. What this would mean is that an equation such as (24) for D. with known v (and valid for all T before the current point of operation) can be regarded as a stable forward recursion for D. with solution (25) 1
Here the solution is that obtained by expanding the operator ci>+(ff) in nonnegative powers of§", and so is linear in present and past v; see Section 4.5. We have taken recursion (24) as a forward recursion in that we have solved it in terms of past v; it is stable in that solution (25) is certainly convergent for uniformly bounded v. Factorisation (23) then implies the rewriting of (22) as (26) so representing the difference equation (22) as the compositiOn of a stable frward recursion and a stable backward recursion. The situation may be plainer in terms of the scalar example of Exercise l. One finds generally that the optimisation of the path of a process generated by a forward recursion yields a recursion of double order, symmetric in past and future, and that if we can represent this doubleorder recursion as the composition of a stable forward recursion and a stable backward recursion, then it is the stable forward recursion which determines the optimal forward path in the infinitehorizon case (see Chapter 18). Suppose we let the horizon point h2 tend to +oo, so that (26) holds for all T ~ t. We can then legitimately halfinvert (26) to (27)
3 DIRECT TRAJECTORY OPTIMISATION
119
if(.,. grows sufficiently slowly with increasing r that the expression on the right is 1 is expanded in nonpositive powers of :Y. We thus have convergent when if>_ ( an expression for A.,. in terms of past A and present and future(. This gives us exactly what we want: an expression for the optimal control in the desired feedback/feedforward form.
.rr
Theorem 6.3.1 Suppose that d.,. grows sufficiently slowly with T that the semiinversion (27) is legitimate. More specifically, that the semiinversion
~+(~) ~ ~ ~r ~~(~)![it [
(27')
is legitimate in that the righthand member is convergent when the operator if>_ ( :Y) I is expanded in nonpositive powers offf. Then
(i) The determination ofu1 obtained by setting T = tin relation (27') constitutes an expression ofthe infinitehorizon optimal control rule infeedbacklfeedforwardform. (ii) The Hermitian character of if> implies that the factorisation (26) can be chosen so that if>_ = « +, whence it follows that the operator which gives the feedforward component is just the inverse ofthe conjugate ofthe feedback operator. Relation (27') in fact determines the whole future course of the optimally controlled process recursively, but it is the determination of the current control u1 that is of immediate interest The relation at r = t determines u1 (optimally) in terms of x 1 and d1(r ~ t); the feedbacklfeedforward rule. Furthermore, the symmetry in the evaluation of these two components explains the structure which began to emerge in Section 2.9, and which we now see as inevitable. We shall both generalise this solution and make it more explicit in later chapters. The achievement of the canonical factorisation (23) is the equivalent of solving the stationary form of the Riccati recursion, and in fact the policyimprovement algorithm of Section 3.5 translates into a fast and natural algorithm for this factorisation. The assumptions behind solution (27) are twofold First, there is the assumption that the canonical factorisation (23) exists. This corresponds to the assumption that infmitehorizon limits exist for the original problem of Section 2.4; that of optimal regulation to zero in the absence of disturbances. Existence of the canonical factorisation is exactly the necessary and sufficient condition for existence of the infinitehorizon limits; the controllability/sensitivity assumptions of Theorem 6.1.3 were sufficient, but probably not necessary. We shall see in Chapter 18 that the policyimprovement method for deriving the optimal infinitehorizon policy indeed implies the natural algorithm for determination of the canonical factorisation. The second assumption is that the normalised disturbance d.,. should increase sufficiently slowly with r that the righthand member of (27), the feedforward
120
STATIONARY RULES AND DIRECT OPTIMISATION
term, should converge. Such convergence does not guarantee that the vector of 'errors' !1 1 will converge to zero with increasing t, even in the case when all components of this error are penalised. There may well be nonzero offsets in the limit; the errors may even increase exponentially fast. However, convergence of the righthand member of (27) implies that d,. increases slowly enough with time that an infinitehorizon optimisation is meaningful. Specifically, suppose that the zeros of 1~(z)! do not exceed p in modulus (where certainly p < l) and that the components of (Lr grow no faster than 7" in modulus. Then the feedforward term in (27) is convergent if P'Y < 1. On the other hand, some components of !l,. may indeed grow at rate 7". For the particular ~ of (20), corresponding to the statestructured case, one finds a canonical factorisation (28) where
and nand K are the infinitehorizon limits of these quantities. Factorisation (28) differs slightly from (23) in that there is the interposing constant matrix ~0 , but this only slightly modifies the semiinversion (27). In the case when r is only partially prescribed we would write the equation system (20) rather as
R
s
ST
Q [ 1Aff Bff
[AT ffBTff 1 0
1
]
[x]
(t)
u
=
>. ,.
and perform the same semiinversion on this system as previously. The command
signal r occurs only in the combinations Rr and Sr, which are functions only of those components of r which are prescribed.
Exercises and comments (1) Consider the simple regulation problem in the undisturbed scalar case, when we can write the cost function as l:,.[Q.s 2 (x,.  Ax,_I) 2 + R.x;J + terminal cost. The problem is thus reduced, since we have used the plant equation to eliminate u, and so need not introduce .>..The stationarity condition on x,. yields cl>(ff)x" = 0 (*), where ci>(z) = Q(l  Az)(l  Az 1) + B2R. (This is not the ci> of (22~ but a reduced version of it)
!
4 AN ALTERNATIVE FORM FOR THE RICCATI RECURSION
121
The symmetric (in past and future) secondorder equation (•) can be legitimately reduced to a firstorder stable forward equation, which determines the infinitehorizon optimal path from an arbitrary starting point, so yieldin.g in effect an optimal control rule in openloop form. Suppose the canonical factorisation il>(z) <X (1 rz)(l rz 1) where r is less than unity in modulus. (Note that this is exactly the determination of the optimal gain factor r given by equation (2.36)). Then division of (*) by the 'future' factor leaves the equation ( 1  r ff)x.,. = 0, or x.,. = rx.,._ 1, which we know to be indeed the plant equation under optimal infinitehorizon control. (2) Use relations (27')(29) to verify the determination (2.67) of the optimal
control rule. (3) The continuous time version of equation (22) (for the statestructured model (252)/(2.53)) is il>(f0)~ =(,where
R it>(s) = S [ sl A
sT
Q B
If we define the conjugate of it>(s) as il>(s) =it>( s)T then it> is evidently selfconjugate. The analogue of the canonical factorisation (26) is it>(s) = it>_(s) il>+(s), where both it>+ and il>:;: 1 have a Laplace representation valid for Re(s) ~ 0 which involves only exponentials est for nonnegative t (and so have all singularities in the left halfplane). it>_ has the complementary definition, and will in fact be the conjugate of it>+. Solution by the semiinversion (27) then proceeds analogously. (4) Show that the roots of lit>(z)l = 0 are just the eigenvalues of the optimal infinitehorizon gain matrix r.
4 AN ALTERNATIVE FORM FOR THE RICCATI RECURSION One could extremise the timeintegral (19) recursively, and by doing so one is led to the expanded form of the optimality equation
F(x, t) = inf sup[c(x, u, t) Xr+loU
).
+ >.T(xt+l Ax Bu) + F(xt+ 1 , t + 1)].
Here we have written x 1, Ut and >.t+I simply as x, u and.>., and F(x, t) is the usual value function at timet, which we know to have the form !xTJI1x. Let us suppose for simplicity that we are in the timehomogeneous case when c(x, u) is given by (2.23) and that S has been normalised to zero. This normalisation can be reversed by replacing A and R in the expression below by A  BQ 1S and R  sT Q 1S respectively. If we extremise >. out first in the equation above then this is reduced to the usual optimality equation {2.29), leading to expressions (2.25) and (227) for the Riccati
122
STATIONARY RULES AND DIRECT OPTIMISATION
recursion and the optimal control. If we extremise the variables out in the order xt+ 1, u, >. then we find that the righthand member of the equation undergoes the successive transformations
!(xTRx + uT Qu)  ..\T(Ax+ Bu)  !>.Trr;+\ ..\ + lxT Rx >.TAx 1>.Trr1 >.  1>.T J >. 2 2 t+l 2 +
!xT[R +AT (J +II~\ f Ajx. 1
where J = BQ 1BT, as ever. That is, the Riccati equation is recovered in the alternative form
(31) Further, if we trace through the extremal value of u1 which emerges from the above sequence of transformations we find that it is u 1 = K 1x 1 , as ever, but with the gain matrix K 1 having the evaluation
(32) instead of (2.28). We shall find a use for these alternative forms (31) and (32) in Chapter 16. In continuous time the two forms coalesce; expressions (2.55) and (2.56) for the continuoustime Riccati equation and control matrix are (with the normalisation to S = 0) the limit forms of both discretetime versions. However, the alternative forms above have in common with the continuoustime forms that they reveal the role of the controlpower matrix J = BQ 1BT.
5 DIRECT TRAJECTORY OPTIMISATION FOR HIGHERORDER
CASES The approach of Section 6.2 generalises immediately. Suppose that the plant equation is generalised to
(33)
dx+!Jiu = d
where .91 = A(ff) = L:~=O A,§"' and !11 = B(ff) = 2:~ 1 B,§"', corresponding to pthorder dynamics. The timeintegral (19) then generalises to 0=
2._)c(x.,, Un T) +
>.; (dx., + !Jiu., dr )] +terminal cost
(34)
T
and the equation system (20) correspondingly to
[!
sT Q
!11
?] [xxcl(r) = [0] !11 0
uuc >.
0
T
d
(35) T
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS
123
Here an expression such as .i1 again denotes A (y 1) T, the conjugate of d. The normalised disturbance d now bas the definition d de= d d r  !Muc. We can again write system (35) as (22); the matrix of operators .P(ff) thus defined again bas the self.conjugacy property ~ = .P. Now, provided that a canonical factorisation (23) of .P(z) exists (which is again ensured by appropriate controllability and sensitivity hypotheses) th.en the manipulations (26)(27) go through exactly as previously to yield the optimal control rule implied in (27). This is expressed in terms of present and past process variables, and so we must suppose these observable. The case of imperfect observation is best left until we treat the full stochastic case (see Chapters 12 and 20 for the cases p = 1 and p unrestricted respectively). Because we have restricted the dynamics to finite order p the canonical factors are polynomial of degree p and a fast iterative method of factorisation based upon the policy improvement algorithm is still available; see Chapter 18. The continuous time results are analogous, with, for example, .91 having a representation A(!?~) and its conjugate .91 a representation A( !?I) T_ The canonical factors have the characterisation given in Exercise 3.3.
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS The analysis of Sections 3 and 5 is almost in transfer function terms as it stands; completion of the view raises some interesting points. Consider the continuoustime version of system (20): (36) This then constitutes a filter (the optimal filter, as a transformation from input (to output D.) with transfer function .P(s) 1• This conclusion seems so simple that one wonders whether the subtleties of canonical factorisation etc. were necessary. However, they were indeed so. For one thing, the symmetry of .P implies that the system (36) is unstable as a forward dynamic system, and that the corresponding filter cannot be both causal and stable. That it is not causal is, of course, because the optimal control bas a feedforward component, anticipating the effect of future disturbances. The stable inversion must be of the form
D.(t)
=I:
g(T)((t T) dT
(37)
where g( T) is determined from the Fourier inversion (38) (i.e. by taking the contour of integration as the imaginary axis in sspace when inverting the Laplace transform; cf. Theorem 4.2.2(ii)). The transient response
124
STATIONARY RULES AND DIRECT OPTIMISATION
g( r) for r > 0 (r < 0) is made up of contributions from the poles ofthe integrand corresponding to zeros of l~(s) I in the left (right) half of the complex plane. It is the separation of these two sets of singularities which corresponds to the canonical factorisation of ~ The equivalent of a canonical factorisation cannot be avoided, and remains to be faced in the apparently simple inversion (37). The second point is that the filter view of relation (36) implies that it holds for all time r, whereas we suppose that it holds only for r ~ t, where t is the 'now' of the optimiser. That is, in order to develop the optimal control rule in closedloop form, we assume that optimisation of control from time t does not presume optimality of control before time t. This is the reason why the set of optimality 1 conditions (36) is inverted only partially, to ~+(.qJ)~ = ~(Et) ( rather than to ~ = ~(Et) 1 (. A control rule deduced from the second relation would be openloop, even if optimal. However, if this complete filter inversion is not useful for determining policy, it is useful for determining some aspects of performance. Consider the particular and d are both identically zero and the command signal r is case in which prescribed in all components, so that, as in Sections 4.1 and 4.10, we consider the response of tracking error x r to command signal r. In this case the system (36) reduces to
rr
where the vector ~ on the left can be regarded as system error, the quantity that one would wish to tend to zero with the passage of time. If the optimal control has been used at all times and
r(r)
= LWjeSfT
(39)
j
then the contribution to system error ofthejth term is
We can thus assert Theorem 6.6.1 The process command signal (39) will ultimately be followed with zero error by the optimally controlled process if A(sj)Wj = 0 for all j such that Re(sj) ~ 0.
This of course is simply a restatement of the conclusion already reached ~n Section 2.9: that there will be zero offset in the limit if all unstable modes of the command signal satisfy the uncontrolled plant equation.
6 THE CONCLUSIONS IN TRANSFER FUNCTION TERMS
125
The conclusion extends to the case when only certain functions of x are subject to command. Suppose that only the linear combination Dxc is prescribed, and has the form
Dxc(T) =
L
(40)
v1es1r.
j
Theorem 6.6.2 The partial process command signal (40) will ultimately be followed with zero error by the optimally controlled process iffor each j such that Re(s1) ~ 0 one can find a vector w1 such that A(s1 )w1 = 0 Dw1 = v1 . 1
This corresponds to a completion of r to a form (39) which is consistent with prescription (40) and with zero limiting tracking error. Let us calculate what the asymptotic offsets would be in the case of constant ::xc, uc and zero d. The asymptotic values of x and u are then determined by
[
R S
STQ AJ"l [x] BJ u
Ao
Bo
0
=
[li
Sr ++ STucl Q~ .
>.
0
Here the matrix in the lefthand member is just possesses a canonical factorisation if it is strictly positive on the imaginary axis. The second assertion would certainly need modification in our Lagrangian formalism, when our (mtentionally) unreduced timeintegral is not positivedefinite as a quadratic form, but of mixed character. This mixed character is intrinsic and, in the later risksensitive context, significant.
CHAPTER 7
The Pontryagin Maximum Principle The Pontryagin maximum principle states a method of direct trajectory optimisation which we have in fact already seen in two contexts and by two approaches. These are the approaches which yield both a rapid development of the formalism and a feeling for its meaning. In Section 2.10 we considered perturbations from an optimal trajectory and, in the continuoustime case, deduced the firstorder relations (2.76) and (2.77) by dynamic programming methods. These can be regarded as necessary conditions for optimality of the trajectory, at least when the derivatives invoked exist. Then, in the last chapter, we found Lagrangian methods efficacious for a direct optimisation of trajectory. This was for the LQ case, but perhaps generalises. The maximum principle was first seen as replacing the classical calculus of variations, which it superseded because it could deal with cases for which the optimal control turned out to be discontinuous. One finds such discontinuous controls in the 'bangbang' control of the Bush problem (Section 6), rocket thrust programming (Section 7) and even our fishing problem (Section 2.7). The principle also revealed an attractive formalism, a Hamiltonian structure analogous to that of classical mechanics. The reason for this is that a Hamiltonian structure appears whenever an incomplete specification of dynamics is supplemented by an extremal principle. In the control case the dynamical specification is incomplete because the plant equation determines process dynamics for given control values, but gives no guide for the determination of these control values. The guide is supplied by the extremal principle of costminimisation. A fully rigorous derivation of the principle would be, at the present, both lengthy and unattractive. We shall rather adopt a heuristic approach which reveals the structure very quickly and which provides a machinery for the quick generation, in any particular case, of the assertions which one may expect to hold. There is one interesting point, however. Versions of the principle hold in both discrete and continuous time, but in fact under milder conditions in the continuoustime case, for reasons explained in Section 1. Because of this and because of the greater importance of the continuoustime case in the control context we shall devote most of the treatment to this case.
132
THE PONTRYAGIN MAXIMUM PRINCIPLE
1 THE PRINCIPLE AS A DIRECT LAGRANGIAN OPTIMISATION We saw in the last chapter how effective direct trajectory optimisation was for the LQ model, especially in the infinitehorizon limit. It is then natural to ask whether the same methods could not be carried over to other models. That is, consider a discretetime model with the statestructure implied by the plant equation (2.2) and the cost function (2.3) and with a vectorvalued state variable. Regard the plant equation at timeT as a constraint on the {x, u} path with which can be associated a vector Lagrangian multiplier A.,.. The Lagrangian form
h1 L)c(x,., u,., T)
+ A;(x,. a(Xr1, Ur1, 7)] + Ch(xh)
(1)
r=O
should then be extremised with respect to the {x, u, A} path, for given initial conditions. We have hitherto supposed the horizon point h prescribed, with h = +oo as a limit case. It is conceivable, however, that the Lagrangian approach could be valid under other stopping rules; e.g. that h is the first time at which x first enters some set. For instance, the flight of an aircraft terminates at the moment when it comes to rest on the ground, which may be at a time and in a manner unscheduled. The formalism that one derives this way is exactly the formalism of the socalled 'discrete maximum principle', the maximum principle in a discretetime formulation. However, the conclusions are correct under only quite restrictive conditions, because the strong form of Lagrangian methods to which we wish to appeal is valid only under such conditions. Briefly, we would like to assert that necessary and sufficient conditions for optimality of the {x, u} path are that the Lagrangian form (1) should be minimal with respect to {x, u} and maximal with respect to {A}. This assertion is certainly true under the following hypotheses: that the set of permitted paths { x, u} is convex, that c is convex in its x, u arguments, that a is linear in its x, u arguments, that the horizon is fixed and finite, plus growth conditions ensuring boundedness of optimal path and cost. These hypotheses were all satisfied for the LQ model of the last chapter, at least so long as the horizon was held finite. However, if they are weakened then one can be sure of nothing without further investigation. The continuoustime analogue of the Lagrangian form (1) would be
1h [c(x, u, T)
+ AT(x a(x, u, T))]dT + C(x(h), h)
(2)
in the notation of Section 2.6. One is now has a continuous infinity of variables, and so would expect additional reasons for possible failure of the Lagrangian approach. Greater care is indeed necessary, but the continuoustime case presents one significant simplification. Suppose that the. control variable u is also vectorvalued, but with its values restricted to a set lliJ which may very well not be
I THE PRINCIPLE AS A DIRECT LAGRANGIAN OPTIMISATION
133
convex. However, by varying u rapidly relative to the rate at which xis changing one can effectively achieve any value of u which is an average of values in d/1. That is, d/1 is effectively replaced by its convex hull, a convex set. This is the intuitive content of the socalled 'chattering lemma: This differing behaviour manifests itself in that the dynamic programming equations yield a version of the maximum principle in continuous time which can only be equalled in strength in discrete time if one makes restrictive assumptions. We shall then associate the principle almost entirely with the continuoustime case. The LQ model remains the outstanding example of a case in which the discretetime maximum principle is valid and useful; we indicate a couple of others in the exercises. Exercises and comments
(1) Economic growth Consider the dynamic allocation problem of Section 3.4, where x 1 is the 'activity vector' at time, t, the vector of intensities at which the various activities are pursued. Thus x 1 ~ 0, and resource limitations enforce the plant equation Ax1 ~ b + Bxt1· Suppose that utility is linear: that one wishes to cT x 1, where the kth component of cis the rate at which activity k maximise delivers 'utility'. Application of Lagrangian techniques is valid; the Lagrangian form would be
2::7
h
L[cT x 1+ ).:[(b + Bxt1 Axt Zt)] I= I
Here z1 ~ 0 is the margin of inequality in the plant equation; the vector of amounts of resources unused at stage t. The multiplier has the familiar price interpretation; thejth element of A1 is the effective unit price ofresourcej at time t. Maximisation with respect to z1 yields the conclusion A1 ~ 0 with equality in those components for which the corresponding component of z1 is positive. Maximisation with respect to x 1 yields c + BT At+! AT A1 ~ 0, with equality in those components for which the corresponding component of x 1 is positive. If the system is selfsufficient in that it can maintain itself in the absence of external supplies then, at times remote from both the beginning and the end of the optimisation period, it settles on to a maximal growth path (the turnpike of the economists), for which x 1 is of the form p1x for some fixed x and some p ~ 1. The maximal growth rate p and the direction x of the optimal path are determined by
(3) so that p isjustthe maximal root of IPA 
Bl =
0.
134
THE PONTRYAGIN MAXIMUM PRINCIPLE
The von NeumannGale model (see e.g. Gale, 1960, 1967, 1968) generalises this in that it relaxes the assumptions of linear dependence of consumption, production and utility on x. Convexity assumptions must be retained, however, or the two extremal operations in the analogue of (3) no longer commute, and the concept of an optimal growth rate has to be qualified. (2) Optimal dosage Consider the plant equation x, = Ax11 +__!!! in scalar variables, where the controls u are to be chosen to minimise ~}(xr  X,:) 2 subject only to u ;;;:: 0. This would then be an LQ tracking problem but for the fact that control costing has been replaced by the positivity condition u ;;;:: 0. One might regard x as the concentration of a drug in a patient's body, attenuating at rate A in the absence of further administration, but maintained by dosage u. The sequence { u~} is the desired concentration profile. (It is convenient to make a couple of temporary changes of convention: we write u1 rather than u,_ 1 in the plant equation, and shall change the sign of A.) The Lagrangian form is
!
h
L!(x1  x~)
2
+ A1(Ax1_, + u1 
x 1)].
1=1
One then fmds the conditions A1 ;;;:: 0 with equality if u 1 > 0 and
(1
~
t 0, i.e. such that medication should begin immediately. If, on the other hand, d .it X1 ~ 0, then medication begins first at t 1• We have used the notation d to indicate potential generalisations of the argument. The problem has something in common with the production problem of Section 4.3. In the case A = 1 it reduces to the socalled 'monotone regression' problem, in which one tries to approximate a sequence {~}as closely as possible by a nondecreasing sequence {x 1 }. In this case X, is the smallest concave sequence which exceeds~ and its 'free segments' are straight lines.
x;
2 THE PONTRYAGIN MAXIMUM PRINCIPLE The maximum principle (henceforth abbreviated to MP) is a direct optimality condition on the path of the process. It is a calculation for a fixed initial valuex of state, whereas the DP approach is a calculation for a generic initial value. It can be regarded as both a computational and an analytic technique (and in the second case will then solve the problem for general initial value). The proof of the fact that derivatives etc. exist in the required sense is a very technical and lengthy matter, which we shall not attempt. It is much more important to have a feeling for the principle and to understand why it holds, coupled with an appreciation that caution may be necessary. We shall give a heuristic derivation based upon the dynamic programming equation, which is certainly the directest and most enlightening way to derive the conclusions which one may expect to be valid. A conjugate variable p will make its appearance. This corresponds to the multiplier vector A, the identification in fact being p = .XT, so that p is a row vector. The row notation p fits in naturally with gradient and Hamiltonian conventions; the column notation .X is better when, as in equation (6.20), we wish to write all the stationarity conditions as a single equation system. We shall refer top as either the 'conjugate variable' or the 'dual variable'. Note the conventions on derivatives listed in Appendix 1: in particular, that the vector of first derivatives of a scalar variable with respect to a column (row) vector variable is a row (column) vector. Consider first a timeinvariant formulation. The state variable x is a column vector of dimension n; the control variable u may take values in a largely arbitrary set llli. We suppose plant equation x = a(x, u), instantaneous cost function c(x, u), and that the process ,.stops when x first enters a prescribed stopping set!/, when a terminal cost lk\(x) is incurred. The value function F(x) then obeys the dynamic programming equation
136
THE PONTRYAGIN MAXIMUM PRINCIPLE
inf(c + Fxa) = 0 ...
(x ¢. !/),
(5)
with the terminal condition
(x E !/).
F(x) = IK(x)
(6)
The derivative Fx may well not exist if xis close to the boundary of a forbidden region (within which F is effectively infinite) or even if it is close to the boundary of a highly penalised but avoidable region (when F will be discontinuous at the boundary). We have already seen examples of this in Exercise 2.6.2 and shall see others in Section 10. However, let us suppose for the moment that x is on a free orbit, on which any perturbation ox in position changes F only by a term Fxox + o(ox). Define the conjugate variable P = Fx
(7)
(a row vector, to be regarded as a function of time p(t) on the path) and the Hamiltonian
H(x, u,p) = pa(x, u) c(x, u)
(8)
(a scalar, defined at each point on the path as a function of current x, u and p).
*Theorem 7.2.1 (The Pontryagin maximum principle on a free orbit; timeinvariant version) ( i) On the optimal path the variables x and p obey the equations
[= a(x, u)]
(9)
( 10) and the optima/value ofu(t) is the value ofu maximising H[x(t), u,p(t)]. (ii) The value of His identically zero on this path. *Proof Only assertions (9) and (10) need proof; the others follow from the dynamic programming equation (5) and the definition (7) of p. Assertion (9) is obviously valid. To demonstrate (10), write the dynamic programming equation in incremental form as F(x) = inf[c(x, u)bt + F(x + a(x, u)bt)] + o(ot).
( 11)
"
Differentiation with respect to x yields
p(t) whence (10) follows.
= CxOt p(t + ot)[l +ax ot] + o(ot) 0
2 THE PONTRYAGIN MAXIMUM PRINCIPLE
137
The fact that the principle is such an immediate consequence of the dynamic programming equation may make one wonder what has been gained. What has been gained is that, instead of having to solve the partial differential equation (5) (with its associated extremal condition on u) over the whole continuation set, one has now merely to solve the two sets of ordinary differential equations (9) and (10) (with the associated extremal condition on u) on the orbit. These conditions on the orbit are indeed those which one would obtain by a formal extremisation of the Lagrangian form (2) with respect to x, u and >.; as we leave the reader to verify. Note that the equations (9) and (10) demand only stationarity of the Lagrangian form with respect to the >. and xpaths, whereas the condition with respect to u makes the stronger demand of maximality. It is (9) and (10) which one would regard as characterising Hamiltonian structure; they follow by extremisation of an integral J[px H(x,p)] dtwith respect to the (:x,p) path. A substantial question, which we shall defer to the next section, is that of the terminal conditions which hold when x encounters 9. Let us first transfer the conclusions above the the timedependent case, when a, c, 9 and II< may all be !dependent. The DPequation for F(x, t) will now be inf(c + F, + Fxa) = 0
..
(12)
outside 9, with F(x, t) = IK(x, t) for (x, t) in 9. However, we can reduce this case to a formally timeinvariant case by augmenting the state variable x by the variable t.We then have the augmented variables P ~ [p Po].
(13)
where the scalar po is to be identified with  F,. However, we shall still preserve the same definition (8) of Has before, so that, as we see from (12), the relation
H=F,= po.
(14)
holds on the optimal orbit.
Theorem 7.2.2 (The Pontryagin maximum principle on a free orbit) (i) The assertions ofTheorem 7.2.1 ( i) stillhold, but equation (10) is now augmented by the relation
Po= H, = c, pa 1•
( 15)
(ii) H +Po is identically zero on an optimal orbit. Suppose the system timehomogeneous in that a and c are independent oft. Then His constant on an optimal orbit.
138
THE PONTRYAGIN MAXIMUM PRINCIPLE
Proof All assertions save the last are simple translations of the assertions of Theorem 7.2.1. If a and c are independent oft then we see from (15) that Po is constant on an optimal orbit, whence the fmal assertion follows. 0 However, the essential assertions of the maximum principle are those expressed in Theorem 7.2.1 (i) which, as we see, transfer to the timedependent case unchanged. Note that His now a function oft as well as ofx(t), u(t) andp(t). Exercises and comments
(1) As indicated above, one can expect H to be identically zero on an optimal orbit when the process is intrinsically time invariant and the total cost F~) is welldefined The case of a scalar state variable is then particularly amenable. By eliminatingp from the two relations, that His identically zero and that it is maximal with respect to u, one derives the optimal control rule in closedloop form. (2) Suppose that the process is timeinvariant and has a welldefined average cost 'Y· The total future cost is then F(x, t) = f(x)  1t plus an 'infinite constant' representing a cost of 'Y per unit time in perpetuity. We thus have H p 0 =
=
F1 = 1, so that the constant value of H can be identified with the average reward rate ~. In the scalar case the optimal control rule can be determined, at least implicitly, from H + 7 = 0 and the umaximality condition. The equilibrium point is then determined from Hx = Hp = 0; 'Y can then be evaluated as the value ofHat this equilibrium point. 3 TERMINAL CONDITIONS The most obvious example of departure from a free orbit is at termination of the path on the stopping set. Since the path is continuous, there is the obvious matching condition: that the terminal point is the limit point along the path. However, if one may vary the path so as to choose a favourable terminal point, then there will also be optimality conditions. The rigorous statement of these terminal conditions can be quite difficult if one allows rather general stopping sets and terminal costs. We shall give only the assertions which follow readily in the most regular cases. However, even more difficult than termination is the case when parts of state space are forbidden, so constraining the path. (For example, an aircraft may be compelled to avoid obstacles, or an industrialist may not be allowed to incur debt, even temporarily.) In such a case the optimal path must presumably skirt the boundary of the forbidden region for a time before resuming a free path. The special conditions which bold on entry to and exit from such restricted phases are termed transversality conditions; we shall consider them in Section 11. Consider first the fully timeinvariant case. One then has the terminal condition F(x) = K(x) for x in the stopping set !1'. However, one can appeal to
139
3 TERMINAL CONDITIONS
this as a continuity condition, that F(x) + IK(i) as x (outside!/) approaches x (inside Sf'), only if x is the optimal termination point for some free trajectory. Obviously x must lie on the boundary 8!/ of !/, since motion is continuous. However, we shall see from the examples of the next section there may be points on this boundary which are so costly that they are not optimal terminal points for any trajectory terminating in !/. Let f/ opt denote the set of possible optimal termination points. Partial integration of the Lagrangian expression for cost minimisation throws it into the form
1 1
(pi H) dr + IK(x) =
1
1
(px +H) dr + px + IK(x) p(O)x(O)
(16)
where the overbar indicates terminal values. Let a be a direction from x into f/ opt. in that there is a value x< = x + ea + o( e) which lies in f/ opt for all small enough values of the positive scalar £ If x is an optimal termination point for the trajectory under consideration then we deduce from (16) that px + IK(x) :::::; px< + IK(x 0 at termination. There is thus no NT arc; maximal thrust is applied throughout. If r 0 = 0 then it follows again from H = 0 that v = 0 at termination. If r 0 < 0 then v < 0 at termination. These are cases in which the thrust is insufficient to lift the rocket. If initially the rocket happens to be already rising then maximal thrust is applied until the rocket is on the point of reversing, which is taken as the terminal instant. If the rocket happens to be already falling then termination is immediate. This last discussion illustrates the literal nature of the analysis. In discussing all the possible cases one comes across some which are indeed physically possible but which one would hardly envisage in practice. Exercises and comments (1) An approximate reverse of the sounding rocket problem is that of soft landing: to land a rocket on the surface of a planet with prescribed terminal velocity in such a way as to minimise fuel consumption. It may be assumed that gravitational forces are vertical and constant, that there is no atmosphere and that all motion is vertical. Note that equation (42) remains valid. Hence show that the thrust programme must consist of a phase of null thrust followed by one of maximal thrust upwards (the phases possibly being of zero duration). How is the solution affected if one also penalises the time taken?
8 PROBLEMS WHICH ARE PARTIALLY LQ LQ models can be equally well treated by dynamic programming or by the maximum principle; one treatment is in fact only a slightly disguised version of
151
8 PROBLEMS WHICH ARE PARTIALLY lQ
the other. However, there is a class of partially LQ models for which the roaximum principle quickly reveals some simple conclusions. We shall treat these at some length, since conclusions are both explicit and transfer in an interesting way to the stochastic case (see Chapter 24). Assume vector state and control variables and a linear plant equation
x=Ax+Bu.
(43)
Suppose an instantaneous cost function
c(u) = !uT Qu,
(44)
which is quadratic in u and independent of x altogether. We shall suppose that the only state costs are those incurred at termination. The analysis which follows remains valid if we allow the matrices A, Band Q to depend upon time t, but we assume them constant for simplicity. However, we shall allow termination rules which are both timedependent and nonLQ, in that we shall assume that a terminal cost K(e) is incurred upon first entry to a stopping set of ~values [1', where is the combined state/time variable = (x, t). We assume that any constraint on the path is incorporated in the prescription of [I' and K, so that € values which are 'forbidden' belong to fl' and carry infinite penalty. The model is thus LQ except at termination. The assumption that state costs are incurred first at termination is realistic under certain circumstances. For example, imagine a missile or an aircraft which is moving through a region of space which (outside the stopping set fl') is uniform in its properties (i.e. in gravitational force and air density). Then no immediate positiondependent cost is incurred. This does not mean to say that spatial position is immaterial, however; one will certainly avoid any configuration of the craft which would take it to the wrong target or (in the case of the aircraft) lead it to crash. In other words, one will try to so maneoeuvre the craft that flight terminates favourabl~ in that the sum of control costs and terminal cost is minimal. This will be the interest of the problem: to chart out a course which is both economical and avoids hazards (e.g. mountain peaks) which would lead to premature termination. The effect of such hazards is even more interesting in the stochastic case, when even the controlled path of the craft is not completely predictable. It is then not enough to scrape past a hazard; one must allow a safe clearance. The analysis of this section has a natural stochastic analogue, which we pursue in Chapter24. The Hamiltonian for the problem is
e
e
H(x,u,p) = A.T[Ax+BuJ!uTQu if we take p = A.T as the multiplier. It thus follows that on a free section of the optimal path (i.e. a section clear of 5")
THE PONTRYAGIN MAXIMUM PRINCIPLE
152
u = Q!BT)..
(45)
A=AT>..
(46)
e
Consider now the optimal passage from an initial position = (x, t) to a terminal = (x, t) by a path which we suppose to be free. We shall position correspondingly denote the terminal value of>.. by .X, and shall denote timetogo t  t by s. It follows then from (40) and (41) that the optimal value of u is given by
e
(47) Inserting this expression for u back into the plant equation and cost function we find the expressions
x eAsx = V(s).X,
(48)
= !_xT V(s)A
(49)
F(e, ~)
for terminal x and total cost in terms of .X, where
V(s) =los ttfrJeATrdr.
(50)
and J = BQ 1BT, as ever. In (50) we recognise just the controllability Gramian. Solving for Afrom (48) and substituting in (47) and (49) we deduce
Theorem 7.8.1 Assume the model specified above. Then (i) The minimal cost offree passage from ~toe is F(e, e)= !(x eA 3 x)TV(s) 1(x eA 8 x),
(51)
andtheopenloopformoftheoptimal control ate= (x, t) is u = Q1 BTeATsv(s)l(x eAsx). wheres = t t. (ii) The minimal cost ofpassage from e to the stopping set [/, by an orbit which is free before termination, is
(52)
ifthis can be attained (53)
Expression (52) still gives the optimal control rule, with ~ determined by the minimisation in (53). This value of~ will be constant along an optimal path. We have in effect used the simple and immediate consequence (47) of the maximum principle to solve the dynamic programming equation. Relation (52) is indeed the closedloop rule which one would wish, but one would never have
9 CONTROL OF THE INERTIALESS PARTICLE
153
imagined that it would imply the simple course (47) of actual control values along the optimal orbit. Solution of (43) and (47) yields the optimal orbit as
x(r) =eA.,. x + V(r)eAT(iT) X
(t ~
T
~ l)
where x = x(t) and X is determined by (48). For validity of evaluation (53) it is necessary that this orbit should not meet the stopping set !/before timet. Should it do so, then the orbit will have to break up into more than one free section, these sections being separated by grazing encounters with !/ at which special transition conditions will hold. We shall consider some such cases by example in the following sections. 9 CONTROL OF THE INERTIALESS PARTICLE
The examples we now consider are grossly simpler than any actual practical problem, but bring out points which are important for such problems. We shall be able to generalise these to the stochastic case (see Chapter 24), where they are certainly nontrivial. Let x be a scalar, corresponding to the height of an aircraft above level ground. We shall suppose that the craft is moving with a constant horizontal velocity, which we can normalise to unity, so that time can be equated to horizontal distance travelled. We suppose that the plant equation is simply
x=u,
(.54)
i.e. that velocity equals control force applied. This would represent the dynamics of a mass moving in treacle: there are no inertial effects, and it is velocity rather than acceleration which is proportional to applied force. We shall then refer to the object being controlled as an 'inertialess particle'; inertialess for the reasons stated, and a particle because its dynamic state is supposed specified fully by its position. It is then the lamest possible example of an aircraft; it not merely shows no inertia, but also no directional effects, no angular inertia and no aerodynamic effects such as lift. We shall use the term 'aircraft' for vividness, however, and as a reminder of the physical object towards whose description we aspire by elaboration ofthe model. We have A = 0, B = l.We thus see from (45)/(46) that the optimal control value u is constant along a free section of the orbit, whence it follows from the plant equation (54) that such sections of orbit must be straight lines. We find that V(s) = Q 1s, so that F(~, ~) = Q(x x) 2 /2s. Suppose thatthe stopping set is the solid, level earth, so that the region of free movement is x > 0 and the effective stopping set is the surface of the ground, x = 0. The terminal cost can then be specified as a function IK( t) of time (i.e. distance along the ground). The expressions (53) and (52) for the value function and the closedloop optimal control rule then become
154
THE PONTRYAGIN MAXIMUM PRINCIPLE
F(x, t)
= ~~~[~ + IK(t + s)].
(55)
u = x/s.
(56)
Here the timetogo s must be determined from the minimisation in (55~ which determines the optimal landingpoint 1 = t + s. The rule (56) is indeed consistent with a constant rate of descent along the straightline path joining (x, t) and (0, t). However, suppose there is a sharp mountain between the starting point and the desired terminal point 1 determined above, sufficiently high that the path determined above will not clear it That is, if the peak occurs at coordinate t 1 and has height h then we require that x(tl) >h. If the optimal straightline path determined above does not satisfy this then the path must divide into two straightline segments as illustrated in Figure 6. The total cost of this compound path is
(57) where Fis the 'free path' value function determined in (55). It is more realistic to regard a crash on the mountain as carrying a high cost, K,, say, rather than prohobited. In the stochastic case this is the view that one must necessarily take, because then the crash outcome always has positive probability. If one resigns and chooses to crash then there will be no control cost at all and the total cost incurred will be just K 1• One will then choose the crash option if one is in a position (x, t) for which expression (57) exceeds K 1, i.e. for which
x_..:..._ 4xs
(65)
If these inequalities do indeed both hold then the optimal path suffers a grazing en~ounter
with the ground after a time s1 = t 1  t determined as the unique positive
root of (3x + VS! ) 2
w2
= (s si) 2
(66)
which is less than  3x/ v. The first condition of (65) states that the prescribed landing time must be later than the time at which it would be optimal to pull out of the dive without consideration of the sequel. The second condition implies that, even if the frrst holds, the optimal path to the termination point will still be free if the required terminal rate of descent is large enough. That is, if one is willing to accept a crash landing at the destination!
10 CONTROL OF THE INERTIAL PARTICLE
159
Proof If the path is a free one then it has the cubic form given above. We may as well normalise the time origin by setting t = 0 and so t = s in this relation. The coefficients a and {3 in the cubic are then determined by the terminal conditions x(s) = 0, x'(s) = w. The cubic then has a root at T = s, and one finds that theremaining two roots are the roots of the quadratic
? 2aT+ b =
0,
(67)
where 2
a=
ilx
s(x+vs) 2x+ (v+ w)s'
b = ::;:2x + (v + w)s ·
The only case (consistent with x > 0, w < 0) in which the optimal path is not free is the case (iii) of Figure 7, so this is the one we must exclude. This will be the case in which the quadratic (67) has both roots between 0 and s. This in turn is true if and only if the quadratic expression is positive at both 0 and s, and has a turning point inside the interval at which it is negative. That is:
b < 0,
?
2as + b
> 0,
0 b.
We find that the first two of these conditions are both equivalent to 2x + (v
This last with the inequality a
+ w)s < 0.
> 0 implies that x+ vs
The condition a
(68)
b, is equivalent to (3x + vs)
2
+ 4xws > 0.
(71)
The free path is nonoptimal if and only if all of relations (68)(71) hold. Relations (70) and (71) give the bounds on w 3x + vs 2s
(3x + vs) 4xs
2
   <  w < ','The upper bound in this relation exceeds the lower bound by (3x + vs)(x + vs)/ (4xs). It follows from (69) that the interval thus determined for w is empty unless 3x + vs < 0, a relation which implies (68), (69) and (70). We are thus left with the pair of conditions (65) asserted in the theorem. In the case that both these conditions are fulfilled the optimal path cannot be free, and is made up of two free segments meeting at time t 1 = t + s 1. We choose s 1 to minimise the sum of the cost incurred on the two free segments, as given by expression (60); the stationarity condition (66) emerges immediately. It follows
160
THE PONTRYAGIN MAXIMUM PRINCIPLE
Figure 10 A graphical illustration ofthe solution ofequation (66) for the optimal timing s1 of the grazing point.
from the observation after Theorem 7.10.1 that the root s 1 must be less than 3x/v. Indeed, equation (66) has a single such root as we see from Figure 10; the left and righthand members of (66) are respectively decreasing and increasing, as functions of s,, in the interval 0 ~ s 1 ~  3xjv. 0 Indeed, we can determine s 1 explicitly. In taking square roots in (66) we must take the negative option on one side, because 3x + vs 1 is positive whereas w is negative. The appropriate root of the resulting quadratic in s 1 is (3x vs)
s, =
+ V(3x + vs) 2 2(w~v)
12xws '
at least if w  v > 0, which we may expect. This approaches  3x/vas w tends to zero. 11 AVOIDANCE OF THE STOPPING SEf: A GENERAL RESULT
The conclusions of Theorem 7.10.1 can be generalised, with implications for the stochastic case. We consider the general model of Section 9 and envisage a situation in which the initial conditions are such that the uncontrolled path would meet the stopping set !/'; one's only wish is to avoid this encounter in the most economical fashion. Presumably the optimal avoiding path will graze !I' and then continue in a zerocost fashion (i.e. subsequently avoid !I' without further control). We shall speak of this grazing point as the 'termination' point, since it indeed marks the end of the controlled phase of the orbit. We then consider the linear system (43) with control cost (44). Suppose also that the stopping set !I' is the halfspace
11 AVOIDANCE OF THE STOPPING SET: A GENERAL RESULT
161 (72)
!/'={x:aTx:::::;b}
This is not as special as it may appear; if the stopping set is one that is to be avoided rather than sought then its boundary will generally be (n I)dimensional, and can be regarded as locally planar under regularity conditions. Let us denote the the linear function aT x of state by z. Let F(~, [)be the minimal cost of transition by a free orbit from an initial point ~ = (x, 0) to a terminal point [ = (x, t), already evaluated in (51). Then we shall find that, under certain conditions, the grazing point [of the optimal avoiding orbit is determined by minimising F( ~. [) with respect to the free components of [at termination and maximising it (at least locally) with respect to t. That is, the grazing point is, as far as timing goes, the most expensive point on the surface of !/ on which to terminate from a free orbit. The principal condition required is that aTB = 0, implying that the component z is not directly affected by the control. This is equivalent to aTJ = 0. Certainly, if one is to terminate at a given timet then the value of z is prescribed as b, but one should then optimise over all other components of x. Let the value of F(~, [) thus minimised be denoted G(~, t). The assertion is then that the optimal value oft is that which maximises G( ~' t). We need a preparatory lemma. Lemma 7.11.1 Add to the prescriptions (43), (44) and ( 72) ofplant equation, cost function and stopping set the conditions that the process be controllable and that aTB = O.Then
(73) Proof Optimisation ofF(~,[) with respect to the free components of x will imply that the Xof (47) is proportional to a, and so can be written Oa for some scalar e. The values of() and tare related by the termination condition z = b, which, in virtue of (48), we can write as
aT e41x
+ ()aT V(t)a = b.
(74)
We further see from (49) that G(~, t)
= !fPaT V(t)a.
(75)
Let us write V(t) and its derivative with respect to t simply as V and V. Controllability then implies that aTVa > 0. By replacing T by s T in the integrand of (50) we see that we can write Vas
(76)
V=J+AV+ VAT.
Differentiating (74) with respect tot we fmd that (aTVa)(d()jdt) +aT AeAia + OaTVa
= 0,
(77)
162
THE PONTRYAGIN MAXIMUM PRINCIPLE
so that
aG = at
te aTVa + OaTVa(dOjdt) = OaTAeAix !OlaTVa. 2
Finally, we have :Z = aT(Ax +1~) =aTAx= aTAeAix +eaT AVa= aT AeAix+!OaTVa. The second equality follows because aT J = 0 and the fourth by appeal to (76). We thus deduce from the last two relations that aGI ar = Oz. Inserting the evaluation of 0 implied by (74) we deduce (73).
Theorem 7.11.2 The assumptions of Theorem 7.11.1 imply that the grazing point ~ ofthe optimal!/avoiding orbit is determined by first minimising F (~, ~) with respect to x subject to aTx = band then maximising it, at least locally. with respect tot. Proof As indicated above, optimality will require the restricted i.optimisation; we have then to show that the optimal t maximises G(~, !). At any value t for which the controlled orbit crosses z = 0 the uncontrolled orbit will lie below z = 0, so that aT eAi x  b < 0. If, on the controlled orbit, z decreases through zero at timet, then one will increase tin an attempt to find an orbit which does not enter !/.We see from (73) that G(x, t) will then increase. Correspondingly, if z crosses zero from below then one will decrease t and G(x, t) will again increase. If the two original controlled orbits are such that the t values ultimately coincide under this exercise, then i = 0 at the common value, so that the orbit grazes !/ locally, and G(x, t) is locally maximal with respect to t. The true grazing point (i.e. that for which the orbit meets !/ at no other point) will be found by repeated 0 elimination of crossings in this fashion. One conjectures that G(x, t) is in fact globally maximal with respect tot at the grazing point, but this bas yet to be demonstrated. We leave the reader to confirm that this criterion yields the known grazing point t = 3xjv for the crash avoidance problem of Theorem 7.10.1, for which the condition aT B = 0 was indeed satisfied. We should now determine the optimal !/avoiding control explicitly.
Theorem 7.11.3
Define
D. = D.(x, s) = b  aTeAs x,
a = a(s) =
J
aT V(s)a.
Then, under the conditions of Theorem 7.11.1, the optimal!/avoiding control at xis (78)
163
12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS
where s is given the value s which maximises (tl. I a) 2 • With this understanding tl. I u1 is constant along the optimal orbit (before the grazing point) ands is the time remaining until grazing. Proof Let us set ~
= (x, 0), ~ = (x, s) so that x is the value of state when a time
s remains before termination. The cost of passage along the optimal free orbit from~to ~is
(79) where V = V(s) is given by (50) and 6 = x eAsx. The optimal control at time t = 0 for prescribed ~is (80) The quantity v 16 is the terminal value of .A and is consequently invariant along the optimal orbit. That is, if one considers it as a function of initial value then its value is the same for any initial point chosen on that orbit Specialising now to the f/ avoiding problem, we know from the previous theorem that we determine the optimal grazing point by minimising F ( with respect to x subject to z = b and then maximising it with respect to s. The first minimisation yields
e
e, ()
(81) so the values of s at the optimal grazing point is that which maximises (tl.l a) 2 • Expression (78) for the optimal control now follows from (80) and (81~ The identification of D.lil with v 16 (with s determined in terms of current x) demonstrates its invariance along the optimal orbit (before grazing). D
12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS We should now consider the transition rules which hold when a free orbit enters or emerges from a part of state space in which the orbit is constrained. We shall see that conclusions and argument are very similar to those which we deduced for termination in Section 3. Consider first of all the timeinvariant case. Suppose that an optimal path which begins freely meets a set ~ in statespace which is forbidden. We shall assume that ~ is open, so that the path can traverse the boundary a~ of~ for a while, during which time the path is of course constrained. One can ask: what conditions hold at the optimal points of entry to and exit from 8ff? Let x be an entry point and p and p' be the values of the conjugate variable immediately before and after entry. Just as for the treatment of the terminal problem in Section 3, we can partially integrate the Lagrangian expression for minimal total cost up to the transition point and so deduce an expression whose primary dependence upon the transition value x occurs in a term (p p')x.
THE PONTRYAGIN MAXIMUM PRINCIPLE
164
Suppose we can vary x to x + eO'+ o( e), a neighbouring point in 8/F. Then, by the same argument as that of Theorem 7.3.1 we deduce that (p  p') f7 is zero if x is an optimal transition value. That is, the linear function pO' of the conjugate variable is continuous at an optimal transition point for all directions f7 tangential to the surface of :F at x. Otherwise expressed, the vector (p  p') T is normal to the surface of :F at x. We deduce the same conclusion for optimal exit points by appeal to a timereversed version of the problem. Transferring these results to the timevarying problem by the usual device of taking an augmented state variable~= (x, t) we thus deduce *Theorem 712.1 Let :F be an open set in (x, t) space which is forbidden. Let (x, t) be an optimal transition point (for either entry to or exit from 8/F) and (p,po) and p',p~) the values of the augmented conjugate variable immediately before and after transition. Then
(82) fora// directions (0', r) tangential to 8/F at (x, t). In particular, if t can be varied independently of x then the Hamiltonian H is continuous at the transition. Proof The first assertion follows from the argument before the theorem, as indicated. If we can vary tin both directions for fixed x at transition then (82) implies that po is continuous at transition. But we have p 0 + H = 0 on both sides of the transition, so the implication is that His continuous at the transition. 0
One can also develop conclusions concerning the form of the optimal path during the phase when it lies in 8/F. However, we shall content ourselves with what can be gained from discussion of a simple example in the next section. An example we have already considered by a direct discussion of costs is the steering of the inertial particle in Section 10. For this the state variable was (x, v) and :F was x < 0. The boundary 8/F is then x = 0, but on this we must also require that v = 0, or the condition x ~ 0 would be violated in either the immediate past or the immediate future. Suppose that we start from (x, v) at t =:: 0 and reach x = 0 first at time t (when necessarily v = 0 as well). Suppose p, q are the variables conjugate to x, v, so that the Hamiltonian is H = pv + qu  Qu 2 /2, and so equal to pv + q2 j2Q when u is chosen optimally. Continuity of Hat a transition point, when vis necessarily zero, thus amounts to continuity of q2• Let us confirm that this continuity is consistent with the previously derived condition (66). lfp and q denote the values of the conjugate variables at t 1  , just before transition, then integration of equation (46) leads to p( r) = p, u( r) = Q 1 (qs+p~j2), x(r) = Q 1 (q~/2+ps 3 j6), T. The values of p and q are determined by the prescribed initial
Q 1q(r) = Q 1(q+ps), v(r)
where s =
t1 
=::
13 REGULATION OF A RESERVOIR
165
values x and v. We find q proportional to (3x + vt) / tl, and one can find a similar expression for q immediately after transition in terms of the values 0 and w of terminal height and velocity. Assertion of the continuity of q2 at transition thus leads exactly to equation (66~
13 REGULATION OF A RESERVOIR This is a problem which the author, among others, has discussed as 'regulation of a dam: However, purists are correct in their demur that the term 'dam' can refer only to the retaining wall, and that the object one wishes to control, the mass of water, is more properly referred to as the 'reservoir: . Let x denote the amount of water in the reservoir, and suppose that it obeys the plant equation = v u, where vis the inflow rate (a function of time known in advance) and u is the drawoff rate (a quantity at the disposition of the controller~ One wishes to maximise a criterion with instantaneous reward rate g(u), where g is concave and monotonic increasing. This concavity will (by Jensen's inequality) discourage variability in u. One also has the natural constraint u ~ 0. The state variable x enters the analysis by the constraint 0 ~ x ~ C, where C is the capacity of the reservoir. We shall describe the situations in which x = C, x = 0 and 0 < x < Casfull empty and intermediate phases respectively. One would of course wish to extend the analysis to the case for which v (which depends on future rainfall, for example) is imperfectly predictable, and so supposed stochastic. This can be achieved for LQ versions of the model (see Section 2.9) but is difficult if one retains the hard constraints on x and u and a nonquadratic reward rate g( u). We can start from minimisation of the Lagrangian form J[g(u) + p(xv + u)] dr, so that the Hamiltonian is H(x, u,p) = g(u) + p(v u). A price interpretation would indeed characterise p as an effective current price for water. We then deduce the following conclusions.
x
Theorem 7.13.1 An optimal drawoffprogramme shows the following features. (i) The value of u is the nonnegative value maximising g(u)  pu, which then increases with decreasing p. (ii) The value ofu is constant in any one intermediate phase. (iii) The value ofp is decreasing Qncreasing) and so the value ofv = u is decreasing Qncreasing) in an empty (full) phase. (iv) The value ofu is continuous at transition points. Proof Assertion (i) follows immediately from the form of the Hamiltonian and the nature of g. Assertions (ii) and (iii) follow from extremisation of the Lagrangian form with respect to x. In an intermediate phase x can be perturbed either way, and one deduces that jJ = 0 (the free orbit condition for this particular case). Hence p, and so u, is constant in such a phase. In an empty phase perturba
166
THE PONTRYAGIN MAXIMUM PRINCIPLE
tions of x can only be nonnegative, and so one can deduce only that P ~ 0. Thus p is decreasing, and so u is increasing. Since u and v are necessarily equal, if x is being held constant, then v must also be increasing. The analogous assertions for a full phase follow in the same way; perturbations can then only be nonpositive. The fmal assertion follows from continuity of Hat transition points. With u set equal to its optimal value H becomes a monotonic, continuous function of p. Continuity of Hat transition points then implies continuity ofp, and so of u. 0 The example is interesting for its appeal to transversality conditions, but also because there is some discussion of optimal behaviour during the empty and full phases (which constitute the boundary 8/F of the forbidden region !F: the union of x < 0 and x > C). 'frivially, one must have u = v in these phases. However, one should not regard this as the equation determining u. In the case x = 0 (say) one is always free to take a smaller value of u (and so to let water accumulate and so to move into an intermediate phase). The optimal drawoff rate continues to be determined as the value extremising g( u)  pu; it is the development ofp which is constrained by the condition x = 0. Although the rule u = v is trivial if one is determined not to leave the empty phase, the conclusion that v must be increasing during such a phase (for optimality) is nontrivial.
Notes Pontryagin is indeed the originator of the principle which bears his name, and whose theory and application has been so developed by himself and others. It is notable that he held the dynamic programming principle in great scorn; M.H.A. Davis describes him memorably as holding it up 'like a dead rat by its tail' in the preface to Pontryagin et al. (1962). This was because of the occasional nonexistence of the derivative Fx in the simplest of cases. However, as we have seen, it is a rat which alive, ingenious, direct, and able to squeeze through where authorities say it cannot The material of Section 11 is believed to be new.
PART 2
Stochastic Models
CHAPTER 8
Stochastic Dynamic Programming A difficulty which must be faced is that of incompleteness of information. That is, one may simply not have all the information needed to make an optimal decision, and which we have hitherto supposed available. For example, it may be impossible or impracticable to observe all aspects of the process variablethe workings of even a moderatesized plant, or of the patient under anaesthesia which we instanced in Section 5.2, are far too complex. This might matter less if the plant were observable in the technical sense of Chapter 5, so that the observations available nevertheless allowed one to build up a complete picture of tlle state of affairs in the course oftime. However, there are other uncertainties which cannot be resolved in this way. Most systems will have exogenous inputs of some kind: disturbances, reference signals or timevarying parameters such as price or weather. If the future of these is imperfectly predictable, as is usually the case, then the basis for the methods we have used hitherto is lost There are two approaches which lead to a natural mathematical resolution of this situation. One is to adopt a stochastic formulation. That is, one arrives somehow at a probability model for plant and observations, so that all variables are jointly defined as random variables. The variables which are observable can then be used to make inferences on those which are not. More specifically, one chooses a policy, a control rule in terms of current observables, which minimises the expectation of some criterion based on cost. The other, quite as natural mathematically, is the minimax approach. In this one assumes that all unobservables take the worst values they can take Uudged on the optimisation criterion) consistently with the values of observables. The operation of conditional expectation is thus replaced by a conditional maximisation (of cost). The stochastic approach seems to be the one which takes account of average performance in the long run; it has the completer theory and is the one usually adopted The minimax approach corresponds to a worstcase analysis, and is frankly pessimistic. We shall consider only the stochastic approach, but shall find minimax ideas playing a role when we later develop the idea of risksensitivity. Lastly, there is a point which should be made to maintain perspective, even if it cannot be followed up in this volume. The larger the system (i.e. the greater the number of individual variables) then the more unrealistic becomes the picture that there is a central optimiser who uses all currently available information to make all necessary decisions. The physical flow of information and commands
170
STOCHASTIC DYNAMIC PROGRAMMING
would be excessive, as would the central processing load. This is why an economy or a biological organism is partially decentralised: some control decisions are made locally, on the basis of local information plus central commands, leaving only the major decisions to be made centrally, on aggregated information. Indeed, the more complex the system, the greater the premium on trading a loss in optimality for a gain in simplicityand, perhaps, the greater the possibility of doing so advantageously, and of recognising the essential which is to be optimised. We use the familiar notations E(x) and E(xiy) for expectation and conditional expectation, and shall rarely make the notational distinction (which is only occasionally called for) between random variables and particular values which they may adopt. Correspondingly, P(x) and P(xiy) denote the probability (unconditional and conditional) of a particular value x, at least if x is discretevalued. However, more generally and more loosely, we also use P(x) to denote simply the probability law of x. So, the Markov property of a process {x 1} would be expressed, whatever the nature of the state space, by P(x1+1IX1) = P(xt+1lx1), where X1 is the history {xT; T ~ t}. 1 ONESTAGE OPTIMISATION A special feature of control optimisation is that it is a multistage problem: one makes a sequence of decisions in time, the later decisions being in general based on more information than the earlier ones. For this very reason it is helpful to begin by considering the singlestage case, in which one only has a single decision to make. For example, suppose that the pollution level of a water supply is being monitored. One observes pollution level y in the sample taken and has then the choice of two actions u: to raise the alarm or to do nothing. In practice, of course, one might well convert this into a dynamic problem by allowing sampling to continue over a period of time until there is a more assured basis for action one way or the other. However, suppose that action must be taken on the basis of this single observation. A cost C is incurred; the costs of raising the alarm (perhaps wrongly) or of not doing so (perhaps wrongly). The magnitude of the cost will then depend upon the decision u and upon the unknown 'true state' of affairs. Let us denote the cost incurred if action u is taken by C(u1 a random variable whose distribution depends on u. One assumes a stochastic (probabilistic) model in which the value of the cost C(u) for varying u and of the observable yare jointly defined as random variables. A policy prescribes u as function u(y) of the observable u; the policy is to be chosen to minimise E[C(u(y) )]. Theorem 8.1.1 The optimal decision function u( y) is determined by choosing u as the value minimising E[C(u)iy]. Proof If a decision rule u(y) is followed then the expected cost is
1 ONESTAGE OPTIMISATION
E[C(u(y))]
171
= E{E[C(u(y))ly]} ~ E{inf E[C(u)ly]} u
and the lower bound is attained by the rule suggested in the theorem.
0
The theorem may seem trivial, but the reader should understand its point: the reduction of a constrained minimisation problem to a free one. The initial problem is that of minimising E[C(u(y)J with respect to the jUnction u(y), so that the minimising u is constrained to be a function of y at most. This is reduced to the problem of minimising E[C( u) IY]freely with respect to the parameter u. One might regard u as a variable whose prescription affects the probability distribution of the cost C 1 just as does that of y, and so write E[C (u) IYJ rather as E[qy, u]. However, to do this is to blur a distinction between the variables y and u. The variable y is a random variable whose specification conditions the distribution of C. The variable u is not initially random, but a variable whose value can be chosen by the optimiser and which parametrises the distribution of C We discuss the point in Appendix 2, where a distinction is made by writing the expectation as E[Ciy; u], the semicolon separating parametrising variables from conditioning variables. However, while the distinction is important in some contexts, it is not in this, for reasons explained in Appendix 2. The reader may be uneasy: the formulation of Theorem 8.1.1 makes no mention of an important physical variable: the 'true state' of affairs. This would be the actual level of pollution in the pollution example. It would be this variable of which the observation y is an imperfect indicator, and which in combination with the decision u determines the cost. Suppose that the problem admits a state variable x which really does express the 'true state' of affairs, in that the cost is in fact a deterministic function C (x, u) of x and u. So, if one knew x, one would simply choose u to minimise C(x, u). However, one knows only y, which is to be regarded as an imperfect observation on x. The joint distribution of x andy is independent of u, because the values of these random variables, whether observable or not, have been realised before the decision u is taken.
Theorem 8.1.2 Suppose that the problem admits a state variable x, that C(x 1 u) is the cost jUnction and f (x 1 y) the joint density of x and y with respect to a product measure Jll (dx)JLl(dy). Then the optimal value of u is that minimising C(x, u) f(x, Y)JLI d.x.
J
Proof Let us assume for simplicity that x and y are discrete random variables with a joint distribution P(?c,y); th.e formal generalisation is then clear. In this case E[C(u)iy]
=L X
C(x 1 u)P(xly)
<XL C(x u)P(x,y) = L C(x, u)P(x)P(yJx), 1
X
X
(1)
172
STOCHASTIC DYNAMIC PROGRAMMING
where the proportionality sign indicates a factor P(y) I, independent of u. The third of these expressions is the analogue of the integral expression asserted in thetheorern. [] We give the fourth expression in (1) because P(x) and P(yjx) are often specified on fairly distinct grounds. The conditional distribution P(yjx) of observation on state is supplied by one's statistical model of the observation process, whose mechanism may be fairly clear. The distribution P(x) constitutes the 'prior distribution' of state and its specification may be debatable; see Exercise 1. Exercises and comments (1) The socalled twohypothesis twoaction case is the simplest, but is both illuminating and useful. Consider the pollution example of the text, and suppose that serious pollution of the river, if it occurs, can only be due to a catastrophic failure at a factory upstream. There are then only two 'pollution states', that this failure has not occurred or that it has, corresponding to x equal to 0 or 1, say. Denote the prior probabilities P(x) of these by ?rx and the probability density of the observation y conditional on the value of x by fx(Y~ Suppose that there are just two actions: to raise the alarm or not. The cost of raising the alarm is Co or zero according as x is 0 or 1; the cost of not raising the alarm is zero or C1 according as xis 0 or 1. It follows then from the last fonn of the criterion that one should raise the alarm if ?ro/o (y) Co < 1rJ/i (y) C1. That is, if the likelihood ratio fi (y) /lo (y) exceeds the threshold value 7roCo/7ri C1. (2) Risksensitivity and hedging. The new effects that a stochastic element can bring can be demonstrated on a onestage model. Suppose that, if one divides a sum of money x so that an amount x1 is invested in activity j, then one receives a total return of r = E1 CJXJ. That is, c1 is the rate of return 6n activity j. If the c1 are known then one maximises r by investing the whole sum in an activity for which the return rate c1 is greatest. (If there are several activities achieving this rate then the investment can be spread over them, but there is no advantage in such diversification). If the c1 are unknown, and regarded as random variables, then one maximises the expected return by investing the whole sum in an activity for which the expected rate of return E(c1) is maximal, where this is an expectation conditional on one's information at the time of the investment decision. However, suppose one chooses rather to extrernise the expected value of exp( Or), maximising or minimising according as the risksensitivity parameter(} is positive or negative. This criterion takes account of variability of return as well as of expectation, variability being welcome or unwelcome according as (} is positive or negative (the riskseeking and riskaverse cases; see Chapter 16~
2 MULTISTAGE OPTIMISATION
173
For simplicity, let us assume (rather unrealistically) that the random variables Cj are independent. Then one chooses the allocation to maximise L,1 e 1Fj(&xj). where lj(o) = log{E[exp(ocj)]}. The functions lj(o) are convex (see Appendix 3). It follows then that the functions e 1Fj(Oxj) are convex or concave according as one is in the riskseeking or riskaverse case. In the first case will invest the whole sum in an activity j for which e 1Fj(Ox) is maximal. In the second case one spreads the investment (hedges against uncertainty) by choosing the Xj so that FJ (Ox1 ) ~ >., with equality if Xj is positive (j = I, 2, ... ) . Here >. is a Lagrange multiplier chosen so that L,j x1 = x. Hedging is a very real feature in investment practice, and we see that it is induced by the two elements of uncertainty and riskaverseness. (3) Follow through the treatment of Exercise 2 in the case (again unrealistic!) of normally distributed cb when lj· (o) = J.LjO + Vjo 2 • Here J.Lj and v1 are respectively the mean and variance of c1 (conditional on information at the time of decision).
!
2 MULTISTAGE OPTIMISATION; THE DYNAMIC PROGRAMMING EQUATION If we extend the analysis of the last section to the multistage case then we are essentially treating a control optimisation problem in discrete time. Indeed, the discussion will link back to that of Section 2.1 in that we arrive at a stochastic version of the dynamic programming principle. There are two points to be made, however. Firstly, the stochastic formulation makes it particularly clear that the dynamic programming principle is valid without the assumption of state structure and, indeed, that state structure is a separate issue best brought in later. Secondly, the temporal structure of the problem implies properties which one often takes for grapted: this structure has to be made explicit. Suppose, as in Section 2.1, that the process is to be optimised over the time period t = 0, I, 2, ... , h. Let W0 indicate all the information available at time 0; it is from this and the stochastic model that one must initially infer plant history up to time t = 0, insofar as this is necessary. Let x 1 denote the value of the process variable at time t, and X, the partial process history {xi, x2, ... , x 1 }. Correspondingly let y 1 denote the observation which becomes available and u1 the control action taken at time t, with corresponding partial histories Y, and U1• Let W1 denote the information available at time t; i.e. the information on which choice of u, is to be based. Then we assume that W, = { W0 , Y, U,_J}. That is, current information consists just of initial information plus current observation history plus previous control history. It is taken for granted, and so not explicitly indicated, that W 1 also implies prescription of the stochastic model and knowledge of clock timet.
174
STOCHASTIC DYNAMIC PROGRAMMING
A realisable policy 1r is one which specifies u1 as a function of W 1 for t = 1, 2, ... , One assumes a cost function C. This may be specified as a function of Xh and Uhl, but is best regarded simply as a random variable whose distribution, jointly with that of the course of the process and observations, is parametrised by the chosen control sequence Uh1· The aim is then to choose rrto minimise E... (C). Define the total value function Uhl·
G( W1) = inf E..[CI W,], ...
(2)
the minimal expected cost conditional on information at time t. Here E... is the expectation operator induced by policy 1t: We term G the total value function because it refers to total cost, whereas the usual value function F refers only to present and future cost (in the case when cost can indeed be partitioned over time). G is automatically tdependent, in that W1 takes values in different sets for different t. However, the simple specification of W, as argument is enough to indicate this dependence.
*Theorem 8.2.1 (The dynamic programming principle) The total value function G( W 1) obeys the backward recursion (the dynamic programming or optimality equation)
G(W,) = inf E[G(Wt+I)I Wr, Ur]
"'
(t
= 1 ' 2, ... ' h  1)
(3)
with closing condition
(4) and the minimising value ofu1 in (3) is the optimal value ofcontrol at timet. We prove these assertions formally in Appendix 2. They may seem plausible in the light of the discussion of the previous section, but demonstration of even their formal truth requires a consideration of the structure implicit in a temporal optimisation problem. They are in fact rigorously true if all variables take values in finite sets and if the horizon is finite; the theorem is starred only because of possible technical complications in other cases. That some discussion is needed even of formal validity is clear from the facts that the conditioning variables W, in (2) and ( W, u1) in (3) are a mixture of random variables Y, which are genuinely conditioning and control histories UrI or U1 which should be seen as parametrising. Further, it is implied that the expectations in (3) and (4) are policyindependent, the justification in the case of (4) being that all decisions lie in the past These points are covered in the discussion of Appendix 2. Relation (3) certainly provides a taut and general expression of the dynamic programming principle, couched, as it must be, in terms of the maximal current observable W1•
3 STATE STRUCTURE
175
3 STATE STRUCTURE The two principal properties which ensure state structure are exactly the stochastic analogues of those assumed in Chapter 2. (i) Markov dynamics. It is required that the process variable x should have the property (5) P(xt+tiX,, U,) = P(xt+tlx,, u,). where X, U1 are now complete histories. That is, if we consider the distribution of x 1+ 1 conditional on process history and parametrised by control history then it is in fact only the values of process and control variables at time t which have any effect. This is the stochastic analogue of the simplyrecursive deterministic plant equation (2.2~ and specification of the righthand member of (5) as a function of its three arguments amounts to specification of a stochastic plant equation. (ii) Decomposable cost function. It is required that the cost function should break into a sum of instantaneous and closing costs, of the form
c=
h1
h1
1=0
1=0
L c(x, u,, t) + Ch(xh) = L c, + ch,
(6)
say. This is exactly the assumption (2.3) already made in the deterministic case. We recall the definition of sufficiency of a variable ~~ in Section 2.1 and the characterisation of x as a state variable if (x1 , t) is sufficient. These definitions transfer to the stochastic case, and we shall see, by an argument parallel to that of the deterministic case, that assumptions (i) and (ii) do indeed imply that x is a state variable, if only it is observable. A model satisfying these assumptions is often termed a Markov decision process, the point being that they define a simply recursive structure. However, if one is to reap the maximal benefit of this structure then one must make an observational demand. (iii) Perfect state observation. It is required that the current value of state should be observable. That is, x 1 should be known at the time t when u1 is to be determined, so that W, =(X,, U,_t)· As we have seen already in the deterministic case, assumption (i) can in principle be satisfied if there is a description x of the system which is detailed enough that it can be regarded as physically complete. Whether this detailed description is immediately observable is another matter, and one to which we return in Chapters 12 and 15. We follow the pattern of Section 2.1. Define the future cost at timet h1
IC 1 =
L:cT +Ch T=t
and the value function F(W,) =infE.,.[C,IW,] 7(
(7)
176
STOCHASTIC DYNAMIC PROGRAMMING
so that G( W1) = I:~~~ Cr + F( W1). Then the following theorem spells out the sufficiency of~~ = (x1 , t) under the assumptions above.
Theorem 8.3.1 Assume conditions (i)(iii) above. Then (i) F( JV;) is a function of x 1 and t alone. If we write it F(x 1 , t) then it obeys the dynamic programming equation
(t
~h)
(8)
with terminal condition (9) (ii) The minimising value ofu1 in (8) is theoptimalvalueofcontrolat timet, which is consequently also a function only of x 1 and t.
Proof The value of F(Wh) is Ch(xh), so the asserted reduction of Fis valid at time h. Assume it valid at time t + 1. The general dynamic programming equation (3) then reduces to F(W1 ) = inf{c(x1, u1 , t) u,
+ E[F(xt+I, t + l)JX,, U,]}
(10)
and the minimising value of u1 is optimal. But, by assumption (i), the righthand member of (10) reduces to the righthand member of (8). All assertions then D follow by induction. So, again one has the simplification that not all past information need be stored; it is sufficient for purposes of optimisation that one should know the current value of state. The optimal control rule derived by the minimisation in (8) is again in closedloop form, since the policy before timet has not been specified. It is in the stochastic case that the necessity for closedloop operation is especially clear, since continuing stochastic disturbance of the dynamics makes use of the most recent information imperative. At least in the timehomogeneous case it is convenient to write (8) simply as
F(·,t) = fi?F(,t+ 1)
( 11)
where £? is the operator with action £? ¢(x)
= inf {c(x, u) + E[¢(xt+l) Jx1 = x, u1 = u]}. u
(12)
This is of course just the stochastic version of the forward operator already introduced in Section 3.1. As then, 2 ¢(x1) is the minimal cost incurred if one is allowed to choose u1 optimally, in the knowledge that at time t + 1 one will incur a cost of ¢(xt+ 1). In the discounted case£? would have the action 2¢(x) = inf {c(x, u) u
+ ,8E[¢(xt+l )Jx1 =
x, u1 = u]}.
(13)
177
4 THE EQUATION IN CONTINUOUS TIME
4 THE DYNAMIC PROGRAMMING EQUATION IN CONTINUOUS
TIME It is convenient to note here the continuoustime analogue of the material of the last section and then to develop some continuoustime formalism in Chapter 9, before progressing to applications in Chapter 10. The analogues of assumptions (i}(ili) of that section will be plain; we deduce the continuoustime analogues of the conclusions by a formal passage to the limit. It follows by the discretetime argument that the value function, the infimum of expected remaining cost from time t conditional on previous process and control history, is a functionF(x, t) of x(t) = xand t alone. The analogue of the dynamic programming equation (8) for passage from t to t + 8t is
F(x, t)
= inf{c(x, u, t) + E[F(x(t + 8t), t + 8t)lx(t) = x, u(t) = u]} + o(5t). u (14)
Defme now the infinitesimal generator A(u, t) of the controlled process by A(u, t)rp(x)
1 = lim(8t){E[¢(x(t + 8t))lx(t) = x, u(t) = u] ¢(x)}. 6t!O
(15)
That is, there is an assumption that, at least for sufficiently regular ¢(x),
E[rp(x(t + 8t))lx(t)
= x, u(t) = u] = ¢(x) + [A(u, t)¢(x)]8t + o(8t).
The form of the term of order 8t defines the operator A; to write the coefficient of 8t as A(u, t)rp(x) emphasises that the distribution of x(t + 8t) is conditioned by the value x of x(t) and parametrised by t and the value u ofu(t).We shall consider the form of A in some particular cases in the next chapter.
*Theorem 8.4.1 Assume the continuoustime analogues of conditions (i)(iii) of the Section 3. Then x is a state variable and the value function F (X. t) obeys the dynamic programming equation
i~ [c(x, u, t) +oF~~· t) + A(u, t)F(x, t)]
= 0
(t ..1k8t + o(8t)
(k =1 j)
(1)
for small positive 8t. This is a regularity condition which turns out to be selfconsistent. The quantity >..1k is termed the probability intensity of transition from j to k, or simply the transition intensity. The assumption itself implies that the transition has been a direct one: the probability of its having occurred by passage through some other state is of smaller order in 8t (see Exercise 1).
Theorem 9.1.1 The process with transition intensities >..1k has infinitesimal generator A with action
A¢(j)
=
L AJk[¢(k) ¢(j)]. k
(2)
180
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
Proof It is a consequence of (1) that E[¢(x(t + 8t)) ¢(x)ix(t) = j) =
L AJk[¢(k) ¢(j)]8t + o(8t) kf.j
whence (2) follows, by the definition of A. To include the case k = j in the D summation plainly has no effect. We can return to the controlled timedependent case by making the transition intensity a function >.1k(u, t) ofu and t, when the dynamic programming equation (8.16) becomes
i~f [c(j, u, t) +oF~, t) + ~ >.1k(u, t)[F(k, t) F(j, t)]l
= 0.
(3)
Exercises and comments (1) It follows from (1) and the Markov character of the process that
P[x(t + 8t) = i,
x(t + 2 8t) = klx(t) = j] = >.1;Aik(bt) 2 + o[(8t)
2
]
fori distinct from bothj and k. This at least makes it plausible that the probability of multiple transitions in an interval oflength 8t is o(8t). 2 DETERMINISTIC AND PIECEWISE DETERMINISTIC PROCESSES The deterministic model for which x is a vector obeying the plant equation x = a(x) is indeed a special case of a stochastic model. The rate of change of ¢(x) in time is ¢xa, so that the infinitesimal generator A has the action A¢(x) = ¢x(x)a(x)
(4)
where ¢xis the rowvector of differentials of¢ with respect to the components of x. Consider a hybrid of this deterministic process and the jump process of the last section, in which the xvariable follows deterministic dynamics x = a1(x) in the jth regime, but transition can take place from regime j to regime k with intensity >.1k(x). Such a process is termed a piecewise deterministic process. The study of such processes was initiated and developed by Davis (1984, 1986, 1993). For example, if we consider an animal population, then statistical variability can occur in the population for at least two reasons. One is the intrinsic variability due to the fact that the population consists of a finite number of individuals: demographic stochasticity. Another is that induced by variability of climate, weather etc.: environmental stochasticity. If the population is large then it is the second source which is dominant: the population will behave virtually deterministically under fixed environmental conditions. If we suppose, for
3 THE DERIVATE CHARACTERISTIC FUNCTION
181
simplicity, that the environmental states are discrete, with welldefined transition intensities, then the process is effectively piecewise deterministic. In such a case the state variable consists of the pair (j, x): the regime labelj and the plant variable x. We leave it to the reader to verify that the infinitesimal generator of the process has the action
A¢J(j,x) = ifJx(j,x)aJ(x)
+ LAJk(x)[ifJ(k,x) ¢J(j,x)]. k
3 THE DERIVATE CHARACTERISTIC FUNCTION Recall that the momentgenerating JUnction (abbreviated MGF) of a random column vector xis defined as M(a) = E(eax), where the transform variable a is then a row vector. Some basic properties of MGFs are derived in Appendix 3. One would define M (iO) as the characteristic jUnction of x; this always exists for real 0. The two definitions then differ only by a 900 rotation of the argument in the complex plane, and it is not uncommon to see the two terms loosely confused. Suppose from now on that the state variable x is vectorvalued. We can then define the function (5) of the column vector x and the row vector a. It is known as the derivate characteristic fUnction (abbreviated to DCF). We see, from the interpretation of A following its definition (8.15), that Hhas the corresponding interpretation
E(e 06xlx(t) = x) = 1 + H(x, a)8t + o(8t),
(6)
where 6x = x( t + 6t)  x( t) is the increment in x over the time interval ( t, t + 6t]. The DCF thus determines the MGF of this increment for small 6t; to this fact plus the looseness of terminology mentioned above it owes its name. For example, consider a process for which xcanjump to a value x + d;(x) with probability intensity >.1(x)(j = 1, 2, ... )_ For this the infmitesimal generator has the action
AifJ(x) = L >.1(x)[¢J(x + d;(x)) ifJ(x)]
(7)
j
and the DCF has the evaluation
H(x,a) = I:>.J<x)[eadJ(x) 1].
(8)
j
Comparing these last two relations we see that we can make the assertion, at least for processes of this type:
182
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
"'Theorem 9.3.1 Relation (5) between the derivate characteristic/unction Hand the infinitesimal generator Ahas the inversion A= H(x, a/ax)
(9)
if it
is understood that, when we form M>(x) = H(x, a/ax)(x), the differential operator a1ax acts only on the xargument of.
This indeed follows from comparison of (7), (8) and appeal to the 'Taylor' formula
e(8f8x)d(x) = (x +d), Here (a/ax)dwemean the inner product I:k dka/axk' where dk andxk arethekth components of the vectors d and x respectively. We shall return to the important identity (9) in Section 22.3. 4 PROCESSES OF INDEPENDENT INCREMENTS, AND PROCESSES DRIVEN BY THEM A process {x(t)} is one of independent increments if the increments in x over disjoint time intervals are statistically independent. If, in addition, the distribution of the increment depends only upon the length of the interval then it is a homogeneous process of independent increments (abbreviated HPII). Such processes are important because their 'time derivative' (should it exist) provides the continuoustime equivalent of a sequence of liD random variables. The abbreviation HPII is not a standard one, but it will serve us for this and the next section. Let us define the MGF
M(a, t) = E[e"'[x(t)x(Ollj. Then the HPII property is obviously equivalent to the property
M(a, t1 "'Theorem 9.4.1
+ tz) =
M(a, ti)M(a, tz)
(10)
lf {x(t)} is an HPJ/then there exists a function ,P(a) such that M(a, t) = e1t/l(a).
( 11)
dx(t)}] = exp{j ,P(a(t)) dt },
(12)
Furthermore E[exp{j a(t)
for prescribed functions a (t) for which the righthand member is defined.
4 PROCESSES OF INDEPENDENT INCREMENTS
183
*Proof Relation (11) is indeed a consequence of (10). It follows from (11) and the independence of increments that, if {t1} is an increasing sequence of timepoints, then E [exp{
~ aJlx(tJ) x(tJd}] = exp{ ~(tJ tJ1)1/J(aJ) }·
Relation (12) then follows by a limiting argument. *Theorem 9.4.2 ofx, so that
D
The process {x(t)} is an HPII ifand only ifH(x, a) is independent
A= 1/J(o/ox),
(13)
say. One can then identify 1/J with the 1/J of(ll). *Proof If the process is an HPII so that (11) holds then one indeed verifies that H(x, a) = 1/J(a), implying(13). To establish the reverse conclusion, define the conditional MGF M(a, x, t) = E[ex(r)lx(O) = xJ This obeys the backward equation
a
otM(a,x, t) = 1/J(o/ox)M(a, x, t)
which has a solution M(a,x,t) = e=+h/1().
(14)
But relation (14) is sufficient to establish both relation (10) and the representation (11). D So, H(x, a) = 1/J(a) is both necessary and sufficient for the HPII property, with the 1/J( a) of (11) identified with the DCF. It is natural to ask: what functions 1/J are possible? which are interesting? In fact, the HPII property implies that exp['I/J(a)J must be the MGF of an infintely divisible distribution. The UvyKhintchine theorem then implies that an HPII process must be a superposition of independent Poisson and Wrener processes. It is a quick matter to describe these processes and verify the HPII property for them. The simplest HPII process is the Poisson process. For this x takes only nonnegative integer values, the only transitions are of the form x + x + 1 and these all have the same intensity A. One could regard x(t)  x(O) as the number of events which have taken place in the time interval (0, t], if events take place independently with probability intensity A. These events might be insurance claims, traffic accidents or arrivals of cosmic particles in a chamber. The DCF of the process is 1/J(a) = A(e 1). We see then from (11) that the distribution of the number of events in a time interval of length t is Poisson with expectation At, whence the name of the process. One speaks of the events occurring in a Poisson stream ofrate >..
184
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
We can generalise the process by superimposing Poisson streams. Consider the scalar process {w(t)} generated by
w(t) =
L.>1x1(t)
(15)
j
where the a1 are constants and the {x1(t)} are independent Poisson processes of respective rates >y. For example, thejth stream might be interpretable as a stream of particles each carrying charge a1; the variable w(t) is then the total charge which has accumulated by time t. Such a process is termed a compound Poisson process. It is plainly HPII with DCF t/J(o)
= L}y(eaai 1).
(16)
j
One obtains an interesting class of models by driving a deterministic model by
anHPil Theorem 9.4.3 Consider the vector process {x(t)} generated by the differential equation (written in incrementa/form) bx = a(x) bt + b(x) bw
(17)
where {w( t)} is a vectorvalued homogeneous process ofindependent increments with Then {x(t)} is a MarkovprocesswithDCF
DCFt{J(o~
H(x, o)
.
= oa(x) + t/J(ob(x)).
(18)
The assertions follow immediately if one regards relation (6) as providing the effective definition of the DCF and inserts expression (17) for bx into it. The 'plant equation' (17) is interesting as the continuoustime version of a process driven by 'noise' which is totally random, in the sense that its values at different times are statistically independent. One is tempted to rewrite it as .X = a(x)
+ b(x)~:
(19)
where the 'noise' variable ~: is the time derivative of w. The process {w(t)} generated by (15) plainly does not have a derivative in the usual sense. If one admitted the idea one would have to regard it as a random sequence of impulses, those of magnitude a1 occurring with intensity >..b independently of other impulses. This is the 'shot noise' process formulated by physicists early in the century. The concept is proper in that integrals of shot noise are proper; one can write (12) as E[exp{j
o(t)~:(t) dt}]
= exp{j t/J(o(t)) dr}.
(20)
185
5 WIENER PROCESSES (BROWNIAN MOTION)
5 WIENER PROCESSES (BROWNIAN MOTION), WHITE NOISE AND DIFFUSION PROCESSES If we consider a sequence of shot noise processes in which the impulses become ever weaker but ever more frequent then we reach the classic limit process associated with Brownian motion and white noise.
*Theorem 9.5.1 Consider the scalar HPJ/ { w( t)} with DCF 1/J(a.) = (eaa + eaa)/2a2.
(21)
The limit process for small a is then an HPII with DCF and infinitesimal generator
A=
1/J(8f8x) = ~(:x)
2
(22)
Further, since E[exp{j a.(t)dw(t)}] =E[exp{j a.(t)E(t)dt}]
=exp{~j a.(t) 2 dt}. (23)
increments ofthe process arejointly normally distributed. The assertions are formally immediate. They constitute in fact a process form of the central limit theorem, given operational meaning by the fact that the argument can be phrased in terms of the statistics of wellbehaved functionals such as the integral I a.( t) dw( t). We recognise in (21) a compound Poisson process consisting of two streams, each of rate 1/2a2 and carrying 'charges' of size a and opposite signs. If a is small then the charge carried is small, but particles arrive fast. The 'derivative' f of the limit process is an infinitely dense sequence of infinitesimal independent impulses, whose integral has zero expectation but is normally distributed The limit process { w( t)} is a timehomogeneous process of independent increments which also has the property of being Gaussian. As a function of time it in fact continuous, and is the only continuous HPII. It is in all senses classical: sometimes referred to as Brownian motion and written B(t); sometimes referred to as the Wiener process and written W(t). Its formal differential E(t) is termed 'white noise' by engineers for reasons explained in Section 13.2 It is improper in having infinite variance, but proper in that integrals of white noise are proper, as we see in (23). The argument of Theorem 9.5.1 can be repeated in a vector version. We then arrive at the notion of a vector Wiener process w(t) and vector white noise E(t) with the property that the linear form I a.(t) dw(t) = I a.(t)E(t) dt is normally distributed with zero mean and covariance matrix
186
STOCHASTIC DYNAMICS IN CONTINUOUS TIME
cov(j o(t)e(t) dt)
=j
o(t)No(t)T dt
(24)
for some nonnegative deftnite matrix N, at least for functions o(t) such that this last integral is convergent. Relation (24) implies that
( 1
t+6t
cov {6t) 1
)
e(r) dr = Nj6t,
1
which has an infinite limit as 6t tends to zero. This is an indication that c: itself has an irregular character; its components have inftnite variance. Such a conclusion is inevitable if one requires property (24) for nonzero N The matrix N indeed gives the statistical scale of c: but is not its covariance matrix. It is more appropriately to as the power matrix. As we have seen, one can derive white noise as the 'limit' of realistic physical processes. Mathematically, the concept is regularised in the socalled theory of generalised processes (see e.g. Gihman and Skorohod, 1972, 1979} White noise is to conventional processes what the Dirac 6function is to conventional functions. Like the 6function, it can be seen as a limit of conventional processes, and the operational characterisation in terms of linear functionals above is analogous to the operational characterisation I o(t)6(t) dt = o(O) of the 6function. Initial discussions of the 6function took the circumspect route of writing I o(t)6(t) dt as I o(t)dH(t) where the Heaviside function H(t) is the formal integral of the 6:function; the function which is constant except for a single unit step at the origin. This is now seen as a crutch which can be kicked away, but the same caution leads to the writing of I o(t)c:(t) dt as I o(t) dw(t) where {w(t)} is a Wiener process. The noisedriven processes (19) which are most considered, for various reasons, are those for which the driving noise e is white; such a process is termed a diffusion process for reasons which will soon become clear. If one takes a ·deterministic process as a first crude approximation to a given stochastic process, then there are good reasons for taking a diffusion process as the second; see Section 22.3.We can specialise Theorem 9.4.3 to the diffusion case.
Theorem 9.5.2 Consider the vector process {x(t)} generated by the differential equation 6x = a(x)t + b(x) 6w
(25)
or
x=
a(x) + b(x)c:
(26)
where c: is a white noise process (wa Wiener process) ofpower matrix N Then {x( t)} is a Markov process with DCF
5 WIENER PROCESSES (BROWNIAN MOTION)
187
H(x,a) = aa(x) +!aN(x)aT
(27)
where N(x) = b(x)Nb(x)T. We see from (13) and (27) that the generator of the process has the action A¢ = ¢xa(x)
+!
tr[N(x)¢xx]
(28)
where ¢xx is the square matrix of second differentials of ¢. It is because A is a secondorder differential operator in x that one terms the process a diffusion process. More exactly, the corresponding equation j = ATf for the probability density fof x(t) is a diffusion equation in the full technical sense, representing the diffusion of probability mass in state space. The deterministic term a(x) in (26) simply induces a transfer of probability mass along the deterministic path, but the random term in f induces a spreading of that mass. The vector a(x) and matrix N(x) are referred to as the drift and diffusion coefficients respectively. It is indeed the generator (27) which determines the forward dynamics of the process, and so implies a specification of the 'stochastic plant equation: However, in this case the stochastic differential equation (26) can be regarded as providing an alternative specification, and one expressed as a dynamic equation in the state variable itself. For a controlled timehomogeneous diffusion process the dynamic programming equation for the value function F(x, t) would be inf[c(x, u, t) u
+ F + Fxa(x, u, t) +! 1
tr(N(x, u, t)F=)I = 0.
(29)
CHAPTER 10
Some Stochastic Examples 1 LQG REGULATION WITH PLANT NOISE Let us return to the LQ regulation example of Section 2.4 with the sole modification that the plant equation is changed to Xr
= AXr1 +Burl+ fr
(1)
where t. 1 is a random disturbance term. This term would often be referred to as plant noise'plant' because it occurs in the plant equation and 'noise' because in audioelectronic contexts the disturbance manifests itself as a rush of background noise. Plant noise is indeed often to be regarded as random in that it is only partially predictable. The thermal noise of electronic circuits or the gusts to which an aircraft is subject provide examples of disturbances which are scarcely predictable at all. The mention of predictability is a reminder that the stochastic character of the noise must be specified. A common starting assumption is that the noise is as unpredictable as possible; that it is white in that the distribution of t. 1 conditional on process history before timet is independent of both process history and t. This implies that the process bas the Markov property (i) of Section 8.3. It also implies that the variables € 1 are independently and identically distributed. (See Sections 9.5 and 13.2 for a discussion of the special character of white noise in continuous time and of the reason for the description 'white~ Process history before time t is specified by (X1_ 1, Ur1), since this determines fr forT < t. Actually, we do not need to demand complete 'whiteness', but simply that (2)
for a constant matrix N. That is, that the conditional mean of f 1 should be constant (and normalised to zero, by convention) and that the conditional covariance matrix of f 1 should also be constant. A stronger assumption which is often made is that the noise should be Gaussian as well as white, i.e. that the noise variables should be jointly normally distributed. The resultant model is then termed an LQG model, indicating linear dynamics, quadratic costs and Gaussian noise. However, as far as discretetime models are concerned, we need the Gaussian assumption first when we come to consider imperfect state observation in Chapter 12.
190
SOME STOCHASTIC EXAMPLES
Theorem 10.1.1 Assume the stochastic plant equation specified by (1~ (2) and the quadratic cost structure specified in equations (2.22) and ~23). Then the value function has the form
F(x, t)
= !xTII,x + 6
(3)
1
and the optimal control the form u1 =K1x 1•
(4)
Here the matrices II, and K1 have the same evaluations as in Theorem 24.2 and
(5)
6, = 6r+1 +! tr(Nilr+I)
with terminal condition 6h = 0. That is, the matrix II, is the same solution of the Riccati equation (2.25) as before, and K, is determined from it by the same relation (2.28) as before. Indeed, the control rule (4) is exactly what it was in the noisefree case, and the only effect of the noise is to add a term 61 to the cost. The interpretation is that the quadratic term in (3) represents the cost caused by the initial deviation of state x from zero and the term 61 represents the cost caused by the continued disturbance of state by noise. In the infinitehorizon limit (if one exists) the presence of noise induces a cost of! tr(NII) per unit time.
Proof The course of the proof is inductive, as ever. Relation (3) holds at time h. Assume then that it holds at time t + 1, so that F(W,)
= inf{c(x, u) +!E((Ax + Bu + fr+I?nr+I(Ax +Bu + fr+I)IXr,Ur]} "
+ 61+11
where x, u and e are the values holding at time t. Assumptions (2) imply that the conditional expectation in this relation reduces to
!(Ax+ Buliir+I (Ax+ Bu) +! tr(NIIr+l) whence all assertions follow.
0
The admission of plant noise with the 'secondorder white' properties (2) thus affects the optimal control rule not at all (at least, if this is expressed in closedloop form) and increases costs only by a term independent of state or polic}: It may seem strange that, for our first stochastic example, the optimal control was unaffected (at least in its closedloop form) by the presence of the stochastic element. This is a consequence of the very special assumptions. The essence of optimal control generally is that one exploits the statistical characteristics of
1 LQG REGULATION WITH PLANT NOISE
191
noise variables to the full. For example, suppose that the plant equation (1) is modified to x 1 = Ax,_1 + Buc1 + v,
(6)
where v1 is a stochastic noise signal of some more general form. One will often suppose that it can be generated as the output v, = G(f/)£. 1 of a linear filter driven by white noise, so that one still has in effect a system driven by white noise, but more complex than (1). This can be coped with in various ways, but the most insightful is that which follows from the certainty equivalence principle of Chapter 12. This principle implies that, under some further assumptions on the linear/Gaussian nature of observations, the optimal control at time t for system (6) would be the same as that for the deterministic system X.,.=
A Xr1
+ B U.,.1 + v.,.(1)
(r
> t),
(7)
where v~') is an appropriate linear predictor of v.,. based on information available at time t. That is, at time t one replaces future stochastic noise vT( r > t) by an 'equivalent' deterministic disturbance v~1 ) and then applies the methods of Sections 2.9 or 6.3 to deduce the optimal feedback/feedforward control in terms of this predicted disturbance. We shall see in Chapter 12 that similar considerations hold if the state vector xis itself not perfectly observable. It turns out that E~) = 0 (r > t) for a whitenoise input f. which has been perfectly observed up to timet. This explains why the closedloop control rule was unaffected in case (1). Once we drop LQG assumptions then treatment of the stochastic case becomes much more difficult. For general nonlinear cases there is not a great deal that can be said We shall see in Section 7 and in Chapter 24 that one can treat some models for which LQG assumptions hold before termination, but for which rather general termination conditions and costs may be assumed. Some other models which we treat in this chapter are those concerned with the timing of a single definite action, or with the determination of a threshold for action. For systems of a realistic degree of complexity the natural appeal is often to asymptotic considerations: e.g. the 'heavy traffic' approximations for queueing systems or the largedeviation treatment oflargescale systems. Exercises and comments (1) Consider the closed and openloop forms of optimal control (2.32) and (2.33)
deduced for the simple LQ problem considered there. Show that if the plant equation is driven by white noise of variance N then the additional cost incurred from time t = 0 is DQN (Q + sD) 1 or hDN according as the closed or the openloop rule is used. These then grow as log h or as h with increasing horizon.
2:::ci
192
SOME STOCHASTIC EXAMPLES
2 OPTIMAL EXERCISE OF A STOCK OPTION As a last discretetime example we shall consider a simple but typical financial optimisation problem. One has an option, although not an obligation, to buy a share at price p. The option must be exercised by day h. If the option is exercised on day t then one can sell immediately at the current price x" realising a profit of x 1  p. The price sequence obeys the equation Xr+! = x, + e1 where the e1 are independently and identically distributed random variables for which Elel < oo. The aim is to exercise the option optimally. The state variable at time t is, strictly speaking, x 1 plus a variable which indicates whether the option has been exercised or not However, it is only the latter case which is of interest, sox is the effective state variable. If Fs(x) is the value function (maximal expected profit) with times to go then
Fo(x) = max{x p, 0}
= (x p)+
and
F.r(x) = max{x p, E[F.r!(x +e)]}
(s=l,2, ... ).
The general character of Fs(x) is indicated in Figure 1; one can establish the following properties inductively: (i) Fs(x) xis nonincreasing in x; (ii) Fs(x) is increasing in x; (iii) Fs(x) is continuous in x; (iv) Fs(x) is nondecreasing ins. For example, (iv) is obvious, since an increase ins amounts to a relaxation of the time constraint. However, for a formal proof:
F1(x)
= max{x p,E[Fo(x +e)]}~
max{x p,O}
= F0 (x),
xp Figure I The valuefonction at horizon sfor the stock option example.
3 A QUEUEING MODE
193
whence Fs is nondecreasing ins, by Theorem 3.1.1. Correspondingly, an inductive proofof(i) follows from Fs(x) x =max{ p, E[Fs1 (x + t:) (x + t:)]
+ E(t:)}.
We then derive Theorem 10.2.1 There exists a non decreasing sequence {as} such that an optimal policy is to exercise the option first when x ): as, where xis the current price and sis the number ofdays to go before expiry ofthe option. Proof From (i) and the fact that Fs(x) ): x pit follows that there exists an as such that Fs(x) is greater than x p if x < as and equals x p if x ): as· It follows from (iv) that as is nondecreasing ins. 0
The constant as is then just the supremum of values of x for which Fs(x) xp.
>
3 A QUEUEING MODEL Queues and systems of queues provide a rich source of optimisation models in continuous time and with discrete state variable. One must not think simply of the single queue ('line') at a ticket counter; computer and communication systems are examples of queueing models which constitute a fundamental type of stochastic system of great technological importance. However, consideration of queues which feed into each other opens too big a subject; we shall just cover a few of the simplest ideas for single and parallel queues in this section and the next chapter. Consider the case of a socalled MIMI! queue, with x representing the size of the queue and the control variable u being regarded as something like service effort. If we say that customers arrive at rate A and are served at rate p,(u) then this is a loose way of stating that the transition x + x + 1 has intensity Aand the transition x+ x 1 has intensity p,(u) if x > 0 (and, of course, intensity zero if x = 0). We assume the process timehomogeneous, so that the dynamic programming equation takes the form
i~ [c(x, u) +oF~~· t) + A(u)F(x, t)]
= 0.
(8)
Here the infinitesimal generator has the action (cf. (9.2))
A(u)¢(x) = A[¢(x + 1) ¢(x)]
+ p,(u,x)[¢(x 1) ¢(x)]
where p,( x, u) equals p,( u) or zero according as xis positive or zero. If we were interested in averageoptimisation over an infinite horizon then equation (8) would be replaced by
194
SOME STOCHASTIC EXAMPLES
'Y = inf[c(x, u) u
+ A(u)f(x)]
(9)
where .\ and f(x) are respectively the average cost and the transient cost associated with the averageoptimal policy. In fact, we shall concern ourselves more with the question of optimal allocation of effort or of customers between several queues than with optimisation of a single queue. In preparation for this, it is useful to solve (9) for the uncontrolled case, when J.L(u) reduces to a constant J.L and c(x, u) to c(x). In fact, we shall assume the instantaneous cost proportional to the number in the queue, so that c( x) = ax. We leave it to the reader to verify that the solution of equation (9) is, in this reduced case, a.\
"(=,, j.L/\
!( X )
= _a_x(x + 1) J.L A
2
.
(10)
We have assumed the normalisation f(O) = 0. The solution is, of course, valid only if.\ < p., for it is only then that queue size is finite in equilibrium.
4 THE HARVESTING EXAMPLE: A BIRTHDEATH MODEL Recall the deterministic harvesting model of Section 1.2, which we shall generally associate with fisheries, for definiteness. This had a scalar state variable x, the 'biomass' or stock level, which followed the equation
x=
a(x) u.
(11)
Here u is the harvesting rate, which (it is supposed) may be varied as desired. The rate of return is also supposed proportional to u, and normalised to be equal to it. (The model thus neglects two very important elements: the age structure of the stock and the xdependence of the cost of harvesting at rate u.) We suppose again that the function a(x), the net reproduction rate of the unharvested population, has the form illustrated in Figure 1.1; see also Figure 2. We again denote by Xm and xo the values at which a(x) is respectively maximal and zero. An unharvested population would thus reach an equilibrium at x = x 0 . We know from the discussion of Section 2.7 that the optimal policy has the threshold form: u is zero for x ~ c and takes its maximal value (M, say) for x > c. Here c is the threshold, and one seeks now to determine its optimal value. If a(x) > 0 for x ~ c and a(x) M < 0 for x > c then the harvested population has the equilibrium value c and yields a return at rate 'Y = a( c). If we do not discount, and so choose a threshold value which maximises this average return A, then the optimal threshold is the value Xm which maximises a( c). A threshold policy will still be optimal for a stochastic model under corresponding assumptions on birth and death rates. However, there is an effect which at first sight seems remarkable. If extinction of the population is impossible then one will again choose a threshold value which maximises average
4 THE HARVESTING EXAMPLE: A BIRTHDEATH MODEL
195
return, and we shall see that, under a variety ofassumptions, this optimal threshold indeed approaches Xm as the stochastic model approaches determinism. However, if extinction is certain under harvesting then a natural criterion (in the absence of discounting) is to maximise the expected total return before extinction. It then turns out that the optimal threshold approaches xo rather than Xm as the model approaches determinism. There thus seems to be a radical discontinuity in optimal policy between the situations in which the time to extinction is finite or infinite (with probability one, in both cases). We explain the apparent inconsistency in Section 6, and are again led to a more informed choice of criterion. We shall consider three distinct stochastic versions of the model; to follow them through at least provides exercise in the various types of continuoustime dynamics described in the last chapter. The first is a birthdeath process. Letjbe the actual number offish; we shall set x = j f K where K is a scaling parameter, reflecting the fact that quite usual levels of stock x correspond to large values of j. (We are forced to assume biomass proportional to population size, since we have not allowed for an age structure). We shall suppose that j follows a continuoustime Markov process on the nonnegative integers with possible transitions j + j + 1 and j + j  1 at respective probability intensities >..1 and 1LJ· These intensities thus correspond to population birth and death rates. The net reproduction rate >..1  J.ti could be written a1, and corresponds to K.a(x). Necessarily /LO = 0, but we shall suppose initially that >..o > 0. That is, that a zero population is replenished (by a trickle of immigration, say), so that extinction is impossible. Let 1ri denote the equilibrium distribution of population size; the probability that the population has size j in the steady state. Then the relation 1r1>..1 = 7rj+I J..l.i+I (expressing the balance of probability flux between states j and j + 1 in equilibrium) implies that 1fJ ex Pi• where
Pi= >..o>..l ••. )..j1
(
j= 0 ,1,2, ... ) ,
(12)
J.I.I/J2 · · • J..l.i
(with p0 = 1, which is consistent with the convention that an empty product should be assigned the value unity). A threshold c for the xprocess implies a threshold d ~ KC for thejprocess. For simplicity we shall suppose that the harvesting rate M is infmite, althcugh the case of a finite rate can be treated almost as easily. Any excess of population over d is then immediately removed and one effectively has >..d = 0 and Pi = 0 for j >d. The average return (i.e. expected rate of return in the steady state) on the xscale is then
(13)
SOME STOCHASTIC EXAMPLES
196
the term 7rdAd representing the expected rate at which excess of population over c is produced and immediately harvested. Suppose now that the ratio ej = /Lj I >.j is effectively constant (and less than unity) for j in the neighbourhood of d. The effect of this is that 7rdj ::::> 1rdO~ (j ~d), so the probability that the population is an amountj below threshold falls away exponentially fast with increasingj. Formula (13) then becomes 1
::::>
~t 1 >.d(1  Od) = ~t 1 (>.d /Ld) = ~t 1 ad =a( c).
The optimal rule under these circumstances is then indeed to choose the threshold cas the level at which the net reproduction rate is maximal, namely, Xm. This argument can be made precise if we pay attention to the scaling. The nature of the scaling leads one to suppose that the birth and death rates are of the forms >.j = ~t>.(jl~t), /Lj = K.J.L(jl~t) in terms of functions >.(x) and J.L(x), corresponding to the deterministic equation .X= >.(x)  J.L(x) = a(x) in the limit of large K.. The implication is then that >.j I /Lj varies slowly with j if K. is large, with the consequence that the equilibrium distribution of x falls away virtually exponentially as x decreases from d = cI K.. The details of this verbal argument are easily completed (although anomalous behaviour atj = 0 can invalidate it in an interesting fashion, as we shall see). The theory of large deviations (Chapter 22) deals precisely with such scaled processes, in the range for which the scale is large but not yet so large that the process has collapsed to determinism. Virtually any physical system whose complexity grows at a smaller rate than its size generates examples of such processes. Suppose now that >.0 = 0, so that extinction is possible (and indeed certain if, as we shall suppose, passage to 0 is possible from all states and the population is harvested above some finite threshold.) Expression (12) then yields simply a distribution concentrated onj = 0. Let F;· be the expected total return before extinction conditional on an initial population ofj. (It is understood that the policy is that of harvesting at an infinite rate above the prescribed threshold value d.) The dynamic programming equation is then
(0 <j .1>.2 . .. >.k1
(j
~d)
(15)
S BEHAVIOUR FOR NEARDETERMINISTIC MODELS
197
where (16)
Proof We can write (14) as >.1!11+1 = J.Lill.l where !11 = Fj  FJI· Using this equation to determine !11 in terms of l1ci+1 = 1 and then summing to determine Fj, we obtain the solution (15). D
We see from (15) that the ddependence of the Fj occurs only through the common factor ILl, and the optimal threshold will maximise this. The maximising value will be that at which Ad/ J.Ld decreases from a value above unity to one below, so that ad = A.t  J.Ld decreases through zero. That is, the optimal value of c is Xo, the equilibrium level of the unharvested population. More exactly, it is less than x0 by an amount not exceeding K: 1 . This means in fact a very low rate of catch, even while the population is viable. The two cases thus lead to radically different recommendations: that the threshold should be set near to Xm or virtually at x 0 respectively. We shall explain the apparent conflict in the next two sections. It turns out that the issue is not really one of whether extinction is possible or not, but of two criteria which differ fundamentally and are both extreme in their way. A better understanding of the issues reveals the continuum between the two policies. Exercises aod comments
(1) Consider the naive policy first envisaged in Chapter 1, in which the population was harvested at a flat rate u for all positive x. Suppose that this translates into an extra mortality rate of v per fish in the stochastic model. The equilibrium distribution tr1 of population size is then again given by expression (12), once normalised, but with J.LJ now modified to J.Li + v. The roots x 1 and x2 of a(x) = u, indicated in Figure 1.2 and corresponding to unstable and stable equilibria of the deterministic process, now correspond to a local minimum and a local maximum of tr1. It is on this local maximum that faith is placed, in that the probability mass is supposed to be concentrated there. However, as v (and sou) increases this local maximum becomes ever feebler, and vanishes altogether when u reaches the critical value Um = a(xm)· S BEHAVIOUR FOR NEARDEI'ERMINISIIC MODELS In order to explain the apparent stark contradiction between the two policies derived in Section 4 we need to obtain a feeling for orders of magnitude of the various quantities occurring as K: becomes large, and the process approaches determinism. We shall follow through the analysis just for the birthdeath model
198
SOME STOCHASTIC EXAMPLES
of the last section, but it holds equally for the alternative models of Sections 8 and
9. Indeed, all three cases provide examples of the large deviation theory of Chapter22. Consider then the birthdeath model in the case when extinction is possible. Since most of the time before extinction will be spent near the threshold value if K is large (an assertion which we shall shortly justify) we shall consider only Fck the expected yield before extinction conditional on an initial valued ofj. Let 1j denote the expected time before extinction which is spent in state j (conditional on a start from d). Then, by the same methods which led to the evaluation (15), we find that
1j = Ild
[t
J.i.IJ.l2 · · · J.i.k1] [J.l.j+IJ.i.J+2 · · · J.i.d].
k=l .A1.A2 ... >k1
.Ai.AJ+I ... .Act
( 17)
which is consistent with expression (15) for Fd = .AdFd. When we see the process in terms of the scaled variable x = j I K we shall write Fj = ~tF(x) and 1j = T(x).lfwe define
R(x)
=fox log(.A(y)lp,(y)J dy
( 18)
then we deduce from expression (15) that
F(c) =
e"R(c)+o(~ c.
Theorem 10.8.2 Suppose that extinction is certain for the harvested process. Then the relevant solution of (43) is F(x)
=
eiCR(c)
(fox e~ c and appeal to the condition of certain extinction. The last assertion follows from this evaluation. D The analogue of the final conclusion of Section 4 then follows.
Theorem 10.83 The optimal value of c is that maximising the expression ei c. Show then that expression (40) has the evaluation M/(MfZ R') 1 ;::;; (1/ R')
+ Ij(M(! R') =a,
for large "" where all functions are evaluated at c.
9 A MODEL WITH A SWITCHING ENVIRONMENT
209
9 THE HARVESTING EXAMPLE: A MODEL WITH A SWITCHING ENVIRONMENT As indicated in Section 9.2, we can use a piecewisedeterministic model to represent the effects of environmental variation. Suppose that the model has several environmental regimes, labelled by i = 1, 2, . . . . In regime i the population grows deterministically at net rate a;(x), but transition can take place to regime h with probability intensity IW;h· This is then a model whose stochasticity is of quite a different nature to that of the birthdeath or diffusion models of Sections 4 and 8. It comes from without rather than within, and represents what is termed 'environmental stochasticity', as distinct from 'demographic stochasticity'.Whether this affects conclusions has to be determined. The equivalent deterministic model would be given by equation (11) but with
(46) where p; is the steadystate probability that the system is in regime i. The model converges to this deterministic version if transitions between regimes take place so rapidly that one is essentially working in an 'average regime: This occurs in the limit oflarge, "'• so that l'i. again appears as the natural scaling parameter. A fixed threshold would certainly not be optimal for such a multiregime model. It is likely that the optimal policy would be of a threshold nature, but with a different threshold in each regime. Of course, it is a question whether the regime is known at the time decisions must be made. Observation of the rate of change of x should in principle enable one to determine which regime is currently in force, and so what threshold to apply. However, observation of x itself is unreliable, and estimation of its rate of change fraught with extreme error. If one allowed for such imperfect observation then the optimal policy would base action on a posterior distribution (i.e. a distribution conditional on current observables) of the values of both x and i (see Chapter 15). An optimal policy would probably take the form that harvesting effort should be applied only if an estimate of the current value of x exceeded a threshold dependent on both the precision of this estimate and the current posterior probabilities of the different regimes. We shall consider only the fixedthreshold policy, desperately crude though it must be in this case, and shall see how the optimal threshold value compares with that for the equivalent deterministic model. We shall consider a tworegime case, which is amenable to analysis. A value of x at which at (x) and a2 (x) have the same sign cannot have positive probability in equilibrium. Let us suppose then that at (x) = >.(x) ~ 0 and a2(x) = f.J(x) ~ 0 over an interval which includes all xvalues of interest. We shall set llt2 = 111 and 112t
=
112.
210
SOME STOCHASTIC EXAMPLES
Suppose initially that extinction is impossible, so that the aim is to maximise the expected rate of return 'Y in the steady state. We shall suppose that the maximal harvest rate M is infinite. For the deterministic equivalent of the process we have, by (46),
a(x) = v2.A(x)  v1JL(x) . VJ
(47)
+1/2
We shall suppose that this has the character previously assumed, see Figure 2. We also suppose that JL(x) = 0 for x ~ 0, so that xis indeed confined to x ~ 0. The question of extinction or nonextinction is more subtle for this model. Suppose, for example, that .A(O) = 0 (so that a population cannot be replenished) and that JL(x) is bounded away from zero for positive x. Then extinction would be certain, because there is a nonzero probability that the unfavourable regime 2 can be held long enough that the population is run down to zero. For extinction to be impossible in an isolated population one requires that JL(x) tends to zero sufficiently fast as x decreases to zero; the exact condition will emerge. Let p 1(x) denote the probability/probabilitydensity of the ilx pair in equilibrium. These obey the Kolmogorov forward equations
(0
~ x
.(x) grows as x! 0.
Theorem 10.9.2 In the deterministic limit expression (57) reduces to 'Y =a( c), with a(x) specified by (52} Proof We may suppose that a(x) ;;:=: Oin the interval [0, c), so that R is increasing in the interval In the limit oflarge tt expression (52) then becomes
y"'
.).. (vi/ R')(>. I + 1L 1)
+1
= vz.A111
!IJIJ.
+ 112
=a
with the argument c understood throughout.
0
The optimal value of c thus converges to Xm in the deterministic limit, as one would expect Suppose now that extinction is possible, in that .A(O) = 0 and that the population can be run down to zero in a bounded time in the unfavourable regime 2. It will then in fact be certain under communicationlboundedness conditions. Let F1(x) denote the expected total return before absorption conditional on a start from (i, x).We have then the dynamic programming equations
(0 < x< c). Since escape from x = 0 by any means is impossible we have Ft (0) However, the real assertion is that
(54)
= F2 (0) =
0.
(55) where F1(0+) = limxLoF1(x) and¢ is an as yet undetermined positive quantity. The point is that, if x is small and positive, then it has time to grow in regime 1 and time to decline to zero in regime 2 (before there is a change in regime). The
SOME STOCHASTIC EXAMPLES
212
second equation of (54) continues to hold at x = c, but for the first we must substitute (56) Relation (56) follows from the fact that escape from (1, c) is possible only the transition of i from 1 to 2; this takes an expected time (~~:11 1 ) 1 during which return is being built up at rate >.(c). Theorem 10.9.3
The value functions F; (x) have the evaluations
F1 (x) = ¢
+ ((3/vz)
Fz(x) = ((3/vi)
1x
1x >.(y) e~.(c)e"'R(c)_ For 11: large this amounts to the maximisation of R(c), i.e. to the equation a(c) = 0, with a~) having the determination (47). That is, the optimal threshold again approaches the value x 0 . To be exact, the stationarity condition with respect to cis
If we assume that >.(k) is increasing with x then we see that, at least for sufficiently large 11:, the optimal threshold c lies somewhat above the value xo. For the two previous models it lay below. It is in this that the nature of the stochasticity (environmental rather than demographic) reveals itself. In the previous examples
9 A MODEL WITH A SWITCHING ENVIRONMENT
213
there would virtually never have been any harvesting if c had been set above the equilibrium value .xo. The effect in this case is that x can indeed rise sufficiently above x 0 during the favourable regime 1, and one waits for this to happen before harvesting. Notes on the literature The fact that the threshold c should seemingly be set at the unharvested equilibrium level x 0 if one sought to maximise the expected total return before a point of certain extinction was first observed by Lande, Engen and Saether (1994, 1995). for the case of a diffusion process. The analysis of Section 8 expands this treatment. The material of Sections 46 and 9 appears in Whittle and Horwood (1995).
CHAPTERJJ
Policy Improvement: Stochastic Versions and Examples 1 THE TRANSFER OF DETERMINISTIC CONCLUSIONS In Chapter 3 we considered patterns of infinitehorizon behaviour and the technique of policy improvement for the deterministic case. All conclusions reached there transfer as they stand to the stochastic case if we appropriately extend the definitions of the forward operators L and !l' and their continuoustime analogues. As in Chapter 3, attention is restricted to the statestructured timehomogeneous case. In discrete time we define the conditional expectation operator
E(u)cjl(x)
= E(cjl(xt+dlxt =
x, Ut
The forward operators L(g) and !l' for the policy respectively are then defined by
L(g)cjl(x)
= u].
g(oo)
and an optimal policy
= c(x,g(x)) + {3E(g(x))¢(x),
!l'cjl(x) = inf[c(x, u) u
(1)
+ {3E(u)¢(x)].
(2)
(3)
The corresponding dynamic programming equations for the value functions Vs = Vse((oo)) and Fs then again take the forms
(s > 0),
(4)
with the xargument of these functions understood. The material of Chapter 3 is now valid as it stands with these extended definitions. Explicitly: Theorem 3.1.1, asserting the monotonicity of the forward operators and the monotonicity (in s) of the value functions under appropriate terminal conditions, still holds. If we assume the instantaneous cost function c(x,u) nonnegative then Theorems 3.2.13.2.3 on total cost still hold, as does Theorem 3.5.1 on policy improvement The continuoustime analogue of the operator (1) is the infinitesimal generator A( u) for the controlled process, defined in (8.15). In terms of this the stochastic versions of the differential forward operators M (g) and Jt of Section 3.1 take the forms
216
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
M(g)rf>(x) = c(x,g(x)) a¢(x) + A(g(x))rf>(x) vltrf>(x) = inf[c(x, u) a¢(x) + A(u)rf>(x)].
" The assertions of Chapter 3 for the continuous time case then also transfer bodily. Exercises and comments
(1) We can supplement the example of instability given in Exercise 3.2.1 by the classic stochastic example of the simplest gambling problem. Suppose a gambler has a capital x which takes integral values, positive as well as negative, and has the choice of ceasing play with reward x, or of placing a unit stake and continuing. In this second case he doubles his stake or loses it, each with probability 112. If his aim is to maximise expected reward then the dynamic programming equation is
Gs(x)
= max{x,!(Gst(X 1) + Gst(x+ l)]}
(s > 0)
where Gs(x) is his maximal expected reward if he has capital x with s plays remaining. If at s = 0 he only has the option of retiring then Go (x) = x, and so Gs(x) = x for all s. However, the infinitehorizon version of this equation also has a solution G(x) = +oo. If the retirement reward xis replaced by min (a, x) for integral a then the equation has a solution G(x) =a for x ~a. This corresponds to the policy in which the player continues until he has a capital of a (an event which ultimately occurs with probability one for any prescribed a) and then retires. The solution G(x) = +oo corresponds to an indefinitely large choice ofa. Investigate how infinitehorizon conclusions are modified if any of the following concessions to reality is admitted: (i) debt is forbidden, so that termination is enforced in state 0; (ii) rewards are discounted; (iii) a constant positive transaction cost is levied at each play. (2) An int~resting example in positive programming is that of blackmail. Suppose there are two states: those in which the blackmailer's victim is compliant or resistant. Suppose that, if the blackmailer makes a demand of u (0 ~ u ~ 1), then a compliant victim pays it, but becomes resistant with probability u2. A resistant victim pays nothing and stays resistant If Gs is the maximal expected amount the blackmailer can extract from a compliant victim in s further demands, then Go = 0 and
Gs+l
= sup(u + (1 ~)Gs] = 1/J(Gs), "
1
say. Here the optimising value of u is the smaller ofl and (2Gs f and 1/J( G) is 1 or G + 1j (4G) according as G is less than or greater than!Show that Gs grows as s 112 and the optimal demand decreases as s 112 for large s. There is thus no meaningful infinitehorizon limit, either in total reward or in
2 AVERAGECOST OPTIMALITY
217
optimal policy. The blackmailer becomes ever more careful as his horizon increases, but the limiting policy u = 0 is of course not optimaL
(3) Consider the discounted version of the problem, for which the infinitehorizon reward G obeys G = ,P({3G). Show that, if! ~ {3 < 1, then G = (2y'{3(1  /3) ) 1 .
2 AVERAGECOST OPTIMALITY The problem of averagecost optimisation is one for which the stochastic model in fact shows significant additional features. Because of its importance in applications, it is also one for which we would wish to strengthen the discussion of Chapter 3. We observed in the deterministic contexts of Section 2.9 and 3.3 that one could relatively easily determine the value of control at an optimal equilibrium point, but that the determination of a control which stabilised that point (or, more ambitiously, optimised passage to it) was a distinct and more difficult matter. The stochastic case is less degenerate, in that this distinction is then blurred. Consider the case of a discrete state space, for simplicity. Suppose that there is a class of states 91 for which recurrence is certain under a class of control policies which includes the optimal policy. Then all these states will have positive probability in equilibrium (under any policy of the class) and, in minimising the average cost, one also optimises the infmitehorizon control rule at every state value in Pit. Otherwise expressed: since equilibrium behaviour still implies continuing variation (within 9l) in the stochastic case, optimisation of average cost also implies optimisation against transient disturbance (within Pit). These ideas allow us to give the equilibrium dynamic programming equations (3.9) and (3.10) an interpretation and a derivation independent of the sometimes troublesome infinitehorizon limit. Consider the cost recursion for the policy g(ool: 'Y + v = L(g)v,
(5)
where 'Y is the average cost under the policy and v(x) the transient cost from x, suitably normalised. (These are both gdependent, but we take this as understood, for notational simplicity.) Suppose that the state space is discrete and all states are recurrent under the policy. Then 'Y can be regarded as an average cost over a recu"ence cycle to any prescribed state (see Exercise 2~ Equation (5) can be very easily derived in this approach. which completely avoids any mention of infinite horizon limits, although it does imply indefinite continuation. The natural normalisation of vis to require that E[v(x)] should be zero, where the expectation is that induced by policy in equilibrium. This is equivalent to requiring that the total transient cost over a recurrence cycle should have zero expectation. If there are several recurrence classes under the policy then there will be a separate equation (5) for each recurrence class. These recurrence classes are
218
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
analogous to the stable equilibrium points of the deterministic case (although see Exercise 1). In the optimised case the equation
'Y+f=2f
(6)
has a similar interpretation, the equation holding for all x in a given recurrence class under the optimal policy, and"' and f being the minimal average cost and transient cost in this class. Whether this equation is either necessary or sufficient for optimality depends again upon uniqueness questions: on whether the value function could be affected by a notional closing cost, even in the infmitehorizon limit. The supposition of a nonnegative cost does indeed imply thatfmay be assumed bounded below, and so that the relevantfsolution of (6) is the minimal solution exceeding an arbitrary specified bound However, one must frequently appeal to arguments more specific to the problem if one is to resolve the matter.
Theorem 11.21 Suppose that (6) can be strengthened to
"1+ f
=
2/ =
L(g)f.
(7)
Then
(8) where dn = n 1 (/(xo) E[f(xn)!Wo]), foranypolicyn withequalityifn = g(ool. If dn ~ 0 with increasing nfor any 1r andfor any Wo one can then assert that the policy g( 00 l is averagecost optimal. Proof Relation (7) is a strengthening in that the second equality asserts that the infimum with respect to u implied in the evaluation of !Rf(x) is attained by the choice u = g(x). If we denote the value of c(x, u) at timet by c1 then (7) can be written "f + f(x,) ~ E.,..[ct
+f(xt+I)! Wt]
for any policy 7; with equality if 11' = g(oo). Taking expectations on both sides conditional on Wo and summing overt from 0 to n  1 we deduce that "l' +f(xo) .; E, (
~ c, +f(x.) IWo}
whence the assertions from (8) onwards follow.
0
2 AVERAGECOST OPTIMALITY
219
If cis uniformly bounded in both directions then we may assume the same off, and it is then clear that dn has a zero limit. In other cases special arguments are required to establish the result. Exercises and comments
(1) One might think of the recurrence classes as being analogous to the domains of attraction of the deterministic case, but this is not so. The notion that occupation of the states should actually recur is important: in the deterministic case this would true only of the points of the socalled attractor. Furthermore, the transient states in the stochastic formulation may communicate with many of the recurrence classes. A different set of ideas altogether (and one with which we are not concerned for the moment) is encountered if a process is 'scaled' so that it approaches a deterministic limit as a parameter "' is increased. In this case states which are in the same recurrence class may communicate ever more feebly as "' is increased, until they belong to distinct domains of attraction in the deterministic limit.
(2) A controlled Markov process with a stationary Markov control policy is simply a Markov process with an associated cost function. Suppose the statespace discrete, that c( x) is the instantaneous cost in state x and that p( x, y) is the transition probability from state x to state y. Suppose we define a modified cost function c(x) 1 and define v(x) as the expected sum of modified costs over a path which starts at state x and ends at first entry (or recurrence) to a specified state, say state 0. Then
v(x)
=
c(x) 1 + I>(x,y)v(y). yfO
Suppose we choose 1 so that v(O) = 0, i.e. so that the sum of modified costs has zero expectation over a recurrence cycle. Then this last equation becomes simply
v(x) = c(x) 1 + LP(x,y)v(y).
(9)
y
which we can identify with (5). We have 0 = v(O) sum is over a recurrence cycle to state 0. That is,
= E[L;,(c(x,) 1)], where the
where r is the recurrence time; this exhibits ~f as the average cost over a recurrence cycle. On the other hand, we deduce from (9) that 1 = L:x 1r(x)c(x), where {1r(x)} is the equilibrium distribution over states. The second interpretation is the 'infinite horizon' one; the first is purely in terms of recurrence, with no explicit appeal to limiting operations.
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
220
(3) The blackmail example of Exercise 1.2 showed infmite total reward, but is not rephrasable as a regular averagereward problem. The maximal expected reward over horizons grows as s 112 rather than s, so the average reward is zero and the transient reward infmite. Indeed, an average reward of "Y and a transient reward off(x) are associated with a development
F(x) =
1
2/3 +f(x) + o(l  /3)
of the discounted value function for 1  /3 small and positive. This contrasts with the (1  {3) I 12 behaviour observed in Exercise 1.2. 3 POLICY IMPROVEMENT FOR THE AVERAGECOST CASE The averagecost criterion is the natural one in many control contexts, as we have emphasised. It is then desirable that we should obtain an averagecost analogue of Theorem 3.5.1, establishin~ the efficacy of policy improvement. 00 We assume a policy g} at stage i. Let us distinguish the corresponding evaluations of y, v(x), L and expectation E by a subscript i. The policyimprovement step determines Kt+l by !t'v; = L;+ 1 v;. Let us write the consequent equation "Yi + V; ~
ft'V;
= Lt+! V;
(10)
in the form "Yi
+ v; ~ !t'v; + 6 =
L;+l v; + 6
( 11)
where 6 is some nonnegative constant If 6 can in fact be chosen positive then one has an improvement in a rather strong sense, as we shall see.
Theorem 11.3. 1 Relation (11) has the implication (12)
where d, = n 1 (v;(xo) E;+l[v;(x,)Jxo]).
q d,
t 0 with increasing n it then follows that: (i) there is a strict improvement in average cost if 6 is positive; vz) averageoptimality has been attained ifequality holds in (11).
The proof follows the same course as that of Theorem 11.2.1. Note that, effectively, the only policies considered are stationary Markov. However, we expect the optimal policy to lie in this class.
4 MACHINE MAINTENANCE
221
In continuous time the optimality equation (7) is replaced by 1 = .Af, i.e. 1 = inf[e(x, u) u
+ A(u)f(x)].
( 13)
4 MACHINE MAINTENANCE As an example, consider the optimisation of the allocation of service effort over a set of n machines. The model will be a very simple one, formulated in continuous time. Consider first a model for a single machine which, with use, passes through states of increasing wear x = 0, 1, 2, .... The passage from x to x + 1 has intensity >.for all x, and the machine incurs a cost at a rate ex while in state x, this representing the effect of wear on operation. A service restores the machine instantaneously to state 0. Suppose the machine is serviced randomly, with probability intensity fL· The dynamic programming equation under this blind policy will be
1 =ex+ >.[!(x +I) f(x)] + fL[f(x) /(0)]
( 14)
if costs are undiscounted. Here 1 is the average cost under the policy and fo) the transient cost in state x. The equation has the unique solution f(x) = exjp,,
1
= >.effL.
( 15)
if we make the normalising assumptionf(O) = 0. Suppose now that we haven such machines and distinguish the parameters and state of the ith machine by a subscript i. The average cost for the whole system will then be 1
= L 'Yi =
L >.;c;jf..Li
( 16)
i
with the fL; constrained by (17)
iffLis the intensity of total service effort available. Specification of the fL; amounts to a random policy in which, when a machine is to be serviced, the maintenance man selects machine i with probability tL;/ fL· A more intelligent policy will react to system state x = {x;}. One stage of policy improvement will achieve this. However, before considering policy improvement, one could simply optimise the random policy by choosing the fL; to minimise expression (16) subject to constraint (17). One readily finds that this leads to an allocation fL; ex y'>:ic/. Policy improvement will recommend one to next service the machine i for which
222
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
I)cixi + AJ[jj(xJ+i) jj(xJ)]}
+ J..L[.fi(O) f(x;)]
j
is minimal; i.e. for which
Ji(x;) = C;Xi/ j..£; is greatest. If we use the optimised value of J..£; derived above then the recommendation is that the next machine i to be serviced is that_ for which Xt y'CJ>:; is greatest. . Note that this is an index rule, in that an index x,JCJ):; is calculated for each machine and that machine chosen for service whose current index is greatest. The rule seems sensible: degree of wear, cost of wear and rapidity of wear are all factors which would cause one to direct attention towards a given machine. However, the rule is counterintuitive in one respect: the index decreases with increasing A;, so that an increased rate of wear would seem to make the machine need service less urgently. However, the rate is already taken account of in the state variable x; itself, which one would expect to be of order A; if a given time bas elapsed since the last service. The deflation of x; by a factor ...;>::;is a reflection of the fact that one expects a quickly wearing component to be more worn, even under an optimal policy. An alternative argument in Section 14.6 will demonstrate that this policy is indeed close to optimal. 5 CUSTOMER ALLOCATION BETWEEN QUEUES Suppose there are n queues of the type considered in Section 10.3. Quantities defined for the itb queue will be given subscript i, so that x;, A;, J..£; and a;x; represent size, arrival rate, service rate and instantaneous cost rate for that queue. We suppose initially that these queues operate independently, and use
(18) to denote the total arrival rate of customers into the system.
However, suppose that arriving customers can in fact be routed into any of the queues (so that the queues are mutually substitutable alternatives rather than components of a structured network). We look for a routing policy 1r which minimises the expected average cost "f.,= E.,[_L; a;x;]. The policy implied by the specification above simply sends an arriving customer to queue iwith probability A;/ A; the optimal policy will presumably react to the current system state {x1}. The random routing policy achieves an average cost of "'(=
~ ~aX ~ "(j= ~''; ; J..£;A;
(19)
6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES
223
· (see (10.10)). As in the last section, before applying policy improvement we might  " ·as well optimise the random policy by choosing the allocation rates At to minimise expression (19) subject to (18), for given A. One readily finds that the optiroal choice is
(20) where 8 is a Lagrange multiplier whose value is chosen to secure equality in (18). So, speed of service and cheapness of occupation both make a queue more attractive; a queue for which a1 I Jl.; > 1I 8 will not be used at all. Consider one stage of policy improvement It follows from the form of the dynamic programming equation that, if the current system state is {x 1}, then one will send the next arrival to that queue i for which
A[.fi(x; +I).fi(x;)] + LJJ.Axi)[jj(xJ 1) jj(xj)]
(21)
I
is minimal. Here the fi, are the transient costs under the random allocation policy, determined in (10.10). That is, one sends the arrival to that queue i for which
1':(
Ji Xt
+ l)  Jl1':(X; ) _ a1(x 1 +, 1) JJ.jl'lj
is minimal. If the A; have already been optimised by the rule (20) then it follows that one sends the new arrival to the queue for which ( (x1 + 1) ..;a:Tii;) is minimal, although with i restricted to the set of values for which expression (20) is positive. This rule seems sensible: one tends to direct customers to queues which are small, cheap and fastserving. Note that the rule again has the index form.
6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES We could have considered the problem of the last section under the assumption that it is server effort rather than customer arrivals which can be switched between queues. The problem could be rephrased in the form in which it often occlirs: that there is a single queue to which customers of different types (indexed by z) arrive, and such that any customer can be chosen from the queue for the next service. Customers of different types may take different times to serve, so it is as well to make a distinction between service effort and service rate. We shall suppose that if one puts service effort u 1 into serving customers of type i (i.e. the equivalent of a; servers working at a standard rate) then the intensity of service completion is the service rate a1JJ.;. One may suppose that a customer of type i has an exponentially distributed 'service requirement' of expectation JJ.j 1, and that this is worked off at rate a 1 if service effort a 1 is applied.
224
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
As in the last section, we can begin by optimising the fixed serviceallocation policy, which one will do by minimising the expression y=
a;>.,
L. "Yt= L. U;J.L;At I
I
with respect to the service efforts subject to the constraint
LU;=S,
(22)
I
on total service effort. The optimal allocation is u 1 = J.Lj 1[>..1+ yf8a 1>.1J.L1]
(23)
where 8 is a Lagrange multipler, chosen to achieve equality in (22). An application of policy improvement leads us, by an argument analogous to that of the previous section, to the conclusion that all service effort should be allocated to that nonempty queue i for which J.L;[f;(x, 1) f;(x 1)] is minimal; i.e. for which a;J.L 1x 1
v1(x1) = J.L;[f;(x1) f;(x1  1)] = U;J.L; _
>.;
(24)
is maximal. If the fixed rates J.Lt have been given their optimal values (23) then the rule is: all service effort should be concentrated on that nonempty customer class i for which x 1 a1J.Lt/ >.. 1 is maximal It is reasonable that one should direct effort towards queues whose size x 1 or unit cost a1 is large, or for which the response J.Li to service effort is good However, again it seems paradoxical that a large arrival rate >..1 should work in the opposite direction. The explanation is analogous to that of the previous section: this arrival rate is already taken account of in the queue size x 1 itself, and the deflation of x 1 by a factor .../)..1 is a reflection of the fact that one expects a busy queue to be larger, even under an optimal policy. Of course, the notion that service effort can be switched wholly and instantaneously is unrealistic, and a policy that took account of switching costs could not be a pure index policy. Suppose that to switch an amount of service effort u from one queue to another costs c!u!. Suppose that one application of policy improvement to a policy of fixed allocation {u1} of service effort will modify this to {~}.Then the~ will minimise
J
2)tcd~ u,! ~vt]
(25)
i
subject to
(26)
7 REWARDED SERVICE RATHER THAN PENALISED WAITING
225
and nonnegativity. Here v; = v;(x;) is the index defined in (24) and the factor 4 occurs because the first sum in (25) effectively counts each transfer of effort twice. If we take account of constraint (26) by a Lagrangian multiplier (} then the differential of the Lagrangian form Lis
oL
{ L;+ := a~ = L; :=
!c !c 
Vj 
(a;> 0";)
(}
Vj 
(}
(a; < 0";)
We must thus have ~ equal to a;, not less than a;, not greater than a; or equal to zero according as L; < 0 < L;+, Lt+ = 0, L;_ = 0 or L; > 0. This leads us to the improved policy. Let 2l+ be the set of i for which v; is maximal and 2l_ the set of i (possibly empty) for which v; <max v1 c. 1
(27)
Then all serving effort for members of 2l_ should be transferred to members of 2l+· The definitions of the v; will then of course be updated by substitution of the new values of service allocation. Such a reallocation of effort will occur whenever discrepancies between queues become so great that (27) holds for some i. The larger c, the less frequently will this occur.
7 REWARDED SERVICE RATHER THAN PENALISED WAITING Suppose the model of the last section is modified in that that a reward r; is earned for every customer of type i whose service is completed, no cost being levied for waiting customers. This is then a completely different situation, in that queues can be allowed to build up indefinitely without penalty. It has long been known (Cox and Smith, 1961) that the optimal policy under these circumstances is to serve a customer of that type i for which r;p,; is maximal among those customers present in the queue. This is intuitively right, and in fact holds for service requirements of general distribution We shall see that policy improvement leads to what is effectively just this policy. For a queue of a single type served at unit rate we have the dynamic programming equation in the undiscounted case
"'=
~~(x + 1)
+ p,(x)[r ~(x)],
(28)
where x is queue size, ~ is arrival rate, p,(x) equals the completion rate p, if x is positive and is otherwise zero, r is the reward and ~(x) is the increment f(x)  f(x 1) in transient cost. Equation (28) has the general solution for~ _ "! p,r ~') ~(x) AP, ' + ("/' P,A
[l!_]x1 ' .
(29)
/1
If p, > ~ then finiteness implies that the second term must be absent in (29), whence we deduce that
226
POLICY IMPROVEMENT. STOCHASTIC VERSIONS
6(x)
= r.
(30)
This is the situation in which it is the arrival rate which limits the rate at which reward can be earned. In the other case, J.L < A, it is the completion rate which is the limiting factor; the queue builds up and we have
6(x)
=
(J.L/.XY 1 .
(31)
The total reward rate for a queue of several types under a fixed allocation {17;} of service effort is then "f
=L
r1
min[.X;, 17;J.Li].
j
If we rank projects in order of decreasing r;J.Li then an optimal fixed allocation is that for which 17; = .X;/ J.L; for i = 11 2 1 • • • 1 m, where m is as large as possible consistent with (22), to allocate any remaining service effort to type m + 1 and zero service effort to remaining types. Now consider a policy improvement step. We should direct the next service to a customer of type i which is present in the queue and for which i maximises J.L1[r1  .6.;(x;)]. It follows from solutions (30), (31) that this expression is zero for i = 11 2, ... 1 m and equal to r;J.L;[1 (17;J.L;/.X;)x;l] fori> m. That is, one will serve any of the first m types if one can. If none ofthese are present, then in effect one will serve the customer type present which maximises r 1J.L;, because the fact that the traffic intensity for queue m + 1 exceeds unity means that Xm+l will be infinite in equilibrium. It may seem strange that the order in which one serves members of the m queues of highest effective value r1J.L 1 appears immaterial. The point is that all arrivals of these types will be served ultimately in any case. If there were any discounting at all then one would, of course, always choose the type of highest value among those present for fust service. 8 CALL ROUTING IN LOSS NETWORKS Consider a network of telephone exchanges, with the nodes of the network (the exchanges) indexed by a variable j = 11 2, .... Suppose that there are mik lines ('trunks') on the directed link from exchange j to exchange k, of which Xjk are busy at a given time. One might think then that the vector ! = { Xjk} of these occupation numbers would adequately describe the state of the system, but we shall see that this is not quite so. Calls arrive for a jk connection in a Poisson stream of rate )vk. these streams being supposed independent Such calls, once established, terminate with probability intensity J.Ljk· When a call arrives for ajkconnection, it need not be established on a directjk link. There may be no free trunks on this link, in which case one can either look for an alternative indirect routing (of which there may be many) or simply not
227
8 CALL ROUTING IN LOSS NETWORKS
accept the call. In this latter case we assume that the call is simply lostno queueing facility is provided, and the caller is assumed to drop back into the population, resigned to disappointment. We see that a full description of the state of the network must indicate how many calls are in progress on each possible route. Let n, be the number of calls in progress on route r. Denote the vector with elements n, by !! and let !! + e, denote the same vector with n, increased by one. Let us suppose that the establishment of a jkconnection brings in revenue Wjk. and that one seeks a routing policy which maximises the average expected revenue. The dynamic programming equation would then be
'Y =
L L \kmax{O, j
k
Wjk
+max[!(!! + e,) fC!!)]} r
(32)
+ 2:n,p,[f(_!! e,) /(!!)),
where 'Y andfindicate average reward and transient reward respectively. Here the rmaximisation in the first sum is over feasible routes which establish a jkconnection. The zero option in this sum corresponds to rejection of the call on the grounds that connection is either impossible or uneconomic. (The difference f(!!)  f(!! + e,) can be regarded as the implied cost of establishing an incoming call along router. If this exceeds Wjk then the connection uses capacity which could be more profitably used elsewhere. We can take as a convention that this cost is infinite if the route is infeasiblei.e. requires nonexistent capacity.) In the second sump, is taken to equal /ljk if router begins inj and ends ink. The term indicated is included in the sum only if!!  e, ~ 0; i.e. if there is indeed at least one call established on route r. Solution of this equation seems hopeless. However, the value function can be determined for the simple policy in which one uses only direct routes, accepting a call for this route if and only if there is a free trunk. We shall then apply one stage of policy improvement. The average and transient rewards on one such link (for which we shall drop the jk subscripts) are determined by ~f =
X[w + f(x
+ 1)f(x)] + px[f(x 1) f(x)]
(0 < x < m).
(33)
For x = 0 this equation holds with the term in p missing; for x = m it holds with the term in A missing. Let us define the quantity
A(x) = /(x) f(x
+ 1)
which can be interpreted as the cost of accepting an incoming call if x trunks are currently busy. One finds then the elegant solution
A( x ) = w B(m, 8) B(x,fJ)
L..l.
(0 < x < m);
'Y = ,\w [1 B (m,ull)] ,
228 where(}=
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
>.J J.Lis the traffic intensity on the link. and B(m, 0) is the Erlangfunction B(m, B)
~ (If" /m!) (t If' /x!)
I
This is the probability tha~ all trunks are busy, so that an incoming call cannot be accepted: the blocking probability. The formula for w1k thus makes sense. Returning to the full network, define
Then we see from (32) that one stage of policy improvement yields the revised policy: If a call comes in, assign it the feasible route for which the sum of the costs x along the route is minimal, provided that there is such a feasible route and that this minimal cost does not exceed WJk· In other cases, the call must or should be rejected The policy seems sensible, and is attractive in that the effective cost of a route is just the sum of individual components x. For this latter reason the routing policy is termed separable. Separability ignores the interactions between links, and to that extent misses the next level of sophistication one would wish to reach. The policy tends to assign indirect routings more readily than does a more sophisticated policy: This analysis is taken, with minor modifications, from Krishnan and Ott (1986). Later work has taken the view that, for a large system, policies should be decentralised, assume very limited knowledge of system state and demand little in the way of realtime calculation. One might expect that performance would fall well below the fullinformation optimum under such constraints, but it seems, quite remarkably, that this need not be the case. Gibbens eta/. (1988, 1995) have proposed the dynamic alternative routing policy under which, if a onestep route is not available, twostep routes are tested at random, the call being rejected if this search is not quickly successful One sees how little information or processing is required, and yet performance has been shown to be close to optimal
CHAPTER 12
The LQG Model with Imperfect Observation 1 AN INDICATION OF CONCLUSIONS When we introduced stochastic state structure in Section 8.3 we assumed that the current value of the state variable was observable, a property generally referred to as 'perfect observation: Note that this is not the same as 'complete information', which describes the situation for which the whole future course of the process is predictable for a given policy. If the model is statestructured but imperfectly observed (i.e. if the current value of state is imperfectly observable) then the simple recursive treatment of Section 8.3 fails. We shall see in Chapter 15 that, in this case, an effective state variable is supplied by the socalled information state: the distribution of x 1 conditional on W 1• This means that the state space !!£with which we have to work has perforce been expanded to the space of distributions on !!£, a tremendous increase in dimensionality. However, LQG processes show a great simplification, in that for them these conditional distributions are always Gaussian, and so are parametrised by the conditional mean value x1 and the conditional covariance matrix V1• Indeed, matters are even simpler, in that V1 turns out to develop in time by deterministic and policyindependent rules. The only point at which we appeal to the observed course of the process is then in the calculation of x1 • One can regard x1 as an estimate of the current state x 1 based on current information W1.It is interesting that the formulation of a control rule forces one to estimate unobservables; even more interesting that the optimisation of this rule implies criteria for the optimisation of estimation. LQG processes have already been defined by the properties of linear relations between variables, quadratic costs and Gaussian noise. We shall come shortly to a definition which expresses their essence even more briefly and exactly. If we add the feature of imperfect observation to the LQG regulation problem of Section 10.1 then we obtain what one might regard as the prototype imperfectly observed statestructured LQG process in discrete time. For this the plant and observation relations take the form Xt
= AxtI +But I+ ft
(1)
+ 1/t
(2)
Yt = CxtI
230
THE LQG MODEL WITH IMPERFECT OBSERVATION
where y 1 is the observation which becomes available at timet. We suppose that the process noise E and the observation noise TJ jointly constitute a Gaussian white noise process with zero mean and with covariance matrix (3)
Further, we retain the instantaneous and terminal cost functions c(x,u)
=![~r[~ ~][~],
(4)
of Section 9.1. Ifwe can treat this model then treatment of more general cases (e.g. incorporating tracking of a reference signal) will follow readily enough. We shall see that there are two principal and striking conclusions. One is that, if u, = K 1x 1 is the optimal control in the case of perfect state information, then the optimal control in the imperfectobservation case is simply u 1 = K,x,. This is a manifestation of what is termed the certainty equivalence principle (CEP). The CEP states, roughly, that one should proceed by replacing any unobservable by its current estimate, and then behaving as if one were in the perfectobservation case. It turns out to be a key concept, not limited to LQGmodels. On the other hand, it cannot hold for models in which policy affects the information which is gained. The other useful conclusion is the recursion for the estimate x1
(5) known as the Kalman filter. This might be said to be the plant equation for the effective state variable x1; it takes the form of the original plant equation (1), but, instead of being driven by plant noise e1, is driven by the innovation y 1  Cx1_ 1 • The innovation is just the deviation of the observation y 1 from the value E(yd W1_1) that one would have predicted for it at time t  I; hence the name. The matrix H 1 is calculated by rules depending on V,_ 1. . The Kalman filter provides the natural computational tool for the realtime determination of state estimates, a computation which would be realised by either a computational or an analogue model of the plant. Finally, LQG structure has really nothing to with state structure, and essential ideas are indeed obscured if one treats the statestructured case alone. Suppose that X, U and Y denote process, control and observation realisations over the complete course of the process. The cost function C will then be a function q X, U). Suppose that the probability density of X and Y for a control sequence U announced in advance is
f(X, Yl; U)
= eD(X,Yf;U)
(6)
(so that U is a parametrising variable). This must be a density relative to an appropriate measure; we return to this point below. We shall term [ll
2 LQG STRUCTURE AND IMPERFECT OBSERVATION
231
the discrepancy, since it increases with increasing improbability of plant/ observation realisations X and Y for given U The two functions C and ID of the variables indicated characterise the cost and stochastic structure of the problem. One can say that the problem is LQG if both the cost function C and the discrepancy [) are quadratic in their arguments when the density (6) is relative to Lebesgue measure. This characterisation indeed implies, very economically, that dynamic relations are linear, costs quadratic and noise Gaussian. It also implies that policy cannot affect information, in that the only effect of controls on observations is to add on a term which is linear in known controls, and so can be corrected out. It implies in addition that the variables take values freely in a vector space (since the density is relative to Lebesgue measure) and that the stopping rule is specification of a horizon point h (since histories X, Y, U are taken over a prescribed time interval). It can be asserted quite generally (i.e. independently ofLQG ideas) that the two quantities C and ID between them characterise the model. One might say that both C and ID should be small (relative to what they could be), in that one wishes to make C small and expects ID to be small. One can define the discrepancy for any random variable; i.e. ID(x) for x alone or ID(x!y) for x conditioned by y. The fact that the discrepancy is the negative logarithm of a density relative to a measure which may be normalised in different ways means that it is determined only up to an additive constant. We shall assume it so normalised that inf ID(x) = 0. X
So, if xis a vector normally distributed with mean fL and covariance matrix V, then ID (x) is just the quadratic form! (x  fL) T v 1 ( x  fL). Note the general validity of formulae such as
ID(x,y) = ID(y)
+ ID(x!y).
It is often asked why the observation relation is taken in the form (2), with Yt essentially being an observation on immediately previous state rather than on current state. The weak answer is that things work out nicely that way. Probably the right answer is that, if one regards (x 1 , y 1 ) as a joint state variable, then y 1 is a function of current state, uncorrupted by noise. This is an alternative expression of imperfect observation which has a good deal to recommend it: that one observes some aspect of current state without error.
2 LQG STRUCTURE AND IMPERFECT OBSERVATION The treatment of imperfectly observed LQG models involves a whole train of ideas, which authors order in different ways as they seek the right development.
232
THE LQG MODEL WITH IMPERFECT OBSERVATION
We shall start from what is surely the most economical characterisation of LQG structure: the assumed quadratic character of the cost C(X, U) and the discrepancy D(X, Yl; U) as functions of their arguments. We also regard the treatment as constrained by two considerations: it should not appeal to state structure and it should generalise naturally to what is a natural extension of LQG structure: the risksensitive models of Chapters 16 and 21. The consequent treatment is indeed a very economical and direct one; completed essentially in this section and the next. Sections 68 are added for completeness: to express the results already obtained in the traditional vocabulary oflinear least square estimates, innovations, etc. Section 9 introduces the dual variables, in terms ofwhich the duality of optimal estimation to optimal control finds its natural statement. Note first some general points. The fact that we have written the cost C as C( X, U) implies that the process has been defined generally enough that the pair (X,, U1) includes all arguments entering the cost function Cup timet (such as values of reference signals as well of plant itself). The dynamics of the process and observations are specified by the probability law P(X, Yl; U), which is subject to some natural constraints. We have
P(X, Yl; U)
h1
h1
1=0
1=0
=II P(xr+I,Yr+IIXr, Y,; U) =II P(xr+I,Yr+J!X,, Y,; U,)
(7)
the second equality following from the basic condition of causality; equation (A2.4). Further
D(x,+J,Yr+JiX,; U,) = D(xr+J!X,, Y,; U,) + []l(y,+J!Xr+l, Y,; U,) = D(xr+J!X,; U,) + D(y,+JiXr+l• Y,; U,)
(8)
the second equality expressing the fact that plant is autonomous and observation subsidiary to it, in that the pair (X1, Y1) is no more informative for the prediction of X 1+1 than is X 1 alone. Relations (7) and (8) then imply
Theorem 12.2.1 The discrepancy has the additive decomposition h1
D(X, Yl; U) = L[D(xr+IIXr; U,)
+ D(Yr+IIXr+l, Y,; U,)].
(9)
1=0
Consider now the LQG case, when all expressions in (9) are quadratic in their arguments. The dyamics of the process itself are specified by h1
D(XI; U) =
L D(xr+IIX
1;
t=O
U1)
(10)
2 LQG STRUCTURE AND IMPERFECT OBSERVATION
233
and the conditional discrepancies will have the specific form D(xr+IIX,; U,)
=! (xr+l 
dr+l  A,X, B, U,) T N,)1(xr+l dr+l A,X, B, U1)
(11) for some vector d and matrices A, B, N, in general timedependent. Relations (10) and (11) between them imply a stochastic plant equation Xr+l = A,X, + B,U, + dr+!
+ tr+l
(0 ~ t t) have been substituted the values which minimise [)(XI; U). The value ofu, thus determined, Udct(X, Urt1 is the optimal value of utfor the deterministic process in closedloop form. Proof All that we have done is to use the deterministic version of (12) to express future process variables Xr (r > t) in the cost function in terms of control variables U and current process history X1, and then minimise the consequent expression for C with respect to the as yet undermined controls u,.( r ~ t). However, the point of the theorem is that this future course ofthe deterministic process is determined (for given U) by minimisation of [)(XI; U) with respect to these future variables. 0 Otherwise expressed, the 'deterministic future path' is exactly the most probable future path for given X, and U In optimising the 'deterministic process' we have supposed current plant history X, known; a supposition we now drop.
Exercises and comments (1) Note that state structure would be expressed, as far as plant dynamics go, by
P(XI; U) = IT,P(xt+tlx,; u,). (2) Relation (12) should indeed be seen as a canonical plant equation rather than necessarily the 'physical' plant equation. The physical plant equation might have the form
234
TIIE LQG MODEL WITH IMPERFECT OBSERVATION
where the plant noise e• is autocorrelated. However, we can substitute f;+I = E( E7+ dX1 , U1) + ft+ I· This standardises the equation to the form (12), since the expectation is linear in the conditioning variables and f has the required orthogonality properties. The deterministic forms of the two sets of equations will be equivalent, one being derivable from the other by linear operations.
3 THE CERTAINTY EQUIVALENCE PRINCIPLE When we later develop the ideas of projection estimates and innovations then the certainty equivalence principle is rather easily proved in a version which immediately generalises; see Exercise 7.2. Many readers may find this sufficient. However, we find it economical to give a version which does not presume this apparatus, does not assume statestructure and which holds for a more general optimisation criterion. The LQG criterion is that one chooses a policy n to minimise the expected cost E'li"(C). However, it is actually simpler and more economical for present purposes to consider the rather more general criterion: that nshould maximise E,.[e9"C] for prescribed positive B. Since this second criterion function is 1  OE'li"(C) + o(B) for small 9, we see that the two criteria agree in the limit of small 9. For convenience we shall refer to the two criteria as the LQGcriterion and the LEQGcriterion respectively, LEQG being an established term in which the EQ stands for 'exponential of quadratic~ The move to the LEQG criterion induces a measure of 'risksensitivity'; of regard for the variability of C as well as its expectation. We shall see in Chapters 16 and 21 that LQG theory has a complete LEQG analogue. Indeed, LQG theory appears as almost a degenerate case of LEQG theory, and it is that fact which we exploit in this section: LEQG methods provide the insightful and economical treatment of the LQG case. We wish to set up a dynamic programming equation for the LEQG model. If we defme the somewhat transformed total value function eG(W,)
= f( Yt) sup£'71"[e9CI Wt], 1r
where f is the Gaussian probability density, then the dynamic programming equation takes the convenient form
(t .[2.>lx _ATV.A] for nonnegative definite V is fundamental, and valid even if V is singular (in that the evaluation is +oo
THE LQG MODEL WITH IMPERFECT OBSERVATION
246
unless x lies in the orthogonal complement of the null space of V). It gives us the evaluation of the discrepancy D (x, y) =
r[ ][ l
s~y [ATx + JLTy ! [~ ~;: ~;~ ~]
The minimising values of Aand JL are the differentials of D with respect to x andy. If x has the discrepancyminimising value x then A must be zero, so the extremal equations with respect to Aand JL become x = VxyJL andy = Vyy/L· This amounts just to the equation (35). The analogues of A and JL in the dynamic context are just the Lagrange multipliers associated with plant and observation equations, and it is by their introduction that a direct trajectory optimisation finds its natural completion (see Section 9) as it did already in the case of perfect observation (Section 6.3). 7 INNOVATIONS The previous section was concerned with a onestage situation. In the temporal context we shall have a multistage situation, in which at every instant of time t (= 0, 1, 2, ... ) one receives a new vector observation y 1• The observation history attime tis then Y1 = {y.,.; 0 ~ r ~ t}. Define the innovation ( 1 at timet by
(o =Yo E(yo),
( 1
=y1 19'(y 1 JY,_l)
(t=l,2,3, ... ).
(40)
The innovation is thus the deviation of the actual observation y 1 at timet from the projection forecast one would have made of it the moment before, at time t  1. It thus represents the 'new' information gained at time t, in that it is that part of y 1 which could not have been predicted from earlier information. Theorem 12.Zl
The innovations are mutually uncorrelated.
Proof Certainly ( 1 l_ Yrl· It follows then that ( 1 linear function of Yrl·
l_
(,.forT < t, because(,. is a 0
Now define
the projection estimate of a given random vector x based on information at time t. We shall use this superscript notation consistently from now on, to denote the optimal determination (of the quantity to which the superscript is applied, and under a criterion of optimality which may vary) based on information at timet. It follows, by appeal to (39), that x(t) = ltf(xj Y,_l)
+ ltf(xj(,) =
x(t1)
+H
1( 1
(41)
247
7 INNOVATIONS
where the matrix H1 in fact has the determination
H1 = cov(x, ( 1)[cov((1}t 1.
(42)
Equation (41) shows that the estimate of x based on Y1_ 1 can be updated to the estimate based on Y1 simply by the addition of a term linear in the last innovation. This is elegant conceptually, and turns out to provide the natural algorithm in the dynamic context. Equation (37) of course implies further that I
x(tl
=L
I
8(xi(r)
r=O
=L
Hr(r.
r=O
which is nothing but the development of a projection in terms of an orthogonal basis. However, the innovations are more than just an orthogonalisation of the sequence {Yr}; the fact that the orthogonalisation is achieved by the timeordered rule (40) means that ( 1 indeed has the character expressed in the term 'innovation'. The innovations themselves are calculated by the forward recursion 11
(r = Yt
L tS'(yll(r)·
(43)
r=1
Alternatively, we can take the dual point of view, and calculate them by minimising the discrepancy backwards in time rather than by doing least square calculations forward in time.
Theorem 127.2 The value ofy1 minimising D( Y1) is t). That is, by repeated appeal to the autoregressive relation, future E set equal to zero. This is indeed an immediate consequence of the fact that 8(E.,.IX1) = 0 forT> t, which is itself a consequence of the whitenoise assumptions.
wifu
5 OPTIMISATION CRITERIA IN THE STEADY STATE Optimisation criteria will often have a compact expression in terms of spectral densities in the stationary regime (i.e. the steadystate limit). Whether one best goes about steadystate optimisation by a direct minimisation of these expressions with respect to policy is another matter, but the expressions are certainly interesting and have operational significance. Consider first the LQG criterion. Let us write an instantaneous cost function such as (2.23) in the 'system' form (22) Here A is the vector of 'deviations' which one wishes to penalise (having, for example, x  r and u  rr as components) and 9t is the associated matrix (which would be just [ ~
~]
in the case (223)). Suppose that { A 1} has auto
covariance generating function g(z) in the stationary regime under a given stabilising policy n; and corresponding spectral density f(w). Then the aim is to choose 1r to minimise the average cost 'Y = E(!AT9tA] = !E{tr[9tAAT]} =! tr[9t cov(A)].
(23)
Here E denotes an expectation under the stationary regime for policy 1r and tr(P) denotes, as ever, the trace of the matrix P. For economy of notation we have not explicitly indicated the dependence of,, E and g(z) upon 1r. Appealing to the formula cov(A) =
2~ Jf(w) dw,
we thus deduce Theorem 135.1
The criterion (23) can be expressed 'Y = E(!AT9tA] =
~
J
tr[9t.f(w)J dw
(24)
where f(w) is the spectral density function of the Aprocess under the stationary regime for policy 11; and the integral is is over the realfrequency interval [1r, n] in the case ofdiscrete time and over the whole rea/line in the case ofcontinuous time.
266
STATIONARY PROCESSES; SPECTRAL THEORY
In the discretetime case it is sometimes useful to see expression (24) in power series rather than Fourier terms, and so to write it
I y=!Abs{tr[mg(z)]}=.
47rl
j
dz tr(mg(z)].
z
(25)
Here the symbol ~bs' denotes the operation of extracting the absolute term in the expansion of the bracketed term in powers of z upon the unit circle, and the integral is taken around the unit circle in an anticlockwise direction. If we had considered a cost function
(26) up to a horizon h then we could have regarded the average cost (23) as being characterised by the asymptotic relation
E(C) = hy + o(h)
(27)
for large h. Here E is again the expectation operator under policy 1r, but now conditional on the specified initial conditions. The o(h) term reflects the effect of these conditions; it will be zero if the stationary regime has already been reached attime t = 0. Consider now the LEQG criterion introduced in Section 123. We saw already from that section that the LEQG model provided a natural embedding for the LQG model; we shall see in Chapters 16, 17 and 21 that it plays an increasing role as we bring in the concepts of risksensitivity, the Hoo criterion and largedeviation evaluations. For this criterion we would expect a relation analogous to (27):
(28) Here y(0) is a type of geometricaverage cost, depending on both the policy 1r and the risksensitivity parameter 0. The least ambitious aim of LEQGoptimisation in the infinitehorizon limit would be to choose the policy 1r to minimise y(O). ('Least ambitious', because a fulldress dynamic programming approach would minimise transient costs as well as average costs.) We aim then to derive an expression for y(O) analogous to (24).
Theorem B.5.2
The average cost y( 0) defined by (48) has the evaluation y(O) =
!j
4 0
logll + (}mf(w)i dw
(29)
for values of 0 such that the symmetrisation of the matrix I+ (}mf(w) is positive definite for all real w. Here f (w) is the spectral density function of the D.process under the stationary regime for the specified policy 1r (assumed stationary and
5 OPTIMISATION CRITERIA IN THE STEADY STATE
267
linear), and the integral is again over the interval [1r, 1r] or the whole real axis in discrete or continuous time respectively. Here IPI denotes the determinant of a matrix P; note that expression (29) indeed reduces to (24) in the limit () + 0. We shall prove an intermediate lemma for the discretetime case before proving the theorem. Suppose that the AGF of {6. 1} has a canonical factorisation (16) with A(z) analytic in some annulus lzl :::;; 1 + b for positive b. That is, the 6.process has an autoregressive representation. It is also Gaussian, since the policy is linear. Let us further suppose this representation so normalised that Ao =I. The probability density of 6. 1 conditional on past values is then f(6.rl6.r; T < t)
=
[(27r)miVI] 112 exp[! E; v 1 E1]
where t:: 1 = A(ff)6. 1 and m is the dimension of 6.. This implies that
f(6.o, 6.1' ... '6.hd6.r; T < 0) = [(21ft! Vlrh12 exp [ !
~ [(2~)'1 v1r'
12
exp [
l
t,
~ El v 1Er] (30)
lli M(ff)ll, + o(h)]
where M(z) = A(z) v 1A(z) = g(zr 1• Since the multivariate density (30) integrates to unity, and since the normalisation Ao = I implies the relation logl VI = dloglg(z)l, we see that we can write conclusion (30) as follows.
Lemma 13.5.3 Suppose that M (z) is selfconjugate, positive definite on the unit circle and analytic in a neighbourhood ofthe unit circle. Then
j j ... j exp[!~xi"M(ff)x,]dxodxl .. . dxh1 =
(21f)hmfZexp{ (h/2)Abs[logiM(z)IJ
(31)
+ o(h)}.
*Proof of Theorem 13.5.2 The expectation in (28) can be expressed as the ratio of two
multivariate integrals of type (31), with the identifications 1 + ()9l and M(z) = g(z) 1 in numerator and denominator respectively. We thus have
M(z)
= g(zr
E(eec) = exp[(h/2)Abs{logjg(z) 1 + 09lj + logjg(z)l} + o(h)] whence the evaluation
'"'!(B) =
1
20
Abs[logl/ + B9tg(z) I}
268
STATIONARY PROCESSES; SPECTRAL THEORY
and its alternative expression (29) follow. The continuoustime demonstration is analogous. 0 The only reason why we have starred this proof is because the o(h) term in the exponent of the final expression (30) is in fact 6.dependent, so one should go into more detail to justify the passage to the assertion (31).
CHAPTER14
Optimal Allocation; The Multiarmed Bandit 1 ALLOCATION AND CONTROL This chapter is somewhat off the principal theme, although on a topic of considerable importance in its own right, and the reader may bypass it if he wishes. Allocation problems are concerned with the sharing of limited resources between various activities which are being pursued. This not the kind of problem envisaged in classical control theory, but of course is indeed a control problem if this allocation is being varied in time to meet changing conditions. For example, the adaptive allocation of transmission capacity in a communications network provides just such an example; clearly of fundamental technological importance. The classic dynamic allocation problem is the 'multiarmed bandit', henceforth referred to as MAB. This is the situation in which a gambler makes a sequence of plays of any of n gambling machines (the 'bandits'), and wishes to choose the machine which he plays at each stage so as to maximise the total expected payoff (perhaps discounted, in the infinitehorizon case} The payoff probability of the ith machine is a parameter, (Ji say, whose value is unknown. However, the gambler builds up an estimate of (Ji which becomes ever more exact as he gains more experience of the machine. The conflict, then, is between playing a machine which is known to have a good value of (Ji and experimenting with a machine about which little is known, but which just might prove even better. It is in order to resolve this conflict that one formulates the problem as an optimisation problem. As an allocation problem this is quite special, on the one hand, but has features which confuse the issue, on the other. The resource which one is allocating is one's time (or, equivalently, effort) in that one can play only one machine at a time, and must decide which. In more general models one will be able to split the allocation at a given time. The problem also has a feature which is fascinating but is irrelevant in the first instance: the 'state' of the ith machine at a given time is not its physical state, but is the state of one's information about the value of 01, an 'informational state~ We shall see how to handle this concept in the next chapter, but this aspect should not divert one from the essential problem, which is to decide which machine to use next on the basis of something which we may indeed term the current 'state' of each machine.
270
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
In order to sideline such irrelevancies it is useful to formulate the MAB problem in greater generality. We shall suppose that one has n 'projects', the ith of which has state value x;. The current state of all projects is assumed known. One can engage only one project at a time; if one engages project i then X; changes to a value by timehomogeneous Markov rules (i.e. the distribution of conditional on all previous state values of all projects is in fact dependent only on x;), the states of unengaged projects do not change and there is a reward whose expectation is a function r;(x;) of x; and i. If one envisages indefinite operation starting from time t = 0 and discounts rewards by a factor {3 then the aim is to choose policy 1r so as to maximise E,.[:L::::o {31R 1], where R 1 is the reward received at time t. The policy is the rule by which one decides which project to engage at any given time. One can generalise even this formulation so as to make it more realistic in several directions, as we shall see. However, the problem as stated is the best first formalisation, and captures the essential elements of a dynamic allocation problem. The problem in this guise proved frustratingly difficult, and resisted sustained attack from the 'forties to the 'seventies. However, it had in fact been solved by Gittins about 1970; his solution became generally known about 1981, when it opened up wide practical and conceptual horizons. Gittins' solution is simple to a degree which is found amazing by anyone who knew the frustrations of earlier work on the topic. One important feature which emerges is that the optimal policy is an index policy. That is, one can attach a index v;(x;) to the ith project which is a function of the project label i and the current state x; of the project alone. If the index is appropriately calculated (the Gittins index), then the optimal policy is simply to choose a project of currently greatest index at each stage. Furthermore, the Gittins index v; is determined by the statistical properties of project i alone. We shall describe this determination, both simple and subtle, in the next section. The MAB formulation must be generalised if one is to approach a problem as complicated as, for example, the routing of telephone traffic through a network of exchanges. One must allow several types of resource; these must be capable of allocation over more than one 'project' at a time; projects which are unengaged may nevertheless be changing state; projects may indeed interact. We shall sketch one direction of generalisation in Sections 57. The Gittins solution ofthe MAB problem stands as the exact solution of a 'pure' problem. The inevitable next stage in the analysis is to see how this exact solution of an idealised problem implies a solution, necessarily optimal only in some asymptotic sense, for a large and complex system.
x;
x;
2 THE GITTINS INDEX The Gittins index is defined as follows. Consider the situation in which one has only two alternative actions: either to operate project i or to stop operation and
2 THE GITTINS INDEX
271
receive a 'retirement' reward of M. One has then in fact an optimal stopping problem (since once one ceases to operate the project, its state will not change and there is no reason to resume operation). Denote the value function for this problem by ¢;(x;, M), to make the dependence on the retirement reward as well as on the project state explicit. This will obey the dynamic programming equation
(1) a relation which we shall abbreviate to
¢; = max[M, 4¢;).
(2)
Here x; and~ are, as above, the values of the state of project i before and after one stage of operation. As a function of M for fixed x; the function ¢; has the form indicated in Figure 1: nondecreasing and convex, and equal to M for M greater than a critical value M;(x;). This is the range in which it is optimal to accept the retirement reward rather than to continue, and M 1(x1) is the crossover value, at which M is just large enough that the options of continuing or terminating are equally attractive. Note that M;(x 1) is not the fair buyout price for the project when in state x 1; it is more subtle than that. It is the price which is fair (in state x 1) if an offer is made which is to remain open, and so which the project operator is free to accept at any time in the future. It is this quantity which can be taken as the Gittins index, although usually one scales it to take
v;(x,)
= (1 
{3)M;(x;)
(3)
as the index. One can regard vas the size of an annuity that a capital sum M would buy, so that one is rephrasing the problem as a choice between the alternatives of either operating the project with its uncertain return or of moving to the 'certain' project which offers a constant income of v.
Figure 1 The graph of¢>1(x 1, M) as afunction of M Here X; and Mare respectively the state of project i and the reward ifthe retirement option is accepted, and ¢ 1(x;, M) is the valuefunction ofthe corresponding optimal stopping problem.
272
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
Note that the index v,(x,) is indeed evaluated in terms of the properties of project i alone. The solution of thenproject problem is phrased in terms of these indices: the optimal policy (the Gittins index policy) is to choose at each stage one of the projects of currently greatest index. One may indeed regard this as a reduction of the problem which is so powerful that one can term it a solution, since solution of the nproject problem has been reduced to solution of a stopping problem for individual projects. We shall prove optimality of this policy in the next section, but let us note now an associated assertion. Let x denote the composite state (x1, x2, ... , Xn) of the set of all n projects, and let «(x, M) has the evaluation in terms of the oneproject value functions q)j if»(x,M)
= B
r
}M
rr3¢;(x;,m) dm. Om
i
(4)
Solution of this augmented problem for general M implies solution of the nproject problem without the retirement option, because if M < A then one will never accept the retirement option. Exercises and comments (1) Prove that ¢1(x1, M) has indeed the character asserted in the text and illustrated in Figure 1.
3 OPTIMALITY OF THE GITTINS INDEX POLICY The value function «J»(x, M) will obey the dynamic programming equation if»= max[M,maxLtCil]
(5)
I
where the operator Lt defined implicitly by comparison of (1) and (2) will act only on the x 1argument of Cll. We shall prove validity of (4) and optimality of the Gittins policy by demonstrating that expression (4) is the unique solution of (5) and that the Gittins policy corresponds to the maximising options in (5). More explicitly, that one operates a project of maximal index vi if this exceeds M (1  !3) and otherwise accepts the retirement option. Many other proofs of optimality have now been offered in the literature which do not depend upon dynamic programming ideas; one particular line is indicated in the exercises
3 OPTIMALITY OF THE GITTINS INDEX POLICY
273
below. However, these are no shorter, most of them do not yield the extra conclusion (4), and it is a matter of opinion as to whether they are more insightful.
Lemma 14.3.1 Expression (4) may alternatively be written ll>(x, M) = rPi(x1, M)P1(x, M) where
+ Loo ¢ 1(x;, m) dmP1(x, m)
·=II 8¢j(Xj, aM M)
P,·( x, M) .
(6)
(7)
jf.i
is nonnegative, nondecreasing in M and equal to unity for
M > M(i) := mJX Mi.
(8)
Jrl
Proof Note that quantities such as M 1 and M(i) have a dependence upon xwhich we have suppressed. Equation (6) follows from (4) by partial integration. Since ¢ 1, as a function of M, is nondecreasing, convex and equal to M for M ~ M 1, then 8¢;/oM is nonnegative, nondecreasing and equal to unity for M;:::: M 1• The D properties asserted for P1 thus follow. Consider the quantity
81(x1,M)
=
¢1(x 1,M) L 1¢ 1(x1,M)
and note that that 61 ;:::: 0, with equality for M ~ M 1• We are interested for the moment in the dependence of the various quantities on M for fixed x, and so shall for simplicity suppress the xargument.
Lemma 14.3.2 Expression (6) satisfies the relations II>~
with equality if M
~
(9)
M
maxMj, and 1
II>(M) L;II>(M) with equality if M 1 =
=
max~ ~
J
8;(M)P;(M)
+!Moo 8;(m)dmP;(m) ~ 0
(10)
M.
Here dmP1(m) is the increment in P1(m) for an incrementdm in m.
Proof Inequality (9) and the characterisations of the equality case follow from (6) and the properties of P;. The first relation of (10) follows immediately from (6). The nonnegativity of the expression follows from the nonnegativity of 8; and the nonnegative and
274
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
nondecreasing nature of P;. We know that o;(M) = 0 for M ~ M; and that dmP;(m) = 0 form;:?!: M(i)• so that expression (10) will be zero if M ~ M; and Mul ~ M;. This pair of conditions is equivalent to those asserted in the lemma.o
Theorem 14.3.3 The value function (x, M) of the augmented problem indeed has the evaluation (4) and the Gittins index policy is optimal. Proof The assertions of Lemma 14.3.2 show both that expression (4) satisfies the dynamic programming equation (2) and that the Gittins index policy (augmented by the recommendation of termination if M exceeds max;M;) provides the maximising option in (2). But since (2) has a unique solution and the maximising option indicates the optimal action (see Exercise 3.1.1, valid also for the stochastic case) both assertions are proved. Exercises and comments We indicate an alternative line of argument which explains the form of solution (4) and avoids some of the appeal to dynamic programming ideas. (1) (Whittle, 1980). Consider a policy which is such that project i is terminated as soon as x; enters a writeoff set S; (i = 1, 2, ... , N) and retirement with reward M takes place the moment all projects have been written off. We assume that there is some rule for the choice from the set of projects still in use, which we need not specify. Let us term such a policy a writeoffpolicy, and denote the value functions under a given such policy for theNproject and oneproject situations by F(x, M) and.fi(x;, M) respectively. Then
oF oM= E(,B'"Ix)),
o.fi ) oM= E (/3.,., Ix;,
where 1 is the (random) time taken to drive all N projects into their writeoff sets and 1; the time taken to drive project i into its writeoff set. But 1 is distributed as 2:; 7'; with the 7'; taken as independently distributed. Hence it follows that
oF = oM
II o.fi
. oM'
I
which would imply the validity of (4) if it could be asserted that the optimal policy was also a writeoff policy. (2) (TsitsiklisWeber). Denote the value function for the augmented problem if i is restricted to a set I of projects by V(I). This should also show a dependence on M and the project states which we suppress. Then Vhas the sub modular property
V(I)
+ V(J)
;:?!: V(I U J)
+ V(I n J).
(11)
4 EXAMPLES
275
Prove this by induction on the timetogo s and appeal to the fact that choice of a one project from each of I u J and I n J can be seen as a choice of one project from each of I and J, but that the converse assertion is not in general true. (3) (fsitsiklis, 1986). Take I as the set of all projects which are written off (in an optimal policy for the full set of n projects) and J as its complement. Then relation (11) becomes M + V(J) ;;?!: 41 + M, or 41 ~ V(J). But plainly the reverse inequality holds, so that
e)
(12)
e
where is the optimal breakpoint for retirement reward M. We find the solution of (12) to be
¢(x, M)
= (xfa) + (J1./o2 ) + cePJC
(13)
where p is the negative solution of
! Npl + Jl.P a = 0. and c is an arbitrary constant. The general solution of (12) would also contain an exponential term corresponding to the positive root of this last equation, but this will be excluded since ¢ cannot grow faster than linearly with increasing x. The unknowns c and ~ are determined by the boundary conditions ¢ = M and ¢x = 0 at X= (see Exercise 10.7.2} If we substitute expression (13) into these two equations then the relation between M and which results is equivalent to M = M(e). We leave it to the reader to verify that the calculation yields the determination
e
e
( ) M( x=x+ vx=a )
J1. +
J J1.2 + 2aN . 20
5 RESTLESS BANDITS
277
The constant added to x represents the future discounted reward expected from future change in x. This is positive even if J.L is negativea consequence of the fact that one can take advantage of a random surge against trend if this occurs, but can retire if it does not.
5 RESTLESS BANDITS One desirable relaxation of the basic MAB model would be to allow projects to change state even when not engaged, although of course by different rules. For example, one's knowledge of the efficacy of a medical treatment used to combat a particular infection improves as one uses it. However, it could actually deteriorate when one ceased to use the treatment if, for example, the virus causing the infection were mutating. For a similar example, one's information concerning the position of an enemy submarine will in general improve as long as one tracks it, but would actually deteriorate if one ceased tracking. Even if the vessel were not taking deliberate evasive action its path would still not be perfectly predictable. As a final example, suppose that one has a pool of n employees of whom exactly m are to be set to work at a given time. One can imagine that employees who are working produce, but at a decreasing rate as they tire. Employees who are resting do not produce, but recover. The 'project' (the employee) is thus changing state whether or not he is at work. We shall speak of the phases when a project is in operation or not as active and passive phases. For the traditional MAB model a project was static in its passive phase. As we have seen, for many problems this is not true: the active and passive phases produce contrary movements in state space. For submarine surveillance the two phases induce gain and loss of information respectively. For the labour force the two phases correspond to tiring and recovery. We shall refer to a project which may change even in the passive phase as a 'restless bandit', the description being a literal one for the submarine example. The workforce example generalised the MAB model in another respect: one was allowed to engage m of then projects at a time rather than just one. We shall allow this, so that, for the submarine example, we could suppose that m ( < n) aircraft are available to track the n submarines. It is then a matter of allocating the surveillance effort of the m aircraft in order to keep track of all n submarines as well as possible. We shall specialise the model in one respect: we shall assume rewards undiscounted, so that the problem is that of maximising the average reward. This makes for a much simpler analysis; indeed, the known solutions of a number of standard problems are greatly simpler in the undiscounted case. As we have maintained earlier, and argue on structural grounds in Section 16.9, the undiscounted case is in general the realistic one in the control context.
278
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
Let us label the phases by k; active and passive phases corresponding to k = I and k = 2 respectively. If one could operate project i without constraint then it would yield a maximal average reward "f; determined by the dynamic programming equation 'Yi
+/;(x;)
= max{rik(x;) k
+ Ek[l;(.i;)lx;]}.
(14)
Here r;k(x;) is the expected instantaneous reward for project i in phase k, Ek is the expectation operator in phase k and /;(x;) is the transient reward for the optimised project. We shall write (14) more compactly as 'Yi
+ /; = max[L;1 /;, Li2/;]
(i= 1,2, ... ,n).
( 15)
Let m(t) be the number of projects which are active at time t. We wish to optimise operation under the constraint m(t) = m for prescribed m and identically in t; that exactly m projects should be active at all times. Let Ropt(m) be the optimal average return (from the whole population of n projects) under this constraint. However, a more relaxed demand would be simply to require that E[m(t)]
= m,
( 16)
where the expectation is the equilibrium expectation for the policy adopted. Essentially, then, we wish to maximise E("'£; r;) subject to E("'£; l;) = n m. Here r; is the reward yielded by project i (dependent on project state and phase) and l; is the indicator function which takes the value 1 or 0 according as to whether project i is in the passive or the active phase. However, this is a constraint we could take account of by maximising E["'£;(r; + vl;)], where vis a Lagrangian multiplier. We are thus effectively solving the modified version of (15) "f;(v)
+/; =
max[L;I/;, v +La/;]
(i= 1,2, ... ,n)
(17)
where/; is a function/;(x;, v) of x; and v. An economist would view vas a 'subsidy for passivity', pitched at just the level (which might be positive or negative) which ensures that m projects are active on average. Note that the subsidy is independent of the project; the constraint (16) is one on total activity, not on individual project activity. We thus see that the effect of relaxing the constraint m(t) = m to the averaged version (16) is to decouple the projects; relation (17) involves project i alone. This is also the point of the Gittins solution of the original MAB problem: that the nproject problem was decomposed into noneproject problems. A negative subsidy would usually be termed a 'tax: We shall use the term 'subsidy' under all circumstances, however, and shall refer to the policy induced by the optimality equations (17) as the subsidy policy. This is a policy optimal under the averaged constraint (16). If we wish to be specific about the value of v we shall refer to the policy as the vsubsidy policy. For definiteness, we shall close
S RESTLESS BANDITS
279
the passive set That is, if x1 is such that Ltt fi = v + La fi then project i is to be rested. The value of v mwt be chosen so that indeed m projects are active on average. This induces a mild recoupling of the projects.
Theorem 14.5.1 The maximal average reward under constraint (16) is R(m)
= ¥[~ l't(v) v(n m)]
(18)
and the minimising value ofv is the required subsidy level. Proof This is a classic assertion in convex programming (that, under regularity conditions, the extrema attained in primal and dual problems are equal) but best seen directly. The average reward is indeed the squarebracketed expression in (18~ because the average subsidy paid must be subtracted. Since
L 11(11) = supE,. [I:(r; + 11l;)l i
,.
I
(where 1r denotes policy) then
8 av4: 1 (11) 1
I
= E1r
4:/
1
= (n m).
I
This equation relates m and v and yields the minimality condition in (18). The condition is indeed one ofminimality, because I: f';(ll) is convex increasing in 11. The function R(m) is concave, and represents the maximal average reward for any min [0, n]. D Define now the index v1(x1) of project iwhen in state x 1 as the value ofv which makes 4t fi = 11 + L 12 fi in (17). In other words, it is the value of subsidy which makes the two phases equally attractive for project i in state x 1• This is an obvious analogue of the Gittins index, to which it indeed reduces when passive projects are static and yield no reward. The interest is that the index (which Gittins characterised as a fair 'retirement income') is now seen as the Lagrange multiplier associated with a constraint on average activity. The index is obviously meaningful: the greater the subsidy needed to induce one to rest a project, the more rewarding must it be to operate that project. Suppose now that we wish to enforce the constraint m(t) = m rigidly. Then a plausible policy is the index policy; at all times to choose the projects to be operated as the m projects of currently greatest index (i.e. the first m on a list ranking projects by decreasing index). Let w denote the average return from this policy by Rind(m).
280
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
Theorem 14.5.2 Rmd(m)
~
Ropt(m)
~
R(m).
(19)
Proof The frrst inequality holds because Rapt is by definition the optimal average return under the constraint m(t) = m. The second holds because R(m) is the optimal average return under the relaxed version of this constraint, E~~=m
0
The question now is: how close are the inequalities (19), i.e. how close is the index policy to optimality? Suppose we reduce rewards to a per project basis in that we divide through by n. The relation
Rind(m)/n
~
Ropt(m)Jn ~ R(m)/n
(20)
then expresses inequalities between rewards (under various policies) averaged over both time and projects. One might conjecture that, if we let m and n tend to infinity in constant ratio and hold the population of projects to some fixed composition, then all the quotients in (20) will have limits and equality will hold throughout in this limit. This conjecture has in fact been essentially verified in a very ingenious analysis by Weber and Weiss (1990). However, there are a couple of interesting reservations. Let us say that a project is indexable if the set of values of state for which the project is rested increases from the empty set to the set of all states as v increases. This implies that, if the project is rested for a given value of subsidy, then it is rested for all greater values. It also implies that, if all projects are indexable, then the projects i which are active under a a vsubsidy policy are just those for which
vi(x;) > v. One might think that indexability would hold as a matter of course. It does so in the classic MAB case, but not in this. Counterexamples can be found, although they seem to constitute a small proportion of the set of all examples. An example given in Whittle (1988) shows how nonindexability can come about. Let D(v) be the set of states for which a given project is rested under the vsubsidy policy. say) enters D as v increases. It can be that paths Suppose that a given state (x starting from with in D show long excursions from D before they return. This implies a surrender of subsidy which can become nonoptimal once v increases through some higher value, when will leave D. Another point is that asymptotic equality in the second inequality of (20) can fail unless a certain stability condition is satisfied (explicitly, unless the solution ofthe deterministic version of the equations governing the distribution of index values in the population under the index policy converges to a unique equilibrium~ However, the statistics of the matter are interesting. In an investigation of over 20000 randomly generated test problems Weber and Weiss found that about 90% were indexable, but found no counterexamples to averageoptimality (i.e. of
e e
=e.
e
6 AN EXAMPLE: MACHINE MAINTENANCE
281
instability of the dynamic equations). In searching a more specific set of examples they found counterexamples in fewer than one case in IQ 3, and for these the margin ofsuboptimality was of the order of one part in 1o 5. The assertion that the index policy is averageoptimal is then virtually true for all indexable cases; it is remarkable that an assertion can escape absolute validity by so little. The result gives some indication that asymptotic optimality can be achieved in large systems by quite simple policies.
6 AN EXAMPLE: MACHINE MAINTENANCE The machine maintenance problem considered in Section 11.4 constitutes a good first example. This is phrased in terms of costs rather then rewards, so that it is now natural to think of vas a cost for action rather than a subsidy for inaction; i.e. to identify it as the cost of a machine overhaul. In the notation of Section 11.4, the dynamic programming equation for a single machine analogous to equation (17) is then )' = min{v +ex+ /(0)  f(x), ex+ >.[f(x + 1)  /(x)].
(21)
If we normalise /(0) to zero and conjecture that the optimal policy is to service the machine if X ;;::: for some critical value then (21) implies the equations
e
e
'Y + f(x) = v + ex
)'=ex+ >.[!(x + 1) /(x)] These have solution
f(x) = ex+ v )' f(x) =
I:hj=O
ex)/>.= )'X ex(x 1)
(o ~ x ~e).
2).
).
e
The identity of the two solutions at X= thus provides the determining equation for )' in terms of
e
(22) Here we have neglected discreteness effect by replacing ~(e 1) bye_ Now ~ is determined by optimality; i.e. by the requirement that 'Y should be minimal with respect to This is equivalent to requiring that the derivatives with respect to~ of the two sides of relation (22) should be equal (see Exercise 10.7.21 i.e. to the condition >.e,...., 'Y ee. Substituting this evaluation of 'Y into (22) we deduce a relation between v and~ equivalent to v "' v(e). In this way we find the evaluation
e.
ex?
v(x) "'e(x + >.) + 2I,
282
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
accurate to within a discreteness effect The last (and dominant) term in this expression is essentially the index which was deduced by policy improvement in Section 11.4.
7 QUEUEING PROBLEMS Despite the fact that the technique of policy improvement produced plausible index policies for a number of queueing problems in Sections 11.511.7, the restless bandit technique fails to do so. Consider, for example, the costing of service, in the hope of deducing an index for the allocation of service between queues. With the cost structure of Section 11.5 the dynamic programming equation corresponding to (17) would be 'Y = ex+ min{ >.~(x + 1), 11 + >.~(x + 1)  J.t(x)~(x)}
Here
11
is the postulated cost of employing a server,
~(x)
is the increment
f(x) f(x 1) in transient cost and J.t(x) equals J.t or zero according as x is positive or zero. One must assume that J.t > >. if 'Y is to be finite. One finds (and we leave this as an exercise to the reader) that for any nonnegative 11 the active setthe set in which it is optimal to employ a serveris the whole set x > 0 of positive x. One thus does not have a decision boundary which changes with changing v; the very feature on which the definition of the index was based in Section 5. One can see how this comes about: there is no point in considering policies for which, for example, there is no service in a state X = for some ~ > 0. Such policies would present the same optimisation problem as ever for the queue on the Set of States X~ ~'but With the feature that one is accepting a baseload of and so incurring an unnecessary constant cost of c~. One might argue that variation of J.t allows varying degrees of engagement, so that one might allow J.' to vary with state with a corresponding proportional s_ervice cost However, one reaches essentially the same conclusion in the undiscounted case: that an optimal policy serves all states x > 0 at a common rate. The special character of such queueing problems becomes manifest if one considers the largesystem limit n + oo envisaged in Section 5. It is assumed that the traffic intensity for the whole system is less than unity, so that there is sufficient service capacity to cover all queues. In that case all this capacity is directed to a queue the moment it has customers, so that all queues are either empty or (momentarily) have one customer in the course of service. This rather unrealistic immediacy of response can be avoided only if response is inhibited by switching costs or slowed down by delay in either observation or transfer. Such considerations do not hold for the pure reward problem of Section 11.7. We leave the reader to confirm that the methods of Section 6 indeed lead to the known optimal policy.
e
e
7 QUEUEING PROBLEMS
283
Notes oo the literature
The subject has a vast literature. Although Gittins had arrived at his solution by 1970 it was published first in Gittins and Jones (1974). His later book (Gittins, 1989) gives a collected exposition of the whole subject. The proof of optimality associated with solution (4) for the value function was given in Whittle (1980). Restless bandits were introduced in Whittle (1988) and their analysis completed by Weber and Weiss (1990).
CHAPTER 15
Imperfect State Observation We saw in Chapter 12 that, in the LQG case, there was a treatment of the case of imperfect observation which was both complete and explicit. By 'imperfect observation' we mean that the current value of the process variable (or of the state variable, in statestructured cases) is not completely observable. In practice, observation will seldom be complete. However, the LQG case is an exception in that it is so tractable; this tractability carries no further than to the LEQG case (see Chapter 16). There are very few models with imperfect observation which can be treated both exactly and explicitly. Let us restrict attention to the statestructured case, which one might regard as the standard normalised form. Then the central formal result which emerges if the state variable is imperfectly observed is that there is still a simply recursive dynamic programming equation. However, the argument of the value function is no longer the physical state variable, but an 'informational' state variable: the distribution of the current value of physical state conditional on current information. This argument is then, not as formerly an element of the state space PI, but a distribution on PI. This great increase in cardinality of the argument makes analysis very much more difficult. The simplification of the LQG case is that this conditional distribution is always Gaussian, and so parametrised by its current mean x1 and covariance matrix V1• The value function depends then on these arguments alone, and the validity of the certainty equivalence principle and associated ideas implies even further simplification. Certainly, there is no general analogue of the certainty equivalence principle, and it must be said that this fact adds interest as well as difficulty. In general the choice of control will also affect the information one gains, so that control actions must be chosen with two (usually conflicting) goals in mind: to control the system in the conventional sense and to gain information on aspects of the system for which it is needed. For example, to 'feel the steering' of a strange car is effective as a longterm control measure (in that one gains necessary information on the car's driving characteristics) but not as a shortterm one. Similarly, it is the conflict between the considerations of profitmaximisation on the basis of the information one has and the choice of actions to improve this information base that gives the original multiarmed bandit problem its particular character. Control with this dual goal is often referred to as dual control. (A 'duality' quite distinct from the mathematical duality between control/future and estimation/
286
IMPERFECT STATE OBSERVATION
past which has been our constant theme.). An associated concept is that of adaptive control: the system may have parameters which are not merely unobservable but also changing, and procedures must be such as to track these changes as well as possible and adapt the control rule to them. A procedure which effectively estimates an unknown parameter will of course also track a changing parameter. The theory of dual and adaptive control requires a completely new set of ideas; it is subtle, technical and, while extensive, is as yet incomplete. For these reasons we shall simply not attempt any account of it, but shall merely outline the basic formalism and give a single tractable example. 1 SUFFICIENCY OF THE POSTERIOR DISTRIBUTION OF STATE Let us suppose, for simplicity of notation, that all random variables are discretevaluedthe formal extension of conclusions to more general cases in then obvious. We shall use a naive notation, so that, for example P(x11 W1) denotes the probability of a value x 1 of the state at time t conditional on the information W1 available at time t. We are thus not making a notational distinction between a random variable and particular values of that random variable, just as the 'P' in the above expression denotes simply 'probability of' rather than a defined function of the bracketed arguments. We shall use a more explicit functional notation when needed. Let us consider the discretetime case. The structural axioms of Appendix 2 are taken for granted (and so also their implication: that past controlsparametrising variablescan be unequivocally lumped in with the conditioning variables). We assume the following modified version of the statestructure hypotheses of Section 8.3. (i) Markov dynamics It is assumed that process variable x and observation y have the property
(I) (ii) Decomposable cost function It is assumed that the cost function separates into a sum of instantaneous and terminal costs, of the form /r1
C=
L {3 c(x,, u, t) + (iCh(xh)· t=O 1
(2)
(iii) lriformation It is assumed that W1 = ( Wo, Y,, U,_t) and that the information available at time t = 0 implies a prior distribution of initial state P(xol Wo). It is thus implied in (iii) that Yr is the observation that becomes available at time t, when the value of u 1 is to be determined. Assumption (i) asserts rather more than Markov structure; it states that, for given control values U, the stochastic
I SUFFICIENCY OF THE POSTERIOR DISTRIBUTION OF STATE
287
process {x,} is autonomous in that the distribution of x 1 + 1 conditional on X 1 and Y, is in fact dependent only on X 1• In other words, the causal dependence is oneway; y depends upon x but not conversely. This is an assumption which could be weakened; see Exercise 1. We have included a discount factor in the cost function (2) for economy of treatment. Assumptions (i) and (iii) imply that both state variables and observations can be regarded as random variables, and that the stochastic treatment can be started up, in that an initial distribution P(xol Wo) is prescribed for initial state Xo (the prior distribution). The implication may be more radical than it seems. For example, for the original multiarmed bai:ldit the physical state variable is in fact the parameter vector = {Oi}; the vector of unknown success probabilities. The fact that this does not change with time, under usual assumptions, makes it no less a state variable. However, the change in viewpoint which allows one to regard this parameter as a random variable is nontrivial. Generally, the formulation above implies that unknown plant parameters are to be included in the state variable and are to be regarded as random variables, not directly observable but of known prior distribution. That is, one takes the Bayesian point of view to inference on structure. The controversy among statisticians concerning the Bayesian formulation and its interpretation has been a battle not yet consigned to history. We shall take the pragmatic point of view that, in this context, the only formulations which lead to a natural recursive mathematical analysis of the problem are the Bayesian formulation and its minimax analogue. We shall refer to the distribution
e
P, = {P,(x,)} := {P(x,j W,)}
(3)
as the posterior distribution of state. More explicitly, the posterior distribution of x 1 at time t, in that it is the distribution of x 1 conditional upon the information that has been gathered by timet. It obeys a natural forward recursion.
Theorem 15.1.1 (Bayes updating of the posterior distribution) Under the assumptions (i)(iii) listed above the posterior distribution P1 obeys the updating formula
(4)
Proof We have, for fixed W,+l and variable x,+"
P(xt+d Wt+l) ex P(xt+I, Yt+d W,, u,)
= LP(x,,x,+l,Yt+h jw;,u,) x,
288
IMPERFECT STATE OBSERVATION
= LP(xrl W,, u1 )P(xt+l,Yt+d W1, x 1, u1) x,
= LP,(x,)P(xr+i 1Yt+dx 1 ,u1). X
The last step follows from (3) and the implication of causality: that P(x1 1 W1, u1) :::::: P(xrl W, ). Normalising this expression for the conditional distribution of Xr+i we deduce recursion (4). 0 Just as the generic value of x 1 is often denoted simply by x, so shall we often denote the generic value of P1 simply by P. We see from (4) that the updating formula for P can be expressed
P(x)+ P'(x) :=
I:z P(z)a,(x,y,lz,u) L:x L:z P(z)a (x,ylz, u)
(S)
1
where a1 (x, ylz, u) is the functional form of the conditional probability
P(xt+l = x, Yt+l = ylx, = z, u, = u). Recall now our definition of a sufficient variable €1 in Section 2.1. Theorem 8.3.1 could be expressed as an assertion that, under the assumptions (i)(iii) of Section 8.3, the pair (x1, t) is sufficient, where x 1 is the dynamic state variable. What we shall now demonstrate is that, under the imperfectobservation versions of these assumptions expressed in (i)(iii) above, it is the pair (P1 , t) which is sufficient, where P, is the posterior distribution of dynamical state x 1 • For this reason Pis sometimes referred to as an 'informational state' variable or a 'hyperstate' variable, to distinguish it from x itself, which still remains the underlying physical state variable.
Theorem 15.1.2 (The optimality equation under imperfect state observation) Under the assumptions (i)( iii) listed above, the variable (P1 , t) is sufficient, and the optimality equation for the minimal expected discounted future cost takes the form F(P,t) =i~f [LP(x)c(x,u,t) X
+.BLLLP(z)a,(x,ylz,u)F(P',t+ X
y
1)]
(6) (t .P) = >.' ¢(P) for any positive scalar>.. We shall find it sometimes convenient to write F(P, t} as F(( { P(x)}, t) if we wish to indicate how P(x) transforms for a given value ofx.
Theorem 15.1.3 The value jUnction F(P, t) can be consistently extended to unnormalised distributions P by the requirement that it be homogeneous ofdegree one in P. when the dynamic programming equation (6) simplifies to the form
Proof Recursion (6) would certainly reduce to (8) if F(P, t + 1) had the homogeneity property, and F(P, t) would then share this property. But it is evident from (7) that F(P, h) has the property. 0 The conclusion can be regarded as an indication of the fact that it is only the relative values of P(x 1 W1) (for varying x 1 and fixed W1) which matter, and that the normalisation factor in (5) is then irrelevant. 1
Exercises and comments (1) An alternative and in some ways more natural formulation is to regard (x1 ,y1) as jointly constituting the physical state variable, but of which only the component y 1 is observed. The Markov assumption (i) of the text will then be weakened to
P{xt+t,Yr+IIXr, Yr. Ur) = P(xt+t.Yt+dxr.Yr.Ur) consistent with the previous assumption (i) of Section 8.3. Show that the variable (P 11 y 1 , t) is sufficient, where P 1 = {P1(x1)} = {P(x1 1 Wr)} is updated by the formula
PHI (Xr+J) oc
L Pr(xr)P(xt+t.Yt+tlxr.Yr. Ur)· x,
2 EXAMPLES: MACHINE SERVICING AND CHANGEPOINT DEfECTION One might say that the whole of optimal adaptive control theory is latent in equation (8), if one could only extract it! Even something like optimal statistical communication theory would be just a special case. However, we shall confine our ambitions to the simplest problem which is at all amenable.
290
IMPERFECT STATE OBSERVATION
Suppose that the dynamic state variable x represents the state of a machine; suppose that this takes integer values j = 0, 1, 2, . . . . Suppose that the only actions available are to let the machine run (in which case x follows a Markov chain with transition probabilities PJk) or to service it (in which case the machine is brought to state 0~ To run the machine for unit time in statej costs CJ> to service it costs d Al each stage one derives an observation y on machine state. Suppose, for simplicity, that this is discretevalued, the probability of an observation y conditional on machine statejbeingp1(y). Let P = {P1} denote the current posterior distribution of machine state, and let 'Y and f(P) denote the average and transient cost under an optimal policy (presumed stationary). The dynamic programming equation corresponding to (8)is then
where o(P) is the unnormalised distribution which assigns the entire probability mass L;1 P1 to state 0. The hope is, of course, to determine the set of Pvalues for which the option of servicing is indicated. However, even equation (9) offers no obvious purchase for general solution. The trivial special case is that in which there are no observations at all. Then P is a function purely of the time which has elapsed since the last service, and the optimal policy must be to service at regular intervals. The optimal length of interval is easily determined in principle, without recourse to (9~ A case which is still special, but less trivial, is that in which the machine can be in only two states, x = 0 or I, say. We would interpret these as 'satisfactory' and 'faulty' respectively. In this case the informational state can be expressed in terms of the single number P1
rr Po+P1' This is the probability (conditional on current information), that the machine is faulty. Let us suppose that Pol = p = 1  q and p 1o = 0. That is, the faultfree machine can develop a fault with probability p, but the faulty machine cannot spontaneously correct itself. If we setf(P) = ¢{rr) and assume the normalisation ¢(0) = 0 then equation (9) becomes, in this special case,
(10) Here we have assumed that c0 = 0, and have defined
2 MACHINE SERVICING AND CHANGEPOINT DETECTION
p(y)
= (1 1r)qpo(Y) + (p+1rq)pi(y),
291
(11)
Formula (11) gives the updating rule 1r+ n'. The optimal decision will presumably be to service when 1r exceeds some threshold value, this value being in principle determinable from {10). This twostate model can also represent the problem of changepoint detection mentioned in Section 8.1. Suppose that state 0 is that in which a potential pollution source (say, a factory or a reactor) is inactive, and state 1 that in which it is active. One must decide whether to 'service' (i.e. to give the alarm and take antipollution measures) or not on the basis of observations y. Giving the alarm costs d; delaying the alarm when it should be given costs c per unit time. Two particular cases of this model have been analysed in the literature: those in which the observations y are Poisson or normal variables respectively, with parameters dependent upon pollution state j. The problem of determining the critical threshold value of 1r from (10) under these assumptions is soluble, and is referred to as the Poisson or Gaussian disorder problem, respectively.
BEYOND PART 3
RiskSensitive and H 00 Criteria
CHAPTER 16
Risksensitivity: The LEQG Model 1 UTILITY AND RISKSENSITIVITY Control optimisation has been posed in terms of cost, whereas economists work largely in terms of reward, i.e. negative cost. Since we shall be invoking some economic concepts, let us conduct the discussion in terms of reward for the moment, before reverting to the cost formulation. Suppose that R is the net monetary reward from some enterprise. One then wishes to conduct the enterprise (i.e. choose a policy) so as to maximise IR. However, if IR is a random variable, then it does not necessarily follow that one will wish to maximise £.11 (R) with respect to policy 1r; one might choose rather to maximise E"(U(R)], where U is some nonlinear function, usually monotone increasing. For example, if U had the concave form of Figure l then this would be an indication that a given increment in reward would be of less benefit if IR were already large than if it were small. The function U is termed a utility function, and is virtually defined by the fact that, on axiomatic or behavioural grounds, one has decided that E,.[U(IR)] is the quantity one wishes to maximise. The necessity for choice of a utility function arises because one is averaging over an uncertain outcome, but it would also arise if one wished to characterise the benefit derived from a reward which was distributed over time or over many enterprises. In cost terms, one could generalise the minimisation of expected cost E,. (C) to the minimisation of the criterion E,.[L(C)]. Here Lis a disutility function, again
u(R)
Figure 1 The graph ofa utility function U (IR) of the concave increasing form usually considered. The concavity expresses a decreasing marginal rate ofreturn of utility with reward, which induces a ri.skaverseness on the part ofthe optimiser.
296
RISKSENSITIVITY: THE LEQC MODEL
Figure 2 A convex disutility fimction, implying an effective riskaverse attitude or pessimism
presumably monotone increasing. One gains a feeling for the implications of such a generalisation if one considers the two cases of Figures 2 and 3. In the case of Figure 2 L is supposed convex, so that every successive increment in cost is regarded ever more seriously. In this case Jensen's inequality implies that E[L(C)] ~ L[E(C)] so that, for a given value of E(C), a certain outcome is preferred to an uncertain outcome. That is, an optimiser with a convex disutility function is riskaverse, in that he dislikes uncertainty. The concave disutility function of Figure 3 corresponds to the opposite attitude. In this case L is supposed concave, so that successive increments in cost are regarded ever less seriously. Jensen's inequality is then reversed, with the implication that an optimiser with a concave disutility function is riskseeking, in that he positively welcomes uncertainty. In the transition case, when Lis linear, the optimiser is riskneutral in that he is concerned only by the expectation of cost and not by its variability. All other cases correspond to a degree of risksensitivity on his part. One can interpret riskseeking and riskaverse attitudes on the part of the optimiser as manifestations of optimism or pessimism respectively, in that they
Figure 3 A concave disutillty function, implying an effective riskseeking attitude or optimism.
1 UTILITY AND RISKSENSITIVITY
2fJ7
imply his belief that uncertainties tend to his advantage or disadvantage respectively. This conclusion will emerge rather more explicitly in the next section. The attitude to risk is revealed also if one converts the criterion back on to a cost scale by defining
(1) Here L l is the function inverse to L, which certainly exists if L is strictly monotonic. If L is monotone increasing then minimisation of x,r is of course equivalent to minimisation of the expected disutility. Suppose now that under policy 71' the cost C has expectation m and a small variance ll Expansion of L(C) in powers of C  m then leads to the conclusion that E,.[L(C)] = L(m) + iL"(m)v + o(v) (under regularity conditions on differentials and moments). This in turn implies that
L"(m) v
x,. = m + L'(m) :2 + o(v).
(2)
That is, variability is weighted positively or negatively in the criterion according as the disutility function is convex or concave. This argument is less convincing than that based on Jensen's inequality, in that it makes unnecessary assumptions. However, it is illuminating in other respects; see Exercise 1. There are now good reasons for paying particular attention to the exponential disutility function eSC, where ()is a parameter, the risksensitivity parameter. This function should be minimised for () negative (when it is monotone increasing) and maximised for fJ positive (when it is monotone decreasing~ However, if we normalise back on to a cost scale as in (1), then we find that all cases are covered by the assertion that the normalised criterion
(3) should be minimised with respect to 71'. The criterion reduces in the case () = 0 to the classic riskneutral criterion E,..(C); it corresponds to increasingly riskseeking or riskaverse attitudes as() increases through positive values or decreases through negative values respectively. Note that relation (2) now becomes
x,.(fJ)
= m fJv/2 + o(v).
(4)
The exponential disutility function is attractive for two reasons. (i) The parameter () places the optimiser naturally on a scale of optimismpessimism. (ii) In this case, and essentially in this case alone, the coefficient of v in (2) is independent of m (see Exercise 1). That is, in this case there is an approximate decoupling of the aspects of expectation and variability of cost. However, the more compelling reasons are what one might term mathematical/pragmatic in character; such reasons have a way of turning out to be fundamental. (iii) Under LQG assumptions the exponential criterion leads to a
298
RISKSENSITIVITY: THE LEQC MODEL
complete and attractive generalisation of LQG theory. Essentially, the expectation in (3) is then the integral of something resembling a Gaussian density. (iv) If LQG assumptions are replaced by what one might term 'largescale' assumptions and if an exponential criterion is adopted then largedeviation theory becomes immediately applicable. It is striking that the economic concept of risksensitivity, interesting in itself, should mesh so naturally with the mathematics. We shall explore the LQG generalisation in the remainder of this chapter. Largedeviation concepts open a complex of ideas to which the final part of the book is devoted. Exercises and comments
(1) Show that L"(m)/ L'(m) is independent of m if and only if L(m) is a linear function of an exponential of m. A utility function is of course not significantly changed by a linear transformation, because such transformations commute with the operation of expectation. (2) Note an implication of Exercise 8.1.3: that relation (4) is exact (i.e. there is no remainder term) if L(m) is an exponential function of a normal variable. (3) A classic moment inequality asserts that, if x is a nonnegative scalar random variable, then (Ex) 1/' is nondecreasing in r. From this it follows that x. . (9) is nonincreasing in 9. 2 THE RISKSENSITIVE CERTAINTYEQUIVALENCE PRINCIPLE The combination of the LQG hypotheses of Section 12.2 and the exponential criterion (3) leads us to the LEQG model already discussed in Section 12.3, and to which this chapter is devoted. In fact, we found in Section 12.3 that the most economical way of proving the certaintyequivalence principle (CEP) in the LQG case was to do so first for the LEQG model in the riskseeking or 'optimistic' case 9 > 0. Let us slightly rephrase the conclusions summarised there in Lemma 12.3.2 and Theorem 12.3.3. Define the stress
(5) the linear combination of cost and discrepancy which occurs naturally in the evaluation of expectation (3). Let us also define the modified totalcost value function G( W1) as in Section 12.3 by eIJG(W,) = f( Y1)extE,..[e8Cj W,J. (6)
...
where the extremisation is a maximisation or a minimisation according as (} is positive or negative. Then the conclusions of Lemma 123.2 and Theorem 12.3.3 can be rephrased as follows.
2 THE RISKSENSITIVE CERTAINTYEQUIVALENCE PRINCIPLE
299
Theorem 16.2.1 (The risksensitive certainty equivalence theorem for the riskseeking (optimistic) case) Assume LEQG structure with () > 0. Then the total value function has the expression G(W,)
= g + u,.:r;.. inf inf inf§ Jr:r>t X 1 = g, + inf{0D(X,, Y,j; UrI) x, 1
1
(7)
where g1 is a policyindependentfunction oft alone. The value ofu1 thus determined is the LEQGoptimal value of the control at time t. If the value of u1 minimising the square bracket is denoted u(X1, U,_I) then the LEQGoptimal value of u1 is u( X,(t), UrI) where X,(t) is the value of X 1 determined by the]mal X1minimisation. The first equality of (7) asserts that the value function at time t is obtained by minimising the stress with respect to all process/observation variables currently unobservable and all decisions not already formed, and that the value of u1 determined in this way is optimal. This may not immediately suggest the expression of a CEP, but it is a powerful assertion which achieves what one might term conversion to .free form. By this we mean that an optimisation with respect offunctions u1(W1 ) has been replaced by an unconstrained minimisation with respect to unobservables and undetermined controls. That is, the constraint that the optimal u1 should depend only upon W1 has been achieved automatically in a free extremisation. This process of effectively relaxing constraints will be taken to its conclusion in Chapters 1921, when we reach the full timeintegral formulation. The final assertion of the theorem looks much more like a CEP, in that it asserts that the optimal value of uf is just the optimal value for known X1 with X 1 replaced by an estimate X,Ct . The riskneutral CEP is often confused with the separation principle, which asserts (in terms not well agreed) the separateness of the optimisations of estimation and controL There is certainly no such separation in the LEQG case. Control costs (both past and future) affect the value of the estimates and noise statistics affect the control rule (even if the process variable is perfectly observed) However, we see how a separation principle should now be expressed. If X 1 is provisionally assumed known then the evaluation of the two terms inside the curly brackets in the final expression of (7), which can be regarded as concerned with estimation and control respectively, can proceed separately. The two evaluations are then coupled by the final minimisation with respect to X1, which also yields the final estimate x,. The effectiveness of this separation is much clearer in the statestructured case considered below. The CEP in the riskaverse case, (J < 0, differs interestingly, and requires more careful statement. The distinction arises from the fact that relation (12.16) now becomes
300
RISKSENSITIVITY: THE LEQC MODEL
G(W,) =sup infG(Wr+1) + · · · u,
(8)
Yr+l
and that the order of the two extremal operations cannot in general be reversed. The consequence is that the analogue of Theorem 16.2.1 is Theorem 16.2.2 (The risksensitive certainty equivalence theorem for the riskaverse (pessimistic) case) Assume LEQG structure with () < 0. Then the total value function has the expression
G(W,) = g1 + inf sup ... inf sup§ u,
YtT I
llbl
1 = g, + stat{0 1D(X,, X,
+stat Jlr:T
~ l
Yb
Y,l; Ur1)
stat [C(X, U) +
x.,.:r>t
(9)
o 11D(xr+1, ... ,xhiX,; U)]}
where g, is a policyindependent function oft alone. The value of u, thus determined is the LEQGoptimal value of the control at time t. If the value of u1 extremising the square bracket is denoted u(X,, Ur1) then the LEQGoptima/ value of u, is u( x, 9, where 0 is the largest root of IRi + J(9)1 = 0. Equivalently, the critical value 0 is the largest root of II+ RJ(O)I = 0, an evaluation which is correct even if R is singular.The value 0 thus determined is indeed the point of 'neurotic' breakdown. It is always nonpositive if Qand Rare nonnegative defmite. Note that u and e both behave as controls in this example, in that they both appear linearly in the expression for Xi and quadratically in the stress function. However, whereas u is always chosen to minimise stress, e is chosen to to minimise it or maximise it according as 9 is positive or negative. That is, this au"xiliary 'control' is chosen to help or frustrate the optimiser according as he is riskseeking or riskaverse. We shall expand upon this point in the next section. If we modify R toR in the cost function then the term! (xi  X:) T R(xi X:) is to be interpreted rather as a reward, which we wish to be large. The calculations above are still valid as long as P (modified by the change of sign of R) is negative definite, i.e. as long as 9 does not exceed the smallest root of I/ RJ(9)1 = 0. This upper bound represents the point of 'euphoric' breakdown. It can be of either sign. 4 THE CASE OF PERFECT STATE OBSERVATION Most of the remainder ofthe chapter is devoted to the case of principal interest: the statestructured regulation model in the standard timehomogeneous undiscounted formulation of equations (12.1)(12.4). The exception is Section 12,
4 THE CASE OF PERFECT STATE OBSERVATION
305
where we find ourselves forced to take a fundamental look at the question of discounting. Let us first of all assume perfect state observation. The problem reduces then to the solution of the equation (12) for the future stress; essentially the dynamic programming equation. This can be written
F(x11 t) = inf ext [c(xr, ur) + !0 1(e:T N 1e:)t+ 1 + F(xr+b t + 1)] u,
{17)
.Xt+l
where 'ext' denotes a minimisation or a maximisation according as 0 is positive or negative, and ft+l and Xr+l are related by the plant equation (12.1). In virtue of this last fact we can rewrite (12) as
F(x 11 t) = infext[c(xr, ur) + !0 1(e:TN 1e:),+ 1 + F(Axr + Bu1 + ft+l• t + 1)] u,
ft+l
{18) But in this form we see that e: can be seen as a subsidiary control variable, entering the plant equation linearly just as u does and carrying a quadratic cost just as u does. It is chosen to help the optimiser if 0 is positive and to oppose him if 9 is negative. One might regard u and e: as the controls which are wielded by the optimiser and by Nature respectively. The optimiser effectively assumes that Nature is working with him or against him according as he is riskseeking or riskaversea fair characterisation of optimism or pessimism respectively. Note that Nature makes its move first, at each stage. Ofcourse, e: does not appear as a control in the actual plant equation, but only in the predicted course of the optimally controlled process. In the actual plant equation it is simply random process noise, as ever. It appears as a 'control' for the predicted process because this prediction is generated by a stress extremisation; the extremisation which determines the optimal value of current control. The LQ character ()f recursion (18) implies a solution on the familiar Riccati lines, which can indeed be expressed in terms of the solution of the risk neutral case.
· Theorem 16.4.1 The solution ofequation Q8) for the future stress has the quadratic form F(xr, t) =! (xTITx) 1 (19)
if it has this form for t = h, and the optimal control then has the linear form (20) Here II, is determined by either of the alternative forms (2.25) or (6.31) of the riskneutral Riccati equation and K1 by either of the corresponding alternative equations (2.27) or ~32) if ITt+ 1 is replaced in the righthand side ofthese equations by
{21)
306
RISKSENSITIVITY: THE LEQC MODEL
Validity of these conclusions is subject to the proviso that the matrix J(B) + ll;+'1 should be positive defznite for all relevant t, where J( B) is the augmented controlpower matrix J(B) = BQ 1BT + BN. (22) Proof This is inductive, as ever. Suppose that relation holds at time t + 1. We leave the reader to verify the identity T
I
T
T
ext[(€ (BN) f +(a+ f) II( a+ E)]= a ITa
•
(23)
If we perform the fextremisation in (23) we thus obtain · F(x1 , t) = mf[c(x, u,)
u,
+ 2I (Ax,+ Bu,) TITt+ I (Ax 1 + Bu1)].
But this is just the inductive relation which held in the riskneutral case with the substitution ofilr+l for llr+l· The question ofvalidity is covered in the following discussion. 0 If we consider the solutions of the riskneutral case in the alternative forms (6.32) then we see that the only effect of risksensitivity is simply to replace the controlpower matrix J = BQ 1BT by the augmented form (22~ so that, for example, the Riccati equation becomes (6.31~
(24) if S has been normalised to zero. That is, the controlpower matrix is augmented by a multiple of the noisepower matrix. This illustrates again the role of noise as an effective auxiliary control, working with or against the intended control according as ()is positive or negative. H the final maximisation with respect to >. of the second set of displayed equations in Section 6.4 is now to be valid then we require the final condition of the. theorem. This sets a lower bound on 8: see Exercise 1. The optimal (i.e. stressextremising) values of u1 and ft+I are matrix multiples of x,; if they are given these values then the quantities Ax1 + Bu1 and Ax,+ Bu, + ft+l can be written r,x, and f,x,. We can regard r, =A+ BK1 as the actual gain matrix for the optimally controlled process and f 1 as what one might call the predictive gain matrix: the gain matrix that would hold if fr+l really did take the role of an auxiliary control variable and take the value predicted for it by stress extremisation. By appealing to the alternative derivations of Section 6.4 one fmds the evaluations
r, =A J(J(B) +II~\ ) 1A,
r, =A J(B)(J(B) + TI,+11) 1 A
(25)
if S has been normalised to zero. In other cases A should be replaced by A  sT Q 1B. If infmite horizon limits exist then it is f which is necessarily a
5 THE DISTURBED (NONHOMOGENEOUS) CASE
307
stability matrix; r may not be if the optimiser is excessively optimistic. Note the relation
Exercises and comments (1) In the scalar case the infinitehorizon form of the Riccati equation (24) becomes
A2 II II = R + 1 + 1(9)II where 1(9) is given by (22). Assume that R > 0. Show that the equation has a finite nonnegative solution iff 1(9) > 0 in the case IAI ~ 1 and 1(9) > (1 IAI 2)/ R in the case lA I < 1. That is, the critical lower bound ii is  B 2f N Q or  B 2 / N Q( 1  IAI 2 )/ NR according as the uncontrolled plant is unstable or stable. 5 THE DISTURBED (NONHOMOGENEOUS) CASE If the plant equation (12.1) is modified by the addition of a deterministic disturbance d, : x 1 =Ax,_ I + Bu,_l + d, + e1 then we shall expect the future stress to have the nonhomogeneous quadratic form (26) F,(x) = ixTII,x ui x + · · · where + · · · indicates terms independent of x. We can generalise relation (23) to obtain extE[(eT(9Nf 1e +(a+ e)TII(a +e) 2uT(a +e)]= aTfla 25Ta + · · · where + · · · indicates terms independent of a, the matrix fi is again given by (21), and
From this we deduce
Theorem /6.5./ If the plant equation includes a deterministic disturbance d then the future stress has the form (26) with II, obeying the modified Riccati equation indicated in Theorem 16.4.1 and u 1 obeying the backward recursion (27) Here i' 1 is the predictive gain matrix defined in (25).
308
RISKSENSITIVITY: THE LEQC MODEL
Verification is immediate. We see that recursion (27) differs from the corresponding riskneutral recursion of Section 2.9 only by the substitution of of f' 1 for the riskneutral evaluation of r ,. This is indeed as it must be, since we have replaced optimisation with respect to a single control u by optimisation with respect to the pair (u, ~). The same argument leads to the evaluation of the optimal control as
u, = K,x, + (Q + BTftt+1B) 1BT(I + Ollr+IN) 1(O'r+l II,+ldr+J).
(28)
The combination of (28) and the recursion (Z7) gives one the explicit feedbackfeedforward formula for the optimal control analogous to (2.65). As for the riskneutral case, all these results emerge much more rapidly and cleanly (at least in the stationary case) when we adopt the factorisation techniques of Sections 6.3; see Section 11 and Chapter 21. 6 IMPERFECT STATE OBSERVATION: THE SOLUTION OF THE PRECURSION In the case of imperfect observation one has also to solve the forward recursion (14) for the function P(x,, W,). Just as the solution of the Frecursion implies solution of the controloptimisation problem in the case of perfect observation, so solution of the Precursion largely implies solution of the estimation problem. 'Largely', because the Precursion produces only the state estimate xpased upon past stress.. However, the full minimalstress estimate is then quickly derived, as we shall see in the next section. For the standard regulation problem as formulated in equations (12.1)(12.4) we can write the Precursion (14) as
x,
OP(x,, W,) = min[Ocr1 +D, +8P(x1l, Wr1)]
(29)
Xr1
where
and t:, 1J are to be expressed in terms of x, y, u by the plant and observation relations (12.1) and (12.2} Now, if we take the riskneutral limit 0 + 0 then ()p will have a limit, D, say, which satisfies the riskneutral form of (29)
D(x,, W,) = min[D, +D(xr~, Wr1)]
(30)
Xrl
and has the interpretation
D(x,, W,)
= D(x,, Y,j; U,_I) = D(x,j W,) + · · · = i[(x x?V 1(x x)], + · · · (31)
6 IMPERFECT STATE OBSERVATION
309
Here+··· indicates terms not involving x 1 (in fact identifiable with D( Y1j; U1_ 1)), and .X, and V1 are respectively the mean and covariance of x, conditional on W,. In this riskneutral case xis identifiable with x and, as we saw from Section 12.5, relation (31) implies the recursive updatings of xand Y. the Kalman filter and the Riccati recursion. We can in fact utilise the known solution of (30) to determine that of (29). In the risksensitive case we can again establish that OP has the quadratic form exhibited in the final member of (31), so that
P(x, W1) =
~ [(x x) v 1(x x)], + · ··,
(32)
where + · · · again indicates terms independent of x, irrelevant for our purposes. The quantity x1 can now be identified as the estimate of xr which extremises past stress at time t, and V1 can be said to measure the precision of this estimate in that (Ov,r' measures the curvature of past stress as x 1 varies from x,. Relation (29) now determines the updating rules for these quantities.
Theorem 16.6.1 Under the assumptions above the past stress has the quadratic form ()2) with 0 and V, identified as the prescribed mean and variance of xo conditional on W0 . The valuesofx,and V, are determined/rom thoseofx1_ 1and V1_ 1 by the Kalman filter (1 2.22)1 (12.23) and the Riccati recursion (12.24) with the modifications that x, is replaced by x,, v,_, by
x
(33)
andxtt by (34)
in the righthand sides ofthese relations. For validity ofthese recursions it is necessary that V1:_11 + cT M 1C+ should be positive definite for all relevant t. Proof The quadratic character (32) of P follows again inductively from (29). Recursion (29) differs from (30) only in the addition of the term Oc,_,. If we assume that P(x 1_ 1 , W,_ 1) has the quadratic form asserted then wefmd that fJc,_ 1 + OP(xt1, W,_ 1)
=! [(x xlv 1(x .i)],_ 1 +terms independent of x,_,
whence it follows that
But this is just the recursion of the riskneutral case with the substitution of Xt1 and V,_, for x,_, and V,_,. 0
310
RISKSENSITIVITY: THE LEQC MODEL
The modified recursions are again more transparent in the alternative forms
(12.59), (12.60). One sees that, as far as updating of Vis concerned, the only effect of risk sensitivity is to modify cTM 1c to cT M 1c + () R. That is, to the information matrix associated with a single yobservation one adds the matrix
9R, reflecting the 'information' implied by costpressures (positive or negative according to the sign of 9). The passage (34) from x1_ 1 to Xr1 indicates bow the estimate of Xr1 changes if we add present cost to past stress. In continuous time the two forms of the modified recursions coincide and are more elegant; we set them out explicitly in Section 8. 7 IMPERFECT STATE OBSERVATION: RECOUPLING In the riskneutral case the recipe for the coupling of estimation and control asserted by the CEP is so immediate that one scarcely gives it thought: the optimal control is obtained by simple substitution oftbe estimate 1 for x, in the perfectobservation form of the optimal rule. It is indeed too simple to suggest its risksensitive generalisation, which we know from Theorem 16.23. This is that the optimal control at time t is u(x 1 , t), where u(x, t) is the optimal perfectobservation (but risksensitive) control and 1 is the minimalstress estimate of xr: the value of x 1 extremising P(x 1, W1) + F(x, t). It was the provisional specification of current state x 1 which allowed us to decouple the evaluations of past and future stress; the evaluations are recoupled by the determination of 1• Now that the separate evaluations {26) and (32) ofF and P have been made explicit in the last two sections, the recoupling can also be made explicit.
x
x
x
Theorem 16.7.1 The optimal control at timet is given by u1 = K1x 1 , where
.x, =(I+ 9v,rr,r 1(x, + 9V,u,).
(35)
Here IT, K1 and CT1 are the expressions determined in Theorems 16.4.1 and16.5.1 and V1, x1 those determined in Theorem 16.6.1. This follows immediately from the CEP assertion of Theorem 16.2.3 and the evaluations of the last two sections. Note that .X1 is an average of the value 1 of x 1 which extremises past stress and the value II~ 1 u1 which extremises future stress. As one approaches the riskneutral case then the effect of past stress swamps that of future stress.
x
8 CONTINUOUS TIME The continuoustime analogue of all the LEQG conclusions follows by passage to the continuous limit, and is in general simpler than its discretetime original. The
7 IMPERFECT STATE OBSERVATION: RECOUPLING
311
analogues of the CEP theorems of Section 1 are obvious. Note that the u and f extremisations of equation (18) are now virtually simultaneous: the optimiser and Nature are playing a differential game, with shared or opposing objectives according as 8 is positive or negative. The solutions for past and future stress in the statestructured case simplify in that the two alternative forms coalesce. The solution for the forward stress for the disturbed but otherwise timehomogeneous regulation problem is
F(x, t)
= !xTilx qT x + · · ·
where Il and u are functions of time. The matrix Il obeys the backward Riccati equation
This reduces to
IT+ R + ATil + IlA 
ITJ(8)1I = 0
(37)
if Shas been normalised to zero, where the augmented controlpower matrix J( 8) again has the evaluation
J(8)
= BQ 1BT + 8N.
The vector u obeys the backward linear equation
&+f'TuIld=O where f' is the predictive gain matrix
The optimal control in the case of perfect state observation is
u = Kx + QIBTu
(38)
where the timedependence of u, K, x and u is understood, and
K= Q 1(S+BTII).
{39)
In the case of imperfect state observation the past stress has the solution P(x, W)
= ;8 (x x?v 1 (x x) + ...
where the timedependence of x, W, forward Riccati equation
xand Vis understood The matrix Vobeys the
312
RISKSENSITIVITY: THE LEQC MODEL
which reduces to
if L has been normalised to zero. The updating formula for Kalman filter) is
x (the risksensitive
di:jdt =Ax+ Bu + d + H(y Cx) OV(Rx+ STu)
(40)
where H = (L + VC)M 1 . Recoupling follows the discretetime pattern exactly. That is, the optimal control is u = Kx + Q 1BTu where K is given by (39) and xis the minimalstress estimate of x:
x =(I+ OVIIf 1 (x+ OVu). 9 SOME CONTINUOUSTIME EXAMPLES The simplest example is that of scalar regulation, the continuoustime equivalent of Exercise 4.1. Equation (37) for II can be written :
=
R + 2AIT J(O)IT2
= f(IT),
(41)
say, where J(()) = B 2 Q 1 + ()N, and s represents time to go. Let us suppose that R > 0. We are interested in the nonnegative solutions of f(IT) = 0 which moreover constitute stable equilibrium solutions of(41), in thatf'(IT) < 0. In the case J(O) > 0 there is only one nonnegative root, and it is stable; see Figure 4. In the case J(O) = 0 there is such a root if A < 0 and no nonnegative root at all if A ~ 0. If J(O) < 0, then there is no nonnegative root if A ~ 0, but there can be one if A is negative and J( 0) not too large; see Figure 5. In fact, there is a root of therequiredcharacterifA < OandRA 2 jJ(O) < O.Tosummarise:
Fipn 4 The graph offll in the CtlSe J > 0; it has a single positive zero.
9 SOME CONTINUOUSTIME EXAMPLES
313
II Figure 5 The graph offiT in the case J < 0. A < 0; there is a positive zero if J exceeds a critical value.
Theorem 16.9.1 Assume the scalar regulation problem set out above, with S = 0 and R and Q both positive. Then both II and the magnitude of K decrease as (J increases. IfA ~ 0 (i.e. the uncontrolled plant is unstable) then the breakdown value iJ equals  B2 IN Q and II becomes infinite as() decreases to this value. If A < 0 (ie. the uncontrolled plant is stable) then the breakdown value iJ equals B2 INQ A 21NR. The nonnegative equilibrium solution II of (41) at()= iJ is finite, but is unstable to positive perturbations. A secondorder example is provided by the linearised pendulum model of Section 2.8 and Exercise 5.1.5. In a stochastic version the angle of deflection o: from the vertical obeys o = ao:+bu+€ where € is white noise of power N and the coefficient a is negative or positive according as one seeks to stabilise to the hanging or the inverted position. The cost function is! (r 1o:2 + r2 2 + Qif), with r 1 and Q strictly positive. The analysis of Section 2.8 applies with J = b'Q 1 +ON. It follows from that analysis that there is a finite equilibrium solution II of the Riccati equation if and only if J > 0, and that this solution is then stable. The breakdown value is thus 0 = 1 IN Q, whatever the sign of a. The greater stability of the hanging position compared with the inverted position is reflected in the relative magnitudes of II in the two cases, but the hanging position is still only neutrally stable, rather than truly stable. Finally, a secondorder example which has no steady state is provided by the inertial missile example of Exercise 2.8.4. The solution for the optimal control obtained there now becomes
a
Ds(xi + xzs) u =  ==Q_+_D=(:':1+=e~N==Qc:)s"731=3 '
314
RISKSENSITIVITY: THE LEQC MODEL
e
where sis time to go. The critical breakdown value is = 1 IN Q  3INDs3' and so increases with s. The longer the time remaining, the more possibility there is for mischance, according to a pessimist. 10 AVERAGE COST
The normalised value function F(x, t) defmed by
eBF(x,t)
= extuE[eliC'ix1 = x, Ur = u].
(42)
in the statestructured case should be distinguished from the future stress defined in Section 2, which is in fact the xdependent part of the value function. The neglected term, independent of x, is irrelevant for determination of the optimal control, but has interest in view of its interpretation as the cost due to noise in the risksensitive case. Let us make the distinction by provisionally denoting the value function by Fv(x, t).
Theorem 16.10.1 Consider the LEQG regulation problem in discrete time with perfect state observation and no deterministic disturbance (ie. d = 0). Then the normalised valuefunction F, has the evaluation
F,(x, t) = F(x, t)
+ 81
(43)
where F(x, t) is the future stress, evaluated in Theorems 16.4.1 and 16.5.1, and (44) The proof follows by the usual inductive argument, applied now to the explicit form exp[9Fv(x, t)] =
e~t(11')" 12 iNr 1 / 2 j
exp[ !fTN 1€  9c(x, u)
 9Fv(Ax + Bu + E1 t + 1)] d€ of recursion (42). The evaluation (44) of the increment in cost due to noise indeed reduces to the corresponding riskneutral evaluation (10.5) in the limit of zero 9. It provides the evaluation 1 (45) 1(9) = logl/ + 9NIII
29
of average error in the steady state, where II is now the equilibrium solution of the Riccati equation (24) (with A and R replaced by A ST Q 1Band R ST Q 1S if S has not been normalised to zero). More generally, it provides an evaluation of the average cost for a policy u(t) = Kx(t) which is not necessarily optimal (but stabilising, in an appropriate sense) if II is now the solution of
315
10 AVERAGE COST
IT= R
+ KTS + ST K + KT QK +(A+ BK)T(II 1 + 9Nr 1(A + BK).
(46)
For more general models the average cost is best evaluated by the methods of Section 13.5. Recall that we there deduced the expression
7(8) =
!J
4 9
logl/ + 99tf(w)l dw
(47)
for the average cost under a policy which is linear, stationary and stabilising, but otherwise arbitrary. Here f(w) is the spectral density function of the deviation vector D. under the policy and 9t the associated penalty matrix. This expression is valid for a general linear timeinvariant model; there is no assumption of state structure or of perfect process observation. On the other hand, the task remains of determining f(w) in terms of the policy and determining the class of f(w) which can be generated as this policy is varied. It is by no means evident that the evaluation (47) reduces to that determined by (45) and (46) in the statestructured regulation case with the policy (supposed stabilising) u = K.x. The reconciliation in the riskneutral case is straightforward. That for the risksensitive case is less so and, as one might perhaps conjecture from the form of (47), follows by appeal to a canonical factorisation: see Exercise 20.2.1. A view of system performance which has become popular over recent years is to consider the deviation vector D. as system output, and the various disturbances to which the system may be subject (e.g. plant and observation noise) as system input. A given control rule is then characterised by the frequency response function G(iw) from input to output which it induces. One wishes to choose the rule to make G small in some norm, and there are many norms which could be chosen. So, suppose that the noise inputs could be regarded as a collective system input ( which is white with covariance (power) matrix 91, say. (A prenormalising filter which would achieve this could be incorporated in the total filter ~ Expression (47) then becomes
7(8) =
!j
4 9
logj/ + 99tG(iw)91G(( iw?lciw
!j
=4
9
logl/ + 99tG91GI dw (48)
and the problem then becomes one of minimising this expression with respect to G. Expression (48) is indeed a norm on G, generated by the notion of LEQG optimisation (in the stationary regime). To phrase the optimisation problem in this way helps in discussion of certain issues, as we shall see in the next chapter. However, it also glosses over an essential issue: how does one cope with the fact that the class of response functions G generated as policy is varied is quite restricted? It is a virtue of the dynamic programming approach that it provides an automatic recognition of the constraints implied by specification of the system; some equivalent insight must be provided in a direct optimisation.
316
RISKSENSITIVITY: THE LEQC MODEL
For reasons which will emerge in the next chapter, one sometimes normalises the matrices 9t and 9l to identity matrices by regarding G as the transfer function from a normalised input to a normalised output (so absorbing 9t and 9l into the definition of G). In such a case (48) reduces to
y(O) =
1 11" 4 8
J

logj/ + OGGI dw.
(49)
The criterion function (48) or (49) is sometimes termed an entropy criterion, in view of its integrated logarithmic character. However, we should see it for what it is: an average cost under LEQG assumptions. In the riskneutral case (49) reduces simply to the meansquare norm (47r) 1 ftr[GGJJdw, also proportional to a meansquare norm for the transient response. 11 TIMEINTEGRAL METHODS FOR THE LEQG MODEL The timeintegral methods of Sections 6.3 and 6.5 are equally powerful in the risksensitive case, and equally well cut through detailed calculations to reveal the essential structure. Furthermore, the modification introduced by risksensitivity is most interesting. We shall consider this approach more generally in Chapters 21 and 23, and so shall for the moment just consider the perfect observation version of the standard statestructured regulation model (12.1)(12.4). The stress has the form §
= 2:)c +!eT(eNr 1elr +terminal cost. T
This is to be extremised with respect to u and e subject to the constraint of the plant equation (12.1). If we take account of the plant equation at time T by a Lagrange multiplier >.r and extremise out e then we are left with an expression 0 = L[c(xr, Ur) + x;(xr Axr1  Burd !O>.JN>.r] +terminal cost (50) T
to be extrem.ised with respect tox, u and>.. This is the analogue of the Lagrangian expression (6.19) for the deterministic case which we found so useful, with stochastic effects taken care of by the quadratic term in >.. If we have reached timet then stationarity conditions will apply only over the timeinterval T ~ t. The stationarity condition with respect to Er implies the relation
ON>.(I) T
= f(l) T
(T
~
t)
between the multiplier and the estimate of process noise. In the riskneutral case fJ = 0 this latter estimate will be zero, which is indeed the message of Section 10.1. Stationarity of the timeintegral (50) with respect to remaining variables at time T leads to the simple generalisation
12 WHY DISCOUNT?
sT Q Bfl
[ _ ATff1] BT!/1 BN
317
[x] (r) =0 u .A
(T
~
t).
(51)
T
of equation (6.20) (in the special regulation case, when deterministic disturbances and command signals are set equal to zero). The matrix operator «..AoJ, where the eigenvalues ~ are necessarily nonnegative, then tr(GrMG) _ I:~IGo/ tr(M)  L >..iloi which plainly has the same sharp upper bound as does the second expression ~~
0
One might express the first characterisation verbally as: II Gil~ is the maximal 'energy' amplification that G can achieve when applied to a vector. Consider now
2 CHARACTERISTICS OF THE Hao NORM
325
the dynamic case, when G(iw) is the frequency response function of a filter with action G( ~).
Theorem 17.2.2 In the case ofafilter with transfor function G(s)
E!G(~)612
IIGII2 = su p
oc
El6! 2
where the supremum is over the class of stationary vector processes {6(t)} of appropriatedimensionforwhichO < El6! 2 < oo. Proof Suppose that the 6process has spectral distribution matrix F(w). Then we can write
E!G(~)6! 2 Jtr(G dF G) El61 2 = Jtr(dF) But the increment dF = dF(w) is a nonnegative definite matrix. We thus see from the second characterisation of the previous theorem that the sharp upper bound to this last expression (under variations of F) is the supremum over w of the maximal eigenvalue of GG, which is just II Gil~· The bound is attained when 6'( t) is a pure sinusoid of appropriate frequency and vector amplitude. 0 We thus have the dynamic extension of the previous theorem: II Gil~ can be characterised as the maximal 'power' amplification that the filter G can achieve for a stationary input. Finally, there is a deterministic version of this last theorem which we shall not prove. We quote only the continuoustime case, then give a finitedimensional proof in Exercise 1 and a counterexample in Exercise 2. Suppose that the response function G(s) belongs to the Hardy class, in that is is causal as well as stable. This implies that G(s) is analytic in the closed right half of the complex plane and (not trivially) that the maximal eigenvalue of G(s)G(s)T (over eigenvalues and over complex s with nonnegative real part) is attained for a value s = iw on the imaginary axis. One can assert:
Theorem 17.2.3 Ifthe filter G belongs to the Hardy class then
II
GII2 = su OO
p
f IG(~)6!2dt f l6i 2dt
I
where the supremum is over nonzero vector signals 6( t) ofappropriate dimension. This differs from the situation of Theorem 17.2.2 in that 6 is a deterministic signal of finite energy rather than a stationary stochastic signal of finite power. An integration over time replaces the statistical expectation. The assertion is that
326
THE Hoo FORMULATION
IIGII~ is the maximal amplification of 'total energy' that G can achieve when applied to a signal. Since we wish G to be 'small' in the control context, it is apparent from the above that adoption of the H 00 criterion amounts to design for protection against the 'worst case~ This is consistent With the fact that LEQG design is increasingly pessimistic as 0 decreases, and reaches blackest pessimism at 0 = 0. The contrast is with the Hz or riskneutral case, 0 = 0, when one designs with the average case in mind. However, the importance of the H 00 criterion over the past few years has derived, not from its minimax character as such, but rather from its suitability for the analysis of robustness of design. This suitability stems from the property, easily established from the characterisations of the last two theorems, that
(6) Exercises and comments
(1) The output from a discretetime fllter with input 61 and transient response g1 is y, = 'L.rKrOtr· We wish to find a sequence 1} whose 'energy' is amplified 2 maximally by the filter, in that ('L,, 1Ytl / 'L,, 18,1 is maximal. Consider the SISO case, and suppose time periodic in that Cr has period m and all the sums over time are restricted to m consecutive values. Show (by appeal to Theorem 17.2.1 and Exercise 13.2.1, if desired) that the energy ratio is maximised for 61 = eiw1 with 2 w some multiple of 2rrfm, and that the value of the ratio is then IG(iw)l =
16
IE. creiwrl 2 • (2) Consider the realisable SISO continuoustime filter with transient response e"' and so transfer function G(s} = (sThen II Gil~ is just the maximal value of IG(iw}f, which is a 2. Consider a signal 8(t) = ef3t for t ~ 0, zero otherwise, where /3 is positive. Show that the ratio of integrated squared output to integrated squared input is
or'.
2/3(o + /3}21oo (e2at 2e(af3)t + e2f3t) dt.
If o < 0 (so that the filter is causal) then this reduces to (o2  a/3} 1 which is indeed less than o 2 • If a > 0 (so that the filter is not causal) then the expression is infinite.
3 THE Roo CRITERION AND ROBUSTNESS Suppose that a control rule is designed for a particular plant, and so presumably behaves well for that plant (in that, primarily, it stabilises adherence to set points or command signals). The rule is robust if it continues to behave well even if the actual plant deviates somewhat from that specified. The concept of robustness
3 THE H,., CRITERION AND ROBUSTNESS
327
G
u
K
w +
Figure 1 A block diagram ofthe system ofequation (7). Ifa pole ofG(s) corresponding to a plant instability is cancelled by a zero of K(s), then the controlled system will not be completely stable.
thus takes account of the fact that the plant structure may never be known exactly, and may indeed vary in time. In control theory, as in statistics and other subjects, the conviction has grown that a concern for optimality must be complemented by a concern for robustness. Furthermore, both qualities must be quantified if one is to reach the right compromise between goals which are necessarily somewhat conflicting. For an example, consider the system of Figure 1, designed to make plant output v follow a command signal w. Here G and K are the transfer functions of plant and controller, and actual plant output v is distinguished from observed plant output y = v + 'TJ, where 'TJ is observation noise. The block diagram is equivalent to the relations
u = K(w v ry)
(7)
v w =(I+ GKr 1 (w GKry)
(8)
v =Gu, whence we deduce the expression
for the tracking error v  w in terms of the inputs to the system. For stable operation we require that the operator I + GK should have a stable causal inverse. Let us assume that stability holds; the socalled small gain theorem asserts that a sufficient condition ensuring this is that IIGKIIoo < l. We see from (8) that the response functions of tracking error to command signal and (negative) observation noise are respectively
One would like both of these to be small, in some appropriate norm. They cannot both be small, however, because S1 + S2 =I. S1 is known as the sensitivity of the
328
THE Haa FORMULATION
system; its norm measures performance in that it measures the relative tracking error. S2 is known as the complementary sensitivity, and actually provides a measure of robustness (or, rather, of lack of robustness) in that it measures the sensitivity of performance to a change in plant specification. This is plausible, in that noisecorruption of plant output is a kind of plant perturbation. However, for an explicit demonstration, suppose the plant operator perturbed from G to G + 8G. We see from the correspondingly perturbed version of equations (7) that
v =(I+ GK) 1GK(w 11) +(I+ GK) 1 (8G)K(w v 1]). The perturbed system will remain stable if the operator I+ (I+ GK) I ( 8G)K acting on v has a stable causal inverse. It follows from another application of the small gain theorem that this continued stability will hold if
i.e. if the relative perturbation in plant is less than the reciprocal of the complementary sensitivity, in the Hoo norm. Actually, one should take account of the expected scale and dynamics of the inputs to the system. This is achieved by setting w = w, wand 77 = W2 ry, say, where W 1 and W2 are prescribed filters. In the statistical LEQG approach one would regard wand ij as standard vector white noise variables. In the worstcase deterministic approach one would generate the class oftypical inputs by letting w and 1i vary in the class of signals of unit total energy. Performance and robustness are then measured by the smallness of IIS1 W,ILXl and IIS2 W2lloo respectively. Specifically, the upper bound on IIG 16GIIoo which ensures continued stability is
IIS2W211~'·
Of course, in a simple minimisation of quadratic costs a compromise will be struck between 'minimisation' of the two operators S 1 and S2 in some norm. There will be a quadratic term in the tracking error v  w in the cost function, and this will lead to an expression in H 2 norms of S1 and S2. The more observation noise there is, the greater the consideration given to decreasing s2. so assuring increased robustness. The advantages of an H 00 formulation were demonstrated first in Zames (1981), who began an analysis which has since been brought to a high technical level by Doyle, Francis and Glover, among others; see Francis (1987) and and Doyle, Francis and Thnnenbaum (1992). The standard formulation of the problem is the system formulation of equations (6.44)/(6.45~ expressed in the block diagram of Figure 4.4. The design problem is phrased as the choice of K to minimise the response of D. to ( in the H 00 norm, subject to the condition that 'K should stabilise the system', i.e. that all system outputs should be stabilised against all system inputs. The analysis of this formulation has generated a large specialist literature. We shall simply list a number of observations.
3 THE H.., CRITERION AND ROBUSTNESS
329
(i) The relation of the Hoo criterion to the LEQG criterion means that many problems already have a solution in the now wellestablished LEQG extension of classical LQG ideas. The need to take account of observation noise ensures a degree of robustness in rules derived on either LQG or LEQG criteria, although this is an insight which might not have been apparent before Zames. (ii) The two problems formulated in this section are stated in inputoutput form, in that the plant is specified simply by its response function G rather than by a statestructured model, for example. There have been a number of attempts over the years to attack such problems directly, usually in the LQG framework, by, for example, simply seeking the filter K in (7) which minimises a quadratic criterion such as E(.6.T9i.6.). (See Newton, Gould and Kaiser (1957), Holt eta/. (1960), Whittle (1963),Youla, Bongiorno and Jabr (1976)). One can make a certain amount of progress, but, as we noted in Section 6.7, a headon approach can prove perplexing and tends to miss two insights which are revealed almost automatically in a statespace analysis of the nonstationary problem. These are (i) certaintyequivalence, with its separation of the aspects of estimation and control, and (ii) the usefulness of the introduction of conjugate variables; Lagrange multipliers associated with the constraints constituted by the prescribed dynamic relations. Exploitation of these properties simplifies the analysis radically, especially in the vector case. We consider the determination of optimal stationary controls without reduction to state structure in Chapters 1821, but using timeintegral methods which incorporate these insights by their nature and yield precisely what is needed. (iii) The operator factorisation techniques associated with these timeintegral methods are distinct from both the WienerHopf methods used in a direct attack on the inputoutput model and the matrix Jfactorisation techniques expounded by e.g. Vidyasagar (1985) and Mustafa and Glover (1990) for solution of the Hoo problem. (iv) Simple minimisation of .6. in some norm will in fact assure overall stability of the system (provided this is physically achievable) if 6. includes measures of relevant signals in the system. For example, the inclusion of u itself in the criteria used through the whole of this text avoids the possibility that good stability or tracking could be achieved at the expense of infinite control forces. Further, and as we have seen, such an optimisation will automatically achieve some degree of robustness if observation noise is assumed in the model Finally, one might say that the ultimate guarantee of robustness is use of an adaptive control rule; an observation which opens wider vistas than we can explore. n~
PART 4
Timeintegral Methods and Optimal Stationary Policies This Part is devoted to the direct determination of the optimal stationary policy for a model which is supposed LQG or LEQG and timehomogeneous, but not necessarily statestructured. The theme is an interesting one in that the subthemes of timeintegrals, certainty equivalence, the maximum principle, transform methods, canonical factorisation and policy improvement coalesce naturally. The methods are a development of those already applied to the deterministic statestructured case in Section 6.3 and generalised in Section 6.5. The reader content to consider only the statestructured case could omit this Part without prejudke to what follows (although passing up a chance for enlightenment, in the author's view!). The problem first to be considered is that of deriving a stationary optimal control policy for a timehomogeneous LQG model, in general neither statestructured nor perfectly observable. Our methods intrinsically assume the existence of infinitehorizon limits, so that the optimal policy is derived as the infinitehorizon limit (in fact stationary) of an optimal finitehorizon policy. Such a policy will certainly have the property of being averageoptimal: i.e. of minimising the average cost incurred per unit time in the stationary regime. It will in fact also have the stronger property of optimising reaction to transients, which an averageoptimal policy may not have unless plant noise is such that it can stimulate all modes of the system. We have an expression for the average cost in formula (13.24), and policy is specified by the timeinvariant realisable linear operator :1{ which one uses to express the control u in terms of current observables. One might regard the optimisation problem simply as a matter of direct minimisation of this
332
TIMEINTEGRAL METHODS
expression with respect to .Y(. This was the approach taken by a number of authors in the early days of control optimisation (see Newton, Gould and Kaiser (1957), Holt et a/. (1960), Whittle (1963)), when it seemed a natural development of the techniques employed by Wiener for optimal prediction. However, while the approach yields results rather easily in the case of scalar variables, in the vector case one is led to equations which seem neither tractable nor transparent. In fact, by attacking the problem in this bulllike fashion one is forgoing all the insights of certainty equivalence, the Kalman filter etc. One could could argue that, if these insights had not already been gained, they should be revealed in any natural approach. If so, then this is not one. In fact, the seeming simplification of narrowing the problem down to one of averageoptimisation blinds one to an even more direct approach. This is an approach which is familiar in the deterministic case and which turns out to be available even in the stochastic case: the extremisation of a timeintegral. We use this term in a rather specific technical sense; by a timeintegral we mean a sum or integral over time of a function of current variables of the model in which expectations are absent and which is such that the optimal values of decisions and estimates can be obtained by a free and unconstrained extremisation of the integral. In earlier publications on this topic the author has referred to these as 'pathintegrals', but this is a usage inconsistent with the quantumtheoretic use of the term. Strictly speaking, a pathintegral is an integral over paths (i.e. an expectation over the many paths which are possible) whereas a timeintegral is an integral along a path. The fact which makes substantial progress possible is that a pathintegral can often be expressed as an extremum over timeintegrals. For example, we we saw in Chapter 16 that the expectation (i.e. the pathintegral) E[exp( OC)] could be expressed as an extremum of the stress§= C + o 1[}. If one clears matrix inverses from the stress by Legendre transformations (i.e. by introducing Lagrange multipliers to take account of the contraints of plant and observation equations) then one has the expectation exactly in the form of the extremum of a timeintegral. , It is this reduction which we have exploited in Chapters 6 and 16, shall exploit for a general class of LQG and LEQG models in this Part, and shall extend (under scaling assumptions) to the nonLQG case in Part 5. We saw in Section 6.3 that the statestructured LQ problem could be converted to a timeintegral formulation by the introduction of Lagrange multipliers, and that the powerful technique of canonical factorisation then determined the optimal stationary policy almost immediately. We saw in Sections 6.5 and 6.6 that these techniques extended directly to models which were not statestructured. These solutions extend to the stochastic and imperfectly observed case by the simple substitution of estimates for unobservables, justified by the certaintyequivalence principle. We shall see in Chapter 20 that timeintegral techniques also take care of the estimation problem (the familiar control/estimation duality
TIMEINTEGRAL METHODS
333
finding perfect expression) and, in Chapter 21, that these methods extend to the LEQG model. All these results are exact, but the extension to the nonLQG case of Part 5 is approximate in the same sense as is the approximation of the pathintegrals of quantum mechanics by the timeintegrals (action integrals) of classical mechanics, with refmement to a higherorder approximation at sensitive parts of the trajectory. There is a case, then, for developing the general time integral formalism first, which we do in Chapter 18. In this way one sees the general pattern, uncluttered by the special features which necessarily arise in the application to control and estimation. There is one point which should be made. Although our models are no longer statestructured, they are not so general that they are given in input/output form. Rather, we assume plant and observation equations, as in the statestructured case, but allow variables in them to occur to any lag up to some value p. The loss of an explicit dynamic relationship which occurs when one reduces the model to input/output form has severe consequences, as we saw in Section 6.7. This Part gives a somewhat streamlined version of the author's previous work on timeintegral methods as set out in Whittle (1990a). However, the material of Sections 20.320.6 and 21.3 is, to the best of the author's knowledge, new. There must be points of contact in the literature, however, and the author would be grateful for notice of these.
CHAPTER 18
The Timeintegral Formalism 1 QUADRATIC INTEGRALS IN DISCRETE TIME For uniformity we shall continue to use the term 'timeintegral' even in discrete time, despite the fact that the 'integral' is then a sum. Consider the integral in the variable~
(1) with prescribed coefficients G and (. This is then a sum over time of a quadratic function of the vector sequence {~T}, the function being timeinvariant in its seconddegree part. So, for the models of Sections 6.3 and 6.5 ~is the vector with subvectors x, u and .X, and the timevarying term (T would arise from known disturbances d or known command signals xc, UC. The matrix coefficients G1 and the vector coefficients (T are specified, but no generality is lost by imposition of the normalisation
(2) If the sum in (1) runs over h1 < r < h2, then the 'end terms' arise arise from contributions at times r ~ h1 and r ;?: h2 respectively. The final term can be regarded as arising from a terminal cost and the initial term as arising from a probabilistic specification of initial conditions. Suppose the optimisation problem is such that we wish to extremise the integral with respect to the course of the variable {~T}. We cannot at the moment be more specific than 'extremise', as we see by considering again the models of Chapter 6, for which one minimised with respect to x and u and maximised with respect to >.. In any case, the stationarity condition with respect to ~T is easily seen to be
(3) if r is sufficiently remote from both h 1 and h2 that neither end term involves where
Cf>(ff) :=
L Gjffi. j
~n
(4)
THE TIMEINTEGRAL FORMALISM
336
The normalisation (2) has the implication
(5)
4>=CI>
indicating a kind of past/future symmetry in relations (3). Here, as ever, the conjugation operation bas the effect CI>(z) = CI>(z 1 onztransforms. Suppose that we regard our optimisation as a 'forward' optimisation in that, if t is the current instant of time, then is already determined for T < t and we can optimise only for r ;;a: t. We shall then write equation (3) rather as
Y
eT
I
(r ;;a: t)
(6)
to emphasise this dependence on t. That is, relation (6) determines the optimal course of the process from timet onwards for a prescribed past (at timet). This was exactly the situation which prevailed for the control optimisation of Sections 6.3 and 6.5. If the horizon h2 becomes infinite then we may indeed demand that (6) should hold for all r ;;a: t and, if the sequence {(T} and the terminal term are sufficiently wellbehaved, expect that the semiinversion of the system (6) analogous to equation (6.27) should be valid. For the estimation problems of Chapter 20 we shall see that the same ideas apply, but with the optimisation applied to the past rather than the future (cf. Section 12.9). To consider the general formulation (1) is then already a simplification, in that we see that these two cases can be unified, and that the operator CI>(S') is immediately related to the matrix coefficients Gj of the path integral Indeed, partial summation brings this integral to the neater form
e
0=
L (.r)eT  (~ ~T) + end terms. T
If CI>(z) possesses a canonical factorisation then, in virtue of (5), we may suppose it symmetric:
(7) 1
where both p then it will turn out that (z) has the canonical factorisation (7). The the infinitehorizon system (6) has the semiinversion
¢J(ff)e~'>
1
= CI>01¢J(S'f (T
(r ;;a: t)
(8)
2 FACTORISATION AND RECURSION IN THE 'MARKOV' CASE
337
1
(with the understandings that :!1 operates on the subscript T and that¢(:!/) is to be expanded in nonpositive powers of:!/) if the righthand member of (8) is defined. This latter condition will be satisfied if ~r decays as pr for increasing T, where p is less than the radius ofconvergence of¢(z) . The passage from (6) to (8) converts the recursion (6), symmetric in past and future in the Sense (5), to a Stable forward recursion, expressing e~) for T ~ t in terms of past values e~l(a < r); the righthand side of (8) constitutes a known driving term for this recursion. One could complete the inversion to obtain a complete determination, but to do so is positively undesirable. Relation (8) determines explicitly (in terms of the known fort p ~ T < t) and, aS We saw for the control application of Chapter 6, this determination yields the optimal control rule in the form which is both closedloop and natural for realtime realisation. Of course, if (8) is to hold in the infinitehorizon limit h2 > oc for system (6) then certain conditions must be satisfied. The analogue of the controllability/ sensitivity conditions of Theorem 6.1.3, which were the basic ones for the existence of infinitehorizon limits, is simply that a canonical factorisation (7) should exist. Regularity conditions will also be required of the terminal contributions implied in the specification of the path integral (1). If such conditions are granted, then the infinitehorizon solution (8) is formally immediate. However, to make it workable we must find a workable way of deriving the canonical factorisation (7). We shall achieve this by returning to the recursive ideas of dynamic programming and relating the canonical factor ¢ to the value function. This value function need never be determined; it is enough that the dynamic programming technique of policy improvement implies a rapidly convergent algorithm for determination of the canonical factorisation. Let us begin with consideration of the 'Markov' case, for which dynamics are of order one and recursions are simple.
e;t)
er
Exercises and comments
(1) A necessary condition that (z) should be invertible in stable form is that it should be nonsingular for z on the unit circle, which implies that all three factors in (7) should be nonsingular there. A sufficient condition for canonical factorisability is that (z) should be strictly positive definite on the unit circle. This latter condition will in general not hold in our analysis, however, because the quadratic form constituted by Dhas in general a mixed convex/concave character. 2 FACTORISATION AND RECURSION IN THE 'MARKOV' CASE If G1 = 0 for ljl > p, so that interactions (crossterms) appear in the timeintegral only at time lags up to p, we shall say that we are dealing with pthorder
338
TIIE TIMEINTEGRAL FORMALISM
dynamics. We shall then refer to the case p = 1 as the Markov or statestructured case. In the Markov case the timeintegral can be written D=
L
Cr
+ end terms,
T
where
(9) Consider now a recursive extremisation of the integral, with Cr regarded as an 'instantaneous cost', and with the closing cost at the horizon point h2 assumed to be a quadratic function of ~h2 alone. If we are interested in a forward optimisation then we can define a 'value function' F(~~o t), this being the extremised 'cost' from time t onwards for given ~1 • This will obey the dynamic programming equation
(10) where 'stat' signifies evaluation of the bracket at a value of ~t+I which renders it stationary. As a dynamic programming equation this has the simplifying feature that one optimises with respect to the whole vector argument ~ rather than just some part ofit. In the case under consideration the value function will have the quadratic form
(11) for t =::;; h2 if it does so at h2, with the coefficients (respectively matrix, vector and scalar) obeying the backward recursions
II, =Go G1II~ 1 G1 a,=(,
G_Irr;;1at+I
(12)
61 = 61+1 aJ+ 1 II~ 1 at+l· The first of these is just the Riccati recursion; strikingly simpler in this general formulation than it was when written out for the statestructured case of the control problem in equations (225)/(2.26). The second relation likewise corresponds to the equation before (2.65). The third gives the recursion for the ~ independent component of cost incurred which we thought too troublesome to derive in Section 2.9. The extremising value of ~t+ 1 in (10) is determined by
(13) Suppose that we are in the infinite horizon limit h2 ) +oo, and that the problem is wellbehaved in that II, has, for fixed t, a limit II. It is under these conditions that one expects the forward recursion (13) for ~~ to be stable in that its solution
3 VALUE FUNCTIONS AND RECURSIONS
339
should tend to zero with increasing t. In other words, that II + G1ff is a stable operator, or that r =  II 1G1 is a stability matrix. This leads one to the conjecture that the right canonical factor ¢>(z) of ili(z) might be proportional to
II+ G1z. Indeed, we find from the equilibrium form of the Riccati equation for II above that (14) is a factorisation of iii, and we see from (13) that this is indeed the canonical factorisation we seek if one exists. The condition for factorisability is then that the Riccati equation (the first equation of (12)) should have a solution II for which n 1 G1 is a stability matrix. Equation (14) thus relates the canonical factorisation to the recursive calculations familiar to us from dynamic programming. We should now like the version of this relation in higherorder cases. 3 VALUE FUNCfiONS AND RECURSIONS In the last section we related the canonical factorisation of ili(z) in the Markov case to the evaluation of the matrix II of the value function. We aim now to extend that result to the case of pthorder dynamics. In doing so we may suppose that (r 0, since iii and its properties are determined entirely by the seconddegree part of the path integral (1). In control language, we specialise to the case of regulation to zero. One can group the terms in Dto define an 'instantaneous cost' Cr in various ways; we shall define it as
=
p1
Cr = H:p+ 1Go€rp+1
+ L{:_JGpJ{rp·
(15)
j=O
If we define the value function
F({, €11, ... , {tp+1; t) =stat
{,;r>l
[I>r] T
(16)
;;.t
then this will obey the optimality (dynamic programming) equation
We suppose that the terminal cost is purely quadratic with a limited dependence on the past. The assumption of pthorder dynamics then implies that F will indeed be a function of just the variables indicated in (16), and it will in fact be homogeneous quadratic (i.e. of degree two) in the {variables:
THE TIMEINTEGRAL FORMALISM
340
sa)' The 'optimal policy' will then be of a plag linear form in that the extremal criterion in (17) will determine 1 linearly in terms of the p previous values of
e:
et+
p
I:ak(t+ l)e,k+t
=o
(19)
k=O
say, with a 0 necessarily symmetric. Our expectation is that the coefficients ak(t + 1) will become independent oft in the infinitehorizon limit (if this exists) and that (19) will then be a stable forward recursion which, written ¢(ff)et+l = 0, defines the canonical factor ¢(z). The argument follows. Ifwe substitute the form (19) into the optimality equation (17) we obtain a 'superRiccati' equation for the p(p + 1) /2 matrices ITjk which is most forbidding. The relations become much more compact in terms of the generating functions
F(w, z; t)
=I: L ITjk(t)wlzk j
l)(w, z) =Go+
(20)
k p
p
j=l
}=I
L Gjzj + L G_jwl
¢(z;t+ I) =F(O,z;t+ I)+GpzP
(21) (22)
p
a(z;t+ I)= L:ak(t+ I)zk
(23)
k=O
in the complex scalars z and w. We shall refer to F( w, z : t) as the value function
transform. We shall now find it convenient to suppose that F is not necessarily the value function corresponding to an optimal policy, but that corresponding to an arbitrary plag linear policy of the form (19). It will then still have the homogeneous quadratic form (18).
Theorem 18.3.1 (i) The value function transform under the policy (19) obeys the backward recursion (dynamic programming equation)
F(w, z; t) = (wz) 1[(wz)P ~(z 1 , w 1) + F(w, z) ¢(w)Ta01a(z)  a(w)Ta 1¢(z) + a(w?aij 1¢oaij 1a(z)jt+l
(24)
0
where the final subscript indicates that F, a and¢ all carry the additional argument t + 1.
4 A CENTRAL RESULT: THE RICCATI/FACTORlSATION RELATIONSHIP
341
( ii) The optimal choice ofo is
o(z; t + 1} = ¢(z; t + 1) = F(O,z; t + 1)
+ GP:zP
(25)
With this choice (24) reduces to the optimality equation F(w, z; t)
= (wz) 1[(wz)PCJ>(z 1, w 1) + F(w, z) rt>(wl rj>01rt>(z)) + 1. 1
(26)
The proof is by heavy verification; the detail is given in Whittle (1990a) p.164. One can see that the terms independent of o in the righthand member of (24) originate from the presence of the cost term c1 in (18) and the translation of the tsubscripts of the ~arguments. The terms in o originate from the fact that ~t+l is expressed in terms of earlier ~values by (19). Although derivation may be heavy, we see that in (26) we have an elegant expression of the optimality equation, and in (25) an elegant expression of what will soon be identified as a canonical factor ¢ in terms of the value function transform F. Note the need to introduce the double generating furnction (21), related to CI>(z) by
CI>(z) = CI>(z 1,z).
(27)
4 A CENTRAL RESULT: THE RICCATI/FACTORISATION RELATIONSHIP If we assume the existence of infinitehorizon limits under an optimal policy then the assertions of Theorem 18.3.1 become even simpler.
Theorem 18.4.1 (i) In the infinitehnrizon limit the optimal value function transform has the evaluation
in terms of
rj>(z)
= F(O, z) + G1 :zP.
(29)
t)
(30)
(ii) Theequation (r
;;J!:
holds along the optimal path. The assertions also begin to have a very clear significance. If we set w = z 1 in (28) we obtain
(31) This is nothing but a canonical factorisation of 4>(z), since the forward recursion (30) is stable. We thus deduce
342
THE TIMEINTEGRAL FORMALISM
Theorem 18.4.2 (i) If infinitehorizon limits exist then~ has a canonical factorisation (7) with J~·l~J'l~fl (T~, j
(38)
k
where ~J'l = ~, ~J is the rth time differential of the jth com~onent of~ and ( is a known vector function of time. The scalar coefficients c1~s) are supposed constant, so that the integral is timehomogeneous as far as the seconddegree terms are concerned. As a matter of normalisation we can demand the symmetry (sr)
(rs)
(
c~ = ~k
) 39
However, this normalisation is not the only possible recasting of the integral (37). Partial integration yields the effective equality ,dr],ds] _ _ ,drl],ds+l]
"'J "'k 
"J
"'k
(r > 0),
(40)
I
345
6 THE INTEGRAL FORMALISM IN CONTINUOUS TIME
~ere we mean by 'effective equality' t~at t~e substi.t~tion wi.ll cha~ge ~nly ~he ~~dterms and so does not affect the optimality conditions at time pmnts mtenor ':io the range of integration. · We can carry the reduction (40) so far as to write the pathintegral (37) as
~ =!
JI: I: I: L cXs) ()'{j~f+s] J dT
(T ( dT +end effects,
or, more compactly, u=
I[~ e'P(!?))~ (T ~] dT +end effects.
(41)
Here ip (!?)) is an n x n matrix of operators with jkth element
(42)
Theorem 18.6.1 The condition that the (37) stationary can be wn'tten
~path
for
(T
T
~
~
t should render the integral
t)
at time points interior to the interval of integration. The matrix operator Hermitian in that
(43) 'P(.~)
is
(44) The second assertion is an immediate consequence of (39) and (42); the first then follows from (41). We can write the symmetry condition (44) as fP = ~. consistently with previous continuoustime conventions. If, as implicitly indicated, the optimisation is a forward one then, as ever, we seek a canonical factorisation
'P(s) = fP_(s)'Po'P+(s). The Hermitian character of fP means that we can indeed look for a symmetric factorisation
'P(s) = ¢(s)'Po¢(s)
(45)
1
say, where both ¢(s) and ¢(sr have all their singularities strictly in the left half of the complex splane. If infinitehorizon limits exist then the infinitehorizon solution is again effectively given by the semiinversion of (43) to the stable forward equation
(46) for the optimising~· As in the discretetime case, there is a natural normalisation of the factorisation (45), determined by deduction of this factorisation from the
346
THE TIMEINTEGRAL FORMALISM
optimality equation of a recursive approach. We cover these matters in the next section, which reveals both the analogies to and some differences from the discretetime material of Sections 4 and 5. 7 RECURSIONS AND FACTORISATIONS IN CONTINUOUS TIME
In considering the factorisation question we can again normalise to the homogeneous ('regulation to zero') case ( 0. There is a difference between the discrete and continuoustime cases which does not seem more than notational, but is annoyingly persistent. It refers to degree of dynamics, and has already arisen in Section 4.7. When we said that dynamics were of order p in the discretetime case we really meant that they were of order p at most. If they were in fact of lesser order in some relations then this did not matter; the only requirement was that a recursion such as (19) should really be a determining forward recursion, in that the coefficient a 0 of ~ at the current timeargument should be nonsingular. The corresponding requirement in the continuoustime case is that the matrix coefficient of the differentials of highest order should be nonsingular, and this order may well be different for different components of ~. Suppose that, under the normalisation (39), the differentials of ~j occur inc(~) up to order rj exactly (j = 1, 2, ... , n). Then what one might term a minimalorder linear policy would determine the forward evolution of~ by a set oflinear relations
=
(j= 1,2, ... ,n),
(47)
say, where the coefficients K. may be timedependent. The point is that the optimal policy will lie within this class, and that determination of the optimal relation (47) in the infinitehorizon limit implies a determination of the canonical factor ¢(s). Let F(~, t) denote the value function under policy (47), starting from a known~ history at time t. Although we have loosely written this as ~dependent, in fact it will be dependent on the function ~(T) through the differentials ~j'l(r < rj; j = 1, 2, ... , n) at the current instant t. Under appropriate restrictions on the form of the terminal cost it follows then that the value function under policy (47) is of the homogeneous quadratic form
F(~,r)
=
1LLLLnXs)(t)~j'~~r1 j
(48)
k
and that Fwill obey the backward equation
c + /JF + " " ~[r+IJ /JF = O
ot
~~ 1
r.,. The consequent Lagrangian form hl
u
=L
r=O
h
c(xr. u,., 7") +
L >.;(dx + gju d)T + ch.
(3)
r=l
constitutes our timeintegraL Here C11 is the closing cost at time h; we could, consistently with other assumptions, suppose it a quadratic function of ~ (h p < 7" ~h). We again make the distinction between 7", a general running time variable, and t, which labels the moment 'now: In other words, we assume that Ur has already been determined for T < t, not necessarily optimally, and that u,. is to be determined forT~ t. Extremisation of I with respect to (x, u, >.)r gives a linear system of equations which we can write as
(t
~ 7"
< h p).
(4)
The corresponding equations for h  p < T ~ h will be modified in that they will include contributions from Ch. As in Section 6.5, iJI and 91 are the operators conjugate to .s;l and t'J in that, for example, 91 = A(S"' 1) T = I;,=O A"[!!Tr. So .s;l operates into the past and .il into the future. Note that, effectively, >.r = 0 for 7"
>h.
The superscript (t) indicates that the optimisation is one holding from timet onwards. The values of .x}'l and u}1l for T ~ t determined by (4) and subsequent equations constitute the prediction at time t of the future course of the optimally controlled process. This will coincide with the actual course only if there is no plant noise t:. However, the determination of ui'l thus obtained is optimal under all circumstances. If we write the equation system (4) as
(5) then this can be identified with the equation (18.3) of the general pathintegral treatment. The matrix function ~(z) thus implicitly defined certainly satisfies the symmetry requirement~=~ of the general treatment. Suppose that ~(z) has the canonical factorisation
(6)
2 EXPRESSION OF THE OPTIMAL CONTROL RULE IN DISCRETE TIME
351
where we have supposed the normalisation of (18.31), the natural discretetime normalisation. Then. the semiinversion
(r
~
t)
(7)
of (5), if valid, provides the optimal stationary solution in the infinitehorizon
limit. It provides the solution in that relation (7) constitutes a stable forward relation determining the predicted course of the optimallycontrolled process. However, more importantly, relation (7) for r = t gives an explicit expression for the optimal value of Ut in closedloop form. We shall make this determination more explicit in the following section; one may well claim it as the most direct, economic and natural determination of the optimal stationary policy. The condition that a canonical factorisation (6) should exist is just the condition that infinitehorizon limits should exist in the case ( 0; the case of 'regulation to zero'. The condition that the righthand member of (7) should be finite (with the operator acting into the future) is just the condition that the optimally controlled system should be able to cope with the input (; see the discussion after Theorem 6.3.1. The continuoustime analogue is fairly immediate. The operators in the plant equation are now polynomials in the differential operator~. so that, for example, .91 =A(~)= A,~'. Sums are replaced by integrals in (3), and relation (4) holds with the definition i1 =A( ~)T of conjugacy. In analogue to (5) one writes this system as
=
Lr
(8) where a time argument r is understood and !!) is understood as d/ dr. If ~(s) has the canonical factorisation
(9) where we have supposed the normalisation of Theorem 18.7.2, then the analogue ofthe semiinversion (7) is
(r
~
t).
(10)
2 EXPRESSION OF THE OPTIMAL CONTROL RULE IN DISCRETE TIME
l
For the LQG control problem the matrix generating function cl>{z) has the form cl>(z)
R(z)
S(z)
A(z)
A(z)
B(z)
0
= S(z) Q(z) B(z) . [
(11)
352
OPTIMAL STATIONARY LQG POLICIES: PERFECT OBSERVATION
Here have in fact generalised the cost function somewhat by replacing the matrix R by a generating function R(z), etc. This corresponds to the inclusion of lagged variables in the cost function c1; the generalisation can be made painlessly at this point and the conclusions we now reach remain valid under it. Recall that B0 = 0, and suppose the normalisation Ao = I. This form is special enough that we can say something about the form of the canonical factor cp(z) in the normalised factorisation (9).
Theorem 19.2.1 The canonicalfactor cp ofthe matrix generating function (11), under the normalisation of Theorem 18.4.2, has the form
c/J =
c/Jxx c/Jux
[ A
c/Jxu /] c/Juu 0 , B
(12)
0
the partitioning being the (x, u, .X) partitioning of , and an argument z being understood throughout. If the cost function involves lags ofsize p l at most then the sub matrices with the cplabels are polynomials in z ofdegree p  1 at most. Proof Validity of the form (12) follows by appeal to the recursive algorithm (18.35). We leave the reader to verify that, if ¢J(i) has the form (12~ then so does ¢J(i+l)· So then does cp, byTheorern 18.5.3. The final assertion follows from the fact that, under the assumption stated,
c/Jp see (18.32).
= Gp = [
~ Ap
00 0]0 ; Bp
0
0
Theorem 19.2.2 Suppose that infinitehorizon limits exist. Then the optimal value ofu, is given explicitly and in closedloop form by
(13) where the u.X subscript on the bracket indicates the extraction of the corresponding submatrix in the (x, u, .X)partitioning of the bracketed matrix. If the cost function involves lags ofsize p  1 at most then relation (1 3) expresses the optimal u1 in terms of(x,, Xtl, ... , Xtp+l), (ut1 1 Ut2 1 • • • , utp+l)and (d, dt+l, .. .). Proof Relation (13) follows if we extract the usubvector relationship from rela0 tion (7) at T = t, taking account of the form asserted for cp in Theorem 9.2.1. The closedloop determination (13) of the optimal control is explicit to within achievement of the canonical factorisation, and is both as neat and as explicit a solution as the general case will allow.
3 OPTIMAL CONTROL RULE IN CONTINUOUS TIME
353
We gain both some econo~y and some insight if we introduce the s:ystem notation of Section 6.6, and write the system (4) in the generalised case (11) as
[9t(5)
~(5)]
~(ff)
[x](r)= [0]
0
).
d
r
r
(14)
1
where then
X=[~], and the plant equation reduces to ~"'=d. The assertion (12) is then that the normalised canonical factor ifJ takes the form
ifJ(z) _ 
[ifln(z)
~o]
~(z)
0
(15)
'
where 91o is the absolute term in the matrix polynomial ~(z). We can follow through the proof of Theorem 19.2.1, but in the more economical system notation throughout, to deduce (15) directly. However, we have to return to the more explicit notation of relation (13) if we wish to distinguish the particular role of u, and deduce the control rule explicitly. Exercises and comments (1) For the Markov case .sD.
(20)
Show that in the special case (19) relation (20) implies )QJ = Q + lJA. 1RA 1B, where J =I Q 1BJ rPxxA 1B. Show that in the case cJ>(s) =
R(s) [
0
A(s)
0 Q(s) B(s)
~]
B(s) , 0
special only in that S(s) has been normalised to zero, relation (20) amounts to
)Q.J= Q+BA1RA1B, where J = Q:; 1 {[rPuu(.~) B.A:; 1¢xu] [rPux B.A:; 1¢xx]A 1B}. In both cases J is indeed the return difference transfer function for the optimal control.
CHAPTER 20
Optimal Stationary LQG Policies: Imperfect Observation 1 THE PROCESS/OBSERVATION MODEL: APPEAL TO CERTAINTY EQUIVALENCE
We assume the linear model of equation (19.1) together with an observation relation: dx
+ P.lu = d + f
(1)
y+~x = 11·
(2)
This specification covers both the discrete and continuoustime formulations; the time variable has not been explicitly indicated. The operator ~ is, like .rJ and dl, a causal translationinvariant linear operator. Let us initially discuss the discretetime case withpthorder dynamics, with~ having the form C(ff) = 2:::=! C,ff'. As ever, dis a deterministic disturbance term and f and 11 are plant and observation noise respectively. We suppose that these noise terms jointly constitute Gaussian white noise with zero mean and covariance matrix
In this case of imperfect observation the information W, available at time consists of the observation and control histories Y1 and U11, plus the complete course of the deterministic component of disturbance {d1}. We shall ultimately be passing to the stationary regime, in which the past as well as the future is infinite, so that observation and control histories extend into the infinite past. Since the model is totally LQG we can appeal to the certainty equivalence principle of Section U.3 to deduce the optimal control in this imperfectly observed case from that for perfect observation. We know from the analysis of Section 19.2 that, in the case of perfect observation, the optimal stationary determination of u1 is given in closedloop form by
rPux(ff)x,
+ rPuu(fl)ut =
1
[¢or/J(f/)
LA·
(3)
358
OPTIMAL STATIONARY LQG POLICIES
The lefthand member of this relation gives the feedback component of optimal control, expressing u, in terms of x., (t p < r::::;; t) and u.,. (t p < r < t). The righthand member gives the feedforward term, in terms of d., (r ~ t). In the case of imperfect observation recursion (3) for the optimal control still holds, except that x 7 must be replaced, where it occurs, by the current projection estimate x~l. We are then led to the inference problem; the determination of these estimates. The duality of estimation and control has already been demonstrated for the Markov case in Section 12.9; we shall see how this extends to dynamics of general order.
2 PROCESS ESI1MATION IN TIMEINTEGRAL FORM (DISCRETE TIME) The characterisation that we shall take of projection estimates is not the linear leastsquare property asserted in (ii) of Theorem 12.6.3, but rather the dual probabilitymaximising (or discrepancyminimising) property asserted in (iii) of the same theorem. It is this characterisation which yields the natural timeintegral formulation. A related point is that canonical factorisations are then factorisations of something like the reciprocal of an AGF rather than (as in the LLS approach associated with the Wiener filter) a factorisation of an AGF. This means that, in the case of pthorder dynamics, the factors are polynomials of degreep. We can set up the problem rigorously under the supposition that observation began at a finite time h1. and can then pass to the infinitehistory limit h 1  oo, just as we passed to the infinitehorizon limit for control optimisation. Histories will then be restricted histories, so that X, is {xr; h1  p::::;; T::::;; t}, etc. Then the negative exponent in the Gaussian density of X, and Y, for prescribed U,_ 1 is the discrepancy
10 1 =priorterms+&t[E]T[~ r=ht
17
T
Z]I[E], 17 T
Where E and 17 are expressed in terms of x, y and u by appeal to the plant and observation relations (1) and (2). The 'prior terms' reflect the distribution assumed for relevant system history before time h1. In the case ofpthorder dynamics they will constitute a quadratic function of {x..,yr, u.,.; h1  p ::::;; r < h1 }. The projection estimates x~l are just the values of x 7 minimising [) 1. Estimation thus amounts to a backward rather than a forward optimisation problem; a timeintegral (or a precursor to one) is to be extremised over its course before the current moment t rather than after. Actually, we would not regard extremisation of 10 1 as extremisation of a timeintegral, because it is subject to the constraints implied by the plant and observation relations (1) and (2). Let us eliminate these by the introduction of
2 PROCESS ESTIMATION IN TIMEINTEGRAL FORM
359
Lagrangian multiplier vectors 1,. and m,. for the constraints constituted by the relations at time T and so extremise a Lagrangian form !Dr+ LW(dx + Plu d t:) + mT(y + ~x 17)],. T
with Dr expressed in terms ofthe noise variables as above. Minimisation of this form with respect to the noise variables reduces the problem to extremisation of the past path integral r
llp(l,m,x) =(prior terms)+ L[v(l,m) IT(dx + Plu d) mT(y + ~x)]..,., r=ht
(4) Here
v(l,m)
=
![ ~ r[~
~ ][~]
is the informational analogue of c(x, u). Strictly, we should should give the multipliers I and m superscripts (t), to indicate that they are the multipliers associated with optimisation on the basis of observables at time t. The relation between the multipliers and the noise estimates is
The direct interpretation of the multipliers is the familiar one of sensitivities: of /~r) and m~l as respectively the rates of change of the minimal value of lOr (subject to constraints (1) and (2)) with changes in the actual values of E,. and 7],.. Note one effect of the transformation from lOr to Up; the expression for the timeintegral has been cleared of all inverses of covariance matrices. Thus one puzzling and seemingly perverse feature associated with a stochastic formulation is removed: that it actually seems to become anomalous if some noise components are zero, and so the noise covariance matrix is singular. One cannot directly relate the multipliers l, and .AT associated with the plant equation at past and future times. This is because the past optimisation takes precedence: one first minimises !D with respect to unobservables and then C with respect to undetermined controls. This is again a reflection of the degenerate character of the riskneutral model. When we come to the risksensitive formulation in the next chapter we shall see that effectively there is a single timeintegral which spans both past and future, with the consequence that past and future multipliers can be related. The estimates x~r) of process history at time t are obtained by minimising lOr with respect to the corresponding x,.. The path integral llp must then be
360
OPTIMAL STATIONARY LQG POLICIES
maximised with respect to these variables and minimised with respect to corresponding/, m. We deduce then the stationarity conditions
[ ~.il ~ S:] [~~] ~
0
X
(tJ =
[d _:u] 0
T
(rJ
(hi t)
(7)
1
since terms for T > t do not appear in Dp. The initial conditions are provided by the stationarity conditions for r ~ h1. which deviate from the pattern (5) in that they will involve the prior distribution. Note that relations (7) imply that (6) has a solution independent of the values of x~r) for r > t, because x~l simply does not occur in the equation system (5) for r > t. We shall refer to this feature as the lack of forward coupling. It reflects the fact that the estimates of present and past variables do not depend upon the estimates of future variables. One can contrast it with the essential presence of backward coupling in the corresponding control relation (19.5); present control is undoubtedly expressed in terms of present and past values of the process variable. Relation (6) plus its modified version at earlier time points determines in principle all past estimates x~tl. We wish to solve these equations efficiently, but we really only need to solve them for the estimates x~l ( t  p < T ~ t) which are needed for implementation of the control rule. In the infinitehistory limit this point is again met by appeal to a canonical factorisation; this time of w(z). In the infinitehistory limit h 1 . oo equations (5) and (6) hold for all T ~ t. If we assume that w(z) has a canonical factorisation w(z) = 1f;(z)'f/J011/J(z),
(8)
which we have taken in the normalised form analogous to (19.6), then the system (6) can be semiinverted to
(r
~
t).
(9)
Here the operator '1/J( §') acts into the future, with an effective boundary condition x~l = 0 (T > t) implied by the property of lack of forward coupling. The oEerator '1/J( §') t acts into the past. Relation (9) for r = t determines x~r) (and so x/l) explicitly. The relations forT= t 1, t 2 ... then determine the values of (t) {t) . X1_ 11 X1_ 2 , ... recurs1vely.
3 THE PARALLEL KALMAN FILTER (DISCRETE TIME)
361
With this one would seem to have estimates of the process variable at the relevant times in the form one would wish. This is not quite true, however. Rather than an expression for x~'l in the form of an infmite sum involving all past observations (which is what (9) yields if we set r = t) we would wish to generate this estimate by some simple updating recursion, as was achieved by the Kalman fllter in the statestructured case. The question is, then, whether the backward recursion with respect to r implied by (9) can be converted into a natural forward recursion with respect tot: the pthorder equivalent of the Kalman ftlter. We shall see in succeeding sections that this conversion is almost immediate. Note one necessary difference between factorisations (19.6) and (8): the order of factors is now reversed, in that the factor 1/J(z) = Ef=o 1/J1zl, for which both 1/J(z) and its matrix inverse have expansions in nonnegative powers of z valid in izl ~ 1, is now the initial factor. Otherwise the factorisation is the complete analogue of (19.6) (e.g., in that 1/Jo is the constant term in 1/J(z) and is symmetric) and can be achieved by the same policyimprovement algorithm. Of course, the improvement is in inference rule rather than control rule. The existence of a canonical factorisation (8) is again the essential condition that infinitehistory limits should exist for the estimation rule and the distribution of estimation errors. Exercises and comments (1) For the Markov cased= I  A:!i, f!l = B:!i and~= C:!ithenormalised canonical factor 1/J(z) of\ll(z) has the form 1/J(z) =
N+AVAT LT + CVAT [ I
L+AVCT M + VCVT 0
IAzl Cz 0
where Vis the limit value of the covariance matrix of the estimation error .X  x (cf. Exercise 19.2.1).
3 THE PARALLEL KALMAN FILTER (DISCRETE TIME) The innovation in the observations now has the form I" .,,  y 1  y 1(11)

y, + fix,(t1) .
( 10)
Here the timetranslation operator acts, as ever, only on the subscript, so that
fix~• I) = E~=l C,x~~~l). We assume the normalised form (8) of the canonical factorisation. Theorem 20.3.1 (The parallel Kalman filter) dated by the relations
The estimates x~l (r ~ t) are up
362
OPTIMAL STATIONARY LQG POLICIES u ( + ( .9/ I) XI(11) + ::.t~Ut ~ dI+ ilQ l1
(t)
(II)
1111
X1
(t) _ x(t1) , XT T T
H 1T ( I
(T < t),
(12)
where ( 1 is the innovation (10) and the matrix coefficients Hj are determined by 00
H(z)
:=
LHJzi = [1P(z) llxm·
(13)
j=O
Equation (ll) is recognisable as a generalised form of the Kalman filter; it must now be supplemented by the relations (12) which update also the estimates of the lagged xvalues. The real novelty of the theorem lies in the compact evaluation (13) of the coefficients H1 in terms of the canonical factor 1P The xm factor indicates that we extract the corresponding submatrix from the (j, m, x)partitioned matrix. We term relation (ll) plus relations (12) for 0 < j < p the parallel filter because it simultaneously updates estimates of the all the components of what would in fact be a state vector. It contrasts with another possible version of the Kalman filter, the serial filter, which emerges in the next section.
Proof Note the crucial difference between relations (19.5) and (6) on which we have already commented Relations (19.5) couple back into the past in that they involve Xr forT < t. However, relations (6) do not couple forward into the future; future values of x are not involved and future values of I and m are zero, as asserted in (7). This absence of forward coupling implies an effective boundary condition
(r > t).
(14)
We deduce from (6) that w(.:t)(x~lx~ll)
= P:1 _ P~~
(r
~
t).
( 15)
But the righthand member of (15) is equal to zero for T < t and to the column vector 1 1 with partition (0, (1, 0) for T = t. Premultiplying relation (15) by the operator 1P(.:7)  11Po we thus deduce that
(r
~
t).
This, together with the effective boundary condition (14), implies that
Xr(1) Xr(11) h ITTI  h1T
[~] ~
(r
~
t),
(16)
where h1 is the coefficient of z1 in the expansion of 1P(z) I in nonpositive powers of z. This demonstrates the validity of (12) for T ~ t with the evaluation
4 THE SERIAL KALMAN FILTER (DISCRETE TIME)
3()3
(13) of the coefficients /lj. For the particular case T = t we deduce an expression for x~tI) from the relation .91x1(t1)
+ fflu = d1, 1
so deducing (ll) from (12) for the case T = t.
D
Exercises and comments (1) Consider again the Markov case, for which the canonical factor 1jJ is given in 1 Exercise 1 of Section 2. The xm submatrix of 1/J(zf is
H(z) = H +z 1 V(l. nTz 1) 1cT(M + cvcT) 1
(17)
whereH = (L+AVCT)(M + CVCTf 1 and S1 =A HC. ThenHo indeed has the standard evaluation H. The relations (12) for T < t are of only academic interest in this case, but we see from (17) that llj = V[nTy 1 cT(M + CVCT) 1 for j > 0. Part of the reason why this formula does not hold at j = 0 is th.at observation of y, gives some information on the value of f 1 if L =/: 0 (i.e. if plant and observation noise are correlated~ This helps in the estimation of x, but not of earlier xvalues. 4 THE SERIAL KALMAN FILTER (DISCRETE TIME) Direct operations on the relation (6) yield an interesting alternative form of the higherorder Kalman filter. Let us define Xt as '¢(ff) I p, , i.e as the solution of (18)
1/J(ff)x = P·
Theorem 20.4.1 The vector (19)
can be identified with x~'l. In particular, x1 can be identified with x~ 1 l, the estimate of current process variable. Proof It follows from (6) and (18) that '¢"0 1 1/J(ff)x~'l
=
Xr
(T
~
t)
(20)
But, because of the lack of forward couplin~ and the fact that 1/Jo is the absolute term in 1/J, relation (20) forT = t reduces to x/l = Xt· D
364
OPTIMAL STATIONARY LQG POUCIES
One might define the serial estimate of process history at timet as {xr; T ~ t},, and the updated or revised estimate as {x~lT ~ t}. Correspondingly, the best 1) = L::= C,x\~~ 1 l, whereas the serial predictor of y, at time t 1 is 1 predictor is L::=l C,x,_,. Corresponding to the notion of the innovation (, = y 1  y\  1) is then that of the serial innovation
l
f_,= 
s
(,=
8
y, y,= y, + ~x,.
(21)
By the same argument as that which proved the special form (19.12) for the canonical factor¢ of 41 we deduce that the canonical factor 1/J has the form
1/J11 1/Jtm 1/J = [ 1/Jml 1/Jmm I 0
(22)
where an argument§' or z is understood.
Theorem 20.4.2 (The serial Kalman filter) forward pair ofrecursions
The variables x and mobey the stable
dx + P.lu = d + 1/Jim(ff)m 1/Jmm(ff)m = y + ~x
(23) (24)
so that xis determined recursively by the serial Kalmanfdter
dx + fJiu = d + 1/J~m(ff)'r/Jmm(ff) 1 (,
(25)
itselfa stableforward recursion. Proof Written in full, relation (18) becomes
[t:, t:o I
~] [~] o
x'T
0
=[
_:u] o 'T
(26)
From this it follows that 1= 0 and that the equation system reduces to (23), (24). s This reduced system implies the determination m= 1/J;;.~ ( and the Kalman filter recursion (25) for X. Stability of all relations as forward recursions is guaranteed by the canonical character of 1/J. 0 Relation (25) has indeed the character of the classical Kalman filter, in that it is equivalent to the driving of a plant model by the serial innovations, or of a plant/ observation model by the observations. The parallel fJ..lter of the last section has rather the character of the driving of a statereduced plant model by the innovations; the serial filter avoids such a reduction. However, the fact that the s serial filter works on serial innovations means that the driving term 1/Jtm..P~ (is
365
5 INVALIDITY OF THE SERIAL FILTER s
s
:in general not a function of current ( alone, but also of past (. Some ;compensation is required to take account of the fact that history has been 'estimated serially. The best way to view relation (25) is in fact to revert to the ·equationpair (23), (24) and regard this as a coupled system of plant/observation .!Ilodel and compensator driven by the observations. Of course, the reason why l1 = 1Jtl is zero for all t is that the plant equation at t constitutes no essential constraint; the variable x 1 appears in this relation alone and can be given the value which satisfies it best, without affecting the estimates of earlier history or their fit. The following result clarifies the character of m1 and relates the two innovations. Theorem 20.4.3
The serial and parallel innovations are related by s
1/JQ".!v.,( = m= 1/J;;,~ ( .
(27)
Proof Relation (16) at r = t plus the lack of forward coupling implies that
1/Jo(x~'l  x~tl)) =It· Because 1/J has the form (22) this last relation implies that lf'l 1(1l) = 0, which we know, and also that 1/Jornm(m~ 1 )  m~tl)) = (1•
Since m~rl) from(24).
=0
the first equality of (27) thus follows; the second we know
D
Finally, summation of the equation x~")  x~al) over the ranger < a ~ t leads to the conclusion
= Har(a with respect to
a
Theorem 20.4.4 The updated estimates of past process values are obtained from the serial estimates .X by the formula 1T
x~1 )
= x.,. + L
j=l
l1"
Hj(r+J
= X.,. + L
Hj1/!0mmmr+J
(r
~
t).
(28)
j=l
5 THE CONTINUOUSTIME CASE; INVALIDITY OF THE SERIAL FILTER Interesting points arise in the continuoustime case. The transfer from the discretetime case can not be taken mechanically, and some points are delicate enough to affect implementation. When it comes to estimation then relation (5), written again as (6~ still holds, with the boundary condition of lack of forward coupling. We appeal to a canonical factorisation (29) However, it is now sometimes advantageous to vary the normalisation from that indicated in Theorem 18.7.2. In the next section we shall demonstrate that a
366
OPTIMAL STATIONARY LQG POLICIES
factorisation can be found of the form (22); see (35). It will then also follow that, if we again define x(t) as the solution of(l8), then x(t) can again be identified with the estimate of current process value x('l(t). Furthermore, it follows from the form of 1jJ that and obey the analogues of (23), (24)
x
m
Sllx + f!lu = d + 1/Jtm(51J)fn 1/Jmm(51J)fn = Y +~X,
(30)
so that x again obeys the serial Kalman filter relation, analogous to (25),
dx + ~u = d + 1/Jtm(fJ)t/Jmm(!!Jr' (
s
.
(31)
Here (is again the serial innovation. The true and serial innovations now have the definitions s
(=y+~x
(32)
respective!~
Here we have used xr( t) to denote the rth differential fJr x( t) of x at t and Xr (t) to denote its projection estimate on information at t: (33)
Note that 5lJ acts, as ever, on the running time argument r in (33). However, relations (31) and (32) have only a formal validity. The innovation ( is not differentiable with respect to time ~since it has a whitenoise component). Neither in fact is the serial innovation (, and the fact that equation (32) relates s differentials of to differentials of ( is an indication that some of the differentials of xdo not exist either. Equations (30) are proper in that they define a stable filter with input and welldefined outputs and in. However, the equations themselves constitute an improper realisation of this filter, in that they express relations between signals (variables) which are so illdefined mathematically as to be hopelessly illconditioned physicall~ · In order to obtain a set of updating relations in which all variables are well defined we have to revert to the parallel filter and so, effectively, to a state reduction of the model. This goes rather against our programme, one of whose elements was the refusal to resort to state reductions. However, the reduction is for purposes of proof only, and we shall see that the parallel filter, generating all the relevant estimates Xr, can be derived directly from the form of the canonical factor 1/J.
x
y
x
6 THE PARALLEL KALMAN FILTER (CONTINUOUS TIME) The continuoustime case has shown itselfto differ from the discretetime case in that the serial Kalman filter cannot be implemented as it stands. Turning then to
6 THE PARALLEL KALMAN FILTER (CONTINUOUS TIME)
367
the parallel filter, we see that there is necessarily another difference. Whereas one X at any lag (either in discrete or COntinUOUS tiJne) one cannot consider the estimate of differentials of x of any order, because differentials of components of x will in general exist only up to order one less than the maximal order occurring in the plant equation. (Recall that the highestorder derivative has in general a whitenoise component) This is of course acceptable, in that these are the only estimates of differentials that one needs for purposes of control. However, if one restricts oneself to estimating just these differentials, then one has effectively reverted to a statereduced treatment, which is rather against the spirit of our programme. We shall in fact appeal to a statereduction for purposes of proof, but the final conclusions do not appeal to such a reduction. That is, the continuoustime analogue of the matrices Hi occurring in the parallel filter analogous to (11), (12) will be deduced directly from the canonical factor 1/J of 'It. Let us for the moment omit the input terms d and f!Ju in the plant equation; these can be left out and later restored without affecting estimation. In the Markov case (with d(s) = si A, ~(s) = C for constant matrices A and C) we find canonical factors
can well consider the estimation of
1/J(s) =
v [ I0
L
+veT si
A] ,
M
C
0
0
(34)
where Vis the stationary covariance matrix of the estimation error x x. We shall see that this generalises to
A]
[0 0 I]
1/Jii 1/Jim 1/J = 1/Jmi 1/Jmm C , 1/J. = 0 M 0 [ I 0 0 I 0 0
(35)
where 1/J and its elements have arguments s or 9J as appropriate. Let us initially suppose that the dynamics are of order p exactly, so that the matrix coefficient Ap of f!)P in .s;l is nonsingular, and can be normalised to the identity. The Kalman filter for the statereduced model, which gives the parallel updating for the unreduced model, then has the form (analogous to (11), (12)),
(0::;,. r (z) will again have a canonical factorisation of
374
THE RISKSENSITIVE (LEQG) VERSION
1
the form (19.6) with ¢o symmetric, and the expression (19.12) of the canonical' factor 4> generalises to :
(z) =
xx ux
[ .91
xu 4>uu PJ
I 0
l
ON
where the submatrices are also functions of z. The optimisation of control in the case of perfect observation differs only from that of the riskneutral case by the substitution of this revised form of the canonical factor, and the optimal control rule is then again of the form (19.13). However, when observation is imperfect then solution of the equation pair (6), (7) presents a problem which was not encountered in the riskneutral case. The two sets of equations are coupled in both directions, forward as well as back. The 1 system (6) is now linked to the past by the occurrence of J.L~ ) in the righthand member at T = taswell as the occurrence of terms in x~1 ) forT< t. However, the real difference lies in the system (7), which was not previously linked to the future, but now is so by the fact that >.~l is nonzero forT ~ t and also by the occurrence of u1 in the righthand member at T = t. Risksensitivity has the effect that estimates based on information acquired from the past are also affected by costs to be incurred in the future. In Whittle (1990a) this point was dealt with in a distinctly unsatisfactory fashion. Appeal to the canonical factorisations of 4>(z) and 'IT(z) reduced the two infinite equation systems (6), (7) to a single system of 2p vector equations, these equations being related to the determination of the estimates of xT(t p < T:::;:; t) as the values minimising the sum of past and future stress. This reduced system is readily solved in the statestructured case p = 1, solution corresponding simply to the recoupling step of Theorem 16.7.1. However, the form of solution in higherorder cases was not evident, and there the matter was left. One might think that, since plant noise E appears as an effective auxiliary control, one might simply revert to a riskneutral formulation in which the control u is replaced by the pair (u,E). This view fails in the case of imperfect observation, however, because the estimates fVl for T < t are formed retrospectively. That is, if we regard E as the control wielded by Nature, then Nature has the unfair advantage that she can revise past controls in her own favour. We indicate an alternative approach in the next section which yields an explicit solution in terms of a canonical factorisation, but at the cost that the function being factorised is rational rather than polynomial (for dynamics of finite order). The analysis is a pleasing and interesting one, in that it demonstrates the efficacy of the innovation concept. However, we are left, for the moment, with the conclusion that one might as well revert to a statereduction when it comes to implementation.
375
3 A GENERAL FORMALISM
~es and comments
"(p We revert to the point raised in Section 16.10:
the nonobviousness of the :·equivalence of expressions (16.45) and (16.47) for the risksensitive average cost in the statestructured case, under a stabilising policy u = Kx. We may as well then normalise the model to an uncontrolled one with a stable plant equati()n x 1 = Axrl + ft and an instantaneous cost function xT Rx. The dynamicprogramming evaluation of average cost is 'Y = ( 1/29) log II + 9NTII where II is the solution of
!
(9) (see Theorem 16.10.1). The general evaluation f_!'om Theorem 13.5.2 is 1 = (1/29) Abs log IP(z)l where P(z) = I+ 9Rsr 1N .r;~ 1 and.r;l =I Az. If we take the timeintegral approach to evaluation of the predicted path for the process then the value of ci> corresponding to (18.4) is ci> = [
~
_;N] .
We find, with some matrix manipulation, that lei>!= I.J1i'lldiiPI so that Abs log IPI = Abs log! ci> 1. But we know from the treatment of the Markov case in Section 18.2 that ci>(z) = G_,z 1 +Go+ G,z has the canonical factorisation (18.14) where ll has a value Q satisfying Q =Go G_,J;! 1G1. This equation implies in the present case that ·
Q=
[~ ~N]
where n satisfies (9). We thus have Abs log!PI = Abs loglci>I log! I+ 9NIII, whence identity of the two evaluations follows.
= logi1JI =
3 A GENERAL FORMALISM In order to distinguish the wood from the trees it is better to revert to a more abstractly posed problem, which in fact generalises that of Chapter 18 in some respects. Suppose that we are extremising a quadratic timeintegral 0, i.e. a sum over time T of quadratic functions of a vector 'system variable' Xr· Suppose that the column vector Xr can be partitioned (en TJr ), where the two components are distinguished by the fact that is never observed, but that at time t the second component '1r has been observed for r :::::; t. Extremisation of the integral at timet is then conditioned by knowledge of past 'T/ values. In the control context above, we think of ~r as having the components (A, JL, x)r and T'fr as having the components (Yn Ur1)· Suppose that the timeintegral has the form
er
0 = Lbr(~,'T/) a;~r !3;Tir] +(end terms) T
THE RISKSENSITIVE (LEQG) VERSION
376
where the sequences {a.,.} and {.B.,.} are known and 1 has the timehomogeneous quadratic form
The entries in the matrix are operators, so that r {{,for example, should be written more explicitly as r {{(ff), a power series in the backward translation operator ff. The integral 0 thus involves crossterms between values of the system variable at different times. We suppose symmetry of the matrix r in that f' = r. As ever, we denote the values of ~r and 1Jr on the extremal path constrained by knowledge of {ry.,.; T ~ t} by ~~t) and ry~l. If x is any function of the path let us define 6_x(tl = x(tl  x(tl); the change in the value of x on the extremising path as the information gained at time t is added to that already available at time t  1. Theorem 21.3.1 Suppose that infinitehorizon limits exist. Then (i) If x is any linear function of the path then 6_x(t) is a matrix multiple of the 'innovation' 
(t1)
it  Tlt  Tlt
( 11)
.
Specifically,
(12) where Ko =I and Ki = Ofor j < 0. (ii) Suppose that the canonicalfactorisation
r '7'/(z) r 'l{(z)r{{(z) 1r{71(z) = v(z)vo 1v(z)
(13)
holds. Then the generatingjimctions 00
H(z)
= LH1zi, oo
have the evaluations
(14) Proof Extremising nsubject to information at time t we deduce the linear equations
(all T)
(15)
(r > t)
(16)
3 A GENERAL FORMALISM
377
which then imply that
+ r{17 ~ 17~l = o
(all T)
(1 7)
r 'l{~~~~l + r '7'7~ 17~l = o
(T > t)
(1 8)
r~~~~tJ
But ~77~1 ) is zero for r < t, and for r = t equals the 'innovation' (11). Assertion (i) then follows from this statement and equations (17), (18). If we form the generating functions H(z) and K(z) then it follows from equations (17) and (18) that r~{(z)H(z) r17~(zJH(z)
+ r~11 K(z)
=
0
+ r '1'1(z)K(z) = G(z)
(19)
(20)
where G(z) is a function whose expansion on the unit circle contains only nonpositive powers of z. Suppressing the zargument for simplicity we then have
n=
rz~'r~17 K
(21)
(r '7'7 r '7{r«'r~17 )K =G. From this last equation it follows that vo1 v K = vIG . But since one side of this last equation has an expansion on the unit circle in nonnegative powers and the other in nonpositive powers they must both be constant, and the constant must be v01voKo = /. Thus v01vK =I, which D together with (21) implies the determinations (14) of K and H. The conclusions are attractive. However, if we suppose 'pthorder dynamics' in that the matrix r(Y) of the cost function (10) involves powers of ff only in the range [p,p], then the expression factorised in (13) is not similarly restricted; it is a matrix of functions rational in z.
PART 5
Neardeterminism and Large Deviation Theory Large deviation theory is enjoying an enormous vogue, both because of its mathematical content and the many ways in which this can be viewed, and because of the large range of applications for which it proves a natural tool. Chapter 22 gives an introduction to the topic, by an approach perhaps closer to that of the physicist than of the probabilist. However, some justification before even that introduction would not come amiss. The theory has a clear relevance to control theory, as we explain in Chapters 23 and 25. However, in some respects one needs a more refined treatment to capture some of the essential stochastic effects. We cover such refmements in Chapter 24, which could indeed be read directly, as it demands scarcely more of large deviation theory than we now sketch. Large deviation theory is a shifted version (in quite a literal sense, as we shall see) of the two basic limit assertions of elementary probability theory: the law of large numbers and the central limit theorem. The second can be regarded as a refined version of the first (in a certain sense; that of weak convergence). The two assertions have process versions: the convergence of a stochastic process to a deterministic process or to a diffusion process if it is subject to an everincreasing internal averaging in a sense which we make explicit in Section 22.3. To see how large deviation theory extends these concepts, consider a process {x(t)} over a timeinterval (0, h), and suppose the values of x(h) and x(O) prescribed. Suppose further that x( h) does not lie on the limitdeterministic path starting from x(O) (and we shall soon be more specific on the magnitude of the deviation assumed). Then one can still determine a most probable path between the prescribed endpoints. Largedeviation theory demonstrates that, under
380
NEARDETERMINISM AND LARGE DEVIATION THEORY
appropriate assumptions, this most probable path is just the limitdeterministic path for a 'tilted' version of the process. It also provides a first approximation to the probability that x(h) takes the prescribed value, conditional on the value of x(O). This estimate is sufficient for many purposes, but can be improved if one approximates the tilted process by a diffusion rather than a deterministic process. Let us be somewhat more specific. Consider a random scalar x which is the arithmetic average of K independently and identically distributed scalar random variables ~j ( j = 1, 2, ... , K) of mean J1. and variance a2. (A symbol such as Nor n would be more conventional than K, but these are already in full use.) Then x has mean J1. and variance a2 JK, and converges to J1. with increasing K in almost any stochastic sense one cares to name (the 'law of large numbers', in its various versions). In particular, for sufficiently regular functions C(x) one has
E"[C(x)]
=
C(Jl.)
+ o(l).
(1)
for large K. Here we have given the expectation operator a subscript K to indicate that the distribution of x depends upon this parameter. A strengthening of (1) would be
(2) which makes clear that the remainder term in (1) may be in fact O(K 1) rather than anything weaker. One obtains stronger conclusions if one allows the function under the expectation to depend upon K as well as upon x. For example, the central limit theorem amounts to the assertion (again for sufficiently regular C) that E~< { C[(x Jl.)/ u./K.]} = E[ C(77)]
where 1] is a standard normal variable. The large deviation assertion is that the D(x), known as the rate function, such that
+ o( 1)
~distribution
(3)
determines a function
(4) (The precise result is Cramer's theorem, which also evaluates the rate function; see Section 22.2.) The interest is that the function under the expectation is exponential in K, and it is this which forces the value of x contributing most to the expectation away from the central value Jl.· The point of the assertion is that there is indeed such a dominant value, and that it is the value minimising C(x) + D(x). This last observation partly explains the reason for the term 'large deviation', which perhaps becomes even clearer if we consider distributions. If it is proper to allow ec in (4) to be the indicator function of a set .91 then (4) becomes
P"(x E .91) = exp[K inf D(x) xEJll
+ o(K)].
(5)
NEARDETERMINISM AND LARGE DEVIATION THEORY
3.81
In considering the event x E .!II one is considering deviations of x from J.L of order one (if J.L does not itself lie in d), whereas the probable deviations are of order 1j fo. One is then indeed considering deviations which are large relative to what is expected, and D(x) expresses the behaviour of the tails of the xdistribution well beyond the point at which the normal approximation is generally valid. Results of the type of (5) are extremely valuable for the evaluation of quantities such as the probability of transmission error in communication contexts, or the probability of system failure in reliability contexts. If we regard C(x) as a cost function then evaluation (4) seems perfectly tailored to treatment of the risksensitive case. For fixed (} and large K. we would have
{6) If we wrote the lefthand member as exp( K.8F), so defining Fas a an effective cost under the risksensitive criterion, then we would have
F = ext[C(x) + o 1D(x)] + o(l),
{7)
X
where 'ext' indicates an infimum or a supremum according as (J is positive or negative. This opens the intriguing possibility that the risksensitive treatment of control in Chapter 16 has a natural nonLQG version, at least for processes which are neardeterministic in the particular sense that large deviation theory requires. The rate function D(x) is then an asymptotic version of the discrepancy function of Section 12.1. However, the hope that all the theory based on LQG assumptions has more general validity should be qualified If we revert to the riskneutral case (J + 0 then it will transpire that relation (7) becomes simply F = C(J.L) + o( 1), which is in most cases too crude to be usefula warning that large deviation results will often be too crude. Nevertheless, they give a valuable first indication, which can be refined Further, the rate function does in a sense encapsulate the essential stochastic features of the 1]distribution, and results such as those of the nonlinear filtering section (Section 25.5) are not at all too crude, but again encapsulate the essentials. In considering a single random variable x we have considered a static problem. As indicated above, these ideas can be generalised to the dynamic case, when one is concerned with a stochastic process {x(t)}. In this case one should see the variable K. as representing the physical (perhaps spatial) scale of a system, and x(t) as representing a physical average over the system at timet. We consider the process version in Section 224, a necessary preliminary to control applications. As we have emphasised and shall demonstrate, a large deviation evaluation such as (5) follows from the law of large numbers, and has nothing to do with normal approximation. However, in the Gaussian case, when the ~variables are normally distributed, then the central limit theorem is exact in that x is normally
382
NEARDETERMINISM AND LARGE DEVIATION THEORY
distributed over its whole range. The probability evaluation (5) must then necessarily be virtually coincident with the evaluation of the normal integral (after appropriate scaling of the variable) over d. Indeed, it turns out that the rate function D(x) has evaluation (xj2a2; the familiar negative exponent in the normal density. In other words, if variables are normally distributed then large deviation theory is 'exact', in that the effective asymptotic density const. exp[D(x)J coincides with the actual density. As a further indication of the same point: if the ryvariables are Gaussian and the function C<x") in (4) is quadratic then the remainder term in the righthand member is constant and 0(1), in that it is equal to log[D" / ( C" + D") ], where the doubleprimed quantities are the second derivatives of C and D. It is for these reasons that results which we obtain as large deviation approximations coincide, under LQG assumptions, with results which we know from Chapter 16 to be exact.
Jil
!
CHAPTER 22
The Essentials ofLarge Deviation Theory 1 THE LARGE DEVIATION PROPERTY There is a gain in clarity and speed if we define the large deviation property flrst and then motivate it, although of course this inverts the order of both history and understanding. There are many ways of approaching the subject; we follow a route which is economical and which is natural for the control context. A treatment as brief as this necessarily lacks rigour at many points, but the fact that the argument is so natural will perhaps reassure the reader that it is also rigorisable. We proceed through a succession of cases: first the 'static' case of a single random variable, then the dynamic case of a stochastic process and finally, in Chapters 2325, the case of a controlled stochastic process. We have already attempted a first orientation in the introductory section above. It is perhaps then appropriate to treat that introduction as the zeroth section of this chapter, and to refer to its equations as equations of this chapter. Consider a random vector variable x whose distribution depends upon a nonnegative parameter K. One is thus considering a family of distributions indexed by "" The corresponding probability measures and expectation operators can then be written P,. and E,. respectively, if one wishes to emphasise the dependence. The large deviation property then states, roughly, that there exists a function D(x), the rate function, such that (5) (in the preamble to Part 5) holds forlarge K. and for sufficiently regular sets d. Equivalently, at this level of rigour, (4) holds for large K. and sufficiently regular scalar functions C(x). It is relation (5) which is usually taken as the characterising property, stated more carefully as an asymptotic inequality in one direction for closed sets .!II and in the other direction for open sets. It is the expectation characterisation (4) which is more natural in our context, however, regarded as valid at least for continuous functions C obeying appropriate growth conditions. Neither property implies the other without supplementary conditions, but relation (5) would of course follow formally from (4) if one took exp[C(x)] as the indicator function of the set d. However, such a discontinuous choice may be unacceptable, and it is because of the possible trouble associated with such discontinuities that rigour requires a more cautious formulation of assertion (5). Why one should expect versions of either of these relations to hold and under what conditions has yet to be explained. However, note the implications. The
384
THE ESSENTIALS OF LARGE DEVIATION THEORY
expression for both the probability and the expectation is exponential in,.., to first order, with the coefficient of ,.., in the exponent determined in terms of the rate function as indicated. Further, it is a single extremising value of x which contributes dominantly to both probability and expectation. In the case of P"" (x E .91) this value of x is that value in .JJI which is most probable on the basis of a kind of unnormalised density exp[ K.D(x)J In the case of the expectation the value of x which contributes dominantly has to compromise between achieving a high value of exp( C) and high probability. In either case, the operation of integration (with respect to a probability measure) has been replaced by that of extremisation. In fact, one seems to find the large deviation property in just one class of cases. This class is characterised by three properties. (i) The parameter,.., measures the scale of the stochastic system being considered. For example, both the capacity of and demand on a telephone network might be of order ,.., or ,.., might measure the size of an insurance company. (ii) The random variable x is an average over the system of a vectorvalued random variable. Thus, it might represent the instantaneous traffic being carried per unit of capacity on the various links of the telephone network, or the current dividend which the insurance company announces. (iii) The system has the stochastic homogeneity and ergodicity properties which imply that x behaves as the average of ,.., independently and identically distributed random variables, in that
(8) for some 1/J and for all values of the row vector a: for which the lefthand member is defined. Let us note a few points of definition and notation. If ' is a vectorvalued random variable then
M(o:) = E(eek{) is its moment generating function (abbreviated to MGF). Here~ is asswned to be a column vector and a: then a row vector. M(o:) certainly exists for purely imaginary a:; and will exist for other values if the tails ofthe rydistribution decay at least exponentially fast We shall have occasion to work with the function 1/J(o:) =log M(o:), the cumulant generating function of~ (abbreviated to CGF). These functions have a number of important properties which we summarise in Appendix 3. We use the abbreviation liD for 'independently and identically distributed'. 2 THE STATIC CASE: CRAMER'S THEOREM Cramer's theorem concerns the case when relation (8) holds exactly; i.e.
E""(e"ax)
= e"1/J(o)
(9)
2 THE STATIC CASE: CRAMER'S THEOREM
385
for some 'r/J for all real o for which the lefthand side exists. This will be the case (at least for /'1, integral, a case to which we can restrict ourselves) when x is the arithmetic mean of /'1, liD vector random variables ~J with CGF ,P( o).
Theorem 22.2.1 (Cramer's theorem) Suppose that relation (9) holds for all o: in some set of real values with nonempty interior. Then x has the large deviation property and the rate function has the evaluation D(x)
= sup[ox 1/l(o)].
( 10)
(k
Cramer's original theorem asserts the large deviation property in that (5) is shown to hold if d is an interval, or indeed any countable union of intervals. We shall rather give a proof of the expectation relation (4) which makes the line of proof for case (5) clear. Note the interesting evaluation of the rate function (10): D(x) is seen to be the negative of the minimum transform (or Legendre transform) of the CGF ,P(o). This in itself raises a number of fascinating issues, which we briefly mention in Exercise 2. We shall need to appeal to some differentiability and convexity properies of the CGF, listed in the theorems of Appendix 3. The proofs, although brief, are deferred to Appendix 3 so as not to break the argument. Consider now the modification of the distribution of by tilting, so that the expectation of a function¢(~) of~ becomes rather
e
E(a) [¢(~)] = E[¢(~)ea~].
(II)
M(o)
That is, one weights the original ~distribution by the exponential factor ecrE.. and then renormalises the distribution. This apparently arbitrary concept is in fact motivated naturally, both by the proofs of the theorems of Appendix 3 and by the use to which we shall shortly put it. We refer to o in (11) as the tilt parameter.
Theorem 22.2.2 (i) The mean of the otilted distribution is £(a)(~)= 1/la, the column vector offirst differentials of1/1 at c. (ii) Define the convex function ofthe column vector a (12)
D(a) = sup[oa 1/l(o)]. a
If o is chosen as the value at which the supremum is achieved in (12) then £(a) (0
= a.
(iii) Furthermore, at points where the derivatives exist, Da = o,
Daa = [1/laar
1
(13)
where a and o are the corresponding values determined by (12).
=
Proof It follows from the definition (11) that £(a)(~)= [8M(o)j8o]jM(o) 'rPa· The function D(a) defined in (12) is certainly convex, being a supremum of
386
THE ESSENTIALS OF LARGE DEVIATION THEORY
linear function of a. The supremum will be attained at the value of a determined by '1/Ja = ~ so that the tilted distribution whose parameter has this value indeed has mean a. To prove assertion (iii), write the definition (12) as
D(a) = sup G(a, a). Q
It follows then by standard arguments that Da = Ga and Daa = Gaa GaaG;,~Gaa, where a is given its extremising value. These relations reduce to those asserted in (13) in the particular case (12). 0 Suppose that the liD variables ~j are given the tilted distribution (11). The correspondingly tilted expectation of a function <jJ(x) of the arithmetic mean x = x: 1 1 E,j can then be written in terms of the until ted expectation as
L;J=
E~al[¢(x)] = M(ar,.E,.[¢(x)e,.ax].
We can reverse this relationship to obtain E,.[e1tC(x)J = E,.{e~.(o:)]. a
A very early result of this type was deduced by Miller (1961), who remarked on the surprising fact that .A(o:) need not itself be an MGF, although it shares
3 OPERATORS AND SCALING FOR MARKOV PROCESSES
389
many of the properties. An alternative expression of the rate function, also surprising, is
Note that this application corresponds to an averaging over time, and is quite distinct from our later consideration of temporal stochastic processes which show an averaging over the system. (6) Ruin ofan insurance company (MartinLoj 1986) Suppose that an insurance company begins with capital K and that capital then develops as a process of independent increments whose increment in unit time has CGF 1/J(a). Let T be the time at which ruin occurs (i.e. when capital first runs negative). Then a classic result due to Lundberg states that E( ell.,.) =
e~) and of size O(K 1) in a given interval of time means that it indeed has something like an effective rate of change for large K.
5 HAMILTONIAN ASPECTS The extremal path x( ·)determined by the evaluation ofexpression (35) is 'optimal' in that it compromises best between high probability and low cost (or, equivalently, low discrepancy and low cost). We shall term it limitoptimal since we shall wish to consider stochastic deviations from it in a more refined treatment. A special case is that in which the process is considered over the time interval (0, h] and there is no cost at all except that which effectively prescribes the terminal value x(h). The initial value x(O) is of course prescribed. The limitoptimal path in this case is then the (asymptotically) most probable path between the prescribed endvalues. It minimises expression (36)
396
THE ESSENTIALS OF LARGE DEVIATION THEORY
for the rate function with respect to this path, and so is subject to the pair of equations
.
8H
x = 8a '
. 8H a =  ox
(
t
K. This is exactly the condition deduced in (7.58) for the option of crashing to be cheaper than that of striving to clear the peak. (2) Suppose that the specification (33) of the 'terminal' cost is modified to K(y) = Qy 2 /2s 1 for y > h. This would represent the further control cost involved if one were required to bring the the particle to zero height a time s 1 after having cleared the peak. There is then an incentive to clear the peak by no more than is necessary Show that the formula analogous to (34) is eF(x,t)/NQ
=
~exp[!((? (f)][l !)((2)] + exp(K/NQ)!)((l)
Vs~
where (t = (h x)j../Ns
(2
=
(h ~) /ffl s+ s1
and
>. =
~. s+s1
That is, ( 1 and (2 are respectively proportional to the current distance below the peak and the distance that one would be below the peak at time t1 if one continued on the direct flight path to the final landing point at time t 1 + s1. Note that
the difference in control costs between the straightline paths to the final landing point from the peak and from the starting point respectively. Show that in the case K = +oo the optimal control is
426
CONTROLLED FIRST PASSAGE
x s+s 1
N ..f>..
( (2)
s
14?((2)'
U=+:'~"':"'"7"
The two terms represent respectively the control that would take the particle to the final destination by the straightline path and the correction needed to lift it over the peak. Show that in the case of ( 2 large and positive (when the straightline path would meet the mountain) the formula analogous to (37) is
u = _h_~_x + N[h(s+_SI_sl) xrl+o(N).
{39)
The first term sets a course for the peak. The second introduces a lift inversely proportional to the amount by which the particle is currently below the straightline path passing through peak and destination. We can recast the control rule (39) in the more significant form (38), but with the constant d now modified to d = (g + hlsi) 1 where g = (h x)ls is the constant gradient of the optimal deterministic approach path. The term hI s1 is correspondingly the descent rate required on the other side of the mountain. The effect is then again that one aims to clear the peak by an amount Nd, this intended clearance showing only slow variation on the optimal approach path, but decreasing as the steepness of either required approach or subsequent descent is increased 6 CRASH AVOIDANCE FOR THE INERTIAL PARTICLE Consider the stochastic version of the inertial particle model of Section 7.10 in which the applied forces are the sum of control forces and process noise, so that the stochastic plant equation is
x= v,
v=
u+t:.
Then the condition (2) is again trivially satisfied, with K. = 1I NQ. It would be interesting to consider both landing and obstacleavoidance for this model. However, the most natural first problem to consider is that analysed in Section 7.10; that of pulling out of a dive with minimal expenditure of control energy. That is, one begins with x > 0, v < 0 and wishes to minimise expected total control cost up to the moment when velocity v is first zero under the condition that height x should be nonnegative at this this point (and so at earlier points~ The moment when v becomes zero is to be regarded as the moment when one has pulled out of the dive. In a first treatment, costs incurred after this moment are not considered Recall the results of the deterministic case. If no control is exerted then the particle of course crashes after a time= xlv. The optimal control is u = 2V2 l3x which brings the particle out of its dive (and grazing the ground) after a time
3xjv.
6 CRASH AVOIDANCE FOR THE INERTIAL PARTICLE
427
Relation (4) now becomes eF(x,•)/NQ = 1 _ p
+ PeKfNQ
where P = P(x, v) is the probability of a crash (i.e. the probability for the uncontrolled process that height is negative when velocity is first zero) and K is the penalty incurred at a crash (assumed constant). If we assume K infinite then this reduces simply to 1  P, and the optimal control is
u = NP,/(1 P).
(40)
Let us use s to denote elapsed time, with x(O) = x, v(O) = v. Then x(s) is normally distributed with mean x + vs and variance N; /3, and the probability that x( s) is negative is ci>( (), where
( = (x+ vs)J;;.
(41)
Now, the point at which stochastic variation from the limit optimal path is likely to take the path into !/is, by the discussion of Section 2, just the grazing point of the optimal deterministic path, which we know to be at s = 3xjv. We see from (41) that this is also the value of s which maximises (, and so maximises the probability that x(s) ~ 0. One is then led to a conjecture: that the grazing points of the optimally controlled deterministic path are just the points which maximise the probability that the uncontrolled stochastic path has strayed into !/. The conjecture is true in a class of cases which includes this one, as we shall see from the next section, but not generally. The value of (at this grazing point is
(=~ [vii 3Ylix' and we shall have P(x, v) ""P(( ~ 0) = ci>((). (Strictly, it is the logarithms of these quantities whose ratio is asymptotically unity.) To within terms of smaller order in N we thus have
u=
~ ¢(() •.
v~ 1 ci>(()
(42)
If (is large then we can appeal to the approximation (22) and deduce that
2y2
3N 2v
u"'.
3x
(43)
This again follows the pattern: optimal deterministic control plus a cautionary heightgaining correction. In this case the correction is inversely proportional to current velocity rather than current height. As for the obstacleavoidance
CONTROLLED FIRST PASSAGE
428
problem of the last section, the form of the correction term is somewhat counter. intuitive, but is explained in the same way. It can also be given the more' illuminating form of preservation of a nearconstant clearance.
Theorem 24.6.2 At values ofx and vfor which (is large and positive the optimal control is
2v2 u = 3(x Nd)
+ o(N)
(44)
for small N, where d = 9.x2 / 4vl is invariant on the deterministic path. That is, the effect of plant noise on the optimal control is that one behaves as though one had to miss the ground by a margin of Nd, this margin being essentially constant until one is within distance o(N) of the ground. The stronger the control needed to pull out of the dive, the smaller will be the safety margin Ndthus allowed.
Proof If we equate the expressions (43) and (44) for the control then we obtain the evaluation asserted for d. It follows from the analysis of Section 7.10 that vx 213 is indeed constant on the deterministic path. The point of the theorem is then again that this correction is constant on the deterministic path, and so sub0 ject only to slow variation on the stochastic path. 7 THE AVOIDANCE OF ABSORPTION GENERALLY The conclusions of the last section generalise nicely. Suppose that we consider the stochastic version of the general avoidance problem of Section 7.11. That is, the plant equation has the linear form (17) with control cost! JuT Qu dr, and the stopping set sis the halfspace aT x ~ b. As emphasised in Section 7.11, this last assumption is not as special as it may appear. Let us normalise a so that lal = 1; a modification of b to b + d then shifts the plane a distance d in the direction normal to it. As in Section 7.11, we shall define the variable z = aT X. Since the process is timehomogeneous we may as well take the starting and termination points as t = 0 and t = s, so that € = (x, 0), €= (.X, s), and sis the time remaining to termination at x. We take the termination point s as the grazing point; the first time at which, if no further control were exerted, the uncontrolled deterministic path would henceforth avoid s. We may as well then assume that initial conditions are such that there is such an s; i.e. that x is such that some control is necessary to avoid s. We again make the assumption that the relation N = K 1J holds, and assume"' large and J fixed. We have then
1/;(x) := e~ t). But then the expression for 0, reduces exactly to the expression for F(x, t) asserted in Theorem 23.2.1. The corresponding evaluation of P( W(t), x) follows by direct appeal to the ratefunction evaluation (22.36). The course of the Markov process with state variable (x, z) for given u( ·) (not necessarily optimal) over the time interval 0 ::so; r ::so; t will have density f(X(t),Z(t)) = exp(ttD(t) + o(tt)], where X(t) denotes xhistory to timet, etc. and
D(t)
= Do(x(O)) +
sup
a(·), PO
( [o:.X + (3z H(x, u, o:, /3)] dr.
Jo
The large deviation property then implies that (9) holds, with the identification
OP( W(t), x)
1 = inf[9C + D(t)] x(·)
where C 1 is the cost up to timet and the infimum is subject to x(t) = x. But this yields exactly the evaluation (14) if we set .i = y, o: = o>..T and (3 = OJ.LT. 0 The statement that one would generally recognise as a maximum principle follows if one writes down the stationarity conditions on the path integral (14). In terms of the effective Hamiltonian
these would be
X = 8Jff ).T =  8Jff 8)...T' 8x
y = 8Jft'T 8J.L
(0 < T < t'
'J
J.L = 0 (t < r < t), Jff is maximal in u (t ::so; r < t) u is prescribed (0 ::so; r < t), y is prescribed (0 < r ::so; t) plus the end conditions implied by the theorem.
5 PURE ESTIMATION: NONLINEAR FILTERING
437
4 EXAMPLES AND SPECIAL CASES Under the LQG assumptions of the usual quadratic cost function and joint DCF (3) one readily finds that the assertions of the last section agree with those derived for the statestructured LEQG case in Chapter 16, known in that case to be exact. If we go to the riskneutral limit we find that the asymptotically optimal control is just u(x(t), t), where u(x, t) is the optimal riskneutral perfect information control and x(t) the value of x minimising D( W(t), x)the large deviation version of the conditionally most probable value. As indicated above, this conclusion rests on nothing more sophisticated than the facts that x(t) x(t) has zero expectation and a covariance matrix of order ~I. If one formally lets etend to zero in the integral of (14) one obtains what seems like an excessive collapse, in that el H(x, u, e>.. T' Oj.LT) reduces to >..Ta(x, u)+ J.L Tc(x, u), where c(x, u) is the expected value ofy(t) conditional on the past. That is, the process seems to reduce to determinism in observation as well as plant, which is surely too much! However, one will then be left with a term J.L T(y  c(x, u)) in the integral, whose extreme value with respect to J.L will be infinite unless y is exactly equal to its expected value c(x, u). This simply reflects the fact that the rate function D for the process over a given time interval is nonzero unless the process lies on its expected (deterministic) path, so that in other cases D/8 will have an infinite limit as e tends to zero. Consideration of the behaviour of D on its own is the theme of our final section: the consideration of process statistics unskewed by costs.
5 PURE ESTIMATION: NONLINEAR FILTERING An interesting special case is that for which there is no aspect of cost or choice of control policy, and one is simply trying to infer the value of x(t) from knowledge of current observations W(t), possibly in the presence of past (and so known) controls. In the LQG case this inference is supplied by the Kalman filter, supplemented by the recursive calculation of associated matrices such as the covariance matrix V, as we saw in Chapter 12. The general problem is referred to as that of 'nonlinear filtering' by analogy with this case. If we specialise the material of this chapter we have then a large deviation treatment of the nonlinear filtering problem which is exact under LQG assumptions. Since the controls are known and have no cost significance we can write W({) simply as Y(t), the observation history. Consider the expression
D1( Y(t), x) = inf [Do(x(O)) x(·)
+
sup a(·),/J(·)
['[ax+ {3y H(x, u, a, {3)] dr]
Jo
(15)
where the infimum is subject to x(t) = x. Then exp[~D 1 ] is the large deviation approximation to the joint probability density of x(t) and Y(t) and so provides, to
438
IMPERFECT OBSERVATION; NONLINEAR FILTERING
within a Y(t)dependent normalising factor, the large deviation approximation to the probability density of the current state value x( t) conditional on observation history W(t). However, as we have seen in Chapters 12 and 20 and Exercise 22.7.1, the updating of such conditional distributions is more naturally carried out in transform space. We shall in fact deduce a large deviation form of a forward updating equation for the corresponding MGF M(a, t)
= E[eax(r)l Y(t)].
( 16)
*Theorem 25.5.1 Suppose that the Markov process {x(t), z(t)}, where i = y, has DCF K,H(x, u, /'i, 1a, K, 1/3) with known past controls u. Then for large /'i, the conditional characteristic function (16) has the form
( 17) fort
> Oifit has this form fort= 0. The unsealed CGF'IjJobeys the updating equation &rj;(a, t) =a( a) a(O)
( 18)
8t
where
(19) Proof Relation (17) will hold at t = 0 under the assumption (4), when we shall have
(20)
'1/J(a, 0) = sup[ax Do(x)]. X
The function '1/J(a, t) will differ from ll!(a,t) = sup[axD 1 (Y(t),x)] X
=sup [ax(t) D 0 (x(O))x()
sup a(·), {3()
([ax+
Jo
f3y H(x, u, a, /3)] dr]
(21) exp[K,\li(K, 1a,
only by an additive term, independent of a. The point is that t)] is the large deviation evaluation of the conditional MGF of x(t) times the probability density of the observation history Y(t). We must divide out this latter to evaluate the MGF; equivalently (22)
5 PURE ESTIMATION: NONLINEAR FILTERING
439
Note the distinction in (21) between the vector function a{r)(r ~ t) and the vector a; we shall see that there is an effective identification a(t) =a. The extrema with respect to x( ·) and {3( ·) are unconstrained. A partial integration of (21) and an appeal to (20) allows us to rewrite 1¥ as
1¥(a, t) =sup x()
inf ['1/J(a(O), 0) 130
a(·),
+
Jo['[ax {3y + H(x, u, a, {3)] dr]
(23)
where the extremisation with respect to x(t) has indeed induced the constraint a( t) = a. Now, by the same methods by which the expression (23.10) for the value function was shown to satisfy the approximate dynamic programming equation (23.9), we deduce that expression (23) obeys the forward equation in time
~~ = i~f[n(~: ,u,a,{3) {3y]
= a(a).
Relations (22) and (24) imply that '1/J satisfies (18)/ (19).
(24)
0
Equation (18) is indeed the precise analogue of the dynamic programming equation (23.9). However, the exact version of (18) is not, in general, the precise analogue of the the exact version (23.8) of (23.9). Relations (18) and (19) provide the updating equation for the conditioned xq) distribution and so can be regarded as providing the largedeviation equivalent of the Kalman filter for the general case. In the LQG case, when H has the evaluation (1), then (18) is exact. We find that (18) is solved by
'1/J(a,t) = ax(t) +!aV(t)aT if x( t) and V( t) obey the familiar updating equations for the state estimate and the covariance matrix of its error: yet another derivation of the Kalman filter and Riccati equation! If we regard the state estimate x( t) as again the conditionally most probable value, then, in the large deviation approximation, it will be the value maximising infa[~~:'I/J(~~: 1 a, t) ax]. The two extremisations yield the conditions a= 0 and x(t) = [&tp(a, t)/ooJa=O• SO that x(t) is nothing but the large deviation evaluation of the expected value 6f x(t) conditional on Y(t). The fact that the mode and the mean of the conditional distribution agree is an indication of the implied supposition of a high degree of regularity in the distribution. Relations (18) and (19) are fascinating in that they supply the natural updating in the most general case to which large deviation methods are applicable. However, they do not in general supply a finiteparameter updating (i.e. '1/J(a, t) does not remain within a family of functions of a specified by a finite number of tdependent parameters). One is updating a whole distribution and will have such a reduction only in fortunate cases; in others it can only be forced by crude approximation. The LQG case of course supplies the standard fortunate
440
IMPERFECT OBSERVATION; NONLINEAR FILTERING
example: the conditional distribution remains Gaussian, and is parametrised by its timedependent mean and covariance matrix. However, for an interesting nonGaussian example, consider again the problem of regulating particle numbers in a chamber, associated with the unsealed DCF (22.26). Suppose that one cannot observe the actual number n of particles present in the chamber, but can only register particles which collide with a sensor, each particle in the chamber having probability intensity v of registering. One then has essentially a running count m( t) of the number of registrations. Defining the normalised cumulative observation z(t) = m(t)/ K, we then have the normalised DCF H(x, u, a, /3) = u(ea 1) + px(ea 1) + vx(e13  1). for x
= nj K and z. Thus cr(a), as defined by (19~ has the evaluation cr(a)
= u(ea
1)
+ p'I/J'(ea 1) v'I/J' + y y logy+ y log (v'I/J')
and the updating relation (18) becomes
0:: = u(ea 
1)
+ p'I/J
1 (
ea  1)  v( 1/J' 
1/J~) + y log (1/J' 11/J~)
(25)
where 1/J' = 81/J / 8a and 'lj/0 = [1/J'la=O· Of course, y = z does not exist in any naive sense (as it does not even in the conventional case (1), of observation corrupted by white noise) and we must interpret this last relation incrementally:
61/J = [u(ea 1) + p'I/J'(ea 1) v(1/J'
1j;~)]6t
+ [log(1j;'/1/J~)]8z
(26)
where 6z is K 1 times the number of particles registered in the timeincrement 6t. Remarkably, relation (26) is exact; see Exercise 1 below. However, it is still not finitedimensional in character. Exercises and comments (1) Since we aim to prove that the updating equation (26) is exact, we may as well set K = 1. Suppose that M(a, t) = exp(ax)p(dx, t). If a particle is registered (i.e. 6z = 1), then, by Bayes' theorem, M changes instantaneously to
J
J
vxeaxp(dx, t)/
J
vxp(dx, t) = Ma/MaO = (1/Ja/1/JaO)M.
Thus 1/J suffers an instaneous increment of log(1/JaNao) which is just what equation (26) asserts. If no particle is registered in (t, t + 6t), so that 8z = 0, then over the time increment 6t the characteristic function M changes to const.
J
(1 +A 8t)eax(l  vx 8t)p(dx, t)
= const.
J
eax[1
+ H(x, a) 6t vx &] p(dx, t).
5 PURE ESTIMATION: NONLINEAR FILTERING
441
Here A is the infinitesimal generator of the xprocess alone, H(x, a) = u( e0  1) + px( ea  1) the corresponding DCF and the constant is such as to retain the normalisation M(a, t + 8t) = 1. This indeed implies relation (26) with 8z =0. Notes on tbe literature
The material of this chapter is largely taken from Whittle (1991a). There is a very large literature on nonlinear filtering; an appeal to large deviation ideas is explicit in the papers by Hijab (1984) and James and Baras (1988). These present what is essentially a recursive updating of the rate function D ( Y (t), x) in the case when the stochastic element of both plant and observation dynamics is a diffusion, the updating equation being characterised with some justification as 'the wave equation of nonlinear filtering'. However, this description could much more fittingly be applied to the general updating relation (18) I (19). That this is set in transform space is, as we have emphasised, natural in the inference context.
APPENDIX 1
Notation and Conventions For discretetime models the time variable tis assumed to take signed integer values; for continuoustime models it may take any value. A variable x which is timedependent is a signal. In discrete time the value of x at time tis denoted x 1• More generally, if ( ... ) is a bracketed expression then (... )1 denotes that expression with all quantities in the bracketed evaluated at time t, unless otherwise indicated. In continuous time the timedependence of x is either indicated x(t) or understood. In either discrete or continuous time the expression x(tl denotes the optimal (minimal stress) estimate of x based on 1 information at time t. The special estimate x~ ) is denoted x1• A circumflex is also used to denote a Laplace transform or ztransform in Chapter 4 alone. If {x 1} is a sequen.ce ofvariables in discrete time then X 1 denotes the history {x ... ;r ~ t} of this sequence up to time t. The starting point of this history may depend upon circumstances; it is usually either T = 0 or T = oo. The simple capital X denotes a complete realisationthe course of the xvariable over the whole relevant time interval. There are other standard modifiers for signals. The command value of x is denoted r, and the deviation X XC denoted x•. The notation X denotes either the equilibrium value or the terminal value of x, in different contexts. Do not confuse with the effect of the overbar on complex quantities and operators, where it amounts to the operation of conjugation; see below. The bold symbol xis used in a system notation, when it denotes all internal variables of the system collectively. A matrix is denoted by an italic capital: A. Its transpose is denoted by AT. If Q is a matrix then Q > 0 and Q ~ 0 indicate respectively that Q is positive definite and positive semidefinite (and so understood to be symmetric). If A is a matrix with complex elements then A denotes the transpose of its complex conjugate. The overbar thus combines the operations of transposition and complex conjugation, which we regard as the conjugation operation for matrices. If xis said to be a vector then it is a column vector unless otherwise indicated. If F(x) is a scalar function of X then 8F f8x is the row vector of first differentials of F with respect to the elements of x. We sometimes denote this by the convenient subscript notation Fx. and use Fxx to denote the square matrix of second differentials. If a(x) is itself a vector function of x then ax denotes the matrix whosejkth element is the differential ofthejth element ofa with respect to the kth element of x. If H (p) is a scalar function of a row vector p then Hp is the column vector of differentials of H with respect to the elements ofp.
444
APPENDIX 1
The subscript notation is also sometimes used to distinguish submatrices in a partitioned matrix; see, for example, Sections 19.2 and 20.3. Operators are generally denoted by a script capital. Important special operators are the identity operator .Y, the backward shift operator f7 with effect f7x 1 = x 1_ 1 and the differential operator~ with effect ~x = dxjdt. A symbold will denote a distributed lag operator A(f/) L_j Ajf/j in discrete time_ and a differential operator A(~) = L_j A1 ~ is continuous time. The conjugate .91 of .91 is defined as A (f/ 1) T in discrete time and A ( ~) T in continuous time. We shall often consider the corresponding generating functions A(z) and A(s), the complex scalars z and s corresponding to f7 and ~ respectively. These will also be denoted by .91, the interpretation (operator or generating function) being clear from the context. In this case the conjugate .91 is to be identified with A(z 1 )T and A( s) Tin discrete and continuoustime contexts respectively. The shell or Gill characters §, [D and [D are used to denote stress and cost and discrepancy components of stress respectively. The largedeviation evaluation of [D is proportional to the rate function, which we denote by D. The cost from time t is denoted C 1, so if the process is obliged to terminate at a given horizon point h then the closing cost is denoted Cn. This does not necessarily coincide with the terminal cost IK, which is the cost incurred when the process is obliged to stop (possibly before h) because it has entered a stopping set Y. The notations max or min before an expression denote the operation of taking a maximum or minimum, with respect to a variable which is indicated if this is not obvious. Correspondingly, sup and inf denote the taking of a supremum or infimum; stat denotes evaluation at a stationary point; ext denotes the taking of an extremum (of a nature specified in the context). The expectation operator is denoted by E, and Err denotes the expectation under a policy 1r. Probability measure is indicated by P( ·) and probability density sometimes by f(·). Conditional versions of these are denoted by E(·l·), P(·l·) and f (·I·) respectively. The covariance matrix E{[x E(x)][y E(y)]T} between two random vectors x andy is written Vxy, or cov(x, y). We write cov(x, x) simply as cov(x). If Vxy = 0 then x andy are said to be orthogonal, expressed symbolically as x.ly. The modified equality symbol:= (resp. =:)in an equation indicates that the left (right)hand member is defined by the expression in the right (left)hand member. Notations adopted as standard throughout the text are listed below, but some symbols perform multiple duty. Abbreviations Abs AGF CEP
Absolute term in a power series expansion on the unit circle Autocovariance generating function Certainty equivalence principle
APPENDIX I
CGF DCF MGF Tr
445
Cumulant generating function Derivate characteristic function Moment generating function Trace (of a matrix)
Standard symbols
A,B,C
Coefficient matrices in the plant and observation equations d,I?J,((i Operator versions of these for higherorder models The system operator (.91!11) 21 Cost function Instantaneous cost function or rate c(x,u) !?) The time differential operator Rate function D [} The discrepancy component of stress Plant disturbance; various d Expectation operator E Expectation operator under policy 1r E"' Future value function; future stress F Transient cost; probability density; the operator in the Riccati equation f Total value function; the coefficient of the quadratic terms in a timeG integral; a transfer function; controllability or observability Gramian g A function defining a policy u1 = g(x 1) Hamiltonian; the innovation coefficient in the Kalman filter; the H constant factor in a canonical factorisation; the matrix coefficient of highestorder derivatives Horizon; various h The identity operator ..1 The identity matrix I Timeintegral 0 J The controlpower matrix The matrix coefficient in the optimal feedback control K A terminal cost function IK Forward operator; covariance matrix of plant and observation noise L The forward operator of the dynamic programming equation; Laplace !l' transform operator Forward operator in continuous time; the covariance matrix of M observation noise; a moment generating function; a retirement reward: an upper bound on the magnitude of control m The dimension of the control variable u The covariance matrix of plant noise (termed the noise power matrix in N continuous time) The dimension of the process variablex n
c
446 p p
Q R 9l r
s §
y s f/ t
u
u,
u
v
v(,\, J.l)
w,
w X X, X
y
Y, y
z a (3
r I ~
6 f
( 1'J
e >.
APPENDIX I
Past stress; probability measure The order of dynamics; the conjugate variable of the maximum principle. The matrix of control costs The matrix of processvariable costs The cost matrix in a system formulation The dimension ofthe observationy The crossmatrix of process and controlvariable costs Stress The stopping set Tune to go; the complex scalar argument of a transfer function corresponding to !!) The backward shift operator Tune; the present moment The complete control realisation Control history at time t The control (action, decision) variable A covariance matrix, or its risksensitive analogue; a value function under a prescribed policy. The information analogue of c(x, u) Information available at time t; a noisescaling matrix The command signal The complete process realisation Process history at time t ·. The process or state variable The complete observation realisation Observation history at time t Observation The complex scalar corresponding to !F Discount rate; the row vector argument in transform contexts; various Discount factor; the row vector argument in transform contexts; various Gain matrix Average cost, either direct or in a risksensitive sense System error; estimation error Increment (as time increment 6t); Kronecker ~function Plant noise Innovation; primitive system input; coefficient of the linear term in a time integral Observation noise Risksensitivity parameter The Lagrange multiplier for the plant constraint; birth rate
APPENDIX 1 JL
v ~
IT 1r
a T
.P d;
w
'1/J
x !1 w
447
The Lagrange multiplier for the observation constraint; death rate A Gittins index; various A sufficient variable; the combination ~. t) of state variable and time; the argument of a quadratic timeintegral The matrix of a quadratic value function Policy The coefficient of the linear term in a quadratic value function; a function occuring in the in the updating of posterior distributions A running time variable A matrix of operators (or generating functions) occurring in forward optimisation; the normal integral A canonical factor of .P; the normal density; various A matrix of operators (or generating functions) occurring in backward optimisation; a function ocurring in the updating of posterior distributions A canonical factor of w; a cumulantgenerating function; an expectation over first passage A costrenormalisation of a risksensitive criterion; various The information gain matrix Frequency
Modifiers, superscripts, etc. .X
d B xc .X x(r)
x1 .X,
Equilibrium value or terminal value of the variable x Conjugate of the operator or generating function .sd The critical value ofthe risksensitivity parameter B The command value of x The deviation x xc; limitoptimal path The best (minimum stress) estimate of x based on information at timet The best estimate of x 1 based on current information: x~r) The best estimate of x 1 based on current information and past costs alone
APPENDIX 2
The Structural Basis of Temporal Optimisation A stochastic model of the system will specify the joint distribution, in some sense, of all relevant random variables (e.g. process variables and observations) for prescribed values of the control variables. These latter variables are then parametrising rather than conditioning, since they affect the distribution of the former variables but are not themselves initially defined as random variables. The information at a given time consists of all variablevalues which are known at that time, both observations and control values. The process model describes not merely the plant which is to be controlled, but also variables such as command signals. Either these signals have their wh.ole course specified from the beginning (which is then part of the model specification) or they are generated by a model which is then part of the process model. Consider optimisation in discrete time over the time interval t ;;.: 0. As in the text, we shall use Xt, U1 and Y1 to denote process, control and observation histories up to timet, and W 1 to denote information available at the time t. More explicitly, Wt denotes the information available at the time the value of u1 is to be chosen, and so consists of (W0 , Y1, U1_I). That is, it includes recollection of previous controls and Wo, the prior information available when optimisation begins. We take for granted that it also implies knowledge oft itself: of clock time. Let us for simplicity take W0 for granted, so that all expectations and probabilities are calculated for the prescribed Wo. Our aim is to show that the total value function
G( W1) = inf E(CI Wt)
(1)
1f
satisfies the optimality equation
G( Wt) = inf E[G( Wt+t)[ Wt, ut] u,
(2)
and that the infimising value of u 1 in (2) is the optimal value of control at timet. These assertions were taken almost as selfevident in Chapter 8, but they require proof at two levels. One is at the level of rigour; there may be technical problems due to the facts that the horizon is infinite or that that the control may take values in an infinite set. We shall not concern ourselves with these issues, but rather with the much
450
APPENDIX 2
more fundamental one of structure. The optimality equation is only 'selfevident' because one unconsciously makes structural assumptions. These assumptions and their consequences must be made explicit. That there is need for closer thought is evident from the fact that the conditional expectations in (1) and (2) are not welldefined as they stand, because the control history U,_I or U, is not defined as a random variable. The following discussion is an improved version of the first analysis of these matters, which was given in Whittle (1982), pp. 1502. Complete realisations Xoc, U00 andY00 will be denoted simply by X, Uand Ylt can be assumed without Joss of generality that w = gives complete information on the course of all variables. We shall use naive probability notations P(x) and P(xJy) for distributions and conditional distributions of random variables x and y, as though these variables were discretevalued. All such formalism has an evident version or interpretation in more general cases. A subscript 1r, as in P,(x), indicates the distribution induced under policy 1r; correspondingly for expectations. The policy 1r is subject to the condition of realisability: that the value of the current control u1 may depend only on current observables W,. We shall express this by saying that u, must be W,measurable. By this we do not mean measurability in the technical sense of measure theory, but in the naive structural sense, that u, can depend on no other variable than W,. We shall in general allow randomised policies, in that the policy 1r is determined by specification of a conditional probability distribution P,(u1 W 1 ) for each t. This is convenient, even though the optimal policy may be taken as deterministic, in that it expresses u1 as a function of W, for each t. One must now distinguish between conditioning variables and parametrising variables. A model for the stochastic dynamics of the process would imply a specification of the probability distribution of X for any given U However, this is not a distribution of X conditional on U, because U is not even defined as a random variable until a control policy has been specified. Rather, U is a parametrising variable; a variable on which the distribution of X depends. We shall write the distribution of X conditioned by Y and parametrised by Z as P(XJ Y; Z). The specification of a stochastic model for the controlled process thus corresponds to the specification of the parametrised distribution P( X!; U). The full stochastic specification of both plant equation and observation structure corresponds to specification of P( X, YJ; U). We see then that we should more properly write P,(u1 W,) as P,(ur!; W 1 ), because we are simply defining a u, distribution which allows an arbitrary dependence upon W1 • Calculation of expectations for prescribed Wo may also be a mixture of conditioning and parametrising. These distinctions turn out not to matter, but thanks only to the ordering imposed by a temporal structure. The following are now the basic assumptions of a temporal optimisation problem.
w)()
1
1
APPENDIX 2
451
(i) Separateness ofmodel and policy. ()()
P1r(X, Y, U) = P(X, Yl; U)
IJ P1r(u1l; W1).
(3)
1=0
(ii) Causality. P(X~, Y1l; U)
= P(X~, Y1i;
U11).
(4)
The assumption W1 = (Wo, Y1, U11) implies the further properties (iii) Nonanticipation. W 1 is (Y1, U1_1)measurable. (iv) Retentionofinformation. W1_1 and U11 are W1measurable. Conditions such as these are often taken for granted and not even listed, but the dynamic programming principle is not valid without them. The fact that information at time t includes that at time t  1 would be expressed in some of the literature by saying that in (1), for example, one is conditioning on an increasing sequence of ufields constituting a filtration. We mention the point only to help the reader make the connection; this is not a language that we shall need. Condition (i) factors the joint distribution of X, Yand U under policy 1r into terms dependent on the model and the policy respectively. Note that it is by specification of the policy that one completes the stochastic specification which allows U to be regarded as a random variable jointly with X and Y. Note that relation (3) also expresses realisability of the policy. Condition (ii) does indeed express the vital causality condition: that the course of the process up to a given time cannot be affected by actions after that time. Here the distinction between parametrising and conditioning variables is cruciaL In the sense expressed by (4) the variable x 1 (for example) cannot be affected by uT(T ~ t). However, once a policy is specified, then x 1 will in general show a stochastic dependence on these future control variables, simply because these future control variables will in general show a stochastic dependence upon x 1 . Condition (iv) expresses the fact that information is in principle never lost or discarded. ('In principle', because assumptions of the Markov type make it unnecessary to retain all information.) In particular, past decisions are recalled. The aspect of optimisation is introduced by requiring 1r to be such as to minimise E1r(C) for a suitably specified cost function C = q W). Usually the observations yare regarded as subsidiary to the process variable x, in that they are observations on the process variable. This dependent role would be expressed by some property such as P(x1,Y1IX11, Y11i U11) = P(x~,yt!X11; U11)
and the assumption that the cost function C may depend on X and U but not on Y. However, there is no need to make such assumptions; one can simply regard {x 1, y 1} as a stochastic process describing system evolution (and parametrised by U) of which only the component {y 1} is observable.
452
APPENDIX 2
We shall now show that the dynamic programming principle follows from these assumptions. For notational simplicity we shall abbreviate P,.(u1!; W1) to P11"1· An unrestricted summation 2: will be a summation over all W. A symbol ( W1) under the summation sign will indicate that the summation is to be extended over all W consistent with a given W1• Let us define
V(1r, W1) = E,.(q W1), the total value function for a specified policy 1r. The conditionalexpectation notation is legitimate, since W1 is defined as a random variable under the policy. Then V obeys the recursion
(5)
V(1r, W1) = E,.[V(1r, WI+I)! Wt] in virtue of (iv) and the properties of a conditional expectation.
Lemma A2.1
V (1r, W1) is independent ofpolicy before time t.
Proof We have V(1r, W1) =
2:(W,) C( W)P,.(X, Y, U) 2: p (X y U) . (W,)
,.
'
(6)
'
Substituting expression (3) for P,.(X, Y, U) into this equation we find that the P1rj for j < t cancel out, since they depend on W only through W1• 0
Lemma A2.2 For any function ¢ of process history the expectation E,.[¢(X1+I, Yt+ 1) IW1, u1] is independent ofpolicy and can be written
Proof Relation (7) certainly holds if P(Xt+I, Yt+tli U1) is replaced by P,.(XI+I, Yt+ 1 , U1 ). But, by the separateness and causality assumptions (i) and (ii), I
P,.(Xt+t. yl+l, Ul)
= P(Xt+l! yl+t!i Ul) IT P,.j. j=O
The P,.j terms cancel, leaving relation (7), which certainly has the implication asserted. 0
Lemma A2.3
If P,.1 is chosen optimally, for given t, then recursion ( 7) becomes
APPENDIX 2
453
V(rr, W,) = inf E[V(rr, Wt+I)jW,,u,] u,
where the expectation operator is independent of 1r and the the minimising u, is optimal, under prescription of1r at other time points. Proof We have
E'lr[V(rr, Wt+I)jW,] = I,:P,.(u,jW,)E[V(rr, Wt+I)jW,,u,]
(8)
where the last expectation operator is independent of rr, by Lemma A2.2. Indeed, a further appeal to Lemma A2.1 indicates that the final expectation is independent of P,, , so that P,, occurs in expression (8) only where it appears explicitly. The assertions of the lemma then follow. 0 The desired conclusions are now immediate. Theorem A2.4 The optimal total value function G obeys the dynamic programming equation (2), the expectation operator being independent ofpolicy. The minimising valueofu, in (2) is theoptimalvalueofcontrol at timet.
This follows simply by application of Lemma A2.3 under the assumption that policy has been optimised after timet. The explicit form of the expectation in (2) follows from (7). Essentially, P'lr( W,+d U,) = P( Wt+d; U,).
APPENDIX 3
Moment Generating Functions; Basic Properties If x is a vector random variable then its moment generating function (MGF) is defined as (1) where o is then a row vector. This certainly exists for purely imaginary o; and the characteristic function M(ie) is studied for real every much in its own right. Its relation to the MGF is exactly that of the Fourier transform to the Laplace transform. M(o) will exist for a range of real o if the tails of the xdistribution decay at least exponentially fast. In such cases the distribution of x and associated random variables has stronger properties, which are characterised immediately and simply in terms of the MGF, as we saw particularly in Chapter 22.
Theorem A3.1 The moment generating function M (o) is convex for real o. The set JJI ofreal o for which M (o) isfinite is convex and contains the value zero. Proof M(o) is an average of functions eax which are convex in o and so is itself convex. We know already that 0 E d. Jensen's inequality
M(po + q{J)
~
pM(o) + qM({J)
(o, {3
E
d;p, q ~ O;p + q = 1)
for convex functions then implies that all elements of the convex hull of d belong to d. That is, d is convex. D This convexity explains why the equation M(o) = 1 has at most two real roots in the scalar case, one at o = 0 and the other of sign opposite to that of
M'(O)
= E(x).
Theorem A3.2 M(o) possesses derivatives of all orders in the interior of d, obtained by differentiatingexpression (1) under the expectation sign. Proof This follows from the absolute convergence of the expectation thus de~d
D
Existence of derivatives is of course closely related to the existence of moments, which are proportional to the derivatives of M{o) at o = 0. We see from the
456
APPENDIX 3
theorem that, if 0 is an interior point of .91, then moments of all orders exist. The classic case for which no moments of integral order exist is the Cauchy distribution; a distribution of scalar x with probability density proportional to (1 + and characteristic function exp( 101). We could have proved convexity of M(a) by appealing to the fact that the matrix of second differentials
x2r'
(where a1 is thejth element of a) is plainly nonnegative definite. This is clumsy compared with the proof above, and makes an unnecessary appeal to the existence of differentials. However, we shall indeed use it in a moment If we write the second differential in the lefthand member as M1k (a) then M1k (a) I M (a) can be identified as £(al(x1xk), where £(o) is the expectation for the tilted distribution defined in Section 22.2. That is, the atilted expectation of a function f/J(x) is
£(o}[f/J(x)] =
E[~~~~axJ.
The MGFofa sum of independent random variables is the product ofMGFs. The cumulant generating function (abbreviated to CGF) t/J(a) =log M(a) is then a natural quantity to consider, since the CGF of a sum of independent random variables is the sum ofCGFs.
Theorem A3.3 The function t/J( a) is also convex in .91. Proof If we use the subscript notations M1 = 8M I aa1 and M1k = Efl M I aajaak, with argument a understood, then we have • 1.'/c '~'1
= MJk _ M1Mk M
M2 .
But the matrix of these second differentials is just the covariance matrix of x on the tilted distribution, and so is nonnegative definite. This proves convexity 0 oft/J.
References Azencott, R. (1982) Formule de Thylor stochastique et developpement asymptotique d'integrales de Feynman. Seminaire de Probabilites XVI Lecture Notes in Mathematics 921, 237284. Springer, Berlin. Azencott, R. (1984) Densite des diffusions en temps petits; developpements asymptotiques. Semina ire de Probabi/ites XV/ll Lecture Notes in Mathematics 1059, 402498. Springer, Berlin. Bartlett, M.S. (1949) Some evolutionary stochastic processes. J Roy. Statist. Soc. B, 11,211229. Barlett, M.S. (1955) Deterministic and stochastic models for recurrent epidemics. Proc. Third Berkeley Symposium, (Ed. J. Neyman), IV, 81109. Bartlett, M.S. (1960) Stochastic Population Models in Ecology and Epidemiology. London. Ben Arous, G. (1988) Methodes de Laplace et de Ia phase stationnaire sur l'espace de Wiener. Stochastics, 25,125153. Benes, V, Shepp, L.A. and Witsenhausen, H.S. (1980) Some soluble stochastic control processes. Stochastics, 4. Brockett, R.W (1970) Finitedimensional Linear Systems. Wiley, New York. Bucklew, J.A. (1990) Large Deviation Techniques in Decision, Simulation and Estimation. Wiley, New York. Clark, CW (1976) Mathematical Bioeconomics. Wiley, New York. Cox, D.R. and Smith, WL. (1961) Queues. Methuen, London. Davis, M.HA. (1984) Piecewisedeterministic Markov processes: a general class of nondiffusion stochastic models. J Roy. Statist. Soc. B, 46, 353388. Davis, M.H.A. (1986) Control of piecewisedeterministic processes via discretetime dynamic programming. In Stochastic Differential Systems (Ed M. Kohlmann). Lecture Notes in Control and Information Sciences, 78, SpringerVerlag, Berlin. Davis, M.H.A. (1993) Markov Mode/sand Optimization. Chapman and Hall, London. Dembo, A. and Zeitouni, 0. (1991) Large Deviations and Applications. A.K. Peters, Wellesley, USA. Deuschel, 1D. and Strook, D. (1989) Large Deviations. Academic Press, New York. Doyle, lC, Francis, B.A. and 'Th.nnenbaum, A.R. (1992) Feedback Control Theory. Macmillan, New York. Ellis, R. (1985) Entropy. Large Deviations and Statistical Mechanics. Springer, New York. Fleming,WH. (1971) Stochastic control for small noise intensities. SIAM J Control Optim., 9,4735U. Fleming, WH. (1978) Exit probabilities and optimal stochastic control. Applied Math. Optim., 4, 329346. Fleming,WH. (1985) A stochastic control approach to some large deviations problems. In Recent Mathematical Methods in Dynamic Programming (Eds. Capuzzo Dolcetta eta/.). Springer, Berlin. Fleming,WH. and Soner, H.M. (1992) Controlled Markov Processes and Viscosity Solutions. Springer, Berlin.
458
REFERENCES
Fleming, WH. and Tsai, C.P. (1981) Optimal exit probabilities and differential games. Applied Math. Optim., 7, 253282. Fragopoulos, D. (1994) H 00 Synthesis Theory using Polynomial System Representations. Ph.D. Thesis, Department of Electrical and Electronic Engineering, University of Strathclyde, Scotland. Francis, B.A. (1987) A Course in H 00 Control Theory. Springer, Berlin. Friedlin, M.I. and Wentzell, A.D. (1984) Random Perturbations of Dynamical Systems. SpringerVerlag, New York. (Russian original published in 1979 by Nauka, Moscow.) Gale, D. (1960) The Theory of Linear Economic Models. McGrawHill, New York. Gale, D. (1967) On optimal development in amultisectoreconomy. Rev. Econ. Stud., 34, 118. Gale, D. (1968) A mathematical theory of optimal economic development. Bull Amer. Math. Soc., 74,207223. Gartner, J. (1977) On the large deviations from the invariant measure. Th. Prob. Appl., 22, 2439. Gibbens, R.J., Kelly, F.P. and Key, P.B. (1988) Dynamic alternative routingmodelling and behaviour. In Proceedings of the Twelfth International Teletraffic Congress. North Holland, Amsterdam. Gibbens, R.J., Kelly, F.P. and Key, P.B. (1995) Dynamic alternative routing. In Routing in Communications Networks (Ed. M.A. Steenstrup), Prentice Hall, Englewood Cliffs, New Jersey. Gihman, I. I. and Skorohod, AY. (1972) Stochastic Differential Equations. Springer, Berlin. Gihman, 1.1. and Skorohod, AY. (1979) The Theory of Stochastic Processes,Vol. III. Springer, Berlin. Gittins, J.C. (1989) Multiarmed Bandit Allocation Indices. Wiley, Chichester. Gittins, J.C. and Jones, D.M. (1974) A dynamic allocation index for the sequential design of experiments. In Progress in Statistics (Ed. J. Gani), pp. 241266. North Holland, Amsterdam. Glover, K. and Doyle, J.C. (1988) Statespace formulae for all stabilizing controllers that satisfy an H 00 norm bound and relations to risk sensitivity. System & Control Letters, 11, 167172. Hagander, P. (1973) The use of operator factorisation for linear control and estimation. Automatica, 9, 623631. Hijab, 0. (1984) Asymptotic Bayesian estimation of a first order equation with small diffusion. Ann. Prob., 12,809902. Holland, C.J. (1977) A new energy characterisation of the smallest eigenvalue of the Schrodinger equation. Comm. Pure Appl Math., 30, 755765. fJolt, C., Modigliani, F., Muth, J.F. and Simon, H.A. (1960) Planning, Production, Inventories and Workforce. PrenticeHall, Englewood Cliffs, New Jersey. Howard, R.A. (1960) Dynamic Programming and Markov Processes. MIT Press and Wiley, New York. Jacobson, D.H. (1973) Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Trans. Autom. Control, AC18, 124131. Jacobson, D.H. (1977) Extensions of Linearquadratic Control, Optimization and Matrix Theory. Academic Press, New York. James, M.R. and Baras, J.S. (1988) Nonlinear filtering and large deviations: a PDEcontrol theoretic approach. Stochastics, 23,391412. Kalaba, R. (1959) On nonlinear differential equations, the maximum operation and monotone convergence. 1 Math. Mech., 8, 519574. Krishnan, A.R. and Ott, T.J. (1986) Statedependent routing for telephone traffic; theory and results. Proc. 25th IEEE Controland Decision Conference, 21242128.
REFERENCES
459
Kumar, P.R. and van Schuppen, J.H. (1981) On the optimal control of stochastic processes with an exponentialofintegral performance index.J Math Anal. Appl., 80,312332. Lande, R., Engen, S. and Saether, B.E. (1994) Optimal harvesting, economic discounting and extinction risk in fluctuating populations. Nature, 372, 8890. Lande, R. Engen, S. & Saether, B.E. (1995). Optimal harvesting of fluctuating populations with a risk of extinction. Am. Nat.145, 728745. MartinLof, A. (1986) Entropy, a useful concept in risk theory. Scand. Actuarial J,??, 223235. Miller, H.D. (1961) A convexity property in the theory of random variables on a finite Markov chain. Ann. Math Statist., 32, 12601270. Mustafa, D. and Glover, K. (1990) Minimum Entropy Hoo Control. Springer, Berlin. Newton, G.C, Gould, L.A. and Kaiser, J.F (1957) Analytical Design of Linear Feedback Controls. Wiley, New York. Pollatschek, M. and AviItzhak, B. (1969) Algorithms for stochastic games with geometrical interpretation. Man. Sci., 15, 399413. Pontryagin, L.S., Boltyansk.ii, V.G., Gamkrelidze, RY. and Mischenko, E.F (1962) The Mathematical Theory of Optimal Processes. Interscience, New York. Puterman, M. L. (1994) Markov Decision Processes. Wiley, New York. Shwartz, A. and Weiss, A. (1995) Large Deviations for Performance Analysis. Chapman and Hall, London. Speyer, J.L. (1976) An adaptive terminal guidance scheme based on an exponential cost criterion with applications to homing missile guidance. IEEE Trans. Autom. Control, AC21, 371375. Speyer, J.L., Deyst, J. and Jacobson, D.H. (1974) Optimisation of stochastic linear systems with additive measurement and process noise using exponential performance criteria. IEEE Trans. Autom. Control, AC19, 358366. Stroock, D. (1984) An Introduction to the Theory of Large Deviations. Springer, Berlin. Stroock, D. and Varadhan, S. (1979) Multimensional Diffusion Processes. Springer, Berlin. Tegeder, R.W: (1993) Large Deviations, Hamiltonian 'R!chniques and Applications in Biology. Ph. D. Thesis, University of Cambridge. Tsitsiklis, J.N. (1986) A lemma on the MAB problem. IEEE Trans. Automat. Control, AC31, 576577. Van Vleck, J.H. (1928) Proc. Nat!. Acad. Sci. USA, 14, 178. Vanderbei, R.J. and Weiss, A. (1988) Large Deviations and Their Application to Computer and Communications Systems. Circulated unpublished notes, AT&T Bell Laboratories. Varadhan, S. (1984) Large Deviations and Applications. SIAM, Philadelphia. Vidyasagar, M. (1985) Control System Synthesis; a Factorization Approach. MIT Press, Cambridge, Mass. Weber, R. and Weiss, G. (1990) On an index policy for restless bandits. 1 Appl. Prob., 27, 647648. Whittle, P. (1963) Prediction and Regulation. English Universities Press, London Whittle, P. (1980) Multiarmed bandits and the Gittins index.J Roy. Statist. Soc., B 42, 143149. Whittle, P. (1981) Risksensitive linear/quadratic/Gaussian control. Adv. Appl. Prob., 13, 764777. Whittle, P. (1982) Optimisation over Time, Vol. 1. Wiley, Chichester. Whittle, P. (1983a) Optimisation over Time, Vol. 2. Wiley, Chichester. Whittle, P. (1983b) Prediction and Regulation. Second and revised edition of Whittle (1963) University of Minnesota Press and Blackwell, Oxford. Whittle, P. (1986) The risksensitive certainty equivalence principle. In Essays in Time Series Analysis and Allied Processes, Ed. J. Gani, 383388. Applied Probability Trust, Sheffield.
460
REFERENCES
Whittle, P. (1988) Restless bandits; activity allocation in a changing world. In A Celebration ofApplied Probability (Ed. J. Gani), l Appl. Prob.,lSA, 287298. Whittle, P. (1990a) Risksensitive Optimal Control. Wiley, Chichester and New York. Whittle, P. (1990b) A risksensitive maximum principle. Syst. Contr. Lett.,15, 183192. Whittle, P. (1991a) A risksensitive maximum principle: the case of imperfect state observation. IEEE '/rai'IS. Auto. Control, AC36, 793801. Whittle, P. (1991b) Likelihood and cost as path integrals. I Roy. Statist. Soc., 8 53,505529. Whittle, P (1995) Largedeviation expressions for the distribution of firstpassage coordinates. Adv. Appl. Prob., 27,692710. Whittle, P. and Gait, P. (1970) Reduction of a class of stochastic control problems. l Inst.
Math. Appl.,6,131140. Whittle, P. and Horwood, JYI. (1995) Population extinction and optimal resource management. To appear in PhiL Trans. Roy. Soc. B. Whittle, P and Komarova, N. (1988) Policy improvement and the NewtonRaphson algorithm. Prob. Eng. Inf Sci., 2, 249255. Whittle, P and Kuhn, J. (1986) A Hamiltonian formulation of risksensitive, linear/ quadratic/gaussian control. Int. J. Control. Willems, J. C (1991) Paradigms and puzzles in the theory of dynamical systems. IEEE Trans. Autom. Control, AC 36, 259294. Willems, J.C (1992) Feedback in a behavioural setting. In Models and Feedback: Theory and Applications (Eds. lsidori, A. and Tarn, T.J.), pp. 179191. Birkhiiuser. Willems, J.C. (1993) LQcontrol: a behavioural approach. In Proc.J993 IEEE Conference on
Decision and Control. Wold, H.O.A. (1938) The Analysis of Stationary Time Series. Almquist and Wicksell, Uppsala. Youla, D.C, Bongiorno, J.J. and Jabr, H.A. (1976) Modern WienerHopf design of optimal controllers. Part I: The singleinput output case. IEEE Trans. Auto. Control, AC21, 313. Part II. The multivariable case. IEEE Trans. Auto. Control, AC21, 319338. Zam.es, G. (1981) Feedback and optimal sensitivity: model reference transformations, multiplicative seminorms, and approximate inverses. IEEE Trai'IS. Auto. Control, AC 26, 301320.
Index Allocation, of activity 269283 Averagecost optimisation 5356, 138, 217221,314316 Autocovariance 256257 Autocovariance generating function (AGF) 257 Autoregressive processes and representations 262 Avoidance of hazards 160163, 428430 Backlogging 20 Backwards translation operator See Translation operator Bangbang control 31, 146 Bayesian statistics 287 Behavioural formulation 129 Blackmail 216, 220 Brownian motion See Wiener process Bush problem 145147 Calculus of variations 17 Call routing 226228 Cart and pendulum model 35Jl, 99 Canonical factorisation of operators 118, 129, 262, 263, 336, 339, 342344, 345, 348, 352, 354, 361, 3(iJ Causality 68. 451 CayleyHamilton theorem 99100 Certainty equivalence 191, 230, 234239, 298302,432435 Changepoint detection 289291 Chernoff's inequality 387388 . Circulant process 261_ . Classic fonrili.hition'63=66 "='~,, · · Clearance, optimal 424; 426, 428, 430 Closedloopproperty·J4; 25,176,191 Closing cost 12,49 · ::. : ' ·:: .. Command signal 64 Companion matrix 101 . ' · · Compound Poisson process::;t84 ' ·.
Conditionally most probable estimate 237 Conjugate variable 135, 136 Consumption, optimal 1719,5556,143, 411413 Controlpower matrix 34, 306 Controllability 101106 Cost, closing 12, 49 Cost, instantaneous 12 Cost, terminal 49 Cost, transient 54 Cost function 1213 Covariance matrix 239 Cramer's theorem 380, 384387 Crash avoidance 157160 Cumulant generating function (CGF) 384 Derivate charcteristic function (DCF) 181, 390 Detectability 109 Diffusion coefficient 187 Diffusion processes 186187,206208 Direct trajectory optimisation 4246, 131166 Direct trajectory optimisation with LEQG structure 316317,371377 Direct trajectory optimisation with LQ structure 116121,122123,126129 Direct trajectory optimisation with LQG structure 331369 Discounting 15, 29, 44, 49, 143, 178, 317319 Discrepancy function 230231 Domain of attraction 3, 93 Dosage, optimal 134135 Drift coefficient 187 Dual control 285286 Dual variable See Conjugate variable Duality, ofestimation and control2482:53 Dynamic lags 8889 Dynamic programming equation 1314, 28, 174, 176, 177, 288, 289, 339, 340, 347, 4()7
462
INDEX
Eikonal equation 31 Entropy criterion 316 Equilibrium 9293 Equilibrium point, optimal 4246 Erlang function 228 Estimates, conditionally most probable 237 linear least square 243 minimal discrepancy 243 projection 242246 Euler condition 17 Euphoria 304 Excessive functions 51 Extinction 1967, 207, 210 Factorisation See Canonical factorisation of operators Feedback 14, 25,6366, 8589 Feedback/feedforward rules 40, 119 Filters 6785 Filter, action on stochastic processes 257261 Filter inversion 7274, 8082 Filter, proper 84, 87 Filter, Kalman See Kalman filter Firstpassage problems 3031, 140142, 201205,415430 Final value theorem 83 Fluid approximation 392 Flypaper effect 204 Forward operator 48 Free form 299 Frequency response function 69 Future stress 305307, 310,311 'Gain, effective 88 Gain matrix 24, 26 Gittins index 270275 Gramian 104, 105, 108 Grazing of stopping set 157, 162 Growth, optimal 5556,133135
Hoo criterion 321329 Hoo norm 324326 Hamiltonian 45, 136, 137, 396 Hamiltonian structure 395397 Hardy class 325 Harvesting, optimal 25,3133,6162, 194201, 206213
Hedging 173 Homogeneous processes of independent increments (HPII) 182184 Horizon 12 Imperfect observation 9698,229253, 285291,308312,357369,431441 Index, Gittins 270275 Indexability 280 Infinitehorizon behaviour 2628,4756, lll115, 307, 312313 Infinitesimal generator 177, 179, 390 Infinitely divisible distributions 183 Inertial particle 37, 155164, 426428 Inertialess particle 37, 153154,408411, 421426 Information 173, 449 Information state 229, 285, 288 Innovation 246248, 361, 365, 376 Inpututput formulation 6389 Instability, of optimisation 52, 216 Instantaneous cost 12 Insurance 389 Jump process 179, 391, 397 Kalman filter 109, 230, 239242, 248, 252, 309,312,361369 Lagrange's equations 99 Laplace transform 8183 Large deviation theory 198,379441 and control 405414 and equilibrium distributions 399400 and expected exit times 401 and first passage 415430 and imperfect observation 431441 and nonlinear filtering 437441 refinements 397399 Linear least square estimate 243 Linearisation 93, 9798 Loop operator 66 LEQG models and optimisation 234, 295320,316317,371377 LQ models and optimisation 2228, 3338,3842,5961, lll129, 147, 150163 LQG models and optimisation 189191, 201202,202205,331369
INDEX
LQG models with imperfect observation 229253 Machine maintenance 22I222, 28I282, 28929I Maximum principle (MP) l3l166 Maximum principle, risksensitive (RSMP) 407408,435436 Minimal discrepancy estimate 243 Minimaxcriterion 169 Missdistance 17 Modes, of a system 95,106 Moment generating function (MGF) 182, 384, 455456 Monotonicity, of operators 48 Moving average processes and representations 261262 Multiarmed bandits 269283 Neurotic breakdown 303 NewtonRaphson algorithm 5759 Negative programming 50 Neighbouring optimal control 4246 Noise power matrix 186 Nonlinear filtering 437441 Notation 443447 Observable 173, 449 Observer 109 Observability lOfr 109 Observation, imperfect 9fr98, 229253, 2~291, 3083I2, 357369, 43144I Occupation times 388389 Offset 89,116, 123I26 Openloop control 25, 19I Operator, forward 48 Operator, loop 66 Operator, translation 27 Operators, factorisation of See Canonical factorisation of operators Optimal stopping 205206 Optimality equation See Dynamic programming equation Optimality conditions See Direct trajectory optimisation Optimisation criteria, expressions for 265268 Optimism 302 Orthogonal random variables 239
463
Parametrising variables 171, 174, 450 Paststress 308310, 3113I2 Pendulum models 3436,8485,94,95 Pessimism 303 PID controllers 99 PI! NR algorithm 344, 348 Piecewise deterministic processes 180I81, 209213 Plant II, 64, 66, 85 Plant equation II Plant instability 128 Poisson process 183 Poisson stream 183 Pole cancellation 88 Policy 16, 174 Policy improvement 5fr61, 215228 for call routing 226228 for machine maintenance 22I222 for queueing models 222226 Policy improvement and canonical factorisation 342344, 348 Pontryagin maximum principle See Maximum principle Positive programming 51 Posterior distribution 287 Prediction 264265, 305, 350 Primer 149 Process variable II Production scheduling 2021 Projection estimate 242246 Proper filter 8487 Queueing models 193194, 222226, 282 Rate function 380, 385, 394 Rational transfer function 71 Realisation 109 Recoupling 310, 374 Recurrence 219 Reference signal See Command signal Regulation 23 Replica Markov processes 391 Reservoir optimisation 4042, I65166 Resonance 96 Restless bandits 277283 Return difference 355 Riccati equation 2324, ll3, 240, 252, 305, 309,311,338, 34(}341
464
INDEX
Riccati equation, alternative form of 121122,253 Risksensitivity 172173, 295320, 406414, 432437 Robustness 326328 RouthHurwicz criterion 109 Satellite model 98, 106, 109 Scaling of processes 195, 384, 391 Sensitivity 327328 Separation principle 299,432435 Setpoint 23 Shot noise 184 Small gain theorem 327 Spectral density {function and matrix) 259261 Stability 3, 92 Stability, internal 87 Stability, local 93 Stability matrix 40, 92 Stability, of filters 69, 71, 8385 Stabilisability 103 State structure ll12, 91110,175176 Stationary policies 16, 115121 Stationary processes 255257 Stopping set 136, 139, 151, 394, 415 Stress 298 future 301,305307,311 past 301,308310,311312 Submodularity 274 Switching locus 148 System formulation 125 Tangency condition 205206 Temporal optimisation, bases of 449453
Terminal cost 49 Terminal conditions 138140, 205206 Tilting, of a distribution 385 Timehomogeneity 16, 68 Timeintegral methods 331377,407, 435436 See also Direct trajectory optimisation Timeintegral methods, a generalised formulation 375377 Timeinvariance 16 Timetogo 16 Tracking 3842, 65, 115121, 308 Transfer function 69, 75, 79 Transient cost 54 Transient response 68, 80 Transition intensity 179 Translation invariance 68 Translation operator 27 Transversality conditions 138140, 163165 Turnpike 56, 133 Twisting, of a process 395 Type number 88 Utility function 295 Value function 13, 174, 339, 347 von NeumannGale model 56, 134 White noise 185, 189 Wiener filter 264265 Wiener process 184187 Wold representation 263 ztransfonn 69, 77, 79 Zermelo's problem 144