Optimal Control
WILEYINTERSCIENCE SERIES IN SYSTEMS AND OPTIMIZATION
Advisory Editors
Sheldon Ross Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720, USA
Richard Weber Cambridge University, Engineering Department, Management Studies Group, Mill Lane, Cambridge CB21RX, UK
GITTINS Multiarmed Bandit Allocation Indices KALL/WAILACE Stochastic Programming KAMP/HASLER Recursive Neural Networks for Associative Memory KIBZUN/KAN 
Stochastic Programming Problems with Probability and Quantile Functions
VAN DIJK Queueing Networks and Product Forms: A Systems Approach ·· WHITTLE Optimal Control: Basics and Beyond WHITTLE Risksensitive Optimal Control
Optimal Control Basics and Beyond Peter Whittle Statistical Laboratory, University of Cambridge, UK
JOHN WILEY & SONS
Copyright © 1996 by John Wiley & Sons Ltd Baffms Lane, Chichester, West Sussex P0191UD, England National 01243 779777 International ( +44) 1243 779777 All rights reserved. No part of this book may be reproduced by any means, or transmitted, or translated into a machine language without the written permission of the publisher. Cover photograph by courtesy of The News Portsmouth
Other Wiley Editorial Off~ees John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 101580012, USA Jacaranda Wiley Ltd, 33 Park Road, Milton, Queensland 4064, Australia John Wiley & Sons (Canada) Ltd, 22 Worcester Road, Rexdale, Ontario M9W ILl, Canada John Wiley & Sons (SEA) Pte Ltd,~ Jalan Pemimpin #0504. Block B, Union Industrial Building, Singapore 20'S7
Library ofCongreu CatllloginginPubliCiltion Data Whittle, Peter. Optimal control : basics and beyond I Peter Whittle. p. em. (WlleyInterscience series in systems and optimization) Includes bibliographical references and index. ISBN 0 471956791 (he: alk. paper). ISBN 0 47196099 3 (pb : alk.paper) 1. Automatic control. 2. Control theory I. Title. II. Series. TI213:W442 1996 629.8 dc20 9522113 CIP
Britulr Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 0 471956791; 0 47196099 3 (pbk) Typeset in 10/12pt Tunes by Pure Tech India Limited, Pondicherl"}t Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn This book is printed on acidfree paper responsibly manufactured from sustainable forestation, for which at least two trees are planted for each one used for paper production.
Contents Preface
vii
1 First ideas
BASICS Part 1 Deterministic Models 2 3 4 5 6 7
Deterministic models and their optimisation A sketch of infinitehorizon behaviour; policy improvement The classic formulation of the control problem; operators and filters Statestructured deterministic models Stationary rules and direct optimisation for the LQ model The Pontryagin maximum principle
9 11 47 63 91 111 131
Part 2 Stochastic Models
167
8 Stochastic dynamic programming 9 Stochastic dynamics in continuous time 10 Some stochastic examples 11 Policy improvement: stochastic versions and examples 12 The LQG model with imperfect observation 13 Stationary processes; spectral theory 14 Optimal allocation; the multiarmed bandit 15 Imperfect state observation
169 179 189 215 229 255 269 285
BEYOND
Part 3 Risksensitive and H 00 Criteria
293
16 Risk sensitivity: the LEQG model 17 The H 00 formulation
295 321
vi
CONTENTS
Part 4 Timeintegral Methods and Optimal Stationary Policies
331
18 19 20 21
335 349 357 371
The timeintegral formalism Optimal stationary LQG policies: perfect observation Optimal Stationary LQG policies: imperfect observation The risksensitive (LEQG) version
Part 5 Neardeterminism and Large Deviation Theory
379
22 23 24 25
383 405 415 431
The essentials oflarge deviation theory Control optimisation in the large deviation limit Controlled first passage Imperfect observation; nonlinear filtering
Appendices
443
AI Notation and conventions A2 The structural basis of temporal optimisation A3 Moment generating functions; basic properties
443 449 455
References
457
Index
461
Preface Anyone who writes on the subject of control without having faced the responsibility of practical implementation should be conscious of his presumption, and the strength of this sense should be at least doubled if he writes on optimal control. Beautiful theories commonly wither when put to the test, usually because factors are present which simply had not been envisaged. This is the reason why the design of practical control systems still has aspects of an art, for all the science on which it now calls. Nevertheless, even an art requires guidelines, and it can be claimed that the proper function of a quest for optimality is just the revelation of fundamental guidelines. The notion of achieving optimality in systems of the degree of complexity encountered in practice is a delusion, but the attempt to optimise idealised systems does generate the fundamental concepts needed for the enlightened treatment of less ideal cases. This observation then has a corollary: the theory must be natural and incisive enough that it does generate recognisable concepts; a theory which ends in an opaque jumble of formulae has served no purpose. 'Control theory' is now understood not merely in the narrow sense of the control of mechanisms but in the wider sense of the control of any dynamic system (e.g. communication, distribution, production, financial, economic), in general stochastic and imperfectly observed. The text takes this wider view and so covers general techniques of optimisation (e.g. dynamic programming and the maximum principle) as well as topics more classically associated with narrowsense control theory (e.g. stability, feedback, controllability). There is now a great deal of standard material in this area, and it is to this which the 'basics' component of the book provides an introduction. However, while the material may be standard, the treatment of the section is shaped considerably by consciousness of the 'beyond' component into which it leads. There are two pieces of standard theory which impress one as complete: one is the Pontryagin maximum principle for the optimisation of deterministic processes; the other is the optimisation of LQG models (a class of stochastic models with Linear dynamics, Quadratic costs and Gaussian noise). These have appeared like two islands in a sea of problems for which little more than an ad hoc treatment was available. However, in recent years the seabed has begun to rise and depths have become shallows, shallows have become bridging dry land. The class of risksensitive models, LEQG models, was introduced, and it was
viii
PREFACE
found that the LQG theory could be extended to these, although the mode of extension was sufficiently unevident that its perception added considerable insight. At about the same time it was found that optimisation on the H 00 criterion was both feasible, in that analytic advance was possible, and useful, in that it gave a robust criterion. Unexpectedly and beautifully, these two lines of work coalesced when it was realised that the Hoo criterion was a special case of the LEQG criterion, for all that one was phrased deterministically and the other stochastically. Finally, it was realised that, if largedeviation theory is applicable (as it is when a stochastic model is close to determinism in a certain sense), then all the exact results of the LQG theory have a version which holds in considerable generality. These successive insights revealed a structure in which concepts which had been familiar in special contexts for decades (e.g. timeintegral solutions, Hamiltonian structure, certainty equivalence, solution by canonical factorisation) were seen to be closely related and to supply exactly the right view of a very general class of stochastic models. The 'beyond' component is devoted to exposition of this material, and it was the fact that such a connected treatment now seems possible which motivated the writing of this text. Another motivation was the desire to write a successor to my earlier work Optimisation over Time (Wiley 1982, 1983). However, it is not squarely a successor. I wanted to write something much more homogeneous and tightly focused, and the restriction to the control theme provided that tightness. Remarkably, the recent advances mentioned above also induced a tightening, rather than the looseningone might have expected. For example, it turns out that the discounted cost criterion so beloved of exponents of dynamic programming is logically inconsistent outside a rather narrow context (see Section 16.12). In control contexts it is natural to work with either total or timeaveraged cost (in terminating or nonterminating situations respectively). The algorithm which emerges as natural is the iterative one of policy improvement. This has intrinsically a clear variational basis; it can also be seen as a NewtonRaphson algorithm (Section 3.5) whose secondorder convergence is often rapid enough that a single iteration is enlightening (see Section 3.7 and the examples of Chapter 11); it implies similarly effective algorithms in derived work, e.g. for the canonical factorisations of Chapters 1821. One very important topic to which we give little space is that of dual control. By this is meant the use of control actions to evoke information as well as to govern the dynamics of the system, with its associated concepts of adaptive control, selftuning regulators, etc. Chapter 14 on the multiarmed bandit constitutes almost the only substantial discussion. Despite the fact that the idea of dual control emerges spontaneously in any effort to optimise the running of a stochastic dynamic system, the topic seems too demanding and idiosyncratic that one can treat it in passing. Indeed, one may say that the treatment of this book pushes a certain line about as far as it can be taken, and that this line necessarily skirts
PREFACE
ix
dual control. In all our formulations of the LQG model, the LEQG model, largedeviation versions and even minimax control we find that there is a certainty equivalence principle. The principle indeed generally takes a more sophisticated form than that familiar from the simple LQG case, but any such principle must by its nature exclude dual control: the notion that control actions affect information gained. Another topic from which we refrain, despite the attention it has received in recent years, is the use of J factorisation techniques and the like to determine all stabilising controls satisfYing some lower bound on performance. This topic is important because of the increased emphasis given to robustness: the realisation that it is oflittle use if a control is optimal for a specified model if its performa nce deteriorates rapidly with departure from that specification. However, we take reassurance from one conclusion which this body of work establishes: that if a control rule is optimised under the assumption that there is observation error then it is also proofed to some extent against errors in model specification (see Section 17.3). The factorisation techniques which we employ are those associated with the formulation of optimal control as the extremisation of a suitably defined timeintegral (even in the stochastic case). This is a class of ideas completely distinct from that of ]factorisation, and with its own particular elegance. My references to the literature are not systematic, but I have certainly given credit· for all recent work for which I knew an attribution. However, there are many sections in which I have worked out my own treatment, very possibly in ignorance of existing work. Let me apologise in advance to authors thus unwittingly overlooked, and affirm my readiness to correct the record at the first opportunity. A substantial proportion of this work was completed before my retirement in 1994 from the Churchill Chair, endowed by the Esso Petroleum Company. I am profoundly indebted to the Company for its support over my 27year occupancy of the Chair.
CHAPTER I
First Ideas 1 CONTROL AS AN OPTIMISATIO N PROBLEM One tends to think of 'control' as meaning the control of mechanisms: e.g. the classic stabilisation of the speed of a steam engine by the centrifugal governor, the stabilisation of temperature in a central heating system, or the many automatic controls built into a modern aircraft. However, the controls built into an aircraft are modest compared with those which Nature has built into any higher organism; a biological rather than a mechanical system. This can be taken as an indication that any system operating in time, be it mechanical, electrical, biological, economic or industrial, will need continuous monitoring and correction if it is to keep on course. In other words, it needs control The efficient running of the dynamic system constituted by an economy or a factory poses a control problem just as much as does the operation of an aircraft. The fact that control actions may be realised by procedures or by conscious decisions rather than by mechanisms is a matter of implementation rather than of principle. (Although it is also true that it is the higherlevel decisions, laying out the general course one wishes the system to follow, which will be taken consciously, and it is the lowerlevel decisions which will be automated. The more complex the system, the more need there will be for an automated lowlevel decision structure which ensures that the system actually follows the course decided by higherlevel policy;) In traditional control theory the problem is regarded very much as one of stabilitythat departures from the desired course should certainly be corrected ultimately, and should preferably be corrected quickly, smoothly and effortlessly. Since the midcentury increasing attention has been given to more specific design criteria: control rules are chosen so as to minimise a cost function which appropriately penalises both deviation from course and excessive control action. That is, the design problem is formulated as an optimisation problem. This has virtues, in that it leads to a sharpening of concepts; indeed, to the generation of concepts. It has faults, in that the model behind the optimisation may be so idealised that it leads to a nonrobust solutiona solution which is likely to prove unacceptable if the actual system deViates at all from that supposed. However, as is usual when 'theory' is criticised, this objection is not a criticism of theory as such, but criticism of a naive theory. One may say, indeed, that optimisation exposes the weaknesses in thinking which are usually compensated for by soundness of intuition. By this is meant that, if one makes certain assumptions,
2
FIRST IDEAS
then an attempt at optimisation will go to the limit in some direction consistent with a literal interpretation of these assumptions. It is not a bad idea, then, to see how an illposed attempt at optimisation can reveal the pitfalls and point the way to their remedy. 2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE A good example of the harvesting of a renewable resource would be the operation of a fishery. Consider the simplest case, in which the description of current fish stocks is condensed to a single variable, x, the biomass. That is, we neglect the classification by species, age, size and location which a more adequate model would obviously require. We also neglect the effect of the seasons (although see Exercise 1) and suppose simply that, in the absence of fishing, biomass follows a differential equation
x=a(x)
(1)
where xis the rate of change of x with time, dxjdt. The function a(x) represents the rate of change of biomass, a net reproduction rate, and in practice has very much the course illustrated in Figure 1. It is initially positive and increasing with x, but then dips and becomes negative for large x, as the demands which a large biomass levies on environmental resources make themselves felt. Two significant stock levels are xo and Xm, distinguished in Figure 1. The stock level x0 is the equilibrium level for the unharvested population, that at which the net reproduction rate is zero. The stock level Xm is that at which the net reproduction rate is greatest. If stocks are depleted at a rate u by fishing then the equation becomes
x = a(x) u.
(2)
X
Figure 1 The postulated form of the net reproduction rate for a population. This rate is maximal at Xm and it is zero at xo, which would consequently be the equilibrium level ofthe unharvestedpopulation.
2 AN EXAMPLE: THE HARVESTING OF A RENEWABLE RESOURCE
3
X
Figure 2 The values x1 and x2 are the possible equilibrium levels ofpopulation if harvesting is carried out at a]zxed rate ufor x > 0. These are respectively unstable and stable, as is seen from the indicated direction of movement ofx.
Note that u is the actual catch rate, rather than, for example, fishing effort. Presumably a given effort yields less in the way of catch as x decreases until, when x becomes zero, one could catch at no faster rate than the rate a(O} at which the population is being replenished from external sources (which may be zero). Suppose, nevertheless, that one prescribes a fishing policy by announcing how one will determine u. If one chooses u varying with x then one is showing some responsiveness to the current state; in control terminology one is incorporating feedback. However, let us consider the most naive policy (which is not to say that it has not been used): that which sets u at a definite fixed value for x > 0. An equilibrium value of x under this policy must satisfy a(x) = u, and we see from the graph of Figure 2 that this equation has in general two solutions, x 1 and x 1 , say. Recall that the domain of attraction of an equilibrium point is the set of initial values x for which the trajectory would lead to that equilibrium. Further, that the equilibrium is stable (in a local sense) only if all points in some neighbourhood of it lie in its domain of attraction. Examining the sign of x = a(x) u, we see that the lesser value x 1 has only itself as domain of attraction, and so is unstable. The greater value x 2 has x > x 1 as domain of attraction, and so is stable. One might pose as a natural aspiration: to choose the value of u which is largest consistent with existence of a stable equilibrium solution, and this would seem to be
u = Um := a(xm)· That is, the maximal value of u for which a( x) = u has a solution, and so for which the equilibrium operating point is such that the biomass replaces itself at the maximal rate.
~
FIRST IDEAS
4
I X
Fipre 3 lfthefvced harvesting rate is taken as high as u,, then the equilibrium at Xm is only semistable.
However, this argument is fallacious, and its adoption is said to be the reason why the Peruvian anchovy fishery crashed between 1970 and 1973 from an annual catch of12.3 million tons to one of1.8 million tons (Clark, 1976). As u increases to Um then x 1 and x2 converge to the common value Xm. But Xm has domain of attraction x ;?; Xm, and so is only semistable (Figure 3). If the biomass drops at all from the value Xm then it crashes to zero. In Exercise 10.4.1 we consider a stochastic model of the situation which makes the same point in another way. We shall see in the next chapter that the policy which indeed maximises the steadystate harvest rate is that which one might expect: to fish at the maximal feasible rate (presumably greater than um) for x > Xm and not to fish at all for x < Xm. This makes the stock level Xm a stable point of the controlled system, at which one achieves an effective harvest rate of a(xm). At least, this is the optimal policy for this simple model; the model can be criticised on many grounds. _Exercises and comments (1) One can to some extent consider seasonal effects by considering a discretetime model Xr+I = a(x,)  u,
in which time t moves forwards in unit steps (corresponding to the annual cycle) rather then continuously. In this case the function a has the form of Figure 4 rather than of Figure 1. The same arguments can be applied as in the continuoustime case, although it is worth noting that it was this model (with u = 0) which provided the first and simplest demonstration of chaotic effects. (2) Suppose that the constant value presumed for u when x > 0 exceeds a(O), with u = 0 for x = 0. Then x = 0 is effectively a stable equilibrium point, with an
3 DYNAM IC OPTIMISATION TECHN IQUES
5
Figure 4 The form ofthe yeartoyear reproduction rate.
t rate effective harvest rate u = a(O). This is because one harvests at the constan One again. zero to the moment x becomes positive, and drives the biomass back and zero of tion has then a 'chattering' equilibrium, at which the alterna is u) of values positive infinitesimally positive values of x (and of zero and rate, ation immigr the infinitely rapid. The effective harvest rate must balance a(O). At this level, a fish is caught the moment it appears from somewhere. is also Under the policy indicated at the end of the section the equilibrium at Xm both out smooth course of a 'chattering' one. Practical considerations would operation and solution around this transition point. 3 DYNAMIC OPTIMISATION TECHNIQUES in the The crudity of the control rule of the previous section lay, of course, d to adapte be must rate harvest assumption of a constant harvest rate. The a least, very the at that, ensure to current conditions, and in such a way as be well may it cs dynami ed improv depleted population can recover. With rium possible to retain the point of maximal productivity Xm as the equilib good of on deducti the for basis a needs y operating point. However, one certainl dynamic rules. There are a number of approaches, all ultimately related. n to The first is the classical design approach, with its primary concer le desirab other that, after and, point ng secure stability at the desired operati later with ues techniq of set one least at dynamic characteristics. This shares rs 4 approaches: the techniques needed to handle dynamic systems (see Chapte and 5). on the One optimisation approach is that oflaying direct variational conditions path, the of n variatio no be should there path of the process; of requiring that The cost. smaller a yield would which cs, consistent with the prescribed dynami ns. variatio of s calculu the in problem a as cast optimisation problem is then be to is problem control the if ation modific However, this classic calculus needs
6
FIRST IDEAS
accommodated naturally, and the form in which it is effective is that of the Pontryagin maximum principle (Chapter 7). This is a valuable technique, but one which would seem to be applicable only in the deterministic case. However, it has a natural version for at least certain classes of stochastic models; see Chapters 16, 1821, 23 and 25. Another approach is the recursive one, in which one optimises the control action at a given time on the assumption that the optimal rule for later times has already been determined This leads to the dynamic programming technique, a technique which is central and which has the merit of being immediately applicable also in the stochastic case (see Chapter 8). It is this approach which in a sense provides the spine of our treatment, although we shall see that all other methods are related to it and sometimes provide advantageous variants of it It is also true that there is merit in methods which display the future options for the controlled process more clearly than does the dynamic programming technique (see the certainty equivalence principles of Chapters 12 and16). One might say that methods which are expressed in terms of the predicted future path of the process (such as the maximum principle, the certaintyequivalence principle and the timeintegral methods of Chapters 1821) correspond to the approach of a chessplayer who explores a range of future scenarios in his mind before he makes a move. The dynamic programming approach reflects the approach of the player who has built up a mental evaluation of all possible board configurations, and so can replace the longterm goal of winning by the shortterm goal of choosing a move which leads to a highervalue configuration. There is virtue both in the explicit awareness of future possibilities and in the ability to be guided to the same effect by aiming for some more immediate goal. Finally, there is the relatively naive approach of simply choosing a reasonable control rule and evaluating its performance (by, say, determination of the average cost associated with the rule under equilibrium conditions). It is _ seldom easy to optimise the rule at this stage; the indirect routes to optimisation are more effective and more revealing. However, there is a systematic method of improving such solutions to yield something which is well on the way to optimality. This is the technique of policy improvement (see Chapters 3 and 11), an approach also derived from dynamic programming. Judged either as an analytic or a computational technique, this may be the single most important tool In cases where optimality may be an unrealistic ambition, even a false one, it offers a way of starting from a humble base and achieving performance comparable with the optimal The revision of policy that it recommends can itself convey insight. Policy improvement has a good theoretical basis, has a natural expression in all the characteristions of optimality and, as an iterative technique, it shows secondorder convergence to optimality.
4 ORGANISATION OF THE TEXT
7
4 ORGANISATION OF THE TEXT
Conventions on notation and standard notations are listed in Appendix 1. While much of the treatment of the text is informal, conclusions are either announc ed in advance or summar ised afterwards in theorem proof form. This form should be regarded as neither forbidding nor pretentious, but simply as the best way of punctuating and summarising the discussion. It is also by far the best form for readers looking for a quick reference on some point. It does create one difficulty, however. There are theorems whose validity is of completely assured by the conditions statedm athema ticians could conceive full than less of ts argumen where s situation are there r, nothing else. Howeve rigour have led one to considerable penetration and to what one indeed believes to be the essential insight, but for which the aspiration to full rigour would multiply the length of the treatment and obscure its point. This is particularly the is case when the topic is new enough that a rigorous treatment, even if available, , however s, assertion ise summar to wish still would One itself not insightful. l technica to subject is these of truth the that od leaving it to be understo ns assertio y summar Such verified. nor stated neither conditions of a nature should not properly be termed 'theorems~ We cover this point by starring the second type. So, Theorem 2.3.1 is true as its stands. On the other hand, *Theorem 7.2.1 is 'essentially' valid in statement and proof, but both would need technical supplement before the star could be removed. Exercises are in some cases substantial. In others they simply make points which, although importa nt or interesting in themselves, would have interrup ted the discussion if they had been incorpo rated into the main text. Theorems carry chapter and section labels. Thus, Theorem 2.3.1 is the first theorem of Section 3 of Chapter 2. Equations are numbered consecutively through a chapter, however, without chapter label. A reference to equation (18) n would thus mean equation (18) of the current chapter, but a reference to equatio (3.18) would mean equation (18) of Chapter 3. A similar convention holds for figures.
!
j
BAS ICS PAR T 1
Deterministic Models
CHA PTE R 2
Deterministic Models and their Optimisation 1 STATE STRUCTURE OPTIMISATION AND DYNAMIC PROGRAMMING
'process' or the The dynamic operation one is controlling is referred to as the ' as including 'plant' more or less interchangeably; we shall usually take 'system set of variables sensors, controls and even comm and signals as well as plant. The termed the which describe the evolution of the process will be collectively can be value whose le, variab l contro The x. process variable and denoted by on notati the with tent consis is This u. by d chosen by the optimiser, will be denote of Chapter 1. to whether Models are termed stochastic or deterministic according as oratio n of incorp the that see shall We randomness enters the description or not. that the ising recogn of way a ally stochastic effects (i.e. of randomness) is essenti course future the that lar, particu values of certain variables may be unknown; in t restric We table. predic ectly of certain input variables may be only imperf rs. chapte ourselves to deterministic models in these first seven in continuous We shall denote time by t. Physical models are naturally phrased useful to also is it er, time, when t may take any value on the real axis. Howev values integer only take to consider models in discrete time, when t is considered s proces the that notion to the t = ... , 2, 1, 0, 1, 2, .... This corresponds for ts, contex ic econom develops in stages, of equal length. It is a natural view in so decisions tend example, when data become available at regular intervals, and way, when they this in to be taken at the same intervals. Even engineers operate l values are contro if work with 'sampled data'. Discretisation of time is inevitable g with a startin in determined digitally. There are mathematical advantages the more to ent discretetime formulation, even if one later transfers the treatm al in materi cover to physical continuoustime formulation. We shall in general try both versions. if the control There are two aspects of the model which must be specified e is the plant optimisation problem is to be properly formulated. The first ofthes ls u. This equation; the dynamic evolution rule that x obeys for given contro and must be describes the dynamics of the system which is to be controlled, aspect is the derived from a physical model of that system. The second function. This cost a of cation specifi s implie performance criterion, which usually
12
DETERMINISTIC MODELS AND THEIR OPTIMISATION
cost function penalises all aspects of the path of the process which are regarded as undesirable (e.g. deviations from required path, lack of smoothness, depletion of resources) and the control policy is to be chosen to minimise it. Consider first the case of an uncontrolled system in discrete time. The plant equation must then take the form of a recursion expressing x 1 in terms of previous xvalues. Suppose that this recursion is firstorder, so taking the form
(1) where we have allowed dynamics also to depend upon time. In this case the variable x constitutes a dynamically complete description of the state of the system, in that the future course {x 7 ; T > t} of the process at time tis determined totally by x 1 , and is independent of the path {x7 ; T < t} by which x 1 was reached. A model with this property is said to have state structure, and the process variable x can be more strongly characterised as the state variable. State structure for a controlled process in discrete time will also require that the model is, in some sense, simply recursive. It is a property of system dynamics and the cost function jointly. We shall assume that the plant equation takes the form Xt
= a(XtI,
UtI,
(2)
t).
analogously to (1). Further, if one is optimising over the time period 0 < t shall assume that the cost function C takes the additive form
c=
hI
hI
T=O
r=O
L c(xn Un r) + Ch(xh) = L
Cr
+ ch,
< h, we (3)
say. The endpoint h is referred to as the horizon, the point at which operations close. It is natural to regard the terms Cr and ch as costs incurred at time T and time h respectively; we shall refer to them as the instantaneous and closing costs. We have thus assumed, not merely additivity, but also that the instantaneous cost depends only on current state, control and time, and that the closing cost  depends only upon the closing state xh. One would often refer to xh and Ch as the terminal state and the terminal cost respectively. However, we shall encounter processes which may terminate in other ways before the horizon point is reached (e.g. by accident or by bankruptcy) and it is useful to distinguish between the cost incurred in such a physical termination and one incurred simply by the state one is left in at the expiry of the planning period. We have yet to define what we mean by 'state structure' in the controlled case, but shall see in Theorem 2.1.1 that assumptions (2) and (3) do in fact imply the simply recursive character of the optimisation problem that one would wish. Relation (2) is of course a simple forward recursion, and the significant property of the cost function (3) turns out to be that it can be generated by a simple backward recursion. We can interpret the quantity
1 STATE STRUCTURE
13
h1 C1 = Lct+Ch,
(4)
T=t
as the cost incurred form timet onwards. It plainly obeys the backward recursion
C1 = c(x, u1, t) + Ct+1
t re substantial models of portfolio management and, indeed, of economic development (see Section 3.4)
18
DETERMINISTIC MODELS AND THEIR OPTIMISATION
We can regard the optimiser as an investor who wishes to split his capital between investment and consumption at each stage. The capital x can be taken as the state variable; a nonnegative scalar assumed to follow the plant equation
(12) Here u1 is the amount diverted for consumption at time t and a is the factor by which invested capital appreciates over unit time. It is more natural to phrase the problem as one of maximising the total benefit the investor receives from consumption (which an economist would term utility) rather than of minimising cost. Let r(u) be the utility he derives from consumption u, and suppose that the total utility is just the sum of discounted instantaneous utilities. If debt is not permitted then necessarily x ~ 0 and so u1 ,::; x 1; if consumption cannot be reversed to become production then u ~ 0. Let Fs(x) be the maximal utility that can be gained with s time periods to go, starting from capital x. The dynamic programming equation is then
Fs(x)= max [r(u)+.BFsI(a(xu))] O~u~x
(s>O).
(13)
The concept of discounting was motivated in the last section by the notion that money invested will grow by compound interest. It may then seem perplexing to see the same feature built in twice: a discounting by a factor ,B as well as the growth of invested capital by a factor a (as expressed in the plant equation (12)). The more natural interpretation of ,B in this case is as a survival probability; the probability that the investor will survive the next unit of time to enjoy his savings. If we make the assumptions
for the form of the utility functions then (13) is easily solved. Here vis a constant in (0, 1). The form for r(u) is plausible; there are economic reasons for assuming utility to be concave and nondecreasing. That the terminal reward Fo should have the same form is perhaps debatable, but the assumption is a convenient one. We leave the reader to verify the solutions
Uopt=Jsr1/v X for the maximal utility and the optimal consumption at horizon s. Here {Is} is a sequence of constants obeying 1"1/v
Js
with
= 1 + "~rf.l/v '·sl
2 OPTIM AL CONSU MPTIO N
19
so that J}fv =J/fv 0
s
+ 1 Y
1,·
utility grows, so that The consta nt 1 can be interpr eted as the rate at which 8y after a time s (for sacrifi ce of an amoun t 6 of curren t utility can yield utility small 8 and with optima l decisio ns; see Ex. 2). we see from the If 1 > 1, so that there is an incenti ve to delay consum ption, then solutio n above that
is possible, then for large s. The second relatio n implie s that, if utility growth mme. consum ption is indeed delaye d to almost the end of the progra If 1 < 1 then one finds the limitin g values xlv
Fs(x) . (l _ ~~)",
U0 pt > (1 /)X
arding , then maxim al ass becom es large. That is, if to delay consum ption is unrew l policy then has utility over an infinit e horizo n is finite. Furthe rmore, the optima capital at all times. the station ary form that one reinvests a fractio n 1 of one's Exercise 1. see However, this station arity does not imply consta nt capital ; one sacrifi ces that This kind of model in genera l implie s the conclu sion possible. The is consum ption to investm ent (i.e. to growth) if econom ic growth finite life by level ual conclu sion is one which is modifi ed at the individ functio n, utility the of expectancy, as we have seen. It is also sensitive to the form and of course fails if there are physical limits to growth . Exerci ses and comments at each stage by a (1) Under the optima l policy derive d above capita l grows consis tent with unity, than less or r greate factor () = a1 = ( a(3) lfv. This can be 'Y
< 1.
if one forgoes a small (2) Show that 1 is indeed the growth rate of utility, in that, t j33 fJS(lv)8 = y8 amoun t 8 of curren t utility, one can realise an additio nal amoun after times. (3) Solve the linear case v = 0. propor tional to (4) The limit v j 1 yields the case in which r(u) and Fo(x) are
Xmin are minim al log(u  Umin) and log(x Xmin ) respectively, where Um.in and capital must not fall. subsist ence values, below which consum ption and final Solve (13) in this case.
20
DETERMINISTIC MODELS AND THEIR OPTIMISATION
3 PRODUCTION SCHEDULING This is a version of an inventory problem which allows some useful reduction, but ultimately resorts to computation. We indicate variants of the problem at the end of the section. A factory produces a single commodity and wishes to schedule its production so as to meet a known timevarying pattern of demand in the most economical fashion. Let us define the state x 1 at timet as the stock held at the beginning of the t th day; u1 as production during that day and d1 as demand during that day. In such problems one must indicate relative timing within a stage if one is to be clear about the sequence of events. The plant equation is then Xr+J
= Xr + Ur 
dr.
(14)
Suppose that the instantaneous cost function for day tis CJ (u 1) + c2(Xr+J), where c1 (u) is the cost of producing an amount u in a day, and c2(x) is the cost of carrying a stock x from one day to the next. The optimality equation (for costs) is then
+ c2(x + u d1) + F(x + u d 1, t + 1)].
F(x, t) = min[c1 (u) u
(15)
Plausible simple forms for the cost functions are C!(u)
= { ~+bu
c2(x)
+oo = { ex
(u::;;; 0) > 0)
(u
(x < 0) (x? 0).
(16} (17)
Assumption (16) implies that one can dispose of stock without cost, but that to go into production on a given day implies a base cost a and a unit cost b. Assumption (17) implies that negative stock ('backlogging') is forbidden, and that to carry positive stock incurs a unit cost of c per day. We assume a horizon point h at which negative stock is again forbidden, but positive stock valueless. There is no discounting. It is the occurrence of the terms in a and c which make the problem nontrivial. If the base cost a were zero then one would produce only enough to satisfY current demand. If the storage cost c were zero then one would produce enough to satisfY all future demand in a single production run. With both costs present one must steer between these two extremes. The simplifYing feature of the problem, as we have formulated it, can be summarised. Theorem 2.3.1 In an optimal production programme one produces on a given day if and only ifstocks are insufficient to meet that day's demand. When one produces one brings stock up to such a level that it satisfies demand for an integral number ofdays.
3PRODUCTIONSCHEDUUNG
21
Proof Suppose that under a production programme {u;} one has u; > 0 for a given t. If x ~ dr and one transferred the production u; to day t + 1one would save a cost cu; or a+ cu; according as u;+l is zero or positive. Hence the first assertion For the second point, suppose that on day t one produces enough to satisfY demand before time r, but not enough to satisfY demand at time r as well. That is, u1 == Ej;;/ ~  Xt + 6, where 0 ~ 6 < d,.. Then one must produce on day r in any case. If one decreased u; by 6 and increased u; by 6 then one would save a storage cost c(r  t)D, with no other effect on costs. Hence 6 = 0 in an optimal policy. D Thus, if x 1 ~ d1 one does not produce and lets stock run down. If x 1 ~ d1 then one produces enough that demand is met up to some time r  1 exactly (so that x,. == 0), where r is an integer exceeding t. The optimal r will be determined by the optimality equation (18) F(x, t) = min [c(t, r) + F(O, r)], t 0. (iii) If Q = 0 then II = R. In this case the optimal control takes whatever value is needed to bring the state variable to zero at the next step.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME Control models are usually framed in continuous time, and all the material of Section 1 has a continuoustime analogue. If we look for state structure then the analogues of relations (2) and (3) are a firstorder differential equation
x = a(x, u, t),
(42)
as plant equation and a cost function
C=
1h
c(x,u,r)dr+C(x(h),h).
(43)
of integral form. Here xis the time rate of change dx/ dt of the state variable x. It thus seems that .an assumption is forced upon one: that the possible values and course of x are  such this rate of change can be defmed; see Exercise 1. We shall usually suppose x to be a vector taking values in Rn. We shall write the value of x at time t as x( t) rather than x" although often the time argument will be suppressed. So, it is understood that the variables are evaluated at timet in the plant equation (42), and at timer in the integral of (43). The quantity c(x, u, r) now represents an instantaneous rate of cost, and the final term in (43) again represents the closing cost incurred at the horizon point h. The general discussion of optimisation methods in Section 1 holds equally for the continuoustime case: there seem to be just two methods which are widely applicable. One is that of direct trajectory optimisation by Lagrangian methods, which we develop in Chapter 7 under its usual description of the maximum principle. The other is the recursive approach of dynamic programming.
6 DYNAMIC PROGRAMMING IN CONTINUOUS TIME
29
equation formally We can derive the continuoustime dynamic progr ammi ng t) is again defmed as from the discretetime version (6). The value function F(x, t with state value x. the minimal future cost incurr ed if one starts from time e (cf. (6)) that Considering events over a positive timeincrement lit we deduc (44) F(x, t) = inf[c(x, u, t)& + F(x + a(x, u, t)&, t + c5t)] + o(c5t). u
Letting c5t tend to zero in this relation we deduce the contin equation
.
1~
[
c(x, u, t) +
uoustime optimality
] t) 8F(x, t) 8F(x, = 0 81 + 8 x a(x, u2 t)
(t a'(O), when the recommendation is to fish the population to virtual extinction. In this the model demonstrates the crudity of the narrow accounting view with its neglect of wider social, economic or environmental considerations. We consider stochastic versions of the model in Chapter 10, and are forced by these to face the issues. 8 LQ REGULATION IN CONTINUOUS TIME
In analogy to assumptions (21) and (22) of the discretetime case we suppose the timehomogeneous forms of plant equation
34
DETER MINIS TIC MODELS AND THEIR OPTIMISATION
(52)
x=Ax +Bu and cost function C =!
foh c(x, u)dT +![xTllx)(h)
(53)
(23). Then the where the instantaneous cost rate c(x, u) still has the form immediately, follow analogues of the discretetime assertions of Theor em 2.4.2 n of the solutio by either by passage to the continuous limit from these or continuoustime dynamic progra mming equation (45). The cost function is quadratic in state (54)
F(x, t) = !xTll( t)x and the timedependent matrix ll obeys the Riccati equation
(0
~
This is to be regarded as a backward equation for ll( t) with termin of IT (h). The optimal control rule is linear in curren t state
t 0) Fs = fi!Fs1 where !I! is the operator with action
48
A SKETCH OF INFINI TE HORIZ ON BEHAVIOUR
, u) !f'qy(x) = inf[c(x u
(2)
+ {3¢J(a(x, u))]
is that it is the cost on a scalar functio n of state qy(x). The interpe tation of !f'qy(x) sing ur, say, optimi so ion: operat of incurr ed if one optimises a single stage time t + 1. at +1) ¢J(x cost 1 closing a knowing that x 1 = x and that one will incur ns of functio scalar into state of ns The operat or !!' thus transfo rms scalar functio it since s, proces sed optimi the of state. We shall term is the forward operator n functio cost given a upon d indicates how minim al costs from time t (say) depen ic dynam the that fact the at time t + 1. The term is quite consistent with progra mming equati on is a backward equation. 1r which is not Let us also consid er how costs evolve if one chooses a policy only t and the s necessarily optimal. If the policy 1r is Markov in that u 1 depend will also be a t value x of curren t state x 1 then the value function from time ary then station also function only of these variables, V( 1r, x, t), say. If the policy is policy the case this it must be of the form u1 = g(x 1) for some fixed function g. In nt consta this of is often written as 1r = g< 00 ), to indica te indefinite application The x). as V3 (g(oo), rule, and one can write the value functio n with times to go is policy fixed this for backward cost recurs ion
Vs+l(g(ool,x) = c(x,g( x))
+ /3Vs(g(ool,a(x,g(x))),
(s > 0)
(3)
a relation which we shall conden se to
(4) g( 00 l. If it can be Here L(g) is then the forwar d operat or for the process with policy can suppress the taken for grante d that this is the policy being operat ed then we as argum ent g and write the recurs ion (4) simply
Vs+! = LVs. scalar functions The operat ors L and !!' transfo rm scalar functions of state to assum ed that ¢is a of state. If one applies either of them to a function c/J, then it is is that they are share they ty proper tant scalar function of state. One impor L. for ly similar 'lj;; ? ¢J if monotonic. By this we mean that !!' ¢J ? !1''1/J
Theorem 3.1.1 ( i) The operators !!'and L are monotonic. g (non(ii) If Ft ? ( ~ )Po then the sequence {Fs} is monotone nondecreasin increasing); correspondingly for { V9 }. Proof Lis plainly monot onic. We have then
!/'¢ = L¢? L'lj;? !1''1/J, ion of!/'¢, i.e. if we take u = g(x) as the particu lar contro l rule induce d by format . proven the minim ising value of u in (2). Assert ion (i) is thus
2 INFINITEHORIZON LIMITS FOR TOTAL COST
Assert ion (ii) follows inductively. If Fs ~ Fs1 then Fs+ 1
= .2! Fs ~ .2!Fs1 = Fs.
In contin uous time we have (with the conventions of Sectio n of relations (1) and (4):
oF= vHF,
OS
ov =MV OS
(s
49
0 2.6) the analog ues
> 0).
(5)
operat ors vii and Here F and V are taken as functio ns of x and s and the M = M (g) have the action s
a(x, u)], A
M<jJ(x)
= c(x,g (x)) a<jJ(x) + 0 ~~) a(x,g(x)).
Exerci ses and comments unifor mly bound ed (1) Consid er a timeh omoge neous discret etime model with f3 < 1). Suppo se also instan taneou s cost functio n and strict discou nting (so that l variables can take contro and state the that ity) (for simplicity rather than necess n ¢(x) of state functio scalar a of 11¢11 norm only finitely many values. Define the bound ed in ns functio of class the be f!J by supx l¢(x)j, the suprem um norm. Let this norm. Hence show that Show that for¢ and 'lj; in f!J we have 112¢  .2!'1/JII :o::;; .BII ¢  '!/!II· n Fin f!J, solutio unique a has F the equilib rium optima lity equati on F = .2! relatio n x u/ the that and f!J, identifiable with lims....oo fl!s'I/J for any ¢ of rule. l contro on determ ined by .2!F defines an optima l infinit ehoriz
2 INFINITEHORIZON LIMITS FOR TOTAL COST ite operat ion, does The fact that one faces an infinit e horizo n, i.e. envisages indefin if one fires a guided not mean that the proces s may not termin ate. For example, falls back to Earth missile, it will contin ue until it either strikes some object or of escape into space.) with its fuel spent. (For simplicity we exclude the possibility al cost which is a In either case the traject ory has termin ated, with a termin functio n IK( x) of the termin al value x of state. has set no a priori The proble m is nevertheless an infinit ehoriz on one if one consid ered the one that in , bound a set bound on the time allowed. If one had time h, then ibed prescr a at flight in firing a failure if the missile were still The cost Ch ency. conting this to presum ably one should assign a cost Ch(xh) it from the uish disting to cost, might then more approp riately be termed a closing terminal cost IK, the cost of natura l termin ation.
50
A SKETCH OF INFINITE HORIZON BEHAVIOUR
In the infinitehorizon case h is infinite and there is no mention of a closing cost. One very regular case is that for which the total infinitehorizon cost is well defined, and the total costs V(x) and F(x) (under a policy g(co) and an optimal policy respectively) are finite for prescribed x. If instantaneous cost is nonnegative then this means that the trajectory of the controlled process must be such that the cost incurred after time t tends to zero with increasing t. One situation for which this would occur is that envisaged above: that in which the process terminates of itself at some finite time and incurs no further cost. Another is that discussed in Exercise 1.1, in which instantaneous costs are uniformly bounded and discounting is strict, when the value at time 0 of cost incurred after timet tends to zero as /3 1• Yet another case is that typified by the LQ regulation problem of Section 2.3. Suppose one has a fixed policy u = Kx which is stabilising, in that the elements of (A+ BK) 1 tend to zero as p1 with increasing t. Then x 1 and u1 also tend to zero as p1, and the instantaneous cost c(x1 , u1 ) tends to zero as p21 • Although there is no actual termination in this case, one approaches a costless equilibrium sufficiently fast that total cost is finite. One would hope that finiteness of total cost would imply that Vand F could be identified respectively as the limits of Vs = L'Ch and of Fs = sesch ass + oo, for any specified closing cost Ch in some natural class CC. One would also hope that these cost functions obeyed the equilibrium forms of the dynamic programming equations F = St'F, (6) V=LV
(7)
and that they were the unique solutions of these equations (at least in some natural class of functions). Further, that the minimising value of u in (6) would determine a stationary optimal policy. That is, that if
F = St'F = L(g)F (8) theng(oc) is optimal. In fact, counterexamples can be found to all these conjectures (see Exercises 14 below). Nevertheless, they hold under relatively mild conditions. One condition which assures a good part of them and which is natural in the control context is indeed that instantaneous costs should be nonnegative c(x, u) ~ 0. This is referred to as the case of negative programming in the literature; 'negative' because positive cost corresponds to negative reward. The following two results are immediately useful; more substantial analysis is postponed until Chapter 11. Theorem3.2.1
Supposethatc ~ O.Then V(g(oo)) = limsrcoL(gro.
Proof Here by L'O we mean the finitehorizon cost Vs = Lsch in the case when the closing cost function Ch(x) is identically zero. Let c1 be the cost incurred at
2 INFINITEHORIZON LIMITS FOR TOTAL COST
51
time t under the policy g< 00 l for a prescribed value of initial state. Then the assertion of the theorem simply amounts to the statement 00
s
0
sToo 0
Lct =lim L:c
1,
valid for a sum of nonnegative terms, whether convergent or not.
D
Theorem3.2.2 Suppose that c ~ Oand that the function ¢(x) is such that¢~ Oand L(g)¢ !l'V; = L(gi+l)V; ~ L(gi+dV; ~ L(gi+IY(O)  Vt+i·
That is, V1 > V1+1 and the iteration has produced a strict improvement.
0
One would like to assert that, if equality holds, so that V1 = !l' V1, then V1 = F and the policy gloo) is optimal. This will be true under the assumptions of Exercise 1.1, for example, but is not in general true without qualification. One may say that value iteration carries out the operations of improvement of policy and passage to the infinitehorizon limit simultaneously. Policy improvement realises these two operations alternately, with passage to the limit telescoped to solution of the linear equation system V =LV. Typically, policy improvement indeed approaches optimality (when it does so) in very few iterations. The following observation partly explains why.
Theorem 3.5.2 The policy improvement algorithm is identical with application ofthe NewtonRaphson algorithm to the equilibrium dynamic programming equation F = !l'F.
58
A SKETCH OF INFINITE HORIZON BEHAVIOUR
Proof Recall what is meant by the NewtonRaphson (henceforth NR) algorithm. If we have an approximate solution F = V; ofF = ft' F then we improve it to a solution V;+ 1 = V; + !::..; by setting
(14) expanding the righthand member as far as firstorder terms in !::..; and solving the consequent linear equation for!::..;. Suppose that ft' V; = L(g;+I) V;. Because u is optimised out in the application of fi' the variation in u induced by the variation !::..; of V; has no firstorder effect on the value of fi'( V; + .6.;), and fi'(V;
+ !::..;) =
L(gi+I)(V; + .6.;) + o(.6.;)
It follows that the linearised version of equation (14) is just vi+ I = L(gi+I) vi+!·
That is, vi+ I is just the value function for the policy gi.';:"{, where gi+l was deduced D from the value function V; exactly by the policy improvement procedure. This is a useful observation, because the NR algorithm is such a natural one, as we shall find The algorithm now has a direct variational justification in this context. The equivalence also has implications for rates of convergence: in regular cases, at least, value iteration and policy improvement show respectively firstand secondorder convergence; see Exercise 1. Policy improvement can be used also for the averagecost formulation. If we assume, for simplicity, that there is only a single equilibrium point under the various policies, then in discrete time it takes the form: determination of the average cost 'Yi and transient cost function v;(x) from 'Yi + v; = L(g;)v;
followed by determination of the improved control rule u L(gi+I)v;
= g;+ 1( x) from
= fi'v;.
By 'improvement' is meant that either the average cost has been decreased, or it is unchanged and the transient cost has been decreased. One generally hopes for improvement in the first and stronger sense; see Section 11.3. The continuoustime versions of both total and averagecost procedures will be plain. The techniques of value iteration and policy improvement were formalised by Howard (1960). The equivalence of policy improvement and the NR algorithm was demonstrated in the LQ case by Whittle and Komarova (1988); in this case it holds in a tighter sense. However, we see from the last theorem that it holds generally. Puterman (1994) attributes the observation of the equivalence to Kalaba (1959) and Pollatschek and AviItzhak (1969). However, it is only in recent years that the point and its application have bitten: see Chapter 18.
6 POLICY IMPROVEMENT AND LQ MODELS
59
Exercises and comments (1) Consider solution of the equation y = f (y) for a scalar y. Let y be the desired solution and {y;} a sequence of approximate solutions. Denote the error y; y by Ei. The sequence of approximations is said to show rthorder convergence if Ei+I = 0( ~) for small E; and all sufficiently large i. Suppose one generates the sequence by iteration: Yi+I = f(yi). Then, if convergence toy indeed occurs, one fmds that Ei+l = /'Ei + o(t::;), where f' is the first derivative off at y. If the sequence is generated by the NR algorithm one finds that
!"
Ei+l = 2 ( 1 _ f')
Ef +
o(Ef).
Granted the convergence and differentiability assumptions, we see then that the two methods show first and secondorder convergence respectively. (2) In the case of positive programming the analogue of the leastexcessive property of Theorem 3.2.2 does not hold, with the consequence that policy improvement may not work. Consider Exercise 2.4, where one has the option of continuation or termination in each state. If one chooses termination in all states then the corresponding value function (in reward terms) is V(x) = 1 1/x. If one then performs policy improvement (i.e. chooses the action corresponding to the larger of V(x + 1) or the retirement reward 1 lfx) then one chooses continuation in all states. However, indefinite continuation is nonoptimal. 6 POLICY IMPROVEMENT AND LQ MODELS Policy improvement can be regarded as both an analytic and a computational technique. For LQ models it is a combination of both, in that it provides an excellent way of solving the Riccati equation iteratively. Consider the problem of undiscounted LQ regulation to zero treated in Section 2.3. For notational simplicity we shall assume the matrix S normalised to zero. We know that the optimal policy has the form ur = Kxr. Assume then that the policy at stage i has the form ur = Kixr, so that the corresponding infinitehorizon value function has the form V;(x) = !xTII;x. Then the two steps of the policy improvement algorithm reduce to: (i) Determine IIi from the linear equation IIi= R+K(QKi +(A +BK;)TII;(A +BKi)·
(15)
This will have a finite solutionifthematrixA + BKi is a stability matrix; i.e. if the assumed control law is stabilising. (ii) Determine the matrix Ki+l of the improved control law as (16)
A SKETCH OF INFINITE HORIZON BEHAVIOUR
60
For a numerical example, consider the case
i],
A=[~
B=
[n,
R=
[~ ~],
Q= 1,
S=O,
which can be regarded as a discretetime version of the stabilisation to the origin of a pointmass moving on a lineNew tonian in that its acceleration is proportio nal to the control force exerted on it. If we suppose initially a control rule with K = [0.25 1.00], then successive iterations modify this to [0.329 1.224] and then [0.328 1.220]. The smallness of the change on the second iteration indicates that the first iteration had already brought the control very close to optimality. The value of II corresponding to the last (and virtually optimal) ru1e is
II= [3.715 3.044] 3.044 8.264 . The continuoustime versions of (15) and (16) are
R
+ K( QKi +(A+ BKi) TIIi + IIi(A + BKi) = 0,
(17) {18)
Ki+l = Q 1BTIIi.
For example, consider the pendu1um regulation problem of Section 2.8 with
A=
[:1
~],
B=
[n.
R=
[~ ~],
Q=1, S=O.
The ± option in A corresponds to consideration of the inverted or the hanging pendulum respectively. Of course, we solved this problem analytically in Section 2.8, but it is of interest to see how quickly the policy improvement algorithm will yield convergence to the known solution. For the hanging case we have the optimal solution
II= [2.;2 1 ] 1 .;2 '
K = (1
v'i]
where .;2 = 1.414 to three decimal figures. The initial choice K = (1 iterates to  (1 1.5] and  [1 1.416] on the first and second steps. For the inverted version the optimal solution is
II_ [4.898 3.000]  3.000 2.449 ,
K = (3.000
1]
2.449],
to three decimal figures. The initial choice K = [2 1] iterates to [3.5 4.0], [3.05 2.76] and (3.001 2.467] on successive steps. The numerica l solution of the cart/pendu1um problem was given in Section 2.8. The difficulty with this problem is to find a stable policy with which to start the
7. POLICY IMPROV EMENT FOR THE HARVESTING EXAMPLE
61
the algorith mone cannot do so until one realises that the coefficients of which that to displacement terms in the control rule must have the sign opposite one would expect on naive grounds, as noted in Section 2.8. , The problem of regulation to zero can be regarded as a totalco st problem total finite since a policy u = Kx which stabilises the equilibr ium x = 0 achieves in cost from any finite starting point x. If we bring in disturbances or tracking as we although Section 2.9 then cost does indeed build up indefinitely with time, have to go to a statistical formula tion of the disturba nces and the reference signal all before the average cost has a simple definition; see Section 10.1. However, in n equatio of these cases the central problem remains the solution of the Riccati on for the matrix II, which does not differ from that for the simple regulati problem above.
7. POLICY IMPROVEMENT FOR THE HARVESTING EXAMPLE red Conside r again the harvesting example of Section 1.2. The naive policy conside policy optimal the one; hted shortsig very a be to out in that section turned in (under the extreme simplifYing assumpt ions of the model) was deduced in is ment improve policy of stage one ul successf Section 2.7. Let us see how taking us from the first policy towards the second. If V(x) is the infinitehorizon cost under a policy u = g(x) then V satisfies the equation (19) u a V + (a(x)  u) Vx = 0 with u indeed given the value g(x). Recall that xis the stock size, a(x) the natural net reprodu ction rate, u the catch rate and a the discoun t rate. For the naive policy u is given the constan t value Urn = a(xm) for all x > 0, where Xm is the value rate maximising a~). For the optimal policy one harvests at the maxima l feasible M for x > Xa and not at all for x ~ X 0 , where Xa is the root of a' (x) = a. It follows from (19), with u = g(x ), that
Vx = 1 + a(x)  a V(x). g(x)  a(x)
(20)
or M It also follows that, for the improve d policy, one will set u equal to zero of case the in choice The unity. then less or than according as Vx is greater
see equality is immater ial; we might as well set u equal to zero for definiteness. We M or zero to equal u take should we that then from (20) that this implies u (with not or sign same the have a(x) according as a(x)  a V(x) and g(x) zero in the transitio nal case). For the naive policy it follows from (20), and is otherwise obvious, that (21)
62
A SKETCH OF INFINITE HORIZON BEHAVIOUR
where T(x) is the time taken for the stock to drop from x to zero under the policy. For x ~ Xm this time is infinite; for x < Xm T(x)
=
1
dy . o a(xm)  a(y) x
(22)
For the naive policy it is also true that the expression g(x)  a(x) = a(xm)a(x) is nonnegative for all positive x. It thus follows that u will be zero in the improved policy for all x such that a(x) ~ o: V(x). Appealing to evaluations (21) and (22) we find this to be the case if o:
1x
dy l a(xm)  a(O) :::::; og = o a(xm)  a(y) a(xm)  a(x)
1x 0
d(y) d y, a(xm)  a(y)
or
1 x
o: a'(y) ;:'':: dy :::::; 0.
o a(xm)  a(y)
(23)
The value of x separating the regimes of zero and maximal harvesting rate in the improved policy is that for which equality holds in (23). One sees that this root certainly lies between Xm and Xa. Repetition of the argumen t shows that the recommended transition value converges rapidly to the optimal value x" with iteration of the improvement. This argument was of course unnecessary, as the optimal policy is known and simple. However, the example is perhaps reassuring, in that it shows how a single stage of policy improvement may change a conspicuously stupid policy to one qualitatively close to the optimal.
CHAPTER 4
The Classic Formulation of the Control Problem: Operators and Filters 1 THE CLASSIC CONTROL FORMULATION By the 'classic' control formulation we mean the formulation adopted when design was achieved by art (or craft) rather than by optimisation. The two approaches necessarily have a good deal of common ground, however, some of which we explore in this chapter. A prototype example is that of a mariner trying to make a ship follow a desired course; we may as well suppose it a sailing ship, and so emphasise the classical nature of the example. The block diagram of Figure 1 represents the situation. The ship is affected by the actions taken by the mariner, in changing sail or helm. It is also subject to other and unintended disturbances, such as weather and current. The mariner observes how the ship reacts to these combined forces and revises his actions accordingly, thus closing the control loop. His actions must be based upon what he can observe: e.g. apparent speed and heading of the ship, roll and pitch, the state of the sea and weather, the hazards of the topography and, of course, the specified intended course. Actual course
Wind and sea
,
Ship 1
Helm and set of sail
Observations
~
Mariner
Desired course
Fipre 1 The block diagram for a system in which a mariner controls the course ofhis ship by the operation ofhelm and sails, endeavouring to follow a set course against the disturbances offered by weather and sea.
64
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM X
d
Plant
cY y
u
Controller
X
w
Figure 2 An abstract version of the block diagram ofFig.l corresponding to equations (1) and (2).
In more general language the situation is that of Figure 2. The physically given system which is to be controlled is termed the plant. In our navigation example the plant is primarily the ship (although see the comments of Exercise 1), which is subject to the disturbances d of sea and weather as well as to the deliberate control actions u. The mariner becomes the controller, the unit which determines control actions u on the basis of current information. (This unit could then equally well be a person taking conscious decisions or a device following the rules built into it. The formal view is simply that a control policy formulates an action rule for all foreseeable situations. This view fails if situations arise which are not foreseeable or which are poorly quantifiableou tside our domain!) The current information will include all available observations y on the state of the plant and also the command signal w which specifies those aspects of the course which one wishes the plant to follow. The block diagram of Figure 2 is equivalent to a set of mathematical relations. The plant is regarded as a system whose inputs are the control sequence u and , disturbances d and whose outputs are the actual performance x and the observation sequence y. However, the only output upon which decisions can be based are the observations. For control purposes the plant is thus described by an operator !A! . The first of these corresponds a to nds correspo a filter which is causal, but stable only if !AI < 1. The second
74
THE CLASSIC FORMULATI ON OF THE CONTROL PROBLEM
filter which is stable if solution
IAI > 1, but is noncausal.
Indeed, it corresponds to a
00
x1
= 2:) Ar' dt+, r=1
of (20). This is mathematically acceptable if lA I < 1 (and d uniformly bounded, say), but is of course physically unacceptable, being noncausal. Exercises and comments (1) Consider the filter d
~
x implied by the difference equation (18). The leading
coefficient Ao must be nonzero if the filter is to be causal. Consider the partial power expansion. p1
t1
A(z) 1
= Lg,z' +
A(zr 1
r=O
L crd+k, k=O
in which the two sums are respectively the dividend and the remainder after t steps of division of A(z) into unity by the longdivision algorithm. Show, by establishing recursions for these quantities, that the solution of system (18) for general initial conditions at t = 0 is just p1
t1
Xt
= Lgrdtr + L
CtkXtk·
k=O
r=O
Relation (21) illustrates this in the case p = 1. (2) Suppose that A(z) = 1If= 1 (1 aiz), so that we require Jail < 1 for allj for stability. Determine the coefficients c1 in the partial fraction expansion p
A(z) 1 =
L c1(1 a1z)
1
}=1
in the case when the a1 are distinct. Hence determine the coefficients g, and c1k of the partial inversion of Exercise 1. (3) Model (20) has the frequency response function (1  Aeiw) 1• An input signal d1 = eiwt will indeed be multiplied by this factor in the output x ifthe filter is stable and sufficient time has elapsed since startup. This will be true for an unstable filter only if special initial conditions hold (indeed, that x showed this pattern already at startup). What is the amplitude of the response function? 4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE
Suppose now that input d is an mvector and output x an nvector. Then the representations (7) and (12) still hold for linear and TIL filters respectively, the
4 FILTERS IN DISCRETE TIME; THE VECTOR (MIMO) CASE
75
first by definition and the second again by Theorem 4.2.1. The coefficients g1r and g, are still interpretable as transient response, but are now n x m matrices, since they must give the response of all n components of output to all m components of input. However, for the continuation of the treatment of the last section, let us take an approach which is both more oblique and more revealing. Note, first, that we are using IF in different senses in the two places where it occurs in (11). On the right, it is applied to the input, and so converts mvectors into mvectors. On the left, it is applied to output, and does the same for nvectors. For this reason, we should not regard :!/ as a particular case of a filter operator q;; we should regard it rather as an operator of a special character, which can be applied to a signal of any dimension, and which in fact operates only on the time argumen t of that signal and not on its amplitude. We now restrict ourselves wholly to the TIL case. Since q; is a linear operator, one might look for its eigenvectors; i.e. signals~~ with the property q;~ 1 = .\~1 for some scalar constant A. However, since the output dimension is in general not equal to the input dimension we look rather for a scalar signal a 1 such that q}~u 1 = rw 1 for some fixed vector rJ for any fixed vector f The translationinvariance condition (11) implies that, if the inputou tput pair {~ur, 'TJO"r} has this property, then so does {~u 1 _J,7JU 1 I}. If these sequences are unique to within a multiplicative constant for a prescribed (then one set of signals must be a scalar multiple of the other, so that (a1_ 1 = z~a 1 for the some scalar z. This implies that a 1 ex z 1, which reveals the particula r role of the exponential sequences. Further, 1J must be then be linearly dependen t upon (, although in general by a zdependent rule, so that we can set 1J = G(z)( for some matrix G(z). But it is then legitimate to write q;
= G(!F),
(22)
since q} has this action for any input of the form (z 1 or a linear combination of such expressions for varying (and z. If G(z) has a power series expansion L:,g,z' then relation (22) implies an expression (12) for the output of the filter, with g, identified as the matrixvalued transient response. We say 'a' rather than 'the' because, just as in the SISO case, there may be several such expansions, and the appropriate one must be resolved by considerations of causality and stability. These concepts are defined as before, with the lqcondition (13) modified to
where grJk is the jkth element of g,. The transient response g, is obtained from G( z) exactly as in (17); by determination of the coefficient of z' in the appropriate expansion of G(z). If causality is demande d then the only acceptable expansion is that in nonnegative powers of z.
76
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
If G(z) is rational (meaning that the elements of G(z) are all rational in z) then the necessary and sufficient condition that the filter be both causal and stable is that of Theorem 4.2.3, applied to each element of G. Again, one returns to basics and to physical reality if one sees the filter as being generated by a model. Suppose that the filter is generated by the dynamic equation (18), with x and d now understood as vectors. If this relation is to be invertible we shall in general require that input and output be of the same dimension, so that A (9'") is a square matrix whose elements are polynomial of degree p in 9'". The analysis of Section 3 generalises immediately; we can summarise conclusions. Theorem 4.4.1 The filter r§ determined by (18) has transfer function G( z) = A (z) I_ The causa/form ofthe filter is stable ifand only if the zeros ofiA(z) I lie strictly outside the unit circle.
Here IA(z)l is the determinant of A(z), regarded as a function of z. The first conclusion follows from A(z)G(z) =I, established as before. The elements of A(z) 1 are rational in z, with poles at the zeros of the determinant IA(z)J. (More exactly, these are the only possible poles, and all of them occur as the pole of some element.) The second conclusion then follows from Theorem 4.2.3. The fact that z = 0 must not be a zero of lA (z) I implies that Ao is nonsingular. This is of course necessary if (18) is to be seen as a forward recursion, determining x 1 in terms of current input and past xvalues. The actual filter output may be only part of the output generated by the dynamic equation (18). Suppose we again take the car driving over the bumpy road as our example, and take the actual filter output as being what the driver observes. He will observe only the grosser motions of the car body, and will in fact observe only some lowerdimensional function of the process variable x. Exercises and comments (1) Return to the control context and consider the equation pair Xr
= Axr1 + Bur1,
Yt
=
Cxt1·
(23)
as a model of plant and observer. In an inputoutput description of the plant one often regards the control u as the input and the observation y as the output, the state variable x simply being a hidden variable. Show that the transfer function u + y is C(J Az) I Bz2 . What is the condition for stability of this causal filter? If a disturbance d were added to the plant equation of (23), then this would constitute a second input. If a control policy has been determined then one has a higherorder formulation; dis now the only input to the controlled system. (2) The general version of this last model would be
5 COMPOSITION AND INVERSION OF FILTERS; ZTRANSFORMS
dx + !Jlu = 0,
y
+ '?!x =
where d, 18 and'?! are causal TIL operators. If ,s;l [unction u. y is C(z)A(z) 1B(z).
77
0,
= A(Y) etc.
then the transfer
5 COMPOSITION AND INVERSION OF FILTERS; zTRANSFORMS
We assume for the remainder of the chapter that all filters are linear, translationinvariant and causal. Let us denote the class of such filters by ~ If filters ~~ and ~2 are applied in succession then the compound filter thus generated also lies in'?! and has action (24) That is, its transient response at lag r is I":v g2vg!,rv, in an obvious terminology. However, relation (24) expresses the same fact much more neatly. The formalism we have developed for TIL filters shows that we we can manipulate the filter operators just as though the operator §" were an ordinary scalar, with some guidance from physical considerations as to how power series expansions are to be taken. This formalism is just the Heaviside operator calculus, and is completely justified as a way of expressing identities between coefficients such as the A, of the vector version of (18) and the consequent transient response g,. However, there is a parallel and useful formalism in terms of ztransforms (which become Laplace transforms in continuous time). This should not be seen as justifYing the operator formalism (such justification not being needed) but as supplying useful analytic characterisatim ;s and evaluations. Suppose that the vector system (25) does indeed start up at time zero, in that both x and dare zero on the negative time axis. Define the ztransforms 00
00
.X(z) =
Lx i, 1
d(z) =
1=0
Ld z 1
1
(26)
1=0
for scalar complex z. Then it is readily verified that relation (25) amounts to
A(z)x(z) = d(z) with an inversion
x(z) = A(zr 1d(z).
(27)
1 This latter expression amounts to the known conclusion G(z) = A(z) if we 1 of z. powers nonnegative in understand that A(z) and d(z) are to be expanded
78
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
The inversion is completed by the assertion that x 1 is the coefficient of z1in the expansion of .X( z). Analytically, this is expressed as Xr
=
Z~i
J
x(z)zt 1 dz
(28)
where the contour of integration is a circle around the origin in the complex plane small enough that all singularities of x(z) lie outside it. Use of transforms supplies an alternative language in which the application of an operator, as in A(ff)x, is replaced by the application of a simple matrix multiplier: A(z)x. This can be useful, in that important properties of the operator A(ff) can be expressed in terms of the algebraic properties of A(z), and calculable integral expressions can be obtained for the transient response g, of (17). However, it is also true that both the operator formalism and the concept of a transfer function continue to be valid in cases where the signal transforms .X and ddo not exist, as we shall see in Chapter 13. Exercises and comments (1) There is a version of (27) for arbitrary initial conditions. If one multiplies relation (25) by z1 and then sums over t ~ 0 one obtains the relation
x(z) = A(z) 1 [d(z)
t
L AjXkZjkl·
k=l]
~
(29)
k
This is a transform version of the solution for x 1 of Exercise 3.1, implying that solution for all t ~ 0. Note that we could write it more compactly as
(30) where the operator [ ]+retains only the terms in nonnegative powers of z from the power series in the bracket. It is plain from this solution that stability of the filter implies also that the effect of initial conditions dies away exponentially fast with increasing time. (2) Consider the relation x 1 = god1 + g 1d1_ 1 , for which the transient response is zero at lags greater than unity. There is interest in seeing whether this could be generated by a recursion of finite order, where we would for example regard relation (25) as a recursion of order p. If we define the augmented process variable .X as that with vector components x 1 and d 1 then we see that it obeys a recursion x1 = A.Xr1 + Bdt, where
A=
[00 g1]0 ,
r ,; .
.
__s
6 FILTERS IN CONTINUOUS TIME
79
unde r any circu mstan ces is The fact that this system could not be unsta ble so G(z) has no poles. reflected in the fact that II Azi has no zeros, and
6 FILTERS IN CONTINUOUS TIME defin ed for any real r; let us In contin ous time the transl ation opera tor f'/' is role of the unit transl ation fT repla cer by the conti nous variable r. However, the ation. More specifically, one transl al must be taken over by that of an infini tesim = lim Tl [ 1  f'T'~"], which is !!) ation must consi der the rate of chang e with transl r 10 on just the differential opera tor !!) = d/ dt. The relati
) _ .( . x(t T + 8t) x(t T) 1lill XlT {j t 6tl0 amou nts to the opera tor relati on
d
  fl'~" =
dT
!!)fl'~".
Since f/ 0 = 1 this has forma l soluti on
(31) the translations. Relat ion (31) which exhibits :!) as the infini tesma l gener ator of can be regar ded as an expression of Taylor's theor em er~x(t)
~(r!!)'j .1 x(t)
= L..J j=O
= x(t r) = f!'~"x(t).
(32)
]·
make s sense even if x is not Note, thoug h, that transl ation x(t r) of x(t) . differentiable to any given order, let alone indefinitely linear filter 0 and so excludes a differentiating action for the filter. If IG(s) I + 0 as lsi + oo then the filter is said to be strictly proper. In this case even deltafunctions are excluded in g; i.e. response must be smoothly distributed. A rational transfer function G(s) is now one which is rational ins. As in the discretetime case, this can be seen as the consequence of finiteorder, finitedimension al linear dynamics. The following theorem is the analogue of Theorem 4.2.3.
Theorem 4.9.1 Suppose that a causal filter has a response function G(s) which is rational and proper. Then the filter is stable ifand only ifall poles of G(s) have strictly negative real part (i.e. lie strictly in the left halfofthe complex plane). Proof This is again analogous to the proof of Theorem 4.2.3. G( s) will have the expansion in partial fractions G(s)
= 2: c,s' + 2:2: d1k(s s1rkl r
j
k
where the ranges of summation are finite, j and k are nonnegative integers and the s1 are the nonzero poles of G(\'). Negative powers s' cannot occur, since these would imply a componen t in the output consisting of integrated input, and the integral of a bounded function will not be bounded in general. Neither can positive powers occur, because of the condition that the filter be proper. The first sum thus reduces to a constant c0 . This correspond s to an instantane ous term c0 d(t) in filter output t§d, which is plainly stable. The term in (s s1 )k! gives a term proportion al toT* exp(srr) in the fllter response; this componen t is then stable if and only if Re( s1) < 0. The .condition of the theorem is thus sufficient, and necessity follows, as previously, D from the linear independe nce of the componen ts of filter response. The decay example of Exercise 7.1 illustrates these points. The transfer function G1(s) for the output x1 had poles at the values J.Lk (k = 0, 1, ... ,j). These are strictly negative for k < p, and so G1 determine d a stable filter for j < p. The final filter had a response singularity at s = J.Lp = 0. This gave rise to an instability correspond ing, as we saw, to the accumulat ion of matter in the final state. Exercises and comments (1) The second stock example is that of the hanging pendulum a damped harmonic oscillator. Suppose that the bob of the pendulum has unit mass and
10 SYSTEMS STABILISED BY FEEDBA CK
85
that it is driven by a forcing term d. The linearised equation of motion (see d Section 5.1) for the angle of displacement x of the pendulu m is then A(E»)x = a and 0 damping ts represen ative) 2 (nonneg a1 where A(s) = s + a1s + ao. Here of zeros the then 0 = a If 1 gravity. to due force g (positive) represents the restorin to an undamp ed A~) have the purely imaginary values ±iyao (corresponding ned by initial determi de amplitu an with oscillation of the free pendulum, real part negative with complex are zeros conditions). If 0 < a 1 < 2v'aQ then the ~ 2y'QO, a If m). 1 pendulu free the of (corresponding to a damped oscillation motion illatory nonosc damped a to then they are negative real (corresponding g as accordin unstable or stable thus is of the free pendulum). The equivalent filter the pendulu m is damped or not. A damped harmon ic oscillator would also provide the simplest useful model of our car driving over a bumpy road, the output variable x being the vertical displacement of the car body. If the suspension is damped lightly enough that the n car shows an oscillatory response near the natural frequency of vibratio 1 at s modulu in large be w0 = yao then the response function A(sr will s = ±iwo. This can be observed when one drives along an unsealed road which has developed regular transverse ridges (as can happen on a dry creek bed). There is a critical speed which must be avoided if the car is not to develop violent e oscillations. The effect is enhanced by the fact that the ridges develop in respons to such oscillations!
10 SYSTEMS STABILISED BY FEEDBACK We shall from now on often specify a filter simply by its transfer function, so that is we write G rather than r§. In continuous time the underst anding is then that G understood as denoting G(E») or G(s) according as one is considering action of in the filter in the time domain or the transform domain. Correspondingly, iate. appropr as G(z), or G(ff) discrete time it denotes We are now in a position to resume the discussion of Section 1. There the physically given system (the plant) was augmented to the controlled system by addition of a feedback loop incorporating a controller. The total system thus consists of two filters in a loop, corresponding to plant and controller, and one seeks to choose the controller so as to achieve satisfactory performance of the whole system. Optimisation considerations, a consciousness that 'plant' constitutes a model for more than just the process under control (see Exercise 1.1) and a later concern for robustness to misspecification (see Chapter 17) lead one to modifY the block diagram of Figure 2 somewhat, to that of Figure 4. In this diagram u and y denote, as ever, the signals constituted by control and observations respectively. The signal ( combines all primitive exogenous inputs to the system: e.g. plant noise, observation noise and comma nd signals (or the noise that drives comma nd signals if these are generated by a statistical model).
86
THE CLASSIC FORM ULAT ION OF THE CONT ROL PROBLEM
~
6 G
y
u
K
Figure 4 The block diagram corresponding to the canoni cal set ofsystem equations (45). The plant G, understood in a wide sense, has as outputs the actual observations y and the vector of deviations~ It has as inputs the contro l u and the vector (of 'primitive' inputs to the system.
The signa l D. comp rises all the 'deviations' which are pena lised in the cost function. Thes e woul d for exam ple inclu de track ing error and those aspec ts of the contr ol u itself which incur cost. Here the plant G is unde rstoo d as inclu ding all given aspec ts of the system. Thes e certa inly inclu de plant in the narro w sens ethe proce ss being controlle dbu t also the senso r system whic h provi des the obser vatio ns. They may also inclu de subsi diary mode ls used to predi ct, for example, sea and weat her for the long dista nce yach tsma n of Secti on 1, or the future inflow to the reservoir of Secti on 2.9, or the comm and signa l const ituted by the posit ion of a vehicle one is attem pting to follow. The optim iser may be unab le to exert any contr ol upon these aspects, but he must regar d them as part of the total given physical model. As well as the contr ol input u to this gene ralise d plant one has the exogenous input (. This comp rises all quan tities whic h are primi tive input s to the system; i.e. exogenous to it and not expla ined by it. Thes e inclu de statis tical noise variables (white nois ewh ich no mode l can reduce) and also comm and sequences and the like whic h are know n in adva nce (and so for whic h no mode l is needed).
It may be thoug ht that some of these input s shou ld enter the system at anoth er point ; e.g. that obser vatio n noise shou ld enter just before the controller, and that a know n comm and seque nce shou ld be a direc t inpu t to the controller. However, the simple form alism of Figur e 4 cover s all these cases. The inpu t (is in gener al a vecto r inpu t whose comp onen ts feed into the plant at different ports . A comm and or noise signal desti ned for the contr oller can be route d throu gh the plant , and eithe r inclu ded in or supe rimp osed upon the infor matio n strea m y. As far as plant outpu ts are conc erned , the devia tion signa l D. will not be completely observable in general, but must be defin ed if one is to evalu ate (and optimise) system perfo rman ce.
10 SYSTEMS STABILISED BY FEEDBACK
87
If we assume timeinvariant linear structure then the block diagram of Figure 4 is equivalent to a set of relations D.= G11( + G12u
y = G12( + G22u u=Ky.
(45)
We can write this as an equation system determining the system variables in terms of the system inputs; the endogenous variables in terms of the exogenous variables: (46)
By inverting this set of equations one determines the system transfer function, which specifies the transfer functions from all components of the system input ( to the three system outputs: fl, y and u. The classic demand is that the response of tracking error to command signal should be stable, but this may not be enough. One will in general require that all signals occurring in the system should be finite throughout their course. Denote the first matrix of operators in (46) by M, so that it isM which must be inverted. The simplest demand would be that the solution of (46) should be determinate; i.e. that M(s) should not be singular identically in s. A stronger demand would be that the system transfer function thus determined should be proper, so that the controlled system does not effectively differentiate its inputs. A yet stronger demand is that of internal stability; that the system transfer function should be stable. Suppose all the coefficient matrices in (46) are rational in s. Then the case which is most clearcut is that in which the poles of all the individual transfer functions are exactly at the zeros of IM(s) l,i.e. at the zeros of II G22(s)K(s)l. In such a case stability of any particular response (e.g. of error to command signal) would imply internal stability, and the necessary and sufficient condition for stability would be that II G2 2 (s)K(s)l should have all its zeros strictly in the left halfplane. In fact, it is only in quite special cases that this pattern fails. These cases are important, however, because performance deteriorates as one approaches them. To illustrate the kind of thing that can happen, let us revert to the model (3) represented in Figure 3, which is indeed a special case of that which we are considering. The plant output y is required to follow the command signal w; both of these are observable and the controller works on their difference e. The only noise is process noised, superimposed upon the control input. Let us suppose for simplicity that all signals are scalar. Solution (5) then becomes
e = (1
+ GK) 1(Gd w)
(47)
88
THE CLASSIC FORMULATION OF THE CONTROL PROBLEM
in the response function notation. Suppose that G(s) and and K(s) are rational. Then the transfer function (1 + GK) 1 of e tow is rational and proper and its only poles are precisely at the zeros of 1 + G(s)K(s). It is then stable if and only if these zeros lie strictly in the left halfplane. The same will be true of the response function (1 + GK) I G of e to d if all unstable poles of G are also poles of GK. However, suppose that the plant response G has an unstable pole which is cancelled by a zero of the controller response K. Then this pole will persist in the transfer function (I+ GK) 1G, which is consequently unstable. To take the simplest numerical example, suppose that G = (s 1r 1 , K = 1  s 1 • Then the transfer functions
(1
+ GK) 1 =
 5
s+ 1'
1
(l+GK)
s G=(s+l)(s 1)
are respectively stable and unstable. One can say in such a case that the controller is such that its output does not excite this unstable plant mode, so that the mode seems innocuous. The mode is there and ready to be excited, however, and the noise does just that. Moreover, the fact that the controller cannot excite the mode means that it is also unable to dampen it. These points are largely taken care of automatically when the controller is chosen optimally. If certain signal amplitudes are penalised in the cost function, then those signals will be stabilised to a low value in the optimal design, if they can be. If inputs are such that they will excite an instability of the total system then such instabilities will be designed out, if they can be. If inputs are such that their differentials do not exist then the optimal system will be proper, if it can be. One may say that optimality enforces a degree of robustness in that, as far as Physical constraints permit, it protects against any irregularity permitted in system input which is penalised in system output. Optimisation, like computer programming, is a very literal procedure. It supplies all the protection it can against contingencies which are envisaged, none at all against others. Exercises and comments (l) Application of the converse to the final value theorem (Section 8) can yield useful information about dynamic lagsthe limiting values for large time of the tracking error e or its derivatives. Consider a scalar version of the simple system of Figure 3. If the loop transfer function has the form ksN ( 1 + o(s)) for smalls, then k is said to be the effective gain of the loop and N its type number: the effective number of integrations achieved in passage around the loop. Consider a ~ommand signal w which is equal to f' jn! for positive t and zero otherwise. Then w::::: snl, and it follows from (47) and an application of the converse to the final value theorem (if applicable) that the limit of ~e for large tis lims1o O(sNn+J). It
10 SYSTEMS STABILISED BY FEEDBA CK
89
thus follows that the limit offset in the jth differential of the output path y is zero, finite or infinite according as n is less than, equal to or greater than N + j. So, suppose that w is the position of a fleeing hare andy the position of a dog pursuing it. Then a zero offset for j = I and n = 2 would mean that, if the hare maintai ned a constan t acceleration (!) then at least the difference in the velocitie s of the dog and the hare would tend to zero with time. It appears then that an increase in N improves offset. However, it also causes a decrease in stability, and N = 2 is regarded as a practica l upper limit.
CHAPTER5
Statestructured Deterministic Models In the last chapter we considered deterministic models in the classic inputoutput formulation. In this we discuss models in the more explicit statespace formulation, specialising rather quickly to the timehomogeneous linear case. The advantage of the statespace formulation is that one has a physically explicit model whose dynamics and whose optimisation can both be treated by recursive methods, without assumption of stationarity. Concepts such as those of controllability and observability are certainly best developed first in this framework. The advantage of the inputoutput formulation is that one can work with a more condensed formulation of the model (in that there is no necessity to expand it to a state description) and that the transform techniques then available permit a powerful treatment of, in particular, the stationary case. We shall later move freely between the two formulations, as appropriate. 1 STATESTRUCTURE FOR THE UNCONTROLLED CASE: STABILTY; LINEARISATION Let us set questions of control and observation to one side to begin with, and simply consider a dynamic system whose course is described by a process variablex. We have already introduced the notion of state structure for a discretetime model in Section 2.1. The system has state structure if x obeys a simple recursion Xt
=
a(Xtl,
t),
(1)
when x is termed the state variable. Dynamics are timehomogeneous if they are governed by timeindependent rules, in which case (1) reduces to the form Xt
=
a(Xtl)·
(2)
We have said nothing of the set of values within which x may vary. In the majority of practical cases x is numerical in value: we may suppose it a vector of finite dimension n. The most amenable models are those which are linear, and the assumption oflinearity often has at least a local validity. A model which is statestructured, timehomogeneous and linear then necessarily has the form
x 1 = Ax1t +b
(3)
92
STATESTRUCTURED DETERMINISTIC MODELS
where A is a square matrix and ban nvector. If the equation(/  A)x = b has a solution for .X (see Exercise 1) then we can normalise b to zero by working with a new variable x .X. If we assume this normalisation performed, then the model (3) reduces to (4) Xt = AXtl· The model has by now been pared down considerably, but is still interesting enough to serve as a basis for elaboration in later sections to controlled and imperfectly observed versions. We are now interested in the behaviourof the sequence Xt = A 1xo generated by (4). It obviously has an equilibrium point x = 0 (corresponding to the equilibrium point x = .X of (3)). This will be the unique equilibrium point if I A is nonsingular, when the only solution of x = Ax is x = 0. Supposing this true, one may now ask whether this equilibrium is stable in that x 1 + 0 with increasing tforany xo.
Theorem 5.J.l The equilibrium of system (4) at x = 0 is stable eigenvalues ofthe matrix A have modulus strictly less than unity.
if and only if all
Proof Let ). be the eigenvalue of maximal modulus. Then there are sequences A 1x0 which grow as N, so I..XI < 1 is necessary for stability. On the other hand, no such sequence grows faster than f' 11..XI 1, so !..XI< 1 is also sufficient for 0 stability. A matrix A with this property is termed a stability matrix. More explicitly, it is termed a stability matrix 'in the discretetime sense', since the corresponding property in continuous time differs somewhat. Note that if the equilibrium at zero is stable then it is necessarily unique; if it is not unique then it cannot be stable (Exercise 2). Note that g1 = A 1 is the transient response function of the system (4) to a driving input The fact that stability implies exponential convergence of this response to zero also implies lqstability of the fllter thus constituted, and so of the filter of 1 Exercise 4.4.1. The stability criterion deduced there, that C(/ Azr B should that of by implied is circle, unit the have all its singularities strictly outside Theorem 5.1.1. All this material has a direct analogue in the continuoustime case, at least for the case of vector x (to which we are virtually forced; see the discussion of Exercise 21.1). The analogue of (21 a statestructured timehomogeneous model, is
.X= a(x).
(5)
(For economy of notation we use the same notation a(x) as in (21 but the functions in the two cases are quite distinct) The normalised linear version of this model, corresponding to (4), is
1 STATESTRUCTURE FOR THE UNCON TROLL ED CASE
93
W
i=Ax 1 The analog ue of the formal solutio n x 1 = A xo of (4) is the solutio n
x(t) = e1Ax(O) :=
(tA'/  x(O) 'L. J. oo
1
(7)
j=O
of (6). The stability criterio n is also analogo us.
if all Theorem 5.1.2 The equilibrium of system (6) at x = 0 is stable if and only zero. than less strictly part eigenvalues ofthe matrix A have real ns Proof Let a be the eigenvalue of maxim al real part. Then there are functio the On . stability for ry 1 necessa is 0 < x(t) = e1Ax(O) which grow as eu , so Re(a)
1 < 0 is also other hand, no such functio n grows faster than t" eRe(u)t, so Re(a) 0 sufficie nt for stability.
e sense. A matrix A with this propert y is a stability matrix in the continu oustim and so 0, = a(x) of ns solutio If a(x) is nonline ar then there may be several the that 1.2: Section of ons several possibl e equilib rium points. Recall the definiti which from values initial of set domain ofattraction of an equilib rium point is the stable if its the path would lead to that point, and that the equilib rium is locally we shall orth Hencef point. domain of attracti on include s a neighb ourhoo d of the the models ear nonlin For take 'stabilit y' as meanin g simply 'local stability'. case; linear the in e equilib rium points are usually separat ed (which is not possibl e that x is such see Exercis e 2) and so one or more of them can be stable. Suppos x from the x(t)= an equilib rium point, and define the deviati on A(t) which is ves equilib rium value. If a(x) possess es a matrix ax of first derivati n (6) equatio then continu ous in the neighb ourhoo d of x and has value A at x become s
(8) e x will indeed to within a term o( A) in the neighb ourhoo d of x. The state variabl it is by thus remain in the neighb ourhoo d of x if A is a stabilit y matrix, and testing A = ax(x) that one determ ines whethe r or not xis locally stable. in the The passag e from (5) to (8) is termed a linearisation of the model is ation linearis of ue techniq the and , reasons s neighb ourhoo d of X, for obviou should one r, Howeve ur. behavio local of study indeed an invalua ble tool for the behavi our be aware that nonline ar systems such as (2) and (5) can show limiting e.g. limit rium: equilib static a to e passag much more compli cated than that of ing of a someth nt represe would these of cycles or chaotic behaviour. Either that expect to ble reasona is it and r, failure in most control contexts, howeve es. exampl of exotic most the optimis ation will exclude them for all but
94
STATESTRUCTURED DET ERM INIS TIC MODELS
We have already seen an example of mult iple equi libri a in the harvesting example of Sect ion 1.2. If the harv est rate was less than the max imal net repr oduc tion rate then there were two equilibria; one stable and the othe r unstable. The stock example is of cour se the pend ulum ; in its linea rised form the archetypal harm onic oscillator. If we supp ose the pend ulum unda mpe d then the equation of moti on for the angle a of disp lacement from the hang ing posi tion is
a+ w2 sin a = 0,
(9)
where w2 is prop ortio nal to the effective length of the pend ulum . Ther e are two static equilibrium positions: a = 0 (the hang ing position) and a = 1l" (the inverted position). Let us brin g the mod el to state form and linearise it simultaneously, by defining D. as the vector whose elements are the deviations of a and a from their equilibrium values, and then retaining only firstorder term s in D.. The matr ix A for the linearised system is then
A=[±~ ~], where the + and  options refer to the inverted and hang ing equilibria respectively. We fmd that A has eigenvalu es ±w in the inverted position, so this is certainly unstable. In the hang ing posi tion the eigenvalues are ±iw, so this is also unstable, but only just the amplitude of the oscillation abou t equilibrium remains constant. Of course, one can calculate these eigen values simply by dete rmin ing which values cr are consistent with a solution a( t) = eat of the linearised version of equation (9). However, we shall tend to discuss models in their state red uced forms. Discretetime models can equally well be linearised; we leave details to the reader. We shall develop som e examples of greater novelty in the next section, when we consider controlled processes.
Exercises and comments (1) This exercise and the next refer to the discretetime mod el (3). If (I  A )x = b has no solution for x then a finite equi libri um value certainly does not exist. It follows also that I  A mus t be singular, so that A has an eigenvalue ). = 1. (2) If (I A)x = b has mor e than one solu tion then, again, I A is singular. Furt herm ore, any linea r com bina tion of these solutions with scala r coefficients (i.e. any poin t in the smallest linea r man ifold vft cont ainin g these points) is a solution, and a possible equilibrium. Ther e is neutral equi libri um betw een poin ts of vft in that, once x 1 is in .H, there is no furth er motion.
t
I I ! i
1 STATESTRUCTURE FOR THE UNCONTROLLED CASE
95
(3) Suppose that the component x1t of the vector x 1 represents the number (assumed continuousvalued) of individuals of age j in a population at time t, and that the Xjt satisfy the dynamic equations 00
xot = :l:a1xJ,tJ,
(j > 0).
}=0
The interpretation is that a1 and b1 are respectively the reproduction and survival rates at age j. One may assume that b1 = 0 for some finite j if one wishes the dimension of the vector x to be finite. Show that the equilibrium at x = 0 is stable (i.e. the population becomes extinct in the course of time) if all roots>. of 00
Lbobi · · ·bJiaJ>.Ji = 1 j=O
are strictly less than unity in modulus. Show that the root of greatest modulus is the unique positive real root. (4) A pattern observed in many applications is that the recursion (2) holds for a scalar x with the function a(x) having a sigmoid form: e.g. x2 /(1 + x) 2 (x ~ 0). ! 0 and Jg >a are thus necessary
and sufficient for stability of this equilibrium. (6) The eigenvalues and eigenfunctions of A are important in determining the 'modes' of the system (4) or (6). Consider the continuoustime case (6) for definiteness, and suppose that A has the full spectral representation A = H AH 1, where A is the diagonal matrix of eignenvalues >.1 and the columns of H (rows of n 1) are the corresponding right (left) eigenvectors. Then, by adoption of a new state vector x = Hx, one can write the vector equation (6) as the n decoupled scalar equations i 1 = >.1x1 , corresponding to the n decoupled modes of variation. An oscillatory mode will correspond to a pair of complex conjugate eigenvalues. (7) The typical case for which a complete diagonalisation cannot be achieved is that in which A takes the form
96
STATESTRUCTURED DETERM INISTIC MODELS
for nonzero fl.· One can imagine that two populat ion groups both reproduce at net rate A, but that group 2 also generates member s of group 1 at rate fl.· There is a double eigenvalue of A at A, but Ai
= ;\i1 [ ~
jf
l
eAt_ eAt [ 01
f.Ll]
1 .
One can regard this as a situation in which a mode of transient response eAt (in continuous time) is driven by a signal of the same type; the effect is to produce an output proport ional to teA'. If there are n consecutive such stages of driving then the response at the last stage is proporti onal to f'eA 1. In the case when A is purely imaginary (iw, say) this corresponds to the familiar phenom enon of resonance of response to input of the same frequency w. The effect of resonance is that output amplitude increases indefinitely with time until other effects (nonlinearity, or slight damping) take over. 2 CONTR OL, IMPER FECT OBSERVATION AND STATE STRUC TURE We saw in Section 2.1 that achievement of 'state structure' for the optimisation of a controlled process implied conditions upon both dynamics and cost function . However, in this chapter we consider dynamics alone, and the controll ed analogue of the statestructured dynamic relation (1) would seem to be
(10) which is indeed the relation assumed previously. Control can be based only upon what is currently observable, and it may well be that current state is not fully observable. Consider, for example, the task of an anaesthetist who is trying to hold a patient in a condition of light anaesthesia. The patient's body is a dynamical system, and so its 'physiological state' exists in principle, but is far too complex to be specifiable, let alone observable. The anaesthetist must then do as best he can on the basis of relatively crude indicato rs of state: e.g. appearance, pulse and breathing. In general we shall assume that the new observation available at time tis of the form (11) So, if the new informa tion consisted of several numeric al observations, then y1 would be a vector. Note that y 1 is regarded as being an observation on immedi ate past state Xxt rather than on current state x 1• This turns out to be the formally natural convention, although it can certainly be modified. It is assumed that the past control utI is known; one remembers past actions taken. Relation (11) thus
2 CONTROL, IMPERFECT OBSERVATION AND STATE STRUCTURE
97
represents an imperfect observation on Xt1, whose nature is perhaps affected both by the value chosen for Ut1 and by time. Information is cumulative; all past observations are supposed in principle to be available. In the timehomogene ous linear case x, u and y are vectors and relations (10) and (11) reduce to
Xt = Axtl +But!
(12)
Yt = CxtI
(13)
Formal completeness would demand the inclusion of a term Du 1_ 1 in the righthand member of (13). However, this term is known in value, and can just as well be subtracted out. Control does not affect the nature of the information gained in this linear case. System (121 (13) is often referred to as the system [A, B, C ], since it is specified by these three matrices. The dimension of the system is n, the dimension of x. The linear system is relatively tractable, which explains much of its popularity. However, for all its relative simplicity, the [A, B, C] system generates a theory which as yet shows no signs of completion. Once a particular control rule has been chosen then one is back in the situation of the last section. Suppose, for example, that current state is in fact observable, and that one chooses a control rule ofthe form u1 = Kx1• The controlled plant equation for system (12) then becomes
Xt = (A+ BK)XtJ. whose solution will converge to zero if A + BK is a stability matrix. The continuoustim e analogue of relations (10), (11) is
x=
a(x, u, t),
y = c(x, u, t),
(14)
x=Ax+Bu,
y= Cx+Du.
(15)
and of (12), (13)
with D usually normalised to zero. Note that, while the plant equation of {14) or (15) now becomes a firstorder differential equation, the observation relation becomes an instantaneous relation, nondifferential in form. This turns out to be the natural structure to adopt on the whole, although it can also be natural to recast the observation relation in differential form; see Chapter 25. In continuous time the system (15) with D normalised to zero is also referred to as the system [A,B, C ]. One sometimes derives a linear system (15) by the linearisation of a timehomogeneous nonlinear system in the neighbourhood of a stable equilibrium point of the controlled system. Suppose that state and control values fluctuate about constant values x and u, so that y fluctuates about y = c(x, u). Defining = u  u and y = y  ji, we obtain the = x  x, the transformed variables
x
u
98
STATESTRUCTURED DETE RMIN ISTIC MODE LS
system (15) in these transf orme d variables as a linearised version of the system (14) with the identifications
B= au,
D =Cu. Here the derivatives are evaluated at .X u, and must be supposed continuous in a neigh bourh ood of this point. The appro xima tion remai ns valid only as long as x and u stay in this neighbourhood, which implies either that (.X, u) is a stable equilibrium of the controlled system or that one is consi dering only a short time span. Subject to these latter considerations, one can linear ise even about a timevariable orbit, as we saw in Section 2.9. Exercises and comments (1) Nonuniqueness of the state variable. If relations (12), (13) are regarded as just a way of realising a transfer function C(I ~ Az) 1 2 Bz from u to y then this realisation is far from unique. By consi derin g a new state variable Hx (for square nonsingular H) one sees that the system [HAH  1, HB, cs 1] realises the same transfer function as does [A, B, C ]. (2) A satellite in a planar orbit. Let (r, B) be the polar coordinates of a particle of unit mass (the satellite) moving in a plane and gravi tationally attrac ted to the origin (where the centre of mass of the Earth is supposed situated). The Newt onian equations of motio n are then
where ,r 2 represents the gravitational force and u, and u0 are the radia l and tangential comp onent s of a control force applied to the satellite. A possible equilibrium orbit unde r zero control forces is the circle of radius r = p, when the angul ar velocity must be = w = J (1/p3). Suppose that small control forces are applied; define x as the 4vector whose comp onent s are the deviations of r, r, B and ()from their values on the circular orbit and u as the 2vector whose elements are u, and u0 . Show that for the linearised version (15) of the dynamic equations one has
e
1
00 0 0 2w 0
0
~2w 0
1
1 l
0
Note that the matri x A has eigenvalues 0, 0 and ±iw. The zero eigenvalues corre spond to the neutral stability of the orbit, which is one of a continuous family of ellipses. The others corre spond to the perio dic motio n of frequency w.
I
I ;
!
3 THE CAYLEYHAMILTO N THEOREM
99
In deriving the dynamic equations for the following standard examples we appeal to the Lagrangian formalism for Newtonian dynamics. Suppose that the system is described by a vector q of position coordinates qh that the potential and kinetic energies of the configuration are functions V(q) and T(q, q) and that an external force u with components u1 is applied. Then the dynamic equations can be written
(j
= 1,2, ... )
(3) The cart and the pendulum. This is the celebrated control problem formulated in Section 2.8: the stabilisation of an inverted pendulum mounted on a cart by the exertion of a horizontal control force on the cart. In the notation of that section we have the expressions
V= Lmg cos a, Show that the equations of motion, linearised for small a, are (M +m)q+m Lii
= u,
q+
La= ga,
and hence derive expressions (2.60) for the matrices A and B of the linearised state equations. Show that the eigenvalues of A are 0, 0 and ±Jg(l + m/ M)/ L. The zero eigenvalues correspond to the 'mode' in which the whole system moves at a constant (and arbitrary) horizontal velocity. The positive eigenvalue of course corresponds to the instability of the upright pendulum. (4) A popular class of controllers is provided by the PID (proportional, integral, differential) controllers, for which u is a linear function of current values of tracking error, its timeintegral and its rate of change. Consider the equation for the controlled pendulum, linearised about the hanging position: ii + u?a = u. Suppose one wishes to stabilise this to rest, so that a is itself the error. Note that a purely proportion al control will never stabilise it. The LQoptimal control of Section 2.8 would be linear in a and a, and so would be of the PD form. LQoptimisation will produce something like an integral term in the control only if there is observation error; see Chapter 12. 3 THE CAYLEYHAMILTON THEOREM
A deeper study of the system [A, B, C ] takes one into the byways oflinear algebra. We shall manage with a knowledge of the standard elementary properties of matrices. However, there is one result which should be formalised. Theorem 5.3.1 Let A be an n x n matrix. Then the first n powers I, A, A 2 , ..• , AnI constitute a basis for all the powers A' of A, in that scalar coefficients CrJ exist such that
100
STATESTRUCTURED DETERM INISTIC MODELS nl
A'= LCrjAj
(r = 0, 1,2, ... ).
}=0
(16)
It is importa nt that the coefficients are scalar, so that each element of A' has the same representation in terms of the corresponding elements of I, A, . .. , AnI.
Proof Define the generating function CXl
if>(z) = L(Azj =(I Az) 1 }=0
where z is a scalar; this series will be convergent if lzl is smaller than the reciprocal of the largest eigenvalue of A. Writing the inverse as the adjugate divided by the determinant we have then II
Azi(z)
adj(I
Az)
(17)
Now II  Azl is a polynomial with scalar coefficients a1 : n
II Azl
""a·zi L..J j ' }=0
say, and the elements of adj(I Az) are polynomials in z of degree less than n. Evaluating the coefficient of z' on both sides of (17) we thus deduce that n
La1Ar } = 0
(r
~
n).
(18)
}=0
Relation (18) constitutes a recursion for the powers of A with scalar coefficients a 1. It can be solved for r ~ n in the form (16). D . The CayleyHamilton theorem asserts simply relation (18) for r = n, but this has the extended relation (18) and Theorem 5.3.1 as immediate consequences. It is sometimes expressed verbally as 'a square matrix obeys its own characteristic equation', the characteristic equation being the equation 2:}=o a ;...nJ = 0 for the 1 eigenvalues A..
Exercises and comments (1) Define the nth degree polynomial P(z)  II Azl = 2:}=o aJZi. If we have a discretetime system x 1 = Ax11 then Theorem 5.3.1 implies that any scalar linear function~~= cT x 1 of the state variable satisfies the equation P(:Y)~ 1 = 0. That is, a firstorder nvector system has been reduced to an nthorder scalar system.
I
101
4 CONTROLLABILITY (DISCRETE TIME)
One can reverse this manipulation. Suppose that one has a model for which the process variable ~ is a scalar obeying the plant equation P( .:?T)~1 = burl (*) with ao = 1. Show that the column vector x 1 with elements (~1 ,~1  1 , .•. ,~rn+I) is a statevariablew ithplantequatio nx1 = AxrI +Bur I. where
anI 0
a2
0 1
0
an] .:.
,
B~
II b :
0 The matrix A is often termed the companion matrix of the nthorder system ( *). (2) Consider the continuoustime analogue of Exercise 1. If x obeys x = Ax then it follows as above that the scalar ~ obeys the nthorder differential equation P(fl))~ = 0. Reverse this argument to obtain a companion form (i.e. statereduced form) of the equation P(fl))~ =bu. Note that this equation must be regarded as expressing the highestorder differential of~ in terms oflowerorder differentials, whereas the discretetime relation (*) expresses the leastlagged variable~~ in terms oflagged variables.
4 CONTROLLABILITY (DISCRETE TIME) The twin concepts of controllability and observability concern the respective questions of whether control bites deeply enough that one can bring the state variable to a specified value and whether observations are revealing enough that one can indeed determine the value of the state variable from them. We shall consider these concepts only in the case of the timehomogeneous linear system (12), (13), when they must mirror properties of the three matrices A, B and C. The system is termed rcontrollable if, given an arbitrary value of xo, one can choose control values u0 , ui, ... , UrI such that x, has an arbitrary prescribed value. For example, if m = n and B is nonsingular then the system is 1controllable, because one can move from any value of xo to any value of XI by choosing u0 = BI (xi  Ax0 ). As a second example, consider the system Xt =
[ au
a2I
0]
a22
XrI
+
[O1]
UrI
for which m = 1, n = 2. It is never 1controllable, because u cannot affect the second component of x in a single step. It is uncontrollable if a2I = 0, because u can then never affect this second component. It is 2controllable if a21 ¥= 0, because
102
STATESTRUCTURED DETERMIN ISTIC MODELS
x2

A 2 Xo
[1 a"][uo] u, ,
= Bu, + ABuo = 0 a 21
and this has a solution for uo, u, if a21 Theorem 5.4.1 matrix
#
0. This argument generalises.
Thendime nsional system [A, B, ·]is rcontrollable T~on
= (B, AB, A 2 B, ... , A' 1B]
ifand only if the (19)
hasrankn.
We write the system as [A, B, ·]rather than as {4, B, q since the matrix C evidently has no bearing on the question of controllability. The matrix (19) is written in a partitioned form; it has n rows and mr columns. The notation r;on is clumsy, but shortlived and motivated by the limitations of the alphabet; read it as 'test matrix of size r for the control context: Proof If we solve the plant equation (12) for x, in terms of the initial value x 0 and subsequent control values we obtain the relation x, A' xo
=
Bu,_,
+ ABur2 + · ·· + Ari Buo.
(20)
The question is whether this set of linear equations in uo, UJ, ... , u,_J has a solution, whatever the value of thenvecto r x, A'x0 . Such a solution will always exist if and only if the coefficient matrix of the uvariables has rank n. This matrix is just T;on, whence the theorem follows. 0 If equation (20) has a solution at all then in general it has many. We shall find a way of determining 'good' solutions in Theorem 5.4.3. Meanwhile, the CayleyHamilton theorem has an important consequence. Theorem 5.4.2 Ifa system of dimension n is rcontrollable, then it is also scontrollablefor s ~min (n, r). Proof The rank of ~on is nondecreasing in r, so rcontrollability certainly implies scontrollability for s ~ r. However, Theorem 5.3.1 implies that the rank of ~on is constant for r ~ n, because it implies that the columns of T;on are then linear combinations (with scalar coefficients) of the columns of r;on_ The system is thus ncontrollable if it is rcontrollable for any r, and we deduce the complete assertion. 0
If the system is ncontrollable then it is simply termed controllable. This is a reasonable convention, since Theorem 5.4.2 then implies that the system is
103
4 CONTROLLABILITY (DISCRETE TIME)
if controllable if it is rcontrollable for any r, and that it is rcontrollable for r ? n it is controllable. One should distinguish between controllability, which implies that one can bring the state vector to a prescribed value in at most n steps, and the weaker property of stabilisability, which requires only that a matrix K can be found such that A + BK is a stability matrix, and so that the policy u1 = Kx 1 will stabilise the equilibrium at x = 0. It will be proved incidentally in Section 6.1 that, if the process can be stabilised in any way, then it can be stabilised in this way; also that controllability implies stabilisability. That stabilisability does not imply controlis lability follows from the case in which A is a stability matrix and B is zero. This Note, ble. stabilisa so and stable, but uncontrolled, and so not controllable, however, that stabilisability does not imply the existence of a control which s stabilises the process to an arbitrary prescribed equilibrium point; see Section 6.2and6.6. Finally, the notion of finding a usolution to the equation system (20) can be d made more definite if we require that the transfer from xo to x, be achieve optimally, in some sense. Theorem 5.4.3
( i) rcontrollability is equivalent to the demand that the matrix r1
co,on = LAjBQ IBT(A TY
(21)
j=O
should be positive definite. Here Q is a prescribed positive definite matrix. (ii) Ifthe process is rcontrollable then the control which achieves the passage from u; QuT is prescribed xo to prescribed x, with minimal control cost!
I:::C/
(0
~ r
< r).
(22)
Proof Let us take assertion (ii) first. We seek to minimise the control cost subject to the constraint (20) on the controls (if indeed this constraint can be satisfied, i.e. if controls exist which will effect the transfer). Free minimisation of the Lagrangian form ~I
~I
T=O
T=O
~ L:u'JQu T + AT(x, A'xo 'L,ArT I B~)
yields the control evaluation UT
= Q1 BT (ATrT1).
in terms of the Lagrange multiplier A. Evaluation of A by substitution of this solution back into the constraint (20) yields the asserted control (22).
104
STATESTRUCTURED DETERMINISTIC MODELS
However, we see that (22) provides a control rule which is acceptable for general .xo and x, if and only if the matrix ~on is nonsingular. The requirement of nonsingularity of ~on must then be equivalent to that of controllability, and so to the rank condition of Theorem 5.4.1. This is a sufficient proof of assertion (i), but we can give an explicit argument. Suppose ~n singular. Then there exists an nvector w such that wT ~on = 0, and so wT ~nw = 0, or r1
~)wTAiB)Q 1 (wTAiB)T = 0. j=fJ
But the terms in this sum are individwilly nonnegative, so the sum can be zero only if the terms are individually zero, which implies in turn that wT~B = 0 (j = 0, 1, ... , r 1). That is, wT~n = 0, so that r,:on is ofless than full rowrank, n. This chain of implications is easily reversed, demonstrating the equivalence of the two conditions: r,:on is of rank n if and only if ~n is, i.e. if and only if the nonnegative definite matrix ~on is in fact positive definite. 0 The matrix ~on is known as the control Gramian. At least, this is the name given in the particular case Q = I and r = n. As the proof will have made clear, the choice of Q does not affect the definiteness properties of the Gramian, as long as Q is itself positive definite. Consideration of general Q has the advantage that we relate the controllability problem back to the optimal regulation problem of Section 2.4. We shall give some continuoustime examples in the exercises of the next section, for some of which the reader will see obvious discretetime analogues. 5 CONTROLLAmLITY (CONTINUOUS TIME) Controllability considerations in continuous time are closely analogous to those in discrete time, but there are also special features. The system is controllable if, for a given t > 0, one can find a control {u(T); 0 ::;:;; T < t} which takes the state value from an arbitrary prescribed initial value x(O) to an arbitrary prescribed terminal value x(t). The value oft is immaterial to the extent that, if the system is controllable for one value oft, then it is controllable for any other. However, the smaller the value of t, and so the shorter the interval of time in which the transfer must be completed, the more vigorous must be the control actions. Indeed, in the limit t ! 0 infinite values of u will generally be required, corresponding to the application of impulses or differentials of impulses. This makes clear that the concept of ,..controllability does not carry over to continuous time, and also that some thought must be given to the class of controls regarded as admissible.
5 CONTROLLABIUTY (CONTINUOUS TIME)
105
It follows from the CayleyHamilton theorem and the relation 00
eAt = L)AtY /j!.
(23)
}=0
implicit in (7) that the matrices I,A,A 2 , .•• ,Anl constitute a basis also for the family of matrices {eAt; t ~ 0}. Here n is, as ever, the dimension of the system. We shall define T~on again as in (19), despite the somewhat different understanding of the matrix A in the continuoustime case.
Theorem 5.5.1 (i) The ndimensional system [A, B, · ] is controllable ifand only if the matrix T;on has rank n. (ii) This condition is equivalent to the condition that the control Gramian G(tton = 1t eATBQIBTeATrdT
{24)
should be positive definite (for prescribed positive t and positive definite Q ). (iii) If the system is controllable, then the control which achieves the passage from prescribed x(O) to prescribed x( t) with minimal control cost! J~ uT Qu dr is
(25) Proof If transfer from prescribed x(O) to prescribed x( t) is possible then controls {u( r); 0 ~ T < t} exist which satisfy the equation x(t)
eAtx(O) =lot eA(tr)Bu(r) dr,
(26)
analogous to (20). There must then be a control in this class which minimises the control cost defined in the theorem; we find this to be (25) by exactly the methods used to derive (22). This solution is acceptable if and only if the control Gramian G( tton is nonsingular (i.e. positive definite); this is consequently the necessary and sufficient condition for controllability. As in the proof of Theorem 5.4.3 (i): if G(t)con were singular then there would be annvector w for which wTG(trn = 0, with the successive implications that wT eAt B = 0 (t ~ 0 ), wT Ai B = 0 (j = 0, 1, 2, ... ) and wTr,;on = 0. The reverse implications also hold. Thus, an alternative necessary and sufficient condition for controllability is that r;on should have full ~~
0
While the Gramians G(tton for varying t > 0 and Q > 0 are either all singular or all nonsingular, it is evident that G(tton approaches the zero matrix as t approaches zero, and that the control (25) will then become infinite.
106
STATESTRUCTURED DETERMINISTIC MODELS
Exercises and comments (1) Consider the satellite example of Exercise 2.1 in its linearised form. Show that the system is controllable. Show that it is indeed controllable under tangential thrust alone, but not under radial thrust alone.
(2) Consider the twovariable system x1 = Ajx1 + u (j = 1, 2). One might regard this as a situation in which one has two rooms, roomj having temperature x1 and losing temperature at a rate >...1x1, and heat being supplied (or extracted) exogenously at a common rate u. Show that the system is controllable if and only if >... 1 =f:. >...2 . Indeed, if )q = >...2 < 0 then the temperature difference between the two rooms will converge to zero in time, however u is varied. (3) The situation of Exercise 2 can be generalised. Suppose, as in Exercise 1.5, that the matrix A is diagonalisable to the form H 1AH. With the change of state variable to the set of modal variables .X = Hx considered there the dynamic equations become
ij = AjXj + L
bjkUk,
k
where b1k is thejkth element ofthe matrix HB. Suppose all the A] distinct; it is a fact that the square matrix withjkth element >...J 1 is then no~singular. Use this fact to show that controllability, equivalent to the fact that r~on has rank n, is equivalent to the fact that HB should have no rows which are zero. In other words, the system is controllable if and only if there is some control input to any mode. This llSSertion does not, however, imply a failure of controllability if there are repeated eigenvalues. 6 OBSERVABILITY The notion of controllability rested on the assumption that the initial value of state was known. If, however, one must rely upon imperfect observations, then it is a question whether the value of state (either in the past or in the present) can be determined from these observations. The discretetime system [A, B, C] is said to be robservable if the value of x 0 can be inferred from knowledge of the subsequent observations YhY2 ... ,y, and subsequent relevant control values uo, u1 , ... , u,_ 2 . Note that, if xo can be thus determined, then x 1 is also in principle simultaneously determinable for all t for which one knows the control history. The notion of observability stands in a dual relation to that of controllability; a duality which indeed persists right throughout the subject. We have the determination
107
6 OBSERVABILITY
of Yr in terms of xo and subsequent controls. Thus, if we define the reduced observation r2 2 C" y,.=y,.~ Ari BUj
j=O
then xo is to be determined from the system ofequations
 = C'nA7"l Xo Yr
(0 < r
~
r ).
(27)
These equations are mutually consistent, by hypothesis, and so have a solution. The question is whether this solution is unique. This is the reverse of the situation for controllability, when the question was whether equation (20) for the uvalues had a solution at all, unique or not. Note an implication of the system (27): that the property of observability depends only upon the matrices A and C; not all upon B. We define the matrix
II CA
~bs=
r
c cA.r"1 , CA 2
(28)
the test matrix of size r for the observability context.
Theorem 5.6.1 (i) Thendimensional system [A,·, C] is robservable if and only if the matrix T~bs has rank n. (ii) Equivalently, the system is robservable ifand only ifthe matrix ~bs =
r1
L(CAr!)TM!CAr1
(29)
r=O
is positive definite, for prescribedpositive defznite M. (iii) Ifthe system is robservable then the determination ofXo can be expressed r1
xo = (~bs)'M1 L(CAr!)Ty,..
(30)
r=O
(iv) Ifthe system is robservable then it is sobservablefor s ~ min (n, r). Proof If system (27) has a solution for xo (which is so by hypothesis) then this solution is unique if and only if the coefficient matrix r,'bs of the system has rank n, implying assertion (i). Assertion (iv) follows by appeal to the CayleyHamilton theorem, as in Theorem 5.4.2. If we define the deviation T/r = Yr  CAr! xo then equations (27) amount to T/r = 0 (0 < r ~ r). If these equations were not consistent we could still define a
J 108
STATESTRUCTUR
I ED DE TE RM IN IS
TI C MODELS
'leastsquare' solution to them by minimising any positivedefinite in these deviations quadratic form with respect to x • 0 In particular, I:~:~ rjJM 1rJT· Th we could minimise is minimisation yield s ex indeed have a solutio n (i.e. are mutually co pression (30). If equations (27) ns unique then expressio n (30) must equal this istent, as we suppose) and this is so lution: the actual valu The criterion for uniq e of ueness of the leastsq uare solution is that G; x0 . nonsingular, which is exactly condition (ii bs should be ). As in Theorem 5.4 conditions (i) and (i) ca .3, equivalence of n be verified directly, ifdesired. 0 Note that we have ag ain found it helpful to bring in an optimisa This time it was a qu tion criterion. estion, not of fmding a 'least cost' solution solutions are known to exist, but of fmdi when many ng a 'best fit' solutio solution may exist. Th n when no exact is approach lies close to the statistical appr when observations ar oach necessary e corrupted by noise ; see Chapter 12. Mat observation Gramian. rix (29) is the The continuoustime version of these results which bears the sam e relation to that of Th will now be apparent, with a proof eorem 5.6.1 as that of does to the material of Theorem 5.5.1 Section 4. Theorem 5.6.2 (i) Th en ifand only ifthe matrix T~dimensional continuoustime system [A, ·, CJ is ob bs defined by (28) servable has rank n. (ii) This condition is eq uivalent to the condition that the observation G ramian G(t)obs = 1'(ceA'~") TMIceA'~" dr (31) should be positive defin ite (for prescribed posit ive t and positive defin (iii) Ifthe system is obse ite M ). rvable then the determi nation ofx(O) can be wr itten x(O) = [G(t)obsrl M 1 (CeA'~")TM 1 y(r) dr, where
1'
y(t) = y(t)
1
1
ce A( tT ) Bu(r)d
r.
Away of generating re altime estimates of cu rrent state is to drive a plant by the appare nt discrepancy in ob model of the servation. For the co model (15) this would nt inuoustime amount to generating an estimate x(t) of x( of the equation t) as a solution
i = A x+ B u+ H (y  Cx ), where the matrix H is (32) to be chosen suitably. One can regard this as of a filter whose outp the realisation ut is the estimate x ge nerated from the know n inputs u and
6 OBSERVABILITY
109
y. Such a relation is spoken of as an observer, it is unaffected in its performance by the control policy. We shall see in Chapter 12 that the optimal estimating relation in the statistical context, the Kalman filter, is exactly of this form. Denote the estimation error x(t) x(t) by ~(t~ By subtracting the plant equation from relation (32) and setting y = Cx we see that
A= (A HC)!:::.. Estimation will thus be successful (in that the estimation error will tend to zero with time) if A HC is a stability matrix. If it is possible to find a matrix H such that this is so then the system [A,·, C] is said to be detectable; a property corresponding to the control property of stabilisability.
Exercises and comments (1) Consider the linearised satellite model of Exercise 2.2. Show that state x is observable from angle measurements alone (i.e. from observation of eand B) but not from radial measurements alone. (2) The scalar variables x1 ( j equations XI
= 1, 2, ... , n)
= 2(1 + Xnr 1 XI+ u, Xj = Xj1
of a metabolic system obey the  Xj
(j
= 2, 3, ... 'n).
Show that in the absence of the control u there is a unique equilibrium point in the positive orthant. Consider the controlled system linearised about this equilibrium point. Show that it is controllable, and that it is observable from measurements of x 1 alone.
Notes We have covered the material which is of immediate relevance for our purposes, but this is only a small part of the very extensive theory which exists, even (and especially) for the timehomogeneous linear case. One classical piece of work is the RouthHurwicz criterion for stability, which states in verifiable form the necessary and sufficient conditions that the characteristic polynomial I>J A I = 0 should have all its zeros strictly in the left half:plane. Modern work has been particularly concerned with the synthesis or realisation problem: can one find a system [A, B, C] which realises a given transfer function G? If one can find such a realisation, of finite dimension, then it is of course not unique (see Exercise 2.1). However, the main consideration is to achieve a realisation which is minimal in that it is of minimal dimension. One has the important and beautiful theorem: a system [A, B, C] realising G is minimal if and only if it is both controllable and observable. (See, for example, Brockett (1970) p. 94.)
110
STATESTRUCTURED DETERMINISTIC MOD ELS
However, when we resume optimisation, the relev ant parts of this further theory are in a sense generated automatically and in the operational form dictated by the goal. So, existence theorems are repla ced by explicit solutions (as Theorem 5.4.3 gave greater definiteness to Theorem 5.4.1), the family of 'good' solutions is generated by the optimal solution as the cost function is varied, and the conditions for validity of the optimal solut ion provide the minimal and natural conditions for existence or realisability.
CHAPTER 6
Stationary Rules and Direct Optimisation for the LQModel The LQ model introduced in Sections 2.4, 2.8 and 2.9 has aspects which go far beyond what was indicated there, and a theory which is more elegant than the reader might have concluded from a first impression. In Section 1 we deal with a central issue: proof of the existence of infinitehorizon limits for the LQ regulation problem under appropriate hypotheses. The consequences of this for the LQ tracking problem are considered in Section 2. However, in Section 3 we move to deduction of an optimal policy, not by dynamic programming, but by direct optimisation of the trajectory by Lagrangian methods. This yields a treatment of the tracking problem which is much more elegant and insightful than that given earlier, at least in the stationary case. The approach is one which anticipates the maximum principle of the next chapter and provides a natural application of the transform methods of Chapter 4. As we see in Sections 5 and 6, it generalises with remarkable simplicity; we continue this line with the development of timeintegral methods in Chapters 1821. The material of Sections 3, 5, 6 and 7 must be regarded, not as a systematic exposition, but as a first sketch of an important pattern whose details will be progressively completed. 1 INFINITEHORIZON LIMITS FOR THE LQ REGULATION PROBLEM One hopes that, if the horizon is allowed to become infinite, then the control problem will simplify in that it becomes timeinvariant, i.e. such that a timeshift leaves the problem unchanged. One hopes in particular that the optimal policy will become timeinvariant in form, when it is referred to as stationary. The stationary case is the natural one in a high proportion of control contexts, where one has a system which, for practical purposes, operates indefinitely under constant conditions. The existence of infinitehorizon limits has to be established by different arguments in different cases, and will certainly demand conditions of some kindtime homogeneity plus both the ability and the incentive to control In this section we shall study an important case, the LQ regulation problem of Section 2.4. In this case the value function F(x, t) has the form !xTII1x where II obeys the Riccati equation (2.25). It is convenient to write F(x, t) rather as Fs(x) = !xTIT(s)X where s = h tis the 'time to go~ The matrix II(o) is then that
112
STATIONARY RULES AND DIRE CT OPTIMISA TION
assoc iated with the term inal cost function. The passa ge to an infin ite horiz on is then just the passa ges_ _, +oo, and infin iteh orizo n limits will exist if IT(s) has a limit value 11 which is indep ende nt ofii(o) for the class of term inal cost functions one is likely to consider. In this case the matr ix K 1 = K(s) of the optim al contr ol rule (2.27) has a corre spon ding limit value K, so that the rule takes the statio nary form ut = Kxt. Two basic cond ition s are requi red for the existe nce of infin iteh orizo n limits in this case. One is that of sensitivity: that any devia tion from the desir ed rest poin t x = 0, u = 0 shou ld ultimately carry a cost penal ty, and so dema nd correction. The other is that of controllability: that such any such devia tion can indee d be corre cted ultimately. We suppose S norm alise d to zero; a norm alisa tion which can be reversed if requi red by repla cing R and A in the calculation s below by R  sT Q 1S and A  ST Q 1B respectively. The Ricc ati equa tion then takes the form (s=l ,2, ... ).
(1)
wheref has the actio n fiT= R
+ ATIT A ATITB(Q + BTITB) 1BTITA.
(2)
Lemma 6.1.1 Suppose that IIco) = 0, R ~ 0, Q > 0. Then the sequence {IT(s)} is nondecreasing (in the ordering ofpositivedefmiteness) . Proof We have F1 = xT Rx ~ 0 = Fo. Thus, by Theo rem 3.1.1, Fs(x) = xTII(s)X is nond ecrea sing ins for fixed x. That is, II(s) is nond ecrea sing in the matri xdefiniteness sense. Lemma 6.1.2 Suppose that II(o) = 0, R ~ 0, Q > 0 and that the system [A,B, ·]is either controllable or stabilisable. Then {II(s)} is boun ded above and has a finite limit II. Proof To demo nstra te boun dedn ess one must demo nstra te that a polic y can be found which incur s a finite infin iteh orizo n cost for any presc ribed value x of initia l state. Controllability implies that there is a linea r control rule (e.g. that suggested in the proo f of Theo rem 5.4.3) which, for any xo = x, will bring the state to zero in at most n steps and at a finite cost xTrr• x, say. The cost of holdi ng it at zero there after is zero, so we can asser t that
(3) for all x. Stabilisability implies the same conclusio n, except that convergence to zero takes place exponentially fast rathe r than in a finite time. The nondecreasing sequence {II(s)} is thus boun ded abov e by II* (in the positivedefinite
I LIMITS FOR THE LQ REGULATION PROBLEM
113
sense) and so has a limit II. (More explicitly, take x = e1 , the vector with a un.it in the jth place and zeros elsewhere. The previous lemma and relation (3) then imply that 1r;Js, the jjth element of II(s)• is nondecreasing and bounded above by 1rj;. It thus converges. By then taking x = e1 + ek we can similarly prove the convergence of 1rJks·) D We shall now show that, under natural conditions, II is indeed the limit of II(s) for any nonnegative definite II(o)· The proof reveals more.
Theorem 6.1.3 Suppose that R > 0, Q > 0 and the system [A, B, ·] is either controllable or stabilisable. Then (i) The equilibrium Riccati equation II =/II
(4)
has a unique nonnegative definite solution IT. (ii) For any finite nonnegative definite Il(o) the sequence {II(s)} converges to II (iii) The gain matrix r corresponding to II is a stability matrix. Proof Define II as the limit of the sequencef(sJo. We know by Lemma 6.1.2 that this limit exists, is finite and satisfies (4). Setting u1 = Kx1 and Xr+I = (A+ BK)x1 = fx 1 in the relation
(5) where K and rare the values correspondin g to II, we see that we can write (4) as
(6) Consider the form
(7) and a sequence Xr
= f 1xo, for arbitrary Xo.Then (8)
Thus V(x1) decreases and, being bounded below by zero, tends to a limit. Thus
(9) which implies that x 1 + 0, since R + KT QK ~ R > 0. Since xo is arbitrary this implies that f 1 + 0, establishing (iii). We can thus deduce from (6) the convergent series expression for II: 00
II
= 2)rT)'" (R + KT QK)fi. }=0
(10)
114
STATIONARY RULES AND DIRE CT OPTIM ISATION
Note now that, for arbitr ary finite nonnegative ll(s)
= f(s)n( o)
defin ite llco).
~ jCslo _. n.
(11)
Comparing the minim al shor izon cost with that incur red by using the statio nary rule Ut == Kx1 we dedu ce a reverse inequ ality s1
IIcsJ ~ Z:)r Ty(R + KT QK) ri + (rTYllco)rs _. n.
(12)
j=O
Relations (11) and (12) imply (ii). Finally, asser tion (i) follows becau se, if another finite nonnegative definite solut ion of (4), then
fi = f(slfi + ll.
fi
is 0
It is gratifying that proo f of the convergence ll(s) + n impli es incid ental ly that is a stability matrix. Of course, this is no more than one woul d expect: if the optimal policy is successful in drivi ng the state varia ble x to zero then it must indeed stabilise the equil ibriu m point x = 0. The proof appea led to the cond ition R > 0, which is exactly the cond ition that any deviation of x from zero shou ld be pena lised immediately. However, we can weaken this to the cond ition that the devia tion shoul d be pena lised ultimately.
r
Theorem 6.1.4 The conclusions of Theorem 6.1.3 rema in valid if the condition that R > 0 is replaced by the condition that, if R = LT L then the system [A, ·, L] is either observable or detectable. Proof ·Relation (9) now beco mes (Lxt)T(Lxt)
+ (Kx 1 )TQ(Kxt)+ 0
which implies that Kx1 _, 0 and Lx1 + 0. Thes e convergences, with the relati on = (A+ BK)x1_ 1, imply that x 1 ultim ately enter s a mani fold for whic h
Xt
Xt = AXt1 · (13) The observability cond ition implies that these relations can hold only if x 1 0. The detectability cond ition implies that we can find an H such that A  HL is a stability matrix, and since relati ons (13) imply that x 1 =(A  HL)x1_ 1 , then again x, + 0. Thus x 1 _, 0 unde r eithe r cond ition. This fact estab lished , the proof continues as in Theo rem 6.1.3. 0
=
We can note the corollaries of this result, already ment ioned in Secti on 5.4.
Corollary 6.1.5 (i) Controllability implies stabilisabil ity. ( ii) Stabilisability to x = 0 by any means impli es stabilisability by a control of the linear Markov form u1 = Kx1 •
2. STATIONARY TRACKING RULES
115
Proof (i) The proof of Theorem 6.1.3 demonstrated that a stabilising policy of the linear Markov form could be found if the system were controllable. (ii) The optimal policy under a quadratic cost function is exactly of the linear Markov form, so, if such a policy will not stabilise the system (in the sense of ensuring a finitecost passage to x = 0), then neither will any other. 0 2. STATIONARY TRACKING RULES The proof of the existence of infinite horizon limits demonstrates the validity of the infinitehorizon tracking rule (267) of Section 2.9, at least if the hypotheses of the last section are satisfied and the disturbances and command signals are such that the feedforward term in (2.67) is convergent. We can now take matters somewhat further and begin, in the next section, to see the underlying structure. In order to avoid undue repetition of the material of Section 2.9 and to link up with conventional control ideas we shall discuss the continuoustime case. The continuoustime analogue of (267) would be
u uc = K(x ~) Q 1BT
1
00
erT'TII[d(t + r) ~(t + r)] dr
( 14)
where II, K and rare the infinitehorizon limits of Section 1 (in a continuoustime version) and the time argument t is understood unless otherwise stated. We regard (14) as a stationary rule because, although it involves the timedependent signal ( 15) this signal is seen as a system input on which the rule (14) operates in a stationary fashion. A classic control rule for the tracking of a command signal r in the case UC = 0 would be simply
u =K(x~)
( 16)
where u = Kx is a control which is known to stabilise the equilibrium x = 0. We see that (14) differs from this in the feedforward term, which can of course be calculated only if the future courses of the command signal and disturbance are known. Neither rule in general leads ultimately to perfect following, ('zero offset') although rule (14) does so if d ~. defmed in (15), tends to zero with increasing time. This is sometime expressed as the condition that all unstable modes of(~, UC, d) should satisfY the plant equation. There is one point that we should cover. In most cases one will not prescribe the course of all components of the process vector x, but merely that of certain linear functions of this vector. For example, an aeroplane following a moving target is merely required to keep that target in its sights from an appropriate distance and angle; not to specifY all aspects of its dynamic state. In such a case it is better not
116
STATIONARY RULES AND DIR ECT OPTIMISATION
to carr y out the normalisation of x, u and d ado pted in Section 2.9. If we assume that If = 0 and work with the raw variables then we find tha t the con trol rule (14) becomes rath er
(17) Details of derivation are omitted, because in the next section we sha ll develop an analysis which, at least for the stat ionary case, is much more direct and powerful than that of Section 2.9. Relation (17) is exactly wha t we want. A penalty term such as (xR( x r) is a function only of tho se linear functions of (x r) whi are penalised. The consequence ch is then that Rxc and Sr are fun ctions only of those linear functions of r which are prescribed. If we consider the case when S, uc and dar e zero and .XC is con stan t then relation (17) reduces to
.xcl
(18) which is to be com par ed with rela tion (16) and mu st be sup erio r to it (in average cost terms). Wh en we inse rt these two rules into the plan t equation we see tha t x settles to the equilibrium valu e r 1BK x = r 1m.XC for rule (16) and r 1J(r T) 1Rr for rule (18). Her e Jis again the controlpower mat rix BQ  1BT. We obtain expressions for the tota l offset costs in Exercise 1 and Sec tion 6. Exercises and comments
(1) VerifY tha t the offset cost und er control (16) (assuming S zero and xc constant) is! (A rlP (A r), where
P= (rT ) 1(R+ KT QK )rt = n 1 r  (rT ) 1TI. We shall come to evaluations und er the optimal rule in Section 6. How ever, if R is as_sumed nonsingular (so tha t all components of r are necessarily specified) then location of the opt ima l equ ilibrium poi nt by the methods of Section 2.10 leads to the conclusion tha t the offs et cos t und er the optimal rule (18) again has the form !(A r)T P(A xc) , but now with P= (AR  1AT +B Q 1BT) 1 . Thi s is generalised in equation (43).
3 DIRECT TRAJECTORY OPTIM ISATION: WHY THE OPTIMAL FEEDBACK/FEEDFORWARD CO NTROL RULE HAS THE FORM IT DOES Ou r analysis of the disturbed trac king problem in Section 2.9 won through to a solution with an appealing form , but only after some rath er unappealing
3 DIRECT TRAJECTORY OPTIMISATION
117
calculations. Direct trajectory optimisation turns out to offer a quick, powerful and transparent treatment of the problem, at least in the stationary case. The approach carries over to much more general models, and we shall develop it as a principal theme. Consider the discretetime model of Section 2.9, assuming plant equation (2.61) and instantaneous cost function (2.62). Regard the plant equation at time r as a constraint and associate with it a vector Lagrange multiplier A.,., so that we have a Lagrangian form ~
= 2:)c(x.,., un r) + g(x.,. Ax.,._I  Bu.,._I d.,.)]+ terminal cost.
( 19)
T
Here the time variable r runs over the timeinterval under consideration, wb.ich we shall now suppose to be h 1 < r < h2 ; the terminal cost is incurred at the horizon point r = h2. We user to denote a running value of time rather than t, and shall do so henceforth, reserving t to indicate the particular moment 'now'. In other words, it is assumed that controls u.,. for r < t have been determined, not necessarily optimally, and that the timet has come at which the value of u1 is to be determined in the light of information currently available. We shall refer to the form ~ of (19) as a 'timeintegral' since it is indeed the discretetime version of an integral. We shall also require of a 'timeintegral' a property which~ possesses, that one optimises by extremising the integral .freely with respect to all variables except those whose values are currently known. That is, optimisation is subject to no other constraint The application of Lagrangian methods is certainly legitimate (at least in the case of a fixed and finite horizon) if all cost functions are nonnegative definite quadratic forms; see Section 7.1. We can make the strong statement that the optimal trajectory from time t is determined by minimisation of~ with respect to (x,., u.,.) and maximisation with respect to A.,. for all relevant r ~ t. This extremisation then yields a linear system of equations which we can write
R [
ST
s
Q
I A!T
B!T
where ff is the backwards translation operator defined in (4.9) and normalised disturbance
(20)
d is
the
(21)
already introduced in Section 2.9. We have added a superscript (t) to emphasise that this is an optimisation from timet onwards; nothing is assumed of the (x, u) path before time t except that it is known. The effect of this is that the optimal control rule, when we deduce it, will be in closedloop form.
118
STATIONARY RULES AN D DIR ECT OPTIMISATION
Let us write equation (20) as (22) where {(r} is then known, and { ~~)}, the course of the deviati ons from the desired pat h of the (henceforth) optimallycontrolled process, is presumably determined by (20) plus initial conditions at T = t and termina l con ditions at T = h. Note tha t the matrix is Hermitian, in that if we define the conjugate of = (.:T) as~= (.:r 1 the n~= . Suppose tha t (z), with z a scalar complex variable, has a canonical factorisation
?
(23) where +(z) and +(z) 1 can be validly expanded on the unit circ le wholly in nonnegative powers of z and _ ditto for nonpositive powers. Wh at this would mean is tha t an equation such as (24) for ~ with known v (and valid for all T before the current point of ope ration) can be regarded as a stable forward rec ursion for ~ with solution (25) Here the solution is tha t obtain ed by expanding the operator + (.:1) t in nonnegative powers of .:1, and so is linear in present and pas t v; see Section 4.5. We have taken recursion (24) as a forw ard recursion in that we have solv ed it in terms of pas t v; it is stable in that solu tion (25) is certainly convergen t for uniformly bounded v. Factorisation (23) then implies the rewriting of (22) as
(26)
so representing the difference equation (22) as the compositio n of a stable f9rward recursion and a stable bac kward recursion. The situation may be plainer in terms of the scalar example of Exercise 1. One finds general ly tha t the optimisation of the pat h of a pro cess generated by a forward rec ursion yields a recursion of double order, sym metric in pas t and future, and tha t if we can represent this doubleorder rec ursion as the composition of a stable forward recursion and a stable backward recursion, then it is the stable forw ard recursion which determines the optimal forward pat h in the infmitehor izo n case (see Chapter 18). Suppose we let the horizon point h2 tend to +oo, so that (26) holds for all T ~ t. We can the n legitimately halfinv ert (26) to
(27)
3 DIRECT TRAJECTORY OPTIMISATION
119
if(,. grows sufficiently slowly with increasing r that the expression on the right is convergent when ~ _ (!?")  1 is expanded in nonpositive powers of :!7. We thus have an expression for AT in terms of past A and present and future (.This gives us exactly what we want: an expression for the optimal control in the desired feedback/feedforward form. Theorem 6.3.1 Suppose that dT grows sufficiently slowly with T that the semiinversion (27) is legitimate. More specifically, that the semiinversion
~+(,.) [~A ~r ~ ~(,.)_,
m.
(r
~
r)
(27')
is legitimate in that the righthand member is convergent when the operator~_ (:!7)  1 is expanded in nonpositive powers offf. Then (i) The determination ofu1 obtained by setting r = tin relation (27') constitutes an expression ofthe infinitehorizon optimal control rule infoedback!feedforwardform. (ii) The Hermitian character of~ implies that the factorisation (26) can be chosen so that ~ _ = ~ +, whence it follows that the operator which gives the foedforward component is just the inverse ofthe conjugate ofthefoedback operator.
Relation (27') in fact determines the whole future course of the optimally controlled process recursively, but it is the determination of the current control u1 that is of immediate interest. The relation at r = t determines u1 (optimally) in terms of Xt and d1( r ~ t); the feedback/feedforward rule. Furthermore, the symmetry in the evaluation of these two components explains the structure which began to emerge in Section 2.9, and which we now see as inevitable. We shall both generalise this solution and make it more explicit in later chapters. The achievement of the canonical factorisation (23) is the equivalent of solving the stationary form of the Riccati recursion, and in fact the policyimprovement algorithm of Section 3.5 translates into a fast and natural algorithm for this factorisation. The assumptions behind solution (27) are twofold. First, there is the assumption that the canonical factorisation (23) exists. This corresponds to the assumption that infmitehorizon limits exist for the original problem of Section 2.4; that of optimal regulation to zero in the absence of disturbances. Existence of the canonical factorisation is exactly the necessary and sufficient condition for existence of the infinitehorizon limits; the controllability/sensitivity assumptions of Theorem 6.1.3 were sufficient, but probably not necessary. We shall see in Chapter 18 that the policyimprovement method for deriving the optimal infinitehorizon policy indeed implies the natural algorithm for determination of the canonical factorisation. The second assumption is that the normalised disturbance dT should increase sufficiently slowly with r that the righthand member of (27), the feedforward
120
STATIONARY RULES AND DIR ECT OPTIMISATION
term, should converge. Such convergence does not guarantee that the vector of 'errors' ll.t will converge to zero with incr easing t, even in the case when all components of this error are penalised. The re may well be nonzero offsets in the limit; the errors may even increase expone ntially fast However, convergence of the righthand member of (27) implies that (4 increases slowly enough with time that an infinitehorizon optimisation is mea ningful. Specifically, suppose that the zeros of I.,(Axr1 + u
1
t=l
One then fmds the conditions >.
1
~
x1)].
0 with equality if u1 > 0 and
(1 :!{. t 0 1 (at which a dose is administered). One can be more explicit abo ut the algorithm. Denote the times t at which X, = X'f by lt, t2, ... , going bac kwards in time. The n It = h + 1, when Xh+l = 0.
2 THE PONTRYAGIN MAXIMU M PRINCIPLE
The solution in the range
tt+l ~ t ~ t;
135
has the 'free solution' form X 1 = cuA 1+
c2;A 1 • The coefficients care to be chosen so that Xt =X~ fort= t;, ti+l· Once t 1, t 2 , ..• , !; have been determined then t;+I is determin ed as the smallest value t ( ~ 1) such that the free solution agreeing with at t and t; is not smaller
of than xc at any intermediate point. If this value oft seems to be t = 1 then indeed it is provided that the value of x 0 is such that d i1 X 1 > 0, i.e. such that medicati on should begin immediately. If, on the other hand, d i1 X1 ~ 0, then medicati on begins first at t;. We have used the notation d to indicate potential generalisations of the argument. The problem has something in common with the production problem of Section 4.3. In the case A = 1 it reduces to the socalled 'monotone regression' problem, in which one tries to approximate a sequence {.x. as the conjugate variable (a scalar), since we are already usingp to denote price. Discounting makes the problem timedependent. (This timedependence is removed in a dynamic programming approach only by a renormalisation to present value as time advances,) The instantaneous value of u is such as to maximise the Hamiltonian
H = eatup(u)  >.u.
(23)
We have >. = Hx = 0, so that >. is constant. If >.o is the conjugate variable associated with time then >.0 + H = 0. At termination we have >.o + K1 = 0. But II( is identically zero, so >.o is zero at termination, which implies that H = 0 at termination. This certainly holds if u = _0 at termination, and this turns out to be the only way of satisfying the terminal condition. The initial condition is
1
00
X=
U
dt.
(24)
Consider ftrst the case p(u) = u'Y, which in fact makes the problem a version of the consumption problem considered in Section 2.2. Maximisation of H gives the rule u = keath, for some constant k. If t is the termination time then the condition u(t) = 0 is satisfied only for t infinite; the resource is never released completely in fmite time. Condition (24) gives the evaluation k = (a.xh), so that the optimal release rule is u = ( a.xh) eat h. Here x is the initial stock and this is the openloop release rule. If xis taken as current stock then the optimal rule in closedloop form is u = a.xh. If we insert the openloop expression for u into the reward function above we fmd the maximal return F(x) = ('yfa.)ax 1'Y. These conclusions are consistent with those of Section 2.2, in the case when infinitehorizon limits existed. Consider now the case p(u) = (1  u/2)+. The rate of return up(u) thus tends to zero as u l 0; we shall see that this has as consequence that the resource is released in a finite time. The maximal instantaneous rate of return is attained for u = 1, and we certainly may assume that u < 2, since higher values yield zero rate of return. The optimal release rate is u = 1  >.ea 1• The condition u(t) = 0 then implies that>. = ear, and the optimal release rule in openloop form is
u = 1  ea(it).
(25)
144
THE PONTRYAGIN MAXIMU M PRINCIPLE
Condition (24) yields the determin ing condition fort
x = t (1  eo.l)fa
(26)
and substitution ofexpression (26) for u into the reward function yields
F(x)
= (1 
eo.7) 2 f(2a).
(27)
The determinations (25) and (27) of the optimal u and the value function are only implicit; they are expressed in terms oft which is only implicitly determin ed as a function of x by (26). The dynamic programm ing equation, had we taken that route, could not have been solved more explicitly. Indeed, it would have taken some ingenuity to have spotted this implicit form of the solution. Exercises and comments (1) Zermelo's problem A straight river has a current of speed c(y), where y is the distance from the bank from which a boat is leaving. The boat then crosses the river at a constant speed v relative to the water, so that its position in downstream/crossstream coordinates (x,y) satisfies = v cos(u) + c(y), y = v sin u where u is the heading angle indicated in Figure 4. (i) Suppose that the boatman wishes to be carried downstream as little as possible in crossing. Show that he should follow a heading
x
u = cos 1 ( 
c~)).
(Note the implication, we must have c(y) ~ v for all J! Otherwise the boatman could move upstream in the slack water as far as he likedJ (ii) Suppose the boatman wishes to reach a given point on the opposite bank in minimal time. Show that he should follow the heading U
= COSI (  l
+~c(y)),
where pis a constant chosen to make the path pass throught the target point.
y
o(y)
+ X
Flglll'e 4 The Zermelo problem ofEx. 5.1: the optimal crossing ofa stream.
6 THE BUSH PROBLEM
145
6 THE BUSH PROBLEM This is one of the celebrated early problems of optimal control: to bring a mass to rest in a given position in minimal time, using a force of bounded magnitud e. One example would be that of bringing the rollers of a rolling mill to rest in standard position in minimal time, the angular momentu m of the rollers correspo nding to the linear momentu m of the mass. In the onedimensional case the solution turns out to be simple: to apply maximal force first in one direction and then in the other, the order and duration being such that the mass will come to rest and to the desired position simultaneously. However, to prove the optimality of this manoeuvre, extreme and discontin uous in character, was beyond classical methods of variational calculus. Consider the onedime nsional case; let x denote the coordinate and v the velocity. Thus (x, v) is the state variable, with the plant equation
x= v,
v= u,
(28) where u can be interpret ed as the force applied per unit mass. We suppose that \u\ : : :; M, c = 1 and that !I' consists of the single point (x, v) = (0, 0). That is, the mass is to be brought to rest at the origin in minimal time. Ifp, q are taken as the variables conjugate to x, v respectively, then
H=pv+ qu1
(29)
jJ = 0,
(30)
u = M sgn (q)
(31)
If the terminal values ofp and q are denoted a and !3 then (30) implies that
p=a,
q=f3+a s
(32)
in terms of timetogo s. Since u = M sgn(P) and v = 0 at termination, the relation H = 0 implies that
I.BI = 1/M. Consider first the positive option, !3 = 1/M. If a
~ 0 then q u = M, and backwards integration along the orbit yields
~
0 for s ~ 0, so that
v= Ms, Thus x, v lie on the parabolic locus
(x
~
0, v:::::; 0)
This is the lower half of the switching locus drawn in Figure 5. If, on the other hand, a is negative, then qchanges sign ats = !3/\a\ =so, say, as then does u. If we follow the optimal path in reverse time then in this case it
146
THE PONTRYAGIN MAXI MUM PRINC IPLE
X
/14
Figure 5 The Bush problem. The two halfparabo/ae through the origin constitute the switching locus; we illustrate a path which begins with maximal deceleration, then switches to maximal acceleration when it meets the locus.
leaves the switching locus at s = so and then follows the lightlydrawn parab ola in Figure 5. In forward time, if one starte d at a point on this path then one would apply maxi mal deceleration u =  M and hold it until the switching locus was reached. One would then apply maxi mal acceleratio n u = M and move along the switching locus to the origin. The case f3 = 1/M leads corre spond ingly to the other half of the switching locus: (x ~ 0, v;;?; 0).
The maxi mum princ iple yields the relations (31) and (32) and from these we have deduc ed the optim al rule in closedloop form, expressed in terms of the switching locus. If one starts off the locus then one applie s maxi mal deceleration or acceleration depen ding on wheth er one is above or below it. Once the locus is reached one applies maxi mal decel eratio n or accel eration depen ding upon whether one is upon the uppe r or the lower branch. Once the origin is reach ed then the force is of course removed. The control rule we have dedu ced is an example of what is terme d 'bang bang ' control. That is, the control variable u in general takes extreme values in its perm itted set o/1, sometimes switching between these in a discontinuous fashion, as in this example. Small heatin g systems are usual ly run this way, with the gas flame either fully on or fully off; interm ediate settin gs being impracticable. Fuel economy certainly requires that rocket thrus t should be progr amm ed this way: to operate in general either at full thrus t (in some direct ion) or at zero thrus t. In the Bush example the control force was not casted, merel y limited. In the rocket case
7 ROCKET THRUST PROGRAMME OPTIMISATION
147
there is a linear costing in addition to limitation; this implies the bangba ng character of the optimal rule, as we shall see in the next section. The treatmen t of the Bush problem generalises to some degree to the ndimensio nal case. If we work from the vector generalisations of (30) and (31) then we deduce the partial characterisation
u = M (sa. + /3) T !sa.+ /31 of the control law, for fixed a., /3 subject to l/31 = 1j M. As an optimal path is followed we see that the direction of the control force now in general varies all the time, with up to n reversals of direction in some component. One generates all optimal orbits by variation of a. and /3. However, to deduce the closedloop form of the control in this way seems difficult. Somewhat simpler is the LQ version of the problem, outlined in Exercise 1. Exercises and comments (1) Consider the vector version of the Bush ~roblem with no bound on u, but with an instantan eous cost function (L + Qlul ), penalisin g respectively time taken and control energy consume d during passage to terminati on at x = v = 0. Suppose, to begin with, that we prescribe the terminat ion time t. Show (and we shall treat a more general case in Section 7) that the optimal control is
!
u = _3_(3x+ 2vs) = s2 and that the control cost incurred is
~2 (x+ vs) s
2
v s'
(33)
(34) where s = t  t is time to go. The first term in the final expression of (33) induces correctio n of final position (with an inference that the effective average velocity over the remainin g time interval is v/2) and the second term induces correctio n offinal velocity. One can now take account of the timecos t by choosing s to minimise Ls + F(x, v, s). This will give an (x, v)dependent evaluation of sand so oft, but the evaluation oft is necessarily constant along the optimal orbit.
!
7 ROCKE T THRUST PROGR AMME OPTIMI SATION The solution of the Bush problem of Section 5 is very much what intuition might suggest, but is nevertheless remarkable for its extreme and discontinuous character. It is just for this reason that the problem was so resistant to classical optimisa tion methods, such as as the calculus of variations. The maximum
148
THE PONTRYAGIN MAXIMUM PRINCIPLE
principle has been significant in that it provided a technique for just such cases. A source of significant problems since the 'forties has been the incentive to determine a rocket thrust program me which is optimal in the time/fue l consumption needed to achieve some given manoeuvre. We shall assume the rocket to be effectively a point mass, so that its state is specified by its coordinates in physical space (x ), its velocity (v) and its mass (m ). That is, we neglect rotation, yaw, vibration or any aspect of the rocket other than translation of its centre of mass through space and the wasting of its mass through fuel consumption. Note that x in this case is not the state vector, but simply that part of the state vector (x, v, m) which describes the Euclidea n coordinates of the rocket in physical space. Suppose that the rocket jet has a backward vector velocity k relative to the rocket (so that the absolute velocity of the materia l ~fthe jet is vk) and that the rocket is subject to external forces of vector magnitude h = h(x, v, m, t). Then the condition of conservation of linear momen tum over a time interval of length 6t gives the equation
(m 8m)(v +8v) + (v k)8m = mv+h 8t or
mv=km +h,
(35)
the socalled rocket equation. We shall assume that the direction of the jet vector k can be freely controlled and that the rate mat which mass can be expelled in the rocket jet can also be controlled within limits. If we set
km=u, then we can regard the thrust vector u as the control vector, to be chosen freely subject to lui~M,
(36)
say. With these definitions the collective plant equation for the rocket can be written
x=v mv=u +h. m= clul
(37)
If conjugate variables p, q, rare associated with x, v, m respectively, then equation s (10) become for this case
. qhXl p=m
. pqh11) q=
m
(38)
7 ROCKET THRUST PROGRAMME OPTIMISATION
149
If we suppose that the cost function is purely terminal then the optimal control u should be such as to maximise the expression qu
crlul m For a given value of ofqT:
(39)
lui the thrust vector u will then be chosen in the direction
u=qrl.~ lql' and lui will be chosen to maximise x:lul, where
That is,
1•1
~ { ~rermmaw, }
accordmg au { :
~.
(40)
The vector q is often termed the primer; its direction determines the optimal thrust direction and the magnitudes of q and r determine the optimal thrust magnitude. The paths on which lui is respectively M, 0 or intermediate in value are often called maximal thrust, null thrust and intermediate thrust arcs respectively, or simply MT, NTand IT arcs. Equations (38) simplifY somewhat if the rocket is moving in a purely gravitational field, when h = m7, where 'Y is the gravitational vector. One is thus neglecting aerodynamic effects such as drag. The vector 'Y will have the form 7(x) = if the field is a conservative one, V(x) being the potential energy associated with the gravitational field. In this case equations (38) reduce to
v;
p = q"fx,
q= p,
; = qufm2
and the primer obeys the equation
q=
q'Yx = qVxx
(41)
where Vxx is the matrix of second differentials of x. Equations (38) become particularly simple if Vmay be assumed quadratic, when Vxx is independent of x. As a very special case, consider the problem of maximising the height reached by a sounding rocket, assuming constant gravity and neglecting (implausibly!) effects such as aerodynamic drag. The problem can be taken as a onedimensional one in which x represents height measured upward from the starting point (the ground). The quantities v, p, q and r will all be scalar, and 'Y =  g, where g is the gravitational constant.
150
THE PONTRYAGIN MA XIM UM PRINCIPLE
The pro blem is one of max imi sing x at termination. Let mo den ote the mas s of the rocket structure, so tha t w = m  mo is the mas s of fuel remaini ng. The term inal con diti ons hol din g are the n p = 1, q = 0, and r is nonpo sitive or nonnegative acc ord ing as the fuel rese rve w is positive or zero. We find from equations (38) tha tp = 1, q = s(= tim e to go) so tha t (42) But u, if nonzero, is in the directio n of q = s and so positive (upwar ddirected). Thus k = m  1 and l'i. mu st dec rease strictly thro ugh time. We thus see that the thru st pro gra mm e mu st tak e the form of a pha se of max ima l thru st followed by a pha se of null thru st, eith er or bot h of these phases possibly being of zero duration. Den ote the term ina l value of r by r0 • If r0 > 0 (so tha t w = 0 at terminat ion) the n it follows from (40) and (42) tha t thru st is zero for s < cmoro and max ima l for larger s, and the con diti on H = 0 implies tha t v = 0 at terminat ion . Thi s is the usual case, in which an MT pha se which exhausts fuel is followed by an NT pha se dur ing which the rocket coasts to its max ima l height, when its velo city is zero. If r0 ~ 0 the n l'i. > 0 at termination . The re is thus no NT arc; max ima l thru st is applied throughout. If ro = 0 the n it follows aga in from H = 0 tha t v = 0 at termination. If r 0 < 0 the n v < 0 at termination. The se are case s in whi ch the thru st is insufficient to lift the roc ket If initially the rocket hap pen s to be alre ady rising then max ima l thru st is app lied unt il the rocket is on the poi nt of reversing, which is take n as the term ina l inst ant. If the rocket hap pen s to be alre ady falling then term inat ion is immediate. This last discussion illustrates the literal natu re of the analysis. In disc ussing all the possible cases one com es acro ss som e which are ind eed physica lly possible but which one would hardly envisag e in practice. Exercises and comments (1) An app rox ima te reverse of the sou ndi ng rocket pro blem is tha t of soft landing: to lan d a rocket on the surface of a plan et wit h pre scri bed term inal velocity in such a way as to min imi se fuel consumption. It may be ass um ed tha t gravitational forces are vertical and constant, tha t ther e is no atm osp her e and tha t all mo tion is vertical. Not e tha t equation (42) remains valid. Hen ce show tha t the thru st pro gra mm e mu st con sist of a pha se of null thru st foll owed by one of max ima l thru st upwards (the pha ses possibly being of zero dur ation). How is the solution affected ifone also pen alises the tim e taken?
8 PROBLEMS WH ICH ARE PAR TIALLY LQ
LQ models can be equally wel l trea ted by dyn ami c pro gra mm ing or by the max imu m principle; one trea tme nt is in fact only a slightly disguise d version of
1 (
8 PROBLEMS WHICH ARE PARTIALLY LQ
151
the other. However, there is a class of partially LQ models for which the maximum principle quickly reveals some simple conclusions. We shall treat these at some length, since conclusions are both explicit and transfer in an interes ting way to the stochastic case (see Chapter 24). Assume vector state and control variables and a linear plant equation x=Ax +Bu.
(43)
Suppose an instantaneous cost function c(u)
= !uT Qu,
(44) which is quadratic in u and independent of x altogether. We shall suppos e that the only state costs are those incurred at termination. The analysis which follows remains valid if we allow the matrices A, Band Qto depend upon time t, but we assume them constant for simplicity. Howeve r, we shall allow termination rules which are both timedependent and nonLQ , in that we shall assume that a terminal cost IK( ~) is incurre d upon first entry to a stopping set of fvalues 9', where~ is the combined state/time variable~= (x, t). We assume that any constraint on the path is incorporated in the prescri ption of g and IK, so that ~ values which are 'forbidden' belong to g and carry infinite penalty. The model is thus LQ except at termination. The assumption that state costs are incurred first at termination is realistic under certain circumstances . For example, imagine a missile or an aircraft which is moving through a region of space which (outside the stopping set 9') is uniform in its properties (i.e. in gravitational force and air density). Then no immediate positiondependent cost is incurred. This does not mean to say that spatial position is immate rial, however; one will certainly avoid any configuration of the craft which would take it to the wrong target or (in the case of the aircraft) lead it to crash. In other words, one will try to so maneoeuvre the craft that flight terminates favoura bly, in that the sum of control costs and terminal cost is minimal. This will be the interest of the problem: to chart out a course which is both econom ical and avoids hazards (e.g. mounta in peaks) which would lead to premature termination. The effect of such hazards is even more interesting in the stochas tic case, when even the controlled path of the craft is not completely predict able. It is then not enough to scrape past a hazard; one must allow a safe clearan ce. The analysis of this section has a natural stochastic analogue, which we pursue in Chapte r24. The Hamilt onian for the problem is .
H(x,u, p) = ).T[Ax +BuJ !uTQu
if we take p = ).T as the multiplier. It thus follows that on a free section of the optimal path (i.e. a section clear of 9')
152
THE PONTRYAGIN MAXIMUM PRINCIPLE
u = Q!BT>.
(45)
j_=AT.A.
(46)
Consider now the optimal passage from an initial position~ = ( x, t) to a terminal position ~ = (x, t) by a path which we suppose to be free. We shall correspondingly denote the terminal value of>. by 5, and shall denote timetogo t  t by s. It follows then from (40) and (41) that the optimal value of u is given by
(47) Inserting this expression for u back into the plant equation and cost function we find the expressions
(48) F(~, ~)
= !,\T V(s)5
{49)
for terminal x and total cost in terms of 5, where
V(s) = los ~r JeATr dr.
(50)
and J = BQ 1BT, as ever. In (50) we recognise just the controllability Gramian. Solving for 5 from (48) and substituting in (47) and (49) we deduce
Theorem 7.8.1 Assume the model specified above. Then (i) The minimal cost offree passage from ~to {is F(~, e)= !(x eAsx?V(s) 1(x eA3 x),
(51)
and the openloop form ofthe optimal control at ~ = (x, t) is u = Q! BTeATs V(sr!(x eAsx). where s = t  t. (ii) The minimal cost ofpassage from ~ to the stopping set !/, by an orbit which is free before termination, is F(e)
=
!nf[F(~,~)+K(~)].
~E9'
(52)
ifthis can be attained (53)
Expression (52) still gives the optimal control rule, with ~ determined by the minimisation in (53). This value of~ will be constant along an optimal path. We have in effect used the simple and immediate consequence (47) of the maximum principle to solve the dynamic programming equation. Relation (52) is indeed the closedloop rule which one would wish, but one would never have
9 CONTROL OF THE INERTIALESS PARTICLE
153
imagined that it would imply the simple course (47) of actual control values along the optimal orbit. Solution of (43) and (47) yields the optimal orbit as x( r)
=eArx + V( r)eAT(Ir) X
(t ~ r ~ T)
where x = x(t) and X is determined by (48). For validity of evaluation (53) it is necessary that this orbit should not meet the stopping set !/before timet. Should it do so, then the orbit will have to break up into more than one free section, these sections being separated by grazing encounters with !/ at which special transition conditions will hold. We shall consider some such cases by example in the following sections. 9 CONTR OL OF THE INERTIA LESS PARTICLE The examples we now consider are grossly simpler than any actual practical problem, but bring out points which are importan t for such problems. We shall be able to generalise these to the stochastic case (see Chapter 24), where they are certainly nontrivial. Let x be a scalar, corresponding to the height of an aircraft above level ground. We shall suppose that the craft is moving with a constant horizontal velocity, which we can normalise to unity, so that time can be equated to horizonta l distance travelled. We suppose that the plant equation is simply
X=U, (54) i.e. that velocity equals control force applied. This would represent the dynamics of a mass moving in treacle: there are no inertial effects, and it is velocity rather than acceleration which is proportio nal to applied force. We shall then refer to the object being controlled as an 'inertialess particle'; inertialess for the reasons stated, and a particle because its dynamic state is supposed specified fully by its position. It is then the lamest possible example of an aircraft; it not merely shows no inertia, but also no directional effects, no angular inertia and no aerodynamic effects such as lift. We shall use the term 'aircraft' for vividness, however, and as a reminder of the physical object towards whose description we aspire by elaboration of the model. We have A = 0, B = 1. We thus see from (45) I (46) that the optimal control value u is constant along a free section of the orbit, whence it follows from the plant equation (54) that such sections of orbit must be straight lines. We find that · V(s) = Q 1s, so that F(~, ~) = Q(x x) 2 j2s. Suppose that the stopping set is the solid, level earth, so that the region of free movement is x > 0 and the effective stopping set is the surface of the ground, x = 0. The terminal cost can then be specified as a function IK(t) of time (i.e. distance along the ground). The expressions (53) and (52) for the value function and the closedloop optimal control rule then become
154
THE PONTRYAGIN MAXIMUM PRINCIPLE
F(x, t) =
!~[~ + IK(t+s)],
(55)
u = xjs.
(56) Here the timetogo s must be determined from the minim isation in (55), which determines the optim al landingpoint 1 = t + s. The rule (56) is indee d consistent with a const ant rate of descent along the straightline path joining ~. t) and (0,7). However, suppose there is a sharp moun tain between the startin g point and the desired termin al point 1 determ ined above, sufficiently high that the path determined above will not clear it. That is, if the peak occur s at coordinate t 1 and has height h then we require that x(ti) >h. If the optim al straightline path determined above does not satisfy this then the path must divide into two straightline segments as illustrated in Figure 6. The total cost of this comp ound path is
Q(h x) 2 F1(x, t) = 2 ( ) +F(h ,t1), tit
(57)
where Fis the 'free path' value function determined in (55). It is more realistic to regard a crash on the moun tain as carrying a high cost, K 1, say, rather than prohobited. In the stochastic case this is the view that one must necessarily take, becau se then the crash outcome always has positive probability. If one resigns and chooses to crash then there will be no control cost at all and the total cost incurr ed will be just K 1• One will then choose the crash option if one is in a position (x, t) for which expression (57) exceeds K1. i.e. for which
x 0, whic hwem ayex pect. Thisa ppro
ache s3xf vasw tends tozer o.
11 AVOIDANCE OF THE SfOPPING SEf: A GEN ERAL RESULT The conclusions of Theo rem 7.10.1 can be generalised , with implications for the
stochastic case. We consider the general mode l of Section 9 and envisage a situation in which the initial conditions are such that the uncontrolled path would meet the stopping set !/'; one's only wish is to avoid this encounter in the most economical fashion. Presumably the optim al avoiding path will graze !/ and then continue in a zerocost fashion (i.e. subsequently avoid !/' without further control). We shall speak of this grazing point as the 'term inatio n' point, since it indee d marks the end of the controlled phase of the orbit. We then consider the linear system (43) with contr ol cost (44). Suppose also that the stopping set !/ is the halfspace
11 AVOIDANCE OF THE STOPPING SET: A GENERAL RESULT
9'={x:aTx::;;;b}
161 (72)
This is not as special as it may appear; if the stopping set is one that is to be avoided rather than sought then its boundary will generally be (n  1)dimensional, and can be regarded as locally planar under regularity conditions. Let us denote the the linear function aTx of state by z. Let F( ~) be the minimal cost of transition by a free orbit from an initial point = (x, 0) to a terminal point ~ = (x, t), already evaluated in (51). Then we shall find that, under certain conditions, the grazing point ~of the optimal avoiding orbit is determined by minimising F(e, ~)with respect to the free components of ~at termination and maximising it (at least locally) with respect to i That is, the grazing point is, as far as timing goes, the most expensive point on the surface of 9' on which to terminate from a free orbit. The principal condition required is that aTB = 0, implying that the component z is not directly affected by the control. This is equivalent to aTJ = 0. Certainly, if one is to terminate at a given timet then the value of z is prescribed as b, but one should then optimise over all other components of x. Let the value of F(e, ~) thus minimised be denoted G(e, t). The assertion is then that the optimal value oft is that which maximises G( t). We need a preparatory lemma.
e
e,
e,
Lemma 7.11.1 Add to the prescriptions (43), (44) and (72) ofplant equation, cost function and stopping set the conditions that the process be controllable and that aTB =O.Then (73)
e,
Proof Optimisation ofF( ~) with respect to the free components of x will imply that the Xof (47) is proportional to a, and so can be written Oa for some scalar 0. The values of() and tare related by the termination condition z = b, which, in virtue of (48), we can write as aTeA1x
+ ()aTV(t)a =b.
(74)
We further see from (49) that
G(e, t) = !fiaTV(t)a.
(75)
Let us write V(t) and its derivative with respect to t simply as V and V. Controllability then implies that aTVa > 0. By replacing T by s T in the integrand of (50) we see that we can write Vas
V=J+AV+ VAT.
(76)
Differentiating (74) with respect tot we fmd that
(aTVa)(dOfdt) +aT AeA1a + oaTva = 0,
(77) .
162
THE PONTRYAGIN MAX IMUM PRIN CIPL E
so that
oG = !B2aTVa + OaTVa(dOfdt) =B aT AeA1x !01aTVa.
at
Finally, we have
z= ar(Ax + J5..) =aTAx= aTAeA x+ oar AVa = arAeA x +!Oarva. 1
1
The second equality follows beca use aT J = 0 and the fourth by appeal to (76). We thus deduce from the last two relations that 8Gja t = Bz. Inser ting the evaluation of eimpl ied by (74) we deduce (73). Theorem 7.11.2 The assumptions of Theorem 7.11.1 imply that the grazing poin t ~ ofthe optimal Y' avoiding orbit is determined by first mini misin g F (.;, ~) with respect to x subject to aTx = band then maxi misin g it, at least locally, with respect tot. Proo f As indic ated above, optimality will require the restricted xoptimisation; we have then to show that the optim al t maxi mises G(~, !). At any value t for which the controlled orbit crosses z = 0 the uncontrolled orbit will lie below z = 0, so that aT eA7x  b < 0. If, on the contr olled orbit, z decreases throu gh zero at time !, then one will increase tin an attem pt to find an orbit which does not enter :/.W e see from (73) that G(x, t) will then increase. Correspondingly, if z crosses zero from below then one will decrease t and G(x, t) will again increase. If the two original controlled orbits are such that the t values ultimately coincide unde r this exercise, then = 0 at the com mon value, so that the orbit grazes :/ locally, and G(x, t) is locally max imal with respect tot. The true grazing poin t (i.e. that for which the orbit meets Y at no othe r point) will be found by repeated elimination of crossings in this fashion. 0
x
One conjectures that G(x, 7) is in fact globally maxi mal with respect to 7 at the grazing point, but this has yet to be demonstra ted. We leave the reader to conf irm that this criterion yields the know n graz ing poin t t = 3xf v for the crash avoidance problem of Theo rem 7.10.1, for which the cond ition aT B = 0 was inde ed satisfied. We should now deter mine the optim al Y avoi ding control explicitly. Theorem 7.11.3
Defin e
~ = ~(x,s)
= b aTeAsx,
a= a(s) = JaTV (s)a .
Then, under the conditions of Theorem 7.11.1, the optim
al :/av oidin g control at xis
(78)
12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS
163
where s is given the values which maximises (~I (j) 2 . With this understanding ~I rf2 is constant along the optimal orbit (before the grazing point) and s is the time remaining until grazing.
e
Proof Let us set ~ = (X' 0)' = (X' s) so that X is the value of state when a time s remains before termination. The cost of passage along the optimal free orbit from~to eis (79) where V = V(s) is given by (50) and 8 = x eAsx. The optimal control at time t = 0 for prescribed is
e
(80) The quantity v 18 is the terminal value of A. and is consequently invariant along the optimal orbit. That is, if one considers it as a function of initial value ~then its value is the same for any initial point chosen on that orbit. Specialising now to the Y' avoiding problem, we know from the previous theorem that we determine the optimal grazing point by minimising F (~, e) with respect to x subject to z = b and then maximising it with respect to s. The first minimisation yields (81) so the values of sat the optimal grazing point is that which maximises (~I (j) 2 • Expression (78) for the optimal control now follows from (80) and (81). The identification of ~I dl with v 18 (with s determined in terms of current x) D demonstrates its invariance along the optimal orbit (before grazing). 12 CONSTRAINED PATHS: TRANSVERSALITY CONDITIONS We should now consider the transition rules which hold when a free orbit enters or emerges from a part of state space in which the orbit is constrained. We shall see that conclusions and argument are very similar to those which we deduced for termination in Section 3. Consider first of all the timeinvariant case. Suppose that an optimal path which begins freely meets a set F in statespace which is forbidden. We shall assume that § is open, so that the path can traverse the boundary 8§' ofF for a while, during which time the path is of course constrained. One can ask: what conditions hold at the optimal points of entry to and exit from oF? Let x be an entry point and p and p' be the values of the conjugate variable immediately before and after entry. Just as for the treatment of the terminal problem in Section 3, we can partially integrate the Lagrangian expression for minimal total cost up to the transition point and so deduce an expression whose primary dependence upon the transition value x occurs in a term (p  p')x.
164
TH E PONTRYAGIN MA XIM UM PRI NC IPL E
Suppose we can vary x to x + w· + o( c), a neighbouring poi nt in 8F . Then, by the same argument as tha t of Th eor em 7.3.1 we deduce tha t (p  p') a is zero if x is an optimal transition value. Th at is, the linear function pa of the conjugate variable is continuous at an opt ima l tran sition point for all directions a tangential to the surface of F at x. Otherwise expressed, the vector (p  p') T is nor ma l to the surface of .F at x. We deduce the sam e conclu sion for optimal exit points by appeal to a timereversed version of the proble m. Transferring these results to the timevarying problem by the usual device of taking an augmented state variable~= (x, t) we thus deduce *Th eor em 7.12.1 Let .F be an open set in (x, t) space whi ch is forbidden. Let (x, t) be an opt ima l transition poi nt (fo r eith er ent ry to or exi t fro m 8F ) and (p, Po) and p',p the values of the aug me nte d conjugate variable imm edi ate ly before and after transition. The n
0)
(p  p') a + (po 
p~)r =
0 (82) for all directions (a, r) tan gen tial to 8F at (x, t). In particular, if t can be var ied ind epe nde ntly of x then the Ha mil ton ian H is con tinu ous at the transition. Pro of Th e first assertion follows from the argument bef ore the theorem, as indicated. If we can vary tin bot h directions for fixed x at transiti on then (82) implies tha t p 0 is continuous at transiti on. Bu t we have p 0 + H = 0 on bot h sides of the transition, so the implication is tha t Hi s continuous at the transition. 0 One can also develop conclu sions concerning the form of the optimal pat h during the phase when it lies in 8F. However, we shall con ten t ourselves with what can be gained from discus sion of a simple example in the next section. An example we have already considered by a direct discus sion of costs is the steering of the inertial particl e in Section 10. For this the sta te variable was (\", v) and ff was x < 0. Th e bou nda ry 8F is then x = 0, but on this we mu st also require tha t v = 0, or the condition x ;;?: 0 would be vio late d in either the immediate pas t or the imm edi ate future. Suppose tha t we sta rt fro m ~. v) at t = 0 and reach x = 0 first at time t (when necessarily v = 0 as wel l). Sup the variables conjugate to x, pose p, q are v, so tha t the Hamiltonian is H = pv + qu  Qu2 /2, and so equal to pv + q2 j2Q when u is chosen optimally. Continuity of H at a transition point, when vis nec essarily zero, thus amounts to continuity of q2. Let us con firm tha t this con tinuity is consistent with the previously derived condition (66). Ifp and q den ote the values of the conjugate var iables at t 1 , jus t before transition, then integra tion of equation (46) leads 1 to p(r ) = p, u(r ) = Q q(r ) = Q 1(q + ps) , v(r ) = Q  1(qs + pil /2) , x(r 1 ) = Q(qi l /2 + ps3/6) , where s = t 1  r. The values of p and q are determined by the prescribed initial
13 REGULATION OF A RESERVOIR
165
values x and v. We find q proportional to (3x + vt) / t2, and one can find a similar expression for q immediately after transition in terms of the values 0 and w of terminal height and velocity. Assertion of the continuity of q2 at transition thus leads exactly to equation (66). 13 REGULATION OF A RESERVOIR This is a problem which the author, among others, has discussed as 'regulation of a dam'. However, purists are correct in their demur that the term 'dam' can refer only to the retaining wall, and that the object one wishes to control, the mass of water, is more properly referred to as the 'reservoir: Let x denote the amount of water in the reservoir, and suppose that it obeys the plant equation .X = v  u, where v is the inflow rate (a function of time known in advance) and u is the drawoff rate (a quantity at the disposition of the controller). One wishes to maximise a criterion with instantaneous reward rate g( u), where g is concave and monotonic increasing. This concavity will (by Jensen's inequality) discourage variability in u. One also has the natural constraint u ~ 0. The state variable x enters the analysis by the constraint 0 ~ x ~ C, where C is the capacity of the reservoir. We shall describe the situations in which x = C, x = 0 and 0 < x < Cas fun empty and intermediate phases respectively. One would of course wish to extend the analysis to the case for which v (which depends on future rainfall, for example) is imperfectly predictable, and so supposed stochastic. This can be achieved for LQ versions of the model (see Section 2.9) but is difficult if one retains the hard constraints on x and u and a nonquadratic reward rate g( u). We can start from minimisation of the Lagrangian form J[g(u) + p(xv + u)] dr, so that the Hamiltonian is H(x, u,p) = g(u) + p(v u). A price interpretation would indeed characterise p as an effective current price for water. We then deduce the following conclusions.
Theorem 7.13.1 An optimal drawoffprogramme shows thefollowingfoatures. (i) The value ofu is the nonnegative value maximising g(u) pu, which then increases with decreasingp. (ii) The value ofu is constant in any one intermediate phase. (iii) The value ofp is decreasing vncreasing) and so the value ofv = u is decreasing vncreasing) in an empty (full) phase. (iv) The value ofu is continuous at transition points. Proof Assertion (i) follows immediately from the form of the Hamiltonian and the nature of g. Assertions (ii) and (iii) follow from extremisation of the Lagrangian form with respect to x. In an intermediate phase x can be perturbed either way, and one deduces that jJ = 0 (the free orbit condition for this particular case). Hence p, and so u, is constant in such a phase. In an empty phase perturba
166
THE PONTRYAGIN MAX IMU M PRIN CIPLE
tions of x can only be
nonnegative, and so one can deduce only that P :::;: 0. Thu s p is decreasing, and so u is increasing. Since u and v are necessarily equal, if x is
being held constant, then v mus t also be increasing. The analogous assertion s for a full phase follow in the sam e way; perturbations can then only be nonpositive. The final assertion follows from continui ty of Hat transition points. With u set equal to its opti mal value H becomes a monotonic, continuous function of p. Continuity of Hat transition points then implies continuity ofp, and so of u. 0 The example is interesting for its appeal to transversality conditions, but also because there is some discussion of opti mal behaviour during the empty and full phases (which constitute the bou nda ry off of the forbidden region!!': the union of x < 0 and x > C). Trivially, one mus t have u = v in these phases. However, one should not regard this as the equation dete rmining u. In the case x = 0 (say) one is always free to take a smaller value of u (and so to let water accumulate and so to move into an intermediate phase). The optimal drawoff rate continues to be determined as the value extremising g( u)  pu; it is the development ofp which is constrained by the condition x = 0. Alth ough the rule u = v is trivial if one is determined not to leave the empty phase, the conclusion that v mus t be increasing during such a phase (for optimali ty) is nontrivial.
Notes Pontryagin is indeed the originator of the principle which bears his name, and whose theory and application has been so developed by himself and others. It is notable that be held the dyn ami c prog ram min g principle in great scorn; M.H.A. Davis describes him memorably as hold ing it up 'like a dead rat by its tail' in the preface to Pontryagin et al. (1962). This was because of the occasional nonexistence of the derivative Fx in the simp lest of cases. However, as we have seen, it is a rat which alive, ingenious, direct, and able to squeeze through where authorities say it cannot. The material of Section 11 is believed to be new.
i
I
1
I \
l
PART 2
Stochastic Models
CHAPTER 8
Stochastic Dynamic Programming A difficulty which must be faced is that of incompleteness of information. That is, one may simply not have all the information needed to make an optimal decision, and which we have hitherto supposed available. For example, it may be impossible or impracticable to observe all aspects of the process variabletb.e workings of even a moderatesized plant, or of the patient under anaesthesia which we instanced in Section 5.2, are far too complex. This might matter less if the plant were observable in the technical sense of Chapter 5, so that the observations available nevertheless allowed one to build up a complete picture of tb.e state of affairs in the course of time. However, there are other uncertainties which cannot be resolved in this way. Most systems will have exogenous inputs of some kind: disturbances, reference signals or timevarying parameters such as price or weather. If the future of these is imperfectly predictable, as is usually the case, then the basis for the methods we have used hitherto is lost. There are two approaches which lead to a natural mathematical resolution of this situation. One is to adopt a stochastic formulation. That is, one arrives somehow at a probability model for plant and observations, so that all variables are jointly defined as random variables. The variables which are observable can then be used to make inferences on those which are not. More specifically, one chooses a policy, a control rule in terms of current observables, which minimises the expectation of some criterion based on cost. The other, quite as natural mathematically, is the minimax approach. In this one assumes that all unobservables take the worst values they can take (judged on the optimisation criterion) consistently with the values of observables. The operation of conditional expectation is thus replaced by a conditional maximisation (of cost). The stochastic approach seems to be the one which takes account of average performance in the long run; it has the completer theory and is the one usually adopted. The minimax approach corresponds to a worstcase analysis, and is frankly pessimistic. We shall consider only the stochastic approach, but shall find minimax ideas playing a role when we later develop the idea of risksensitivity. Lastly, there is a point which should be made to maintain perspective, even if it cannot be followed up in this volume. The larger the system (i.e. the greater the number of individual variables) then the more unrealistic becomes the picture that there is a central optimiser who uses all currently available information to make all necessary decisions. The physical flow of information and commands
170
STOCHASTIC DYNAMIC PROGRAMMING
would be excessive, as would the central processing load. This is why an economy or a biological organism is partially decentralised: some control decisions are made locally, on the basis of local information plus central commands, leaving only the major decisions to be made centrally, on aggregated information. Indeed, the more complex the system, the greater the premium on trading a loss in optimality for a gain in simplicityand, perhaps, the greater the possibility of doing so advantageously, and of recognising the essential which is to be optimised. We use the familiar notations E(x) and E(xiy) for expectation and conditional expectation, and shall rarely make the notational distinction (which is only occasionally called for) between random variables and particular values which they may adopt. Correspondingly, P(x) and P(xiy) denote the probability (unconditional and conditional) of a particular value x, at least if x is discretevalued. However, more generally and more loosely, we also use P(x) to denote simply the probability law of x. So, the Markov property of a process {x 1} would be expressed, whatever the nature ofthe state space, by P(xt+1IX1 ) = P(xt+ 1 lx 1), where X 1 is the history {X 7 ; T ~ t }. 1 ONESTAGE OPTIMISATION A special feature of control optimisation is that it is a multistage problem: one makes a sequence of decisions in time, the later decisions being in general based on more information than the earlier ones. For this very reason it is helpful to begin by considering the singlestage case, in which one only has a single decision to inake. For example, suppose that the pollution level of a water supply is being monitored. One observes pollution level y in the sample taken and has then the choice of two actions u: to raise the alarm or to do nothing. In practice,. of course, one might well convert this into a dynamic problem by allowing sampling to continue over a period oftime until there is a more assured basis for action one way or the other. However, suppose that action must be taken on the basis of this single observation. A cost C is incurred; the costs of raising the alarm (perhaps wrongly) or of not doing so (perhaps wrongly). The magnitude of the cost will then depend upon the decision u and upon the unknown 'true state' of affairs. Let us denote the cost incurred if action u is taken by C(u1 a random variable whose distribution depends on u. One assumes a stochastic (probabilistic) model in which the value of the cost C(u) for varying u and of the observable y are jointly defined as random variables. Apolicy prescribes u as function u(y) of the observable u; the policy is to be chosen to minimise E[C(u(y))]. Theorem 8.1.1 The optimal decision function u(y) is determined by choosing u as the value minimising E[C(u)ly]. Proof If a decision rule u(y) is followed then the expected cost is
,
1 ONESTAGE OPTIMISATION
171
E[C(u(y))] = E{E[C(u(y))iy]} ~ E{inf E[C(u)iy]} u
and the lower bound is attained by the rule suggested in the theorem.
0
The theorem may seem trivial, but the reader should understand its point: the reduction of a constrained minimisation problem to a free one. The initial problem is that of minimising E[C(u(y)] with respect to the jUnction u(y), so that the minimising u is constrained to be a function of y at most. This is reduced to the problem of minimising E[C( u) IY]freely with respect to the parameter u. One might regard u as a variable whose prescription affects the probability distribution of the cost C,just as does that of y, and so write E[C(u)iy] rather as E[qy, u]. However, to do this is to blur a distinction between the variablesy and u. The variable y is a random variable whose specification conditions the distribution of C. The variable u is not initially random, but a variable whose value can be chosen by the optimiser and which parametrises the distribution of C. We discuss the point in Appendix 2, where a distinction is made by writing the expectation as E[qy; u], the semicolon separating parametrising variables from conditioning variables. However, while the distinction is important in some contexts, it is not in this, for reasons explained in Appendix 2. The reader may be uneasy: the formulation of Theorem 81.1 makes no mention of an important physical variable: the 'true state' of affairs. This would be the actual level of pollution in the pollution example. It would be this variable of which the observationy is an imperfect indicator, and which in combination with the decision u determines the cost. Suppose that the problem admits a state variable x which really does express the 'true state' of affairs, in that the cost is in fact a deterministic function C(x, u) ofx and u. So, if one knew x, one would simply choose u to minimise C(x, u). However, one knows only y, which is to be regarded as an imperfect observation on x. The joint distribution of x andy is independent of u, because the values of these random variables, whether observable or not, have been realised before the decision u is taken.
Theorem 8.1.2 Suppose that the problem admits a state variable x, that C(x, u) is the cost jUnction andf(x,y) the joint density ofx andy with respect to a product measure Jl.i (dx)p,2 (dy ). Then the optimal value of u is that minimising J C(x, u) f(x,y)P,i dx. Proof Let us assume for simplicity that x and y are discrete random variables with a joint distribution P{?c, y); ~e formal generalisation is then clear. In this case E[C(u)iy] =
L C(x, u)P(xiy) ex: L C(x, u)P(x,y) = L C(x, u)P(x)P(yJx), X
X
X
{1)
172
STOCHASTIC DYNAMIC PROG RAMM ING
where the proportionality sign indicates a factor P(y) I, indep enden t of u. The third of these expressions is the analogue of the integr al expression asserted in the theorem. 0 We give the fourth expression in (1) because P(x) and P(yjx) are often specified on fairly distinct grounds. The conditional distributio n P(y lx) of observation on state is supplied by one's statistical model of the observation process, whose mechanism may be fairly clear. The distribution P(x) constitutes the 'prior distribution' of state and its specification may be debat able; see Exercise 1. Exercises and comments (1) The socalled twohypothesis twoaction case is the simplest, but is both illuminating and useful. Cons ider the pollution exam ple of the text, and suppose that serious pollution of the river, if it occurs, can only be due to a catastrophic failure at a factory upstream. There are then only two 'pollu tion states', that this failure has not occurred or that it has, corresponding to x equal to 0 or 1, say. Denote the prior probabilities P(x) of these by ?rx and the probability density of the observation y conditional on the value of x by fx(y ~ Suppose that there are just two actions: to raise the alarm or not. The cost of raising the alarm is Co or zero according as x is 0 or 1; the cost of not raisin g the alarm is zero or C1 according as xis 0 or 1. It follows then from the last form of the criterion that one should raise the alarm if ?ro/o(y)Co < 7rifi(y)CJ. That is, if the likelihood ratio fi(Y)/fo(y) exceeds the threshold value 7roCo/7ri C1.
(2) Risksensitivity and hedging. The new effects that a stochastic element can bring can be demonstrated on a onestage model. Supp ose that, if one divides a sum of money x so that an amou nt x1 is invested in activi ty j, then one receives a total return of r = E1 CjXJ. That is, c1 is the rate of return on activity j. If the c1 are known then one maximises r by investing the whole sum in an activity for which the return rate Cj is greatest. (If there are severa l activities achieving this rate then the investment can be spread over them, but there is no advantage in such diversification). If the c1 are unkn own, and regar ded as rando m variables, then one maximises the expected retur n by inves ting the whole sum in an activity for which the expected rate of return E( c ) is maximal, where this is an 1 expectation conditional on one's information at the time of the investment decision. However, suppose one chooses rathe r to extremise the expected value of exp( Or), maximising or minimising according as the risksensitivity parameter() is positive or negative. This criterion takes account ofvar iability of return as well as of expectation, variability being welcome or unwelcome according as () is positive or negative (the riskseeking and riskaverse cases; see Chap ter 16).
2 MULTISTAGE OPTIMISATION
173
For simplicity, let us assume (rather unrealistically) that the random variables c1 are independent. Then one chooses the allocation to maximise '2:.1 B 1Fj(Bx1 ), where Fj(o:) = log{E[exp(o:cJ)]}. The functions Fj(o:) are convex (see Appendix 3). It follows then that the functions et Fj(Bx1) are convex or concave according as one is in the riskseeking or riskaverse case. In the first case will invest the whole sum in an activity j for which e 1Fj( Bx) is maximal. In the second case one spreads the investment (hedges against uncertainty) by choosing the x1 so that Fj(Bx1) ~>.,with equality if x1 is positive (j = 1, 2, ... ). Here>. is a Lagrange multiplier chosen so that '2:.1 x1 = x. Hedging is a very real feature in investment practice, and we see that it is induced by the two elements of uncertainty and riskaverseness. (3) Follow through the treatment of Exercise 2 in the case (again unrealistic!) of normally distributed CJ> when Fj(o:) = p1a+!v1a 2• Here 1LJ and v1 are respectively the mean and variance of c1 (conditional on information at the time of decision). 2 MULTISTAGE OPTIMISATIO N; THE DYNAMIC PROGRAMMI NG EQUATION If we extend the analysis of the last section to the multistage case then we are essentially treating a control optimisation problem in discrete time. Indeed, the discussion will link back to that of Section 2.1 in that we arrive at a stochastic version of the dynamic programming principle. There are two points to be made, however. Firstly, the stochastic formulation makes it particularly clear that the dynamic programming principle is valid without the assumption of state structure and, indeed, that state structure is a separate issue best brought in later. Secondly, the temporal structure of the problem implies properties which one often takes for grap.ted: this structure has to be made explicit. Suppose, as in Section 2.1, that the process is to be optimised over the time period t = 0, 1, 2, ... , h. Let W0 indicate all the information available at time 0; it is from this and the stochastic model that one must initially infer plant history up to time t = 0, insofar as this is necessary. Let x 1 denote the value of the process variable at time t, and X 1 the partial process history {x1,x2 , •.. ,x1 }. Correspondingly let y 1 denote the observation which becomes available and u1 the control action taken at time t, with corresponding partial histories Y 1 and Ut. Let W1 denote the information available at time t; i.e. the information on which choice of u1 is to be based. Then we assume that W1 = {Wo, Y, U1t}. That is, current information consists just of initial information plus current observation history plus previous control history. It is taken for granted, and so not explicitly indicated, that W1 also implies prescription of the stochastic model and knowledge of clock time t.
174
STOCHASTIC DYNAMIC PROGRAMMING
A realisable policy 1ris one which specifies u1 as a function of W 1 fort = 1, 2, ... , One assumes a cost function C. This may be specified as a function of Xh and Uhl, but is best regarded simply as a random variable whose distribution, jointly with that of the course of the process and observations, is parametrised by the chosen control sequence UhI· The aim is then to choose 1rto minimise E,,.(C). Define the total value function uhI.
G( W 1)
=
inf ,. E1r[CI Wt],
(2)
the minimal expected cost conditional on information at timet. Here E,. is the expectation operator induced by policy 11: We term G the total value function because it refers to total cost, whereas the usual value function F refers only to present and future cost (in the case when cost can indeed be partitioned over time). G is automatically tdependent, in that W1 takes values in different sets for different t. However, the simple specification of W1 as argument is enough to indicate this dependence. *Theorem 8.2.1 (The dynamic programming principle) The total value function G( W 1) obeys the backward recursion (the dynamic programming or optimality
equation) (t
= 1, 2, ... )h  1)
(3)
with closing condition
(4) and the minimising value ofu 1 in (3) is the optimal value ofcontrol at timet.
We prove these assertions formally in Appendix 2. They may seem plausible in the light of the discussion of the previous section, but demonstration of even their formal truth requires a consideration of the structure implicit in a temporal optimisation problem. They are in fact rigorously true if all variables take values in finite sets and if the horizon is finite; the theorem is starred only because of possible technical complications in other cases. That some discussion is needed even of formal validity is clear from the facts that the conditioning variables W1 in (2) and ( W1, u1) in (3) are a mixture of random variables Y 1 which are genuinely conditioning and control histories U1_t or U1 which should be seen as parametrising. Further, it is implied that the expectations in (3) and (4) are policyindependent, the justification in the case of (4) being that all decisions lie in the past These points are covered in the discussion of Appendix 2. Relation (3) certainly provides a taut and general expression of the dynamic programming principle, couched, as it must be, in terms of the maximal current observable We.
3 STATE STRUCT URE
175
3 STATE STRUC TURE The two principa l properties which ensure state structur e are exactly the stochastic analogues of those assumed in Chapter 2 (i) Markov dynamics. It is required that the process variable x should have the property (5) where Xr. U1 are now complete histories. That is, if we conside r the distribu tion of Xr+l conditio nal on process history and paramet rised by control history then it is in fact only the values of process and control variables at time t which have any effect. This is the stochastic analogue of the simplyrecursive deterministic plant equation (2.2), and specification of the righthand member of (5) as a function of its three argumen ts amounts to specification of a stochastic plant equation. (ii) Decomposable cost function. It is required that the cost function should break into a sum of instanta neous and closing costs, of the form
c=
h1
h1
1=0
1=0
L c(xl, ul, t) + Ch(xh) = L
Cz
+ ch,
(6)
say. This is exactly the assumption (2.3) already made in the deterministic case. We recall the definition of sufficiency of a variable ~~ in Section 2.1 and the characte risation of x as a state variable if (x 1 , t) is sufficient. These definitio ns transfer to the stochastic case, and we shall see, by an argume nt parallel to that of the deterministic case, that assumptions (i) and (ii) do indeed imply that x is a state variable, if only it is observable. A model satisfYing these assumpt ions is often termed a Markov decision process, the point being that they defme a simply recursive structure. However, if one is to reap the maxima l benefit of this structur e then one must make an observational demand . (iii) Perfect state observation. It is required that the current value of state should be observable. That is, x 1 should be known at the time t when u is to be 1 determi ned, so that W1 =(X,, Uti)· As we have seen already in the deterministic case, assumpt ion (i) can in principl e be satisfied if there is a description x of the system which is detailed enough that it can be regarded as physically complete. Whethe r this detailed description is immediately observable is another matter, and one to which we return in Chapter s 12 and 15. We follow the pattern of Section 2.1. Define the future cost at time t h1
C1 = Lcr+C h r=l and the value function
F( W1)
= inf EII"[C1 1Wz] 7r
(7)
l 176
I
STOCHASTIC DYNAMIC PROGRAMMING
so that G{ W1) = I:~~~ Cr + F ( W1). Then the following theorem spells out the sufficiency of et = (xt, t) under the assumptions above. Theorem 8.3.1 Assume conditions ( i)( iii) above. Then (i) F(W1) is a function ofx1 and t alone. If we write it F(x1 , t) then it obeys the dynamic programming equation
(t
~h)
(8)
with terminal condition
(9) (ii) The minimising value ofu1 in (8) is the optimal value ofcontrol at timet, which is consequently also a function only ofx 1 and t. Proof The value of F(Wh) is Ch(xh), so the asserted reduction of Fis valid at time h. Assume it valid at time t + 1. The general dynamic programming equation (3) then reduces to F( W 1)
c(x1 , u 1, t) + E[F(xt+l, t + l)IX1, U1]} = inf{ u,
(10)
and the minimising value of u1 is optimal. But, by assumption (i), the righthand member of (10) reduces to the righthand member of (8). All assertions then D follow by induction. So, again one has the simplification that not all past information need be stored; it is sufficient for purposes of optimisation that one should know the current value of state. The optimal control rule derived by the minimisation in (8) is again in closedloop form, since the policy before timet has not been specified. It is in the stochastic case that the necessity for closedloop operation is especially clear, since continuing stochastic disturbance of the dynamics makes use of the most recent information imperative. At least in the timehomogeneous case it is convenient to write (8) simply as F(·,t)
= !l'F(·,t+ I)
(11)
where !l' is the operator with action !l'¢(x)
u) + E[¢(xt+l)lx1 = x, u = u]}. = inf{c(x, u 1
(12)
This is of course just the stochastic version of the forward operator already introduced in Section 3.1. As then, !l'¢(x1) is the minimal cost incurred if one is allowed to choose u1 optimally, in the knowledge that at time t + 1 one will incur a cost of¢( x 1+1). In the discounted case !l' would have the action !l'¢(x)
u) + ,8E[¢(xr+i)lx1 = x, Ur = u]}. = inf{c(x, u
(13)
4 THE EQUATION IN CONTINUOUS TIME
177
4 THE DYNAMIC PROGRAMMING EQUATION IN CONTINUOUS TIME It is convenient to note here the continuoustime analogue of the material of the last section and then to develop some continuoustime formalism in Chapter 9, before progressing to applications in Chapter 10. The analogues of assumptions (i)(iii) of that section will be plain; we deduce the continuoustime analogues of the conclusions by a formal passage to the limit. It follows by the discretetime argument that the value function, the infimum of expected remaining cost from timet conditional on previous process and control history, is a function F(x, t) of x(t) = x and t alone. The analogue of the dynamic programming equation (8) for passage from t to t + 8t is F(x, t) = inf{ c(x, u, t) u
+ E[F(x(t + 8t), t + 8t)lx(t) =
x, u(t) = u]}
+ o(5t). (14)
Defme now the infinitesimal generator A(uJ t) of the controlled process by A(u, t)¢(x)
= lim(8t) 1{E[¢(x(t + 8t))lx(t) 8t!O
= x,u(t) = u] ¢(x)}.
(15)
That is, there is an assumption that, at least for sufficiently regular ¢( x ), E[¢(x(t + 8t))lx(t) = x, u(t)
= u] = ¢(x) + [A(u, t)¢(x)]8t + o(8t).
The form of the term of order 8t defines the operator A; to write the coefficient of 8t as A(u, t)¢(x) emphasises that the distribution of x(t + 8t) is conditioned by the valuex of x(t) and parametrised by t and the value u of u(t).We shall consider the form of A in some particular cases in the next chapter. *Theorem 8.4.1 Assume the continuoustime analogues of conditions (i)(iii) of the Section 3. Then x is a state variable and the value function F (x, t) obeys the dynamic programming equation
i~f [c(x, u, t) + BF~~, t) + A(u, t)F(x, t)]
= 0
(t t),
(7)
where v~) is an appropriate linear predictor of v 7 based on information available at timet. That is, at timet one replaces future stochastic noise v7 (r > t) by an 'equivalent' deterministic disturbance v~r) and then applies the methods of Sections 2.9 or 6.3 to deduce the optimal feedback/feedforward control in terms of this predicted disturbance. We shall see in Chapter 12 that similar considerations hold if the state vector xis itself not perfectly observable. It turns out that E~) = 0 (1 > t) for a whitenoise input E which has been perfectly observed up to time t. This explains why the closedloop control rule was unaffected in case (1). Once we drop LQG assumptions then treatment of the stochastic case becomes much more difficult. For general nonlinear cases there is not a great deal that can be said. We shall see in Section 7 and in Chapter 24 that one can treat some models for which LQG assumptions hold before termination, but for which rather general termination conditions and costs may be assumed. Some other models which we treat in this chapter are those concerned with the timing of a single definite action, or with the determination of a threshold for action. For systems of a realistic degree of complexity the natural appeal is often to asymptotic considerations: e.g. the 'heavy traffic' approximations for queueing systems or the largedeviation treatment oflargescale systems.
Exercises and comments (1) Consider the closed and openloop forms of optimal control (2.32) and (2.33) deduced for the simple LQ problem considered there. Show that if the plant equation is driven by white noise of variance N then the additional cost incurred from time t = 0 is D QN E;,;;ci (Q + sD) I or hDN according as the closed or the openloop rule is used. These then grow as log h or as h with increasing horizon.
192
SOME STOCHASTIC EXAMPLES
2 OPTIM AL EXERCISE OF A STOCK OPTIO N
As a last discretetime example we shall consider a simple but typical financia l optimisation problem. One has an option, although not an obligation, to buy a share at price p. The option must be exercised by day h. If the option is exercised on day t then one can sell immediately at the current price x 1, realising a profit of x 1  p. The price sequence obeys the equation Xt+l = x 1 + E1 where the € 1 are independently and identically distributed random variables for which EJEJ < oo. The aim is to exercise the option optimally. The state variable at time t is, strictly speaking, x 1 plus a variable which indicates whether the option has been exercised or not. However, it is only the latter case which is of interest, sox is the effective state variable. If F (x) is the 3 value function (maximal expected profit) with times to go then Fo(x) = max{x  p, 0} = (x p)+ and
Fs(x) = max{x  p,E[Fs! (x +E)]}
(s = 1,2, ... ).
The general character of Fs(x) is indicated in Figure 1; one can establish the following properties inductively: (i) Fs(x) xisnon increas inginx; (ii) Fs(x) is increasing in x; (iii) Fs(x) is continuous in x; (iv) Fs(x) is nondecreasing ins. For example, (iv) is obvious, since an increase in s amounts to a relaxation of the time constraint. However, for a formal proof: F, (x) = max{x  p, E[Fo(x +E)]};?: max{x  p, 0} = Fo(x),
Figure 1 The value function at horizon sfor the stock option example.
3 A QUEUEING MODE
193
whence Fs is nondecreasing ins, by Theorem 3.1.1. Correspondingly, an inductive proof of (i) follows from
Fs(x) x =max{ p, E[Fs1 (x +e) (x +e)]+ E(e)}. We then derive
Theorem 10.2.1 There exists a non decreasing sequence {as} such that an optimal policy is to exercise the option first when x ~ as. where x is the current price and s is the number ofdays to go before expiry ofthe option. Proof From (i) and the fact that Fs(x) ~ x pit follows that there exists an as such that Fs(x) is greater than x p if x < as and equals x p if x ~ as.It follows from (iv) that as is nondecreasing in s. 0 The constant as is then just the supremum ofvalues of x for which Fs(x)
>
xp.
3 A QUEUEING MODEL Queues and systems of queues provide a rich source of optimisation models in continuous time and with discrete state variable. One must not think simply of the single queue ('line') at a ticket counter; computer and communication systems are examples of queueing models which constitute a fundamental type of stochastic system of great technological importance. However, consideration of queues which feed into each other opens too big a subject; we shall just cover a few of the simplest ideas for single and parallel queues in this section and the next chapter. Consider the case of a socalled M/M/1 queue, with x representing the size of the queue and the control variable u being regarded as something like service effort. If we say that customers arrive at rate A and are served at rate p,(u) then this is a loose way of stating that the transition x + x + 1 has intensity Aand the transition x+ x 1 has intensity p,(u) if x > 0 (and, of course, intensity zero if x = 0). We assume the process timehomogeneous, so that the dynamic programming equation takes the form
h![c(x,u)+ BF~:,t) +A(u)F(x,t)] =0.
(8)
Here the infinitesimal generator has the action (cf. (9.2))
A(u).p(x) = A[.P(x + 1)  .P(x)] + p,(u, x) [.P(x 1)  .P(x)] where p,(x, u) equals p,(u) or zero according as xis positive or zero. If we were interested in averageoptimisation over an infinite horizon then equation (8) would be replaced by
194
SOME STOCHASTIC EXAMPL ES
1 = inf(c(x, u) u
+ A(ulf(x)]
(9)
where .\ and f(x ) are respectively the average cost and the transien t cost associated with the averageoptim al policy. In fact, we shall con cer n ourselves more with the question of optimal allocation of effort or of customers between several queues tha n with optimi sation of a single queue. In preparation for this , it is useful to solve (9) for the unc ontrolled case, when JL(u) reduces to a con stant f.L and c(x, u) to c(x). In fact , we shall assume the instantaneous cost pro portional to the num ber in the que ue, so that c(x) = ax. We leave it to the reader to verify tha t the solution of equation (9) is, in this reduced case,
a>.
!=,, f.J, A
f(x ) =
a f.L .\
x(x + 1) 2
(10)
We have assumed the normalisat ion f(O) = 0. The solution is, of course, valid only if>. < f.L, for it is only then tha t queue size is finite in equilibrium .
4 THE HARVESTING EXAMPLE : A BIRTHDEATH MODEL
Recall the deterministic harvesting model of Section 1.2, which we sha ll generally associate with fisheries, for definite ness. This had a scalar state variabl e x, the 'biomass' or stock level, which foll owed the equation
.X= a(x ) u.
(11)
Here 1J is the harvesting rate, which (it is supposed) may be varied as des ired. The rate of retu rn is also supposed pro portional to u, and nor mal ised to be equ al to it. (The model thus neglects two ver y imp orta nt elements: the age stru ctur e of the stock and the xdependence of the cost ofharvesting at rate uJ We suppose again tha t the functio n a(x), the net reproduction rate of the unharvested population, has the form illustrated in Figure 1.1; see also Figure 2. We again denote by Xm and .xo the values at which a(x) is respectively maximal ·an d zero. An unharvested pop ula tion would thus reach an equilibrium at x = xo. We know from the discussion of Section 2.7 tha t the optimal policy has the threshold form: u is zero for x ~ c and takes its maximal value (M, say) for x > c. Here c is the threshold, and one seeks now to determine its optima l value. If a(x) > 0 for x ~ c and a(x ) M < 0 for x > c then the harvested pop ulation has the equilibrium value c and yields a return at rate 1 =a ( c). If we do not discount, and so choose a thresho ld value which maximises this ave rage retu rn A, then the optimal threshold is the valu e Xm which maximises a( c). A threshold policy will still be optimal for a stochastic model und er corresponding assumptions on birt h and death rates. However, there is an effect which at first sight seems remark able. If extinction of the pop ula tion is impossible then one will again choose a threshold value which maximises average
4 THE HARVESTING EXAMPLE: A BIRTHDEATH MODEL
., I
I
I I I
195
return, and we shall see that, under a variety ofassumptions, this optimal threshold indeed approaches Xm as the stochastic model approaches determinism. However, if extinction is certain under harvesting then a natural criterion (in the absence of discounting) is to maximise the expected total return before extinction. It then turns out that the optimal threshold approaches x0 rather than Xm as the model approache s determinis m. There thus seems to be a radical discontinuity in optimal. policy between the situations in which the time to extinction is finite or infinite (with probability one, in both cases). We explain the apparent inconsistency in Section 6, and are again led to a more informed choice of criterion. We shall consider three distinct stochastic versions of the model; to follow them through at least provides exercise in the various types of continuou stime dynamics described in the last chapter. The first is a birthdeat h process. Letjbe the actual number offish; we shall set x = j JK where K is a scaling parameter, reflecting the fact that quite usual levels of stock x correspon d to large values of j. (We are forced to assume biomass proportion al to population size, since we have not allowed for an age structure). We shall suppose thatj follows a continuoustime Markov process on the nonnegative integers with possible transitions j t j + I and j t j  1 at respective probability intensities >.i and f..ti· These intensities thus correspond to populatio n birth and death rates. The net reproduction rate >.i  f..ti could be written ai, and correspon ds to Ka(x). Necessarily f..to = 0, but we shall suppose initially that >.o > 0. That is, that a zero population is replenished (by a trickle of immigration, say), so that extinction is impossible. Let 7rj denote the equilibrium distribution of population size; the probability that the population has size j in the steady state. Then the relation 7rjAj = 1ri+1f..ti+ 1 (expressing the balance of probability flux between states j and j + 1 in equilibrium) implies that 7rj ex: Pi> where
Pi =
>.o>.1 ... >i1 f..tlf..t2 ••• /tj
(
j = O, 1, 2, . . .),
(12)
(with Po = 1, which is consistent with the convention that an empty product should be assigned the value unity). A threshold c for the xprocess implies a threshold d ~ Kc for the jprocess. For simplicity we shall suppose that the harvesting rate M is infinite, alth()ugh the case of a finite rate can be treated almost as easily. Any excess of population over d is then immediately removed and one effectively has Ad = 0 and Pi = 0 for j >d. The average return (i.e. expected rate of return in the steady state) on thexscale is then
(13)
196
SOME STOCHASTIC EXAMPLES
the term 1rd>.d representing the expected rate at which excess of population over c is produced and immediately harvested. Suppose now that the ratio 91 = J.i.JIAJ is effectively constant (and less th~ unity) for j in the neighbourhood of d. The effect of this is that 'lrdJ ~ 7rd()~ (j ~d), so the probability that the population is an amo untj below threshold falls away exponentially fast with increasingj. Form ula (13) then becomes 1 ~ ~~; 1 ~(1 9d)
= ~~; 1 (>.d JJ.d) = K 1ad =a( c).
The optimal rule unde r these circumstances is then indeed to choose the threshold c as the level at which the net reproducti on rate is maximal, namely, Xm. This argument can be made precise if we pay attention to the scaling. The nature of the scaling leads one to suppose that the birth and death rates are of the forms >.1 = ~~;)..(jj~~;), J.£J = KJJ.(jf~~;) in terms of functions A(x) and p,(x), corresponding to the deterministic equation x = >.(x)  J.£(x) = a(x) in the limit of large l'i.. The implication is then that A]/ p, varies slowl y withj if~~; is large, with the 1 consequence that the equilibrium distributio n of x falls away virtually exponentially as x decreases from d = cf~~;. The details of this verbal argument are easily completed (although anomalous behaviour atj = 0 can invalidate it in an interesting fashion, as we shall see). The theory of large deviations (Chapter 22) deals precisely with such scaled processes, in the range for which the scale is large but not yet so large that the process has collapsed to determinism. Virtually any physical system whose complexity grows at a smaller rate than its size generates examples of such processes. Suppose now that >.o = 0, so that extinction is possi ble (and indeed certain if, as we shall suppose, passage to 0 is possible from all states and the population is harvested above some ftnite threshold,) Expressio n (12) then yields simply a distribution concentrated onj = 0. Let Fj be the expected total retur n before extinction conditional on an initial population ofj. (It is unde rstoo d that the policy is that ofharvesting at an infinite rate above the prescribed threshold value ~ The dynamic programming equation is then
(O< j..p!:l.j+l = p,16.1 where 6.1 = Fj Fft· Using this equation to determine 6.1 in terms of .6.d+1 = I and then summing to determine Fj, we obtain the solution (15). D We see from (15) that the ddependence of the Fj occurs only through the common factor lid, and the optimal threshold will maximise this. The maximising value will be that at which Ad/ JLd decreases from a value above unity to one below, so that ad = Ad  JLd decreases through zero. That is, the optimal value of c is xo, the equilibrium level of the unharvested population. More exactly, it is less than x 0 by an amount not exceeding K: 1. This means in fact a very low rate of catch, even while the population is viable. The two cases thus lead to radically different recommendations: that the threshold should be set near to Xm or virtually at x 0 respectively. We shall explain the apparent conflict in the next two sections. It turns out that the issue is not really one of whether extinction is possible or not, but of two criteria which differ fundamentally and are both extreme in their way. A better understanding of the issues reveals the continuum between the two policies. Exercises and comments (1) Consider the naive policy first envisaged in Chapter 1, in which the population was harvested at a flat rate u for all positive x. Suppose that this translates into an extra mortality rate of v per fish in the stochastic model. The equilibrium distribution 1T"J of population size is then again given by expression (12), once normalised, but with JLJ now modified to JLJ + v. The roots Xt and x2 of a(x) = u, indicated in Figure 1.2 and corresponding to unstable and stable equilibria of the deterministic process, now correspond to a local minimum and a local maximum of 1T"j. It is on this local maximum that faith is placed, in that the probability mass is supposed to be concentrated there. However, as v (and so u) increases this local maximum becomes ever feebler, and vanishes altogether when u reaches the critical value Urn = a(xm)· 5 BEHAVIOUR FOR NEARDETERMINISTIC MODELS In order to explain the apparent stark contradiction between the two policies derived in Section 4 we need to obtain a feeling for orders of magnitude of the various quantities occurring as K: becomes large, and the process approaches determinism. We shall follow through the analysis just for the birthdeath model
198
SOME STOCHASTIC EXAMPLES
of the last section, but it holds equally for the alternative models of Sections 8 and 9. Indeed, all three cases provide exam ples of the large deviation theo ry of Chapter22. Consider then the birthdeath mod el in the case when extinction is possible. Since mos t of the time before extinction will be spent near the thre shol d value if r;, is large (an assertion which we shall shor tly justify) we shall cons ider only Ftt. the expected yield before extinction cond ition al on an initial valu ed ofj. Let 'Fj denote the expected time befo re extinction which is spen t in state j (conditional on a star t from d ). Then, by the same meth ods which led to the evaluation (15), we find that
'Fj = rrd
[t k=!
1"11"2 .. ·1"k1] )qA2 ... Ak1
[~"J+IPJ+2 . . ·1"d]. .\.tAJ+l ... Ad
which is consistent with expression (15) for Fd = AdFd. Whe n we see the process in term s of the scaled variable x Fj = "'F( x) and Tj = T(x). Ifwe defi ne
R(x)
=
1x
log[A(y)l p(y)J dy
(17)
= j I"' we shall write (18)
then we deduce from expression (15) that
F(c) =
ex:R(c)+o(~ c.
harvested process. Then Theorem 10.8.2 Suppose that extinction is certain for the the relevant solution of (43) is F(x)
= e~R(c)
(1x
where F' (c) has the evaluation
F'(c) =2M
el x for (43) of tion (45) then follows by solution The evalua this evaluation. 0 condition of certain extinction. The last assertion follows from s. The analogue of the final conclusion of Section 4 then follow ising the expression Theorem 10.8.3 The optimal value of c is that maxim value ofc is then determined by e~ c. Show then that expression (40) has the evaluation Mj(M fZ R') 1 ~ (1/ R') + 1/(M (l R')
for large K, where all functions are evaluated at c.
=a,
ENT 9 A MODEL WITH A SWITCHING ENVIRONM
WIT 9 THE HARVESTING EXAMPLE: A MODEL T ENVIRONMEN
209
H A SWI TCH ING
wisedeterministic mod el to M indicated in Section 9.2, we can use a piece Suppose that the mod el has represent the effects of environmental variation. = 1, 2, . . . . In regime i the several environmental regimes, labelled by i , but transition can take place population grows deterministically at net rate ai(x) to regime h with probability intensity Kllih· a different nature to that of This is then a model whose stochasticity is of quite 8. It comes from with out and 4 the birth deat h or diffusion models of Sections ronmental stochasticity', 'envi rather than within, and represents what is term ed affects conclusions has this ther as distinct from 'demographic stochasticity'. Whe to be deter mine d by equation (11) but with The equivalent deterministic model would be given a(x)
= :~::)iai(x)
(46)
i
m is in regime i. The model where Pi is the steadystate probability that the syste s between regimes take place converges to this deterministic version if transition ge regime'. This occurs in the 'avera so rapidly that one is essentially working in an scaling parameter. al limit oflarge, x:, so that K again appears as the natur for such a multiregime al A fixed threshold would certainly not be optim hold nature, but with thres of a model. It is likely that the optimal policy would be ion whether the quest a it is a different threshold in each regime. Of course, n of the rate of rvatio . Obse regime is know n at the time decisions must be made e is currently regim mine which change ofx should in principle enable one to deter x itself is of n ver, observatio in force, and so what threshold to apply. Howe . If one error ht with extreme unreliable, and estimation of its rate of change fraug optimal policy would base allowed for such imperfect observation then the bution conditional on curre nt action on a poste rior distribution (i.e. a distri ter 15). An optim al policy observables) of the values of both x and i (see Chap t should be applied only if an would probably take the form that harvesting effor hold dependent on both the estimate of the current value of x exceeded a thres probabilities of the different . precision of this estimate and the current poste rior regimes. desperately crude thoug h it We shall consider only the fixedthreshold policy, threshold value compares with must be in this case, and shall see how the optim al that for the equivalent deterministic mod el able to analysis. A value of x We shall consider a tworegime case, which is amen ot have positive probability in at which a1(x) and a2(x) have the same sign cann ~ 0 and a2(x) = JJ.(x) :E;; 0 equilibrium. Let us suppose then that a1 (x) = ..\(x) est We shall set 1112 = lit and over an interval which includes all xvalues of inter 1121
= 112·
SOME STOCHASTIC EXAMPLES
210
Suppose initially that extinction is impossible, so that the aim is to maximise the expected rate of return 'Y in the steady state. We shall suppose that the maxima l harvest rate M is infinite. For the deterministic equivalent of the process we have, by (46),
a(x)
= z'2..\(x) .
vw.(x).
(47)
v1 + v2
We shall suppose that this has the character previously assumed, see Figure 2. We also suppose that p,(x) = 0 for x :E;; 0, so that xis indeed confine d to x ;;;::: 0. The question of extinction or nonextinction is more subtle for this model Suppose, for example, that ..\(0) = 0 (so that a population cannot be replenished) and that p,(x) is bounde d away from zero for positive x. Then extinction would be 2 certain, because there is a nonzero probability that the unfavourable regime to on extincti For zero. to down can be held long enough that the populat ion is run be impossible in an isolated populat ion one requires that p,(x) tends to zero sufficiently fast as x decreases to zero; the exact condition will emerge. Let Pi(x) denote the probability/probabilitydensity of the ilx pair in equilibrium. These obey the Kolmogorov forward equations
(0
:E;;
x .. for all x, and the machine incurs a cost at a rate ex s the machine representing the effect of wear on operation. A service restore instantaneously to state 0. intensity p,. The Suppose the machine is serviced randomly, with probability be will policy blind dynamic programming equation under this (14) 1 =ex+ .>..[!(x + 1) f(x)] + p,[f(x) /(0) ] policy and ft?c) the if costs are undiscounted. Here 1 is the average cost under the transient cost in state x. The equation has the unique solution
= .Xe/Jl. if we make the normalising assumptionf(O) = 0. f(x) = ex/ J.i.,
1
(15)
parameters and Suppose now that we have n such machines and distinguish the whole system the state of the ith machine by a subscript i. The average cost for will then be (16) .X;c;/ Jb; li = I =
L i
L i
with the Jb; constrained by (17) if J.l is the intensity oftotal service effort available. a machine is Specification of the Jli amounts to a random policy in which, when probability Jbd Jb· A to be serviced, the maintenance man selects machine i with stage of policy One {x;}. = x state more intelligent policy will react to system ering policy consid before ver, Howe .~~·""!!J=~~"~improvement will achieve this. ing the Jli to choos by policy m rando improvement, one could simply optimise the that this finds readily One (17). minimise expression (16) subject to constraint leads to an allocation Jli ex ~machine i for Policy improvement will recommend one to next service the which
222
POLICY IMPROVEMENT: STO CHASTIC VERSIONS
:L{CjXJ + >.,[.tj(Xj+!) jj(x})]} + tt(Ji(O) f(x;)j j
is minimal; i.e. for which Ji(xi) = c;xd /ti is greatest. If we use the optimi sed value of f.L; derived above the n the recommendation is tha t the next machine i to be serviced is tha t for which x 1 jCJ>:; is greatest. . Note tha t this is an index rule, in tha t an index x 1 ~ is calculat ed for each machine and tha t machine chosen for service whose cur ren t index is greatest. The rule seems sensible: degree of wear, cost of wear and rapidity of wea r are all factors which would cause one to direct attention towards a give n mac hine. However, the rule is counterintuiti ve in one respect: the index dec reases with increasing >.1, so tha t an increased rate of wear would seem to make the machine need service less urgently. Howeve r, the rate is already taken accoun t of in the state variable x 1 itself, which one wou ld expect to be of ord er At if a give n time has elapsed since the last service. The deflation of x1 by a factor A is a reflection of the fact that one expects a quickly wearing component to be mo re worn, even und er an optimal policy. An alternative arg um ent in Section 14.6 will demonstrate tha t this pol icy is indeed close to optimal. 5 CU STO ME R ALLOCATION BE TW EE N QU EU ES Suppose there are n queues of the type considered in Section 10.3. Quantities defined for the ith queue will be given subscript i, so tha t x 1, A;, f.Li and a1x 1 represent size, arrival rate, service rate and instantaneous cost rate for tha t queue. We suppose initially tha t these que ues operate independently, and use
(18) to denote the total arrival rate of cus tomers into the system. However, suppose that arriving cus tomers can in fact be routed into any of the queues (so tha t the queues are mu tually substitutable alternatives rath er tha n components of a stru ctu red netw ork). We look for a routing policy 1r which minimises the expected average cos t "/1r = E1r[L: 1 a1x 1]. The policy imp lied by the specification above simply sends an ving customer to queue iwi th pro bability >..d >..;the optimal policy will presumarri ably react to the current system stat e { x 1}. The ran dom routing policy achieve s an average cost of
(19)
6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES
223
(see (10.10)). As in the last section, before applying policy improvement we might ·as well optimise the random policy by choosing the allocation rates At to minimise expression (19) subject to (18), for given A. One readily fmds that the optimal choice is (20) where () is a Lagrange multiplier whose value is chosen to secure equality in (18). So, speed of service and cheapness of occupation both make a queue more attractive; a queue for which a;J JLi > 1j Owill not be used at all. Consider one stage of policy improvement. ·It follows from the form of the dynamic programming equation that, if the current system state is {xi}, then one will send the next arrival to that queue i for which
is minimal. Here the fi, are the transient costs under the random allocation policy, determined in (10.10). That is, one sends the arrival to that queue i for which
fi(xi + 1) fi(xi) = at(Xt +A1) J.'i
i
is minimal. If the At have already been optimised by the rule (20) then it follows that one sends the new arrival to the queue for which ((x 1+ 1) J a;J JJ.t) is minimal, although with i restricted to the set of values for which expression (20) is positive. This rule seems sensible: one tends to direct customers to queues which are small, cheap and fastserving. Note that the rule again has the index form. 6 ALLOCATION OF SERVER EFFORT BETWEEN QUEUES We could have considered the problem of the last section under the assumption that it is server effort rather than customer arrivals which can be switched between queues. The problem could be rephrased in the form in which it often occlirs: that there is a single queue to which customers of different types (indexed by z) arrive, and such that any customer can be chosen from the queue for the next service. Customers of different types may take different times to serve, so it is as well to make a distinction between service effort and service rate. We shall suppose that if one puts service effort u1 into serving customers of type i (i.e. the equivalent of u1 servers working at a standard rate) then the intensity of service completion is the service rate O"tJ.I.i· One may suppose that a customer of type i has an exponentially distributed 'service requirement' of expectation JJ.i 1, and that this is worked off at rate ai if service effort ui is applied.
224
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
As in the last section, we can begin by optimising the fixed serviceallocation policy, which one will do by minimising the expression
with respect to the service efforts subject to the constraint
L:u =S,
(22)
1
i
on total service effort. The optimal allocation is
u1 = p.j 1 [A; + ylOa;A;J.L;]
(23) where ()is a Lagrange multipler, chosen to achieve equality in (22). An application of policy improvement leads us, by an argum ent analogous to that of the previous section, to the conclusion that all servic e effort should be allocated to that nonempty queue i for which p.1[.fi(x 1)  Ji(x1)] is minimal; 1 i.e. for which
v1(x1) = p.1[.fi(x1)  Ji(x;  1)] =
a·J.L·X· 1 ' 'A
O';Jl.;
i
(24)
is maximal. If the fixed rates p.1 have been given their optim al values (23) then the rule is: all service effort should be concentrated on that nonempty customer class i for which x1J a1p.t/ A; is maximal. It is reasonable that one should direct effort towards queue s whose size x 1 or unit cost a; is large, or for which the response p.; to servic e effort is good. However, again it seems paradoxical that a large arrival rate A; should work in the opposite direction. The explanation is analogous to that of the previous section: this arrival rate is already taken account of in the queue size x1 itself, and the deflation of x1 by a factor JX1 is a reflection of the fact that one expects a busy queue to be larger, even under an optimal policy. Of course, the notion that service effort can be switch ed wholly and instantaneously is unrealistic, and a policy that took accou nt of switching costs could not be a pure index policy. Suppose that to switch an amou nt of service effort u from one queue to anoth er costs c!u!. Suppose that one application of policy improvement to a policy of fixed allocation {u;} of service effort will modifY this to {~}.Then the~ will minimise
E!!c;l~ u;! ~v;) i
(25)
subject to
(26)
7 REWARDED SERVICE RATHER THAN PENALISED WAITING
225
and nonnegativity. Here Vt = v;(x;) is the index defined in (24) and the factor! occurs because the first sum in (25) effectively counts each transfer of effort twice. If we take account of constraint (26) by a Lagrangian multiplier B then the differential of the Lagrangian form L is
oL
8~
{Li+ :=!CVt9 Lt := !c  Vj  e =
(cr; > cr1) (cr; < crt)
We must thus have d; equal to cr;., not less than cr1, not greater than cr1 or equal to zero according as L;_ < 0 < Lt+• L;+ = 0, L 1_ = 0 or L 1_ > 0. This leads us to the improved policy. Let 22+ be the set of i for which v 1 is maximal and 22_ the set of i (possibly empty) for which v 1 < max vj  c.
(27)
1
Then all serving effort for members of 22_ should be transferred to members of 22+. The definitions of the v; will then of course be updated by substitution of the new values of service allocation. Such a reallocation of effort will occur whenever discrepancies between queues become so great that (27) holds for some i. The larger c, the less frequently will this occur. 7 REWARDED SERVICE RATHER THAN PENALISED WAITING Suppose the model of the last section is modified in that that a reward r 1 is earned for every customer of type i whose service is completed, no cost being levied for waiting customers. This is then a completely different situation, in that queues can be allowed to build up indefmitely without penalty. It has long been known (Cox and Smith, 1961) that the optimal policy under these circumstances is to serve a customer of that type i for which r1p,1 is maximal among those customers present in the queue. This is intuitively right, and in fact holds for service requirements of general distribution. We shall see that policy improvement leads to what is effectively just this policy. For a queue of a single type served at unit rate we have the dynamic programming equation in the undiscounted case 1
= AA(x + 1) + p,(x)[r A(x)],
(28)
where xis queue size, A is arrival rate, p,(x) equals the completion rate JL if xis positive and is otherwise zero, r is the reward and A(x) is the increment f (x)  f (x  1) in transient cost. Equation (28) has the general solution for A A(x)
=I W +(I AT) [!!:.]x1 Ap,
p,A
A
(29)
If p, > A then finiteness implies that the second term must be absent in (29), whence we deduce that
226
POLIC Y IMPROVEMENT: STOCHASTIC VERS IONS
6(x)
= r.
(30) This is the situation in which it is the arrival rate which limits the rate at which reward can be earned. In the other case, 11 < A, it is the completion rate which is the limiting factor; the queue builds up and we have !=J.l r,
6(x)
= (J.L/>Y 1•
(31) The total reward rate for a queue of several types unde r a fixed allocation { cr1} of service effort is then 1
=I>~ min[>.;, cr;Jt;]. i
If we rank projects in order of decreasing r;J.l; then an optim al fixed allocation is that for which cr1 = >..;/ J.l; for i = 1, 2, ... , m, where m is as large as possible consistent with (22), to allocate any remaining servic e effort to type m + 1 and zero service effort to remaining types. Now consider a policy improvement step. We should direct the next service to a customer of type i which is present in the queue and for which i maximises J.L;[r; 6 1(x1)]. It follows from solutions (30), (31) that this expression is zero for i = 1,2, ... ,m and equal to r1tt1[1 (cr;J.L;/>.;)x1 1 ] fori > m. That is, one will serve any of the first m types if one can. If none of these are present, then in effect one will serve the custo mer type present which maxim ises r;J.l;, because the fact that the traffic intensity for queue m + 1 exceeds unity means that Xm+l will be infinite in equilibrium. It may 'seem strange that the order in which one serve s members of the m queues of highest effective value r;J.l; appears imma terial. The point is that all arrivals of these types will be served ultimately in any case. If there were any discounting at all then one would, of course, always choose the type of highest value among those prese nt for first service. 8 CALL ROUTING IN LOSS NETWORKS Consider a network of telephone exchanges, with the nodes of the network (the exchanges) indexed by a variable j = 1, 2, .... Supp ose that there are mjk lines ('trun ks') on the directed link from exchange j to excha nge k, of which Xjk are busy at a given time. One might think then that the vector ~ = {Xjk} of these occupation numbers would adequately describe the state of the system, but we shall see that this is not quite so. Calls arrive for a jk conne ction in a Poisson stream of rate Ajk. these streams being supposed indep endent. Such calls, once established, terminate with probability intensity /ijk. When a call arrives for ajkconnection, it need not be established on a direc tjk link. There may be no free trunk s on this link, in which case one can either look for an alternative indirect routing (of which there may be many) or simply not
227
8 CALL ROUTING IN LOSS NETWORKS
accept the call. In this latter case we assume that the call is simply lostno queueing facility is provided, and the caller is assumed to drop back into the population, resigned to disappo intment We see that a full description of the state of the network must indicate how many calls are in progress on each possible route. Let n, be the number of calls in progress on route r. Denote the vector with elements n, by !! and let !! + e, denote the same vector with n, increased by one. Let us suppose that the a establishment of a jkconnection brings in revenue WJk. and that one seeks c dynami routing policy which maximises the average expected revenue. The programming equation would then be 'Y=
2:2:>1kmax{O,w1k+max[f(!! +e,) /(!!)] } r
k
j
(32)
+ LnrtLr [f(!! e,) /(_!!)], r
where 'Y andfind icate average reward and transient reward respectively. Here the rmaximisation in the first sum is over feasible routes which establish a jkconnection. The zero option in this sum corresponds to rejection ofthe call on the grounds that connection is either impossible or uneconomic. (The difference f (!!)  f (!! + er) can be regarded as the implied cost of establishing an incoming call along router. If this exceeds w1k then the connection uses capacity which could be more profitably used elsewhere. We can take as a convention that this cost is infinite if the route is infeasib lei.e. requires nonexistent capacity.) In the second sum ILr is taken to equaliLJk if route r begins inj and ends ink. The term indicated is included in the sum only if !!  e, ~ 0; i.e. if there is indeed at least one call established on route r. Solution of this equation seems hopeless. However, the value function can be a determined for the simple policy in which one uses only direct routes, accepting stage one apply then shall We trunk. free a is there if only call for this route if and of policy improvement. The average and transient rewards on one such link (for which we shall drop thejk subscripts) are determined by 1 = >.[w + f(x
+ 1) f(x)]
+ tLX[f(x  1) f(x)]
(0 < x < m).
(33)
For x = 0 this equation holds with the term in IL missing; for x = m it holds with the term in >. missing. Let us define the quantity
.6.(x) = f(x) f(x + 1) which can be interpreted as the cost of accepting an incoming call if x trunks are currently busy. One finds then the elegant solution
B) A( ) = w B(m, B(x, ())
w. X
(0 < X < m ) ;
'Y
= >.w [1  B(m, e)] )
228
POLICY IMPROVEMENT: STOCHASTIC VERSIONS
where() = )..j JL is the traffic intensity on the link and B(m, B) is the Erlangfunction
This is the probability tha~ all trunks are busy, so that an incoming call cannot be accepted: the blocking probability. The formula for Wjk thus makes sense. Returning to the full network, define
Then we see from (32) that one stage of policy improvement yields the revised policy: If a call comes in, assign it the feasible route for which the sum of the costs x along the route is minimal, provided that there is such a feasible route and that this minimal cost does not exceed WJk. In other cases, the call must or should be rejected. The policy seems sensible, and is attractive in that the effective cost of a route is just the sum of individual components x. For this latter reason the routing policy is termed separable. Separability ignores the interactions between links, and to that extent misses the next level of sophistication one would wish to reach. The policy tends to assign indirect routings more readily than does a more sophisticated policy. This analysis is taken, with minor modifications, from Krishnan and Ott (1986). Later work has taken the view that, for a large system, policies should be decentralised, assume very limited knowledge of system state and demand little in the way of realtime calculation. One might expect that performance would fall well below the fullinformation optimum under such constraints, but it seems, quite remarkably, that this need not be the case. Gibbens et al. (1988, 1995) have proposed the dynamic alternative routing policy under which, if a onestep route is not available, twostep routes are tested at random, the call being rejected if this search is not quickly successful. One sees how little information or processing is required, and yet performance has been shown to be close to optimal.
CH AP TE R12
The LQG Model with Imperfect Observation 1 AN INDICATION OF CONCLUSIONS Section 8.3 we assumed that the When we introduced stochastic state structure in a property generally referred to current value of the state variable was observable, same as 'complete information', as 'perfect observation~ Note that this is not the e future course of the process is which describes the situation for which the whol predictable for a given policy. observed (i.e. if the current If the model is statestructured but imperfectly simple recursive treatment of value of state is imperfectly observable) then the in this case, an effective state Section 8.3 fails. We shall see in Chapter 15 that, n state: the distribution of x 1 variable is supplied by the socalled informatio !'£'with which we have to work conditional on W,. This means that the state space butions on!'£', a tremendous has perforce been expanded to the space of distri increase in dimensionality. tion, in that for them these However, LQG processes show a great simplifica and so are parametrised by the conditional distributions are always Gaussian, covariance matrix V1• Indeed, conditional mean value x1 and the conditional develop in time by deterministic matters are even simpler, in that V1 turns out to h we appeal to the observed whic and policyindependent rules. The only point at as an x of n 1• One can regard x1 course of the process is then in the calculatio g estin inter is It • W on 1 mati infor nt estimate of the current state x 1 based on curre even bles; serva unob ate estim to one s that the formulation of a control rule force rule implies criteria for the more interesting that the optimisation of this optimisation of estimation. properties of linear relations LQG processes have already been defined by the noise. We shall come shortly to a between variables, quadratic costs and Gaussian briefly and exactly. If we add definition which expresses their essence even more regulation problem of Section the feature of imperfect observation to the LQG prototype imperfectly observed 10.1 then we obtain what one might regard as the this the plant and observation statestructured LQG process in discrete time. For relations take the form (I) rl+ Et
x, =
AXt 1
Yt = CXt1
+Bu
+ fJt
(2)
230
THE LQG MODEL WITH IMPERFECT OBSERVATION
where y 1 is the observation which becomes available at timet. We suppose that the process noise E and the observation noise 1J jointly constitute a Gaussian white noise process with zero mean and with covariance matrix (3)
Further, we retain the instantaneous and terminal cost functions
c(x,u)=~[~r[~ ~][~],
(4)
of Section 9.1. If we can treat this model then treatment of more general cases (e.g. incorporating tracking of a reference signal) will follow readily enough. We shall see that there are two principal and striking conclusions. One is that, if u1 = K 1x 1 is the optimal control in the case of perfect state information, then the optimal control in the imperfectobservation case is simply u1 = Kr!xr. This is a manifestation of what is termed the certainty equivalence principle (CEP). The CEP states, roughly, that one should proceed by replacing any unobservable by its current estimate, and then behaving as if one were in the perfectobservation case. ltturns outto be a key concept, not limited to LQG models. On the other hand, it cannot hold for models in which policy affects the information which is gained. The other useful conclusion is the recursion for the estimate i 1
(5) known as the Kalman filter. This might be said to be the plant equation for the effective state variable i 1; it takes the form of the original plant equation (1), but, instead of being driven by plant noise Er, is driven by the innovation Yt Cic1· The innovation is just the deviation of the observation y 1 from the value E(yrl Wti) that one would have predicted for it at timet 1; hence the name. The matrix H 1 is calculated by rules depending on V1_ 1, , The Kalman filter provides the natural computational tool for the realtime determination of state estimates, a computation which would be realised by either a computational or an analogue model of the plant. Finally, LQG structure has really nothing to with state structure, and essential ideas are indeed obscured if one treats the statestructured case alone. Suppose that X, U and Y denote process, control and observation realisations over the complete course of the process. The cost function C will then be a function C(X, U). Suppose that the probability density of X and Yfor a control sequence U announced in advance is
f(X, Yl; U) =
eO(X,YI;U)
(6)
(so that U is a parametrising variable). This must be a density relative to an appropriate measure; we return to this point below. We shall term [)
OBSERVATION 2 LQG STRUCTURE AND IMPE RFEC T
231
asing improbability of plan t/ the discrepancy, since it increases with incre The two functions C and I[]) of the observation realisations X and Y for given U. stochastic structure of the problem. variables indicated characterise the cost and cost funct ion C and the discrepancy One can say that the problem is LQG if both the density (6) is relative to Lebesgue []) are quadratic in their arguments when the measure. economically, that dyna mic This characterisation indeed implies, very Gaussian. It also implies that relations are linear, costs quadratic and noise the only effect of controls on policy cann ot affect information, in that r in know n controls, and so can be observations is to add on a term which is linea variables take values freely in a corrected out. It implies in addition that the Lebesgue measure) and that the vector space (since the density is relative to t h (since histories X, Y, U are stopping rule is specification of a horizon poin taken over a prescribed time interval). ntly ofLQ G ideas) that the two It can be asserted quite generally (i.e. independe se the model. One migh t say that quantities C and [j) between them characteri they could be), in that one wishes both C and I[]) should be small (relative to what to make C small and expects IDi to be small. variable; i.e. [J)(x) for x alon e or One can define the discrepancy for any rand om discrepancy is the negative the I[D(xjy) for x conditioned by y. The fact that be normalised in different may h logarithm of a density relative to a measure whic constant. We shall assu me ive addit ways means that it is determined only up to an it so normalised that inf[J)(x) = 0. X
n p. and covariance matrix V, then So, if xis a vector normally distributed with mea (x p.). Note the general validity of [Jl(x) is just the quad ratic form ! (x p.) T vI formulae such as [Jl(x,y) = [Jl(y)
+ [J)(xjy).
is taken in the form (2), with y 1 It is often asked why the observation relation ely previous state rather than on essentially being an observation on immediat work out nicely that way. Probably curre nt state. The weak answer is that things as a joint state variable, then y 1 is a the right answer is that, if one regards (xt,Yt) . This is an alternative expression function of curre nt state, unco rrupt ed by noise deal to recommend it: that one of imperfect observation which has a good error. observes some aspect of current state without
ERVATION 2 LQG STRUCTURE AND IMPERFECT OBS models involves a whole train of The treatment of imperfectly observed LQG as they seek the right development. ideas, which auth ors order in different ways
232
THE LQG MODEL WITH IMPERFE CT OBSERVATION
We shall start from what is surely the most economical characterisation of LQG structure: the assumed quadratic character of the cost C(X, U) and the discrepancy [])(X, YJ; U) as functions of their arguments. We also regard the treatment as constrained by two considerations: it should not appeal to state structure and it should generalise naturally to what is a natural extension of LQG structure: the risksensitive models of Chapters 16 and 21. The consequent treatment is indeed a very economical and direct one; completed essentially in this section and the next. Sections 68 are added for completeness: to express the results already obtained in the traditional vocabulary oflinear least square estimates, innovations, etc. Section 9 introduces the dual variables, in terms of which the duality of optimal estimation to optimal control finds its natural statement. Note first some general points. The fact that we have written the cost C as C(X, U) implies that the process has been defined generally enough that the pair (X1, U1) includes all arguments entering the cost function Cup timet (such as values of reference signals as well of plant itself). The dynamics of the process and observations are specified by the probability law P(X, YJ; U), which is subject to some natural constraints. We have P(X, YJ; U)
h1
h1
t=O
1=0
=IT P(xt+!,Yt+J/Xr, Yt; U) =IT P(xt+!,Yt+I/Xr, Yt; Ur)
(7)
the second equality following from the basic condition of causality; equation (A2.4). Further
[])(xt+l ,yt+I!Xr; Ur) = [])(xt+IIXt, Yr; Ur) + [])(yt+dX r+l, Yt; Ur) = [])(xt+dX t; Ut) + [])(Yt+I!Xt+l• Yt; Ur)
(8)
the second equality expressing the fact that plant is autonomous and observation subsidiary to it, in that the pair (X1 , Y1) is no more informative for the prediction of xt+I than is X 1 alone. . Relations (7) and (8) then imply
Theorem 12.2.1 The discrepancy has the additive decomposition h1
[ll(X, YJ; U) = L[[])(xr+ J!Xr; Ur)
+ [])(yt+IIXt+l> Yr; Ut)].
(9)
t=O
Consider now the LQG case, when all expressions in (9) are quadratic in their arguments. The dyamics of the process itself are specified by h1
[])(Xj; U) = L t=O
[])(xt+1IXt; U 1)
(10)
2 LQG STRUCTURE AND IMPERF ECT OBSERVATION
233
and the conditio nal discrepancies will have the specific form [])(xt+tiXt; Ut)
=! {xc+l 
dt+t  AtXt  Bt Uz) TNi+\ (xr+l dr+l ArXt  Bt Uz)
(11)
(10) for some vector d and matrices A, B, N, in general timedependent. Relations and (11) between them imply a stochastic plant equation (12) (0 ~ t t) have been substituted the values which minimise [))(XI; U). The value ofu1 thus determined, Udet(X1 , U1_t ), i.s the optimal value ofutfor the deterministic process in closedloop form.
Proof All that we have done is to use the determi nistic version of (12) to express future process variables x 7 ( T > t) in the cost function in terms of control variexables U and current process history Xt, and then minimis e the consequ ent r, Howeve pression for C with respect to the as yet underm ined controls u7(T;;;: t). is the point of the theorem is that this future course of the deterministic process fudetermi ned (for given U) by minimis ation of [))(XI; U) with respect to these 0 ture variables. Otherwi se expressed, the 'deterministic future path' is exactly the most probable future path for given X 1 and U. In optimisi ng the 'deterministic process' we have suppose d current plant history X 1 known; a supposition we now drop.
Exercises and comments (1) Note that state structur e would be expressed, as far as plant dynamics go, by P(XI; U) = IlzP(xt+tlxz; ut)· (2) Relation (12) should indeed be seen as a canonic al plant equation rather than necessarily the 'physical' plant equation. The physical plant equation might have the form
234
TH E LQ G MO DEL WIT H IMP ERF ECT OBSERVATION
where the plant noise e* is autoco rrelated. However, we can substit ute e;+I = E( e;+ 1 1X~> Ut) + ft+l· This standar dises the equation to the form (12) expectation is linear in the con , since the ditioning variables and e has the required orthogonality properties. The det erministic forms of the two sets of equations will be equivalent, one being deriva ble from the other by linear operati ons. 3 TH E CERTAINTY EQUIV AL
ENCE PR IN CIP LE
When we later develop the ideas of projection estimates and innova tions then the certainty equivalence principle is rather easily proved in a ver sion which immediately generalises; see Exe rcise 7.2. Many readers may fmd this sufficient. However, we fmd it economical to give a version which does not presume this apparatus, does not assume states tructure and which holds for a mo re general optimisation criterion. The LQG criterion is that one cho oses a policy 1r to minimise the exp ected cost E'Jr (C). However, it is actually sim pler and more economical for pre sent purposes to consider the rather more genera l criterion: that 1r should maximise E'Jr [eBC] for prescribed positive 9. Since this second criterion function is 1 fJE'Jr(C) + o(9) for small 0, we see that the two crit eria agree in the limit of small (). For convenience we shall refer to the two criteria as the LQGcrite rion and the LEQGcriterion respectively, LE QG being an established term in which the EQ stands for 'exponential of quadratic ~ The move to the LEQG criterion induces a measure of 'risksensitivity'; of regard for the variability of C as well as its expectation. We shall see in Chapt ers 16 and 21 that LQG theory has a complete LEQG analogue. Indeed, LQG theory appears as almost a degene rate case of LEQG theory, and it is that fact wh ich we exploit in this section: LEQG methods provide the insightful and econom ical treatment of the LQG case. We wish to set up a dynamic progra mming equation for the LEQG mo del. If we defme the somewhat transformed total value function eG(W,) = f(Y t) supE.r[eBCIWt], 'If
where f is the Gaussian probab ility density, then the dynamic programming equation takes the convenient form eG(W,) =SUp Ut
J
dyt+leG(WHI)
(t t) given thei r dete rmin istic prediction. The value u(X1, U11) of Ut which minimises the squa re bracket is thus Udet ( X 1, U,_ )+ 1 0(0). The min ima l value of the squa re brac ket in (17) is then 0(0), with the consequence that the value of X1 min imis ing the curly brac ket is just x, + 0( 0), with the defmition of x?> mod ified to that state d in the theorem. The fina l
4 APPLICATIONS: THE EXTENDE D CEP
237
determination of the minimising value of Ut, which we know to be LEQGoptimal at 0, is then U t oflat er decisions form ed in this way are not in general the optim al values of these quantities, but they are indee d the best estim ates one can form at time t of what these optim al future decisions will be (an asser tion we make explicit in Exercise 7.1). This is one advan tage of the exten ded formulation; that optim isatio n is seen as part of a provisiona l forward plan. This plan will of cours e be upda ted as time moves on and one gains new infor matio n, but the fact that it has been form ed gives one much more consc ious appre ciatio n of the cours e of events and the conti ngen cies again st whic h one is guard ing. It is a view of optim isatio n whic h lies close to the intuitive meth ods by whic h we cope with daily life. Agai n, one can imag ine form al estim ates being repla ced by infor mal estimates in cases where one does not have the basis for a comp lete stoch astic model.
a
Of course, the point abou t the recursive appro ach is that it is econ omic al: one does not have to hold a whole future in one's mind (or processors). The chess analo gue we have already given in Secti on 1.3 bring s home the contr ast betw een
239
5 CALCULATION OF ESTIMATES: THE KALMAN FILTER
the two views. To play chess by the extended method is to choose one's current JilOVe after analysing all likely sequences of play some time into the future. To play ; chess by the recursive method is to give every board configuration x a value F(x) and to look for a move which changes the configuration to one of greater (or Jllaximal possible) value. In other words, an ultimate goal (winning) is replaced by an immedia te goal (improvement of configuration). Of course, the chess example is complicated by the fact that there are two players, optimising for different goals. However, there are plenty of other examples. One can say that Nature has allowed for the limited intelligence (and processing power) of her creations by replacing ultimate goals by intermediate goals. An animal does not eat in order that it may not die (a distant goal) or in order that the species may survive (even more distant) but in order that it may not feel hungry (immediate goal). Evolution has implante d a mechanism which both senses relevant aspects of state (a need for food) and evaluates them (so that a condition of extreme hunger is unpleasant). 5 CALCULATION OF ESTIMA TES: THE KALMA N FILTER In order to implement a control rule such as as u1 = K 1x1 for an imp,erfectly observed statestructured system we have to be able to calculate 1 = x/>. LQG assumptions will imply that the distribution of x 1 conditional on W1 is normal, with mean x1 and covariance matrix V, say. The conditional mean value x1 is indeed identifiable with the conditionally most probable value, because it is the value which maximises P(x11W,). If the observation relation is also 'statestructured' in the appropriate sense then one can develop recursions which express and V, in terms of 1. V1_ 1, 1 and u1 1 . The recursion for 1 is the ubiquitous Kalman filter (5). Such recursions supply the natural way of calculating the estimates, whether control is realised by digital or analogue means. Before developing them we shall clarify notation and some properties of the normal distribution. Suppose that x and yare a pair of random vectors (so defmed as joint random variables). Let us suppose for convenience that they have zero means: E(x) = 0, E(y) = 0. Then the matrix E(xyT) with jkth element E(xjyk) is the covariance matrix between x and y (note that the order is relevant). We shall denote it variously (20) E(xyT) = cov(x, y) = Vxy·
x
x
x,
x,_
y
We shall denote cov(x, x) simply by cov(x); the covariance matrix of the random vector x. If xis scalar then cov(x) is simply the variance of x, which we shall write as var(x). If cov(x, y) = 0 then one says that x andy are mutually uncorrelated, and writes this x.iy. The relation is indeed an orthogonality relation under conventions natural to the context.
240
THE LQG MODEL WITH IMPERFECT OBSERVATION
Lemma 12.5.1 Suppose the vectors x andy are jointly normally distributed with means. Then the distribution of x conditional on y is normal with mean and iance matrix given by
Proof Denote the value asserted for E(xly) in (21) by x; we shall later see this notation as consistent. Then one readily verifies that x  x and y are independently (and normally) distributed, so that the unconditioned distribution of x  .X is the same as its distribution conditional on the value of y. But one again verifies that the unconditioned distribution of x  x is normal with zero mean and covariance matrix equal to the expression for cov(xly) asserted in (21). The conclusion of the theorem thus follows. 0 We have implicitly assumed Vyy nonsingular, but the assertions of the theorem remain meaningful even if this is not the case; see Exercise 1. We come now to the principal conclusions.
Theorem 12.5.2 Assume the imperfectly observed statestructured model specified in relations (1)(4) of Section 1. Suppose initial conditions prescribe xo as normally distributed with mean x0 and covariance matrix V0. The model is then LQG and the distribution of x 1 conditional on W 1 is normal with mean x1 and covariance matrix Vr, say. (i) The estimate x1 obeys the updating relation Xr
= Axt1 +But! + Hr(Yr Cxtd
(22)
(the Kalman filter), where
(23) (ii) The covariance matrix V1 obeys the updating relation
vi= N + AVttAT (L + AV11 cT)(M + cv1t cTrl (LT + cv11AT). (24) Proof There are many proofs, and we shall indeed give one in Section 8 which dispenses with the assumption of normality. However, this assumption has been intrinsic to the LQG formulation and makes for what is by far the most economicalproof. The preliminary assertions of the Theorem follow from previous discussion. If we denote the estimation error x 1  x1 by b.1 then V1 is exactly the covariance matrix of b. 1. The quantities Xr1  Xr1, Xr Ax11  But1, and Yt  Cxt! are
I
241
: THE KAL MAN FILTER 5 CALCULATION OF ESTIMATES
The se are jointly norm ally with ~t1. Er and 'IJt respectively. with zero mea n and covariance mat rix
[ v0 0
0
N L1
0
L M
l
.
join t as V. This effectively gives us the where we have written Vt1 simply is to do t mus nal on W,_t, urI · Wh at we distribution of Xti . x, and Yr conditio by Xt of tion then con ditio n the distribu integrate out the unobservable Xrt and t, (Wr dist ribu tion of x 1 conditional on the value of Yt. so obtaining the the to nt gration over values of x 1_ 1 is equivale Ut1 ,y 1) = W1, as desired. Inte tt and Bu t Ax 1 Xt= +oo is on is sing ular (in that the evaluati is fundamental, and valid even if V
246
THE LQG MODEL WITH IMPERF ECT OBSERVATION
unless x lies in the orthogo nal complement of the null space of V). It gives us the evaluation of the discrepancy D(x,y)
=
r[
s~y [ATx + pTy ![~ ~;: ~;;] [~ J]
The minimi sing values of A and p are the differentials of D with respect to x andy. If x has the discrepancyminimising value then). must be zero, so the extremal equations with respect to A and ft become x = Vxy/L andy = Vyy/.L· This amoun ts just to the equatio n (35). The analogues of A and ft in the dynamic context are just the Lagrange multipliers associated with plant and observation equations, and 'it is by their introduction that a direct trajectory optimisation finds its natural completion (see Section 9) as it did already in the case of perfect observation (Sectio n 6.3).
x
7 INNOVATIONS The previous section was concer ned with a onestage situation. In the temporal context we shall have a multistage situation, in which at every instant of time t (= 0, 1, 2, ... ) one receives a new vector observation y 1• The observa tion history at timet is then Y1 = {yr; 0 ~ T ~ t}. Define the innovation ( 1 at time t by (o =Yo E(yo), (t=Yt< if(YtlY t_I) (t=l,2 ,3, ... ). (40) The innovation is thus the deviation of the actual observation Yt at timet from the projection forecast one would have made of it the momen t before, at time t  1. It thus represents the 'new' inform ation gained at time t, in that it is that part of y 1 which could not have been predict ed from earlier information.
Theorem 12.7.1 The innovations are mutually uncorrelated. Proof Certain ly ( 1 l. Ytl· It follows then that ( 1 l. (r forT < t, becaus e (r is a linear function of Ytl· 0 Now define
x(t) = .C(xl Yt), the projection estimate of a given random vector x based on inform ation at time t. We shall use this supersc ript notatio n consistently from now on, to denote the optima l determ ination (of the quantity to which the supersc ript is applied , and under a criterion of optimality which may vary) based on inform ation at timet. It follows, by appeal to (39), that
x(tl = C(xl Ytd
+ tB'(xlCr)
= x(ti)
+ HtCt
(41)
7 INNOVATIONS
247
where the matrix Hr in fact has the determination
Hr
= cov(x, Ct)[cov((r)r 1.
(42)
Equation (41) shows that the estimate of x based on Yzl can be updated to the estimate based on Y1 simply by the addition of a term linear in the last innovation. This is elegant conceptually, and turns out to provide the natural algorithm in the dynamic context. Equation (37) of course implies further that X(t)
=
t
t
r=O
r=O
'L: l(xl(r) = L Hr(r
which is nothing but the development of a projection in terms of an orthogonal basis. However, the innovations are more than just an orthogonalisation of the sequence {y1}; the fact that the orthogonalisation is achieved by the timeordered rule (40) means that ( 1 indeed has the character expressed in the term 'innovation: The innovations themselves are calculated by the forward recursion 1l
(t
= Yt L l(yrl(r)·
(43)
r=l
Alternatively, we can take the dual point of view, and calculate them by minimising the discrepancy backwards in time rather than by doing least square calculations forward in time. Theorem 12.7.2 The value ofy 1 minimising D( Y1) is C(y 1 1 Yr1) and the minimised value is D ( Yt1). Further,
(44) where Mt = cov( (t). Proof The first pair of assertions follows from the theorems of the last section,
and relation (44) is nothing but relation (30) transferred to this case.
0
We plainly then have the additive decomposition of the discrepancy
D(Yr)
=
n=~=l (~M; 1 Cr·
Exercises and comments
(1) Let u~l be the estimate of the optimal value of uT formed at time t by the application of the extended CEP in Section 4. Show that indeed uVl = l(u~r) IW1 ).
248
THE LQG MODEL WITH IMPERFEC T OBSERVATION
(2) A proof of the certainty equivalence principle. Consider the statestructured
LQG model expressed in equations (1)(4), the criterion being minimisation of expected cost. Let F(x 1 , t) and u(x1 , t) be the value function and oftimal control rule at time t under perfect stateobservation. Note that x~t+l = x~r) + z1 = x1 + z1 where, conditional on ( Wr, ur), the random variable z r has zero expectation and a covariance matrix independent of policy or process history. Hence show, by an inductive argument, that the value function and optimal control in the case of imperfect observation are F(x1, t) + · · · and u(x 11 t) respectively, where+ · · · indicates a term independent of policy or process history. 8 THE KALMAN FILTER REVISITE D We can now give a quick derivation of the Kalman filter under purely secondorder assumptions, when x1 is defined as the projection estimate tf(xrl Wr). The proof is in some ways more enlightening than that of Section 5, although it needs more in the way of preliminary analysis. We have Xr = tf(Axrl +Burl+ =Bur!+ O"(Axr!
= A.Xr1
c:rl We)
+ t:eiWr1) + tf(AXrI + t:rl(r)
(45)
+ Bur! + Hr(t
for appropriate H 1• Here ( 1 is the innovation (46) so the form (22) of the Kalman filter recursion follows from (45) and (46). Further 1 Hr = cov(Axr1 + er, (r)[cov((r )r 1 = cov((;, (r)[cov((r )r ,
(47)
where
(48) But, as in Section 5, (; and ( 1 can be written as e1 + A~r1 and 'TJr + C~rI. respectively, and are jointly normally distributed with zero means and with covariance matrix (25). The expression (23) for H 1 thus follows, as does expression (24) for (49) The quantity ( 1 is certainly the innovation in the observations. The quantity (; equals x 1  tf(x11 Wr1 ), and so can be regarded as the plant innovation. 9 ESTIMAT ION AS DUAL TO CONTRO L The recursion (24) for the error covariance matrix V1 stands in obvious parallel to the Riccati recursion (2.25)/(2.26) of LQ control optimisation. The analogy is
r
'';•i>
9 ESTIMATION AS DUAL TO CONTR OL
249
of covariance complete, with only the 'dualising' modifications that transposes d instead of forwar goes ion matrices replace cost matrices and the recurs which is tion estima and l backward. In fact, there is a duality between contro ngian Lagra the of on constantly in evidence, but is revealed cleanly only by adopti methods of Section 6.3. conditions, The stochastic model must be completed by prescription of initial ution distrib the of which in the present statestructured case means prescription . The V matrix 0 of xo conditional on Wo; norma l with mean .Xo and covariance to up tion realisa expression for Dr= [l)(X1, Ytl Wo; Ur 1 ), the discrepancy for the time t, is then
u by appeal to where A, E and TJ are understood as expressed in terms of x, y and n relatio initial the and (2) the plant and observation relations (1), (51) Ao = .Xo xo. {x~l; 0:::;; T~ t}, The projection estimate of the course of the process up to timet, x'1'values. is obtained by minimising [l)r with respect to the corresponding less natural does the , inistic determ being to is model However, the nearer the s appear in (11) inverse whose es matric ance this formulation seem. The covari zero, and this ally identic is nent compo noise and (50) will be singular if any forms. these ising minim and sing plainly presents problems in discus reformulation The resolution turns out to be a continuation of the Lagrangian naturally to over carries indeed which of Section 6.3 for the deterministic case, plant and the view we is, That . the stochastic imperfectly observed model aints constr as (51) ion condit observation equations together with the initial us Let liers. multip ge Lagran of which are best taken care of by the introduction and 0 = T case the in (51) introduce multipliers IT to take account of constraint multiplier mT to the plant equation at time Tin the case T > 0; correspondingly, a the Lagrangian has take account of the observation relation at time T. One then
form r
[l)r
IQ(xo .Xo b.o) 2)z;( xT AxT1  BuT!  ET) T=J
+ m;(YT  CxT1 TJT)], to be maximised to be minimised with respect to x and the noise variables and a superscript have should liers with respect to the multipliers. Properly, the multip at time t. ation inform on t to indicate that the extremisation is one based when only cript supers this e However, for notational simplicity we shall includ the point needs emphasis.
THE LQG MODEL WITH IMPERFECT OBSERVATION
250
Theorem 12.9.1 If the noise variables are minimised out ofthe Lagrangian form then the problem is reduced to that ofrendering the time integral I
O(l,m, x) =[!lTV/+ zT(x ~ x)Jo + l:[v(lr,mr)
(52)
T=J
+ !J(xr ~ Axr1 ~ Burd + m;(Yr ~ CxrI)] maximal with respect to the x variables and minimal with respect to the (/, m) variables. Here
v(l,m)=![~r[~ z][~]
(53)
and the noise estimates are related to the multipliers by T7
/(t) ~ 
r 0 0
(t) A (1)  Xo  Xo , uo
(54)
A
(55)
Proof Equations (54) and (55) are just the stationarity conditions for the Lagrangian form with respect to the noise variables. At the values thus determined the Lagrangian form reduces to 0(/, m, x), and remaining assertions follow. D Note that the transformation clears the form of matrix inverses. Note also that 0 is the analogue of the Lagrangian form (6.19) for the deterministic control case, with(/, m, x) replacing (x, u, >.). Thequadraticform v(l, m) isjustthe information analogue of the instantaneous cost c(x, u) defined in (4). In general, we shall see that the primal and dual variables switch roles when we move from the control problem defined on the future to the estimation problem defined on the past . . The multipliers I and m are to be regarded as differentials of []) 1 with respect to the values of noise variables, which in turn implies the relations (54) and (55) between multipliers and noise estimates. By taking the extremisations of the form 0 in various orders we can obtain the analogues of various familiar assertions. Firstly, by writing down all the stationarity conditions simultaneously we obtain the equation system
[~ ~ !Hfr+[~"] ,~ 0
(1
~ 7 ~
(56)
t).
together with the end conditions uz(t) 
A
r,o  xo
~
(t)
Xo ,
(l,m)~) = 0
(T
> t).
(57)
r
~.·.
9 ESTIMATION AS DUAL TO CONTROL
251
In (56) we have again used the operator notation d=lA ff,
with il = I  AT ff 1 etc. Note also that the translation operator acts on the subscript r, not on the superscript t, which is regarded as fixed for the moment. In equation (56) we see the complete dual of the corresponding system (6.20) for the control problem. We can use the operator and operator factorisation approach just as we did in Section 6.3 and 6.5; the analogue of an infmite horizon will now be the passage from a startup point at r = 0 to r+ oo. We shall follow through these ideas explicitly and for a more general case in Chapter 20. Theorem 12.9.2 Ifone extremises out the xvariables in 0then the problem is reduced to that ofminimising the form I
!UTV1)0 + l:)v(l,.,mT ) z;B~1
+ m;yT]
(58)
T=1
with respect to the variable (~ m), subject to the backward equation (0
~
r
~
t)
(59)
and endcondition (I, m)T = 0 (r > t). The assertion is immediate. Its point is that exhibits the estimation problem as the complete analogue of the control problem originally posed in Section 2.4 Equation (59) is the analogue of the plant equation, v(!, m) is the analogue of the instantaneous cost, and the initial term ! WV/) 0 is the analogue of the terminal cost. The difference lies only the occurrence of a few terms linear in (l, m) in the sum (58)the observation and control variables constitute effective input terms which lead to a nonhomogeneity. However, we are in general not interested in estimating all of process history, but merely the value of current state x 1• For this we resort to recursive arguments, just as the optimal control was deduced by the recursive argument of the dynamic programming equation. Theorem 12.9.3 The extremum of form (52) for prescribed x 1 and 11 is (! [T VI + [T (x  x) ]1 + · · · , where + · · · indicates terms independent of these variables. Proof Let us drop the !subscripts for the moment The constrained extremum is certainlyoftheform!JT PI+ zT(x J.t) +···for someP, J.t. Ifweminimisewith respect to 1 then this becomes  ! (x ~. J.t) Tp 1(x  J.t) + · · ·. But this must be 0 identical with! (x x) Ty 1(x x) + ···,whenc e the assertion follows.
252
THE LQG MODEL WITH IMPERF ECT OBSERVATION
Theorem12.9.4
The recursion
holds, with
lt1 =AT[+ cTm. The Kalman filter (22) and theforward Riccati recursion (24)for Vfollow. Proof It follows from the previous theorem and expression (52) for the time integral that we have the recursion [!/TV/+ /T(x x)] 1 = ext{(!zTV/ + zT(x x)] 1_1 + v(l,,m,) + !'[(x, AXt1 Bu,_r) + m"f(y, Cx,1)}, (62) where the extremum is with respect to /1_ 1 , x 1_ 1 and m1• The extremum with respect to 1 yields the condition (61) and the recursion reduces to (60). The Kalman and Riccati relations then follow immediately. 0
x,_
The reader may ask why we should bother with yet a third derivation of the Kalman filter. Well, the derivation is incidental to the goal; a formalism which is claimed as significant could hardly be regarded as living up to that claim unless it delivered the Kalman filter in passing. On~ point is that the Lagrangian approach, already seen as valuable for the optifiisation of deterministic control, is now seen to extend to this stochastic imperf~ctiY observed case. More sigiuficantly, it achieves the goal which has loomed ever more clearly as desirable: the characterisation of the optimisation problem as the free extremisation of a timeintegral such as (52) with respect to its variables. Constraints can of course not ~e eliminated, but they can be exhibited as natural consequ ences of a free optimisation in a higherlevel problem. This higherlevel formulation was achieved in two steps. The first was the deduction of the certainty equivalence ,principle, which absorbed the constraint constituted by the realisability condition (that control must depend on current observables alone). The second was the introduction of the dual variables I and m, which absorbed the constraints constituted by plant and observation equations. Comparison of the equation systems (6.20) and (56) demonstrates the complete symmetry between control/future and estimation/past, with primal variables (x, u) and dual variable s(/, m) switching roles in the two cases. There are other byproducts in terms of insight and solution: reduction of these equation systems by the canonical factorisation methods of Section 6.3 solves the stationary optimisation problem for stochastic dynamics of any order (see Chapter 1821). Remarkably, insight becomes complete first when we consider the risksensitive models of Chapters 16 and 21, for which past and future timeintegrals are
9 ESTIMATION AS DUAL TO CONT ROL
253
,A. assoc iated in a single integral, and the Lagrange multipliers land d. relate be the plant equation in the past and in the future can
~~""""
Exercises and comments extrema are taken in the (1) Alternative forms. By varying the order in which the holds with the alternative last equation, demo nstrat e that the Kalm an filter (22) evaluation of Hand recursion for P.' (63) (64) 1 ions (63)/(64) have the Here A= A LM 1 C and N = N LM LT. Relat I (24), just as was the case same continuoustime limit version (28)/ (29) as do (23) for the corre spond ing control relations.
CH AP TER 13
Stationary Process; Spectral Theory E FUN CTIO N 1 STATIONARITY: THE AUTOCOVARIANC homogeneous and for whic h a Consider a system which is intrinsically timerule is successful, in that it is stationary control rule is adopted. If this control to steadystate behaviour. A stabilising, then the system will eventually settle down viour is itself termed stationary. stochastic process showing such steadystate beha statistically invariant unde r More technically, a process is stationary if it is this, consider first the discretetimetranslation. To appreciate what is mean t by xis the process variable, and time case. Deno te the process by {x1}, so that whole process in a parti cular denote a realisation (i.e. the actual course of the tor f7 introduced in Secti on case) by X. Recall the backward translation opera formed sequence f7 X is X 1t 4.2, which has the effect that the tth term in the trans sequence fl' X is x 1_,, for any rather than x 1• More generally, the tth term in the nary if statio is } integral r. One says then that the process {x 1
E[(fi'X)] = E[(X)]
(1)
of the realisation for which the for any integral rand for any scalar functional ly what is meant by 'statistical righthand mem ber in (1) is defined. This is exact will in fact hold for all r if it invariance unde r timetranslation'. Note that (1) holds for r = ±1. been to optimise in the finite Our approach to control optimisation hithe rto has tended to a stationary form ol horizon initially and then see if the optim al contr itself converge to stationarity in the infinitehorizon limit. The process may then . However, one could move in the course oftim e (i.e. in the infinitehistory limit) g for the stationary rule which to the stationary situation immediately, by askin statio nary state. We make some minimises an expected cost per unit time in the clear, in any event, that there is observations on this point in Exercise 2, but it is d, unde r LQ assumptions there interest in the study of the stationary case. Indee with the statestructured case are two principal bodies of technique: one dealing the stationary case by trans form by recursive methods and the other dealing with methods. ption that xis vectorvalued (a If we consider LQ models then there is an assum functions of X which we only column vector of dimension n, say) and that the process is Gaussian, then its need consider are linear/quadratic. Indeed, if the and secondorder moments. statistics are completely characterised by its first
STATIONARY PROCESSES; SPECTRAL THEORY
256
As far as frrstorder moments are concerned, the stationarity condition (1) will imply that E(x 1) is independent oft, so that
(2) for some constant nvector J.L· As far as secondorder moments are concerned, (1) will imply that that the covariance between x 1 and x 1_, is a function of r alone:
cov(xr,Xtr)
=
v,,
(3)
say, for integral r. The quantity v, is termed the autocovariance at lag r of the process. It is an n x n matrix whose jkth element is the covariance between the jth element of x 1 and the kth element of x 1_,. It provides the only measure one has of dependence between values of the process variable at different times, if one is restricted to knowledge of first and secondorder moments. It is therefore the only measure there is at all in the Gaussian case. If one regards v, as a function of the lag r then it is termed the autocovariance function. The full expression
v,
= E[(x 1  J.L)(x 1_ , 
for v, and stationarity imply that v_,
JL)T],
(4)
= v'f, or v,
T = v_,.
(5)
That is, the autocovariance function is invariant under the combined operations of. transposition and lagreversal. This plus the further property indicated in Exercise 1 are in fact the characterising properties of an autocovariance function. Commonly one supposes that the process has been reduced to zero mean (by adoption of a new variable x J.L). In this case (4) reduces to v, = E[x1xJ_,]. A degenerate but important stationary process is vector white noise, {E1}, for which v, is zero for nonzero r. One normally supposes that this has zero mean, and, if v0 = V, then one speaks of {E1} as 'white noise of covariance matrix V~ The corresponding assertions for the case of continuous time are largely immediate; one simply allows the time variable t and the lag variable r to take any value on the real line. The one point that does need special discussion is the character of white noise, already covered to some extent in Section 9.5. Discretetime white noise {Er} of covariance matrix Vhas the property that the linear form I>=~ 1 E 1 has zero mean and covariance matrix 'L, 1 at V a'f, at least for all sequences of matrix coefficients {a 1} such that this last sum is convergent. The corresponding characterisation of continuoustime white noise is that the linear form a(t)E~t) dt is normally distributed with zero mean and covariance matrix Ja(t) Va(t) dt. The additional property of normality follows from the assumed independence and stationarity properties. If E had autocovariance function v(r) then the relation
J
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL
cov (j a(t)E(t) dt)
=
jj
257
a(tJ) v(t1 t2)a(t2)T dt1 dt2
asserted implies then that would hold in regular cases. The evaluation ion; an indication. of the v(r) = Vc5(r), where 6 is the Dirac delta funct noise. The matrix V is the exceptional character of continuoustime white time interval; it is appropriately covariance matrix of the integral oft: over a unit referred to as the power matrix oft:. Exercises and comments
2 0, for any sequence of column (1) Note that 1:1 L:k aJ v1kak = E(l: 1 aJ Xt) ;;;;: ges. vectors {at} for which the first sum cover programming equa tion that (2) Note that there is no assumption in the dynamic one is considering an assum ed the controlled process is stationary, even when Kx 1). One may say that this is infinitehorizon limit (as with the simple rule Ut = nary, optimises passage to the because the control thus derived, although statio state. If one tries to determine steady state as well as performance in the steady s oneself an instr umen t (the an optimal steadystate rule directly then one denie less effective in dealing with reaction against transients) and may derive a rule al for the deterministic LQ optim such transients. For example, the rule u1 0 is dy been reached. It will alrea has problem of Section 2.4 if the equilibrium x 1 = 0 ising unless x = 0 is a stabil be not be optimal if x is not zero, and will not even stable equilibrium of the uncontrolled process. gecost optimal' but not This is an extreme example of a policy which is 'avera (as in Section 10.1, for N > 0) optimal. If nondegenerate plant noise is present st optimal. This is the infinitethen there is only one policy which is averageco rary initial conditions, and so horizon optimal policy, which is optimal for arbit also copes optimally with transients.
=
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL is as natural in this context as it The use of ztransforms and Fourier transforms se of the translationinvariant was in Chap ter 4, and for the same reason: becau etetime case and define the character of the process. Consider again the discr ) autocovariance generating function (abbreviated AGF 00
g(z)
=
L
v,t',
(6)
r=oo
x n matrix whose elements are where z is a complex scalar. The AGF is then ann 1 z ) T, a relation which we shall functions of z. Property (5) implies that g( z) = g( ons 6.3 and 6.5. It follows then write as g = g, consistent with the usage of Secti
258
STATIONARY PROCESSES; SPECTRAL THEORY
that, if the doubly infinite series (6) converges at a given value of z, then it also converges at z 1• In particular, if v, decays as p' with increasing r, for some scalar pin [0, 1), then g(z) converges in the annulus p < lzl < p 1. Note that a white noise process of covariance matrix Vhas a constant AGF: g(z) = V. The AGF may well not converge anywhere. The standard example is the simple sinusoid x 1 = sin(w0 t  '1/J), which defines a stationary process if the phase 'lj; is assumed to be a random variable uniformly distributed over (1r, 1r). Its autocovariance function is v, = cos(wor), so that the series (6) indeed converges nowhere. A regular class of stationary process is constituted by those for which x is the output of a stable system obeying finiteorder linear dynamics and driven by white noise. We shall see below that g(z) is then rational and necessarily convergent in an annulus p < lzl < p 1• The important property of the AGF is that it transforms very much more pleasantly under the action of a filter on the process than does the autocovariance function itself. Suppose that a process {y1} is derived as the output of a translationinvariant linear filter with stationary input {x 1}:
!
(7) Here we have used the notation of Chapter 4, so that b, is the transient response of the filter and B(z) = Ls b,z' its generating function; the transfer function of a discretetime filter. If the sum in (7) is convergent (the appropriate sense here being that of convergence in mean square) then the output {yt} is defined and also stationary. Since we are dealing with more than one process then autocovariances etc. must be labelled by the process to which they refer. Let us denote the autocovariance and the AG F for the process {x 1 } by v~x) and g(x) ( z), etc. Theorem 13.2.1 Suppose that the functions g(x) (z) and B(z) are both analytic in an annulus p < lzl < p 1• Then the sum (7) is meansquare convergent, the AGFg(yl(z) ofthe output is analytic in the same annulus and output statistics are related to input statistics by Vr(y) 
" "" " b·Jvrjj+k (x) bTk• L..L..j
(8)
k
(9) Proof Relations (8) and (9) are those which would be established by formal argument: assertion (8) follows from (7) and the definition of an autocovariance; assertion (9) then follows by the formation ofgenerating functions. Convergence of the sum (7) and validity of these manipulations is amply justified by our
DENSITY MATRIX 2 THE ACTION OF FILTERS: THE SPECTRAL
259
erge to zero as plrJ with incre as· very strong assumptions: that v~x) and b, conv 0 ing lrl. strong, but make for simplicity, The assumptions of the theorem are excessively h largely covers our needs. Som e and are justified in the rational case whic tantial relaxation takes one deep relaxation is indicated in Exercise 2, but subs into Fourier theory. form ation rule (8) for the The poin t of the theo rem is that the trans ly pleasing, but the rule (9) for the autocovariances themselves is not particular suggestive. We shall often write it transformation of AGF s is both comp act and simply as g(y) = Bg(x) lJ. a mod el already cons idere d in A prim e example is a statistical version of Section 4.4: A(ff )x = t. y= Cx, noise of covariance V, so that the Here A (ff) is polynomial in ff and t is white rated by a stochastic difference outp ut y is a linear function of a variable x gene have all its zeros strictly outside the equation. Stability requires that lA (z) Ishould esses converge to stationarity with unit circle. Und er these circumstances the proc 1 g(yl(z) = CA(z ) 1 V A(z f cT.
a neig hbou rhoo d of the unit circle. This is certainly rational, and free of poles in , if, as in Section 4.2, we see the The AGF can be given a more physical basis ed define power series (6) as a Four ier series. Let us inde
:2: v,eirw, 00
f(w)
= g(eiw) =
(10)
r=oo
ency'. The transformation (10) with where the variable w is to be regarded as 'frequ unit circle, where we expect it to w real amounts to considering g(z) on the converge if it converges anywhere. Suppose x scalar, for simplicity. We have hl
0 ~ h1 Ell : Xteitwl2 = h1 t=O
h
h1 h1
:2: :2: j=O k=O
Vjkei(kj)w =
L (I 
lr!/h)v,eirw.
r=h
with increasing h if regularity This last expression will converge to f(w) is that 2::, lv,! < oo. We see then conditions are satisfied; a sufficient condition positive and can be identified with that, unde r such cond ition s,f(w ) is real and trans form of the sequence {x1 }, in the expected squa red mod ulus of the Fourier sense. This argu ment leads to an an appropriately norm alise d and limiting
260
STATIONARY PROCESSES; SPECTR AL THEORY
interpretation off(w) as the density of the expected distribution of 'power' in the process {x 1 } over the frequency spectrum. For this reason it is termed the spectral density function in the scalar case, and the spectral density matrix in the vector case. For the sinusoidal process sin(wot 1{;) mentioned above the Fourier series (10) of course does not converge, but limiting arguments evaluate it as f (w) = [8(w wo) + 8(w + wo)]. That is, the energy of the process is indicated as being concentrated in the frequencies ±w0 , which is indeed consistent with the nature of the process. Relation (9) becomes, in terms of spectral densities,
!
J(Y) (w)
= f3(w)f(x) (w)f3( w) T = f3f(x) ,8,
(11)
where f3(w) = L,, b,eirw is. the frequency response function of the ftlter (see Section 4.2). This could be regarded as the consequence E[(CY>(CY>J = f3E[((x)((x)],8 of a relation (CY>(w) = f3(w)((x)(w), where ((x)(w) is the (random) Fourier amplitude at frequency w in a Fourier decomposition of x into 1 frequencies and ((xl(w) its transposed complex conjugate. In fact, such a decomposition is a delicate matter and one must be circumspect in speaking of a 'Fourier amplitude', but the view is nevertheless a correct one; see Exercise 1. Note that we expect the relation Vr
= 21 11"
1'/f ei""f(w) dw,
{12)
'If
inverse to (10). 'The continuoustime results are formally analogous, with sums overt replaced by integrals over the whole time axis and integrals over w replaced by integrals over the infinite frequency axis ( oo, oo). Thus, the mutually inverse pair of relations (10), (12) is replaced by
v(r)
11
= 2
11"
00
eiWTf(w) dw.
oo
These relations are certainly valid if/(w) exists and is integrable. If we conside r the importa nt special case (in continuous time) of rational f(w), thenf(w ) can certainly have no singularity on the real waxis, so integrability cannot fail for this reason. However, integrability over the infinite waxis will also require thatf{w ) should tend to zero with increasing w, and should do so sufficiently quickly. The effect of this condition is to exclude a white noise component, since white noise has a constan t spectral density. If f(w) is rational and tends to zero with increasing w then it must indeed tend to zero at least as w 2, and so is integrable. The ftlter x+ y expressed by y = B(!'i)x has transfer function B(s) and frequency response function f3(w) = B(iw). With these understandings, relation (11) holds in regular cases.
I 4
.~
~ ~
i
1 ::;.
ESSIVE REPRESENTATIONS 3 MOVINGAVERAGE AND AUTOREGR
261
the cons tant value V (where Vis the . For a white noise process the SDF f( w) has as to whether the proc ess is in (;()variance matr ix or powe r matr ix, according as an indication that the ener gy discrete or continuous time). One interprets this rang e of meaningful frequencies of the process is uniformly distributed over the cont inuo us time). The proc ess was (i.e. (11", 11") in discrete time and ( oo, oo) in belie f that white light has such a then given the nam e 'white' noise in the mist aken even over the visible range, but uniform ener gy spectrum. It certainly does not, the term 'white noise' has stuck. Exercises and comments
onar y process can be unde rstoo d (1) Circulant processes. The struc ture of a stati by assu ming it periodic, with perio d much bette r if one makes the time axis finite of the sequence {xo, x 1 , x2, ... , m. That is, the complete realisation X consists cyclic perm utati on, so that .rx = Xm 1}, and the shift oper ator ff effects a just means statistical inva rianc e {xmI,xo,Xt, ... ,Xm2}· 'Stationarity' then on. Assume stationarity in what unde r .r, and so unde r any cyclic perm utati follows. e with perio d m. Suppose, for Show that cov(xr, x,_,) is a function v, of r alon e matr ix of the mvector X. rianc simplicity, that x is scalar, and let V be the cova the kth element of the that and Show that V has eigenvalues jj = }:, v,eir ) and sums run over 11"i/m exp(2 = corresponding right eigenvector is ()ik, where() a 'spectral density' is value eigen any m consecutive integers. That is, the jth ~ (j is the finite wher El([1 with evaluated at w = 21f'j/m, and can be identified 1 • Furth ermo re, 2 1 01 x ~ 1 1 1 (j = mFour ier trans form of the sequence X in that disti nctj, kin for 0 = k) that E((j( these Four ier amplitudes are uncorrelated, in thes etO, l, .. . ,m 1. b,.r ' is to multiply (j by B(()i). The effect of a filter with oper ator B(.r ) = }:, n by white noise, and so with outp ut (2) Cons ider a discretetime SISO filter drive that this expression should }:, b,e,_,. The necessary and sufficient cond ition ld have finite seco ndo rder shou ut converge in mean square (and so that the outp therefore sufficient if one is ition moments) is that }:, is fmite. The same cond condition will also be The e. abov considers an inpu t whose SDF is boun ded zero. away from nece ssary if the SDF of the inpu t is boun ded
b;
SSIV E REP RES ENT ATIO NS: 3 MOVINGAVERAGE AND AUT ORE GRE CAN ONI CAL FACTORISATIONS ut of a filter with a white noise Cons ider the process {x 1} obta ined as the outp input:
(13)
262
STATIONARY PROCESSES; SPECTRAL THEORY
In the timeseries literature such a process would be referred to as a moving average process, the words 'average' and 'moving' being a loose indication of the properties 'linear' and 'translationinvariant~ If the whitenoise process has covariance matrix V then it follows, as a special case of (9), that {x,} has AGF
g(z) = B(z)VB(z)
(14)
If the filter is causal then b, is zero for r negative, so that B(z) is a series in nonnegative powers of z. One then speaks of the moving average relation itself as being 'causal' or 'realisable: Suppose, as in Section 4.4, that the filter is specified as an inputdrive n linear difference equation 00
A(ff)x, = L:a,x,_, =
Et
(15)
r=O
We regard (15) as a relation determinin g x 1 in terms of €1 and past xvalues, so that a, is indeed zero for r negative and a0 is nonsingular. If the relation (15) is stable
then the output {x 1} generated is stationary, and can indeed be represented as a realisable moving average of the input. That is, it can be represente d in the form (13) with B(z) = A(z) 1, where these expressions are understoo d as power series in nonnegative powers of z (see Section 4.4~ Its AGF is then
(16) In the timeseries literature a process generated by a relation such as (15) is termed an autoregressive process, the relation itself being an autoregression. The term 'conveys, again loosely, the idea that the variable is linearly dependent upon its own past. The relation (15) is of course just a stochastic difference equation, linear and with constant coefficients, and of order p if a, is zero for r > p. Let us now reverse these deductions. If the AGF has the form (14) then the process could have been generated by a moving average relation (13); if the AGF has the form (16) then the process could have been generated by an autoregressive . relation (15). In other words, one would have, respectively, moving average and autoregressive representations of the process. The point of this manoeuvr e will appear in the next section. Let us make it more explicit.
Theorem 113.1 Suppose that the AGF g(z) ofa process x 1 can befactorised in the form Q4), where both B(z) and B(z) 1 are analytic in lzl ~ 1. Then the process has a causal moving average representation QJ), with V identified as the covariance matrix ofthe white noise process {E1}. Furthermore, this representation is invertible, in that it can be inverted to the autoregressive relation Q5), with A(z) = B(zr 1• Proof Define a process { E1} by Et
= B(ff) 1x 1 = A(ff)x1,
(17)
I
.d
ONS 3 MOVINGAVERAGE AND AUTOREGRESSIVE REPRESENTATI
263
of§. It where it is unders tood that the expansions are in nonnegative powers fined, so follows from the assumptions of the theorem that this quantity is wellde (15) hold that the process { Er} is stationary and both the relations (13) and that (9) relation to appeal by now see We es. between the two process g(E)
= AgA = B 1glr 1 = V,
so that the process {€ 1} is white, with covariance matrix V.
D
ation of The factorisation (14) deman ded in the theorem is a canonical factoris we look, essentially, for a g~). This type of factorisation will occur repeatedly as . The factorisation process the of ntation represe ressive movingaverage or autoreg pty annulu s nonem a in c analyti is and AGF an is is indeed possible if g(z) p < lzl < p1. with an We can restate the theorem in what is a completely equivalent form, but ant. signific prove will emphasis whose difference its inverse as Theorem 13.3.2 Suppose that the AGF g(z) ofa process x 1 is such that form a matrix can befactorised in the
(18) has a stable where both A(z) and A(zr 1 are analytic in izi ~ 1. Then the process of the matrix nce covaria the as ed identifi V autoregressive representation (15), with white noise process { € 1}.
. Note, This is indeed just a variation of the stateme nt of the previous theorem 1 rather g(zr of ation factoris cal however, that it is expressed in terms of a canoni factor than of g(z), and that the order of factors is reversed, in that it is the final inside ies propert ity regular have rather than the initial factor which is required to the unit circle. ion Of course, one factorisation immediately implies the other, but the distinct state to comes it when , between the two versions will prove significant. Roughly equivalent estimation (as it will) then factorisations (14) and (18) are respectively square least linear a as to the two characterisations of a projection estimate es process with deal estimate or as a minimu m discrepancy estimate. When we a is (18) ation whose dynamics are of finite order (as we shall) then factoris polynomial factorisation, whereas (14) is not. ing The € 1 defintJd by (17) are in fact just the innovations of the xprocess (assum e averag this to have begun in the infinite past) and the realisable moving own its of representation (13) is just a representation of the process in terms (1938). The innovations. It is a special case of a representation deduced by Wold above (the Wold representation breaks x into the movingaverage term deduced ed nondeterministic component) and a term which can be perfectly predict purely
STATIONARY PROCESSES; SPECTRAL THEORY
264
by linear relations. Our factorisation hypotheses exclude the possibility of this second, deterministic, component.
4 THE WIENER FILTER Suppose that one has observed a process {x 1} up to timet, and so knows its partial history X 1 = {Xr; T ::;;;; t}. Suppose that one wishes to use the observations to estimate the value of some unknown random variable f It is natural to consider the projection estimate ~(r) = 8(~JX1 ), which we know, from the last chapter, to have optimality properties. We suppose all variables reduced to zero mean. Theorem 13.4.1 Suppose that the AGFofthe process {x1 } has canonicalfactorisation (13). Then the projection estimate of~ in terms of X 1 is given by 00
c(t)"'
t, which is itself a consequence of the whiten assumptions. 5 OPTIMISATION CRITERIA IN THE STEADY STATE spectra l Optimisation criteria will often have a compact expression in terms of best one r Whethe densities in the stationary regime (i.e. the steadystate limit). these of sation goes about steadystate optimisation by a direct minimi ions are expressions with respect to policy is another matter, but the express certainly interesting and have operational significance. n Consider first the LQG criterion. Let us write an instantaneous cost functio such as (2.23) in the 'system' form
(22) , for Here ~ is the vector of 'deviations' which one wishes to penalise (having matrix example, x  r and u  uc as components) and 9t is the associated has auto(which would be just [ ~ ~] in the case (2.23) ). Suppose that { ~t} a given covariance generating function g(z) in the stationary. regime under is to aim the Then f(w). density l spectra stabilising policy 11; and corresponding choose 1r to minimise the average cost (23) tr[9t cov(~)]. !E{tr[9t~~T]} 'Y E(!~T!Jt~]
=!
=
=
and tr(P) Here E denotes an expectation under the stationary regime for policy 1r have not we n notatio of y econom For P matrix the denotes, as ever, the trace of to the ing Appeal 1r. upon g(z) and E "f, of ence explicitly indicated the depend formula cov(~) = 217r f(w) dw,
j
we thus deduce
Theorem 13.5.1 The criterion (23) can be expressed 'Y
= E(!~T9\A] =
4~
J
tr[9lf(w)J dw
(24)
ry where f (w) is the spectral density function of the Aprocess under the stationa the in 1r] [1r, interval regime for policy 11; and the integral is is over the rea/frequency case ofdiscrete time and over the whole rea/line in the case ofcontinuous time.
266
STATIONARY PROCESSES; SPECTRA L THEORY
In the discretetime case it is sometimes useful to see expression (24) in power series rather than Fourier terms, and so to write it
1 7 = !Abs{tr[9tg(z)]} = 47rl.
dz j tr[9tg(z)]. z
(25)
Here the symbol ~s· denotes the operation of extracting the absolute term in the expansion of the bracketed term in powers of z upon the unit circle, and the integral is taken around the unit circle in an anticlockwise direction. If we had considere d a cost function h1
C=z=c,
(26)
t=O
up to a horizon h then we could have regarded the average cost (23) as being characterised by the asymptotic relation E(C) = h7 + o(h)
{27)
for large h. Here E is again the expectation operator under policy 1r, but now conditional on the specified initial conditions. The o(h) term reflects the effect of these conditions; it will be zero if the stationary regime has already been reached at timet= 0. Consider now the LEQG criterion introduce d in Section 12.3. We saw already from that section that the LEQG model provided a natural embedding for the LQG model; we shall see in Chapters 16, 17 and 21 that it plays an increasing role as we bring in the concepts of risksensitivity, the H 00 criterion and largedeviation evaluations. For this criterion we would expect a relation analogous to(27):
(28) Here 7( 9) is a type of geometricaverage cost, depending on both the policy 1r and _ the risksensitivity paramete r 9. The least ambitious aim of LEQGoptimisation in the infinitehorizon limit would be to choose the policy 1r to minimise 7(6). ('Least ambitious', because a fulldress dynamic program ming approach would minimise transient costs as well as average costsJ We aim then to derive an expression for 7( 9) analogous to (24). Theorem 13.5.2
The average cost 7( 9) de]med by (48) has the evaluation
!j
7(6) = 4 9
logjl + 99tf(w)l dw
{29)
for values of(} such that the symmetrisation of the matrix I+ 09tf(w) is positive definite for all real w. Here f (w) is the spectral density function of the fl.process under the stationary regime for the specified policy 1r (assumed stationary and
5 OPTIMISATION CRITERIA IN THE STEADY STATE
267
linear), and the integral is again over the interval [1r, 1r] or the whole real axis in discrete or continuous time respectively. Here /P/ denotes the determin ant of a matrix P; note that expression (29) indeed reduces to (24) in the limit(} + 0. We shall prove an intermediate lemma for the discretetime case before proving the theorem. Suppose that the AGF of { 6c} has a canonica l factorisation (16) with A(z) analytic in some annulus /zl ~ 1 + 6 for positive 8. That is, the 6.process has an autoregressive representation. It is also Gaussian, since the policy is linear. Let us further suppose this representation so normalis ed that Ao =I. The probability density of 6. 1 conditional on past values is then
f(6.c/6.r;T < t) = [(27r)m/V/] 112 exp[!ci "V 1E1] where Ec
= A(Y)6.c and m is the dimension of 6.. This implies that
f(6.o, 6.,' ... '6h1/6r; 'T < 0) = [(27r)ml VIJh/Zexp [ = [(27r)'/ VIJh/ 2 exp [
! ~ E; v'fc]
l
! ~ 6'f M(ff)Ac + o(h)
(30)
where M(z) = A(z) v 1A(z) = g(zf 1 . Since the multivariate density (30) integrates to unity, and since the normalis ation Ao = I implies the relation log/ V/ = dlogjg(z )/, we see that we can write conclusion (30) as follows.
Lemma 13.5.3 Suppose that M(z) is selfconjugate, positive definite on the unit circle and analytic in a neighbourhood ofthe unit circle. Then
II··· Iexr[l~x;'M(ff)x}xodXI· .dxo1
(31)
= (27r)hmf2exp{ (h/2)Ab s[log/M( z)/J + o(h)}. *Proof of Theorem 13.5.2 The expectation in (28) can be expressed as the ratio
multivariate integrals of type (31), with the identifications M(z) = g(z) I + (}91 and M(z) = g(z) 1 in numerato r and denomin ator respectively. We thus have
of two
E(eec) = exp[(h/2 )Abs{log jg(z) 1 + 091/ + log!g(z)i} + o(h)] whence the evaluation
"!(B)
1
= 20 Abs[log/I + 091g(z)l}
.I
268
STATIONARY PROCESSES; SPECTRAL THEORY
and its alternative expression (29) follow. The continuoustime demonstration is analogous. 0 The only reason why we have starred this proof is because the o(h) term in the exponent of the final expression (30) is in fact Lldependent, so one should go into more detail to justify the passage to the assertion (31).
r' '
~''~' ,. .· : r '
!
CHA PTE R14
Optimal Allocation; The Multiarmed Bandit 1 AU.OCATION AND CONTROL topic of conThis chapter is somewhat off the principal theme, although on a he wishes. if it bypass may reader the and siderable importance in its own right, resources limited of g sharin the Allocation problems are concerned with problem of kind the not This d. pursue between various activities which are being m if proble l contro a indeed is envisaged in classical control theory, but of course le, examp For ons. conditi ng this allocation is being varied in time to meet changi rk netwo ons unicati comm a the adaptive allocation of transmission capacity in tance. impor logical techno provides just such an example; clearly of fundamental henceforth The classic dynamic allocation problem is the 'multiarmed bandit', ce of sequen a makes er referred to as MAB. This is the situation in which a gambl the choose to plays of any of n gambling machines (the 'bandits'1 and wishes f payof ed expect machine which he plays at each stage so as to maximise the total the of ility probab (perhaps discounted, in the infinitehorizon case). The payoff er, the ith machine is a parameter, 0; say, whose value is unknown. Howev he as gains gambler builds up an estimate of Oi which becomes ever more exact n playing a more experience of the machine. The conflict, then, is betwee g with a machine which is known to have a good value of 0; and experimentin better. It is machine about which little is known, but which just might prove even m as an proble the ates formul one that t in order to resolve this conflic optimisation problem. features As an allocation problem this is quite special, on the one hand, but has ing is allocat is one which ce which confuse the issue, on the other. The resour at a ne machi one only play one's time (or, equivalently, effort) in that one can the split to able be will one s time, and must decide which. In more general model but ting fascina is allocation at a given time. The problem also has a feature which time is not is irrelevant in the first instance: the 'state' of the ith machine at a given of 0;, an value its physical state, but is the state of one's information about the chapter, next 'informational state~ We shall see how to handle this concept in the which is to but this aspect should not divert one from the essential problem, we may decide which machine to use next on the basis of something which indeed term the current 'state' of each machine.
270
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
In order to sideline such irrelevancies it is useful to formulate the MAB problem in greater generality. We shall suppose that one has n 'projects', the ith of which has state value xi. The current state of all projects is assumed known. One can engage only one project at a time; if one engages project i then xi changes to a value ~ by timehomog eneous Markov rules (i.e. the distribution of ~ conditional on all previous state values of all projects is in fact dependent only on xi), the states of unengaged projects do not change and there is a reward whose expectation is a function r;(x;) of x; and i. If one envisages indefinite operation starting from time t = 0 and discounts rewards by a factor f3 then the aim is to choose policy 1r so as to maximise E'lr[L::o {3 1R 1], where R 1 is the reward received at time t. The policy is the rule by which one decides which project to engage at any given time. One can generalise even this formulation so as to make it more realistic in several directions, as we shall see. However, the problem as stated is the best first formalisation, and captures the essential elements of a dynamic allocation problem. The problem in this guise proved frustratingly difficult, and resisted sustained attack from the 'forties to the 'seventies. However, it had in fact been solved by Gittins about 1970; his solution became generally known about 1981, when it opened up wide practical and conceptual horizons. Gittins' solution is simple to a degree which is found amazing by anyone who knew the frustrations of earlier work on the topic. One important feature which emerges is that the optimal policy is an index policy. That is, one can attach a index lli(xi) to the ith project which is a function of the project label i and the current state x; of the project alone. If the index is appropriatel y calculated (the Gittins index), then the optimal policy is simply to choose a project of currently greatest index at each stage. Furthermore , the Gittins index II; is determined by the statistical properties of project i alone. We shall describe this determinatio n, both simple and subtle, in the next section. The MAB formulation must be generalised if one is to approach a problem as complicated as, for example, the routing of telephone traffic through a network of exchanges. One must allow several types of resource; these must be capable of allocation over more than one 'project' at a time; projects which are unengaged may nevertheless be changing state; projects may indeed interact. We shall sketch one direction of generalisatio n in Sections 57. The Gittins solution of the MAB problem stands as the exact solution of a 'pure' problem. The inevitable next stage in the analysis is to see how this exact solution of an idealised problem implies a solution, necessarily optimal only in some asymptotic sense, for a large and complex system.
2 THE GITTINS INDEX The Gittins index is defined as follows. Consider the situation in which one has only two alternative actions: either to operate project i or to stop operation and
271
2 THE GITTINS INDE X
then in fact an optim al stopp ing receive a 'retir emen t' reward of M. One has proje ct, its state will not chan ge probl em (since once one ceases to opera te the te the value function for this and there is no reaso n to resum e operation). Deno on the retire ment reward as well probl em by ¢i (Xi, M), to make the depe nden ce dyna mic progr amm ing equa tion as on the proje ct state explicit. This will obey the
(1) a relati on whic h we shall abbreviate to
(2)
of proje ct i before and after one Here Xi and~ are, as above, the values of the state stage of opera tion. the form indic ated in Figu re As a funct ion of M for fixed Xi the funct ion ¢i has for M great er than a critic al value 1: nond ecrea sing and convex, and equa l to M accep t the retire ment rewa rd M;(xi)· This is the range in whic h it is optim al to over value, at which M is just large rathe r than to continue, and M; (x;) is the cross ng are equally attractive. enou gh that the optio ns of conti nuing or term inati the proje ct when in state xi; it is Note that Mi(xi) is not the fair buyo ut price for (in state Xi) if an offer is made more subtle than that. It is the price whic h is fair ct opera tor is free to acce pt at which is to rema in open, and so which the proje can be taken as the Gittins index , any time in the future. It is this quantity whic h altho ugh usual ly one scales it to take
(3)
ity that a capit al sum M woul d as the index. One can regard vas the size of an annu choic e betw een the altern ative s of buy, so that one is rephr asing the probl em as a n or of moving to the 'certa in' eithe r opera ting the proje ct with its unce rtain retur proje ct whic h offers a cons tant incom e of v.
M. (x;)
M
X; and Mare respectively the state of Figure 1 The graph of r/J;(x;, M) as afunction ofM Here and ~ V(J). But plainly the reverse relation (11) becomes M + V(J) ~ ci> + M, or exceed M if and only if J is noninequality holds, so that ci> = V (J), and this will inde ed aban done d once they are empty. Thus, in an optim al policy, projects are projects are written off. Thus the written off and operation continues until all ssion (4) for the value function is optim al policy is a writeoff policy, and expre al to something like the argu men t inde ed correct. However, one still has to appe policy. of the text to establish optimality of the Gittins
4 EXAMPLES tion of the index v(x) for an The problem has been reduced to deter mina i. Dete rmin ation of v(x) index ct individual project, so we can drop the proje and one may well have ct, proje that requires solution of the stopping problem for analytic solut ion is that fact t. The to resort to num erica l methods at this poin fact that the MAB the idate inval possible in only relatively few cases does not lem to the oneprob ect proj of then prob lem is essentially solved by the reduction ples which in exam some list shall project problem (with a retirement option). We fact perm it rapid and transparent treatment. ¢(x( t), M) is necessarily nonLet us say that a project is deteriorating if mach ine whose state is sufficiently increasing in t. One may, for example, have a deteriorates with age. We leave the indicated by its age, and whose perfo rman ce ing equation (2), that v(x) = r(x) read er to show, from the dynamic prog ramm function. simply, where r( x) is the insta ntan eous reward al policy is a onestep lookIf all projects were deteriorating then the optim h the expected immediate reward ahead policy: one chooses the project i for whic situation in which the ri have ri(xi) is maximal. This will ultimately lead to the one then switches projects to keep been roughly equalised for all projects, and the tyres on one's car with the spare them so. That is, it is as if one kept changing of wear. Switching costs will ensu re so as to keep all five tyres in an equal state that one in fact tolerates a degree of inequality. project, for which ¢(x( t),M ) is The opposite situation is that of an improving perfo rman ce of a machine may nondecreasing with time. For example, the lasts, the more likely it is to be a improve with age in the sense that, the longer it in this case, good one. We leave to the reader to conf irm that,
276
OPTIMA L ALLOCATION; THE MULTIARMED BANDIT
That is, the index is the discounted constan t income equivalent of the expected discounted return from indefinite operati on of the project. If all projects are improving then, once one adopts a project, one stays with it. However, mentio n of 'lasting' brings home the possibility that a machin e may indeed 'improve' up to the point where it fails, and is thereafter valueles s. Let us denote this failure state by x = 0, presum e it absorbing and that r(O) = 0. Suppose that the machin e is otherwise improving in that ¢(x(t), M) is nondecreasing in t as long as the state x = 0 is avoided. Let a denote the random failure time, the smallest value oft for which x( t) = 0. In this case
v(x) E['E~:J ,81r(x(t))jx(O) = x] 1  ,8 = I E[,B"ix(O) = x] If all projects follow this 'improving through life' pattern then, once one adopts a project, one will stay with it until it fails. Anothe r tractable example is provided by a diffusion process in continu ous time. Suppose that the state x of the project takes values on the real line, that the project yields reward at rate r(x) = x while it is being operate d, that reward is discounted at rate a, and that x itself follows a diffusion process with drift and diffusion coefficients J1. and N. This conveys the general idea of a project whose return improves with its 'condition', but whose conditi on varies random ly. The equatio n for ¢(x, M) is then X 
a¢ + Jl.¢x + !N ¢xx = 0
(x > ~)
(12)
where~ is the optima l breakp oint for retirem ent reward M. We find the solution of (12) to be
¢(x,M ) = (xja) + (Jl.Ja 2 ) + cePx
(13)
where pis the negative solution of
! Np2 + Jl.P 
a = 0.
and c is an arbitra ry constant. The general solution of (12) would also contain an exponential term corresp onding to the positive root of this last equatio n, but this will be excluded since ¢ cannot grow faster than linearly with increas ing x. The unknow ns c and ~ are determ ined by the bounda ry conditi ons ¢ = M and ¢x = 0 at x = ~(see Exercise 10.7.2~ If we substitute expression (13) into these two equations then the relation between M and ~ which results is equival ent to M = M(e). We leave it to the reader to verify that the calculation yields the determ ination
v(x) = aM(x) = x+
J1. + ..jJ.L 2 + 2aN
la
.
5 RESTLESS BANDITS
277
reward expected from The constant added to x represents the future discounted quence of the fact future change in x. This is positive even if J1. is negat ivea conse if this occurs, but that one can take advantage of a random surge against trend can retire if it does not.
5 RESTLESS BANDITS be to allow projects to One desirable relaxation of the basic MAB model would by different rules. For course of gh change state even when not engaged, althou ent used to comb at a treatm al medic a example, one's knowledge of the efficacy of actually deteriorate could it ver, particular infection improves as one uses it. Howe virus causing the the le, examp when one ceased to use the treatment if, for infection were mutating. on of an enemy For a similar example, one's information concerning the positi would actually but it, submarine will in general improve as long as one tracks deliberate taking not deteriorate if one ceased tracking. Even if the vessel were evasive action its path would still not be perfectly predictable. of whom exactly As a final example, suppose that one has a pool of n employees employees who are m are to be set to work at a given time. One can imagine that yees who are resting working produce, but at a decreasing rate as they tire. Emplo is thus changing state do not produce, but recover. The 'project' (the employee) whether or not he is at work. tion or not as We shall speak of the phases when a project is in opera a project was static active and passive phases. For the traditional MAB model is not true: the this ms proble many for in its passive phase. As we have seen, space. For state in ents movem ry contra active and passive phases produce ation inform of loss and gain e induc s submarine surveillance the two phase and tiring to pond corres s phase respectively. For the labour force the two recovery. passive phase as a We shall refer to a project which may change even in the rine example. subma the 'restless bandit', the description being a literal one for er respect: one anoth in l The workforce example generalised the MAB mode one. We shall just than was allowed to engage m of then projects at a time rather se suppo that m ( < n) allow this, so that, for the submarine example, we could a matter of allocating aircraft are available to track the n submarines. It is then of all n submarines the surveillance effort of the m aircraft in order to keep track as well as possible. assume rewards We shall specialise the model in one respect: we shall e reward. This averag the ising maxim of undiscounted, so that the problem is that a number of of ns solutio n know the , makes for a much simpler analysis; indeed As we have case. d ounte undisc the standard problems are greatly simpler in n 16.9, the Sectio in ds groun maintained earlier, and argue on structural t. contex l contro undiscounted case is in general the realistic one in the
278
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
Let us label the phases by k; active and passive phases corresponding to k = 1 and k = 2 respectively. If one could operate project i without constraint then it would yield a maximal average reward "f; determined by the dynamic programming equation "fi
+ fi(x;)
= max{rik(x;) k
+ Ekffi(X:)Ix;]}.
(14)
Here r;k(x;) is the expected instantaneous reward for project i in phase k, Ek is the expectation operator in phase k and fi(x;) is the transient reward for the optimised project. We shall write (14) more compactly as
"(; + Ji = max[Lil Ji,Li2fi]
(i=l,2, ... ,n).
(15)
Let m(t) be the number of projects which are active at time t. We wish to optimise operation under the constraint m(t) = m for prescribed m and identically in t; that exactly m projects should be active at all times. Let Ropt(m) be the optimal average return (from the whole population of n projects) under this constraint. However, a more relaxed demand would be simply to require that (16)
E[m(t)] =m,
where the expectation is the equilibrium expectation for the policy adopted. Essentially, then, we wish to maximise ECL_; r;) subject to E(L_; !;) = n m. Here r; is the reward yielded by project i (dependent on project state and phase) and l; is the indicator function which takes the value 1 or 0 according as to whether project i is in the passive or the active phase. However, this is a constraint we could take account of by maximising E[L_;(r; + 11/;)], where 11 is a Lagrangian multiplier. We are thus effectively solving the modified version of (15) "f;(v)
+Ji = max[L;,Ji, v + Li2fi]
(i = 1,2, ... ,n)
(17)
wherefi is a functionfi(x;, v) of x; and v. An economist would view v as a 'subsidy for passivity', pitched at just the level (which might be positive or negative) which ensures that m projects are active on average. Note that the subsidy is independent of the project; the constraint (16) is one on total activity, not on individual project activity. We thus see that the effect of relaxing the constraint m(t) = m to the averaged version (16) is to decouple the projects; relation (17) involves project i alone. This is also the point of the Gittins solution of the original MAB problem: that thenproject problem was decomposed into n oneproject problems. A negative subsidy would usually be termed a 'tax: We shall use the term 'subsidy' under all circumstances, however, and shall refer to the policy induced by the optimality equations (17) as the subsidy policy. This is a policy optimal under the averaged constraint (16). If we wish to be specific about the value of 11 we shall refer to the policy as the vsubsidy policy. For definiteness, we shall close
•
. 110:.
5 RESTLESS BAND ITS
279
v + L,2 J; then project i is to be the passive set. That is, if x; is such that Ln J; = rested. cts are active on average. The value of v must be chosen so that indeed m proje cts. This induces a mild recoupling of the proje constraint (16) is Theorem 14.5.1 The maximal average reward under R(m) =
i~f [~ 'Y;(v) v(n m)]
(18)
and the minimising value ofv is the required subsidy level. ing (that, unde r regularity Proof This is a classic assertion in convex prog ramm problems are equal) but best conditions, the extrema attained in prim al and dual squarebracketed expression in seen directly. The average reward is indee d the acted. Since (18), because the average subsidy paid must be subtr
(where 1r denotes policy) then
a
avl; : 'Y;(v) = E1f I
L l; = (n m). I
ality condition in (18). The This equation relates m and v and yields the minim is convex increasing in v. v) ( 'Y; condition is indeed one of minimality, because 2: mal average reward for maxi The function R(m) is concave, and represents the 0 any min [0, n].
x; as the value ofv which Define now the index v;(x;) of project iwhe n in state value of subsidy which the is it makes Ln J; = v + L;2 J; in (17). In other words, x;. This is an obvious state in i ct makes the two phases equally attractive for proje passive projects when es d reduc analogue of the Gittins index, to which it indee (which Gittins index the that are static and yield no reward. The interest is the Lagrange as seen is now characterised as a fair 'retirement income') is obviously index The ty. activi multiplier associated with a constraint on average project, the a rest to one e meaningful: the greater the subsidy needed to induc more rewarding must it be to operate that project. m(t) = m rigidly. Then a Suppose now that we wish to enforce the constraint to choose the projects to be plausible policy is the index policy; at all times index (i.e. the first m on a list operated as the m projects of currently greatest te the average return from this ranking projects by decreasing index). Let us deno policy by Rinct(m).
280
OPTIMAL ALLOCATION; THE MULTIARMED BANDIT
Theorem 14.5.2
Rind(m) ~ Ropt(m) ~ R(m).
(19)
Proof The first inequality holds because Ropt is by definition the optimal average return under the constrain t m(t) = m. The second holds because R(m) is the optimal average return under the relaxed version of this constraint, D E[m(t)] = m.
The question now is: how close are the inequalities (19), i.e. how close is the index policy to optimality? Suppose we reduce rewards to a per project basis in that we divide through by n. The relation (20) Rind(m)/n ~ Ropt(m)fn ~ R(m)/n then expresses inequalities between rewards (under various policies) averaged over both time and projects. One might conjecture that, if we let m and n tend to infinity in constant ratio and hold the populatio n of projects to some fixed composition, then all the quotients in (20) will have limits and equality will hold throughout in this limit This conjecture has in fact been essentially verified in a very ingenious analysis by Weber and Weiss (1990). However, there are a couple of interesting reservations. Let us say that a project is indexable if the set of values of state for which the project is rested increases from the empty set to the set of all states as v increases. This implies that, if the project is rested for a given value of subsidy, then it is rested for all greater values. It also implies that, if all projects are indexable, then the projects i which are active under a a 11subsidy policy are just those for which
1/i(Xt) > 1/.
One might think that indexability would hold as a matter of course. It does so in the classic MAB case, but not in this. Counterexamples can be found, although they seem to constitute a small proportio n of the set of all examples. An example given in Whittle (1988) shows how nonindexability can come about. Let D(v) be the set of states for which a given project is rested under the vsubsidy policy. Suppose that a given state (x = {, say) enters D as v increases. It can be that paths starting from {with { in D show long excursions from D before they return. This implies a surrende r of subsidy which can become nonopti mal once v increases through some higher value, when {will leave D. Another point is that asymptotic equality in the second inequality of (20) can fail unless a certain stabilit;y condition is satisfied (explicitly, unless the solution of the deterministic version of the equations governing the distribution of index values in the populatio n under the index policy converges to a unique equilibrium). However, the statistics of the matter are interesting. In an investigation of over 20000 randomly generated test problems Weber and Weiss found that about 90% were indexable, but found no counterexamples to averageoptimality (i.e. of
281
NTE NAN CE 6 AN EXAMPLE: MAC HIN E MAI
of s). In searching a more specific set instability of the dynamic equation 3 for and , 10in fewer than one case in examples they found counterexamples 5 the orde r of one part in 1o . of these the mar gin of suboptimality was for averageoptimal is then virtually true The assertion that the index policy is dity vali lute an assertion can escape abso all indexable cases; it is remarkable that on that asymptotic optimality can be cati by so little. The result gives some indi le policies. achieved in large systems by quite simp INT ENA NC E 6 AN EXAMPLE: MA CH INE MA d sidered in Section 11.4 constitutes a goo The machine maintenance problem con is it that s of costs rath er then rewards, so first example. Thi s is phrased in term ; action rath er than a subsidy for inaction now natu ral to thin k of v as a cost for 11.4, tion overhaul. In the notation of Sec i.e. to identify it as the cost of a machine atio n for a single machine analogous to equ the dynamic programming equation (17) is then (21) .X[f(x + 1) f(x) ]. 'Y = min{v +ex + f(O ) f(x) , ex+ ice ecture that the optimal policy is to serv If we norm alis ef(O ) to zero and conj s tion equa the value ~then (21) implies the machine if x ;;::: ~for some critical 'Y + f(x) = v +ex 'Y = ex+ .X[f(x + 1) f(x) ]
These have solution
(x;;:::
f(x )=c x+ v' Y f(x) =
f::(ij=O
~)
ex)/ .X= "fX  cx( x 1) 2.X .A
The identity of the two solutions at x for 'Y in terms of~
equ = .; thus provides the determining
atio n (22)
~ is effect by replacing~(~ 1) bye _ Now Here we have neglected discreteness with l ima min be ld requirement that 'Y shou determined by optimality; i.e. by the that the derivatives with respect to .; of iring respect to f This is equivalent to requ the be equal (see Exercise 10.7.2), i.e. to the two sides of relation (22) should a uce ded we evaluation of 1' into (22) condition .Xc "''Y  e~. Substituting this on uati eval the v "' v( ~). In this way we fmd relation between v and .; equivalent to
cx2
v(x) "' c(x +.X) + li,
282
BANDIT OPTIMAL ALLOCATION; THE MULTIARMED
dominant) term in this accurate to within a discreteness effect. The last (and policy improvement in by expression is essentially the index which was deduced Section 11.4.
7 QUEUEING PROBLEMS vement produ ced plausible Despite the fact that the technique of policy impro in Sections 11.511.7, the index policies for a numb er of queueing problems restless bandit technique fails to do so. of deducing an index Consider, for example, the costing of service, in the hope cost structure of Section for the allocation of service between queues. With the to (17) would be 11.5 the dynamic progr amm ing equation corresponding 7 =ex + min{A L\(x + 1), v + AA(x + 1) J.£(x)L\(x)} , L\(x) is the increment Here v is the postulated cost of employing a server zero according as xis f(x)  f(x 1) in transient cost and J.£(x) equals 1£ or to be finite. One fmds (and positive or zero. One must assume that J.£ > A if 7 is nonnegative v the active we leave this as an exercise to the reader) that for any ris the whole set x > 0 of sett he set in which it is optimal to employ a serve dary which changes with positive x. One thus does not have a decision boun the index was based in of ition defin the changing v; the very feature on which Section 5. considering policies for One can see how this comes about: there is no point in for some { > 0. Such ~ = x which, for example, there is no service in a state ever for the queue on as em policies would present the same optimisation probl ting a baseload of { accep the Set Of States X;:;::: {,but With the feature that one is and so incurring an unnecessary constant cost of c{. es of engagement, so One might argue that variation of J.£ allows varying degre sponding proportional that one might allow Jl to vary with state with a corre same conclusion in the s.ervice cost. However, one reaches essentially the states x > 0 at a comm on undiscounted case: that an optimal policy serves all rate. es manifest if one The special character of such queueing problems becom It is assumed that 5. n Sectio in aged envis considers the largesystem limit n ___. oo so that there is unity, than less is the traffic intensity for the whole system this capacity is all case that In sufficient service capacity to cover all queues. s are either queue all that so directed to a queue the mom ent it has customers, This rather e. servic of e empty or (momentarily) have one customer in the cours ited by inhib is nse respo if unrealistic immediacy of response can be avoided only er. transf or vation switching costs or slowed down by delay in either obser problem of Section 11.7. We d rewar pure the for hold not do Such considerations on 6 indeed lead to the leave the reader to confi rm that the methods of Secti known optimal policy.
7 QUEUEING PROBLEMS
Notes on the literature
283
ns had arrived at his solution by The subject has a vast literature. Although Gitti s (1974~ His later book (Gittins, . 1970 it was published first in Gittins and Jone subject. The proo f of optimality e 1989) gives a collected exposition of the whol was given in Whittle (1980). tion associated with solution (4) for the value func their analysis completed and ) (1988 Restless bandits were intro duce d in Whittle by Weber and Weiss (1990).
CHA PTE R IS
Imperfect State Observation ent of the case of We saw in Chapt er 12 that, in the LQG case, there was a treatm By 'imperfect it. imperfect observation which was both complete and explic le variab (or of the state observation' we mean that the current value of the process able. In practice, variable, in statestructured cases) is not completely observ observation will seldom be complete. le; this tractability However, the LQG case is an exception in that it is so tractab 16). There are very few . carries no further than to the LEQG case (see Chapt er d both exactly and models with imperfect observation which can be treate case, which one might explicitly. Let us restrict attention to the statestructured l formal result which centra the Then form. regard as the standard normalised there is still a simply that is ed observ fectly emerges if the state variable is imper ent of the value argum the ver, Howe on. recursive dynamic programming equati ational' state 'inform an but le, variab function is no longer the physical state tional on condi state al physic of value variable: the distribution of the current nt of the eleme an rly forme as not current information. This argument is then, of the ality cardin in se increa state space !!l, but a distribution on !!l. This great lt. argument makes analysis very much more difficu distribution is The simplification of the LQG case is that this conditional covariance and x mean 1 t always Gaussian, and so parametrised by its curren and the alone, ents matrix V,. The value function depends then on these argum s even implie ideas validity of the certainty equivalence principle and associated further simplification. lence principle, Certainly, there is no general analogue of the certainty equiva lty. In general the and it must be said that this fact adds interest as well as difficu gains, so that control choice of control will also affect the information one in mind: to control actions must be chosen with two (usually conflicting) goals on aspects of the the system in the conventional sense and to gain information a strange car is of g' steerin the 'feel to le, system for which it is needed. For examp information sary neces gains one that effective as a longterm control measure (in rly, it is the Simila one. term shorton the car's driving characteristics) but not as a of the basis the on on misati conflict between the considerations of profitmaxi base ation inform this ve information one has and the choice of actions to impro ter. charac ular that gives the original multiarmed bandit problem its partic l. (A 'duality' quite Control with this dual goal is often referred to as dual contro and estimation/ e distinct from the mathematical duality between control/futur
286
IMPERFEC T STATE OBSERVATION
past which has been our constant theme.} An associated concept is that of adaptive control: the system may have parameters which are not merely unobservable but also changing, and procedures must be such as to track these changes as well as possible and adapt the control rule to them. A procedure which effectively estimates an unknown parameter will of course also track a changing parameter. The theory of dual and adaptive control requires a completely new set of ideas; it is subtle, technical and, while extensive, is as yet incomplete. For these reasons we shall simply not attempt any account of it, but shall merely outline the basic formalism and give a single tractable example. 1 SUFFICIE NCY OF THE POSTERI OR DISTRIBU TION OF STATE Let us suppose, for simplicity of notation, that all random variables are discretevaluedth e formal extension of conclusions to more general cases in then obvious. We shall use a naive notation, so that, for example P(x11W1) denotes the probability of a value x 1 of the state at time t conditional on the information W1 available at timet. We are thus not making a notational distinction between a random variable and particular values of that random variable, just as the 'P' in the above expression denotes simply 'probability of' rather than a defined function of the bracketed arguments. We shall use a more explicit functional notation when needed. Let us consider the discretetime case. The structural axioms of Appendix 2 are taken for granted (and so also their implication: that past controlsparametrising variables can be unequivocally lumped in with the conditioning variables). We assume the following modified version of the statestructure hypotheses of Section 8.3. (i) Markov dynamics It is assumed that process variable x and observation y have the property P(xt+l•Yt+liXt, Y,, U,)
= P(xt+l,Yt+IIx, u,).
(1)
(ii) Decomposable cost function It is assumed that the cost function separates into a sum of instantaneous and terminal costs, of the form
C=
h1
L {3' c(x, u, t) + phch(xh).
(2)
1=0
(iii) Information It is assumed that W1 = (Wo, Y, U1_ 1 ) and that the information available at time t = 0 implies a prior distribution of initial state P(xoiWo). It is thus implied in (ill) that y, is the observation that becomes available at time t, when the value of u1 is to be determine d Assumption (i) asserts rather more than Markov structure; it states that, for given control values U, the stochastic
RIBUTION OF STATE 1 SUFF ICIEN CY OF THE POSTERIOR DIST
287
ibuti on of x, + 1 conditional on X 1 process {x1} is auto nom ous in that the distr r words, the causal depe nden ce is and Y1 is in fact depe nden t only on X 1• In othe . This is an assu mpti on which coul d oneway; y depe nds upon x but not conversely ded a disco unt factor in the cost be weakened; see Exercise 1. We have inclu function (2) for econ omy of treat men t variables and observations can Assu mpti ons (i) and (iii) imply that both state stochastic treat ment can be start ed be rega rded as rand om variables, and that the is presc ribed for initial state .xo (the up, in that an initia l distribution P(xol Wo) radical than it seems. For prior distribution). The implication may be more it the physical state variable is in fact example, for the origi nal mult iarm ed band unkn own success probabilities. The the para mete r vecto r = {Oi}; the vecto r of r usua l assumptions, makes it no fact that this does not change with time, unde poin t which allows one to rega rd less a state variable. However, the change in view l. this para mete r as a rand om variable is nontrivia own plan t param eters are to unkn that ies impl e abov on Generally, the formulati rega rded as rand om variables, not be inclu ded in the state variable and are to be ibution. That is, one takes the directly observable but of know n prio r distr ture. The· controversy amo ng Bayesian poin t of view to inference on struc on and its interpretation has been statisticians conc ernin g the Bayesian form ulati take the prag mati c poin t of view a battle not yet cons igne d to histor)t We shall whic h lead to a natu ral recursive that, in this context, the only formulations the Bayesian formulation and its math emat ical analysis of the prob lem are mini max analogue. We shall refer to the distr ibuti on
e
(3) citly, the poste rior distr ibuti on of as the posterior distr ibuti on of state. More expli ition al upon the infor mati on x 1 at time t, in that it is the distribution of x 1 cond ral forward recursion. that has been gath ered by time t. It obeys a natu distribution) Under the asTheorem 15.1.1 (Bayes upda ting of the post erior bution P 1 obeys the updating sumptions (i)(iii) listed above the posterior distri formula t+I, Yt+IIxr. u,) (4) ) I:x, P,(x,)P(x P ( I ) . ) ( t+l Xt+! = "' L..Jx, P, Xt P(yt+1 Xt, Ut
Proof We have, for fixed Wr+l and variable x 1+h P(xt+tiWt+t) ex: P(xt+IoYt+tiW,u,) = LP(x,,Xt+IoYt+!oiWr.ut) x,
288
IMPERFECT STATE OBSERVATION
= LP(xrl Wr, Ur)P(Xr+J,Yt+d W1,XI> u1) x,
= LPr(Xr)P(xr+l,Yt+dxr, ur)· X
The last step follows from (3) and the implication of causality: that P(x1 W 1 , u1 ) = P(x1 W1). Normalising this expression for the conditional distribution of x 1+1 we deduce recursion (4). 0 1
1
Just as the generic value of Xr is often denoted simply by x, so shall we often denote the generic value of P1 simply by P. We see from (4) that the updating formula for P can be expressed
P(x)+ P'(x)
:=
L:z P(z)a (x,y, lz, u) L:x L:z P(z)ar(x,y!z, u)
(S)
1
where a1 (x, ylz, u) is the functional form of the conditional probability
P(xt+l = X,Yt+l = ylx1 = z, u1 = u). Recall now our definition of a sufficient variable ~~ in Section 2.1. Theorem 8.3.1 could be expressed as an assertion that, under the assumptions (i)(iii) of Section 8.3, the pair (x1 , t) is sufficient, where x 1 is the dynamic state variable. What we shall now demonstrate is that, under the imperfectobservation versions of these assumptions expressed in (i)(iii) above, it is the pair (P1 , t) which is sufficient, where P1 is the posterior distribution of dynamical state Xt· For this reason Pis sometimes referred to as an 'informational state' variable or a 'hyperstate' variable, to distinguish it from x itself, which still remains the underlying physical state variable. Theorem 15.1.2 (The optimality equation under imperfect state observation) Under the assumptions (i)(iii) listed above, the variable (P1 , t) is sufficient, and the optimality equation for the minimal expected discountedfuture cost takes the form
F(P, t) = i~f [ L P(x)c(x, u, t) X
+f3LLLP(z) a,(x,y!z,u)F (P',t+ X
y
1)]
(6) (t .P) = N ¢(P) for any positive scalar \.We shall find it sometimes convenient to write F(P, t) as F(( {P(x)}, t) if we wish to indicate how P(x) transform s for a given value ofx.
Theorem 15.1.3 The value function F(P, t) can be consistently extended to unnormalised distributions P by the requirement that it be homogeneous ofdegree one in P, when the dynamic programming equation (6) simplifies to the form F(P, t)
=i~ [;; P(x)c(x, u, t)+{3 ~ F ( { ;;= P(z)at(x,ylz, u)}, t+ 1) ]·
(8)
Proof Recursion (6) would certainly reduce to (8) if F(P, t + 1) had the homogeneity property, and F(P, t) would then share this property. But it is evident from D (7) that F ( P, h) has the property. The conclusio n can be regarded as an indicatio n of the fact that it is only the relative values of P(x11W1) (for varying Xt and fixed W1) which matter, and that the normalis ation factor in (5) is then irrelevant. Exercises and comments (1) An alternative and in some ways more natural formulation is to regard (x1 , y 1) as jointly constituting the physical state variable, but of which only the compone nt Yr is observed. The Markov assumpti on (i) of the text will then be weakened to
P(xt+I,Yt+I!Xt, Yt, Ut) = P(xr+I,Yt+dxt,Yt.Ut) consisten t with the previous assumpti on (i) of Section 8.3. Show that the variable (P 1,y1,t) IS sufficient, where Pt = {P1(xt)} = {P(xt!Wt)} is updated by the formula
Pt+l (xt+d ex
Lx, Pt(Xt)P(xt+l, Yt+dxt, Yt, Ut)·
2 EXAMP LES: MACHI NE SERVIC ING AND CHANG EPOINT DETECT ION One might say that the whole of optimal adaptive control theory is latent in equation (8), if one could only extract it! Even somethin g like optimal statistical commun ication theory would be just a special case. However, we shall confine our ambitions to the simplest problem which is at all amenable.
290
IMPERFECT STATE OBSERVATION
Suppose that the dynamic state variable x represents the state of a machine; suppose that this takes integer values j = 0, 1, 2, . . . . Suppose that the only actions available are to let the machine run (in which case x follows a Markov chain with transition probabilities PJk) or to service it (in which case the machine is brought to state 0). To run the machine for unit time in statej costs ci> to service it costs d. At each stage one derives an observation y on machine state. Suppose, for simplicity, that this is discretevalued, the probability of an observation y conditional on machine state j being Pi (y). Let P = { P1} denote the current posterior distribution of machine state, and let 'Y and f(P) denote the average and transient cost under an optimal policy (presumed stationary). The dynamic programming equation corresponding to (8) is then
where 8(P) is the unnormalised distribution which assigns the entire probability mass 2:1 P1 to state 0. The hope is, of course, to determine the set of Pva1ues for which the option of servicing is indicated. However, even equation (9) offers no obvious purchase for general solution. The trivial special case is that in which there are no observations at all. Then P is a function purely of the time which has elapsed since the last service, and the optimal policy must be to service at regular intervals. The optimal length of interval is easily determined in principle, without recourse to (9). A case which is still special, but less trivial, is that in which the machine can be in only two states, x = 0 or 1, say. We would interpret these as 'satisfactory' and 'faulty' respectively. In this case the informational state can be expressed in terms of the single number 7f=
P, Po+Pt
0
This is the probability (conditional on current information), that the machine is faulty. Let us suppose that Pot = p = 1  q and Pto = 0. That is, the faultfree machine can develop a fault with probability p, but the faulty machine cannot spontaneously correct itself. If we setf(P) = ¢(1r) and assume the normalisation 0. Let us slightly rephrase the conclusions summar ised in Lemma 12.3.2 and Theorem 12.3.3. Define the stress
(5) the the linear combin ation of cost and discrepancy which occurs naturally in value st evaluation of expecta tion (3). Let us also define the modifie d totalco function G( W1) as in Section 12.3 by (6) eeG(W,) = f( Y1)extE7r[e9Cj Wr]· 7r (} is where the extremisation is a maximi sation or a minimis ation according as 12.3.3 Theorem and 12.3.2 positive or negative. Then the conclusions of Lemma can be rephrase d as follows.
UIVALENCE PRINCIPLE 2 THE RISKSENSITIVE CERTAINTYEQ
299
equivalence theorem for the riskTheorem 16.2.1 (The risksensitive certainty G structure with B > 0. Then the total seeking (optimistic) case) Assume LEQ value function has the expression inf inf t inf§ G(Wr) = gt + u.,;r;; X . t JT:r>  1[J)(Xr, Yrli UrI) + inf{B X, + inf inf [C(X, U) + e 1UJ(xr+l) ... )XhiXri U)]}
= gc
(7)
u.,;r;;. t x.,.:r>t
. The value of u1 thus determined is where g1 is a policyindependent function oft alone t. If the value of u1 minimising the the LEQGoptimal value of the control at time the LEQGoptimal value of u1 is square bracket is denoted u(X1 , U1_I) then mined by the final Xrminimisation. u( xit), Ut1) where x?l is the value ofX 1 deter e function at time t is obtained by The first equality of (7) asserts that the valu ess/observation variables currently minimising the stress with respect to all proc formed, and that the value of u1 deterunobservable and all decisions not already immediately suggest the expression of min ed in this way is optimal. This may not achieves what one might term convera CEP, but it is a powerful assertion which isation with respect of functions sion to free form. By this we mea n that an optim ined minimisation with respect to unu1( W1) has been replaced by an unconstra t is, the constraint that the opti mal observables and undetermined controls. Tha achieved automatically in a free u1 should depe nd only upon W1 has been ing constraints will be taken to its extremisation. This process of effectively relax h the full timeintegral formulation. conclusion in Chapters 1921, when we reac h mor e like a CEP, in that it asserts The final assertion of the theorem looks muc optimal value for known X 1 with X 1 that the optimal value of u1 is just the is often confused with the 1 ). The riskneutral CEP replaced by an estimate not well agreed) the separateness of separation principle, which asserts (in terms Ther e is certainly no such separation the optimisations of estimation and controL and future) affect the value of the in the LEQ G case. Control costs (both past rol rule (even if the process variable is estimates and noise statistics affect the cont perfectly observed) should now be expressed. If Xr is However, we see how a separation principle uation of the two terms inside the provisionally assumed known then the eval which can be regarded as conc erne d curly brackets in the final expression of (7), can proceed separately. The two with estimation and control respectively, misation with respect to X 1, which evaluations are then coupled by the final mini tiveness of this separation is muc h also yields the final estimate x?l. The effec d below. clearer in the statestructured case considere rs interestingly, and requires mor e The CEP in the riskaverse case, B < 0, diffe the fact that relation (12.16) now careful statement. The distinction arises from becomes
Xi
RISKSENSITIVITY: THE LEQC MODEL
300
(8)
G(W1) =sup infG(Wt+ 1) + ... u,
Yt+l
and that the order of the two extremal operations cannot in general be reversed. The consequence is that the analogue of Theorem 16.2.1 is Theorem 16.2.2 (The risksensitive certainty equivalence theorem for the riskaverse (pessimistic) case) Assume LEQG structure with(}< 0. Then the total value function has the expression
G( W1) = g1 + inf sup ... inf sup§ u,
= gt +
Yt+I
uhI
stat{fJ 1[])(Xt,
x,
Yh
(9)
Ytli Uri)
stat [C(X, U) + 8 1[])(xt+l• ... ,xhiXti U)]}
+stat
u.,.:r ~ t x.,.:r>t
where g 1 is a policyindependent function oft alone. The value ofu1 thus determined is the LEQGoptimal value of the control at time t. If the value of u1 extremising the square bracket is denoted u(Xi, U1_!) then the LEQGoptimal value of u1 is is the value of X 1 determined by the final Xrextremisation. u( x?l, Ur1) where
x?l
Here Yh can be regarded as yielding complete information, and so can be identified with X, and the operator 'stat' simply renders the expression to which it is applied stationary with respect to the variable indicated. The first relation in (9) follows as in the riskseeking case, but the order of the extremisations must now be observed. However, all that is necessary in applying recursions such as (8) is that quantities being minimised (maximised) should be convex (concave) in the relevant argument The cases in which this requirement fails prove interesting; see Sections 3 and 4. The extremisation conditions in the first expression of (9) will yield linear relations which can be solved (i.e. variables eliminated) in any order. The rearrangement of extremal operations in the final expression of (9) corresponds to just such a reordering, but with the characterisations of maximality or minimality now weakened to stationarity of some kind. Although statestructure plays no part in this formalism (and it is for this reason plus economy that we have dragged the reader through the general case) the situation does indeed become more transparent if state structure is assumed. In such a case cost and discrepancy will have the additive decompositions h1
hl
c=L
c(xr, Ut, t) h
D(xt)
L Ct + ch
(10)
t=O
t=O
[]) =
+ Ch(xh) =
h
+ LD(xr,YriXtl, Yt!iUt1) =Do+ LDt 1=1
(11)
t=l
say. Here we have not discounted explicitly, and D(x0 ) is the discrepancy derived from the prior distribution of x 0 conditional on W0 .
UIVA 2 THE RISKSENSITIVE CERTAINTYEQ
LENCE PRIN CIPL E
301
and the past stress P1 = P(x , W1) as Defm e now the future stress Fr = F(xr, t) the values of t1
h1
I:(c.,.+ e Dr+t) + Ch 1
and
e
Do+ L(c .,.+ 8 1D.,.+t)
1
r=O
d out. Tha t is, all controls from time t with all variables except W1 and x, extremise servables at time t except x 1 itself are onwards are mini mise d out and all unob 8 is positive or negative. mini mise d or max imis ed out, acco rdin g as nce for the state stru cture d
vale Theorem 16.2.3 (Recursions and certaintyequi Then the future and past stresses l. mode G LEQ d case) Assume a statestructure itions are determined by the recursions and terminal cond ext [c(x1, u, t) + et D(xr+tlx,; u,) F(xr, t) = inf Xt+l Ut
(12)
(O~t 0. Show 2 ~ 1 and J(fJ) > (1 IAI )/ R nonnegative solution iff J(B) > 0 in the case !AI 2 2 QIN d ii is  B IN Q or  B in the case lA I < 1. That is, the critical lower boun is unsta ble or stable. (l  IAI 2 )/NR accor ding as the unco ntrol led plant US) CAS E 5 THE DIST URB ED (NON HOM OGE NEO the addit ion of a deter minis tic If the plant equa tion (12.1) is modi fied by distu rbanc e d1 : x, = Axr1 +Bu rl+ d, + tr the nonh omog eneo us quad ratic then we shall expe ct the future stress to have form
(26)
can generalise relation (23) to where + · · · indicates terms indep ende nt of x. We obtai n )] = aTfi a 20'Ta + · · · ext,[(tT (BN) 1 £+( a+ tlii( a + £)  2aT (a+£ where + · · · indicates terms indep ende nt of a, the and
matr ix fi is again given by (21),
From this we dedu ce
ministic disturbance d then Theorem 16.5.1 If the plant equation includes a deter modified Riccati equation indithe future stress has the form (26) with Il1 obeying the recursion cated in Theorem 16.4.1 and a 1 obeying the backward (27) Here f\ is the predictive gain matr ix defined in
(25).
RISKSENSITIVITY: THE LEQC MODE L
308
(27) differs from the Verification is immediate. We see that recursion by the substitution of of corresponding riskneutral recursion of Section 2.9 only be, since we have f' 1 for the riskneutral evaluation ofr 1 • This is indeed as it mustoptim isation with u by replaced optimisation with respect to a single control respect to the pair (u, e). al control as The same argument leads to the evaluation of the optim (28) 1 lN)l (at+ I  IIt+ldt+l)· Ut = KtXt + (Q + BTfrt+IB) BT (I+ OIIt+ the explicit feedbackThe combination of (28) and the recursion (27) gives one . (2.65) to gous analo feedforward formula for the optimal control more rapidly and much e emerg s As for the riskneutral case, all these result isation techfactor the adopt cleanly (at least in the stationary case) when we . niques of Sections 6.3; see Section 11 and Chapter 21.
TION OF THE 6 IMPERFECT STATE OBSERVATION: THE SOLU SION PRECUR the forward recursion In the case of imperfect observation one has also to solve the Frecursion implies (14) for the function P(xtJ W 1). Just as the solution of of perfect observation, solution of the controloptimisation problem in the case the estimation problem. so solution of the Precursion largely implies solution of estimate x)Jased upon 'Largely', because the Precursion produces only the state is then quickly derived, past stress.. However, the full minimalstress estimate x1 as we shall see in the next section. ions (12.1)(12.4) we For the standard regulation problem as formulated in equat can write the Precursion (14) as (29) BP(xt, W,) = min[Bct1 + Dt + BP(xt1, Wtd ) Xr1
where Ct1
=![~r_J~ ~H~t~·
plant and observation and f, TJ are to be expressed in terms of x, y, u by the relations (12.1) and (12.2). have a limit, D, say, Now, if we take the riskneutral limit()+ 0 then ()p will which satisfies the riskneutral form of (29) (30) D(xt, W,) = min[Dt + D(xt l. Wi1)] .l"t1
and has the interpretation D(xr, W,)
= D(xr, Yr\; Urd
= D(xt\ W,)
+ ···
=! [(x x? v (x .X)] + ··· 1
1
(31)
309
6 IMPE RFEC T STATE OBSERVATION
fact identifiable with D( Y1I; U1_ 1 ) ) , Here + · · ·indicates term s not involving x, (in cova rianc e of x 1 conditional on W1• and 1 and V1 are respectively the mea n and and, as we saw from Section 12.5, In this riskneutral case is identifiable with of xand Y. the Kalm an filter and the relation (31) implies the recursive upda tings know n solution of (30) to dete rmin e Ricc ati recursion. We can in fact utilise the that of (29). that OP has the quadratic form In the risksensitive case we can again establish exhibited in the final mem ber of (31), so that
x
x
x
P(xr. W,) =
;o
[(x x)
v
1 (x
x)Jr + · · ·,
(32)
of x, irrelevant for our purp oses . where + · · · again indicates term s inde pend ent estim ate of XT which extremises past The quantity X1 can now be identified as the ure the precision of this estim ate in stress at time t, and V1 can be said to meas stress as x 1 varies from 1• Rela tion that (OV1) 1 measures the curvature of past quantities. (29) now deter mine s the upda ting rules for these
x
past stress has the quadratic form Theorem 16.6.1 Under the assumptions above the prescribed mean and variance of xo conditional ~2) with x0 and V1 identified as the those ofx1_ 1and Vr1 by the Kalon Wo. The values of X1 and Vr are determinedfrom (1 2.24) with the modifications man filter (1 2.22)1 (1 2.23) and the Riccati recursion that x1 is replaced by X1, Vr1 by Vtl
1
= ( Vt1 +OR )
1
(33)
andxr1 by (34)
ofthese recursions it is necessary in the righthand sides ofthese relations. For validity for all relevant t. that V;=_11 + cr M 1C+ should be positive definite agai n inductively from (29). ReProof The quad ratic character (32) of P follows · If we assu me from (30) only in the addi tion of the term Ocr! cursi on (29) differs ted then we fmd that that P(x 1_ 1, W1_ 1) has the quad ratic form asser (Jet!
+ BP(Xt1, Wr1)
=! [(x x)Tv I (x x)]t1 +ter ms independent of Xr1
whence it follows that
al case with the subs But this is just the recursion ofth e riskneutr and Vt! for Xt! and Vt!·
titution of Xr1
0
310
RISKSENSITIVITY: THE LEQC MODEL
The modified recursions are again more transparent in the alternative forms (12.59), (12.60). One sees that, as far as updatin g of Vis concern ed, the only effect 1 of risk sensitivity is to modify CTM 1 C to CTM C+BR. That is, to the one adds the matrix ation information matrix associated with a single yobserv or negative (positive BR, reflecting the 'information' implied by costpressures according to the sign of B). The passage (34) from 1_ 1 to Xt1 indicates how the estimate of Xt1 changes if we add present cost to past stress. In continuous time the two forms of the modified recursions coincide and are more elegant; we set them out explicitly in Section 8.
x
7 IMPERFECT STATE OBSERVATION: RECOUPLING In the riskneutral case the recipe for the coupling of estimation and control asserted by the CEP is so immedi ate that one scarcely gives it thought: the optimal control is obtaine d by simple substitution of the estimate x1 for Xt in the its perfectobservation form of the optimal rule. It is indeed too simple to suggest the that is This 16.2.3. Theorem from know we which sation, risksensitive generali optimal control at time t is u(.Xt, t), where u(x~o t) is the optimal perfectof observation (but risksensitive) control and x1 is the minimalstress estimate t). , F(x + ) W 1 , P(x 1 1 xr: the value of x 1 extremising It was the provisional specification of current state x 1 which allowed us to d decouple the evaluations of past and future stress; the evaluations are recouple ofF (32) and (26) ons evaluati separate the that by the determi nation of 1• Now be and P have been made explicit in the last two sections, the recoupling can also made explicit.
x
Theorem 16.7.1 The optimal control at timet is given by u1 = K/x,, where
.x, = (I+ ev~rr~r\i·, + evtot)·
(35)
Here II1, K1 and o1 are the expressions determined in Theorems 16.4.1 and 16.5.1 and V1, xt those determined in Theorem 16.6.1. This follows immediately from the CEP assertion of Theorem 16.2.3 and the Xr evaluations of the last two sections. Note that 1 is an average of the value of which extremises past stress and the value II; 1a1 which extremises future stress. As one approaches the riskneutral case then the effect of past stress swamps that of future stress.
x
x,
8 CONTINUOUS TIME to The continuoustime analogue of all the LEQG conclusions follows by passage The . the continuous limit, and is in general simpler than its discretetime original
7 IMPERFE CT STATE OBSERVATION: RECOUPL ING
311
analogues of the CEP theorems of Section 1 are obvious. Note that the u and t: extremisations of equation (18) are now virtually simultaneous: the optimiser and Nature are playing a differential game, with shared or opposing objectives according as eis positive or negative. The solutions for past and future stress in the statestru ctured case simplifY in that the two alternative forms coalesce. The solution for the forward stress for the disturbed but otherwise timehomogeneous regulation problem is F(x, t)
= !xTIIx aT x + · · ·
where II and a are functions of time. The matrix II obeys the backward Riccati equation
This reduces to
IT+ R + ATII +ITA IIJ(O)II = 0
(37)
if Shas been normalis ed to zero, where the augmented controlpower matrix J(O) again has the evaluation
The vector a obeys the backward linear equation
a+ f'T
(T
IId
=0
where f' is the predictive gain matrix
The optimal control in the case of perfect state observation is
u = Kx+ Q 1BTa
(38)
. where the timedep endence of u, K, x and a is understo od, and
(39) In the case of imperfec t state observation the past stress has the solution
P(x, W)
= ;e(x x)TV 1(x x) + ...
where the timedependence of x, W, forward Riccati equation
xand Vis understood. The matrix Vobeys the
RISKSENSITIVITY: THE LEQC MODEL
312 which reduces to
if L has been normalised to zero. The updating formula for Kalman filter) is
x (the risksensitive
dX/dt =Ax+ Bu + d + H(y Cx) OV(Rx +STu)
(40)
where H = (L + VC)M 1 • Recoupling follows the discretetime pattern exactly. That is, the optimal control is u = Kx + Q 1BTu where Kis given by (39) and x is the minimalstress estimate of x:
9 SOME CONTINUOUSTIME EXAMPLES The simplest example is that of scalar regulation, the continuoustime equivalent of Exercise 4.1. Equation (37) for II can be written
~ = R + 2AIT J(O)II2 =/(IT),
1
I
I
1
(41)
say, where J(O) = B2 Q 1 + BN, and s represents time to go. Let us suppose that R > 0. We are interested in the nonnegative solutions of f(IT) = 0 which moreover constitute stable equilibrium solutions of (41), in that/' (II) < 0. In the case J(B) > 0 there is only one nonnegative root, and it is stable; see Figure 4. In the case J(8) = 0 there is such a root if A < 0 and no nonnegative root at all if can be A~ 0. If J(O) < 0, then there is no nonnegative root if A~ 0, but there of root one if A is negative and J (0) not too large; see Figure 5. In fact, there is a the required character if A < 0 and R  A2 / J( 0) < 0. To summarise:
II II
Figure 4 The graph ofjTI in the case J > 0; it has a single positive zero.
9 SOME CONTINUOUSTIME EXAMPLES
313
II is a positive zero if J exceeds a criFigure 5 The graph of[IT in the case J < 0, A < 0; there value. tical
em set out above, with S = 0 Theorem 1691 Assume the scalar regulation probl itude of K decrease as e inand R and Q both positive. Then both II and the magn creases. down value 0equals If A ~ 0 (i.e. the uncontrolled plant is unstable) then the break value.  B2 / N Q and II becomes infinite as() decreases to this breakdown value 0 equals the then ) stable is plant d If A < 0 (j. e. the uncontrolle solution TI of (41) at()= 0 is B2 jNQ  A 2 jNR. The nonnegative equilibrium finite, but is unstable to positive perturbations. linea rised pend ulum mode l of A secon dord er exam ple is provi ded by the on the angle of defle ction a versi Secti on 28 and Exercise 5.1.5. In a stoch astic from the vertic al obeys
a= aa+ bu+ E
icien t a is negative or posit ive wher e E is white noise of powe r N and the coeff ing or the inver ted posit ion. The accor ding as one seeks to stabi lise to the hang 2 Q strictly positive. 2 cost funct ion is! (r 1a 2 + r 2a + Qu ), with r 1 and 2 Q 1 + BN. It follows from that The analysis of Secti on 2.8 applies with J = b II of the Ricca ti equa tion if and analysis that there is a finite equil ibriu m solut ion . The break down value is thus only if J > 0, and that this solut ion is then stable of the hang ing posit ion 0 = 1/ NQ, whatever the sign of a. The great er stabithelityrelati ve magn itude s of II in comp ared with the inver ted posit ion is reflected only neutr ally stable, rathe r than in the two cases, but the hang ing posit ion is still truly stable. stead y state is provi ded by the Finally, a secon dord er exam ple whic h has no solut ion for the optim al contr ol inert ial missile exam ple of Exerc ise 2.8.4. The obtai ned there now beco mes
u=
Ds(x1 + x2s) Q + D(l + BNQ)s3 /3'
314
RISKSENSITIVITY: THE LEQC MODEL
where sis time to go. The critical breakdown value is iJ = 1 INQ  3INDs3, and so increases with s. The longer the time remaining, the more possibility there is for mischance, according to a pessimist. 10 AVERAGE COST The normalised value function F(x, t) defmed by eBF(x,t)
= extuE[e8C'Ixr = x,ur = u].
(42)
in the statestructured case should be distinguished from the future stress defined in Section 2, which is in fact the xdependent part of the value function The neglected term, independent of x, is irrelevant for determination of the optimal control, but has interest in view of its interpretation as the cost due to noise in the risksensitive case. Let us make the distinction by provisionally denoting the value function by F,(x, t). Theorem 16.10.1 Consider the LEQG regulation problem in discrete time with perfect state observation and no deterministic disturbance (i.e. d = 0). Then the normalised value function Fv has the evaluation
F,(x, t) = F(x, t) + 8r
(43)
where F(x, t) is the future stress, evaluated in Theorems 16.4.1 and 16.5.1, and (44)
The proof follows by the usual inductive argument, applied now to the explicit form exp[OF,(x, t)] = e:t(?rrn/2 INI 1/ 2
j exp[!eTN e Oc(x,u) 1
 OFv(Ax + Bu + e, t + 1)] de of recursion (42). The evaluation (44) of the increment in cost due to noise indeed reduces to the corresponding riskneutral evaluation (10.5) in the limit of zero 0. It provides the evaluation 1 (45) 'Y(O) = 26 log!/+ ONITI of average error in the steady state, where II is now the equilibrium solution of the Riccati equation (24) (with A and R replaced by A  ST Q 1B and R  ST Q 1S if Shas not been normalised to zero). More generally, it provides an evaluation of the average cost for a policy u(t) = Kx(t) which is not necessarily optimal (but stabilising, in an appropriate sense) if II is now the solution of
315
10 AVERAGE COST
BK)T (II 1 + ON) 1(A + BK). II= R + KTS + ST K + KT QK +( A+
(46)
methods of rage cost is best evaluated by the For mo re general models the ave uced the expression Section 13.5. Recall tha t we there ded (47) 1(8) = 4 0 log ll + 09if(w)l dw ilising, but which is linear, stat ion ary and stab for the average cost und er a policy deviation the spectral density function of the otherwise arbitrary. Her e f(w) is s expression the associated penalty matrix. Thi vector 6. und er the policy and 9i tion of stat e variant model; the re is no assump is valid for a general linear timein rem ain s task the d, han ervation. On the oth er structure or of perfect process obs of f(w ) s clas the g the policy and determinin of det erm inin gf( w) in terms of icy is varied. which can be generated as this pol rmined by evaluation (47) reduces to tha t dete It is by no means evident tha t the (supposed cy d regulation case with the poli (45) and (46) in the statestructure straightis iation in the riskneutral case stabilising) u = Kx. The reconcil ht per hap s e case is less so and, as one mig forward. Tha t for the risksensitiv factorisation: follows by appeal to a canonical conjecture from the form of (47), see Exercise 20.2.1. years is to ch has become pop ula r over recent A view ofsystem per form anc e whi urbances dist s iou var system output, and the consider the deviation vector !::. as system as e) nois n atio (e.g. pla nt and observ to which the system may be subject funce ons resp cy characterised by the frequen input. A given control rule is the n rule the ose cho ch it induces. On e wishes to tion G( iw) from inp ut to out put whi sen. cho be ld there are many nor ms which cou to make G small in some nor m, and ut inp em ld be regarded as a collective syst So, suppose tha t the noise inputs cou r filte g lisin wer) mat rix 91, say. (A prenorma (wh ich is white with covariance (po ion Express inc orp ora ted in the total filter GJ which would achieve this could be (47) the n becomes
!j
1(8) =
4~0 j
!j
log !/+ 09iG(iw)9lG(( iw ?!d w = 4 0
Iog!I + 09iG9lGI dw (48)
respect to of minimising this expression with and the problem then becomes one of LE QG ion not the m on G, generated by G. Expression (48) is indeed a nor blem in pro on sati me). To phr ase the optimi optimisation (in the stationary regi pter. cha t nex the tain issues, as we shall see in this way helps in discussion of cer fact the with ntial issue: how does one cope However, it also glosses over an esse te qui is ied ns G generated as policy is var tha t the class of response functio es vid t it pro amic pro gra mm ing approach tha restricted? It is a virtue of the dyn on of the cati cifi spe by lied constraints imp an automatic recognition of the n. st be provided in a direct optimisatio system; some equivalent insight mu
316
RISKSENSITIVITY: THE LEQC MODEL
For reasons which will emerge in the next chapter, one sometimes normalises the matrices if\ and 91 to identity matrices by regarding G as the transfer function from a normalised input to a normalised output (so absorbing 9l and 91 into the definition of G). In such a case (48) reduces to
1(8)
1 = 41re
J

logjl + BGGI dw.
(49)
The criterion function (48) or (49) is sometimes termed an entropy criterion, in view of its integrated logarithmic character. However, we should see it for what it is: an average cost under LEQG assumptions. In the riskneutral case (49) reduces simply to the meansquare norm (47rf 1 Jtr[GG]jdw, also proportional to a meansquare norm for the transient response.
11 TIMEINTEGRAL METHODS FOR THE LEQG MODEL The timeintegral methods of Sections 6.3 and 6.5 are equally powerful in the risksensitive case, and equally well cut through detailed calculations to reveal the essential structure. Furthermore, the modification introduced by risksensitivity is most interesting. We shall consider this approach more generally in Chapters 21 and 23, and so shall for the moment just consider the perfect observation version of the standard statestructured regulation model (12.1)(12.4). The stress has the form §
= l_.)c +!ET(ON) 1E]r +terminal cost. T
This is to be extremised with respect to u and f subject to the constraint of the plant equation (12.1). If we take account of the plant equation at time 7 by a Lagrange multiplier >.r and extremise out E then we are left with an expression
~ = I'::[c(xn ur) + >.;(xr Axr1 Burl) !B>.JN>.r] +terminal cost (50) T
to be extremised with respect to x, u and>.. This is the analogue of the Lagrangian expression (6.19) for the deterministic case which we found so useful, with stochastic effects taken care of by the quadratic term in >.. If we have reached time t then stationarity conditions will apply only over the timeinterval 7;;;: t. The stationarity condition with respect to fr implies the relation
between the multiplier and the estimate of process noise. In the riskneutral case 8 0 this latter estimate will be zero, which is indeed the message of Section 10.1. Stationarity of the timeintegral (50) with respect to remaining variables at time 7leads to the simple generalisation
317
12 WHY DISCOUNT?
sT
(r
Q
~
t).
(51)
Bf / case, when deter mini stic distu rban ces of equa tion (6.20) (in the speci al regu latio n . The matr ix oper ator () 0 for which the matrix I increasing in e. w, expression (1) is positive,finite, and nontive, below which ro(B) is no longer nega ily (ii) The critical value () 0 , necessar defined is determined by (2) B( / = c?(G)
the maximal eigenvalue ofGG. where a1 (G) is the maximal value (over w) of (16.48) to expression (1) implies that Proof The normalisation of expression negative. It follows then that the criter9\ ~ 0, and so that the cost function is nonall real() for which it is defined; we see ion x1r(B) of (16.1.3) is nonnegative for
322
THE Hoo FORMULATION
from Exercise 16.1.3 that it is also nonincreasing in B. These properti es then transfer to the limit evaluation /G (B). Integrability of GG (which implies that G is proper, in the continuoustime case) implies integrability oflogj/ + BGGI as long as the matrix I + BGG remains positive definite. As 8 is decreased this condition fails first at()= ()G· As we shall see by example, the average cost /G(B) may or may not become infinite at that of value, but its definition certainl y fails from that point, because the derivation 0 ness. definite evaluation (1) in Section 13.5 required positive Since the class of control rules corresponding to the feasible G includes the optimal stationa ry rules at any given value of fJ we have
Theorem 17.1.2 The breakdown value 8 of() in the stationa ry case can be identified with infGBG, where the infinum is over feasible values of G. Equivalently, (3) el = inf al{G). G
The quantity cr( G), the nonnegative square root of c?( G), is also a norm on G. It is known as the Hoo norm of G, otherwise written
IIGJioo = cr(G). ry The conclusion of Theorem 17.1.2 can then be expressed: an optimal stationa s minimise it that in , optimal H also is 8 00 point wn breakdo LEQG control rule at the . function transfer the HX! norm ofthe effective This is the essential implication of the GloverDoyle paper, and was found particularly striking because the Hoo criterion had attracte d intense interest since about 1981, for reasons we shall explain in the next two sections. The realisation that it is so closely related to the LEQG criterion, whose investigation had proceeded completely independently, marked a real breakthrough in understanding. We can make the immedi ate assertio n that the Hoo problem is solved for the . statestructured problem of the last chapter by the LEQGo ptimal policies we deduced in Sections 16.416.7, applied at the breakdown point B = 8. However, should make the connect ion more explicit. If the actual whitenoise input has covariance matrix m and if the cost replaced by attach~d to an output deviation !::,. is !::,. Tmt::,. then the form GG is evaluation affect not does which mGmG, to within a similarity transfor mation be wcould 9\ and m both of eigenvalues, and so of the Hoo norm. Indeed, n of inclusio and noise te dependent, correspo nding to allowance of nonwhi system the r lagged terms in the quadrat ic cost function. Conside
!
dx+fl u
=f
y+~X='17
u = Jf"y
(4)
1 TH E Hoo LIMIT
323
matrix (123) an d the the white with covariance are uts inp ise no the where rateinterpretations in the is given by (12.4) (with ion nct for tfu cos us eo tan ins tan pu re regulation pro ble m are thus considering a We e). va cas ser e ob , im on st ati ou nu equ conti pectively pla nt res te itu nst co (4) of s which the three relation s the solution e. Suppose system (5) ha rul l tro con d an on ati rel tio n erator :Y{ to mi nim ise the tha t which chooses the op is e rul o Bo al tim op the Th en er w) of the ma tri x al over eigenvalues an d ov im ax (m e alu env eig al ma xim G( iw)T. G(iw) [ [ the rules de du ce d in statestructured case by the in ved sol is m ble breakdown value 8 = 9. Th is pro stationary lim it an d at the the in en e tak , 6.7 41 16. s Section linear Markov for the cas ptimal control rules are oo Bo ter the lat t se tha n the the of s ns It follow uivalent versio ion, an d are certaintyeq ate 1 of pe rfe ct state observat the mi nim al stress est im is, at servation. Th ob te sta ct rfe nic al pe no im 'ca of e the , in the cas However tuted for x 1 in the rule. sti sub is ) see ate m, im for est ed ear tur testruc at 9 (a lin tpu t form rat he r tha n sta ou utinp in is d en ite giv plo en ex oft e we hav model' is dynamical str uc tur e which the t tha s an me ich wh Section 3, largely lost. is a useful one, since it co nc ep t of B 00optimality the er eth wh er nd wo y On e ma = 9 at which LEQGyty at the very po int (} ali tim op QG LE to s 1 an d kn ow alread correspond by example in Exercise see ll sha we As ls. fai n d Kb ec om e inf ini te at optimisatio unstable the n bo th II an is nt pla the if , 9.1 16. m these quantities, bu t fro mT he ore n finite solutions exist for the ble sta is nt pla the If this point. perturbations. in fact unstable to positive its usefulness in the the determination of II is no rm has rat he r be en o Bo the of int po the xt two sections. However, which we develop in the ne rty pe pro a s, nes ust rob an B 2 criterion, tre atm en t of uld be sai d to be ba sed on wo e rul G LQ ry na tio sta Th e optimal . adratic no rm Jtr( GG) dw in tha t it minimises the qu
~ ~]
~
t]
x
Exercises and comments rule in the simplest case: nation of the B 00 op tim al mi ter de ect dir the r de continuoust im e case. (1) Consi tion to zero in the scalar ula reg n tio va ser ob ctrfe tha t of pe normalised to zero. If uctured mo de l with S str testa al usu the e the transfer function We ass um ising policy u = Kx the n bil sta a ts op ad e on d an N = wwT pa ir (x, u) is nt noise to the deviation from the normalised pla
G(s) =
[~] s~r
324 where r
THE Hoo FORMULATION
= A + BK. It follows that (with s = I
I
+
OHHI = 1 BN(R + QK2)
+
and so that (}
iw)
=  [·1UKf sup w
w2
+r2
BN(R + QK2 )] I ..? rz w+
The maximising value of w is zero, so the H 00 optimal value of K minimises (R + QK2 )/(A + BK) 2 subject to stability, i.e. to A+ BK < 0. The freely minimising value is at K = BR /AQ and is acceptable if A< 0; we have then ON = B2 / Q + A 2 j R. If A ;;;:: 0 the n the minimising value of K (subject to stability) is K infinite and of opp osite sign to B, corresponding to O N= B2/ Q. We thus confirm the identity of the H00 optimal rule with the LEQGoptimal rule at the breakdown point, set out in Theorem 16.9.1. 2 CHARACTERISTICS OF TH E Hoo NORM The H 00 nor m was originally intr oduced for considerations which are not at all stochastic in nature, let alone LEQ G. In order to und ers tan d these we should establish some properties of the norm.
Theorem 17.2.1 Consider the case in which G is a constant matrix, so that II Gil~ is the maximal eigenvalue of GGT. We then have the characterisations 2 IIGII co
_ 
jG8j 2 _ tr(GTMG) 2  sup (M ) o 181 M tr
sup
(5)
where the suprema are respectively over nonzero vectors 8 and nonzer o nonnegative definite matrices M ofappropriate dimension. Proof The second expression in (5) is sup6((8T GT G8)/(8T 8)] which is indeed the· maximal eigenvalue of GT G and so also of GGT. If M has spe ctral representation M = L_1 >.AoJ, where the eigenvalues >..1 are necessarily nonnegative, then
tr(GTMG) _ L_>..1 IG8/ tr(M )  L_.>.1 18i which plainly has the same sha rp upp er bou nd as does the sec ond expression 0 One might express the first charac terisation verbally as: II Gil~ is the maximal 'energy' amplification that G can achieve when applied to a vector. Consider now in~
2 CHARACTERISTICS OF THE Hoo NORM
325
response function of a fllter with the dynamic case, when G( iw) is the frequency action G(!'J).
ion G(s) Theorem 17.2.2 In the case ofafilter with transfer funct IIGII2 = su EIG(9))812 p £1812 00
nary vector processes {8(t)} of where the supremum is over the class of statio 2 appropriate dimension for which 0 < Ej8j < oo. distribution matrix F(w). Then Proof Suppose that the 6process has spectral we can write
EjG(!'J)oj 2 _ ftr(G dF G) Jtr(d F) Ejoj 2 
definite matrix. We thus see from But the increment dF = dF(w) is a nonnegative rem that the sharp uppe r boun d the second characterisation of the previous theo F) is the supremum over w of the to this last expression (under variations of The boun d is attained when 6( t) maximal eigenvalue of GG, which is just II Gil;,. 0 vector amplitude. is a pure sinusoid of appropriate frequency and previous theorem: II Gil;, can be We thus have the dynamic extension of the tion that the filter G can achieve characterised as the maximal 'power' amplifica for a stationary input. on of this last theorem which we shall not Finall~ there is a deterministic versi then give a finitedimensional prove. We quote only the continuoustime case, in Exercise 2. Suppose that the proo f in Exercise 1 and a counterexample class, in that is is causal as well as response function G(s) belongs to the Hardy closed right half of the complex stable. This implies that G(s) is analytic in the eigenvalue of G(s) G(sl (over plane and (not trivially) that the maximal tive real part) is attained for a value eigenvalues and over complex s with nonnega s = iw on the imaginary axis. One can assert:
class then Theorem 17.2.3 Ifthefilter G belongs to the Hardy IG(!'J)8j2dt II GII2 = su J p J j6j2df I 00
6( t) ofappropriate dimension. where the supremum is over nonzero vector signals in that 6 is a deterministic This differs from the situation of Theorem 17.2.2 y stochastic signal of finite power. signal of finite energy rathe r than a stationar expectation. The assertion is that An integration over time replaces the statistical
THE Hoo FORMULATION
326
IIGII~ is the maximal amplification of 'total energy' that G can achieve when applied to a signal. Since we wish G to be 'small' in the control context, it is apparent from the above that adoption of the HXJ criterion amounts to design for protection against the 'worst case~ This is consistent with the fact that LEQG design is increasingly pessimistic as edecreases, and reaches blackest pessimism at e = 0. The contrast is with the H 2 or riskneutral case, e = e, when one designs with the average case in mind. However, the importance of the Hoo criterion over the past few years has derived, not from its minimax character as such, but rather from its suitability for the analysis of robustness of design. This suitability stems from the property, easily established from the characterisations of the last two theorems, that
(6) Exercises and comments (1) The output from a discretetime filter with input 81 and transient response g1 is y 1 = ~rgrDtr· We wish to find a sequence ~81 } whose 'energy' is amplified maximally by the filter, in that (~ 1 1Ytl 2 / ~~ 181 1 is maximal. Consider the SISO case, and suppose time periodic in that gr has period m and all the sums over time are restricted to m consecutive values. Show (by appeal to Theorem 17.2.1 and Exercise 13.2.1, if desired) that the energy ratio is maximised for 81 = eiw1 with w some multiple of 2rr/m, and that the value of the ratio is then IG(iw)l 2 = I ~Tgreiwrl2· (2) Consider the realisable SISO continuoustime filter with transient response 1 . Then II Gil;, is just the maximal eat and so transfer function G(s) = (s2 value of IG(iw)l , which is a 2. Consider a signal 8(t) = ef3t for t ~ 0, zero otherwise, where (3 is positive. Show that the ratio of integrated squared output to integrated squared input is
ar
2(3(a +
f3r2loo
(e2at 2e(af3)t
+ e2f3t) dt.
If a< 0 (so that the filter is causal) then this reduces to (a2 a(J) 1 which is indeed less than a 2 . If a > 0 (so that the filter is not causal) then the expression is infinite.
3 THE Hoo CRITERION AND ROBUSTNESS Suppose that a control rule is designed for a particular plant, and so presumably behaves well for that plant (in that, primarily, it stabilises adherence to set points or command signals). The rule is robust if it continues to behave well even if the actual plant deviates somewhat from that specified. The concept of robustness
327
N AN D ROBUSTNESS 3 TH E Hoo CR ITE RIO
G
u w K
+
of G(s) corresponding to a ofequation ( 7). lfa pole em syst the of m not be comple· gra dia ck Figure I A blo the controlled system will ed by a zero ofK(s), then plant instability is cancell tely stable.
may never be kn ow n tha t the plant structure t fac the of t un co ac as in statistics an d oth er thus takes time. In control theory, in ry va eed ind y ma d for optimality mu st be exactly, an grown that a co nc ern s ha ion ict nv co the , both qualities mu st be subjects, robustness. Furthermore for ern nc co a by ted en between goals which are complem the right compromise ch rea to is e on if ied quantif nflicting. ou tpu t necessarily somewhat co 1, designed to make plant ure Fig of tem sys the r For an example, conside nsfer functions of pla nt Here G an d K are the tra w. l na sig d an mm co a ed from observed pla nt v follow nt ou tpu t v is distinguish pla l ua act alent d an er, oll ntr an d co e block diagram is equiv is observation noise. Th TJ ere wh J, r ' + v = y t outpu (7) to the relations u = K( w v TJ) v= Gu , pressi whence we deduce the ex
on
(8) 1  GKTJ) v w =( I+ GK ) (w tem. Fo r stable of the inputs to the sys ms ter in w v or err have a stable causal for the tracking erator I+ GK should op the t tha e uir req all gain theorem operation we holds; the socalled sm ity bil sta t tha e um ass inverse. Let us t IIGKIIoo < l. ndition ensuring this is tha or to co mm an d asserts tha t a sufficient co tra functions of cking err nse po res the t tha (8) We see from vely vation noise are respecti signal an d (negative) obser
norm. They ca nn ot all, in some appropriate sm be to se the of ty of the th bo e One would lik is known as the sensitivi S se 1 + S2 =I . S 1 cau be , ver we ho , bo th be small
328
THE
n)O FORMULATION
system; its norm measures perform ance in that it measures the relative tracking error. S2 is known as the complementary sensitivity, and actually provides a measure· of robustness (or, rather, of lack of robustness) in that it measures the sensitivity of perform ance to a change in plant specification. This is plausible, in that noisecorruption of plant output is a kind of plant perturbation. Howeve r, for an explicit demonstration, suppose the plant operato r perturb ed from G to G + 8G. We see from the correspondingly perturb ed version of equations (7) that
v =(I+ GKr 1GK(w TJ) +(I+ GKr 1(8G)K( w v TJ). The perturb ed system will remain stable if the operator I+ (I+ GK) 1(8G)K acting on v has a stable causal inverse. It follows from another application of the small gain theorem that this continued stability will hold if
i.e. if the relative perturba tion in plant is less than the reciprocal of the complementary sensitivity, in the Hoo norm. Actually, one should take of the expected scale and dynamics of the inputs to the system. This isaccount achieved by setting w = W1 w and TJ = Wtii, say, where W 1 and W2 are prescribed filters. In the statistical LEQG approach one would regard wand ij as standard vector white noise variables. In the worstcase deterministic approach one would generate the class of typical inputs by letting w and ij vary in the class of signals of unit total energy. Performance and robustness are then measured by the smallness of IIS1 W1lloo and IIS2 W2lloo respectively. Specifically, the upper bound on I!G 18GIIoo which ensures continued stability is
IIS2W211~1 .
Of course, in a simple minimisation of quadratic costs a compromise will be struck between 'minimisation' of the two operators sl and s2 in some norm. There will be a quadratic term in the tracking error v  w in the cost function , and this will lead to an expression in H 2 norms of S1 and S2. The more observation noise there is, the greater the consideration given to decreasing S 2, so assuring increased robustness. The advantages of an Hoo formulation were demonstrated first in Zames (1981), who began an analysis which has since been brought to a high technical level by Doyle, Francis and Glover, among others; see Francis (1987) and and Doyle, Francis and Tannenbaum (1992). The standard formulation of the problem is the system formulation of equations (6.44)/ (6.45), expressed in the block diagram of Figure 4.4. The design problem is phrased as the choice of K to minimise the response of A to ( in the Hoo norm, subject to the condition that 'K should stabilise the system', i.e. that all system outputs should be stabilised against all system inputs. The analysis of this formulation has generated a large specialist literature. We shall simply list a number of observations.
~ ._:• .•.·
:1
ROBUSTNESS 3 THE H 00 CRI TER ION AND
329
man y n to the LEQ G criterion means that (i) The relation of the H 00 criterio d LE QG tion in the now wellestablishe Hoo problems already have a solu tion noise The need to take account of observa extension of classical LQ G ideas. criteria, s derived on either LQ G or LEQG ensures a degree of robustness in rule es. ht not have bee n apparent before Zam although this is an insight which mig utp ut uto inp in ed stat are in this section (ii) The two problems formulated rath er than simply by its response function G form, in tha t the plant is specified of attempts ber num a n bee e mple. The re hav by a statestructured model, for exa ework, s directly, usually in the LQ G fram over the years to attack such problem dra tic qua a ses imi min ch filter K in (7) whi by, for example, simply seeking the eta!. t Hol 7), (195 ser Kai Newton, Gou ld and (See ~). T9t E(~ as h suc n erio crit e a cer tain rno and Jabr (1976)). One can mak (1960),Whittle (1963),Youla, Bongio roach can app n do hea a ed in Section 6.7, amount of progress, but, as we not almost aled reve are ch s two insights whi prove perplexing and tends to mis se are The . blem pro ary lysis of the nonstation automatically in a statespace ana and ion mat esti of ects separation of the asp s; (i) certaintyequivalence, with its able vari te juga con of of the introduction control, and (ii) the usefulness the by ted stitu con with the constraints Lagrange multipliers associated ies the ation of these properties simplif loit Exp prescribed dynamic relations. vector case. analysis radically, especially in the without of optimal stationary controls We consider the determination hod s met al tegr ein pters 1821, but using tim reduction to state structure in Cha t is wha y isel prec by their natu re and yield which incorporate these insights needed. tegral niques associated with these timein (iii) The operator factorisation tech ck atta ct dire a in Wie ner Ho pf methods used methods are distinct from bot h the ded oun exp es mat rix /factorisation techniqu on the inp uto utp ut model and the of the Hoo a and Glover (1990) for solution by e.g. Vidyasagar (1985) and Mustaf problem. ility e nor m will in fact assure overall stab (iv) Simple minimisation of~ in som s of sure mea sically achievable) if Ll includes of the system (provided this is phy eria crit the example, the inclusion of u itself in relevant signals in the system. For ility or avoids the possibility that good stab used through the whole of this text , and ther Fur expense of infinite control forces. tracking could be achieved at the of ree e deg on will automatically achieve som as we have seen, such an optimisati assumed in the model. robustness if observation noise is of an mate guarantee of robustness is use Finally, one might say tha t the ulti n we can tion which opens wider vistas tha adaptive control rule; an observa explore.
I"
PART 4
d Timeintegral Methods acnies Optimal Stationary Poli al stationary policy termination of the optim de t ec dir the to ted vo eneous, bu t no t This Pa rt is de LEQG an d timehomog or G LQ ed os pp su is that th e subfor a model which is an interesting one in me the e Th . ed tur uc um principle, necessarily statestr uivalence, the maxim eq y int rta ce ls, gra coalesce themes of timeinte d policy improvement an n tio isa tor fac al nic no eady applied to the transform methods, ca velopment of those alr de a are ds tho me e ralised in Section 6.5. naturally. Th in Section 6.3 and gene se ca ed tur uc str testa uld omit this Part deterministic statestructured case co the ly on r ide ns co to nt for enlightenThe reader conte h passing up a chance ug tho (al s low fol at wh without prejudice to w!). timal ment, in the author's vie deriving a stationary op of t tha is red ide ns co neral neither stateThe problem flrst to be eous LQ G model, in ge en og om eh tim a for y ally assume the control polic Ou r methods intrinsic . ble va ser ob y ctl rfe pe derived as the structured no r t the optimal policy is tha so , its lim on riz ho on policy. existence of infinitean optimal finitehoriz of ry) na tio sta t fac (in i.e. of infinitehorizon limit of being averageoptimal: y ert op pr the ve ha y inl stationary regime. It Such a policy will certa urred pe r un it time in the inc st co ge era av the on to transients, minimising erty of optimising reacti op pr r ge on str the ve ha is such that it will in fact also t have unless plant noise no y ma y lic po al tim which an averageop of the system. policy is can stimulate all modes in formula (13.24), and st co ge era av the for :ft" which one uses to We have an expression lisable linear operator rea t an ari nv ei tim the e might regard the specified by current observables. On of ms ter in u ol ntr isation of this express the co matter of direct minim a as ply sim m ble optimisation pro
332
TIMEINTEGRAL METHODS
expression with respect to :It. This was the approach taken by a number of auth ors in the early days of control optimisati on (see Newton, Gould and Kaiser (1957), Holt et al. (1960), Whittle (1963)), whe n it seemed a natural developmen t of the techniques employed by Wiener for optimal prediction. However, while the approach yields results rather easily in the case of scalar variables, in the vector case one is led to equations which seem neither tractable nor transparent. In fact, by attacking the problem in this bulllike fashion one is forgoing all the insights of certainty equivalence, the Kalman filter etc. One could could argue that, if these insights had not already been gained, they should be revealed in any natural approach. If so, then this is not one. In fact, the seeming simplif ication of narrowing the problem down to one of averageoptimisation blinds one to an even more direct approach. This is an approach which is familia r in the deterministic case and which turns out to be available even in the stochas tic case: the extremisation of a timeintegral. We use this term in a rather specific technical sense; by a timeintegral we mean a sum or integral over time of a fun ction of current variables of the mo del in which expectations are absent and which is such that the optimal valu es of decisions and estimates can be obt ained by a free and unconstrained extremisation of the integral. In earlier publications on this topi c the author has referred to these as 'pathintegrals', but this is a usage inconsis tent with the quantumtheoretic use of the term. Strictly speaking, a pathint egral is an integral over paths (i.e. an expectation over the many paths whi ch are possible) whereas a timeintegr al is an integral along a path. The fact which makes substantial progress possible is that a pathintegral can often be express ed as an extremum over timeintegr als. For example, we we saw in Chapter 16 that the expectation (i.e. the pathint egral) E[exp( OC)J could be expressed as an extremum of the str ess §= C + 1 o [). If one clears matrix inverses from the stress by Legendre transformations (i.e. by introducing Lagrange multipliers to take account of the contraints of pla nt and observation equations) then one has the expectation exactly in the form of the extremum of a timeintegral. , It is this reduction which we have exploited in Chapters 6 and 16, sha ll exploit for a general class of LQG and LEQ G models in this Part, and shall exte nd (under scaling assumptions) to the non LQG case in Part 5. We saw in Section 6.3 that the statest ructured LQ problem could be conver ted to a timeintegral formulation by the introduction of Lagrange multiplier s, and that the powerful technique of can onical factorisation then determi ned the optimal stationary policy almost imm ediatel)t We saw in Sections 6.5 and 6.6 that these techniques extended directly to models which were not statest ructured. These solutions extend to the stoc hastic and imperfectly observed case by the simple substitution of estimates for unobservables, justified by the cert aintyequivalence principle. We shall see in Chapter 20 that timeintegral tech niq ues also take care of the estimation pro blem (the familiar control/estimation duality
TIMEINTEGRAL METHODS
333
methods extend to the finding perfect expression) and, in Chapt er 21, that these to the nonLQG case ion extens LEQG model. All these results are exact, but the ion of the pathximat appro of Part 5 is approximate in the same sense as is the n integrals) of (actio integrals of quant um mechanics by the timeintegrals ximation at sensitive classical mechanics, with refmement to a higherorder appro parts of the trajectory. al formalism first, There is a case, then, for developing the general time integr al pattern, unclu ttered which we do in Chap ter 18. In this way one sees the gener ation to control and applic the in arise by the special features which necessarily estimation. models are no longe r There is one point which should be made. Although our input/ outpu t form. in given statestructured, they are not so general that they are statestructured the in Rather, we assume plant and observation equations, as p. The loss value some case, but allow variables in them to occur to any lag up to mode l to the reduces of an explicit dynamic relationship which occurs when one n 6.7. input/ outpu t form has severe consequences, as we saw in Sectio r's previous work autho the of n versio lined stream This Part gives a somewhat ver, the mater ial of on timeintegral methods as set out in Whittle (1990a). Howe edge, new. There knowl r's autho the of Sections 20.320.6 and 21.3 is, to the best r would be autho the and er, howev must be points of conta ct in the literature, grateful for notice ofthese.
l
CHAPTER18
The Timeintegral Formalism CRETE TIME 1 QUADRATIC INTEGRALS IN DIS the term 'timeintegral' even in discrete For uniformity we shall continue to use then a sum. Consider the integral in the time, despite the fact that the 'integral' is variable~
(1) is then a sum over time of a quad ratic with prescribed coefficients G and (. This the function being timeinvariant in its function of the vector sequence {~T }, Sections 6.3 and 6.5 ~ is the vector with seconddegree part. So, for the models of ying term (T would arise from kno wn subvectors x, u and >., and the timevar als r, uc. The matrix coefficients Gj and disturbances d or known com man d sign but no generality is lost by imposition of the vector coefficients (T are specified, the normalisation
(2) then the 'end terms' arise arise from If the sum in (1) runs over h1 < r < h2, h2 respectively. The final term can be contributions at times r ~ h1 and r ?: and the initial term as arising from a regarded as arising from a terminal cost itions. probabilistic specification of initial cond such that we wish to extremise the Suppose the optimisation problem is variable {~T}. We cann ot at the mom ent integral with respect to the course of the see by considering again the models of be more specific than 'extremise', as we respect to x and u and maximised with Cha pter 6, for which one minimised with condition with respect to ~T is easily seen respect to A. In any case, the stationarity to be
(3) and h2 that neither end term involves if r is sufficiently remote from both h1 where
IP(!/) :=
L j
Gj§J.
~n
(4)
336
THE TIMEINTEGRAL FORMALISM
The normalisation (2) has the implication = 0),
(40)
IN CONTINUOUS TIME 6 THE INTEGRAL FORMALISM
345
the the substitution will change only we mea n by 'effective equality' that rior inte ts poin e tim at ons ity con diti and so does not affect the optimal range of integration. as far as to write the pathintegral (37) We can car ry the reduction (40) so 0
=!
J2::: L I: L cj~s) r~~r+s] J ~ J[~ e~(~)~dr 
(
(,T dr +en d effects,
or, more compactly,
(T (] dr +en d effects.
fi =
rators wit hjk th elem Here q,(~) is ann x n mat rix of ope
\Pjk(~) = LLc};s)(~)'EPs.
(41)
ent (42)
s
the integral the ~path for r ~ t should render Theorem 18.6.1 The condition that ·' ·  (37) stationary can be written (43) ~ t) ( 1"
rator \P(
[email protected]) is rval of integration. The mat rix ope at time points interior to the inte Hermitian in that (44)
first the n te consequence of (39) and (42); the The sec ond asse rtio n is an imm edia as =_(s)(t)ej'let1 j
and that Fwill obey the backward
k
$
(48)
equation
c + oF + " " e!r+lJ oF = at ~ L.J ') adrl 0 J r.(~) B6Px>.(~)]d}. Exercises and comments (1) The return difference equation Consider again the classic feedback loop of Figure 4.2. The operator :f{'r§ is the loop operator, giving the effect on a signal which passes successively through the plant and the controller, and ; = ..F + :f{'r§ is the return difference operator, giving the difference in effectis · between no traverse and one traverse of the loop. If a disturbance d superimposed on the control u as it is fed into the plant then u satisfies J u = d. Let J(s) be the continuoustime transfer function of .f. The return difference equation is a version of the inequality JQJ > Q, which holds for LQoptimal control, at least under stateassumptions, if Q is the penalty matrix for contr<Jl. It expresses the fact that the operation ; 1 attenuates at all frequencies. The relation is very easily proved from the canonical factorisation of with . factors of the known form (17). If the column vector with subvectors x, u and .X is written b.., let the solution of ¢b..= 0 in terms ofu be written b..= Du. Then the return difference equation amounts simply to (20)
h(j)n 1¢D = hD. Show that in the special case (19) relation (20) implies ]QJ =
Q+BA 1RA 1B,
where J = I  Q 1B"[ rPxxA 1B. Show that in the case 0 R(s) Q(s) (s) = [ 0 A(s) B(s)
~]
B(s) , 0
special only in that S(s) has been normalised to zero, relation (20) amounts to
JQ.J= Q+BA 1RA 1B, where J = Q;: 1{[¢uu(2))  B*A;: 1¢xu] [¢ux B.A;: 1¢xx]A 1B}. In both cases J is indeed the return difference transfer function for the optimal control.
CHAPT ER20
Optimal Stationary LQG Policies: Imperfect Observation 1 mE PROCESS/OBSERVATION MODEL: APPEAL TO CERTAINTY EQUIVALENCE We assume the linear model of equation (19.1) together with an observa.tion relation: d x + E!lu = d + f
(1)
y+~x= TJ.
(2)
This specification covers both the discrete and continuoustime formulations; the time variable has not been explicitly indicated The operator ~ is, like d and E!4, a causal translationinvariant linear operator. Let us initially discuss the discretetime case withpthord er dynamics, with CG having the form C(fr) = L:f= 1 C,fr'. As ever, dis a deterministic disturbance term and f and TJ are plant and observation noise respective!~ We suppose that these noise terms jointly constitute Gaussian white noise with zero mean and covariance matrix
In this case of imperfect observation the information W1 available at time t consists of the observation and control histories Y1 and U,_ 1, plus the complete course of the deterministic component of disturbance {d1}. We shall ultimately be passing to the stationary regime, in which the past as well a.s the future is infmite, so that observation and control histories extend into the infinite past. Since the model is totally LQG we can appeal to the certainty equivalence principle of Section 12.3 to deduce the optimal control in this imperfectly observed case from that for perfect observation. We know from the analysis of Section 19.2 that, in the case of perfect observation, the optimal stationary determination of u1 is given in closedloop form by
(3)
358
OP TI MA L STATIONA RY LQG POLICIES
Th e lefthand member of this relation gives the feedback compon control, expressing u ent of optimal 1 in terms of x, (t p < T ~ t) an d u, righthand member giv (t p < T < t). Th e es the feedforward ter m, in terms of d, (T ;;::: In the case of imperfe t). ct observation recursio n (3) for the optimal holds, except th at x, mu control still st be replaced, where it occurs, by the curre estimate x~l. We are th nt projection en led to the inference problem; the determi estimates. The duality na tio n of these of estimation an d cont rol has already been for the Markov case in demonstrated Section 12.9; we shall see how this extends to general order. dynamics of
2 PROCESS ESTIM ATION IN TIMEIN TEGRAL FORM (DISCRETE TIME) Th e characterisation th at we shall take of projection estimates leastsquare property is no t the linear asserted in (ii) of Th eorem 12.6.3, bu t ra th probabilitymaximisi er the dual ng (or discrepancymi nimising) property as the same theorem. It se rte d in (iii) of is this characterisatio n which yields the na integral formulation. tural timeA related po in t is th at canonical factorisa factorisations of some tions are then thing like the reciproc al of an AG F rather th LLS approach associa an (as in the ted with the Wiener fil ter) a factorisation of means that, in the ca an AGF. This se of pt h or de r dyna mics, the factors are degreep. polynomials of We ca n set up the prob lem rigorously un de r the supposition th at ob began at a finite time servation h1, an d can th en pass to the infinitehistory limit ht ju st as we passed to the > o o, infinitehorizon limit for control optimisatio will then be restricted n. Histories histories, so th at Xt is {x,; h1  p ~ T ~ t}, etc negative exponent in th . Then the e Gaussian density of X 1 an d Y1 for prescr discrepancy ibed U1_ 1 is the
i0 1 =priorterms+!t[E]T[~ r=hr
'T] r
ft ] l[ E ], 'T]
r where E an d ry are ex pressed in terms of x, y and u by appeal to observation relations the plant and (1) an d (2). Th e 'pr io r ter ms' reflect the distribut for relevant system hi ion assumed story before time h 1· In the case ofpt h or de r will constitute a quad dy namics they ratic function of {x ,y , u,; h 1  p ~ r < ht}. The projection estimate s x~l are ju st the value s of x,. minimising 10 thus amounts to a back 1 • Estimation ward rather th an a forw ard optimisation prob integral (or a precur lem; a timesor to one) is to be extremised over its co current m om en t t rath ur se before the er th an after. Actually, we would no t regard extremisation of 101 as extremisation integral, because it of a timeis subject to the co nstraints implied by observation relations the pl an t an d (1) an d (2). Le t us eli minate these by the in troduction of
2 PROCESS ESTIMATION IN TIMEINTE GRAL FORM
359
Lagrangian multiplier vectors l.r and m 7 for the constraints constituted by the relations at time T and so extremise a Lagrangian form [l)t
+ L:W (dx +Plud E) + mT (y +