Approximate Dynamic Programming for Operations Research: Solving the curses of dimensionality
Warren B. Powell
May 2, ...
56 downloads
987 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Approximate Dynamic Programming for Operations Research: Solving the curses of dimensionality
Warren B. Powell
May 2, 2005
(c) Warren B. Powell, 2005 All rights reserved.
Department of Operations Research and Financial Engineering Princeton University, Princeton, NJ 08544
Contents 1 The challenges of dynamic programming
1
1.1
A dynamic programming example: a shortest path problem . . . . . . . . . .
2
1.2
The three curses of dimensionality . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Some real applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Problem classes in asset management . . . . . . . . . . . . . . . . . . . . . .
8
1.5
What is new in this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6
The many dialects of dynamic programming . . . . . . . . . . . . . . . . . .
12
1.7
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2 Some illustrative applications 2.1
2.2
2.3
15
Deterministic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.1
The budgeting problem . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1.2
The shortest path problem . . . . . . . . . . . . . . . . . . . . . . . .
17
Stochastic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.1
The gambling problem . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.2
The batch replenishment problem . . . . . . . . . . . . . . . . . . . .
21
2.2.3
The secretary problem . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.4
Optimal stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3 Modeling dynamic programs 3.1
32
Notational style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
33
CONTENTS
ii
3.2
Modeling time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3
Modeling assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.4
Illustration: the nomadic trucker . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.1
A basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.2
A more realistic model . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.4.3
The state of the system
. . . . . . . . . . . . . . . . . . . . . . . . .
43
The exogenous information process . . . . . . . . . . . . . . . . . . . . . . .
44
3.5.1
Basic notation for information processes . . . . . . . . . . . . . . . .
44
3.5.2
Models of information processes . . . . . . . . . . . . . . . . . . . . .
45
The states of our system . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.6.1
The three states of our system . . . . . . . . . . . . . . . . . . . . . .
48
3.6.2
Pre- and post-decision state variables . . . . . . . . . . . . . . . . . .
51
3.6.3
Partially observable states . . . . . . . . . . . . . . . . . . . . . . . .
53
Modeling decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.7.1
Decisions, actions, and controls . . . . . . . . . . . . . . . . . . . . .
54
3.7.2
The nomadic trucker revisited . . . . . . . . . . . . . . . . . . . . . .
56
3.7.3
Decision epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.7.4
Policies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.7.5
Randomized policies . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Information processes, revisited . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.8.1
Combining states and decisions . . . . . . . . . . . . . . . . . . . . .
59
3.8.2
Supervisory processes . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Modeling system dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.9.1
A general model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.9.2
System dynamics for simple assets . . . . . . . . . . . . . . . . . . . .
63
3.9.3
System dynamics for complex assets . . . . . . . . . . . . . . . . . . .
64
3.10 The contribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.11 The objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.5
3.6
3.7
3.8
3.9
CONTENTS
iii
3.12 Models for a single, discrete asset . . . . . . . . . . . . . . . . . . . . . . . .
71
3.12.1 A single asset formulation . . . . . . . . . . . . . . . . . . . . . . . .
72
3.12.2 A multiple asset formulation for a single asset . . . . . . . . . . . . .
73
3.13 A measure-theoretic view of information** . . . . . . . . . . . . . . . . . . .
75
3.14 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4 Introduction to Markov decision processes
83
4.1
The optimality equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.2
The optimality equations using the post-decision state variable . . . . . . . .
87
4.3
Finite horizon problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.3.1
The optimality equations . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.3.2
Backward dynamic programming . . . . . . . . . . . . . . . . . . . .
90
Infinite horizon problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.4.1
Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.4.2
Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.4.3
Hybrid value-policy iteration . . . . . . . . . . . . . . . . . . . . . . .
99
4.4.4
The linear programming formulation . . . . . . . . . . . . . . . . . . 100
4.4
4.5
4.6
Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.1
The optimality equations . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.2
Proofs for value iteration . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.3
Optimality of Markovian policies . . . . . . . . . . . . . . . . . . . . 112
4.5.4
Optimality of deterministic policies . . . . . . . . . . . . . . . . . . . 113
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5 Introduction to approximate dynamic programming
121
5.1
The three curses of dimensionality (revisited)
. . . . . . . . . . . . . . . . . 122
5.2
Monte Carlo sampling and forward dynamic programming . . . . . . . . . . 123
5.3
Using the post-decision optimality equations . . . . . . . . . . . . . . . . . . 126
5.4
Low-dimensional representations of value functions . . . . . . . . . . . . . . 127
CONTENTS
iv
5.4.1
Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.2
Continuous value function approximations . . . . . . . . . . . . . . . 129
5.4.3
Algorithmic issues
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5
Complex resource allocation problems . . . . . . . . . . . . . . . . . . . . . . 131
5.6
Experimental issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.6.1
The initialization problem . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.2
Sampling strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.3
Exploration vs. exploitation . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.4
Evaluating policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.7
Dynamic programming with missing or incomplete models . . . . . . . . . . 139
5.8
Relationship to reinforcement learning . . . . . . . . . . . . . . . . . . . . . 140
5.9
But does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.10 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6 Stochastic approximation methods
146
6.1
A stochastic gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.2
Sampling random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3
Some stepsize recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3.1
Properties for convergence . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.2
Deterministic stepsizes . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.3.3
Stochastic stepsizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.4
A note on counting visits . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.4
Computing bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.5
Optimal stepsizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.5.1
Optimal stepsizes for stationary data . . . . . . . . . . . . . . . . . . 168
6.5.2
Optimal stepsizes for nonstationary data - I . . . . . . . . . . . . . . 171
6.5.3
Optimal stepsizes for nonstationary data - II . . . . . . . . . . . . . . 172
6.6
Some experimental comparisons of stepsize formulas . . . . . . . . . . . . . . 174
6.7
Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
CONTENTS 6.8
6.9
v
Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.8.1
Some probabilistic preliminaries . . . . . . . . . . . . . . . . . . . . . 181
6.8.2
An older proof
6.8.3
A more modern proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.8.4
Proof of theorem 6.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 190
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.9.1
Stochastic approximation literature . . . . . . . . . . . . . . . . . . . 193
6.9.2
Stepsizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7 Discrete, finite horizon problems
198
7.1
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.2
Sample models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3
7.4
7.2.1
The shortest path problem . . . . . . . . . . . . . . . . . . . . . . . . 200
7.2.2
Getting through college . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.2.3
The taxi problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.2.4
Selling an asset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Strategies for finite horizon problems . . . . . . . . . . . . . . . . . . . . . . 208 7.3.1
Value iteration using a post-decision state variable . . . . . . . . . . . 208
7.3.2
Value iteration using a pre-decision state variable . . . . . . . . . . . 210
7.3.3
Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Temporal difference learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.4.1
The basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
7.4.2
Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
7.5
Monte Carlo value and policy iteration . . . . . . . . . . . . . . . . . . . . . 218
7.6
Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.7
State sampling strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.7.1
Sampling all states . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.7.2
Tree search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7.7.3
Rollout heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
CONTENTS
vi
7.8
A taxonomy of approximate dynamic programming strategies
. . . . . . . . 225
7.9
But does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.9.1
Convergence of temporal difference learning . . . . . . . . . . . . . . 227
7.9.2
Convergence of Q-learning . . . . . . . . . . . . . . . . . . . . . . . . 227
7.10 Why does it work** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.11 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 8 Infinite horizon problems
230
8.1
Approximate dynamic programming for infinite horizon problems . . . . . . 231
8.2
Algorithmic strategies for discrete value functions . . . . . . . . . . . . . . . 231
8.3
Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.4
Approximate policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
8.5
TD learning with discrete value functions . . . . . . . . . . . . . . . . . . . . 235
8.6
Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.7
Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.8
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9 Value function approximations
240
9.1
Simple aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.2
The case of biased estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
9.3
Multiple levels of aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 249
9.4
9.5
9.3.1
Combining multiple statistics . . . . . . . . . . . . . . . . . . . . . . 250
9.3.2
The problem of correlated statistics . . . . . . . . . . . . . . . . . . . 252
9.3.3
A special case: two levels of aggregation . . . . . . . . . . . . . . . . 255
9.3.4
Experimenting with hierarchical aggregation . . . . . . . . . . . . . . 256
General regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 9.4.1
Pricing an American option . . . . . . . . . . . . . . . . . . . . . . . 258
9.4.2
Playing “lose tic-tac-toe” . . . . . . . . . . . . . . . . . . . . . . . . . 261
Recursive methods for regression models . . . . . . . . . . . . . . . . . . . . 262
CONTENTS
vii
9.5.1
Parameter estimation using a stochastic gradient algorithm . . . . . . 263
9.5.2
Recursive formulas for statistical estimation . . . . . . . . . . . . . . 263
9.5.3
Recursive time-series estimation . . . . . . . . . . . . . . . . . . . . . 266
9.5.4
Estimation using multiple observations . . . . . . . . . . . . . . . . . 267
9.6
9.7
Why does it work?* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9.6.1
Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.6.2
Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 269
9.6.3
Derivation of the recursive estimation equations . . . . . . . . . . . . 270
9.6.4
The Sherman-Morrison updating formula . . . . . . . . . . . . . . . . 272
Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10 The exploration vs. exploitation problem
275
10.1 A learning exercise: the nomadic trucker . . . . . . . . . . . . . . . . . . . . 275 10.2 Learning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 10.2.1 Pure exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 10.2.2 Pure exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 10.2.3 Mixed exploration and exploitation . . . . . . . . . . . . . . . . 280 10.2.4 Boltzman exploration . . . . . . . . . . . . . . . . . . . . . . . . . 280 10.2.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 10.3 A simple information acquisition problem . . . . . . . . . . . . . . . . . . . . 282 10.4 Gittins indices and the information acquisition problem . . . . . . . . . . . . 284 10.4.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 10.4.2 Basic theory of Gittins indices . . . . . . . . . . . . . . . . . . . . . . 286 10.4.3 Gittins indices for normally distributed rewards . . . . . . . . . . . . 288 10.4.4 Gittins exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 10.5 Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 10.6 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 11 Value function approximations for resource allocation
296
CONTENTS
viii
11.1 Value functions versus gradients . . . . . . . . . . . . . . . . . . . . . . . . . 297 11.2 Linear approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 11.3 Monotone function approximations . . . . . . . . . . . . . . . . . . . . . . . 300 11.4 The SHAPE algorithm for continuously differentiable problems . . . . . . . . 302 11.5 Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 11.6 Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 11.6.1 The projection operation . . . . . . . . . . . . . . . . . . . . . . . . . 309 11.6.2 Proof of convergence of the learning version of the SPAR algorithm . 311 11.7 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 12 The asset acquisition problem
322
12.1 The single-period problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 12.1.1 Properties and optimality conditions . . . . . . . . . . . . . . . . . . 325 12.1.2 A stochastic gradient algorithm . . . . . . . . . . . . . . . . . . . . . 327 12.1.3 Nonlinear approximations for continuous problems . . . . . . . . . . . 328 12.1.4 Piecewise linear approximations . . . . . . . . . . . . . . . . . . . . . 329 12.2 The multiperiod asset acquisition problem . . . . . . . . . . . . . . . . . . . 334 12.2.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 12.2.2 Computing gradients with a forward pass . . . . . . . . . . . . . . . . 336 12.2.3 Computing gradients with a backward pass . . . . . . . . . . . . . . . 336 12.3 Lagged information processes . . . . . . . . . . . . . . . . . . . . . . . . . . 338 12.3.1 Modeling lagged information processes . . . . . . . . . . . . . . . . . 340 12.3.2 Algorithms and approximations for continuously differentiable problems342 12.3.3 Algorithms and approximations for nondifferentiable problems . . . . 344 12.4 Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 12.4.1 Proof of convergence of the optimizing version of the SPAR algorithm 346 12.5 Bibliographic references
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
13 Batch replenishment processes
356
CONTENTS
ix
13.1 A positive accumulation problem . . . . . . . . . . . . . . . . . . . . . . . . 356 13.1.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 13.1.2 Properties of the value function . . . . . . . . . . . . . . . . . . . . . 358 13.1.3 Approximating the value function . . . . . . . . . . . . . . . . . . . . 359 13.1.4 Solving a multiclass problem using linear approximations . . . . . . . 361 13.2 Monotone policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 13.2.1 Submodularity and other stories . . . . . . . . . . . . . . . . . . . . . 364 13.2.2 From submodularity to monotonicity . . . . . . . . . . . . . . . . . . 366 13.3 Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 13.3.1 Optimality of monotone policies . . . . . . . . . . . . . . . . . . . . . 368 13.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 14 Two-stage stochastic programming
376
14.1 Two-stage stochastic programs with recourse . . . . . . . . . . . . . . . . . . 377 14.2 Stochastic projection algorithms for constrained optimization . . . . . . . . . 382 14.3 Proximal point algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 14.4 The SHAPE algorithm for differentiable functions . . . . . . . . . . . . . . . 386 14.5 Separable, piecewise-linear approximations for nondifferentiable problems . . 389 14.6 Benders decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 14.6.1 The basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 14.6.2 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 14.6.3 Experimental comparisons . . . . . . . . . . . . . . . . . . . . . . . . 397 14.7 Why does it work?** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 14.7.1 Proof of the SHAPE algorithm . . . . . . . . . . . . . . . . . . . . . 399 14.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 15 General asset management problems
407
15.1 A basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 15.2 Sample applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
CONTENTS
x
15.3 A myopic policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 15.4 An approximate dynamic programming strategy . . . . . . . . . . . . . . . . 416 15.4.1 A linear approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 416 15.4.2 A separable, piecewise linear approximation . . . . . . . . . . . . . . 417 15.4.3 Network structure, multicommodity problems and the Markov property418 15.5 Some numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 15.5.1 Experiments for single commodity flow problems . . . . . . . . . . . . 422 15.5.2 Experiments for multicommodity flow problems . . . . . . . . . . . . 424 15.6 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
Chapter 1 The challenges of dynamic programming The optimization of problems over time arises in many settings, ranging from the control of heating systems to managing entire economies. In between are examples including landing aircraft, purchasing new equipment, managing fleets of vehicles, selling assets, investing money in portfolios or just playing a game of tic-tac-toe or backgammon. As different fields encountered these problems, they tended to discover that certain fundamental equations described their behavior . Known generally as the “optimality equations,” they have been rediscovered by different fields under names like dynamic programming and optimal control. This book focuses on a broad range of topics that arise in operations research. Most of these can be categorized as some form of asset management, with the understanding that this term refers to both physical and financial assets. We make an important distinction between problems that involve a single asset and those that involve multiple assets or asset classes. Problems involving a single asset range from selling a bond to routing a single aircraft, but also include playing a game of tic-tac-toe or planning an academic schedule to maximize the chances of graduating from college. In principle, single-asset problems could also include the problem of landing an aircraft or controlling a robot, but we avoid these examples primarily because of their emphasis on low-dimensional controls of continuously differentiable function. Although single-asset problems represent an important class of applications, the book continuously builds toward problems that involve the management of mulitple assets. Examples include allocating resources between competing projects or activities, managing fleets of containers in international commerce, scheduling pilots over a set of flights, assigning specialists to different tasks over time, upgrading technologies (information technologies, energy generating technologies) over time, and acquiring assets of different types (capital to run a company, jets for a charter airline, oil for national energy needs) to meet demands as they evolve over time. All of these problems can be formulated as dynamic programs that represent a mathematical framework for modeling problems where information and decisions evolve over time. 1
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
2
Dynamic programming is a fairly mature branch of applied mathematics, but it has struggled with the transition from theory to computation. Most of the textbooks on dynamic programming focus on problems where all the quantities are discrete. A variety of algorithms exist for these problems, but they typically suffer from what is commonly referred to as the “curse of dimensionality” which we illustrate in the next section. Dynamic programming has its roots in several fields. Engineering and economics tend to focus on problems with continuous states and decisions (these communities refer to decisions as controls), while the fields of operations research and artificial intelligence work primarily with discrete states and decisions (or actions). Problems that are modeled with continuous states and decisions (and typically in continuous time) are typically addressed under the umbrella of “control theory” whereas problems with discrete states and decisions, modelled in discrete time, are studied at length under the umbrella of “Markov decision processes.” Both of these subfields set up recursive equations that depend on the use of a state variable to capture history in a compact way. The study of asset management problems (or more broadly, “resource allocation”) is dominated by the field of mathematical programming (or stochastic programming when we wish to explicitly capture uncertainty), which has evolved without depending on the construct of a state variable. Our treatment draws heavily from all three fields.
1.1
A dynamic programming example: a shortest path problem
Perhaps one of the best known applications of dynamic programming is that faced by a typical driver choosing a path in a transportation network. For simplicity (and this is a real simplification for this application), we assume that the driver has to decide at each node (or intersection) which link to traverse next (we are not going to get into the challenges of left turns versus right turns). Let I be the set of intersections. If the driver is at intersection i, he can go to a subset of intersections Ji+ at a cost cij . He starts at the origin node s ∈ I and has to find his way to the destination node d ∈ I at the least cost. The problem can be easily solved using dynamic programming. Let: vi = The cost to get from intersection i ∈ I to the destination node d. We assume that vd = 0. Initially, we do not know vi , and so we start by setting vi = M , where “M ” is known as “big M” and represents a large number. Let Jj− be the set of intersections i such that there is a link from i to j. We can solve the problem by iteratively computing: vi ← min{vi , cij + vj }
∀j ∈ I
(1.1)
Equation (1.1) has to be solved iteratively. We stop when none of the values vi change. It should be noted that this is not a very efficient way of solving a shortest path problem. For
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
3
example, in the early iterations, it may well be the case for a particular intersection j that vj = M . In this case, there is no point in executing the update. In efficient implementations, instead of looping over all j ∈ I, we maintain a list of intersections j that we have already reached out to (and in particular those where we just found a better path). The algorithm is guaranteed to stop in a finite number of iterations. Shortest path problems arise in a variety of settings that have nothing to do with transportation or networks. Consider, for example, the challenge faced by a college freshman trying to plan her schedule to graduation. By graduation, she must take 32 courses overall, including eight departmentals, two math courses, one science course, and two language courses. We can describe the state of her academic program in terms of how many courses she has taken under each of these five categories. Let Stc be the number of courses she has taken by the end of semester t in category c = {Total courses, Departmentals, Math, Science, Language}, and let St = (Stc )c be the state vector. Based on this state, she has to decide which courses to take in the next semester. To graduate, she has to reach the state S8 = (32, 8, 2, 1, 2). We assume that she has a measurable desirability for each course she takes, and that she would like to maximize the total desirability of all her courses. The problem can be viewed as a shortest path problem from the state S0 = (0, 0, 0, 0, 0) to S8 = (32, 8, 2, 1, 2). Let St be her current state at the beginning of semester t and let xt represent the decisions she makes while determining what courses to take. We then assume we have access to a transition function f (St , xt ) which tells us that if she is in state St and makes decision xt , she will land in state St+1 , which we represent by simply using: St+1 = f (St , xt ) In our transportation problem, we would have St = i if we are at intersection i, and xt would be the decision to “go to j,” leaving us in the state St+1 = j. Finally, let Ct (St , xt ) be the contribution or reward she generates from being in state St and making the decision xt . The value of being in state St is defined by the equation: Vt (St ) = max {Ct (St , xt ) + Vt+1 (f (St , xt ))} xt
∀st ∈ St
(1.2)
where St is the set of all possible (discrete) states that she can be in at the beginning of the year.
1.2
The three curses of dimensionality
All dynamic programs can be written in terms of a recursion that relates the value of being in a particular state at one point in time to the value of the states that we are carried into at the next point in time. For deterministic problems, this equation can be written: Vt (St ) = max {Ct (St , xt ) + Vt+1 (St+1 (St , xt ))} xt
(1.3)
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
4
Equation (1.3) is known as Bellman’s equation, or the Hamilton-Jacobi equation, or increasingly, the Hamilton-Jacobi-Bellman equation (HJB for short). Recursions of this sort are fundamental to many classes of dynamic decision-making problems. If we can solve this equation, we can solve our problem. In a simple example such as our shortest path problem, the algorithm is extremely easy, so a student might ask: “And people spend entire lifetimes on this equation???” Bellman’s equation is, in fact, very easy to solve if the state variable is something simple, like a street intersection, the amount of money in a retirement account, or the price of a stock. Mathematically, the problems can become quite rich and subtle when we introduce uncertainty. But computationally, the challenge arises when St (and therefore xt ) is not a scalar, but a vector. For our college student, the state space is approximately 33 × 9 × 3 × 2 × 3 = 5, 346 (not all states are reachable). If the school adds an additional requirement that the student take at least seven liberal arts courses (to make sure that students in the sciences have breadth in their course selection), the state space grows to 5, 346 × 8 = 42, 768. This is the curse of dimensionality at work. In other words, while difficult theoretical questions abound, the real challenge in dynamic programming is computation. When we introduce uncertainty, we have to find the value of xt that maximizes the expected contribution: Vt (St ) = max E{Ct (St , xt ) + Vt+1 (St+1 )} xt
(1.4)
For scalar problems, equation (1.4) is also relatively easy to solve. There are, however, many real problems where the state variable, the random variable (over which we are taking an expectation) and the decision variable are all vectors. For example, state variable St might be the amount of money we have in different investments, the number of aircraft of different types that a company owns or the location of containers distributed around the country. Random quantities can be the prices of different stocks, the demands for different types of products or the number of loads that need to be moved in containers between different regions. Finally, our decision vector xt can be how much to invest in different stocks, how many aircraft of different types to purchase, or the flows of containers between locations. For these problems, the dimensions of these vectors can range from dozens to tens of thousands to millions. If we applied our method for solving Bellman’s equation to our college student in section 1.1, we would have to solve (1.2) for every possible value of St . This would require finding the expectation, which involves a summation (if we continue to assume that all quantities are discrete) over all the dimensions of our random quantities. Finally, to find the best value of xt , we have to enumerate over all possible decisions to find the best one. Each one of these steps involves enumerating all outcomes of a multidimensional vector. These problems can be computationally intractable if the number of dimensions is as small as 10 or 20.
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
5
Figure 1.1: The major railroads in the United States have to manage complex assets such as boxcars, locomotives and the people who operate them.
1.3
Some real applications
Asset management problems arise in a broad range of applications. Assets can be physical objects including people, equipment such as aircraft, trucks or electric power transformers, commodities such as oil and food, and fixed facilities such as buildings and power generators. An example of a very complex asset allocation problem arises in railroads. In North America, there are six major railroads (known as “Class I” railroads) which operate thousands of locomotives, many of which cost over $1 million. Decisions have to be made now to assign locomotives to trains, taking into account how the locomotives will be used at the destination. For example, a train may be going to a location that needs additional power. Or a locomotive may have to be routed to a maintenance facility, and the destination of a train may or may not offer good opportunities for getting the locomotive to the shop. The balancing of immediate and future costs and rewards is common throughout applications involving freight transportation. In the military, the military airlift command has to make decisions about which aircraft to assign to move passengers or freight. These decisions have to balance what appears to be the best decision now (which aircraft is closest) and the downstream effects of a decision (the destination of a load of freight or passengers may not have good maintenance facilities for a particular aircraft type). Similar issues arise in the truckload motor carrier industry, where drivers are assigned to move loads that arise in a highly dynamic environment. Large companies manage fleets of thousands of drivers, and the challenge at any moment in time is to find the best driver. There is much more to the problem than simply finding the closest driver; each driver is
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
6
Figure 1.2: Airlift capacity can be a major bottleneck in military airlift operations
Figure 1.3: Large truckload motor carriers have to manage fleets of as many as 10,000 drivers in a highly dynamic environment where shippers place orders at the last minute. characterized by attributes such as his or her home location and equipment type as well as his or her skill level and experience. As with the locomotives, there is a need to balance decisions that maximize profits now versus those that produce good long run behavior. A major challenge is getting drivers back home. In some segments of the industry, a driver may be away for two weeks or more. It is often necessary to look two or more moves into the future to find strategies for routing a driver toward his or her home. The electric power industry faces the problem of designing and purchasing expensive equipment used to run the electric power grid. It can take a year or more to build one of these components, and each must be designed to a set of specifications. However, it is not always known exactly how the component will be used in the future, as it may be necessary to use the component to respond to an equipment failure. The auto industry also faces design decisions, but in this case the industry has to choose which cars to build and with what features. It is not possible to design the perfect car for each population segment, so
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
7
Special equipment
Figure 1.4: The electic power industry has to design, purchase and place expensive equipment that can be used when failures occur. the challenge is to design a car and hope that people are willing to compromise and purchase a particular design. With design cycles that often exceed three years, there is a tremendous amount of uncertainty in these decisions. A second important asset class is money which can take on the form of cash, money market certificates, bonds, stocks, futures, options and other financial instruments such as derivatives. Since physical objects and money are often interchangeable (money can be used to purchase a physical object; the physical object can be sold and turned into money), the financial community will talk about real assets as opposed to financial assets. . A third “asset” which is often overlooked is information. Consider the problem faced by the government which is interested in researching a new technology such as fuel cells or converting coal to hydrogen. There may be dozens of avenues to pursue, and the challenge is to determine which projects to invest in. The state of the system is the set of estimates of how well different components of the technology work. The government funds research to collect information. The result of the research may be the anticipated improvement, or the results may be disappointing. The government wants to plan a research program to maximize the likelihood that a successful technology is developed within a reasonable time frame (say, 20 years). Depending on time and budget constraints, the government may wish to fund competing technologies in the event that one does not work. Alternatively, it may be more effective to fund one promising technology, and to then switch to an alternative if the first does not work out.
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
1.4
8
Problem classes in asset management
The vast array of applications in asset management can be divided into some major problem classes. We use these problem classes throughout the book to motivate various algorithmic strategies. The budgeting problem. Here we face the problem of allocating a fixed resource over a set of activities that returns a reward which is a function of how much we invest in the activity. For example, drug companies have to decide how much to invest in different research projects or how much to spend on advertising for different drugs. Oil exploration companies have to decide how much to spend exploring potential sources of oil. Political candidates have to decide how much time to spend campaigning in different states. Asset acquisition with concave costs. A company can raise capital by issuing stock or floating a bond. There are costs associated with these financial instruments independent of how much money is being raised. Similarly, an oil company purchasing oil will be given quantity discounts (or it may face the fixed cost of purchasing a tankerload of oil). Retail outlets get a discount if they purchase a truckload of an item. All of these are instances of acquiring assets with a concave (or more generally, non-convex) cost function, which means there is an incentive for purchasing larger quantities. Asset acquisition with lagged information processes. We can purchase commodity futures that allow us to purchase a product in the future at a lower cost. Alternatively, we may place an order for memory chips from a factory in southeast Asia with one to two week delivery times. A transportation company has to provide containers for a shipper who may make requests several days in advance or at the last minute. All of these are asset acquisition problems with lagged information processes. Buying/selling an asset. In this problem class, the process stops when we either buy an asset when it looks sufficiently attractive or sell an asset when market conditions warrant. The game ends when the transaction is made. For these problems, we tend to focus on the price (the purchase price or the sales price), and our success depends on our ability to trade off current value with future price expectations. General asset allocation problems. This class encompasses the problem of managing reusable and substitutable assets over time. Applications abound in transportation and logistics. Railroads have to move locomotives and boxcars to serve different activities (moving trains, moving freight) over time. An airline has to move aircraft and pilots in order to move passengers. Consumer goods have to move through warehouses to retailers to satisfy customer demands. Demand management. There are many applications where we focus on managing the demands being placed on a process. Should a hospital admit a patient? Should a trucking company accept a request by a customer to move a load of freight? Shortest paths. In this problem class, we typically focus on managing a single, discrete resource. The resource may be someone playing a game, a truck driver we are trying to
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
9
route to return him home, a driver who is trying to find the best path to his destination or a locomotive we are trying to route to its maintenance shop. Shortest path problems, however, also represent a general mathematical structure that applies to a broad range of dynamic programs that have nothing to do with routing a physical asset to a destination. Dynamic assignment. Consider the problem of managing multiple resources, such as computer programmers, to perform different tasks over time (writing code or fixing bugs). Each resource and task is characterized by a set of attributes that determines the cost (or contribution) from assigning a particular resource to a particular task.
All of these problems focus on the problem of managing physical or financial assets. They provide an idea of the diversity of applications that can be studied. In each case, we have focused on the question of how to manage the asset. In addition, there are three other classes of questions that arise for each application: Pricing. Often the question being asked is what price should be paid for an asset. The right price for an asset depends on how it is managed, so it should not be surprising that we often find asset prices as a byproduct. Information collection. Since we are modeling sequential information and decision processes, we explicitly capture the information that is available when we make a decision, allowing us to undertake studies that change the information process. For example, the military uses unmanned aerial vehicles (UAV’s) to collect information about targets in a military setting. Oil companies drill holes to collect information about underground geologic formations. Travelers try different routes to collect information about travel times. Pharmaceutical companies use test markets to experiment with different pricing and advertising strategies. In addition, the algorithmic strategies that we pursue under the umbrella of approximate dynamic programming all involve the need to explore different regions of the state space to estimate the value of being in these states. These strategies require that we understand the tradeoff between the cost (time) required to visit different states and the benefits derived from improving the precision with which we can estimate the value of being in a state. Technology switching. The last class of questions addresses the underlying technology that controls how the physical process evolves over time. For example, when should a power company upgrade a generating plant (e.g. to burn oil and natural gas)? Should an airline switch to aircraft that fly faster or more efficiently? How much should a communications company invest in a technology given the likelihood that better technology will be available in a few years? Most of these problems arise in both discrete and continuous versions. Continuous models would be used for money, physical products such as oil, grain and coal, or discrete products that occur in large volume (most consumer products). In other settings, it is important to retain the integrality of the assets being managed (people, aircraft, locomotives, trucks, and expensive items that come in small quantities). For example, how do we position emergency response units around the country to respond to emergencies (bioterrorism, major oil spills,
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
10
failure of certain components in the electric power grid)? What makes these problems hard? With enough assumptions, none of these problems are inherently difficult. But in real applications, a variety of issues emerge that can make all of them intractable. These include: • Evolving information processes - We have to make decisions now before we know the information that will arrive later. This is the essence of stochastic models, and this property quickly turns the easiest problems into computational nightmares. • High dimensional problems - Most problems are easy if they are small enough. In real applications, there can be many types of assets, producing decision vectors of tremendous size. • Measurement problems - Normally, we assume that we look at the state of our system and from this determine what decision to make. In many problems, we cannot measure the state of our system precisely. The problem may be delayed information (stock prices), incorrectly reported information (the truck is in the wrong location), misreporting (a manager does not properly add up his total sales), theft (retail inventory), or deception (an equipment manager underreports his equipment so it will not be taken from him). • Unknown models (information, system dynamics) - We can anticipate the future by being able to say something about what might happen (even if it is with uncertainty) or the effect of a decision (which requires a model of how the system evolves over time). • Missing information - There may be costs that simply cannot be computed, and are instead ignored. The result is a consistent model bias (although we do not know when it arises). • Comparing solutions - Primarily as a result of uncertainty, it can be difficult comparing two solutions to determine which is better. Should we be better on average, or are we interested in the best, worst solution? Do we have enough information to draw a firm conclusion?
1.5
What is new in this book?
As of this writing, dynamic programming has enjoyed a relatively long history, with many superb books. Within the operations research community, the original text by Bellman (Bellman (1957)) was followed by a sequence of books focusing on the theme of Markov decision processes. Of these, the current high-water mark is Markov Decision Processes by Puterman, which played an influential role in the writing of chapter 4. This field offers a powerful theoretical foundation, but the algorithms are limited to problems with very low dimensional state and action spaces.
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
11
This volume focuses on a field that is coming to be known as approximate dynamic programming which emphasizes modeling and computation for much harder classes of problems. The problems may be hard because they are large (for example, large state spaces), or because we lack a model of the underlying process which the field of Markov decision processes takes for granted. Two major references precede this volume. Neuro-Dynamic Programming by Bertsekas and Tsitsiklis was the first book to appear that summarized a vast array of strategies for approximating value functions for dynamic programming. Reinforcement Learning by Sutton and Barto presents the strategies of approximate dynamic programming in a very readable format, with an emphasis on the types of applications that are popular in the computer science/artificial intelligence community. This volume focuses on models of problems that can be broadly described as “asset management,” where we cover both physical and financial assets. Many of these applications involve very high dimensional decision vectors (referred to as controls or actions in other communities) which can only be solved using the techniques from the field of mathematical programming. As a result, we have adopted a notational style that makes the relationship to the field of math programming quite transparent. A major goal of this volume is to lay the foundation, starting as early as chapter 3, for solving these very large and complex problems. There are several major differences between this volume and these major works which precede it. • We focus much more heavily on the modeling of problems. Emphasis is placed throughout on the proper representation of exogenous information processes and system dynamics. Partly for this reason, we present finite-horizon models first since it requires more careful modeling of time than is needed for steady state models. • Examples are drawn primarily from the classical problems of asset management that arise in operations research. We make a critical distinction between single asset problems (when to sell an asset, how to fly a plane from one location to another) and problems with multiple asset classes (how to manage a fleet of aircraft, purchasing different types of equipment, managing money in different forms of investments) by introducing specific notation for each. • We bring together the power of approximate dynamic programming, which is itself represents a merger of Markov decision processes and stochastic approximation methods, with stochastic programming and math programming. The result is a new class of algorithms for solving (approximately) complex resource allocation problems which exhibit state and action (decision) vectors with thousands or even tens of thousands of dimensions. The notation is chosen to facilitate the link between dynamic programming and math programming. • We explicitly identify the three curses of dimensionality that arise in asset management problems, and introduce an approximation strategy based on using the post-decision state variable which has received limited attention in the literature (and apparently no attention in other textbooks).
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
12
• The theoretical foundations of this material can be deep and rich, but our presentation is aimed at a undergraduate or masters level students with introductory courses in statistics, probability and, for the later chapters, linear programming. For more advanced students, proofs are provided in “Why does it work” sections. The presentation is aimed primarily at students interested in taking real, complex problems, producing proper mathematical models and developing computationally tractable algorithms. Our presentation integrates the contributions of several communities. Much of the foundational theory was contributed by the probability community in the study of Markov decision processes and, in a separate subcommunity, stochastic approximation methods. We also recognize the many contributions that have emerged from the control theory community, generally in the context of classical engineering problems. Finally, we integrate important contributions from stochastic programming for solving high dimensional decision problems under uncertainty. We think that this volume, by bringing these different fields together, will contribute to the thinking in all fields.
1.6
The many dialects of dynamic programming
Dynamic programming arises from the study of sequential decision processes. Not surprisingly, these arise in a wide range of applications. While we do not wish to take anything from Bellman’s fundamental contribution, the optimality equations are, to be quite honest, somewhat obvious. As a result, they were discovered independently by the different communities in which these problems arise. The problems arise in a variety of engineering problems, typically in continuous time with continuous control parameters. These applications gave rise to what is now referred to as control theory. While uncertainty is a major issue in these problems, the formulations tend to focus on deterministic problems (the uncertainty is typically in the estimation of the state or the parameters that govern the system). Economists adopted control theory for a variety of problems involving the control of activities from allocating single budgets or managing entire economies (admittedly at a very simplistic level). Operations research (through Bellman’s work) did the most to advance the theory of controlling stochastic problems, thereby producing the very rich theory of Markov decision processes. Computer scientists, especially those working in the realm of artificial intelligence, found that dynamic programming was a useful framework for approaching certain classes of machine learning problems known as reinforcement learning. It is not simply the case that different communities discovered the fundamentals of dynamic programming independently. They also discovered the computational challenges that arise in their solution (the “curse of dimensionality”). Not surprisingly, different communities also independently discovered classes of solution algorithms. As different communities discovered the same concepts and algorithms, they invented their own vocabularies to go with them. As a result, we can solve the Bellman equations, the
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
13
Hamiltonian, the Jacobian, the Hamilton-Jacobian, or the all-purpose Hamilton-JacobianBellman equations (typically referred to as the HJB equations). In our presentation, we prefer the term “optimality equations.” There is an even richer vocabulary for the types of algorithms that are the focal point of this book. Everyone has discovered that the backward recursions required to solve the optimality equations in section 2.1.1 do not work as the number of dimensions increases. A variety of authors have independently discovered that an alternative is to step forward through time, using iterative algorithms to help estimate the value function. This general strategy has been referred to as forward dynamic programming, iterative dynamic programming, adaptive dynamic programming, and neuro-dynamic programming. However, the term that appears to have been most widely adopted is approximate dynamic programming. In some cases, the authors would discover the algorithms and only later discover their relationship to classical dynamic programming. The use of iterative algorithms that are the basis of most approximate dynamic programming procedures also have their roots in a field known as stochastic approximation methods. Again, authors tended to discover the technique and only later learn of its relationship to the field of stochastic approximation methods. Unfortunately, this relationship was sometimes discovered only after certain terms became well established. Throughout the presentation, students need to appreciate that many of the techniques in the fields of approximate dynamic programming and stochastic approximation methods are fundamentally quite simple, and often obvious. The proofs of convergence and some of the algorithmic strategies can become quite difficult, but the basic strategies often represent what someone would do with no training in the field. As a result, the techniques frequently have a very natural feel to them, and the algorithmic challenges we face often parallel problems we encounter in every day life. As of this writing, the relationship between control theory (engineering and economics), Markov decision processes (operations research), and reinforcement learning (computer science/artificial intelligence) are well understood by the research community. The relationship between iterative techniques (reviewed in chapter 5) and the field of stochastic approximations is also well established. There is, however, a separate community that evolved from the field of deterministic math programming, which focuses on very high dimensional problems. As early as the 1950’s, this community was trying to introduce uncertainty into mathematical programs. The resulting subcommunity is called stochastic programming which uses a vocabulary that is quite distinct from that of dynamic programming. The relationship between dynamic programming and stochastic programming has not been widely recognized, despite the fact that Markov decision processes are considered standard topics in graduate programs in operations research. Our treatment will try to bring out the different dialects of dynamic programming, although we will tend toward a particular default vocabulary for important concepts. Students need to be prepared to read books and papers in this field that will introduce and develop
CHAPTER 1. THE CHALLENGES OF DYNAMIC PROGRAMMING
14
important concepts using a variety of dialects. The challenge is realizing when authors are using different words to say the same thing.
1.7
Bibliographic notes
Basic references: Bellman (1957),Howard (1971),Derman (1970), Dynkin (1979), Ross (1983), Heyman & Sobel (1984),Puterman (1994) , Bertsekas & Tsitsiklis (1996) Sequential allocation: Taylor (1967)
Chapter 2 Some illustrative applications Dynamic programming is one of those incredibly rich fields that has filled the careers of many. But it is also a deceptively easy idea to illustrate and use. This chapter presents a series of dynamic programming problems that lend themselves to simple (often analytical) solutions. The goal is to teach dynamic programming by example. It is possible, after reading this chapter, to conclude that “dynamic programming is easy” and to wonder “why do I need the rest of this book?” The answer is: sometimes dynamic programming is easy and requires little more than the understanding gleaned from these simple problems. But there is a vast array of problems which are quite difficult to model, and where standard solution approaches are computationally intractable. We divide our presentation between deterministic and stochastic problems. The careful reading will pick up on subtle modeling differences between these problems. If you do not notice these, chapter 3 brings these out more explicitly.
2.1
Deterministic problems
2.1.1
The budgeting problem
A problem with similar structure to the gambling problem is the budgeting problem. Here, we have to allocate a budget of size R to a series of tasks T . Let xt be the amount of money allocated to task t, and let Ct (xt ) be the contribution (or reward) that we receive from this allocation. We would like to maximize our total contribution: max x
X
Ct (xt )
(2.1)
t∈T
15
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
16
subject to the constraint on our available resources: X
xt = R
(2.2)
t∈T
In addition, we cannot allocate negative resources to any task, so we include: xt ≥ 0
(2.3)
We refer to (2.1)-(2.3) as the budgeting problem (other authors refer to it as the “resource allocation problem,” a term we find too general for such a simple problem). In this example, all data is deterministic. There are a number of algorithmic strategies for solving this problem that depend on the structure of the contribution function, but we are going to show how it can be solved without any assumptions. We will approach this problem by first deciding how much to allocate to task 1, then to task 2, and so on until the last task, T . In the end, however, we want a solution that optimizes over all tasks. Let: Vt (Rt ) = The value of having Rt resources remaining before we solve the problem of allocating for task t Implicit in our definition of Vt (Rt ) is that we are going to solve the problem of allocating Rt over tasks t, t + 1, . . . , T in an optimal way. Imagine that we somehow know the function Vt+1 (Rt+1 ), where Rt+1 = Rt − xt . Then it seems apparent that the right solution for task t is to solve: max Ct (xt ) + Vt+1 (Rt − xt )
0≤xt ≤Rt
(2.4)
Equation (2.4) forces us to balance between the contributions that we receive from task t and what we would receive from all the remaining tasks (which is captured in Vt+1 (Rt − xt )). One way to solve (2.4) is to assume that xt is discrete. For example, if our budget is R = $10 million, we might require xt to be in units of $100,000 dollars. In this case, we would solve (2.4) simply by searching over all possible values of xt (since it is a scalar, this is not too hard). The problem is that we do not know what Vt+1 (Rt+1 ) is. The simplest strategy for solving our dynamic program in (2.4) is to start by using VT +1 (R) = 0 (for any value of R). Then we would solve: VT (RT ) =
max Ct (xT )
(2.5)
0≤xT ≤RT
for 0 ≤ RT ≤ R. Now we know VT (RT ) for any value of RT that might actually happen. Next we can solve: VT −1 (RT −1 ) =
max
0≤xT −1 ≤RT −1
CT −1 (xT −1 ) + VT (RT −1 − xT −1 )
(2.6)
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
17
Clearly, we can play this game recursively, solving: Vt (Rt ) = max Ct (xt ) + Vt+1 (Rt − xt ) 0≤xt ≤Rt
(2.7)
for t = T − 1, T − 2, . . . , 1. Once we have computed Vt for t ∈ T , we can then start at t = 1 and step forward in time to determine our optimal allocations. This strategy is simple, easy and optimal. It has the nice property that we do not need to make any assumptions about the shape of Ct (xt ), other than finiteness. We do not need concavity or even continuity; we just need the function to be defined for the discrete values of xt that we are examining.
2.1.2
The shortest path problem
Perhaps one of the most popular dynamic programming problems is known as the shortest path problem. Although it has a vast array of applications, it is easiest to describe in terms of the problem faced by every driver when finding a path from one location to the next over a road network. Let: I = The set of nodes (intersections) in the network. L = The set of links (i, j) in the network. cij = The cost (typically the time) to drive from node i to node j, i, j ∈ I, (i, j) ∈ L Ni = The set of nodes j for which there is a link (i, j) ∈ L. We assume that a traveler at node i can choose to traverse any link (i, j) where j ∈ Ni . Assume our traveler is starting at some node r and needs to get to a destination node s at the least cost. Let: vj = The minimum cost required to get from node j to node s. Initially, we do not know vj , but we do know that vs = 0. Let vjn be our estimate, at iteration n, of the cost to get from j to s. We can find the optimal costs, vj , by initially setting vj0 to a large number for j 6= s, and then iteratively looping over all the intersections, finding the best link to traverse out of an intersection i by minimizing the sum of the outbound link cost cij plus our current estimate of the downstream value vjn−1 . The complete algorithm is summarized in figure 2.1. This algorithm has been proven to converge to the optimal set of node values. There is a substantial literature on solving shortest path problems. Because they arise in so many applications, there is tremendous value in solving them very quickly. Our basic algorithm is not very efficient because we are often solving equation (2.8) for an intersection i where vin−1 = M , and where vjn−1 = M for all j ∈ Ni . A more standard strategy is to
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
18
Step 0. Let vj0
( M = 0
If j 6= s. If j = s.
where “M ” is known as “big-M” and represents a large number. Let n = 1. Step 1. Solve for all i ∈ I: vin = min{vin−1 , min cij + vjn−1 } j∈Ni
(2.8)
Step 2. If vin < vin−1 for any i, let n := n + 1 and return to step 1. Else stop.
Figure 2.1: Basic shortest path algorithm Step 0. Let vj0
( M = 0
If j 6= s. If j = s.
where “M ” is known as “big-M” and represents a large number. Let n = 1. Set the candidate list C = {r}. Step 1. Choose node i ∈ C from the top of the candidate list. Step 2. For all nodes j ∈ Ni do: Step 2a. v˜jn = cij + vjn−1
(2.9)
Step 2b. If v˜jn < vjn−1 and if j 6∈ C, then set vjn = v˜jn and add j to the candidate list: C = C ∪ {j} (j is assumed to be put at the bottom of the list). Step 3. Drop node i from the candidate list. If the candidate list C is not empty, return to step 1.
Figure 2.2: An origin-based shortest path algorithm maintain a candidate list of nodes C which consists of an ordered list i1 , i2 , . . .. Initially the list will consist only of the origin node r. As we reach out of a node i in the candidate list, we may find a better path to some node j which is then added to the candidate list (if it is not already there). This is often referred to as Bellman’s algorithm, although the algorithm in figure 2.1 is a purer form of Bellman’s equation for dynamic programming. A very effective variation of the algorithm in 2.2 is to keep track of nodes which have already been in the candidate list. If a node is added to the candidate list which was previously in the candidate list, a very effective strategy is to add this node to the top of the list. This variant is known as Pape’s algorithm (pronounced “papa’s algorithm”). Another powerful variation, called Dijkstra’s algorithm (pronounced “Diekstra”) chooses the node from the candidate list with
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
19
the smallest value of vin . Almost any (deterministic) discrete dynamic program can be viewed as a shortest path problem. We can view each node i as representing a particular discrete state of the system. The origin node r is our starting state, and the ending state s might be any state at an ending time T . We can also have shortest path problems defined over infinite horizons, although we would typically include a discount factor. We are often interested in problems where there is some source of uncertainty. For our shortest path problem, it is natural to view the cost on a link as random, reflecting the variability in the travel time over each link. There are two ways we can handle the uncertainty. The simplest is to assume that our driver has to make a decision before seeing the travel time over the link. In this case, our updating equation would look like: vin
= min
vin−1 , min E{cij (W ) j∈Ni
+
vjn−1 }
where W is some random variable describing the network. This problem is identical to our original problem; all we have to do is to let cij = E{cij (W )} be the expected cost on an arc. vin
= min
vin−1 , E
min cij (W ) +
j∈Ni
vjn−1
Here, the expectation is outside of the min operator which chooses the best decision, capturing the fact that now the decision itself is random. If we go outside of our transportation example, there are many settings where the decision does not take us deterministically to a particular node j.
2.2
Stochastic problems
2.2.1
The gambling problem
A gambler has to determine how much of his capital he should bet on each round of a game, where he will play a total of N rounds. He will win a bet with probability p and lose with probability q = 1−p (assume q < p). Let sn be his total capital after n plays, n = 1, 2, . . . , N , with s0 being his initial capital. For this problem, we refer to sn as the state of the system. Let xn be the amount he bets in round n, where we require that xn ≤ sn−1 . He wants to maximize ln sN (this provides a strong penalty for ending up with a small amount of money at the end and a declining marginal value for higher amounts). Let ( 1 If the gambler wins the nth game Wn = 0 Otherwise
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
20
The system evolves according to: sn = sn−1 + xn W n − xn (1 − W n ) Let V n (sn ) be the value of having sn dollars at the end of the nth game. The value of being in state sn at the end of the nth round can be written: V n (sn ) = max0≤xn+1 ≤sn EV n+1 (sn+1 ) = max0≤xn+1 ≤sn EV n+1 (sn + xn+1 W n+1 − xn+1 (1 − W n+1 )) Here, we claim that the value of being in state sn is found by choosing the decision that maximizes the expected value of being in state sn+1 given what we know at the end of the nth round. We solve this by starting at the end of the N th trial, and assuming that we have finished with S N dollars. The value of this is: V N (S N ) = ln S N Now step back to n = N − 1, where we may write: V N −1 (sN −1 ) = =
max
0≤xN ≤sN −1
max
0≤xN ≤sN −1
EV N (sN −1 + xN W N − xN (1 − W N )) p ln(sN −1 + xN ) + (1 − p) ln(sN −1 − xN )
(2.10)
Let V N −1 (sN −1 , xN ) be the value within the max operator. We can find xN by differentiating V N −1 (sN −1 , xN ) with respect to xN , giving: p 1−p ∂V N −1 (sN −1 , xN ) = N −1 − N −1 N N ∂x s +x s − xN N −1 N −1 2s p−s −x = (sN −1 )2 − (xN −1 )2 Setting this equal to zero and solving for xN gives: xN = (2p − 1)sN −1 The next step is to plug this back into (2.10) to find V N −1 (sN −1 ): V N −1 (sN −1 ) = p ln(sN −1 + sN −1 (2p − 1)) + (1 − p) ln(sN −1 − sN −1 (2p − 1)) = p ln(sN −1 2p) + (1 − p) ln(sN −1 2(1 − p)) = p ln sN −1 + (1 − p) ln sN −1 + p ln(2p) + (1 − p) ln(2(1 − p)) | {z } K
= ln s
N −1
+K
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
21
where K is a constant with respect to sN −1 . Since the additive constant does not change our decision, we may ignore it and use V N −1 (sN −1 ) = ln sN −1 as our value function for N − 1, which is the same as our value function for N . Not surprisingly, we can keep applying this same logic backward in time and obtain: V n (sn ) = ln sn (+K N ) for all n, where again, K n is some constant that can be ignored. This means that for all n, our optimal solution is: xn = (2p − 1)sn The optimal strategy at each iteration is to bet a fraction β = (2p − 1) of our current money on hand. Of course, this requires that p > .5.
2.2.2
The batch replenishment problem
One of the classical problems in operations research is one that we refer to here as the batch replenishment problem. To illustrate the basic problem, assume that we have a single type of resource which is consumed over time. As the reserves of the resource run low, it is necessary to replenish the resources. In many problems, there are economies of scale in this process. It is cheaper (on an average cost basis) to increase the level of resources in one jump (see examples). Example 2.1: A startup company has to maintain adequate reserves of operating capital to fund product development and marketing. As the cash is depleted, the finance officer has to go to the markets to raise additional capital. There are fixed costs of raising capital, so this tends to be done in batches. Example 2.2: An oil company maintains an aggregate level of oil reserves. As these are depleted, it will undertake exploration expeditions to identify new oil fields, which will produce jumps in the total reserves under the company’s control.
We address this problem in some depth in chapter 13. We use it here simply as an illustration of dynamic programming methods. Let: ˆ t = Demand for the resources during time interval t. D Rt = Resource level at time t. xt = Additional resources acquired at time t to be used during time interval t + 1.
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
22
The transition function is given by: M ˆ t+1 ) = max{0, (Rt + xt − D ˆ t+1 )} Rt+1 (Rt , xt , D
Our one period cost function (which we wish to minimize) is given by: ˆ t+1 ) = Cˆt+1 (Rt , xt , D = where: cf = cp = ch =
Total cost of acquiring xt units of the resource. M ˆ t+1 ) cf I{xt >0} + cp xt + ch Rt+1 (Rt , xt , D The fixed cost of placing an order. The unit purchase cost. The unit holding cost.
For our purchases, C t (xt ) could be any nonconvex function; this is a simple example of one. Since the cost function is nonconvex, it helps to order larger quantities at the same time. Assume that we have a family of decision functions X π (Rt ) for determining xt . Our goal is to solve:
min E π∈Π
( T X
) ˆ t+1 ) γ t Cˆt+1 (Rt , X π (xt ), D
t=0
This problem class often yields policies that take a form such as “if the resource level goes below a certain amount, then order up to a fixed amount.” The basic batch replenishment problem, where Rt and xt are scalars, is quite easy (if we know things like the distribution of demand). But there are many real problems where these are vectors because there are different types of resources. The vectors may be small (different types of fuel, raising different types of funds) or extremely large (hiring different types of people for a consulting firm or the military; maintaining spare parts inventories). Even a small number of dimensions would produce a very large problem using a discrete representation.
2.2.3
The secretary problem
The so-called secretary problem (Cayley (1875)) is one of the classic problems of dynamic programming. The motivation of the problem is determining when to hire a candidate for a job (presumably a secretarial position), but it can also be applied to reviewing a series of offers for an asset (such as selling your house or car). This problem provides a nice illustration of a dynamic programming problem that can be solved very easily.
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
23
Setup Assume that we have N candidates for a secretarial position. Each candidate is interviewed in sequence and assigned a score that allows us to compare him or her to other candidates. While it may be reasonable to try to maximize the expected score that we would receive, in this case, we want to maximize the probability of hiring the best candidate out of the entire pool. We need to keep in mind that if we stop at candidate n, then we would not have even interviewed candidates n + 1, . . . , N . Let: ωn = Score of the nth candidate. ( 1 If the score of the nth candidate is the best so far sn = 0 Otherwise. S = State space, given by (0, 1, ∆), where the states 0 and 1 mean that we are still searching, and ∆ means we have stopped the process. X = {0(continue), 1(quit)}, where “quit” means that we hire the last candidate interviewed. Because the decision function uses the most recent piece of information, we define our history as: hn = {ω1 , . . . , ωn } To describe the system dynamics, it is useful to define an indicator function: ( 1 If ωn = max1≤m≤n {ωm } In (hn ) = 0 Otherwise which tells us if the last observation is the best. Our system dynamics can now be given by:
Sn+1
( In (hn ) If xn = 0 = ∆ If xn = 1
To compute the one-step transition matrix, we observe that the event the n + 1st applicant is the best has nothing to do with whether the nth was the best. As a result: P rob[In+1 (hn+1 ) = 1|In (hn )] = P rob[In+1 (hn+1 ) = 1] This simplifies the problem of finding the one-step transition matrix. By definition we have: P rob(Sn+1 = 1|sn , xn = 0) = P rob[In+1 (hn+1 ) = 1]
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
24
In+1 (hn+1 ) = 1 if the n + 1st candidate is the best out of the first n + 1, which clearly occurs with probability 1/(n + 1). So: 1 n+1 n = 0|sn , xn = 0) = n+1
P rob(Sn+1 = 1|sn , xn = 0) = P rob(Sn+1
Our goal is to maximize the probability of hiring the best candidate. So, if we do not hire the last candidate, then the probability that we hired the best candidate is zero. If we hire the nth candidate, and the nth candidate is the best so far, then our reward is the probability that this candidate is the best out of all N . This probability is simply the probability that the best candidate out of all N is one of the first n, which is n/N . So, the conditional reward function is: ( n/N cn (sn , xn |hn ) = 0
If sn = 1 and xn = 1 Otherwise
With this information, we can now set up the optimality equations: Vn (sn ) = max E{cn (sn , xn |hn ) + Vn+1 (Sn+1 )|sn } xn ∈X
Solution The solution to the problem is quite elegant, but the technique is unique to this particular problem. Readers interested in the elegant answer but not the particular proof (which illustrates dynamic programming but otherwise does not generalize to other problem classes) can skip to the end of the section. Let: Vn (s) = The probability of choosing the best candidate out of the entire population, given that we are in state s after interviewing the nth candidate. Recall that implicit in the definition of our value function is that we are behaving optimally from time period t onward. The terminal reward is: VN (1) = 1 VN (0) = 0 VN (∆) = 0
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
25
The optimality recursion for the problem is given by: X Vn (1) = max (cn (1, stop) + Vn+1 (∆)) , cn (1, continue) + p(s0 |s)Vn+1 (s0 ) 0 s ∈(0,1)
Noting that: cn (1, continue) = 0 n cn (1, stop) = N Vn+1 (∆) = 0 ( 1/(n + 1) s0 = 1 p(s0 |s) = n/(n + 1) s0 = 0 We get: Vn (1) = max
n 1 n , Vn+1 (1) + Vn+1 (0) N n+1 n+1
(2.11)
Similarly, it is easy to show that:
1 n Vn (0) = max 0, Vn+1 (1) + Vn+1 (0) n+1 n+1
(2.12)
Comparing (2.12) and (2.11), we can rewrite (2.11) as: Vn (1) = max
nn N
, Vn (0)
o
(2.13)
From this we obtain the inequality: Vn (1) ≥ Vn (0)
(2.14)
which seems pretty intuitive (you are better off if the last candidate you interviewed was the best you have seen so far). At this point, we are going to suggest a policy that seems to be optimal. We are going to interview the first n ¯ candidates, without hiring any of them. Then, we will stop and hire the first candidate who is the best we have seen so far. The decision rule can be written as ( 0 (continue) n ≤ n ¯ xn (1) = 1 (quit) n>n ¯
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
26
To prove this, we are going to start by showing that if Vm (1) > m/N for some m (or alternatively if Vm (1) = m/N = Vm (0)), then Vm0 (1) > m0 /N for m0 < m. If Vm (1) > m/N , then it means that the optimal decision is to continue. We are going to show that if it was optimal to continue at set m, then it was optimal to continue for all steps m0 < m. Assume that Vm (1) > m/N . This means, from equation (2.13), that it was better to continue, which means that Vm (1) = Vm (0) (or there might be a tie, implying that Vm (1) = m/N = Vm (0)). This allows us to write: m−1 1 Vm (1) + Vm (0) m m = Vm (0) m . ≥ N
Vm−1 (0) =
(2.15) (2.16)
Equation (2.15) is true because Vm (1) = Vm (0), and equation (2.16) is true because Vm (1) ≥ m/N . Stepping back in time, we get: Vm−1 (1) = max
m−1 , Vm−1 (0) N
m N m−1 > . N
≥
(2.17) (2.18)
Equation (2.17) is true because Vm−1 (0) ≥ m/N . We can keep repeating this for m − 1, m − 2, . . ., so it is optimal to continue for m0 < m. Now we have to show that if N > 2, then n ¯ ≥ 1. If this is not the case, then for all n, Vn (1) = n/N (because we would never quit). This means that (from equation (2.12)):
1 n+1 n Vn (0) = + Vn+1 (0) n+1 N n+1 1 n = + Vn+1 (0) N n+1 Using VN (0) = 0, we can solve (2.19) by backward induction: VN (0) = 0 1 N −1 VN −1 (0) = + VN (0) N N −1+1 1 = N 1 N −2 1 VN −2 (0) = + N N −2+1 N N −2 1 1 = + N N −2 N −1
(2.19)
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
27
In general, we get: 1 1 m 1 + + ··· + Vm (0) = N m m+1 N −1 We can easily see that V1 (0) > N1 ; since we were always continuing, we had found that V1 (1) = N1 . Finally, equation (2.14) tells us that V1 (1) ≥ V1 (0), which means we have a contradiction. This structure tells us that for m ≤ n ¯: Vm (0) = Vm (1) and for m > n ¯: m N m 1 1 1 Vm (0) = + + ··· + N m m+1 N −1 Vm (1) =
It is optimal to continue as long as Vm (0) > m/N , so we want to find the largest value for m such that: m 1 1 1 m + + ··· + > N m m+1 N −1 N or:
1 1 1 + + ··· + m m+1 N −1
> 1
If N = 5, then we can solve by enumeration: n ¯=1 n ¯=2 n ¯=3
1 1
+ 21 + 13 + 1 + 13 + 2 1 + 3
1 4 1 4 1 4
= 2.08 = 1.08 = .58
So for N = 5, we would use n ¯ = 2. This means interview (and discard) two candidates, and then take the first candidate that is the best to date. For large N , we can find a neat approximation. We would like to find m such that: 1 1 1 + + ··· + m m+1 N −1 Z N 1 ≈ dx M x = ln N −ln m N = ln m
1 ≈
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
28
Solving for m means finding ln(N/m) = 1 or N/m = e or m/N = e−1 = 0.368. So, for large N , we want to interview 37 percent of the candidates, and then choose the first candidate that is the best to date.
2.2.4
Optimal stopping
A particularly important problem in asset pricing is known as the optimal stopping problem. Imagine that you are holding an asset which you can sell at a price that fluctuates randomly. Let pˆt be the price that is revealed in period t, at which point you have to make a decision: ( 1 Sell xt = 0 Hold Our system has two states: ( 1 We are holding the asset St = 0 We have sold the asset If have sold the asset, then there is nothing we can do. We want to maximize the price we receive when we sell our asset. Let the scalar Vt be the value of holding the asset at time t. This can be written: Vt = E max xt pˆt + (1 − xt )γVt+1 xt ∈(0,1)
So, we either get the price pˆt if we sell, or we get the discounted future value of the asset. Assuming the discount factor γ < 1, we do not want to hold too long simply because the value in the future is worth less than the value now. In practice, we eventually will see a price pˆt that is greater than the future expected value, at which point we would stop the process and sell our asset. The time at which we sell our asset is known as a stopping time. By definition, xτ = 1. It is common to think of τ as the decision variable, where we wish to solve: max Eˆ pτ τ
(2.20)
Equation (2.20) is a little tricky to interpret. Clearly, the choice of when to stop is a random variable since it depends on the price pˆt . We cannot optimally choose a random variable, so what is meant by (2.20) is that we wish to choose a function (or policy) that determines when we are going to sell. For example, we would expect that we might use a rule that says: ( 1 If pˆt ≥ p¯ and St = 1 Xt (St , p¯) = 0 Otherwise
(2.21)
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
29
In this case, we have a function parameterized by p¯. In this case, we would write our problem in the form: max E p¯
∞ X
γ t Xt (St , p¯)
t=1
This formulation raises two questions. First, while it seems very intuitive that our policy would take the form given in equation (2.21), there is the theoretical question of whether this in fact is the structure of an optimal policy. Questions of this type can be quite theoretical in nature. Chapter 13 demonstrate how these questions can be answered in the context of a class of batch replenishment problems. The second question is how to find the best policy within this class. For this problem, that means finding the parameter p¯. For problems where the probability distribution of the random process driving prices is (assumed) known, this is a rich and deep theoretical challenge. Alternatively, there is a class of algorithms from stochastic optimization that allows us to find “good” values of the parameter in a fairly simple way.
2.3
Bibliographic notes
The examples provided in this chapter represent classic problems that can be found in a number of sources. The presentation of the secretary problem is based on Puterman (1994). Nice presentations of simple dynamic programs can be found in Ross (1983), Whittle (1982) and Bertsekas (2000).
Exercises 2.1) Give an example of a sequential decision process from your own experience. Describe the decisions that have to be made, the exogenous information process, the state variable, and the cost or contribution function. Then describe the types of rules you might use to make a decision. 2.2) √ Repeat the gambling problem assuming that the value of ending up with S N dollars is SN . 2.3) Write out the steps of a shortest path algorithm, similar to that shown in figure 2.2 which starts at the destination and works backward to the origin. 2.4) Consider a budget problem where you have to allocate R advertising dollars (which can be treated as a continuous variable) among T = (1, 2, . . . , TP ) products. Let xt be the total dollars allocated to product t, where we require that t∈T xt ≤ R. Further √ assume that the increase in sales for product t is given by xt (this is the contribution we are trying to maximize).
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
30
a) Set up the optimality equation (similar to equation (2.7)) for this problem, where the state variable Rt is the total remaining funds that remain after allocating funds to products (1, 2, . . . , t − 1). b) Assuming you have RT available to be allocated to the last product, what is the optimal allocation to the last product? Use this answer to write out an expression for VT (RT ). c) Use your answer to part (b) to find the optimal allocation to product T − 1 assuming we have RT −1 dollars available to be allocated over the last two products. Find the optimal allocation xT −1 and an expression for VT −1 . d) By now, you should see a pattern forming. Propose what appears to be the functional form for Vt (Rt ) and use inductive reasoning to prove your conjecture by showing that it returns the same functional form for Vt−1 (Rt−1 ). e) What is your final optimal allocation over all products? √ 2.5) Repeat exercise (2.4) assuming that the reward for product t is ct xt . 2.6) Repeat exercise (2.4) assuming that the reward (the increased sales) for product t is given by ln(x). 2.7) Repeat exercise (2.4) one more time, but now assume that all you know is that the reward is continuously differentiable, monotonically increasing and concave. 2.8) What happens to your answer to the budget allocation problem (for example, exercise 2.4) if the contribution is convex instead of concave? 2.9) We are now going to do a budgeting problem where the reward function does not have any particular properties. It may have jumps, as well as being a mixture of convex and concave functions. But this time we will assume that R = $30 dollars and that the allocations xt must be in integers between 0 and 30. Assume that we have T = 5 products, with a contribution function Ct (xt ) = cf (xt ) where c = (c1 , . . . , c5 ) = (3, 1, 4, 2, 5) and where f (x) is given by: 0 5 f (x) = 7 10 12
x≤5 x=6 x=7 x=8 x≥9
Find the optimal allocation of resources over the five products. 2.10) You suddenly realize towards the end of the semester that you have three courses that have assigned a term project instead of a final exam. You quickly estimate how much each one will take to get 100 points (equivalent to an A+) on the project. You then
CHAPTER 2. SOME ILLUSTRATIVE APPLICATIONS
31
guess that if you invest t hours in a project, which you estimated would need T hours to get 100 points, that for t < T , your score will be: p R = 100 t/T That is, there are declining marginal returns to putting more work into a project. So, if a project is projected to take 40 hours and you only invest 10, you estimate that your score will be 50 points (100 times the square root of 10 over 40). You decide that you cannot spend more than a total of 30 hours on the projects, and you want to choose a value of t for each project that is a multiple of 5 hours. You also feel that you need to spend at least 5 hours on each project (that is, you cannot completely ignore a project). The time you estimate to get full score on each of the four projects is given by: Project Completion time T 1 20 2 15 3 10 You decide to solve the problem as a dynamic program. a) b) c) d) d)
What is the state variable and decision epoch for this problem? What is your reward function? Write out the problem as an optimization problem. Set up the optimality equations. Solve the optimality equations to find the right time investment strategy.
2.11) You have to send a set of questionnaires to each of N population segments. The size of each population segment is given by wi . You have a budget of B questionnaires to allocate among the population segments. If you send xi questionnaires to segment i, you will have a sampling error proportional to: √ f (xi ) = 1/ xi You want to minimize the weighted sum of sampling errors, given by: F (x) =
N X
wi f (xi )
i=1
You PN wish to find the allocation x that minimizes F (x) subject to the budget constraint i=1 xi ≤ R. Set up the optimality equations to solve this problem as a dynamic program (needless to say, we are only interested in integer solutions). 2.12) Do the following: a) Set up the dynamic programming recursion for this problem. Define your state and decision spaces, and one period reward function. b) Show that the optimal betting strategy is to bet a fraction β of his fortune on each play. Find β.
Chapter 3 Modeling dynamic programs Good modeling begins with good notation. Complex problems in asset management require considerable discipline in notation, because they combine the complexities of the original physical problem and to the challenge of modeling sequential information and decision processes. Students will typically find the modeling of time to be particularly subtle. In addition to a desire to model problems accurately, we also need to be able to understand and exploit the structure of the problem, which can become lost in a sea of complex notation. It is common in textbooks on dynamic programming to quickly adopt a standard paradigm so that the presentation can focus on dynamic programming principles and properties. Our emphasis is on modeling and computation, and our goal is to solve large-scale, complex problems. For this reason, we devote far more attention to modeling than would be found in other dynamic programming texts. The choice of notation has to balance historical patterns with the needs of a particular problem class. Notation is easier to learn if it is mnemonic (the letters look like what they mean) and compact (avoiding a profusion of symbols). Notation also helps to bridge communities. For example, it is common in dynamic programming to refer to actions using “a” (where a is discrete); in control theory a decision (control) is “u” (which may be continuous). For high dimensional problems, it is essential to draw on the field of mathematical programming, where decisions are typically written as “x” and resource constraints are written in the standard form Ax = b. In this text, many of our problems involve managing assets (resources) where we are trying to maximize or minimize an objective subject to constraints. For this reason, we adopt, as much as possible, the notation of math programming to help us bridge the fields of math programming and dynamic programming. Proper notation is also essential for easing the transition from simple illustrative problems to the types of real world problems that can arise in practice. In operations research, it is common to refer to an asset class (in finance, it could be a money market fund, real estate or a stock; in the physical world, it could be a type of aircraft or locomotive) as a “commodity” which might be indexed as k in a set of classes K. But as projects evolve, these asset classes may pick up new dimensions. A common asset management problem in railroads is the movement of boxcars, where there is a clear set of different boxcar types make up 32
CHAPTER 3. MODELING DYNAMIC PROGRAMS
33
our “commodities.” Real boxcars, however, have other attributes such as who owns them (so-called “pools”), the precise configuration of the boxcar (they vary in aspects such as the exact location of the door and locking mechanism, for example), their maintenance status, and their cleanliness. As these attributes are added to the problem, the number of boxcar types grows dramatically. It is especially important that notation be clear and elegant. Simple, textbook problems are easy. The challenge is modeling complex, realistic problems. If the foundational notation is not properly designed, the modeling of a real problem will explode into a tortuous vocabulary.
3.1
Notational style
Notation is a language: the simpler the language, the easier it is to understand the problem. As a start, it is useful to adopt notational conventions to simplify the style of our presentation. For this reason, we adopt the following notational conventions: Variables - Variables are always a single letter. We would never use, for example, CH for “holding cost.” Indexing vectors - Vectors are almost always indexed in the subscript, as in xij . Since we use discrete time models throughout, an activity at time t can be viewed as an element of a vector. So xt would be an activity at time t, with the vector x = (x1 , x2 , . . . , xt , . . . , xT ) giving us all the activities over time. When there are multiple indices, they should be ordered from outside in the general order over which they might be summed (think of the outermost index as the most detailed information). So, if xtij is Pthe P flow P from i to j at time t with cost ctij , we might sum up the total cost using t i j ctij xtij . Dropping one or more indices creates a vector over the elements of the missing indices to the right. So, xt = (xtij )∀i,∀j is the vector of all flows occuring at time t. If we write xti , this would be the vector of flows out of i at time t to all destinations j. Time, when present, is always the innermost index. Superscripts - These are used to represent different flavors of variables. So, ch (or chold ) might be a holding cost while co (or corder ) could be an order cost. Note that while variables must be a single letter, superscripts may be words (although this should be used sparingly). We think of a variable like “ch ” as a single piece of notation. Iteration counters - We always place iteration counters in the superscript, and we primarily use n as our iteration counter. So, xn is our activity at iteration n. If we are using a descriptive superscript, we might write xh,n to represent xh at iteration n. Sometimes algorithms require inner and outer iterations. In this case, we use n to index the outer iteration and m for the inner iteration. While this will prove to be the most natural way to index iterations, students should be aware of the occasional potential for confusion where it may not be clear if the superscript n is an index (as we view it) or raising a variable to the nth power.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
34
Sets are represented using capital letters in a caligraphic font, such as X , F or I. We generally use the lower case roman letter as an element of a set, as in x ∈ X or i ∈ I. Exogenous information - Information that first becomes available (from outside the system) ˆ t or pˆt . These are our basic random at time t is denoted using hats, for example, D variables. Statistics - Statistics computed using exogenous information are generally indicated using bars, for example x¯t or V¯t . Since these are functions of random variables, they are also random. Index variables - Throughout, i, j, k, l, m and n are always scalar indices. Of course, there are exceptions to every rule. It is extremely common in the transportation literature to model the flow of a type of resource (called a commodity and indexed by k) from i to j using xkij . Following our convention, this should be written xkij . Authors need to strike a balance between a standard notational style and existing conventions.
3.2
Modeling time
A survey of the literature reveals different styles toward modeling time. When using discrete time, some authors start at one while others start at zero. When solving finite horizon problems, it is popular to index time by the number of time periods remaining, rather than elapsed time. Some authors index a variable, say St , as being a function of information up through t − 1, while others assume it includes information up through time t. t may be used to represent when a physical event actually happens, or when we first know about a physical event. The confusion over modeling time arises in large part because there are two processes that we have to capture: the flow of information, and the flow of physical and financial assets. There are many applications of dynamic programming to deterministic problems where the flow of information does not exist. Similarly, there are many models where the arrival of the information about a physical asset, and when the information takes affect in the physical system, are the same. For example, the time at which a customer physically arrives to a queue is often modeled as being the same as when the information about the customer first arrives. Similarly, we often assume that we can sell an asset at a market price as soon as the price becomes known. However, there is a rich collection of problems where the information process and physical process are different. A buyer may purchase an option now (an information event) to buy a commodity in the future (the physical event). Customers may call an airline (the information event) to fly on a future flight (the physical event). An electric power company has to purchase equipment now to be used one or two years in the future. All of these problems represent examples of lagged information processes and force us to explicitly model the informational and physical events.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
t =3
t=2
t =1
35
t=4
t=0
t=2
t =1
t=4
t =3
3.1a: Information processes t=0
t =3
t=2
t =1
t=0
t =1
t=2
t =3
3.1b: Physical processes Figure 3.1: Relationship between discrete and continuous time for information processes (3.1a) and physical processes (3.1b).
Notation can easily become confused when an author starts by writing down a deterministic model of a physical process, and then adds uncertainty. The problem arises because the proper convention for modeling time for information processes is different than what should be used for physical processes. We begin by establishing the relationship between discrete and continuous time. All of the models in this book are presented in discrete time, since this is the most natural framework for computation. The relationship of our discrete time approximation to the real flow of information and physical assets is depicted in figure 3.1. When we are modeling information, time t = 0 is special; it represents “here and now” with the information that is available at the moment. The discrete time t refers to the time interval from t − 1 to t (illustrated in figure 3.1a). This means that the first new information arrives during time interval 1. This notational style means that any variable indexed by t, say St or xt , is assumed to have access to the information that arrived up to time t, which means up through time interval t. This property will dramatically simplify our notation in the future. For example, assume that ft is our ˆ t is the observed demand during time interval t, forecast of the demand for electricity. If D we would write our updating equation for the forecast using: ˆt ft = (1 − α)ft−1 + αD
(3.1)
We refer to this form as the informational representation. Note that the forecast ft is written as a function of the information that became available during time interval t. When we are modeling a physical process, it is more natural to adopt a different convention (illustrated in figure 3.1b): discrete time t refers to the time interval between t and t + 1. This convention arises because it is most natural in deterministic models to use time to represent when something is happening or when an asset can be used. For example, let
CHAPTER 3. MODELING DYNAMIC PROGRAMS
36
Rt be our cash on hand that we can use during day t (implicitly, this means that we are measuring it at the beginning of the day). Let Dt be the demand for cash during the day, and let xt represent additional cash that we have decided to add to our balance (to be used during day t). We can model our cash on hand using the simple equation: ˆt Rt+1 = Rt + xt − D
(3.2)
We refer to this form as the actionable representation. Note that the left hand side is indexed by t + 1, while all the quantities on the right hand side are indexed by t. This equation makes perfect sense when we interpret time t to represent when a quantity can be used. For example, many authors would write our forecasting equation (3.1) as: ˆt ft+1 = (1 − α)ft + αD
(3.3)
This equation is correct if we interpret ft as the forecast of the demand that will happen in time interval t. A review of the literature quickly reveals that both modeling conventions are widely used. Students need to be aware of the two conventions and how to interpret them. We handle the modeling of informational and physical processes by using two time indices, a form that we refer to as the “(t, t0 )” notation. For example: ˆ tt0 = The demands that first become known during time interval t to D be served during time interval t0 . ftt0 = The forecast for activities during time interval t0 made using the information available up through time t. 0 Rtt = The resources on hand at time t that cannot be used until time t0 . xtt0 = The decision to purchase futures at time t to be exercised during time interval t0 . Each of these variables can be written as vectors, such as: Dt ft xt Rt
= = = =
(Dtt0 )t0 ≥t (ftt0 )t0 ≥t (xtt0 )t0 ≥t (Rtt0 )t0 ≥t
Note that these vectors are now written in terms of the information content. For stochastic problems, this style is the easiest and most natural. Each one of these quantities is computed at the end of time interval t (that is, with the information up through time interval t) and represents a quantity that can be used at time t0 in the future. We could adopt the convention that the first time index uses the indexing
CHAPTER 3. MODELING DYNAMIC PROGRAMS
37
system illustrated in figure 3.1a, while the second time index uses the system in figure 3.1b. While this convention would allow us to easily move from a natural deterministic model to a natural stochastic model, we suspect most people will struggle with an indexing system where time interval t in the information process refers to time interval t − 1 in the physical process. Instead, we adopt the convention to model information in the most natural way, and live with the fact that product arriving at time t can only be used during time interval t + 1. ˆ tt is Using this convention it is instructive to interpret the special case where t = t0 . D simply demands that arrive during time interval t, where we first learn of them when they arrive. ftt makes no sense, because we would never forecast activities during time interval t after we have this information. Rtt represents assets that we know about during time interval t and which can be used during time interval t. Finally, xtt is a decision to purchase assets to be used during time interval t given the information that arrived during time interval t. In financial circles, this is referred to as purchasing on the spot market. The most difficult notational decision arises when first starting to work on a problem. It is natural at this stage to simplify the problem (often, the problem appears simple) and then choose notation that seems simple and natural. If the problem is deterministic and you are quite sure that you will never solve a stochastic version of the problem, then the actionable representation (figure 3.1b and equation (3.2) is going to be the most natural. Otherwise, it is best to choose the informational format. If you do not have to deal with lagged information processes (e.g. ordering at time t to be used at some time t0 in the future) you should be able to get by with a single time index, but you need to remember that xt may mean purchasing product to be used during time interval t + 1. As a final observation, consider what happens when we want to know the expected costs ˆ t )}, where the given that we make decision xt−1 . We would have to compute E{Ct (xt−1 , D ˆ t . The function that results from taking the expectation is over the random variable D expectation is now a function of information up through time t − 1. Thus, we might use the notation: ˆ t )}. C¯t−1 (xt−1 ) = E{Ct (xt−1 , D This can take a little getting used to. The costs are incurred during time interval t, but now we are indexing the function with time t − 1. The problem is that if we use a single time index, we are not capturing when the activity is actually happening. An alternative is to switch to a double time index, as in: ˆ t )}. C¯t−1,t (xt−1 ) = E{Ct (xt−1 , D where C¯t−1,t (xt−1 ) is the expected costs that will be incurred during time interval t using the information known at time t − 1.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
3.3
38
Modeling assets
Many of the models in this book are of fairly complex problems, but we typically start with relatively simple problems. We need notation that allows us to evolve from simple to complex problems in a natural way. The first issue we deal with is the challenge of modeling single and multiple assets. In engineering, a popular problem is to use dynamic programming to determine how to best land an aircraft, or control a power generating plant. In computer science, researchers in artificial intelligence might want to use a computer to play a game of backgammon or chess. We would refer to these applications as single asset problems. Our “system” would be the aircraft, and the state variable would describe the position, velocity and acceleration of the aircraft. If we were to model the problem of flying a single aircraft as a dynamic program, we would have considerable difficulty extending this model to simultaneously manage multiple aircraft. The distinction between modeling a single asset (such as an aircraft) and multiple assets (managing a fleet of aircraft) is important. For this reason, we adopt special notation when we are modeling a single asset. For example, it is quite common when modeling a dynamic program to use a variable such as St to represent the “state” of our system, where St could be the attributes describing a single aircraft, or all the attributes of a fleet of aircraft. Unfortunately, using such general notation disguises the structure of the problem and significantly complicates the challenge of designing effective computational algorithms. For this reason, if we are managing a single asset, we adopt special notation for the attributes of the asset. We let: at = Vector of attributes of the asset at time t. A = Set of possible attributes. . The attribute at can be a single element or a vector, but we will always assume that the vector is not too big (no more than 10 or 20 elements). In the case of our shortest path problem, at would be the node number corresponding to the intersection where a driver had to make a decision. If we are solving an asset selling problem, at might capture whether the asset has been sold, and possibly how long we have held it. For a college student planning her course of study, at would be a vector describing the number of courses she has taken to fulfill each requirement. There is a vast array of problems that involving modeling what we would call a single asset. If there is no interest in extending the model to handle multiple assets, then it may be more natural to use St as the state variable. Students need to realize, however, that this notational framework can be quite limiting, as we show over the course of this chapter. If we are modeling multiple assets, we would capture the resource state of our system by defining the resource vector: Rta = The number of assets with attribute a at time t.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
39
Rt = (Rta )a∈A . Rt is a vector with |A| dimensions. If a is a vector (think of our college student planning her course work), then |A| may be quite large. The size of |A| will have a major impact on the algorithmic strategy. We often have to model random arrivals of assets over time. For this purpose, we might define: ˆ t = Vector of new arrivals to the system during time period t. R ˆ t may be the withdrawals from a mutual fund during time interval t (a single type of asset), R or the number of requests for a particular car model (multiple asset classes), or the number of aircraft an airline has ordered where each aircraft is characterized by a vector of attributes ˆ t mathematically, we assume it takes a set of outcomes in a a. When we are representing R ˆ t is a set that is always denoted Ω (don’t ask), with elements ω ∈ Ω. Using this notation, R ˆ t (ω) is random variable giving the number of new arrivals of each type of asset class, and R a specific sample realization. New information may be more than just a new arrival to the system. An aircraft flying from one location to the next may arrive with some sort of maintenance failure. This can be modeled as a random change in an attribute of the aircraft. We can model this type of new information by defining: ˆ ta (Rt ) = The change in the number of assets with attribute a due to R exogenous information. ˆ Rt (Rt ) = The information function capturing exogeneous changes to the resource vector. ˆ t (Rt ) is a function, where a sample realization would be written R ˆ t (Rt , ω). Here, R There are many settings where the information about a new arrival comes before the new arrival itself as illustrated in the examples. Example 3.1: An airline may order an aircraft at time t and expect the order to be filled at time t0 . Example 3.2: An orange juice products company may purchase futures for frozen concentrated orange juice at time t that can be exercised at time t0 . Example 3.3: A programmer may start working on a piece of coding at time t with the expectation that it will be finished at time t0 .
This concept is important enough that we offer the following term:
CHAPTER 3. MODELING DYNAMIC PROGRAMS
40
Definition 3.3.1 The actionable time of an asset is the time at which a decision may be used to change its attributes (typically generating a cost or reward). The actionable time is simply one attribute of an asset. For example, if at time t we own a set of futures purchased in the past with exercise dates of t + 1, t + 2, . . . , t0 , then the exercise date would be an attribute of each futures contract. When writing out a mathematical model, it is sometimes useful to introduce an index just for the actionable time (rather than having it buried as an element of the attribute vector a). Before, we let Rta be the number of resources that we know about at time t with attribute a. The attribute might capture that the resource is not actionable until time t0 in the future. If we need to represent this explicitly, we might write: Rt,t0 a = The number of resources that we know about at time t that will be actionable with attribute a at time t0 . Rtt0 = (Rt,t0 a )a∈A Rt = (Rt,t0 )t0 ≥t . Perhaps the most powerful aspect of our notation is the attribute vector a, which allows us to represent a broad range of problems using a single variable. In fact, there are six major problem classes that can be represented using the same notation: • Basic asset acquisition problems - a = {}. Here Rt is a scalar, often representing how much money (or quantity of a single type of asset) that is on hand. • Multiclass asset management - a = {k} where k ∈ K is a set of asset classes. The attribute a consists of a single, static element. • Single commodity flow problems - a = {i} where i ∈ I is a set of states of an asset. Examples include managing money that can be invested in different stocks and a fleet of identical transport containers whose only attributesare their current location. • Multicommodity flow - a = {k, i} where k ∈ K represents the asset class and i ∈ I is a set of locations or states. • Heterogeneous resource allocation problem - a = (a1 , . . . , an ). Here we have an n-dimensional attribute vector. These applications arise primarily in the management of people and complex equipment. • Multilayered resource scheduling problem - a = {ac1 |ac2 | · · · |acn }. Now the attribute vector is a concatenation of attribute vectors. The only class that we do not specifically address in this volume is the last one.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
3.4
41
Illustration: the nomadic trucker
The “nomadic trucker” is a colorful illustration of a multiattribute resource which helps to illustrate some of the modeling conventions being introduced in this chapter. Later, we use this example to illustrate different issues that arise in approximate dynamic programming, leading up to the solution of large-scale asset management problems later in the book. The problem of the nomadic trucker arises in what is known as the truckload trucking industry. In this industry, a truck driver works much like a taxicab. A shipper will call a truckload motor carrier and ask it to send over a truck. The driver arrives, loads up the shipper’s freight and takes it to the destination where it is unloaded. The shipper pays for the entire truck, so the carrier is not allowed to consolidate the shipment with freight from other shippers. In this sense, the trucker works much like a taxicab for people. However, as we will soon see, our context of the trucking company adds an additional level of richness that offers some relevant lessons for dynamic programming. Our trucker runs around the United States, where we assume that his location consists of which of the 48 contiguous states he is located in. When he arrives in a state, he sees the customer demands for loads to move from that state to other states. There may be none, one, or several. He may choose a load to move if one is available; alternatively, he has the option of doing nothing or moving empty to another state (even if a load is available). Once he moves out of a state, all other customer demands (in the form of loads to be moved) are assumed to be picked up by other truckers and are therefore lost. He is not able to see the availability of loads out of states other than where he is located. Although truckload motor carriers can boast fleets of over 10,000 drivers, our model focuses on the decisions made by a single driver. There are, in fact, thousands of trucking “companies” that consist of a single driver. It is also the case that a driver in a large fleet still has some flexibility over what loads he accepts and where he moves. The problem of dispatching drivers has often been described as a negotiation, implying that drivers retain some independence in how they are assigned. In chapter 15 we show that the concepts we develop here form the foundation for managing the largest and most complex versions of this problem. For now, our “nomadic trucker” represents a particularly effective way of illustrating some important concepts in dynamic programming.
3.4.1
A basic model
The simplest model of our nomadic trucker assumes that his only attribute is his location, which we assume has to be one of the 48 contiguous states. We let: I = The set of “states” (locations) that the driver can be located at. We use i and j to index elements of I. His attribute vector then consists of: a = (i)
CHAPTER 3. MODELING DYNAMIC PROGRAMS
42
In addition to the attributes of the driver, we also have to capture the attributes of the loads that are available to be moved. For our basic model, loads are characterized only by where they are going. Let: b = The vector of characteristics of a load. B = The space of possible load attributes. For our basic problem, the load attribute vector consists of: b=
b1 b2
=
The origin of the load The destination of the load
The set B is the set of all pairs of origins and destinations.
3.4.2
A more realistic model
We need a richer set of attributes to capture some of the realism of the real life of a truck driver. A driver’s behavior is determined in the United States by a set of rules set by the Department of Transportation (“DOT”) that limit how much a driver can drive so he does not become too tired. There are three basic limits: the amount a driver can be behind the wheel in one shift, the amount of time a driver can be “on duty” in one shift, and the amount of time that a driver can be on duty over any contiguous eight day period. These rules were revised effective in year 2004 to be as follows: a driver can drive at most 11 hours, he may be on duty for at most 14 continuous hours (there are exceptions to this rule), and the driver can work at most 70 hours in any eight day period. The last clock is reset if the driver is off-duty for 34 successive hours during any stretch (known as the “34 hour reset”). If we add these three variables, our attribute vector grows to:
a1 a2 at = a3 = a4
The location of the driver The number of hours a driver has been behind the wheel during his current shift. The number of hours a driver has been on-duty during his current shift. An eight element vector giving the number of hours the driver was on duty over each of the previous eight days.
We emphasize that element a4 is actually a vector that holds the number of hours the driver was on duty during each calendar day over the last eight days. The three attributes that capture the DOT rules affect our ability to assign drivers to certain loads. A load may have the attribute that it must be delivered by a certain time. If a driver is about to hit some limit, then he will not be able to drive as many hours as he would otherwise be able to do. We may assign a driver to a load even if we know he cannot make the delivery appointment in time, but we would need to assess a penalty.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
43
A particularly important attribute for a driver is the one that represents his home “domicile.” This would be represented as a geographical location, although it may be represented at a different level of detail than the driver’s location (stored in a1 ). We also need to keep track of how many days that our driver has been away from home: at =
a1 a2 a3 a4 a5 a6
=
The location of the driver The number of hours a driver has been behind the wheel during his current shift The number of hours that a driver has been on-duty during his current shift. An eight element vector giving the number of hours the driver was on duty over each of the previous eight days. The geographical location giving the driver’s home domicile. The number of days that the driver has been away from home.
It is typically the case that drivers like to be home over a weekend. If we cannot get him home on one weekend, we might want to work on getting him home on the next weekend. This adds a very rich set of behaviors to the management of our driver. We might think that it is best to get our driver close to home, but it does not help to get him close to home in the middle of a week. In addition, there may be a location that is fairly far away, but that is known for generating a large number of loads that would take a driver near or to his home.
3.4.3
The state of the system
We can now use two different methods for describing the state of our driver. The first is his attribute vector a at time t. If our only interest was optimizing the behavior of a single driver, we would probably let St be the state of the driver, although this state would be nothing more than his vector of attributes. Alternatively, we can use our resource vector notation, which allows us to scale to problems with multiple drivers:
D Rta
( 1 If our trucker has attribute a = 0 Otherwise
RtD = (Rta )a∈A In the same way, we can represent the state of all the loads to be moved using: L Rtb = The number of loads with attribute b. L Rt = (Rtb )b∈B
Finally, our complete resource vector is given by: Rt = (RtD , RtL )
CHAPTER 3. MODELING DYNAMIC PROGRAMS
3.5
44
The exogenous information process
An important dimension of many of the problems that we address is the arrival of exogenous information, which changes the state of our system. While there are many important deterministic dynamic programs, exogenous information processes represent an important dimension in many problems in asset management.
3.5.1
Basic notation for information processes
We have already seen one example of an exogenous information process in the variable ˆ t . We can use this notation to represent customer demands, new equipment arriving at the R company or new drivers being hired (as long as these are not decisions that we are modeling). There are, however, other forms of exogenous information: interest rates, prices, travel times, costs, equipment breakdowns, people quitting, and so on. To write down a complete model, we would need to introduce notation for each class of exogenous information. It is standard to let: ω = A realization of all the information arriving over all time periods. = (ω1 , ω2 , . . . , ωt , . . .), where: ωt = The information arriving during time interval t. Ω = The set of all possible sample realizations (with ω ∈ Ω). ω is an actual sample realization of information from the set Ω. ω is sometimes referred to as a sample path or a “scenario.” It is important to recognize, however, that ω is not a random variable. It is not meaningful, for example, to take the expected value of a function f (ω), since ω is viewed as a single, deterministic realization of the information. The mathematical representation of information that is not yet known requires the introduction of a function. Surprisingly, while ωt is fairly standard notation for a sample realization, there is not standard notation for a generic random variable to represent unknown information. Different authors use It , ξt , ω ˆ t or, fairly commonly, ωt itself. Students need to understand the different subcommunities that work in this field. Probabilists will object to writing E[f (ω)] (the expectation of f (ω)), while others will claim that this is perfectly clear. ˆ or pˆ) we refer to a sample realization Whenever we have a random variable (for example, D ˆ ˆ and pˆ are random of the random variable by D(ω) or pˆ(ω). It is important to recognize that D ˆ variables, whereas D(ω) and pˆ(ω) are numbers. When we need a generic random variable, we suggest using: Wt = The exogenous information becoming available during interval t. The choice of notation Wt as a generic “information function” is not standard, but it is mnemonic (it looks like ωt ). We would then write ωt = Wt (ω) as a sample realization. This
CHAPTER 3. MODELING DYNAMIC PROGRAMS
45
notation adds a certain elegance when we need to write decision functions and information in the same equation. Generic information variables, such as Wt , should be used to simplify notation, which means they are useful when there are different forms of exogenous information. If there is ˆ t ), then that variable should be used as the only one form of exogenous information (say, R information variable. We also need to refer to the history of our process, for which we define: Ht = The history of the process, consisting of all the information known through time t. = (W1 , W2 , . . . , Wt ). Ht = The set of all possible histories through time t. = {Ht (ω)|ω ∈ Ω}. ht = A sample realization of a history, = Ht (ω). We sometimes need to refer to the subset of Ω that corresponds to a particular history. The following is useful for this purpose: Ωt (ht ) = {ω|(W1 (ω), W2 (ω), . . . , Wt (ω)) = ht , ω ∈ Ω}
3.5.2
(3.4)
Models of information processes
Information processes come in varying degrees of complexity. Needless to say, the structure of the information process plays a major role in the models and algorithms used to solve the problem. Below, we describe information processes in increasing levels of complexity. Processes with independent increments A large number of problems in asset management can be characterized by what are known as processes with independent increments. What this means is that the change in the process is independent of the history of the process, as illustrated in the examples. The practical challenge we typically face in these applications is that we do not know the parameters of the system. In our price process, the price may be trending upward or downward, as determined by the parameter µ. In our customer arrival process, we need to know the rate λ (which can also be a function of time).
CHAPTER 3. MODELING DYNAMIC PROGRAMS
46
Example 3.4: A publicly traded index fund has a price process that can be described (in discrete time) as pt+1 = pt + σδ, where δ is normally distributed with mean µ, variance 1, and σ is the standard deviation of the change over the length of the time interval. Example 3.5: Requests for credit card confirmations arrive according to a Poisson process with rate λ. This means that the number of arrivals during a period of length ∆t is given by a Poisson distribution with mean λ∆t, which is independent of the history of the system.
State-dependent information processes The standard dynamic programming models require that the distribution of the process moving forward be a function of the state of the system. This is a more general model than one with independent increments. Interestingly, many models of Markov decision processes use information processes that do, in fact, exhibit independent increments. For example, we may have a queueing problem where the state of the system is the number of customers in the queue. The number of arrivals may be Poisson, and the number of customers served in an increment of time is determined primarily by the length of the queue. It is possible, however, that our arrival process is a function of the length of the queue itself (see the examples for illustrations). Example 3.6: Customers arrive at an automated teller machine according to a Poisson process, but as the line grows longer, an increasing proportion decline to join the queue (a property known as balking in the queueing literature). The apparent arrival rate at the queue is a process that depends on the length of the queue. Example 3.7: A market with limited information may respond to price changes. If the price drops over the course of a day, the market may interpret the change as a downward movement, increasing sales and putting further downward pressure on the price. Conversely, upward movement may be interpreted as a signal that people are buying the stock, encouraging more buying behavior.
State-dependent information processes are more difficult to model and introduce additional parameters that must be estimated. However, from the perspective of dynamic programming, they do not introduce any fundamental complexities. As long as the distribution of outcomes is dependent purely on the state of the system, we can apply our standard models. It is also possible that the information arriving to the system depends on its state, as depicted in the next set of examples. This is a different form of state-dependent information process. Normally, an outcome ω is assumed to represent all information available to the system. A probabilist would
CHAPTER 3. MODELING DYNAMIC PROGRAMS
47
Example 3.8: A driver is planning a path over a transportation network. When the driver arrives at intersection i of the network, he is able to determine the transit times of each of the segments (i, j) emanating from i. Thus, the transit times that arrive to the system depend on the path taken by the driver.
insist that this is still the case with our driver; the fact that the driver does not know the transit times on all the links is simply a matter of modeling the information the driver uses. However, most engineering students will find it more natural to think of the information as depending on the state. More complex information processes Now consider the problem of modeling currency exchange rates. The change in the exchange rate between one pair of currencies is usually followed quickly by changes in others. If the Japanese yen rises relative to the U.S. dollar, it is likely that the Euro will also rise relative to it, although not necessarily proportionally. As a result, we have a vector of information processes that are correlated. In addition to correlations between information processes, we can also have correlations over time. An upward push in the exchange rate between two currencies in one day is likely to be followed by similar changes for several days while the market responds to new information. Sometimes the changes reflect long term problems in a country’s economy. Such processes may be modeled using advanced statistical models which capture correlations between processes as well as over time. An information model can be thought of as a probability density function φt (ωt ) that gives the density (we would say the probability of ω if it were discrete) of an outcome ωt in time t. If the problem has independent increments, we would write the density simply as φt (ωt ). If the information process is Markovian (dependent on a state variable), then we would write it as φt (ωt |St−1 ). If the state variable requires information from history (for example, our “state variable” is the history (Wt−1 , Wt−2 , . . . , Wt−T )), then we have a “history dependent” model. In some cases with complex information models, it is possible to proceed without any model at all. Instead, we can use realizations drawn from history. For example, we may take samples of changes in exchange rates from different periods in history and assume that these are representative of changes that may happen in the future. The value of using samples from history is that they capture all of the properties of the real system. This is an example of planning a system without a model of an information process.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
3.6
48
The states of our system
We have now established the notation we need to talk about the most important quantity in a dynamic program: its state. The state variable is perhaps the most critical piece of modeling that we will encounter when solving a dynamic program. Surprisingly, other treatments of dynamic programming spend little time defining a state variable. Bellman’s seminal text [Bellman (1957), p. 81] says “... we have a physical system characterized at any stage by a small set of parameters, the state variables.” In a much more modern treatment, Puterman first introduces a state variable by saying [Puterman (1994), p. 18] “At each decision epoch, the system occupies a state.” In both cases, the italics are in the original manuscript, indicating that the term “state” is being introduced. In effect, both authors are saying that given a system, the state variable will be apparent from the context. In this section, we take this topic a bit further. We open our discussion with a presentation on three perspectives of a state variable, after which we discuss where the state variable is measured.
3.6.1
The three states of our system
Interestingly, standard dynamic programming texts do not offer a formal definition of a state, presuming instead that it is readily apparent from an application. For many problems, this is true. Our interest, however, is in more complex problems (which require computational solutions), and in these settings, state variables become more subtle. To set up our discussion, assume that we are interested in solving a relatively complex asset management problem, one that involves multiple (possibly many) different types of assets which can be modified in various ways (changing their attributes). For such a problem, it is necessary to work with three types of states: The state of a single resource - As a resource evolves, the state of a resource is captured by its attribute vector a. The resource state vector - This is the state of all the different types of resources at the same time, given by Rt . The information state - This captures what we know at time t, which includes Rt along with estimates of parameters such as prices, times and costs, or the parameters of a function for forecasting demands (or prices or times). We have already introduced the attribute vector a for the state of an asset. Consider the problem of routing a single asset (such as an aircraft, a locomotive, a pilot, or a truck driver) over time (possibly, but not necessarily, in the presence of uncertainty). We could let at be the attribute vector of the asset at time t. In this setting, at is the state of our asset. If we have more than one asset, then Rt becomes the joint state of all our assets at time t. The dimensionality of Rt is potentially equal to the dimensionality of A. If our asset has
CHAPTER 3. MODELING DYNAMIC PROGRAMS
49
no attributes (for example, we are only interested with acquiring and selling a single type of asset), then |A| = 1. In some problems |A| can be quite large, which means that Rt can be a very high dimensional vector. It is common in some subcommunities to use St as a “state variable.” We suggest using St as a generic state variable when it is not important to be specific, and in particular when we may wish to include other forms of information. Typically, the other information represents what we know about various parameters of the system (costs, speeds, times, prices). To represent this, let: θ¯t = A vector of estimates of different problem parameters at time t. θˆt = New information about problem parameters that arrive during time interval t. We can think of θ¯t as the state of our information about different problem parameters at time t. We can now write a more general form of our state variable as: St = Our information state at time t. = (Rt , θ¯t ).
Remark: In one of his earliest papers, Bellman struggled with the challenge of representing both the resource state and other types of information, which he did using the notation (x, I) where x was his resource state variable and I represented other types of information. The need to differentiate between the resource state and “other information” indicates the equivalence in his mind (and those of many authors to follow) between the “state of the system” and the resource state. The most notable exception to this view is the study of information acquisition problems. The best-known examples of this problem class are the bandit problems discussed in chapter 10, where the state is an estimate of a parameter. It is important to have a formal definition of a state variable. For this purpose, we offer:
Definition 3.6.1 A state variable is the minimally dimensioned function of history that is necessary and sufficient to model all future dynamics of the system. We use the term “minimally dimensioned function” so that our state variable is as compact as possible. For example, we could argue that the history ht is the information we need to model future dynamics (if we know ht , then we have all the information that has come to us during the simulation). But this is not practical. As we start doing computational work, we are going to want St to be as compact as possible. Furthermore, there are many problems where we simply do not need to know the entire history. It might be enough to know the status of all our assets at time t (the resource variable Rt ). But there are examples where this is not enough.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
50
Assume, for example, that we need to use our history to forecast the price of a stock. Our history of prices is given by (ˆ p1 , pˆ2 , . . . , pˆt ). If we use a simple exponential smoothing model, our estimate of the mean price p¯t can be computed using: p¯t = (1 − α)¯ pt−1 + αˆ pt where α is a stepsize satisfying 0 ≤ α ≤ 1. With this forecasting mechanism, we do not need to retain the history of prices, but rather only the latest estimate p¯t . As a result, p¯t is called a sufficient statistic, which is a statistic that captures all relevant information needed to compute any additional statistics from new information. A state variable, according to our definition, is always a sufficient statistic. Consider what happens when we switch from exponential smoothing to an N -period moving average. Our forecast of future prices is now given by:
p¯t =
N −1 1 X pˆt−τ N τ =0
Now, we have to retain the N -period rolling set of prices (ˆ pt , pˆt−1 , . . . , pˆt−N +1 ) in order to compute the price estimate in the next time period. With exponential smoothing, we could write: St = p¯t If we use the moving average, our state variable would be: St = (ˆ pt , pˆt−1 , . . . , pˆt−N +1 )
(3.5)
Students need to realize that many authors say that if we use the moving average model, we no longer have a proper state variable. Rather, we would have an example of a “historydependent process” where the state variable needs to be augmented with history. Using our definition of a state variable, the concept of a history-dependent process has no meaning. The state variable is simply the minimal information required to capture what is needed to model future dynamics. State variables differ only in their dimensionality. Needless to say, having to explicitly retain history, as we did with the moving average model, produces a much larger state variable than the exponential smoothing model. The state variable is critical to the success of dynamic programming as a practical, computational tool. The higher the dimensionality of St , the more parameters we are going to have to estimate. One problem characteristic that can have a major impact on the design of a state variable is the probabilistic structure of the information process. The simplest process occurs when the information variables W1 , W2 , . . . , Wt are independent (the process may be nonstationary, but independence is really nice) or is conditionally independent given the
CHAPTER 3. MODELING DYNAMIC PROGRAMS
51
state variable. For example, we might decide that the price of an asset fluctuates randomly up or down from the previous period’s price. This means that the probability distribution of pˆt+1 depends only on pˆt . When our random information is customer demands, we often find that we can assume independence. When the information process is interest rates, on the other hand, it can easily be the case that the process is characterized by a fairly complex underlying structure. In this case, we may find that we get good results by assuming that the price in t + 1 depends on only a few periods of history.
3.6.2
Pre- and post-decision state variables
We can view our system as evolving through sequences of new information followed by a decision followed by new information (and so on). Although we have not yet discussed decisions, for the moment let the decisions (which will often be vectors) be represented generically using xt (we discuss our choice of notation for a decision in the next section). In this case, a history of the process might be represented using: ht = (S0 , x0 , W1 , x1 , W2 , x2 , . . . , xt−1 , Wt ) ht contains all the information we need to make a decision dt at time t. As we discussed before, ht is sufficient but not necessary. We expect our state variable to capture what is needed to make a decision, allowing us to represent the history as: ht = (S0 , x0 , W1 , S1 , x1 , W2 , S2 , x2 , . . . , xt−1 , Wt , St )
(3.6)
The sequence in equation (3.6) defines our state variable as occurring after new information arrives and before a decision is made. For this reason, we call St the pre-decision state variable. This is the most natural place to write a state variable because the point of capturing information from the past is to make a decision. For most problem classes, we can design more effective computational strategies using the post-decision state variable. This is the state of the system after a decision xt . For this reason, we denote this state variable Stx , which produces the history: x ht = (S0 , x0 , W1 , S1 , x1 , S1x , W2 , S2 , x2 , S2x , . . . , xt−1 , St−1 , Wt , St )
(3.7)
We again emphasize that our notation Stx means that this function has access to all the exogenous information up through time t, along with the decision xt (which also has access to the information up through time t). Interestingly, virtually every text on stochastic, dynamic programming assumes that the state variable is the pre-decision state variable. The optimality recursion relates the predecision state St+1 to St , requiring that we model both the decision xt that is made after observing St as well as the information Wt+1 that arrives during time interval t + 1.
st
e gam dule Sche Canc el gam e
e gam dule Sche Cancel g a me
su y nn .6
we ath
ca
Us e
1 in . t ra cas e r Fo For eca st c lou dy .3
re Fo
er r epo rt
CHAPTER 3. MODELING DYNAMIC PROGRAMS
s e rt t u po no er re Do eath w
e gam dule Sche Cancel g a me
e gam dule Sche Cancel g a me
Rain .2 -$2000 Clouds .3 $1000 Sun .5 $5000 Rain .2 -$200 Clouds .3 -$200 Sun .5 -$200
52
Rain .8 -$2000 Clouds .2 $1000 Sun .0 $5000 Rain .8 -$200 Clouds .2 -$200 Sun .0 -$200 Rain .1 -$2000 Clouds .5 $1000 Sun .4 $5000 Rain .1 -$200 Clouds .5 -$200 Sun .4 -$200 Rain .1 -$2000 Clouds .2 $1000 Sun .7 $5000 Rain .1 -$200 Clouds .2 -$200 Sun .7 -$200
- Decision nodes - Outcome nodes
Figure 3.2: Decision tree showing decision nodes (pre-decision state variable) and outcome nodes (post-decision state variable). By contrast, it has always been the tradition in the decision-theory literature to model both the pre- and post-decision states when representing decision trees. Figure 3.2 shows the decision tree for a classic problem from the decision-theory literature which addresses the problem of whether we should collect information about the weather to determine if we should hold a Little League game. Squares represent nodes where decisions have to be made: Should we check the weather report? Should we schedule the game? Circles represent outcome nodes: What does the weather report say? What will the weather actually be? Decision nodes represent the pre-decision state of the system. Outcome nodes are the state of the system just before new information arrives, which is the same as immediately after a decision is made. Unless otherwise specified (not just in this volume, but throughout the dynamic programming/control communities), a state variable is the pre-decision state variable. There are specific problem classes in asset management that really need the post-decision state variable. In this case, it is more convenient notationally to simply define the state variable as the post-decision state variable, which allows us to drop the “x” superscript. The examples provide a series of illustrations. As we progress, we will see that the choice of using pre- versus post-decision state variables is governed by problem complexity and our ability to accurately approximate the future without sacrificing computational tractability. The vast majority of the dynamic programming literature uses the pre-decision state variable. Most authors do not even distinguish
CHAPTER 3. MODELING DYNAMIC PROGRAMS
53
Example 3.9: If we are selling an asset, the pre-decision state variable can be written as St = (Rt , pt ) where Rt = 1 if we are holding the asset and 0 otherwise, while pt is the price that we can sell the asset at if we are still holding it. The post-decision variable Stx = Rtx simply captures whether we are still holding the asset or not. Example 3.10: The nomadic trucker revisited. Let Rta = 1 if the trucker has attribute a at time t and 0 otherwise. Now let Ltb be the number of loads of type b available to be moved at time t. The pre-decision state variable for the trucker is St = (Rt , Lt ), which tells us the state of the trucker and the loads available to be moved. Assume that once the trucker makes a decision, all the loads in Lt are lost, and new loads become available x = 1 if the at time t + 1. The post-decision state variable is given by Stx = Rtx where Rta trucker has attribute a after a decision has been made. Example 3.11: Imagine playing backgammon where Rti is the number of your pieces on the ith “point” on the backgammon board (there are 24 points on a board). Let d be the decision to move from one point to another, where the set of potential decisions Dt depends on the roll of the dice during the tth play. xtd is the number of pieces the player moves from one point to another, where xt = (xtad )d∈Dt . The state of our board when we make a decision is given by St = (Rt , Di ). The transition from St to St+1 depends on the player’s decision xt , the play of the opposing player, and the next roll of the dice. The post-decision state variable is simply Rtx which is the state of the board after a player moves.
between whether they are using a pre- or post-decision state variable. Students can easily identify which is being used: if the expectation is within the max or min operator, then the formulation is using the pre-decision state vector.
3.6.3
Partially observable states
There is a subfield of dynamic programming that is referred to as partially observable Markov decision processes where we cannot measure the state exactly, as illustrated in the examples. Example 3.12: A retailer may have to order inventory without being able to measure the precise current inventory. It is possible to measure sales, but theft and breakage introduce errors. Example 3.13: A transportation company needs to dispatch a fleet of trucks, but does not know the precise location or maintenance status of each truck. Example 3.14: The military has to make decisions about sending out aircraft to remove important military targets that may have been damaged in previous raids. These decisions typically have to be made without knowing the precise state of the targets.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
54
Markov decision processes with partially observable states provides a nice framework for modeling systems that can not be measured precisely. This is an important subfield of Markov decision processes, but is outside the scope of our presentation. It is tempting to confuse a post-decision state variable as a pre-decision state variable that can only be measured imperfectly. This ignores the fact that we can measure the postdecision state variable perfectly, and we can formulate a version of the optimality equations that determine the value function. In addition, post-decision state variables are often simpler than pre-decision state variables.
3.7
Modeling decisions
Fundamental to dynamic programs is the characteristic that we are making decisions over time. For stochastic problems, we have to model the sequencing of decisions and information, but there are many uses of dynamic programming that address deterministic problems. In this case, we use dynamic programming because it offers specific structural advantages, such as our budgeting problem in chapter 1. But the concept of sequencing decisions over time is fundamental to a dynamic program. It is important to model decisions properly so we can scale to high dimensional problems. This requires that we start with a good fundamental model of decisions for asset management problems. Our choices are nonstandard for the dynamic programming community, but very compatible with the math programming community.
3.7.1
Decisions, actions, and controls
A survey of the literature reveals a distressing variety of words used to mean “decisions.” The classical literature on Markov decision process talks about choosing an action a ∈ A (or a ∈ As where As is the set of actions available when we are in state s). The optimal control community works to choose a control u ∈ Ux when the system is in state x. The math programming community wants to choose a decision represented by the vector x, while the Markov decision community wants to choose a policy and the simulation community wants to apply a rule. Our interest is primarily in solving large-scale asset allocation problems, and for this purpose we must draw on the skills of the math programming community where decisions are typically vectors represented by x. Most of our examples focus on decisions that act on assets (buying them, selling them, or managing them within the system). For this reason,
CHAPTER 3. MODELING DYNAMIC PROGRAMS
55
we define: d = A type of decision that acts on an asset (or asset type) in some way (buying, selling, or managing). Da = The set of potential types of decisions that can be used to act on a resource with attribute a. xtad = The quantity of resources with attribute a acted on with decision d at time t. xt = (xtad )a∈A,d∈Da . Xt = Set of acceptable decisions given the information available at time t. If d is a decision to purchase an asset, then xd is the quantity of assets being purchased. If we are moving transportation assets from one location i to another location j, then d would represent the decision to move from i to j, and xtad would be the flow of resources. Earlier, we observed that the attribute of a single asset a might have as many as 10 or 20 dimensions, but we would never expect the attribute vector to have 100 or more dimensions (problems involving financial assets, for example, might require 0, 1 or 2 dimensions). Similarly, the set of decision types, Da , might be on the order of 100 or 1000 for the most complex problems, but we simply would never expect sets with, say, 1010 types of decisions (that can be used to act on a single asset class). Note that there is a vast array of problems where the size of Da is less than 10. Example 3.15: Assume that you are holding an asset, where you will receive a price pˆt if you sell at time t. Here there is only one type of decision (whether or not to sell). We can represent the decision as xt = 1 if we sell, and xt = 0 if we hold. Example 3.16: A taxi waiting at location i can serve a customer if one arrives, sit and do nothing or reposition to another location (without a customer) where the chances of finding one seem better. We can let DM be the set of locations the cab can move to (without a customer), where the decision d = i represents the decision to hold (at location i). Then let ds be the decision to serve a customer, although this decision can only be made if there is a customer to be served. The complete set of decisions is D = ds ∪ DM . xd = 1 if we choose to take decision d.
It is significant that we are representing a decision d as acting on a single asset or asset type. The field of Markov decision processes represents a decision as an action (typically denoted a), but the concept of an action in this setting is equivalent to our vector x. Actions are typically represented as being discrete, whereas our decision vector x can be discrete or continuous. We do, however, restrict our attention to cases where the set D is discrete and finite. In some problem classes, we manage a discrete asset which might be someone playing a game, the routing of a single car through traffic, or the control of a single elevator moving up
CHAPTER 3. MODELING DYNAMIC PROGRAMS
56
and down in a building. In this case, at any point in time we face the problem of choosing a single decision d ∈ D. Using our x notation, we would represent this using xtadˆ = 1 if ˆ Alternatively, we could simply drop our “x” we choose decision dˆ and xtad = 0 for d 6= d. notation and simply let dt be the decision we chose at time t. While recognizing that the “d” notation is perhaps more natural and elegant if we face simply the scalar problem of choosing a single decision, it greatly complicates our transition from simple, scalar problems to the complex, high-dimensional problems. Our notation represents a difficult choice between the vocabularies of the math programming community (which dominates the field of high-dimensional asset management problems) and the control community (which dominates the field of approximate dynamic programming). Since some in the control community use a for action instead of u for control, primarily to exploit results in the field of Markov decision processes, it made more sense to choose notation that was natural and mnemonic (and did not conflict with our critical notation a for the attribute of an asset).
3.7.2
The nomadic trucker revisited
We return to our nomadic trucker example to review the decisions for this application. There are two classes of decisions the trucker may choose from: Dbl Dl Die De D
= = = = =
The decision to move a load with attribute vector b ∈ B. (Dbl )b∈B The decision to move empty to location i ∈ I. (Die )i∈I Dl ∪ De
The trucker may move “empty” to the same state that he is located in, which represents the decision to do nothing. The set Dl requires a little explanation. Recall that b is the attribute vector of a load to be moved. An element of Dl represents the decision to move a type of load. Other decision classes that could be modeled including buying and selling trucks, repairing them, or reconfiguring them (for example, adding refrigeration units so a trailer can carry perishable commodities).
3.7.3
Decision epochs
Most of our presentation in this book adopts a discrete time format. Information arrives during time interval t (between t − 1 and t), and decisions are made at time t. We typically view the times t = (1, 2, . . . , ) as equally spaced points in time. But there are settings where decisions are determined by exogenous events. We may have to decide whether to admit a patient to a hospital for elective surgery. The decision has to be made when the patient calls
CHAPTER 3. MODELING DYNAMIC PROGRAMS
57
in. We may have to decide when to sell a thinly traded stock. Such a decision is naturally made when the stock changes price. The points in time when a decision has to be made (even if the decision is to do nothing) are referred to as decision epochs. Decision epochs may occur at evenly spaced points in time or may be determined by exogenous information events. If they are determined by information events, we might define a set E with element e ∈ E. Now let te be the time that information event e occurs (for example, the eth phone call). Instead of indexing time by t = (1, 2, . . . , ), we may index time by (t1 , t2 , . . . , te , ).
3.7.4
Policies
When we are solving deterministic problems, our interest is in finding a set of decisions xt over time. When we are solve stochastic problems (problems with dynamic information processes), the decision xt for t ≥ 1 is a random variable. This happens because we do not know (at time t = 0) the state of our system St at time t. How do we make a decision if we do not know the state of the system? The answer is that instead of finding the best decision, we are going to focus on finding the best rule for making a decision given the information available at the time. This rule is commonly known as a policy: Definition 3.7.1 A policy is a rule that determines a decision given the available information. This definition implies that our policy produces a decision deterministically; that is, given a state St , it produces a single action x. There are, however, instances where St does not contain all the information needed to make a decision (for example, our post-decision state variable Stx ). In addition, there are special situations (arising in the context of two-player games) where there is value in choosing a decision somewhat randomly. For our computational algorithms, there will be many instances when we want to choose what appears to be a non-optimal decision for the purpose of collecting more information. Policies come in many forms (each with their own notation). Perhaps the most common form in introductory treatments of dynamic programming is to assume that the policy is of the “table lookup” variety. That is, given a discrete state St , our policy can be viewed as a simple rule of the form “if we are in state St we should make decision xt .” Although different authors use different notation, it is most common to represent such a rule as a policy π ∈ Π where Π is our set of possible policies (rules) from which we have to choose. In this version, the set Π is viewed as consisting of a set of discrete policies which are typically finite. For high dimensional problems, we virtually never use policies of the table-lookup variety. Instead, these are functions that must be solved to produce a decision. For this reason, we use the notation:
CHAPTER 3. MODELING DYNAMIC PROGRAMS
58
Xtπ = A function returning a decision vector x, where (Xtπ )π∈Π is the family of functions from which we have to choose. Often, a policy is determined by choosing a particular function for making a decision and then tuning the parameters of the function (which could easily be continuous variables). In this setting, the set of potential policies is infinite. To illustrate, consider our budget problem from chapter 1. There, we were making decisions by solving problems of the form: π (Rt+1 ) Xtπ (Rt ) = arg max Ct (xt ) + V¯t+1 0≤xt ≤Rt
(3.8)
where “arg max” means “find the value of xt (the argument) that maximizes the expression that follows.” Here, V¯ π might be a particular value function (think of it as an approximation of the real value function). This type of approximation means that we have to estimate V¯t+1 (Rt+1 ) for each possible (discrete) value of Rt+1 . If Rt+1 is a vector (as would arise if we are managing different types of assets), this strategy means that we may have to estimate a very large number of parameters. We can simplify this problem by replacing our discrete value function with a linear approximation v¯t+1 Rt+1 . Now we wish to solve: Xtπ (Rt ) = arg max Ct (xt ) + v¯t+1 Rt+1 0≤xt ≤Rt
(3.9)
Our policy now consists of choosing a single parameter v¯t+1 . This example illustrates two policies: one that requires us to specify the value of being in each state (at least approximately) and one that only requires us to come up with a single slope. These are two classes of policies, which we might denote by Πdiscrete and Πlinear . Each class contains an infinite set of parameters over which we have to search.
3.7.5
Randomized policies
Assume you need to buy an asset at an auction, and you do not have the time to attend the auction yourself. Your problem is to decide which of your two assistants to send. Assistant A is young and aggressive, and is more likely to bid a higher price (but may also scare off other bidders). Assistant B is more tentative and conservative, and might drop out if he thinks the bidding is heading too high. This is an example of a randomized policy. We are not directly making a decision of what to bid, but we are making a decisiont that will influence the probability distribution of whether a bid will be made. This is known as a randomized policy. In section 4.5.4, we show that given a choice between a deterministic policy and a randomized policy, the deterministic policy will always be at least as good as a randomized policy. But there are situations where we may not have a choice. In addition, there are
CHAPTER 3. MODELING DYNAMIC PROGRAMS
59
situations involving two-player games where a deterministic policy allows the other player to predict your response and obtain a better result.
3.8
Information processes, revisited
The introduction of decisions and policies requires that we revisit our model of the information process. We are going to want to compute quantities such as expected profits, but we cannot find an expectation using only the probability of different outcomes of the exogenous information. We also have to know something about how decisions are generated.
3.8.1
Combining states and decisions
With our vocabulary for policies in hand, we need to take a fresh look at our information process. The sequence of information (ω1 , ω2 , . . . , ωt ) is assumed to be driven by some sort of exogenous process. However, we are generally interested in quantities that are functions of both exogenous information as well as the decisions. It is useful to think of decisions as endogenous information. But where do the decisions come from? We now see that decisions come from policies. In fact, it is useful to represent our sequence of information and decisions as: π Htπ = (S0 , X0π , W1 , S1 , X1π , W2 , S2 , X2π , . . . , Xt−1 , Wt , St )
(3.10)
Now our history is characterized by a family of functions: the information variables Wt , the decision functions (policies) Xtπ , and the state variables St . We see that to characterize a particular history ht , we have to specify both the sample outcome ω as well as the policy π. Thus, we might write a sample realization as: hπt = Htπ (ω) π We can think of a complete history H∞ (ω) as an outcome in an expanded probability space (if we have a finite horizon, we would denote this by HTπ (ω)). Let: π ω π = H∞ (ω)
be an outcome in our expanded space, where ω π is determined by ω and the policy π. Let Ωπ be the set of all outcomes of this expanded space. The probability of an outcome in Ωπ obviously depends on the policy we are following. Thus, computing expectations (for example, expected costs or rewards) requires knowing the policy as well as the set of exogenous outcomes. For this reason, if we are interested, say, in the expected costs during time period t, some authors will write Etπ {Ct (St , xt )} to express the dependence of the expectation on the policy. However, even if we do not explicitly index the policy, it is important to understand that we need to know how we are making decisions if we are going to compute expectations or other quantities.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
3.8.2
60
Supervisory processes
In many instances, we are trying to control systems that are already controlled by some process, often a human. Now, we have two sets of decisions: Xtπ (St ) made by our mathematical model and the decisions that are made by human operators. The examples (below) provides an illustration. Example 3.17: An asset management problem in the printing industry involves the assignment of printing jobs to printing machines. An optimization model may assign a print job to one printing plant, while a knowledgeable planner may insist that the job should be assigned to a different plant. The planner may know that this particular job requires skills that only exist at the other plant. Example 3.18: A military planner may know that it is best to send a cargo aircraft on a longer path because this will take it near an airbase where tankers can fly up and refuel the plane. Without this information, it may be quite hard for an algorithm to discover this strategy. Example 3.19: An expert chess player may know that a sequence of steps produces a powerful defensive position.
When we combine machines and people, we actually create two decision processes: what the machine recommends and what the human implements. Since these “supervisory” decisions are exogenous (even though they have access to the machine-generated decision), we might let xˆt be the supervisory decisions (which we assume override those of the machine). One of the opportunities in machine learning is to use the sequence of decisions xˆt to derive patterns to guide the model.
3.9
Modeling system dynamics
We begin our discussion of system dynamics by introducing some general mathematical notation. While useful, this generic notation does not provide much guidance into how specific problems should be modeled. We then describe how to model the dynamics of some simple problems, followed by a more general model for complex assets.
3.9.1
A general model
The dynamics of our system is represented by a function that describes how the state evolves as new information arrives and decisions are made. The dynamics of a system can be
CHAPTER 3. MODELING DYNAMIC PROGRAMS
61
represented in different ways. The easiest is through a simple function that works as follows: St+1 = S M (St , Xtπ , Wt+1 )
(3.11)
The function S M (·) goes by different names such as “plant model” (literally, the model of a physical production plant), “plant equation,” “law of motion,” “transfer function,” “system dynamics,” “system model,” “transition law,” and “transition function.” We prefer “transition function” because it is the most descriptive. We choose the notation S M (·) to reflect that this is the state transition function, which represents a model of the dynamics of the system. Below, we reinforce the “M ” superscript with other modeling devices. The arguments of the function follows standard notational conventions in the control literature (state, action, information), but students will find that different authors will follow one of two conventions for modeling time. While equation (3.11) is fairly common, many authors will write the recursion as: St+1 = S M (St , Xtπ , Wt )
(3.12)
If we use the form in equation (3.12), we would say “the state of the system at the beginning of time interval t + 1 is determined by the state at time t, plus the decision that is made at time t and the information that arrives during time interval t.” In this representation, t indexes when we are using the information. We refer to (3.12) as the actionable representation since it captures when we can act on the information. This representation is always used for deterministic models, and many authors adopt it for stochastic models as well. We prefer the form in equation (3.11) where time t indexes the information content of the variable or function. We refer to this style as the informational representation. In equation (3.11), we have written the function assuming that the function does not depend on time (it does depend on data that depends on time). A common notational error is to write a function, say, ft (St , xt ) as if it depends on time, when in fact the function is stationary, but depends on data that depends on time. If the parameters (or structure) of the M function depends on time, then we would use StM (St , xt , Wt+1 ) (or possibly St+1 (St , xt , Wt+1 )). M If not, the transition function should be written S (St , xt , Wt+1 ). This is a very general way of representing the dynamics of a system. In many problems, the information Wt+1 arriving during time interval t + 1 depends on the state St at the end of time interval t, but is conditionally independent of all prior history given St . When this is the case, we say that we have a Markov information process. When the decisions also depend only on the state St , then we have a Markov decision process. In this case, we can store the system dynamics in the form of a one-step transition matrix as follows: p(s0 |s, x) = The probability that St+1 = s0 given St = s and Xtπ = x. P (x) = Matrix of elements where p(s0 |s, x) is the element in row s and column s0 .
CHAPTER 3. MODELING DYNAMIC PROGRAMS
62
There is a simple relationship between the transition function and the one-step transition matrix. Let: ( 1 X is true 1X = 0 Otherwise The one-step transition matrix can be computed using p(s0 |s, x) = E{1{s0 =S M (St ,x,Wt+1 )} |St = s} X = P (ωt+1 )1{s0 =S M (St ,x,Wt+1 )}
(3.13)
ωt+1 ∈Ωt+1
It is common in the field of Markov decision processes to assume that the one-step transition is given as data. Often, it can be quickly derived (for simple problems) using assumptions about the underlying process. For example, consider an asset selling problem with state variable St = (Rt , pt ) where: ( 1 We are still holding the asset. Rt = 0 The asset has been sold. and where pt is the price at time t. We assume the price process is described by: pt+1 = pt + t+1 where t is a random variable with distribution: +1 With probability 0.3 t = 0 With probability 0.6 −1 With probability 0.1 Assume the prices are integer and range from 1 to 100. We can number our states from 0 to 100 using: S = {(0, −), (1, 1), (1, 2), . . . , (1, 100)} Now assume that we adopt a decision rule for selling of the form: ( Sell asset If pt < p¯ X π (Rt , pt ) = Hold asset If pt ≥ p¯
CHAPTER 3. MODELING DYNAMIC PROGRAMS
63
Assume that p¯ = 60. A portion of the one-step transition matrix for the rows and columns corresponding to the state (0, −) and (1, 58), (1, 59), (1, 60), (1, 61), (1, 62) looks like:
P 60
(0,-) (1,58) (1,59) = (1,60) (1,61) (1,62)
1 1 1 0 0 0
0 0 0 0 0 0 0 .1 0 0 0 0
0 0 0 0 0 0 .6 .3 .1 .6 0 .1
0 0 0 0 .3 .6
This matrix plays a major role in the theory of Markov decision processes, although its value is more limited in practical applications. By representing the system dynamics as a one-step transition matrix, it is possible to exploit the rich theory surrounding matrices in general and Markov chains in particular. In engineering problems, it is far more natural to develop the transition function first. Given this, it may be possible to compute the one-step transition matrix exactly or estimate it using simulation. The techniques in this book do not, in general, use the one-step transition matrix, but use instead the transition function directly. But formulations based on the transition matrix provide a powerful foundation for proving convergence of both exact and approximate algorithms.
3.9.2
System dynamics for simple assets
It is useful to get a feel of the system dynamics by considering some simple applications. Asset acquisition I - Purchasing assets for immediate use Let Rt be the quantity of a single asset class we have available at the end of a time period, but before we have acquired new assets (for the following time period). The asset may be money available to spend on an election campaign, or the amount of oil, coal, grain or other ˆ t be the demand for the resource that occurs commodities available to satisfy a market. Let D over time interval t, and let xt be the quantity of the resource that is acquired at time t to be used during time interval t + 1. The transition function would be written: ˆ t+1 }. Rt+1 = max{0, Rt + xt − D
Asset acquisition II: purchasing futures Now assume that we are purchasing futures at time t to be exercised at time t0 . At the end of time period t, we would let Rtt0 be the number of futures we are holding that can be
CHAPTER 3. MODELING DYNAMIC PROGRAMS
64
exercised during time period t0 . Now assume that we purchase xtt0 additional futures to be used during time period t0 . Our system dynamics would look like:
Rt+1,t0
( Rtt0 + xtt0 t’ = t+2, t+3, . . . . = ˆ t+1 } t’=t+1 max{0, Rt,t+1 + xt,t+1 − D
In many problems, we can purchase assets on the spot market, which means we are allowed to see the actual demand before we make the decision. This decision would be represented by xt+1,t+1 , which means the amount purchased using the information that arrived during time interval t + 1 to be used during time interval t + 1 (of course, these decisions are usually the most expensive). In this case, the dynamics would be written:
Rt+1,t0
( Rtt0 + xtt0 t’ = t+2, t+3, . . . = . ˆ t+1 } t’=t+1 max{0, Rt,t+1 + xt,t+1 + xt+1,t+1 − D
Planning a path through college Consider a student trying to satisfy a set of course requirements (for example, number of science courses, language courses, departmentals, and so on). Let Rtc be the number of courses taken that satisfy requirement c at the end of semester t. Let xtc be the number of courses the student enrolled in at the end of semester t for semester t + 1 to satisfy requirement c. Finally let Fˆtc (xt−1 ) be the number of courses in which the student received a failing grade during semester t given xt−1 . This information depends on xt−1 since a student cannot fail a course that she was not enrolled in. The system dynamics would look like: Rt+1,c = Rt,c + xt,c − Fˆt+1,c
3.9.3
System dynamics for complex assets
We adopt special notation when we are modeling the dynamics for problems with multiple asset classes. This notation is especially useful for complex assets which are represented using the attribute vector a. Whereas above we modeled the dynamics using the system state variable, with complex assets it is more natural to model the dynamics at the level of individual asset classes. Assume we have resources with attribute a, and we act on them with decision d. The result may be a resource with a modified set of attributes a0 . In general, the decision will generate a contribution (or a cost) and will require some time to complete. This process is modeled using a device called the modify function: M (t, a, d) → (a0 , c, τ )
(3.14)
CHAPTER 3. MODELING DYNAMIC PROGRAMS
65
Here, we are acting on an asset with attribute vector a with decision d using the information available at time t, producing an asset with attribute vector a0 , generating cost (or contribution) c, and requiring time τ to complete. The modify function is basically our transition function at the level of an individual asset (or asset class) and a single decision acting on that asset. For many problems in the general area of asset management, this modeling strategy will seem more natural. However, it introduces a subtle discrepancy with the more classical transition function notation of equation (3.11) which includes an explicit dependence on the information Wt+1 that arrives in the next time interval. As we progress through increasingly more complex models in this volume, we will need to model different assumptions about the information required by the modify function. A wide range of problems can be modeled as one of the three cases: Information known at time t - M (t, a, d). The result of a decision (e.g. the attribute vector a0 ) is completely determined given the information at time t. For example, an aircraft with attribute a (which specifies its location among other things), sent to a particular city, will arrive with attribute a0 which might be a deterministic function of a and d and the information available at time t when the decision is made. Information known at time t + 1 - M (t, a, d, Wt+1 ). The modify function depends on information that becomes available in the time period after decision d is made (which uses the information available at time t). For example, a funding agency may invest in a new technology, where a characterizes what we know about the technology. After the research is funded, we learn the outcome of the research (say, in the following year) which is unknown at the time the decision is made to fund the research. Information known at time t + τ - M (t, a, d, Wt+τ ). The modify function depends on information that becomes available at the end of an action, at time t + τ , where τ itself may be a random variable. Returning to our aircraft, the attribute vector may include elements describing the maintenance status of the aircraft and the time of arrival . The flight time may be random, and we will not learn about the mechanical status of the aircraft until it lands (at time t + τ ). In the latter two cases, the argument t refers to when the decision is made (and hence the information content of the decision), but the additional argument Wt+1 or Wt+τ tells us the information we need to compute the outcome of the decision. It is sometimes convenient to refer to the attribute vector a0 using a function, so we define: aM (t, a, d) = The terminal attribute function.
(3.15)
We use the superscript ‘M ’ to emphasize the relationship with the modify function (students may also think of this as the “model” of the physical process). The argument t indicates the information content of the function, which is to say that we can compute the function using information that is available up through time interval t. Normally, when we make a
CHAPTER 3. MODELING DYNAMIC PROGRAMS
66
decision to act on an asset at time t, the transition function can use the information in the full state variable St (whereas at is the state of the asset we are acting on), so we could write aM (St , a, d) (or aM (St , d)). Example 3.20: The attributes of a taxicab can be described by its location, fuel level and how many hours the driver has been on duty. If the cab takes a customer to location j, it changes location, burns fuel and adds more hours to the time the driver has been on duty. Example 3.21: A student progressing through college can be described by the course requirements she has completed. The decision d represents the courses she decides to take, where she may drop (or fail) a course. a0 = aM (t, a, d) describes her academic progress at the end of the next semester.
Our modify function brings out a common property of many asset management problems: an action can take more than one time period to complete. If τ > 1, then at time t + 1, we know that there will be an asset available at time t0 = t + τ in the future. This means that to capture the state of the system at time t + 1, we need to recognize that an important attribute is when the asset can be used. For algebraic purposes, it is also useful to define the indicator function: ( 1 M (t, a, d) = (a0 , ·, t0 − t) δt0 ,a0 (t, a, d) = 0 Otherwise ∆t = Matrix with δt0 ,a0 (t, a, d) in row (t0 , a0 ) and column (a, d). In addition to the attributes of the modified resource, we sometimes have to capture the fact that we may gain or lose resources in the process of completing a decision. We define: γ(t, a, d) = The multiplier giving the quantity of resources with attribute a available after being acted on with decision d at time t. The multiplier may depend on the information available at time t, but is often random and depends on information that has not yet arrived. Illustrations of gains and losses are given in the next set of examples. Using our modify function and gain, we can now provide a specific set of equations to capture the evolution of our resource vector. Remembering that Rtt0 represents the resources we know about at time t (now) that are actionable at time t0 ≥ t, we assume that we can only act on resources that are actionable now. So, for t0 > t, the evolution of the resource vector is given by: Rt+1,t0 a0 = Rt,t0 a0 +
XX a∈A d∈D
ˆ t+1,t0 a0 δt0 ,a0 (t, a, d)xtad + R
(3.16)
CHAPTER 3. MODELING DYNAMIC PROGRAMS
67
Example 3.22: A corporation is holding money in an index fund with a 180 day holding period (money moved out of this fund within the period incurs a four percent load) and would like to transfer them into a high yield junk bond fund. The attribute of the asset would be a = (AssetType, Age). There is a transaction cost (the cost of executing the trade) and a gain γ, which is 1.0 for funds held more than 180 days, and 0.96 for funds held less than 180 days. Example 3.23: Transportation of liquified natural gas - A company would like to purchase 500,000 tons of liquified natural gas in southeast Asia for consumption in North America. Although in liquified form, the gas evaporates at a rate of 0.2 percent per day, implying γ = .998.
Equation (3.16) can be read as follows: the number of resources with attribute a0 (that are actionable at time t0 ) that we know about at time t + 1 is the sum of i) the number of resources with attribute a0 with actionable time t0 that we know about at time t, plus ii) the number of resources that are actionable now that will become actionable (due to our decisions) at time t0 with attribute a0 , plus iii) the number of resources with attribute a0 that are actionable at time t0 that we first learn about during time interval t + 1. A more compact form can be written if we view the actionable time t0 as a part of the attribute a0 . Assume that we “act” on any resource that is not actionable by “doing nothing.” In this case, we can write (3.16) as: Rt+1,a0 (ω) =
XX
ˆ t+1,a0 (ω) δa0 (t, a, d, ω)xtad + R
(3.17)
a∈A d∈D
Equation (3.16) can be written in matrix form: ˆ t+1 (ω) Rt+1 (ω) = ∆t xt (ω) + R
(3.18)
or more simply: ˆ t+1 Rt+1 = ∆t xt + R
(3.19)
It is often useful to have a compact functional representation for the resource dynamics. For this reason, we introduce the notation: ˆ t+1 (ω) RM (Rt , xt , ω) = ∆t (ω)xt + R The superscript “M ” indicates that this is really just the modify function in vector form. We are implicitly assuming that our decision xt is derived from a deterministic function of the state of the system, although this is not always the case. If the only source of randomness
CHAPTER 3. MODELING DYNAMIC PROGRAMS
68
is new arrivals, then it is going to be most common that RM (Rt , xt , ω) will depend on information that arrives during time interval t + 1. However, there are many applications where the function ∆ depends on information that arrives in later time periods.
3.10
The contribution function
Next we need to specify the contribution (or cost if we are minimizing) produced by the decisions we make in each time period. If we use a pre-decision state variable, and we are at time t trying to make decision xt , we would represent our contribution function using: Cˆt+1 (St , xt , Wt+1 ) = Contribution at time t from being in state St , making decision. xt and then receiving the information Wt+1 When we make the decision xt , we do not know Wt+1 , so it is common to use: Ct (St , xt ) = E{Cˆt+1 (St , xt , Wt+1 )|St } The role that Wt+1 plays is problem dependent, as illustrated in the examples below. There are many asset allocation problems where the contribution of a decision can be written using: ctad = The unit contribution of acting on an asset with attribute a with decision d. This contribution is incurred in period t using information available in period t. In this case, our total contribution at time t could be written: Ct (St , xt ) =
XX
ctad xtad
a∈A d∈Da
In general, when we use a pre-decision state variable, it is best to think of Ct (St , xt ) as an expectation of a function that may depend on future information. Students simply need to be aware that in some settings, the contribution function does not depend on future information. It is surprisingly common for us to want to work with two contributions. The common view of a contribution function is that it contains revenues and costs that we want to maximize or minimize. In many operational problems, there can be a mixture of “hard dollars” and “soft dollars.” The hard dollars are our quantifiable revenues and costs. But there are often other issues that are important in an operational setting, but which cannot always be easily quantified. For example, if we cannot cover all of the demand, we may wish to assess a penalty for not satisfying it. We can then manipulate this penalty to reduce the amount
CHAPTER 3. MODELING DYNAMIC PROGRAMS
69
Example 3.24: In asset acquisition problems, we order xt in time period t to be used to ˆ t+1 in the next time period. Our state variable is St = Rt = the product satisfy demands D on hand after demands in period t have been satisfied. We pay a cost cp xt in period t and ˆ t+1 ) in period t + 1. Our total one-period contribution receive a revenue p min(Rt + xt , D function is then: ˆ t+1 ) = p min(Rt + xt , D ˆ t+1 ) − cp xt Cˆt,t+1 (Rt , xt , D The expected contribution is: ˆ t+1 ) − cp xt } Ct (St , xt ) = E{p min(Rt + xt , D
Example 3.25: Now consider the same asset acquisition problem, but this time we place our orders in period t to satisfy the known demand in period t. Our cost function contains both a fixed cost cf (which we pay for placing an order of any size) and a variable cost cp . The cost function would look like: ( ˆ t) p min(Rt + xt , D xt = 0 Ct (St , xt ) = f p ˆ t ) − c − c xt xt > 0 p min(Rt + xt , D Note that our contribution function no longer contains information from the next time period. If we did not incur a fixed cost cf , then we would simply look at the demand Dt and order the quantity needed to cover demand (as a result, there would never be any product left over). However, since we incur a fixed cost cf with each order, there is a benefit to ordering enough to cover the demand now and future demands. This benefit is captured through the value function.
of unsatisfied demand. Examples of the use of soft-dollar bonuses and penalties abound in operational problems (see examples). Given the presence of these so-called “soft dollars,” it is useful to think of two contribution functions. We can let Ct (St , xt ) be the hard dollars and Ctπ (St , xt ) be the contribution function with the soft dollars included. The notation captures the fact that a set of soft bonuses and penalties represents a form of policy. So we can think of our policy as making decisions that maximize Ctπ (St , xt ), but measure the value of the policy (in hard dollars), using Ct (St , X π (St )).
3.11
The objective function
We are now ready to write out our objective function. Let Xtπ (St ) be a decision function (equivalent to a policy) that determines what decision we make given that we are in state St .
CHAPTER 3. MODELING DYNAMIC PROGRAMS
70
Example 3.26: A trucking company has to pay the cost of a driver to move a load, but wants to avoid using inexperienced drivers for their high priority accounts (but has to accept the fact that it is sometimes necessary). An artificial penalty can be used to reduce the number of times this happens. Example 3.27: A charter jet company requires that in order for a pilot to land at night, he/she has to have landed a plane at night three times in the last 60 days. If the third time a pilot landed at night is at least 50 days ago, the company wants to encourage assignments of these pilots to flights with night landings so that they can maintain their status. A bonus can be assigned to encourage these assignments. Example 3.28: A student planning her schedule of courses has to face the possibility of failing a course, which may require taking either an extra course one semester or a summer course. She wants to plan out her course schedule as a dynamic program, but use a penalty to reduce the likelihood of having to take an additional course. Example 3.29: An investment banker wants to plan a strategy to maximize the value of an asset and minimize the likelihood of a very poor return. She is willing to accept lower overall returns in order to achieve this goal and can do it by adding an additional penalty when the asset is sold at a significant loss.
Our optimization problem is to choose the best policy by choosing the best decision function from the family (Xtπ (St ))π∈Π . We are going to measure the total return from a policy as the (discounted) total contribution over a finite (or infinite) horizon. This would be written as:
F0π
= E
( T X
) γ
t
Ctπ (St , Xtπ (St ))|S0
(3.20)
t=0
where γ discounts the money into time t = 0 values. In some communities, it is common to use an interest rate r, in which case the discount factor is: γ=
1 1+r
Important variants of this objective function are the infinite horizon problem (T = ∞)and the finite horizon problem (γ = 1). A separate problem class is the average reward, infinite horizon problem: T −1 1X π lim Ct (St , Xtπ (St ))|S0 T →∞ T t=0
( F0π = E
) (3.21)
Our optimization problem is to choose the best policy. In most practical applications,
CHAPTER 3. MODELING DYNAMIC PROGRAMS
71
we can write the optimization problem as one of choosing the best policy, or: F0∗ = max F0π
(3.22)
π∈Π
Often (and this is generally the case in our discussions) a policy is characterized by a continuous parameter. It might be that the optimal policy corresponds to a value of the parameter equal to infinity. It is possible that F0∗ exists, but that an optimal “policy” does not exist (because it requires finding a parameter equal to infinity). While this is more of a mathematical curiosity, we handle these situations by writing the optimization problem as: F0∗ = sup F0π
(3.23)
π∈Π
where “sup” is the supremum operator, which finds the smallest number greater than or equal to F0π for any value of π. If we were minimizing, we would use “inf,” which stands for “infimum,” which is the largest value less than or equal to the value of any policy. It is common in more formal treatments to use “sup” instead of “max” or “inf” instead of “min” since these are more general. Our emphasis is on computation and approximation, where we consider only problems where a solution exists. For this reason, we use “max” and “min” throughout our presentation. The expression (3.20) contains one important but subtle assumption that will prove to be critical later and which will limit the applicability of our techniques in some problem classes. Specifically, we assume the presence of what is known as linear, additive utility. That is, we have added up contributions for each time period. It does not matter if the contributions are discounted or if the contribution functions themselves are nonlinear. However, we will not be able to handle functions that look like:
F π = Eπ
X
t∈T
!2 γ t Ct (St , Xtπ (St ))
(3.24)
The assumption of linear, additive utility means that the total contribution is a separable function of the contributions in each time period. While this works for many problems, it certainly does not work for all of them, as depicted in the examples below. In some cases these apparent instances of violations of linear, additive utility can be solved using a creatively defined state variable.
3.12
Models for a single, discrete asset
With our entire modeling framework behind us, it is useful to contrast two strategies for modeling a single (discrete) asset. The asset may be yourself (planning a path through
CHAPTER 3. MODELING DYNAMIC PROGRAMS
72
Example 3.30: We may value a policy of managing an asset using a nonlinear function of the number of times the price of an asset dropped below a certain amount. Example 3.31: Assume we have to find the route through a network where the traveler is trying to arrive at a particular point in time. The value function is a nonlinear function of the total lateness, which means that the value function is not a separable function of the delay on each link. Example 3.32: Consider a mutual fund manager who has to decide how much to allocate between aggressive stocks, conservative stocks, bonds, and money market instruments. Let the allocation of assets among these alternatives represent a policy π. The mutual fund manager wants to maximize long term return, but needs to be sensitive to short term swings (the risk). He can absorb occasional downswings, but wants to avoid sustained downswings over several time periods. Thus, his value function must consider not only his return in a given time period, but also how his return looks over one year, three year and five year periods.
college, or driving to a destination), a piece of equipment (an electric power generating plant, an aircraft or a locomotive), or a financial asset (where you have to decide to buy, sell or hold). There are two formulations that we can use to model a single asset problem. Each illustrates a different modeling strategy and leads to a different algorithmic strategy. We are particularly interested in developing a modeling strategy that allows us to naturally progress from managing a single asset to multiple assets.
3.12.1
A single asset formulation
Assume we have a single, discrete asset. Just before we act on an asset, new information arrives that determines the contribution we will receive. The problem can be modeled using: at = The attribute vector of the asset at time t. Da = The set of decisions that can be used to act on the resource with attribute a. aM (t, at , dt ) = The terminal attribute function, which gives the attribute of a resource with attribute at after being acted on with decision dt at time t. Wt+1 = New information that arrives during time period t + 1 that is used to determine the contribution generated by the asset. Ct+1 (at , dt , Wt+1 ) = The contribution returned by acting on a resource with attribute at with decision d ∈ Da given the information that becomes available during time interval t + 1. When we manage a single asset, there are two ways to represent the decisions and the state of the system. The first is geared purely to the modeling of a single, discrete asset:
CHAPTER 3. MODELING DYNAMIC PROGRAMS
73
The resource state variable: at . The system state variable: St = (at , Wt ). The decision variable: d ∈ Da . The transition function: at+1 = aM (t, a, d) (or at+1 = aM (t, a, d, Wt+1 ) Recall that aM (t, a, d) is the terminal attribute function, which gives the attributes of a resource after it has been acted on. This is an output of the modify function (see equation (3.14) in section 3.9). Here, we are assuming that the outcome of the modify function is known in the next time period. There are many models where the modify function is deterministic given the information at time t. By contrast, there are many applications where the outcome is not known until time t + τ , where τ itself may be random. This is the conventional formulation used in dynamic programming, although we would customarily let the attribute vector at be the state St of our system (the single resource), which can be acted on by a set of actions a ∈ A. The optimality equations for this problem are easily stated as: n o Vt (at ) = max E Cˆt+1 (at , d, Wt+1 ) + Vt+1 (aM (t + 1, at , d)) d∈Dat
(3.25)
Given the value function Vt+1 (a), equation (3.25) is solved by simply computing the total contribution for each d ∈ D and choosing the best decision. There is a vast array of dynamic programming problems where the single asset is the system we are trying to optimize. For these problems, instead of talking about the attributes of the asset we would simply describe the state of the system. In our discussion, we will occasionally use this model. The reader should keep in mind that in this formulation, we assume that the attribute vector is small enough that we can usually enumerate the complete attribute space A. Furthermore, we also assume that the decision set D is also small enough that we can enumerate all the decisions.
3.12.2
A multiple asset formulation for a single asset
The second formulation is mathematically equivalent, but uses the same notation that we use for more complex problems: The resource state variable: Rt . The system state variable: St = (Rt , Wt ). The decision variable: xt = (xtad )a∈A,d∈Da . P P The transition function: Rt+1,a0 = a∈A d∈Da δa0 (t, a, d)xtad .
CHAPTER 3. MODELING DYNAMIC PROGRAMS Since we have a single, discrete asset,
P
a∈A
74
Rta = 1. Our optimality equation becomes:
Vt (Rt ) = max E{Cˆt+1 (a, xt , Wt+1 ) + γVt+1 (Rt+1 )} xt ∈Xt
(3.26)
Our feasible region Xt is given by the set: X
xtad = 1
(3.27)
xtad ≥ 0
(3.28)
d∈Da
where we assume that the set Da includes a “do nothing” option. It might seem as though the optimality equation (3.26) is much harder to solve because we are now choosing a vector xt ∈ X rather than a scalar decision d P ∈ D. Of course, because of equation (3.27), we are really facing an identical problem. Since a∈A Rt+1,a = 1, we can rewrite Vt+1 (Rt+1 ) using: Vt+1 (Rt+1 ) =
X
vt+1,a0 Rt+1,a0
(3.29)
a0 ∈A
where vt+1,a0 = Vt+1 (Rt+1 ) if Rt+1,a = 1. vt+1,a0 is the value of having an asset with attribute a0 at time t + 1. Since: Rt+1,a0 =
XX
δa0 (t, a, d)xtad
(3.30)
a∈A d∈D
we can rewrite (3.29) as Vt+1 (Rt+1 ) =
X
vt+1,a0
a0 ∈A
=
XX
δa0 (t, a, d)xtad
a∈A d∈D
XX
vt+1,aM (t,a,d) xtad
(3.31)
a∈A d∈D
since by definition, aM (t, a, d) =
XX
δa0 (t, a, d)xtad
a∈A d∈D
We can also simplify our contribution function by taking advantage of the fact that we have to choose exactly one decision: C¯t (a, xt ) = E{Cˆt+1 (a, xt , Wt+1 )} XX = ctad xtad a∈A d∈D
(3.32) (3.33)
CHAPTER 3. MODELING DYNAMIC PROGRAMS
75
Combining (3.26) with (3.31) gives:
Vt (at ) = max
xt ∈Xt
= max
xt ∈Xt
( XX a∈A d∈D
XX
) C¯tad xtad + γ
XX
vt+1,aM (t,a,d) xtad
a∈A d∈D
(C¯tad + γvt+1,aM (t,a,d) )xtad
(3.34)
a∈A d∈D
Equation (3.34) makes it apparent that we are doing the same thing that we did in (3.25). We have to compute the total contribution of each decision and choose the best. The first is the most natural model for a single discrete asset, but extending it to multiple assets is extremely awkward. The second model is no harder to solve, but forms the basis for solving far larger problems (including those with multiple assets).
3.13
A measure-theoretic view of information**
For students interested in proving theorems or reading more theoretical research articles, it is useful to have a more fundamental understanding of information. When we work with random information processes and uncertainty, it is standard in the probability community to define a probability space, which consists of three elements. The first is the set of outcomes Ω, which is generally assumed to represent all possible outcomes of the information process (actually, Ω can include outcomes that can never happen). If these outcomes are discrete, then all we would need is the probability of each outcome p(ω). It is nice to have a terminology that allows for continuous quantities. We want to define the probabilities of our events, but if ω is continuous, we cannot talk about the probability of an outcome ω. However we can talk about a set of outcomes E that represent some specific event (if our information is a price, the event E could be all the prices that constitute the event that the price is greater than some number). In this case, we can define the probability of an outcome E by integrating the density function p(ω) over all ω in the event E. Probabilists handle continuous outcomes by defining a set of events F, which is literally a “set of sets” because each element in F is itself a set of outcomes in Ω. This is the reason we resort to the script font F as opposed to our calligraphic font for sets; students may find it easy to read E as “calligraphic E” and F as “script F.” The set F has the property that if an event E is in F, then its complement Ω \ E is in F, and the union of any two events EX ∪EY in F is also in F. F is called a “sigma-algebra” (which may be written σ-algebra). An understanding of sigma-algebras is not important for computational work, but can be useful in certain types of proofs, as we see in this volume. Sigma-algebras are without question one of the more arcane devices used by the probability community, but once they are mastered, they are a powerful theoretical tool. Finally, it is required that we specify a probability measure denoted P, which gives the
CHAPTER 3. MODELING DYNAMIC PROGRAMS
76
probability (or density) of an outcome ω which can then be used to compute the probability of an event in F. We can now define a formal probability space for our exogenous information process as (Ω, F, P). If we wish to take an expectation of some quantity that depends on the information, say Ef (Wt ), then we would sum (or integrate) over the set ω multiplied by the probability (or density) P. It is important to emphasize that ω represents all the information that will become available, over all time periods. As a rule, we are solving a problem at time t, which means we do not have the information that will become available after time t. To handle this, we let Ft be the sigma-algebra representing events that can be created using only the information up to time t. To illustrate, consider an information process Wt consisting of a single 0 or 1 in each time period. Wt may be the information that a customer purchases a jet aircraft, or the event that an expensive component in an electrical network fails. If we look over three time periods, there are eight possible scenarios, as shown in table 3.1. Outcome ω 1 2 3 4 5 6 7 8
Time period 1 2 3 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Table 3.1: Set of demand outcomes Let E{W1 } be the set of outcomes ω that satisfy some logical condition on W1 . If we are at time t = 1, we only see W1 . The event W1 = 0 would be written E{W1 =0} = {ω|W1 = 0} = {1, 2, 3, 4} The sigma-algebra F1 would consist of the events {E{W1 =0} , E{W1 =1} , E{W1 ∈(0,1)} , E{W1 6∈(0,1)} } Now assume that we are in time period 2 and have access to W1 and W2 . With this information, we are able to divide our outcomes Ω into finer subsets. Our history H2 consists of the elementary events H2 = {(0, 0), (0, 1), (1, 0), (1, 1)}. Let h2 = (0, 1) be an element of H2 . The event E{h2 =(0,1)} = {3, 4}. In time period 1, we could not tell the difference between outcomes 1, 2, 3 and 4; now that we are at time 2, we can differentiate between ω ∈ (1, 2) and ω ∈ (3, 4). The sigma-algebra F2 consists of all the events Eh2 , h2 ∈ H2 , along with all possible unions and complements. Another event in F2 is the {ω|(W1 , W2 ) = (0, 0)} = {1, 2}. A third event in F2 is the union of these two events, which consists of ω = {1, 2, 3, 4} which, of course, is one of the
CHAPTER 3. MODELING DYNAMIC PROGRAMS
77
events in F1 . In fact, every event in F1 is an event in F2 , but not the other way around, the reason being that the additional information from the second time period allows us to divide Ω into a finer set of subsets. Since F2 consists of all unions (and complements), we can always take the union of events, which is the same as ignoring a piece of information. By contrast, we cannot divide F1 into a finer subsets. The extra information in F2 allows us to filter Ω into a finer set of subsets than was possible when we only had the information through the first time period. If we are in time period 3, F will consist of each of the individual elements in Ω as well as all the unions needed to create the same events in F2 and F1 . From this example, we see that more information (that is, the ability to see more elements of W1 , W2 , . . .) allows us to divide Ω into finer-grained subsets. We see that Ft−1 ⊆ Ft . Ft always consists of every event in Ft−1 in addition to other finer events. As a result of this property, Ft is termed a filtration. It is because of this interpretation that the sigma-algebras are typically represented using the letter F (which literally stands for filtration) rather the more natural letter H (which stands for history). The fancy font used to denote a sigmaalgebra is used to designate that it is a set of sets (rather than just a set). It is always assumed that information processes satisfy Ft−1 ⊆ Ft . Interestingly, this is not always the case in practice. The property that information forms a filtration requires that we never “forget” anything. In real applications, this is not always true. Assume, for example, that we are doing forecastingPusing a moving average. This means that our ˆ t−t0 . Such a forecasting process “forgets” forecast ft might be written as ft = (1/T ) Tt0 =1 D information that is older than T time periods.
3.14
Bibliographic notes
Most textbooks on dynamic programming give very little emphasis on modeling. The multiattribute notation for multiple asset classes is based primarily on Powell et al. (2001). Figure 3.1 which describes the mapping from continuous to discrete time was outlined for me by Erhan Cinlar.
Exercises 3.1) A college student must plan what courses she takes over each of eight semesters. To graduate, she needs 34 total courses, while taking no more than five and no less than three courses in any semester. She also needs two language courses, one science course, eight departmental courses in her major and two math courses. a) Formulate the state variable for this problem in the most compact way possible. 3.2) Assume that we have N discrete P assets to manage, where Ri is the number of assets of type a ∈ A and N = a∈A Ra . Let R be the set of possible values of
CHAPTER 3. MODELING DYNAMIC PROGRAMS
78
the vector R. Show that: N + |A| − 1 |R| = |A| − 1 where
X Y
=
X! Y !(X − Y )!
is the number of combinations of X items taken Y at a time. b) Give the transition function for our college student assuming that she successfully passes any course she takes. You will need to introduce variables representing her decisions. c) Now give the transition function for our college student, but now allow for the random outcome that she may not pass every course. 3.3) A broker is working in thinly traded stocks. He must make sure that he does not buy or sell in quantities that would move the price and he feels that if he works in quantities that are no more than 10 percent of the average sales volume, he should be safe. He tracks the average sales volume of a particular stock over time. Let vˆt be the sales volume on day t, and assume that he estimates the average demand ft using ft = (1 − α)ft−1 + αˆ vt . He then uses ft as his estimate of the sales volume for the next day. Assuming he started tracking demands on day t = 1, what information would constitute his state variable? 3.4) How would your previous answer change if our broker used Pa 10-day moving average ˆt−i+1 as his estimate to estimate his demand? That is, he would use ft = 0.10 10 i=1 v of the demand. 3.5) The pharmaceutical industry spends millions managing a sales force to push the industry’s latest and greatest drugs. Assume one of these salesmen must move between a set I of customers in his district. He decides which customer to visit next only after he completes a visit. For this exercise, assume that his decision does not depend on his prior history of visits (that is, he may return to a customer he has visited previously). Let Sn be his state immediately after completing his nth visit that day. a) Assume that it takes exactly one time period to get from any customer to any other customer. Write out the definition of a state variable, and argue that his state is only his current location. b) Now assume that τij is the (deterministic and integer) time required to move from location i to location j. What is the state of our salesman at any time t? Be sure to consider both the possibility that he is at a location (having just finished with a customer) or between locations. c) Finally assume that the travel time τij follows a discrete uniform distribution between aij and bij (where aij and bij are integers?
CHAPTER 3. MODELING DYNAMIC PROGRAMS
79
3.6) Consider a simple asset acquisition problem where xt is the quantity purchased at the end of time period t to be used during time interval t + 1. Let Dt be the demand for the assets during time interval t. Let Rt be the pre-decision state variable (the amount on hand before you have ordered xt ) and Rtx be the post-decision state variable. a) Write the transition function so that Rt+1 is a function of Rt , xt and Dt+1 . x b) Write the transition function so that Rtx is a function of Rt−1 , Dt and xt .
c) Write Rtx as a function of Rt , and write Rt+1 as a function of Rtx . 3.7) As a buyer for an orange juice products company, you are responsible for buying futures for frozen concentrate. Let xtt0 be the number of futures you purchase in year t that can be exercised during year t0 . a) What is your state variable in year t? b) Write out the transition function. 3.8) A classical inventory problem works as follows. Assume that our state variable Rt is the amount of product on hand at the end of time period t and that Dt is a random variable giving the demand during time interval (t − 1, t) with distribution pd = P rob(Dt = d). The demand in time interval t must be satisfied with the product on hand at the beginning of the period. We can then order a quantity xt at the end of period t that can be used to replenish the inventory in period t + 1. Give the transition function that relates Rt+1 to Rt . 3.9) Many problems involve the movement of assets over networks. The definition of the state of the single asset, however, can be complicated by different assumptions for the probability distribution for the time required to traverse a link. For each example below, give the state of the asset: a) You have a deterministic, static network, and you want to find the shortest path from an origin node r to a destination node s. There is a known cost cij for traversing each link (i, j). b) Each day, you need to choose between one of two paths from home to work, but you do not know the travel time for each path because it is random (but the mean and variance of the distribution of travel times remains the same from one day to the next). Each time you follow a path, you get more information about the travel time over that path. You need to devise a strategy for determining which path to choose each day. c) A taxicab is moving people in a set of cities C. After dropping a passenger off at city i, the dispatcher may have to decide to reposition the cab from i to j, (i, j) ∈ C. The travel time from i to j is τij , which is a random variable with a discrete uniform distribution (that is, the probability that τij = t is 1/T , for t = 1, 2, . . . , T ). Assume that the travel time is known before the trip starts. d) Same as (c), but now the travel times are random with a geometric distribution (that is, the probability that τij = t is (1 − θ)θt−1 , for t = 1, 2, 3, . . .).
CHAPTER 3. MODELING DYNAMIC PROGRAMS
80
3.10) In the figure below, a sailboat is making its way upwind from point A to point B. To do this, the sailboat must tack, whereby it sails generally at a 45 degree angle to the wind. The problem is that the angle of the wind tends to shift randomly over time. The boats skipper decides to check the angle of the wind each minute and must decide whether the boat should be on port or starboard tack. Note that the proper decision must consider the current location of the boat, which we may indicate by an (x,y) coordinate.
B
Path of boat: Port tack: Starboard tack:
Wind :
A
3.11) What is the difference between the history of a process, and the state of a process? 3.12) As the purchasing manager for a major citrus juice company, you have the responsibility of maintaining sufficient reserves of oranges for sale or conversion to orange juice products. Each week, you can purchase up to a quantity qti at price pti from supplier i ∈ I, where the price/quantity pairs (pti , qti )i∈I fluctuate from week to week. Let xti be the amount that you decide to purchase from supplier i in week t to be used in week t + 1. Let s0 be your total initial inventory, and let Dt be the amount of product that the company needs for production during week t. If we are unable to meet demand, the company must purchase additional product on the spot market at a spot price pspot ti . a) What is the exogenous stochastic process for this system? b) What are the decisions you can make to influence the system? c) What would be the state variable for your problem? d) Write out the transition equations. e) What is the one-period contribution function?
CHAPTER 3. MODELING DYNAMIC PROGRAMS
81
f) Propose a reasonable structure for a decision rule for this problem, and call it X π . Your decision rule should be in the form of a function that determines how much to produce in a given period. g) Carefully and precisely, write out the objective function for this problem in terms of the exogenous stochastic process. Clearly identify what you are optimizing over. h) For your decision rule, what do we mean by the space of policies? 3.13) Customers call in to a service center according to a (nonstationary) Poisson process. Let E be the set of events representing phone calls, where te , e ∈ E is the time that the call is made. Each customer makes a request that will require time τe to complete and will pay a reward re to the service center. The calls are initially handled by a receptionist who determines τe and re . The service center does not have to handle all calls and obviously favors calls with a high ratio of reward per time unit required (re /τe ). For this reason, the company adopts a policy that the call will be refused if (re /τe ) < γ. If the call is accepted, it is placed in a queue to wait for one of the available service representatives. Assume that the probability law driving the process is known, where we would like to find the right value of γ. a) This process is driven by an underlying exogenous stochastic process with element ω ∈ Ω. What is an instance of ω? b) What are the decision epochs? c) What is the state variable for this system? What is the transition function? d) What is the action space for this system? e) Give the one-period reward function. f) Give a full statement of the objective function that defines the Markov decision process. Clearly define the probability space over which the expectation is defined, and what you are optimizing over. 3.14) A major oil company is looking to build up its storage tank reserves, anticipating a surge in prices. It can acquire 20 million barrels of oil, and it would like to purchase this quantity over the next 10 weeks (starting in week 1). At the beginning of the week, the company contacts its usual sources, and each source j ∈ J is willing to provide qtj million barrels at a price ptj . The price/quantity pairs (ptj , qtj ) fluctuate from week to week. The company would like to purchase (in discrete units of millions of barrels) xtj million barrels (where xtj is discrete) from source j in week t ∈ (1, 2, . . . , 10). Your goal is to acquire 20 million barrels while spending the least amount possible. a) What is the exogenous stochastic process for this system? b) What would be the state variable for your problem? Give an equation(s) for the system dynamics. c) Propose a structure for a decision rule for this problem and call it X π . d) For your decision rule, what do we mean by the space of policies? Give examples of two different decision rules.
CHAPTER 3. MODELING DYNAMIC PROGRAMS
82
e) Write out the objective function for this problem using an expectation over the exogenous stochastic process. f) You are given a budget of $300 million to purchase the oil, but you absolutely must end up with 20 million barrels at the end of the 10 weeks. If you exceed the initial budget of $300 million, you may get additional funds, but each additional $1 million will cost you $1.5 million. How does this affect your formulation of the problem? 3.15) You own a mutual fund where at the end of each week t you must decide whether to sell the asset or hold it for an additional week. Let rt be the one-week return (e.g. rt = 1.05 means the asset gained five percent in the previous week), and let pt be the price of the asset if you were to sell it in week t (so pt+1 = pt rt+1 ). We assume that the returns rt are independent and identically distributed. You are investing this asset for eventual use in your college education, which will occur in 100 periods. If you sell the asset at the end of time period t, then it will earn a money market rate q for each time period until time period 100, at which point you need the cash to pay for college. a) What is the state space for our problem? b) What is the action space? c) What is the exogenous stochastic process that drives this system? Give a five time period example. What is the history of this process at time t? d) You adopt a policy that you will sell if the asset falls below a price p¯ (which we are requiring to be independent of time). Given this policy, write out the objective function for the problem. Clearly identify exactly what you are optimizing over.
Chapter 4 Introduction to Markov decision processes This chapter provides an introduction to what are classically known as Markov decision processes, or stochastic, dynamic programming. Throughout, we assume finite numbers of discrete states and decisions (“actions” in the parlance of Markov decision processes), and we assume we can compute a one-step transition matrix. Several well-known algorithms are presented, but these are exactly the types of algorithms that do not scale well to realisticallysized problems. So why cover material that is widely acknowledged to work only on small or highly specialized problems? First, some problems have small state and action spaces and can be solved with these techniques. Second, the theory of Markov decision processes can be used to identify structural properties that can dramatically simplify computational algorithms. But far more importantly, this material provides the intellectual foundation for the types of algorithms that we present in later chapters. Using the framework in this chapter, we can prove very powerful results that will provide a guiding hand as we step into richer and more complex problems in many real-world settings. Furthermore, the behavior of these algorithms provide important insights that guide the behavior of algorithms for more general problems. There is a rich and elegant theory behind Markov decision processes, and this chapter is aimed at bringing it out. However, the proofs are deferred to the “Why does it work” section (section 4.5). The intent is to allow the presentation of results to flow more naturally, but serious students of dynamic programming are encouraged to delve into these proofs. This is partly to develop a deeper appreciation of the properties of the problem as well as to develop an understanding of the proof techniques that are used in this field.
83
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
4.1
84
The optimality equations
In the last chapter, we were able to formulate our problem as one of finding a policy that maximized the following optimization problem:
max E π
( T X
) γ t Ctπ (St , Xtπ (St ))
(4.1)
t=0
For most problems, solving equation (4.1) is computationally intractable, but it provides the basis for identifying the properties of optimal solutions and finding and comparing “good” solutions to determine which is better. With a little thought, we realize that we do not have to solve this entire problem at once. Assume our problem is deterministic (as with our budgeting problem of chapter 1). If we are in state st and make decision xt , our transition function will tell us that we are going to land in some state s0 = St+1 (x). What if we had a function Vt+1 (s0 ) that told us the value of being in state s0 ? We could evaluate each possible decision x and simply choose the one decision x that had the largest value of the one-period contribution, Ct (st , xt ), plus the value of landing in state s0 = St+1 (x) which we represent using Vt+1 (St+1 (xt )). Since this value represents the money we receive one time period in the future, we might discount this by a factor γ. In other words, we have to solve: x∗t (st ) = arg max {Ct (st , xt ) + γVt+1 (St+1 (xt ))} xt
Furthermore, the value of being in state st is the value of using the optimal decision x∗t (st ). That is: Vt (st ) = max {Ct (st , xt ) + γVt+1 (St+1 (xt ))} xt
(4.2)
Equation (4.2) is known as either Bellman’s equation, in honor of Richard Bellman, or “the optimality equations” because they characterize the optimal solution. They are also known as the Hamilton-Jacobi equations, reflecting their discovery through the field of control theory, or the Hamilton-Jacobi-Bellman equations (in honor of everybody), or HJB for short. When we are solving stochastic problems, we have to model the fact that new information becomes available after we make the decision xt and before we measure the state variable St+1 . Our one period contribution function is given by: Cˆt+1 (st , xt , Wt+1 ) = The contribution received in period t + 1 given the state st and decision xt , as well as the new information Wt+1 that arrives in period t + 1. When we are making decision xt , we only know st , which means that both Cˆt+1 (st , xt , Wt+1 ) and the next state St+1 are random. If we are to choose the best decision, we need to
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
85
maximize the expected contribution: n o ˆ Vt (st ) = max E Ct+1 (st , xt , Wt+1 ) + γVt+1 (St+1 (xt ))|St = st
(4.3)
xt
Let: Ct (st , xt ) = E{Cˆt+1 (st , xt , Wt+1 )|St = st } Substituting this into (4.3) gives us what we call the expectation form of the optimality equations: Vt (st ) = max Ct (st , xt ) + γE{Vt+1 (St+1 (xt ))|St = st }
(4.4)
xt
This equation forms the basis for our algorithmic work in later chapters. Interestingly, this is not the usual way that the optimality equations are written in the dynamic programming community. We can write the expectation using: E{Vt+1 (St+1 (xt ))|St = st } =
X
X
P (ωt+1 )1{s0 =ft (st ,x,ωt+1 )} Vt+1 (s0 )
s0 ∈S ωt+1 ∈Ωt+1
|
{z
pt (s0 |st ,xt )
} (4.5)
=
X
pt (s0 |st , xt )Vt+1 (s0 )
(4.6)
s0 ∈S
where S is the set of potential states. The reader may wish to refer back to (3.13) to review the substitution of the one-step transition matrix pt (s0 |st , xt ) into equation (4.5). Substituting (4.6) back into (4.4) gives us the standard form of the optimality equations: Vt (st ) = max Ct (st , xt ) + γ xt
X
pt (s0 |st , xt )Vt+1 (s0 )
(4.7)
s0 ∈S
While the transition matrix can, in practice, be computationally intractable, equation (4.7) offers a particularly elegant mathematical structure that is the basis for much of the theory about the properties of Markov decision processes. We can write (4.7) in a more compact form. Recall that a policy π is a rule that specifies the action xt given the state st . The probability that we transition from state St = s to St+1 = s0 can be written as: pss0 (x) = P rob(St+1 = s0 |St = s, xt = x)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
86
We would say that “pss0 (x) is the probability that we end up in state s0 if we start in state s at time t and take action x.” Since the action x is determined by our policy (or, decision function) X π , it is common to write this probability as: pπss0 = P rob(St+1 = s0 |St = s, X π (s) = x) It is often useful to write this in matrix form: Ptπ = The one-step transition matrix under policy π
.
where pπss0 is the element in row s and column s0 . Now let Ct (xt ) and vt+1 be column vectors with elements Cts (xt ) and vs,t+1 respectively, where s ∈ S. Then (4.7) is equivalent to: vt = max Ct (xt ) + γPtπ vt+1 xt
(4.8)
where the max operator is applied to each element in the column vector. If the decision xt is a scalar (for example, whether to sell or hold an asset), then the solution to (4.8) is a vector, with a decision xts for each state s. Note that this is equivalent to a policy - it is a rule specifying what to do in each state. If the decision xt is itself a vector, then the solution to (4.8) is a family of decision vectors xt (st ) for all st ∈ S. For example, assume our problem is to assign individual programmers to different programming tasks, where our state st captures the availability of programmers and the different tasks that need to be completed. Of course, computing a vector xt for each state st which is itself a vector is much easier to write than to implement. The vector form of Bellman’s equation in (4.8) can be written even more compactly using operator notation. Let M be the “max” (or “min”) operator in (4.8) that can be viewed as acting on the vector vt+1 to produce the vector vt . Let V be the space of value functions. Then, M is a mapping: M:V→V defined by equation (4.8). We may also define the operator Mπ for a particular policy π, which is simply the linear operator: Mπ (v) = Ct + γP π v
(4.9)
for some vector v ∈ V. We see later in the chapter that we can exploit the properties of this operator to derive some very elegant results for Markov decision processes. These proofs provide insights into the behavior of these systems, which can guide the design of algorithms. For this reason, it is relatively immaterial that the actual computation of these equations may be intractable for many problems; the insights still apply.
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
4.2
87
The optimality equations using the post-decision state variable
In section 3.6.2, we pointed out that it is possible to capture the state of a system immediately before or immediately after a decision is made. Virtually every textbook on dynamic programming uses what we will sometimes call the pre-decision state variable. This is most natural because it has all the information we need to make a decision. The complication that arises in computational work is that if we want to make a decision that takes into account the impact on the future (which is the whole point of dynamic programming), then we have to work with the value of a decision x that puts us in state St+1 (x), which is a random variable. As a result, we are forced to compute (or, more commonly, approximate) the quantity E{Vt+1 (x)|St }. For some problem classes, this can cause real complications. We can circumvent this problem by formulating the optimality equations around the post-decision state variable. Recall that we can write our history of information, decisions and states as: x ht = (S0 , x0 , W1 , S1 , x1 , S1x , W2 , S2 , x2 , S2x , . . . , xt−1 , St−1 , Wt , St )
When we wrote our recursion around the pre-decision state variable St , we obtained the optimality equations that are given in equations (4.4)-(4.8). If we write the same equations around the post-decision state variable, we obtain equations of the form: x Vt−1 (sxt−1 )
x = E max Cˆt (sxt−1 , Wt , xt ) + γVtx (Stx (xt ))|St−1 = sxt−1 xt
(4.10)
We have indexed both the state variables and the value functions with the superscript “x” to denote when a quantity is computed for the post-decision state variable. The reader needs to keep in mind while reading equation (4.10) that the time index t always refers to the information content of the variable or function. The biggest difference in the optimality recursions is that now the expectation is outside of the max operator. Since we are conditioning on sxt−1 , we need the information Wt in order to compute xt . There is a simple relationship between Vt (St ) and Vtx (St ) that is summarized as follows: Vt (St ) = max Ct (St , xt ) + γVtx (Stx (xt )) xt ∈Xt
Vtx (Stx ) = E {Vt+1 (St+1 )|Stx }
(4.11) (4.12)
Note that from equation (4.11), Vt (St ) is a deterministic function of Vtx (St (xt )). That is, we do not have to compute an expectation, but we do have to solve a maximization problem. From equation (4.12), we see that Vtx (Stx ) is just the conditional expectation of Vt+1 (St+1 ). So now we pose the question: why would we ever want the value functions computed around a post-decision state vector? When we use the pre-decision state vector, we have one
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
88
maximization problem to solve (for a particular state). Now, we have to solve a maximization problem within an expectation. This certainly seems more complicated. The value of the post-decision optimality equations, oddly enough, arises purely for computational reasons. Keep in mind that for more complex problems, it is impossible to compute the expectation exactly. As a result, using the pre-decision state variable requires that we first approximate the value function, and then approximate the expectation. Using the post-decision state variable produces the following decision function: π x x Xt+1 (Stx , Wt+1 (ω)) = arg max Cˆt (St , xt , Wt+1 (ω)) + γVt+1 (St+1 (xt , Wt+1 (ω))) (4.13) xt+1
Note that to determine the decision xt+1 , we just have to sample the information that would have been available, namely Wt+1 . The basic strategy in approximate dynamic programming is to simulate forward in time for a sample realization ω. Thus, we would simply simulate the information and solve the decision problem for a single sample realization (as we would do in any practical application).
4.3
Finite horizon problems
Finite horizon problems tend to arise in two settings. First, some problems have a very specific horizon. For example, we might be interested in the value of an American option where we are allowed to sell an asset at any time t ≤ T where T is the exercise date. Another problem is to determine how many seats to sell at different prices for a particular flight departing at some point in the future. In the same class are problems that require reaching some goal (but not at a particular point in time). Examples include driving to a destination, selling a house, or winning a game. A second class of problems are actually infinite horizon, but where the goal is to determine what to do right now given a particular state of the system. For example, a transportation company might want to know what drivers should be assigned to a particular set of loads right now. Of course, these decisions need to consider the downstream impact, so models have to extend into the future. For this reason, we might model the problem over a horizon T which, when solved, yields a decision of what to do right now.
4.3.1
The optimality equations
The foundation of dynamic programming is the property that the optimality equations give you the optimal solution. Section (4.5) provides the core proofs, but there are some important principles that should be understood by any student interested in using dynamic programming.
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
89
We begin by writing the expected profits using policy π from time t onward:
Ftπ (st )
= E
π
(T −1 X
) Ct0 (St0 , Xtπ0 (St0 ))
+ CT (ST )|St = st
(4.14)
t0 =t
Ftπ (st ) is the expected total contribution if we are in state st in time t, and follow policy π from time t onward. If Ftπ (st ) were easy to calculate, we would probably not need dynamic programming. Instead, it seems much more natural to calculate Vtπ recursively using: π Vtπ (st ) = Ct (st , Xtπ (st )) + E Vt+1 (St+1 )|st
(4.15)
Our first step is to establish the equivalence between Ftπ and Vtπ using: Proposition 4.3.1 Ftπ (st ) = Vtπ (st ). π , and The proof, given in section 4.5.1, uses a proof by induction: assume it is true for Vt+1 π then show that it is true for Vt . Not surprisingly, inductive proofs are very popular in dynamic programming.
Proposition 4.3.1 is one of those small results that is easy to overlook. It establishes the equivalence between the value function for a policy and the value of a policy. With this result in hand, we can then establish the key theorem: Theorem 4.3.1 Let Vt (st ) be a solution to equation (4.3) (or (4.7)). Then Ft∗ = max Ftπ (st ) π∈Π
= Vt (st )
Theorem 4.3.1 says that the value of following the optimal policy over the horizon is the same as the solution to the optimality equations, which establishes that if we solve the optimality equations, then we know the value of the optimal policy. We should also note, however, that while an optimal solution may exist, an optimal policy may not. While such issues are of tremendous importance to the theory of Markov decision policies, they are rarely an issue in practical applications. Theorem 4.3.1 also expresses a fundamental property of dynamic programs that was first observed by Bellman. It says that the optimal policy is the same as taking the best decision given the state you are in and then following the optimal policy from then on.
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
4.3.2
90
Backward dynamic programming
When we encounter a finite horizon problem, we assume that we are given the function VT (ST ) as data. Often, we simply use VT (ST ) = 0 because we are primarily interested in what to do now, given by x0 , or in projected activities over some horizon t = 0, 1, . . . , T ph where T ph is the length of a planning horizon. If we set T sufficiently larger than T ph , then we may be able to assume that the decisions x0 , x1 , . . . , xT ph are of sufficiently high quality to be useful. Solving a finite horizon problem, in principle, is straightforward. As outlined in figure 4.1, we simply have to start at the last time period, compute the value function for each possible state s ∈ S, and then step back another time period. This way, at time period t we have already computed Vt+1 (S). Step 0. Initialization: Initialize the terminal contribution VT (sT ). Set t = T − 1. Step 1. Calculate: ( Vt (st )
=
max Ct (st , x) + γ xt
) X
0
0
p(s |s, x)Vt+1 (s )
s0 ∈S
for all st ∈ S. Step 2. If t > 0, decrement t and return to step 1. Else, stop.
Figure 4.1: A backward dynamic programming algorithm One of the most popular illustrations of dynamic programming is the discrete asset acquisition problem (popularly known in the operations research community as the inventory planning problem). Assume that you order a quantity xt at each time period to be used ˆ t+1 . Any unused product is held over to the in the next time period to satisfy a demand D following time period. For this, our “state variable” St is the quantity of inventory left over at the end of the period after demands are satisfied. The transition equation is given by ˆ t+1 ]+ where [x]+ = max{x, 0}. The cost function (which we seek to St+1 = [St + xt − D ˆ t+1 ) = ch St + co Ixt >0 where IX = 1 if X is true and 0 minimize) is given by Cˆt+1 (St , xt , D otherwise. Note that the cost function is nonconvex. This does not cause us a problem if we solve our minimization problem by searching over different (discrete) values of xt . Since ˆ t+1 )}. all our quantities are scalar, there is no difficulty finding Ct (St , xt ) = E{Cˆt+1 (St , xt , D The one-step transition matrix is computed using: p(s0 |s, x) =
X
ˆ t+1 = ω)1{s0 =[s+x−ω]+ } P rob(D
ω∈Ω
ˆ t+1 . where Ω is the set of (discrete) outcomes of the demand D
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
91
Another example is the shortest path problem with random arc costs. Assume that you are trying to get from origin node r to destination node s in the shortest time possible. As you reach each intermediate node i, you are able to observe the time required to traverse each arc out of node i. Let Vj be the expected shortest path time from j to the destination node s. At node i, you see the arc time τˆij (ω) and then choose to traverse arc i, j ∗ (ω) where j ∗ (ω) solves minj τˆij {(ω) + Vj }. We would then compute the value of being at node i using Vi = E{minj τˆij (ω) + Vj }.
4.4
Infinite horizon problems
Infinite horizon problems arise whenever we wish to study a stationary problem in steady state. More importantly, infinite horizon problems provide a number of insights into the properties of problems and algorithms, drawing off an elegant theory that has evolved around this problem class. Even students who wish to solve complex, nonstationary problems will benefit from an understanding of this problem class. We begin with the optimality equations: Vt (st ) = max E {Ct (st , x) + γVt+1 (st+1 )|st } x∈X
We can think of a steady state problem as one without the time dimension. Letting V (s) = limt→∞ Vt (st ) (and assuming the limit exists), we obtain the steady state optimality equations: )
( V (s) = max c(s, x) + γ x∈X
X
p(s0 |s, x)V (s0 )
(4.16)
s0 ∈S
The functions V (s) can be shown (as we do later) to be equivalent to solving the infinite horizon problem:
max E π∈Π
(∞ X
) γ t Ct (St , Xtπ (St ))
t=0
Now define: P π,t = t−step transition matrix, over periods 0, 1, . . . , t − 1, given policy π. π = Πt−1 t0 =0 Pt0 .
We further define P π,0 to be the identity matrix. Now let: cπt = Ct (st , X π (st ))
(4.17)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
92
be the column vector of the expected cost of being in each state given that we choose the action xt described by policy π. The infinite horizon, discounted value of a policy π starting at time t is given by: vtπ =
∞ X
0
0
γ t −t P π,t −t cπt0
(4.18)
t0 =t
Assume that after following policy π0 we follow policy π1 = π2 = . . . = π. In this case, equation (4.18) can be now written as:
v
π0
= c
π0
+
= c π0 +
∞ X t0 =1 ∞ X
0
0
γ t P π,t cπt0 0
γt
0 π Πtt00−1 P cπt0 00 =0 t
(4.19) (4.20)
t0 =1
= c
π0
+ γP
π0
∞ X
0 0 π γ t −1 Πtt00−1 P cπt0 00 =1 t
(4.21)
t0 =1
= cπ0 + γP π0 v π1
(4.22)
Equation (4.22) shows us that the value of a policy is the single period reward plus a discounted terminal reward that is the same as the value of a policy starting at time 1. If our decision rule is stationary, then π0 = π1 = . . . = πt = π, which allows us to rewrite (4.22) as: v π = cπ + γP π v π
(4.23)
This allows us to solve for the stationary reward explicitly (as long as 0 ≤ γ < 1): v π = (I − γP π )−1 cπ
We can also write an infinite horizon version of the optimality equations as we did earlier. Letting M be the “max” (or “min”) operator, the infinite horizon version of equation (4.9) would be written: Mπ (v) = cπ + γP π v
(4.24)
There are several algorithmic strategies for solving infinite-horizon problems. The first, value iteration, is the most widely used method. It involves iteratively estimating the value function. At each iteration, the estimate of the value function determines which decisions we will make and as a result defines a policy. The second strategy is policy iteration. At every iteration, we define a policy (literally, the rule for determining decisions) and then determine
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
93
Step 0: Initialization: Set v 0 (s) = 0 ∀s ∈ S. Set n = 1. Fix a tolerance parameter > 0. Step 1: For each s ∈ S compute: v n+1 (s)
=
max c(s, x) + γ x∈X
X
p(s0 |s, x)v n (s0 )
(4.25)
s0 ∈S
Let xn+1 be the decision vector that solves equation (4.25). Step 2: If kv n+1 − v n k < (1 − γ)/2γ, set x = xn+1 , v = v n+1 and stop; else set n = n + 1 and go to step 1.
Figure 4.2: The value iteration algorithm for infinite horizon optimization the value function for that policy. Careful examination of value and policy iteration reveals that these are closely related strategies that can be viewed as special cases of a general strategy that uses value and policy iteration. Finally, the third major algorithmic strategy exploits the observation that the value function can be viewed as the solution to a specially structured linear programming problem.
4.4.1
Value iteration
Value iteration is perhaps the most widely used algorithm in dynamic programming because it is the simplest to implement and, as a result, often tends to be the most natural way of solving many problems. It is virtually identical to backward dynamic programming for finite horizon problems. In addition, most of our work in approximate dynamic programming is based on value iteration. Basic value iteration Value iteration comes in several flavors. The basic version of the value iteration algorithm is given in figure 4.2. It is easy to see that the value iteration algorithm is similar to the backward dynamic programming algorithm. Rather than using a subscript t, which we decrement from T back to 0, we use an iteration counter n that starts at 0 and increases until we satisfy a convergence criterion. A slight variant of the value iteration algorithm provides a somewhat faster rate of convergence. In this version (typically called the Gauss-Seidel variant), we take advantage of the fact that when we are computing the expectation of the value of the future, we have
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
94
P to loop over all the states s0 to compute s0 p(s0 |s, x)v n (s0 ). For a particular state s, we would have already computed v n+1 (ˆ s) for sˆ = 1, 2, . . . , s − 1. By simply replacing v n (ˆ s) with n+1 v (ˆ s) for the states we have already visited, we obtain an algorithm that typically exhibits a noticeably faster rate of convergence. Relative value iteration Another version of value iteration is called relative value iteration, which is useful in problems that do not have a discount factor or where the optimal policy converges much more quickly than the value function, which may grow steadily for many iterations. The relative value iteration algorithm is shown in 4.4. In relative value iteration, we focus on the fact that we are more interested in the convergence of the difference |v(s) − v(s0 )| than we are in the values of v(s) and v(s0 ). What often happens is that, especially toward the limit, all the values v(s) start increasing by the same rate. For this reason, we can pick any state (denoted s∗ in the algorithm) and subtract its value from all the other states. Replace Step 1 with: Step 1’: For each s ∈ S compute: v n+1 (s)
= maxx∈X c(s, x) + γ
X
p(s0 |s, x)v n+1 (s0 ) +
s0 <s
X
p(s0 |s, x)v n (s0 )
s0 ≥s
Figure 4.3: The Gauss-Seidel variation of value iteration Step 0: Initialization: • Choose some v 0 ∈ V. • Choose a base state s∗ and a tolerance . • Let w0 = v 0 − v 0 (s∗ )e where e is a vector of ones. • Set n = 0. Step 1: Set v n+1 wn+1
= Mwn . = v n+1 − v n+1 (s∗ )e.
Step 2: If sp(v n+1 − v n ) < (1 − γ)/γ, go to step 3; otherwise, go to step 1. Step 3: Set x = arg maxx∈X {c(x) + γP π v n }.
Figure 4.4: Relative value iteration
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
95
To provide a bit of formalism for our algorithm, we define the span of a vector v as follows: sp(v) = maxs∈S v(s) − mins∈S v(s) Here and throughout this section, we define the norm of a vector as: kvk = max v(s) s∈S
Note that the span has the following six properties: 1) sp(v) ≥ 0. 2) sp(u + v) ≤ sp(u) + sp(v). 3) sp(kv) = |k|sp(v). 4) sp(v + ke) = sp(v). 5) sp(v) = sp(−v). 6) sp(v) ≤ 2kvk. Property (4) implies that sp(v) = 0 does not mean that v = 0 and therefore it does not satisfy the properties of a norm. For this reason, it is called a semi-norm. The relative value iteration algorithm is simply subtracting a constant from the value vector at each iteration. Obviously, this does not change the optimal decision, but it does change the value itself. If we are only interested in the optimal policy, relative value iteration often offers much faster convergence, but it may not yield accurate estimates of the value of being in each state. Bounds and rates of convergence One important property of value iteration algorithms is that if our initial estimate is too low, the algorithm will rise to the correct value from below. Similarly, if our initial estimate is too high, the algorithm will approach the correct value from above. This property is formalized in the following theorem: Theorem 4.4.1 For a vector v ∈ V: a) If v satisfies v ≥ Mv, then v ≥ v ∗ . b) If v satisfies v ≤ Mv, then v ≤ v ∗ . c) If v satisfies v = Mv, then v is the unique solution to this system of equations and v = v∗.
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
96
The proof is given in section 4.5.2. The result is also true for finite horizon problems. It is a nice property because it provides some valuable information on the nature of the convergence path. In practice, we generally do not know the true value function, which makes it hard to know if we are starting from above or below (although some problems have natural bounds, such as nonnegativity). The proof of the monotonicity property above also provides us with a nice corollary. If V (s) = MV (s) for all s, then V (s) is the unique solution to this system of equations, which must also be the optimal solution. This result raises the question: what if some of our estimates of the value of being in some states are too high, while others are too low? In this case, we might cycle above and below the true estimate before settling in on the final solution. Value iteration also provides a nice bound on the quality of the solution. Recall that when we use the value iteration algorithm, we stop when kv n+1 − v n k < (1 − γ)/2γ
(4.26)
where γ is our discount factor and is a specified error tolerance. It is possible that we have found the optimal policy when we stop, but it is very unlikely that we have found the optimal value functions. We can, however, provide a bound on the gap between the solution we terminate with, v n and the optimal values v ∗ by using the following theorem: Theorem 4.4.2 If we apply the value iteration algorithm with stopping parameter and the algorithm terminates at iteration n with value function v n+1 , then: kv n+1 − v ∗ k ≤ /2
(4.27)
Let x be the optimal decision rule that we terminate with, and let v π be the value of this policy. Then:
kv π − v ∗ k ≤ The proof is given in section 4.5.2. While it is nice that we can bound the error, the bad news is that the bound can be quite poor. More important is what the bound teaches us about the role of the discount factor. We can provide some additional insights into the bound, as well as the rate of convergence, by considering a trivial dynamic program. In this problem, we receive a constant reward c at every iteration. There are no decisions, and there is no randomness. The value of this “game” is quickly seen to be: v
∗
=
∞ X
γ nc
n=0
=
1 c 1−γ
(4.28)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
97
Consider what happens when we solve this problem using value iteration. Starting with v 0 = 0, we would use the iteration: v n = c + γv n−1 After we have repeated this n times, we have:
v
n
=
n−1 X
γ nc
m=0
=
1 − γn c 1−γ
(4.29)
Comparing equations (4.28) and (4.29), we see that: vn − v∗ = −
γn c 1−γ
(4.30)
Similarly, the change in the value from one iteration to the next is given by: kv
n+1
n
−v k = = = =
n+1 n γ γ 1 − γ − 1 − γ c 1 n γ γ − c 1 − γ 1 − γ n γ − 1 c γ 1 − γ γ nc
If we stop at iteration n + 1, then it means that: n
γ c ≤ /2
1−γ γ
(4.31)
If we choose so that (4.31) holds with equality, then our error bound (from 4.27) is: kv n+1 − v ∗ k ≤ /2 γ n+1 = c 1−γ From (4.30), we know that the distance to the optimal solution is: |v n+1 − v ∗ | =
γ n+1 c 1−γ
(4.32) (4.33)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
98
Step 0: Initialization: • Set n = 0. • Select a policy π 0 . Step 1: Let v n be the solution to: n
(I − γP π )v
= cπ
n
(4.34)
Step 2: Find a policy π n+1 that satisfies: xn+1 (s)
= arg max {c(x) + γP π v n } x∈X
(4.35)
which must be solved for each state s. Step 3: If xn+1 (s) = xn (s) for all states s, then set x∗ = xn+1 ; otherwise, set n = n + 1 and go to step 1.
Figure 4.5: Policy iteration which matches our bound. This little exercise confirms that our bound on the error may be tight. It also shows that the error decreases geometrically at a rate determined by the discount factor. For this problem, the error arises because we are approximating an infinite sum with a finite one. For more realistic dynamic programs, we also have the effect of trying to find the optimal policy. When the values are close enough that we have, in fact, found the optimal policy, then we have only a Markov reward process (a Markov chain where we earn rewards for each transition). Once our Markov reward process has reached steady state, it will behave just like the simple problem we have just solved, where c is the expected reward from each transition.
4.4.2
Policy iteration
In policy iteration, we choose a policy and then find the infinite horizon, discounted value of the policy. This value is then used to choose a new policy. The general algorithm is described in figure 4.5. It is useful to illustrate the policy iteration algorithm in different settings. In the first, consider a batch replenishment problem where we have to replenish resources (raising capital, exploring for oil to expand known reserves, hiring people) where there are economies from ordering larger quantities. We might use a simple policy where if our level of resources Rt < s for some lower limit s, we order a quantity xt = S − Rt . This is known as an (s, S) policy and is written: ( 0 X π (Rt ) = S − Rt
Rt ≥ s Rt < s
(4.36)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
99
The parameters (s, S) constitute a policy π. For a given policy π n = (sn , S n ), we can n compute a one-step transition matrix P π . We then find the steady-state value of being in each state R given this policy by solving equation (4.34). Given these values, we can find a new set of actions xn+1 for each state s, which represents a new policy. It can be shown that this vector xn+1 will have the same structure as equation (4.36), so we can infer (sn+1 , S n+1 ) from (4.35). This example illustrates policy iteration when the policy can be represented as a table lookup function. Now, consider an asset allocation problem where we face the problem of placing expensive components of the electric power grid around the country in case of equipment failures. If there is a failure, we will use the closest available replacement. As new components become available, we will place the new component in the empty location that has the highest value. The set of “values” for each location represents a policy. Assume that each time there is a failure, an order for a new component is placed. However, it takes a year to receive delivery. If v¯i is the value of placing a component at location i, then the vector v¯ determines the rule for how components are placed around the country as they become available (as a result, v¯ determines our policy). Let Sti = 1 if there is a spare component at location i after the tth arrival of a new component (which is when we have to make a decision). Given v¯ and a model of how failures arise, we can compute the probability transition matrix P π and use (4.34) to determine the value of being in each state St (which consists of the vector of elements Sti ). We may then use the steady state vector v n to infer a new set of values v¯in+1 that would then determine a new policy. The policy iteration algorithm is simple to implement and has fast convergence when measured in terms of the number of iterations. However, solving equation (4.34) is quite hard if the number of states is large. If the state space is small, we can use v π = (I−γP π )−1 cπ , but the matrix inverse can be computationally expensive. For this reason, we may use a hybrid algorithm that combines the features of policy iteration and value iteration.
4.4.3
Hybrid value-policy iteration
Value iteration is basically an algorithm that updates the value at each iteration and then determines a new policy given the new estimate of the value function. At any iteration, the value function is not the true, steady-state value of the policy. By contrast, policy iteration picks a policy and then determines the true, steady-state value of being in each state given the policy. Given this value, a new policy is chosen. It is perhaps not surprising that policy iteration converges faster in terms of the number of iterations because it is doing a lot more work in each iteration (determining the true, steady-state value of being in each state under a policy). Value iteration is much faster per iteration, but it is determining a policy given an approximation of a value function and then performing a very simple updating of the value function, which may be far from the true value function. A hybrid strategy that combines features of both methods is to perform a somewhat more complete update of the value function before performing an update of the policy. Figure 4.6
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
100
Step 0: Initialization: • Set n = 1. • Select a tolerance parameter and inner iteration limit M . • Select some v 0 ∈ V. Step 1: Find a decision xn (s) for each s that satisfies: ) ( X n 0 n−1 0 x (s) = arg max c(s, x) + γ p(s |s, x)v (s ) x∈X
s0 ∈S
which we represent as policy π n . Step 2: Partial policy evaluation. n
a) Set m = 0 and let: un (0) = cπ + γP π v n−1 . b) If kun (0) − v n−1 k < (1 − γ)/2γ, go to step 3. Else: c) While m < M do the following: n
n
i) un (m + 1) = cπ + γP π un (m) = Mπ un (m). ii) Set m = m + 1 and repeat (i). d) Set v n = un (M ), n = n + 1 and return to step 1. Step 3: Set x = xn+1 and stop.
Figure 4.6: Hybrid value/policy iteration outlines the procedure where the steady state evaluation of the value function in equation (4.34) is replaced with a much easier iterative procedure (Step 2 in figure 4.6). This step is run for M iterations where M is a user-controlled parameter that allows the exploration of the value of a better estimate of the value function. Not surprisingly, it will generally be the case that M should decline with the number of iterations as the overall process converges.
4.4.4
The linear programming formulation
Theorem 4.4.1 showed us that if v ≥ c + γP v, then v is an upper bound (actually, a vector of upper bounds) on the value of being in each state. This means that the optimal solution, which satisfies v ∗ = c + γP v ∗ , is the smallest value of v that satisfies this inequality. We can use this insight to formulate the problem of finding the optimal values as a linear program. Let β be a vector with element βs > 0, ∀s ∈ S. The optimal value function can be found by solving the following linear program: min βv v
subject to: v ≥ c + γP v v ≥ 0
(4.37) (4.38) (4.39)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
101
The linear program has a |S|-dimensional decision vector (the value of being in each state), with |S| inequality constraints (equation (4.38). This formulation has been viewed as primarily a theoretical result since it was first suggested in 1960. The linear program can only be solved for problems with relatively small numbers of states. High quality solutions can be obtained more simply using value or policy iteration. However, recent research has suggested new approximate dynamic programming algorithms based on the linear programming formulation.
4.5
Why does it work?**
The theory of Markov decision processes is especially elegant. While not needed for computational work, an understanding of why they work will provide a deeper appreciation of the properties of these problems.
4.5.1
The optimality equations
Until now, we have been presenting the optimality equations as though they were a fundamental law of some sort. To be sure, they can easily look as though they were intuitively obvious, but it is still important to establish the relationship between the original optimization problem and the optimality equations. Since these equations are the foundation of dynamic programming, it seems beholden on us to work through the steps of proving that they are actually true. We start by remembering the original optimization problem that we are trying to solve: Ftπ (st ) = E
(T −1 X
) ct0 (st0 , Xtπ0 (St0 )) + Ct (sT )|St = st
(4.40)
t0 =t
Since (4.40) is, in general, exceptionally difficult to solve, we resort to the optimality equations: π Vtπ (st ) = Ct (st , Xtπ (st )) + E Vt+1 (St+1 )|st
(4.41)
Our challenge is to establish that these are the same. In order to establish this result, it is going to help if we first prove the following: Lemma 4.5.1 Let St be a state variable that captures the relevant history up to time t, and let Ft0 (St+1 ) be some function measured at time t0 ≥ t + 1 conditioned on the random variable St+1 . Then: E [E{Ft0 |St+1 }|St ] = E [Ft0 |St ]
(4.42)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
102
Proof: Assume, for simplicity, that Ft0 is a discrete, finite random variable that takes outcomes in F. We start by writing: E{Ft0 |St+1 } =
X
f P (Ft0 = f |St+1 )
(4.43)
f ∈F
Recognizing that St+1 is a random variable, we may take the expectation of both sides of (4.43), conditioned on St as follows: E [E{Ft0 |St+1 }|St ] =
X X
f P (Ft0 = f |st+1 , St )P (St+1 = st+1 |St )
(4.44)
st+1 ∈S f ∈F
First, we observe that we may write P (Ft0 = f |st+1 , st ) = P (Ft0 = f |st+1 ), because conditioning on St+1 makes all prior history irrelevant. Next, we can reverse the summations on the right hand side of (4.44) (some technical conditions have to be satisfied to do this, but let us assume that all our functions are “nice”). This means: E [E{Ft0 |St+1 = st+1 }|St ] =
X X
f P (Ft0 = f |st+1 , St )P (St+1 = st+1 |St )
(4.45)
f ∈F st+1 ∈S
=
X f ∈F
=
X
f
X
P (Ft0 = f, st+1 |St )
(4.46)
st+1 ∈S
f P (Ft0 = f |St )
(4.47)
f ∈F
= E [Ft0 |St ]
(4.48)
which proves our result. Note that the essential step in the proof occurs in equation (4.45) when we add St to the conditioning. We are now ready to show: Proposition 4.5.1 Ftπ (st ) = Vtπ (st ). Proof: To prove that (4.40) and (4.41) are equal, we use a standard trick in dynamic programming: proof by induction. Clearly, FTπ (sT ) = VTπ (sT ) = Ct (sT ). Next, assume that it holds for t + 1, t + 2, . . . , T . We want to show that it is true for t. This means that we can write: ( ) T −1 X Vtπ (st ) = Ct (st , Xtπ (st )) + E E ct0 (st0 , Xtπ0 (st0 )) + Ct (sT (ω)) st+1 st (4.49) | t0 =t+1 {z } π (s Ft+1 t+1 )
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
103
We then use lemma 4.5.1 to write E [E {. . . |st+1 } |st ] = E [. . . |st ]. Hence, "
T −1 X
Vtπ (st ) = Ct (st , Xtπ (st )) + E
# ct0 (st0 , Xtπ0 (st0 )) + Ct (sT )|st
(4.50)
t0 =t+1
When we condition on st , Xtπ (st ) (and therefore Ct (st , Xtπ (st ))) is deterministic, so we can pull the expectation out to the front giving:
Vtπ (st ) = E
"T −1 X
# ct0 (st0 , xt0 (st0 )) + Ct (sT )|st
(4.51)
t0 =t
= Ftπ (st )
(4.52)
which proves our result.
The expectation in equation (4.41) provides for a significant level of generality. For example, we might have a history dependent process where we would write: π Vtπ (ht ) = Ct (ht , Xtπ (st )) + E Vt+1 (ht+1 )|ht
(4.53)
where ht+1 = (ht , ωt ) (if we are using the exogneous stochastic process). When we have a Markovian process, we can express the conditional expectation in (4.41) using a one-step transition matrix: Vtπ (st ) = Ct (st , X π (st )) +
X
π pt (s0 |st , X π (st ))Vt+1 (s0 )
(4.54)
s0 ∈S
Using equation (4.41), we have a backward recursion for calculating Vtπ (st ) for a given policy π. Now that we can find the expected reward for a given π, we would like to find the best π. That is, we want to find: Ft∗ (st ) = max Ftπ (st ) π∈Π
As before, if the set Π is infinite, we replace the “max” with “sup”. We solve this problem by solving the optimality equations. These are: ( Vt (st ) = max Ct (st , x) + x∈X
) X
pt (s0 |st , x)Vt+1 (s0 )
(4.55)
s0 ∈S
We are claiming that if we find the set of V 0 s that solves (4.55), then we have found the policy that optimizes Ftπ . We state this claim formally as:
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
104
Theorem 4.5.1 Let Vt (st ) be a solution to equation (4.55). Then Ft∗ = Vt (st ) = max Ftπ (st ) π∈Π
Proof: The proof is in two parts. First, we show by induction that Vt (st ) ≥ Ft∗ (st ) for all st ∈ S and t = 0, 1, . . . , T − 1. Then, we show that the reverse inequality is true, which gives us the result. Part 1: We resort again to our proof by induction. Since VT (sT ) = Ct (sT ) = FTπ (sT ) for all sT and all π ∈ Π, we get that VT (sT ) = FT∗ (sT ). Assume that Vt (st ) ≥ Ft∗ (st ) for t = n + 1, n + 2, . . . , T , and let π be an arbitary policy. For t = n, the optimality equation tells us: (
)
Vn (sn ) = max cn (sn , x) + x∈X
X
pn (s0 |sn , x)Vn+1 (s0 )
s0 ∈S
∗ By the induction hypothesis, Fn+1 (s) ≤ Vn+1 (s), so we get:
(
)
Vn (sn ) ≥ max cn (sn , x) + x∈X
X
∗ pn (s0 |sn , x)Fn+1 (s0 )
s0 ∈S
π ∗ (s) for an arbitrary π. Also let X π (sn ) be the decision (s) ≥ Fn+1 Of course, we have that Fn+1 that would be chosen by policy π when in state sn . Then:
(
)
Vn (sn ) ≥ max cn (sn , x) + x∈X
X s0 ∈S
≥ cn (sn , X π (sn )) +
X s0 ∈S
=
Ftπ (sn )
This means: Vn (sn ) ≥ Fnπ (sn ) which proves part 1.
π pn (s0 |sn , x)Fn+1 (s0 )
for all π ∈ Π
π pn (s0 |sn , X π (sn ))Fn+1 (s0 )
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
105
Part 2: Now we are going to prove the inequality from the other side. Specifically, we want to show that for any > 0 there exists a policy π that satisfies: Fnπ (sn ) + (T − n) ≥ Vn (sn )
(4.56)
To do this, we start with the definition: )
( Vn (sn ) = max cn (sn , x) + x∈X
X
pn (s0 |sn , x)Vn+1 (s0 )
(4.57)
s0 ∈S
We may let xn (sn ) be the decision rule that solves (4.57). This rule corresponds to the policy π. In general, the set X may be infinite, whereupon we have to replace the “max” with a “sup” and handle the case where an optimal decision may not exist. For this case, we know that we can design a decision rule xn (sn ) that returns a decision x that satisfies: Vn (sn ) ≤ cn (sn , x) +
X
pn (s0 |sn , x)Vn+1 (s0 ) +
(4.58)
s0 ∈S
We can prove (4.56) by induction. Assume that it is true for t = n + 1, n + 2, . . . , T . We already know that Fnπ (sn ) = cn (sn , X π (sn )) +
X
π pn (s0 |sn , X π (sn ))Fn+1 (s0 )
s0 ∈S π (s0 ) ≥ Vn+1 (s0 ) − (T − (n + 1)) to get: We can use our induction hypothesis which says Fn+1
Fnπ (sn ) ≥ cn (sn , X π (sn )) +
X
pn (s0 |sn , X π (sn ))[Vn+1 (s0 ) − (T − (n + 1))]
s0 ∈S
= cn (sn , X π (sn )) +
X
pn (s0 |sn , X π (sn ))Vn+1 (s0 ) −
s0 ∈S
X s0 ∈S
( =
cn (sn , X π (sn )) +
pn (s0 |sn , X π (sn )) [(T − n − 1)]
) X
pn (s0 |sn , X π (sn ))Vn+1 (s0 ) +
− (T − n)
s0 ∈S
Now, using equation (4.58), we replace the term in brackets with the smaller Vn (sn ) (equation (4.58)): Fnπ (sn ) ≥ Vn (sn ) − (T − n) which proves the induction hypothesis. We have shown that: Fn∗ (sn ) + (T − n) ≥ Fnπ (sn ) + (T − n) ≥ Vn (sn ) ≥ Fn∗ (sn )
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES This proves the result.
106
Now we know that solving the optimality equations also gives us the optimal value function. This is our most powerful result because we can solve the optimality equations for many problems that cannot be solved any other way.
4.5.2
Proofs for value iteration
Infinite horizon dynamic programming provides a compact way to study the theoretical properties of these algorithms. The insights gained here are applicable to problems even when we cannot apply this model, or these algorithms, directly. Our first result establishes a monotonicity property that can be exploited in the design of an algorithm: Theorem 4.5.2 For a vector v ∈ V: a) If v satisfies v ≥ Mv, then v ≥ v ∗ . b) If v satisfies v ≤ Mv, then v ≤ v ∗ . c) If v satisfies v = Mv, then v is the unique solution to this system of equations and v = v∗. Proof: Part (a) requires that: v ≥ max{cπ + γP π v} π∈Π π0
≥ c + γP π0 v ≥ cπ0 + γP π0 (cπ1 + γP π1 v) = cπ0 + γP π0 cπ1 + γ 2 P π0 P π1 v
(4.59) (4.60) (4.61)
Equation (4.59) is true by assumption (part (a) of the theorem) and equation (4.60) is true because π0 is some policy that is not necessarily optimal for the vector v. Using similar reasoning, equation (4.61) is true because π1 is another policy which, again, is not necessarily optimal. Using P π,(t) = P π0 P π1 · · · P πt , we obtain by induction: v ≥ cπ0 + γP π0 cπ1 + · · · + γ t−1 P π0 P π1 · · · P πt−1 cπt + γ t P π,(t) v
(4.62)
Recall that: vπ =
∞ X t=0
γ t P π,(t) cπt
(4.63)
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
107
Breaking the sum in (4.63) into two parts allows us to rewrite the expansion in (4.62) as: ∞ X
v ≥ vπ −
0
0
γ t P π,(t ) cπt0 +1 + γ t P π,(t) v
(4.64)
t0 =t+1
Taking the limit of both sides of (4.64) as t → ∞ gives us: ∞ X
v ≥ lim v π − t→∞
≥ vπ
0
0
γ t P π,(t ) cπt0 +1 + γ t P π,(t) v
(4.65)
t0 =t+1
∀π ∈ Π
(4.66)
The limit in (4.65) exists as long as the reward function cπ is bounded and γ < 1. Because (4.66) is true for all π ∈ Π, it is also true for the optimal policy, which means that: v ≥ v π∗ = v∗ which proves part (a) of the theorem. Part (b) can be proved in an analogous way. Parts (a) and (b) mean that v ≥ v ∗ and v ≤ v ∗ . If v = Mv, then we satisfy the preconditions of both parts (a) and (b), which means they are both true and therefore we must have v = v ∗ . This result means that if we start with a vector that is higher than the optimal vector, then we will decline monotonically to the optimal solution (almost – we have not quite proven that we actually get to the optimal). Alternatively, if we start below the optimal vector, we will rise to it. Note that it is not always easy to find a vector v that satisfies either condition (a) or (b) of the theorem. In problems where the rewards can be positive and negative, this can be tricky. We now undertake the proof that the basic value function iteration converges to the optimal solution. This is not only an important result, it is also an elegant one that brings some powerful theorems into play. The proof is also quite short. However, we will need some mathematical preliminaries: Definition 4.5.1 Let V be a set of (bounded, real-valued) functions and define the norm of v by: kvk = sup v(s) s∈S
where we replace the “sup” with a “max” when the state space is finite. Since V is closed under addition and scalar multiplication and has a norm, it is a normed linear space.
CHAPTER 4. INTRODUCTION TO MARKOV DECISION PROCESSES
108
Definition 4.5.2 T : V → V is a contraction mapping if there exists a γ, 0 ≤ γ < 1 such that: kT v − T uk ≤ γkv − uk Definition 4.5.3 A sequence vn ∈ V, n = 1, 2, . . . is said to be a Cauchy sequence if for all > 0, there exists N such that for all n, m ≥ N : kvn − vm k < Definition 4.5.4 A normed linear space is complete if every Cauchy sequence contains a limit point in that space. Definition 4.5.5 A Banach space is a complete normed linear space. Definition 4.5.6 We define the norm of a matrix Q as: kQk = max s∈S
X
|q(j|s)|
j∈S
which is to say, the largest row sum of the matrix. If Q is a one-step transition matrix, then kQk = 1. Definition 4.5.7 The triangle inequality, which is satisfied by the Euclidean norm as well as many others, means that given two vectors a, b ∈ α ¯ , ∀n ≥ 1, if α > α ¯, n n+1 α 0). Here, θ¯∗ is a deterministic quantity that does not depend on the sample path. Because of the restriction p(ω) > 0, we accept that in theory, there could exist a sample outcome that can never occur that would produce a path that converges to some other point. As a result, we say that the convergence is “almost sure,” which is universally abbreviated as “a.s.” Almost sure convergence establishes the core theoretical property that the algorithm will eventually settle in on a single point. This is an important property for an algorithm, but it says nothing about the rate of convergence. Let x ∈ Rn . At each iteration n, we sample some random variables to compute the function (and its gradient). The sample realizations are denoted by ω n . We let ω = (ω 1 , ω 2 , . . . , ) be a realization of all the random variables over all iterations. Let Ω be the set of all possible realizations of ω, and let F be the σ-algebra on Ω (that is to say, the set of all possible events that can be defined using Ω). We need the concept of the history up through iteration n. Let: H n = A random variable giving the history of all random variables up through iteration n. A sample realization of H n would be: hn = H n (ω) = (ω 1 , ω 2 , . . . , ω n )
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
182
We could then let Ωn be the set of all outcomes of the history (that is, hn ∈ H n ) and let Hn be the σ-algebra on Ωn (which is the set of all events, including their complements and unions, defined using the outcomes in Ωn ). Although we could do this, this is not the convention followed in the probability community. Instead, we define a sequence of σ-algebras F 1 , F 2 , . . . , F n as the sequence of σ-algebras on Ω that can be generated as we have access to the information through the first 1, 2, . . . , n iterations, respectively. What does this mean? Consider two outcomes ω 6= ω 0 for which H n (ω) = H n (ω 0 ). If this is the case, then any event in F n that includes ω must also include ω 0 . If we say that a function is F n -measurable, then this means that it must be defined in terms of the events in F n , which is in turn equivalent to saying that we cannot be using any information from iterations n + 1, n + 2, . . .. We would say, then, that we have a standard probability space (Ω, F, P) where ω ∈ Ω represents an elementary outcome, F is the σ-algebra on F and P is a probability measure on Ω. Since our information is revealed iteration by iteration, we would also then say that we have an increasing set of σ-algebras F 1 ⊆ F 2 ⊆ . . . ⊆ F n .
6.8.2
An older proof
Enough with probabilistic preliminaries. Let F (θ, ω) be a F-measurable function. We wish to solve the unconstrained problem: max E {F (θ, ω)} x
(6.72)
with θ¯∗ being the optimal solution. Let g(θ, ω) be a stochastic ascent vector that satisfies: g(θ, ω)T ∇F (θ, ω) ≥ 0
(6.73)
For many problems, the most natural ascent vector is the gradient itself: g(θ, ω) = ∇F (θ, ω)
(6.74)
which clearly sastifies (6.73). We assume that F (θ) = E {F (θ, ω)} is continuously differentiable and convex, with bounded first and second derivatives so that for finite M : − M ≤ g(θ, ω)T ∇2 F (θ)g(θ, ω) ≤ M
(6.75)
A stochastic gradient algorithm (sometimes called a stochastic approximation method) is given by: θ¯n = θ¯n−1 + αn g(θ¯n−1 , ω n )
(6.76)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
183
We first prove our result using the proof technique of Blum (1954a) that generalized the original stochastic approximation procedure proposed by Robbins & Monro (1951) to multidimensional problems. This approach does not depend on more advanced concepts such as Martingales, and as a result is accessiblePto a broader audience. proof helps the P∞ This n n 2 reader understand the basis for the conditions ∞ α = ∞ and (α ) < ∞ that are n=0 n=0 required of all stochastic approximation algorithms. We make the following (standard) assumptions on stepsizes: αn ≥ 0 n ≥ 1
(6.77)
αn = ∞
(6.78)
(αn )2 < ∞
(6.79)
∞ X n=1 ∞ X n=1
We want to show that under suitable assumptions, the sequence generated by (6.76) converges to an optimal solution. That is, we want to show that: lim θ¯n = θ¯∗
n→∞
(6.80)
We now use Taylor’s theorem (remember Taylor’s theorem from freshman calculus?), which says that for any continuously differentiable convex function F (θ), there exists a parameter 0 ≤ θ ≤ 1 that satisfies: F (θ) = F (θ¯0 ) + ∇F (θ¯0 + θ(θ − θ¯0 ))(θ − θ¯0 )
(6.81)
This is the first order version of Taylor’s theorem. The second order version says: 1 F (θ) = F (θ¯0 ) + ∇F (θ¯0 )(θ − θ¯0 ) + (θ − θ¯0 )T ∇2 F (θ¯0 + θ(θ − θ¯0 ))(θ − θ¯0 ) 2
(6.82)
We use the second order version. Replace θ¯0 with θ¯n , and replace θ with θ¯n+1 . Also, we can simplify our notation by using: g n = g(θ¯n , ω n ) This means that: x − θ¯0 = = =
θ¯n − θ¯n−1 θ¯n−1 + αn g n − θ¯n−1 αn g n
(6.83)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
184
From our stochastic gradient algorithm (6.76), we may write: F (θ¯n , ω n ) = F (θ¯n−1 + αn g n (ω n ), ω n ) 1 = F (θ¯n−1 , ω n ) + ∇F (θ¯n−1 , ω n )(αn g n ) + (αn g n )T ∇2 F (θ¯n−1 + θ(αn g n ))(αn g n ) 2 (6.84) It is now time to use a standard mathematician’s trick. We sum both sides of (6.84) to get: N X
F (θ¯n , ω n ) =
n=1
N X
F (θ¯n−1 , ω n ) +
n=1
N X
∇F (θ¯n−1 , ω n )(αn g n )
(6.85)
n=1 N 1X n n T 2 + (α g ) ∇ F θ¯n−1 + θ(αn g n ) (αn g n ) 2 n=1
(6.86)
Note that the terms F n (θ¯n ), n = 2, 3, . . . , N appear on both sides of (6.86). We can cancel these. We then use our lower bound on the quadratic term (6.75) to write: F (θ¯N , ω N ) ≥ F (θ¯0 , ω 1 ) +
N X
1 ∇F (θ¯n−1 , ω n )(αn g n ) + (αn )2 (−M ) 2 n=1
(6.87)
We now want to take the limit of both sides of (6.87) as N → ∞. In doing so, we want to show that everything must be bounded. We know that F (θ¯N ) is bounded (almost surely) because we assumed that the original function was bounded. We next use the assumption (6.13) that the infinite sum of the squares of the stepsizes is also bounded to conclude that the rightmost term in (6.87) isPbounded. Finally, we use (6.73) to claim that all the terms ¯n n n in the remaining summation ( N n=1 ∇F (θ )(α g )) are positive. That means that this term is also bounded (from both above and below). What do we get with all this boundedness? Well, if ∞ X
αn ∇F (θ¯n , ω n )g n < ∞ (almost surely)
(6.88)
n=1
and (from (6.12)) ∞ X
αn = ∞
(6.89)
n=1
We can conclude that ∞ X n=1
∇F (θ¯n , ω n )g n < ∞.
(6.90)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
185
Since all the terms in (6.90) are positive, they must go to zero. (Remember, everything here is true almost surely; after a while, it gets a little boring to keep saying almost surely every time. It is a little like reading Chinese fortune cookies and adding the automatic phrase “under the sheets” at the end of every fortune.) We are basically done except for some relatively difficult (albeit important if you are ever going to do your own proofs) technical points to really prove convergence. At this point, we would use technical conditions on the properties of our ascent vector g n to argue that if ∇F (θ¯n , ω n )g n → 0 then ∇F (θ¯n , ω n ) → 0 (it is okay if g n goes to zero as F (θ¯n , ω n ) goes to zero, but it cannot go to zero too quickly). This proof was first proposed in the early 1950’s by Robbins and Monro and became the basis of a large area of investigation under the heading of stochastic approximation methods. A separate community, growing out of the Soviet literature in the 1960’s, addressed these problems under the name of stochastic gradient (or stochastic quasi-gradient) methods. More modern proofs are based on the use of Martingale processes, which do not start with Taylor’s formula and do not (always) need the continuity conditions that this approach needs. Our presentation does, however, help to present several key ideas that are present in most proofs of this type. First, concepts of almost sure convergence are virtually standard. Second, it is common to set up equations such as (6.84) and then take a finite sum as in (6.86) using the alternating terms in the sum to cancel all but the first and last elements of the sequence of some function (in our case, F (θ¯n , ω n )). We then establish boundedness P∞ the n 2 of this expressionPas N → ∞, which will requirethe assumption that n=1 (α ) < ∞. Then, n the assumption ∞ n=1 α = ∞ is used to show that if the remaining sum is bounded, then its terms must go to zero. More modern proofs will use functions other than F (θ¯n ). Popular is the introduction of so-called Lyapunov functions, which are somewhat artificial functions that provide a measure of optimality. These functions are constructed for the purpose of the proof and play no role in the algorithm itself. For example, we might let T n = ||θ¯n − θ¯∗ || be the distance between our current solution θ¯n and the optimal solution. We will then try to show that T n is suitably reduced to prove convergence. Since we do not know θ¯∗ , this is not a function we can actually measure, but it can be a useful device for proving that the algorithm actually converges. It is important to realize that stochastic gradient algorithms of all forms do not guarantee an improvement in the objective function from one iteration to the next. First, a sample gradient g n may represent an appropriate ascent vector for a sample of the function F (θ¯n , ω n ) but not for its expectation. In other words, randomness means that we may go in the wrong direction at any point in time. Second, our use of a nonoptimizing stepsize, such as αn = 1/n, means that even with a good ascent vector, we may step too far, and actually end up with a lower value.
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
6.8.3
186
A more modern proof
Since the original work by Robbins and Monro, more powerful proof techniques have evolved. Below we illustrate a basic Martingale proof of convergence. The concepts are somewhat more advanced, but the proof is more elegant and requires milder conditions. A significant generalization is that we no longer require that our function be differentiable (which our first proof required). First, just what is a Martingale? Let ω1 , ω2 , . . . , ωt be a set of exogenous random outcomes, and let ht = Ht (ω) = (ω1 , ω2 , . . . , ωt ) represent the history of the process up to time t. We also let Ft be the σ-algebra on Ω generated by Ht . Further, let Ut be a function that depends on ht (we would say that Ut is a Ft -measurable function). This means that if we know ht , then we know Ut deterministically (needless to say, if we only know ht , then Ut+1 is still a random variable). We further assume that our function satisfies: E[Ut+1 |Ft ] = Ut If this is the case, then we say that Ut is a martingale. Alternatively, if: E[Ut+1 |Ft ] ≤ Ut
(6.91)
then we say that Ut is a supermartingale. If Ut is a supermartingale, then it has the property that it drifts downward. Note that both sides of equation (6.91) are random variables. The way to understand this equation is to think about an outcome ω ∈ Ω; if we fix ω, then Ut is fixed, as is Ft (that is, we have chosen a particular history from the set Ft ). The inequality in equation (6.91) is then interpreted as being true for each ω. Finally, assume that Ut ≥ 0. If this is the case, we have a sequence Ut that is decreasing but which cannot go below zero. Not surprisingly, we obtain the following key result: Theorem 6.8.1 Let Ut be a positive supermartingale. Then, Ut converges to a finite random variable U ∗ almost surely. So what does this mean for us? In our immediate application, we are interested in studying the properties of an algorithm as it progresses from one iteration to the next. Following our convention of putting the iteration counter in the superscript (and calling it n), we are going to study the properties of a (nonnegative) function U n where hn represents the history of our algorithm. We want to find a function U n and show that it is a supermartingale to establish convergence to some limit (U ∗ ). Finally, we want to show that it must converge to a point that represents the optimal solution. The first step is finding the function U n . It is important to remember that we are proving convergence, not designing an algorithm. So, it is not necessary that we be able to actually calculate U n . We want a function that measures our progress toward finding the optimum
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
187
solution. One convenient function is simply: U n = (θ¯n − θ¯∗ )2
(6.92)
where θ¯∗ is the optimal solution. Of course, we don’t know F ∗ , so we cannot actually compute U n , but that is not really a problem for us. Note that we can quickly verify that U n ≥ 0. If we can show that U n is a supermartingale, then we get the result that U n converges to a random variable U ∗ , and, of course, we would like to show that U ∗ = 0. We assume that we are still solving a problem of the form: max E[F (θ, ω)] θ∈Θ
(6.93)
where we assume that F (θ, ω) is continuous and convex (but we do not require differentiability). We are solving this problem using a stochastic gradient algorithm: θ¯n = θ¯n−1 − αn g n
(6.94)
where g n is our stochastic gradient, given by: g n = ∇x F (θ¯n−1 , ω n ) We next need to assume some properties of the stochastic gradient g n . Specifically, we need to assume that: (Assumption 1) E[g n+1 (θ¯n − θ¯∗ )|F n ] ≥ 0 (Assumption 2) kg n k ≤ C < ∞ a.s.
(6.95) (6.96)
Equation (6.95) assumes that on average, the gradient g n points toward the optimal solution θ¯∗ . This is easy to prove for deterministic, differentiable functions. While this may be harder to establish for stochastic problems or problems where F (θ) is nondifferentiable, we have not had to assume that F (θ) is differentiable. To show that U n is a supermartingale, we start with: U n+1 − U n = (θ¯n+1 − θ¯∗ )2 − (θ¯n − θ¯∗ )2 2 = (θ¯n − αn+1 g n+1 ) − θ¯∗ − (θ¯n − θ¯∗ )2 = (θ¯n − θ¯∗ )2 − 2αn+1 g n+1 (θ¯n − θ¯∗ ) + (αn+1 g n+1 )2 − (θ¯n − θ¯∗ )2 = (αn+1 g n+1 )2 − 2αn+1 g n+1 (θ¯n − θ¯∗ )
(6.97) (6.98) (6.99) (6.100)
Taking conditional expectations on both sides gives: E[U n+1 |F n ] − E[U n |F n ] = E[(αn+1 g n+1 )2 |F n ] − 2E[αn+1 g n+1 (θ¯n − θ¯∗ )|F n ]
(6.101)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
188
We now apply assumption 1 in equation (6.95) to argue that the last term on the right hand side of (6.101) is nonnegative. Also, recognizing that E[U n |F n ] = U n is deterministic (given F n ) allows us to rewrite (6.101) as: E[U n+1 |F n ] ≤ U n + E[(αn+1 g n+1 )2 |F n ]
(6.102)
Because of the positive term on the right hand side of (6.102), we cannot directly get the result that U n is a supermartingale. But hope is not lost. We appeal to a neat little trick that works as follows. Let: W
n
∞ X
n
= U +
(αm g m )2
(6.103)
m=n+1
We are going to show that W n is a supermartingale. From its definition, we obtain: W n = W n+1 + U n − U n+1 + (αn+1 g n+1 )2
(6.104)
Taking conditional expectations of both sides gives: W n = E W n+1 |F n + U n − E U n+1 |F n + E (αn+1 g n+1 )2 |F n which is the same as: E[W n+1 |F n ] = W n − U n − E[U n+1 |F n ] + E (αn+1 g n+1 )2 |F n {z } |
(6.105)
I
We see from equation (6.102) that I ≥ 0. Removing this term, then, gives us the inequality: E[W n+1 |F n ] ≤ W n
(6.106)
This means that W n is a supermartingale. It turns out that this is all we really need because limn→∞ W n = limn→∞ U n . This means that: lim U n → U ∗
(almost surely)
n→∞
(6.107)
Now that we have the basic convergence of our algorithm, we have to ask: but what is it converging to? For this result, we return to equation (6.100) and sum it over the values n = 0 up to some number N : N X n=0
(U
n+1
n
−U ) =
N X n=0
(α
n+1 n+1 2
g
) −2
N X n=0
αn+1 g n+1 (θ¯n − θ¯∗ )
(6.108)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
189
The left hand side of (6.108) is an alternating sum (sometimes referred to as a telescoping sum), which means that every element cancels out except the first and the last. Using this, and taking expectations of both sides gives: " E[U N +1 − U 0 ] = E
N X
#
"
(αn+1 g n+1 )2 − 2E
n=0
N X
# αn+1 g n+1 (θ¯n − θ¯∗ )
n=0
Finally, taking the limit as N → ∞ and using the limiting result from equation (6.107) gives: " U∗ − U0 = E
∞ X
# (αn+1 g n+1 )2 − 2E
"
∞ X
# αn+1 g n+1 (θ¯n − θ¯∗ )
(6.109)
n=0
n=0
Because of our supermartingale property (equation 6.107) we can argue that the left hand side of (6.109) is bounded. P Next, using assumption 2 (which ensures that g n is bounded) n 2 and the requirement that E ∞ n=1 (α ) < ∞ allows us to claim that the first term on the right hand side is also bounded. This means that the second term on the right hand side of (6.109) is also bounded. Let β n = g n+1 (θ¯n − θ¯∗ ). We have just shown that: ∞ X
αn+1 β n ≤ ∞
(6.110)
n=0
But, we have required that lim β n → 0
n→∞
P∞
n=1
αn = ∞. If
P∞
n=0
αn+1 β n is bounded which implies that (6.111)
which means g n+1 (θ¯n − θ¯∗ ) → 0. This can happen in two ways. Either g n+1 → 0 or (θ¯n − θ¯∗ ) → 0 (or both). Either way, we win. If g n+1 → 0, then this can only happen because θ¯n → θ¯∗ (under our convexity assumption, the gradient can only vanish at the minimum). In general, this will never be the case either because the problem is stochastic or F (θ) is not differentiable. In this case, we must have θ¯n → θ¯∗ , which is what we were looking for in the first place.
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
6.8.4
190
Proof of theorem 6.5.1
Proof: Let J(αt ) denote the objective function from the problem stated in (6.61). J(αt ) = = = =
h
2 i ¯ E θt (αt ) − θt 2 n−1 ¯ ˆ E (1 − αt ) θ + αt θt − θt 2 n−1 ¯ ˆ E (1 − αt ) θ − θt + αt θt − θt h 2 2 i 2 2 n−1 (1 − αt ) E θ¯ − θt + (αt ) E θˆt − θt i h n−1 ˆ ¯ − θt θt − θt +2αt (1 − αt ) E θ | {z }
(6.112)
I
The expected value of the cross-product term, I, vanishes under the assumption of independence of the observations and the objective function reduces to the following form: 2
J(αt ) = (1 − αt ) E
h
θ¯n−1 − θt
2 i
2
+ (αt ) E
θˆt − θt
2
(6.113)
In order to find the optimal stepsize, αt∗ , that minimizes this function, we obtain the first t) = 0, which gives us, order optimality condition by setting ∂J(α ∂αt − (1 −
αt∗ ) E
h
¯n−1
θ
− θt
2 i
+
αt∗ E
θˆt − θt
2
=0
(6.114)
Solving this for αt∗ gives us the following result, h
2 i n−1 ¯ E θ − θt αt∗ = h i 2 2 E θ¯n−1 − θt + E θˆt − θt
(6.115)
We observe that the least squares estimate of the stepsize suggests combining the previous estimate and the new observation in inverse proportion to their mean squared errors. It is well known that the mean-squared error of an estimate can be computed as the sum of its variance and the square of its bias from the true value (Hastie et al. (2001)). The biases in the estimates can be obtained using equations (6.62) and (6.63).
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
191
The variance of the observation, θˆt is computed as follows: h i 2 Var θˆt = E θˆt − θt = E ε2t = σ2
(6.116)
We can write the estimate from the previous iteration as: θ¯n−1 = (1 − αt )θ¯n−1 + αt θˆt h i = (1 − αt ) (1 − αn−1 )θ¯t−2 + αn−1 θˆn−1 + αt θˆt = (1 − αt )(1 − αn−1 )θ¯t−2 + (1 − αt )αn−1 θˆn−1 + αt θˆt h i = (1 − αt )(1 − αn−1 ) (1 − αt−2 )θ¯t−3 + αt−2 θˆt−2 + (1 − αt )αn−1 θˆn−1 + αt θˆt = (1 − αt )(1 − αn−1 )(1 − αt−2 )θ¯t−3 + (1 − αt )(1 − αn−1 )αt−2 θˆt−2 + (1 − αt )αn−1 θˆn−1 + αt θˆt Assume that α1 = 1. This means that when we extend the sequence back to the beginning, the first term vanishes, and we are left with: θ¯n−1 =
t−1 X
αi
i=1
=
t−1 X i=1
t−2 Y
(1 − αj )θˆi
j=i+1 t−2 Y
αi
(1 − αj ) (θi + εi )
(6.117)
j=i+1
The expected value of this estimate can be expressed as: t−1 t−2 X Y E θ¯n−1 = αi (1 − αj )θi i=1
j=i+1
(6.118)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
192
We can compute the variance of the smoothed estimate, θ¯n−1 , as follows: h n−1 n−1 2 i n−1 ¯ ¯ Var θ = E θ − E θ¯ !2 t−1 t−2 t−1 t−2 X Y X Y = E αi (1 − αj )θˆi − αi (1 − αj )θi i=1
t−1 X
= E
i=1
=
t−1 X
2
(αi )
i=1
=
t−1 X
j=i+1 t−2 Y
αi
i=1
(1 − αj ) θˆi − θi
j=i+1
!2
j=i+1 t−2 Y
(1 − αj )2 E ε2i
j=i+1 2
(αi )
i=1 n−1 2
= λ
t−2 Y
(1 − αj )2 σ 2
j=i+1
σ
(6.119)
where: λn−1 =
t−1 X
(αi )2
i=1
t−2 Y
(1 − αj )2
j=i+1
We can compute λt and the variance of the smoothed estimate recursively using: λ = (αt )2 + (1 − αt )2 λn−1 t E θ¯t = αt θt + (1 − αt )E θ¯n−1
(6.120) (6.121)
Using these results, we can express the mean squared errors as, E
θˆt − θt
2
h i = Var θˆt = σ2
h E
θ¯n−1 − θt
2 i
(6.122)
= Var θ¯n−1 + (βt )2 = λn−1 σ 2 + (βt )2
(6.123)
We can obtain the desired result by putting together equations (6.115), (6.122) and (6.123).
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
6.9 6.9.1
193
Bibliographic notes Stochastic approximation literature
Kushner & Clark (1978) , Robbins & Monro (1951) , Kiefer & Wolfowitz (1952) , Blum (1954b) , Wasan (1969), Dvoretzky (1956) Pflug (1988) gives an overview of deterministic and adaptive stepsize rules used in stochastic approximation methods. Pflug (1996) Spall: Overview Spall (2003)
6.9.2
Stepsizes
In forecasting, exponential smoothing is widely used to predict future values of exogenous quantities such as random demands. The original work in this area was done by Brown and Holt. Forecasting models for data with a trend component have also been developed. In most of the literature, constant stepsizes are used (see Brown (1959), Holt et al. (1960)) and Winters (1960)), mainly because they are easy to implement in large forecasting systems. These may be tuned to work well for specific problem classes. However, it has been shown that models with fixed parameters will demonstrate a lag if there are fluctuations in the mean or trend components of the observed data. There are several methods that monitor the forecasting process using the observed value of the errors in the predictions with respect to the observations. For instance, Brown (1963), Giffin (1971), Trigg (1964) and Gardner (1983) develop tracking signals that are functions of the errors in the predictions. If the tracking signal falls outside of certain limits during the forecasting process, either the parameters of the existing forecasting model are reset or a more suitable model is used. Similar updating techniques that involve stepsizes are used extensively in closely related fields such as reinforcement learning (Jaakkola et al. (1994), Bertsekas & Tsitsiklis (1996)), where convergence of the learning algorithms is dependent on the stepsizes satisfying certain conditions. Darken & Moody (1991) addresses the problem of having the stepsizes evolve at different rates, depending on whether the learning algorithm is required to search for the neigborhood of the right solution or to converge quickly to the true value. However, the rate at which the stepsize changes is a deterministic function. Darken & Moody (1991) introduce search then converge. From stochastic programming: Kesten (1958) , Mirozahmedov & Uryasev (1983) , Gaivoronski (1988) , Ruszczy´ nski & Syski (1986) Kmenta - Best linear unbiased estimators: Kmenta (1997) Hastie et al - machine learning Hastie et al. (2001) Darken and Moody STC Darken et al. (1992)
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
194
Learning rates for Q-learning: Even-Dar & Mansour (2004) Jacobs (1988) Kalman filter: Stengel (1994) Stochastic gradient stepsizes: Mathews & Xie (1993), Douglas & Mathews (1995) Benveniste: Benveniste et al. (1990), Douglas & Mathews (1995)
Exercises In all the exercises below, U is a random variable that is uniformly distributed between 0 and 1. 6.1) Let U be a random variable that is uniformly distributed between 0 and 1. Let R = − λ1 lnU . Show that P rob[R ≤ x] = 1 − e−λx , which shows that R has an exponential distribution. 6.2) Let R = U1 + U2 . Derive the probability density function for R. 6.3) Let X be an arbitrary random variable (it can be discrete or continuous). Let F (θ) = 1 − P rob[X ≤ x], and let F −1 (y) be its inverse (that is, if y = F (θ), then F −1 (y) = x). Show that F −1 (U ) has the same distribution as X. 6.4) Prove that the recursive expression for λt in equation (6.67) is equivalent to the expression in equation (6.65). 6.5) Prove equation 6.46. Use the definition of ν: n
ν =E
θ¯n−1 − θˆn
2
Expand the term in the expectation and reduce. 6.6) We are going to again try to use approximate dynamic programming to estimate a discounted sum of random variables (we first saw this in chapter 5): T
F =E
T X
γ t Rt
t=0
where Rt is a random variable that is uniformly distributed between 0 and 100 (you can use this information to randomly generate outcomes, but otherwise you cannot use this information). This time we are going to use a discount factor of γ = .95. We assume that Rt is independent of prior history. We can think of this as a single state Markov decision process with no decisions.
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
195
a) Using the fact that ERt = 50, give the exact value for F 100 . b) Propose an approximate dynamic programming algorithm to estimate F T . Give the value function updating equation, using a stepsize αt = 1/t. c) Perform 100 iterations of the approximate dynamic programming algorithm to produce an estimate of F 100 . How does this compare to the true value? d) Compare the performance of the following stepsize rules: Kesten’s rule, the stochastic gradient adaptive stepsize rule (use µ = .001), 1/nβ with β = .85, the Kalman filter rule, and the optimal stepsize rule. For each one, find both the estimate of the sum and the variance of the estimate. 6.7) Consider a random variable given by R = 10U (which would be uniformly distributed between 0 and 10). We wish to use a stochastic gradient algorithm to estimate the mean of R using the iteration θ¯n = θ¯n−1 − αn (Rn − θ¯n−1 ), where Rn is a Monte Carlo sample of R in the nth iteration. For each of the stepsize rules below, use equation (6.10) to measure the performance of the stepsize rule to determine which works best, and compute an estimate of the bias and variance at each iteration. If the stepsize rule requires choosing a parameter, justify the choice you make (you may have to perform some test runs). a) αn = 1/n. b) Fixed stepsizes of αn = .05, .10and.20. c) The stochastic gradient adaptive stepsize rule (equations 6.33)-(6.34)). d) The Kalman filter (equations (6.54)-(6.58)). e) The optimal stepsize rule (algorithm 6.9). 6.8) Repeat (6.7) using Rn = 10(1 − e−0.1n ) + 6(U − 0.5) 6.9) Repeat (6.7) using Rn = 10/(1 + e−0.1(50−n) ) + 6(U − 0.5) 6.10) We are going to solve a classic stochastic optimization problem known as the newsvendor problem. Assume we have to order x assets after which we try to satisfy a random demand D for these assets, where D is randomly distributed between 100 and 200. If x > D, we have ordered too much and we pay 5(x − D). If x < D, we have an underage, and we have to pay 20(D − x). a) Write down the objective function in the form minx Ef (x, D). b) Derive the stochastic gradient for this function. c) Since the gradient is in units of dollars while x is in units of the quantity of the asset being ordered, we encounter a scaling problem. Let x0 = 100 and choose as a stepsize αn = 5/n. Estimate the optimal solution using 100 iterations.
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
196
6.11) A customer is required by her phone company to commit pay for a minimum number of minutes per month for her cell phone. She pays 12 cents per minute of guaranteed minutes, and 30 cents per minute that she goes over his minimum. Let x be the number of minutes she commits to each month, and let M be the random variable representing the number of minutes she uses each month. a) Write down the objective function in the form minx Ef (x, D). b) Derive the stochastic gradient for this function. c) Let x0 = 0 and choose as a stepsize αn = 10/n. Use 100 iterations to determine the optimum number of minutes the customer should commit to each month. 6.12) An oil company covers the annual demand for oil using a combination of futures and oil purchased on the spot market. Orders are placed at the end of year t − 1 for futures that can be exercised to cover demands in year t. If too little oil is purchased this way, the company can cover the remaining demand using the spot market. If too much oil is purchased with futures, then the excess is sold at 70 percent of the spot market price (it is not held to the following year – oil is too valuable and too expensive to store). To write down the problem, model the exogenous information using: ˆ t = Demand for oil during year t. D pˆst = Spot price paid for oil purchased in year t. pˆf + t = Futures price paid in year t for oil to be used in year t + 1.
The decision variables are given by: f θ¯t,t+1 = Number of futures to be purchased at the end of year t to be used in year t + 1. s θ¯t = Spot purchases made in year t.
a) Set up the objective function to minimize the expected total amount paid for oil to cover demand in a year t + 1 as a function of θ¯tf . List the variables in your expression that are not known when you have to make a decision at time t. b) Give an expression for the stochastic gradient of your objective function. That is, what is the derivative of your function for a particular sample realization of demands and prices (in year t + 1)? c) Generate 100 years of random spot and futures prices as follows: pˆft = 0.80 + 0.10Utf pˆst,t+1 = pˆft + 0.20 + 0.10Uts where Utf and Uts are random variables uniformly distributed between 0 and 1. Run 100 iterations of a stochastic gradient algorithm to determine the number of
CHAPTER 6. STOCHASTIC APPROXIMATION METHODS
197
futures to be purchased at the end of each year. Use θ¯0f = 30 as your initial order quantity, and use as your stepsize αt = 20/t . Compare your solution after 100 years to your solution after 10 years. Do you think you have a good solution after 10 years of iterating?
Chapter 7 Discrete, finite horizon problems A rich array of techniques have evolved in the field of approximate dynamic programming that focuses on problems with discrete states and actions. We use this framework to describe a class of techniques that depend on our ability to enumerate states and actions and to estimate value functions of the “table look-up variety” where there is an estimate of the value of being in each discrete state. The techniques that we describe in this chapter are only interesting for problems where the state and action spaces are “not too big.” While the so-called curse of dimensionality arises in a vast array of applications, there are many problems where this does not happen. A particular class of discrete problems that fall in this category are those that involve managing a single (discrete) asset such as equipment (locomotives, aircraft, trucks, printing presses), people (automobile drivers, equipment operators, a student planning an academic career), or a project (where a set of tasks have to be completed in order). Most game problems (chess, checkers, backgammon, tetris) also fall in this category. These are important problems, but typically have the quality that the state and action spaces are of reasonable size. There are numerous problems involving the management of a single asset that are important problems by themselves. In addition, techniques for solving more complex, multiasset problems are often solved by decomposing them into single asset problems. As a result, this is the proper foundation for addressing these more complex problems. For example, the most widely used strategy for scheduling airline crews involves solving dynamic programs to schedule each individual crew, and then using an optimization package to determine which set of schedules should be used to produce the best overall solution for the airline. This chapter focuses purely on finite horizon problems. Since most treatments of this material have been done in the context of infinite horizon problems, a word of explanation is in order. Our justification for starting with a finite horizon framework is based on pedagogical, practical, and theoretical reasons. Pedagogically, finite horizon problems require a more careful modeling of the dynamics of the problem; stationary models allow us to simplify the modeling by ignoring time indices, but this hides the modeling of the dynamics of decisions and information. By starting with a finite horizon model, the student is forced to clearly write down the dynamics of the problem, which is a good foundation for building infinite 198
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
199
horizon models. More practically, finite horizon models are the natural framework for a vast array of operational problems where the data is nonstationary and/or where the important behavior falls in the initial transient phase of the problem. Even when a problem is stationary, the decision of what to do now depends on value function approximations that often depend on the initial starting state. If we were able to compute an exact value function, we would be able to use this value function for any starting state. For more complex problems where we have to depend on approximations, the best approximation may quite easily depend on the initial state. The theoretical justification is that certain algorithms depend on our ability to obtain unbiased sample estimates of the value of being in a state by following a path into the future. With finite horizon models, we only have to follow the path to the end of the horizon. With infinite horizon problems, authors either assume the path is infinite or depend on the presence of zero-cost, absorbing states.
7.1
Applications
There are a wide variety of problems that can be viewed as managing a single, discrete asset. Some examples include: The shortest path problem - Consider the problem faced by a traveler trying to get from home to work over a network with uncertain travel times. Let I be the set of intersections, and let τij be a random variable representing the travel time from intersection i to intersection j. The traveling salesman problem - Here, a person or piece of equipment has to make a series of stops to do work at each stop (making sales calls, performing repairs, picking up cash from retail stores, delivering gasoline to gas stations). The attributes of our asset would include its location and the time it arrives at a location, but might also include total elapsed time and how full or empty the vehicle is if it is picking up or delivering goods. Planning a college academic program - A student has to plan eight semesters of courses that will lead to a degree. The student has five sets of requirements to satisfy (two language courses, one science course, one writing course, four courses in a major field of study, and two courses chosen from four groups of courses to provide breadth). In addition, there is a requirement to finish a minimum number of courses for graduation. The attribute of a student can be described as the number of courses completed in each of the six dimensions (or the number of courses remaining in each of the six dimensions). Sailing a sailboat - Sailboats moving upwind have to sail at roughly a 45 degree angle to the wind. Periodically, the boat has to change tack, which requires turning the boat by
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
200
90 degrees so the wind will come over the opposite side. Tacks take time, but they are necessary if the wind changes course so the boat can move as much as possible towards the next marker. If the wind moved in a constant direction, it would be easy to plan a triangle that required a single tack. But such a path leads the boat vulnerable to a wind shift. A safer strategy is to plan a path where the boat does not veer too far from the line between the boat and the next mark. The unit commitment problem - This is a class of problems that arises in the electric power industry. A unit might be a piece of power generation equipment that can be turned on or off to meet changing power needs. There is a cost to turning the unit on or off, so we must to decide when future power demands justify switching a unit on. The fuel commitment problem - A variation on the unit commitment problem is the decision of what fuel to burn in plants that can switch between coal, oil, and natural gas. Here, the issue is not only the switching costs but potential price fluctuations. Selling an asset - A very special instance of a discrete asset problem is determining when to sell an asset. We want to sell the asset at the highest price given assumptions about how prices are evolving and the conditions under which it can be sold. These problems all assume that the only pieces of information are the attributes of the asset itself. Even single asset problems can be hard when there is other information available to make a decision. In our traveling salesman problem, the attribute of the salesman may be her location, but other information can be the status of other requests for a visit. As we show later, the real complication is not how much information is available to make a decision, but rather how much we capture as an attribute for the purpose of estimating a value function. In the world of approximate dynamic programming, we are not looking for optimal solutions, but for solutions that are better than we would obtain without these techniques. We do not have to put all the dimensions of the state variable when we are estimating a value function; the goal is to choose the elements that contribute the most to explaining the value of being in a state.
7.2
Sample models
In this section, we present a series of real problems that involve the management of a single asset. Our goal here is to provide a basic formulation that highlights a class of applications where approximate dynamic programming can be useful.
7.2.1
The shortest path problem
An important special case of the single asset problem is the shortest path problem. Actually, any discrete dynamic program can be formulated as a shortest path problem by introducing
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
201
a “super state” that the system has to end up in and by viewing the problem in terms of the evolution of the entire system. However, given the array of real problems that fit the general framework of shortest paths, it seems more natural to work in the context of problems that are naturally formulated this way. Shortest path problems can usually be formulated in terms of getting an asset from one state to a terminal state in either the least amount of time or at the lowest cost. These are most naturally viewed as finite horizon problems, but we have to handle the possibility that the asset may not make it to the terminal state within the horizon. This situation is easily handled, at least conceptually, by including an end-of-horizon transition from any state into the terminal state with a cost. Shortest path problems come in a variety of flavors, especially when we introduce uncertainty. Below we review some of the major classes of shortest path problems. A deterministic shortest path problem The best example of a shortest path problem is that faced by a driver making her way through a network. The network consists of a set of intersections I, and at each intersection, she has to decide which link (i, j) to progress down. Let q be the node her trip originates at, and let r be her intended destination. Let: →
I i = The set of nodes that can be reached directly from intersection i. ← I j = The set of nodes that can reach directly out to intersection j. The forward and backward reachable sets identify the set of links out of, and into, a node (assuming that there is only one link connecting two nodes). For our presentation, we assume that we are minimizing the cost of getting to the destination, recognizing that the cost may be time. Let: cij = The cost of traversing link (i, j). ( 1 If we traverse link (i, j) given that we are at intersection i xij = 0 Otherwise Let vi be the value of being at node i. Here, the value is the cost of getting from node i to the destination node r. If we knew all these values, we could find our optimal path to the destination by simply solving: x∗ij = arg min
X j∈
→ I i
(cij + vj )xij
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
202
Clearly, the value of being at node i should now satisfy: vi =
X
(cij + vj )x∗ij
→ i
j∈ I
which says that the cost of getting from i to r should equal the cost of making the best decision out of i considering both the cost of the first link out of i plus the cost of getting from the destination of the link to the final destination of the trip. This means that our values (vi )i∈I should satisfy the equation: vi = min
X j∈
(cij + vj )xij
(7.1)
→ I i
Equation (7.1) has been widely studied. Since these algorithms are applied to networks with tens or even hundreds of thousands of links, researchers have spent decades looking for algorithms that solve these equations very quickly. One algorithm that is especially poor is a straightforward application of backward dynamic programming, where we iteratively compute: vin =
( P min j∈ 0
→ i
I
(cij + vjn−1 )xij
∀i ∈ I \ r i=r
We initialize vi0 = M for all i ∈ I except for vr0 . It turns out that this algorithm will converge in N iterations, where N is the length of the path with the longest number of links from q to r. A more effective algorithm uses a backward recursion. Starting at the destination node r (where we know that vr = 0), we put node r in a candidate list. We then take the node at the top of the candidate list and look backward from that node. Any time we find a node i where vin < vin−1 (that is, where we found a better value to node i) is added to the bottom of the candidate list (assuming it is not already in the list). After the node at the top of the list is updated, we drop it from the list. The algorithm ends when the list is empty. Random costs There are three flavors of problems with random arc costs: Case I - All costs are known in advance: Here, we assume that we have a wonderful realtime network tracking system that allows us to see all the costs before we start our trip. Case II - Costs are learned as the trip progresses: In this case, we assume that we see the actual arc costs for links out of node i when we arrive at node i.
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
203
Case III - Costs are learned after the fact: In this setting, we only learn the cost on each link after the trip is finished. Let Cij be a random variable representing the cost with expectation c¯ij . Let Cij (ω) be a sample realization. For each of the three cases, we can solve the problem using different versions of the same dynamic programming recursion: Case I - All costs known in advance: Since we know everything in advance, it is as if we are solving the problem using deterministic costs Cij (ω). For each sample realization, we have a set of node values that are therefore random variables Vi . These are computed using: X
Vi (ω) = min
j∈
(Cij (ω) + Vj (ω))xij
(7.2)
→ I i
ˆ On average, the expected We would have to compute vi (ω) for each ω ∈ Ω (or a sample Ω). cost to the destination r would be given by: vi = E{Vi }
(7.3)
Case II - Link costs become known when we arrive at the intersection: For this case, the node values are expectations, but the decisions are random variables. The node values satisfy the recursion: X (Cij (ω) + vj )xij (ω) vi = E min → j∈ I
(7.4)
i
Case III - Link costs become known only after the trip: Now we are assuming that we do not get any new information about link costs as we traverse the network. As a result, the best we can do is to use expected link costs C¯ij : vi = min
X j∈
(C¯ij + vj )xij
→ I i
which is the same as our original deterministic problem.
(7.5)
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
204
Random arc availabilities A second source of uncertainty arises when links may not be available at all. We can handle this using a random upper bound: ( 1 If link (i, j) is available for travel Uij = 0 Otherwise The case of random arc capabilities can also be modeled by using random arc costs with Cij = M if Uij = 0, and so the problems are effectively equivalent. The practical difference arises because if a link is not available, we face the possibility that the problem is simply infeasible.
7.2.2
Getting through college
The challenge of progressing through four years of college requires taking a series of courses that satisfy various requirements. For our example, we will assume that there are five sets of requirements: two courses in mathematics, two in language, eight departmentals (that is, courses from the department a student is majoring in), four humanities courses (from other departments), and one science course (chemistry, physics, geology, biology). The total number of courses has to add up to 32. Each semester, a student may take three, four or five courses. Of these courses, a student will normally select courses that also satisfy the various requirements that have to be satisfied prior to graduation. We can describe the state of our student at the beginning of each semester in terms of the following vector: at = Attribute vector for a student at the end of semester t. a1 Number of mathematics courses completed a2 Number of language courses completed a3 Number of departmentals completed = = Number of humanities courses completed a 4 a5 Number of science courses completed a6 Total number of courses completed ( 1 If the student has attribute a at time t Rta = 0 Otherwise
Rt = (Rta )a∈A We assume that Rt is a pre-decision state variable, so that it indicates the information available for making a decision. The student has to make decisions about which courses to
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
205
take at the end of each semester. For this problem, we would have eight decision epochs indexed t = (0, 1, . . . 7) representing decisions to be made before each semester begins. x0 represents decisions made at the beginning of her first year. We can represent our decisions using: D = The set of courses in the course catalog. ( 1 If the student chooses course d for semester t + 1 xtd = 0 Otherwise We also have to keep track of which courses were completed “satisfactorily.” During semester t, the information Wt = (Wtd )d∈D that comes in is whether the student passed or failed the course, where: ( 1 If the student passes course d taken in semester t. Wtd = 0 Otherwise We can keep track of the courses that have been completed satisfactorily using:
Ptd Pt,d
1 If course d has been satisfactorily completed by the end of = semester t 0 Otherwise = Pt−1,d + xtd Wtd
The decisions have to satisfy certain constraints: X
xtd ≥ 3
(7.6)
xtd ≤ 5
(7.7)
xtd ≤ 1 − Pt−1,d xtd ∈ (0, 1)
(7.8) (7.9)
d∈D
X d∈D
Constraints (7.6) and (7.7) ensure that the student takes between three and five courses each semester. Constraint (7.8) prevents the student from taking a course that has already been completed. In addition to completing the necessary number of courses while meeting all requirements, the student may prefer some courses over others. These preferences are captured in the contribution function Ct (xt , Rt ), which combines a numerical preference for each course. In addition, there is a terminal reward C8 (R8 ) that captures the value of being in state R8 at the end of eight semesters. It is here that we include penalties for not taking enough courses, or not satisfying one of the requirements.
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
206
The state of our system is given by St = (Rt , Pt ) which describes the number of requirements the student has completed, and the courses she has taken. Technically, Rt can be computed from Pt , but it is useful to represent both in the state variable since Pt tells us which courses she is allowed to take (she cannot take the same course twice), while Rt indicates the degree to which she is progressing toward graduation.
7.2.3
The taxi problem
In our version of the taxi problem, a taxi picks up and drops off customers and tries to maximize his total profit over time (we can view this as a steady state, infinite horizon problem or as a time-dependent, finite horizon problem). The process starts when a customer gets in the cab and tells him where he wants to go. After the cab drops off the customer, he faces the decision of where to go next to wait for a customer (sit and wait? go to a nearby hotel? head back to the airport?). We let: at = The attributes of the taxi at the end of a trip. ( 1 If the taxi has attribute a at time t Rta = 0 Otherwise Rt = (Rta )a∈A We assume that space is discretized into a set of locations, represented by: I = The set of locations that an empty taxi can move to. This allows us to represent decisions using: D = The decision to move to a location in I. An element d ∈ D represents a decision to move to a location id ∈ I. After the cab makes a decision, he sits and waits for a customer to arrive. At a location, a cab will sit if no customers arrive or move to the destination requested by the first customer who arrives and requests service (our cabbie is not allowed to turn down customers). From the perspective of a dispatcher, the cab will call in letting her know that he has a passenger and where he is going to. We can model the information that arrives when the customer boards as:
Wtd
( 0 No customer arrived and the cabbie remained in his location = i The destination of the trip
Wtf = The fare paid for the trip.
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
207
Thus, Wtd > 0 means that a customer arrived, in which case Wtd is the destination requested by the customer. If Wtd = 0, then we would assume that Wtf = 0. In this problem, decisions are made at the end of a trip. If the taxi decides to change locations, we wait until the cab arrives at his new location, and then information arrives that determines whether he moves to a new location and how much money he makes. At the end of any time period, the taxi can choose to either sit and wait another time period or move to another location. We can model this problem in discrete time or at specific events (“decision epochs”). For example, we might assume that once a cab has decided to wait for a customer, he has to wait until a customer actually arrives. In this case, the only decision points are at the end of a trip (which need not occur at discrete time points). If the taxi is at location i, then he will wait a random amount of time and then serve a customer who will take him to location j with probability pij . If the cab decides to move to location k, he will then move from k to j with probability pkj . In the language of Markov decision processes, we would say that the probability that a cab at i goes to location j is given by pij (x) where x = (xd )d∈D captures the decision to stay or move to an alternate location. If the cab could not reposition to another location, we would have a classic example of a Markov chain (or, to be more precise, a semi-Markov process, because the time during which the cab waits may be an arbitrary probability distribution). Since we can make decisions that effectively change the probability of a transition, we have a Markov decision process.
7.2.4
Selling an asset
An important problem class is determining when to sell an asset. Let:
Rt xt pt ct pˆt T
( 1 = 0 ( 1 = 0 = = = =
If we are holding the asset at time t Otherwise If we sell the asset at time t Otherwise
The price that we obtain for the asset if we sell at time t. The contribution from holding the asset during time interval t. The change in price that arrives during time interval t. Time by which the asset must be sold.
The state of our system is given by: St = (Rt , pt )
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
208
which evolves according to the equation: ( 1 If Rt−1 = 1 and xt = 0 = 0 Otherwise
Rt
pt = pt−1 + pˆt In our modeling strategy, the price process (ˆ pt ) can be quite general. There is an extensive literature on the asset selling process that assumes the sequence (ˆ pt ) is independent where the variance is either constant (“constant volatility”) or time varying (“dynamic volatility”). We have a true state variable, and therefore a Markovian system, even if pˆt depends on pt−1 . If this assumption is violated (the price changes might depend on prices before time t − 1), then St is not a complete state variable. But even this does not prevent us from developing good approximate algorithms. For example, we may obtain price information from a real system (where we may not even have a mathematical model of the information process). We may still use our simpler state variable as a basis for building a value function approximation. Our one-period contribution function is given by: ( pt Ct (St , xt ) = 0
Rt = 1 and xt = 1 Otherwise
Given a family of decision rules (X π (St ))π∈Π , our problem is to solve:
max E π∈Π
7.3
(∞ X
) γ t Ct (St , X π (St ))
t=0
Strategies for finite horizon problems
In this section, we sketch the basic strategies for using forward dynamic programming methods to approximate policies and value functions. The techniques are most easily illustrated using a post-decision state variable, after which we describe the modifications needed to handle a pre-decision state variable. We then describe a completely different strategy, known as Q-learning, to overcome the problem of approximating the expectation imbedded in a pre-decision framework.
7.3.1
Value iteration using a post-decision state variable
There are two simple strategies for estimating the value function for finite horizon problems. The first uses a single pass procedure. Here, we step forward in time using an approximation of the value function from the previous iteration. After solving a time-t problem, we update the value function for time t. The procedure is illustrated in figure 7.1.
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
209
Step 0: Initialization: Step 0a. Initialize V¯t0 , t ∈ T . Step 0b. Set n = 1. Step 0c. Initialize R01 . Step 1: Choose a sample path ω n . Step 2: Do for t = 1, 2, . . . , T : Step 2a: Let: ωtn = Wt (ω n ). Step 2b: Solve: n n vˆtn = max Ct (Rt−1 , ωtn , xt ) + γ V¯tn−1 (RM (Rt−1 , ωtn , xt )) xt ∈Xt
and let xnt be the value of xt that solves (7.10). Step 2c: Update the value function: n−1 n n n V¯t−1 (Rt−1 ) = (1 − αn )V¯t−1 (Rt−1 ) + αn vˆtn Step 2d: Compute: n Rtn = RM (Rt−1 , ωtn , xnt ).
(7.10)
(7.11)
Step 3. Increment n. If n ≤ N go to Step 1. Step 4: Return the value functions (V¯tn )Tt=1 .
Figure 7.1: Single pass version of the approximate dynamic programming algorithm. A few notes are in order. We show in step 1 that we can choose the entire sample path before we start stepping forward through time. In many applications, random variables are correlated across time, or are dependent on the state. Frequently, the relevant information is very state dependent. For example, in our stochastic shortest path problem, the only information we want to see when we are at node i are the costs on links emanating from node i (so there is no point in generating costs on all the other links). For these applications, it will be much more natural to generate information as we progress. After solving the problem at time t, we obtain an updated estimate of the value of being in state Rt , which we call vˆt . In step 2c, we then smooth over these to obtain an updated estimate of V¯t−1 . The change in the time index can be confusing at first. vˆt is indexed by t because it is a random variable that depends on information from time interval t. The smoothing in step 2c has the effect of approximating the expectation over this information, producing an estimate of a function that depends only on information up through time t − 1. Figure 7.1 is known as a single pass procedure because all the calculations are finished at the end of each forward pass. The updates of the value function take place as the algorithm progresses forward in time. The algorithm is fairly easy to implement, but may not provide the fastest convergence. As an alternative, we can use a double pass procedure, which is illustrated in figure 7.2. In this version, we step forward through time creating a trajectory of states, actions and outcomes. We then step backwards through time updating the value of being in a state using information from the same trajectory in the future.
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
210
Step 0. Initialize V¯t0 , t ∈ T . Step 0a. Set n = 1. Step 0b. Initialize R01 . Step 1: Choose a sample path ω n . Step 2: Do for t = 1, 2, . . . , T : Step 2a: Choose ωtn = Wt (ω n ). Step 2b: Find: n n xnt = arg max Ct (Rt−1 , ωtn , xt ) + γ V¯tn−1 (RM (Rt−1 , ωtn , xt )) xt ∈Xt
n Step 2c: Compute Rtn = RM (Rt−1 , ωtn , xnt ).
Step 3: Do for t = T, T − 1, . . . , 1: Step 3a: Compute vˆtn using the decision xnt from the forward pass: n n vˆtn = Ct (Rt−1 , ωtn , xnt ) + γˆ vt+1 Step 3b: Update the value function approximations: V¯ n (Rn ) = (1 − αn )V¯ n−1 (Rn ) + αn vˆn t−1
t−1
t−1
t−1
t
Step 4. Increment n. If n ≤ N go to Step 1. Step 5: Return the value functions (V¯tn )Tt=1 .
Figure 7.2: Double pass version of the approximate dynamic programming algorithm The result of our backward pass is vˆtn , which is the contribution from the sample path ω and a particular policy. Our policy is, quite literally, the set of decisions produced by the value function approximation V¯ n . In the double pass algorithm, if we repeated step 4 over and over (for a particular initial state R0 ), V¯0n (R0n ) would eventually converge to the correct estimate of the value of being in state R0 and following the policy produced by the approximation V¯tn−1 . As a result, although vˆtn is a valid, unbiased estimate of the value of being in state Rtn at time t and following the policy produced by V¯ n , we cannot say that V¯tn (Rtn ) is an unbiased estimate of the value of being in state Rtn . Rather, we can only talk about the properties of V¯tn in the limit. n
7.3.2
Value iteration using a pre-decision state variable
Our presentation up to now has focused on using a post-decision state variable, which gives us a much simpler process of finding decisions. It is much more common in the literature to formulate problems using the pre-decision state variable. The concepts we have described up to now all apply directly when we use a pre-decision state variable. For this section, we let Rt be the pre-decision state variable and represent the dynamics using the equation n n ) Rt+1 = RM (Rtn , xnt , Wt+1
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
211
We remind the reader that we change the order of the arguments in the function RM (·) when we use a pre-decision state variable. The arguments reflect the order in which events happen: we see the state Rn , we make a decision xn , and then we see new information W n+1 . Since we always use only one form of the transition function in any particular application, we do not introduce a different functional name for reasons of notational simplicity. It is important for the reader to keep in mind whether a model is being formulated using the pre or post-decision state variable. A sketch of the general algorithm is given in figure 7.3. Although the algorithm closely parallels that for the post-decision state variable, there are important differences. The first one is how decisions are made. We can assume that we have a policy π that determines how we make a decision given the state R, but in the field of approximate dynamic programming, we generally have to resort to solving approximations of: n ))} xnt = arg max E{C(Rtn , xt , Wt+1 ) + γVt (RM (Rtn , xnt , Wt+1 xt
(7.12)
In practice, we typically replace the value function V () with an approximation V¯ (), and we approximate the expectation by taking a sample of outcomes and solving: xnt = arg max
xt ∈Xt
X
n−1 pn (ˆ ω )C(Rtn , xt , Wt+1 (ˆ ω )) + γ V¯t+1 (RM (Rtn , xnt , Wt+1 (ˆ ω )))
ˆn ω ˆ ∈Ω
A second difference is that vˆn is now an approximation of an expectation. We can still do smoothing to update V¯ n , but the choice of stepsize should reflect the size of the sample ˆ Ω. Students should pay particular attention to the indexing over time and iterations. In n . In equation (7.14), we equation (7.11), we smoothed in vˆtn to update our estimate of V¯t−1 smoothed in vˆtn to update our estimate of V¯tn . The reason is that in equation (7.14), vˆtn is actually an approximation of the expectation of V¯t (rather than just a sample realization).
7.3.3
Q-learning
Return for the moment to the classical way of making decisions using dynamic programming. Normally we would look to solve: n−1 (Rt+1 (Rt , xt , Wt+1 ))) xnt = arg max Ct (Rt , xt ) + γEV¯t+1 xt ∈Xt
(7.15)
Solving equation (7.15) can be problematic for two different reasons. The first is that we may not know the underlying distribution of the exogenous information process. If we do not know the probability of an outcome, then we cannot compute the expectation. These are problems where we do not have a model of the information process. The second reason is that
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
212
Step 0: Initialization: Step 0a. Initialize V¯t0 , t ∈ T . Step 0b. Set n = 1. Step 0c. Initialize R0 . Step 1: Do while n ≤ N : Step 2: Do for t = 0, 1, . . . , T : ˆ n ⊆ Ω and solve: Step 2a: Choose Ω X n−1 vˆtn = max pn (ˆ ω ) Ct (Rtn−1 , xt , Wt+1 (ˆ ω )) + γ V¯t+1 (RM (Rtn−1 , xt , Wt+1 (ˆ ω )) xt ∈Xt
ˆn ω ˆ ∈Ω
(7.13) and let xnt be the value of xt that solves (7.13). Step 2b: Sample ωtn and compute: n n Rt+1 = RM (Rtn , xt , ωt+1 ). Step 2c: Update the value function: V¯tn (Rtn ) = (1 − αn )V¯tn−1 (Rtn ) + αn vˆtn
(7.14)
Step 3: Return the value functions (V¯tn )Tt=1 .
Figure 7.3: Approximate dynamic programming using a pre-decision state variable. while we may know the probability distribution, the expectation may be computationally intractable. This typically arises when the information process is characterized by a vector of random variables. We can circumvent this problem by replacing the expectation with a (not-too-large) ˆ We may construct Ω ˆ so random sample of possible realizations, which we represent by Ω. ˆ occurs with equal probability (that is, 1/|Ω|), ˆ or each may have that each outcome ω ∈ Ω its own probability p(ω). Using this approach, we would make a decision using: xnt = arg max Ct (Rt , xt ) + γ xt ∈Xt
X
n−1 p(ω)V¯t+1 (Rt+1 (Rt , xt , Wt+1 (ω))
(7.16)
ˆ ω∈Ω
Solving equation (7.16) can be computationally difficult for some applications. For example, if xt is a vector, then solving the myopic problem (the value function is zero) may be a linear or integer program of significant size (that is solvable, but not easily). Solving it over a set ˆ makes the problem dramatically larger. of scenarios Ω One thought is to solve the problem for a single sample realization: n−1 xnt = arg max Ct (Rt , xt ) + γ V¯t+1 (Rt+1 (Rt , xt , Wt+1 (ω))) xt ∈Xt
(7.17)
The problem is that this means we are choosing xt for a particular realization of the future information Wt+1 (ω). This problem is probably solvable, but it is not likely to be a reasonable
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
213
approximation (we can always do much better if we make a decision now knowing what is going to happen in the future). But what if we choose the decision xt first (for example, at random), and then compute the cost? Let the resulting cost be represented using: ˆ n (Rt , xt , Wt+1 (ω n )) = Ct (Rt , xt ) + γ V¯ n−1 (RM (Rt , xt , Wt+1 (ω n ))) Q t t+1 We could now smooth these values to obtain: ˆ nt (Rt , xt , Wt+1 (ω n )) ¯ n−1 ¯ nt (Rt , xt ) = (1 − αn )Q (Rt , xt ) + αn Q Q t ¯ n (Rt , xt ) as an approximation of: We use Q t Qt (Rt , xt ) = E Ct (Rt , xt ) + γVt+1 (RM (Rt , xt , Wt+1 ))|Rt The functions Qt (Rt , xt ) are known as Q-factors and they capture the value of being in a state, and taking a particular action. We can now choose an action by solving: ¯ n (Rt , xt ) xnt = arg max Q t xt ∈Xt
(7.18)
This strategy is known as Q-learning. The complete algorithm is summarized in figure 7.4. A major advantage of the strategy is that we do not have to compute the expectation in equation (7.15), or even solve approximations of the form in equation (7.16). The price of this convenience is that we have significantly enlarged the statistical problem. If we let R be the state space and X be the action space, this implies that we have to learn |R| × |X | parameters. For multidimensional problems, this can be a daunting task, and unlikely to be of practical value. But there is another application. Assume, for example, that we do not know the probability distribution of the exogenous information, such as might arise with a control problem running in real time. If we choose an action xt (given a state Rt ), the physical process tells us the contribution Ct as well as the next state Rt+1 . Implicit in the generation of the state Rt+1 is both the exogenous information as well as the transition function. For problems where the state and action spaces are not too large, but where we do not have a model of the information process or transition function, Q-learning can be a valuable tool. The challenge that we face with Q-learning is that it is replacing the problem of finding a value function V (R) with that of finding a function Q(R, x) of both the state and the action. If we are working on a problem of a single resource with a relatively small attribute vector a (recall that with a single resource, the resource state space is the same as the attribute space) and not too many decision types d, this technique should work fine. Of course, if the state and action spaces are small, we can use standard backward dynamic programming techniques, but this assumes that we have a one-step transition matrix. It is for this reason that some authors describe Q-learning as a technique for problems where we are missing the transition matrix (random outcomes come from an exogenous source).
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
214
Step 0: Initialization: ¯ 0t (Rt , xt ) for all states Rt and decisions Step 0a. Initialize an approximation for the value function Q xt ∈ Xt , t ∈ T . Step 0b. Set n = 1. Step 0c. Initialize R01 . Step 1: Choose a sample path ω n . Step 2: Do for t = 1, 2, . . . , T : Step 2a: Find the decision using the current Q-factors: ¯ n−1 (Rtn , xt ) xnt = arg max Q t xt ∈Xt
(7.19)
Step 2b. Compute: ˆ nt+1 (Rtn , xnt , Wt+1 (ω n )) = Ct (Rtn , xnt ) + γ V¯ n−1 (RM (Rtn , xnt , Wt+1 (ω n ))) (7.20) Q t+1 ¯ n−1 Step 2c: Update Q and V¯tn−1 using: t ¯ nt (Rtn , xnt ) = (1 − αn )Q ¯ n−1 ˆ nt+1 (Rtn , xnt , Wt+1 (ω n )) Q (Rtn , xnt ) + αn Q t ¯ nt (Rtn , xnt ) V¯tn (Rtn ) = max Q xt
Step 2d: Find the next state: n Rt+1 = RM (Rtn , xnt , Wt+1 (ω n )). Step 3. Increment n. If n ≤ N go to Step 1. ¯ nt )T . Step 4: Return the Q-factors (Q t=1
Figure 7.4: A Q-learning algorithm. Q-learning has been applied to a variety of problems. One famous illustration involved the management of a set of elevators. However, there are many problems (in particular those involving the management of multiple assets simultaneously) where the state space is already large, and the action space can be many times larger. For these problems, estimating a function Q(R, x) even when R and x have small dimensions would be completely intractable. There is a cosmetic similarity between Q-learning and approximate dynamic programming using a post-decision state variable. The post-decision state variable requires finding V¯tx,n−1 (Rtx ) and then finding an action by solving: x,n x,n x , Wt (ω n ), xt )) Vˆtx,n (Rt−1 , Wt (ω n )) = arg max Ct (Rt−1 , Wt (ω n ), xt ) + γ V¯tx,n−1 (Rtx (Rt−1 xt ∈Xt
x x Since we can write Rt as a function of Rt−1 and Wt , we can replace (Rt−1 , Wt (ω n )) with the pre-decision state Rt (ω n ), giving us:
Vˆtn (Rtn (ω n )) = max Ct (Rtn (ω n ), xt ) + γ V¯tn−1 (Rtx (Rtn (ω n ), xt )) xt ∈Xt
n n ˆ x,n = max Q t (Rt (ω ), xt ) xt ∈Xt
CHAPTER 7. DISCRETE, FINITE HORIZON PROBLEMS
215
where n n n n ˆ x,n ¯ n−1 (Rx (Rn (ω n ), xt )) Q t (Rt (ω ), xt ) = Ct (Rt (ω ), xt ) + γ Vt t t
(7.21)
is a form of Q-factor computed using the post-decision value function. Although it is not the same as the original Q-factor, it is still a function that estimates the value of being in a particular state and taking an action. Thus, estimating a Q-factor can be viewed as the same as learning a post-decision value function. Viewed this way, Q-learning is easy to confuse with learning a post-decision value function. Computationally, however, they are quite different. In Q-learning, we face the challenge of estimating |R| × |X | parameters. If Rx is the state space for the post-decision state variable, we face the problem of estimating only |Rx | parameters if we use the post-decision state variable. In many applications, |Rx | g (which (g 0 ,n) means that aggregation g 0 is more aggregate than g), then the statistic v¯a is computed (g,n) n using observations vˆa that are also used to compute v¯a . We can see this relationship (g 0 ) clearly by writing v¯a using: 1
0
v¯a(g ) =
(g 0 ) Na
X
vˆn
(g 0 )
n∈Na
= νa(g) +
1
X
(g 0 ) Na
1
µna +
X
(g 0 ) Na
(g 0 )
n∈Na
εn
(g 0 )
n∈Na
1
0
) = νa(g) + µ ¯(g + a
(g 0 ) Na
X n ε + (g)
n∈Na
X (g 0 )
n∈Na
εn (g)
\Na
This relationship shows us that we can write the error term at the higher level of aggregation g 0 as a sum of a term involving the errors at the lower level of aggregation g (for the same 0 0 state a) and a term involving errors from other states a00 where Gg (a00 ) = Gg (a), given by 1
0
) ε¯(g = a
(g 0 )
Na
X n ε + (g)
n∈Na
X (g 0 )
n∈Na
εn (g)
\Na
1
=
(g 0 )
Na
P
(g)
=
Na
1
Na
(g 0 ) Na
ε¯(g) + (g 0 ) a
(g)
n∈Na (g) Na
(g) N a
εn
X
+
(g 0 )
n∈Na
X
εn
εn (g)
\Na
(9.9)
(g 0 ) (g) n∈Na \Na
We can overcome this problem by rederiving the expression for the optimal weights. For (g) a given (disaggregate) attribute a, the problem of finding the optimal weights (wa )g∈G is stated as follows: min E
(g) wa ,g∈G
1 2
!2 X g∈G
wa(g) · v¯a(g) − νa(g)
(9.10)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
253
subject to: X
wa(g) = 1
(9.11)
g∈G
wa(g) ≥ 0,
g∈G
(9.12)
Let: (g) δ¯a(g) = The error in the estimate, v¯a , from the true value associated with attribute vector a. = v¯a(g) − νa
The optimal solution is given in the following theorem: (g)
Theorem 1 For a given attribute vector, a, the optimal weights, wa , g ∈ G, where the individual estimates are correlated by way of a tree structure, are given by solving the following system of linear equations in (w, λ): X
h i 0 wa(g) E δ¯a(g) δ¯a(g ) − λ = 0
∀ g0 ∈ G
(9.13)
g∈G
X
wa(g) = 1
(9.14)
g∈G
wa(g) ≥ 0
∀ g∈G
(9.15)
Proof: The proof is not too difficult and it illustrates how we obtain the optimal weights. We start by formulating the Lagrangian for the problem formulated in (9.10)-(9.12), which gives us: !2
1 L(w, λ) = E 2 1 = E 2
X
wa(g) · v¯a(g) − νa(g)
!
+λ 1−
X
wa(g)
g∈G
g∈G
!2 X
(g)
wa(g) v¯a(g) − νa
!
+λ 1−
X
wa(g)
g∈G
g∈G
The first order optimality conditions are " E
X
(g)
wa(g) v¯a(g) − νa
(g 0 )
v¯a
− νa(g)
# −λ = 0
∀ g0 ∈ G
(9.16)
g∈G
X g∈G
wa(g) − 1 = 0
(9.17)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
254
To simplify equation (9.16), we note that, " E
X
(g)
wa(g) v¯a(g) − νa
(g 0 )
v¯a
− νa(g)
#
" = E
# X
g∈G
(g 0 )
wa(g) δ¯a(g) δ¯a
g∈G
=
X
h i 0 wa(g) E δ¯a(g) δ¯a(g )
(9.18)
g∈G
Combining equations (9.16) and (9.18) gives us equation (9.13) which completes the proof. Finding the optimal weights that handles the between the statistics at difh correlations i (g) ¯(g 0 ) ¯ ferent levels of aggregation requires finding E δa δa . We are going to compute this expectation by conditioning on the set of attributes a ˆn that are sampled. This means that our expectation is defined over the outcome space Ωε . The expectation is computed using: Proposition 2 The coefficients of the weights in equation (9.14) can be expressed as follows: h i h i N (g,n) h 2 i 0 a (g 0 ) E δ¯a(g) δ¯a(g ) = E µ ¯(g) µ ¯ + (g0 ) E ε¯(g) a a a Na
∀g ≤ g 0 and g, g 0 ∈ G
(9.19)
The proof is given in section 9.6.2. Now consider what happens when we make the assumption that the measurement error ε is independent of the attribute being sampled, a ˆn . We do this by assuming that the variance of the measurement error is a constant given by σε 2 . This gives us the following result n
Corollary 1 For the special case where the statistical noise in the measurement of the values is independent of the attribute vector sampled, equation (9.19) reduces to, h i h i σε2 0 (g 0 ) E δ¯a(g) δ¯a(g ) = E µ ¯(g) µ ¯ + a a (g 0 ) Na
(9.20)
The proof is given in section 9.6.1. (0)
For the case where g = 0 (the most disaggregate level), we assume that µa = 0 which gives us h i (g 0 ) E µ ¯(0) µ ¯ =0 a a This allows us to further simplify (9.20) to obtain: h i 0 E δ¯a(0) δ¯a(g ) =
σε2 (g 0 )
Na
(9.21)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
9.3.3
255
A special case: two levels of aggregation
It is useful to consider the case of two levels of aggregation, since this allows us to compute the weights analytically. For the case of two levels of aggregation, the system of linear equations given by (9.10)(9.12) reduces to: i (0)2 ¯ wa(0) + E δ¯a(0) δ¯a(1) wa(1) − λ = 0 E δa h 2i E δ¯a(0) δ¯a(1) wa(0) + E δ¯a(1) wa(1) − λ = 0
(9.23)
wa(0) + wa(1) = 1
(9.24)
h
(9.22)
Solving for the weight on the disaggregate level produces
wa(0)
i h h 2i (0) (1) (1) − E δ¯a δ¯a E δ¯a i h h 2i h 2i = (0) (1) (1) (0) − 2E δ¯a δ¯a + E δ¯a E δ¯a
(9.25)
By contrast, ifh we assume the estimates at the different levels of aggregation are independent i (1) (0) we can use E δ¯a δ¯a = 0 which gives us:
wa(0)
h 2i (1) E δ¯a h 2i h 2i = (1) (0) + E δ¯a E δ¯a
(9.26)
To see the relationship between the two formulas, we use the following results from section 9.3.2: σε2 E δ¯(0) δ¯(1) = from (9.21) N (1) h 2i σε2 E δ¯(0) = from (9.21) N (0) h 2i h 2i σε2 E δ¯(1) = E µ ¯(1) + (1) from (9.20) h 2 i N = E µ ¯(1) + E δ¯(0) δ¯(1)
(9.27)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
256
Substituting (9.27) into (9.25) gives: h
wa(0)
h i (0) (1) (0) ¯(1) ¯ ¯ ¯ E µ ¯ +E δ δ − E δa δa h 2i h i = 2 (0) (0) ¯(1) (1) (0) (1) ¯ ¯ ¯ ¯ E δa +E µ ¯ +E δ δ − 2E δa δa h 2i E µ ¯(1) h 2i h i = 2 (0) (0) (1) E δ¯a +E µ ¯(1) − E δ¯a δ¯a (1)2
i
(9.28)
From (9.28) we see that as the bias goes to zero, the weight on the disaggregate level goes to zero. Similarly, as the bias grows, the weight on the disaggregate level increases.
9.3.4
Experimenting with hierarchical aggregation
Using a value function approximation based on a weighted, multiple aggregation formula can produce a significant increase in the quality of the solution. Figure 9.5 shows the results of estimating the value of our nomadic trucker with three attributes. We are using a pure exploration strategy so that we are focused purely on the problem of estimating the value of the attribute. Also shown are the estimates when we use a single level of aggregation (we tried each of the four levels of aggregation in isolation). The weighted estimate works the best at almost all iterations (the level three aggregation works slightly better in the very beginning). In the limit, only the pure disaggregate estimate matches the weighted combination in the limit. As the number of observations increases, all of the more aggregate estimates level off with a higher error (as we would expect). These results will, of course, vary between applications. It is not uncommon for a higher level of aggregation to work best at first. However, it appears that using a weighted combination across the levels of aggregation can be quite robust.
9.4
General regression models
There are many problems where we can exploit structure in the state variable, allowing us to propose functions characterized by a small number of parameters which have to be estimated statistically. Section 9.1 represented one version where we had a parameter for each (possibly aggregated) state. The only structure we assumed was implicit in the ability to specify a series of one or more aggregation functions. In this section, we are going to allow ourselves to create a much wider range of approximations by viewing the value function as nothing more than a complex quantity that we want to predict using standard statistical techniques. Using conventional statistical notation, imagine that we have a series of explanatory (independent) variables (Xi )i∈I , and a single dependent variable Y that we would like to predict given knowledge of each Xi . Further assume that we have n observations of each, and let xin be the nth observation of xi , and let
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
257
PercentAverage errorpercentage from exact error from value the optimalfunction
Multi-attribute nomadic trucker - myopic runs 60
Aggregate level 4 50
40
Disaggregate Aggregate levels 2 and 3
30
Aggregate level 1
20
Weighted combination
10
0 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Total number of observations
Number of iterations
Figure 9.5: Using a mixture of estimates based on different levels of performance can give more accurate estimates than using a single, disaggregate estimate. yn be the corresponding dependent variable. For simplicity, let’s now assume that we think that a reasonable model can be formed to explain Y using: Y = θo +
X
θi Xi
i∈I
This is the classic linear-in-the-parameters model that is often the subject of introductory regression courses. Since we have I + 1 parameters, as long as we have at least n ≥ I + 1 observations (and some other conditions) we can find a vector θ that minimizes the deviations between the predicted values of Y and the observed values. This is the science of regression. The art of regression is determining an appropriate set of explanatory variables. This is exactly the approach we are going to take to approximate a value function. Our observed dependent variables are the updates to the value function that we have represented as vˆn in the past. For each observed vˆn is a corresponding state, which in this chapter we have represented as our attribute vector a ˆn . Now assume that we have, through knowledge and insight, decided that we can capture what is important through a series of functions which we are going to represent as (φb (a))b∈B . These functions are often referred to as features, since they are expected to capture the important aspects of a state variable. The number of these functions, given by the size of the set B, is generally constructed so that this is not too large (and certainly nowhere near the size of the state space A). For historical reasons, these functions are known in the approximate dynamic programming literature as basis functions (hence the choice of notation for indexing the functions).
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
9.4.1
258
Pricing an American option
Consider the problem of determining the value of an American put option which gives us the right to sell at $1.20 at any of four time periods. We assume a discount factor of 0.95, representing a five percent rate of return (compounded at each time period rather than continuously). If we wait until time period 4, we must exercise the option, receiving zero if the price is over $1.20. At intermediate periods, however, we may choose to hold the option even if the price is below $1.20 (of course, exercising it if the price is above $1.20 does not make sense). Our problem is to determine whether to hold or exercise the option at the intermediate points. From history, we have found 10 samples of price trajectories which are shown in table 9.2. If we wait until time period 4, our payoff is shown in table 9.3, which is zero if the price
Outcome 1 2 3 4 5 6 7 8 9 10
Stock prices Time period 1 2 3 1.21 1.08 1.17 1.09 1.12 1.17 1.15 1.08 1.22 1.17 1.12 1.18 1.08 1.15 1.10 1.12 1.22 1.23 1.16 1.14 1.13 1.22 1.18 1.21 1.08 1.11 1.09 1.15 1.14 1.18
4 1.15 1.13 1.35 1.15 1.27 1.17 1.19 1.28 1.10 1.22
Table 9.2: Ten sample realizations of prices over four time periods. is above 1.20, and 1.20 − p4 for prices below $1.20. Option value at Time Outcome 1 2 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 - -
t=4 period 3 4 - 0.05 - 0.07 - 0.00 - 0.05 - 0.00 - 0.03 - 0.01 - 0.00 - 0.10 - 0.00
Table 9.3: The payout at time 4 if we are still holding the option. At time t = 3, we have access to the price history (p1 , p2 , p3 ). Since we may not be able
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
259
to assume that the prices are independent or even Markovian (where p3 depends only on p2 ), the entire price history represents our state variable. We wish to predict the value of holding the option at time t = 4. Let V4 (a4 ) be the value of the option if we are holding it at time 4, given the state (which includes the price p4 ) at time 4. Now let the conditional expectation at time 3 be: V¯3 (a3 ) = E{V4 (a4 )|a3 } Our goal is to approximate V¯3 (a3 ) using information we know at time 3. For our basis function, we propose a linear regression of the form: Y = θ0 + θ1 X1 + θ2 X2 + θ3 X3 where: Y X1 X2 X3
= = = =
V4 p2 p3 (p3 )2
Keep in mind that it is important that our explanatory variables Xi must be a function of information we have at time t = 3, whereas we are trying to predict what will happen at time t = 4 (the payoff). We would then set up the data matrix given in table 9.4.
Outcome 1 2 3 4 5 6 7 8 9 10
Regression data Independent variables Dependent variable X1 X2 X3 Y 1.08 1.17 1.3689 0.05 1.12 1.17 1.3689 0.07 1.08 1.22 1.4884 0.00 1.12 1.18 1.3924 0.05 1.15 1.10 1.2100 0.00 1.22 1.23 1.5129 0.03 1.44 1.13 1.2769 0.01 1.18 1.21 1.4641 0.00 1.11 1.09 1.1881 0.10 1.14 1.18 1.3924 0.00
Table 9.4: The data table for our regression at time 3. We may now run a regression on this data to determine the parameters (θi )4i=0 . It makes sense to consider only the paths which produce a positive value in the fourth time period. The linear regression is only an approximation, and it is best to fit the approximation in the region of prices which are the most interesting (we could use the same reasoning to
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
260
include some “near misses”). For our illustration, however, we use all 10 observations, which produces the equation: V¯3 ≈ 0.0056 − 0.1234p2 + 0.6011p3 − 0.3903(p3 )2 V¯3 is (an approximation of) the expected value of the price we would receive if we hold the option until time period 4. We can now use this approximation to help us decide what to do at time t = 3. Table 9.5 compares the value of exercising the option at time 3 against holding the option until time 4, computed as γ V¯3 (a3 ). Taking the larger of the two payouts, we find, for example, that we would hold the option given samples 1-4, but would sell given samples 5 and 6.
Outcome 1 2 3 4 5 6 7 8 9 10
Rewards Decision Exercise Hold 0.03 0.04155 ×.95 = 0.03 0.03662 ×.95 = 0.00 0.02397 ×.95 = 0.02 0.03346 ×.95 = 0.10 0.05285 ×.95 = 0.00 0.00414 ×.95 = 0.07 0.00899 ×.95 = 0.00 0.01610 ×.95 = 0.11 0.06032 ×.95 = 0.02 0.03099 ×.95 =
0.03947 0.03479 0.02372 0.03178 0.05021 0.00394 0.00854 0.01530 0.05731 0.02944
Table 9.5: The payout if we exercise at time 3, and the expected value of holding based on our approximation. We can repeat the exercise to estimate V¯2 (at ). This time, our independent variables “Y ” can be calculated two different ways. The simplest is to take the larger of the two columns from table 9.5. So, for sample path 1, we would have Y1 = max{.03, .04155} = .04155. This means that our observed value is actually based on our approximate value function V¯3 (a3 ). This represents an implementation of our single-pass algorithm described in figure 7.1. An alternative way of computing the observed value of holding the option in time 3 is to use the approximate value function to determine the decision, but then use the actual price we receive when we eventually exercise the option. Using this method, we receive 0.05 for the first sample path because we decide to hold the asset at time 3 (based on our approximate value function) after which the price of the option turns out to be worth 0.05. Discounted, this is worth 0.0475. For sample path 2, the option proves to be worth 0.07 which discounts back to 0.0665 (we decided to hold at time 3, and the option was worth 0.07 at time 4). For sample path 5 the option is worth 0.10 because we decided to exercise at time 3. This is exactly the double pass algorithm given in figure 7.2. Regardless of which way we compute the value of the problem at time 3, the remainder of the procedure is the same. We have to construct the independent variables “Y ” and regress
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
261
X
1
2
3
4
5
6
X
O
O
7
8
9
O
X
O
9.6a
9.6b
Figure 9.6: Some tic-tac-toe boards. 9.6a) gives our indexing scheme, and 9.6b) is a sample board. them against our observations of the value of the option at time 3 using the price history (p1 , p2 ). Our only change in methodology would occur at time 1 where we would have to use a different model (because we do not have a price at time 0).
9.4.2
Playing “lose tic-tac-toe”
The game of “lose tic-tac-toe” is the same as the familiar game of tic-tac-toe, with the exception that now you are trying to make the other person get three in a row. This nice twist on the popular children’s game provides the setting for our next use of regression methods in approximate dynamic programming. Unlike our exercise in pricing options, representing a tic-tac-toe board requires capturing a discrete state. Assume the cells in the board are numbered left to right, top to bottom as shown in figure 9.6a. Now consider the board in figure 9.6b. We can represent the state of the board after the tth play using:
ati
If cell i contains an “X” 1 = 0 If cell i is blank. −1 If cell i contains an ”O”
at = (ati )9i=1 We see that this simple problem has up to 39 = 19, 683 states. While many of these states will never be visited, the number of possibilities is still quite large, and seems to overstate the complexity of the game (the state space is the same if we play the original version of tic-tac-toe). We quickly realize that what is important about a game board is not the status of every cell as we have represented it. For example, rotating the board does not change a thing, but
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
262
it does represent a different state. Also, we tend to focus on strategies (early in the game when it is more interesting) such as winning the center of the board or a corner. We might start defining variables (basis functions) such as: φ1 (at ) = 1 if there is an “X” in the center of the board, 0 otherwise. φ2 (at ) = The number of corner cells with an “X”. φ3 (at ) = The number of instances of adjacent cells with an “X” (horizontally, vertically or diagonally). There are, of course, numerous such functions we can devise, but it is unlikely that we could come up with more than a few dozen (if that) which appeared to be useful. It is important to realize that we do not need a value function to tell us to make obvious moves such as blocking your opponent after he’she gets two in a row. Once we form our basis functions, our value function approximation is given by: V¯t (at ) =
X
θtb φb (at )
b∈B
We note that we have indexed the parameters by time t (the number of plays) since this is likely to play a role in determining the value of the feature being measured by a basis function. We estimate the parameters θ by playing the game (and following some policy) after which we see if we won or lost. We let Y n = 1 if we won the nth game, 0 otherwise. This also means that the value function is trying to approximate the probability of winning if we are in a particular state. We may play the game by using our value functions to help determine a policy. Another strategy, however, is simply to allow two people (ideally, experts) to play the game and use this to collect observations of states and game outcomes. This is an example of supervisory learning. If we lack a “supervisor” then we have to depend on simple strategies combined with the use of slowly learned value function approximations. In this case, we also have to recognize that in the early iterations, we are not going to have enough information to reliably estimate the coefficients for a large number of basis functions.
9.5
Recursive methods for regression models
Estimating regression models to estimate a value function involves all the same tools and statistical issues that students would encounter in any course on regression. The only difference in dynamic programming is that our data is usually generated internal to the algorithm, which means that we have the opportunity to update our regression model after every iteration. This is both an opportunity and a challenge. Traditional methods for estimating the parameters of a regression model either involve solving a system of linear equations or solving a nonlinear programming problem to find the best parameters. Both methods are
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
263
generally too slow in the context of dynamic programming. The remainder of this section describes some simple updating methods that have been used in the context of approximate dynamic programming.
9.5.1
Parameter estimation using a stochastic gradient algorithm
In our original representation, we effectively had a basis function for each state at and the parameters were the value of being in each state, given by v¯t (Rt ). Our updating step was given by: v¯an = v¯an−1 − αn v¯an−1 − vˆan This update is a step in the algorithm required to solve: 1 min E (v − vˆ)2 v 2 where vˆ is a sample estimate of V (a). When we parameterize the value function, we create a function that we can represent using V¯a (θ). We want to find θ that solves: 1 min E (V¯a (θ) − vˆ)2 θ 2 Applying our standard stochastic gradient algorithm, we obtain the updating step: θ¯n = =
θ¯n−1 − αn (V¯a (θ¯n−1 ) − vˆ(ω n ))∇θ V¯a (θn ) θ¯n−1 − αn (V¯a (θ¯n−1 ) − vˆ(ω n ))φ(an )
where φ(a) is a |B|-element vector.
9.5.2
Recursive formulas for statistical estimation
In this section, we provide a primer on recursive estimation. We assume that we are using a linear-in-the-parameters model of the form:
y = θ0 +
I X
θi x i + ε
i=1
Let y n be the nth observation of our dependent variable (what we are trying to predict) based on the observation (xn1 , xn2 , . . . , xnI ) of our dependent variables (the xi are equivalent to the
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
264
basis functions we used earlier). If we define x0 = 1, we can define
xn0 xn1 .. .
xn =
xnI
to be an I + 1-dimensional column vector of observations. Throughout this section, and unlike the rest of the book, we use traditional vector operations, where xT x is an inner product (producing a scalar) while xxT is an outer product, producing a matrix of cross products. Letting θ be the column vector of parameters, we can write our model as: y = θT x + ε We assume that the errors (ε1 , . . . , εn ) are independent and identically distributed. We do not know the parameter vector θ, so we replace it with an estimate θ¯n which gives us the predictive formula: y¯n = (θ¯n )T xn where y¯n is our predictor of xn . Our prediction error is: εˆn = y n − (θ¯n )T xn Our goal is to choose θ to minimize the mean squared error: min θ
n X
(y m − θT xm )2
(9.29)
m=1
It is well known that this can be solved very simply. Let X n be the n by I + 1 matrix: Xn =
x10 x11 x20 x21 .. .. . . . . . xn0 xn1
x1I x2I .. . xnI
Next, denote the vector of observations of the dependent variable as: Yn =
y1 y2 .. . yn
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
265
The optimal parameter vector θn (after n observations) is given by: θn = [(X n )T X n ]−1 (X n )T Y n
(9.30)
Equation (9.30) is far too expensive to be useful in dynamic programming applications. Even for a relatively small number of parameters (which may not be that small), the matrix inverse is going to be too slow for most applications. Fortunately, it is possible to compute these formulas recursively. The updating equation for θ is: θn = θn−1 − H n εˆn
(9.31)
where H n is a column vector computing using: Hn =
1 n−1 n B x γn
(9.32)
where γ n is a scalar and B n−1 is an I + 1 by I + 1 matrix. γ n is computed using: γ n = 1 + (xn )T B n−1 xn
(9.33)
while the matrix B n is updated recursively using: B n = B n−1 −
1 (B n−1 xn (xn )T B n−1 ) n γ
(9.34)
The derivation of equations (9.31)-(9.34) is given in section. 9.6.3. Equation (9.31) has the feel of a stochastic gradient algorithm, but it has one significant difference. Instead of using a typical stepsize, we have the vector H n . In our dynamic programming applications, the observations y n will represent estimates of the value of being in a state, and our independent variables will be either the states of our system (if we are estimating the value of being in each state) or the basis functions, in which case we are estimating the coefficients of the basis functions. The equations assume implicitly that the estimates come from a stationary series. There are many problems where the number of basis functions can be extremely large. In these cases, even the efficient recursive expressions in this section cannot avoid the fact that we are still updating a matrix where the number of rows and columns is the number of states (or basis functions). If we are only estimating a few dozen or a few hundred parameters, this can be fine. If the number of parameters extends into the thousands, even this strategy would probably bog down. It is very important for students to work out the approximate dimensionality of the matrices before using these methods.
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
9.5.3
266
Recursive time-series estimation
We can use our recursive formulas to estimate a more general time-series model. At iteration n, let the elements of our basis function be given by: φn = (φb (Rn ))b∈B which is the observed value of each function given the observed state vector Rn . If we wished to include a constant term, we would define a basis function φ = 1. Let nb = |B| be the number of basis functions used to explain the value function, so φn is an nb -dimensional column vector. In our most general representation, we may feel that the value function should be explained over several iterations of inputs. Let φ(n) = (φ1 , φ2 , . . . , φn )T be the history of observations of φ over the iterations. In addition to this information, we also have the history of observations of the updated values, which we represent using: vˆ(n) = (ˆ v 1 , vˆ2 , . . . , vˆn )T where each vˆn is a scalar. Taken together, φ(n) and vˆ(n) is our population of potential explanatory variables that we can use to help us predict vˆn+1 . The standard way of stating a model that uses information from previous observations is with a backshift operator. Let q −1 be an operator that accesses information from the previous time period, as in: q −1 X (n) = X (n−1) Now, define two series of backshift operators: ˆ −1 ) = 1 + a A(q ˆ1 q −1 + a ˆ2 q −2 + . . . + a ˆno q −no ˆ −1 ) = ˆb1 q −1 + ˆb2 q −2 + . . . + ˆbn q −ni B(q i where no and ni are parameters that specify how many previous iterations of output and input vectors (respectively) will be used to predict the next iteration. We only use the previous no observations of the value function observations vˆn and the previous ni observations of the basis functions φ. Using the backshift operators, we can write our model as: ˆ −1 )ˆ ˆ −1 )φ(n) + εt A(q v (n) = B(q or, equivalently: vˆn + a ˆ1 vˆn−1 + . . . + a ˆno vˆn−no = ˆb1 φn−1 + . . . + ˆbni φn−ni + εn
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
267
n−1 where εn is a random noise term. Since the elements (φm )m=n−n are |B|-dimensional vectors, i n i ˆ the coefficients (bm )m=1 are also each |B|-dimensional vectors of coefficients, each of which has to be estimated. We assume that ˆbm φn−m represents an inner product between the two vectors.
Stated differently, we are trying to predict vˆn using: vˆn = ˆb1 φn−1 + . . . + ˆbni φn−ni − (ˆ a1 vˆn−1 + . . . + a ˆno vˆn−no ) + εn Let: θ = (ˆb1 , . . . , ˆbni , a ˆ1 , . . . , a ˆno ) be our vector of parameters, and: xn = (φn−1 , . . . , φn−ni , −ˆ v n−1 , . . . , −ˆ v n−no ) Our prediction error is computed using: εˆn = vˆn − (θ¯n−1 )T xn We could take the history of observations and find θ using standard algorithms for minimizing the variance. A more efficient strategy is to use the recursive equations given in section 9.5.2. In practice, it is not clear how much history should be used when specifying a model. This will be problem dependent, and obviously, the computational complexity will rise quickly as more history is added. It is likely that we would want to only use the current observations of the basis functions (that is, ni = 1), but perhaps several observations from the past history of the actual values would capture biases and trends. The appeal of the ability to incorporate a history of past estimates of value functions is that it can be a mechanism for adjusting for the bias, which minimizes the need to tune a stepsize.
9.5.4
Estimation using multiple observations
The previous methods assume that we get one observation and use it to update the parameters. Another strategy is to sample several paths and solve a classical least-squares problem for estimating the parameters. In the simplest implementation, we would choose a set of ˆ n (rather than a single sample ω n ) and follow all of them, producing a set of realizations Ω estimates (ˆ v n (ω))ω∈Ωˆ n that we can use to update the value function. If we have a set of observations, we then face the classical problem of finding a vector of parameters θˆn that best match all of these value function estimates. Thus, we want to solve: 1 X ¯ ˆ θˆn = arg min (V (θ) − vˆn (ω))2 ˆ n| θˆ |Ω ˆn ω∈Ω
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
268
This is the standard parameter estimation problem faced in the statistical estimation community. If V¯ (θ) is linear in θ, then we can use the usual formulas for linear regression. If the function is more general, we would typically resort to nonlinear programming algorithms to solve the problem. In either case, θˆn is still an update that needs to be smoothed in with the previous estimate θn−1 , which we would do using: θ¯n = (1 − αn )θ¯n−1 + αn θˆn
(9.35)
One advantage of this strategy is that in contrast with the updates that depend on the gradient of the value function, updates of the form given in equation (9.35) do not encounter a scaling problem, and therefore, we return to our more familiar territory where 0 < αn ≤ 1. ˆ increases, the stepsize should also be increased because there Of course, as the sample size Ω n is more information in θˆ . The usefulness of this particular strategy will be very problem dependent. In many ˆ n before applications, the computational burden of producing multiple estimates vˆn (ω), ω ∈ Ω producing a parameter update will simply be too costly.
9.6
Why does it work?*
9.6.1
Proof of Proposition 1
Proof: The second term on the right hand side of equation (9.41) can be further simplified using, 2
h
E ε¯(g) a
i 2
= E = = = =
1
X
(g,n) Na
1 (g,n) Na
(g,n)
∀ g0 ∈ G
(g)
n∈Na
X
X 2 (g)
m∈Na
1 Na
εn ,
X
E [εm εn ]
(g)
n∈Na
E (εn )2
2 (g)
n∈Na
1 (g,n) Na
(g,n)
2 Na
σε 2
σε2 (g,n)
(9.36)
Na
Combining equations (9.19), (9.41) and (9.36) gives us the result in equation (9.20).
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
9.6.2
269
Proof of Proposition 2
We start by defining δ¯a(g) = µ ¯(g) ¯(g) a +ε a
(9.37)
Equation (9.37) gives us: h i h i 0 (g) (g 0 ) (g 0 ) E δ¯a(g) δ¯a(g ) = E (¯ µ(g) + ε ¯ )(¯ µ + ε ¯ ) a a a a h i (g) (g 0 ) (g) (g 0 ) (g 0 ) (g) (g) (g 0 ) ¯a ε¯a + ε¯a ε¯a ¯a ε¯a + µ ¯a + µ = E µ ¯a µ h i h i h i h i (g 0 ) (g 0 ) (g) (g) (g 0 ) (g) (g 0 ) ε ¯ + E ε ¯ ε ¯ = E µ ¯(g) µ ¯ + E µ ¯ ε ¯ + E µ ¯ a a a a a a a a
(9.38)
We note that h 0 i (g) 0) (g ) (g) E µ ¯a ε¯a = µ ¯(g ¯a = 0 a E ε Similarly h i (g 0 ) E µ ¯(g) ε ¯ =0 a a This allows us to write equation (9.38) as, h i h i h i (g) ¯(g 0 ) (g) (g 0 ) (g) (g 0 ) ¯ E δa δa = E µ ¯a µ ¯a + E ε¯a ε¯a
(9.39)
We start with the second term on the right hand side of equation (9.39). Using equation (9.3), this term can be written as " # (g,n) h i X Na 1 0) (g) (g) E ε¯(g) ¯(g = E ε¯(g) ε ¯ + E ε ¯ · εn 0 0 a ε a a · a a (g ) (g ) Na Na (g 0 ) (g) n∈Na \Na (g,n) X (g) (g) 1 Na (g) εn = E ε ¯ ε ¯ + E ε ¯ · (9.40) a a a (g 0 ) (g 0 ) Na Na (g 0 ) (g) n∈Na \Na | {z } I
The term I can be rewritten using X E ε¯(g) εn = E ε¯a(g) E a · (g 0 )
n∈Na
(g)
\Na
X (g 0 )
n∈Na
= 0
εn (g)
\Na
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
270
which means h E
0) ε¯(g) ¯(g a ε a
i
(g,n)
=
Na
(g 0 )
Na
h E
2 ε¯(g) a
i
Combining (9.39) and (9.41) proves the proposition.
9.6.3
(9.41)
Derivation of the recursive estimation equations
Here we derive the recursive estimation equations given by equations (9.31)-(9.34). To begin, we note that the matrix (X n )T X n is an I + 1 by I + 1 matrix where the element for row i, column j is given by: [(X n )T X n ]i,j =
n X
m xm i xj
m=1
This term can be computed recursively using:
n T
n
[(X ) X ]i,j =
n−1 X
m n n (xm i xj ) + xi xj
m=1
In matrix form, this can be written: [(X n )T X n ] = [(X n−1 )T X n−1 ] + xn (xn )T Keeping in mind that xn is a column vector, xn (xn )T is an I + 1 by I + 1 matrix formed by the cross products of the elements of xn . We now use the Sherman-Morrison formula (see section 9.6.4 for a derivation) for updating the inverse of a matrix [A + uuT ]−1 = A−1 −
A−1 uuT A−1 , 1 + uT A−1 u
(9.42)
where A is an invertible n × n matrix, and u is an n-dimensional column vector. Applying this formula to our problem, we obtain: [(X n )T X n ]−1 = [(X n−1 )T X n−1 + xn (xn )T ]−1 = [(X n−1 )T X n−1 ]−1 [(X n−1 )T X n−1 ]−1 xn (xn )T [(X n−1 )T X n−1 ]−1 − 1 + (xn )T [(X n−1 )T X n−1 ]−1 xn
(9.43)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
271
The term (X n )T Y n can also be updated recursively using: (X n )T Y n = (X n−1 )T Y n−1 + xn (y n )T
(9.44)
To simplify the notation let: B n = [(X n )T X n ]−1 γ n = 1 + (xn )T [(X n−1 )T X n−1 ]−1 xn This simplifies our inverse updating equation (9.43) to: B n = B n−1 −
1 (B n−1 xn (xn )T B n−1 ) γn
Combining (9.30) with (9.43) and (9.44) gives: θn+1 = [(X n )T X n ]−1 (X n )T Y n 1 n−1 n T n n−1 n−1 = B − n (B (x ) x B ) (X n−1 )T Y n−1 + (xn )T y n γ n−1 n−1 T n−1 = B (X ) Y 1 − n B n−1 (xn )T xn B n−1 (X n−1 )T Y n−1 + (xn )T y n + B n−1 (xn )T y n γ We can start to simplify by using θn−1 = B n−1 (X n−1 )T Y n−1 . We are also going to bring the term xn B n−1 inside the square brackets. Finally, we are going to bring the last term B n−1 xn y n inside the brackets by taking the coefficient B n−1 xn outside the brackets and multiplying the remaining y n by the scalar γ n = 1 + (xn )T B n−1 xn : 1 n−1 n T n n−1 n−1 T n−1 B (x ) x (B (X ) Y ) γn + xn B n−1 (xn )T y n − (1 + xn B n−1 (xn )T )y n
θn+1 = θn−1 −
Again, we use θn−1 = B n−1 (X n−1 )T Y n−1 and observe that there are two terms xn B n−1 (xn )T y n that cancel, leaving: θn = θn−1 −
1 n−1 n T n n−1 n B (x ) x θ − y γn
We note that θn−1 (xn )T is our prediction of y n using the parameter vector from iteration n − 1 and the explanatory variables xn . y n is, of course, the actual observation, so the difference is our error, εˆn . Let: Hn =
1 n−1 n T B (x ) γn
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
272
We can now write our updating equation using: θn = θn−1 − H n εˆn
9.6.4
(9.45)
The Sherman-Morrison updating formula
The Sherman-Morrison matrix updating formula (see Golub & Loan (1996)) assumes that we have a matrix A, and that we are going to update it with the outer product of the column vector u to produce the matrix B: B = A + uuT
(9.46)
Pre-multiply by B −1 and post-multiply by A−1 , giving: A−1 = B −1 + B −1 uuT A−1
(9.47)
Post-multiply by u: A−1 v = B −1 u + B −1 uuT A−1 u = B −1 u 1 + uT A−1 u Note that uT A−1 u is a scalar. Divide through by 1 + uT A−1 u : A−1 v = B −1 u (1 + uT A−1 u) Now post-multiply by uT A−1 : A−1 vv T A−1 = B −1 uuT A−1 T −1 (1 + u A u)
(9.48)
Equation (9.47) gives us: B −1 uuT A−1 = A−1 − B −1
(9.49)
Substituting (9.49) into (9.48) gives: A−1 vv T A−1 = A−1 − B −1 T −1 (1 + u A u)
(9.50)
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
273
Solving for B −1 gives us: B −1 = [A + uuT ]−1 A−1 vv T A−1 = A−1 − (1 + uT A−1 u) which is the desired formula.
9.7
Bibliographic notes
Strategies range from picking a fixed level of aggregation (Whitt (1978), Bean et al. (1987)), or using adaptive techniques that change the level of aggregation as the sampling process progresses (Bertsekas & Castanon (1989), Mendelssohn (1982),Bertsekas & Tsitsiklis (1996)), but which still use a fixed level of aggregation at any given time. Bounds on state/row aggregation: Zipkin (1980b), Zipkin (1980a) LeBlanc & Tibshirani (1996) outlines a general framework for how to combine a collection of general regression/classification fit vectors in order to obtain a better predictive model. The weights on the estimates from the individual predictors are computed by least squares minimization, stacked regression, generalized cross-validation and bootstrapping. Adaptive regression by mixing (Yang (1999)) assigns weights on candidate models that are combined after proper assessment of performance of the estimators, with the aim of reducing instability. The weights for combining the models are obtained as functions of the distributions of the error estimates and the variance of the random errors. ” Bayesian reference: Bernardo & Smith (1994) George et al. (2003)
Exercises 9.1) In a spreadsheet, create a 4 × 4 grid where the cells are numbered 1, 2, . . . , 16 starting with the upper left hand corner and moving left to right, as shown below. We are going
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
to treat each number in the cell as the mean of the observations drawn from that cell.
CHAPTER 9. VALUE FUNCTION APPROXIMATIONS
274
Now assume that if we observe a cell, we observe the mean plus a random variable that is uniformly distributed between -1 and +1. Next define a series of aggregation where aggregation 0 is the disaggregate level, aggregation 1 divides the grid into four 2 × 2 cells, and aggregation 2 aggregates everything into a single cell. After n iterations, let (g,n) v¯a be the estimate of cell “a” at the nth level of aggregation, and let v¯an =
X
wa(g) v¯a(g,n)
g∈G
be your best estimate of cell a using a weighted aggregation scheme. Compute an overall error measure using: (¯ σ 2 )n =
X
(¯ van − νa )2
a∈A
where νa is the true value (taken from your grid) of being in cell a. Also let w(g,n) be the average weight after n iterations given to the aggregation level g when averaged over all cells at that level of aggregation (for example, there is only one cell for w(2,n) ). Perform 1000 iterations where at each iteration you randomly sample a cell and measure it with noise. Update your estimates at each level of aggregation, and compute the variance of your estimate with and without the bias correction. a) Plot w(g,n) for each of the three levels of aggregation at each iteration. Do the weights behave as you would expect? Explain. b) For each level of aggregation, set the weight given to that level equal to one (in other words, we are using a single level of aggregation) and plot the overall error as a function of the number of iterations. c) Add to your plot the average error when you use a weighted average, where the weights are determined by equation (9.5) without the bias correction. d) Finally add to your plot the average error when you used a weighted average, but now determine the weights by equation (9.7) which uses the bias correction. e) Repeat the above assuming that the noise is uniformly distributed between -5 and +5. 9.2) Prove equation 9.6. 9.3) Show that the vector H n in the recursive updating formula from equation (9.45) θn = θn−1 − H n εˆn reduces to H n = 1/n for the case of a single parameter.
Chapter 10 The exploration vs. exploitation problem A fundamental challenge with approximate dynamic programming is that our ability to estimate a value function may require that we visit states just to estimate the value of being in the state. Should we make a decision because we think it is the best decision (based on our current estimate of the values of states the decision may take us to), or do we make a decision just to try something new? This is a decision we face in day-to-day life, so it is not surprising that we face this problem in our algorithms. This choice is known in the approximate dynamic programming literature as the “exploration vs. exploitation” problem. Do we make a decision to explore a state? Or do we “exploit” our current estimates of downstream values to make what we think is the best possible decision? It can cost time and money to visit a state, so we have to consider the future value of action in terms of improving future decisions. Intertwined with this question is the challenge of learning. When we visit a state, what did we learn? In some problems, we obtain nothing more than another observation of the value of being in the state. But in many applications, we can use our experience of visiting one state to improve what we know about other states. When this ability is included, it can change our strategy.
10.1
A learning exercise: the nomadic trucker
A nice illustration of the explore vs. exploit problem is provided by our nomadic trucker example. Assume that the only attribute of our nomadic trucker is his location. Thus,
275
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
276
a = {i}, where i ∈ I is a location. At any location, we have two types of choices: Dae = The set of locations a driver with attribute a can move empty to. l Da = The customer orders that are available to a driver with attribute a. Da = Dae ∪ Dal The set Dal is random. As the driver arrives to location i, he sees a set of customer orders that are drawn from a probability distribution. The driver may choose to serve one of these orders, thereby earning a positive revenue, or he may choose to move empty to another location (that is, he may choose a decision d ∈ Dae ). Included in the set Dae , where a = i, is location i, representing a decision to stay in the same location for another time period. The set Dal may be empty. Each decision earns a contribution cad which is positive if d ∈ Dl and negative or zero if d ∈ De . If the driver has attribute an at iteration n, he observes a sample realization of the orders Dl (ω n ) and then makes his next decision dn by solving: dn = arg maxn can ,d + γ V¯ n−1 (aM (an , d)) d∈D(ω )
Here, aM (t, a, d) tells us the destination that results from making a decision. V¯ n−1 is our estimate of the value of being in this state. After making a decision, we compute: vˆan = cadn + γ V¯ n−1 (aM (an , dn )) and then update our value function using: ( (1 − αn )V¯ n−1 (a) + αn vˆan V¯ n (a) = V¯ n−1 (a)
If a = aM (an , dn ). Otherwise
We start by initializing the value of being in each location to zero, and use a pure exploitation strategy. If we simulate 500 iterations of this process, we produce the pattern shown in figure 10.1. Here, the circles at each location are proportional to the value V¯ 500 (a) of being in that location. The small circles indicate places where the trucker never visited. Out of 50 cities, our trucker has ended up visiting nine. An alternative strategy is to initialize V¯ 0 (a) to a large number. For our illustration, where rewards tends to average several hundred dollars per iteration (we are using a discount factor of 0.80), we might initialize the value function to $2000 which is higher than we would expect the optimal solution to be. Using the same strategy, visiting a state general produces a reduction in the estimate of the value of being in the state. Not surprisingly, the logic tends to favor visiting locations we have never visited before (or have visited the least). The
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
277
Figure 10.1: Using a pure exploitation strategy and low initial estimates the nomadic trucker becomes stuck in a local solution, visiting only a handful of cities. resulting behavior is shown in figure 10.2. Here, the pattern of lines shows that after 500 iterations, we have visited almost every city. How do these strategies compare? We also ran an experiment where we estimated the value functions by using a pure exploration strategy, where we ran five iterations of sampling every single location. Then, for all three methods of estimating the value function, we simulated the policy produced by these value functions for 200 iterations. The results are shown in figure 10.3. The results show that for this example, pure exploitation with a high initial estimate for the value function works better than when we use a low initial estimate, but estimating the value functions using a pure exploration strategy works best of all. Furthermore, the differences are fairly substantial.
10.2
Learning strategies
Much of the challenge of estimating a value function is identical to that facing any statistician trying to fit a model to data. The biggest difference is that in dynamic programming, we may choose what data to collect by controlling what states to visit. Further complicating the problem is that it takes time (and may cost money) to visit these states to collect the information. Do we take the time to visit the state, and better learn the value of being in the state? Or do we live with what we know?
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
278
Figure 10.2: Using a pure exploitation strategy and high initial estimates the nomadic trucker tries to visit everywhere.
300
250
Profit per day
200
150
100
50
0 Low initial
High initial
Explore all
Figure 10.3: Expected value of policies from pure exploitation with low initial value functions, pure exploitation with high initial value function, and pure exploration.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
279
Below we review several simple strategies, any of which can be effective for specific problem classes.
10.2.1
Pure exploration
Here, we use an exogenous process (such as random selection) to choose either a state to visit, or an action (which leads to a state). Once in a state, we sample information and obtain an estimate of the value of being in the state which is then used to update our estimate. In a pure exploration strategy, we can guarantee that we visit every state, or at least have a chance of visiting every state. We need to remember that some problems have 10100 states or more, so even if we run a million iterations, we may sample only a fraction of the complete state space. But at least we sample a broad range of the state space. The amount of exploration we undertake depends in large part on the cost of collecting the information (how much time does it take to run each iteration) and the acceptable level of errors. The problem with a pure exploration strategy is that we may only be interested in a very tiny fraction of a large state space.
10.2.2
Pure exploitation
A pure exploitation strategy assumes that we have to make decisions by solving: xnt = arg max C(Rt , xt ) + V¯tn−1 (Rt (xt )) xt ∈Xt
Some authors refer to this as a greedy strategy, since we are doing the best that we think we can given what we know. A pure exploitation may be needed for practical reasons. For example, consider a large resource allocation problem where we are acting on a resource vector Rt = (Rta )a∈A which we act on with a decision vector xt = (xtad )a∈A,d∈D . For some applications in transportation, the dimensionality of Rt may be in the thousands, while xt may be in the tens of thousands. For problems of this size, randomly choosing an action, or even a state, even if we run millions of iterations (very unlikely for problems of this size), means that we are sampling a tiny fraction of the state or action space. For such problems, exploration can be pointless. Furthermore, an exploitation policy avoids visiting states that are unreachable or truly suboptimal. The problem with pure exploitation is that it is quite easy to become stuck in a local solution simply because we have poor estimates of the value of being in some states. While it is easy to construct small problems where this problem is serious, the errors can be substantial on virtually any problem that lacks specific structure that can be exploited to ensure convergence. As a rule, optimal solutions are not available for large problems, so we have to be satisfied with doing the best we can do. But just because your algorithm appears
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
280
to have converged, do not fool yourself into believing that you have reached an optimal, or even near-optimal, solution.
10.2.3
Mixed exploration and exploitation
A common strategy is to mix exploration and exploitation. We might specify an exploration rate ρ where ρ is the fraction of iterations where decisions should be chosen at random (exploration). The intuitive appeal of this approach is that we maintain a certain degree of forced exploration, while the exploitation steps focuses attention on the states that appear to be the most valuable. This strategy is particularly popular for proofs of convergence because it guarantees that in the limit, all (reachable) states will be visited infinitely often. This property is then used to prove that estimates will reach their true value. In practice, using a mix of exploration steps only adds value for problems with relatively small state or action spaces. The only exception arises when the problem lends itself to an approximation which is characterized by a relatively small number of parameters.
10.2.4
Boltzman exploration
The problem with exploration steps is that you are choosing a decision d ∈ D at random. Sometimes this means that you are choosing really poor decisions where you are learning nothing of value. An alternative is Boltzman exploration where from state S, a decision d is chosen with probability with a probability proportional to the estimated value of a decision. For example, let Q(S, d) = R(S, d) + V¯ n (S, d) be the value of choosing decision d when we are in state S. Using Boltzman exploration, we would choose decision d with probability: eQ(S,d)/T Q(S,d0 )/T d0 ∈D e
P (S, d) = P
(10.1)
T is known as the temperature, since in physical systems, electrons at high temperatures are more likely to bounce from one state to another. As the parameter T increases, the probability of choosing different decisions becomes more uniform. As T → 0, the probability of choosing the best decision approaches 1.0. It makes sense to start with T relatively large and steadily decrease it as the algorithm progresses. Boltzman exploration provides for a more elegant choice of decision. Those which appear to be of lower value are selected with a lower probability. We focus our energy on the decisions that appear to be the most beneficial, but provide for intelligent exploration.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
281
Projection algorithm 100
Pure exploration
95
90
85
Pure exploitation 80
75
70 1
8
15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148
Figure 10.4: Pure exploration outperforms pure exploitation initially, but slows as the iterations progress.
10.2.5
Remarks
The tradeoff between exploration and exploitation is nicely illustrated in figure 10.4 where we are estimating the value of being in each state for a small problem with a few dozen states. For this problem, we are able to compute the exact value function, which allows us to compute the value of a policy using the approximate value function as a percentage of the optimal. This graph nicely shows that pure exploration has a much faster initial rate of convergence, whereas the pure exploitation policy works better as the function becomes more accurate. This behavior, however, is very problem dependent. The value of any exploration strategy drops as the number of parameters increases. If a mixed strategy is used, the best fraction of exploration iterations is problem dependent, and may be difficult to ascertain without access to an optimal solution. Tests on smaller, computationally tractable problems (where exploration is more useful) will not tell us the right balance for larger problems. Consider, for example, the problem of allocating hundreds (or thousands) of different types of assets, which can be described by a resource state vector Rt with hundreds (or thousands) of dimensions. There may be 10 to 100 different types of decisions that we can act on each asset class, producing a decision vector xt with thousands or even tens of thousands of dimensions. The state space may be 1010 or more, with an even larger action space. Choosing actions (or states) at random for exploration purposes, in an algorithm where we are running thousands (or tens of thousands) of iterations means we are sampling at most a tiny sample of states (and these are only being sampled once or twice).
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
282
For such large problems, an exploration strategy is going to be of little value unless we can exploit a significant amount of structure. At the same time, a pure exploitation strategy is very likely to become stuck in a local solution that may be of poor quality.
10.3
A simple information acquisition problem
Consider the situation of a company selling a product at a price p during time period t. Assume that production costs are negligible, and that the company wants to sell the product at a price that maximizes revenue. Let p∗ be this price, which is unknown. Further assume that the lost revenue (per unit sold) is approximated by β(pt − p∗ )2 which, of course, can only be computed if we actually knew the optimal price. In any given time period (e.g. a month) the company may conduct market research at a cost per unit sold of c (assume the company continues to sell the product during this time). When the company conducts a market research study, it obtains an imperfect estimate of the optimal price which we denote pˆt = p∗ + t where E = 0 and V ar() = σ 2 . Let xt = 1 if the company conducts a market research study during time period t, and 0 otherwise. We assume that our ability to estimate the correct price is independent of our pricing policy. For this reason, the market research strategy, captured by x = (xt )t , is independent of the actual observations (and is therefore deterministic). Our goal is to minimize expected costs (lost revenue plus marketing costs) per unit over a finite horizon t = 1, 2, . . . , T . Since each market research study gives us an unbiased estimate of the true optimal price, it makes sense for us to set our price to be the average over all the market research studies. Let: nt =
t X
xt0
t0 =1
be the number of market research studies we have performed up to (and including) time t. Thus: ( nt −1 pt−1 + n1t pˆt If xt = 1 nt pt = pt−1 Otherwise pt is an unbiased estimate of p∗ with variance σ ¯2 =
σ2 nt
where we assume for simplicity that σ 2 is known. We note that our lost revenue function was conveniently chosen so that: Eβ(pt − p∗ )2 = β σ ¯2
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
283
Since our decisions xt are independent of the state of our system, we can formulate the optimization problem for choosing x as follows:
max F (x) = E x
=
T X
β(pt − p∗ )2 + cxt
t=1 T X
βσ ¯ 2 + cxt
t=1
We use the intuitive result (which the reader is expected to prove in the exercises) that we should perform market research for µ time periods and then stop. This means that xt = 1, t = 1, 2, . . . , µ with xt = 0, t > µ, which also implies that nt = t for t ≤ µ. Using this behavior, we may simplify F (x) to be: µ T X X σ2 σ2 β F (x, µ) = β + cxt + t µ t=µ+1 t=1 We can solve this easily if we treat time as continuous, which allows us to write F (x, µ) as: Z T σ2 σ2 β dt F (x, µ) = β + c dt + t µ t=1 t=µ T µ σ2 = βσ 2 ln t + ct 1 + β (t) µ µ Z
µ
= βσ 2 ln µ + c(µ − 1) + β
σ2 (T − µ) µ
Differentiating with respect to µ and setting the result equal to zero gives: 1 T ∂F (x, µ) = βσ 2 + c − βσ 2 2 = 0 ∂µ µ µ Finding the optimal point µ∗ to stop collecting information requires solving: cµ2 + βσ 2 µ − σ 2 T = 0 Applying the familiar solution to quadratic equations, and recognizing that we are interested in a positive solution, gives:
µ=
−βσ 2 +
p
(βσ 2 )2 + 4cσ 2 T 2c
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
284
We see from this expression that the amount of time we should be collecting information increases with σ 2 , β and T , and decreases with c, as we would expect. If there is no noise (σ 2 = 0), then we should not collect any information. Most importantly, it highlights the concept that there is an optimal strategy for collecting information, and that we should collect more information when our level of uncertainty is higher. The next section extends this basic idea to a more general (but still restrictive) class of problems.
10.4
Gittins indices and the information acquisition problem
For the most part, the balance of exploration and exploitation is ad hoc, problem dependent and highly experimental. There is, however, one body of theory that offers some very important insights into how to best make the tradeoff between exploring and exploiting. This theory is often referred to as multiarmed bandits which is the name given to the underlying mathematical model, or Gittins indices which refers to the elegant method for solving the problem.
10.4.1
Foundations
Consider the problem faced by a gambler playing a set of slot machines (“one-armed bandits”) in a casino. Now pretend that the probability of winning is different for each slot machine, but we do not know what these probabilities are. We can, however, obtain information about the probabilities by playing a machine and watching the outcomes. Because our observations are random, the best we can do is obtain statistical estimates of the probabilities, but as we play a machine more, the quality of our estimates improves. Since we are looking at a set of slot machines, the problem is referred to as the multiarmed bandit problem. This is a pure exercise in information acquisition, since after every round, our player is faced with the same set of choices. Contrast this situation with most dynamic programs which involve allocating an asset where making a decision changes the attribute (state) of the asset. In the multiarmed bandit problem, after every round the player faces the same decisions with the same rewards. All that has changed is what she knows about the system. This problem, which is extremely important in approximate dynamic programming, provides a nice illustration of what might be called the knowledge state (or information state). The difference between the state of the resource (in this case, the player) and the state of what we know has confused authors since Bellman first encountered the issue. The vast majority of papers in dynamic programming implicitly assume that the state variable is the state of the resource. This is precisely the reason that our presentation in chapter 3 adopted the term “resource state” to be clear about what we were referring to. In our multiarmed bandit problem, let Wi be the random variable that gives the amount
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
285
that we win if we play the ith bandit. Most of our presentation assumes that Wi is normally distributed. Let θi be the true mean of Wi (which is unknown) and variance σi2 (which ¯i2 ) be our estimate of the mean and we may assume is known or unknown). Now let (θ¯in , σ variance of Wi after n iterations. Under our assumption of normality, the mean and variance completely determine the distribution. We next need to specify our transition equations. When we were managing physical assets, we used equations such as Rt+1 = [Rt + xt − Dt+1 ]+ to capture the quantity of assets available. In our bandit problem, we have to show how the estimates of the parameters of the distribution evolve over time. Now let xni = 1 if we play the ith slot machine during the nth round, and let Win be the amount that we win in this round. Also let: Nin
=
n X
xni
i=1 0
be the total number of times we sampled the ith machine. Since the observations (Win )nn0 =1 come from the same distribution, the best estimate of the mean is a simple average, which can be computed recursively using: ( N n −1 i
Nin n−1 ¯ θi
θ¯in =
θ¯in−1 +
1 Win Nin
If xni = 1
(10.2)
Otherwise
Similarly, we would estimate the variance of Wi using: ( N n −2 (ˆ σi2 )n
=
n i (ˆ σi2 )Ni −1 Nin −1 (ˆ σi2 )n−1
+
1 (Win Nin
n
N −1 − θ¯i i )2
If xni = 1 Otherwise
(10.3)
We are more interested in the variance of θ¯in which is given by: (¯ σi2 )n =
1 (ˆ σ 2 )n Nin i
The apparent discrepancy in the stepsizes between (10.3) and (10.2) arises because of the small sample adjustment for variances when the mean is unknown. One challenge in using (10.3) to estimate the variance, especially for larger problems, is that the number of observations Nin may be quite small, and often zero or 1. A reasonable approximation may be to assume (at least initially) that the variance is the same across the slot machines. In this case, we could estimate a single population variance using: (ˆ σ 2 )n =
n − 2 2 n−1 1 (ˆ σ ) + (Win − θ¯in−1 )2 n−1 n
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
286
which is updated after every play. The variance of θ¯in would then be given by: (¯ σi2 )n =
1 (ˆ σ 2 )n n Ni
Even if significant differences are suspected between different choices, it is probably a good idea to use a single population variance unless Nin is at least 10. Under the assumption of normality, S n = (θ¯n , σ ˆ n , Nin ) is our state variable, where equations (10.2) and (10.3) represent our transition function. We do not have a resource state variable because our “resource” (the player) is always able to play the same machines after every round, without affecting the reward structure. Some authors (including Bellman) refer to (θ¯n , σ ¯ n ) as the hyperstate, but given our definitions (see section 3.6), this is a classic state variable since it captures everything we need to know to model the future evolution of our system. Given this model, it would appear that we have a classic dynamic program. We have a 2|I|-dimensional state variable, which also happens to be continuous. Even if we could model θ¯n and σ ¯ n as discrete, we have a multidimensional state variable with all the computational challenges this entails. In a landmark paper (Gittins & Jones (1974)), it was shown that this problem could be solved as a series of one-dimensional problems using an index policy. That is, it is possible to compute a number νi for each bandit i, using information about only this bandit. It is then optimal to choose which bandit to play next by simply finding the largest νi for all i ∈ I. This is known as an index policy, and the values νi are widely known as Gittins indices.
10.4.2
Basic theory of Gittins indices
Assume we face the choice of playing a single slot machine, or stopping and converting to a process that pays a reward ρ in each time period until infinity. If we choose to stop sampling and accept the fixed reward, the total future reward is ρ/(1 − γ). Alternatively, if we play the slot machine, we not only win a random amount W , we also learn something about the parameter θ that characterizes the distribution of W (for our presentation, EW = θ, but θ could be a vector of parameters that characterizes the distribution of W ). θ¯n represents our state variable, and the optimality equations are V (θ¯n |ρ) = max ρ + γV (θ¯n |ρ), C(θ¯n ) + γE V (θ¯n+1 |ρ) θ¯n
(10.4)
where we have written the value function to express the dependence on ρ. C(θ¯n ) = EW is our expected reward given our estimate θ¯n . Since we have an infinite horizon problem, the value function must satisfy the optimality equations: ¯ ¯ ¯ + γE V (θ¯0 |ρ) θ¯ V (θ|ρ) = max ρ + γV (θ|ρ), C(θ)
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
287
where θ¯0 is defined by equation (10.2). It can be shown that if we choose to stop sampling in iteration n and accept the fixed payment ρ, then that is the optimal strategy for all future rounds. This means that starting at iteration n, our optimal future payoff (once we have decided to accept the fixed payment) is: ¯ V (θ|ρ) = ρ + γρ + γ 2 ρ + ... ρ = 1−γ which means that we can write our optimality recursion in the form: n ρ n n n+1 ¯ ¯ ¯ ¯ V (θ |ρ) = max , C(θ ) + γE V (θ |ρ) θ 1−γ
(10.5)
Now for the magic of Gittins indices. Let ν be the value of ρ which makes the two terms in the brackets in (10.5) equal. That is, ν ¯ + γE V (θ¯0 |ν) θ¯ = C(θ) 1−γ
(10.6)
¯ the estimate of the variance σ ν depends on our current estimate of the mean, θ, ¯ 2 , and the number of observations n we have made of the process. We express this dependence by ¯σ writing the index as ν(θ, ¯ 2 , n). Now assume that we have a family of slot machines I, and let νi (θ¯i , σ ¯i2 , Nin ) be the value 2 n ¯ of ν(θi , σ ¯i , Ni ) that we compute for each slot machine i ∈ I, where Nin is the number of times we have played slot machine i by iteration n. An optimal policy for selecting slot machines is to choose the slot machine with the highest value for νi (θ¯i , σ ¯i2 , Nin ). Such policies are known as index policies, and for this problem, the parameters νi (θ¯i , σ ¯i2 , Nin ) are widely known as Gittins indices. The computation of Gittins indices highlights a subtle issue when computing expectations for information-collection problems. The proper computation of the expectation required to solve the optimality equations requires, in theory, knowledge of exactly the distribution that we are trying to compute. To illustrate, the expected winnings are given by C(θ¯n ) = EW = θ, but θ is unknown. Instead, we adopt a Bayesian approach that our expectation is computed with respect to the distribution we believe to be true. Thus, at iteration n we believe that our are normally distributed with mean θ¯n , so we would use C(θ¯n ) = θ¯n . The term winnings E V (θ¯n+1 |ρ) θ¯n captures what we believe the effect of observing W n+1 will have on our estimate θ¯n+1 , but this belief is based on what we think the distribution of W n+1 is, rather than the true distribution. The beauty of Gittins indices (or any index policy) is that it reduces N -dimensional problems into a series of one dimensional problems. The problem is that solving equation ¯σ (10.5) (or equivalently, (10.6)) offers its own challenges. Finding ν(θ, ¯ 2 , n) requires solving the optimality equation in (10.5) for different values of ρ until (10.6) is satisfied. In addition, this has to be done for different values of θ¯ and n. Although algorithmic procedures have been designed for this, they are not simple.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
10.4.3
288
Gittins indices for normally distributed rewards
The calculation of Gittins indices is simplified for special classes of distributions. In this section, we consider the case where the observations of rewards W are normally distributed. Students learn in their first statistics course that normally distributed random variables satisfy a nice property. If Z is normally distributed with mean 0 and variance 1, and if: X = µ + σZ then X is normally distributed with mean µ and variance σ 2 . This property simplifies what are otherwise difficult calculations about probabilities of events. For example, computing P rob[X ≥ x] is difficult because the normal density function cannot be integrated analytically. Instead, we have to resort to numerical procedures. But because of the above translationary and scaling properties of normally distributed random variables, we can perform the difficult computations for the random variable Z (the “standard normal deviate”), and use this to answer questions about any random variable X. For example, we can write:
x−µ X −µ ≥ P rob[X ≥ x] = P rob σ σ x−µ = P rob Z ≥ σ
Thus, the ability to answer probability questions about Z allows us to answer the same questions about any normally distributed random variable. The same property applies to Gittins indices. Although the proof requires some development, it is possible to show that: ¯σ ν(θ, ¯ 2 , n) = θ¯ + σ ¯ ν(0, 1, n) Thus, we have only to compute a “standard normal Gittins index” for problems with mean 0 and variance 1, and n observations. Unfortunately, as of this writing, there do not exist easy-to-use software utilities for computing standard Gittins indices. The situation is similar to doing statistics before computers when students had to look up the cumulative distribution for the standard normal deviate in the back of a statistics book. Table 10.1 is exactly such a table for Gittins indices. The table gives indices for both the parameters-known and parameters unknown cases. In the parameters known case, we assume that σ 2 is given, which allows us to estimate the variance of the estimate for a particular slot machine just by dividing by the number of observations. Given access to a table of values, applying Gittins indices becomes quite simple. Instead of choosing the option with the highest θ¯in (which we would do if we were ignoring the value
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
Observations 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
289
Discount factor Known variance Unknown variance 0.95 0.99 0.95 0.99 0.9956 1.5758 0.6343 1.0415 10.1410 39.3343 0.4781 0.8061 1.1656 3.1020 0.3878 0.6677 0.6193 1.3428 0.3281 0.5747 0.4478 0.9052 0.2853 0.5072 0.3590 0.7054 0.2528 0.4554 0.3035 0.5901 0.2274 0.4144 0.2645 0.5123 0.2069 0.3808 0.2353 0.4556 0.1899 0.3528 0.2123 0.4119 0.1058 0.2094 0.1109 0.2230 0.0739 0.1520 0.0761 0.1579 0.0570 0.1202 0.0582 0.1235 0.0464 0.0998 0.0472 0.1019 0.0392 0.0855 0.0397 0.0870 0.0339 0.0749 0.0343 0.0760 0.0299 0.0667 0.0302 0.0675 0.0267 0.0602 0.0269 0.0608 0.0242 0.0549 0.0244 0.0554
Table 10.1: Gittins indices for the case of observations that are normally distributed with mean 0, variance 1, from Gittins (1989). of collecting information), we choose the option with the highest value of: θ¯in + σ ¯in ν(0, 1, Nin ) This strategy is attractive because it is simple to apply and does not require using the device of a pure exploration step. As we have pointed out, for large state and action spaces, exploration steps are of little value when the number of iterations being run is much smaller than the state space. Using Gittins indices allows us to use a modified exploitation strategy, where the choice of decision requires adding the term (¯ σi2 )n ν(0, 1, Nin ) to the value of being in a state. Since the indices ν(0, 1, Nin ) decline naturally to zero (along with the standard deviation σ ¯ ), in the limit we have a pure exploitation strategy. Perhaps the most useful insight from the multiarmed bandit problem is that it illustrates the role that uncertainty plays in the exploration process. We have to strike a balance between choosing what appears to be the best option, and what might be the best option. If an option has a somewhat lower estimated value, but where the variance is so high that the upper tail exceeds the upper tail of another option, then this is something we should explore. How far we go out on the upper tail depends on the number of observations and the discount factor. As the discount factor approaches 1.00, the value of exploring goes up.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
10.4.4
290
Gittins exploration
We have to remind ourselves that Gittins indices work only for multiarmed bandit problems. This is a very special problem class, since at each iteration we face exactly the same set of choices. In addition, while our understanding of the value of each choice changes, the actual flows of rewards is the same from one iteration to the next. Not surprisingly, this is a very small set of dynamic programs. Consider now a more general problem where an asset with attribute at , after making decision d ∈ Da , becomes an asset with attribute a0t where we face options d0 ∈ Da0 . Consider the challenge of deciding which of two decisions d1 , d2 ∈ Da is better. Decision d1 produces an asset with attribute aM (a, d1 ) = a01 . Let V¯ n (a01 ) be our estimate of the value of the asset at this time. Similarly, decision d2 produces an asset with attribute aM (a, d2 ) = a02 and estimated value V¯ n (a02 ). Finally, let cad be the immediate contribution of the decision. If we were using a pure exploitation policy, we should choose the decision that minimizes cad + V¯ n (aM (a, d)). The development of Gittins indices suggests that we could apply the idea heuristically as follows: d = arg max cad + V¯ n (aM (a, d)) + σ ¯ n (aM (a, d))ν(n) d∈Da
(10.7)
where σ ¯ n (aM (a, d)) is our estimate of the standard deviation of V¯ (aM (a, d)), and ν(n) tells us how many standard deviations away from the mean we should consider (which is a function of the number of observations n). For this more general problem, we do not have any theory that tells us what ν(n) should be. Since strategies for balancing exploration and exploitation are largely heuristic in nature, it seems reasonable to simply adopt a heuristic rule for ν(n). An analysis of the exact Gittins indices suggests that we might use ρG ν(n) = √ n
(10.8)
where ρG is a parameter we have to choose using calibration experiments. The presence of √ n in the denominator reflects the observation that the Gittins indices drop approximately with the square root of the number of observations. We note that we use (10.7) only to decide which state to visit next. To update the value of being in state a, we still use vˆan = arg max cad + V¯ n (aM (a, d)) d∈Da
We refer to the policy of using equation (10.7) to decide what state should be visited next as a Gittins exploration strategy. The attraction of a Gittins exploration policy is that it does not depend on randomly sampling states or actions. This means we may be able to use it even when our decision is a vector xt acting on a resource vector Rt , both of which may have hundreds or thousands of dimensions. A Gittins exploration strategy encourages us to visit states which might be attractive.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
291
Experiments with this idea using the nomadic trucker problem quickly demonstrated that the biggest challenge is estimating σ ¯ n (aM (a, d)). Simply estimating the variance of the estimate of the value of being in a state does not accurately estimate the spread of the errors between the true value function (which we are able to compute) and the approximate one. It was not unusual to find that the exact value function for a state was 10 to 30 standard deviations away from our estimate. The problem is that there is a compounding effect of errors in the value function, especially when we use a pure exploitation strategy. For this reason, we found it necessary to try values of ρG that produces values of ν(n) that were much larger than what would be found in the Gittins tables. Figure 10.5 shows the percent error in the value function obtained using a Gittins exploration strategy for both a single and multiattribute nomadic trucker problem. For the single attribute problem (the attribute is simply the location of the truck, of which there are only 50), a pure exploration strategy produced the most accurate value functions (figure 10.5a). This outperformed a Gittins exploration policy even with ρG = 500. By contrast, a Gittins exploration strategy on a multiattribute problem (with three attributes and an attribute space over 1,000), using ρG ∈ (5, 10, 20, 50) significantly outperformed both pure exploration and exploitation policies (figure 10.5b). For this specific example, ρG produced the best results, with values of ρG of 5 and 20 producing nearly equivalent results. However, the results for the single attribute problem suggest that this is not a generalizable result. In practice, we cannot count on having access to the optimal value function. Instead, we may have to take an approximate value function and simulate decisions using this approximation. Using this approach, we are getting an estimate of the value of a policy. Figure 10.6 shows the value of the policies produced by the value function approximations found using the exploration strategies given in figure 10.5. Figure 10.6a shows that for the single attribute problem, the high quality value function produced by the exploration strategy also produced the best policies. For the multiattribute problem, the better value functions produced by the Gittins exploration policies also translated into better policies, but the differences were less noticeable. Pure exploration produces the worst policies initially, but eventually catches up. Pure exploitation starts well, but tails off in later iterations. All the Gittins exploration policies perform reasonably well throughout the range tested.
10.5
Why does it work?**
An optimal weighting strategy for hierarchical aggregation. Tsitsklis ”short proof” of Gittins indices?
10.6
Bibliographic notes
Bandit processes: Weber (1992), Whittle (1982)
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
292
Percentage error from the optimal
25
Pure Exploitation Gittins20 Gittins50 Gittins100 Gittins500 Pure Exploration
20
15
10
5
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of iterations
10.5a Effect of Gittins exploration on a single attribute nomadic trucker problem
Percentage error from the optimal
10
9
PureExploitation
8
G5 G10 G20 G50 PureExploration
7
6
5
4
3
2
1
0 0
5000
10000
15000
20000
25000
30000
35000
40000
Number of iterations
10.5b Effect of Gittins exploration on a multiattribute nomadic trucker problem Figure 10.5: The heuristic application of Gittins to a multiattribute asset management problem produces more accurate value functions than either pure exploitation or pure exploration policies. Q-learning for bandit processes Duff & Barto (2003), Duff (1995), Berry & Fristedt (1985)
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
293
8500
8300
8100
Average Policy Value
7900
7700
7500
7300
PureExploitation Gittins20 Gittins50 Gittins100 Gittins500 PureExploration
7100
6900
6700
6500 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Number of Iterations
10.6a The value of the policies for the single attribute nomadic trucker
7400
Average policy values
7200
7000
6800
PureExploitation Gittins5
6600
Gittins10 Gittins20 6400
Gittins50 PureExploration
6200
6000 0
5000
10000
15000
20000
25000
30000
35000
40000
Number of iterations
10.6b The value of the policies for the multiattribute nomadic trucker Figure 10.6: The value of the policy produced by the approximate value functions created using different exploration policies.
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
294
Gittins indices: Gittins (1979), Gittins (1981), Lai & Robbins (1985), Gittins & Jones (1974), Gittins (1989)
Exercises 10.1) Joe Torre, manager of the Yankees (the greatest baseball team in the country), has to struggle with the constant game of guessing who his best hitters are. The problem is that he can only observe a hitter if he puts him in the order. He has four batters that he is looking at. The table below shows their actual batting averages (that is to say, batter 1 will produce hits 30 percent of the time, batter 2 will get hits 32 percent of the time, and so on). Unfortunately, Joe doesnt know these numbers. As far as he is concerned, these are all .300 hitters. For each at bat, Joe has to pick one of these hitters to hit. The table shows what would have happened if each batter were given a chance to hit (1 = hit, 0 = out). Again, Joe does not get to see all these numbers. He only gets to observe the outcome of the hitter who gets to hit. Assume that Joe always lets the batter hit with the best batting average. Assume that he uses an initial batting average of .300 for each hitter (in case of a tie, use batter 1 over batter 2 over batter 3 over batter 4). Use .300 as your initial estimate of each batters average. Whenever a batter gets to hit, calculate a new batting average by putting an 80 percent weight on your previous estimate of his average plus a 20 percent weight on how he did for his at bat. So, according to this logic, you would choose Batter 1 first. Since he does not get a hit, his updated average would d be 0.80(.200) + .20(0)=.240. For the next at bat, you would choose Batter 2 because your estimate of his average is still .300, while your estimate for Batter 1 is now .240. After 10 at bats, who would you conclude is your best batter? Comment on the limitations of this way of choosing the best batter. Do you have a better idea? (It would be nice if it were practical.)
Day 1 2 3 4 5 6 7 8 9 10
Actual batting average 0.300 0.320 0.280 0.260 Batter A B C D 0 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1
CHAPTER 10. THE EXPLORATION VS. EXPLOITATION PROBLEM
295
10.2) There are four paths you can take to get to your new job. On the map, they all seem reasonable, and as far as you can tell, they all take 20 minutes, but the actual times vary quite a bit. The value of taking a path is your current estimate of the travel time on that path. In the table below, we show the travel time on each path if you had travelled that path. Start with an initial estimate of each value function of 20 minutes with your tie-breaking rule to use the lowest numbered path. At each iteration, take the path with the best estimated value, and update your estimate of the value of the path based on your experience. After 10 iterations, compare your estimates of each path to the estimate you obtain by averaging the “observations” for each path over all 10 days. How well did you do? Day 1 2 3 4 5 6 7 8 9 10
1 37 32 35 30 28 24 26 28 24 33
Paths 2 3 29 17 32 23 26 28 35 19 25 21 19 25 37 33 22 28 28 31 29 17
4 23 17 17 32 26 31 30 27 30 29
10.3) We are going to try again to solve our asset selling problem, We assume we are holding a real asset and we are responding to a series of offers. Let pˆt be the tth offer, which is uniformly distributed between 500 and 600 (all prices are in thousands of dollars). We also assume that each offer is independent of all prior offers. You are willing to consider up to 10 offers, and your goal is to get the highest possible price. If you have not accepted the first nine offers, you must accept the 10th offer. a) Write out the decision function you would use in an approximate dynamic programming algorithm in terms of a Monte Carlo sample of the latest price and a current estimate of the value function approximation. b) Write out the updating equations (for the value function) you would use after solving the decision problem for the tth offer. c) Implement an approximate dynamic programming algorithm using synchronous state sampling. Using 100 iterations, write out your estimates of the value of being in each state immediately after each offer. d) From your value functions, infer a decision rule of the form “sell if the price is greater than p¯t .”
Chapter 11 Value function approximations for resource allocation In chapter 9, we focused on estimating the value of being in a discrete state, a problem that we posed in terms of managing a single asset. In this chapter, we turn our attention to the challenge of estimating the value of being in a state when we are managing multiple assets or asset classes. We assume throughout this chapter that we have a resource vector Rt where the number of dimensions is “not too large.” Practically speaking, Rt may have hundreds or even thousands of dimensions, but problems with more than 10,000 dimensions tend to be computationally very difficult using the readily available hardware available as of this writing. If Rt is discrete, we may still be facing a state space of staggering size, but we are going to treat Rt as continuous and focus on separable approximations or those where the number of parameters is a manageable size. We consider a series of approximation strategies of increasing sophistication: Linear approximations - These are typically the simplest nontrivial approximations that work well when the functions are approximately linear over the range of interest. It is important to realize that we mean “linear in the resource state” as opposed to the more classical “linear in the parameters” model that we considered earlier. Separable, piecewise linear, concave (convex if minimizing) - These functions are especially useful when we are interested in integer solutions. Separable functions are relatively easy to estimate and offer special structural properties when solving the optimality equations. Auxiliary functions - This is a special class of algorithms that fixes an initial approximation and uses stochastic gradients to tilt the function. General nonlinear regression equations - Here, we bring the full range of tools available from the field of regression. These techniques can be used for more general problems than just approximating V (R), but we use this setting to illustrate them. 296
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION297 Ultimately, the challenge of estimating value functions can draw on the entire field of statistical estimation. Approximate dynamic programming introduces some unique challenges to the problem of statistically estimating value functions, but in the end, it all boils down to statistical estimation.
11.1
Value functions versus gradients
It is common in dynamic programming to talk about the problem of estimating the value of being in a state. In the arena of asset management, it is often the case that we are more interested in estimating the derivative of the function rather than the function itself. In principal, the challenge of estimating the slope of a function is the same as that of estimating the function itself (the slope is simply a different function). However, there can be important, practical advantages to estimating slopes. First, if the function is approximately linear, it may be possible to replace estimates of the parameter at each state (or set of states) with a single parameter which is the estimate of the slope of the function. Estimating constant terms is typically unnecessary. A second and equally important difference is that if we estimate the value of being in a state, we get one estimate of the value of being in a state when we visit that state. When we estimate a gradient, we get an estimate of a gradient for each parameter. For example, if Rt = (Rta )a∈A is our asset vector and Vt (Rt ) is our value function, then the gradient of the value function with respect to Rt would look like: ∇Rt Vt (Rt ) =
vˆa1 vˆa2 .. .
vˆ|A| where vˆai =
∂Vt (Rt ) ∂Rtai
There may be additional work required to obtain each element of the gradient, but the incremental work can be far less than the work required to get the value function itself. This is particularly true when the optimization problem naturally returns these gradients (for example, dual variables from a linear program), but this can even be true when we have to resort to numerical derivatives. Once we have all the calculations to solve a problem once, solving small perturbations can be very inexpensive. There is one important problem class where finding the value of being in a state, and finding the derivative, is equivalent. That is the case of managing a single asset (see section 3.12). In this case, the state of our system (the asset) is the attribute vector a, and we are
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION298 interested in estimating the value V (a) of our asset being in state a. Alternatively, we can represent the state of our system using P the vector Rt , where Rta = 1 indicates that our asset has attribute a (we assume that a∈A Rta = 1). In this case, the value function can be written: Vt (Rt ) =
X
vta Rta
a∈A
Here, the coefficient vta is the derivative of Vt (Rt ) with respect to Rta . In a typical implementation of an approximate dynamic programming algorithm, we would only estimate the value of an asset when it is in a particular state (given by the vector a). This is equivalent to finding the derivative vˆa only for the value of a where Rta = 1. By contrast, computing the gradient ∇Rt Vt (Rt ) implicitly assumes that we are computing vˆa for each a ∈ A. There are some algorithmic strategies (see, for example, section 14.6) where this assumption is implicit in the algorithm. Computing vˆa for all a ∈ A is reasonable if the attribute state space is not too large (for example, if a is a physical location among a set of several hundred locations). If a is a vector, then enumerating the attribute space can be prohibitive (it is, in effect, the “curse of dimensionality” revisited). Given these issues, it is critical to first determine whether it is necessary to estimate the slope of the value function, or the value function itself. The result can have a significant impact on the algorithmic strategy.
11.2
Linear approximations
There are a number of problems where we are allocating assets of different types. As in the past, we let a be the attributes of an asset and Rta be the quantity of assets with attribute a in our system at time t with Rt = (Rta )a∈A . Rt may describe our investments in different asset classes (growth stocks, value stocks, index funds, international mutual funds, domestic stock funds, bond funds). Or Rt might be the amount of oil we have in different reserves or the number of people in a management consulting firm with particular skill sets. We want to make decisions to acquire or drop assets of each type, and we want to capture the impact of decisions now on the future through a value function Vt (Rt ). Rather than attempt to estimate Vt (Rt ) for each value of Rt , it may make more sense to estimate a linear approximation of the value function with respect to the resource vector. Linear approximations can work well when the single-period contribution function is continuous and increases or decreases monotonically over the range we are interested in (the function may or may not be differentiable). They can also work well in settings where the value function increases or decreases monotonically, even if the value function is neither convex nor concave, nor even continuous.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION299 To illustrate, consider the problem of purchasing a commodity. Let Dt = The random demand during time interval t. Rt = The commodities on hand at time t to be used during time interval t + 1. xt = The quantity ordered to be used during time interval t + 1. pt = The market price for selling commodities during time interval t. ct = The purchase cost for commodities purchased at time t. At time t, we know the price pt and demand Dt for time interval t, but we have to choose how much to order for the next time interval. We can do this by solving: Vt (Rt ) = max E {pt+1 min{Rt + xt , Dt+1 } − ct xt + Vt+1 (Rt+1 (xt ))} xt
(11.1)
where Rt+1 = [Rt + xt − Dt+1 ]+ Now, assume that we introduce a linear value function approximation: V¯t+1 (Rt+1 ) ≈ v¯t+1 Rt+1
(11.2)
The resulting approximation can be written: ∼
V t (Rt ) = max E {pt+1 min{Rt + xt , Dt+1 } − ct xt + v¯t+1 Rt+1 } xt = max E pt+1 min{Rt + xt , Dt+1 } − ct xt + v¯t+1 [Rt + xt − Dt+1 ]+ xt
(11.3)
We assume that we can compute, or at least approximate, the expectation in equation (11.3). If this is the case, we may approximate the gradient at iteration n using a numerical derivative, as in: ∼
∼
vˆt =V t (Rt + 1)− V t (Rt ) We may either use vˆt as the slope of the function (that is, v¯t = vˆt ), or we may perform smoothing on vˆ: v¯t ← (1 − α)¯ vt + αˆ vt Linear approximations are especially useful in the context of more complex problems (for example, those involving multiple asset types). The quality of the approximation depends on how much the slope changes as a function of Rt .
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION300
11.3
Monotone function approximations
There are a number of settings in asset management where we can prove that a value function is increasing or decreasing in the state variable. These are referred to as monotone functions. If the function is increasing in the state variable, we might say that it is “monotonically increasing,” or that it is isotone (although the latter term is not widely used). Assume we have such a function, which means that while we do not know the value function exactly, we know that V (R + 1) ≤ V (R) (for scalar R). Next, assume our current approximation V¯ n−1 (R) satisfies this property, and that we have a new estimate vˆn for R = Rn . If we use our standard updating algorithm, we would write: V¯ n (Rn ) = (1 − αn )V¯ n−1 (Rn ) + αn vˆn After the update, it is quite possible that our updated approximation no longer satisfies our monotonicity property. One way to maintain monotonicity is through the use of a leveling algorithm, which works as follows n ¯ n−1 (Rn ) + αn vˆn if r = Rn (1 − α )V V¯ n (r) = V¯ n (r) ∨ (1 − αn )V¯ n−1 (Rn ) + αn vˆn if r > Rn ¯n V (r) ∧ (1 − αn )V¯ n−1 (Rn ) + αn vˆn if r < Rn
(11.4)
where x ∧ y = max{x, y}, and x ∨ y = min{x, y}. Equation (11.4) starts by updating the slope V¯ n (r) for r = Rn . We then want to make sure that the slopes are declining. So, if we find a slope to the right that is larger, we simply bring it down to our estimated slope for r = Rn . Similarly, a slope to the left that is smaller, we simply raise it to the slope for r = Rn . The steps are illustrated in figure 11.1. The leveling algorithm is easy to visualize, but it is unlikely to be the best way to maintain monotonicity. For example, we may update a value at r = Rn for which there are very few observations. But because it produces an unusually high or low estimate, we find ourselves simply forcing other slopes higher or lower just to maintain monotonicity. A more elegant strategy is the SPAR algorithm which works as follows. Assume that we start with our original set of values (V¯ n−1 (r))r≥0 , and that we sample r = Rn and obtain an estimate of the slope vˆn . After the update, we obtain the set of values (which we store temporarily in the function y¯n (r)): ( (1 − αn )V¯ n−1 (R) + αn vˆn n y¯ (r) = V¯ n−1 (r)
r = Rn otherwise
(11.5)
If y¯n (r) ≥ y¯n (r + 1) for all r, then we are in good shape. If not, then either y¯n (Rn ) < y¯n (Rn + 1) or y¯n (Rn − 1) < y¯n (Rn ). We can fix the problem by solving the projection
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION301
v0 v1 v2 vitn
u0
u1
Ritk
u2
11.1a: Initial monotone function.
v0 v1 v2
vˆitn = (1 − α )vˆitn−1 + α vitn
u0
u1
Ritk
u2
11.1b: After update of a single segment.
v0 v1 v2
vˆitn = (1 − α )vˆitn−1 + α vitn
u0
u1
Ritk
u2
11.1c: After leveling operation. Figure 11.1: Steps of the the leveling algorithm. Figure 11.1a shows the initial monotone function, with the observed R and observed value of the function vˆ. Figure 11.1b shows the function after updating the single segment, producing a non-monotone function. Figure 11.1c shows the function after monotonicity restored by leveling the function.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION302 problem: min kv − y¯n k2
(11.6)
v
subject to: v(r + 1) − v(r) ≤ 0
(11.7)
Solving this projection is especially easy. Imagine that after our update, we have a violation to the left. The projection is achieved by averaging the updated cell with all the cells to the left that create a monotonicity violation. This means that we want to find the largest i ≤ Rn such that: Rn
X 1 y¯ (i − 1) ≥ n y¯n (r) R − i + 1 r=i n
In other words, we can start by averaging the values for Rn and Rn − 1 and checking to see if we now have a concave function. If not, we keep lowering the left end of the range until we either restore monotonicity or reach r = 0. If our monotonicity violation is to the right, then we repeat the process to the right. The process is illustrated in figure 11.2. We start with a monotone set of values (a), then update one of the values to produce a monotonicity violation (b), and finally average the violating values together to restore monotonicity (c).
11.4
The SHAPE algorithm for continuously differentiable problems
A particularly simple algorithm for approximating value functions of continuous resources starts with an initial approximation and then “tilts” this function to improve the approximation. The concept is most effective if it is possible to build an initial approximation, perhaps using some simplifications, that produces a “pretty good” solution. The idea works as follows: Assume that we have access to some sort of initial approximation that we will call V¯t0 (R), which we assume is continuously differentiable. Since we can choose this approximation, we can further assume that the derivatives are fairly easy to compute (for example, it might be a low order polynomial). We also assume that we have access to a stochastic n−1 gradient vˆt = ∇V (Rt−1 , ω n ). So, we have an exact gradient of our approximate function and a stochastic gradient of the real function. We now update our approximate function using: n n−1 n−1 n−1 n V¯t−1 (Rt−1 ) = V¯t−1 (Rt−1 ) + αn ∇V (Rt−1 , ω n ) − ∇V¯t−1 (Rt−1 ) R
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION303
v0 v1 v2 vitn
u0
u1
Ritk
u2
11.2a: Initial monotone function.
v0 v1 v2
vˆitn = (1 − α )vˆitn−1 + α vitn
u0
u1
Ritk
u2
11.2b: After update of a single segment.
v0 v1 v2
( )
ΠV vˆitn
u0
u1
Ritk
u2
11.2c: After projection. Figure 11.2: Steps of the SPAR algorithm. Figure 11.2a shows the initial monotone function, with the observed R and observed value of the function vˆ. Figure 11.2b shows the function after updating the single segment, producing a non-monotone function. Figure 11.2c shows the function after the projection operation.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION304
Step 0 Initialize V¯ 0 and set n = 1. Step 1 Sample Rn . Step 2 Observe a sample of the value function vˆn . Step 3 Calculate the vector y n as follows ( (1 − αn )VRn−1 + αn vˆn n n y (r) = n−1 v (r)
if r = Rn , otherwise
(11.8)
Step 4 Project the updated estimate onto the space of monotone functions: v n = Π(y n ), by solving (11.6)-(11.7). Increase n by one and go to Step 1.
Figure 11.3: The learning form of the separable, projective approximation routine (SPAR). The first term on the right hand side is our current functional approximation. The second term is a linear adjustment (note that the term in parentheses is a constant) that adds to the current approximation the difference between the stochastic gradient of the real function and the exact gradient of the current approximation. This linear adjustment has the effect of tilting the original approximation. As a result, this algorithm does not change the shape of the original approximation, but does help to fix errors in the slope of the approximation. The steps of the SHAPE algorithm are illustrated in figure 11.4. The algorithm is provably convergent if V (x, W ) and V¯ 0 (x) are continuously differentiable (see section 14.7.1), but it can be used as an approximation even when these conditions are not satisfied. We can illustrate the SHAPE algorithm using a simple numerical example: max EV¯ (x, W ) = E x≥0
1 ln (x + W ) − 2(x + W ) 2
where W represents random measurement error, which is normally distributed with mean 0 and variance 4. Now, assume that we start with a convex approximation fˆ0 (s). √ V¯ 0 (x) = 6 x − 2x,
(11.9)
We begin by obtaining the initial solution x0 : √ x0 = arg max 6 x − 2x x≥0
Note that our solution to the approximate problem may be unbounded, requiring us to impose artificial limits. Since our approximation is concave, we can set the derivative equal
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION305
∇fˆ n ( R n ) Approximate function
∇F ( x n −1 , ω n )
Exact function The exact gradient of the approximate function
x n −1
The approximate gradient of the exact function
11.4a: True function and initial approximation.
∇fˆ n ( R n )
∇F ( x n −1 , ω n )
∇F n ( x n , ω n ) − ∇fˆ n ( x n )
11.4b: Difference between stochastic gradient of true function and actual gradient of approximation.
(
)
fˆ n ( x) + α n ∇F n ( x n , ω n ) − ∇fˆ n ( x n ) x
xn
11.4c: Updated approximation. Figure 11.4: Illustration of the steps of the SHAPE algorithm.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION306 to zero to find: √ ∇V¯ 0 (x) = 3/ x − 2 = 0 which gives us x0 = 2.25. Since x0 ≥ 0, it is optimal. To find the stochastic gradient, we have to sample the random variable W . Assume that W (ω 1 ) = 1.75. Our stochastic gradient is then: 1 −2 + W (ω 1 )) 1 = 2(2.25 + 1.75)) = 0.1250.
∇V¯ (x, W (ω 1 )) =
2(x0
Thus, while we have found the optimal solution to the approximate problem (which produces a zero slope), our estimate of the slope of the true function is positive, so we update with the adjustment: √ V¯ 1 (x) = 6 x − 2x − α1 (0.1250 − 0)x √ = 6 x − 3.125x This algorithm is provably convergent for two-stage problems even if the original approximation is something simple such as a separable polynomial. For example, we could use something as simple as: V¯ 0 (R) = −
X
¯ a − Ra )2 (R
a∈A
¯ a is a centering term. where R The SHAPE algorithm is incredibly simple, but has seen little numerical work. It is likely to be more stable than a simple linear approximation, but the best results will be obtained when information about the problem can be used to develop an initial approximation that captures the structure of the real problem. Arbitrary approximations (such as R2 ) are unlikely to add much value because they contain no information about the problem.
11.5
Regression methods
As in chapter 9 we can create regression models where the basis functions are manipulations of the number of resources of each type. For example, we might use: V¯ (R) = θ0 +
X a∈A
θ1a Ra +
X a∈A
θ2a Ra2
(11.10)
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION307 where θ = (θ0 , (θ1a )a∈A , (θ2a )a∈A ) is a vector of parameters that are to be determined. The choice of explanatory terms in our approximation will generally reflect an understanding of the properties of our problem. For example, equation (11.10) assumes that we can use a mixture of linear and separable quadratic terms. A more general representation is to assume that we have developed a family B of basis functions (φb (R))b∈B . Examples of a basis function are φb (R) = Ra2b !2 X
φb (R) =
Ra
for some subset Ab
a∈Ab
φb (R) = (Ra1 − Ra2 )2 φb (R) = |Ra1 − Ra2 | A common strategy is to capture the number of resources at some level of aggregation. For example, if we are purchasing emergency equipment, we may care about how many pieces we have in each region of the country, and we may also care about how many pieces of a type of equipment we have (regardless of location). These issues can be captured using a family of aggregation functions Gb , b ∈ B, where Gb (a) aggregates an attribute vector a into a space A(b) where for every basis function b there is an element ab ∈ A(b) . Our basis function would then be expressed using: φb (R) =
X
1{Gb (a)=ab } Ra
a∈A
As we originally introduced in section 9.4, the explanatory variables used in the examples above, which are generally referred to as independent variables in the regression literature, are typically referred to as basis functions in the approximate dynamic programming literature. A basis function can be linear, nonlinear separable, nonlinear nonseparable, and even nondifferentiable, although the nondifferentiable case will introduce additional technical issues. The challenge, of course, is that it is the responsibility of the modeler to devise these functions for each application. We have written our basis functions purely in terms of the resource vector, but it is possible for them to be written in terms of other parameters in a more complex state vector, such as asset prices. Given a set of basis functions, we can write our value function approximation as: V¯ (R|θ) =
X
θb φb (R)
(11.11)
b∈B
It is important to keep in mind that V¯ (R|θ) (or more generally, V¯ (S|θ)), is any functional form that approximates the value function as a function of the state vector parameterized by θ. Equation (11.11) is a classic linear-in-the-parameters function. We are not constrained to this form, but it is the simplest and offers some algorithmic shortcuts.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION308 The issues that we encounter in formulating and estimating V¯ (θ, R) are the same that any student of statistical regression would face when modeling a complex problem. The major difference is that our data arrives over time (iterations), and we have to update our formulas recursively. Also, it is typically the case that our observations are nonstationary. This is particularly true when an update of a value function depends on an approximation of the value function in the future (as occurs with value iteration or any of the TD(λ) classes of algorithms). When we are estimating parameters from nonstationary data, we do not want to equally weight all observations. The problem of finding θ can be posed in terms of solving the following stochastic optimization problem: 1 min E (V¯ (R|θ) − Vˆ )2 θ 2 We can solve this using a stochastic gradient algorithm, which produces updates of the form: θ¯n = =
θ¯n−1 − αn (V¯ (Rn |θ¯n−1 ) − Vˆ (ω n ))∇θ V¯ (Rn |θn ) θ¯n−1 − αn (V¯ (Rn |θ¯n−1 ) − Vˆ (ω n ))φ(Rn )
If our value function is linear in Rt , we would write: V¯ (R|θ) =
X
θa Ra
a∈A
In this case, our number of parameters has shrunk from the number of possible realizations of the entire vector Rt to the size of the attribute space (which, for some problems, can still be large, but nowhere near as large as the original state space). For this problem, φ(Rn ) = Rn . It is not necessarily the case that we will always want to use a linear-in-the-parameters model. We may consider a model where the value increases with the number of resources, but at a declining rate that we do not know. Such a model could be captured with the representation: V¯ (R|θ) =
X
θ1a Raθ2a
a∈A
where we expect θ2 < 1 to produce a concave function. Now, our updating formula will look like: n θ1n = θ1n−1 − αn (V¯ (Rn |θ¯n−1 ) − Vˆ (ω n ))Rθ2 n θ2n = θ2n−1 − αn (V¯ (Rn |θ¯n−1 ) − Vˆ (ω n ))Rθ2 ln Rn
n
where we assume the exponentiation operator in Rθ2 is performed componentwise.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION309 We can put this updating strategy in terms of temporal differencing. As before, the temporal difference is given by: ¯ n−1 (Rτ ) Dτ = Cτ +1 (Rτ , Wτ +1 (ω n ), xτ +1 ) + V¯τn−1 +1 (Rτ +1 ) − Vτ The original parameter updating formula (equation 7.27) when we had one parameter per state now becomes: θ¯n = θ¯tn−1 + αn
T −1 X
λτ −t Dτ ∇θ V¯ (Rn |θ¯n )
τ =t
It is important to note that in contrast with most of our other applications of stochastic gradients, updating the parameter vector using gradients of the objective function requires mixing the units of θ with the units of the value function. In these applications, the stepsize αn has to also perform a scaling role.
11.6
Why does it work?**
11.6.1
The projection operation
under construction This section is taken verbatim from Powell et al. (2004). Let:
( (1 − αn )vsn + αn η n zsn = vsn
if s = sn , otherwise,
(11.12)
Let us now describe the way the projection v = ΠV (z) can be calculated. Clearly, v is the solution to the quadratic programming problem 1 kv − zk2 2 subject to: vs+1 − vs ≤ 0, min
(11.13) s = 0, . . . , M,
(11.14)
where, for uniformity, we denote v0 = B, vM +1 = −B. Associating with (11.14) Lagrange multipliers λs ≥ 0, s = 0, 1, . . . , M , we obtain the necessary and sufficient optimality conditions: vs = zs + λs − λs−1 , s = 1, 2, . . . , M, λs (vs+1 − vs ) = 0, s = 0, 1, . . . , M.
(11.15) (11.16)
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION310 If i1 , . . . , i2 is a sequence of coordinates such that vi1 −1 > vi1 = vi1 +1 = · · · = c = · · · = vi2 −1 = vi2 > vi2 +1 , then, adding the equations (11.15) from i1 to i2 yields i2 X 1 c= zs . i2 − i1 + 1 s=i 1
If i1 = 1, then c is the minimum of the above average and B, and for i2 = M the maximum of −B and this average has to be taken. The second useful observation is that v n ∈ V and z n computed by (11.12) differs from v n in just one coordinate. If z n 6∈ V one of two cases must occur: either zsnn −1 < zsnn , or zsnn +1 > zsnn . If zsnn −1 < zsnn , then we search for the largest 1 < i ≤ sn for which sn
n zi−1
X 1 ≥ n zsn . s − i + 1 s=i
(11.17)
If such i cannot be found we set i = 1. Then we calculate sn
X 1 zn c= n s − i + 1 s=i s
(11.18)
and set vjn+1 = min(B, c),
j = i, . . . , sn .
(11.19)
We have λ0 = max(0, c − B), and 0 λs = λs−1 + zs − vs 0
s = 1, . . . , i − 1, s = i, . . . , sn − 1, s = sn , . . . , M.
It is straightforward to verify that the solution found and the above Lagrange multipliers satisfy conditions (11.15)–(11.16). The procedure in the case when zsnn < zsnn +1 is symmetrical: it is the same procedure applied to the graph of z rotated by π.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION311
11.6.2
Proof of convergence of the learning version of the SPAR algorithm
This section provides a detailed proof of the learning version of the SPAR algorithm. The goal of the presentation is not just to prove convergence, but to also demonstrate the proof techniques that are required. We start from the description and analysis of the basic learning algorithm for a concave piecewise linear function of one variable f : [0, M ] → IR. We assume that f is linear on the intervals [s − 1, s], s = 1, . . . , M . Let vs = f (s) − f (s − 1),
s = 1, . . . , M.
Let us note that the knowledge of the vector v allows us to reconstruct f (x), x ∈ [0, M ], except for the constant term f (0): f (x) = f (0) +
l X
vs + vl+1 (x − l),
(11.20)
s=1
where l is such that l ≤ x < l + 1. The main idea of the algorithm is to recursively update a vector v¯n ∈ IRM , n = 0, 1, . . . , in order to achieve convergence of v¯n to v (in some stochastic sense). Let us note that by the concavity of f the vector v has decreasing components: vs+1 ≤ vs ,
s = 1, . . . , M − 1.
(11.21)
We shall at first assume that there exists a constant B such that v1 ≤ B,
vM ≥ −B.
(11.22)
Clearly, the set V of vectors satisfying (11.21)–(11.22) is convex and closed. We shall therefore ensure that all our approximate slopes v¯n are elements of V as well. To this end we shall employ the operation of orthogonal projection on V ΠV (z) = arg min{kv − zk2 : v ∈ V }.
(11.23)
Let (Ω, H, IP) be the probability space under consideration. Let sn be a random variable taking values in {1, . . . , M }. Denote by F 0 the σ-algebra generated by v¯0 and, for k = 1, 2, . . . , let F n denote the σ-algebra generated by v¯0 , · · · , v¯n , s0 , . . . , sn−1 . Now, for n = 0, 1, · · · , define F s,n = σ(¯ v 0 , · · · , v¯n , s0 , · · · , sn ). Note that F n ⊂ F s,n and that sn is not measurable with respect to F n . We can not avoid this ugly notation in this version of the algorithm, although it will be a clear notation for the optimizing version of the algorithm, as in this last version, sn will be a deterministic function of v¯0 , . . . , v¯n , s0 , . . . , sn−1 . The SPAR-Exploration algorithm is given in figure 11.5.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION312
STEP 0 Set v¯0 ∈ V, n = 0. STEP 1 Sample sn ∈ {1, . . . , M }. STEP 2 Observe a real-valued random variable vˆn+1 such that IE{ˆ v n+1 | Fks } = vsn ,
a.s.
(11.24)
STEP 3 Calculate the vector z n+1 ∈ IRM as follows: ( (1 − αn )¯ vsn + αn vˆn+1 if s = sn , zsn+1 = v¯sn otherwise, where αn ∈ (0, 1] and αn is F n -measurable. STEP 4 Calculate v¯n+1 = ΠV (z n+1 ), increase n by one and go to step 1.
Figure 11.5: Separable, Projective Approximation Routine - Exploration version We will need a few more definitions and assumptions before we prove the main result of this section. We denote pns = IP{sn = s|F n },
s = 1, . . . , M.
(11.25)
if s = sn , otherwise.
(11.26)
Also, let gsn+1
( −ˆ v n+1 + v¯sn = 0
Then, for n = 0, 1, . . . , z n+1 = v¯n − αn g n+1 and v¯n+1 = ΠV (¯ v n − αn g n+1 ).
(11.27)
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION313 Note that IE{gsn+1 |F n } = IE{IE{gsn+1 |F s,n }|F n } = IE{IE{(−ˆ v n+1 + v¯sn )1{sn =s} |F s,n }|F n } = IE{1{sn =s} (−IE{ˆ v n+1 |F s,n } + v¯sn )|F n } = IE{1{sn =s} (−vs + v¯sn )|F n } = (−vs + v¯sn )IE{1{sn =s} |F n } = (−vs + v¯sn )IP{sn = s|F n } = pns (−vs + v¯sn )
(11.28) (11.29) (11.30) (11.31) (11.32) (11.33)
Equality (11.28) is due to the Tower property, while equality (11.29) follows from definition (11.26). Furthermore, (11.30) is given by the Fks measurability of 1{sn =s} and v¯sn . Finally, equalities (11.31), (11.32) and (11.33) are due to assumption (11.24), F n measurability of v¯sn and to definition (11.25), respectively. Thus IE{g n+1 | F n } = P n (¯ v n − v),
P n = diag(pns )M s=1 .
(11.34)
In addition to (11.24), we assume that there exists a constant C such that for all n IE{(ˆ v n+1 )2 | Fks } ≤ C
a.s.
(11.35)
We also assume that ∞ X n=0 ∞ X
αn = ∞ a.s.,
(11.36)
IE(αn )2 < ∞,
(11.37)
n=0
lim inf pns > 0 a.s., n→∞
s = 1, . . . , M.
(11.38)
. We say that a sequence of random variables {Mn : 0 ≤ n < ∞} is a martingale (submartingale) with respect to the filtration F provided that the sequence {Mn } has three basic properties: (i) Mn is F n -adapted,
n = 0, 1 . . .
(ii) IE{Mn |F n−1 } = (≥) Mn−1 , (iii) IE{|Mn |} < ∞,
n = 1, 2, . . .
n = 0, 1 . . .
Theorem 2 Assume (11.24) and (11.35)–(11.38). Then Algorithm SPAR-Exploration generates a sequence {¯ v n } such that v¯n → v a.s.
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION314 To prove the theorem, we need to use two lemmas. The Euclidean norm is the norm under consideration. P n 2 n+1 2 Lemma 11.6.1 Let S0 = 0 and Sm = m−1 k , m = 1, 2, . . . . Then {Sm } is a n=0 (α ) kg F-submartingale that converges almost surely to a finite random variable S∞ . Proof: The first submartingale property is clearly satisfied. In order to show the second one, note that {Sm } is positive and increasing. Thus, Sm ≥ Sm−1 =⇒ IE{Sm |F m−1 } ≥ IE{Sm−1 |F m−1 } = Sm−1 . The third property is obtained recalling that Sm − Sm−1 = (αm−1 )2 kg m k2 . Hence, IE{Sm − Sm−1 |F m−1 } = IE{Sm |F m−1 } − Sm−1 = IE{(αm−1 )2 kg m k2 |F m−1 }. Therefore, IE{Sm |F m−1 } = Sm−1 + IE{(αm−1 )2 kg m k2 |F m−1 }.
(11.39)
Also, note that kg m k2 = (−ˆ v m + v¯sm−1 )2 1{sm−1 =s} ≤ (−ˆ v m + B)2 1{sm−1 =s} ≤ (−ˆ v m + B)2 , where the equality is obtained from the definition of g m and since v¯ ∈ V , it is bounded by B, then the first inequality holds. This last displayed inequality together with the fact that αm−1 is F m−1 -measurable yields IE{(αm−1 )2 kg m k2 |F m−1 } = (αm−1 )2 IE{kg m k2 kF m−1 } ≤ (αm−1 )2 IE{(ˆ v m )2 |F m−1 } − 2B(αm−1 )2 IE{ˆ v m |F m−1 } + (αm−1 )2 B 2 . We also have that IE{(ˆ v m )2 |F m−1 } = IE{IE{(ˆ v m )2 |F s,m−1 }|F m−1 } ≤ IE{C|F m−1 } a.s. = C a.s.,
(11.40)
where the first equality follows from Tower property and the inequality is due to assumption (11.35). Furthermore, IE{ˆ v m |F m−1 } = IE{IE{ˆ v m |F s,m−1 }|F m−1 } = IE{vsm−1 |F m−1 } a.s. ≤ IE{B|F m−1 } a.s. = B a.s.
(11.41)
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION315 The first two equalities and the inequality follow, respectively, from the Tower property, assumption (11.24) and boundedness of V . Therefore, there exists a constant C1 such that IE{(αm−1 )2 kg m k2 |F m−1 } ≤ C1 (αm−1 )2 ,
a.s.,
m = 1, 2, . . . .
The last inequality together with equation (11.39) yields IE{Sm |F m−1 } ≤ Sm−1 + C1 (αm−1 )2 ,
m = 1, 2, . . . .
Thus, taking the expected valued we obtain IE{Sm } ≤ IE{Sm−1 } + C1 IE{(αm−1 )2 } ≤ IE{Sm−2 } + C1 IE{(αm−2 )2 } + C1 IE{(αm−1 )2 } .. . ≤ IE{S0 } + C1
m−1 X
IE{(αk )2 }
n=0
≤ C2 < ∞, since IE{S0 } = 0 and by (11.37). Therefore, as Sm is positive, we have checked all three submartingale properties and {Sm } is a F-submartingale. Also, since sup IE{Sm } ≤ C2 < ∞, by the Submartingale Convergence m
a.s.
Theorem (Shiryaev, 1996, page 508), Sm −−→ S∞ , where S∞ is finite. P n n v − v, g n+1 − P n (¯ v n − v)i, m = 1, 2, . . . . Lemma 11.6.2 Let U0 = 0 and Um = m−1 n=0 α h¯ Then {Um } is a F-martingale that converges almost surely to a finite random variable U∞ . Proof: The first property is clearly satisfied. In order to show the second one, note that Um − Um−1 = αm−1 h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i. Then, IE{Um − Um−1 |F m−1 } = IE{αm−1 h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i|F m−1 } M X = αm−1 IE{ (¯ vsm−1 − vs )(gsm − pm−1 (¯ vsm−1 − vs ))|F m−1 } s
=
s=1 M X αm−1 (¯ vsm−1 s=1
− vs )[IE{gsm |F m−1 } − pm−1 (¯ vsm−1 − vs )] s
= 0, where the second equality is due to the definition of ha, bi and the last one follows from (11.34). To obtain the third property, recall that 2 Um = (Um−1 + αm−1 h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i)2 .
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION316 Thus, taking expectations yields 2 2 IE{Um |F m−1 } = Um−1 + 2Um−1 IE{αm−1 h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i|F m−1 } + IE{(αm−1 )2 (h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i)2 |F m−1 }.
(11.42)
As the second term of the sum is equal to IE{Um − Um−1 |F m−1 }, we know it is zero. Now, let’s focus on the last term. We have h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i =
=
M X s=1 M X s=1
≤
(¯ vsm−1 − vs )) (¯ vsm−1 − vs )(gsm − pm−1 s (¯ vsm−1 − vs )(−ˆ v m + v¯sm−1 )1{sm−1 =s} − pm−1 (¯ vsm−1 − vs )2 |s {z }
M X s=1
≤
M X
≥0
(¯ v m−1 − v )(−ˆ v m + v¯sm−1 )1{sm−1 =s} | {z } | s {z s} ≤B
≤2B
2B(−ˆ v m + B)1{sm−1 =s}
(11.43)
s=1 m
= 2B(−ˆ v + B)
M X
1{sm−1 =s }
s=1
|
{z
=1
}
= 2B(−ˆ v m + B), where (11.43) is due to the boundedness of V . Hence, IE{(αm−1 )2 (h¯ v m−1 − v, g m − P m−1 (¯ v m−1 − v)i)2 |F m−1 } ≤ (αm−1 )2 IE{(2B(−ˆ v m + B))2 |F m−1 } = (αm−1 )2 4B 2 (IE{(ˆ v m )2 |F m−1 } − 2BIE{ˆ v m |F m−1 } + B 2 ) ≤ (αm−1 )2 4B 2 (B − 2BC + B 2 ) ≤ C3 (αm−1 )2 , by (11.40) and (11.41), where C3 is a constant. The previous inequality together with equation (11.42) yields 2 2 IE{Um |F m−1 } ≤ Um−1 + C3 (αm−1 )2 ,
m = 1, 2, . . . .
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION317 Thus, taking the expected valued we obtain 2 2 IE{Um } ≤ IE{Um−1 } + C3 IE{(αm−1 )2 } 2 } + C3 IE{(αm−2 )2 } + C3 IE{(αm−1 )2 } ≤ IE{Um−2 .. .
≤
IE{U02 }
+ C3
m−1 X
IE{(αk )2 }
n=0
≤ C4 < ∞, since IE{U02 } = 0 and by (11.37). Therefore, {Um } is bounded in L2 , and thus bounded in L1 . This means we have checked all three conditions and {Um } is a F-martingale. Also, the L2 -Bounded Martingale Convera.s. gence Theorem (Shiryaev, 1996, page 510) tells us that Um −−→ U∞ , where U∞ < ∞. Proof: [Proof of theorem 2] Since V is a closed and convex set of IRM , the Projection Theorem (Bertsekas et al., 2003, page 88) tells us that ΠV : IRM → V is continuous and nonexpansive, i.e., kΠV (y) − ΠV (x)k ≤ ky − xk,
∀x, y ∈ IRM .
Thus, k¯ v n+1 − vk2 = kΠV (z n+1 ) − ΠV (v)k2 ≤ kz n+1 − vk2 = k¯ v n − αn g n+1 − vk2 = k¯ v n − vk2 − 2αn h¯ v n − v, g n+1 i + (αn )2 kg n+1 k2 , as ka − bk2 = kak2 − 2ha, bi + kbk2 . Now, if we add and subtract 2αn h¯ v n − v, P n (¯ v n − v)i we get k¯ v n+1 − vk2 ≤ k¯ v n − vk2 − 2αn h¯ v n − v, P n (¯ v n − v)i − 2αn h¯ v n − v, g n+1 − P n (¯ v n − v)i + (αn )2 kg n+1 k2 ,
(11.44)
as ha, bi + ha, ci = ha, b + ci. Now, we will prove that {k¯ v n − vk} converges almost surely. Let, for n = 0, 1, . . . , An+1 = (S n+1 − Sk ) − 2(U n+1 − Uk ), where {Sk } and {Uk } are defined as in lemmas 11.6.1 and 11.6.2. Note that An+1 is the last two terms of (11.44). Also consider B n+1 =
n X
Am+1 =
m=0 n+1
n X
(Sm+1 − Sm ) − 2(Um+1 − Um )
m=0
= (S − S0 ) − 2(U n+1 − U0 ) (telescopic sum) = S n+1 − 2U n+1
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION318 P Clearly, ∞ m=0 Am+1 = B∞ = S∞ − U∞ < ∞ a.s., as both S∞ and U∞ are finite almost surely from lemmas 11.6.1 and 11.6.2. Hence, it is valid to write A
n+1
=
∞ X
∞ X
Am+1 −
m=n
Am+1 .
m=n+1
Therefore, inequality (11.44) can be rewritten as ≥0 n+1
k¯ v
}| { z v n − v, P n (¯ v n − v)i − vk ≤ k¯ v − vk − 2 αn h¯ ∞ ∞ X X + Am+1 − Am+1 . 2
n
2
m=n
m=n+1
Thus, from the positiveness of the previous inner product term, n+1
k¯ v
∞ X
2
− vk +
n
2
Am+1 ≤ k¯ v − vk +
∞ X
Am+1 .
m=n
m=n+1
We can infer from this inequality, that the sequence defined by D
n+1
n
2
= k¯ v − vk +
∞ X
Am+1 ,
n = 0, 1, . . .
m=n
P is decreasing. This sequence is also bounded, as V is bounded and ∞ m=0 Am+1 is finite almost surely. From these two facts (decreasing and bounded), we can conclude that the sequence {Dk } converges. P∞ Moreover, as m=n Am+1 → 0 as n → ∞, we can also conclude that the sequence n 2 {k¯ v − vk } or, equivalently, {k¯ v n − vk} converges almost surely, since the sum of the limits is the limit of the sums. We are finally ready to finish our proof. Recall that inequality (11.44) holds for all n. Then, k¯ v n+1 − vk2 ≤ k¯ v n − vk2 − 2αn h¯ v n − v, P n (¯ v n − v)i + An+1 n n X X n−1 2 m m m m ≤ k¯ v − vk − 2 α h¯ v − v, P (¯ v − v)i + Am+1 m=n−1
m=n−1
.. . ≤ k¯ v 0 − vk2 − 2
n X
αm h¯ v m − v, P m (¯ v m − v)i +
m=0
n X
Am+1 .
m=0
Thus, n+1
k¯ v
2
− vk + 2
n X m=0
m
m
m
m
0
2
α h¯ v − v, P (¯ v − v)i ≤ k¯ v − vk +
n X m=0
Am+1 .
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION319 Passing to the limits we obtain ∞ X n+1 2 lim k¯ v − vk + 2 αm h¯ v m − v, P m (¯ v m − v)i n→∞
m=0 0
2
≤ k¯ v − vk +
∞ X
Am+1 < ∞ a.s.,
m=0
as the last sum is finite. Therefore, since {k¯ v n+1 − vk2 } is convergent, the last inequality tells us ∞ X αm h¯ v m − v, P m (¯ v m − v)i < ∞ a.s. m=0
But all the terms of this sum are positive, and from (11.38), at least one diagonal element v n } that of P m is strictly positive. Therefore, there must exist a subsequence {¯ v nk } of {¯ converges to v almost surely. Moreover, as {k¯ v n+1 − vk2 } is convergent a.s., all its subsequences converge and have the same limit. Thus, since the subsequence {k¯ v nk − vk2 } converges to zero a.s., as {¯ v nk } converges to v a.s., the whole sequence {k¯ v n+1 − vk2 } converges to zero and thus {¯ vn} converges to v a.s. It is also possible that at a given point sn ∈ {1, . . . , M − 1} we can observe two random n+1 variables: vˆn+1 satisfying (11.24) and (11.35), and vˆ+ such that, for all n, n+1 IE{ˆ v+ |F s,n } = vsn +1
a.s.
(11.45)
n+1 2 IE{(ˆ v+ ) |F s,n } ≤ C,
a.s.
(11.46)
and
The algorithm SPAR-Exploration can be easily adapted to this case, too. The only difference is Step 3, where we use both random observations, whenever they are available: n n vs + αn vˆn+1 if s = sn , (1 − α )¯ n+1 (11.47) zsn+1 = (1 − αn )¯ if sn < M and s = sn + 1, vsn + αn vˆ+ n v¯s otherwise, The analysis of this version v n+1 + v¯sn −ˆ n+1 gsn+1 = −ˆ v+ + v¯sn 0
of the method is similar to the basic case. We define if s = sn , if sn < M and s = sn + 1, otherwise.
Thus, =0
IE{g1n+1 |F n }
n+1
= IE{(−ˆ v + n+1 = IE{(−ˆ v + n n = p1 (¯ v1 − v1 ),
n+1 v+ v¯1n )1{sn =1} + (−ˆ v¯1n )1{sn =1} |F n }
+
z }| {
v¯1n ) 1{sn =0}
1{sn <M } |F n }
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION320 as in the basic case, and for s = 2, . . . , M , =1
z }| { n+1 IE{gsn+1 |F n } = IE{(−ˆ v n+1 + v¯sn )1{sn =s} + (−ˆ v+ + v¯sn )1{sn =s−1} 1{sn <M } |F n } n+1 + v¯sn )1{sn =s−1} |F n } v+ = IE{(−ˆ v n+1 + v¯sn )1{sn =s} |F n } + IE{(−ˆ = pns (¯ vsn − vs ) + pns−1 (¯ vsn − vs ), where the last equality was obtained following the same reasoning as the basic case. It follows that ( pns (¯ vsn − vs ) if s = 1 IE{gsn+1 |Fk } = (11.48) n n n (ps + ps−1 )(¯ vs − vs ) if 1 < s ≤ M . Therefore, after replacing the coefficients pns by ( if s = 1 pns n p¯s = pns + pns−1 if 1 < s ≤ M , we can reduce this version of the method to the basic case analyzed earlier. If we have a separable multidimensional problem, the algorithm SPAR-Exploration can be applied component-wise. All the analysis remains true, provided the observation points sn = (sn1 , . . . , snn ) are sampled at random, and for each coordinate i, assumption (11.38) is satisfied.
11.7
Bibliographic notes
The proofs of the SPAR algorithm, for both the learning (section 11.6.2) and optimizing (section 12.4.1) versions, were adapted from the original proofs Powell et al. (2004). Our presentation is more complete, and is designed to demonstrate the proof technique. The presentation was prepared by Juliana Nascimento. Recursive estimation: Ljung & Soderstrom (1983), Young (1984), Recursive least squares in dynamic programming: A.Nedi¸c & D.P.Bertsekas (2003), Bradtke & Barto (1996) SPAR: Powell et al. (2004) SHAPE: Cheung & Powell (2000) Leveling algorithm Topaloglu & Powell (2003) CAVE: Godfrey & Powell (2001) Matrix methods: Golub & Loan (1996) Convergence: Tsitsiklis and van Roy: Tsitsiklis & Van Roy (1997), Jaakkola et al. (1994) Lp method and approximate DP: de Farias & Van Roy ((to appear),
CHAPTER 11. VALUE FUNCTION APPROXIMATIONS FOR RESOURCE ALLOCATION321 Bertazzi et al Bertazzi et al. (2000) Aggregation: Rogers et al. (1991), Wright (1994), Bertsekas & Castanon (1989), Schweitzer et al. (1985), Tsitsiklis & Van Roy (1996), Bean et al. (1987), Mendelssohn (1982), Zipkin (1980b), Zipkin (1980a), Soft state aggregation Singh et al. (1995) Iterative methods for continuous problems: Luus (2000) The SPAR learning proof was adapted from the proof given in Powell et al. (2004), but the presentation is designed to carefully demonstrate mathematical style and proof technique. The presentation was prepared by Juliana Nascimento.
Exercises 11.1) Show that the scaling matrix H n in the recursive updating formula from equation (9.45) θn = θn−1 − H n εˆn reduces to H n = 1/n for the case of a single parameter.
Chapter 12 The asset acquisition problem One of the fundamental problems in asset management is the asset acquisition problem, which is where we face the challenge of acquiring assets that are then consumed in a random way (see examples). Example 12.1: The chief financial officer for a startup company has to raise capital to finance research, development, and business startup expenses to get the company through to its initial public offering. The CFO has to decide how much capital it has to raise initially as well as plan the addition of new capital as needs arise. The consumption of capital depends on the actual flow of expenses, which are partially offset by growing revenues from the new product line. Example 12.2: A business jet company sells fractional ownership in its jets. It has to order new jets, which can take a year or more to arrive, in order to meet demands as they arise. Requests for jets fluctuate due to a variety of factors, including random noise (the behavior of individuals making decisions) and the stock market. Example 12.3: A management consulting firm has to decide how many new college graduates to hire annually to staff its needs over the course of the year. Example 12.4: An energy company has to sign contracts for supplies of coal, oil and natural gas to serve its customers. These decisions have to be made in advance in the presence of fluctuating market prices and the aggregate demand for energy from its customers.
There are two issues that make these problems interesting. One is the presence of nonconvex cost functions, where it costs less, on average, to acquire larger quantities at the same time. This property, which arises in a large number of operational settings, creates an incentive to acquire assets now that can be held in inventory and used to satisfy demands in the future. We will defer this problem class to chapter 13 under the category of batch replenishment processes. 322
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
323
The second complicating issue is the uncertainty in demand for assets and fluctuations in both the cost of purchasing as well as selling assets. This is the problem class we address in this chapter. We restrict our attention to problems where there is only one type of asset. This eliminates, for the moment, the complications that arise with substitution between asset classes (this problem is addressed in chapters 14 and 15). Instead, our focus is on modeling various types of information processes. We start with the basic problem of making the decision to acquire additional assets in one time period to meet demand in the next time period (where the assets are consumed or lost). This simple problem allows us to explore specific approximation strategies that exploit the natural concavity of the value function. The remainder of the chapter extends this core strategy to multiperiod problems where assets can be held, and to problems where orders can be made more cheaply for time periods farther in the future. This chapter is the first time that we try to solve a problem purely using continuous state and decision variables with continuous value function approximations. In addition, we will create these approximations using gradients rather than estimates of the functions themselves. This work will lay the foundation for much larger and more complex asset management problems.
12.1
The single-period problem
The core problem in asset acquisition is one that is famously known as the newsvendor problem (in the old days, this was called the “newsboy problem”). The story goes something like this: At the beginning of the day, our newsvendor has to decide how many newspapers to put in the newstand at the side of the road. Needless to say, this decision has to be made before she knows the demand for newspapers. At the end of the day, the newspapers are worthless, and remaining papers have to be discarded. It is possible that all the papers are sold out, at which point we face the problem of losing an unknown amount of demand. Of course, our interest in this problem is much broader than newspapers. For example, a growing business jet company has to sign contracts at the beginning of one year for jets that will arrive at the beginning of the next year. The company hopes that these jets will all be purchased (often in fractions of 1/8) before the end of the year. If not, the company faces an overage situation, which means the company is paying for assets that are not generating revenue. The alternative is that demand has exceeded supply, which may be turned away or satisfied by leasing aircraft at a cost that is higher than what the company receives.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
324
In our notation, we would define: x0 = The quantity the company orders initially which arrives during time interval 1 (the time between 0 and 1) that can be used to satisfy demands during this time period. D1 = The demand for assets that arise during time interval 1. cp = The unit purchase cost of assets. p = The price charged to each customer for the asset. Our contribution function is given by: C(x0 ) = E {p min[x0 , D1 ] − cp x0 }
(12.1)
This is termed the profit maximizing form of the newsvendor problem. The problem is sometimes stated in a cost-minimizing version. In this form, we define: co = The unit cost of ordering too much. cu = The unit cost of ordering too little. We would now define the cost function (to be minimized): C(x0 ) = E co [x0 − D1 ]+ + cu [D1 − x0 ]+
(12.2)
where [x]+ = max{0, x}. We are going to use the profit maximizing version in equation (12.1), but our presentation will generally work for a more general model. We assume (as occurs in the real newsvendor problem) that unused assets have no value. This is the reason it is called the one-period model. Each time period is a new problem. As we show later, solving the one-period model sets the foundation for solving multiperiod problems. At this point we are not concerned with whether the demands are discrete or continuous, but this will affect our choice of algorithmic strategies. Our starting point for analyzing and solving this problem is the stochastic optimization problem that we first posed in chapter 6, where we addressed the problem of solving: max Ef (x, W ) x
Clearly, the single-period newsvendor problem fits this basic framework where we face the problem of choosing x = x0 before we have access to the information W = D1 . This opens the door to algorithms we have already presented in chapter 6, but we can also exploit the properties of this special problem.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM 150
150
100
100
325
50
50
0 0
1 1
2
3
4
5
6
7
8
9
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
-50 -50
Revenue
Revenue
-100
Future value
-100
Future value Total
Total -150 -150
12.1(a)
12.1(b)
Figure 12.1: The newsvendor profit function for a sample realization (a) and an average over several samples (b).
12.1.1
Properties and optimality conditions
The first important property of the newsvendor problem is that it is continuous in x (in chapter 13, we looked at problems that are not continuous). If demands are continuous, the objective function will be continuously differentiable. If demands are discrete (but we allow the order quantity to be continuous), the objective function will be piecewise linear. The second property that we are going to exploit in our solution of the profit maximizing version of the newsvendor problem (equation (12.1)) is concavity. It is pretty easy to see the concavity of the function for a sample realization D1 (ω). Figure 12.1 shows the function (a) for a single realization, and (b) averaged over 10 realizations. For the sample in figure 12.1(a), we see the clear effect of the min operator on the revenue, whereas the costs are linear. When averaged over multiple samples, the plot takes on a smoother shape. If demands are continuous, then the profit function will be a continuously differentiable function. If demands are discrete and integer (but if we let x0 be continuous), then the profit function would be piecewise linear with kinks for integer values of x. Later, we exploit this property when we develop approximations for the value function. The basic newsvendor problem is simple enough that we can derive a nice optimality condition. It is easy to see that the gradient of the function, given a sample realization ω, is given by: ( p − cp ∇C(x0 ) = −cp
If x0 < D1 (ω) If x0 ≥ D1 (ω)
(12.3)
Since the function is not differentiable at x0 = D1 (ω), we have to arbitrarily break a tie (we could legitimately use a value for the gradient in between −cp and p − cp ). Since our function is unconstrained, we expect the gradient at the optimal solution to equal zero. If our function is piecewise linear (because demands are discrete), we would say that there
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
326
∇F ( x * ) = 0
x*
∇F ( x * ) = 0
x* 12.2(a)
12.2(b)
Figure 12.2: At optimality, the gradient at the optimal solution will equal zero if demands are continuous (a) or the set of subgradients at the optimal solution will include the zero gradient (b). exists a gradient of the function that is zero at the optimal solution. The two cases are illustrated in figure 12.2. When we look at the expression for the gradient in equation (12.3), we quickly see that the gradient is never equal to zero. What we want is for the expected value of the gradient to be zero. We can write this easily as: E∇C(x0 ) = = = =
(p − cp )P[x0 < D1 ] − cp P[x0 ≥ D1 ] (p − cp )(1 − P[x0 ≥ D1 ]) − cp P[x0 ≥ D1 ] p − cp − pP[x0 ≥ D1 ] 0
Solving gives us: P[x0 ≥ D1 ] =
p − cp p
(12.4)
Equation (12.4) is well known in the operations management community as the critical ratio. It tells us that at optimality, the fraction of time we want to satisfy demand is given by (p − cp )/p. Obviously we require cp ≤ p. As cp approaches the price, our marginal profit from satisfying demand drops to zero, and we want to cover less and less demand (dropping to zero when cp = p). This is an elegant result, although not one that we can actually put into practice. But it helps to see that at optimality, our sample gradient is, in general, never equal to zero. Instead, we have to judge optimality by looking at the average value of the gradient.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
12.1.2
327
A stochastic gradient algorithm
In chapter 6, we saw that we could solve stochastic optimization problems of the form: max EF (x, W ) using iterations of the form: xn = xn−1 + αn ∇F (xn−1 , W (ω n )) This is true for unconstrained problems. At a minimum, we will have nonnegativity constraints, and in addition, we may have limits on how much we can order. Constraints like these are fairly easy to handle using a simple projection algorithm. Let ΠX be a projection operator that takes a vector from outside of the feasible region to the point in X closest to the original vector. Our algorithm would then look like: xn = ΠX xn−1 + αn ∇F (xn−1 , W (ω n )) For nonnegativity constraints, the projection operator simply takes any negative elements and makes them zero. Similarly, if we require x ≤ u, then any elements greater than u are simply set equal to u. Interestingly, this algorithm is convergent even if F (x, W ) is nondifferentiable. It turns out that our basic conditions on the stepsize required to ensure convergence for stochastic problems are also the conditions needed for nondifferentiable problems. Solving stochastic, nondifferentiable problems (which is what we often encounter with newsvendor problems) does not introduce additional complexities. Stochastic gradient algorithms can be notoriously unstable. In addition to the use of a declining stepsize, another strategy is to perform smoothing on the gradient itself. That is, we could compute: g¯n = (1 − β n )¯ g n−1 + β n ∇F (xn−1 , W (ω n )) where β is a stepsize for smoothing the gradient. We would then update our solution using: xn = xn−1 + αn g¯n with a projection operation inserted as necessary. The smoothed gradient g¯n is effectively creating a linear approximation of the value function. Since maximizing (or minimizing) a linear approximation can produce extreme results, the stepsize αn plays a critical role in stabilizing the algorithm.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
328
Consider now a minor reformulation of the newsvendor problem: max C0 (x0 ) + EV (R0 )
(12.5)
C0 (x0 ) = −cp x0 R0 = x 0 V (x0 ) = p min{R0 , D1 }
(12.6) (12.7) (12.8)
x
where
Here, we are maximizing a one-period contribution function C0 (x0 ), which captures the cost of purchasing the assets plus the expected value of the future given that we are in R0 = x0 . Now, replace the value function with a linear approximation: max C0 (x0 ) + v¯0 x0 x
(12.9)
For a given value of xn−1 , we can find the gradient of the value function given a sample 0 realization of the demand. Let: vˆn = Sample gradient of V (R0 ). ( p if R0n−1 < D1 (ω n ) = 0 if R0n−1 ≥ D1 (ω n )
(12.10)
Given the sample gradient, we would perform smoothing to approximate an expectation: v¯n = (1 − αn )¯ v n−1 + αn vˆn Of course, optimizing a linear approximation can produce extreme solutions (in fact, the problem does not even make sense unless we bound x0 from above and below), but if we take the gradient of (12.9), we can move in the direction of this gradient. Of course, the gradient of (12.9) is the same as our stochastic gradient. Linear approximations can work well in special situations. Later, we introduce variations to the newsvendor where we already have a supply of assets on hand, but may need to order more. The amount we can order may be small compared to the demand and the amount that we already have on hand. As a result, x0 may only influence our “value function” over a small range where a linear approximation is appropriate.
12.1.3
Nonlinear approximations for continuous problems
In general, nonlinear approximations are going to work better than linear ones. To produce a nonlinear approximation, we could use any of the techniques for statistically estimating
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
329
value function approximations that were introduced in chapter 9. A particularly simple algorithm for problems with continuous demands (and therefore continuously differentiable value functions) is the SHAPE algorithm (section 11.4). To illustrate, start with the dynamic programming form of the newsvendor problem given in equations (12.5)-(12.6). Now, assume we have an idea of what the value function might look like, although we may not have an exact expression. It is clear, looking at equation (12.8), that the function will initially rise somewhat linearly, but level off at some point. It would be reasonable to guess at a function such as: V¯ 0 (R0 |θ) = θ0 (1 − exp (θ1 R0 ))
(12.11)
Right now, we are going to assume that we have an estimate of the parameter vector θ which would be based on an understanding of the problem. We can then use the SHAPE algorithm to refine our estimate. Using the sample gradient vˆn in equation (12.10), we would use the updating equation: V¯ n (R0 |θ) = V¯ n−1 (R0 |θ) + αn vˆn − ∇R0 V¯ n−1 (R0 |θ) R0 At iteration n, our value function will look like the original value function plus a linear correction term: V¯ n (R0 |θ) = θ0 (1 − exp (θ1 R0 )) + g¯n R0 where: n
g¯ =
n X
αn vˆn − ∇R0 V¯ n−1 (R0 |θ)
m=1
The SHAPE algorithm will work best when it is possible to exploit the problem structure to produce a “pretty good” initial approximation. The SHAPE algorithm will then tune this initial approximation, producing a convergent algorithm in certain problem settings (such as the simple newsvendor problem). An alternative is to use an arbitrary polynomial (for example, −x2 ). In this case, it is probably more effective to use the more general regression methods described in section 9.4.
12.1.4
Piecewise linear approximations
There are several reasons why we may be interested in a piecewise linear approximation. First, our demands may be discrete, in which case the value function will actually be piecewise linear in the resources (if we are allowed to cover a fraction of a demand). Second, we may wish to solve our newsvendor problem using a linear programming package (this is
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
330
relevant in more general, multidimensional settings). Third (and often the most important), our underlying demand distribution may not be known, and we may have to allow for the possibility that it has an unusual shape. For example, we may face demands that are usually zero, but may be large. Or, the demands may be bimodal. In the presence of such behaviors, it may be difficult to propose reasonable polynomial approximations that will work well across all the different types of demand behaviors. We can handle such problems using piecewise linear approximations where we represent our value function in the form: V¯ n−1 (R) =
X
X a∈A
bRc
v¯n−1 (r) + (R − bRc)¯ v n−1 (dRe) ,
(12.12)
r=1
where bRc is the largest integer less than or equal to R, and dRe is the smallest integer greater than or equal to R. If our resource vector R is small and discrete, we would have a slope for every integer value of R, which is equivalent to estimating a discrete version of the value function. The form in equation (12.12) has two advantages. First, the range of R may be quite large, and we may prefer to approximate the function using only a few slopes. Second, it shifts the emphasis from approximating the value of having R assets on hand to estimating the marginal value of additional assets. This will prove to be a critical step as we move on to vector-valued problems (multiple asset types). We assume that at iteration n, we make a decision xn0 that we obtain by solving: xn0 = arg max C0 (x0 ) + V¯ n−1 (R0 ) x
(12.13)
given xn0 , we find the sample gradient vˆn (equation (12.10)) using R0n = xn0 . For simplicity, we are going to assume that we have an estimate of a slope v¯n−1 (r) for every value of r (more general applications of this idea have to handle the possibility that the function is piecewise linear over ranges of values of r). In this case, we would update our slope using: v¯n (Rn ) = (1 − αn )¯ v n−1 (Rn ) + αn vˆn With this update, we may no longer have the property that v¯n (Rn −1) ≥ v¯n (Rn ) ≥ v¯n (Rn +1), which means that our value function approximation V¯ n (R) is no longer concave. There are two problems with this state of affairs. Theoretically it is unattractive because we know the function is concave. More practically, it makes equation (12.13) much harder to solve (when we make the step to multiple asset types, the problem becomes virtually impossible). Instead, we can use the same methods described in section 11.3. Instead of estimating a monotone function, we estimate a monotone set of slopes. We would first update our slopes
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
331
using: ( (1 − αn )¯ v n−1 (R) + αn vˆn y n (r) = v¯n−1 (r)
if r = Rn otherwise
(12.14)
We then check to see if we have a concavity violation (equivalently, a violation of the monotonicity of the slopes). If y n (r) ≥ y n (r + 1) for all r, then our function is still concave, and we would set: v¯n (r) = y n (r) Otherwise, we have a violation, which means that either y n (rn ) < y n (rn + 1) or y n (rn − 1) < y n (rn ). We would solve the problem by solving the projection problem: min kv − y n k2 v
subject to: v(r + 1) − v(r) ≤ 0 We solve this problem as we did in section 11.3. Assume that our violation is of the form y n (Rn ) < y n (Rn + 1), which means that our updated slope raised the value of y n (Rn ) too much. We want to find the smallest value i > Rn such that Rn
X 1 y n (r) y (i + 1) ≤ i − Rn + 1 r=i n
We are using gradients to estimate a set of slopes just as we estimated the value function directly in section 11.3. So far, it seems as if we are doing the same thing here using slopes that we did in section 11.3 using direct estimates of the function. However, there is one critical difference. When we first presented the SPAR algorithm, we presented the learning version, which means we assumed there was some exogenous process that was choosing the states for us. We also assumed that we were going to sample each state with a positive probability, which meant that in the limit, each state would be sampled infinitely often. For our newsvendor problem, we no longer have this property. Our decisions are driven by equation (12.13), which means our decision xn depends on our value function approximation. Since the states we sample are determined by the solution of an optimization problem, we refer to this as the optimizing version of the algorithm, which is summarized in figure 12.3. In fact, this is the way all of our approximate dynamic programming algorithms work that we have presented up to now. The problem with these algorithms is that in general, not only
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
332
Step 0 Initialize V¯ 0 and set n = 1. Step 1 Solve: xn0 = arg max C0 (x0 ) + V¯ n−1 (R0 ) x
and set R0 = x0 . Step 2 Sample D1 (ω n ) and determine the gradient vˆn using (12.10). Step 3 Calculate the vector y n as follows: ( (1 − αn )¯ v n−1 (r) + αn vˆn n y (r) = n−1 v (r)
if r = Rn , otherwise
(12.15)
Step 4 Project the updated estimate onto the space of monotone functions: v¯n = Π(y n ), by solving (11.6)-(11.7). Set n = n + 1 and go to Step 1.
Figure 12.3: The optimizing form of the separable, projective approximation routine (SPAR). is there no proof of optimality, but it is easy to create examples where they work extremely poorly. The problem is that if you get a poor estimate of being in a state, you may avoid decisions that put you in it. In our case (assuming discrete demands), the algorithm is optimal! The formal proof is given in section 12.4.1. A natural question is: why does it work? In the learning version, we sample each state infinitely often. Since we get unbiased estimates of each state, the only nontrivial aspect of the proof is the introduction of the monotonicity preserving step. But no one would be surprised that the algorithm is convergent. When we depend on our current value function approximation to tell us what point to sample next, it would seem that a poor approximation would encourage us to sample the wrong points. For example, assume we were estimating a discrete value function, and that at each iteration, we looked at all values of x0 to choose the best decision. Then, assume that we only update the value that we observe without introducing an intermediate step such as maintaining monotonicity or concavity. It would be easy, due to statistical noise, to get a low estimate of the value of being in the optimal state. We would never visit this state again, and therefore we would never fix our error. The reason the algorithm works here is because of the concavity property. Assume we can order between 0 and 10 units of our asset, and that our initial approximation produces an optimal solution of 0 when the true optimum is 5. As we repeatedly sample the solution corresponding to x0 = 0, we eventually learn that the slope is larger than we thought (which means that the optimal solution is to the right). This moves the optimal solution to a higher value. This behavior is illustrated in figure 12.4. The function does not converge to the true function everywhere. It is entirely possible that we may have inaccurate estimates of the function as we move away from the region of optimality. However, concavity bounds these
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
333
Exact
Approximate Approximate
x*
x* 12.4(a)
12.4(b)
Approximate
Approximate
x*
x* 12.4(c)
12.4(d)
Approximate After projection
Approximate
x*
x*
12.4(e)
12.4(f)
Figure 12.4: Illustration of the steps of the SPAR update logic of a concave function. errors. We do not need to get all the slopes exactly. We will, however, get the slopes at the optimal solution exactly.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
334
The use of a piecewise linear approximation will, in general, require the estimation of more parameters than would be needed if we characterized the value function using a polynomial that could be fitted using regression methods. The additional parameters are the reason the method will work well even for odd demand distributions, but the importance of this feature will depend on specific applications. This problem illustrates how we can estimate value functions using a small number of parameters. Furthermore, if we can exploit a property such as concavity, we may even be able to get optimal solutions.
12.2
The multiperiod asset acquisition problem
While the newsvendor problem can be important in its own right, there is a vast array of problems that involve acquiring assets over time and where assets that are not used in one time period remain in the next time period. In financial applications, there can also be gains and losses from one time period to the next reflecting changes in prices due to market fluctuations. In this section, we show how the multiperiod asset acquisition problem can be solved using techniques that we have already described. We present the basic model in section 12.2.1. In order to update the value function approximation, we have to compute approximations of gradients of the value functions. This can be done in two ways. Section 12.2.2 describes the simpler single-pass method. Then, section 12.2.3 describes the two-pass method which, while somewhat more difficult to implement, generally has faster rates of convergence.
12.2.1
The model
We start with a basic multiperiod model: Dt = The demand for assets during time interval t. xt = The quantity of assets acquired at time t to be used during time interval t + 1. Rt = Assets on hand at the end of time interval t that can be used during time interval t + 1. Ct (Rt−1 , xt , Dt ) = Contribution earned during time interval t. = pt min{Rt−1 , Dt } − cp xt We note that the revenue term, pt min{Rt−1 , Dt } is a constant at time t whereas in our single-period problem, this term did not even appear (see equation (12.6)). For our presentation, it is going to be useful to use both the pre- and post- decision state
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
335
variables, which produces the following transition equations: x − Dt ]+ Rt = [Rt−1 Rtx = Rt + xt
Writing the optimality equation using the post-decision state variable gives x x x = E max Ct (Rt−1 , Dt , xt ) + γVt (Rt ) |Rt−1
x ) Vt−1 (Rt−1
xt
An effective solution technique is to combine the approximate dynamic programming methods described in chapter 5, the functional approximation methods of section 11 and the ideas suggested in section 12.1. Assume we have designed a functional approximation V¯t (Rt ). At iteration n, we would make decisions by solving: n xnt = arg max Ct (Rt−1 , Dt (ω n ), xt ) + γ V¯tn−1 (Rtx ) xt
(12.16)
Recall that solving (12.16) represents a form of decision function that we represent by Xtπ , where the policy is to solve (12.16) using the value function approximation V¯tn−1 (Rtx ). Assume, for example, that our approximation is of the form V¯tn−1 (Rtx ) = θ0 + θ1 Rtx + θ2 (Rtx )2 In this case, (12.16) looks like: xnt = arg max pt min{Rt−1 , Dt (ω n )} − cp xt + γ(θ0 + θ1 Rtx + θ2 (Rtx )2 ) xt
(12.17)
We can find xnt by differentiating (12.17) with respect to xt and setting the result equal to zero: 0 = = = =
x,n dCt (Rt−1 , Dt (ω n ), xt ) dV¯ n−1 (Rtx ) +γ t dxt dxt x,n n−1 n ¯ dCt (Rt−1 , Dt (ω ), xt ) dV (Rx ) dRtx +γ t x t dxt dRt dxt p x −c + γ(θ1 + 2θ2 Rt ) x − Dt (ω n )]+ + xt )) −cp + γ(θ1 + 2θ2 ([Rt−1
x where we used dRtx /dxt = 1 and Rtx = [Rt−1 − Dt (ω n )]+ + xt . Solving for xt given that x,n x Rt−1 = Rt−1 gives:
xnt
1 = 2θ2
cp − θ1 γ
n − [Rt−1 − Dt (ω n )]+
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
336
If we used a piecewise linear approximation for the value function, we would have to find the slope of the sum of the one-period contribution plus the value function for each value of xt until we found a point where the slope changed from negative to positive. As we described in chapter 7, there are two methods for updating the value function: a single-pass method where updates are computed as we step forward in time (see figure 7.1), and a double-pass method (outlined in figure 7.2). These discussions were presented in the context of computing the value of being in a state. We now illustrate these calculations in terms of computing the derivative of the function, which is more useful for asset acquisition problems.
12.2.2
Computing gradients with a forward pass
The simplest way to compute a gradient is using a pure forward pass implementation. At time t, we have to solve: ∼n n n p n−1 V t (Rt−1 , Dt (ω )) = arg max pt min{Rt−1 , Dt (ω )} − c xt + γ V¯t (Rt ) xt
∼n
Here, V t (Rt−1 , Dt (ω n )) is just a placeholder. We next compute a derivative using the finite difference: ∼n
∼n
vˆtn =V t (Rt−1 + 1, Dt (ω n ))− V t (Rt−1 , Dt (ω n )) ∼n
Note that when we solve the perturbed value V t (Rt−1 +1, Dt (ω n )) that we have to reoptimize xt . For example, it is entirely possible that if we increase Rt−1 by one then xnt will decrease by one. As is always the case with a forward pass algorithm, vˆtn depends on V¯tn (Rt ), and as a result will typically be biased. Once we have a sample estimate of the gradient vˆtn , we next have to update the value function. We can represent this updating process generically using: n−1 n n n = U V (V¯t−1 , vˆt , Rt−1 ) V¯t−1
The actual updating process depends on whether we are using a piecewise linear approximation, a general regression equation, or the SHAPE algorithm. The complete algorithm is outlined in figure 12.5, which is an adaptation of our original single-pass algorithm. A simple but critical conceptual difference is that we are now explicitly assuming that we are using a continuous functional approximation.
12.2.3
Computing gradients with a backward pass
Computing the gradient using a backward pass means that we are finding the gradient of the total contribution when we are following a specific policy determined by our current value
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
337
Step 0. For all t choose a function form and initial parameters for the value function V¯t0 (St ) for all time periods t. Let n = 1. Step 1. Choose ω n . Step 2: Do, for t = 1, 2, . . . , T : Step 2a. Solve:
∼n Vt
n n arg max Ct (Rt−1 , Dt (ω n ), xt ) + V¯tn−1 (RM (Rt−1 , xt , ωtn ))
xnt
=
n
= arg max pt min{Rt−1 , Dt (ω n )} − cp xt + γ V¯tn−1 (Rt )
(Rt−1 , Dt (ω ))
xt ∈Xt xt
Step 2b. Compute: ∼n
∼n
vˆtn =V t (Rt−1 + 1, Dt (ω n ))− V t (Rt−1 , Dt (ω n )) Step 2c. Update the value function: n−1 n n n V¯t−1 = U V (V¯t−1 , vˆt , Rt−1 ) n − Dt (ω n )]+ + xt . Step 2d. Compute Rtn = [Rt−1
Step 3. Let n = n + 1. If n < N , go to step 1.
Figure 12.5: An approximate dynamic programming algorithm for the multiperiod newsvendor problem function approximation. We start by writing the expected total costs from time t onward, given the information up through time t (that is, we condition on the pre-decision state variable Rt ), as x,n Ftπ (Rt−1 , Dt (ω n )) = E
=
=
( T X
) x,n n π n n n Ct0 (Rtx,n 0 −1 , Dt0 (ω ), Xt0 (Rt0 −1 , Dt0 (ω )))|Rt = (Rt−1 , Dt (ω ))
t0 =t x,n x,n Ct (Rt−1 , Dt (ω n ), Xtπ (Rt−1 , Dt (ω n ))) ) ( T X x,n n π n +E Ct0 (Rtx,n 0 −1 , Dt0 (ω ), Xt0 (Rt0 −1 , Dt0 (ω ))|Rt t0 =t+1 x,n x,n Ct (Rt−1 , Dt (ω n ), Xtπ (Rt−1 , Dt (ω n )))
π + E{Ft+1 (Rt+1 )|Rt }
(12.18)
x We would now like to compute the gradient of Ftπ (Rt ) with respect to Rt−1 for a specific n sample realization ω . We first write (12.18) for a sample realization ω as follows: x,n x,n x,n π Ftπ (Rt−1 , Dt (ω n ), ω n ) = Ct (Rt−1 , Dt (ω n ), Xtπ (Rt−1 , Dt (ω n ))) + Ft+1 (Rt+1 (ω n )) x Computing the finite difference for a unit change in Rt−1 , and using x,n x,n x Rt (Rt−1 ) = [Rt−1 − Dt (ω n )]+ + Xtπ (Rt−1 , Dt (ω n ))
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
338
gives x,n x,n + 1, Dt (ω n ), ω n ) − Ftπ (Rt−1 , Dt (ω n ), ω n ) vˆtn = Ftπ (Rt−1
This can be computed recursively. Let: x,n x,n x,n + 1, Dt (ω n ))) ∆Ctπ (Rt−1 , Dt (ω n )) = Ct (Rt−1 + 1, Dt (ω n ), Xtπ (Rt−1 x,n x,n −Ct (Rt−1 , Dt (ω n ), Xtπ (Rt−1 , Dt (ω n )))
be the change in contribution if we have one more unit available. Note that this can produce a change in the amount ordered which we represent using x,n x,n x,n , Dt (ω n )) + 1, Dt (ω n )) − Xtπ (Rt−1 , Dt (ω n )) = Xtπ (Rt−1 ∆Xtπ (Rt−1 x We can now determine the change in Rtx due to a unit change in Rt−1 :
x,n x x ∆Rtx = [Rt−1 + 1 − Dt (ω n )]+ − [Rt−1 − Dt (ω n )]+ + ∆Xtπ (Rt−1 , Dt (ω n )) This allows us to write x,n n vˆtn = ∆Ctπ (Rt−1 , Dt (ω n )) + vˆt+1 ∆Rtx
(12.19)
This can be computed recursively starting with t = T (using vˆTn +1 = 0) and stepping backward in time. A complete statement of the algorithm is given in figure 12.6.
12.3
Lagged information processes
There is a much richer class of problems when we allow ourselves to make orders now for different times into the future. The difference between when we order an asset and when we can use it is called a lag, and problems that exhibit lags between when we know about them and when we can use them are referred to as lagged information processes. Time lags arise in a variety of settings. The examples illustrate this process. These are all examples of using contracts to make a commitment now (a statement of the information content) for assets that can be used in the future (when they become actionable). The “contracts” become implicit in other settings. It is possible to purchase components from a supplier in southeast Asia that may require as much as 10 weeks to be built and shipped to North America. A shipload of oil departing from the Middle East may require three weeks to arrive at a port on the Atlantic coast of the United States. A truck departing from Atlanta, Georiga will arrive in California four days later. A plane departing from New York will arrive in London in six hours. We first provide a basic model of a lagged asset acquisition process. After this, we describe two classes of value function approximations and algorithms: continuously differentiable approximations and piecewise linear, nondifferentiable approximations.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
339
Step 0. For all t choose a function form and initial parameters for the value function V¯t0 (St ) for all time periods t. Let n = 1. Step 1. Choose ω n . Step 2: Do, for t = 1, 2, . . . , T : Step 2a. Solve:
∼n Vt
n n arg max Ct (Rt−1 , Dt (ω n ), xt ) + V¯tn−1 (RM (Rt−1 , xt , ωtn ))
xnt
=
n
= arg max pt min{Rt−1 , Dt (ω n )} − cp xt + γ V¯tn−1 (Rt )
(Rt−1 , Dt (ω ))
xt ∈Xt xt
n Step 2b. Compute Rtn = [Rt−1 − Dt (ω n )]+ + xt .
Step 3: Do, for t = T, T − 1, . . . , 1 Step 3a. Compute: x,n ∆Ctπ (Rt−1 , Dt (ω n ))
x,n x,n = Ct (Rt−1 + 1, Dt (ω n ), Xtπ (Rt−1 + 1, Dt (ω n ))) x,n x,n n π − Ct (Rt−1 , Dt (ω ), Xt (Rt−1 , Dt (ω n ))) x,n x,n x,n ∆Xtπ (Rt−1 , Dt (ω n )) = Xtπ (Rt−1 + 1, Dt (ω n )) − Xtπ (Rt−1 , Dt (ω n )) x,n x x ∆Rtx = [Rt−1 + 1 − Dt (ω n )]+ − [Rt−1 − Dt (ω n )]+ + ∆Xtπ (Rt−1 , Dt (ω n ))
Step 3b. Compute: x,n n vˆtn = ∆Ctπ (Rt−1 , Dt (ω n )) + vˆt+1 ∆Rtx
Step 3c. Update the value function: n−1 n n n V¯t−1 = U V (V¯t−1 , vˆt , Rt−1 )
Step 4. Let n = n + 1. If n < N , go to step 1.
Figure 12.6: A two-pass algorithm for the multiperiod newsvendor problem Example 12.5: An orange juice products company can purchase futures contracts that allow it to buy product in later years at lower prices. Example 12.6: As a result of limited shipping capacity, shippers have to purchase space on container ships months or years in the future. Example 12.7: Airlines have to sign contracts to purchase aircraft from manufacturers in the future as a result of manufacturing bottlenecks. Example 12.8: Electric power companies have to purchase expensive equipment such as transformers for delivery 18 months in the future, but may pay a higher rate to get delivery 12 months in the future. Example 12.9: Large energy consumers will purchase energy (electricity, oil and gas) in the future to lock in lower prices.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
12.3.1
340
Modeling lagged information processes
We begin by writing the decisions we have to make using xtt0 = The quantity of assets ordered at time t (more specifically, with the information up through time t), to be used in time period t0 . The double time indices have specific roles. The first “t” represents the information content (the knowable time) while the second “t0 ” represents when the quantity in the variable can be used (the actionable time). Since the decision is made with the information that became available during time interval t, the decision has to be made at the end of the time interval. By contrast, if the product arrives during time interval t0 , we treat is as if it can be used during time interval t0 , which is as if it arrives at the beginning of the time period. The reader may wish to review the discussion of modeling time in section 3.2. The single period and multiperiod newsvendor problems which we have considered up to now used a variable xt that represented the decision made at time t (literally, with the information up through time t) for assets that could be used in time interval t + 1. We would now represent this variable as xt,t+1 . If we always order for the next time period, the double time index is a bit clumsy. With our more general model, we face a much richer set of decisions. We adopt the convention that: xt = (xtt0 )t0 ≥t which means that xt is now a vector. We note that our notation provides for the element xtt which is, literally, the assets ordered after we see the demand. This is known as purchasing on the spot market. By contrast, we would term xtt0 for t0 > t as purchasing futures. Depending on the setting, we might pay for “futures” when either we purchase them or when we receive them. Once we introduce lags, it is useful to apply this concept throughout the problem. For example, if we are able to order assets that can be used in the future, then at any time t, there will be assets that we can use now and others we know about which we cannot use until the future. We represent this state of affairs using: Rtt0 = The assets that we know about at time t that can be used in time interval t0 . Rt = (Rtt0 )t0 ≥t Time lags can also apply to costs and prices. For example, we could define: ctt0 = The cost of assets purchased at time t that can be used at time t0 . ct = (ctt0 )t0 ≥t
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
341
We might even know about future demands. For this we would define: Dtt0 = Demands that become known during time interval t that need to be met during time interval t0 . Dt = (Dtt0 )t0 ≥t But for now, we are going to assume that demands have to be served as they arise, but that we can make decisions now for assets that can be used at a deterministic time in the future. Noting that Rt is now a vector, our transition function RM (Rt−1 , Wt , xt ) must be a similarly dimensioned vector-valued function. We would write: Rt,t+τ = RτM (Rt−1 , Wt , xt ) We distinguish two important cases: τ = 1, and τ > 1. These are given by: Rt,t+1 = Rt−1,t+1 + [Rt−1,t + xtt − Dt ]+ + xt,t+1 Rt,t+τ = Rt−1,t+τ + xt,t+τ τ > 1
(12.20) (12.21)
We can now write our transition equation as we did before (using Rt = RM (Rt−1 , Dt , xt )), recognizing that both sides are vectors. Assume we pay for assets when they are first purchased (rather than when they are actionable). In this case, our single period contribution function becomes: Ct (Rt−1 , xt , Dt ) = pt min{Rt−1,t + xtt , Dt (ω)} −
X
cptt0 xtt0
(12.22)
t0 ≥t
Our decision function: x∗t = arg max Ct (Rt−1 , xt , Dt ) + γ V¯t (Rt ) xt
(12.23)
is identical to what we used for the multiperiod case without lags (equation (12.16)) with the exception that xt and Rt are now vectors. This is a nice example of how notation fails to communicate notational complexity. When we have a vector, it is important to estimate how many dimensions it will have. For the problem of lagged information processes, Rt would be a vector with anywhere from a half dozen (signing futures contracts up to six years into the future) to several dozen (the number of days in advance of a shipment from one part of the world to another) or more. But it is unlikely that this issue would produce vectors with 100 or more dimensions. So, we have multiple dimensions, but not an extremely large number of dimensions (this comes later). However, as we have learned, even 10 or 20 dimensions can really explode a problem.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
12.3.2
342
Algorithms and approximations for continuously differentiable problems
We first consider algorithms that are designed for continuously differentiable, concave functions. These are the “nicest” class of functions and enjoy the vast array of algorithms from nonlinear programming. These algorithms can be divided into two broad classes: those that depend on linear approximations of the objective function (gradient algorithms) and those that depend on exact or approximate estimates of the Hessian (matrix of second derivatives) of the objective function (known as Newton algorithms). Let: Ft (xt ) = Ct (Rt−1 , xt , Dt ) + γ V¯t (RM (Rt−1 , Dt , xt ))
(12.24)
be the function we are trying to maximize. Ct is linear in xt . We assume that V¯t (Rt ) is continuously differentiable in Rt . Then, Ft (xt ) is continuously differentiable in xt as long as RM (Rt−1 , Dt , xt ) is continuously differentiable in xt . As they are written, the transition equations are not differentiable with respect to the spot purchases xtt since we may purchase more than we need which allows the purchases to spill over to the next time period (see equation (12.20)). The point at which too many spot purchases spill to the next time period is what creates the nondifferentiability. This problem is handled by splitting the spot purchases into two components: xnow = The amount of the spot purchases that must be used now. tt f uture xtt = The amount of the spot purchases that must be used in the future. xtt = xnow + xfttuture tt We can now rewrite equation (12.20) as: Rt,t+1 = Rt−1,t+1 + [Rt−1,t + xnow − Dt ]+ + xfttuture + xt,t+1 tt where xnow is constrained by: tt ! xnow − snow ≤ Dt − tt tt
Rt−1,t+1 +
X
xt00 t
t00 t
X
=
θ1tt0 Rtt0 + θ2tt0 (Rtt0 )2
(12.25)
t0 >t
Substituting (12.25) and (12.22) into (12.24) gives Ft (xt ) = pt min{Rt−1,t + xtt , Dt (ω)} −
X
cptt0 xtt0 + γ
t0 ≥t
X
θ1tt0 Rtt0 + θ2tt0 (Rtt0 )2 (12.26)
t0 >t
(12.26) is separable in xtt0 , make it possible to simply take the derivative and set it equal to zero: dV¯tt0 (Rtt0 ) dRtt0 dFt (xt ) = −cptt0 + γ dxtt0 dRtt0 dxtt0 p = −ctt0 + γ (θ1tt0 + 2θ2tt0 Rtt0 ) = 0 which gives Rtt0
1 = 2θ2tt0
1 p c 0 − θ1tt0 γ tt
From equation (12.21) we have Rt,t0 = Rt−1,t0 + xtt0 Since xtt0 ≥ 0, our solution is xtt0 = max {0, Rt,t0 − Rt−1,t0 } 1 1 p = max 0, c 0 − θ1tt0 − Rt−1,t0 2θ2tt0 γ tt
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
344
Separable approximations are especially nice to work with, although the errors introduced need to be quantified. Chapter 15 demonstrates that these can work quite well on much more complex asset allocation problems. Separable approximations are also especially easy to estimate. If we use more complex, nonseparable approximations, it will be necessary to design algorithms that handle the structure of the approximation used.
12.3.3
Algorithms and approximations for nondifferentiable problems
It is possible that the value function is nondifferentiable, as would happen if our demands were discrete. In chapter 14, we consider a very elegant set of strategies known as Benders decomposition for handling this problem class. These algorithms are beyond the scope of our discussion here, so instead, we might suggest using a separable, piecewise linear approximation for the value function. We already developed a piecewise linear approximation in section 12.1.4. We can use this same approximation here by further approximating the value function as separable: V¯tn−1 (Rt ) =
X
V¯ttn−1 (Rtt0 ) 0
(12.27)
t0 ≥t
where V¯ttn−1 (Rtt0 ) = 0
X
bRtt0 c
X
a∈A
v¯ttn−1 vttn−1 0 (r) + (Rtt0 − bRtt0 c)¯ 0 (dRtt0 e)
(12.28)
r=1
V¯ttn−1 (Rtt0 ) is a piecewise linear function, where v¯ttn−1 0 0 (r) is the slope between r and r + 1. We would estimate this function just as we did in section 12.1.4. If V¯tn−1 (Rt ) is piecewise linear, then it is nondifferentiable and we have to move to a different class of functions. We can still use our gradient projection algorithm, but we have to use the same types of stepsize rules that we used when using stochastic gradients. These rules require P we nwere solvingPproblems ∞ n 2 the conditions ∞ α = ∞ and (α ) < ∞, which pushes us into rules of the general n=1 n=1 n class α = a/(b + n) (see chapter 6). A much better approach for this problem class is to solve our problem as a linear program. To do this, we need to use a common device for representing concave, piecewise linear functions. the nonlinear function V¯ (R), we introduce variables y(r) where PRmaxInstead of writing max R = r=0 y(r) and R is the largest possible value of R. We require that 0 ≤ y(r) ≤ 1. We can then write our function using:
V¯ (R) =
max R X
r=0
v¯(r)y(r)
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
345
When we solve our linear program, it would be very odd if y(0) = 1, y(1) = 0, y(2) = 1, y(3) = 1, y(4) = 0. We could try to write a constraint to make sure that if y(r + 1) > 0 then y(r) = 1, but we do not have to. We will assume that v¯(0) ≥ v¯(1) ≥ . . . ≥ v¯(r). This means that we always want to maximize y(r) before allowing y(r + 1) to increase. Our only problem arises when v¯(r) = v¯(r + 1). We could handle this by defining intervals that are more than one unit (in practice, we do not use intervals of unit length anyway). But even if we did not, if v¯(r) = v¯(r + 1), then it does not matter if y(r + 1) > y(r). Using this device, we write our linear program as: " pt min{Rt−1,t + xtt , Dt (ω)} −
Ft−1 (Rt−1 ) = max xt
! X
cptt0 xtt0
t0 ≥t Rmax
+γ
X
# v¯ttn−1 (ytt0 (r))
(12.29)
y=0
subject to: ytt0 (r) ytt0 (r) xtt − stt Rtt0 − xtt0 max R X
≥ ≤ ≤ =
0 ∀t0 ≥ t, ∀r 1 ∀t0 ≥ t, ∀r Dt − Rt−1,t Rt−1,t0 t0 > t + 1
ytt0 (r) − Rtt0 = 0 ∀t0 ≥ t
(12.30) (12.31) (12.32) (12.33) (12.34)
r=0
xtt0 , stt0 ≥ 0 ∀t0 ≥ t xtt0 ≤ Utt0 ∀t0 ≥ t X xtt0 ≤ Ut
(12.35) (12.36) (12.37)
t0 ≥t
Equations (12.30) and (12.31) constrain the y(r) variables to be between 0 and 1. Equation (12.32) restricts the amount of spot purchases to be made in time period t so that there is only enough to cover the demand for time t, where stt is a slack variable. Equation (12.33) defines the actionable assets during time interval t0 , which we know about at time t, to be what we purchase at time t that is actionable during time interval t0 plus what we purchased for time interval t0 , which we knew about at time t − 1. Equation (12.34) sets up the relationship between the ytt0 (r) variables and the total flow Rtt0 . Equation (12.35) is the usual nonnegativity constraint, and equations (12.36) and (12.37) introduce possible upper bounds on the amount we can order. Note that if we did not have (12.37), the problem would decompose in the variables (xtt0 )t0 ≥t . This formulation uses a well-known device for modeling piecewise linear, concave functions. We also get something for free from our LP solver. Equations (12.32) and (12.33) represent the equations that capture the impact of the assets from the previous time period, Rt−1 , on this problem. These constraints yield the dual variables vˆtt0 , t0 ≥ t. Just as we
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
346
have used them earlier, the duals vˆtt0 represent gradients of our optimization problem with respect to Rt−1,t0 . However, since our problem is not differentiable, these are subgradients, which is to say they satisfy: ¯ t−1 ) − Ft−1 (Rt−1 )) ≤ v¯t (R ¯ t−1 − Rt−1 ) (Ft−1 (R Instead of a subgradient (which we get for free from the LP package), we could use numerical derivatives to obtain, say, the true right derivative: n n vˆttn 0 = Ft−1 (Rt−1 + ett0 ) − Ft−1 (Rt−1 )
where ett0 is a suitably dimensioned vector of zeroes with a 1 in the element corresponding to (tt0 ). Finding numerical derivatives in this setting is not very expensive. The big expense is solving the linear program (although for these small problems it is quite easy – we consider much larger problems later). Once we have solved a linear program once, perturbing the right hand side and resolving is very fast, but obviously not as fast as using a dual variable (which is free). However it can, for some problems, improve the rate of convergence.
12.4
Why does it work?**
12.4.1
Proof of convergence of the optimizing version of the SPAR algorithm
As with the proof of the learning version of the SPAR algorithm (section 11.6.2), this section is designed partly to prove the convergence of the optimizing version of the SPAR algorithm, and partly to demonstrate the mathematical techniques required by the proof. The challenge in the optimizing version of the algorithm is that we no longer assume that we are going to visit each point along the curve infinitely often. The points that we visit are determined by the approximation, so we have to show that we end up visiting the optimal points infinitely often. For this reason, the steps of the proof are provided in a much greater level of detail than would normally be used in most publications. In the optimizing version of SPAR, the observation points sn , n = 0, 1, . . . are generated by solving the approximate problem xn = arg max F¯ n (x) = x∈X
n X
f¯in (xi ),
(12.38)
i=1
where each f¯in , i = 1, . . . , n is a concave, piecewise linear function, defined as (11.20) and X is a convex and closed set such that X ⊂ {x ∈ IRn : 0 ≤ xi ≤ Mi , i = 1, . . . , n}. Thus, F¯ n is a concave, piecewise linear and separable function, xni ∈ {1, . . . , Mi } and we can set sn = (xn1 , . . . , xnn ).
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
347
Note that sni is now measurable with respect to F n . Also note that assumption (11.38) applied component-wise may not be satisfied. Even though, since (11.48) holds for each coordinate i, inequality (11.44) is true for each coordinate i: vin − vi )i + Ai,n+1 . vin − vi , Pin (¯ k¯ vin+1 − vi k2 ≤ k¯ vin − vi k2 − 2αn h¯
(12.39)
The matrix Pin , which is F n -measurable, is a positive diagonal matrix with entries s strictly positive if and only if the s coordinate of v¯i has a chance of being updated in the current iteration of 2, we conclude that Pthe algorithm. Proceeding exactly as in the proof of Theorem n the series ∞ A is convergent a.s. Furthermore, the sequence {k¯ v −v i k} is convergent i n=0 i,n+1 a.s., for every i = 1, . . . , n. Thus, the SPAR-Optimization and the SPAR-Exploration version only differ by the way s are obtained. Our aim is to prove that even without assumption (11.38), the sequence {sn } generated by SPAR-Optimization converges to an optimal solution of n
max F (x) = x∈X
n X
fi (xi ),
(12.40)
i=1
provided a certain stability condition is satisfied. Before we state our theorem, we should recall some definitions and remarks. 1. The subdifferential of a function h at x, denoted by ∂h(x), is the set of all subgradients of g at x. 2. The normal cone to a convex set C ⊂ IRn at x, denoted by NC (x), is the set NC (x) = {d ∈ IRn : dT (y − x) ≤ 0, y ∈ C}. 3. (Bertsekas et al., 2003, page 257) Let h : IRn → IR be a concave function. A vector x∗ maximizes h over a convex set C ⊂ IRn if and only if 0 ∈ ∂h(x∗ ) − NC (x∗ ).
(12.41)
4. An optimal point is called stable if it satisfies 0 ∈ int [∂h(x∗ ) − NC (x∗ )].
(12.42)
5. A stable point is also a solution to a perturbed problem ˜ max h(x), x∈C
˜ ∗ )) < and is a sufficiently small positive number. provided that dist(∂h(x∗ ), ∂ h(x
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
348
6. Every closed and bounded set is compact. Every sequence in a compact set has a subsequence that converges to some point of the set. This point is called an accumulation point. Applying these concepts to our problem, we have that the subdifferential of F¯ n at an integer point xn is given by n n n n n n ¯2,x vn,x ¯n,x ¯1,x v2,x ∂ F¯ n (xn ) = [¯ v1,x n ] × · · · × [¯ n ] × [¯ n +1 , v n +1 , v n +1 , v n ]. n n 2 1 2 1
Furthermore, as V n × X is closed and bounded, hence compact, the sequence (¯ v n , xn ) ∈ n V × X generated by the algorithm SPAR-Optimization has accumulation points (¯ v ∗ , x∗ ). Also, as F¯ n is concave and X is convex, the solution xn of (12.38), as it is optimal, satisfies 0 ∈ ∂ F¯ n (xn ) − NX (xn ). Then, by passing to the limit, we can conclude that each accumulation point (¯ v ∗ , x∗ ) of the sequence {(¯ v n , xn )} satisfies the condition 0 ∈ ∂ F¯ ∗ (x∗ ) − NX (x∗ ).
We will also need the following lemma: Lemma 12.4.1 Assume that for each i = 1, . . . , n the conditions (11.35)–(11.37) Pm−1 (11.24), j+1 and (11.45)–(11.46) are satisfied. Define R0 = 0 and Rm = j=0 αj (kgi k−IE{kgij+1 k|Fj }), m = 1, 2, . . . . Then {Rm } is a F-martingale that converges to a finite random variable R∞ almost surely. Proof: The first martingale property is easily verified. In order to get the second one, note that Rm − Rm−1 = αm−1 (kgim k − IE{kgim k|Fm−1 }). Then, Fm−1 -mble
z
}|
{
IE{αm−1 (kgim k − IE{kgim k|Fm−1 })|Fm−1 } αm−1 (IE{kgim k|Fm−1 } − IE{kgim k|Fm−1 })
IE{Rm − Rm−1 |Fm−1 } = = = 0.
To obtain the third property, recall that 2 Rm = (Rm−1 + αm−1 (kgim k − IE{kgim k|Fm−1 }))2 .
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
349
Thus, taking expectations yields 2 2 IE{Rm |Fm−1 } = Rm−1 =IE{Rm −Rm−1 |Fm−1 }=0 }| { z + 2Rm−1 IE{αm−1 (kgim k − IE{kgim k|Fm−1 })|Fm−1 } 2 + IE{αm−1 (kgim k − IE{kgim k|Fm−1 })2 |Fm−1 } 2 2 = Rm−1 + αm−1 IE{(kgim k2 |Fm−1 } 2 IE{kgim kIE{kgim k|Fm−1 }|Fm−1 } − 2αm−1 2 + αm−1 (IE{kgim k|Fm−1 })2 2 2 2 = Rm−1 + αm−1 IE{(kgim k2 |Fm−1 } − αm−1 (IE{kgim k|Fm−1 })2 2 2 ≤ Rm−1 + αm−1 IE{(kgim k2 |Fm−1 }.
(12.43)
Also, note that m−1 2 m−1 2 m kgim k2 = (−ˆ vim + v¯is ) 1{sm−1 vi+ + v¯is ) 1{sm−1 =s} + (−ˆ =s−1} i i m ≤ (−ˆ vim + B)2 1{sm−1 vi+ + B)2 1{sm−1 =s} + (−ˆ =s−1} i i m ≤ (−ˆ vim + B)2 + (−ˆ vi+ + B)2 .
Hence, m IE{(kgim k2 |Fm−1 } ≤ IE{(−ˆ vim + B)2 |Fm−1 } + IE{(−ˆ vi+ + B)2 |Fm−1 },
and since (11.24),(11.35),(11.45) and (11.46) hold, following the same steps as in (11.40) and (11.41) for the last two expectations, we know that there exists a constant C5 such that IE{(kgim k2 |Fm−1 } ≤ C5 . Therefore, 2 2 2 IE{Rm |Fm−1 } ≤ Rm−1 + C5 αm−1 .
Finally, taking the expected value we obtain 2 2 2 } + C5 IE{αm−1 } IE{Rm } ≤ IE{Rm−1 2 2 2 } + C5 IE{αm−2 } + C5 IE{αm−1 } ≤ IE{Rm−2 .. .
≤
IE{R02 }
+ C5
m−1 X
IE{(αn )2 }
n=0
≤ C6 < ∞, since IE{R02 } < ∞ and by (11.37).
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
350
Therefore, {Rm } is bounded in L2 , and thus bounded in L1 . This means we have checked all three martingale properties and {Rm } is a F-martingale. Also, the L2 -Bounded Martina.s. gale Convergence Theorem tells us that Rm −−→ R∞ , where R∞ < ∞. Now we are ready to state and prove the main result of this section. Theorem 3 Assume that for each i = 1, . . . , n the conditions (11.24), (11.35)–(11.37) and (11.45)–(11.46) are satisfied. If an accumulation point (¯ v ∗ , x∗ ) of the sequence {(¯ v n , xn )} generated by the algorithm, satisfies the stability condition: 0 ∈ int [∂ F¯ ∗ (x∗ ) − NX (x∗ )],
(12.44)
then, with probability one x∗ , is an optimal solution of (12.40). Proof: Let us fix ω ∈ Ω and consider a convergent subsequence {(¯ v n (ω), xn (ω))}, along ∗ ∗ N (ω) ⊆ IN. Let us denote by (¯ v , x ) the limit of this subsequence. This limit depends on ω too, but we shall omit the argument ω to simplify notation. From the stability condition, there exists > 0 such that for all iterations n for which n ∗ |¯ vi,x ¯i,x ∗ (ω) − v ∗ | ≤ , i i
i = 1, . . . , n,
(12.45)
the solution xn of the approximate problem (12.38) is equal to x∗ , by remark 5. Then, the coefficients pni,s are equal to 1 for s = x∗i and s = x∗i + 1, and are zero otherwise, for each i, as x∗i and x∗i + 1 are the coordinates of v¯i that will be updated in the current iteration. Thus, for a fixed i, if we focus our attention on the points s = x∗i :
−2α
n
h¯ vin (ω)
−
vi , Pin (¯ vin (ω)
− vi )i = −2α
n
M X
n pnis (¯ vis (ω) − vis )2
s=1 n 2 n 2 = −2αn [(¯ vi,x vi,x ∗ (ω) − vi,x∗ ) + (¯ ∗ +1 (ω) − vi,x∗ +1 ) ] i i i i n 2 ≤ −2αn (¯ vi,x ∗ (ω) − vi,x∗ ) . i i
Thus, the last inequality together with (12.39) yields n 2 k¯ vin+1 (ω) − vi k2 ≤ k¯ vin (ω) − vi k2 − 2αn (¯ vi,x ∗ (ω) − vi,x∗ ) + Ai,n+1 (ω). i i
(12.46)
n ∗ Let n ∈ N (ω) be large enough so that |¯ vi,x vi,x ¯n converges ∗ (ω)−¯ ∗ | < /2. This n exists as v i i to v¯∗ . Consider j ≥ n such that j ∗ |¯ vi,x ¯i,x for all i = 1, . . . , n. ∗| ≤ ∗ (ω) − v i i
(12.47)
CHAPTER 12. THE ASSET ACQUISITION PROBLEM
351
Let us suppose that the x∗i th coordinate of the accumulation point is not optimal, i.e., ∗ v¯i,x ∗ 6= vi,x∗ . i i
(12.48)
We shall prove that it leads to a contradiction. We will again divide the rest of the proof in several parts. Part 1 shows that the set of consecutive j ≥ n for which condition (12.47) holds is finite. Part 1: From assumption (12.48), we can always choose a sufficiently small > 0 such that ∗ |¯ vi,x ∗ − vi,x∗ | > 2. Then, for the iterations j satisfying (12.47), we have i i j j ∗ ∗ ¯i,x |¯ vi,x vi,x ¯i,x ∗ − vi,x∗ | ∗ + v ∗ (ω) − vi,x∗ | = |¯ ∗ (ω) − v i i i i i i j ∗ ∗ ≥ |¯ vi,x vi,x ¯i,x ∗ − vi,x∗ | − |¯ ∗| ∗ (ω) − v i | i {z i} | i {z } >2
≤
> . Combining the previous inequality with the fact that inequality (12.46) holds true yields vij (ω) − vi k2 − 2αj (ω)2 + Ai,j+1 (ω). k¯ vij+1 (ω) − vi k2 ≤ k¯
(12.49)
If the set of consecutive j ≥ n for which condition (12.47) holds was infinite, then the previous inequality holds for all j ≥ n and we can write ∞ > k¯ vij (ω) − vi k2 ≥ k¯ vin+1 (ω) − vi k2 + 2αn (ω)2 − Ai,n+1 (ω) ≥ k¯ vin+2 (ω) − vi k2 + 2
n+1 X
αm (ω)2 −
n+1 X
Ai,m+1 (ω)
m=n
m=n
.. .
(12.50) ≥ lim j→∞ |
k¯ vij (ω) {z
} ∗ (ω) − v i i 1≤i≤n
is finite. We shall prove that the sum of stepsizes between n ∈ N (ω) and l(n, , ω) − 1 is at least of order , if n is large enough. We will accomplish this in the following part.
CHAPTER 12. THE ASSET ACQUISITION PROBLEM l(n,,ω)
Part 2: By the definition of l(n, , ω) we have, |¯ vi,x∗i l(n,,ω) k¯ vi (ω)
−
v¯i∗ k
= ≥
M q X
l(n,,ω)
|¯ vi,s
s=1 l(n,,ω) |¯ vi,x∗i (ω)
352
∗ (ω) − v¯i,x ∗ | > , for some i. Thus, i
∗ 2 (ω) − v¯i,s |
∗ − v¯i,x ∗| i
> . n Since v¯n (ω) → v¯∗ , n ∈ N (ω), for all sufficiently large n ∈ N (ω), we also have |¯ vi,x ∗ (ω)− i ∗ v¯i,x∗i | < /2. Hence, l(n,,ω)
k¯ vi
l(n,,ω)
v (ω) − v¯in (ω)k ≥ k¯ | i
{z >
− v¯i∗ k − k¯ v ∗ − v¯n (ω)k > /2. } } | i {zi t by solving the approximation: n xnt = arg max pt min{Rt−1,t , Dt (ω n )} − xt
X
cptt0 xtt0 + V¯tn−1 (Rt )
t0 ≥t
where V¯tn−1 (Rt ) is an approximation of the value function. Recalling that V¯tn−1 defines a policy that we designate by π n , let Ftπ (Rt−1 , Dt (ω n ))
=
T X
Ct0 (Rt0 −1 , Dt0 (ω n ), Xtπ0 (Rt0 −1 , Dt0 (ω n )))
t0 =t
be the value of the policy for sample path ω n given Rt−1 and Dt (ω n ). a) Show how to compute a gradient vˆttn 0 of Ftπ (Rt−1 , Dt (ω n ), ω n ) with respect to Rt−1,t0 , given Dt (ω n ). Your method should not depend on any functional form for V¯tn−1 (Rt ). P b) Assume now that V¯t (Rt ) = t0 >t v¯tt0 Rtt0 . Show that the gradient vˆttn 0 = vˆtn0 , which is to say that the gradient depends only on the actionable time t0 .
Chapter 13 Batch replenishment processes We often encounter processes in which the supply of an asset is consumed over time (usually in random increments) and followed by replenishments to restore the supply. The need to periodically replenish is driven by a bit of economics unique to this problem class - we assume that the cost of acquiring additional assets is concave with respect to the quantity. In this chapter, we focus on problems where the concavity arises because we have to pay a fixed cost to place an order, after which we pay a fixed unit cost. However, most of the work in this chapter will apply to general concave functions which might reflect discounts for ordering larger amounts. Replenishment processes come in two flavors: negative drift (assets are consumed by an exogenous demand) with positive replenishment and positive drift (assets arrive according to an exogenous process) followed by batch depletion (see examples). We use the term “replenishment” for problems where there is a negative drift (an exogenous process that consumes assets) that requires periodic replenishments, but the same basic dynamics arise when there is a positive drift (an exogenous process governing the accumulation of assets) that requires a periodic clearing of assets. The problems are not equivalent (that is, one is not simply the negative of the other), but, from a computational perspective, the issues are quite similar.
13.1
A positive accumulation problem
Positive accumulation problems arise when resources arrive to a system as a result of an exogenous process. Examples include customers returning cars to a car rental agency who then wait for a bus to return to the airport; money generated from dividend income in a stock fund that has to be reinvested; shipments arriving at a loading dock waiting for a truck to move them to another location; and complex equipment (jet engines, electric power transformers) that accumulates particles in the oil indicating that the component is degenerating and may require maintenance or repair.
356
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
357
Example 13.1: A software company continually updates its software product. From time to time, it ships a new release to its customer base. Most of the costs of shipping an update, which include preparing announcements, posting software on the internet, printing new CD’s, and preparing manuals summarizing the new features, are relatively independent of how many changes have been made to the software. Example 13.2: E-Z Pass is an automated toll collection system. Users provide a credit card, and the system deducts $25 to provide an initial account balance. This balance is reduced each time the traveler passes through one of the automated toll booths. When the balance goes below a minimum level, another $25 is charged to the credit card to restore the balance. Example 13.3: Shipments accumulate at a freight dock where they are loaded onto an outbound truck. Periodically, the truck fills and it is dispatched, but sometimes it is dispatched before it is full to avoid excessive service penalties for holding the shipments. Example 13.4: An oil company monitors its total oil reserves, which are constantly drawn down. Periodically, it acquires new reserves either through exploration or by purchasing known reserves owned by another company.
We use the context of customers (people, freight) arriving to a queue that have to be moved by a vehicle of finite capacity.
13.1.1
The model
Our model uses the following parameters: cr = The fixed cost of a replenishment. ch = Penalty per time period for holding a unit of freight. K = Maximum size of a batch. Our exogenous information process consists of At = Quantity of new arrivals during time interval t. Our (post-decision) state variable is Rt = Assets remaining at the end of time period t. There are two decisions we have to make. The first is whether to dispatch a vehicle, and the second is how many customers to put on the vehicle. For this problem, once we make the decision to dispatch a vehicle, we are going to put as many customers as we can onto the
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
358
vehicle, so the “decision” of how many customers to put onto the vehicle seems unnecessary. It becomes more important when we consider multiple customer types later in the chapter. For consistency with this section, we define: ( 1 If a batch is sent at time t = 0 Otherwise
zt
xt = The number of customers to put on the truck. In theory, we might be able to put a large number of customers on the vehicle, but we may face a nonlinear cost that makes this suboptimal. For the moment, we are going to assume that we always want to put as many as we can, so we set: xt = min{zt K, zt (Rt−1 + At )} X (Rt−1 , At ) = The decision function that returns zt and xt given Rt = Rt−1 + At . π
The transition function is described using Rt = Rt−1 + At − xt
(13.1)
The objective function is modeled using: Ct (Rt−1 , At , zt , xt ) = The cost incurred in period t, given state Rt and dispatch decision xt . = cr zt + ch Rt (xt ) (13.2) Our problem is to find the policy Xtπ (Rt ) that solves: min E x
( T X
) Ct (Rt−1 , At , Xtπ (Rt−1 ))
(13.3)
t=0
If we are managing a single asset class, then Rt and xt are scalars and the problem can be solved using standard backward dynamic programming techniques of the sort that were presented in chapter 4 (assuming that we have a probability model for the demand). In practice, many problems involve multiple asset classes, which makes standard techniques impractical. But we can use this simple problem to study the structure of the problem.
13.1.2
Properties of the value function
Recognizing that Rt is a post-decision state variable, the value function Vt (Rt ) is given by the standard equation: Vt−1 (Rt−1 ) = E
M
max Ct (Rt−1 , At , zt , xt ) + γVt (R (Rt−1 , At , xt ))|Rt−1
zt ∈(0,1)
(13.4)
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
359
5800
5600
5400
Cost
5200
5000
4800
4600
optimal 4400
0
5
10
15
20 25 30 Number of customers at time 0
35
40
45
50
Figure 13.1: Shape of the value function for the positive-drift batch replenishment problem, from Papadaki & Powell (2003). The value function is illustrated in figure 13.1. It turns out it has a few nice properties which can be proven. The first is that it increases monotonically (rather, it never decreases). The second is that it is concave over the range R ∈ (nK, (n + 1)(K) − 1) for n = 0, 1, . . .. The third is that the function is K-convex, which means that it satisfies: V (R+ + K) − V (R− + K) ≥ V (R+ ) − V (R− )
(13.5)
For example, if we have a vehicle that can take 20 customers, K-convexity means that V (25) − V (5) ≤ V (35) − V (15). The value function is convex when measured on a lattice K units apart. We can exploit these properties in the design of an approximation strategy.
13.1.3
Approximating the value function
While finding an optimal policy is nice, we are more interested in obtaining good quality solutions using methods that are scalable to more complex problems. The most important property of the function is that it rises monotonically, suggesting that a linear approximation is likely to work reasonably well. This means that our value function approximation will look like: V¯t (Rt ) = v¯t Rt
(13.6)
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
360
If we replace the expectation in (13.4) and use the linear approximation we obtain: ∼
n n n V t−1 (Rt−1 , At (ω )) = max Ct (Rt−1 , At (ω , zt , xt )) + v¯t Rt (Rt−1 , At (ω ), xt )) zt ∈(0,1)
(13.7)
Solving this is quite easy, since we only have to try two values of zt . We can get an estimate of the slope of the function using: ∼
∼
vˆtn =V t−1 (Rt−1 + 1, ω n )− V t−1 (Rt−1 , ω n ) n : which we then smooth to obtain our estimate of v¯t−1 n n−1 v¯t−1 = (1 − αn )¯ vt−1 + αn vˆtn
Deterministic
percentage error from optimal
concave
0.2 Another strategy is to recognize that the most important part of the curve corresponds linear to values 0 ≤ R ≤ K, over which the function is concave. We can use the techniques of chapter 0.15 11 to produce a concave approximation which would more accurately capture the function in this region.
0.1 can compare these approximations to the optimal solution for the scalar case, since We this is one of the few problems where we can obtain an optimal solution. The results are shown 0.05 in figure 13.2 for both the linear and nonlinear (concave) approximations. Note that the linear case works better with fewer iterations. It is easier to estimate a single slope rather0 than an entire piecewise linear function. As we run more iterations, the nonlinear 0 works better. 50 For some large 100 scale problems, 150 it may be 200 250 hundreds 300 function impossible to run iterations of iterations (or even dozens). For such applications, a linear approximation is best.
percentage error from optimal
Stochastic concave linear
0.15
0.1
0.05
0
0
50
100
150 iterations
200
250
Figure 13.2: Percent error produced by linear and nonlinear approximations as a function of the training iterations, from Papadaki & Powell (2003).
300
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
13.1.4
361
Solving a multiclass problem using linear approximations
Now assume that we have different types of assets arriving to our queue. Using our standard notation, let Rta be the quantity of assets of type a at time t (we continue to use a postdecision state variable). We are going to assume that our attribute is not too big (dozens, hundreds, but not thousands). For this problem, we still have a scalar decision variable zt that indicates whether we are dispatching the vehicle or not. However now we have a nontrivial problem of determining how many customers of each type to put on the vehicle. Let: Rta = The number of customers of type a at the end of period t. Ata = The number of arrivals of customeres of type a during time interval t. xta = The number of customers of type a that we can put on the vehicle. We let PRt , At and xt be the corresponding vectors over all the customer types. We require that a∈A xta ≤ K. The transition equation is now given by Rt,a = Rt−1,a + Ata − xta For this problem, nonlinear value functions become computationally much more difficult. Our linear value function now looks like: V¯t (Rt ) =
X
v¯ta Rta
a∈A
which means that we are solving, at each point in time: ∼
n n V t−1 (Rt−1 , At (ω )) = max Ct (Rt−1 , At (ω ), zt , xt ) + zt ∈(0,1)
X
v¯ta Rta (Rt−1 , At (ω n ), xt(13.8) ))
a∈A
This is characterized by the parameters (¯ vta )a∈A , which means we have to estimate one parameter per customer type. We can estimate this parameter by computing a numerical derivative for each customer type. Let eta be a |A|-dimensional vector of zeroes with a 1 in the ath element. Then compute: ∼
∼
vˆta =V t−1 (Rt−1 + eta , At (ω n ))− V t−1 (Rt−1 , At (ω n ))
(13.9)
Equation 13.9 is a sample estimate of the slope, which we then smooth to obtain an updated estimate of the slope: n n−1 n v¯ta = (1 − αn )¯ vta + αn vˆta
(13.10)
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
362
∼
Computing V t−1 (Rt−1 + eta , At (ω n )) for each product type a can be computationally burdensome in applications with large numbers of product types. It can be approximated by assuming that the decision variable zt does not change when we increment the resource vector. Solving equation (13.8) requires that we determine both zt and xt . We only have two values of zt to evaluate, so this is not too hard, but how do we determine the vector xt ? We need to know something about how the customers differ. Assume that our customers are differentiated by their “value” where the holding cost is a function of value. Given this, and given the linear value function approximation, it is not surprising that it is optimal to put as many of the most valuable customers into the vehicle as we can, and then move to the second most valuable, and so on, until we fill up the vehicle. This is how we determine the vector xt . Given this method of determining xt , finding the best value of zt is not very difficult, since we simply have to compute the cost for zt = 0 and zt = 1. The next question is: how well does it work? We saw that it worked quite well for the case of a single customer type. With multiple customer types, we are no longer able to find the optimal solution. The classic “curse of dimensionality” catches up with us, since we are not able to enumerate all possibilities of the resource vector Rt . As an alternative, we can compare against a sensible dispatch policy. Assume that we are going to dispatch the vehicle whenever it is full, but we will never hold it more than a maximum time τ . Further assume that we are going to test a number of values of τ and find the one that minimizes total costs. We call this an optimized “dispatch-when-full” (DWF) policy (a bit misleading, since the limit on the holding time means that we may be dispatching the vehicle before it is full). When testing the policy, it is important to control the relationship of the holding cost ch to the average dispatch cost per customer, cr /K. For ch < cr /K, the best strategy tends to be to hold the vehicle until it is full. If ch > cr /K, the best strategy will often be to limit how long the vehicle is held. When ch ' cr /K, the strategy gets more complicated. A series of simulations (reported in Papadaki & Powell (2003)) were run on datasets with two types of customer arrival patterns: periodic, where the arrivals varied according to a fixed cycle (a period of high arrival rates followed by a period of low arrival rates). The results are given in table 13.1 for both the single and multiproduct problems, where the results are expressed as a fraction of the costs returned by our optimized DWF policy. The results show that the linear approximation always outperforms the DWF policy, even for the case ch < cr /K where DWF should be nearly optimal. The multiproduct results are very close to the single product results. The computational complexity of the value function approximation for each forward pass is almost the same as DWF. If it is possible to estimate the value functions off-line, then there is very little additional burden for using an ADP policy.
CHAPTER 13. BATCH REPLENISHMENT PROCESSES Method
h
r
c > c /K ch ' cr /K ch < cr /K
Average
Iterations periodic aperiodic periodic aperiodic periodic aperiodic
363
linear linear linear linear scalar scalar scalar scalar (25) (50) (100) (200) 0.602 0.597 0.591 0.592 0.655 0.642 0.639 0.635 0.822 0.815 0.809 0.809 0.891 0.873 0.863 0.863 0.966 0.962 0.957 0.956 0.976 0.964 0.960 0.959
linear linear linear linear mult. mult. mult. mult. (25) (50) (100) (200) 0.633 0.626 0.619 0.619 0.668 0.660 0.654 0.650 0.850 0.839 0.835 0.835 0.909 0.893 0.883 0.881 0.977 0.968 0.965 0.964 0.985 0.976 0.971 0.969
0.819
0.837
0.809
0.803
0.802
0.827
0.821
0.820
Table 13.1: Costs returned by the value function approximation as a fraction of the costs returned by an dispatch-when-full policy.
13.2
Monotone policies
One of the most dramatic success stories from the study of Markov decision processes has been the identification of the structure of optimal policies. An example is what are known as monotone policies. Simply stated, a monotone policy is one where the decision gets bigger as the state gets bigger, or the decision gets smaller as the state gets bigger (see examples). Example 13.5: A software company must decide when to ship the next release of its operating system. Let St be the total investment in the current version of the software. Let xt = 1 denote the decision to ship the release in time period t while xt = 0 means to keep ¯ Thus, as St investing in the system. The company adopts the rule that xt = 1 if St ≥ S. gets bigger, xt gets bigger (this is true even though xt is equal to zero or one). Example 13.6: An oil company maintains stocks of oil reserves to supply its refineries for making gasoline. A supertanker comes from the Middle East each month, and the company can purchase different quantities from this shipment. Let Rt be the current inventory. The policy of the company is to order xt = Q − St if St < R. R is the reorder point, and Q is the “order up to” limit. The bigger St is, the less the company orders. Example 13.7: A mutual fund has to decide when to sell its holding in a company. Its policy is to sell the stock when the price pˆt is greater than a particular limit p¯.
In each example, the decision of what to do in each state is replaced by a function that determines the decision. The function depends on the choice of one or two parameters. So, instead of determining the right action for each possible state, we only have to determine the parameters that characterize the function. Interestingly, we do not need dynamic programming for this. Instead, we use dynamic programming to determine the structure of the
CHAPTER 13. BATCH REPLENISHMENT PROCESSES
364
optimal policy. This is a purely theoretical question, so the computational limitations of (discrete) dynamic programming are not relevant. The study of monotone policies is included partly because it is an important part of the field of dynamic programming. It is also useful in the study of approximate dynamic programming because it yields properties of the value function. For example, in the process of showing that a policy is monotone, we also need to show that the value function itself is monotone (that is, increases or decreases with the resource state). To demonstrate the analysis of a monotone policy, we consider a classic batch replenishment policy that arises when there is a random accumulation that is then released in batches. Examples include dispatching elevators or trucks, moving oil inventories away from producing fields in tankers, and moving trainloads of grain from grain elevators.
13.2.1
Submodularity and other stories
In the realm of optimization problems over a continuous set, it is important to know a variety of properties about the objective function (such as convexity/concavity, continuity and boundedness). Similarly, discrete problems require an understanding of the nature of the functions we are maximizing, but there is a different set of conditions that we need to establish. One of the most important properties that we will need is supermodularity. Interestingly, different authors define supermodularity differently (although not inconsistently). We assume we are studying a function g(u), u ∈ U where U ⊆ t as if they were unknown. It just means that each time we take a random sample of new information (step 2a in figure 15.5) it means we always sample the same information. We can also put the entire problem (over all time periods) into a linear
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS
423
Figure 15.9: Illustration of a pure network for time-staged, single commodity flow problems. Simulation Horizon Locations 15 30 60 20 100.00% 100.00% 100.00% 40 100.00% 99.99% 100.00% 80 99.99% 100.00% 99.99% Table 15.1: Percentage of the optimal solution. programming package to obtain the optimal solution. Experiments (reported in Godfrey & Powell (2002a)) were run on problems with 20, 40 and 80 locations, and 15, 30 and 60 time periods. The results are reported in table 15.1 as percentages of the optimal solution produced by the linear programming algorithm. It is not hard to see that the results are very near optimal. We know that separable, piecewise linear approximations do not produce provably optimal solutions (even in the limit) for this problem class, but it appears that the error is extremely small. We have to note, however, that a good commercial linear programming code is much faster than iteratively estimating value function approximations. We now consider what happens when the demands are truly stochastic, which is to say that we obtain different numbers each time we sample information. For this problem, we do not have an optimal solution. Although this problem is relatively small (compared to true industrial applications), formulated as a Markov decision process produces state spaces that are far larger than anything we could hope to solve. We can use Benders decomposition (section 14.6) for multistage problems, but experiments have shown that the rate of convergence is so slow that it does not produce reasonable results. Instead, experimental research has found that the best results are obtained using a rolling horizon procedure where at each time t, we combine the demands that are known at time t with expectations of any random demands for future time periods. A deterministic problem is then solved over a planning horizon of length T ph which typically has to be chosen experimentally.
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS
424
100
95
90
Percent of posterior optimal
85
80
Point forecast ADP
75
70
65
60
55
50 20/100
20/200
20/400
40/100
40/200
40/400
80/100
80/200
80/400
Figure 15.10: Percentage of posterior bound produced by a rolling horizon procedure using a point forecast of the future versus an approximate dynamic programming approximation. For a specific sample realization, we can still find an optimal solution using an linear programming solver, but this solution “cheats” by being able to use information about what is happening in the future. This solution is known as the posterior bound since it uses information that only becomes known after the fact. The results of these comparisons are shown in figure 15.10. The experiments were run on problems with 20, 40 and 80 locations, and with 100, 200 and 400 vehicles in our fleet. Problems with 100 vehicles were able to cover roughly half of the demands, while a fleet of 200 vehicles could cover over 90 percent. The fleet of 400 vehicles was much larger than would have been necessary. The ADP approximation produced better results across all the problems, although the difference was most noticeable for problems where the fleet size was not large enough to cover all the demands. Not surprisingly, this was also the problem class where the posterior solution was relatively the best.
15.5.2
Experiments for multicommodity flow problems
Experiments for multicommodity flow problems (reported in Topaloglu & Powell (to appear)) assumed that it was possible to substitute different types of equipment for different types of demands at a single location as illustrated in figure 15.11. This means that we cannot use a vehicle at location i to serve a customer out of location j, but we can use a vehicle of type k ∈ K to serve a demand of type ` ∈ K (we assume the demand types are the same as vehicle types). Different substitution rules can be considered. For example, a customer type might allow a specific subset of vehicle types (with no penalty). Alternatively, we can assume that if we can provide a vehicle of type k to serve a customer of type k, there is no
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS
1 .7 .6 0 0
.8 1 .6 .4 .4
.5 .8 1 .7 .6 I
.3 .3 .5 1 .6
0 0 .5 .5 1
1 1 1 1 1
0 1 1 1 1
0 0 1 1 1 II
0 0 0 1 1
0 0 0 0 1
1 .5 0 0 0
.5 1 .5 0 0
0 0 .5 0 1 .5 .5 1 0 .5 III
0 0 0 .5 1
425
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 IV
1 1 1 1 1
1 1 1 1 1
Table 15.2: Different substitution patterns used in the experiments.
substitution penalty, but there is a penalty (in the form of receiving only a fraction of the reward for moving a load of freight) if we use a vehicle of type k to serve a customer of type ` (k 6= `). Four substitution matrices were used (all datasets used five equipment types and five demand types). These are given in table 15.2. Matrix S I allows all forms of substitution, but discounts the revenue received from a mismatch of equipment type and demand type. Matrix S II would arise in settings where you can always “trade up”; for example, if we do not have equipment type 3, you are willing to accept 4 or 5. Matrix S III limits substitution to trading up or down by one type. Matrix S IV allows all types of substitution without penalty, effectively producing a single commodity problem. This problem captures the effect of using an irrelevant attribute when representing a resource class.
Location A Cars available now
Location B
Location C
Figure 15.11: Illustration of substitution possibilities for the multicommodity flow formulation.
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS Problem Base T 30 T 90 I 10 I 40 C II C III C IV R1 R5 R 400 R 800 c 1.6 c8
T 60 30 90 60 60 60 60 60 60 60 60 60 60 60
|I| 20 20 20 10 40 20 20 20 20 20 20 20 20 20
|K| 5 5 5 5 5 5 5 5 5 5 5 5 5 5
|D| 5 5 5 5 5 5 5 5 5 5 5 5 5 5
F 200 200 200 200 200 200 200 200 1 5 400 200 200 200
Demands 4000 2000 6000 4000 4000 4000 4000 4000 4000 4000 4000 4000 4000 4000
c 4 4 4 4 4 4 4 4 4 4 4 4 1.6 8
r 5 5 5 5 5 5 5 5 5 5 5 5 5 5
426 S I I I I I II III IV I I I I I I
Table 15.3: Characteristics of the test problems.
Each dataset was characterized by the following parameters: T I K D F D c r S
= = = = = = = = =
Number of time periods. Set of locations Set of vehicle types Set of demand types Fleet size Total number of demands to be served over the horizon Cost per mile for moving a vehicle empty Contribution per mile for moving a load Choice of substitution matrix: I, II, III, IV
A series of datasets were created by choosing a single base problem and then modifying one attribute at a time (e.g. the length of the horizon) to test the effect of that parameter. Table 15.3 summarizes all the datasets that were created. The results of the experiments are shown in table 15.4 which gives the max (highest objective function as a percent of the optimal), min, mean, standard deviation, and the CPU time (and iterations) to reach solutions that are 85th , 90th , 95th , 97.5th percent of the optimal solution. It is important to note that the approximate dynamic programming solutions were always integer, whereas the linear programming optimal solution was allowed to produce fractional solutions. If we required integer solutions, the resulting integer program would be quite large and hard to solve. As a result, some of the gap between the ADP approximation and the optimal solution can be attributed to the relaxation of the integrality requirement. Recognizing that all experimental tests depend to a degree on the structure of the problem and the choice of problem parameters, the results seem to suggest that the use of separable, piecewise linear approximations are giving high quality results.
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS Problem
Max.
Min.
Mean
Base T 30 T 90 I 10 I 40 SI S II S III S IV R 100 R 400 c 1.6 c8
98.90 98.58 98.89 99.75 98.90 98.90 99.51 98.61 99.86 96.87 99.52 99.13 98.55
98.65 98.12 98.64 99.55 98.49 98.65 99.17 98.23 99.73 96.16 99.33 98.72 98.11
98.76 98.37 98.75 99.65 98.52 98.76 99.34 98.41 99.75 96.48 99.43 90.01 98.36
Std. dev. 0.062 0.130 0.055 0.039 0.211 0.062 0.083 0.092 0.032 0.189 0.045 0.009 0.092
Time (sec.) to reach 85 90 95 97.5 37 101 248 506 19 35 165 276 56 106 276 721 11 11 30 63 194 530 992 2276 37 101 248 506 59 59 433 991 33 88 374 505 235 287 479 938 66 384 475 40 40 140 419 33 33 54 602 95 209 431 1274
No. iterations for 85 90 95 97.5 2 4 8 14 2 3 10 15 2 3 6 13 2 2 4 7 3 6 9 17 2 4 8 14 2 2 7 13 2 4 12 15 4 5 9 14 3 12 14 2 2 5 12 2 2 3 13 4 7 12 30
427 Time (s) per iter. 46.6 24.9 74.9 14.5 154.2 46.6 75.5 44.8 82.4 50.2 48.8 47.2 43.7
Table 15.4: Performance of ADP approximation on deterministic, multicommodity flow datasets expressed as a percent of the linear programming solution (from Topaloglu & Powell (to appear)). The table also shows the number of iterations and the CPU time per iteration. Caution is required when comparing CPU times against those of competing algorithms. There are many applications where it is possible to use trained value functions. In these settings, a solution may be obtained using a single forward pass through a new dataset. Real-time (sometimes known as “on-line”) applications may run continuously, where each forward pass uses updated data. The same experiments were run using stochastic demand data. As with the single commodity dataset, we used two forms of comparison. First, we compared the results obtained using the approximate dynamic programming approach to those obtained using a point forecast of the future over a planning horizon. Then, for each sample path, we computed the optimal solution using all the information. All results are reported as a percentage of the posterior optimal solution. The results are shown in figure 15.12. Again we see consistently better solutions across all the problem variations.
15.6
Bibliographic notes
Powell et al. (2004), Topaloglu & Powell (to appear), Godfrey & Powell (2002a), Godfrey & Powell (2002b) Powell et al. (2002) Spivey & Powell (2004)
CHAPTER 15. GENERAL ASSET MANAGEMENT PROBLEMS
428
100
95
90
Percent of posterior bound
85
80
Point forecast ADP
75
70
65
60
55
50 Base
I_10
I_40
S_II
S_III
S_IV
R_100
R_400
c_1.6
c_8
Figure 15.12: Percentage of posterior bound produced by a rolling horizon procedure versus an approximate dynamic programming approximation when using stochastic demand data.
Bibliography Aalto, S. (2000), ‘Optimal control of batch service queues with finite service capacity and linear holding costs’, Mathematical Methods of Operations Research 51, 263–285. 373 Adelman, D. (2004), Price directed control of a closed logistics queueing network, Tech. rep., University of Chicago. 373 Andreatta, G. & Romeo, L. (1988), ‘Stochastic shortest paths with recourse’, Networks 18, 193–204. 227 A.Nedi¸c & D.P.Bertsekas (2003), ‘Least squares policy evaluation algorithms with linear function approximation’, Journal of Discrete Event Systems 13, 79–110. 238, 320 Bean, J., Birge, J. & Smith, R. (1987), ‘Aggregation in dynamic programming’, Operations Research 35, 215–220. 273, 321 Bellman, R. (1957), Dynamic Programming, Princeton University Press, Princeton. 10, 14, 48, 115 Bellman, R. (1971), Introduction to the Mathematical Theory of Control Processes, Vol. II, Academic Press, New York. 115 Bellman, R. & Dreyfus, S. (1959), ‘Functional approximations and dynamic programming’, Mathematical Tables and Other Aids to Computation 13, 247–251. 142 Benveniste, A., Metivier, M. & Priouret, P. (1990), Adaptive Algorithms and Stochastic Approximations, Springer-Verlag, New York. 194 Berg, M., Schouten, F. & Jansen, J. (1998), ‘Optimal batch provisioning to customers subject to a delay-limit’, Management Science 44(5), 684–697. 373 Bernardo, J. M. & Smith, A. F. M. (1994), Bayesian Theory, John Wiley and Sons, New York. 273 Berry, D. A. & Fristedt, B. (1985), Bandit Problems, Chapman and Hall, London. 292 Bertazzi, L., Bertsekas, D. & Speranza, M. G. (2000), Optimal and neuro-dynamic programming solutions for a stochastic inventory trasportation problem, Unpublished technical report, Universita Degli Studi Di Brescia. 321 Bertsekas, D. (1976), Dynamic Programming and Stochastic Control, Academic Press, New York. 115 Bertsekas, D. (2000), Dynamic Programming and Optimal Control, Athena Scientific, Belmont, Massachusetts. 29 429
BIBLIOGRAPHY
430
Bertsekas, D. & Castanon, D. (1989), ‘Adaptive aggregation methods for infinite horizon dynamic programming’, IEEE Transactions on Automatic Control 34(6), 589–598. 273, 321 Bertsekas, D. & Castanon, D. (1999), ‘Rollout algorithms for stochastic scheduling problems’, J. Heuristics 5, 89–108. 227 Bertsekas, D. & Shreve, S. (1978), Stochastic Optimal Control: The Discrete Time Case, Academic Press, New York. 115 Bertsekas, D. & Tsitsiklis, J. (1991), ‘An analysis of stochastic shortest path problems’, Mathematics of Operations Research 16, 580–595. 115, 227 Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific, Belmont, MA. 14, 115, 142, 193, 227, 238, 273 Bertsekas, D., Nedic, A. & Ozdaglar, E. (2003), Convex Analysis and Optimization, Athena Scientific, Belmont, Massachusetts. 317, 347 Bertsekas, D. P., Borkar, V. S. & Nedich, A. (2004), Improved temporal difference methods with linear function approximation, in J. Si, A. G. Barto, W. B. Powell & D. W. II, eds, ‘Handbook of Learning and Approximate Dynamic Programming’, IEEE Press. 238 Birge, J. & Louveaux, F. (1997), Introduction to Stochastic Programming, Springer-Verlag, New York. 405 Birge, J. & Wallace, S. W. (1988), ‘A separable piecewise linear upper bound for stochastic linear programs’, SIAM J. Control and Optimization 26(3), 1–14. 405 Birge, J. & Wets, R. (1989), ‘Sublinear upper bounds for stochastic programs with recourse’, Mathematical Programming 43, 131–149. 405 Blum, J. (1954a), ‘Approximation methods which converge with probability one’, Annals of Mathematical Statistics 25, 382–386. 180, 183 Blum, J. (1954b), ‘Multidimensional stochastic approximation methods’, Annals of Mathematical Statistics 25, 737–744. 193 Boyan, J. (2002), ‘Technical update: Least-squares temporal difference learning’, Machine Learning 49, 1–15. 238 Bradtke, S. J. & Barto, A. G. (1996), ‘Linear least-squares algorithms for temporal difference learning’, Machine Learning 22, 33–57. 320 Brown, R. (1959), Statistical Forecasting for Inventory Control, McGraw-Hill, New York. 193 Brown, R. (1963), Smoothing, Forecasting and Prediction of Discrete Time Series, PrenticeHall, Englewood Cliffs, N.J. 193 Cayley, A. (1875), ‘Mathematical questions with their solutions, no. 4528’, Educational Times. 22 Chen, Z.-L. & Powell, W. (1999), ‘A convergent cutting-plane and partial-sampling algorithm for multistage linear programs with recourse’, Journal of Optimization Theory and Applications 103(3), 497–524. 405
BIBLIOGRAPHY
431
Cheung, R. K.-M. & Powell, W. B. (2000), ‘SHAPE: A stochastic hybrid approximation procedure for two-stage stochastic programs’, Operations Research 48(1), 73–79. 320, 399 Chow, G. (1997), Dynamic Economics, Oxford University Press, New York. 115 Crites, R. & Barto, A. (1994), ‘Elevator group control using multiple reinforcement learning agents’, Machine Learning 33, 235–262. 227 Dantzig, G. & Ferguson, A. (1956), ‘The allocation of aircrafts to routes: An example of linear programming under uncertain demand’, Management Science 3, 45–73. 405 Darken, C. & Moody, J. (1991), Note on learning rate schedules for stochastic optimization, in Lippmann, Moody & Touretzky, eds, ‘Advances in Neural Information Processing Systems 3’, pp. 1009–1016. 193 Darken, C., Chang, J. & Moody, J. (1992), ‘Learning rate schedules for faster stochastic gradient search’, Neural Networks for Signal Processing 2 - Proceedings of the 1992 IEEE Workshop. 193 Dayan, P. (1992), ‘The convergence of td(λ) for general λ’, Machine Learning 8, 341–362. 227 de Farias, D. & Van Roy, B. ((to appear)), ‘The linear programming approach to approximate dynamic programming’, Operations Research 00, 000–000. 320 Deb, R. (1978a), ‘Optimal control of batch service queues with switching costs’, Advances in Applied Probability 8, 177–194. 373 Deb, R. (1978b), ‘Optimal dispatching of a finite capacity shuttle’, Management Science 24, 1362–1372. 373 Deb, R. (1984), ‘Optimal control of bulk queues with compound poisson arrivals and batch service’, Operations Research 21, 227–245. 373 Deb, R. & Schmidt, C. (1987), ‘Optimal average cost policies for the two-terminal shuttle’, Management Science 33, 662–669. 373 Deb, R. & Serfozo, R. (1973), ‘Optimal control of batch service queues’, Advances in Applied Probability 5, 340–361. 373 Derman, C. (1970), Finite State Markovian Decision Processes, Academic Press, New York. 14, 115 Doob, J. L. (1953), Stochastic Processes, John Wiley & Sons. 399 Douglas, S. & Mathews, V. (1995), ‘Stochastic gradient adaptive step size algorithms for adaptive filtering’, Proc. International Conference on Digital Signal Processing, Limassol, Cyprus 1, 142–147. 194 Dreyfus, S. & Law, A. M. (1977), The art and theory of dynamic programming, Academic Press, New York. 115 Duff, M. O. (1995), Q-learning for bandit problems, Technical report, Department of Computer Science, University of Massachusetts, Amherst, MA. 292 Duff, M. O. & Barto, A. G. (2003), Local bandit approximation for optimal learning problems, Technical report, Department of Computer Science, University of Massachusetts, Amherst, MA. 292
BIBLIOGRAPHY
432
Dvoretzky, A. (1956), On stochastic approximation, in J. Neyman, ed., ‘Proc. 3rd Berkeley Sym. on Math. Stat. and Prob.’, Berkeley: University of California Press, pp. 39–55. 180, 193 Dynkin, E. B. (1979), Controlled Markov Processes, Springer-Verlag, New York. 14, 115 Ermoliev, Y. (1971), ‘The general problem in stochastic programming’, Kibernetika. 180 Ermoliev, Y. (1983), ‘Stochastic quasigradient methods and their application to system optimization’, Stochastics 9, 1–36. 180 Ermoliev, Y. (1988), Stochastic quasigradient methods, in Y. Ermoliev & R. Wets, eds, ‘Numerical Techniques for Stochastic Optimization’, Springer-Verlag, Berlin. 180, 405 Even-Dar, E. & Mansour, Y. (2004), ‘Learning rates for q-learning’, Journal of Machine Learning Research 5, 1–25. 194 Frank, H. (1969), ‘Shortest paths in probabilistic graphs’, Operations Research 17, 583–599. 227 Frieze, A. & Grimmet, G. (1985), ‘The shortest path problem for graphs with random arc lengths’, Discrete Applied Mathematics 10, 57–77. 227 Gaivoronski, A. (1988), Stochastic quasigradient methods and their implementation, in Y. Ermoliev & R. Wets, eds, ‘Numerical Techniques for Stochastic Optimization’, SpringerVerlag, Berlin. 163, 180, 193 Gardner, E. S. (1983), ‘Automatic monitoring of forecast errors’, Journal of Forecasting 2, 1–21. 193 George, A. & Powell, W. B. (2004), Adaptive stepsizes for recursive estimation in dynamic programming, Technical report, Department of Operations Research and Financial Engineering, Princeton University. 174, 178, 179 George, A., Powell, W. B. & Kulkarni, S. (2003), The statistics of hierarchical aggregation for multiattribute resource management, Technical report, Department of Operations Research and Financial Engineering, Princeton University. 273 George, A., Powell, W. B. & Kulkarni, S. (2005), The statistics of hierarchical aggregation for multiattribute resource management, Technical report, Department of Operations Research and Financial Engineering, Princeton University. 248 Giffin, W. (1971), Introduction to Operations Engineering, R. D. Irwin, Inc., Homewood, Illinois. 193 Gittins, J. (1979), ‘Bandit processes and dynamic allocation indices’, Journal of the Royal Statistical Society, Series B 14, 148–177. 294 Gittins, J. (1981), ‘Multiserver scheduling of jobs with increasing completion times’, Journal of Applied Probability 16, 321–324. 294 Gittins, J. (1989), Multi-Armed Bandit Allocation Indices, John Wiley and Sons, New York. 289, 294 Gittins, J. C. & Jones, D. M. (1974), A dynamic allocation index for the sequential design of experiments, in J. Gani, ed., ‘Progress in Statistics’, pp. 241–266. 286, 294
BIBLIOGRAPHY
433
Gladyshev, E. G. (1965), ‘On stochastic approximation’, Theory of Prob. and its Appl. 10, 275–278. 399 Godfrey, G. & Powell, W. B. (2002a), ‘An adaptive, dynamic programming algorithm for stochastic resource allocation problems I: Single period travel times’, Transportation Science 36(1), 21–39. 423, 427 Godfrey, G. & Powell, W. B. (2002b), ‘An adaptive, dynamic programming algorithm for stochastic resource allocation problems II: Multi-period travel times’, Transportation Science 36(1), 40–54. 427 Godfrey, G. A. & Powell, W. B. (2001), ‘An adaptive, distribution-free approximation for the newsvendor problem with censored demands, with applications to inventory and distribution problems’, Management Science 47(8), 1101–1112. 320 Golub, G. & Loan, C. V. (1996), Matrix Computations, John Hopkins University Press, Baltimore, Maryland. 272, 320 Guestrin, C., Koller, D. & Parr, R. (2003), ‘Efficient solution algorithms for factored MDPs’, Journal of Artificial Intelligence Research 19, 399–468. 227 Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer series in Statistics, New York, NY. 190, 193 Heyman, D. & Sobel, M. (1984), Stochastic Models in Operations Research, Volume II: Stochastic Optimization, McGraw Hill, New York. 14, 115 Higle, J. & Sen, S. (1991), ‘Stochastic decomposition: An algorithm for two stage linear programs with recourse’, Mathematics of Operations Research 16(3), 650–669. 405 Holt, C., Modigliani, F., Muth, J. & Simon, H. (1960), Planning, Production, Inventories and Work Force, Prentice-Hall, Englewood Cliffs, NJ. 193 Howard, R. (1971), Dynamic Probabilistic Systems, Volume II: Semimarkov and Decision Processes, John Wiley and Sons, New York. 14, 115 Infanger, G. (1994), Planning under Uncertainty: Solving Large-scale Stochastic Linear Programs, The Scientific Press Series, Boyd & Fraser, New York. 405 Infanger, G. & Morton, D. (1996), ‘Cut sharing for multistage stochastic linear programs with interstate dependency’, Mathematical Programming 75, 241–256. 405 Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994), Convergence of stochastic iterative dynamic programming algorithms, in J. D. Cowan, G. Tesauro & J. Alspector, eds, ‘Advances in Neural Information Processing Systems’, Vol. 6, Morgan Kaufmann Publishers, Inc., pp. 703–710. 193, 227, 320 Jacobs, R. A. (1988), ‘Increased rate of convergence through learning rate adaptation’, Neural Networks 1, 295 – 307. 194 Kall, P. & Wallace, S. (1994), Stochastic Programming, John Wiley and Sons, New York. 405 Kesten, H. (1958), ‘Accelerated stochastic approximation’, The Annals of Mathematical Statistics 29(4), 41–59. 193
BIBLIOGRAPHY
434
Kiefer, J. & Wolfowitz, J. (1952), ‘Stochastic estimation of the maximum of a regression function’, Annals Math. Stat. 23, 462–466. 193 Kmenta, J. (1997), Elements of Econometrics, second edn, University of Michigan Press, Ann Arbor, Michigan. 193 Kushner, H. J. & Clark, S. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, Springer-Verlag, New York. 193 Kushner, H. J. & Yin, G. G. (1997), Stochastic Approximation Algorithms and Applications, Springer-Verlag, New York. 180 Lai, T. L. & Robbins, H. (1985), ‘Asymptotically efficient adaptive allocation rules’, Advances in Applied Mathematics 6, 4–22. 294 LeBlanc, M. & Tibshirani, R. (1996), ‘Combining estimates in regression and classification’, Journal of the American Statistical Association 91, 1641–1650. 273 Leslie Pack Kaelbling, M. L. L. & Moore, A. W. (1996), ‘Reinforcement learning: A survey’, Journal of Artifcial Intelligence Research 4, 237–285. 142 Ljung, l. & Soderstrom, T. (1983), Theory and Practice of Recursive Identification, MIT Press, Cambridge, MA. 320 Luus, R. (2000), Iterative Dynamic Programming, Chapman & Hall/CRC, New York. 321 Mathews, V. J. & Xie, Z. (1993), ‘A stochastic gradient adaptive filter with gradient adaptive step size’, IEEE Transactions on Signal Processing 41, 2075–2087. 194 Mendelssohn, R. (1982), ‘An iterative aggregation procedure for Markov decision processes’, Operations Research 30(1), 62–73. 273, 321 Mirozahmedov, F. & Uryasev, S. P. (1983), ‘Adaptive stepsize regulation for stochastic optimization algorithm’, Zurnal vicisl. mat. i. mat. fiz. 23 6, 1314–1325. 162, 193 Mulvey, J. M. & Ruszczy´ nski, A. J. (1995), ‘A new scenario decomposition method for large-scale stochastic optimization’, Operations Research 43(3), 477–490. 405 Neuts, M. (1967), ‘A general class of bulk queues with Poisson input’, Ann. Math. Stat. 38, 759–770. 373 Neveu, J. (1975), Discrete Parameter Martingales, North Holland, Amsterdam. 399, 400 Papadaki, K. & Powell, W. B. (2002), ‘A monotone adaptive dynamic programming algorithm for a stochastic batch service problem’, European Journal of Operational Research 142(1), 108–127. 373 Papadaki, K. & Powell, W. B. (2003), ‘An adaptive dynamic programming algorithm for a stochastic multiproduct batch dispatch problem’, Naval Research Logistics 50(7), 742–769. 359, 360, 362, 373 Pflug, G. C. (1988), Stepsize rules, stopping times and their implementation in stochastic quasi-gradient algorithms, in ‘Numerical Techniques for Stochastic Optimization’, Springer-Verlag, pp. 353–372. 193
BIBLIOGRAPHY
435
Pflug, G. C. (1996), Optimization of Stochastic Models: The Interface Between Simulation and Optimization, Kluwer International Series in Engineering and Computer Science: Discrete Event Dynamic Systems, Kluwer Academic Publishers, Boston. 142, 193 Powell, W. B. & Humblet, P. (1986), ‘The bulk service queue with a general control strategy: Theoretical analysis and a new computational procedure’, Operations Research 34(2), 267– 275. 373 Powell, W. B., Ruszczy´ nski, A. & Topaloglu, H. (2004), ‘Learning algorithms for separable approximations of stochastic optimization problems’, Mathematics of Operations Research 29(4), 814–836. 309, 320, 321, 399, 427 Powell, W. B., Shapiro, J. A. & Sim˜ao, H. P. (2001), A representational paradigm for dynamic resource transformation problems, in R. F. C. Coullard & J. H. Owens, eds, ‘Annals of Operations Research’, J.C. Baltzer AG, pp. 231–279. 77 Powell, W. B., Shapiro, J. A. & Sim˜ao, H. P. (2002), ‘An adaptive dynamic programming algorithm for the heterogeneous resource allocation problem’, Transportation Science 36(2), 231–249. 427 Psaraftis, H. & Tsitsiklis, J. (1990), Dynamic shortest paths with Markovian arc costs, Preprint. 227 Puterman, M. L. (1994), Markov Decision Processes, John Wiley and Sons, Inc., New York. 14, 29, 48, 115, 373 Robbins, H. & Monro, S. (1951), ‘A stochastic approximation method’, Annals of Math. Stat. 22, 400–407. 180, 183, 193 Rockafellar, R. & Wets, R. (1991), ‘Scenarios and policy aggregation in optimization under uncertainty’, Mathematics of Operations Research 16(1), 119–147. 405 Rockafellar, R. T. (1976), ‘Augmented Lagrangians and applications of the proximal point algorithm in convex programming’, Math. of Operations Research 1, 97–116. 385 Rogers, D., Plante, R., Wong, R. & Evans, J. (1991), ‘Aggregation and disaggregation techniques and methodology in optimization’, Operations Research 39(4), 553–582. 321 Ross, S. (1983), Introduction to Stochastic Dynamic Programming, Academic Press, New York. 14, 29, 115 Ruszczy´ nski, A. (1980a), ‘Feasible direction methods for stochastic programming problems’, Math. Programming 19, 220–229. 180, 385 Ruszczy´ nski, A. (1980b), ‘Feasible direction methods for stochastic programming problems’, Mathematical Programming 19, 220–229. 405 Ruszczy´ nski, A. (1987), ‘A linearization method for nonsmooth stochastic programming problems’, Mathematics of Operations Research 12(1), 32–49. 180, 405 Ruszczy´ nski, A. & Syski, W. (1986), ‘A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems’, Mathematical Programming Study 28, 113–131. 162, 193 Schweitzer, P., Puterman, M. & Kindle, K. (1985), ‘Iterative aggregation-disaggregation procedures for discounted semi-Markov reward processes’, Operations Research 33(3), 589– 605. 321
BIBLIOGRAPHY
436
Sen, S. & Higle, J. (1999), ‘An introductory tutorial on stochastic linear programming models’, Interfaces 29(2), 33–61. 405 Shiryaev, A. (1996), Probability Theory, Vol. 95 of Graduate Texts in Mathematics, SpringerVerlag, New York. 315, 317 Si, J., Barto, A. G., Powell, W. B. & D. Wunsch II, e. (2004), Handbook of Learning and Approximate Dynamic Programming, IEEE Press, New York. 142 Sigal, C., Pritsker, A. & Solberg, J. (1980), ‘The stochastic shortest route problem’, Operations Research 28(5), 1122–1129. 227 Singh, S., Jaakkola, T. & Jordan, M. I. (1995), Reinforcement learning with soft state aggregation, in G. Tesauro, D. Touretzky & T. K. Leen, eds, ‘Advances in Neural Information Processing Systems 7’, MIT Press. 321 Spall, J. C. (2003), Introduction to stochastic search and optimization: estimation, simulation and control, John Wiley and Sons, Inc., Hoboken, NJ. 193 Spivey, M. & Powell, W. B. (2004), ‘The dynamic assignment problem’, Transportation Science 38(4), 399–419. 427 Stengel, R. (1994), Optimal Control and Estimation, Dover Publications, New York, NY. 194 Stokey, N. L. & R. E. Lucas, J. (1989), Recursive Methods in Dynamic Economics, Harvard University Press, Cambridge. 115 Sutton, R. (1988), ‘Learning to predict by the methods of temporal differences’, Machine Learning 3, 9–44. 142, 216, 227 Sutton, R. & Barto, A. (1998), Reinforcement Learning, The MIT Press, Cambridge, Massachusetts. 142 Taylor, H. (1967), ‘Evaluating a call option and optimal timing strategy in the stock market’, Management Science 12, 111–120. 14 Taylor, H. M. (1990), Martingales and Random Walks, Vol. 2, Elsevier Science Publishers B.V.,, chapter 3. 399 Topaloglu, H. & Powell, W. B. (2003), ‘An algorithm for approximating piecewise linear concave functions from sample gradients’, Operations Research Letters 31(1), 66–76. 320 Topaloglu, H. & Powell, W. B. (to appear), ‘Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems’, Informs Journal on Computing. 424, 427 Topkins, D. M. (1978), ‘Minimizing a submodular function on a lattice’, Operations Research 26, 305–321. 373 Trigg, D. (1964), ‘Monitoring a forecasting system’, Operations Research Quarterly 15(3), 271–274. 193 Trigg, D. & Leach, A. (1967), ‘Exponential smoothing with an adaptive response rate’, Operations Research Quarterly 18(1), 53–59. 164
BIBLIOGRAPHY
437
Tsitsiklis, J. & Van Roy, B. (1997), ‘An analysis of temporal-difference learning with function approximation’, IEEE Transactions on Automatic Control 42, 674–690. 237, 238, 320 Tsitsiklis, J. N. (1994), ‘Asynchronous stochastic approximation and q-learning’, Machine Learning 16, 185–202. 227 Tsitsiklis, J. N. & Van Roy, B. (1996), ‘Feature-based methods for large scale dynamic programming’, Machine Learning 22, 59–94. 142, 321 Van Roy, B. (2001), Neuro-dynamic programming: Overview and recent trends, in E. Feinberg & A. Shwartz, eds, ‘Handbook of Markov Decision Processes: Methods and Applications’, Kluwer, Boston. 142 Van Slyke, R. & Wets, R. (1969), ‘L-shaped linear programs with applications to optimal control and stochastic programming’, SIAM Journal of Applied Mathematics 17(4), 638– 663. 405 Wallace, S. (1986a), ‘Decomposing the requirement space of a transportation problem’, Math. Prog. Study 28, 29–47. 405 Wallace, S. (1986b), ‘Solving stochastic programs with network recourse’, Networks 16, 295– 317. 405 Wallace, S. W. (1987), ‘A piecewise linear upper bound on the network recourse function’, Mathematical Programming 38, 133–146. 405 Wasan, M. (1969), Stochastic approximations, in J. T. J.F.C. Kingman, F. Smithies & T. Wall, eds, ‘Cambridge Transactions in Math. and Math. Phys. 58’, Cambridge University Press, Cambridge. 193 Watkins, C. (1989), Learning from delayed rewards, Ph.d. thesis, Cambridge University, Cambridge, UK. 227 Watkins, C. & Dayan, P. (1992), ‘Q-learning’, Machine Learning 8, 279–292. 227 Weber, R. (1992), ‘On the gittins index for multiarmed bandits’, The Annals of Applied Probability 2(4), 1024–1033. 291 Werbos, P. (1990), A menu of designs for reinforcement learning over time, in R. S. W.T. Miller & P. Werbos, eds, ‘Neural Networks for Control’, MIT PRess, Cambridge, MA, pp. 67–96. 142 Werbos, P. (1992a), Neurocontrol and supervised learning: an overview and evaluation, in D. A. White & D. A. Sofge, eds, ‘Handbook of Intelligent Control’, Von Nostrand Reinhold, New York, NY, pp. 65–86. 142 Werbos, P. J. (1987), ‘Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research’, IEEE Transactions on Systems, Man and Cybernetics. 142 Werbos, P. J. (1992b), Approximate dynamic programming for real-time control and neural modelling, in D. J. White & D. A. Sofge, eds, ‘Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches’. 142 Werbos, P. J. (1992c), Neurocontrol and supervised learning: an overview and valuation, in D. A. White & D. A. Sofge, eds, ‘Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches’. 142
BIBLIOGRAPHY
438
Wets, R. (1989), Stochastic programming, in ‘Handbooks in Operations Research and Management Science: Optimization’, Vol. 1, Elsevier Science Publishers B.V., Amsterdam, pp. Volume 1, Chapter 8. 405 White, C. C. (1991), ‘A survey of solution techniques for the partially observable Markov decision process’, Annals of operations research 32, 215–230. 142 White, D. A. & Sofge, D. A. (1992), Handbook of Intelligent Control, Von Nostrand Reinhold, New York, NY. 142 White, D. J. (1969), Dynamic Programming, Holden-Day, San Francisco. 115 Whitt, W. (1978), ‘Approximations of dynamic programs I’, Mathematics of Operations Research 3, 231–243. 273 Whittle, P. (1982), Optimization over time: Dynamic programming and stochastic control Volume I, John Wiley and Sons, New York. 29, 291 Winters, P. R. (1960), ‘Forecasting sales by exponentially weighted moving averages’, Management Science 6, 324–342. 193 Wright, S. E. (1994), ‘Primal-dual aggregation and disaggregation for stochastic linear programs’, Mathematics of Operations Research 19, 893–908. 321 Yang, Y. (1999), ‘Adaptive regression by mixing’, Journal of the American Statistical Association. 273 Young, P. (1984), Recursive Estimation and Time-Series Analysis, Springer-Verlag, Berlin, Heidelberg. 320 Zipkin, P. (1980a), ‘Bounds for row-aggregation in linear programming’, Operations Research 28, 903–916. 273, 321 Zipkin, P. (1980b), ‘Bounds on the effect of aggregating variables in linear programming’, Operations Research 28, 403–418. 273, 321