Multi-armed Bandit Allocation Indices
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazeb...
94 downloads
2248 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Multi-armed Bandit Allocation Indices
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
Multi-armed Bandit Allocation Indices 2nd Edition
John Gittins Department of Statistics, University of Oxford, UK
Kevin Glazebrook Department of Management Science, Lancaster University, UK
Richard Weber Statistical Laboratory, University of Cambridge, UK
A John Wiley & Sons, Ltd., Publication
This edition first published 2011 © 2011 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software. Library of Congress Cataloging-in-Publication Data Gittins, John C., 1938Multi-armed bandit allocation indices / John Gittins, Richard Weber, Kevin Glazebrook. – 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-0-470-67002-6(cloth) 1. Resource allocation – Mathematical models. 2. Mathematical optimization. 3. Programming (Mathematics) I. Weber, Richard, 1953- II. Glazebrook, Kevin D., 1950- III. Title. QA279.G55 2011 519.5 – dc22 2010044409
A catalogue record for this book is available from the British Library. Print ISBN: ePDF ISBN: oBook ISBN: ePub ISBN:
978-0-470-67002-6 978-0-470-98004-0 978-0-470-98003-3 978-1-119-99021-5
Set in 10/12pt Times by Laserwords Private Limited, Chennai, India
Contents Foreword
ix
Foreword to the first edition
xi
Preface Preface to the first edition
xiii xv
1 Introduction or exploration Exercises
1 16
2 Main ideas: Gittins index 2.1 Introduction 2.2 Decision processes 2.3 Simple families of alternative bandit processes 2.4 Dynamic programming 2.5 Gittins index theorem 2.6 Gittins index 2.6.1 Gittins index and the multi-armed bandit 2.6.2 Coins problem 2.6.3 Characterization of the optimal stopping time 2.6.4 The restart-in-state formulation 2.6.5 Dependence on discount factor 2.6.6 Myopic and forwards induction policies 2.7 Proof of the index theorem by interchanging bandit portions 2.8 Continuous-time bandit processes 2.9 Proof of the index theorem by induction and interchange argument 2.10 Calculation of Gittins indices 2.11 Monotonicity conditions 2.11.1 Monotone indices 2.11.2 Monotone jobs 2.12 History of the index theorem 2.13 Some decision process theory Exercises
19 19 20 21 23 24 28 28 29 30 31 32 32 33 36 40 43 44 44 45 47 49 50
vi
CONTENTS
3
Necessary assumptions for indices 3.1 Introduction 3.2 Jobs 3.3 Continuous-time jobs 3.3.1 Definition 3.3.2 Policies for continuous-time jobs 3.3.3 The continuous-time index theorem for a SFABP of jobs 3.4 Necessary assumptions 3.4.1 Necessity of an infinite time horizon 3.4.2 Necessity of constant exponential discounting 3.4.3 Necessity of a single processor 3.5 Beyond the necessary assumptions 3.5.1 Bandit-dependent discount factors 3.5.2 Stochastic discounting 3.5.3 Undiscounted rewards 3.5.4 A discrete search problem 3.5.5 Multiple processors Exercises
55 55 56 58 58 58 61 61 61 62 63 64 64 66 68 70 73 76
4
Superprocesses, precedence constraints and arrivals 4.1 Introduction 4.2 Bandit superprocesses 4.3 The index theorem for superprocesses 4.4 Stoppable bandit processes 4.5 Proof of the index theorem by freezing and promotion rules 4.5.1 Freezing rules 4.5.2 Promotion rules 4.6 The index theorem for jobs with precedence constraints 4.7 Precedence constraints forming an out-forest 4.8 Bandit processes with arrivals 4.9 Tax problems 4.9.1 Ongoing bandits and tax problems 4.9.2 Klimov’s model 4.9.3 Minimum EWFT for the M/G/1 queue 4.10 Near optimality of nearly index policies Exercises
79 79 80 83 88 90 93 95 97 102 105 106 106 108 110 111 113
5
The achievable region methodology 5.1 Introduction 5.2 A simple example 5.3 Proof of the index theorem by greedy algorithm 5.4 Generalized conservation laws and indexable systems 5.5 Performance bounds for policies for branching bandits 5.6 Job selection and scheduling problems 5.7 Multi-armed bandits on parallel machines Exercises
115 115 116 119 124 132 136 139 147
CONTENTS
vii
6 Restless bandits and Lagrangian relaxation 6.1 Introduction 6.2 Restless bandits 6.3 Whittle indices for restless bandits 6.4 Asymptotic optimality 6.5 Monotone policies and simple proofs of indexability 6.6 Applications to multi-class queueing systems 6.7 Performance bounds for the Whittle index policy 6.8 Indices for more general resource configurations Exercises
149 149 150 152 155 155 159 162 169 171
7 Multi-population random sampling (theory) 7.1 Introduction 7.2 Jobs and targets 7.3 Use of monotonicity properties 7.4 General methods of calculation: use of invariance properties 7.5 Random sampling times 7.6 Brownian reward processes 7.7 Asymptotically normal reward processes 7.8 Diffusion bandits Exercises
173 173 179 181 185 195 201 205 210 211
8 Multi-population random sampling (calculations) 8.1 Introduction 8.2 Normal reward processes (known variance) 8.3 Normal reward processes (mean and variance both unknown) 8.4 Bernoulli reward processes 8.5 Exponential reward processes 8.6 Exponential target process 8.7 Bernoulli/exponential target process Exercises
213 213 213 218 221 225 229 234 239
9 Further exploitation 9.1 Introduction 9.2 Website morphing 9.3 Economics 9.4 Value of information 9.5 More on job-scheduling problems 9.6 Military applications
241 241 242 243 244 244 245
References
249
Tables
261
Index
285
Foreword John Gittins’ first edition of this book marked the end of an era, the era in which a succession of investigators struggled for an understanding and ‘solution’ of the multi-armed bandit problem. My foreword to that edition celebrated the gaining of this understanding, and so it seems fitting that this should be retained. The opening of a new era was like the stirring of an ant-heap, with the sudden emergence of an avid multitude and a rush of scurrying activity. The first phase was one of exploitation, in which each worker tried to apply his special expertise in this new context. This yielded, among other things, a remarkable array of proofs of optimality of the Gittins index policy. The most elegant and insightful was certainly Richard Weber’s ‘prevailing charge’ proof (see section 2.5), expressible in a single paragraph of verbal reasoning. I must confess, however, to a lingering attachment to the dynamic programming proof (see section 4.3) which provided also the value function of the Gittins policy and a treatment immediately generalizable to the case of superprocesses. The phase of real interest was the subsequent one, of exploration. To what range of models can the Gittins technique be extended? Here the simile latent in the terms ‘bandit’ and ‘arm’ (with their gambling machine origins) begins to become quite strained. I myself spoke rather of a population of ‘projects’, only one of which could be engaged at any given time. One wishes to operate only the high-value projects, but can determine which these are only by trying them all – it is to this that the phrase ‘exploitation and exploration’ first referred. The process is then one of allocation and learning. Any project does not itself change, but its ‘informational state’ does – one’s knowledge of it as expressed by a Bayesian updating. However, situations occur for which the projects do also have a physical state, which may change by stochastic rules, or for which the population of projects changes by immigration or emigration. These are cases one would wish to cover. It is then more natural to think of the projects as ‘activities’, having their own dynamic structure, and whose performance one would wish to optimize by the appropriate allocation of given resources over activities. This is the classical economic activity analysis in a dynamic setting. However, we are now effectively adding to this the feature that the current physical states of the activities are incompletely known, and must be inferred from observation. ‘Observation time’
x
FOREWORD
is then an additional resource to be allocated. Section 6.8 gives the clearest indication of such an ambition. Extension to such cases requires new tools, and Chapters 5 and 6 consider two such: study of the achievable region and Lagrangian relaxation of the optimization. These central chapters present very clearly the current state of theory in these areas, a considerable enhancement of understanding and many examples of cases of practical interest for which an optimal policy – of index character – is determined. It must be confessed, however, that the general problems of indexability and optimality remain beached on the research frontier, although one senses a rising tide which will lift them. Explicitness is also a feature of Chapters 7 and 8, which go into hard detail on the determination of indices under different statistical assumptions. This scholarly and modern work gives a satisfyingly complete, rigorous and illuminating account of the current state of the subject. It also conveys a historical perspective, from the work of Klimov and von Olivier (which appeared in print before appreciation of John Gittins’ advance had become general) to the present day. All three authors of the text have made key and recent contributions to the subject, and are in the best position to lay its immediate course. Peter Whittle
Foreword to the first edition The term ‘Gittins index’ now has firm currency in the literature, denoting the concept which first proved so crucial in the solution of the long-standing multiarmed bandit problem and since then has provided a guide for the deeper understanding of all such problems. The author is, nevertheless, too modest to use the term so I regard it as my sole role to reassure the potential reader that the author is indeed the Gittins of the index, and that this book sets forth his pioneering work on the solution of the multi-armed bandit problem and his subsequent investigation of a wide class of sequential allocation problems. Such allocation problems are concerned with the optimal division of resources between projects which need resources in order to develop and which yield benefit at a rate depending upon their degree of development. They embody in essential form a conflict evident in all human action. This is the conflict between taking those actions which yield immediate reward and those (such as acquiring information or skill, or preparing the ground) whose benefit will come only later. The multi-armed bandit is a prototype of this class of problems, propounded during the Second World War, and soon recognized as so difficult that it quickly became a classic, and a byword for intransigence. In fact, John Gittins had solved the problem by the late sixties, although the fact that he had done so was not generally recognized until the early eighties. I can illustrate the mode of propagation of this news, when it began to propagate, by telling of an American friend of mine, a colleague of high repute, who asked an equally well-known colleague ‘What would you say if you were told that the multi-armed bandit problem had been solved?’ The reply was somewhat in the Johnsonian form: ‘Sir, the multi-armed bandit problem is not of such a nature that it can be solved’. My friend then undertook to convince the doubter in a quarter of an hour. This is indeed a feature of John’s solution: that, once explained, it carries conviction even before it is proved. John Gittins gives here an account which unifies his original pioneering contributions with the considerable development achieved by both by himself and other workers subsequently. I recommend the book as the authentic and authoritative source-work. Peter Whittle
Preface The first edition of this book was published in 1989. A shortened version of the preface to that edition follows this preface. The uninitiated reader might find it helpful to read it at this point. Since 1989 the field has developed apace. There has been a remarkable flowering of different proofs of the main index theorem, each giving its own particular insight as to why the result is true. Major bodies of related theory have also emerged, notably the achievable region and restless bandit methodologies, and the discussion in economics of the appropriate balance between exploration and exploitation. These have led, for example, to improved algorithms for calculating indices and to improved bounds on the performance of index strategies when they are not necessarily optimal. These developments form the case for a new edition, plus the fact that there is an ongoing interest in the book, which is now difficult to buy. There are now three authors rather than just John Gittins. Kevin Glazebrook and Richard Weber bring familiarity with more recent developments, to which they have made important contributions. Their expertise has allowed us to include new chapters on achievable regions and on restless bandits. We have also taken the opportunity to substantially revise the core earlier chapters. Our aim has been to provide an accessible introduction to the main ideas, taking advantage of more recent work, before proceeding to the more challenging material in Chapters 4, 5 and 6. Overall we have tried to produce an expository account rather than just a research monograph. The exercises are designed as an aid to a deeper understanding of the material. The Gittins index, as it is commonly known, for a project competing for investment with other projects allows both for the immediate expected reward, and for the value of the information gained from the initial investment, which may be useful in securing later rewards. These factors are often called exploitation and exploration, respectively. In this spirit Chapter 1, which is a taster for the rest of the book, is subtitled ‘Exploration’. The mainstream of the book then continues through Chapters 2 to 4. It breaks into independent substreams represented by Chapter 5, Chapter 6, Chapters 7 and 8, and Chapter 9, which looks briefly at five further application areas.
xiv
PREFACE
Readers should have some prior knowledge of university-level probability theory, including Markov processes, and for a full understanding of Chapters 5 and 6 will need to know the basic ideas of linear programming. We hope the book will be of interest to researchers in chemometrics, combinatorics, economics, management science, numerical analysis, operational research, probability theory and statistics. Few readers will want to read from cover to cover. For example, both chemometricians and numerical analysts are likely to be interested mainly in Chapter 8 and the tables which it describes: chemometricians as potential users of the tables, and numerical analysts for the sake of the methods of calculation. As another example, Chapters 1, 2 and 3 could form the basis for a specialized lecture course in applied probability for final-year undergraduates. In addition to those colleagues acknowledged in the preface to the first edition we are very grateful to Anne-Marie Oreskovich and Morton Sorenson for further help in the calculation of index values. John Gittins Department of Statistics, University of Oxford Kevin Glazebrook Department of Management Science, Lancaster University Richard Weber Statistical Laboratory, University of Cambridge August 2010
Preface to the first edition A prospector looking for gold in the Australian outback has to decide where to look, and if he prospects for any length of time must repeatedly take further such decisions in the light of his success to date. His decision problem is how to allocate his time, in a sequential manner, so as to maximize his likely reward. A similar problem faces a student about to graduate from university who is looking for employment, or the manager of a commercial laboratory with several research projects competing for the attention of the scientists at his or her disposal. It is to problems like these that the indices described in this book may be applied. The problems are characterized by alternative independent ways in which time or effort may be consumed, for each of which the outcome is uncertain and may be determined only in the course of applying the effort. The choice at each stage is therefore determined partly on the basis of maximizing the expected immediate rate of return, and partly by the need to reduce uncertainty, and thereby provide a basis for better choices later on. It is the tension between these two requirements that makes the decision problem both interesting and difficult. The classic problem of this type is the multi-armed bandit problem, in which several Bernoulli processes with different unknown success probabilities are available, each success yielding the same reward. There are already two good books on this subject: by Presman and Sonin (1982), and by Berry and Fristedt (1985). Both books go well beyond the simple version of the problem just stated: most notably to include, respectively, an exponentially distributed rather than constant interval between trials, and a study of the consequences of discounting rewards in different ways. The aims of this book are to expound the theory of dynamic allocation indices, and to explore the class of problems for which they define optimal policies. These include sampling problems, like the multi-armed bandit problem, and stochastic scheduling problems, such as the research manager’s problem. Tables of index values are given for a number of sampling problems, including the classical Bayesian multi-armed bandit problem. For the most part, and except where otherwise indicated, the book is an account of original work, though much of it has appeared before in the publications bearing the author’s name which are listed in the references. It is a pleasure to acknowledge the stimulus and encouragement which this work has received over the years by conversations, and in some cases
xvi
PREFACE TO THE FIRST EDITION
collaboration, with Joao Amaral, Tony Baker, John Bather, Sten Bergman, Owen Davies, Kevin Glazebrook, David Jones, Frank Kelly, Aamer Khan, Dennis Lindley, Peter Nash, David Roberts and Peter Whittle. I should particularly like to thank Joao Amaral, Frank Geisler and David Jones for the substantial effort involved in calculating index values, and Brenda Willoughby for her painstaking work in typing the manuscript. The book is dedicated to Hugh and Anna, who have grown up during its preparation. John Gittins Mathematical Institute, Oxford University March 1989
1
Introduction or exploration This book is about mathematical models for optimizing in a sequential manner the allocation of effort between a number of competing projects. The effort and the projects may take a variety of forms. Examples are: an industrial processor and jobs waiting to be processed; a server with a queue of customers; an industrial laboratory with research projects; any busy person with jobs to do; a stream of patients and alternative treatments (yes, the patients do correspond to effort – see Problem 5 later in this chapter); a searcher who may look in different places. In every case effort is treated as being homogeneous, and the problem is to allocate it between the different projects so as to maximize the expected total reward which they yield. It is a sequential problem, as effort is allowed to be reallocated in a feedback manner, taking account of the pattern of rewards so far achieved. The reallocations are assumed to be costless, and to take a negligible time, since the alternative is to impose a travelling-salesman-like feature, thereby seriously complicating an already difficult problem. The techniques which come under the heading of dynamic programming have been devised for sequential optimization problems. The key idea is a recurrence equation relating the expected total reward (call this the payoff) at a given decision time to the distribution of its possible values at the next decision time (see equation (2.2)). Sometimes this equation may be solved analytically. Otherwise a recursive numerical solution may, at any rate in principle, be carried out. This involves making an initial approximation to the payoff function, and then successive further approximations by substituting in the right-hand side of the recurrence equation. As Bellman (1957), for many years the chief protagonist of this methodology, pointed out, using the recurrence equation involves less computing than a complete enumeration of all policies and their corresponding payoffs, but none the less soon runs into the sands of intractable storage and
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
2
MULTI-ARMED BANDIT ALLOCATION INDICES
processing requirements as the number of variables on which the payoff function depends increases. For the problem of allocating effort to projects, the number of variables is at least equal to the number of projects. An attractive idea, therefore, is to establish priority indices for each project, depending on its past history but not that of any other project, and to allocate effort at each decision time only to the project with the highest current index value. To calculate these indices it should be possible to calibrate a project in a given state against some set of standard projects with simple properties. If this could be done we should have a reasonable policy without having to deal with any function of the states of more than one project. That optimal policies for some problems of effort allocation are expressible in terms of such indices is well known. The first three problems described in this chapter are cases in point. Chapters 2, 3 and 4 set out a general theory which puts these results into context. They include several theorems asserting the optimality of index policies under different conditions. In fact five of the seven problems described in this chapter may be solved fairly rapidly by using the index theorem (Theorem 2.1), and its Corollary 3.5. Problem 4 requires Theorem 3.1, which is a continuous-time version of the index theorem. Since they may change as more effort is allocated, these priority indices may aptly be termed dynamic allocation indices. The policies that they define include, in relevant circumstances, the cμ-rule, Smith’s rule, and the shortest processing time rule (SPT). The main aim of this chapter is to exhibit their range of application by describing a number of particular instances. A second aim is to give an informal introduction to the main methods available for determining indices. These are by (i) interchange arguments; (ii) exploiting any special features of the bandit processes concerned, in particular those which lead to the optimality of myopic policies; (iii) calibration by reference to standard bandit processes, often involving iteration using the dynamic programming recurrence equation; and (iv) using the fact that a dynamic allocation index may be regarded as a maximized equivalent constant reward rate. Problem 1 Single-machine scheduling (see also Section 2.12, Section 3.2 and exercise 4.1) There are n jobs ready to be processed on a single machine. A cost ci is incurred per unit of time until job i has been completed, and the service time (or processing time) for job i is si . In what order should the jobs be processed so as to minimize the total cost? Problem 2 Gold-mining (see also Section 3.5.2) A woman owns n gold mines and a gold-mining machine. Each day she must assign the machine to one of the mines. When the machine is assigned to mine i there is a probability pi that it extracts a proportion qi of the gold left in the mine, and a probability 1 − pi that it extracts no gold and breaks down permanently. To what sequence of mines on successive days should the machine be assigned so as to maximize the expected amount of gold mined before it breaks down?
INTRODUCTION OR EXPLORATION
3
Problem 3 Search (see also Section 3.5.4 and exercise 3.6) A stationary object is hidden in one of n boxes. The probability that a search of box i finds the object if it is in box i is qi . The probability that the object is in box i is pi , and changes by Bayes’ theorem as successive boxes are searched. The cost of a single search of box i is ci . How should the boxes be sequenced for search so as to minimize the expected cost of finding the object? For Problem 1 an optimal schedule is one which processes the jobs in decreasing order of ci /si , as may be shown by the following simple interchange argument. If jobs 1 and 2 are processed as the first and second, respectively, then the total holding cost due to these two jobs is c1 s1 + c2 (s1 + s2 ). If the processing order of jobs 1 and 2 is interchanged, then this quantity becomes c2 s2 + c1 (s1 + s2 ). The holding cost due to the other jobs is unchanged. Thus, the first ordering has the smallest cost if and only if c1 /s1 ≥ c2 /s2 . It follows that the relationship ci /si ≥ cj /sj must hold for any two jobs i and j such that i immediately precedes j in an optimal processing order of the jobs. This implies that the total holding cost is maximized by processing the n jobs in decreasing order of indices which are defined as μi = ci /si . For Problem 2, an optimal policy is one which, when xi is the amount of gold remaining in mine i on a particular day, allocates the machine to a mine j such that pj qj xj pi qi xi = max . i 1 − pi 1 − pj For Problem 3, if box j is searched next under an optimal policy, then pj qj /cj = max pi qi /ci . i
For Problems 2 and 3, a policy is specified by a sequence of numbers corresponding to the order of the mines to which the machine is allocated until it breaks down, or the order in which the boxes are to be searched until the object is found. The policies just quoted may therefore be shown to be optimal in the class C of all those policies specified by sequences in which each i (= 1, 2, . . . , n) occurs an infinite number of times, by arguments involving interchanges of adjacent numbers in such a sequence. Since any policy may be matched by a policy in C for any arbitrarily large initial portion of the specifying number sequence, it follows that the above policies are optimal in the class of all policies. The optimal policies for Problems 1, 2 and 3 may be described as follows. At each stage where a job, mine or box is to be selected there is a real-valued function defined on the set of available alternatives, and the one selected is such as to maximize this function. These functions, then, are dynamic allocation indices for the three problems. We have described these indices as functions on the sets of available alternatives. More precisely, the index for Problem 2, for example, is a function of the three variables, xi , qi and pi , which constitute the information available for mine i when the amount of gold which it contains is xi .
4
MULTI-ARMED BANDIT ALLOCATION INDICES
The point of this observation is that our indices do not simply tell us which alternative to choose at a given stage in a particular problem with n alternatives. Rather, they specify the optimal policy for a problem with any set of alternatives of a given type. This means that if we know that such an index exists for a given class of problems, without as yet having an explicit expression for it, then it is sufficient, in order to find an explicit expression, to consider problems within the class for which n = 2, and indeed even this restricted class of problems need not be solved in full generality, as our next example shows. Problem 4 Industrial research (see also Section 3.2) The manager of a team of industrial scientists has n research projects which may be carried out in any order. Switches from project to project cause negligible loss of time, whether these occur when a project has been successfully completed or not. The value to the manager’s employer of completing project i at time t is Vi e−γ t , (i = 1, 2, . . . , n; γ > 0). The time which the team would need to spend on project i in order to complete it is a random variable σi , which has distribution function Fi (t) and density function fi (t). What policy should the manager follow in order to maximize the expected total value generated by the n projects? Following standard usage, a switch from one project to another project is called a preemption. We say that we have the preemptive case if preemption is allowed at any time, even if the project that is currently receiving effort has not yet been completed. We say that we have the non-preemptive case if a project, once started, must continue to receive processing effort until it has been completed. In the non-preemptive case, the optimal schedule can be found with an interchange argument, just as in Problem 1. Suppose that projects 1 and 2 are the first two projects processed. The expected value obtained from these projects when 1 is processed before 2 is at least as great as when their order of processing is reversed if E V1 e−γ σ1 + V2 e−γ (σ1 +σ2 ) ≥ E V2 e−γ σ2 + V1 e−γ (σ1 +σ2 ) , which is equivalent to V2 Ee−γ σ2 V1 Ee−γ σ1 ≥ . E 1 − e−γ σ1 E 1 − e−γ σ2 We may conclude that the expected total value obtained from the n projects is maximized by processing them in decreasing order of indices defined as μi =
Vi Ee−γ σi . E(1 − e−γ σi )
We can make a connection with Problem 1. Suppose that under a given order of processing, project i completes at time ti . Then the expected total value obtained from the projects can be written as E i Vi e−γ ti = i Vi − γ E i Vi ti + o(γ ).
INTRODUCTION OR EXPLORATION
5
Thus, in the limit as γ 0, the problem of maximizing E i Vi e−γ ti is identical to that of minimizing E i Vi ti , a measure that is identical to that in Problem 1 if we identify Vi with ci . Also, as γ 0 we have γ μi → Vi /Eσi , an index analogous to that of ci /si in Problem 1. The characterization of an optimal policy is not as straightforward in the preemptive case, that is when arbitrary switching amongst projects is allowed. However, let us take advantage of the fact that the problem may be shown (Theorem 3.1) to have an index, and proceed to find an expression for the index by considering the case n = 2. Project 1 will be an arbitrary project of the type described. Project 2 we suppose to require a non-random time to complete, and V2 = λ. Let C denote the class of projects of this type for a given and different positive values of λ. For a set of projects all of which are in C it is best to work on them in order of decreasing λ-values. This is almost obvious, as such a policy results in the larger undiscounted values λ being discounted least by the factors e−γ s , and is easily checked by an interchange argument. It is also not difficult to convince oneself that in this case work on a project should not be interrupted before the project has been completed. Thus, for C the λ-values define an index. Consider then the case n = 2, where project 1 is arbitrary and project 2 is in C . If we can find a λ for which it is optimal to work initially either on project 1 or on project 2 then, since an index is known to exist for the entire set of projects, it follows that one such index function may be defined by assigning the index value λ to project 1 as well as to project 2. In effect we may calibrate the entire set of projects by using C as a measuring device. Any monotone function of λ would of course serve equally well as an index. By the optimality property of the index rule, it follows that if in a given problem it is optimal to work on a member of C then, if the problem is modified by adding further members of C all with the same λ-value, it remains optimal to work on these in succession until they have all been completed. Thus for calibration purposes we may suppose that several replicates of project 2 are available. For good measure suppose there is an infinite number of replicates. (Discounting ensures that the maximum expected reward over all policies remains finite.) Note, too, that since the value of is arbitrary we may as well work with the limiting case as tends to zero, so that our infinity of replicates of project 2 reduce to a continuous reward stream at the undiscounted rate λ, an object which will appear later, and termed a standard bandit process with parameter λ, or simply S(λ). We should like, then, a criterion for deciding whether it is best to work on project 1 first, with the option of switching later to S(λ), or to start on S(λ). Denote these two alternatives by 1λ and λ. Note that the index rule never requires a switch back to project 1 in the case of positive , and such a switch is therefore not required for optimality either for positive or in the limiting case of a standard bandit process. Under 1λ, work proceeds on project 1 up to some time t (possibly ∞), or until project 1 terminates, whichever occurs first, and then switches to
6
MULTI-ARMED BANDIT ALLOCATION INDICES
S(λ), from which point 1λ and λ yield identical reward streams. Thus 1λ is strictly preferable if t t Vf(s)e−γ s ds > λ[1 − F (s)]e−γ s ds, (1.1) 0
0
these two expressions being, respectively, the expected reward from 1λ up to the switch to S(λ), and the expected reward from λ up to that point. The uninformative suffix 1 has been dropped in (1.1). The right-hand side of (1.1) takes the given form since λe−γ s δs + o(δs) = return from S(λ) in the interval (s, s + δs), and 1 − F (s) = P {switch to˜S(λ) under 1λ takes place after s}. It is strictly better to start work on project 1 if and only if (1.1) holds for some t >0, that is iff t V 0 f (s)e−γ s ds λ < sup t . (1.2) t > 0 0 [1 − F (s)]e−γ s ds It follows that the right-hand side of (1.2) is the required index for a project on which work has not yet started. For a project on which work has already continued for a time x without successful completion, the density f (s) and distribution function F (s) in (1.2) must be replaced by the conditional density f (s)/[1 − F (x)] and the conditional distribution function [F (s) − F (x)]/ [1 − F (x)], giving the general expression for the index t V x f (s)e−γ s ds sup t . (1.3) −γ s ds t >x x [1 − F (s)]e The completion rate (or hazard rate) is defined as ρ(s) = f (s)/[1 − F (s)]. If this is decreasing then (1.3) reduces to Vρ(x). Since ρ(x) is the probability density for completion at time x conditional on no completion before x, using Vρ(x) as an index corresponds to maximizing the expected reward during the next infinitesimal time interval, a policy sometimes termed myopic because of its neglect of what might happen later, or disregard of long-range vision, to continue the optic metaphor. What we have shown is that for Problem 4 there is no advantage to be gained from looking ahead if all the completion rates are decreasing, but otherwise there may well be an advantage. Notice, if two projects, say 1 and 2, are both projects of greatest index, so V1 ρ1 (x1 ) = V2 ρ2 (x2 ), then a policy of continuing the project of greatest index will need to share its effort between projects 1 and 2, in such a manner that V1 ρ1 and V2 ρ2 remain equal as they decrease. This is an issue to which we return in Section 3.3.2, when we consider jobs with continuously varying effort allocations. For a project for which ρ(s) is nondecreasing, the supremum in (1.3) is attained as t → ∞. The index is nondecreasing, so the option of preemption is never worthwhile. In the γ 0 limit, the index reduces to V /Ex [T ], where Ex [T ] denotes the expected remaining time required to complete the project,
INTRODUCTION OR EXPLORATION
7
given that it has been worked upon for time x without completing. The fact that ρ(x) is nondecreasing in x implies that Ex [T ] is nonincreasing in x. The index of V /Ex [T ] is identical to that which we found for the non-preemptive case, demonstrating that if all projects have increasing completion rates, then within the class of preemptive policies there is an optimal policy that is non-preemptive. The indices already derived for Problems 1, 2 and 3 are all myopic. However, later we shall meet more general versions of each of these problems for which short-sighted optimization is often not optimal. This leads to more complex expressions for the indices, which were obtained in full generality only as a result of the index theorem. In each case the need to look ahead arises from the possibility that the index for the alternative selected may increase at subsequent stages. This conflict between short-term and long-term rewards is neatly shown by the following deceptively simple-sounding problem.
Problem 5 Multi-armed bandits (see also Sections 2.6, 2.12, 7.1 and 8.4) There are n arms (treatments) which may be pulled (allocated to patients) repeatedly in any order. Each pull may result in either a success or a failure. The sequence of successes and failures which results from pulling arm i forms a Bernoulli process with unknown success probability θi . A success at the tth pull yields a reward a t−1 (0 < a < 1), whilst an unsuccessful pull yields a zero reward. At time zero, each θi has a beta prior distribution and these distributions are independent for different arms. These prior distributions are converted by Bayes’ theorem to successive posterior distributions as arms are pulled. Since the class of beta distributions is closed under Bernoulli sampling, these posterior distributions are all beta distributions. How should the arm to pull next at each stage be chosen so as to maximize the total expected reward from an infinite sequence of pulls? For this problem, in contrast to the previous one, there is an obvious solution. This is at each stage to pull the arm for which the current expected value of θi is largest. Thus if the current distribution for θi is Beta(αi , βi ) (i = l, 2, . . . , n), the suggestion is that the next arm j to be pulled should be such that αj /(αj + βj ) = maxi αi /(αi + βi ). This policy is simply stated, intuitively plausible, and in most cases very good. However, it is unfortunately not optimal. This is because it is myopic, and in this problem there is some conflict, though not usually a strong one, between shortterm and long-term rewards. To see this, suppose n = 2 and α1 /(α1 + β1 ) = α2 /(α2 + β2 ). The suggested index rule is indifferent between the two arms, though if, for example, α2 + β2 α1 + β1 , so the variance of θ1 is much greater than for θ2 , it is much more likely that further pulls on arm 1 will lead to appreciable changes in α1 /(α1 + β1 ) than it is for comparable changes to occur when arm 2 is pulled. This strongly suggests that it must be better to pull arm 1, as the expected immediate rewards are the same in both cases, and there is
8
MULTI-ARMED BANDIT ALLOCATION INDICES
more information to be gained by pulling arm 1; information which may be used to achieve higher expected rewards later on. This hunch turns out to be well founded, as we shall see in Section 8.4. In fact there seems to be no simple way of expressing the optimal policy, and a good deal of effort has been invested in working it out iteratively by means of the dynamic programming recurrence equation for the expected total reward under an optimal policy. When n = 2, and with beta prior distributions with parameters αi , βi (i = 1, 2), this takes the form α1 [1 + aR(α1 + 1, β1 , α2 , β2 )] R(α1 , β1 , α2 , β2 ) = max α1 + β1 β1 aR(α1 , β1 + 1, α2 , β2 ), α1 + β1 α2 [1 + aR(α1 , β1 , α2 + 1, β2 )] α2 + β2
β2 aR(α1 , β1 , α2 , β2 + 1) . + α2 + β2 +
(1.4)
The reasoning behind this is as follows. If arm 1 is pulled, the probability of a success is the expectation of θ1 with respect to a Beta(α1 , β1 ) distribution, namely α1 /(α1 + β1 ). This quantity is the multiplier of the first expression in square brackets on the right-hand side of (1.4), which is the expected total reward under an optimal policy after the first pull, conditional on this resulting in a success. To see this, note that the success yields an immediate reward of 1 and results, by applying Bayes’ theorem, in a posterior distribution which is Beta(α1 + 1, β1 ). The prior distribution for θ2 remains the current distribution, as a pull on arm 1 yields no relevant information. Thus the expected total reward under an optimal policy from this point onwards is exactly the same as the expected total reward from the outset with priors which are Beta(α1 + 1, β1 ) and Beta(α2 , β2 ), except for an additional factor a, namely aR(α1 + 1, β1 , α2 , β2 ). Similarly, the probability of a failure if arm 1 is pulled is β1 /(α1 + β1 ), and after this the conditional expected total reward under an optimal policy is aR(α1 , β1 + 1, α2 , β2 ). Thus the first expression in the curly brackets is the expected total reward under an optimal policy after the first pull if this is on arm 1, and, of course, the second expression is the similar quantity when arm 2 is pulled first. The sum of the four parameters on which the function R depends is one greater where R appears on the right-hand side of (1.4) than for the lefthand side. This means that given an approximation to R for all parameter values such that α1 + β1 + α2 + β2 = N , an approximation to R for α1 + β1 + α2 + β2 = N − 1 may be calculated using (1.4). Recall that for a beta distribution the parameter values are positive. For a given positive integer N , the number of positive integer solutions of the equation α1 + β1 + α2 + β2 = N −1 is (N − 2)(N − 3)(N − 4)/3!, so this is the number of calculations using (1.4) that are required. This approximation for α1 + β1 + α2 + β2 = N − 1 may now be
INTRODUCTION OR EXPLORATION
9
substituted in the right-hand side of (1.4) to give an approximation for α1 + β1 + α2 + β2 = N − 2. Successive substitutions yield an approximation to R for all integer parameter values totalling N or less, and the corresponding approximation to the optimal policy. The total number of individual calculations required is N−1 1 1 (r − 1)(r − 2)(r − 3) = (N − 1)(N − 2)(N − 3)(N − 4). 3! 4!
(1.5)
r=4
One of the main theorems for discounted Markov decision processes asserts that repeated iteration using an equation of the form (1.4) gives convergence to the true optimal reward function. Since large parameter values imply tight distributions for θ1 and θ2 , it is also possible to give increasingly good initial approximations for R for large values of N . Thus we have a method which may be used to give increasingly accurate approximations to the entire function R by increasing the value of N from which the iterative calculations begin. For a ≤ 0.9, a good approximation may be obtained with N = 100. From (1.5) we note that this requires around 106 individual calculations. The storage requirement is determined by the largest array of R values required at any stage of the calculation, which is to say by the set of values corresponding to parameter values which sum to N . This set has (N − 1)(N − 2)(N − 3)/3! members – around 105 for N = 100. Thus the computing requirements are appreciable even when n = 2. A similar analysis shows that for general n (> 1), number of individual calculations required = storage requirement =
(N − 1)! , (2n)!(N − 2n − 1)!
(N − 1)! . (2n − 1)!(N − 2n)!
This means that for N = 100 the calculations are quite impracticable for n > 3. These computational difficulties may be drastically reduced by taking advantage of the index theorem. For this problem it is convenient to use as a calibrating set those arms for which the probability of success is known. These are in effect discrete-time standard bandit processes. Suppose then that n = 2, arm 1 has a success probability θ which has a prior Beta(α, β) distribution, and arm 2 has a known success probability p. The dynamic programming recurrence equation for this problem may be written as p α R(α, β, p) = max , [1 + aR(α + 1, β, p)] 1−a α+β
β aR(α, β + 1, p) . (1.6) + α+β The expression p/(1 − a) is the expected total return from a policy which always pulls arm 2, since such a policy yields an expected reward of a t−1 p at the tth pull. The reason for its appearance in (1.6) is that if it is optimal to pull arm
10
MULTI-ARMED BANDIT ALLOCATION INDICES
2 once, then after this pull our information about the two success probabilities is precisely the same as it was before, and it must therefore be optimal to pull it again, repeatedly. This may be checked by an argument based on the exact analogue of (1.4) for the present problem, which is the same as (1.6) except that p/(1 − a) is replaced by p + aR(α, β, p). Equation (1.6) may be solved iteratively for R(α, β, p) starting with an approximation for values of α and β such that α + β = N , just like equation (1.4). The set of (α, β) values for which the two expressions in curly brackets in (1.6) are equal defines those arms with beta priors whose index value is equal to p. Thus index values may be calculated as accurately as required for all beta priors by solving (1.6) for a sufficiently fine grid of values of p, thereby defining optimal policies for all n. An analysis like that carried out for (1.4) shows that to solve (1.6) for a single value of p, number of individual calculations required = 12 (N − 1)(N − 2), storage requirement = N − 1. For a grid of p-values, the first of these numbers is multiplied by the number of values in the grid, and the second is unchanged. Thus for n = 2 the index method requires fewer calculations for N = 100 for grids of up to a few hundred p-values than the direct method, does very much better in terms of storage requirement, and at the same time solves the problem for n > 2. In brief, Problem 5 has been brought within the bounds of computational feasibility by the index theorem. In this respect it is typical of a large number of problems to which the theorem applies, and for which it does not lead to a simple explicit solution. Because of its status as the archetypal problem in sequential resource allocation, the multi-armed bandit has been used as the source of the generic term bandit process to describe one of the alternative reward (or cost)-generating processes between which resources are allocated. Thus each of the jobs, mines, boxes, projects, arms and locations of Problems 1 to 6, respectively, defines a bandit process. Problem 6 Exploration (see also Section 9.6) Each of n locations contains an unknown number of objects of value. At each point in time t = 0, 1, 2, . . . a search for objects is conducted at some chosen location. Any objects discovered during a search are removed and their value is realized. Should x objects be found at location i at time t then a return a t (ri x − ci ) is generated, where a is a discount factor satisfying 0 < a < 1, ri is the value of a single discovered object at i, and ci is the cost of a single search. Once a location has been searched sufficiently often, it may be that the prospect of making further discoveries is sufficiently discouraging that it is not cost effective to make further searches. To simplify matters, we assume, as in Problem 4, that to switch between locations is cost free and does not delay further
INTRODUCTION OR EXPLORATION
11
future searches. The goal is to determine in what order to search locations and when to stop searching altogether in order to maximize overall net returns. Before search begins at t = 0, we express our beliefs about the number of objects at each location by assigning a prior distribution to each. It will be convenient at this point to focus discussion on the nature of a single location, and we drop the location identifier i until we need it again. We shall use Y for the unknown number of objects. We shall assume that the numbers of objects at distinct locations are independent. As searching proceeds, the prior distribution will be transformed by Bayes’ theorem into a posterior distribution for the number of objects at the location remaining to be discovered. Suppose that s searches have been made at the location and that their outcomes are summarized by the vector x = (x1 , x2 , . . . , xs ), where xt is the number of objects found when the location was searched for the tth time. We write |(x, s) for the posterior distribution of Y − t≤s xt , the number of still-undiscovered objects. In addition to the above, we also need a detection model , namely a set of conditional probabilities for the number of objects discovered during a search. We use φ(x|y) for the probability that a search will yield x discovered objects, given that y were there before the search. One intriguing question which is raised in relation to this model is whether the discovery of a large or small number of objects at a location is an encouragement or a disincentive to revisit that location to make further searches there. Roughly speaking, small discoveries are always a disincentive to return, as the decisionmaker both revises downward her beliefs about the potential value that there was at the location prior to the search as well as removing the few objects she found. The effect of a large discovery is a considerably more complex issue and will depend in important ways on the way we go about modelling prior beliefs. It must presumably be true that, over a longish run, locations become less attractive as they are emptied of objects. In order to make progress, we seek some simplification of the problem’s structure by requiring that the posterior distribution |(x, s) should depend upon the search history (x, s) only through σs ≡ t≤s xt , the total number of objects discovered, and s, the number of searches. We write |(x, s) ≡ |(σs , s) when this is the case. This requirement imposes structure on the detection model. For example, it can be shown that the probability that a search discovers nothing must take the form φ(0|y) = q y for some q, 0 ≤ q ≤ 1. A natural choice for the detection model which guarantees the required simplification of the posterior structure takes φ(x|y) = yx p x q y−x , 0 ≤ x ≤ y, where p = 1 − q, namely that the conditional model is binomial B(y, p). This choice flows from an assumption of a detection rate p, constant across individual objects at the location, together with a further assumption that objects are discovered (or not) independently of one another. The assumption of a binomial detection model has an interesting consequence which points the way ahead. The key idea flows from a result of Rao and Rubin (1964). Let U and V be random variables taking non-negative integer values such that U |V ∼ B(V , p). It follows that U and V − U are independent if and
12
MULTI-ARMED BANDIT ALLOCATION INDICES
only if V is Poisson. Moreover, when this is the case both U and V − U are also Poisson. We can interpret this result in our context by taking V as the number of objects remaining at the location, and U as the number discovered in a single search, with V − U the number remaining undiscovered after the search. It follows from this result that if the prior is Poisson(λ), say, then the sequence of posteriors computed after each successive search is deterministic in the sense that it does not depend upon the number of objects discovered in the searches. To be more precise, the posterior |(σs , s) is Poisson[λ(1 − p)s ] for all realizations of the search history x. The Rao and Rubin result makes it plain that this deterministic behaviour holds only for Poisson priors. In this case the optimal policy for the n location problem must itself be deterministic. In other words, the optimal policy is simply a (finite) sequence of locations to be explored, followed by a decision to quit once the sequence has been completed. We now restore the location identifier and suppose that the prior for each location i is Poisson(λi ). From the above it is readily inferred that, once location i has been searched on si occasions, then the expected net return (undiscounted) from its next search is given by μi (si ) = ri λi (1 − pi )si − ci . In this very simple case, a straightforward argument based on pairwise interchanges, akin to that given for Problem 4, yields the conclusion that at every stage the best location to explore next is the one with the largest associated value of the index μi (si ), provided that this largest value is positive. It is optimal to stop searching altogether once all of these one-stage net returns are negative. Although the fact that we are dealing here with a situation in which the optimal policy is a deterministic sequence of searches makes the analysis simpler conceptually, it is not the key driver of the argument underlying the interchange proof. This is, rather, the fact that the mean one-step returns from successive searches of a single location form a decreasing sequence almost surely. It turns out that this property also holds for priors whose tails are lighter than Poisson. To capture this idea, we first observe that the Poisson family may be characterized as the set of distributions on N for which (n + 1)n+1 /n is constant. We then say that a distribution on N has a lighter (respectively heavier) tail than Poisson if (n + 1)n+1 /n is decreasing (respectively increasing) in n. It is easy to show that the light and heavy tail classes are both conjugate in the sense that posteriors must be in the class if priors are. The light-tail class includes all binomial priors as well as those from the truncated Poisson class. By writing ξ(σs , s) for the mean of the posterior |(σs , s), where is a light tail prior, it is easy to show that, ξ(σs , s) > ξ(σs , s ),
σs ≥ σs , s ≥ s + 1.
Hence, in these light-tail cases, the posterior mean number of objects remaining undiscovered, though now in general dependent on the outcomes of earlier
INTRODUCTION OR EXPLORATION
13
searches, is guaranteed to form a decreasing sequence as exploration proceeds. In such a case, the above one-step return will become μi (σisi , si ) = ri ξi (σisi , si ) − ci , and the optimal search schedule when all locations are of this character will always choose a location i whose associated index μi (σisi , si ) is maximal, provided this maximal value is positive. Searching stops when all one-step returns are negative. The proof is in all essentials the same as that in the simple Poisson case. One further point about the light-tail family: it is in fact the case here that, for any fixed s, the posterior mean ξ(σs , s) decreases in σs . Hence in these cases a large discovery is a positive disincentive to return to a location for further searches. Suppose, though, that the prior has a heavier tail than Poisson. This is so if the prior is negative binomial or a general mixture of Poissons. For such cases the posterior mean ξ(σs , s) is decreasing in s for fixed σs but is increasing in σs for fixed s. Hence any search which finds nothing will certainly lead to a decrease in the posterior mean. However, the posterior mean is increasing in the number of discovered objects. It may happen that a search which discovers a sufficiently large number of objects can yield an increase in the posterior mean. Hence for such cases the sequence of one-step mean returns is not guaranteed to be decreasing, and the very simple analysis of the earlier cases no longer suffices. However, the index theorem for multi-armed bandits guarantees that in these cases there are indeed quantities νi (σisi , si ) which play the role earlier assumed by the one-step returns μi (σisi , si ) in determining optimal policies. It can be shown that ν(σs , s) is itself increasing in σs for fixed s, and so in this case a sufficiently large discovery is an incentive to revisit the location for further searches. For further discussion of this problem see Glazebrook and Boys (1995). Problem 7 Choosing a job A man is faced with a number of opportunities for employment which he can investigate at a rate of one per day. For a typical job this investigation has a cost c and informs the man of the daily wage rate w, which has a prior probability distribution F (w). The job has probability 12 of proving congenial or, on the other hand, uncongenial, and the man rates these outcomes as worth g or −g per day, respectively. The congeniality becomes apparent on the N th day in the job, where N is a random variable with a Geometric(p) distribution. The discount factor for any reward or cost which occurs on day t is a t−1 ; unlike the other parameters, a is the same for all the jobs. Decisions to take or leave a job may be taken at any time, and with effect on the day following the decision. How should the man decide whether to take a job, and then whether to leave it if he does take it, when he discovers its congeniality? The bandit processes for this problem are, of course, the various jobs. A (discrete-time) standard bandit process is simply a job for which the wage rate and congeniality are both known. The sum of these two quantities we term the reward rate.
14
MULTI-ARMED BANDIT ALLOCATION INDICES
The index for a bandit process is, as always, equal to the reward rate of the standard bandit process which leads to indifference between the two when these are the only bandit processes available. In the economics literature, and in the present context, this equivalent reward rate is termed the reservation wage. Consider first a job J with a known wage rate w, but for which the congeniality is as yet unknown, together with an alternative standard job S(ν) with a known reward rate ν. Suppose that initially there is indifference between the two jobs. It follows that indifference must persist until the congeniality for job J is known. This is because after n (< N ) days working on job J , the distribution of N − n, now conditional on N > n, is Geometric(p), the same as the initial unconditional distribution of N , and the rewards available from that time onwards therefore have precisely the same structure as at the outset. This is except for the discounting, which scales all future rewards downwards by the same factor when we compare the reward streams with those available after n days on job J , thus leaving the optimal policy unchanged. We may suppose, therefore, that our man starts working on job J and carries on doing so for N days, at which point he becomes aware of the congeniality. If the job is congenial it must now be optimal to remain in it, since it was optimal to do so even without this knowledge. If the job is uncongenial it must now be optimal to switch to S(ν), since our assumption of initial indifference between the two jobs means that the switch was already optimal without this knowledge. This then is one optimal policy. The indifference assumption means that it is also optimal to work permanently on S(ν). (Note that we are endowing our man with an infinite working life, or – which in the present context is equivalent – one of L days, with a L 1.) The expected rewards from each of these two policies may now be equated and the equation solved for ν. Before doing this, note that for either policy we have E(Reward) = E(Reward during first N days) + 12 E(Reward after N th day | J is congenial) + 12 E(Reward after N th day | J is uncongenial). Moreover, the last of the three terms on the right-hand side is the same for both policies, since in both cases S(ν) is the job our man works on after the N th day if job J is uncongenial. Thus the equation reduces to w[1 + a(1 − p) + a 2 (1 − p)2 + · · ·] + 12 (w + g)(1 + a + a 2 + · · ·) × pa[1 + a(1 − p) + a 2 (1 − p)2 + · · ·] = ν[1 + a(1 − p) + a 2 (1 − p)2 + · · ·] + 12 ν(1 + qa + a 2 + · · ·) × pa[1 + a(1 − p) + a 2 (1 − p)2 + · · ·],
INTRODUCTION OR EXPLORATION
which on summing the geometric series becomes
1 pa(w + g) 1 paν w+ = ν+ . 1 − a(1 − p) 2(1 − a) 1 − a(1 − p) 2(1 − a)
15
(1.7)
The index for job J is therefore ν =w+
pag . 2(1 − a) + pa
(1.8)
Thus it is better to take job J until its congeniality is apparent than a job with a known reward rate of w, although initially w is the expected reward rate for J . The margin of the preference for J is increasing in p, a and g. This is to be expected, for reasons to which we shall shortly return. In this derivation of the index ν, we were able to show that only one policy starting with job J needed to be considered. Call this policy π1 and write ν1 = ν. If, for example, our model had allowed for several different degrees of congeniality this would not have been so easy. A more formal and general procedure is required, and we now digress to consider this question before evaluating the index for a job with an unknown wage rate. Equation (1.7) is an expression of the fact that the expected reward from working on job J under policy π1 is equal to that available from S(ν1 ) over the same random time. The factor [2(1 − a) + pa][1 − a(1 − p)]−1 [2(1 − a)]−1 , by which equations (1.8) and (1.7) differ, may be regarded as the expected discounted time for which the man works on job J under π1 , and ν1 as the equivalent constant reward rate yielded by J under this policy. For any other policy π starting with job J , an equation analogous to (1.7) may be written down and solved to give an equivalent constant reward rate ν. The point to note is that unless π is chosen to maximize ν we do not have initial indifference between jobs J and S(ν), as the expected total reward is higher if the maximizing policy π1 is followed than it is if our man works throughout on S(ν). We have established then that the index defined for a bandit process in terms of calibration by standard bandit processes is equal to the maximum equivalent constant reward rate obtainable from that bandit process. This is an important and general result, not restricted to Problem 7. Returning now to the case of a job K with an as yet unknown wage rate, it is clear that the equivalent constant reward rate is maximized by a policy of the following type: Find out the wage rate w on the first day. If w > w1 accept the job. If w2 ≥ w leave the job if it turns out to be uncongenial. If w > w2 keep the job even if it turns out to be uncongenial. Since the index ξ for job K is the maximum equivalent constant reward rate it follows that w2 ∞ pa(w+g) a a w + dF (w) + 1−a −c + 1−a(1−p) w2 wdF(w) w1 2(1−a) ξ = sup . (1.9) w2 ∞ pa a a w1 ,w2 1 + dF(w) + 1 + 1−a(1−p) dF(w) w1 2(1−a) 1−a w2
16
MULTI-ARMED BANDIT ALLOCATION INDICES
It is not difficult to show that the maximizing w1 and w2 are such that ξ = w1 +
pag = w2 − g. 2(1 − a) + pa
Thus the maximization in (1.9) is effectively with respect to a single variable, and readily carried out numerically. It is clear on general grounds that, like the case of a known wage rate, ξ must increase with p, a and g, though this is not particularly obvious from (1.9). The reason is that opportunities for achieving a high equivalent constant reward rate increase with each of these variables. In the case of g this is because large values of g increase the variability of rewards between different possible outcomes, and therefore increase the possibilities of including those with high reward rates and excluding those with low reward rates. In the case of p, any increase tends to make the time at which this selection can be made earlier, and therefore more significant. Large values of a, on the other hand, reduce the tendency for delays in the selection to diminish the equivalent constant reward rate which can be achieved, since they mean that rewards which occur after the selection are not totally insignificant compared with the earlier rewards; this effect is not confined to Problem 7, as Theorem 2.3 will show. Models along the lines of Problem 7 had previously appeared in the economics literature over at least a 10-year period, but the solution in terms of the indices (1.8) and (1.9) was obtained only with the help of the index theorem. Some of the relevant papers, which give references to a large number of others, are those by Berninghaus (1984), Lippman and McCall (1981), McCall and McCall (1981), Miller (1984) and Weitzman (1979).
Exercises 1.1. Jobs 1, 2, 3, 4 are to be processed in some order by a single machine. As in Problem 1, job i has processing time si , and a cost ci is incurred per unit of time until it completes. There are precedence constraints amongst jobs such that job i cannot be started until job i − 2 is complete, i = 3, 4. Show that it is optimal to process job 1 first if
c1 c1 + c3 c2 c2 + c4 max , , ≥ max . s1 s1 + s3 s2 s2 + s4 Extend this result to two other problems, A and B, each with 6 jobs, and constraints of the form A: job i cannot start until job i − 3 is complete, i = 4, 5, 6; B: job i cannot start until job i − 2 is complete, i = 3, 4, 5, 6. The model here is further pursued in exercise 4.7. 1.2. Write down equation (1.6) with p + aR(α, β, p) in place of p/(1 − a) as suggested on page 10, and hence deduce the given form of equation (1.6).
INTRODUCTION OR EXPLORATION
17
1.3. Consider a bandit in the setup of Problem 5, with just one arm, so we do not need subscripts. Let ca t−1 be the cost of the ith pull. Let R(P , a|θ ) and C(P , a|θ ) be the expected total reward and cost, respectively, for a known value of θ , of following policy P , and ν(P , a|θ ) = R(P , a|θ )/C(P , a|θ ), the profitability of P for known θ . Let R(P , a) = ER(P , a|θ ), C(P , a) = EC(P , a|θ ) and ν(P , a) = R(P , a)/C(P , a), where E denotes expectation with respect to the prior distribution for θ . Consider policies of the form: stop pulling after r pulls or after the first failure, whichever is soonest, and write P = r to denote such a policy. Find expressions for ν(r, a|θ ), ν(1, a) and ν(2, a). Show that (i) ν(r, a|θ ) does not depend on r or on a; (ii) ν(2, a) > ν(1, a); (iii) ν(1, a) does not depend on a; (iv) ν(2, a) is increasing in a. What do these results tell us about the possibility of profiting from experience? As we shall see, Theorem 2.3 is a more general version of (iv), and the same question applies. 1.4. Show that, in equation (1.9), the maximizing w1 and w2 satisfy the stated condition.
2
Main ideas: Gittins index 2.1
Introduction
Chapter 1 in large measure set out to dazzle the reader with the diversity of the allocation problems which succumb to a solution in terms of indices. In this chapter we introduce the key concepts of a bandit process and of a family of alternative bandit processes (FABP). For simplicity, we shall sometimes abbreviate these terms to ‘bandit’ and ‘bandits process’, respectively. We apologize to the reader who feels that it is illogical to refer to an arm of a multi-armed bandit as a bandit. It is indeed illogical, but we judge that the precedents for this contraction and its convenience outweigh the purist argument. Bandit process is the generic term for the jobs, gold mines, boxes, research projects, arms and locations of the various problems in Chapter 1. A FABP is a model for the problem when choices must be made between different bandit processes. It is a special type of exponentially discounted semi-Markov decision process (or, in discrete time, a Markov decision process). For this reason, we provide in Section 2.2 a brief synopsis of basic ideas for Markov and semi-Markov decision processes. In Section 2.3 we define the key notions of a bandit process, and the decision problem for a simple family of alternative bandit processes (SFABP). In Section 2.4 we specialize to discrete time and formulate the relevant dynamic programming equation. Its solution is presented in terms of an index policy in Section 2.5. It is proved using an argument with an economic flavour. This proof only starts the ball rolling. There remains much more to say about the index theorem and properties of the Gittins index. That discussion continues in Section 2.6. Our book contains a number of complementary proofs of the index theorem, and generalizations of it. Each proof has its merits and can help the reader to develop greater insight. A second proof, in Section 2.7, uses an interchange argument. In Section 2.8 we extend ideas to semi-Markov bandit Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
20
MULTI-ARMED BANDIT ALLOCATION INDICES
processes in continuous time. A third proof of the index theorem, in Section 2.9, uses an induction on the sum of the sizes of the state spaces of the bandits and a very simple interchange argument. It is only valid if the number of states of each bandit is finite, but it has the advantage that it provides an algorithm by which to calculate Gittins indices. We say more about calculation of the indices in Section 2.10. In Section 2.11 we include a discussion of monotone indices and monotone jobs. We give some notes about historical development of the index and proof of the index theorem in Section 2.12. Finally, in Section 2.13 we briefly summarize some of the standard theory of Markov and semi-Markov decision processes that we will frequently use.
2.2
Decision processes
A Markov process is a mathematical model of a system which passes through a succession of states in a manner determined by a corresponding succession of transition probability distributions which depend on the current state. This concept is one of the most powerful weapons in the probabilist’s armoury. The mathematical theory is elegant, and examples abound in all branches of the physical and economic sciences. To such a process let us now add a set of decisions available at each stage, and on which the probability distribution governing the next state of the system depends. Let us also add a set of possible rewards at each stage, so that the decision-maker receives a reward which depends on his or her decision and on the subsequent transition. These additions give us a Markov decision process, varieties of which are appropriate models for a wide range of situations where a sequence of decisions has to be made. The study of Markov decision processes began with the work of Wald (1950) and Bellman (1957) and continues in the present day. There is a substantial literature–see, for example, the book by Howard (1971) and the review by Teugels (1976). The modern theory is well explained in many books, such as those by Whittle (1983), Ross (1983) and Puterman (2005). Various different criteria may be used to choose which decision to make at each stage. We shall consider Markov decision processes with an infinite time horizon: processes, that is to say, which go on forever. Future rewards will be discounted, thereby ensuring that the total reward obtained is finite. The aim will be to find policies which maximize the expected value of this total discounted reward, which will be termed simply the payoff . Some of the most interesting applications of our theory are to non-discounted problems, whose solutions may be obtained either by letting the discount factor tend to one, or by reinterpreting the discount factor in some way. Problems 1, 2 and 3 of Chapter 1 are cases in point. It was Blackwell (1965) who proved many of the fundamental results for Markov decision processes. These results include the existence and uniqueness of a solution to the dynamic programming equation, the fact that an optimal policy
MAIN IDEAS: GITTINS INDEX
21
lies in the class of deterministic stationary Markov policies, and the convergence of the value iteration technique. These were proved in the first instance for a discrete-time process, and we shall be interested in their extensions to the case when the times between successive decision times are themselves random variables. Processes of this type are known as semi-Markov decision processes. The important results that we need are summarized at the end of this chapter in Theorem 2.10.
2.3
Simple families of alternative bandit processes
A bandit process is a special type of semi-Markov decision process. We might at this point spend several pages setting out in greater detail than in the previous section some general definitions for Markov decision processes and semi-Markov decision processes. However, many readers will already be familiar with these ideas. Others will find them easy to understand within the context of the specific problems that we will be addressing. So let us proceed immediately to the description of a bandit process. For simplicity, we shall temporarily restrict attention to a bandit process that moves on a countable state space, say E. Controls are applied to a bandit process at a succession of decision times, the intervals between which are random variables. The first decision time is at time zero. Transitions between states occur instantaneously at these times, and at no other times, as do also accruals of rewards. It will prove convenient to regard the control applied at a decision time as being continuously in force up to, but not including, the next decision time. When the process is in state x at decision time t, a control must be chosen from a set of two controls, {0, 1}. The control 0 freezes the process, in the sense that when it is applied no reward accrues and the state of the process does not change. Application of control 0 also initiates a time interval in which every point is a decision time. In contrast, control 1 is termed the continuation control. If at a given decision time the bandit is in state x and the continuation control is applied, then the process yields an immediate reward a t r(x) = e−γ t r(x), where for the present we assume that r(x) is bounded. The quantities a (0 < a < 1) and γ (> 0) are termed the discount factor and discount parameter respectively. We denote by P (y|x) the probability that the state of the process immediately after time t is y, given that at time t the process is in state x. We denote by F (s|x, y) the probability that the time that elapses until the next decision time is no more than s, given that at time t the process is in state x and control u = 1 is applied, leading to a transition to state y. Just as a Markov decision process is a (semi-Markov) decision process with the property that the duration between decision times has a constant value, so a Markov bandit process is a bandit process with the property that when the control u = 1 is applied, the duration between decision times is constant which, without loss of generality, we may assume to be 1.
22
MULTI-ARMED BANDIT ALLOCATION INDICES
A policy for a decision process is any rule, including randomized rules, which for each decision time t specifies the control to be applied at time t as a function of t, the state at time t, the set of previous decision times, and of the states of the process and the controls applied at those times. Thus the control applied at time t may depend on the entire previous history of the process, though not on what happens after time t, a situation which we shall describe by saying that it is pastmeasurable. A policy which maximizes the expected total reward over the set of all policies for every initial state will be termed an optimal policy. Deterministic policies are those which involve no randomization. Stationary policies are those which involve no explicit time dependence. Markov policies are those for which the control chosen at time t depends only on t and the state at time t. At any stage, the total time during which the process has not been frozen by applying control 0 is termed the process time. The process time when control 1 is applied for the (i + 1)th time is denoted by ti (i = 0, 1, 2, . . .). The state at process time ti is denoted by x(ti ). The sequences of times ti and states x(ti ), i = 0, 1, . . . constitute a realization of the bandit process, which thus does not depend on the sequence of controls applied. The reward at decision time t if control 1 has been applied at every previous decision time, so that process time coincides with real time, is a t r(x(t), 1), which we abbreviate to a t r(t). A policy which results in control 1 being applied up to a time τ , and control 0 being applied thereafter, will be termed a stopping rule, and τ the corresponding stopping time. In particular, a policy which divides the state space into a continuation set, on which control 1 is applied, and a stopping set on which control 0 is applied, is a deterministic stationary Markov stopping rule. We note that τ may take the value infinity with positive probability. Stopping rules have been extensively studied, for the most part in the context of stopping problems (e.g. Chow, Robbins and Siegmund, 1971); these may be regarded as being defined by bandit processes in which the freeze control produces nonzero reward, that is r(x, 0) = 0. We now come to the central concept of a family of alternative bandit processes (FABP). This is a decision process formed by bringing together a set of n bandit processes all with the same discount factor. Time zero is a decision time for every bandit process, and at time zero, control 1 is applied to just one of them, bandit process B1 say, and control 0 is applied to all the other bandit processes until the next decision time for B1 is reached. Again, this is a decision time for every bandit process and just one of them is continued, B2 say (which may be B1 again), whilst the others are frozen by applying control 0 until the next decision time for B2 . Once again, this is a decision time for every bandit process and just one of them is continued. A realization of a FABP is constructed by proceeding in this way from one decision time to the next. The state space for the FABP is the product of the state spaces for the individual bandit processes. The control set at a decision time is given by the set of bandit processes which may then be continued, application of control i to the FABP corresponding to selection of bandit process i for continuation. There may be precedence constraints (Section 4.6) or arrivals (Section 4.8) which mean that this set does not always include all n bandit
MAIN IDEAS: GITTINS INDEX
23
processes. Families of alternative bandit processes all of which are always included in the control set will be described as simple families of alternative bandit processes (SFABPs). The reward which accrues at each decision time is the one yielded by the bandit process which is then continued. A stream of rewards is thereby obtained. These are accumulated in some performance measure, called the payoff , which we may define to be the expected value of a sum of undiscounted rewards over a finite horizon, an average reward over the infinite horizon, or, more usually, a sum of discounted rewards up to the infinite horizon.
2.4
Dynamic programming
In this section and the two that follow, we suppose that the state spaces of our processes are countable and we restrict attention to discrete-time Markov bandit processes. Thus the decision times are just t = 0, 1, . . . These assumptions serve to simplify our initial exposition of the main ideas. Principally, they mean that in equations like (2.1) and (2.4), below, we can sum over 0, 1, 2, . . . , rather than over some random set of decision times, and in (2.2) we can simply sum over y ∈ Ei . There is no serious difficulty in extending ideas to bandit processes having uncountable state spaces and evolving in continuous time; we do this in Section 2.8. Consider a SFABP consisting of the n independent bandit processes B1 , . . . , Bn . The state of Bi at process time t is a member of the countable state space, Ei , and is denoted xi (t). We need to distinguish this from the state of Bi at time t, which we may write as ξi (t). The state of the decision process is the vector of states ξ = (ξ1 , . . . , ξn ) and is a member of the product space E = E1 × · · · × En . If at decision time t we choose to apply the continuation control to Bi , which is in state ξi , then a reward ri (ξi ) is obtained from it. Let us assume that |ri | is bounded (an assumption we can relax later). The state of Bi makes a transition from ξi to y, with probability Pi (y|ξi ). The other bandits are frozen, in the sense that their states do not change and no rewards are obtained from them (although one can relax the second of these assumptions and allow frozen bandits to produce rewards; we see this in the tax problems of Section 4.9.1). In choosing which of the bandits to continue we know ξ , and all other data of the problem (i.e. we know the discount factor a, and all functions ri and Pi ). The aim is to maximize the infinite-horizon expected discounted sum of rewards. Let π denote a past-measurable policy, and let it denote the bandit process that is continued at time t under this policy. Suppose the state is ξ(t). The reward that is obtained is rit (ξit (t)). In the language of Markov decision processes, we wish to find the value function, which is defined as ∞ t V (ξ ) = sup E a rit (ξit (t)) ξ(0) = ξ , (2.1) π
t=0
24
MULTI-ARMED BANDIT ALLOCATION INDICES
where the supremum is over all past-measurable policies π. The existence of V (·) is guaranteed by standard theory. Presented in this way, a SFABP is merely a special type of infinite-horizon discounted-reward Markov decision process. The natural way to address it is via the dynamic programming equation (or functional equation) of V (ξ ) = max
i∈{1,...,n}
ri (ξi ) + a
Pi (y|ξi )V (ξ1 , . . . , ξi−1 , y, ξi+1 , . . . , ξn ) .
(2.2)
y∈Ei
Relevant results from the theory of discounted Markov decision processes (summarized in Theorem 2.10) permit us to conclude that (i) there is at least one optimal policy that is deterministic stationary and Markov; (ii) the value function V that is defined in (2.1) is the unique bounded function satisfying (2.2); and (iii) a policy is optimal if and only if for every ξ the bandit i to which the continuation control is applied is one that achieves the maximum on the right-hand side of (2.2). Furthermore, approximations to V may be computed by the method of value iteration, in which Vn is calculated from the recursion formed by replacing V with Vn and Vn−1 , on the left and right-hand sides of (2.2), respectively. If V0 is any bounded function, then Vn → V as n → ∞. These facts are true for general Markov decision processes under the assumptions that |ri | is bounded (which we have assumed) and that the control space is finite (which it is, since we are to continue one of n bandits). Unfortunately, the dynamic programming equation (2.2) is not usually helpful in computing V (·) since the size of the problem grows exponentially in the number of bandits, n. Suppose, for example, that the state space of Bi is finite and of size ki = |Ei |, ki ≥ 2. In principle, a solution to (2.2) could be found by using any of the standard solution methods, such as the policy improvement algorithm, or linear programming. However,the vector state ξ = (ξ1 , . . . , ξn ) evolves on the state space of size k = |E| = i ki . So the number of stationary policies is nk , where k > 2n . In a linear programming formulation the number of variables would be O(2n ). So the solution of (2.2) is computationally infeasible for even moderate sizes of k and n.
2.5
Gittins index theorem
We have defined a discrete-time SFABP as a Markov decision process in which exactly one of n constituent bandit processes (or Markov reward processes) is to be continued (or advanced) at each of the decision times 0, 1, . . . Remarkably, the optimal policy for this problem takes the form of an index policy. That is, there exists a real-valued index, say ν(Bi , ξi ), which we may compute separately for each bandit process as a function only of its current state. The optimal policy is always to continue the bandit process with the greatest index.
MAIN IDEAS: GITTINS INDEX
25
Assuming that the above paragraph is true, then it is easy to figure out what the index must be. Consider the discrete-time standard bandit process, defined as a bandit that has just one state and pays reward λ when at a decision time it receives the continuation control. Denote this bandit process by S(λ). Now consider a discrete-time SFABP comprised from just two bandit processes: Bi and the standard bandit S(λ). If the continuation control is applied to S(λ) at all of the decisions times 0, 1, . . . , then the payoff is λ(1 + a + a 2 + · · ·) =
λ . 1−a
(2.3)
Alternatively, suppose that the continuation control is applied to Bi at decision time 0, and then subsequently an optimal policy is followed. This policy may switch the continuation control to S(λ) at some future decision time τ (τ > 0). Having done this, the information about Bi at time τ + 1 is the same as at time τ . Since it was optimal to apply the continuation control to S(λ) at τ , it must be optimal to apply the continuation control to S(λ) at time τ + 1, and ever after. This means that the maximal payoff must be τ −1 λ sup E (2.4) a t ri (xi (t)) + a τ xi (0) = xi , 1−a τ >0 t=0
where τ denotes a past-measurable stopping time, taking a value in the set of decision times {1, 2, . . .}. Note that t and the process time of Bi are the same, so ξi (t) = xi (t). By subtracting (2.3) from (2.4) we can deduce that the value of λ for which it is equally optimal to apply the continuation control to either of the two bandit processes initially is such that τ −1 λ t τ 0 = sup E a ri (xi (t)) − (1 − a ) xi (0) = xi . 1−a τ >0 t=0
Since it is the supremum of decreasing linear functions of λ, the right-hand side is convex and decreasing in λ, and has a unique root. That root can also be expressed as
τ −1 λ λ t τ sup λ : a ri (xi (t)) + a ≤ sup E xi (0) = xi 1 − a τ >0 1−a t=0
or as sup λ : sup E τ >0
τ −1 t=0
a [ri (xi (t)) − λ] xi (0) = xi ≥ 0 . t
(2.5)
26
MULTI-ARMED BANDIT ALLOCATION INDICES
Expression (2.5) provides an economic interpretation for λ as the greatest rent that one would be willing to pay (per period) for ownership of the rewards arising from the bandit while it is continued for one or more periods. This value of λ, which we shall now denote as ν(Bi , xi ), is the obvious candidate for the index for Bi when it is in state xi . Essentially, we have calibrated Bi by finding a standard bandit process S(λ) such that in a twobandit SFABP comprised from Bi and S(λ), we are indifferent as to which of the bandits should initially receive the continuation control. From (2.5) we have also τ −1 t E a ri (xi (t)) xi (0) = xi t=0
ν(Bi , xi ) = sup τ >0
E
τ −1
a xi (0) = xi
,
(2.6)
t
t=0
where τ is a past-measurable stopping time. Observe that ν(Bi , xi ) is defined as the maximal possible value of a quotient. The numerator is the ‘expected total discounted reward over τ steps’, and the denominator is the ‘expected total discounted time over τ steps’, where τ is at least 1 step. It is useful to define some notation for the numerator and denominator. Let these be Rτ (Bi ) = E
τ −1
a t ri (xi (t)),
t=0
Wτ (Bi ) = E
τ −1
a t = (1 − a)−1 E 1 − a τ ).
t=0
Let us also define ντ (Bi ) =
Rτ (Bi ) , Wτ (Bi )
(2.7)
so we may rewrite (2.6) as ν(Bi ) = sup ντ (Bi ). τ >0
All the above quantities naturally depend on the initial state xi (0) of the bandit process Bi . When necessary (Bi ) can be replaced with (Bi , xi ) to indicate the values when xi (0) = xi . It is very important to appreciate that ν(Bi , ·) is a function that can be determined knowing only a, ri and Pi . That is, it can be computed without knowing anything about xj , rj and Pj for other bandit processes. We shall see in Sections 2.9 and 2.10 that if Bi moves on a finite state space of size ki then its ki Gittins indices (one for each state) can be computed in time O(ki3 ).
MAIN IDEAS: GITTINS INDEX
27
For a SFABP F made up of bandit processes B1 , . . . , Bn , let xj be a typical element of the state space Ej of Bj , and μj a real-valued function defined on Ej . A policy for F will be described as an index policy with respect to μ1 , μ2 , . . . , μn if at any decision time the bandit process Bk selected for continuation is such that μk (xk ) = maxj μj (xj ). Theorem 2.1 The index theorem for a SFABP A policy for a simple family of alternative bandit processes is optimal if it is an index policy with respect to ν(B1 , ·), ν(B2 , ·), . . . , ν(Bn , ·). The following proof uses the economic interpretation of the index that is afforded by (2.5). It is for the index theorem for a SFABP in discrete time; in Section 2.8 we show how to extend the proof idea to a continuous-time setting. Proof (by prevailing charges argument). To kick off, let us consider the problem with just bandit Bi , which may be continued or frozen at each of the decision times t = 0, 1, . . . Let us add to our usual model a requirement that at each decision time at which the bandit is continued a fixed charge (or rent) must be paid. Call this the prevailing charge. If this charge is set too great then it will not be worthwhile to continue the bandit at all. If the charge is set to −∞ then the bandit may be continued forever, at a profit. Define the fair charge as that level of prevailing charge for which one would be indifferent between not continuing, and continuing for at least one more step, and then optimally stopping at some step after that. The fair charge, which we denote λi (xi ), is a function of the state and is given by (2.5). Suppose a point is reached at which it would be optimal to stop continuing the bandit because the prevailing charge is too great. This is time τ in (2.5). At this point the prevailing charge is reduced to the current value of its fair charge, which makes it just possible to continue without loss (i.e. with expected profit zero). Such a reduction is made whenever the prevailing charge is no longer small enough that one could continue without making a loss. Thus the prevailing charge is λ∗i (xi (t)) = mins≤t λi (xi (s)); it depends on xi (0), . . . , xi (t) (which we omit from its argument for convenience). Notice that the prevailing charge is nonincreasing in the number of units of processing that the bandit has received (i.e. in its process time). Now suppose we have a simple family of n alternative bandit processes and at each time we are to select exactly one of them to continue. The prevailing charge for each bandit is initially set equal to its fair charge, and decreased periodically thereafter, as described above. By always continuing the bandit with the greatest prevailing charge it is possible to obtain an expected profit of zero and ensure that some bandit is being continued at every step. But it is not possible to obtain an expected profit that is strictly positive, since there would then have to be a positive profit in at least one bandit. Since the prevailing charge is periodically reset to the fair charge (but no less), a strict profit from any bandit is impossible.
28
MULTI-ARMED BANDIT ALLOCATION INDICES
Note that the n nonincreasing sequences of prevailing charges are independent of policy; it is only their interleaving that depends on policy. If we adopt a policy of always continuing the bandit with the greatest prevailing charge then we interleave the n sequences of nonincreasing prevailing charges into a single nonincreasing sequence of charges, and so this maximizes the discounted sum of prevailing charges so paid. Since profit is no more than zero, this is an upper bound on the expected total discounted sum of rewards that can be obtained from the bandits under any policy. However, this upper bound is attained by the proposed policy, and so it must be optimal. The fair charge that is defined in (2.5) is the Gittins index. Over the past 40 years, many researchers have found other ways to prove this theorem. The proof above is due to Weber (1992). Each proof has its merits and is instructive. We give two more proofs in this chapter, and others in Chapters 4 and 5. For now, we restrict attention to the index theorem for a SFABP. In Chapter 4 we develop ideas and prove index theorems for FABPs that are not simple, such as jobs that are subject to precedence constraints, and bandits with arrivals. Some readers may find it helpful to see the ideas of the above proof expressed in more notation. See exercise 4.2.
2.6
Gittins index
In this section we make some important observations about the Gittins index theorem and the index defined by (2.6).
2.6.1 Gittins index and the multi-armed bandit The index theorem for a SFABP was first obtained by Gittins and Jones in 1971. They presented their index solution to the multi-armed bandit problem to the European Meeting of Statisticians in Budapest in 1972. Publication of the theorem followed in 1974(a). They gave ν the name dynamic allocation index , and that is the name that was used in the first edition of this book. However, almost all researchers in the field now use the name Gittins index . We (Glazebrook and Weber) use this name throughout what we write. Applications of Gittins indices have been made in the fields of chemometrics, economics, engineering, numerical analysis, military planning, operational research, probability, statistics and website design. We give examples throughout the remaining chapters, and especially in Chapter 9. The problem that is most closely associated in many people’s minds with the Gittins index is the multi-armed bandit problem (Problem 5). However, the reader may be interested to know that the practical problem whose study led to discovery of the index theorem was not the multi-armed bandit problem, but rather a problem of selecting chemical compounds for screening as potential drugs. This is modelled as a type of bandit process that we term a target process (see Section 7.1).
MAIN IDEAS: GITTINS INDEX
29
It is worth repeating the remark of Whittle (1991) that the multi-armed bandit is not necessarily the most helpful model upon which to focus attention, since it involves the irrelevant complication that the state of an arm i is not ‘physical’ but ‘informational’, this being the ‘state of knowledge’ about an unknown success probability θi . It is interesting that in the multi-armed bandit problem the application of the continuation control is associated with the notion of Bayesian ‘learning’ (as it is also in other of the Chapter 1 problems). However, the notion of learning is not really central to the story of the Gittins index and index theorem. Indeed, it is because we wish to be thinking more generally than about ‘arms’ and the multi-armed bandit problem that we have defined ‘bandit processes’ and a FABP. As we have seen in Chapter 1, there are a very large number of problems that have solutions in terms of Gittins indices. Naturally, one is curious as to whether the index theorem continues to hold if aspects of the modelling assumptions for a SFABP are changed. There are both negative and positive results to be had. We see, for example, in Chapter 3 that the assumption that the discounted rewards are summed up to an infinite horizon cannot be altered to a sum up to a finite horizon. However, we see in Chapter 4 that it is possible to find index theorems for FABPs with precedence constraints and certain types of arrivals.
2.6.2 Coins problem To gain further appreciation as to why the index theorem is true, consider the following problem. Problem 8 Coins (see Section 2.11 and exercise 2.3) Suppose we have n biased coins, pi being the probability of a head when coin i is tossed. The coins may be tossed once each in any order. Each head tossed yields a reward, the reward from a head on the tth toss being a t−1 (0 < a < 1). In what order should the coins be tossed so as to maximize the expected total reward? For this problem the answer is obvious, the coins should be tossed in decreasing order of pi . Now suppose that mn coins are arranged in an m × n matrix. The n columns of this matrix can be imagined to be n piles of m coins. Let us impose the constraint that a coin cannot be tossed until all those coins above it have been tossed and removed from its pile. Suppose that in each pile of coins j , the value of pij decreases as we move down the pile, that is, as i = 1, 2, . . . , m. Let us call this the decreasing case. Under these conditions it is still possible to toss the coins in decreasing order of pij . Since this is the optimal policy without the constraints imposed by the piles, it must remain optimal when the constraints imposed by piles are respected. The problem is essentially unchanged if the piles are extended towards infinity by adding coins for which pij = 0, for all i > m. With this extension, the n piles
30
MULTI-ARMED BANDIT ALLOCATION INDICES
define a simple family of n alternative bandit processes. Our argument shows that for this SFABP it is optimal to adopt the myopic policy, which always selects the bandit process yielding the highest immediate expected reward. An easy additional argument shows that this policy is an index policy as described in Theorem 2.1. In the general case, that pij is not decreasing as one moves down pile j , things are more complicated. We should try to toss coins with high pij s early in the overall sequence, but how should we proceed if these desirable coins are not at the tops of their respective piles? In some circumstances the myopic policy is still optimal, but it is not difficult to construct pij for which the optimal policy depends on the second, third, or for that matter the kth coin from the top of a pile, where k is arbitrary. From the index theorem we see that the coins problem is solved by defining for pile j the index τ a i−1 pij νj = max τ ≥1
i=1 τ
. a
i−1
i=1
Let τj∗ be the greatest positive integer for which the maximum on the right-hand side above can be achieved by taking τ = τj∗ . The optimal policy begins by identifying j for which νj is greatest and then taking τj∗ coins from the top of pile j . Once this is done, νj can be recomputed, and the process repeats. In the decreasing case νj = p1j .
2.6.3 Characterization of the optimal stopping time Looking at (2.5) and (2.6), it is reasonable to ask if the supremum over τ is actually attained by some stopping time. The following lemma states that the answer to this question is yes, and also provides a characterization of τ . Since we now focus on a single bandit process, say Bi , let us drop the subscript i, denote it by B, write its state as x, a reward as r(x), and trust that by paying attention to context the reader will not be confused by the fact that elsewhere x can denote a vector of bandit states. Lemma 2.2 The supremum in (2.6) is attained by a stopping time τ . For x(0) = ξ , this stopping time has a stopping set, say 0 , (⊆ E), which may be chosen to be any set such that {x : ν(B, x) < ν(B, ξ )} ⊆ 0 ⊆ {x : ν(B, x) ≤ ν(B, ξ )}. Proof. Fix any λ. Consider the problem, given any initial state x, of maximizing over stopping times τ ≥ 0, a payoff of Rτ (B, x) − λWτ (B, x).
MAIN IDEAS: GITTINS INDEX
31
This defines a Markov decision process having the dynamic programming equation φ(x) = max {0, r(x) − λ + aE[φ(x(1))|x(0) = x]} . Choosing 0 within the right-hand side equates to stopping. The optimal stopping time is also one that achieves the supremum within (2.5). By standard results (see Theorem 2.10 at the end of this chapter), there exists a deterministic stationary Markov optimal policy. This equates to an optimal stopping rule that is defined by a stopping set, 0 . The set 0 is optimal iff x ∈ 0 ⇐ 0 > r(x) − λ + aE[φ(x(1))|x(0) = x], x ∈ 0 ⇒ 0 ≥ r(x) − λ + aE[φ(x(1))|x(0) = x]. Using the definition of ν(B, x) in (2.5), we see that the right-hand sides above are true iff λ > ν(B, x) and λ ≥ ν(B, x), respectively. The lemma follows by putting λ = ν(B, ξ ).
2.6.4 The restart-in-state formulation Another characterization of the index is also interesting. Let us fix a state ξ (∈ E) and define μ(x) as the maximal payoff that can be obtained from bandit Bi , when at time 0 it starts in state x and we allow the option of restarting Bi in state ξ whenever and as often as we like. This is a Markov decision process, and in particular, τ −1 t τ μ(ξ ) = sup E a ri (xi (t)) + a μ(ξ ) xi (0) = ξ . τ >0
t=0
This is equivalent to E μ(ξ ) = sup τ >0
τ −1
a t r(x(t))
t=0
1−
=
Ea τ
ν(B, ξ ) . 1−a
Katehakis and Veinott (1987) have used this restart-in-state idea as the basis for computing indices by value iteration. To calculate the index for some state ξ ∈ E, we take μ0 (·) = 0, and define for all x ∈ E,
μk+1 (x) = max μk (ξ ), r(x) + a P (y|x)μk (y) . y
Then μk (ξ ) → μ(ξ ) as k → ∞. In passing, let us make the trivial point that we could replace ν(·) with μ(·) in the statement of the index theorem (Theorem 2.1). For completeness, we
32
MULTI-ARMED BANDIT ALLOCATION INDICES
subsequently prove Theorem 4.8, which states that if the index theorem is true for the class of all bandits processes with discount factor a, then the index must be a strictly increasing function of ν(·).
2.6.5 Dependence on discount factor In (2.5) we have an interpretation of the index as the greatest per period rent that one would be willing to pay for ownership of the rewards arising from the bandit as it is continued for one or more periods. This idea makes the following theorem intuitive and also suggests its proof. Theorem 2.3 The (discrete-time) Gittins index, ν(B, x, a), is nondecreasing in the discount factor a. Proof. The set of stopping times over which the supremum is taken in equation (2.4) and in all the other expressions up to (2.5), may be extended to include randomized stopping times without affecting the validity of any of these expressions. One way to define a randomized stopping time, say τ , is as the minimum of a non-randomized stopping time τ and an independent geometrically distributed random variable σ such that P (σ > t) = (b/a)t , t = 0, 1, . . ., choosing b < a. Hence P (τ > t) = (b/a)t P (τ > t). This means that for any λ, and a < b,
τ −1 τ −1
sup E a t r(x(t)) − λ ≥ sup E bt r(x(t)) − λ τ >0
t=0
τ >0
t=0
where both expectations are conditional on x(0) = x. It follows from (2.5) that ν(B, x, a) ≥ ν(B, x, b).
2.6.6 Myopic and forwards induction policies If the optimal stopping time on the right-hand side of (2.6) is τ = 1, then the index evaluates to Eri (xi (t)). If this is true for all Bi and xi then the Gittins index policy is identical to a myopic or one-step-look-ahead policy. In Section 2.11 we discuss some circumstances in which this happens. When the optimal stopping time is not always τ = 1, the Gittins index policy can be viewed as generalizing the idea of a one-step-look-ahead policy, since it looks ahead to a positive stopping time τ , choosing τ to maximize ντ (B) (= Rτ (B)/Wτ (B)). We may think of ντ (B) as an equivalent constant reward rate, since a standard bandit paying a constant reward of this amount would produce the same expected reward up to time τ . Thus, we may also think of ν(B) as a maximized equivalent constant reward rate. The index ν(B) can be used in two ways. Suppose that we repeatedly carry out a procedure of identifying a bandit, B, with greatest index, and then apply the continuation control to it until the next decision time for that bandit process
MAIN IDEAS: GITTINS INDEX
33
is reached. This defines our usual index policy. If, instead, we identify a bandit, B, with greatest index, but then apply the continuation control to it until time τ is reached, such that ν(B) = Rτ (B)/Wτ (B), then this defines what we call a modified index policy. In fact, it follows from Lemma 2.2 that for a SFABP these two types of policy are identical. One can generalize the idea in an obvious way and similarly define an index for any Markov decision process D, by Eg ν(D, x) = sup g,τ
τ −1
a t r(t)
t=0 τ −1
Eg
, at
t=0
where expectations are conditional on x(0) = x, the supremum is taken over g (a policy, which may be taken to be a deterministic stationary Markov policy) and τ (a positive stopping time), and Eg denotes expectation conditional on g being employed. Given a collection of Markov decision processes, D1 , . . . , Dn , one can now define a problem in which, at each decision time, we are to choose one process to continue, and an action with which to continue it. If we choose Di so that ν(Di , xi ) = maxj ν(Dj , xj ), continue Di up to its next decision time, and then repeat the procedure, we call this a forwards induction policy (or FI policy). If, however, we continue Di up to the time τ that is maximizing in calculation of ν(Di , xi ), we call this a modified forwards induction policy (or FI* policy). Since a modified index policy is actually a modified forwards induction policy, we shall henceforth use only the latter name. Neither a one-step-look-ahead policy, forwards induction policy, or modified forwards induction policy need be optimal, but they can provide guidance towards heuristics. Further discussion of forwards induction policies continues in Section 4.5.
2.7
Proof of the index theorem by interchanging bandit portions
The reward stream from a SFABP is formed by splicing together portions of the reward streams of the constituent bandit processes. A good policy is one which places reward stream portions with a high equivalent constant reward rate early in the sequence, subject to any constraints on precedence, in particular the precedence sequence defined by the process time for reward stream portions from the same bandit process. The decision-maker’s problem is like that of the cricket captain who would like his most effective batsmen to bat early in the innings, where they are likely to have the greatest possible effect on the outcome of the match. Given two standard bandit processes S(λ1 ) and S(λ2 ) with λ1 > λ2 , the reward from selecting S(λ1 ) for a time T1 and then selecting S(λ2 ) for a time T2 is
34
MULTI-ARMED BANDIT ALLOCATION INDICES
clearly greater than the reward if the order is reversed and S(λ2 ) is selected for T2 followed by S(λ1 ) for T1 . This is because it is always better for high rewards to be given priority over low rewards, thereby incurring less discounting. This principle carries over immediately to the case of two arbitrary bandit processes. Lemma 2.4 If bandit processes B1 and B2 have indices ν(B1 ) and ν(B2 ) with ν(B1 ) > ν(B2 ), and τ is a stopping time for B1 such that ντ (B1 ) = ν(B1 ), while σ is an arbitrary stopping time for B2 , then the expected reward from selecting B1 for time τ , and then B2 for time σ , is greater than from reversing the order of selection. Proof. We have Rσ (B2 ) Rτ (B1 ) > Wτ (B1 ) Wσ (B2 ) Rσ (B2 ) Rτ (B1 ) > ⇐⇒ 1 − Ea τ 1 − Ea σ ⇐⇒ Rτ (B1 ) + Ea τ Rσ (B2 ) > Rσ (B2 ) + Ea σ Rτ (B1 ).
ν(B1 ) > ν(B2 ) ⇒
Since B1 and B2 are stochastically independent, this is the required inequality. Figure 2.1 depicts the interchange described in Lemma 2.4. The lemma shows that, for any two independent consecutive portions from different bandit processes, the expected reward is higher if the portion with the higher equivalent constant reward rate comes before the portion with the lower equivalent constant reward rate. It is thus a reasonable conjecture that it would be optimal to select a bandit process with an index value at least as high as those of any other bandit process, and to continue it for the stopping time corresponding to the index value. As explained in Section 2.6.6, we call the policy defined by repeatedly
B1
B1
B1
B2
B2
t B2
B2 s
s B1
B1
B1
t
Figure 2.1 An interchange of portions of B1 and B2 . Vertical ticks mark decision times. Bandit B1 has process time τ after receiving the continuation control at three decision times. Bandit B2 has process time σ after receiving the continuation control at two decision times. The figure depicts semi-Markov bandits, for which the interval between decision times is not a constant.
MAIN IDEAS: GITTINS INDEX
35
selecting bandit processes in this way a modified forwards induction policy, and the intervals between successive selections the stages of such a policy. Judicious use of Lemma 2.4 provides a proof of the index theorem. The following is essentially the original proof idea of Gittins and Jones (1974a). It relies on Lemma 2.2, which says that we may take the optimal τ , such that ντ (B) = ν(B), to be the first time that the index for B drops below its initial value. Proof of the index theorem by interchanging bandit portions. Let us say that a policy lies in class k if at no more than k decision times does it ever apply the continuation control to a bandit that does not have greatest Gittins index. We may think of such a policy as a Gittins index policy with k options to deviate. Consider first a policy π in 1 and an initial state in which π immediately exercises its option to deviate. Suppose that B1 is the bandit of uniquely greatest index, but π continues B2 which has smaller index. Let ν1 denote the initial value of the index of B1 . Since π belongs to 1 , it must continue B2 for a time τ2 until its index value first drops below ν1 , and then continue B1 for a time τ1 until its index value also drops below ν1 . Now consider a policy π which interchanges these two portions of the reward streams from B1 and B2 ; it continues B1 for time τ1 , then continues B2 for time τ2 , and then subsequently is the same as π; thus the reward streams after time τ1 + τ2 are identical under the two policies. Notice that π is in 1 . By Lemma 2.4, the payoff of π is greater than that of π. We must deal with the small technical point that perhaps more than one bandit initially has the greatest index, say ν(B1 ) = ν(B3 ) = ν1 . In this case, π might continue B2 for time τ2 , and let the choice of which of B1 or B3 to continue next depend on the realization of B2 that has just been observed. However, since B1 and B3 have the same index, any policy for ordering their initial (modified forwards induction) stages is as good as the one that begins with B1 . So π cannot benefit by letting the order of initial stages of B1 and B3 depend on the realized initial portion of B2 . Without loss of generality, we may assume that B1 continues before B3 and the argument of the above paragraph applies. We also see that if two or more bandits are of greatest index, it does not matter which we continue first. Thus, we may conclude that if policies are restricted to 1 then there is no state of the bandits in which the single option to deviate from a Gittins index policy should be taken. By an inductive argument we may conclude that the Gittins index policy is optimal within the class k (i.e. it follows from the above that the last option to deviate need not be used, and so we may restrict attention to k−1 , and inductively to k−2 , . . . , 0 .) Given any positive there exists an -optimal policy in class k , once we take k sufficiently large. In discrete time this is obvious, and in continuous time it is an easy consequence of the Conditions A and B which we will meet in Section 2.8. Thus, the Gittins index policy is -optimal for all > 0. This means that it is optimal.
36
MULTI-ARMED BANDIT ALLOCATION INDICES
2.8
Continuous-time bandit processes
Thus far, we have placed our discussion in discrete time, and we have assumed that all bandits move on countable state spaces. The second of these assumptions can be relaxed by saying that the state space, , should be a (Borel) subset of some (complete separable metric) space, that is equipped with a σ -algebra X of subsets of which includes every subset consisting of just one element of . Any further functions that we introduce to our model and which depend on states must be measurable with respect to X . The last two sentences need not worry readers who are unfamiliar with these notions, since in most interesting bandit problems the state space is either countable, or some nice subset of Euclidean space (such as R+ ). In continuous time, a FABP is defined to be a type of semi-Markov decision process. This is a generalization of a Markov decision process in which the distribution of the time that elapses between two successive states x and y depends on x, y and the decision action (u ∈ ) taken in state x. For a SFABP in which bandits are semi-Markov processes and have general state spaces, this means the following. To begin, t0 = 0 is a decision time. Suppose tj is the j th decision time after t0 and that at this time bandit Bi is in state xi and is chosen for continuation. Reward ri (xi ) is immediately obtained. The next decision time is tj +1 = tj + T . At this time a bandit is again to be chosen for continuation. This could again be Bi , or some other bandit. At time tj +1 the state of Bi is y, where y lies in set A (A ⊂ X ) with probability Pi (A|xi ). The probability of the event [T ≤ s] is Fi (s|xi , yi ). The functions ri (·), Pi (A|·) and Fi (s|·, ·) are to be X - and X 2 -measurable, but otherwise can be quite general. In Section 2.4 we made certain claims about the value function and solution of the dynamic programming equation, which are true under the assumptions of strict discounting, bounded rewards and a finite control set. In the case of a semiMarkov decision process we need in addition a small technical condition that an infinite number of transitions should not occur in finite time. This is guaranteed if there exist positive δ and such that, whatever the state of the bandit that is continued, we have P (T > δ) > (see, for example, Ross, 1970, p. 157, or Puterman, 2005, Section 11.1). Condition A There exist δ (> 0) and (> 0) such that F [(δ, ∞)|x, y, u]P (dy|x, u) > (x ∈ , u ∈ (x))
where (δ, ∞) denotes the unbounded open interval to the right of δ. Our assumption that rewards are bounded can also be replaced by a weaker condition. Let x(0) be the state of a semi-Markov decision process D at time zero, xg (t) the state at time t if policy g is followed, and Rg (D, x) the expected total discounted reward under policy g when x(0) = x. Let [t] denote the first decision time after time t.
MAIN IDEAS: GITTINS INDEX
37
Condition B E[a [t] Rg (D, xg ([t]))|x(0) = x] tends to zero uniformly over all policies g as t tends to infinity, for all x ∈ . Considering bandit process Bi as the only bandit process available, we now define, analogous to the definitions in Section 2.5, and with the stopping time τ constrained to the set of decision times {t1 , t2 , . . .}, a tj ri (xi (tj )), Rτ (Bi ) = E tj 0
As previously, these quantities depend on the initial state xi (0) of the bandit process B, and we can write (Bi , xi ) rather than (Bi ) if we wish to make explicit that xi (0) = xi . Notice that in the continuous-time setting Wτ (B) = γ −1 E(1 − a τ ), whereas for a discrete-time Markov bandit Wτ (B) = (1 − a)−1 E(1 − a τ ). The factor of γ (1 − a)−1 by which the discrete-time bandit differs will be termed the discretetime correction factor. The proofs of the index theorem in Sections 2.7 and 2.5 can be easily extended to this setting. So far as the proof in Section 2.7 goes, the important thing is that Lemma 2.4 still holds. For the proof in Section 2.5, the fair charge should now be levied continuously at every instant that Bi is being continued, so it is defined by ⎫ ⎧ ⎡ ⎤ τ ⎬ ⎨ a tj ri (xi (tj )) − λ a t dt xi (0) = xi ⎦ ≥ 0 . λi (xi ) = max λ : sup E ⎣ ⎭ ⎩ τ >0 0 t Cd .
2.9
Proof of the index theorem by induction and interchange argument
We finish this chapter’s exposition of basic theory with a third proof of the index theorem for a SFABP. This proof, which is due to Tsitsiklis (1994), uses a particularly straightforward interchange argument. It requires an assumption that the total number of possible bandit states is finite, but it has something new to offer because it constructively picks out states one by one, in order of their decreasing index values. This provides us with a practical algorithm by which to calculate indices. It is also a proof that is easy to generalize to a FABP with arrivals, as we do in Section 4.7. We start with a simple family of n alternative bandit processes, with constituents B1 , . . . , Bn . Suppose that Bi evolves in continuous time as a semiMarkov process on a finite state space, Ei = {1, 2, . . . , N }. Let us make the simplification that all bandits are identical, with the same state space E, but in different initial states. This is without loss of generality because we might imagine that all the bandits have the state space E = E1 ∪ · · · ∪ En , where E1 , . . . , En are distinct, and that the state transition probabilities are such that states in Ei do not
MAIN IDEAS: GITTINS INDEX
41
communicate with those in Ej , j = i. By making this assumption, the notation is simplified because any two bandits are distinguished only by their states. It is convenient to imagine that the reward structure is slightly different to that described in Section 2.8. Suppose that at decision time t we apply the continuation control to Bi . Having done this, we learn the state y to which the bandit process makes its next transition, and a positive random variable T , whose distribution depends on xi (t) and y. Reward is now earned at the rate of r¯ (xi (t)) throughout the interval [t, t + T ), until the next decision time is reached at t + T . If we take r¯ (xi ) = E
T 0
r(xi (t)) !, a t dt xi (0) = xi
then this model is equivalent to the standard semi-Markov model of Section 2.8 in which a reward of r(xi (t)) is received at time t. Henceforth we drop the bars and suppose that r(xi ) denotes the constant reward rate that accrues during the interval [t, t + T ). Proof of the index theorem. The proof has two parts. We start by arguing that the optimal policy is a priority policy. Subsequently, we show that this priority policy is the same as the one induced by Gittins indices. Part I. The proof that the optimal policy is a priority policy is by induction on the size of the state space, N , using a very simple interchange argument. Suppose that r(iN ) = maxi r(i). We claim that if at decision time 0 some bandit is in state iN then it is optimal to apply the continuation control to that bandit up to the next decision time. Suppose not, and that B1 is in state iN , but that a strictly better policy, say π, initially applies the continuation control to some bandit that is in state j , where j = iN . Imagine following this policy, perhaps applying the continuation control to a sequence of bandits, until eventually a decision time is reached at which the continuation control is first applied to a bandit in state iN , which without loss of generality we may take to be B1 . It is impossible for such a time never to be reached, since that would mean forgoing a period over which one can earn reward at the greatest possible rate. Suppose this happens at time τ , and that subsequently B1 is continued up to the next decision time τ + T . Define s w(s) = 0 a t dt = (1 − a s )γ −1 . We can now write the reward over the interval [0, τ + T ) as R + a τ r(iN )w(T ),
(2.8)
where R is the total discounted reward obtained over the interval [0, τ ) from the bandits to which the continuation control is applied before being applied to B1 , and a τ r(iN )w(T ) is the reward obtained from B1 . Now consider an alternative policy π , which starts by continuing B1 up to decision time T , then applies the continuation control to the same sequence of bandits as π would have done over [0, τ ), until reaching time T + τ , and then is identical to π thereafter. Clearly, it is possible to realize π . The two
42
MULTI-ARMED BANDIT ALLOCATION INDICES
policies have the same rewards after time τ + T . The reward under π over [0, t + T ) is r(iN )w(T ) + a T R.
(2.9)
Since r(iN ) is maximal, R ≤ r(iN )w(τ ). Making use of this, we can conclude that π is at least as good as π, because the difference between (2.9) and (2.8) is ! ! r(iN )w(T ) + a T R − R + a τ r(iN )w(T ) ! ≥ r(iN ) w(T ) + a T w(τ ) − w(τ ) − a τ w(T ) ! = r(iN ) 1 − a T + a T (1 − a τ ) − (1 − a τ ) − a τ (1 − a T ) γ −1 = 0. Now we can use induction on the size of the state space, N . The above argument has shown that if the continuation control is applied to a bandit and at the next decision time this bandit is in state iN , then it is optimal to continue that bandit until eventually it reaches a decision time and its state is not iN . This observation allows us to construct a modified problem from which state iN is deleted. We proceed as follows. First, we note that it is optimal to apply the continuation control to any bandits that are in state iN until no bandits in state iN remain. If now some bandit is in state i = iN and we apply the continuation control and it reaches state N , then it is optimal to apply the continuation control again and again until it is not in state iN . Thus, we can think of a modified problem in which a decision time is reached only when the bandit reaches a decision time of the original problem and furthermore is not in state iN . We can also modify the reward rates so that the total expected discounted reward up to the next decision time in the modified problem is the same as it would be in the unmodified problem, say r(·) → r1 (·). By this construction we have a new problem in which state iN has been removed. By an inductive hypothesis, a priority policy is optimal for a state space of size N − 1. The base case of N = 1 is trivial. This proves that there is an ordering of the states such that it is optimal always to continue the bandit whose state comes first in this order. Part II. It remains to relate the priority order that has been derived above to that which is implied by the Gittins indices. We see that the optimal ordering of the states is determined by an algorithm that picks states one by one. If state iN is picked first, the next state to be picked will be the one for which the reward rate, say r1 (i), in the modified problem is greatest; suppose this is for i = iN−1 . Let us define ν(iN ) = r(iN ), ν(iN−1 ) = r1 (iN−1 ), and so on. Clearly, ν(iN ) ≥ ν(iN−1 ). We continue finding the indices in this iterative manner and so they are nonincreasing in the order that the states are picked out.
MAIN IDEAS: GITTINS INDEX
43
Because of the way that we have defined the reward rates, t E r(x(t ))a x(0) = ij ν(ij ) = rN−j (ij ) =
t ν(x)] = 1, the supremum in (2.6) is attained when τ > 1, and ν(x) > r(x). Note that if r(x(t)) ≤ r(x)(t = 1, 2, . . .) almost surely, then for all τ ≥ 2, ERτ −1 (x(1)) ≤ r(x). EWτ −1 (x(1)) So
Rτ (x) r(x) + aERτ −1 (x(1)) ν(x) = sup = r(x), = max r(x), sup τ ≥1 Wτ (x) τ ≥2 1 + aEWτ −1 (x(1))
and hence the two conditions which are equivalent in Proposition 2.5 both hold. Thus the following proposition holds. Proposition 2.7 If P [r(x(t)) ≤ r(x), t = 1, 2, . . .] = 1, the supremum in (2.6) is attained when τ = 1, and ν(x) = r(x). Proposition 2.6 may be extended to give the following. Proposition 2.8 If P [ν(x(t)) > ν(x), t = 1, 2, . . .] = 1, the supremum in (2.6) is attained when τ = ∞, and ν(x) > r(x). This includes the case when ν(x(t)) is almost surely increasing. Propositions 2.5, 2.6, 2.7 and 2.8 all have obvious counterparts for a general semi-Markov bandit process.
2.11.2 Monotone jobs In Chapter 3 we will have much to say about the special type of bandit process called a job (which we have already met in Problems 1 and 4). A job is a bandit process which yields just a single positive reward upon being completed at some random time. A useful result, which also extends to generalized jobs (ones that can also produce rewards before completion), is also an immediate consequence of Lemma 2.2. Let C denote the completion state of a job, after which no reward can be obtained.
46
MULTI-ARMED BANDIT ALLOCATION INDICES
Proposition 2.9 If P [ν(x(t)) > ν(x) > 0|x(t) = C] = 1 (t = 1, 2, . . .), then the supremum in (2.6) is attained when τ = min{t : x(t) = C}, and ν(x) > Vρ(x) (x = C). An analysis of the Gittins index for jobs proceeds by supposing that switching is allowed either on completion of a job or when its process time (i.e. the amount of service it has so far received) is an integer multiple of some time unit . For a typical job, let V e−γ t be the reward on completion at time t. Suppose that its service time has distribution function F , which is differentiable, with density f , and the completion rate is ρ(s) = f (s)/[1 − F (s)]. Letting → 0 corresponds to no restrictions on switching (the preemptive case). This means (see Section 3.2) that for x ≥ 0 we have t V x f (s)e−γ s ds ν(x) = sup t . (2.10) −γ s ds t >x x [1 − F (s)]e Our discussion of monotone indices suggests that attractive results may also be obtained for monotone jobs, appropriately defined. Appropriate restatements of Propositions 2.5–2.9 may be obtained by replacing the unit time interval between decision times by and then letting → 0. Proposition 2.5 tells us that dν(x)/dx < 0 is equivalent to ν(x) = Vρ(x) and the supremum in (2.10) being as t x. Similarly, Proposition 2.8 tells us that dν(x)/dx > 0 is equivalent to ν(x) > Vρ(x) and the supremum in (2.10) being for t > x. Proposition 2.7 gives ν(x) = Vρ(x) if ρ(x) ≥ ρ(t)(t > x). To apply Proposition 2.9, note first that ν(t) is increasing for t > x if and only if the same is true of ∞ f (s)e−γ s ds Q(t, γ ) = ∞ t , −γ s ds t [1 − F (s)]e in which case ν(x) = V Q(x, γ ). Let σ denote the total service time required by a job. Then ∞ 1 − F (s) −1 Q(t, 0) = ds = E(σ − t|σ > t). (2.11) 1 − F (t) t Thus, given a set of jobs such that Qi (t, γ ) is increasing in t for each job i, the optimal policy is to process them in order of decreasing Vi Qi (0, γ ). This is the form of the optimal policy irrespective of the behaviour of Qi (t, γ ), and it is the same as the optimal policy if preemptions are not allowed. We have shown that the effect of having Qi (t, γ ) increasing in t is to reduce the problem to the non-preemptive case. This result was first proved by Rothkopf (1966). Propositions 2.5–2.9 are all stated for single bandit processes. They are corollaries of Lemma 2.2 or, more generally, of the fact that an optimal policy in any Markov or semi-Markov decision process must make an optimal choice of control action on the right-hand side of the dynamic programming equation
MAIN IDEAS: GITTINS INDEX
47
(see Theorem 2.10 (iii)). Extensions of the propositions to exploit this greater generality may be obtained in fairly obvious fashion. This is true, in particular, for generalized jobs subject to precedence constraints in the form of an out-forest (see Theorem 4.11).
2.12
History of the index theorem
It is now 40 years since Gittins and Jones first proved the index theorem for a SFABP. The idea that an optimal policy for a sequential decision problem may sometimes take the form of an index policy was not new at the time; it had been known for some very simple scheduling problems since the 1950s. For example, for Problem 1, it was Smith (1956) who observed that, to minimize the weighted flow time of n jobs that have known processing times on a single machine, one should process them in decreasing order of an index computed as weight divided by processing time (an index priority policy known as ‘Smith’s rule’). An elementary version of the multi-armed bandit problem (Problem 5) was solved by Bellman (1956), for the special case in which there are just two drugs and the success probability of one of them is known, and by Bradt, Johnson and Karlin (1956) in the case of undiscounted rewards and a finite horizon. These authors defined and discussed indices characterizing optimal policies (but without actually calling them indices). They presaged the idea, described above, in which an arm with unknown pi can be calibrated against an arm with known pk such that one is indifferent (in a two-armed bandit problem) as regards initially pulling arm i or k. The 1970s saw rapid advances in researchers’ appreciations of how index policies can be optimal for more complicated problems. Klimov (1974), Sevcik (1974), Harrison (1975), Tcha and Pliska (1977) and Meilijson and Weiss (1977) were all obtaining results about the optimality of index policies for scheduling problems in various forms of multi-class queues, and finding indices similar to the Gittins index. They were mostly unaware of the index theorem. This was first presented by Gittins and Jones at a conference in 1972 (whose proceedings were published in 1974). It became more widely known after the publication of Gittins’ 1979 paper. Perhaps it is because the theorem is so remarkable and unexpected that many researchers have sought to re-prove it for themselves. It is impossible to list all who have studied the index theorem and contributed proofs or additional insights. At the time of the first edition of this book we cited contributions by Beale (1979), Whittle (1980), Karatzas (1984), Varaiya, Walrand and Buyukkoc (1985), Mandelbaum (1986, 1988), Chen and Katehakis (1986), Kallenberg (1986), Eplett (1986), R. P. Kertz (1986), Tsitsiklis (1986), Katehakis and Veinott (1987) (restart-in-state method), and Lai and Ying (1988). The proof of Varaiya et al. is distinctive for showing that results can be placed in the context of monotone sequences of sigma-fields, also known as filtrations, for each bandit process, rather than the Markov or semi-Markov setting to which
48
MULTI-ARMED BANDIT ALLOCATION INDICES
most other work is restricted. They also give a simple algorithm for computing indices for a bandit process with a finite number of states, and show how it may be applied in the assignment of a single server to a finite set of queues between which jobs transfer in a stochastic manner on completion. Mandelbaum shows that extensions of the theory of stopping problems, in which time is replaced by a partially ordered set, provide a convenient language for describing bandit problems, and uses this to provide some extensions to the work of Varaiya et al. He gives an elegant proof of the index theorem for a SFABP for each of which the undiscounted reward to date is a continuous function of process time. The proof starts from the discrete-time case and uses limit arguments in similar fashion to the proof of Theorem 3.1. Since the first edition, further proofs have been given by Weber (1992) (see Section 2.5), El Karoui and Karatzas (1993, 1994), Ishikida and Varaiya (1994), Tsitsiklis (1994) (see Section 2.9), Bertsimas and Ni˜no-Mora (1993, 1996) (see Section 5.3), Glazebrook and Garbe (1996), Kaspi and Mandelbaum (1998) and B¨auerle and Stidham (2001) (who treat the deterministic fluid Klimov problem via an achievable region approach). We have mentioned that Dumitriu, Tetali and Winkler (2003) have adapted Weber’s proof to solve a problem of playing golf with multiple balls. Bank and K¨uchler (2007) show that the Gittins index in continuous time can be chosen to be pathwise lower semi-continuous from the right and quasi-lower semi-continuous from the left. The multi-armed bandit problem and the Gittins index have now been explained in books by Whittle (1983, 1996), Walrand (1988), Ross (1983, Chapter 7), Puterman (2005, Chapter 3) and Pinedo (2008, Chapter 10). Mahajan and Teneketzis (2008) provide a nice modern survey of the classical multi-armed bandit problem and its variants. E. Frostig and G. Weiss (1999) have made a study of four proofs that they find especially instructive, and have elucidated the connections amongst them. The following table lists some proofs that have been notable for their originality and impact on the field.
1972 1980 1985 1988 1989 1991 1992 1993 1994
Gittins, Jones Whittle Variaya, Walrand, Buyukkoc Weiss Gittins Tsoucas Weber Bertsimas, Ni˜no-Mora Tsitsiklis
Interchange argument Dynamic programming Interchange argument Interchange argument Interchange argument Achievable region Prevailing charge argument Achievable region Interchange and induction on size of the state space
MAIN IDEAS: GITTINS INDEX
49
We should mention that there are formulations of the multi-armed bandit problem which differ from that described in Problem 5. Suppose that, under some policy, the arm that is pulled on pull t is denoted it , and rit denotes the reward obtained. In a version of the multi-armed bandit problem defined by Robbins (1952), the regret after T pulls is ρ = T max{p1 , . . . , pn } −
T
rit ,
t=1
and one seeks a strategy such that the average regret ρ/T tends to 0 as T tends to infinity, for all values of the unknown parameters p1 , . . . , pn . This is possible with any sensible strategy that pulls each arm infinitely often, although some such policies are better than others in terms of minimizing a secondary criterion, such as the worst-case expected regret, supp1 ,...,pn E(ρ). Lai and Robbins (1985) show that, asymptotically, the regret must grow logarithmically in the number of pulls, whatever the values of the unknown success probabilities. Anantharam, Varaiya and Walrand (1987a,b) find similar results when one samples m processes at a time. Other researchers have addressed similar issues. For example, Mannor and Tsitsiklis (2003) show that it suffices to pull arms O[(n/ 2 ) log(1/δ)] times to find with high probability (of at least 1 − δ) an arm whose expected reward is within of the expected reward of the best arm.
2.13
Some decision process theory
We conclude this chapter with a short statement of some well-known results for Markov decision processes and semi-Markov decision processes. See Whittle (1983) and Puterman (2005) for more details. Let D be a decision process, with state space , such that in each state x (∈ ) the set of available control actions, (x), is finite. Suppose, moreover, that D is either a (discrete-time) Markov decision process, for which the functional equation is X(s) = max r(x, u) + a X(y)P (dy|x, u) , x ∈ , (2.12) u∈(x)
or a (continuous-time) semi-Markov decision process satisfying Condition A, for which the functional equation is ∞ X(x) = max r(x, u) + a t X(y)F (dt|x, y, u)P (dy|x, u) , x ∈ . u∈(x)
t=0
(2.13)
50
MULTI-ARMED BANDIT ALLOCATION INDICES
" If is countable, then . . . P (dy|x, u) may be replaced with y . . . P (y|x, u). The objective is to maximize a payoff of the expected sum of discounted rewards up to the infinite horizon. Let R(D, x) denote the supremum, over all policies, of the payoff that can be obtained when starting in state x. Value iteration is the term coined by Howard (1960) to describe a procedure that starts with some X -measurable function X0 (·), defined on , and then defines and determines functions Xn (·) iteratively from an equation constructed by taking either (2.12) or (2.13) and substituting Xn (x) in place of X(x), and Xn−1 (y) in place of X(y), on the left- and right-hand sides, respectively. If X0 (·) = 0, then Xn (D, x) is the supremum over all policies of the payoff that can be obtained when starting in state x, and prior to the (n + 1)th decision time. Let us further suppose that Condition B holds (which is trivial if r(·, ·) is bounded). We may summarize in a single theorem some important results. Theorem 2.10 (i) There is at least one optimal policy which is deterministic, stationary and Markov. (ii) R(D, ·) is the unique solution of the functional equation which is such that E{a [t] X(xg ([t]))|x(0) = x} tends to zero uniformly over g as t tends to infinity; it is the unique bounded solution in the case that r(·, ·) is bounded. (iii) A policy is optimal if and only if, for every x in , the control which it chooses in state x is one that achieves the maximum on the right-hand side of the functional equation when X(y) is replaced by R(D, y). (iv) R(D, ·) is an X -measurable function. (v) The function R(D, ·) may be determined by value iteration at each point in for which E{a [t] X0 (xg ([t]))|x(0) = x} tends to zero uniformly over g as t tends to infinity (and so trivially, if we take X0 (·) = 0).
Exercises 2.1. A unit cube has its corners at Euclidean points (i, j, k), i, j, k ∈ {0, 1}. Two tokens are placed at (0, 0, 1) and (0, 1, 1). A gift of value £G can be claimed if either token reaches (1, 1, 1). At any time you may pay £1 and point to a token; that token will then make a random move (with equal probability) to one of its three neighbouring points. What is the least value of G for which it is worthwhile to pay for a first move? Given a greater value of G, what is your optimal strategy? This problem, and many variations of it, are discussed by Dumitriu, Tetali and Winkler (2003).
MAIN IDEAS: GITTINS INDEX
51
2.2. Let F = {Bi : 1 ≤ i ≤ n} be a SFABP, g an index policy with respect to ν(B1 , ·), . . . , ν(Bn , ·), and Ti (t) (i = 1, 2, . . . , n) the process time for Bi , at time t when policy g is applied to F . Assuming a continuous-time setting, show that it follows from (2.5) and Lemma 2.2 that the expected total reward from F under policy g may be written as ∞ Rg (F ) = E max inf ν[Bi , xi (Ti (s))] a t dt. 0
1≤i≤n
0≤s≤t
This result is due to Mandelbaum (1986) and has obvious connections with the proof in Section 2.5. 2.3. In the ‘multi-processor’ version of the coins problem there are n piles of infinitely many coins, and the sum of discounted rewards (discount factor a) is to be maximized when taking m (< n) coins from the tops of piles at each step. Let us make the additional assumption that for any two distinct piles i and j , and coins in these piles at positions k and from the top, pki > pj ⇒ (1 − a)pki > pj . Show that under this assumption a greedy algorithm is optimal. Hint: begin by analysing the problem with n = 3, m = 2, and piles having {pj 1 } = {1, 0, 0, . . .}, {pj 2 } = {b, 0, 0, . . .}, {pj 3 } = {c, c, c, . . .}, where 1 > b > c. Pandelis and Teneketzis (1991) determine a similar condition on the reward processes sufficient to guarantee the optimality of the strategy that always continues the projects having the m greatest Gittins indices. 2.4. Given a discount parameter γ , let (I ) denote the maximal payoff that can be obtained from a SFABP which has as its constituent bandits a subset I of the set of bandits S = {B1 , . . . , Bn }. Show that (·) is a submodular set function, meaning that for any I, J ⊆ S, (I ∪ J ) + (I ∩ J ) ≤ (I ) + (J ). Hint: think about interleaving sequences of prevailing charges, as described in Section 2.5 (or see Weber (1992)). Let S be a set of n jobs. Suppose that job i has deterministic service time ti and pays a reward Vi on completion. It is desired to choose I as a size m subset of S for which R(I ) is maximal. Show that this can be done by computing the optimal such subset of size m − 1 and then adding to it a job which is not yet included and whose inclusion would increase the payoff by the greatest amount. We meet this problem again in example 5.15. 2.5. Verify that the following MATLAB program calculates the Gittins indices for a discrete-time bandit process with finite state space. (In MATLAB, X = A\B gives the solution to AX = B, and R(1:N,1)./W(1:N,1) is the
52
MULTI-ARMED BANDIT ALLOCATION INDICES
vector with elements R(i,1)/W(i,1). The value nan is ‘not a number’, and is ignored when doing calculations.) % Inputs: a = 0.95
% discount factor (scalar)
r = [1 2 3]’
% state reward vector (N x 1)
P = [.1 .2 .7; .4 .4 .2; .8 .1 .1] % probability transition % matrix (N x N) N = size(r,1);
% number of states
Ph = P;
% transitions to higher index % states
f = ones(N,1);
% flag indicating non-computed % states
for j = 1:N
% there are
= N indexes to
% compute R = ((eye(N)-a*Ph)\r);
% discounted reward until
W = ((eye(N)-a*Ph)\ones(N,1));
% discounted time until stopping
% stopping nu = R(1:N,1)./W(1:N,1);
% compute ratios nu(B,i)
nu = nu.*f;
% ignore for higher states
[nu_max,I] = max(nu);
% find best
G(I) = nu_max;
% copy in the result vector
f(I) = nan;
% exclude from remaining % computations
Ph(:,I) = P(:,I);
% include in transition % matrix Ph
end % Output is G = [1.9228 1.9481 1.9850]
2.6. Consider a discrete-time bandit process B with a finite state space {1, . . . , N }. From Section 2.9 it follows that the greatest Gittins index is maxi ri . Show that the least Gittins index is mini (1 − a)R(B, i), where R(B, i) denotes the payoff achieved when starting in state i and applying the continuation control continually. 2.7. Take the setup in the previous example. The algorithm described in Section 2.9 finds the Gittins indices in decreasing order of their magnitudes. Might it be possible to find them in reverse order, beginning with the smallest Gittins index? Take a discrete-time model and suppose all rewards are non-negative. Augment the state space by a state 0, with r0 = 0. Consider an undiscounted problem, equivalent to an original problem with discount factor a and transition matrix P , in which the transition matrix is taken to be P ∗ , where P ∗ (j |i) = aP (j |i), i, j = 0, P ∗ (0|0) = 1 and P ∗ (0|i) = 1 − a for all i = 0. Clearly, 0 is the state of smallest Gittins index. If i is the state of
MAIN IDEAS: GITTINS INDEX
53
next smallest Gittins index then its index is ντi (B, i, 1), where the ‘1’ means discount factor 1, and τi is the time that starting in state i the bandit state hits 0. Prove that the state of smallest index (apart from state 0) is the one for which ντi (B, i, 1) is least (cf. the previous example). Once this state has been identified, say i1 , we may proceed to find the state of second smallest index, by now thinking of τi as the first time to return from state i to the set {0, i1 }. Continuing in the obvious manner, the algorithm terminates when the state of greatest Gittins index is identified. Prove that this works.
3
Necessary assumptions for indices 3.1
Introduction
This chapter is principally about the modelling assumptions that are necessary for the index theorem to be true. However, before embarking upon a discussion of such assumptions, we begin in Sections 3.2 and 3.3 by saying a bit more about jobs. These are the special type of bandit processes that we have already met in Problems 1 and 4. Their properties, and the problems of scheduling them, form a recurrent theme in this book. At this point we carry forward our discussion of the Gittins index for a continuous-time job, which we have already begun in the context of monotone jobs in Section 2.11.2, and then outline the proof of the index theorem for a simple family of alternative bandit processes (SFABP) of jobs. Using jobs to provide examples, Section 3.4 investigates the modelling assumptions that are necessary for an index theorem to be true. These are that (i) rewards are accumulated up to an infinite time horizon; (ii) there is constant exponential discounting; and (iii) there is only one processor/server. Section 3.5 discusses some special cases in which an index theorem remains true even when these assumptions are relaxed. Varieties of discounting are discussed in Section 3.5.1. It has also been amongst our assumptions that discounting is strict, and that unless a bandit process receives the continuation control no rewards accrue from it and its state does not change. These assumptions are not met in Problems 1, 2, 3 or 6. Nonetheless, the index theorem can be used to solve them. In Section 3.5.2 we use the notion of stochastic discounting to solve Problem 3. In Section 3.5.3 we discuss the use of Gittins indices for instances such as Problem 1, in which Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
56
MULTI-ARMED BANDIT ALLOCATION INDICES
undiscounted holding cost accrues from all unfinished jobs. An index theorem holds for such problems and the index can be obtained by letting the discount parameter tend to zero. This result is applied in 3.5.4 to solve Problem 3 (Search). The chapter returns to jobs in Section 3.5.5 with some results about minimum flow time scheduling on multiple processors.
3.2
Jobs
A bandit process which yields a positive reward at a random time, and no other rewards, positive or negative, is termed a job. A simple family of such bandit processes defines a version of the well-known problem of scheduling a number of jobs on a single machine, such as we have already met in Problems 1 and 4 of Chapter 1. Many papers on scheduling problems have been written over the past 50 years. For a review of the field the reader is referred in particular to the books by Conway, Maxwell and Miller (1967), Baker (1974), Coffman (1976), and the conference proceedings edited by Dempster, Lenstra and Rinnooy Kan (1982). French (1982) gives a convenient introduction. An excellent reference for problems in machine scheduling is the book of Pinedo (2008). In addition to their importance in scheduling theory, SFABPs of job type retain sufficient generality to provide further insight into why the index theorem holds, as well as providing counter-examples to any conjecture that substantial relaxation of its conditions might be possible. The remaining sections of this chapter are devoted to carrying out this programme, and also showing that a little ingenuity considerably stretches the range of problems for which the conditions of the theorem may be shown to hold. A natural extension of the notion of a job is to allow costs and rewards to occur before completion of the job. This leads to the definition of a job as a quite general bandit process, except for the existence of a state (or set of states) C representing completion. Selection of an uncompleted job at each decision time until all the jobs have been completed may be forced by supposing that a job in state C behaves like a standard bandit process with parameter −M, where M is large, as this convention means it can only be optimal to select a completed job for continuation if every job has been completed. Jobs of this kind will be termed generalized jobs to distinguish these from the less elaborate jobs of the opening paragraph of this section. Restrictions of the form ‘Job A must be completed before job B may be started’ are termed precedence constraints. They are present in many real-life scheduling problems. Resource allocation to jobs, including generalized jobs, under precedence constraints is one of the themes of Chapter 4. The extension to generalized jobs will not be made without explicit mention, and for the moment we revert to jobs with a single reward on completion. There remains the question of the times at which the processing, or service, of a job may be interrupted. As we have mentioned in our discussion of Problem 4, one possibility is that no interruptions are allowed and service is switched from one
NECESSARY ASSUMPTIONS FOR INDICES
57
job to another only when the first job has been completed. This is termed the nonpreemptive case. At the other extreme, there may be no restrictions on switching, so that every time is a decision time. An intermediate possibility is that switching is allowed either on completion of a job or when its process time (i.e. the amount of service it has so far received) is an integer multiple of some time unit . For a typical job let V e−γ t be the reward on completion at time t, σ be the service time (i.e. the process time required for completion), F (t) = P (σ ≤ t), f (t) = dF(t)/dt (if F is differentiable) and τk = min(σ, k). The state of the job is either C (denoting completion), or k if it has so far received service for a time k without being completed. Thus ν(C) = 0 and, for the intermediate case just described, k V r e−γ t dF(t) Rτk −r (r) . (3.1) ν(r) = sup = sup k k > r Wτk −r (r) k>r [1 − F (t)]e−γ t dt r
Note that in (3.1) the stopping time τk − r is conditioned by the event (σ > r). Thus it has the probability density function f (s + r)/[1 − F (r)] in the range 0 < s < (k − r) and P [τk − r = (k − r) | σ > r] =
1 − F (k) . 1 − F (r)
With this hint, and the fact that expectation of a non-negative random variable X with distribution function F may be written as ∞ EX = [1 − F (x)]dx, (3.2) 0
the derivation of (3.1) is left as a simple manipulative exercise. Various limiting cases of (3.1) are of particular interest. Letting → ∞ corresponds to the non-preemptive case, in which the only state of interest apart from state C is state 0, and we have ∞ V 0 e−γ t dF(t) ν(0) = ∞ . (3.3) −γ t dt 0 [1 − F (t)]e Letting → 0 corresponds to the preemptive case, and so for x ≥ 0, t V x e−γ s dF(s) ν(x) = sup t t > x x [1 − F (s)]e−γ s ds t V x f (s)e−γ s ds , if F is differentiable. = sup t −γ s ds t >x x [1 − F (s)]e
(3.4)
(3.5)
This is (2.10) and also (1.3), which was derived in a different way. The nature and optimality of the corresponding index policy are considered in the next section. Notice that there is now no interval between decision times. The quantity
58
MULTI-ARMED BANDIT ALLOCATION INDICES
corresponding to the reward rate up to the next decision time is the instantaneous reward rate Vf (x)/[1 − F (x)] = Vρ(x). The quantity ρ(x) may be termed the completion rate of the job (or in other contexts the hazard rate, or the mortality rate). Note that P (Completion before x + δx | no completion before x) = ρ(x)δx + o(δx). We have already noted in our discussion of Problem 4 that for n jobs with completion times ti , and rewards Vi , an index policy maximizes the expected total reward Vi e−γ ti = Vi − γ E Vi ti + o(γ ). (3.6) E i
i
i
In the limit as γ 0 this amounts to minimizing E Vi ti ; that is, minimizing the expected total cost, when V i is the cost per unit time of any delay in completing job i. The quantity ti is termed the flow time of the policy (or schedule) and Vi ti the weighted flow time with weights Vi . Putting γ = 0 in (3.3), and using (3.2), it follows that, if preemption is not allowed, the expected weighted flow time (EWFT) is minimized by scheduling jobs in order of decreasing Vi /Eσi . This is Problem 1 once again, with the trivial extension of random service times, now solved through our general theory. The formal justification of taking the limit as γ 0 is given in Section 3.5.3.
3.3
Continuous-time jobs
3.3.1 Definition For jobs, rewards occur only on completion, though intermediate times before completion may also be designated as decision times. This section explores the consequences of increasing the density of these intermediate decision times until, in the limit, every time is a decision time. Jobs with this property will be termed continuous-time jobs. A limit argument may be used to provide an appropriate extension of Theorem 2.1. We shall outline the proof; the full proof was given in Section 5.3 of the first edition of this book. The first stage is to define an index policy based on the index (3.5) for a simple family of alternative jobs. For an alternative route via variational arguments to index theorems for continuous-time jobs, see Nash and Gittins (1977). The question of what happens when rewards as well as decisions occur continuously over time is deferred to Sections 7.5 and 7.6.
3.3.2 Policies for continuous-time jobs One difficulty becomes apparent if we suppose that there are just two jobs, 1 and 2, with decreasing completion rates ρ1 (x1 ) and ρ2 (x2 ), so that νi (xi ) = Vi ρi (xi ),
i = 1, 2.
NECESSARY ASSUMPTIONS FOR INDICES
59
Suppose V1 ρ1 (0) = V2 ρ2 (0), and initially that switching between jobs is allowed only when xi = ki (ki = 0, 1, 2, . . . ), for some time unit , or when one of the jobs terminates. This restriction on switching modifies the form of the index. We have (r+1) Vi r fi (s)e−γ s ds νi (k) = (r+1) , [1 − Fi (s)]e−γ s ds r and νi (k) = Vi ρi (k) > νi (k) >νi ((k + 1)) = Vi ρi ((k + 1))
(i = 1, 2; k = 0, 1, 2, . . .).
Thus the modified indices also decrease as the jobs age, and an optimal -policy switches between jobs so as to keep the values of Vi ρ(xi (t)) (i = 1, 2) approximately equal for all t. If we now increase the number of decision times by decreasing , this approximate equality becomes closer and closer, and it seems a fair bet that in the limit, when switching can take place at any time, we should have exact equality at all times. To achieve this we must allow policies which give some service to both jobs in any time interval, however short. So what is required is an extension of the set of allowed policies to include this sort of shared service, and a theorem to show that the corresponding index policy is optimal. We shall state and outline the proof of a theorem for a SFABP F of n jobs under regularity conditions which could almost certainly be weakened, though it is sufficiently general for practical purposes. The distribution function of the service time of each job is assumed to be piecewise twice continuously differentiable, in other words twice continuously differentiable except at points each of which lies in an open interval in which it is the only point at which the service time distribution is not twice continuously differentiable. Denote by E1 the set of these exceptional points. Service for the n jobs is assumed to be available at a constant overall rate, which, without loss of generality, we assume to be 1. Thus if ui (t) is the rate of service allocated to job i at time t then i ui (t) = 1. The function ui (·) depends on the current state of the system, which is the vector of states for the n jobs. The state xi (t) of job i at time t is its age
t
ui (s)ds, 0
if it has not been completed, and C if it has been completed. If x k is the state of the system at the kth job-completion, and this occurs at time t, the allocation vector u(s) = (u1 (s), . . . , un (s)) is x k -measurable for s > t until the (k + 1)th job-completion. The x k -measurability of u(s) means that, although every time is a decision time, in the sense that u(s) may be changed at any s, all the information
60
MULTI-ARMED BANDIT ALLOCATION INDICES
relevant to determining u(s) between successive job-completions is available at the most recent of those job-completions. Piecewise twice continuous differentiability of the service time distribution F allows us to define an index policy for F as follows. The definition is expressed recursively in a kind of pidgin programming language. (i) Set t = 0, xi (t) = 0 (i = 1, 2, . . . , n). (ii) Determine the subset G1 of jobs j such that νj (xj (t)) = max νi (xi (t)), i
and the subset G2 of G1 for which ageing beyond age xj (t) produces initially an increase in the index. (iii) IF: G1 includes just one job k, GOTO (iv). G1 includes more than one job and G2 = ∅, select a job k ∈ G2 and GOTO (iv). G1 contains more than one job and G2 = ∅, GOTO (v). (iv) Set uk (s) = 1, ui (s) = 0 (i = k, t ≤ s < v), where v = inf{s > t : νk (xk (s)) ≤ νi (xi (t)) for some i = k}. (Note that v is random, as νk (C) ≤ νi (xi (t)) for all i.) Set t = v, GOTO (ii). (v) Choose u(s) so that Vi ρi (xi (s)) = Vj ρj (xj (s))
(i, j ∈ G1 , t ≤ s < v),
i.e. so that ρj (xj (s)) ρi (xi (s)) ui (s) = Vj uj (s) (i, j ∈ G1 , t ≤ s ≤ v) , Vi dxi dxi ui (s) = 1, and ui (s) = 0 (i ∈ / G1 ), G1
where v = sup{w : νi (xi (s)) is decreasing if t ≤ s < w, and the interval (xi (t), xi (w)) includes no E1 points for job i (i ∈ G1 ); νi (xi (s)) > νj (xj (s))
(i ∈ G1 , j ∈ / G1 , (t ≤ s < w)}.
(Thus for jobs in G1 the indices in the interval [t, v] are equal, decreasing, and greater than the indices of jobs outside G1 .) Set t = v, GOTO (ii).
NECESSARY ASSUMPTIONS FOR INDICES
61
Note that this definition includes no explicit rule for stopping. The recursion stops when v = ∞ in (iv) or (v).
3.3.3 The continuous-time index theorem for a SFABP of jobs Theorem 3.1 Index policies are optimal for a SFABP consisting of jobs. Outline of proof. From Theorem 2.1 we know that index policies are optimal in the class of -policies, for which switching between jobs is restricted to decision times occurring at intervals . To prove the theorem it would suffice to show that (i) the payoff from one of these index -policies tends as 0 to the payoff from a continuous-time index policy; and (ii) the maximal payoff from a -policy tends to the continuous-time maximal payoff as 0. The proof may be completed along these lines.
3.4
Necessary assumptions
The index theorem for a SFABP (Theorem 2.1) has been proved under modelling assumptions that (i) rewards are accumulated up to an infinite time horizon; (ii) there is constant exponential discounting; and (iii) in the language of scheduling, there is just one processor/server. Using jobs as the basis for examples, we now show that these assumptions cannot be relaxed.
3.4.1 Necessity of an infinite time horizon Consider first the infinite time horizon assumption. If instead of this there is a finite time horizon T at which the reward stream is terminated, we can see what happens by putting γ = 0 and supposing that the n jobs have deterministic service timesti (i = 1, 2, . . . , n), with . We thus have the problem of ti > T maximizing S Vi with respect to S subject to S ti < T , where S is a subset of the n jobs and Vi is the reward on completion of job i. Since the total time available is limited, the general aim must clearly be to include in S those jobs for which Vi /ti is large. Indeed if either Vi or ti is the same for all the jobs then the optimal S is obtained by listing the jobs in decreasing order of Vi /t i and defining S as the first m jobs in the list, where m is maximal subject to S ti ≤ T . With arbitrary values for the Vi s and ti s, however, this simple recipe does not always work. Consider, for example, the case T = 10, n = 3, (t1 , t2 , t3 ) = (7, 2, 3), (V1 , V2 , V3 ) = (9, 3, 4). Clearly the optimal S = {1, 3}, whereas the Vi /ti criterion leads to S = {2, 3}. Moreover, by varying T and/or n it is easy to construct cases for which S includes any subset of these three jobs, so there can be no alternative index giving the optimal policy in every case. This finite-horizon, undiscounted, deterministic version of our problem is well known in the literature as the knapsack problem. We are given a set of items of differing values and weights and seek to select a subset which maximizes the
62
MULTI-ARMED BANDIT ALLOCATION INDICES
total value of the subset subject to it not exceeding a given maximum weight. In general terms any solution to the knapsack problem must strike a balance between possibly conflicting desires to (i) include items for which Vi /ti is large; and (ii) avoid undershooting the weight restriction by any substantial amount. In other words, use of Vi /ti as an index must be tempered by the requirements of good packing so as to exploit the full weight allowance. The knapsack optimization problem belongs to a class of problems known as NP-hard . To explain the meaning of this term we need to start with an explanation of the class NP, which consists of decisions problems (i.e. those whose answer is yes or no) for which, if the answer to an instance is yes then there exists a certificate by which to check that the answer is yes in a computation time that is bounded by a polynomial function of the instance’s size (i.e. the number of bits needed to describe it). A problem A is said to be polynomial-time reducible (or just reducible) to some other problem B, if any instance of A can be transformed into an instance of B, and then an answer to B transformed back to provide an answer to A, these transformations taking a total time that is bounded by a polynomial function of the size of the instance of A. A problem is NP-hard if every problem in NP is reducible to it, and is NP-complete if it is also in NP. The knapsack optimization problem is not NP-complete because it is not a decision problem. However, it has a decision version which is NP-complete. This is the problem: given data (V1 , t1 ), . . . , (Vn , tn ), and numbers and T , does there exist a subset of items S such that S Vi ≥ and S ti < T ? Clearly, the knapsack decision problem is reducible to the knapsack optimization problem. The NP-complete class contains a large number of combinatorial decision problems. Any two problems in this class are reducible to one another. Yet it is strongly believed that for no problem in this class is it possible to bound the time needed to compute the solution to an instance by a polynomial function of the instance’s size. Problems that are NP-complete quickly become computationally intractable as the size of the problem instance increases. See, for example, Papadimitriou and Steiglitz (1982) for further information about these important classes of problems. The categorization of scheduling problems as to their computational complexities, and the identification of those which are NP-hard, has been an extremely active area of research. See Lawler, Lenstra and Rinnooykan (1989) and Lawler, Lenstra, Rinnooy Kan and Shmoys (1993).
3.4.2 Necessity of constant exponential discounting The question of which discount functions are compatible with an index theorem may also be explored by considering jobs with deterministic service times. Suppose two such jobs with the same index value have parameters (V1 , t1 ) and (V2 , t2 ), and d(·) is the discount function. Thus if these jobs are carried out in succession starting at time t, the total reward from them both should not depend on which of them is done first, so that V1 d(t + t1 ) + V2 d(t + t1 + t2 ) = V2 d(t + t2 ) + V1 d(t + t1 + t2 ).
NECESSARY ASSUMPTIONS FOR INDICES
63
Hence V1 d(t + t1 ) − d(t + t1 + t2 ) = , V2 d(t + t2 ) − d(t + t1 + t2 )
(3.7)
a constant independent of t, and (3.7) must hold for any t1 , t2 ∈ R+ for suitably chosen V1 and V2 . This is true only if d(t) = A + Be−γ t for some A, B and γ (see exercise 3.3). The constant A may be set equal to zero with no loss of generality, since its value has no influence on the choice of an optimal policy. Thus, putting d(0) = 1, we have d(t) = e−γ t for some γ , so that for a general index theorem to hold the discounting must be exponential. A proof of this result using Bernoulli bandit processes with two-point prior distributions is given by Berry and Fristedt (1985) for discount functions which are restricted to their discount sequences. These are defined as those for which t+1class ofregular t d(s)/ d(s) is nonincreasing in t. s=0 s=0 Note that a finite time horizon actually amounts to a form of discounting. If no rewards are allowed to accrue after time T this is equivalent to a discount factor d(t) such that d(t) = 0 for t > T , which is a particular form of non-exponential discounting.
3.4.3 Necessity of a single processor To conclude our discussion of assumptions that are necessary for the optimality of index policies for SFABPs, and other decision processes which may be reduced to this form, we now consider what happens when, in the language of scheduling, there is more than one processor, or machine. This means modifying the rules for a family of alternative bandit processes (FABP) so that the number of bandit processes being continued at any one time is equal to the number of processors. It turns out that quite strong additional conditions are required for an index theorem to hold, Once again we proceed by looking at what happens for jobs with no preemption. Suppose that a is close to 1. For an infinite time horizon the problem thus becomes one of minimizing expected weighted flow time. Let n be the number of jobs, m the number of machines, ci the cost per unit time of delay in completing job i, and si the service time of job i. As noted earlier, when there is just one machine, an optimal schedule orders deterministic jobs in decreasing order of ci /si . When there is more than one machine an optimal schedule clearly must preserve this feature so far as the jobs processed by each individual machine are concerned. However, this condition does not guarantee optimality, and in general there is no simple condition which does. A counter-example illustrates the difficulty which arises. Counter-example 3.2 n = 3, m = 2, (c1 , s1 ) = (c2 , s2 ) = (1, 1), (c3 , s3 ) = (2, 2). Jobs J1 , J2 , J3 all have the same value of the index ci /si , and so one might suppose that any selection of two jobs for immediate service would be equally
64
MULTI-ARMED BANDIT ALLOCATION INDICES Schedule 1 J1 J2
Machine 1 Machine 2 0
J3 1
2
3
Time
Schedule 2 J1
Machine 1 Machine 2
J2 J3
0
1
2
Time
Figure 3.1 Two alternative schedules for processing three jobs on two machines. good. However, as we see in Figure 3.1, the weighted flow times of Schedules 1 and 2 are 8 and 7, respectively. Fairly obviously Schedule 2 does better than Schedule 1 because it uses all the available processing capacity until every job is finished. This observation may be formalized as a theorem, which may be proved by simple algebraic manipulations. Theorem 3.3 For m machines and n deterministic jobs with ci /si the same for all minimization of weighted flow time is equivalent to minimization of jobs, δj2 , where Sm−1 + δj is the total process time for machine j , and S = si . This result gives a useful insight as to how to construct a schedule minimizing the weighted flow time, though it does not provide an explicit solution. Even in the case that all ci /si are the same, the problem of finding a schedule that minimizes the weighted flow time is NP-hard, just like the knapsack problem. In fact, for m = 2, finding a solution amounts to finding a subset of the n jobs for which the total service time is as close as possible to S/2. This is an alternative standard specification of the knapsack problem.
3.5
Beyond the necessary assumptions
3.5.1 Bandit-dependent discount factors One way in which we might try to relax the assumption of constant exponential discounting is to imagine that a reward which is obtained at time t is discounted by a t , but that a depends on the identity of the bandit producing the reward. In fact, the following theorem holds. The original unpublished proof by Nash used the terminology of Nash (1980), but is essentially as follows. Theorem 3.4 An optimal policy for a simple family F of two alternative bandit processes A and B with discount factors a and b is one which selects A or B according as νAB (x) > or < νBA (y) when A is in state x and B is in state y,
NECESSARY ASSUMPTIONS FOR INDICES
65
where, writing x = x(0) and y = y(0), and si and ti for the ith decision time after time zero for A and B respectively, E a si rA (si ) E bti rB (ti ) νAB (x) = sup τ >0
si 0
ti 0. The generalized bandit process is equivalent to a semi-Markov SFABP in which, if at a decision time the continuation control is applied to Bi , then a reward of ri (xi ) = Ri (xi )/Q(Bi , xi ) is obtained, and the duration until the next decision time is the sum of (i) a random variable T , whose distribution depends on xi and xi , where xi is the state to which Bi makes its next transition (doing so with probability Pi (xi |xi )); and (ii) a quantity S such that a S = Q(Bi , xi )/Q(Bi , xi ), which is positive by the assumption that Q(Bi , xi )/Q(Bi , xi ) < 1. Theorem 3.4 can be obtained as an application of the above, in which the process times, say v and w, for A and B, respectively, are incorporated in the definition of their states. Let c be a discount factor, and set Q(A, x) = (b/c)v Q(B, y) = (a/c)w v RA (x) = (a/c) rA (x) RB (y) = (b/c)w rB (y), so that ct Q(B, y)RA (x) = a t rA (x(t)) and ct Q(A, x)RB (y) = bt rB (y(t)). We may take c > max(a, b) so that cS = Q(A, x )/Q(A, x) < 1, and hence S > 0. The idea of a generalized bandit process can also be used to deal with a set of jobs only one of which has to be completed. Let Fi (xi ) denote the probability that job i completes before reaching age xi , and consider the generalized bandit process with Q(Bi , xi ) = 1 − Fi (xi ) (see Glazebrook and Fay, 1987). Two other tricks which may sometimes be used for the same purpose are described in Sections 3.5.2 and 3.5.4.
3.5.2 Stochastic discounting Theorem 3.4 is for a problem to which the index theorem does not obviously apply, the method of proof being to show that the problem is equivalent to a second problem, on a different time scale, satisfying the conditions for the index theorem. A similar device may be used for the gold-mining problem (Problem 9). For convenience we restate it here. Problem 2 Gold-mining A woman owns n gold mines and a gold-mining machine. Each day she must assign the machine to one of the mines. When the machine is assigned to mine i
NECESSARY ASSUMPTIONS FOR INDICES
67
there is a probability pi that it extracts a proportion qi of the gold left in the mine, and a probability 1 − pi that it extracts no gold and breaks down permanently. To what sequence of mines on successive days should the machine be assigned so as to maximize the expected amount of gold mined before it breaks down? The difficulty here is that when the machine breaks down no further gold can be mined, either from the mine on which the breakdown occurred or from any other mine, so that the reward streams available from unselected mines do not necessarily remain frozen. This can be resolved, as pointed out by Kelly (1979), by exploiting an equivalence between discounting a future reward, and receiving the same reward with a probability less than one, which may be regarded as stochastic discounting. This is an idea that we have already used within a proof of the index theorem in Section 2.8. A deterministic policy for the gold-mining problem is a fixed schedule of the mines to be worked on each day until the machine breaks down. Under a particular policy let Ri be the amount of gold mined on day i if there is no breakdown on day i, and let Pi be the probability of no breakdown on day i, conditional on no breakdown during the previous i − 1 days. Write Ti = − loge Pi . Thus the expected total reward may be written as ∞
−
Ri exp
Tj .
j ≤i
i=1
This is also the total reward from a deterministic SFABP F for which Ri is the reward at the ith decision time (including time zero), Ti is the interval between the ith and (i + 1)th decision times, and the discount parameter is 1. There is an obvious one–one correspondence between gold mines and the bandit processes in F . For mine i let rij be the amount of gold mined on the j th day that it is worked if there is no breakdown on or before this day, pij be the probability of no breakdown on the j th day conditional on no earlier breakdown, and tij = − loge pij , generalizing somewhat the original specification of the problem. The initial index value for mine i is rij exp − tik rij pik νi0 = max N≥1
j ≤N
1 − exp
k≤j
−
k≤N
tik
= max N≥1
j ≤N
1−
k≤j
.
(3.8)
pik
k≤N
If rij pij (1 − pij )−1 is decreasing in j , the maximum occurs for N = 1 and νi0 = ri1 pi1 (1 − pi1 )−1 . If rij pij (1 − pij )−1 is increasing in j , the maximum occurs for N = ∞.
68
MULTI-ARMED BANDIT ALLOCATION INDICES
Since F is a deterministic SFABP, the expression (3.8) for the index νi0 may be generalized immediately to the situation when the sequence {(rij , pij ) : j = 1, 2, . . .} is stochastic. The index theorem (Theorem 2.1) still applies. If the rij s are all negative it is not very helpful to think of this problem in terms of gold mining, though as we shall see there are some genuine applications to research planning. Equation (3.8) holds with no restriction on the sign of the rij s, but a second derivation which works when every rij ≤ 0 is of some interest. The single-machine scheduling problem with generalized jobs was formulated as a SFABP by appending to the end of each job a standard bandit process with a large negative parameter −M. This has the effect of ensuring that every job is completed before one of these standard bandit processes is selected. For the gold-mining problem with every rij ≤ 0 a similar device may be used. Each mine corresponds to a generalized job, and breakdown of the machine to completion of the job being processed. To ensure that no further costs are incurred once a job has been completed we simply append to each job at the point of completion a standard bandit process with parameter zero. This has the desired effect of formulating the problem as an equivalent SFABP, except that the discount factor is one and the index values are all zero. We can also consider a discounted version of the problem. If mine i is mined for the j th time (and the machine is not yet broken down), then reward rij pij is obtained; subsequent to collecting this reward, the machine breaks down with probability 1 − a 1+loga pij , and so no further rewards can be obtained. We know that an index policy is optimal for such a model. See Section 2.8. One example of a gold-mining problem with non-positive rij s is the problem facing a chemist who wishes to synthesize a compound by one of n possible routes. Each route has a number of stages, at each of which synthesis is possible, −rij being the cost of the j th stage of route i. A second feature of pharmaceutical research is that compounds are subjected to a series of tests, known as screens, before being accepted for clinical trials. These tests are for therapeutic activity and absence of toxic side-effects, and result in the rejection of the vast majority of compounds. If a compound is going to fail it is of course desirable that the cost of finding this out should be as small as possible, with consequent implications for the order of the screens. For example, if a sequence of activity screens and a sequence of toxicity screens have been fixed, the problem of merging the two sequences into one may be modelled as a gold-mining problem with n = 2 and non-positive rij s. A more detailed discussion of this problem may be found in Section 3.3 of Bergman and Gittins (1985).
3.5.3 Undiscounted rewards Problems 1 and 6 are interesting because they touch upon two further assumptions for a SFABP that we have not yet explored. One is that rewards are strictly
NECESSARY ASSUMPTIONS FOR INDICES
69
discounted; that is, the discount factor a is less than 1 and the discount parameter γ is strictly positive. Another is that no reward accrues from a bandit process when it is frozen. These assumptions are not met in Problems 1 and 6, and yet it is reasonable to hope that the index theorem would be useful for these problems. We have already noted in Section 3.2 that if under a given policy the decision times for a decision process are ti , at which time a reward r(ti ) occurs (i = 0, 1, 2, . . . ), the total discounted reward is ∞ i=0
r(ti )e−γ t1 =
∞ i=0
r(ti ) − γ
∞
ti r(ti ) + o(γ ).
i=0
An optimal policy is one which maximizes the expectation of this quantity, or, in the limit as γ 0, minimizes the expectation of ti r(ti ). This is a generalization to arbitrary decision processes of the expected weighted flow time (EWFT) criterion of Section 3.2. It was conjectured there that, for a SFABP made up of jobs, this criterion is optimized by the index policy obtained by putting γ = 0 in the expression for the index in the discounted case. The conjecture will now be shown to be true, and to extend to arbitrary SFABPs under the generalized EWFT criterion. Under the following condition the proof is easy. An argument similar to that which we will use in Section 3.5.4 may be used to remove the condition. Condition C A SFABP comprised of B1 , . . . , Bn is said to satisfy Condition C if there exists > 0 such that for any pair of distinct bandit processes Bi , Bj in states x, y, either ν(Bi , x) > ν(Bj , y) for all γ < , or ν(Bi , x) < ν(Bj , y) for all γ < . Corollary 3.5 Under Condition C the generalized EWFT criterion for a SFABP is minimized by the index policy obtained by putting γ = 0 in the general expression for the index. Proof. It follows from Theorem 2.1 that the Gittins index policy is optimal for all positive discount parameters. From Condition C it follows that this policy is the same for all γ < . The corollary follows by considering the limits as γ 0 of the payoffs (i) under the Gittins index policy, and (ii) under any other policy. For any index theorem an obvious analogue of Corollary 3.5 holds, and may be proved in similar fashion (see also Theorem 5.13). Problems with statedependent holding costs are discussed further in the context of tax problems in Section 4.9.
70
MULTI-ARMED BANDIT ALLOCATION INDICES
3.5.4 A discrete search problem We now turn to another of our problems from Chapter 1, which we repeat here for convenience. Problem 3 Search A stationary object is hidden in one of n boxes. The probability that a search of box i finds the object if it is in box i is qi . The probability that the object is in box i is pi and changes by Bayes’ theorem as successive boxes are searched. The cost of a single search of box i is ci . How should the boxes be sequenced for search so as to minimize the expected cost of finding the object? This problem is an obvious candidate for modelling as a SFABP the boxes corresponding to bandit processes, box i being in state pi , and searching a box corresponding to continuation of the bandit process. The difficulty is that, each time a box is searched without finding the object, the effect of Bayes’ theorem is not only to decrease the probability that the object is in that box, but also to increase the probability that it is in any of the other boxes. The state of a bandit process only changes when it is continued, so we must find some other way to proceed. A related SFABP in fact turns out to have an identical cost structure, thereby providing a solution in terms of an index. Problem 3A The ith bandit process in a simple family F of n alternative bandit processes passes in deterministic sequence through states j = 1, 2, . . . when it is continued (i = 1, 2, . . . , n). It takes a time ci to pass from one state to the next, and a cost pi (1 − qi )j −1 qi accrues per unit time until the j th of these transitions. The SFABP F has, then, a cost structure of the type covered by Corollary 3.5. Under a given policy πA let c(i, j ) be the time at which bandit process i reaches state j . Thus the total cost under policy πA is n i=1
pi
∞
(1 − qi )j −1 qi c(i, j ).
(3.9)
j =1
For any policy πA for Problem 3A there is a corresponding policy π for Problem 3 which searches box i whenever policy πA continues bandit process i, until the point when the object is found. Under policy π the expected cost of finding the object is also given by the expression (3.9), so that corresponding sets of policies are optimal for the two problems. If Condition C were to hold it would therefore follow from Corollary 3.5 that optimal policies for the search problem would be defined by the index policies for F given by letting γ 0 in the Gittins index (as defined by (2.7)) for the
NECESSARY ASSUMPTIONS FOR INDICES
71
equivalent reward problem with discount parameter γ . For box i after j searches in the box i and the object not yet found, this index for bandit process i in state j is N
νi (j, γ ) = sup
r=j +1
γ −1 {1 − exp[−γ (N − j )ci ]}
N >j
=
pi (1 − qi )r−j qi exp[−γ (r − j )ci ]
pi (1 − qi )qi exp(−γ ci ) , γ −1 [1 − exp(−γ ci )]
(3.10)
where the supremum occurs at N = j + 1 because (1 − qi )r−j is a decreasing function of r. Thus νi (j ) = lim νi (j, γ ) = γ 0
pi (1 − qi )j qi . ci
(3.11)
The expression (3.9) and subsequent calculations are expressed in terms of the initial probability pi , that the object is in box i (i = 1, 2, . . . , n). After j (i) unsuccessful searches of box i, for each i, the effect of Bayes’ theorem is to transform these prior probabilities into posterior probabilities pi proportional to pi (1 − qi )j (i) . It thus follows from (3.11) and Corollary 3.5 that if Condition C holds then optimal policies are those conforming to the index pi qi /ci . In fact, the indices defined by (3.10) do not always satisfy Condition C. This is because there are infinitely many possible values of νi (j, γ ), for i = 1, 2, . . . , n, j (i) = 1, 2, . . ., and given an arbitrarily small γ we can find pairs of indices whose ordering changes as the discount parameter passes through this value of γ . However, an argument can be used to finesse this. Suppose that in Problem 3A the cost of pi (1 − qi )j −1 qi which accrues per unit time until the j th transition of bandit i is replaced by 0 when j > K. Call the resulting SFABP F K . In this notation the SFABP F becomes F ∞ . Let π denote an index policy with indices that are defined by (3.11) for j ≤ K and which are 0 for j > K. Let Cg (F ) denote the total cost incurred in F under some policy g. Analogous meanings are given to Cπ (F ), Cg (F K ) and Cπ (F K ). Condition C holds for F K since the number of possible index values for F K is finite. So by applying Corollary 3.5 to F K we have inf Cg (F ) > inf Cg (F K ) = Cπ (F K ) = Cπ (F ) − δ(K), g
g
where δ(K) → 0 as K → ∞. Hence π is optimal for F . Thus if we adopt the convention that the probabilities pi change according to Bayes’ theorem, the following result, obtained by D. Blackwell (reported by Matula, 1964), is a consequence of Corollary 3.5.
72
MULTI-ARMED BANDIT ALLOCATION INDICES
Theorem 3.6 Optimal policies for Problem 3 are those conforming to the index pi qi /ci . The present derivation via the SFABP F , given by Kelly (1979), has the advantage of leading immediately to considerable generalization. A simpler, and more direct, proof is given as an exercise at the end of the chapter. Suppose now that the probability of finding the object on the j th search of box i, conditional on not finding it before then, and provided it is in box i, is qij , and that the cost of the j th search of box i is cij . For the associated SFABP F now let bandit process i pass through the sequence of states xi (0) (= 0), xi (1), xi (2), . . . Let cij be the time taken to pass from state xi (j − 1) to xi (j ), and let a cost per unit time of j −1
pi
(1 − qir )qij
r=1
accrue until the state xi (j ) has been reached. This formulation allows the sequences (ci1 , qi1 ), (ci2 , qi2 ), . . . to be independent random processes (i = 1, 2, . . . , n). The expected total cost under a given policy πA for F is now E
n
pi
i=1
−1 ∞ j (1 − qir )qij c(i, j ) , j =1 r=1
where c(i, j ) is the time at which bandit process i reaches state xi (j ). This is also the expected cost of finding the object under the corresponding policy P for the search problem. Again it follows from Corollary 3.5 that optimal policies for the search problem are those based on an index, which now takes the form pi E νi (xi (j )) = sup
N
r−1
(1 − qis )qir |xi (j )
r=j +1 s=j +1
N >j
E
N
cir |xi (j )
.
(3.12)
i=j +1
Here N is an integer-valued stopping variable for bandit process i, and pi is the current posterior probability that the object is in box i. The theory of optimal search for a physical object was initiated by B. O. Koopman, working for the United States Navy during the Second World War. The most complete theory is for a unique stationary object, though there is an obvious need for a theory which extends to multiple objects, including false targets, and to mobile objects, which may or may not wish to be found, depending on the friendliness or otherwise of the searcher’s intentions. Developments of the theory have taken place along each of these directions and are still continuing. The key
NECESSARY ASSUMPTIONS FOR INDICES
73
text, though it does not deal with objects which move so as to help or hinder the search process, is the book by Stone (1975). This is complemented by books of Gal (1980) and Alpern and Gal (2003) which treat both search and rendezvous games. Koopman (1979) sets the subject in an interesting historical perspective, and Str¨umpfer (1980) has compiled a useful index of about 400 books and papers. The final section of the book by Ahlswede and Wegener (1987) is also on this subject.
3.5.5 Multiple processors In Section 3.4.3 we observed that, in general, there is no index policy which can determine the schedule minimizing the weighted flow time of n jobs, having processing times s1 , . . . , sn and weights c1 , . . . , cn , when these are to be processed by m (> 1) identical machines which operate in parallel. However, there are two obvious ways that we might reduce the generality of the problem. These are the special cases in which (i) si = constant; or (ii) ci = constant. In both of these cases the obvious extension of the single machine index theorem holds; the single machine index ci /si becomes ci in (i), and si−1 in (ii). For case (i) this is a statement of the obvious, and it is easily extended to allow the service times to be identically distributed independent variables. The less obvious case is (ii), for which Baker (1974) has proved the following. Theorem 3.7 The flow time for a set of n deterministic jobs on any fixed number m of machines is minimized by scheduling in order of increasing service time. Such policies are the only optimal policies, except that for every positive integer r the jobs in the rth position from the end of the schedule for each machine may be permuted arbitrarily between machines without changing the flow time. In view of the simplicity of the solutions for deterministic jobs when either si = constant or ci = constant, it is of interest to consider whether ci /Esi serves as an index when service times are random, all ci = 1, and the expected service times Esi are equal. We would conjectures that under the stated conditions every initial selection of jobs for service is optimal. However, this is not so, as the following counter-example shows. Counter-example 3.8 n = 4, m = 2, c1 = c2 = c3 = c4 = 1. P (si = 0) = P (si = 2) =
1 2
(i = 1, 2),
s3 = s4 = 1.
Let W denote the (weighted) flow time. Thus if jobs 1 and 2 are processed first, P (W = 2) = 14 = P (W = 10), P (W = 5) = 12 and so EW = 5 12 . If jobs 3 and 4 are processed first, P (W = 4) = 14 = P (W = 8), P (W = 6) = 12 , so EW = 6. As in Counter-example 3.2, the more efficient schedule does better because it enables both machines to be used until every job has been completed.
74
MULTI-ARMED BANDIT ALLOCATION INDICES
Let us now try to bring these results together. If expected weighted flow time is to be minimized then c(t) = I (t) ci must be reduced rapidly, where I (t) is the set of jobs which remain uncompleted at time t. The quantity c(t) is, of course, the rate of increase of weighted flow time at time t, or simply the weight at time t. The aim then is an early reduction of weight. When there is just one machine and job i is being processed, the rate of weight reduction should be regarded as equal to ci /Esi for a time Esi . This view of the matter is confirmed by the index theorem. With m machines working on a set J (t) of jobs at time t, the rate of weight reduction is J (t) (ci /Esi ). Again the aim is to ensure that large values of this quantity occur sooner rather than later. A perfect realization of this aim occurs if (i) for each machine, ci(t) /Esi(t) is decreasing in t, where i(t) is the job on which the machine is working at time t; (ii) (i, j ) ∈ J (t) ⇒ ci /Esi = cj /Esj , for all t; and (iii) J (t) consists of m jobs, so that no machine is idle, until the simultaneous completion of the last m jobs. Note that condition (iii) is included in (i) and (ii) if we regard an idle machine as working on a job for which ci = 0. For m = 1, conditions (ii) and (iii) are satisfied trivially and condition (i) is the specification of an index policy, which thus gives the desired perfect realization. For m > 1 only exceptionally is a perfect realization possible. Counter-examples 3.2 and 3.8 are two of these exceptional cases. More generally there is some degree of conflict between conditions (i), (ii) and (iii). For example, there is a conflict between (ii) and (iii) if Counter-example 3.2 is modified by increasing c1 and c2 by the same amount, or between (i) and (iii) if c1 and c2 are reduced by the same amount in Counter-example 3.8. Simple criteria for optimally resolving such conflicts are available only when the generality of the problem is reduced in some way; for example, when the si s are identically distributed, or the ci s are equal and the si s deterministic. To relate these insights to the general setting of a SFABP with m servers it is useful to think in terms of the stages in the realization of a bandit process which were used to define a modified forwards induction policy. For a given bandit process let the current stage at time t be the stage which finishes earliest after time t. Let νi∗ (t) be the index value for bandit process i at the beginning of the stage which is current at time t. In Section 2.5 we called this the prevailing charge. Recall that νi∗ (t) may be regarded as the equivalent constant reward rate for bandit process i at time t. By extension, i∈J (t) νi∗ (t) is the equivalent constant reward rate generated by the SFABP at time t, where J (t) denotes the set of bandit processes in progress at time t. As discussed in Section 2.6.2 for the single server case, an optimal policy must be one for which large values
NECESSARY ASSUMPTIONS FOR INDICES ∗ J (t) νi (t)
of occurs if
75
occur sooner rather than later. A perfect realization of this aim
∗ (i) νi(j,t) (t) is decreasing in t (j = 1, 2, . . . , m), where i(j, t) is the bandit process being served by server j at t; and
(ii) (i, j ) ∈ J (t) = νi∗ (t) = νj∗ (t) for all t. For m = 1, condition (i) is satisfied by an index policy. This follows for reasons already discussed in Section 2.6.3 (i.e. prevailing charges are decreasing). Since condition (ii) is satisfied trivially, it follows that an index policy is a perfect realization of the aim just stated. This is what we would expect in view of the index theorem. For m > 1 there is in general some conflict between conditions (i) and (ii), as already shown by our discussion of jobs with no preemption. Note that for such a job it follows from equation (3.3) that ci /Esi is the limit as a → 1 of νi∗ (t), provided t < si , so that our discussion of jobs with no preemption does serve as an illustration of the general problem. A further illustration is provided by the problem of minimizing weighted flow time for discrete-time jobs with no restriction on preemption. Weber (1980, 1982), in his work on multi-processor stochastic scheduling, has shown (see also Gittins, 1981), that the index rule is optimal for any m if the weights are the same for every job, the service times are independently and identically distributed, and the completion rate is a monotone function of process time. It is perhaps rather surprising that such strong conditions are required, but there seems to be no general result with significantly weaker conditions. For example, for decreasing completion rates which are not necessarily the same for every job, a myopic policy may not be optimal, even with equal weights. This is shown by the following counter-example. Counter-example 3.9 n = 3, m = 2, c1 = c2 = c3 = 1. p11 = p21 = p3j =
1 10
(j = 1, 2, . . . ),
p1j = p2j =
1 100
(j = 2, 3, . . . ),
where pij = P (si = j |si > j − 1). In Section 2.11.2 it is shown that for jobs with decreasing completion rates the stopping time in the definition of the index is the interval between decision times. For discrete-time jobs and no preemption restrictions this means that each of the stages of a modified forwards induction policy lasts for just one time unit. Since only integer values of t are relevant, this in turn means that νi∗ (t) is simply the index for job i at time t. It follows from (3.1) that νi∗ (t) = ci pii(t) where i(t) is the process time of job i at time t, provided job i is still uncompleted at time t. In the present case, since c1 = c2 = c3 = 1, the index value for an uncompleted job i at process time j is equal to the completion rate pij . Since
76
MULTI-ARMED BANDIT ALLOCATION INDICES
p11 = p21 = p31 it follows that there are myopic policies which start by serving any two of the three jobs. It is a simple matter to check that the expected weighted flow time for the policy 12 which serves jobs 1 and 2 at time 1 is greater than for the other two myopic policies 13 and 23. This difference is to be expected since under 12 the indices for the two bandit processes continued at time 2 differ by at least 0.09, whereas under 13 and 23 these indices are equal with probability 0.892. Thus policies 13 and 23, which are in fact both optimal, do not violate condition (ii) so much as does 12.
Exercises 3.1. Establish the identity (3.1). 3.2. Suppose that two jobs have parameters (Vi , ti ) (i = 1, 2), and they are to be carried out in succession, starting at time s. Reward obtained at time t is to be discounted by a factor d(t). Show that if, for all values of the parameters, an interchange argument is to be capable of determining which job should be served first, and this is to be independent of s, then d(t1 + s)/d(t2 + s) must be independent of s. Show that this is true only if d(t) = e−γ t for some γ . 3.3. Prove the assertion following equation (3.7) that ‘this is true only if d(t) = A + Be−γ t for some A, B and γ ’. Hint: put t1 = δ, t2 = 2δ in (3.7). By expanding as a power series in δ, deduce that d (t)/d (t) must be a constant. 3.4. Why does the second derivation of equation (3.8), given at the end of Section 3.5.2 for the case that all rij ≤ 0, not necessarily work if some of the rij s are positive? 3.5. Prove Theorem 3.3. 3.6. Deterministic policies for Problem 3 are defined by the sequence, with repetitions, in which the boxes are to be searched, up to the point when the object is found. Attention may be restricted to sequences in which each of the boxes occurs infinitely often, since otherwise there is a positive probability that the object is never found. Denote by (p, i) the pth search of box i, and by (p, i)(q, j ) part of a sequence in which (p, i) immediately precedes (q, j ). Lemma 3.10 If (p, i)(q, j ) occurs in a given sequence, and that sequence is better than the one obtained by replacing (p, i)(q, j ) by (q, j ) (p, i), then any sequence which includes (p, i)(q, j ) is better than the one derived from it by interchanging (p, i) and (q, j ). Denote the relationship postulated in Lemma 3.10 between (p, i) and (q, j ) by (p, i) (q, j ).
NECESSARY ASSUMPTIONS FOR INDICES
77
For simplicity assume that (p, i) (q, j ) excludes the possibility that (q, j ) (p, i). Lemma 3.11 If (p, i) (q, j ) and (q, j ) (r, k) then (p, i) (r, k). Lemma 3.12 If (p, i) (q, j ) then (p, i) (q + 1, j ). Prove the above three lemmas, and hence show that the optimal policy for Problem 3 is given by the index pi qi /ci . Similar lemmas leading to an optimal policy in the form of an index may be established for a variety of other discrete search problems, not all of which may easily be formulated as FABPs. Lehnerdt (1982) gives an interesting account of this approach which extends, for example, to more than one hidden object, dangerous searches in which there is a possibility of the searcher being put out of action, different objective functions, and problems with precedence constraints. 3.7. Consider a SFABP that is composed of a countable infinity of identical bandits, all initially in the same state. At every instant in time the continuation control is to be applied to exactly m of the bandits, m ≥ 1. Show that the obvious generalization of the Gittins index policy is optimal; in other words the continuation control should be applied to a set of m bandits having the m greatest index values. Hint: bound the payoff by a multiple of the maximal payoff that could be obtained when m = 1, and then show that this bound can be attained. Now suppose that a switching cost of c(x, y) must be paid whenever the application of continuation control is switched from a bandit in state x to a bandit in state y. Show that an index policy is optimal. Hint: argue that it is unnecessary to ever switch back to a bandit once the continuation control is switched away from it. These results are from the work of Bergemann and V¨alim¨aki (2001). 3.8. Consider an undiscounted discrete-time setting. The age of a job is the number of units of processing it has already received. Given that its age is x it completes processing after the next unit of service with probability p(x) = 1 − q(x). Letting s denote its remaining processing time, we have Es =
∞ t=1
P (s ≥ t) = 1 +
∞ t−2
q(x + k).
t=2 k=0
Suppose there are two jobs, of ages x1 and x2 . At the first step only one processor is available, but at all subsequent steps there are two processors available. Notice that the expected sum of completion times, say E(t1 + t2 ), is the same no matter which job is given service by the single processor which is available at the first step. However, E min(t1 , t2 ) differs, and this might be the time that some third job can be started. So to minimize the
78
MULTI-ARMED BANDIT ALLOCATION INDICES
expected sum of completion times of the three jobs (their flow time), we wish to minimize E min(t1 , t2 ). If job 1 is served at the first step then E min(t1 , t2 ) = 1 + q(x1 ) + q(x1 )
t−1 ∞
q(x1 + 1 + k)q(x2 + k).
t=1 k=0
Show that if q(·) is either monotone increasing or monotone decreasing then it is optimal to serve at the first step the job for which q(xi ) is least, i = 1, 2. Hint: take x2 = x1 + 1. This example provides an introduction to theorems about optimal policies for multi-processor stochastic scheduling (see Weber, 1982).
4
Superprocesses, precedence constraints and arrivals 4.1
Introduction
This chapter is about families of alternative superprocesses, and families of alternative bandit processes that are not simple. The notion of a superprocess is introduced in Section 4.2. This is a generalization of a bandit process in which, if superprocess Si is chosen for continuation, then the control with which to continue it must be chosen from a set which may have more than one element. In Section 4.3 we prove an index theorem for a simple family of alternative superprocesses. The proof uses the idea from Section 2.8 (in which the index is given an interpretation of a ‘fair stake’). However, an index theorem for superprocesses holds only under a fairly strong condition, which ensures that the choice of control is the same whenever Si is in a given state and is the one selected for continuation. This condition is met by a type of stoppable bandit process, as explained in Section 4.4. A family of alternative bandit processes (FABP) is not simple if not all bandits are available at all times. This can happen because there are precedence constraints, or because bandits arrive over time. To prove index theorems in such cases, the key idea remains that of showing how reward streams arising under an arbitrary policy can be dismembered and respliced so as to form reward streams that yield at least as great a payoff, and conform to those that would arise under an index policy. However, the analysis of rearranged reward streams must be more intricate than in Section 2.7. It is couched in terms of freezing rules, promotion rules, and modified forwards induction policies. In Section 4.5 we develop these notions and reprove the index theorem for a simple family of alternative bandit processes (SFABP). This is overkill for a SFABP, but the methodology comes Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
80
MULTI-ARMED BANDIT ALLOCATION INDICES
into its own in Section 4.6, where we prove an index theorem for scheduling generalized jobs that are subject to precedence constraints. In Section 4.7 we obtain an index theorem for jobs with out-forest precedence constraints. Problems in which new bandits arrive over time have been called branching bandits and arm-acquiring bandits. An important feature is that the arrivals are linked to the bandit being continued. These are addressed in Section 4.8, using the proof methodology of Section 2.9. In Section 4.9 we turn to ‘tax problems’, those in which rewards (or costs) are also accrued from the bandits that are frozen. This section also includes a discussion of Klimov’s problem and the problem of scheduling to minimize expected weighted flow time in an M/G/1 multiclass queue. We conclude with Section 4.10, where we find a bound on the suboptimality of policies that do not always select the bandit with greatest index.
4.2
Bandit superprocesses
The index theorem for a SFABP has been established in Section 2.7. We also made the important observation in Section 2.6.6 that an optimal policy can also be characterized as a modified forwards induction policy. Our aim in this section is to generalize these results to a family of alternative decision processes, in a way we now describe. Recall that a SFABP F is formed from a set {B1 , . . . , Bn } of bandit processes with the same discount factor, by requiring that at each decision time the freeze control 0 is applied to all but one of the constituent bandit processes. So F generates a single reward stream depending on the policy followed. The new starting point is a set of decision processes {D1 , . . . , Dn } all with the discount factor a. From each Di we now form a second decision process Si by adding to the control set i (xi ) at state xi the freeze control 0 which, as in the case of a bandit process, leaves the state unchanged and generates zero rewards for as long as it is applied. A similar composition of the constituents {S1 , . . . , Sn } may be achieved by again requiring that at each decision time the freeze control is applied to all but one of the constituents. The only difference is that at each decision time a choice must now be made both of an Si and of a control (other than the freeze control) from its control set. The constituents Si of the composite decision process F defined in this way will be termed bandit superprocesses, or simply superprocesses, and F itself a family of alternative superprocesses. As in the case of a SFABP, if there are no additional constraints on the set of superprocesses available for selection at a decision time then we have a simple family of alternative superprocesses (SFAS). Notice that if a superprocess S is operated under a deterministic stationary Markov policy g then this gives a bandit process. Let us denote it as Sg . So we already have an index, which we may write as ν(Sg , x) for state x of Sg . An index policy for a SFAS F comprised of superprocesses S1 , . . . , Sn must pick out at each decision time both a superprocess Sj and a control from the control set for the corresponding Dj . Thus consider the functions
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
81
μi {(xi , ui ) → R; xi ∈ i , ui ∈ (xi )} (i = 1, 2, . . . , n), where i is the state space for Si and i (xi ) is the control set for Di in state xi . An index policy for F with respect to μ1 , μ2 , . . . , μn is one which at any decision time when Si is in state xi (i = 1, 2, . . . , n) applies control uj to superprocess Sj , for some Sj and uj such that μj (xj , uj ) = max μi (xi , ui ). {i,ui ∈i (xi )}
Note that this maximum exists because of our standing assumption that control sets are finite. For superprocesses we should like a function which is an index for as large as possible a set Ia of superprocesses with discount factor a. Suppose that Dg denotes decision process D when it is operated under a deterministic stationary Markov policy g. Define Rτ (Dg , x) . W τ (Dg , x) τ >0
νg (D, x) = sup
This is the same as ν(Dg , x) above, but it is helpful in what follows to change the notation slightly. Similarly, we shall use Rgτ (D, x) as an alternative for Rτ (Dg , x). An obvious candidate is the function ν(·, ·, ·) defined by the equation ν(S, x, u) =
sup
{g:g(x)=u}
νg (D, x)
(S ∈ Ia , x ∈ D , u ∈ D (x)),
(4.1)
where D is formed from S by removing the freeze control. In fact, Theorem 4.8 (to be proved later) states that any possible index for superprocesses must be an increasing function of ν(S, x, u). Because of Theorem 4.8, an index policy with respect to ν(·, ·, ·) will be referred to simply as an index policy. The question now is just how inclusive can we define Ia to be if ν(·, ·, ·) is to be an index for Ia . Let Ba denote the entire uncountable set of bandit processes with the discount factor a. If Ia = Ba then S ∈ Ia ⇒ S is a bandit process and ν(S, x, u) = ν(S, x, 1) = ν(S, x). Thus ν(·, ·, ·) is an index for Ia by Theorem 2.1. We can actually extend Ia well beyond Ba , though not to include every superprocess with discount factor a. Clearly any superprocess which leads to optimal policies that do not conform to the index ν(·, ·, ·) must be excluded from Ia . This possibility is illustrated by the following example (Figure 4.1). 1 3 2
Figure 4.1 Precedence constraints for Example 4.1.
82
MULTI-ARMED BANDIT ALLOCATION INDICES
Example 4.1 Jobs 1, 2 and 3 are subject to the above precedence constraints, so that job 3 cannot be started until jobs 1 and 2 have been completed. Jobs 1, 2 and 3 require, respectively, 1, 2 and 1 unit(s) of service time up to the first decision time after commencement of the job, and then terminate with probabilities 12 , 1 and 1, yielding rewards 0, 1 and M(> 5 23 ). After these times every job behaves like a standard bandit process with parameter 0, whether it has terminated or not. A reward at time t is discounted by the factor ( 12 )t . The three jobs form a decision process D and, with the addition of the freeze control 0, a superprocess S. The controls in a control set, other than control 0, may be identified with the jobs available for selection, and the policies of interest identified by job sequences, with the convention that a job equivalent to a 0 standard bandit process is selected only if no further positive rewards are available. Thus, if x is the initial state, 3 1 4 1 1 + M 2 2 2 2+M = ν(S, x, 1) = ν123 (D, x) = , 2 3 46 1+ 1 1 + 1 + 1 2
2
1 2 ν(S, x, 2) = ν213 (D, x) =
1+
2 1 2 +
2
2
1 5
+ 2 1 2 2
M 8+M = 58 . 1 1 3 + 2
2
Note that the condition M > 5 23 ensures that the stopping times in the definitions of ν123 (D, x) and ν213 (D, x) are at process time 1 for job 1 if it fails to terminate, and otherwise are at t = 4, the completion time for job 3. Thus ν(S, x, 1) > ν(S, x, 2) ⇐⇒ M > 21, and S is compatible with the index ν(·, ·, ·) only if any optimal policy for a SFAS F including S which selects S in state x also selects job 1 if M > 21. This condition is not satisfied, as may be seen by defining F to be the SFAS of which the superprocess S is the only member. In this case the optimal policy is either 123 or 213. Since the expected reward from job 3 is the same for both policies, and the only other reward is from job 2 and occurs earlier under 213 than under 123, policy 213 must be optimal. It follows that the superprocess S is incompatible with the index ν(·, ·, ·), and therefore must be excluded from I1/2 . The key to an appropriate definition for Ia is the observation that if an index picks out a particular superprocess S in state x, then it must pick out the same control(s) from D (x) irrespective of which other superprocesses are available for selection. This suggests that we look for a condition which ensures that if an optimal policy selects S in state x it always selects the same control(s) from D (x). If this is true for every S then the index theorem for a SFABP applies. In fact it is sufficient, as shown by Whittle (1980), to ensure that this reduction occurs when the only available alternative to S is a standard bandit process. We now state this as Condition D.
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
83
Condition D Consider the SFAS {S, }, where is a standard bandit process with parameter λ. Suppose there exists a function g such that, for all x and λ for which it is optimal to select S when it is in state x, it is optimal to apply to S the control u = g(x) (i.e. the optimal control does not depend on λ). Superprocess S will be said to satisfy Condition D. Note 4.2 If a superprocess S satisfies Condition D, a control u ∈ D (x) is optimal for {S, } if and only if ν(S, x, u) = ν(S, x) ≥ λ. This follows by supposing that in the statement of Condition D has the parameter ν(S, x).
4.3
The index theorem for superprocesses
Mirroring our definition of an index policy for a FABP in Section 2.5, we say that a policy is an index policy with respect to μ1 , μ2 , . . . , μn if at any decision time the superprocess Sk selected for continuation, with control u, is such that μk (xk , u) = maxj,u∈(xj ) μj (xj , u). Assuming Condition D, we can prove an index theorem. Theorem 4.3 Index theorem for a SFAS A policy for a simple family of alternative superprocesses comprised of S1 , . . . , Sn , which all satisfy Condition D, is optimal if it is an index policy with respect to ν(S1 , ·, ·), . . . , ν(Sn , ·, ·). As in Section 2.8, we shall make use of the fact that a simple family of alternative semi-Markov superprocesses is equivalent to a discrete-time Markov decision process with stochastic discounting. The decision times are 0, 1, 2, . . . When a superprocess is continued it makes an unsuccessful transition with a probability of 1 − a T , where T is the time until the next decision time in the equivalent semi-Markov process. In the special case that the superprocesses are discrete-time Markov decision processes, T is always 1. To simplify language, we shall say that when S makes an unsuccessful transition it ‘dies’. The objective in our problem is to maximize the expected (undiscounted) total reward obtained from a SFAS before the first constituent superprocess dies. We generalize the proof given in Section 2.8 as follows. Proof. We start with a proof for a family of n = 2 superprocesses, S1 and S2 . Suppose that they are initially in states x1 (0) and x2 (0). Given that Condition D holds, let gi (xi ) be the optimal continuation control for Si when it is selected and is in state xi . We consider a class of policies, k , in which the index policy must be followed at decision times k, k + 1, . . . That is, we compute for each superprocess its fair stake, μi (xi ), and then continue from time k the superprocess i having the greatest fair stake, applying the control gi (xi ). Extending the terminology introduced in Section 2.5 to apply to superprocesses, the fair stake, μi (xi ), is the maximum amount we would be willing to sacrifice if Si dies while we continue it, in exchange for ownership of any rewards arising from Si until either it dies or we choose to stop collecting its rewards.
84
MULTI-ARMED BANDIT ALLOCATION INDICES
Suppose that an optimal policy within 1 is in 0 , in other words that it is the index policy. If this is true then the index policy is optimal. This is an argument we have already used in Section 2.7. In brief, if the first sentence of this paragraph is true, then for any positive k the optimal policy in class k is in class k−1 and indeed in 0 . There certainly exists a k such that k contains an -optimal policy (which is true if under any policy the probability that no superprocess will have died by time k tends to 0 as k tends to infinity; this follows from Condition A). If the first sentence of this paragraph is true, then for all > 0 there is an -optimal policy in 0 . So consider the optimal policy in 1 . Without loss of generality, suppose it is optimal at decision time 0 to continue S1 , using some control u1 . Since we are restricted to class 1 , we must subsequently be using the index policy. Suppose that we continue the two superprocesses under this optimal policy until one dies. This normally ends the decision process. However, let us continue the remaining superprocess until it also dies. Suppose that the reward obtained from Si before it dies is Ri . Let Ri = ri + ri , where ri is the reward obtained from Si before the first death, and ri is the subsequent reward obtained before the second death (where either or both of ri and ri might be 0). The reward obtained from the two superprocesses prior to the first death is r1 + r2 = r1 + R2 − r2 . Since the value of ER2 is independent of the order in which rewards from S1 and S2 are obtained, we maximize E(r1 + r2 ) by maximizing E(r1 − r2 ). We now focus on showing how to do that. While S2 is being continued the value of its index, μ2 (x2 ), changes as a function of its state. The prevailing stake is μ∗2 (t) = min0≤s≤t μ2 (x2 (s)); that is, the least value of μ2 (x2 ) that has been encountered so far (the lower envelope). Suppose that S2 is continued until it dies and at that time its prevailing stake is M. Then because the prevailing stake μ∗2 is periodically reduced by just enough so that it is never unprofitable to continue S2 , the expected total reward obtained by continuing S2 until it dies is E(R2 ) = E(M). Similarly, at any time t (possibly even random) at which μ∗2 (t) = μ2 (t), the expected further reward to be obtained from S2 by continuing it until it dies is just the expected value of the prevailing stake that must be paid when it dies. By thinking of t as the time at which S1 dies, we see that E(r2 ) = E[X1 (M)M], where X1 (M) is an indicator variable for the event that S1 dies before the first time after time 0 at which its prevailing stake drops below M (equivalently, that S1 dies before S2 when the index policy is applied after time 0). We may also write E(r1 ) = E[R(M)], where R(M) denotes the expected reward obtained from S1 until it dies, or its prevailing stake drops below M. Let pM be the probability that the index of S1 drops below M before S1 dies, thus pM = E[1 − X1 (M)]. As we have said, maximizing E(r1 + r2 ) is equivalent to maximizing E(r1 − r2 ). From the above arguments, we have E(r1 − r2 ) = E[R(M) − X1 (M)M] = E R(M) − (1 − pM )M = E[R(M) + pM M] − E(M).
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
85
It may help the reader to note that in the case of Markov bandits, we have pM = E(a τ ), where τ is the optimal time of switching from the continuation of S1 to continuation of a standard bandit with parameter λ = (1 − a)M. By Condition D, the expression R(M) + pM M is maximized, for every M, by taking u = g1 (x1 (0)), provided M is sufficiently small that this quantity is greater than M. Since M ≤ μ2 (x2 (0)), our proviso holds if μ1 (x1 (0)) ≥ μ2 (x2 (0)). If this is true then we have shown that the optimal policy is indeed the index policy and we are done. If not, we reason as follows. Suppose the initial fair stakes are such that μ1 (x1 (0)) < μ2 (x2 (0)), and yet we believe that the optimal policy in class 1 is one that starts by applying some u1 to S1 . Note that our conjectured optimal policy lies in the class of policies in which all controls applied to S2 are determined by g2 , the first control applied to S1 is u1 , and all subsequent controls applied to S1 are determined by g1 . This defines a SFABP of two bandits in which we suppose S1 starts at a state x1 (0) = x1 (0) in which u1 is mandatory–which is different from any instance of state x1 (0) occurring subsequently, in which the control g1 (x1 (0)) is mandatory. The payoff of this SFABP is maximized by an index policy (by appealing to Theorem 2.1). Since we have μ1 (x1 (0) ) ≤ μ1 (x1 (0)) < μ2 (x2 (0)), it is better to start by continuing S2 . Suppose we do this, and then continue with the index policy, until eventually we are about to continue S1 for the first time, with control u1 . At this point it has a greater index than does S2 , and so the conclusion of the immediately previous paragraph can be applied: meaning that it is at least as good to use the control g1 (x1 (0)) as to use u1 . Of course we might never again reach a point where we continue S1 . Either way, we have a policy which is at least as good, and which lies in the class of policies which always use g1 and g2 to determine controls. Within this class, the index policy is optimal (again by appealing to Theorem 2.1). This completes the proof for two superprocesses. To ease understanding, the proof has been explained for two superprocesses. To prove the theorem for n superprocesses one simply needs to reread the above, letting the role of S2 be assumed by superprocesses S2 , . . . , Sn . In the context of the proof, the continuation of S2 , . . . , Sn occurs under the index policy, and so as a group they evolve as a single bandit process. We imagine continuing the n superprocesses until both S1 and the first of S2 , . . . , Sn dies. Reinterpret R2 as the reward obtained from S2 , . . . , Sn until the first of these dies; r2 and r2 become the portions of this reward that are obtained before and after the first of S1 , . . . , Sn dies, and M is now the prevailing stake of whichever of S2 , . . . , Sn is being continued when the first of them dies (or equivalently, the prevailing stake of the single bandit process which they form). Notice that the fair stake for the bandit process formed from S2 , . . . , Sn is just, say, μ(x2 , . . . , xn ) = maxi≥2 μi (xi ), since with this definition of fair stake, and consequent sequence of prevailing stakes, we can break even by always continuing a constituent having maximal prevailing stake, until the first of them dies. However, by definition of the fair stake for each of the constituents, it is impossible to break even by putting at risk any strictly greater stake, since to do so would mean we could break even for one of the constituents at this level of stake.
86
MULTI-ARMED BANDIT ALLOCATION INDICES
A different proof of this theorem, which uses the machinery of freezing and promotion rules in Section 4.5, can be found in the first edition of this book. It uses some ideas that are similar to those in the proof above, such as the analysis in the penultimate paragraph. However, because it involves rearranging portions of discounted reward streams, the algebraic details are more complicated than is the analysis of the expected prevailing stakes that are paid in an undiscounted setting. We conclude this section by giving a brief presentation of the means by which Whittle (1980) has also proved Theorem 4.3. The proof is elegant and provides valuable insight. Whittle’s proof of Theorem 4.3. The key idea is to guess a formula for the value function. Suppose that we add to S1 , . . . , Sn a standard bandit process S(λ). Assuming that Theorem 4.3 is true, then we know that the decision process proceeds by continuing one or the other of the Si until all have indices less than λ. From this point onwards only S(λ) is continued. For this reason, Whittle calls the standard process a retirement option, since if it is ever optimal to choose this option it must remain optimal to do so permanently and the total reward obtained thereafter will be m = λ/(1 − a). Suppose the index for Si first falls below λ when it has process time τi . Then our candidate for the value function can be written as
(x, m) = E R + a τ1 +···+τn m x , where x = (x1 , . . . , xn ) is the vector of initial states of S1 , . . . , Sn , and R denotes the discounted sum of rewards obtained from S1 , . . . , Sn under the index policy up to time τ = τ1 + · · · + τn . If (x, m) is differentiable, n n ∂ ∂ E a τi =
(x, m) = E a τ1 +···+τn x =
i (xi , m), ∂m ∂m i=1
i=1
where i (xi , m) is the value function for the SFAS consisting of just Si and S((1 − a)m). Assume, without loss of generality, that all rewards accruing from S1 , . . . , Sn are non-negative and bounded above by (1 − a)M. Then by integration, and taking (x, M) = (1 − a)M, we have
M
(x, 0) = (1 − a)M − 0
∂
i (xi , m)dm. ∂m i
(4.2)
The argument thus far is intuitive. However, now taking (4.2) to define (x, 0) we have a candidate value function for the SFAS consisting of just S1 , . . . , Sn . It is well defined because each i (xi , m) is a convex function of m (being the maximum over stopping times of a linear function of m, as in (2.4)) and so is differentiable for almost all values of m. If (x, m) satisfies the dynamic programming equation, then (by Theorem 2.10 (ii)) it is indeed the value function, and so the index policy is optimal.
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
87
Assume for simplicity that the decision times are 0, 1, . . . The dynamic programming equation for the SFAS is V (x) = max Li V (x), i
where the operator Li is defined by
Li f (x) = max ri (xi ) + aEu f (x1 , . . . , xi−1 , y, xi+1 , . . . , xn ) u∈i (xi )
and expectation is over the state y to which Si transitions from xi under the choice of control u ∈ i (xi ). The operator Li has its obvious meaning when applied to a function of xi alone (by just erasing x1 , . . . , xi−1 , xi+1 , . . . , xn in the form above). Let ∂ Pi (m) =
j (xj , m). ∂m j =i
Suppose that S1 is the superprocess initially having the greatest index, say ν1 . If m ≥ ν1 /(1 − a) then i (xi , m) = m, Pi (m) = 1 and dPi (m)/dm = 0, for all i. Applying integration by parts to (4.2), and using M ≥ ν1 /(1 − a), (x, M) =
i (x, M) = M and Pi (M) = 1,
(x, 0) = (x, M) + i (xi , 0)Pi (0) − i (xi , M)Pi (M) M +
i (xi , m)dPi (m) 0
ν1 /(1−a)
= i (xi , 0)Pi (0) +
i (xi , m)dPi (m).
(4.3)
0
For any m ≤ ν1 /(1 − a) (including m = 0) we have 1 (x1 , m) = L1 1 (x1 , m) (where it is at this point that we need Condition D). Using this fact in (4.3), with i = 1, we see that (x, 0) = L1 (x, 0). Finally, since Li i (xi , m) expresses the reward obtained by continuing Si even when it would have been optimal to retire, we must have that for all i and m,
i (xi , m) ≥ Li i (xi , m). Also, dPi (m)/dm is non-negative by convexity of the
j . Using these facts in (4.3) we see that (x, 0) ≥ Li (x, 0) for all i. We have now seen both that (x, 0) ≥ Li (x, 0) for all i and that (x, 0) = L1 (x, 0). Thus (x, 0) = maxi Li (x, 0), which is the statement that (x, 0) satisfies the dynamic programming equation. This proves Theorem 4.3. Eick (1988) has adapted Whittle’s proof to establish a limited index theorem for bandit reward sampling processes (Section 7.1) for which the rewards do not become known immediately. R.P. Kertz (1986) has considered families of alternative bandit processes subject to global constraints, expressed as inequalities on the expectation of the integral over time of the discounted value of some
88
MULTI-ARMED BANDIT ALLOCATION INDICES
function of the state. He removes these constraints by Lagrangian methods and shows that, for a SFABP subject to global constraints, the solution may be expressed as an index policy with modified index values. He obtains a generalization of (4.2) for the optimal payoff function together with some explicit results for Bernoulli reward processes. Whittle has used the expression for the value function to obtain the optimal priority system for an M/G/1 queue under the expected weighted flow time criterion (the non-preemptive version of Theorem 4.15) from the expression for the case with Poisson arrivals of further bandit processes. This work is described in Whittle’s books (1983, 1996).
4.4
Stoppable bandit processes
Condition D is admittedly a strong condition. Indeed, one might say that it restricts attention to the most uninteresting case of a SFAS because the choice of control u ∈ i (xi ) becomes irrelevant. However, it is interesting that Condition D is sufficient for Theorem 4.3 and that something stronger is not required. Furthermore, we can describe at least one interesting model for which Condition D is guaranteed to hold. The term stoppable bandit process was coined by Glazebrook (1979) to describe the process obtained by adding to the control set of a bandit process a stop control , control 2, which causes the process to behave as a standard bandit process with a parameter μ(x) depending on the state x. Strictly speaking this is a superprocess rather than a bandit process. If control 2 is selected, the state of the stoppable bandit process, and the state of any SFAS to which it belongs, does not change, so we can, as usual with a standard bandit process, ignore the possibility of switching from the use of control 2. Thus to use control 2 has the same effect as terminating the decision process with a final reward of γ −1 μ(x), and this is why it is called the stop control. A SFAS made up of stoppable bandit processes might be a suitable model for an industrial research project in which there are several possible approaches, each represented by one of the stoppable bandit processes. Using the stop control corresponds to stopping research and exploiting commercially the knowhow gained by one of the approaches. A stoppable bandit process will be said to have an improving stopping option if μ(x(t)) is almost surely nondecreasing in process time t. The process time remains the total time for which the continuation control 1 has been applied. Lemma 4.4 Condition D holds for a stoppable bandit process S with an improving stopping option. Proof. Consider a family of two alternative superprocesses: S and a standard bandit process with parameter λ. We must show that for those values of λ for
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
89
which an optimal policy selects S when it is in state x, the control which is applied to S is independent of λ. Let Rτ (S) denote the expected reward obtained up to time τ when S is continued up to this time with the control u = 1. Suppose that for some λ1 it is strictly optimal to select S and apply u = 1. Then there must exist τ such that Rτ (S) + γ −1 Ea τ max{λ1 , μ(x(τ ))} > γ −1 max {λ1 , μ(x)} .
(4.4)
Suppose that for some λ2 it is strictly optimal to select S and apply u = 2. Then it must be the case that Rτ (S) + γ −1 Ea τ μ(x(τ )) < γ −1 μ(x). Multiplying this by −1 and adding the result to (4.4) we could conclude that Ea τ [λ1 − μ(x(τ ))]+ >[λ1 − μ(x)]+ . But this is impossible for μ(x(τ )) ≥ μ(x).
From Theorem 4.3 and Lemma 4.4 it follows that index policies are optimal for simple families of stoppable bandit processes, for each of which μ(x(t)) increases with t. An interesting example of alternative stoppable bandit processes arises in the so-called buyer’s problem. Imagine that there are n suppliers of some commodity item, such as a newspaper. The value of successive items supplied by supplier i form a sequence of independent identically distributed real-valued random variables having density function fi (·|θi ), where θi is unknown, having prior density πi (θi ). The random variables may be sampled one at a time at a fixed cost per observation, and with the option of stopping at any time and collecting a reward equal to the mean of the distribution, discounted by a t , where t is the time of stopping. In the obvious way this defines a stoppable bandit process, whose state is the posterior distribution of θi . Stopping may be understood as deciding to make a purchase. However, since the stopping option need not improve with sampling, we cannot use Lemma 4.4 to deduce the optimality of index policies for a problem of sampling items from alternative suppliers, optimally stopping and then committing to make all subsequent purchases from one of them (e.g. taking out a newspaper subscription). This problem was considered in depth and in much greater generality than we have stated it here by Bergman (1981), using and extending the theoretical framework for stopping problems set out by Chow, Robbins and Siegmund (1971), each sampling process being represented by a filtration. Bergman showed that, in general, index policies are indeed not optimal, though they are optimal if initially there are an infinite number of populations available for sampling, each with the same prior information. Bather (1983) has also considered the buyer’s problem. By using the inequality set out in exercise 4.5 (but which he must have derived independently
90
MULTI-ARMED BANDIT ALLOCATION INDICES
of Whittle), he confirms Bergman’s conclusion that an index policy is optimal with an infinite number of initially indistinguishable populations (cf. exercise 3.7). This follows by introducing a standard bandit process with parameter equal to the initial common index value of each stoppable bandit process (see (4.1)). In fact it is not difficult to see that the argument holds for any infinitely replicated superprocess. Bather uses a diffusion approximation to a normal sampling process along the lines to be described in Section 7.6, to obtain an upper bound for the index in the buyer’s problem when fi (·|θi ) is the normal density with a known variance and the unknown mean θi has a normal prior. He does this using his comparison procedure, to be explained in Section 7.6.
4.5
Proof of the index theorem by freezing and promotion rules
We now turn to another a proof of the index theorem for a SFABP. It uses a more complicated rearrangement of bandit portions than the simple interchange that is analysed by Lemma 2.4 and is used to prove the index theorem in Section 2.7. The proof methodology that we now describe is overkill for the index theorem for a SFABP, but it comes into its own in proofs of index theorems for FABPs that are not simple. We prove that a modified forwards policy is optimal for a SFABP. We remind the reader that this is a policy which repeatedly carries out a procedure of identifying a bandit B with greatest index and applying the continuation control to it until the stopping time τ is reached, such that ν(B) = Rτ (B)/Wτ (B). Before we begin, we need some further notation. For a FABP, F , we define the index ν(F ) = sup gτ
Rgτ (F ) , Wgτ (F )
(4.5)
where the supremum is taken over policy g and positive stopping time τ . The quantities Rgτ (F ) (and Wgτ (F )) denote the expected discounted reward (and expected discounted time) that is accrued up to stopping time τ under policy g. As elsewhere, we suppress the showing of dependencies upon the starting state of F . In the proof of Lemma 2.2 one sees that ν(B) = ντ (B) for some stopping time τ , by presenting ν(B) in the context of a Markov decision process and then applying Theorem 2.10 (i) (that there exists a deterministic stationary Markov optimal policy). In similar fashion it may be shown that the supremum is achieved in (4.5) for some g and τ . These then specify the first stage of a modified forwards induction policy for F .
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
91
We are now ready to present another proof of Theorem 2.1. The emphasis in this proof is on the characterization of the optimal policy as a modified forwards induction policy. So let us recapitulate it that way. Theorem 2.1 The classes of index policies, forwards induction policies, and optimal policies are identical for a simple family F of alternative bandit processes, B1 , . . . , Bn . Proof. We know from Lemma 2.2 that, for a SFABP, the classes of index policies and modified forwards induction policies are the same. So we show that given any policy g, and a modified forwards induction policy π, there exists a policy which is as good as g and which starts with the first stage of π. Suppose that the first stage of π consists of applying the continuation control to B = B1 until time τ . Thus, we are assuming that B is the bandit of greatest index, and τ is a stopping time such that ν(B) = Rτ (B)/Wτ (B). Let us denote the set of rewards that accrue from B during this first stage of π by R. Under π, the rewards in R contribute (to the payoff) a quantity a ti r(ti ) = Wτ (B)ν(B), (4.6) Rτ (B) = E ti 0, then it must be optimal to continue without introducing any period of freeze control at this point (since to freeze for time φ would reduce subsequent payoff by a factor a φ ). These observations imply that Zf (B) is maximized by a stopping rule, from which fact the lemma follows. Before stating the next lemma we need one preliminary idea. Consider a policy g that is applied to F up to stopping time τ . Over the period up to τ , each Bj is sometimes being continued and sometimes being frozen. Thus, we may say that (g, τ ) defines a freezing rule f j that is applied to Bj . It is Part (i) of the following lemma that we need. Parts (ii) and (iii) make clear the equivalence of index policies and forwards induction policies for F . Lemma 4.6 Suppose F is a SFABP, comprised of B1 , . . . , Bn . Then (i) ν(F ) = maxj ν(Bj ). (ii) If f j is a policy for Bj defined by a policy π and stopping time σ for F such that νπσ (F ) = ν(F ), then νf j (Bj ) = ν(F ) for all j such that Wf j (Bj ) > 0. (iii) A policy for F is an index policy if and only if it is a forwards induction policy. Proof. (i) Suppose an arbitrary policy g is applied to F , with a stopping time τ . As explained just prior to the lemma statement, the reward stream of Bj accrues as if a freezing rule f j were applied to it (this being a randomized freezing rule that is determined by the evolution of all the bandits under g, τ ). Thus Rgτ (F ) = Rf j (Bj ) ≤ ν(Bj )Wf j (B) ≤ max ν(Bj )Wgτ (F ), j
j
j
(4.12) where the first inequality follows from Lemma 4.5. This shows that ν(F ) ≤ maxj ν(Bj ). The reverse inequality is obvious, since ν(F ) ≥ ν(Bj ) for all j . (ii) This follows by putting (g, τ ) = (π, σ ) in (4.12). For this choice the left- and right-hand sides are equal, which forces (ii) to be true.
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
95
(iii) That index policies are forwards induction policies is an immediate consequence of (i). The converse is an immediate consequence of (ii).
4.5.2 Promotion rules Equation (4.10) makes a comparison between payoffs when the (i + 1)th reward from A, that would accrue under ∗g at time τ + si , is promoted, so that under g it accrues at an earlier time τ + si + pi . The sequence of non-positive pi , such that −τ ≤ p1 ≤ p2 ≤ · · · ≤ 0, defines what we call a promotion rule for A with respect to τ . Given a fixed realization of the first stage of π, the time τ is not a random variable. The promotion rule depends on this given realization of the first stage of π, and the evolution of A, but pi is past-measurable with respect to the full history of A up to its decision time si . Equation (4.10) is a consequence of the following lemma. Lemma 4.7 For a bandit process B, the increase in the expected total reward which is caused by applying the promotion rule p instead of the null promotion rule (both defined with respect to the same given time τ ) is less than or equal to the corresponding increase for a standard bandit process with the parameter ν(B). Proof. For the purposes of this lemma and its proof, τ is given (not a random variable). Under the promotion rule p for B with respect to τ , the rewards of B accrue at the same times (say τ + ti + pi ) as they would if a freezing rule f for which fi = τ + pi was applied to B, started at time 0. For this reason, the lemma follows if we can show that [Rf (B) − a τ R(B)] − ν(B)[Wf (B) − a τ W (B)] ≤ 0, or equivalently, [Rf (B) − ν(B)Wf (B)] − a τ [R(B) − ν(B)W (B)] ≤ 0.
(4.13)
We may evaluate the left-hand side of (4.13) in portions, each of which is nonpositive. Consider the contribution to the left-hand side arising from those rewards accruing before the stopping time σ such that Rσ (B) = ν(B)Wσ (B). Write this as [Rσf (B) − ν(B)Wσf (B)] − a τ [Rσ (B) − ν(B)Wσ (B)].
(4.14)
By Lemma 4.5, the first term in square brackets is no greater that the second term in square brackets, which by definition of σ is 0. Thus (4.14) is non-positive. Note that in (4.14) the coefficient of ν(B) is −Wσf (B) + a τ Wσ (B). Since fi ≤ τ , this coefficient is non-positive, and so (4.14) remains non-positive if ν(B) is replaced by anything greater.
96
MULTI-ARMED BANDIT ALLOCATION INDICES
At time σ , B is now in some state y such that ν(B, y) ≤ ν(B). We can repeat the above argument over a second stage which collects rewards accruing while B receives the continuation control over [σ, σ + σ2 ), where the stopping time σ2 is such that Rσ2 (B, y) = ν(B, y)Wσ2 (B, y). We deduce that a similar expression to (4.14) must hold, but starting from y, and with ν(B, y) replacing ν(B). However, by the remark at the conclusion of the previous paragraph, this expression remains non-positive if we increase ν(B, y) to ν(B). Thus we conclude that the contribution to the left-hand side of (4.13) due to rewards accruing over [σ, σ + σ2 ) is also non-positive. The lemma is proved by repeating this argument until we pass beyond time s, where f no longer has any effect upon any rewards yet to accrue (which by Condition A happens with probability 1). In concluding this section we use Lemma 4.5 to prove a result about the uniqueness of indices. Throughout this book our chief interest is in indices which define optimal allocation rules irrespective of which particular FABP is available for selection. Thus, if Ba is the entire uncountable set of bandit processes with the discount factor a, we should like an index for each bandit process B ∈ Ba which defines optimal policies for every SFABP formed by a finite subset of Ba . Let us call such an index an index for Ba . We wish the reader to be clear that the following theorem is actually a corollary of the index theorem, Theorem 2.1, and so it holds no new content. However, it is interesting to see a stand-alone proof. Theorem 4.8 Any index for Ba (0 < a < 1), if there is one, must be a strictly increasing function of ν(B, x). Proof. Let us take a continuous-time model. We first note that the statement of the theorem is true if Ba is replaced by a , the set of all standard bandit processes with discount factor a. This is because if is a standard bandit process with parameter λ (and just one state), then ν() = λ. Consider a SFABP with bandit processes B and , where is a standard bandit process with the parameter λ. To prove the theorem it is sufficient to show that for such a SFABP there is an optimal policy which continues at all times if and only if λ ≥ ν(B). There is an obvious one–one correspondence between policies for the SFABP {B, } and freezing rules for the bandit process B. The payoff from {B, } under the policy which corresponds with the freezing rule f for B may be expressed as fi ∞ ∞ a ti +fi r(ti ) + λE a ti a t dt Rf ({B, }) = E i=0
= Rf (B) + λE
∞ i=0
= Rf (B) + λγ
−1
a ti
i=0 fi
a t dt.
ft−1
− λWf (B).
ft−1
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
97
The payoff from {B, } under the policy which continues at all times is λγ −1 . Hence λγ −1 ≥ Rf ({B, }) ⇐⇒ λ ≥
Rf (B) = νf (B), Wf (B)
provided f0 = ∞, which ensures that Wf (B) = 0. It follows that the policy which continues at all times is optimal if and only if λ≥
sup
{f :f0 =∞}
νf (B) = ν(B),
where the equality follows from Lemma 4.5.
4.6
The index theorem for jobs with precedence constraints
This section and the next one explore conditions under which an index theorem can be obtained for families of generalized jobs (i.e. bandit processes with a completion state, see Section 3.2) subject to precedence and arrival constraints. We do this by reworking the arguments used to prove Theorem 2.1 in Section 4.5. In those arguments, we focused upon the first stage of a modified forwards induction policy which chooses B and τ such that Rτ (B)/Wτ (B) = ν(B), and then starts by applying the continuation control to B up to time τ . This is now replaced by a policy π and stopping time σ , such that the right-hand side below is maximal over all possible π and σ , and permits us to define, as in (4.5), ν(F ) =
Rgτ (F ) Rπσ (F ) = sup . Wπσ (F ) W gτ gτ (F )
(4.15)
The pair π, σ defines the first stage of a modified forwards induction policy, or FI* policy. We should like to mimic the arguments of Section 4.5, in which we showed that we could improve an arbitrary policy g by shifting to the start the first stage of a modified forwards induction policy (i.e. the τ -length period of applying the continuation control to B). If analogous arguments are to work, then we need a result like (4.8), which said that Rf (B) ≤ Wf (B)ν(B) for any freezing rule f . The analogous result should say that Rgπσ (F ) ≤ Wgπσ (F )ν(F ),
(4.16)
where g is any policy for F , which, if it is not π, has the effect of postponing the accrual of rewards that would have occurred during the first stage of a FI* policy (i.e. before time σ under π). The notations Rgπσ (F ) and Wgπσ (F ) denote the expected discounted rewards and the expected discounted time instants accruing under g to portions of the reward stream that would have arisen during the first stage of the FI* policy (π, σ ), had that policy been employed.
98
MULTI-ARMED BANDIT ALLOCATION INDICES
Given (4.15), and assuming (4.16), then in a manner analogous to (4.9) we would have Rπσ (F ) − Rgπσ (F ) ≥ [Wπσ (F ) − Wgπσ (F )]ν(F ).
(4.17)
The reward under policy g can be written as Rg (F ) = Rgπσ (F ) + ERhp (G ),
(4.18)
where ERhp (G ) denotes the contribution to payoff from rewards other than those captured by Rgπσ (F ). These are rewards that may be viewed as having been promoted, via some promotion rule p, to occur earlier under g than they would occur under a policy that brings the first portion of a FI* policy to the start. The reward after bringing the first portion of the FI* policy to the start can be written as Rπσ h (F ) = Rπσ (F ) + ERh0 (G ),
(4.19)
where ERh0 (G ) denotes the contribution to the payoff from rewards other than those captured by Rπσ (F ), and h takes these rewards in the same order as they would be taken under g. We wish to show that the policy πσ h is at least as good as g. So we must check that πσ h is implementable. To complete a proof that (4.19) is as great as (4.18), we also need a result analogous to (4.10). This would be Rhp (G ) − Rh0 (G ) ≤ [Whp (G ) − Wh0 (G )]ν(G ).
(4.20)
We saw that (4.10) was actually implied by (4.8). By essentially the same argument, (4.20) is implied by (4.16). We also need ν(G ) ≤ ν(F ), but this is actually strict, since σ can be characterized as the first time that, under π, we have ν(G ) < ν(F ), where G is what remains of F at time σ . So a proof that an optimal policy may begin with the first stage of a FI* policy hinges upon the truth of (4.16) and πσ h being implementable. Let us check these facts in a special case and prove the following. Theorem 4.9 The index theorem for non-preemptive scheduling of ordinary jobs with precedence constraints The classes of index policies, forwards induction policies, and optimal policies are identical for a FABP F in which all bandits are jobs, with a single reward upon completion, and where there are arbitrary precedence constraints, no arrivals, and preemption is not allowed. Proof. For this FABP, a modified forwards induction policy is a policy which chooses a subset of jobs and (respecting the precedence constraints) processes them in some order until all are complete. Without loss of generality suppose that this set of jobs is B1 , . . . , Bk , and they are processed in order 1, . . . , k. In this case, Rgπσ (F ) is the contribution to the payoff obtained from B1 , . . . , Bk , when
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
99
under g they may be processed in some other order, and periods of freezing may be introduced, say by a freezing rule f , while jobs Bk+1 , . . . , Bn are processed. Processing in the order induced by g, the reward stream obtained from B1 , . . . , Bk can be viewed as that arising from a single bandit process, say B. So by Lemma 4.5 and the fact that ν(B) ≤ ν(F ), we have Rgπσ (F ) = Rf (B) ≤ Wf (B)ν(B) = Wgπσ (F )ν(B) ≤ Wgπσ (F )ν(F ), which establishes (4.16). It is obvious that πσ h is implementable, since the information that we need in order to ensure that h processes jobs Bk+1 , . . . , Bn in the same order as they are processed under g is fully available when under a FI* policy we move the processing of jobs B1 , . . . , Bk to the start. The precedence constraints are respected by πσ h because they are respected by both g and π. Partial results along the lines of Theorem 4.9 were obtained by Sidney (1975), Kadane and Simon (1977) and Glazebrook and Gittins (1981). The problems of determining an optimal schedule when preemption is not allowed, and of estimating the penalty in confining attention to non-preemptive schedules when preemption is allowed, have been considered by Glazebrook in a series of papers (1980a, 1980b, 1981a, 1981b, 1982a, 1982b, 1983), Glazebrook and Gittins (1981) and Glazebrook and Fay (1987). Note that when there are precedence constraints, the calculation of indices is typically fairly complicated even with no preemptions (e.g. see exercise 4.7). If we can show that preemption, although permitted, does not occur under an optimal policy, Theorem 4.9 will still hold. This is true, for example, if the jobs are ordinary jobs, and for each job the expression ∞ [1 − F (t)]−1 eγ t f (s)e−γ s ds t
is increasing in t see Glazebrook (1980b). More generally, there is no reason that (4.16) should hold for an arbitrary family of jobs. However, we might make it a condition. Recall that Rgπσ (F ) denotes the contribution to the payoff from rewards that would accrue during the first stage of a FI* policy (π, σ ), but which are postponed because policy g is employed rather than (π, σ ). Similarly, Wgπσ (F ) measures expected discounted intervals of time. Condition E The FABP F is said to satisfy Condition E if, for any initial state, any policy g, and FI* policy with first stage (π, σ ), Rgπσ (F ) ≤ ν(F ). Wgπσ (F ) Equality holds if g = π. We now have the following.
100
MULTI-ARMED BANDIT ALLOCATION INDICES
Theorem 4.10 The index theorem for generalized jobs with precedence constraints The classes of index policies, forwards induction policies, and optimal policies are identical for a family F of alternative generalized jobs with no arrivals (regarded as a SFAS), and which satisfies Condition E. Proof. Condition E ensures (4.16). So we only need to check that the policy πσ h is implementable, in other words, that we can move the first stage of FI* policy π to the front, and then take the remaining rewards in the same order as they would be taken under some arbitrary policy g (and so that conditional on g and the first stage of π, these other rewards can be viewed as arising from some residual family of alternative bandits G , to which some policy h is applied–and where promotion with respect to σ is applied under g, but not under πσ h). As in the proof of Theorem 4.9, this is true for generalized jobs with precedence constraints because information about the first stage of π comes earlier under πσ h than under g, and because since both g and π respect precedence constraints the policy πσ h can take the rewards of G in the right order without violating precedence constraints. From the discussion of Section 4.2 and Theorem 4.3 it follows that Condition D is both necessary and sufficient for a superprocess to be compatible with the optimality of a general index policy. However, the condition is not always easy to check, and it is useful to have alternative, and relatively easily checked, conditions which are sufficient for it to hold. We have seen that Condition E provides such a sufficient condition, as do conditions for a stoppable bandit process with an improving stopping option, and those of Theorem 4.9. Our earlier discussion of Example 4.1 may be amplified to provide an example of a sub-family of alternative bandit processes which satisfies Condition D, and is therefore compatible with a general index policy, but which does not satisfy Condition E. Suppose M > 21. Consider the sub-family F formed by adding to the superprocess S formed by the three jobs of Example 4.1 a standard bandit process with parameter φ, where 8+M 2+M φ, if λ is such that an optimal policy for {F , } starts by selecting any of the bandit processes in F (i.e. if λ ≤ ν(S, x, 1)), it must select job 1. Thus F satisfies Condition D so far as the initial state x is concerned, and it is easy to check that the same is true for all other possible states of F . On the other hand, the FABP D (i.e. S without the stop control) does not satisfy Condition E, and this property is transmitted to {F , }. That D does not
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
101
satisfy Condition E follows immediately from Theorem 4.10 and the fact that the index policy is not optimal for D. It may also be checked directly. Condition E arose by thinking about what would be required to generalize the proof arguments that are used in Section 4.5. However, it is reasonable to question the value of Condition E, and status of Theorem 4.10, since although we were able to check Condition E in the context of Theorem 4.9, it appears at first sight that Condition E might be difficult to check more generally. It would be gratifying if there were some other condition, more easily checkable, which were to imply Condition E. This is Condition F, which we now explain, and so our conditions are in the sequence F ⇒ E ⇒ D. To explain Condition F we need some preliminaries. Consider a FABP F of generalized jobs Bj (j = 1, 2, . . . , n) subject just to precedence constraints. We shall shortly wish to let π be a modified forwards induction policy for F , and σ to be the time at which its first stage ends. However, for the moment, let π be an arbitrary policy for F , and σ a stopping time for the bandit process Fπ . Thus σ is the first time at which F reaches some stopping subset 0 of its state space when π is the policy used. Let σi (i = 1, 2, . . . , n; σ1 + σ2 + · · · + σn = σ ) be the process time of bandit processes Bi when F first reaches 0 under π. If π is the policy used, σi is π-measurable in the sense that whether or not σi ≥ t depends, with probability one and for all t ∈ R+ , only on the history of Fπ before the process time of Bi , exceeds t. For a given set of realizations of the n alternative bandit processes, the value taken by σi is well defined whether or not π is the policy used. It is simply the process time of Bi up to the first time when F would have reached 0 under π. However, if some other policy h is used, σi may or may not be measurable in the above sense. We express all this by saying that σi is by definition π-measurable but may or may not be g-measurable. Speaking rather loosely, σi is not g-measurable if under policy g there is a possibility that for some t, the process time of Bi may exceed t before we know whether or not σi ≥ t. The truncated FABP defined by truncating Fπ at time σ is said to be g-measurable if σi is g-measurable for all i. We now have language in which to state Condition F. Condition F The FABP F is said to satisfy Condition F if, for any initial state, the truncated FABP K defined by a modified forwards induction policy is g-measurable for all g. It may help the reader if we say that the essential implication of Condition F may be appreciated as follows. Suppose that g is an arbitrary policy and π is a FI* policy for F . Condition F implies that, starting from any initial state, as we apply g to F and each decision time is reached, it is possible to say with certainty whether or not the reward that is taken at that time would also have been taken during the first stage of π, had π been employed instead of g, starting from the same initial state. Furthermore, it is possible to recognize, at the time that it is being taken, that a reward is the last reward that would have been taken
102
MULTI-ARMED BANDIT ALLOCATION INDICES
during the first stage of π. These facts all follow from the above definition of g-measurability. We now argue that Condition F ⇒ Condition E. If Condition E does not hold then this means that for some initial state of F , g a policy for F , and π a FI* policy, Rgπσ (F ) > Wgπσ (F )ν(F ).
(4.21)
Recall that Rgπσ (F ) and Wgπσ (F ) denote the expected discounted rewards and the expected discounted time instants accruing under g that would have accrued if π had been employed until its first stage ends at time σ . By appealing to the essential implication of Condition F, described in the previous paragraph, we may define a policy h and freezing rule f , such that Rgπσ (F ) = Rf (Fh ) and Wgπσ (F ) = Wf (Fh ). We simply take h to be the same as g, except that whenever a reward would be taken under g that would not have been taken during the first stage of π, then we pretend that this reward is not taken and the interval up to the next decision time is viewed as a period of freezing inserted by f . Note that h can be applied without violating any precedence constraints, because jobs in the set of jobs that would have been started during the first stage of π have no predecessors outside that set (since π is feasible), and all precedence constraints due to jobs within this set are met whenever one is started under h (since g is feasible). In combination with (4.21) we have νf (Fh ) = Rf (Fh )/Wf (Fh ) > ν(F ). However, by an argument along lines we have made several times before (based upon considering a Markov decision process and Theorem 2.10 (i)), there must be a stopping time τ such that ν(F ) = ντ (Fh ) ≥ νf (Fh ). This contradiction leads to the conclusion that (4.21) cannot hold.
4.7
Precedence constraints forming an out-forest
It is not hard to find examples in which the classes of optimal policies and FI* policies are not the same (and so for which, by Theorem 4.10, Condition E must necessarily fail). For example, consider Example 4.1. Here preemption is allowed and the jobs are ordinary jobs. In this example, we may begin by working either on job 1 or on job 2. A large reward accrues if job 3 is completed. The initial choice between jobs 1 and 2 does not affect the probability of completing job 3 or the time when this occurs. Thus in the determination of an optimal policy this choice depends only on the relative profitability of time spent on job 1 or on job 2. On the other hand, in constructing the first stage of a FI* policy it is also useful to know as soon as possible if either of jobs 1 or 2 is going to remain uncompleted, and so prevent completion of job 3. As soon as this is known we will declare the first stage to have terminated, and the sooner this happens the less that our inability to complete job 3 will reduce the equivalent reward rate achieved during the first stage. Only one time unit need be spent on job 1 to find out whether it will be completed, as compared with two time units for job 2. Thus, in constructing an FI* policy, there is a reason for choosing job 1, which
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
103
is irrelevant when constructing an optimal policy. This difference leads to the counter-example. The two ways of progressing towards completion of job 3 correspond to two arcs leading to node 3 in the directed graph representing the jobs and precedence constraints of the FABP D. It is natural to consider what happens if this feature is avoided by restricting our attention to directed graphs with no circuits and for which the in-degree of each node is at most one. Figure 4.3 shows a directed graph of this type, which we shall refer to as an out-forest. Each connected component is called an out-tree.
1
2
3
Figure 4.3 Precedence constraints forming an out-forest of three out-trees. A FABP with precedence constraints which may be represented by an outforest forms a simple family of sub-families of alternative bandit processes, the number of sub-families being equal to the number of out-trees or, equivalently, the number jobs available for service at time zero. In the case of the graph shown in Figure 4.3, there are three sub-families, as indicated. Clearly each of these sub-families of jobs is a superprocess. Provided that the FABP F has been divided into as many sub-families as possible, there is one job in each sub-family which must be completed before any other job in the sub-family may be started. This job will be termed the initial job in the sub-family, and a sub-family with the property just described will be termed a minimal sub-family (or a minimal family if it coincides with F ). A policy for F will be described as a minimal sub-family index policy if at any decision time the bandit process (or job) Ji selected for continuation is an initial job of a sub-family, say Fi , such that ν(Fi ) is maximal amongst such choices of an initial job. Our next result may now be stated very simply. Theorem 4.11 The classes of minimal sub-family index policies, forwards induction policies, modified forwards induction policies, and optimal policies are identical for a family F of alternative jobs with no arrivals and precedence constraints in the form of an out-forest. The first stage of a modified forwards induction policy for F confines itself to a single minimal sub-family, and is equivalent to the first stage of a modified forwards induction policy for the minimal sub-family Fi such that ν(Fi ) is maximal. Proof. As usual, the classes of forwards induction and modified forwards induction policies are the same (since the stage of a modified forwards induction
104
MULTI-ARMED BANDIT ALLOCATION INDICES
policy ends when ν(F ) first drops below its initial value). Suppose that at time 0 the jobs available to service are J1 , . . . , Jk , and these are the root jobs of k sub-families of jobs, F1 , . . . , Fk . Let |F | denote the number of jobs in F . We prove the theorem by induction on the hypothesis that it is true if |F | ≤ . The base of the induction is = 1. Suppose = |F | > 0 and some policy is followed. Once a first job completes, we have a residual superprocess, say F , with |F | < |F |. So by the truth of the inductive hypothesis for − 1 we see that from that point onwards, processing of jobs must proceed according to the theorem. This means that in each of the original k sub-families, viewed as a superprocess, service must proceed in a manner that does not depend on how service was given to the root jobs prior to the time that the first root job completes. This makes the sub-families F1 , . . . , Fk into k alternative bandit processes, and thus the initial choice as to which of J1 , . . . , Jk should receive service is just a choice amongst alternative bandit processes. If this is Ji then the first stage of a modified forwards induction policy for F confines itself to Fi . This theorem also holds for continuous-time jobs. The proof uses a similar limit argument to the one used to prove Theorem 3.1. The above proof is essentially that of Glazebrook (1976), who also extends Theorem 4.11 to the case of stoppable jobs, where these are defined as follows. Recall that a job is a bandit process with a completion state C, attainment of which means that the job behaves like a standard bandit process with the parameter −M. A stoppable bandit process is a bandit process with an additional control 2, application of which in state x causes behaviour like that of a μ(x) standard bandit process. A stoppable job is a bandit process which brings these features together. State C is replaced by a completion set mc . Once the job reaches a state x in mc it never leaves that state. Control 2 remains available for states in mc . Thus the −M standard bandit process available on completion of a job, and introduced only to allow each job arbitrarily large process times, becomes irrelevant. Precedence constraints may be defined between stoppable jobs. Job J1 precedes job J2 if the freeze control must be applied to J2 until J1 reaches its completion set. The proof of the following theorem is virtually identical to the proof of Theorem 4.11 except that the index theorem for stoppable bandit processes with improving stopping options (Theorem 4.3 and Lemma 4.4) is used in place of the index theorem for a SFABP (Theorem 2.1). We must check that the stoppable bandit processes that are formed from F1 , . . . , Fk once a first job completes each have an improving stopping option. They do if we define the stopping option for a composite job to be the best of the stopping options available from those original jobs of which it is composed that are either complete or available for service. We then have the following theorem.
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
105
Theorem 4.12 The classes of minimal sub-family index policies, forwards induction policies, and optimal policies are identical for a family of alternative stoppable jobs with no arrivals and precedence constraints in the form of an out-forest, provided that each of the stoppable jobs has an improving stopping option.
4.8
Bandit processes with arrivals
Let us start with a SFABP and add the possibility that new bandits may become available subsequent to the start. This might be because they are exogenous arrivals, or because precedence constraints have been met. It turns out that an index theorem holds provided that the distribution of bandits that are newly available at any decision time depends only on events related to the bandit that was chosen for continuation at the previous decision time. Specifically, suppose that when B is in state x and is selected for continuation at some decision time t, a random set of bandits become newly available at the next decision time, t + T . As we have done previously, let us imagine for simplicity that all bandits evolve on the same finite state space, E = {1, . . . , N }. We suppose that the distribution of the set of bandits becoming newly available at t + T depends on x, T and the state y to which B makes a transition at time t + T . Note that the distribution of arrivals does not depend on the actual time t. This model for arrivals has been variously called arm-acquiring bandits (Whittle 1981), branching bandits (Weiss 1988), and bandit process-linked arrivals (or job-linked arrivals) (Gittins 1989). Such bandits can also be viewed as a model for what happens when a job B generates a random out-forest of successors when it reaches a completion state. We shall meet branching bandits again in Example 5.4. The model permits some convenient simplifications. If bandits are jobs then we may make the simplification that if a job is selected for service then it completes by the next decision time. The possibility that the job may not complete may be accounted for by including in the set of arriving jobs a job which represents its continuation having not completed. It would be possible to obtain a proof of an index theorem for bandits with bandit process-linked arrivals by generalizing Theorem 4.11 to bandits with random out-forest precedence constraints. However, it is perhaps easier and instructive to generalize a proof of the index theorem in Chapter 1. As remarked by Tsitsiklis (1994), it is straightforward to see that the proof method for the index theorem in Section 2.9 continues to work when bandit arrivals are of this type. Recall that the proof in Section 2.9 is based upon an interchange argument and induction on the size of the state space. In the presence of job-linked arrivals it remains the case that if a bandit starting in state iN has, amongst all states, the greatest equivalent reward rate up to its next decision time, say t11 , and iN = arg maxi∈E r(i)/E[1 − exp(−γ t1i )], then at any decision
106
MULTI-ARMED BANDIT ALLOCATION INDICES
time t, for which some bandit is in state iN , it is optimal to continue that bandit up to its next decision time. This permits us to remove state iN from further consideration and reduce the decision process to one in which bandits evolve on state space {1, . . . , N } \ {iN }, after appropriately modifying the data for rewards, state transition probabilities, and bandit process-linked arrivals of new bandits at transition times. If, furthermore, we wish to prove an index theorem when there may also be precedence constraints in the form of a (possibly random) out-forest, then this is possible. However, so that we may continue to base the proof on an induction on the size of a finite state space, it may be necessary to approximate the actual decision process of interest by its restriction to some (very large) subset of its actual state space. We have noted that the distribution of arrivals should not depend on the actual time t. However, we may without difficulty imagine that the batch of jobs that arrives within the interval between two decision times is the result of a time-homogeneous Poisson process or, in the discrete-time case for which every decision time is integer valued, a Bernoulli process. We shall refer to this simply as Poisson or Bernoulli arrivals, and the next result follows immediately. Theorem 4.13 arrivals For a time case) and out-forest, the are identical.
The index theorem for a FABP with Poisson or Bernoulli FABP with Poisson arrivals (Bernoulli arrivals in the discreteany precedence constraints in the form of a (possibly random) classes of forwards induction policies and optimal policies
We shall return to branching bandits in Section 5.5, where we develop bounds on the suboptimality of priority policies that do not rank states in the optimal priority. We have remarked that the bandit state of greatest index is the one having the greatest equivalent reward rate, say r(i)/E(1 − e−γ Ti ) (see Section 2.9). However, to determine index values when there are arrivals is computationally quite a task, and by no means always a tractable one. There is, however, one important case for which everything falls out very neatly, and this is the subject of the next section of this chapter.
4.9
Tax problems
4.9.1 Ongoing bandits and tax problems A bandit process changes state only when the continuation control is applied to it, and yields a reward only at those times. It is also of interest, as pointed out by Varaiya, Walrand and Buyukkoc (1985) to consider reward processes with similar properties except that rewards also accrue when the freeze control is applied, rather than just when the continuation control is applied, which will be termed ongoing bandit processes. A family of ongoing bandit processes might, for example, model the problem faced by a businesswoman with interests in a number of projects, each of which yields a constant cash flow except when she
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
107
spends some time on it, thereby changing its characteristics in some way. The ongoing bandit processes represent the projects; continuing one of the ongoing bandit processes is equivalent to working on the corresponding project. An ongoing bandit process differs from our usual bandit process in that while the freeze control is applied in state x, reward accrues at a rate λ(x). The nice thing about any ongoing bandit process, say B, is that there is a corresponding ordinary bandit process, say B † , which produces the same expected total return, apart from a contribution which is independent of the policy followed, if the same policy is applied to both processes. To see this, note that if control 1 is applied at time s in state x, leading to a transition to state y at time t, a reward stream at rate λ(x) from s to ∞ is lost in return for a reward stream at rate λ(y) from t to ∞. Thus the change in expected reward is the same as if control 1 is applied to a bandit process B † in state x at time s for which λ† (x) = 0,
r (x) = E λ(y) †
∞
∞
a du − λ(x) u
0
t−s
a du x + r(x, 1). u
(4.22)
It follows that if in all other aspects they are the same then the payoffs from B and B † are equivalent under any policy. This equivalence means that the index theorem for a SFABP may be extended to allow some or all of the constituent bandit processes to be ongoing bandit processes. From (2.7) and (4.22) it follows that the index for an ongoing bandit process in state x (= x(0)) is E[a τ λ(x(τ )) − λ(x)] + Rτ (x) 1 − Ea τ τ >0
ν(x) = sup
where Rτ (x) = E
t
r(x(t))a x(0) = x ,
t 0 then it matters that class i jobs are always given priority over those in class i + 1. However, in the γ 0 limit it does not matter how the jobs in classes 1, . . . , j − 1 are scheduled because the index is just the quotient of the expected total reward collected and the expected time elapsed, over the period required to reach the situation in which there remain no jobs in any class of higher priority than j . Since the order of service does not matter in the γ 0 limit, we may suppose that the initial job and any feedback jobs are served first, until there is no more feedback. Suppose this takes time T , and that during this time the rewards collected sum to C. Once time T is reached, there will be numbers of jobs in classes 1, 2, . . . , j − 1 remaining (the ones that arrived in Poisson streams during the interval up to T ). The number in class i will have mean λi T . It follows by thinking about the busy period created from serving each of these jobs that the expected time that it will take to reach a point that there are no jobs in the system of priority higher than j is proportional to T , say βT , and that the expected reward collected over this time is also proportional to T , say αT . This means that the index for j in the γ 0 limit has the value E(C + αT ) EC ν(j ) = = +α (1 + β). E(T + βT ) ET This is an increasing function of EC/ET , which is the value of the index in a problem with no arrivals. Thus, we may conclude that the priority order of the classes is the same as that with no arrivals.
110
MULTI-ARMED BANDIT ALLOCATION INDICES
4.9.3 Minimum EWFT for the M/G/1 queue The same ideas may be employed in analysing some other models. For example, we may use Klimov’s model to treat a problem of minimizing the expected weighted flow time (EWFT) in the M/G/1 multi-class queue. Most readers will know that M/G/1 is standard queueing theory notation (Kendall 1951) and refers to randoM (i.e. Poisson) arrivals, General independent service times, and 1 server. Suppose there are J job classes and class i jobs arrive as a Poisson process of rate λi , and that class i jobs have independent service times all with the distribution Fi , each of them costing ci per unit time between its arrival time and its completion. Let us ignore for the moment the assumption we have made in the Klimov model that the number of job classes is finite (from which it follows that the indices ν(i, γ ) define the same priority policy for all γ sufficiently small). In our M/G/1 model we shall not permit preemption at any instant, but only at certain process times, or ages of the jobs. Each age at which preemption is allowed can be considered as a new class. So a class i job can be thought of as passing through classes (i, 0), (i, 1), (i, 2), . . . as it reaches successive ages at which preemption is allowed, until finally it completes. Jobs arrive to class (i, 0) as a Poisson process of rate λi . This defines a job-linked arrivals process. If we assume that there is a finite k such that each job is guaranteed to complete before reaching class (i, k), then there are only finitely many classes. Thus the problem of minimizing EWFT in the multi-class M/G/1 is an example of the Klimov problem and so the same conclusions can be reached. This gives us the following theorem. Theorem 4.15 The EWFT in a busy period of an M/G/1 queue with J job classes is minimized by the index policy defined by the γ = 0 indices for the individual jobs, ignoring arrivals. The proof given here is essentially that of Nash (1973). If we wish to relax the assumption that a job arriving to class (i, 0) completes before reaching class (i, k), then we must cope with the possibility of a countably infinite number of job classes. To ensure that an optimal priority policy for the average cost is the one defined by the ordering of the indices for small γ , it would be sufficient to assume that there exists > 0 such that, for all pairs of distinct states (i, s) and (j, t), the ordering of ν(i, s, γ ) and ν(j, t, γ ) is the same for all γ < . This is similar to Condition C. In fact, the first proof of the above theorem (von Olivier, 1972) was direct, rather than via the discounted version of the problem, and hence did not require an assumption like Condition C. Bertsimas, Paschalidis and Tsitsiklis (1995) have considered the average-cost branching bandits problem and Klimov’s problem. They develop solutions based on fully characterizing the achievable region. This is the method that we expound in Chapter 5. Whittle (2005, 2007) has also studied the Klimov model in the undiscounted limit. He specializes to the case of exponentially distributed service times and
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
111
observes that the state-dependent part of the cost is linear in the state occupation numbers for the multi-armed bandit, but is quadratic for the tax problem. Think, for example, of a M/G/1 queue, but with no arrivals, and in which initially there are n1 and n2 jobs of two classes, all with processing times that are exponentially distributed with mean 1, and c1 > c2 . In this simple case ri = ci , and as γ 0 the index of a class i job becomes ci . The total reward obtained by the time that the system empties is linear in n1 and n2 ; that is, n1 c1 + n2 c2 . But the total holding cost is quadratic, 12 n1 (n1 + 1)c1 + 12 n2 (n2 + 1)c2 + n1 n2 c2 . Obviously, the analysis is more complicated with feedback and arrivals. Remarkably, Whittle has been able to find an explicit solution to the optimality equation for average cost by leveraging the ideas of his formula for the value function in the discounted case that we discussed in Section 4.3. In brief, suppose that starting with just one job, in class j , one operates the system until reaching a point at which it becomes optimal to stop operation and take a retirement reward ν/γ . Denote the reward thereby obtained as ν/γ + ψi (ν) + o(γ ). Whittle shows that the state-dependent part of the average cost is a function of the vector of occupation numbers, n = (n1 , . . . , nJ ), of the form s n + 12 n Qn, where Q is the matrix with elements ∞ ∂ψi (ν) ∂ψj (ν) dν Qij = ∂ν ∂ν 0 and s = (s1 , . . . , sJ ) is the solution of a linear system of equations in the si , involving the Qij and parameters of the problem. Dacre, Glazebrook and Ni˜no-Mora (1999) consider Klimov’s model on m parallel machines. They bound the suboptimality gap when operating Klimov’s index priority in this environment, and demonstrate its optimality in a heavytraffic limit. Aalto, Ayesta and Righter (2009) have also discussed scheduling in the M/G/1 queue. They develop properties of the Gittins index and show that policies of ‘foreground–background’ and ‘first-come-first-served’ are optimal if and only if the service time distribution is, respectively, of ‘decreasing hazard rate’, and ‘new better than used in expectation’.
4.10
Near optimality of nearly index policies
The proof of the index theorem for a SFABP given in Section 4.5 quickly leads, as noted by Katehakis and Veinott (1987), to a useful bound on the suboptimality of a policy which is approximately an index policy. Let h be a policy that starts by choosing a bandit B and continuing it up to some stopping time τ such that ντ (B) ≥ max ν(Bk ) − . k
Once τ is reached, policy h repeats in the same way, choosing a B and τ , continuing B up to τ , and then repeating this procedure forever. We call such a
112
MULTI-ARMED BANDIT ALLOCATION INDICES
policy h an -index ∗ policy. This definition may fairly easily be shown to imply that h is an -FI* policy. By adapting the ideas used in Section 4.5 to prove Theorem 2.1 we have the following corollary. Corollary 4.16 If h is an -index∗ policy then Rh (F ) ≥ R(F ) − γ −1 . Proof. Let g be an arbitrary policy. Suppose h initially continues B up to time τ . So Rτ (B) ≥ Wτ (B)[ν(B) − ]. As in Section 4.5, we may start with g and move the τ -length portion of time that we continue B to the start of the processing, and thereby nearly derive (4.11). All the arguments are valid, except that in the second line of (4.11) the quantity Wτ (B)ν(B) must be replaced with Wτ (B)[ν(B) − ]. This is enough to conclude that the reward obtained from policy g is no more than Wτ (B) greater than the reward obtained from the policy which is the same as g, except that the first τ -length segment of h (over which B is continued) is moved to the start. By repeating this argument multiple times, moving successive portions of h earlier in the processing, we have that the difference between the ∞ rewards under h and g is bounded below by − 0 a t dt = − γ −1 . Similarly, we have the following under the conditions that Theorems 4.9 and 4.10 hold. The proof closely follows that for Corollary 4.16. Corollary 4.17 If h is an -FI* policy then Rh (F ) ≥ R(F ) − γ −1 . An alternative definition of an approximate index policy is one for which the index of the bandit process continued at each decision time is within of the maximal index at that time. Call such a policy an -index policy. Glazebrook (1982c) has established the following bound when F is Markov. Theorem 4.18 If h is an -index policy for an SFABP F with the discount parameter γ , then Rh (F ) ≥ R(F ) − γ −1 (1 − e−γ )−1 . Which of these bounds works best clearly depends on the value of γ and on which kind of -approximation is most readily shown to hold in a given case. In fact, both Corollary 4.16 and Theorem 4.18 may be inferred from a more general result for branching bandits given in Theorem 5.12. There appears to be no obvious way of extending Corollary 4.16 to provide an approximation result in the most general setting covered by the index theorem for superprocesses (Theorem 4.3), though the extension is straightforward for the rather less general index theorem for generalized jobs with precedence constraints and satisfying Condition E (Theorem 4.10).
SUPERPROCESSES, PRECEDENCE CONSTRAINTS AND ARRIVALS
113
Fay and Walrand (1991) have developed results along the lines of Corollary 4.16 for Nash’s generalized bandit problem, as explained in Section 3.4.1. It is interesting to note that no approximation result follows from Theorem 4.18 on letting γ 0. A different approach seems to be necessary (for example see exercise 4.1).
Exercises 4.1. Use an interchange argument to show that for Problem 1 an -index policy (using index ci /si ) comes within ( si )2 of achieving the minimum cost. 4.2. Given a bandit process B, define a bandit process B ∗ , which differs only in that if it is continued when its process time is t then the reward that accrues is min0≤s≤t ν(B, x(s)) (i.e. the prevailing charge for B, as defined in Section 2.5). Given a SFABP F comprised of B1 , . . . , Bn , let F ∗ be the SFABP comprised of B1∗ , . . . , Bn∗ . Prove the index theorem by verifying the following claims: (i) for any freezing rule, Rf (Bi ) ≤ Rf (Bi∗ ); (ii) for any policy g, Rg (F ) ≤ Rg (F ∗ ); (iii) Rg (F ∗ ) is maximized by the Gittins index policy; (iv) under this policy Rg (F ∗ ) = Rg (F ). This exercise expresses in a different way the proof of the index theorem in Section 2.5 . 4.3. Let F be a simple family of alternative sub-families Fj (j = 1, 2, . . . , m) of bandit processes. Show that F satisfies Condition E if the same is true of each of the Fj . (Thus, for example, some of the Fj may consist of jobs with no preemption allowed, and others of jobs with job-linked arrivals.) 4.4. Show that Condition F is satisfied by an FABP composed of jobs with arbitrary precedence constraints and no preemption. This leads to an alternative proof for Theorem 4.9. 4.5. Show, using the notation of Section 4.10 that
(x, m) ≤ i (xi , m) + i (x i , m) − m, where i (x i , 0) denotes the value function for the set of bandits when Bi is excluded. Hint: See exercise 2.4. 4.6. Use the law of large numbers to show that for the M/G/1 queue the weighted flow time per unit time over a long period is minimized by a policy which minimizes EWFT in a busy period, provided the expected duration of a busy period is finite. 4.7. A bandit process B is made up of two jobs (see Section 3.2). Job 1 must be completed before job 2 may be started, and neither of the jobs may be interrupted before completion. Find an expression for ν(B). Switching now to the criterion of minimum expected weighted flow time, let ci be the cost per unit time of the flow time of job i and let τi be its processing time (i = 1, 2). Use Corollary 3.5 to show that ν(B) = max{ c1 /Eτ1 , (c1 + c2 )/E(τ1 + τ2 )}.
(4.23)
114
MULTI-ARMED BANDIT ALLOCATION INDICES
Now extend Theorem 4.15 to show that the EWFT in a busy period of an M/G/1 queue, for which jobs may arrive in pairs linked in this way, is also minimized by an index policy for which the index of a linked pair of jobs is given by (4.23). For further consideration of non-preemptive optimal scheduling see Kadane and Simon (1977) and Glazebrook and Gittins (1981). 4.8. Extend the proof of Theorem 4.15 to allow each class of Poisson arrivals to be made up of indistinguishable sets of jobs subject to an out-forest of precedence constraints, rather than of individual jobs.
5
The achievable region methodology 5.1
Introduction
During the 1990s, researchers invested great energy in developing an ‘achievable region approach’ for optimizing stochastic systems by way of index and priority policies. Seminal papers are those by Tsoucas (1991), Bhattacharya, Georgiadis and Tsoucas (1992), Bertsimas, Paschalidis and Tsitsiklis (1995), Bertsimas and Ni˜no-Mora (1993, 1996), Glazebrook and Garbe (1996) and Dacre, Glazebrook and Ni˜no-Mora (1999). Conservation laws for queueing systems have also been discussed by Federgruen and Groenvelt (1988), Shanthikumar and Yao (1992), and Glazebrook and Ni˜no-Mora (2001). Ni˜no-Mora (2011b) has written an encyclopaedia article on the subject. Several of these papers are very long; many details are far too complicated to reprise completely within this book. However, we can certainly give the reader an appreciation of the essentials. Let us begin by quoting Bertsimas, who says that just as the minimum spanning tree problem can be formulated as a linear program over a polymatroid and solved by a greedy algorithm, there are stochastic optimization problems, such as the multi-armed bandit problem, or the Klimov problem (see Klimov, 1974), that can be formulated as a linear program over a so-called extended polymatroid and solved by a priority-index policy. This priority-index policy can be found via an adaptive greedy algorithm that runs in polynomial time (e.g. in the number of bandits states). In fact, the idea of finding a priority policy via a greedy algorithm had begun with Klimov (1974), and his work provided motivation for further research. It was Bertsimas and
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
116
MULTI-ARMED BANDIT ALLOCATION INDICES
Ni˜no-Mora (1993, 1996) who developed the general theory of conservation laws and indexable systems in the form it takes in Section 5.4. In Section 5.2 we begin with a very simple example of scheduling in a twoclass M/M/1 queue. We see that the vector of the two mean queue lengths is constrained to lie in a one-dimensional simplex (a line segment), whose extreme points (endpoints of the segment) are obtained by giving priority to one class over the other. It must follow that any system-wide objective which is linear in the mean queue lengths must be optimized by one of these priority policies. This is a simple notion, but it becomes powerful when applied to many interesting control problems. The method has utility for problems with discounted returns if (i) the system-wide return can be expressed as some weighted sum of the expected discounted total efforts that are applied to different job classes, or projects; and (ii) one can develop linear constraints for these measures of effort. We see in Section 5.3 that the problems posed by simple families of alternative bandit processes (SFABPs) (or multi-armed bandit problems) can be cast in this manner. We develop a linear program whose solution provides an upper bound on the expected return from any SFABP, and the fact that this bound can be shown to be attained leads us to yet one more proof of the index theorem. The ideas are worked out in Section 5.4, to prove that systems satisfying generalized conservation laws (GCLs) have optimal policies that are index policies. The ideas are applied to branching bandits in Section 5.5, and to job selection and scheduling problems in Section 5.6. The method does not lead to the easiest proof of the index theorem for a SFABP or for branching bandits, but it can be useful in calculating, for example, the suboptimality of a priority policy based on Gittins indices in a parallel server environment in which it is not optimal. We make such calculations in Section 5.7.
5.2
A simple example
We shall first illustrate the general approach of the chapter by considering the simple case of the optimality of the cμ-rule for a two-class M/M/1 queueing system, scheduled to minimize a linear holding cost. To the authors’ knowledge, this problem was first analysed in this manner by Coffman and Mitrani (1980). Customers of two types arrive at a single server according to independent Poisson streams, with λi the rate of arrival for class i, i = 1, 2. The service times of class i customers are exponentially distributed with mean μ−1 i , i = 1, 2. Service times are independent of each other and of the arrival streams. The overall rate at which work arrives at the system, namely λ1 /μ1 + λ2 /μ2 , is assumed less than 1, and this guarantees the existence of stable scheduling policies for the system. Policies for the system must be past-measurable and non-idling, while priorities between customers are imposed preemptively. The goal is to choose such a policy π to minimize the long-term holding cost rate, namely C opt = min {c1 Eπ (N1 ) + c2 Eπ (N2 )} , π
THE ACHIEVABLE REGION METHODOLOGY
117
where ci is a positive-valued class i holding cost rate, Ni is the number of class i customers in the system and Eπ is an expectation taken under the steady-state distribution when policy π is applied. To proceed, we first develop a suitable collection of (in)equalities involving the steady state expectations Eπ (N1 ) and Eπ (N2 ). For this, we consider the workin-system process whose value at time t under the policy π is denoted W π (t) and is the aggregate of the remaining service times of all customers present in the system at t. Equivalently, it is the time which would be required to clear the system in the absence of any future arrivals. Sample paths of W π (·) consist of upward jumps at arrival epochs, the size of the jump being equal to the service requirement of the entering customer, followed by a period during which it is reduced at rate 1 as service is delivered to the system. Each such period terminates either with an empty system when W π (t) hits the value 0, or with another upward jump when the next customer arrives. It is clear that these sample paths do not depend on the choice of policy π in the non-idling class under consideration. It must certainly be the case that in steady state the expected amount of work in the system is policy invariant. Recall that the Pollaczek–Khintchine formula for the M/G/1 queue states that if X has the distribution of a service time then the mean work-in-system is λVar(X)/(1 − λEX). Using this, we may infer that −1 ρ1 μ−1 Eπ (N1 ) Eπ (N2 ) 1 + ρ2 μ2 + = , μ1 μ2 1 − ρ1 − ρ2
where ρi = λi /μi is the rate at which class i work enters the system. We develop these ideas further by considering W1π (t), the work-in-system at time t which is due to class 1 customers only. This quantity is plainly not policy invariant and will critically depend upon the degree of priority accorded to class 1 work. It is not difficult to see that, for each realization of the system, W1π (t) is minimized at each t by the policy which gives all class 1 work absolute priority over any class 2 work. In this case the system only serves class 2 customers if W1π (t) = 0. Denote such a policy by 1 → 2. We infer that, in the steady state, the expected amount of class 1 work in the system is minimized by 1 → 2 and have that, for all choices of non-idling π, ρ1 μ−1 Eπ (N1 ) 1 ≥ , μ1 1 − ρ1 where the right-hand expression is the mean work-in-system of an M/M/1 system in steady state serving class 1 customers only. Similarly, ρ2 μ−1 Eπ (N2 ) 2 ≥ , μ2 1 − ρ2 with the bound on the right-hand side achieved by the priority policy 2 → 1. Motivated by the above, we introduce xiπ = Eπ (Ni )/μi as the class i performance associated with policy π, i = 1, 2. We denote by X the set of
118
MULTI-ARMED BANDIT ALLOCATION INDICES
all possible pairs (x1π , x2π ) as π ranges through the non-idling policy class. We call X the performance space or achievable region. The minimized cost for this problem may now be expressed by C opt = min c1 μ1 x1π + c2 μ2 x2π = min {c1 μ1 x1 + c2 μ2 x2 } . π
(x1 ,x2 )∈X
The right-hand minimization is of the linear objective over feasible space X. However, the above discussion of work-in-system tells us that all performance pairs (x1π , x2π ) lie on the line segment P given by −1 ρ1 μ−1 ρ2 μ−1 ρ1 μ−1 1 2 1 + ρ2 μ2 , , x2 ≥ , x1 + x1 = P = (x1 , x2 ) : x1 ≥ 1 − ρ1 1 − ρ2 1 − ρ1 − ρ2 and hence we have that X ⊆ P . For convenience, we denote the endpoints, or extreme points, of the line segment P by A and B. We use A for the point with minimal x1 -value, whose coordinates are given by x1 =
ρ1 μ−1 1 , 1 − ρ1
x2 =
−1 ρ1 μ−1 ρ1 μ−1 1 + ρ2 μ2 1 − , 1 − ρ1 − ρ2 1 − ρ1
and B for the endpoint given by x1 =
−1 ρ1 μ−1 ρ2 μ−1 1 + ρ2 μ2 2 − , 1 − ρ1 − ρ2 1 − ρ2
x2 =
ρ2 μ−1 2 . 1 − ρ2
By the above arguments, the points A and B may be identified as the performances associated with the priority policies 1 → 2 and 2 → 1 respectively. Hence the endpoints of P are in X. However, any point in P may be expressed as a convex combination of A and B and hence is easily seen to be the performance of a corresponding randomization of the priority policies 1 → 2 and 2 → 1. It therefore follows that all points in P are in X and hence that P ⊆ X from which we can then conclude that X = P and hence that the achievable region for this problem is exactly the line segment P . It further follows that C opt = min c1 μ1 x1π + c2 μ2 x2π = min {c1 μ1 x1 + c2 μ2 x2 } , π
(x1 ,x2 )∈P
where the right-hand optimization is a linear program (LP) which is easily solved. Call any optimizer of this LP x opt . It is easy to see that when c1 μ1 ≥ c2 μ2 , x opt = A, and a policy achieving this optimal performance is 1 → 2. When c1 μ1 ≤ c2 μ2 , x opt = B and a policy achieving this optimal performance is 2 → 1. We thus conclude that our problem is solved by the so-called cμ-rule which gives priority to the customer class with the larger ci μi -value. This is a myopic principle. The optimal policy always seeks to serve in such a way as to drive down the current holding cost at the maximum available rate. Please also note that
THE ACHIEVABLE REGION METHODOLOGY
119
it follows easily from the above discussion that when c1 μ1 = c2 μ2 , all non-idling policies are optimal. Before proceeding to a rather more challenging example of the achievable region approach, we pause to emphasize an important notational choice for the material in this chapter. In the above example and throughout Chapter 5 we shall use x π to denote a system performance under policy π. We shall also use x for algebraic variables related to performance. Two simple requirements which will govern the choice of performance in any given example are (i) it should be an expectation; and (ii) the system wide objective to be optimized should be expressible in terms of the chosen performance. Thus in the current chapter we shall not use the notation x(t) for a system state at time t.
5.3
Proof of the index theorem by greedy algorithm
In this section we sketch the key ideas of the achievable region approach by using it to give yet one more proof of the index theorem for the general multi-armed bandit problem (or simple family of alternative bandit processes). Suppose that all n bandits are identical and have the same finite state space E = {1, . . . , N }. As noted in Section 3.2 this is without loss of generality, but provides a useful simplification because bandits are distinguished only by their state. Previously, we have used x to denote a state, but as stated above it is convenient in this chapter to reserve x1 , . . . , xN for linear programming, or system performance variables. In this chapter we need only a notation for the initial state of a multi-armed bandit, and shall denote this by k = (k1 , . . . , kn ). Let Ii (t) be an indicator variable for the event that a bandit in state i is continued at time t. For a given policy π, let xiπ denote the total discounted number of times that a bandit in state i is continued under policy π. That is, ∞ xiπ = Eπ a t Ii (t) k . (5.1) t=0
Since the initial state k is to be regarded as fixed, we do not show explicitly the dependence xiπ on k. Assuming that whenever a bandit that is in state i is continued we obtain reward
ri , the multi-armed bandit problem is equivalent to the problem of maximizing i ri xiπ , by choice of (x1π , . . . , xNπ ) optimally within the space of performance vectors that are ‘achievable’ by admissible policies π. The crux of the approach is to characterize the achievable region of performance vectors as a polyhedral space that is defined by a set of linear constraints. These constraints, sometimes called conservation laws, can be obtained by thinking about some objective functions that are obviously optimized by priority policies. Let us start with one constraint that is trivial. Since one bandit is being continued at each time, we must have one simple conservation law: 1 xiπ = 1 + a + a 2 + · · · = . (5.2) 1−a i∈E
120
MULTI-ARMED BANDIT ALLOCATION INDICES
One can also see the above as the total expected discounted reward in a trivial bandit problem in which ri = 1 for all i. To find another constraint let us pick some subset of states S, S ⊂ E, and consider a new bandit problem in which one is trying to minimize an expected discounted sum of costs over the infinite horizon. Continuing a bandit in state i ∈ /S costs nothing. But if one continues a bandit with state i ∈ S then an immediate strictly positive cost, ASi , must be paid, which we set equal to the expected discounted number of steps that would be required to return this bandit’s state to within S. Let TiS be a random variable with the distribution of this number of steps. So S ASi = E 1 + a + · · · + a Ti −1 . (5.3) with −ASi for i ∈ S and We now have a problem in which ri has
beenS replaced π by 0 otherwise. We wish to minimize i∈S Ai xi . This is a nearly trivial bandit problem. One can easily prove that a policy is optimal for this bandit problem if it always gives preference to bandits whose states are not in S. We simply have to show that such a policy has a value function that satisfies the relevant dynamic programming equation. This is easy to check, working with the fact that the value function for such a policy is
TS 1 S V (k) = E a ki . 1−a i:ki ∈S /
A key fact is that once the states of all bandits are in S it does not matter which bandit we continue next. The dynamic programming equation is solved by V S (k) = 1/(1 − a) for any k such that the states of all bandits are in S. Let us define b(S) = V S (k). By the solution that we have now identified to our nearly trivial multi-armed bandit problem we have established the validity of constraints of the form ASi xiπ ≥ b(S), S ⊆ E. (5.4) i∈S
It is consistent with all the above definitions to take AE i = 1 (1 ≤ i ≤ N ), and b(E) = 1/(1 − a). Then (5.2) becomes (5.6) and we have proved the following lemma. Lemma 5.1 There exist positive ASi (i ∈ S ⊆ E), having the interpretation (5.3), such that for any scheduling policy π, ASi xiπ ≥ b(S), for all S ⊂ E, (5.5) i∈S
π AE i xi = b(E),
(5.6)
i∈E
and such that equality holds in (5.5) if π is a policy that gives priority to bandits whose states are not in S over any bandits whose states are in S.
THE ACHIEVABLE REGION METHODOLOGY
121
The lemma motivates us to consider the following linear programming relaxation of our problem. It is a relaxation because we do not require the xi to be of the form xi = xiπ , for some policy π. ri xi Primal LP : maximize
i∈E
ASi xi ≥ b(S), for all S ⊆ E,
(5.7)
AE i xi = b(E),
(5.8)
i∈S
i∈S
xi ≥ 0, for all i.
(5.9)
The optimal value of this LP is an upper bound on the optimal value for our bandit problem. The dual LP is b(S)yS Dual LP : minimize
S
ASi yS ≥ ri , for all i,
(5.10)
S : i∈S
yS ≤ 0, for all S ⊂ E,
(5.11)
yE unrestricted in sign.
(5.12)
We will identify optimal solutions for the primal and dual by finding {xi }i∈E and {yS }S⊆E , that are feasible for these problems and satisfy the complementary slackness conditions. These are sufficient conditions for feasible solutions to the primal and dual linear programs to be optimal (see Vanderbei, 2007). We begin by choosing i to maximize ri /AE i . Suppose, without loss of generality, that the maximizer is state i = N . Let us put y¯E = rN /AE N = rN , and also put y¯S = 0 for all S, S = E which include N . The dual constraints, for i = 1, 2, . . . , N − 1, are now of the form ASi yS ≥ ri − AE i y¯E . S : i∈S,N ∈S /
The right-hand side is essentially a new value of ri , and it is ≤ 0. So we can repeat the above procedure. We let SN−1 = E − {N } and then choose i to maximize ri − AE i y¯E S
Ai N −1
.
Suppose, without loss of generality, that this is for state i = N − 1. Then y¯SN −1 =
(rN−1 − AE N−1 y¯E ) S
N −1 AN−1
122
MULTI-ARMED BANDIT ALLOCATION INDICES
and y¯S = 0 for all unassigned S whose complement does not contain both N and N − 1. Note that by construction, y¯SN −1 ≤ 0. We can continue in this manner. Suppose without loss of generality that states are identified successively in the order N, N − 1, . . . , 2, 1. Let Sj = {1, 2, . . . , j }. This will turn out to be the set of j states of smallest index. The only non-zero dual variables are y¯E , y¯SN −1 , . . . , y¯S2 , y¯S1 . By the way they have been constructed, all but the first of these variables are guaranteed to be non-positive. So they satisfy (5.11)–(5.12). Let the xi = xiπ be determined by the policy π that gives priority to bandits whose states come latest in the list 1, 2, . . . , N . The complementary slackness condition between xi and (5.10) is satisfied since equality holds in (5.10). The complementary slackness condition between y¯S and (5.7) is trivially satisfied for S such that y¯S = 0. It is satisfied for S such that y¯S < 0, if for all j = 1, . . . , N − 1, Sj Ai xi = b(Sj ). i∈Sj
The above holds because of Lemma 5.1 and the fact that π gives priority to jobs in Sjc = {j + 1, . . . , N } over those in Sj = {1, . . . , j }. This shows that an optimal solution to the primal LP is the performance vector of a priority scheduling policy. Moreover, this priority policy can be identified by the adaptive greedy algorithm which we have used to constructs the y¯S . More details are given by Bertsimas and Ni˜no-Mora (1996). Note that the adaptive greedy algorithm is similar to the idea of Section 2.9, where we identified states in decreasing order of their Gittins indices. We may set out the adaptive greedy algorithm as follows. Adaptive greedy algorithm AG(A, r) Input: r = {ri , i ∈ E}; A = {ASi , i ∈ S ⊆ E} Initialization: let y¯E := max{ri /AE i : i ∈ E} choose iN ∈ E attaining the above maximum let νiN := y¯E let SN := E; let k := N Loop: while k ≥ 2 do let Sk−1 := Sk \ {ik }
Sj ri − N j =k Ai y¯Sj let y¯Sk−1 := max : i ∈ Sk−1 S Ai k−1 choose ik−1 ∈ Sk−1 attaining the above maximum let νik−1 := νik + y¯Sk−1 let k := k − 1 end {while}
THE ACHIEVABLE REGION METHODOLOGY
123
The algorithm calculates νij s, with riN = νiN ≥ νiN −1 ≥ · · · ≥ νi1 . These are in fact the Gittins indices (as has been presaged by our choice of the Greek letter ν). To see this, let us start by noting that iN must be the state of maximal Gittins index, the value of which must be νiN = riN . This is because there is no other state from which one could compute a greater equivalent constant reward rate than riN . The algorithm defines νij −1 := νij + y¯Sj −1 , and so rewriting the definition of y¯Sk−1 by substituting y¯Sj = νij − νij +1 , we have N−1
S
N rik−1 − Aik−1 νiN −
j =k Sk−1 Aik−1
νik−1 = νik + S
=
S
j Aik−1 (νij − νij +1 )
S
S
S
N −1 k−1 N k − Aik−1 ) + · · · + νik (Aik−1 − Aik−1 ) rik−1 + νiN (Aik−1
S
k−1 Aik−1
(5.13)
Consider the trajectory of a bandit which starts in state ik−1 and is continued until its state first exits the set {ik−1 , . . . , iN }. Let τj denote the random time at which its state first exits the set {ij , . . . , iN } (k ≤ j ≤ N ). We can write the above as ⎛ ⎞ τN −1 −1 τ τ N −1 k −1 E ⎝rik−1 + νiN a t + νiN −1 a t + · · · + νik at ⎠ νik−1 =
t=τN
t=1
⎛ E ⎝1 +
τN −1 −1
τN −1
at +
τk −1
at + · · · +
t=τN
t=1
t=τk+1
⎞
.
at ⎠
t=τk+1
If we assume an inductive hypothesis that νiN , . . . , νik are the N − k + 1 greatest Gittins indices, and use their interpretation as equivalent constant reward rates, then this reduces to τ −1 k rx(t) a t E t=0
νik−1 = E
τ −1 k
.
(5.14)
at
t=0
This shows that νik−1 has the interpretation of the next greatest Gittins index. While this essentially completes the proof that the Gittins index policy is optimal for a SFABP, it will assist the subsequent development if we take the ideas somewhat further. In this example we identify the system performance by x π = (x1π , . . . , xNπ ), where xiπ is as in the preceding discussion. We also write X for the achievable region consisting of the x π attained as π ranges through
124
MULTI-ARMED BANDIT ALLOCATION INDICES
the class of past-measurable policies for the multi-armed bandit. It follows from Lemma 5.1 that X is contained in the convex polytope defined by constraints (5.7)–(5.9), that is, + N S E P = x∈ R : Ai xi ≥ b(S), S ⊂ E and Ai xi = b(E) . i∈S
i∈E
Now observe that the LP in (5.7)–(5.9) may be re-expressed as ri xi . max x ∈P
i∈E
The above argument has demonstrated that the solution of this LP is of the form x = x πG , where the right-hand side denotes the system performance of a Gittins index policy, denoted here by πG . Since every extreme point of a bounded polyhedron is the unique maximizer of some linear objective, it therefore follows that the extreme points of P are the performances of Gittins index policies defined with respect to some vector of rewards r. It must then follow that all of the extreme points of P are members of X and that therefore, invoking the argument of Section 5.2 concerning convex combinations of extreme points, it must follow that P ⊆ X and hence that X = P . This last argument only requires that the system performance be an expectation, which in the current case it is. Finally, it must therefore be the case that the value of the LP in (5.7)–(5.9) and of its dual in (5.10)–(5.12) are equal to the maximal reward from the multi-armed bandit. This fact will be exploited in succeeding sections of this chapter.
5.4
Generalized conservation laws and indexable systems
A few key facts are critical to the account of the multi-armed bandit given in the preceding section. The first is that the system performance is an expectation and the second is that the objective to be optimized is linear in this performance. Finally, there is the fact that the system performance satisfies a collection of (in)equalities of the form given in Lemma 5.1. The current section points out that there are a wide range of other problems which have a formulation which shares these features and hence which have solutions in the form of priority policies with priorities determined according to a Gittins index ordering derived from an adaptive greedy algorithm. In this more general setting, such Gittins indices have sometimes been called generalized Gittins indices but we shall drop the qualifying adjective. We consider the following general setting: service is offered to N job types which may be tasks, customer classes (as in Section 5.2) or states of distinct projects (as in Section 5.3). Throughout the chapter we shall assume that this collection of job types E is finite. Policies for scheduling service must be nonidling and past-measurable. With any policy π is associated a system performance
THE ACHIEVABLE REGION METHODOLOGY
125
xiπ
π
x which is an N -vector whose ith component is a performance associated with type i jobs. This system performance must be an expectation. It may be necessary to impose conditions on the system to make sure that all of these expectations exist. We write E = {1, 2, . . . , N } for the set of all job types and S for some general subset of it. Let σ = (σ1 , σ2 , . . . , σN ) be some permutation of E. With any σ is associated a priority policy which schedules service to the job types by enforcing priorities according to σ . We shall use σ for both the permutation and the policy which it yields. We adopt the convention that under policy σ , jobs of type σN are given highest priority, followed by jobs of type σN−1 , and so on. Jobs of type σ1 have lowest priority. The following definition expresses the requirement that the system performance should satisfy a set of (in)equalities of the kind in Lemma 5.1. We say that the system obeys generalized conservation laws of type 1 , or GCL(1) for short. Definition 5.2 GCL(1) A system satisfies GCL(1) if there exist non-negative performance x π , set function b : 2E → R+ and matrix A = ASi , i ∈ E, S ⊆ E satisfying ASi > 0, i ∈ S, and ASi = 0, i ∈ / S, S ⊆ E, such that for all π we have
ASi xiπ ≥ b(S),
(5.15)
π AE i xi = b(E),
(5.16)
i∈S
and i∈E
where equality holds in (5.15) for all permutation policies σ for which (σ1 , σ2 , . . . , σ|S| ) = S. The content of Theorem 5.5 (to be stated below) is that for any system obeying GCL(1) there exists an optimal priority policy for any system-wide objective which is linear in the performance x π . Moreover, such a policy is determined by a set of Gittins indices which may be computed by an adaptive greedy algorithm. Hence for GCL(1), a set of objectives which are linear in the performance and which are subject to constraints determined by the matrix A, one for each subset S, is minimized by a scheduling policy which gives job types in S c priority over those in S. An important special case of GCL(1) occurs when ASi is simply the indicator function for S, equal to 1 for i ∈ S and 0 otherwise. Here we shall say that the system satisfies strong conservation laws of type 1 , SCL(1) for short. We have already seen that the SFABP of the preceding section armed with the performance defined there satisfies GCL(1). We now give two
126
MULTI-ARMED BANDIT ALLOCATION INDICES
further examples. The first is of a special case which satisfies SCL(1), while the second is of a general system which encompasses the SFABP and many of its generalizations discussed in Chapter 4. Example 5.3 A system consists of N jobs to be processed non-preemptively, with Pi the processing requirement of job i. Processing requirements of distinct jobs are independent. Here the class of scheduling policies is restricted to the permutation policies σ . If we use Siσ for the time at which the processing of job i begins under σ , then the performance vector x σ given by σ xiσ = E
Si +Pi
e−γ t dt ,
Siσ
i ∈ E,
where γ > 0, can easily be shown to satisfy SCL(1). Example 5.4 Bertsimas and Ni˜no-Mora (1996) have established that branching bandits satisfy GCL(1) when the system performance is designed for application to problems where the objective is the maximization of total discounted reward. A branching bandit (which we have already met in Section 4.8) is most helpfully thought of as a model for the scheduling of collections of projects of which there are n different types, labelled {1, 2, . . . , n} ≡ . Each type k project has a finite state space Ek with generic state labelled ik . The project/state pair (k, ik ), denoting a project k in state ik , is treated as a single job type in this formulation and is replaced by a single identifier, i say. It follows that the set E = {1, 2, . . . , N } of possible job types is given by E= Ek . (5.17) k∈
At each decision epoch, service is applied to exactly one job type, that is, to exactly one project in its current state. Since the systems being modelled may well incorporate exogenous arrivals of new jobs as well as jobs which leave the system upon completion of service, it is possible that the system could fall empty. We adopt the convention that, should this happen, service is deemed to be applied to a job of a specially designated type which is incapable of earning rewards. In general, the processing of a type i job takes random time vi at the end of which the processed type i job is replaced in the system by a collection consisting of φij jobs of type j , j ∈ E. The collection {(vi , φij ), j ∈ E} has a general joint distribution which is independent of all other project types and identically distributed for the same i. In general, the collection φij , j ∈ E will usually represent two different kinds of outcomes from the processing of i. The first is a transition within the project k for which i ∈ Ek . Hence service may change the state of the project to which it is applied. In the case of a SFABP, this is the sole impact of service, and hence in this case exactly one of the φij will be 1 for some j ∈ Ek , with all others equal to 0. The second kind of outcome of
THE ACHIEVABLE REGION METHODOLOGY
127
processing will be the arrival into the system of exogenous arrivals, or perhaps the emergence of new job types which become candidates for processing once the service of i has been completed. The latter may be relevant, for example, in cases where there are precedence constraints between the projects. In a natural development of the choices made in the analysis of SFABPs in Section 5.3, we take the system performance under policy π to be ∞ π −γ t xi = Eπ Ii (t)e dt , i ∈ E, 0
where Ii (t) is an indicator taking the value 1 when a type i job is processed at time t, and 0 otherwise. Note that this form of the performance will be particularly appropriate when the processing time random variables vi are continuous. Should we have a discrete-time model with all of the vi equal to 1, as in Section 5.3, then a formulation in which the performance is represented as a sum rather than an integral will be more natural. Our choice of matrix A is as follows: fix subset S ⊆ E and i ∈ S. Consider a system in which a single job of type i is present at some time, 0 say, and to which processing is given. Subsequent processing choices are made in a way that gives job types in S c priority over those in S. Precisely how this is done is immaterial. We use TiS for the first time after 0 at which only job types in S are present in the system. We then choose ASi
TS E 0 i e−γ t dt = vi −γ t , E 0 e dt
i ∈ S, S ⊆ E.
With these choices, the system satisfies GCL(1). There is a proof of this assertion given by Bertsimas and Nino-Mora (1996) which elaborates that given in outline for SFABPs in Section 5.3. We now consider a system satisfying GCL(1) and seek a scheduling policy π to maximize the linear objective R opt = max ri xiπ . π
i∈E
As before, we take X to be the set of achievable x π . In the forthcoming material, we use AG(A, r) to denote the adaptive greedy algorithm described in Section 5.3, the notation reflecting the fact that the inputs to the algorithm are the matrix A and the reward vector r. Theorem 5.5 If a system satisfies GCL(1) with performance x π , base function b and matrix A, then the achievable region X is the convex polytope P (A, b) given by + N S P (A, b) = x ∈ R : Ai xi ≥ b(S), S ⊂ E, and AE i xi = b(E) , i∈S
i∈E
128
MULTI-ARMED BANDIT ALLOCATION INDICES
whose extreme points are the performances of permutation policies. Further, R opt is achieved by a Gittins index policy πG determined by the indices (νi , i ∈ E) emerging from adaptive greedy algorithm AG(A, r). The proof used by Bertsimas and Ni˜no-Mora (1996) is essentially exactly the argument given in Section 5.3 which starts at Lemma 5.1 and which then uses complementary slackness to establish that the performance x πG solves the LP given by (5.7)–(5.9). In Section 5.3 it is only the choice of performance variable x and matrix A which are particular to the SFABP case, along with the calculations around (5.13) which are intended to clarify that the indices emerging from AG(A, r) are indeed the Gittins indices with which the reader is by now familiar from the earlier chapters of the book. We now highlight an important structural property which the matrix A may enjoy. To prepare the way for the required definition, note that for branching bandits the set E has a natural project-wise partition given by {E1 , E2 , . . . , En } ≡ , say. See (5.17). Definition 5.6 Matrix A is decomposable with respect to some partition {E1 , E2 , . . . , En } of E if S∩Ek
ASi = Ai
,
i ∈ S ∩ Ek , S ⊆ E.
It should be clear from the discussion in Section 5.3 for SFABPs that the matrix A introduced at (5.3) is indeed decomposable with respect to the partition of E induced by the state spaces of the individual projects. A major significance of decomposability is that when the adaptive greedy algorithm AG(A, r) is deployed to compute indices, it emerges that the indices for the job types in the partition set Ek may be computed with respect to attributes of project k only. This is the familiar idea that in a multi-armed bandit, an arm’s index depends only on that arm’s attributes. The above definition formalizes this key insight and computational reduction in the more general context of GCL systems. Please also note that decomposability of A with respect to some useful partition of E is by no means universal. There are many branching bandit examples where there is no helpful reduction of the computational task of obtaining Gittins indices from algorithm AG(A, r). We conclude this section by giving an alternative form of GCL which is more natural to problems involving multi-class queueing systems where the goal is the determination of a scheduling policy to minimize an expected cost, as is the case in the simple example discussed in Section 5.2. The following definition is essentially the same as that for GCL(1), but with the inequality in (5.15) reversed in (5.18). Definition 5.7 GCL(2) A system satisfies GCL(2) if there exist non-negative performance x π , set function f : 2E → R+ and matrix F = FiS , i ∈ E, S ⊆ E satisfying FiS > 0, i ∈ S, and FiS = 0, i ∈ / S, S ⊆ E,
THE ACHIEVABLE REGION METHODOLOGY
such that for all π we have
129
FiS xiπ ≤ f (S),
(5.18)
FiE xiπ = f (E),
(5.19)
i∈S
and
i∈E
where equality holds in (5.18) for all permutation policies σ for which (σ1 , σ2 , . . . , σ|S| ) = S. Note that now the objectives determined by matrix F are maximized by policies which give S c priority. This is natural, for example, in queueing examples where performances may be related to mean queue lengths. An important special case occurs when FiS is simply the indicator function for S, equal to 1 for i ∈ S and 0 otherwise. Here we shall say that the system satisfies strong conservation laws of type 2, or SCL(2) for short. We now give some examples. Example 5.8 Several multi-class queueing systems satisfying SCL(2) were described by Shanthikumar and Yao (1992). In the examples we present below, job type is identified with customer class and E = {1, 2, . . . , N }, where N is the number of classes. We shall assume that service requirements are independent of each other and of the arrival processes and that priorities are imposed preemptively. Simple work conservation principles along the lines of the simple M/M/1 example discussed in Section 5.2 yield the conclusion that for a multiclass G/G/1 system the performance xiπ = Eπ {Wi (t)},
i ∈ E,
satisfies SCL(2) for any t > 0, where Wi (t) stands for the work-in-system at time t due to class i customers. Specializing to the G/M/1 case we conclude that the performance choice xiπ = Eπ {Ni (t)/μi } , i ∈ E, also satisfies SCL(2) where Ni (t) is the number of class i customers present at time t and μ−1 is the class i mean service time. Steady state performance i measures are also available, as are those expressed in terms of response times rather than numbers in system. If we make the additional assumption that service times are identically distributed then it can be shown that a G/M/c system with the performance xiπ = Eπ {Ni (t)}, i ∈ E, also satisfies SCL(2). This is also true of a simple queueing network variant of this system. See Shanthikumar and Yao (1992) for more details.
130
MULTI-ARMED BANDIT ALLOCATION INDICES
Example 5.9 Consider the class of branching bandits introduced in Example 5.4 We introduce the matrix = {E(φij ), (i, j ) ∈ E 2 } and impose the condition that I − be nonsingular. This condition guarantees that the system remains stable under non-idling policies. We take the system performance under policy π to be T xiπ = Eπ Ii (t)tdt , i ∈ E, (5.20) 0
where T is the time at which the first busy period ends under π. Bertsimas and Ni˜no-Mora (1996) explain how this choice of performance enables us to study undiscounted branching bandits with objectives based either on job completion times or on time spent by jobs in the system. For example, they show that if we consider a single busy period of a branching bandit and write yiπ for the expectation of the total of all type i completions, then we have that yiπ differs from xiπ /E(vi ) by a constant which is policy independent. The nonsingularity of I − guarantees that E(T ) < ∞. If we now choose matrix F according to FiS =
E(TiS ) , E(vi )
i ∈ S, S ⊆ E,
where the random variables TiS are as in Example 5.4, then the system satisfies GCL(2). The proof of the following result requires only trivial modifications to that of Theorem 5.5 above. In it we consider a system satisfying GCL(2) where we seek a scheduling policy π to minimize the linear objective ci xiπ . C opt = min π
i∈E
Theorem 5.10 If a system satisfies GCL(2) with performance x π , base function f and matrix F , then the achievable region is the convex polytope P (F , f ) given by N P (F , f ) = x ∈ R+ : FiS xi ≤ f (S), S ⊂ E and FiE xi = f (E) i∈S
i∈E
whose extreme points are the performances of permutation policies. Further, C opt is achieved by a Gittins index policy πG determined by the indices (νi , i ∈ E) emerging from adaptive greedy algorithm AG(F , c). We conclude by remarking that, in light of the fact that permutation policies σ play a central role in the development of GCL systems, it is unsurprising that the theory may be readily extended to accommodate situations in which certain
THE ACHIEVABLE REGION METHODOLOGY
131
priorities are imposed. This is an extremely useful modelling device and has been used to incorporate problem features such as machine breakdowns and complex patterns of preemption/non-preemption. Suppose for definiteness that we have a system which, if unconstrained, is GCL(1) with performance x π , base function b, matrix A, reward vector r, and associated set of job types E. We now impose the requirement that E has a partition E=
M
=1
into M priority classes. We adopt the convention that job types in M have highest priority, followed by M−1 , and so on. Hence a policy for a single server system can only process a member of m at epochs for which there are no job types in ∪M =m+1 present in the system. The proof of the following result involves elementary extensions to that which yields Theorem 5.5 above. To state the theorem with the maximum of economy we need the notation m m =
. =1
Thus m is the set of job types within the m lowest priority classes. Theorem 5.11 If a system satisfies GCL(1) with performance x π , base function b, matrix A and precedence partition E = ∪M m=1 m , then the achievable region is the convex polytope P (A, b, ) given by N S∪m−1 P (A, b, ) = x ∈ R+ : Ai xi ≥ b(S ∪ m−1 ), S ⊂ m , and i∈S
m A i xi
= b(m ), 1 ≤ m ≤ M ,
i∈E
whose extreme points are the performances of policies which operate as permutations within each priority class m , 1 ≤ m ≤ M. In light
of thisπ result, the problem of determining a policy which maximizes returns i∈E ri xi for the system while respecting the imposed priorities is advanced by studying the LP given by max ri xi . x ∈P (A,b,)
i∈E
The dual of this LP is solved by an algorithm AG(A, r, ) which applies an adaptive greedy algorithm AG(Am , r m ) to determine a set of Gittins indices for each priority class in turn, starting with M . Hence optimal policies are obtained
132
MULTI-ARMED BANDIT ALLOCATION INDICES
by applying a Gittins index ordering to choose between competing job types within each priority class. Details of the algorithm AG(A, b, ) are as follows: • For priority class M apply adaptive greedy algorithm AG(AM , r M ), where S∪ AM = Ai M−1 , i ∈ S, S ⊆ M and r M = {ri , i ∈ M } . M and assoThis determines a set of Gittins indices ν1M ≤ ν2M ≤ · · · ν| M| M ciated subsets Si ⊆ M consisting of the i job types of lowest index within M . m m • For the adaptive greedy algorithm AG(A , r ), appropriate for priority class m, m < M, we need the indices νi , 1 ≤ i ≤ | |, > m and the subsets Si , 1 ≤ i ≤ | |, > m from the earlier stages. The matrix Am and the reward vector r m required for AG(Am , r m ) are given by S∪ and Am = Ai m−1 , i ∈ S, S ⊆ m
rim
= ri −
| | M
S ∪ −1
Ai j
νj +1 − νj , i ∈ m , 1 ≤ m ≤ M − 1.
=m+1 j =1
A variety of special structures yield simplification of these results. See Garbe and Glazebrook (1998a). Corresponding results hold for GCL(2) systems.
5.5
Performance bounds for policies for branching bandits
Quite apart from its intuitive power, the achievable region approach and its associated formulations create considerable opportunity for performing simple calculations of expected returns (or expected costs) under given policies. To illustrate this, we shall first explore the class of discounted branching bandits given as Example 5.4 above and demonstrate how to develop suboptimality bounds for policies for this important model class. In the process, we shall obtain a further proof of the optimality of Gittins index policies for this model class. Recall that this system with its performance ∞ xiπ = Eπ Ii (t)e−γ t dt , i ∈ E, 0
satisfies GCL(1). The form of the matrix A is given in the foregoing discussion and we continue to use b for its base function. We shall suppose that job types are numbered in descending order of the Gittins indices emerging from the adaptive greedy algorithm AG(A, r), namely such that ν1 ≤ ν2 ≤ · · · ≤ νN . Hence the policy associated with the identity permutation of E is a Gittins index policy πG . We write Si = {1, 2, · · · , i} for
THE ACHIEVABLE REGION METHODOLOGY
133
the collection of job types with the i lowest indices. It will be convenient to introduce the quantities ξi , 1 ≤ i ≤ N, related to the dual variables of the discussion in Section 5.3 and given by ξN = νN ;
ξi = νi − νi+1 , 1 ≤ i ≤ N − 1.
We note from the structure of AG(A, r) (equivalently, that the choice of dual variables described in Section 5.3 satisfies (5.10) with equality) that rj =
N
S
ξi Aj i ,
1 ≤ j ≤ N.
(5.21)
i=j
Suppose now that at time 0 the job of highest numerical type (and hence also of highest Gittins index) present in the system is of type G. Suppose also that a job of type j < G is present at 0. We write j πG for the policy which at time 0 processes the type j job and then, once this processing is complete, serves jobs according to the Gittins index ordering given by the identity permutation. We shall R π for the endeavour to obtain an expression for R πG − R j πG , where we write
π π expected system reward associated with policy π and given by R = N i=1 ri xi . In order to assist with this task, we pause to remark that the fact that the system satisfies GCL(1) means that the performance associated with any permutation policy σ satisfies a set of N linear equations with coefficients given by the matrix A. Moreover, these equations are of triangular form and hence, in principle, are easily solved. The trick in our analysis, therefore, is to express all policies of interest as permutations of E. To achieve this for j πG we use the following device: label the job of type j present in the system at time 0 and to be processed then by j 1, and all other occurrences of type j jobs in the ! system " by j 2. With this relabelling the collection of job types now becomes E \ {j } ∪ {j 1, j 2} and satisfies a modestly perturbed version of !the GCL(1) " for the original system. We introduce a collection of permutations of E \ {j } ∪ {j 1, j 2} and associated policies, by σ (g) = (1, . . . , j − 1, j 2, j + 1, . . . , g, j 1, g + 1, . . . , N ) ,
j < g ≤ G.
We note that σ (j ) is just πG , while σ (G) is j πG . The fact that the system satisfies GCL(1) yields the following equations for the performance x σ (g) associated with σ (g): k
S
σ (g)
Ai k xi
= b(Sk ),
1 ≤ k ≤ g,
i=1 g
S ∪{j 1} σ (g) xj 1
= b Sg ∪ {j 1} ,
S ∪{j 1} σ (g) xj 1
= b (Sk ∪ {j 1}) ,
S ∪{j 1} σ (g) xi
+ Aj g
S ∪{j 1} σ (g) xi
+ Aj k
Ai g
i=1 k i=1
Ai k
g + 1 ≤ k ≤ N,
134
MULTI-ARMED BANDIT ALLOCATION INDICES
where for ease of notation we have written j for j 2. We can now marginally simplify the above equations by exploiting the fact, easily deduced from the S ∪{j 1} S definition of the matrix A, that for all choices of i, k we have Ai k = Ai k . σ (g−1) We have a similar set of equations for the performance x and deduce the following for the difference δ g ≡ x σ (g−1) − x σ (g) , j + 1 ≤ g ≤ G : g
k
S
g
δk = 0,
1 ≤ k ≤ g − 1,
g
g ≤ k ≤ N.
S
Ai k δi + Aj k δj 1 = 0,
i=1 g
g
Note that these equations enable all the δk to be expressed as multiples of δj 1 . We can now combine these equations with the expression for rj in (5.21) to infer that R σ (g−1) − R σ (g) =
N
g
g
ri δi = δj 1
g−1
S
ξi Aj i ,
j + 1 ≤ g ≤ G,
i=j
i=1
and hence that G ! σ (g−1) " − R σ (g) R
R πG − R j πG = R σ (j ) − R σ (G) =
g=j +1
=
G
g−1
g
δj 1
g=j +1
S
ξi Aj i .
(5.22)
i=j
The quantity in (5.22) is exactly the expected reward lost if at time 0 a type j job is processed instead of a job of maximal Gittins index. Please note that it is g trivial to establish from direct arguments that δj 1 ≤ 0, j + 1 ≤ g ≤ G. We also have, from the definitions of the quantities concerned, that ξi ≤ 0, 1 ≤ i ≤ N − 1, S and that Aj i ≥ 0, 1 ≤ i ≤ N − 1. We infer immediately from our expression for the reward difference that R πG − R j πG ≥ 0, from which the optimality of πG follows via standard dynamic programming arguments. We have thus obtained a simple direct proof of the optimality of Gittins index policies for discounted branching bandits. But more than this, the above expression for R πG − R j πG enables us to develop simple, interpretable performance bounds for general policies for this system. For example, we can rewrite and bound (5.22) as follows: R πG − R j πG =
G
g
δj 1
g=j +1
=
G−1
G
g=j i=g+1
g−1
S
ξi Aj i
i=j S
δji 1 ξg Aj g
THE ACHIEVABLE REGION METHODOLOGY
=
G−1 #
σ (g)
xj 1
$ S − xjσ1(G) ξg Aj g
g=j
≤
−xjσ1(G)
135
G−1
G−1 S ξg Aj g
≤−
g=j
g=j
γ
ξg
=
(νG − νj ) . γ
The final inequality in the above chain combines an exact calculation for xjσ1(G) , which is trivial, with a simple upper bound for the quantities S
Aj g , j ≤ g ≤ G − 1, derived from their characterization above. We have thus secured a highly intuitive bound for R πG − R j πG which is a multiple of the difference between the maximal index available in the system at time 0 (νG ) and the index of the job actually processed then (νj ). An alternative approach to the bounding of the quantity (5.22) yields ! " R πG − R j πG ≤ (νG − rj ) 1 − E e−γ vj /γ ,
(5.23)
where vj is the processing time of a type j job. We draw these calculations together in the following result. Theorem 5.12 In a branching bandit with discounted rewards, if a job of type j is present at time 0 and a job of type G has maximal Gittins index, then we have ! −γ v " (νG − νj ) πG j πG j R −R ≤ min /γ . ; (νG − rj ) 1 − E e γ The bound given in the statement of the theorem may now be used as the basis for the construction of bounds on R πG − R π , the reward suboptimality of general past-measurable scheduling policy π. The development which yields such bounds is standard and makes use of conditioning arguments, aggregation and convergence results. Using such methods, we may obtain an extension to Theorem 4.18 to the case of branching bandits from the first bound in Theorem 5.12. Using the second bound, we can infer a similar extension to Corollary 4.16. We conclude this section by giving the outcome of a similar analysis applied to the branching bandit model with undiscounted costs given as Example 5.9 above. Recall that this system with its performance (5.20) satisfies GCL(2). The form of the matrix F is given in the foregoing discussion. We write
C π for the π π expected system cost associated with policy π and given by C = N i=1 ci xi . We again consider a scenario in which a job of type j is processed at time 0, notwithstanding the fact that the job of maximal index then is of type G > j .
136
MULTI-ARMED BANDIT ALLOCATION INDICES
From an argument which closely mirrors that given above for the discounted reward case, we conclude that C j πG − C πG = −
G g=j +1
g
δj 1
g−1
S
ξi Fj i .
i=1
g
This is easily seen to be positive since δj 1 ≥ 0, j + 1 ≤ g ≤ G, ξi ≤ 0, 1 ≤ i ≤ S N − 1, and Fj i ≥ 0, 1 ≤ i ≤ N − 1. The optimality of πG follows. Bounding this quantity yields the following result. Theorem 5.13 In a branching bandit with undiscounted costs, if a job of type j is present at time 0 and a job of type G has maximal Gittins index, then we have $ S $ # # j j C j πG − C πG ≤ min xj 1 − xjG1 Fj j (νG − νj ) ; xj 1 − xjG1 (νG − cj ) .
5.6
Job selection and scheduling problems
Consider a GCL system whose set of job types, E, has a partition expressed by E = ∪k∈ Ek ,where = {1, 2, . . . , n}. In many applications, Ek will be the state space of projects of type k. We shall refer to any subset ⊆ as a selection, to be made in advance of (optimal) scheduling of the job types in ∪k∈ Ek . Consider the following examples. Example 5.14 A SCL(1) system consists of a collection of n jobs to be processed by a single machine with a discounted reward criterion, as in Example 5.3 above. However, there is some cost ck associated with the submission of job k for service. A natural selection/scheduling problem considers the choice of a job collection
⊆ = {1, 2, . . . , n} to maximize − ck + R opt ( ), k∈
where R opt ( ) is the return achieved when the jobs in are optimally scheduled. Example 5.15 A collection of research projects, , is modelled as a SFABP and each is a candidate for inclusion in a company’s research portfolio. The chosen portfolio will consist of just m ≤ | | = n projects. We wish to choose a subset
⊆ of size m to maximize R opt ( ) over all such choices. We have already met this problem in exercise 2.4. Example 5.16 A multi-class queueing system is SCL(2). The associated set of customer class can be served at one of two service stations which need
THE ACHIEVABLE REGION METHODOLOGY
137
not be identical. We would like a partitioning of into (to be served at station 1) and \ (to be served at station 2) to minimize the overall costs when both stations are scheduled optimally. To summarize, we choose to opt opt minimize C1 ( ) + C2 ( \ ). Even simple cases of problems of this kind have been found intractable. To make progress we need to characterize optimal returns from GCL systems as a function of a selection . For definiteness, we now restrict to the case of a GCL(1) system with performance x π , base function b, matrix A and reward vector r. Other details are as in the opening paragraph to this section. We require the following two conditions: (i) Matrix A is decomposable with respect to the partition {E1 , E2 , . . . , En }. See Definition 5.6 above. (ii) The system is reducible in that the performance xiπ , i ∈ ∪k∈ Ek for the system resulting from the application of the policy π to the selection
coincides with the corresponding components of the performance for the full system when S is given priority over S c and choices within S are made according to π. It is a straightforward consequence of the above conditions, the GCL structure and Theorem 5.5 that the achievable region for the selection is given by |S | S P (A, b) = x ∈ R+ : Ai xi ≥ b(S ∪ S c ) − b S c , S ⊂ S , i∈S
and
i∈S
ASi xi = b(E) − b S c ,
from which we conclude that R opt ( ) =
max
x ∈P (A,b)
ri xi .
(5.24)
i∈S
Decomposability implies that the Gittins indices which emerge from the dual of (5.24), namely {νi , i ∈ S }, are unchanged from those which emerge from the analysis of the full system. We now number the elements of S in ascending order of their index values. It may be readily inferred from the analysis of Section 5.3 applied in the current context that the value of the dual of (5.24) may be written R
opt
( ) =
|S
|−1
! " ! c " b S ∪ Si − b S c (νi − νi+1 ) + b(E) − b S c ν|S | ,
i=0
=
|S | i=0
b(S c ∪ Si )(νi − νi+1 ),
138
MULTI-ARMED BANDIT ALLOCATION INDICES
where S0 = ∅, Si = {1, 2, . . . , i}, 1 ≤ i ≤ |S |, and we use the conventions ν0 = ν|S | = 0. Having expressed R opt in terms of the base function b, it will now be properties of the latter which will yield corresponding properties of the former. The key property is described in the following definition. Definition 5.17 Function b : 2 → R+ is supermodular with respect to {E1 , E2 , . . . , En } if, for all ⊂ , a proper subset, and subsets S, T , ς and τ of E satisfying T ⊆S⊆ Ek and τ ⊆ ς ⊆ Ek , k∈
k ∈ /
we have b (T ∪ ς) − b (T ∪ τ ) ≤ b (S ∪ ς) − b (S ∪ τ ) . That base function b has the above property follows simply from the nature of GCL(1). Everything is now in place for the following result. Theorem 5.18 If a system satisfying GCL(1) is decomposable with respect to {E1 , E2 , . . . , En } and reducible, then R opt : 2 → R+ is submodular and, further, is nondecreasing when b : 2 → R+ is nondecreasing. Details of the proof of the above theorem may be found in Garbe and Glazebrook (1998b). A very similar analysis for decomposable and reducible GCL(2) systems yields the conclusion that C opt (·) is supermodular and is also nondecreasing when the base function f is. It is now very easy to check that the necessary conditions of decomposability and reducibility are satisfied for the SCL(1) job-scheduling system described in Example 5.3 and for SFABPs, where we use the partition induced by the state spaces of individual projects. They are also satisfied for the SCL(2) multi-class queueing systems described as Example 5.8 above. It then follows that the selection element of Examples 5.14–5.16 could well involve, respectively, the maximization of a submodular function, the maximization of a submodular function over subsets of given size, and the minimization of a supermodular function. Such combinatorial optimization problems are known to be difficult in general, with NP-hard special cases. However, the work of Nemhauser and Wolsey (1981) suggests that greedy algorithms for constructing solutions should work well. This conclusion has been borne out in a wide variety of numerical examples. For example, in 80 000 randomly generated problems, all of which concern the choice of a portfolio of a fixed number of projects to maximize the expected return, a greedy approach was optimal on 79 743 occasions and was never more than 1% suboptimal. We conclude this section by noting that Dacre and Glazebrook (2002) have achieved a range of extensions of the above theorem which extend considerably its application to multi-class queueing networks.
THE ACHIEVABLE REGION METHODOLOGY
5.7
139
Multi-armed bandits on parallel machines
Since Gittins index policies are optimal when the GCLs of Section 5.4 are satisfied, it is reasonable to hope that such policies might be near optimal when GCLs come close to being satisfied, but are not exactly. This is indeed the case. Further, in many such problems we are able to bound the degree of reward/cost suboptimality of a variety of index heuristics. This approach particularly bears fruit when we try to extend the scope of index theory from the single server contexts which predominate among GCL systems to counterparts involving M servers working in parallel. Here we observe that such a server configuration (for which GCL usually fails) is close (in some sense) to a single server working at M times the speed (where GCL holds). Thus such analyses can shed light on the inefficiencies in service which result from parallelism. We conclude the current chapter by attempting to analyse the SFABP of Section 5.3 for the parallel server case. The notation we shall use for this discussion will be as in Section 5.3. We shall also assume that each arm b evolves when activated according to an irreducible time-homogeneous Markov Chain with one step transition matrix P b . Since each state space Eb is finite, irreducibility implies positive recurrence, which in turn implies that all moments of the random variables TiS are finite. To avoid triviality we assume that the number of parallel servers M exceeds the number of arms n. For general policy π and initial system state k ∈ E1 × · · · × En we develop the notation Aπ (S, k) = ASi xiπ (k), S ⊆ E, i∈E
where both the matrix A and the performance x π (k) are as in Section 5.3, but in the latter the conditioning upon initial state k is now made explicit. With job types numbered as usual in ascending order of their indices, we recall from (5.21) that since AE i = 1 for all i then we must have that rj = νN −
N−1
S
(νi+1 − νi )Aj i ,
j ∈ E.
i=j
It then follows straightforwardly that, for any policy π, R (k) = π
ri xiπ
= νN
i∈E
M 1−a
N−1
(νi+1 − νi )Aπ (Si , k),
(5.25)
(νi+1 − νi ) min Aπ (Si , k),
(5.26)
−
i=1
which in turn yields R opt (k) ≤ νN
M 1−a
−
N−1 i=1
π
140
MULTI-ARMED BANDIT ALLOCATION INDICES
and hence an upper bound for the optimal return R opt (k). It will be convenient to write A(S, k) = min Aπ (S, k). π
We point out in Section 5.3 that when M = 1 the above minimum is achieved by all policies which give job types in S c priority. One way of recovering the optimality of the Gittins index policy πG in this case is then to observe that this policy achieves all of the minima in the expression for the bound in (5.26). The above expressions immediately yield R
opt
(k) − R (k) ≤ π
N−1
(νi+1 − νi )[Aπ (Si , k) − A(Si , k)]
(5.27)
i=1
as an upper bound on the degree of reward suboptimality for policy π. To take this further we need some further notation. We shall use ib = min(i : i ∈ Eb ) for the job type of lowest index for arm b. We then number the bandits such that in ≥ in−1 ≥ · · · ≥ i1 and define the quantities ψi , i ∈ E, by ψi = 0, i ≥ in ,
and ψi = n − m + 1, im − 1 ≥ i ≥ im−1 .
In words, ψi is the number of bandits which have no intersection with Si , the set {1, 2, . . . , i} ⊆ E containing the i job types of lowest index. We use Bi for the corresponding set of bandits. It must follow from these definitions that ψi ≥ M ⇒ A(Si , k) = 0 since when ψi ≥ M there exist policies which are able to avoid use of the set Si completely, or equivalently, which focus all processing on the bandits in Bi . We now suppose that M − 1 ≥ ψi and seek to develop an effective lower S bound for A(Si , k) to feed into the bound (5.27). For bandit b ∈ / Bi we use Tkbi for a random variable whose distribution is that of the amount of processing required to move bandit b from its initial state kb until it enters the subset Si for the first time. We shall also use S S Tkbi Tk i = b∈B / i
for this quantity aggregated over all qualifying bandits. Lemma 5.19 (i) If M − 1 ≥ ψi then
S M − ψi Tk i /(M−ψi ) A(Si , k) ≥ E a 1−a
(ii) If ψi ≥ M then A(Si , k) = 0.
THE ACHIEVABLE REGION METHODOLOGY
141
Proof. Part (ii) of the result is trivial and follows from the text preceding the statement of the lemma. Hence we shall suppose that M − 1 ≥ ψi and prove part (i). This proceeds in two steps. We first observe that we can exploit the nature of the matrix A to conclude that A(Si , k) = min Aπ (Si , k) = min Eπ π
≥ min Eπ π
π
n M
j ∈Si
Ibm (t)a t k .
S T i −1 a tj 1 + a + · · · + a j k
(5.28)
b=1 m=1
In the above calculation we use tj for the time of the th occasion upon which S policy π chooses a job of type j for continuation, and Tj i for the duration of the c subsequent excursion of the corresponding arm into Si . The indicator Ibm (t) is 1 exactly when machine m processes arm b at time t and where b has paid its first visit to Si at or before time t. To see why the above inequality holds, observe first that, should j ∈ Eb ∩ Si , then the processing which arm b will receive from S tj until b next enters Si will be delivered during the interval [tj , tj + Tj i ) at the earliest. Then observe that, once arm b has paid its first visit to Si then it sojourns in Si and Sic (which may be null) will alternate. Our next step in the proof is to create a relatively simple single machine relaxation of the optimization problem in (5.28) above as follows: in the Mmachine problem, use the pair (m, t) to denote the processing slot available on machine m at time t. We consider the single machine map, φ : (m, t) → φ(m, t) = Mt + (m − 1) to a corresponding slot in a single machine problem. Recall that M − 1 ≥ ψi and that the arms which have no intersection with Si (and hence which can contribute nothing to the objective in (5.28) above) are numbered n, n − 1, . . . , n − ψi + 1. We now develop our single machine relaxation for (5.28) by describing what constitutes an admissible policy π for it as follows: 1. a policy π can only schedule bandits n, n − 1, . . . , n − ψi + 1 respectively at times φ(1, t), φ(2, t), . . . , φ(ψi , t), t ≥ 0; 2. subject only to 1. above, π schedules a single arm at each of the times {φ(m, t); 1 ≤ m ≤ M, t ≥ 0} ≡ N; 3. the allocation made at time φ(m, t) in the single machine relaxation attracts discounting a t as would be appropriate in the originating Mmachine context. To summarize, the single machine relaxation has a free choice of arm at each epoch in N save only that some bandits (those in Bi ) may only be scheduled at certain epochs. Since the value of the single machine relaxation is plainly a
142
MULTI-ARMED BANDIT ALLOCATION INDICES
lower bound for the value of the optimization problem (5.28) we infer that n M ∞ t (5.29) A(Si , k) ≥ min Eπ Ib [φ (m, t)] a k , π
b=1 m=1 t=0
where, in (5.29), the indicator Ib (s) is 1 exactly when arm b is processed at time s in the single machine relaxation, having paid its first visit to Si at or before time s. It is straightforward to show that there exists a minimizing π for (5.29) which chooses the arms in Bi whenever possible. This is scarcely surprising since these arms contribute nothing to a positive-valued objective which is to be minimized. Hence for 1 ≤ m ≤ ψi , arm n − m + 1 is optimally scheduled at all times φ(m, t), t ≥ 0. A simple pairwise interchange argument yields the additional conclusion that the remaining arms in {1, 2, . . . , n − ψi } are optimally scheduled at the remaining epochs φ(m, t), ψi + 1 ≤ m ≤ M, t ≥ 0 by any policy which gives priority to job types in Sic over those in Si . This characterization of the optimal policy for our single machine relaxation enables us to compute the minimum in (5.29). We write ⎢ ⎥ ⎢ Sc ⎥ ⎢ Tk i ⎥ Sic ⎦ (M − ψi ) + T , Tk = ⎣ M − ψi where 0 ≤ T ≤ M − ψi − 1, and x is the integer part of x. A simple calculation now yields Si M − ψi − T (1 − a) A(Si , k) ≥ E a Tk /(M−ψi ) 1−a S M − ψi Tk i /(M−ψi ) E a . ≥ 1−a
This completes the proof of the lemma.
The following expression for an upper bound for R opt (k) now follows from (5.26) and the preceding lemma. Corollary 5.20
M 1−a N−1 1
R opt (k) ≤ νN −
1−a
(νi+1 − νi )(M − ψi )E a
S
Tk i /(M−ψi )
.
i=in−M+1
Proof. We apply the above results and observe that, by the definition of the quantities concerned, M − 1 ≥ ψi ⇐⇒ i ≥ in−M+1 .
THE ACHIEVABLE REGION METHODOLOGY
143
Imagine that we have some policy π of interest for the parallel machine problem. If we can combine the above upper bound on R opt (k) with either an exact expression for R π (k) or a lower bound for it then we may be able to derive informative bounds on the reward suboptimality of π, namely, R opt (k) − R π (k). ∗ We begin with an index heuristic πG∗ for which an exact expression for R πG (k) is easily obtained. Recall that ib is the job type of lowest index associated with arm b and that the n arms are numbered such that in ≥ in−1 ≥ · · · ≥ i1 . We construct πG∗ as follows: • arm n + 1 − m with initial state (job type) kn+1−m and lowest index state in+1−m is given exclusive rights to machine m, 1 ≤ m ≤ M − 1; • the remaining arms {1, 2, . . . , n + 1 − M} are scheduled on machine M according to a Gittins index policy denoted πGM . It is plainly the case that, given the above once for all allocation of arms to machines, the optimality of Gittins index policies for the single machine problem guarantees that the subsequent scheduling is optimal. What is as yet not at all clear is that this prior allocation is an effective approach to solving the parallel machine problem. We shall see that it works extremely well when the discount rate a 1. π∗ We proceed by giving an exact expression for RmG (k), the contribution to the ∗ expected return R πG (k) from the processing on machine m, 1 ≤ m ≤ M − 1. From the expression in (5.25) we infer that, for machine m, 1 ≤ m ≤ M − 1, π∗ RmG (k)
= νN
1 1−a
−
N−1
(νi+1 − νi )An+1−m (Si , k),
i=1
where the superscript n + 1 − m indicates that we have a single machine processing arm n + 1 − m only. By elementary calculations for the single machine case, as in Section 5.3, together with the fact that, under πG∗ , machine m never processes any job type i for which i < in+1−m , we deduce that π∗ RmG (k)
= νN
1 1−a
N−1
−
(νi+1 − νi )E a
S
Tkmi
, 1 ≤ m ≤ M − 1.
i=in+1−m
(5.30) Now, considering machine M, which under πGM will never process a job type i for which i < in+1−M , we have by a similar calculation that
π∗
RMG (k) = νN = νN
1 1−a 1 1−a
−
N−1
M
(νi+1 − νi )AπG (Si , k)
i=1
−
N−1 i=in+1−M
(νi+1 − νi )E a
S
Tk i
M
,
(5.31)
144
MULTI-ARMED BANDIT ALLOCATION INDICES
where, by a minor abuse of notation, we use kM in (5.31) for the vector-valued initial state of the arms scheduled on machine M. Aggregating the expected returns from the individual machines and simplifying we infer that ∗
R πG (k) =
M
π∗
RmG (k)
m=1
= νN
M 1−a
−
1 1−a
N−1
M
S i (νi+1 − νi )E a Tkm .
i=in+1−M m=1
(5.32) If we now combine the expression in (5.32) with the upper bound in the preceding corollary, we deduce that R
opt
(k) − R
∗ πG
(k) ≤
1 1−a ×
N−1
(νi+1 − νi )
i=in+1−M
M
E a
S
Tkmi
− (M − ψi )E a
S
Tk i /(M−ψi )
.
m=ψi +1
The following result is now an immediate consequence. In it we use the notation O(1 − a) to denote a quantity which, when divided by 1 − a tends to a constant in the limit a → 1. Proposition 5.21
∗
R opt (k) − R πG (k) ≤ O(1 − a). We now proceed to consider πG , namely the Gittins index priority policy applied in the parallel machine environment. At all decision epochs, πG schedules for processing the M arms most preferred by the index ordering. The complexity of πG ’s operation means that it is not possible to secure a helpful exact expression for R πG (k) and so we content ourselves with developing an effective lower bound. Using the preceding machinery, this will be achieved by finding upper bounds for the quantities AπG (Si , k), 1 ≤ i ≤ N . The key details are in the next result. Lemma 5.22 (i) If M − 1 ≥ ψi , then AπG (Si , k) ≤ (ii) If ψi ≥ M then AπG (Si , k) = 0.
M − ψi + O(1). 1−a
THE ACHIEVABLE REGION METHODOLOGY
145
Proof. Part (ii) of the result is trivial and follows from the definitions of the quantities involved. In order to simplify some of the upcoming calculations, we shall now impose some requirements on how policy πG should operate. None of these will have any impact on expected returns. First we shall suppose, as we did in the proof of Lemma 5.19, that should πG schedule arm n − m + 1 for processing at any time, then this is performed on machine m, 1 ≤ m ≤ ψi . We shall use T (Si ) for the first time at which πG chooses a job type in Si for processing. At T (Si ), all of the n − M arms not scheduled for processing must have current state in Si . Trivially, at all times t ≥ T (Si ) all job types in Sic present in the system will be scheduled for processing. This means, inter alia, that from T (Si ), the arms n − m + 1, 1 ≤ m ≤ ψi , will always be scheduled for processing. Hence if we write Tm for the number of decision epochs for which machine m does not process arm n − m + 1, then we must have, for all realizations of the system, that Tm ≤ T (Si ),
1 ≤ m ≤ ψi .
(5.33)
We shall now make the further assumption that, for t ≥ T (Si ), πG operates on a minimal switching basis, namely, that an arm is only switched from processing on any machine at epochs at which πG takes it out of processing altogether. We now write Tm (Si ), ψi + 1 ≤ m ≤ M for the first time at which machine m processes an Si -job type. Hence, we have T (Si ) =
min
ψi +1≤m≤M
Tm (Si ),
and write m (Si ) = T (Si ) − Tm (Si ),
ψi + 1 ≤ m ≤ M.
(5.34)
Conditionally upon the state of the system at T (Si ), each non-zero m (Si ) will S be Tj i for some j ∈ Sic . Under the minimal switching assumption operating on machines ψi + 1 ≤ m ≤ M, each interval [Tm (Si ), ∞) will be a disjoint union S of intervals of the form [tj , tj + Tj i ), where j ∈ Si and tj is as in the proof of Lemma 5.19. It now easily follows from the above that ⎤ ⎡ M Tm (Si ) a ⎦ (5.35) AπG (Si , k) = EπG ⎣ k , 1−a m=ψi +1
and from the definitions of the quantities concerned that ψi m=1
Tm +
M m=ψi +1
S
Tm (Si ) = Tk i .
(5.36)
146
MULTI-ARMED BANDIT ALLOCATION INDICES
We infer from (5.33), (5.34) and (5.36) that
S Tk i − M m=ψi +1 m (Si ) Tm (Si ) ≥ + m (Si ), M
ψi + 1 ≤ m ≤ M,
and hence from (5.35) that # S
$ M − ψi Tk i − M πG m=ψi +1 m (Si ) /M A (Si , k) ≤ EπG a k 1−a M − ψi = + O(1), 1−a as required for part (i). This concludes the proof.
The following result is now an immediate consequence of (5.27) and Lemmas 5.19 and 5.22. Proposition 5.23 R opt (k) − R πG (k) ≤ O(1). Interestingly, from the preceding two propositions we secure a stronger result for the index heuristic πG∗ , which makes an index-based prior allocation of arms to machines, than for the natural parallel machine implementation of the Gittins index priority ordering secured by πG . It may be conjectured that this is an artefact of the nature of the above calculations where we were able to utilize ∗ an exact expression for R πG (k) but had to resort to a lower bound for R πG (k). There are examples in the doctoral thesis of Dunn (2005) which make it clear that Proposition 5.23 cannot be strengthened in general. It is true, though, that it is possible to impose additional conditions on the problem setup which then yields an O(1 − a) suboptimality bound for πG . See, for example, Theorem 3 of Glazebrook and Wilkinson (2000). It is natural now to develop the above analyses by considering the limit a → 1. All Gittins indices will converge to finite limits, yielding limit policies for πG∗ and πG which we shall refer to as πG∗ (1) and πG (1) respectively. The following idea may be unfamiliar to some readers. We use the notation r(π, t) for the (undiscounted) reward earned by policy π at time t. Definition 5.24 Policy π ∼ is average-overtaking optimal if T t T t 1 ∼ lim inf E[r (π , s) |k] − E[r(π, s)|k] ≥ 0 T →∞ T + 1 t=0 s=0
for all policies π and initial states k.
t=0 s=0
THE ACHIEVABLE REGION METHODOLOGY
147
In our problem context, average-overtaking optimality is stronger than average-reward optimality. The following theorem is a simple consequence of the preceding propositions, and utilizes the classical results of Blackwell (1962), Veinott (1966) and Denardo and Miller (1968). Theorem 5.25 Policy πG∗ (1) is average-overtaking optimal, and the policy πG (1) is average-reward optimal for our SFABP on M parallel machines. Dunn and Glazebrook (2001, 2004) have described extensions of these results to cases where the M machines operate at different speeds and also where the set of machines which are available for processing is a stochastic process.
Exercises 5.1. Review the argument of Section 5.2 which establishes that, for the two-class single server queue considered, the mean work-in-system in steady state is policy invariant. Why does the argument break down when there are two identical servers working in parallel? Show that the argument still works in this case when μ1 = μ2 .
5.2. In Section 5.3, we consider the minimization problem minπ i∈S ASi xiπ for a SFABP with the cost coefficients ASi given by (5.3). Because of the relation (5.6), this minimization problem is equivalent to the maximization π E π E S max Ai xi + Ai − Ai xi . π
i∈S c
i∈S
Use a suitable version of the adaptive greedy algorithm to obtain a set of Gittins indices to solve this maximization problem. 5.3. Formulate a finite state version of the model of Section 4.7 (of alternative jobs with no arrivals and precedence constraints in the form of an out-forest) as a branching bandit of the kind described in Example 5.4. 5.4. Review the notion of decomposability of the matrix A as given in Definition 5.6. Verify that the adaptive greedy algorithm AG(A, r) applied to the job types in E in a decomposable case yields the same Gittins indices for the partition set Ek as does the algorithm applied to the job types in Ek only. 5.5. Consider the following two models: • A SFABP is processed by a single agent or server which is prone to breakdown. Once broken, the server is repaired and processing of the bandits cannot resume until a repair is completed. Successive repair times form a sequence of independent identically distributed random variables taking values in the positive integers.
148
MULTI-ARMED BANDIT ALLOCATION INDICES
• A family of alternative jobs has constituent jobs of two distinct types. Type 1 jobs are served preemptively: their service, can be interrupted at any time. Type 2 jobs must be served non-preemptively: their service, once started, must be continued through to completion without interruption. How can such models be formulated using the priority class structure described at the end of Section 5.4?
6
Restless bandits and Lagrangian relaxation 6.1
Introduction
In the various proofs of the index theorem for simple families of alternative bandit processes (SFABPs) which have been presented in the book to date, the fact that bandits not in receipt of processing are frozen in their current state plays a fundamental role. We have, though, seen in the last two chapters that it is possible to demonstrate the optimality of Gittins index policies in problem areas including multi-class queues, where systems can evolve in interesting and important ways through agencies (like an arrivals process) beyond the processing action. Nonetheless, many problem domains naturally lead to the question of whether we can incorporate restlessness (the evolution of unprocessed system components) as a central feature. It was Whittle (1988) who suggested extending the notion of a SFABP by allowing bandits not in receipt of processing to evolve, but according to a different law from that which applies when they are active. In the context of a setup in which m of n such restless bandits could be active at any time, he developed an index heuristic by means of a Lagrangian relaxation of the problem. Lagrangian relaxation is a well-known technique by which to address optimization problems with constraints. It works by moving hard constraints into the objective so as to exact a penalty on the objective if they are not satisfied (see Bertsimas and Tsitsiklis (1997) and Bertsekas (1999) for further discussion of this technique and examples of its use). The index which emerges from this analysis, widely known as the Whittle index, reduces to that of Gittins in the SFABP case. The utility of the model in many practical contexts, and the elegance and power of Whittle’s proposal has meant that it has enjoyed great popularity. Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
150
MULTI-ARMED BANDIT ALLOCATION INDICES
In Section 6.2 we present the basic model along with some motivating applications. The theory, including the existence of Whittle’s index, depends upon a structural property of the constituent bandits, called indexability. This is discussed in Section 6.3 where the index is introduced. The price of Whittle’s generalization is that his index heuristic is not optimal in general. However, it has been shown to be optimal under given conditions in an asymptotic regime in which the number of bandits in total (n) and the number activated (m) grow in proportion. These ideas are explained in outline in Section 6.4. We proceed to explain in Section 6.5 that indexability is often easy to establish in cases where the optimal policy for Whittle’s Lagrangian relaxation has a monotone structure for each bandit. There are many such examples including the spinning plates model discussed there. The remaining sections of the chapter all focus on the application of the Lagrangian relaxation/Whittle index methodology to problems in queueing control. In Section 6.6 we discuss how to overcome a difficulty in applying the ideas to problems of service control, which Whittle himself thought were not amenable to treatment by these methods. In Section 6.7 we use a problem involving the admission and routing of impatient customers for service as a context within which to discuss ideas about how to assess the closeness to optimality of the Whittle index heuristic. Here we consider both the tightness of the Lagrangian relaxation and approaches which utilize achievable region methodology. Finally, in Section 6.8 we point out that Whittle’s Lagrangian relaxation approach has the potential to underpin the design of effective index heuristics in a range of much more general resource allocation problems.
6.2
Restless bandits
Again, we start with a family of n alternative semi-Markov decision processes. Given their states at time t, say x1 (t), . . . , xn (t), we are to choose actions u1 (t), . . . , un (t) to apply at time t. As in the multi-armed bandit problem, we suppose that there are just two available actions, so ui (t) ∈ {0, 1}. The first generalization is that we require that at each time t exactly m (< n) of the bandits be given the action ui = 1. The second generalization is that the application of action u = 0 no longer freezes a bandit; instead, its state evolves, but in a distinct way from its continuation under the action u = 1. Once the action ui = 1 is applied to bandit i, that action must continue to be applied until the next decision time for that bandit. The same is true for action ui = 0 (so this differs from the usual bandit model, in which while ui = 0 every time is a decision time). Let us briefly give three examples of decision processes that are nicely modelled as restless bandits. The first example is along the lines envisaged by Whittle. Following his lead, we call u = 1 the active action and u = 0 the passive action. Example 6.1 The state of a bandit is a measure of its vigour. The active and passive actions correspond to notions of work and rest. Performance is positively
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
151
related to vigour (or lack of fatigue), which increases with rest and decreases with work. For example, suppose that x takes values in {1, . . . , k}. If the active action u = 1 is applied to a bandit in state x, then there accrues an immediate reward of r(x), increasing in x, but vigour decreases to max{x − 1, 1}. The passive action u = 0 produces no reward, but vigour increases to min{x + 1, k}. We pursue this story further towards the end of Section 6.5. This next model is of a type investigated by Le Ny, Dahleh and Feron (2008) and by Liu and Zhao (2010), and developed in the context of the mechanism design problems of Section 9.3. Example 6.2 The active and passive actions correspond to notions of ‘observation’ and ‘no observation’. For example, suppose that each bandit is in one of two conditions: 0 and 1, associated with being ‘bad’ or ‘good’, respectively. It moves back and forth between these two conditions independently of any actions applied by the decision maker, according to a two-state Markov chain. So far as the decision maker is concerned, the state of the bandit is the probability that it is in good condition. Under the action u = 1 the condition is observed, and if this is found to be i then x(t + 1) = pi1 . Moreover, if the condition is good then a reward is obtained. Under the action u = 0 the underlying condition of the process is not observed, and so, in a Bayesian manner, x(t + 1) = x(t)p11 + [1 − x(t)]p01 . No reward is obtained. Le Ny, Dahleh and Feron (2008) use the framework of restless bandits to model the routing of unmanned military aircraft to observe the states of targets. The aim is to collect rewards which are available when they visit the targets in a particular state. Similarly, Liu and Zhao (2010), and Ahmad, Liu, Javidi, Zhao and Krishnamachari (2009) address problems of opportunistic communication channel usage. Their model is that a number of channels independently alternate between conditions of being ‘busy’ or ‘free’, where ‘busy’ may correspond to a channel carrying some primary traffic. A secondary user, who may sense m channels in each time slot, obtains a reward whenever she senses a channel that is free. In the first of these papers, Liu and Zhao have shown that under the assumption that the condition of each arm evolves according to a homogeneous Markov chain, the restless bandit is indexable. They have found Whittle’s index in closed form and shown it corresponds to the myopic policy which senses the m channels most likely to be free. Ahmad et al . (2009) show that if the transition probability ‘free to free’ exceeds that of ‘busy to free’, that is, p11 ≥ p01 , then the Whittle index policy is optimal. These are problems of so-called sensor management (e.g. a problem of maintaining tracks on a number of targets with limited resource to track them). In a similar vein, Guha, Munagala and Shi (2009) look at a case of restless bandits in which the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process whose exact state is only revealed when the arm is played. This places
152
MULTI-ARMED BANDIT ALLOCATION INDICES
the problem in the context of partially observed Markov decision processes. Washburn (2008) has written a survey. Our final model is one introduced by Weber and Weiss (1990), and called ‘dual-speed’ by Glazebrook, Ni˜no-Mora and Ansell (2002). Its continuous-time version has been analysed by Weber (2007). Example 6.3 The active and passive actions correspond to running the process at different speeds. For example, suppose for 0 < < 1, P (j |i, 1), i = j P (j |i, 0) = (1 − ) + P (i|i, 1), i = j. Thus a bandit which is operated continuously with u = 1 has the same stationary distribution as one that is operated continuously with u = 0. But the process moves faster when u = 1.
6.3
Whittle indices for restless bandits
Let us now simplify the discussion by specializing to a discrete-time model in which each of the times 0, 1, . . . is a decision time for every bandit. We now have a family of alternative Markov decision processes and are to choose u(t) = (u1 (t), . . . , un (t)), subject to a constraint that u(t) ∈ , where = {(u1 , . . . , un ) : ui ∈ {0, 1} for all i, and i ui (t) = m}. For simplicity, suppose there are n (restless) bandits and they are statistically identical. The dynamic programming (DP) equation for the problem of maximizing the expected total discounted sum of rewards is, for a state x = (x1 , . . . , xn ), r(xi , ui ) + a V (y1 , . . . , yn ) P (yi | xi , ui ) . V (x) = max u∈
i
y1 ,...,yn
i
We cannot expect to solve this equation in general. Indeed, Papadimitriou and Tsitsiklis (1999) have shown that the restless bandits problem is PSPACE-hard. This notion is similar to NP-hard except that it is space (i.e. the size of computer memory) that is at issue. A decision problem is in PSPACE, if it can be solved using an amount of space that is polynomially bounded by the input size. A decision problem is PSPACE-hard if any problem in PSPACE can be reduced to it in polynomial time. Let us focus upon average reward (i.e. the limit as a → 1). This is attractive because performance does not depend on the initial state. Also, as Whittle comments (and as we have seen in Section 4.9.2), indices are often simpler in the average case. Assuming that the n bandits are statistically equivalent it is plausible that, under an optimal policy, bandit i will be given the action ui = 1 for precisely a fraction m/n of the time. This motivates interest in an upper bound on the maximal average reward that can be obtained by considering a single bandit and asking how it should be controlled if we wish to maximize
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
153
the average reward obtained from that bandit, subject to a relaxed constraint that ui = 1 is employed for a fraction of exactly m/n of the time. So consider a stationary Markov policy for operating a single restless bandit. Let zxu be the proportion of time that the bandit is in state x and that under this policy the action u is taken. An upper bound for our problem can be found from a linear program in variables {zxu : x ∈ E, u ∈ {0, 1}}: maximize r(x, u)zxu (6.1) x,u
subject to zxu = 1,
(6.2)
x,u
zx0 ≥ 1 − m/n,
x
u
zxu =
(6.3)
zyu P (x | y, u), for all x,
(6.4)
y,u
zxu ≥ 0, for all x, u.
(6.5)
Here (6.4) are equations that determine the stationary probabilities. Notice that we have put an inequality in (6.3). Let us justify this by making the assumption that action u = 1 (which we call the active action) is in some sense better than u = 0 (which we call the passive action). So if constraint (6.3) did not exist then we would wish to take u = 1 in all states. At optimality, (6.3) will hold with equality. The optimal value of the dual linear program problem is equal to g, where this can be found from the average reward dynamic programming equation φ(x) + g = max r(x, u) + W (1 − u) + φ(y)P [y | x, u(x)] . (6.6) u∈{0,1}
y
Here W and φ(x) are the Lagrange multipliers for constraints (6.3) and (6.4), respectively. The multiplier W is positive and may be interpreted as a subsidy for taking the passive action. It is interesting to see how (6.6) can be obtained from (6.1)–(6.4). However, we might have simply taken as our starting point a problem of maximizing average reward when there is a subsidy for taking the passive action. In general, the solution of (6.6) partitions the state space E into three sets, E0 , E1 and E01 , where, respectively, the optimal action is u = 0, u = 1, or some randomization between both u = 0 and u = 1. Let us avoid uninteresting pathologies by supposing that the state space is finite, and that every pure policy gives rise to a Markov chain with one recurrent class. Then the set E01 , where there is randomization, need never contain more than one state, a fact that is known for general Markov decision processes with constraints; see Ross (1989).
154
MULTI-ARMED BANDIT ALLOCATION INDICES
It is reasonable to expect that, as the subsidy W increases in (6.6), the set of states E0 (in which u = 0 is optimal) should increase monotonically. This need not happen in general. However, if it does then we say the bandit is indexable. Whittle defines as an index the least value of the subsidy W such that u = 0 is optimal. We call this the Whittle index , denoting it W (·), where W (x) = inf{W : x ∈ E0 (W )}. It can be used to define a heuristic policy (the Whittle index policy) in which, at each instant, one engages m bandits with the greatest indices, that is, those that are the last to leave the set E1 as the subsidy for the passive action increases. The Whittle index extends the Gittins optimal policy for classical bandits: Whittle indices can be computed separately for each bandit; they are the same as the Gittins index in the case that u = 0 is a freezing action, so that P (j |i, 0) = δij . The discussion so far begs two questions: (i) under what assumptions is a restless bandit indexable; and (ii) how good is the Whittle index policy? Might it be optimal, or very nearly optimal as n becomes large? Interestingly, there are special classes of restless bandit for which one can prove indexability directly. Bandits of the type in Example 6.2 are indexable, as is shown by Liu and Zhao (2010). The dual-speed bandits in Example 6.3 are indexable, as can be shown via conservation laws (Glazebrook, Ni˜no-Mora and Ansell 2002), or (in a continuous-time setting) by a simple direct method (Weber 2007); this latter method can also be used to show that a restless bandit is indexable if the passive action transition rate qij0 depends only on j (the destination state). The approach of Glazebrook et al . is to use conservation laws to prove that the following is sufficient for indexability. Fix a deterministic stationary Markov policy π and look at the expected total discounted time spent taking the action u = 1 if one takes u = 1 at the first step and then follows it with π. Compare that to the expected total discounted time spent taking u = 1 if one takes u = 0 at the first step and then follows it with π. Then the bandit is indexable if, for all starting states and stationary policies π, the former quantity is greater. The dual-speed bandit satisfies this condition. They make further use of conservation laws to find an expression for the reward suboptimality of a general policy. This is used to study the closeness to optimality of an index policy for dual-speed restless bandits. The strong performance of the index policy is confirmed by a computational study. Ni˜no-Mora (2001) also uses the achievable region approach to address the question of indexability. He finds sufficient conditions on model parameters under which a single restless bandit is indexable. He describes an adaptive greedy index algorithm for computing the indices, within a framework of so-called partial conservation laws. Indexability is verified by running the algorithm. The above work provides some insight, but the question of indexability is subtle, and a complete understanding is yet to be achieved.
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
6.4
155
Asymptotic optimality
We now turn to the question of optimality or near optimality of the Whittle index (n) (n) (α), Ropt (α) and r(α) denote, respectively, the policy. Taking m = αn, let RW average reward that is obtained from n restless bandits under the Whittle index policy, under an optimal policy, and from a single bandit under the relaxed policy (that the bandit receives the action u = 1 for a fraction α of the time). Then (n) (n) RW (α) ≤ Ropt (α) ≤ nr(α).
It is plausible that the Whittle index policy should be asymptotically optimal as (n) (α)/n → 0 as n → ∞. Weber and n becomes large, in the sense that r(α) − RW Weiss (1990, 1991) have shown that this is true if certain differential equations have an asymptotically stable equilibrium point (i.e. a point to which they converge from any starting state). These are the differential equations which describe a fluid approximation to the stochastic process of the bandit states evolving under the Whittle index policy. Suppose bandits move on a state space of size k, and let zi (t) be the proportion of the bandits in state i; the differential equations are of the form dz/dt = A(z)x + b(z), where A(z) and b(z) are constants within k polyhedral regions which partition the positive orthant of Rk . It turns out that there are examples of dual-speed restless bandits (needing k > 3) in which the differential equations have an asymptotically stable equilibrium cycle (rather than a stable equilibrium point), and this can lead to suboptimality of the Whittle index policy. However, in examples, the suboptimality was never found to be more than about one part in 104 . It is tempting to try to generalize the idea of a Whittle index for restless bandits to problems with a discounted reward criterion, starting with the appropriate functional equation in place of (6.6) and adding a subsidy for use of the passive action. However, there is no asymptotic optimality result for this case that is analogous to the result of Weber and Weiss for the average reward case. The use of discounted versions of Whittle indices can actually end up recommending the worst of all priority policies, and a payoff that is very far from the optimum. This is because the identity of the optimal priority policy can critically depend on the starting state of the n restless bandits, whereas the ordering of Whittle indices is calculated without reference to the starting state.
6.5
Monotone policies and simple proofs of indexability
In many of the successful applications which have used Whittle indices, the undiscounted DP equation in (6.6), in which a single restless bandit is operated
156
MULTI-ARMED BANDIT ALLOCATION INDICES
to maximize a combination of returns earned and passive subsidies paid, is optimized by a policy with a monotone structure. This is particularly the case in so-called bi-directional bandits in which the active and passive actions tend to move the state of the bandit in opposite directions. The optimality of policies of monotone structure can greatly simplify the task of demonstrating indexability and of recovering the Whittle indices. To illustrate these ideas, we consider the spinning plates model of Glazebrook, Kirkbride and Ruiz-Hernandez (2006). As above, the issue of indexability requires that we focus attention on individual bandits. In this model, a bandit is best thought of as a reward-generating asset which moves on a finite state space, E = {1, . . . , k}. Should action u ∈ {0, 1} be applied to the asset while in state x ∈ E, the time to the next transition is exponentially distributed with rate λ(x)u + μ(x)(1 − u), where λ(x) is the rate at which the asset moves from x to x + 1 under the active action u = 1, and μ(x) is the rate at which the asset moves from x to x − 1 under the passive action u = 0. The upward rate λ(k) and the downward rate μ(1) are both zero. We have that r(x, 0) = r(x, 1) = r(x), x ∈ E, and hence that rewards are earned under both actions. We assume that the reward function r : E → R+ is increasing and so the effect of the active control (which may be thought of as investment in the asset) is to move the asset upward to more lucrative states. While passive, the asset tends to deteriorate downward to states which earn rewards at lower rates. The n-bandit problem concerns how the decision maker should invest in n such assets, m (< n) at a time, to maximize the average rate of return earned from them over an infinite horizon. If without loss of generality we uniformize such that max max{λ(x), μ(x)} = 1, x∈E
then the DP equation (6.6) becomes g = max r(x) + W (1 − u) + λ(x)u[φ(x + 1) − φ(x)] u∈{0,1}
+ μ(x)(1 − u)[φ(x − 1) − φ(x)] .
(6.7)
In order to find solutions to (6.7), we introduce the class A of monotone policy π : E → {0, 1} for which π ∈ A if and only if there exists y ∈ {1, 2, . . . , k + 1} such that π(x) = 0 ⇐⇒ x ≥ y.
(6.8)
We denote the policy determined in (6.8) by (y). To see that there must be an optimal policy for (6.7) which is a member of A, consider some general deterministic, stationary and Markov policy π for which π(x1 ) = 1 and π(x2 ) = 0 for some 1 ≤ x1 < x2 ≤ k, where x1 is the assumed initial state of the asset. Write x π for the smallest x > x1 at which π takes the passive action 0. Plainly x π ≤ x2 . Since policy π takes the action 1 at all states in the range {x1 , . . . , x π − 1}, it
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
157
xπ
will evolve upward from its initial state x1 until it hits state for the first time. From that point it will have alternating sojourns in the states x π (when passive) and x π − 1 (when active). The average reward rate for the policy π when the asset starts from initial state x1 is easily seen to be [W + r (x π )] λ (x π − 1) + r (x π − 1) μ (x π ) . λ (x π − 1) + μ (x π ) But this is also the average reward rate for the monotone policy (x π ) from any initial state. Hence the best policy in the monotone class A is at least as good as π when the initial state of the asset is x1 . By systematic consideration of other cases, we establish that no policy/initial state combination has associated reward rate which can exceed that earned by the best member of A from any initial state. We may thus restrict consideration to the policy class A. We introduce the notation g(W ) in recognition of the dependence of the maximal average return on the passive subsidy W . As a consequence of the fact that there always exists a member of A which achieves g(W ), we have that g(W ) = max
1≤x≤k+1
[W + r(x)] λ(x − 1) + r(x − 1)μ(x) , λ(x − 1) + μ(x)
(6.9)
and write x(W ) for the smallest maximizing x in (6.9). Note that in (6.9) we set r(0), r(k + 1), λ(0) and μ(k + 1) equal to any convenient positive values. If we now introduce ϕ(x) = λ(x − 1)[λ(x − 1) + μ(x)]−1 , 1 ≤ x ≤ k + 1, then (6.9) simplifies to g(W ) = max [W ϕ(x) + R(x)], 1≤x≤k+1
(6.10)
where R(x) = r(x)ϕ(x) + r(x − 1)[1 − ϕ(x)] is the return rate generated by the asset under monotone policy (x). The following result flows easily from the above discussion. Theorem 6.4 (i) If ϕ is strictly decreasing over 1 ≤ x ≤ k + 1, the asset is indexable. (ii) If additionally, W ∗ (x) =
R(x + 1) − R(x) ϕ(x) − ϕ(x + 1)
(6.11)
is strictly decreasing over 1 ≤ x ≤ k, then the asset has Whittle index W (x) = W ∗ (x), 1 ≤ x ≤ k. Proof. For part (i) we observe from (6.10) that g(W ) is the upper envelope of a finite number of linear and increasing functions. It follows that g(W ) is increasing, piecewise linear and convex. Further, its right gradient is ϕ(x(W )) everywhere, and the strictly decreasing nature of ϕ implies that x(W ) is itself decreasing. At all W for which g(W ) is differentiable the monotone policy (x(W )) is strictly optimal within A.
158
MULTI-ARMED BANDIT ALLOCATION INDICES
In all cases we have that the set of states in which the passive action is optimal is given by E0 (W ) = {x(W ), . . . , k}. It follows easily from this discussion that E0 (W ) is decreasing in W and hence that the asset is indexable. This proves part (i). Under the hypotheses of part (ii) it is straightforward to check that when W = W ∗ (x), both x and x + 1 achieve the maximum in (6.10). This corresponds to a hinge point at which g(W ) is non-differentiable. Hence we have
that E0 W ∗ (x) = {x + 1, . . . , k}, with both actions u = 0 and u = 1 optimal in x. It follows that W (x) = inf{W : x ∈ E0 (W )} = W ∗ (x) and hence that W ∗ is indeed the Whittle index in this case. The numerator in (6.11) has an interpretation as the increase in return rate from the asset when the decision is made to activate/invest in the asset in state x (given that the asset is activated at all states below x) rather than not to. The denominator may be understood as the corresponding increase in activation/investment rate for the asset. Hence W ∗ (x) is a natural measure of increased return per unit of increased investment for the asset in state x. Ni˜noMora (2001) uses the term marginal productivity indices for such quantities. Please also note that this simple form of the index is restricted to the cases for which W ∗ (x) is decreasing. This is sometimes called the strictly indexable case, meaning that all states have distinct indices. When the condition concerning W ∗ (x) in part (ii) of the above theorem does not hold, then it remains true that the hinge points of g(W ) yield the index values. These are recoverable from a simple algorithm. In such cases it will be the case that distinct asset states may have a common index value. A very similar analysis is available for a simple case of Example 6.1. To achieve conformity with the spinning plates model above we assume that a bandit moves on state space E = {1, 2, . . . , k} with large states x representing high earning, more vigorous conditions for the bandit. Now under the active action u = 1, applied in any state x, the bandit experiences fatigue and moves to x − 1 at exponential rate ν(x), while under the passive action u = 0 the bandit is refreshed and moves to x + 1 at rate ρ(x). We take ν(1) = ρ(k) = 0. In contrast to the spinning plates model we now have r(x, 0) = 0, r(x, 1) = r(x), x ∈ E, and hence rewards here are only earned by the bandit when activated. As before we assume that r(x), the rate of return achieved by the bandit when active, is increasing in the state x. Whittle (1988) gave a brief discussion of a particular case of this model, which had a linear structure for both rewards and stochastic dynamics, which he called the Ehrenfest project. Under a suitable uniformization, the DP equation (6.8) becomes g(W ) = max (1 − u) {W + ρ(x) [φ(x + 1) − φ(x)]} u∈{0,1}
+ u {r(x) + ν(x)[φ(x − 1) − φ(x)]} .
(6.12)
It is not difficult to show that optimal policies for (6.12) may be drawn from the monotone class B, where π ∈ B if and only if there exists y ∈ {1, 2, . . . , k + 1}
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
159
for which π(x) = 1 ⇐⇒ x ≥ y. With this simplification of the problem, we have g(W ) = max {W ψ(x) + r(x)[1 − ψ(x)]}, 1≤x≤k+1
where ψ(x) = ν(x)[ν(x) + ρ(x − 1)]−1 , 1 ≤ x ≤ k + 1. An analysis similar to the above yields the following result. Theorem 6.5 (i) If ψ is increasing, the bandit is indexable. (ii) If additionally, W ∗∗ (x) :=
r(x)[1 − ψ(x)] − r (x + 1) [1 − ψ (x + 1)] ψ(x + 1) − ψ(x)
is increasing over 1 ≤ x ≤ k, then the bandit has Whittle index W (x) = W ∗∗ (x), 1 ≤ x ≤ k. The above examples prompt the question as to when more generally is it possible to restrict attention to monotone policies in solving the DP equation (6.6). In some very recent work, Glazebrook, Hodge and Kirkbride (2010a) have given some sufficient conditions which include both of the above examples as special cases. They discuss applications to machine maintenance, asset management and inventory routing.
6.6
Applications to multi-class queueing systems
It was initially thought that the scope for the application of the ideas of this chapter to multi-class queueing systems was limited. It seemed reasonable to hope, for example, that we should be able to recover the cμ-rule discussed in Section 5.2 by means of the Lagrangian relaxation approach described here. However, Whittle (1996) concluded that the approach could not deliver this. To explain what makes this problematic, we shall discuss an extension of the setup in Section 5.2 to the service control of an N -class queueing system where the standard Markovian assumptions regarding arrivals and service completions apply (with λi , μi the rates for class i), but where the instantaneous cost rate is i ci (ni ) when the number of class i customers in the system is ni , with each ci (·) increasing and convex. Amongst others, van Meighem (1995) has argued the importance of objective functions which are increasing and convex in customer delays. To guarantee that all the expectations we shall require exist, we further assume that each ci (·) is bounded above by some polynomial of finite order. As before, service policies are preemptive. In particular, a customer’s service may be interrupted at an arrival epoch. Taking each ci (·) to be linear recovers
160
MULTI-ARMED BANDIT ALLOCATION INDICES
the classic case, though the approach of this chapter does now give us the scope to consider problems with many servers. This is, of course, at the cost of strict optimality for any index heuristic recovered. To guarantee the existence of stable policies, we require that i ρi < m, and ρi < 1 for each i, where ρi = λi /μi . The form of many server problems allowed by the analysis is such as to permit at most one server access to each class at any time. If we argue as in Section 6.2 above, then, dropping the class identifier i and imposing the uniformization λ + μ = 1, we develop the appropriate version of DP equation (6.6) for this problem as φ(n) + g = min c(n) − W (1 − u) + λφ(n + 1) u∈{0,1}
+ μuφ [n − 1]+ + μ (1 − u) φ(n) ,
(6.13)
where [n]+ = max(n, 0). This DP equation describes a decision problem in which the passive subsidy W is paid whenever the queue is not being served. However, the standard theory of M/M/1 queues indicates that the average rate at which such subsidies are paid must be W (1 − ρ) for any stable policy. Hence varying the subsidy W can have no impact upon optimal policies and offers no direct route to a Whittle index for this problem. However, this impasse can be broken if we consider a discounted version of the DP equation in (6.13). This is effective as an approach because, under a regime in which all costs and subsidies are discounted at rate γ , the total expected subsidy received up to an infinite horizon from a stable policy now has the form W (1 − ρ)/γ + O(1), where the O(1) term depends both upon the initial state of the queue and, most crucially for our purposes, the policy adopted. Hence varying W in a discounted version of (6.13) can (and, indeed, does) lead to an index. An appropriate Whittle index for the average cost problem can then be recovered by considering the limit γ → 0. We now supply the details. We write C(·, W, γ ) : N → R for the value function (minimized cost from initial state n) for a discounted cost version of (6.13) in which the instantaneous cost rate for the queue is adjusted to γ c(n). This is an appropriate scaling of costs, since it means that, as policies are varied, both changes in the total holding costs incurred and in the subsidies paid will be O(1). This will in turn guarantee good behaviour for the ratios of these quantities (which is what indices are) in the limit as γ → 0. The value function C(·, W, γ ) satisfies the DP equation (λ + μ + γ )C(n, W, γ ) = min γ c(n) − W (1 − u) + λC(n + 1, W, γ ) u∈{0,1}
+ μuC([n − 1]+ , W, γ ) + μ(1 − u)C(n, W, γ ) .
(6.14)
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
161
In what follows, we shall write E0 (W, γ ) for the subset of N for which u = 0 is an optimal control and achieves the minimum in (6.14). Note that 0 ∈ E0 (W, γ ) for all choices of W ∈ R+ and γ > 0. We shall be looking for indexability, which here will mean that E0 (W, γ ) is increasing in W as it ranges over R+ , for each fixed γ . When this is the case we shall recover Whittle indices for the discounted problem. We define W (n, γ ) = inf{W ; n ∈ E0 (W, γ )} and take W (0, γ ) = 0. To see what form such an index might take, we shall start off by assuming everything we would like to be true, and more besides. All such assumptions will, of course, need to be verified post hoc. We shall just give a sketch of the salient points. A full account may be found in Ansell, Glazebrook, Ni˜no-Mora and O’Keeffe (2003). We shall assume then that we do indeed have indexability and that moreover the index W (n, γ ) is increasing in n for each fixed γ . That the fair subsidy should increase with n when holding costs are not only increasing but also convex does at least seem plausible. It will turn out to be true. Under these mighty assumptions, consider first the discounted DP equation in (6.14) when the passive subsidy W is taken to be the index W ∗ (n, γ ), for some fixed n ≥ 1. We write W ∗ rather than W at this stage as we do not as yet know that an index exists. Hence W ∗ is the assumed index. It must be true that there exist (at least) two optimal policies for the problem with W = W ∗ (n, γ ). Let us call them π1 and π2 . The assumed increasing nature of the index means that both of these policies will take the active action u = 1 in all states greater than n and the passive action u = 0 in all states less than n. However, π1 will take the active action in state n while π2 will take the passive action there. Both of these policies must have the same total expected cost from any initial state, including the specified n. Consider first policy π1 operating from initial state n. It will continue to take the active action u = 1 until the queue enters the state n − 1 for the first time at T , say. Random variable T is stochastically identical to a single busy period for an M/M/1 queue with arrival rate λ and service rate μ. We write c∗ (n, γ ) for the total expected discounted cost incurred during [0, T ). From time T , policy π1 will take the passive action u = 0 until the next arrival occurs, returning the queue to its initial state n. This cycle is repeated ad infinitum. It is straightforward to show that the total expected cost incurred by policy π1 from initial state n is given by
c∗ (n, γ ) + E e−γ T [γ c (n − 1) − W ∗ (n, γ )](γ + λ)−1
, 1 − λE e−γ T (γ + λ)−1 with a similarly derived expression for the costs incurred under π2 . Equating the costs under the two policies and solving for W ∗ (n, γ ) will give us a formula for the assumed index, namely W ∗ (n, γ ) =
Ee−γ T 1 − Ee−γ T
γ c∗ (n, γ ) − γ c − 1) , (n 1 − Ee−γ T
n ≥ 1.
(6.15)
162
MULTI-ARMED BANDIT ALLOCATION INDICES
We take W ∗ (0, γ ) = 0. Ansell et al . (2003) are able to show that W ∗ (n, γ ) is increasing in n, and use direct DP arguments to establish that W ∗ (n, γ ) ≤ W ≤ W ∗ (n + 1, γ ) ⇒ E0 (W, γ ) = {0, 1, . . . , n},
n ∈ N. (6.16) It is thus clear that E0 (W, γ ) is indeed increasing in W for each fixed γ and hence we have indexability for the discounted version of our problem. It follows easily from (6.16) that W (n, γ ) = inf{W ; n ∈ E0 (W, γ )} = W ∗ (n, γ ), and so the expression in (6.15) really does give the discounted Whittle index we require. We can then recover a Whittle index for the average cost problem as follows: W (n) = lim W (n, γ ) = lim W ∗ (n, γ ) γ →0
=
γ →0
μ(μ − λ) E[c(n − 1 + X) − c(n − 1)], λ
(6.17)
where X is a random variable whose distribution is that of the number in system of an M/M/1 queue in equilibrium when the arrival and service rates are λ and μ respectively. If we specialize to quadratic costs and take c(n) = b + cn + dn2 , the Whittle index for average costs becomes W (n) = cμ +
d(3λ − μ)μ + 2dμn, (μ − λ)
n ≥ 1,
(6.18)
which is indeed, as we set out to show, equal to cμ in the linear costs case d = 0. In general, from (6.18) we see that, in contrast to the special linear case, the index W (n) for convex costs depends both upon the arrival rate λ and the number in the system n. While the Whittle index policy for the N -class problem is not optimal, all the evidence we have is that it performs very strongly, even when there is just a single server. Glazebrook, Lumley and Ansell (2003) extend this analysis to the more complex case in which service times follow a general distribution and policies are non-preemptive. For that case the simplest form for the Whittle index for average costs is W (n) =
E[c(n + X) − c(n − 1 + X)] , E(S)
where S is a single service time and random variable X has the distribution of the number in system of the single class M/G/1 queue under non-idling service. This index reduces to c/E(S) when costs are linear.
6.7
Performance bounds for the Whittle index policy
In Section 6.3 we mentioned the work of Ni˜no-Mora and his espousal of the achievable region approach to address questions of indexability. While we believe
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
163
that rather simpler direct methods to demonstrate indexability and obtain Whittle indices are often available, such as those of Section 6.5, nevertheless aspects of the achievable region approach are invaluable in the development of tools for the construction of performance bounds for the Whittle index policy. There is an important link here with the methods used in Section 5.7. These exploited the structure of the adaptive greedy algorithm to obtain the expression (5.25) for the expected return under general policy π in terms of the Gittins indices νi and the quantities Aπ (Si , k) which measure the usage by policy π of the low index states (i.e. those in the sets Si ). Ni˜no-Mora (2001) develops an adaptive greedy algorithm for Whittle indices, and so much the same is possible here, though the more complex restless bandit structure may make it more challenging to derive simple, interpretable bounds. We illustrate what is possible by reference to a queueing control problem which concerns the admission control and routing of impatient customers for service. We note that customer impatience is now recognized as an important phenomenon in many queueing applications and is widely modelled. Much of the most important work in this area has been motivated by call centre applications. We cite Garnett, Mandelbaum and Reiman (2002) and Bassamboo, Harrison and Zeevi (2005) as two key contributions within what is now a vast literature. Customers arrive according to a Poisson process of rate λ and seek service which is available from S service stations. Service at station s is provided by σs servers, each of which completes the service of customer requirements at exponential rate μs . Upon arrival, a decision is made concerning whether a customer should be admitted to service at all. If a customer is discarded at this stage a penalty d is paid. Once admitted, a customer is routed to one of the stations where they remain until leaving the system. At each station, service is first-come, first-served (FCFS). Customers are impatient. If they have not been admitted to service before an exponentially distributed amount of time with mean θ −1 has elapsed, they leave the system unserved. A cost cs is incurred when any loss is incurred at station s. To avoid trivialities, the penalty d is assumed smaller than all of the cs . Every customer service successfully completed at station s earns a return rs . Decisions concerning admission to the system and subsequent routing are made in full knowledge of the system state n, which here records the numbers of customers present at each station. We write n = {ns , 1 ≤ s ≤ S}. The goal is to design a policy for admission control and subsequent routing which comes close to maximizing the expected return from the system, net of penalties. To develop suitable Whittle indices for this problem, we consider a single station (and drop the identifier s from the notation) facing the entire customer stream. For each arriving customer, a decision is made on whether to admit (the active action u = 1) or not (the passive action u = 0). A passive subsidy W is available whenever the passive action is taken. The returns and penalties on offer are as in the preceding paragraph. Hence for any customer not admitted the discard penalty d is incurred, while for admitted customers either a return r will be earned if the customer is served or a cost c incurred if not. Under a general deterministic, stationary and Markov policy π for this single station admission
164
MULTI-ARMED BANDIT ALLOCATION INDICES
control problem, we write α(π) for the average rate of service completions achieved over an infinite horizon and β(π) for the average rate of admissions. It follows that the difference β(π) − α(π) is the average rate of losses from the queue. We use g(W ) for the maximum return rate achievable from the single station. Accounting for all rewards earned and penalties paid, we have g(W ) = min[(r + c)α(π) + (W − d + c)[λ − β(π)]] − cλ. π
(6.19)
It is straightforward to establish that this single station problem is solved by a policy with monotone structure. Hence, for any fixed W , it is optimal to admit customers provided the number at the station present upon arrival is strictly below some threshold, nW say, and not to admit otherwise. Once this has been established, the demonstration of indexability (which here now means that the threshold nW decreases with W ) and development of the Whittle index is easy and uses arguments akin to those used in Section 6.5. To describe the Whittle index for this problem, we use {nx , 0 ≤ x ≤ n} for the stationary distribution of the number at the station under a monotone policy with threshold n. We first observe that W (n) = r + d, n < σ . When there are free servers upon arrival (n < σ ), the return r upon completion is guaranteed for any admitted customer and the fair subsidy for the passive discard action compensates both for the return lost and for the discard penalty incurred. Otherwise we have σ −1 −1 x W (n) = d − c + (r + c)μ 0 × μ
x=0 σ −1 x=0
n x −1 x −1 +θ 0 0
−1 ,
n ≥ σ.
(6.20)
x=σ
It is easy to establish that W (n) is decreasing in n and so, unsurprisingly, a station with a longer queue is less attractive. Since d < c, W (n) will certainly be negative for n large enough. To understand the form of the index, we write r ∗ (n) for the expected return from a customer admitted to a station at which n customers are already present, and observe that the quantity r ∗ (n) + d may be thought of as an individual index, namely the fair compensation due to a discarded customer for her personal lost return. However, this simple index takes no account of the negative impact of a decision to admit upon the returns available from other customers. Hence, r ∗ (n) + d may also be thought of as a myopic index as it takes no thought for future consequences. The Whittle index W (n) may be thought of as an appropriate modification to this simple index to take account of the increased system congestion attendant upon a decision to admit. Unsurprisingly, we have W (n) < r ∗ (n) + d for all n. We shall presently return to the Whittle index W (n) and exploit the fact that it may be obtained as the output from an adaptive greedy algorithm. We briefly leave the single station problem in (6.19) which defines the index and consider the full S-station admission control/routing problem outlined above.
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
165
Firstly, note that we shall use πW to denote the natural Whittle index policy for our admission control/routing problem. Policy πW will route an arriving customer to any station of largest Whittle index, unless all stations have indices which are negative, in which case the customer will not be allowed entry to service. To see why πW has this particular form we note that the discard option, regarded now as a decision to route to a station where a loss of d is guaranteed, from the above formula, must have an associated Whittle index of zero. We now use gs (W ) for the average optimal return from the single station problem (6.19) for station s. Consider the quantity R opt (W ) :=
S
gs (W ) + λ(d − W )(S − 1),
(6.21)
s=1
which has the following interpretation: suppose that in the original S-station problem each station is empowered to take a decision concerning whether to admit each arriving customer or not. Use π to denote a policy for such a problem. Each station which chooses to admit will increment its queue length by one, and a return for service completion or a cost for a loss will be paid as in the original problem. For every station which refuses to admit, the discard penalty d is payable. Further, policy π will earn a system-wide subsidy equal to W times the difference between the aggregate discard rate for all S stations and λ(S − 1). If we use βs (π) to denote the admission rate at station s under π, opt then the contribution to R (W ) from discard decisions is equal to (W − d)[λ − s βs (π)]. We conclude that R opt (W ) ≥ R opt ,
W ≥ 0,
(6.22)
where in (6.22) we use R opt for the optimal return for the original S-station problem. To see why this is so, observe that the setup which yields R opt (W ) concerns a class of policies which contains the admissible policies for the original problem as a sub-class. In the latter, we are required either to discard an arriving customer from all S stations (in which case we discard her altogether and pay d) or we discard her from S − 1 stations and route her to the remaining station. When the policies π are restricted to the sub-class which are admissible for the original problem, the returns and penalties which yield (6.21) exactly match those in the original problem, with the exception of the subsidy W [λ − s βs (π)] earned from discarded customers. Since, when W ≥ 0, this subsidy is guaranteed nonnegative, (6.22) follows. We call the decision problem which yields the value R opt (W ) the Lagrangian relaxation of the original S-station problem. It is possible to gain insight regarding the quality of performance of the Whittle index policy πW from the tightness of the Lagrangian relaxation, as measured, say, by R ∗ − R opt , where R ∗ := minW ≥0 R opt (W ). To encourage this line of thought we observe that we already know that R opt (W ) is achieved and the Lagrangian relaxation is solved by a policy which admits an incoming customer
166
MULTI-ARMED BANDIT ALLOCATION INDICES
to all stations whose Whittle index exceeds W . It is straightforward to establish from the above details that, if either (i) there exists a policy which is admissible for the original problem which achieves R opt (0); or (ii) there exists a non-discarding policy which is admissible for the original problem which achieves R opt (W ) for some W > 0; then it must follow that R ∗ = R opt , and that πW achieves R opt and solves our admission control/routing problem. We can use such ideas to argue for the optimality of the Whittle index policy πW in regimes of light and heavy traffic, where we let λ → 0 and λ → ∞ respectively and keep other system parameters fixed. For example, in the heavytraffic case, it is easy to show that, for all sufficiently large λ, each station s has a value n∗s such that Ws (n) ≥ 0 ⇐⇒ n ≤ n∗s . Consider now the operation of the system under the policy for the Lagrangian relaxation which achieves R opt (0) and which admits incoming customers to all stations whose index is positive, and hence to all stations s for which ns ≤ n∗s . It is clear that, for very large λ, under such a policy almost all arriving customers will either encounter a system state n in which all stations have negative index (ns = n∗s + 1 for all s) and will be discarded or a system state n in which exactly one station has positive index (ns = n∗s for exactly one s) and will be routed there. But such a policy is admissible for the original problem; indeed it is precisely the Whittle index policy πW . Hence it follows from (i) above that we should expect πW to achieve optimality in the limit λ → ∞. Indeed it does. Glazebrook et al . (2009) give a rigorous argument. Given the importance of this result, we now state it in the following theorem, opt where we use the notations Rλπ , Rλ to express the dependence of the average returns concerned upon the customer arrival rate λ. Theorem 6.6 The Whittle index heuristic πW is asymptotically optimal in a heavy-traffic limit in the sense that lim
λ→∞
opt π Rλ − Rλ W = 0.
The above arguments notwithstanding, an approach to demonstrating strong performance for the Whittle index policy πW , which relies upon establishing that the Lagrangian relaxation is tight, is conservative. Numerical examples have demonstrated that πW may well perform strongly even when R ∗ exceeds R opt by some margin. We proceed to a stronger approach based on achievable region methodology. The details use suitable developments of the work of Ni˜no-Mora (2001).
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
167
We again focus on an individual station and drop the station identifier s. We define the values Anx , 0 ≤ x ≤ n, n ∈ Z+ , by
(6.23) Anx := 1 − λ 1 − xx (μn+1 − θn+1 )−1 , where in (6.23) μn = μ min(n, σ ) and θn = θ (n − σ )+ are, respectively, the total service rate and loss rate for the station when n customers are present. The Whittle indices W (n), n ∈ Z+ in (6.20) may be recovered from an adaptive greedy algorithm. As the sequence W (n) is decreasing, and hence we know the index ordering for the station, we are able to give a simplified version of index computation. Recall that r ∗ (n) is the expected return for a customer admitted to+the station when there are n already present. We develop the sequence yn , n ∈ Z inductively by −1 , y0 = r ∗ (0) A00 n−1 −1 ∗ n Ax yx Ann , yn = r (n) −
n ≥ 1.
x=0
The Whittle index is then recovered as W (n) = d +
n
yx .
x=0
The structure of the above computations yields the equation r ∗ (n) = W (0) − d −
n
[W (x − 1) − W (x)] Anx ,
(6.24)
x=1
from which we can find expressions for R π , the average return rate for the system under policy π, along the lines of (5.25) in Section 5.7. To achieve this, we revert to the full S-station problem and develop the full sequence {W ∗ (n), n ∈ N} of indices from all stations (including the zero index of the discard action) in decreasing order, with equal indices from distinct stations appearing as repeats in the sequence. In particular, W ∗ (1) is the largest index from any station, given by W ∗ (1) = max Ws (0). 1≤s≤S
To take things further, we use the pair (s, x) to denote ‘station s with head count x’. Consider any pair of natural numbers n, n with n ≤ n . We develop the constants {Bnn , 1 ≤ n ≤ n , n ∈ N} utilizing the station-specific quantities Ans,x defined above. Suppose that the index W ∗ (n ) belongs to some station s pair (s, x), say, and that pair (s, z) satisfies Ws (z − 1) > W ∗ (n) ≥ Ws (z). We then take Bnn = Axs,z . Consider now any deterministic, stationary and Markov policy π
168
MULTI-ARMED BANDIT ALLOCATION INDICES
for the S-station problem and any n ∈ N. We write ρnπ for the long-run proportion of time for which state n (namely, the state with index W ∗ (n) and in nth position in the overall index ordering) is active under policy π. We now define Aπ (n) :=
∞
Bnn ρnπ
n =n
as a measure of the rate at which policy π activates states in position n or higher in the overall index ordering and whose indices are W ∗ (n) or less. It is straightforward to infer from the identity in (6.24) that the average rate of return from policy π is given by ∞ π
∗ π ∗ ∗ R = λ W (1) − d − W (n − 1) − W (n) A (n) n=2
= −λd + λ
∞
W ∗ (n) Aπ (n) − Aπ (n + 1) .
(6.25)
n=1
The reader may interpret the expression in (6.25) by thinking of the positive quantity Aπ (n) − Aπ (n + 1) as (roughly) a measure of the usage made by π of the nth highest index state. It is then clear from (6.25) that high-return policies are those which make substantial use of high-index states. It is natural to seek suboptimality bounds for the index policy πW which measure its usage of highindex states. Proceeding much as in Section 5.7, we infer from (6.25) that
R opt − R πW
∞
∗ =λ W (n − 1) − W ∗ (n) AπW (n) − Aopt (n)
n=2
∞
π ∗ ∗ W ≤λ W (n − 1) − W (n) A (n) − A(n) ,
(6.26)
n=2
where A(n) := minπ Aπ (n) and the superscript opt denotes the application of an optimal policy for the problem. Glazebrook, Kirkbride and Ouenniche (2009) underscore the value of the bound in (6.26) by describing a simple example where it is achieved exactly and elucidates strong performance in the Whittle index heuristic πW , yet where the Lagrangian relaxation for the problem concerned is far from tight, with R ∗ exceeding R opt by over 10 percent. We conclude this section by illustrating its use in giving an account of the light-traffic optimality of πW . This result powerfully complements the earlier statement of πW ’s heavytraffic optimality.
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
169
Theorem 6.7 The Whittle index heuristic πW is asymptotically optimal in a light-traffic limit in the sense that opt
lim
π
Rλ − Rλ W
λ→0
opt
Rλ
= 0.
Proof. It is trivial that 0 ≤ Bnn ≤ 1 for all choices of n, n and hence that AπW (2) is bounded above by the long-run proportion of time that the system is nonempty under πW . However, it is plain that the number in system under πW is stochastically bounded above by the number present in an M/M/1 queue with arrival rate λ and service rate equal to mins μs . We infer that λ < min μs ⇒ AπW (2) ≤ λ(min μs )−1 . s
s
It is easy to show from the definitions of the quantities concerned that AπW (n) is decreasing in n and hence from the bound in (6.26) that ∞ opt πW Rλ − Rλ ≤ λ [W ∗ (n − 1) − W ∗ (n)]AπW (n) n=2
≤ λW ∗ (1)AπW (2) ≤ λ2 W ∗ (1)(min μs )−1 . s
opt
However, in the limit λ → 0, Rλ = O(λ) and the result follows.
It is possible to establish the optimality of πW in a range of other asymptotic regimes and for a range of extensions of the model discussed here. See Glazebrook et al . (2009) for details.
6.8
Indices for more general resource configurations
In Section 6.2 we considered a model in which, at each decision epoch, m of n restless bandits are made active. We conclude this chapter by pointing out that the methodology of Lagrangian relaxation allows us to develop indices which permit the construction of heuristic policies for considerably more complex resource configurations. To illustrate ideas, we shall consider a situation in which S units of some divisible resource (manpower, equipment, money) are available for distribution among n projects, where each individual project allocation at any epoch can be any number of units of resource between 0 (the project is passive) and S (the project is allowed to consume all of the available resource). Before proceeding to discuss a simple queueing control example, it will assist the
170
MULTI-ARMED BANDIT ALLOCATION INDICES
discussion if we modify the discussion around (6.6) by replacing the constraint (6.3) by zx1 ≤ m/n, x
with the consequential DP equation modified to φ(y)P (y|x, u) . φ(x) + g = max r (x) − W u + u∈{0,1}
(6.27)
y
Trivially, (6.27) is obtained by subtracting W from both sides in (6.6) The average return g is now maximal for a problem in which returns from bandits are earned net of charges for bandit activation. Lagrange multiplier W now has an economic interpretation as a charge for activity, with the Whittle index now a fair charge for activation rather than a fair subsidy for passivity. This trivial recasting of the formulation gives a more natural starting point for the more general resource allocation problems we would now like to consider. Suppose that S servers are available to provide service at N stations, to which customers arrive in independent Poisson streams, with λi the rate for station i. The choice ui = a represents an allocation of a of the S servers to station i and yields a service rate for station i of μi (a). The system controller uses information on the number of customers ni at each station i to determine optimal allocations of the servers to the stations at all times. The objective of such allocations is the minimization of the average linear holding cost rate incurred, with ci the cost of holding a single customer at station i for one unit of time. The problem in which each μi (·) is increasing and convex is easy to solve, since it is straightforward to establish from the DP equations that there exist optimal policies which always allocate the S servers en bloc. Hence the problem becomes essentially one involving a single server, and is solved by the policy which allocates all S servers to whichever nonempty station has the largest associated value of ci μi (S). An assumption that the μi (·) are all increasing and concave gives a law of diminishing returns for increased allocations at all stations and seems much more plausible. Such an assumption will also encourage the distribution of the service team among several nonempty stations, not just one. We proceed to develop an index heuristic for this problem by creating a Lagrangian relaxation in which the system controller is free to allocate any number of servers (up to S) to each station and where there is a charge of W per unit of time for each server allocated. This relaxation may be decomposed into N single station problems in which the objective is the minimization of an aggregate of holding costs and charges for service. We drop the station identifier i and give the DP equation for the single station problem under the uniformization
RESTLESS BANDITS AND LAGRANGIAN RELAXATION
λ + μ(S) = 1 as g=
min
u∈{0,1,...,S}
171
cn − W u + λ[φ(n + 1) + φ(n)]
+ μ(u) φ([n − 1]+ ) − φ(n) .
(6.28)
We take forward the discussion in Section 6.2 by considering an optimal deterministic, stationary and Markov policy π(W ) for (6.28) and the corresponding subset Ea [π(W )] ⊆ Z+ of station states for which the server allocation mandated by π(W ) is a or less. The station is said to be indexable if there exists such a family of optimal policies for which, for each a ∈ {0, 1, . . . , S − 1}, Ea [π(W )] increases as W ranges over R+ . When that is the case, we then have the corresponding index, defined by W (a, n) = inf{W : n ∈ Ea [π(W )]}, interpreted as a fair charge for raising the number of servers at the station from a to a + 1 when n customers are there. Glazebrook, Hodge and Kirkbride (2010b) show that, for this model, each station is indeed indexable when service rates are concave, with W (a, n) decreasing in a (as is guaranteed from the definition of the index) and increasing in n. The latter is a special feature for this problem which is related to the existence of optimal policies for (6.28) which are monotone in state; that is, such that the minimizing u is nondecreasing in n. Restoring the station indices, it follows from the above discussion that there exists an optimal policy for the Lagrangian relaxation of the N -station problem which, for each system state n = {ni , 1 ≤ i ≤ N }, increases the number of servers allocated to each station i until this reaches the level ai for which the corresponding fair charge Wi (ai , ni ) falls below the resource charge W . A natural greedy index heuristic πW for the original problem allocates the S servers in each system state n in decreasing order of the above station indices. The station achieving arg maxi Wi (0, ni ) will be the first to receive a server. An alternative characterization of the resulting allocation πW (n) = a is that it maximizes the minimal index mini:ai > 0 Wi (ai − 1, ni ) over all admissible allocations a. Numerical evidence suggests that the greedy index heuristic performs very strongly.
Exercises 6.1. Formulate the following single server models as restless bandits: • A SFABP with the additional feature that an additional cost ai (xi ) + bj (xj ) is incurred whenever processing is switched from bandit i in state xi to bandit j in state xj . • A family of alternative bandits, each of which is a member of one of K distinct classes. There are Bernoulli arrivals to each class, which are independent across classes. Each bandit is allowed just a single period
172
MULTI-ARMED BANDIT ALLOCATION INDICES
of processing, which may be terminated by the system controller at any time. When that period is terminated the bandit leaves the system. • A family of alternative bandits with Bernoulli arrivals as in the previous example but now with the additional feature that each arriving bandit has its own deadline for earning rewards. Once that deadline is reached, it leaves the system. 6.2. Review the admission control/routing model of Section 6.7. Prove the assertion that there exists a state monotone policy which solves (6.19). Use this fact to derive the relation n n n g(W ) = min (r + c)x μ min(x, σ ) + (W − d + c)n λ − cλ, n∈N
x=0
where nx is as described in the text of Section 6.7. Use the balance equations for the system to derive an expression for nn and show that it is decreasing in n. Use this fact to prove indexability. Hint: it may help you to review the proof of Theorem 6.4 in Section 6.5. 6.3. For the model of Section 6.7, use the Lagrangian relaxation to argue for the optimality of the Whittle index policy πW in the following asymptotic regimes: • Slow service: all service rates μs → 0 with other model parameters fixed. • Fast service from the station, station 1 say, whose index is maximal in the sense that W1 (0) = maxs Ws (0): service rate μ1 → ∞ with other model parameters fixed.
7
Multi-population random sampling (theory) 7.1
Introduction
An arm of the multi-armed bandit problem described as Problem 5 in Chapter 1 generates a sequence X1 , X2 , . . . of independent and identically distributed random variables taking the values 1 and 0 with probabilities θ and 1 − θ . The reward associated with the value X is a t−1 X if this is the outcome on the tth pull, counting pulls on every arm. This is one of the problems discussed in the present chapter, which is characterized by bandit processes defined by sequences of independent identically distributed random variables X1 , X2 , . . . When, as in the multi-armed bandit, these random variables themselves constitute a sequence of rewards, the resulting bandit process will be termed a reward process. Any function of the Xi s may also be taken to define a sequence of rewards, and thereby a bandit process. All bandit processes defined in this way will be termed bandit sampling processes. The bandit sampling processes defined by the function which takes the value 1 for the first Xi to be sampled with a value ≥ T , and 0 for all other Xi s, will in particular be investigated. In the following section it will be shown that a family of alternative bandit processes (in this chapter we shall be concerned only with simple families) of this type serves as a model for the situation where a number of populations are sampled sequentially with the aim of identifying as rapidly as possible an individual that achieves a target level T on some scale of interest. This type of bandit process will thus be termed a target process.
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
174
MULTI-ARMED BANDIT ALLOCATION INDICES
The Xi s for a given sampling process we suppose to be drawn from a distribution with an unknown, possibly vector-valued parameter θ belonging to a family D of distributions for which a density f (· | θ ) exists with respect to some fixed probability space (R, F1 , μ1 ). Adopting the standard Bayesian setup (e.g. see Barra, 1981), θ has a prior density π0 with respect to a probability space (R r , F2 , μ2 ), and f (x|θ ) is F2 -measurable for all x ∈ R. Thus the joint density for θ and X1 is π0 (θ )f (x|θ ), and it follows that θ has a density with respect to (R r , F2 , μ2 ) conditional on X1 taking the value x1 , which may be written as π1 (θ |x1 ) =
π0 (θ )f (x1 |θ ) . π0 (φ)f (x1 |φ)dμ2 (φ)
The density π1 is termed the posterior density for θ after observing X1 to take the value x1 . Similarly, after observing Xi to take the values xi (i = 1, 2, . . . , n), the posterior density for θ is πn−1 (θ |x1 , . . . , xn−1 )f (xn |θ ) πn−1 (φ|x1 , . . . , xn−1 )f (φ|θ )dμ2 (φ) π0 (θ ) ni=1 f (xi |θ ) = . π0 (φ) ni=1 f (xi |φ)dμ2 (φ)
πn (θ |x1 , . . . , xn ) =
(7.1)
The reader should not be alarmed by the appearance of the measures μ1 and μ2 in this account. We shall be concerned only with densities with respect to Lebesgue measure, counting measure and simple combinations of these two measures. These are the usual densities for continuous, discrete or mixed distributions, respectively. Equation (7.1) is sometimes written in the form πn (θ | x n ) ∝ π0 (θ )
n
f (xi |θ ),
i=1
the proportionality sign ∝ indicating that an F1n -measurable function of x n (= (x1 , x2 , .. . , xn )) has been omitted, one which may be determined from the fact that πn dμ2 = 1. This result is sometimes called the generalized Bayes’ theorem. Two assumptions will be made for the sake of computational tractability. They concern the important statistical concept of sufficiency. For present purposes, a statistic is any real-valued function defined for every positive integer n as a F1n -measurable function of X n = (X1 , . . . , Xn ). The statistics u1 , . . . , um are a sufficient set of statistics for random samples of size n if the distribution of Xn conditional on u1 (X n ), . . . , um (X n ) does not depend on θ . The following equivalent condition for sufficiency is due to Neyman (e.g. see Barra, 1981).
MULTI-POPULATION RANDOM SAMPLING (THEORY)
175
Theorem 7.1 The statistics u1 , u2 , . . . , um are sufficient for random samples of size n iff, for almost all x and for suitably chosen functions d and h, n
f (xi |θ ) = d(u1 (x n ), u2 (x n ), . . . , um (x n ), θ )h(x n ),
i=1
where d is a Borel-measurable function defined in Rm+r , h is Fin -measurable, and x n = (x1 , x2 . . . , xn ). The two simplifying assumptions are as follows. Assumption 1 The family of distributions D is an exponential family. This means (e.g. see Ferguson, 1967) that the density f (x | θ ) may be written in the form ∞ f (x|θ ) = c(θ ) exp qj (θ )tj (x) h(x). j =1
Assumption 2 The prior density π0 (θ ) may be written in the form m a qj (θ )bj . π0 (θ ) = k0 [c (θ )] exp j =1
Most of the standard distributions do belong to exponential families, so Assumption 1 is not a serious restriction. It means, as Theorem 7.1 shows, that the statistics n tj (xi ) (j = 1, 2, . . . , m) i=1
form a sufficient set. From equation (7.1) it therefore follows that the posterior density πn (θ |x n ) depends on the observation vector x n only as a function of the sufficient statistics. This property gives the clue to the term sufficient, the point being that, for the purpose of calculating πn , all the information derived from the observations is given by the sufficient statistics. Assumption 2 means that πn (θ |x n ) may be written as m n πn (θ | x n ) = kn (x n ) [c(θ )]a+n exp qj (θ ) bj + tj (xi ) , (7.2) j =1
i=1
where kn (x n ) is chosen so that πn (θ |x n )dμ2 (θ ) = 1. Thus, for any n and x n , πn is the probability density for a distribution defined by the parameters a + m and bj +
n i=1
tj (xi )
(j = 1, 2, . . . , m).
176
MULTI-ARMED BANDIT ALLOCATION INDICES
The family P of distributions with densities of the form (7.2) is said to be conjugate to D, a concept discussed in detail by Raiffa and Schlaifer (1961). To restrict the prior distribution to P is often acceptable, since it typically allows the prior mean and variance of a parameter to be given arbitrary values, and is virtually essential for computational tractability. This sequential Bayesian setup fits into the bandit process framework if the state of a bandit sampling process is identified as the current (posterior) distribution for θ . For a target process we also need a completion state C, as in the jobs of Section 3.2, attainment of which means that an Xi above the target has been observed, and that all further rewards are zero rewards. The undiscounted reward r(S, π) from continuing the bandit sampling process when it is in state π is now the expected undiscounted reward yielded by the next Xi when π is the current density for θ . Thus
r(S, π) = g(x)f (x|π)dμ1 (x), (7.3) where for a reward process g(x) = x, for a target process g(x) = 1, if x ≥ T , and g(x) = 0 if x < T , and f (·|π) = f (·|θ )π(θ )dμ2 (θ ). The restriction of the prior distribution to a parametric conjugate family means that the state may be specified by the parameters of the current distribution for θ . Note that, for a target process, r(S, π) is the probability in state π that the next observation reaches the target. It will be referred to as the current probability of success (CPS). To refer to the current distribution of the unknown parameter θ as the state of a bandit process may seem a little odd. Certainly decision theorists would be more likely to reserve this description for θ itself. The reader may find it helpful to regard θ as the true or absolute state from an all-knowing divine standpoint, and π as the existential state summarizing the incomplete information about θ available to an earthly decision maker. It follows from the definition (2.7), and dividing by the discrete-time correction factor γ (1 − a)−1 defined in Section 2.8, which will be standard for all Markov sampling processes, that for any Markov bandit process B in state x, ν(B, x) = sup ντ (B, x) ≥ ν1 (B, x) = r(B, x).
(7.4)
τ >0
This corresponds to the facts that a reward rate per unit time of r(B, x) may be achieved by setting τ = 1, and that higher average reward rates may sometimes be achieved by sequentially adjusting τ so as to be higher or lower according to whether the prospects for high future rewards improve or deteriorate. For a bandit sampling process S, the relationship between ν(S, π) and r(S, π) is particularly instructive. Consider the limiting case when π represents precise knowledge of the true state θ . Thus the existential state coincides with the true state, and does not change as further Xi s are sampled (except when state C is
MULTI-POPULATION RANDOM SAMPLING (THEORY)
177
reached in the case of a target process). The expected reward on each occasion that an observation is made, except in state C, is therefore
r(S, θ ) = g(x)f (x|θ )dμ1 (x). Thus, writing σ = min{i : Xi ≥ T } and applying the discrete-time correction factor, it follows from (2.7) that ⎧ ⎨ = r(S, θ ) for a reward process if 0 < τ ντ (S, θ ) = r(S, θ ) for a target process if 0 ≤ τ ≤ σ ⎩ < r(S, θ ) for a target process if P (τ > σ ) > 0. It follows that for either a reward or a target process, ν(S, θ ) = sup ντ (S, θ ) = r(S, θ ). τ >0
Thus when the parameter θ is known precisely (7.4) holds with equality. This reflects the fact that there is no further information about θ to be learned from the process of sampling. Except in this limiting case, π expresses some uncertainty about θ , so that more reliable information about θ may be obtained by sampling, and this information then used to adjust τ so as to achieve a higher expected average reward rate than r(S, π). Thus the difference ν(S, π) − r(S, π) is caused by uncertainty about θ , and may be expected to be large when this uncertainty is large. Theorem 2.3 tells us that ν(S, π) increases with a, and so therefore does ν(S, π) − r(S, π), since r(S, π) is independent of a. This is because, the larger the discount factor, the more important are future rewards in comparison with the immediate expected reward r(S, π), and consequently the more important does it become to resolve any uncertainty about θ . The difference ν(S, π) − r(S, π) is one measure of the importance of sampling to obtain information, rather than simply to obtain the reward r(S, π). Kelly (1981) has shown that for the Bernoulli reward process (see Problem 5 and Sections 7.4 and 7.7) with a uniform prior distribution, the indices for any two states are ordered, for values of a sufficiently close to one, by the corresponding numbers of zero rewards sampled, provided these are different, irrespective of the numbers of successes. In contrast, the expected immediate reward r(S, π) is equal to the proportion of successes. This phenomenon of r(S, π) bearing virtually no relationship to the index ν(S, π), and hence to the policy which should be followed when more than one sampling process is available, means that the prospects of immediate rewards are dominated by the longer-term requirement to gather information which may be used to improve the rewards obtained later on. The remaining sections of this chapter develop the theory of sampling processes. In Section 7.2 a single-processor scheduling theorem is proved which establishes that a simple family of alternative target processes is an appropriate
178
MULTI-ARMED BANDIT ALLOCATION INDICES
model when the aim is to achieve a given target as soon as possible, irrespective of which process it comes from. Section 7.3 explores the circumstances in which the monotonicity condition for Proposition 2.7 is satisfied by a target process, thus leading to a simple expression for index values. Section 7.4 sets out two general methods for calculating index values when no simplifying monotonicity property holds, taking advantage of any location or scale invariance properties and assuming a conjugate prior distribution. This methodology is extended in Section 7.5 to semi-Markov sampling processes for which the sampling time for each observation has an exponential distribution. Section 7.6 describes the Brownian reward process, quoting asymptotic results due to Bather (1983) and Chang and Lai (1987), and showing the asymptotic relationship both of the process itself and of the asymptotic results just mentioned to the normal reward process. Section 7.7 shows that a result similar to Bather’s holds for other sampling processes, including target processes, if these are asymptotically normal in an appropriate sense. In Section 7.8 we discuss a type of bandit whose state evolves, when it is not frozen, as a diffusion process satisfying a stochastic differential equation. In both Sections 7.5 and 7.6, sampling processes are defined which allow decisions to be taken at any time. This, strictly speaking, takes us outside the scope of the theorems of Chapter 2. The theory of Section 3.3 for continuoustime jobs also does not apply, as we are not dealing with jobs as defined in Section 3.2. As in the case of continuous-time jobs, a complicating issue is that, under an index policy, effort may have to be shared. Bank and K¨uchler (2007) deal with this issue by proving that an optimal policy must follow the current leader among the Gittins indices while ensuring that whenever a bandit process does not receive the continuation control exclusively then its Gittins index is at its lowest value so far. This characterization leads to a proof of the index theorem for a simple family of alternative bandit processes (SFABP ) in continuous time. Chapter 8 describes in detail the methods and results of calculations of dynamic Gittins indices for the normal, Bernoulli and exponential reward processes, and for the exponential target process. For the Bernoulli reward process we give a MATLAB program (Section 8.4). For others, results are tabulated at the back of the book. These cases were selected mainly for their general interest, though the exponential target process is also worthy of note as the original bandit process which arose in the practical problem whose consideration led to the index theorem. This problem is that of selecting chemical compounds for screening as potential drugs from a number of families of similar compounds. The sampled values represent measurements of some form of potentially therapeutic activity. The target is a level of activity at which much more thorough testing is warranted. An exponential distribution of activity is often a reasonable assumption. Index values, which are closely related to the upper tail areas of distributions of activity, are, however, very sensitive to the form of the distribution, particularly if the target is higher than the activities already sampled. A procedure was therefore developed in which observations are first transformed to achieve as good a fit to the
MULTI-POPULATION RANDOM SAMPLING (THEORY)
179
exponential form as possible, and this type of distribution is then only assumed for the upper part of the distribution, all values below some threshold being treated as equivalent. This led to a model involving target processes based on a distribution with an unknown atom of probability at the origin, and an exponential distribution with an unknown parameter over R+ . Calculations of index values for target processes of this type, which will be described as Bernoulli/exponential target processes, have been reported by Jones (1975), and are described in general terms in Section 8.7. A description of the software package CPSDAI based on these ideas, which has been developed for use by chemists, is given in Gittins (2003).
7.2
Jobs and targets
In the previous section, a simple family of alternative target processes was described as a suitable model for a situation where several different populations may be sampled sequentially, and the aim is to find a sample value which exceeds the target as quickly as possible from any of the populations. In fact, target processes are jobs (see Section 3.2), and an index policy therefore maximizes the expected payoff from the set of jobs. This is an apparently different objective. The following theorem shows that the two objectives are actually equivalent. Theorem 7.2 For a discounted single machine scheduling problem in which each job yields the same reward on completion, any policy that maximizes the expected total reward must also maximize the expected reward from the first job completion. Proof. Let A be a job which yields a reward of 1 on completion, and B a bandit process which has the same properties as A except that instead of a reward on completion there is a cost of continuation. For an interval of duration h this cost is h + o(h) if the interval occurs before completion, and the cost is zero if the interval occurs after completion. For both A and B the discount parameter is γ . The theorem is a consequence of the following lemma, the superscripts distinguishing the indices (and in the proof other functions) referring to A from those referring to B. Lemma 7.3
νA 1 + B = −1. γ ν
Proof. Let τ be the stopping time defined by a stopping set 0 which does not include the completion state C. Since A and B pass through the same sequence of states, τ is defined both for A and B. Let T be the (process) time when the process first reaches state C. From our assumptions about the reward for A and the costs for B, it follows that to determine ν A only stopping times of the form
180
MULTI-ARMED BANDIT ALLOCATION INDICES
τA = min(τ, T ) need be considered, and to determine νB only stopping times of the form τ if τ < T , τB = ∞ if τ ≥ T need be considered, in both cases for some τ > 0. We have τA WτAA = −RτBB = E e−γ s = γ −1 − γ −1 E e−γ τA
RτAA
=γ
=
WτBB
=
−1
−γ
−1
0
e
−γ τ
dF(τ, T ) − γ
−1
τ 0 tends to zero as γ tends to zero. This is sufficient for the applications made in this chapter.
7.3
Use of monotonicity properties
In general some values of Xi point to more favourable values of θ than do other values of Xi , and in consequence neither of the sequences ν(Si , πi ) or r(S, πi ) (i = 0, 1, 2, . . . ) is necessarily monotone. What can happen is that the prior density π0 is so favourable that, when S is a target process, any sequence of Xi s which is capable of making the posterior density πn even more favourable than π0 must include a value which exceeds the target T . This consideration leads us to look for prior densities ρ such that any sequence {Xi < T ; i = 1, 2, . . .} generates a corresponding sequence of posterior densities {πi }, π0 = π, such that r(S, π) ≥ r(S, πi )
(i ≥ 0).
Let us define a probability density π with this property to be favourable. From (7.3) it follows that, for a target process, r(S, π) = 0 for any state defined by a density π. By definition of the completion state C, r(S, C) = 0. Thus if π is favourable, the sequence of rewards (or current probabilities of success) associated with the sequence of existential or completion states starting from π are all no greater than r(S, π). This is the condition for Proposition 2.7, and we have established the following result. Proposition 7.4 If S is a target process and the probability density π for the parameter θ is favourable, then ν(S, π) = r(S, π). Usually at least some prior probability distributions are favourable. For example this is true for the Bernoulli, normal and exponential target processes, a brief account of each of which follows. The last two are discussed in considerable detail by Jones (1975). The values assumed for the target T are not restrictive, in the Bernoulli case because there is no other nontrivial possibility, and in the other cases because of the invariance of the processes concerned under changes of scale, and in the normal case also of location. These invariance properties are discussed in Section 7.4.
182
MULTI-ARMED BANDIT ALLOCATION INDICES
Example 7.5 Bernoulli target process T = 1. P (Xi = 1) = 1 − P (Xi = 0) = θ . The family of conjugate prior distributions for θ consists of beta distributions with densities of the form (α + β) α−1 θ (1 − θ )β−1 (α)(β)
(0 ≤ θ ≤ 1),
where α > 0 and β > 0. Thus any state π apart from the completion state may be represented by the parameters α and β, and it follows from (7.3), dropping S from the notation, that
1 (α + β) α−1 α r(α, β) = θ (1 − θ )β−1 dθ = θ . (α)(β) α + β 0 Thus if N = min{n : Xn = 1}, the sequence of states starting from (α, β) is {(α, β), (α, β + 1), (α, β + 2), . . . , (α, β + N − 1), C}. Since α α ≥ α+β α+β +n
(n ≥ 0),
it follows that every prior distribution state (α, β) is favourable, and hence that ν(α, β) = r(α, β) =
α . α+β
Example 7.6 Normal target process, known variance T = 0, variance = 1. f (x|θ ) = φ(x − θ ) = (2π)−1/2 exp[− 12 (x − θ )2 ]
(x ∈ R).
If the prior density for θ is the improper uniform density over the real line, it follows from equations (7.1) and (7.2) that the posterior density for θ after n values have been sampled with mean x¯ may be written as n (θ ∈ R). ¯ = (2πn−1 )−1/2 exp − (x¯ − θ )2 πn (θ |x) 2 Conjugate prior densities for θ are of this form, with n taking any (not necessarily integer) positive value. Thus an arbitrary state π (= C) may be represented by the parameters x¯ and n. We have f (x|π) = f (x|x, ¯ n)
∞ = f (x|θ )πn (θ |x)dθ ¯ −∞ ∞
n 1 (2π)−1/2 exp − (x − θ )2 (2πn)−1/2 exp − (x¯ − θ )2 dθ 2 2 −∞ 1/2 n n (x − x) ¯ 2 , = exp − 2π(n + 1) 2(n + 1)
=
MULTI-POPULATION RANDOM SAMPLING (THEORY)
183
and from (7.3)
∞
r(x, ¯ n) =
f (x|x, ¯ n)dx = x(1 ¯ + n−1 )−1/2 .
(7.6)
0
For a given initial state (x, ¯ n), the state after sampling {Xi < 0 : i = 1, 2, . . . , m} ¯ with mean X¯ is ((nx¯ + mX)(n + m)−1 , n + m). From (7.6) it follows that if x¯ ≥ 0 then ¯ r(x, ¯ n) ≥ r(nx¯ + mX)(n + m)−1 , (n + m), and hence that the state is favourable, so that ν(x, ¯ n) = r(x, ¯ n). Example 7.7 Normal target process, unknown mean and variance θ = (μ, σ ). f (x|μ, σ ) = σ −1 φ((x − μ)/σ ) (x ∈ R).
T = 0.
Conjugate prior densities for μ and σ may be written in the form 1 −n−1 2 2 2 ¯ s) ∝ σ exp − 2 (n − 1)s + n(x¯ − μ) + (x − μ) πn (μ, σ |x, , 2σ (μ ∈ R, σ > 0), which is the posterior after observations xi (i = 1, 2, . . . , n) with mean density x¯ and (n − 1)−1 (xi − x) ¯ 2 = s 2 , starting from an improper prior proportional to σ −1 . Also −n/2 (x − x) ¯ 2 n · f (x|πn ) = f (x|x, ¯ s, n) ∝ 1 + n + 1 (n − 1)s 2 (see Section 8.3 for intermediate steps), thus u = [n/(n + 1)]1/2 (x − x)s ¯ −1 has Student’s t-distribution with n − 1 degrees of freedom. Thus, as in the known variance case, the density f (x|πn ) is symmetrical about x. ¯ The variance of this density, however, now behaves rather differently. For small values of n and fixed s it decreases more rapidly than for known σ 2 as n increases, because of the increase in the degrees of freedom of the t-distribution. This alters the circumstances under which ν(πn ) = r(πn ) in a quite interesting way. For known σ 2 , if x¯ > 0 the modified value of x¯ after observing xn+1 is reduced unless xn+1 > 0, and this results in r(πn+1 ) < r(πn ), since x¯ is the mean of the density f (x|πn ). Admittedly the variance of f (x|πn+1 ) is smaller than that of f (x|πn ), which tends to increase the total probability above the target, zero, but not by enough to offset the decrease caused by the decrease in mean value. For the unknown variance case, on the other hand, the reduction in the variance from f (x|πn ) to f (x|πn+1 ) may, for small n, be sufficient to cause an increase in CPS although x¯ > 0 and xn+1 < 0. For similar reasons it may be impossible to
184
MULTI-ARMED BANDIT ALLOCATION INDICES
increase CPS with a negative xn+1 although x¯ < 0. Jones has carefully examined this phenomenon and concludes that, for the undiscounted case, if n > 2.75 ν(x, ¯ s, n) > r(x, ¯ s, n) iff x¯ < 0. If n < 2.75 there are values of x¯ which serve as counter-examples to both parts of this assertion. Example 7.8 Exponential target process T = 1. f (x|θ ) = θ e−θx
(x > 0).
Conjugate prior densities may be written in the form πn (θ |) = n θ n−1 e−θ / (n)
( > 0, n > 0),
which is the posterior density after observations xi (i = 1, 2, . . . , n) with total , starting from an improper prior density proportional to θ −1 . Thus
∞ f (x|πn ) = f (x|, n) = f (x|θ )πn (θ |)dθ 0
∞ n = θ n exp[−θ ( + x)]dx (n) 0 n n = . ( + x)n+1 n
∞ f (x|, n)dx = . r(, n) = +1 1 Thus r(, n) is an increasing function of , and the condition m Xi , n + m ≤ r(, n)|Xi < 1; i ≤ m = 1 P r + i=1
is therefore equivalent to n ( + m + 1)n+m ≥ ( + 1)n ( + m)n+m ,
(7.7)
since ( + m)n /( + m + 1)n+m is the supremum of the set of values of m r( + i=1 Xi , m + 1) which are consistent with the condition Xi < 1 (i ≤ m). Jones shows that if (7.7) holds for m = 1 then it also holds for m > 1, and that (7.7) holds if and only if r(, n) ≥ pn where, for all n, pn e−1 and pn → e−1 as n → ∞. Thus, using Proposition 7.4, ν(, n) = r(, n) ⇐⇒ r(, n) ≥ pn .
(7.8)
MULTI-POPULATION RANDOM SAMPLING (THEORY)
185
Jones goes on to discuss the Bernoulli/exponential target process for which f (x|p, θ ) = pθ e−θx , and P (X = 0) = 1 − p, where both p and θ are unknown parameters with conjugate prior distributions (i.e. beta and gamma distributions respectively). A joint conjugate prior for p and θ has parameters m, n and , and has the form of the posterior distribution for p and θ , starting from an improper prior density proportional to p −1 (1 − p)−1 θ −1 after observing m + n Xi s with m zeros and a total of . It turns out that ν(m, n, ) = r(m, n, ) ⇐⇒ r(m, n, ) ≥ pmn, where pmn → e−1 as n → ∞, for all m. The results of calculations of ν(, n) for the undiscounted exponential target process are described in Section 8.6. Jones reports calculations for the undiscounted Bernoulli/exponential target process. The method of calculation is set out in Section 8.7.
7.4
General methods of calculation: use of invariance properties
Very often none of the simplifying monotonicity properties of Section 2.11 holds for a bandit sampling process. This leaves us with two main approaches to the calculation of index values. The first approach, which we might call the direct approach, is via the defining equation for the index of a sampling process S in state π, Rτ (S, π) τ > 0 Wτ (S, π)
ν(S, π) = sup ντ (S, π) = sup τ >0
(7.9)
The second, or calibration approach uses the standard bandit processes as a measuring device as in the proof of Theorem 4.8. The functional equation (2.12) for the optimal payoff from a SFABP {S, }, consisting of the sampling process S in state π and a standard bandit process with parameter λ, takes the form
R(λ, π) = max λ + aR(λ, π), r(π) + a R(λ, πx )f (x|π)dμ1 (x) = max
λ , r(π) + a 1−a
R(λ, πx )f (x|π)dμ1 (x) ,
(7.10)
where R(λ, π) and r(π) are abbreviated notations for R({S, }, π) and r(S, π), and πx denotes the image of π under the mapping defined by Bayes’ theorem when x is the next value sampled. Note that π defines the state of {S, } as well as the state of S, since the state of does not change. From Theorems 2.1 and 2.10 it follows that ν(s, π) = λ if and only if it is optimal to select either S or for continuation; that is, if and only if the two expressions inside the square brackets on the right-hand side of equation (7.10) are equal. The calibration
186
MULTI-ARMED BANDIT ALLOCATION INDICES
approach consists of solving equation (7.10) for different values of λ, and hence finding the values of λ for which the two expressions inside the square brackets are equal as π varies throughout the family of distributions P. For a target process, the range of possible index values is from 0 to 1, and it is convenient to calibrate in terms of a standard target process for which the probability φ of reaching the target at each trial is known. For target processes our discussion will focus largely on the undiscounted case (i.e. a = 1 or γ = 0). The recurrence corresponding to (7.10) is
T −1 M(φ, π, T ) = min φ , 1 + M(φ, πx , T )f (x|π)dμ1 (x) , (7.11) −∞
where M(φ, π, T ) is the expected number of values needing to be sampled under an optimal policy to reach the target T , given two target processes, one in state π and the other a standard target process with known success probability φ. The direct approach involves maximizing with respect to a stopping set defined on an infinite state space. A complete enumeration of all the possibilities is therefore impossible, and some means of restricting the class of stopping sets to be considered must be found. For the Bernoulli reward process (i.e. an arm of the classical multi-armed bandit of Problem 5) this has been achieved by allowing the stopping set P0 ⊂ P defining a stopping time τ to depend on the process time, and by imposing the restriction τ ≤ T , where T is a non-random integer. Defining ν T (S, π) = sup ντ (S, π), 0 0. Thus the state π may be represented by the parameter values α and β, and it follows from (7.3) that for a Bernoulli reward process (dropping S from the notation),
1
r(α, β) =
θ 0
(α + β) α−1 α (1 − θ )β−1 dθ = θ . (α)(β) α+β
Restricting P0 (t) to be of the form (7.12), we now have, for positive integers α, β and N (= T ), and N > α + β, ν N−α−β (α, β)
N−α−β−1
r(α, β) + = sup φ
t=1
1+
a
t
t
Q(α, β, t, k, φ)r(α + k, β + t − k)
k=0 N−α−β−1 t=1
at
t
,
(7.13)
Q(α, β, t, k, φ)
k=0
where Q(α, β, t, k, φ) = P [πt = (α + k, β + t − k) ∩ ν N−α−β−s (πs ) ≥ φ, 1 ≤ s ≤ t|π0 = (α, β)]. For a given N the expression (7.13) leads to an algorithm along the following lines for calculating ν N−α−β (α, β) for all non-negative integers α and β such that α + β < N . (i) If α + β = N − 1, the stopping time τ in the definition of ν N−α−β (α, β) must be equal to one. Thus ν 1 (α, β) = r(α, β) =
α . (α + β)
(7.14)
(ii) Using (7.14), Q(α, β, t, k, φ) may be calculated for α + β = N − 2, t = 1 and k = 0, 1. We have Q(α, β, 1, 1, φ) = P [X1 = 1|π0 = (α, β)] × 1 if ν 1 (α + 1, β) ≥ φ , 0 if ν 1 (α + 1, β) < φ
188
MULTI-ARMED BANDIT ALLOCATION INDICES
Q(α, β, 1, 0, φ) = P [X1 = 0|π0 = (α, β)] × 1 if ν 1 (α, β + 1) ≥ φ , 0 if ν 1 (α, β + 1) < φ and P [X1 = 1|π0 = (α, β)] =
α . (α + β)
Values of ν N−α−β (α, β) for α + β = N − 2 may now be calculated from (7.13) using these quantities together with (7.14). (iii) Now knowing the function ν N−α−β (α, β) for α + β = N − 1 and α + β = N − 2, calculations similar to those described in stage (ii) of the algorithm give values of Q(α, β, t, k, φ) for α + β = N − 3, t = 1, 2, and k ≤ t. These, together with (7.14), may be substituted into (7.13) to give ν N−α−β (α, β) for α + β = N − 3. (iv) Similar calculations give, in turn, values ν N−α−β (α, β) for α + β = N − 4, N − 5, and so on, the final quantity to be calculated being ν N (1, 1). To carry out this entire calculation for a given N takes O(N 4 ) elementary steps, and the storage requirement is O(N 2 ). This may be compared with the calibration method for this problem described in Chapter 1, for which the number of elementary steps for a single value of λ (called p in Chapter 1) is O(N 2 ), and the storage requirement is O(N ). Thus the calibration method is better for large values of N . In fact, for large values of N both methods of calculation can be made more economical by storing the functions ν and Q (for the direct method) and R (for the calibration method) on reduced grids of values, and interpolating as necessary. This reduces the number of elementary steps to O(N 2 ) and the storage requirement to O(N ) for the direct method, and for the calibration method to O(N ) and O(1), respectively. Thus the calibration method retains its advantage for large N . A similar comparison between the two approaches, showing the calibration approach to be preferable for large N , may be carried out for other bandit sampling processes. The only large-scale application of the direct approach to date has in fact been for the Bernouilli reward process, with the results described in Section 8.4. The calibration method can often be simplified by exploiting the presence of a location or a scale parameter. A parametric family D of distributions for a random variable X is said (e.g. see Ferguson, 1967) to have (i) a location parameter μ if the distribution of X − μ does not depend on the parameter μ; (ii) a scale parameter σ if the distribution of X/σ does not depend on the positive parameter σ ; or (iii) joint location and scale parameters μ and σ if the distribution of (X − μ)/σ does not depend on the parameters μ and σ (> 0). Equivalently, for families D with densities f (x|θ ) with respect to Lebesgue measure, μ is
MULTI-POPULATION RANDOM SAMPLING (THEORY)
189
a location parameter if f (x|μ) = g(x − μ) for some g, σ is a scale parameter if f (x|σ ) = σ −1 g(xσ −1 ) for some g, and μ and σ are joint location and scale parameters if f (x|μ, σ ) = σ −1 g[(x − μ)σ −1 ] for some g. For the remainder of this section densities with respect to Lebesgue measure will be assumed to exist. Location and scale parameters may be used to reduce the complexity of the functions ν(π), R(λ, π) and M(φ, π, T ). Note that R(λ, π) = sup{Rτ (π) + λ(1 − a)−1 E(a τ |π)}
(7.15)
M(φ, π, T ) = inf{Cτ (π, T ) + φ −1 Qτ (π, T )},
(7.16)
τ
and τ
where the stopping time τ is in both cases the time at which the standard bandit process is first selected, Cτ (π, T ) is the expected number of samples drawn from the target process S before either time τ or the target T is reached, and Qτ (π, T ) = P (τ is reached before the target T ), in both cases starting from the state π. Since there are no state changes when a standard bandit process is selected, we may suppose that it is always selected, after time τ . As the following theorems and their corollaries show, considerable simplification results from the presence of location and/or scale parameters, sufficient statistics, and a conjugate prior distribution. The possible sequences of states through which a bandit sampling process may pass, starting from a given state, are in 1–1 correspondence with the sequences of Xi s. Thus stopping times which start when the process is in the given state may be defined as functions of the Xi sequences. In fact it will prove convenient in the proofs to refer to stopping times which occur after a given sequence of Xi values by the same symbol, whether the starting states are the same or not. If τ is a stopping time defined in terms of the Xi sequence, let τ (b, c) (b > 0) denote the stopping time for which bX1 + c, bX2 + c, . . . , bXτ + c is the sequence sampled before stopping. Lower case letters denote the values taken by the random variables represented by the corresponding upper case letters. Theorem 7.9 If the parameter μ for the family of distributions for a reward process is a location parameter, the sample mean X¯ is a sufficient statistic, and π0 is the (improper) uniform distribution over the real line, then πn may be ¯ and n, identified by its parameters x¯ (the value taken by X) R(λ + c, x¯ + c, n) = c(1 − a)−1 + R(λ, x, ¯ n) (c ∈ R), and ν(x, ¯ n) = x¯ + ν(0, n).
190
MULTI-ARMED BANDIT ALLOCATION INDICES
Proof. Since X¯ is a sufficient statistic it follows from Theorem 7.1 and equation (7.1) that πn depends on the vector of sampled values x only as a function of x, ¯ and may therefore be written as πn (μ|x). ¯ We have n πn (μ|x) ¯ ∝ [f (x|μ)] ¯ = [g(x¯ − μ)]n = [f (x¯ + c|μ + c)]n
∝ πn (μ + c|x¯ + c).
(7.17)
and hence πn (μ|x) ¯ = πn (μ + c|x¯ + c).
(7.18)
In (7.17) the proportionality sign covers multiplication by a positive function of x which is independent of μ. The first follows from (7.1) on putting proportionality n n = h(x , x , . . . , x )/ π0 (μ) = a constant, and since f (x |μ)/[f (x|μ)] ¯ i 1 2 n i=1 h(x, ¯ x, ¯ . . . , x), ¯ and the justification of the second proportionality is similar. The two equalities in (7.17) are consequences of the fact that μ is a location parameter. Equation (7.18) and the fact that μ is a location parameter give πn (μ|x) ¯ =
n+m i=n+1
f (xi |μ) = πn (μ + c|x¯ + c)
n+m
f (xi + c|μ + c).
(7.19)
i=n+1
Thus, writing M for the random variable taking the value μ, the joint distribution of M, Xn+1 , Xn+2 , . . . , Xn+m conditional on X¯ = x¯ is the same as the joint distribution of M − c, Xn+1 − c, Xn+2 − c, . . . , Xn+m − c conditional on X¯ = x¯ + c. It follows that E(a τ |x, ¯ n) = E(a τ (1,c) |x¯ + c), and ¯ n) = Rτ (1,c) (x¯ + c, n) − c[1 − E(a τ |x, ¯ n)]/(1 − a). Rτ (x, Thus Rτ (1,c) (x¯ + c, n) + (λ + c)(1 − a)−1 E(a τ (1,c) |x¯ + c, n) = c(1 − a)−1 + Rτ (x, ¯ n) + λ(1 − a)−1 E(a τ |x, ¯ n), and hence, taking the supremum over τ (or equivalently over τ (1, c)), it follows from (7.15) that R(λ + c, x¯ + c, n) = c(1 − a)−1 + R(λ, x, ¯ n).
(7.20)
MULTI-POPULATION RANDOM SAMPLING (THEORY)
191
The notion of calibration now provides the proof of the second part of the theorem. From (7.10) and Theorems 2.1 and 2.10, ν(x, ¯ n) is the unique λ for which λ = r (x, ¯ n) 1−a
nx¯ + x , n + 1 f (x |μ ) πn (μ |x¯ ) dx dμ. +a R λ, x+1
(7.21)
From (7.3) and (7.18), and since f (x + c|μ + c) = f (x|μ), we have r(x¯ + c, n) = c + r(x, ¯ n). It therefore follows from (7.20) and (7.21), again using (7.18) and f (x + c|μ + c) = f (x|μ), this time with c = −x, ¯ that ν(x, ¯ n) = x¯ + ν(0, n).
Corollary 7.10 (and 7.18, see later comment) The conclusions of the theorem hold if μ is a location parameter having a conjugate prior distribution with the parameters x¯ and n, where x¯ is a scale parameter and n ≥ 0, and if these parameters take the values (nx¯ + x)/(n + 1) and n + 1 when x is the next value sampled. Proof. The theorem follows from equations (7.18), (7.19) and (7.21), plus the fact that μ is a location parameter. These equations hold under the conditions of the corollary. Note that in Corollary 7.10, and the similar corollaries to Theorems 7.11 and 7.13, the parameter n is not restricted to integer values. Theorem 7.11 If the parameter σ (> 0) for the family of distributions for a reward process is a scale parameter, the sample mean X is a sufficient statistic, and π0 (σ ) ∝ σ −d for some d > 0, then πn may be identified by its parameters x¯ and n, R(bλ, bx, ¯ n) = bR(λ, x, ¯ n) (b > 0), and ν(x, ¯ n) = xν(1, ¯ n). Proof. This runs along similar lines to the proof of Theorem 7.9. We have ¯ ∝ σ −d [f (x|σ ¯ )]n = σ −d−n [g(xσ ¯ −1 )]n πn (σ |x) = b−n σ −d [f (bx|bσ ¯ )]n ∝ πn (bσ |bx), ¯
192
MULTI-ARMED BANDIT ALLOCATION INDICES
and hence ¯ = bπn (bσ |bx). ¯ πn (σ |x)
(7.22)
Thus, since σ is a scale parameter, πn (σ |x) ¯
n+m i=n+1
n+m
f (xi |σ ) = bm+1 πn (bσ |bx) ¯
f (bxi |bσ ).
(7.23)
i=n+1
Hence, writing for the random variable taking the value σ , the joint distribution of , Xn+1 , Xn+2 , . . . , Xn+m conditional on X¯ = x¯ is the same as the joint distribution of b−1 , b−1 Xn+1 , b−1 Xn+2 , . . . , b−1 Xn+m conditional on X¯ = bx. ¯ It follows that E(a τ |x, ¯ n) = E(a τ (b,0) |bx, ¯ n), and
Rτ (x, ¯ n) = b−1 Rτ (b,0) (bx, ¯ n).
Thus, using (7.15), R(bλ, bx, ¯ n) = bR(λ, x, ¯ n).
(7.24)
Calibration again provides a proof of the second part of the theorem. The unique λ satisfying equation (7.21) with σ in place of μ is ν(x, ¯ n). The equations (7.3), (7.22) and f (bx|bσ ¯ ) = b−1 f (x|σ ¯ ) now give r(bx, ¯ n) = br(x, ¯ n), and, again using (7.21), the parallel argument to that given for Theorem 7.9 now gives ν(x, ¯ n) = xν(1, ¯ n).
Corollary 7.12 (and 7.20, see later comment) The conclusions of the theorem hold if σ is a scale parameter having a conjugate prior distribution with the parameters x¯ and n, where x¯ is a scale parameter and n ≥ 0, and if these parameters take the values (nx¯ + x)/(n + 1) and n + 1 when x is the next value sampled. The proof follows that of Corollary 7.10, using equations (7.22) and (7.23) in place of equations (7.18) and (7.19), plus the fact that σ is a scale parameter. The proofs of the next two theorems and their respective corollaries are along similar lines to those of the last four results and are left as exercises. Theorem 7.13 If the parameters μ and σ (> 0) for the family of distributions for a reward process are joint location and scale parameters, σ is known, the sample mean X¯ is a sufficient statistic, and π0 (μ) is the improper uniform distribution over R, then πn may be identified by its parameters x¯ and n, R(bλ + c, bx¯ + c, n, bσ ) = c(1 − a)−1 + bR(λ, x, ¯ n, σ ) (b > 0, c ∈ R) and
ν(x, ¯ n, σ ) = x¯ + σ ν(0, n, 1).
MULTI-POPULATION RANDOM SAMPLING (THEORY)
193
In the proof, the place of equations (7.18) and (7.22) is taken by the equation ¯ σ ) = bπn (bμ + c|bx¯ + c, bσ ). πn (μ|x, Equations (7.19) and (7.23) are replaced by the equation πn (μ|x, ¯ σ)
n+m
f (xi |μ, σ )
i=n+1
= bm+1 πn (bμ + c|bx¯ + c, bσ )
n+m
f (bxi + c|bμ + c, bσ ).
i=n+1
Corollary 7.14 (and 7.22, see later comment) The conclusions of the theorem hold if μ and σ are joint location and scale parameters, with σ known and μ having a conjugate prior distribution with parameters x, ¯ σ and n (≥ 0), for which x¯ and σ are joint location and scale parameters, and x¯ and n take the values (nx¯ + x)/(n + 1) and n + 1 when x is the next value sampled. Theorem 7.15 If the parameters μ and σ (> 0) for the family of distributions for a reward process are joint location and scale parameters, the sample mean X¯ 2 −1 ¯ 2 are sufficient statistics, and π0 (μ, σ ) and variance S = (n − 1) (Xi − X) −d ∝ σ for some d > 0, then πn may be identified by the parameters x, ¯ s and n, ¯ s, n) R(bλ + c, bx¯ + c, bs, n) = c(1 − a)−1 + bR(λ, x,
(b > 0, c ∈ R)
and ν(x, ¯ s, n) = x¯ + sν(0, 1, n). In the proof, the place of equations (7.18) and (7.22) is taken by the equation πn (μ, σ |x, ¯ s) = bπn (bμ + c, bσ |bx, ¯ bs).
(7.25)
Equations (7.19) and (7.23) are replaced by the equation πn (μ, σ |x, ¯ s)
n+m
f (xi |μ, σ ) = bm+1 πn (bμ + c, bσ |bx¯ + c, bS)
i=n+1
×
n+m
f (bxi + c|bμ + c, bσ ).
i=n+1
Corollary 7.16 (and 7.24, see later comment) The conclusions of the theorem hold if μ and σ are joint location and scale parameters, having a conjugate prior distribution with parameters x, ¯ s (> 0) and n (≥ 0) which satisfies equation (7.25)
194
MULTI-ARMED BANDIT ALLOCATION INDICES
and if these parameters change by the same rules as the sample mean, standard deviation and number of observations, as further values are sampled. The next four theorems are the counterparts of Theorems 7.9, 7.11, 7.13, and 7.15 for target processes, and have similar proofs. Each theorem has a corollary which may be written out in the words of the corollary of its counterpart, and proved along similar lines. These corollaries will be referred to as Corollaries 7.18, 7.20, 7.22 and 7.24, respectively. Theorem 7.17 If the parameter μ for the family of distributions for a target process is a location parameter, the sample mean X¯ is a sufficient statistic, and π0 is the (improper) uniform distribution over the real line, then M(φ, x, ¯ n, T ) = M(φ, x¯ + c, n, T + c)
(c ∈ R),
and ν(x, ¯ n, T ) = ν(x¯ − T , n, 0). Theorem 7.19 If the parameter σ (> 0) for the family of distributions for a target process is a scale parameter, the sample mean X¯ is a sufficient statistic, and π0 (σ ) ∝ σ −d for some d > 0, then M(φ, x, ¯ n, T ) = M(φ, bx, ¯ n, bT ) (b > 0),
and ν(x, ¯ n, T ) = ν
x¯ , n, 1 . T
Theorem 7.21 If the parameters μ and σ (> 0) for a target process are joint location and scale parameters, σ is known, the sample mean is a sufficient statistic, and π0 (μ) is the improper uniform distribution over R, then M(φ, x, ¯ n, σ, T ) = M(φ, x¯ + c, n, bσ, bT + c) (b > 0, c ∈ R),
and ν(x, ¯ n, σ, T ) = ν
x¯ − T , n, 1, 0 . σ
Theorem 7.23 If the parameters μ and σ (> 0) for a target process are joint location and scale parameters, the sample mean and variance are sufficient statistics, and π0 (μ, σ ) ∝ σ −d for some d > 0, then M(φ, x, ¯ s, n, T ) = M(φ, bx¯ + c, bs, n, bT + c) (b > 0, c ∈ R),
and ν(x, ¯ s, n, σ, T ) = ν
x¯ − T , 1, n, 0 . s
MULTI-POPULATION RANDOM SAMPLING (THEORY)
7.5
195
Random sampling times
A sampling process for which the times U1 , U2 , . . . needed to sample a value are independent identically distributed random variables is an example of a semi-Markov bandit process. Thus (Theorem 2.1) optimal policies for sampling sequentially from a family of such processes are again index policies. SemiMarkov sampling processes are of interest as models for whole areas of research in a pharmaceutical or agrochemical laboratory. A family of two semi-Markov reward processes, for example, serves as a model for the herbicide and fungicide programmes. Each sampled value is, in accountants’ language, the expected net present value of a potential herbicide or fungicide at the time of its discovery. The aim of the model would be to indicate the relative priorities of the two research programmes. No index values for semi-Markov reward processes have as yet been calculated, though this would be feasible, particularly when the problem may be reduced by exploiting invariance properties. Provided the distributions of the Ui s and of the Xi s are independent, and the prior distributions for their parameters are also independent, the theorems of Section 7.4 continue to hold, with the obvious modification that the state space must now specify a posterior distribution for the parameters of the U -distribution. We now also have an analogue of Theorem 7.9, referring to the U -distribution instead of the X-distribution. To state this theorem the notation is extended by letting ρn be the posterior density for the parameter (∈ R+ ) of the U -distribution, and including the discount parameter γ in the notation. Since the prior densities ρ0 and π0 are by assumption independent, it follows that ρn and πn are also independent. Theorem 7.25 If the parameter σ (> 0) for the family of distributions for the sampling time of a semi-Markov reward process is a scale parameter, the sample mean U¯ is a sufficient statistic, and ρ0 (σ ) ∝ σ −d for some d > 0, then ρn may be identified by its parameters u¯ and n, and for b > 0, R(b−1 λ, bu, ¯ n, πn , b−1 γ ) = R(λ, u, ¯ n, πn , γ ), and ν(u, ¯ n, πn , γ ) = γ ν(uγ ¯ , n, πn , 1). Proof. The two parts of the theorem are direct consequences of changing the time unit, for the first part by the factor b, and for the second part by the factor γ . A further modification is to allow a switch from a sampling process to occur at any time, rather than restricting switches to the times when a new value is obtained. This works out quite neatly when the U -distribution is exponential, which means that the times when new values are obtained form a Poisson process in process time.
196
MULTI-ARMED BANDIT ALLOCATION INDICES
Let θ be the exponential parameter, so that the probability of observing a new value in any time interval (t, t + δt) is θ δt + o(δt). Let θ have the improper prior density ρ(θ |0, 0) ∝ θ −1 . Suppose that after a time t, n values have been sampled at times t1 , t1 + t2 , · · · , t1 + t2 + · · · + tn . The probability density for this outcome is n −θt θ e i exp −θ (t − ti ) = θ n e−θt , i=1
exp[−θ (t − ti )] being the probability that the (n + 1)th sampling time is greater than t − ti . Thus it follows from the generalized Bayes’ theorem that the posterior density for θ may be written as ρ(θ |t, n) ∝ θ −1 θ n e−θt ∝ t n θ n−1 e−θt / (n), which is a gamma density with parameters n and t. Thus the probability density for the further time up to the next observed value is
∞ nt n f (u|t, n) = [(n)]−1 t n θ e−θn θ n−1 e−θt dθ = . (t + n)n+1 0
Theorem 7.26 For a reward process with unrestricted switching, an exponential distribution with parameter θ for the sampling time, and an improper prior density p(θ |0, 0) ∝ θ −1 , the posterior density for θ after n observations in a time t may be identified by its parameters n and t, and for b > 0, R(b−1 λ, bt, n, πn , b−1 γ ) = R(λ, t, b, πn , γ ), and ν(t, n, πn , γ ) = γ ν(tγ , n, πn , 1). Proof. As for Theorem 7.25.
The next theorem states some fairly obvious properties of the function ν(t, n, πn , γ ) which are needed when calculating it. Theorem 7.27 If a sampling process with unrestricted switching (i) yields a reward g(x) when the value x is sampled, and g(x) increases with x; (ii) has an exponential sampling-time distribution with a parameter θ which has an (improper) prior density proportional to θ −1 ; and (iii) has a positive discount parameter γ , then ν(t, n, πn , γ ) is a decreasing function of t. Moreover, if g(x) is constant, so that the distribution of the sampled values is irrelevant and ν(t, n, πn , γ ) reduces to ν(t, n, γ ), ν(t, n, γ ) is an increasing function of n.
MULTI-POPULATION RANDOM SAMPLING (THEORY)
197
Proof. In state (t, n, πn ), the distribution function F (u|t, n) for the time U until the next observed value is 1 − t n (t + u)−n . This is decreasing in t and increasing in n. Since F (u|t, n) is a continuous function of u, for a given initial state (t1 , m, πm ) a realization of the sampling process may be defined by the sequences P1 , P2 , . . . and X1 , X2 , . . . instead of U1 , U2 , . . . and X1 , X2 , . . ., where i−1 Pi = F Ui t1 + Uj , m + i − 1 . j =1
Note that this definition means that Pi has a uniform distribution on the interval [0, 1]. For the given initial state a stopping time τ may be defined by the nonnegative functions wi {[0, 1]i−1 × [0, ∞]i−1 ∩ {wj > 0, j < i} → wi (P1 , P2 , . . . , Pi−1 , X1 , X2 , . . . , Xi−1) )} (i = 1, 2, . . .), i−1 where τ = mini j =1 Uj + wi : wi < Ui , or equivalently by the functions taking values in the interval [0, 1] pi {[0, 1]i−1 × [0, ∞]i−1 ∩ {pj > 0, j < i} → pi (P1 , P2 , . . . , Pi−1 , X1 , X2 , . . . , Xi−1) )} (i = 1, 2, . . .), where
i−1 Uj , m + i − 1 . pi = F wi t1 + j =1
Thus wi is the maximum time allowed between sampling Xi−1 and sampling Xi , before stopping occurs, and pi is the conditional probability that Xi is sampled before stopping occurs. Both wi and pi depend on the history and state of the process when Xi−1 is sampled. If the functions pi (i = 1, 2, . . . ) define τ starting from the state (t1 , m, πm ), now let σ be the stopping time for the sampling process starting from the state (t2 , m, πm ) which is defined by the same functions pi . Since m and πm are unchanged, the joint distributions of the Pi s and Xi s are the same for both starting states. The definition of σ means that the same is true for the joint distributions of the Pi s, Xi s and Yi s, where Yi = 1 if Xi is observed before stopping occurs, and Yi = 0 otherwise. Corresponding realizations starting from (t1 , m, πm ) and using τ , and starting from (t2 , m, πm ) and using σ , differ only in the values, Ui1 and Ui2 , respectively, of the sampling times of the Xi s in the two cases. If t2 < t1 , then, for given values of Pi (i ≥ 1), Ui1 > Ui2 . This is a consequence of
198
MULTI-ARMED BANDIT ALLOCATION INDICES
the fact that F (s|t, n) is a decreasing function of t. Thus Rτ (t1 , m, πm , γ ) = E
∞
−γ
g(Xi )Yi exp
∞
g(Xi )Yi exp
Uj 1
j =1
i=1
<E
i
−γ
i
Uj 2
= Rσ (t2 , m, πn , γ ).
j =1
i=1
Also σ < τ , since as well as Ui1 > Ui2 we have, in obvious notation, wi1 > wi2 , so that Wτ (t1 , m, πm , γ ) = γ −1 E(1 − e−γ τ ) > γ −1 E(1 − e−γ σ ) = Wσ (t2 , m, πm , γ ). Thus Rτ (t1 , m, πm , γ ) τ > 0 Wτ (t1 , m, πm , γ )
ν(t1 , m, πm , γ ) = sup
Rσ (t2 , m, πm , γ ) = ν(t2 , m, πm , γ ), W σ (t2 , m, πm , γ ) σ >0
< sup
as required. For the case g(x) = a constant, a virtually identical argument, which now compares realizations starting from states (t, n1 ) and (t, n2 ), and uses the fact that F (u|t, n) is increasing in n, shows that if n2 < n1 , then ν(t, n1 , γ ) > ν(t, n2 , γ ).
As a first example, suppose the rewards are all known to be equal to 1, so that the only uncertainty concerns the intervals between them. Thus πn may be dropped from the notation, as may γ , being a constant throughout a set of calculations. Index values for other values of γ then follow from Theorem 7.26. From Theorem 7.27 it follows that ν(t, n) is a decreasing function of t and an increasing function of n. Thus, given γ (> 0) and n, for some sn ν(t, n) λ according as t sn , and sn increases with n. The index ν(t, n) may be determined by calculating sn (n > 0) for different values of λ. For a given λ this may be done by the calibration procedure using the functional equation (2.13) for the optimal payoff from a SFABP consisting of the reward processes in the state (t, n) and a standard bandit process with parameter λ. This takes the form w nt n λt n R(λ, t, n) = sup e−γ (u−t) [1 + R(λ, u, n + 1)] n+1 du + e−γ (w−t) . u γ wn w≥t t
MULTI-POPULATION RANDOM SAMPLING (THEORY)
199
The supremum occurs at w = max(t, sn ). Thus, differentiating with respect to w to find the maximum, sn 1 1 + R(λ, sn , n + 1) = λ + . n γ Suppose now that the rules of the SFABP are modified so that no switching is allowed for n > N . Thus, if R(λ, t, n, N ) is the corresponding optimal payoff function, R(λ, t, n, N ) = max(Payoff if is permanently selected, Payoff if S is permanently selected) = max(λ, nt −1 )γ −1
(n ≥ N ).
This follows since
∞
Payoff if is permanently selected =
(7.26)
e−γ s λds = λγ −1 ,
0
and Payoff if S is permanently selected
∞ (Payoff if S is permanently selected|θ )ρ(θ |t, n)dθ =
0 ∞
= 0
n θ t n θ n−1 e−θt · dθ = . γ (n) tγ
The corresponding functional equation is, for 0 < n ≤ N − 1, w nt n R(λ, t, n, N ) = sup e−γ (u−t) [1 + R(λ, u, n + 1, N )] n+1 du u w≥t t λt n , (7.27) + e−γ (w−t) γ wn the supremum being at w = max(t, snN ), and, differentiating for the maximum, snN 1 1 + R(λ, snN , n + 1) = λ + . (7.28) n γ From (7.26) we have snN = nλ−1 (n > N − 1). Values of snN and of the function R(λ, ·, n, N ) may be calculated successively for n = N − 1, N − 2, . . . , 1 from (7.28), and then substituting for R(λ, ·, n + 1, N ) in the right-hand side of equation (7.27), using (7.26) to start the process. Clearly snN increases
200
MULTI-ARMED BANDIT ALLOCATION INDICES
with N and tends to sn as N → ∞. These calculations have not actually been carried out, but would be along similar lines to those described in Chapter 8. As a second example of a reward process with an exponential distribution of sampling times and unrestricted switching times, suppose that the rewards sampled also have an exponential distribution. Let θ1 and θ2 be the respective parameters of the two exponential distributions, each with (improper) prior densities proportional to θi−1 (i = 1, 2). The generalized Bayes’ theorem shows that πn is a gamma density with parameters n and , the sum of the first n sampled values. Thus the optimal payoff function for the SFABP formed by the reward process S and a standard bandit process with the parameter λ may be written as R(λ, t, n, , γ ). Theorem 7.26 gives ν(t, n, , γ ) = γ ν(tγ , n, , 1), so calculations are required only for one value of λ, which will be supposed equal to 1 and dropped from the notation. The obvious adaptation of Theorem 7.11 gives ν(t, n, , γ ) = ν(t, n, 1, γ ); this means that calculations are required only for one value of λ, which will also be supposed equal to 1 and dropped from the notation, so that R(λ, t, n, , γ ) becomes R(t, n, ). The functional equation is w
x [x + R(u, n + 1, + x)] e−(u−t) R(t, n, ) = sup w≥t
0
t
n n nt −(w−t) t × dx du + e . ( + x)n+1 un+1 wn n
n
(7.29)
From Theorems 7.27 and 7.11 (adapted), it follows that ν(t, n, ) is a decreasing function of t and an increasing function of (its dependence on n is not so clear, as this is a parameter in both posterior distributions). Thus, given n and , (7.30) ν(t, n, ) 1 according as t sn (), for some sn () which increases with . Since ν(t, n, ) = ν(t, n, 1), the index function may be easily determined from sn (). It follows from (7.30) that the supremum in (7.29) is attained when w = max[t, sn ()]. Differentiation with respect to w shows that sn () + Q(sn (), n, ) = 1 + n−1 n
where
∞
Q(u, n, ) = 0
R(u, n + 1, + x)
(n > 1),
(7.31)
n n dx. ( + x)n+1
As in the previous example, arbitrarily close approximations to sn () may be obtained by backwards induction from N , when the rules of the SFABP are modified to prohibit switching for n > N . If R(t, n, , N ) is the corresponding optimal payoff we now have R(t, n, |, N ) = max(1, t −1 ).
(7.32)
MULTI-POPULATION RANDOM SAMPLING (THEORY)
201
The corresponding functional equation is identical to (7.29), apart from the additional argument N on which the functions R and Q now also depend. The calculations for sn () now follow the pattern of those for sn in the previous example, with repeated use of the finite switching-horizon versions of (7.31) and (7.29) in turn, for n = N − 1, N − 2, . . . , 2. Note that this means iterating with a function of two variables, R(·, n·, N ), rather than of one variable, as in the case of rewards all equal to 1.
7.6
Brownian reward processes
The Brownian motion process is a random process defined on R+ , having independent normal increments on disjoint intervals, and such that (0) = 0, E(t) = θ t (t > 0), and Var[(u) − (t)] = s 2 (u − t) (0 < t < u). A standard Brownian motion process (or Wiener process) has θ = 0, s = 1, and we denote it by W . No physical particle carries out Brownian motion, though the large number of discrete, randomly directed shocks to which a gas molecule is subject mean that, over any time interval which is not extremely short, its displacement is well approximated by Brownian motion. Let us now define a Brownian reward process, in terms of , which has expected total discounted reward up to time T of
T
T
T e−γ t d (t) = e−γ t Ed (t) = e−γ t θ dt = θ γ −1 1 − e−γ T , E 0
and
Var 0
T
0
e−γ t d (t) =
T
0
e−γ t Var d (t) =
0
T
e−γ t s 2 dt = s 2 γ −1 1 − e−γ T .
0
Note that (T ) is a sufficient statistic for the process up to time T . This follows from Theorem 7.1 if we write out the joint density function for any (t1 ) (1 ≤ i ≤ r), where t1 < t2 < · · · < tr = T . The state of the process at time ¯ T ), where ¯ = (T )/T . The index for a T may therefore be expressed as (, 2 ¯ T ) will be Brownian reward process with parameters s and γ in the state (, ¯ written as νB (, T , s, γ ). If we decrease the unit of time by the factor γ , this ¯ γ T , γ −1 s 2 and 1. The maximized reward rate ¯ T , s 2 and γ to γ −1 , changes , νB is also decreased by the factor γ , so that ¯ T , s, γ ) = γ νB (γ −1 , ¯ γ T , γ −1/2 s, 1). νB (,
(7.33)
Interestingly, νB may be found from an optimal stopping problem, as we now briefly explain. The idea is due to Bather (1983). Let Z be a random process defined on the non-negative real line R+ and taking values in R2 such that, for some fixed (u, v) ∈ R2 Z(t) = (u + W (t), v − t) where W is a Wiener process.
(0 < t < v),
202
MULTI-ARMED BANDIT ALLOCATION INDICES
Let τ be a stopping time for Z (see Section 2.3) defined by a closed stopping subset of R2 , and let r be a real-valued function defined on R2 . Finally let f (u, v) = E r(Z(τ ))|Z(0) = (u, v) . Clearly, f (u, v) = r(u, v),
(u, v) ∈ .
(7.34)
Under regularity conditions it may be shown that ∂f ∂r = ∂u ∂u
and
∂f ∂r = , ∂v ∂v
(u, v) ∈ bd( ),
(7.35)
where bd( ) denotes the boundary of . It may also be shown that f satisfies the heat equation ∂f 1 ∂ 2f , = 2 2 ∂u ∂v
(u, v) ∈ / .
(7.36)
Suppose we wish to choose so as to maximize f (u, v). It may be shown along the lines of the proof of Lemma 2.2 that this is achieved simultaneously for all (u, v) ∈ R2 when = (x, y) : sup f (x, y) = r(x, y) .
¯ 1/T ). In pursuit of the Suppose γ = s = 1 and the prior of θ is normal N (, ¯ Gittins index νB (, T , 1, 1) let us look for the least retirement reward M such that τ τ −t ¯ M = sup E 0 e−t d(t) + Me−τ = sup E 0 (t)e dt + Me−τ , (7.37) τ
τ
¯ T ) at time 0. where expectation is conditional on θ having a normal prior N (, Note that for any stopping time τ
∞ −t ¯ )e−τ . ¯ dt = (τ (t)e E For τ = 0 this is E (7.37) we derive
∞ 0
τ −t ¯ Thus, subtracting ¯ from both sides of ¯ dt = . (t)e
¯ = sup E M − (τ ¯ ) e−τ = sup E max M − (τ ¯ ), 0 e−τ , M − τ
(7.38)
τ
where we can introduce a lower bound of 0 since it is always an option to never stop. ¯ for the We now make use of the interesting fact that the posterior mean parameter θ of a Brownian process at time t may itself be regarded as a Brownian process with −t −1 as the time variable. This means that, with a simple change
MULTI-POPULATION RANDOM SAMPLING (THEORY)
203
of variables, the stopping problem on the right-hand side of (7.38) is equivalent to the stopping problem described above for a process Z(t) when we take r(u, v) = max(u, 0) exp(v −1 ).
(7.39)
We explain how Bather has used this equivalence at the end of this section. Before doing so, let us return to the normal reward process with known variance σ 2 , and seek to compare its Gittins index and the Gittins index for a Brownian reward process. We may in the first instance assume that the normal reward process has σ 2 = 1, since if this is not true for X it is true for the transformed observation X = σ −1 X. The single unknown parameter μ is thus the mean of the distribution and π0 is taken to be the improper uniform density over the real line. We have f (x|μ) = (2π)−1/2 exp − 12 (x¯ − μ)2 ; hence μ is a location parameter. Also n f (xi |μ) = exp − n2 (x − μ)2 (2π)−n/2 exp[− 12 ( xi2 − nx¯ 2 )]; i=1
hence X¯ is a sufficient statistic by Theorem 7.1. ¯ ∝ π0 (μ) πn (μ|x)
n
f (xi |μ) ∝ (2πn−1 )−1/2 exp − n2 (x¯ − μ)2 ,
i=1
¯ is normal with mean x¯ and variance n−1 , using Bayes’ theorem. Thus πn (μ|x) and conjugate prior densities for μ are of this form, with n not restricted to integer values. By Theorem 7.9 ν(x, ¯ n) = x¯ + ν(0, n). If the observations are now rescaled by multiplying by σ , this has the effect of multiplying both x¯ and ν(x, ¯ n) by σ , so that, extending the notation to indicate the dependence on the variance σ 2 , ν(x, ¯ n, σ ) = x¯ + σ ν(0, n, 1). This has already been noted as part of Theorem 7.13. Note that σ is a scale parameter. Within a Brownian reward process we can embed a normal reward process. Let
θ e−γ t dt = θ γ −1 (1 − e−γ ), = T /n, μ =
Xi =
0 i
(t) exp{−γ [t − (i − 1)]}dt
(i−1)
σ 2 = Var Xi = s 2 γ −1 (1 − e−γ ).
(i = 1, 2, . . . , n),
204
MULTI-ARMED BANDIT ALLOCATION INDICES
The Xi s form a normal reward process with the discount factor a = e−γ . The posterior distribution for μ at time T if only the Xi s are observed is therefore ¯ (1 − a)−1 , s 2 γ (1 − N (x, ¯ σ 2 /n), so that the posterior distribution for θ is N (xγ −1 −1 a) n ). The posterior distribution for θ at time T when the whole process (t) is observed is obtained by letting n → ∞, and is therefore N ((T )/T , s 2 /T ). It follows that the posterior distribution for θ after observing the first n Xi s to have the mean x¯ is the same as for the fully observed Brownian motion at time T ∗ = nγ −1 (1 − a) with (T ∗ ) = nx. ¯ The index for the normal reward process in the state (x, ¯ σ 2 /n) must, then, be less than the index for the Brownian reward process in the state ((T ∗ )/T ∗ , s 2 /T ∗ ), since the first of these processes differs from the second only in (i) providing incomplete information about (t); and (ii) restricting the allowed values of the stopping time τ with respect to which the index maximizes the expected reward rate. Thus, using subscripts N and B to distinguish the normal and Brownian reward processes, and multiplying νN by the discrete-time correction factor γ (1 − a)−1 (see Section 2.8), γ νN (x, ¯ n, σ, a) < νB ((T ∗ )/T ∗ , s, γ ) 1−a n(1 − a) σ γ 1/2 xγ ¯ , , = νB ,γ . 1−a γ (1 − a)1/2
(7.40)
We have shown that νN (x, ¯ n, σ, a) = x¯ + σ νN (0, n, 1, a),
(7.41)
and it follows on letting n → ∞ that νB ((T )/T , T , s, γ ) = (T )/T + sνB (0, T , 1, γ ).
(7.42)
Thus, using (7.33), (7.41) and (7.42), the relationship (7.40) between νN and νB may be written in the form νN (0, n, 1, a) < (1 − a)1/2 νB (0, n(1 − a), 1, 1).
(7.43)
The upper bound which (7.43) provides for νN is close when the set of allowed stopping times for the Brownian motion in which the normal process is embedded are close together. For a fixed value of T = n(1 − a), this means that the bound improves as a → 1, since = −γ −1 loge a is the interval between these times and, as noted in the derivation of (7.33), the factor γ −1 may be removed by changing the units in which time is measured. Thus, writing u(·) = νB (0, ·, 1, 1), we have established, admittedly rather informally, the following theorem. Theorem 7.28 (1 − a)−1/2 νN (0, T /(1 − a), 1, a) < u(T ) and → u(T ) as a → 1. Calculations of the function u are reported in Section 8.2. It has the properties set out in the following theorem.
MULTI-POPULATION RANDOM SAMPLING (THEORY)
205
Theorem 7.29 (i) T 1/2 u(T ) is decreasing in T . (ii) T u(T ) ≤ 2−1/2 and → 2−1/2 as T → ∞. (iii) u(T ) = [−T −1 loge (−16πT 2 loge T )]1/2 + o(1) as T → 0. Parts (i) and (ii) are due to Bather (1983), who for (i) used a direct argument, and for (ii) a comparison procedure whose essence we may briefly describe as follows. The problem of finding so that (7.34), (7.35) and (7.36) are satisfied is known as a free boundary problem. In general no closed form of solution can be found, though there are exceptions, in particular when the maximal f corresponds to a separable solution of the heat equation. This observation led Bather to consider two stopping problems distinguished by subscripts 1 and 2. Suppose that the point (u, v) and the reward functions r1 and r2 are such that f2 (u, v) = r2 (u, v) = r1 (u, v)
(7.44)
and r2 (x, y) ≥ r1 (x, y)
(x ∈ R, y < v).
(7.45)
Since it is optimal to stop at (u, v) for problem 2, and the reward function for problem 2 is at least as great as it is for problem 1 at every point which the process Z may reach starting from (u, v), it must also be optimal to stop at (u, v) for problem 1. Thus if problem 2 can be solved and the conditions (7.44) and (7.45) are satisfied, the point (u, v) will have been shown to belong to the optimal stopping set 1 for the possibly less immediately tractable problem 1. The comparison procedure for a given problem (1) is to build up as large as possible a subset of the optimal stopping set 1 by finding a tractable auxiliary problem (2), for which conditions (7.44) and (7.45) are satisfied, for each point (u, v) in the subset. Part (iii) of the theorem is due to Chang and Lai (1987). They consider the stopping problem defined by (7.39) and prove (iii) by considering the asymptotic properties of a class of solutions of the heat equation, together with comparison theorems somewhat similar to the one used by Bather. They also extend this result to provide an approximation to the index for a discrete reward process governed by a distribution belonging to an exponential family. The approximation holds in the limit as the discount factor a → 1.
7.7
Asymptotically normal reward processes
The approximation (7.43) for the index of a normal reward process with known variance may be extended to the large class of reward processes that, essentially
206
MULTI-ARMED BANDIT ALLOCATION INDICES
because of the central limit theorem, may be regarded as being approximately normal. For any reward process, the index depends only on the random process defined by the mean of the current distribution for the expected reward. Thus the conditions for a good approximation based on the normal reward process are those which ensure that the change in x¯ as n increases to n + m has a distribution which for large n is close to its distribution for the normal case. The main instrument for checking whether these conditions hold is Theorem 7.32. Two well-known results also play a role, and these are stated first. Theorem 7.30 Lindberg–Levy central limit theorem If Y1 , Y2 , . . . are independent identically distributed random variables with mean μ and variance σ 2 , then n d (nσ )−1/2 (Yi − μ) −→ i=1
as n → ∞, and this convergence is uniform. Theorem 7.31 Chebyshev’s inequality If the random variable Y has mean μ and variance σ 2 then P (|Y − μ| ≥ t) ≤ σ 2 /t 2
(t > 0).
Now reverting to the setup for a sampling process introduced in Section 7.1, let be the parameter space for θ , and Pn be the joint probability distribution over × R∞ for θ, Xn+1 , Xn+2 , . . . defined by the posterior distribution n for θ and the distribution Lθ for each of the Xi s. Let hn = (x1 , x2 , . . . , xn ), the history of the process up to time n. The mean θ1 and variance θ2 of the distribution defined by f (x|θ ) are both assumed to exist. Our key asymptotic result involves the following random variables which, together with θ1 and θ2 , are defined in terms of Pn ; s is a positive constant, x¯n = n−1 (x1 + x2 + · · · + xn ). n+m n+m −1/2 −1 −1/2 −1/2 s Xi − mθ1 , U = m θ2 Xi − mθ1 , U =m i=n+1
i=n+1 −1/2 n1/2 θ2 (θ1
V = n1/2 s −1 (θ1 − x¯n ) , V = − x¯n ) , −1 −1/2 W = m + n−1 m−1/2 U + n−1/2 V . FU , FU , FV , FV and FW are their distribution functions; F1 and F2 are those of θ1 and θ2 . Conditional distribution functions are denoted by FU |θ , F2|V etc. The following conditions are required. (i) There is a sequence of Baire functions sn defined on hn (n = 0, 1, 2, . . . ) such that, given and s (both > 0), n (|θ2 − s2 | > ) → 0 as n → ∞ for any sequence hn (n = 0, 1, 2, . . . ) on which sn (hn ) takes the fixed value s for every n. A sequence of hn s with this property will be termed a fixed-s sequence.
MULTI-POPULATION RANDOM SAMPLING (THEORY)
207
(ii) FV (v) → (v) on any fixed-s sequence, v ∈ R. (iii) For any bounded interval I ⊂ R and > 0, n |FU |θ (u) − (u)| < , u ∈ I → 1 on any fixed-s sequence as m → ∞ and n → ∞ in such a fashion that mn−1 ≥ some fixed positive c. Note that Theorem 7.30 tells us that FU |θ (u) → (u) uniformly in u ∈ R as m → ∞. Condition (iii) is a related, but not equivalent, property. When the Xi s are a reward sequence and Eθ1 = x¯n , our interest, as remarked at the beginning of this section, is in showing that for large n the difference x¯ = x¯n+m − x¯n has a distribution close to its distribution for the normal case. This is the meaning of the theorem which follows. Note that x¯ = sm1/2 n−1/2 (m + n)−1/2 W . Theorem 7.32 Normal limit theorem (1) If Lθ is N (θ1 , σ 2 ), where σ 2 = s 2 is known, and 0 is the improper uniform distribution on the entire real line, then FW (w) = (w) (w ∈ R). (2) Under conditions (i), (ii) and (iii), FW (w) → (w) (w ∈ R), on any fixed-s sequence as m → ∞ and n → ∞ in such a fashion that mn−1 ≥ some fixed positive c. Proof. (1) FU |θ = (θ ) (θ ∈ ). Thus FU = and U is independent of V . Also FV = , since n is N (x¯n , σ 2 π −1 ). Part 1 of the theorem follows, since W is a linear combination of U and F with coefficients for which the sum of squares is one. (2) For simplicity of presentation, and with only an easily removable loss of generality, our proof is for the case mn−1 = c.
FW (w) = FU |θ (1 + c2 )1/2 w − cn1/2 s −1 (θ1 − x¯n ) dn (θ ). (7.46) For the case of part 1 this reduces to
∞ 1/2 1 + c2 w − cv φ(v)dv, FW (w) = −∞
(7.47)
which, as just shown, is equal to (w). The proof of part 2 consists of using conditions (i), (ii) and (iii) to obtain bounds on the difference between expressions (7.46) and (7.47). We first show that, given > 0, M may be chosen so that
∞ 1/2 2 FW (w) − 1+c w − cv dFV (v) < 3/5 (n > M). (7.48) −∞
Let ξ = −1 (1 − /20), and δ (0 < δ < 12 s) be such that −1/2
|[(1 + c2 )1/2 wsθ2
− cv] − [(1 + c2 )1/2 w − cv]| < /5
208
MULTI-ARMED BANDIT ALLOCATION INDICES 1/2
if |s − θ2 | < δ and |v| < ξ . There must be such a δ because of the uniform continuity of continuous functions on compact intervals. Now choose M so that n ({|V | ≥ ξ } ∪ {|θ2 − s 2 | ≥ δ} ∪ {FU |θ (u) − (u)| ≥ /5 for some |u| < 2(1 + c2 )1/2 |w| + cξ }) < /5
(n > M).
(7.49)
This is possible because of conditions (ii), (i) and (iii) in turn. Next note that FU |θ (1 + c2 )1/2 w − cn1/2 s −1 (θ1 − x¯n ) −1/2 −1/2 = FU |θ (1 + c2 )1/2 wsθ2 − cn1/2 θ2 (θ1 − x¯n ) . It thus follows from (7.49) that the integrand in (7.46) may be replaced by −1/2
[(1 + c2 )1/2 wsθ2
−1/2
− cn1/2 θ2
and then by
−1/2
[(1 + c2 )1/2 w − cn1/2 θ2
(θ1 − x¯n )]
(θ1 − x¯n )],
with an error at each stage of no more than /5, except on a set of n -measure of at most /5, on which the total error of the two changes is at most one. Now writing dF2|V (θ2 )dFV (v) in place of dn (θ ), which we can do since the integrand now depends on θ only through θ2 and V , and carrying out the integration with respect to θ2 , the error bound (7.48) follows. The rest of the proof is along similar lines. The convolution [(1 + c2 )1/2 w − cv]dFV (v) may be rewritten as FV [(1 + c2 )1/2 w − c−1 y] φ(y)dy. Choose N so that |FV (v) − φ(v)| < /5 for |v| ≤ (1 + c−2 )1/2 |w| + c−1 ξ and n > N , which is possible by (ii). Arguing along the lines of the previous paragraph it follows that ∞ [(1 + c2 )1/2 w − cv]dFV (v) −∞
− Since that
∞ −∞
[(1 + c
−2 1/2
)
− c y]φ(y)dy < 2/5 −1
(n > N ).
(7.50)
[(1 + c2 )1/2 w − c−1 y]φ(y)dy = (w), we have from (7.48) and (7.50) |FW (w) − (w)| <
(n > max(M, N )),
and the theorem is proved.
From the inequality (7.43) and the comments after it, it follows that, when for each n the expected value of the next reward in the sequence is x¯n , and the conditions of the theorem hold, [νa (n , n) − x¯n ]s −1 n(1 − a)1/2 → 2−1/2
(7.51)
MULTI-POPULATION RANDOM SAMPLING (THEORY)
209
as n(1 − a) → ∞ on a fixed-s sequence and a → 1. The fact that condition (iii) requires mn−1 ≥ c might at first sight be expected to invalidate this conclusion, since the calculation of νa (n , n) involves values of x¯n+m for all m > 0. The reason it does not do so is that we are considering values of a close to 1. This means that many members of the sequence x¯n+m (m = 0, 1, 2, . . . ) have an appreciable effect in the calculation of νa (n , n), and a may be chosen near enough to 1 for the higher values of m, for which the normal approximation holds, to swamp the effect of the lower values, for which it does not. Applications of the theorem may be extended by considering a function G(y, n) on R+ × R satisfying appropriate regularity conditions. Suppose, as n tends to infinity and with m ≥ 0, that G(y, n), ∂G(y, n)/∂y and n1/2 [G(y, n + m) − G(y, n)] tend uniformly in bounded y-intervals to G(y), dG(y)/ dy and 0 respectively, where G(y) is continuously differentiable. Let G = G(x¯n+m , n + m) − G(x¯n , n), and a fixed-x¯ sequence be a sequence of hn s on which x¯n = x¯ (n = 1, 2, . . . ) for some x. ¯ Corollary 7.33 conditions,
Under conditions (i), (ii) and (iii) and the above regularity > ] → 0 Pn [|G/x¯ − dG(x)/dy| ¯
on any fixed-(x, ¯ s) sequence, as m → ∞ and n → ∞ in such a fashion that mn−1 ≥ some fixed positive c. Proof. For brevity, write the p ¯ We have G/x¯ → dG(x)/dy.
conclusion
of
G = G(x¯ + x, ¯ n + m) − G(x¯ + x, ¯ n) + x¯
the
corollary
as
∂ G(x¯ + αx, ¯ n), ∂y
for some α ∈ (0, 1). To establish the corollary it is therefore sufficient to show that ∂ ¯ p dG(x) (a) G(x¯ + αx, ¯ n) → , and ∂y dy p
¯ n + m) − G(x¯ + x, ¯ n)] → 0. (b) (x) ¯ −1 [G(x¯ + x, The limit (a) follows from part 2 of the theorem, the uniform convergence of ∂G(y, n)/∂y to dG(y)/dy on bounded y-intervals, and continuity of dG/dy at y = x. ¯ From part 2 of the theorem it also follows that lim [lim inf Pn (n1/2 |x| ¯ > δ)] = 1.
δ→0 n→∞
(7.52)
Now writing the expression in (b) as (n1/2 x) ¯ −1 n1/2 [G(x¯ + x, ¯ n + m) − G(x¯ + x, ¯ n)], the limit (b) follows from (7.52), the uniform convergence to zero of n1/2 [G(y, n + m) − G(y, n)], and a further application of part 2 of the theorem.
210
MULTI-ARMED BANDIT ALLOCATION INDICES
This corollary may be applied to bandit processes for which the expected value of the next reward in the sequence, when the current distribution of θ is n , is G(x¯n , n), where G(y, n) satisfies the stated conditions. A difference G between members of the sequence of expected rewards is now well approximated for large n by multiplying the difference x, ¯ for the situation where this sequence coincides with the x¯n s, by dG(x¯n )/dy, which may be treated as independent of n provided we are considering only large values of n. Asymptotically, then, the relationship between the two expected reward sequences may be regarded as constant and linear. Under these circumstances it follows from the definition (2.7) that the same linear relationship must hold between the corresponding Gittins indices. The asymptotic result corresponding to (7.51) is therefore −1 [νa (n , n) − G(x, ¯ n)] sdG(x)/dy ¯ n(1 − a)1/2 → 2−1/2
(7.53)
as n(1 − a) → ∞ on a fixed-(x, ¯ s) sequence and a → 1.
7.8
Diffusion bandits
The properties of the Brownian reward process listed as parts (i) and (iii) of Theorem 7.28 form part of the rich harvest yielded by the relationship between the Wiener process and the heat equation expounded by Doob (1955). The possibility of using this relationship in the solution of sequential decision problems was noted by Lindley (1960) and Chernoff (1961), and discussed in detail by Chernoff (1968) and Bather (1970). Chernoff was interested in the diffusion version of the finite-horizon undiscounted two-armed bandit problem, but it was Bather (1983) who turned the comparison procedure which he had developed (1962) for finding approximate solutions to the free boundary problems associated with optimal stopping into an effective instrument for analysing the discounted Brownian reward process. Bandit processes based upon the Wiener process have also featured in the work of Karatzas (1984), who considers bandits for which the state x (∈ R) at process time t is a diffusion process satisfying the stochastic differential equation dx = μ(x)dt + σ (x)dW, where W is a Wiener process, and rewards accrue at the rate h(x) per unit time when the bandit process is selected, h being an increasing function. The general form of the stopping set for the stopping time τ which maximizes the expression τ E h(x(t))e−γ t dt + Me−γ τ 0
may be obtained along the lines described by Bensoussan and Lions (1978), leading to equations from which the Gittins index may be calculated. For the
MULTI-POPULATION RANDOM SAMPLING (THEORY)
211
case when the functions μ and σ are constants, Karatzas shows that the Gittins index becomes
∞ z ν(x) = h x+ e−z dz, β 0 where
β = [(μ2 + 2γ σ 2 )1/2 − μ]σ −2 .
As one would expect, ν(x) is increasing in each of x, μ and σ 2 . Karatzas also establishes the technically far from trivial fact that realizations of a SFABP made up of diffusion bandit processes of this type may be described in a suitable probability space when an index policy is followed. By mimicking Whittle’s proof of the index theorem, he shows that the index policy is optimal. Eplett (1986) generalizes the Karatzas assumptions by allowing x(t) to be an arbitrary continuous-time Markov process, and considers discrete-time approximations, establishing conditions under which both the indices and the optimal payoff functions for the approximations converge to those for the original setup. As he points out, this analysis is of practical importance as there must be some limit to the possibility of reallocating resources at very short notice, and it is also sometimes easier to calculate indices for discrete-time approximations. Filliger and Hongler (2007) generalize Karatzas’s results to bandit processes driven by a class of superdiffusive noise sources.
Exercises 7.1. Verify the forms given in Section 7.3 for the conjugate families of prior distributions for Bernoulli, normal (with σ 2 first known, then unknown) and exponential sampling processes. 7.2. Write out the proofs of one or two of Theorems 7.13, 7.15, 7.17, 7.19, 7.21 and 7.23. For example, Theorem 7.15 provides a suitably tricky test of understanding of the methods of these proofs. For this case you may find it helpful to note that for odd values of n, the statistics X¯ and S 2 take the values x¯ and s 2 if one of the Xi s takes the value x, ¯ half the remainder of the Xi s take the value x¯ + s, and the other Xi s take the value x¯ − s. For even values of n a slightly different set of Xi values is needed. 7.3. Verify the inequality ν(t, n1 , γ ) > ν(t, n2 , γ ) at the end of the proof of Theorem 7.27. 7.4. Verify equation (7.32).
8
Multi-population random sampling (calculations)
8.1
Introduction
This chapter sets out in some detail the methods by which the general principles described in Chapter 7 have been used to calculate Gittins indices. This is done in turn for reward processes based on a normal distribution with known variance; a normal distribution with unknown variance; a {0, 1} distribution and an exponential distribution; for undiscounted target processes based on an exponential distribution; and on an exponential distribution together with an atom of probability at the origin. The results of these calculations are also described for each case except the Bernoulli/exponential target process, for which some results are given by Jones (1975). Two approximations which should prove useful are given in Section 8.7. Tables are given at the back of the book. They seem likely to be accurate to within 1 or 2 in the last digit tabulated. Jones also gives results for the normal target process, both with known variance (1970) and with unknown variance (1975).
8.2
Normal reward processes (known variance)
Because of Theorem 7.13 and Corollary 7.14, we may without loss of generality assume the known variance σ 2 to be equal to 1. The single unknown parameter μ is thus the mean of the distribution, and π0 is taken to be the improper uniform Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
214
MULTI-ARMED BANDIT ALLOCATION INDICES
density over the real line f (x|μ) = (2π)−1/2 exp − 12 (x − μ)2 ; hence μ is a location parameter. n
2 f (xi |μ) = exp − n2 (x¯ − μ)2 (2π)−n/2 exp − 12 xi − nx¯ 2 ;
i=1
hence X¯ is a sufficient statistic by Theorem 7.1. ¯ ∝ π0 (μ) πn (μ|x)
n
f (xi |μ) ∝ (2πn−1 )−1/2 exp − n2 (x¯ − μ)2 ,
i=1
¯ is normal using the generalized Bayes’ theorem (equation (7.1)). Thus πn (μ|x) with mean x¯ and variance n−1 , and conjugate prior densities for μ are of this form, with n not restricted to integer values. ∞ ¯ n) = f (x|μ)πn (μ|x)dμ ¯ f (x|πn ) = f (x|x, −∞
= =
∞ −∞
(2π)−1/2 exp − 12 (x − μ)2 (2πn−1 )−1/2 exp − n2 (x¯ − μ)2 dμ
n 2π(n+1)
1/2
r(x, ¯ n) =
n exp − 2(n+1) (x − x) ¯ 2 . ∞
−∞
xf (x|x, ¯ n)dx = x. ¯
Thus the functional equation (7.10) becomes ∞ λ x+x ¯ , x¯ + a R(λ, x, ¯ n) = max R λ, nn+1 , n + 1 f (x|x, ¯ n)dx , (8.1) 1−a −∞ and ν(x, ¯ n) is the value of λ for which the two expressions inside the square brackets are equal. The conditions of Theorem 7.9 (and of the more general Theorem 7.13) hold, so ν(x, ¯ n) = x¯ + ν(0, n) and it is sufficient to calculate ν(x, ¯ n) for x¯ = 0. From (8.1), and since R(λ + c, x¯ + c, n) = c(1 − a)−1 + R(λ, x, ¯ n), it follows that R(λ, 0, n) = max
λ ,a 1−a
∞ −∞
R λ−
x n+1 , 0, n
+ 1 f (x|0, n)dx .
(8.2)
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
Now defining
215
Q(λ, n) = R(λ, 0, n) − λ(1 − a)−1 ,
this gives Q(λ, n) = max 0, −λ + a
∞ −∞
Q λ−
x n+1 , n
+ 1 f (x|0, n)dx .
(8.3)
Thus, writing ξn for ν(0, n), ξn is the value of λ for which the two expressions inside the square brackets in equation (8.2) (or equivalently in equation (8.3)) are equal. The quantity Q(λ, n) is the difference between the payoff from an optimal sequential policy using a reward process S starting in the state πn = (0, n), and a standard bandit process with the parameter λ, and the payoff using only the standard bandit process. Thus Q(λ, n) = 0 if λ is large enough for the optimal policy to start (and therefore permanently continue) with . On the other hand, if λ is sufficiently large and negative for the optimal policy to keep selecting S for a long time with high probability, then Q(λ, n) is essentially the difference between always selecting S and always selecting , namely 0(1 − a)−1 − λ(1 − a)−1 . We should find, therefore, that Q(λ, n) = 0 for λ > ξn , and Q(λ, n) + λ(1 − a)−1 → 0 as λ → −∞. This means that (8.3) may be written as
1/2 n Q(λ, n) = max 0, −λ + a 2π(n + 1) ∞ x ,n +1 × Q λ− n+1 (n+1)(λ−ξn+1 ) × exp −
nx 2 dx 2(n + 1)
n(n + 1) 1/2 ξn+1 Q(y, n + 1) = max 0, −λ + a 2π −∞ 1 2 × exp − 2 n(n + 1)(y − λ ) dy .
(8.4)
The calculation of ξn was based on equation (8.4). An approximation to R(λ, 0, n), the payoff from S and under an optimal sequential policy, may be obtained by allowing only policies which select the same bandit process at all times from the point when n reaches the value N onwards. Let R(λ, 0, n, N ) denote this approximation to R(λ, 0, n), and Q(λ, n, N ) the corresponding approximation to Q(λ, n). Thus R(λ, 0, n, N ) and Q(λ, n, N ) are increasing in N , since an increase in N means an enlargement
216
MULTI-ARMED BANDIT ALLOCATION INDICES
of the set of policies over which optimization takes place, and tend to R(λ, 0, n) and Q(λ, n), respectively, as N tends to infinity. We have, for n ≥ N , R(λ, 0, n, N ) = max[λ(1 − a)−1 , 0], Q(λ, n, N ) = R(λ, 0, n, N ) − λ(1 − a)−1 = max[0, −λ(1 − a)−1 ],
(8.5)
and, for 1 < n < N + 1, n − 1 1/2 ∞ x Q(λ, n − 1, N ) = max 0, −λ + a Q λ − , n, N 2πn n n(λ−ξn N) (n − 1) x 2 dx × exp − 2n
n (n − 1) 1/2 ξn N Q (y, n, N ) = max 0, −λ + a 2π −∞ 1 2 (8.6) × exp − 2 n (n − 1) (y − λ) dy , where ξnN is the corresponding approximation to ξn , and ξnN ξn as N → ∞. Thus ξn−1N is the value of λ for which the expression after the comma in the curly brackets in equation (8.6) (which will be written as Q∗ (λ, n − 1, N )) is equal to 0. The method of calculation was a backwards induction using equation (8.6) to obtain, in succession, values of ξnN and of the function Q(·, N, N ) for n = N − 1, N − 2, . . . , 1. Equation (8.5) was used to provide values of Q(·, N, N ), and clearly ξNN = 0, since ξNN is the value of λ for which the expected payoff from a permanent choice of S is the same as from a permanent choice of , when S is in the state (0, N ). At each stage of the calculation, values of Q∗ (λ, n, N ) were obtained by numerical integration, the value ξnN of λ for which Q∗ (λ, n, N ) = 0 was found by the bisection method, and a grid of values of Q(λ, n, N ) was calculated for use in the next stage of the backwards induction. The numerical integration for values of Q∗ (λ, n, N ) was carried out using Romberg’s method (e.g. see Henrici, 1964) to evaluate separately the contributions to the total integral from those parts of the five intervals with the endpoints λ ± 6n−1 , λ ± 4n−1 , λ ± 2n−1 , that are below the point ξnN , contributions from values of y outside the range λ ± 6n−1 being negligible. The grid of values of Q(·, n, N ) was defined separately for the five intervals with the endpoints ξnN − 5n−1/2 , ξnN − 4n−1/2 , ξnN − 3n−1/2 , ξnN − 2n−1/2 , ξnN − 6n−1 , ξnN
(n > 40),
and with the endpoints ξnN − 5n−1/2 , ξnN − 4n−1/2 , ξnN − 3n−1/2 , ξnN − 2n−1/2 , ξnN − n−1/2 , ξnN
(n ≤ 40).
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
217
For n > 40 the numbers of values calculated in the five successive intervals were 6, 10, 16, 40 and 50, and for n ≤ 40 the numbers were 6, 10, 24, 30 and 50. Within each interval the grid was equally spaced. The values of Q(·, n, N ) required for the next set of integrals were obtained by quadratic interpolation for ξnN > λ > ξnN − 5n−1/2 . For λ < ξnN − 5n−1/2 , the approximation Q(λ, n, N ) −λ(1 − a)−1 was used. The grid values which most influence the calculation of the sequence of ξnN s are those near to the corresponding value of ξnN . Consequently, more grid values were calculated in the intervals close to ξnN than in those further away, so as to reduce interpolation errors, and the grid values close to ξnN were also calculated with higher precision by reducing the maximum error allowed in the Romberg procedure. The actual numbers of grid points were determined by carrying out repeated calculations for different numbers of points. In this way the convergence of the calculated ξnN values as the number of grid points increased was directly observed. This meant that suitable numbers could be chosen, and also provided a means of estimating bounds for the error due to interpolation between grid points. Similar iterative procedures were used to determine appropriate maximum errors for numerical integration and root-finding, and corresponding bounds for the errors in calculating ξnN . The loss of accuracy from these two causes turns out to be negligible compared with that caused by the use of a discrete grid of points. To obtain index values we need values of ξn rather than the lower bound for ξn given by ξnN . It is reasonable to conjecture that (ξn − ξnN )/ξn = O(a N−n ),
(8.7)
since this represents geometric discounting of the effect of the finite horizon using the discount factor a. Repeated calculations with different values of N show that (8.7) is approximately correct, thereby providing a means of bounding the error caused by estimating ξn by ξnN . This source of error can be made negligible by starting the calculations from a suitable value of N . Reverting now to the fuller notation of Section 7.6 (apart from the subscript N referring to the fact that we are dealing with a normal process), it follows from Theorem 7.28 and Theorem 7.29 (ii) that n(1 − a)1/2 ν(0, n, 1, a) < n(1 − a)u(n(1 − a)) ≤ 2−1/2 ,
(8.8)
that the difference of the first two of these quantities is small when a is close to 1, and that the difference of the second two is small when n(1 − a) is large. Because of the inequalities (8.8), and since, by Theorem 7.13, ν(x, ¯ n, , a) = x¯ + σ ν(0, n, 1, a), the index ν(x, ¯ n, σ, a) is shown in Tables 8.1 and 8.2 in the form of values of ν(1 − a)1/2 ν(0, n, 1, a). This normalized index shows the convergence to 2−1/2 (= 0.70711) indicated by the inequalities (8.8) and subsequent comments as a → 1 and n(1 − a) → ∞.
218
MULTI-ARMED BANDIT ALLOCATION INDICES
There is also clear evidence of convergence as n → ∞ and a remains fixed. The rate of convergence appears to be O(n−1 ). The values shown in Table 8.1 for n = ∞ were obtained by further calculations for much larger values of n than those appearing elsewhere in the table. The left-hand inequality in (8.8) means that a lower bound to T u(T ) may be read off from Table 8.1. This lower bound is shown for a selection of values of T in Table 8.2. The function u is of some interest in its own right since the indices for any Brownian reward process may be obtained from it by scaling, as shown in Section 7.6. Values of the lower bound for small T were examined for evidence of the asymptotic behaviour given in Theorem 7.29 (iii). No very striking agreement was detected, presumably because no calculations were available for small enough value of T , nor perhaps for a sufficiently near 1.
8.3
Normal reward processes (mean and variance both unknown)
The unknown parameters are the mean μ and standard deviation σ ; π0 ∝ σ −1 (σ > 0). 1 f (x|μ, σ ) = (2πσ 2 )−1/2 exp − 2 (x − μ)2 , 2σ hence μ and σ are joint location and scale parameters.
n 1 f (xi |μ, σ ) = (2πσ 2 )−n/2 exp − 2 ¯ 2 + n(x¯ − μ)2 , (xi − x) 2σ i=1
¯ 2 and X¯ are jointly sufficient statistics by hence S 2 = (n − 1)−1 (Xi − X) Theorem 7.1. The generalized Bayes’ theorem gives πn (μ, σ |x, ¯ s) ∝ π0 (μ, σ )
n
f (xi |μ, σ )
i=1
∝σ
−n−1
∞ ∞
¯ s, n) = f (x|πn ) = f (x|x, 0
∞ ∞
∝ 0
−∞
σ
1 2 2 exp − 2 (n − 1)s + n(x¯ − μ) . 2σ
−n−2
−∞
f (x|μ, σ )πn (μ, σ |x, ¯ s, n)dμdσ
1 n (x − x) ¯ 2 exp − 2 (n − 1)s 2 + 2σ n+1 nx¯ + x 2 + (n + 1) μ − dμdσ n+1
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
219
1 n ∝ σ −n−1 exp − 2 (n − 1)s 2 + (x − x) ¯ 2 dσ 2σ n+1 0 −∞ −n/2 n (x − x) ¯ 2 ∝ 1+ . · n + 1 (n − 1)s 2
∞ ∞
¯ −1 , it follows that t has Student’s Writing t = [n/(n + 1)]1/2 (x − x)s t-distribution with n − 1 degrees of freedom, that is, it has the density ( n2 )
t2 g(t, n − 1) = n−1 1 + n−1 [(n − 1)π]1/2 2 ∞ r(x, ¯ s, n) = xf (x|x, ¯ s, n) = x, ¯
− n2 .
−∞
and the functional equation (7.10) becomes
λ R(λ, x, ¯ s, n) = max , 1−a ∞ nx¯ + x , s(s, x, ¯ x, n + 1), n + 1 R λ, x¯ + a n+1 −∞ f (x|x, ¯ s, n)dx ,
(8.9)
where s(s, x, ¯ x, n + 1) = [n−1 (n − 1)s 2 + (n + 1)−1 (x − x) ¯ 2 ]1/2 . The two expressions inside the curly brackets are equal when λ = ν(x, ¯ s, n). The conditions of Theorem 7.15 hold, so ν(x, ¯ s, n) = x¯ + sν(0, 1, n), and it is sufficient to calculate ν(x, ¯ s, n) for x¯ = 0 and s = 1. From (8.9), and since R(bλ + c, bx¯ + c, b, s, n) = c(1 − a)−1 + bR(λ, x, ¯ s, n) (n > 1), we have
∞ λ R(λ, 0, 1, n) = max s(1, 0, x, n + 1) ,a 1−a −∞ λ − x(n + 1)−1 ×R , 0, 1, n + 1 f (x|0, 1, n)dx . s(1, 0, x, n + 1)
220
MULTI-ARMED BANDIT ALLOCATION INDICES
Writing Q(λ, n) = R(λ, 0, 1, n) − λ(1 − a)−1 , this becomes, for n > 1,
∞ Q(λ, n) = max 0, −λ + a n−1/2 (n − 1 + t 2 )1/2 ×Q
−∞
n
λ − (n + 1)−1/2 t , n + 1 g(t, n − 1)dt . (n − 1 + t 2 )1/2
1/2
(8.10)
The second expression inside the curly brackets in equation (8.10) will be denoted by Q∗ (λ, n). Writing ν(0, 1, n) = ζn , we have Q∗ (ζn , n) = 0. As for the normal reward process with known variance, calculations were based on the analogue of (8.10) when a switch from S to is allowed only for n ≤ N . Extending the notation in the obvious way, we have
∞ Q(λ, n − 1, N ) = max 0, −λ + a (n − 1)−1/2 (n − 2 + t 2 )1/2
−∞
(n − 1) λ − n−1/2 t ×Q , n, N g(t, n − 2)dt (n − 2 + t 2 )1/2 1/2
= max[0, Q∗ (λ, n − 1, N )] (2 < n < N + 1). (8.11) (8.12) Q(λ, n, N ) = max 0, −λ (1 − a)−1 (n > N − 1) , Q∗ (ζn−1,N , n − 1, N ) = 0 (2 < n < N + 1), ζnN = 0 (n ≥ N ). Also ζnN ζn as N → ∞. Equations (8.11) and (8.12) are the counterparts of equations (8.6) and (8.5) for the normal reward process with known variance. In equation (8.6) the integration over x is reduced to the range n(λ − ξnN ) < x, using the fact that Q(y, n, N ) = 0 for y ≥ ξnN . The integral in equation (8.11) cannot be reduced in the same simple way, as the function [(n − 1)1/2 λ − n−1/2 t] (n − 2 + t 2 )−1/2 is not monotone in t for all values of n and λ. There are in fact three possible cases, depending on whether the equation in t, (n − 1)1/2 λ − n−1/2 t = ζnN , 1/2 n − 2 + t2
(8.13)
has 0, 1 or 2 real roots. These lead to integrals with 2, 1 or 0 infinite endpoints, respectively. The calculations of ζN,N , ζN−1,N , ζN−2,N , . . . , ζ2,N was carried out in sequential fashion along the lines of the calculation for ζN,N , ζN−1,N , ζN−2,N , . . . , ζ1,N for the known variance case. The numerical integration for values of Q∗ (λ, n − 1, N ) proceeded by first determining the roots of equation (8.13), and hence the set A of values of t for which the integrand is nonzero, and then applying Romberg’s method to evaluate separately the contributions to the total integral which arise from the intersections of the set A with the five intervals with the
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
221
endpoints ±6(1 + 200n−3 ), ±4(1 + 60n−3 ), ±2(1 + 20n−3 ). The grid of values of Q(·, n, N ) was defined separately for the five intervals with the endpoints ζnN − 5n−1/2 (1 + 100n−3 ), ζnN − 4n−1/2 (1 + 60n−3 ), ζnN − 3n−1/2 (1 + 40n−3 ), ζnN − 2n−1/2 (1 + 20n−3 ), ζnN ζnN − 6n−1 (1 + 6n−3 ), for n > 40, and with endpoints defined in the same way for 21 ≤ n ≤ 40 except that ζnN − n−1/2 (1 + 6n−3 ) replaces ζnN − 6n−1 (1 + 6n−3 ). For 6 ≤ n ≤ 20 two more intervals were added, and a further two intervals for 3 ≤ n ≤ 5. The wider intervals and the additional intervals, compared with the known variance case, were to allow for the greater dispersion of the t-distribution, compared with the standard normal distribution, for small numbers of degrees of freedom. For a given value of the discount factor a, the difference in the definitions of the quantities ξn (= ν(0, n, 1)), defined in Section 8.2, and ζn (= ν(0, 1, n)), defined in this section, is that in the first case the variance of the normal reward process is known to be 1, and in the second case the sample variance is 1 and the prior distribution for the true variance expresses ignorance. In both cases the mean value of the first n rewards, and hence the expected value of subsequent rewards, is 0, so that ξn and ζn represent the extent to which the expected reward rate up to a stopping time may be increased by making the stopping time depend on the reward sequence. The scope for such an increase itself depends on the degree of uncertainty about the mean of the underlying normal distribution. This is greater if both the mean and the variance are unknown than if the variance is known, since the t-distribution, which expresses the uncertainty when the variance is unknown, has fatter tails than the normal distribution, which does so when the variance is known. Since the t-distribution also converges to a N (0, 1) distribution as the number of degrees of freedom tends to infinity, it is not surprising that the calculations show that ζn /ξn > 1 and 1 as n → ∞. For this reason the calculated values of ξn are given in Table 8.3 in terms of the ratio ζn /ξn .
8.4
Bernoulli reward processes
A Bernoulli reward process is an arm of the classical multi-armed bandit problem, given as Problem 5 in Chapter 1. This problem was first considered by Bellman (1956). Xi takes values 0 or 1 with probabilities (1 − θ ) and θ . The parameter θ has an improper prior density π0 ∝ θ −1 (1 − θ )−1 . After n observations, α of them ls, and β = n − α of them 0s, the posterior density πn ∝ θ α−1 (1 − θ )β−1 , which is a beta density with parameters α and β, mean αn−1 and variance αβn−2 (n + 1)−1 . In the notation of Section 7.7, θ1 = θ , θ2 = θ (1 − θ ) and x¯n = αn−1 . If we now put sn2 = αβn−2 , which means that a fixed-s sequence is a fixed-x¯ sequence, the conditions of Theorem 7.32 may be checked. Condition (i) follows from Chebyshev’s theorem (7.31). Note that here a fixed-s sequence defined for integer
222
MULTI-ARMED BANDIT ALLOCATION INDICES
values of n involves non-integer values of α. Since the corresponding sequence of beta distributions n is well defined, this causes no difficulty. The distribution function for a standardized beta variate tends to as the parameters tend to infinity in a fixed ratio (Cram´er, 1946). It follows that θ1 − αn−1 1/2 n (8.14) 1/2 < ν → (ν) αβn−2 (n + 1)−1 n as n → ∞ on a fixed-s sequence (v ∈ R). This, together with condition (i), implies condition (ii). Condition (iii) follows, with a little care, by applying the Lindberg–Levy central limit theorem (7.30) to FU |θ as m → ∞ with θ = αn−1 , plus the fact that n (|θ1 − αn−1 | > ) → 0 as n → ∞ if αn−1 is fixed, as it is on a fixed-s sequence. Thus the limit (7.51) holds, and takes the form [ν(α, β) − αn−1 ]n2 α −1/2 β −1/2 (1 − a)1/2 → 2−1/2
(8.15)
as n(1 − a) → ∞ on a fixed-αn−1 sequence and α → 1. Calculations for this case were carried out by the direct method described in Section 7.4. The limit (8.15) indicates that for large values of n(1 − a) and fixed a close to 1, and for values of α and β such that αn−1 is constant, ν(α, β) − αn−1 ∝ n. In fact it turns out that for any a and any constant value of αn−1 , ν(α, β) is well approximated for large values of n by an equation of the form [ν(α, β) − αn−1 ]−1 = nα −1/2 β −1/2 (A + Bn + Cn−1 )(1 − a)1/2 ,
(8.16)
where A, B and C depend on a and on αn−1 . In this case the calculation of index values involves no approximation of continuous functions by grids of values, or numerical integration, and the only important error arises from the finite value N of n from which the iterations start, and the consequent restriction on the set of allowed stopping times. As for the other calculations of index values for sampling processes, the importance of this source of error was estimated by comparing the index values calculated using different values of N . Such errors were found to be negligible in the first four significant figures of index values for a = 0.99, N = 300, n ≤ 150,
(8.17)
a ≤ 0.95, N = 300, n ≤ 200.
(8.18)
and for
Values of ν(α, n − α) for n(∈ Z) ≤ 300 were calculated starting from N = 300 for a = 0.5, 0.6, 0.7, 0.8, 0.9, 0.95 and 0.99. For each value of a, and for values of λ = αn−1 ranging between 0.025 and 0.975, a weighted least squares fit of the form A + Bn + Cn−1 to the function y(n) = [ν(nλ, n − nλ) − λ]−1 [λ(1 − λ)(1 − a)]−1/2
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
223
was carried out for each λ at values of n such that nλ is an integer, using the weight function [y(n)]−2 . The ranges of n-values used were chosen with a lower endpoint high enough for a good fit to be achieved, and with an upper endpoint low enough to satisfy (or almost satisfy) the appropriate constraint (8.17) or (8.18). The fitted values of A, B and C are shown in Tables 8.9, 8.10 and 8.11, and the corresponding ranges of n-values in Table 8.12. A MATLAB program for calculating the index values ν(α, β, a) is given at the end of this section. It may be run on a standard laptop and takes about 30 minutes. It uses the recurrence (1.6) and calculates values to four-figure accuracy at positive integer values of α and β with α + β ≤ 40, and for 0 < a ≤ 0.99. This has been checked by comparing the results with those calculated by the direct method. The results for a = 0.5, 0.7, 0.9 and 0.99 are given as examples in Tables 8.5, 8.6, 8.7 and 8.8, respectively. For α + β > 40 an alternative approximation to ν(α, β, a) may be obtained by assuming ν(nλ, n − nλ, a) − λ to be a linear function of [λ(1 − λ)]1/2 and interpolating (or if necessary extrapolating) the set of approximations to ν(nλ, n − nλ, a) given by the least squares fits for the required value of n and for the 13 fitted values of λ. For integer values of α and β and for values of n (= α + β), subject to the bounds (8.17) or (8.18), the index values obtained by this approximation procedure have been compared with the directly calculated values. For the most part the discrepancy is at most one in the fourth significant figure, the main exceptions being for values of αn−1 outside the range (0.025, 0.975). Because of the relationship between the indices for the Bernoulli and normal reward processes which Theorem 7.32 establishes for large n, and of the ndependence of the normal reward process index, described in Section 8.2, and since for the Bernoulli process ν(α, β) − αn−1 is small for large n, it is fairly certain that the four-figure accuracy noted in Table 8.4 when n is large and 0.025 ≤ αn−1 ≤ 0.975 holds for n > 200. Clearly the main conclusions set out in Table 8.4 for large n must also hold at non-integer values of α and β. What happens for large n and |αn−1 − 0.5| > 0.475 is much less clear, particularly when min(α, β) < 1. Calculations of index values for the Bernoulli reward process have also been reported (briefly) by Gittins and Jones (1979), Robinson (1982) and Katehakis and Derman (1986).
MATLAB program to calculate ν(α, β, a) %%%%%%%%%%%%%%%%%% %%% Initialize %%% %%%%%%%%%%%%%%%%%% % Set parameters for search routine beta = .95;
% discount rate (= a in text)
N = 750;
% starting value for a + b (alpha + beta in text)
step = 0.0001;
% step-size
224
MULTI-ARMED BANDIT ALLOCATION INDICES
% Initialize matrices G = zeros(N-1,N-1);
% array with intermediate values
Gittins = zeros(N-1,N-1);
% array with final Gittins indices
% Initialize endpoints (starting points for backward induction) for a = 1:N-1 G(a,N-a) = a/N;
% initialize endpoints at E[beta(a,b)]
end %%%%%%%%%%%%%%%%% %%% Main loop %%% %%%%%%%%%%%%%%%%% for p = step/2:step:1 % Continuation value for ‘safe’ arm safe = p/(1-beta); for nn = N-1:-1:2 for a = 1:nn-1 % Continuation value for ‘risky’ arm risky = a/nn * (1 + beta * G(a+1,nn-a)) + (nn-a)/nn * (beta * G(a,nn-a+1)); % ‘safe’ increases faster than ‘risky’ in p, so % the first time safe
risky, we get the index value
if (Gittins(a,nn-a) == 0) && (safe
risky)
Gittins(a,nn-a) = p-step/2; end % update recursion G(a,nn-a) = max(safe,risky); end end end %%%%%%%%%%%%%%%%%%%%%% %%% Output results %%% %%%%%%%%%%%%%%%%%%%%%% % Set output ranges a_out = 1:40; b_out = 1:40; Results = double([ 0 a_out ; b_out’ Gittins(a_out,b_out)’ ])
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
8.5
225
Exponential reward processes
There is one unknown parameter θ . f (x|θ ) = θ e−θx , so θ −1 is a scale parameter, call it σ . Since π0 (θ ) ∝ θ −1 , the prior density for the scale parameter σ is proportional to σ −1 . n
where =
f (xi |θ ) = θ n e−θ ,
i=1
n
i=1 xi ,
and is a sufficient statistic by Theorem 7.1.
πn (θ | ) ∝ π0 (θ )
n
f (xi |θ ) ∝ n θ n−1 e−θ / (n).
i=1
Thus πn (θ |) is a gamma density with the parameters n and , and conjugate prior densities for θ are of this form. ∞ f (x|θ ) πn (θ |) dθ f (x|πn ) = f (x|, n) = 0
∞
n n θ n exp [−θ ( + x)] dx = . (n) 0 ( + x)n+1 ∞ . xf (x|, n) dx = r (, n) = n−1 0 =
n
The functional equation (7.10) becomes λ , R (λ, , n) = max 1−a n−1 ∞ n −n−1 R (λ, + x, n + 1) ( + x) dx , + an
(8.19)
0
and ν(, n) is the value of λ for which the two expressions inside the square brackets are equal. The conditions of Theorem 7.11 hold, so ν(b, n) = bν(, n) and it is sufficient to calculate the sequence of values of for which ν(, n) = 1 (n = 1, 2, . . . ). These will be written as n . Defining S(, n) = R(1, , n), and S ∗ (, n) as the optimal payoff from bandit processes S and starting with S in the state (, n) and subject to the constraint that at time zero S must be selected, equation (8.19) reduces to S(, n) = max[(1 − a)−1 , S ∗ (, n)],
(8.20)
226
MULTI-ARMED BANDIT ALLOCATION INDICES
where
S ∗ (, n) = (n − 1)−1 + an n
∞
S (z, n + 1) z−n−1 dz,
(8.21)
and S ∗ (n , n) = (1 − a)−1 .
(8.22)
S ∗ (z, n + 1) is strictly increasing in z, thus S(z, n + 1) = (1 − a)−1 for z ≤ n + 1, and for < n+1 the integral from to n+1 in the expression for S ∗ (, n) may be evaluated explicitly, giving n a ∗ −1 S (, n) = (n − 1) + 1− 1−a n+1 ∞ S(z, n + 1)z−n−1 dz. (8.23) + an n n+1
As usual, the calculation of n was based on the analogues of equations (8.20) to (8.23) under the constraint that a switch from the reward process S to the standard bandit process is not allowed for n > N . These lead to an approximation nN which decreases as N increases and tends to n as N tends to infinity. The equations in question, all for n > 2, are S(, n − 1, N ) = max[(1 − a)−1 , S ∗ (, n − 1, N )], S ∗ (, n − 1, N ) = (n − 2)−1 + a(n − 1) n−1
∞
S (z, n, N ) z−n dz
(8.24)
(n ≤ N ) , (8.25)
S ∗ (n−1,N , n − 1, N ) = (1 − a)−1 , n−1 a ∗ −1 1− S (, n − 1, N ) = (n − 2) + 1−a nN ∞ + a(n − 1) n−1 S(z, n, N )z−n dz
(8.26)
nN
( < nN , n ≤ N ), ∗
−1
−1
S (, n − 1, N ) = (n − 2) (1 − a)
(8.27) (n > N ).
(8.28)
From (8.26) and (8.28) it follows that NN = N − 1. The values of N−1,N , N−2,N , . . . , 2,N were calculated in turn using equations (8.24) to (8.27). Since (n − 1)−1 is the expected reward from the next observation from S when it is in the state (, n), and since for a given value of n−1 the uncertainty associated with future rewards decreases as n increases, it follows that nN (n − 1)−1 increases with n for n ≤ N . This is because the contribution to
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
227
the index value arising from the possibility of learning the value of the parameter θ more precisely is less when there is not much more learning to be done, which is to say when n is large. It follows that nN increases with n, and hence that equation (8.27) is valid for values of in the neighbourhood of n−1,N . A further consequence of the fact that nN increases with n is that if S is in a state (, n) for which ≥ NN = N − 1, then ≥ Nm (n ≤ m ≤ N ), and hence the optimal policy under the constraint of no switch from S to after n reaches the value N is one which selects S at all times. This policy yields the expected payoff (n − 1)−1 (1 − a)−1 , so that S(, n, N ) = (n − 1)−1 (1 − a)−1
( ≥ NN ).
(8.29)
∗ Each successive value of nN was found by solving the equation SnN −1 ∗ = (1 − a) for . Equation (8.27) was used to calculate S (, n, N ), so that once the integral has been evaluated we are left with an nth order polynomial to solve for , for which the Newton–Raphson method was used. Equations (8.25) and (8.27) were then used to calculate a grid of values of S(, n, N ) = S ∗ (, n, N ) for > nN , to use in evaluating the integrals required in the next stage of the calculation. From (8.24) and (8.28) it follows that
S(, N, N ) = max[1, (N − 1)−1 ](1 − a)−1 ,
(8.30)
so that S(·, N, N ) has a discontinuous first derivative at the point = N − 1 = NN . Now substituting (8.30) in the right-hand side of (8.27) for n = N it follows that S ∗ (·, N − 1, N ) has a discontinuous second derivative at = NN , and from (8.24) that S(·, N − 1, N ) has a discontinuous second derivative at = NN and a discontinuous first derivative at = N−1,N . Equation (8.27) now shows that these discontinuities are inherited by S ∗ (·, N − 2, N ), though now transformed by the integral into discontinuities of the third and second derivatives respectively. Equation (8.24) transmits the properties of S ∗ (·, N − 2, N ) to S(·, N − 2, N ) for > N−2,N and adds a discontinuity of the first derivative at = N−2,N . This process clearly may be formalized as an inductive argument showing that S(·, n, N ) has a discontinuous rth order derivative at n+r−1,N (r ≤ N − n + 1). These discontinuities in the derivatives of S(, n, N ) mean that errors would arise from interpolating values of this function between a set of points which straddles discontinuity points. In order to minimize this source of error, the grid of calculated values S(, n, N ) was chosen so as to include considerable numbers of discontinuity points. The necessary integrals of the form S(z, n, N )z−n dz were evaluated numerically using Romberg’s procedure separately over several different intervals and then summing. Those intervals close to the value z = nN were chosen so as not to straddle discontinuity points. Quadratic interpolation, requiring three grid points for each interpolated value, was carried out to give the required values of S(z, n, N ) between grid points, and wherever possible the three grid points were chosen so as to avoid straddling discontinuity points. The
228
MULTI-ARMED BANDIT ALLOCATION INDICES
grid of values of S(, n, N ) was calculated for the following values of , unless ≥ NN , in which case S(, n, N ) is given by (8.29). For brevity the second suffix N has been omitted throughout. The integers in the specification indicate a corresponding number of equally spaced grid points between the two points separated by each integer; thus there were seven equally spaced grid points between n and n+1 for all n, for example. The semicolons separate grid points at which S(, n, N ) was calculated more (to the left) or less (to the right) accurately; the reason for this is that values of S(, n, N ) for close to nN have more influence on the calculation of mN for m < n, and it is therefore more important to calculate them accurately. n , 7, n+1 , 7, n+2 , 7, n+3 , 3, n+4 , 3, n+5 , 1, n+6 , 1, n+7 , 1, n+8
(all n < N ).
n+8 , n+9 , . . . , n+16 ; 7, n + 20 (2 ≤ n ≤ 6). n+8 , n+9 , . . . , n+12 ; 7, n + 20 (7 ≤ n ≤ 10). n+8 ; 7, n + 4n1/2 , 3, n + 6n1/2
(11 ≤ n ≤ 16).
n+8 , 3, n + 2n1/2 ; 7, n + 4n1/2 , 3, n + 6n1/2
(17 ≤ n ≤ 36).
n+8 , 7, n + 2n1/2 ; 7, n + 4n1/2 , 3, n + 6n1/2
(n > 36).
For values of greater than the largest grid point for a given n, S(, n, N ) was approximated as (n − 1)−1 (1 − a)−1 , corresponding to the assumption that the standard bandit process is never selected after the reward process S reaches the state (, n).This is a good approximation for large , and means that the contribution to S(z, n, N )z−n dz from this part of the range may be evaluated explicitly. In the notation of Section 7.7, θ1 = θ −1 and θ2 = θ −2 . Thus the (improper) prior density, π0 (θ ) = a constant, leads after n observations totalling to a posterior gamma density with parameters n + 1 and , with respect to which θ1 has mean x¯n (= /n) and variance (n − 1)−1 x¯n2 . With this modified prior and sn = x¯n , the conditions of Theorem 7.32 may be checked. Condition (i) follows from Chebyshev’s theorem, 7.31. Condition (ii) follows from the Lindberg–Levy central limit theorem, 7.30, when we note that V is asymptotically a standardized gamma variate whose distribution function may be expressed as the convolution of n + 1 identical exponential distribution functions. The random variables U and θ are independent with respect to Pn , and condition (iii) is thus an immediate consequence of the same theorem. Thus Theorem 7.32 holds, as therefore does the limit (7.52). The posterior distribution for θ starting from the prior density π0 (θ ) ∝ θ −1 , and after n observations totalling , is the same as with π0 (θ ) = a constant after n − 1 observations totalling . Thus, with π0 (θ ) ∝ θ −1 , the limit (7.52) holds if
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
229
we replace x¯n by /(n − 1) and n by n − 1. It then takes the form ν(, n) − /(n − 1) (n − 1)(1 − a)1/2 → 2−1/2 /(n − 1)
(8.31)
as n(1 − a) → ∞ on a fixed-(/(n − 1)) sequence and a → 1. Since ν(b, n) = bν(, n), and by definition ν(n , n) = 1, it follows from (8.31) that n / (n − 1) → 1 as n(1 − a) → ∞ and a → 1, and hence that (n − 1 − n )(1 − a)1/2 → 2−1/2
(8.32)
as n(1 − a) → ∞ and a → 1. Because of the limit (8.32), values of ν(, n) have been tabulated in the form (n − 1 − n )(1 − a)1/2 ; see Table 8.13. The tabulated values are accurate to the five significant figures shown. As in the normal case, the accuracy was checked by repeated calculations with different numbers of grid points, different error bounds for numerical integration and root-finding, and different starting points for dynamic programming iteration. A more detailed account of the calculations is given by Amaral (1985). From these results it seems very likely that (n − 1 − n )(1 − a)1/2 tends to a limit as n tends to infinity for every fixed a. This is the analogue of the behaviour of n(1 − a)1/2 ν(0, n, 1, a) for the normal reward process with known variance. The fact that the limits in the two cases appear to differ more widely for values of a which are not close to 1 reflects the failure of the asymptotic normality argument associated with the limit (7.51) when a is not near 1.
8.6
Exponential target process
Suppose without loss of generality that the target T = 1, as this is just a matter of appropriately scaling the unit of measurement (see Theorem 7.17). The distributional assumptions are the same as for the exponential reward process, and the functional equation (7.11) becomes M(φ, , n, 1) −1 n = min φ , 1 + n
1
−n−1
M (φ, + x, n + 1, 1) ( + x)
dx . (8.33)
0
Defining N (, n) = M(φ, , n, 1), and N ∗ (, n) as the minimum expected number of samples from S and the standard target process before reaching the target 1, starting with S in the state (, n) and subject to the constraint that S must be selected at time zero, equation (8.33) reduces to N (, n) = min[φ −1 , N ∗ (, n)],
(8.34)
230
MULTI-ARMED BANDIT ALLOCATION INDICES
where
∗
N (, n) = 1 + n
+1
n
N (z, n + 1)z−n−1 dz.
(8.35)
If (φ, n) is the solution of the equation in , N ∗ (, n) = φ −1 , we have
(8.36)
ν((φ, n)n−1 , n) = φ.
For < (φ, n + 1) < + 1, part of the integral in (8.35) may be evaluated explicitly, giving
n ∗ −1 N (, n) = 1 − φ −1 (φ, n + 1) +1 n N (z, n + 1)z−n−1 dz. (8.37) + n (φ,n+1)
For + 1 < (φ, n + 1), we have N ∗ (, n) = 1 − φ −1
+1
n
−1 .
(8.38)
Values of (φ, n) for a given φ may be determined by a backwards induction calculation using equations (8.34), (8.37) and (8.38). Repeated calculations of this type for different values of φ establish a set of curves in (, n)-space on which the index values are known. Index values at other points may then be obtained by interpolation. This procedure will now be described rather more fully. Note that the dependence of N (, n) and N ∗ (, n) on φ has been suppressed from the notation. The calculation of the quantities (φ, n) was based on the analogues of equations (8.24) to (8.28) under the constraint that a switch from the target process S to the standard bandit process is not allowed for n > N . These lead to approximations (φ, n, N ) which decrease as N increases and tend to (φ, n) as N tends to infinity. The equations in question, all for n > 1, are N (, n − 1, N ) = min[φ −1 , N ∗ (, n − 1, N )], +1 ∗ n−1 N (z, n, N )z−n dz N (, n − 1, N ) = 1 + (n − 1)
(8.39) (n ≤ N ),
(8.40) N ∗ [(φ, n − 1, N ), n − 1, N ] = φ −1 ,
(8.41)
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
∗
N (, n − 1, N ) = 1 − φ
−1
(φ, n, N ) +1
n−1
+ (n − 1) n−1
231
−1 N (z, n, N )z−n dz
(φ,n,N)
( < (φ, n, N ) < + 1, n ≤ N ), n−1 ∗ −1 −1 N (, n − 1, N ) = 1 − φ +1
(8.42)
( + 1 < (φ, n, N ), n ≤ N ), n−1 (n > N ). N ∗ (, n − 1, N ) = −1
(8.43) (8.44)
Equation (8.44) is a consequence of the fact that, for n > N , N ∗ (, n − 1, N ) is the expected number of samples from S required to reach the target starting from the state (, n − 1), and for a given θ the expected number required is eθ . The right-hand side of (8.44) is the expectation of eθ with respect to the gamma distribution with the parameters and n − 1, which is the current posterior distribution for θ in the given state. When S is in the state (, n) the probability that the next value sampled from S exceeds the target 1, or current probability of success (CPS), is
∞ 1
∞
f (x|, n)dx = 1
n n dx = ( + x)n+1
+1
n .
The index ν(, n) of S in the state (, n) exceeds the CPS by an amount corresponding to uncertainty about the parameter θ , and the consequent possibility of estimating its value more accurately after further sampling from S. Since this uncertainty decreases as n increases, the value of the CPS which leads to a given index value φ increases with n, and tends to φ as n tends to infinity. For a given x¯ it is easy to show that CPS is a decreasing function of n. Since CPS is an increasing function of x¯ it follows that the value of x¯ which leads to the index value φ increases with n, which is to say that (φ, n)n−1 increases with n. A simple extension of this informal argument shows that (φ, n, N )n−1 increases with n for all φ and N , and therefore (φ, n, N ) certainly increases with n. For the next few paragraphs, φ and N are fixed, and (φ, n, N ) is abbreviated as n . From (8.41) and (8.44) it follows that N = (1 − φ 1/N )−1 .
232
MULTI-ARMED BANDIT ALLOCATION INDICES
N−1 , N−2 , . . . , 1 were calculated in turn using equations (8.39) to (8.43). Each successive value of n was obtained by solving for the equation N ∗ (, n, N ) = φ −1 using the bisection method. Equations (8.40), (8.42) and (8.43) were then used to calculate a grid of values of N (, n, N ) = N ∗ (, n, N ) for > n to use in evaluating the integrals required in the next stage of the calculation. The function S(·, n, N ) which occurs in the calculation of index values for the exponential reward process has a discontinuous rth order derivative at n+r−1,N (r ≤ N − n + 1) The argument used for that case shows, when applied to the exponential target process, that N (·, n, N ) has a discontinuous rth derivative at n+r−1 (r ≤ N − n + 1). These discontinuities all originate as discontinuities of the first derivative of N (·, n, N ) at n , defined by equations (8.39) and (8.41), for all the different values of n, and are then inherited as discontinuities of successively higher order derivatives via equation (8.40) as n decreases. For N (·, n, N ) there are further discontinuities of higher order derivatives generated by the discontinuity of the first-order derivative at n . These are discontinuities transmitted on at least one occasion via the upper limit of the integral in equation (8.49), rather than exclusively via the lower limit. It is not difficult to show that N (·, n, N ) has a discontinuous rth derivative at n+r−1 − k (r ≤ N − n + 1) for any positive integer k such that n+r−1 − n > k. The most important of these additional discontinuities are those for which k = 1, as these include discontinuities of lower order derivatives than those for k > 1. Accordingly, a scheme of calculation was adopted which avoids interpolating values N (·, n, N ) over ranges which include those points n+r−1 , and n+r−1 − 1 (r ≤ N − n + 1) which are close to n . Integrals of the form N (z, n, N )z−n dz were evaluated by Romberg’s procedure over sub-intervals chosen so that those close to the value z = n do not straddle discontinuity points of the types just mentioned. Quadratic interpolation to obtain values of N (z, n, N ) between grid points was as far as possible carried out using sets of three grid points which do not straddle these discontinuity points. The grid of values of N (, n, N ) was calculated for values of based on the following pattern, unless ≥ N , when N (, n, N ) = n ( − 1)−n . n , 7, n+1 , 7, n+2 , 7, n+2 , 3, n+4 , 3, n+5 , 1, n+6 , 1, n+7 , 1, n+8
( all n < N ).
n+8 , n+9 , . . . , n+16 ; 7, (n + 20)(− loge φ)−1 −1
n+8 , n+9 , . . . , n+12 ; 7, (n + 20)(− loge φ)
(2 ≤ n < 6). (7 ≤ n ≤ 10).
n+8 ; 7, (n + 4n1/2 )(− loge φ)−1 , 3, (n + 6n1/2 )(− loge φ)−1
(11 ≤ n < 16).
n+8 , 3, (n + 2n1/2 )(− loge φ)−1 ; (n + 4n1/2 )(− loge φ)−1 , 3, (n + 6n1/2 )(− loge φ)−1
(17 ≤ n ≤ 36).
In addition, those values of m − 1 (m = 1, 2, . . . ) for which, for a given n, n < m − 1 < n+8 , were taken as grid points. The integers in the above
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
233
array indicate numbers of additional equally spaced grid points between the n+r values separated in the list by the integer in question. If the two n+r values are also separated by values of m − 1, the additional points were equally spaced in, for example, the interval (n+r , m − 1), so that the number of additional points in any such sub-interval was 2k − 1 for some integer k, and so that the number of additional points separating two n+r values was always at least equal to the number shown in the array. The expected reward from the next observation when the target has not yet been reached is e−θ for a given value of θ . The expectation of e−θ with respect to the current distribution n with parameters n and x¯ (= n−1 ) is (nx) ¯ n (nx¯ −n −1 n + 1) , which tends to exp(−x¯ ) as n → ∞. The functions (ny) (ny + 1)−n and exp(−y −1 ) satisfy the conditions for G(y, n) and G(y) of Corollary 7.33, as a little manipulation shows, and conditions (i) to (iii) of Theorem 7.32 were checked in the previous section. However, the situation differs from that considered in the discussion leading to the limit (7.53), because the sequence of expected rewards (nx) ¯ n (nx¯ + 1)−n terminates as soon as the target is reached. In fact (7.53) may be applied directly to a modified target process for which rewards occur for every observation Xi which exceeds the target, rather than just for the first such observation. Writing s(t) (t = 0, 1, 2 . . . ) for the sequence of states, and pt for the probability of exceeding the target with an observation taken when the state is s(t), the index for a general target process with this modification, and with the discount factor a, may be written E(p0 + ap1 + a 2 p2 + · · · + a τ −1 pτ −1 ) . E(1 + a + a 2 + · · · + a τ −1 ) τ >0
νaM = sup
(8.45)
Now note that the index of a target process without the modification and with no discounting may be written in the same form as (8.45), but with a r replaced by q0 q1 · · · qr−1 (r = 1, 2, . . . , τ − 1), where qr = 1 − pr . This follows by calibrating the target process against the class of standard target processes for which the probability of reaching the target is known and the same for each observation. The calibration procedure is similar to that used, for example, in the discussion of Problem 4 in Chapter 1. This observation suggests a method of obtaining an approximation for ν(, n) for the exponential target process. Provided qr only varies slowly with r, which is true for large values of n, we should find that ν(, n) = νqM (S, n), where q = q0 = 1 − n ( + 1)−n = q(, n). Plugging all this into (7.53), we may conjecture that [ν(, n) − 1 + q(, n)] exp(−n −1 )[1 − q(, n)]1/2 → 2−1/2 as n(1 − q) → ∞ and q → 1.
(8.46)
234
MULTI-ARMED BANDIT ALLOCATION INDICES
Now note that it follows from Chebyshev’s theorem, 7.31, that n (|x¯ −1 − θ | > ) → 0
n→∞
as
( > 0).
Thus for large n the modified target process behaves like a standard bandit process with the parameter Pθ (Xi ≥ 1|θ = x¯ −1 ) exp(−x¯ −1 ), so that
νqM (nx, ¯ n) → exp(−x¯ −1 )
n → ∞.
as
We also have 1 − q(nx, ¯ n) = (nx) ¯ n (nx¯ + 1)−1 → exp(−x −1 ) as n → ∞. It follows that the limit (8.46) may be rewritten in the form n(, n) = −
n[ν(, n) − n ( + 1)−n ] → 2−1/2 ν(, n)1/2 loge [ν(, n)]
(8.47)
as n(1 − q) → ∞ and q → 1. The limit (8.47) remains a conjecture because we have not shown that it is legitimate to replace qr by q when q is close to 1, even for large values of n. The results of calculations of the quantities (ν, n), where ν((ν, n), n) = ν, are shown in Table 8.14. Because of the conjectured limit (8.47), values of n(, n) are also tabulated in Table 8.15. The accuracy of the calculations was checked along the lines described for other cases. The tables are consistent with the conjectured limit and, more importantly, suggest that n(, n) tends to a limit as n tends to infinity for any fixed value ν of ν(, n), and that this limit does not change rapidly with ν, thus offering a means of extrapolation and interpolation to find values of ν(, n) which have not been directly calculated. In fact, for all n ≥ 10, a function of the form −1 )
A − Bn(C+Dn+En
,
where the five constants depend on ν, turns out to approximate n(x, ¯ n) well for any fixed value ν of ν(, n). Table 8.16 shows values of these constants for ν between 0.01 and 0.3. For 10 ≤ n ≤ 850 the relative error in (, n) is at most 0.6% for the tabulated values of ν. Note that (, n) = 0 (n ∈ Z) for ν ≥ 0.4. This is a consequence of (7.8).
8.7
Bernoulli/exponential target process
As in the case of the exponential target process, we suppose the observations to be scaled so that the target T = 1. The sampled random variable X for this
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
235
process has a density with respect to the measure μ on the non-negative real line for which μ({0}) = 1, and μ(I (a, b)) = b − a (b ≥ a ≥ 0), where I (a, b) is the open interval (a, b). There are two unknown parameters p (0 < p < 1) and θ (> 0). f (0|p, θ ) = 1 − p,
f (x|p, θ ) = pθ e−θx
(x > 0),
so that the distribution of X is made up of an atom of probability 1 − p at the origin together with an exponential distribution over R+ . As mentioned in Section 7.1, the reason for considering this particular process is that a simple family of alternative Bernoulli/exponential target processes is appropriate as a model for the search for a compound which shows activity at some target level in a commercial pharmaceutical project. Different target processes correspond to different classes of compounds, any of which may include active compounds. The distribution of activity within a class of compounds typically tails off for higher levels of activity, and the activity scale may fairly easily be transformed so as to produce an approximately exponential distribution. The atom of probability at x = 0 corresponds to the fact that compounds frequently show no observable activity. It is also frequently desirable to ignore any activity below some threshold level for the purpose of estimating the parameter θ , as extrapolation errors caused by a distribution that is only approximately exponential tend to be less serious when the range over which an exponential distribution is assumed is reduced. This may be achieved by setting the origin of the activity scale at the threshold level. Fairly extensive tables of index values for the Bernoulli/exponential target process have been calculated, and their use in new-product chemical research discussed, by Jones (1975). Here we shall simply establish the form of the functional equation and exhibit the asymptotic relationships between the index for this process, and those for the exponential target process, and for a Bernoulli reward process with discounting only when a successful trial occurs. The prior density π0 (p, θ ) ∝ p −1 (1 − p)−1 θ −1 . Also N i=1
N n f (xi |p, θ ) = p (1 − p)m θ n e−θ , n
where m is the number of zero xi s, m + n = N , and = nx¯ = n and are jointly sufficient statistics and πN (p, θ |, n) ∝ π0 (p, θ )
N
xi > 0 xi .
Hence
f (xi |p, θ )
i=1
∝
(n + m) n−1 n n−1 −θ . p (1 − p)m−1 θ e (n) (m) (n)
Thus the posterior distributions for p and θ are independent and, respectively, beta with parameters m and n, and gamma with parameters n and . Conjugate
236
MULTI-ARMED BANDIT ALLOCATION INDICES
prior distributions for p and θ are therefore of this form, though we shall widen the class of distributions under consideration by allowing the second parameter n1 of the beta distribution, and the first parameter n2 of the gamma distribution, to take different values. We have ∞ 1 f (0|πN ) = f (0|m, n1 , , n2 ) = f (0|p, θ )πN (p, θ |m, n1 , , n2 )dp dθ =
∞ 1
0
0
0
(1 − p)πN (p, θ |m, n1 , , n2 )dp dθ
0
1
1
m (n1 + m) n1 −1 (1 − p)m dp = , p m + n1 0 (n1 ) (m) ∞ 1 f (x|p, θ )πN (p, θ |m, n1 , , n2 )dp dθ f (x|πN ) = f (x|m, n1 , , n2 ) = =
0
0
(n1 + m) n1 −1 p (1 − p)m dp = p (n1 ) (m) 0 n1 n2 n2 = (x > 0). m + n1 ( + x)n2 +1
∞
θ e−θx
0
n2 n2 −1 −θ e dθ θ (n2 )
The functional equation (7.11) becomes M(φ, m, n1 , , n2 , 1) m = min φ −1 , 1 + M(φ, m + 1, n1 , , n2 , 1) m + n1 1 n1 n2 n2 + M (φ, m, n1 + 1, + x, n2 + 1, 1) dx . (8.48) m + n1 0 ( + x)n2 +1 When the two expressions inside the square brackets in (8.48) are equal, ν(m, n1 , , n2 ) = φ. Equation (8.48) may be solved by backwards induction along the lines followed for the exponential target process (Section 8.6). The starting point now is a grid of values of M(φ, m, n1 , , n2 , 1) for a given φ, for m and n1 such that m + n1 = N , and for n2 = n1 + c, where c is fixed. Successive grids for m + n1 = N − 1, N − 2 etc. may then be calculated by substituting in the righthand side of (8.48). This procedure is described in more detail by Jones. The complexity of the calculations is now of order N 2 for a single value of φ, rather than of order N as for the exponential target process, and an approximation to ν(m, n1 , , n2 ) based on the index ν(, n2 ) for the exponential process is therefore of interest. Proposition 8.1 ν(m, n1 , , n2 ) >
n1 − 1 ν(, n2 ). m + n1 − 1
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
237
Proof. The sequence X1 , X2 , . . . of values sampled from an exponential process which starts from the state (, n2 ) has the same distribution as the sub-sequence of positive Xi s sampled from a Bernoulli/exponential process which starts from the state (m, n1 , , n2 )–the sub-sequence, that is to say, obtained by striking out any zero-valued Xi s. If τ is a stopping time for the exponential process, let σ (τ ) denote the stopping time for the Bernoulli/exponential process defined in terms of the sub-sequence of positive corresponding Xi s. We have Rσ (τ ) (m, n1 , , n2 ) = Rτ (, n2 ), since zero-valued Xi s produce zero-valued rewards, and Wσ (τ ) (m, n1 , , n2 ) = E(T |m, n1 , , n2 )Wτ (, n2 ), where T is the interval between successive positive Xi s for the Bernoulli/exponential process. E(T |m, n1 , , n2 ) = E[E(T |p)|m, n1 , , n2 ] 1 m + n1 − 1 (m + n1 ) n1 −1 p −1 (1 − p)m−1 dp = = p . (m) (n ) n1 − 1 1 0 Thus Rσ (τ ) (m, n1 , , n2 ) Rτ (, n2 ) n1 − 1 = sup W (m, n , , n ) m + n − 1 W σ (τ ) 1 2 1 τ (, n2 ) τ >0 τ >0
ν(m, n1 , , n2 )> sup =
n1 − 1 ν(, n2 ), m + n1 − 1
(8.49)
since the stopping times of the form σ (τ ) are a subset of the stopping times for the Bernoulli/exponential process. In fact they are a subset which take no account of changes in the distribution of p as more information becomes available. Fairly obviously a higher rate of return may be achieved by using such information, hence the strict inequality in (8.49). The lower bound to ν(m, n1 , , n2 ) given by Proposition 8.1 gives a good approximation when the ignored information mentioned in the proof is relatively unimportant. This is so when p is already known fairly accurately, which is to say when m and n1 are both large. When n2 is large, a different approximation comes into play, based on the fact that a large n2 means that not much remains to be learnt about θ , which means that P (X > 1|X > 0, m, n1 , , n2 ) does not change much as n2 increases. In fact we have n2 P (X > 1|X > 0, m, n1 , , n2 ) = exp(n2 −1 ) e−θ +1 for large values of n2 . For a given initial state π(0) defined by (m, n1 , , n2 ), let π(t) denote the state at process time t, τ = min{t : π(t)∈0 } for some subset 0 of the state-space
238
MULTI-ARMED BANDIT ALLOCATION INDICES
which does not include the completion state C, and T = min{t : π(t) = C}. For t < min(τ, T ) let p(π(t)) = P X > 1|π(t) , q(π(t)) = P π(t + 1) ∈ 0 |π(t) , r(π(t)) = 1 − p(π(t)) − q(π(t)). As mentioned in Section 7.2, in determining the index value, attention may be restricted to stopping times of the form τA = min(τ, T ) for some 0 . Thus ∞ i−1 E p(π(0)) + p(π(i)) r(π(j )) i=1
ν(m, n1 , , n2 ) = sup
j =0
∞ i−1 E 1+ r(π(j ))
τ >0
.
(8.50)
i=1 j =0
If n max(m, n1 ), a good approximation to ν(m, n1 , , n2 ) may be obtained by restricting 0 to be of the form {(m , n 1 , , n 2 ) : (m , n 1 ) ∈ ∗0 ⊂ R+ × R+ }. This restriction amounts to ignoring any further information about θ and making τ depend only on the information about p obtained from the sampling process; this is reasonable if θ is known much more accurately than p, as the assumption n2 max(m, n1 ) implies. We may also assume that if (m , n 1 ) ∈ ∗0 then (m + h, n 1 ) ∈ ∗0 (h > 0) and (m , n 1 − h) ∈ ∗0 (0 < h < n 1 ), as the beta distributions for p with the parameters (m + h, n 1 ) and (m , n 1 − h) are stochastically less than the distribution with parameters (m , n 1 ), and the option of stopping is therefore relatively more attractive. With 0 of this form, and writing π = (m , n 1 , , n 2 ) (π ∈ 0 ∪ {C}), n p(π) = 1 m + n1
+ 1
n
2
,
m if (m + 1, n 1 ) ∈ ∗0 , q(π) = m + n 1 0 otherwise n 0 2 n 1 m 1− + r(π) = m + n1 +1 m + n 1
if (m + 1, n 1 ) ∈ ∗0 otherwise
.
Writing Q = n2 ( + 1)−n2 , and ignoring changes in this quantity as n2 increases, it follows on substituting expressions of the above form for p(π(t)),
MULTI-POPULATION RANDOM SAMPLING (CALCULATIONS)
239
q(π(t)) and r(π(t)) in (8.50) that, to the stated approximations, ν(m, n1 , , n2 ) = Qν(m, n1 ), where ν(m, n1 ) is the index for a Bernoulli reward process in the state (m, n1 ), for which the discount factor between successive trials is 1 if the second trial of the pair is unsuccessful and 1 − Q if it is successful. Index values for this modified Bernoulli reward process may be calculated by making the obvious changes to the procedure described in Section 7.4, again with a big improvement in the complexity of the calculation compared with backwards induction based on equation (8.48).
Exercises 8.1. Explain why you might expect the following statement to hold in respect of the Gittins indices for the Bernoulli reward process. ν(α, β + 1, a) < ν(α, β, a) < ν(α + 1, β − 1, a).
8.2. 8.3.
8.4.
8.5.
For a proof see Bellman (1956). How might these inequalities be used to change the MATLAB program in Section 8.4 so that it will run more quickly? Work through the derivations of some of the equations (8.3), (8.10), (8.19), (8.23), (8.34), (8.37) and (8.48). The quantities Q(λ, n, N ) appearing in equations (8.5) and (8.6), and in equations (8.11) and (8.12), are defined as maxima over sets of policies which increase with N . Explain why it follows from this that the associated quantities ξnN and ζnN are increasing with N , and why they tend to the limits ξn and ζn as N → ∞. Chebyshev’s theorem, 7.31, was cited as establishing condition (i) of Theorem 7.32 in the two applications discussed in Sections 8.4 and 8.5. Show that this claim is justified. Derive the expression for the Gittins index of a target process described in the paragraph following equation (8.45).
9
Further exploitation 9.1
Introduction
The first edition of this book presented the Gittins index solution to the exponentially discounted multi-armed bandit problem and its application to a wide class of sequential resource allocation and stochastic scheduling problems. Our aims in this second edition have been to give a refreshed account of the same material and to say what more has been done since 1989. There has been a remarkable flowering of new insights, generalizations and applications since publication of the first edition. We have used these freely to update the presentation, while still preserving most of the original material. New to this edition is the material on the achievable region approach to stochastic optimization problems, the construction of performance bounds for suboptimal policies, Whittle’s restless bandits, and the use of Lagrangian relaxation in the construction and evaluation of index policies. The book now includes numerous proofs of the index theorem and discusses the special insights that they provide. Many contemporary applications are surveyed, and over 100 new references have been included. In this final chapter, we briefly mention some other interesting work of an applied nature. The application in Section 9.2 is to a problem of website design. In Section 9.3 we describe how some interesting restless bandit models have been formulated in economics. The use of the Gittins index to discuss the value of information is considered in Section 9.4 We say a bit more about work on job-scheduling problems in Section 9.5. Some military applications are described in Section 9.6.
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
242
MULTI-ARMED BANDIT ALLOCATION INDICES
9.2
Website morphing
Commercial website design is big business. Websites need to be easy and attractive to use. The problem of choosing the best format has led to an application of the Bernoulli reward process multi-armed bandit. Defining our 1–0 success–failure variable to correspond to a website visit which ends with, or without, a purchase, we can identify the arms of our multiarmed bandit with the different available formats. If we suppose that the format presented to each visitor depends on the success records to date of the various formats, and that the objective is to maximize the expected total discounted number of successful visits, the correspondence with a multi-armed bandit is almost exact. In fact the proponents of real-time website morphing propose a more sophisticated approach which takes account of different cognitive types among the visitors to the website. A person’s cognitive type describes the way that person most readily assimilates and evaluates information. This is measured in several dimensions, for example in a preference either for lots of detailed upfront information, or for big picture predigested summaries, or in the degree to which an individual is likely to be influenced by the opinions of other consumers. This suggests a model in which each cognitive type forms a separate visitor stream and a separate multi-armed bandit. The difficulty is that we do not know a visitor’s cognitive type when they first access the website. Hauser, Urban, Liberali and Braun (2009) deal with this issue by having a Bayesian prior distribution over cognitive types. This is transformed into a posterior distribution by the evidence provided by the first few clicks after each visitor accesses the website. The next steps in the procedure are heuristic. With just one cognitive type we would simply assign to each successive visitor the format with the highest current Gittins index value. With several cognitive types, each format has a separate index value for each cognitive type. The expected Gittins index is evaluated using the posterior distribution over cognitive types. The website then, in real time, morphs (i.e. changes) to the format with the highest expected Gittins index. With just one cognitive type, the resulting purchase, or failure to purchase, is then used via Bayes’ theorem to update the distribution of the success probability for the format chosen. With several cognitive types, each format has a separate distribution of success probabilities for each cognitive type. Each of these distributions is updated via Bayes’ theorem after each visitor, the extent of the adjustment determined by weights given by the posterior distribution over cognitive types for the visitor. An application in a similar vein is described by Babaioff, Sharma and Slivkins (2009), who formulate a multi-armed bandit model of pay-per-click auctions for internet advertising. In each round the auctioneer selects an advertiser and shows her advertisement, which is then either clicked or not.
FURTHER EXPLOITATION
9.3
243
Economics
Economists have been attracted to the multi-armed bandit problem, especially as it allows them to model the trade-off of exploration and exploitation (two commonly used terms in the economics field); see for example March (1991), or Levinthal and Posen (2011). We note in passing that this terminology to describe the tension between a long-term and a short-term perspective was coined independently by Gittins and Jones (1974b). Bergemann and V¨alim¨aki (2008) have written a survey paper on bandit problems which is good in its coverage of applications in economics. They give an economic interpretation of the prevailing charges proof of the index theorem (see Section 2.5) by imagining that arms are owned by agents who rent them out in a competitive market of operators, using a descending price auction. Agents maximize their rental income (the prevailing charges they obtain) when the operators obtain best use of the arms by operating them to maximize the payoff. One particularly active line of research is concerned with so-called mechanism design. Suppose that each of n arms has an owner, and the state of each arm is known only to its owner. Any reward that is obtained from an arm is kept by its owner. At each step, each owner makes a claim about his arm’s Gittins index. If these claims were truthful then the Gittins index policy would maximize reward. However, the owners are self-interested and may find it advantageous to lie. The aim of a social planner is to implement some scheme that incentivizes owners to truthfully report the state of their arms. For example, if the private information about each arm corresponds to the length of a job, then owners of jobs might be required to make payments that depend on the processing times they declare. The order in which jobs are processed can also be made to depend on the declarations. Cavallo, Parkes and Singh (2006) have explored this type of problem and described an incentive-compatible mechanism for eliciting state information. They use Gittins indices, and distribute computation of the optimal policy among the agents. Along similar lines, Cavallo (2008) discusses a problem in which, during each of weeks t = 0, 1, . . . , a mobile health clinic is to be allocated to exactly one of n neighbourhoods within a city. Neighbourhood leaders make weekly claims about the value they place on being allocated the clinic. The state of neighbourhood i might encapsulate information about the value which the neighbourhood leader places on being allocated the clinic. When the clinic is allocated to a neighbourhood, then the value that the neighbourhood has for the clinic changes–perhaps a significant number of needs are not met, or the local population now knows more about the clinic. If a neighbourhood is not allocated the clinic then its state may also change–perhaps because more people are now waiting to be treated. The government would like to allocate the clinic so that it does the most good. This poses a problem that is similar to that of restless bandits except that the state of each bandit is private information of that bandit, who must be incentivized to reveal it.
244
MULTI-ARMED BANDIT ALLOCATION INDICES
9.4
Value of information
Researchers have used the multi-armed bandit problem and Gittins indices to discuss issues of imperfect information and how to assess the worth of information. We briefly mention some representative examples of the questions addressed. Berry and Kertz (1991) have studied the value of perfect information for Bernoulli bandits, comparing the optimal rewards obtained by players with perfect and imperfect information. Gittins and Wang (1992) have investigated the importance of acquiring additional information. They define the learning component of the index as the difference between the index and the expected immediate reward. They show that for two arms with the same expected immediate reward, the learning component is larger for the arm for which the reward rate is more uncertain. Bhulai and Koole (2000) look at two extreme situations, which occur when a bandit has been played N times: the situation where the decision maker stops learning and the situation where the decision maker acquires full information about that bandit. They show that the difference in reward between these lower and upper bounds goes to zero as N grows large. Pek¨oz (2003) considers a problem in which the goal is to maximize the long-run average reward per pull with the restriction that any previously acquired information is forgotten whenever a switch between arms is made. He investigates some peculiarities of optimal policies. Jones and Koo (2008) discuss bandit problems in which the sequence of discount factors is not known in advance. They compare the reward of observers who have information about the discount sequence from the start with those who do not.
9.5
More on job-scheduling problems
As we have seen, the field of job scheduling is rich in problems for which bandit processes or restless bandits can provide helpful models. Fay and Glazebrook (1989) discuss a job-scheduling problem in which there are unsuccessful job completions, but jobs are alternatives, so that only one from a set of alternative jobs must be completed successfully. Glazebrook (1991) looks at scheduling jobs on machines that are subject to breakdown, and demonstrates that, under mild restrictions, a non-preemptive policy is optimal in the class of preemptive ones. Ishikida and Wan (1997) discuss a problem of single-server scheduling of jobs that expire at predetermined dates. An index-based scheduling policy is shown to maximize the expected total reward. In a similar vein, Righter and Shanthikumar (1998) give conditions on the optimality of an index policy for multi-armed bandits when bandits are deteriorating and arms expire independently (geometrically). They also give a new simple proof of the optimality of the Gittins index policy for the classic multi-armed bandit problem (using ideas similar to Weiss (1988)).
FURTHER EXPLOITATION
245
The problem of developing optimal policies for job scheduling or a family of alternative bandit processes with switching penalties is a famously intractable one to which several important contributions have been made. See, for example, Glazebrook (1980b), Agrawal, Hedge and Teneketzis (1988), van Oyen and Teneketzis (1994), Banks and Sundaram (1994). Le Ny and Feron (2006) present a linear programming relaxation for the restless bandits problem with discounted rewards and with additional costs penalizing switching between projects. Similarly, Guha and Munagala (2009) examine a multi-armed bandit problem in which pulling an arm in state x incurs an operating cost c(x), and a switch from pulling an arm in state x to one in state y costs (x, y). They consider the finite-horizon future utilization problem, in which the aim is to maximize the reward obtained at the T th pull, subject to constraints on the total operating costs and switching costs incurred during the previous pulls. It is a problem that addresses practical problems in robot navigation, sensor networks and labour economics. Using Lagrangrian relaxation, they construct a polynomial-time computable policy that achieves a reward of at least 14 − times that achieved by an optimal policy. The paper of Jun (2004) provides a good survey of research on bandit problems with switching costs.
9.6
Military applications
Military logistics has proved a fruitful field for the application of index policies. Problem domains already mentioned in the text which provide models and analyses of relevance to military applications include the optimal exploration model of Problem 6 in Chapter 1, the discrete search models of Chapter 3, and the work on sensor management mentioned in Chapter 6 in relation to restless bandits. In this section we shall briefly describe two further problem domains. First, Glazebrook and Washburn (2004), in their review of so-called ShootLook-Shoot models, argue that the advent of modern long-range, accurate, but very expensive weapons has lent importance to analyses of approaches to shooting at targets which make good use of the information gathered on target status in the process. They describe and discuss the importance of multi-armed bandit models in this context. They cite the work of Manor and Kress (1997), who consider a firing problem in which target j has probability pj of being killed by a single shot. If the target is killed, that fact is confirmed with probability qj . Otherwise there is no feedback regarding the outcome of shots. The goal is to kill as many targets as possible with a fixed finite number of shots. This latter condition technically places the models outside of the infinite-horizon class with which we have primarily been concerned. However, the status of one-step-lookahead or myopic rules in this domain renders the horizon irrelevant. A proof is given that the myopic policy of always shooting at the target which offers the largest immediate gain is optimal. The authors argue the relevance of such models to the operation of weapon systems such as fibre-optic guided missiles. A manifest disadvantage of a myopic policy is its likely need to switch targets frequently.
246
MULTI-ARMED BANDIT ALLOCATION INDICES
Aviv and Kress (1997) explore the status of policies which do less switching. Inter alia, Glazebrook and Washburn (2004) are able to demonstrate the optimality of myopic policies for a model in which the feedback to the shooter following each shot is quite general in nature. A contribution by Barkdoll, Gaver, Glazebrook, Jacobs and Posadas (2002) exploits the notion of a superprocess discussed in Chapter 4. In their model, the shooter must not only make a choice of target at each decision epoch, but must also decide how to operate her engagement radar in support of each shot. Finally, Glazebrook, Kirkbride, Mitchell, Gaver and Jacobs (2007) demonstrate the ability of modern index theory to provide analyses of shooting problems in which a range of important problem features are modelled. Such features include • the threat posed to the shooter herself, which may be policy dependent; • the fact that the shooter may not have certainty regarding the kinds of targets she is facing; • the possibility that the targets may change over time; • the opportunity for the shooter to disengage from the current conflict in order to secure value from targets which may appear later. In their analyses, much use is made of the multiplicative reward model of Nash (1980), discussed in Section 3.5.1. In a quite different problem domain, Pilnick, Glazebrook and Gaver (1991) explore the application of index theory to a problem concerning the replenishment of a carrier battle group with ammunition (and possibly other kinds of supplies) while in an area subject to attack by enemy aircraft. One approach to such replenishment, called VERTREP (vertical replenishment), uses a logistics helicopter to transport pallets carrying ammunition from an on-station ammunition ship to the ships in the battle group. It takes time δi for the outward trip by the helicopter to ship i, ρi for the return flight and ψi is the time taken from receipt of a shipment of ammunition on ship i until it is ready for use. If we suppose that ψi ≤ ρi for all i then there will be no queues of ammunition on board ships waiting to be readied for use. Since it is unlikely that it will be possible to fully replenish the battle group in advance of the next attack, the question of priorities in replenishment is an important one. We use U (j ) = U (j1 , j2 , . . . , jn ) for the utility associated with having ji lifts of ammunition ready for use on board ship i, 1 ≤ i ≤ n, when the next attack occurs. If we suppose that T , the time of the next attack, has an exp(μ) distribution, then the goal of analysis is to develop a policy π for the scheduling of helicopter lifts which will maximize Eπ [U (j (T ))]. The nature of the best replenishment strategy will depend critically upon the nature of the utility function U , which will in turn depend upon the nature of the threat to the battle group. Consider a situation, for example, in which each defending ship has one antiaircraft missile system and where each delivery from the helicopter brings a single missile. Suppose further that pi is the single-shot kill probability of the missile
FURTHER EXPLOITATION
247
system on ship i and that θi is the probability that the attacker is engageable by ship i. If we assume that engageability by defenders is mutually exclusive (the attacker is engaged by at most one ship) and that θ0 is the probability that the attacker fails to be engaged at all, then it is natural to take the utility U (j ) to be the probability that the attacker is killed when the battle group missile state is j , namely U (j ) = 1 − θ0 −
n
θi (1 − pi )ji .
(9.1)
i=1
With this utility, the problem of determining an optimal replenishment policy may be readily formulated as a simple family of alternative bandit processes, with Gittins index νij =
θi pi (1 − pi )j −1 e−(δi +ψi )μ , 1 − e−(δi +ρi )μ
0 ≤ j ≤ Li − 1, 1 ≤ i ≤ n,
(9.2)
associated with ship i with j lifts of ammunition already on board. Note that in (9.2) we use Li for the number of lifts requested by ship i. The index νij is decreasing in j for each i and so the index policy is of one-step-look-ahead type and will involve much switching of deliveries between ships. This is entirely appropriate for the situation of mutually exclusive engagement envisaged in which there is an evident need to have adequate missile supplies on all ships with a serious chance of engaging the attacker. We could, by way of contrast, imagine a situation in which, should the attacker be engageable by one ship then he is engageable by all n in the battle group. This in turn leads to a utility U (j ) = (1 − θ0 ) 1 − ni=1 (1 − pi )ji . (9.3) which yields a multi-armed bandit with a multiplicative reward structure of the type discussed by Nash (1980). Here the index associated with ship i having j lifts of ammunition already on board is given by νij =
pi e−(δi +ψi )μ , 1 − e−(δi +ρi )μ
0 ≤ j ≤ Li − 1, 1 ≤ i ≤ n.
In this case the index νij does not depend upon j and hence there are optimal replenishment policies which are non-preemptive and supply ships exhaustively, one at a time. Under the assumption of collective engagement, it makes sense to make sure that the most effective defenders are fully replenished first. Perhaps the most likely scenario is one midway between the two extremes above, which yield the utilities in (9.1) and (9.3). We might imagine that the battle group is constituted of a collection of mutually exclusive groups of ships, with (at most) one group to be engaged by the attacker. Further, all ships in the engaged group are all individually engaged. The index analysis regards each
248
MULTI-ARMED BANDIT ALLOCATION INDICES
subgroup of ships as a superprocess. The critical Condition D in Chapter 4 is satisfied if the times δi , ρi and ψi are common to the ships in a group. This condition is plausible when the group members are in close proximity to one another. Under this condition there exists an optimal replenishment strategy of index form which is non-preemptive within each group but which is likely to involve considerable switching between groups. Pilnick et al . (1991) also discuss a very different approach to replenishment called CONREP (connected replenishment), in which a delivery ship draws alongside each receiving ship in turn and deliveries are made via wire highlines which are rigged between them. Under this scenario, such is the time involved in switching between ships that it is reasonable to suppose that a replenishment policy has the character that each receiving ship has at most a single period alongside the delivery ship. Such a situation is envisaged in the second model given as part of exercise 6.1, and a dynamic programming analysis can be developed with the use of index ideas.
References Aalto, S., Ayesta, U. and Righter, R. (2009). On the Gittins index in the M/G/1 queue, Queueing Syst., 63, 437–58. Agrawal, R., Hedge, M. and Teneketzis, D. (1988). Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching costs, IEEE T. Automat. Contr., 33, 899–906. Ahlswede, R. and Wegener, I. (1987). Search Problems, John Wiley & Sons, Ltd. Ahmad, S.H.A., Liu, M., Javidi, T., Zhao, Q. and Krishnamachari, B. (2009). Optimality of myopic sensing in multichannel opportunistic access, IEEE T. Inf. Theor., 55, 4040–50. Alpern, S. and Gal, S. (2003). The Theory of Search Games and Rendezvous, Springer. Amaral, J.A.F.P., do (1985). Aspects of Optimal Sequential Resource Allocation, D.Phil. thesis, Oxford. Anantharam, V., Varaiya, P. and Walrand, J. (1987a). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays - Part I: I.I.D. rewards, IEEE T. Automat. Contr., 32, 968–76. Anantharam, V., Varaiya, P. and Walrand, J. (1987b). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays - Part II: Markovian rewards, IEEE T. Automat. Contr., 32, 977–82. Ansell, P.S., Glazebrook, K.D., Ni˜no-Mora, J. and O’Keeffe, M. (2003). Whittle’s index policy for a multi-class queueing system with convex holding costs, Math. Meth. Oper. Res., 57, 21–39. Aviv, Y. and Kress, M. (1997). Evaluating the effectiveness of shoot-look-shoot tactics in the presence of incomplete damage information, Military Oper. Res., 3, 79–89. Babaioff, M., Sharma, Y. and Slivkins, A. (2009). Characterizing truthful multi-armed bandit mechanisms: extended abstract, in EC ’09: Proceedings of the 2009 ACM Conference Electronic Commerce, ACM, pages 79–88. Baker, K.R. (1974). Introduction to Sequencing and Scheduling, John Wiley & Sons, Ltd. Bank, P. and K¨uchler, C. (2007). On Gittins index theorem in continuous time, Stoc. Proc. Appl., 117, 1357–71. Banks, J.S. and Sundaram, R.K. (1994). Switching costs and the Gittins index, Econometrica, 62, 687–94. Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
250
REFERENCES
Barkdoll, T., Gaver, D.P., Glazebrook, K.D., Jacobs, P.A. and Posadas, S. (2002). Suppression of enemy air defenses (SEAD) as an information duel, Naval Res. Logist., 49, 723–42. Barra, J.-R. (1981). Mathematical Basis of Statistics, Academic Press. Bassamboo, A., Harrison, J.M. and Zeevi, A. (2005). Dynamic routing and admission control in high-volume service systems: asymptotic analysis via multi-scale fluid limits, Queueing Syst., 51, 249–85. Bather, J.A. (1962). Bayesian procedures for deciding the sign of a normal mean, Proc. Cambridge Phil. Soc., 58, 599–620. Bather, J.A. (1970). Optimal stopping problems for Brownian motion, Adv. Appl. Prob., 2, 259–86. Bather, J.A. (1983). Optimal stopping of Browian motion: a comparison technique, in D. Siegmund, H. Chernoff, J.S. Rustagi and M.H. Rizvi, editors, Recent Advances in Statistics, Academic Press, pages 19–50. B¨auerle, N. and Stidham, S., Jr. (2001). Conservation laws for single-server fluid networks, Queueing Syst., 38, 185–94. Beale, E.M.L. (1979). Contribution to the discussion of Gittins, J. R. Statist. Soc. B , 41, 171–2. Bellman, R.E. (1956). A problem in the sequential design of experiments, Sankhya A, 30, 221–52. Bellman, R.E. (1957). Dynamic Programming, Princeton University Press. Ben-Israel, A. and Fl˚am, S.D. (1990). A bisection/successive approximation method for computing Gittins indices, Math. Meth. Oper. Res., 34, 411–22. Bensoussan, A. and Lions, J.L. (1978). Applications des Inequations Variationelles en Controle Stochastique, Dunod. Bergemann, D. and V¨alim¨aki, J. (2001). Stationary multi-choice bandit problems, J. Econ. Dynam. Control , 25, 158–94. Bergemann, D. and V¨alim¨aki, J. (2008). Bandit problems, in S.N. Durlauf and L.E. Blume, editors, The New Palgrave Dictionary of Economics, Palgrave Macmillan, pages 336–40. Bergman, S.W. (1981). Acceptance Sampling: The Buyer’s Problem, Ph.D. thesis, Yale. Bergman, S.W. and Gittins, J.C. (1985). Statistical Methods for Planning Pharmaceutical Research, Marcel Dekker. Berninghaus, S. (1984). Das ‘Multi-Armed-Bandit’ Paradigms, Verlagsgruppe Athenaum. Berry, D.A. and Fristedt, B. (1985). Bandit Problems, Chapman Hall. Berry, D.A. and Kertz, R.P. (1991). Worth of perfect information in Bernoulli bandits, Adv. Appl. Prob., 23, 1–23. Bertsekas, D.P. (1999). Nonlinear Programming, Athena Scientific. Bertsimas, D. and Ni˜no-Mora, J. (1993). Conservation laws, extended polymatroids and multi-armed bandit problems; a polyhedral approach to indexable systems, in G. Rinaldi and L. Wolsey, editors, Proceedings of the Third International Symposium on Integer Programming and Combinatorial Optimization (IPCO 93), Erice, Italy, CIACO, pages 355–84. Bertsimas, D. and Ni˜no-Mora, J. (1996). Conservation laws, extended polymatroids and multiarmed bandit problems; a polyhedral approach to indexable systems, Math. Oper. Res., 21, 257–306.
REFERENCES
251
Bertsimas, D., Paschalidis, I. and Tsitsiklis, J.N. (1995). Branching bandits and Klimov’s problem: achievable region and side constraints, IEEE T. Automat. Contr., 40, 2063–75. Bertsimas, D.J. and Tsitsiklis, J.N. (1997). Introduction to Linear Optimization, Athena Scientific. Bhattacharya, P.P., Georgiadis, L. and Tsoucas, P. (1992). Extended polymatroids: properties and optimization, in E. Balas, G. Cornnjelos and R. Kannan, editors, Proceedings of the 2nd Integer Programming and Combinatorial Optimization Conference, Pittsburgh, University Printing and Publications, Carnegie Mellon University, pages 298–315. Bhulai, S. and Koole, G.M. (2000). On the value of learning for Bernoulli bandits with unknown parameters, IEEE T. Automat. Contr., 45, 213–40. Blackwell, D. (1962). Discrete dynamic programming, Ann. Math. Stat., 39, 719–26. Blackwell, D. (1965). Discounted dynamic programming, Ann. Math. Stat., 36, 226–35. Bradt, R.N., Johnson, S.M. and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations, Ann. Math. Stat., 27, 1060–74. Cavallo, R. (2008). Efficiency and redistribution in dynamic mechanism design, in EC’08: Proceedings of the 2008 ACM conference on Electronic Commerce, ACM, pages 220–229. Cavallo, R., Parkes, D.C. and Singh, S. (2006). Optimal coordinated planning amongst self-interested agents with private state, in R. Dechter and T. Richardson, editors, Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI’06), Cambridge, MA, AVAI Press, pages 55–62. Chang, F. and Lai, T.L. (1987). Optimal stopping and dynamic allocation, Adv. Appl. Prob., 19, 829–53. Chen, Y.R. and Katehakis, M.N. (1986). Linear programming for finite state multi-armed bandit problems, Math. Oper. Res., 11, 83. Chernoff, H. (1961). Sequential tests for the mean of a normal distribution, in J. Neyman, editor, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, University of California Press, pages 79–81. Chernoff, H. (1968). Optimal stochastic control, Sankhya A, 30, 221–52. Chow, Y.S., Robbins, H. and Siegmund, D. (1971). Great Expectations: The Theory of Optimal Stopping, Houghton Mifflin. Coffman, E.G., Jr. (1976). Computer and Job–Shop Scheduling Theory, John Wiley & Sons, Inc. Coffman, E.G., Jr. and Mitrani, I. (1980). A characterization of waiting time performance realizable by single-server queues, Oper. Res., 28, 810–21. Conway, R.W., Maxwell, W.L. and Miller, L.W. (1967). Theory of Scheduling, Addison–Wesley. Cram´er, H. (1946). Mathematical Methods of Statistics, Princeton University Press. Dacre, M.J. and Glazebrook, K.D. (2002). The dependence of optimal returns from multiclass queueing systems on their customer base, Queueing Syst., 40, 93–115. Dacre, M.J., Glazebrook, K.D. and Ni˜no-Mora, J. (1999). The achievable region approach to the optimal control of stochastic systems, J. R. Statist. Soc. B , 747–91 (with discussion).
252
REFERENCES
Dempster, M.A.H., Lenstra, J.K. and Rinnooy Kan, A.H.G., editors (1982). Deterministic and Stochastic Scheduling, Reidel. Denardo, E.V. and Miller, B.L. (1968). An optimality criterion for discrete dynamic programming with no discounting, Ann. Math. Stat., 39, 1220–7. Doob, J.L. (1955). A probability approach to the heat equation, Trans. Amer. Math. Soc., 80, 216–80. Dumitriu, I., Tetali, P. and Winkler, P. (2003). On playing golf with two balls, SIAM J. Discret. Math., 16, 604–15. Dunn, R.T. (2005). Bandit Problems on Parallel Machines, Ph.D. thesis, University of Edinburgh. Dunn, R.T. and Glazebrook, K.D. (2001). The performance of index-based policies for bandit problems with stochastic machine availability, Adv. Appl. Prob., 33, 365–90. Dunn, R.T. and Glazebrook, K.D. (2004). Discounted multiarmed bandit problems on a collection of machines with varying speeds, Math. Oper. Res., 29, 266–79. Eick, S.G. (1988). Gittins procedures for bandits with delayed responses, J. R. Statist. Soc. B , 50, 125–32. El Karoui, N. and Karatzas, I. (1993). General Gittins index processes in discrete time, Proc. Natl. Acad. Sci. USA, 90, 1232–6. El Karoui, N. and Karatzas, I. (1994). Dynamic allocation problems in continuous time, Adv. Appl. Prob., 4, 255–86. Eplett, W.J.R. (1986). Continuous-time allocation indices and their discrete-time approximations, Adv. Appl. Prob., 18, 724–46. Fay, N.A. and Glazebrook, K.D. (1989). A general model for the scheduling of alternative stochastic jobs that may fail, Prob. Eng. Inform. Sci., 3, 199–221. Fay, N.A. and Walrand, J.C. (1991). On approximately optimal index strategies for generalised arm problems, J. Appl. Prob., 28, 602–12. Federgruen, A. and Groenvelt, H. (1988). Characterization and optimization of achieveable performances in general queueing systems, Oper. Res., 36, 733–41. Ferguson, T.S. (1967). Mathematical Statistics: A Decision Theoretic Approach, Academic Press. Filliger, R. and Hongler, M.-O. (2007). Explicit Gittins’ indices for a class of superdiffusive processes, J. Appl. Prob., 44, 554–9. French, S. (1982). Sequencing and Scheduling: An Introduction to the Mathematics of the Job–Shop, Ellis Norwood. Frostig, E. and Weiss, G. (1999). Four proofs of Gittins’ multi-armed bandit theorem, private communication. Gal, S. (1980). Search Games, Academic Press. Garbe, R. and Glazebrook, K.D. (1998a). Stochastic scheduling with priority classes, Math. Oper. Res., 23, 119–44. Garbe, R. and Glazebrook, K.D. (1998b). Submodular returns and greedy heuristics for queueing scheduling problems, Oper. Res., 46, 336–46. Garnett, O., Mandelbaum, A. and Reiman, M. (2002). Designing a call center with impatient customers, Manuf. Serv. Oper. Manage., 4, 208–27. Gittins, J.C. (1979). Bandit processes and dynamic allocation indices (with discussion), J. R. Statist. Soc. B , 41, 148–77.
REFERENCES
253
Gittins, J.C. (1981). Multiserver scheduling of jobs with increasing completion rates, J. Appl. Prob., 18, 321–4. Gittins, J.C. (1989). Multi-armed Bandit Allocation Indices, John Wiley & Sons, Inc. Gittins, J.C. (2003). Stochastic models for the planning of pharmaceutical research, J. Statist. Theory Appl., 2, 198–214. Gittins, J.C. and Jones, D.M. (1974a). A dynamic allocation index for the sequential design of experiments, in J. Gani, editor, Progress in Statistics, North-Holland, pages 241–66, read at the 1972 European Meeting of Statisticians, Budapest. Gittins, J.C. and Jones, D.M. (1974b). A dynamic allocation index for new-product chemical research, Technical report, Cambridge University Engineering Department Report CUED/A - Mgt Stud/TR13, Cambridge University. Gittins, J.C. and Jones, D.M. (1979). A dynamic allocation index for the discounted multi-armed bandit problem, Biometrika, 66, 561–5. Gittins, J.C. and Wang, Y.G. (1992). The learning component of dynamic allocation indices, Ann. Stat., 20, 1625–36. Glazebrook, K.D. (1976). Stochastic scheduling with order constraints, Int. J. Sys. Sci., 7, 657–66. Glazebrook, K.D. (1979). Stoppable families of alternative bandit processes, J. Appl. Prob., 16, 843–54. Glazebrook, K.D. (1980a). On single-machine sequencing with order constraints, Naval Res. Logist. Quart., 27, 123–30. Glazebrook, K.D. (1980b). On stochastic scheduling with precedence relations and switching costs, J. Appl. Prob., 17, 1016–24. Glazebrook, K.D. (1981a). On non-preemptive strategies for stochastic scheduling problems in continuous time, Int. J. Sys. Sci., 12, 771–82. Glazebrook, K.D. (1981b). On non-preemptive strategies in stochastic scheduling, Naval Res. Logist. Quart., 28, 289–300. Glazebrook, K.D. (1982a). On the evaluation of fixed permutations as strategies in stochastic scheduling, Stoc. Proc. Appl., 13, 87. Glazebrook, K.D. (1982b). On the evaluation of non-preemptive strategies in stochastic scheduling, in M.A.H. Dempster, J.K. Lenstra and A.H.G. Rinooy Kan, editors, Deterministic and Stochastic Scheduling, NATO Advanced Study Institute Series, Reidel, pages 375–84. Glazebrook, K.D. (1982c). On the evaluation of suboptimal strategies for families of alternative bandit processes, J. Appl. Prob., 19, 716–22. Glazebrook, K.D. (1983). Methods for the evaluation of permutations as strategies in stochastic scheduling, Manage. Sci., 29, 1142–55. Glazebrook, K.D. (1991). On nonpreemptive policies for stochastic single-machine scheduling with breakdowns, Prob. Eng. Inform. Sci., 5, 77–87. Glazebrook, K.D. and Boys, R.J. (1995). A class of Bayesian models for optimal exploration, J. R. Statist. Soc. B , 57, 705–20. Glazebrook, K.D. and Fay, N.A. (1987). On the scheduling of alternative stochastic jobs on a single machine, Adv. Appl. Prob., 19, 955–73. Glazebrook, K.D. and Garbe, R. (1996). Reflections on a new approach to Gittins indexation, J. Oper. Res. Soc., 47, 1301–9.
254
REFERENCES
Glazebrook, K.D. and Gittins, J.C. (1981). On single-machine scheduling with precedence relations and linear or discounted costs, Oper. Res., 29, 289–300. Glazebrook, K.D., Hodge, D.J. and Kirkbride, C. (2010a). Monotone policies and indexability for bi-directional restless bandits, under review. Glazebrook, K.D., Hodge, D.J. and Kirkbride, C. (2010b). General notions of indexability for queueing control and asset management, Ann. Appl. Prob. (in press). Glazebrook, K.D., Kirkbride, C., Mitchell, H.M., Gaver, D.P. and Jacobs, P.A. (2007). Index policies for shooting problems, Oper. Res., 55, 769–81. Glazebrook, K.D., Kirkbride, C. and Ouenniche, J. (2009). Index policies for the admission control and routing of impatient customers to heterogeneous service stations, Oper. Res., 57, 975–89. Glazebrook, K.D., Kirkbride, C. and Ruiz-Hernandez, D. (2006). Spinning plates and squad systems: policies for bi-directional restless bandits, Adv. Appl. Prob., 38, 95–115. Glazebrook, K.D., Lumley, R.R. and Ansell, P.S. (2003). Index heuristics for multiclass M/G/1 systems with nonpreemptive service and convex holding costs, Queueing Syst., 45, 81–111. Glazebrook, K.D. and Ni˜no-Mora, J. (2001). Parallel scheduling of multiclass M/M/m queues: approximate and heavy-traffic optimization of achievable performance, Oper. Res., 49, 609–23. Glazebrook, K.D., Ni˜no-Mora, J. and Ansell, P.S. (2002). Index policies for a class of discounted restless bandits, Adv. Appl. Prob., 34, 754–74. Glazebrook, K.D. and Washburn, A. (2004). Shoot-look-shoot: a review and extension, Oper. Res., 52, 454–63. Glazebrook, K.D. and Wilkinson, D.J. (2000). Index-based policies for discounted multiarmed bandits on parallel machines, Ann. Appl. Prob., 10, 877–96. Guha, S. and Munagala, K. (2009). Multi-armed bandits with metric switching costs, in S. Albers, A. Marchetti-Spaccamela, Y. Matias, S. Nikoletseas and W. Thomas, editors, Automata, Languages and Programming, 36th Internatilonal Colloquium, ICALP 2009, Rhodes, Greece, July 5–12, 2009 , proceedings, part II, Springer, pages 496–507. Guha, S., Munagala, K. and Shi, P. (2009). Approximation algorithms for restless bandit problems, in C. Mathieu, editor, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pages 28–37. Harrison, J.M. (1975). Dynamic scheduling of a multiclass queue: discount optimality, Oper. Res., 23, 270–82. Hauser, J.R., Urban, G.L., Liberali, G. and Braun, M. (2009). Website morphing, Market. Sci., 28, 202–23. Henrici, P. (1964). Elements of Numerical Analysis, John Wiley & Sons, Inc. Howard, R. (1960). Dynamic Programming and Markov Processes, MIT Press. Howard, R. (1971). Dynamic Probabilistic Systems, volume II, Semi-Markov and Decision Processes, John Wiley & Sons, Inc. Ishikida, T. and Varaiya, P. (1994). Multi-armed bandit problem revisited, J. Optim. Theory Appl., 83, 113–54. Ishikida, T. and Wan, Y.W. (1997). Scheduling jobs that are subject to deterministic due dates and have deteriorating expected rewards, Prob. Eng. Inform. Sci., 11, 65–78.
REFERENCES
255
Janson, S. and Peres, Y. (2001). Hitting times for random walks with restarts, Technical report, Department of Statistics, U.C. Berkeley, preprint. Jones, D.M. (1970). A Sequential Method for Industrial Chemical Research, Master’s thesis, U.C.W. Aberystwyth. Jones, D.M. (1975). Search Procedures for Industrial Chemical Research, Ph.D. thesis, Cambridge. Jones, M.L. and Koo, R. (2008). On the worth of perfect information in bandits with random discounting, Sequential Anal., 27, 58–67. Jun, T. (2004). A survey on the bandit problem with switching costs, De Economist , 152, 513–41. Kadane, J.B. and Simon, H.A. (1977). Optimal strategies for a class of constrained sequential problems, Ann. Stat., 5, 237–55. Kallenberg, L.C.M. (1986). A note on Katehakis and Y.-R. Chen’s computation of the Gittins index, Math. Oper. Res., 11, 184–6. Karatzas, I. (1984). Gittins indices in the dynamic allocation problem for diffusion processes, Adv. Appl. Prob., 12, 173–92. Kaspi, H. and Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time, Ann. Appl. Prob., 8, 270–90. Katehakis, M.N. and Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials, in J. Van Ryzin, editor, Adaptive Statistical Procedures and Related Topics, Institute of Mathematical Statistics, pages 29–39. Katehakis, M.N. and Veinott, A.F. (1987). The multi-armed bandit problem: decomposition and computation, Math. Oper. Res., 12, 262–8. Kelly, F.P. (1979). Contribution to the discussion of Gittins, J. R. Statist. Soc. B , 41, 167–8. Kelly, F.P. (1981). Multi-armed bandits with discount factor near one: the Bernoulli case, Ann. Stat., 9, 987–1001. Kendall, D.G. (1951). Some problems in the theory of queues, J. R. Statist. Soc. B , 13, 85. Kertz, R.P. (1986). Decision processes under total expected concomitant constraints with approximations to bandit processes, private communication. Klimov, G.P. (1974). Time-sharing service systems I, Theory Probab. Appl., 19, 532–51. Klimov, G.P. (1978). Time-sharing service systems II, Theory Probab. Appl., 23, 314–21. Koopman, B.O. (1979). An operational critique of detection laws, Oper. Res., 27, 33. Krishnamurthy, V. and Wahlberg, B. (2009). Partially observed Markov decision process multiarmed bandits – structural results, Math. Oper. Res., 34, 287–302. Lai, T.L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules, Adv. Appl. Math., 6, 4–22. Lai, T.L. and Ying, Z. (1988). Open bandit processes and scheduling of queueing networks, Adv. Appl. Prob., 20, 447–72. Lawler, E.L., Lenstra, J.K. and Rinnooy Kan, A.H.G. (1989). Theory of Sequencing & Scheduling, John Wiley & Sons, Ltd. Lawler, E.L., Lenstra, J.K., Rinnooy Kan, A.H.G. and Shmoys, D.B. (1993). Sequencing and scheduling: algorithms and complexity, in S.C. Graves, A.H.G. Rinnooy Kan and
256
REFERENCES
P.H. Zipkin, editors, Handbooks in Operations Research and Management Science, Elsevier, pages 445–522. Lehnerdt, M. (1982). On the structure of discrete sequential search problems and of their solutions, Math Operationsforschung u. Statist., Ser. Optimization, 13, 533–7. Le Ny, J., Dahleh, M. and Feron, E. (2008). Multi-UAV dynamic routing with partial observations using restless bandit allocation indices, in American Control Conference, 2008 , pages 4220–5. Le Ny, J. and Feron, E. (2006). Restless bandits with switching costs: linear programming relaxations, performance bounds and limited lookahead policies, in Proceedings of the 2006 American Control Conference, pages 1587–92. Levinthal, D. A. and Posen, H. E. (2011). Chasing a Moving Target: Learning in Dynamic Environments, Management Science. Available at SSRN: http://ssrn.com/ abstract=1540100. Lindley, D.V. (1960). Dynamic programming and decision theory, Appl. Statist., 10, 39–51. Lippman, S.A. and McCall, J.J. (1981). The economics of belated information, Internat. Econ. Rev., 22, 46. Liu, K. and Zhao, Q. (2010). Indexability of restless bandit problems and optimality of Whittle’s index for dynamic multichannel access, IEEE T. Inf. Theor., preprint at http://arxiv.org/abs/0810.4658. Mahajan, A. and Teneketzis, D. (2008). Multi-armed bandit problems, in A.O. Hero III, D.A. Castanon, D. Cochran and K. Kastella, editors, Foundations and Applications of Sensor Management, Springer, pages 121–51. Mandelbaum, A. (1986). Discrete multi-armed bandits and multi-parameter processes, Prob. Theory Rel. Fields, 71, 129–47. Mandelbaum, A. (1988). Continuous multi-armed bandits and multi-parameter processes, Ann. Prob., 15, 1527–56. Mannor, S. and Tsitsiklis, J.N. (2003). Lower bounds on the sample complexity of exploration in the multi-armed bandit problem, in B. Sch¨olkopf and M.K. Warmuth, editors, Sixteenth Annual Conference on Computational Learning Theory, Springer, pages 418–32. Manor, G. and Kress, M. (1997). Optimality of the greedy shooting strategy in the presence of incomplete damage information, Naval Res. Logist., 44, 613–22. March, J.G. (1991). Exploration and exploitation in organisational learning, Organ. Sci., 2, 71–86. Matula, D. (1964). A periodic optimal search, Amer. Math. Monthly, 71, 15–21. McCall, B.P. and McCall, J.J. (1981). Systematic search, belated information, and the Gittins index, Econ. Lett., 8, 327–33. van Meighem, J.A. (1995). Dynamic scheduling with convex delay costs: the generalized cμ-rule, Adv. Appl. Prob., 5, 809–33. Meilijson, I. and Weiss, G. (1977). Multiple feedback at a single-server station, Stoc. Proc. Appl., 5, 195–205. Miller, R.A. (1984). Job matching and occupational choice, J. Political Econ., 92, 1086–120. Nash, P. (1973). Optimal Allocation of Resources Between Research Projects, Ph.D. thesis, Cambridge.
REFERENCES
257
Nash, P. (1980). A generalized bandit problem, J. R. Statist. Soc. B , 42, 165–9. Nash, P. and Gittins, J.C. (1977). A Hamiltonian approach to optimal stochastic resource allocation, Adv. Appl. Prob., 9, 55–68. Nemhauser, G.L. and Wolsey, L.A. (1981). Maximizing submodular set functions: formulations and analysis of algorithms, in P. Hansen, editor, Annals of Discrete Mathematics 11, Studies on Graphs and Discrete Programming, North-Holland, pages 279–301. Ni˜no-Mora, J. (2001). Restless bandits, partial conservation laws and indexability, Adv. Appl. Prob., 33, 76–98. Ni˜no-Mora, J. (2007). A (2/3)n3 fast-pivoting algorithm for the Gittins index and optimal stopping of a Markov chain, INFORMS J. on Computing, 19, 596–606. Ni˜no-Mora, J. (2011a). Klimov’s model, in J.J. Cochran, editor, Wiley Encyclopedia of Operations Research and Management Science, John Wiley & Sons, Inc., pages 2265–74. Ni˜no-Mora, J. (2011b). Conservation laws and related applications, in J.J. Cochran, editor, Wiley Encyclopedia of Operations Research and Management Science, John Wiley & Sons, Inc., pages 939–50. von Olivier, G. (1972). Cost-minimum priorities in queueing systems of type M/G/1, Elektron. Rechenanl., 14, 262–71. van Oyen, M.P. and Teneketzis, D. (1994). Optimal stochastic scheduling of forest networks with switching penalties, Adv. Appl. Prob., 26, 474–97. Pandelis, D.G. and Teneketzis, D. (1991). On the optimality of the Gittins index rule for multi-armed bandits with multiple plays, Math. Meth. Oper. Res., 50, 449–61. Papadimitriou, C.H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall. Papadimitriou, C.H. and Tsitsiklis, J.N. (1999). The complexity of optimal queuing network control, Math. Oper. Res., 24, 293–305. Pek¨oz, E.A. (2003). Some memoryless bandit policies, J. Appl. Prob., 40, 250–6. Pilnick, S.E., Glazebrook, K.D. and Gaver, D.P. (1991). Optimal replenishment of ships during combat, Naval Res. Logist., 637–88. Pinedo, M.L. (2008). Scheduling: Theory, Algorithms and Systems, 3rd edition, Springer. Presman, E.L. and Sonin, I.M. (1982). Sequential Control with Incomplete Information: The Bayesian Approach, Nauka. Puterman, M.L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd revised edition, John Wiley & Sons, Ltd. Raiffa, H. and Schlaifer, R. (1961). Applied Statistical Decision Theory, Harvard Business School. Rao, C.R. and Rubin, H. (1964). On a characterization of the Poisson distribution, Sankhya A, 32, 265–70. Righter, R. and Shanthikumar, G.J. (1998). Independently expiring multiarmed bandits, Prob. Eng. Inform. Sci., 12, 453–68. Robbins, H. (1952). Some aspects of the sequential design of experiments, Bull. Amer. Math. Soc., 58, 527–35. Robinson, D.R. (1982). Algorithms for evaluating the dynamic allocation index, Op. Res. Letters, 1, 72–4.
258
REFERENCES
Ross, K.W. (1989). Randomized and past-dependent policies for Markov decision processes with multiple constraints, Oper. Res., 37, 474–7. Ross, S.M. (1970). Applied Probability Models with Optimization Applications, Holden-Day. Ross, S.M. (1983). Introduction to Stochastic Dynamic Programming, Academic Press. Rothkopf, M.H. (1966). Scheduling with random service times, Manage. Sci., 12, 707–13. Sevcik, K.C. (1974). Scheduling for minimum total loss using service time distributions, JACM , 21, 66–75. Shanthikumar, J.G. and Yao, D.D. (1992). Multiclass queueing systems: polymatroidal structure and optimal scheduling control, Oper. Res., 40, S293–9. Sidney, J.B. (1975). Decomposition algorithms for single-machine scheduling with precedence relations and deferral costs, Oper. Res., 23, 283–93. Smith, W.E. (1956). Various optimizers for single-stage production, Naval Res. Logist. Quart., 1–2, 59–63. Sonin, I.M. (2008). A generalized Gittins index for a Markov chain and its recursive calculation, Statist. Probab. Lett., 78, 1526–33. Stone, L.D. (1975). Theory of Optimal Search, Academic Press. Str¨umpfer, J. (1980). Search theory index, Technical Report TN-017-80, Institute for Maritime Technology, Simonstown, S. Africa. Tcha, D.W. and Pliska, S.R. (1977). Optimal control of single-server queuing networks and multi-class M/G/1 queues with feedback, Oper. Res., 25, 248–58. Teugels, J.L. (1976). A bibliography on semi-Markov processes, J. Comp. Appl. Maths., 2, 44. Thron, C. (1984). The multi-armed bandit problem and optimality of the Gittins index strategy, private communication. Tsitsiklis, J.N. (1986). A lemma on the multiarmed bandit problem, IEEE T. Automat. Contr., 31, 576–7. Tsitsiklis, J.N. (1994). A short proof of the Gittins index theorem, Ann. Appl. Prob., 4, 194–9. Tsoucas, P. (1991). The region of achieveable performance in a model of Klimov, Technical report, Research Report RC16543, IBM T.J. Watson Research Center, Yorktown Heights, NY. Vanderbei, R. (2007). Linear Programming: Foundations and Extensions, Springer. Varaiya, P., Walrand, J. and Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: the discounted case, IEEE T. Automat. Contr., 30, 426–39. Veinott, A., Jr. (1966). On finding optimal policies in discrete dynamic programming with no discounting, Ann. Math. Stat., 37, 128–94. Wald, A. (1950). Statistical Decision Functions, John Wiley & Sons, Inc. Walrand, J. (1988). An Introduction to Queueing Networks, Prentice Hall. Wang, Y.G. (1997). Error bounds for calculation of the Gittins indices, Austral. J. Statist., 39, 225–33. Washburn, A. (2008). Application of multi-armed bandits to sensor management, in A.O. Hero III, D.A. Castanon, D. Cochran and K. Kastella, editors, Foundations and Applications of Sensor Management, Springer, pages 153–75. Weber, R.R. (1980). Multi-server Stochastic Scheduling, Ph.D. thesis, Cambridge.
REFERENCES
259
Weber, R.R. (1982). Scheduling jobs with stochastic processing requirement on parallel machines to minimise makespan or flowtime, J. Appl. Prob., 19, 167–82. Weber, R.R. (1992). On the Gittins index for multiarmed bandits, Ann. Appl. Prob., 2, 1024–33. Weber, R.R. (2007). Comments on: Dynamic priority allocation via restless bandit marginal productivity indices, TOP , 15, 211–16. Weber, R.R. and Weiss, G. (1990). On an index policy for restless bandits, J. Appl. Prob., 27, 637–48. Weber, R.R. and Weiss, G. (1991). Addendum to ‘On an index policy for restless bandits’, Adv. Appl. Prob., 23, 429–30. Weiss, G. (1988). Branching bandit processes, Prob. Eng. Inform. Sci., 2, 269–78. Weitzman, M.L. (1979). Optimal search for the best alternative, Econometrica, 47, 641–54. Whittle, P. (1980). Multi-armed bandits and the Gittins index, J. R. Statist. Soc. B , 42, 143–49. Whittle, P. (1981). Arm-acquiring bandits, Ann. Prob., 9, 284–92. Whittle, P. (1983). Optimization over Time, John Wiley & Sons, Ltd; volume 1, 1982, volume 2, 1983. Whittle, P. (1988). Restless bandits: activity allocation in a changing world, in J. Gani, editor, A Celebration of Applied Probability, Applied Probability Special Volume 25A, Applied Probability Trust, pages 287–98. Whittle, P. (1991). Book review: Multi-armed Bandit Allocation Indices, Stoch. Stoch. Rep., 35, 125–8. Whittle, P. (1996). Optimal Control: Basics and Beyond , John Wiley & Sons, Ltd. Whittle, P. (2005). Tax problems in the undiscounted case, J. Appl. Prob., 42, 754–65. Whittle, P. (2007). Networks: Optimisation and Evolution, Cambridge University Press. Yao, Y.C. (2006). Some results on the Gittins index for a normal reward process, in H. Ho, C. Ing and T. Lai, editors, Time Series and Related Topics: In Memory of Ching-Zong Wei , Institute of Mathematical Statistics, pages 284–94.
Tables Table 8.1 Normal reward process (known variance), values of n(1 − a)1/2 ν(0, n, 1, a) (Section 8.2). a
.5
.6
.7
.8
.9
.95
.99
.995
.14542 .17209 .18522 .19317 .19855 .20244 .20539 .20771 .20959 .21113 .21867 .22142 .22286 .22374 .22433 .22476 .22508 .22534 .22554 .22646 .22678 .22693 .22703 .22709 .22714 .22717 .22720 .22722 .22741
.17451 .20815 .22513 .23560 .24277 .24801 .25202 .25520 .25777 .25991 .27048 .27443 .27650 .27778 .27864 .27927 .27974 .28011 .28041 .28177 .28223 .28246 .28260 .28270 .28276 .28281 .28285 .28288 .28316
.20218 .24359 .26515 .27874 .28820 .29521 .30063 .30496 .30851 .31147 .32642 .33215 .33520 .33709 .33838 .33932 .34003 .34059 .34104 .34311 .34381 .34416 .34438 .34452 .34462 .34470 .34476 .34480 .34524
.22582 .27584 .30297 .32059 .33314 .34261 .35005 .35607 .36105 .36525 .38715 .39593 .40070 .40370 .40577 .40728 .40843 .40934 .41008 .41348 .41466 .41525 .41561 .41585 .41602 .41615 .41625 .41633 .41714
.23609 .29485 .32876 .35179 .36879 .38200 .39265 .40146 .40889 .41526 .45047 .46577 .47448 .48013 .48411 .48707 .48935 .49117 .49266 .49970 .50219 .50347 .50425 .50478 .50516 .50545 .50568 .50587 .5092
.22263 .28366 .32072 .34687 .36678 .38267 .39577 .40682 .41631 .42458 .47295 .49583 .50953 .51876 .52543 .53050 .53449 .53771 .54037 .55344 .55829 .56084 .56242 .56351 .56431 .56493 .56543 .56583 .583
.15758 .20830 .24184 .26709 .28736 .30429 .31881 .33149 .34275 .35285 .41888 .45587 .48072 .49898 .51313 .52451 .53391 .54184 .54864 .58626 .60270 .61220 .61844 .62290 .62629 .62896 .63121 .63308
.12852 .17192 .20137 .22398 .24242 .25803 .27158 .28356 .29428 .30400 .36986 .40886 .43613 .45679 .47324 .48677 .49817 .50796 .51648 .56637 .59006 .60436 .61410 .62123 .62674 .63116 .63481 .63789
n 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000 ∞
(continued overleaf ) Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
262
TABLES
Table 8.1 Normal reward process (known variance), values of n(1 − a)1/2 ν(0, n, 1, a) (Section 8.2). (Continued ) a = .95 n/100 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
a = .99 n/100
n/100
.56583 .56618 .56648 .56674 .56698 .56719 .56738 .56756 .56772 .56787 .56802 .56815 .56828 .56840 .56851 .56862 .56872 .56882 .56892 .56901 .56910 .56918 .56926 .56934 .56942 .56950 .56957
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
.63308 .63468 .63611 .63737 .63852 .63956 .64054 .64144 .64229 .64309 .64385 .64458 .64527 .64594 .64658 .64720 .64780 .64838 .64895 .64949 .65003 .65055 .65106 .65156 .65205 .65253 .65299
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
n/100
.65345 .65390 .65435 .65478 .65521 .65562 .65604 .65644 .65684 .65724 .65763 .65801 .65838 .65875 .65912 .65948 .65984 .66019 .66053 .66088 .66122 .66155 .66188 .66221 .66253 .66285 .66317
a = .995 n/100 10 11 12 13 14 15
n/100 .63789 .64055 .64288 .64492 .64689 .64861
16 17 18 19 20
.65020 .65167 .65304 .65433 .65554
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
.66348 .66379 .66409 .66440 .66470 .66499 .66529 .66558 .66586 .66615 .66643 .66671 .66698 .66726 .66753 .66780 .66806 .66833 .66859 .66885 .66911 .66936 .66962 .66987 .67012 .67036 .67061
TABLES
263
Table 8.2 Brownian reward process, lower bound for T u(T ), (Sections 7.6 and 8.2). T
Tu(T) >
T
Tu(T) >
0.005 0.01 0.02 0.05 0.1 0.2 0.5
0.12852 0.17192 0.22398 0.30400 0.36986 0.43617 0.51648
1 2 5 10 20 40 80
0.56637 0.60436 0.63789 0.65554 0.64385 0.65478 0.66806
Table 8.3 Normal reward process, ratio of indices for cases of unknown variance and known variance (Section 8.3). Values shown are for ζ ξ −1 − 1 for n = 2, 3, 4 a
.5
.6
.7
0.971 0.789 0.389
5.365 0.810 0.396
5.995 0.847 0.408
25.681 19.012 15.165 12.612 10.797 9.440 4.188 2.689 1.980 1.566 1.296 1.106 0.963 0.854 0.766 0.380 0.253 0.188 0.151 0.126 0.108 0.094 0.083 0.074
Values 26.003 19.212 15.303 12.714 10.882 9.502 4.203 2.696 1.984 1.569 1.298 1.106 0.964 0.854 0.767 0.380 0.251 0.188 0.151 0.126 0.107 0.095 0.084 0.076
.8
.9
.95
.99
.995
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 1000
8.132 10.088 14.988 36.767 52.950 0.923 1.123 1.438 2.848 3.851 0.432 0.496 0.597 1.011 1.289
shown are for 100(ζ ξ −1 − 1) for n ≥ 5 26.601 27.842 31.228 36.487 57.509 19.585 20.365 22.512 25.865 39.095 15.563 16.110 17.635 20.039 29.567 12.907 13.316 14.470 16.308 23.643 11.025 11.345 12.257 13.722 19.624 9.622 9.881 10.624 11.829 16.728 4.233 4.301 4.507 4.868 6.471 2.710 2.741 2.840 3.021 3.886 1.992 2.010 2.068 2.179 2.738 1.574 1.586 1.625 1.700 2.098 1.301 1.310 1.337 1.392 1.693 1.109 1.115 1.136 1.178 1.414 0.966 0.972 0.987 1.020 1.212 0.856 0.860 0.873 0.899 1.059 0.768 0.771 0.782 0.804 0.939 0.379 0.380 0.383 0.390 0.435 0.252 0.252 0.254 0.257 0.286 0.189 0.189 0.190 0.192 0.207 0.151 0.151 0.152 0.153 0.165 0.125 0.126 0.126 0.127 0.137 0.108 0.108 0.108 0.109 0.119 0.094 0.095 0.095 0.096 0.110 0.084 0.084 0.084 0.085 0.093 0.076 0.076 0.076 0.078 0.084
71.073 47.376 35.483 28.192 23.292 19.784 7.522 4.479 3.136 2.390 1.919 1.597 1.364 1.188 1.050 0.476 0.305 0.224 0.178 0.149 0.129 0.114 0.102 0.094
4-figure accuracy for 0.025 ≤ α/(α + β) ≤ 0.975, endpoint errors at most −0.0002
4-figure accuracy for 0.025 ≤ α/(α + β) ≤ 0.975, endpoint errors at most −0.0003
errors up to 0.0002 for 40 ≤ α + β ≤ 50 and 0.025 ≤ α/(α + β) ≤ 0.975, 4-figure accuracy for α + β > 50 and 0.025 ≤ α/(α + β) ≤ 0.975, endpoint errors at most −0.0005
errors up to 0.0002 for 40 ≤ α + β ≤ 70 and 0.025 ≤ α/(α + β) ≤ 0.975, 4-figure accuracy for α + β > 70 and 0.025 ≤ α/(α + β) ≤ 0.975, endpoint errors at most −0.0009
0.8
0.9
0.95
0.99
Endpoint errors were nearly all negative. For that reason the bounds quoted are expressed as negative numbers.
4-figure accuracy for 0.025 ≤ α/(α + β) ≤ 0.975, endpoint errors at most −0.0001
Accuracy in untabulated range
0.5, 0.6, 0.7
a
Table 8.4 Bernoulli reward process, and accuracy of approximation (Section 8.4).
264 TABLES
TABLES
265
Table 8.5 Bernoulli reward process, index values, a = 0.5. α
1
2
3
4
5
6
7
8
9
10
.5590 .3758 .2802 .2223 .1837 .1563 .1358 .1200 .1074 .0972 .0888 .0816 .0756 .0703 .0657 .0617 .0582 .0550 .0522 .0496
.7060 .5359 .4298 .3577 .3058 .2668 .2364 .2122 .1923 .1758 .1619 .1500 .1397 .1307 .1228 .1158 .1095 .1039 .0988 .0942
.7772 .6289 .5258 .4512 .3947 .3504 .3149 .2859 .2616 .2411 .2236 .2084 .1951 .1833 .1729 .1636 .1553 .1477 .1409 .1346
.8199 .6899 .5937 .5201 .4626 .4163 .3783 .3465 .3196 .2965 .2764 .2589 .2434 .2297 .2174 .2064 .1964 .1873 .1790 .1714
.8485 .7333 .6441 .5736 .5165 .4697 .4306 .3974 .3688 .3440 .3223 .3032 .2862 .2709 .2572 .2448 .2335 .2232 .2138 .2051
.8691 .7658 .6832 .6161 .5606 .5140 .4745 .4407 .4113 .3855 .3627 .3423 .3242 .3078 .2931 .2796 .2673 .2561 .2457 .2362
.8847 .7911 .7144 .6507 .5971 .5515 .5121 .4781 .4482 .4218 .3983 .3773 .3583 .3411 .3255 .3113 .2982 .2862 .2751 .2648
.8969 .8114 .7399 .6795 .6279 .5835 .5447 .5107 .4807 .4541 .4302 .4086 .3891 .3713 .3551 .3402 .3265 .3139 .3022 .2913
.9067 .8280 .7611 .7038 .6543 .6111 .5732 .5396 .5096 .4828 .4587 .4369 .4170 .3988 .3821 .3668 .3526 .3395 .3273 .3159
.9148 .8419 .7791 .7247 .6771 .6353 .5982 .5652 .5355 .5087 .4845 .4625 .4424 .4240 .4070 .3913 .3767 .3632 .3506 .3389
11
12
13
14
15
16
17
18
19
20
.9216 .8537 .7946 .7428 .6971 .6566 .6205 .5880 .5587 .5321 .5079 .4859 .4657 .4470 .4298 .4139 .3991 .3853 .3724 .3603
.9274 .8638 .8080 .7586 .7147 .6755 .6403 .6085 .5797 .5534 .5294 .5073 .4870 .4683 .4510 .4349 .4199 .4058 .3927 .3804
.9324 .8726 .8197 .7726 .7304 .6925 .6582 .6271 .5987 .5728 .5490 .5270 .5068 .4880 .4706 .4544 .4392 .4251 .4118 .3992
.9367 .8804 .8301 .7850 .7444 .7077 .6743 .6439 .6161 .5906 .5670 .5453 .5251 .5063 .4888 .4726 .4573 .4431 .4296 .4170
.9405 .8872 .8393 .7961 .7570 .7215 .6890 .6593 .6321 .6069 .5837 .5621 .5420 .5233 .5059 .4896 .4743 .4599 .4464 .4337
.9439 .8933 .8476 .8061 .7684 .7340 .7024 .6734 .6467 .6220 .5990 .5777 .5578 .5393 .5219 .5055 .4902 .4758 .4622 .4494
.9469 .8988 .8550 .8152 .7788 .7454 .7147 .6864 .6602 .6359 .6133 .5922 .5726 .5541 .5368 .5205 .5052 .4908 .4772 .4643
.9496 .9037 .8618 .8235 .7883 .7559 .7260 .6984 .6727 .6488 .6266 .6058 .5863 .5680 .5509 .5347 .5194 .5049 .4913 .4784
.9520 .9081 .8679 .8310 .7970 .7656 .7365 .7095 .6843 .6609 .6390 .6185 .5992 .5811 .5641 .5480 .5327 .5183 .5047 .4917
.9542 .9122 .8736 .8379 .8050 .7745 .7461 .7198 .6952 .6721 .6506 .6304 .6113 .5934 .5765 .5605 .5454 .5310 .5174 .5045
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 α β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
266
TABLES
Table 8.6 Bernoulli reward process, index values, a = 0.7. α
1
2
3
4
5
6
7
8
9
10
.6046 .4118 .3075 .2434 .2005 .1699 .1471 .1295 .1155 .1042 .0948 .0870 .0803 .0745 .0695 .0651 .0613 .0578 .0547 .0519
.7358 .5650 .4546 .3789 .3237 .2822 .2497 .2238 .2025 .1849 .1700 .1573 .1463 .1367 .1283 .1208 .1141 .1081 .1027 .0978
.7985 .6520 .5472 .4701 .4116 .3654 .3282 .2978 .2724 .2508 .2324 .2164 .2024 .1901 .1792 .1694 .1606 .1527 .1455 .1390
.8359 .7088 .6121 .5370 .4779 .4304 .3911 .3581 .3301 .3062 .2854 .2671 .2510 .2368 .2240 .2125 .2021 .1927 .1841 .1762
.8612 .7489 .6599 .5887 .5305 .4825 .4426 .4085 .3791 .3535 .3311 .3114 .2938 .2781 .2639 .2510 .2394 .2288 .2191 .2101
.8794 .7790 .6970 .6296 .5734 .5259 .4856 .4511 .4211 .3946 .3712 .3504 .3317 .3149 .2997 .2859 .2733 .2617 .2510 .2412
.8932 .8024 .7265 .6627 .6088 .5626 .5225 .4878 .4575 .4306 .4066 .3851 .3657 .3481 .3320 .3175 .3041 .2918 .2804 .2699
.9041 .8213 .7506 .6904 .6386 .5938 .5545 .5199 .4894 .4624 .4381 .4161 .3962 .3781 .3615 .3463 .3323 .3194 .3075 .2964
.9129 .8367 .7708 .7137 .6640 .6207 .5824 .5483 .5179 .4906 .4663 .4441 .4239 .4054 .3884 .3728 .3583 .3449 .3325 .3209
.9202 .8496 .7878 .7337 .6861 .6441 .6068 .5734 .5434 .5162 .4916 .4694 .4490 .4303 .4131 .3971 .3823 .3686 .3558 .3438
11
12
13
14
15
16
17
18
19
20
.9263 .8606 .8024 .7510 .7054 .6647 .6285 .5958 .5662 .5393 .5148 .4924 .4720 .4532 .4358 .4196 .4046 .3905 .3775 .3652
.9316 .8700 .8151 .7662 .7224 .6832 .6478 .6159 .5868 .5603 .5360 .5136 .4931 .4742 .4567 .4404 .4252 .4110 .3977 .3852
.9361 .8782 .8263 .7796 .7375 .6996 .6652 .6340 .6055 .5794 .5554 .5332 .5126 .4936 .4761 .4597 .4444 .4301 .4166 .4039
.9400 .8855 .8361 .7915 .7510 .7144 .6810 .6505 .6226 .5969 .5731 .5512 .5308 .5118 .4941 .4777 .4624 .4479 .4343 .4216
.9435 .8919 .8449 .8021 .7632 .7278 .6953 .6655 .6382 .6129 .5895 .5678 .5476 .5287 .5110 .4945 .4791 .4647 .4510 .4381
.9466 .8976 .8527 .8117 .7743 .7399 .7084 .6793 .6525 .6277 .6047 .5832 .5632 .5444 .5269 .5104 .4949 .4804 .4667 .4538
.9494 .9027 .8598 .8204 .7843 .7510 .7204 .6920 .6657 .6414 .6187 .5975 .5777 .5591 .5417 .5253 .5098 .4952 .4815 .4685
.9519 .9073 .8662 .8284 .7935 .7612 .7314 .7037 .6780 .6541 .6318 .6109 .5913 .5729 .5556 .5393 .5238 .5092 .4955 .4825
.9541 .9115 .8721 .8356 .8019 .7706 .7416 .7146 .6894 .6659 .6439 .6234 .6040 .5858 .5686 .5524 .5371 .5226 .5088 .4957
.9562 .9154 .8774 .8423 .8096 .7792 .7510 .7246 .7000 .6769 .6553 .6351 .6160 .5979 .5809 .5649 .5496 .5352 .5214 .5083
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 α β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TABLES
267
Table 8.7 Bernoulli reward process, index values, a = 0.9. α
1
2
3
4
5
6
7
8
9
10
.7029 .5001 .3796 .3021 .2488 .2103 .1815 .1591 .1413 .1269 .1149 .1049 .0965 .0892 .0828 .0773 .0724 .0681 .0642 .0608 .0576 .0548 .0522 .0498 .0477 .0457 .0438 .0421 .0406 .0391 .0377 .0364 .0352 .0341 .0330 .0320 .0311 .0302 .0294 .0286
.8001 .6346 .5163 .4342 .3720 .3245 .2871 .2569 .2323 .2116 .1942 .1793 .1664 .1552 .1453 .1366 .1288 .1218 .1155 .1098 .1046 .0998 .0955 .0915 .0878 .0844 .0812 .0783 .0756 .0730 .0706 .0683 .0662 .0642 .0623 .0606 .0589 .0573 .0558 .0543
.8452 .7072 .6010 .5184 .4561 .4058 .3647 .3308 .3025 .2784 .2575 .2396 .2239 .2100 .1977 .1867 .1768 .1679 .1598 .1524 .1456 .1394 .1337 .1284 .1236 .1190 .1148 .1108 .1072 .1037 .1004 .0974 .0945 .0918 .0892 .0868 .0845 .0823 .0802 .0782
.8723 .7539 .6579 .5809 .5179 .4677 .4257 .3900 .3595 .3332 .3106 .2906 .2729 .2571 .2431 .2304 .2190 .2086 .1991 .1904 .1824 .1750 .1682 .1619 .1560 .1506 .1455 .1407 .1362 .1320 .1280 .1243 .1207 .1174 .1142 .1112 .1084 .1056 .1031 .1006
.8905 .7869 .6996 .6276 .5676 .5168 .4748 .4387 .4073 .3799 .3558 .3343 .3155 .2985 .2831 .2692 .2565 .2450 .2344 .2247 .2157 .2074 .1997 .1926 .1859 .1796 .1738 .1683 .1631 .1583 .1537 .1493 .1452 .1413 .1377 .1341 .1308 .1276 .1246 .1217
.9039 .8115 .7318 .6642 .6071 .5581 .5156 .4795 .4479 .4200 .3951 .3729 .3529 .3348 .3187 .3039 .2904 .2780 .2665 .2559 .2462 .2371 .2287 .2208 .2134 .2065 .2000 .1940 .1882 .1828 .1777 .1728 .1682 .1639 .1597 .1558 .1520 .1484 .1450 .1417
.9141 .8307 .7573 .6940 .6395 .5923 .5510 .5144 .4828 .4548 .4296 .4069 .3864 .3678 .3507 .3351 .3210 .3079 .2958 .2846 .2742 .2645 .2554 .2470 .2390 .2316 .2246 .2180 .2117 .2058 .2002 .1949 .1899 .1851 .1806 .1762 .1721 .1681 .1644 .1608
.9221 .8461 .7782 .7187 .6666 .6212 .5811 .5454 .5134 .4853 .4601 .4371 .4163 .3972 .3798 .3638 .3490 .3352 .3227 .3110 .3001 .2899 .2803 .2714 .2629 .2550 .2475 .2405 .2338 .2275 .2215 .2158 .2104 .2053 .2003 .1957 .1912 .1869 .1828 .1789
.9287 .8588 .7956 .7396 .6899 .6461 .6071 .5723 .5409 .5125 .4871 .4642 .4432 .4240 .4062 .3898 .3747 .3606 .3475 .3353 .3240 .3134 .3035 .2941 .2853 .2770 .2691 .2617 .2546 .2479 .2416 .2356 .2298 .2244 .2191 .2141 .2094 .2048 .2004 .1962
.9342 .8695 .8103 .7573 .7101 .6677 .6300 .5960 .5652 .5373 .5116 .4886 .4677 .4483 .4303 .4137 .3983 .3840 .3706 .3581 .3463 .3353 .3250 .3154 .3062 .2976 .2894 .2816 .2743 .2673 .2606 .2543 .2482 .2425 .2370 .2317 .2267 .2218 .2172 .2128
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
(continued overleaf )
268
TABLES
Table 8.7 Bernoulli reward process, index values, a = 0.9. (Continued ) α
11
12
13
14
15
16
17
18
19
20
.9389 .8786 .8230 .7728 .7276 .6869 .6502 .6171 .5870 .5594 .5342 .5109 .4898 .4705 .4525 .4357 .4201 .4056 .3920 .3792 .3672 .3560 .3453 .3353 .3259 .3170 .3085 .3005 .2929 .2856 .2787 .2721 .2658 .2597 .2540 .2485 .2432 .2381 .2333 .2286
.9429 .8864 .8340 .7863 .7431 .7039 .6682 .6360 .6065 .5795 .5546 .5316 .5103 .4908 .4728 .4561 .4404 .4257 .4119 .3989 .3868 .3753 .3644 .3542 .3445 .3352 .3266 .3183 .3105 .3030 .2958 .2890 .2825 .2762 .2702 .2645 .2590 .2537 .2486 .2438
.9463 .8932 .8437 .7982 .7568 .7191 .6846 .6531 .6242 .5977 .5732 .5505 .5294 .5097 .4916 .4749 .4591 .4444 .4305 .4174 .4050 .3934 .3824 .3720 .3621 .3527 .3437 .3352 .3271 .3195 .3121 .3051 .2984 .2919 .2858 .2798 .2742 .2687 .2634 .2584
.9494 .8993 .8522 .8089 .7691 .7327 .6993 .6684 .6404 .6143 .5902 .5678 .5469 .5275 .5092 .4923 .4766 .4618 .4479 .4347 .4222 .4104 .3993 .3887 .3787 .3691 .3600 .3513 .3431 .3351 .3276 .3205 .3136 .3070 .3006 .2945 .2887 .2830 .2776 .2724
.9521 .9046 .8598 .8184 .7802 .7450 .7126 .6827 .6551 .6296 .6058 .5838 .5631 .5439 .5258 .5087 .4929 .4781 .4642 .4509 .4384 .4265 .4153 .4046 .3944 .3847 .3755 .3667 .3583 .3502 .3425 .3351 .3280 .3213 .3148 .3086 .3026 .2968 .2912 .2859
.9545 .9094 .8667 .8270 .7902 .7562 .7248 .6957 .6686 .6436 .6203 .5985 .5782 .5591 .5412 .5243 .5083 .4934 .4795 .4662 .4537 .4417 .4304 .4196 .4093 .3995 .3902 .3813 .3727 .3645 .3567 .3492 .3420 .3350 .3284 .3221 .3159 .3100 .3043 .2988
.9567 .9137 .8729 .8347 .7993 .7665 .7360 .7076 .6812 .6566 .6338 .6123 .5922 .5733 .5555 .5388 .5229 .5079 .4939 .4807 .4681 .4561 .4447 .4339 .4235 .4136 .4042 .3952 .3865 .3782 .3703 .3627 .3553 .3483 .3415 .3350 .3287 .3227 .3169 .3113
.9587 .9176 .8785 .8419 .8077 .7758 .7462 .7186 .6928 .6686 .6462 .6251 .6053 .5866 .5690 .5524 .5367 .5217 .5076 .4943 .4817 .4697 .4583 .4474 .4370 .4271 .4175 .4084 .3997 .3913 .3833 .3756 .3682 .3610 .3541 .3475 .3411 .3349 .3290 .3233
.9604 .9212 .8836 .8483 .8153 .7845 .7556 .7288 .7036 .6799 .6578 .6371 .6175 .5991 .5817 .5652 .5496 .5348 .5207 .5073 .4946 .4827 .4712 .4603 .4499 .4399 .4303 .4211 .4123 .4039 .3957 .3879 .3804 .3732 .3662 .3595 .3530 .3468 .3407 .3349
.9621 .9244 .8883 .8543 .8224 .7925 .7644 .7382 .7136 .6905 .6686 .6483 .6290 .6108 .5935 .5772 .5618 .5470 .5330 .5197 .5070 .4949 .4835 .4726 .4621 .4521 .4425 .4333 .4244 .4159 .4077 .3998 .3922 .3849 .3779 .3711 .3645 .3582 .3521 .3461
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
TABLES
269
Table 8.7 Bernoulli reward process, index values, a = 0.9. (Continued ) α
21
22
23
24
25
26
27
28
29
30
.9636 .9274 .8926 .8598 .8289 .7998 .7726 .7470 .7229 .7003 .6789 .6587 .6398 .6218 .6048 .5886 .5733 .5587 .5448 .5315 .5188 .5067 .4952 .4843 .4738 .4638 .4541 .4449 .4360 .4274 .4192 .4112 .4036 .3962 .3891 .3822 .3756 .3692 .3630 .3570
.9649 .9301 .8966 .8649 .8349 .8067 .7802 .7552 .7317 .7095 .6885 .6686 .6499 .6322 .6154 .5994 .5842 .5697 .5559 .5427 .5301 .5180 .5064 .4954 .4850 .4750 .4653 .4560 .4471 .4385 .4302 .4222 .4145 .4071 .3999 .3930 .3863 .3798 .3735 .3675
.9662 .9327 .9003 .8696 .8405 .8131 .7873 .7629 .7398 .7181 .6975 .6780 .6595 .6421 .6254 .6096 .5945 .5802 .5664 .5533 .5408 .5288 .5173 .5062 .4957 .4856 .4760 .4667 .4577 .4491 .4408 .4328 .4250 .4175 .4103 .4033 .3966 .3900 .3837 .3776
.9674 .9350 .9038 .8740 .8458 .8191 .7939 .7701 .7475 .7262 .7060 .6868 .6686 .6514 .6349 .6193 .6044 .5901 .5765 .5635 .5510 .5391 .5276 .5166 .5060 .4959 .4862 .4769 .4680 .4593 .4510 .4429 .4352 .4276 .4204 .4133 .4065 .3999 .3936 .3874
.9685 .9372 .9069 .8781 .8507 .8247 .8001 .7769 .7548 .7339 .7140 .6952 .6772 .6602 .6440 .6285 .6137 .5996 .5861 .5732 .5608 .5489 .5375 .5265 .5160 .5058 .4961 .4868 .4778 .4692 .4608 .4527 .4449 .4374 .4301 .4230 .4162 .4095 .4031 .3969
.9695 .9392 .9099 .8819 .8553 .8300 .8060 .7832 .7616 .7411 .7216 .7031 .6854 .6685 .6525 .6373 .6226 .6086 .5952 .5824 .5701 .5583 .5469 .5360 .5255 .5154 .5056 .4962 .4873 .4786 .4703 .4622 .4544 .4468 .4395 .4324 .4255 .4188 .4124 .4061
.9705 .9411 .9127 .8855 .8596 .8349 .8115 .7892 .7681 .7479 .7288 .7106 .6932 .6765 .6607 .6456 .6311 .6173 .6040 .5913 .5790 .5673 .5560 .5451 .5347 .5246 .5148 .5054 .4964 .4878 .4794 .4713 .4635 .4559 .4485 .4414 .4345 .4278 .4213 .4150
.9714 .9429 .9153 .8889 .8636 .8396 .8167 .7949 .7742 .7544 .7356 .7177 .7006 .6842 .6685 .6536 .6393 .6255 .6124 .5997 .5876 .5759 .5647 .5539 .5435 .5334 .5237 .5143 .5052 .4965 .4882 .4801 .4723 .4647 .4573 .4502 .4432 .4365 .4300 .4236
.9722 .9446 .9178 .8921 .8675 .8440 .8216 .8003 .7800 .7606 .7421 .7244 .7076 .6914 .6759 .6612 .6470 .6335 .6204 .6079 .5958 .5842 .5730 .5623 .5519 .5419 .5322 .5229 .5138 .5051 .4967 .4886 .4808 .4732 .4658 .4586 .4517 .4449 .4384 .4320
.9730 .9462 .9201 .8951 .8711 .8481 .8263 .8054 .7855 .7664 .7482 .7309 .7143 .6984 .6831 .6684 .6544 .6410 .6281 .6157 .6037 .5922 .5811 .5704 .5601 .5501 .5405 .5311 .5221 .5134 .5049 .4968 .4890 .4814 .4740 .4668 .4599 .4531 .4466 .4402
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
(continued overleaf )
270
TABLES
Table 8.7 Bernoulli reward process, index values, a = 0.9. (Continued ) α
31
32
33
34
35
36
37
38
39
40
.9738 .9477 .9223 .8979 .8745 .8521 .8307 .8102 .7907 .7720 .7541 .7370 .7207 .7050 .6899 .6754 .6615 .6483 .6355 .6232 .6113 .5998 .5888 .5782 .5679 .5580 .5484 .5391 .5301 .5214 .5130 .5048 .4969 .4893 .4820 .4748 .4678 .4611 .4545 .4481
.9745 .9491 .9244 .9006 .8777 .8558 .8348 .8148 .7956 .7773 .7597 .7429 .7268 .7113 .6964 .6821 .6684 .6552 .6426 .6303 .6186 .6072 .5963 .5857 .5755 .5656 .5560 .5468 .5378 .5292 .5208 .5126 .5047 .4970 .4897 .4825 .4755 .4688 .4622 .4558
.9751 .9504 .9263 .9031 .8808 .8593 .8388 .8192 .8004 .7823 .7651 .7485 .7326 .7173 .7027 .6885 .6750 .6619 .6494 .6373 .6256 .6143 .6034 .5929 .5828 .5729 .5634 .5542 .5453 .5367 .5283 .5201 .5122 .5045 .4971 .4900 .4830 .4762 .4696 .4632
.9758 .9517 .9282 .9055 .8837 .8627 .8426 .8234 .8049 .7872 .7702 .7538 .7382 .7231 .7086 .6947 .6813 .6683 .6559 .6439 .6324 .6212 .6104 .5999 .5898 .5800 .5706 .5614 .5525 .5439 .5356 .5274 .5195 .5119 .5044 .4972 .4903 .4835 .4769 .4705
.9764 .9529 .9300 .9078 .8864 .8659 .8462 .8273 .8092 .7918 .7751 .7590 .7435 .7287 .7144 .7006 .6873 .6745 .6622 .6503 .6389 .6278 .6170 .6067 .5966 .5869 .5775 .5684 .5595 .5509 .5426 .5345 .5266 .5190 .5116 .5043 .4973 .4905 .4840 .4775
.9769 .9540 .9316 .9100 .8891 .8690 .8497 .8311 .8133 .7962 .7797 .7639 .7486 .7340 .7199 .7063 .6932 .6805 .6683 .6565 .6451 .6341 .6235 .6132 .6032 .5935 .5842 .5751 .5663 .5577 .5494 .5414 .5335 .5259 .5185 .5112 .5042 .4974 .4908 .4844
.9775 .9551 .9332 .9120 .8916 .8719 .8530 .8348 .8173 .8004 .7842 .7686 .7536 .7391 .7252 .7118 .6988 .6863 .6741 .6625 .6512 .6403 .6297 .6195 .6096 .6000 .5907 .5816 .5728 .5643 .5561 .5480 .5402 .5326 .5252 .5180 .5110 .5041 .4975 .4911
.9780 .9561 .9348 .9140 .8940 .8747 .8561 .8383 .8210 .8045 .7885 .7732 .7583 .7441 .7303 .7170 .7042 .6918 .6798 .6682 .6570 .6462 .6358 .6256 .6157 .6062 .5969 .5879 .5792 .5707 .5625 .5545 .5467 .5391 .5317 .5245 .5175 .5107 .5040 .4975
.9785 .9571 .9362 .9159 .8963 .8774 .8591 .8416 .8247 .8084 .7927 .7775 .7629 .7488 .7352 .7221 .7094 .6972 .6853 .6738 .6627 .6520 .6416 .6315 .6217 .6122 .6030 .5941 .5854 .5769 .5687 .5607 .5530 .5454 .5380 .5309 .5239 .5171 .5104 .5039
.9789 .9580 .9376 .9177 .8985 .8799 .8620 .8448 .8281 .8121 .7966 .7817 .7673 .7534 .7400 .7270 .7145 .7023 .6906 .6792 .6682 .6575 .6472 .6372 .6275 .6181 .6089 .6000 .5914 .5830 .5748 .5668 .5591 .5516 .5442 .5371 .5301 .5233 .5166 .5102
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
TABLES
271
Table 8.8 Bernoulli reward process, index values, a = 0.99. α
1
2
3
4
5
6
7
8
9
10
.8699 .7005 .5671 .4701 .3969 .3415 .2979 .2632 .2350 .2117 .1922 .1756 .1614 .1491 .1384 .1290 .1206 .1132 .1066 .1006 .0952 .0903 .0858 .0817 .0780 .0745 .0713 .0684 .0656 .0631 .0607 .0585 .0564 .0545 .0526 .0509 .0493 .0478 .0463 .0449
.9102 .7844 .6726 .5806 .5093 .4509 .4029 .3633 .3303 .3020 .2778 .2571 .2388 .2228 .2086 .1960 .1847 .1746 .1654 .1570 .1494 .1425 .1361 .1302 .1248 .1198 .1151 .1107 .1067 .1029 .0994 .0960 .0929 .0900 .0872 .0846 .0821 .0797 .0775 .0754
.9285 .8268 .7308 .6490 .5798 .5225 .4747 .4337 .3986 .3679 .3418 .3187 .2982 .2799 .2637 .2491 .2359 .2239 .2130 .2031 .1940 .1856 .1778 .1107 .1640 .1578 .1521 .1467 .1416 .1369 .1325 .1283 .1244 .1206 .1171 .1138 .1106 .1076 .1048 .1021
.9395 .8533 .7696 .6952 .6311 .5756 .5277 .4876 .4520 .4208 .3932 .3685 .3468 .3274 .3097 .2938 .2792 .2659 .2539 .2428 .2325 .2231 .2142 .2061 .1985 .1914 .1848 .1786 .1727 .1673 .1621 .1572 .1526 .1483 .1441 .1402 .1365 .1329 .1296 .1264
.9470 .8719 .7973 .7295 .6697 .6172 .5710 .5300 .4952 .4640 .4359 .4108 .3882 .3677 .3491 .3324 .3170 .3028 .2898 .2778 .2666 .2564 .2468 .2379 .2295 .2217 .2143 .2075 .2010 .1949 .1891 .1837 .1785 .1736 .1690 .1645 .1603 .1563 .1525 .1488
.9525 .8857 .8184 .7561 .6998 .6504 .6061 .5665 .5308 .5002 .4722 .4469 .4239 .4030 .3839 .3663 .3501 .3355 .3218 .3092 .2974 .2864 .2762 .2666 .2577 .2493 .2414 .2340 .2269 .2203 .2140 .2080 .2024 .1971 .1920 .1871 .1825 .1781 .1739 .1699
.9568 .8964 .8350 .7773 .7249 .6776 .6352 .5970 .5625 .5310 .5034 .4782 .4551 .4340 .4145 .3967 .3801 .3648 .3505 .3374 .3252 .3138 .3031 .2930 .2836 .2747 .2663 .2583 .2509 .2439 .2372 .2308 .2248 .2190 .2135 .2083 .2033 .1985 .1940 .1896
.9603 .9051 .8485 .7949 .7456 .7004 .6599 .6230 .5895 .5589 .5307 .5057 .4827 .4615 .4420 .4238 .4070 .3914 .3769 .3633 .3505 .3387 .3277 .3173 .3074 .2981 .2894 .2811 .2732 .2658 .2587 .2520 .2456 .2395 .2337 .2282 .2229 .2178 .2129 .2083
.9631 .9122 .8598 .8097 .7631 .7203 .6811 .6456 .6130 .5831 .5556 .5302 .5072 .4862 .4666 .4484 .4314 .4156 .4009 .3870 .3741 .3619 .3503 .3396 .3295 .3199 .3109 .3023 .2941 .2864 .2790 .2719 .2652 .2588 .2527 .2469 .2414 .2360 .2309 .2260
.9655 .9183 .8693 .8222 .7781 .7373 .6997 .6653 .6337 .6045 .5776 .5527 .5295 .5083 .4889 .4707 .4537 .4378 .4228 .4088 .3957 .3833 .3716 .3605 .3500 .3402 .3309 .3221 .3136 .3056 .2980 .2907 .2838 .2771 .2708 .2647 .2588 .2533 .2479 .2428
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
(continued overleaf )
272
TABLES
Table 8.8 Bernoulli reward process, index values, a = 0.99. (Continued ) α
11
12
13
14
15
16
17
18
19
20
.9675 .9234 .8775 .8331 .7912 .7522 .7161 .6826 .6521 .6236 .5973 .5728 .5501 .5288 .5091 .4911 .4741 .4581 .4431 .4290 .4157 .4031 .3913 .3801 .3694 .3593 .3497 .3405 .3320 .3238 .3159 .3084 .3013 .2944 .2878 .2815 .2755 .2697 .2641 .2587
.9693 .9278 .8846 .8426 .8027 .7653 .7307 .6984 .6685 .6408 .6150 .5911 .5687 .5477 .5280 .5096 .4929 .4769 .4619 .4477 .4343 .4216 .4096 .3983 .3875 .3772 .3675 .3582 .3493 .3408 .3328 .3252 .3178 .3108 .3040 .2975 .2913 .2853 .2796 .2741
.9709 .9318 .8909 .8509 .8129 .7771 .7437 .7125 .6833 .6564 .6312 .6077 .5856 .5650 .5456 .5273 .5100 .4943 .4793 .4651 .4517 .4389 .4268 .4153 .4044 .3941 .3842 .3747 .3657 .3571 .3489 .3409 .3335 .3263 .3194 .3128 .3064 .3002 .2943 .2886
.9722 .9352 .8964 .8584 .8220 .7877 .7554 .7253 .6970 .6706 .6460 .6229 .6012 .5808 .5617 .5436 .5265 .5103 .4955 .4814 .4679 .4551 .4430 .4314 .4204 .4099 .3999 .3904 .3813 .3725 .3642 .3561 .3484 .3410 .3340 .3273 .3208 .3145 .3084 .3026
.9735 .9383 .9014 .8651 .8302 .7972 .7660 .7369 .7094 .6835 .6595 .6369 .6155 .5955 .5766 .5587 .5418 .5258 .5105 .4965 .4831 .4703 .4582 .4466 .4355 .4249 .4148 .4052 .3960 .3872 .3787 .3706 .3628 .3553 .3480 .3410 .3344 .3281 .3219 .3160
.9746 .9411 .9059 .8711 .8377 .8059 .7758 .7474 .7208 .6956 .6719 .6498 .6289 .6091 .5904 .5728 .5560 .5402 .5251 .5106 .4974 .4846 .4725 .4609 .4498 .4391 .4290 .4193 .4099 .4010 .3925 .3843 .3764 .3688 .3615 .3544 .3476 .3410 .3348 .3288
.9756 .9436 .9099 .8766 .8444 .8137 .7846 .7571 .7312 .7067 .6834 .6617 .6412 .6217 .6033 .5859 .5693 .5536 .5387 .5244 .5107 .4981 .4860 .4744 .4632 .4526 .4424 .4326 .4233 .4143 .4056 .3974 .3894 .3817 .3743 .3672 .3603 .3537 .3472 .3410
.9765 .9458 .9136 .8816 .8506 .8210 .7928 .7660 .7408 .7169 .6942 .6728 .6527 .6336 .6154 .5982 .5818 .5662 .5514 .5372 .5237 .5107 .4987 .4871 .4760 .4654 .4552 .4454 .4359 .4269 .4182 .4098 .4018 .3941 .3866 .3794 .3724 .3657 .3592 .3529
.9774 .9479 .9169 .8861 .8563 .8276 .8003 .7743 .7497 .7264 .7043 .6832 .6634 .6446 .6267 .6097 .5935 .5781 .5634 .5494 .5359 .5231 .5107 .4992 .4882 .4775 .4673 .4575 .4480 .4390 .4302 .4218 .4137 .4059 .3984 .3911 .3841 .3773 .3707 .3644
.9781 .9498 .9200 .8903 .8615 .8338 .8072 .7820 .7579 .7352 .7136 .6930 .6734 .6550 .6373 .6205 .6045 .5893 .5747 .5608 .5475 .5347 .5225 .5107 .4997 .4891 .4789 .4691 .4596 .4505 .4417 .4333 .4251 .4172 .4096 .4023 .3952 .3884 .3818 .3754
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
TABLES
273
Table 8.8 Bernoulli reward process, index values, a = 0.99. (Continued ) α
21
22
23
24
25
26
27
28
29
30
.9788 .9516 .9228 .8942 .8663 .8394 .8137 .7891 .7657 .7435 .7223 .7021 .6829 .6647 .6473 .6308 .6150 .5999 .5854 .5717 .5584 .5458 .5336 .5219 .5106 .5001 .4899 .4801 .4706 .4615 .4527 .4442 .4360 .4282 .4205 .4131 .4060 .3991 .3924 .3860
.9795 .9532 .9255 .8978 .8708 .8447 .8197 .7958 .7730 .7512 .7305 .7107 .6918 .6738 .6568 .6404 .6248 .6099 .5956 .5819 .5688 .5562 .5442 .5326 .5214 .5106 .5004 .4906 .4812 .4721 .4633 .4548 .4466 .4386 .4310 .4235 .4164 .4094 .4027 .3962
.9801 .9547 .9279 .9011 .8750 .8497 .8253 .8020 .7797 .7584 .7382 .7188 .7003 .6825 .6657 .6496 .6342 .6194 .6052 .5917 .5787 .5662 .5542 .5427 .5316 .5209 .5105 .5007 .4913 .4822 .4734 .4649 .4567 .4487 .4410 .4336 .4264 .4194 .4126 .4060
.9807 .9561 .9302 .9043 .8789 .8543 .8306 .8078 .7861 .7652 .7454 .7264 .7082 .6908 .6740 .6582 .6430 .6284 .6144 .6010 .5881 .5757 .5638 .5523 .5413 .5306 .5204 .5104 .5009 .4919 .4831 .4746 .4664 .4584 .4507 .4432 .4360 .4290 .4222 .4156
.9812 .9574 .9323 .9072 .8825 .8586 .8355 .8133 .7921 .7717 .7522 .7336 .7157 .6986 .6821 .6664 .6514 .6370 .6232 .6098 .5971 .5848 .5730 .5616 .5506 .5400 .5298 .5199 .5103 .5012 .4924 .4840 .4757 .4678 .4600 .4525 .4453 .4382 .4314 .4248
.9817 .9587 .9343 .9099 .8859 .8626 .8401 .8185 .7977 .7778 .7586 .7403 .7228 .7059 .6898 .6742 .6594 .6452 .6315 .6183 .6056 .5935 .5817 .5704 .5595 .5490 .5388 .5290 .5194 .5102 .5014 .4929 .4847 .4768 .4690 .4615 .4543 .4472 .4403 .4337
.9822 .9598 .9362 .9124 .8891 .8664 .8445 .8234 .8030 .7835 .7647 .7468 .7295 .7130 .6970 .6817 .6670 .6530 .6395 .6264 .6139 .6018 .5901 .5789 .5680 .5576 .5474 .5376 .5282 .5190 .5101 .5015 .4934 .4854 .4777 .4702 .4629 .4559 .4490 .4423
.9827 .9609 .9379 .9148 .8921 .8700 .8486 .8280 .8080 .7889 .7705 .7529 .7359 .7196 .7039 .6888 .6743 .6604 .6471 .6342 .6217 .6097 .5982 .5870 .5762 .5658 .5557 .5460 .5366 .5274 .5186 .5100 .5017 .4938 .4861 .4786 .4713 .4642 .4573 .4507
.9831 .9619 .9396 .9171 .8950 .8734 .8525 .8323 .8128 .7941 .7761 .7586 .7420 .7260 .7105 .6957 .6813 .6675 .6543 .6416 .6292 .6174 .6059 .5948 .5841 .5737 .5637 .5540 .5447 .5356 .5267 .5182 .5098 .5018 .4942 .4867 .4794 .4723 .4654 .4587
.9835 .9629 .9411 .9192 .8977 .8766 .8562 .8364 .8174 .7990 .7813 .7642 .7478 .7320 .7168 .7021 .6880 .6743 .6613 .6487 .6365 .6247 .6133 .6023 .5917 .5814 .5714 .5618 .5524 .5434 .5346 .5261 .5178 .5097 .5019 .4945 .4873 .4802 .4733 .4666
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
(continued overleaf )
274
TABLES
Table 8.8 Bernoulli reward process, index values, a = 0.99. (Continued ) α
31
32
33
34
35
36
37
38
39
40
.9839 .9638 .9426 .9213 .9002 .8797 .8597 .8403 .8217 .8036 .7863 .7695 .7533 .7378 .7228 .7084 .6944 .6809 .6679 .6555 .6434 .6317 .6204 .6095 .5990 .5887 .5788 .5693 .5600 .5509 .5422 .5337 .5255 .5174 .5096 .5020 .4948 .4878 .4809 .4742
.9842 .9646 .9440 .9232 .9027 .8826 .8630 .8441 .8258 .8081 .7910 .7745 .7586 .7433 .7286 .7143 .7005 .6872 .6743 .6620 .6500 .6385 .6273 .6165 .6060 .5958 .5860 .5764 .5672 .5582 .5495 .5411 .5329 .5249 .5171 .5095 .5021 .4951 .4883 .4816
.9846 .9655 .9453 .9250 .9049 .8853 .8662 .8476 .8297 .8123 .7955 .7793 .7637 .7486 .7341 .7200 .7064 .6932 .6805 .6682 .6564 .6450 .6339 .6232 .6128 .6027 .5929 .5834 .5742 .5653 .5566 .5482 .5400 .5320 .5243 .5167 .5094 .5022 .4954 .4887
.9849 .9662 .9466 .9267 .9071 .8879 .8692 .8510 .8334 .8164 .7999 .7839 .7685 .7536 .7393 .7254 .7120 .6990 .6865 .6743 .6626 .6513 .6403 .6296 .6193 .6093 .5996 .5901 .5810 .5721 .5635 .5551 .5469 .5390 .5313 .5237 .5164 .5092 .5023 .4957
.9852 .9670 .9478 .9284 .9092 .8904 .8721 .8542 .8369 .8202 .8040 .7883 .7732 .7585 .7444 .7307 .7174 .7046 .6922 .6801 .6685 .6573 .6464 .6358 .6256 .6157 .6060 .5966 .5875 .5787 .5701 .5618 .5536 .5457 .5380 .5305 .5232 .5161 .5091 .5024
.9855 .9677 .9489 .9300 .9112 .8928 .8748 .8573 .8403 .8239 .8080 .7926 .7776 .7632 .7492 .7357 .7226 .7100 .6977 .6858 .6742 .6631 .6523 .6419 .6317 .6218 .6122 .6029 .5939 .5851 .5765 .5682 .5601 .5523 .5446 .5371 .5298 .5227 .5158 .5090
.9858 .9684 .9500 .9315 .9131 .8950 .8774 .8603 .8436 .8275 .8118 .7966 .7819 .7677 .7539 .7406 .7276 .7151 .7030 .6912 .6798 .6687 .6581 .6477 .6376 .6278 .6183 .6090 .6000 .5913 .5828 .5745 .5664 .5586 .5510 .5435 .5362 .5292 .5223 .5155
.9860 .9690 .9510 .9329 .9149 .8972 .8799 .8631 .8467 .8308 .8154 .8005 .7860 .7720 .7584 .7452 .7325 .7201 .7081 .6965 .6852 .6741 .6636 .6533 .6433 .6336 .6241 .6149 .6060 .5973 .5888 .5806 .5726 .5647 .5571 .5497 .5425 .5354 .5285 .5218
.9863 .9696 .9520 .9343 .9166 .8993 .8823 .8658 .8497 .8341 .8189 .8042 .7900 .7761 .7627 .7497 .7371 .7249 .7130 .7015 .6903 .6794 .6689 .6587 .6488 .6391 .6298 .6206 .6117 .6031 .5947 .5865 .5785 .5707 .5631 .5557 .5485 .5415 .5346 .5279
.9866 .9702 .9530 .9356 .9183 .9013 .8846 .8684 .8526 .8372 .8223 .8078 .7938 .7801 .7669 .7540 .7416 .7295 .7178 .7064 .6953 .6846 .6741 .6640 .6541 .6446 .6352 .6262 .6173 .6087 .6004 .5922 .5843 .5765 .5690 .5616 .5544 .5474 .5406 .5339
β 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
TABLES
275
Table 8.9 Bernoulli reward process, fitted A-values (equation (8.16)). α
0.5
0.6
0.7
0.8
0.9
0.95
0.99
r/n 0.025 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.975
9.2681 7.0096 5.2973 3.9360 2.7944 2.6787 6.1145 6.8028 5.1840 4.0111 3.0983 2.6018 3.1825 10.3081 5.1912 4.0473 3.2564 2.7799 2.9147 4.4136 13.4238 4.2665 3.4833 3.0181 2.8967 3.6013 5.8120 17.1924 4.2057 3.6679 3.4095 3.5452 4.8016 7.5217 20.2994 4.3423 3.8339 3.6279 3.8171 5.0397 7.8905 21.5139 4.2710 3.7488 3.5316 3.7004 5.0189 8.0621 22.9575 5.0689 4.6046 4.4692 4.7972 6.3305 9.6717 24.9623 5.9458 5.4645 5.3591 5.7447 7.4003 10.9401 27.2915 7.6929 7.0607 6.8503 7.1881 9.0770 12.9649 31.0320 11.9057 11.1842 10.9371 11.3723 13.4908 17.9812 38.4591 17.9304 17.0536 17.4127 18.2349 21.1040 26.1323 49.5095 24.1764 28.0856 27.4593 26.7203 33.5284 40.8216 65.9106
Table 8.10 Bernoulli reward process, fitted B-values (equation (8.16)). a
0.5
0.6
0.7
0.8
0.9
0.95
0.99
9.2848 6.8134 5.1849 4.2244 3.9207 3.8095 3.8637 3.8097 3.9210 4.2241 5.1853 6.8137 9.3011
7.0047 5.1994 4.0386 3.3829 3.1798 3.1074 3.1624 3.1074 3.1798 3.3829 4.0388 5.2011 6.9875
5.3042 4.0089 3.2071 2.7739 2.6402 2.5938 2.6451 2.5940 2.6402 2.7740 3.2047 4.0082 5.2973
3.9386 3.0752 2.5684 2.3034 2.2198 2.1922 2.2363 2.1924 2.2200 2.3041 2.5692 3.0757 3.9503
2.7539 2.2996 2.0446 1.9105 1.8645 1.8525 1.8838 1.8530 1.8675 1.9114 2.0467 2.3015 2.7577
2.1869 1.9403 1.7979 1.7207 1.6918 1.6865 1.7079 1.6872 1.6967 1.7225 1.8009 1.9440 2.1901
1.7222 1.6483 1.6080 1.5820 1.5709 1.5716 1.5795 1.5745 1.5796 1.5892 1.6200 1.6793 1.7798
r/n 0.025 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.975
276
TABLES
Table 8.11 Bernoulli reward process, fitted C-values (equation (8.16)). a
0.5
0.6
0.7
0.8
0.9
0.95
0.99
r/n 0.5004 −0.0603 0.1432 0.0912 −0.8275 0.1667 0.2466 −0.0355 −0.2108 −2.8142 −0.0331 −0.0169 −0.1904 −0.9422 −4.0195 −0.0286 −0.0080 −0.3423 −1.2125 −4.7172 −0.4630 −0.9010 −1.3409 −2.8740 −10.6097 −0.6325 −0.8255 −1.2365 −2.5030 −7.0787 −0.0068 −0.1195 −0.3624 −1.0104 −4.9064 −0.8394 −1.0776 −1.5097 −2.8257 −7.3069 −1.1022 −1.4691 −2.2118 −3.6767 −7.7188 −2.3833 −2.2668 −2.4562 −3.2635 −7.9008 −5.8849 −5.9008 −5.9116 −6.7079 −9.4755 −15.3026 −13.0777 −19.9554 −20.7245 −23.3043 21.7826 −109.5406 −83.8863 −6.3995 −68.3475
0.025 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.975
−9.8562 −11.7535 −19.3552 −22.6347 −34.4107 −27.4223 −23.2428 −28.5828 −28.2543 −27.6156 −28.1731 −30.9933 −82.6998
36.7720 −119.1096 −122.1469 −139.2893 −168.5174 −152.4617 −155.0706 −150.1690 −148.8783 −151.9041 −126.5893 −79.0757 36.7720
Table 8.12 Bernoulli reward process, ranges of n-values used to fit A, B and C (equation (8.16)). λ .025 and .975 .05 and .95 .1 to .9
a
0.5, 0.6, 0.7
0.8, 0.9
0.95
0.99
40 to 160 20 to 180 10 to 180
40 to 160 20 to 160 10 to 160
40 to 160 20 to 140 20 to 140
40 to 160 40 to 160 30 to 150
TABLES
277
Table 8.13 Exponential reward process, values of (n−1−n )(1 − a)1/2 , (Section 8.5). a
0.5
0.6
0.7
0.8
0.9
0.95
0.99
.07460 .08883 .09476 .09797 .09998 .10135 .10235 .10310 .10369 .10620 .10699 .10737 .10760 .10775 .10786 .10794 .10800 .10805 .10827 .10834 .10838 .10840 .10842 .10842 .10843 .10843
.10921 .13189 .14145 .14666 .14993 .15217 .15379 .15502 .15599 .16012 .16143 .16206 .16244 .16269 .16287 .16300 .16310 .16319 .16355 .16367 .16374 .16377 .16380 .16381 .16382 .16384
.14797 .18309 .19831 .20675 .21209 .21578 .21848 .22054 .22216 .22921 .23148 .23261 .23328 .23373 .23404 .23428 .23447 .23462 .23528 .23550 .23561 .23568 .23572 .23575 .23578 .23579
.18418 .23769 .26214 .27609 .28513 .29148 .29619 .29983 .30273 .31582 .32025 .32251 .32387 .32479 .32545 .32595 .32634 .32665 .32807 .32855 .32878 .32893 .32902 .32909 .32914 .32918
.19738 .27612 .31598 .34010 .35638 .36821 .37725 .38441 .39025 .41849 .42913 .43486 .43847 .44095 .44277 .44417 .44527 .44617 .45035 .45181 .45256 .45301 .45331 .45353 .45370 .45383
.17342 .26250 .31249 .34452 .36705 .38395 .39721 .40797 .41693 .46337 .48279 .49390 .50119 .50638 .51028 .51332 .51577 .51778 .52757 .53119 .53309 .53427 .53507 .53567 .53615 .53650
.09412 .16397 .21213 .24689 .27340 .29456 .31203 .32684 .33966 .41557 .45428 .47938 .49747 .51135 .52242 .53153 .53920 .54575 .58211 .59815 .60745 .61371 .61814 .62163 .62426 .62643
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900
278
TABLES
Table 8.14 Exponential target process, values of (ν, n), (Section 8.6). ν
0.3
0.25
0.2
0.15
0.1
0.05
1.2075 2.0211 2.8430 3.6683 4.4954 5.3235 6.1521 6.9812 7.8106 16.1110 24.4150 32.7200 41.0252 49.3307 57.6363 65.9419 74.2476 82.5533 165.6111 248.6693 331.7275 414.7858 497.8442 580.9025 663.9608
0.9905 1.6911 2.4017 3.1164 3.8333 4.5514 5.2702 5.9896 6.7094 13.9157 21.1267 28.3390 35.5517 42.7647 49.9778 57.1910 64.4042 71.6175 143.7515 215.8860 288.0206 360.1553 432.2900 504.4247 576.5595
0.7911 1.3863 1.9938 2.6065 3.2218 3.8388 4.4568 5.0755 5.6946 11.8979 18.1077 24.3192 30.5314 36.7440 42.9569 49.1698 55.3828 61.5959 123.7283 185.8614 247.9947 310.1281 372.2615 434.3949 496.5284
0.6031 1.0963 1.6049 2.1202 2.6390 3.1599 3.6822 4.2055 4.7295 9.9854 15.2510 20.5192 25.7886 31.0585 36.3288 41.5992 46.8699 52.1406 104.8502 157.5610 210.2721 262.9834 315.6948 368.4061 421.1176
0.4191 0.8073 1.2154 1.6324 2.0543 2.4792 2.9061 3.3343 3.7636 8.0812 12.4143 16.7519 21.0915 25.4321 29.7734 34.1150 38.4570 42.7991 86.2247 129.6528 173.0816 216.5107 259.9398 303.3691 346.7984
0.2288 0.4940 0.7863 1.0916 1.4043 1.7216 2.0421 2.3648 2.6893 5.9766 9.2927 12.6182 15.9480 19.2802 22.6138 25.9485 29.2838 32.6196 65.9896 99.3665 132.7453 166.1248 199.5048 232.8850 266.2654
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800
TABLES
279
Table 8.14 Exponential target process, values of (ν, n), (Section 8.6). (Continued ) ν
0.04
0.03
0.025
0.02
0.015
0.01
0.1884 0.4233 0.6873 0.9655 1.2518 1.5434 1.8385 2.1362 2.4358 5.4815 8.5610 11.6517 14.7477 17.8468 20.9477 24.0498 27.1529 30.2565 61.3086 92.3699 123.4339 154.4989 185.5645 216.6304 247.6965
0.1466 0.3473 0.5789 0.8264 1.0830 1.3455 1.6121 1.8817 2.1535 4.9298 7.7466 10.5775 13.4151 16.2565 19.1004 21.9459 24.7926 27.6402 56.1371 84.6472 113.1610 141.6764 170.1927 198.7094 227.2265
0.1250 0.3064 0.5197 0.7497 0.9894 1.2355 1.4861 1.7398 1.9959 4.6214 7.2919 9.9782 12.6722 15.3706 18.0719 20.7752 23.4798 26.1855 53.2680 80.3665 107.4698 134.5750 161.6814 188.7883 215.8956
0.1029 0.2630 0.4558 0.6661 0.8870 1.1147 1.3473 1.5833 1.8220 4.2801 6.7888 9.3156 11.8514 14.3924 16.9368 19.4836 22.0321 24.5818 50.1107 75.6600 101.2153 126.7734 152.3327 177.8929 203.4537
0.0801 0.2162 0.3853 0.5728 0.7717 0.9781 1.1898 1.4054 1.6239 3.8895 6.2131 8.5579 10.9133 13.2751 15.6411 18.0099 20.3809 22.7534 46.5198 70.3132 94.1149 117.9203 141.7275 165.5360 189.3457
0.0565 0.1644 0.3047 0.4642 0.6360 0.8161 1.0020 1.1924 1.3861 3.4163 5.5147 7.6389 9.7762 11.9216 14.0724 16.2269 18.3841 20.5435 42.1931 63.8809 85.5812 107.2870 128.9958 150.7064 172.4182
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800
(continued overleaf )
280
TABLES
Table 8.14 Exponential target process, values of (ν, n), (Section 8.6). (Continued ) ν = 0.1 n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
ν = 0.001
n 0.0565 0.1644 0.3047 0.4642 0.6360 0.8161 1.0020 1.1924 1.3861 3.4163 5.5147 7.6389 9.7762 11.9216 14.0724 16.2269 18.3841 20.5435
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
n 20.5435 42.1931 63.8809 85.5812 107.2870 128.9958 150.7064 172.4182 194.1306 215.8435 237.5568 259.2704 280.9843 302.6983
2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
n 0.0090 0.0388 0.0882 0.1527 0.2287 0.3135 0.4052 0.5025 0.6042 1.7609 3.0338 4.3571 5.7079 7.0764 8.4572 9.8471 11.2440 12.6465
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 2000 3000 4000 5000
12.6465 26.8361 41.1591 55.5365 69.9429 84.3670 98.8030 113.2474 127.6979 142.1530 156.6118 171.0734 185.5373 200.0032 286.8229 431.5699 576.3392 721.1166
TABLES
281
Table 8.15 Exponential target process, values of n[ν − n (1 + )−n ] ν −1/2 (− log ν)−1 , where ν = ν(, n), (Section 8.6). ν
0.3
0.25
0.2
0.15
0.1
0.05
.00241 .00270 .00291 .00307 .00318 .00328 .00335 .00341 .00347 .00373 .00383 .00388 .00392 .00394 .00395 .00397 .00398 .00398 .00402 .00403 .00404 .00404 .00404 .00404 .00404
.00684 .00801 .00880 .00937 .00980 .01014 .01042 .01064 .01083 .01180 .01217 .01236 .01248 .01256 .01262 .01267 .01270 .01273 .01286 .01291 .01293 .01294 .01295 .01296 .01296
.01365 .01639 .01827 .01963 .02068 .02150 .02217 .02273 .02319 .02560 .02654 .02704 .02735 .02756 .02772 .02784 .02793 .02800 .02835 .02846 .02852 .02856 .02858 .02860 .02861
.02308 .02845 .03219 .03496 .03710 .03882 .04022 .04140 .04239 .04766 .04980 .05096 .05170 .05220 .05257 .05285 .05307 .05325 .05408 .05436 .05450 .05459 .05465 .05469 .05472
.03510 .04477 .05170 .05697 .06114 .06453 .06735 .06974 .07179 .08311 .08797 .09070 .09245 .09368 .09458 .09528 .09583 .09628 .09838 .09911 .09949 .09971 .09986 .09997 .10006
.04577 .06200 .07436 .08417 .09222 .09896 .10472 .10970 .11408 .13998 .15228 .15963 .16454 .16809 .17076 .17286 .17455 .17595 .18272 .18518 .18647 .18725 .18778 .18816 .18845
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800
(continued overleaf )
282
TABLES
Table 8.15 Exponential target process, values of n(ν − n (1 + )−n )ν −1/2 (− log ν)−1 , where ν = ν(, n), (Section 8.6). (Continued ) ν
0.04
0.03
0.025
0.02
0.015
0.01
.04620 .06380 .07750 .08855 .09772 .10548 .11216 .11799 .12314 .15432 .16961 .17891 .18524 .18984 .19334 .19611 .19836 .20022 .20937 .21277 .21454 .21563 .21637 .21691 .21731
.04498 .06360 .07855 .09086 .10124 .11015 .11791 .12474 .13082 .16875 .18811 .20019 .20854 .21471 .21947 .22326 .22635 .22893 .24187 .24678 .24937 .25097 .25205 .25284 .25343
.04339 .06222 .07763 .09052 .10149 .11099 .11933 .12672 .13333 .17532 .19731 .21124 .22099 .22825 .23389 .23841 .24211 .24521 .26097 .26703 .27024 .27222 .27357 .27454 .27526
.04084 .05948 .07512 .08842 .09992 .10997 .11888 .12683 .13400 .18059 .20574 .22198 .23350 .24217 .24896 .25444 .25897 .26277 .28239 .29007 .29415 .29667 .29837 .29960 .30047
.03693 .05473 .07012 .08352 .09531 .10578 .11517 .12364 .13134 .18296 .21197 .23115 .24500 .25556 .26393 .27074 .27641 .28120 .30649 .31663 .32206 .32542 .32767 .32927 .33017
.03100 .04682 .06103 .07379 .08531 .09577 .10531 .11406 .12211 .17860 .21219 .23516 .25214 .26534 .27595 .28469 .29204 .29833 .33248 .34669 .35440 .35918 .36236 .36460 .36625
n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800
TABLES
283
Table 8.15 Exponential target process, values of n[ν − n (1 + )−n ] ν −1/2 (− log ν)−1 , where ν = ν(, n), (Section 8.6). (Continued ) ν = 0.1 n 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
ν = 0.001
n .03100 .04682 .06103 .07379 .08531 .09577 .10531 .11406 .12211 .17860 .21219 .23516 .25214 .26534 .27595 .28469 .29204 .29833
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
n .29833 .33248 .34669 .35440 .35918 .36236 .36460 .36625 .36754 .36854 .36933 .36995 .37045 .37086
2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
n .00843 .01302 .01752 .02196 .02632 .03063 .03487 .03904 .04315 .08020 .11067 .13597 .15742 .17594 .19215 .20653 .21940 .23103
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 2000 3000 4000 5000
.23103 .30758 .34972 .37717 .39668 .41130 .42265 .43169 .43904 .44512 .45019 .45449 .45814 .46127 .47315 .47990 .48046 .47864
284
TABLES
Table 8.16 Exponential target process, values of A, B, C, D and E (Section 8.6). ν 0.3 0.25 0.2 0.15 0.1 0.05 0.04 0.03 0.025 0.02 0.015 0.01
A
B
C
1000D
E
0.0089 0.0360 0.1032 0.2687 0.7311 2.5341 3.5126 5.1426 6.4394 8.3339 11.3539 17.1229
2.221 2.666 3.242 4.189 5.834 8.038 8.601 8.283 8.581 8.606 8.200 10.119
−1.0410 −1.0479 −1.0491 −1.0497 −1.0417 −0.9805 −0.9559 −0.8998 −0.8792 −0.8471 −0.7967 −0.7932
−0.12462 −0.04785 −0.03422 −0.06770 −0.07456 −0.17755 −0.19105 −0.29122 −0.30000 −0.33041 −0.37843 −0.23122
−1.4441 −1.5699 −1.7848 −2.1413 −2.6759 −3.2358 −3.3707 −3.2936 −3.3755 −3.3970 −3.3255 −3.8292
Index Aalto, S., 111 achievable region, 48, 115, 118, 119, 123, 127, 130–132, 137, 241 approach, 110, 115, 119, 154, 162, 166 active action, 150, 153 adaptive greedy algorithm, 43, 115, 122, 124, 125, 127, 128, 130–132, 147, 163, 164, 167 admission control/routing, 163–166, 172 age of a job, 59, 77, 110 Agrawal, R. M., 245 Ahlswede, R., 73 Ahmad, S. H. A., 151 Alpern, S., 73 Amaral, J. A. F. P., xvi, 229 Anantharam, V., 49 Ansell, P. S., 152, 154, 161, 162 arm-acquiring bandits, 105 (see also branching bandits) arrivals, 97 (see also Bernoulli arrivals, branching bandits and job-linked arrivals) asymptotically optimal, 155, 166, 169 average cost, 110, 162 reward, 152, 153, 155
overtaking optimal, 146, 147 reward optimal, 147 Aviv, Y., 246 Ayesta, U., 111 Babaioff, M., 242 backwards induction, 200, 216, 230, 236, 239 Baker, A. G., xvi Baker, K. R., 56, 73 bandit process, 10 (see also generalized, standard and stoppable bandit process) linked arrivals, 105 (see also branching bandits) bandits process, 19, 38 (see also family of alternative bandit processes and multi-armed bandit) Bank, P., 48, 178 Banks, J. S., 245 Barkdoll, T., 246 Barra, J. R., 174 base function, 127 Bassamboo, A., 163 Bather, J. A., xvi, 89, 90, 178, 201, 205, 210 B¨auerle, N., 48 Bayes’ theorem, 3, 7, 8, 71, 174 Bayesian setup, 174 Beale, E. M. L., 43, 47
Multi-armed Bandit Allocation Indices, Second Edition. John Gittins, Kevin Glazebrook and Richard Weber © 2011 John Wiley & Sons, Ltd. Published 2011 by John Wiley & Sons, Ltd. ISBN: 978-0-470-67002-6
286
INDEX
Bellman, R. E., 1, 20, 47, 221, 239 Ben-Israel, A., 44 Bensoussan, A., 210 Bergemann, D., 77, 243 Bergman, S. W., xvi, 68, 89 Berninghaus, S., 16 Bernoulli reward process, 44, 88, 177, 186, 187, 221, 223, 235, 239, 242, 264–276 (see also multi-armed bandit) target process, 182 Bernoulli/exponential target process, 179, 185, 213, 234 Berry, D. A., xv, 63, 244 Bertsekas, D. P., 149 Bertsimas, D., 48, 110, 115, 122, 126–128, 130, 149 Bhattacharya, P. P., 115 Bhulai, S., 244 bisection method, 232 Blackwell, D., 20, 71, 147 bound, 135, 139, 140, 143, 144, 152, 168 Boys, R. J., 13 Bradt, R. N., 47 branching bandits, 44, 80, 105, 106, 110, 112, 116, 126, 128, 130, 132, 134–136, 147 Braun, M., 242 Brownian motion, 201, 204 reward process, 178, 201–204, 210, 218, 263 busy period, 109, 110, 113, 114, 130, 161 buyer’s problem, 89 Buyukkoc, C., 43, 44, 47, 48, 106 calibration, 2, 5, 15, 26, 43, 185, 188, 191, 192, 198, 233 Cavallo, R., 243 central limit theorem, 206
Chang, F., 178, 205 Chebyshev’s inequality, 206, 239 Chen, Y. R., 43, 47 Chernoff, H., 210 Chow, Y. S., 22, 89 cμ-rule, 2, 116, 118, 159, 162 Coffman, E. G., 56, 116 coins problem, 29, 30 complementary slackness, 121, 122, 128 completion rate, 6, 46, 58, 75 set, 104 computational complexity, 10, 62, 64, 188, 236, 239 (see also computing requirement and storage requirement) computing requirement, 9 Condition A, 36, 84, 96 B, 37 C, 69, 71, 110 D, 82, 83, 85, 87, 88, 100 E, 99–102, 112, 113 F, 101, 102 conjugate prior distribution, 176, 178, 189, 211 conservation laws, 115, 119, 154 (see also GCL(1), GCL(2), SCL(1) and SCL(2)) generalized conservation laws, 116, 124, 125 strong conservation laws, 125, 129 continuation control, 21 continuous-time bandit, 36 index theorem, 61, 211 job, 55, 58, 104 policy, 58 control set, 21 Conway, R. W., 56 CPSDAI, 179
INDEX
Cram´er, H., 222 current probability of success (CPS), 176, 231 customer impatience, 163 Dacre, M. J., 111, 115, 138 Dahleh, M., 151 dangerous search, 77 Davies, O. I., xvi decision time, 21 decomposable, 128, 137, 138, 147 -policy, 59, 61 Dempster, M. A. H., 56 Denardo, E. V., 147 Derman, C., 44, 223 deterministic policy, 22, 67 stationary Markov policy, 21, 22, 24 diffusion approximation, 90 bandits, 210 directed graph, 103 discount factor, 10, 13, 20–23, 32, 65, 66, 68, 80, 81 parameter, 20, 21, 67, 179 (see also arrivals discounting and stochastic discounting) discrete-time correction factor, 37, 176, 204 Doob, J. L., 210 DP, see dynamic programming, dual LP, see linear program Dumitriu, I., 40, 48, 50 Dunn, R. T., 146, 147 dynamic programming (DP), 1 (see also backwards induction, functional equation, recurrence equation and value iteration) equation, 19, 20, 24, 36, 46, 86, 87, 120, 152, 153, 156, 158, 159, 161, 170
287
economics, 14, 16, 243, 245 Eick, S. G., 87 El Karoui, N., 48 Eplett, W. J. R., 211 -FI∗ policy, 112 -index policy, 112 -index∗ policy, 112 equivalent reward rate, 14, 32, 74, 102, 105, 106 EWFT, 58, 69, 110, 113 existential state, 176 exploitation, xiii, 241, 243 exploration, xiii, 10, 243, 245 exponential family of distributions, 175 reward process, 225, 277 target process, 181, 229, 235, 236, 278–284 FABP, 19 (see also family of alternative bandits processes) fair charge, 27 stake, 38, 83, 85 family of alternative bandit processes (FABP), 19, 22, 23, 27, 36, 63, 70, 90, 96 (see also bandits process and multi-armed bandit) family of alternative superprocesses, 80 favourable probability distribution, 181 Fay, N. A., 66, 99, 113, 244 Federgruen, A., 115 Ferguson, T. S., 175, 188 Feron, E., 151, 245 Filliger, R., 211 filtration, 47, 89 FI∗ policy, 33, 97, 103 (see also modified forwards induction policy) fixed-s sequence, 206 Fl˚am, S. D., 44
288
INDEX
flow time, 47, 56, 58, 63, 64, 69, 73–76, 80, 88, 110, 113 fluid approximation, 155 forwards induction, 33 (see also modified forwards induction) freeze control, 21, 22, 80–82, 93, 94, 104, 106, 107 freezing rule, 79, 86, 90, 93–97, 99, 102, 113 French, S., 56 Fristedt, B., xv, 63 Frostig, E., 48 functional equation, 23, 49, 185, 198–201, 236 Gal, S., 73 γ 0 limit, 5, 6, 44, 58, 68–70 Garbe, R., 48, 115, 132, 138 Garnett, O., 163 Gaver, D. P., 246 GCL(1), 125, 127, 128, 131, 133, 137, 138 GCL(2), 128, 130 Geisler, F., xvi generalized bandit problem, 65, 113 Bayes theorem, 174 job, 45, 47, 56, 97 Georgiadis, L., 115 Gittins index calculation of, 43, 51–53, 213 examples of, 128, 132, 137, 163, 247 theory of, xi, 19, 28, 32, 35, 43, 44, 46–48, 52, 123, 124, 130, 132–135, 140, 143, 144, 146, 239 Gittins, J. C., 28, 35, 44, 48, 58, 68, 75, 99, 114, 179, 223, 243, 244
Glazebrook, K. D., xvi, 13, 48, 66, 88, 99, 111, 114, 115, 132, 138, 146, 147, 152, 154, 156, 159, 161, 162, 166, 168, 171, 244–246 gold mining, 2, 66–68 golf with two balls, 40, 48 greedy algorithms, 51, 115, 138, 171 Groenvelt, H., 115 Guha, S., 151, 245 Harrison, J. M., 47, 163 Hauser, J. R., 242 hazard rate, 58 heavy-traffic, 166 Hedge, D., 245 Henrici, P., 216 Hodge, D. J., 159, 171 Hongler, M.-O., 211 Howard, R., 20, 50 impatient customers, 163 improving stopping option, 88, 104, 105 index policy, 24, 27, 80, 81 (see also -index, -index∗ , and minimal sub-family index policy) index theorem, 2, 28, 35, 38, 91, 100, 103, 104, 109, 178, 211 for a FABP with arrivals, 106 for a SFABP, 27 for a SFABP of continuous-time jobs, 61 for a SFAS, 83 for jobs with precedence constraints, 98 history of, 47 proof, 27, 33, 40, 83, 86, 98, 119 indexability, 124, 154, 156, 161–164, 172
INDEX
indexable, 151, 154, 157–159, 171 systems, 116, 124 initial job, 103, 109 interchange argument, 3–5, 19, 20, 40, 41, 105, 113, 142 invariance, 178, 181, 185, 195 Ishikida, T., 48, 244 Jacobs, P. A., 246 Janson, S., 40 Javidi, T., 151 job, 45, 56, 58, 179 (see also continuous-time job and generalized job) selection and scheduling problems, 116, 136 linked arrivals, 105, 108, 110, 113 (see also branching bandits) jobs, 244, 245 Johnson, S. M., 47 Jones, D. M., xvi, 28, 35, 44, 179, 181, 184, 185, 213, 223, 235, 236 Jones, M. L., 244 Jun, T., 245 Kadane, J. B., 99, 114 Kallenberg, L. C., 43 Karatzas, I., 48 Karlin, S., 47 Kaspi, H., 48 Katehakis, M. N., 31, 43, 44, 111, 223 Kelly, F. P., xvi, 67, 72, 177 Kendall, D. G., 110 Kertz, R. P., 47, 87, 244 Khan, A., xvi Kirkbride, C., 156, 159, 166, 168, 171, 246 Klimov, G. P., 44, 47, 115 Klimov’s model, 108, 109, 110, 115
289
knapsack problem, 61, 64 Koo, R., 244 Koole, G. M., 244 Koopman, B. O., 72, 73 Kress, M., 245, 246 Krishnamachari, B., 151 Krishnamurthy, V., 44 K¨uchler, C., 48, 178 Lagrange multipliers, 88, 153, 170 Lagrangian relaxation, 149, 159, 165, 166, 168–172, 241, 245 Lai, T. L., 44, 47, 49, 178, 205 Lawler, E. L., 62 Lehnerdt, M., 77 Lenstra, J. K., 56, 62 Le Ny, J., 151, 245 Liberali, G., 242 light traffic, 166, 168, 169 Lindley, D. V., xvi, 210 linear program (LP), 115, 116, 118, 119, 121, 153 dual LP, 121 linear programming, 24, 43, 119 relaxation, 121, 245 Lions, J. L., 210 Lippman, S. A., 16 Liu, K., 151, 154 Liu, M., 151 location parameter, 188–191, 194, 203, 214 look-shoot-look models, 245 LP, see linear program Lumley, R. R., 162 Mahajan, A., 48 Mandelbaum, A., 48, 51, 163 Mannor, S., 49 Manor, G., 245 March, J. G., 243 Markov bandit process, 21
290
INDEX
Markov (Continued) decision process (or problem), 9, 20, 21, 23 policy, 22, 50 process, 19 MATLAB program, 51, 223, 239 Matula, D., 71 Maxwell, W., 56 McCall. B. P., and J. J., 16 measurability, 101 (see also past-measurable policy) g-measurability, 101, 102 mechanism design, 151, 243 Meighem, van J. A., 159 Meilijson, I., 47 military applications, 72, 245 Miller, B. L., 147 Miller, L. W., 56 Miller, R. A., 16 minimal sub-family index policy, 103, 104 Mitchell, H. M., 246 Mitrani, I., 116 modified forwards induction, 33, 79, 80, 90, 91, 93, 97, 98, 101, 103, 104 policy, 35, 74, 75 (see also FI* policy) monotone class, 157, 158 in state, 171 indices, 44, 46 jobs, 45, 46 policy, 155–157, 159, 172 structure, 156, 164 monotonicity conditions, 45, 178, 181 (see also improving stopping option) mortality rate, 58 multi-armed bandit, 7, 28, 115, 119, 124, 139, 173 (see also Bernoulli reward process, family of alternative bandit processes and bandits process)
problem, 28, 119, 120, 150 multi-class queue, 80, 110, 129, 136, 138, 159 (see also queueing) multiple processors, 51, 73, 75, 78 (see also parallel machines problem) Munagala, K., 151, 245 myopic policy, 2, 6, 7, 30, 32, 75, 151 Nash, P., xvi, 58, 64, 65, 110, 246 Nemhauser, G. L., 138 Newton–Raphson method, 227 Neyman, J., 174 Ni˜no-Mora, J., 43, 48, 108, 111, 115, 127, 154, 158, 161, 162, 166 non-preemptive case, 46, 57, 88 non-preemptive schedule, 46 normal limit theorem, 207 reward process, 44, 213, 218, 223, 261, 263 target process, 182, 183 NP-complete, 62, 64, 138 NP-hard, 62, 64 O’Keeffe, M., 161, 162 Olivier, von G., 110 one-step-look-ahead policy, 32 ongoing bandit process, 107 Oreskovich, A., xiv Ouenniche, J., 166, 168 out-forest of precedence constraints, 47, 102, 103, 114, 147 out-tree, 103 Oyen, van M. P., 245 Pandelis, D. G., 51 Papadimitriou, C. H., 62, 152 parallel machines problem, 116, 139, 143, 144, 146, 147 (see also multiple processors)
INDEX
Parkes, D. C., 243 Paschalidis, I., 110 passive action, 150, 153 past-measurable policy, 22, 23 payoff, 1, 20, 23 Pek¨oz, E. A., 244 Peres, Y., 40 performance bounds, 132, 134, 163, 241 space, 118 vector, 119, 126, 129, 139 pharmaceutical research, 68, 195, 235 Pilnick, S. E., 246, 248 Pinedo, M. L., 48, 56 Pliska, S. R., 47 Poisson arrivals, 88, 114 policy (see also continuous-time, -, deterministic, index, Markov, monotone, one-step-look-ahead, past-measurable and stationary) improvement algorithm, 24, 44 polymatroid, 115 Posadas, S., 246 precedence constraints, 56, 77, 81, 97 (see also out-forest of precedence constraints) partition, 131 preemption, 4, 57 Presman, E. L., xv prevailing charge, 27, 37, 74, 113, 243 stake, 38, 84, 86 priority classes, 131 Problems, statement of 1. single-machine scheduling, 2 2. gold mining, 2, 66 3. search, 3, 70 4. industrial research, 4 5. multi-armed bandits, 7, 49 6. exploration, 10
291
7. choosing a job, 13 8. coins, 29 process time, 22 promotion rule, 79, 86, 90, 95 PSPACE-hard, 152 Puterman, M. L., 20, 36, 48, 49 queueing (see also multi-class queue) control problem, 163, 169 M/G/1 queue, 88, 108, 110, 111, 113, 114, 117, 162 M/M/1 queue, 159 systems, 129, 136, 138, 159 Raiffa, H., 176 Rao, C. R., 11 recurrence equation, 1, 2, 8, 9 (see also functional equation) reducible, 137, 138 regret, 49 regular discount function, 63 Reiman, M., 163 relaxation, 141, 170 research projects, 4, 88, 136, 235 reservation wage, 14 restart in state, 31, 44, 47 restless bandit, 149–151, 155 retirement option, 43, 86 reward process, 173, 195 (see also Bernoulli, exponential and normal reward process) suboptimality, 135, 143, 154 Righter, R., 111, 244 Rinnooy Kan, A. H. G., 56, 62 Robbins, H., 22, 49, 89 Roberts, D. M., xvi Robinson, D. R., 44, 223 Romberg’s integration method, 216, 220, 227, 232 Ross, K. W., 153 Ross, S. M., 20, 36, 48 Rothkopf, M. H., 46
292
INDEX
routing, 163 Rubin, H., 11 Ruiz-Hernandez, D., 156 sampling process, 173 scale parameter, 188 scheduling, 2, 46, 56, 63, 68, 73, 75, 99, 179, 244 Schlaifer, R., 176 SCL(1), 125, 136, 138 SCL(2), 129, 136, 138 search, 3, 70, 72, 76, 245 semi-Markov decision process, 21, 36 reward process, 195 sampling process, 197 sensor management, 151, 245 servers working in parallel, 139 Sevcik, K. C., 47 SFABP, 19 (see also simple family of alternative bandits processes) SFAS, 80 (see also simple family of alternative superprocesses) Shanthikumar, J. G., 115, 129, 244 shared service, 59 Sharma, Y., 242 Shi, P., 151 Shmoys, D. B., 62 Sidney, J. B., 99 Siegmund, D., 22, 89 Simon, H. A., 99, 114 simple family of alternative bandit processes (SFABP), 19, 119 simple family of alternative superprocesses (SFAS), 80 Singh, S., 243 Slivkins, A., 242 Smith, W. E., 47
Smith’s rule, 2, 47 Sonin, I. M., xv, 43 Sorenson, M., xiv spinning plates model, 150, 156, 158 standard bandit process, 2, 5, 9, 13, 25, 185 stationary policy, 22 Steiglitz, K., 62 Stidham, S. Jr., 48 stochastic discounting, 38, 67, 83 scheduling, 78, 241 Stone, L. D., 73 stop control, 88 stoppable bandit process, 88, 89 job, 104 stopping rule, 22 set, 22 time, 22 storage requirement, 9, 188 Str¨umpfer, J., 73 sub-family of alternative bandit processes, 113 (see also minimal sub-family of alternative bandit processes) submodular set function, 51, 113, 138 suboptimality, 116, 139, 140, 155 bounds, 132, 146 sufficient statistic, 174, 175, 195 Sundaram, R. K., 245 supermodular set function, 138 superprocess, 79–84, 87, 88, 90, 100, 103, 104 switching costs, 77, 171, 245 system performance, 119, 123–127, 130
INDEX
target process, 28, 39, 40, 173, 176–179, 181, 185, 186, 189, 194, 213, 229, 230, 232–235 (see also Bernoulli, Bernoulli/exponential, exponential and normal target process) tax problem, 23, 69, 80, 106, 107, 111 Tcha, D. W., 47 Teneketzis, D., 48, 51, 245 Tetali, P., 40, 48, 50 Teugels, J. L., 20 Thron, C., 186 Tsitsiklis, J. N., 40, 47, 48, 49, 105, 110, 115, 149, 152 Tsoucas, P., 115 undiscounted rewards (see γ 0 limit) unsuccessful transition, 37 Urban, G. L., 242 V¨alim¨aki, J., 77, 243 value function, 23, 36, 86, 88, 120, 160 iteration, 24, 44, 50 of information, 244 Vanderbei, R., 121 Varaiya, P., 43, 44, 47, 49, 106 Veinott, A., 31, 44, 47, 111, 147
293
Wahlberg, B, 44 Wald, A., 20 Walrand, J. C., 43, 44, 47–49, 106, 113 Wan, Y. W., 244 Wang, Y. G., 44, 244 Washburn, A., 152, 245, 246 Weber, R. R., 28, 48, 51, 75, 78, 152, 154, 155 website morphing, 242 Wegener, I., 73 Weiss, G., 43, 47, 48, 105, 155, 244 Weitzman, M. I., 16 Whittle index examples of, 155, 161, 163 theorem, 86 theory of, 149–152, 154, 155, 157–160, 162–170, 172 Whittle, P., ix, x, xii, xvi, 20, 29, 44, 47, 48, 49, 82, 86, 105, 110, 111, 149, 158, 159 Wiener process, 44, 201, 210 Wilkinson, D. J., 146 Winkler, P., 40, 48, 50 Wolsey, L. A., 138 work-in-system, 117, 129 Yao, D. D., 115, 129 Yao, Y. C., 44 Ying, Z., 44, 47 Zeevi, A., 163 Zhao, Q., 151, 154