Systems & Control: Foundations & Applications Series Editor Tamer Bas¸ar, University of Illinois at Urbana-Champaign, Urbana, IL, USA Editorial Board ˚ om, Lund University of Technology, Lund, Sweden Karl Johan Astr¨ Han-Fu Chen, Academia Sinica, Beijing, China Bill Helton, University of California, San Diego, CA, USA Alberto Isidori, University of Rome, Rome, Italy; Washington University, St. Louis, MO, USA Miroslav Krstic, University of California, San Diego, CA, USA Alexander Kurzhanski, University of California, Berkeley, CA, USA; Russian Academy of Sciences, Moscow, Russia H. Vincent Poor, Princeton University, Princeton, NJ, USA Mete Soner, ETH Z¨urich, Z¨urich, Switzerland; Swiss Finance Institute, Z¨urich, Switzerland
For further volumes: http://www.springer.com/series/4895
Daniel Hern´andez-Hern´andez J. Adolfo Minj´arez-Sosa Editors
Optimization, Control, and Applications of Stochastic Systems In Honor of On´esimo Hern´andez-Lerma
Editors Daniel Hern´andez-Hern´andez Department of Probability and Statistics Center for Research in Mathematics Mineral de Valenciana, GJ, M´exico
J. Adolfo Minj´arez-Sosa Department of Mathematics University of Sonora Hermosillo, SO, M´exico
ISBN 978-0-8176-8336-8 ISBN 978-0-8176-8337-5 (eBook) DOI 10.1007/978-0-8176-8337-5 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2012942475 Mathematics Subject Classification (2010): 62C05, 60J05, 91A15, 91B70 © Springer Science+Business Media, LLC 2012 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.birkhauser-science.com)
Onesimo Hern´andez-Lerma, Mexico City, 2011
Foreword
Professor On´esimo Hern´andez-Lerma has made a great number of fundamental contributions in the field of stochastic control systems during the last 30 years. He has contributed to the development of this theory in many aspects, which include adaptive control, parametric estimation, recursive algorithms, infinite dimensional linear programming, ergodic properties of Markov processes, measure theory, discrete- and continuous-time controlled Markov processes, and game theory. Also, his research work has been concerned with important applications to the areas of queuing systems, control of population, and management sciences. He has authored more than 140 papers, some of which still have a great influence in the study of controlled stochastic systems, since he was the first to mathematically formalize many fundamental results. He recently received the Scopus Prize (2008) and the Thomson Reuters Award (2009), recognizing the influence and importance of his work. Professor Hern´andez-Lerma has written or coauthored 11 books or monographs on different topics. In the field of stochastic control, his book Adaptive Markov Control Processes (1989) soon became a reference for researchers and graduate students, and today it is considered a classic. In 1996 and 1999 he jointly wrote with J. B. Lasserre the books Discrete-Time Markov Control Processes: Basic Optimality Criteria and Further Topics on Discrete-Time Markov Control Processes, giving a systematic and deep study to controlled Markov processes on Borel spaces with several optimality criteria. The techniques required to deal with discrete- and continuous-time models are different, but in many aspects the general intuition from one can be applied to the other. This intuition was firmly grounded in his most recent books Continuous-Time Markov Decision Processes: Theory and Applications (with X. P. Guo) and Selected Topics on Continuous-Time Controlled Markov Chains and Markov Games (with T. Prieto-Rumeau). Among the diverse optimality criteria analyzed in the work of Professor Hern´andez-Lerma, one of the most important is the average or ergodic index, where the asymptotic behavior of the controlled stochastic process needs to be well understood to verify the existence of solutions of the optimal control problem and to characterize the value function in terms of the optimality equation. The book vii
viii
Foreword
Markov Chains and Invariant Probabilities, written jointly with J. B. Lasserre and published in 2003, is precisely a well-established work on the ergodic behavior of Markov chains in metric spaces. Average optimality represents an interesting combination of ergodic theory and stochastic optimal control and requires effective techniques and clever ideas to deal with problems from adaptive control, partially observed processes, linear programming and approximating procedures, among others, and on this topic we can find a good number of papers by Professor Hern´andez-Lerma. A complete list of his publications is given below. Professor Hern´andez-Lerma received his Ph.D. from Brown University in 1978, and since then he has been a regular faculty member of the Mathematics Department at the Centro de Investigaci´on y de Estudios Avanzados (Cinvestav) of the Instituto Polit´ecnico Nacional, where he has carried out most of his research and teaching activities. His work had a pioneering character in Mexico, where he was the first expert in stochastic control. Groups of mathematicians in Mexico were quite small in those years and only few people could see the potential of having a strong group in the field of stochastic optimal control. At Cinvestav he has always been recognized by his students for his excellent ability to lecture on many topics such as real analysis, probability theory, Markov processes, and stochastic calculus. Up to now he has graduated 17 Ph.D. students, a record among Mexican mathematicians, and his former students recognize his generosity, sensitivity, and his ability to propose cutting-edge research projects. The results of most of those Ph.D. theses have been published in well-established international journals. Since 1986 he has regularly organized the Workshop on Stochastic Control, which is an important forum for the group of stochastic control in Mexico to present new research and to interact with experts from other countries. This group has an intense research activity thanks to the inspiration of Professor Hern´andez-Lerma, who has also influenced a significant number of other young mathematicians and visiting postdoctoral fellows who have visited Cinvestav in these years. He has participated in numerous editorial boards, for journals like the SIAM Journal on Control and Optimization, International Journal on Stochastic Analysis, and Journal of Dynamics and Games. He has also served as a guest editor of special volumes in top journals. Additionally, Professor Hern´andez-Lerma has always showed a genuine interest in probability and statistics education in Mexico and has written two monographs in on these topics. The contributions of Professor Hern´andez-Lerma to the development of applied mathematics in his country were recognized by the Government of Mexico in 2001, when he received the Sciences and Arts National Award, being the third mathematician to obtain such a high distinction. Also, in 2003 he received a doctor honoris causa from the Universidad de Sonora, and in 2008 he was honored with the Medalla L´azaro C´ardenas by the Instituto Polit´ecnico Nacional. Guanajuato, M´exico Guanajuato, M´exico
Daniel Hern´andez-Hern´andez J. Adolfo Minj´arez-Sosa
Preface
This volume presents a collection of papers by friends and colleagues of Professor On´esimo Hern´andez-Lerma in his honor. The first idea to compile the book arose during On´esimo’s Fest Symposium held in San Luis Potos´ı, Mexico, during March 16–18, 2011, to honor the 65th birthday of Professor On´esimo Hern´andez-Lerma, who has been an important contributor to stochastic optimal control. Thereafter, a group of colleagues whose research interests are in the areas of stochastic optimal control, optimization theory, and probability were invited to collaborate on this project. As a result, 33 authors from all over the world have contributed 18 chapters to the book. All papers have been peer-reviewed and give a general view on the state-of-the-art of the art of several topics on the covered fields. In particular, the book presents recent developments on discretetime Markov control or decision processes under different contexts: discounted and average criterion as well as sample-path and constrained optimality; continuoustime controlled problems for diffusion processes, jump Markov processes, and semi-Markov processes; optimal stopping problems; global optimization; and the existence of solutions of stochastic partial differential equations. Additionally the book contains important applications to inventory control problems and financial systems. Chapters are presented in alphabetical order by first author. We express our deep gratitude to all the people who have collaborated to make this special volume a success. Mainly, we thank all the authors for their excellent and important contributions, as well as all the reviewers for their time and effort. We would also like to thank the Centro de Investigaci´on en Matem´aticas (CIMAT) and the Departamento de Matem´aticas, Universidad de Sonora. Guanajuato, M´exico Guanajuato, M´exico
Daniel Hern´andez-Hern´andez J. Adolfo Minj´arez-Sosa
ix
Contents
1
2
3
4
5
6
On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion . . . . . . . . . . . . . . . . . . . . Ari Arapostathis
1
Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Lankere Benkherouf and Alain Bensoussan
13
Sample-Path Optimality in Average Markov Decision Chains Under a Double Lyapunov Function Condition .. . . . . . . . . . . . . . . Rolando Cavazos-Cadena and Ra´ul Montes-de-Oca
31
Approximation of Infinite Horizon Discounted Cost Markov Decision Processes .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Franc¸ois Dufour and Tom´as Prieto-Rumeau
59
Reduction of Discounted Continuous-Time MDPs with Unbounded Jump and Reward Rates to Discrete-Time Total-Reward MDPs . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Eugene A. Feinberg Continuous-Time Controlled Jump Markov Processes on the Finite Horizon .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Mrinal K. Ghosh and Subhamay Saha
77
99
7
Existence and Uniqueness of Solutions of SPDEs in Infinite Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 111 T.E. Govindan
8
A Constrained Optimization Problem with Applications to Constrained MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 125 Xianping Guo, Qingda Wei, and Junyu Zhang
xi
xii
9
Contents
Optimal Execution of Derivatives: A Taylor Expansion Approach . . . 151 Gerardo Hernandez-del-Valle and Yuemeng Sun
10 A Survey of Some Model-Based Methods for Global Optimization . . 157 Jiaqiao Hu, Yongqiang Wang, Enlu Zhou, Michael C. Fu, and Steven I. Marcus 11 Constrained Optimality for First Passage Criteria in Semi-Markov Decision Processes . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 181 Yonghui Huang and Xianping Guo 12 Infinite-Horizon Optimal Control Problems for Hybrid Switching Diffusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 203 H´ector Jasso-Fuentes and George Yin 13 Fluid Approximations to Markov Decision Processes with Local Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225 Alexey Piunovskiy and Yi Zhang 14 Minimizing Ruin Probabilities by Reinsurance and Investment: A Markovian Decision Approach .. . . . . . . . . . . . . . . . . . . . 239 Rosario Romera and Wolfgang Runggaldier 15 Estimation of the Optimality Deviation in Discounted Semi-Markov Control Models. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 253 Luz del Carmen Rosas-Rosas 16 Discrete Time Approximations of Continuous Time Finite Horizon Stopping Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 265 Lukasz Stettner 17 A Direct Approach to the Solution of Optimal Multiple-Stopping Problems . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 283 Richard H. Stockbridge and Chao Zhu 18 On the Regularity Property of Semi-Markov Processes with Borel State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 301 ´ Oscar Vega-Amaya
Contributors
Ari Arapostathis Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA Lankere Benkherouf Faculty of Science, Department of Statistics and Operations Research, Kuwait University, Safat, Kuwait Alain Bensoussan International Center for Decision and Risk Analysis, School of Management, University of Texas at Dallas, Richardson, TX, USA WCU Distinguished Professor, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Ajou University, San S. Woncheon-Dong, Yeontong-Gu, Suwon, Korea Rolando Cavazos-Cadena Departamento de Estad´ıstica y C´alculo, Universidad Aut´onoma Agraria Antonio Narro, Buenavista, Saltillo, Coahuila, M´exico Franc¸ois Dufour Institut de Math´ematiques de Bordeaux, Universit´e Bordeaux I, Talence, France INRIA Bordeaux Sud Ouest, Team CQFD, Bordeaux, France Eugene A. Feinberg Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA Michael C. Fu The Robert H. Smith School of Business & Institute for Systems Research, University of Maryland, College Park, MD, USA Mrinal K. Ghosh Department of Mathematics, Indian Institute of Science, Bangalore, India T. E. Govindan ESFM, Instituto Polit´ecnico Nacional, M´exico D.F., M´exico Xianping Guo Sun Yat-Sen University, Guangzhou, China Gerardo Hernandez-del-Valle Statistics Department, Columbia University, New York, NY, USA Jiaqiao Hu Department of Applied Mathematics and Statistics, State University at Stony Brook, Stony Brook, NY, USA xiii
xiv
Contributors
Yonghui Huang Sun Yat-Sen University, Guangzhou, China H´ector Jasso-Fuentes Departamento de Matem´aticas, CINVESTAV-IPN, M´exico D.F., M´exico Steven I. Marcus Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD, USA Raul ´ Montes-de-Oca Departamento de Matem´aticas, Universidad Aut´onoma Metropolitana–Iztapalapa, M´exico D.F., M´exico Alexey Piunovskiy Department of Mathematical Sciences, University of Liverpool, Liverpool, UK Tom´as Prieto-Rumeau Department of Statistics, UNED, Madrid, Spain Rosario Romera University Carlos III de Madrid, Madrid, Spain Luz del Carmen Rosas-Rosas Departamento de Matem´aticas, Universidad de Sonora, Hermosillo, Sonora, M´exico Wolfgang Runggaldier Dipartimento di Matematica Pura ed Applicata, University of Padova, Padova, Italy Subhamay Saha Department of Mathematics, Indian Institute of Science, Bangalore, India Lukasz Stettner Institute of Mathematics Polish Academy of Sciences, Warsaw, Poland Richard H. Stockbridge Department of Mathematical Sciences, University of Wisconsin at Milwaukee, Milwaukee, WI, USA Yuemeng Sun Cornell University, Ithaca, NY, USA ´ Oscar Vega-Amaya Departamento de Matem´aticas, Universidad de Sonora, Hermosillo, Sonora, M´exico Yongqiang Wang Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD, USA Qingda Wei Sun Yat-Sen University, Guangzhou, China George Yin Department of Mathematics, Wayne State University, Detroit, MI, USA Junyu Zhang Sun Yat-Sen University, Guangzhou, China Yi Zhang Deparment of Mathematical Sciences, University of Liverpool, Liverpool, UK Enlu Zhou Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, IL, USA Chao Zhu Department of Mathematical Sciences, University of Wisconsin at Milwaukee, Milwaukee, WI, USA
Resume of Professor On´esimo Hern´andez-Lerma
Present Position:
Full Professor “3F” (top level), Department of Mathematics, CINVESTAV-IPN
• Member of the Mexican National Research System (SNI): Area I (Basic Sciences), Level 3 (top level).
Special Awards • • • • •
Sciences and Arts National Award by the Government of M´exico (2001). Doctor Honoris Causa by the Universidad de Sonora (2003). L´azaro C´ardenas Medal (IPN) 2008. Scopus Prize (Elsevier) 2008. Thomson Reuters Award 2009.
Research interests: Optimal control of stochastic systems, Multiobjective control problems, Stochastic games, Infinite–dimensional linear programming, Markov processes.
Academic Background • Mexican Air Force College, 1964. • Escuela Superior de F´ısica y Matem´aticas (ESFM) of the Instituto Polit´ecnico Nacional (IPN), Licenciado en F´ısica y Matem´aticas, 1970. • Division of Applied Mathematics, Brown University, M.Sc., 1976. • Division of Applied Mathematics, Brown University, Ph.D., 1978.
xv
xvi
Resume of Professor On´esimo Hern´andez-Lerma
Employment • Mexican Air Force, 1964–1968. • Escuela Superior de Ingenier´ıa Mec´anica y El´ectrica (ESIME) of the IPN, 1968– 1988. • Universidad Aut´onoma Metropolitana, Azcapotzalco, 1974–1975. • Centro de Investigaci´on y de Estudios Avanzados, Mathematics Department, 1978—present. Chairman, 1992–1997, 2011–2015.
Visiting Appointments • Universidad de los Andes (M´erida, Venezuela), 1971–1972. • University of Texas at Austin, 1982–1983. One-month visits: 1984, 1985, 1986, 1987. • LAAS-CNRS (Toulouse, France). Several one-month visits per year, during 1986–2001. • Texas Tech University, 1987–1988. • University of Padova (Padova, Italy). Two months, 1991. • Universidad Carlos III de Madrid. Two months, 2000, 2002.
Visiting Postdoctoral Fellows • Nadine Hilgert, INRA–ENSA.M, Montpellier, France, January–August 1999 • Xianping Guo, Zhongshan University, Guangzhou, P.R. China, August 2000– 2002 • Tom´as Prieto Rumeau, Universidad Complutense de Madrid, Spain, August 2003–January 2004
Former Ph.D. Students Rolando Cavazos Cadena Roberto S. Acosta Abreu Sergio O. Esparza N´un˜ ez Daniel Hern´andez Hern´andez Ra´ul Montes de Oca Machorro Fernando Luque V´asquez Oscar Vega Amaya Juan Gonz´alez Hern´andez C´esar E. Villarreal Rodr´ıguez J. Rigoberto Gabriel Arg¨uelles
CINVESTAV CINVESTAV Texas Tech Univ. CINVESTAV UAM–Iztapalapa FC–UNAM UAM–Iztapalapa FC–UNAM CINVESTAV CINVESTAV
1985 1987 1989 1993 1994 1997 1998 1998 1998 2000
Resume of Professor On´esimo Hern´andez-Lerma
Raquiel R. L´opez Mart´ınez Jorge Alvarez Mena Guadalupe Carrasco Licea Mario A. Villalobos Arias H´ector Jasso Fuentes Armando F. Mendoza P´erez Beatris A. Escobedo Trujillo
CINVESTAV CINVESTAV FC–UNAM CINVESTAV CINVESTAV CINVESTAV CINVESTAV
xvii
2001 2002 2003 2005 2007 2008 2011
Associate Editor Positions • • • • • • • • • •
SIAM Journal on Control and Optimization (1991–1996) Bolet´ın de la Sociedad Matem´atica Mexicana, 3a. serie (1995–2005) Journal of Mathematical Systems, Estimation, and Control (1992–1998) Applicationes Mathematicae (Warsaw) (1999–present) Top, journal of the Spanish Society of Statistics and Operations Research, SEIO (2001–2006) Information Technology for Economics and Management (ITEM) (2001–2005) Morfismos, journal of the CINVESTAV Mathematics Department (1997–present) Revista Mexicana de Econom´ıa y Finanzas (ReMEF) (2002-2007) International Journal on Stochastic Analysis (2010–present) Journal of Dynamics and Games (2011-)
List of Publications by On´esimo Hern´andez-Lerma
Books and Monographs 1. M´etodos de Fourier en la F´ısica y la Ingenier´ıa, Editorial Trillas, M´exico, 1974. 2. Elementos de Probabilidad y Estad´ıstica, Fondo de Cultura Econ´omica, M´exico, 1979; 2nd. printing, 1982. 3. Adaptive Markov Control Processes. Springer–Verlag, New York, 1989. 4. Annals of Operations Research Special Volumes 28 and 29 (1991) on Markov Decision Processes, with J.B. Lasserre, Guest Editors. 5. Lectures on Continuous–Time Markov Control Processes. Aportaciones Matem´aticas, Advanced Texts series Vol. 3, Sociedad Matem´atica Mexicana, 1994. 6. Discrete–Time Markov Control Processes: Basic Optimality Criteria. Springer– Verlag, New York, 1996, with J.B. Lasserre. 7. Further Topics on Discrete–Time Markov Control Processes. Springer–Verlag, New York, 1999, with J.B. Lasserre. 8. Markov Chains and Invariant Probabilities. Birkh¨auser, Basel, 2003, with J.B. Lasserre. 9. Elementos de Probabilidad y Estad´ıstica, Sociedad Matem´atica Mexicana, 2003, with A. Hern´andez del Valle. 10. Continuous–Time Markov Decision Processes: Theory and Applications. Springer–Verlag, 2009, with X.P. Guo. 11. Selected Topics on Continuous–Time Controlled Markov Chains and Markov Games. Imperial College Press, 2012, with T. Prieto–Rumeau. Papers 1. Series de Fourier: Breve introducci´on hist´orica, Miscel´anea Matem´atica No. 7, 1974, pp. 13–24. 2. Probabilidad y estad´ıstica en el nivel primario, Matem´aticas y Ense˜nanza No. 2, 1975, pp. 23–42, with L.G. Gorostiza. 3. Lyapunov criteria for stability of differential equations with Markov parameters, Bol. Soc. Mat. Mexicana 24 (1979) pp. 27–48. xix
xx
List of Publications by On´esimo Hern´andez-Lerma
4. Exit probabilities for a class of perturbed degenerate systems, SIAM J. Control Optim. 19 (1981), pp. 39–51. 5. Control o´ ptimo de procesos de difusi´on markovianos, Ciencia (journal of the Mexican Academy of Sciences) 32 (1981), pp. 39–55, with W.H. Fleming. 6. Modelos matem´aticos en amibiasis, Sigma 7 (1981), pp. 131–138, with R. Cano Mancera and R. L´opez–Revilla. 7. Modelos matem´aticos de adhesi´on celular con aplicaciones a la adhesi´on de trofozo´ıtos de Entamoeba histolytica, Ciencia (journal of the Mexican Academy of Sciences) 33 (1982), pp. 107–117, with R. Cano Mancera and R. L´opez Revilla. 8. Adaptive control of service in queueing systems, Syst. Control Lett. 3 (1983), pp. 283–289, with S.I. Marcus. 9. Identification and approximation of queueing systems, IEEE Trans. Automatic Control AC–29 (1984), pp. 472–474, with S.I. Marcus. 10. Modelos matem´aticos de la adhesi´on de trofozo´ıtos de Entamoeba histolytica a eritrocitos humanos, Acta Mexicana de Ciencia y Tecnolog´ıa II (1984), No. 5, pp. 65–75, with R. Cano Mancera and R. L´opez–Revilla. 11. Control adaptable iterativo de sistemas markovianos con costo promedio, Acta Mexicana de Ciencia y Tecnolog´ıa II, No. 6 (1984), pp. 63–68, with R.S. Acota–Abreu. 12. Modelado, estimaci´on y control de recursos pesqueros, I. Modelos de poblaciones, Acta Mexicana de Ciencia y Tecnolog´ıa, II No. 7 (1984), pp. 51–61. 13. Optimal adaptive control of priority assignment in queueing systems, Syst. Control Lett. 4 (1984) 65–72, with S.I. Marcus. 14. Adaptive control of discounted Markov decision chains, J. Optim. Theory Appl. 46 (1985), pp. 227–235, with S.I. Marcus. 15. Nonstationary value–iteration and adaptive control of discounted semi– Markov processes, J. Math. Anal. Appl. 112 (1985), pp. 435–445. 16. Approximation and adaptive policies in discounted dynamic programming, Bol. Soc. Mat. Mexicana 30 (1985), pp. 25–35. 17. Iterative adaptive control of denumerable state average–cost Markov systems, Control & Cybernetics 14, No. 4 (1985), pp. 313–322, with R.S. Acosta Abreu. 18. Finite–state approximations for denumerable multidimensional state discounted Markov decision processes, J. Math. Anal. Appl. 113 (1986), pp. 382–389. 19. Filtrado estad´ıstico, Acta Mexicana de Ciencia y Tecnolog´ıa 4, Nos. 13–14 (1986), pp. 49–55. 20. Adaptive control of Markov processes with incomplete state information and unknown parameters, J. Optim. Theory Appl. 52 (1987), pp. 227–241, with S.I. Marcus. 21. Control adaptable de sistemas inciertos a tiempo discreto, Ciencia (journal of the Mexican Academy of Sciences) 38 (1987), Part I, pp. 69–80; Part II, pp. 147–154; Part III, pp. 217–228, with M. Espa˜na and D.B. Hern´andez.
List of Publications by On´esimo Hern´andez-Lerma
xxi
22. Adaptive policies for discrete–time stochastic control systems with unknown disturbance distribution, Syst. Control Lett. 9 (1987), pp. 307–315, with S.I. Marcus. 23. Estimaci´on de par´ametros en sistemas compartamentales parcialmente observables, Acta Mexicana de Ciencia y Tecnolog´ıa 5, No. 17 (1987), pp. 33–43, with F. Maldonado Etchegaray. 24. Approximation and adaptive control of Markov processes: average reward criterion, Kybernetika (Prague) 23 (1987), pp. 265–288. 25. Continuous dependence of stochastic control models on the noise distribution, Appl. Math. Optim. 17 (1988), pp. 79–89, with R. Cavazos–Cadena. 26. Controlled Markov processes: recent results on approximation and adaptive control, Texas Tech University Mathematics Series, Visiting Scholars Lectures 1986–1987, 15 (1988), 91–117. 27. A forecast horizon and a stopping rule for general Markov decision processes, J. Math. Anal. Appl. 132 (1988), 388–400, with J.B. Lasserre. 28. Recursive nonparametric estimation of nonstationary Markov processes, Bol. Soc. Mat. Mexicana 33 (1988), 57–69, with S.O. Esparza and B.S. Duran. 29. Nonparametric adaptive control of discrete–time partially observable stochastic systems, J. Math. Anal. Appl. 137 (1989), pp. 312–334, with S.I. Marcus. 30. On rolling horizon procedures for Markov control processes, Aportaciones Matem´aticas (Soc. Mat. Mexicana), Serie Notas de Investigaci´on 4 (1989), 73–88. 31. Discretization procedures for adaptive Markov control processes, J. Math. Anal. Appl. 137 (1989), pp. 485–514, with S.I. Marcus. 32. Adaptive policies for priority assignment in discrete time queues — discounted cost criterion, Control and Cybernetics 19 (1990), 149–177, with R. Cavazos Cadena. 33. Error bounds for rolling horizon policies in discrete–time Markov control processes, IEEE Trans. Automatic Control 35 (1990), 1118–1124, with J.B. Lasserre. 34. Average cost optimal policies for Markov control processes with Borel state space and unbounded costs, Syst. Control Lett. 15 (1990), 349–356, with J.B. Lasserre. 35. Density estimation and adaptive control of Markov processes: average and discounted criteria, Acta Appl. Math. 20 (1990), 285–307, with R. Cavazos– Cadena. 36. Average cost Markov decision processes: optimality conditions, J. Math. Anal. Appl. 158 (1991), 396–406, with J.C. Hennet and J.B. Lasserre. 37. Recursive adaptive control of Markov decision processes with the average reward criterion, Appl. Math. Optim. 23 (1991), 193–207, with R. Cavazos– Cadena. 38. Recurrence conditions for Markov decision processes with Borel state space: A survey, Ann. Oper. Res. 28 (1991), 29–46, with R. Montes de Oca and R. Cavazos–Cadena.
xxii
List of Publications by On´esimo Hern´andez-Lerma
39. Average optimality in dynamic programming on Borel spaces — unbounded costs and controls, Syst. Control Lett. 17 (1991), 237–242. 40. On integrated square errors of recursive nonparametric estimates of nonstationary Markov processes, Prob. Math. Stat. 12 (1991), 25–33. 41. Discrete–time Markov control processes with discounted unbounded costs: Optimality criteria, Kybernetika (Prague) 28 (1992), 191–213, with M. Mu˜noz de Ozak. 42. Equivalence of Lyapunov stability criteria in a class of Markov decision processes, Appl. Math. Optim. 26 (1992), 113–137, with R. Cavazos–Cadena. 43. Control o´ ptimo estoc´astico y programaci´on lineal infinita, Aportaciones Matem´aticas (Soc. Mat. Mexicana). Serie Notas de Investigaci´on 7 (1992), 109–120. 44. Value iteration and rolling plans for Markov control processes with unbounded rewards, J. Math. Anal. Appl. 177 (1993), 38–55, with J.B. Lasserre. 45. Existence of average optimal policies in Markov control processes with strictly unbounded cost, Kybernetika (Prague). 29 (1993), 1–17. 46. Monotone approximations for convex stochastic control problems, J. Math. Syst., Estimation, and Control 4 (1994), 99–140, with W. Runggaldier. 47. Linear programming and average optimality of Markov control processes on Borel spaces–unbounded costs, SIAM J. Control Optim. 32 (1994), 480–500, with J.B. Lasserre. 48. Weak conditions for average optimality in Markov control processes, Syst. Control Lett. 22 (1994), 287–291, with J.B. Lasserre. 49. Discounted cost Markov decision processes on Borel spaces: The linear programming formulation, J. Math. Anal. Appl. 183 (1994), 335–351, with D. Hern´andez–Hern´andez. 50. Conditions for average optimality in Markov control processes on Borel spaces, Bol. Soc. Mat. Mexicana 39 (1994), 39–50, with R. Montes–de–Oca and J.A. Minj´arez–Sosa. 51. Average cost Markov control processes with weighted norms: value iteration. Appl. Math. (Warsaw) 23 (1995), 219–237, with E. Gordienko. 52. Average cost Markov control processes with weighted norms: existence of canonical policies. Appl. Math. (Warsaw) 23 (1995), 199–218, with E. Gordienko. 53. Linear programming and infinite horizon problems of deterministic control theory. Bol. Soc. Mat. Mex. (3rd series) 1 (1995), 59–72, with D. Hern´andez– Hern´andez. 54. Numerical aspects of monotone approximations in convex stochastic control problems. Ann. Oper. Res. 56 (1995), 135–156, with C. Piovesan and W.J. Runggaldier. 55. Conditions for average optimality in Markov control processes with unbounded costs and controls, J. Math. Syst., Estimation, and Control 5 (1995), 459–477, with R. Montes de Oca. 56. A counterexample on the semicontinuity of minima. Proc. Amer. Math. Soc. 123 (1995), 3175–3176, with F. Luque V´asquez.
List of Publications by On´esimo Hern´andez-Lerma
xxiii
57. Invariant probabilities for Feller–Markov chains. J. Appl. Math. and Stoch. Anal. 8 (1995), 341–345, with J.B. Lasserre. 58. Value iteration in average cost Markov control process on Borel spaces. Acta Appl. Math. 42 (1996), 203–222, with R. Montes–de–Oca. 59. Average optimality in Markov control processes via discounted cost problems and linear programming. SIAM J. Control Optim. 34 (1996), 295–310, with J.B. Lasserre. 60. Existence of bounded invariant probability densities for Markov chains. Statist. Prob. Lett. 28 (1996), 359–366, with J.B. Lasserre. 61. The linear programming approach to deterministic optimal control problems. Appl. Math. (Warsaw) 24 (1996), 17–33, with D. Hern´andez–Hern´andez and M. Taksar. 62. An extension of the Vitali–Hahn–Saks Theorem. Proc. Amer. Math. Soc. 124 (1996), 3673–3676; correction, ibid. 126 (1998), p. 949, with J.B. Lasserre. 63. The linear programming formulation of optimal control problems, Texas Tech University, Math. Series, Visiting Scholars Lectures, 1993–1996, 19 (1997), 1–17. 64. Policy iteration for average cost Markov control processes on Borel spaces. Acta Appl. Math. 47 (1997), 125–154, with J.B. Lasserre. 65. Cone–constrained linear equations in Banach spaces. J. Convex Anal. 4 (1997), 149–164, with J.B. Lasserre. 66. Infinite linear programming and multichain Markov control processes in uncountable spaces. SIAM J. Control Optim. 36 (1998), 313–335, with J. Gonz´alez–Hern´andez. 67. Existence and uniqueness of fixed points for Markov operators and Markov processes. Proc. London Math. Soc. 76 (1998), 711–736, with J.B. Lasserre. 68. Infinite–horizon Markov control processes with undiscounted cost criteria: from average to overtaking optimality. Appl. Math. (Warsaw) 25 (1998), 153– 178, with O. Vega–Amaya. 69. Linear programming approximations for Markov control processes in metric spaces. Acta Appl. Math. 51 (1998), 123–139, with J.B. Lasserre. 70. Approximation schemes for infinite linear programs. SIAM J. Optim. 8 (1998), 973–988, with J.B. Lasserre. 71. Ergodic theorems and ergodic decomposition of Markov chains. Acta Appl. Math. 54 (1998), 99–119, with J.B. Lasserre. 72. Envelopes of sets of measures, tightness, and Markov control processes. Appl. Math. Optim. 40 (1999), 377–392, with J. Gonz´alez–Hern´andez. 73. Semi–Markov control models with average costs. Appl. Math. (Warsaw) 26 (1999), 315–331, with F. Luque–V´asquez. 74. Markov control processes with the expected total–cost criterion: optimality, stability, and transient models. Acta Appl. Math. 59 (1999), 229–269, with G. Carrasco and R. P´erez–Hern´andez. 75. Sample–path optimality and variance–minimization of average cost Markov control processes. SIAM J. Control Optim. 38 (2000), 79–93, with O. Vega– Amaya and G. Carrasco.
xxiv
List of Publications by On´esimo Hern´andez-Lerma
76. Fatou and Lebesgue convergence theorems for measures. J. Appl. Math. Stoch. Anal. 13 (2000), 137–146, with J.B. Lasserre. 77. Limiting optimal discounted–cost control of a class of time–varying stochastic systems. Syst. Control Lett. 40 (2000), 37–42, with N. Hilgert. 78. Constrained Markov control processes in Borel spaces: the discounted case. Math. Meth. Oper. Res. 52 (2000), 271–285, with J. Gonz´alez–Hern´andez. 79. On the classification of Markov chains via occupation measures. Appl. Math. (Warsaw) 27 (2000), 489–498, with J.B. Lasserre. 80. Zero–sum stochastic games in Borel spaces: average payoff criteria. SIAM J. Control Optim. 39 (2001), 1520–1539, with J.B. Lasserre. 81. Further criteria for positive Harris recurrence of Markov chains. Proc. Amer. Math. Soc. 129 (2001), 1521–1524, with J.B. Lasserre. 82. Strong duality of the general capacity problem in metric spaces. Math. Meth. Oper. Res. 53 (2001), 25–34, with J.R. Gabriel. 83. Limiting average cost problems in a class of discrete–time stochastic systems. Appl. Math. (Warsaw) 28 (2001), 111–123, with N. Hilgert. 84. Limiting discounted–cost control of partially observable stochastic systems. SIAM J. Control. Optim. 40 (2001), 348–369, with R. Romera. 85. On the probabilistic multichain Poisson equation. Appl. Math. (Warsaw) 28 (2001), 225–243, with J.B. Lasserre. 86. A multiobjective control approach to priority queues. Math. Meth. Oper. Res. 53 (2001), 265–277, with L.F. Hoyos–Reyes. 87. Nonstationary continuous–time Markov control processes with discounted costs on infinite horizon. Acta Appl. Math. 67 (2001), 277–293, with T.E. Govindan. 88. The Lagrange approach to infinite linear programs. Top 9 (2001), 293–314, with J.R. Gabriel and R.R. L´opez–Mart´ınez. 89. The linear programming approach, Chapter 12 in Handbook of Markov Decision Processes, edited by E. Feinberg and A. Shwartz, Kluwer, 2002, with J.B. Lasserre. 90. Strong duality of the Monge–Kantorovich mass transfer problem in metric spaces. Math. Zeit. 239 (2002), 579–591, with J.R. Gabriel. 91. Convergence of the optimal values of constrained Markov control processes. Math. Meth. Oper. Res. 55 (2002), 461–484, with J. Alvarez–Mena. 92. Continuous–time controlled Markov chains. Ann. Appl. Prob. 13 (2003), 363– 388, with X.P. Guo. 93. Performance analysis and optimal control of the Geo/Geo/c queue. Performance Evaluation 52 (2003), 15–39, with J.R. Artalejo. 94. Minimax control of discrete–time stochastic systems. SIAM J. Control Optim. 41 (2003), 1626–1659, with J.I. Gonz´alez–Trejo and L.F. Hoyos–Reyes. 95. Drift and monotonicity conditions for continuous–time controlled Markov chains with an average criterion. IEEE Trans. Automatic Control 48 (2003), 236–245, with X.P. Guo. 96. Constrained continuous–time Markov control processes with discounted criteria. J. Stoch. Anal. Appl. 21 (2003), 379–399, with X.P. Guo.
List of Publications by On´esimo Hern´andez-Lerma
xxv
97. Zero–sum games for continuous–time Markov chains with unbounded transition and average payoff rates. J. Appl. Prob. 40 (2003), 327–345, with X.P. Guo. 98. Bias optimality versus strong O–discount optimality in Markov control processes with unbounded costs. Acta Appl. Math. 77 (2003), 215–235, with N. Hilgert. 99. Constrained average cost Markov control processes in Borel spaces. SIAM J. Control Optim. 42 (2003), 442–468, with J. Gonz´alez–Hern´andez and R.R. L´opez–Mart´ınez. 100. Continuous–time controlled Markov chains with discounted rewards. Acta Appl. Math. 79 (2003), 195–216, with X.P. Guo. 101. The Lagrange approach to constrained Markov control processes. Invited paper, Morfismos 7 (2003), 1–26, with R.R. L´opez–Mart´ınez. 102. Zero–sum games for nonhomogeneous Markov chains with expected average payoff criteria. Applied and Computational Math., 3 (2004), 10–22, with X.P. Guo. 103. The scalarization approach to multiobjective Markov control problems: why does it work?, Appl. Math. Optim. 50 (2004), 279–293, with R. Romera. 104. Multiobjective Markov control processes: a linear programming approach. Invited paper. Morfismos, 8 (2004), 1–33, with R. Romera. 105. The Laurent series, sensitive discount and Blackwell optimality for continuous– time controlled Markov chains, Math. Meth. Oper. Res. 61 (2005), 123–145, with T. Prieto–Rumeau. 106. Convergence and approximation of optimization problems, SIAM J. Optim. 15 (2005), 527–539, with J. Alvarez–Mena. 107. Bias and overtaking equilibria for zero–sum continuous–time Markov games, Math. Meth. Oper. Res. 61 (2005), 437–454, with T. Prieto–Rumeau. 108. Nonzero–sum games for continuous–time Markov chains with unbounded discounted payoffs, J. Appl. Prob. 42 (2005), 303–320, with X.P. Guo. 109. Extreme points of sets of randomized strategies in constrained optimization and control problems, SIAM J. Optim. 15 (2005), 1085–1104, with J. Gonz´alez–Hern´andez. 110. Approximation of general optimization problems. Invited paper. Morfismos, 9 (2005), 1–20, with J. Alvarez–Mena. 111. Zero–sum continuous–time Markov games with unbounded transition and discounted payoff rates, Bernoulli, 11 (2005), 1009–1029, with X.P. Guo. 112. Bias optimality for continuous–time controlled Markov chains, SIAM J. Control Optim., 45 (2006), 51–73, with T. Prieto–Rumeau. 113. On solutions to the mass transfer problem, SIAM J. Optim., 17 (2006), 485– 499, with J. Gonz´alez–Hern´andez and J.R. Gabriel. 114. Existence of Nash equilibria for constrained stochastic games, Math. Meth. Oper. Res. 63 (2006), 261–285, with J. Alvarez–Mena. 115. Asymptotic convergence of a simulated annealing algorithm for multiobjective optimization problems, Math. Meth. Oper Res., 64 (2006), 353–362, with M. Villalobos–Arias and C.A. Coello Coello.
xxvi
List of Publications by On´esimo Hern´andez-Lerma
116. Asymptotic convergence of metaheuristics for multiobjective optimization problems, Soft Computing 10 (2006), 1001–1005, with M. Villalobos–Arias and C.A. Coello Coello. 117. A survey of recent results on continuous–time Markov decision processes, Top 14 (2006), 177–261, with X.P. Guo and T. Prieto–Rumeau. 118. A unified approach to continuous–time discounted Markov control processes. Invited paper. Morfismos 10 (2006), 1–40, with T. Prieto–Rumeau. 119. Zero–sum games for continuous–time jump Markov processes in Polish spaces: discounted payoffs, Adv. Appl. Probab. 39 (2007), 645–668, with X.P. Guo. 120. Comments on: “Dynamic priority allocation via restless bandit marginal productivity indices” [TOP 15 (2007), 161–198] by J. Ni˜no–Mora. TOP 15 (2007), 208–210. 121. Ergodic control of continuous–time Markov chains with pathwise constraints, SIAM J. Control Optim. 47 (2008), 1888–1908, with T. Prieto–Rumeau. 122. Characterizations of overtaking optimality for controlled diffusion processes, Appl. Math. Optim. 57 (2008), 349–369, with H. Jasso–Fuentes. 123. Existence and regularity of nonhomogeneous Q(t)–processes under measurability conditions, J. Theoret. Probab. 21 (2008), 604–627, with L. Ye and X.P. Guo. 124. The vanishing discount approach to average reward optimality: the strongly and the weakly continuous cases, Morfismos 12 (2008), 1–15, with T. Prieto– Rumeau 125. Ergodic control, bias, and sensitive discount optimality for Markov diffusion processes, J. Stoch. Anal. Appl. 27 (2009), 363–385, with H. Jasso–Fuentes. 126. Blackwell optimality for controlled diffusion processes, J. Appl. Probab. 46 (2009), 372–391, with H. Jasso–Fuentes. 127. Variance minimization and the overtaking optimality approach to continuous– time controlled Markov chains, Math. Meth. Oper. Res. 70 (2009), 527–540, with T. Prieto–Rumeau. 128. Markov control processes with pathwise constraints, Math. Meth. Oper. Res. 71 (2010), 477–502, with A.F. Mendoza–P´erez. 129. The vanishing discount approach to constrained continuous–time controlled Markov chains, Syst. Control Lett. 59 (2010), 504–509, with T. Prieto– Rumeau. 130. Policy iteration and finite approximations to discounted continuous–time controlled Markov chains, in: Modern Trends in Controlled Stochastic Processes, edited by A.B. Piunovskiy, Luniver Press, Frome, U.K., 2010, pp. 84–101, with T. Prieto–Rumeau. 131. Overtaking optimal strategies and sensitive discount optimality for controlled diffusions, in: Modern Trends in Controlled Stochastic Processes, edited by A.B. Piunovskiy, Luniver Press, Frome, U.K., 2010, pp. 102–122, with H. Jasso–Fuentes.
List of Publications by On´esimo Hern´andez-Lerma
xxvii
132. Nonstationary discrete–time deterministic and stochastic control systems with infinite horizon, International Journal of Control 83 (2010), 1751–1757, with X.P. Guo and A. Hern´andez–del–Valle. 133. Asymptotic normality of discrete–time Markov control processes. J. Applied Probab. 47 (2010), 778–795, with A.F. Mendoza–P´erez. 134. New optimality conditions for average–payoff continuous–time Markov games in Polish spaces. Science China Math. 54 (2011), pp. 793–816, with X.P. Guo. 135. Variance–minimization of Markov control processes with pathwise constraints. Optimization. Published electronically: 30 March 2011. DOI:10.1080/ 02331934.2011.565762, with A.F. Mendoza–P´erez. 136. Nonstationary discrete–time deterministic and stochastic control systems: bounded and unbounded cases. Syst. Control Lett. 60 (2011), pp. 503–509, with X.P. Guo and A. Hern´andez–del–Valle. 137. Overtaking optimality for controlled Markov–modulated diffusions. Optimization, with B.A. Escobedo–Trujillo. Published electronically: May 3, 2011. DOI:10.1080/02331934.2011.565417. 138. An inverse optimal problem in discrete–time stochastic control. Journal of Difference Equations and Applications, with D. Gonz´alez–S´anchez. Electronic version published: 29 September 2011. DOI:10.1080/10236198.2011.613596. 139. Bias and overtaking equilibria for zero–sum stochastic differential games. J. Optim. Theory Appl., with B.A. Escobedo–Trujillo and J.D. L´opez– Barrientos. Published online: December 2011. DOI: 10.1007/s10957-0119974-4. To appear 140. Deterministic optimal policies for Markov control processes with pathwise constraints. Applicationes Mathematicae (Warsaw), with A.F. Mendoza– P´erez. 141. First passage problems for nonstationary discrete–time stochastic control systems. Euro. J. Control, with X.P. Guo and A. Hern´andez–del–Valle.
Chapter 1
On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions Under the Ergodic Criterion Ari Arapostathis
Dedicated to On´esimo Hern´andez-Lerma on the occasion of his 65th birthday.
1.1 Introduction The policy iteration algorithm (PIA) for controlled Markov chains has been known since the fundamental work of Howard [2]. For controlled Markov chains on Borel state spaces, most studies of the PIA rely on blanket Lyapunov conditions [9]. A study of the PIA that treats the model of near-monotone costs can be found in [11], some ideas of which we follow closely. An analysis of the PIA for piecewise deterministic Markov processes has appeared in [6]. In this chapter we study the PIA for controlled diffusion processes X = {Xt , t ≥ 0}, taking values in the d-dimensional Euclidean space Rd , and governed by the Itˆo stochastic differential equation dXt = b(Xt ,Ut ) dt + σ(Xt ) dWt .
(1.1)
All random processes in (1.1) live in a complete probability space (Ω, F, P). The process W is a d-dimensional standard Wiener process independent of the initial condition X0 . The control process U takes values in a compact, metrizable set U, and Ut (ω ) is jointly measurable in (t, ω ) ∈ [0, ∞) × Ω. Moreover, it is nonanticipative: for s < t, Wt − Ws is independent of Fs := the completion of σ {X0 ,Ur ,Wr , r ≤ s} relative to (F, P) . A. Arapostathis () Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, USA e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 1, © Springer Science+Business Media, LLC 2012
1
2
A. Arapostathis
Such a process U is called an admissible control, and we let U denote the set of all admissible controls. We impose the following standard assumptions on the drift b and the diffusion matrix σ to guarantee existence and uniqueness of solutions to (1.1). (A1) Local Lipschitz continuity: The functions T b = b 1 , . . . , b d : Rd × U → Rd and σ = σi j : Rd → Rd×d are locally Lipschitz in x with a Lipschitz constant KR depending on R > 0. In other words, if BR denotes the open ball of radius R centered at the origin in Rd , then for all x, y ∈ BR and u ∈ U, |b(x, u) − b(y, u)| + σ(x) − σ(y) ≤ KR |x − y| , where σ2 := trace σσT . (A2) Affine growth condition: b and σ satisfy a global growth condition of the form |b(x, u)|2 + σ(x)2 ≤ K1 1 + |x|2 , ∀(x, u) ∈ Rd × U . (A3) Local nondegeneracy: For each R > 0, there exists a positive constant κR such that d
∑
ai j (x)ξi ξ j ≥ κR |ξ |2 ,
∀x ∈ BR ,
i, j=1
for all ξ = (ξ1 , . . . , ξd ) ∈ Rd , where a := 12 σ σT . We also assume that b is continuous in (x, u). In integral form, (1.1) is written as Xt = X0 +
t 0
b(Xs ,Us ) ds +
t 0
σ(Xs ) dWs .
(1.2)
The second term on the right-hand side of (1.2) is an Itˆo stochastic integral. We say that a process X = {Xt (ω )} is a solution of (1.1) if it is Ft -adapted, continuous in t, defined for all ω ∈ Ω and t ∈ [0, ∞), and satisfies (1.2) for all t ∈ [0, ∞) at once a.s. With u ∈ U treated as a parameter, we define the family of operators Lu : 2 C (Rd ) → C (Rd ) by Lu f (x) = ∑ ai j (x) i, j
∂2 f ∂f (x) + ∑ bi (x, u) (x) , ∂ xi ∂ x j ∂ xi i
u ∈ U.
(1.3)
We refer to Lu as the controlled extended generator of the diffusion. Of fundamental importance in the study of functionals of X is Itˆo’s formula. For f ∈ C 2 (Rd ) and with Lu as defined in (1.3), f (Xt ) = f (X0 ) +
t 0
LUs f (Xs ) ds + Mt ,
a.s.,
(1.4)
1 On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions
where Mt :=
t 0
∇ f (Xs ), σ(Xs ) dWs
3
is a local martingale. Krylov’s extension of the Itˆo formula [10, p. 122] extends (1.4) 2,p (Rd ). to functions f in the Sobolev space Wloc Recall that a control is called stationary Markov if Ut = v(Xt ) for a measurable map v : Rd → U. Correspondingly, the equation Xt = x0 +
t 0
t b Xs , v(Xs ) ds + σ(Xs ) dWs
(1.5)
0
is said to have a strong solution if given a Wiener process (Wt , Ft ) on a complete probability space (Ω, F, P), there exists a process X on (Ω, F, P), with X0 = x0 ∈ Rd , which is continuous, Ft -adapted, and satisfies (1.5) for all t at once, a.s. A strong solution is called unique, such solutions X and X agree P-a.s. when if any dtwo viewed as elements of C [0, ∞), R . It is well known that under Assumptions (A1)– (A3), for any stationary Markov control v, (1.5) has a unique strong solution [8]. Let USM denote the set of stationary Markov controls. Under v ∈ USM , the process X is strong Markov, and we denote its transition function by Pv (t, x, ·). It also follows from the work of [5, 12] that under v ∈ USM , the transition probabilities of X have densities which are locally H¨older continuous. Thus Lv defined by Lv f (x) = ∑ ai j (x) i, j
∂2 f ∂f (x) + ∑ bi (x, v(x)) (x) , ∂ xi ∂ x j ∂ xi i
v ∈ USM ,
for f ∈ C 2 (Rd ) is the generator of a strongly continuous semigroup on Cb (Rd ), which is strong Feller. We let Pvx denote the probability measure and Evx the expectation operator on the canonical space of the process under the control v ∈ USM , conditioned on the process X starting from x ∈ Rd at t = 0. In Sect. 1.2 we define our notation. Sect. 1.3 reviews the ergodic control problem for near-monotone costs and the basic properties of the PIA. Sect. 1.4 is dedicated to the convergence of the algorithm.
1.2 Notation The standard Euclidean norm in Rd is denoted by | · |, and ·, · stands for the inner product. The set of nonnegative real numbers is denoted by R+ , N stands for the set of natural numbers, and I denotes the indicator function. We denote by τ(A) the first exit time of the process {Xt } from the set A ⊂ Rd , defined by τ(A) := inf {t > 0 : Xt ∈ A} .
4
A. Arapostathis
The open ball of radius R in Rd , centered at the origin, is denoted by BR , and we let τR := τ(BR ), and τ˘ R := τ(BcR ). The term domain in Rd refers to a nonempty, connected open subset of the Euclidean space Rd . We introduce the following notation for spaces of real-valued functions on a domain D ⊂ Rd . The space L p (D), p ∈ [1, ∞), stands for the Banach space of (equivalence classes) of measurable functions f satisfying D | f (x)| p dx < ∞, and L ∞ (D) is the Banach space of functions that are essentially bounded in D. The space C k (D) (C ∞ (D)) refers to the class of all functions whose partial derivatives up to order k (of any order) exist and are continuous, and Cck (D) is the space of functions in C k (D) with compact support. The standard Sobolev space of functions on D whose generalized derivatives up to order k are in L p (D), equipped with its natural norm, is denoted by W k,p (D), k ≥ 0, p ≥ 1. In general if X is a space of real-valued functions on D, Xloc consists of all functions f such that f ϕ ∈ X for every ϕ ∈ Cc∞ (D). In this manner we obtain the p 2,p spaces Lloc (D) and Wloc (D). Let h ∈ C (Rd ) be a positive function. We denote by O(h) the set of functions f ∈ C (Rd ) having the property | f (x)| < ∞, h(x)
lim sup |x|→∞
(1.6)
and by o(h) the subset of O(h) over which the limit in (1.6) is zero. 2 We adopt the notation ∂i := ∂∂xi and ∂i j := ∂ x∂i ∂ x j . We often use the standard summation rule that repeated subscripts and superscripts are summed from 1 through d. For example, ai j ∂i j ϕ + bi ∂i ϕ :=
d
∑
ai j
i, j=1
d ∂ 2ϕ ∂ϕ + ∑ bi . ∂ xi ∂ x j i=1 ∂ xi
1.3 Ergodic Control and the PIA Let c : Rd × U → R be a continuous function bounded from below. As well known, the ergodic control problem, in its almost sure (or pathwise) formulation, seeks to a.s. minimize over all admissible U ∈ U lim sup t→∞
1 t
t 0
c(Xs ,Us ) ds .
(1.7)
A weaker, average formulation seeks to minimize lim sup t→∞
1 t
t 0
EU c(Xs ,Us ) ds .
(1.8)
1 On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions
5
We let ρ ∗ denote the infimum of (1.8) over all admissible controls. We assume that ρ ∗ < ∞. We assume that the cost function c : Rd × U → R+ is continuous and locally Lipschitz in its first argument uniformly in u ∈ U. More specifically, for some function Kc : R+ → R+ ,
c(x, u) − c(y, u) ≤ Kc (R)|x − y|
∀x, y ∈ BR , ∀u ∈ U ,
and all R > 0. An important class of running cost functions arising in practice for which the ergodic control problem is well behaved is the near-monotone cost functions. Let M ∗ ∈ R+ ∪ {∞} be defined by M ∗ := lim inf min c(x, u) . |x|→∞ u∈U
The running cost function c is called near-monotone if ρ ∗ < M ∗ . Note that inf-compact functions c are always near-monotone. We adopt the following abbreviated notation. For a function g : Rd × U → R and v ∈ USSM we let gv (x) := g x, v(x) , x ∈ Rd . The ergodic control problem for near-monotone cost functions is characterized as follows Theorem 1.3.1. There exists a unique function V ∈ C 2 (Rd ) which is bounded below in Rd and satisfies V (0) = 0 and the Hamilton–Jacobi–Bellman (HJB) equation min LuV (x) + c(x, u) = ρ ∗ , x ∈ Rd . u∈U
∗
The control v ∈ USM is optimal with respect to the criteria (1.7) and (1.8) if and only if it satisfies d ∂V ∂V min ∑ b (x, u) (x) + c(x, u) = ∑ biv∗ (x) (x) + cv∗ (x) u∈U i=1 ∂ xi ∂ xi i=1
d
i
a.e. in Rd . Moreover, with τ˘ r = τ(Bcr ), r > 0, we have V (x) = lim sup r↓0
inf
v∈USSM
Evx
0
τ˘ r
cv (Xt ) − ρ
∗
dt ,
x ∈ Rd .
A control v ∈ USM is called stable if the associated diffusion is positive recurrent. We denote the set of such controls by USSM . Also we let μv denote the unique invariant probability measure on Rd for the diffusion under the control v ∈ USSM .
6
A. Arapostathis
Recall that v ∈ USSM if and only if there exists an inf-compact function V ∈ C 2 (Rd ), a bounded domain D ⊂ Rd , and a constant ε > 0 satisfying Lv V (x) ≤ −ε
∀x ∈ Dc .
(1.9)
It follows that the optimal control v in Theorem 1.3.1 is stable. For v ∈ USSM we define 1 t v ρv := lim sup E cv (Xs ) ds . t 0 t→∞ A difficulty in synthesizing an optimal control v ∈ USM via the HJB equation lies in the fact that the optimal cost ρ ∗ is not known. The PIA provides an iterative procedure for obtaining the HJB equation at the limit. In order to describe the algorithm we, first need to review some properties of the Poisson equation LvV (x) + cv (x) = ρ ,
x ∈ Rd .
(1.10)
We need the following definition. Definition 1.3.1. For v ∈ USSM , and provided ρv < ∞, define
τ˘ r cv (Xt ) − ρv dt , x = 0 . Ψv (x) := lim Evx r↓0
0
For v ∈ USM and α > 0, let Jαv denote the α -discounted cost
∞ Jαv (x) := Evx e−α t cv (Xt ) dt , x ∈ Rd . 0
We borrow the following result from [1, Lemma 7.4]. If v ∈ USSM and ρv < ∞, then 2,p there exists, a function V ∈ Wloc (Rd ), for any p > 1, and a constant ρ ∈ R which d satisfies (1.10) a.e. in R and such that, as α ↓ 0, α Jαv (0) → ρ and Jαv − Jαv (0) → V uniformly on compact subsets of Rd . Moreover,
ρ = ρv
and V (x) = Ψv (x).
2,p (Rd ) as the canonical solution of the We refer to the function V (x) = Ψv (x) ∈ Wloc v d Poisson equation L V + cv = ρv in R . It can be shown that the canonical solution V to the Poisson equation is the unique 2,p (Rd ) which is bounded below and satisfies V (0) = 0. Note also that solution in Wloc (1.9) implies that any control v satisfying ρv < M ∗ is stable. The PIA takes the following familiar form:
Algorithm (PIA). 1. Initialization. Set k = 0 and select any v0 ∈ USM such that ρv0 < M ∗ . 2,p 2. Value determination. Obtain the canonical solution Vk = Ψvk ∈ Wloc (Rd ), p > 1, to the Poisson equation
1 On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions
7
Lvk Vk + cvk = ρvk in Rd . 3. If vk (x) ∈ Arg minu∈U bi (x, u)∂iVk (x) + c(x, u) x-a.e., return vk . 4. Policy improvement. Select an arbitrary vk+1 ∈ USM which satisfies ∂ V k vk+1 (x) ∈ Arg min ∑ bi (x, u) (x) + c(x, u) , ∂ xi u∈U i=1
d
x ∈ Rd .
Since ρv0 < M ∗ it follows that v0 ∈ USSM . The algorithm is well defined, provided vk ∈ USSM for all k ∈ N. This follows from the next lemma which shows that ρvk+1 ≤ ρvk , and in particular that ρvk < M ∗ , for all k ∈ N. 2,p (Rd ), p > 1, be Lemma 1.3.1. Suppose v ∈ USSM satisfies ρv < M ∗ . Let V ∈ Wloc the canonical solution to the Poisson equation
LvV + cv = ρv ,
in Rd .
Then any measurable selector vˆ from the minimizer Arg min bi (x, u)∂iV (x) + c(x, u) u∈U
satisfies ρvˆ ≤ ρv . Moreover, the inequality is strict unless v satisfies LvV (x) + cv (x) = min [LuV (x) + c(x, u)] = ρv , u∈U
for almost all x.
(1.11)
Proof. Let V be a Lyapunov function satisfying Lv V (x) ≤ k0 − g(x), for some inf-compact g such that cv ∈ o(g) (see [1, Lemma 7.1]). For n ∈ N, define vˆn (x) =
⎧ ⎨v(x) ˆ if x ∈ Bn ⎩v(x) if x ∈ Bc . n
Clearly, vˆn → vˆ as n → ∞ in the topology of Markov controls (see [1, Section 3.3]). It is evident that V is a stochastic Lyapunov function relative to vˆn , i.e., there exist constants kn such that Lvˆn V (x) ≤ kn − g(x), for all n ∈ N. Since V ∈ o(V ), it follows that (see [1, Lemma 7.1]) 1 vˆn (1.12) E [V (Xt )] −−→ 0 t→∞ t x Let h(x) := ρv − min [LuV (x) + c(x, u)] , u∈U
x ∈ Rd .
8
A. Arapostathis
Also, by definition of vˆn , for all m ≤ n, we have Lvˆn V (x) + cvˆn (x) ≤ ρv − h(x) IBm (x) .
(1.13)
By Itˆo’s formula we obtain from (1.13) that
1 vˆ t 1 vˆn n cvˆn (Xs ) ds E [V (Xt )] − V (x) + Ex t x t 0
t 1 vˆn ≤ ρv − E x h(Xs ) IBm (Xs ) ds , t 0
(1.14)
for all m ≤ n. Taking limits in (1.14) as t → ∞ and using (1.12), we obtain
ρvˆn ≤ ρv −
Rd
h(x) IBm (x) μvˆn (dx) .
(1.15)
Note that v → ρv is lower semicontinuous. Therefore, taking limits in (1.15) as n → ∞, we have
ρvˆ ≤ ρv − lim sup n→∞
Rd
h(x) IBm (x) μvˆn (dx) .
(1.16)
Since c is near-monotone and ρvˆn ≤ ρv < M ∗ , there exists Rˆ > 0 and δ > 0, such that μvˆn (BRˆ ) ≥ δ for all n ∈ N. Then with ψvˆn denoting the density of μvˆn Harnack’s inequality [7, Theorem 8.20, p. 199] implies that there exists a constant CH = CH (R) ˆ with |BR | denoting the volume of BR ⊂ Rd , it holds that such that for every R > R, inf ψvˆn ≥ BR
δ , CH |BR |
∀n ∈ N.
By (1.16) this implies that ρvˆ < ρv unless h = 0 a.e.
2
1.4 Convergence of the PIA We start with the following lemma. Lemma 1.4.2. The sequence {Vk } of the PIA has the following properties: (i) For some constant C0 = C0 (ρv0 ) we have infRd Vk > C0 for all k ≥ 0. (ii) Each Vk attains its minimum on the compact set K (ρv0 ) := x ∈ Rd : min c(x, u) ≤ ρv0 . u∈U
(iii) For any p > 1, there exists a constant C˜0 = C˜0 (R, ρv0 , p) such that Vk 2,p ≤ C˜0 W (B ) R
∀R > 0 .
1 On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions
9
(iv) There exist positive numbers αk and βk , k ≥ 0, such that αk ↓ 1 and βk ↓ 0 as k → ∞ and αk+1Vk+1 (x) + βk+1 ≤ αkVk + βk ∀k ≥ 0 . Proof. Parts (i) and (ii) follow directly from [3, Lemmas 3.6.1 and 3.6.4]. For part (iii) note first that the near-monotone assumption implies that M∗ +ρ M ∗ − ρ vk vk μvk K ≥ ∗ 2 M + ρvk
∀k ≥ 0 .
M∗ +ρ M ∗ − ρ v0 v0 ≥ ∗ μvk K 2 M + ρv0
∀k ≥ 0 .
Consequently,
v
v
uniformly on compact subsets of Rd . Hence, since Jαk − Jαk (0) → Vk weakly in W 2,p (BR ) for any R > 0, (iii) follows from [3, Theorem 3.7.4]. Part (iv) follows as in [11, Theorem 4.4]. 1 2 As the corollary below shows, the PIA always converges. Corollary 1.4.1. There exist a constant ρˆ and a function Vˆ ∈ C 2 (Rd ) with Vˆ (0) = 0, such that, as k → ∞, ρvk ↓ ρˆ and Vk → Vˆ weakly in W 2,p (BR ), p > 1, for any R > 0. Moreover, (Vˆ , ρˆ ) satisfies the HJB equation min LuVˆ (x) + c(x, u) = ρˆ , u∈U
x ∈ Rd .
(1.17)
Proof. By Lemma 1.3.1, ρvk is decreasing monotonically in k and hence converges to some ρˆ ≥ ρ ∗ . By Lemma 1.4.2 (iii), the sequence Vk is weakly compact in W 2,p (BR ), p > 1, for any R > 0, while by Lemma 1.4.2 (iv), any weakly convergent subsequence has the same limit Vˆ . Also repeating the argument in the proof of Lemma 1.3.1, with hk (x) := ρvk−1 − min [LuVk−1 (x) + c(x, u)] , u∈U
x ∈ Rd ,
we deduce that for any R > 0 there exists some constant K(R) such that BR
hk (x) dx ≤ K(R) ρvk−1 − ρvk
∀k ∈ N .
1 Theorem 4.4 in [11] applies to Markov chains on Borel state spaces. Also the model in [11] involves only inf-compact running costs. Nevertheless, the essential arguments can be followed to adapt the proof to controlled diffusions. We skip the details.
10
A. Arapostathis
Therefore, hk → 0 weakly in L 1 (D) as k → ∞ for any bounded domain D. Taking limits in the equation min [LuVk−1 (x) + c(x, u)] = ρvk−1 − hk (x) u∈U
2
and using [3, Lemma 3.5.4] yields (1.17).
It is evident that v ∈ USM is an equilibrium of the PIA if it satisfies ρv < M ∗ and min Lu Ψv (x) + c(x, u) = ρv , u∈U
x ∈ Rd .
(1.18)
For one-dimensional diffusions, one can show that (1.18) has a unique solution, and hence, this is the optimal solution with ρv = ρ ∗ . For higher dimensions, to the best of our knowledge there is no such result. There is also the possibility that the PIA converges to vˆ ∈ USSM which is not an equilibrium. This happens if (1.17) satisfies LvˆVˆ (x) + cvˆ(x) = min LuVˆ (x) + c(x, u) = ρˆ > ρvˆ,
x ∈ Rd .
u∈U
(1.19)
This is in fact the case with the example in [4]. In this example the controlled diffusion takes the form dXt = Ut dt + dWt , with U = [−1, 1] and running cost c(x) = 1 − e−|x|. If we define 3 ξρ := log + log(1 − ρ ) , 2 and Vρ (x) := 2
x −∞
e2|y−ξρ | dy
y −∞
ρ ∈ [1/3, 1)
e−2|z−ξρ | ρ − c(z) dz,
x ∈ R,
then direct computation shows that 1 2 Vρ (x) − |Vρ (x)| + c(x) =
ρ
∀ρ ∈ [1/3, 1) ,
and so the pair (Vρ , ρ ) satisfies the HJB. The stationary Markov control corresponding to this solution of the HJB is wρ (x) = − sign(x − ξρ ). The controlled process under wρ has invariant probability density ϕρ (x) = e−2|x−ξρ | . A simple computation shows that ∞ −∞
c(x)ϕρ (x) dx = ρ − 98 (1 − ρ )(3ρ − 1) < ρ ,
∀ρ ∈ (1/3, 1) .
Thus, if ρ > 1/3, then Vρ is not a canonical solution of the Poisson equation corresponding to the stable control wρ . Therefore, this example satisfies (1.19) and shows that in general we cannot preclude the possibility that the limiting value of the PIA is not an equilibrium of the algorithm.
1 On the Policy Iteration Algorithm for Nondegenerate Controlled Diffusions
11
In [11, Theorem 5.2], a blanket Lyapunov condition is imposed to guarantee convergence of the PIA to an optimal control. Instead, we use Lyapunov analysis to characterize the domain of attraction of the optimal value. We need the following definition. Definition 1.4.2. Let v∗ be an optimal control as characterized in Theorem 1.3.1. ∗ Let V denote the class of all nonnegative functions V ∈ C 2 (Rd ) satisfying Lv V ≤ d k0 − h(x) for some nonnegative, inf-compact h ∈ C (R ) and a constant k0 . We denote by o(V) the class of inf-compact functions g satisfying g ∈ o(V ) for some V ∈ V. The theorem below asserts that if the PIA is initialized at a v0 ∈ USSM whose associated canonical solution to the Poisson equation lies in o(V) then it converges to an optimal v∗ ∈ USSM . Theorem 1.4.2. If v0 ∈ USSM satisfies Ψv0 ∈ o(V), then ρvk → ρ ∗ as k → ∞. Proof. The proof is straightforward. By Lemma 1.4.2 (iv), Vˆ ∈ o(V). Also by (1.17), we have ∗ Lv Vˆ (x) + cv∗ (x) ≥ ρˆ ,
x ∈ Rd ,
and applying Dynkin’s formula, we obtain
1 v∗ t 1 v∗ ˆ E V (Xt ) − V (x) + Ex cv∗ (Xs ) ds ≥ ρˆ , t x t 0
(1.20)
Since Vˆ ∈ o(V), by [1, Lemma 7.1] we have 1 v∗ ˆ Ex V (Xt ) −−→ 0 t→∞ t and thus taking limits as t → ∞ in (1.20), we obtain ρ ∗ ≥ ρˆ . Therefore, we must have ρˆ = ρ ∗ . 2
1.5 Concluding Remarks We have concentrated on the model of controlled diffusions with near-monotone running costs. The case of stable controls with a blanket Lyapunov condition is much simpler. If, for example, we impose the assumption that there exist a constant k0 > 0 and a pair of nonnegative, inf-compact functions (V , h) ∈ C 2 (Rd ) × C (Rd ) satisfying 1 + c ∈ o(h) and such that Lu V (x) ≤ k0 − h(x, u)
∀(x, u) ∈ Rd × U ,
then the PIA always converges to the optimal solution.
12
A. Arapostathis
Acknowledgement This work was supported in part by the Office of Naval Research through the Electric Ship Research and Development Consortium.
References 1. Arapostathis, A., Borkar, V.S.: Uniform recurrence properties of controlled diffusions and applications to optimal control. SIAM J. Control Optim. 48(7), 152–160 (2010) 2. Arapostathis, A., Borkar, V.S., Fern´andez-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993) 3. Arapostathis, A., Borkar, V.S., Ghosh, M.K.: Ergodic control of diffusion processes, Encyclopedia of Mathematics and its Applications, vol. 143. Cambridge University Press, Cambridge (2011) 4. Bensoussan, A., Borkar, V.: Ergodic control problem for one-dimensional diffusions with nearmonotone cost. Systems Control Lett. 5(2), 127–133 (1984) 5. Bogachev, V.I., Krylov, N.V., R¨ockner, M.: On regularity of transition probabilities and invariant measures of singular diffusions under minimal conditions. Comm. Partial Differential Equations 26(11–12), 2037–2080 (2001) 6. Costa, O.L.V., Dufour, F.: The policy iteration algorithm for average continuous control of piecewise deterministic Markov processes. Appl. Math. Optim. 62(2), 185–204 (2010) 7. Gilbarg, D., Trudinger, N.S.: Elliptic partial differential equations of second order, Grundlehren der Mathematischen Wissenschaften, vol. 224, second edn. Springer-Verlag, Berlin (1983) 8. Gy¨ongy, I., Krylov, N.: Existence of strong solutions for Itˆo’s stochastic equations via approximations. Probab. Theory Related Fields 105(2), 143–158 (1996) 9. Hern´andez-Lerma, O., Lasserre, J.B.: Policy iteration for average cost Markov control processes on Borel spaces. Acta Appl. Math. 47(2), 125–154 (1997) 10. Krylov, N.V.: Controlled diffusion processes, Applications of Mathematics, vol. 14. SpringerVerlag, New York (1980) 11. Meyn, S.P.: The policy iteration algorithm for average reward Markov decision processes with general state space. IEEE Trans. Automat. Control 42(12), 1663–1680 (1997) 12. Stannat, W.: (Nonsymmetric) Dirichlet operators on L1 : existence, uniqueness and associated Markov processes. Ann. Scuola Norm. Sup. Pisa Cl. Sci. (4) 28(1), 99–140 (1999)
Chapter 2
Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint Lankere Benkherouf and Alain Bensoussan
2.1 Introduction Demand uncertainty of the various items in the supply chain plays an important role in the control of material flow. Moreover, information on the suppliers mode of deliveries is key in selecting appropriate policies for inventory management. In some situations, orders for products cannot be placed while waiting for a delivery of previous orders. Thus placing an order-time constraint on deliveries. This applies where production capacity at the supplier side is limited. This situation may also arise in certain organizations where transportation capacity is limited: see Bensoussan, C¸akanyidirim, and Moussaoui [4]. This chapter considers the basic inventory model of Scarf [16] and modifies it by including lead time with order-time constraints of the type mentioned above. It is well known that an (s, S) policy is optimal for Scarf’s model for the infinite horizon stationary model: see Igelhart [10], Veinott [18], Beyer and Sethi [6], and Benkherouf [3]. Also, (s, S) policies remain optimal in the presence of lead time: see [18]. This chapter shows that (s, S) policies are still optimal with the additional constraints on order time. A result, which seems to follow straightforwardly from existing results in the literature, turned out to be delicate and needing some care.
L. Benkherouf Department of Statistics and Operations Research, Kuwait University, Safat 13060, Kuwait e-mail:
[email protected] A. Bensoussan () International Center for Decision and Risk Analysis, School of Management, University of Texas at Dallas, Richardson, TX 75080, USA The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Ajou University, San S. Woncheon-Dong, Yeontong-Gu, Suwon 443–749, Korea e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 2, © Springer Science+Business Media, LLC 2012
13
14
L. Benkherouf and A. Bensoussan
A related model to this chapter’s is discussed in [4] for a discrete-state, continuous-time inventory model with infinite horizon where the demand process is assumed to be generated by some Poisson process. The main ingredients for the problem under consideration in this chapter are given below: The demand process is assumed to be composed of a sequence of i.i.d variables, D 1 , . . . , Dn , . . . , constructed on a probability space (Ω , A , P), where Dn represents the demand at time n. Let F n = σ (D1 , . . . , Dn ) be the σ −algebra generated by the demand process and F 0 = (Ω , 0). / An ordering policy V is composed of a sequence of ordering times
τ1 , . . . , τ j , . . . , with τ j being F τ j−1 measurable. To each ordering time τ j , one associates a quantity vτ j which represents the amount ordered. The new element here is that the usual condition τ j ≤ τ j+1 is replaced with the constraint
τ j+1 ≥ τ j + L ,
(2.1)
in which L is the lead time, L ≥ 1, and integer. The inventory evolves as follows: yn+1 = yn + vn − Dn , = τ j , for some j ≥ L, and yn is the inventory level at time n. with y1 = x, vn = 0, if n To define the objective function, we introduce the cost in a single period. It is composed of a cost on the inventory l(x) = hx+ + px− ,
(2.2)
C(v) = K1v>0 + cv,
(2.3)
and an order cost where h, p, K, c are strictly positive known constants and x+ = max{0, x}, x− = − min{0, x}, with 1, If v > 0 1v>0 = . 0, otherwise Costs are assumed to be additive and discounted geometrically at a known rate α , 0 < α < 1, and that unmet demand is completely backlogged. The objective function is given by the formula Jx (V ) =
∞
∑ α n−1 (E[l(yn ) + C(vn)]),
n=1
(2.4)
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
15
and the value function is u(x) = inf Jx (V ), V
(2.5)
where E is the usual expectation operator. Inventory models with deterministic lead time can be found in [18], Archibald [1], Sahin [13, 14], and Fegergruen and Schechner [7]. These papers, but [18], are concerned with the computations of system performance measures for (s, S) policies. The closest to the present work is [1], where at most one outstanding order is permitted. This is equivalent to constraint (2.1). However, no proof of the optimality of (s, S) policies was provided. In fact the notion of policies with at most one outstanding order is not new. It can be found in Hadley and Within [9], where it was discussed in the context of lost sales models: see also Kim and Park [11]. This chapter, unlike those that deals with order-time constraint, but [4], provides a proof of the optimality of (s, S) policies. In the next section, we shall derive the Bellman equation associated to u(x). Section 2.3 is concerned with ground preparation for showing the derivation of (s, S) policies. Section 2.4 deals with existence and derivation of the pair (s, S). The optimality of (s, S) policies is shown, under some technical conditions, in Sect. 2.5. Section 2.6 treats the special case where the demand in each period is exponentially distributed. Further, this section contains some general remarks and a conclusion.
2.2 Dynamic Programming It is easy to figure out the Bellman equation, considering the possibilities at the initial time. If there is no order at time 0, then (by (2.5) and (2.4)) the best which can be achieved over an infinite planning horizon is l(x) + α Eu(x − D). On the other hand if an order of size v is made at time 0, which is possible since there is no pending order, then the inventory evolves as follows: y1 = x, y2 = x − D1 , ··· yL+1 = x − (D1 + · · · + DL ) + v, and the best which can be achieved is L
K + cv + l(x) + ∑ α j−1 El(x − (D1 + · · · + D j−1)) j=2
+ α Eu(x + v − (D1 + · · · + DL )). L
16
L. Benkherouf and A. Bensoussan
From these considerations, we can easily derive Bellman equation u(x) = min l(x) + α Eu(x − D),
L−1
K − cx + ∑ α El(x − D ) + inf [cη + α Eu(η − D ( j)
j
j=0
L
η >x
(L)
)] ,
(2.6)
in which we use the notation ( j)
D
=
D1 + · · · + D j , 0,
if j ≥ 1 . if j = 0
In the next section, we shall recast the function u given in (2.6) in a form which we shall find useful later on in our analysis.
2.3 (s,S) Policy We first transform equation (2.6), by introducing a constant s ∈ R, and changing u(x) into Hs (x), using the formula L−1
u(x) = −cx + ∑ α j El(x − D( j)) + Cs + Hs (x),
(2.7)
j=0
where Cs will be defined below. In fact, let g(x) = (1 − α )cx + α LEl(x − D(L) ), then we take Cs =
g(s) + α cD¯ , 1−α
where D¯ is the expected value of D, with 0 < D¯ < ∞. We note the formula L−1
∑ α j Eg(x − D( j))
j=0
= cx(1 − α L ) − α cD¯
L−1 1 − αL ¯ + ∑ α j+L El(x − D( j+L) ). + α L cDL 1−α j=0
(2.8)
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
17
We then check that Hs (x) is the solution of g(x) − g(s) + α EHs(x − D),
Hs (x) = min
K + inf
L−1
η >x
∑α
j
( j)
E(g(η − D ) − g(s)) + α EHs (η − D L
(L)
)
.
(2.9)
j=0
On the other hand, for any s we define Hs (x) =
g(x) − g(s) + α EHs(x − D), x ≥ s . 0, x<s
(2.10)
A solution Hs satisfying (2.10) exists, is unique, and is continuous in R. We have used the same notation Hs (x) between (2.9) and (2.10), because we want to find s so that the solution of (2.10) is also a solution of (2.9). Let us set gs (x) = (g(x) − g(s))1x≥s , then the solution of (2.10) satisfies, for all x, Hs (x) = gs (x) + α EHs (x − D).
(2.11)
From this relation we deduce, after iterating that Hs (x) =
L−1
∑ α j Egs(x − D( j)) + α L EHs(x − D(L)),
j=0
therefore also L−1
∑ α j E(g(x − D( j)) − g(s)) + α LEHs (x − D(L))
j=0
= Hs (x) +
L−1
∑ α j E(g(x − D( j)) − g(s))1x−D( j) <s .
j=0
Define
Ψs (x) = Hs (x) +
L−1
∑ α j E(g(x − D( j)) − g(s))1x−D( j) <s ,
j=0
(2.12)
18
L. Benkherouf and A. Bensoussan
then if we want Hs (x) to satisfy (2.9), we must have Hs (x) = min g(x) − g(s) + α EHs(x − D), K + inf Ψs (η ) . η >x
(2.13)
This relation motivates the choice of s. We want to find s so that K + inf Ψs (η ) = 0.
(2.14)
η >s
If such an s exists, we define S = S(s) as the point where the infimum is attained. inf Ψs (η ) = Ψs (S(s)).
(2.15)
η >s
Note, from (2.13), that Ψs (x) coincides with Hs (x), for x > s, when L = 1. In the next section we shall show the existence of the pair (s, S) satisfying (2.14) and (2.15).
2.4 Existence and Characterization of the Pair (s, S) Let F (L) (x) and f (L) (x) be the cumulative distribution and the density function of D(L) respectively and F¯ (L) (x) = 1 − F (L) (x), with F (0) (x) = 1. Then (2.8) and (2.2) give g(x) = (1 − α )cx + α Lh
x 0
(x − ξ ) f (L) (ξ )dξ + α L p
∞ x
(ξ − x) f (L) (ξ )dξ .
It follows that μ (x) = g (x) is given by
μ (x) = (1 − α )c + α Lh − α L (h + p)F¯ (L) (x).
(2.16)
(1 − α )c − α L p < 0.
(2.17)
We assume that
There exists a unique s¯ > 0 such that
μ (x) ≤ 0, μ (x) ≥ 0,
if x ≤ s¯ . if x ≥ s¯
(2.18)
Note that if L = 0, (2.17) reduces to the classical condition of optimality of (s, S) policies with no lead time.
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
19
Lemma 2.4.1. The solution Hs of (2.10) satisfies for s ≤ x, Hs (x) + =
L−1
x
j=0
s
∑ αj
F¯ ( j) (x − ξ )μ (ξ )dξ
∞ 1 − αL (g(x) − g(s)) + ∑ α j 1−α j=L
x s
F ( j) (x − ξ )μ (ξ )dξ .
Furthermore, the equation is still valid for s > x by dropping the integral term on the right-hand side. Proof. The proof follows immediately by recalling the definition of μ in (2.16) and checking that Hs given by Hs (x) =
∞
x
j=0
s
∑αj
F ( j) (x − ξ )μ (ξ )dξ ,
(2.19) 2
solves (2.10). Lemma 2.4.2. The function Ψs defined in (2.12) is given by
Ψs (x) =
∞ 1 − αL (g(x) − g(s)) + ∑ α j 1−α j=L
x s
F ( j) (x − ξ )μ (ξ )dξ
L−1
−((1 − α )c + α Lh)D¯ ∑ jα j j=1
L−1
+α L (h + p) ∑ α j
+∞
j=1
0
F¯ ( j) (ζ )F¯ (L) (x − ζ )dζ .
Proof. Note that E(g(x − D( j)) − g(s))1x−D( j) <s s
= −E1D( j) >x−s μ (ξ )dξ =− =− =− =−
+∞ s x−s
+∞ x−s s −∞ x −∞
x−ζ
x−D( j)
μ (ξ )d ξ dF ( j) (ζ )
F¯ ( j) (ζ )μ (x − ζ )dζ
F¯ ( j) (x − ξ )μ (ξ )dξ F¯ ( j) (x − ξ )μ (ξ )dξ +
x s
F¯ ( j) (x − ξ )μ (ξ )dξ .
(2.20)
20
L. Benkherouf and A. Bensoussan
x ¯ ( j) F (x − The Lemma is now immediate from further direct computations on −∞ ξ )μ (ξ )dξ , (2.16), Lemma 2.4.1, and the definition of Ψs . This finishes the proof. 2
It is easy to deduce from Lemma 2.4.2 that Ψs (x) is bounded below and tends to +∞ as x → +∞. Therefore the infimum over x ≥ s is attained and we can define S(s) as the smallest infimum. Proposition 2.4.1. We assume that (2.17) holds, then one has 1.
L−1
Ψs (S(s)) → −((1 − α )c + α Lh)D¯ ∑ jα j , as s → +∞.
(2.21)
maxΨs (S(s)) = Ψs¯(S(s)) ¯ ≥ 0,
(2.22)
Ψs (S(s)) → −∞, as s → −∞,
(2.23)
j=1
2.
s
3.
4. There exists one and only one solution of (2.14) such that s ≤ s. ¯ Proof. We compute, by Lemma 2.4.2,
Ψs (x) =
∞ 1 − αL μ (x) + ∑ α j 1−α j=L
x s
L−1
x
j=1
0
− α L (h + p) ∑ α j
f ( j) (x − ξ )μ (ξ )dξ
F¯ ( j) (ζ ) f (L) (x − ζ )dζ ,
(2.24)
which can also be written as ∞
x
j=L
s
Ψs (x) = γ (x) + ∑ α j
f ( j) (x − ξ )μ (ξ )dξ ,
(2.25)
with
γ (x) =
L−1 1 − αL ((1 − α )c − α L p) + α L (h + p) ∑ α j 1−α j=0
x 0
F ( j) (x − ξ ) f (L) (ξ )dξ . (2.26)
The function γ is increasing in x and
γ (x) =
1 − αL ((1 − α )c − α L p) < 0, for x ≤ 0, 1−α
γ (∞) =
1 − αL ((1 − α )c + α Lh), 1−α
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
21
therefore there exists a unique s∗ such that
γ (x) < 0,
for x < s∗ ,
γ (x) > 0,
for x > s∗ ,
γ (s∗ ) = 0, s∗ > 0. Note that L−1 1 − αL μ (x) − γ (x) = α L (h + p) ∑ α j 1−α j=0
x
F¯ ( j) (x − ξ ) f (L) (ξ )dξ .
0
(2.27)
The quantity on the right-hand side of the above equality vanishes for x ≤ 0 or L = 1. Otherwise 1 − αL μ (x) − γ (x) > 0, 1−α
for x > 0, L ≥ 2.
Since γ (s∗ ) = 0, we have using (2.18), μ (s∗ ) > 0, hence 0 < s¯ < s∗ . Next, if s ≥ s∗ , Ψs (x) ≥ 0; for x ≥ s by (2.25); hence S(s) = s. Therefore, we get by (2.20) that L−1
Ψs (S(s)) = −((1 − α )c + α Lh)D¯ ∑ jα j j=1
L−1
+ α L (h + p) ∑ α j j=1
+∞ 0
F¯ ( j) (ζ )F¯ (L) (s − ζ )dζ ,
(2.28)
j this function decreases in s and converges to −((1 − α )c + α L h)D¯ ∑L−1 j=1 j α as s → +∞. This proves part (1). Consider now s¯ < s < s∗ . We note that
Ψs (s) = γ (s) < 0, Ψs (s∗ ) =
∞
∑α
j=L
j
s∗ s
f ( j) (s∗ − ξ )μ (ξ )dξ > 0,
hence in this case s¯ < s < S(s) < s∗ . If s < s, ¯ then we can claim that s < s¯ < S(s). Indeed, for s < x < s, ¯ we can see from formula (2.25) and (2.18) that Ψs (x) < 0. However, we cannot compare S(s) with s∗ in this case.
22
L. Benkherouf and A. Bensoussan
We then study the behavior of Ψs (S(s)). We have already seen that, for s > s∗ the function Ψs (S(s)) is decreasing to the negative constant −((1 − α )c + j α L h)D¯ ∑L−1 j=1 j α . In this case, it follows from (2.24) that L−1 d Ψs (S(s)) = −α L (h + p) ∑ α j ds j=1
s 0
F¯ ( j) (ζ ) f (L) (s − ζ )dζ , s > s∗ .
(2.29)
Note that s > 0. If s < s∗ , then s < S(s); therefore d ∂Ψs (S(s)), Ψs (S(s)) = ds ∂s hence
∞ d 1 − αL j ( j) Ψs (S(s)) = − μ (s) + ∑ α F (S(s) − s) . ds 1−α j=L
(2.30)
Note that at s = s∗ the two formulas in (2.29) and (2.30) coincide, since −
L−1 1 − αL μ (s∗ ) = −α L (h + p) ∑ α j 1−α j=1
s∗ 0
F¯ ( j) (ζ ) f (L) (s∗ − ζ )dζ ,
which can be easily checked by remembering the definition of s∗ (γ (s∗ ) = 0), and (2.27). It follows clearly that Ψs (S(s)) decreases on (s, ¯ s∗ ) and increases on (−∞, s). ¯ Finally Ψs (S(s)) decreases on (s, ¯ +∞) and increases on (−∞, s). ¯ So it attains its maximum at s. ¯ From the formula for Ψs¯(x); Hs¯(x); for x > s¯ and the fact that s¯ is the minimum of g, we get immediately that Ψs¯(S(s)) ¯ ≥ 0. This shows part (2) of the proposition. Finally, for s < 0 we have
Ψs (S(s)) ≤ Ψs (0) =
∞ 1 − αL (g(0) − g(s)) + ∑ α j 1−α j=L
0 s
F ( j) (−ξ )dξ
L−1
− ((1 − α )c − α Lh)D¯ ∑ jα j , j=1
which tends to be −∞ as s → −∞. This proves (3). Therefore Ψs (S(s)) is increasing from −∞ to a positive number as s grows from −∞ to s. ¯ Therefore there is one and only one s < s¯ satisfying (2.14). The proof has been completed. 2
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
23
2.5 Solution as an (s, S) Policy It remains to see whether the solution Hs of equation (2.10) where s is the solution of (2.14) is indeed a solution of (2.13). It is useful to use, instead of Ψs (x) a different function, namely
Φs (x) = Hs (x) +
L−1
∑ α j E(g(x − D( j)) − g(s))1x−D( j) <s ,
(2.31)
j=1
which differs from Ψs (x) by simply deleting the term corresponding to j = 0. Clearly
Φs (x) = Ψs (x),
for x ≥ s.
However, when x < s,
Ψs (x) = Φs (x) + g(x) − g(s), Note that in finding S it is indifferent to work with one or the other function. Note also that, when L = 1, Φs (x) = Hs (x), for all x. Now we claim that inf Ψs (η ) = inf Φs (η ).
η >x
η >x
This equality is obvious when x > s, since the functions are identical. If x < s we have inf Ψs (η ) = inf Φs (η ) = inf Φs (η ) = −K.
η >x
Indeed,
η >x
η >s
inf Ψs (η ) = min inf Ψs (η ), inf Ψs (η ) ,
η >x
η >s
xs Ψs (η ) = −K, whereas (2.10) and (2.13) give inf Ψs (η ) = inf
xx
For x < s, this relation reduces to 0 = min[g(x) − g(s), 0],
(2.32)
24
L. Benkherouf and A. Bensoussan
which is true since g(x) is decreasing for x < s.We then consider x > s. Since Hs (x) is equal to the first term of the bracket. Therefore, what we have to prove is Hs (x) ≤ K + inf Φs (η ), η >x
for x > s.
(2.33)
In fact, we have not been able to prove (2.33) for all values of L. We know it is true for L = 1. We will prove it afterwards for exponential demands. For general demand distributions we have: Proposition 2.5.2. We assume
α L ((1 − α )c − α L p)(L − 1)D¯ + (1 − α )K ≥ 0,
(2.34)
then property (2.33) is satisfied. Proof. We note that this result includes the case L = 1 in which the condition is automatically satisfied. The proof is similar to that of the case L = 1. We recall from (2.10) that Hs (x) − α EHs (x − D) = gs (x),
for all x.
(2.35)
We then find a similar equation for Φs (x). This is where it is important to consider Φs (x) and not Ψs (x), since we write the equation for any x and not just for x > s. It is easy to verify, using (2.31), that Φs (x) is the solution of
Φs (x) − α E Φs (x − D) = gs (x) + α E(g(x − D) − g(s))1x−D<s − α L E(g(x − D(L) ) − g(s))1x−D(L) <s ,
(2.36)
and again we check that this equation coincides with (2.35) when L = 1. Going back to (2.33) we recall that K + inf Φs (η ) ≥ K + inf Φs (η ) = 0, η >x
η >s
for all x > s.
(2.37)
So it is sufficient to prove (2.33) for x > x0 > s, where x0 is the first value x such that Hs (x) ≥ 0. We necessarily have Hs (x0 ) = 0. We have s < s¯ < x0 . Let us fix ξ > 0 and consider the domain x ≥ x0 − ξ . We can write, using (2.35), that for all x Hs (x) − α EHs (x − D)1x−D≥x0 −ξ ≤ gs (x), using the fact that EHs (x − D)1x−D<x0 −ξ ≤ 0.
(2.38)
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
25
Define next Ms (x) = Φs (x + ξ ) + K. We note that Ms (x) > 0 for all x. We then state, by (2.37), that Ms (x) − α EMs (x − D) = gs (x + ξ ) + α E(g(x + ξ − D) − g(s))1x+ξ −D<s − α L E(g(x + ξ − D(L) ) − g(s))1x+ξ −D(L) <s + (1 − α )K, and using the positivity of M we can assert that Ms (x) − α EMs (x − D)1x−D≥x0 −ξ ≥ gs (x + ξ ) + α E(g(x + ξ − D) − g(s))1x+ξ −D<s − α L E(g(x + ξ − D(L) ) − g(s))1x+ξ −D(L) <s + (1 − α )K.
(2.39)
We now consider the difference Ys (x) = Hs (x) − Ms (x), in the domain x ≥ x0 − ξ . We have by (2.38) and (2.39) that Ys (x) − α EYs (x − D)1x−D≥x0 −ξ ≤ gs (x) − gs (x + ξ ) − α E(g(x + ξ − D) − g(s))1x+ξ −D<s +α L E(g(x + ξ − D(L) ) − g(s))1x+ξ −D(L) <s − (1 − α )K.
(2.40)
We first check easily, that gs (x + ξ ) − gs(x) ≥ 0,
for all x ≥ x0 − ξ .
(2.41)
Consider next the function
χs (y) = α E(g(y − D) − g(s))1y−D<s − α L E(g(y − D(L) ) − g(s))1y−D(L) <s , for y ≥ x0 . We check that
χs (y) =
s +∞ y−ζ
−∞
so in fact
χs (y) =
s −∞
L (L)
−α f (η ) + α f
(η ) dη μ (ζ )dζ ,
¯ − ζ ) + α L F¯ (L) (y − ζ ))μ (ζ )dζ . (−α F(y
(2.42)
26
L. Benkherouf and A. Bensoussan
Note that in the integral, μ (ζ ) < 0, since s < s. ¯ We deduce that
χs (y) ≥ α
s −∞
¯ − ζ ))μ (ζ )dζ . (F¯ (L) (y − ζ ) − F(y
¯ − ζ ) ≥ 0, and μ (ζ ) ≥ ((1 − α )c − α L p). Further, note that F¯ (L) (y − ζ ) − F(y Therefore
χs (y) ≥ α L ((1 − α )c − α L p)
s −∞
¯ − ζ ))dζ . (F¯ (L) (y − ζ ) − F(y
It follows that for y ≥ s,
χs (y) ≥ α L ((1 − α )c − α L p) = α L ((1 − α )c − α L p)
y −∞ ∞ 0
¯ − ζ ))dζ (F¯ (L) (y − ζ ) − F(y
¯ (F¯ (L) (u) − F(u))du
¯ = α L ((1 − α )c − α L p)(L − 1)D. Thanks to assumption (2.34) we can assert from (2.40) and (2.42) that Ys (x) − α EYs (x − D)1x−D≥x0 −ξ ≤ 0,
for all x ≥ x0 − ξ .
Also Ys (x0 − ξ ) ≤ −Φs (x0 ) − K ≤ −K, since Φs (x0 ) ≥ Hs (x0 ) = 0. It follows that Ys (x) ≤ 0,
for all x ≥ x0 − ξ , 2
which is the desired result.
2.6 The Exponential Case In this section, we shall examine the special case when the demand D is exponentially distributed. That is, f (x) =
β exp(−β x), 0,
if x ≥ 0, , otherwise
(2.43)
for some β > 0. The next proposition shows that property (2.33) automatically holds in this case.
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
27
Proposition 2.6.3. We assume that the demand is distributed according to an exponential distribution, then (2.33) holds. Before we proceed to the proof of Proposition 2.6.3, the following result is needed. Lemma 2.6.3. For s ≤ s, ¯ the solution Hs of (2.10) satisfies the following properties: 1. Is constant on (−∞, s], 2. Strictly decreasing on (s, s], ¯ 3. Hs (x) → ∞, as x → ∞ Proof. It clear that (1) follows from (2.10). Property (2) can be easily deduced from (2.18) and (2.19). We claim that 1 Hs (x) ≥ (g(s) ¯ − g(s)). (2.44) 1−α Assume otherwise and define ¯ As = {x ∈ R, s ≤ x ≤ s}. Let
x∗ = min x ∈ As : Hs (x∗ ) =
1 (g(s) ¯ − g(s)) . 1−α
If follows from (2.10), the definitions of g, x∗ and (2) that ¯ − g(s)) + α Hs(x∗ ). Hs (x∗ ) = g(x∗ ) − g(s) + α EHs(x∗ − D) > (g(s) Hence, Hs (x∗ ) >
1 (g(s) ¯ − g(s)). 1−α
This is in contradiction with the definition of x∗ . Therefore (2.44) is true. This leads by (2.10) to 1 Hs (x) ≥ g(x) − g(s)) + α (g(s) ¯ − g(s)). 1−α It is then immediate to see that (3) holds. This finishes the proof.
2
Proof (Proposition 2.6.3). We are going to show that Hs (x) ≥ 0,
if x ≥ x0 ,
(2.45)
then (2.33) will follow immediately, since for x ≥ x0 Hs (x) ≤ Hs (x + ξ ) ≤ Φs (x + ξ ) ≤ Φs (x + ξ ) + K,
for ξ ≥ 0.
28
L. Benkherouf and A. Bensoussan
It can be shown using integration by parts that if f is given (2.43), then Hs is the solution of the differential equation (1 − α )β Hs(x) + Hs (x) = G(x),
(2.46)
where Hs (s) = 0 and G(x) = β (g(x) − g(s)) + μ (x). In fact the function Hs can be written explicitly. However, we content ourselves with (2.46) since this will be sufficient to proceed further in the proof. We claim that Hs has a unique minimum point. Indeed, Lemma 2.6.3 implies Hs attains a minimum belonging to the interval (s, ¯ ∞). Further, Hs has a local minimum on (s, x0 ). Moreover, it is easy to show that Hs is twice differentiable on (s, ∞). Assume now that Hs has two minima x1 and x2 , with x1 < x2 . It is clear that we can select x1 such that s < x1 < x0 . It follows that there exists an x∗ ∈ (x1 , x2 ) such that x∗ is a local maximum. Therefore, H (x∗ ) = 0, and H (x∗ ) ≤ 0. Differentiating both sides of (2.46), we get that Hs (x∗ ) = G (x∗ ) = β (μ (x∗ ) + μ (x∗ )). The definition of μ in (2.16) with (2.18) and the fact x∗ > s¯ imply that Hs (x∗ ) > 0. This leads to a contradiction. Therefore, Hs has a unique minimum and this minimum belongs to the interval (s, x0 ). Consequently, (2.45) is true. The result has been proven. 2 In this chapter a discrete-time continuous-state inventory model, with a fixed lead time of several periods, was considered. Orders for products cannot be placed while waiting for a delivery of previous orders. It was shown that the policy which minimizes the total expected inventory costs over an infinite planning horizon is of an (s, S) type: see Propositions 2.5.2 and 2.6.3. It seems natural to ask if (s, S) policies are still optimal for continuous-state, continuous-time models for inventory models with constraint (2.1). A possible starting point for such investigation is the work of Bensoussan, Liu, and Sethi [5]: see also Benkherouf and Bensoussan [2]. The answer to this question remains open. Another interesting problem is to see if requirement (2.34) can be weakened. Moreover, since the present work seems to be among the first attempts at examining the optimality of (s, S) policies for inventory models with lead-time and ordertime constraints and has only dealt with the basic stationary model of Scarf, then it may be worthwhile to look at possible extensions of the present work to the large body of existing inventory models in the literature. These may include models with Markovian demand as discussed in Sethi and Feng [15], or demand dependent on the environment as found in Song and Zipkin [17], or models with two suppliers as treated in Fox, Metters, and Semple [8]. Acknowledgments Alain Bensoussan would like to acknowledge the support of WCU (World Class University) program through the Korea Science and Engineering Foundation funded by the Ministry of Education, Science and Technology (R31-20007). The authors would also like to thank an anonymous referee for comments on an earlier version of the paper.
2 Discrete-Time Inventory Problems with Lead-Time and Order-Time Constraint
29
References 1. Archibald, B.C.: Continuous review (s, S) policies with lost sales. Management Science. 27, 1171–1177 (1981). 2. Benkherouf, L., Bensoussan, A.: On the Optimality of (s, S) inventory policies: A QuasiVariational Inequalities Approach. SIAM Journal on Control and Optimization. 48, 756–762 (2009). 3. Benkherouf, L.: On the optimality of (s, S) policies: A Quasi-variational Approach. Journal of Applied Mathematics and Stochastic Analysis. doi 158193, 1–9 (2008). 4. Bensoussan, A., C ¸ akanyidirim, M., Moussaoui, L.: Optimality of (s, S) policies for continuoustime inventory problems with order constraint. Annals of Operations Research. (In Press). 5. Bensoussan, A., Liu, R.H, Sethi, S.P.: Optimality of an (s, S) policy with compound Poisson and diffusion demand: A QVI approach. SIAM Journal on Control and Optimization. 44, 1650–1676 (2006). 6. Beyer, D., Sethi, S.P.: The classical average-cost inventory models of Iglehart and VeinottWagner revisited. Journal of Optimization Theory and Applications. 101, 523–555 (1999). 7. A. Federgruen, Schechner, Z.: Cost formulas for continuous review inventory models with fixed deliver lags. Operations Research. 31, 957–965 (1983). 8. Fox, E.J., Cox,E.L., Metters, R.: Optimal Inventory Policy with Two Suppliers. Operations Research. 54, 389–393 (2006). 9. Hadley, G.J., Within, T.M.: Analysis of Inventory Systems. Prentice-Hall, New-Jersey (1963). 10. Iglehart, D.L.: Optimality of (s, S) policies in the infinite horizon dynamic inventory problem. Management Science. 9,259–267 (1963). 11. Kim, D.E., Park, K.S.: (Q, r) inventory model with a mixture of lost sales and time-weighted backorders. Journal of the Operational Research Society. 36, 231–238 (1985). 12. Moinzadeh, K., Nahmias, S.: A continuous review for an inventory system with two supply modes. Management Science. 34, 761–773 (1988). 13. Sahin, I.: On the stationary analysis of continuous review (s, S) inventory systems with constant lead times Operations Research. 27, 717–729 (1971). 14. Sahin, I.: On the objective function behavior in (s, S) inventory models Operations Research. 30, 709–724 (1982). 15. Sethi, S.P., Cheng, F.: Optimality of (s, S) policies in inventory models with Markovian demand Operations Research. 45, 931–939 (1997). 16. Scarf, H.: The optimality of (s, S) policies in the dynamic inventory problem. Mathematical Methods in social Sciences, ed. by K. Arrow, S. Karlin, P. Suppes, Stanford University Press, Stanford, Chap.13, (1960). 17. Song, J.S., Zipkin, P.: Inventory Control in a Fluctuating Demand Environment. Operations Research. 41, 351–370 (1993). 18. Veinott, A.F.: On the optimality of (s, S) inventory policies: New conditions and a new proof. J. Siam. Appl. Math. 14, 1067–1083 (1966).
Chapter 3
Sample-Path Optimality in Average Markov Decision Chains Under a Double Lyapunov Function Condition Rolando Cavazos-Cadena and Raul ´ Montes-de-Oca
3.1 Introduction This note concerns discrete-time Markov decision processes (MDPs) evolving on a denumerable state space. The performance index of a control policy is an (long run) average criterion, and besides standard continuity compactness conditions, the main structural assumption on the model is that (a) the (possibly unbounded) cost function has a Lyapunov function (·) and (b) a power of order larger than 2 of also admits a Lyapunov function [14]. Within this context, the main purpose of the paper is to analyze the sample-path average optimality of some policies whose expected optimality is well known. More specifically, the main results in this direction are as follows: (i) The stationary policy f obtained by optimizing the right-hand side of the optimality equation is sample-path optimal in the strong sense, that is, under the action of f , the observed average costs in finite times converge almost surely to the optimal expected average cost g, whereas if the system is driven by any other policy, then with probability 1, the inferior limit of those averages is at least g. (ii) The Markovian policies obtained from procedures frequently used to approximate a solution of the optimality equation, like the vanishing discount or the successive approximations methods, are sample-path average optimal in the strong sense. R. Cavazos-Cadena () Departamento de Estad´ıstica y C´alculo, Universidad Aut´onoma Agraria Antonio Narro, Buenavista, Saltillo, Coahuila 25315, M´exico e-mail:
[email protected] R. Montes-de-Oca Departamento de Matem´aticas, Universidad Aut´onoma Metropolitana–Iztapalapa, M´exico D.F. 09340, M´exico e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 3, © Springer Science+Business Media, LLC 2012
31
32
R. Cavazos-Cadena and R. Montes-de-Oca
The expected average criteria have been intensively studied, and a fairly complete account of the theory can be found in [9, 12, 13]; see also [1]. In this last paper, it was shown that, for a general MDP, if the optimality equation has a bounded solution, then the stationary policy f referred to in the point (i) above is optimal in the sample-path sense. In [3, 4], a similar conclusion was obtained for models with denumerable state space if the cost function has an almost monotone (or penalized) structure, in the sense that the costs are sufficiently large outside a compact set; such a conclusion was extended to models on Borel spaces by [10,16,20]. More recently, for models with denumerable state space and finite actions sets, the sample-path average criterion was studied in [15] under the uniform ergodicity assumption. On the other hand, the first result described above is an extension of Theorem 4.1 in [6], where the sample-path optimality of the policy f mentioned above was established in a weaker sense than the one used in the present work. The approach of this note relies on basic probabilistic ideas, like Kolmogorov’s inequality and the first Borel-Cantelli lemma, and was motivated by the elementary analysis of the strong law of large numbers as presented in [2]. The organization of the subsequent material is as follows: In Sect. 3.2, the decision model is presented and the conditions to obtain an (expected) average optimal stationary policy from a solution of the optimality equation are briefly described. Next, in Sect. 3.3, the idea of Lyapunov function is introduced and some of its elementary properties are established, whereas in Sect. 3.4, the basic structural restriction on the model, namely, the double Lyapunov function condition, is formulated as Assumption 3.4.1, and the main result of the chapter, solving problem (i) above, is stated as Theorem 3.4.1. The argument to establish this result relies on some properties of the sequence of innovations associated with the sequence of optimal relative costs, which are presented in Sect. 3.5, and then, the main theorem is proved in Sect. 3.6. Next, the result on the sample-path optimality of Markovian policies is stated as Theorem 3.7.1 in Sect. 3.7, and the necessary technical tools to prove that result, concerning tightness of the sequence of empirical measures and uniform integrability of the cost function, are established in Sect. 3.8. Finally, the exposition concludes with the proof of Theorem 3.7.1 in Sect. 3.9. Notation. Throughout the remainder, IN stands for the set of all nonnegative integers and the indicator function of a set A is denoted by IA , so that IA (x) = 1 if x ∈ A and IA (x) = 0 when x ∈ / A. On the other hand, for a topological space IK, the class of all continuous functions defined on IK and the Borel σ -field of IK are denoted by C (IK) and B(IK), respectively, whereas IP(IK) stands for the class of all probability measures defined in B(IK). Finally, for an event G, the corresponding indicator function is denoted by I[G].
3 Sample-Path Optimality in Markov Decision Chains
33
3.2 Decision Model Let M = (S, A, {A(x)}x∈S ,C, P) be an MDP, where the state space S is a denumerable set endowed with the discrete topology and the action set A is a metric space. For each x ∈ S, A(x) ⊂ A is the nonempty subset of admissible actions at x and, defining the class of admissible pairs by IK : = {(x, a) | a ∈ A(x), x ∈ S}, the mapping C : IK → IR is the cost function, whereas P = [px y (·)] is the controlled transition law on S given IK, that is, for all (x, a) ∈ IK and y ∈ S, the relations px y (a) ≥ 0 and ∑y∈S px y (a) = 1 are satisfied. This model M is interpreted as follows: At each time t ∈ IN, the decision maker observes the state of a dynamical system, say Xt = x ∈ S, and selects an action (control) At = a ∈ A(x) incurring a cost C(x, a). Then, regardless of the previous states and actions, the state at time t + 1 will be Xt+1 = y ∈ S with probability px y (a); this is the Markov property of the decision process. Assumption 3.2.1 (i) For each x ∈ S, A(x) is a compact subset of A. (ii) For every x, y ∈ S, the mappings a → C(x, a) and a → px y (a) are continuous in a ∈ A(x). Policies. The space IHt of possible histories up to time t ∈ IN is defined by IH0 : = S and IHt : = IKt × S for t ≥ 1, whereas a generic element of IHt is denoted by ht = (x0 , a0 , . . . , xi , ai , . . . , xt ), where ai ∈ A(xi ). A policy π = {πt } is a special sequence of stochastic kernels: For each t ∈ IN and ht ∈ IHt , πt (·|ht ) is a probability measure on B(A) concentrated on A(xt ), and for each Borel subset B ⊂ A, the mapping ht → πt (B|ht ), ht ∈ IHt , is Borel measurable. The class of all policies is denoted by P and when the controller chooses actions according to π , the control At applied at time t belongs to B ⊂ A with probability πt (B|ht ), where ht is the observed history of the process up to time t. Given the policy π being used for choosing actions and the initial state X0 = x, the distribution of the state-action process {(Xt , At )} is uniquely determined [9], and such a distribution and the corresponding expectation operator are denoted by Pxπ and Exπ , respectively. Next, define F : = ∏x∈S A(x) and notice that F is a compact metric space, which consists of all functions f : S → A such that f (x) ∈ A(x) for each x ∈ S. A policy π is Markovian if there exists a sequence { ft } ⊂ F such that the probability measure πt (·|ht ) is always concentrated at ft (xt ), and if ft ≡ f for every t, the Markovian policy π is is referred to as stationary. The classes of stationary and Markovian policies are naturally identified with F ∞ and M : = ∏t=0 F , respectively, and with these conventions F ⊂ M ⊂ P. Performance Criteria. Suppose that the cost function C(·, ·) is such that Exπ [|C(Xt , At )|] < ∞,
x ∈ S,
π ∈ P,
t ∈ IN.
(3.1)
34
R. Cavazos-Cadena and R. Montes-de-Oca
In this case, the (long-run superior limit) average cost corresponding to π ∈ P at state x ∈ S is defined by 1 π k−1 J(x, π ) := lim sup Ex ∑ C(Xt , At ) , (3.2) k→∞ k t=0 and the corresponding optimal value function is specified by J ∗ (x) := inf J(x, π ), π ∈P
x ∈ S;
(3.3)
a policy π ∗ ∈ P is (superior limit) average optimal if J(x, π ∗ ) = J ∗ (x) for every x ∈ S. The criterion (3.2) evaluates the performance of a policy in terms of the largest among the limit points of the expected average costs in finite times. In contrast, the following index assesses a policy in terms of the smallest of such limit points: 1 J− (x, π ) := lim inf Exπ k→∞ k
k−1
∑ C(Xt , At )
(3.4)
t=0
is the (long run) inferior limit average criterion associated with π ∈ P at a state x, whereas the optimal value function associated with this criterion is given by ∗ J− (x) := inf J− (x, π ),
π ∈P
x ∈ S.
(3.5)
From these specifications, it follows that ∗ J− (·) ≤ J ∗ (·),
(3.6)
and within the context described below, it will be shown that the equality holds in this last relation. Optimality Equation. A basic instrument to analyze the above average criteria is the following optimality equation:
g + h(x) = inf
a∈A(x)
C(x, a) + ∑ px y (a)h(y) ,
x ∈ S,
(3.7)
y∈S
where g ∈ IR and h ∈ C (S) are given functions, and it is supposed that Exπ [|h(Xn )|] < ∞,
x ∈ S,
π ∈ P,
n ∈ IN.
Under this condition and (3.1), a standard induction argument combining (3.7) and the Markov property yields that, for every nonnegative integer n,
3 Sample-Path Optimality in Markov Decision Chains
35
n
∑ C(Xt , At ) + h(Xn+1)
(n + 1)g + h(x) ≤ Exπ
,
x ∈ S,
π ∈ P.
t=0
Moreover, if f ∈ F satisfies that g + h(x) = C(x, f (x)) + ∑ px y ( f (x))h(y),
x ∈ S,
(3.8)
y∈S
then
(n + 1)g + h(x) =
Exf
n
∑ C(Xt , At ) + h(Xn+1)
,
x ∈ S.
t=0
Therefore, assuming that the condition Exπ [h(Xn+1 )] =0 n→∞ n+1
(3.9)
lim
holds for every x ∈ S and π ∈ P, it follows that the relation 1 Ef lim n→∞ n + 1 x
n
1 Eπ inf ∑ C(Xt , At ) = g ≤ lim n→∞ n + 1 x t=0
n
∑ C(Xt , At )
(3.10)
t=0
is always valid, and then, (3.2)–(3.6) immediately yield that ∗ J− (x)
1 Ef = J (x) = g = lim n→∞ n + 1 x ∗
n
∑ C(Xt , At )
,
x ∈ S,
(3.11)
t=0
so that: (i) The superior and inferior limit average criteria render the same optimal value function, (ii) A stationary policy f satisfying (3.8) is average optimal, and (iii) The optimal average cost is constant and is equal to g.
3.3 Lyapunov Functions In this section, a structural condition on the model M will be introduced under which (a) the basic condition (3.1) holds, (b) the optimality equation (3.7) has a solution (g, h(·)) such that the convergence (3.9) occurs, and (c) a policy f ∈ F
36
R. Cavazos-Cadena and R. Montes-de-Oca
satisfying (3.8) exists, so that the conclusions (i)–(iii) stated above hold. Throughout the rest of this chapter z∈S
is a fixed state
and T stands for the first return time to state z, that is, T : = min{n > 0 | Xn = z},
(3.12)
where, as usual, the minimum of the empty set is ∞. The following idea was introduced in [14] and was analyzed in [5]: Definition 3.3.1. Let D ∈ C (IK) and : S → [1, ∞) be given functions. The function is a Lyapunov function for D, or “D has the Lyapunov function ”, if the following conditions (i)–(iii) hold: (i) 1 + |D(x, a)| + ∑y =z px y (a)(y) ≤ (x) for all (x, a) ∈ IK. (ii) For each x ∈ S, the mapping f → ∑y px y ( f (x))(y) = Exf [(X1 )] is continuous in f ∈ F . (iii) For each f ∈ F and x ∈ S, Exf [(Xn )I[T > n]] → 0 as n → ∞. The sentence “D admits a Lyapunov function” means that there exists a function : S → [1, ∞) such that conditions (i)–(iii) above hold. The following simple lemma will be useful. Lemma 3.3.1. Suppose that C has the Lyapunov function (·). In this case the assertions (i) and (ii) below are valid. (i) For every n ∈ IN and π ∈ P, 1 Eπ n+1 x
n
∑ (1 + |C(Xt , At )|) + (Xn+1)
≤ B(x) : = (x) + (z),
x ∈ S;
t=0
in particular, the basic condition (3.1) holds. 1 (ii) lim Exπ [(Xn )] → 0. n→∞ n Proof. Notice that the inequality 1 + |C(x, a)| + ∑y∈S px y (a)(y) ≤ (x) + (z) is always valid, by Definition 3.3.1(i), and then, an induction argument using the Markov property yields that for arbitrary x ∈ S and π ∈ P, Exπ
n
∑ (1 + |C(Xt , At )|) + (Xn+1)
≤ (x) + (n + 1)(z),
n ∈ IN,
t=0
a relation that immediately yields part (i); a proof of the second assertion can be found in Lemma 3.2 of [6].
3 Sample-Path Optimality in Markov Decision Chains
37
The following lemma, originally established by [14], shows that the existence of a Lyapunov function has important implications for the analysis of the average criteria in (3.2) and (3.4). Lemma 3.3.2. Suppose that the cost function C has a Lyapunov function . In this case, there exists a unique pair (g, h(·)) ∈ IR × C (S) such that the following assertions (i)–(v) hold: (i) g = J ∗ (x) for each x ∈ S. (ii) h(z) = 0 and |h(x)| ≤ (1 + (z)) · (x) for all x ∈ S. Therefore, by Lemma 3.3.1 (ii), the convergence (3.9) holds. (iii) The pair (g, h(·)) satisfies the optimality equation (3.7). (iv) For each x ∈ S, the mapping a → ∑y∈S px y (a)h(y) is continuous in a ∈ A(x). (v) An optimal stationary policy exists: For each x ∈ S, the term within brackets in the right-hand side of (3.7) has a minimizer f (x) ∈ A(x), and the corresponding policy f ∈ F is optimal. Moreover, (3.11) holds. A proof of this result can be essentially found in Chapter 5 of [14]; see also Lemma 3.1 in [6] for a proof of the inequality in part (ii). Remark 3.3.1. Notice that g in Lemma 3.3.2 is uniquely determined, since it is the optimal (expected) average cost at every state. The function h(·) in the above lemma is also unique, as established in Lemma A.2(iv) in [7]. Indeed, defining the relative cost function as C(·, ·) − g, the function h(·) is the optimal total relative cost incurred before the first return time to state z; more explicitly, T −1 h(x) = infπ ∈P Exπ ∑t=0 [C(Xt , At ) − g] for all x ∈ S. This section concludes with some simple but useful properties of Lyapunov functions stated in Lemma 3.3.3 below, whose statement involves the following notation. Definition 3.3.2. The class L () consists of all functions D ∈ C (IK) such that a positive multiple of is a Lyapunov function for D, that is, D ∈ L () if and only if for some c > 0,
1 + |D(x, a)| + ∑ px y (a)[c (y)] ≤ c (x), y =z
(x, a) ∈ IK.
Notice that the function c (·) inherits the properties (ii) and (iii) of the function (·) in Definition 3.3.1. Lemma 3.3.3. Suppose that : S → [1, ∞) is a Lyapunov function for a function D˜ ∈ C (IK): ˜ then is also a Lyapunov function for D0 . (i) If D0 ∈ C (IK) is such that |D0 | ≤ |D|, (ii) With the notation in Definition 3.3.2, the following properties (a) and (b) hold: (a) L () is a vector space that contains the constant functions. (b) If D1 , D2 ∈ L (), then max{D1 , D2 } and min{D1 , D2 } also belong to L ().
38
R. Cavazos-Cadena and R. Montes-de-Oca
Proof. The first part follows directly from Definition 3.3.1. To establish part (ii), first notice that L () is nonempty since D˜ ∈ L (). Next, suppose that D1 , D2 ∈ L () and observe that 1 + |Di (x, a)| + ∑ px y (a)[ci (y)] ≤ ci (x), y =z
(x, a) ∈ IK,
i = 1, 2,
where c1 and c2 are positive constants. If d1 and d2 are real numbers, multiplying both sides of the above equality by 1 + |di|, it follows that, for every (x, a) ∈ IK, ([1 + |di|) + (1 + |di|)|Di (x, a)| + ∑ px y (a)[(1 + |di|)ci (y)] ≤ (1 + |di|)ci (x), y =z
and then, 1 + |diDi (x, a)| + ∑ px y (a)[(1 + |di|)ci (y)] ≤ (1 + |di|)ci (x), y =z
i = 1, 2.
these inequalities, it follows that 1 + [1 + |d1D1 (x, a)| + |d2D2 (x, a)|] + ∑ px y (a)[(1 + |d1|)c1 + (1 + |d2|)c2 ] (y) y =z
≤ [(1 + |d1|)c1 + (1 + |d2|)c2 ] (x), showing that 1 + |d1D1 | + |d2D2 | ∈ L (), and because the function in the left-hand side of this inclusion dominates |d1 D1 + d2 D2 | and 1, part (i) yields that (a) d1 D1 + d2 D2 ∈ L () and 1 ∈ L (), so that L () is a vector space that contains the constant functions. Finally, since | max{D1 , D2 }|, | min{D1 , D2 }| ≤ |D1 |+|D2 |, the above displayed inclusion and part (i) together imply that (b) max{D1 , D2 }, min{D1 , D2 } ∈ L ().
3.4 A Double Lyapunov Function Condition and Sample-Path Optimality In this section, a notion of sample-path average optimality is introduced, and using the idea of Lyapunov function, a structural condition on the decision model M is formulated. Under such an assumption, it is stated in Theorem 3.4.1 below that, in addition to being expected average optimal, a stationary policy f satisfying the (3.8) is also average optimal in the sample-path sense.
3 Sample-Path Optimality in Markov Decision Chains
39
Definition 3.4.1. A policy π ∗ ∈ P is sample-path average optimal with optimal value g∗ ∈ IR if the following conditions (i) and (ii) are valid: (i) For each state x ∈ S,
1 n−1 ∑ C(Xt , At ) = g∗ n→∞ n t=0 lim
(ii) For every π ∈ P and x ∈ S, lim inf n→∞
∗
Pxπ -a.s. ;
1 n−1 ∑ C(Xt , At ) ≥ g∗ n t=0
Pxπ -a.s..
The existence and construction of sample-path optimal policies will be studied under the following structural condition on the model M . Assumption 3.4.1 [Double Lyapunov function condition] (i) The cost function C(·, ·) has a Lyapunov function . (ii) For some β > 2, the mapping β admits a Lyapunov function. A simple class of models satisfying this assumption is presented below. Example 3.4.1. For each t ∈ IN, let Xt ∈ S = IN be the number of customers waiting for a service at a time t in a single-server station. To describe the evolution of {Xt }, let the action set A be a compact metric space and set A(x) = A for every x. Next, let {Δt (a) |t ∈ IN, a ∈ A} and {ξt (a) |t ∈ IN, a ∈ A} be two families of independent and identically distributed random variables taking values in the set IN, and suppose that the following conditions hold: → P[ξt (a) = k], and a → (i) For each k ∈ IN, the mappings a → P[Δt (a) = k], a E[ξt (a)r ] ∈ (0, ∞), 1 ≤ r ≤ 2m + 3, are continuous, where m ∈ IN \ {0} is fixed. (ii) P[Δt (a) = 1] = μ (a) = 1 − P[Δt (a) = 0] and E[ξt (a)] − μ (a) ≤ −ρ < 0 for all a ∈ A. When the action a ∈ A is chosen, (the Bernoulli variable) Δt (a) and ξt (a) represent the number of service completions and arrivals in [t,t + 1), respectively, and the evolution of the state process is determined by Xt+1 = Xt + ξt (a) − Δt (a)IIN\{0} (Xt ) if At = a,
t ∈ IN,
(3.13)
an equation that allows to obtain the transition law and to show that px y (a) is a continuous function of a ∈ A, by the first of the conditions presented above. In the third part of the following proposition, a class of cost functions satisfying Assumption 3.4.1 is identified. Proposition 3.4.1. In the context of Example 3.4.1, the following assertions hold when z = 0 and T is the first return time in (3.12): (i) For each r = 1, 2, . . . , 2m + 2, there exist positive constants br and cr such that the function r+1 ∈ C (S) given by r+1 (x) = xr+1 + br x + cr ,
x ∈ S,
(3.14)
40
R. Cavazos-Cadena and R. Montes-de-Oca
satisfies
ρ xr + 1 + E[r+1(Xt+1 ) | Xt = x, At = a] ≤ r+1 (x),
x ∈ S,
(3.15)
where ρ > 0 is the number in condition (ii) stated in Example 3.4.1. Moreover, for each x ∈ S, a → E[r+1 (Xt+1 ) | Xt = x, At = a],
a ∈ A,
is continuous,
(3.16)
and for every π ∈ P, lim E π [(1 + ρ Xnr )I[T n→∞ x
> n]] = 0.
(3.17)
Consequently, (ii) For j = 1, 2, . . . , 2m + 1, the above mapping j+1 (·) is a Lyapunov function for the cost function C j (x) : = ρ x j ,
x ∈ S.
(iii) Suppose that, for some integer j = 1, 2, . . . , m − 1, the cost function C ∈ C (IK) is such that max |C(x, a)| ≤ b1 x j + b0 , a∈A
x ∈ S,
where b0 and b1 are positive constants. In this case, the function C satisfies Assumption 3.4.1. Proof. (i) For Xt = x
= 0 and 1 ≤ r ≤ 2m + 2, the evolution equation (3.13) yields that E[(Xt+1 )r+1 |Xt = x, At = a] = E[(x + ξt (a) − Δt (a))r+1 ] ≤ xr+1 − (r + 1)ρ xr + R(x, a),
(3.18)
where supa∈A |R(x, a)| = O(xr−1 ); thus, there exists a constant b > 0 such that R(x, a) ≤ ρ xr + b for every x
= 0, and it follows that
ρ xr − b + E[(Xt+1)r+1 |Xt = x, At = a] ≤ xr+1 . When r = 0, the term R(x, a) in (3.18) is null, so that ρ + E[Xt+1 |Xt = x, At = a] ≤ x; multiplying both sides of this relation by a sufficiently large constant br such that ρ br > b + 1 and combining the resulting inequality with the one displayed above, it follows that
ρ xr + 1 + E[(Xt+1)r+1 + br Xt+1 |Xt = x, At = a] ≤ xr+1 + br x,
x ∈ S \ {0}.
Defining cr : = 1 + maxa∈A E[ξt (a)r+1 + br ξt (a)], it follows that the function r+1 in (3.14) satisfies the inequality (3.15), whereas (3.16) follows from the
3 Sample-Path Optimality in Markov Decision Chains
41
continuity properties of the distributions of the departure and arrival streams in Example 3.4.1. On the other hand, (3.15) yields that the inequality Exπ [ρ X0r + 1 + r+1 (X1 )I[T > 1]] ≤ r+1 (x) is always valid, and an induction argument using the Markov property leads to
Exπ
n−1
∑
(ρ Xtr + 1)I[T
> t] + r+1(Xn )I[T > n] ≤ r+1 (x),
n ∈ IN;
t=0 ∞ taking the limit as n goes to +∞, this implies that Exπ [∑t=0 (ρ Xtr + 1)I[T > t]] ≤ r+1 (x), and (3.17) follows. (ii) The relations (3.15) and (3.16) immediately show that j+1 satisfies the requirements (i) and (ii) in Definition 3.3.1 of a Lyapunov function for C j . Next, observe that j+1 ≤ c0 x j+1 + c1 for some constants c0 and c1 , and then, (3.17) with j + 1(≤ m + 2) instead of r implies that j+1 also satisfies the third property in Definition 3.3.1. (iii) Observe that the condition on the function C(·, ·) can be written as |C| ≤ b1C j + b0 , and then, it is sufficient to show that C j satisfies Assumption 3.4.1, since in this case the corresponding conclusion for C follows from Lemma 3.3.3. Let the integer j between 1 and m − 1 be arbitrary and notice that part (ii) yields that j+1 is a Lyapunov function for C j . Next, set β = (2 j + 3)/( j + 1) > 2 and observe that (3.14) implies that there exist positive constants c0 and c1 β such that j+1 (x) ≤ c1 x2 j+3 + c0 = c1C2 j+3 (x) + c0 ; since 2 j + 3 ≤ 2m + 1,
β
part (ii) shows that C2 j+3 has a Lyapunov function, and then, j+1 also admits a Lyapunov function, by Lemma 3.3.3. The following result establishes the existence of sample-path average optimal stationary polices. Theorem 3.4.1. Suppose that Assumptions 3.2.1 and 3.4.1 hold, and let (g, h(·)) be the solution of the optimality equation guaranteed by Lemma 3.3.2. In this case, if the stationary policy f satisfies (3.8), then f is sample-path average optimal with the optimal value g. More explicitly, for each x ∈ S and π ∈ P, 1 n−1 ∑ C(Xt , At ) = g Pxf -a.s. n→∞ n t=0 lim
(3.19)
and lim inf n→∞
1 n−1 ∑ C(Xt , At ) ≥ g n t=0
Pxπ -a.s.
(3.20)
Remark 3.4.1. This theorem is related to Theorem 4.1 in [6] where MDPs with average reward criteria were considered. In the context of the present work, Theorem 4.1 in that paper establishes that, under Assumption 3.2.1, if the cost function has n−1 the Lyapunov function , then lim supn→∞ n−1 ∑t=0 C(Xt , At ) ≥ g Pxπ -a.s. for each
42
R. Cavazos-Cadena and R. Montes-de-Oca
π ∈ P and x ∈ S and that the equality holds if π = f satisfies (3.8). In the present Theorem 3.4.1, the additional condition in Assumption 3.4.1(ii) is incorporated, and in this context, the stronger conclusions (3.19) and (3.20) are obtained. A proof of Theorem 3.4.1 will be given after presenting the necessary preliminaries in the following section:
3.5 Innovations of the Sequence of Optimal Relative Costs This section contains the main technical tool that will be used to establish Theorem 3.4.1. The necessary result concerns properties of the sequence of innovations associated to {h(Xt )} which is introduced below. Throughout the remainder of this chapter Assumptions 3.2.1 and 3.4.1 are supposed to be valid even without explicit reference, and (g, h(·)) ∈ IR × C (S) stands for the pair satisfying the optimality equation (3.7) as described in Lemma 3.3.2. Next, for each positive integer n, let Fn be the σ -field generated by the states observed and actions applied up to time n: Fn : = σ (Xt , At , 0 ≤ t ≤ n),
n = 1, 2, 3, . . . ,
(3.21)
and observe that for each initial state x and π ∈ P, the Markov property of the decision process yields that Exπ [h(Xn ) | Fn−1 ] =
∑ pXn−1 y (An−1)h(y).
(3.22)
y∈S
Definition 3.5.1. The process of {Yk , k ≥ 1} of innovations associated to the sequence of observed optimal relative costs {h(Xk ), k ≥ 1} is given by Yn = h(Xn ) − ∑ pXn−1 , y (An−1 )h(y),
n = 1, 2, 3, . . . .
y∈S
Now, let x ∈ S and π ∈ P be arbitrary but fixed, and notice that combining the definition above with (3.21) and (3.22), it follows that (i) Yn is Fn measurable, and (ii) the innovations Yn can be written as Yn = h(Xn ) − Exπ [h(Xn ) | Fn−1 ],
(3.23)
and then, Yn is uncorrelated with the σ -field Fn−1 with respect to Pxπ , that is, Exπ [YnW ] = 0 if W is Fn−1 measurable and YnW is Pxπ integrable
(3.24)
([2]). The following is the main result of this section. Theorem 3.5.1. Suppose that Assumptions 3.2.1 and 3.4.1 hold and let the pair (g, h(·)) ∈ IR × C (S) be as in Lemma 3.3.2. In this context, for each initial state x ∈ S and π ∈ P, the following convergences hold:
3 Sample-Path Optimality in Markov Decision Chains
43
h(Xn ) = 0 Pxπ -a.s. n
(3.25)
1 n ∑ Yk = 0 Pxπ -a.s. n→∞ n k=1
(3.26)
lim
n→∞
and lim
This theorem will be established using two elementary facts stated in the following lemmas. The first one is a criterion for almost sure convergence, which is a consequence of the first Borel-Cantelli lemma [2]. Lemma 3.5.1. Let {Wn } be a sequence of random variables defined on a probability space (Ω , F , P). In this case, if ∑∞ n=1 P[|Wn | > ε ] < ∞ for each ε > 0, then limn→∞ Wn = 0 P-a.s. . The second result involved in the proof of Theorem 3.5.1 is the following inequality by Kolmogorov. Lemma 3.5.2. If n and k are two positive integers such that n > k, then for every α >0 r 1 n π max ∑ Yt ≥ α ≤ 2 ∑ Exπ [Yt2 ]. Px r: k≤r≤n t=k α t=k This classical result is established as Theorem 22.4 in [2] for the case in which the Yn ’s are independent. In the present context, from the relations (3.27)–(3.29) below, it follows that Exπ [Yn2 ] is always finite, and then, (3.24) yields that, if n > k, then Exπ [YnYk I[G]] = 0 for every G ∈ Fk ; from this last observation, the same arguments in the aforementioned book allow to establish Lemma 3.5.2. Now, let be a Lyapunov function for the cost function C such that β admits a Lyapunov function for some β > 2, as ensured by Assumption 3.4.1. Applying Lemma 3.3.1 to the cost function β , it follows that there exists a function b : S → (0, ∞) such that 1 Eπ n+1 x
n
1 ∑ (Xt ) ≤ n + 1 Exπ t=0 2
n
∑
β
(Xt ) ≤ b(x),
x ∈ S,
(3.27)
t=0
where the first inequality is due to the fact that (·) ≥ 1. Proof of Theorem 3.5.1. Let ε > 0 be arbitrary and notice that, by Lemma 3.3.2(ii), |h(·)| ≤ c(·),
(3.28)
where c = 1 + (z). Combining this relation with Markov’s inequality, it follows that, for each positive integer n, Pxπ [|h(Xn )/n| > ε ] ≤
Exπ [|h(Xn )|β ] cβ Exπ [(Xn )β ] ≤ nβ ε β nβ ε β
44
R. Cavazos-Cadena and R. Montes-de-Oca
and then, (3.27) yields that Pxπ [|h(Xn )/n| > ε ] ≤
cβ (n + 1)b(x) cβ b(x) ≤ 2 ; nβ ε β nβ −1ε β
π since β > 2, it follows that ∑∞ n=1 Px [|h(Xn )/n| > ε ] < ∞, and the convergence (3.25) follows from Lemma 3.5.1. Next, using that the (unconditional) variance of a random variable is an upper bound for the expectation of its conditional variance [19], from (3.23), it follows that
Exπ [Yt2 ] = Exπ [(h(Xt ) − Exπ [h(Xt )|Ft−1 ])2 ] ≤ Exπ [(h(Xt ) − Exπ [h(Xt )])2 ] ≤ Exπ [h(Xt )2 ] ≤ c2 Exπ [(Xt )2 ];
(3.29)
see (3.28) for the last inequality. This fact and Lemma 3.5.2 together lead to r c2 n max ∑ Yt > α ≤ 2 ∑ Exπ [(Xt )2 ], r: k≤r≤n t=k α t=k
Pxπ
and then, (3.27) yields that Pxπ
r c2 (n + 1)b(x) , max ∑ Yt > α ≤ r: k≤r≤n t=k α2
α > 0,
n > k ≥ 1.
(3.30)
Using this relation with k = 1, n = m2 , and α = ε m2 , it follows that qm :
= Pxπ
m
2 m r π 2 max ∑ Yt > m ε ∑ Yt > ε ≤ Px t=1 r: 1≤r≤m2 t=1
−2
≤
c2 (m2 + 1)b(x) . ε 2 m4
Therefore, ∑∞ m=1 qm < ∞, and recalling that ε > 0 is arbitrary, an application of Lemma 3.5.1 implies that 2
1 m ∑ Yt = 0 m→∞ m2 t=1 lim
Pxπ -a.s.
On the other hand, given a positive integer m, from the inclusion
(3.31)
3 Sample-Path Optimality in Markov Decision Chains
45
2 m + j m2 + j 2 −1 2 max (m + j) ∑ Yt ≥ ε ⊂ max ∑ Yt ≥ m ε , j: 0≤ j≤2m j: 0≤ j≤2m 2 2 t=m
t=m
it follows that m2 + j 2 −1 max (m + j) ∑ Yt ≥ ε pm : = j: 0≤ j≤2m t=m2 2 m + j ≤ Pxπ max ∑ Yt ≥ m2 ε j: 0≤ j≤2m t=m2 r = Pxπ max ∑ Yt ≥ m2 ε r: m2 ≤r≤(m+1)2 −1 2
Pxπ
t=m
≤
c2 (m + 1)2b(x)
ε 2 m4 ,
where the last inequality was obtained from (3.30) with n = (m + 1)2 − 1, k = m2 , and α = m2 ε . Thus, ∑∞ m=1 pm < ∞, and then, Lemma 3.5.1 implies that lim
m→∞
m2 + j 2 max (m + j)−1 ∑ Yt = 0 Pxπ -a.s. j: 0≤ j≤2m t=m2
To conclude, let n be a positive integer and let m be the integral part of n = m2 + i,
(3.32) √ n, so that
0 ≤ i ≤ 2m.
Assume that i is positive and notice that in this case 2 1 n m2 1 m2 1 m +i ∑ Yt ≤ ∑ Yt + m2 + i ∑ Yt , n t=1 n m2 t=1 t=m2 +1 and then, 2 1 n 1 m2 1 m + j Yt , ∑ Yt ≤ 2 ∑ Yt + max n t=1 m t=1 j: 0≤ j≤2m m2 + j ∑2 t=m +1 a relation that is also valid when i = 0, that is, if n = m2 . Taking the limit when m goes to +∞ in both sides of this last inequality, the convergences in (3.31) and (3.32) together imply that (3.26) holds.
46
R. Cavazos-Cadena and R. Montes-de-Oca
3.6 Proof of Theorem 3.4.1 In this section, a criterion for the sample-path average optimality of a policy will be derived from Theorem 3.5.1 and that result will be used to establish Theorem 3.4.1. The arguments use the following notation: Definition 3.6.1. The discrepancy function Φ : IK → IR associated to the pair (g, h(·)) in Lemma 3.3.2 is defined by
Φ (x, a) : = C(x, a) + ∑ px y (a)h(y) − h(x) − g. y∈S
Notice that Φ is a continuous mapping, by Assumption 3.2.1 and Lemma 3.3.2(iv). Also, observe that the optimality equation (3.7) yields that
Φ (x, a) ≥ 0,
(x, a) ∈ IK.
Lemma 3.6.1. Suppose that Assumptions 3.2.1 and 3.4.1 hold. In this context, a policy π ∗ ∈ P is sample-path average optimal if and only if 1 n−1 ∑ Φ (Xt , At ) = 0 n→∞ n t=0
∗
Pxπ -a.s.
lim
(3.33)
Proof. It will be verified that for all x ∈ S and π ∈ P, 1 n−1 ∑ [C(Xt , At ) − g − Φ (Xt , At )] = 0 n→∞ n t=0
Pxπ -a.s.
lim
(3.34)
Assuming that this relation holds, the desired conclusion can be established as follows: Observing that 1 n−1 1 n−1 1 n−1 C(X , A ) = [C(X , A ) − g − Φ (X , A )] + g + t t t t t t ∑ ∑ ∑ Φ (Xt , At ), n t=0 n t=0 n t=0 and taking the inferior limit as n goes to ∞ in both sides of this equality, the nonnegativity of the discrepancy function and (3.34) together imply that the relation lim inf n→∞
1 n−1 ∑ C(Xt , At ) ≥ g, n t=0
Pxπ -a.s.
is always valid. Thus, by Definition 3.4.1, π ∗ ∈ P is sample-path average optimal if and only if 1 n−1 ∑ C(Xt , At ) = g n→∞ n t=0 lim
∗
Pxπ -a.s. ,
x ∈ S,
a property that, by (3.34) with π ∗ instead of π , is equivalent to the criterion (3.33).
3 Sample-Path Optimality in Markov Decision Chains
47
Thus, to conclude the argument, it is sufficient to verify the statement (3.34). To achieve this goal, notice that the definition of the discrepancy function yields that following equality is always valid for t ≥ 1: C(Xt−1 ) − g − Φ (Xt−1, At−1 ) = h(Xt−1 ) − ∑ pXt−1 , y (At−1 )h(y), y∈S
a relation that, via the specification of the innovation Yt in Definition 3.5.1, leads to C(Xt−1 ) − g − Φ (Xt−1, At−1 ) = h(Xt−1 ) − h(Xt ) + Yt . Therefore, n
n
t=1
t=1
∑ [C(Xt−1 ) − g − Φ (Xt−1, At−1 )] = h(X0) − h(Xn) + ∑ Yt ,
and then, for every initial state X0 = x and π ∈ P, h(x) h(Xn ) 1 n 1 n − + ∑ Yt , [C(Xt−1 ) − g − Φ (Xt−1, At−1 )] = ∑ n t=1 n n n t=1
Pxπ -a.s. ,
where the equality Pxπ [X0 = x] = 1 was used; from this point, (3.34) follows directly from Theorem 3.5.1. Proof of Theorem 3.4.1. From Definitions 3.6.1 and (3.8), it follows that Φ (x, f (x)) = 0 for every state x. Thus, using that At = f (Xt ) when the system is running under the policy f , it follows that, for every initial state x and t ∈ IN, the equality Φ (Xt , At ) = Φ (Xt , f (Xt )) = 0 holds with probability 1 with respect to Pxf . Therefore, the criterion (3.33) is satisfied by f , and then, Lemma 3.6.1 yields that the policy f is sample-path average optimal.
3.7 Approximations Schemes and Sample-Path Optimality In the remainder of this chapter the sample-path optimality of Markovian policies is analyzed. The interest in this problem stems from the fact that an explicit solution (g, h(·)) of the optimality equation (3.7) is seldom available, and in this case, the sample-path average optimal policy f in (3.8) cannot be determined. When a solution of the optimality equation is not at hand, an iterative approximation procedure is implemented and (i) approximations {(gn , hn (·))}n∈IN for (g, h(·)) are generated, and (ii) a stationary policy fn is obtained from (gn , hn (·)). Such a policy fn is “‘nearly optimal”’ in the sense that, for each fixed x, the convergence Φ (x, fn (x)) → 0 occurs as n → ∞, and the next objective is to establish the samplepath average optimality of the Markovian policy { fn }.
48
R. Cavazos-Cadena and R. Montes-de-Oca
Remark 3.7.1. Two procedures that can be used to approximate the solution of the optimality equation and to generate a Markovian policy { fn } such that the fn ’s are nearly optimal are briefly described below; for details see, for instance, [1,9,11–13], or [17]. (i) The discounted method. For each α ∈ (0, 1), the total expected α -discounted ∞ cost at the state x under π ∈ P is given by Vα (x, π ) : = Exπ [∑t=0 α t C(Xt , At )], ∗ whereas Vα (x) : = infπ ∈P Vα (x, π ), x ∈ S, is the α -optimal value function, which satisfies the optimality equation Vα (x) = inf
a∈A(x)
C(x, a) + α ∑ px y (a)Vα (y) ,
x ∈ S.
y∈S
Now let {αn } ⊂ (0, 1) be a sequence increasing to 1, and define (gn , hn ) : = ((1 − αn )Vαn (z),Vαn (·) − Vαn (z)) and let the policy fn be such that Vαn (x) = C(x, fn (x)) + αn ∑ px y ( fn (x))Vαn (y),
x ∈ S.
(3.35)
y∈S
(ii) Value iteration. This procedure approximates the solution (g, h(·)) of the optimality equation (3.7) using the total cost criterion over a finite horizon. For each n ∈ IN \ {0} let Jn (x, π ) be the total cost incurred when thesystem runs during n n−1 steps under policy π starting at x, that is, Jn (x, π ) : = Exπ ∑t=0 C(Xt , At ) , and let Jn∗ (x) : = infπ ∈P Jn (x, π ) be the corresponding optimal value function; the sequence {Jn∗ (·)} satisfies the relation Jn∗ (x) = inf
a∈A(x)
∗ (y) , C(x, a) + ∑ px y (a)Jn−1
x ∈ S,
n = 1, 2, 3, . . . , (3.36)
y∈S
where J0∗ (·) = 0, so that the functions Jn∗ (·) are determined recursively, which is an important feature of the method. The approximations to (g, h(·)) are given by ∗ (gn , hn (·)) : = (Jn∗ (z) − Jn−1 (z), Jn∗ (·) − Jn∗(z)),
whereas the policy fn is such that fn (x) is a minimizer of the term within brackets in (3.36): ∗ Jn∗ (x) = C(x, fn (x)) + ∑ px y ( fn (x))Jn−1 (y),
x ∈ S,
n = 1, 2, 3, . . .
y∈S
Under Assumptions 3.2.1 and 3.4.1(i), the approximations (gn , hn (·)) generated by the discounted method converge pointwise to (g, h(·)), and the policies fn in (3.35) are nearly optimal in the sense that Φ (x, fn (x)) → 0 as n → ∞. Similar
3 Sample-Path Optimality in Markov Decision Chains
49
conclusions hold for the value iteration scheme if, additionally, the transition law satisfies that px x (a) > 0,
a ∈ A(x),
x ∈ S;
this requirement can be avoided if the transformation by Schewitzer (1971) is applied to the transition law and the value iteration method is applied to the transformed model; see, for instance, [7] or [8]. The following theorem establishes a sufficient condition for the sample-path optimality of a Markovian policy. Theorem 3.7.1. Suppose that Assumptions 3.2.1 and 3.4.1 hold, and let f = { ft } be a Markov policy such that lim Φ (x, fn (x)) = 0,
n→∞
x ∈ S,
(3.37)
where Φ is the discrepancy function introduced in Definition 3.6.1. In this case, the policy f is sample-path average optimal; see Definition 3.4.1. The proof of this result relies on some consequences of Theorem 3.4.1 which will be analyzed below.
3.8 Tightness and Uniform Integrability This section contains the technical preliminaries that will be used to establish Theorem 3.7.1. The necessary results are concerned with properties of the sequence of empirical measures, which is now introduced. Definition 3.8.1. The random sequence {νn } of empirical measures associated with the state-action process {(Xt , At )} is defined by
νn (B) : =
1 n−1 ∑ δ(Xt ,At ) (B), n t=0
B ∈ B(IK),
n = 1, 2, 3, . . . ,
where δk stands for the Dirac measure concentrated at k, that is, δk (B) = 1 if k ∈ B and δk (B) = 0 when k
∈ B. Notice that this specification yields that, for each positive integer n and D ∈ C (IK),
νn (D) : =
IK
D(k)νn (dk) =
1 n ∑ D(Xt , At ) . n t=1
50
R. Cavazos-Cadena and R. Montes-de-Oca
The main result of this section concerns the asymptotic behavior of {νn } and involves the following notation: Given a set S˜ ⊂ S, for each D ∈ C (IK), defines the new function DS˜ ∈ C (IK) as follows: DS˜ (x, a) : = max{|D(x, a)|, 1}IS˜(x),
(x, a) ∈ IK.
(3.38)
Theorem 3.8.1. Suppose that Assumptions 3.2.1 and 3.4.1 hold and let x ∈ S and π ∈ P be arbitrary but fixed. In this context, for each ε > 0, there exists a finite set Fε ⊂ S such that lim sup νn (CS\Fε ) ≤ ε n→∞
Pxπ -a.s.;
(3.39)
see (3.38). Remark 3.8.1. For a positive integer r, let F1/r be the set corresponding to ε = 1/r in the above theorem and define the event Ω ∗ by
Ω∗ : =
∞
lim sup νn (CS\F1/r ) ≤ 1/r . n→∞
r=1
(i) Let the set IKr ⊂ IK be given by IKr = {(x, a) ∈ IK | x ∈ F1/r }.
(3.40)
With this notation, IKr is a compact set, since F1/r is finite, and (3.38) yields that CS\F1/r (x, a) ≥ 1 for (x, a) ∈ IK \ IKr , so that
νn (IK \ IKr ) ≤
IK
CS\F1/r (k)νn (dk) = νn (CS\F1/r ),
and then, lim sup νn (IK \ IKr ) ≤ lim sup νn (CS\F1/r ). n→∞
n→∞
Consequently, along a sample trajectory {(Xt , At )} in Ω ∗ the corresponding sequence {νn } satisfies lim supn→∞ νn (IK \ IKr ) ≤ 1/r for every r > 0 so that {νn } is tight. (ii) Given a probability measure μ defined in B(IK), a function D ∈ C (IK) is integrable with respect to μ if, and only if, for each positive integer r, there ˜ r such that exists a compact set IK ˜r IK\IK
|D(k)|μ (dk) ≤ 1/r.
3 Sample-Path Optimality in Markov Decision Chains
51
Notice now that (3.38) and (3.40) yield that CS\F1/r (x, a) ≥ |C(x, a)| for (x, a) ∈ IK \ IKr , so that IK\IKr
|C(k)|νn (dk) ≤
Thus,
lim sup
IK\IKr
n→∞
IK\IKr
CS\F1/r (k)νn (dk) = νn (CS\F1/r ).
|C(k)|νn (dk) ≤ lim sup νn (CS\F1/r ), n→∞
and then, along a sample trajectory in
lim sup n→∞
IK\IKr
Ω ∗,
|C(k)|νn (dk) ≤ 1/r,
r = 1, 2, 3, . . . ,
showing that the cost function C is uniformly integrable with respect to the family {νn } of empirical measures. (iii) Since Theorem 3.8.1 yields that Pxπ [Ω ∗ ] = 1 for every x ∈ S and π ∈ P, the previous discussion can be summarized as follows: Regardless of the initial state and the policy used to drive the system, the following assertions hold with probability 1: (a) The sequence {νn } is tight and (b) the cost function is uniformly integrable with respect to {νn }. The proof of Theorem 3.8.1 relies on the following lemma. Lemma 3.8.1. Let ε > 0 be arbitrary, and suppose that Assumptions 3.2.1 and 3.4.1 hold, and let gS\F be the optimal expected average cost associated with the cost function −CS\F , that is, gS\F : = inf J(x, π , −CS\F ),
(3.41)
π ∈P
where J(x, π , −CS\F ) is given by the right-hand side of (3.2) with the function −CS\F instead of C. With this notation, there exists a finite set F ⊂ S such that gS\F ≥ −ε .
(3.42)
Proof. Let {Fk } be a sequence of finite subsets of S such that Fk ⊂ Fk+1 ,
k = 1, 2, 3, . . .,
and
∞
Fk = S;
(3.43)
k=1
from (3.38), it follows that −CS\Fk 0 as k ∞, a property that via (3.41) immediately yields that gS\Fk ≤ gS\Fk+1 ≤ 0,
k = 1, 2, 3, . . . ,
52
R. Cavazos-Cadena and R. Montes-de-Oca
so that {gS\Fk } is a convergent sequence; set g¯ : = lim gS\Fk .
(3.44)
k→∞
To establish the desired conclusion, it is sufficient to show that g¯ = 0, since in this case (3.42) occurs when F is replaced by Fk with k large enough. To verify the above equality, let be a Lyapunov function for the function C and notice that Lemma 3.3.3 yields that, for some constant c > 0, the mapping c is a Lyapunov function for −CS\Fk , and then, this last function also satisfies Assumption 3.4.1. Thus, from Lemma 3.3.2 applied to the cost function CS\Fk , it follows that there exists a function hk : S → IR as well as a policy fk ∈ F such that gS\Fk + hk (x) = −CS\Fk (x, fk (x)) + ∑ px y ( fk (x))hk (y),
x ∈ S,
(3.45)
y∈S
where hk (·) ≤ c (·), that is, hk (·) ∈ ∏[c (x), c (x)].
(3.46)
x∈S
Using the fact that the right-hand side of this inclusion as well as F are compact metric spaces, it follows that there exists a sequence {kr } of positive integers increasing to ∞ such that the following limits exist: f¯(x) : = lim fkr (x), r→∞
¯ : = lim hk (x), h(x) r r→∞
x ∈ S.
Next, observe that (3.38) and (3.43) together yield that for each state x, CS\Fk (x, fk (x)) = 0 when k is large enough, whereas, via Proposition 2.18 in p. 232 of [18], the continuity property in Definition 3.3.1(ii) and (3.46) leads to lim
r→∞
¯ ∑ px y ( fkr (x))hkr (y) = ∑ px y ( f¯(x))h(y).
y∈S
y∈S
Replacing k by kr in (3.45) and taking the limit as r goes to ∞ in both sides of the resulting equation, (3.44) and the three last displays allow to write that ¯ g¯ + h(x) =
¯ ∑ px y ( f¯(x))h(y),
y∈S
x ∈ S.
3 Sample-Path Optimality in Markov Decision Chains
53
¯ Starting from this relation, an induction argument yields that (n + 1)g¯ + h(x) = f¯ ¯ Ex [h(Xn+1 )] for every x ∈ S and n ∈ IN, that is, g¯ =
¯ 1 h(x) ¯ ¯ Exf [h(X , n+1 )] − n+1 n+1
and taking the limit as n goes to ∞, the inclusion (3.46) and Lemma 3.3.1(ii) together imply that g¯ = 0; as already mentioned, this completes the proof of the lemma. Proof of Theorem 3.8.1. Recalling that Assumption 3.4.1 is in force, let be a Lyapunov function for the cost function C such that β admits a Lyapunov function for some β > 2. As already noted, for some constant c > 0, the function c is a Lyapunov function for −CS\F , a fact that immediately implies that this last function also satisfies Assumption 3.4.1. Therefore, applying Theorem 3.4.1 with the cost function −CS\F instead of C, it follows that for every x ∈ S and π ∈ P, lim inf νn (−CS\F ) ≥ gS\F , n→∞
Pxπ -a. s. ,
and selecting F as the finite set in Lemma 3.8.1, it follows that lim inf νn (−CS\F ) ≥ −ε , n→∞
Pxπ -a. s. ,
a statement that is equivalent to (3.39).
3.9 Proof of Theorem 3.7.1 In this section a proof of the sample-path average optimality of a Markovian policy satisfying condition (3.37) will be presented. The argument combines Theorem 3.8.1 with the following lemma. Lemma 3.9.1. Let the Markovian policy f = { ft } be such that (3.37) holds, and consider a fixed sample trajectory {(Xt , At )} along which properties (i) and (ii) below hold: (i) The sequence {νn } of empirical measures is tight. (ii) At = f (Xt ) for all t ∈ N. In this context, if ν ∗ is a limit point of the sequence {νn } in the weak convergence topology, then ν ∗ is supported on IK∗ = {(x, a) ∈ IK| Φ (x, a) = 0}, that is,
ν ∗ (IK∗ ) = 1.
(3.47)
54
R. Cavazos-Cadena and R. Montes-de-Oca
Proof. For each x ∈ S and ε > 0 defines the set IK(x, ε ) = {(x, a) | a ∈ A(x) and Φ (x, a) > ε }, which is an open subset of IK, since the function Φ is continuous and S is endowed with the discrete topology. In this case,
δ(Xt ,At ) (IK(x, ε )) = δ(Xt , ft (Xt )) (IK(x, ε )) = 1 ⇐⇒ Xt = x and Φ (x, ft (x)) > ε . Therefore, using condition (3.37), it follows that there exists an integer N > 0 such that δ(Xt ,At ) (IK(x, ε )) = 0, t > N, so that
νn (IK(x, ε )) =
1 n−1 1 N δ (IK(x, ε )) = ∑ (Xt ,At ) ∑ δ(Xt , ft (Xt )) (IK(x, ε )), n t=0 n t=0
n > N,
by Definition 3.8.1, and it follows that νn (IK(x, ε )) → 0 as n → ∞. Therefore, recalling that IK(x, ε ) is an open subset of IK, the fact that ν ∗ is a limit point of the sequence {νn } implies that
ν ∗ (IK(x, ε )) ≤ lim sup νn (IK(x, ε )) = 0, n→∞
x ∈ S,
ε >0
[2]. Finally, using that IK \ IK∗ = x∈S, r∈IN IK(x, 1/r) (because of the nonnegativity of Φ ), the above inequality leads to ν ∗ (IK \ IK∗ ) = 0. Proof of Theorem 3.7.1. It will be proved that the discrepancy function Φ satisfies Assumption 3.4.1.
(3.48)
Assuming that this assertion holds, the conclusion of Theorem 3.7.1 can be obtained as follows: Let f = { ft } be a Markovian policy satisfying the property (3.37) and, for S˜ ⊂ S, define
Φ(S) ˜ (x, a) : = Φ (x, a)IS˜ (x),
(x, a) ∈ IK,
(3.49)
so that the equality
Φ = Φ(S) ˜ + Φ(S\S) ˜
(3.50)
is always valid. Next, given ε > 0, observe the following facts (a) and (b): (a) The property (3.48) allows to apply Theorem 3.8.1 with the cost function C replaced by Φ to conclude that there exists a finite set Fε ⊂ S such that for each x ∈ S, the relation lim supn→∞ νn (ΦS\Fε ) ≤ ε occurs almost surely with respect
3 Sample-Path Optimality in Markov Decision Chains
55
to Pxf ; since (3.49) and (3.38) together imply that Φ(S\Fε ) ≤ ΦS\Fε , it follows that lim sup νn (Φ(S\Fε ) ) ≤ ε n→∞
Pxf -a.s.
(3.51)
(b) Let {(Xt , At )} be a fixed sample trajectory along which (i) {νn } is tight, and (ii) At = ft (Xt ). Now select a sequence {nk } of positive integers such that limk→∞ nk = ∞ and lim νnk (Φ(Fε ) ) = lim sup νn (Φ(Fε ) ).
k→∞
n→∞
Because of the tightness of {νn }, taking a subsequence—if necessary—it can be assumed that {νnk } converges weakly to some ν ∗ ∈ IP(IK), and in this case, observing that Φ(Fε ) is continuous and has compact support (since Fε ⊂ S is finite), it follows that lim νnk (Φ(Fε ) ) = ν ∗ (Φ(Fε ) );
k→∞
on the other hand, using that ν ∗ is supported in the set IK∗ specified in (3.47), by Lemma 3.9.1, and that ΦFε is null on that set (see (3.47) and (3.49)), it follows that ν ∗ (ΦFε ) = 0. Combining this equality with the two last displays, it follows that lim supn→∞ νn (Φ(Fε ) ) = 0 along a sample-path satisfying conditions (i) and (ii) above. Observing that condition (i) holds almost surely with respect to Pxf , by Remark 3.8.1, and that Pxf [At = ft (Xt )] = 1 for all t, it follows that lim sup νn (Φ(Fε ) ) = 0 n→∞
Pxf -a.s. ,
a relation that, combined with (3.50), (3.51) and the nonnegativity of Φ , yields that the convergence limn→∞ νn (Φ ) = 0 occurs with probability 1 with respect to Pxf , and then, the policy f is sample-path average optimal, by Lemma 3.6.1. Thus, to conclude the argument, it is sufficient to verify (3.48). To achieve this goal, let be a Lyapunov function for the cost function C such that β admits a Lyapunov function for some β > 2, and recall that the solution (g, h(·)) of the optimality Equation (3.7) satisfies |h(·)| ≤ c(·) (3.52) for some c > 0, as well as h(z) = 0. Combining this last equality with the specification of the discrepancy function, it follows that, for all (x, a) ∈ IK, h(x) = C(x, a) − g − Φ (x, a) + ∑ px y (a)h(y). y =z
(3.53)
56
R. Cavazos-Cadena and R. Montes-de-Oca
On the other hand, Lemma 3.3.3 yields that |C(·) − g| admits a Lyapunov function of the form c1 where c1 > 0 so that c1 (x) ≥ |C(x, a) − g| + 1 +
∑
y∈S, y
=z
px y (a)c1 (y);
multiplying both sides of this relation by a constant c2 satisfying c2 c1 > c and c2 > 1,
(3.54)
it follows that c2 c1 (x) ≥ c2 |C(x, a) − g| + c2 +
∑
y∈S, y =z
px y (a)c2 c1 (y),
and then, c2 c1 (x) ≥ |C(x, a) − g| + 1 +
∑
y∈S, y
=z
px y (a)c2 c1 (y).
Combining this inequality with (3.53), it is not difficult to obtain that ˜ ≥ 1 + Φ (x, a) + (x)
∑
˜ px y (a)(y),
(x, a) ∈ IK,
(3.55)
y∈S, y =z
˜ : = c1 c2 (·) − h(·) ≥ 0 and the inequality follows from (3.52) and (3.54). where (·) Recalling that Φ is nonnegative, the above display yields that that ˜ takes values in [1, ∞) and that ˜ satisfies the first requirement for being a Lyapunov function for ˜ ≤ c˜ , and then, ˜ inherits the Φ ; setting c˜ : = c1 c2 + c > 0, (3.53) implies that (·) second and third properties in Definition 3.3.1 from the corresponding ones of . Thus, ˜ is a Lyapunov function for Φ , and using that ˜β ≤ c˜β β , it follows that ˜β also admits a Lyapunov function, by Lemma 3.3.3. Thus, the statement (3.48) holds, and as already mentioned, this concludes the proof of Theorem 3.7.1. Acknowledgment With sincere gratitude and appreciation, the authors dedicate this work to Professor On´esimo Hern´andez-Lerma on the occasion of his 65th anniversary, for his friendly and generous support and clever guidance.
References 1. Arapostathis A., Borkar V.K., Fern´andez-Gaucherand E., Ghosh M.K., Marcus S.I.: Discrete time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31, 282–344 (1993) 2. Billingsley P. : Probability and Measure, 3rd edn. Wiley, New York (1995)
3 Sample-Path Optimality in Markov Decision Chains
57
3. Borkar V.K: On minimum cost per unit of time control of Markov chains, SIAM J. Control Optim. 21, 652–666 (1984) 4. Borkar V.K.: Topics in Controlled Markov Chains, Longman, Harlow (1991) 5. Cavazos-Cadena R., Hern´andez-Lerma O.: Equivalence of Lyapunov stability criteria in a class of Markov decision processes, Appl. Math. Optim. 26, 113–137 (1992) 6. Cavazos-Cadena R., Fern´andez-Gaucherand E.: Denumerable controlled Markov chains with average reward criterion: sample path optimality, Math. Method. Oper. Res. 41, 89–108 (1995) 7. Cavazos-Cadena R., Fern´andez-Gaucherand E.: Value iteration in a class of average controlled Markov chains with unbounded costs: Necessary and sufficient conditions for pointwise convergence, J. App. Prob. 33, 986–1002 (1996) 8. Cavazos-Cadena R.: Adaptive control of average Markov decision chains under the Lyapunov stability condition, Math. Method. Oper. Res. 54, 63–99 (2001) 9. Hern´andez-Lerma O.: Adaptive Markov Control Processes, Springer, New York (1989) 10. Hern´andez-Lerma O.: Existence of average optimal policies in Markov control processes with strictly unbounded costs, Kybernetika, 29, 1–17 (1993) 11. Hern´andez-Lerma O., Lasserre J.B.: Value iteration and rolling horizon plans for Markov control processes with unbounded rewards, J. Math. Anal. Appl. 177, 38–55 (1993) 12. Hern´andez-Lerma O., Lasserre J.B.: Discrete-time Markov control processes: Basic optimality criteria, Springer, New York (1996) 13. Hern´andez-Lerma O., Lasserre J.B.: Further Topics on Discrete-time Markov Control Processes, Springer, New York (1999) 14. Hordijk A.: Dynamic Programming and Potential Theory (Mathematical Centre Tract 51.) Mathematisch Centrum, Amsterdam (1974) 15. Hunt F.Y.: Sample path optimality for a Markov optimization problem, Stoch. Proc. Appl. 115, 769–779 (2005) 16. Lasserre J.B:: Sample-Path average optimality for Markov control processes, IEEE T. Automat. Contr. 44, 1966–1971 (1999) 17. Montes-de-Oca R., Hern´andez-Lerma O.: Value iteration in average cost Markov control processes on Borel spaces, Acta App. Math. 42, 203–221 (1994) 18. Royden H.L.: Real Analysis, 2nd edn. MacMillan, New York (1968) 19. Shao J.: Mathematical Statistics, Springer, New York (1999) 20. Vega-Amaya O.: Sample path average optimality of Markov control processes with strictly unbounded costs, Applicationes Mathematicae, 26, 363–381 (1999)
Chapter 4
Approximation of Infinite Horizon Discounted Cost Markov Decision Processes Franc¸ois Dufour and Tom´as Prieto-Rumeau
4.1 Introduction Markov decision processes (MDPs) constitute a general family of controlled stochastic processes suitable for the modeling of sequential decision-making problems. They appear in many fields such as, for instance, engineering, computer science, economics, operations research, etc. A significant list of references on discrete-time MDPs may be found in the survey [2] and the books [1, 4, 8, 10, 11, 17, 18]. The analysis of MDPs leads to a large variety of interesting mathematical and computational problems. The corresponding theory has reached a rather high degree of maturity although the classical tools, such as value iteration, policy iteration, linear programming, and their various extensions, are generally hardly applicable in practice. Hence, solving MDPs numerically is an awkward and important problem mainly because MDPs are generally very large due to their inherent structure, and so, solving the associated dynamic programming equation leads to the well known curse of dimensionality. In order to meet this challenge, different approaches have emerged, which can be roughly classified in three classes of techniques: they are based on: (i) The approximation of the control model via state aggregation or discretization so as to provide a simpler model (ii) The approximation of the value function through adaptation of classical techniques (such as dynamic programming or value iteration)
F. Dufour Institut de Math´ematiques de Bordeaux, Universit´e Bordeaux I, Talence 33405, France INRIA Bordeaux Sud Ouest, Team CQFD, Bordeaux 78153, France e-mail:
[email protected] T. Prieto-Rumeau () Department of Statistics, UNED, Madrid 28040, Spain e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 4, © Springer Science+Business Media, LLC 2012
59
60
F. Dufour and T. Prieto-Rumeau
(iii) The approximation of the optimal policy by developing efficient policy improvement steps This chapter belongs to class (i) of the methods described above. Our objective is to present an approximation procedure which transforms an infinite horizon discounted cost MDP with general state and action spaces, and possibly unbounded cost function, into a simpler model (with finite state and action spaces, and, consequently, bounded cost function). We show that the optimal value function and the optimal policies of the MDP under consideration can be approximated by the corresponding optimal value function and the optimal policies of the finite MDP. Most interestingly, explicit bounds on the approximation errors are derived. Moreover, it is well known that for infinite horizon discounted cost MDPs, the optimal policy is deterministic and stationary. It must be pointed out that our approximation procedure preserves this important property by providing a suboptimal deterministic stationary policy with guaranteed approximation error. There exists a huge literature regarding the approaches related to items (ii) and (iii), for example, reinforcement learning, neuro-dynamic programming, approximate dynamic programs, and simulation-based techniques, to name just a few. Without attempting to present an exhaustive panorama on numerical methods for MDPs, we suggest that the interested reader consults the survey [20] and the books [5,6,16, 19] and the references therein, to get a rather complete view of this research field. It is important to stress the fact that the technique developed herein is rather different to those described in (ii) and (iii) above. Indeed, the approaches related to items (ii) and (iii) are related to probabilistic approximation techniques, whereas those of item (i) are related to numerical approximation methods. Moreover, the methods of items (ii) and (iii) mainly deal with MDPs with finite state and/or finite action spaces, and, more importantly, with bounded cost functions; see, for example, the books [5, 6, 16, 19]. These two classes of approximation techniques can be roughly described as follows: (a) Numerical approximation techniques, which give actual bounds or approximations. To fix ideas, suppose that V ∗ is the optimal value that we want to approximate. In a numerical approximation scheme, given some ε > 0, we find | < ε . Such techniques can be found in, for some (number) V such that |V ∗ − V example, [9, 12, 14, 21]. (b) Probabilistic approximation techniques, which provide approximations that converge in a suitable probabilistic sense, or bounds that are satisfied with probability “close” to 1. They are usually based on simulating sample paths of the controlled process. In this setting, for given ε > 0, we find a random variable V such that E|V ∗ −V | < ε (convergence in L1 ), or such that P{|V ∗ −V | > ε } < δ for some small δ (convergence in probability). We note that, in practice, we do not obtain V , but, rather, we observe a realization of the random variable V . The probabilistic techniques in (b) are usually more efficient—in computational terms—than the numerical methods in (a), and they even succeed to break the curse of dimensionality. The price to pay, however, is that we approximate a given
4 Approximation of Discounted MDPs
61
value, say V ∗ , by a random observation V . Depending on the application under consideration, the fact that the methods in (a) provide a deterministic estimate with guaranteed bounds can be more attractive than the stochastic estimates with bounds valid in mean given by the approaches (b). Clearly, the techniques in (a) and (b) are of a different nature and, so, hard to compare mainly because the approximation is deterministic in case (a) and stochastic in (b). The discretization procedure (or state aggregation) to convert an MDP into a simpler optimization problem can be traced back to the middle of the 1970s; see [15, Sect. 2.3] for a rather complete account of the works developed at that time. In particular, the discretization procedure has been analyzed in a general context in [12,14,21] and for specific MDPs in [3,9]. It is important to point out the differences between the results obtained in our work and those presented in the literature. In [14], the author studies the approximation of the original decision model by means of a sequence of MDPs. Under mild hypotheses, such as continuous convergence of the data, a convergence result is established. However, as pointed in [14, p. 494], no rates of convergence are given mainly due to the fact that the problem under study is fairly general. The approximation of dynamic programs in the general framework of monotone contraction operator models is developed in [12, 21]; the analysis is then particularized to a classical MDP model. In [21], the author focuses on infinite horizon problems, while in [12], the author addresses the problem of finite-stage models. This approach is based on the construction of partitions of the state and action spaces. The corresponding convergence results and the error bounds are established under general hypotheses. As mentioned in [21, p. 236], however, this approach is “of limited practical value because the optimal return function is used” for the corresponding construction. Moreover, and more importantly, the convergence of the optimal value functions associated to the approximating models depends heavily on an appropriate and relevant choice of the partitions. In [12], the problem of how to construct the approximating decision model (and, therefore, the associated partitions) is not discussed. Nevertheless, it is mentioned that this is an important problem and that a theoretical guideline should be provided (see the introduction of [12, Sect. 4]). In [21], this problem is emphasized too, and under continuity and compactness assumptions, a convergence result is obtained (Theorem 5.2). In [3, 9], the discretization procedure of the classical MDP model is addressed under the hypothesis that the state and action spaces are compact, and the cost function is bounded. Their approaches are constructive in the sense that they are applicable in practice. Our framework here is clearly more general than the one studied in [3,9]. Indeed, we deal with an MDP with general state space (namely, a locally compact Borel space) and compact action sets, while the cost function is allowed to be unbounded. Such extensions are clearly not straightforward to obtain, and they need a careful and sharp analysis of the partitioning of the state and action spaces. Compared to [12, 14, 21], we impose stronger conditions though, as opposed to [14], we are able to derive explicit error bounds for our approximation scheme. Furthermore, rather than embedding our MDP into the general theory of monotone contraction operators as in [12, 21], our idea is to take into account the characteristic features of the MDP
62
F. Dufour and T. Prieto-Rumeau
under consideration so as to provide a more constructive solution (with the important property that it is applicable in practice). In particular, our approach provides an explicit construction of the partitions of the state and action spaces. As mentioned before, this important problem is not discussed in [12] and it is just briefly studied in [21] for a very specific model (with compact state and action spaces). Consequently, our results are different, and complementary, to those obtained in [12,21]. Moreover, the analysis in [12, 21] is done by assuming that the cost function is bounded, whereas it is briefly explained how the results could be extended to the unbounded case. We claim, however, that these extensions are not straightforward and they depend again on an appropriate and relevant choice of the partitions of the state and action spaces. The rest of the chapter is organized as follows. In Sect. 4.2, we introduce the MDP model we are interested in and state our main assumptions. Some useful preliminary results are proved in Sect. 4.3. We give our first main result in Sect. 4.4, on the approximation of the optimal value function. An approximation of a discount optimal policy is presented in Sect. 4.5. Our conclusions are stated in Sect. 4.6.
4.2 Definition of the Control Model In this section, we introduce the discrete-time Markov control model we are concerned with and state our main assumptions. We follow closely the notation of Chap. 2 in [10]. Let us consider the five tuple for a Markov control model M = (X, A, (A(x), x ∈ X), P, c), where: (a) The state space X is a locally compact Borel space. The metric on X and the family of the Borel subsets of X will be respectively denoted by dX and B(X). (b) The action space A is a locally compact Borel space. The metric on A will be denoted by dA . In the family of closed subsets of A, we will consider the Hausdorff metric, defined as dH (C1 ,C2 ) := sup inf {dA (a, b)} ∨ sup inf {dA (a, b)} a∈C1 b∈C2
b∈C2 a∈C1
= sup {dA(a,C2 )} ∨ sup {dA (C1 , b)} a∈C1
b∈C2
for closed C1 ,C2 ⊆ A (where ∨ stands for “maximum”). We know that dH is a metric, except that it might not be finite. By B(A), we will denote the family of Borel subsets of A. (c) The set of feasible actions at state x ∈ X is A(x), assumed to be a nonempty measurable subset of X. We also suppose that IK := {(x, a) ∈ X × A : a ∈ A(x)} is a measurable subset of X × A, which contains the graph of a measurable function from X to A.
4 Approximation of Discounted MDPs
63
(d) The transition probability function, which is given by the stochastic kernel P on X given IK. This means that B → P(B|x, a) is a probability measure on (X, B(X)) for every (x, a) ∈ IK and, in addition, that (x, a) → P(B|x, a) is measurable on IK for every B ∈ B(X). (e) The measurable cost-per-stage function c : IK → IR. The family of stochastic kernels ϕ on A given X such that ϕ (A(x)|x) = 1 for all x ∈ X is denoted by Φ . Let F be the family of measurable functions f : X → A satisfying that f (x) ∈ A(x) for all x ∈ X. (By (c) above, the set F is nonempty.) In our next definition, we introduce the class of admissible policies for the decision-maker. We use the notation IN, which stands for the set of nonnegative integers. Definition 4.2.1. Let H0 := X and Hn := IK × Hn−1 for n ≥ 1. A control policy is a sequence π = {πn }n∈IN of stochastic kernels πn on A given Hn satisfying that πn (A(xn )|hn ) = 1 for all hn ∈ Hn and n ∈ IN, where hn := (x0 , a0 , . . . , xn−1 , an−1 , xn ) (note that, on Hn , we are considering the product σ -algebra). Let Π be the class of all policies. A policy π = {πn }n∈IN is said to be deterministic if there exists a sequence { fn } in F such that πn (·|hn ) = δ fn (xn ) (·) for all n ∈ IN and hn ∈ Hn , where δa denotes the Dirac probability measure supported on a ∈ A. Finally, if the deterministic policy π = {πn }n∈IN is such that πn (·|hn ) = δ f (xn ) (·) for some f ∈ F , then we say that π is a deterministic stationary policy. The set of such policies is identified with F . Let (Ω , F ) be the canonical space consisting of the set of all sample paths Ω = (X × A)∞ = {(xt , at )}t∈IN and the associated product σ -algebra F . Here, {xt }t∈IN is the state process and {at }t∈IN stands for the action process. For every policy π ∈ Π and any initial distribution ν on X, there exists a unique probability measure Pνπ on (Ω , F ) such that, for any B ∈ B(X), C ∈ B(A), and ht ∈ Ht with t ∈ IN, Pνπ (x0 ∈ B) = ν (B),
Pνπ (at ∈ C|ht ) = πt (C|ht ),
Pνπ (xt+1 ∈ B|ht , at ) = P(B|xt , at ).
(For such a construction, see, e.g., [10, Chap. 2].) The expectation with respect to Pνπ is denoted by Eνπ . If ν = δx for some x ∈ X, then we will write Pxπ and Exπ in lieu of Pνπ and Eνπ , respectively. Next, we define our infinite horizon discounted cost Markov control problem. Suppose that 0 < α < 1 is a given discount factor. For any initial state x ∈ X, the total expected discounted cost (or discounted cost, for short) of a control policy π ∈ Π is ∞ J(x, π ) := Exπ ∑ α t c(xt , at ) . t=0
(In the sequel, we will impose conditions ensuring that this discounted cost is finite.) The optimal discounted cost function is then defined as J(x) := inf J(x, π ) for x ∈ X, π ∈Π
64
F. Dufour and T. Prieto-Rumeau
and we say that a policy π ∗ ∈ Π is discount optimal if J(x, π ) = J(x) for every x ∈ X. We introduce some more notation. Given t ≥ 1, an initial state x ∈ X, and a control policy π ∈ Π , let t−1 Jt (x, π ) := Exπ ∑ α k c(xk , ak ) , k=0
that is, Jt (x, π ) is the discounted cost of the policy π on the first t decision epochs. Also, let Jt (x) := inf Jt (x, π ) for x ∈ X. π ∈Π
Our assumptions on the control model use the notion of Lipschitz continuity. Given two metric spaces (X1 , ρ1 ) and (X2 , ρ2 ), and a function g : X1 → X2 , we say that g is L-Lipschitz continuous on X1 (for some so-called Lipschitz constant L > 0) if
ρ2 (g(x), g(y)) ≤ L · ρ1 (x, y) for every x and y in X1 . Typically, we will denote the Lipschitz constant of a Lipschitz continuous function g by Lg . If X1 is the product of two metric spaces: (X11 , ρ11 ) and (X12 , ρ12 ), then, when referring to Lipschitz continuity on X1 := X11 × X12 , the metric considered in X1 is the sum of ρ11 and ρ12 . We state our assumptions on the control model. Assumption A. (A1) as:4:A1For each state x ∈ X, the action set A(x) is compact. (A2) The multifunction Ψ from X to A, defined as Ψ (x) := A(x), is LΨ -Lipschitz continuous with respect to the Hausdorff metric, i.e., for some constant LΨ > 0 and every x, y ∈ X, dH (A(x), A(y)) ≤ LΨ · dX (x, y). (The sets A(x) being closed—Assumption (A1)—,the Hausdorff metric is indeed well defined.) (A3) There exists a continuous function w : X → [1, ∞) and a positive constant c such that |c(x, a)| ≤ cw(x) for all (x, a) ∈ IK. Moreover, the cost function c is Lc -Lipschitz continuous on IK. Before proceeding with our assumptions, we introduce some more notation. Given a function v : X → IR, we define Pv : IK → IR as
Pv(x, a) := X
v(y)P(dy|x, a) for (x, a) ∈ IK,
provided that the corresponding integrals are well defined and finite. The class of measurable functions v : X → IR such that ||v||w := supx∈X {|v(x)|/w(x)} is finite, with w as in Assumption (A3), denoted by Bw (X), is a Banach space and || · ||w is
4 Approximation of Discounted MDPs
65
indeed a norm (called the w-norm). Let Lw (X) be the family of functions in Bw (X) which, in addition, are Lipschitz continuous. We proceed with Assumption A. (A4.i) The function Pw is continuous on IK. In addition, there exists d > 0 such that Pw(x, a) ≤ dw(x) for all (x, a) ∈ K. (A4.ii) The stochastic kernel P is weakly continuous, meaning that Pv is continuous on IK for every bounded and continuous function v on X. (A4.iii) There exists a constant LP > 0 such that for every (x, a) and (y, b) in IK, and v ∈ Lw (X), with Lipschitz constant Lv , |Pv(x, a) − Pv(y, b)| ≤ LP Lv [dX (x, y) + dA (a, b)]. In this case, the stochastic kernel P is said to be LP -Lipschitz continuous. (A5) The discount factor α satisfies α d < 1 and α LP (1 + LΨ ) < 1. Remark 4.2.1. Lipschitz continuity of Ψ in Assumption (A2) and the fact that the action sets are compact (Assumption (A1)) imply that Ψ is a continuous multifunction from X to A; see Lemma 2.6 in [7]. Given x ∈ X and f ∈ F , the notation c(x, f ) := c(x, f (x)), P(·|x, f ) := P(·|x, f (x)), and Pv(x, f ) := Pv(x, f (x)) will be used in the sequel.
4.3 Preliminary Results In this section, we state some technical lemmas that will be useful in the rest of the chapter. First, we derive some properties of the Bellman operator. Lemma 4.3.1. Suppose that Assumption A is satisfied, and define the operator T as Tu(x) := min c(x, a) + α u(y)P(dy|x, a) for x ∈ X, a∈A(x)
X
where u ∈ Lw (X). (i) If u ∈ Lw (X), then Tu ∈ Lw (X) and LTu = (Lc + α LP Lu )(1 + LΨ ). (ii) The operator T is a contraction mapping with modulus α d. (iii) T has a unique fixed point u∗ ∈ Lw (X), with ||u∗ ||w ≤ c/(1 − α d). The fixed point u∗ is the limit (in the w-norm) of {T n u0 }n≥1 for every u0 ∈ Lw (X). Proof.
Part (i). Given x and y in X, we have that |Tu(x) − Tu(y)| ≤ sup
inf |Δ (x, a, y, b)| ∨ sup
b∈B(y) a∈A(x)
inf |Δ (x, a, y, b)|,
a∈A(x) b∈B(y)
66
F. Dufour and T. Prieto-Rumeau
where Δ (x, a, y, b) := c(x, a) − c(y, b) + α (Pu(x, a) − Pu(y, b)) for (x, a) and (y, b) in IK. By Lipschitz continuity of the cost function c and the stochastic kernel P, we obtain |Δ (x, a, y, b)| ≤ (Lc + α LP Lu ) · (dX (x, y) + dA(a, b)). Finally, recalling the definition of the Hausdorff metric, we conclude that |Tu(x) − Tu(y)| ≤ (Lc + α LP Lu ) · (dX (x, y) + dH (A(x), A(y))) ≤ [(Lc + α LP Lu )(1 + LΨ )] · dX (x, y). Hence, Tu ∈ Lw (X) and its Lipschitz constant is (Lc + α LP Lu )(1 + LΨ ). Part (ii). Proceeding as in Part (i), given u, v ∈ Lw (X), and x ∈ X, we have |Tu(x) − T v(x)| ≤ α sup P|u − v|(x, a) a∈A(x)
≤ α ||u − v||w sup Pw(x, a) ≤ α d||u − v||ww(x), a∈A(x)
by Assumption (A4.i), and the stated result follows. Part (iii). Based on Banach’s fixed point theorem, it is easily seen that, given arbitrary u0 ∈ Lw (X), the sequence {T n u0 }n∈IN is Cauchy in the w-norm. Hence, it converges to some u∗ ∈ Bw (X). Observe now that, as a consequence of Part (i) and Assumption (A5), the Lipschitz constant Ln of T n u0 verifies lim Ln =
n→∞
Lc (1 + LΨ ) . 1 − α LP(1 + LΨ )
Hence, since for every n ∈ IN and x, y ∈ X we have |T n u0 (x) − T n u0 (y)| ≤ Ln dX (x, y), it follows that |u∗ (x) − u∗ (y)| ≤
Lc (1 + LΨ ) · dX (x, y), 1 − α LP (1 + LΨ )
so that u∗ ∈ Lw (X). The fact that u∗ is the unique fixed point of T in Bw (X) follows from standard arguments. Finally, letting u0 ≡ 0, by Assumption (A3), we have that ||Tu0 ||w ≤ c, and so we obtain ||u∗ ||w ≤ ||u∗ − Tu0 ||w + ||Tu0||w ≤ α d||u∗ ||w + c, and ||u∗ ||w ≤ c/(1 − α d) follows.
2
Under Assumption A, the Assumptions 8.5.1, 8.5.2, and 8.5.3 in [11] are satisfied. Then, as a direct consequence of Lemma 4.3.1 and the results in [11, Sect. 8.3.B], we derive the following facts. The Lipschitz continuity of the optimal discounted cost functions, established below, can be found also in [13, Theorem 4.1].
4 Approximation of Discounted MDPs
67
Theorem 4.3.1. If Assumption A is satisfied, then the following statements hold. (i) For every t ≥ 1, the optimal discounted cost Jt in the first t decision epochs is T t 0 (that is, the t-times composition of T with itself starting from J0 ≡ 0). In particular, Jt ∈ Lw (X) with LJt =
Lc (1 + LΨ ) Lc (1 + LΨ )(1 − (α LP (1 + LΨ ))t ) ≤ 1 − α LP(1 + LΨ ) 1 − α LP(1 + LΨ )
and ||Jt ||w ≤ c(1 − (α d)t )/(1 − α d). (ii) The optimal discounted cost function J is in Lw (X), and it is the unique fixed point of the operator T in Bw (X); its Lipschitz constant is LJ =
Lc (1 + LΨ ) 1 − α LP (1 + LΨ )
(4.1)
and ||J||w ≤ c/(1 − α d). (iii) The sequence {Jt }t∈IN converges to J in the w-norm. In addition, ||Jt − J||w ≤ c(α d)t /(1 − α d)
for every t ≥ 0.
Our next result is a well-known fact, and it will be useful in the forthcoming. Lemma 4.3.2. Suppose that Assumption A is verified, and let f ∈ F and v ∈ Bw (X) be such that v(x) ≥ c(x, f ) + α
X
v(y)P(dy|x, f )
for all x ∈ X.
(4.2)
Then v(x) ≥ J(x, f ) for every x ∈ X. Proof. Integrating, recursively, the inequality (4.2) with respect to P(dy|x, f ), we obtain, for every n ≥ 1, v(x) ≥ Jn (x, f ) + α n Exf [v(xn )] for x = x0 ∈ X.
(4.3)
Now, on the one hand, |Exf [v(xn )]| ≤ ||v||w d n w(x) by Assumption (A4.i), and thus, since α d < 1, lim α n Exf [v(xn )] = 0.
n→∞
On the other hand, f Ex ∑ α t c(xt , at ) ≤ cw(x) ∑ (α d)t , t≥n
t≥n
and consequently, limn→∞ Jn (x, f ) = J(x, f ). Taking the limit as n → ∞ in (4.3), we obtain the desired result. 2
68
F. Dufour and T. Prieto-Rumeau
The proof of our next result can be found in Lemma 2.9 in [7]. Lemma 4.3.3. We suppose that Assumption A holds. Let K0 be an arbitrary compact subset of X and fix ε > 0. There exists a sequence {Kt }t∈IN of compact subsets of X such that, for all t ∈ IN,
sup
c (x,a)∈IKt Kt+1
w(y)P(dy|x, a) ≤ ε ,
where IKt := {(x, a) ∈ X × A : x ∈ Kt , a ∈ A(x)}.
4.4 Approximation of the Optimal Discounted Cost Function In this section, we propose an approximation of the optimal discounted cost function J by using the fact that J is the fixed point of the operator T . Our main result is presented in Theorem 4.4.2. Suppose that (Z, ρ ) is a compact metric space and fix a constant η > 0. A finite subset Γ := {z1 , . . . , zr } of Z is said to be associated to an η -partition of Z if there exists a finite measurable partition {Z1 , . . . , Zr } of Z which satisfies the two conditions below. (i) We have zi ∈ Zi for all 1 ≤ i ≤ r. The projection operator pΓ : Z → Γ is then defined as pΓ (z) := zi if z ∈ Zi . (ii) For every z ∈ Z, it is ρ (z, pΓ (z)) ≤ η . Clearly, such partitions exist because Z is compact. Fix now arbitrary positive constants ε , β , and δ . In what follows, the constants ε , β , and δ remain fixed and, so, they will not be explicit in the notation. For ε > 0, consider the sequence of compact sets {Kt }t∈IN that was constructed in Lemma 4.3.3. For every t ∈ IN, let Γt be a finite subset of Kt associated to a β -partition of Kt . Now, given x ∈ Γt for t ∈ IN, let Θt (x) be a finite subset of A(x) associated to a δ -partition of the compact metric action space A(x). Fix a time horizon N ≥ 1. The functions JN,t on Γt , for 0 ≤ t ≤ N, are given by JN,N (x) := 0 for x ∈ ΓN and, for 0 ≤ t < N, by JN,t (x) := min
a∈Θt (x)
c(x, a) + α
∑
y∈Γt+1
JN,t+1 (y)P(pΓ−1 {y}|x, a) for x ∈ Γt . (4.4) t+1
The functions JN,t are extended to Kt by letting JN,t (x) := JN,t (pΓt (x))) for x ∈ Kt . Therefore, JN,N (x) = 0 for all x ∈ KN , and (4.4) becomes JN,t (x) = min
a∈Θt (x)
c(x, a) + α
Kt+1
JN,t+1 (y)P(dy|x, a)
for x ∈ Γt .
4 Approximation of Discounted MDPs
69
Lemma 4.4.4. If Assumption A holds, then, for every N ≥ 1 and 0 ≤ t ≤ N,
1 Lc (1 + LΨ ) αc sup |JN,t (x) − JN−t (x)| ≤ ·ε . · · (β + δ ) + 1−α 1 − α LP (1 + LΨ ) 1 − αd x∈Kt Proof. Throughout this proof, the integer N ≥ 1 remains fixed. To simplify the notation, in what follows, we will write HN,t := supx∈Kt |JN,t (x) − JN−t (x)| for 0 ≤ t ≤ N and we note that HN,N = 0. For 0 ≤ t < N and z ∈ Γt , we have JN,t+1 (y)P(dy|z, a) , JN,t (z) = min c(z, a) + α a∈Θt (z)
Kt+1
JN−t (z) = min c(z, b) + α JN−t−1 (y)P(dy|z, b) . b∈A(z)
X
Consequently, |JN,t (z) − JN−t (z)| is bounded above by max
inf |Δ (z, a, b)| ∨ sup min |Δ (z, a, b)|, b∈A(z) a∈Θt (z)
a∈Θt (z) b∈A(z)
where
Δ (z, a, b) = c(z, a) − c(z, b) + α
Kt+1
JN,t+1 (y)P(dy|z, a) − α PJN−t−1 (z, b).
Now, given z ∈ Γt , a ∈ Θt (z) and b ∈ A(z), we observe the following: Firstly, |c(z, a) − c(z, b)| ≤ Lc · dA (a, b).
(4.5)
|PJN−t−1 (z, b) − PJN−t−1 (z, a)| ≤ LP LJN−t−1 · dA(a, b).
(4.6)
Secondly,
Finally, |
Kt+1 JN,t+1 (y)P(dy|z, a) − PJN−t−1 (z, a)|
||JN−t−1 ||w
c Kt+1
w(y)P(dy|z, a) +
Kt+1
is less than or equal to
|JN,t+1 (y) − JN−t−1 (y)|P(dy|z, a),
which, by Lemma 4.3.3 and the fact that (z, a) ∈ IKt , is in turn bounded above by ||JN−t−1 ||w · ε + sup |JN,t+1 (y) − JN−t−1 (y)| = ||JN−t−1 ||w · ε + HN,t+1 . y∈Kt+1
Combining (4.5)–(4.7) yields that |Δ (z, a, b)| is bounded above by (Lc + α LP LJN−t−1 ) · dA (a, b) +
αc · ε + α HN,t+1 , 1 − αd
(4.7)
70
F. Dufour and T. Prieto-Rumeau
where we have used the fact that ||JN−t−1 ||w ≤ c/(1 − α d); see Theorem 4.3.1(i). On the other hand, also by Theorem 4.3.1 and recalling Lemma 4.3.1(i) and (4.1), (Lc + α LP LJN−t−1 ) ≤ (Lc + α LP LJN−t−1 )(1 + LΨ ) = LJN−t ≤ LJ . Finally, by the definition of Θt (z), max
inf dA (a, b) ∨ sup min dA (a, b) ≤ δ , b∈A(z) a∈Θt (z)
a∈Θt (z) b∈A(z)
so that, for all z ∈ Γt , |JN,t (z) − JN−t (z)| ≤ LJ · δ +
αc · ε + α HN,t+1 . 1 − αd
Suppose now that x ∈ Kt , and let z = pΓt (x) ∈ Γt . We have that |JN,t (x) − JN−t (x)| = |JN,t (z) − JN−t (x)| ≤ |JN,t (z) − JN−t (z)| + |JN−t (z) − JN−t (x)|, where |JN−t (z) − JN−t (x)| ≤ LJN−t dX (z, x) ≤ LJ · β . We conclude that, for 0 ≤ t < N, HN,t ≤ LJ · (δ + β ) +
αc · ε + α HN,t+1 , 1 − αd
with HN,N = 0. The bound on HN,t readily follows by induction: indeed, it holds for t = N, and we can prove recursively that it is satisfied for t = N − 1, . . . , 1, 0. 2 To alleviate the notation, we will write H :=
L (1 + L ) 1 αc c Ψ · (β + δ ) + · ·ε . 1 − α 1 − α LP(1 + LΨ ) 1 − αd
(4.8)
(As already mentioned, the constants β , δ , and ε are assumed to be fixed and, hence, we do not make them explicit in the notation H above.) Theorem 4.4.2. Suppose that the control model M satisfies Assumption A. Given arbitrary positive constants β , δ , and ε , consider the functions JN,t constructed above. For every N ≥ 1 and x ∈ K0 , |JN,0 (x) − J(x)| ≤ H +
c(α d)N · w(x). 1 − αd
4 Approximation of Discounted MDPs
71
Proof. By Theorem 4.3.1(iii), for every x ∈ K0 , we have |JN (x) − J(x)| ≤
c(α d)N · w(x). 1 − αd
The stated result is now a direct consequence of Lemma 4.4.4, with t = 0.
2
As a consequence of Theorem 4.4.2, we have the following: Suppose that x0 ∈ X is an arbitrary initial state x0 ∈ X, and let η > 0 be a given precision. Letting K0 = {x0 }, we can, a priori and explicitly, determine β , δ , ε , and N such that L (1 + L )
c(α d)N αc 1 c Ψ · (β + δ ) + ·ε + · w(x0 ) < η . · 1 − α 1 − α LP(1 + LΨ ) 1 − αd 1 − αd Indeed, all the constants involved in the above inequality are known (they depend on the Lipschitz constants of the control model M and on other constants introduced in Assumption A). Once we determine β , δ , ε , and N, we proceed with the constructive procedure described in this section and we compute explicitly JN,0 (x0 ) such that |JN,0 (x0 ) − J(x0 )| < η .
4.5 Approximation of a Discount Optimal Policy In the previous section, we provided an approximation of the optimal discounted cost function J. In this section, we are concerned with the approximation of a discount optimal policy. There is a straightforward naive approach (that we discard): for N large enough, consider an approximation of the optimal policy of the finite horizon discounted control problem; see [7]. Then, for the first N periods, use this policy, and from time N, choose arbitrary actions. This policy is close to discount optimality, but it is not stationary. We know, however, that there exists a deterministic stationary policy that is discount optimal. Such a policy is obtained, loosely speaking, as the minimizer in the fixed point equation J = T J. From the point of view of the decision-maker, it is more interesting to have at hand a deterministic stationary policy that is close to discount optimality, instead of a nonstationary one. Hence, our goal here is to provide such a deterministic stationary policy, based on our finite approximations scheme. Given N ≥ 1, 0 ≤ t < N, and z ∈ Γt , we have JN,t (z) = min
a∈Θt (z)
c(z, a) + α
= c(z, aN,t (z)) + α
Kt+1
Kt+1
JN,t+1 (y)P(dy|z, a)
JN,t+1 (y)P(dy|z, aN,t (z))
72
F. Dufour and T. Prieto-Rumeau
for some aN,t (z) ∈ Θt (z). We define fN,t ∈ F as follows: Suppose that fΔ ∈ F is a fixed deterministic stationary policy. Let fN,t (x) :=
⎧ ⎨arg min dA (a, aN,t (pΓt (x))), ⎩
if x ∈ Kt ;
a∈A(x)
fΔ (x),
if x ∈ / Kt .
(In particular, fN,t (z) = aN,t (z) if z ∈ Γt .) We derive from Proposition D.5 in [10] that such fN,t exists and that it is indeed measurable. Lemma 4.5.5. Suppose Assumption A holds. For every N ≥ 1, 0 ≤ t < N, and x ∈ Kt , J(x) +
2c(α d)N−t 2 ε cα · w(x) + + LJ β + 2H 1 − αd 1 − αd
≥ c(x, fN,t ) + α
Kt+1
J(y)P(dy|x, fN,t ).
Proof. Throughout this proof, N ≥ 1 and 0 ≤ t < N remain fixed. Given x ∈ Kt , by Theorem 4.3.1(iii) and Lemma 4.4.4, J(x) ≥ JN−t (x) −
c(α d)N−t c(α d)N−t · w(x) ≥ JN,t (x) − · w(x) − H, 1 − αd 1 − αd
(4.9)
where we use the notation introduced in (4.8). Letting z = pΓt (x), observe that JN,t (x) = JN,t (z) and also that JN,t (z) = c(z, fN,t ) + α
Kt+1
JN,t+1 (y)P(dy|z, fN,t ),
where, for y ∈ Kt+1 , it is JN,t+1 (y) ≥ JN−t−1 (y) − H, so that JN,t (z) ≥ c(z, fN,t ) + α
Kt+1
JN−t−1 (y)P(dy|z, fN,t ) − H.
(4.10)
We define the function G on X as G(x) := c(x, fN,t (x)) + α PJN−t−1 (x, fN,t (x)). By the Lipschitz continuity of c, JN−t−1 , and the stochastic kernel P, given x ∈ Kt and letting z = pΓt (x), we obtain |G(x) − G(z)| ≤ (Lc + α LP LJN−t−1 ) · (dX (x, z) + dA ( fN,t (x), fN,t (z))).
4 Approximation of Discounted MDPs
73
By definition, dA ( fN,t (x), fN,t (z)) = inf dA (a, fN,t (z)) a∈A(x)
≤ sup
inf dA (a, b)
b∈A(z) a∈A(x)
≤ dH (A(x), A(z)) ≤ LΨ · dX (x, z). Hence, recalling that dX (x, z) ≤ β , this yields that |G(x) − G(z)| ≤ (Lc + α LP LJN−t−1 )(1 + LΨ )β ≤ (Lc + α LP LJ )(1 + LΨ )β = LJ β . Note also that, for all x ∈ Kt , ε cα JN−t−1 (y)P(dy|x, fN,t ) − G(x) ≤ c(x, fN,t ) + α 1 − αd Kt+1
(4.11)
(4.12)
because ||JN−t−1 ||w ≤ c/(1 − α d) (Theorem 4.3.1(i)). Adding and subtracting G(x) and G(z) in (4.10), we obtain from (4.11) and (4.12) that 2 ε cα JN,t (z) + LJ β + + H ≥ c(x, fN,t ) + α 1 − αd
Kt+1
JN−t−1 (y)P(dy|x, fN,t ). (4.13)
Recalling that ||JN−t−1 − J||w ≤ c(α d)N−t−1 /(1 − α d), the stated result is now derived from (4.9). 2 Let C0 := K0 , and for t ≥ 1, let c ∩ Kt . Ct := K0c ∩ · · · ∩ Kt−1
Also, we define C∞ := t∈IN Ktc . In plain words, if x ∈ t∈IN Kt , then x ∈ Ct when t is the first index for which x ∈ Kt . We have Kt ⊆ C0 ∪ · · · ∪ Ct . Then, C∞ is the complement of t∈IN Ct = t∈IN Kt . Clearly, C∞ and {Ct }n∈IN form a measurable partition of the state space X. Given N ≥ 1, we define the following deterministic stationary policy fN ∈ F . If x ∈ Ct for some t ∈ IN, then fN (x) := fN+t,t (x), while if x ∈ C∞ , then fN (x) := fΔ (x), where fΔ ∈ F is a fixed policy. Also, we define the function J ∈ Bw (X) as J(x) :=
J(x), if x ∈ t∈IN Ct ; cw(x)/(1 − α d), if x ∈ C∞ .
Since, by Theorem 4.3.1(ii), ||J||w ≤ c/(1 − α d), it should be clear that J(x) ≤ J(x) ≤ cw(x)/(1 − α d) for every x ∈ X,
(4.14)
74
F. Dufour and T. Prieto-Rumeau
and also that c Kt+1
J(y)P(dy|x, a) ≤
εc 1 − αd
for x ∈ Kt and a ∈ A(x).
Lemma 4.5.6. Suppose that Assumption A holds and fix N ≥ 1. If x ∈ J(x) +
2c(α d)N 3 ε cα · w(x) + + LJ β + 2H ≥ c(x, fN ) + α 1 − αd 1 − αd
X
(4.15)
t∈IN Ct ,
then
J(y)P(dy|x, fN ),
while if x ∈ C∞ , then J(x) ≥ c(x, fN ) + α
X
J(y)P(dy|x, fN ).
Proof. Suppose that x ∈ Ct for some t ∈ IN. Since x ∈ Kt , it follows from Lemma 4.5.5 applied to the pair N + t,t that J(x) +
2c(α d)N 2 ε cα · w(x) + + LJ β + 2H 1 − αd 1 − αd
≥ c(x, fN+t,t ) + α
Kt+1
J(y)P(dy|x, fN+t,t ).
Recall now that, by definition, fN+t,t (x) = fN (x). Note also that J(x) = J(x) for x ∈ Ct and J(y) = J(y) when y ∈ Kt+1 ⊆ C0 ∪ · · · ∪Ct+1 . Consequently, J(x) +
2c(α d)N 2 ε cα · w(x) + + LJ β + 2H ≥ c(x, fN ) + α 1 − αd 1 − αd
Kt+1
J(y)P(dy|x, fN ).
Finally, the first statement of the lemma follows from (4.14) and (4.15). The inequality when x ∈ C∞ is derived from the bound (4.14) after some straightforward calculations. 2 Theorem 4.5.3. Suppose that the control model M satisfies Assumption A. Given arbitrary positive constants β , δ , and ε , and N ≥ 1, consider the policy fN constructed above. For every x ∈ t∈IN Ct = t∈IN Kt , we have LJ β + 2H 2c(α d)N 3 ε cα + 0 ≤ J(x, fN ) − J(x) ≤ · w(x) + . 2 1−α (1 − α d) (1 − α d)(1 − α ) Proof. As a consequence of Lemma 4.5.6, J(x) +
3ε cα 2c(α d)N · w(x) + + LJ β + 2H ≥ c(x, fN ) + α 1 − αd 1 − αd
X
J(y)P(dy|x, fN )
4 Approximation of Discounted MDPs
75
for every x ∈ X. Hence, letting N LJ β + 2H 3ε cα := J(x) + 2c(α d) · w(x) + J(x) + 1−α (1 − α d)2 (1 − α d)(1 − α )
for x ∈ X,
≥ c(x, fN ) + a direct calculation shows that the function J ∈ Bw (X) satisfies J(x) α PJ(x, fN ) for all x ∈ X. The stated result now follows from Lemma 4.3.2. 2 As for Theorem 4.4.2, given some precision η > 0, we can—explicitly and a priori—determine constants β , δ , and ε , as well as N ≥ 1, such that 2c(α d)N LJ β + 2H 3 ε cα + · w(x0 ) + 0. Let (A, A ) be a measurable space and X be a random variable with t P{X > t} = exp − λ (φ (s))ds , 0
where λ (a) is a nonnegative measurable function on A and φ is a measurable mapping of [0, ∞) to (A, A ). The interpretation of this formula is that we deal with a fixed state, X is the time the process spends in this state until the jump (A, A ) is the action set, and φ is a policy that selects an action φ (t), when the process spends time t in the state since the last jump in this state.
80
E.A. Feinberg
Define p(B) = P{φ (X) ∈ B}, B ∈ A , the probability that an action from set B was used at jump epoch. Let m(B) = E
X
0
1{φ (s) ∈ B}ds ,
B∈A,
the expected time that actions from set B were used until the jump. According to Feinberg [8, Theorem 1], m(B) =
B
p(da) , λ (a)
B∈A.
(5.1)
The measure m defines the expected total reward up to time X. If the reward rate under an action a is r(a), where r is a measurable function, then the expected total reward during the time interval [0, X] is R=
r(a)m(da).
(5.2)
A
If, instead of selecting actions φ (t) at time t ∈ [0, X], an action is selected randomly at time 0 according to the probability p and is followed until the jump epoch X, the expected total time m(B), during which that actions a ∈ B are used, is also defined by (5.1). Since the expected total rewards until jump and the distribution of an action selected at jump epoch are the same for φ and for the policy that selects actions only at time 0 according to the probability p, these two policies yield the same expected total rewards. Of course, for multiple jumps, the analysis is more involved, and it is carried in [10], where it is shown that the corresponding policies have the same occupancy measures. This is done there without the assumption that the jump rates are uniformly bounded. We conclude the introduction with a terminological remark. We write “occupancy measure,” instead of “occupation measure” frequently used in the literature on MDPs, because the former provides a more adequate description of the corresponding mathematical object.
5.2 Definition of a CTMDP A CTMDP is defined by the multiplet {X, A, A(x), q(·|x, a), r(x, a), R(x, a, y)}, where: (i) X is the state space endowed with a σ -field X such that (X, X ) is a standard Borel space; that is, (X, X ) is isomorphic to a Borel subset of a Polish space. (ii) A is the action space endowed with a σ -field A such that (A, A ) is a standard Borel space.
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
81
(iii) A(x) are sets of actions available at x ∈ X. It is assumed that A(x) ∈ A for all x ∈ X and the set of feasible state-action pairs Gr(A) = {(x, a) : x ∈ X, a ∈ A(x)} is a Borel subset of X × A containing the graph of a Borel mapping from X to A. (iv) q(·|x, a) is a signed measure on (X, X ) taking nonnegative finite values on measurable subsets of X \ {x} and satisfying q(X|x, a) = 0, where (x, a) ∈ Gr(A). In addition, q(Γ |x, a) is a measurable function on Gr(A) for any Γ ∈ X . Let q(x, a) −q({x}|x, a). Since q(X|x, a) = 0, then q(X \ {x}|x, a) = q(x, a) and 0 ≤ q(x, a) < ∞ for all (x, a) ∈ Gr(A) because 0 ≤ q(X \ {x}|x, a) < ∞. (v) r(x, a) is a real-valued reward rate function that is assumed to be measurable on Gr(A); −∞ ≤ r(x, a) ≤ ∞. (vi) R(x, a, y) is the instantaneous reward collected if the process jumps from state x to state y, where y = x, and an action a is chosen in state x at the jump epoch. The function R(x, a, y) is assumed to be measurable; −∞ ≤ R(x, a, y) ≤ ∞. A signed measure is also called a kernel. A kernel q is called stable if q(x) ¯ supa∈A(x) q(x, a) < ∞, where q(x, a) −q({x}|x, a), x ∈ X. Set q(·|x, a) 0 for all (x, a) ∈ / Gr(A). Everywhere in this chapter the following assumption holds. Assumption 5.2.1 The transition kernel q is stable. Assumption 5.2.1 is needed to define and analyze the following two models considered in this chapter : the relaxed CTMDP and the MDP corresponding to the relaxed CTMDPs. It is also needed for continuity assumption in Sect. 5.5. Assumption 5.2.1 is not needed for Theorems 5.4.3 and 5.5.4, and for the statements of Theorems 5.4.1 and 5.4.2, if the two models mentioned above are excluded from their formulations. Adjoin the isolated points x∞ , a∞ to X and A, respectively, and define X∞ X ∪ {x∞ }, A∞ A ∪ {a∞ } as well as the σ -fields X∞ = σ (X , {x∞ }) and A∞ = σ (A , {a∞ }). Set A(x∞ ) a∞ and q(x∞ , a∞ ) 0. Let R+ (0, ∞), R0+ [0, ∞) and ¯ + respectively. ¯ + (0, ∞]. Let R+ and R¯ + be the Borel σ -fields of R+ and R R ¯ Define the basic measurable space (Ω , F ), where Ω = (X × R+ )∞ and F = (X × R¯ +)∞ . Following the construction of the jump Markov model in Kitaev [23], we briefly describe the stochastic process under study. For ω = {x0 , θ1 , x1 , . . .} ∈ Ω , define the random variables xn (ω ) = xn , θn+1 (ω ) = θn+1 , n ≥ 0, t0 = 0,tn = θ1 + θ2 + · · · + θn , n ≥ 1, t∞ = limn→∞ tn . Let ωn = {x0 , θ1 , . . . , θn , xn }, where n ≥ 0 (for n = 0, omit θ0 ) denote the history up to and including the nth jump. The jump process of interest is denoted by ξt (ω ):
ξt (ω )
∑ I{tn ≤ t < tn+1}xn + I{t∞ ≤ t}x∞.
n≥0
The function ξt (ω ) is piecewise continuous. Thus, the values ξt− are well-defined when t < t∞ .
82
E.A. Feinberg
A policy φ is defined by a sequence {φn , n = 0, 1, . . .} of measurable mappings of ((X × R+ )n+1 , (X × R+ )n+1 ) → (A, A ) such that φn (x0 , θ1 , x1 , . . . , θn , xn , s) ∈ A(xn ), where x0 ∈ X, xi ∈ X, and θi ∈ R+ , for all i = 1, . . . , n, and s ∈ R+ . A policy φ selects actions at
∑ I{tn < t ≤ tn+1 }φn(x0 , θ1 , . . . , θn , xn ,t − tn) + I{t∞ ≤ t}a∞,
t > 0.
n≥0
We say that a policy φ is simple if φn (x0 , θ1 , . . . , θn , xn , s) = φn (xn ). Here, following the tradition, we slightly abuse notations by using the same notation φn on both sides of the last equality. Our use of the term “simple” is consistent with its use in Dynkin and Yushkevich [5], where it means a nonrandomized Markov policy for discrete time.We use this term both for discrete and continuous time. A policy is called switching simple if φn (x0 , θ1 , . . . , xn , s) = φn (xn , s). A policy φ is called deterministic if there exists a mapping φ : X → A such that φn (x0 , θ1 , . . . , θn , xn , s) = φ (xn ), n = 0, 1, . . ., for all xi ∈ X, where i = 0, . . . , n, for all θi ∈ R+ , where i = 1, . . . , n, and for all s ∈ R+ . Of course, φ is a measurable mapping and φ (x) ∈ A(x) for all x ∈ X. The same is true for mappings φn for a simple policy φ . Any policy φ defines the predictable random measure ν φ on (R+ × X, R+ × X ) defined by t∧t∞
ν φ (ω , (0,t] × Y ) =
0
q(Y \ {ξs−}|ξs− , as )ds,
t ∈ R+ , Y ∈ X .
According to Jacod [19, Theorem 3.6], this random measure and the initial probability distribution γ of the initial state x0 ∈ X uniquely define a marked point process x0 , θ1 , x1 , θ2 . . . such that ν φ is the compensator of its random measure. We φ φ denote Pγ and Eγ the probabilities and expectations associated with this process. If φ φ γ ({x}) = 1 for some x ∈ X, we shall write Px and Ex , respectively. + − Let c = max{c, 0} and c = min{c, 0} for any number c. For a positive discount rate α , consider the expected total discounted rewards: V+ (x, φ )
= Exφ
and V− (x, φ )
= Exφ
t∞
e
0
0
t∞
∞
r (ξt , at )dt + ∑ e
−α t +
− α tn +
R (ξtn−1 , atn , ξtn )
n=1
e
∞
r (ξt , at )dt + ∑ e
−α t −
−α tn −
R (ξtn−1 , atn , ξtn ) .
n=1
We follow the convention that (+∞) + (−∞) = −∞ throughout this chapter. For almost all problems considered in the literature, the following assumption (the socalled general convergence condition) holds:
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
V+ (x, φ ) < ∞
83
for any policy φ and for any state x ∈ X.
(5.3)
Everywhere in this chapter, we assume that the following weaker assumption holds. Assumption 5.2.2 For any policy φ and for any initial state x ∈ X, either V+ (x, φ ) < ∞ or V− (x, φ ) > −∞. The total expected discounted rewards are defined as V (x, φ ) = V+ (x, φ )+V− (x, φ ) = Exφ
∞
∑
n=1
tn
e
= Exφ
−α t
tn−1
t∞
e
−α t
0
∞
r(ξt , at )dt+ ∑ e
−α tn
R(ξtn−1 , atn , ξtn )
n=1
r(ξt , at )dt + e
−α tn
R(ξtn−1 , atn , ξtn ) ,
where the second and third equalities hold because of Assumption 5.2.2. Let Π be the set of all policies and V (x) = supφ ∈Π V (x, φ ). A policy φ is called optimal if V (x, φ ) = V (x) for all x ∈ X. Define ∗
r (x, a) =
r(x, a) +
X\{x} R(x, a, y)q(dy|x, a)
α + q(x, a)
,
x ∈ X, a ∈ A(x).
(5.4)
Following Feinberg [10], consider the occupancy measures φ M˜ x,n (Y, B) = Exφ
tn+1 tn
e−α t I{ξt ∈ Y, at ∈ B}(α + q(ξt , at ))dt,
(5.5)
where n = 0, 1, . . . , Y ∈ X , B ∈ A . Then, according to [10, Corollary 4.4 and (4.20)], V (x, φ ) =
∞
∑
n=0 X A
φ r∗ (z, a)M˜ x,n (dz, da).
(5.6)
In particular, (5.6) means that the introduced model is equivalent to the model without instant rewards at jump epochs and with the reward rate function rˆ(x, a) = r(x, a) +
R(x, a, y)q(dy|x, a), X\{x}
x ∈ X, a ∈ A(x).
(5.7)
For a standard Borel space (E, E ), denote by P(E) the set of probability measures on (E, E ). Let M (E) be the minimal σ -field on P(E) such that for any C ∈ E the function μ (C) defined on P(E) is measurable. Then (P(E), M (E)) is a standard Borel space. In particular, if (E, E ) is a Polish space, then (P(E), M (E)) is the Borel σ -field in the topology of weak convergence.
84
E.A. Feinberg
Relaxed CTMDP and relaxed policies. Consider a CTMDP with a stable kernel q. We relax the model by replacing the action set A with P(A), the action sets A(x) with P(A(x)), x ∈ X; the transition kernels q(Y |x, a) with q(Y ¯ |x, μ )
A(x)
q(Y |x, a)μ (da),
where μ ∈ P(A(x)); and the reward functions r and R with the functions r¯ and ¯ μ ) ≡ 0, respectively, where R(x, r¯(x, μ )
A(x)
rˆ(x, a)μ (da).
¯ \ {x}|x, μ ). Since q is stable, q(x, ¯ μ ) < ∞ for all x ∈ X Let again q(x, ¯ μ ) −q(X and μ ∈ P(A(x)). As shown in Feinberg [10], the relaxed CTMDP satisfies the same basic assumptions (i)–(vi) as the original model. So, we can consider a policy π for this CTMDP. A policy for the relaxed CTMDP is called a relaxed policy for the original CTMDP. If a policy is simple in the relaxed CTMDP, it is called a relaxed simple policy in the original CTMDP. Relaxed policies are usually called randomized policies in the literature. For a fixed initial state or initial distribution, a relaxed policy also defines a stochastic process up to time t∞ , and we denote by Pxπ and Exπ the probabilities and expectations for a relaxed policy π and initial state x. Let Π R be the set of all relaxed policies. If a relaxed policy π selects at any time instance and along any trajectory a measure δat concentrated at one point, then it can be interpreted as a usual policy. Following this interpretation, we write Π ⊆ Π R . Let πt (da|ω ) be the probability distribution of actions that a relaxed policy π selects at time t < t∞ along the trajectory ω . The corresponding measure is defined as πt (ω ). Here two different notations are used for the same objects because πt (B|ω ) = πt (ω )(B) for any B ∈ A . As above with the random variables ξt (ω ), we will not write ω explicitly whenever possible. Thus, we shall write πt (da) and πt instead of πt (da|ω ) and πt (ω ), respectively. The expected total discounted reward for a relaxed policy π and an initial state x is the expected total discounted reward for the relaxed CTMDP. It can be written as V (x, π ) = V+ (x, π ) + V−(x, π ). Here we use for the relaxed CTMDP the notations introduced for the original CTMDP. If either V+ (x, π ) < ∞ or V− (x, π ) > −∞, then V (x, π ) = Exπ
0
t∞
e
−α t
r¯(ξt , πt )dt
= Exπ
∞
∑
tn
n=1 tn−1
e
−α t
r¯(ξt , πt )dt ,
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
or, in a more explicit form, t ∞ ∞ π −α t π V (x, π ) = Ex e rˆ(ξt , a)πt (da)dt = Ex ∑ 0
A
tn
e
−α t
n=1 tn−1
A
85
rˆ(ξt , a)πt (da)dt .
We also recall that, following the convention in this chapter, V (x, π ) = −∞ when V+ (x, π ) = ∞ and V− (x, π ) = −∞. We also define V R (x) = supπ ∈Π R V (x, π ). A relaxed policy π is called optimal if V (x, π ) = V R (x) for all x ∈ X.
5.3 Definition of a Discrete-Time Total-Reward MDP A discrete-time MDP is defined by a multiplet {X, A, A(x), p(Y |x, a), r˜(x, a)}, where the state set X, action set A, and sets of available actions A(x), x ∈ X, have the same properties as in CTMDPs; p(·|x, a) is the transition probability from Gr(A) to X, that is, p(·|x, a) is a probability distribution on (X, X ) for any (x, a) ∈ Gr(A) and p(Y |x, a) is a measurable function on Gr(A) for any Y ∈ X ; r˜(x, a) is the one-step reward function measurable on Gr(A). A trajectory is a sequence x0 , a0 , x1 , a1 , . . . from (X × A)∞ . A policy σ is a sequence of transition probabilities σn , n = 0, 1, . . ., from (X × A)n × X to A such that σn (A(xn )|x0 , a0 , . . . , xn−1 , an−1 , xn ) = 1. A policy σ is called randomized Markov if σn (B|x0 , a0 , . . . , xn−1 , an−1 , xn ) = σn (B|xn ), xn ∈ X, n = 0, 1, . . . , for all B ∈ A . A simple policy φ is defined as a sequence of measurable mappings φn : X → A, n = 0, 1, . . . such that φn (x) ∈ A(x), x ∈ X. A deterministic (stationary deterministic) policy is defined as a measurable mapping from X to A satisfying φ (x) ∈ A(x) for all x ∈ X. This means that in state x, the action φ (x) is selected. Let Δ be the set of all policies. As usual, in view of the Ionesu Tulcea theorem [16], any initial state distribution γ and policy σ define a probability measure Pγσ on the sequence of trajectories. We denote by Eγσ the expectation with respect to Pγσ . We write Pxσ and Exσ , x ∈ X, instead of Pγσ and Eγσ , respectively, when γ ({x}) = 1. Let W+ (x, σ ) = Exσ
∞
∑ r˜+ (xn , an),
W− (x, σ ) = Exσ
n=0
n=0
and W (x, σ ) = W+ (x, σ ) + W− (x, σ ). If either W+ (x, σ ) < ∞ or W− (x, σ ) > −∞, then W (x, σ ) = Exσ
∞
∑ r˜− (xn , an ),
∞
∑ r˜(xn , an),
n=0
and W (x, σ ) = −∞, if W+ (x, σ ) = ∞ and W− (x, σ ) = −∞.
86
E.A. Feinberg
Similar to the continuous-time case, consider the value function W (x) = supσ ∈Δ W (x, σ ). A policy σ is called optimal if W (x, σ ) = W (x) for all x ∈ X. We also consider occupancy measures for the MDP: σ Mx,n (Y, B) Exσ I{xn ∈ Y, an ∈ B},
n = 0, 1, . . . , Y ∈ X , B ∈ A .
(5.8)
σ r˜(z, a)Mx,n (dz, da).
(5.9)
Then, for x ∈ X and for σ ∈ Δ , W (x, σ ) =
∞
∑
n=0 X A
For any policy σ and for a fixed x ∈ X, consider a randomized Markov policy σ ∗ such that Pσ {dxn , an ∈ B) , Pxσ − a.s. σn∗ (B|xn ) = x σ (5.10) Px {dxn } It is well known [34] that ∗
Pxσ {xn ∈ Y, an ∈ B} = Pxσ {xn ∈ Y, an ∈ B},
Y ∈X, B∈A.
This implies ∗
σ σ Mx,n = Mx,n ,
n = 0, 1, . . . ,
(5.11)
and, in view of (5.9), for any reward function r˜ W (x, σ ∗ ) = W (x, σ ).
(5.12)
5.4 Main Results For a total-reward discounted CTMDP defined by a multiplet {X, A, A(x), q(·|x, a), r(x, a), R(x, a, y)}, consider a discrete-time MDP defined by a multiplet {X∞ , A∞ , A(x), p(·|x, a), r˜(x, a)}, where A(x∞ ) = {a∞ },
p(Y |x, a) =
⎧ q(Y \{x}|x,a) ⎪ ⎪ ⎨ α +q(x,a) , if Y ∈ X , x ∈ X, a ∈ A(x), α
α +q(x,a) ⎪ ⎪ ⎩ 1,
,
if Y = {x∞ }, x ∈ X, a ∈ A(x),
if Y = {x∞ }, x = x∞ , a = a∞ ,
and r˜(x, a) = r∗ (x, a) (see (5.4)) and r˜(x∞ , a∞ ) = 0.
(5.13)
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
87
Thus, the original CTMDP {X, A, A(x), q(·|x, a), r(x, a), R(x, a, y)} defines two ¯ μ , y)}, other models: the relaxed CTMDP {X, P(A), P(A(x)), q(·|x, ¯ μ ), r¯(x, μ ), R(x, where μ ∈ P(A(x)), and the discrete-time MDP {X∞ , A∞ , A(x), p(·|x, a), r˜(x, a)} with the transition probabilities and rewards defined in (5.4,5.13) and in the line following (5.13). These three models have the sets of policies Π , Π R , and Δ , respectively, where Π ⊆ Π R . The objective criteria for the two CTMDPs are the expected total discounted rewards, and the objective criterion for the discrete-time MDP is the expected total reward. In addition, we can consider the fourth model: the MDP corresponding to the relaxed CTMDP. Let Δ R be the set of policies for this MDP; Δ ⊆ Δ R . Observe that for x ∈ X and μ ∈ P(A(x)), the one-step reward for this MDP is r˜(x, μ ) =
rˆ(x, μ ) . α + q(x, ¯ μ)
(5.14)
Observe that the sets of states, actions, and transition rates define the structure of the models and the sets of policies Π , Π R , Δ , and Δ R , while the reward functions r and R define the objective functions, which are functionals on these sets. Theorem 5.4.1 Consider a CTMDP and its three associated models: the relaxed CTMDP, the corresponding MDP, and the MDP corresponding to the relaxed CTMDP. For any policy φ in the CTMDP and any state x ∈ X, in each of three associated models, there exists a policy σ such that V (x, φ ) = G(x, σ ) and this equality hold for all reward functions r and R, where G denotes the expected total reward in the two associated MDPs and the expected total discounted reward in the relaxed CTMDP. Proof. (i) Consider the relaxed CTMDP. Since φ ∈ Π ⊆ Π R , the policy φ itself is the required policy in the relaxed CTMDP. Thus, we can set σ = φ . φ (ii) Fix x ∈ X. Recall that M˜ x,n , n = 0, 1, . . ., are occupancy measures for the CTMDP defined in (5.5), when the initial state is x. Consider on (X, X ) finite φ nonnegative measures m˜ x,n , n = 0, 1, . . ., φ m˜ φx,n (Y ) M˜ x,n (Y, A),
Y ∈X.
(5.15)
For the corresponding MDP, consider a randomized Markov policy σ satisfying for each B ∈ A
σn (B|z) =
φ M˜ x,n (dz, B) φ m˜ x,n (dz)
,
φ
m˜ μ ,n − a. e., n = 0, 1, . . . .
(5.16)
According to Feinberg [10, Theorem 4.5 and Lemma A.2], such a randomized Markov policy σ exists and σ φ Mx,n = M˜ x,n ,
n = 0, 1, . . . ,
(5.17)
88
E.A. Feinberg
and therefore V (x, φ ) = W (x, σ ) for any reward functions r and R. We remark that [10, Lemma A.2] deals with a SMDP. For a randomized Markov policy, decisions depend on the current state and jump number. They do not depend on the current or past values of the time parameter t. Therefore, these policies are also randomized Markov policies for the MDP corresponding to the CTMDP; see [10, Corollary A.4] with β¯ = 1, where the parameter β¯ is considered in that corollary. We remark that β¯ < 1 is chosen in [10, Corollary A.4] to ensure that the discount factor is less than 1. As shown in [10], the discount factor less than 1 can be chosen if the jump rates are uniformly bounded. Since the jump rates may be unbounded in the current chapter, it may be impossible to set the discount factor β¯ < 1, but it is possible to set the discount factor β¯ = 1. (iii) As shown in (i), a policy π belongs to the relaxed CTMDP. By applying (ii) to the relaxed CTMDP, we obtain a Markov policy with the required properties in the MDP corresponding to the relaxed CTMDP. 2 Definition 5.4.1 Consider a CTMDP and its three associated models: the relaxed CTMDP, the corresponding MDP, and the MDP corresponding to the relaxed MDP. Two of these four models are called equivalent, if for each state x ∈ X, for any policy in one of the models, there exists a policy in another model such that the values of the total-reward criteria for these policies are equal in the corresponding models, when the initial state x is fixed. If this is true for any reward functions r and R, then the models are called strongly equivalent. Theorem 5.4.2 Consider a CTMDP and its three associated models: the relaxed CTMDP, the corresponding MDP, and the MDP corresponding to the relaxed CTMDP. Then (a) The following three models are strongly equivalent: the relaxed CTMDP, the corresponding MDP, and the MDP corresponding to the relaxed CTMDP. (b) If the set A is countable, then all four models are strongly equivalent. We consider two lemmas needed for the proof of Theorem 5.4.2, Lemma 5.4.1 Theorem 5.4.2 holds under the addition condition that the reward functions r and R are nonnegative. Proof. (a) Fix the initial state x ∈ X. Observe that Δ ⊆ Δ R and that, according to Theorem 5.4.1, for any policy in the relaxed CTMDP there exists a policy with the required property in the MDP corresponding to the relaxed CTMDP. Thus, to show the strong equivalence of the MDP corresponding to the CTMDP and the MDP corresponding to the relaxed CTMDP, it is sufficient to show that for any policy in the MDP corresponding to the relaxed CTMDP, there exists a policy in the MDP corresponding to the CTMDP such that these two policies have equal objective functions and this is true for all nonnegative reward functions r and R. Observe that for an arbitrary state x ∈ X and for an arbitrary action μ ∈ P(A(x)) in the MDP corresponding to the relaxed CTMDP there exists a
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
89
randomized action π ∈ P(A(x)) in the MDP corresponding to the CTMDP such that these two actions yield the same transition probabilities and expected one-step rewards. More precisely, for μ ∈ P(A(x)), we define π ∈ P(A(x)) by
π (da) =
α+
α + q(x, a) μ (da). A(x) q(x, a) μ (da)
(5.18)
Indeed, if an action is selected according to the distribution π in the MDP corresponding to the CTMDP, then the transition probability p and expected one-step reward r˜ are p(Y |x, π ) =
A(x)
p(Y |x, a)π (da) =
A(x)
q(Y \ {x}|x, a)π (da) , α + q(x, a)
Y ∈X,
rˆ(x, a)π (da) . A(x) α + q(x, a)
r˜(x, π ) =
If an action μ is selected in the MDP corresponding to the relaxed CTMDP, then the transition probability p¯ and expected one-step reward r¯ are
p(Y ¯ |x, μ ) =
A(x) q(Y
α+
\ {x}|x, a)μ (da)
A(x) q(x, a)μ (da)
,
Y ∈X,
r¯(x, μ ) =
A(x) rˆ(x, a)μ (da)
α+
A(x) q(x, a)μ (da)
.
In view of (5.18), p(Y |x, π ) = p(Y ¯ |x, μ ) and r˜(x, π ) = r¯(x, μ ). This implies that, if the initial state is fixed, for any randomized Markov policy in the MDP corresponding to the relaxed CTMDP, there exists a randomized Markov policy in the MDP corresponding to the CTMDP such that they objective functions coincide. In addition, this is true for any nonnegative reward functions r and R. Thus, the MDP corresponding to the CTMDP and the MDP corresponding to the relaxed CTMDP are strongly equivalent. According to cases (i) and (ii) in the proof of Theorem 5.4.1, for any policy π in the relaxed CTMDP, there exists a policy with the same performance for all reward functions in the MDP corresponding to the relaxed CTMDP. Now consider a policy σ for the MDP corresponding to the CTMDP. Without loss of generality [34], let σ be randomized Markov. The policy σ can be considered as a policy that selects actions only at jump epochs and decisions depend only on the state to which the process jumps in and the jump number. In other words, σ is a policy for a SMDP defined by the underlying CTMDP model. Define the relaxed switching simple policy π by e−q(z,a)s σn (da|z) , −q(z,a)s σ (da|z) n Ae
πn (da|z, s) =
n = 0, 1, . . . , z ∈ X, s ≥ 0.
(5.19)
90
E.A. Feinberg
The marked point processes defined by the policies σ and π have the same sojourn-times distributions because for every n = 0, 1, . . . they have the same hazard rates after nth jump:
−q(xn ,a)s σ (da|x ) n n A q(xn , a)e . −q(xn ,a)s σ (da|x ) e n n A
In addition, they have the same transition probabilities,
A q(Y
\ {x}|xn, a)e−q(xn ,a)θn+1 σn (da|xn ) −q(x ,a)θ , n n+1 σ (da|x ) n n Ae
at jump epochs. The initial state x is fixed, and it is the same for marked point processes defined by the policies σ and π . Thus, the policies σ and π define the same marked point processes. In addition, the reward rates are equal:
r¯(xn , πn (xn , s)) =
−q(xn ,a)s σ (da|x ) n n A r(xn , a)e . −q(xn ,a)s σ (da|x ) e n n A
Thus, for any discount rate α ∈ [0, 1], for any policy σ in the MDP corresponding to the original CTMDP, there exist a relaxed policy π in the CTMDP such that the expected total discounted rewards are equal for these two policies. We remark that this correspondence is proved in Feinberg [10, (8.5), (8.6)] by using compensators. (b) Without loss of generality, let A = {1, 2, . . .}. Fix x ∈ X and consider a randomized Markov policy σ for the MDP corresponding to a CTMDP. For n = 0, 1, . . ., we set θi (z, n) = −(α + q(z, i))−1 ln 1 −
σn (i|z) , ∑∞ k=i σn (k|z)
i = 1, 2, . . . , z ∈ X, a ∈ A, (5.20)
where 00 = 0 and ln(0) = −∞. We also set θ0 (z, n) = 0, θi (z, n) = ∑ik=1 θk (z, n). Consider a switching simple policy φ :
φn (z, s) = i for z ∈ X, n = 0, 1, . . . , θi−1 (z, n) ≤ s < θi (z, n), i = 1, 2, . . . . (5.21) According to [10, Theorem 5.2], σ satisfies (5.16) written for π = φ . In addition, V (x, φ ) = W (x, σ ) for any nonnegative reward functions r and R. In conclusion, we notice that technical conditions in Feinberg [10] and here are different. Instead of Condition 5.1 from [10], we use the assumption that A is countable. These conditions achieve the same goal: they guarantee that the functions φn (z, s) are well defined in (5.21) and measurable in (z, s). 2
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
91
Lemma 5.4.2 Assumption 5.2.2 implies that the following statements hold for all x ∈ X: (a) (b) (c) (d)
Either supφ ∈Π V+ (x, φ ) < ∞ or infφ ∈Π V− (x, φ ) > −∞. Either supπ ∈Π R V+ (x, π ) < ∞ or infπ ∈Π R V− (x, π ) > −∞. Either supσ ∈Δ W+ (x, σ ) < ∞ or infσ ∈Δ W− (x, σ ) > −∞. Either supσ ∗ ∈Δ R W+ (x, σ ∗ ) < ∞ or infσ ∗ ∈Δ R W− (x, σ ∗ ) > −∞.
Proof. (c) Fix x ∈ X. We shall prove that if statement (c) does not hold, then Assumption 5.2.2 does not hold either. Let statement (c) does not hold. Define the function rˆ+ by (5.7) with the functions r and R replaced with the functions r+ and R+ , respectively. Similarly, define the function rˆ− by (5.7) with the functions r and R replaced with the functions r− and R− , respectively. Then define the functions r˜+ and r˜− by (5.14) with the function rˆ replaced with the functions rˆ+ and rˆ− , respectively. For σ ∈ Δ , let W++ (x, σ ) and W−− (x, σ ) be the discrete-time expected total rewards with the reward functions r˜+ and r˜− , respectively. Then W++ (x, σ ) ≥ W+ (x, σ ) and W−− (x, σ ) ≤ W− (x, σ ) for all σ ∈ Δ . Thus, since (c) does not hold, sup W++ (x, σ ) = ∞v
(5.22)
inf W−− (x, σ ) = −∞.
(5.23)
σ ∈Δ
and σ ∈Δ
The rest of the proof shows that (5.22) and (5.23) imply that Assumption 5.2.2 does not hold. Equality (5.22) means that the value for a positive dynamic programming problem is infinite. Thus, for every n = 1, 2, . . ., there exists a deterministic policy φn satisfying W++ (x, φn ) ≥ 2n . Similarly, for every n = 1, 2, . . ., there exists a deterministic policy ψn satisfying W−− (x, φn ) ≤ −2n . Consider a policy σ ∈ Δ that is defined in the following way: this policy independently selects one of the policies {φ1 , ψ1 , φ2 , ψ2 , . . .}, and policy φn and each policy ψn are selected with the probability 2−n+1, n = 1, 2, . . . (here we apply the same method as in [5, p. 108]). We have W++ (x, σ ) = ∞ and W−− (x, σ ) = −∞. We also have that W++ (x, σ ∗ ) = ∞ and W−− (x, σ ∗ ) = −∞ for a randomized Markov policy σ ∗ satisfying (5.10). Since the policy σ ∗ uses a countable number of actions at any state z, for the policy φ for the CTMDP, defined by (5.21) with σ = σ ∗ in (5.20), we have V+ (x, φ ) = ∞ and V− (x, φ ) = −∞. Thus, Assumption 5.2.2 does not hold. (a) Theorem 5.4.1 implies that if statement (a) does not hold, then both (5.22) and (5.23) hold. Therefore, Assumption 5.2.2 does not hold. (b) Lemma 5.4.1 implies that if statement (b) does not hold, then both (5.22) and (5.23) hold. Therefore, Assumption 5.2.2 does not hold. (d) The proof is the same as for (b).
92
E.A. Feinberg
Proof of Theorem 5.4.2. Lemma 5.4.1 implies the theorem under additional conditions that either both r and R are nonnegative or they both are nonpositive. Lemma 5.4.2 implies that this is also true for general functions r and R.
Let f : X → R0+ . A policy φ is called f -optimal for a CTMDP if V (x, φ ) ≥ V (x)− f (x) for all x ∈ X. A policy φ is called f -optimal for an MDP if W (x, φ ) ≥ W (x) − f (x) for all x ∈ X. Theorem 5.4.3 (a) The function V is universally measurable and V (x) = V R (x) = W (x) = W R (x) for all x ∈ X. (b) If for a function f : X → R0+ , a simple policy ϕ is f -optimal for the corresponding discrete-time MDP, it is also f -optimal for the original CTMDP. In particular, a deterministic f -optimal policy for the discrete-time MDP is f -optimal for the CTMDP. Proof. (a) Fix arbitrary x ∈ X. According to Theorem 5.4.2, V R (x) = W (x) = W R (x). Theorem 5.4.1 implies that W (x) ≥ V (x). Let σ be a policy in the MDP corresponding to the CTMDP. According to Feinberg [6, Theorem 3], for any K > 0, there exists a simple policy φ such that W (x, φ ) ≥
W (x, σ ), K,
if W (x, σ ) < ∞, if W (x, σ ) = ∞.
In addition, V (x) ≥ V (x, φ ) = W (x, φ ) for any simple policy φ . Thus, for any policy σ in the CTMDP and for any K > 0, V (x) ≥
W (x, σ ), K,
if W (x, σ ) < ∞, if W (x, σ ) = ∞.
This implies V (x) ≥ W (x). Thus, V (x) = W (x). Universal measurability of the function W follows from [7, Theorem 3.1.B]. Statement (b) follows from statement (a).
5.5 Applications: Value Iterations and Optimality of Deterministic Policies Theorems 5.4.1–5.4.3 can be used to prove the existence of optimal policies for CTMDPs, describe the structure of optimal policies, and compute them. For example, an optimal deterministic policy for the corresponding MDP is optimal for the original CTMDP. These theorems also demonstrate that there is no need to consider relaxed policies for CTMDPs, because they do not overperform standard policies. We conjecture that the assumption that A is countable is not necessary for the validity of Theorem 5.4.2.
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
93
The classic results on the convergence of value iterations and on other properties of positive dynamic programs [3,4,34] imply the following fact, which is a stronger version of Piunovskiy and Zhang [29, Theorem 1], where the case R(x, a, y) = R(x) was considered. Theorem 5.5.4 Let the function r∗ , defined in (5.4), be nonnegative. Then the function V is the minimal nonnegative solution to the following Bellman equation
q(dy|x, a) V (y) , X\{x} α + q(x, a)
V (x) = sup r∗ (x, a) + a∈A(x)
x ∈ X,
(5.24)
and can be computed by the value iteration procedure: Wk (x) V (x) when k → ∞, where W0 (x) 0 and Wk+1 (x) sup a∈A(x)
q(dy|x, a) Wk (y) , X\{x} α + q(x, a)
r∗ (x, a) +
x ∈ X.
(5.25)
Proof. Since the function r∗ is nonnegative, the corresponding MDP has nonnegative one-step rewards and, according to [3, Proposition 9.14], its k-horizon values Wk form a nondecreasing sequence converging to W. According to Theorem 5.4.3, V = W.
As stated in Theorem 5.4.3, the function V is universally measurable and, more precisely, it is upper semianalytic; see [5] or [3] for details. The same is true for the k-horizon values Wk , k = 1, 2, . . . , of the MDP corresponding to the CTMDP. As the following theorem states, the function V satisfies the optimality equation under broad conditions. However, optimal policies may not exist under conditions of Theorem 5.5.4 even when the action set A is finite; see [4] or [9, Example 6.8]. Theorem 5.5.5 If V (x) < ∞ for all x ∈ X, then the function V satisfies the Bellman equation (5.24). If, in addition, r∗ (x, a) ≤ 0 for all (x, a) ∈ Gr(A), then V is the maximal nonpositive function satisfying (5.24). Proof. According to Theorem 5.4.3, V = W. The value function for an MDP satisfies the optimality equation [5, Sect. 6.2], and W is its maximal solution when all onestep rewards are nonpositive, [34] or [3, Proposition 9.10].
Value iterations may not converge to the optimal value function, if the reward function is not nonnegative [34, Example 6.1]. However, for MDPs with nonpositive rewards, value iterations converge to the optimal value function under compactness and continuity conditions introduced by Sch¨al [31, 32] for MDPs with compact action sets. Extensions of these conditions to MDPs with noncompact action sets are currently well-understood; see [16, Lemma 4.2.8] for MDPs with setwise continuous transition probabilities and [11, Proposition 3.1] for MDPs with weakly continuous transition probabilities, where discounting was considered. The same is true for MDPs with nonpositive one-step rewards—so-called negative MDPs.
94
E.A. Feinberg
These conditions also imply the existence of deterministic optimal policies for discounted and for negative MDPs. We present below such conditions for CTMDPs. Let q+ (Y |x, a) q(Y \ {x}|x, a), Y ∈ X . Observe that q+ (X|x, a) = q(X \ {x}|x, a) < ∞. The definition (5.4) can be rewritten as r∗ (x, a) =
r(x, a) +
R(x, a, y)q+ (dy|x, a) , α + q(x, a)
X
x ∈ X, a ∈ A(x).
(5.26)
Condition (Sc). (i) For each x ∈X, the kernel q+ (·|x, a) is setwise continuous in a ∈ A(x); that is, the function X f (y)q+ (dy|x, a) is continuous in (x, a) ∈ Gr(A) for any bounded measurable function f on X. (ii) The reward function r∗ (x, a) is bounded above on Gr(A), and for each x ∈ X, it is sup-compact on A(x), that is, the set {a ∈ A(x)| r∗ (x, a) ≥ c} is compact for each finite constant c. Condition (Wc). (i) The q+ (·|x, a) is weakly continuous in (x, a) ∈ Gr(A); that is, the function kernel + X f (y)q (dy|x, a) is continuous in (x, a) ∈ Gr(A) for any bounded continuous function f on X. (ii) The reward function r∗ (x, a) is sup-compact on Gr(A), that is, the set {(x, a) ∈ Gr(A)| r∗ (x, a) ≥ c} is compact for each finite constant c. Observe that Condition (Sc)(i) implies that q(x, a) is continuous in a ∈ A(x) for each x ∈ X. This and the inequality α > 0 imply that the transition probability p(·|x, a) is setwise continuous in a ∈ A(x) for each x ∈ X. Similarly, Condition (Wc)(i) implies that q(x, a) is continuous in (x, a) ∈ Gr(A). This and the inequality α > 0 imply that the transition probability p(·|x, a) is weakly continuous in (x, a) ∈ Gr(A). Thus, Conditions (Sc) and (Wc) imply respectively the validity of the following conditions for the corresponding MDP. These conditions are the versions of Sch¨al’s [31, 32] conditions (S) and (W) originally introduced for compact action sets. Condition (Su). (i) For each x ∈ X, the transition probability kernel p(·|x, a) is setwise continuous in a ∈ A(x). (ii) Condition (Sc) (ii) holds. Condition (Wu). (i) The transition probability p(·|x, a) is weakly continuous in (x, a) ∈ A(x). (ii) Condition (Wc) (ii) holds. Theorem 5.5.6 Let r∗ (x, a) ≤ 0 for all (x, a) ∈ Gr(A). If Condition (Sc) holds, then the value function V is measurable. If the Condition (Wc) holds, then the value function V is sup-compact. In either case, the following statements hold:
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
95
(a) There exists a deterministic optimal policy, (b) A deterministic policy is optimal if and only if for all x ∈ X V (x) =r∗ (x, φ (x)) +
q(dy|x, φ (x)) V (y) X\{x} α + q(x, φ (x))
q(dy|x, a) V (y) . X\{x} α + q(x, a)
= max r∗ (x, a) + a∈A(x)
(5.27) (5.28)
(c) The value function V can be computed by the value iteration procedure: Wk (x) V (x) when k → ∞, where W0 (x) 0 and Wk+1 , k = 1, 2, . . . , are defined in (5.25). Proof. If Conditions (Sc) or (Wc) hold for a CTMDP, then Conditions (Su) or (Wu) hold for the corresponding MDP, respectively. For MDPs, the statements of the theorem are standard facts. For example, similar facts are presented in [16, Lemma 4.2.8] for MDPs satisfying Condition (Su) and in [11, Proposition 3.1] for MDPs satisfying Condition (Wu). Though in [16] and in [11] a discount factor is considered, the arguments from [16] and [11] are applicable when r is nonpositive and conditions (Su) or (Wu) hold for the corresponding MDP.
If t∞ = ∞ (Pxπ -a.s.) for all π and x, then the problem studied in this chapter becomes an infinite-horizon problem in continuous time. Conditions for t∞ = ∞ (Pxπ a.s.) are described in the literature; see [12,14]. In addition, under certain conditions, a policy π , for which the probability of t∞ < ∞ is positive, cannot be optimal. For example, such a condition is R(x, a, y) < c < 0 for some number c. In this case, V (x, π ) = −∞ for any policy π with Pxπ {t∞ < ∞} > 0. In conclusion, we formulate the following statement. Theorem 5.5.7 Let either Condition (Sc) or Condition (Wc) be satisfied. If the following two assumptions hold: (i) The function rˆ, defined in (5.7), is bounded above, (ii) For any initial state x ∈ X and for any policy π ∈ Π , the inequality V (x, π ) > −∞ implies t∞ = ∞ (Pxπ -a.s.), then there exists a deterministic optimal policy. Proof. Let rˆ(x, a) ≤ K < ∞ for some finite constant K > 0. Let us subtract K from the reward rate r. Then the corresponding MDP will have nonpositive one-step rewards. Therefore, Theorem 5.5.6 implies the existence of a deterministic optimal policy for the CTMDP with the reduced reward rate. Assumption (ii) implies that the expected total rewards V (x, π ) will be reduced for any policy π and any for x ∈ X by the constant K/α = K 0∞ e−α t dt. Thus, an optimal policy for the CTMDP with the reduced reward rates is also optimal for the original CTMDP.
Acknowledgements This research was partially supported by NSF grants CMMI-0900206 and CMMI-0928490.
96
E.A. Feinberg
References 1. Bather, J.: Optimal stationary policies for denumerable Markov chains in continuous time. Adv. Appl. Prob. 8, 144–158 (1976) 2. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton, N.J. (1957) 3. Bertsekas, D.P. and Shreve, S.E.: Stochastic Optimal Control: the Discrete Time Case. Academic Press, New York (1978) 4. Blackwell, D.: Positive dynamic programming. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability 1 415–418, University of California Press, Berkeley (1967). 5. Dynkin, E.B. and Yushkevich, A.A.: Controlled Markov Processes. Springer, Berlin (1979) 6. Feinberg, E.A.: Non-randomized Markov and semi-Markov strategies in dynamic programming. Theory Probability Appl., 27, 116–126 (1982) 7. Feinberg, E.A.: Controlled Markov processes with arbitrary numerical criteria. Theory Probability Appl., 27, 486–503 (1982) 8. Feinberg, E.A.: A generalization of “expectation equals reciprocal of intensity” to nonstationary distributions. J. Appl. Probability, 31, 262–267 (1994) 9. Feinberg, E.A.: Total reward criteria. In Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes, pp. 173–207, Kluwer, Boston (2002) 10. Feinberg, E.A.: Continuous time discounted jump Markov Decision Processes: a discrete-event approach. Math. Oper. Res. 29, 492–524 (2004) 11. Feinberg, E.A., Lewis, M.E.: Optimality inequalities for average cost Markov decision processes and the stochastic cash balance problem. Mathematics of Operations Research 32 769–783 (2007) 12. Guo, X. Hern´andez-Lerma, O.: Continuous-Time Markov Decision Processes: Theory and Applications. Springer, Berlin (2009) 13. Guo, X. Hern´andez-Lerma, O., Prieto-Rumeau, T: A survey of recent results on continuoustime Markov Decision Processes. Top 14, 177–261 (2006) 14. Guo, X., Piunovskiy, A. Discounted continuous-time Markov decision processes with constraints: unbounded transition and loss rates. Mathematics of Operations Research 36 105–132 (2011) 15. Hern´andez-Lerma O.: Lectures on Continuous-Time Markov Control Processes. Aportaciones Mathem´aticas. 3. Sociedad Mathem´atica Mexicana, M´exico (1994) 16. Hern´andez-Lerma, O., Lasserre J.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996) 17. Hordijk, A. and F. A. van der Duyn Schouten: Discretization procedures for continuous time decision processes. In Transactions of the Eighth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Volume C, pp. 143–154, Academia, Prague (1979) 18. Howard, R.: Dynamic Programming and Markov Processes. Wiley, New York (1960) 19. Jacod, J.: Multivariate point processes: predictable projections, Radon-Nikodym derivatives, representation of martingales. Z. Wahr. verw. Geb. 31 235–253 (1975) 20. Kakumanu, P.: Continuously discounted Markov decision models with countable state and action space. Ann. Math. Stat. 42 919–926 (1971). 21. Kakumanu, P.: Relation between continuous and discrete time Markovian decision problems. Naval. Res. Log. Quart. 24 431–439 (1977) 22. Kallianpur, G.:Stochastic Filtering Theory. Springer, New York (1980) 23. Kitaev, Yu. M.: Semi-Markov and jump Markov controlled models: average cost criterion. Theory Probab. Appl. 30 272–288 (1985). 24. Kitaev, Yu. M. and V. V. Rykov.: Controlled Queueing Systems. CRC Press, New York (1995) 25. Lippman, S.: Applying a new device in the optimization of exponential queueing systems. Oper. Res. 23, 687–710 (1975)
5 Reduction of Discounted Continuous-Time MDPs to Discrete-Time MDPs
97
26. Martin-L¨of, A.: Optimal control of a continuous-time Markov chain with periodic transition probabilities. Oper. Res. 15, 872–881 (1966) 27. Miller, B.L.: 1968. Finite state continuous time Markov decision processes with a finite planning horizon. SIAM J. Control 6 266–280 (1968) 28. Miller, B.L.: Finite state continuous time Markov decision processes with an infinite planning horizon. J. Math. Anal. Appl. 22 552–569 (1968) 29. Piunovskiy, A., Zhang, Y.: The transformation method for continuous-time Markov decision processes, Preprint, University of Liverpool, 2011 30. Rykov, V.V.: Markov Decision Processes with finite spaces of states and decisions. Theory Prob. Appl. 11, 302–311 (1966) 31. Sch¨al, M.: Conditions for optimality and for the limit of n-stage optimal policies to be optimal. Z. Wahrs. verw. Gebr. 32 179–196 (1975) 32. Sch¨al, M.: On dynamic programming: compactness of the space of policies. Stoch. Processes Appl. 3 345–364, 1975. 33. Serfozo, R.F.: An equivalence between continuous and discrete time Markov decision processes. Oper. Res. 27, 616–620 (1979) 34. Strauch, R.E.: Negative dynamic programming, Ann. Math. Stat, 37, 871–890 (1966) 35. Yushkevich, A. A.: Controlled Markov models with countable state space and continuous time. Theory Probab. Appl. 22 215–235 (1977) 36. Zachrisson, L.E.: Markov games. In: Dresher, M., Shapley, L., Tucker A. (eds.) Advances in Game Theory, pp. 211–253. Princeton University Press, Princeton, N.J. (1964)
Chapter 6
Continuous-Time Controlled Jump Markov Processes on the Finite Horizon Mrinal K. Ghosh and Subhamay Saha
6.1 Introduction This chapter studies continuous-time Markov decision processes and continuous-time zero-sum stochastic dynamic games. In the continuous-time setup, although the infinite horizon cases have been well studied, the corresponding literature on finite horizon case is few and far between. Infinite horizon continuoustime Markov decision processes have been studied by many authors (e.g. see [5] and the references therein). In the finite horizon case, Pliska [7] has used a semi-group approach to characterise the value function and the optimal control. But his approach yields only existential results. In this chapter, we show that the value function is a smooth solution of an appropriate dynamic programming equation. Our method of proof gives algorithms for computing the value function and an optimal control. The situation is analogous for continuous-time stochastic dynamic Markov games. In this problem as well, the infinite horizon case has been studied in the literature [6]. To our knowledge, the finite horizon case has not been studied. In this chapter, we prove that the value of the game on the finite horizon exists and is the solution of an appropriate Isaacs equation. This leads to the existence of saddle point equilibrium. The rest of our chapter is structured as follows. In Sect. 6.2 we analyse the finite horizon continuous-time MDP. Section 6.3 deals with zero-sum stochastic dynamic games. We conclude our chapter in Sect. 6.4 with a few remarks.
M.K. Ghosh () • S. Saha Department of Mathematics, Indian Institute of Science, Bangalore 560012, India e-mail:
[email protected];
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 6, © Springer Science+Business Media, LLC 2012
99
100
M.K. Ghosh and S. Saha
6.2 Finite Horizon Continuous-Time MDP Throughout this chapter the time horizon is T . The control model we consider is given by {X,U, (λ (t, x, u),t ∈ [0, T ], x ∈ X, u ∈ U), Q(t, x, u, dz), c(t, x, u)} where each element is described below. The state space X. The state space X is the set of states of the process under observation which is assumed to be a Polish space. The action space U. The decision-maker dynamically takes his action from the action space U. We assume that U is a compact metric space. The instantaneous transition rate λ . λ : [0, T ] × X × U → [0, ∞) is a given function satisfying the following assumption: (A1) λ is continuous and there exists a constant M such that sup λ (t, x, u) ≤ M.
t,x,u
The transition probability kernel Q. For a fixed t ∈ [0, T ], x ∈ X, u ∈ U, Q(t, x, u, .) is a probability measure on X with Q(t, x, u, {x}) = 0. Q satisfies the following: (A2) Q is weakly continuous, i.e. if xn → x, tn → t, un → u, then for any f ∈ Cb (X) X
f (z)Q(tn , xn , un , dz) →
X
f (z)Q(t, x, u, dz).
The cost rate c. c : [0, T ] × X ×U → [0, ∞) is a given function satisfying the following assumption: (A3) c is continuous and there exists a finite constant C˜ such that ˜ sup c(t, x, u) ≤ C. t,x,u
Next we give an informal description of the evolution of the controlled system. Suppose that the system is in state x at time t ≥ 0 and the controller or the decisionmaker takes an action u ∈ U. Then the following happens on the time interval [t,t + dt]: 1. The decision maker has to pay an infinitesimal cost c(t, x, u)dt, and 2. A transition from state x to a set A (not containing x) occurs with probability
λ (t, x, u)dt
A
Q(t, x, u, dz) + o(dt);
6 Continuous-Time Controlled Jump Markov Processes on the Finite Horizon
101
or the system remains in state x with probability 1 − λ (t, x, u)dt + o(dt). Now we describe the optimal control problem. To this end we first describe the set of admissible controls. Let u : [0, T ] × X → U be a measurable function. Let U denote the set of all such measurable functions which is the set of admissible controls. Such controls are called Markov controls. For each u ∈ U , it can be shown that there exists is a strong Markov process {Xt } (see [1, 3]) having the generator At u f (x) = −λ (t, x, u(t, x)) f (x) +
X
f (z)Q(t, x, u(t, x), dz)
where f is a bounded measurable function. For each u ∈ U , define u V u (t, x) = Et,x
T
t
c(s, Xs , u(s, Xs ))ds + g(XT )
(6.1)
where g : X → R+ is the terminal cost function which is assumed to be bounded, u is the expectation operator under the control u with initial continuous and Et,x condition Xt = x. The aim of the controller is to minimise V u over all u ∈ U . Define T u V (t, x) = inf Et,x c(s, Xs , u(s, Xs ))ds + g(XT ) . (6.2) u∈U
t
The function V is called the value function. If u∗ ∈ U satisfies ∗
V u (t, x) = V (t, x) ∀(t, x), then u∗ is called an optimal control. The associated dynamic programming equation is ⎧ dϕ ⎪ ⎪ dt (t, x)+ infu∈U c(t, x, u)−λ (t, x, u)ϕ (t, x) ⎪ ⎪ ⎨+λ (t, x, u) ϕ (t, z)Q(t, x, u, dz) = 0 X ⎪ on X × [0, T ) and ⎪ ⎪ ⎪ ⎩ ϕ (T, x) = g(x). The importance of (6.3) is illustrated by the following verification theorem.
(6.3)
102
M.K. Ghosh and S. Saha
Theorem 6.2.1 If (6.3) has a solution ϕ in Cb1,0 ([0, T ] × X), then ϕ = V , the value function. Moreover, if u∗ is such that c(t, x, u∗ (t, x)) − λ (t, x, u∗ (t, x))ϕ (t, x) + λ (t, x, u ∗ (t, x)) ϕ (t, z)Q(t, x, u∗ (t, x), dz) X
= inf c(t, x, u) − λ (t, x, u)ϕ (t, x) + λ (t, x, u) ϕ (t, z)Q(t, x, u, dz) , u∈U
(6.4)
X
then u∗ is an optimal control.
Proof. Using Ito-Dynkin formula to the solution ϕ of (6.3), we obtain
u ϕ (t, x) ≤ inf Et,x u∈U
T
t
c(s, Xs , u(s, Xs ))ds + g(XT ) .
For u = u∗ as in the statement of the theorem, we get the equality ∗
u ϕ (t, x) = Et,x
t
T
c(s, Xs , u∗ (s, Xs ))ds + g(XT ) .
The existence of such a u∗ follows by a standard measurable selection theorem [2]. 2 In view of the above theorem, it suffices to show that (6.3) has a solution in Cb1,0 ([0, T ] × X). Theorem 6.2.2 Under (A1)–(A3), the dynamic programming equation (6.3) has a unique solution in Cb1,0 ([0, T ] × X). Proof. Let ϕ (t, x) = e−γ t ψ (t, x) for some γ < ∞. Then from (6.3) we get, ⎧ ⎪ e−γ t ddtψ (t, x) − γ e−γ t ψ (t, x) + inf c(t, x, u) − λ (t, x, u)e−γ t ψ (t, x) ⎪ ⎪ u∈U ⎪ ⎪
⎨ +λ (t, x, u) X e−γ t ψ (t, z)Q(t, x, u, dz) = 0 ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎪ ⎩ψ (T, x) = eγ T g(x). Thus (6.3) has a solution if and only if ⎧ dψ ⎪ (t, x) − γψ (t, x) + inf eγ t c(t, x, u) − λ (t, x, u)ψ (t, x) ⎪ dt ⎪ u∈U ⎪ ⎪
⎨ +λ (t, x, u) X ψ (t, z)Q(t, x, u, dz) = 0 ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎪ ⎩ψ (T, x) = eγ T g(x)
6 Continuous-Time Controlled Jump Markov Processes on the Finite Horizon
103
has a solution. The above differential equation is equivalent to the following integral equation: T
ψ (t, x) = eγ t g(x) + eγ t + λ (s, x, u)
u∈U
t
X
e−γ s inf eγ s c(s, x, u) − λ (s, x, u)ψ (s, x)
ψ (s, z)Q(s, x, u, dz) ds .
Let Cbunif ([0, T ] × X) be the space of bounded continuous functions ϕ on [0, T ] × X with the additional property that given ε > 0 there exists δ > 0 such that sup |ϕ (t + h, x) − ϕ (t, x)| < ε x
whenever |h| < δ .
Suppose ϕn ∈ Cbunif ([0, T ] × X) and ϕn → ϕ uniformly. Then |ϕ (t + h, x) − ϕ (t, x)| ≤ |ϕ (t + h, x) − ϕn(t + h, x)| + |ϕn(t + h, x) − ϕn(t, x)| |ϕn (t, x) − ϕ (t, x)| . Given ε > 0, there exists n0 such that sup |ϕn0 (t, x) − ϕ (t, x)| < t,x
there exists δ > 0 such that sup |ϕn0 (t + h, x) − ϕn0 (t, x)| < x
Putting n = n0 , we get from the above inequality sup |ϕ (t + h, x) − ϕ (t, x)| < ε x
ε , and for this n0 , 3
ε whenever |h| < δ . 3
whenever |h| < δ .
Thus Cbunif ([0, T ]×X) is a closed subspace of Cb ([0, T ]×X), and hence it is a Banach space.
Now for ϕ ∈ Cbunif ([0, T ] × X), it follows from the assumption on Q that X ϕ (t, z) Q(t, x, u, dz) is continuous in t, x and u. Define T : Cbunif ([0, T ] × X) → Cbunif ([0, T ] × X) by T ψ (t, x) = eγ t g(x) + eγ t +λ (s, x, u)
T t
X
e−γ s inf eγ s c(s, x, u) − λ (s, x, u)ψ (s, x) u∈U
ψ (s, z)Q(s, x, u, dz) ds .
For ψ1 , ψ2 ∈ Cbunif ([0, T ] × X), we have
104
M.K. Ghosh and S. Saha
|T ψ1 (t, x) − T ψ2 (t, x)| ≤ eγ t
T t
e−γ s 2M||ψ1 − ψ2||ds
=
2M γ t −γ t e [e − e−γ T ]||ψ1 − ψ2 || γ
=
2M [1 − e−γ (T−t) ]||ψ1 − ψ2|| γ
≤
2M ||ψ1 − ψ2 ||. γ
Thus if we choose γ = 2M + 1, then T is a contraction and hence has a fixed point. Let ϕ be the fixed point. Then e−(2M+1)t ϕ is the unique solution of (6.3). 2
6.3 Zero-Sum Stochastic Game In this section, we consider a zero-sum stochastic game. The control model we consider here is given by {X,U,V, (λ (t, x, u, v),t ∈ [0, T ], x ∈ X, u ∈ U, v ∈ V, Q(t, x, u, v, dz), r(t, x, u, v)} where X is the state space as before; U and V are the action spaces for player I and player II, respectively; λ and Q denote the rate and transition kernel, respectively, which now depend on the additional parameter v; and r is the reward rate. The dynamics of the game is similar to that of MDP with appropriate modifications. Here player I receives a payoff from player II. The aim of player I is to maximise his payoff, and player II seeks to minimise the payoff to player I. Now we describe the strategies of the players. In order to solve the problem, we will need to consider Markov relaxed strategies. We denote the space of strategies of player I by U and that of player II by V where U = {u | u : [0, T ] × X → P(U) measurable} , V = {v | v : [0, T ] × X → P(V ) measurable} . Now corresponding to λ , Q and r, define
λ˜ (t, x, μ , ν ) = ˜ x, μ , ν ) = Q(t, r˜(t, x, μ , ν ) =
V U
V U
V U
λ (t, x, u, v)μ (du)ν (dv), Q(t, x, u, v)μ (du)ν (dv), r(t, x, u, v)μ (du)ν (dv),
6 Continuous-Time Controlled Jump Markov Processes on the Finite Horizon
105
where μ ∈ P(U) and ν ∈ P(V ). As in the previous section, we make the following assumptions: (A1 ) λ is continuous and there exists a finite constant M such that sup λ (t, x, u, v) ≤ M.
t,x,u,v
(A2 ) Q is weakly continuous, i.e. if xn → x, tn → t, un → u and vn → v, then for any f ∈ Cb (X) X
f (z)Q(tn , xn , un , vn , dz) →
X
f (z)Q(t, x, u, v, dz).
(A3 ) r is continuous and there exists a finite constant C˜ such that ˜ sup r(t, x, u, v) ≤ C.
t,x,u,v
If the players use strategies (u, v) ∈ U × V , then the expected payoff to player I is given by T u,v Et,x r˜(s, Xs , u(s, Xs ), v(s, Xs ))ds + g(XT ) t
where g is the terminal reward function which is assumed to be bounded and continuous. Now we define the upper and lower values for our game. Define T u,v V (t, x) = inf sup Et,x r˜(s, Xs , u(s, Xs ), v(s, Xs ))ds + g(XT ) . v∈V u∈U
t
Also define u,v V (t, x) = sup inf Et,x
u∈U v∈V
t
T
r˜(s, Xs , u(s, Xs ), v(s, Xs ))ds + g(XT ) .
The function V is called the upper value function of the game, and V is called the lower value function of the game. In the game, player I is trying to maximise his payoff and player II is trying to minimise the payoff of player I. Thus V is the minimum payoff that player I is guaranteed to receive and V is the guaranteed greatest amount that player II can lose to player I. In general V ≤ V . If V (t, x) = V (t, x), then the game is said to have a value. A strategy u∗ is said to be an optimal strategy for player I if u∗ ,v Et,x
for any t, x, v.
t
T
r˜(s, Xs , u (s, Xs ), v(s, Xs ))ds + g(XT ) ≥ V (t, x) ∗
106
M.K. Ghosh and S. Saha
Similarly, v∗ is called an optimal policy for player II if u,v Et,x
∗
T
t
r˜(s, Xs , u(s, Xs ), v∗ (s, Xs ))ds + g(XT ) ≤ V (t, x)
for any t, x, u. Such a pair (u∗ , v∗ ), if it exists, is called a saddle point equilibrium. Our aim is to find the value of the game and to find optimal strategies for both the players. To this end, consider the following pair of Isaacs equations: ⎧ dϕ ⎪ (t, x) + inf sup r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )ϕ (t, x) ⎪ dt ⎪ ν ∈P(V ) μ ∈P(U) ⎪ ⎪ ⎨
˜ ˜ x, μ , ν , dz) = 0 +λ (t, x, μ , ν ) X ϕ (t, z)Q(t, ⎪ ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎩ ϕ (T, x) = g(x).
(6.5)
⎧ dψ ⎪ (t, x) + sup inf r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )ψ (t, x) ⎪ dt ⎪ ⎪ μ ∈P(U) ν ∈P(V ) ⎪ ⎨
˜ ˜ x, μ , ν , dz) = 0 +λ (t, x, μ , ν ) X ψ (t, z)Q(t, ⎪ ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎩ ϕ (T, x) = g(x).
(6.6)
By Fan’s minimax theorem [4], we have that if ϕ ∈ Cb1,0 ([0, T ] × X) is a solution of (6.5), then it is also a solution of (6.6) and vice versa. The importance of Isaacs equations is illustrated by the following theorem. Theorem 6.3.1 Let ϕ ∗ ∈ Cb1,0 ([0, T ] × X) be a solution of (6.5) and (6.6). Then (i) ϕ ∗ is the value of the game. (ii) Let (u∗ , v∗ ) ∈ U × V be such that inf
ν ∈P(V )
r˜(t, x, u∗ (t, x), ν ) − λ˜ (t, x, u∗ (t, x), ν )ϕ ∗ (t, x) + λ˜ (t, x, u∗ (t, x), ν ) X
= sup
∗ ˜ ϕ (t, z)Q(t, x, u (t, x), ν , dz) ∗
inf
μ ∈P(U) ν ∈P(V )
r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )ψ (t, x) + λ˜ (t, x, μ , ν ) X
˜ ψ (t, z)Q(t, x, μ , ν , dz)
(6.7)
6 Continuous-Time Controlled Jump Markov Processes on the Finite Horizon
107
and sup
μ ∈P(U)
r˜(t, x, μ , v∗ (t, x)) − λ˜ (t, x, μ , v∗ (t, x))ϕ ∗ (t, x) + λ˜ (t, x, μ , v∗ (t, x)) X
=
inf
˜ x, μ , v∗ (t, x), dz) ϕ ∗ (t, z)Q(t,
sup
ν ∈P(V ) μ ∈P(U)
r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )ϕ (t, x) + λ˜ (t, x, μ , ν ) X
˜ x, μ , ν , dz) . ϕ (t, z)Q(t,
(6.8)
Then u∗ is an optimal policy for player I and v∗ is an optimal policy for player II. Proof. Let u∗ be as in (6.7) and v be any arbitrary strategy of player II. Then by Ito-Dynkin formula applied to the solution ϕ , we obtain ∗
u ,v ϕ ∗ (t, x) ≤ Et,x
T
t ∗
u ,v ≤ inf Et,x
r˜(s, Xs , u∗ (s, Xs ), v(s, Xs ))ds + g(XT )
v∈V
T
t
r˜(s, Xs , u∗ (s, Xs ), v(s, Xs ))ds + g(XT )
≤ V (t, x) . Now let v∗ be as in (6.8) and let u be any arbitrary strategy of player I. Then again by Ito’s formula we obtain ∗
ϕ (t, x) ≥
u,v∗ Et,x
≥ inf
v∈V
t
u,v∗ Et,x
T
r˜(s, Xs , u(s, Xs ), v (s, Xs ))ds + g(XT ) ∗
t
T
r˜(s, Xs , u(s, Xs ), v (s, Xs ))ds + g(XT ) ∗
≥ V (t, x) . From the above two inequalities, it follows that
ϕ ∗ (t, x) = V (t, x) = V (t, x) . Hence ϕ ∗ is the value of the game. Moreover it follows that (u∗ , v∗ ) is a saddle point equilibrium. 2 Now our aim is to find a solution of (6.5) (and hence of (6.6)) in Cb1,0 ([0, T ] × X). Our next theorem asserts the existence of such a solution.
108
M.K. Ghosh and S. Saha
Theorem 6.3.2 Under (A1 )–(A3 ), equation (6.5) has a unique solution in Cb1,0 ([0, T ] × X). Proof. Let ϕ (t, x) = e−γ t ψ (t, x) for some γ < ∞. Substituting in (6.5), we get ⎧ ⎪ e−γ t ddtψ (t, x) − γ e−γ t ψ (t, x) + inf sup r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )e−γ t ψ (t, x) ⎪ ⎪ ν ∈P(V ) μ ∈P(U) ⎪ ⎪ ⎨
−γ t ˜ x, μ , ν , dz) = 0 +λ˜ (t, x, μ , ν ) X e ψ (t, z)Q(t, ⎪ ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎩ ψ (T, x) = eγ T g(x). Thus (6.5) has a solution if and only if ⎧ dψ ⎪ sup eγ t r˜(t, x, μ , ν ) − λ˜ (t, x, μ , ν )ψ (t, x) ⎪ dt (t, x) − γψ (t, x) + inf ⎪ ν ∈P(V) μ ∈P(U) ⎪ ⎪ ⎨
˜ ˜ x, μ , ν , dz) = 0 +λ (t, x, μ , ν ) X ψ (t, z)Q(t, ⎪ ⎪ ⎪ on X × [0, T ) and ⎪ ⎪ ⎩ ψ (T, x) = eγ T g(x) has a solution. The above differential equation is equivalent to the following integral equation:
ψ (t, x) = eγ t g(x) + eγ t
T
e− γ s
t
+ λ˜ (s, x, μ , ν )
X
inf
sup
ν ∈P(V ) μ ∈P(U)
eγ s r˜(s, x, μ , ν ) − λ˜ (s, x, μ , ν )ψ (s, x)
˜ x, μ , , ν , dz) ds . ψ (s, z)Q(s,
Let Cbunif ([0, T ] × X) be the same space as defined in the previous section. Define T : Cbunif ([0, T ] × X) → Cbunif ([0, T ] × X) by
T ψ (t, x) = eγ t g(x) + eγ t
T t
e− γ s
inf
sup
ν ∈P(V ) μ ∈P(U)
− λ˜ (s, x, μ , ν )ψ (s, x) + λ˜ (s, x, μ , ν ) For ψ1 , ψ2 ∈ Cbunif ([0, T ] × X), we have
eγ s r˜(s, x, μ , ν )
X
˜ x, μ , ν , dz) ds . ψ (s, z)Q(s,
6 Continuous-Time Controlled Jump Markov Processes on the Finite Horizon
|T ψ1 (t, x) − T ψ2 (t, x)| ≤ eγ t
T t
109
e−γ s 2M||ψ1 − ψ2||ds
=
2M γ t −γ t e [e − e−γ T ]||ψ1 − ψ2 || γ
=
2M [1 − e−γ (T−t) ]||ψ1 − ψ2|| γ
≤
2M ||ψ1 − ψ2 ||. γ
Thus if we choose γ = 2M + 1, then T is a contraction and hence has a fixed point. Let ϕ be the fixed point. Then e−(2M+1)t ϕ is the unique solution of (6.5). 2
6.4 Conclusion In this chapter we have established smooth solutions of dynamic programming equations for continuous-time controlled Markov chains on the finite horizon. This has led to the existence of an optimal Markov strategy for continuous-time MDP and saddle point equilibrium in Markov strategies for zero-sum games. We have used the boundedness condition on the cost function c for simplicity. For continuous-time MDP, if c is unbounded above, then we can show that V (t, x) is the minimal nonnegative solution of (6.3) by approximating the cost function c by c ∧ n for a positive integer n and then letting n → ∞. If c is unbounded on both sides and it satisfies a suitable growth condition, then again we can prove the existence of unique solutions of dynamic programming equations in C1,0 ([0, T ] × X) with appropriate weighted norm; see [5] and [6] for analogous results.
References 1. A. Arapostathis, V. S. Borkar and M. K. Ghosh, Ergodic Control of Diffusion Processes, Cambridge University Press, 2011. 2. V. E. Benes, Existence of optimal strategies based on specified information for a class of stochastic decision problems, SIAM J. Control 8 (1970), 179–188. 3. M. H. A. Davis, Markov Models and Optimization, Chapman and Hall, 1993. 4. K. Fan, Fixed-point and minimax theorems in locally convex topological linear spaces, Proc. of the Natl. Academy of Sciences of the United States of America 38 (1952), 121–126. 5. X. Guo and O. Hern´andez-Lerma, Continuous-Time Markov Decision Processes. Theory and Applications, Springer-Verlag, 2009. 6. X. Guo and O. Hern´andez-Lerma, Zero-sum games for continuous-time jump Markov processes in Polish spaces: discounted payoffs, Adv. in Appl. Probab. 39 (2007), 645–668. 7. S. R. Pliska, Controlled jump processes, Stochastic Processes Appl. 3 (1975), 259–282.
Chapter 7
Existence and Uniqueness of Solutions of SPDEs in Infinite Dimensions T.E. Govindan
7.1 Introduction In this chapter, we study a neutral stochastic partial differential equation in a real Hilbert space of the form d[x(t) + f (t, x(ρ (t)))] = [Ax(t)+a(t, x(ρ (t)))]dt +b(t, x(ρ (t)))dw(t), x(t) = ϕ (t),
t > 0;
t ∈ [−r, 0] (0 ≤ r < ∞);
(7.1) (7.2)
where a : R+ × C → X (R+ = [0, ∞)), b : R+ × C → L(Y, X), and f : R+ × C → D((−A)−α ), 0 < α ≤ 1, are Borel measurable, and −A : D(−A) ⊂ X is the infinitesimal generator of a strongly continuous semigroup {S(t),t ≥ 0} defined on X. Here w(t) is a Y -valued Q-Wiener process, and the past stochastic process {ϕ (t),t ∈ [−r, 0]} has almost sure (a.s) continuous paths with E||ϕ ||Cp < ∞, p ≥ 2. The delay function ρ : [0, ∞) → [−r, ∞) is measurable, satisfying −r ≤ ρ (t) ≤ t,t ≥ 0. To motivate this kind of equations, consider a semilinear equation with a finite delay in the deterministic case; see Hernandez and Henriquez [7]: dx(t) = Ax(t) + f (xt ) + Bu(t), dt
t > 0,
(7.3)
where x(t) ∈ X represents the state, u(t) ∈ Rm denotes the control, xt (s) = x(t + s), −r ≤ s ≤ 0, A : X → X , and B : Rm → X. The feedback control u(t) will be defined by u(t) = K0 x(t) −
d dt
t
−r
K1 (t − s)x(s)ds,
(7.4)
T.E. Govindan () ESFM, Instituto Polit´ecnico Nacional, M´exico D.F. 07738, M´exico e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 7, © Springer Science+Business Media, LLC 2012
111
112
T.E. Govindan
where K0 : X → Rm is a bounded linear operator, and K1 : [0, ∞) → L(X, Rm ) is an operator-valued map. The closed system corresponding to the control (7.4) takes the form: t d K1 (t − s)x(s)ds = (A + BK0)x(t) + f (xt ), t > 0. x(t) + dt −r A special case of (7.1), see (7.6) below, was studied in Govindan [4, 5] and Luo [8] by using Lipschitz conditions on all the nonlinear terms and in Govindan [6] using local Lipschitz conditions on a(t, u) and b(t, u). In this chapter, our goal is to study the existence and uniqueness problem for (7.1) and (7.2) using Lipschitz conditions. The rest of the chapter is organized as follows: In Sect. 7.2, we give the preliminaries. For a general theory of stochastic differential equations in infinite dimensions, we refer to the excellent texts of Ahmed [1] and Da Prato and Zabczyk [2]. Since we work in the same framework as in Govindan [4] and Taniguchi [10], we shall be quite brief here. In Sect. 7.3, we present our main result. Some examples are given in Sect. 7.4.
7.2 Preliminaries Let X,Y be real separable Hilbert spaces and L(Y, X) be the space of bounded linear operators mapping Y into X. We shall use the same notation | · | to denote norms in X,Y and L(Y, X). Let (Ω , B, P, {Bt }t≥0 ) be a complete probability space with an increasing right continuous family {Bt }t≥0 of complete sub-σ -algebras of B. Let βn (t)(n = 1, 2, 3, . . .) be a sequence of real-valued standard Brownian motions mu√ tually independent defined on this probability space. Set w(t) = ∑∞ n=1 λn βn (t)en , t ≥ 0, where λn ≥ 0 (n = 1, 2, 3, . . .) are nonnegative real numbers and {en } (n = 1, 2, 3, . . .) is a complete orthonormal basis in Y . Let Q ∈ L(Y,Y ) be an operator defined by Qen = λn en . The above Y -valued stochastic processes w(t) is called a Q-Wiener process. Let h(t) be an L(Y, X)-valued function and let λ be a sequence 1/2 √ √ √ ∞ 2 { λ1 , λ2 , . . .}. Then we define |h(t)|λ = ∑n=1 | λn h(t)en | . If |h(t)|2λ < ∞, then h(t) is called λ -Hilbert-Schmidt operator. Next, we define the X-valued stochastic integral with respect to the Y -valued Q-Wiener process w(t); see Taniguchi [10]. Definition 7.2.1. Let Φ : [0, ∞) → σ (λ )(Y, X) be a Bt -adapted process. Then, for any Φ satisfying 0t E|Φ (s)|2λ ds < ∞, we define the X-valued stochastic integral t 0 Φ (s)dw(s) ∈ X with respect to w(t) by t t Φ (s)dw(s), h = < Φ ∗ (s)h, dw(s) >, h ∈ X, 0
0
where Φ ∗ is the adjoint operator of Φ .
7 Existence and Uniqueness of Solutions of SPDEs
113
A semigroup {S(t),t ≥ 0} is said to be exponentially stable if there exist positive constants M and a such that ||S(t)|| ≤ M exp(−at), t ≥ 0, where || · || denotes the operator norm in X. If M = 1, the semigroup is said to be a contraction. Let C := C([−r, 0]; X) denote the space of continuous functions ϕ : [−r, 0] → X endowed with the norm ||ϕ ||C = sup |ϕ (s)|. Let BT = BT (ϕ ) be the space of −r≤s≤0
measurable random processes φ (t, ω ) with a.s. continuous paths; for each t ∈ [0, T ], φ (t, ω ) is measurable with respect to Bt and φ (s, ω ) = ϕ (s, ω ) for −r ≤ s ≤ 0 with the norm ||φ ||BT = (E||φ (., ω )||Cp )1/p , 1 ≤ p < ∞. BT is a Banach space, see Govindan [4] and the references therein. If {S(t),t ≥ 0} is an analytic semigroup with an infinitesimal generator −A such that 0 ∈ ρ (−A) (the resolvent set of −A), then it is possible to define the fractional power (−A)α , for 0 < α ≤ 1, as a closed linear operator on its domain D((−A)α ). Furthermore, the subspace D((−A)α ) is dense in X and the expression ||x||α = |(−A)α x|,
x ∈ D((−A)α ),
defines a norm on D((−A)α ). Definition 7.2.2. A stochastic process {x(t),t ∈ [0, T ]}(0 < T < ∞) is called a mild solution of equation (7.1) if
(i) x(t) is Bt -adapted with 0T |x(t)|2 dt < ∞, a.s., and (ii) x(t) satisfies the integral equation x(t) = S(t)[ϕ (0) + f (0, ϕ )] − f (t, x(ρ (t))) − + +
t 0
t 0
t 0
AS(t − s) f (s, x(ρ (s)))ds S(t − s)a(s, x(ρ (s)))ds S(t − s)b(s, x(ρ (s)))dw(s),
a.s.,
t ∈ [0, T ].
(7.5)
For convenience of the reader, we will state below some results that will be needed in the sequel.
Lemma 7.2.1. [3] Let WAΦ (t) = 0t S(t − s)Φ (s)dw(s),t ∈ [0, T ]. For any arbitrary p > 2, there exists a constant c(p, T ) > 0 such that for any T ≥ 0 and a proper modification of the stochastic convolution WAΦ , one has E sup |WAΦ (t)| p ≤ c(p, T ) sup ||S(t)|| pE t≤T
t≤T
T 0
|Φ (s)|λp ds.
114
T.E. Govindan
Moreover if E 0T |Φ (s)|λp ds < ∞, then there exists a continuous version of the process {WAΦ ,t ≥ 0}. Lemma 7.2.2. [2, Theorem 6.10] Suppose −A generates a contraction semigroup. Then the process WAΦ (.) has a continuous modification and there exists a constant κ > 0 such that E sup |WAΦ (s)|2 ≤ κ E s∈[0,t]
t 0
|Φ (s)|2λ ds,
t ∈ [0, T ].
Theorem 7.2.1. [1, 9] Let −A be the infinitesimal generator of an analytic semigroup {S(t),t ≥ 0}. If 0 ∈ ρ (A), then (a) S(t) : X → Xα for every t > 0 and α ≥ 0. (b) For every x ∈ Xα , we have S(t)Aα x = Aα S(t)x. (c) For every t > 0, the operator Aα S(t) is bounded and ||Aα S(t)|| ≤ Mα t −α e−at ,
a > 0.
(d) Let 0 < α ≤ 1 and x ∈ D(Aα ); then ||S(t)x − x|| ≤ Cα t α ||Aα x||.
7.3 Existence and Uniqueness of a Mild Solution In this section we study the existence and uniqueness of a mild solution of the equation (7.1). But, before that, it shall be interesting to recall some recent existence and uniqueness results for a mild solution of a simpler equation given by d[x(t) + f (t, xt )] = [Ax(t) + a(t, xt )]dt + b(t, xt )dw(t), x(t) = ϕ (t),
t ∈ [−r, 0] (0 ≤ r < ∞);
t > 0;
(7.6) (7.7)
where xt (s) = x(t + s), −r ≤ s ≤ 0, and all the other terms are as defined before. To begin with, we present our first result for (7.6) from Luo [8]. Hypothesis (A): Let the following assumptions hold a.s.: (A1) −A is the infinitesimal generator of an analytic semigroup of bounded linear operators {S(t),t ≥ 0} in X, and the semigroup is exponentially stable. (A2) The mappings a(t, u) and b(t, u) satisfy the following Lipschitz and linear growth conditions:
7 Existence and Uniqueness of Solutions of SPDEs
115
|a(t, u) − a(t, v)| ≤ L1 ||u − v||C ,
L1 > 0,
|b(t, u) − b(t, v)|λ ≤ L2 ||u − v||C ,
L2 > 0,
|a(t, u)|2 + |b(t, u)|2λ ≤ L23 (1 + ||u||C2 ),
L3 > 0,
for all u, v ∈ C. (A3) f (t, u) is a function continuous in t and satisfies |(−A)α f (t, u) − (−A)α f (t, v)| ≤ L4 ||u − v||C , |(−A)α f (t, u)| ≤ L5 (1 + ||u||C ),
L4 > 0, L5 > 0,
for all u, v ∈ C. Theorem 7.3.2. [8] Let Assumptions (A1)–(A3) hold. Then there exists a unique mild solution {x(t), 0 ≤ t ≤ T } of the problem (7.6) and (7.7) provided L||(−A)−α || + LM1−α Γ (α )/aα < 1, where L = max{L4 , L5 } and Γ (·) is the Gamma function. In Govindan [5], a new successive approximation procedure was introduced to study the existence and uniqueness of a mild solution of equation (7.6) under the following hypothesis. Hypothesis (B): Let the following assumptions hold a.s.: (B1) Same as (A1). (B2) For p ≥ 2, the functions a(t, u) and b(t, u) satisfy the Lipschitz and linear growth conditions: |a(t, u) − a(t, v)| p ≤ L6 ||u − v||Cp , |b(t, u) − b(t, v)|λp |a(t, u)| p + |b(t, u)|λp
≤ ≤
L7 ||u − v||Cp , L8 (1 + ||u||Cp ),
L6 > 0, L7 > 0, L8 > 0,
for all u, v ∈ C. (B3) Same as (A3). Theorem 7.3.3. [5] Let Assumptions (B1)–(B3) hold. For p = 2, {S(t),t ≥ 0} is a contraction. Then there exists a unique mild solution x(t) of the problem (7.6) and (7.7) provided 1/p < α < 1, and L||(−A)−α || < 1, where L = max{L4 , L5 }.
116
T.E. Govindan
Recently, Govindan [6] studied the existence and uniqueness problem for (7.6) by using only a local Lipschitz condition. To consider this result, we need the following Hypothesis (C): Let the following assumptions hold a.s.: (C1) Same as (A1) but now the semigroup is a contraction. (C2) The functions a(t, u) and b(t, u) are continuous and that there exist positive constants Mi = Mi (T ), i = 1, 2 such that |a(t, u) − a(t, v)| ≤ M1 ||u − v||C , |b(t, u) − b(t, v)|λ ≤ M2 ||u − v||C , for all t ∈ [0, T ] and u, v ∈ C. (C3) The function f (t, u) is continuous and that there exists a positive constant M3 = M3 (T ) such that || f (t, u) − f (t, v)||α ≤ M3 ||u − v||C , for all t ∈ [0, T ] and u, v ∈ C. (C4) f (t, u) is continuous in the quadratic mean sense: lim
(t,u)→(s,v)
E|| f (t, u) − f (s, v)||2α → 0.
Theorem 7.3.4. [6] Suppose that Assumptions (C1)–(C4) are satisfied. Then, there exists a time 0 < tm = tmax ≤ ∞ such that (7.6) has a unique mild solution. Further, if tm < ∞, then limt↑tm E|x(t)|2 = ∞. Finally, we state the following hypothesis to consider the main result of the chapter. Hypothesis (D): Let the following assumptions hold a.s.: (D1) −A is the infinitesimal generator of an analytic semigroup of bounded linear operators {S(t),t ≥ 0} in X, and the semigroup is exponentially stable. (D2) For p ≥ 2, the functions a(t, u) and b(t, u) satisfy the Lipschitz and linear growth conditions: p
|a(t, u) − a(t, v)| p ≤ C1 ||u − v||C ,
C1 > 0,
|b(t, u) − b(t, v)|λp
C2 > 0,
≤
C2 ||u − v||Cp ,
for all u, v ∈ C. There exist positive constants C3 , P and δ > a and a continuous function ψ : R+ → R+ satisfying |ψ (t)|2 < P exp(−δ t), t ≥ 0 such that |a(t, u)| p + |b(t, u)|λp ≤ C3 ||u||Cp + ψ (t), for all u ∈ C.
t ≥ 0,
7 Existence and Uniqueness of Solutions of SPDEs
117
(D3) f (t, u) is a function continuous in t and satisfies |(−A)α f (t, u) − (−A)α f (t, v)| ≤ L4 ||u − v||C ,
L4 > 0,
|(−A)α f (t, u)| ≤ L5 (1 + ||u||C ),
L5 > 0,
for all u, v ∈ C. Theorem 7.3.5. Let Assumptions (D1)–(D3) hold. Suppose that for the case p = 2, the semigroup {S(t),t ≥ 0} is a contraction. Then there exists a unique mild solution x(t) of (7.1) provided L||(−A)−α || + LM1−α Γ (α )/aα < 1, where L = max{L4 , L5 }. To prove this theorem, let us introduce the following iteration procedure: Define for each integer n = 1, 2, 3, . . ., x(n) (t) = S(t)[ϕ (0) + f (0, ϕ )] − f (t, x(n)(ρ (t))) − + +
t 0
t 0
t 0
AS(t − s) f (s, x(n) (ρ (s)))ds S(t − s)a(s, x(n−1) (ρ (s)))ds S(t − s)b(s, x(n−1) (ρ (s)))dw(s),
t ∈ [0, T ],
(7.8)
and for n = 0, x(0) (t) = S(t)ϕ (0),
t ∈ [0, T ],
(7.9)
while for n = 0, 1, 2, . . ., x(n) (t) = ϕ (t),
t ∈ [−r, 0].
(7.10)
Proof of Theorem 7.3.5. Let T be any fixed time with 0 < T < ∞. Rewriting (7.8) as x(n) (s) = S(s)ϕ (0) + S(s)(−A)−α (−A)α f (0, ϕ ) − (−A)−α (−A)α f (x, x(n) (ρ (t))) + + +
s 0
s 0
s 0
(−A)1−α S(s − τ )(−A)α f (τ , x(n) (ρ (τ )))dτ S(s − τ )a(τ , x(n−1)(ρ (τ )))dτ S(s − τ )b(τ , x(n−1)(ρ (τ )))dw(τ ),
a.s. s ∈ [0, T ].
118
T.E. Govindan
By Theorem 7.2.1, Assumptions (D1) and (D3) and Luo [8], we get |x(n) (s)| ≤ Me−as |ϕ (0)| + Me−as||(−A)−α ||L5 (1 + ||ϕ ||C ) (n)
+L5 ||(−A)−α ||(1 + ||xs ||C ) s
M1−α e−a(s−τ ) (n) (1 + ||xτ ||C )dτ (s − τ )1−α 0
s
(n−1)
+ S(s − τ )a(τ , x (ρ (τ )))dτ
+
L5
0
s
(n−1)
+ S(s − τ )b(τ , x (ρ (τ )))dw(τ )
0
a.s..
Note that (−A)−α , for 0 < α ≤ 1 is a bounded operator, see Pazy [9, Lemma 6.3 p. 71]. An application of Lemma 7.2.1 (or Lemma 7.2.2 for p = 2) then yields
−α
1 − L5||(−A)
|| − L5M1−α Γ (α )/a
α
p
E sup |x(n) (s)| p 0≤s≤t
p p−1 −α α ≤3 E M|ϕ (0)| + L5 (1 + ||ϕ ||C )((M + 1)||(−A) || + M1−α Γ (α )/a ) + M p T p−1
t 0
+ M c(p, T ) p
E|a(s, x(n−1) (ρ (s)))| p ds
t
E|b(s, x
(n−1)
0
.
(ρ (s)))|λp ds
Next, by Assumption (D2), we have E sup |x(n) (s)| p ≤ 0≤s≤t
3 p−1 [1 − L5||(−A)−α || − L5M1−α Γ (α )/aα ] p
p −α α × E M|ϕ (0)| + L5 (1 + ||ϕ ||C )((M + 1)||(−A) || + M1−α Γ (α )/a + M p (T p−1 + c(p, T ))
t 0
,
(n−1) p ||C + ψ (s))ds
(C3 E||xs
n = 1, 2, 3, . . . .
Since E||ϕ ||Cp < ∞, it follows from the last inequality that E sup |x(n) (s)| p < ∞,
for all n = 1, 2, 3, . . .
and t ∈ [0, T ],
0≤s≤t
proving the boundedness of {x(n) (t)}n≥1. Let us next show that {x(n) } is Cauchy in BT . For this, consider
7 Existence and Uniqueness of Solutions of SPDEs
119
x(1) (s) − x(0) (s) = S(s) f (0, ϕ ) − f (s, x(1) (ρ (s))) + + +
s 0
s 0
s 0
(−A)S(s − τ ) f (τ , x(1) (ρ (τ )))dτ S(s − τ )a(τ , x(0) (ρ (τ )))dτ S(s − τ )b(τ , x(0) (ρ (τ )))dw(τ ).
By Assumptions (D1) and (D3), we have (1)
(0)
|x(1) (s) − x(0) (s)| ≤ M||(−A)−α ||L5 (1 + ||ϕ ||C ) + ||(−A)−α ||L4 ||xs − xs ||C (0)
+ ||(−A)−α ||L5 (1 + ||xs ||C ) Γ (α ) (0) + C3 M1−α α 1 + sup ||xs ||C a 0≤s≤t
s
+
(−A)1−α S(s − τ )(−A)α [ f (τ , x(1) (ρ (τ ))) 0
− f (τ , x
(0)
(ρ (τ )))]dτ
s
+
S(s − τ )a(τ , x(0)(ρ (τ )))dτ
0
s
+
S(s − τ )b(τ , x(0)(ρ (τ )))w(τ )
0
a.s..
Next, using Lemma 7.2.1 (or Lemma 7.2.2 for p = 2) and Assumptions (D2) and (D3), we have −p (1) (0) E||xt − xt ||Cp ≤ 3 p−1 1 − L4 ||(−A)−α || − L4 M1−α Γ (α )/aα × E M||(−A)−α ||L5 (1 + ||ϕ ||C ) (0)
+||(−A)−α ||L5 (1 + sup ||xs ||C ) 0≤s≤t
p Γ (α ) (0) +L5 M1−α α (1 + sup ||xs ||C ) a 0≤s≤t +M (T p
p−1
+ c(p, T ))
t 0
.
(0) (C3 E||xs ||Cp + ψ (s))ds
120
T.E. Govindan
Next, consider x(n) (s) − x(n−1) (s) = f (s, x(n−1) (ρ (s))) − f (s, x(n) (ρ (s))) − + +
s 0
s 0
s 0
AS(s − τ )[ f (τ , x(n) (ρ (τ ))) − f (τ , x(n−1) (ρ (τ )))]dτ S(s − τ )[a(τ , x(n−1)(ρ (τ ))) − a(τ , x(n−2) (ρ (τ )))]dτ S(s − τ )[b(τ , x(n−1)(ρ (τ ))) − b(τ , x(n−2) (ρ (τ )))]dw(τ ).
Estimating as before, we have (n)
(n−1) p ||C
[1 − L4 ||(−A)−α || − L4 M1−α Γ (α )/aα ||] p E||xt − xt ≤ 2 p−1M p T p−1C1
t 0
(n−1)
E||xs
+ 2 p−1M p c(p, T )C2
t 0
(n−2) p ||C ds
− xs
(n−1)
E||xs
(n−2) p ||C ds.
− xs
Thus, (n)
(n−1) p ||C
E||xt − xt
2 p−1 M p (C1 T p−1 + C2 c(p, T )) ≤ [1 − L4 ||(−A)−α || − L4 M1−α Γ (α )/aα ||] p
t
(n−1)
E||xs
0
(n−2) p ||C ds.
− xs
Using the familiar Cauchy formula, (n)
(n−1) p ||C
E||xt − xt
≤
[2 p−1M p (C1 T p−1 + C2 c(p, T ))]n−1 [1 − L4||(−A)−α || − L4M1−α Γ (α )/aα ||](n−1)p ×
≤
t (t − s)n−2 0
(n − 2)!
(1)
(0)
E||xs − xs ||Cp ds
[2 p−1M p (C1 T p−1 + C2 c(p, T ))]n−1 [1 − L4||(−A)−α || − L4M1−α Γ (α )/aα ||](n−1)p (1)
(0)
× E||xs − xs ||Cp
T n−1 . (n − 1)!
This shows that {xn } is Cauchy in BT . Then the standard Borel-Cantelli lemma argument can be used to show that xn (t) → x(t) as n → ∞ uniformly in t on [0, T ] and x(t) is indeed a mild solution of equation (7.1). Finally, to show the uniqueness, let x(t) and y(t) be two mild solutions of equation (7.1). Consider
7 Existence and Uniqueness of Solutions of SPDEs
121
x(s) − y(s) = f (s, y(ρ (s))) − f (s, x(ρ (s))) − + +
s 0
s 0
s 0
AS(s − τ )[ f (τ , x(ρ (τ ))) − f (τ , y(ρ (τ )))]dτ S(s − τ )[a(τ , x(ρ (τ ))) − a(τ , y(ρ (τ )))]dτ S(s − τ )[b(τ , x(ρ (τ ))) − b(τ , y(ρ (τ )))]dw(τ ).
Proceeding as before, we have E||xt − yt ||Cp ≤
2 p−1 M p (C1 T p−1 + C2 c(p, T )) [1 − L4||(−A)−α || − L4 M1−α Γ (α )/aα ||] p
t 0
E||xs − ys ||Cp ds.
Applying Gronwall’s lemma, E||xt − yt ||Cp = 0,
t ∈ [0, T ], 2
and the uniqueness follows. This completes the proof.
7.4 Applications In this section, we consider two examples as applications of the results considered in the previous section. Example 7.4.1. Consider the neutral stochastic partial functional differential equation with finite delays r1 , r2 , and r3 (r > ri ≥ 0, i = 1, 2, 3): d[z(t, x) +
3 (t) ||(−A)3/4 ||
0 −r3
z(t + u, x)du] =
∂2 z(t, x) + 1(t) ∂ x2
0 −r1
+ 2(t)z(t − r2 , x)dβ (t),
z(t + u, x)du dt t > 0, (7.11)
i : R+ → R+ ,
i = 1, 2, 3;
z(s, x) = ϕ (s, x),
ϕ (s, ·) ∈ L [0, π ], 2
z(t, 0) = z(t, π ) = 0,
ϕ (·, x) ∈ C −r ≤ s ≤ 0,
t > 0,
a.s., 0 ≤ x ≤ π;
where β (t) is a standard one-dimensional Wiener process; i (t), i = 1, 2, 3 is a continuous function; and E||ϕ ||C2 < ∞.
122
T.E. Govindan
Take X = L2 [0, π ], Y = R1 . Define −A : X → X by −A = ∂ 2 /∂ x2 with domain D(−A) = {w ∈ X : w, ∂ w/∂ x are absolutely continuous, ∂ 2 w/∂ x2 ∈ X, w(0) = w(π ) = 0}. Then −Aw =
∞
∑ n2(w, wn )wn ,
w ∈ D(−A),
n=1
where wn (x) = 2/π sin nx, n = 1, 2, 3, . . ., is the orthonormal set of eigenvectors of −A. It is well-known that −A is the infinitesimal generator of an analytic semigroup {S(t),t ≥ 0} in X and is given by ∞
S(t)w =
∑ e−n t (w, wn )wn , 2
w ∈ X,
n=1
that satisfies ||S(t)|| ≤ exp(−π 2t),t ≥ 0, and hence is a contraction semigroup. Define now f (t, zt ) = 3 (t) a(t, zt ) = 1 (t)
0 −r3
0
−r1
z(t + u, x)du, z(t + u, x)du,
b(t, zt ) = 2 (t)z(t − r2 , x). Next, || f (t, zt )||3/4
0
3 (t)
3/4
= z(t + u, x)du (−A)
3/4 ||(−A) || −r3 ≤ 3 (T )r3 ||zt ||C ,
a.s..
This shows that f : R+ × C → D((−A)3/4 ) with C4 (T ) = 3 (T )r3 . Similarly, a : R+ × C → X and b : R+ × C → L(R, X). Thus, (7.11) can be expressed as (7.6) with −A, f , a, and b as defined above. Hence, there exists a unique mild solution by Theorem 7.3.4. The existence results Theorems 7.3.2 and 7.3.3 are not applicable to (7.11). Example 7.4.2. Consider the stochastic partial differential equation with a variable delay: dz(t, x) =
∂2 z(t, x) + F(t, z( ρ (t), x)) dt ∂ x2
+ G(t, z(ρ (t), x))dβ (t),
t > 0,
(7.12)
7 Existence and Uniqueness of Solutions of SPDEs
z(t, 0) = z(t, π ) = 0, z(s, x) = ϕ (s, x),
ϕ (s, ·) ∈ L2 [0, π ],
123
t > 0,
ϕ (·, x) ∈ C, −r ≤ s ≤ 0,
0 ≤ x ≤ π;
where β (t) is a standard one-dimensional Wiener process and E||ϕ ||C2 < ∞. Let us assume that F : R+ × R → R and G : R+ × R → R satisfy the conditions: For t ≥ 0, there exists a constant k > 0 such that |F(t, z1 ) − F(t, z2 )|2 + |G(t, z1 ) − G(t, z2 )|2 ≤ k|z1 − z2 |2 ,
z1 , z2 ∈ R,
and there exist positive constants L, P, δ and a continuous function ψ : R+ → R+ satisfying |ψ (t)|2 ≤ P exp(−δ t), for all t ≥ 0 such that |F(t, z)|2 + |G(t, z)|2 ≤ L|z|2 + ψ (t),
z ∈ R.
Take X = L2 [0, π ], Y = R. Define −A : X → X by −A = ∂ 2 /∂ x2 with domain D(−A) given as before in Example 7.4.1. So, −A is the infinitesimal generator of an analytic semigroup {S(t),t ≥ 0} in X that satisfies ||S(t)|| ≤ exp(−π 2t),t ≥ 0. Define now a(t, z(ρ (t))) = F(t, z(ρ (t), x)), and
b(t, z(ρ (t))) = G(t, z(ρ (t), x)).
Next, |a(t, z(ρ (t)))|2 + |b(t, z(ρ (t)))|2λ =
π 0
[|F(t, z(ρ (t), x))|2 + |G(t, z(ρ (t), x))|2 ]dx
≤ π [L||z||C2 + ψ (t)]. This shows that a : R+ × C → X and b : R+ × C → L(R, X). Thus, (7.12) can be expressed as (7.1) with −A, a, and b as defined above and with f = 0. It can be verified as above that |a(t, z1 (ρ (t))) − a(t, z2 (ρ (t)))|2 + |b(t, z1 (ρ (t))) − b(t, z2 (ρ (t)))|2λ ≤ π k||z1 − z2 ||C2 . Hence, (7.12) has a unique mild solution by Theorem 7.3.5. Acknowledgements The author wishes to thank SIP and COFAA both from IPN, Mexico for financial support.
124
T.E. Govindan
References 1. Ahmed, N.U.: Semigroup Theory with Applications to Systems and Control, Pitman Research Notes in Math., Vol. 246, (1991). 2. Da Prato, G., Zabczyk, J.: Stochastic Equations in Infinite Dimensions, Cambridge Univ. Press, Cambridge, (1992). 3. Da Prato, G., Zabczyk, J.: A note on stochastic convolution, Stochastic Anal. Appl. 10, 143–153 (1992). 4. Govindan, T.E.: Almost sure exponential stability for stochastic neutral partial functional differential equations, Stochastics 77, 139–154 (2005). 5. Govindan, T.E.: A new iteration procedure for stochastic neutral partial functional differential equations, Internat. J. Pure Applied Math. 56, 285–298 (2009). 6. Govindan, T.E.: Mild solutions of neutral stochastic partial functional differential equations, Internat. J. Stochastic Anal. vol. 2011, Article ID 186206, 13 pages, (2011). doi:10.1155/2011/186206. 7. Hernandez, E., Henriquez, H.R.: Existence results for partial neutral functional differential equations with unbounded delay, J. Math. Anal. Appl. 221, 452–475 (1998). 8. Luo, L.: Exponential stability for stochastic neutral partial functional differential equations, J. Math. Anal. Appl. 355, 414–425 (2009). 9. Pazy, A.: Semigroups of Linear operators and Applications to Partial Differential Equations, Springer, New York, (1983). 10. Taniguchi, T.: Asymptotic stability theorems of semilinear stochastic evolution equations in Hilbert spaces, Stochastics Stochastics Reports 53, 41–52 (1995).
Chapter 8
A Constrained Optimization Problem with Applications to Constrained MDPs Xianping Guo, Qingda Wei, and Junyu Zhang
8.1 Introduction Constrained optimization problems form an important aspect in control theory, for instance, constrained Markov decision processes (MDPs) [2, 3, 10–13, 15–21, 24, 25, 27, 29–33, 35, 36], and constrained diffusion processes [4–9]. In this chapter, we are concerned with a constrained optimization problem, in which the objective function is defined on the product space of a linear space and a convex set. The constrained optimization problem is to maximize the values of the function with any fixed variable in the linear space, over a constrained subset of the convex set which is given by the function with another fixed variable from the linear space and with a given constraint. The basic idea for the constrained optimization problem comes from the studies on the discounted and average optimality for discrete- and continuous-time MDPs with a constraint. We aim to develop a unified approach to dealing with such constrained MDPs. More precisely, for discrete- and continuoustime MDPs with a constraint, the linear space can be taken as a set of some realvalued functions such as reward and cost functions in such MDPs, and the convex set can be chosen as the set of all randomized Markov policies, the set of all randomized stationary policies, or the set of all the occupation measures according to a specified case of MDPs with different criteria. The objective function can be taken as one of the expected discounted (average) criteria, while the first variable in the objective function can be taken as the reward/cost functions in MDPs and the second one as a policy in a class of policies. Thus, MDPs with a constraint can be reduced to our constrained optimization problem.
Research supported by NSFC, GDUPS, RFDP, and FRFCU. X. Guo () • Q. Wei • J. Zhang Sun Yat-Sen University, Guangzhou 510275, China e-mail:
[email protected];
[email protected];
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 8, © Springer Science+Business Media, LLC 2012
125
126
X. Guo et al.
A fundamental question on the constrained optimization problem is whether there exists a constrained-optimal solution. The Lagrange multiplier technique is a classical approach to proving the existence of a constrained-optimal solution for such an optimization problem. There are several authors using the Lagrange multiplier technique to study MDPs with a constraint; see, for instance, discretetime constrained MDPs with the discounted and average criteria [3, 32, 33] and continuous-time constrained MDPs with the discounted and average criteria [18, 20,35]. All the aforementioned works [3,18,20,32,33,35] require the nonnegativity assumption on the costs. We also apply this approach to analyze the constrained optimization problem. Following the arguments in [3, 18, 20, 32, 33, 35], we give conditions under which we prove the existence of a constrained-optimal solution to the constrained optimization problem, and also give a characterization of a constrained-optimal solution for a particular case. Then, we apply our main results to discrete- and continuous-time constrained MDPs with discounted and average criteria. More precisely, in Sect. 8.4.1, we use the results to show the existence of a constrained-optimal policy for the discounted discrete-time MDPs with a constraint in which the state space is a Polish space and the rewards/costs may be unbounded from above and from below. To the best of our knowledge, there are no any existing works dealing with constrained discounted discrete-time MDPs in Borel spaces and with unbounded rewards/costs. In Sect. 8.4.2, we investigate an application of the main results to constrained discrete-time MDPs with state-dependent discount factors and extend the results in [32] to the case in which discount factors can depend on states and rewards/costs can be unbounded from above and from below. In Sects. 8.4.3 and 8.4.4, we consider the average and discounted continuous-time MDPs with a constraint, respectively. Removing the nonnegativity assumption on the cost function as in [18, 20, 35], we prove that the results in [18, 20, 35] still hold using the results in this chapter. The rest of this chapter is organized as follows. In Sect. 8.2, we introduce the constrained optimization problem under consideration and give some preliminary facts needed to prove the existence of a constrained-optimal solution to the optimization problem. In Sect. 8.3, we state and prove our main results on the existence of a constrained-optimal solution. In Sect. 8.4, we provide some applications of our main results to constrained MDPs with different optimality criteria.
8.2 A Constrained Optimization Problem In this section, we state the constrained optimization problem under consideration and give some preliminary results needed to prove the existence of a constrainedoptimal solution. To do so, we introduce some notation below: 1. Let C be a linear space and D a convex set.
8 A Constrained Optimization Problem with Applications to Constrained MDPs
127
2. Suppose that G is a real-valued function on the product space C × D and satisfies the following property: G(k1 c1 + k2 c2 , d) = k1 G(c1 , d) + k2G(c2 , d)
(8.1)
for any c1 , c2 ∈ C, d ∈ D and any constants k1 , k2 ∈ R := (−∞, +∞). For any fixed c ∈ C, let U := d ∈ D : G(c, d) ≤ ρ , which depends on the given c and a so-called constraint constant ρ . Then, for another given r ∈ C, we consider a constrained optimization problem below: Maximize G(r, ·) over U.
(8.2)
Definition 8.2.1. d ∗ ∈ U is said to be a constrained-optimal solution to the problem (8.2) if d ∗ maximizes G(r, d) over d ∈ U, that is, G(r, d ∗ ) = sup G(r, d). d∈U
Remark 8.2.1. When D is a compact and convex metric space, and G(r, d) and G(c, d) are continuous in d ∈ D, it follows from the Weierstrass theorem [1, p.40] that there exists a constrained-optimal solution. In general, however, D may be unmetrizable in some cases, such as the set of all randomized Markov policies in continuous-time MDPs [20, p.10]; see continuous-time constrained MDPs with the discounted criteria in Sect. 8.4.4. In order to solve (8.2), we assume that there exists a subset D ⊆ D, which is assumed to be a compact metric space throughout this chapter. To analyze problem (8.2), we define the following unconstrained optimization problem by introducing a Lagrange multiplier λ ≥ 0, bλ := r − λ c,
G∗ (bλ ) := sup G(bλ , d),
(8.3)
d∈D
and then give the conditions below. Assumption 8.2.1
= 0. / (i) For each fixed λ ≥ 0, D∗λ := d λ ∈ D | G(bλ , d λ ) = G∗ (bλ ) (ii) There exists a constant M > 0, such that max |G(r, d)|, |G(c, d)| ≤ M for all d ∈ D. (iii) G(r, d) and G(c, d) are continuous in d ∈ D .
128
X. Guo et al.
Assumption 8.2.1(i) implies that there exists at least an element d λ ∈ D such that G(bλ , ·) attains its maximum. In addition, the boundedness and continuity hypotheses in Assumptions 8.2.1(ii) and (iii) are commonly used in optimization control theory. Assumption 8.2.2 For the given c ∈ C, there exists an element d ∈D (depending = 0. / on c) such that G(c, d ) < ρ , which means that d ∈ D | G(c, d) < ρ Remark 8.2.2. Assumption 8.2.2 is a Slater-like hypothesis, typical for the constrained optimization problems; see, for instance, [3, 17, 18, 20, 32, 33, 35]. In order to prove the existence of a constrained-optimal solution, we need the following preliminary lemmas. Lemma 8.2.1. Suppose that Assumption 8.2.1(i) holds. Then, G(c, d λ ) is nonincreasing in λ ∈ [0, ∞), where d λ ∈ D∗λ is arbitrary but fixed for each λ ≥ 0. Proof. For each d ∈ D, by (8.1) and (8.3), we have G(bλ , d) = G(r, d) − λ G(c, d) for all λ ≥ 0. Moreover, since G(bλ , d λ ) = G∗ (bλ ) for all λ ≥ 0 and d λ ∈ D∗λ , we have, for any h > 0, −hG(c, d λ ) = G(bλ +h , d λ ) − G(bλ , d λ ) ≤ G(bλ +h , d λ +h) − G(bλ , d λ ) ≤ G(bλ +h , d λ +h) − G(bλ , d λ +h ) = −hG(c, d λ +h), which implies that
G(c, d λ ) ≥ G(c, d λ +h ).
Hence, G(c, d λ ) is nonincreasing in λ ∈ [0, ∞).
2
Remark 8.2.3. Under Assumption 8.2.1(i), it follows from Lemma 8.2.1 that the following nonnegative constant λ := inf λ ≥ 0 : G(c, d λ ) ≤ ρ , d λ ∈ D∗λ
(8.4)
is well defined. Lemma 8.2.2. Suppose that Assumptions 8.2.1(i), (ii) and 8.2.2 hold. Then, the λ in (8.4) is finite; that is, λ is in [0, ∞). constant
Proof. Let κ := ρ − G(c, d ) > 0, with d as in Assumption 8.2.2. Since lim
2M λ →∞ λ
for the constant M as in Assumption 8.2.1(ii), there exists δ > 0 such that
=0
8 A Constrained Optimization Problem with Applications to Constrained MDPs
129
2M − κ < 0 for all λ ≥ δ . λ
(8.5)
Thus, for any d λ ∈ D∗λ with λ ≥ δ , we have
G(r, d λ ) − λ G(c, d λ ) = G(bλ , d λ ) ≥ G(bλ , d ) = G(r, d ) − λ G(c, d ). That is,
G(r, d λ ) − G(r, d ) + G(c, d ) − ρ ≥ G(c, d λ ) − ρ , λ
which, together with Assumption 8.2.1(ii) and (8.5), yields
|G(r, d λ )| + |G(r, d )| 2M −κ ≤ − κ < 0 for all λ ≥ δ . (8.6) λ λ
G(c, d λ ) − ρ ≤
λ ≤ δ < ∞. Hence, it follows from (8.6) that
2
Lemma 8.2.3. Suppose that Assumptions 8.2.1(i) and (iii) hold. If lim λk = λ , and k→∞
d λk ∈ D∗λ (for each k ≥ 1) is such that lim d λk = d ∈ D , then d ∈ D∗λ . k
k→∞
Proof. As d λk ∈ D∗λ for all k ≥ 1, by (8.1) and (8.3), we have k
G(r, d λk ) − λk G(c, d λk ) = G(bλk , d λk ) ≥ G(bλk , d) = G(r, d) − λk G(c, d) (8.7) for all d ∈ D. Letting k → ∞ in (8.7) and using Assumption 8.2.1(iii), we get G(bλ , d) = G(r, d) − λ G(c, d) ≥ G(r, d) − λ G(c, d) = G(bλ , d) for all d ∈ D. Thus, d ∈ D∗λ .
2
Lemma 8.2.4. If there exist λ0 ≥ 0 and d ∗ ∈ U such that G(c, d ∗ ) = ρ and G(bλ0 , d ∗ ) = G∗ (bλ0 ), then d ∗ is a constrained-optimal solution to the problem (8.2). Proof. For any d ∈ U, since G(bλ0 , d ∗ ) = G∗ (bλ0 ) ≥ G(bλ0 , d), we have G(r, d ∗ ) − λ0G(c, d ∗ ) ≥ G(r, d) − λ0G(c, d).
(8.8)
As G(c, d ∗ ) = ρ and G(c, d) ≤ ρ (because d ∈ U), from (8.8) we get G(r, d ∗ ) ≥ G(r, d ∗ ) + λ0(G(c, d) − ρ ) ≥ G(r, d) for all d ∈ U, which implies the desired result.
2
130
X. Guo et al.
8.3 Main Results In this section, we focus on the existence of a constrained-optimal solution. To do so, in addition to Assumptions 8.2.1 and 8.2.2, we also impose the following condition. Assumption 8.3.1
(i) For each θ ∈ [0, 1], d1 , d2 ∈ D∗ , dθ := θ d1 + (1 − θ )d2 satisfies G(bλ , dθ ) =
λ
G∗ (bλ ). (ii) G(c, dθ ) is continuous in θ ∈ [0, 1]. Remark 8.3.4. For each fixed c1 ∈ C, if G(c1 , ·) satisfies the following property G(c1 , dθ ) = θ G(c1 , d1 ) + (1 − θ )G(c1, d2 ) for all d1 , d2 ∈ D and θ ∈ [0, 1], then Assumption 8.3.1 is obviously true. Now we give our first main result on the problem (8.2). Theorem 8.3.1. Under Assumptions 8.2.1, 8.2.2, and 8.3.1, the following statements hold: λ = 0, then there exists a constrained-optimal solution d∈ D . (a) If (b) If λ > 0, then a constrained-optimal solution d ∗ ∈ D exists, and moreover, there exist a number θ ∗ ∈ [0, 1] and d 1 , d 2 ∈ D∗ such that λ
G(c, d 1 ) ≥ ρ , G(c, d 2 ) ≤ ρ , and d ∗ = θ ∗ d 1 + (1 − θ ∗)d 2 .
λ = 0: By the definition of λ , there exists a sequence d λk ∈ Proof. (a) The case ∗ Dλ ⊂ D such that λk ↓ 0 as k → ∞. Because D is compact, without loss of k generality, we may assume that d λk → d∈ D . Thus, by Lemma 8.2.1, we have G(c, d λk ) ≤ ρ for all k ≥ 1, and then it follows from Assumption 8.2.1(iii) that d∈ U. Moreover, for each d ∈ U, we have G∗ (bλk ) = G(bλk , d λk ) ≥ G(bλk , d), which, together with Assumption 8.2.1(ii), implies G(r, d λk ) − G(r, d) ≥ λk (G(c, d λk ) − G(c, d)) ≥ −2λk M.
(8.9)
Letting k → ∞ in (8.9), by Assumption 8.2.1(iii), we have ≥ G(r, d) for all d ∈ U, G(r, d) which means that d is a constrained-optimal solution. (b) The case λ ∈ (0, ∞): Since λ is in (0, ∞), there exist two sequences of positive λ , and δk ↓ λ numbers {λk } and {δk } such that d λk ∈ D∗λ , d δk ∈ D∗δ , λk ↑ k
k
8 A Constrained Optimization Problem with Applications to Constrained MDPs
131
as k → ∞. By the compactness of D , we may suppose that d λk → d 1 ∈ D and d δk → d 2 ∈ D . By Lemma 8.2.3, we have d 1 , d 2 ∈ D∗ . By Assumption 8.2.1(iii) λ and Lemma 8.2.1, we have G(c, d 1 ) ≥ ρ and G(c, d 2 ) ≤ ρ .
(8.10)
Define the following map:
θ → G(c, θ d 1 + (1 − θ )d 2) for each θ ∈ [0, 1]. Thus, it follows from Assumption 8.3.1(ii) and (8.10) that there exists θ ∗ ∈ [0, 1] such that G(c, θ ∗ d 1 + (1 − θ ∗)d 2 ) = ρ .
(8.11)
Let d ∗ := θ ∗ d 1 +(1− θ ∗)d 2 . Then, by Assumption 8.3.1(i), we have G(bλ , d ∗ ) = G∗ (bλ ), which together with (8.11) and Lemma 8.2.4 yields that d ∗ ∈ D is a constrained-optimal solution. 2 To further characterize a constrained-optimal solution, we next consider a particular case of the problem (8.2). A special case: Let X := {1, 2, . . .}, Y be a metric space, and P(Y ) the set of all probability measures on Y . For each i ∈ X, Y (i) ⊂ Y is assumed to be a compact metric space. Let D := {ψ | ψ : X → P(Y ) such that ψ (·|i) ∈ P(Y (i)) ∀ i ∈ X}, and D := {d| d : X → Y such that d(i) ∈ Y (i) ∀ i ∈ X}. Remark 8.3.5. (a) The set D is convex. That is, if ψk (k = 1, 2) are in D, and ψ p (·|i) := pψ1 (·|i) + (1 − p)ψ2(·|i) for any p ∈ [0, 1] and i ∈ X, then ψ p ∈ D. (b) A function d ∈ D may be identified with the element ψ ∈ D, for which ψ (i) is the Dirac measure at the point d(i) for all i ∈ X. Hence, we have D ⊂ D. (c) Note that D can be written as the product space D = ∏i∈X Y (i). Hence, by the compactness of Y (i) and the Tychonoff theorem, D is a compact metric space. In order to obtain the characterization of a constrained-optimal solution for this particular case, we also need the following condition. Assumption 8.3.2 For each λ ≥ 0, if d 1 , d 2 ∈ D∗λ , then d ∈ D∗λ for each d ∈ d ∈ D : d(i) ∈ {d 1 (i), d 2 (i)} ∀ i ∈ X . Then, we have the second main result on the problem (8.2) as follows. Theorem 8.3.2. (For the special case.) Suppose that Assumptions 8.2.1, 8.2.2, 8.3.1, and 8.3.2 hold for the special case. Then there exists a constrained-optimal solution d ∗ , which is of one of the following two forms (i) and (ii): (i) d ∗ ∈ D and (ii) there exist g1 , g2 ∈ D∗ , a point i∗ ∈ X, and a number θ0 ∈ [0, 1] such that λ
g1 (i) = g2 (i) for all i = i∗ , and, in addition,
132
X. Guo et al.
⎧ f or y = g1 (i) when i = i∗ , ⎨ θ0 ∗ d (y|i) = 1 − θ0 f or y = g2 (i) when i = i∗ , ⎩ 1 f or y = g1 (i) when i = i∗ .
λ be as in (8.4). If λ = 0, by Theorem 8.3.1 we have d ∗ ∈ D . Thus, Proof. Let we only need to consider the other case λ > 0. By Theorem 8.3.1(b), there exist d 1 , d 2 ∈ D∗ such that G(c, d 1 ) ≥ ρ and G(c, d 2 ) ≤ ρ . If G(c, d 1 ) (or G(c, d 2 ))= ρ , it λ
follows from Lemma 8.2.4 that d 1 (or d 2 ) is a constrained-optimal solution. Hence, to complete the proof, we shall consider the following case: G(c, d 1 ) > ρ and G(c, d 2 ) < ρ .
(8.12)
Using d 1 and d 2 , we construct a sequence {dn } as follows. For all n ≥ 1 and i ∈ X, let 1 d (i) i < n, dn (i) = d 2 (i) i ≥ n. Obviously, d1 = d 2 and lim dn = d 1 . Since d 1 , d 2 ∈ D∗ , by Assumption 8.3.2, we n→∞
λ
see that dn ∈ D∗ for all n ≥ 1. As d1 = d 2 , by (8.12) we have G(c, d1 ) < ρ . If there λ exists n∗ such that G(c, dn∗ ) = ρ , then dn∗ is a constrained-optimal solution (by Lemma 8.2.4). Thus, in the remainder of the proof, we may assume that G(c, dn ) =ρ for all n ≥ 1. If G(c, dn ) < ρ for all n ≥ 1, then by Assumption 8.2.1(iii), we have lim G(c, dn ) = G(c, d 1 ) ≤ ρ ,
n→∞
which is a contradiction to (8.12). Hence, there exists some n > 1 such that G(c, dn ) > ρ , which, together with G(c, d1 ) < ρ , gives the existence of some n such that G(c, dn) < ρ and G(c, dn+1 ) > ρ .
(8.13)
Obviously, dn and dn+1 differ in at most the point n. Let g1 := dn, g2 := dn+1 and i∗ := n. For any θ ∈ [0, 1], using g1 and g2 , we construct dθ ∈ D as follows. For each i ∈ X, ⎧ for y = g1 (i) when i = i∗ , ⎨θ dθ (y|i) = 1 − θ for y = g2 (i) when i = i∗ , ⎩ 1 for y = g1 (i) when i = i∗ . Then, we have dθ (·|i) = θ δg1 (i) (·) + (1 − θ )δg2(i) (·)
(8.14)
8 A Constrained Optimization Problem with Applications to Constrained MDPs
133
for all i ∈ X and θ ∈ [0, 1], where δy (·) denotes the Dirac measure at any point y. Hence, by (8.13), (8.14), and Assumption 8.3.1, there exists θ0 ∈ (0, 1) such that
G(c, dθ0 ) = ρ and G(bλ , dθ0 ) = G∗ (bλ ), which, together with Lemma 8.2.4, yield that dθ0 is a constrained-optimal solution. Obviously, dθ0 randomizes between g1 and g2 , which differ in at most the point i∗ , and so the theorem follows. 2
8.4 Applications to MDPs with a Constraint In this section, we show applications of the constrained optimization problem to MDPs with a constraint. In Sect. 8.4.1, we use Theorem 8.3.1 to show the existence of a constrained-optimal policy for the constrained discounted discretetime MDPs in a Polish space and with unbounded rewards/costs. In Sect. 8.4.2, we investigate an application of Theorem 8.3.2 to discrete-time constrained MDPs with state-dependent discount factors. In Sects. 8.4.3 and 8.4.4, we will improve the corresponding results in [18, 20, 35] using Theorem 8.3.2 above.
8.4.1 Discrete-Time Constrained MDPs with Discounted Criteria The constrained discounted discrete-time MDPs with a constant discount factor have been studied; see, for instance, [2, 11, 32] for the case of a countable state space and [15, 16, 24, 27, 29] for the case of a Borel state space. Except [2] dealing with the case in which the rewards may be unbounded from above and from below, all the aforementioned works investigate the case in which rewards are assumed to be bounded from above. To the best of our knowledge, in this subsection we first deal with the case in which the state space is a Polish space and the rewards may be unbounded from above and from below. The model of discrete-time constrained MDPs under consideration is as follows [22, 23]: X, A, (A(x), x ∈ X), Q(·|x, a), r(x, a), c(x, a), ρ ,
(8.15)
where X and A are state and action spaces, which are assumed to be Polish spaces with Borel σ -algebras B(X) and B(A), respectively. We denote by A(x) ∈ B(A) the set of admissible actions at state x ∈ X. Let K := {(x, a)|x ∈ X, a ∈ A(x)}, which is assumed to be a closed subset of X × A. Furthermore, the transition law Q(·|x, a) with (x, a) ∈ K is a stochastic kernel on X given K. Finally, the function r(x, a) on
134
X. Guo et al.
K denotes rewards, while the function c(x, a) on K and the number ρ denote costs and a constraint, respectively. We assume that r(x, a) and c(x, a) are real-valued Borel-measurable on K. We denote by Π , Φ , and F the classes of all randomized history-dependent policies, randomized stationary policies, and stationary policies, respectively; see [22, 23] for details. Let Ω := (X × A)∞ and F the corresponding product σ -algebra. Then, for an arbitrary policy π ∈ Π and an arbitrary initial distribution ν on X, the well-known Tulcea theorem [22, p.178] gives the existence of a unique probability measure Pνπ on (Ω , F ) and a stochastic process {(xk , ak ), k ≥ 0}. The expectation operator with respect to Pνπ is denoted by Eνπ , and we write Eνπ as Exπ when ν ({x}) = 1. Fix a discount factor α ∈ (0, 1) and an initial distribution ν on X. We define the expected discounted reward V (r, π ) and the expected discounted cost V (c, π ) as follows:
∞ ∞ π k π k V (r, π ) := Eν ∑ α r(xk , ak ) and V (c, π ) := Eν ∑ α c(xk , ak ) for all π ∈ Π . k=0
k=0
Then, the constrained optimization problem for the model (8.15) is as follows: Maximize V (r, ·) over U1 := π ∈ Π | V (c, π ) ≤ ρ . (8.16) To solve (8.16), we consider the following conditions: (B1) There exist a continuous function ω1 ≥ 1 on X and positive constants L1 , m, and β1 < 1 such that, for each (x, a) ∈ K, |r(x, a)| ≤ L1 ω1 (x), |c(x, a)| ≤ L1 ω1 (x), and
X
ω12 (y)Q(dy|x, a) ≤ β1 ω12 (x) + m.
(B2) The function ω1 is a moment function on K, that is, there exists a nondecreasing sequence of compact sets Kn ↑ K such that lim inf ω1 (x) : (x, a) ∈ / n→∞ Kn = ∞. (B3) ν (ω12 ) := X ω12 (x)ν (dx) < ∞. (B4) Q(·|x, a) is weakly continuous on K, that is, the function X u(y)Q(dy|x, a) is continuous in (x, a) ∈ K for each bounded continuous function u on X. (B5) The functions r(x, a) and c(x, a) are continuous on K. (B6) There exists π ∈ Π such that V (c, π ) < ρ . Remark 8.4.6. (a) Condition (B1) is known as the statement of the Lyapunovlike inequality and the growth condition on the rewards/costs. Conditions (B4) and (B5) are the usual continuity conditions. Condition (B6) is the Slater-like condition. (b) Conditions (B1) and (B3) are used to guarantee the finiteness of the expected discounted rewards/costs. The role of condition (B2) is to prove the compactness of the set of all the discount occupation measures in the ω1 -weak topology (see Lemma 8.4.5).
8 A Constrained Optimization Problem with Applications to Constrained MDPs
135
To state our main results of Sect. 8.4.1, we need to introduce some notation. Let ω1 be as in condition (B1). We denote by Bω1 (X) the Banach space of realvalued measurable functions u on X with the finite norm uω1 := sup ω|u(x)| (x) , that x∈X
1
is, Bω1 (X) := {u| uω1 < ∞}. Moreover, we say that a function v on K belongs to Bω1 (K) if x → sup |v(x, a)| is in Bω1 (X). We denote by Cω1 (K) the set of all a∈A(x)
continuous functions on K which also belong to B (K), and Mω1 (K) stands for ω1 the set of all measures μ on B(K) such that K ω1 (x)μ (dx, da) < ∞. Moreover, Mω1 (K) is endowed with ω1 -weak topology. Recall that the ω1 -weak topology on Mω1 (K) is the coarsest topology for which all mappings μ → K v(x, a)μ (dx, da) are continuous for each v ∈ Cω1 (K). Since X and A are both Polish spaces, by Corollary A.44 in [14, p.423] we see that Mω1 (K) is metrizable with respect to the ω1 -weak topology. By Lemma 24 in [29, p.141], it suffices to consider the discount occupation measures induced by randomized stationary policies in Φ . For each ϕ ∈ Φ , we define the discount occupation measure by
η ϕ (B × E) :=
∞
ϕ
∑ α k Pν (xk ∈ B, ak ∈ E)
for all B ∈ B(X) and E ∈ B(A).
k=0
The set of all the discount occupation measures is denoted by N , i.e., N := {η ϕ : ϕ ∈ Φ }. From the conditions (B1) and (B3), we have K
ω1 (x)η ϕ (dx, da) =
∞
ϕ
∑ α k Eν
k=0
ν (ω12 ) m 0, there exists an integer N2 > 0 such that
sup
ϕ ∈Φ KNc 2
ω1 (x)η ϕ (dx, da) ≤ ε .
(8.21)
Hence, by (8.19), (8.21), and Corollary A.46 in [14, p.424], we conclude that N is relatively compact in the ω1 -weak topology. Therefore, N is compact in the ω1 weak topology. 2 Under conditions (B1)–(B5), from Lemma 8.4.5 and (8.18), we define a realvalued function G on C × D := Cω1 (K) × N as follows: G(c, η ) :=
K
c(x, a)η (dx, da) for (c, η ) ∈ C × D = Cω1 (K) × N .
(8.22)
Obviously, the function G defined in (8.22) satisfies (8.1). Moreover, let D := N . Now we provide our main result of Sect. 8.4.1 on the existence of constrainedoptimal policies for (8.16). Proposition 8.4.1. Under conditions (B1)–(B6), there exists a constrained-optimal policy ϕ ∗ ∈ Φ for the constrained MDPs in (8.16), that is, V (r, ϕ ∗ ) = sup V (r, π ). π ∈U1
Proof. We first verify Assumption 8.2.1. From conditions (B1) and (B5), we see that → K (r(x, a) − λ c(x, a))η ϕ (dx, da) is continuous for each λ ≥ 0, the mapping η ϕ on N . Thus, Assumption 8.2.1(i) follows from the compactness of N . Moreover, by condition (B1) and (8.17), we have L1 ν (ω12 ) mL1 =: M max |G(r, η )|, |G(c, η )| ≤ + 1−α (1 − β1)(1 − α ) for all η ∈ N , and so Assumption 8.2.1(ii) follows. By conditions (B1) and (B5), we see that Assumption 8.2.1(iii) is obviously true. Secondly, Assumption 8.2.2 follows from condition (B6) and Lemma 24 in [29, p.141]. Finally, since G(c, θ η1 + (1 − θ )η2 ) = θ G(c, η1 ) + (1 − θ )G(c, η2 ) for all θ ∈ [0, 1] and η1 , η2 ∈ N , Assumption 8.3.1 is obviously true. Hence, Theorem 8.3.1 gives the existence of η ∗ ∈ N such that K
r(x, a)η ∗ (dx, da) = sup
η ∈Uo K
r(x, a)η (dx, da),
which together with Theorem 6.3.7 in [22] implies Proposition 8.4.1.
2
138
X. Guo et al.
8.4.2 Constrained MDPs with State-Dependent Discount Factors In this subsection, we use discrete-time constrained MDPs with state-dependent discount factors to present another application of the constrained optimization problem. Discrete-time unconstrained MDPs with nonconstant discount factors are studied in [26, 34]. Moreover, [36] deals with discrete-time constrained MDPs with state-dependent discount factors in which the costs are assumed to be bounded from below by a convex analytic approach, and here we use Theorem 8.3.1 above to deal with the case in which the costs are allowed to be unbounded from above and from below. The model of discrete-time constrained MDPs with state-dependent discount factors is as follows: X, A, (A(i), i ∈ X), Q(·|i, a), (α (i), i ∈ X), r(i, a), c(i, a), ρ , where the state space X is the set of all positive integers, α (i) ∈ (0, 1) are given discount factors depending on state i ∈ X, and the other components are the same as in (8.15), with ik here in lieu of xk in Sect. 8.4.1. Fix any initial distribution ν on X. The discounted criteria, W (r, π ) and W (c, π ), are defined by W (r, π )
:= Eνπ
r(i0 , a0 ) + ∑
∏ α (ik )r(in , an ) ,
∞ n−1
n=1 k=0
α (i )c(i , a ) ∏ k n n for all π ∈ Π .
∞ n−1
W (c, π ) := Eνπ c(i0 , a0 ) + ∑
n=1 k=0
Then, the constrained optimization problem is as follows: sup W (r, π ) subject to W (c, π ) ≤ ρ .
π ∈Π
(8.23)
To ensure the existence of a constrained-optimal policy π ∗ for (8.23) (i.e., W (r, π ∗ ) ≥ W (r, π ) for all π such that W (c, π ) ≤ ρ ), we consider the following conditions from [34]: (C1) There exists a constant α ∈ (0, 1) such that 0 < α (i) ≤ α for all i ∈ X. (C2) There exist constants L2 > 0 and β2 , with 1 ≤ β2 < α1 and a function ω2 ≥ 1 on X such that, for each (i, a) ∈ K, |r(i, a)| ≤ L2 ω2 (i), |c(i, a)| ≤ L2 ω2 (i), and
∑ ω2 ( j)Q( j|i, a) ≤ β2ω2 (i).
j∈X
(C3) ν (ω2 ) := ∑ ω2 (i)ν (i) < ∞. i∈X
8 A Constrained Optimization Problem with Applications to Constrained MDPs
139
(C4) For each i ∈ X, A(i) is compact. (C5) For each i, j ∈ X, the functions r(i, a), c(i, a), Q( j|i, a), and ∑ ω2 (k)Q(k|i, a) are continuous in a ∈ A(i). (C6) There exists π ∈ Π such that W (c, π) < ρ .
k∈X
Remark 8.4.7. Conditions (C1)–(C3) are known as the finiteness conditions. Conditions (C4) and (C5) are the usual continuity-compactness conditions. Condition (C6) is the Slater-like condition. Under conditions (C1)–(C5), we define a real-valued function G on C × D := Cω2 (K) × Π as follows: G(c, π ) := W (c, π ) for (c, π ) ∈ C × D = Cω2 (K) × Π .
(8.24)
Obviously, since the set Π is convex, the function G defined in (8.24) satisfies (8.1). Let D := F. Then, we state our main result of Sect. 8.4.2 on the existence of constrained-optimal policies for (8.23). Theorem 8.4.3. Suppose that conditions (C1)–(C6) hold. Then there exists a constrained-optimal policy for (8.23), which is either a stationary policy or a randomized stationary policy that randomizes between two stationary policies differing in at most one state; that is, there exist two stationary policies f 1 , f 2 , a state i∗ ∈ X, and a number p∗ ∈ [0, 1] such that f 1 (i) = f 2 (i) for all i = i∗ , and, in ∗ p addition, the randomized stationary policy π (·|i) is constrained-optimal, where ⎧ ∗ f or a = f 1 (i) when i = i∗ , ⎨p p∗ π (a|i) = 1 − p∗ f or a = f 2 (i) when i = i∗ , ⎩ 1 f or a = f 1 (i) when i = i∗ . Remark 8.4.8. Theorem 8.4.3 extends the corresponding one in [32] for a constant discount factor to the case of state-dependent discount factors. Moreover, we remove the nonnegativity assumption on the costs as in [32]. We will prove Theorem 8.4.3 using Theorem 8.3.2. To do so, we introduce the notation below. For each ϕ ∈ Φ , i ∈ X, and π ∈ Π , define Wr (i, π ) := Eiπ Wc (i, π ) := Eiπ
r(i0 , a0 ) + ∑
∏ α (ik )r(in , an ) ,
∞ n−1
n=1 k=0
∏ α (ik )c(in , an ) ,
∞ n−1
c(i0 , a0 ) + ∑
n=1 k=0
bλ (i, a) := r(i, a) − λ c(i, a) for all (i, a) ∈ K,
140
X. Guo et al.
∞ n−1 W λ (i, π ) := Eiπ bλ (i0 , a0 ) + ∑ ∏ α (ik )bλ (in , an ) , n=1 k=0
Wλ∗ (i) and u(i, ϕ ) :=
λ
:= sup W (i, π ), π ∈Π
A(i)
Q( j|i, ϕ ) :=
u(i, a)ϕ (da|i), for u(i, a) = bλ (i, a), r(i, a), c(i, a),
A(i)
Q( j|i, a)ϕ (da|i) for j ∈ X.
Then, we give three lemmas below, which are used to prove Theorem 8.4.3. Lemma 8.4.6. Under conditions (C1)–(C5), the following assertions hold: L2 ν (ω2 ) 2 ν (ω2 ) (a) |W (r, π )| ≤ L1− αβ2 and |W (c, π )| ≤ 1−αβ2 for all π ∈ Π . (b) For each fixed ϕ ∈ Φ , Wu (·, ϕ ) (u = r, c) is the unique solution in Bω2 (X) to the following equation:
v(i) = u(i, ϕ ) + α (i) ∑ v( j)Q( j|i, ϕ ) f or all i ∈ X. j∈X
(c) W (r, f ) and W (c, f ) are continuous in f ∈ F. Proof. For the proofs of (a) and (b), see Theorem 3.1 in [34]. (c) We only prove the continuity of W (r, f ) in f ∈ F because the other case is similar. Let fn → f as n → ∞, and fix any i ∈ X. Choose any subsequence {Wr (i, fnm )} of {Wr (i, fn )} converging to some point v(i) as m → ∞. Then, since X is denumerable, the Tychonoff theorem, together with the part (a) and fn → f , gives the existence of subsequence {Wr ( j, fnk ), j ∈ X} of {Wr ( j, fnm ), j ∈ X} such that lim Wr ( j, fnk ) =: v ( j), v (i) = v(i), and lim fnk ( j) = f ( j) for all j ∈ X. (8.25)
k→∞
k→∞
2 ω2 ( j) Furthermore, by Theorem 3.1 in [34], we have |v ( j)| ≤ L1− αβ2 for all j ∈ X, which implies that v ∈ Bω2 (X). On the other hand, for the given i ∈ X and all k ≥ 1, by part (b) we have
Wr (i, fnk ) = r(i, fnk ) + α (i) ∑ Wr ( j, fnk )Q( j|i, fnk ).
(8.26)
j∈X
Then, it follows from (8.25), (8.26), condition (C5), and Lemma 8.3.7 in [23, p.48] that v (i) = r(i, f ) + α (i) ∑ v ( j)Q( j|i, f ). j∈X
8 A Constrained Optimization Problem with Applications to Constrained MDPs
141
Hence, part (b) yields v (i) = Wr (i, f ).
(8.27)
Thus, as the above subsequence {Wr (i, fnm )} and i ∈ X are arbitrarily chosen and (by (8.27)) all such subsequences have the same limit Wr (i, f ), we have lim Wr (i, fn ) = Wr (i, f ) for all i ∈ X.
n→∞
Therefore, from condition (C3) and Theorem A.6 in [22, p.171], we get lim W (r, fn ) =
n→∞
∑
i∈X
lim Wr (i, fn ) ν (i) =
n→∞
∑ Wr (i, f )ν (i) = W (r, f ),
i∈X
which gives the desired conclusion, W (r, fn ) → W (r, f ) as n → ∞.
2
Lemma 8.4.7. Suppose that conditions (C1), (C2), (C4), and (C5) hold. Then we have (a) Wλ∗ is the unique solution of the following equation in Bω2 (X): λ v(i) = sup b (i, a) + α (i) ∑ v( j)Q( j|i, a) for all i ∈ X. a∈A(i)
(8.28)
j∈X
(b) There exists a function f ∗ ∈ F such that f ∗ (i) ∈ A(i) attains the maximum in (8.28) for each i ∈ X, that is, Wλ∗ (i) = bλ (i, f ∗ ) + α (i) ∑ Wλ∗ ( j)Q( j|i, f ∗ ) for all i ∈ X,
(8.29)
j∈X
and f ∗ ∈ F is optimal. Conversely, if f ∗ ∈ F is optimal, it satisfies (8.29). Proof. See Theorem 3.2 in [34].
2
Remark 8.4.9. Under conditions (C1), (C2), (C4), and (C5), for each λ ≥ 0, let D∗λ (e) := f ∈ F : W λ (i, f ) = Wλ∗ (i) for all i ∈ X . Then, it follows from Lemma 8.4.7 that D∗λ (e) = 0, / and that f ∈ D∗λ (e) if and only if f ∈ F satisfies (8.29). Lemma 8.4.8. Suppose that conditions (C1)–(C5) hold. Then, for each f1 , f2 ∈ D∗λ (e) (with any fixed λ ≥ 0), and 0 ≤ p ≤ 1, define a policy π p by π p (·|i) := pδ f1 (i) (·) + (1 − p)δ f2(i) (·) for all i ∈ X. Then, (a) W λ (i, π p ) = Wλ∗ (i) for all i ∈ X. (b) W (c, π p ) is continuous in p ∈ [0, 1].
142
X. Guo et al.
Proof. (a) Since bλ (i, π p ) = pbλ (i, f1 ) + (1 − p)bλ (i, f2 ),
(8.30)
Q( j|i, π ) = pQ( j|i, f1 ) + (1 − p)Q( j|i, f2 ),
(8.31)
p
by Lemma 8.4.7 and the definition of D∗λ (e), we have Wλ∗ (i) = bλ (i, fl ) + α (i) ∑ Wλ∗ ( j)Q( j|i, fl ) for all i ∈ X and l = 1, 2, j∈X
which together with (8.30) and (8.31) gives Wλ∗ (i) = bλ (i, π p ) + α (i) ∑ Wλ∗ ( j)Q( j|i, π p ) for all i ∈ X.
(8.32)
j∈X
Therefore, by Lemma 8.4.6(b) and (8.32), we have W λ (i, π p ) = Wλ∗ (i) for all i ∈ X, and so part (a) follows. (b) For any p ∈ [0, 1] and any sequence {pm } in [0, 1] such that lim pm = p, by m→∞ Lemma 8.4.6, we have Wc (i, π pm )=c(i, π pm )+α (i) ∑ Wc ( j, π pm )Q( j|i, π pm ) for all i ∈ X and m≥1. j∈X
(8.33) Hence, as in the proof of Lemma 8.4.6, from (8.33), the definition of π pm and Theorem A.6 in [22, p.171], we have lim W (c, π pm ) = W (c, π p ),
m→∞
and so W (c, π p ) is continuous in p ∈ [0, 1].
2
Proof of Theorem 8.4.3. By Lemmas 8.4.6–8.4.8 and (8.24), we see that Assumptions 8.2.1 and 8.3.1 hold. Moreover, Assumptions 8.2.2 and 8.3.2 follow from condition (C6) and Lemma 8.4.7, respectively. Hence, by Theorem 8.3.2, we complete the proof. 2
8.4.3 Continuous-Time Constrained MDPs with Average Criteria In this subsection, removing the nonnegativity assumption on the cost function as in [20, 35], we will prove that the corresponding results in [20, 35] still hold using Theorem 8.3.2 above. The model of continuous-time constrained MDPs is of the form [18, 20, 30, 35]: X, A, (A(i), i ∈ X), q(·|i, a), r(i, a), c(i, a), ρ ,
(8.34)
8 A Constrained Optimization Problem with Applications to Constrained MDPs
143
where X is assumed to be a denumerable set. Without loss of generality, we assume that X is the set of all positive integers. Furthermore, the transition rates q( j|i, a), which satisfy q( j|i, a) ≥ 0 for all (i, a) ∈ K and j = i. We also assume that the transition rates q( j|i, a) are conservative, i.e., ∑ q( j|i, a) = 0 for all (i, a) ∈ K, and j∈X
stable, which means that q∗ (i) := sup −q(i|i, a) < ∞ for all i ∈ X. In addition, a∈A(i)
q( j|i, a) is measurable in a ∈ A(i) for each fixed i, j ∈ X. The other components are the same as in (8.15), with a state i here in lieu of a state x in Sect. 8.4.1. We denote by Πm , Φ and F the classes of all randomized Markov policies, randomized stationary policies, and stationary policies, respectively; see [18,20,30,35] for details. To guarantee the regularity of the Q-process, we impose the following drift condition from [30]: (D1) There exists a nondecreasing function ω3 ≥ 1 on X such that lim ω3 (i) = ∞. (D2) There exist constants γ1 ≥ κ1 > 0 and a state i0 ∈ X such that
∑ q( j|i, a)ω32 ( j) ≤ −κ1ω32 (i) + γ1Ii0 (i)
i→∞
for all (i, a) ∈ K,
j∈X
where IB (·) denotes the indicator function of any set B. Let T := [0, ∞), and let (Ω , B(Ω )) be the canonical product measurable space with (X × A)T being the set of all maps from T to X × A. Fix an initial distribution ν on X. Then, under conditions (D1) and (D2), by Theorem 2.3 in [20, p.14], for each policy π ∈ Πm , there exist a unique probability measure Pνπ on (Ω , B(Ω )) and a stochastic process {(x(t), a(t)),t ≥ 0}. The expectation operator with respect to Pνπ is denoted by Eνπ . For each π ∈ Πm , we define the expected average criteria, V (r, π ) and V (c, π ), as follows:
Eνπ 0T r(x(t), a(t))dt , V (r, π ) : = lim inf T →∞ T
T π Eν 0 c(x(t), a(t))dt . V (c, π ) : = lim sup T T →∞ Then, the constrained optimization problem for the average criteria is as follows: sup V (r, π ) subject to V (c, π ) ≤ ρ .
π ∈Πm
(8.35)
To guarantee the existence of a constrained-optimal policy π ∗ for (8.35) (i.e., V (r, π ∗ ) ≥ V (r, π ) for all π such that V (c, π ) ≤ ρ ), we need the following conditions from [20, 30, 35]:
144
X. Guo et al.
(D3) There exists a constant L3 > 0 such that |r(i, a)| ≤ L3 ω3 (i) and |c(i, a)| ≤ L3 ω3 (i) for all (i, a) ∈ K. (D4) For each i ∈ X, A(i) is compact. (D5) For each i ∈ X, q∗ (i) ≤ ω3 (i). (D6) ν (ω32 ) := ∑ ω32 (i)ν (i) < ∞. i∈X
(D7) For each i, j ∈ X, the functions r(i, a), c(i, a), q( j|i, a), and ∑ ω3 (k)q(k|i, a) k∈X
are continuous in a ∈ A(i). (D8) For each ϕ ∈ Φ , the corresponding Markov process with transition rates q( j|i, ϕ ) is irreducible, where q( j|i, ϕ ) := A(i) q( j|i, a)ϕ (da|i) for all i, j ∈ X. (D9) There exists ϕ ∈ Φ such that V (c, ϕ ) < ρ . From conditions (D1), (D2), and (D8), by Theorem 4.2 in [28], for each ϕ ∈ Φ , the corresponding Markov process with transition rates q( j|i, ϕ ) has a unique invariant probability measure μϕ on X. Moreover, under conditions (D1)–(D4), (D7), and (D8), by Theorem 7.2 in [30] we have
T 1 ϕ V (r, ϕ ) = lim Eν r(x(t), a(t))dt = ∑ r(i, ϕ )μϕ (i) (8.36) T →∞ T 0 i∈X and 1 ϕ Eν T →∞ T
V (c, ϕ ) = lim
0
T
c(x(t), a(t))dt =
∑ c(i, ϕ )μϕ (i),
(8.37)
i∈X
where r(i, ϕ ) :=
A(i)
r(i, a)ϕ (da|i) and c(i, ϕ ) :=
A(i)
c(i, a)ϕ (da|i) for all i ∈ X.
ϕ by For each ϕ ∈ Φ , we define the average occupation measure μ ϕ ({i} × B) := μϕ (i)ϕ (B|i) for all i ∈ X and B ∈ B(A(i)). μ ϕ : The set of all the average occupation measures is denoted by N1 , i.e., N1 := { μ ϕ ∈ Φ }. Then, we have the following result. Lemma 8.4.9. Suppose that conditions (D1)–(D8) hold. Then, for any π ∈ Πm with ϕ0 ∈ N1 such that V (r, ϕ0 ) ≥ V (r, π ) and V (c, ϕ0 ) ≤ ρ . V (c, π ) ≤ ρ , there exists μ Proof. For each fixed π ∈ Πm with V (c, π ) ≤ ρ and n ≥ 1, we define a measure μn by
μn ({i} × B) :=
1 n
n 0
Eνπ I{i}×B (x(t), a(t)) dt for all i ∈ X and B ∈ B(A(i)). (8.38)
8 A Constrained Optimization Problem with Applications to Constrained MDPs
145
Then, by conditions (D2) and (D6), Lemma 6.3 in [20, p.90], and (8.38), we have
∑
i∈X A(i)
ω32 (i)μn (i, da) = ≤
1 n
n 0
1 n
n
Eνπ ω32 (x(t)) dt
∑
0 i∈X
≤ ν (ω32 ) +
e−κ1t ω32 (i) +
γ1 (1 − e−κ1t ) ν (i)dt κ1
γ1 < ∞. κ1
(8.39)
On the other hand, from conditions (D1) and (D4), we see that the sets {(i, a) ∈ K : ω32 (i) ≤ zω3 (i)} are compact in K for each z ≥ 1. Hence, by (8.39) and Corollary A.46 in [14, p.424], we conclude that the sequence { μn } is relatively compact in the ω3 -weak topology. Thus, there exist a subsequence { μnl } of { μn } and a probability measure μ ∈ Mω3 (X × A) such that μnl converges to μ in the ω3 -weak topology. By condition (D4), we see that K is closed. Then, since μnl weakly converges to μ , by Theorem A.38 in [14, p.420], we have 0 = lim inf μnl (K c ) ≥ μ (K c ) ≥ 0, l→∞
which implies μ (K) = 1. Moreover, for each bounded function v on X, a direct calculation together with condition (D5), the Fubini theorem, the Kolmogorov forward equation, and Theorem 2.3 in [20, p.14] gives
∑
i∈X A
= = = =
1 nl
nl
1 nl
nl
0
1 nl
nl
0
1 nl
nl
1 = nl =
0
0
∑
q( j|i, a)v( j) μnl (i, da) ∑
j∈X
∑
i∈X A
∑∑
∑ q( j|i, a)v( j) Pνπ (x(t) = i, a(t) ∈ da)dt
j∈X
k∈X i∈X A
∑
q( j|i, a)v( j) Pkπ (x(t) = i)πt (da|i)ν (k)dt ∑
j∈X
∑ v( j) ∑ q( j|i, πt )pπ (0, k,t, i) ν (k)dt
k∈X j∈X
∑ ∑ v( j)
k∈X j∈X
i∈X
∂ pπ (0, k,t, j) ν (k)dt ∂t
∑ v( j) pπ (0, k, nl , j) − pπ (0, k, 0, j) ν (k)
k∈X j∈X
1 1 π Eν v(x(nl )) − nl nl
∑ v(k)ν (k),
k∈X
(8.40)
146
X. Guo et al.
where pπ (0, i,t, j) denotes the minimal transition function with transition rates q( j|i, πt ) := A(i) q( j|i, a)πt (da|i) for all i, j ∈ X and t ≥ 0. Letting l → ∞ in (8.40), by conditions (D5), (D7), and (8.39), we have
∑
i∈X A
q( j|i, a)v( j) μ (i, da) = 0 ∑
j∈X
for each bounded function v on X, which together with Lemma 4.6 in [30] yields ϕ0 . Furthermore, by condition μ ∈ N1 . Hence, there exists ϕ0 ∈ Φ such that μ = μ (D3), (8.38), and (8.39), we have V (c, π ) ≥ lim sup ∑
n→∞ i∈X A
c(i, a)μn (i, da) and V (r, π ) ≤ lim inf ∑ n→∞
i∈X A
r(i, a)μn (i, da),
which together with conditions (D3) and (D7) yield
ρ ≥ V (c, π ) ≥ and V (r, π ) ≤
∑
i∈X A
∑
i∈X A
ϕ0 (i, da) = V (c, ϕ0 ), c(i, a)μ
ϕ0 (i, da) = V (r, ϕ0 ). r(i, a)μ
This completes the proof of the lemma.
2
By Lemma 8.4.9 we see that the constrained optimization problem (8.35) is equivalent to the following form: sup V (r, ϕ ) subject to V (c, ϕ ) ≤ ρ .
ϕ ∈Φ
(8.41)
Under conditions (D1)–(D8), from (8.41) we define a real-valued function G on C × D := Cω3 (K) × Φ as follows: G(c, ϕ ) := V (c, ϕ ) for (c, ϕ ) ∈ C × D = Cω3 (K) × Φ .
(8.42)
Then, by (8.36) and (8.37), we see that the function G defined in (8.42) satisfies (8.1). Moreover, let D := F. Now we state our main result of Sect. 8.4.3 on the existence of constrainedoptimal policies for (8.35). Proposition 8.4.2. Suppose that conditions (D1)–(D9) hold. Then there exists a constrained-optimal policy for (8.35), which may be a stationary policy or a randomized stationary policy that randomizes between two stationary policies differing in at most one state.
8 A Constrained Optimization Problem with Applications to Constrained MDPs
147
Proof. It follows from Lemma 7.2, Theorem 7.8, and Lemma 12.5 in [20] that Assumptions 8.2.1 and 8.3.2 hold. Obviously, condition (D9) implies Assumption 8.2.2. Finally, from Lemma 12.6 and the proof of Theorem 12.4 in [20], we see that Assumption 8.3.1 holds. Hence, by Theorem 8.3.2, we complete the proof. 2 Remark 8.4.10. Proposition 8.4.2 shows that the nonnegativity assumption on the costs as in [20, 35] is not required.
8.4.4 Continuous-Time Constrained MDPs with Discounted Criteria In this subsection, we consider the following discounted criteria J(r, π ) and J(c, π ) in (8.43) below for the model (8.34), in lieu of the average criteria above. Removing the nonnegativity assumption on the cost function as in [18, 20], we will prove that the corresponding results in [18, 20] still hold using Theorem 8.3.2 above. With the same components as in the model (8.34), we consider the following drift condition from [18, 20]: = 0, and L > 0 (E1) There exists a function ω4 ≥ 1 on X and constants γ2 ≥ 0, κ2 such that q∗ (i) ≤ Lω4 (i) and
∑ ω4 ( j)q( j|i, a) ≤ κ2 ω4 (i) + γ2
for all (i, a) ∈ K.
j∈X
Fix a discount factor α > 0 and an initial distribution ν on X. For each π ∈ Πm , we define the discounted criteria, J(r, π ) and J(c, π ), as follows: J(r, π ) :
=Eνπ
∞
e 0
−α t
∞ π −α t r(x(t), a(t))dt , J(c, π ) : =Eν e c(x(t), a(t))dt . 0
(8.43) Then, the constrained optimization problem for the discounted criteria is as follows: sup J(r, π ) subject to J(c, π ) ≤ ρ .
π ∈Πm
(8.44)
To ensure the existence of a constrained-optimal policy π ∗ for (8.44) (i.e., J(r, π ∗ ) ≥ J(r, π ) for all π such that J(c, π ) ≤ ρ ), we consider the following conditions from [18, 20]: (E2) There exists a constant L4 > 0 such that |r(i, a)| ≤ L4 ω4 (i) and |c(i, a)| ≤ L4 ω4 (i) for all (i, a) ∈ K.
148
X. Guo et al.
(E3) The positive discount factor α verifies that α > κ2 , with κ2 as in (E1). (E4) ν (ω4 ) := ∑ ω4 (i)ν (i) < ∞. i∈X
(E5) For each i ∈ X, A(i) is compact. (E6) For each i, j ∈ X, the functions r(i, a), c(i, a), q( j|i, a) and ∑ ω4 (k)q(k|i, a) k∈X
are continuous in a ∈ A(i). (E7) There exist a nonnegative function ω on X and constants γ3 ≥ 0, κ3 > 0, and L > 0 such that q∗ (i)ω4 (i) ≤ L ω (i) and
∑ ω ( j)q( j|i, a) ≤ κ3 ω (i) + γ3
for all (i, a) ∈ K.
j∈X
(E8) There exists π ∈ Πm such that J(c, π) < ρ . Note that the set Πm is convex. That is, if π 1 and π 2 are in Πm , and for any p ∈ [0, 1], i ∈ X, and t ∈ [0, ∞), πtp (·|i) := pπt1 (·|i) + (1 − p)πt2(·|i), then π p ∈ Π . Under conditions (E1)–(E6), we define a real-valued function G on C × D := Cω4 (K) × Πm as follows: G(c, π ) := J(c, π ), for (c, π ) ∈ C × D = Cω4 (K) × Πm .
(8.45)
Obviously, the function G defined in (8.45) satisfies (8.1). Moreover, let D := F. Now we state our main result of Sect. 8.4.4 on the existence of constrainedoptimal policies for (8.44). Proposition 8.4.3. Suppose that conditions (E1)–(E8) hold. Then there exists a constrained-optimal policy for (8.44), which may be a stationary policy or a randomized stationary policy that randomizes between two stationary policies differing in at most one state. Proof. It follows from Theorems 6.5 and 6.10 and Lemma 11.6 in [20] that Assumptions 8.2.1 and 8.3.2 hold. Obviously, condition (E8) implies Assumption 8.2.2. Finally, from Lemma 11.7 and the proof of Theorem 11.4 in [20], we see that Assumption 8.3.1 holds. Hence, by Theorem 8.3.2, we complete the proof. 2 Remark 8.4.11. Proposition 8.4.3 shows that the nonnegativity assumption on the costs as in [18, 20] is not required.
References 1. Aliprantis, C., Border, K. (2007). Infinite Dimensional Analysis. Springer-Verlag, New York. 2. Altman, E. (1999). Constrained Markov Decision Processes. Chapman and Hall/CRC Press, London. 3. Beutler, F.J., Ross, K.W. (1985). Optimal policies for controlled Markov chains with a constraint. J. Math. Anal. Appl. 112, 236–252.
8 A Constrained Optimization Problem with Applications to Constrained MDPs
149
4. Borkar, V., Budhiraja, A. (2004). Ergodic control for constrained diffusions: characterization using HJB equations. SIAM J. Control Optim. 43, 1467–1492. 5. Borkar, V.S., Ghosh, M.K. (1990). Controlled diffusions with constraints. J. Math. Anal. Appl. 152, 88–108. 6. Borkar, V.S. (1993). Controlled diffusions with constraints II. J. Math. Anal. Appl. 176, 310–321. 7. Borkar, V.S. (2005). Controlled diffusion processes. Probab. Surv. 2, 213–244. 8. Budhiraja, A. (2003). An ergodic control problem for constrained diffusion processes: existence of optimal Markov control. SIAM J. Control Optim. 42, 532–558. 9. Budhiraja, A., Ross, K. (2006). Existence of optimal controls for singular control problems with state constraints. Ann. Appl. Probab. 16, 2235–2255. 10. Feinberg, E.A., Shwartz, A. (1995). Constrained Markov decision models with weighted discounted rewards. Math. Oper. Res. 20, 302–320. 11. Feinberg, E.A., Shwartz, A. (1996). Constrained discounted dynamic programming. Math. Oper. Res. 21, 922–945. 12. Feinberg, E.A., Shwartz, A. (1999). Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans. Autom. Control. 44, 628–631. 13. Feinberg, E.A. (2000). Constrained discounted Markov decision processes and Hamiltonian cycles. Math. Oper. Res. 25, 130–140. 14. Follmer, ¨ H., Schied, A. (2004). Stochastic Finance: An Introduction in Discrete Time. Walter de Gruyter, Berlin. 15. Gonzalez-Hern ´ andez, ´ J., Hernandez-Lerma, ´ O. (1999). Envelopes of sets of measures, tightness, and Markov control processes. Appl. Math. Optim. 40, 377–392. 16. Gonzalez-Hern ´ andez, ´ J., Hernandez-Lerma, ´ O. (2005). Extreme points of sets of randomized strategies in constrained optimization and control problems. SIAM J. Optim. 15, 1085–1104. 17. Guo, X.P. (2000). Constrained denumerable state non-stationary MDPs with expected total reward criterion. Acta Math. Appl. Sin. (English Ser.) 16, 205–212. 18. Guo, X.P., Hernandez-Lerma, ´ O. (2003). Constrained continuous-time Markov controlled processes with discounted criteria. Stochastic Anal. Appl. 21, 379–399. 19. Guo, X.P. (2007). Constrained optimization for average cost continuous-time Markov decision processes. IEEE Trans. Autom. Control. 52, 1139–1143. 20. Guo, X.P., Hernandez-Lerma, ´ O. (2009). Continuous-time Markov Decision Processes: Theory and Applications. Springer-Verlag, Berlin Heidelberg. 21. Guo, X.P., Song, X.Y. (2011). Discounted continuous-time constrained Markov decision processes in Polish spaces. Ann. Appl. Probab. 21, 2016–2049. 22. Hernandez-Lerma, ´ O., Lasserre, J.B. (1996). Discrete-Time Markov Control Processes: basic optimality criteria. Springer-Verlag, New York. 23. Hernandez-Lerma, ´ O., Lasserre, J.B. (1999). Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York. 24. Hernandez-Lerma, ´ O., Gonzalez-Hern ´ andez, ´ J. (2000). Constrained Markov control processes in Borel spaces: the discounted case. Math. Methods Oper. Res. 52, 271–285. 25. Huang, Y., Kurano, M. (1997). The LP approach in average reward MDPs with multiple cost constraints: the countable state case. J. Inform. Optim. Sci. 18, 33–47. 26. Gonzalez-Hern ´ andez, ´ J., Lopez-Mart ´ i´nez, R.R., Perez-Hern ´ andez, ´ J.R. (2007). Markov control processes with randomized discounted cost. Math. Methods. Oper. Res. 65, 27–44. 27. Lopez-Mart ´ i´nez, R.R., Hernandez-Lerma, ´ O. (2003). The Lagrange approach to constrained Markov processes: a survey and extension of results. Morfismos. 7, 1–26. 28. Mey, S.P., Tweedie, R.L. (1993). Stability of Markov processes III: Foster-Lyapunov criteria for continuous-time processes. Adv. Appl. Prob. 25, 518–548. 29. Piunovskiy, A.B. (1997). Optimal Control of Random Sequences in Problems with Constraits. Kluwer, Dordrecht. 30. Prieto-Rumeau, T., Hernandez-Lerma, ´ O. (2006). Ergodic control of continuous-time Markov chains with pathwise constraints. SIAM J. Control Optim. 45, 51–73.
150
X. Guo et al.
31. Puterman, M.L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York. 32. Sennott, L.I. (1991). Constrained discounted Markov chains. Probab. Engrg. Inform. Sci. 5, 463–475. 33. Sennott, L.I. (1993). Constrained average cost Markov chains. Probab. Engrg. Inform. Sci. 7, 69–83. 34. Wei, Q.D., Guo, X.P. (2011). Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39, 369–374. 35. Zhang, L.L., Guo, X.P. (2008). Constrained continuous-time Markov decision processes with average criteria. Math. Methods Oper. Res. 67, 323–340. 36. Zhang, Y. (2011). Convex analytic approach to constrained discounted Markov decision processes with non-constant discount factors. Top. doi: 10.1007/s11750-011-0186-8.
Chapter 9
Optimal Execution of Derivatives: A Taylor Expansion Approach Gerardo Hernandez-del-Valle and Yuemeng Sun
9.1 Introduction The problem of optimal execution is a very general problem in which a trader who wishes to buy or sell a large position K of a given asset S—for instance, wheat, shares, derivatives, etc.—is confronted with the dilemma of executing slowly or as quick as possible. In the first case, he/she would be exposed to volatility, and in the second, to the laws of offer and demand. Thus, the trader must hedge between the market impact (due to his trade) and the volatility (due to the market). The main aim of this chapter is to study and characterize the so-called Markowitzoptimal open-loop execution trajectory of contingent claims. The problem of minimizing expected overall liquidity costs has been analyzed using different market models by [1, 6, 8], and [2], just to mention a few. However, some of these approaches miss the volatility risk associated with time delay. Instead, [3, 4] suggested studying and solving a mean-variance optimization for sales revenues in the class of deterministic strategies. Further on, [5] allowed for intertemporal updating and proved that this can strictly improve the meanvariance performance. Nevertheless, in [9], the authors study the original problem of expected utility maximization with CARA utility functions. Their main result states that for CARA investors there is surprisingly no added utility from allowing for intertemporal updating of strategies. Finally, we mention that the Hamilton-Jacobi-Bellman approach has also recently been studied in [7].
G. Hernandez-del-Valle () Statistics Department, Columbia University, New York, NY 10027, USA e-mail:
[email protected] Y. Sun Cornell University, Theca, NY 14853, USA e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 9, © Springer Science+Business Media, LLC 2012
151
152
G. Hernandez-del-Valle and Y. Sun
The chapter is organized as follows: in Sect. 9.2, we state the optimal execution contingent claim problem. Next, in Sect. 9.3, we provide its closed form solution. In Sect. 9.4, a numerical example is studied, and finally we conclude in Sect. 9.5 with some final remarks and comments.
9.2 The Problem The Model. A trader wishes to execute K = k0 + · · ·+ kn units of a contingent claim C with underlying S by time T . The quantity to optimize is given by the so-called execution shortfall, defined as Y=
n
∑ k jC j − KC0,
j=0
and the problem is then to find k0 , . . . , kn such that attain the minimum min (E[Y ] + λ V[Y ]) ,
k0 ,...,kn
for some λ > 0. Assuming the derivative C is smooth in terms of its underlying S, it follows from the Taylor series expansion that 1 C j = f (S0 ) + f (S0 )(S˜ j − S0) + f (S0 )(S˜ j − S0 )2 + R3 , 2 where S˜ is the effective price and R3 is the remainder which is o((S˜ j − S0)3 ). Hence, n
n
j=0
j=0
n
1
n
n
j=0
j=0
∑ k jC j = ∑ k j f (S0 ) + ∑ f (S0 )k j (S˜ j − S0) + 2 f (S0 ) ∑ k j (S˜ j − S0)2 + ∑ k j R3 j=0
= KC0 + f (S0 )
n
∑ k j S˜ j − KS0
j=0
s+
n n 1 f (S0 ) ∑ k j (S˜ j − S0)2 + ∑ k j R3 . 2 j=0 j=0
That is, Y=
n
∑ k jC j − KC0
j=0
= f (S0 )
n
∑ k j S˜ j − KS0
j=0
+
n n 1 f (S0 ) ∑ k j (S˜ j − S0 )2 + ∑ k j R3 . 2 j=0 j=0
(9.1)
Note that if we use only the first-order approximation, then our optimization problem has already been solved and corresponds to [4] trading trajectory.
9 Optimal Execution of Derivatives
153
9.3 Second-Order Taylor Approximation In this section, we extend [4] market impact model for the case of a contingent claim. We provide our main result which is the closed form objective function by adapting a second order-Taylor approximation.
9.3.1 Effective Price Process Let ξ1 , ξ2 , . . . be a sequence of i.i.d. Gaussian random variables with mean zero and variance 1, and let the execution times be equally spaced, that is, τ := T /n. Then, the price and “effective” processes are respectively defined as
kj τ kj , S˜ j = S j − h τ
S j = S j−1 − τ g
+ σ τ 1/2ξ j ,
and the permanent and temporary market impact will be modeled, for simplicity, as
kj g τ
kj =α , τ
kj h τ
=β
kj , τ
for some constant α and β . Hence, letting j
x j := K −
∑ km
and
m=0 j
W j :=
∑ ξm ,
i.e. W j ∼ N(0, j), Cov(W j ,Wi ) = min(i, j),
m=1
it follows that
β S˜ j − S0 = σ τ 1/2W j − α (K − x j ) − k j . τ
(9.2)
9.3.2 Second-Order Approximation From (9.1) and (9.2), the second-order approximation of the execution shortfall Y is given by
154
G. Hernandez-del-Valle and Y. Sun
β 1/2 Y ≈ f (S0 ) ∑ k j σ τ W j − α (K − x j ) − k j τ j=0 n
2 n β 1 1/2 + f (S0 ) ∑ k j σ τ W j − α (K − x j ) − k j . 2 τ j=0
(9.3)
Next, expanding the squared term, we get 2 β β2 βσ 1/2 σ τ W j − α (K − x j ) − k j = σ 2 τ W j2 + α 2 (K − x j )2 + 2 k2j − 2 1/2 k jW j τ τ τ − 2ασ τ 1/2(K − x j )W j + 2
αβ k j (K − x j ), τ
Thus the expected value of Y is approximately n β E[Y ] = f (S0 ) ∑ k j −α (K − x j ) − k j τ j=0 n 1 β2 2 αβ 2 2 2 + f (S0 ) ∑ k j σ τ j + α (K − x j ) + 2 k j + 2 k j (K − x j ) , 2 τ τ j=0 (9.4) to compute the variance V of Y we rearrange (9.3) as Y≈
n
n
j=0
j=0
∑ ν j k jW j + ∑ η j k jW j2 + D,
where D are all the deterministic terms and βσ ν j := f (S0 )σ τ 1/2 − f (S0 ) ασ τ 1/2 (K − x j ) + 1/2 k j τ
η j :=
1 f (S0 )σ 2 τ . 2
It follows that the variance of Y is V[Y ] = V
n
∑ ν j k jW j
+V
j=0
=
n
∑ v2j k2j j + 2 ∑
j=0
0≤i< j≤n
n
∑
η j k jW j2
+ 2 Cov
j=0 n
n
n
j=0
j=0
∑ ν j k jW j , ∑
vi ki v j k j i + ∑ η 2j k2j · 2 j2 + 2 j=0
∑
η j k jW j2
ηi ki η j k j · 2i2
0≤i< j≤n
(9.5) and the last term equals zero.
9 Optimal Execution of Derivatives
155
9.3.3 Optimal Trading Schedule for the Second-Order Approximation To find the optimal trading schedule for the second-order approximation of Y , we need find the sequence of k0 , . . . , kn such that E[Y ] + λ V[Y ] is minimized for a given λ and where E[Y ] and V[Y ] are as in (9.4) and (9.5), respectively. After some simplification, n β E[Y ] + λ V[Y ] = f (S0 ) ∑ k j α (x j − K) − k j τ j=0 n 1 β2 2αβ + f (S0 ) ∑ k j σ 2 τ j + α 2 (K − x j )2 + 2 k2j + (K − x j )k j 2 τ τ j=0 +λ
n
∑ jk j
n
v2j k j + 2 jk j η 2j + 2v j
j=0
∑
vm km + 4 j η j
m= j+1
n
∑
m= j+1
9.4 Numerical Solution For Y as in (9.3), the optimization problem we aim to solve is min (E[Y ] + λ V[Y ])
k0 ,k1 ,...,kn
subject to n
∑ k j = K.
j=0
We solve the problem using fmincon in the Matlab. Example 9.4.1. For this example let n = 2;
K = 1000;
δ = f (S0 ) = 0.5;
α = 0.1;
β = 0.5;
γ = f (S0 ) = 0.2;
λ = 0.4;
σ = 0.5,
the optimal trading strategy is k0 = 333.3348,
k1 = 333.3336,
and the optimal objective function is 5.5736 × 108.
k2 = 333.3316,
τ = 1;
ηm km .
156
G. Hernandez-del-Valle and Y. Sun
Remark 9.4.1. The trading trajectory has a downward trend. Intuitively, and on contrast to executing a large size at a single transaction, our result suggests to split the overall position in almost even trades. The linear assumption that we made on the temporary and the permanent impacts seems to explain the almost equal execution quantities.
9.5 Concluding Remarks In this work, we study the Markowitz-optimal execution trajectory of contingent claims. In order to do so, we use a second-order Taylor approximation with respect to the contingent claim C evaluated at the initial value of the underlying S. We obtain the closed form objective function given a risk averse criterion. Our approach allows us to obtain the explicit numerical solution and we provide an example. Acknowledgements The authors were partially supported by Algorithmic Trading Management LLC.
References 1. A. Alfonsi, A. Fruth and A. Schied (2010). Optimal execution strategies in limit order books with general shape functions, Quantitative Finance, 10, no. 2, 143–157. 2. A. Alfonsi and A. Schied (2010). Optimal trade execution and absence of price manipulations limit order books models, SIAM J. on Financial Mathematics, 1, pp. 490–522. 3. R. Almgren and N. Chriss (1999). Value under liquidation, Risk, 12, pp. 61–63. 4. R. Almgren and N. Chriss (2000). Optimal execution of portfolio transactions, J. Risk, 3 (2), pp. 5–39. 5. R. Almgren and J. Lorenz (2007). Adaptive arrival price, Trading, no. 1, pp. 59–66. 6. D. Bertsimas and D. Lo (1998). Optimal control of execution costs, Journal of Financial Markets, 1(1), pp. 1–50. 7. P.A. Forsyth (2011). Hamilton Jacobi Bellman approach to optimal trade schedule, Journal of Applied Numerical Mathematics, 61(2), pp. 241–265. 8. A. Obizhaeva, and J. Wang (2006). Optimal trading strategy and supply/demand dynamics. Journal of Financial Markets, forthcoming. 9. A. Schied, T. Sch¨oneborn and M. Tehranchi (2010). Optimal basket liquidation for CARA investors is deterministic, Applied Mathematical Finance, 17, pp. 471–489.
Chapter 10
A Survey of Some Model-Based Methods for Global Optimization Jiaqiao Hu, Yongqiang Wang, Enlu Zhou, Michael C. Fu, and Steven I. Marcus
10.1 Introduction Global optimization aims at characterizing and computing global optimal solutions to problems with nonconvex, multimodal, or badly scaled objective functions; it has applications in many areas of engineering and science. In general, due to the absence of structural information and the presence of many local extrema, global optimization problems are extremely difficult to solve exactly. There are many different types of methods in the literature on global optimization, which can be categorized based on different criteria. For instance, they can be classified either based on the properties of problems to be solved (combinatorial or continuous, nonlinear, linear, convex, etc.) or by the properties of algorithms that search for new candidate solutions such as deterministic or random search algorithms. Random search algorithms can further be classified as instance-based or model-
J. Hu Department of Applied Mathematics and Statistics, State University at Stony Brook, Stony Brook, NY 11794, USA e-mail:
[email protected] Y. Wang • S.I. Marcus () Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD 20742, USA e-mail:
[email protected];
[email protected] E. Zhou Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, IL 61801, USA e-mail:
[email protected] M.C. Fu The Robert H. Smith School of Business & Institute for Systems Research, University of Maryland, College Park, MD 20742, USA e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 10, © Springer Science+Business Media, LLC 2012
157
158
J. Hu et al.
based algorithms according to the mechanism of generating new candidate solutions [46]. Instance-based algorithms maintain a single solution or population of candidate solutions, and the construction of new generate of candidate solutions depends explicitly on the previously generated solutions. Some well-known instance-based algorithms include simulated annealing [25], genetic algorithms [16,36], tabu search [15], nested partitions [35], generalized hill climbing [22, 23], and evolutionary programming [12]. Model-based search algorithms are a class of new solution techniques and were introduced only in recent years [18, 27, 32–34, 42]. In modelbased algorithms, new solutions are generated via an intermediate probabilistic model that is updated or induced from the previously generated solutions. Thus, there is only an implicit/indirect dependency among the solutions generated at successive iterations of the algorithm. Specific model-based algorithms include annealing adaptive search (AAS) [31, 41], the cross-entropy (CE) method [32– 34], and estimation of distribution algorithms (EDAs) [27, 42]. Instance-based algorithms have been extensively studied in past decades. After briefly reviewing some model-based algorithms, this chapter focuses on several model-based methods that have been developed recently.
10.2 Global Optimization and Previous Work 10.2.1 Problem Statement In many engineering design and optimization applications, we are concerned with finding parameter values that achieve the optimum of an objective function. Such problems can be mathematically stated in the generic form: x∗ ∈ arg max H(x),
(10.1)
x∈X
where x is a vector of n decision variables, the solution space X is a nonempty (often compact) subset of ℜn , and the objective function H : X → ℜ is a bounded deterministic function. Throughout this chapter, we assume that there exists a global optimal solution to (10.1), i.e., ∃ x∗ ∈ X such that H(x) ≤ H(x∗ ) ∀x = x∗ , x ∈ X. In practice, this assumption can be justified under fairly general conditions. For example, for continuous optimization problems with compact solution spaces, the existence of an x∗ is guaranteed by the well-known Weierstrass theorem, whereas in discrete optimization, the assumption holds trivially when X is a (nonempty) finite set. Note that no further structural assumptions, such as convexity or differentiability, are imposed on the objective function, and there may exist many locally optimal solutions. In other words, our focus is on general global optimization problems with little known structure. This setting arises in many complex systems of interest, e.g.,
10 A Survey of Some Model-Based Methods for Global Optimization
159
when the explicit form of H is not readily available and the objective function values can only be assessed via “black-box” evaluations.
10.2.2 Previous Work on Random Search Methods In this section, we review a class of global optimization algorithms collectively known as random search methods. A random search method usually refers to an algorithm that is iterative in nature, and uses some sort of randomized mechanism to generate a sequence of iterates, e.g., candidate solutions or probabilistic models, in order to successively approximate the optimal solution. What type of iterates an algorithm produces and how these iterates are generated are what differentiates approaches. A major advantage of stochastic search methods is that they are robust and easy to implement, because they typically only rely on the objective function values rather than structural information such as convexity and differentiability. This feature makes these algorithms especially prominent in optimization of complex systems with little structure. From an algorithmic point of view, a random search algorithm can further be classified as being either instance-based or model-based [46]. In instanced-based algorithms, an iterate comprises a single or a set/population of candidate solution(s), and the construction of new candidate solutions depends explicitly on previously generated solutions. Such algorithms can be represented abstractly by the following framework: 1. Given a set/population of candidate solutions Y (k) (which might be a singleton set), generate a set of new candidate solutions X (k) according to a specified random mechanism. 2. Update the current population Y (k+1) based on population Y (k) and candidate solutions in X (k) ; increase the iteration counter k by 1 and reiterate from Step 1. Thus the two major steps in an instance-based algorithm are the generation step that produces a set of candidate solutions, and the selection/update step that determines whether a newly generated solution in X (k) should be included in the next generation. Over the past few decades, a significant amount of research effort has been centered around instance-based methods, with numerous algorithms proposed in the literature and their behaviors relatively well studied and understood. Some well-known examples include simulated annealing [25], genetic algorithms [16,36], tabu search [15], nested partitions [35], generalized hill climbing [22, 23], and evolutionary programming [12]. We focus on model-based methods, which differ from instance-based approaches in that candidate solutions are generated at each iteration by sampling from an intermediate probability distribution model over the solution space. The idea is to iteratively modify the distribution model based on the sampled solutions to bias the future search toward regions containing high-quality solutions. In its most basic
160
J. Hu et al.
from, a model-based algorithm typically consists of the following two steps: let gk be a probability distribution on X at the kth iteration of an algorithm: 1. Randomly generate a set/population of candidate solutions X (k) from gk . 2. Update gk based on the sampled solutions in X (k) to obtain a new distribution gk+1 ; increase k by 1 and reiterate from step 1. The underlying idea is to construct a sequence of iterates (probability distributions) {gk } with the hope that gk → g∗ as k → ∞, where g∗ is a limiting distribution that assigns most of its probability mass to the set of optimal solutions. So it is the probability distribution (as opposed to candidate solutions as in instance-based algorithms) that is propagated from one iteration to the next. Clearly, the two key questions one needs to address in a model-based algorithm are how to generate samples from a given distribution gk and how to construct the distribution sequence {gk }. In order to address these questions, we provide brief descriptions of three model-based algorithms: annealing adaptive search (AAS) [31, 41], the cross-entropy (CE) method [32–34], and estimation of distribution algorithms (EDAs) [27, 42]. The annealing adaptive search algorithm was originally introduced in Romeijn and Smith [18] as a means to understand the behavior of simulated annealing. The algorithm generates candidate solutions by sampling from a sequence of Boltzmann distributions parameterized by time-dependent temperatures. As the temperature decreases to zero, the sequence of Boltzmann distributions becomes more concentrated on the set of optimal solutions, so that a solution sampled at later iterations will be close to the global optimum with high probability. For the class of Lipschitz optimization problems, it is shown that the expected number of iterations required by AAS to achieve a given level of precision increases at most linearly in the problem dimension [31, 41]. However, the idealized AAS is not intended to be a practically useful algorithm, because the problem of sampling exactly from a given Boltzmann distribution is known to be extremely difficult. This implementation issue has motivated a number of algorithms that approximate AAS, where a primary focus has been on the design and refinement of Markov chain-based sampling techniques embedded within the AAS framework [40, 41]. The CE method was motivated by an adaptive algorithm for estimating probabilities of rare events in complex stochastic networks [32], which involves variance minimization. It was later realized [33] that the method can be modified to solve combinatorial and continuous optimization problems. The CE method uses a family of parameterized probability distributions on the solution space and tries to find the parameter of the distribution that assigns maximum probability to the set of optimal solutions. Implicit in CE is an optimal importance sampling distribution concentrated only on the set of optimal solutions. The key idea is to use an iterative scheme to successively estimate the optimal parameter that minimizes the Kullback-Leibler (KL) divergence between the optimal distribution and the family of parameterized distributions. Although there have been extensive developments regarding implementation and successful practical applications of CE (see [34]), the literature analyzing the convergence properties of the CE method is relatively
10 A Survey of Some Model-Based Methods for Global Optimization
161
sparse, with most of the existing results limited to specific settings (see, e.g., [17] for a convergence proof of a variational version of CE in the context of estimation of rare event probabilities, and [7] for probability one convergence proofs of CE for discrete optimization problems). General convergence and asymptotic rate results for CE were recently obtained in [21] by relating the algorithm to recursions of stochastic approximation type (see Sect. 10.6). EDAs were first introduced in the field of evolutionary computation. They inherit the spirit of the well-known genetic algorithms (GAs), but eliminate the crossover and mutation operators to avoid the disruption of partial solutions. In EDAs, a new population of candidate solutions are generated according to the probability distribution induced or estimated from the promising solutions selected from the previous generation. Unlike CE, EDAs often take into account the interrelations between the underlying decision variables needed to represent the individual candidate solutions. At each iteration of the algorithm, a high-dimensional probabilistic model that better represents the interdependencies between the decision variables is induced; this step constitutes the most crucial and difficult part of the method. We refer the reader to [27] for a review of the way in which different probabilistic models are used as EDA instantiations. A proof of convergence of a class of EDAs, under the idealized infinite population assumption, can be found in [42]. There are many other model-based algorithms proposed for global optimization. Some interesting examples include ant colony optimization (ACO) [9], probability collectives (PCs) [39], and particle swarm optimization (PSO) [24]. We do not provide a comprehensive description of all of them, but instead present some recently developed frameworks and approaches that allow us to view these algorithms in a unified setting. These approaches, including model reference adaptive search (MRAS) [18], the particle-filtering (PF) approach [43], the evolutionary games approach [38], and the stochastic approximation gradient approach [20, 21], will be discussed in detail in the following sections.
10.3 Model Reference Adaptive Search As we have seen from Sect. 10.2, model-based algorithms differ from each other in the choices of the distribution sequence {gk }. Examples of the {gk } sequence include (a) Boltzmann distributions, used in AAS; (b) optimal importance sampling measure, primarily used in the CE method; and (c) proportional selection schemes, used in EDAs, ACOs, and PCs. However, in all the above cases, the construction of gk often depends on the objective function H, whose explicit form may not be available. In addition, since gk may not have any special structure, sampling exactly from the distribution is in general intractable. To address these computational challenges arising in modelbased methods, we have formalized in [18] a general approach called model reference adaptive search (MRAS), where the basic idea is to use a convenient parametric distribution as a surrogate to approximate gk and then sample candidate
162
J. Hu et al.
solutions from the surrogate distribution. More specifically, the method starts by specifying a family of parameterized distributions { fθ , θ ∈ Θ } (with Θ being the parameter space) and then projects gk onto the family to obtain a sampling distribution fθk , where the projection is implemented at each iteration by finding an optimal parameter θk that minimizes the Kullback-Leibler (KL) divergence between gk and the parameterized family [34], i.e., gk (x) gk (dx) . θk = arg min D(gk , fθ ) := arg min ln (10.2) fθ (x) X θ ∈Θ θ ∈Θ The idea is that the parameterized family is specified with some structure (e.g., family of normal distributions parameterized by means and variances) so that once its parameter is specified, sampling from the corresponding distribution can be performed relatively easily and efficiently. Another advantage is that the task of constructing the entire surrogate distribution now simplifies to the task of finding its associated parameters. Roughly speaking, each sampling distribution fθk obtained via (10.2) can be viewed as a compact approximation of gk , and consequently the entire sequence { fθk } may (hopefully) retain some nice properties of the distribution sequence {gk }. Thus, to ensure the convergence of the MRAS method, it is intuitively clear that the sequence {gk } should be chosen in a way so that it can be shown to converge to a limiting distribution concentrated only on the set of optimal solutions. Since the distribution gk is primarily used to guide the parameter updating process and to express the desired properties of the MRAS method, it is called the reference distribution. We now provide a summary of the MRAS method: 0. Select a sequence of reference distributions {gk } with desired convergence properties and choose a parameterized family { fθ }. 1. Given θk , sample N candidate solutions Xk1 , . . . , XkN from fθk . 2. Update the parameter θk+1 by minimizing the KL divergence
θk+1 = arg min D(gk+1 , fθ ); θ
increase k by 1 and reiterate from step 1. Note that the algorithm above assumes that the expectation/integral involved in the KL divergence (cf. (10.2)) can be evaluated exactly. In practice, it is often estimated by an empirical average based on samples obtained at step 1. The MRAS framework accommodates many algorithms aforementioned in Sect. 10.2. For example, when Boltzmann distributions are used as reference models, the resulting algorithm becomes AAS with an additional projection step. The algorithm instantiation considered in [18] uses the following recursive procedure to construct the gk sequence: H(x)gk (x) , X H(x)gk (dx)
gk+1 (x) =
(10.3)
10 A Survey of Some Model-Based Methods for Global Optimization
163
where g0 (x) is a given initial distribution on X and we have assumed for simplicity that H(x) > 0 for all x ∈ X to prevent negative probabilities. This form of reference distributions has also been used in a class of EDAs with proportional selection schemes. It weights the new distribution gk+1 by the value of the objective function H(x), so that each iteration of (10.3) improves the expected performance in the sense that Egk+1 [H(X)] :=
X
H 2 (x)gk (dx) ≥ Egk [H(X)], X H(x)gk (dx)
H(x)gk+1 (dx) = X
so solutions with better performance are given more probability under gk+1 . This results in a {gk } sequence that converges to a degenerate distribution at the optimal solution. Furthermore, it is shown in [18] that the CE method can also be recovered by replacing gk in the right-hand side of (10.3) with fθk . In other words, there is a sequence of reference distributions implicit in CE that takes the form H(x) fθk (x) . X H(x) f θk (dx)
gk+1 (x) =
(10.4)
Since gk+1 in (10.4) is obtained by tilting the sampling distribution fθk with the objective function H, it improves the expected performance of fθk , i.e.,
H 2 (x) fθk (dx) ≥ X H(x) f θk (dx)
Egk+1 [H(X)] = X
X
H(x) fθk (dx) := Eθk [H(X)].
Therefore, it is reasonable to expect that the projection of gk+1 on the parameterized family, fθk+1 , also improves fθk , i.e., Eθk+1 [H(X)] ≥ Eθk [H(X)]. This view of CE leads to an important monotonicity property of the method, generalizing that of [34], which is only proved for the one-dimensional case.
10.3.1 Convergence Result For the family of natural exponential distributions (NEFs), the optimization problem involved at step 2 of the MRAS method can be solved analytically in closed form, which makes the approach very convenient to implement in practice. We recall the definition of NEFs. Definition 10.3.1. A parameterized family { fθ , θ ∈ Θ ⊆ ℜd } is said to belong to the natural exponential family if there exist mappings Γ : ℜn → ℜd and K : ℜd → ℜ such that each fθ in the family can be represented in the form fθ (x) = exp θ T Γ (x) − K(θ ) , where K(θ ) is a normalization constant given by K(θ ) = ln X exp(θ T Γ (x))dx.
164
J. Hu et al.
The function K(θ ) plays an important role in the theory of NEFs. It is strictly convex in the interior of Θ with gradient ∇θ K(θ ) = Eθ [Γ (X)] and Hessian matrix Covθ [Γ (X)]. We define the mean vector function m(θ ) := Eθ [Γ (X)]. Since the Jacobian of m(θ ) is strictly positive definite, we have from the inverse function theorem that m(θ ) is a one-to-one invertible function of θ . Generally speaking, m(θ ) can be viewed as a transformed version of the sufficient statistic Γ (x), whose value contains all necessary information to estimate the parameter θ . For example, for the univariate normal distribution N(μ , σ 2 ) with mean μ and variance σ 2 , it can be seen that Γ (x) = (x, x2 )T and θ = ( σμ2 , − 2σ1 2 )T . Thus, m(θ ) = Eθ [Γ (X)] becomes (μ , σ 2 + μ 2 )T , which can be uniquely solved for μ and σ 2 given the value of m(θ ). When NEFs are used as the parameterized family, we have the following convergence theorem for the instantiation of MRAS considered in [18]. Theorem 10.3.1. When {gk } in (10.3) are used as reference distributions in MRAS, let {θk } be the sequence of parameters generated by the algorithm based on the sampled candidate solutions. Under appropriate assumptions (see [18]), lim m(θk ) = Γ (x∗ ) w.p.1.
k→∞
The interpretation of Theorem 10.3.1 relies on the parameterized family used in MRAS and, in particular, on the specific form of the sufficient statistic Γ (x). We consider two special cases of Theorem 10.3.1. (a) In continuous optimization when multivariate normal distributions with mean vector μ and covariance matrix Σ are used as the parameterized family, then it is easy to show that Theorem 10.3.1 implies limk→∞ μk = x∗ and limk→∞ Σk = 0n×n w.p.1, where 0n×n represents an n-by-n zero matrix. In other words, the sequence of sampling distributions { fθk } will converge to a delta distribution with all probability mass concentrated on x∗ . (b) For a discrete optimization problem with feasible domain X that contains l distinct values denoted by x1 , . . . , xl , the parameterized family can be specified in terms of an l-by-1 probability vector Q, whose ith entry qi represents the probability that a (random) solution will take the ith value xi . A probability mass function on X, when parameterized by Q, can thus be expressed as l
fθ (x) = ∏ qi
I{x=xi }
:= eθ
T Γ (x)
,
i=1
where I{·} is the indicator function, θ = [ln q1 , . . . , ln ql ]T , and the sufficient statistic Γ (x) = [I{x = x1 }, . . . , I{x = xl }]T . Therefore, a simple application of Theorem 10.3.1 yields l
∑ ∏(qki )I{x=xi } I{x = x j } = I{x∗ = x j } ∀ j w.p.1, k→∞ lim
x∈X i=1
10 A Survey of Some Model-Based Methods for Global Optimization
165
where qki is the ith entry of the probability vector Qk obtained at the kth iteration of the algorithm. This in turn implies that limk→∞ qki = I{x∗ = xi } w.p.1., i.e., the sequence of Qk will convergence to a degenerate probability vector assigning unit mass to x∗ . We remark that Theorem 10.3.1 does not address the convergence rate of the algorithm. Moreover, the proof techniques used in [18] cannot be directly carried over to analyze other algorithms such as CE, due to the dependency of gk on the parameterized family (cf. (10.4)). In Sect. 10.6, we show that with some appropriate modifications of the MRAS method, we can arrive at a general framework linking model-based methods to recursive algorithms of stochastic approximation type, which makes the convergence and convergence rate analysis of these algorithms more tractable.
10.4 Particle-Filtering Approach Filtering refers to the estimation of an unobserved state in a dynamical system based on noisy observations that arrive sequentially in time (c.f. [8] for an introduction). The idea behind the particle-filtering approach is to transform the optimization problem into a filtering problem. Using a novel interpretation, the distribution sequence {gk } in model-based optimization corresponds to the sequence of conditional distributions of the unobserved state given the observation history in filtering, and hence, {gk } is updated from a Bayesian perspective. A class of simulationbased filtering techniques called particle filtering can then be employed to sample from {gk }, leading to a framework for model-based optimization algorithms. More specifically, the optimization problem (10.1) can be transformed into a filtering problem by choosing an appropriate state-space model, such as the following: Xk = Xk−1 , k = 1, 2, . . . , Yk = H(Xk ) − Vk , k = 1, 2, . . . ,
(10.5)
where Xk ∈ ℜn is the unobserved state, Yk ∈ ℜ is the observation, and {Vk , k = 1, 2, . . .} is an i.i.d. sequence of nonnegative random variables that have a p.d.f. ϕ . A prior distribution on X0 is denoted by g0 . The goal of filtering is to compute the conditional density gk of the unobserved state Xk given the past observations {Y1 = y1 , . . . ,Yk = yk } for k = 1, 2, . . .. Let F denote the σ -field of Borel sets of ℜn . Then the conditional density gk satisfies P(Xk ∈ A|Y1:k = y1:k ) =
A
gk (x)dx, ∀A ∈ F ,
166
J. Hu et al.
where Y1:k = {Y1 , . . . ,Yk }, and y1:k = {y1 , . . . , yk }. Using Bayes rule, the evolution of gk (x) can be derived as follows: gk (x) = p(x|y0:k−1 , yk ) =
p(yk |x)p(x|y0:k−1 ) p(yk |y0:k−1 )
=
ϕ (H(x) − yk )gk−1 (x) , ϕ (H(x) − yk )gk−1 (x)dx
(10.6)
where the last line uses the density functions induced by (10.5). The intuition of (10.5) and (10.6) and their connection with optimization can be explained as follows: the unobserved state {Xk } is constant with the underlying value being the optimum x∗ , which needs to be estimated; the observations {yk } are noisy observations of the optimal function value H(x∗ ) and come from the sample function values in an optimization algorithm; the conditional density gk is a density estimate of the optimum x∗ at iteration k based on the sample function values {y1 , . . . , yk }. Equation (10.6) implies that gk is tuned the more promising area where H(x) is greater than yk since ϕ (H(x) − yk ) is positive if H(x) ≥ yk and is zero otherwise. Hence, randomization in the optimization algorithm is brought in by the randomness of Vk , and the choice of the p.d.f. of Vk , ϕ , results in different sample selection or weighting schemes in the algorithm. In order to ensure the resultant optimization algorithm monotonically approaches the optimum, the following general condition (C) on ϕ is imposed: (C) The p.d.f. ϕ (·) is positive, strictly increasing, and continuous on its support [0, ∞). It is shown in [45] that if ϕ satisfies the condition, then for an arbitrary, fixed observation sequence {y1 , y2 , . . .}, the estimate of the function value is monotonically increasing, i.e., Egk+1 [H(X)] ≥ Egk [H(X)]. Hence, it has the same monotonicity property as MRAS and CE. Furthermore, the estimate of the optimal function value asymptotically converges to the true optimal function value as stated in the following theorem that is also shown in [45]. Theorem 10.4.2. Suppose the following conditions hold: (i) For all H(x) < H(x∗ ), the set {z ∈ X : H(z) ≥ H(x)} has strictly positive measure with respect to the initial sampling distribution, i.e., {z∈X:H(z)≥H(x)} g0 (x)dx > 0. (ii) There is a unique optimum x∗ , and H(x) is continuous at x∗ . (iii) ϕ satisfies the condition (C). Then for an arbitrary, fixed observation sequence {y1 , y2 , . . .}, lim Egk [H(X)] = H(x∗ ).
k→∞
10 A Survey of Some Model-Based Methods for Global Optimization
167
The conditions (i) and (ii) ensure that any neighborhood of the optimum always has a positive probability to be sampled. The result implies that the samples drawn from gk in the limit will be concentrated on the optimum.
10.4.1 Algorithms The distribution sequence {gk } in general does not have a closed-form solution. Various numerical filtering methods (cf. [5] for a recent survey) are available to numerically approximate {gk }. However, the most akin to model-based optimization algorithms is the particle-filtering technique, which is a more recent class of approximate filtering methods based on Sequential Monte Carlo (SMC) simulation (cf. the tutorial [1] and the more recent tutorial [11] for a quick reference and the book [10] for a more comprehensive account). Despite its abundant successful applications in many areas, particle filtering has rarely been explored in optimization. The basic particle filter is a sequential importance sampling resampling algorithm, each iteration of which is composed of an importance sampling step to propagate the particles (i.e., samples) from the previous iteration to the current, a Bayes updating step to update the weights of the particles, and a resampling step to generate new particles in order to prevent sample degeneracy. Applying it to the distribution sequence {gk } specified in (10.6) leads to the particle filtering for optimization (PFO) framework as follows: 1 0. Initialization. Specify g0 , and draw i.i.d. samples {X1i }Ni=1 from g0 . Set k = 1. 1. Bayes updating. Take yk to be a sample function value H(Xki ) according to a certain rule. Compute the weight wik for sample Xki according to
wik ∝ ϕ (H(Xki ) − yk ), i = 1, 2, . . . , Nk , and normalize the weights such that they sum up to 1. i }Nk+1 from the weighted samples 2. Resampling. Generate i.i.d. samples {Xk+1 i=1 N k {wik , Xki }i=1 using regularized method, density projection method, or resamplemove method. 3. Stopping. If a stopping criterion is satisfied, then stop; else, increase k by 1 and reiterate from step 1. Note that the simple method of sampling with replacement cannot be used in the resampling step since it does not generate new values for the samples and hence does not explore new candidate solutions for the purpose of optimization. Several other known resampling methods can be used to generate new candidate solutions and can also be easily implemented, including the regularized method [28], the density projection method [44], and the resample-move method [13]. The regularized method draws new i.i.d. samples from a continuous mixture distribution, where each continuous kernel of the mixture distribution is centered at each sample
168
J. Hu et al.
Xki and the weight of that kernel is equal to the probability mass wik of Xki . The density projection method resembles MRAS and CE in finding a parameterized density fθk by minimizing the KL divergence between the discrete distribution {wik , Xki } and the parameterized family. The resample-move method applies a Markov chain Monte Carlo (MCMC) step to move the particles after they are generated by sampling with replacement. Depending on the resampling methods, the convergence properties of the different instantiations of PFO are also slightly different, but all readily follow from the existing convergence results of the corresponding particle filters in the literature [6, 14, 44] under suitable assumptions. We end this section with a final remark that the PFO framework provides a new perspective on CE and MRAS. We will use the truncated selection scheme for sample selection as an illustration. Suppose that the objective function H(x) is bounded by H1 ≤ H(x) ≤ H2 . In the state-space model (10.5), let the observation noise Vk follow a uniform distribution U(0, H2 − H1 ), and then ϕ , the p.d.f. of Vk , satisfies 1 , if 0 ≤ u ≤ H2 − H1 ; ϕ (u) = H2 −H1 (10.7) 0, otherwise. Since yk is a sample function value, the inequality H(x) − yk ≤ H2 − H1 holds with probability 1, so substituting (10.7) into (10.6) yields gk (x) =
I{H(x) ≥ yk }gk−1 (x) . I{H(x) ≥ yk }gk−1 (x)dx
The standard CE method can be viewed as PFO with the above choice of distribution sequence {gk } and the density projection method for resampling, so the samples {Xki } are generated from fθk−1 and the weights of the samples are computed according to wik ∝ I{H(Xki ) ≥ yk }. However, the approximation of gk−1 by fθk−1 introduces an approximation error, which is accumulated to the next iteration. This approximation error can be corrected by taking fθk−1 as an importance density and hence can be taken care of by the weights of the samples. That is, in the case of MRAS or CE in which the sequence {yk } is monotonically increasing, the weights are computed according to wik =
gk (Xki ) I{H(Xki ) ≥ yk } ∝ . fθk−1 (Xki ) fθk−1 (Xki )
This instantiation of PFO coincides with an instantiation of MRAS. More details on a unifying perspective on EDAs, CE, and MRAS are given in [45].
10 A Survey of Some Model-Based Methods for Global Optimization
169
10.5 Evolutionary Games Approach The main idea of the evolutionary games approach is to formulate the global optimization problem as an evolutionary game and to use dynamics from evolutionary game theory to study the evolution of the candidate solutions. Searching for the optimal solution is carried out through the dynamics of reaching equilibrium points in evolutionary games. Specifically, we establish a connection between evolutionary game theory and optimization by formulating the global optimization problem as an evolutionary game with continuous strategy spaces. We show that there is a strong connection between a particular equilibrium set of the replicator dynamics and the global optimal solutions. By using Lyapunov theory, we also show that the particular equilibrium set is asymptotically stable under mild conditions. Based on the connection between the equilibrium points and global optimal solutions, we develop a model-based evolutionary optimization (MEO) algorithm. First, we set up an evolutionary game with a continuous strategy space. Let B be the Borel σ -field on X, the strategy space of the game; for each t, let Pt be a probability measure defined on (X, B). Let Δ denote set of all the strategies (probability measures) on X. Each point x ∈ X can be viewed as a pure strategy. Roughly speaking, the fraction of agents playing the pure strategy x at time t is Pt (dx). An agent playing the pure strategy x obtains a fitness φ (H(x)), where φ (·) : ℜ → ℜ+ is a strictly increasing function. An appropriate chosen φ (·) can facilitate the expression of the model updating rule presented later. Let X be a random variable with probability distribution Pt . The fractions of agents adopting different strategies in the continuous game is described by the probability measure Pt defined on the strategy space X, so the average payoff of the whole population is given by EPt [φ (H(X))] =
X
φ (H(x))Pt (dx).
In evolutionary game theory [29], the evolution of this probability measure is governed by some dynamics such as the so-called replicator dynamics. Let A be a measurable set in X. If the replicator dynamics with a continuous strategy space is adopted, we have P˙ t (A ) =
A
(φ (H(x)) − EPt [φ (H(X))])Pt (dx).
(10.8)
From (10.8), we can see that if φ (H(x)) outperforms EPt [φ (H(X))] at x, the probability measure around x will increase. If there exists a probability density function pt , such that Pt (dx) = pt μ (dx), where μ (·) is the Lebesgue measure defined on (X, B), then (10.8) becomes p˙t (x) = (φ (H(x)) − EPt [φ (H(X))])pt (x),
(10.9)
170
J. Hu et al.
which governs the evolution of the probability density function on the continuous strategy space. When pt (x) is used as our model to generate candidate solutions for the global optimization problem (10.1), the differential equation (10.9) can be used to update the model pt (x), with the final goal of making the probability density function pt (x) converge to a small set containing the global optimal solution. Then, the global optimization problem can be easily solved by sampling from the obtained probability density function.
10.5.1 Convergence Analysis In this section, we study the properties of the equilibrium points of (10.8) and their connection with the global optimal solutions for the optimization problem, by employing the tools of equilibrium analysis in game theory and stability analysis in dynamic systems. Assume that the optimization problem (10.1) has m global optimal solutions {xi , i = 1, . . . , m}. It is easy to see that P(x) = δ (x − xi ) for i = 1, . . . , m are equilibrium points of (10.8), and we might guess there is a strong connection between the equilibrium points of (10.8) and the optimal solutions of the global optimization problem (10.1). We enforce the following assumption on function φ . Assumption 10.5.1 φ (·) is a continuous and strictly increasing function; there exist constants L and M such that L ≤ φ (H(x)) ≤ M for all x ∈ X. The following theorem shows that the overall fitness of the strategy (probability measure) Pt is monotonically increasing over time. Theorem 10.5.3. Let Pt be a solution of the replicator dynamics (10.8). Under Assumption 10.5.1, the average payoff of the entire population EPt [φ (H(X))] is monotonically increasing with time t. If Pt is not an equilibrium point of (10.8), then EPt [φ (H(X))] is strictly increasing with time t. To further study the properties of the equilibrium points of the replicator dynamics (10.8), the Prokhorov metric is used to measure the distance between different strategies (probability measures):
ρ (P, Q) := inf{ε > 0 : Q(A ) ≤ P(A ε ) + ε and P(A ) ≤ Q(A ε ) + ε ,
∀A ∈ B},
where A ε := {x : ∃y˜ ∈ A , d(y, ˜ x) < ε }, in which d is a metric defined on X. Then, the convergence of ρ (Qn , Q) → 0 is equivalent to the weak convergence of Qn to Q [3]. Definition 10.5.2. Let E be a set in Δ . For a point P ∈ Δ , define the distance between P and E as ρ (P, E ) = inf{ρ (P, Q), ∀Q ∈ E }. E is called Lyapunov stable if for all ε > 0, there exists η > 0 such that ρ (P0 , E ) < η =⇒ ρ (Pt , E ) < ε for all t > 0.
10 A Survey of Some Model-Based Methods for Global Optimization
171
Definition 10.5.3. Let E be a set in Δ . E is called asymptotically stable if E is Lyapunov stable and there exists η > 0 such that ρ (P0 , E ) < η =⇒ ρ (Pt , E ) → 0 as t → ∞. Definition 10.5.4. Δ0 ⊂ Δ is the set containing all P0 for which there exists a xk such that P0 (A˜) > 0 for any set A˜ ∈ B that contains xk and has a positive Lebesgue measure μ (A˜) > 0. Let C = {P : P = limt→∞ Pt starting from some P0 ∈ Δ0 }. To present the main convergence result, we also need the following assumption. Assumption 10.5.2 There is a finite number of global optimal solutions {x1 , . . . , xm } for the optimization problem (10.1), where m is a positive integer. Theorem 10.5.4. If Assumptions 10.5.1 and 10.5.2 hold, then for any P ∈ C , there (x) = m α δ (x − x ); the exist αi ≥ 0, for i = 1, . . . , m with ∑m ∑i=1 i i=1 αi = 1 such that P i m (x) set C can be represented as C = {P : P = ∑i=1 αi δ (x − xi ), for some ∑m i=1 αi = 1, αi ≥ 0, ∀ i = 1, . . . , m}, and in addition, the set C is asymptotically stable.
10.5.2 Model-Based Evolutionary Optimization From the above analysis, we know that the global optimal solutions can be obtained by generating samples from equilibrium distributions of the replicator dynamics (10.8); these equilibrium distributions can be approached by following trajectories of (10.8) starting from P0 ∈ Δ0 . Note that by Theorem 10.5.4, the equilibrium points obtained by starting from P0 ∈ Δ0 are of the form P(x) = ∑m i=1 αi δ (x − xi ), where m ∑i=1 αi = 1 and αi ≥ 0 for i = 1, . . . , m, which suggests using a sum of Dirac functions to approximate pt . Assume a group of candidate solutions {yti }Ni=1 is generated from pt ; then the probability density function pt can be approximated by pˆt (x) = ∑Ni=1 wti δ (x − xti ), where δ denotes the Dirac function and {wti }Ni=1 are weights satisfying ∑Ni=1 wti = 1. If we use this approximation pˆt as our probabilistic model and substitute it into (10.9), we have
N ∂ wti = φ (H(xti )) − ∑ wtj φ (H(xtj )) wti , ∂t j=1
∀i = 1, . . . , N.
(10.10)
The corresponding discrete-time version of (10.10) is wik+1 =
φ (H(xik )) j j ∑Nj=1 wk φ (H(xk ))
wik ,
∀i = 1, . . . , N.
(10.11)
We can let φ (·) be an exponential function so that the denominator of the right-hand side of (10.11) is not equal to zero. Although an updated density approximation pˆk+1 (x) = ∑Ni=1 wik+1 δ (x − xik ) is obtained, it cannot be used directly to generate new candidate solutions. We construct a new continuous density to approximate pˆk+1 ,
172
J. Hu et al.
which is done by projecting pˆk+1 onto some parameterized family of distributions gθ . The idea of projection onto a parameterized family has also been used in CE and MRAS, as discussed above. Specifically, we minimize the KL divergence between the parameterized distribution gθ and pˆk+1 : (10.12) θk+1 = arg min D pˆk+1 , gθ , θ ∈Θ
where Θ is the domain of θ . After some algebraic operations, we can show that solving (10.12) is equivalent to: maxθ ∈Θ ∑Ni=1 wik+1 ln gθ (yik ). All the above analysis is carried out when replicator dynamics, e.g., (10.8) and (10.9), are used. There are some other dynamics in evolutionary game theory such as imitation dynamics, logit dynamics, and Brown-von Neumann-Nash dynamics that can be used to update the weights {wik }. To present the algorithm in a more general setting, the updating of weights is denoted as wik = Dd φ (H(xik−1 ))I{H(xi
, k−1 )≥γk−1 }
N
j j φ (H(xk−1 ))I{H(x j )≥γ } , wik−1 ∑ wk−1 k−1 k−1
,
j=1
(10.13) where γk−1 is a constant that is used to select good candidate solutions; Dd is a function of three variables, which is used to represent the updating rule. For example, when Dd is derived from replicator dynamics, we have wik =
1 i N φ (H(xk−1 ))I{H(xik−1 )≥γk−1 } wik−1 , j ∑Nj=1 N1 φ (H(xk−1 ))I{H(x j )≥γ } k−1 k−1
∀i = 1, . . . , N.
Based on the above analysis, a Monte Carlo simulation version of the MEO algorithm is given as follows.
Model-Based Evolutionary Optimization Algorithm (MEO) 0. Initialization. Specify N as the total number of candidate solutions generated at each iteration. Choose ρ ∈ (0, 1] and an initial gθ0 defined on X. Set k = 0, wi0 = 1/N for i = 1, . . . , N, and γ0 = −∞. 1. Quantile calculation. Generate N candidate solutions {xik }Ni=1 from gθk . Calculate the 1 − ρ quantile γk of {xik }Ni=1 . If γk < γk−1 and k > 1, set γk = γk−1 and wik−1 = 1/N for i = 1, . . . , N. Set k = k + 1 and go to Step 2. 2. Updating the probabilistic model. The discrete approximation of the model is pˆk (x) = ∑Ni=1 wik δ (x − xik−1), where {wik } are updated according to (10.13). 3. Density projection. Construct gθ by projecting the density pˆk = ∑Ni=1 wik δ (x − xik−1 ) onto gθ : θk = arg maxθ ∈Θ ∑Ni=1 wik ln gθ (xik−1 ). 4. Stop if some stopping criterion is satisfied; otherwise go to Step 1.
10 A Survey of Some Model-Based Methods for Global Optimization
173
Generally, it is not easy to solve the optimization problem (10.12), which depends on the choice of gθ . However, for gθ in an exponential family, analytical solutions can be obtained. A comprehensive exposition of the evolutionary games approach is given in [37, 38].
10.6 Stochastic Approximation Approach In this section, we present a stochastic approximation framework to study modelbased algorithms [21]. The framework is based on the MRAS method presented in Sect. 10.3 and is intended to combine the robust features of model-based algorithms encountered in practice with rigorous convergence guarantees. Specifically, by exploiting a natural connection between model-based algorithms and the wellknown stochastic approximation (SA) method [2,4,26,30], we show that, regardless of the type of decision variables involved in (10.1), algorithms conforming to the framework can be equivalently formulated in the form of a generalized stochastic approximation procedure on a transformed continuous parameter space for solving a sequence of stochastic optimization problems with differentiable structures. This viewpoint, which is new to this type of random search algorithms, allows us to study the asymptotic convergence and rate properties of these algorithms by using existing theory and tools from SA. The key idea that leads to the proposed framework is based on replacing the reference sequence {gk } in the original MRAS method by a more general distribution sequence in the recursive form: gˆk+1 (x) = αk gk+1 (x) + (1 − αk ) fθk (x), αk ∈ (0, 1) ∀ k,
(10.14)
which is a mixture of the reference distribution gk+1 and the sampling distribution fθk obtained at the kth iteration. Such a mixture gˆk+1 retains the properties of gk+1 while, on the other hand, ensures that its difference from fθk is only incremental. Thus, the intuition is that if one were to replace gk+1 with gˆk+1 in minimizing the KL divergence D(gˆk+1 , fθ ), then the new sampling distribution fθk+1 obtained would also stay close to the current sampling distribution fθk . When {gˆk } instead of {gk } is used at step 2 of MRAS to minimize the KL divergence, the following lemma reveals a key link between the two successive mean vector functions of the projected probability distributions [21]. Lemma 10.6.1. If fθ belongs to NEFs and the new parameter θk+1 obtained via minimizing D(gˆk+1 , fθ ) is an interior point of the parameter space Θ for all k, then m(θk+1 ) − m(θk ) = −αk ∇θ D(gk+1 , fθ )|θ =θk .
(10.15)
Basically, Lemma 10.6.1 states that regardless of the specific form of gk , the mean vector function m(θk ) (i.e., a one-to-one transformation of θk ) is updated at each step
174
J. Hu et al.
along the gradient descent direction of the time-varying objective function for the minimization problem minθ D(gk+1 , fθ ). In particular, in the case of the CE method, i.e., when gk+1 in (10.15) takes the form gk+1 (x) = be seen that recursion (10.15) becomes
H(x) f θk (x)
X H(x) f θk (dx)
m(θk+1 ) − m(θk ) = αk ∇θ ln Eθ [H(X)]|θ =θk .
(cf. (10.4)), it can
(10.16)
Hence, m(θk ) is updated along the gradient direction of the objective function for the maximization problem maxθ ln Eθ [H(X)], the optimal solution to which is a sampling distribution fθ ∗ that assigns maximum probability to the set of optimal solutions of (10.1). Note that the parameter sequence {αk } turns out to be the gain sequence for the gradient iteration, so that the special case αk ≡ 1 corresponds to the original MRAS method. This suggests that all model-based algorithms that fall under the MRAS framework can be equivalently viewed as gradientbased recursions on the parameter space Θ for solving a sequence of optimization problems with differentiable structures. This new interpretation of model-based algorithms provides a key insight to understand how these algorithms address hard optimization problems with little structure. In actual implementation, when integrals/expectations are replaced by sample averages based on Monte Carlo sampling, (10.15) and (10.16) become recursive algorithms of stochastic approximation type with direct gradient estimation. Thus, it is clear that the rich body of tools and results from stochastic approximation can be incorporated into the framework to analyze model-based algorithms.
10.6.1 Convergence of the CE Method The convergence of the CE algorithm has recently been studied in [19,21] by casting a Monte Carlo version of recursion (10.16) in the form of a generalized RobbinsMonro algorithm in terms of the true gradient, bias, and an error term due to random sampling and then following the arguments of the ordinary differential equation (ODE) approach [2,4]. The main convergence results are summarized below, where for notational convenience, we define η := m(θ ) and ηk := m(θk ). Theorem 10.6.5. (Convergence of CE) Under some regularity conditions (see [21]), the sequence of iterates {ηk } generated by the CE algorithm converges w.p.1 to a compact connected internally chain recurrent set of the ODE dη (t) = L(η ), t ≥ 0, dt where L(η ) := ∇θ ln Eθ [H(X)]|θ =m−1 (η ) .
(10.17)
10 A Survey of Some Model-Based Methods for Global Optimization
175
Theorem 10.6.5 indicates that the long-run behavior (e.g., local/global convergence) of CE is primarily governed by the asymptotic solution of an underlying ODE. This result formalizes our prior observation in [18], which provides counterexamples indicating that CE and its variants are in general local improvement methods. Under the more stringent assumption that the convergence of {ηk } occurs to a unique limiting point η ∗ , the following asymptotic normality result was obtained in [21]. Theorem 10.6.6. (Asymptotic normality of CE) Under some appropriate conditions (see Theorem 4.1 of [21]), τ dist k 2 (ηk − η ∗ ) −−−→ N 0, Σ as k → ∞, where τ ∈ (0, 1) is some appropriate constant and Σ is a positive definite covariance matrix.
10.6.2 Model-Based Annealing Random Search To further illustrate the stochastic approximation approach, we present an algorithm instantiation of the framework called model-based annealing random search (MARS) [20]. MARS can essentially be viewed as an implementable version of the annealing adaptive search (AAS) algorithm, in that it provides an alternative approach to address the implementation difficulty of AAS (cf. Sect. 10.2). The basic idea is to use a sequence of NEF distributions to approximate the target Boltzmann distributions and then use the sequence as surrogate distributions to generate candidate points. Thus, by treating Boltzmann distributions as reference distributions, candidate solutions are drawn at each iteration of MARS indirectly from a Boltzmann distribution by sampling exactly from its approximation. This is in contrast to Markov chain-based techniques [41] that aim to directly sample from the Boltzmann distributions. The MARS algorithm is conceptually very simple and is summarized below: 0. Choose a parameterized family { fθ }, an annealing schedule used in the Boltzmann distribution, and a gain sequence {αk }. 1. Given θk , sample N candidate solutions Xk1 , . . . , XkN from fθk . 2. Update the parameter θk+1 = argθ minD( gk+1 , fθ ); increase k by 1 and reiterate from step 1. At Step 2 of MARS, the reference distribution is given by gk+1 (x) = αk g¯k+1 (x) + (1 − αk ) fθk (x), where g¯k+1 is an empirical estimate of the true Boltzmann distribution gk+1 (x) :=
eH(x) /Tk H(x)/Tk dx Xe
based on the sampled solutions Xk1 , . . . , XkN , and {Tk }
is a sequence of decreasing temperatures that controls how fast the sequence of Boltzmann distributions will degenerate.
176
J. Hu et al.
Under its equivalent gradient interpretation, Lemma 10.6.1 shows that the mean vector function m(θk+1 ) of the new distribution fθk+1 obtained at step 2 of MARS can be viewed as an iterate generated by a gradient descent algorithm for solving the iteration-varying minimization problem minθ D(g¯k+1 , fθ ) on the parameter space Θ , i.e., m(θk+1 ) − m(θk ) = −αk ∇θ D(g¯k+1 , fθ )|θ =θk . (10.18) Note that since the reference distribution g¯k+1 may change shape with k, a primary difference between MARS and CE is that the gradient in (10.18) is time-varying vs. stationary in (10.16). Stationarity in general only guarantees local convergence, whereas the time-varying feature of MARS provides a viable way to ensure that the algorithm escapes from local optima, leading to global convergence. By the properties of NEFs, recursion (10.18) can be further written as m(θk+1 ) − m(θk ) = −αk m(θk ) − Egk+1 [Γ (X)] + Egk+1 [Γ (X)] − Eg¯k+1 [Γ (X)] = −αk ∇θ D(gk+1 , fθ )|θ =θk − αk Egk+1 [Γ (X)] − Eg¯k+1 [Γ (X)] . This becomes a Robbins-Monro-type stochastic approximation algorithm in terms of the true gradient and a noise term due to the approximation error between gk+1 and g¯k+1 . Thus, in light of the existing theories from stochastic approximation, the convergence analysis of MARS essentially boils down to the issue of inspecting whether the Boltzmann distribution gk+1 can be closely approximated by its empirical estimate g¯k+1 . The following results are obtained in [20]. Theorem 10.6.7. (Global convergence of MARS) Under some appropriate conditions (see Theorem 3.1 of [20]), lim m(θk ) = Γ (x∗ ) w.p.1.
k→∞
Theorem 10.6.8. (Asymptotic normality of MARS) Let αk = a/kα and the sample size be polynomially increasing Nk = ckβ for constants a > 0, c > 0, α ∈ ( 12 , 1), and β > α . Under some additional conditions on {Tk }, k
α +β 2
dist m(θk ) − Γ (x∗ ) −−−→ N(0, Σ ) as k → ∞,
where Σ is some positive definite covariance matrix. Numerical results on high-dimensional multi-extremal benchmark problems reported in [20] show that MARS may yield high-quality solutions within a modest number of function evaluations and provide superior performance over some of the existing algorithms.
10 A Survey of Some Model-Based Methods for Global Optimization
177
10.7 Conclusions We reviewed several recent contributions to model-based methods for global optimization, including algorithms and convergence results for model reference adaptive search, the particle-filtering approach, the evolutionary games approach, and the stochastic approximation gradient approach. These approaches analyze model-based methods from different perspectives, providing useful tools to explore properties of the updating mechanism of probabilistic models and to facilitate proofs of convergence of model-based algorithms. Acknowledgements This work was supported in part by the National Science Foundation (NSF) under Grants CNS-0926194, CMMI-0856256, CMMI-0900332, CMMI-1130273, CMMI1130761, EECS-0901543, and by the Air Force Office of Scientific Research (AFOSR) under Grant FA9550-10-1-0340.
References 1. Arulampalam, S., Maskell, S., Gordon, N.J., Clapp, T.: A tutorial on particle filters for on-line non-linear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 2. Benaim, M.: A dynamical system approach to stochastic approximations. SIAM Journal on Control and Optimization 34, 437–472 (1996) 3. Billingsley, P.: Convergence of Probability Measures. John Wiley & Sons, Inc., New York (1999) 4. Borkar, V.: Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press; New Delhi: Hindustan Book Agency (2008) 5. Budhiraja, A., Chen, L., Lee, C.: A survey of numerical methods for nonlinear filtering problems. Physica D: Nonlinear Phenomena 230, 27–36 (2007) 6. Chopin, N.: Central limit theorem for sequential Monte Carlo and its applications to Bayesian inference. The Annals of Statistics 32(6), 2385–2411 (2004) 7. Costa, A., Jones, O., Kroese, D.: Convergence properties of the cross-entropy method for discrete optimization. Operations Research Letters (35), 573–580 (2007) 8. Davis, M.H.A., Marcus, S.I.: An introduction to nonlinear filtering. The Mathematics of Filtering and Identification and Applications. Amsterdam, The Netherlands, Reidel (1981) 9. Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Trans. on Evolutionary Computation 1, 53–66 (1997) 10. Doucet, A., deFreitas, J.F.G., Gordon, N.J. (eds.): Sequential Monte Carlo Methods In Practice. Springer, New York (2001) 11. Doucet, A., Johansen, A.M.: A tutorial on particle filtering and smoothing: Fifteen years later. Handbook of Nonlinear Filtering. Cambridge University Press, Cambridge (2009) 12. Eiben, A., Smith, J.: Introduction to Evolutionary Computing. Natural Computing Series, Springer (2003) 13. Gilks, W., Berzuini, C.: Following a moving target - Monte Carlo inference for dynamic Bayesian models. Journal of the Royal Statistical Society 63(1), 127–146 (2001) 14. Gland, F.L., Oudjane, N.: Stability and uniform approximation of nonlinear filters using the Hilbert metric and application to particle filter. The Annals of Applied Probability 14(1), 144– 187 (2004) 15. Glover, F.W.: Tabu search: A tutorial. Interfaces 20, 74–94 (1990)
178
J. Hu et al.
16. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Boston, MA (1989) 17. Homem-De-Mello, T.: A study on the cross-entropy method for rare-event probability estimation. INFORMS Journal on Computing 19, 381–394 (2007) 18. Hu, J., Fu, M.C., Marcus, S.I.: A model reference adaptive search method for global optimization. Operations Research 55(3), 549–568 (2007) 19. Hu, J., Hu, P.: On the performance of the cross-entropy method. In: Proceedings of the 2009 Winter Simulation Conference, pp. 459–468 (2009) 20. Hu, J., Hu, P.: Annealing adaptive search, cross-entropy, and stochastic approximation in global optimization. Naval Research Logistics 58, 457–477 (2011) 21. Hu, J., Hu, P., Chang, H.: A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Transactions on Automatic Control (forthcoming) (2012) 22. Jacobson, S., Sullivan, K., Johnson, A.: Discrete manufacturing process design optimization using computer simulation and generalized hill climbing algorithms. Engineering Optimization 31, 247–260 (1998) 23. Johnson, A., Jacobson, S.: A class of convergent generalized hill climbing algorithms. Applied Mathematics and Computation 125, 359–373 (2001) 24. Kennedy, J., Eberhart, R.: Particle swarm optimization. In Proceedings of IEEE International Conference on Neural Networks. IEEE Press, Piscataway, NJ, pp. 1942–1948 (1995) 25. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983) 26. Kushner, H.J., Clark, D.S.: Stochastic Approximation Methods for Constrained and Unconstrained Systems and Applications. Springer-Verlag, New York (1978) 27. Larra˜naga, P., Lozano, J. (eds.): Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publisher, Boston, MA (2002) 28. Musso, C., Oudjane, N., Gland, F.L.: Sequential Monte Carlo Methods in Practice. SpringerVerlag, New York (2001) 29. Oechssler, J., Riedel, F.: On the dynamics foundation of evolutionary stability in continuous models. Journal of Economic Theory 107, 223–252 (2002) 30. Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407 (1951) 31. Romeijn, H., Smith, R.: Simulated annealing and adaptive search in global optimization. Probability in the Engineering and Informational Sciences 8, 571–590 (1994) 32. Rubinstein, R.Y.: Optimization of computer simulation models with rare events. European Journal of Operations Research 99, 89–112 (1997) 33. Rubinstein, R.Y.: The cross-entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability 2, 127–190 (1999) 34. Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer, New York, NY (2004) 35. Shi, L., Olafsson, S.: Nested partitions method for global optimization. Operations Research 48(3), 390–407 (2000) 36. Srinivas, M., Patnaik, L.M.: Genetic algorithms: A survey. IEEE Computer 27, 17–26 (1994) 37. Wang Y., Fu, M.C., Marcus, S.I.: Model-based evolutionary optimization. In Proceedings of the 2010 Winter Simulation Conference. IEEE Press, Piscataway, NJ, pp. 1199–1210 (2010) 38. Wang, Y., Fu, M.C., Marcus, S.I.: An evolutionary game approach for model-based optimization. Working paper (2011) 39. Wolpert, D.H. Finding bounded rational equilibria part i: Iterative focusing. In Proceedings of the Eleventh International Symposium on Dynamic Games and Applications, T. Vincent (Editor), Tucson AZ, USA (2004) 40. Zabinsky, Z., Smith, R., McDonald, J., Romeijn, H., Kaufman, D.: Improving hit-and-run for global optimization. Journal of Global Optimization 3, 171–192 (1993) 41. Zabinsky, Z.B.: Stochastic Adaptive Search for Global Optimization. Kluwer, The Netherlands (2003)
10 A Survey of Some Model-Based Methods for Global Optimization
179
42. Zhang, Q., M¨uhlenbein, H.: On the convergence of a class of estimation of distribution algorithm. IEEE Trans. on Evolutionary Computation 8, 127–136 (2004) 43. Zhou, E., Fu, M.C., Marcus, S.I.: A particle filtering framework for randomized optimization algorithms. In Proceedings of the 2008 Winter Simulation Conference. IEEE Press, Piscataway, NJ, pp. 647–654 (2008) 44. Zhou, E., Fu, M.C., Marcus, S.I.: Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control 55(5), 1101–1116 (2010) 45. Zhou, E., Fu, M.C., Marcus, S.I.: Particle filtering framework for a class of randomized optimization algorithms. Under review (2012) 46. Zlochin, M., Birattari, M., Meuleau, N., Dorigo, M.: Model-based search for combinatorial optimization: A critical survey. Annals of Operations Research 131, 373–395 (2004)
Chapter 11
Constrained Optimality for First Passage Criteria in Semi-Markov Decision Processes Yonghui Huang and Xianping Guo
11.1 Introduction In the field of Markov decision problems (MDPs), the control horizon is usually a fixed finite interval [0, T ] or the infinite interval [0, +∞). In many real applications, however, the control horizon may be a random duration [0, τ ], where the terminal time τ is a random variable at which the state of the controlled system changes critically and the control beyond τ may no longer be meaningful or necessary. For example, in the insurance systems [27], the control after the time when the company is bankrupt becomes unnecessary. Therefore, it makes better sense to consider the problem in [0, τ ], where τ represents the bankruptcy time of the company. Such situations motivate first passage problems in MDPs [13,15,19,21,22,28], for which one generally aims at maximizing/minimizing the expected reward/cost over a first passage time to some target set. This chapter is devoted to studying constrained optimality for first passage criteria, for which the dynamic of a system is described by semi-Markov decision processes (SMDPs). The state space is assumed to be denumerable, while the action set is general. Both reward and cost rates are possibly unbounded. A key feature of our model is that the discount rate is state-action dependent, and furthermore, the undiscounted case is allowed. This feature makes our model more general since the state-action-dependent discount rate exactly characterizes the practical cases such as the interest rate in economic and financial systems [2, 9, 17, 23, 26], which can be adjusted according to the real circumstances. We aim to maximize the expected reward obtained during a first passage time to some target set, subject to that the associated expected cost over this first passage time does not exceed a given constraint. An interesting special case is that in which the reward rates are uniformly
Y. Huang • X. Guo () Sun Yat-Sen University, Guangzhou 510275, China e-mail:
[email protected];
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 11, © Springer Science+Business Media, LLC 2012
181
182
Y. Huang and X. Guo
equal to one, which corresponds to a stochastic time optimal control problem with a target set; see Remark 11.2.4(d) for details. Previously, Beutler and Ross [3] consider constrained SMDPs with the longrun average criteria. They suppose that the state space of the SMDP is finite, and the action space compact metric. A Lagrange multiplier formulation involving a dynamic programming equation is utilized to relate the constrained optimization to an unconstrained optimization parametrized by the multiplier. This approach leads to a proof for the existence of a semi-simple optimal constrained policy. That is, there is at most one state for which the action is randomized between two possibilities; at all other states, an action is uniquely chosen for each state. Feinberg [4] further investigates constrained average reward SMDPs with finite state and action sets. They develop a technique of state-action renewal intensities and provide linear programming algorithms for the computation of optimal policies. On the other hand, Feinberg [5] deals with constrained infinite horizon discounted SMDPs. Compared with the existing works above, however, our main interest in this chapter is to analyze the constrained optimality for first passage criteria in SMDPs, which, to best of our knowledge, is an issue not yet explored. To obtain the existence of a constrained first passage optimal policy, we first give suitable conditions and then employ the so-called Lagrange multiplier technique to analyze the constrained control problem. Based on the Lagrange multiplier technique, we transform the constrained control problem to an unconstrained one, prove that a constrained optimal policy exists, and show that the constrained optimal policy randomizes between two stationary policies differing in at most one state. The rest of this chapter is organized as follows. In Sect. 11.2, we formulate the control model, followed by the optimality conditions and the main results on the existence of constrained optimal policies. In Sect. 11.3, some technique preliminaries are given, and the proof of the main result is presented in Sect. 11.4.
11.2 The Control Model The model of constrained SMDPs considered in this chapter is specified by the eight objects {E, B, (A(i) ⊂ A, i ∈ E), Q(·, · | i, a), r(i, a), c(i, a), α (i, a), γ },
(11.1)
where E is the state space, a denumerable set; B ⊂ E is the given target set, such as the set of all bad states or of good states of a system; A is the action space, a Borel space endowed with the Borel σ -field A ; and A(i) ∈ A is the set of admissible actions at state i ∈ E. The transition mechanism of the SMDPs is defined by the semi-Markov kernel Q(·, ·|i, a) on R+ × E given K, where R+ = [0, +∞), and K = {(i, a) | i ∈ E, a ∈ A(i)} is the set of feasible state-action pairs. It is assumed that (1) Q(·, j|i, a) (for any fixed j ∈ E and (i, a) ∈ K) is a nondecreasing, right continuous real function on R+ such that Q(0, j|i, a) = 0; (2) Q(t, ·|·, ·) (for each fixed t ∈ R+ ) is a sub-stochastic kernel on E given K; and (3) P(·|·, ·) := Q(∞, ·|·, ·) is a stochastic kernel on E given K. If action a ∈ A(i) is selected in state i, then Q(t, j | i, a) is the
11 Constrained Optimality in Semi-Markov Decision Processes
183
joint probability that the sojourn time in state i is not greater than t ∈ R+ , and the next state is j. Moreover, r(i, a) and c(i, a) in (11.1) denote the reward and cost rate functions on K valued in R = (−∞, +∞), respectively, which are both assumed to be measurable on A(i) for each fixed i ∈ E. In addition, α (i, a) represents the discount rate, which is a measurable function from K to R+ . Finally, γ is a given constraint constant. Remark 11.2.1. Compared with the models of the standard constrained discounted and average criteria [3–5], in this model (11.1), we introduce a target set B ⊂ E of the controlled system, and furthermore, the discount rate α (i, a) here is state-action dependent and may be equal to zero (i.e., the undiscounted case is allowed). To state the constrained SMDPs we are concerned with, we need to introduce the classes of policies. For each n ≥ 0, let Hn be the family of admissible histories up to the nth jump (decision epoch), that is, Hn = (R+ × K)n × (R+ × E), for n = 0, 1, . . .. Definition 11.2.1. A randomized history-dependent policy, or simply a policy, is a sequence π = {πn , n ≥ 0} of stochastic kernels πn on A given Hn satisfying
πn (A(in ) | hn ) = 1 ∀ hn = (t0 , i0 , a0 , . . . ,tn−1 , in−1 , an−1 ,tn , in ) ∈ Hn ,
n = 0, 1, . . . .
The class of all policies is denoted by Π . To distinguish the subclasses of Π , we let Φ be the family of all stochastic kernels ϕ on A given E such that ϕ (A(i) | i) = 1 for all i ∈ E, and F the set of all functions f : E → A such that f (i) is in A(i) for every i ∈ E. A policy π = {πn } ∈ Π is said to be randomized Markov if there is a sequence {ϕn } of ϕn ∈ Φ such that πn (· | hn ) = ϕn (· | in ) for every hn ∈ Hn and n ≥ 0. We denote such a policy by π = {ϕn }. A randomized Markov policy π = {ϕn } is said to be randomized stationary if every ϕn is independent of n. In this case, we write π = {ϕ , ϕ , . . .} as ϕ for simplicity. Further, a randomized Markov policy π = {ϕn } is said to be deterministic Markov if there is a sequence { fn } of fn ∈ F such that ϕn (· | i) is the Dirac measure at fn (i) for all i ∈ E and n ≥ 0. We write such a policy as π = { fn }. In particular, a deterministic Markov policy π = { fn } is said to be (deterministic) stationary if fn are all independent of n. Similarly, we write π = { f , f , . . .} as f for simplicity. We denote by ΠRM , ΠRS , ΠDM , and ΠDS the families of all randomized Markov, randomized stationary, deterministic Markov, and stationary policies, respectively. Obviously, Φ = ΠRS ⊂ ΠRM ⊂ Π and F = ΠDS ⊂ ΠDM ⊂ Π . Let P(E) denote the set of all the probability measures on E. For each (s, μ ) ∈ R+ × P(E) and π ∈ Π , by the well-known Tulcea’s theorem ([10, Proposition π ) and a stochastic process C.10]), there exist a unique probability space (Ω , F , P(s, μ) {Tn , Jn , An , n ≥ 0} such that, for each i, j ∈ E,t ∈ R+ ,C ∈ A and n ≥ 0, π P(s, μ ) (T0 = s, J0 = i) = μ (i),
(11.2)
π P(s, μ ) (An ∈ C | hn ) = πn (C | hn ),
(11.3)
π P(s, μ ) (Tn+1 − Tn ≤ t, Jn+1 = j | hn , an ) = Q(t, j | in , an ),
(11.4)
184
Y. Huang and X. Guo
where Tn , Jn , and An denote the nth decision epoch, the state, and the action chosen at the nth decision epoch, respectively. The expectation operator with respect to π π P(s, μ ) is denoted by E(s, μ ) . In particular, if μ is the Dirac measure δi (·) concentrated π π π π at some state i ∈ E, we write P(s, μ ) and E(s, μ ) as P(s,i) and E(s,i) , respectively. For π π π π simplicity, P(0, μ ) and E(0, μ ) is denoted by Pμ and E μ , respectively. Without loss of generality, in the following, we always set the initial decision epoch T0 = 0 and omit it. Remark 11.2.2. (a) The construction of the probability measure space (Ω , F , π ) and the above properties (11.2)–(11.4) follow from those in Limnios and P(s, μ) Oprisan [18, p.33] and Puterman [24, p.534–535]. (b) Let X0 := 0, Xn := Tn − Tn−1 (n ≥ 0) denote the sojourn times between decision epochs (jumps). Then, the stochastic process {Tn , Jn , An , n ≥ 0} may be rewritten as the one {Xn , Jn , An , n ≥ 0}. To avoid the possibility of an infinite number of decision epochs within finite time, we make the following assumption that the system does not have accumulation points. Assumption 11.2.1 For all μ ∈ P(E) and π ∈ Π , Pμπ ({ lim Tn = ∞}) = 1. n→∞
To verify Assumption 11.2.1, we can use a sufficient condition below. Condition 11.2.2 There exist constants δ > 0 and ε > 0 such that Q(δ , E | i, a) ≤ 1 − ε ∀(i, a) ∈ K. Remark 11.2.3. In fact, Condition 11.2.2 is the standard regular condition widely used in SMDPs [5, 16, 20, 24, 25], which exactly implies Assumption 11.2.1 above. Under Assumption 11.2.1, we define an underlying continuous-time state-action process {Z(t),W (t),t ∈ R+ } corresponding to the stochastic process {Tn , Jn , An } by Z(t) = Jn ,
W (t) = An , for Tn ≤ t < Tn+1 ,
t ∈ R+ and n ≥ 0.
Definition 11.2.2. The stochastic process {Z(t),W (t)} is called a (continuoustime) SMDP. For the target set B ⊂ E, we consider the random variable
τB := inf{t ≥ 0 | Z(t) ∈ B} (with inf 0/ := ∞), which is the first passage time into the set B of the process {Z(t),t ∈ R+ }. Now, fix an initial distribution μ ∈ P(E). For each π ∈ Π , the expected first passage reward and cost criteria are defined as follows: τ t B e− 0 α (Z(u),W (u))du r(Z(t),W (t))dt , (11.5) Vr (μ , π ) := Eμπ 0
Vc (μ , π ) := Eμπ
τB 0
e−
t
0 α (Z(u),W (u))du
c(Z(t),W (t))dt .
(11.6)
11 Constrained Optimality in Semi-Markov Decision Processes
185
To introduce the constrained problem, for the constraint constant γ in (11.1), let U := {π ∈ Π | Vc (μ , π ) ≤ γ } be the set of “constrained” policies. We assume that U = 0/ throughout the following. Then, the optimization problem we are interested in is to maximize the expected first passage reward Vr (μ , π ) over the set U, and our goal is to find a constrained optimal policy as defined below. Definition 11.2.3. A policy π ∗ ∈ U is called constrained optimal if Vr (μ , π ∗ ) = sup Vr (μ , π ). π ∈U
Remark 11.2.4. (a) It is worthwhile to point out that the expected first passage reward criterion Vr (μ , π ) defined in (11.5) is different from the usual discounted reward criteria [11, 12, 24] and the total reward criteria without discount [6, 11, 24]. In fact, the former concerns with the performance of the system during a first passage time to some target set, while the latter concern with the performance of the system over an infinite horizon. However, if the target set B = 0/ (and thus τB ≡ ∞) and, furthermore, the discount factor α (i, a) is state-action independent (say, α (i, a) ≡ α ), then the expected first passage reward criterion (11.5) above will be directly reduced to the standard infinite horizon expected discounted reward criteria or expected total reward criteria [6, 11, 12, 14, 24]. (b) Note that the case without discount, that is, α (i, a) ≡ 0, is allowed in the context of this chapter; see Remark 11.2.5 for further details. (c) When the constraint constant γ in (11.1) is sufficiently large so that U = Π , then the constrained first passage optimization problem (recall Definition 11.2.3) is reduced to the usual unconstrained first passage optimization problems [13, 15, 19, 21, 22, 28]. (d) In real situations, the target set B usually represents the set of failure states of a system, and thus τB denotes the working life (functioning life) of the system. Therefore, our aim is to maximize the expected rewards Vr (μ , π ) obtained before the system fails, subject to the associated costs Vc (μ , π ) incurred before the failure of the system is not more than some constraint constant γ . In particular, if the reward function rate r(i, a) ≡ 1 and the discount factor α (i, a) ≡ 0, our aim is then reduced to maximizing the expected working life of the system, subject to the associated costs Vc (μ , π ) incurred before the failure of the system are not more than some constraint constant γ . To obtain the existence of a constrained optimal policy, we need several sets of conditions. Assumption 11.2.3 There exist constants M > 0, 0 < β < 1, and a weight function w ≥ 1 on E such that for every i ∈ Bc := E − B, (a) supa∈A(i) | r(i, a)| ≤ Mw(i), and supa∈A(i) | c(i, a)| ≤ Mw(i) for all a ∈ A(i), where
186
Y. Huang and X. Guo
r(i, a) : = r(i, a) c(i, a) : = c(i, a)
∞ 0
∞ 0
e−α (i,a)t (1 − D(t | i, a))dt, e−α (i,a)t (1 − D(t | i, a))dt, and
D(t | i, a) : = Q(t, E | i, a). (b) supa∈A(i) ∑ j∈Bc w( j)m( j | i, a) ≤ β w(i), where m( j | i, a) := i, a).
∞ −α (i,a)t e Q(dt, j | 0
Remark 11.2.5. (a) In fact, Assumption 11.2.3 is a condition that ensures the first passage criteria (11.5) and (11.6) to be finite and the dynamic programming operators to be contracting; see Lemmas 11.3.1–11.3.2 below. (b) Assumption 11.2.3(a) shows that the cost function is allowed to have neither upper nor lower bounds, while the ones in the existing works [3–5, 7, 8, 12] for the standard constrained expected discount criteria are assumed to be bounded or nonnegative (bounded below). (c) Note that the case without discount, that is, “α (i, a) ≡ 0”, is allowed in Assumption 11.2.3. In this case, Assumption 11.2.3(b) is reduced to that there exists a constant 0 < β < 1 such that sup
∑c w( j)P( j | i, a) ≤ β w(i) ∀ i ∈ Bc (with P( j | i, a) := Q(∞, j | i, a)),
a∈A(i) j∈B
(11.7) which can be still verified. This fact is due to that the restrictions in Assumption 11.2.3(b) are imposed on the data of the set Bc rather than the entire space E. However, if the restrictions in Assumption 11.2.3(b) are imposed on the data of the entire space E, that is, there exists a constant 0 < β < 1 such that sup
∑ w( j)P( j | i, a) ≤ β w(i)
∀ i ∈ E,
(11.8)
a∈A(i) j∈E
then (11.8) fails to hold itself. Indeed, by taking infi w(i) in the two sides of (11.8), we can conclude from (11.8) that “β ≥ 1”, which leads to a contradiction with “0 < β < 1”. Assumption 11.2.4 (a) For each i ∈ Bc , A(i) is compact. (b) The functions r(i, a), c(i, a), and m( j | i, a) defined in Assumption 11.2.3 are continuous in a ∈ A(i) for each fixed i, j ∈ Bc , respectively. (c) The function ∑ j∈Bc w( j)m( j | i, a) is continuous in a ∈ A(i), with w as in Assumption 11.2.3. Remark 11.2.6. Assumption 11.2.4 is the compactness-continuity conditions for the first passage criteria, which is similar to the standard compactness-continuity conditions for discount and average criteria; see, for instance, Beutler and Ross [3], Guo and Hern´andez-Lerma [7, 8]. The difference between them lies in that the former only imposes restrictions on the data of the set Bc , while the latter focus on the data of the entire space E.
11 Constrained Optimality in Semi-Markov Decision Processes
187
Assumption 11.2.5 (a) ∑ j∈Bc w( j)μ ( j) < ∞. (b) U0 := {π ∈ Π | Vc (μ , π ) < γ } = 0. / Remark 11.2.7. (a) Assumption 11.2.5(a) is a condition on the “tails” of the initial distribution μ , whereas Assumption 11.2.5(b) is a Slater-like hypothesis, typical for constrained problems; see, for instance, Beutler and Ross [3], Guo and Hern´andez-Lerma [7, 8], and Zhang and Guo [29]. (b) It should be noted that the conditions in Assumptions 11.2.3–11.2.5 are all imposed on the data of the set Bc rather than the entire space E and thus can be fulfilled in greater generality. Our main result is stated as following. Theorem 11.2.1. Suppose that Assumptions 11.2.1–11.2.5 hold. Then there exists a constrained optimal policy that may be a stationary policy or a randomized stationary policy which differ in at most one state; that is, there exist two stationary policies f 1 , f 2 , a state i∗ ∈ Bc , and a number p ∈ [0, 1] such that f 1 (i) = f 2 (i) for all i = i∗ and, in addition, the randomized stationary policy ϕ p (· | i) is constrained optimal, where ⎧ if a = f 1 (i∗ ), ⎨ p, p ϕ (a | i) = 1 − p, if a = f 2 (i∗ ), (11.9) ⎩ 1 2 ∗ 1, if a = f (i) = f (i), i =i . 2
Proof. See Sect. 11.4.
11.3 Technical Preliminaries This section provides some technical preliminaries necessary for the proof of Theorem 11.2.1 in Sect. 11.4. To begin with, we define the w-norm for every real-valued function u on E by uw := sup |u(i)|/w(i), i∈E
where w is the so-called weight function on E as in Assumption 11.2.3. Let Bw (E) := {u : uw < ∞} be the space of w-bounded functions on E. Lemma 11.3.1. Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then: (a) For each i ∈ E and π ∈ Π , |Vr (i, π )| ≤ Mw(i)/(1 − β ), |Vc (i, π )| ≤ Mw(i)/(1 − β ). Hence, Vr (·, π ) ∈ Bw (E), and Vc (·, π ) ∈ Bw (E).
188
Y. Huang and X. Guo
(b) For all i ∈ E, π ∈ Π , and u ∈ Bw (E), n−1 π −α (Jk ,Ak )Xk+1 lim Ei ∏ e 1{J0 ∈Bc ,...,Jn ∈Bc } u(Jn ) = 0, n→∞
k=0
where 1D is the indicator function on a set D. Proof. (a) By the definition of Vr (i, π ), we see that Vr (i, π ) can be expressed as below: Vr (i, π ) =
Eiπ
= Eiπ = Eiπ = Eiπ
e
− 0t α (Z(u),W (u))du
0
∞
t
0 α (Z(u),W (u))du
e−
0
∞
∑
Tn+1
e−
r(Z(t),W (t))dt
1{τB >t} r(Z(t),W (t))dt
t
0 α (Z(u),W (u))du
n=0 Tn ∞
∑
Xn+1
e−
Tn +t 0
dt1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An )
α (Z(u),W (u))du
n=0 0
=
τB
∞
Tn
∑ e− 0
Eiπ
α (Z(u),W (u))du
1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An )
n=0
Xn+1
= Eiπ = Eiπ
e
dt
∞ n−1
∑
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc,...,Jn ∈Bc } r(Jn , An )
Xn+1
n=0 k=0
0
n−1
∞
k=0
Xn+1 0
e−α (Jn ,An )t dt|X0 , J0 , A0 , . . . , Xn , Jn , An
∞ n−1
∑ ∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc,...,Jn ∈Bc } r(Jn , An )
n=0 k=0
×Eiπ = Eiπ
e−α (Jn ,An )t dt
∑ Eiπ ∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An) ×
= Eiπ
− TTnn +t α (Z(u),W (u))du
0
n=0
dt1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An )
∞ n−1
Xn+1 0
e−α (Jn ,An )t dt|X0 , J0 , A0 , . . . , Xn , Jn , An
∑ ∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc,...,Jn ∈Bc } r(Jn , An )
n=0 k=0
11 Constrained Optimality in Semi-Markov Decision Processes
× Eiπ
=
∞
e
−α (Jn ,An )t
0
∞ n−1
∑ ∏e
189
(1 − D(t | Jn , An ))dt
−α (Jk ,Ak )Xk+1
1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An ) ,
(11.10)
n=0 k=0
where the third equality follows from Assumption 11.2.1 and the ninth equality is due to the property (11.4). We now show that for each n = 0, 1, . . ., Eiπ
n−1
∏e
−α (Jk ,Ak )Xk+1
1{J0 ∈Bc ,...,Jn ∈Bc } w(Jn ) ≤ β n w(i).
(11.11)
k=0
Indeed, (11.11) is trivial for n = 0. Now, for n ≥ 1, it follows from the property (11.4) and Assumption 11.2.3(b) that Eiπ
−α (Jk ,Ak )Xk+1 c c e 1 w(J ) n {J0 ∈B ,...,Jn ∈B } ∏
n−1 k=0
=
Eiπ
n−1
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc } w(Jn )
Eiπ
k=0
| T0 , J0 , A0 , . . . , Tn−1 , Jn−1 , An−1 ]] n−2 = Eiπ ∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn−1 ∈Bc } Eiπ e−α (Jn−1 ,An−1 )Xn 1{Jn ∈Bc } w(Jn ) k=0
| T0 , J0 , A0 , . . . , Tn−1 , Jn−1 , An−1 =
Eiπ
n−2
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn−1 ∈Bc}
k=0
× = Eiπ
∑
∞
j∈Bc 0
e−α (Jn−1 ,An−1 )t w( j)Q(dt, j | Jn−1 , An−1 )
n−2
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn−1 ∈Bc}
k=0
≤ β Eiπ
∑
w( j)m( j | Jn−1 , An−1 )
j∈Bc
−α (Jk ,Ak )Xk+1 c c e 1 w(J ) n−1 . {J0 ∈B ,...,Jn−1 ∈B } ∏
n−2 k=0
Iterating (11.12) yields (11.11).
(11.12)
190
Y. Huang and X. Guo
Moreover, observe that Assumption 11.2.3(a) and (11.11) gives Eiπ
n−1
∏e
−α (Jk ,Ak )Xk+1
1{J0 ∈Bc ,...,Jn ∈Bc } | r(Jn , An )|
k=0
≤ MEiπ
−α (Jk ,Ak )Xk+1 c c e 1 w(J ) n {J0 ∈B ,...,Jn ∈B } ∏
n−1 k=0
≤ M β w(i), n
which together with (11.10) yields |Vr (i, π )| ≤
∞
∑
Eiπ
n=0
≤
n−1
∏e
−α (Jk ,Ak )Xk+1
1{J0 ∈Bc ,...,Jn ∈Bc } | r(Jn , An )|
k=0
∞
∑ Mβ n w(i) = Mw(i)/(1 − β ).
n=0
Thus, we get sup |Vr (i, π )|/w(i) ≤ Mw(i)/(1 − β ), i∈E
which shows that Vr (·, π ) ∈ Bw (E). Similarly, the conclusion for Vc (·, π ) can be obtained. (b) Since |u(i)| ≤ uw w(i) for all i ∈ E, it follows from (11.11) above that Eiπ
n−1
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc }|u(Jn )|
k=0
≤ uw Eiπ
−α (Jk ,Ak )Xk+1 c c e 1 w(J ) ≤ uwβ n w(i), n {J0 ∈B ,...,Jn ∈B } ∏
n−1 k=0
and so part (b) follows.
2
Remark 11.3.8. In fact, Lemma 11.3.1 here for first passage criteria in SMDPs is similar to Lemma 3.1 in Huang and Guo [15]. The main difference between them is due to that the discount factor α (i, a) here is state-action dependent, and the reward rate here is possibly unbounded (while the ones in Huang and Guo [15] are not). For ϕ ∈ Φ , we define the dynamic programming operators H ϕ and H on Bw (E) as follows: for u ∈ Bw (E), if i ∈ Bc ,
11 Constrained Optimality in Semi-Markov Decision Processes
ϕ
H u(i) := a∈A(i)
r(i, a) +
∑
191
u( j)m( j | i, a) ϕ (da | i),
(11.13)
j∈Bc
Hu(i) := sup r (i, a) +
∑
u( j)m( j | i, a) ,
(11.14)
j∈Bc
a∈A(i)
and if i ∈ B, H ϕ u(i) = Hu(i) := 0. Lemma 11.3.2. Suppose that Assumptions 11.2.1 and 11.2.3 hold. Then: (a) For each ϕ ∈ Φ , Vr (·, ϕ ) is the unique solution in Bw (E) to the equation Vr (i, ϕ ) = H ϕ Vr (i, ϕ ) ∀i ∈ E. (b) If, in addition, Assumption 11.2.4 also holds, Vr∗ (i) := supπ ∈Π Vr (i, π ) is the unique solution in Bw (E) to equation Vr∗ (i) = HVr∗ (i) ∀i ∈ E. ∗
Moreover, there exists an f ∗ ∈ F such that Vr∗ (i) = H f Vr∗ (i), and such a policy f ∗ ∈ F satisfies Vr (i, f ∗ ) = Vr∗ (i) for every i ∈ E. Proof. (a) From Lemma 11.3.1, it is clear that Vr (·, ϕ ) ∈ Bw (E). We now establish the equation Vr (i, ϕ ) = H ϕ Vr (i, ϕ ). It is obviously true when i ∈ B, and for i ∈ Bc , by (11.10), we have Vr (i, ϕ ) ϕ
= Ei
∞ n−1
∑
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc}r(Jn , An )
n=0 k=0
=
ϕ Ei ϕ
∞ n−1 ϕ −α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An ) | T0 , J0 , A0 , T1 , J1 Ei ∑ ∏ e n=0 k=0
ϕ
r(J0 , A0 ) + e−α (J0,A0 )X1 1{J0 ∈Bc ,J1 ∈Bc } Ei = Ei 1{J0 ∈Bc }
∞ n−1
∑ ∏ e−α (Jk ,Ak )Xk+1
n=1 k=1
1{J2 ∈Bc ,...,Jn ∈Bc } r(Jn , An ) | T0 , J0 , A0 , T1 , J1
=
a∈A(i)
ϕ (da | i) r(i, a)+
∑c
j∈B
∞
e 0
−α (i,a)t
Q(dt, j
ϕ | i, a)Ei
∞ n−1
∑ ∏ e−α (Jk ,Ak )Xk+1
n=1 k=1
1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An ) | T0 = 0, J0 = i, A0 = a, T0 = t, J1 = j
192
Y. Huang and X. Guo
=
a∈A(i)
ϕ (da | i) r(i, a) +
ϕ
∞ n−1
∑c m( j | i, a)E j ∑ ∏ e−α (Jk ,Ak )Xk+1
j∈B
n=0 k=0
1{J0 ∈Bc ,...,Jn ∈Bc } r(Jn , An )
=
a∈A(i)
ϕ (da | i) r(i, a) +
∑
m( j | i, a)Vr ( j, ϕ ) ,
j∈Bc
where the fifth equality is due to the properties (11.2)–(11.4) and that policy ϕ is Markov. Hence, we obtain that Vr (i, ϕ ) = H ϕ Vr (i, ϕ ), i ∈ E. To complete the proof, we need only show that H ϕ is a contraction from Bw (E) to Bw (E), and thus H ϕ has a unique fixed point in Bw (E). Indeed, for an arbitrary function u ∈ Bw (E), by Assumption 11.2.3 it is clear that |H ϕ u(i)| = r(i, a) + ∑ u( j)m( j | i, a) ϕ (da | i) a∈A(i) j∈Bc ≤ | r(i, a)| + ∑ |u( j)|m( j | i, a) ϕ (da | i) a∈A(i)
≤
a∈A(i)
j∈Bc
Mw(i) + uw β w(i) ϕ (da | i)
= (M + β uw)w(i) ∀i ∈ Bc , which implies that H ϕ u ∈ Bw (E), that is, H ϕ maps Bw (E) to itself. Moreover, for any u, u ∈ Bw (E), we have |H ϕ u(i) − H ϕ u (i)| (u( j) − u ( j))m( j | i, a) ϕ (da | i) = ∑c a∈A(i) j∈B ≤ ∑ |u( j) − u ( j)|m( j | i, a) ϕ (da | i) a∈A(i)
≤
a∈A(i)
j∈Bc
u − u w β w(i) ϕ (da | i)
= β u − u w w(i) ∀i ∈ Bc . Hence, H ϕ u − H ϕ u w ≤ β u − u w , and thus H ϕ is a contraction from Bw (E) to itself. By Banach’s Fixed Point Theorem, H ϕ has a unique fixed point in Bw (E), and so the proof is achieved.
11 Constrained Optimality in Semi-Markov Decision Processes
193
(b) Under Assumption 11.2.4, using a similar manner to the proof of part (a) yields that H is a contraction from Bw (E) to itself, and thus, by Banach’s Fixed Point Theorem, H has a unique fixed point u∗ in Bw (E), that is, Hu∗ = u∗ . Hence, to prove part (b), we need to show that: (b1 ) Vr∗ ∈ Bw (E), with w-norm Vr∗ ≤ M/(1 − β ). (b2 ) Vr∗ = u∗ . In fact, (b1 ) is an immediate result of Lemma 11.3.1(a). Thus, it remains to prove (b2 ). To this end, we show that u∗ ≤ Vr∗ and u∗ ≥ Vr∗ as below, respectively. It is clear that u∗ (i) = Vr∗ (i) = 0 for every i ∈ B. Hence, in the following, we restrict our argument to the case of i ∈ Bc . (i) This part is to show that u∗ ≤ Vr∗ . By the equality u∗ = Hu∗ and the measurable selection theorem [10, Proposition D.5, p.182], there exists an f ∈ F such that r (i, f ) + u∗ (i) =
∑c u∗ ( j)m( j | i, f ) ∀i ∈ Bc .
(11.15)
j∈B
Iteration of (11.15) yields n−1 m−1 f ∗ −α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jm ∈Bc } r(Jm , f ) u (i) = Ei ∑ ∏ e m=0 k=0
+Eif
−α (Jk ,Ak )Xk+1 ∗ c c e 1 u (J ) ∀i ∈ Bc , n = 1, 2 . . . , n {J0 ∈B ,...,Jn ∈B } ∏
n−1 k=0
and letting n → ∞ we get, by Lemma 11.3.1(b), ∞ m−1 f ∗ −α (Jk ,Ak )Xk+1 u (i) = Ei ∑ ∏ e 1{J0 ∈Bc ,...,Jm ∈Bc } r(Jm , f ) = Vr (i, f ) ∀i ∈ Bc . m=0 k=0
Thus, by the definition of Vr∗ , we see that u∗ ≤ Vr∗ . (ii) This part is to show that u∗ ≥ Vr∗ . Note that u∗ = Hu∗ implies that u∗ (i) ≥ r(i, a) +
∑ u∗( j)m( j | i, a)
∀i ∈ Bc , a ∈ A(i),
(11.16)
j∈Bc
which gives r(Jn , An ) + 1{Jn ∈Bc } 1{Jn ∈Bc } u∗ (Jn ) ≥ 1{Jn ∈Bc }
∑c u∗( j)m( j | Jn , An )
∀n ≥ 0.
j∈B
(11.17) Hence, for any initial state i ∈ Bc and policy π ∈ Π , using properties (11.2)– (11.4) yields
194
Y. Huang and X. Guo ∗
1{Jn ∈Bc } u (Jn ) ≥
Eiπ
r(Jn , An ) + e−α (Jn,An )Xn+1 1{Jn ∈Bc ,Jn+1 ∈Bc } 1{Jn ∈Bc } ×u∗ (Jn+1 ) | T0 , J0 , A0 , . . . , Tn , Jn , An ∀n = 0, 1 . . . ,
which gives n−1
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn ∈Bc} u∗ (Jn)
k=0
≥ Eiπ
n−1
∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc,...,Jn ∈Bc }r(Jn , An )
k=0 n
+ ∏ e−α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jn+1 ∈Bc } u∗ (Jn+1 ) | T0 , J0 , A0 , . . . , Tn , Jn , An
k=0
∀n = 0, 1 . . . . Therefore, taking expectation Eiπ and summing over m = 0, 1 . . . , n − 1, we obtain n−1 m−1 ∗ π −α (Jk ,Ak )Xk+1 1{J0 ∈Bc ,...,Jm ∈Bc } r (Jm , Am ) u (i) ≥ Ei ∑ ∏ e m=0 k=0
+Eiπ
−α (Jk ,Ak )Xk+1 ∗ c c e 1 u (J ) n , ∀n = 1, 2 . . . . {J0 ∈B ,...,Jn ∈B } ∏
n−1 k=0
Finally, letting n → ∞ in the latter inequality and using Lemma 11.3.1(b), it follows that u∗ (i) ≥ Vr (i, π ) so that, as i and π were arbitrary, we conclude that u∗ ≥ Vr∗ . Combining (i) with (ii) yields that u∗ = Vr∗ , and thus we have Vr∗ = HVr∗ . Finally, it follows from Vr∗ = HVr∗ and the measurable selection theorem that ∗ there exists an f ∗ ∈ F such that Vr∗ = H f Vr∗ . This fact together with part (a) implies f∗ that Vr = Vr∗ . 2 Remark 11.3.9. Note that Lemma 11.3.2 also holds for the case of the expected first passage cost Vc accordingly. Note that F can be written as the product space F = ∏i∈E A(i). Hence, by Assumption 11.2.4(a) and Tychonoff’s theorem, F is a compact metric space. Lemma 11.3.3. Suppose that Assumptions 11.2.1–11.2.4 and 11.2.5(a) hold. Then the functions Vr (μ , f ) and Vc (μ , f ) are continuous in f ∈ F. Proof. We only prove the continuity of Vr (μ , f ) in f ∈ F because the other case is similar. Let fn → f as n → ∞ and fix any i ∈ E. Let v(i) := lim supn→∞ Vr (i, fn ).
11 Constrained Optimality in Semi-Markov Decision Processes
195
Then, by Theorem 4.4 in [1], there exists a subsequence {Vr (i, fnm )} (depending on i) of {Vr (i, fn )} such that Vr (i, fnm ) → v(i) as m → ∞. Then, by Lemma 11.3.1(a), we have Vr ( j, fnm ) ∈ [−Mw( j)/(1 − β ), Mw( j)/(1 − β )] for all j ∈ E and m ≥ 1, and so Vr (·, fnm ) is in the product space ∏ j∈E [−Mw( j)/(1 − β ), Mw( j)/(1 − β )] for each m ≥ 1. Since E is denumerable, the Tychonoff theorem shows that the space ∏ j∈E [−Mw( j)/(1 − β ), Mw( j)/(1 − β )] is compact, and thus there exists a subsequence {Vr (·, fnk )} of {Vr (·, fnm )} converging to some point u in ∏ j∈E [−Mw( j)/(1 − β ), Mw( j)/(1 − β )], that is, limk→∞ Vr ( j, fnk ) = u( j) for all j ∈ E, which, together with fn → f and limm→∞ Vr (i, fnm ) = v(i), implies that v(i) = u(i), lim Vr ( j, fnk ) = u( j), and lim fnk ( j) = f ( j), for all j ∈ E. (11.18) k→∞
k→∞
Moreover, by Lemma 11.3.1(a), we have |u( j)| ≤ Mw( j)/(1 − β ), for all j ∈ E,
(11.19)
which implies that u ∈ Bw (E). On the other hand, for k ≥ 1 and the given i ∈ Bc , by Lemma 11.3.2(a), we have Vr (i, fnk ) = r(i, fnk ) +
∑ Vr ( j, fnk )m( j | i, fnk ).
(11.20)
j∈Bc
Then, under Assumptions 11.2.3 and 11.2.4, from (11.18)–(11.20) and Lemma 8.3.7 (i.e., the Generalized Dominated Convergence Theorem) in [11], we get u(i) = r(i, f ) +
∑ u( j)m( j | i, f ).
(11.21)
j∈Bc
Thus, by Lemma 11.3.2(a) and (11.18), we conclude that lim sup Vr (i, fn ) = v(i) = u(i) = Vr (i, f ).
(11.22)
n→∞
Similarly, we can prove that lim inf Vr (i, fn ) = Vr (i, f ), n→∞
which together with (11.22) implies that lim sup Vr (i, fn ) = lim inf Vr (i, fn ) = Vr (i, f ), n→∞
n→∞
and so lim Vr (i, fn ) = Vr (i, f ).
n→∞
(11.23)
196
Y. Huang and X. Guo
Moreover, noting that Vr (i, fn ) = Vr (i, f ) = 0 for every i ∈ B and n ≥ 0, it follows from Assumption 11.2.5(a) and Lemma 8.3.7 in [11] again that lim Vr (μ , fn ) = lim
n→∞
n→∞
=
lim ∑ [Vr (i, fn )]μ (i) ∑ [Vr (i, fn )]μ (i) = n→∞ c
i∈E
i∈B
lim Vr (i, fn )]μ (i) = ∑ Vr (i, f )μ (i) = Vr (μ , f ), ∑ [n→∞ c
i∈Bc
(11.24)
i∈B
which gives the stated result: Vr (μ , fn ) → Vr (μ , f ), as n → ∞.
2
To analyze the constrained control problem (recall Definition 11.2.3), we introduce a Lagrange multiplier λ ≥ 0 as follows. For each i ∈ E and a ∈ A(i), let bλ (i, a) := r(i, a) − λ c(i, a).
(11.25)
Furthermore, for each policy π ∈ Π and i ∈ E, let Vbλ (i, π ) := Eiπ Vbλ (μ , π ) :=
τB
e
b (Z(t),W (t))dt ,
− 0t α (Z(u),W (u))du λ
0
∑ Vbλ ( j, π )μ ( j),
(11.26) (11.27)
j∈E
Vb∗λ (i) := sup Vbλ (i, π ),Vb∗λ (μ ) := sup Vbλ (μ , π ). π ∈Π
π ∈Π
(11.28)
Remark 11.3.10. Notice that, for each i ∈ B, Vbλ (i, π ) = 0. Therefore, we have Vbλ (μ , π ) =
∑c Vbλ ( j, π )μ ( j),
j∈B
Vb∗λ (μ ) =
∑c Vb∗λ ( j)μ ( j).
j∈B
Under Assumptions 11.2.1–11.2.4, by Lemma 11.3.2(b), we have Vb∗λ (i) =
⎧ ⎨ 0,
for i ∈ B, ∗ λ ⎩ sup b (i, a) + ∑ c Vbλ ( j)m( j | i, a) , for i ∈ Bc ,
(11.29)
j∈B
a∈A(i)
λ (i, a) := bλ (i, a) ∞ e−α t (1 − D(t | i, a))dt. Moreover, for each i ∈ E, the where b 0 maximum in (11.29) is realized by some a ∈ A(i), that is,
A∗λ (i) :=
⎧ A(i), ⎨ ⎩ a ∈ A(i)
| Vb∗λ (i)
for i ∈ B, ∗ λ = b (i, a) + ∑ j∈Bc Vbλ ( j)m( j | i, a) , for i ∈ Bc (11.30)
is nonempty. In other words, the following sets
11 Constrained Optimality in Semi-Markov Decision Processes
F∗λ
:=
f ∈ F | f (i) ∈
A∗λ (i)
197
∀i ∈ E
(11.31)
and
∗ Φ := ϕ ∈ Φ | ϕ (Aλ (i) | i) = 1 ∀i ∈ E λ
(11.32)
are nonempty. Next lemma reveals that Φ λ is convex. Lemma 11.3.4. Under Assumptions 11.2.1–11.2.4, the set Φ λ is convex. Proof. For each ϕ1 , ϕ2 ∈ Φ λ , and p ∈ [0, 1], let
ϕ p (· | i) := pϕ1 (· | i) + (1 − p)ϕ2(· | i), ∀i ∈ E.
(11.33)
Hence, ϕ p (A∗λ (i) | i) = pϕ1 (A∗λ (i) | i) + (1 − p)ϕ2 (A∗λ (i) | i) = 1, and so Φ λ is convex. 2 Notation. For each λ ≥ 0, we take an arbitrary, but fixed policy f λ ∈ F∗λ , and denote Vr (μ , f λ ), Vc (μ , f λ ), and Vbλ (μ , f λ ) by Vr (λ ), Vc (λ ), and Vb (λ ), respectively. By Lemma 11.3.2, we have that Vbλ (i, f ) = Vb∗λ (i) for all i ∈ E and f ∈ F∗λ . Hence, Vb (λ ) = Vbλ (μ , f λ ) = Vb∗λ (μ ). Lemma 11.3.5. If Assumptions 11.2.3–11.2.4 and 11.2.5(a) hold, then Vc (λ ) is nonincreasing in λ ∈ [0, ∞). Proof. By (11.5), (11.6), and (11.25)–(11.26) for each π ∈ Π , we obtain Vbλ (μ , π ) = Vr (μ , π ) − λ Vc(μ , π ) ∀λ ≥ 0. Moreover, noting that Vb (λ ) = Vbλ (μ , f λ ) = Vb∗λ (μ ) for all λ ≥ 0 and f λ ∈ F∗λ , we have, for any h > 0, −hVc (λ ) = Vbλ +h (μ , f λ ) − Vb(λ ) ≤ Vb (λ + h) − Vb(λ ) ≤ Vb (λ + h) − Vbλ (μ , f λ +h ) = −hVc(λ + h), which gives that −hVc (λ ) ≤ −hVc (λ + h). Hence, Vc (λ ) is nonincreasing in λ ∈ [0, ∞).
2
Lemma 11.3.6. Suppose that Assumptions 11.2.1–11.2.4 hold. If lim λk = λ , and f λk ∈ F∗λ is such that lim f λk = f , then f ∈ F∗λ . k
k→∞
k→∞
198
Y. Huang and X. Guo
Proof. Since f λk ∈ F∗λ , for each i ∈ Bc and π ∈ Π , we have k
Vb∗λk (i) = Vr (i, f λk ) − λkVc (i, f λk ) ≥ Vbλk (i, π ) = Vr (i, π ) − λkVc (i, π ). (11.34) Letting k → ∞ in (11.34) and by Lemma 11.3.3, we obtain Vbλ (i, f ) ≥ Vbλ (i, π ) ∀i ∈ Bc and π ∈ Π , which together with the fact that A∗λ (i) = A(i) for each i ∈ B implies that f ∈ F∗λ . 2 Under Assumptions 11.2.1–11.2.4 and 11.2.5(a), it follows from Lemma 11.3.5 that the following nonnegative constant
λ := inf{λ ≥ 0 | Vc (λ ) ≤ γ }
(11.35)
is well defined. Lemma 11.3.7. Suppose that Assumptions 11.2.1–11.2.5 hold. Then the constant λ in (11.35) is finite, that is, λ ∈ [0, ∞). Proof. Suppose that λ = ∞. By Assumption 11.2.5(b), there exists a policy π ∈ Π such that Vc (μ , π ) < γ . Let d := γ − Vc (μ , π ) > 0. Then, for any λ > 0, we have Vbλ (μ , π ) = Vr (μ , π ) − λ Vc(μ , π ) = Vr (μ , π ) − λ (γ − d).
(11.36)
As λ = ∞, by (11.35) and Lemma 11.3.5, we obtain Vc (λ ) > γ for all λ > 0. Therefore, Vb (λ ) = Vr (λ ) − λ Vc (λ ) < Vr (λ ) − λ γ . Since Vb (λ ) = Vb∗λ (μ ) ≥ Vbλ (μ , π ), from (11.36), we have Vr (λ ) − λ γ > Vb (λ ) ≥ Vbλ (μ , π ) = Vr (μ , π ) − λ (γ − d) ∀λ > 0, (11.37) which gives Vr (λ ) > Vr (μ , π ) + λ d ∀λ > 0.
(11.38)
On the other hand, by Lemma 11.3.1 and Assumption 11.2.5(a), we have max |Vr (μ , π )|, |Vr (λ )| ≤ M
∑
w( j)μ ( j) /(1 − β ) := M˜ < ∞ (11.39)
j∈Bc
for all λ > 0. The latter inequality together with (11.38) gives that 2M˜ > λ d ∀λ > 0,
(11.40)
˜ which is clearly a contradiction; for instance, take λ = 3M/d > 0. Hence, λ is finite. 2
11 Constrained Optimality in Semi-Markov Decision Processes
199
11.4 Proof of Theorem 11.2.1 In this section, we prove Theorem 11.2.1 by using the Lagrange approach and the following lemma. Lemma 11.4.8. If there exist λ0 ≥ 0 and π ∗ ∈ U such that Vc (μ , π ∗ ) = γ and Vbλ0 (μ , π ∗ ) = Vb∗λ0 (μ ), then π ∗ is constrained optimal. Proof. For any π ∈ U, since Vbλ0 (μ , π ∗ ) = V ∗λ0 (μ ) ≥ Vbλ0 (μ , π ), we have b
Vr (μ , π ∗ ) − λ0Vc (μ , π ∗ ) ≥ Vr (μ , π ) − λ0Vc (μ , π ).
(11.41)
Noting that Vc (μ , π ∗ ) = γ and Vc (μ , π ) ≤ γ (by π ∈ U), from (11.41), we get Vr (μ , π ∗ ) ≥ Vr (μ , π ) + λ0(γ − Vc (μ , π )) ≥ Vr (μ , π ) ∀π ∈ U, which means that π ∗ is constrained optimal.
2
Proof of Theorem 11.2.1. By Lemma 11.3.7, the constant λ ∈ [0, ∞). Thus, we shall consider the two cases: λ = 0 and λ > 0. The case of λ = 0: By (11.35), there exists a sequence f λk ∈ F∗λ such that λk ↓ 0 k as k → ∞. Because F is compact, we may assume that f λk → f˜ without loss of generality. Thus, by Lemma 11.3.5, we have Vc (μ , f λk ) ≤ γ for all k ≥ 1, and then it follows from Lemma 11.3.3 that f˜ ∈ U. Moreover, for each π ∈ U, we have that Vb (λk ) = Vbλk (μ , f λk ) ≥ Vbλk (μ , π ). Hence, by Lemma 11.3.1(a) and (11.39), ˜ Vr (μ , f λk ) − Vr (μ , π ) ≥ λk [Vc (μ , f λk ) − Vc(μ , π )] ≥ −2λk M.
(11.42)
Letting k → ∞ in (11.42), by Lemma 11.3.3, we have Vr (μ , f˜) − Vr (μ , π ) ≥ 0 ∀π ∈ U, which means that f˜ is a constrained optimal stationary policy. The case of λ > 0: First, if there is some λ ∈ (0, ∞) satisfying Vc (λ ) = γ , then there exist an associated f λ ∈ F∗λ such that Vc (λ ) = Vc (μ , f λ ) = γ , and V ∗λ (μ ) =
b
Vbλ (μ , f λ ). Thus, by Lemma 11.4.8, f λ is a constrained optimal stationary policy. Now, suppose that Vc (λ ) = γ for all λ ∈ (0, ∞). Then, as λ is in (0, ∞), there exist two nonnegative sequences {λk } and {δk } such that λk ↑ λ and δk ↓ λ , respectively. Since F is compact, we may take f λk ∈ F∗λ and f δk ∈ F∗δ such that f λk → f 1 ∈ F and k
k
f δk → f 2 ∈ F, without loss of generality. By Lemma 11.3.6, we have that f 1 , f 2 ∈
200
Y. Huang and X. Guo
F∗λ . By Lemmas 11.3.3 and 11.3.4, we obtain that Vc (μ , f 1 ) ≥ γ and Vc (μ , f 2 ) ≤ γ . If Vc (μ , f 1 ) = γ ( or Vc (μ , f 2 ) = γ ), by Lemma 11.4.8, we have that f 1 (or f 2 ) is a constrained optimal stationary policy. Hence, to complete the proof, we shall consider the following case: Vc (μ , f 1 ) > γ and Vc (μ , f 2 ) < γ .
(11.43)
Now using f 1 and f 2 , we construct a sequence of stationary policies { fn } as follows. For each n ≥ 1 and i ∈ E, let
fn (i) :=
f 1 (i), if i < n; f 2 (i), if i ≥ n,
where, without loss of generality, the denumerable state space is assumed to be the set {1, 2, . . .}. Obviously, f1 = f 2 and lim fn = f 1 . Hence, by Lemma 11.3.3, n→∞
lim Vc (μ , fn ) = Vc (μ , f 1 ). Since f 1 , f 2 ∈ F∗λ (just mentioned), by (11.31), we see
n→∞
that fn ∈ F∗λ for all n ≥ 1. As f1 = f 2 , by (11.43), we have Vc (μ , f1 ) < γ . If there exists n∗ such that Vc (μ , fn∗ ) = γ , then by Lemma 11.4.8 and fn ∈ F∗λ , fn∗ a constrained optimal stationary policy. Thus, in the remainder of this section, we may assume that Vc (μ , fn ) = γ for all n ≥ 1. If Vc (μ , fn ) < γ for all n ≥ 1, lim Vc (μ , fn ) = Vc (μ , f 1 ) ≤ γ , which is a contradiction to (11.43). Thus, there exists n→∞
some n ≥ 1 such that Vc (μ , fn ) > γ , which together with Vc (μ , f1 ) < γ gives the existence of some n˜ such that Vc (μ , fn˜ ) < γ and Vc (μ , fn+1 ˜ ) > γ.
(11.44)
Obviously, the stationary policies fn˜ and fn+1 differ in at most the state n. ˜ Here, it ˜ should be pointed out that n˜ must be in Bc . Indeed, if n˜ ∈ B, we have Vc (n, ˜ fn˜ ) = Vc (n, ˜ fn+1 ) = 0, which implies that V ( μ , f ) = V ( μ , f ) and thus leads to a c n˜ c ˜ n+1 ˜ contradiction to (11.44). For any p ∈ [0, 1], using the stationary policies fn˜ and fn+1 ˜ , we construct a randomized stationary policy ϕ p as follows. For each i ∈ E, ⎧ if a = fn˜ (n) ˜ when i = n, ˜ ⎨ p, ϕ p (a | i) = 1 − p, if a = fn+1 ˜ when i = n, ˜ ˜ (n) ⎩ = n. ˜ 1, if a = fn˜ (i) when i
(11.45)
Since fn˜ , fn+1 ∈ F∗λ , by Lemma 11.3.4, we have Vbλ (μ , ϕ p ) = V ∗λ (μ ) for all ˜ b p ∈ [0, 1]. We also have that Vc (μ , ϕ p ) is continuous in p ∈ [0, 1]. Indeed, for any p ∈ [0, 1] and any sequence {pm } in [0, 1] such that lim pm = p, as in the proof of n→∞ Lemma 11.3.2, we have
11 Constrained Optimality in Semi-Markov Decision Processes
Vc (i, ϕ
pm
)=
∑
ϕ
pm
(a | i) c(i, a) +
∑ Vc( j, ϕ
pm
201
)m( j | i, a) ∀i ∈ Bc . (11.46)
j∈Bc
a∈A(i)
Hence, as in the proof of Lemma 11.3.3, from (11.45) and (11.46), we obtain lim Vc (μ , ϕ pm ) = Vc (μ , ϕ p ),
n→∞
and so Vc (μ , ϕ p ) is continuous in p ∈ [0, 1]. Finally, let p0 = 0 and p1 = 1. Then, Vc (μ , ϕ p0 ) = Vc (μ , fn+1 ˜ ) > γ and Vc (μ , ϕ p1 ) = Vc (μ , fn˜ ) < γ . Therefore, by the continuity of Vc (μ , ϕ p ) in p ∈ [0, 1] ∗ ∗ there exists a p∗ ∈ (0, 1) such that Vc (μ , ϕ p ) = γ . Since Vbλ (μ , ϕ p ) = V ∗λ (μ ), by ∗
b
Lemma 11.4.8, we have that ϕ p is a constrained optimal stationary policy, which randomizes between the two stationary policies fn˜ and fn+1 that differ in at most the ˜ state n˜ ∈ Bc . 2
References 1. Aliprantis, C.D., Burkinshaw O.: Principles of real analysis. Third edition. Academic Press, Inc., San Diego, CA (1998) 2. Berument, H., Kilinc, Z., Ozlale, U.: The effects of different inflation risk premiums on interest rate spreads. Phys. A 333, 317–324 (2004) 3. Beutler, F.J., Ross, K.W.: Time-average optimal constrained semi-Markov decision processes. Adv. in Appl. Probab. 18, 341–359 (1986) 4. Feinberg, E.A.: Constrained semi-Markov decision processes with average rewards. Z. Oper. Res. 39, 257–288 (1994) 5. Feinberg, E.A.: Continuous time discounted jump Markov decision processes: a discrete-event approach. Math. Oper. Res. 29, 492–524 (2004) 6. Guo, X.P.: Constrained denumerable state non-stationary MDPs with expected total reward criterion. Acta Math. Appl. Sinica (English Ser.) 16, 205–212 (2000) 7. Guo, X.P., Hern´andez-Lerma, O.: Constrained continuous-time Markov control processes with discounted criteria. Stochastic Anal. Appl. 21, 379–399 (2003) 8. Guo, X.P., Hern´andez-Lerma, O.: Continuous-Time Markov Decision Processes: Theory and Applications. Springer-Verlag, Berlin Heidelberg (2009) 9. Haberman, S., Sung, J.: Optimal pension funding dynamics over infinite control horizon when stochastic rates of return are stationary. Insur. Math. Econ. 36, 103–116 (2005) 10. Hern´andez-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York (1996) 11. Hern´andez-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York (1999) 12. Hern´andez-Lerma, O., Gonz´alez-Hern´andez, J.: Constrained Markov control processes in Borel spaces: the discounted case. Math. Methods Oper. Res. 52, 271–285 (2000) 13. Huang, Y.H, Guo, X.P.: Optimal risk probability for first passage models in semi-Markov decision processes. J. Math. Anal. Appl. 359, 404–420 (2009) 14. Huang, Y.H, Guo, X.P.: Discounted semi-Markov decision processes with nonnegative costs. Acta. Math. Sinica (Chinese Series) 53, 503–514 (2010) 15. Huang, Y.H, Guo, X.P.: First passage models for denumerable semi-Markov decision processes with nonnegative discounted costs. Acta. Math. Appl. Sinica 27, 177–190 (2011)
202
Y. Huang and X. Guo
16. Huang, Y.H, Guo, X.P.: Finite horizon semi-Markov decision processes with application to maintenance systems. European. J. Oper. Res. 212, 131–140 (2011) 17. Lee, P., Rosenfield, D.B.: When to refinance a mortgage: a dynamic programming approach. European. J. Oper. Res. 166, 266–277 (2005) 18. Limnios, N., Oprisan, J.: Semi-Markov Processes and Reliability. Birkhauser, ¨ Boston (2001) 19. Lin, Y.L.: Continuous time first arrival target models (1)- discounted moment optimal models. Acta. Math. Appl. Sinica-Chinese Serias 14, 115–124 (1991) 20. Lippman, S.A.: Semi-Markov decision processes with unbounded rewards. Management Science 19, 717–731 (1973) 21. Liu, J.Y., Huang S.M.: Markov decision processes with distribution function criterion of firstpassage time. Appl. Math. Optim. 43, 187–201 (2001) 22. Liu, J.Y., Liu, K.: Markov decision programming—the first passage model with denumerable state space. Systems Sci. Math. Sci. 5, 340–351 (1992) 23. Newell R. G. and Pizer W. A. Discounting the distant future: how much do uncertain rates increase valuation. J. Environ. Econ. Manage 46, 52–71 (2003) 24. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Inc., New York (1994) 25. Ross, S.M.: Average cost semi-Markov decision processes. J. Appl. Probab. 7, 649–656 (1970) 26. Sack, B., Wieland, V.: Interest-rate smoothing and optimal monetary policy: a review of recent empirical evidence. J. Econ. Bus. 52, 205–228 (2000) 27. Yong, J.M., Zhou, X.Y.: Stochastic Controls—Hamiltonian Systems and HJB Equations. Springer-Verlag, New York (1999) 28. Yu, S.X., Lin, Y.L., Yan, P.F.: Optimization models for the first arrival target distribution function in discrete time. J. Math. Anal. Appl. 225, 193–223 (1998) 29. Zhang, L.L., Guo, X.P.: Constrained continuous-time Markov control processes with average criteria. Math. Meth. Oper. Res. 67, 323–340 (2008)
Chapter 12
Infinite-Horizon Optimal Control Problems for Hybrid Switching Diffusions H´ector Jasso-Fuentes and George Yin
12.1 Introduction Owing to the emerging applications arising in manufacturing and production planning, biological and ecological systems, communication networks, and financial engineering, resurgent attention has been drawn to the study of regime-switching diffusion systems, their asymptotic properties, and the associated control and optimization problems. In such processes, continuous dynamics and discrete events coexist and interact. The use of these hybrid models stems from their ability to provide more realistic formulation for real-world applications in which the usual stochastic differential equation models alone are no longer adequate. The regimeswitching diffusions blend both continuous and discrete characteristics, where the discrete event process is modeled as a random switching process to represent random environment and other random influences. In this chapter, we focus on controlled dynamic systems whose dynamics are given by the aforementioned switching diffusion processes. We focus on optimal controls of such systems in an infinite horizon. This chapter is dedicated to Professor On´esimo Hern´andez-Lerma on the occasion of his 65th birthday, who has made many important contributions to infinite-horizon optimal controlled diffusions. Before proceeding further, we briefly review some relevant literature. Discounted and average reward optimality has been extensively studied for different classes of dynamic systems: for Markov decision processes (MDPs), see [20, 43, 45] and the
H. Jasso-Fuentes () Departamento de Matem´aticas, CINVESTAV-IPN, M´exico D.F. 07000, M´exico e-mail:
[email protected] G. Yin Department of Mathematics, Wayne State University, Detroit, MI 48202, USA e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 12, © Springer Science+Business Media, LLC 2012
203
204
H. Jasso-Fuentes and G. Yin
references therein. For continuous-time Markov chains, we can mention the works [16, 17]. In [52, 53], the authors treated singularly perturbed MDPs in discrete and continuous time. Concerning continuous time, in [19, 40], the analysis was given for general controlled Markov processes, whereas the papers in [2, 4–6, 30, 31, 49] considered controlled diffusion process, and [14, 15] for switching diffusions. Dealing with bias optimality, discrete-time MDPs were treated in [21, 22, 57]. From another angle, [38], studied this criterion in the context of continuous-time Markov chains. Other related papers are [26, 27], where the authors considered controlled diffusion processes. The recent papers [12, 29] study this criteria for switching diffusions. Strong overtaking optimality was originally introduced in [44]. This notion was relaxed in [1] and in [51]. Many classes of controlled systems both deterministic and/or stochastic were considered subsequently; see [7, 55] for an overview. Other related papers of overtaking optimality include [9, 21, 33] for discrete-time problems, [38] for continuous-time MDP, [47, 48] for continuoustime deterministic systems, and [26, 34] for controlled diffusion processes. For switching diffusions, the recent papers [12, 29] are the first works dealing with bias and overtaking optimality for this class of dynamic systems. Blackwell optimality was introduced in [3]. This concept has been studied extensively for MDPs by [8, 10, 11, 24, 25, 32, 50]. For continuous-time models, there exists only a handful of papers. For instance, [28, 42] deal with controlled diffusion processes, whereas [39] consider continuous-time controlled Markov chains. To the best of our knowledge, the only paper dealing with hybrid switching diffusions is [29]. In this chapter, we begin with the so-called basic criteria including discounted reward and average reward per unit time criteria. Nevertheless, it is known that the basic criteria have drawbacks. The discounted rewards ignore the future activities and actions, whereas average reward per unit time pays no attention to finite horizons. Thus, the so-called advanced criteria become popular. Not only is the consideration of advance criteria interesting from a mathematical point of view, but it is a practically useful consideration. It would be ideal to document the results obtained so that it will be beneficial to both researchers and practitioners. Since the initiation of the study of stochastic control, there have been numerous papers published in this subject concentrating on infinite horizon control systems. Nevertheless, the results for advanced criteria are still scarce. There have been only a handful of papers focusing on advanced criteria. The rest of this chapter is organized as follows: Sect. 12.2 introduces the control model together with the main assumptions and ergodicity. Section 12.3 concerns the study of basic optimality criteria including the ρ -discounted reward criterion and the ergodic criterion. Section 12.4 focuses on the advanced or selective criteria, including bias optimality, overtaking optimality, and sensitive discount optimality. Finally, Sect. 12.5 concludes this chapter.
12 Advanced Criteria for Controlled Hybrid Diffusions
205
12.2 Controlled Switching Diffusions and Ergodicity Suppose that W (·) is a d-dimensional Wiener process and that α (·) is a continuoustime Markov chain with a finite state space M = {1, 2, . . . , l} and a generator Q = (qi j ). Throughout this chapter, as a standing assumption, we assume that α (·) and W (·) are independent and that α (·) is irreducible. Let b : Rn × M × U → Rn and σ : Rn × M → Rn×d . Consider the controlled hybrid diffusion process given by dx(t) = b(x(t), α (t), u(t))dt + σ (x(t), α (t))dW (t),
(12.1)
with x(0) = x, α (0) = i, and t ≥ 0. The set U ⊂ Rm is called the control (or action) set, and u(·) is a U-valued stochastic process representing the controller’s action at each time t ≥ 0. Notation. For vectors and matrices, we use the usual Euclidean norms |x|2 := ∑i x2i and |A|2 := Tr(AA ) = ∑i, j A2i j , where A is the transpose of A = (Ai j ) and Tr(B) denotes the trace of the square matrix B, respectively. As usual, the gradient and the Hessian matrix of a function ϕ are represented by ∇ϕ and Hϕ , respectively. The following assumption ensures the existence of a unique strong solution of (12.1). Assumption 12.2.1. (a) The control set U is compact. (b) For each i ∈ M, b(·, i, ·) and σ (·, i) are continuous on Rn × U and on Rn , respectively. Moreover, there exist positive constants Ki,1 and Ki,2 such that for each x and y in Rn , sup |b(x, i, u) − b(y, i, u)| ≤ Ki,1 |x − y|
(12.2)
|σ (x, i) − σ (y, i)| ≤ Ki,2 |x − y|.
(12.3)
u∈U
and
(c) There exists a positive constant Ki,3 such that for each i ∈ M, the matrix a := σ σ satisfies x a(y, i)x ≥ Ki,3 |x|2 for all x, y ∈ Rn (uniform ellipticity).
(12.4)
Control Policies and Extended Generator. Throughout this chapter, we consider only (non-randomized) Markov control policies, also named simply Markov policies, which consist of all of U-valued measurable functions on [0, ∞) × Rn × M. A special case of this class is the so-called stationary Markov policy or simply stationary policy consisting of all U-valued measurable functions on Rn × M. Observe that a control policy u(t), t ≥ 0 in (12.1) becomes u(t) := f (t, x(t), α (t)) (or u(t) := f (x(t), α (t)), when we consider Markov (or stationary) policies. By a slight abuse of terminology, we call f itself a Markov policy (or stationary policy). We will denote by M and by F the families of Markov and stationary control policies, respectively.
206
H. Jasso-Fuentes and G. Yin
Let C2 (Rn × M) be the space of real-valued function h on Rn × M, such that h(x, i) ∈ C2 (Rn ) for each i ∈ M. For each x ∈ Rn , i ∈ M, u ∈ U, and h ∈ C2 (Rn × M), let Qh(x, i) := ∑lk=1 qik h(x, k), and 1 Lu h(x, i) := b(x, i, u), ∇h(x, i) + Tr [Hh (x, i)a(x, i)] + Qh(x, i). 2
(12.5)
Moreover, for each f ∈ F, x ∈ Rn , and i ∈ M, we define L f h(x, i) := L f (x,i) h(x, i).
(12.6)
Existence and Uniqueness of Solutions. Under Assumption 12.2.1, for each Markov policy f ∈ M, there exists a unique strong solution (x f (·), α (·)), which is a Markov-Feller process. For f ∈ F, the operator in (12.6) corresponds to the infinitesimal generator for (x f (·), α (·)), and its corresponding transition probf ability and conditional expectation are given by Px,i (t, B) := P((x f (t), α (t)) ∈ f n B|x (0) = x, α (0) = i), for every Borel set B ⊂ R × M, and x ∈ Rn , i ∈ M, and f f Ex,i [ϕ (x(t), α (t))] := ∑lk=1 Rn ϕ (y, k)Px,i (t, dy × {k}), respectively, where ϕ is a f n real-valued function on R × M such that Ex,i [ϕ ] < ∞ (for more details of all these statements, we refer [36, 54]). As was in the previous paragraphs, we sometimes write (x f (·), α (·)) instead of (x(·), α (·)) to emphasize the dependence of f on the system (12.1). Positive Recurrence and Ergodicity. We begin by providing conditions that guarantee positive recurrence and exponential ergodicity of the controlled process. Assumption 12.2.2. There exists a function w ≥ 1 in C2 (Rn × M) and constants d ≥ c > 0 such that for each i ∈ M, (a) lim|x|→∞ w(x, i) = +∞. (b) For all x ∈ Rn , and u ∈ U, Lu w(x, i) ≤ −cw(x, i) + d.
(12.7)
Based on [13, 37, 56], Assumption 12.2.2 ensures that, for each f ∈ F, the hybrid diffusion (x f (·), α (·)) is Harris positive recurrent with a unique invariant probability measure μ f for which
μ f (w) :=
l
∑
n k=1 R
w(y, k)μ f (dy, k) < ∞.
(12.8)
We now introduce the concept of a w-weighted norm, where w is the function in Assumption 12.2.2. Definition 12.2.3. Denote by Bw (Rn × M) the normed linear space of real-valued measurable functions v on Rn × M with finite w-norm defined as v w :=
12 Advanced Criteria for Controlled Hybrid Diffusions
207
n supi∈M supx∈Rn |v(x,i)| w(x,i) . Use Mw (R × M) to denote the normed linear space of finite n signed measures μ on R × M such that
μ w :=
l
∑
n k=1 R
w(y, k) |μ (dy, k)| < ∞,
(12.9)
where |μ | := μ + + μ − denotes the total variation of μ . The next result provides a bound of Exf w(x(t), α (t)) in the sense of (12.10); see Propositions 2.5 and 2.7 in [29] and also [12, 26, 27]. Proposition 12.2.4. Suppose that Assumption 12.2.2(b) holds. Then: (a) For every f ∈ F, x ∈ Rn , i ∈ M, and t ≥ 0, d f Ex,i w(x(t), α (t)) ≤ e−ct w(x, i) + (1 − e−ct ). c
(12.10)
(b) For every v ∈ Bw (Rn × M), x ∈ Rn , i ∈ M, and f ∈ F, 1 f E [v(x(T ), α (T ))] = 0. T →∞ T x,i lim
(12.11)
Assumption 12.2.5 concerns w-exponential ergodicity for the process (x(·), α (·)). Sufficient conditions for Assumption 12.2.5 are given, for instance, in [12, 29]. Assumption 12.2.5. For each f ∈ F, the process (x f (·), α (·)) is w-exponentially ergodic, that is, there exist positive constants η and δ such that f sup Ex,i v(x(t), α (t)) − μ f (v) ≤ η e−δ t v w w(x, i)
(12.12)
f ∈F
for all x ∈ Rn , i ∈ M, v ∈ Bw (Rn × M), and t ≥ 0, where μ f (v) := ∑lk=1 μ f (dy, k).
Rn v(y, k)
12.3 Basic Optimality Criteria Basic optimality criteria consists of the ρ -discounted criterion and the ergodic (or average) reward criterion. For related references, see for instance, [4, 14, 16, 19, 20, 30,43,45,49,52,53] for the discounted case and [5,6,15,17,19,20,31,43,45,52,53] for the ergodic case. In this section, we give a brief account of this two basic criteria. All of our results are based mainly from the following papers [12, 14, 15, 29]. For the average reward criterion, our main tool is the dynamic programming method, based on the Hamilton–Jacobi–Bellman equations (12.24) and (12.25).
208
H. Jasso-Fuentes and G. Yin
Discounted Reward. Define the reward rate r as a real-valued function on Rn × M × U, whose properties are detailed in Assumption 12.3.1 below. Define the set BR := {x ∈ Rn | |x| < R, R > 0} and denote by B¯ R its closure. Assumption 12.3.1. (a) For each i ∈ M, the reward rate r(x, i, u) is continuous on Rn × U, locally Lipschitz in x, and uniformly in u ∈ U; that is, for each R > 0, there exists a constant Ki (R) such that sup |r(x, i, u) − r(y, i, u)| ≤ Ki (R)|x − y| for all |x|, |y| ≤ R.
(12.13)
u∈U
(b) r(·, ·, u) is in Bw (Rn × M) uniformly in u; that is, there exists C > 0 such that, for all x ∈ Rn , sup |r(x, i, u)| ≤ Cw(x, i).
(12.14)
u∈U
For each Markov policy f ∈ M, t ≥ 0, x ∈ Rn , and i ∈ M, we write r(t, x, i, f ) := r(x, i, f (t, x, i)), which reduces to r(x, i, f ) := r(x, i, f (x, i)) if f ∈ F. Using this notation, we define the expected ρ -discounted reward as follows. Definition 12.3.2. Given the discount factor ρ > 0, let ∞ f −ρ t Vρ (x, i, f ) := Ex,i e r(t, x(t), α (t), f )dt
(12.15)
0
be the ρ -discounted reward using the policy f ∈ M, given the initial data x ∈ Rn and i ∈ M. The corresponding optimal value function is defined by Vρ∗ (x, i) := sup f ∈M Vρ (x, i, f ), and a policy f ∗ ∈ M is said to be ρ -discounted optimal if Vρ (x, i, f ∗ ) = Vρ∗ (x, i) for all x ∈ Rn and i ∈ M. The next proposition establishes that Vρ is bounded in an appropriate sense; see [29, Proposition 3.3] for a proof. Proposition 12.3.3. Suppose that Assumptions 12.2.1, 12.2.2, 12.2.5, and 12.3.1 are satisfied. Then Vρ (·, ·, f ) is in Bw (Rn × M) for all f ∈ F, in fact, for every x ∈ Rn and i ∈ M, ¯ sup Vρ (x, i, f ) ≤ Cw(x, i), with C¯ := C(ρ +d) . (12.16) f ∈F
ρc
The following theorem concerns the existence of ρ -discounted optimal policies; see [14] for a proof. Theorem 12.3.4. Suppose that conditions 12.2.1, 12.2.2, 12.2.5, 12.3.1 are satisfied. Then, for fixed ρ > 0, the set of ρ -discounted optimal policies is nonempty. Ergodic Reward. From Sect. 12.2, for each stationary policy, f ∈ F, (x f (·), α (·)) is positive recurrent and has a unique invariant measure μ f . The next definition is concerned with the ergodic criterion together with average optimal policies.
12 Advanced Criteria for Controlled Hybrid Diffusions
209
Definition 12.3.5. For each f ∈ M, x ∈ Rn , i ∈ M, and T > 0, let f
JT (x, i, f ) := Ex,i
T 0
r(t, x(t), α (t), f )dt .
(12.17)
The long-run average reward, also known as ergodic reward or gain of f , given the initial conditions x(0) = x and α (0) = i, is J(x, i, f ) := lim inf T →∞
1 JT (x, i, f ). T
(12.18)
The function J ∗ (x, i) := sup J(x, i, f ) for x ∈ Rn , i ∈ M f ∈M
(12.19)
is referred to as the optimal gain or optimal average reward. If there is a policy f ∗ ∈ M for which J(x, i, f ∗ ) = J ∗ (x, i) for all x ∈ Rn and i ∈ M, then f ∗ is called average optimal. Denote by Fao the set of all average optimal policies; later on, we will see that this set is nonempty. Given f ∈ F, we define r¯( f ) := μ f (r(·, f )) =
l
∑
n k=1 R
r(y, k, f )μ f (dy, k).
(12.20)
The following result shows that, under our ergodic results, the gain in (12.18) becomes a constant and that it is uniformly bounded; see Proposition 3.6 in [29] for a proof (see also [12]). Proposition 12.3.6. Under Assumptions 12.2.1, 12.2.2, 12.2.5, and 12.3.1, we have: (a) J(x, i, f ) = r¯( f ), for all x ∈ Rn , i ∈ M. (b) r¯( f ) is uniformly bounded by r¯( f ) ≤ C
d , c
(12.21)
where c, d, and C are the constants in (12.7) and in (12.14), respectively. Therefore, r∗ := sup r¯( f ) < ∞. f ∈F
(12.22)
To find average optimal policies, we restrict ourselves to the set F instead of considering the whole space M. The reason is that the optimal gain (12.19) is attained in the set F of stationary policies, which also coincides with the constant r∗ in (12.22), that is, sup J(x, i, f ) = sup J(x, i, f ) = r∗ for all x ∈ Rn , i ∈ M.
f ∈M
f ∈F
(12.23)
210
H. Jasso-Fuentes and G. Yin
For more details, see [6, 34, 35] for the diffusion case and [15] for the hybrid switching diffusion case. The following theorem, established in [29], provides the existence and uniqueness of a solution of the average reward HJB equation defined in (12.24). Theorem 12.3.7. If Assumptions 12.2.1, 12.2.2, 12.2.5, and 12.3.1 are satisfied, then the following assertions hold: (a) There exists a unique pair (r∗ , h) consisting of the constant r∗ in (12.22) and of a function h of class C2 (Rn × M) ∩ Bw (Rn × M), with h(0, i0 ) = 0, such that it satisfies the average reward HJB equation r∗ = max [r(x, i, u) + Lu h(x, i)] for all x ∈ Rn and i ∈ M. u∈U
(12.24)
(b) There exists a policy f ∈ F which attains the maximum in (12.24), that is, r∗ = r(x, i, f ) + L f h(x, i) for all x ∈ Rn and i ∈ M,
(12.25)
this class of policies are so-called canonical policies. (c) A policy is average optimal if and only if it is canonical. (d) There exists an average optimal policy.
12.4 Advanced Criteria We have mentioned in the introduction that the average reward in (12.18) is undesirable in that it practically ignores the finite-horizon total reward. To overcome this difficulty, we need to consider more selective criteria. In this section, we study bias optimality including some of its characterizations and sensitive discount optimality which allows us to avoid the problem that average reward criteria face. We will give conditions for the existence and characterizations of bias and mdiscount optimal policies for −1 ≤ m ≤ +∞. We also give further characterizations of these policies. Our characterizations lead to the relation with the concept of overtaking optimality. This concept was defined separately and independently. Remark 12.4.1. Throughout the rest of this chapter, we assume that Assumptions 12.2.1, 12.2.2, 12.2.5 and 12.3.1 hold. Bias Optimality. In this section, we study bias optimality and some of its characterizations which lead to the concept of overtaking optimality. Our approach to work is the well-known dynamic programming method. All of the results established here will be only stated. Their correspondent proofs can be seen in the following two papers [12, 29].
12 Advanced Criteria for Controlled Hybrid Diffusions
211
We next establish the concept of bias of each stationary policy. For each f ∈ F, we define the bias of f as the function f
h f (x, i) := Ex,i
∞ 0
[r(x(t), α (t), f ) − r¯( f )] dt
for all x ∈ Rn and i ∈ M. (12.26)
Note that we can interpret the bias as the expected total difference between the immediate reward r(x(t), α (t), f ) and the ergodic reward r¯( f ). Remark 12.4.2. It is easy to see that h f is in Bw (Rn × M) for each f ∈ F. Namely, by the exponential ergodicity property in (12.12), we have h f (x, i) ≤ r(·, ·, f ) w w(x, i)η
∞
e−δ t dt
0
≤ δ −1Cw(x, i)η
(by (12.14)).
(12.27)
The following lemma establishes that the pair (¯r( f ), h f ) consisting of the constant r¯( f ) in (12.20) and the bias (12.26) is the solution of the so-called Poisson equation; see [18, 29]. Lemma 12.4.3. For each f ∈ F, the pair (¯r ( f ), h f ) is the unique solution of the Poisson equation r¯( f ) = r(x, i, f ) + L f h f (x, i) for all x ∈ Rn and i ∈ M, with the condition l
∑
n k=1 R
h f (y, k)μ f (dy, k) = 0.
(12.28)
(12.29)
Moreover, h f belongs to the space C2 (Rn × M) ∩ Bw (Rn × M). The following proposition relates the bias h f to the solution h of the average reward HJB equation (12.24) when f is average optimal; see Proposition 4.5 in [29]. Proposition 12.4.4. If a policy fˆ ∈ F is average optimal, then its bias h fˆ and any function h in the HJB equation (12.24) coincide up to an additive constant. In fact, h fˆ(x, i) = h(x, i) − μ fˆ(h) for all x ∈ Rn and i ∈ M.
(12.30)
Fixed (x, i) ∈ Rn × M, we define the set U 0 (x, i) := {u ∈ U | r∗ = r(x, i, u) + Lu h(x, i)} .
(12.31)
Remark 12.4.5. (i) From Theorem 12.3.7, U 0 is well defined. (ii) It is easy to see that f ∈ Fao if and only if f (x, i) ∈ U 0 (x, i) for all x ∈ Rn and i ∈ M.
212
H. Jasso-Fuentes and G. Yin
The following result establishes some properties of U 0 . A complete guide of multifunctions and selectors is given, for instance, in [20, Appendix D]. Proposition 12.4.6. Under Assumptions 12.2.1 and 12.3.1, the multifunction (x, i)
→ U 0 (x, i) satisfies the following properties: − (a) For each (x, i) ∈ Rn × M, U 0 (x, i) is a nonempty compact set. (b) For each i ∈ M, U 0 (·, i) is continuous. (c) U 0 (·, i) is Borel measurable, for each i ∈ M. The proofs of (a) and (b) follow from the continuity of u −→ r(x, i, u) + Lu h(x, i) and from the compactness of U, whereas a proof of part (c) is given, for instance, in [23, 46]. Existence of Bias Optimal Policies. The main objective of this part is to show the existence of bias optimal policies, defined in (12.33). We will use the dynamic programming method based on the bias optimality HJB equations described in (12.36)–(12.37), below. To begin with, we state the concept of bias optimality as follows. ˆ i) ∈ C2 (Rn × M) defined by Definition 12.4.7. Consider the function h(x, ˆ i) := sup h f (x, i) for all x ∈ Rn and i ∈ M. h(x, f ∈Fao
(12.32)
Then hˆ is called the optimal bias function. Moreover, if there exists a policy f ∗ ∈ Fao for which ˆ i) for all x ∈ Rn and i ∈ M, h f ∗ (x, i) = h(x, (12.33) then f ∗ is referred to as a bias optimal policy. Remark 12.4.8. (a) Definition 12.4.7 above suggests to maximize the bias h f defined in (12.26) over the class of stationary policies f ∈ F. This definition seems not to be useful at first sight since we are maximizing a kind of error!— see expression (12.26). It will help us, however, to obtain this class of policies in order to get another family of policies so-called overtaking optimal policies that we are going to analyze later on. (b) Denote by Fbias the set of all bias optimal policies. Suppose that Fbias is nonempty. Then Lemma 12.4.3 ensures that the optimal bias function hˆ belongs to C2 (Rn × M) ∩ Bw (Rn × M). Using (12.30), we can rewrite (12.32) by ˆ i) = h(x, i) + sup μ f (−h) for all x ∈ Rn and i ∈ M. h(x, f ∈Fao
(12.34)
This last relation indicates that to search for bias optimal policies, we need only look for policies f ∈ Fao that maximizes the second term on the right-hand side of (12.34). If we define a new control problem, which is referred to as the “bias problem” with:
12 Advanced Criteria for Controlled Hybrid Diffusions
⎫ • The control system (12.1) ⎬ • The action set U 0 (·, ·) defined in (12.31) ⎭ • The reward rate r˜(x, i, u) := −h(x, i) for all x ∈ Rn and i ∈ M
213
(12.35)
then, to find bias optimal policies, we need only solve the bias problem (12.35) with average criterion given by μ f (−h). Our next theorem, whose proof is provided, for instance, in [29, Theorem 4.11] or in [12, Theorem 7.2], ensures both the existence of solutions to (12.36)–(12.37) and the existence of a policy f ∈ Fao that maximizes the right-hand side of these equations. This result is based on the dynamic programming method that uses the bias optimal HJB equations (12.36)–(12.37). ˆ φ ) consisting of the constant (12.22), the Theorem 12.4.9. (a) The triplet (r∗ , h, optimal bias function (12.32), and some function φ ∈ C2 (Rn × M) solves the bias optimal HJB equation r∗ = max {r(x, i, u) + Lu h(x, i)} , and
(12.36)
h(x, i) = max {Lu φ (x, i)} .
(12.37)
u∈U
u∈U 0 (x,i)
(b) There exists a policy f ∈ Fao that maximizes the right-hand side of (12.36)– (12.37). That is, r∗ = r(x, i, f ) + L f h(x, i), and
(12.38)
h(x, i) = L f φ (x, i).
(12.39)
(c) A policy f satisfies (12.38)–(12.39) if and only if it is bias optimal. (d) The set of optimal bias policies is nonempty. Bias Optimality vs Overtaking Optimality. One of the most important features of bias optimality is its closed relation with overtaking optimality, also known as catching up optimality. This concept was introduced in [44] and later through the papers [1, 51]. The idea of overtaking optimality arose from problems of economic growth or capital accumulations. Nowadays, there has been a considerable work dealing with this concept by using different type of dynamic models; see, for instance, [1, 12, 26, 29, 38, 41, 47, 48]. An overview of results can be found in [7, 55]. Let us define now the concept of overtaking optimality. Definition 12.4.10. A policy f ∗ is said to be overtaking optimal in M if for every ε > 0, x ∈ Rn , i ∈ M, and f ∈ M, there exists Tε = Tε (x, i, f ∗ , f ) such that JT (x, i, f ∗ ) ≥ JT (x, i, f ) − ε , for all T ≥ Tε . If ε = 0, then f ∗ is said to be strongly overtaking optimal in M.
(12.40)
214
H. Jasso-Fuentes and G. Yin
By using Ito’s formula for hybrid switching diffusions [36] combined with (12.28), we can easily obtain the relation JT (x, i, f ) = T r¯( f ) + h f (x, i) − Ex,i h f (x(T ), α (T )) f
(12.41)
for each f ∈ F, x ∈ Rn , i ∈ M, and T > 0. Also, the w-exponential ergodicity (12.12) implies f Ex,i h f (x(T ), α (T )) → μ f (h f ) = 0.
(12.42)
These two facts in (12.41)–(12.42) are key points for the equivalence of bias optimality and overtaking optimality. The following theorem establishes this fact. For more details, we refer [29, Theorem 4.10] or [12, Theorem 6.4]. Theorem 12.4.11. A policy f ∗ ∈ F is bias optimal if and only if it is overtaking optimal. Sensitive Discount Optimality. We have studied the concept of bias optimality and its equivalence with overtaking optimality. Each of these concepts constitutes a refinement on the set of average optimal policies because any (stationary) policy corresponding to some of these criteria (e.g., bias and overtaking) turns out to be average optimal, and, in addition, it has some other special feature, depending on what context we are dealing with. In this section, we introduce another refinement of the average reward criterion called sensitive discount optimality, which includes m-discount optimality for m ≥ −1, and Blackwell optimality for the case m = +∞. Sensitive discount optimality is a very useful tool since it gives interesting connections to some of the optimality criterion we have analyzed so far. Its characterizations are given by means of a nested system of “bias-like” optimal HJB equations. m-Discounted Optimality. We begin this part by recalling the ρ -discounted reward Vρ given in Definition 12.3.2. Indeed, for ρ > 0, x ∈ Rn , i ∈ M, and f ∈ M, we f ∞ −ρ t [ 0 e r(t, x(t), α (t), f )dt] and its associated value have defined Vρ (x, i, f ) := Ex,i function by Vρ∗ (x, i) = sup f ∈M Vρ (x, i, f ). The following definition introduces the concept of m-discount optimality for each m ≥ −1. Definition 12.4.12. Let m ≥ −1 be an integer. A policy f ∗ ∈ M is said to be mdiscount optimal if for all x ∈ Rn , i ∈ M, and f ∈ M,
lim inf ρ −m Vρ (x, i, f ∗ ) − Vρ (x, i, f ) ≥ 0. ρ ↓0
(12.43)
We now proceed to generalize the basic criteria (12.15) and (12.18) by using reward rates different from the function r in Assumption 12.3.1. We do need, however, functions satisfying a growth rate similar to Assumption 12.3.1(b). A formal statement of this functions is given next.
12 Advanced Criteria for Controlled Hybrid Diffusions
215
Definition 12.4.13. Let w be the function defined in Assumption 12.2.2. We define Bw (Rn × M × U) as the space of real-valued functions v on Rn × M × U such that sup {v(x, i, u)} ≤ Cv w(x, i),
(12.44)
u∈U
where Cv is a positive constant dependent on v. Remark 12.4.14. (a) Observe that Bw (Rn × M) is a subset of Bw (Rn × M × U) because we can write v ∈ Bw (Rn × M) as v(x, i) = v(x, i, u) for all u ∈ U. (b) By Assumption 12.3.1(b), the reward rate r is in Bw (Rn × M × U). For each f ∈ M, we shall write v(t, x, i, f ) := v(x, i, f (t, x, i)). Similarly, v(x, i, f ) := v(x, i, f (x, i)) for f ∈ F. The following definition is a generalization of the basic criteria (12.15) and (12.18) when we use v rather than r. Definition 12.4.15. Given f ∈ M, x ∈ Rn , i ∈ M, and v ∈ Bw (Rn × M × U), we define: (a) For each ρ > 0, the ρ -discounted v-reward by Vρ (x, i, f , v) :=
f Ex,i
∞
e
−ρ t
0
v(t, x(t), α (t), f )dt .
(12.45)
(b) The v-gain of f is given by J(x, i, f , v) := lim inf T →∞
1 f E T x,i
T
0
v(t, x(t), α (t), f )dt .
(12.46)
Also, given f ∈ F ⊂ M, we define the constant v( ¯ f ) := μ f (v(·, ·, f )) =
l
∑
n k=1 R
v(y, k, f )μ f (dy, k),
(12.47)
where μ f is the invariant probability measure of (x f (·), α (·)). Under Assumptions 12.2.1, 12.2.2, 12.2.5, and 12.3.1, for v ∈ Bw (Rn × M ×U), we can follow the same analysis as given in [29, Proposition 3.6] to conclude that the v-gain (12.46)
turns f T out to be the constant (12.47); i.e., lim infT →∞ T1 Ex,i v(x(t), α (t), f )dt = v( ¯ f ), 0 for f ∈ F. Furthermore, as in the proof of Proposition 3.6 (b) in [29], we can verify that v¯ is uniformly bounded; in fact, sup f ∈F v( ¯ f ) ≤ Cv dc . For each f ∈ F, consider the operator A f on Bw (Rn × M ×U) defined as follows: ∞
A f v(x, i) :=
0
f Ex,i v(x(t), α (t), f ) − v( ¯ f ) dt,
where v¯ is the constant in (12.47) (compare with the bias h f in (12.26)).
(12.48)
216
H. Jasso-Fuentes and G. Yin
By (12.12), the operator A f is bounded by A f v(x, i) ≤ δ −1 η Cv w(x, i);
(12.49)
in other words, A f is uniformly bounded in the norm · w in that A f v ≤ δ −1 η Cv . w
(12.50)
Also observe that for each f ∈ F, A f maps the space Bw (Rn × M × U) into Bw (Rn × M). On the other hand, by using the properties of the invariant measure μ f , we can easily deduce that
μ f (A f v) = 0,
for all f ∈ F and v ∈ Bw (Rn × M × U).
(12.51)
Moreover, for any v ∈ Bw (Rn × M × U), f ∈ F, and k = 0, 1, . . . , we define the function hkf v ∈ Bw (Rn × M) by n hkf v(x, i) := (−1)k Ak+1 f v(x, i) for all x ∈ R , i ∈ M,
(12.52)
where Ak+1 is the (k + 1) composition of A f with itself. In addition, we consider f the special case when v ≡ r. In this case, we simply write hkf r := hkf . Hence, the definition of the bias function in (12.26) coincides with the term h0f in (12.52). Furthermore, h1f = A f (−h0f ) becomes the bias when the reward rate is −h0f (compare with the “bias” problem (12.35)). By using induction, it can be proven that (12.52) is equivalent to hkf = A f (−hk−1 f ).
(12.53)
Finally, from (12.51), we can easily deduce that
μ f (hkf ) = 0,
for all f ∈ F, k = 0, 1, . . . .
(12.54)
Poisson Equation and Average Reward HJB Equation. The concept of mdiscount optimality can be associated to the solution of a system of m average reward HJB equations, defined in (12.58)–(12.60). In this section, we show the existence and uniqueness of a solution to this system of equations. Our approach is based on another auxiliary system of equations the so-called −1th, 0th, . . . , mth Poisson equations, which are defined as follows. Definition 12.4.16. Given f ∈ F, and m ≥ −1, we define the −1th, 0th,. . .,mth Poisson equations by g = r(x, i, f ) + L f h0 (x, i) for all x ∈ Rn and i ∈ M.
(12.55)
12 Advanced Criteria for Controlled Hybrid Diffusions
h0 (x, i) = L f h1 (x, i) for all x ∈ Rn and i ∈ M. ··· h (x, i) = L h m
217
(12.56)
···
f m+1
(x, i) for all x ∈ Rn and i ∈ M.
(12.57)
The following result is concerned with the existence and uniqueness of a solution for (12.55)–(12.57). See [29, Theorem 5.7]. Theorem 12.4.17. Fix m ≥ −1. Then the constant g ∈ R and the functions h0 , h1 , . . . , hm in C2 (Rn × M) ∩ Bw (Rn × M) are solutions to the Poisson equation (12.55)–(12.57) if and only if g = r¯( f ), hk = hkf , for 0 ≤ k ≤ m, and hm+1 = hm+1 + z for z ∈ R. f Definition 12.4.18. For all x ∈ Rn , i ∈ M, and some fixed m ≥ −1, we define the −1th, 0th,. . ., mth average reward HJB equation by g = max[r(x, i, u) + Lu h0 (x, i)], u∈U
h0 (x, i) = max [Lu h1 (x, i)], u∈U 0 (x,i)
··· hm (x, i) =
(12.58) (12.59)
···
max [Lu hm+1 (x, i)],
u∈U m (x,i)
(12.60)
where, if we denote U −1 (x, i) := U for all x ∈ Rn and i ∈ M, then the set U k (x, i), for 0 ≤ k ≤ m, consists of the elements (controls) u ∈ U k−1 (x, i) attaining the maximum in the (k − 1)th average reward HJB equation. Remark 12.4.19. From Assumptions 12.2.1 and 12.3.1, we can prove by induction that for each m ≥ 0, the set U m (x, i) is compact for each x ∈ Rn and i ∈ M. Furthermore, for each i ∈ M, the mapping x −→ U m (x, i) is upper semicontinuous and Borel measurable (see also Proposition 12.4.6). Definition 12.4.20. For a fixed integer m ≥ −1, let us denote by Fm the set of stationary policies f ∈ F such that f (x, i) ∈ U m+1 (x, i) for each x ∈ Rn , i ∈ M; that is, f ∈ Fm if and only if it attains the maximum in the −1th, 0th, . . . , mth average reward HJB equations (12.58)–(12.60). Our next result determines the existence and uniqueness of solutions to the average reward HJB equations defined above. In addition, it shows the existence of stationary policies f ∈ F that maximize the right-hand side of such equations. See [29, Theorem 5.11]. Theorem 12.4.21. Fixed m ≥ −1: (a) There exists a unique solution g ∈ R h0 , . . . , hm+1 satisfying the average reward HJB equations (12.58)–(12.60), where hm+1 is unique up to additive constants. (b) The set Fm is nonempty.
218
H. Jasso-Fuentes and G. Yin
(c) If f ∈ Fm , then the solution of (12.58)–(12.60) turns out g = r¯( f ), h0 = h0f , . . . , hm+1 = hm+1 + z, for some constant z. f Laurent series. In this section, we prove that the expected ρ -discounted v-reward defined in (12.45) can be expressed in terms of a Laurent series. This property will be a key point for the existence of m-discount optimal policies. Recall that Akf is the k-composition of the operator A f in (12.48) by itself. We now establish our main result of this section with the constant δ given in (12.12). See [29, Theorem 5.12]. Theorem 12.4.22. Fix f ∈ F and v ∈ Bw (Rn × M). Then, for all x ∈ Rn , and i ∈ M, the ρ -discounted v-reward (12.45) can be expressed through the following Laurent series: ∞ 1 ¯ f ) + ∑ (−ρ )k Ak+1 (12.61) Vρ (x, i, f , v) = v( f v(x, i). ρ k=0 Moreover, this series converges in the w-norm for all 0 < ρ < δ . The following definition is associated with the residual terms of the Laurent series (12.61). Definition 12.4.23. For each v ∈ Bw (Rm × M × U), f ∈ F, and k = 0, 1, . . . , we define the k-residual of the Laurent series (12.61) by
Ψk ( f , v, ρ ) :=
∞
∑ (−ρ ) j A j+1 f v.
(12.62)
j=k
Our next result shows that the above residual terms are bounded in the w-norm. See [29, Proposition 4.14]. Proposition 12.4.24. Consider a positive constant γ < δ , where δ is the constant in (12.12). Then, for all ρ ≤ γ , and k ≥ 0, sup Ψk ( f , v, ρ ) w ≤ f ∈F
Cv η k δ (δ − γ )
ρ k.
(12.63)
A special case of (12.61) is v ≡ r. In this case, the ρ -discounted reward in (12.15) is expressed by Vρ (x, i, f ) =
∞ 1 r¯( f ) + ∑ ρ k hkf (x, i) ρ k=0
(12.64)
for all f ∈ F, x ∈ Rn , i ∈ M, hkf as in (12.53) and ρ satisfying the condition of Theorem 12.4.22. Existence and Characterizations of m-Discounted Optimal Policies. In this part, we show the existence of m-discount optimal policies for m ≥ −1. To this end, we mainly use the results developed in this section, which lead us to state the
12 Advanced Criteria for Controlled Hybrid Diffusions
219
characterizations of m-discount optimality. Before stating our main results, we shall introduce an important concept concerned with a lexicographical order. Definition 12.4.25. For any two vectors a and b in Rn , we say that a is lexicographical greater than or equal to b, denoted by a b, if the first nonzero component of a − b is positive. We also write a b when a b and a = b. The following theorem provides some characterizations of m-discount optimal policies. See [29, Theorem 5.16]. Theorem 12.4.26. Fixed m ≥ −1, the following statements are equivalents: (a) f lexicographically maximizes the terms r¯( f ), h0f , . . . , hmf of the Laurent series (12.64). (b) f belongs to Fm . (c) f is m-discount optimal. The next corollary is a straightforward consequence of Theorems 12.4.21 and 12.4.26. Corollary 12.4.27. For each m ≥ −1, there exists an m-discount optimal policy. Blackwell Optimality. In the last sections, we analyzed m-discount optimality for any arbitrary m ≥ −1. In this section, we focus on the study of Blackwell optimality, which is based on m-discount optimality. Blackwell optimality was introduced by [3]. Later on, the concept of Blackwell optimality was studied for different classes of control systems; see for instance, [8, 10, 11, 24, 25, 32] for discrete-time MDPs, [39] for continuous-time controlled Markov chains, [28, 42] for controlled diffusion processes, and [29] for switching diffusions. To proceed, the definition of Blackwell optimality is given next. Definition 12.4.28. A policy f ∈ M is called Blackwell optimal if and only if, for any other policy fˆ ∈ M, and any initial states x ∈ Rn , i ∈ M, there exists a positive constant ρ ∗ = ρ ∗ (x, i, fˆ) such that Vρ (x, i, f ) ≥ Vρ (x, i, fˆ), for all 0 < ρ ≤ ρ ∗ .
(12.65)
We now introduce some preliminary results necessary for our analysis; see [29, Proposition 5.27]. Proposition 12.4.29. (a) Consider the set U m (·, ·) in Definition 12.4.18. Then, the sequence {U m }m is decreasing. Furthermore, for each x ∈ Rn and i ∈ M, U ∞ (x, i) :=
∞ m=1
is a nonempty set.
U m (x, i) = lim U m (x, i) m→∞
(12.66)
220
H. Jasso-Fuentes and G. Yin
(b) Consider the set Fm in Definition 12.4.20. Hence, there exists a policy f ∈ F contained in F∞ :=
∞ m=1
Fm = lim Fm . m→∞
(12.67)
We now establish sufficient and necessary conditions for a policy f ∈ F to be Blackwell optimal; see [29, Theorem 5.28]. Theorem 12.4.30. A policy f ∈ F is Blackwell optimal if and only if it is contained in F∞ defined in (12.67). Theorem 12.4.30 and the definition of F∞ in (12.67) lead to the existence of Blackwell optimal policies, as established in the following corollary. Corollary 12.4.31. The set of Blackwell optimal policies is nonempty.
12.5 Concluding Remarks In this chapter, we have studied several classes of infinite-horizon optimal control problems for a general class of hybrid switching diffusions. The analysis was done by using different infinite-horizon criteria, namely, basic and advanced criteria. The motivation for studying advanced criteria stems from the objective of improving the basic criteria. Our analysis provides a summary for the existence and characterizations of optimal control policies related to a wide variety of infinite-horizon problems. We can extend the results to the study for more advanced criteria, for instance, the problem of finding average optimal policies that, in addition, minimize the asymptotic variance. Currently, as an ongoing project, we are studying zero-sum and non-zero-sum stochastic hybrid differential games. One of the disadvantages of our approach is the assumption of uniform ellipticity hypothesis, which excludes singular systems (e.g., piecewise deterministic systems). The degenerate cases and singular systems deserve further thoughts and careful considerations. Acknowledgements The research of the first author was supported in part by the Air Force Office of Scientific Research under FA9550-10-1-0210. The research of the second author was supported in part by the National Science Foundation under DMS-0907753 and in part by the Air Force Office of Scientific Research under FA9550-10-1-0210.
References 1. Atsumi, H. (1965). Neoclassical growth and the efficient program of capital accumulation. Rev. Econom. Studies 32, pp. 127–136. 2. Basak, G.K., Borkar, V.S., Ghosh, M.K. (1997). Ergodic control of degenerate diffusions. Stoch. Anal. Appl. 15, pp. 1–17.
12 Advanced Criteria for Controlled Hybrid Diffusions
221
3. Blackwell D. (1962). Discrete dynamic programming. Annals of mathematical statistics. 33, pp. 719–726. 4. Borkar, V.S. (2005). Controlled diffusion processes. Probab. Surveys 2, pp. 213–244. 5. Borkar, V.S., Ghosh, M.K. (1988). Ergodic control of multidimensional diffusions I: The existence of results. SIAM J. Control Optim 26, pp. 112–126. 6. Borkar, V.S., Ghosh, M.K. (1990). Ergodic control of multidimensional diffusions II: Adaptive control. Appl. Math. Opt. 21, pp. 191–220. 7. Carlson, D.A., Haurie, D.B., Leizarowitz, A. (1991). Infinite Horizon Optimal Control: Deterministic and Stochastic Systems. Springer-Verlag, Berlin. 8. Carrasco, G., Hern´andez-Lerma, O. (2003). Blackwell optimality for Markov control processes in Borel spaces. Preliminary report. 9. Dana, R.A., Le Van, C. (1990). On the Bellman equation of the overtaking criterion. J. Optim. Theory Appl. 67, pp. 587–600. 10. Dekker R., Hordijk A. (1988). Average, sensitive and Blackwell optimal policies in denumerable Markov decision chains with unbounded rewards. Math. Oper. Res. 13, pp. 395–420. 11. Dekker R., Hordijk A. (1992). Recurrence conditions for averageand Blackwell optimality in denumerable state Markov decision chains. Math. Oper. Res. 17, pp. 271–289. 12. Escobedo-Trujillo, B.A., Hern´andez-Lerma, O. (2011). Overtaking optimality for controlled Markov-modulated diffusions. Submitted. 13. Fort, G., Roberts, G.O. (2005). Subgeometric ergodicity of strong Markov processes. Ann. Appl. Probab. 15, pp. 1565–1589. 14. Ghosh, M.K., Arapostathis, A., Marcus, S.I. (1993). Optimal control of switching diffusions to flexible manufacturing systems. SIAM J. Control Optim. 31, pp. 1183–1204. 15. Ghosh, M.K., Arapostathis, A., Marcus, S.I. (1997). Ergodic control of switching diffusions. SIAM J. Control Optim. 35, pp. 1952–1988. 16. Guo, X.P., Hern´andez-Lerma, O. (2003). Continuous-time controlled Markov chains with discounted rewards. Acta Appl. Math. 79, pp. 195–216. 17. Guo, X.P., Hern´andez-Lerma, O. (2003). Drift and monotonicity conditions for continuoustime controlled Markov chains with average criterion. IEEE Trans. Automatic Control. 48, pp. 236–244. 18. Hashemi, S.N., Heunis, A.J. (2005). On the Poisson equation for singular diffusions. Stochastics 77, pp. 155–189. 19. Hern´andez-Lerma, O. (1994). Lectures on Continuous-Time Markov Control Processes. Sociedad Matem´atica Mexicana, Mexico City. 20. Hern´andez-Lerma, O., Lasserre, J.B. (1996). Discrete-Time Markov Control Processes. Springer, New York. 21. Hern´andez-Lerma, O., Lasserre, J.B. (1999). Further Topics on Discrete-Time Markov Control Processes. Springer, New York. 22. Hilgert, N., Hern´andez-Lerma, O. (2003). Bias optimality versus strong 0-discount optimality in Markov control processes with unbounded costs. Acta Appl. Math. 77, 215–235. 23. Himmelberg, C.J., Parthasarathy, T., Van Vleck, F.S. (1976). Optimal plans for dynamic programming problems. Math. Oper. Res. 1, pp. 390–394. 24. Hordijk, A., Yushkevich, A.A. (1999). Blackwell optimality in the class of stationary policies in Markov decision chains with a Borel state and unbounded rewards. Math. Meth. Oper. Res. 49, pp. 1–39. 25. Hordijk, A., Yushkevich, A.A. (1999). Blackwell optimality in the class of all policies in Markov decision chains with a Borel state and unbounded rewards. Math. Meth. Oper. Res. 50, pp. 421–448. 26. Jasso-Fuentes, H., Hern´andez-Lerma, O. (2008). Characterizations of overtaking optimality for controlled diffusion processes. Appl. Math. Optim. 57, pp. 349–369. 27. Jasso-Fuentes, H., Hern´andez-Lerma, O. (2009). Ergodic control, bias and sensitive discount optimality for Markov diffusion processes. Stoch. Anal. Appl. 27, pp. 363–385. 28. Jasso-Fuentes, H., Hern´andez-Lerma, O. (2009). Blackwell optimality for controlled diffusion processes. J. Appl. Probab. 46, pp. 372–391.
222
H. Jasso-Fuentes and G. Yin
29. Jasso-Fuentes, H., Yin, G. (2012). Advanced criteria for controlled Markov-modulated diffusions in an infinite horizon: overtaking, bias, and Blackwell optimality. Submitted. 30. Kushner, H.J. (1967). Optimal discounted stochastic control for diffusions processes. SIAM J. Control 5, pp. 520–531. 31. Kushner, H.J. (1978). Optimality conditions for the average cost per unit time problem with a diffusion model. SIAM J. Control Optim 16, pp. 330–346. 32. Lasserre J.B. (1988). Conditions for the existence of average and Blackwell optimal stationary policies in denumerable Markov decision processes. J. Math. Analysis Appl. 136, pp. 479–490. 33. Le Van, C., Dana, R.-A. (2003). Dynamic Programming in Economics. Kluwer, Dordrecht. 34. Leizarowitz, A. (1988). Controlled diffusion process on infinite horizon with the overtaking criterion. Appl. Math. Optim. 17, pp. 61–78. 35. Leizarowitz, A. (1990). Optimal controls for diffusion in Rd —min-max max-min formula for the minimal cost growth rate. J. Math. Anal. Appl. 149, pp. 180–209. 36. Mao, X., Yuan, C. (2006). Stochastic differential equations with markovian switching. Imperial College Press, London. 37. Meyn, S.P., Tweedie, R.L. (1993). Stability of Markovian processes III: Foster-Lyapunov criteria for continuous-time processes. Adv. Appl. Prob. 25, pp. 518–548. 38. Prieto-Rumeau, T., Hernandez-Lerma, O. (2005). Bias and overtaking equilibria for zero-sum continuous-time Markov games. Math. Meth. Oper. Res. 61, 437–454. 39. Prieto-Rumeau, T., Hernandez-Lerma, O. (2005). The Laurent series, sensitive discount and Blackwell optimality for continuous-time controlled Markov chains. Math. Meth. Oper. Res. 61, 123–145. 40. Prieto-Rumeau, T., Hern´andez-Lerma, O. (2006). A unified approach to continuous-time discounted Markov control processes. Morfismos 10. pp. 1–40. 41. Prieto-Rumeau, T., Hern´andez-Lerma, O. (2006). Bias optimality of continuous-time controlled Markov chains. SIAM J. Control Optim. 45, pp. 51–73. 42. Puterman, M.L. (1994). Sensitive discount optimality in controlled one-dimensional diffusions. Ann. Probab. 2, pp. 408–419. 43. Puterman, M.L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming, J. Wiley, New York. 44. Ramsey, F.P. (1928). A mathematical theory of savings. Economics J. 38, pp. 543–559. 45. Ross, S. (1984). Introduction to Stochastic Dynamic Programming, Academic Press, New York. 46. Sch¨al, M. (1975). Conditions for optimality and for the limit of n-stage optimal policies to be optimal. Z. Wahrs. verw. Gerb. 32, 179–196. 47. Tan, H., Rugh, W.J. (1998). Nonlinear overtaking optimal control: sufficiency, stability, and approximation. IEEE Trans. Automatic Control 43, 1703–1718. 48. Tan, H., Rugh, W.J. (1998). On overtaking optimal tracking for linear systems. Syst. Control Lett. 33, 63–72. 49. Tarres, R. (1985). Asymptotic evolution of a stochastic problem. SIAM J. Control Optim. 23, pp. 614–631. 50. Veinott, A.F. Jr. (1969). Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Statist. 40, 1635–1660. 51. von Weizs¨acker, C.C. (1965). Existence of optimal programs of accumulation for an infinite horizon. Rev. Economic Stud. 32, pp. 85–104. 52. Yin, G., Zhang, Q. (1998). Continuous-Time Markov Chains and Applications: A Singular Perturbation Approach, Springer-Verlag, New York. 53. Yin, G., Zhang, Q. (2005), Discrete-time Markov Chains: Two-time-scale Methods and Applications, Springer, New York. 54. Yin, G., Zhu, C. (2010). Hybrid switching diffusions. Springer, New York. 55. Zaslavski, A.J. (2006). Turnpike Properties in the Calculus of Variations and Optimal Control. Springer, New York.
12 Advanced Criteria for Controlled Hybrid Diffusions
223
56. Zhu, C., Yin, G. (2007). Asymptotic properties of hybrid diffusion systems. SIAM J. Control Optim. 46, pp. 1155–1179. 57. Zhu, Q., Guo, X.P. (2005). Another set of conditions for strong n (n = −1, 0) discount optimality in Markov decision processes. Stoch. Anal. Appl. 23, 953–974.
Chapter 13
Fluid Approximations to Markov Decision Processes with Local Transitions Alexey Piunovskiy and Yi Zhang
13.1 Introduction Markov decision processes (MDPs) model many practical problems that arise from queueing systems, telecommunication, inventories, and so on, see [6, 7, 15]. The fundamental results about an MDP model are the existence of an optimal policy and the sufficiency of the deterministic stationary policies out of the more general class of randomized history-dependent ones. On the other hand, from practical point of view, it is at least of equal importance to know how to obtain an optimal or nearly optimal policy. It is known that practically, the policy iteration and value iteration procedures fail to cope with MDP models with large state and action spaces. So for random walks, it is often the case that a deterministic continuous model is taken for analysis even when the underlying problem is in stochastic nature. This is called a fluid approximation. Fluid approximations are widely used to solve practical problems; examples in the contexts of epidemiology and telecommunication can be found in [11, 16], respectively. In inventory control, the well-known (deterministic) economic-order quantity model can be viewed as a fluid approximation, too, cf. [14]. On the one hand, such fluid models can often be solved much more easily than the corresponding stochastic models. On the other hand, in most cases, they are applied without formal analytical justifications, or the justification focuses on the trajectory level, by showing the trajectory of a scaled stochastic model converges in some
A. Piunovskiy () • Y. Zhang Department of Mathematical Sciences, University of Liverpool, Liverpool, L69 7ZL, UK e-mail:
[email protected];
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 13, © Springer Science+Business Media, LLC 2012
225
226
A. Piunovskiy and Yi Zhang
sense to the one of the fluid model, and this is mainly considered for a continuoustime model, see [8–10] and the references therein. For the justification on the objective level, we refer the reader to [3, 4, 13] and the references therein. In this chapter, we justify (at the level of objective functions) fluid approximations to a discrete-time MDP model with an undiscounted total cost criterion. This is done for an uncontrolled discrete-time model in [12] under more restrictive conditions. The argument is based on [1, 12, 13]. The rest of this chapter is organized as follows. We describe the concerned MDP model in Sect. 13.2 and formulate the main statements in Sect. 13.3, where two sections are devoted to the standard fluid approximation and the refined fluid approximation. We finish this chapter with a conclusion.
13.2 MDP Model The MDP model under consideration is defined by the following elements: • X = {0, 1, 2, . . . } is the state space. • A is the action space, which can be an arbitrary non-empty Borel space, whose topology is omitted from the explicit presentation. • p(z|x, a) is the one-step transition probability, a stochastic kernel on X given X × A and (Borel) measurable in a ∈ A. • r(x, a) is the one-step cost, which is (Borel) measurable in a ∈ A. Assume that the real measurable functions q+ (y, a), q− (y, a) and ρ (y, a) on [0, ∞) × A are given such that q+ (0, a) = q− (0, a) = ρ (0, a) = 0, and on (0, ∞) × A, q+ (y, a) > 0, q− (y, a) > 0, and q+ (y, a) + q− (y, a) ≤ 1. Then we make the MDP model with the absorbing state zero and local transitions only by defining the onestep transition probability and cost via ⎧ + q (x, a), if z = x + 1; ⎪ ⎪ ⎨ − if z = x − 1; q (x, a), p(z|x, a) = + − ⎪ 1 − q (x, a) − q (x, a), if z = x; ⎪ ⎩ 0 otherwise, r(x, a) = ρ (x, a). Let ϕ : X → A be a deterministic stationary policy. For any fixed initial state x and ϕ policy ϕ , the standard canonical construction gives a strategic measure Px on the space of histories in the form of x0 , a0 , x1 , a1 , . . . , and the corresponding expectation ϕ is denoted by Ex , see [7]. We denote the controlled process by {Xt ,t = 0, 1, . . . } and the action process by {At ,t = 0, 1, . . . }. Then the MDP model is the following optimization problem, which is well defined after we impose some conditions below: V ϕ (x) := Exϕ
∞
∑ r(Xt , At )
t=0
→ inf, ϕ
13 Fluid Approximations to Markov Decision Processes with Local Transitions
227
where the infimum is taken over the class of deterministic stationary policies only for simplicity and that under very general conditions, they suffice for the underlying optimization problem, see [1] for more details. In this chapter, we shall actually scale the above described MDP model such that for any fixed scaling parameter n = 1, 2, . . . , the elements of the n-MDP model are as follows: • X = {0, 1, 2, . . . } and A remain as the state and action spaces. • ⎧ + q (x/n, a), if z = x + 1; ⎪ ⎪ ⎨ − q (x/n, a), if z = x − 1; n p(z|x, a) = + (x/n, a) − q−(x/n, a), if z = x; ⎪ 1 − q ⎪ ⎩ 0 otherwise •
is the one-step transition probability. n r(x, a) = ρ (x/n,a) is the one-step cost, which is measurable in a ∈ A. n
The n-MDP model reads n ϕ
V (x)
:= Exϕ
∞
∑
t=0
n
r(Xt , At ) → inf . ϕ
Below, we impose some conditions to guarantee that the n-MDP model under consideration is absorbing in the sense of [1]. To be exact, that means given any ϕ initial state x, Ex [T0 ] < ∞, where T0 := inf{t > 0 : Xt = 0}. Here and below, the context always makes it clear what the scaling parameter is so that the controlled and action processes in the n-MDP model are still denoted by {Xt ,t = 0, 1, . . . } and {At ,t = 0, 1, . . . } for brevity. The above scaling is called the fluid scaling. Its intuitive meaning together with its importance is now explained via the following example, where an uncontrolled situation is considered for simplicity. Accordingly, simpler denotations are employed. We remark that the example is better understood in the context of telecommunication, where fluid models are widely used to solve satisfactorily practical problems of stochastic nature, see [2, 16] and the reference therein. Example 13.2.1. Suppose information packets, 1 kilobit (KB) each, arrive at a router (switch) at the (constant) rate q+ > 0 megabit/second (MB/s), and are served at the (constant) rate q− > q+ MB/s, where q+ + q− ≤ 1. We observe the process up to the moment when the router buffer is empty. Let the holding cost be h per MB per second, so that ρ (y) = hy, where y is the amount of information (MB). For simplicity, we consider the uncontrolled model so that the denotation of the policy ϕ does not appear. One can consider batch arrivals and batch services of 1,000 packets every second; then n = 1 and r(x) = hx = ρ (x). On the other hand, it would be more accurate to consider particular packets; then 1 probabilities q+ and q− will be the same, but the time unit is 1000 s, so that the arrival and service rates (MB/s) remain the same. Remembering that, we consider
228
A. Piunovskiy and Yi Zhang
information up to the individual packets (cf. batches) and the time unit is the cost function for the n-model will obviously change: n r(x) = where n = 1,000.
hx 1 1000 1000
=
1 1000 s, ρ (x/n) n ,
2
The goal of this chapter is to estimate (from the above) the differences between ϕ n ϕ n V ( X0 ) := En X0
∞
∑ n r(Xt , At )
t=0
and the objective functions of two related deterministic continuous models, namely, the standard fluid model and the refined fluid model, which are simpler to solve. So they are regarded and used as the fluid approximations to the original (stochastic) MDP model, see [11, 16] for examples. In greater detail, under some conditions, we provide explicit upper boundary estimates of the absolute differences between the objective functions of the stochastic and the corresponding fluid models in the initial data, which are understood as the level of accuracy of such fluid approximations. In a nutshell, under the imposed conditions, the absolute differences go to zero as fast as 1n , with n being the scaling parameter.
13.3 Main Statements 13.3.1 Standard Fluid Approximation Firstly, we motivate the standard fluid model by using Example 13.2.1. Then we give its formal definition and obtain its level of accuracy in approximating the n-MDP model. Example 13.2.1 continued. Consider the situation in Example 13.2.1. The total holding cost of the n-model nV (x) up to the absorption at the state zero satisfies the equation (cf.[12, (10)])
ρ ( nx ) + q+ nV (x + 1) + q− nV (x − 1) − (q+ + q− ) nV (x) = 0, n x = 1, 2, . . . ; nV (0) = 0.
(13.1)
Since we measure information in MB, it is reasonable to introduce the function v(y) ˆ such that nV (x) = v(x/n), ˆ where v(y) ˆ is the total holding cost up to the absorption given the initial queue was y MB. After the obvious rearrangements of (13.1), we obtain that v(x/n) ˆ satisfies
ρ
x n
+
q+ vˆ nx + 1n − vˆ nx 1 n
+
q− vˆ nx − 1n − vˆ nx 1 n
= 0,
13 Fluid Approximations to Markov Decision Processes with Local Transitions
229
This is a version of the Euler method for solving the differential equation
ρ (y) + (q+ − q− )
dv(y) = 0. dy
(13.2)
Thus, we expect that nV (x) = v(x/n) ˆ ≈ v(x/n) at least for a big value of n.
2
The above example reveals that as far as the objective function is concerned, the n-MDP model can be approximated by a deterministic continuous model specified by a differential equation, at least for a big value of n. This gives the rise to the following standard fluid model. The standard fluid model: vψ (y0 ) := subject to
∞
ρ (y(τ ), ψ (y(τ )))dτ → inf ψ
0
dy = q+ (y, ψ (y)) − q− (y, ψ (y)), with the given initial state y(0). dτ
Here, ψ is a measurable mapping from [0, ∞) to A. Later, we often omit the argument τ from y(τ ) for brevity. Under the conditions of Theorem 13.3.1 below, it can be seen that vψ (y) =
y 0
ρ (z, ψ (z)) dz, q− (z, ψ (z)) − q+ (z, ψ (z))
(13.3)
cf. (13.2). Theorem 13.3.1 (cf. Theorem 1 in [12]). Let n = 1, 2, . . . and a policy ψ for the fluid model be fixed, and ϕˆ be given by ϕˆ (x) := ψ (x/n). Suppose q− (y, ψ (y)) |ρ (y, ψ (y))| ≥ η˜ > 1, sup < ∞, y>0 q+ (y, ψ (y)) ηy y>0
q− (y, ψ (y)) > q > 0, inf
where q > 0 is a constant, and η ∈ (1, η˜ ). Then: nV ϕˆ (x)|
(a) supx=1,2,... |
η x/n
< ∞, i.e., nV ϕˆ (·) is n η -bounded, where n η (x) := η x/n .
(b) If additionally the functions q+ (y, ψ (y)), q− (y, ψ (y)), ρ (y, ψ (y)) are piecewise continuously differentiable, then for an arbitrarily fixed yˆ ≥ 0 lim
sup
n→∞ x∈{0,1,...,[ny]} ˆ
| nV ϕˆ (x) − vψ (x/n)| = 0,
where the function [·] takes the integer part of its argument.
230
A. Piunovskiy and Yi Zhang
(c) If furthermore functions ρ (y, ψ (y)), q− (y, ψ (y)), q+ (y, ψ (y)) are continuously 2ψ v (y) differentiable with uniformly bounded derivatives so that supy>0 d dy 2 := C < ∞, then for any arbitrarily fixed yˆ ≥ 0, C(3η˜ + 1) (yˆ + 1)η˜ yˆ n ϕˆ . V (x) − vψ (x/n) ≤ 2γ η˜ n x∈{0,1,...,[ny]} ˆ sup
The detailed proof of the above theorem can be found in [12], see the proof of Theorem 1 therein. The next example shows that the condition supy>0 |ρ (y,ηψy(y))| < ∞ in the above theorem is important. Example 13.3.2 (cf. Example 3 in [12]). Let A = [1, 2], q+ (y, a) = ad + , q− (y, a) = ad − for y > 0, where d − > d + > 0 are fixed numbers such that 2(d + + d − ) ≤ 1. Put − 2 ρ (y, a) = a2 γ y , where γ > 1 is a constant. So η˜ = dd + > 1. To solve the fluid model vψ (y) → infψ , we use the dynamic programming approach. One can see that the Bellman function v∗ (y) := infψ vψ (y) has the form ∗
v (y) =
y
inf
0 a∈A
ρ (u, a) du, q− (u, a) − q+(u, a)
and satisfies the Bellman equation inf
a∈A
dv∗ (y) + q (y, a) − q−(y, a) + ρ (y, a) = 0; v∗ (0) = 0, dy
cf. [13, Lemma 2] and the “incidental” statement in its proof. Here, we remark ρ (u,a) that the function infa∈A q− (u,a)−q+ (u,a) is universally measurable, see [5, Chap. 7] for more details. Hence, the function ∗
v∗ (y) = vψ (y) =
y 0
2
γu du − d − d+
is well defined, and ψ ∗ (y) ≡ 1 is the optimal policy. ∗ We notice that the condition supy>0 |ρ (y,ψη y (y))| < ∞ is not satisfied, while all the other requirements of Theorem 13.3.1 are met. On the other hand, for the policy given by ϕˆ (x) = ψ ∗ ( nx ) ≡ 1 and n = 1, 2, . . . , nV ϕˆ (x) = n E ϕˆ [ ∞ n r(X , A )] satisfies the following equation ∑t=0 t t x 2
γ( n ) + d + nV ϕˆ (x + 1) + d − nV ϕˆ (x − 1) − (d + + d −)nV ϕˆ (x) = 0; nV ϕˆ (0) = 0, n
13 Fluid Approximations to Markov Decision Processes with Local Transitions
231
cf. (13.1). But then, this equation does not admit non-negative finitely valued solutions, because if we put nV ϕˆ (0) = 0, nV ϕˆ (1) = b, where b ∈ [0, ∞) is a nonnegative constant, then for any x = 1, 2, . . . , n ϕˆ
V (x) = b
x−1 1 η˜ x − 1 2 − + γ ( j/n) (η˜ x− j − 1), ∑ η˜ − 1 nd (η˜ − 1) j=1
and thus, for a big enough value of x, we obtain nV ϕˆ (x) < 0. Therefore, Theorem 13.3.1 does not hold; nV ϕˆ (x) = ∞ for all x = 1, 2, . . . . 2
13.3.2 Refined Fluid Approximation Under the conditions of Theorem 13.3.1 except for q− (y, ψ (y)) > q, (13.3) may not hold, which in comparison with (13.2), suggests that the standard fluid approximation may fail to be accurate in this case; since q− (y) can now approach zero, and q+ (y) < q− (y), it could happen that the standard fluid model does not get absorbed at the state zero, while for any fixed n = 1, 2, . . . , the stochastic process {n Xt ,t = 0, 1, . . . } indeed gets absorbed at the state zero, so that the standard fluid model and the n-MDP model could behave qualitatively differently. Example 13.3.4 below illustrates this situation. Nevertheless, in this case, the refined fluid model introduced below still approximates well the n-MDP model under the following condition. Let ψ (·) be a measurable mapping from [0, ∞) to A. We formulate the following condition. −
|ρ (y,ψ (y))| ψ (y)) Condition A. (a) infy>0 qq+ (y, ≥ η˜ > 1, supy>0 {q+ (y,ψ (y))+q − (y,ψ (y))}η y ≤ C1 < (y,ψ (y)) ˜ ∞, where η ∈ (1, η ). (b) For any n, there exists an (n-dependent) constant K(n) such that n
lW (x) := K(n)q− (x/n, ψ (x/n))η x/n − 1 > 0, x = 1, 2, . . . ,
and |ρ (x/n, ψ (x/n))| |ρ (x/n, ψ (x/n))| = sup < ∞, n l (x) − (x/n, ψ (x/n))η x/n − 1 K(n)q W x=1,2,... x=1,2,... sup
where η ∈ (1, η˜ ) comes from part (a) of this condition. (c) There exist points y1 , y2 , . . . , yl , . . . with yl → ∞ as l → ∞ such that on each of (y,ψ (y)) the intervals (0, y1 ), (y1 , y2 ), . . . , the function q− (y,ψ ρ(y))−q + (y,ψ (y)) is Lipschitz continuous. Simple sufficient conditions for Condition A(b) are given below: see Proposition 13.3.1 and its proof.
232
A. Piunovskiy and Yi Zhang
Refined fluid model: ∞ 0
subject to
ρ (y, ψ (y))
q+ (y, ψ (y)) + q− (y, ψ (y))
du → inf ψ
q+ (y, ψ (y)) − q− (y, ψ (y))
dy = , with a given initial state y(0). du q+ (y, ψ (y)) + q− (y, ψ (y))
One can show under Condition A that v˜ψ (y0 ) :=
∞ 0 +
ρ (y, ψ (y)) dτ = + q (y, ψ (y)) + q− (y, ψ (y)) −
ψ (y))−q (y,ψ (y)) Note that qq+ (y, ≤ (y,ψ (y))+q− (y,ψ (y)) at the state zero in finite time.
1−η˜ 1+η˜
y0 0
ρ (z, ψ (z))
q− (z, ψ (z)) − q+ (z, ψ (z))
dz.
< 0, so that y(·) in the fluid model is absorbed
Theorem 13.3.2. Let n = 1, 2, . . . and yˆ > 0 be fixed, ψ a measurable mapping from [0, ∞) to A, and ϕˆ (·) given by ϕˆ (x) := ψ (x/n). Suppose Condition A is satisfied (y,ψ (y)) for ψ so that there exist an integer L such that the function q− (y,ψρ(y))−q + (y,ψ (y)) is Lipschitz continuous with the common Lipschitz constant D on the intervals (0, y1 ), (y1 , y2 ), . . . , (yL−1 , yL ), (yl , yL+1 ) with yL < yˆ + 1 ≤ yL+1 . Then ε (n) n ϕˆ . V (x) − v˜ψ (x/n) ≤ 2 x∈{0,1,...,[yn]} ˆ sup
Here and below, we put 2K1 2K2 + n + 2K3 (η 1/n − 1), n η˜
ε (n) :=
η˜ + 1 ˆ [D(yˆ + 1) + 3C1Lη y+1 ], η˜ − 1 ˆ η˜ + 1 η˜ 2 2(η˜ + 1) η y+1 C1 1 + , K2 := η˜ − 1 (η˜ − 1) ln η η˜ − η ˆ η˜ + 1 2 3C1 Lη y+1 . K3 := η˜ − 1 ln η K1 :=
Proof. Consider an n-birth-and-death process model whose birth rate is nq+ (x/n, ϕˆ (x)), death rate is nq− (x/n, ϕˆ (x)) and cost rate is ρ (x/n, ϕˆ (x)). The underlying process is denoted by {Yt ,t ≥ 0}. Then we have n
W ϕˆ (x) := Exϕˆ
0
∞
ρ (Yt /n, ϕˆ (Yt ))dt = nV ϕˆ (x), x = 0, 1, . . . .
13 Fluid Approximations to Markov Decision Processes with Local Transitions
233
Indeed, both V ϕˆ (x) and nW ϕˆ (x) are given by the unique n η -bounded solution to the equation 0 = ρ (x/n, ϕˆ (x)) + V (x + 1)nq+(x/n, ϕˆ (x)) + nq−(x/n, ϕˆ (x))V (x − 1) −n(q+ (x/n, ϕˆ (x)) + q− (x/n, ϕˆ (x)))V (x), x = 1, 2, . . . ; 0 = V (0) by Remark 13.3.1, [12, Lem. 1, Lem. 2, Lem. 4] and [13, Lem. 1]. See also [1, Chap. 7]. It remains to apply [13, Thm.1]. 2 nΠ
For any n = 1, 2, . . . , let us consider a subclass of deterministic stationary policies whose elements are ϕ such that the following are satisfied: q− (x/n, ϕ (x)) ≥ η˜ > 1, x=1,2,... q+ (x/n, ϕ (x)) inf
|ρ (x/n, ϕ (x))|
sup x=1,2,...
{q+ (x/n, ϕ (x)) + q−(x/n, ϕ (x))}η x
≤ C1 < ∞,
where η ∈ (1, η˜ ), and there exists an (ϕ , n-dependent) constant K ϕ (n) satisfying K ϕ (n)q− (x/n, ϕ (x))η x/n − 1 > 0, x = 1, 2, . . . , and sup x=1,2,...
|ρ (x/n, ϕ (x))| K ϕ (n)q− (x/n, ϕ (x))η x/n − 1
< ∞.
Note that if there exists a ψ satisfying Condition A, then the set n Π is non-empty as ϕˆ (x) := ψ (x/n) ∈ n Π for all n = 1, 2, . . . . Under Condition B below, for any n = 1, 2, . . . , n Π coincides with the whole class of deterministic stationary policies, see Proposition 13.3.1 below. Remark 13.3.1. For any fixed n = 1, 2, . . . , one can verify that under a fixed policy ϕ ∈ n Π , the n-MDP model admits a Lyapunov function n
lL (x) := Dη x/n , x = 1, 2, . . . , n lL (0) := 2,
where D ≥ 1 is a big enough constant, and a weight function n
lW (x) := K ϕ (n)q− (x/n, ϕ (x))η x/n − 1, x = 1, 2, . . . , n l2 (0) := 1,
cf. [1, Chap.7] and [12, Con.1].
234
A. Piunovskiy and Yi Zhang −
|ρ (y,a)| Condition B. (a) infy>0,a∈A qq+ (y,a) ≥ η˜ > 1, supy>0,a∈A (q+ (y,a)+q − (y,a))η y ≤ C1 < ∞, (y,a) where η ∈ (1, η˜ ). (b) lim infy→∞ infa∈A q− (y, a) > 0.
Proposition 13.3.1. Suppose Condition B is satisfied. Then for any deterministic stationary policy ϕ , it holds that ϕ ∈ n Π for all n = 1, 2, . . . . Proof. Let n = 1, 2, . . . be fixed and consider an arbitrarily fixed ϕ . Then under Condition B, there exists a ζ > 0 such that q− (x/n, ϕ (x)) > ζ , x = 1, 2, . . . . So there exists a constant K ϕ (n) > 0 such that q− (x/n, ϕ (x))η x/n − K ϕ1(n) > ζ˜ , x = 1, 2, . . . where ζ˜ > 0. Thus, K ϕ (n)q− (x/n, ϕ (x))η x/n − 1 > 0, and sup x=1,2,...
≤
|ρ ( nx , ϕ (x))|
x
K ϕ (n)q− ( nx , ϕ (x))η n − 1 1 K ϕ (n)
sup x=1,2,...
=
|ρ ( nx , ϕ (x))| 1 sup x K ϕ (n) x=1,2,... q− ( nx , ϕ (x))η n − K ϕ1(n)
|ρ (x/n, ϕ (x))| (q+ (x/n, ϕ (x)) + q−(x/n, ϕ (x)))η x/n
2q− (x/n, ϕ (x))η x/n x/n − 1 − x=1,2,... q (x/n, ϕ (x))η K ϕ (n)
× sup ≤
2C1 q− (x/n, ϕ (x))η x/n sup − < ∞. ϕ K (n) x=1,2,... q (x/n, ϕ (x))η x/n − K ϕ1(n)
The other requirements for a policy to be in m Π are satisfied by ϕ is evident.
2
∗
Theorem 13.3.3. Suppose the policy ψ solves the refined fluid model and satisfies Condition A. Then for any fixed n = 1, 2, . . . , sup x∈{0,1,...,[ny]} ˆ
n ϕ ∗ V (x) − inf nV ϕ (x) ≤ ε (n), ϕ ∈n Π
where ϕ ∗ (x) := ψ ∗ (x/n), x = 0, 1, 2, . . . . Proof. It can be shown that ∀ ϕ ∈ n Π , nV ϕ (x) ≥ v˜ψ (x/n) − ε (n) 2 . The proof is similar to the one of Theorem 13.3.2. On the other hand, from Theorem 13.3.2, we ∗ ∗ n ϕ n have infϕ ∈n Π nV ϕ (x) ≤ nV ϕ ≤ v˜ψ (x/n) + ε (n) 2 2 ≤ infϕ ∈ Π V (x) + ε (n). ∗
The above theorem asserts that if one solves the refined fluid model and obtains the optimal policy ψ ∗ , then the policy given by ϕ ∗ (x) = ψ ∗ (x/n) is ε (n)-optimal in the underlying n-MDP, and ε (n) goes to zero as n grows large. The next proposition comes from [13, Lem.3].
13 Fluid Approximations to Markov Decision Processes with Local Transitions
235
Proposition 13.3.2. Under Condition B, assume that there exist finite intervals (0, y 1 ), (y 1 , y 2 ), . . . , with lim j→∞ y j = ∞, such that on each of them, the function ρ (y,a) q− (y,a)−q+ (y,a)
is Lipschitz continuous with respect to y for each a ∈ A, and the Lipschitz constants are a-independent. Then for any fixed yˆ > 0, there exists an ψ ∗ satisfying Condition A and solving the refined fluid model on [0, y]. ˆ Now, we give an example where the main results (Theorems 13.3.2 and 13.3.3) of this work are applicable. In fact, by Propositions 13.3.1 and 13.3.2, it suffices to verify Condition B and the other condition of Proposition 13.3.2. Example 13.3.3. Consider a discrete-time single-server queueing system, where during each time step, the probability of having an arrival is given by the function a , and the probability of having a service completion (given there q+ (y, a) = 2y+2+a
2y+2 is at least one job) is given by the function q− (y, a) = 2y+2+a , where y > 0 and 1 a ∈ A := [ 2 , 1]. Suppose the cost function is given by ρ (y, a) = 2y − a, which means that we aim at minimizing the holding cost, which is incurred at a rate of 2£ per time step, while admitting more jobs is encouraged. The state zero is taken as the absorbing state, so that q− (0, 0) = q+ (0, 0) = ρ (0, 0) = 0. It is easy to see that
q− (y, a) 2 + 2y = inf = 2 =: η˜ > 1, + 1 1 a y>0,a∈[ 2 ,1] q (y, a) y>0,a∈[ 2 ,1] inf
sup y>0,a∈[ 12 ,1]
|ρ (y, a)| (q+ (y, a) + q−(y, a))(1.5)y
=
|ρ (y, a)| < ∞, (1.5)y
1 < η := 1.5 < η˜ , lim inf inf q− (y, a) > 0, y→∞ a∈[ 1 ,1] 2
and the function given by (2y − a)(2y + 2 + a) 4a ρ (y, a) = = 2y + a − q− (y, a) − q+ (y, a) 2y + 2 − a 2y + 2 − a is obviously Lipschitz in y > 0 for any a ∈ [ 12 , 1]. Hence, Condition B and the other condition of Proposition 13.3.2 are satisfied by this example. 2 The next example indicates that the condition q− (y, ψ (y)) > q > 0 is important for the standard fluid model to approximate the underlying MDP model accurately. It also illustrates that the standard and the refined fluid models can behave qualitatively differently.
236
A. Piunovskiy and Yi Zhang
Fig. 13.1 The graph of q− (y)
Example 13.3.4 (cf. Example 1 in [13]). For brevity, we deal with an uncontrolled model, i.e., A is taken as a singleton, so that denotations such as q− (y), q+ (y)ρ (y) are used for brevity. We put q− (y) = 0.1I{y ∈ (0, 1]} + 0.125(y − 1)2I{y ∈ (1, 3]} + 0.5I{y > 3}, q+ (y) = 0.2q−(y), ρ (y) = 8q− (y). Clearly, q− (y) is not separated from zero, see Fig. 13.1, while Condition A is satisfied. For the original fluid model, we have ddyτ = −0.1 (y − 1)2 , and, if the initial state 10 y0 = 2, then y(τ ) = 1 + τ +10 , so that limτ →∞ y(τ ) = 1. On the other hand, since q− (y), q+ (y) > 0 for y > 0, and there is a negative trend, the state process Xt in the n-stochastic model starting from n X0 /n = y0 = 2 will be absorbed at zero, see [1, Lem. 7.2, Def. 7.4], while the moment of the absorbtion is postponed for later and later as n → ∞ because the process spends more and more time in the neighborhood of 1, see Figs. 13.2 and 13.3. When using the original fluid model, we have v(2) =
∞ 0
ρ (y(τ ))dτ = 10 = lim lim E2n T →∞ n→∞
∞
∑ I{t/n ≤ T }
n
r(Xt−1 , At ) .
t=1
When using the refined fluid model, we can calculate
ρ (y) q+ (y)+q− (y)
=
8 1.2
for y > 0
so that the process in the refined fluid model is absorbed at the and y(u) = 2 − state zero at the time moment u = 3. Therefore, 2 3 u,
lim nV (2n) = v(2) ˜ =
n→∞
∞ 0
ρˆ (y(u))du =
= lim lim E2n n→∞ T →∞
∞
3 8 0
1.2
∑ I{t/n ≤ T }
n
du = 20
r(Xt−1 , At ) = v(2).
t=1
So the standard fluid model fails to be accurate in this example.
2
13 Fluid Approximations to Markov Decision Processes with Local Transitions
237
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
200
250
300
Fig. 13.2 The state process in the n-stochastic model and its fluid approximation, n = 7 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
200
250
300
Fig. 13.3 The state process in the n-stochastic model and its fluid approximation, n = 15
238
A. Piunovskiy and Yi Zhang
13.4 Conclusion In this chapter, the convergence of the objective function of a scaled absorbing MDP model, with a total undiscounted cost, to the one of the (standard and refined) fluid model is shown. The upper boundary estimate of the rate of convergence is presented based on the initial data, which is of order 1n , where n is the scaling parameter. Hence, the level of accuracy of the fluid approximation is obtained. By examples, we also show that the standard fluid model may fail to approximate the n-MDP model, while the refined fluid model is still accurate. Acknowledgements Mr. Mantas Vykertas kindly helped us improve the English presentation of this chapter. We thank the referee for valuable comments, too.
References 1. Altman, E.: Constrained Markov Decision Processes. Chapman and Hall/CRC, Boca Raton (1999) 2. Avrachenkov, K., Ayesta, U., Piunovskiy, A.: Convergence of trajectories and optimal buffer sizing for AIMD congestion control. Perform. Evaluation. 67, 501–527 (2010) 3. Avrachenkov, K., Piunovskiy, A., Zhang, Y.: Asymptotic fluid optimality and efficiency of tracking policy for bandwidth-sharing networks. J. Appl. Probab. 48, 90–113 (2011) 4. B¨auerle, N.: Optimal control of queueing networks: an approach via fluid models. Adv. Appl. Prob. 34, 313–328 (2002) 5. Bertsekas, D., Shreve, S.: Stochastic Optimal Control. Academic Press, NY (1978) 6. Jacko, P., Sans´o, B.: Optimal anticipative congestion control of flows with time-varying input stream. Perform. Evaluation. 69, 86–101 (2012) 7. Hern´andez-Lerma, O., Lasserre, J.: Discrete-time Markov Control Processes. Springer-Verlag, NY (1996) 8. Dai, J.: On positive Harris recurrence of multiclass queueing networks: a unified approach via fluid limit models. Ann. Appl. Prob. 5, 49–77 (1995) 9. Foss, S., Kovalevskii, A.: A stability criterion via fluid limits and its application to a Polling system. Queueing. Syst. 32, 131–168 (1999) 10. Mandelbaum, A., Pats, G.: State-dependent queues: approximations and applications. In Kelly, F., Williams, R. (eds.) Stochastic Networks, pp. 239–282. Springer, NY (1995) 11. Piunovskiy, A., Clancy, D.: An explicit optimal intervention policy for a deterministic epidemic model. Optim. Contr. Appl. Met. 29, 413–428 (2008) 12. Piunovskiy, A.: Random walk, birth-and-death process and their fluid approximations: absorbing case. Math. Meth. Oper. Res. 70, 285–312 (2009) 13. Piunovskiy, A., Zhang, Y.: Accuracy of fluid approximations to controlled Birth-and-Death processes: absorbing case. Math. Meth. Oper. Res. 73, 159–187 (2011) 14. Piunovskiy, A., Zhang, Y.: On the fluid approximations of a class of general inventory leveldependent EOQ and EPQ models. Adv. Oper. Res. (2011) doi: 10.1155/2011/301205 15. Puterman, M.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, NY (1994) 16. Zhang, Y., Piunovskiy, A., Ayesta, U., Avrachenkov, K.: Convergence of trajectories and optimal buffer sizing for MIMD congestion control. Com. Com. 33, 149–159 (2010)
Chapter 14
Minimizing Ruin Probabilities by Reinsurance and Investment: A Markovian Decision Approach Rosario Romera and Wolfgang Runggaldier
14.1 Introduction Consider a classical Cram´er-Lundberg model Nt
Xt = x − ct − ∑ Yi ,
(14.1)
i=1
where the claim number process {Nt } is a Poisson process with intensity λ , and the claim payment {Yt } is a sequence of independent and identically distributed (i.i.d.) positive random variables independent of {Nt } and with support on the positive half line. Let c be the constant premium rate (income) paid by the insurer. We assume the safety loading condition c > λ E[Y ]. One of the key quantities in the classical risk model is the ruin probability, as a function of the initial reserve x,
ψ (x) = Pr{Xt < 0 for some t > 0}. In general, it is very difficult to derive explicit and closed expressions for the ruin probability. The pioneering works on approximations to the ruin probability were achieved by Cram´er and Lundberg as early as the 1930s under Cram´er-Lundberg condition. This condition is to assume that there exists a constant κ > 0 called adjustment coefficient, satisfying the following Lundberg equation: ∞ 0
c ¯ = , eκ y F(y)dy λ
R. Romera () University Carlos III, Madrid 28903, Spain e-mail:
[email protected] W. Runggaldier Dipartimento di Matematica Pura ed Applicata, University of Padova, Padova 35121, Italy e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 14, © Springer Science+Business Media, LLC 2012
239
240
R. Romera and W. Runggaldier
with F(y) = Pr{Y ≤ y}. Under this condition, the Cram´er-Lundberg asymptotic formula states that if ∞ 0
where P(y) =
1 y ¯ F(s)ds E[Y ] 0
yeκ y dP(y) < ∞,
is the equilibrium distribution of F, then
ψ (x) ≤ e−κ x , x ≥ 0.
(14.2)
The insurer has now the possibility to reinsure the claims. In the case of proportional reinsurance Schmidli [22] showed that there exists an optimal strategy that can be derived from the corresponding Hamilton-Jacobi-Bellman equation. Hipp and Vogt [15] derived by similar methods the same result for excess of loss reinsurance. In Sch¨al [21], the control problem of controlling ruin by investment in a financial market is studied. The insurance business is described by the usual Cram´er-Lundberg-type model, and the risk driver of the financial market is a compound Poisson process. Conditions for investment to be profitable are derived by means of discrete-time dynamic programming. Moreover, Lundberg bounds are established for the controlled model. Diasparra and Romera [3, 4] consider a discrete-time process driven by proportional reinsurance and an interest rate process which behaves as a Markov chain. To reduce the risk of ruin, the insurer may reinsure a part or even all of the reserve. Recursive and integral equations for the ruin probabilities are given, and generalized Lundberg inequalities for the ruin probabilities are derived. We consider a discrete-time insurance risk/reserve process which can be controlled by reinsurance and investment in the financial market, and we study the ruin probability problem in the finite-horizon case. Although controlling a risk/reserve process is a very active area of research (see [2, 6, 16, 24, 26], and references therein), obtaining explicit optimal solutions minimizing the ruin probability is in general a difficult task even for the classical Cram´er-Lundberg risk process. Thus, an alternative method commonly used in ruin theory is to derive inequalities for ruin probabilities. The inequalities can be used to obtain upper bounds for the ruin probabilities [8, 23, 27], and this is the approach followed in this chapter. The basis of this approach is the well-known fact that in the classical Cram´er-Lundberg model, if the claim sizes have finite exponential moments, then the ruin probability decays exponentially as the initial surplus increases (see for instance the book by Asmussen [1]). For the heavy-tailed claims’ case, it is also shown to decay with a rate depending on the distribution of the claim size (see, e.g., [7]). Paulsen [18] reviews general processes for the ruin problem when the insurance company invests in a risky asset. Xiong and Yang [25] give conditions for the ruin probability to be equal to 1 for any initial endowment and without any assumption on the distribution of the claim size as long as it is not identically zero. Control problems for risk/reserve processes are commonly formulated in continuous time. Sch¨al [20] introduces a formulation of the problem where events (arrivals of claims and asset price changes) occur at discrete points in time that may be
14 Minimizing Ruin Probabilities
241
deterministic or random, but their total number is fixed. Diasparra and Romera [3] consider a similar formulation in discrete time. Having a fixed total number of events implies that in the case of random time points the horizon is random as well. In this chapter, we follow an approach inspired by Edoli and Runggaldier [5] who claim that a more natural way to formulate the problem in case of random time points is to consider a given fixed time horizon so that also the number of event times becomes random, and this makes the problem nonstandard. Accordingly, it is reasonable to assume that also the control decisions (level of reinsurance and amount invested) correspond to these random time points. Notice that this formulation can be seen equivalently in discrete or continuous time. The stochastic elements that affect the evolution of the risk/reserve process are thus the timing and size of the claims, as well as the dynamics of the prices of the assets in which the insurer is investing. This evolution is controlled by the sequential choice of the reinsurance and investment levels. Claims occur at random points in time and also their sizes are random, while asset price evolutions are usually modeled as continuous-time processes. On small time scales, prices actually change at discrete random time points and vary by tick size. In the proposed model, we let also asset prices change only at discrete random time points with their sizes being random as well. This will allow us to consider the timing of the events, namely, the arrivals of claims and the changes of the asset prices, to be triggered by a same continuous-time semiMarkov process (SMP), that is, a stochastic process where the embedded jump chain (the discrete process registering what values the process takes) is a Markov chain, and where the holding times (time between jumps) are random variables, whose distribution function may depend on the two states between which the move is made. Since between event times the situation for the insurer does not change, we shall consider controls only at event times. Under this setting, we construct a methodology to achieve an optimal solution that minimizes the bounds of the ruin probability previously derived. Admissible strategies ranging in a compact set are in fact reinsurance levels and investment positions. From a general perspective, and due to the Markovian feature of the risk process, it seems quite natural to look at the minimization of the ruin probability as a Markov decision problem (MDP) for which suitable MDP techniques may hold. Although this is not a standard approach in actuarial risk models, we present in this chapter a connection between our optimization problem and the use of MDP techniques. Many of the most relevant contributions in the literature related to MDP techniques have been developed by Onesimo Hern´andez-Lerma and his coauthors, and some of them have inspired the optimization part of this chapter [9–14]. The rest of the chapter is organized as follows: Section 14.2 describes the elements of the considered model and introduces the formulation of the risk process. Section 14.3 presents the previous recursive relations on ruin probabilities needed to derived our main contributions on the exponential bounds for the ruin probabilities. In Sect. 14.4, the derivation of the reinsurance and investment policy that minimizes an exponential bound is described in connection with MDP techniques, namely, policy improvement and value iteration.
242
R. Romera and W. Runggaldier
14.2 The Risk Process We start this section by fixing the elements of the model studied in this chapter. According to the model proposed in Romera and Runggaldier [19], we consider a finite time horizon T > 0. More precisely, to model the timing of the events (arrival of claims and asset price changes), inspired by Sch¨al [21], we introduce the process {Kt }t>0 for t ≤ T , a continuous-time SMP on {0, 1}, where Kt = 0 holds for the arrival of a claim, and Kt = 1 for a change in the asset price. The embedded Markov chain, that is, the jump chain associated to the SMP {Kt }t>0 , evolves according to a transition probability matrix P = pi j i, j∈{0,1} that is supposed to be given, and the holding times (time between jumps) are random variables whose probability distribution function may depend on the two states between which the move is made. Let Tn be the random time of the n − th event, n ≥ 1, and let the counting process Nt denote the number of events having occurred up to time t defined as follows: Nt =
∞
∑ 1{Tj ≤t}
(14.3)
j=1
and so Tn = min{t ≥ 0|Nt = n}.
(14.4)
We introduce now the dynamics of the controlled risk process Xt for t ∈ [0, T ] with T a given fixed horizon. For this purpose, let Yn be the n − th (n ≥ 1) claim payment represented by a sequence of i.i.d. random variables with common probability distribution function (p.d.f.) F(y). Let Zn be the random variable denoting the time between the occurrence of the n − 1st and nth (n ≥ 1) jumps of the SMP {Kt }t>0 . We assume that {Zn } is a sequence of i.i.d. random variables with p.d.f. G(.). From this, we may consider that the transition probabilities of the SMP {Kt }t>0 are P{KTn+1 = j, Zn+1 ≤ s|KTn = i} = pi j G(s). Notice that for a full SMP model, the distribution function G(.) depends also on i and j. While the results derived below go through in the same way also for a Gi j (.)depending on i, j, we restrict ourselves to independent G(.). A specific form of SMP arises, for example, as follows: let Nt1 and Nt2 be independent Poisson processes with intensities λ 1 and λ 2 , respectively. We may think of Nt1 as counting the number of claims and Nt2 that of price changes and we have that Nt = Nt1 + Nt2 is again a Poisson process with intensity λ = λ 1 + λ 2 . It then follows easily that ⎧ ⎪ ⎨ pi j
=
λj λ
:= p j ,
⎪ ⎩ G(s) = 1 − e−λ s .
∀i,
14 Minimizing Ruin Probabilities
243
The risk process is controlled by reinsurance and investment. In general, this means that we may choose adaptively at the event times TNt (they correspond to the jump times of Nt ) the retention level (or proportionality factor or risk exposure) bNt of a reinsurance contract as well as the amount δNt to be invested in the risky asset, namely, in SNt with St denoting discounted prices. For the values b that the various bNt may take, we assume that b ∈ (bmin , 1] ⊂ (0, 1], where bmin will be introduced below, and for the values of δ for the various δNt , we assume δ ∈ [δ , δ¯ ] with δ ≤ 0 and δ > 0 exogenously given. Notice that this condition allows also for negative values of δ meaning that, see also [22], short selling of stocks is allowed. On the other hand, with an exogenously given upper bound δ¯ , it might occasionally happen that δNt > XNt , implying a temporary debt of the agent beyond his/her current wealth in order to invest optimally in the financial market. By choosing a policy that minimizes the ruin probability, this debt is however only instantaneous and with high probability leads to a positive wealth already at the next event time. Assume that prices change only according to SNt +1 − SNt WN +1 = e t − 1 KTNt +1 , SNt
(14.5)
where Wn is a sequence of i.i.d. random variables taking values in [w, w] ¯ with w < 0 < w, ¯ where one may also have w = −∞, w¯ = +∞ and with p.d.f. H(w). For simplicity and without loss of generality, we consider only one asset to invest in. An immediate generalization would be to allow for investment also in the money market account. Let c be the premium rate (income) paid by the customer to the company, fixed in the contract. Since the insurer pays to the reinsurer a premium rate, which depends on the retention level bNt chosen at the various event times TNt , we denote by C(bNt ) the net income rate of the insurer at time t ∈ [0, T ]. For b ∈ (bmin , 1], we let h(b,Y ) represent the part of the generic claim Y paid by the insurer, and in what follows, we take the function h(b,Y ) to be of the form h(b,Y ) = b ·Y (proportional reinsurance). We shall call policy a sequence π = (bn , δn ) of control actions. Control actions over a single period will be denoted by φn = (bn , δn ). According to the expected value principle with safety loading θ of the reinsurer, for a given starting time t < T , the function C(b) can be chosen as follows: C(b) := c − (1 + θ )
E{Y1 − h(b,Y1)} , 0 < t < T, E{Z1 ∧ (T − t)}
(14.6)
We use Z1 and Y1 in the above formula since, by our i.i.d. assumption, the various Zn and Yn are all independent copies of Z1 and Y1 . Notice also that, in order to keep formula (14.6) simple and possibly similar to standard usage, in the denominator of the right-hand side, we have considered the random time Z1 between to successive events, while more correctly, we should have taken the random time between two successive claims, which is larger. For this, we can however play with the safety loading factor. As explained in Romera and Runggaldier [19], we can now define
244
R. Romera and W. Runggaldier
bmin := min{b ∈ (0, 1] | c ≥ C(b) ≥ c∗ }, where c∗ ≥ 0 denotes the minimal value of the premium considered by the insurer. We have to consider the following technical restrictions: Assumption 14.2.1. Let (i) The random variables (Zn ,Yn ,Wn )n≥1 are, conditionally on Kt , mutually independent.
(ii) E erY1 < +∞ for r ∈ (0, r¯) with r¯ ∈ (0, ∞). E{Y1 } (iii) c − (1 + θ ) E{Z ∧T } ≥ 0. 1
Notice that, since the support of Y1 is the positive half line, we have limr↑¯r {E{erY1 }} = ∞ (¯r may be equal to +∞, e.g., if the support of Y1 is bounded). In the given setting, we obtain for the insurance risk process (surplus) X the following one-step transition dynamics between the generic random times Tn and Tn+1 when at Tn a control action φ = (b, δ ) is taken for a certain b ∈ (bmin , 1] ⊂ (0, 1], and δ ∈ [δ , δ¯ ], XTn+1 = XTn + C(b)Zn+1 − (1 − KTn+1 )h(b,Yn+1 ) + KTn+1 δ (eWn+1 − 1).
(14.7)
Definition 14.2.1. Letting U := [bmin , 1] × [δ , δ ], we shall say that a control action φ = (b, δ ) is admissible if (b, δ ) ∈ U. Notice that U is compact. We want now to express the one-step dynamics in (14.7) when starting from a generic time instant t < T with a capital x. For this purpose, note that if, for a given t < T one has Nt = n, the time TNt is the random time of the n − th event and Tn ≤ t ≤ Tn+1 . Since, when standing at time t, we observe the time that has elapsed since the last event in TNt , it is not restrictive to assume that t = TNt [see the comment below after (14.8)]. Furthermore, since Zn ,Yn ,Wn are i.i.d., in the one-step random dynamics for the risk process Xt , we may replace the generic (Zn+1 ,Yn+1 ,Wn+1 ) by (Z1 ,Y1 ,W1 ). We may thus write XNt +1 = x + C(b)Z1 − (1 − KTNt +1 )h(b,Y1 ) + KTNt +1 δ (eW1 − 1)
(14.8)
for 0 < t < T, T > 0 and with Xt = x ≥ 0 (recall that we assumed t = TNt ). Notice that, if we had t
= TNt and therefore t > TNt , the second term on the right in (14.8) would become C(b)[Z1 − (t − TNt )], and (14.8) could then be rewritten as XNt +1 = [x − C(b)(t − TNt )] + C(b)Z1 − (1 − KTNt +1 )h(b,Y1 ) + KTNt +1 δ (eW1 − 1) with the quantity [x − C(b)(t − TNt )], which is known at time t, replacing x. This is the sense in which above, we mentioned that it is not restrictive to assume that t = TNt . In what follows, we shall work with the risk process Xt , (or XNt ) as defined by (14.8). For convenience, we shall denote by (bn , δn) the values of φ = (b, δ ) at t = TNt . Accordingly, we shall also write (bNt , δNt ) for bTNt , δTNt . Following [24], we shall also introduce an absorbing (cemetery) state κ , such that if XNt < 0 or XNt = κ , then XNt +1 = κ , ∀t ≤ T. The state space is then R ∪ {κ }.
14 Minimizing Ruin Probabilities
245
14.3 Ruin Probabilities We present first the general expression of the ruin probability corresponding to the risk model (14.8). Thus, using the policy π , given the initial surplus x at time t and initial event k ∈ {0, 1} for the Markov chain Kt at time t, the ruin probability is given by N T
ψ π (t, x; k) := Pπ
{Xs < 0 |XNt = x, Kt = k} .
(14.9)
s=Nt +1
Note that the finite-horizon character of the considered model imposes a specific definition for the ruin probabilities. In order to obtain recursive relations for the ruin probability, we specify some notation and introduce the basic definitions concerning the finite-horizon ruin probabilities when one or n intra-event periods are considered. Given a policy π , namely. a predictable process pair πt := (bt , δt ) with (bt , δt ) in U [of which in the definitions below we need just to consider the generic individual control action φ = (b, δ )], we introduce the following functions: uπ (y, z, w, k) : = (1 − k)by − C(b)z − kδ (ew − 1),
τ π (y, w, k, x) : =
(1 − k)by − kδ (ew − 1) − x , C(b)
(14.10)
so that uπ (y, z, w, k) < x ⇐⇒ z > τ π (y, w, k, x). The ruin probability over one intra-event period (namely, the period between to successive event times) when using the control action φ = (b, δ ) is, for a given initial surplus x at time t ∈ (0, T ) and initial event KTNt = k ∈ {0, 1},
ψ1π (t, x; k) :=
w¯ ∞
1
∑
h=0
pk,h
w
0
G(τ π (y, w, h, x) ∧ (T − t))dF(y)dH(w).
(14.11)
We want to obtain a recursive relation for ⎧ ⎫ ⎨(Nt +n)∧N ⎬ T ψnπ (t, x; k) : = Pπ {Xk < 0} | XNt = x, KTNt = k ⎩ k=N +1 ⎭ t
π : = Px,k
⎧ ⎨(Nt +n)∧N T ⎩
k=Nt +1
⎫ ⎬
{Xk < 0} , ⎭
(14.12)
namely, for the ruin probability when at most n events are considered in the interval [t, T ] and a policy π is adopted. In Romera and Runggaldier [19], the following recursive relation is obtained:
246
R. Romera and W. Runggaldier
Proposition 14.3.1. For an initial surplus x at a given time t ∈ [0, T ], as well as an initial event KTNt = k and a given policy π , one has
ψnπ (t, x, k) = P{NT − Nt > 0} ∑1h=0 pk,h
w¯ ∞ w
0
+P{NT − Nt > 1} ∑1h=0 pk,h ·
w¯ ∞ T −t w
0
τ π (y,w,h,x)
G(τ π (y, w, h, x) ∧ (T − t))dF(y)dH(w)
π ψn−1 (t + z, x − uπ (y, z, w, h), h)dG(z)dF(y)dH(w)
(14.13)
from which it immediately also follows that 1
ψ1π (t, x, k) = P{NT − Nt = 1} ∑ pk,h
w¯ ∞
h=0
w
0
G(τ π (y, w, h, x) ∧ (T − t))dF(y)dH(w) (14.14)
since in the case of at most one jump, one has that P{NT − Nt > 0} = P{NT − Nt = 1} and P{NT − Nt > 1} = 0.
14.4 Minimizing the Bounds In the following, we base ourselves on results in Diasparra and Romera [3,4] that are here extended to the general setup of this chapter to obtain the exponential bounds and then to minimize them. To stress the fact that the process X defined in (14.7) corresponds to the choice of a specific policy π , in what follows, we shall use the notation X π . Given a policy πt = (bt , δt )and defining for t ∈ [0, T ], the random variable
Vtπ := C(b)(Z1 ∧ (T − t)) − 1{Z1≤T −t} (1 − KTNt +1 )bY1 + KTNt +1 δ eW1 − 1 , (14.15) where b = bt and δ = δt let, for r ∈ (0, r¯) and k ∈ {0, 1}, π
lrπ (t, k) := Et,k {e−rVt } − 1,
(14.16)
where, for reasons that should become clear below, we distinguish the dependence of l π on r from that on (t, k). Remark 14.4.1. Notice that, by its definition, lrπ (t, k) is, for given π and r ∈ (0, r¯), a bounded function of (t, k) ∈ [0, T ] × {0, 1}.Given its continuity in r, it is uniformly bounded on any compact subset of (0, r¯), for example, on [η , r¯ − η ] for η ∈ (0, r¯). Having fixed η > 0,denote this bound by L, that is, sup
(t,k)∈[0,T ]×{0,1}, r∈[η ,¯r −η ]
|lrπ (t, k)| ≤ L.
(14.17)
14 Minimizing Ruin Probabilities
247
Definition 14.4.2. We shall call a policy π admissible and denote their set by A if at each t ∈ [0, T ], the corresponding control action (bt , δt ) ∈ U, and for any t ∈ [0, T ] and k ∈ {0, 1}, Et,k {Vtπ } > 0 ∀π ∈ A. Notice that A is nonempty since (see Assumption 14.2.1, (iii) it contains at least the stationary policy (bNt , δNt ) ≡ (bmin , 0). According to Romera and Runggaldier [19], we obtain the following result: Proposition 14.4.2. For each (t, k) and each π ∈ A, we have that: (a) As a function of r ∈ (0, r¯), lrπ (t, k) is convex with a negative slope at r = 0. (b) The equation lrπ (t, k) = 0 has a unique positive root in (0, r¯) that we simply denote by Rπ so that the defining relation for Rπ is lRππ (t, k) = 0.
(14.18)
Notice that Rπ actually depends also on (t, k), but for simplicity of notation, we denote it just by Rπ . In view of the main result of this section, namely, Theorem 14.4.1 below, we first obtain [19]: Lemma 14.4.1. Given a surplus x > 0 at a given initial time t ∈ [0, T ] and an initial event k ∈ {0, 1}, we have π ψ1π (t, x, k) ≤ e−R x (14.19) for each π ∈ A, where Rπ is the unique positive root of (14.18) that depends on t and k but is independent of x. Lemma 14.4.2. For given (t, x, k), we have
ψnπ (t, x, k) ≤ γn e−R
πx
(14.20)
for all n ∈ N,π ∈ A, where Rπ is the unique positive solution with respect to r of lrπ (t, k) = 0 (see (14.18)), and γn is defined recursively b
γ1 = 1, γn = γn−1 P{NT − Nt > 1} + P{NT − Nt = 1}.
(14.21)
Remark 14.4.2. Due to the defining relations (14.21), it follows immediately that γn ≤ 1 for all n ∈ N. In fact, using forward induction, we see that the inequality is true for n = 1, and assuming it true for n − 1, we have
γn = γn−1 P{NT − Nt > 1} + P{NT − Nt = 1} ≤ P{NT − Nt > 0} ≤ 1.
(14.22)
We come now to our main result in this section, namely, Theorem 14.4.1 whose proof follows immediately from Lemma 14.4.2 noticing that, see Remark 14.4.2, one has γn ≤ 1.
248
R. Romera and W. Runggaldier
Theorem 14.4.1. Given an initial surplus x > 0 at a given time t ∈ [0, T ], we have, for all n ∈ N and any initial event k ∈ {0, 1} and for all π ∈ A,
ψnπ (t, x, k) ≤ e−R
πx
with Rπ that may depend on (t, k) in [0, T ] × {0, 1}.
14.4.1 Minimizing the Bounds by a Policy Improvement Procedure As mentioned previously, it is in general a difficult task to obtain an explicit solution to the given reinsurance-investment problem in order to minimize the ruin probability and this even for a classical risk process. We shall thus choose the reinsurance level and the investment in order to minimize the bounds that we have derived. By Theorem 14.4.1, this amounts to choosing a strategy π ∈ A such that, for each pair (t, k), the value of Rπ is as large as possible. This strategy will thus depend in general also on t and k but not on the level x of wealth. By Proposition 14.4.2, this Rπ is, for each π ∈ A, the unique positive solution of the equation lrπ (t, k)) = 0, where lrπ (t, k) is, as a function of r ∈ [0, r¯] (and for every fixed (t, k)), convex with a negative slope at r = 0. To obtain, for a given (t, k), the largest value of Rπ , it thus suffices to choose π ∈ A that minimizes lrπ (t, k) just at r = Rπ . This, in fact, appeals also to intuition since, according to the definition in (14.16), minimizing lrπ (t, k) amounts to penalizing negative values of Xtπ = x + Vtπ , thereby minimizing the possibility of ruin. Concerning the minimization of lrπ (t, k) at r = Rπ , notice that decisions concerning the control actions φ = (b, δ ) have to be made only at the event times Tn . The minimization of lrπ (t, k) with respect to π ∈ A has thus to be performed only for pairs (t, k) corresponding to event times, namely, those of the form (Tn , KTn ), thus leading to a policy π with individual control actions φTn = (bTn , δTn ). Our problem to determine an investment and insurance policy to minimize the bounds on the ruin probability may thus be solved by solving the following subproblems: 1. For a given policy, π¯ ∈ A determine lrπ¯ (t, k) for pairs (t, k) of the form (Tn , KTn ). 2. Determine Rπ¯ (Tn , KTn ) that is solution with respect to r of lrπ¯ (Tn , KTn ) = 0. 3. Improve the policy by minimizing lRππ¯ (Tn , KTn ) with respect to π ∈ A. This leads to a policy improvement-type approach, more precisely, one can proceed as follows: • Start from a given policy π 0 (e.g., the one requiring minimal reinsurance and no investment in the financial market). 0 • Determine Rπ corresponding to π 0 for the various (Tn , KTn ).
14 Minimizing Ruin Probabilities
249
• For r = Rπ , determine π 1 that minimizes l ππ 0 (Tn , KTn ). R • Repeat the procedure until a stopping criterion is met (notice that by the above n n−1 procedure Rπ > Rπ ). 0
One crucial step in this procedure is determining the function lrπ (t, k) corresponding to a given π ∈ A, and this will be discussed in the next section.
14.4.2 Computing the Value Function in the Policy Improvement Procedure Recall again that the decisions have to be made only at the event times over a given finite horizon, and consequently, the function lrπ (t, k) has to be computed only for pairs of the form (Tn , KTn ). The number of these events is however random and may be arbitrarily large; furthermore, the timing of these events is random as well. On the other hand, notice that if we can represent the function lrπ (t, k) to be minimized as the fixed point of a contraction operator involving expectations of functions of a Markov process, then the computation can be performed iteratively as in value iteration algorithms. For this purpose, recalling that Zn are i.i.d. random variables with probability distribution function G(.) and that, for given π ∈ A and r ∈ [η , r¯ − η ],the functions lrπ (t, k) are bounded by some L (see Remark 14.4.1), we start with the following: Definition 14.4.3. For given π ∈ A, define T π as the operator acting on bounded functions v(t, k) of (t, k) in the following way: T π (v(t, k))
π 1{t+Z1 ≤T } v(t + Z1 , Kt+Z1 ) + 1{t≤T≤t+Z1 } e−rC(b)(T −t) − 1 = 1{t≤T } Et,k =
1
∑ pk,h
h=0
0
T −t
−rC(b)(T −t) ¯ v(t + z, h)dG(z) + G(T − t) e −1
¯ = 1 − G(z) and where, given πt = (bt , δt ), the value of b is b = bt . with G(z) The following lemma is now rather straightforward: Lemma 14.4.3. For a given π ∈ A and any value of the parameter r ∈ [η , r¯ − η ], the function lrπ (.) is a fixed point of T π , that is, lrπ (t, k) = T π (lrπ )(t, k).
(14.23)
On the basis of the above definitions and results, we may now consider the following recursive relations: ¯ − Tn )[e−rC(bn )(T −Tn ) − 1], lrπ ,0 (Tn , k) = G(T lrπ ,m (Tn , k) = T π (lrπ ,m−1 )(Tn , k) f or m = 1, 2, ..
(14.24)
250
R. Romera and W. Runggaldier
that we may view as a value iteration algorithm. Since between any event time Tn and the terminal time T there may be any number of events occurring, to obtain lrπ (.), the recursions in (14.24) would have to be iterated an infinite number of times. If however the mappings T π are contracting in the sense that T π (v1 ) − T π (v2 ) ≤ γ v1 − v2
(14.25)
for bounded functions v1 (.) and v2 (.) and with γ < 1, then lrπ ,m (Tn , k) approximates lrπ (Tn , k) arbitrarily well in the sup-norm, provided m is sufficiently large. The above assumption can be seen to be satisfied in various cases of practical interest [19].
14.4.2.1 Reduction of Dimensionality and Particular Cases For the policy improvement and value iteration-type procedure in the previous section, the “Markovian state” was seen to be the tuple (Tn , KTn ), which makes the problem two dimensional. It is shown in Romera and Runggaldier [19] that in the particular case when the inter-event time and the claim size distributions are (negative) exponential, a case that has been most discussed in the literature under different settings, then the state space can be further reduced to only the time variable t (the sequence of event times Tn is then in fact a Markov process by itself), and so, the optimal policy becomes dependent only on the event time. This particular case can also be shown [19] to be a concrete example where the mappings T π are contracting as assumed in (14.25). Always for this particular case, it can furthermore be shown (see again [19]) that the fixed point lrπ of the mapping T π which, as discussed above, depends here only on t, can be computed as a semianalytic solution involving a Volterra integral equation. We conclude this section by pointing out that, although by our procedure, we minimize only an upper bound on the ruin probability, the optimal upper bound can also be used as a benchmark with respect to which other standard policies may be evaluated. Finally, as explained in Romera and Runggaldier [19], our procedure allows also to obtain some qualitative insight into the impact that investment in the financial market may have on the ruin probability. It turns in fact out that, in line with some of the findings in the more recent literature (see e.g., [17]), the choice of investing also in the financial market has little impact on the ruin probability unless, as we do here, one allows also for reinsurance. Acknowledgements Rosario Romera was supported by Spanish MEC grant SEJ2007-64500 and Comunidad de Madrid grant S2007/HUM-0413. Part of the contribution by Wolfgang Runggaldier was obtained while he was visiting professor in 2009 for the chair in Quantitative Finance and Insurance at the LMU University in Munich funded by LMU Excellent. Hospitality and financial support are gratefully acknowledged.
14 Minimizing Ruin Probabilities
251
References 1. Asmussen, S., Ruin Probabilities. World Scientific, River Edge, NJ. 2000. 2. Chen, S., Gerber, H. and Shiu, E., Discounted probabilities of ruin in the compound binomial model. Insurance: Mathematics and Economics, 2000, 26, 239–250. 3. Diasparra, M.A. and Romera, R., Bounds for the ruin probability of a discrete-time risk process. Journal of Applied Probability, 2009, 46(1), 99–112. 4. Diasparra, M.A. and Romera, R., Inequalities for the ruin probability in a controlled discretetime risk process. European Journal of Operational Research, 2010, 204(3), 496–504. 5. Edoli, E. and Runggaldier, W.J., On Optimal Investment in a Reinsurance Context with a Point Process Market Model. Insurance: Mathematics and Economics, 2010, 47, 315–326. 6. Eisenberg, J. and Schmidli, H., Minimising expected discounted capital injections by reinsurance in a classical risk model. Scandinavian Actuarial Journal, 2010, 3, 1–22. 7. Gaier, J.,Grandits, P. and Schachermayer, W., Asymptotic ruin probabilities and optimal investment. Annals of Applied Probability, 2003, 13, 1054–1076. 8. Grandell, J., Aspects of Risk Theory, Springer, New York. 1991. 9. Guo, X. P. and Hernandez-Lerma, O., Continuous-Time Markov Decision Processes: Theory and Applications. Springer-Verlag Berlin Heidelberg, 2009. 10. Hernandez-Lerma, O. and Lasserre, J. B., Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, 1996. 11. Hernandez-Lerma, O. and Lasserre, J. B., Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York, 1999. 12. Hernandez-Lerma, O. and Romera, R., Limiting discounted-cost control of partially observable stochastic systems. Siam Journal on Control and Optimization 40 (2), 2001, 348–369. 13. Hernandez-Lerma, O. and Romera, R., The scalarization approach to multiobjective Markov control problems: Why does it work? Applied Mathematics and Optimization 50 (3), 2004, 279–293. 14. Hernandez-Lerma, O. and Prieto-Romeau, T., Selected Topics on Continuous-Time Controlled Markov Chains and Markov Games, Imperial College Press, 2012. 15. Hipp, C. and Vogt, M., Optimal dynamic XL insurance, ASTIN Bulletin 33 (2), 2003, 93–207. 16. Huang, T., Zhao, R. and Tang, W., Risk model with fuzzy random individual claim amount. European Journal of Operational Research, 2009, 192, 879–890. 17. Hult, H. and Lindskog, F., Ruin probabilities under general investments and heavy-tailed claims. Finance and Stochastics, 2011, 15(2), 243–265. 18. Paulsen, J., Sharp conditions for certain ruin in a risk process with stochastic return on investment. Stochastic Processes and Their Applications, 1998, 75, 135–148. 19. Romera, R. and Runggaldier, W., Ruin probabilities in a finite-horizon risk model with investment and reinsurance. UC3M Working papers. Statistics and Econometrics, 2010, 10–21. 20. Sch¨al, M., On Discrete-Time Dynamic Programming in Insurance: Exponential Utility and Minimizing the Ruin Probability. Scandinavian Actuarial Journal, 2004, 3, 189–210. 21. Sch¨al, M., Control of ruin probabilities by discrete-time investments. Mathematical Methods of Operational Research, 2005, 62, 141–158. 22. Schmidli, H., Optimal proportional reinsurance policies in a dynamic setting. Scandinavian Actuarial Journal, 2001, 12, 55–68. 23. Schmidli, H., On minimizing the ruin probability by investment and reinsurance. Annals of Applied Probability, 2002, 12, 890–907. 24. Schmidli, H., Stochastic Control in Insurance. Springer, London. 2008. 25. Xiong, S. and Yang, W.S., Ruin probability in the Cram´er-Lundberg model with risky investments. Stochastic Processes and their Applications, 2011, 121, 1125–1137.
252
R. Romera and W. Runggaldier
26. Wang, R.,Yang, H. and Wang, H., On the distribution of surplus immediately after ruin under interest force and subexponential claims. Insurance: Mathematics and Economics, 2004, 34, 703–714. 27. Willmot, G. and Lin, X., Lundberg Approximations for Compound Distributions with Insurance Applications Lectures Notes in Statistics 156, Springer, New York, 2001.
Chapter 15
Estimation of the Optimality Deviation in Discounted Semi-Markov Control Models Luz del Carmen Rosas-Rosas
15.1 Introduction Semi-Markov control models (SMCMs) is a class of continuous-time stochastic control models where the distribution of the times between consecutive decision epochs (holding or sojourn times) is arbitrary and the actions or controls are selected at the transition times. Its evolution is as follows: At time of the nth decision epoch Tn , the system is in the state xn = x and the controller chooses a control an = a. Then, the system remains in the state x during a nonnegative random time δn+1 with distribution F (· | x, a), and the following happens: (1) an immediate cost D (x, a) is incurred; (2) the system jumps to a new state xn+1 = y according to a transition law Q (· | x, a); and (3) a cost rate d (x, a) is imposed until the transition occurs. Once the transition to state y occurs, the process is repeated. The actions are selected according to rules π known as control policies, and the costs are accumulated throughout the evolution of the system in an infinite horizon using a discounted cost criterion, where total cost under the policy π and initial stated x is denoted by Vα (x, π ). Thus, the control problem is to find a policy π ∗ such that Vα (x, π ∗ ) = inf Vα (x, π ) . π ∈Π
However, there exist situations where the holding time distribution F is unknown by the controller; and therefore, it is necessary to implement approximation procedures together with control schemes to obtain nearly optimal policies. Let us assume that it is possible to get an approximate distribution F˜ (· | x, a). Under this setting, considering the control problem for F˜ and the corresponding
L.C. Rosas-Rosas () Departamento de Matem´aticas, Universidad de Sonora, Hermosillo, Sonora 83000, M´exico e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 15, © Springer Science+Business Media, LLC 2012
253
254
L.C. Rosas-Rosas
total cost V˜α , we can obtain a policy π˜ minimizing V˜α . Then, if the controller uses π˜ to control the original semi-Markov process, our main objective is to estimate the optimality deviation of the control policy π˜ defined as D (x, π˜ ) := Vα (x, π˜ ) − inf Vα (x, π ) . π ∈Π
Specifically, our main result states that D (x, π˜ ) ≤ ψ d F, F˜ , where ψ : [0, ∞) → [0, ∞) is a function such that ψ (s) → 0 as s → 0, and d is some ˜ “distance” between F and F. Moreover, our result can be used to construct asymptotically discounted optimal (ADO) policies as is shown in [7]. Indeed, if we consider the particular case when F has an unknown density g independent of the state-actions pairs, and the holding times are observable, it is possible to apply some statistical density estimation method to obtain a consistent estimator gn . Then, using gn as the approximate holding time distribution, the resulting policy π˜ = { fn } will be ADO for the original SMCM (see Sect. 15.3). Similar estimation problems of the deviation of optimality have been studied by several authors but for Markov control processes and/or controlled diffusion processes, and under different names, namely, robustness, perturbation, or stability index, (see, e.g., [1, 2, 4, 10–12]). In addition, asymptotic optimality implementing estimation and control procedures has been analyzed, for instance in [3, 6, 9]. The chapter is organized as follows: Section 15.2 contains the SMCM and the required assumptions, whereas in Sect. 15.2, the main result and a special case of asymptotic optimality are introduced. Finally, the proofs are presented in Sect. 15.4.
15.2 Semi-Markov Control Models A SMCM is defined by the collection M = (X, A, {A (x) : x ∈ X} , Q, F, D, d), where X is the state space and A is the control (or action) space ; both are assumed to be Borel spaces. For each x ∈ X, we associate a nonempty measurable subset A(x) of A, whose elements are the admissible controls for the controller when the system is in state x. The set K = {(x, a) : x ∈ X, a ∈ A(x)} of admissible state-action pairs is assumed to be a Borel subset of X × A. The transition law Q(· | ·) is a stochastic kernel on X given K, and F(· | x, a) is the distribution function of the holding time at state x ∈ X when the control a ∈ A(x) is chosen. Finally, the cost functions D and d are possibly unbounded and measurable real-valued functions on K. On the other hand, we assume that the costs are continuously discounted, that is, for a discount factor α > 0, a cost K incurred at time t is equivalent to a
15 Estimation of the Optimality Deviation in Semi-Markov Models
255
cost K exp(−α t) at time 0. In this sense, according to points (1) and (3) of the interpretation of the SMCM given in the introduction, the one-stage cost takes the form c(x, a) := D (x, a) + d (x, a)
∞ t
e−α s dsF (dt | x, a) , (x, a) ∈ K,
0 0
which it is not difficult to see that may be rewritten as c(x, a) = D (x, a) + τα (x, a) d (x, a) ,
(x, a) ∈ K,
(15.1)
where,
τα (x, a) :=
1 − βα (x, a) α
and βα (x, a) :=
∞
e−α t F (dt | x, a) ,
(x, a) ∈ K.
0
(15.2) Optimality Criterion. We denote by Π the set of all admissible control policies, and by F ⊂ Π the subset of stationary policies (see, for instance, [5] for definitions). As usual, each stationary policy π ∈ F is identified with a measurable function f : X → A such that f (x) ∈ A (x), x ∈ X, so that π is of the form π = { f , f , . . .}. In this case, we denote π by f . For each x ∈ X and π ∈ Π , we define the total expected discounted cost as Vα (x, π )
:= Exπ
∞
∑e
−α Tn
c (xn , an ) .
n=0
Hence, a policy π ∗ ∈ Π is said to be optimal if for all x ∈ X, Vα (x) := Vα (x, π ∗ ) = inf Vα (x, π ) , π ∈Π
and Vα (x) is called the optimal value function for the model M . Assumptions. We shall require two sets of assumptions. The first one ensures that in a bounded time interval, there are at most a finite number of transitions of the process. Assumption 15.2.2 contains standard continuity and compactness conditions to guarantee the existence of minimizers. Assumption 15.2.1 There exists positive constants ζ and ε , such that for every (x, a) ∈ K, F (ζ | x, a) ≤ 1 − ε . Remark 15.2.1. As was noted in [7], Assumption 15.2.1 implies that
ρα := sup βα (x, a) < 1. (x,a)∈K
256
L.C. Rosas-Rosas
Assumption 15.2.2 (a) The function F (t | x, a) is continuous on a ∈ A (x) for every x ∈ X and t ∈ R. (b) For each x ∈ X, A (x) is a compact set, and the cost functions D (x, a) and d (x, a) are lower semicontinuous (l.s.c.) on A (x). Moreover, there exist a measurable function W : X → [1, ∞) and positive constants b, c¯1 , and c¯2 such that 1 ≤ b < (ρα )−1 and sup |D (x, a)| ≤ c¯1W (x) ,
sup |d (x, a)| ≤ c¯2W (x) ∀x ∈ X,
a∈A(x)
(15.3)
a∈A(x)
X
W (y) Q (dy | x, a) ≤ bW (x) ,
(c) For each x ∈ X, a →
X
x ∈ X, a ∈ A (x) .
(15.4)
W (y) Q (dy | x, a)
is a continuous function on A (x). (d) For every bounded and continuous function u on X, a →
X
u (y) Q (dy | x, a)
is a bounded and continuous function on A (x). We denote by BW the normed linear space of all measurable functions u : X → R with norm |u (x)| u W := sup < ∞. (15.5) x∈X W (x)
15.3 Optimality Deviation Let F˜ (· | x, a) be a distribution function that approximates the holding time distribution F (· | x, a), (x, a) ∈ K. We consider the SMCM ˜ D, d), M˜ = (X, A, {A (x) : x ∈ X} , Q, F, where X, A, A (x), Q, D and d are as the model M . The functions c, ˜ τ˜α , β˜α , ˜ ˜ Vα (x, π ), Vα (x), and ρ˜ α are defined accordingly. Furthermore, we assume that the Assumptions 15.2.1 and 15.2.2 are satisfied for the model M˜. Hence, as a consequence, we have the following result (see, e.g., [7, 8]): Proposition 15.3.1. Suppose that Assumptions 15.2.1 and 15.2.2 hold. Then:
(a) There exist stationary optimal policies f ∗ and f˜ for the models M and M˜, respectively.
15 Estimation of the Optimality Deviation in Semi-Markov Models
257
(b) The value functions Vα and V˜α satisfy the corresponding optimality equation, that is, Vα (x) = min c (x, a) + βα (x, a) Vα (y) Q (dy | x, a) , x ∈ X, X
a∈A(x)
and V˜α (x) = min
a∈A(x)
˜ ˜ c˜ (x, a) + βα (x, a) Vα (y) Q (dy | x, a) , x ∈ X. X
According to the Proposition 15.3.1, we have that for x ∈ X, Vα (x) = inf Vα (x, π ) = Vα (x, f ∗ ) π ∈Π
and
V˜α (x) = inf V˜α (x, π ) = V˜α x, f˜ . π ∈Π
Hence, our objective is to estimate the optimality deviation of the stationary policy f˜. Then, we can state our main result as follows: Theorem 15.3.1. Under Assumptions 15.2.1 and 15.2.2, D x, f˜ := Vα x, f˜ − Vα (x) ≤ M ∗ F˜ − F W (x)
∀x ∈ X,
(15.6)
for some positive constant M ∗ , where ∞
F˜ − F := sup e−α t μF˜ (· | x, a) − μF (· | x, a) (dt) , (x,a)∈K
0
whereas |μF˜ (· | x, a) − μF (· | x, a) | represents the total variation of the signed measure μF˜ (· | x, a) − μF (· | x, a), being μF˜ and μF the measures induced, respectively, by the distribution functions F˜ and F for every fixed pair (x, a) ∈ K.
15.3.1 Asymptotic Optimality As in [7], we consider the special case where the distributions F and F˜ have densities g and g, ˜ respectively, which are independent of the state-action pairs (x, a). That is, F (t | x, a) =
t 0
g (s) ds,
and
F˜ (t | x, a) =
t 0
g˜ (s) ds.
258
L.C. Rosas-Rosas
Then, clearly, βα and τα , as well as β˜α and τ˜α are independent of (x, a). That is,
βα (x, a) := βα =
∞
e−α s g (s) ds,
0
1 − βα τα (x, a) := τα = , α and similarly for β˜α and τ˜α . Furthermore, note that
β˜α − βα =
∞
e−α s {g˜ (s) − g (s)} ds,
0
and in such case, (15.6) becomes D x, f˜ ≤ M ∗ g˜ − g W (x) ∀x ∈ X,
(15.7)
where g˜ − g :=
∞
e−α s g˜ (s) − g (s) ds.
0
In particular, assuming that the holding times are observable, let δ1 , . . . , δn be independent realizations (observed up to the moment of the nth decision epoch) of random variables with the unknown density g, and gn (s) := gn (s; δ1 , . . . , δn ), s ∈ R+ be an arbitrary estimator of g such that ⎡ E⎣
⎤
gn (s) − g (s) ds⎦ → 0
∞
as n → ∞
(15.8)
0
being gn a density. Letting fn := fngn
(15.9)
the optimal stationary policy corresponding to the density gn , where the minimization is done almost sure (a.s.) with respect to the probability measure induced by the common distribution of the random variables δ1 , . . . , δn , and applying (15.7) with gn instead of g, ˜ but observing that gn is a random variable, we obtain E [D (x, fn )] ≤ M ∗ E [ g − gn ]W (x) → 0
as n → ∞, x ∈ X.
Moreover, it is possible to prove that this fact yields the following result:
15 Estimation of the Optimality Deviation in Semi-Markov Models
259
Theorem 15.3.2. The policy π˜ = { fn } is (pointwise) asymptotically discounted optimal, which means that for each x ∈ X, E [Φ (x, fn )] → 0 as n → ∞, where for each (x, a) ∈ K,
Φ (x, a) := c(x, a) + βα
X
Vα (y) Q (dy| x, a) − Vα (x) ,
(15.10)
is the well-known discrepancy function.
15.4 Proofs We will use repeatedly the following inequalities: For u ∈ BW and (x, a) ∈ K, |u (x)| ≤ u W W (x) and
X
u (y) Q (dy | x, a) ≤ b u W W (x) .
(15.11)
(15.12)
The relation (15.11) is a consequence of the definition of u W in (15.5), whereas (15.12) follows from (15.4). Furthermore, from Assumption 15.2.2(b), and following a straightforward calculation (see, e.g., [3, 5]), we can prove that, for all x ∈ X and π ∈ Π , sup Exπ [W (xn )] < ∞, n≥0
˜ such that, for all which in turn implies that there exist positive constants M and M, π ∈ Π, Vα (x) ≤ Vα (x, π ) ≤ MW (x)
(15.13)
and, similarly, ˜ (x) . V˜α (x) ≤ V˜α (x, π ) ≤ MW On the other hand, observe that for all (x, a) ∈ K,
β˜α (x, a) − βα (x, a) ≤ F˜ − F . Hence, from (15.3) and the definition of the costs c and c˜ [see (15.1) and (15.2)], we obtain c¯2 (15.14) |c (x, a) − c(x, ˜ a)| ≤ F˜ − F W (x) , ∀ (x, a) ∈ K. α
260
L.C. Rosas-Rosas
In addition, for any function u ∈ BW , we have
β˜α (x, a) − βα (x, a) u (y) Q (dy | x, a) ≤ b u F˜ − F W (x) W X
and
βα (x, a)
X
u (y) Q (dy | x, a) ≤ b u W ρα W (x)
(15.15)
(15.16)
for all (x, a) ∈ K, where ρα is defined in Remark 15.2.1 (similarly, we obtain (15.15) ˜ and (15.16) substituting βα andρα by βα and ρ˜ α , respectively). Thus, denoting u1 (x) := Vα x, f˜ , where f˜ is the optimal policy for the model M˜, from (15.11), we obtain
Vα y, f˜ − V˜α y, f˜ Q dy | x, f˜ β˜α x, f˜ X
u1 (y) − V˜α (y) Q dy | x, f˜ = β˜α x, f˜ X ≤ bρ˜ α u1 − V˜α W W (x) , x ∈ X.
(15.17)
Lemma 15.4.1. For all x ∈ X, there exist positive constants M1∗ and M2∗ , such that
and
u1 − V˜α ≤ M1∗ F˜ − F W
(15.18)
V˜α − Vα ≤ M ∗ F˜ − F . 2 W
(15.19)
Proof. First observe that from the Markov property, for any stationary policy f ∈ F and for each x ∈ X Vα (x, f ) = c (x, f ) + βα (x, f )
X
Vα (y, f ) Q (dy | x, f ) .
Similarly for V˜α . Then,
u1 (x) − V˜α (x) = Vα x, f˜ − V˜α x, f˜
˜ ˜ ˜ ˜ Vα y, f Q dy | x, f ≤ c x, f + βα x, f X
− c˜ x, f˜ + β˜α x, f˜ V˜α y, f˜ Q dy | x, f˜
. X
Adding and subtracting the term β˜α x, f˜ , Vα y, f˜ Q dy | x, f˜ , X
15 Estimation of the Optimality Deviation in Semi-Markov Models
261
we obtain, for all x ∈ X,
u1 (x) − V˜α (x) ≤ c x, f˜ − c˜ x, f˜
+ βα x, f˜ − β˜α x, f˜ Vα y, f˜ Q dy | x, f˜ X
Vα y, f˜ − V˜α y, f˜ Q dy | x, f˜ . +β˜α x, f˜ X
Hence, from (15.14), (15.15), and (15.17), we have (recall W (·) ≥ 1), u1 − V˜α ≤ c¯2 F˜ − F + b u1 F˜ − F + bρ˜ α u1 − V˜α . W W W α Now, since bρ˜ α < 1 (see Assumption 15.2.2(b) ), we get u1 − V˜α ≤ M1∗ F˜ − F , W where M1∗ :=
c¯2 α
+ b u1 W , 1 − bρ˜ α
which proves (15.18). On the other hand, to prove (15.19), we have from Proposition 15.3.1(b),
V˜α (x) − Vα (x) = min c˜ (x, a) + β˜α (x, a) V˜α (y) Q (dy | x, a)
a∈A(x) X
− min c (x, a) + βα (x, a) Vα (y) Q (dy | x, a)
X
a∈A(x)
≤ sup c˜ (x, a) − c (x, a) +
β˜α (x, a) V˜α (y) Q (dy | x, a) X a∈A(x)
−βα (x, a)
X
Vα (y) Q (dy | x, a)
.
Now, it is easy to see that, by adding and subtracting the term
βα (x, a)
X
V˜α (y) Q (dy | x, a)
and by applying similar arguments as the proof of (15.18), we get
V˜α (x) − Vα (x) ≤ c¯2 F˜ − F W (x) + b V˜α F˜ − F W (x) W α +bρα V˜α − Vα W (x) . W
262
L.C. Rosas-Rosas
Since W (·) ≥ 1, this relation implies V˜α − Vα ≤ c¯2 F˜ − F + b V˜α F˜ − F + bρα V˜α − Vα , W W W α which yields
V˜α − Vα ≤ M2∗ F˜ − F , W
where M2∗
:=
c¯2 α
+ b V˜α W . 1 − bρα 2
15.4.1 Proof of Theorem 15.3.1 In order to prove the main result, first observe that V˜α x, f˜ = V˜α (x) , x ∈ X. Then, D x, f˜ = Vα x, f˜ − Vα (x) = Vα x, f˜ − V˜α x, f˜ + V˜α (x) −Vα (x) , x ∈ X. Hence, from (15.11), Lemma 15.4.1 yields
D x, f˜ ≤ Vα x, f˜ − V˜α x, f˜ + V˜α (x) −Vα (x)
≤ M ∗ F˜ − F W (x) ∀x ∈ X, where M ∗ := M1∗ + M2∗ . 2
15.4.2
Proof of Theorem 15.3.2
First, for every n ∈ N, we define (see (15.10)) the approximate discrepancy function
Φn (x, a) := cn (x, a) + βn
X
Vn (y) Q (dy| x, a) − Vn (x) ,
(x, a) ∈ K,
(15.20)
15 Estimation of the Optimality Deviation in Semi-Markov Models
263
where, for each n ∈ N, cn (x, a) := D(x, a) + τn d(x, a), (x, a) ∈ K,
τn := βn :=
1 − βn , α ∞
e−α s gn (s) ds,
0
Vn (x) := inf Vn (x, π ) , π ∈Π
Vn (x, π ) := Exπ
∞
∑e
−α Tk
cn (xk , ak ) .
k=0
Note that, for each n ∈ N, (15.9), (15.20), and Proposition 15.3.1 imply that
Φn (x, fn ) = 0, a.s., x ∈ X. Hence, and since Φn is a nonnegative function, for each n ∈ N, we can write the following:
Φ (x, fn ) = Φ (x, fn ) − Φn (x, fn )
≤ sup Φ (x, a) − Φn (x, a) a.s. ∀x ∈ X. (15.21) a∈A(x)
In particular observe that from (15.10) and (15.20), for each n ∈ N, we have
Φ (x, a) − Φn (x, a) ≤ c (x, a) − cn (x, a) + Vα (x) −Vn (x)
+ βα Vα (y) Q (dy| x, a) − βn Vn (y) Q (dy| x, a)
X
X
Then, by adding and subtracting the term
βα
X
Vn (y) Q (dy| x, a) ,
we get
Φ (x, a) − Φn (x, a) ≤ c (x, a) − cn (x, a) + Vα (x) −Vn (x)
+ βα
X
|Vα (y) − Vn (y)| Q (dy| x, a)
+ |βα − βn |
X
|Vα (y)| Q (dy| x, a)
a.s.
a.s.
264
L.C. Rosas-Rosas
Therefore, using (15.11), (15.14), (15.15), (15.17), and (15.19), it follows that for every n ∈ N,
Φ (x, a) − Φn (x, a) ≤ c¯2 gn − g W (x) + M ∗ gn − g W (x) 2 α +bρn M2∗ gn − g W (x) + b Vα gn − g W (x) W
a.s.
Hence, considering that bρn < 1 a.s., and from (15.21), for each x ∈ X, we obtain the following, E [Φ (x, fn )] ≤ M0∗W (x) E[ gn − g ] ∀n ∈ N, where M0∗ :=
c¯2 + 2M2∗ + b Vn W . α
Consequently, since Φ (·, ·) ≥ 0, from (15.8), we obtain, for every x ∈ X, E [Φ (x, fn )] → 0
as n → ∞.
2
References 1. Abbad, M.; Filar, J.A. Perturbation and stability theory for Markov control processes. IEEE Trans. Automat. Control. 1992, 37: 1415–1420. 2. Gordienko, E.; Lemus-Rodr´ıguez, E. Estimation of robustness for controlled diffusion processes. Stochastics. 1999, 17: 421–441. 3. Gordienko, E.; Minj´arez-Sosa, J.A. Adaptive control for discrete-time Markov processes with unbounded costs: discounted criterion. Kybernetika. 1998, 34: 217–234. 4. Gordienko, E.; Salem-Silva, F. Robustness inequality for Markov control processes with unbounded costs. Systems Control Lett. 1998, 33, 125–130. 5. Hern´andez-Lerma, O.; Lasserre, J.B. Further Topics on Discrete-Time Markov Control Processes; Springer, New York, 1999. 6. Hilgert, N.; Minj´arez-Sosa, J.A. Adaptive policies for time-varying stochastic systems under discounted criterion. Math. Meth. Oper. Res. 2001, 54: 491–505. 7. Luque-V´asquez, F.; Minj´arez-Sosa, J.A. Semi-Markov control processes with unknown holding times distribution under a discounted criterion. Math. Meth. Oper. Res. 2005, 61: 455–468. 8. Luque-V´asquez, F.; Robles-Alcaraz, M.T. Controlled semi-Markov models with discounted unbounded costs. Bol. Soc. Mat. Mexicana. 1994, 39: 51–68. 9. Minj´arez-Sosa, J.A. Approximation and estimation in Markov control processes under a discounted criterion. Kybernetika. 2004, 40,6: 681–690. 10. Montes-de-Oca, R.; Sakhanenko, A.I.; Salem-Silva, F. Estimates for perturbations of general discounted Markov control chains. Appl. Math. 2003, 30,3 pp.287–304. 11. Van Dijk, N.M. Perturbation theory for unbounded Markov reward processes with applications to queueing. Adv. Appl. Probab. 1988, 20: 99–111. 12. Van Dijk, N.M.; Puterman, M.L. Perturbation theory for Markov reward processes with applications to queueing systems. Adv. Appl. Probab. 1988, 20: 79–98.
Chapter 16
Discrete Time Approximations of Continuous Time Finite Horizon Stopping Problems Lukasz Stettner
16.1 Introduction We assume that on a locally compact metric space E, we are given a right continuous standard Markov process (x(t)) with weakly Feller transition operator (Pt ), i.e., with transition operator Pt that for t ≥ 0 transforms the class C0 (E) of continuous functions on E vanishing at infinity into itself (see [3] for the properties of such Markov processes). Let C([0, T ] × E) denote the class of continuous bounded functions on [0, T ] × E. For a given functions f , g, h ∈ C([0, T ] × E) and a discount factor α > 0, consider the value function the of following continuous time cost functional:
w(s, x) = sup Esx τ ≤T −s
τ
0
e−α u f (s + u, x(u))du + χτ 0. Therefore, the assumption α > 0 is made for simplicity of the presentation only. In the chapter, we consider two methods to approximate the value function w. The first one will be based on a solution to a suitable penalty equation. For a given β > 0, we shall find a solution wβ to the following equation,
L. Stettner () Institute of Mathematics Polish Academy of Sciences, 00-956 Warsaw, Poland e-mail:
[email protected] D. Hern´andez-Hern´andez and A. Minj´arez-Sosa (eds.), Optimization, Control, and Applications of Stochastic Systems, Systems & Control: Foundations & Applications, DOI 10.1007/978-0-8176-8337-5 16, © Springer Science+Business Media, LLC 2012
265
266
L. Stettner β
w (s, x) = Esx
0
T −s
e−α u f (s + u, x(u)) + β (g(s + u, x(u))
−wβ (s + u, x(u)))+ du + e−α (T−s) h(T, x(T − s)) , (16.2) and then shall approximate its intensity version using discretized intensities. The second method consists in direct discretization of stopping times. For a given small Δ , we consider the family of stopping times TΔ of the discretized Markov process (x(nΔ )) taking values in {0, Δ , 2Δ , . . . , nΔ , . . .}. Assuming that s is a multiplicity of Δ , we define a discrete time optimal stopping problem ⎧ ⎨ Δτ −1 wΔ (s, x) = sup Esx ∑ e−iΔ f¯Δ (s + iΔ , x(iΔ )) + χτ ε ∀x ∈ C.
18 On the Regularity Property of Semi-Markov Processes
309
This condition by itself does not imply the regularity of the SMP (see the example in Ross [7, p. 87]); however, it does provided that C ∈ B + , and a suitable recurrence condition holds, e.g., the embedded Markov chain is Harris recurrent. To see this is true, note that sup Δ (x) ≤ (1 − ε ) + ε exp(−γ ) < 1, x∈C
which implies that L = {x ∈ X : Δ (x) < 1} ∈ B + . Hence, from Theorem 18.3.2, the SMP is regular. Acknowledgements The author takes this opportunity to thank to Professor On´esimo Hern´andezLerma for his constant encouragement and support during the author’s academic career and, in particular, for his valuable comments on an early version of present work. Thanks are also due to the referees for their useful observations which improve the writing and organization of this note.
References 1. Bhattacharya, R.N., Majumdar, M. (1989), Controlled semi-Markov models–The discounted case, J. Statist. Plann. Inf. 21, 365–381. 2. C¸inlar, E. (1975), Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs, New Jersey. 3. Hern´andez-Lerma, O., Lasserre, J.B. (2003), Markov Chains and Invariant Probabilities, Birkh¨auser, Basel. 4. Ja´skiewicz, A. (2004), On the equivalence of two expected average cost criteria for semi-Markov control processes, Math. Oper. Research 29, 326–338. 5. Limnios, N., Opris¸an, G. (2001), Semi-Markov Processes and Reliability, Birkha¨user, Boston. 6. Meyn, S.P., Tweedie, R.L. (1993), Markov Chains and Stochastic Stability, Springer-Verlag, London. 7. Ross, S.M. (1970), Applied Probability Models with Optimization Applications, Holden-Day, San Francisco. 8. Sch¨al, M. (1992), On the second optimality equation for semi-Markov decision models, Math. Oper. Research 17, 470–486.