Simulation and the Monte Carlo Method
Simulation and the Monte Carlo Method REUVEN Y.RUBINSTEIN Technion, Israel Institute of Technology
John Wiley & Sons New York Chichester Brisbane
Toronto Singapore
A NOTE TO THE READER This book has been electronicaliy reproduced from digital information stored at John Wiley &Sons, Inc. We are pleased that the use of this new technology will enable us to keep works of emluring scholarly value in print as long as there is a reasonable demand for them. The content of this book is identical to previous printings.
Copyright 0 1981 by John Wiley & Sons, lnc. All rights reserved. Publisbcd simultaneously in Canada.
Reproduction or translation of any part of this work beyond that permitted by Sections 107 or 108 of the 1976 United Slates Copyright Act without the permission of the copyright owner is unlawful. Requests for permission or further information should be addressed to the Permissions Department, John Wiley & Sons, Inc.
Lilkruv o/congrrrr c-g
k iwhbtkw
Dpccc
Rubinstein, Rcuven Y. Simulation and the Monte Carlo method. (Wiley series in probability and mathematical statistics) Includes bibliographies and index. I . Monte Carlo method. 2. Digital computer simulation. I. Title. 11. Series. QA298.RS 5 19.2'82 8 I - 1873 ISBN Ck471-08917-6 AACRZ
10 9
To my wiJe Rina and to my friends Eitan Finkelstein and A iexandr k r n e r Russian refuseniks.
Preface In the last 15 years more than 3000 articles on simulation and the Monte Carlo method have been published. There is real need for a book providing detailed treatment of the statistical aspect of these topics. This book attempts to fill this need, at least partially. I hope it will make the users of simulation and the Monte Carlo method more knowledgeable about these topics. It is assumed that the readers are familiar with the basic concepts of probability theory, mathematical statistics, integral and differential equations, and that they have an elementary knowledge of vector and matrix operators. Sections 6.5, 6.6, 7.3, and 7.6 require more sophistication in probability, statistics, and stochastic processes; they can be omitted for a first reading. Since most complex simulations are implemented on digital computers, a rudimentary acquaintance with computer programming will probably be an asset to the readers of this book, though no computer programs are included, Chapter 1 describes concepts such as systems, models, and the ideas of Monte Carlo and simulation. A discussion of these concepts seems necessary as there is no uniform terminology in the literature. Instead of giving rigid definitions, 1 try to make clear what I mean when I use these terms. In addition to the terminology, some examples and ideas of simulation and Monte Carlo methods are given. Chapter 2 deals with several alternative methods for generating random and pseudorandom numbers on a computer, as well as several statistical methods for testing the “randomness” of pseudorandom numbers. Chapter 3 describes methods for generating random variables and random vectors from different probability distributions. Chapter 4 provides a basic treatment of Monte Carlo integration, and Chapter 5 provides a solution of linear, integral, and differential equations by Monte Carlo methods. It is shown that, in order to find a solution by Monte Carlo methods, we must choose a proper distribution and present vii
viii
PREFACE
the problem in terms of its expected value. Then, taking a sample from this distribution, we can estimate the expected value. In addition, variance reduction techniques (importance sampling, control variates, stratified sampling antithetic variates, etc.) are discussed. Chapter 6 deals with simulating regenerative processes and in particular with estimating some output parameters of the steady-state distribution associated with these processes. Simulation results for several practical problems are presented, and variance reduction techniques are given as well. Chapter 7 discusses random search methods, which are also related to Monte Carlo methods. In this chapter I describe how random search methods can be successfully applied for solving complex optimization problems. The final version of this book was written during my 1980 summer visit at IBM Thomas J. Watson Research Center. I express my gratitude to the Computer Sciences Department for their hospitality and for providing a rich intellectual environment. A number of people have contributed corrections and suBestions for improvement of the earlier draft of the manuscript, especially P. Feigin, I. Kreimer, 0.Maimon, H. Nafetz, G. Samorodnitsky, and E. Yaschin from Technion, Israel Institute of Technology, and P. Heidelberger and S. Lavenberg of IBM Thomas J. Watson Research Center. It is a pleasure to acknowledge my debt to them. I would also like to express my indebtedness to Beatrice Shube of John Wiley & Sons and to Eliezer Goldberg of Technion for their efficient editorial guidance. Many thanks to Marylou Dietrich of IBM and to Eva Gaster of Technion for their excellent typing. Finally, I thank the following authors and publishers for granting permission for publication of the cited material: Pages 12- 17 based on Handbook of Operations Research, Foutuktiorts and Fundamenrafs.Edited by Joseph T. Modem and Salah E. Elmagraby, Von Nostrand Reinhold Company, 1978, pp. 570-573. Pages 23-25 based on D. E. Knuth, The Art of Computer Programming: Seminumerical Algorithms, Val. 2, Addisson-Wesley, Reading, Massachusetts, 1969, pp. 155- 156. Pages 199-208 based on Y. R. Rubinstein, Selecting the best stable stochastic system, in Stochastic Processes and their Applications, 1980. (to appear) Pages 253-255 based on Y. R. Rubinstein and 1. Weisman, The Monte Carlo method for global optimization, Cahiers du Centre d'Etudes de Recherche Operationelle. 21, No. 2, 1979, pp. 143- 149.
PREFACE
ix
Pages 248-251 based on Y. R. Rubinstein, and A. Kornovsky, Local and integral properties of a search algorithm of the stochastic approximation type. Stochusfic Processes Appf., 6, 1978, 129- 134.
REWEN Y.RUBINSTEIN Ha.$2, Israel March 1981
Contents 1. SYSTEMS, MODELS, SIMULATION, AND THE MONTE CARL0 METHODS 1.1 systems, 1
1.2
Models, 3
13 Simdation and the Monte Carlo Methods, 6
1.4 A Madiine Shop Example, 12 References, 17
2. RANDOM NUMBER GENERATION 2.1
Introduction, 20
2.2
Coognrential Generators, 21
23 Statistical Tests of Pseudorandom Numbers, 26 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7
Chi-square Goodness-ofiFit Te.st, 26 ~o~mogorou-Smirnou Goodness-o&Fit Test, 27 Cramer-wn Mises Goodness-of-Fit Test, 30 Serial Test, 30 Run-Up-ad-Down Test, 31 Gap Test, 32 Maximum Test, 33
Exercfses, 33
References, 35
xii
CONTENIS
3. RANDOM VARIATE GENERATION
3.1
Introduction, 38
3.2 inverse Transform Method, 39
3 3 Composition Method, 43 3.4 Acceptance-Rejection Method, 45 3.4.I 3.4.2 3.4.3 3.4.4
Single- Variate Case, 45 Multisurinte Cuse, 50 Generalization of von Neumunn’s Method, 51 Forsythe’s Method, 56
3 5 Simulation of Random Vectors, 58 3.5,I 3.5.2
3.5.3
Intvrse Transform Merhod, 58
Multivariate Transformation Method, 61 Multinormal Distribution, 65
3.6 Generating from Continuous Distributions, 67 Exponentid Distribution, 67 Gamma Distribution, 71 Beta Distribution, 80 3.6.4 Normal Distribution, 86 3.6.5 Lognormal Distribution, Y I 3.6.6 Cauchy Distribution, 91 Weibul Distribution, 92 3.6.7 3.6.8 Chi-square Distribution, 93 3.6.9 Student’s t-Dislribution. 94 3.6.10 F Distribution, 94
3.6. I 3.6.2 3.6.3
3.7 Generating from Discrete Distributions, 95 3.7. I 3.7.2 3.7.3 3.7.4 3.7.5
Binomial Distribution, 101 Poisson Distribut ion, I02 Geometric Distribution, 104 Negatice Binomial Distribution, 104 Hypergeometric Distribution, 106
Exercises, 107 References, 11I
38
xiii
CONTENTS
4.
MONTE CARL0 INTEGRATION AND VARXANCE REDUCTION TECHNIQUES 4.1
114
Introduction, 114
4.2 Monte Carlo Integration, 115 4.2.1 4.2.2 4.2.3 4.2.4
The Hit or Miss Monte Carlo Method I15 The Sample-Mean Monte Carlo Method, I 1 8 Efliciency of Monte Carlo Method I19 Integration in Presence qf Noise, I20
4 3 Variance Reduction Techniques, 121 4.3. I 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.3.8 4.3.9 4.3.10 4.3.JI
Importance Sampling, 122 Correlated Sampling. I24 Control Variates. 126 Stratified Sampling, I31 Antithetic Variates, I35 Partifion of the Region, 138 Reducing the Dimensionaliq, I 4 0 Conditional Monte Carlo, 141 Random Quadrature Method, 143 Biased Estimators, I45 Weighted Monle Carlo Integration, 147 4.3.12 More about Variance Reduction (Queueing Systems and Networks), 148 Exercises, 153
References, 155 Additional References, 157
5. LINEAR EQUATIONS AND MARKOV C W N S 5.1 Simultaneous Linear Equations and Ergodic Markov Chains, 158
5.1.I 5.1.2
Adjoini System of Linear Equations, I63 Computing the Intvrse Matrix, 168
158
xiv
CONTENTS
5.1.3
Soloing a Sysrem of Linear Equations by Simulating a Markoo Chain with an Absorbing State, 170
5.2 lntegral Equations, 173 52.1
5.2.2 5.2.3
Integral Transforms, I 73 Integral Equations of the Second K i d 176 Eigenvalue Probfem, I78
53 The DirichIet Problem, I79 Exercises, 180
References, 181
6. REGENERATIVE METHOD FOR SIMULATION ANALYSIS 183 6.1
Introduction, 183
6.2 Regenerative SimuIation, 184
6 3 Point Estimators aad Confidence Intervals, 187 6.4 Examples of Regenerative Pmceses, 193 6.4.1 6.4.2
6.4.3
A Single Seroer Queue G I / G / I , 193 A Repairman Modef with Spares, 195 A Closed Queueing Network, I97
6.5 Selecting the Best Stable Stochastic System, 199
6.6 Tbe Regenerative Method for Constrained Optimization ProMems, 248 6.7 Variance Reduction Tecbuiques, 213
6.7.1 6.7.2
Confrof Variafes, 2 14 Common Random Numbers in Comparing Stochasric S’sfem, 224
Exercises, 2 t 9
References, 230
xv
CONTENTS
7.
MONTE CARL0 OPTIMIZATION 7.1
Random search Algorithms, 235
7.2
Efficiency of Random Search AlgWtthms, 241 i and Integral Properties of Optimum Trial Random sesrd, Algorithm RS4,248
73 h
7.3.1 7.3.2 7.4
Local Properires of the Algorithm, 248 iniergral Properties of the Algorithm, 251
Monte Cwto Method for Globat Optimizadon, 2f2
7 5 A Closed Form Soiution for Global Optimization, Mo
7.6
Optimization by Smootbd Functioaab;, 263 Appendix, 272 Exercises, 273
References. 273 INDEX, 277
234
Simulation and the Monte Carlo Method
Simulation and the Monte Carlo Method
R E W E N Y. RUBINSTEIN Copyright 0 1981 by John Wiley & Sons, Inc.
CHAPTER1
Systems, Models, Simulation, and the Monte Carlo Methods In this chapter we discuss the concepts of systems, models, simulation and Monte Carlo methods. This discussion seems necessary in the absence of a unified terminology in the literature. We do not give rigid definitions, however, but explain what we mean when using the above-mentioned terms. 1.1 SYSTEMS
By a system we mean a set of reiated entities sometimes called componenfs or elemenrs. For instance, a hospital can be considered as rt system, with doctors. nurses, and patients a.. elements. The eiements have certain characteristics, or attributes, that have logical or numerical values. In aur example an attribute can be, for instance, the number of beds, the number of X-ray machines, skill, quantity, and so on. A number of activities (relations) exist among the elements, and consequently the elements interact. These activities cause changes in the system. For example, the hospital has X-ray machines that have an operator. If there is no operator, the doctors cannot have X-rays of the patients taken. We consider both internal and external relationships. The internal relationships connect the elements within the system, while the external relationships connect the elements with the environment, that is, with the world outside the system. For instance. an internal relationship is the relationship or interaction between the doctors and nurses, or between 1
2
SYSTEMS, MODELS, SI.MC'LATION, AND THE MONTE CARLO METHODS
lnput
h
System
Outpur I
I
I
I
I
I I i
Feedback loop
------------
I I 2
Flg. 1.1.1 Graphical representation of a system.
the nurses and the patients. An external relationship is, for example, the way in which the patients are delivered to the emergency room. We can represent a system by a diagram, as in Fig. 1.1.1. The system is influenced by the environment through the input it receives from the environment. When a system has the capability of reacting to changes in its own state, we say that the system contains feedback. A nonfeedback, or open-loop, system lacks this characteristic. For an example of feedback consider a waiting line; when there are more than a certain number of patients, the hospital can add more staff to handle the increased workload. The attributes of the system elements define its state. In our example the number of patients waiting for a doctor describe the system's state. When a patient arrives at or leaves the hospital, the system moves to a new state. If the behavior of the elements cannot be predicted exactly, it is useful to take random observations from the probability distributions and to average the performance of the objective. We say that a system is in equilibrium or in the steady state if the probability of being in some state does not vary in time. There are still actions in the system, that is, the system can still move from one state to another, but the probabilities of its moving from one state to another are fixed. These fixed probabilities are limiting probabilities that are realized after a long period of time, and they are independent of the state in which the system started. A system is called stable if it returns to the steady state after an external shock in the system. If the system is not in the steady state, it is in a transient state. We can classify systems in a variety of ways. There are natural and artificial systems, ada;price and nonadaptiw sysrems. An adaptive system reacts to changes in its environment, whereas a nonadaptive system does not. Analysis of an adaptive system requires a description of how the environment induces a change of state. Suppose that over a period of time the number of patients increases. If the hospital adds more staff to handle the increased workload, we say that the hospital is an adaptive system.
MODELS
3
1.2 MODELS
The first step in studying a system is building a model. The importance of models and model-building has been discussed by Rosenbluth and Wiener (321, who wrote: No substantial part of the universe is so simple that it can be grasped and controlled without abstraction. Abstraction consists in replacing the part of the universe under consideration by a model of similar but simpler structure. Models.. .are thus a central necessity of scientific procedure.
A scientific model can be defined as an abstraction of some real system, an abstraction that can be used for prediction and control. The purpose of a scientific model is to enable the analyst to determine how one or more changes in various aspects of the modeled system may affect other aspects of the system or the system as a whole. A crucial step in building the model is constructing the objective function, which is a mathematical function of the decision variables. There are many types of models. Churchman et al. 141 and Kiviat [IS] described the following kinds: 1 Iconic models Those that pictorially or visually represent certain aspects of a system. 2 Analog models Those that employ one set of properties to represent some other set of properties that the system being studied possesses. 3 Symbolic models Those that require mathematical or logical operations and can be used to formulate a solution to the problem at hand.
In this book, however. we are concerned only with symbolic models (which are also called abstract models), that is, we deal with models consisting of mathematical symbols or flowcharts. All other models (iconic, analog, verbal, physical, etc.), although no less important, are excluded from this hook. ‘There are many advantages by using mathematical models. According to Fishman (81 they do the following: 1 Enable investigators to organize their theoretical beliefs and empirical observations about a system and to deduce the logical implications of this organization. 2 Lead to improved system understanding. 3 Bring into perspective the need for detail and relevance. 4 Expedite the analysis. 5 Provide a framework for testing the desirability of system modifications.
4
SYSTEMS, MODELS, SIMULATION. AND THE MONTE CARL0 METHODS
6 Allow for easier manipulation than the system itself permits. 7 Permit control over more sources of variation than direct study of a system would allow. 8 Are generally less costly than the system. An additional advantage is that a mathematical model describes a problem more concisely than, for instance, a verbal description does. On the other hand, there are at least three reservations in Fishman’s monograph [S], which we should always bear in mind while constructing a model. First, there is no guarantee that the time and effort devoted to modeling will return a useful result and satisfactory benefits. Occasional failures occur because the level of resources is too low. More often, however, failure results when the investigator relys too much on method and not enough on ingenuity; the proper balance between the two leads to the greatest probability of success. The second reservation concerns the tendency of an investigator to treat his or her particular depiction of a problem as the best representation of reality. This is often the case after much time and effort have been spent and the investigator expects some useful results. The third reservation concerns the use of the model to predict the range of its applicability without proper qualification. Mathematical models can be classified in many ways. Some models are srutic, other are ~+nomic. Static models are those that do not explicitly take time-variation into account, whereas dynamic models deal explicitly with time-variable interaction. For instance, Ohm’s law is an example of a static model, while Newton’s law of motion is an example of a dynamic model. Another distinction concerns deterministic versus sfochmtic models. In a deterministic model all mathematical and logical relationships between the elements are fixed. As a consequence these relationships completely determine the solutions. In a stochastic model at least one variable is random. While building a model care must be taken to ensure that it remains a valid representation of the problem. In order to be useful, a scientific model necessarily embodies elements of two conflicting attributes-realism and simplicity. On the one hand, the model should serve as a reasonably close approximation to the real system and incorporate most of the important aspects of the system.On the other hand, the model must not be so complex that it is impossible to understand and manipulate. Being a formalism, a model is necessarily an abstraction. Often we think that the more details a model includes the better it resembles reaIity. But adding details makes the solution more difficult and
MODELS
5
converts the method for solving a problem from an analytical to an approximate numerical one. In addition, it is not even necessary for the model to approximate the system to indicate the measure of effectiveness for all various alternatives. All that is required is that there be a high correlation between the prediction by the model and what would actually happen with the real system. To ascertain whether this requirement is satisfied or not, it is important to test and establish control over the solution. Usually, we begin testing the model by re-examining the formulation of the problem and revealing possible flaws. Another criterion for judging the validity of the model is determining whether all mathematical expressions are dimensionally consistent. A third useful test consists of varying input parameters and checking that the output from the model behaves in a plausible manner. The fourth test is the so-calied retrospective test. It involves using historical data to reconstruct the past and then determining how well the resulting solution would have performed if it had been used. Comparing the effectiveness of this hypothetical performance with what actually happened then indicates how well the model predicts the reality. However, a disadvantage of retrospective testing is that it uses the same data that guided formulation of the model. Unless the past is a true replica of the future, it is better not to resort to this test at all. Suppose that the conditions under which the model was built change. In this case the model must be modified and control over the solution must be established. Often, it is desirable to identify the critical input parameters of the model, that is, those parameters subject to changes that would affect the solution, and to establish systematic procedures to control them. This can be done by sensitioity analysis, in which the respective parameters are varied over their ranges to determine the degree of variation in the solution of the model. After constructing a mathematical model for the problem under consideration, the next step is to derive a solution from this model. There are analytic and numerical solution methods. An analytic solution is usually obtained directly from its mathematical representation in the form of formula. A numerical solution is generally an approximate solution obtained as a result of substitution of numerical values for the variables and parameters of the model. Many numerical methods are iterative, that is, each successive step in the solution uses the results from the previous step. Newton’s method for approximating the root of a nonlinear equation can serve as an example. Two special types of numerical methods are simulation and the Monte Carlo methods. The following section discusses these.
6
SYSTEMS, MODELS, SIMUI.ATION, AND THE MONTE CARLO METHODS
13 SIMULATION AND THE MONTE CARLO METHODS
Simulation has Iong been an important tool of designers, whether they are simulating a supersonic jet flight, a telephone communication system, a wind tunnel, a large-scale military battle (to evaluate defensive or offensive weapon systems), or a maintenance operation (to determine the optimal size of repair crews). Although simulation is often viewed as a “method of last resort” to be employed when everything else has faiied, recent advances in simulation methodologies, availability of software, and technical developments have made simulation one of the most widely used and accepted tools in system analysis and operations research. Naylor et al. [28] define simulation as follows: Simulation is a numerical technique for conducting experiments on a digital computer, which involves certain types of mathematical and logicjll models that describe the behavior af business or economic system (or some component thereof) over extended periods of real time.
This definition is extremely broad, however, and can include such seemingfy unrelated things as economic models, wind tunnel testing of aircraft, war games, and business management games. Naylor et al. f28] write: The fundamental rationale for using simulation is man’s unceasing quest for knowledge about the future. This search for knowledge and the desire to predict the future arc as old as the history of mankind. But prior to the seventeenth century the pursuit of predictive power was limited almost entirely to the purely deductive methods of such philosophers as Plato, Aristotie. Euclid, and others.
Simulation deals with both abstract and physical models. Some simulation with physical and abstract models might involve participation by real people. Examples include link-trainers for pilots and military or business games. Two types of simulation involving real people deserve special mention. One is operational gaming, the other man-machine simulation. The term “operational gaming” refers to those simulations characterized by some form of conflict of interest among players or human decisionmakers within the framework of the simulated environment, and the experimenter, by observing the players, may be able to test hypotheses concerning the behavior of the individuals and/or the decision system as a whole.
SIMULATION AND THE MONTE CARL0 MhTtTflODS
7
In operational gaming a computer is often used to collect, process, and produce information that human players, usually adversaries, need to make decisions about system operation. Each player’s objective is to perform as well as possible. Moreover, each player’s decisions affect the information that the computer provides as the game progresses through simulated time. The computer can also play an active role by initiating predetermined or random actions to which the players respond. War games and business management games are commonly discussed in operational gaming literature (see. e.g., Morgenthaler (231 and Shubik [38]). Military gaming is essentially a training device for military leaders; it enables them to test the effects of alternative strategies under simulated war conditions. For example, the Naval Electronic Warfare Simulator, developed in the 195Os, consisted of a large analog computer designed primarily to assess ship damage and to provide information to two opposite forces regarding their respective effectiveness in a naval engagement [14, pp. IS, 161. The exercise, which is one form of simulation gaming, has been used as an educational device for naval fleet officers in the final stages of their training. Business games are also a type of educational tool, but for training managers or business executives rather than military leaders. A business game is a contrived situation which imbeds players in a simulated
business environment, where they must make management-type decisions from time to time, and their choices at one time generally affect the environmental conditions under which subsequent decisions must be made. Further. the interaction between decisions and environment is determined by a refereeing process which ia not open to argument from the players [30,pp, 7.81.
In man-machine simulation there is no need for gaming. While interacting with the computer real people in the laboratory perform the data reduction and analysis. The following two examples are drawn from Fishman (8): The Rand Systems Research Laboratory employed simulation to generate stimuli for the study of information processing centers [14, p. 161. The principal features of a radar site were reproduced in the laboratory, and by carefully controlling the synthetic input to the system and recording the behavior of the human detectors it was possible to examine the relative effectiveness of various man-machine combinations and procedures. In 1956 Rand established the Logistics System Laboratory under U.S. Air Force sponsorship [lo]. The first study in this laboratory involved
8
SYSTEMS, MODELS, SIMULATION, AND THE MONTE CARL0 METHODS
simulation of two large logistics systems in order to compare their effectiveness under different management and resource utilization poticies. Each system consisted of men and machines, together with policy rules for the use of such resources in simulated stress situations such as war. The. simulated environment required a specified number of aircraft in flying and alert states, while the system’s capability to meet these objectives was limited by malfunctioning parts, procurement and transportation delays, and the like. The human participants represented management personnel, while higher echelon policies in the utilization of resources were simulated on the computer. The ultimate criteria of the effectiveness of each system were the number of operationally ready aircraft and the dollar cost of maintaining this number. Although the purpose of the first study in this laboratory was to test the feasibility of introducing new procedures into an existing air force logistics system and to compare the modified system with the original one, the second laboratory problem had quite a different objective. Its purpose was to improve the design of the operational control system through the use of simulation. Naylor et ai. [28] describe many situations where simulation can be successfully used. We mention some of them. First, it may be either impossible or extremely expensive to obtain data from certain processes in the real world. Such processes might involve, for example, the performance of large-scale rocket engines, the effect of proposed tax cuts on the economy, the effect of an advertising campaign on total sales. In this case we say that the simulated data are necessary to formulate hypotheses about the system. Secondly, the observed system may be so complex that it cannot be described in terms of a set of mathematical equations for which analytic solutions are obtainable. Most economic systems fall into this category. For example, it is virtually impossible to describe the operation of a business firm, an industry, or an economy in terms of a few simple equations. Simulation has been found to be an extremely effective tool for dealing with problems of this type. Another class of problems that leads to similar difficulties is that of large-scale queueing problems invofving multiple channels that are either parallel or in series (or both). Thirdly, even though a mathematical model can be formulated to describe some system of interest, it may not be possible to obtain a solution to the model by straightforward analytic techniques. Again, economic systems and complex queueing problems provide examples of this type of difficulty. Although it may be conceptually possible to use a set of mathematical equations to describe the behavior of a dynamic system
SIMULATION AND THE MON'I'E CAR1.0 METHODS
9
operating under conditions of uncertainty, presentday mathematics and computer technoiogy are simply incapable of handling a problem of this magnitude. Fourth, it may be either impossible or very costly to perform validating experiments on the mathematical models describing the system. In this case we say that the simulation data can be used to test alternative hypotheses. In all these cases simulation is the only practical tool for obtaining relevant answers. Naylor et ai. [28f have suggested that simulation analysis might be appropriate for the following reasons: 1 Simulation makes it possible to study and experiment with the complex internal interactions of a given system whether it be a firm, an industry, an economy, or some subsystem of one of these. 2 Through simulation we can study the effects of certain informational, organizational, and environmental changes on the operation of a system by making alterations in the model of the system and observing the effects of these alterations on the system's behavior. 3 Detailed observation of the system being simulated m a y lead to a better understanding of the system and to suggestions for improving it, suggestions that otherwise would not be apparent. 4 Simulation can be used a.. a pedagogicai device for teaching both students and practitioners basic skills in theoretical analysis, statistical analysis, and decision making. Among the disciplines in which sirnulation has been used successfully for this purpose are business administration, economics, medicine, and law. 5 Operational gaming has been found to be an excellent means of stimulating interest and understanding on the part of the participant, and is particularly useful in the orientation of persons who are experienced in the subject of the game. 6 The experience of designing a computer simulation model may be more valuable than the actuaI simulation itself. The knowledge obtained in designing a simulation study frequently suggests changes in the system being simulated. The effects of these changes can then be tested via simulation before implementing them on the actual system. 7 Simulation of complex systems can yield valuable insight into which variables are more important than others in the system and how these variables interact. 8 Simufation can be used to expenment with new situations about which we have little or no information so as to prepare for what may happen.
10
SYSTEMS, MODELS, SIML;I.ATION, AND THE MONTE CARL0 METHODS
9 Simulation can serve as a “preservice test” to try out new policies and decision rules for operating a system, before running the risk of experimenting on the real system. 10 Simulations are sometimes valuable in that they afford a convenient way of breaking down a complicated system into subsystems, each of which may then be modeled by an analyst or team that is expert in that area 123, p. 373). 11 Simulation makes it possible to study dynamic systems in either real time, compressed time, or expanded time. 12 When new components are introduced into a system, simulation can be used to help foresee bottlenecks and other problems that may arise in the operation of the system 123,p. 3751.
Computer simulation also enables us to repiicate an experiment. Replication means rerunning an experiment with selected changes in parameters or operating conditions being made by the investigator. In addition, computer simulation often allows us to induce correlation between these random number sequences to improve the statistical analysis of the output of a simulation. ln particular a negative correlation is desirable when the results of two replications are to be summed, whereas a positive correlation is preferred when the results are to be differenced, as in the comparison of experiments. Simulation does not require that a model be presented in a particular format. I t permits a considerable degree of freedom so that a model can bear a close correspondence to the system being studied. The results obtained from simulation are much the same as observations or measurements that might have been made on the system itself. To demonstrate the principles involved in executing a discrete simulation, a n example of simulating a machine shop is given in Section 1.4. Many programming systems have been developed, incorporating simulation languages. Some of them are general-purpose in nature, while others are designed for specific types of systems. FORTRAN, ALGOL, and PL/1 are examples of general-purpose languages, while GPSS, SIMSCRIPT, and SIMULA are examples of special simulation languages. Simulation is indeed an invaluable and very versatile tool in those problems where analytic techniques are inadequate. However, it is by no means ideal. Simulation is an imprecise technique. It provides only statistical estimates rather than exact results, and it only compares alternatives rather than generating the optimal one. Simulation is aiso a slow and costly way to study a problem. It usually requires a large amount of time and great expense for analysis and programming. Finally, simulation yields only numerical data about the performance of the system, and sensitivity
SIMULATION AND TlIE MOVIE CARL0 MBIIIOUS
11
analysis of the model parameters is very expensive. The only possibility is to conduct series of simulation runs with different parameter values. We have defined simulation as a technique of performing samphng experiments on the model of the system. This general definition is often called simulation in a wide sense, whereas simulation in a nurrow sense, or stochastic simulation, is defined as experimenting with the model over time; it includes sampling stochastic variates from probability distribution [ 191. Therefore stochastic simulation is actually a statistical sampling experiment with the model. This sampling involves all the problems of statistical design analysis. Because sampling from a particular distribution involves the use of random numbers, stochastic simulation is sometimes called Monte Carlo simulation. Historically, the Monte Carlo method was considered to be a technique, using random or pseudorandom numbers, for solution of a model. Random numbers are essentially independent random variables uniformly distributed over the unit interval 10, 1). Actually, what are available at computer centers are arithmetic codes for generating sequences of pseudorandom digits, where each digit (0 througb 9) occurs with approximately equal probability (likelihood). Consequently, the sequences can model successive flips of a fair ten-side die. Such codes are called random number generators. Grouped together, these generated digits yield pseudorandom numbers with any required number of elements. We discuss random and pseudorandom numbers in the next chapter. One of the earliest problems connected with Monte Carlo method is the famous Buffon’s needle problem. The problem is as follows. A needle of length I units is thrown randomly onto a floor composed of parallel planks of equal width d units, where d > 1. What is the probability that the needle, once it comes to rest, will cross (or touch) a crack separating the planks on the floor? It can be shown that the probability of the needle hitting a crack is P = 2l/nd. which can be estimated as the ratio of the number of throws hitting the crack to the total number of throws. In the begining of the century the Monte Carlo method was used to examine the Boltzmann equation. In 1908 the famous statistician Student used the Monte Carlo method for estimating the correlation coefficient in his r-distribution. The term “Monte Carlo” was introduced by von Neumann and Ulam during World War 11, as a code word for the secret work at Los Alamos; it was suggested by the gambling casinos at the city of Monte Carlo in Monaco. The Monte Carlo method was then applied to problems related to the atomic bomb. The work involved direct simulation of behavior concerned with random neutron diffusion in fissionable material. Shortly thereafter Monte Carlo methods were used to evaluate complex multidi-
12
SYSTEMS, MODELS, SIMULATION, AND THE MONTE CARL0 METHODS
mensional integrals and to solve certain integral equations, occurring in physics, that were not amenable to analytic solution. The Monte Carlo method can be used not only for solution of stochastic problems, but also for solution of deterministic problems. A deterministic problem can be solved by the Monte Carlo method if it has the same formal expression as some stochastic process. In Chapter 4 we show how the Monte Carlo method can be used for evaluating multidimensional integrals and some parameters of queues and networks. In Chapter 5 the Monte Carlo method is used for solution of certain integral and differential equations. Another field of application of the Monte Carlo methods is sampling of random variates from probability distributions, which Morgenthaler [23] calls model sampling. Chapter 3 deals with sampling from various distributions. The Monte Carlo method is now the most powerful and commonly used technique for analyzing complex problems. Applications can be found in many fields from radiation transport to river basin modeling. Recently, the range of applications has been broadening, and the complexity and computational effort required has been increasing, because realism is associated with more complex and extensive problem descriptions. Finally, we mention some differences between the Monte Carlo method and simulation: 1 in the Monte Carlo method time does not play as substantial a role as it does in stochastic simulation. 2 The observations in the Monte Carlo method, as a rule, are independent. In simulation, however, we experiment with the model over time so, as a rule, the observations are serially correlated. 3 In the Monte Carlo method it is possible to express the response as a rather simple function of the stochastic input variates. In simulation the response is usually a very complicated one and can be expressed explicitly only by the computer program itseif. 1.4
A MACHINE SHOP EXAMPLE
This example is quoted from Gordon [ 11, pp. 570-5731. For better understanding of the example an important distinction to be made is whether an entity is permanent or temporary. Permanent entities can be compactly and efficiently represented in tables, while temporary entities will be volatile records and are usually handled by the list processing technique described later.
A MACHINE SHOP
EXAMPLE
13
Consider a simple machine shop (or a single stage in the manufacturing process of a more complex machine shop). The shop is to machine five types of parts. The parts arrive at random intervals and are distributed randody among the different types. There are three machines, a11 equally able to machine any part. If a machine is available at the time a part arrives, machining begins immediately. If all machines are busy upon arrival, the part will wait for service. On completion of machining the part will be dispatched to a certain destination, depending on its type, The progress of the part is not followed after it is dispatched from the shop. However, a count of the number of parts dispatched to each destination is kept. Clearly, there are two types of elements in the system: parts and machines. There will be a stream of temporary elements, that is, the parts that enter and leave the system. There is no point in representing the different types of parts as different elements; rather, the type is an attribute of the parts. As indicated before, it is simpler to consider the group of machines as a single permanent element, having as attributes the number of machines and a count of the number currently busy. The activities causing changes in the system are the generation of parts, waiting, machining, and departing. (a) System Image A set of numbers is needed to record the state of the system at any time. This set of numbers is called the syslem imge, since it reflects the state of the system. The simulation proceeds by deciding, from the system image, when the next event is due to occur and what type of event it will be; testing whether i t can be executed; and executing the changes to the image implied by the event. The image must have a number representing clock time, an3 this number is advanced, in uneven steps, with the succession of events in the system. For each part record, there are four numbers to represent the part type, the arrival time, the machining time, and the time the part will next be involved in an event. The first three of these items are random variates derived by the methods described in Chapters 3 and 4. The next event time, in generaI, depends on the state of the system, and must be derived as the simulation proceeds. The organization used for the system image is illustrated in Fig. 1.4.1. There are four frames in this figure, representing successive states of the system. The frames are read from left to right and from top to bottom. The frame in the top left corner is the initial state. The description of the system image is made in terms of that particular frame.
Next amval
1
Part Machine Arrival type time time
2
75
I002
Next event time
1002
1
Patt Machine Arrival type time time
(
I
68
1018
Next event time
I
1018
1 I
1
J
Waiting 976
84
-
I
Parts
3
43
being machined
1
2
21 62
1
2
3
12
22
20 31
I5
1
68
1018
1018
896
I
1003
62
1003
8%
Clock time
Counters
Next arrival
I
4
5
II
2
3
12
22
20 31
15
1
68
I018
1018
4
2
being machined
936
1017
75
1002
-
43
972
1040
1
(10171
Clock time 1
2
3
4
12
23
20 31 Fig. 1.4.1
14
5
r-----
Waiting parts
Counters
r;2;;1
1
5 15
1
1
2
3
13
23
20 31
Machine shop example.
4
5
15
I
A MACHINE SHOP EXAMPLE
15
The top line of the system image represents the part due to enter the system next. As shown here, it is a type 2 part, will require 75 minutes of machining, and is due to arrive at time 1002. This, of course, is also its next event time. Below the next amval listing is an open-ended list of the parts that have amved and are now waiting for service. Currently, there are two waiting parts. As indicated, they are listed in order of arrival. Because the waiting parts are delayed, it is not possible to predict a next event time for them. It is necessary to see whether there is a waiting part when a machine finishes, and to offer service to the first part in the waiting line. The next rows of numbers represent the parts now being machined, in this case limited to three. Once machining begins, the time to finish can be derived and entered as the next event time. Three parts are occupying the machines at this time and they have been listed in the order in which they will finish. Finally, a number represents the clock time, here set to an initial value of 1O00, and there are five counters showing how many parts of each type have been completed. Note that it is not customary to precalculate all the random variates. Instead, each is calculated at the time it is needed, so a simulation program continually switches between the examination and manipulation of the system image and the subroutines that calculate the random variates. (b) The Simulation procesS Looking now at the system image in Fig. 1.4.1, assume all events that can be executed up to time loo0 have been processed. It is now time to begin one more cycle. The first step is to find the next potential event by scanning all the event times. Because of the ordering of the parts being machined, it is, in fact, necessary only to compare the time of the next arrival with the first listed time in the machining section. With the numbers shown in frame I, the next event is the arrival of a part at time 1002, so the clock is updated to this time in the second frame. The arriving part finds all machines busy and must join the waiting line. The successor to the part just arrived is generated and inserted as the next future arrival, due to arrive at time 1018. Another cycle can now begin. The next event is the completion of machining a part at time 1003. The third frame of Fig. 1.4.1 shows the state of the system at the end of this event. The clock is updated to 1003 and the finished part is removed from the system, after incrementing by 1 the counter for that part type. There is a waiting part, so machining is started on the first part in the waiting line, and its next event time, derived from the machining time of 84, is calculated as 1087. In this case the new part for machining has the largest
16
SYSTEMS, MODELS, SIMULATION, AND THE MONTE CARLO METHODS
finish time, and it joins the end of the waiting line. The records in the waiting line and the machine segment are all moved down one line. There is tpen another completion at 1017 that, as before, leads to a counter being increment4 and service being offered to the first part in the waiting line. In this case, however, the machining time is short enough for the new part to finish ahead of one whose machining started earlier, so, instead of being the last listed part, the new part becomes the second in the list. This is shown in the last frame of Fig. 1.4.1. (c) Statistics Gathering The purpose of the simulation, of course, is to learn something about the system. In this case only the counts of the number of completed parts by type have been kept. Depending upon the purpose of the simulation study, other statistics could be gathered. Simuiation language programs include routines for collecting certain typical statistics. Among the commonly used types of statistics are the following:
1 Counts Counts give the number of elements of a given type or the number of times some event occurred. 2 Utiihtion OS equipment This can be counted in terms of the fraction of time the equipment is in use or in terms of the average number of units in USE. 3 Distributions This means distributions of random variates, such as processing times and response times. together with their means and standard deviations. (d) LLst Processing In the machine shop example it was convenient to describe the records as though they were located in one of three places, corresponding to whether they represented parts that were aniving, waiting, or being processed. The simulation was described in terms of moving the records from one place to the next, possibly with some resorting. A computer program that used this approach would be very inefficient because of the large amount of data movement involved. Much better control and efficiency are obtained by using list processing. With this technique each record consists of a number of contiguous words (or bytes), some of which are reserved for constructing a list of the records. Each record contains, in a standard position, the address of the next record in the list. This is called a pointer. A special word, called a header, located in a known position, contains a pointer to the first record in the list. The last record in the list has an end-of-list symbol in place of its pointer. If the list happens to be empty, the end-of-list symbol appears in the header. The pointers, beginning from the header, place the records in a specific order, and allow a program to search the records by following the chain of
REFERENCES
17
pointers. These lists, in fact, are usually called chains. There may be another set of pointers tracing through the chain from end to beginning so that a program can move along the chain in either direction. It is also possibie for a record to be on more than one chain, simply by reserving pointer space for each possible chain. Removing or adding a record, or reorganizing the order of a chain now becomes a matter of manipulating pointers. To remove C from a chain of the records A, B, C, D, . .. , the pointer of B is redirected to D. If the record is being discarded, its storage space would probably be returned to another chain from which it can be reassigned later. To put the record Z between B and C, the pointer of B is directed to 2 and the pointer of Z is set to indicate C . Reordering a chain consists of a series of removals and insertions. As can be seen, list processing does not require that records be physically moved. It therefore provides an efficient way of transferring records from one category to another by moving them on and off chains, and it can easily manage lists that are constantly changing size; these are two properties that are very desirable in simulation programming. Therefore list processing is used in the implementation of all major discrete system simulation languages, including the GPSS and SIMSCRIF'T simulation programs. REFERENCES I 2
3 4
5 6
7
8 9
10 II
Ackoff, R. L.,Towards a system of systems concepts, MaMge. Sei., 17, 19771, 661-671. Burt, I. M., D. P. Graver, and M. Perlas, SimpIe stochastic networks: Some problem and procedures, Nav. Res. I q i s r . Qwr:.. 17, 1970, 439-459. Chorafas, D. N., System and Sinairtion. Acadrmic, New York, 1%5. Churchman, C. W.. R. L. Ackoff, and E. L. Amoff, Introdttction to Operotionr Research, Wilcy, New York, 1959. b s h a f f , J. R. and R. L. Sisson, Design and Use of Computer SimuIation Modeis, Macmillan, New York, 1970. Frmakov. J. M., Monte Carlo Method and Rdated Questions, Nauka, Moskow, 1976 (in Russian). Evans, G. W., G. F. Wallace, and G.L. Sutherland, Simulation Using Digital Conlpurets, hentice-Hall, Engiewood Cliffs, New Jersey, 1967. Fishman. G. S.. Concepts and Merho& in Discrete Ewnt Digital Simulation. Wiley, N e w Ymk, 1973. Fishman, G. S., Principles o/ Discrete Ewnr Simuiation, Witey, New York, 1978. Gcisler, M. A., The use of man-machine simulation for support planning, N a n Res. Logist. Quart., 7, 1960, 421-420. Gordon, G., System Simulation. Prentice-Hall. Englewood Ciiffs, New Jascy, 1%9.
1%
SYSTEMS, MODELS, SIMULATION, AND THE MONTE CARLO METHODS
12 H o n a k A of Operotions Research, Founaiz!ionr and F d m e d s , edited by J. J. Modern
and S. E. Elmagraby, Van Nostrand Reinbold. New Yo&, 1978. 13 Hammersley, I. M. and D. C. Handscomb. Monte Carlo Method?, Wiley, New York; Metbucn, London, 1964. 14 Hannan, H. H.,Simulation: A survey, Report SP-260.System Development Corporation, Santa Monica, California. 1961. IS Hillier. F. S. and G. J. Liebeman. Intmokction to Operoriw R d , Holden-Day, San Francisco, California 1968, Cbaptcr 14. 16 Hollingdde, S. H. (Ed.),Digirol Simdotion in oiprrorions Research, American HJcVier,
New York. 1967. 17 10M Corporation, Biblirrgrcqolry on Simularion, Form No. 320.0924, 112 East Post Road, White Plains. New York, 1966. 18 Kiviat, P. J., Digital Computer Simuiotion: Modcling Conccp~,Report RM-5378-PR The Rand Corporation,Santa Monica, California, 1967. I9 Kkinen, J. P.C.,Staristicul Techniquesin Sintulnrion, Part I, Marcel Decker. New York, 1974. 20 Lcwis.
21 22
23
24 25 26
27 28 29
3Q 31
32
P. A. W.,Large-scale Computer-Aided Statistical Mathematics, Naval Postgraduate School, Mooterey, California, in Proc. Contpuler Science and Sfarirtics: 6fh AMWI Syw.IwerJoce, Western Periodical CO., Hollywood, California, 1972. Lucas. H.C., Performance evaluation and monitoring, Conput. Swu., 3, 1971.79-91. Maid. H. and G. Gnugooli, Simulation oj Discrere Stocktic @stem, Science Research Associates, Palo Alto, Califoraia, 1972. Morgenthaler, G.W.,The theory and application of simulation in Operations research, in Progress in Opematiom Reseorch, edited by R. L. Ackoff, Wiky, New York, 1961. M c h d . J. (Ed.). S i d t i o n , Mffiraw-Hill, Near York. 1968. McMillan, C.,Jr., and R. Coourles, S’sremr Anabsis: A CoyDuter Approach to Decision MOdeLr, R c W cd., Richard D. Ervin. Homewood, lllhoiis. 1965. Mikaifov, G. A., Some Problems in ripe 7 h t y of the Monte-Corlo Method, Nauka, Novasibink, U.S.S.R.. 1974 (in Russian). Mitt. I. H.and J. G.Cox, &enria& ojSimulation, Rentice-HaII, Engkwood ClifFs. New Jersey. 1968. Nayior, T.J., J. L. Ealiotfy, D.S. Burdick, and K.Chu, Conlpwcr Simvlotion Techniques, Wiley, New York, 1966. Naylor, T.J., Campurer Simulation Experiments with MOdeLt o j fi&c Systemr, Wiley. New York, 1971. Proc. Cot$. Businem Games, sponsored by the Ford Foundation and Schooi of Business Administration, Tulane Univmity, April 26-28, 1961. Ueirman, J., Conlpufer Simularion Applications: DiscreteEmt Sinnclarionjur rhe Sywhesis und Ana&sis o/ Complex @srem, Wiky, New Yo&, 1971. Roacnbluth, A. and N. Wiener, The role of models in scicact, Phiios. Sci., Xn,No. 4,
Oct. 1945,316-321. 33 Smith,J., Computer Simulation Models, Hafncr. New Yo&, 1968. 34 Sobol, J. M..Compurarionai Method of Monte C d o , Nauka, M d o w , 1973 (in Russian). 35 Sbreider, Y. A. (Ed.), Method of Sroiisricol Tating: Monte Curio Method, Elsevier, Amsterdam, 1964.
REFERENCES
19
36 Stephenson, R. E., Conqpufer Simwlatronfor Engineers, Harcourt Brace Jovanovitch, New York, 1971. 37 Shubik, M., On gaming and game theory, Manage. Sci., Professional Series, 18, 1972, 37- 53. 38 Shut&, M.. A Preliminuq Bibliography on Gaming, Department of Administrative Sciences, Yak University, New Haven, Connecticut, 1970. 39 Shubik, M.. Bibliography on simulation. gaming, artificial intelligence and allied topics, J . Amer. Star. Asroc., 9, 1960, 736-751. 40 Twher, K. D., The Art of Simulation, D. Van Nostrand, Princeton, New Jersey, 1963. 41 Yakowitz, S. J., Contpurorional Probabilip and Simulation, Addison-Wesley, Reading, Massachusetts, 1977.
Simulation and the Monte Carlo Method
R E W E N Y. RUBINSTEIN Copyright 0 1981 by John Wiley & Sons, Inc.
CHAPTER2
Random Number Generation 2.1 INTRODUCIlON
In this chapter we are concerned with methods of generating random numbers on digital computers. The importance of the random numbers in the Monte Carlo method and simulation has been discussed in Chapter 1. The emphasis in this chapter is mainly on the properties of numbers associated with uniform random variates. T h e term rondom number is used instead of uniJorm random number. Many techniques for generating random numbers have been suggested, tested, and used in recent years. Some of these are based on random phenomena, others on deterministic recurrence procedures. Initially, manual methodr were used, including such techniques as coin flipping, dice rdling, card shuffling, and roulette wheeIs. It was believed that only mechanical (or electronic) devices could yield “truly” random numbers. These methods were too slow for general use, and moreover, sequences generated by them could not be reproduced. Shortly following the advent of the computer it became possible to obtain random numbers with its aid. One method of generating random numbers on a digital computer consists of preparing a table and storing it in the memory of the computer. In 1955 the RAND Corporation published [46]a well known table of a million random digits that may be used in forming such a table. The advantage of this method is reproducibility; its disadvantage is its lack of speed and the risk of exhausting the table. In view of these difficulties, John von Neumann (561 suggested the mid-square method, using the arithmetic operations of a computer. His idea was to take the square of the preceding random number and extract the 20
2.2 CONGRUENTIAL GENERATORS
21
middle digits; for example, if we are generating four-digit numbers and arrive at 5232, we square it, obtain 27,373,824;the next number consists of the middle four digits-namely, 3738-and the procedure is repeated. This raises a logical question: how can such sequences, defined in a completely deterministic way, be random? The answer is that they are not really random, but only seem so, and are in fact referred to aspseudorandom or quasi-random; still we Cali them random, with the appropriate reservation. Von Neumann’s method likewise proved slow and awkward for statistical analysis; in addition the sequences tend to cyclicity, and once a zero is encountered the sequence terminates. We say that the random numbers generated by this or any other method are “good” ones if they are uniformly distributed, statistically independent, and reproducible. A good method is, moreover, necessarily fast and requires minimum memory capacity. Since all these properties are rarely, if ever, realized, some compromise must be found. The congruential methods for generating pseudorandom numbers, discussed in the next section, were designed specifically to satisfy as many of these requirements as possible.
2 2 CONGRUENTIAL GENERATORS
The most commonly used present-day method for generating pseudorandom numbers is one that produces a nonrandom sequence of numbers according to some recursive formula based on caiculating the residues modulo of some integer m of a linear transformation. It is readily seen from this definition that each term of the sequence is available in advance, before the sequence is actually generated. Although these processes are completely deterministic, it can be shown [31] that the numbers generated by the sequence appear to be uniformly distributed and statistically independent. Congruential methods are based on a fundamental congruence relationship, which may be expressed as 1321 X,+,-(uX,+c)(modm), i 5 1, ..., n, (2.2.1) where the mult@lier a, the incremnl c, and the modulus m are nonnegative integers. The modulo notation (mod m ) means that X,,, = a x , + c - mk,, (2.2.2) where k, = [(ax,+ c ) / m J denotes the largest positive integer in
(ax,+
c)/m-
Given an initial starting value X, (also called the seed), (2.2.2) yields a congruence relationship (modulo m ) for any value i of the sequence { X,}.
22
R A m M NUMBER GENERATION
Generators that produce random numbers according to (2.2.1) are called mixed congruential generators. The random numbers on the unit inverval (0,l) can be obtained by Xi q==m
(2.2.3)
Clearly, such a sequence will repeat itself in at most m steps, and will therefore be periodic. For example, let a = c = X, = 3 and m = 5 ; then the sequence obtained from the recursive formula XI+, 3XI + 3(mod 5 ) is XI = 3,2,4,0,3. It follows from (2.2.2) that Xi < m for all i. This inequality means that the period of the generator cannot exceed m, that is, the sequence X, contains at most m distinct numbers (the period of the generator in the example is 4, while m = 5). Because of the deterministic character of the sequence, the entire sequence recurs as soon as any number is repeated. We say that the sequence “gets into a loop,” that is, there is a cycle of numbers that is repeated endlessly. It is shown [3l] that all sequences having the form X,+ = f ( X l ) “get into a loop.” We want, of course, to choose m as large as possible to ensure a sufficiently large sequence of distinct numbers in a cycle. Let p be the period of the sequence. When p equals its maximum, that is, whenp = m, we say that the random number generator has a fullperiod. I t can be shown [31] that the generator defined in (2.2.1) has a full period, m, if and only if: 5
,
I c is relatively prime to m , that is, c and m have no common divisor. = I(mod g ) for every prime factor g of m. 3 u = I(mod 4) if m is a multiple of 4. 2 u
Condition 1 means that the greatest common divisor of c and m is unity. Condition 2 means that CI - g [ a / g ] + I. Let g be a prime factor of m;then denoting K = [ a / g ] ,we may write a = 1 +gk. (2.2.4) Condition 3 means that a*1
+ 4[ a/4]
(2.2.5)
if m/4 is an integer. Greenberger [ 19) showed that the correlation coefficient between X i and Xi+ I lies between the values
i-(Z)(l-;)*-. U
m
and that its upper bound is achieved when a = m ’ / * irrespective of the value of c.
2.2
23
CONORUENTIAL GENERATORS
Since most computers utilize either a binary or a decimal digit system, we select m = 2@or m = lop, respectively where denotes the word-length of the particular computer. We discuss both cases separately in the following. For a binary computer we have from condition 1 that m = 2@guarantees a full period. It follows also from (2.2.1)that, for m = 2#, the parameter c must be odd and a = I(mod4), (2.2.6) which can be achieved by setting a=2'+l, rl2. It is noted in the literature [25, 35, 44)that good statistical results can be achieved while choosing m = 235,a = Z7 + I , and c = I. For a decimal computer m = lop. In order to generate a sequence with a full period, c must be a positive number not divisible by g = 2 or g = 5, and the multiplier a must satisfy the condition a =- )(mod 20), or alternatively, a = lo'+ 1, r > I . Satisfactory statistical results have been achieved f 11 by choosing a = 101, c = 1, r 2 4. In this case X, had little or no effect on the statistical properties of the generated sequences. The second widely used generator is the multiplicatiw generator X I + , =aX,(modm), (2.2.7) which is a particular case of the mixed generator (2.2.1) with c = 0. I t can be shown [ I , 2, 5, 311 that, generally, a f i l l period cannot be achieved here, but a maximal period can, provided that X, is reIatively prime to m and u meets certain congruence conditions. For a binary computer we again choose m = 2 @and it is shown [31] that the maximal period is achieved when u 8r 2 3. Here r is any positive integer. The procedure for generating pseudorandom numbers on a binary computer* can be written as:
-
1 Choose any odd number as a starting value X,. 2 Choose an integer a = 8r 5 3, where r is any positive integer. Choose a close to 2@/*(if /3 = 35, a = 2"+ 3 is a good selection). 3 Compute Xi, using fixed point integer arithmetic. T h i s product will consist of 28 bits from which the high-order /3 bits are discarded, and the low-order /3 bits represent Xi. 4 Calculate V , = X , / 2 @to obtain a uniformly distributed variable.
*This procedure and the one that follows arc reproduced almost verbatim from Ref. 31.
24
RANDOM NUMBER GENERATION
,
5 Each successive random number X,, is obtained from the lowsrder bits of the product ax,.
For a decimal computer m = loB. I t is shown in Ref. 49 that the maximal period is achieved when a = 200r %+p, where r is any positive integer and p is any of the following 16 numbers: (3, 11, 13, 19, 2 I , 27,29,37,53,59,61,67,69,77,83,9 1). The procedure for generating random numbers on a decimal computer can be written as: 1 Choose any odd integer not divisible by 5 as a starting value A’,. 2 Choose an integer a = 2 0 r 2 p for a constant multiplier, where r is any integer and p is any of the values 3, 11, 13, 19, 21, 27, 29,37,53,59,61,67,69,77,83,91. Choose u close to IOfl’’. (If p- 10, Q = lO0,OaO 2 3 is a good selection.)
3 Compute ax, using fixed point integer arithmetic. This product will consist of 28 digits, from which the high-order p digits are discarded, and the low-order digits are the value of XI. Integer multiplication instructions automatically discard the high-order digits. 4 The decimal point must be shifted p digits to the left to convert the random number (which is an integer) into a uniformly distributed variate defined over the unit interval U,= X,/108. 5 Each successive random number X , , I is obtained from the low-order dig& of the product ax,.
Another type of generator in which A’,,, depends on more than one of the preceding values is the additive congruential generator [ 171 k = 1,2 ,..,, i - 1. (2.2.8) X,+,~X,+X,-,(modm), In the particular case k = I we obtain the well known Fibonacci sequence, which behaves like sequences produced by the multiplicative congruential method with a = (1 + *)/2. Unfortunately, a Fibonacci sequence is not satisfactorily random, but its statistical properties improve as k increases. RESUME: We have seen that a sequence of pseudorandom numbers
produced by a congruential generator is completely defined by the numbers X,, a, c, and m. In order to obtain satisfactory statistical results our choice must be based on the following six principles.: 1 The number X, may be chosen arbitrarily. If the program is run several times and a different source of random numbers is desired each time, set X , equal to the last value attained by X on the preceding run, or (if more convenient) set X, equal to the current date and time. *These six principles are reproduced by permission from Knuth [31, pp. 155-1561.
2.2
25
CONGRUENTIAL GENERATORS
2 The number m should be large. It may conveniently be taken as the computer's word length, since this makes the computation of ( a X + c) (modm) quite efficient. The computation of (ax + cxmodm) must be done exactly, with no roundoff error. 3 If m is a power of 2 (i.e., if a binary computer is being used), pick a so that a(mod 8) = 5. If m is a power of 10 (i.e., if a decimal computer is being used), choose a so that a(mod 200) = 21. This choice of a, together with the choice of c given below, ensures that the random number generator will produce all m different possible values of X before it starts to repeat. 4 The multiplier a should be larger than preferably larger than m/100, but smaller than m - 6.The best policy is to take some haphazard constant to be the multiplier, such as a = 3,141,592,621 (which satisfies both of the conditions in 3). 5 The constant c should be an odd number when m is a power of 2 and, when m is a power of 10, should also not be a multiple of 5. 6 The least significant (right-hand) digits of X are not very random, so decisions based on the number X should always be primarily influenced by the most significant digits. It is generally better to think of X as a random fraction X / m between 0 and I, that is, to visualize X with a decimal point at its left, than to regard X as a random integer between 0 and m - 1. To compute a random integer between 0 and k 1, we would multiply by k and truncate the result.
-
Finally, we present in this section the IBM System/360 Uniform Random Number Generator, a multiplicative congruential generator that utilizes the full word size, which is equal to 32 bits with 1 bit reserved for algebraic sign. Therefore an obvious choice for m is 23'. A pure congruential generator (c = 0) with m = 2k (k > 0) can have a maximum period length of m / 4 . Thus the maximum period length is 23'/4 = 229. The period length also depends on the starting value. When the modulus m is prime, the maximum possible period length is m - 1. The largest prime less than or equal to 23' is Z3' 1. Hence, if we choose m = Z3' - 1, the uniform random number generators will have a maximum period length of m - 1 = 23'- 2, which is only the upper bound on the period length. The maximum period length depends on the choice of the multiplier. Note that the conditions ensuring a maximum period length do not necessarily guarantee good statistical properties for the generator, ' does satisfy some known although the choice of the particular multiplier 7 conditions regarding the statistical performance of the generated sequence. The System/360 Generator can be described as follows. Choose any
-
26
A',>
RANUOM NUMBER GENERATION
0. For n
> 1,
X,,=75Xn-,(mod23'- I ) = 16,807Xn-,(mod23'- 1). The random numbers are (see (2.2.3)) U, = X,,/@' - I). The results of the statistical tests of the System/360 Uniform Random Number Generator indicate that it is very satisfactory. Versions of this generator are used in the IBM SL/MATH package, the IBM version of APL, the Naval Postgraduate School random number generator package LLRANDOM, and the International Mathematics and Statistics Library (IMSL)package. The generator is also used in the simulation programming language SIMPL/I. The assembly language subroutines GGLl and GGL2 of IBM Corporation (1974) also implement this generator, as well as the FORTRAN subroutine GGL. 23 STATISTICAL TESTS OF PSEUDORANDOM NUMBERS
In this section we describe some statistical tests for checking independence and uniformity of a sequence of pseudorandom numbers produced by a computer program. As mentioned earlier, a sequence of pseudorandom numbers is completely deterministic, but insofar as it passes the set of statisticai tests, it may be treated as one of "truly" random numbers, that is, as a sample from %(O, I). Our object in this section is to provide some idea of these tests rather than present rigorous proofs. For a more detailed discussion of this topic the reader is referred to Fishman [ 111 and Knuth [311. 23.1
chi-square Gooduess-of-Fit Test
The chi-square goodness-of-fit test, proposed by Pearson in 1900, is perhaps the best known of all statistical tests. Let X,, . . .,X , be a sample drawn from a population with unknown cumulative distribution function (c.d.f.) F,(x). We wish to test the null hypothesis H, : F,(x) = Fo(x), for all x , where F,(x) is a completely specified c.d.f., against the alternative for some x . H,:F , ( x ) Fo(x),
+
Assume that the N observations have been grouped into k mutually exclusive categories, and denote by N, and Np; the observed number of trial outcomes and the expected number for the j t h category, j = 1, . . .,k, respectively, when H, is true.
2.3
27
STATISTICAL TESTS OF PSEUDORANDOM NUMBERS
The test criterion suggested by Pearson uses the following statistic: (2.3.1) which tends to be small when H, is true and large when Ha is false. The exact distribution of the random variable Y is quite complicated, but for large samples its distribution is approximately chi-square with k I degrees of freedom [ 151. Under the Ho hypothesis we expect
-
P(Y >
= a,
(2.3.2)
where a is the significant level, say 0.05 or 0.1; the quantile xt...,, that corresponds to probability 1 --a is given in the tables of chi-square distribution. When testing for uniformity we simply divide the interval [O, I] into k nonoverlapping subintervals of length l/k so that Np,? = N / k , In this case we have (2.3.3) and (2,3.2) can again be applied for testing random number generators. To ensure the asymptotical properties of Y it is often recommended in the literature to choose N > Sk and k > IOOO, where k = 28 and k = loa for a binary and a decimal computer, respectively.
23.2 KolmogMav-Smlmav Coodness-of-Fit Test Another test well known in statistical literature is the one proposed by Kolmogorov and developed by Smirnov. Let X,,...,X w again denote a random sample from unknown c.d.f. Fx( x). The sample cumulative distributive function, denoted by F N ( x ) , is defined as
I FN(x ) = -(number of X,less than or equal to x )
N
where I ( -
X ) is the indicator random variable (r.v.) that is, (2.3.4)
For fixed x , F N ( x ) is itself an r.v., since it is a function of the sample.
28
RANDOM NUMBER GENERATION
Let us show that F N ( x )has the same distribution as the sample mean of a Bernoulli distribution, namely F N ( x )k= ~ ] = ( ~ ) [ F I ( ~ ) ] ~ [ I - F ~ ( x(2.3.5) )]~-~.
f"
Denote V;: = 4- oo,x)( X i ) ; then has a Bernoulli distribution with parame,V,. has a binomial distributer P(V, = I) = P ( X i Ix ) = F,(x). Since Zi", tion with parameters N and Fx(x), and since F N ( x )= ( I / N ) Zf-,Y, the result follows immediately. From (2.3.5) we see that
(2.3 -6) and varF,(x)
1
= T F , ( X I) -[ F , ( x ) ] ,
(2.3.7)
Equations (2.3.6) and (2.3.7) show that, for fixed x , FN(x) is an unbiased and consistent estimator of F,(x) irrespective of the form of F,(x). Since FN(x) is the sample mean of random variables 4- o,x)( Xi),i = 1, . ..,N, it follows from the central-limit theorem that f " ( x ) is asymptotically normally distributed with mean F,(x) and variance (l/N)F'(x)[ 1 - F(x)]. We are interested in estimating F'(x) for every x (or rather, for a fixed x ) and in finding how close F,(x) is to F,(x) jointly over all values x . The result lim P [ N-03
sup
IF,(^) - ~ , . ( x ) l > e ] = o
(2.3.8)
-so<x<m
is known as the Gliwnko-Cantelli theorem, which states that for every E > 0 the step function F,(x) converges uniformly to the distribution function F'(x). Therefore for large N the deviation IFN(x)- F,(x)I between the true function F,(x) and its statistical image F N ( x )should be small for all values of x. The random quantity D, = SUP IFN(X) - FX(X)l. (2.3.9) -oa<x<m
which measures how far F,(x)
deviates from F,(x)
is called the
Kolmogorm-Smirnoo one-sample statistic. Kolmogorov and Smirnov proved that, for any continuous distribution F,( x),
(2.3.10)
2.3
STATISTICAL 1ESTS OF PSEUDORANDOM NUMBERS
29
The function H ( x ) has been tabulated and the approximation was found to be sufficiently close for practical applications, so long as N exceeds 35. The c.d.f. H ( x ) does not depend on the one from which the sample was drawn; that is, the limiting distribution of fi DN is disiribution-jiree.This fact allows D,,, to be broadly used as a statistic for goodness-of-fit, For instance, assume that we have the random sample A',, ...,X , and wish to test H0:F,(x)= Fo(x) for all x where Fo(x) is a completely specified c.d.f. (in our case Fo(x) is the uniform distribution in the interval (0, I)). if Ho is true, which means that we have a good random number generator, then
is approximately distributed as the c.f.d. H( x). If Ho is false, which means that we have a bad random number generator, then F N ( x ) will tend to be near the true c.d.f. Fx(x) ratbet than near Fo(x), and consequently ~ u p - , < , ~ ~ ( F ~-( Fo(x)( x) will tend to be large. Hence a reasonable test criterion is to reject H , if sup- m < x c oo IFN( x 1 - FX x )I is largeThe Kolmogorov-Smirnov goodness-of-fit test with significance level Q rejects lf, if and only if D, > x, where the quantile xi --I is given in the tables of H ( x ) . Before we leave the chi-square and Kolmogorov-Smirnov tests, a word is in order on the similarity and difference between them. The similarity lies in the fact that both of them indicate how well a given set of observations (pseudorandom numbers) fits some specified distribution (in our case the uniform distribution); the difference is that the Kolmogorov-Smirnov test applies to continuous (jumpless) c.d.f.'s and the chi-square to distributions consisting exclusively of jumps (since all the observations are divided into k categories). Still the chi-square test may be applied to a continuous Fx(x)l provided its domain is divided into k parts and the variables within each part are disregarded. This is essentially what we did earlier when testing whether or not the sequence obtained from the random number comes from the uniform distribution. When applying the chi-square test allowance must be made for its sensitivity to the number of classes and their widths, arbitrarily chosen by the statistician. Another difference is that chi-square requires grouped data whereas Kolmogorov-Smirnov does not. Therefore when the hypothesized distribution is continuous Koimagorav-Smirnov allows us to examine the goodness-of-fit for each of the n observations, instead of only for k classes, where k s n. In this sense Kolmogorov-Smirnov makes more complete use of the available data.
y,-,, where the quantity y , - , can be found from the appropriate tables. 23.4
Serial Test [31]
The serial test is used to check the degree of randomness between successive numbers in a sequence and represents an extension of the chi-square goodness-of-fit test. Let X i = ( ( 1 1 % - . . uk), x , , = (u&+ 1, . . , u 2 k ) , . (qN-,>&+,,..., UNk}be a sequence of N k-tuples. We wish to test the hypothesis that the r.v.’s X , , X , , .. .,X, are independent and uniformly distributed over the kdimensional unit hypercube. Dividing this hypercube into r elementary hypercubes, each with volume i/rk, and denoting by y,,. .. ,,r the number of k-tuples falling within the G=
‘
9
a
x,
2.3
STATISTICAL TESTS OF PSEUDORANDOM NUMBERS
31
element r
r
,
i - 1 ,..., k ; j i = 1 ,..., r ,
we have that the statistic
i,,.. . ,jk= 1
(2.3.13)
has an asymptotical chi-square distribution with r k - I degrees of freedom. Since there are r‘ hypercubes within which Xi may fall, the question of available space arises. If k = 3 and r = 1O00, the serial test requires lW3= 10’ counters-a problematic requirement in terms of both storage and search. In these circumstances the test is rarely used for k > 2. 2.3.5
‘Ibe-Wpd-Do~nTest f43]
For this test the magnitude of each element is compared with that of its immediate predecessor in the given sequence. If the next element is larger, we have a run-up: if smaller, a run-down. We thus observe whether the sequence increases or decreases and for how long. A decision concerning the pseudorandom number generator may then be based on the number and length of the runs. For example, the following seven-term sequence 0.2 0.4 0.1 0.3 0.6 0.7 0.5 consists of a run-up of length 1, followed by a run-down of length 1, followed by a run-up of length 3. and finally a run-down of length 1, and may be characterized by the binary symbol as I 0 11 1 0, where 1 denotes a run-up and 0 a run-down. More generally, suppose there are N terms, say X , < X, < * < X, when arranged in order of magnitude; the timeordered sequence of observations represents a permutation of these N numbers. There are N ! permutations, each of them representing a possible set of sample observations. Under the null hypothesis each of these alternatives is equally likely to occur. The test of randomness, using runs-up and runs-down for the sequence X,,. . .,X , of dimension N, is based on the derived sequence of dimension N - I, whose ith element is 0 or 1 depending on whether Xi+, - X,, i = 1,. . ..N - 1, is negative or positive. A large number of long runs should not occur in a “truly” random sample. The test rejects the null hypothesis if there are at least r runs of length t or more, where both r and 1 are determined by the desired significance level. The means, variances, and covariances of the numbers of runs of length t or more are given in Levene and Wolfowitz (341.
-
32
RANDOM NUMBER GENERATION
The expected numbers of occurrences of runs in a “truly” random sample are [43]
2N -
for total runs
3
N+l 12
for runs of length 1
11N- 14 12 .
.
*
.
.
.
.
.
for runs of length 2 .
.
.
.
.
.
.
.
.
.
.
.
I
.
.
.
.
.
.
.
.
.
.
2[ ( k 2 + 3k + l ) N - ( k 3 + 3k2 - k - 4)] ( k 3)! fork 0 such that X , = X z m :the smallest such value of n lies in the range p In 5 p + A , and the value of X, is unique in the sense that, if X, X,,and X , = X2,, then X, = X , (hence r - i is a multiple of A). From Knuth 1311.
(a)
-
2 Prove that the middle-square method using2n-digit numbers to the base /3 has the following disadvantage: if ever a number X , whose most significant n digits are zero, appears. then the succeeding numbers wiIl get smaller and smaller until zero occurs repeatedly. From Knuth 131).
3 A sequence generated as in exercise 1 must begin to repeat after at most m values have been generated. Suppose we generalize the method so h a t X,,, depends on X,-I as well as on X , ; formally, letf(x,y) be a function such that, if 0 5 x,y < m, then 0 < / ( x , y ) < m . The sequence is constructed by selecting X, and XI arbitrarily, and then letting
,
X,+ -f( X I , X,- ),
for i
> 0.
Show that the maxtmum period conceivably attainable in this case is m 2 . From Knuth [3 I].
34
RANDOM NUMBER GENERATION
4 Given the two conditions that c is odd and a(mod)45 1, prove that they are necessary and sufficient to guarantee the maximum tengtb period in the sequence
X,,
,
=5
axi - c(mod m )
when m = ac, e + 2. From Knuth [3 I]. 5 Prove that the sequence X,,
-
I
= ax, - c(mod m ) ,
with m loc,e > 3, and c not a multiple of 2 and not a multiple of 5, will have a full period if and only if a(mod 20) = 1. From Knuth 13 I]. 6 Show that the random function SJX) =
I": -- , I(X--Xi)
whereI(t)=
i-I
(A:
is the empirical distribution function of a sample X I ,X,, done by showing that S,,(X) = F , ( x ) for all x.
ift>_O ift 0) -
(3.5.7)
Again assume that the joint density of the random variables Y , = g , ( X , , . . .,A',),..., Y, = gk( X i , . . . ,X,,) is desired, where k is an integer satisfying 1 Ik In. If k < n, we introduce additional new random variables Y k + ,= g k + , ( X i , .. . ,X,,), . . . , Y,, = g , ( X , , . ,A',,) for judiciously selected functions g , , I , . . .,g,,; then we find the joint distribution of
..
Y, ,..., Y,; finally, we find the desired marginal distribution of Y,,..., Y, from the joint distribution of Y,,. . ., Y,. This use of possibly introducing additional random variables makes the transformation y , = g , ( x , , . . . ,x J r . . . ,y,, = g,(xl, . . . , x") a transformation from an ndimensional space to an n-dimensional space. Henceforth we assume that we are seeking the joint distribution of Y,= g , ( X , , . ..,A',), .. ., Y, g , ( X , , .. . ,X,,) (rather than the joint distribution of Y , , . . ., Y,) when we have given the joint probability density of Xi,, . .,X,,. We state our results for n = 2. The generalization for n > 2 is straightforward. Let f x , , , , ( x , , x , ) be given. Set K = {(x1.x2) : f x , . x , ( x 1 . x 2 )> 0).We want to find the joint distribution of Y,= gl(XI, X,) and Y, = g , ( X , , X , ) for known functions g,(x,,x,) and g 2 ( x , . x 2 ) . Now suppose that y I = g,(x,, x , ) and yz g 2 ( x I x,) , defines ;t one-to-one transformation that maps K onto, say, D. x , and x 2 can be expressed in terms of yl and y2; so we can write, say. x , = 'pI ( y , , y 2 )and x 2 = tp2 (y1,y2). Note that K is a subset of the xIx2 plane and D is a subset of theyly2 plane consisting of points ( y l , y 2 )for which there exist a ( x , , x , ) E ~ such that ( y l , y 2 ) = [ g l ( x I , x 2 )g2(x,,x2)]. , The determinant
-
5
is called the Jacobian of the transformation and is denoted by J . The above discussion permits us to stale Theorem 3.5.2.
Theorem 3.52 Let Xi and X , be jointly continuous random variables with density function f x , , x , ( x , , x 2 ) . Set K = { ( x I , x 2 :) f x , , x , C x l , x z > ) 0). Assume that: 1 y , =gl(x,,x2) and y , =g2(xl,x2) defines a one-to-one transformation of K onto D.
63
SIMULATION OF RANDOM VECTORS
2 The first partial derivatives of x , = q , ( y 1 , y 2 )and x2 = (p2(y,,y2)are continuous over D . 3 The Jacobian of the transformation is nonzero for (yI,y2)E D. Then the joint density of Y,= gt(XI,X,) and Y, = g2( XI,X 2 ) is given by = I J 1 fx,,x,( Q I(
f,,, Y,< Y I 9Y2
Yl 9Y2 ) 9 d Y t PY2 ) ) I D ( Y l r Y 2 ).
(3 -5.8)
where
The proof is essentially the derivation of the formulas for transforming variables in double integrals. For proof, the reader is referred to Neuts ~51.
For the single variate case the transformation formula (3.5.8) becomes
(3.5.9) Heref,(x) is the given p.d.f.,fy(y) is the desired p.d.f., 1, is the interval of Y PL: g( X ). We can see that (3.5.9)is a particular case of (3.5.8).
x, and
Let Z, and Z2 be two independent standard normal random variables. Let Y,= Z , + Z2 and Y2 = Z, / Z 2 . Then Example I
Y2 1 +Y2
J=
Yl
(1 + Y d 2
I
I
I+Y2
To find the marginal distribution of, say Y,, we must integrate out yl, that
64
R A h W M VARIATE GENERATION
is
Let
then
and so
a Cauchy density. In other words, the ratio of two independent standard normal random variables has 3 Cauchy distribution. To generate an r.v. from a Cauchy distribution we generate Z, and Z2 from N(0, 1) and take their ratio. Example 2
Let Xi have a gamma distribution x , 2 0,
ni>O
0,
otherwise
with parameters ni and I for i = 1,2, and assume X , and X 2 are independent. Suppose now that the distribution of Y,= X , / ( X , + X,) is desired. We have only the one function y , = gI(xI,x2) = x1/(x, + x2), so we have to select the other to use the transformation technique. Since x , and x2 occur in the exponent of their joint density as their sum,x , + x2 is a good choice. Let y2 = x , + x2; then xi = y l y 2 , x2 =y2 -y,y2, and
J=l
-Y2Y2
I -y'Y , I'Y2.
65
SIMULATION O F RANDOM VECTORS
Hence
It turns out that Yl and Yz are independent and that Y, has a beta distribution with parameters n l and n 2 . Thus to generate a random variate from beta distribution we generate two gamma variates X , and X 2 , then calculate Xl/(Xl + X 2 ) .
3.5.3
Muitinormal Distribution
A random vector X = (X,. .. .,X n ) has a multinormal distribution if the p.d.f. is given by I exp[ - f ( x - p ) 7 X c - 1 ( x- p)] (3.5.10) f x ( x )= (27r)n'2p(
and denoted by N(p.C). Here p = (pl..... p n ) is the mean vector, matrix
I: is
the covariance ( n X n)
'72 n
1x1
which is positive definite and symmetric, is the determinant of 2,and I is the inverse matrix of I:. Inasmuch as C is positive definite and symmetric, there exists a unique lower triangular matrix
I:
C=
CI1
0
c21
c22
- * -
...
0 0
(3 -5.12)
66
RANDOM VARIATE GENERATION
such that
z = CCT.
(3.5.13)
Then the vector X can be represented as
x==cz+p,
(3.5.14)
where Z = (Z,, . . . ,Z,) is a normal vector with zero mean and covariance matrix equal to identity matrix, that is, all components Z,,i = 1,. .. ,n, of 2 are distributed according to the standard normal distribution N(0, 1). In order to obtain C from C = CC’ the so-called “square root method” can be used, which provides a set of recursive formulas for computation of the elements of C. It follows from (3.5.14) that X,=c,,Z,+I.r,. Therefore var X I = u , =~ obtain
and c, I = u:(‘.
(3.5.15)
Proceeding with (3.5.14) we
x, = C 2 I G + c222 + P 2
(3.5.16)
var X, = u2, = var( cZ1Z,+ c2*Z2).
(3.5.17)
and
From (3.5.15) and (3.5.16)
E[W, - - P , ) ( - Y ,
-PdI
=012
= ~ [ c l l ~ l ( C 2 I Z I+ C 2 2 Z 2 ) ] .
(33.18)
From (3.5.17) and (3.5.18) (3.5.19)
(3.5.20)
Generally, c , can ~ be found from the following recursive formula: J- 1
c,, =
where
2 C,kCjk
k= I
(
J-1 ‘JJ
-
I/,’
‘;k)
(3.5.2 1)
GENERATING FROM CONTINUOUS DISTRIBUTIONS
67
Algorithm MN-I describes the necessary steps for generating a muitinorma1 variate. Aigon3h MN-1
1
where
2 Generate Z = ( Z , ,. . . ,2,) from N ( 0 , 1).
3 XtCZ+p. 4 Deliver X.
3.6 GENERATING FROM CONTINUOUS DISTRIBUTIONS
This section describes generating procedures for various single-variate continuous distributions. 3.6.1
Exponential Distribution
An exponential variate X has p.d.f. (3.6.1)
otherwise denoted by exp(p). Pmcedune E- I
By inverse transform method
u = ~ ~ ( x1 -e-.''fl )=
(3.6.2)
so that
- U). Since I - U is distributed in the same way as U,we have X = -pln U . X = -[3ln(l
(3.6.3)
(3.6.4)
68
RANDOM VAIUATE GENERATION
For sampiing purposes we may assume /3 = I: if Y is sampled from the standard exponential distribution exp( I), then X = PV is from exp(P). Algorithm E-1
1 Generate U from %(O, I). 2 Xt-BlnlJ. 3 Deliver X .
Although this technique seems very simple, the computation of the natural logarithm on a digital computer includes a power series expansion (or some equivalent approximation technique) for each uniform variate generated. P m d a E-2 We now prove a proposition that can be useful for generating from exponential distribution exp( 1).
Proposition Let U , , , . . , V,, U n + l ,.... U2,-,be independent uniformly distributed random variables, and let YIt,.. . , L&,- ,) represent the order statistics corresponding to the random sample U,, ..,U2"I. Assume U,, = 0 and Ute, = I ; then the r.v.'s n
Y& = (q,.I )
- q,,)in
n o,,
k = 1. ..., n
(3.6.5)
I
I-
are independent and distributed exp( 1). Proof
Denote X,-
qL)--q k-,,,
k = 1 ,..-,n - 1
and n
v,.
x,= - I n n 1-
1
It will be shown in Section 3.6.2 that X , is from the Erlang distribution, that is,
(3.6.6) It is aiso known (Feller [ I I]) that the vector ( X , , .. .,Xn-,) is distributed fx,,..., X a . I ( ~ 1 ~ * * * r x n - 1 ) ~ ( In) !(3.6.7) inside the simplex n-
I
~ x x , s l , x,20, k- I
k = l , . . - ,n - I .
69
GENERATING FROM CONTINUOUS DISTRIBUTIONS
Hence I v , ,. , . , YJY
In
=Ile-*X,. 1-1
Jvl
We accept the sequence { X,} if N is odd; otherwise we reject it and repeat the process until N turns out odd. Let T be the number of sequences rejected before an odd N appears (T = 0.1, ...) and let X, be the value of the first variable in the accepted sequence; then Y = T + Xo is from exp(1). It is shown in ref. 34 that generation of one exponentional variate in such a way requires on the average (1 + e ) (1 - e - ’1=6 random numbers.
3.6.2
Ganuna Distribution
A random variable X has a gamma distribution if its p.d.f. is defined as
otherwise, and is denoted by G(a,P). Note that for (x = I, C(1,p) is exp(P). Inasmuch as the c.d.f. does not exist in explicit form for gamma distribution, the inverse transform method cannot be applied. Therefore alternative methods of generating gamma variates must be considered.
k e d m G-1 One of the most important properties of gamma distribution is the reproductive property, which can be successfully used for gamma generation. Let Xi, i = 1,. ..,n, be a sequence of independent random variables from G( ai,8).Then X = Xy-,X,is from G( a, 0) where a = Xy- pi. If a is an integer, say, a = m, a random variate from gamma distribution G(rn,/I) can be obtained by summing m independent exponentiaf random variates exp( p), that is, m
X - P ~(--InU,)== -pin i= I
m
I1 u,,
(3.6.10)
i.; I
which is called Erlang distribution and denoted Er(m, p). Algorithm G-1 describes generating r.v.’s from Er(m, p). A l g ~ t i t b iG-1 1 XtO. 2 Generate V from expfl). 3 X t X + Y.
72
RANDOM VARIATE GENERATION
If a = 1, X C ~ Xand deliver X. 5 ata-I. 6 Go to step 2.
4
It is not difficult to see that the mean computation (CPU) time for generation from Erlang distribution is an increasing linear function of a. However, if a is nonintegral, (3.6.10) is not applicable and some difficulties arise while generating gamma variates. For some time no exact method was known and approximate techniques were used. The most common method was the so-called probability switch method [24]. Let m = [a]be the integral part of a and let 6 = a - m. With probability 6, generate a random variate from Gfm + l,@). With probability I 6, generate a random variate from G( m, /I) This . mixture of gamma variates with integral shape parameters will approximate the desired gamma distribution. This technique will only work when a 2 1. In the particular case when a=; gamma variables can be generated exactly by adding half the square of a standard normal variate to the variate generated in (3.6.10).
-
Proc4drcrp
G-2
Johnk [ 161 suggested a technique that exactly generates' variates from G ( 6 , P ) , where O < 6 < 1.
Theorem 3.6.1 Let W and V be independent variates from beta distribution Be(& 1 - 6 ) (see Section 3.6.3) and exp(l), respectively. Then X = /3VW is a variate with G( 6, li). Prmf Let u = v and let x = @w. Then w = x / @ u , and v = u. The Jacobian of this transformation is (3.6.1 1) The joint distribution of ( u , x ) is therefore given by
otherwise. (3.6.12) *It is understood that when we say a method "exactly generates" random variables on a computer, that the exactness is limited by the computer used and by the randomness of the underlying pseudorandom number generator.
73
GENERATING FROM CON I'INUOUS DISTRIBUTIONS
The marginal distribution for X is
which is G(S,p). AI&th
Q.E.D.
C-2
I Generate two variates W and Y from Be(& I - 6) and exp(l), respectively. 2 Compute X = pYW that is from G(S,B). 3. Deliver A'. To generate a variate from G((w,P) we generate an r.v. Y from Er(rn, l), then compute X = p( Y + V W ) , which is from G ( a , P ) . Here IY = 6 m.
+
Recently, a number of procedures for sampling from G(a,B), based on the acceptance-rejection method, were suggested by Ahrens and Dieter [3], Cheng [9], Fishman [13), Tadikamalla [30, 311 and Wallace [35]. Let us consider some of them. Procedure c-3 Wallace [35]suggested a procedure for generating from G(a,1) with a > 1 based on both the acceptance-rejection and probability switch methods. Let
f.(.)
= Ch(Xk(X)P
where h ( x ) is a mixture of two Erlang distributions Er(m, I ) and E r ( m I , 1) equal to Xm-l
h ( x )= P
e --x
( m- l ) !
x"'e-" m! '
+ ( I - P)--
x 2 0,
+
(3.6.13) (3.6.14)
and g ( + 4 = ( ; ) u [ I + ( < -m
$1
(3.6.15)
It can be found from (3.4.10) that the optimal P is equal to I - 6, where 6 = a - [a]. I t follows from (3.6.14) that the mean number of trials C is a monotone decreasing function of rn for a fixed S and
lim m+oo
( m- l ) ! d = I, r(m+8)
74
RANDOM VARIATE GENERATJON
that is, asymptotically the execution time does not depend on S and achieves optimal efficiency C = 1. Algorithm G-3 describes Wallace's procedure. Al-*thm
G-3
1 Compute 6 = a - m, where m = [a]. 2 Generate Cr,,.. ., U, from %(O, I). 3 With probability 1 - 6 compute rn
v-
11 q .
-In
i- I
4
With probability 6 compute m+ I
v = - I ~ u,. i=1
5 Generate another uniform variate U from %(O, I). 6 If U I ( V / m ) ' / [ 1 ( ( V ( / m )- I)&],deliver V as an r.v. from G(a,1). 7 Go to step 2.
+
The following three procedures are reproduced with little change from Ref. 12.
Proaedwe GC Fishman [13) describes another procedure for generating from G ( a , l), a 2 I: g(x) = x a -
' exp[ (YO
- ]/a)]
-X(I
'exp( 1 - a)
(3.6.16) (3.6.17)
a
& ) = L , - X / .
(3.6.18) The probability of success on a trial is 1 -=-
c
aaeI-o
'
(3.6.19)
For large a the mean number of trials is (3 A.20)
It is not difficult to see that the condition U 5 g ( Y ) , where the r.v. Y is
75
GENERATING FROM CONTINUOUS nISTRIBUTtONS
from exp(l/a), can be written as V, 2 f a - 1)( V , - In V, - 1) and V, and V2 are independent r.v.’s from exp(1). Algoritlun G-4 1 Aca-1. 2 Generate V , and V2 from exp(1). 3 If Vz < A ( V , - In V, - I), go to step 2. 4 Deliver V, as a variate from G ( a , I).
Pmcedum G-5 This procedure is due to Cheng 191 and describes gamma generation G ( a , 1) for a > 1 with execution time asymptotically independent of a. In Cheng’s procedure h(x) =
c=
*(p
+ x”-2,
4a”
(3.6.2 1 )
otherwise
(3.6.22)
r(a)eaX
g(x)=xa-A(y+
x20
ea-r 2)) 4*a+h ’- ’
(3.6.23)
where p
51
aAr
A = (2a -
The execution time C is a monotonicalty decreasing function of a such that, for a = I, C = 1.47, and for a = 2, C = 1.25; asymptotically 2 lim C=-m 1.13. (3 -6.24) p4ao
G
Let b = a - In 4 and d = a + l / X . Then Cheng’s algorithm can be written as follows. A f g d h G-5
1 Sample U,and U2 from %(O, I). 2 Vc-X In[U, /( 1 - &)I. 3 Xcae’. 4 If b + d - X 2 ln(U,2V2),deliver X.
5 Go to step I. P h ~ G-6 b Ahrens and Dieter [3] suggested an alternative procedure for generating from G ( a , p ) with a > 1 and execution time independent of a asymptoti-
76
RANDOM VAIUATE GENERATION
cally and equal to lima-ta C = L/m. Their procedure makes use of the truncated Cauchy distribution. Let
(3.6.25) and
where
(3.6.27) and
(3.6.28)
H,(x) =
1+
tan-'(
7),
- 00 < x < 00
(3.6.29)
are the p.d.f. and c.d.f. of the Cauchy distribution, respectively, with parameters y = a - 1. and p = (2a 1)'/2. I t follows from (3.6.25) and (3.6.28) that h ( x ) is the truncated Cauchy distribution with parameters y and p. To apply the acceptance condition U 5 g( Y ), we have to generate an r.v. Y from the truncated Cauchy distribution h ( y ) . The c.d.f. of Y is
-
(3.6 -30) where H , ( y ) is given in (3.6.29). Substituting (3.6.29) in (3.6.30) and using the inverse transform formula Y = H - ' ( U ) , we obtain Y-ptanTr(u[I - H , ( o ) ]
+H,(o)-~}
+y
(3.6.3 1)
where by (3.6.29)
(3.6.32)
77
OENERATING FROM CONTINUOUS DISTRIBUTIONS
It is readily seen that the condition U 5 g ( Y ) is equivalent to
- V=InUO
a-+m
P-m
Thus for large a or /3 Johnk's procedure is not efficient.
TaBie 3.63 I'be Mean Number of Trials as a Functlw of a and fi
P a
I
3
1
2
4
6
3
4
20
56
5
6
56
252
84
RANDOM VARIATE GENERATION
Algorithm Be-4
1 j t l . 2 Generate
V, and r/i+, from %(O,
u.''u. , q:/f. Yl + Y2L
1).
Y,+ 4 '2Y 3
5 If 1, go to step 2. 6 j+-j+2. 7 Deliver X = Y, /(Y,+ Y2).
P m e e h Be-5 This procedure is based on the results of examples 6 and 7 from Section 3.4.3. As follows from (3.4.20) and (3.4.24), the efficiencies of the acceptance-rejection method AR-3 are, respectively,
(3.5.54) (3.5.55) in examples 6 and 7. For integer a and /? we have, respectively, 1 ( a - I)!/?! -
c
(3.6.56)
E
I
(a+P-
l)!
I
. ! ( P - I)!
c
( a + / ? -I ) !
-=
(3.6.57)
In both cases (3.6.56)and (3.6.57)the efficiencies are a little higher than in Johnk's procedure Be4 (see (3.6.53)). I t is interesting to note that for p > a it is more efficient to represent J x ( x ) in the form of (3.4.18) through (3.4.20) and for a > /3 it is more efficient to representf,(x) in the form of (3.4.22) through (3.4.24). Procediin? Be-6 In this procedure h ( x ) is Ee(m,n), that is,
h(x)=
I)! x"-'(l ( m -I)!(n-- I)! (m+n-
-x y - I,
0 Ix I 1 (3.6.58)
where m = [afand n = [ p]. Then
85
GENERATING FROM CONTINUOUS DlSTRIBUTIONS
where 6, = a - rn, S, = /3 - n , and B ( r , s ) = r ( r ) r ( s ) / I ' ( r + s). It is quite easy to prove that the function y = xsc(I - x)&, is concave on [0, I] and achieves its unique maximum y* =
Sf'6262 61 at the point x * = (6, +62)6,+6z 4+4?
Now we set
and
The efficiency of the procedure is
It is easy to see that
(3 -6.61) Comparing (3.6.6 1) with (3.6.56) and (3.6.57), we can also readily prove that Procedure Be-6 is more efficient than Procedure Be-5 for a 2 2, p 2 2. A l m l h m &-6 1 Generate U from %(O, I). 2 Generate Y from Re(m, n ).
3 If
deliver Y. 4
Go
to
step 1.
6, = 0, then g ( x ) = ( 1 - x)',, p). If 8,= 0, then g(x) = ~ ' B ( r n , n ) / B ( a , n ) . If 6 , = 6 , = 0 , then C = 1.
Remark
If
B(m.n)/B(rn,
1 ,
y * = 1, and C = y * = I , and C =
86 3.6.4
RANUOM VARIATE GENERATION
Normal Distribution
A random variable X has a normal distribution if the p.d.f. is
and is denoted N(p.a2). Here p is the mean and u 2 is the variance. Since X = p + aZ, where Z is the standard normal variable denoted by N(0, I), we consider only generation from N ( 0 , I). As we mentioned in Section 3.2, the inverse transform method cannot be applied to the normal distribution and some alternative procedures have to be employed. We consider some of them. More about generation from normal distribution can be found in Fishman [ 121.
Procedure N-l This approach is due to Box and Muller [6]. Let us prove that, if U,and U2 are independent random variates from %(O, I), then the variates Z, =
- 2 In U,
2,= ( - 2 In U,)
cos 2nu2 1/2
(3.6.63)
-
sin 2774
are independent standard normal deviates. To see this let us rewrite the system (3.6.63) as
z, = (2V
cos 2nU
Z , = ( 2 )"'sin ~
(3.6.64)
2 n ~ ,
where Y is from exp(1) and U, = U.I t follows from (3.6.64) that and
Z:+Z$-2V
-' 2 - tan 2nU. 2,
The Jacobian of the transformation
I
- - - 1( z ; + z : ) = 4nv
--1
2n
87
GENERATING PROM CONTINUOUS DISTRIBVTIONS
and
(3.6.65) T h e last formula represents the joint p.d.f. of two independent standard
normal deviates. A/g&thm
N-I
1 Generate two independent random variates U , and U, from %(O, I). 2 Compute 2,and Z , simultaneously by substituting U , and U, in the system of equations (3.6.63).
Procedure 1V-2
This procedure is based on the acceptance-rejection method. Let the r.v. X be distributed
(3.6.66) Since the standard normal distribution is symmetrical about zero, we can assign a random sign to the r.v. generated from (3.6.66) and obtain an r.v. from N ( 0 , 1). To generate an r.v. from (3.6.66) write ,f,( x ) as = c'h(x)g(-r)
,,(.K,
where
(3.6.67)
h( x ) = e --*
(3.6.68) (3.6.69) The efficiency of the method The acceptance condition
I&
equal to Lf.I*/2e ~ 0 . 7 6 .
u 5 g( Y ) is u 2 exp[ - ( Y - 1),/2],
(3.6.70)
which is equivalent to -InU2 where Y is from exp(1).
(Y2
'
(3.6.7 1)
88
RANDOM VARIATE GENERATION
Since -In U is also from exp( 1). the last inequaIity can be written
(3.6.72) where both Vt = - In U and I/, = Y are from exp( 1). AIgoritihm
N-2
1 Generate V, and V2 from exp(1). 2 If V, < (v,- 1 ) ~ / 2go , to step I.
3 Generate U from %(O, I). If U 2 0.5, deliver Z = - V,.
4
5 Deliver Z = Y , .
Remark
In order
to
obtain Algorithm N-2 we can representf,(x) as f * ( x ) = C b , W (1 - N y J x ) ) ,
where h y,( x ) = h( x ) = e - x
H,,(T(x))= I - e -n x ) T ( x ) =t(.X
- Q2,
and then apply Algorithm AR-3‘. Pmcedurp N-3 In this procedure we make use of the logistic distribution [32]
It is shown numerically in Ref. 32 that 8+ = 0.626657, -I = 0.9196
c
(3.6.74)
and
[+
g(x) = 0.25 I
Algorithm N-3 is as foliows. Algorithm N-3
Generate U,and U, from %(O, 1). Yc-0.626657ln(l/U- I). 3 If U 5 g(Y), deliver Y . 4 Go to step 1. I
2
$+ 1.5957x)I. (3.6.3)
exp( - i ~ ~ x ) ’ e x p (
89
GENERATING FROM COMINUOUS DISTRIBUTIONS
P m d U W N-4
This procedure is based on the relationship between the normal distribution with chi-squared distribution and a vector uniformly distributed on the n-dimensional unit sphere. Let Z , , . . . , Z , be i.i.d. r.v.3 distributed N(0, I) and let X = (2:-,Z,?)1/2; then it can be shown by the multivariate transformation method that the vector
Y = = ( Y,..., , .”)+
=I
,...,
”) X
(3.6.76)
is distributed uniformly on the n-dimensional unit sphere.+ Now taking into account that X * = Z;- lZ,’ has the chi-squared distribution with n degrees of freedom (see Section 3.6.8), the algorithm for generating from N ( O , l ) , where I is a unit matrix of size n, is as follows. Algodh
N-4
1 Generate a random vector Y = (Y,, . . . , U,) uniformly distributed on the n-dimensional unit sphere. 2 Generate a chi-square distributed random variate xz with n degrees of freedom. 3 2,- X Y , , k - I , ...,n. 4 Deliver Z = ( Z , ,.. . ,Za}.
Since the efficiency of the algorithm for generating Y = (Y,, ..., Y,) (see example 5, Section 3.4.2) decreases when n increases, it would be interesting to find the optimal n in order to minimize the CPU time while sampling from N(0, I). procedrm N-5
This procedure relies on the central limit theorem, which says that if X i , i = l , ...,n,arei.i.d. r.v.’swith E(X,)=pandvar(X,)=a’, then
Z -
I-
I
(3.6.77)
n”2U
converges asymptotically with n to N(0, 1). Consider the particular case
*An alternative algorithm for generating a vector uniformly distributed on the n-dimensional unit sphere is given in example 5, Section 4.3.2.
90
RANDOM VARIATE GENERATION
when all Xi,i = 1 , . . , . n , are from %(O, I). We find that p=f
n
(3.6.78)
A good approximation can already be obtained for n = 12. In this case I2
Z = C V,-6. I==
(3.6;79)
1
Algorithm N-5is straightforward. Aigoritlhnt N-S 1 Generate 12 uniformly distnbutcd random variates U,, . .., U,2from %(O, 1). 2 Z+-X!i,l.!--6. 3 Deliver %.
P M d U m N-6
Another approximation technique for generating from N ( 0 , 1) is given in Tocher [33j; it makes use of the following approximation: e - v1/2
2e
y
(I
+e
(3.6.80)
TIT.
for x > 0 and k = The c.d.f. for the approximation is
The inverse transformation is 1 k
X=-InAttaching a random sign
K to
1+u 1-u-
(3.6.81 )
this variate we obtain the desired variate
91
GENERATING FROM COXTINUOUS DISTRIBVTIONS
AlgoritAvn N-6
1 Generate U,and U, from %(O, 1). 2 X + VwT jni(1+ u,)/I- v~>I. 3 If U, 5 0.5, deliver 2 = -X. 4 Deliver Z = X .
3.65
Distribution
Let X be from N(p,a'). Then Y = e x has the lognormal distribution with p.d.f.
otherwise.
0.
A I ~ i t l C mLN-1
1 Generate 2 from N(0, I). 2 Xcp+aZ. 3 Yeex. 4 Deliver Y.
3.6.6 Cauchy Distribution
An r.v. X has a Cauchy distribution denoted by C ( a , p ) if the p.d.f. is equal to
B
= n[
p + ( x - *,'I
,
a>o,p>o.-oo<x; go to 1. 3 If Y:+ 4 X t h Y l / Y, + a. 5 Deliver.'A The efficiency of the algorithm is P( r: t. Y; 5;)
+,
so the algorithm is relatively efficient.
3.6.7
W d W Distribution
An r.v. has a Weibul distribution if the p.d.f. is equal to
0, otherwise and is denoted by W( a,#I). To generate X by the inverse transformation method note that u = F,( x) 1 - e * ( x / f l ) " (3 h.87) 5
so
x = p ( - t n ( l - u))"".
(3.6.88)
93
GENERATING FROM CONTINUOUS DISTRIBUTIONS
Since 1 - U is also from ~%(0,I), we have or
(
$)u
= -In
U.
(3.6.90)
Taking into account that -ln(O) is from exp(l), the algorithm for generating an r.v. from a Weibul distribution can be written as follows. A I m i t k m W-I 1 Generate V from exp( 1). 2 Xtpv””. 3 Deliver X .
3.6.8 Chi-square Distribution Let Z,, . . .,Z, be from N(0,l). Then k
Y-
2:;
(3.6.91)
i- I
has the chi-square distribution with k degrees of freedom and is denoted X 2 W
Formula (3.6.91)says, “the sum of the squares of independent standard normal random variables has a chi-square distribution with degrees of freedom equal to the number of terms in the sum”. One approach for generating a chi-square variate from x 2 ( k ) is to generate k standard normal random variables and then apply (3.6.91). Another approach makes use of the fact that x 2 ( k )is a particular case of a gamma density with gamma parameters a and equal, respectively, to k / 2 and 2. Consider two cases. CASE
1
If k is even, then Y can be computed as
Y = -2111
11 (:I: 1
.
(3.6.92)
Formula (3.6.92)requires k/2 uniform variates compared to k in (3.6.91). It also requires one logarithmic transformation, compared to k logarithmic and k cosine or sine transformations .for generating Zi from N(0, l), i = 1,. . .,k (see (3.6.63) and (3.6.64)).
94
RANDOM VARIATE GENERATION
CASE
2 If k is odd, then k/2 - I / 2
Y=-21n(
rI1
1-
u,)+z2,
(3.6.93)
where Z is from N(0, I ) and ZJ, is from %(O, 1). For k > 30 the normal approximation for chi-square variates can be used based on the following formula [24):
z=m-V%=-i. Solving for Y . the chi-square variate, we obtain Y-
(Z+
m)2 2
(3.6.94)
Remark Let Y , , Y2, and Y, be chi-square random variabies with degrees of freedom 2(a + /3), 2a, and 2p, respectively; then
has a beta density with parameters a and p. Applying formula (3.6.92), we get
3.6.9 Student’s I Distribution Let 2 have a standard normat distribution, let Y have a chi-square distribution with k degrees of freedom, and let Z and Y be independent; then
(3.6.95) has a Student’s t distribution with k degrees of freedom. To generate X we simply generate Z as described in Section 3.6.4 and Y as described in Section 3.6.8 and apply (3.6.95). For k 2 30 the normal approximation can be used. 3.6.10
F Distribution
Let Y, be a chi-square random variable with k , degrees of freedom; let
Yz be a chi-square random variable with k, degrees of freedom, and let Y,
9s
GENERATING FROM DISCRETE DISTRIBUTIONS
and Y, be independent. Then the random variable (3.6.96) is distributed as an F distribution with k , and k , degrees of freedom. To generate an F variate we first produce two chi-square variates and then use (3.6.96).
If X has an F distribution with k and k, degrees of freedom, then l / X has an F distribution with k , and k, degrees of freedom. Remurk 1.
Remark 2. If X is an F-distributed random variable with k , and k , degrees of freedom, then
(3 6.97)
has a beta density with parameters a = k /2 and /3= k, /2.
3.7 GENERATING FROM DISCRETE DlSTRIBUTIONS
In this section we describe several procedures for generating stochastic variates from must of the well known discrete distributions. We start with the inverse transform method, which Is generafly easily implemented and is widely used. Let X be a discrete r.v. with probability mass function @.m.f.) k 0,1 ,. . . Pr ( X = x k ) = P, , (3.7.1)
-
and with c.d.f. L
8, = Pr(X G x,)
-
CP,. 8
(3.7.2)
I1
Then
where U is from %(O, 1). Thus X = min { x :g,-
,< U
s gk}.
(3.7.4)
Algorithm IT-2, which is called the inverse transform algorithm, describes generating discrete r.v.'s. This algorithm is based on logical comparison of U with g,'s and is as follows.
96
RANDOM VARIATE GENERATION
C t Po. BCC. KtO.
Generate U from %(O, I). If U I B (UI &), deliver X = x k . K t K + 1. CCAk + lc ( Pk + 1 = + 1 'k ). Bt B -k c (8, + 1 = gk + pk+ 1 )' Go to step 5.
,
Here Po and A,+ = PA+ /Pk are distributed dependent. The recurrent formulas (3.7.5) 'k i.1 = A k + I 'k gk+l = g k + ' k + I
(3.7.6)
in steps 7 and 8 are straightforward for calculation. Most discrete r.v.'s are integers nonnegative valued, that is, xk = k, k = 0, 1,. , . . Later, we consider only these r.v.3. It is easy to see that the mean number of trials m
00
C=I+ xxkPk= ZkP,=l+E(X) k- I
(3.7 -7)
k-l
is equal to the expected value plus one additional trial. Table 3.7.1 represents the values of Po, A,+ I , and C for most well known discrete distributions. In order to generate an r.v. from a specified discrete distribution, we take the corresponding values Po and Ak+, from Table 3.7.1 and then run Algorithm IT-2. In many cases we can improve the efficiency of the inverse transform method IT-2 by starting the search of X at k = m ,rn being an interior point (for example, mode, median, etc.), rather than at k = 0. We assume that tables of Pk and gk are available. The procedure is as follows. If U 2 g,, then gm+t
gm+i- 1
(3.7.8)
+ Pm+,
P,+, = P,+,- IA&+,,
i = 1,2,. ...
(3.7.9) (3.7.10)
4
+
L
-PI+
4
c.
-.
-.C 4
I
4
9,
L
4
2 L
N
I C W
h
h
4
s
Y
97
98
RANDOM VARIATE GENfXATION
where A;+, and A:-, are distribution dependent and their values are available to compute. Algorithm IT-3 describes the necessary steps. A1g~i.h IT-3
1 D+gm* 2 EtP,.
3 Generate 4 5
6 1 8 9 10 11 12
(Ifrom
%(O, 1).
Ktm.
If U > g,, go to step 12. D + D - E ( g k - I = g k - Pk). If U > D , deliver X = K; go to step 1. K t K - 1. If K = 0, deliver X = K ; go to step 1.
E c E A , " - I ( Pk- I = A : - I Pk). Go to step 6. K + K + 1. 13 E c EA4;+ I ( p k + I = I pk 114 D e D + E . 15 If U 5 D,deliver X = K. 16 Go to step 12.
Table 3.7.2 represents the values of Po, m(mode), A i + l and A i + , for most well known discrete distributions. It is easy to see that for an integer m the number of trials (number of logical comparisons of I/ with g: s) is the following r.v.: i f x S 0 , I , ...,m 2 + ( m - X), 1+(X-m), i f x = m + l , m + 2 ,....
m
=
x-0
x-0
n-m+l
x=0
x-m+
m
x-0
I
XXP, x-0
m
xP,=g,+
I +mg,-m(l
x-m+ I
-
P,-
P , + M ~~ , - m
W
+ 5:
m
W
?II
2 P,+ I;: P,+
(3.7.12)
xPx" 1 + E ( X ) - y ( m ) ,
-gm)-
2 xPx+E(X) x-0
(3.7.13)
..., o < p < I
I)p'(l -p)"
\ml max(O,nl + m - n ) s x 5 min(n,.rn)
Hypergeometric
X==O,I,
P,-(x+;-
Negative binomial
e P, = X ! x = O , l , ..., X > O
Poisson
...,n , p > O
Notation
Disaete Unimodai Dbrtributioss
=(;)Px(l -P)"-"
x=O,1,
px
Binomial
Distribution
Table 3.7.2
PO
Modal Value rn
k+l
I-p
h -
k+l
-.P n-k
k
,-
p
( r i k - 1)(1 - p )
k
5;
k n-k+l
100
RANDOM VARlATE GENERATION
where m
2 xPx-g,+m-2mg,.
y(m)-2
(3.7.14)
x-0
It follows from (3.7.7) and (3.7.13) that Algorithm IT-3 is more efficient than Algorithm IT-2 for m such that y( m ) > 0. However, y ( m ) is not necessarily positive for each m. The following example illustrates this point.
Example 1 Assume that the r.v. X has the following p.rn.f.:
p, 0,
x=o
otherwise.
Let m = 1; then y ( l ) = 2 2 ~ , 0 x P x - g , + 1 -2g,,,- 1 - f + 1 - f -0.25 < 0,and therefore Algorithm IT-2 is more efficient than Algorithm IT-3. Neverthetess, in many cases it is possible to choose the starting point m in such a way that y ( m ) > 0,and therefore it is possible for IT-3 to be more efficient than IT-2. Lemma 3.7.1
If there exist m
> 0 such that
rn
P,I
Z: ( 2 % .x-
ij&
forg, 0.
Proof Condition Po 5 Z'$ ,(2x - 1)P' is equivalent to m
2
I= x p x - g , > o ,
(3.7.16)
x-0
and, correspondingly, condition g, 5 f , m > 0, is equivalent to m - 2mg,
> 0.
Both (3.7+16)and (3.7.17) yield y ( m ) > 0. Note I
We can see that Lemma 3.7.1 is valid if Po 5 2:- ,Px.
(3.7.17)
Q.E.D.
101
GENERATING FROM DISCRETE DIS'TRIBU'I'IONS
This condition is not restrictable and holds for practically all discrete distributions. achieves its maximum at points mo or m,+ 1 where correspondingly, on whether gm0+ gm,+i 5 I O r ~ m , + g m , + t > 1.
Lemma 3.7.2
y(m)
mo = max ( m :g ,
5
i), depending,
Prooj It is straightforward to obtain from (3.7.14) that
AY( m ) = ~
( + m 1)
- Y( m ) = 1 - g m - g m + I -
(3.7.18)
For m < mo we have g, + g,+ 5 1, and therefore Ay(m) 2 0; for m > mo we have, correspondingly, g, + g,+ I > I and A y ( m ) < 0.Therefore y ( m ) is a unimodal function with the maximum at points m, or mo+ I , Q.E.D. depending on whether gm0+ gm0+ I I 1 or gma+ gm0+ I > 1. Nore 2 In other words, Lemma 3.7.2 says that y ( m ) achieves its maximum at the median or at a point neighboring the median on the left. As a corollary from these two lemmas we obtain the following theorem.
Tbeorem 3.7.1 The optimal starting point in Algorithm IT-3 is either the median m o = max { m : g , S i } , if Po 5 Z::;'(2x - IjP, and gm0+ g m e + ls 1, or mo + 1, if Po IZ,ol m + I (2x - 1)P. and ~ Q ~ ~ >+ 1. R ~ ~ + ~ Nofe 3 Theorem 3.7.1 is valid not only for integer nonnegative valued r.v.'s, but for any discrete r.v. with values xO,x,,. . . , since Algorithm IT-3 is determined not by the sequence x,, x I . . .., but by its indices 0, 1,. . .. In the rest of this chapter we consider some alternative procedures for generating discrete r.v.'s. Generally, procedures for generating discrete variates are simpler than procedures for generating continuous variates, and we describe them only briefly. 3.7.1
Binomial Dtstrfbution
An r.v. X has a binomial distribution if the p.m.f. is equal to
P, = (;)p"(l - p ) " - - I ,
x 5 0 , ..., n
(3.7.19)
and is denoted by B ( n , p ) . Here 0 < p < 1 is the probability of success in a single trial, and n is the number of trials. To apply the inverse transform method IT-2 we must check the following condition after step 5: if K - n - 1, terminate the procedure with X=K-n.
I02
RANDOM VARIATE GENERATION
It is also worthwhile to note that. if Y is from B ( n , p ) , then n - Y is from B(n,1 - p ) . Hence for purposes of efficiency we generate X from B ( n , p ) according to ~-~(n!p)ifp~;$ (3.7.20) y--B(n, 1 - p ) ifp >f. For larger n the inverse-transform procedure becomes time consuming, and we can consider the normal distribution as an approximation to the binomial. As n increases the distribution of
(3.7.21) approaches N ( 0 , 1). To obtain a binomial variate we generate Z from N(0, I), solve (3.7.21) with respect to X,and round to nonnegative integer, that is,
X = max (0, [ -0.5
+ np + Z(np( 1 - p ) ) ” 2 ] ) ,
(3.7.22)
where [a]denotes the integer part of a. We should consider replacing the binomial with the approximate normal when np > 10 forp >-f and n(1 - p ) > 10 forp 0
(3.7.24)
103
GENERATINO FROM DISCREl'E D1SlRIBU'l'tONS
I t is well known (Feller (111) that, if the time intervals between events are from exp(l/X), the number of events occurring in an unit interval of time is from P ( h ) . Mathematically, it can be written x*
X
I
X q 5 1 5 2 Ir;, where T,, i = O , 1 , ...,X + 1, are from exp(I/X). Since = -(l/X)ln V,, the last formula can be written as X x+ I InY < A < lnV,, X = O , l , ... i-0
(3.7.25)
i-0
i=O
(3.7.26)
i=O
or
(3.7.27) The following algorithm is written with respect to (3.7.25): 1 A+l 2 KcO.
(#k=
1).
3 Generate Uk from %{O, 1). Ac-UkA
(gi,+ImgkUk)'
5 If A < e-', deliver X = K. 6 K t K + 1. 7 G o to step 3. For large h ( h > 10) we can approximate the Paisson distribution by normal distribution. As X increases, the distribution of
(3.7.28) approaches N(0, 1). To obtain a Poisson variate we generate Z from N(0, I), then by analogy with (3.7.22) we obtain
x = max (0,[ x + z''~- OS]),
(3.7.29)
where [a]is the integer part of a. It is shown in Ref. 22 that, if m is the mode, then for large n the mean execution time in Algorithm IT-3 is similar to (3.7.23) and is equal to
(3.7.30)
104
RANDOM VARlATE GENERATION
The mean number of trials in both Algorithms IT-2 and IT-3 are proportional, respectively, to X and A'", and therefore Algorithm IT-3 is again essentially more efficient than Algorithm IT-2.
3.73 Geomebrc . Distribution An r.v. has the geometric distribution if the p.m.f. is equal to
..., o < p
O , ~ ,..., , ~ = n=, ~
j - 1
such that* 1
Pi>%
2
Ej>O
if hi # O
...,n,
ifa,,#O,i,j=1,
(5.1 . I 1)
where p i and 4, are, respectively, the initial distribution and the transition probabilities of the Markov chain. ) ) , apWe first consider the problem of estimation ( h , ~ ( ~ + ’ which proximates ( h , x } . Let k be a given integer and let us simulate the Markov chain (5.1.10), (5.1.1 1) k units of time. We associate with the Markov chain a particle that passes through the sequence of states i,, i,,...,i k . Define (5.1.12)
which can be written recursively
%wm= ww,-, 111
em
wo=l.
(5.1.13)
lb,
We also define the random variable (r.v.) h i,
qL(h) =-
k
I: W m J -
(5.1.14)
PI,m=O associated with the sample path i o + i , - + . . * -+it, which has probability p , a P , a , I ~ I , .2P. , . l , h .Now we are able to prove the following Propositioo 5.1.1 k
E[ vk(h)] = ( h ,
A”/)
=(h,X(k+l))l
(5.1.15)
m- 1
that is, ?)k( h ) is an unbiased estimator of the inner product (h, d k + ’ ) ) . *The Markov chain need not be homogeneous; we are considering the homogeneous case for simplicity only.
161
SIMULTANEOUS LINEAR EQUATIONS AND ERGODIC MARKOV CHAINS Proof
Each path i,-+i,
.
4
+ i k will be realized with probability
P( i,, i , ,. . . ,i k ) = p j o ~ -Pi,,, a i l * . . Pih - - I r k. '
(5.1.16)
While simulating the M.C.(5.1.10)-(5.1.1 I), since the r.v. ak(h) is defined along the path i o - + i , - + . . +ik, we have n
n
2
E [ . ~ l , ( hJ )=
.
* *
I,-1
2 ~k(h)PioP,oil ik-l
* * * ek-,jkt'
(5.1.17)
which, together with (5.1.12) through 15.1.14). gives
k
n
n
(5.1.18)
Using the property Zy,, Pi, = 1, the last formula can be written as k
E[ 7)k(h)]
n
I= Z
n
X
* * *
m-Qrp-l
~
~
~
~
~
a
,
l
~
~
l
i(5.1.19) ~ ~ .
i,=I
Taking into account that
and
we immediately obtain k
E [ ? l k ( h ) ]= ( h , m - 0 A Y ) r < h . n ~ k + 1 9 .
Q.E.D. To estimate (h, x f k + ' ) ) we simulate N random paths ig)+i!')--+ - - 4 iy', s = 1,2,. . .,N, of length k each and then find the sample mean
-
. N
(5.1.20)
.
'
~
~
162
LINEAR EQUATIONS AND MARKOV CHAINS
2%e Pme&m for Esrimpting (h, x('+')) 1 Choose any integer k > 0. 2 Simulate N independent random paths ig)+i;')+ 1,. ..,N, of the Markov chain (5.1.l0)-(5.l.11).
*
d i p ) , s =I:
3 Find
where
4 Calculate
I N Ns-,
ek = - 2 vt'(h)*
( 5 . I .23)
which is an unbiased estimator of the inner product { h , . d k + ' ) ) . Taking the limit of (5.1. 15), we obtain
-
provided that the von Neumann series A + A 2 + * converges and the path i @ - i , - + * h i k . . . i s infinitely long, we obtain an unbiased estimator of ( h , x ) . The sample-mean is then of the form
Thus
-
9
.
N
( 5 . I .25)
where
and ( 5 . I .27)
We note that the inner products (h, X L _ , A m f ) for different h andfcan be found from (5.1.23) by using the same random paths ir)+ii')-+ * * i t ) , s 3 1,.".,N of the M.C.(5.1.10)-(5.1.11).
-
SIMULTANEOUS LINEAR EQUATIONS A N D ERGODIC MARKOV CHAINS
Remark
163
In the particular case where A = aP,0 < a < 1, we have W, = a"
and
5.1.1
Adjoint System of Linear Equations
Let us define for the system of linear equations (5.1.2) an associated system of linear equations (5.1.28) x* = A'x* + h where A' = llaf, I!: is the transpose of A. It is readily seen that (5.1.29)
-
Indeed, we have from (5.1.2) and (5.1.28) (x*,x) = ( x * , A x ) + (x*, f ) and (x,x*) = ( A ' x * , x } + ( h , x ) , respectively. Now (5.1.29) follows because ( A ' x * , x ) = ( x * , A x ) . We call the pair (5.1.2) and (5.1.28) adjoint systems. A direct consequence of (5.t.29) is that there exists another unbiased estimator of (h, x ), which can be written as (5.1.30)
where
are defined on the sample path I , , 4 i , .--,* . ' + i k , which is obtained from the Markov chain defined by the following:
p * = 114; 11; n
n
I:p:=t, 1-1
XP;=l, J-
p *i > - 0,
P,; 2 0,
1
such that 1.
p,?>O,
iff;.+O
2.
P,;>O,
ifa,!,#O,i,j=l.
...,n .
i , j = 1 ,..., n,
164
!.INEAR EQUATIONS A N D MARKOV CHAINS
In the particular case for which P in (5.1.10)-(51.11) stochastic matrix, that is, n
is a doubly
n
!",,=I
and
xcj=l,
(5.1.31)
i s1
j- I
P* can be chosen equal to P'. Assuming also A' = A, then together with (5.1.31) we obtain P' = P , and (5.1.30) becomes $(A)
L 2k
=-2
W,h;_.
(5.1.32)
Pi, m-0
Comparing (5.1.14) with (5.1.32), we can see even in this case, that is, when A' = A and ' P = P, q $ ( h ) # v k ( h ) . The difference between q z ( h ) and q k ( h ) is in terms of J o and /I,~,which are interchanged. We return now to the original problem (5.1.2) of estimating all coordinates xi of the vector x. In order to estimate the j t h coordinate x, of x we assume
-
h ' = e , = ( O ,..,,O , l , O ,...,0)
i
and start simulating the M.C. from the state j, that is, pI,=p, = 1. The corresponding path is then j-+i,-+i2-+* * * + i k . Denoting k
x W,hm9,
(5 . I .33)
E [ qk( e l ) ] = x;'+')%
(5.1.34)
? k k J
=
m=O
where
we immediately obtain the coroiiuty
and aIso BJe,) =
1
N
2 qf)(e,)=xjk+'). S=
(5.1.35)
I
I t follows from (5.1.33) that, in order to estimate all the components xi,
j = 1,2, ...,n, of the vector x, we have to simulate n random paths
j+i,+i2+.*--+ik,j= 1,2 ,..., n , o f theMarkovchain(5.1.10)-(5.1.11), each time starting from a new state i, = j .
SIMULTANEOCS L I N E A R E Q U A r l O N S A N D ERGODIC MARKOV CHAINS
165
Looking carefully at (5.1.33), we find that all qn(e,), j = 1,2,. . . ,n, are similar. They differ only in the initial terms a l o i , / ~ oand i l f ; , which are associated with the choice of the initial state i,. Thus for qk(e,) and q k ( e r ) we have 4 and a r l l / P r l ,.{, . respectively. We now turn to the question of whether or not all the components x, of x can be estimated simultaneously by simulating one path, The answer is affirmative. We start this topic with the following
a,,,/cl,.
Definition The path i0+I,-+ . -+iT will be called cowering if it has visited each s t a t e j = I , . . .,n at least once.
-
Let io-+ii-+ . . -+iT+. . be an infinite realization from the Markov chain (5. I . l0)-(5. I . 1 1). Because our Markov chain is ergodic, each state will be visited infinitely many times and the first hitting time to the state j, T, = min(t : i, = j ) is finite almost surely (as.). With this result in hand the procedure for finding all the estimates qk(e,), j = 1,2, . . . ,n, from one realization can be written in the following way: 1
1 Simulate a covering path i,+i,+.
. . --+iT-+.
.
-+I,.-+.
* *
-+iT+k,
(5.1.36)
where T - max,{q) = maxmin,(r : i , = j } ,j = 1 , . . . ,n , and k is some fixed number. 2 Find the first hitting time = min{t : i , = j ) for each state j I , . . ., n , separately. 3 Take the subpath i r , - * i , * + , - *. . . - + I r l + k(which is the part of the generated path) for each state j = I , . . . ,n , separatety. 4 Calculate a11 T, + k
W ~ A , ~ j, =
?Je,)=
I , . ..,n
( 5 . I .37)
m-c
where
are deiined on the same path (5.1.36) starting at different points T, associated with the first hitting time. Each subpath i T , 4 i z , , , - + *. * -+* * * -+i,,+& is of the same length k. Thus i 0 + i p . . . -+i, will be a covering path of minimal length (in a given realization).
LINEAR EQUATIONS A N D MARKOV CHAINS
166
5 Simulate N such independent random paths it)-.
j (ls ) + .
. - -,igd,-p
-
* +,$,?I+
*
. -+i$$)
+k
and find ( 5 . I .39)
which estimates x,. Therefore all r.v.'s qk(e,),j = I , . . .,n, are defined on the same path and calculated according to the same formula (5.1.37). The only difference between them is the starting point, which is determined by the first hitting time T, and is a random variable.
Proof The proof of this proposition is based on the strong Markov property, which is given in Ref. 2, Proposition 1.22, p. 117, which states that for any homogeneous Markov chain and any bounded function g defined on the state space, we have
E [ g ( i , , i,,
I,.
. . ,) / i f=i]
=ii
E[ g ( i 0 , i l , ..- )tio - j ] .
In our notations
E [ +jk(ej)lq= I ] = E [ g ( i , , i f +,,. . . , i f + k ) I i ,= j ] = E [ g(io,.. . , i k ) l i O = j ] .
By Proposition 5.1.1 E [ g ( i , , . . . , i k ) l i o- j ) = x j k + ' ) . Since E[qk(e,)(T/= l ] does not depend on I, we have E(qk(ej)lT,= I ] = E[jlk(e,)]= x;"').
Q.E.D.
(5.1.41)
Proof
167
SIMULTANEOUS LINEAR EQUATIONS A N D ERGODIC MARKOV CHAINS
Similarly,
1
{
=N E [ ij,.ei)
1' - [ x j k +
I)]')
=
1
var +(, ei ).
Now again using Proposition 1.22 of Ref. 2 (p. 117). we have E ( [i,(e,)121?
=t) =~[g(~,,.-.,i,+,)~i,=j]
[
.
= E [ i ( i o , . . , i k ) ~ i o = j=] E( v k ( e , ) ]
2
). Q.E.D.
Therefore var(B,( e,)] = vaqd,( e, )].
To compare the efficiencies of the two methods we use (4.2.28), which can be written var B,( ej) E=
(5.1.42)
ivar d,<e,) '
and assume without loss of generality N = 1. Since var @(, e,) = var d,( e,), we have e = t / L In the first case we have n trajectories each of length k, so the total length of these trajectories is nk. In the second case we have one trajectory of length max,, .. ,"{ 7;} + k,with mean E(max,, , ,,#{ q})+ k . It is obvious that the second algorithm is on the average more efficient when n > 1 and k % ( n - l ) - ' E [ T = maxj-,.,.,,n{T,)J. Because the first hitting time T,, j = I , . . . ,n , to each state is finite as., it can be proven that
,.
,,,
,
t I ~==+-a.s. I
ask-+oo,
(5.1.43)
"
that is asymptotically the method of covering path is n times more efficient than the standard Monte Carlo method. The efficiency of the second method can be improved if we can find i, = I such that
j -
1..
=
... n
min io = I . .
. ..n
E[max(?}li,=I]
and then take this i, = I as a starting point of the path or, equivalently, choose the initial distribution as
-
0, ifi f ii oo=#l/ .
LINEAR EQUATIONS AND MARKOV CHAINS
168 5.1.2
Computing the Inverse Matrix
It follows from ( 5 . 1 4 that m
x-
2 Amf=B-'f ma0
where B"' = 11 $;'lly-
=1
+ A + A z + . Thejth coordinate of x is n x i = 2 bi;%. *
r- I
Setting
-
(5.1.44)
x j = bj;',
(5.1.45)
f = e , = ( O ,...,O , I , O ,...,0), r
we obtain and the estimator qk(x i ) in (5.1.33) becomes
v&&-')=
c
wm.
(5.1.46)
m/i, -r
Here the summation with respect to W, is taken over the indices i , = r , that is, when the particle visits the state r . The sample mean is then (5.1.47)
where s = 1,2, . . , ,N is the path number. Thus setting
h'-h,-e,=(O
-
,...,O , l , O ,...,0)
(5.1.48)
i
and f=f;=e,=O
,...,0,1,0,...,0, L---y---.l
r
we can estimate all the elements bi;' of the inverse matrix B - ' by (5.1.47). Inasmuch as the problem of determining bj; is a particular case of the problem of finding x,. we can estimate all the elements b,;' of the j t h row of the inverse matrix B - ' simultaneously with xi. Thus the Monte Carlo method provides a way of estimating a single element or any collection of
'
SIMULTANEOUS LINEAR EQUATIONS AND ERGODlC MARKOV CHAINS
169
the elements of B - I . This desirable feature differentiates the Monte Carlo method from other numerical methods in which, as a rule, all the elements of B are computed simultaneously. By solving the adjoint system we can estimate simultaneously all the elements bj;' of the rth column of the inverse matrix B - ' . It follows also from (5.1.36) through (5.1.39) that all the elements bi;' can be estimated simultaneously with the x,'s from the covering path. Before leaving this section we want to turn the readers' attention to the analogy that exists in calculating integrals and solving systems of linear equations by Monte Carlo methods, Calculating the integral
-'
Z=Jg(x)dx,
we introduce any p.d.f. f x ( x ) such that
where X is distributed with p.d.f.j,(x) andf,(x) > 0 when g(x) # 0. Then taking a sample N fromf,(x), we estimate the integral I by (see (4.3.4))
While solving the system of linear equations we introduce any ergodic Markov chain (5. I.10)-(5. I. I 1). Then simulating our Markov chain, we obtain the path io+i,-+ * + i k with probability P(io, i,, . . .,i k ) =
-
PiAThe element x j k ' ' ) of the vector x - ( ~ + ' )can be written (see (5.1.7)) as
Pi,Pioil**
*
x j k + I)
5 4 -+ 2 a , l , ~ +- t C aJita,tilf;2 + *
itth
il
+
.
i t . i 2 , . . , i,
ajitai,i2. . aik.t;kLk
170
LINEAR EQtiATlONS AND MARKOV CHAINS
where l)k is distributed according to ql,C . . ,P,k-E,,.Here i , = j and p, = 1. Now considering N random paths j(’)-+i;’)-+i$‘)-+- - 3 $3’ & s= I , . ..,N, we can estimate x j k ) by (5.1.39). Comparing (4.3.4) and (5.1.39, we realize that both problems of calculating the integral and solving the system of linear equations can be reduced to the problem of estimating the expected value of some random function. In our case the random functions are g ( X ) / j x ( X ) and ?f&(e,), respectively. These results allow us to suggest a general Monte Curio procedure for solving different problems, which can be written as: +
3
1 Find a suitable distribution associated with the problem. 2 Take a sample from this distribution.
3 Substitute the values from the sample in a proper formula, which estimates the solution. 5.1.3. solving a System d Linear Equations by Simulating a Markov Chain with an Absorbing State
Another possibility of estimating ( h , x ) is by simulating a Markov chain with an absorbing state, as was suggested by Forsythe and Leibler
(5.1.50)
with n
P,, 2 0,
i . j - I ...., n , P , . , + & =g, = 1 -
2 P,,2 o /”I
n
~ p , = I , p 1 2 O , i = = 1 , 2,..., n , I
4=
which is essentially an augmented (51.10) matrix. Here p , and P,, are, respectively, the initial and the transition probabilities. Assume also: 1 p,>O, if h , # 0 , 2
O,
i f a i j + O , i , j = 1,2,...,n .
(5.1.51)
171
SIMULTANEOL'S LINEAR EQUATIONS AND ERGODIC MARKOV CHAINS
The state n + I is called an absorbing state of the Markov chain (5.1.50)(5.1.51). It is well known (Cinlar [2]} that, if there exists a state i, i = I , . ..,n, such that P,.,+ > 0, then all the random paths i,+;, + * . . + i,,, terminate in state n + 1 a s . and the expected time of termination of each random path is finite, that is, E(v)< cc. We start to simulate our Markov chain (5.1.50)-(5.1.5 1) by choosing the initial state i , according to the probability pi,. i, = I , 2 , . . . ,n , where xioPi, - I . Consider now a particle that is in state 1,. The particle either will be absorbed with probability gio in state i, or will pass to another state i , with probability P,,,,. Generally. if at time m - I the particle arrived at then it will either be absorbed from there with probability the state i,gim--Ior will continue along the random path to the next state ,i with probability The random path ;,--+i,--+. * --+j(") has probability
,
,,
n
.
P i o ~ o i l ~* ~ ,* iPi-iipgi.T l
where gie=
j - I
is the probabi!ity of absorption from state i,. Consider any r.v. q, which is defined on the parth i,-+i,--+* The expectation of 9 is f f i n
-
-+i(p).
n
where qk is defined on the path that terminates exactly after k units of time. kt
(5.1.52)
where W, is the same as in (5.1.12). Proposition 5.1.4 E[V(k,(h)] = < h * x ) *
(5.1.53)
that is, q t k ) ( h ) is an unbiased estimator of the inner product ( h , x ) , provided E(k) < 00.
172
LINEAR EQUATIONS AND MARKOV CHAINS
Substituting (5.1.22) in (5.1.54) and talung (5.1.12) into account, we obtain m
n
n
Now comparing (5.1.55) with (5.1.19). we immediately obtain (5.1.53). Q.E.D. The procedure for estimating ( h , x ) is: 1 Simulate N independent random paths $)--+il')--+. 1,. . ,N, from the Markov chain (5.1.50)-(5.1.51). 2 Determine
.
- - +if')(k), s =
where Wi') is the same as in (5.1.22). 3 Estimate ( h , x ) by
In the particular case where u,,2 0 and (5.1.SO) can be chosen as 01,
p=
: 0" I
0
... ... ...
X7- lai,< 1,
01"
Ptn+r
ann
Pnn+t
0
that is, p i , = a i j , i,j = I , . . . ,n. In this case W,
-
the matrix P in
1
1 and
There are, however, few applications of these techniques. The reason is that the Monte Carlo method is not competitive with classical numerical analysis in solving systems of linear equations. Still, there are some situations where the Monte Carlo method can be successfully used: 1 The size of the matrix A = IIa!, 11; is sufficiently large ( n > lo3), and a very rough approximation is required. 2 It is necessary to find ( h , x ) for different h andf, where x = A x +f. As mentioned above, such problems can be solved (estimated) simultaneously by simulating only one Markov chain.
173
INTEGRAL EQUATIONS
5 2 INTEGRAL EQUATIONS
One of the most fruitful applications of Monte Carlo methods is in solution of integra: equations. The reason is that such equations cannot be solved efficiently by classical numerical analysis. The idea of solving integral equations by a Monte Carlo method is similar to that of solving simultaneous linear equations. Both methods use Markov chains for simulation. There exists ample literature on solving integral equations by Monte Carlo methods (see [3,7-9]). Its history is connected with the problem of neutron transport, which is described in Spanier and Gelbard's monograph [9].One of thc earliest methods for solving integral equations by a Monte Carlo method was proposed by Albert [I] and was later developed in Refs. 3, 7, and 8. Before proceeding with this topic we need some background on integral transforms. 5.2.1
integral Transforms
Throughout this section we follow Sob1 [S]. Let K be an integral operator such that K + ( x ) = J K k x,)+(x,
1h, I
x , ED ,
(5.2.1)
which maps the function $ ( r ) into K + ( x ) . K + ( x ) is usually called the first iteration of J. with respect to the kernel K. The second iteration is
(5.2.2)
Proceeding recursively we obtain
+
the k th iteration of with respect to the kernel K. We can estimate such integrals by quadrature methods or by Monte Carlo methods, as described in Chapter 4. However, there exists another Monte Carlo method of estimating such integrals, a method that is similar to the method of solving systems of simultaneous linear equations and that based on simulating a Markov chain. Before describing the method let us introduce some notations and make some assumptions.
174
LINEAR EQUATIONS AND MARKOV CHAINS
For any two functions h ( x ) and $(x) their inner product is denoted by ( h , $), where ( h , It>= j h ( x ) $ ( x ) d x .
(5.2.4)
(5.2.5) (5.2.6) (5.2.7)
< 03
.(5.2.8)
JJ.’dx < OQ
(5.2.9)
jh’dx
and SIK’dxi@
< 00,
(5.2.10)
respectively. It is easy to prove, using the Cauchy-Schwarz inequality, that, if conditions (5.2.8) and (5.2.9) are met. then \(h, +)I < co.lndeed
In exercise 2 the reader is asked to prove K + ( x ) E L2(D ) , given (5.2.5) and (5.2.7). With these results we can return to our problem of evaluating Kk+.As we mentioned before, the method o f evaluating K k $ is similar to those for solving the system of linear equations described in Section 5.1.1. From now we consider the problem of finding the inner product ( h , K k \ t ) , which is similar to the problem (h,ZL,,A”’f). The reader is asked to keep this similarity in mind. By analogy with (5.1.10) and (5.1.11) let us introduce any continuous Markov chain
175
INTEGRAL EQUATIONS
satisfying j P ( x , y ) d y = 1, J p ( x ) d x= I , such that 1 p(x)>0, ifh(x)#O P ( x , y )> 0
2
(5.2.13)
if K ( x y ) + 0 ,
where p ( x ) and P ( x , y ) are, respectively, the initial and the transition densities of the Markov chain (5.2.12)-(5.2.13). By analogy with Proposition 5.1.1 we can readily prove the following
(5.2.14)
(5.2.15)
(5.2.16)
Assuming for some given y that h f x )-p(.x) = 6(x -y), where a(,) is Dirac’s delta function, we immediately obtain E[q,(h)J= Kk+. The procedure for estimating the inner product 0. 2 Simulate N independent random paths x:)+x/’)-+. 1,2,. . .,N,from Markov chain (5.2.12)-(5.2.13).
. - -+xp),
s=
3 Find h(x,) qy)(h)=-Wpqi(xk),
s = 1)...,N ,
(5.2.17)
P(X0)
where
(5.2.18) 4 Calculate Bk=-
I
lv
N
~$)(h)m(h,K~\t,), 2-1
which is an unbiased estimator of the inner product ( h , K k + ) .
(5.2.19)
176
5.2.2
LINEAR EQUATrONS AND MARKOV CHAINS
integral Equations of tbe Second
Kind
Consider the following integral equation of the second kind: 4x1
=I
(5.2.20)
K(x,x,)dx,)dx, +.f(x),
D
which can be written as z = Kz +f.
(5.2.2 1)
Let us assume that f ( x ) E L2(D ) , K ( x , x , ) E L2(DX D ) , and I X . I = s u P ~ I K ( x , Y ) l d L - ' / 2 = t ( x ) ,
m-cm
(5.2.43)
where z(x) is the eigenfunction corresponding to A. We can estimate ( h , K " j ) and K"f simultaneously by a Monte Carlo method as described in Section 5.2. I . For further discussion of eigenvatue problems we refer the reader to Hammersley and Handscomb [Sj and Sobol[8]. Until now we have not made any special assumption about our Markov chains. We have required only that the estimators q k ( h ) and ~ ( ~ ) (beh ) unbiased. It is clear that the variance of both qk( h ) and qk,(h ) depends on the transition probabilities 4,. Since in solving linear and integral equations we have, respectively, sums and integrals to deal with, it should be possible to use some of the variance reduction techniques of Chapter 4 for better efficiency. in this context the reader is referred to Michailov [7] and Ermakov 131.
53 THE DIRlCHtMT PROBLEM
One of the earliest and most popular illustrations of the Monte Carlo method is the solution of Dirichlet's problem 141. Dirichlet's problem is to find a continuous and differentiable function u over a given domain D with boundary Do, satisfying (5.3.1)
and u ( x . y ) =g ( x , y ) .
for ( X , Y E Do
(5.3.2)
where g = g ( x . y ) is some prescribed function. Equation (5.3.1) with F ( x , y )# 0 is d i e d the Laplace equation; with F ( x , y ) = 0 it is known as the Poisson equation. Generally, there is no analytical soIution known to this problem and we have to apply a numerical method. We usually start by covering D with a grid and replacing the differential equation by its finite-difference a p proximation. Let us denote the closure of D by 5,that is, D u Do = 5,and
1%0
LINEAR EQUAHONS AND MARKOV CHAINS
the coordinates of the grid by x , = ah and y8 = p h , where h is the step size. Taking the two-dimensional case for convenience, we call the point (x,,ya) E 5 an interior point of 6 if four neighbor points of (x,,yg), namely, (x, - h a a ) , (x, + h,ya), (x,,up - h), and (x,,y8 + h ) also belong to D. We call ( x , , y p ) ~ Da boundary point if there are not four neighbor points that belong to 5. 'faking this definition into account, we have for any interior point u a + t,fi
+ ' a - h,p
@ a ' *
h2
i-
u a . . f i + ~ *Ma8 + ua.B-1 E:
h2
Fa83
(X,,Y~E ) D,
(5.3.3) which is the finite-difference equation of (5.3.1). Here uOp= u(xa,ya), Faa = F(x,, x8), u,, = u(x, 2 h,yp), and u,,8-c = u(x,,yp2,). The fast equation can be rewritten as % @ = f ( ~ , - I , p + U P + l . B + ~ , , p - 1 ~ 0 , / 3 + 1- h 2 F p ) .
(5.3.4)
The boundary condition (5.3.2) is then ga.8
U0.p
and
(x,+yp) E Do.
(5.3.5)
It is not difficult to see that by numbering all the points (x,,y@) E B i n any order we can rewrite (5.3.4) and (5.3.5) as n
u,=
2 u,,u,+x,
i , j - 1 , 2,...,n .
(5.3.6)
J-1
Here n is the number of mesh points f.x,,.v8) E 5,which is also equal to the order of the matrix Iia,, /I .; The matrix [lu,,ll has a specific structure: all diagonal elements are equaI to zero; each row corresponding to an interior point of D has four elements equal to $, ail other elements being zero; each row corresponding to a boundary point of D o contains also elements equal to or zero, but the number of $ elements is as that of neighboring points, which is always less than 4. Thus the Dirichlet problem is approximated by a system of linear equations (5.3.6), which can be solved by the Monte Carlo methods described in Section 5. I .3.
EXERCISES 1 Describe an algorithm for simulating an ergodic Markov chain.
2 Prove that K $ ( x ) E L z ( D ) ,given (5.2.5) and (5.2.7).
181
REFERENCES
Prove that E [ q i ( + k I] K ' # > b k 5 Prove (52.30). given Zz-olK m j l < 03. ie
a
Gin.
< 00.
6 Prove (5.2.35), given x:,,lKmjl
7 Consider the recursive formula (5.2.23) zk+I=Krk++f.
Assume z o = +(x), where +(x) is any function. Then k
K m f + Kk+'+.
irk+'+ m=O
Define Vk(h)=-
h(x-0)
k
2
wmf(xm)+
wk+I/(xk+l)
P ( X O ) m-0
and prove that
E [q k ( h ) ]= ( h , Z " + ' ) } . 8 Prove (5.1.43), that is, prove that asymptotically the method of covering path is n times more efficient than the standard Monte Carlo method. 9 Consider the systems of linear equations x = Ax
+f, where
The exact solution of this is x = (xl. x l ) = (7.58.75). Simulate the following Markov chain with an absorbing state: 0.5 0.2 0.3 P I 0 3 0.4 0 3
lo
0
I'
I
and estimate the exact solution x = (xI,x 2 ) = (7.5,8.75) by making a run of the IOOOlh replication of the Markov chain.
REFERENCES I
Albert, G. E., A general theory of stochastic estimates of (he Neumaan series for solution of certain Fredhdm integral equations and related series, in Symposium o/ Monre Carlo Merhodr, edited by M.A. Meyer. Wiley, New York 1956, pp. 37-46.
182
LINEAR EQUA’HONS AND
MARKOV CHAINS
2 Cinch, E.. Infroducrion fo Stochatric P m m e s . Pnntice-hall. Eaglewood Cliffs. NCW Jersey, 1975. 3 Ennakov, J. M.,Monte Cario Method and Related QuesfMN, Nauka, M m w , 1976 (ia
Russian). 4 Forsythr, S. E. and R. A. Leibler, Matrix inversion by a Monte Carlo method, Math. Tables Other Aidr Compuf., 4, 1950, 127- 129. 5 Hammersky. I. M .and D.C. Handscomb. Morrtu CurloMeihnds. Illethuen. 1.ondon. 1964 6 Halton. 1. H..A retrospectiveand prospective survey of Monte Carlo method, Soc.Xndurr. Appl. Math. Rev.. 12, 1970, I - 63. 7 Michaiiov, G. A., Some Problem in tho Theory of f h t Monre Cad0 Meihod, Nauka, Novosibirsk, U.S.S.R.,1974 (in Russian). 8 Sobol, J. M., Cmywtational Metltodr of Monfe Carlo. Nauka, Moscaw, 1973 (in Russian). 9 Spanicr. J. and E. M. Gclbard, Monre Carlo Principles and Neutron Tronsporta;im Problem, Addison-Wesley, Reading, Massachusetts, 1969.
Simulation and the Monte Carlo Method
R E W E N Y. RUBINSTEIN Copyright 0 1981 by John Wiley & Sons, Inc.
CHAPTER6
Regenerative Method for Simulation Analysis 6.1
INTRODUCTION
it has already been mentioned in Chapter I that many real-world problems are too complex to be solved by analytical methods and that the most practical approach to their study is through simulation. In this chapter we consider simulation of a stochastic system, that is, of a system with random elements. Simulation of such systems can be considered as a statistical experiment, in which we seek valid statistical inferences about some unknown parameters associated with the output of the system (or the associated model) being simulated. However, classical methods of statistics are often unsuitable for estimating these parameters. The reason, as we see later, is that the observations made on the simulated system are highly correlated and nonstationary in time; under these circumstances it is difficult (actually impossible} to carry out adequate statistical analyses of the simulated data. To overcome these difficulties a procedure based on regenerative phenomena, called the regenerative method, has recently been developed. Historically, Cox and Smith [4] were the first to suggest use of regenerative phenomena for simuiating a queueing system with Poisson arrivals. This idea was extended by Kabak [39] and Poliak 1591. Quite recently, Crane and Iglehart [6-91 developed a methodology for the regenerative method, based on a unified approach to analyze the output of simulations of those systems that have the property of self-regeneration, that is, of invariably returning (at particular times) to the conditions under which the future of the simulation becomes a probabilistic replica of its past. In other words, if the simulation output is viewed as a stochastic process, the 183
184
REOENERATIVE METHOD FOR SIMULATION ANALYSIS
regenerative property means that at those particular times the future behavior of the process is independent of its past behavior, and is governed by the same probability law, that is, at those times the stochastic process “starts afresh probabilistically.” Crane and iglehart showed that a wide variety of problems, such as communication networks, queues, maintenance and inventory control systems, can be cast into a common framework using regenerative phenomena; they then proposed a simple technique for obtaining point estimators and confidence intervals for parameters associated with the simulation output. The regenerative method also provides answers to the following important problems: how and when to start the simulation, how long to run it, when to begin collecting data, and how to deal with highly correlated data. The theory and practice of the regenerative method are now in the process of rapid development. The list of references contains about 100 relevant papers known to the author. An excellent introduction to the regenerative method can be found in Crane and Lemoine’s book (101. Iglehart’s forthcoming monograph 138) will present a rigorous development of both the theory and practice. Many others recently obtained results, in particular regarding simulation of response time in networks of queues, are to be found in lglehart and Shedler’s monograph [37]. This chapter is organized as foliows. The basic ideas of the regenerative method are discussed in Section 6.2. Section 6.3 deals with statistical problems, in particular with the confidence interval for the expected values of some functions defined on the steady-state distribution of the process being simulated. In Section 6.4 the ideas of the regenerative method are illustrated for a single-server queue, a repairman system, and a closed queueing system. Choice of the best among a set of competing systems is the subject of Section 6.5. Section 6.6 deals with a linear programming problem in which the coefficients are unknown and presents the output parameters of regenerative processes. Variance reduction techniques in regenerative simulation are the subject of Section 6.7.
REGENERATIVE SlMUWTiON
6.2
We start this section with the definition of a regenerative process. Roughly speaking, a stochastic process ( X ( f ) : I 2. 0) is called regenerative if there exist certain random times 0 < To < T, < T, < * forming a renewal process’ such that at each such time the future of the process becomes a
--
*A sequence of random variables (q:n 2 0) is a renewal T, Ta-,(n 2 1) are i.i.d. r.v.’s.
-
6provided !hat To= 0 and
6.2
185
REGENERATIVE SIMULATION
probabilistic replica of the process itself. Informally, this means that at these times the future behavior of the process is independent of its past behavior and is invariably governed by the same law. In other words, the part of the process ( X ( r ) : T i - , < f I T,} defined between any pair of successive times is a statistically independent probabilistic replica of any other part of the same process defined between any other pair of successive times. The times {T :i 2 0) are called regeneration times and the time between q - , and T, is referred to as the length of the ith cycle. Formally [5], a stochastic process ( X ( t ) : f 2 0) is regenerative if there exists a sequence To,T,, . . . of stopping times+ such that: I
T = { q : I = O , l ,. . . I isarenewalprocess.
2 For any I , rn E (0, 1 , . ( X ( r , ) , .., X ( t , ) } and (X(T,
. . I,
.. . ,i , > 0, the
random vectors are identically distributed and the processes (X(f): t < T,} and (XCT, f I ) :f 2 0 ) are independent.
.
f,,
+ t , ) , . .. .X(T, + r , }
For example, let { X, : n 2 0) be an irreducible, aperiodic, and positive recurrent Markov chain with a countable state space I 5 (0, I,. .. }, and let j be a fixed state; then every time at which statej is entered is a time of regeneration. Let us select a fixed state of the Markov chain (M.C.), say state 0. We then obtain a sequence of stopping times { q :i 2 0) such that O = To < T, < T2 < . - and X, 0 almost surely (as.); that is, once the system enters state 0, the simulation can proceed without any knowledge of its past history. For another example, let us consider the queue size at time t for a Gi/G/ 1 queueing system. Suppose the time origin is taken to be an instant of departure at which rime the departing customer leaves behind exactlyj customers. Then every time a departure occurs leaving behindj customers, the future of the stochastic process after such a time obeys exactly the same probability law as when the process started at time zero. More examples of regenerative processes are considered in Section 6.4. It is shown in Ref. 8 that under certain mild regularity conditions the process ( X ( t ) : I > 0 ) has a limiting steady-state distribution in the sense that there exists a random vector X such that lim P{X ( r ) 5 x } = P( X 5 x).
-
I-+W
.A random variable T taking values in [O, + ,OD) is a stopping time [S) for a stochastic process { X ( r ) :r 2 O), provided that for every finite f 2 0, the occurrence OT nonoccurrence of the event {T5 r ) can be determined from the history ( X ( s ) : s c t ) of the process up to time 1.
186
REGENERA1 1VE METHOD FOR SIMULATION ANALYSIS
This type of convergence is known as weak convergence and is denoted X(r)rjX as r+ 00. The random vector X is called the steady-state vector. Let f : Rk-+R be a given real-valued measurable function, and suppose we wish to estimate the value r = E { f ( X ) } , where X is the steady-state vector. For the M.C.{X,, : n 2 0 ) we have
r = ~ ( f ( ~= )z f) ( i ) P ( X = i ) - x j ( i ) q . iEI
(6.2.1)
iEI
Here, n = ( P (X = i ) :i E I} is the steady-state (stationary) distribution of the regenerative process { X,, :n > 0}, and f ( i ) can be interpreted as the penalty (reward) paid in state i. To find r we can solve the following linear system of stationary equations, n = nP, where P = :i,j E I ) is the transition matrix, and then apply (6.2. I). Let us assume that the valuesf(i) are known but the transition matrix is unknown. It is clear that the value r cannot be found analytically, since 7~ is determined by P, and simulation must be used. Another case is when P is known but the state space is very large; in this case it may be quite difficult to solve the system n*sP, and we must resort to sirnulation again. Possible functions f of interest are the following:
(el
1 If
then E ( f ( X ) ) = 9 , 2 If
then E ( f ( X ) } = P ( X l j } . 3 Iff(i)==iP,p>O,then E { f ( X ) )= E { X P } . 4 If f ( i ) = c, = cost of being in state i, then E ( f ( A')} (the stationary expected cost per unit time).
-
&,c,
P( X
=i )
+
Let 5 denote the interval between the ith and the (i 1)th regeneration times, that is 7 = q+ - 7;, i 2 0 ; 7, is referred to as the length of the ith cycle. Next, assume E(T,)< 00, and define (6.2.2)
6.3
POINT ESTIMATORS AND CONFIDENCE lNTERVALS
187
or T,+,-l
(6.2.3) depending on whether the process ( X ( t ) : t 2 O} is continuous-time or discrete-time. In other words, y. is the penalty (reward) during the cycle of length T~ = q + ,- q.. Naturally, Y, is a random variable (r.v.) because so are T, and f(Xi). We now formulate two fundamental propositions that are used extensively in the rest of this chapter. Proposition 6.2.1. The sequence {( T , T ~: i) 2 I } consists of independent and identicaily distributed (i.i.d) random vectors. Propitioo 6.2.2.
If
T,
is aperiodic,* E ( r l ) < og, and E ( l f ( X ) l ] < 00,
then (6.2.4)
There is an analogous ratio formula when T ! is periodic. For proof of these propositions the reader is referred to [S]. Proposition 6.2.1 says that the behavior patterns of the system during different cycles are statistically independent and identically distributed. Proposition 6.2.2. enables us to estimate the value r = E( Y , ) / E (7 , ) (which is the same as r = E( Y , ) / E ( 7 , ) ) by classical statistical methods, and to find point estimators and confidence intervals fur r. These two problems are the subject of the next section. 6 3 POINT ESTIMATORS AND CONFIDENCE LlVTERVAI.23 [tZ, 281
In this section we consider several point estimators and confidence intervals for the ratio E ( Y , ) / E ( 7 , ) . The problem we consider is as follows: given the Lid. sequence of random vectors { ( Y , T ~ ) : l}, ~ > find point estimators and construct IOO(1 - &)% confidence intervals for the ratio E( r,)lQ7,1. *The random variable T, is periodic with period X > 0 if, with probability 1, it assumes values in the set (0,A, 2X,. .. ] and A is the largest such number. if there is no sucb A, then T , is said to be aperiodic.
188
REGENERATIVE METHOD FOR SIML%ATION ANALYSIS
Let Z, = Y, - '7,. It is readily seen that the Z,'s are i.i.d. r.v.'s, since the vectors (Y,, 7 , ) also are. Note also that E(Z,)=0
(6.3.1)
and a Z = v a r ( Z , ) = v a r ( Y , ) - 2 r c o v ( Y , , ~ , ) f r 2 v a r ( ~ , ) . (6.3.2) Denote Y = (1 /n)X;, I Y,, and 7' = (1 /n)z:, fr,;then by virtue of the central limit theorem, (c.1.t.) we have
(6.3.3) where 3 denotes weak convergence and it is assumed that o 2 < 00. The last formula can be rewritten as
(6i3.4) where i = F/+. Inasmuch as u is unknown, we cannot obtain a confidence interval for r directly from (63.4). However, we can estimate u2 in (6.3.2) from the sample, that is, by s2 =si,
- 2is1, + f2szz,
(6.3.5)
where
and .
n
It is straightforward to see that s 2 - w 2 a.s. as n+w, so (6.3.4)can be rewritten as n"2( i - r ) e N ( 0 , I}, as n - w , (6.3.6) S/.F
and the IOO(1- 61% confidence interval for r = E( ~ ) / E ( Tis~ )
(6.3.7) where t6 = (p- '(I - S/2), $J is the standard normal distribution function, and i = y/F is the point estimator of E ( Y l ) / E ( T i ) .The procedure for
6.3
189
POINT ESTIMATORS AND CONFIDENCEINTERVALS
obtaining a loo( 1 - a)% confidence interval for I can be written as follows: 1 Simulate n cycles of the regenerative process. 2 Compute the sequence .. . , r n and the associated sequence Y,,..., Y, (use (6.2.2) and f6.2.3), respectively, for a continuous-time or
discrete-time process). 3 Compute Y=(l/n)Z:,,Y, estimator by
and 7’ (l/n)xy-i7i and find the point
Y
(6.3.8)
i=y. 7
4 Construct the confidence interval by
where zd= +-‘(I
- 6/2)
and + is the standard normal distribution.
It is readily seen that i-^ = T/F, referred to as the clarsicul estimator [28], is a biased but consistent estimator of E( Y , ) / E ( T ~ Iglehart ). [28] suggested, for the same purpose, the following alternatives: BEALE ESTIMATOR
(6.3.9)
FIELLER ESTIMATOR
(6.3.10)
where
JACKKNIFE ESnMATOR
(6.3.11)
where
190
REGENERATIVE METHOD FOR SIMULATION ANALYSIS
TIN ESTlMATOR
(6.3.12)
Let us now cite some results from Ref, 28. The four point estimators (6.3.9) through (6.3.12) as well as the classical estimator are biased. Their expected value can be expressed as E [ i ( n ) - J r + c + c2 + o( (6.3.1 3)
f)
2
5=
The point estimators (6.3.9), (6.3.1l), and (6.3.12) have been suggested in order to reduce the bias of (6.3.13) up to order I / n 2 . For the jackknife method c , = 0, since
The reader is asked to prove that for both Beale and Tin estimators c , / n is also equal to zero. - i,}--+O as. as n 3 ~ we, can Since both n ’ / * ( i - i b b ) d O and replace i both in (6.3.6) for the c.1.t. and (6.3.7) for the confidence interval without changing the results. For the jackknife method formulas (6.3.6) and (6.3.7) can be written, respectively, as
9-r /2$
as n-+
* N ( O , 1)
00
(6.3.14)
i
where
The FielIer method yields the following lOO(1 - 6)% confidence interval: I,=
I-
0 ’ / 2
7-
7
(72 - k,szz)
t, +
( i 2- k,s,,)
(6.3.15)
6.3
POINT ESTIMATORS AND CONFIDENCE INTERVALS
191
where
and
k, =
[+-‘(I
-w]’ n
The performances of these estimators were compared numerically (via simulating several stochastic models), and the following results were obtained [28]. For short runs the jackknife method is recommended both for point estimators and confidence intervals because it produces slightly better statistical results than other methods. Two minor drawbacks of the jackknife method are a large memory requirement and slightly more complex programming. Additional storage addresses of the order of 2n are required, where n is the number of cycles observed. Where the storage requirement for the jackknife method is excessive, the Beale or Tin methods are recommended for point estimates and the classical method for the confidence intervals. The Fieller method is recornmended for neither point nor confidence intervals. It is found to be heavily biased for short runs and more compiicated than the classical method. The above mentioned five point estimators were based on simulating n cycles of regenerative processes. Another possibility is to consider point estimators based on the simulation for a fixed (but large) Length of time f . In this case the number of cycles N, in the interval (0, r ] is a random variable given by
N,= 2 $ o , q K l s= 1
where I,o,rl is the indicator function of the interval [ O , t j . Replacing n by N,,we can modify all the point estimators (6.3.8) through (6.3.12), preserving their consistency. For example, for the classical estimator we have
Thus, asymptotically, there is little difference while considering point estimators based on simulation n regenerative cycles or on simulation for fixed length of time 1. The c.1.t. in this case is
192
REGENERATIVE METHOD FOR SiMULAnON ANALYSIS
Recently Heidelberger and Meketon 1321 considered estimators based on simulations for a relatively short length of time I . They defined estimators Nt
xx
i( N,)=
I *v, i=
X
(6.3.17)
Ti
i- I
and N,+ 1
(6.3.18)
They then showed that E f i ( N,))= r + O(
L) 1.
(6.3.19)
and E(i(N,+ l ) ) = r + o ( + ) ,
(6.3.20)
so that a bias reduction is achieved by continuing the simulation until the first regeneration after time r. The bias reduction is comparable to that of the jackknife, Beale, and tin estimators since t is proportional to the number of cycles. Table 6.4.3 lists empirical results from simulations of a closed queueing network model for these estimators. We turn now to the problem of determining run length. The lOO(1 - S)% confidence interval for a large but fixed number of cycles has a width approximately equal to
2a+- ’( I - 6/2)
-
E(T,)n’/2
(6.3.21) ’
In terms of duration time I (6.3.20) can be written as (see [24])
(6.3.22) Note that neither u nor E ( 7 , ) are known in advance. Hence it may be worthwhile to take a small sample and obtain rough estimates for u and E(T,).Such estimates would form a basis for a final decision or run length
6.4 EXAMPLES OF REGENERATIVE PROCESSES
193
and level of confidence. We wish to emphasize that all ratio estimators described in this section are designed for simulations with a fixed number of cycles n or a fixed run length t. An alternative possibility would be to consider procedures based on sequential stopping rules.
6.4
EXAMPLES OF REGENERATIVE PROCESSES
In this section we consider three examples of regenerative processes, taken from Refs. 6, 10, and 49: a single server queue, a repair model with spares, and a closed queueing network. 6.4.1
A Single Server Queue GI/G/Z [ti]
This example was described in Section 4.3.12, and will be briefly recapitulated here. Let Cr: and S, be the waiting time and service time, respectively, of the ith customer in a singe server queue. Let A,+, be the time between the arrival of the ith and ( i + 1)th customers. We assume that {S,,i 2 0) are i.i.d. with E ( 8 ) = p-' and that { A , , i 2 I } are i.i,d. with .!?(A,) = A - ' , Let the traffic intensity p be defined by p = A/p. We assume that customer number 0 arrives at time 0 to an empty system. Let X , = S,- I - A , for i 1 1. The waiting time process {W,, i 2 0) can be defined recursively by w,-0 ~ = ( F V , - ~ + X , ) +i ,> I . It i s known [36]that, if p < 1, there exists an infinite number of indices i such that W,3:0 and a random variable W such that W,* W, as i 4 to. Thus we choose zero state as our return state and regenerations occur whenever a customer arrives to find an empty queue. We are interested in estimating E ( W ) , which is finite if E(S:) < m. Since no analytical results are available for calculating the steady-state waiting time E ( W ) , we estimate it via simulation by making use of the classical estimator (6.3.8). The simulation results are shown in Fig. 6.4.1, We see that the customers I, 3,4, 7, 11, and 16 find the server idle, that is, W,= W,= W,= W, = W , ,= W,, = 0, while customers 2, 5, 6, 8, 9, 10,. 12, 13, 14, and 15 find the server busy and wait in the queue before being SC?fVed.
It follows from Fig. 6.4.1 that the simulation data contains five complete cycles with the following pairs {(&7,),i = 1,. . ., 5 ) : (Y1,71) (10,2), ( Yz,r2)= (0, l), (Y,, T ~ ) =(30,3), (Y,, 7,) = (50,4), and ( Y5,T 5 ) = (60,5)}The sixth cycle will start with the arrival of customer 16. Using the
194
REGENERA I I V E METHOD FOR SIMULATION ANALYSIS
Customer number
Fig. 6.4.1 Sample output of queueing simulation.
,x/Z;-
classical estimator i = 2:=
i K
r=--
,-I
5
2 1-
, T ~ we ,
obtain
- 10+0+30+50+60 2+1+3+4+5
150 =-* 15
10.
7,
1
This result can also be obtained by using the sample-mean estimator
l
i=-
N
,-,
I l5 wj=Ts 2 1-1
150 15
w,=-=10.
Here N = C:-,T, = 15 is the length of the run and Zy-tW, = C:-,q. A logical question arises. If both points estimators r' and F are equal (we assume that the length of the run N is equal to n complete cycles, n < N), why do we need all the ratio estimators (6.3.8) through (6.3.12), (6.3.17) and (6.3.181, based on the regenerative phenomena? The answer can be found if we consider not only point estimators for r = E( W ) but confidence intervals as well. In order to construct confidence intervals in the sense of classical statistics, the simulation data must form a sequence of i.i.d. samples from the same underlying probability distribution. The simulation data from the queueing system is the sequence of waiting times W,, . . . , W,. Note, however, that if we start our simulation with an empty queueing system, then the first few waiting times tend to be short, that is, they are correlated, and as a rule, the sample-mean estimator ? will be a biased estimator of r = E( W).
6.4
195
EXAMPLES OF REGENERATIVE PROCESSES
Tabie 6.4.1.
Simulation R d t s for tbe M / M / t Queue
Theoretical Parameter
Value
Point Estimates
Confidence Interval
r = E( W ) = -
0.loo
0.110
[0.0%,0.123]
&W2)
0.040
0.046
[0.035,0.056]
0.120 2.Ooo 0.I73
0.133 2.1 10 0.182
[0.116,0.148] [2.012,2.207] [0.141,0.271]
E( y, ) E( 7 , ) +
qfiw-:-oTj&7i)
4W)
}
Ref. 6. Note: Number of cycles n = Moo, level of confidence IOO(1 - 6)= 90%~ number of replications N = 10. h = 5, p = 10). Source:
To overcome this difficulty we can run the model until it reaches the steady state and then start collecting and updating the simulation data. The problem of determining the steady-state distribution is a difficult one, moreover, requiring considerable computation (CPU) time, but unless we start from it W, and Wj+ will again be correlated (if Wi is short, then W,+, will also tend to be short and vice versa). Since the r.v.'s W , ,...,W, are correiated, classical statistical methods cannot be applied in constructing confidence intends for r = E( W ). Still, this difficulty can be overcome by using the regenerative property, namely, by grouping the simulation data into independent pairs (blocks) ( V,,7 , ) . . .. ,( Y,,T"), which yields different ratio estimators (see (6.3.8) through (6.3.12)' (6.3.17)and (6.3.18)) and the associated confidence intervals by means of classical statistics. Table 6.4.1 presents simulation results for the queueing system M / M /I with h = 5, p = 10 based on a run of 2000 cycles. Confidence intervals at the 90% level are given for the parameters E( W ) , E( W 2 ) , E((V%'- 0.1 )+}, E ( T , ) ,and a ( W ) . The function + ) may be interpreted as a penalty for long waiting time.
,
_ I .
E(dGI7
6.4.2.
A Repairman Model with Spares [lo]
We now consider a repairman problem with n operating units and M spares (Fig. 6.4.2). Each of the operating units fails with rate A. A failed unit enters a queue for service from one of s repairmen on a first-in-first-out (FIFO) basis and is replaced by a spare (if available). The distribution of the i.i.d. repair times is exponential with mean p-' for each repairman. A
1%
REGENERATIVE METHOD FOR SIMULA'NON ANALYSIS
& Queue
u
r Repairmen
m Spares n Operating units
Flg, 6.4.2
Repairman model with spares.
repaired unit enters the pool of spares unless there are fewer than n units in operation, in which case it immediately becomes operational. Denoting by X ( r ) the number of units in service or waiting in the queue for service, then { X ( t ) , r 2 0) is a birth and death process with state space Z = (0, 1,. .., m + n ) , and
(
A, = n h y
(n+ m
Oli<m
-i)A,
m
0, k = I , . . . ,m, n = 1,2,. . ., that is, that the cycle is taken into account only if it is of some minimal length (which can be considered as the sensitivity threshold of the measuring instrument). Remark 2
Remark 3 The r.v.'s Y k [ n ] and ~ ' [ n bk = 1,..., N, n l 1, defined in (6.5.12), store the information obtained up to and including the nth step. We should aIso note that, for each k fixed, only v L [ n ] summands in both Y k ( n ]and 7 ' [ n l are nonzero.
Theorem 6.5.1 If the values of the function f are uniformly bounded by some constant D and if there exists the unique optimal solutionp* of the problem (6.5.3)-(6.5.4), then for any initial distribution p[O] E the sequence ( p [ n11,"- generated by the algorithm (6.5.9)-(6.5.14), converges t o p * with probability I .
,,
Coroliary The theorem remains valid if we assume that the values of the function f cannot be observed directly, but are measured with a random noise. In other words, (!
x ' } = E ~ ( Q X( ' , < ) } ,
i = I . . ..,N ,
where 6 is a random vector with an unknown time-independent probability distribution function. In this case we can consider another random process: { U l ( t ) : i L ~ } {=( x l ( r ) , t ) ) ;
i=i
,...,N.
If ( X ' ( t ) : t 2 0 ) is regenerative, then ( U ' ( t ) : t 2 0} is also regenerative and the values of Q are uniquely defined for each value of the steady-state r.v. I/' of the process ( U 1 ( f :) t 2 01,and E ( j ( X ' ) } =EfE,(Q(X'.E)}}=E{Q(U')},
i= I,
..., N .
Proof of the Theorem Before proving the theorem, let us introduce some notation. Let ll[nJ = r l [ n ] - r,, 4 n 3 = maxlr,[n]f (6.5.17) 1
.
where 5 = E ( / ( X ' ) } , i = 1,. . . N,and n== 1,2,
. .. .
6.5
203
SELECTING THE BEST STABLE STOCHASTIC SYSTEM
On the nth step the state of the algorithm can be descnbed by a 4iV-dimensional vector z[nl=( ~ [ n 7] [,n ]= ( ~ ' [ n .] ., .,~ ~ [ n ]Y[n] ) , = ( Y 1 [ n,..., ] Y N [ n l > . v [ n l = ( ~ ' f ~..., 1 , v N [ n J ) )We . first prove the following lemma.
Lemma. For any Z[O] such that p [ O ] E Srlol, and 7[0]> 0, m
i = i,...,N.
y [ n ] E ( ( t i [ n ] I I E I O ] }< a ,
(6.5.18)
n= I
Proof Without loss of generality set i = 1 and define n = 1,2, . .. . 2;= Y,'- ri7,1,
If a cycle of the regenerative process ( X ' ( r ) :t 2 0) was not carried out on the nth step, then 2: = 0. For all n's such that a cycle of ( X ' ( r ): t 2 0) was performed on the nth step, the 2; are i.i.d. r.v.'s with E(Z,') = 0 and variance u$. Define also
z'[n] = z ' [n - 11 i-z,!.
n = 1,2,.
. ., z ' [ O ]
= Y'[O]-r , d [ o ] .
Then by the Cauchy-Schwarz inequality,
s "{
z"
n ] ' j r [ o ] } .E"2(
< (( z'[ o])*+
(TI[
no=?)'"-^'/^( (
TI[
n ])-'I."
01)
OJ + rev'[ n]) -'In[ 01
1,
where r0 was defined in Remark 2. Since by (6.5.6) d [ n ] 2 ncfn], we have E ( l i , [ n ] l l E [ O ] )S ( ( Z ' [ O ] ) *+n~~)''~(~'[o]f?,ne[ .])-'.
Thus for n large enough E { [ f , [ n ] p [ O ]I }A , e - ' [ n ] n - - " * ,
(6.5.19)
where A, = A,(ZlO]>. Inequality (6.5.19) and condition (6.5.16) imply the Q.E.D. convergence of the series (6.5.18).
Corollary For any state Zjn] of the algorithm on the nth step, 00
c
m=n+ I
Y [ ~ l ~ ( ~ [ ~ J l 0 , j = 1,. ..,M. The operator [ - I
+
is defined by
(6.6.5) Now instead of the original LP problem, the following problem is solved:
(6.6.6)
,,
where p satisfies (6.6.3)and the sequences (pj[n]}2- j = 1,.
. .,N,satisfy
210
REGENERATIVE METHOR FOR SIMULATION ANALYSIS
the following conditions:
Now we propose an adaptive algorithm that converges with probability one to the optimal solution of the LP problem (6.6.1)-(6.63). The algorithm i s similar to the algorithm (6.5.8)-(6.5.16) and is based on a step-by-step correction of the probability vector dn],where n denotes the step number. As in the algorithm (6.5.8)-(6.5.16)there exists a mechanism, provided by (6.6.12)below, that ensures thatpi[n] 2 ~ [ n li ,= I , . .. ,N, where ( ~ f n ] } , " is ~ a monotone decreasing sequence of positive numbers, subject to (6.6.16) through (6.6.21)below. On the nth step the ith system, i E ( 1,. .. ,N),is chosen by simulating the distributionp[n - 1). We denote this event by X [ n ] = X ' . One cycle of the process ( X ' ( t ) : z 2 0) is carried out. Denote by v ' [ n ] , i = 1,. . . ,n, the total number of cycles made by the ith system up to and including the nth step. We check whether or not the inequality v k j n I]2 n ~ [ n ]k, E ( I,.. .,i 1,i + 1,. ..,N},is satisfied for a11 systems. If for some indices k,,...,k , E { 1,. . . ,i - 1, i + 1,. ..,N ) this inequality does not hold, then one additional cycle is camed out for each system k,, .. .,k,, so that ultimately
-
-
v"n3
2ttEE.1,
k-I ,..., N,
(6.6.8)
holds. We record also k 7"
= T,k.,nl
k
- T,.[,l-I ,
k = i, k , , . . .,k,,
(6.6.9)
the lengths of the cycles performed. and for each k calculate M + 1 numbers
Y ~ ~ ~ ~ ~ ~ ~ ' ~ ~ ~k =(I ,~k ,,..., ~ ( k,, , ) j =) O~, l z, ..., , M, (MI-
I
(6.6.10)
if the process { X k ( t ): r 2 0) is continuous-time. In the case of discrete-parameter processes the integral should be replaced by the corresponding sum over the (vk[n])th cycle. Set also
Y,"'
= T," = 0 ,
ifk#i,k,
,...,k , , j = 0 , 1 ,..., M. (6.6.11)
The new distribution p i n ] is updated according to the following recurrence formula: P [ " l =%Jn)(PE"-
11 - Y [ 4 B ( n l Q ) .
(6.6.12)
REGENERATIVE METHOD FOR CONSTRAINED OPTiMlZATION PROBLEMS
211
Here S, is a simplex in RN: H
s,-
1
Pk= 1 , 0 < e < P k - - n"-
1-;
n]iO%'[']Lo
(6.6.17)
Q,
2Y[4 RQ
(6.6.18)
=a
I
00
5: ( y f f l ] p ~ ~ [ n ] 2 E - - " n I- ] )
Tm-,:X,,= 0},
m 2 0.
We say that a regeneration occurs at time Tmand the time between T, and Tm+,,that is, 7, = Tm+l- T,, is referred to as the length of the m cycle. Let k be some positive integer and let rv = n$, = E ( f , ( x ) ) , v = 0, 1,. ..,&. For each m 1 0 and v = 0,1,. . . ,k, define Y,(v) by Tm+I--I L ( V )
E
=
L(Xn).
n-T,
It follows from Proposition 6.2.2 that, if n If,I < cx), then (6.7.6)
Let Z,(Y)= Y,(Y) - r,?,. By (6.7.6) we have for each each m 2 0 E( z m iY 1) = 0 Define
Y
= 0, 1,.
. .,&
and
(6.7.7)
M
x YAv)
?,(M) =
m- I
(6.7.8)
M
z
7,
m- 1
and
for each v = 0,1,. . .,k. Then i,(M)-+r, a.s. as M-, 00 and i u ( N ) + r p a.s. as N+m. Observe !tat r , ( M ) is an estimator for re based on M cycles of the process and X , ( N ) is an estimation for r, based on N transitions of the process. Because {Z,,,(v):m 2 0 ) are i.i.d., it is readily possible to prove the foilowing two c.1.t.'~:
217
VARIANCE REDUCTION TECHNIQURS
Proposition 6.7.1 Let & be a ( k + 1) x ( k + 1)-dimensional covariance If E(I f,(x)l) matrix of Z,,,(v)’s, whose (i,j)th entry is uij = EiZm(i}Zm(j)]. < 00 for each Y = 0, 1,. . .,k , then
[ V a (ro( M )- r,), ...,
(qM )
I),.-
(
=aN 0,E;tl) (6.7.12)
(6.7.13)
The proof of this proposition is given in Ref. 24. Now let j3 be a ( k + 1)-dimensional row vector of real numbers whose vth entry is P(v). Let r,f(M), and %(A’) denote ( k + 1)-dimensional column vectors whose v t h entries are rv,i,(M), and P,(N), respectively. A simple application of the continuous mapping theorem (Theorem (5.1) of Billingsley [ I J) yields the following. Pnopositioa 6.7.2 Let o i ( j l > = f3Zk/3‘ = 2&X&oj3(i)ui,j3(j). hypotheses of Proposition 6.7.1,
Under the
and
where flr = Zf,@(v)r, is the inner product of
B and r.
In order to form confidence intervals for the ry’s (or for linear combinations of the r,,’s) it is necessary to know the u,[k as well as E(T,).These constants are usuaiiy unknown and must be esttmated. In addition j3 may be a fixed, but unknown, vector so it too must be estimated. The following proposition, the proof of which is also given in Ref. 24, tells us that we may replace these quantities in Proposition 6.7.2 by any sequence of strongly consistent estimators preserving the asymptotic normality.
k
Proposition 6.73 Suppose that ? l ( M ) + E ( ~ l )a.s., that &,., M ) + q j as. for each i and j, and that &i, M ) - + P ( i > a s . for each i. Let k ( M ) be the matrix whose (i,j)th entry is 6 , , ( M ) , let & M ) be the vector whose ith
218
REGENERATIVE METHOD FOR SLMULATION ANALYSIS
component is b(i,M),and let Gk( 6, M )= fi( M$,( M)B'(M). Then
\/;i?(B(M)YM)-B(M)r)3N(0, G(B'M)/?,(M)
1)
asM+oo.
(6.7.16)
We turn now to the problem of choosing the functionsf, with a view to achieving variance reductions. Heidelberger (24-261 suggested several ways of choosing 1,,v = 0,... ,k. We consider only one of them [24}. Let (6.7.17) i, = P'f, v = 0, 1,. ..,k, where P ' is the v step matrix function of the process. It is shown in Ref. 24 that in this case, that is, whenf, = P'f, all rv = nf,, Y = 4 1 , . . .,k, are equal to r , and if E{ If( x)l) < co,then wf = R( Pf ). Since r, = r , v = 0,1, ... ,k, it is obvious that
' M
m = 1 ym(v) a.s.
Fv(M)=
4
M
c
E(Y,(v))
H 71 1
=ni=r*
(6.7.18)
7,
m= I
Therefore each FV( M).Y = 0,1, , . , ,k, is a strongly consistent estimator for r , and we can use one of them for this purpose. However, better results can be achieved by using all of them simultaneously, for instance, using (6.7.5), which can be written as k
$(W = 2
B(y)?,(M),
(6.7.19)
V-0
where Xt,,@( v ) * 1. Variance reduction can be achieved if we choose the @(v)'s so as to minimize the asymptotic variance u&3) of $(M).Mathematicalty it can be written as
p ) = FZF'
(6.7.20)
2 p( v) * 1.
(6.7.2 I )
minimize u,'( k
subject to
r-0
The solution of this problem, which can be obtained using Lagrange multipliers, is (6.7.22) (6.7.23)
2 19
VARIANCE REDUCTION TECHlrilQUES
where e denotes the ( k + I )-dimensional row vector each of whose components is 1, and where t is the transpose operation. Formulas (6.7.10) and (6.7.1 I) can be now rewritten as (6.7.24)
(6.7.25)
where ?,@( N ) = X:-,,/3(
N ), and both j(M)-+ra.s. asN+oo Y)$,.(
and i@(N)+ra.s.
as~-,m.
Since the cova;iance matrix Z i s in general unknown, it is necessary to estimate it. If Z ( M ) is any estimator such that e ( M ) - + X as. as M d m , then it is clear that $ - ‘ ( M ) - + X - ’ as. as M 4 o o . Letting k
@ (‘.
M )=
r: a*(.. M ) q w ,
(6.7.26)
v-0
and applying Proposition 6.7.3, we have (6.7.27)
where 6kk(&,M)-+uk(P*) as. as M - w and ? , ( M ) is any sequence of numbers suchAthat? , ( M ) - t E ( 7 ! ) as. as M-+03. A corresponding c.1.t. exists for the X,( N )‘s as well. This method is called the ‘‘method of multiple estimates” because it combines several different estimates of the same quantity. In order to apply this method the functionsf, must be computed (usually before the start of the simulation). For computation efficiencyf, can be defined recursively by lo-f and f, = PS,-I for Y 2 I. This saves having to compute the Y step transition function P‘, a potentially large computational economy. If the state space is finite and the transition matrix is sparse, the work involved in calculatingfi for a few values of v may not be too heavy. We note that to form the estimates .C”( N ) (or Fp( M)) we must evaluate f.(X,) for each value of Y and each transition n. This tends to increase the amount of time needed for each transition simulated. However, if the variance reduction obtained is sufficientiy large, the potential savings in
220
REGEKERATIVE METHOD FOR SIMULA’TION ANALYSIS
the number of transitions that need to be simulated will more than offset the extra work per transition. We also note that additional work must be done at the end of each cycle to update the estimates of the covariance matrix X k (using no variance reducing technique, we need only update ui). It is shown (see [24]) that u ~ ( ~ * ) - +asO k - m . For many types of Markov chains we can expect substantial variance reductions even when k is relatively small (say 2 or 3). For countable I we have ua
fk(+
I3 p ; m =
E[S(X,,,)IX,=~].
(6.7.28)
J”0
Thus if the Markov chain makes transitions only to “neighboring” states and if f ( j ) is close tof(i) for j close to i,it can be seen from (6.7.28) that, for smatl k , f k ( i ) and f ( i ) should be nearly the same. This means that ik( N ) and i,,(N ) will be highly correlated, a condition that generally results in good variance reduction. Many queueing networks exhibit this special type of structure. Ideally, we would like to be able to have the “optimal” value of k in the sense that, for a given computer budget, we would like to pick the value that yieids the narrowest confidence intervals for r (part of the budget must be allocated to calculation of theJ’s). To perform such an optimization we would have to know u:(j3*) for each Y 2 0. These quantities are generally unknown, and even to estimate them would require calculating the f,’s and then simulating the Markov process for an additional number of cycles. The disadvantage of such a procedure is that the cost of computation off, may be higher than the gain achieved through variance reduction. Generally speaking, the success of this technique depends on our ability to compute and store efficiently the functions 4. The method of multiple estimates can be extended to certain types of continuous-time processes such as continuous-time Markov chains and semi-Markov processes (see 124)). To find out the efficiencies of this method Heidelberger I241 considered the following four examples: the queue length process in a finite capacity M / M /1 queue, the queue length process in the repair problem with spares, and the waiting time processes in both M/M/1 and M / M / 2 queues. These processes were chosen because analytic results are readily available, thereby making a comparison between analytic and simulation results possible. Despite their simplicity, these processes are by no means “easy” to simulate, in particular the heavily loaded queues, which require very large run lengths to get good simulation estimates. The simulation results, which are also presented in Ref. 24, show that for all four examples substantial variance reduction was obtained. However, as this method
221
VARIANCE REDUCTION TECHNIQUES
entails additional computations both before and during the course of simulation, we would recommend using it only when it is computationally advantageous to do so. In the case of Markov chain it is likely that the method will be most effective if the transition matrix of the process is sparse, in which case the preliminary calculations can be carried out with relative ease. It is for this type of process that the method is recommended.
Example 2 We consider now another example of variance reduction, taken from Ref. 45. Before starting this example we need more mathematical background on the regenerative method. Let X again be the steady-state vector of the regenerative process ( x ( t ) : t 2 0}, let f and g be given real-valued measurable functions, and suppose we want to estimate (6.7.29)
It follows from Proposition 6.2.2 that, if E ( l f ( X ) l } < 00 and €(1g(x)l} 00, then
0 such that g(x) 2 g(x*) for all x E D satisfying IIx - x*ll c 6. We define a local minimum in a similar way, but in the sense that inequality g ( x ) 5 g(x*) i s reversed. If the inequality g(x) 5 g ( x * ) is replaced by a strict inequality g( x )
< g( x * ) ,
x E D ,x it x * ,
we have a strict local maximum: and if the sense of the inequality
g c x ) < g(x*) is reversed, we have a strict local minimum. We say that the function g has a global (absolure) maximum (strict global maximum) at x* E D if g(x) Ig ( x * ) , f g ( x ) < g ( x * ) J holds for every x E D. A similar definition holds for a global minimctm (strict global minimum). A global maximum at x* implies that g ( x ) takes on its greatest vahe g(x*) at that
point no matter where else we may search in the set D. A local maximum, on the other hand, only guarantees that the value of g ( x ) is a maximum with respect to other points nearby, specifically in a ®ion about x * . 234
235
RANDOM SEARCH ALGORITHMS
Thus a function may have many local maxima, each with a different value of g(x), say, g(x;), j - 1, ...,k . The global maximum can always be chosen from among these local maxima by comparing their values and choosing one such that
where x* E ( x , " , j = 1 ,...,k } ,
It is clear that every global maximum (minimum) is also a local maximum (minimum); however, the converse of this statement is, in general, not true. If g(.x) is a convex function in R" and D C R" is a convex set then every local minimum of g at x E D also a global minimum of g over D [2]. 7.1
RANDOM SEARCH ALGORITHMS
Consider the following deterministic optimization problem: (7.1.1)
where g( x) is a real-valued bounded function defined on a closed bounded domain D c R". It is assumed that g dchieves its maximum value at a unique point x*. The function g( x ) may have many local maxima in D but only one global maximum. When g(x) and D have some attractive properties. for instance, g ( x ) is a differentiable concave function and I1 is a convex region, thcn, as previously mentioned, a local maximum i s also a global maximurn and probleni (7. I. 1) can be solved explicitly by mathematical programming methods (see Avriel [2]). If the problem cannot be solved explicitly, then numerical methods, in particular Monte Carlo methods, can be applied. For better understanding of the subsequent text we describe an iterative gradient algorithm, assuming for simplicity that the set D = R". According to the gradient algorithm, we approximate the point x * step by step. If on the ith iteration ( i = 1,2,. .. ) we have reached point x,, then the next point x,+, is chosen as x,+,=x,+a,Vg(xi),
where
a,>O
(7.1.2)
236
MONTE CARLO OPTIMIZATION
is the gradient of g(x), where i 3 g ( x k ) / b k , k = 1,. ..,n, are the partial derivatives, and where a,> 0 is the step parameter. If the function g(x) is not differentiable or if the analytic expression of g( x ) is not given explicitly (only the values of g( x) can be observed at each point x E D), then the finite difference gradient algorithm (7.1.3)
x,+ I = x, + a,$g(x,)
can be applied. In (7.1.3)
is the finite difference estimate of the gradient Vg(x,). Under some rather mild conditions (see Avriel [2]) on g(x) and a,,the algorithm (7.1.3) converges to the locai extremum x * . In the case where either g ( x ) or the regton D is nonconvex, the classical numerical optimization methods fail. However, Monte Carlo methods, in particular random search algorithms] can be applied. If we assume, for instance, that g ( x ) is a rnultiextrernal function, then procedures (7.1.2) and (7.1.3) converge only to one of the local extrema, subject to choice of the initial point xo from which the algorithms (7.1.2) and (7.1.3) start. We consider several random search algorithms capable of finding the extremum x * for complex nonconvex functions. The random search algorithms have been described in many papers and books (see Ermolyev 191, Katkovnik [ 171, Rastrigin [28j, and Rubinstein [3 1-36]). and successfully implemented for various complex optimization problems. We now consider several random search algorithms. Rpndom Seawh D
x,+, = x,
d k T d Algm'thm (Algorithm RS-I)
a, +24
g(x,
+ p,Z,) - g ( x ,
- &Zi) JZ,, a,.> O,P, > 0. (7.1.4)
According to this algorithm, at the ith iteration we generate a random vector S, continuously distributed on the n-dimensional unit sphere, calculate the increment (see Fig. 7.1.1) A g -t (,' 1= g t xi + Pi ) - g( x j - fin Ej (7.1.5) ' 1
1 1
RANDOM SEARCH ALGORITHMS
237
Fii. 7.1.1 Graphical repnsentation of the double trials random search algorithm RS-1.
and choose the next point according to (7.1.4). It is not difficult to see that this algorithm generalizes the gradient algorithm (7.1.3). Only in the particular case where Zl is taken in the direction of the gradient do procedures (7.1.3) and (7.1.4) coincide.
Nunlinear Tacric Rmulom Smwh Algodhm (AlgO&hm RS-2) x t + , = x , + -YiSignY,Ei, a; a,>O,j3,>0,
Pi
(7.1.6)
(7.1.7)
Sign U,
I, 0,
Ux>O ify L O .
According to this algorithm, we perform a trial step in the random direction E,and check the Sign Y,. If > 0, then x,, = x , + ( a , / P I ) Y E , . If U, I0, then x , , = x, and nu iteration is made.
,
Linear
,
Tactic Rondom Sea& Algorithm (Algoriflun RS-3)
This algorithm contains the following steps: 1 i t 0 , generate &. 2 Calculate the increment
Y, =g(xi + PtZ,) -gfxi). 3 If Y, < 0, go to step 6.
238
MONTE CARL0 OPTIMIZATION
4
a
x,,
5 Go to step 7. 6 xifltxi. i t i 7 Go to step 2.
,= x i + 2 PI YEi,
ai
> 0, pi > 0.
(7.1.8)
+ 1; generate z-l .
Thus if 5 > 0, we perform as many iterations as possible in the initial chosen random direction xi + /3,Ej- if y. 5 0, we generate a random vector Sjand perform only one iteration according to the nonlinear tactic random search algorithm RS-2. It is not difficult to see that search in the same direction versus choice of a new direction is subject to the shape of g ( x ) . The flatter the gradient lines, the more iterations will be performed according to step 4 and correspondingly the fewer iterations according to step 6. In the particular case where g ( x ) is a linear function, all iterations will be performed according to step 4 in the direction of the vector xo+ a + ao&- 'YoE0,where Z, is the first random vector such that Y , > O and no iteration will be performed according to step 6. This is the reason why this algorithm is called a linear tactic random search algorithm.
,;
Optimum Trial Random Search A l g o r i t h (Algoritrhm RS-4)
This algorithm comprises the following steps: 1 Choose N > 1 independent random points xi + /?& on the sphere {xi+ PIEi}, where Zi is a random vector continuously distributed on the unit sphere with reahations k = 1,. ...IV. 2 Consider the sequence of increments
and let Z,$ denote the direction that has produced this maximum. 4 'fie p i n t x,+, is chosen according to the following iterative procedure: x i + , = x,
+ a , f i i - l ~ ~ , Z ~ ' ,ai > O, pi > 0.
(7.1 . I 1)
Thus the next point x , + ~is chosen in the direction E,$ of the greatest increase x g of the function g(x), that is, the vector Z$, corresponds to the trial optimal among those available.
239
RANDOM SEARCH ALGORITHh4S
Stahtimi Gmdient Rrrndom Search Algorithm (Algorithm RS-5)
This algorithm can be described as follows. 1 Choose N > I independent random points xi + bizjkon the sphere is a random vector continuously distributed on the ( x i + bizi},where unit sphere with realizations Z i k , k = 1 . .,N . 2 Calculate the sequence of increments (7.1.12) Kk =g(xi + pisik) - g ( x i ) , k = 1,. ..,N . 3 Set ?.
&=-
1
” 2
(7.1.13)
VkSik.
k-i
4
The point x i + , is chosen according to xi+l
5
xi
+ aiPi-lV,.s,
a;> 0, Pi> 0.
(7. I . 14)
Thus given x i , the next point xi+ I is chosen in the direction which is a result of averaging the sample Zil,.. . ZiiNweighted with their correspond(7.i.12). In the particular case where N = n and ing increments k = I ,..., n, + = e k , = O ....,0 . 1 , O ,..,, 0,
-
-
xk
k
we obtain the foltowing finite difference gradient algorithm: Xi+L
(7.1.1 5 )
= x, + a , b ( x , )
where
g(x, -4-
@I,%*,
... , x n ) - g(-r)
gtx,,
.. .‘X, + fin) - g ( x )
. . . . I
Pn
131
)-
It is not difficult to prove that for a linear function the direction of qe,on the average, coincides with that of the gradient of g(x). This is the reason why the algorithm is called “statistical gradient algorithm.’’ Consider the following srochasfic opfimization problem.
max “ E [ + ( x ,W ) ] = max g ( % ) = g ( x * ) = g * . xEDcR
(7.1.16)
xEDcR”
Here + ( x , W )is a function of two variables, x and W,x * is the optimal point of g(x), which is assumed to be unique, and W is an r.v. with unknown p.d.f. fw(w). We assume that at each point x E D only the individual realization of +( x, W ) can be observed. It is clear that, if the p.d.f. j w ( w ) is unknown, probiem (7.1.16) cannot be solved analytically. However, numerical methods can be applied.
240
MONTE CARLO OPTIMlWTION
One widely used numerical method for solving (7.1.16) is the srochastic approximation method. This method was originated by Robbins and Monro [30], who suggested a procedure for finding a root of a regression function measured with a noise. Kiefer and Wolfowitz [ 191 considered a procedure for finding x* in the optimization problem (7.i.16) where x E R'. The
procedures of Robbins-Monro and Kiefer-Wotfowitz were generalized by Dvoretzky 181. Hundreds of papers and many books have been written in the past I5 years about stochastic approximation, their convergence, and their applications. The reader is referred to Wilde [44]and Wasan [43]. We consider the following algorithm: x , + I = x , + a , ~ + ( x , K), ~
(7.1.17)
where
+(x,
+B L , X * , ..
, x n , Wl,) - @ ( x i - P I , X Z , .
28, $4x,, x 2 , .
.. * x,
+ P, , W,,) - 44x1
7
x2 1
28,
..
* 7
x,
. ?X,, W,2) ,..., *
1
- Pn, Wn2)
is the estimate of the gradient fig(x). It is readil seen that in the absence of noise, that is, when W - 0 , 6+(x, W ) =Jg(x) and (7.1.17) coincides with (7.1.3). In addition, if the realizations of the noise ar,e independent and E( W) = 0,then 6+(x, W )is an unbiased estimator of V g ( x ) . Proof of convergence of algorithm (7.1.17) to x*, subject to some and the function +(x, W), can conditions on the sequences be found, for instance, in Dvoretzky (81, Gladyshev [13j, and Wasan [43]. I t is not difficult to understand that the random search algorithm can also be used for solving problem (7.1.16). For instance, by analogy with (7.1.17) the random search double trial algorithm (Algorithm RS-I) can be written as
(al)zr,
a X , + ~ = X , +--"[+(x,+
P I E , . q,) - +(xi
F 2 ) ] E , . (7.i.i8)
2/31
We can see that, for the same reasons as the random search algorithm (7.1.4) extends the gradient algorithm (7.1.3), the random search algorithm (7.1.18) extends the stochastic approximation algorithm (7.1.17). Proof of convergence of (7.1.18) to x* can be found in Rubinstein [31). In analogy with (7.1.18) we can adopt any of the random search algorithms RS-2through RS-5 for solving problem (7.1.16).
EFFICIENCY O F THE RANDOM SEARCH ALGORITHMS
291
7.2 EFFICIENCY O F THE RANDOM SEARCH ALGORIT)?MS
The random search algorithms can be compared according to different criteria. Usually, they are compared according to their local and integral properties [28, 291. Local properties are associated with a single iteration of the random search algorithm, integral properties-with many iterations. Comparing different algorithms according to integral properties we usually define: I The initial condition from which search starts. 2 A set of test functions (linear, quadratic, parabolic, multiextremal,
etc.) for which the extremum is sought. 3 Some criteria that must be achieved during optimization. The following criteria can be used. Find an index k corresponding to the best algorithm among S algorithms available, such that: (a) where the number of iteration i is given.
tw
242
MONTE CARLO OPTIMIZA'IION
It is readily seen that the first three problems are associated with finding the best algorithm when the number of iteration i is given; the last two involve finding the best algorithm that hits, at the minimum number of iterations, a given region R , or R , containing the extremum point x*. In Section 7.3 we consider some local and integral properties of Algorithm RS-4. Generally, the problem of comparison of different algorithms according to their integral properties is difficult to solve. Some attempts to overcome this difficulty have been made by Rastrigin [28]. Another interesting problem is how to find the optimal combination of algorithms, each of which is capable of finding the extremum of g(x). This problem is solved in Rubinstein [33] and uses Bellman's principle of optimality. Now we consider some local properties of the random search algorithms, assuming that some point x, has been reached, and that we are allowed to make only a single step (iteration). Let x!:)~, s = I , . . . ,S,be the point (the state of the system) after this single iteration. Let us define the efficiency of the random search algorithms as (7.2.1) where
that is, where Axfa)is the projection of the vector x!:), - x, on the direction of the vector x, - x*. and 4(*'is the number of observations (measurements) of g(x) required for the algorithms in the ith step. For simplicity we consider only the case where g ( x ) is approximately a linear function, which is the same as to assume that in Taylor expansion g ( ~ , + , ) = ~ ( ~ , + ~ X , ) ~ ~ ( ~ , ) + ~ ~ ~ (7.2.2) , ~ V ~ ( ~
Therefore at each iteration made by the random search algorithms, we approximate gfx) IinearIy on the interval Ax,. It is proven in (32)that, for a rather wide class of functions optimized by random search algorithms under the conditions m
I-
I
OD
i- I
there exists a number I , sufficiently large and such that for i 1 1 a linear approximation of g(x), that is, (7.2.2), is valid. Substituting (7.2.2) in any of the four random search Algorithms RS-1, RS-2, RS-4, and RS5 (see, respectively, (7.1.4). (7.1.6), (7.1.11), and
243
EFFICIENCY OF THE RANDOM SEARCH ALGORITHMS
(7.1.14)), we readily obtain
+
, v ~= ~ x, ~ , a ~ ’ ) V g ( . x ,cos ) p!*’io( AX^"')
(7.2.3)
where (7.2.4) and s = I , 2,4,5 corresponds to RS-I , RS-2, RS-4, and RS-5. The distribution of ‘p!’) depends on the specific algorithm and on the distribution of the Let us assume without loss of generality that afs)=1. random vector E!:,’”). Then taking into account that for a linear function g(x) the direction of the vector x* - xi coincides with the direction of the gradient V g ( x i ) , we can express the efficiency C, (see (7.2.1)) as
c,
=
E(c0s p y )
(7.2.5)
E( A!(”)
We consider here only the efficiencies of the random search Algorithms RS-I,and RS-4, assuming that the vector E is uniformly distributed on the surface of the unit n-dimensional sphere. The Double Trial Random Search Algorithm RS-1 It follows from (7.1.4). (7.2.3). and (7.2.4) that (a)
where p!’) is a random angle between t b vector 2:’)uniformly distributed on the n-dimensional sphere and the vector V g ( x , ) . We assume here that the direction of the gradient corresponds to ‘p;”=O. Furthermore, it follows from (7.2.5) that the distribution of d’)does not depend on i ; therefore the index i can be omitted. We also omit for convenience index (1) in d’).I t is shown in the Appendix that pl has a p.d.f.*
h,(cp) = B,sinn-2cp,
7r
7r
2-
2’
- - < ‘p 5 -
(7.2.6)
where (7.2.7) *Weuse for convenience - 5 rp I7 rather than 0 5 p 5 n (see Appendix).
244
MONTE CARLO OPTIMIZATION
Since for Algorithm RS-I we need two observations of g(x) at points + B E ) and g(x - BE), respectively, the efficiency C, (see (7.2.5)) is
g(x
C,”’= E(coscp)
2
(7.2.8)
.
The expected value and the variance of cos cp are, respectively,
(7.2.9)
(7.2.10)
Substituting (7.2.9) in (7.2.81, we obtain Bn c, = n-1
(7.2.1 1)
and the following relationships can also be easily verified :
Table 7.2.1 and Fig. 7.2.1 represent the efficiency C,, and var(cos p) = u2 as
Table 7.2.1 The Efficiency and u’ ss FurrdiolrP d II for Aigod?bm RS-1 02
0.5995 0.416
10
0.3 184 0.25 0.2125 0. I875 0.I702 0.1556 0.1452 0.1367 0.1294
0.1344
0.4112 0.3876 0.3792 0.3677 0.3602 0.3518 0.3564 0.3652 0.3529
I1
0.123
0.2268
0.3538
2 3 4 5 6 7
8 9
0.314
0.26 0.22 I 0.1957 0.166
0.1401
EFFICIENCY OF THE RANDOM SEARCH ALGORITHMS
245
\
;_I
9
;0.3 0.2 0.I
0
Fig. 7.2.1
2
3 4 5 6 7 8 91011
'fie efficiency and y2 as functions of n for Algorithm RS-I.
a function of space size n, from which it follows that, as n increases, both
the efficiency and the variance decrease. When n+oo, E(coscp)-+O and Cn-+O,that is, the random search Algorithm RS-I becomes inefficient.
(b} The Optimum Trials Random Search Algorithm RS4 It follows from (7.1.1 I), (7.2.3), and (7.2.4) that
where cos qy)= max(cos vir,.. . ,cos pi#). Since the distribution of v:*) does not depend on the step number i, we can again omit the index i. We also omit for convenience index (4) in 91:'). To find the efficiency of Algorithm RS-4 let us find the distribution of Y==eoscp, where cp is distributed (compare with (7.2.6) and (7.2.7))
and
By the transformation method (see Section 3.5.2) we obtain n-3 P,(o)= Bn(l 4) * ,
-15051.
(7.2.12)
246
MONTE CARLO OPTIMIZATION
The c.d.f. and p.d.f. of V: = max( V , , ... , V , ) are, respectively,
F , ( 4 ) = [ F,(41N
(7.2.13)
and P,(”;}=”F,(o)]”-’
P,(u).
(7.2.14)
The expected value and the variance of V i are, respectively,
(7.2.15)
For n = 3 we have P3(uoN)=$lN
+t))N-’
N- I E( v$) = N+l 4N
var( v,,}=
( N - 1)“
+
(7.2.17) (7.2.18) (7.2.19)
I t follows from (7.2.5) that the efficiency of Algorithm RS-4is
(7.2.20) For n = 3 we obtain c3=
( N - I) ( N + 1)”
(7.2.2 1)
The optimal value of C, equals i and is achieved when N is equal to 2 or 3. Generally, it is difficult to find C, and var(V,O)for n > 3. Table 7.2.2 and Fig. 7.2.2 represent simulation results for C, and var( V,”)as a function of n for the optimal number of trials N+ on the base of 100 runs. It is interesting to note that the optimal N* = 2 and does not depend on n. Comparing Algorithms RS-1 and RS-4 for a linear function, we conclude that RS-I is more efficient than RS-4 for all n > 1. The variance associated with Algorithm RS-4 for the optimal N* = 2 is always less than that associated with RS-1. The intuitive explanation for it can be given as follows. Taking two random trials according to Algorithm RS-I, we always
241
EFFICIENCY OF THE RANDOM SEARCfl .ALGORITHMS
Table 7.2.2 Tbe Efficiency and the var( V , ) as Functions of II for Algorithm RS-4 Cn
varf v:)
N*
3
0.198
4 5 6 7 9 11
0.159 0.137 0.121 0.109 0.092 0.08 I
0.236 0.171 0. I34 0.110 0.053 0.070 0.050
2 2 2
n
cn 0
4075 3845 3743 3647 3575
2 2
2
3478 3622
2
Note: The sample size is equal to 100.
find a feasible random direction toward the extremum, which is generally not true for Algorithm RS-4. Indeed, the probability of finding such a direction (success) in N independent trials is equal P ( N ) = I - ( I -p)” Here p is the probability of success in a single trial. Taking into account that for a linear function p = l, we obtain, for the optimal N + = 2. P( N * = 2) that is, the probability of a success in Algorithm R S 4 is equal to f - Defining khe efficiency as CdS)/u(’)),where o(’)= (var(cosrp(’))J’/*.we see from Tables 7.2.1 and 7.2.2 that both Algorithms RS-I and RS-4have approximately the same efficiency.
=a,
00.20 .25
0.05
0
t
\
1 2 3 4
Fig. 7.2.2 The efficiency and the var(V:.) size is equal to 100).
5 6
7
8
9
1011
n
as functions of n for Algorithm S-4(the sample
248
MONTE CARLO OPTIMIZATION
73 LOCAL AND INTEGRAL PROPERTIES O F THE OPTIMUM TRIAL RANDOM SEARCH ALGORITHM RS-4
This section is based on Ref. 35. 73.1 Local Properties of the Algorithm
The term "local properties" refers here to convergence of the vector - x, to the direction of greatest increase of the function g ( x ) , as the number of trials m tends to infinity. Assume that g(x) is a continuous function and (7.3.1) x E D c R", +(x, W ) = g ( x ) + W ,
x,+
,
that is, each measurement of the function g(x) is accompanied by additive noise W ,and assume that the vector Z is continuously distributed on the unit sphere with a densityf(Z). Let B be the set on the surface of the unit sphere defined by the condition f(Z) > 0 and let B be the closure of E . Let us also assume that the maximum maxg(x
+ BZ) -g( x + PZ')),
' 2 E ii
(7.3.2)
IEB
occurs a1 the unique point x + pro. We are concerned with the asymptotic behavior of the sequence of optimum-trial directions ( E ~ > ~ defined - , by
Vector E v is almost surely (as.) the only limiting vector of the sequence {E2)zE, if and only if the noise W satisfies the following property: For a s . any sequence ( W,}:-, of W ' s realizations and for any c > 0, there exists a natural number K , (which depends on the sequence) such that
Theorem 73.1
W,< F k + c , where
-
W, =
K,Ikg(x+@Z,)+ y , 1 S j < m , - 1 (7.3.6) and at the same time Zrnk B S( so, 6).
Continuity of g(x) implies that we can choose 7 > 0 and 6,< 6, such that inf g(x + P Z ) > sup g(x + p S ) + 27. (7.3.7) E Bn s (E", 8 ,) 3E if/(
[email protected])
is (a) The case of unbounded noise Assume the sequence { Wmk)yunbounded and satisfies Wm4
K,
(7.3.8)
Denote by Zkthe number of the trial in which the maximum qmk is achieved, that is, WS4= Kmk and iiik < m k hold. The sequence of indices (Eik)';9ct is a.s. unbounded, because { Wml)T=, is a s . unbounded. Therefore &heevent E-" k o E Bln S( z0,6,)
will as. occur for some Eke> K,, since at each trial there is a constant nonzero probability of its occurrence. Comparing the results obtained in trials iEk, and mko,it follows from (7.3.7) and (7.3.8) that B( x + P',iiko) + w,ii.,>g(" + B'rnb0) + Wmro+7 s
which contradicts (7.3.6). Q.E.D. (b) The case of bounded noise. If sup W = Wmx< m, then the sequence { Wrn>z, as. contains an infinite subsequence { W , , ) ~such , that i = 1,2, ... W,,,,,-q< Wm,5 Wmx,
,
On the other hand, there exists as. a particular subscript miO such that
z
4
0
E Bn
S(Z*,&,).
Thus fcr any m > mio satisfying EmB S(?, S), g(x +Pm,,,)
+ Wm,*> gtx + @Ern)+ Wm,+ 7 ) g(x + g z m )
which contradicts (7.3.6).
+ Wm,
Q.E.D.
not satisfy(2) Necessity Assume that the set ?of sequences {Wk}:-t ing the theorem's condition has a probability P ( c ) > 0 For each sequence
250
from
MONTE CARLO OPTIMIZATION
c there exists a number c > 0 and a subsequence { Wk,}lm,,such that wk,2 Ek,
+ c.
(7.3.9)
Our task now is to prove that with probability P ( c ) the vector Zo is the What we actually prove is a only limiting vector of the sequence { E:}:somewhat stronger statement: namely, that the set of limiting vectors contains the set
,.
V, = iin { Zlg(x
+ PS") - c < g(x + pz) Ig(x + pro>}.(7.3.10)
To prove this statement it suffices to show that for any y € 3 and any S > 0 the sequence {E;}E-, will visit the neighborhood S(y,S) infinitely often. Indeed, for any trial there exists a constant positive probability of entering the set S ( y , 8 ) n V,. This implies that the subsequence of trials (k,} satisfying (7.3.9) as. contains a new subsequence {k,,) such that E S ( y , 6) n V, holds. The vectors z k , , will be optimum-trial directions, -k,, since for any i , 1 Ii 5 k,l- 1, 8(AX + @5k,l
) + wk,l > 8(x + pz' 1 f
wk,,
= . g ( n + p z , ) +w;.
Q.E.D. Remark In the case without noise ( W = 0 a.s.) we can explicitly calculate the number of trials required to enter a prescribed &neighborhood S(Eo, 6) of the point I ' with a prescribed probability p. Define
that is a is the probability of visiting S(Zo,S) at each single trial. The probability of visiting S(Zo,6 ) at least once by making rn trials is equal to p,=
I-(l-a)m.
(7.3.1 1)
Thus if we want p, 2 p , i t suffices to produce (7.3.1 2)
trials. In the case where p = 1 - a, m 2
In a In(I - a)
Table 7.3.1 shows some values of m as a function of a.
(7.3.13)
LOCAL AND IN’I‘EGRAI. PROPERTIES OF THE R A N D O M SEARCH ALGORITHM
251
TaMe 73.1 Dependence of m 00 a a
0.500 0.200 0.100 0.050 0.020 0.010 0.005 0.002 0.001
r n l
22
8
58
194
1057
458
3104
6903
73.2 Integral Properties of the Algorithm to
The term “integral properties” refers to convergence of Algorithm RS-4 the point of extremum x*.
Tbeorem 73.2 Suppose that g ( x ) has bounded second derivatives. Let E( 115, II*Ix(J,xI, . . - x , ) Ihf 7
< 00
(7.3.14)
for
11 x, 11 I5 < 00, j
= 0,1 ,. ..,I , where
5, = @,-
’
E$.
Let the normalizing factor y, satisfy the condition
0 < v,( TI II X I II + A, ) < WJ
(7.3.15)
9
where 7, =
1,
If
l l ~ & l>l o
and
0, if II VP, iI = 0 ( V , is defined in (7.3.18)), and let a, and /3, be such that 7, =r
m
a, 2 0,
/3, 2 0,
I]alp, < 00, 1-1
m
2 /j? < m, 1-
I
@
2 a , = 00: 1-1
(7.3.1 6) then the optimal trial random search algorithm x, + I = 4 .x, - a,v,E,)
(7.3.17)
converges a s . to x * . Here a(.) denotes the projection operator on D (i.e., for every x E R“, n ( x ) E D and IIx - n(x)ll- min,,,llx - Y 11. Proof Since g ( x ) has bounded second derivatives, it is readily shown that E(Eitxi) = C i V g ( x j ) +P,V,s
(7.3.18)
where C, and the vector V, have bounded components, that is, Ci< 00, IIV,ll < 00. Further, convergence of (7.3.17) to x* follows from Ref. 10, Theorem 1.
252 7.4
MONTE CARLO OPTIMIZATION
MONTE CARLO METHOD FOR GLOBAL OPTiMIZATlON
(a) Deterministic Optimization Problem The problem of finding the global extremum of g( x ) (see (7. I . 1)) has been approached in a number of different ways. The earliest methods were associated with the grid technique and the function was evaluated at equispaced points throughout D. We shall consider only Evtushenko's algorithm [ 1 I] in such a deterministic sense. Some other deterministic approaches for global optimization are given in Dixon [7], Shubert 1371, and Strongin [39]. Evtushenko makes the foilowing assumptions about the function and the objective: 1 The function satisfies the Lipscitz condition, that is,
Ig(X,)-g(x,)j 0. 2 Each x E D,,where 0,s{ x : tg(x) -g(x*)J
<E},
is accepted as an approximation for x * . Evtushenko's algorithm
IS as
follows.
Algorithm GI- I 1 Evaluate the funclion a1 N equispaced points x,, . . . ,xNthroughoutD and define k = 1 , . . ., N. y, =g(x,), 2 Estimate g* by MN = max( y , .. .,y, ).
.
The theoretical background to this approach is very simple. Let V, be the sphere 11 x - x, [I 5 r, where rt = L . - ' ( g ( . ~ , ) MN+E). Then for any x E V, g( x ) 2 g( x,) - Ix, = MN - c, Hence if the sphere V,, i = I , . ..,N,covers the whole set D, then M N cannot differ from g + by more than E. and the problem is solved. In the simplest case where D is an interval, a Ix Ib, Evtushenko proposed the following procedure: x,=u+-
E
M,=g(x,)
L' 2 E +B(X&) - 4 X&+, = XK + L M , = max(g(x,),M,-,).
253
MONTE CARLO METHOR FOR GLOBAL OPTIMIZATION
The number of function evaluations required to solve the problem is greatest in the case of a monotonically increasing function, namely L(b-a)
N=
2E
.
Most algorithms for global optimization contain random elements and are related to the Monte Cario method. We consider some such algorithms. Brooks (4) suggested, for solving problem (7.1.1), the following “pure” random search algorithm. Aig~dhtttGI-2 1 Generate X , , ...,X , from any p.d.f.f,(x)
such thatf,(x)
> 0, when
X E D. 2 Find Y k = = g ( X k )k. = I ,...,N. 3 Estimateg* by M, = max( Y,,..., YN).
This algorithm was also discussed in Ref. 36. Our nomenclature follows that reference, and our discussion is based on it. Let p be the probability measure defined on B, the Bore1 o-field of D, so that ( D , B , P ) is a probability space. L e t g - ’ ( a , b ) = { x E D : a < g ( x ) _ < b }and , Iet F ( y ) = P ( Y ; l y ) ; then F ( y ) = p { 4 X , ) 5 v } = P ( g - ’( - -7Y
1)
and Y,,. . . ,Y, are independent identically distributed (i.i.d.) random variables (r.v.’s) on R’ with a cumulative probability distribution function (c.d.f.) F. Proposition 7.4.1 Suppose P assigns u positive probability to every neighborhood of x+, and suppose g is continuous at s*,then
lim MN= g *
as.
(7.4.1)
N+W
Proof It is clear that F(g*) = 1 and for each 6 > 0 we have 1 - F(g* - 6) = P ( g * - 6 < g( X , ) 5 g*} > 0 by our assumption. Let A N ( 6 ) be the event ( M N5 g* - 6); then P ( A , ( 6 ) ) = FN(g* - 6) and X g , , P { A , ( 6 ) } = F(g* - 6)/(1 - F(g* - 6)) < 00. By the Borel-Cantelli lemma P { M , ,< g* - 6 infinitely often) = 0 for all d > 0 and thus (7.4.1) follows. Q.E.D. The choice of P, and consequently the resulting F, depends on our prior knowledge of x’. If it is known that a certain region is more likefy to include x*, then it would be more efficient to assign a higher probability to
254
MONI'E CARLO OPTIMIZATION
that region. If nothing is known a priori about x*, a uniform distribution over D can be assumed. In guaranteeing (7.4.1) the exact choice of P is immaterial. However, the rate of convergence is determined by the properties of F. For example, by a theorem of Gnedenko [MI, if there exists a constant a > 0 such that 1 - F(g* - 0 5 ) (7.4.2) lim =cu, vc>o 610 1 - F(g* - 6 )
then
with aN determined by F(g* - a N )= ( N - l)/N. Some more properties of MNare listed below. I Geometric distribution Let jv8 be the first N for which Then N8 is a geometric r.v., that is,
P { N , p k ] =: F k - ' ( g * - 6)[ 1
- F(g* -a)].
MN
> g* - 6.
f,2,.. . .
k
(7.4.4) Consequently, it is well known that EN, =
I I - F(g*
-
- 6) = ??&
and P(N8 5 k }
5
1 -- F k ( g * - 6) ZE P&,t.
-
as 0 640 9 and thus f6,,,,,l = 1 -(1 l / 9 8 ) [ v ~ 1 +l the integer part of v). Hence 9)8 = EN, is approximately a 63% confidence bound for N8, the number of trials necessary to make M N > g* - 6 (6 > 0 small). Let a = 1 - F(g+ - 6); then fa$k =1(1 - a)&.For every given pair ( a , P ) the smallest k for which P8.k 2 P is k( a, 8) = In( 1 - &/In( 1 - a), and Table 7.3.1 with k( a,1 - a) = nr can be used again. 2 tack of memory I t is well known that (7.4.4) implies I t is clear that
'
~
~
3
e - = 0.63 (here [ q ]is
P { N b > k + m / N a> m ) = P ( N g > k). In terms of N,,,we thus have P ( M , + , Ig* - SIM, 5 g*
(7.4 S)
- S} = P ( M , I g* - 6 ) ,
because the events {A'&> k} and { M k5 g* - 6 ) are identical. It follows that, given m successive failures (to enter ( y : y > g* - 6 ) ) , the conditional distribution of the number of trials necessary for the first success equals its
MONTE CARLO MEI'HOD FOR GLOBAL OPTIMIZATION
unconditional distribution. In particular we have E( N, I M, 5 g * - 6 1 = m + EN8.
255
(7.4.6)
3 Poisson approximation If (7.4.2) or (7.4.3) hold, then Z8,,v,the > g* - 6, is asymptotically number of q, i = 1.2,. ..,N, for which Poisson distributed. More precisely, for fixed N and 6 > 0, Z 8 , , is a binomial r.v. with parameters N and p = I - F ( g + - 8 ) . When (7.4.2) holds by substituting 6 = a N in (7.4.2), we obtain N [I - F(g* - 6a,)]+c", which implies that Z c o , , Nconverges in distribution to a Poisson r.v. with parameter c".
The problem of finding the global maximum of g ( x ) can be reduced to that of finding the mode for association with g(x) density function. Indeed, if g(x) 2 0, x E D , then $(x) =c-'g(x) where c - ' = ( j g ( x ) d x ) - ' is a density function. and the problems of finding the global maximum of g ( x ) and finding the mode of # ( x ) are equivalent. This can be solved by one of the methods mentioned in Refs. 41, 42, and 46. If g(x) is unrestricted in sign but bounded, that is, if Ig(x)l 5 k , then f x ( x ) - c - ' ( g ( x ) + k ) , where c . ' = [ J ( g ( x ) + k ) d x ) - ' is again a density function. A natural extension of the "pure" random search algorithm GI-2 is the so-called muitistart algorithm [7], which is probably the one most frequently used in practice for global optimization. In this approach we use any iterative procedure (gradient, random search, etc.) for local optimization and run it from a number of different starting points xoJ, j I, ....N. The set of all terminating points hopefully includes the global maximum x * . The muitistart algorithm is as follows. Algorithm GI-3 1 Generate Xol,. . .,X,, from any p.d.f.f,Jx-) > 0,x E D (usually X, is chosen to be uniformly distributed over D ) . 2 Consider XoI.. . . ,X,, as the starting points, then apply N times a local optimization algorithm (gradient, random search, etc.) and find the local extrema x:, , . ..xfN of g(x) associated with Xol,. . ., XoN. 3. Estimate x * by max (1:. . . .,x:}.
Let us define D, as the set of starting points A', from which the algorithm will converge toj-th local maximum. We call DJ the region ofarrruction of t h e j t h local maximum. Let us assume that the number of local maxima is finite, and let X, be uniformly distributed over D; then the probability
256
MONTE CARLO OPTIMIZATION
of at least one X,, from a sequence of N points drawn at random over D , falling in the region of attraction of the global maximum 0;.equals (7.4.7)
where m( 0 )is the measure of D. A more sophisticated approach to the global optimization problem was suggested by Chichinadze [5], who introduced a probability function P( u ) as the probability of g(x) < c, that is, if m ( V ) is the measure of the level set
Y = { x : g ( x )< u ) .
then (7.4.8)
The function P(o) is, of course, not available, but if we calculate g(x) at N points distributed at random over D, and count the number M of these points for which g ( x ) < u , then M / N approximates P( u). It is not difficult to see that the global maximum corresponds to P ( c ) = I and the global minimum to f ( u ) = 0. To find the solution P ( c ) = 1, Chichnadze suggested approximating f ( v ) by a linear combination of a set of given polynomial functions P,(u), i = I . . . . , k , k
P(,) = 2 h,P,(t;).
(7.4.9)
r=l
The range of u was divided at the points u,, j = 1,. . .,s. and the optimal values of A, were determined by minimizing (7.4.10)
where M, is the number of points for which g ( x ) O , j = 1,. . .,s. The root t‘* of P( u) = 1 was then determined to obtain an estimate of the global maximum of g(x). Considerable attention has been paid in the multiextremal optimization to the random search algorithms. Gaviano [ 121 showed that if x,+ I = x,
+ a,&,
(7.4.1 1)
and a,= arg ( global max g( x,
+ arSr)),
(7.4.12)
257
MONTE CARLO METHOD FOR GLORA!.. OPTIMIZATION
then lim P ( g ( x , ) - g( x+) < E ) = I
(7.4.13)
i+ 00
for every p > 0.Here E is a vector uniformly distributed on the surface of a unit n-dimensional sphere. If D is a finite space and if a bound on the first derivative of g ( x ) is known, then Evtushenko's [ 1 11 or Shubert's [371 one-dimensional global optimization techniques could be used to find the optimal ai.However, for a general function, a global optimization along the lines of (7.4.12) is difficult to perform. Matyas [22] proved the convergence to x' of the foHowing random search algorithm. Aigorihm GI-4 1 Generate Y,,Yz,. . . , from an n-dimensional normal distribution with zero mean and covariance matrix 2, that is Y N(0, Z). 2 Select an initial point xi E D. 3 Compute g(xl).
-
4
i c l .
5 If x, + Yi E D , go to step 8. 6 xi+-x,+,. 7 Go to step 10. 8 Compute g(x, + Y,). 9
xi+)-
i
xi xi
+ U,, 3
if g( x i+
q.)2 g( xi) - e, where E > 0
otherwise.
10 i c i + 1. 11 Go to step 5 .
According to this algorithm, a step is made from the point xi in the direction only if x i + Y, E D and g ( x i + Y,) 2 g ( x i ) - e. The following procedure, based on cluster analysis, was introduced into global optimization by Becker and Lago [3]. AIgoritAm G/-5 1 Select N points uniformly distributed in D. 2 Take Nl < N of these points with the greatest function values. 3 Apply a cluster analysis to these N, points, grouping them into discrete clusters; then find the boundaries of each cluster and define a new
domain D , c D,which hopefully contains the global maximum. 4 Replace D by D , and perform steps 1 through 3 several times.
258
MONTE CARL0 OPTIMIZATION
This is a heuristic algorithm and its ability to find the global maximum depends on the cluster analysis technique used in step 3 and on the parameters N and N,.There exists a positive probability of missing the global maximum. However, in practice this technique is widely used for global optimization. More on cluster analysis for global optimization can be found in Gomulka [lS], Price [27], and Tom [@I. (b) Stochastic Optimization Problem Consider the stochastic optimization problem (7.1.16). assuming that g ( x , W ) = g(.)
f
w,
(7.4.14)
which means that g ( x ) is measured with some error W. The following Monte Carlo algorithm, which is similar to Algorithm G1-2, can be used for estimating g* in (7. I . 16). AlgWithtn GI-2' 1 Generate X I , .. . ,X, from any probability distribution function ) 0 ,ED). ~ (p-d-f.)&(Xh ( f x ( ~ > 2 Find Y k - g ( x , , w k ) = g ( X k ) + w k $ k = I , ...,N. 3 Estimateg" by M N max(Y,, . . ., Y N ) . Q
Let wk be i.i.d. r.v.'s with a given c.d.f. H. We also assume that the W, and the X, are independent and that W , * inf ( u :N(u)= 1) 5 00. The following proposition is proven in Ref. 36.
Proposition 7.4.2.
Under the conditions of Proposition 7.4.1 lim M, = g + + W,, a s . N-4)
(7.4.15)
Proof: Let E N = max
IcrsN
q.
We say that ( E N )is stable if there exists a sequence of constants {q,,,) such that for all 6 > 0 lim P ( j € , - q ! , I > & f = O .
N-tW
(7.4.16)
We consider three cases. 1
< GO, in
which case our estimate for g* is M N - W,,and we
certainly have lim ( M N -
N-bW
w+)= g *
as.
(7.4.17)
MONTE CARLO METHOD FOR GLOBAL OPTIMIZATION
2
W , = 00, but
259
{EN}is stable, in which case (7.4.16) implies lim ( M N - q N )= gf in probabiIity,
(7.4.18)
N-m
and q N is determined by H(qN)= ( N - l)/N. A necessary and sufficient condition for case 2 is [14] I-H(u+6) lim =o, V6 >o. (7.4.19) u-bao 1-H(u) We thus see that, if W , and q N are known, we still have convergent algorithms in (7.4.17) and (7.4.18). 3 W , = 00, but ( E N ) is not stabIe. Here we have by (7.4.15) M N + m as. Q.E.D. The following examples will demonstrate these ideas. 1 If the W; are normally distributed with mean 0 and variance u 2 , then (7.4.19) holds and { E N }is stable with q N = 4 2 log N)”’. 2 Suppose that the W,’s have the generalized double exponential distri-
bution, that is,
Then by (7.4.19) ( E N }is not stable for q N = (log ( N/2))1’a.
Q
II, but is stable for a
> 1 with
Algorithm GI-3 can be also adapted for the stochastic optimization problem (7.1.16). rewriting step 2 as follows: 2 Consider ,Yo,, . . . , ,YONas the starting points; then apply N times a local iterative procedure (stochastic approximation, random search, etc.) that is able to find the association local extrema x : . . ..,xz of E [ g ( x, W ) ] =g(x). (c)
Cowtrained Optimizatkm Consider the following constrained opti-
mization problem: (7.4.20)
subject to g,(x)sO,
k=l,
...,m.
(7.4.21)
We assume that the convex programming methods (see Avriel [2j) cannot be applied because the convexity assumptions do not hold either for the region D = { x :g,(x) I 0, k = I , . . .,M } or for the function go(x).
260
MONTE C A W OPTIMIZATION
Let us consider two cases. 1 If the region D = {x :g k ( x ) 5 0, k = I , . . .,rn) is known, and we can readily generate r.v.’s at D, then Algorithms G1-2 through GI-5 can be directly applied for finding the global extremum of (7.4.20) and (7.4.21). 2 If the region D = { x : g , ( x ) 5 0, k = 1,. . .,m} is either unknown explicitly or is complex, but another region D, that contains D and has a simple shape is known, then we generate r.v.’s at D, and accept or reject them according to whether X E D or X E ( 0 ,- D ) . Next we can apply again Algorithms GI-2 through GI-5.
7 5 A CLOSED FORM SOLUTlON FOR GLUBAL OPTIMIZATION
This section is based on the results of Meerkov [23] and Pincus [25]. Both papers deal with the multiextremal optimization and use the classical Lapkace formula for certain integrals. We follow Pincus [25]. Consider the optimization problem mjn “g(x) =g(x*) = g + , XEDCR
where g(x) is a continuous function, D is a closed bounded domain, and x * is the unique optimum point, Pincus [25Jproved the following theorem.
Theorem 75.1. Let g ( x ) = &x,, . ..,x , ) be a real-valued continuous function over a closed bounded domain D E R“. Further, assume there is a unique point x * E D at which min,c,,g(x) is attained (there are no restrictions on relative minima). Then the coordinates X: of the minimization point are given by
In particular the theorem is valid when D is convex and the objective function g is strictly convex. The proof of the theorem is based on the Laplace formula, which for sufficiently large X can be written as
bi
exp ( - ~ g x)) ( d~ m x , exp ~ (-~g(x*))
(7.5.2)
exp ( - M x ) ) dX = exp ( - M x * ) ) .
(7.5.3)
We now outline a Monte Carlo method based on Metropolis et al. work [24] (see also [26n for evaluating the coordinates of the minimization point
A CLOSED FORM SOLIJTION FOR GLOBAL OP1 IMIZATION
261
. .. ,x,' 1, that is, for approximating the ratio appearing on the right-hand side of (7.5.1). For fixed X (7.5.1) can be written as
x* = (x:,
(7.5.4)
For large h the major contribution to the integrals appearing in (7.5.1) comes from a small neighborhood of the minimizing point x*. Metropolis' sampling procedure (241, described below, is based on simulating a Markov chain that spends, in the long run, most of the time visiting states near the minimizing point and is more efficient than a direct Monte Carlo, which estimates both the numerator and the denominator separately. The idea of the method is to generate samples with density
where the denominator of (7.5.5) IS not known. This is done as follows. Partition the region D into a finite number N of mutually disjoint subregions 0, and replace integrals over D by corresponding h e m a n n sums using the partition { D,}. Fix a point y J = (yi, . , ., y i ) E 0,. Then construct an irreducible ergodic Markov chain {Xk} with state space {y',. . , , y N ) and with transition probabilities pJ,, I 5 i, j I N, satisfying wJ J ' I p#k3 j I , . . , N , where vJ= exp I( - h g ( y J ) J / Z , , - ,exp I[ -Ag(yh)J; that is, (5)is the invariant distribution for the Markov chain. It should be noted that, in the last expression for T, we have assumed for simplicity that all subregions D, have equal volumes. Then using the strong law of large numbers for Markov chains, we have with probability I
.
Lg k-I
*k
-+
m--Lm
(7.5.6)
262
MONTE CARL0 OPTIMIZATION
The sampling error for each component X; of the vector x k is (see [26]) E[m-'(ZT-',,X;- F , ) ~ 5] c / m , where c is a positive number,
J= I
From Chebyshev's inequality we have
[I
k -m
P rn-'
x;-p, k-l
We now turn to the question of how Metropolis constructs a Markov chain with the required invariant distribution. He starts with a symmetric transition probability matrix P* = ( p : ) , 1 5 i , j < N, that is, p,'i =pi:, p; > 0, Z;Zyp:, = I , the known ratios r1/nJ, and defines the transition matrix of the Markov chain (X,)as follows:
J1
(7.5.7)
PI, = P : l ,
I
Pi: +
pIj(l - z ) , i = j ,
I t is shown in Ref. 14 that a Markov chain with the above transition matrix has the invariant distribution {q},that is, 4 = X,p,,?. A chain with such a transition matrix can be realized as follows. Given that the chain is in statey' at time k , that is, {X,= y ' } , the stale at time k + 1 is determined by choosing a new state according to the distribution { p $ , j = I , . ..,N). If the state chosen isy', we calculate the ratio ~ / qIf .~ / 2q1, we accept yJ as the new state at time k + I ; if 5/77, < 1. we takeyJ as the state of the Markov chain at time k + 1 with probability r j / q and y' as the new state at time k + I with probability 1 - ? / n i . It is also shown in Ref. 16 that this procedure leads to a Markov chain with transition matrix P = (pi,). It should be noted that (7.5.1) can be useful not only for finding the global optimum in a multiextremal problem, but aIso for solving nonlinear equations (see [20])and some kinds of problems in statistical mechanics as well (see [ 161).
263
OPTIMIZATION BY SMOOTHED FVWCTIONAW
7.6 O€TIMlZATION BY SMOOTMED FUNCTlONALS
Consider the following stochastic optimization problem (see (7. I . 16)) min E,[ + ( x , w ) ]= rnin "g(x) =g(x*) (7.1.16') xEDcR"
xEDcR
where +(x, W )is a stochastic function with unknown p.d.f. p ( x ) , D is a convex bounded domain, and x * is the unique optimal point. We also assume that g ( x ) is bounded for each x E D and var,[+(x, W ) ]< 00. For solving this problem let us introduce the following convolution function: OD
f(x,P)
= J r nA ( o , P ) g ( x - w ) d v = I _ _ X ( ( x - o ) , P ) g ( o ) d t b -m
(7.6.1 )
which is called a smoothedfuncfional[18). In order for g( x, j3) to have nice smoothed properties, let us make some assumptions about the kernel hc( u, p). I &",/I)= (I/P")~(v/P) = [ ~ / ~ ~ ) h ( a , /.p .,,wn/p) . is a piece-wise differentiable function with respect to 0. 2 lims,,h^( o, P ) = 6( v), where 6( u ) is Dirac's delta function. 3 lima,,&x, 8) = g(,x), if.r is a point of continuity of'g(.r) 4 h^(o,p) is a p.d.f.. that is, $Cx./3)= E v [ g ( x - V>1. We assume that the original function g(x) is not "well behaved." For instance. it can be a multiextremal function or have a fluctuating character (see Fig. 7.6.1). We expect "better behavior" from the smoothed function g(x, p ) than from the original one. The idea of smoothed functionats is as follows: for a given function g(x) construct a smoothed function gfx, P ) and, operating only with b(x, p), find the extremum for g(x-). In other words, while operating only with
Fig. 7.6.t
A bed "behaved" function.
264
MONTE CARLO OPTIMIZATION
g ( x , P ) , we want to avoid all fluctuation and local extrema of g ( x ) and
find x'. it is obvious that the effect of smoothing depends on the parameter p: for large /3 the effect of smoothing is large, and vice versa. When p+O it follows from condition 2 that & x . P ) - g ( x ) and that there is no smoothing. It is intuitively clear that, to avoid fluctuations and local extrema, /? has to be sufficiently large at the start of the optimization. However, on approaching the optimum we can reduce the effect of smoothing by letting p vanish, since at the extremum point x' we want coincidence of both extrema, g(x) and g ( x , p ) . Accordingly, we speak of a set of smoothed functions g ( x , b , ) , s = I , 2,. . . , while constructing an iterative procedure for finding x + . Before describing the iterative procedure for solving the problem (7. I. t 6 ) , we derive some attractive properties of g(x, p). PROPERTY
I
If g ( x ) is convex, then g ( x , P ) is also convex.
The proof of this property is straightforward. For 0 < h < I A g ( x, p ) + ( I - h ) g ( Y = / h ' ( o . P ) [ Xg(x
-u)+ ( I
I
P ) - kit Ax + t I - x )y ,P )
-X)g(y
-v )-g(hx+(l
- h)y - 0)ldv. (7.6.2)
The convexity of g(x) implies g ( h x + ( 1 - X ) y - u ) =g(A(x
- 0)+ ( 1 - A ) ( y - u ) ) 5 Xg( x - G ) + ( I - h ) g ( y - 0).
(7.6.3)
Substituting (7.6.3) in (7.6.2) and taking into account that k ( o , p ) 2 0, we obtain the proof immediately. PROPERTY
2 I t is readily seen that the gradient of the smoothed function
%(x , /I) may be expressed as
'30
'-03
(7.6.4)
and is called a smoorhcd gradient. Using the right-hand side of (7.6.4), together with condition 1 ), we obtain
(7.6.5)
265
OPTIMIZATION BY SMOOTHED FUNCTIONALS
where (7.6.6)
is the gradient of h( u ) and i3h(o)/i3uk*k = I , . . . ,n, are the partial derivatives. It is important to note that, to find a gradient of the smoothed function &x,P), we do not need to know the gradient of g(x), which sometimes
does not exist at all, We consider also the following smoothed function:
d(x,P)
=jmh*(ri,P)[g ( x +
0 ) +g(x
- ri,]
dri.
(7.6.7)
-m
By analogy with (7.6.4) and (7.6.5) we can obtain the smoothed gradient for g(x, P):
=-I P 1
m
h"(o)[g(x-Pt;)-g(x+P~)]d~
(7.6.8)
-m
Now we give two examples of kernels h'( o, p), which satisfy conditions 1 through 4, and find their smoothed gradients according to (7.6.8). Example 1 Let h ( o ) be an n-dimensional standard multinormal distribution
(7.6.9)
Then the smoothed gradient of g ( x ) is
g x ( x , / 3 ) = - ! - j w c h ( r i ) E g ( ~ + P o ) - - g ( x - P t ; ) ] ~(7.6.10) .
P
Example 2
--m
Let (7.6.1 1)
266
MONTE CARLO OPTIMIZATION
that is, let the random vector u be uniformly distributed over the surface of the unit sphere. The smoothed gradient equals
gxx(x,/?)=
ii
cII= 1
o h ( c ) [g(x
+ P O )-g(x - P O ) ] do.
(7.6.12)
Having g x ( x , / ? ) a t our disposal, we can construct, for instance, an iterative gradient algorithm x,+t =
+,
- agx(x,.@,)),
a>o
(7.6.13)
and find the conditions under which x i converges to x* in the deterministic optimization problem minxEDcR.g(x) = g(x*), which is a particular case of (7. I. 16), with p ( w ) being a Dirac 6 function. Here n( - ) denotes the projection operation on D (i.e., for every x E R", n ( x ) E D and I f x - n(x)li = min,,,l[x -yil), and a is a step parameter. Since g ( x ) is not a "well behaved" function, calculation of the multiple integrals gx( x, 8) and g'( x, P ) are usually not available in explicit form and numerical methods have to be used. One of them is, as we know, the Monte Carlo method. For instance, an estimator of gx(x,/3) can be found by the sampie-mean Monte Carlo method (see Section 4.2.2)
and is called pnrnrnetrical sra[isricul gradient (PSG)11 81. Heref'u) is a p.d.f. from which a sample of length N is taken. Assuming that J( 0 ) h( I)), we obtain, respectively, the PSG in examples 1 and 2, as
=
&x,B)=--
1
*
Nfl / a 1
~[g(x+B~)-g(x-flV,)]
(7.6.15)
and
The r.v.3 in (7.6.15) and (7.6.16) are generated from (7.6.9) and (7.6.11), respectively . By analogy with (7.6.7) the smoothed gradient of + ( x , W ) is oc
4~, prove that %(x , 8 ) 2 g(x). 6 Prove (7.6.4) and (7.6.5). 7 Prove that, if h,(rp) -i B,,[sin"-'cp[, 0 5 cp 5 27r, then P,,(o), where c = coscp is distributed according to (7.2.12).