VARIATIONAL METHODS IN STATISTICS
This is Volume I2 1 in MATHEMATICS IN SCIENCE AND ENGINEERING A Series of Monograph...
30 downloads
676 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
VARIATIONAL METHODS IN STATISTICS
This is Volume I2 1 in MATHEMATICS IN SCIENCE AND ENGINEERING A Series of Monographs and Textbooks Edited by RICHARD BELLMAN, Unii.crsitj of Soirtlwr/r Culiforriirr The complete listing of books in this series is available from the Publisher upon request.
VA R IAT I0 NA L METHODS IN STATISTICS Jagdish S. Rustagi Department of Statistics The Ohio State University Columbus. Ohio
ACADEMIC PRESS
New York
San Francisco
A Subsidiary of Harcourt Brace Jovanovich, Publishers
London
1976
COPYRIGHT 0 1976, BY ACADEMIC PRESS,INC.
ALL RIGHTS RESERVED. NO PART OF THIS PUBLICATION MAY B E REPRODUCED OR TRANSMITTED IN ANY F OR M OR BY ANY MEANS. ELECTRONIC OR MECHANICAL, INCLUDING PHOTOCOPY, RECORDING, OR ANY INFORMATION STORAGE AND RETRIEVAL SYSTEM, W I T HO U T PERMISSION IN WRITING FROM T HE PUBLISHER.
ACADEMIC PRESS, INC.
111 Fifth Avenue, New York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD. 24/28 Oval R o a d , Lo n d o n N W I
Library of Congress Cataloging in Publication Data Rustagi, Jagdish S Variational methods in statistics. (Mathematics in science and engineering series ; ) Includes bibliographies and index. 1. Mathematical statistics. 2. Calculus of variations. I. Title. 11. Series. QA276.R88 519.5’35 75-1 3092 ISBN 0-12-604560-7 PRINTED IN THE UNITED STATES OF AMERICA
TO
Kamla Pradip, Pramod, and Madhu
This page intentionally left blank
Contents
Preface Acknowledgements Chapter Z
Synopsis
1.1 General Introduction 1.2 Classical Variational Methods 1.3 Modern Variational Methods 1.4 Linear Moment Problems 1.5 Nonlinear Moment Problems 1.6 Optimal Designs for Regression Experiments 1.7 Theory of Optimal Control 1.8 Miscellaneous Applications of Variational Methods in Statistics References
Chapter ZZ 2.1 2.2 2.3
xi xiii
1 2 4 6 7 9 11 13 14
Classical Variational Methods
Introduction Variational Problem Illustrations in Statistics
16 17 18 vii
viii
CONTENTS
2.4 2.5 2.6 2.7 2.8 2.9
Euler-Lagrange Equations Statistical Application Extremals with Variable End Points Extremals with Constraints Inequality Derived from Variational Methods Sufficiency Conditions for an Extremum References
Chapter I l l 3.1 3.2 3.3 3.4 3.5 3.6
4.1 4.2 4.3 4.4 4.5 4.6
46 47 51 56 57 60 62
Linear Moment Problems
Introduction Examples Convexity and Function Spaces Geometry of Moment Spaces Minimizing and Maximizing an Expectation Application of the Hahn-Banach Theorem t o Maximizing an Expectation Subject t o Constraints References
Chapter V 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8
Modem Variational Methods
Introduction Examples Functional Equations of Dynamic Programming Backward Induction Maximum Principle Dynamic Programming and Maximum Principle References
Chapter I V
22 29 31 36 39 42 45
64 65 71 76 78 85 90
Nonlinear Moment Problems
Introduction Tests of Hypotheses and Neyman-Pearson Lemma A Nonlinear Minimization Problem Statistical Applications Maximum in the Nonlinear Case Efficiency of Tests Type A and Type D Regions Miscellaneous Applications of the Neyman-Pearson Technique References
92 94 99 105 109 110 115 123 132
CONTENTS
Chapter VI 6.1 6.2 6.3 6.4 6.5 6.6 6.7
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
Optimal Designs for Regression Experiments
Introduction Regression Analysis Optimality Criteria Continuous Normalized Designs Locally Optimal Designs Spline Functions Optimal Designs Using Splines Appendix to Chapter VI References
Chapter VZI
ix
133 134 137 144 150 161 165 168 169
Theory of Optimal Control
Introduction Deterministic Control Process Controlled Markov Chains Statistical Decision Theory Sequential Decision Theory Wiener Process Stopping Problems Stochastic Control Problems References
172 173 177 181 187 191 192 198 200
Chpater VIII Miscellaneous Applications of Variational Methods in Statistics 8.1 8.2 8.3 8.4 8.5 8.6 8.7
Index
Introduction Applications in Reliability Bioassay Application Approximations via Dynamic Programming Connections between Mathematical Programming and Statistics Stochastic Programming Problems Dynamic Programming Model of Patient Care References
202 203 209 213 216 225 227 229 233
This page intentionally left blank
Preface
Calculus of variations is an important technique of optimization. Our attempt in this book is to develop an exposition of calculus of variations and its modern generalizations in order to apply them to statistical problems. We have included an elementary introduction to Pontryagin’s maximum principle as well as Bellman’s dynamic programming. Other variational kchniques are also discussed. The reader is assumed t o be familiar with elementary notions of probability and statistics. The mathematical prerequisites are advanced calculus and linear algebra. To make the book self-contained, statistical notions are introduced briefly so that the reader unfamiliar with statistics can appreciate the applications of variational methods to statistics. Advanced mathematical concepts are also introduced wherever needed. However, well-known results are sometimes stated without proof to keep the discussion within reasonable limits. The first two chapters of the book provide an elementary introduction to the classical theory of calculus of variations, maximum principle, and dynamic programming. The linear and nonlinear moment problems are discussed next, and a variety of variational techniques t o solve them are given. One of the first nonclassical variational results is that given by the Neyman-Pearson lemma, and it is utilized in solving certain moment problems. A few problems of testing statistical hypotheses are also given. The techniques utilized in finding optimal designs of regression experiments are generally variational, and a brief discussion of optimization problems under various criteria of optimality is provided. Variational methods have special significance in stochastic control theory, xi
xii
PREFACE
optimal stopping problems, and sequential sampling. Certain aspects of these problems are discussed. In addition, applications of variational methods in statistical reliability theory, mathematical programming, controlled Markov chains, and patient monitoring systems are given. The main concern of the author is to provide those statistical applications in which variational arguments are central to the solution of the problem. The reader may discover many more applications of these methods in his own work. The references provided are not exhaustive and are given only as sources of supplementary study, since many of the areas in statistics discussed here are still under vigorous development. Applications of variational methods in other engineering sciences, economics, or business are not emphasized, although some illustrations in these areas are discussed. Exhaustive expositions of such applications are available elsewhere. The chapters are so arranged that most of them can be studied independently of each other. The material collected here has not appeared previously in book form, and some of it is being published for the first time. Part of the material has been used by the author over the years in a course in optimizing methods in statistics at The Ohio State University. The book can be used as a text for such a course in statistics, mathematics, operations research, or engineering science departments in which courses on optimization are taught. It may also be used as a reference. Readers are invited to send their comments to the author.
Acknowledgements
I am greatly indebted to Professor Richard Bellman, who invited me to write this book. Professor Herman Chernoff has been a source of great inspiration and advice throughout the planning and development of the material, and 1 am highly obliged to him. I am grateful to Professor Stefan Drobot for introducing me to variational methods. I would also like to acknowledge the help of Professor D. Ransom Whitney for continuous advice and encouragement. Dr. Vinod Goyal helped me in the preparation of Chapter 11; Professors Bernard Harris and Paul Feder provided comments on earlier drafts of some chapters, and I am very grateful to them. I am obliged to Mr. Jerry Keiper, who suggested many corrections and improvements in the manuscript. I would also like to thank the many typists, among them Betty Miller, Diane Marting, and especially Denise Balduff, who have worked on the manuscript. The work on the book began while I had a research grant from the Office of Scientific Research, United States Air Force, at The Ohio State University, and I am very much obliged to both for their support. I am also grateful to the editorial staff of Academic Press for various suggestions regarding the production of the book.
xiii
This page intentionally left blank
VARIATIONAL METHODS IN STATISTICS
This page intentionally left blank
CHAPTER I
Synopsis
1.1
General Introduction
Variational methods refer to the technique of optimization in which the object is to find the maximum or minimum of an integral involving unknown functions. The technique is central to the study of functional analysis in the same way that the theory of maxima and minima are central to the study of calculus. During the last two centuries variational methods have played an important role in the solution of many physical and engineering problems. In the past few decades variational techniques have been developed further and have been applied successfully t o many areas of knowledge, including economics, statistics, control theory, and operations research. Calculus of variations had its beginnings in the seventeenth century when Newton used it for choosing the shape of a ship’s hull to assure minimum drag of water. Several great mathematicians, including Jean Bernoulli, Leibnitz, and Euler, contributed to its development. The concept of variation was introduced by Lagrange. In this book we first introduce the basic ideas of the classical theory of calculus of variations and obtain the necessary conditions for an optimum. Such conditions are known as Euler or Euler-Lagrange equations. In recent years Bellman’s introduction of the technique of dynamic programming has resulted in the solution of many variational problems and has provided practical answers to a large class of optimization problems. In addition to Bellman’s technique we also discuss the maximum principle of Pontryagin. which has been regarded as a culmination of the efforts of the mathematicians in 1
2
1.
SYNOPSIS
the last century to rectify the rule of Lagrange multipliers. It gives a rigorous development of a class of variational problems with special applications in control theory. We include a brief introduction to both of the above variational techniques. Many problems in the study of moments of probability distributions are variational. The Tchebycheff-type inequalities can be seen to result from optimization problems that are variational. Many other problems in this category, in which one wants bounds on the expected value of the largest order statistic or the expected value of the range in a random sample from an unknown population, have had important applications in statistics. In addition to the classical theory of calculus of variations, the methods of the geometry of moment spaces have proved fruitful in characterizing their solutions. A brief introduction to these topics is also provided. One of the first nonclassical variational results is stated in the form of the Neyman-Pearson lemma, which arises in statistical tests of hypotheses. While introducing the fundamental concepts of most powerful tests, Neyman and Pearson provided an extension of the classical calculus of variations result, and this technique has been applied to a large variety of problems in optimization, especially in economics, operations research, and mathematical programming. We include a brief discussion of the Neyman-Pearson technique, which has also become a very useful tool in solving nonlinear programming problems. Many problems arising in the study of optimal designs of regression experiments can be solved by variational methods. The criteria of optimality generally result in functionals that must be minimized or maximized. We describe variational techniques used in stochastic control problems, controlled Markov chains, and stopping problems, as well as the application of variational methods to many other statistical problems, such as in obtaining efficiencies of nonparametric tests. In the next seven sections, we provide a brief introduction to the topics discussed in the book. 1.2
Classical Variational Methods
Calculus of variations, developed over the past two hundred years, has been applied in many disciplines. The basic problem is that of finding an extremum of an integral involving an unknown function and its derivative. The methods using variations are similar to those using differentials and make problems in optimization of integrals easy to solve. In Chapter 11, we discuss the classical approach to variational problems. We obtain the necessary conditions for an extremum in the form of the Euler differential equation. The sufficient conditions are too involved in general, and we consider them only in the case in which the functions involved are convex or concave. Such assumptions guarantee
1.2.
CLASSICAL VARIATIONAL METHODS
3
the existence and uniqueness of the optimum in various cases. We give a brief introduction to the statistical problems that have variational character. Many modern variational problems are discussed later in the book. Wherever possible, applications from statistical contexts illustrate the theory. The classical theory of calculus of variations is extensively discussed, and there are excellent textbooks available. In Section 2.2, we state the variational problem of optimizing the functional
J
b
Wbl
=
L [ x ,Y(X),Y’(X)l dx
a
over the class of continuous and differentiable functions y ( x ) . Various other restrictions on the functions y ( x ) may be imposed. This class is generally called the admissible class. In the optimization process, distinction must be made between global optimum and local optimum. For example, W b ] has a global minimum for y = y o ( x ) if W b ]2 W b o ( x ) ] for. all y in the admissible class. However, the local minimum may satisfy such a condition only in a neighborhood of the function y o ( x ) . Such concepts for a strong and weak local minimum are also defined in this section. The impetus for the development of calculus of variations came from problems in applied mechanics. However, the techniques have been utilized with increasing frequency in other disciplines such as economics, statistics, and control theory. In Section 2.3, we give a few illustrations of variational problems arising from statistical applications. Statistical notions are briefly introduced in this section; further discussion of these can be found in introductory statistics texts. The necessary conditions for a weak local extremum of the integral of the Lagrangian L(x, y , y‘) is obtained in terms of a partial differential equation, called the Euler equation or Euler-Lagrange equation,
There is an integral analog of this differential equation X
The Euler equation is derived in Section 2.4 through the considerations of variations. The variation of the functional is defined, and the fundamental lemmas of calculus of variations are stated and proved. One of the earliest problems in calculus of variations is the brachistochrone problem, in which one considers the path of a particle moving under gravity along a wire from a point A
4
1.
SYNOPSIS
to a point B so as to make the travel time from A to B a minimum. The solution of the brachistochrone problem is given. It is well known that the form of the wire is a cycloid. A statistical illustration of the Euler equation is given in Section 2.5. We consider a problem of time series describing the input and output of a system with an impulse response. The estimation of the impulse response so as to minimize the mean-square error in the sense of Wiener leads to a variational problem. Euler equations result in Wiener-Hopf integral equations that can be solved in many cases. In Section 2.6, we discuss the optimization problem with variable endpoints. It is assumed that a and b, the limits in the integral to be optimized, are no longer fixed and move along certain specified curves. In many applications, such situations occur quite frequently. The necessary conditions for a weak local extremum are obtained. They involve not only the Euler differential equation but also additional relations satisfied by the curves, generally known as the transversality conditions. Constrained optimization problems require additional considerations, and they are discussed in Section 2.7. The general theory of Lagrange multipliers as introduced in differential calculus also applies t o the problems of calculus of variations. An illustration is provided in which bounds of the mean of the largest order statistic are obtained under the restriction that every distribution has mean zero and variance one. The solution is also given for the maximum of Shannon information in which the probability density involved has a given variance. The solution in this case turns out to be normal distribution. In Section 2.8, the Hamiltonian function is introduced. It provides a simpler form for the Euler equation. The basic reduction by this device is to reduce the second order Euler equation to that of the first order involving the Hamiltonian functions. This reduction simplifies the solution in many cases. An application of the Hamiltonian function is given to obtain Young's famous inequality. An elementary introduction to the sufficiency theory for the variational problem is given in Section 2.9. The general treatment in calculus of variations for finding sufficient conditions of optimality is quite involved, and we do not give a general discussion of this topic. In case the Lagrangian is convex or concave in ('y, y'), the sufficiency conditions for a global extremum can be easily derived. We provide such a discussion. Detailed exposition of sufficiency conditions under the classical case are given by Hadley and Kemp (1971), for example. 1.3
Modern Variational Methods
We discuss Bellman's technique of dynamic programming and the maximum principle of Pontryagin in Chapter 111. Modern control theory in engineering and
1.3.
MODERN VARIATIONAL METHODS
5
applications in modern economics result in variational problems that can be solved by the above techniques. Although the maximum principle gives a rigorous mathematical development for the existence and for the necessary conditions of the solution of a general control problem, it is the technique of dynamic programming that provides the answer in practice. Before we consider the techniques, we give a few examples from control theory for illustrative purposes in Section 3.2. These examples introduce the functional equation of dynamic programming. An example is also given in which the functional equation arises from other considerations. Many examples of this nature are found in the literature. The functional equation of dynamic programming and Bellman’s optimality principle are given in Section 3.3. For a process with discrete states, the multistage decision process reduces to the consideration of optimal policy at any given stage through the application of the optimality principle. This principle states that an optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions also constitute the optimal policy with respect to the state resulting from the first decision. The application of this principle results in reducing the dimension of the optimization problem, and the solution of the problem becomes computationally feasible. When the process is stochastic and the criterion of control is in terms of expectations, a functional equation approach can be similarly used. Bellman and Dreyfus (1962) provide methods of numerical solutions for many dynamic programming problems. The backward induction procedure utilized in the functional equation approach of dynamic programming is discussed in Section 3.4. Using sequential sampling in the statistical tests of hypotheses, the optimal decision making requires a similar process of backward induction. In statistical contexts backward induction was introduced by Arrow et al. (1949). It has found various uses in other contexts, such as in stopping problems studied by Chernoff (1972) and Chow ef al. (1970). Section 3.5 discusses the maximum principle of Pontryagin. The principle has been used in providing the rigorous theory of optimal processes. It provides the necessary conditions for optimality. The problem considered is that of optimizing an integral of a known function of a vector x, called the state vector, and the vector u, the control vector, when the vectors x and u satisfy constraints in terms of first order differential equations satisfying certain boundary conditions. The theorem stated gives the necessity of the condition that corresponds to the Euler equation. We give an example arising in the consideration of the time optimal case. This problem is concerned with the minimization of total time taken by the control when the system state vector x and the control vector u satisfy a given differential equation. The general theory of optimal processes and other ramifications of the maximum principle are given in a book by Pontryagin ef al. (1962).
6
I.
SYNOPSIS
The relationship between the dynamic programming and the maximum principle is discussed in Section 3.6. In the case of the time optimal problem, the functional equation of dynamic programming is the same as that obtained by the maximum principle if certain extra differentiability conditions are assumed. Such conditions may not always be satisfied. The relationship is discussed by Pontryagin er al. (1962). 1.4
Linear Moment Problems
Moment problems have been of interest t o mathematicians for a considerable period of time. In its earliest form the moment problem was concerned with finding a distribution function having a prescribed set of moments. A recent survey is given by. Shohat and Tamarkin (1943). Various optimizing problems involving distribution functions with prescribed moments arise in statistical contexts, especially in nonparametric theory. In Chapter IV, we consider the variational problem in which the Lagrangian is a linear function of the distribution function. We utilize the geometry of moment spaces, as well as the Hahn-Banach theorem, in providing solutions t o the linear moment problem. In Section 4.2, a few examples arising from the applications in statistical bioassay and cataloging problems are given. Both of these examples lead to a general problem of finding the bounds of the integral Jg(x) dF(x), where g(x) is a given function with certain properties and F(x) is a distribution function having prescribed moments. The bioassay problem is a special case of the above general problem withg(x) = 1 - e-@, and the cataloging problem also reduces t o a similar g(x). First we consider the case in which the random variable having the distribution function F(x) is finite valued. The results are then extended t o the case in which the random variable takes values on the positive real line. In Section 4.3, we give an introduction t o the concepts of convexity and concavity. The celebrated theorem of Hahn-Banach is stated and proved. The notion of a convex set is very important in the discussion of the optimization problems considered here. Many results are given in such cases. For example, the sufficiency conditions in the classical theory of calculus of variations are given for the convex and concave functions. Many problems give unique solutions in such cases. We can consider the abstract notion of an integral as a linear functional defined over a function space. An elementary introduction t o linear spaces is given. A few pertinent notions are provided to prove the Hahn-Banach theorem for extension of linear functionals. This is an important theorem in function spaces and plays essentially the same role as the separating hyperplane theorem in the theory of convex sets. The Hahn-Banach theorem is then applied t o finding solutions of the linear moment problem. We consider the geometry of moment spaces in Section 4.4. An exhaustive
1.5.
NONLINEAR MOMENT PROBLEMS
7
account of the moment spaces is given by Karlin and Studden (1966). In the linear moment problem, the solution is characterized in terms of extreme points of a convex set. These extreme points in the convex set, generated by the class of all distribution functions, correspond to one-point distributions. Therefore, in many cases the optimizing solutions are obtained in terms of these distributions. In problems of regression designs also, the technique of geometry of moments is quite useful, and we use it to obtain certain admissible designs. Section 4.5 discusses the main optimization problem of Chapter IV. We consider the expectation of the function g(x), given by E[g(X)], where the random variable X has a set of prescribed moments. It is shown that the set in ( k + 1)-dimensional space with coordinates [ E M X ) ] , E(X), . . . , E(XK)] is convex, closed, and bounded. The existence of a minimizing and maximizing distribution is easily obtained from the above fact. The solution is then characterized in terms of discrete distributions. The results are applied t o the examples of Section 4.2, and complete solutions in some simple cases are provided. In Section 4.6, we consider the same problem as in Section 4.5, but with the application of the Hahn-Banach theorem. In moment problems the HahnBanach theorem has been applied by many authors. The results of Isaacson and Rubin (1954) are generalized. The conditions on g(x) are more general than assumed in Section 4.5, and the solutions are available in some cases for which the geometry of moments does not provide the answer. 1.5
Nonlinear Moment Problems
Many statistical applications require optimization of integrals of the following nature. Minimize
over a class of distribution functions F(x) with given moments when cp(x,y ) is a known function with certain desirable properties. Nonlinear problems of this nature can be reduced to the study of linear moment problems involving the optimizing function when q( x, y ) is convex or concave in y. First we reduce the nonlinear problem to a linear one and then apply the Neyman-Pearson technique for the final solution. This ingeneous approach works in many cases and is given in Chapter V. Applications of the nonlinear moment problem are made to obtain bounds for the mean range of a sample from an arbitrary population having a given set of moments. In Section 5.2, the fundamental problem of testing statistical hypotheses is discussed. Relevant notions are introduced, and an elementary introductidn to
8
I.
SYNOPSIS
the solution of the problem of testing a simple hypothesis versus a simple alternative is given. The classical theory of Neyman and Pearson is given, and the Neyman-Pearson lemma is proved. This lemma is one of the first nonclassical variational results, and it has important applications in statistics as well as in other fields. The Euler equation in calculus of variations gives necessary conditions for an extremum of an integral in which the limits of integration are given constants or variables. In the optimization problem of Neyman-Pearson, the integrals of known functions are defined over unknown sets, and the optimization is t o be done over these sets. In this section various generalizations of the Neyman-Pearson lemma are also given. The application of the Neyman-Pearson technique to the duality theory of linear and nonlinear programming is receiving serious attention by many researchers. We consider some of these problems in Chapter VIII. Other applications of the lemma are given in Section 5.8. We discuss the nonlinear problem introduced at the beginning of this section in Section 5.3. Assuming that q ( x , y ) is strictly convex in y and twice differentiable in y, we can prove the existence of the minimizing cumulative distribution function Fo(x). We then reduce the problem t o that of minimizing the integral
over a wider class of distribution functions. This process linearizes the problem and is much easier to deal with by the Neyman-Pearson technique. Similar approaches are commonly made in solving certain nonlinear programming problems. The solution can be obtained by a judicious choice of Fo(x). This approach avoids the technical details of satisfying the necessary and sufficient conditions needed for obtaining the solution through the application of classical calculus of variations. Also the present approach takes care of inequality constraints without much difficulty. This approach seems similar in spirit to the Pontryagin maximum principle discussed in Chapter 111. In Section 5.4, we consider some statistical applications. The functions of the form d X 9 Y ) = 0,- kx)2 occur in a problem of obtaining bounds of the Wilcoxon-Mann-Whitney statistic. The complete solution is given, including the case in which the constraints on the distribution function are inequality constraints: F ( x ) 2 x.
Another example, in which q(x, y ) = y " , is also given. Many other examples for the nonlinear case are given by Karlin and Studden (1966).
1.6.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
9
The problem of maximizing the integral
jq(x9~ ( ~dx1 )
over the same class of distribution functions considered before is discussed in Section 5.5. Using the same condition that q(x, y ) is strictly convex in y, we find by simple arguments that the maximizing distribution must be a discrete distribution, as in the case of the linear moment problem. Results of Chapter N are used to simplify the problem further, and a few examples are provided to illustrate the results. One of the most common uses of the variational method is in finding the efficiency of tests in statistics. We give an introduction t o such optimization problems in Section 5.6. Variational arguments are fruitfully applied by Chernoff and Savage (1958) in their famous paper in which they introduce a statistic for testing a wide variety of nonparametric statistical hypotheses. We give an expression for the large sample efficiency of the Fisher-Yates-TerryHoeffding statistic and the t-test for testing the hypothesis of equality of the location parameters of two populations. The variational technique provides the lower bound as well as the distribution for which the bound is attained. Another example of a similar kind is given in which the asymptotic efficiency of the Wilcoxon test with respect to the t-test is obtained. In Section 5.7, we introduce the problem of determining regions of type A and type D arising in testing statistical hypotheses. Consideration of unbiased tests-that is, tests for which the power function has a minimum at the null hypothesis-requires that the curvature of the power function be studied in the neighborhood of the null hypothesis. Such considerations for obtaining best unbiased tests lead t o type A regions, and the problems are interesting variational problems studied by Neyman and Pearson (1936). In the case of several parameters, consideration of best unbiased tests results in the study of Gaussian curvature of the power function in the neighborhood of the null hypothesis. The variational problems so introduced are studied. We give an example in which the parameters of the bivariate normal distribution are tested. Miscellaneous applications of the Neyman-Pearson technique are provided in Section 5.8. Problems considered are from stockpiling, mathematical economics, dynamic programming, combination of weapons, and discrete search. 1.6
Optimal Designs for Regression Experiments
Design of experiments is an important branch of statisttcs. Its development, in the hands of Sir Ronald A. Fisher, not only improved the method of efficient experimentation in many applied sciences but also provided a large number of mathematical problems in combinatorial theory and optimization. In
10
1.
SYNOPSIS
Chapter VI, we consider the problems arising from performing regression experiments. In such experiments the main objective is to study a response y as a function of an independent variable x. The choice of levels of x when the total number of observations are given becomes an optimization problem. The problem considered is the allocation of observations to various levels of x in order to optimize a certain criterion. In Section 6.2, the problem of linear regression and least-square estimation is discussed. We give the covariance matrix of the estimated parameters, and this covariance matrix is utilized in developing the optimality criteria. Suppose the linear model is given by
E ( Y ) = e,j-,(x) t e2f2(X)t . . . t ekfk(x) = e'qx). A design of experiment will be represented by x , , xz, . . . , xk with associated integers n,, n 2 , . . . , nk such that Cik,lni= n. That is, the design specifies performing ni experiments at level xi. Or we can allocate proportion pi = ni/n observations at xi, i = 1 , 2, . . . ,k. A typical optimization problem is t o find a design that minimizes some characteristic of the estimate of 8.We give estimates of the parameter 0 in the case of linear models and also discuss an approximate theory of nonlinear models. Section 6.3 discusses a few optimality criteria in regression designs. Suppose the covariance matrix of the estimator 8 is given by D(8). Then, one of the most commonly used criteria is the D-optimality criterion, which requires choosing a design that minimizes the determinant of D(8). The minimax criterion requires that the maximum of the quadratic form f'(x)D(e)f(x) be minimized. Since the diagonal elements of the matrix D(6) represent variances of the components of 8 , there is another criterion that minimizes the sum of variances of these components or, essentially, the trace of the matrix D(8). This is known as the A-optimality criterion. There are many more criteria, but we do not discuss them in this section. Various connections among these criteria are available in the literature, and a few are discussed in Section 6.4. We also give an elegant geometrical method, due to Elfving (1951), for obtaining D-optimal designs in a simple case. In Section 6.4, continuous normalized designs are introduced. As seen above, a design is represented by a probability distribution with probabilities P I , . . . , P k at X I , . . . ,xk. In many problems the solution becomes easier if a continuous version of the design is used. The design corresponds to a continuous density function in this case. The criterion for D-optimality reduces to the minimization of an expectation of a known function. Similar problems are discussed in Chapter IV. We also discuss some important connections between D-optimality and minimaxity. Another criterion of linear optimality is also introduced, and some of the other criteria are obtained as special cases. In nonlinear models, obtaining optimal designs becomes difficult; therefore, asymptotic theory is considered for locally optimal designs, that is, designs in
1.7.
THEORY OF OPTIMAL CONTROL
11
the neighborhood of the known parameters. Locally optimal designs are discussed in Section 6.5. Direct application of the calculus of variations in some examples has been recently made by Feder and Mezaki (1971). A few examples, including those discussed by Chernoff (1 962) for accelerated life testing, are also discussed in this section. In Section 6.6, we give an introduction to spline functions. The recent applications of splines to optimal designs is due to Studden (1971). A spline function is given by polynomials over subintervals and is continuous at the end points and satisfies smoothness conditions such as the existence of higher order derivatives. Splines have been used extensively in approximation theory and have also been applied to problems of control theory, demography, and regression analysis. A brief discussion of the application of splines t o optimal designs is given in Section 6.7. Kiefer (1959) introduced the concept of admissible designs in the same spirit as that of an admissible rule in statistical decision theory. We give a brief introduction to the theory of admissible designs using splines. 1.7
Theory of Optimal Control
Many stochastic control problems are variational problems, and dynamic programming methods are commonly used in solving them. In Chapter VII, we discuss the stochastic control problems, giving various forms in which they arise. The techniques of backward induction, statistical decision theory, and controlled Markov chains are also discussed. The deterministic control process is introduced in Section 7.2. The discrete control problem is given to introduce the multistage decision process. In the discussion of the dynamic programming technique and the maximum principle, we have already seen the basic structure of a control process. An’kxample of feedback control with time lag is provided to illustrate the basic elements of the control process so introduced. The direct variational argument is used to obtain the differential equations that provide the solution to the problem. In Section 7.3, the controlled Markov chain theory is given. Markov chains naturally occur in the consideration of the stochastic analog of difference equations such as Yn+l = v n
+ un,
where {un} is a sequence of independent and identically distributed random variables. Then the sequence of {yn} is a Markov chain. The study of controlled Markov chains depends heavily on the technique of dynamic programming. The functional equations are derived in this section for the examples given. The study of controlled Markov chains is also known as discrete dynamic programming or .Markovian decision processes.
12
I.
SYNOPSIS
The concepts of statistical decision theory are needed in studying the stopping problems connected with the Wiener process. Elements of this theory are discussed in Section 7.4. The statistical problems are formulated in terms of a game between nature and statistician, and various criteria of finding optimal decisions, such as those of minimax and Bayes, are defined. Examples are given to illustrate the Bayes strategy in the case of normal distribution and its continuous version-the Wiener process. In Section 7.5, we further develop the statistical decision theory for the sequential case. The Bayes rule in the case of sequential sampling consists of a stopping rule as well as a terminal decision rule. The stopping rule is obtained with the help of backward induction if it is assumed that sampling must terminate after a finite number of observations. The sophisticated use of backward induction for this purpose is due to Arrow e l al. (1949) and was formalized into dynamic programming by Bellman while the latter was studying multistage processes. An example for testing the hypothesis about the normal mean is given, and the problem is reduced to that of a stopping problem for a Wiener process. Chernoff (1972) has made an extensive study of such problems. The Wiener process is introduced in Section 7.6. In many discrete problems concerning the normal distribution, the continuous versions lead to the Wiener processes. The properties of the Wiener process are given as is the simple result corresponding to the standardizing of the normal distribution. That is, a Wiener process with drift p(t) and variance 0 2 ( t )can be transformed into a process with drift zero and variance one. Many problems of sequential analysis and control theory reduce to stopping problems. The stopping problems are also of independent interest in other applications. Such problems are discussed in Section 7.7. Examples of many interesting stopping problems are given by Chow e l al. (1970). Stopping problems for the Wiener process have been discussed by Chernoff (1972). Let a system be described by a process Y(s). Let the stopping cost be d b , s) when Y(s) =y. The problem of optimal stopping is to find a procedure S such that E[d(Y(S),S)] is minimized. For the Wiener process, the technique reduces to that of finding the solution of the heat equation with given boundary values. Such boundary value problems also arise in other contexts. The characterization of continuation sets and stopping sets is made in terms of the solutions of the heat equation. We derive the equation and describe the free boundary problem of the heat equation. The necessary condition for the optimization problem leads to the free boundary solution of the heat equation, and a theorem is stated to provide the sufficient condition for the optimization problem. A simple example is given to illustrate the theory developed by Chernoff. Continuous versions of controlled Markov chains lead to the study of the stopping problems in Wiener processes. An example of rocket control is given in Section 7.8. The solution of the problem is reduced t o the study of the stopping problem of the Wiener process in its continuous version.
I .8.
1.8
MISCELLANEOUS APPLICATIONS
13
Miscellaneous Applications of Variational Methods in Statistics
A few applications of variational techniques not covered in earlier chapters are discussed in Chapter VIII. It is not possible to include a large number of possible applications available in the literature. The topics chosen are based on their current interest in statistics and their potential application to future direction in research. We have included applications in bioassay, reliability theory, mathematical programming, and approximations through splines. In Section 8.2, we discuss some of the important inequalities in the theory of reliability. The case in which the failure distributions have increasing or decreasing failure rates is specially treated. Roughly speaking, by failure rate or hazard rate we mean the conditional probability of failure of an item at time f given that it has survived until time t Increasing failure rate distributions provide a realistic model in reliability in many instances. If F(x) denotes the distribution function of time to failure, then F(x) = 1 - F(x) is the probability of survival until time x and is a measure of the reliability of an item. The bounds on this probability can be obtained by variational methods: It is not difficult to see that the class of distributions with increasing failure rate is not convex and hence the methods of geometry of moment spaces used to solve such problems in Chapter IV cannot be used directly. Modifications of these methods are required, and the results can be extended to more general cases. We also give bounds for E[cp(x, F(x))], where cp(x, y ) is a known function convex iny. Such functions are also considered in the general moment problem in Chapter V. In their monograph on mathematical models of reliability, Barlow and Proschan (1967) give a detailed account of the increasing and decreasing failure rate distributions. Many variational results are also given in their discussion. The efficiency of the Spearman estimator is discussed in Section 8.3. In statistical bioassay, nonparametric techniques are increasingly being used. A simple estimator for the location parameter of a tolerance dkribution in bioassay is the Spearman estimator. It is shown that the asymptotic efficiency of the Spearman estimator, when compared with the asymptotic maximum likelihood estimator in terms of Fisher information, is less than or equal to one. The bounds are attained for the logistic distribution by using straightforward variational methods. Spline functions are introduced in Chapter VI and are applied to a problem of optimal experimental design. In Section 8.4, the splines are applied to problems of approximation, using the technique of dynamic programming. The splines are also used in developing models of regression, especially in data analysis, according to a recent study made by Wold (1974). An example is given in which the exact solution of an approximate problem is much easier to obtain than an approximate solution of an exact problem. The dynamic programming procedure is used to solve the optimization problem. This problem arises in the consideration of the best spline approximation s(x) of a function u ( x ) such that
14
I.
SYNOPSIS
J [s(x) - u ( x ) ] * d x is minimized. Since splines are defined over subintervals, the problem reduces to the study of the optimization of a finite sum. The dynamic programming technique becomes highly appropriate in such a case. In Section 8.5, we consider a few connections between mathematical programming methods and statistics. The scope of these applications is very large, and only a few cases are discussed. The application of the NeymanPearson lemma in developing the duality of nonlinear programming problems has been investigated by Francis and Wright (1969) and many others. We give an introduction to this theory. There are many applications of mathematical programming methods t o statistics. Some of the moment problems can also be reduced to those of programming problems, and the duality theory for these problems leads to interesting results. An interesting example of minimizing an expectation with countable moment constraints on the distribution function is given. This leads to an infinite linear programming problem. The problem of finding the minimum variance, unbiased estimate of a binomial parameter is also solved through a constrained programming problem. An important class of optimization problems arises in stochastic programming. We give a brief account in Section 8.6. There is extensive literature on this topic, so we consider only a few examples. The first example concerns a stochastic linear program in which the stochastic elements enter into the objective function. These elements may enter in the constraints, and the problem then becomes that of chance constrained programming. An example of such a problem is given. An illustration of the application of the dynamic programming technique to the solution of an important decision-making problem in patient care is provided in Section 8.7. The process of patient care in the operating room, recovery room, or an out-patient clinic exhibits the elements of a control process and is amenable t o treatment by dynamic programming. The basic objective in providing such care by the physician or the nurse is to restore homeostasis. Therefore, an objective function is formulated in terms of the physiological variables of the patient at any given time and the variables desired to restore homeostasis. The discussion follows along the lines of Rustagi (1968).
References Arrow, K . J . , Blackwell, D., and Girshick, M. A. (1949). Bayes and minimax solutions of sequential decision problems, Econometrica 17, 21 3-244. Barlow, R., and Proschan, F. (1967). Mathematical Theory of Reliability. Wiley, New York. Bellman, R., and Dreyfus, S. (1962). Applied Dynamic Programmirzg. Princeton Univ. Press, Princeton, New Jersey. Chernoff, H . (1962). Optimal accelerated life designs for estimation, Technomefrics 4, 381-408.
REFERENCES
15
Chernoff, H. (1972). Sequential Analysis and Optimal Design. SOC. lnd. Appl. Math., Philadelphia, Pennsylvania. Chernoff, H., and Savage, I. R. (1958). Asymptotic normality and efficiency of certain nonparametric test statistics, Ann. Math. Statist. 29, 972-994. Chow, Y. S., Robbins, H., and Siegmund, D. (1970). Optimal Stopping. Houghton-Mifflin, New York. Elfving, G . (1951). Optimal allocation in linear regression theory, Ann. Math. Statist. 23, 255-262. Feder, P. I . , and Mezaki, R. (1971). An application of variational methods to experimental design, Technometrics 13, 771-793. Francis, R., and Wright, G. (1969). Some duality relationships for the generalized Neyman-Pearson problem, J. Optimization Theory Appl. 4, 394412. Hadley, G., and Kemp, C. M. (1971). Variational Methods in Economics. American Elsevier, New York. Isaacson, S., and Rubin, H. (1954). On minimizing an expectation subject t o certain side conditions, Tech. Rep. No. 25, Appl. Math. Statist. Lab., Stanford Univ., Stanford, California. Karlin, S., and Studden, W. J . (1966). Tchebycheff Systems: With Applications in Analysis and Sratistics. Wiley (lnterscience), New York. Kiefer, J . (1959). Optimal experimental designs, J. Roy. Statist. SOC.Ser. B 21, 273-319. Neyman, J., and Pearson, E. S. (1936, 1938). Contributions to the theory of testing statistical hypotheses. I. Unbiased critical regions of type A and type A 11. Certain theorems on unbiased critical regions of type A. 111. Unbiased tests of simple statistical hypotheses specifying the value of more than one unknown parameter. Statist. Res. Mem. 1, 1-37; 2,25-57. Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., and Mischenko, E. F. (1962). The Mathematical Theory of Optimal Processes. Wiley (Interscience), New York. Rustagi, J . S. (1968). Dynamic programming model of patient care, Math. Biosci. 8, 14 1-1 49. Shohat, J. A., and Tamarkin, J . D. (1943). The Problem o f Moments. Amer. Math. SOC., Providence, Rhode Island. Studden, W. J. (1971). Optimal designs and spline regression. In Optimizing Methods it7 Statistics (J. S. Rustagi, ed.). Academic Press, New York. Wold, S . (1974). Spline functions in data analysis, Technometrics 16, 1-11.
CHAPTER I I
Classical Variational Methods
2.1
Introduction
During the development of the calculus of variations in the last two centuries, the primary impetus has come from problems in applied mechanics. The subject soon became an area of study in mathematics, and a large number of mathematicians contributed to its development. The earliest problems were concerned with finding the maxima and minima of integrals of functions with or without constraints. Since an integral is just a simple functional defined on the space of functions under study, the variational techniques play the same part in functional analysis as the theory of maxima and minima do in the differential calculus. In view of new technological development in the past few decades, problems of optimization have been encountered in many diverse fields. There arose a series of problems in economics, business, space technology, and control theory that required solutions through variational techniques. Recent advances were needed in order to solve some of the important problems in the above areas, and we shall discuss some of these topics in the next chapter. In this chapter, we discuss first the Euler-Lagrange equation and the concept of variations. Many aspects of the Euler-Lagrange equations are discussed. We offer a few examples in statistics in which variational techniques are needed. A few results in variational calculus with constraints and with variable boundary points are also given. The sufficient conditions are obtained for an extremum in which the Lagrangian is convex or concave. The Hamiltonian functions are 16
2.2.
VARIATIONAL PROBLEM
17
introduced, and Young’s inequality is derived. The Euler equation is intimately connected with partial differential equations, but we do not pursue the subject in this book.
2.2 Variational hoblem Let y(x) be a real-valued function defined for x 1 < x < x 2 having a continuous derivative. Let a given function L(x, y, z ) (called Lagrangian) be continuous and twice differentiable in its arguments. We denote by y’(x) the derivative of y with respect t o x. Suppose now a functional Wb(x)] is defined as x2 W[Y(X)l = / L [x, Y(X), Y’(X)l dx.
(2.2.1)
XI
The earliest problem of calculus of variations is to optimize the functional (2.2.1) over the class of all continuous and differentiable functions y(x), x1 < x < x2 under possibly some more restrictions. The class over which the optimization of the functional W b ( x ) ] is made is generally known as the admissible class of functions. The class of admissible functions d is sometimes restricted to piecewise continuous functions in general variational problems.
Definition A function f defined on a closed interval [a, b ] is called piecewise continuous on [a, b ] if the following conditions hold: (i) f(x) is bounded on [a, b ] , (ii) lim ,;f(x) existstlxoE [a,b ) and lim x+xo-f(x)existsVxoE (a, b ] ,and (iii) f(x) is continuous on (a, b).
Definition f is said to be piecewise continuous on an arbitrary subset S of the reals if it is piecewise continuous on [a. b ] for all [a, b ] C S. Most often in variational problems, one is concerned with obtaining the global maximum or minimum of the functional W b ] . For a global minimum, for example, the problem is to find a y , E d s u c h that Wbol WYl (2.2.2) for all functions y Esu’. Not only does one need to show that a y , with property (2.2.2) exists, but also that it is unique. In many problems, a characterization of the optimizingy,(x) is necessary. In contrast to the problem of finding the global maximum or minimumcalled, for simplicity, global extremum-one may be satisfied in many situations with the local (relative) extremum.
18
11.
CLASSICAL VARIATIONAL METHODS
Let the distance between two functions u(r), u(r) be given by do(u, u) =
sup
tElX,. x2 1
Itc(r) - u(r)l.
(2.2.3)
The srrong local minimum of a functional W [ y ]is given by yo if
WYOl
(2.2.4)
WYl
for ally E d such that there exists a 6 with
u) 0. Notice that under this condition, the class of distributions is not empty. For the general case in which k arbitrary moments are given, the conditions of existence of a distribution function is a classical problem and has been studied in the literature, e.g., Shohat and Tamarkin (1 943). In many nonparametric statistical problems, one is concerned with the minimization of the expectation of the range or the expectation of the extreme order statistics over the class of admissible distribution functions with the first two given moments. That is, we want to minimize, say, the expectation of the largest order statistic. Minimize
1 m
xd[F(x)l".
(2.3.2)
-m
Integrating by parts the integral in (2.3.2), we have the above problem reduced to maximizing J [F(x)]"dx. Here we have
W[yl = and the Lagrangian is given by
s
y"dx,
L(x, y, y ' ) = y n .
(2.3.3)
(2.3.4)
Similarly the minimization of the expectation of the smallest order statistic requires the maximization of the integral (2.3.5) and the Lagrangian in this case is
L(x, y, $1
= (1 -y)".
(2.3.6)
For the minimum or maximum of the expectation of the range of the sample, we need to consider the integral W[F(x)l =
1 - F"(x) - (1 - F(x))" ] d x .
The Lagrangian in this case is
L(x9 Y y ' ) = y"
+ (1 -y)".
(2.3.7)
Example 2.3.2 Let a random sample X I , X 2 , . . . , X,, be given from a population with cdf F(x). An independent sample Y1, Y2, . . . , Y,is also given from another population with cdf Gb). Suppose we want to test a hypothesis F = G against the alternative F # G.
2.3.
21
I L L U S T R A T I O N S IN STATISTICS
One of the tests used for this hypothesis is given by the Wilcoxon-MannWhitney statistic; for reference see Fraser (1957). The test statistic is given by U where i = 1 , 2 , . . . ,m, such that Yi<Xi; U = number of pairs (Xi, I;.)
j = 1 , 2 , . . . , n. Let p = 1 - JG[F-' (x)] dx and by assuming that L(t) = G(F-'(t)), we have
s
L(t) dt = 1 - p .
(2.3.8)
It is well known that E(U) = m n p , and the variance of U , given by V(U),is I
-(m-
1)(1-p)2+(n-1)(1-p)2+p(l-p)}.
There is considerable interest in finding the lower and upper bounds of V(U) over the class of distribution functions L(t), 0 < t < 1 . That is, the problem of interest is to find the extrema of the integral
/
(L ( t )- kt)2 d t
(2.3.9)
0
over the class of functions L(t), which is also a cdf, with side restriction (2.3.8). Notice that the Lagrangian in this case is given by
L (x, y, y ') = 0,- kx)2 *
(2.3.1 0 )
Example 2.3.3 In information theory, one of the first problems is to maximize Shannon's information given by - [ f ( x ) 1%
f(x>dx
for all p d f s fix), which are assumed to satisfy certain moment constraints. Assuming that y = F(x) and that F(x) is absolutely continuous, the above problem is a variational problem with a Lagrangian given by
L(x, y, y ' ) = -y' logy'. Sometimes one is interested in finding the lower bound of the Fisher information given bv
22
11.
CLASSICAL VARIATIONAL METHODS
where the pdf f ( x ) has given mean and variance. In this case, assume that
y = f(x);then the Lagrangian is given by
L (x, y, y ' ) = Y' Y Y . 2.4
Euler-Lagrange Equations
In this section, we derive the necessary conditions for an extremum of the integral
W[Y(X)l =
J L [x,Y(X),Y'(X)l dx
(2.4.1)
and introduce the concept of variations. In order to study the extremal function, we introduce a parameter a in the definition of the function y ( x ) as follows: Let
y ( x ) 3 Y(x, a),
x,
<x <x 2
and
-m
0. Then since q ( x )is continuous, there is an interval [a, b ] around c such that q ( x )> 0. That is, q(x) > 0 for a < x < b. Let
q(c)
<x G ~ elsewhere.
(x - a)'(x - b)', Then
1
1
X I
2
,
b
X,
q(x)q(x)dx =
(x - a)'(x --b)'q(x) dx > 0.
a
XI
This is a contradiction, proving the lemma.
q(x) be continuous in [ x l ,x 2 ] . If Lemma 2.4.2 Let J;; q(x)p'(x)dx= O for every differentiable function ~ ( x )such that v ( x I )= v ( x 2 )= 0 , then there exists c, such that q(x) = c, for all x E [ x l ,x ' ] .
Proof Define
1
1 X
X .1 .-
c = (x2 - X I ) - '
and
q(x) dx
~ ( x=)
XI
(q(t)-c) dt.
XI
Notice that q(x) so defined is differentiable and satisfies the hypotheses of the lemma. The integral x2
r
XZ
r
Also
=
7
(q(x)- c)' dx.
XI
26
11.
CLASSICAL VARIATIONAL METHODS
Hence p(x) - c = 0 or p(x) = c for all x E [x, , x2] . We now prove the following theorem for the necessary conditions for a weak local extremum. Theorem 2.4.1 A necessary condition that W b ] has a weak local extremum at y o is that 6 Who, 6y] = 0 for all variations 6y.
Suppose first that y o is the weak local minimum, then for all admissible functions y in a d l -neighborhood of y o . Now from (2.4.14), we see that 6 W b o ] +o(a)2 0 for small a. For all ~ ( x ) , continuous and differentiable W b ] - W b O ] - o(a) is linear with respect t o a . Therefore, 6 [ y o ,6 y ] is linear with respect to a. Let &A, = & [ y o 6, y l . Then aA,, io(a)2 0; hence A, = -o(l). But A is constant. Therefore, A, = 0 and 6 W [ y O ,6y] = 0.
Proof
W b ] - W b O ]2 0
Theorem 2.4.2 If W b ] has a weak local extremum at y o ,then 6L/6yo = 0 and
(2.4.16)
Proof
From Theorem 2.4.1 x2
using (2.4.10) and (2.4.14), for all q(x) that are continuous and differentiable. Take ~ ( x , = ) v(x2) = 0 so that, using Lemma 2.4.1, we have
Definition
The differential equation
(2.4.17) is called the Euler-Lagrange equation or Euler equation. The Euler-Lagrange equation is really a second order differential equation, as can be seen by simplifying the differential operator. We will see later that the Hamiltonian transformations reduce the equation into two first order differential equations.
Special Cases of Euler-Lagrange Equation The following three cases are of interest in applications.
Case ( i )
aLlay = 0.
(2.4.18)
2.4
EULER-LAGRANGE EQUATIONS
27
Then Eq. (2.4.17) reduces to d
aL
so that
- (-)=O,
dr ay‘
aL
Q
=c.
(2.4.19)
Case (ii) aL/ay’ = 0.
(2.4.20)
aLlay = 0.
(2.4.2 I )
Then we have
Case (iii)
If L does not explicitly depend on x,
x i a x = 0.
(2.4.22)
Then the Euler-Lagrange equation reduces t o
L - y’ aLlay’ = C .
(2.4.23)
To show (2.4.23), we proceed as follows.
Hence L
-
y’ aL/ay‘ = c.
Example 2.4.1 (Brachistochrone problem) One of the earliest problems that led to the development of variational methods is the following. The path of a particle moving under the force of gravity from a given point A to a given point B along a wire, say, is to be obtained so that the time it takes to travel from A to B is a minimum, described in Fig. 2.1. Let m be the mass and u the velocity of the particle. Let g be the gravitational constant. The law of conservation of energy gives
i m v 2 - m g y = 0, giving u = (2gy)’”
or
ds/dt = (2gy)’I2,
28
11.
CLASSICAL VARIATIONAL METHODS
Figure 2.1
where s is the distance traveled in time 1. The time taken is obtained from d t = ds/u, where ds is the element of the arc traveled in time dt. Let A represent x = 0 and B be given by x = xl. Then the total time taken, which is a function of the path y(x), is given by X,
X,
using the well-known formula (ds/dx)2 = 1 -I- (dy/dx)2.
The problem then is to find min T b ] = Tho] over the class of functionsy on 0 G x Gx1. Here
Since aL/ax = 0, using (2.4.23) we have
[y j
Yf2
'I2- ( 1 ) L I Z
or
ly(1
Y -112 = const.
+p)] -112 = C -1/2
where c is some constant. Hence y' = [(c - y ) / y ]'1'. Solving the above differential equation in the parametric form, we obtain the solution as x = (c/2)(q -sin q),
(2.4.24a)
y = (c/2)(1 - cos q).
(2.4.24b)
The curve so defined is a cycloid and gives the form of the wire.
2.5.
STATISTICAL APPLICATION
29
Remark The generalization of the above problem can be made in several directions. One generalization common in applications is that the Lagrangian is a function of several variables and their derivatives. Another generalization is concerned with studying the Lagrangian involving derivatives of higher order. We d o not discuss any of the above generalizations here. 2.5
Statistical Application
We consider an important application of the Euler equation in statistical time series (Jenkins and Watts, 1968) in this section.' The problem is concerned with the estimation of the impulse response in a linear system. Let the system be described at time f by the time series X ( t ) = input of the system,
Y ( f )= output of the system.
Assume that X ( f ) has mean pX and Y(t) has mean py. The linear system is described by
s
m
Y ( t )- PJJ=
h (u) [ X ( f- u ) - p x ] du + Z ( f ) ,
(2.5.1)
0
where h(u) is the system impulse and Z ( t ) is the error term. Figure 2.2 gives a schematic representation of the above system. One of the many criteria for optimization is the Wiener minimum mean-square criterion. This criterion requires choosing h(u) such that
w[h(u)l = E[Z(t)12 is minimized.
Figure 2.2
' 1 am indebted to Edward Dudewicz for pointing out this application.
(2.5.2)
30
11.
CLASSICAL VARIATIONAL METHODS
Let the time series X ( t ) and Y ( t ) be stationary, and let the covariance between x(t)and Y ( t ) be denoted by Yxy(4 =E
“0
-P x l
(2.5.3)
[ Y O + u ) -Pyl
and between X ( t ) , X ( t + u ) by 7xx(u) = E [ X W - Pxl
[ X ( t + u ) -P x l *
Similarly, y y y ( 0 ) denotes the variance of Y(t),and y(O ,) denotes the variance of X(t). Then the criterion (2.5.2) reduces to the minimization of
s
m
W [ h(u)] = E [ Y ( t )- /Ay - 0 h ( u ) [ X ( t - u ) - px] du] 2
m
r
+ 0
J
(2.5.4)
h(u)lz(u)yxx(u - u) du du.
0
We use the variational approach t o find h(u) in order to minimize (2.5.4). The result is stated in the following theorem. Theorem 2.5.1 The function h Wiener-Hopf integral equation
that
minimizes
1
W [ h ] satisfies the
m
yxx(u) =
h(u)yxx(u - u) du,
u 2 0.
0
Proof
Let ho be the minimizing function and let
Iz(u) = h,(u) + €g(u).
(2.5.5)
EXTREMALS WITH VARIABLE END POINTS
2.6.
31
We have m
r
m
0
m
0
We obtain a necessary condition for a minimum if
holds for all g. From (2.5.6), we obtain m
m
m
Since yxx is an even function, we have
1
= 0 = -2
1 m
m
g(u)[y,,,(u)-
ho(u)yXx(u- u) du] du. (2.5.7)
0
0
Since (2.5.7) must be satisfied. for every g, ho satisfies
1 OD
Yxy(4=
ho(u)yxx(u - 4 du,
(2.5.8)
0
which is in the form of a well-known Wiener-Hopf integral equation and can be solved in many cases of interest. 2.6
Extremals with Variable End Points
In the previous section, we considered the case of fixed end points x 1 and x 2 . In many applications, especially in mechanics, the points (xl, y l ) and (x2, y 2 )
32
11.
CLASSICAL VARIATIONAL METHODS
may be known to lie on certain given curves C1 and C,, respectively. Assume that the equations of the curves are given by C1(x, y ) = 0
and
C2(x. y ) = 0 .
(2.6.1)
We assume further that the extremal can be written explicitly as (x, y ( x ) ) and y(x1) =y1 and AXz) 'YZ. (2.6.2) We consider the problem of finding extremalsy' for optimizing (2.2.1) where x 1 and x 2 satisfy (2.6.2). Notice that when the curves C1 and C2 shrink to points, the problem reduces to the case discussed earlier. Our first aim will be to obtain the total variation of the functional W , as defined earlier in (2.4.14), in an alternative form and then give conditions for an extremum. The concept of variation is not only applicable to finding extrema of functionals but has many other applications. We assume here that both x , and y as a function of x , vary. We obtain the total variation of the functional W in terms of the total variation of y , which in turn is given in terms of the variation (differential) of x. Let X(x, a) be a function with parameter Q such that X(x, 0) = x', the value for which the extremum occurs. Similarly X ( x l , 0) = x l 0 and X ( x 2 ,0 ) = x:. Now
(aa= )
(2.6.3)
+ A x Y'(x') + o(a).
(2.6.4)
X(X, a) = X ( X , 0 ) + Q
-
0
+ o((Y)= X O + hw' + o((Y),
where A x o = Sx' is the (total) variation of X at x o . Similarly, Y(X(x, a)) = Y(xo + A x + o(a))
= Y(x')
Also, we find that
Y(X(x, a),0) = Y ( x o ,0) + A x Y r ( x O0,)+ o(a) = y o(x") + Py0'(xo) + o@)+ A X [Y'(x') + 0(0)] + o((Y) = yo(xo) + Sy ( x o ) + A x y'(xo) + o(a) + o@).
(2.6.5)
The quantity
A y = Y - y o = Sy t Axy'
+ o(a) t o(0)
is the total variation of y.
Definition
The total variation of W is defined by
AW= W [ q - W [ y ] +o((Y)+o@).
(2.6.6)
2.6.
33
EXTREMALS WITH VARIABLE END POINTS
To obtain the total variation A W , we proceed as follows.
/
X?
AW=
1
x2
d X L(X, Y ( X ) , Y ' ( x ) ) -
XI
dx L(x, y , y ' ) + o(a, p), (2.6.7)
XI
where X I = X(x,, a ) and X z = X(x2, a). Or x,+Ax,
AW = x,+ A x l
dX dx -L(x + Ax, y + Ay, y' + A b ' ) ) dx
(2.6.8) Since X = x + Ax t o(a), aX/ax = 1 t d/dx A x t o(a). Expanding (2.6.8) and retaining terms containing A x , we have X,
+ - A x + aL -Ay+--.A~') AW= / ' d x { [ l + $ ( A x l [ L ( x , y , y ' )aL aL aY aY ax XI
Since Ay = 6y + y' A x t o(a,0) and A O ' ) = SO')+ y"Ax X,
Let
and since
we have X,
XI
+ o(a,p), we have
1
34
CLASSICAL VARIATIONAL METHODS
11.
Since 6y = Ay
1
-
y‘ A x ,
X,
AW=
dx@)6yt
+o(a,P),
(2.6.9)
XI
where
(2.6.10) We can now give the necessary conditions for a weak local extremum of W with variable end points.
Necessary Conditions for a Weak Local Extremum (1) SL/Sy = 0,
x , <x <xz,
(2.6.1 1) (2.6.12)
where A x , = Axl,, and A y , = A y l x l
(2.6.1 3)
Proof (1)
From Eq. (2.6.9), A W = 0 implies that X I
Since Sy and A x are linearly independent, we have, using 6y = /3yo’(x), XI
2.6
EXTREMALS WITH VARIABLE END POINTS
35
This is true for all 0,y o ’ such that y0’(x1)= yO’(x2) = 0, so
Hence SL/Sy = 0 for all x 1 < x < x z . (2) and ( 3 ) both follow from the assumption that ( A x l , A y l ) and ( A x z , A y 2 ) are independent. Suppose now that the end points are given implicitly to move along the curves given by cl(xl,Yl)=o, C2(Xz,Yz)=O. (2.6.1 4) We have, by introducing variations, C I ( X+~A X I , Y I+ A Y I ) = O or
(2.6.15) Similarly, expanding C z ( x z t A x z , yz t A y z ) , we have (2.6.16) From (2.6.12) and (2.6.1 5), the determinant
L - y ‘ ( x l )aLlay’ix, aLlay’r,,
ac,/a~~
acl/ a x 1
(2.6.17)
vanishes. Similarly, (2.6.13) and (2.6.16) give (2.6.18) We also have 6y
5
(XI, Y l )
=O
y*) = O.
(2.6.1 9) (2.6.20)
Equations (2.6.1 7)-(2.6.20) determine the weak local extremals in case there are variable end points. The conditions (2.6.17) and (2.6.1 8) are generally known as the transversality conditions.
36
11.
CLASSICAL VARIATIONAL METHODS
The transversality conditions are sometimes written as
(2.6.21) (2.6.22) where cpl ( x ) describes the curve C1 : C1(x, cpl ( x l )) = 0 and cp2 ( x ) describes the curve Cz : C 2 ( x z ,cpz(xz))= 0. Consider an example in which the extremal is to be obtained when the Lagrangian is given by L = (1 + y f z ) 1 / 2 / yThe . end points move along the curves x I 2+ y l z= 1 andy, = 5 + x 2 .
2.7
Extremals with Constraints
Suppose y = y z , . . . ,y “ ) represents a set of n functions, each y’ being piecewise continuous on [ x l ,x z ] . The object here is to find the extremum of the integral x2
J
WY(X)l =
(2.7.1)
L [ x , Y , Y‘l d x
Xl
with the following m constraints satisfied by the extremal yo
Fa(x,yo,yo’)=O,
=
a = 1 , 2, . . . , rn, m < n .
bol, . . . ,y”), (2.7.2)
The constraints may be in the form
F , ’ [ x , y o , y o ’ ] = 0.
(2.7.3)
Alternatively, the constraints may also be given so as to involve the integrals of known functions,
j.
Fa(x, y o , y o ’ )dx = 1,.
(2.7.4)
Xl
The method given here is that of Lagrange multipliers as commonly used in elementary analysis. Given constants h = (A’ ,A’, . . . , A m ) , consider the integral
I x2
Xl
(L +
XI
m a=l
AaF,)dx =
A ( x , y , v ’ ,h ) d x .
(2.7.5)
Xl
Then the local extremum is given by the equations
SA/S y = 0
(2.7.6)
2.7.
EXTREMALS WITH CONSTRAINTS
37
From (2.7.2), we obtain the m equations
(2.7.7) Assuming that the Jacobian
(2.7.8)
so that y"", y m + ? .. . ,y" can be obtained uniquely in terms of y ' , y z , . . . , y m , extremals are then given by Eqs. (2.7.6) and (2.7.7). Example 2.7.1 Consider the problem of obtaining bounds of the mean of the largest order statistic as considered in (2.3.2) in the form
1 1
n
x ( F ) F " - ' dF.
(2.7.9)
0
It is assumed that x is a function of F . We have the constraints
/
x ( F ) d F =0
(2.7.10)
x z ( F ) d F = 1.
(2.7.1 1 )
0
1 1
0
We consider the Lagrangian
A(X,y , y', A) = nxy"-' t h l x t AzxZ.
(2.7.1 2 )
The Euler equation gives
ny"-'
+ Al
t 2 A z x = 0.
(2.7.13)
A form of admissible F is then obtained from (2.7.1 3 ) as
The constants hl and hz can then be obtained from the constraints (2.7.10) and (2.7.11). In this form the Euler equation provides in some sense a sort of sufficiency condition, since the function (2.7.14) does maximize (2.7.9). In such a derivation, the existence of the constants X I and h2 is not guaranteed. In Chapter V we shall consider the problem from another point of view and will also demonstrate the existence of the solution. We also consider many problems such as the above without using the Euler equation.
38
11.
CLASSICAL VARIATIONAL METHODS
Example 2.7.2 Consider the problem of maximizing Shannon information
I 0,, ;Z p i = 1. If the number of classes is finite, say S, then we assume p i = 0 for i > S. Let n, be the number of classes occurring exactly r times in the sample. Then
2 rn, =N , r=O 03
(4.2.8)
and the number of distinct classes observed in the sample is
c n,. N
d=
r=l
(4.2.9)
Let the coverage of the sample be denoted by
where the set A consists of all classes from which at least one representative has been observed. Suppose further that the number of distinct classes that will be observed in a second sample of size OLN, a 2 1, is to be predicted. Then denoting by d(a) and ~ ( athe ) number of classes and coverage obtained from ah' observations, the problem is to find upper and lower predictors of E(d(a)), that is, to find the minimum and maximum of the expected value of d(a) when some moments of the distribution with respect to which the expected value is taken are given. Define the random variables
YI .
jth class occurs in the sample, ={ 1 ifotherwise. 0
4.2.
EXAMPLES
69
Then
so that 00
E(d(a)) u
1 [l-(1
j=1
For large N ,
so that
Let (4.2.13) Then (4.2.13) defines a cumulative distribution and is unknown to the experimenter, since p I , p z , . . . are unknown. Now let the random variable Zf) be defined as
zi"'
=
1
if the classj occurs r times in the sample,
0
otherwise.
Then
c zi"' 03
E(nr) =
j=1
(4.2.1 4)
or
or (4.2.1 5)
70
IV.
LINEAR MOMENT PROBLEMS
Similarly,
m
c pj(l-PjY, m
= 1-
(4.2.16)
j=1
so that E(c) u 1 - C pi exp (-Npj).
(4.2.17)
Using (4.2.12) and (4.2.15) and the definition of F(c) in (4.2.13 ) , the expression for E(d(a)) can be written as m
Similarly, m
The problem here then is to find bounds for the integrals
(4.2.19) m
r
(4.2.20) with restrictions on the cumulative distribution function F(x). We consider a general problem of obtaining bounds for the integral
dF(x) with constraints on F(x).
(4.2.21)
4.3.
4.3
CONVEXITY AND FUNCTION SPACES
71
Convexity and Function Spaces
In problems of finding extrema, the notions of convexity and concavity are extensively used. This section provides some relevant definitions and a few pertinent theorems. In later sections, use will be made of the terminology of linear function spaces, linear functionals, and some extensions. We also give the Hahn-Banach theorem, which is commonly utilized in extrema problems. A set S of the Euclidean space R" is called convex if for all x, y E S. the line joining them is also in S. That is, )cx + (1 - X)y E S for 0 < X < 1. A real-valued function f defined on a convex set is called a convex function if f(hx1
-t
( 1 - h )
< V(x1) -t ( 1 - M x z )
(4.3.1)
for O < X < 1. If the function f' is twice differentiable, then the criterion (4.3.1) can be replaced by the condition that
azflax2 2 0.
(4.3.1a)
It can be verified that a convex or concave function defined in the n-dimensional Euclidean space R" is continuous. In the space R", for any point x not equal t o zero and any constant c, the set of points b:Zy=lxfli = c ) is called a hyperplane. A sphere in R" is the set of points (x: Ix - xoI < r ) , where xo is the center and r is the radius of the sphere. A point x is said to be a boundary point of a convex set S if every sphere with center at x contains points of S as well as points outside S. A point of S that is not a boundary point is called an interior point. An important concept in convex sets is that of extreme point. This is a point x of a convex set S that is not interior to any line segment of S. In other words, x is an extreme point of S if there do not exist two points xl, x2 ES with xI# x z andx=Xxl +(1 - X ) x z , O < X < 1. If x is a boundary of a convex set S, then the hyperplane g = l u f l i = c containing x such that 2&ufli < c for ally ES is called a supportinghyperplane of S at x, and Zy==lufli< c is the halfspace determined by the supporting hyperplane. A point x of R" is said to be spanned by a set of points x l , x z , . . . , x p in R" if there are nonnegative quantities a l , . . . , a p with Zf=lai = 1 such that x = Zf=laixi. For later reference, we state a few results in convex sets without proof. For a detailed study of this subject, see Blackwell and Girshick (1954). Theorem 4.3.1 A closed and bounded convex set in R" is spanned by its extreme points, and every spanning set contains the extreme points. Theorem 4.3.2 (i) A closed, convex set is the intersection of all the halfspaces determined by its supporting hyperplanes.
72
IV.
LINEAR MOMENT PROBLEMS
(ii) Every boundary point of a convex set lies in some supporting hyperplane of the set. (iii) Two nonintersecting closed, bounded convex sets can be separated by a hyperplane.
Theorem 4.3.3 For a point x in a closed and bounded convex set in R ” , b(x), the least number of extreme points of a convex set to span x is not greater than n + 1. That is, b(x)< n + 1. Function Spaces Spaces whose elements are functions defined over some fixed set are known
as function spaces. Their study is involved in the development of the variational
methods. We have already seen that an extremum problem is concerned with the choice of a function from a given class of functions such that an integral over the given class is minimized or maximized. An integral is a special functional, the real-valued function defined on the class of functions, and finding an extremum of a functional is similar to finding the extremum of a function of a real variable. That is, variational methods are central to functional analysis in the same way that the theory of maxima and minima is central to the calculus of functions of a real variable. We define a few of the central notions of function spaces and state some pertinent theorems in the sequel. There are excellent books available in functional analysis, and the interested reader is referred to them for details. A real linear space is a set R of elements x, y, z, . . . for which operations of addition (+) and multiplication by the numbers a: 0, . . . are defined obeying the following system of axioms: ( 0 )
x + y = y + x isin R , (x + y )+ z = x + ( y + z ) , 3 0 E R 3 x + O = x forany x E R , Vx E R, 3-x E R 3 x + (-x) = 0, 31 E R 3 V x E R, 1 ‘ x = x , @(OX) = ( ~ P > X , (a + p)x = a x + p x , a(x + y )= a x + ay.
A linear space is called a normed linear space if a nonnegative number llxll is assigned to every x E R such that (i)
(4
(iii)
llxll = 0 if and only if Ilaxll = IaI Ilxll, IIX +yll < llxll + Ilrll.
x = 0,
Afinctional is a real-valued function defined on a function space.
4.3.
E
CONVEXITY A N D FUNCTION SPACES
A functional W b ] is said > 0, there is a 6 such that
73
to be continuous at the point y o E R if for any
IW[yl
-
W b 0 I I < -E
(4.3.2)
if Ily -yoII < 6. The inequality (4.3.2) is equivalent to the two inequalities
W b 0 I - W b l < -E
(4.3.3)
W b 0 I - W b l > --E
(4.3.4)
and The functional is called lower semicontinuous at y o if (4.3.2) is replaced by (4.3.3) only and upper semicontinuous at y o if (4.3.2) is replaced by (4.3.4) in the definition of the continuity of the functional W b ] . An important well-known result for lower semicontinuous functions is stated in the following lemma.
Lemma 4.3.1 If f(x) is a lower semicontinuous function defined on a compact set, then f(x) achieves its infimum. (In particular,fis bounded below.) A functional L is called a linear functional if for any xl,xz € R L((YlX1
+(Yzxz)=~lL(x1)+(YZL(XZ)
and a 2 . for all real numbers Let X* be the space of continuous linear functionals f defined on the function space X such that,
llfll
= sup If(x)l, Ilxll is said to converge to x E X if for every f E X * ,
>
f(Xd
+
f(x>.
A functional L on a linear space M is called additive if
LCf+g)=L(f)+L(g)
for all f , g E M .
L is called subadditive if
LCf+g) for all a 2 0, and if L is an additive, homogeneous functional on N such that L ( f ) < p ( f ) for every f E N, then there is an additive, homogeneous extension Lo of L to the whole of M such that L o ( f )< p ( f ) for everyfEM.
Proof (a) LetfoEM\Nandletf;gEN.Then
Lk)-L(f) =Lk-f)G p k - f ) =P[k+fo)+(-f-fo)l so -P(-f-fo)
-
L(f) G p ( g + fo) -
(5.2.1 5)
subject to conditions (5.2.13) and z(S) € A . Let the set of m t n points given by
be denoted by A when the set S varies over all possible Bore1 subsets of .X.The setd< can then be seen to be closed, bounded, and convex. It also follows from the Lyapounov theorem, as shown by Halmos (1948) and Karlin and Studden (1966). Let Ac be the subset ofdilsuch that (5.2.13) holds. We state without proof the generalized Neyman-Pearson lemma. Theorem 5.2.3 Generalized Neyman-Pearson lemma (Existence) If there exists a set S satisfying (5.2.13) and cp(z) is continuous o n A n A c ,and if A is closed, then there exists a set $' maximizingcp(z(S)) subject to the condition that y(S) = c and z(S) E A . (Necessity) If S o is a set for which z(So) is an interior point of A , ifcp(z) is defined inAlc n A , and cp'(z) exists at z = z(So), then a necessary condition that S o maximizes h(z(S)) subject to conditions y(S) = c and z(S) E A is given by the following. There exists constants k l , k Z ,. . . , k , such that n
m
n
m
(5.2.16)
and
where ai = 6h/6zi , i = 1 , 2 , . . . , IZ
5.3.
A NONLINEAR MINIMIZATION PROBLEM
99
(Sufficiency) If the set So satisfies the side conditions y(s) = c , z(s) E A , and if h(z) is defined, quasiconcave in a dC n A and is differentiable at z=z(So), then a sufficient condition that So maximizes h(z(S)) subject to z(S) E A is that So satisfies (5.2.16).
5.3 A Nonlinear Minimization Problem Let cp be a continuous and bounded function defined on a unit square {(x. y): 0 4 x < 1, 0 < y < l}. Assume further that cp is strictly convex and twice differentiable in y . We consider the problem of minimizing and maximizing
1 1
I(F)=
cp(& F ( x ) ) d x
(5.3.1)
0
over the class of cumulative distribution functions with specified moments, such as in
I 1
s
1
F(x) dx = c1
and
0
xF(x) dx = c2.
(5.3.2)
0
These restrictions on the cumulative distribution function F(x) are similar to the ones in (4.3.1), and we assume that the constants c I ,c2 are such that the class of distribution functions satisfying (5.3.2) is not empty. For simplicity we have assumed in (5.3.2) that only two moments are given. The results obtained are valid for the case in which k moments of the cumulative distribution function are given. Let d denote this class, and we call the cumulative distribution function in d admissible. The existence of an admissible minimizing or maximizing cumulative function Fo(x) can be seen in the same way as discussed in Section 4.4 using Theorem 4.4.1 ,since cp is assumed continuous and bounded. The minimizing cumulative distribution function is also unique, as we shall see below. It is a result of the strict convexity of the function cp i n y and is shown by contradiction. Suppose F,(x) is not unique and that there is another admissible minimizing cumulative distribution function F , (x), with FI ( x ) # Fo(x). Let
M = min
i
F E d O
p(x, F ( x ) ) d x .
100
V.
NONLINEAR MOMENT PROBLEMS
0
= AM
0
+ (1 - X)M = M ,
and hence we have a contradiction. The above results are expressed in the following theorem.
Theorem 5.3.1 If cp(x,y) is strictly convex in y and is continuous and bounded on a unit square, then there exists a unique F o ( x ) € d , which minimizesI(F) given in (5.3.1).
Reduction of the Nonlinear Problem to a Linear One We use the notation
in the following lemma.
Lemma 5.3.1 Fo(x) minimizes (5.3.1) if and only if
i
cpy(x9
Fo)F(x) dx 2
0
i
cpy(Y, Fo)Fo(x)dx.
(5.3.3)
0
Proof Let F(x) be any other admissible cumulative distribution function. Define [(A) for 0 < h Q 1 as follows.
1 1
I(h) =
~ ( xhF0(x) , + (1 - h)F(x))dx.
(5.3.4)
0
By the assumption of twice differentiability of cp, cpy exists and is continuous in y and hence I(X) is differentiable, giving
1 1
I’(h) =
0
cpY(x,hFo(x) + (1 - h)F(x))(Fo(x)-F(x))dx.
(5.3.5)
A NONLINEAR MINIMIZATION PROBLEM
5.3.
1
-
101
A
1
'0
Figure 5.1
Since cp is strictly convex in y , f ( h ) is a strictly convex function of A. If Fo(x) minimizes (5.3.1), f ( X ) achieves its minimum at h = 1 . This is possible if and only = ~0, as shown in Figure 5.1. That is, if I ) ( h ) l ~
0, we can normalize it so as to make it one. Therefore, (5.3.8) becomes
j
( P y k
0
Fo) + 7)l + 7 7 2 X ) F ( X ) d X
5.3.
A NONLINEAR MINIMIZATION PROBLEM
103
or Fo minimizes 1
Il(F) =
j
( v y k
Fo) + 771 + 772x)F(x)dx
(5.3.1 1)
0
among the class of all distribution functions s o n [0, 11. Retracing our steps, we can see that if an admissible Fo(x) minimizes (5.3.1 l), then F o ( x ) minimizes lo(F) over the class of all admissible cumulative distribution functions. The above results are summarized in the following lemma.
Lemma 5.3.4 Fo(x) minimizes lo(F) if Fo(x) is an admissible cumulative distribution function minimizing Il (F),and any Fo(x) that minimizes Io(F) minimizes Il (F)for some v1 and v 2 . Omracterization of the Solution The solution of the above problem will now be characterized in terms of the solution FqIq2(x) of the equation for y given by Let and let
s = { x :A ( x ) f 0,o <x < 1 }:
(5.3.14)
We show below that the Fo-measure of the set S is zero, and the integral of A ( x ) over intervals where Fo(x) is constant is zero. The solution of the minimizing cumulative distribution function can be obtained in terms of the solution Fqlq2(x) modified so as to make a cumulative distribution function. .Essentially the solution coincides with Fqlq2(x), except on intervals in which it is constant. We give below a series of lemmas that lead to the characterization of the solution.
Lemma 5.3.5 If Fo minimizes 1, (F),then the set S has Fo-measure zero. a
Lemma 5.3.6 If for 0 < c < 1, Fo(x)< < x < b, and Fo(x) > c for x > b, then
I
c
for x < a, F o ( x ) = c for
b
A ( x ) dx = 0.
a
(5.3.15)
104
V.
NONLINEAR MOMENT PROBLEMS
Lemma 5.3.7 If Fo minimizes I I ( F ) , then Fo(x) has n o jumps in the open interval (0, 1) and hence A(x) is continuous on (0, 1). For proof of the lemmas, the interested reader is referred to Rustagi (1957). Let Fqlq2(x) be defined for 0 < Fqlq2(x) < 1 such that y = Fqlq,(x) satisfies Eq. (5.3.12). Since cp,(x, y ) is continuous and strictly increasing in y , Fqlq2 (x) is also continuous whenever Fqlq2(x) is defined. We define a function Gqlq2(x) on [0, 11 that is continuous on [0, 1 1 such that
and Gqlq2(1) = 1 . Theorem 5.3.2 If Fo minimizes I , (F),then for 0 < x < 1 , Fo(x)coincides with Gqlq2(x) except on intervals on which Fo(x) is constant.
Proof From Lemmas 5.3.6 and 5.3.7 we know that Fo(x) has no jumps in
(0, l), and Fo(x) is increasing whenever A(x) # 0. Hence Fo(x) remains constant
until it intersects with FqIq, (x). Notice that if GqIq2(x) is a cumulative distribution function, then Fo(x) = GqIq2(x). The solution in general may not be completely specified. However, there are many special cases of interest in which G,,,,,, (x) is the solution of the problem. A possible solution is expressed in Fig. 5.2.
t
F,(x) =
-,Fqlq,(x) = - - - - - -,G q l q , ( X ) = ___. Figure 5.2
'
. -.
105
STATISTICAL APPLlCATlONS
5.4.
Theorem 5.3.3 If cp,,(x,y ) is a nonincreasing function in x and Fo(x) = Gtllr12( x ) , 0 < x < 1.
772
< 0,
Proof Since cpy(x,y ) is nonincreasing in x and q2 is negative, it is clear from Eq. (5.3.13) that A ( x ) is a nonincreasing function of x . Therefore, Ftlls2 ( x ) will be an increasing function of x . Thus Gtllt12( x ) is a cumulative distribution function and hence is the solution of the minimizing problem. Theorem 5.3.4 If cp(x,y) is a function of y alone, say $b),then, corresponding to the solution Fo(x) of minimizing I ( F ) , q2 is negative. Further, the minimizing cumulative distribution function is Gtllt12( x ) for some 171 and 772.
Proof Suppose q2 2 0. Then v I + v 2 x is nondecreasing. Also as $'@) is nondecreasing in y , $ ' ( F o ( x ) )+ v1 + v 2 x is also a nondecreasing function ofx. Hence the solution of (5.3.12) with the conditions of the theorem is also nondecreasing. Therefore, from Theorem 5.3.2, F o ( x ) is constant on [0, 1). But such Fo(x) is not admissible, and there is a contradiction. From Theorem 5.3.3 the solution is given by G q I t l 2 ( x ) .
5.4
Statistical Applications
Example 5.4.1 In the problem of obtaining bounds of the variance of the Wilcoxon-Mann-Whitney statistic, we need the extrema of the integral
1 1
I(F)=
s
with side conditions
0
( F ( x )- kx)2 dx
(5.4.1 )
0
F ( x ) dx = 1 - p ,
(5.4.2)
where F(x) is a cumulative distribution function on (0, l), as discussed in Chapter 11. Now cp(x, F(x)) = (F(x) - kx)2 ; thus cp satisfies the conditions of Theorem 5.3 .l. Therefore, the admissible cumulative distribution function Fo(x) exists, is unique, and leads to the minimization of the integral
j
(Fo(x)- kx
+ X)F(x) d x ,
(5.4.3)
0
as seen from (5.3.1 l), over the class of all distribution functions on (0, 1).
106
V.
NONLINEAR MOMENT PROBLEMS
Let gA(t) be the value of y for which y - k t + A = 0. Then gh(t) is a straight line with a positive slope and its points of intersection with the straight lines y = 0 and y = 1, which are given by XI
Let Fo(x) =
= A/k,
x2 = (A
i::
(5.4.4)
+ l)/k.
(5.4.5)
if x < max (0, x
kx - A,
if
max (xl, 0) Q x < min (1, x ~ ) ,
if
min (x2, 1) < x.
(5.4.6)
We give the various possible cases of (5.4.6), obtaining the value of A so as to satisfy the constraint (5.4.2).
Case (i) x I
< 0, x2 2 1, so that
[$
Fo(x) =
giving A = k/2
-
-
A,
Case (ii) ”
if
O Q x < 1,
if
1 Qx,
(5.4.7)
1 + p , and the minimum value of (5.4.1) is obtained as Z(F0)=
Notice that k
if x < 0,
j(k - 2 + 2~)’.
(5.4.8)
< 2p and k Q 1. 0 < xI Q 1 , O Q x2 Q 1.
Is
6.
Fo(x) = \
{!/
if -
A,
x < xl,
if x1 Q x < x 2 ,
(5.4.9)
if x2 Q x.
W e f i n d A = p k - i and 1 (1 - k)3 I ( F o ) = k ( p k - - $ ) 2-~ 3k ‘
(5.4.1 0)
H e r e k 2 l , a n d 2 p > Ilk.
Case (iii) x 1 Q 0,O < x2 Q 1.
ioy
F~(x)= kx-A,
1,
if
x 0
then
H'(yo(t)e"')Qo(t)= k
(ii) y o ( t ) = 0
then
H'(yO(t)e"')Qo(r)G k .
(i)
The constant k can be determined from the constraint (5.8.1).
Combination of Weapons (Karlin er al.) Determining a weapons combination that optimizes a certain objective function depending on the individual merits of the weapons leads to problems solved by the Neyman-Pearson method. For simplicity we consider here only a single weapon being used against an advancing enemy at distance s. Let the accuracy of the weapon be given in terms of the probability of destroying the enemy when the enemy is at the distance s. We use the notation a(s) p(s) g(s)
F(s)
accuracy, firing policy giving the rate of fire, gain if the enemy is destroyed at distance s, the probability that the enemy survives to a distance s.
Assume that O=
g(s)(l -F(s))ds.
0
For a given policy p(s), the survival probability F(s) is given by m
(5.8.1 1) The problem is to find an optimal policy po(s) that maximizes &I). maximize
That is,
m
m
(5.8.1 2) with constraints (5.8.8) and (5.8.9) The integrand can be easily seen to be a concave function of p and hence, by the same arguments discussed in Section 5.3, we find that the optimal policy po(s) exists and is unique. Using the results of Lemma 5.7.1, we have the reduced problem of finding a maximizing policy so as to minimize m
m
m
with constraints (5.8.8) and (5.8.9). Let S
m
5.8.
127
THE NEYMAN-PEARSON TECHNIQUE
and, since H(s) 2 0, (5.8.13) is minimized if m
(5.8.14)
H(s)a(u)p(u)du 0
is minimized. Note that H(s) involves po(s). The Neyman-Pearson lemma then gives the existence of a constant c such that
:I
if H(s)a (s) > c , if H(s)u (s) < C,
Po@) =
arbitrary,
if
(5.8.15)
H(s)u(s) = C.
In special cases (5.8.15) gives a complete characterization of po(s). Example 5.8.1
Let 6 = 1 = M . ks,
s { { yuo-s),
O<sso.
(1,
Exercise
s < so,
=
ks,
1 kd,
sd.
128
NONLINEAR MOMENT PROBLEMS
V.
Show that po(s) defined below is the optimal policy. (0' p o ( s ) = { 1, I
(0,
s<so, so<sto,
where t o = ( h 2 t d2)'I2,
so = -6 t (ij2 t d2)'I2
and the probability of survival is, s < so,
( so/to,
F(s) = {
so
s/to,
(1,
< s < to,
s >to.
Mathematical Economics (Bellman et al.) Let xi(t), i = 1,2, . . . ,N be outputs of a system and let each xi be divided into two parts yi and zi such that y i is reinvested to increase future output and zj is the profit. Assume that the change in the output is determined by ,N
(5.8. 7 )
and
Xj(0) = c j .
(5.8. 8)
The total profit is given by T
(5.8.1 9 )
The problem is to find an optimal policy so as to maximize (5.8.19) and satisfy constraints (5.8.17) and (5.8.18). For simplicity, assume N = 1 and a l l > 0. We then have (5.8.20) (5.8.2 1)
The total profit is T
r
THE NEYMAN-PEARSON TECHNIQUE
5.8.
129
since, from (5.8.20), we have 1
0
We have the additional constraints OGy,
Iyl t
< Cl +a11
dt.
0
(5.8.22) can be rewritten as follows T
J’
J ( Y I ) = ~ I T + b11(T-t)-11~1(t)dt,
(5.8.2 3)
0
since, integrating by parts,
1 T
0
t
T
Syds)dsdt=(T-t) S.Yl(t)df. 0
0
Let TI be the value o f t for which a l l ( T- t ) - 1 = 0; that is, TI =(allT- 1)lQll Using the Neyman-Pearson technique, the optimizing policy is then given by t
when a l l T - 1 > 0. When a l l T - 1 < 0, y I ( I ) = O everywhere in (0, r). The generalization to the case N > 2 is similarly obtained. The interested reader may consult the details in Bellman er al. (1954). We consider below the case in which the objective function is a linear functional, and linear constraints of physical origin are introduced. The complexity is greatly increased in this case. Now let the differential equation for x(t), absolutely continuous, be given by dx/df = - x ( t ) t y(t),
with
0 0
(6.4.12)
from (6.4.9). By Theorem 6.4.1 (v), we notice that any design E can be represented by a set of [m(m t 1)/2 + 11 designs M(e(xi)). We may consider that E consists of a finite number n of points. Hence tr[M-'(eO)M(~)] -rn =
n
1 p i f'(xi)M(EO)f(xi)-rn. i=l
But e 0 is minimax and hence f'(x)M-'(EO)f(x)
< rn.
(6.4.13)
The inequality (6.4.13) shows c p i f'(xl)M-'(Eo)f(xi)GrnC
pi-rn = O
and hence a contradiction. Therefore, e0 is D-optimal.
6.4.
Corollary
If
€0
CONTINUOUS NORMALIZED DESIGNS
149
is a D-optimal or a minimax design, max f’(x) M-l(e0) f(x) = m . X
Many other illustrations have been developed by Kiefer and Karlin and Studden based on the equivalence of the D-optimality and minimax optimality criteria. The minimax criterion is a reasonable one when the range of the level of x is given and the experimenter wishes to minimize the worst that can happen in his ability to predict over this range. This criterion does not provide a good procedure for extrapolation. The D-optimality criterion in which the generalized variance of the estimate is being minimized may prove to be a poor criterion, according to Chernoff (1972, p. 37). There seems to be no meaningful justification for D-optimality, since it abandons to the vagaries of the mathematics of the problem the scientist’s function of specifying the loss associated with guessing wrong. The invariance property of D-optimality under nonsingular transformations of the parameters disguises its shortcomings. Since a D-optimum design minimizes the generalized variance, under assumptions of normality it minimizes the volume of the smallest invariant confidence region of 0 = (el, . . . ,OS)’ for a given confidence coefficient. It follows from the result on type D regions discussed in Section 5.7 that for given variance, a D-optimum design achieves a test whose power function has maximum Gaussian curvature at the null hypothesis among all locally unbiased tests of a given size. The reader may further refer to Kiefer (1959) for details. It should also be remarked here that the equivalence of D-optimal and minimax designs does not hold when the consideration is restricted to discrete designs. We will have occasion to refer to this problem in a later section.
Cn‘terion of Linear Optimality
A general criterion of optimality of designs has been developed by Federov (1972). Let L be a linear functional defined on the set of positive semidefinite matrices such that L ( A ) 2-0 whenever A is positive semidefinite. Definition A design E is called linear optimal, or L-optimal for short, if it minimizes L [ D ( 6 ) ] ,where D(6) is the covariance matrix of the parameter vector of the model.
The generalization of Theorem 6.4.2 to linear optimality has been made by Federov. The proof follows the same variational technique used in the proof of the theorem. We state the results formally in the following theorem.
150
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Theorem 6.4.3 The following conditions are equivalent: (i) (ii)
E' minimizes L [M-'(E)]. eo minimizes max L[M-'(e)f(x)f'(x)M-'(e)].
(iii)
L[M-'(e0)] = max [M-'(E)f(x)f'(x)M-'(e)].
X EX
x EX
Further, the set of designs satisfying (i), (ii), and (iii) are convex. Since the trace of a matrix satisfies the conditions of the linear functional L, A-optimal designs obtained by minimizing the trace of the dispersion matrix are special cases of L-optimal designs. It is easy to see that tr (A
+ 9) = tr A + tr 9,
tr ( k A ) = k tr A, and tr (A) 2 0 if A is a positive semidefinite matrix. Therefore trace is a linear functional. Linear optimality also extends to A,-optimality, where only 1 parameters out of a given m(1 < m ) are of interest to the experimenter and the optimal criterion requires the minimization of the sum of the variances of these 1 estimates. In the next section we discuss criteria of local optimality. When the model considered is nonlinear, the criteria of optimality discussed so far involve the parameters to be estimated. Hence we need other criteria so as to remove the dependence on the parameters. One general approach is to consider asymptotic theory and the Fisher information matrix in the neighborhood of the known value of the parameters. Use of variational techniques in such cases will be discussed in the next section. 6.5
Locally Optimal Designs
So far we have considered optimal designs for the assumed linear model. In the case of nonlinear models, it is not possible to arrive at the common criteria of optimality discussed so far. Since simple estimates for the parameters cannot be obtained, the study of the covariance matrix of the estimates is out of question. However, designs that are optimal for a given value of the parameter or in the small neighborhood of the parameter are possible. Such designs are generally known as locally optimal. In this section we discuss a few simple problems and obtain locally optimal designs for them. Variational methods can be usefully employed in studying the asymptotic theory of such designs. Elfving's geometrical technique discussed earlier is also variational and can also be used in obtaining locally optimal designs. Feder and Mezaki (1971) have given direct variational approaches to obtain locally optimal designs for studying a
6.5.
151
LOCALLY OPTIMAL DESIGNS
variety of problems. We restrict our approach to D-optimal designs only; however, the approach can be extended to other optimality criteria. Consider again a nonlinear regression model (6.5.1) Yi= q(Xi, 8 ) t ei. We assume as before that the errors ei’s are uncorrelated and have the same variance 0’. 8 has k components e l , . . . ,ek and let q(x, 6 ) be nonlinear in general. The least-squares estimates for 8 can be obtained by minimizing n
(6.5.2)
In the Appendix to this chapter we show that the normal equations obtained by equating to zero the partial derivatives of (6.5.2) with respect to e l , . . . , Ok do not lead to explicit solutions, and therefore the estimates cannot be given explicitly. See, for example, Eq. (6.A.3). However, it is possible to obtain the asymptotic variance and covariance matrix of the estimates. We denote the derivatives of q(Xi, 6 ) with respect to 8 at B = Bo by. (6.5.3) i = 1 , 2, . . . , n, j = 1, 2, . . . ,k. Let the matrix of the partial derivatives gV(B0)
be denoted by Xo. Under fairly general conditions, it is well known that the distribution of the least-squares estimates of B is k-dimensional multivariate normal with mean d o and covariance matrix a2(XOXi)-I, where u’ is the common variance of ei. When the matrix XoXd is singular, generalized inverse or other perturbation techniques could be considered. The designs that minimize the determinant of the matrix X o X i are called locally D-optimal. In what follows we consider locally optimal designs for both linear and nonlinear models. The classical variational techniques are used in obtaining the locally optimal designs. The discrete designs are first reduced to continuous cases, and the variational method is then utilized. Example 6.5.1
Consider the model such that q(x, e)=el + e , ( X - g ,
OGXG
1.
Here k = 2. Then the determinant of the matrix X o X i in this case is given by
(6.5.4) Locally optimal designs are obtained by maximizing (6.5.4) with constraints OGXl G . * . < x , < 1.
152
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Consider a continuous analog of the discrete design as follows. Let x(t) be a right continuous and nondecreasing function with 0 < x(1) < 1, and let xi =x(i/n). The function x(t) here plays the same role as a distribution function on (0, 1). Now (6.5.4) can be written as
Utilizing the continuous transformation, we have the approximate value of (6.5.5) given by I
1
(6.5.6) The optimization problem reduces to finding a right continuous function x(t) on (0, 1) such that it maximizes (6.5.6). The existence of a maximizing function xo(t) is guaranteed by the continuity of the transformation from x(t) t o J(x), where
J(x)=
j
x’(t)dt -
0
(/
x(f)dt)2.
(6.5.7)
0
It can be verified that the set of points J(x) as x varies over the class of right continuous nondecreasing functions on (0, 1) is closed and bounded by results shown in Chapter V. We now utilize variational methods directly to characterize the maximizing function xo(t). The proportion of observations such that x(t) < y for 0 < y < 1 can be regarded asymptotically equivalent to sup(t : x(t) < y). Let
S6 = { t : 6 < xo(t) < 1- 6 )
for
6
>o
and let f ( t ) , be a bounded function on (0, 1) such that
l(t)=O
ift@S6.
(6.5.8)
Consider a variation of the functionxo(t) for sufficiently small E in the form xo(t) + E r ( t ) . The details of such an approach are discussed in Chapter 11, where we derived the Euler-Lagrange equation using variations. Let
The derivative of
is given by
@(E)
I 1
a'(€)= 2 or
1
=2
I[
J
t ( t )d t ,
0
0
1
I 1
[ x o ( ~+) f t ( t ) l d t
[xo(t)+ ~ E ( t )[l( t ) d t - 2
0
@I(€)
153
LOCALLY OPTIMAL DESIGNS
6.5.
1
x,(t) t E
m )-
0
xo(s) t E [ ( S ) ds]C(t) dt.
0
Since x o ( t ) is the solution of the problem, the maximum of @ ( E ) is attained at = 0. Therefore,
E
@TO) = 2
j[ j
x0(s)ds] [ ( t )d t = 0.
x o ( t )-
(6.5.9)
0
0
Now (6.5.9) is satisfied for all [ ( t )and, since [ ( t ) = 0 when t does not belong to the set Ss as assumed in (6.5.Q we have
or
xo(t) 0, x = (u, u ) ' , 8 = (el, 02)'. To obtain locally optimal designs for this nonlinear function, we introduce two functions u o ( t ) and u o ( t ) so as to maximize the corresponding measure of information. The complete solution in this case requires numerical methods and is not pursued here further. Similar nonlinear models can also be treated alternatively by the geometrical method of Elfving. Chernoff (1953) generalized the results of Elfving in a paper on locally optimal designs for estimating parameters. The theory has recently been applied to some practical problems of accelerated life testing by him (see Chernoff, 1962). An example illustrating the theory is discussed here. A random variable T has an exponential probability density function f ( t ) when
The mean of the random variable T is p and variance p 2 . The exponential distribution has been extensively used as a model of the lifetimes of many mechanical or electronic devices. For studying accelerated life tests, an assumption can be made that a device may have a lifetime with mean p(x)= i/(e,x t e 2 x 2 ) ,
o<x<x*,
the quantity x to be selected by the experimenter. It is also assumed that the cost of experimentation at level x is C(X) = c/(e,x
t
e2xZ)
for some given constant c. With the above assumptions, the lifetime T has the probability density function f ( t : 8, X) =
(0,
(Blx
t
e2x2)exp [-(e,x
t
e2x2)t],
o (XflY,)
having knots at X I , x2, . . . , x , . In applications in which smooth functions are needed for interpolation, one considers the criterion o f minimizing u [ g ( x ) ] where b
(6.6.2) In the Theorem 6.6.1, we prove that a spline of degree at most k is needed to minimize (6.6.2). Let the class of functions on (u, b ) whose kth derivatives are step functions be denoted by Ck-'(u,b). Lemma 6.6.1 Let s(x) be a spline of degree k and f ( x ) be a function with the following properties: (i)
f ( x ) E Ck-'(u, b ) and f k = [ d k f ( x ) / d x ] is continuous in each open interval (ti,t i + i = 1 , 2 , . . . ,n, with t o= a and En+ = b . (ii) f k - r - l ( ~ ) ~ ( ~ ) k0+, rr==0, 1 , 2 , . . . , k - 2, f o r x = u a n d x = 6. (iii) f ( u ) sZk-'(u - 0) = f ( b )sZk-l(b t 0) = 0.
Then
J'
163
SPLINE FUNCTIONS
6.6.
f k ( x )sk(x)dx = (-l)k(2k
a
Proof Integrating by parts the integral b
r
(6.6.3) we have,
s
c (-1y [fk-'-'(b)
k-2
fk(X) Sk(X)
a
dx =
r=O
1
sk+r(b)
b
(a) #+'(a)] t (-1)k-1
-fk-r-l
(6.6.4)
f ' ( x ) s Z k - l ( x ) dx.
a
The summation in (6.6.4) vanishes due to hypothesis (ii) of the lemma. Since knots of s(x) and sZk-'(x) are the same, we have
(6.6.5)
i =O
a
where qi is the constant value of sZk-'(x) on (ti,ti+]), i = 0, 1 , . hypothesis (iii) of the lemma, the right hand of (6.6.5) reduces to
. . ,n. Using
n
(6.6.6)
Differentiating Eq. (6.6.1) with respect tox, (2k - 1 ) times, we have S ~ ~ - ' ( C ; ~ + O ) - S ~ ~ - ~ ( C ; ~ - - O ) = ( ~ ~ - ~ ) ! Ci ~= , 1 , 2 , . . . , n .
(6.6.7)
Using (6.6.6), (6.6.7), and (6.6.4), the lemma is proved.
Lemma 6.6.2 If s(x) is a natural spline with k 2 1 and if
f(x) E Ck-'(a, b ) , i = 1 , 2 , . . . , n, then
such that f k ( x ) is continuous in each interval (ti,
J'
a
f k ( x ) s k ( x )dx = (-l)k(2k
-
l)!
?
i=l
cif(&),
(6.6.8)
164
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
and, if f ( x ) = 0 at every knot of s(x),
(6.6.9)
fk(X) sk(x) dx = 0. a
Proof For natural splines, the conditions (ii) and (iii) of Lemma 6.6.1 are satisfied and hence (6.6.8) follows from Lemma 6.6.1. Equation (6.6.9) follows directly from (6.6.7) since f(ti)= 0 for i = 1 , 2 , . . . , n. Theorem 6.6.1 For a < x1 < x 2 < . . . < x, be a natural spline interpolating data points (xlul),
< b,
and 1
< k < n , let s(x)
. . . (Xn, Y n ) . 9
(6.6.10)
Let f ( x ) be any other function interpolating the data points (6.6.10) such that f ( x ) E Ck-'(a, b ) and f k ( x ) is piecewise continuous; then u[s(x)l
df(x)l,
with equality only for s(x) = f ( x ) , where u is defined in (6.6.2).
Proof Using f ( x ) - s(x) in place of f ( x ) in Eq. (6.6.9), we have, from Lemma 6.6.2, s k ( x ) [ f k ( x ) - s k ( x ) ] dx = 0. a
Now u [ ~ ( x )=] u [ ~ ( x ) - s ( x ) + s ( x ) =] u [ ~ ( x ) - s ( x ) ]+ u [ s ( x ) ] (6.6.11)
by definition. Since the first term in (6.6.1 1) is nonnegative,
u[f(x)l 2 o [ s ( x ) l . The equality also follows from (6.6.1 1) when s(x) = f ( x ) .
Remarks (1) Splines are the least oscillatory functions for interpolating data. In particular the above theorem shows that a differentiable function f ( x ) such that f ( x i ) =yi,i = 1 , 2 , . . . , n , and which minimizes b
a
is a cubic spline (rn = 3) of the form (6.6.1) with knots at xl, . . . , x, and is linear below x and above x, .
OPTIMAL DESIGNS USING SPLINES
6.7.
(2) 1,
165
In using the functions X,
x',
. . . ,xm,
( ~ - [ l ) + ~ ( ~,- [ 2 ) + " ' ,
. . . , ( x - [ ~ ) + " ' (6.6.12)
we may be interested in knowing for which set of values of XI,XZ,.
. . ,Xn+m+1
one can interpolate an arbitrary set of values
Y I ? .. . rYn+m+l using a unique linear combination of functions in (6.6.12). This is the case if and only if X~ 0. The admissibility of a design will be defined now.
Definition An experiment E is admissible if there is no other experiment which is better. Otherwise E is called inadmissible. The term experiment and the distribution g(x) are used synonymously in our discussion. We give, in the following, a characterization of optimal admissible designs for polynomial regression, due to Kiefer (1959). The results can be extended to spline regression, such as studied by Studden and Van Arman (1969). The problem of finding admissible designs is reduced in this case to optimization in moment spaces, and therefore we use results discussed in Chapter 1V. Consider a vector t' = ( t l ,t z ,. . . , t k + l )with k + 1 components. Let rl = 1, rfll = u , and all other components be zero. Since, from (6.7.2), the matrix WE)
= (I-1i+j-2(4),
we have for the above vector t , (6.7.4)
t'M(E)t = ~ u / J ~ ( E+ )u ' 1 - 1 2 r ( ~ ) ,
since ~
O ( E)
= p O ( e 2 )= 1. Therefore,
t' [M(E I )
- M ( ~ 2 1 I f=
2~ ( P r ( E 1)
- Ur(EZ
)) + u2(1-1 2 r ( ~ 1 -I-1 )
2r (€2)).
Assuming that M(eI) - M ( E ~ is ) nonnegative definite, we have
>
2 u [ 1 - 1 r ( ~ l ) - ~ r ( ~ 2+)u2 1 [ 1 1 2 r ( ~ 1 ) - ~ 2 r ( ~ 2 ) 10
(6.7.5)
for all u . The inequality (6.7.5) then gives I*r(EI) = 1 - 1 r ( ~ 2 )
(6.7.6)
for r = 0, 1, 2 , . . . , k . Similarly, taking the vector t such that t4+1 = 1 and rs+l = u , with all other components equal to zero and q < s, we have 1-1r ( 6 I ) = 1-1r ( € 2 )
(6.7.7)
6.7.
for r = 0, 1 , 2 , . . . , 2k
-
167
OPTIMAL DESIGNS USING SPLINES
1. Therefore, t' [M(E~) - M ( E ~ )t ]> 0
if and only if That is, the condition of admissibility of an experiment E can be obtained by finding an experiment E such that p Z k ( e )is maximum while its first (2k - 1) moments are specified. Using results of Theorem 4.5.6, we find that for such an experiment the distribution function [(x) has k + 1 jumps, including jumps at the points -1 and 1. We state the result formally in the following theorem. Theorem 6.7.1 The design with distribution function [(x) for the polynomial regression model in (6.7.1) defined on [-1, 11 is admissible if and only if in the interior of the interval [-1, 11, E(x) has at most k - 1 jumps. Consider the case of the regression model in which splines are used. That is, the regression functions are, for example, of the forms 1,
X,
x',
. . . ,x",
(X
- [i)+",
(X
- [i)+",
1
..
9
(X
- ti)+",
i = 1,2,. . ., h , with h knots [ 1 , [',
. . . , Eh
so that
-1 = Eo < .$I
< t z , . . . < Eh < Eh+l = 1. 7
The regression functions with the above splines give a polynomial of degree n on each of the intervals (ti, i = 0, 1, . . . ,h . From Theorem 6.7.l,,we have that the admissible spline design has at most n - 1 jumps in the interval (ti,Ei+i), i = 0, 1 , 2 , . . ,h . When n = 1, the regression function is linear on (ti,ti+,)and continuous at ti.That is, we have the set of regression functions 1, x,
(x - Ed+, (x - E d + ,
..
* >
(x
-
[h)+.
In this case the admissible design does not have a single jump in the interior of the interval (ti, except possibly at the end points .$i and ti+,. Therefore, the possible jumps are at the points -1,
1'1,
b , ..
*
,Eh,
1.
The reader interested in the general discussion of spline regression would find the paper by Studden and Van Arman (1969) very illuminating. The formulas for allocation of observations at given points can be obtained in terms of Lagrange interpolating polynomials. An exhaustive treatment of the problem is given in the books by Federov (1972) and Karlin and Studden (1966).
168
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Appendix to Chapter VI Assume that
yi = q(xi,e) + Ei,
(6.A.1)
where q(Xi, 0 ) is a known function of Xi and 8 such that
Xi = (Xli, Xzi, . . . ,Xki)’,
e =(el, ez, . . . ,ek)’,
i = 1 , 2 , . . . ,n.
We assume again that the errors E ~ ’ Sare uncorrelated with the same variance u2. It is generally assumed for many purposes of statistical inference that the errors E = (E . . . ,en)’ have normal distribution with means 0 and covariance u2I, where I is the identity matrix. The least-squares estimates are obtained by minimizing the sum of squares (6.A.2) The k normal equations are obtained by differentiating (6.A.2) and equating them to zero (6.A.3) j = 1,2, . . . ,k. Obviously these equations cannot always be solved explicitly.
There are many approximate methods of solving the system (6.A.3) in general. We describe an iterative procedure below that essentially requires the linearization of the problem using the Taylor’s expansion of q(X, 0 ) at an initial value of 0 = do = (Ole, . . . ,OkO)’ and then using the results of linear least squares successively. The iterative methods depend heavily on the initial value d o and hence care must be taken to arrive at d o . We have, by Taylor’s expansion,
(6 .A .4) where
and
169
REFERENCES
R is the remainder of the expansion. We shall be using the approximation with R = 0 in further analysis. Let 8 - B0 = So = (Plo, . . . ,l j k o ) ' and
Then the model (6.A. I), using the approximation (6.A.4), becomes
Yi = qio +
c k
j=1
fljozji+ E j
or
(6.A.5)
Y-qo=ZoPo+E, where z o = (Z?',
zy, . . . , z:').
(6.A.6)
Therefore the estimate of Po is given by
b0 = (zo'zo)-'Z0'(Y - q o ) .
(6.A.7)
Po
The next trial value of 8 will now be obtained as 8, + and so on. In general, at any stage i, 8i+l = Bi + and = (Z;Zj)-' Z i ( Y - qj).The process continues until convergence; that is, the ratio of the (j+ 1)st estimate to the jth estimate is within 1 - 6 and 1 + 6 with a preassigned 6. There are many other computational procedures available in the literature, but we do not pursue them here.
pi
pi
References Atwood, C. L. (1969). Optimal and efficient design of experiments, Ann. Math. Statist. 40, 1570-1602. Box, G . E. P. (1954). The exploration and exploitation of response surfaces, Biometrics 10, 16-60. Box, G . E. P., and Draper, N. R. (1968). Evolutionary Operations, A Statistical Method f o r Process Improvement. Wiley, New York. Box, G . E. P., and Wilson, K. B. (1951). On the experimental attainment of optimum conditions, J. R o y . Statist. SOC. 13, 1-45. Box, M. J. (1971). Simplified experimental designs, Technometrics 13, 19-31. Chernoff, H. (1 95 3). Locally optimal designs for estimating parameters, Ann. Math. Statist. 24,586-602. Chernoff, H. (1959). Sequential design of experiments, Ann. Math. Statist. 30, No. 3, 755-770. Chernoff, H. (1 962). Optimal accelerated life designs for estimation, Technometrics 4, NO. 3.381-408.
170
VI.
OPTIMAL DESIGNS FOR REGRESSION EXPERIMENTS
Chernoff, H. (1963). Optimal Design of Experiments, Proc. 8th Con& Design Experiments Army Res. Develop. Testing, pp. 303-315. Chernoff, H. (1969). Sequential designs. In The Design of Computer Simulation Experiments (T.H. Naylor, ed.), pp. 99-120. Duke Univ. Press, Durham, North Carolina. Chernoff, H. (1972). Sequential Analysis and Optimal Design. SOC. Ind. Appl. Math., Philadelphia. Chernoff, H., Bessler, S., and Marshall, A. W. (1962). Accelerated life test, Technometrics 4, NO. 3, 367-380. Elfving, G. (1952). Optimum allocation in linear regression theory, Ann. Math. Statist. 23, 255-262. Elfving, G. (1955). Geometric allocation theory, Skand. Actuarietidskr. 38, 170-190. Elfving, G. (1 956). Selection of nonrepeatable observations for estimation, Proc. 3rd Berkeley Symp. Math. Statist. Probl. 1, 69-75. Elfving, G. (1959). Design of linear experiments. Probability and Statistics ( C m d r Volume) (U.Grenandar, ed.) pp. 58-74. Wiley, New York. Feder, P. I., and Mezaki, R. (1971). An application of variational methods to experimental design, Technometrics, 13, 77 1-793. Federov, V. V. (1972). Theory of Optimal Experiments. Academic Press, New York. Fisher, R. A. (1947). The Design ofExperiments (4th edition). Oliver and Boyd, Edinburgh. Creville, T. N. E. (ed.) (1 969). Theory and Applications of Spline Functions. Academic Press, New York. Hoel, P. G., and Levine, A. (1964). Optimal spacing and weighing in polynomial prediction, A n n Math. Statist. 35, 1553-1560. Hotelling, H. (1941). Experimental determination of the maximum of a function, Ann. Math. Statist. 12, 20-45. Hotelling, H. (1944). Some improvements in weighing and other experimental techniques, Ann. Math. Statist. 15, 297-306. Karlin, S., and Studden, W. J. (1966). Optimal experimental designs, Ann. Math. Statist. 37, 783-8 15. Kiefer, J. (1953). Sequential minimax search for a maximum, Proc. Amer. Math. Soc. 4, 502-506. Kiefer, J . (1959). Optimal experimental designs,J. R o y . Statist. SOC.Series B 21, 273-319. Kiefer, J. (1961). Optimum designs in regression problems: 11, Ann. Math. Statist. 32, 298-325. Kiefer, J., and Wolfowitz, J. (1959). Optimal designs in regression problems, Ann. Math. Statist. 30, 271-294. Kiefer, J., and Wolfowitz, J. (1960). The equivalence of two extremum problems, Can. J. Math. 12, 363-366. Kiefer, J., and Wolfowitz, J. (1965). On the theorem of Hoe1 and Levine on extrapolation, Ann. Math. Statist. 36, 1627-1 655. Kiefer, J., Farrel, R., and Walran, A. (1965). Optimum multivariate designs, Proc. 5th Berkeley Symp. Mcth. Stat. Probl. 1, 1 13-1 38. Murty, V. N., and Studden, W. J. (1972). Optimal designs for estimating the slope of a polynomial regression, J. Amer. Statist. Ass. 67, 869-873. Nalimov, V. V., Colikova, T. I., and Mikeshina, N. G. (1970). On practical use of the concept of D-optimality, Technometrics, 12, 799-812. Rivlin, T. J. (1969). A n Introduction to the Approximation of Functions. Blaisdell, Waltham, Massachusetts. Schoenberg, I. J. (1969). Approximations with Special Emphasis on Spline Functions. Academic Press, New York.
REFERENCES
171
Spendley, N., Hext, G. R., and Hinnisworth, F. R. (1962).Sequential application of simplex designs in optimization and evolutionary operations, Technometrics 4,441-459. Studden, W.J. (1968). Optimal designs on Tchebycheff points, Ann. Math. Statist. 39,
1435-1447.
Studden, W . J. (1971a).Optimal designs and spline regression. In Optimizing Methods in Statistics (J. S. Rustagi, ed.). Academic Press, New York. Studden, W . J . (1971b).Elfving’s Theorem and optimal designs for quadratic loss, Ann. Math. Statist. 42, 1621-1631. Studden, W.J., and Van Arman, D. J. (1969). Admissible designs for polynomial spline regression, Ann. Math. Statist. 40, 1557-1569.
CHAPTER V I I
Theory of Optimal Control
7.1
Introduction
While introducing the optimizing techniques of dynamic programming and maximum principle in Chapter 11, we discussed many problems of control theory. In this chapter many other problems of deterministic and stochastic control theory are discussed. In the development of the mathematical theory of control processes, the methods of dynamic programming are extensively used. The problem of control is basic in many areas of human endeavor, including industrial processes, mechanical equipment, and even the national economy. In many simple situations, the models of control theory will be developed in this chapter. The most common technique in solving these problems is that of dynamic programming, and it has been applied to a large variety of problems. In statistical applications, a large class of problems related to sequential sampling can be solved by backward induction and therefore have direct bearing on dynamic programming. There are many questions related to stopping strategies in dynamical systems, which can be treated by dynamic programming. The continuous versions of many of the above discrete problems can be transformed in terms of the Wiener process, and the related questions are solved with the help of the heat equation. In this chapter we treat the simple aspects of statistical decision theory as they arise in the study of Bayes solutions for the Wiener process. An exhaustive study has recently been made of controlled Markov chains. Markov chains form the basic model of multistage decision processes and we 172
1.2.
DETERMINISTIC CONTROL PROCESS
173
give a brief introduction to the study of control processes that can be thought of as Markov chains. For exhaustive treatment of the above control problems the reader should consult Bellman (1967, 1971), Chernoff (1972), and Kushner (1971). among many others. We follow Chernoff (1972) in the subsequent development. Many other references related to problems discussed here are given at the end of the chapter.
7.2
Deterministic Control Process
In developing the procedures of dynamic programming and maximum principle discussed earlier in this book the motivation came from many control problems. The language of control theory was utilized in obtaining solutions to problems that arise in many areas other than control theory. In this section we discuss the main problems of deterministic contr.01 theory. The solutions of these problems involve some of the variational techniques discussed so far in the book. A deterministic control process is concerned with the mechanical. electrical economic, or biological behavior of a process in which the outcome is completely determined when a given control is applied. There is no random element in the process, and we do not assume that the input or outcome depends on a chance mechanism. Control processes with stochastic elements will be discussed in the next section. The general structure of a deterministic control process is described by several quantities. The srare variable of the process describe the state of the process at any given instant of time, space, etc. The state variables may be a scalar or vector or a continuous function. The control variables reflect the decisions taken at any stage and again may be scalar, vectors, or continuous functions. An essential part of the process is the system of dynamic equations, which give the behavior of the process when the process is in a given state and a given control is applied. In most practical situations such equations are difference or differential equations. The physical and mechanical systems most often are governed by known relations. However, there are many important processes, such as those of economic and biological significance, in which such relations are not completely known. Statistical problems enter into the estimation of these relationships. For our present purposes we assume that the system is governed by a known set of difference or differential equations. For any effective control, it is necessary to have in mind some kind of objective or criterion function. The performance of a process is measured by this criterion function. Depending on the objective of the control, the optimizing problem appears. In a rocket control problem, it may be minimizing the distance
174
VII.
THEORY OF OPTIMAL CONTROL
by which the rocket misses the target; but in the case of an economic control problem, it may be the maximization of profit. Many of the processes have an element of feedback in them. Here, the performance of the process is used to guide the system back to its proper functioning. Some processes have time lag built into them, since a control applied at a given time r will take effect after several units of time have elapsed, rather than immediately. We also consider an example of this type in our discussion. Complete solutions of the deterministic control problems are available when the process is governed by linear difference or differential equations and the objective function is quadratic. The situation is quite similar to that in statistics, in which the theory of linear models is fairly complete when the criterion of minimization is quadratic. Many of these aspects in statistics are studied under the least-squares theory of inference. In connection with optimal design of regression experiments, we have discussed some elements of least-squares theory in Chapter VI. Consider a multistage decision process with Nstages. Let Xo, . . . X ~ b the e sequence of vectors describing the process at stages 0, 1 . 2, . . . , N respectively, with Xo as the initial state of the process. Let the controls at stages 1 . 2, . . . . N be given by U1, U2,. . . , UN. A schematic representation of this process is given in Fig. 7.1.
.
t
%
t
t U
%?
rJ- 1
t&
Figure 7.1
Suppose the state vector X,, at stage n depends only on the state vector at stage n - 1 and the control at stage n ; then, the equation governing the system is of the type X, =fn(XnPl, U,), Let the criterion function be g(X,)
H =
=
1 , 2 , . . . ,N.
IXNI*.
The object is to find the sequence of optimal controls U1 minimizesg(XN). Such a sequence is called an optimal policy.
(7.2.1) (7.2.2)
. . . , UN that
1.2.
DETERMINISTIC CONTROL PROCESS
175
The optimal policy will be obtained by the procedure of backward induction of dynamic programming as discussed earlier. The basic steps in this procedure are the following. Suppose XN-] is somehow determined, we find U, for each possible X N - ~ so as to minimize IX,?. Since X,=f,(XN-l, U,)from(7.2.1), we can find such . , U Tabulate the values of UNp1 and the minimum value of g(X,). Supposing now that XN-2 is available, we can find the optimizing values of UN-l and UN. Since determines X N - ~ and since U, has been tabulated for every XN-] with a minimum of g(X,), a search is simply made over the values of UN-]to find that XN...l for which minimum is attained. The above procedure reduces the dimensionality of the problem, since minimization is reduced over UN-] in place of UN-] and U., This process is repeated backwards until one reaches Xo. But X o is known, and an optimal UI can be chosen by means of tables of X I . In this way, the above numerical procedure determines the optimal policy u , , . . . , u,. We remark here that the numerical solution of the optimization problems requires the storage of functions. This may become a serious restriction on the computation of optimal policy in higher dimensions through the technique of dynamic programming. In Examples 3.2.2 and 3.2.3, we considered the basic elements of the control problem and showed the existence of the solution. The characterization of the solution was also given. We consider an example with lag in the following. Example 7.2.1 (Process with time lag) In some practical applications, the effect of the control may take some time to affect the behavior of the process. For example, when a certain drug is given to a patient and the physiological variables are being monitored, say, every minute, it will take several minutes before the drug’s effect is seen. We consider a simple example in which time lag matters. Suppose the differential equation governing the system for 0 < r < T is given by x ‘ ( t ) = a l x ( t ) + Q z x ( t - 1) + u ( t ) , 1 < t < T, with x ( t ) = g(t),
The criterion function is taken as
J(x, u ) =
i
1
0 < c < 1.
(7.2.3)
[x’ ( t ) + u’ ( t ) ] dt.
(7.2.4)
176
VII.
THEORY OF OPTIMAL CONTROL
Assume that xo(t), u o ( t ) are the minimizing functions. Consider X(f)
u ( t ) = uo(t) t € W ( t ) .
= xo(t) t E U ( t ) ,
(7.2.5)
Then (7.2.4) becomes, on simplification, T
J(x, u ) =J(xo, u o ) t 2~
( x o u t u o w )dt t E*J(v,w).
(7.2.6)
1
Also, from constraints (7.2.3), we have
1u'(t)dt = uo(r)u(r)l
1
and
T
T
i i-' i
I
J
-
-
uo'(t)u(t)dt
1
uo(t)u(t - 1) dt =
1
0
substituting in (7.2.9), we find u ( t ) [ x o ( t ) -u'(t) - a l u o ( t ) - a 2 u o ( t
+ l)] dt
1
u ( t ) [ ~ ~ ( t ) - u g ) ( t ) - a ~ u ~dt( t+) ]uo(T)u(T)= 0.
t
T- 1
(7.2.9)
7.3.
CONTROLLED MARKOV CHAINS
177
Hence we have t 1 ) = 0,
xo(t)-ur(t)-u,uo(t)-u2uo(t
xo(t)-uo’(t)-Ulu,(t)=
0,
1 ,
where y n + ] can be calculated without knowingy,,. For example, consider
. . ,yn-] ify,
is known. (7.3.4)
~ n I += ayrz + t n ,
where is a sequence of independent, identically distributed random variables. Then the sequence { y , } becomes a Markov chain. When a decision is made regarding a process and the decision, or control, affects the probabilities of the process, the process is called a controlled process. The case of a controlled Murkov chain is clearly evident from Example 3.2.3, considered earlier, and we now discuss it further.
Example 7.3.1 Let there be two coins 1 and 2 with probabilities p 1 and p z respectively of falling heads. At each time n = 1 , 2 , . . . , one of the coins is selected and tossed. Let X , be the number of heads accumulated till time n. Assume the trials are independent, so that X , depends on X n p 1 and not on X n - 2 , . . . , X I . Therefore, X , is a Markov chain. Let u,(x) denote the decision (control) for choice of a coin a t time n + 1 , when X , = x. Therefore, P{X,+] =X,}=l-pi
if
u,(X,)=i,
i=1,2.
P{X,+I = X , + I ) = p ; ,
if
u,(X,)=i,
i = 1,2.
Similarly These transition probabilities can be written explicitly as functions of u,(X) by introducing the Kronecker delta 6 i j , where 6ii = 1 if i = j and 0 otherwise.
Pi,;[U,(X)l = (1 -PI> pi,i+l[un(X)I = P I
&4,(X),l + (1 -Pz> 6 U , ( X ) , 2 ,
& ~ , ( x )+,PIZ 6 u n ( x ) , 2 .
(7.3.5)
Then X , is a controlled Murkov chain with transition probabilities given by (7.3.5). In general, the problem for finite time interval [O,n] with discrete times 0, 1,2, . . . ,n has the following framework. The state of the system is a Markov
7.3.
CONTROLLED MARKOV CHAINS
179
chain with N states. Let u'(i) = control when the state is i and there are r units of time left to control, also denoted by un-,.(i). n, = control policy consisting of the matrix of controls {ur(i)}; r = 1, 2 , . . . , n a n d i = 1 , 2 , . . . , N. pii [u'(i)] = transition probability that X I = j given that Xlp1 = i and control u r ( i ) i s u s e d a t t i m e I - l ; i , j = 1 , 2 , . . . ,N , I = 0 , 1 , 2 , . . . , n . A general controlled Markov chain will be given by the difference equation
m = 1 , 2 , . . . ,n , (7.3.6) =f(xm u m t m ) =f(xm ,t m ) , where t m are independent and identically distributed random variables. Let the xm+l
9
9
1
criterion function be assumed to be of the form
c
n-I
~ n , ,X o ) = E {
m =O
k [ X m , u n - m ( X m ) l + ko(xn)},
(7.3.7)
where k and ko are some given functions. The general form of the criterion is given by Kushner (1971). The basic problem of optimal control is to find n, such that V(n,, X o ) is minimized when constraints (7.3.6) are satisfied. From (7.3.7), we have
If the minimum of V(nn,X o ) over all n, is denoted by Vn0, the principle of optimality of dynamic programming leads to the functional equation
Vno = min E(k(Xo, Un(Xo)+ Vn-, o } . In the finite case, with restrictions of continuity on the functions k and pij and other reasonable conditions, the existence of an optimal policy can be proved by backward induction. An important example is the case of a system that is described by a linear system of difference equations and has an objective function given by a quadratic form. We discuss it in Example 7.3.2.
180
VII.
THEORY OF OPTIMAL CONTROL
Example 7.3.2 Let the system be described by Xm+] = A X , t BU, t g m ,
m = 0, 1 , . . . ,n and
Xo = c,
(7.3.8)
where A and B are matrices and X, and U, denote the state and control vectors. We also assume that [, are independently and identically distributed vectors with mean 0 and covariance matrix G . Let G be nonsingular. Let and
k(x, v) = x’Rx t V’QV
ko(x) = x’ROx,
so that [X,’RX,
m=O
+ U,’QU]
t X,’RoX,
where Q is positive definite and R, Ro are nonnegative definite matrices. Let V,(x) = min V(n, , x),
= tI
so that Vn+l(x) = min [ V,(X,) t x‘R,x t u’Qu], with Vo(x) = x’ROx. The dynamic programming solution of the problem is obtained in terms of (7.3.10)
V,(x) = x’Pnx t d,,
(7.3.1 1) (7.3.1 2) Here P, satisfies the functional equation P,+] = A‘P,A t R-A’P,B(B’P,B
t Q)-’B‘P,A.
(7.3.13)
The optimal control is given by
u,(x) = -(B’P,B
t Q)-,B’P,AX.
(7.3.1 4)
The proof is by induction and results from the following arguments. If n = 0, the system stops at 0 since no control is used. Thus (7.3.10) and (7.3.1 1) hold for n = 0. Assuming that (7.3.10) holds for n , we have Vn+](x) = min E[(Ax t Bv t &,)’P,(Ax t Bv t 5,) t x’Rx t v’Qv] , V
7.4.
STATISTICAL DECISION THEORY
181
= min [v'(B'P,B t Q)v t v'B'P,Ax t x'A'P,Bv] V
t x'(A'P,A t R ) x t tr (P,G).
Now the minimizing value of v is given by
U,(x) = -(B'P,B
+ Q)-'B'P,Ax.
Using (7.3.13), we can verify that Vn+ I (x) =
x'Pn+ 1 x t dn+1
9
completing the proof.
7.4 Statistical Decision Theory
In statistical theory, the major problems of inference are concerned with the estimation and testing of hypotheses about the parameters involved in the probability model. The general problem of inference is viewed as a process of decision making under uncertainty, and statistics is described as a science of making decisions under uncertainty, with or without observations. The scope of statistics so described includes a wide variety of problems in the real world. The object of our discussion in this section is to introduce briefly the structure of statistical problems in a more general framework and give a few elementary notions needed to discuss some of the control problems introduced earlier. For a comprehensive discussion of statistical theory, the reader should consult books by Blackwell and Girshick (1954), Ferguson (1967), and DeGroot (1970), among many others. According to decision theory formulation, a statistical problem is a game between nature and the statistician. The unknown state of nature represents the parameter involved in the probability law of the random variable that the statistician can observe. By spying on nature. the statistician can collect data and construct a function to make a decision about 0 . He is supposed to incur a loss (risk) due t o a wrong decision. The assumption of this loss function allows him to choose an optimal way somehow to deal with the loss. There are several ways to select an optimal strategy, and we shall discuss some of them in this section. Consider the following quantities involved in the statistical problem: H set of unknown states of nature 0 . Y" set of random observable X . the probability density function of the random variable X when the f(xl0) nature is in state 0. A set of actions a taken by the statistician.
182
VII.
THEORY OF OPTIMAL CONTROL
L(0, a ) loss to the statistician when he takes action a while nature is in state 8 . D class of decision functions d such that d : .+"'-+A.That is, d ( x ) = a , when statistician observesX = x , he takes action based on the decision rule d.
For a given decision rule d, the loss L(O,d(x)) becomes random, and we generally use its expectation, called the risk to the statistician. That is,
R (e, d ) = EeL(e, d ( ~ ) ) .
The statistical problem is then described by the triple (0,D, R ) , and the basic problem of statistical inference becomes the optimal choice of d in the class D so as to optimize R(O, d ) in some suitable way. Many criteria have been developed to answer this question. The minimax criterion stipulates that the statistician take a decision d o that minimizes, over the class D, the maximum of R(0, d) over the class 0. The minimax rule is the rule of the pessimist and is not always favored since it takes care of the worst possible situation. A rule that has many other desirable properties is the Bayes rule. We say that do is Bayes when do minimizes the expected risk, the expectation taken with respect to a prior distribution of 0 . When the loss is squared,
L(0, a) = (0
- a)',
(7.4.1)
assuming that 0 and A are the sets of real numbers, it can be readily seen that the Bayes rule is given by the mean of the posterior dism'bution. If l ( 0 ) is the prior density of 0 , the posterior distribution of 0 , given X = x , is obtained by the Bayes rule,
(7.4.2) We assume for simplicity that the random variable X and the parameter 0 have continuous probability distributions having a probability density. For the case in general, the expressions can be suitably modified. Now n
The Bayes risk is given by
r(d) =
I
R(e, d)5'(0)do.
(7.4.3)
7.4.
183
STATISTICAL DECISION THEORY
The Bayes rule do is obtained by minimizingr(d); that is, we minimize
Jrs.(O,
(7.4.4)
d(x)t(e)f(xle) dx do.
Expression (7.4.4) can be minimized when
/ u e , d(x))t(Olx)
(7.4.5)
is minimized. When the loss is. given by (7.4.1), the optimal rule is given by do(x) =
fe.geix) dx.
(7.4.6)
The mean of the posterior distribution, therefore, plays an important role in obtaining Bayes rules.
Example 7.4.1 variance d. If
Let X have a normal distribution with mean
I.(
and
q(x) = ( 2 7 ~ - ' /exp ~ (-x2/2),
we assume that (7.4.7) Suppose that interest centers around the estimation of Assume that
with one observation x. (7.4.8)
Now, to find [(FIX), we note that the joint distribution ofx and p is
The marginal density of X i s given by
184
VII.
THEORY OF OPTIMAL CONTROL
Since
-2p(xu-2
+ pOu;2) + x2u-2 + p;
we have
Therefore, Ebb) = If(x, p)/f(x)] is obtained on simplification as follows:
The posterior distribution of 1-1 is normal with mean (7.4.1 1) and V(j.lX)
Example 7.4.2
(7.4.12)
= (u-2 t uo2)-’
Suppose a sequence of independent random variables X1,
X2, . . . , X , have the normal distributions (I/ui) p((x - p)/ui), i = 1,2,
. . . ,n ,
respectively. Let the prior distribution of 1.1 be the same as (7.4.8). Then it follows that the posterior distribution of 1-1 given XI = x1, . . . X , = x, is normal with mean Y , where
.
(7.4.13) and variance S , where
s;1
= (762 t u2 ;
+ . . . t a2;.
(7.4.1 4)
If X I , . . . , X , is a random sample from a normal distribution (I/u)
p((x - p)/u), expressions (7.4.13) and (7.4.14) have the simple forms
(7.4.15)
S;’ = u’;
t nu- 2 .
(7.4.1 6 )
If the sequence of random variables Y , as defined in terms of the Xi’s from Eq. (7.4.13) is considered, we find that, for rn < n, Y , - Y,,, is normal with
7.4.
STATISTICAL DECISION THEORY
185
or (7.4.1 7)
Yml = E [ Y , I x , , . . . ~ x m l That is, Y , is the regression of Y , on x , , . . . , x,, . Thus
Y , = Y,, t u , where u has mean 0 and is uncorrelated with Y,. Therefore, V( Y,) = V( Y,) t V(u).
Also p = Y, +u,
where u has mean 0 and is uncorrelated with Y , ,
V h ) = V(Y,) t V(u). That is, 0:
= V ( Y,)
+ s, .
Thus
v(u)= v(Y,)-v ( Y , , ) = u ~ ~ - s ~ - ( u O ~ - S S ~ ) = S ~ - S ~ . Normality follows from the linearity of the function involved in Y , . The above result is stated in the following lemma.
Lemma 7.4.1 For rn < n, Y , - Y , as defined in (7.4.13) is normally distributed with mean 0 and variance S, - S,, and Y , - Y , is independent of Ym. Lemma 7.4.1 shows that Y(s) behaves like a normal process with independent increments starting from Y o = p o , and the conditional probability distribution of Y , - Y , given Y , is normal with mean 0 and variance S, - S,.
Continuous Version Consider the continuous version of the discrete process C.?=,xi as a process X ( t ) such that the mean and variance are given in terms of dX(t): E [ d X ( t ) ]= p d t , and V [ d X ( t ) ]= u2 d t . (7.4.1 8)
186
VII.
THEORY OF OPTIMAL CONTROL
Given that I-( has prior distribution that is normal with mean p0 and variance 02, the posterior distribution of p is obtained in terms of Y , and S,, which have a continuous version given in terms of a stochastic process Y(s) with parameters. Lemma 7.4.1 shows that Y(s) is a normal process with independent increments having
E [ d Y ( s ) ]= 0,
and
V [ d Y ( s ) ]= -ds,
(7.4.19)
with Y(s0) = P o ,
so = 0 0 ,
and
s-' = ui2 +
Essentially, Y(s) can be regarded as a Gaussian process in -s scale. The above process will be utilized in reducing the sequential analysis problem to that of a stopping problem when a continuous version is used. In a more general setting, we use the following result (Chernoff, 1972). Lemma 7.4.2 with mean
Let X(c) be a Gaussian process of independent increments E[dX(t)l = I.1 d H ( t ) ,
and variance
V [ d X ( t ) ]= dV(c), where V(c) is a nondecreasing function of c. Let I./ have a normal prior distribution with mean p0 and variance u2. Then the posterior distribution of p with given X ( t ) is also normal with mean Y(s) and variance s given by t
s-1
= 002
+ 0
Y(s) is a Wiener process with 0 drift in the -s ( Y o , so) = ( P o , ao2).
scale originating at
In the next section, we discuss elements of sequential decision theory. The stopping rule is obtained in terms of backward induction, which has also been utilized in dynamic programming.
7.5.
SEQUENTIAL DECISION THEORY
187
7.5 Sequential Decision Theory When the possibility of sequential sampling is introduced, the statistical decision problem assumes a more complicated structure. At each stage of sampling, the statistician has to decide whether to stop sampling or continue taking observations and, if he stops sampling, what terminal decision to take. The cost of taking an observation also enters the picture. It is intuitively clear that the more expensive the sampling, the less the number of observations should be taken. For simplicity of discussion we assume that the cost of observation is a constant c. Let the stopping rule be given by c p ~ ~ ~ = ~ c p o , c p l ~ ~ l ~ , c p.z.), ~~,,~z~,.
where tpi(x,, . . . , x i ) is the conditional probability that the statistician will stop sampling given that he has observed X I = x l , . . . , Xi = x i . cpo represents the probability of taking no observations at all. The terminal decision rule is given by where d i ( x l , . . . , x i ) gives a rule for 0 once a sample X I = x l , . . . , Xi = x i is observed and a stopping rule, cpi(xl, . . . , x i ) , is used. Therefore, a decision rule 6(x) has two components given by
6(x) =
[dx), Wl.
(7.5 . l )
The risk in a sequential decision problem is a function of 6(x) and the cost of taking j observations. The Bayes rule is obtained in two steps. First, a terminal decision is obtained for a given prior distribution and a given stopping rule. so as to minimize the risk. Then, the stopping rule is obtained so as to minimize the risk obtained after using the optimal terminal decision rule. Although the argument is circular. it is possible to show that the rule so obtained is the Bayes rule. We show below the procedure of obtaining the Bayes stopping rule given in terms of backward induction first proposed in this connection by Arrow et al. (1949). The generalization of the backward induction to that of the dynamic programming procedure by Bellman has already been discussed in several contexts. For more detailed discussion, see Blackwell and Girshick (1954) and Ferguson (1967), among many others. The basic process is the following. Consider the case in which the sampling must stop after N observations. The case of unlimited sampling can then be treated by taking the limit as N -+ 0 0 . We assume that tpN(X,,
* . . , x I v ) = 1,
188
THEORY OF OPTIMAL CONTROL
VII.
since sampling must stop when N observations are taken. At the ( N - 1)st observation, we stop if the conditional expected loss of stopping immediately, given XI = x l , . . . , X N - I = x ~ - l plus , the cost c of an additional observation, is less than the conditional expected loss given X1 = x l ,. . . , X N - ~= x ~ - land taking one more observation and then stopping. This gives P N - ~ .Knowing (PN and V N - ] , we can proceed inductively to obtain P N - 2 and so on. That is, P N , q N P l.,. . , cpo are determined. This procedure is formally described below. Assume that g(B) is the prior distribution of B and
PN-I =
if
> ....
if
(0,
(7.5.3)
Suppose, f o r j = 1 , 2 , . . . ,N , we define recursively
So that the optimal stopping rule is given by
yp(x I ,
. . . ,xi-])
=
any, if
LO,
if (7.5.5)
f o r j = 1 , 2 , . . . ,N . The stopping rule so obtained is Bayes rule. The terminal decision is chosen so that after stopping at j = J , it is regarded as a fixed sample rule for J
7.5.
SEQUENTIAL DECISION THEORY
189
observations. The procedure jointly giving the stopping and terminal rule provides the optimal decision rule for a sequential analysis problem. Our object in the following will be to introduce the sequential analysis problem so that it can be reduced to that of a stopping problem. The discrete problem can be transformed into a continuous problem to take care of more general situations.
Example 7.5.1 The test of the hypothesis of the mean of a normal distribution p will be reduced to a stopping problem. Let Ho : p > O ,
HI : p < O ,
(7.5.6)
and let
< 0,
L@, ai)= - k p
when
p
kp
when
p>O,
=
where ai = accept Hi, i = 0 , 1. Let the cost of each observation be c. Hence the cost is cn if we stop after 11 observations and the decision is correct, and cn + klpl if the decision is incorrect. In this case the optimal stopping rule can be obtained by the backward induction procedure as described above.
Example 7.5.2 Consider a continuous analog of Example 7.5.1. Let p have a prior density ( l / u o ) q ( p - p o ) / u o . Then, from (7.4.13) and (7.4.14), we have the posterior distribution of p given XI = x , , . . . , X , = x,, s y * q [ @ -y,)
s,q.
The risk using the posterior density, or the posterior risk, of the decision to accept H I and stop at n is db,, s,) where
1 0
d(y, s) =
k l p l ~ - ” ~ q [ s - ’ / ~ ( p - yd)p] ,
(7.5.7)
-m
since the loss is klp1 when p < 0 and 0 elsewhere. Let u = ( p - y)s-’/*,
then d b , S) = -
i’
k ( y + s ’ / ~ uq) ( ~d) ~ ,
--oo
(7.5.8)
190
VII.
THEORY O F OPTIMAL CONTROL
where u = s-'l2y. Simplifying, we have
i"
d ( y , S) = -ksLlz
(U
+ u ) ~ ( udu. )
(7.5.9)
-m
Integrating by parts, we have where --v
J'
@(u) =
q(t)dt.
_m
Or
9
Thus d b , s) = k S l / ' $ + ( U ) ,
(7.5 .I 0)
$+@I = d u ) - u [ I - N u ) ] ,
(7.5.1 1)
where where u Z 0. Similarly, the posterior risk of rejecting the hypothesis H 1 and stopping at ti is ks'l2$-(u), (7.5.12) where
$ - ( u ) = q(u) + u@(u),
where
u
< 0.
(7.5.13)
The cost of sampling is cn or C(S,'
- 002)
(7.5.1 4)
2,
from (7.4.1 6). Hence the risk associated with stopping at the nth observation d O n , s,,), where d(y, s) is given by using (7.5.10), (7.5.12), and (7.5.14), d ( y , S) = ks'/2$(y/s1/2)+ CU'(S-'
-C
J ~ ~ ) ,
(7.5.15)
with
Nu)=
$+(u),
when u > 0,
$-(u),
when u
< 0.
The sequential analysis problem for testing a hypothesis about the mean p has thus been reduced to that of a stopping problem with a given stopping risk. The study of the stopping problems will be made for the continuous versions
1.6.
WIENER PROCESS
191
of the above discrete problem. For that purpose, we introduce the notion of Wiener process in the next section. 7.6 Wiener Process The Wiener process (Brownian motion process) holds an important place in the study of stochastic processes. However, the usefulness of the Wiener process in control theory and statistics arises from the fact that it provides a method of generating continuous time analogs to discrete processes, especially of the type x n + l =g(xn, tn),
(7.6.1)
where t , is a sequence of independent random variables. We use the Wiener process in reducing the sequential analysis problem to that of a continuous case and give the characterization of stopping problems. A stochastic process, that is, a family of random variables X(t) for t in the interval (0, 0,is called stationary if for (7.6.2) O < f l < f 2 < . . . < f k < T, the probability distribution of
x ( f ,t r), x ( f 2 + r), . . . , X ( f k + 7 ) is the same as that of x ( t l )X, ( t 2 ) , . . . , X ( t k ) for all r such that O
(7.7.3)
Figure 7.3 represents a typical situation.
s
Figure 7.3
Example 7.7.3 The continuous analog of the sequential analysis problem discussed in Section 7.5 as Example 7.5.2 is the stopping problem with the stopping risk d(y, s) = ks1/2$(y/s’/2) t ca’(s-’ - a;2),
as given by (7.5.1 5). Example 7.7.4
The continuous time version of Example 7.7.2 can be seen = y o , we have (n an integer)
as follows. For the discrete case, withy-, d(y, n) =
{ mm - nn- ,y 2 ,
for
n=O,y 0, elsewhere,
F ( x ) = 1- e P w , and (8.2.2) Conversely, one can show that, if the failure rate is constant, the failure probability distribution is exponential. The exponential distribution arises in reliability problems quite naturally, and we shall discover it again when we consider bounds for the probability of survival data. In some applications, the failure rate goes on increasing or decreasing. Such distributions have many interesting properties. We define formally the increasing failure rate distributions that have found extensive applications. Definition A continuous distribution function F(x) is said to have an increasing failure rate, or simply IFR, and if and only if
F(t + X ) - F ( x ) 1- F ( t ) is a monotone increasing function o f t for x
(8.2.3)
> 0 and r 2 0 such that F(t) < 1.
For the discrete case, let the time to failure X have the distribution
P(X=k)=pk,
k = 0 , 1,2,....
Then the distribution function is IFR if and only if m
1%-
(8.2.4)
is monotone increasing in k for k = 0, 1,2, . . . . Analogously, the notion of distributions with decreasing failure rate can be defined. Decreasing failure distributions may be denoted by DFR. Although
8.2.
APPLICATIONS IN RELIABILITY
205
there is a large class of either IFR or DFR distributions, there are many distributions that are neither. Also, a distribution may be IFR for some parameter values, while for others it may be DFR. One of the central problems of reliability is the probability of survival until time x , and this probability may sometimes be taken as the reliability of an item. It is of interest, therefore, t o find upper and lower bounds on 1 - F(x). Bounds on the distribution function F(x) have been studied extensively under moment constraints by several researchers, and the whole group of Tchebycheff-type inequalities belong to this problem area. A recent survey is given by Karlin and Studden (1966). In this section, inequalities are discussed for 1 - F ( x ) , when the distributions are IFR. In the classical study of Tchebycheff-type inequalities or in their generalizations such as discussed in Chapter IV, the geometry of moment spaces is utilized. The arguments depend heavily on the property that the class of distribution functions is convex. The class of distribution functions having a prescribed set of moments is also convex. However, in the study of IFR and DFR distribution, it is not true. Therefore, classical methods of geometry of moment spaces cannot be directly applied to the above problem. We first state an important inequality, called Jensen 's inequality.
Theorem 8.2.1 If cp(x) is a convex (concave) function of x , where x is a point in n-dimensional Euclidean space, then
HcpQI > cp bw31.
(8.2.5)
(4
Inequality (8.2.5) is essentially a variational result. We consider finding the lower bound of S[cp(X)]. That is, find the minimum, over the class of distribution functions F ( x ) , of
such that
J k x ) dF(x)
(8.2.6)
I
(8.2.7)
x dF(x) = p.
The Jensen's inequality, restated, says that the lower bound of the integral (8.2.6) is q ( p ) . Such problems have been discussed in Chapter IV. Utilizing these results, since tp is convex, in the case in which X is a real-valued random variable, the. one-point distribution Fo(x) with
(8.2.8) provides the lower bound.
206
V111.
MISCELLANEOUS APPLICATIONS
The upper bound, in casecp(x) is concave, is similarly obtained. In case the distribution function F(x) is absolutely continuous, we note that
T(X)= -d/dX {log [ 1 - F(x)]) .
(8.2.9)
Therefore, the property that F(x) is IFR is equivalent in this case to log( 1 - F(x)) being concave. In our subsequent discussion, logconcavity and logconvexity of 1 -F(x) may replace the property of being IFR and DFR, respectively. A large number of results are given by Barlow and Proschan (1967) for bounds on 1 - F(x), but we discuss only two for illustrative purposes._The interested reader should consult their monograph. We denote 1 - F(x) by F(x),
F(x) = 1 -F(x).
(8.2.10)
Theorem 8.2.2 If F(x) is IFR and
s
x dF(x) = p ,
then
,
if
xl2 dt,
'i
i = 1 , 2, . . . , n - I .
Here yi is the value of si(t) at ti, and zi is the value of the derivative of si(t) at ti. The functional equation of dynamic programming can be obtained from (8.4.1 6) as FiCyi, Z i ) =
min
Yi+l. zi+l
{ f" ti
[ ~ ( t ) - ~ i ( t ) ] dt '
1
+Fi+l(jj+l, zi+j)
. (8.4.17)
The functional equation (8.4.1 7) can be further simplified, leading to explicit solutions of the problem.
For further details, the reader may consult the original paper of Bellman e f al. (1974).
8.5 Connections between Mathematical Programming and Statistics The central problem of statistical inference, that of testing statistical hypotheses and the development of the Neyman-Pearson theory, were briefly
8.5.
CONNECTIONS BETWEEN PROGRAMMING AND STATISTICS
21 7
introduced in Chapter IV. The application of the Neyman-Pearson technique has already been made in solving some nonlinear moment problems as discussed in Chapter V. The Neyman-Pearson problem has recently been applied to many mathematical programming problems, especially in duality theory. Recent references in this connection are Francis and Wright (1969), Vyrsan (1967), Francis and Meeks (1972), and Francis (1971). Mathematical programming methods have been widely used in statistics in many contexts. For a recent survey, see Wagner (1959, 1962), Karlin (1959), Krafft (1970), and Vajda (1972). In using the language of mathematical programming, we have two problemsprimal and dual-and the solution of one provides the solution of the other. For many complicated optimization problems this duality theory simplifies their solution. Interesting new optimization problems arise as a result of the duality property. We discuss below an example in which the Neyman-Pearson problem is the dual to a problem of sufficient interest, called the primal problem. We consider first the Neyman-Pearson problem and describe a duality theory for it. It will be seen that the development resembles the one in Chapter V in which the nonlinear moment problem is solved by first linearizing it and then applying the Neyman-Pearson technique, as considered by Rustagi (1 957) and Karlin (1 959).
Neyman-Pearson Problem Consider a given function ~ p ( x , y ) ,which is strictly concave in y and differentiable in y . Let 11/ (x, y ) , . . .’, G m ( x , y ) be a given set of m functions such that Jli(x, y ) is convex and differentiable i n y for i = 1 , 2 , . . . ,m . Let hl , . . . ,Am be m nonnegative real numbers and let m
V(X,
Y , A) = CP(X~ -
C1 hiGi(x, Y ) ,
(8.5.1)
I=
where A = (A,, . . . , Am)’. By the assumptions above, the function ~ ( xy, , A) is concave and differentiable in y . Let Z(x) and u ( x ) be given functions and denote by W the class of functionsfsuch that Z(X)
< f ( x ) < u(x).
Let the subsets S1, S2 , and S 3 be defined as
s1 = {x
S3
:
= {x :
2 (x, y, A) < 0, aY
Z(X)
2
0,
(8.6.3)
with constraints and where x is an m-dimensional vector, A is an n x m matrix, and b and c are given vectors. The solution to the problem is obtained by considering the extreme points of the convex set given by (8.6.2) and (8.6.3). Many times the constraint (8.6.2) is in terms of inequalities. The minimum, then, can be obtained by the consideration of the value of the functional (8.6.1) over these extreme points. Well-known procedures, such as those of the Simplex method developed by Dantzig, are available to obtain the numerical solutions. Suppose now that c is random. One approach to the solution of the problem will be by finding the minimum of E(c)‘x over the set defined by the same constraints. Recently, Rozanov (1 974) has suggested the following criterion for this stochastic programming problem. Let the value of the objective function at an extreme point of the convex set generated by the restriction be denoted by the random variable Zi. It is assumed that the distribution of Ziis somehow known. Then the Rozanov criterion is to find the minimum so as to minimize the probabilities of extreme points; that is minimize P(Zi) over i. (8.6.4) Numerical procedures resembling the simplex method of deterministic linear programming have been developed recently by Dantzig (1974).
8.7.
DYNAMIC PROGRAMMING MODEL O F PATIENT CARE
227
We consider below the problem of chance constrained programming. Here the constraints are satisfied with specified probabilities. Many procedures of chance constrained programming have been developed by Charnes el al. (1971). Example 8.6.1 The problem is to minimize E(X)
(8.6.5)
P { X < Y} 0.
(8.6.7)
subject to constraints and For simplicity assume that Y is a continuous random variable having the probability
{ 0,
fCv) = 2 y ,
O