Lecture Notes in Control and Information Sciences Editor: M. T h o m a
225
A.S, Poznyak and K. Najim
Learning Automa...
55 downloads
961 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Lecture Notes in Control and Information Sciences Editor: M. T h o m a
225
A.S, Poznyak and K. Najim
Learning Automata and Stochastic Optimization
~ Springer
Series Advisory Board A. B e n s o u s s a n • M.J. G r i m b l e • P. K o k o t o v i c • H. K w a k e r n a a k J.L. M a s s e y • Y.Z. T s y p k i n
Authors A.S. P o z n y a k CINVESTAV, Instituto Polytecnico Nacional M6xico, D e p a r t a m e n t o d o I n g e n i e r i a Electrica, S e c t i o n d e C o n t r o l A u t o m a t i c o A.P. 14-740, 07000 M 6 x i c o D.F., M 6 x i c o K. N a j i m Institut National Polytechnique de Toulouse, Toulouse, France
ISBN 3-540-76154-3 Springer-Verlag Berlin Heidelberg New York British Library Cataloguing in Publication Data Pozniak, A.S. (Aleksandr Semenovich) Learning automata and stochastic optimization. - (Lecture notes in control and information sciences ; 225) 1.Mathematical optimization 2.Machine theory 3.Stochastic processes
I.Title II.Najim, Kaddour 519.2'3 ISBN 3540761543 Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms oflicences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. © Springer-Verlag London Limited 1997 Printed in Great Britain The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that maybe made. Typesetting: Camera ready by authors Printed and bound at the Athenaeum Press Ltd, Gateshead 6913830-543210 Printed on acid-free paper
To Tatyana and Michelle
Contents 0.1 0.2
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
Stochastic Optimization
3
1.1 1.2
3
1.3 1,4 1.5 1.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Standard deterministic and stochastic nonlinear programming problem . . . . . . . . . . . . . . . . . . . . . . . . . . Standard stochastic programming problem .......... Randomized strategies and equivalent smoothed problems . C a r a t h e o d o r y t h e o r e m a n d s t o c h a s t i c linear p r o b l e m on finite set . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
References 2
xi 1
2.7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . Learning automaton . . . . . . . . . . . . . . . . . . . Environment . . . . . . . . . . . . . . . . . . . . . . . Reinforcement, s c h e m e s . . . . . . . . . . . . . . . . . . H i e r a r c h i c a l s t r u c t u r e of l e a r n i n g a u t o m a t a . . . . . . . . . Normalization and projection ................. 2.6.1 Normalization procedure ................ 2.6.2 Projection algorithm .................. Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
14 23
25
On Learning A u t o m a t a 2.1 2.2 2.3 2.4 2.5 2.6
4 5 8
27 . . . .
. . . .
. . . .
. . .
27 29 31 32 34 35 35 36 39
References
40
Unconstrained Optimization Problems
43
3.1 3.2
43
3.3 3.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Statement of the Stochastic Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L e a r n i n g A u t o m a t a Design . . . . . . . . . . . . . . . . . . Nonprojectional Reinforcement schemes . . . . . . . . . . . 3.4.1 Bush-M(~steller r e i n f o r c e m e n t s c h e m e . . . . . . . . . 3.4.2 Shapiro-Narendra reinforcement scheme . . . . . . .
44 46 51 51 57
viii
Contents
3.5 3.6
3.7 3.8
3.9
3.4.3 V a r s h a v s k i i - V o r o n t s o v a r e i n f o r c e m e n t s c h e m e . . . . Normalization Procedure and Optimization ......... Reinforcement Schemes with Normalized Environment response ..................... 3.6.1 Bush-Mosteller reinforcement scheme with normalization procedure ................ 3.6.2 S h a p i r o - N a r e n d r a r e i n f o r c e m e n t scheme w i t h n o r m a l ization procedure .................... 3.6.3 V a r s h a v s k i i - V o r o n t s o v a r e i n f o r c e m e n t scheme w i t h norrealization procedure .................. Learning Stochastic Automaton with C h a n g i n g n u m b e r of A c t i o n s . . . . . . . . . . . . . . . . . Simulation results ....................... 3.8.1 L e a r n i n g a u t o m a t o n w i t h fixed n u m b e r of a c t i o n s . . 3.8.2 L e a r n i n g a u t o m a t o n w i t h c h a n g i n g n u m b e r o f a c t i o n s Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
63 69 75 75 80 85 91 93 94 97 104
References
105
Constrained Optimization Problems
107
4.1 4.2
107 109 109 111 112 118 119 120 123 131 133 139 147 151 153 156
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Constrained optimization problem .............. 4.2.1 P r o b l e m f o r m u l a t i o n . . . . . . . . . . . . . . . . . . 4.2.2 E q u i v a l e n t linear p r o g r a m m i n g p r o b l e m . . . . . . . 4.3 L a g r a n g e m u l t i p l i e r s u s i n g r e g u l a r i z a t i o n a p p r o a c h . . . . . 4.4 O p t i m i z a t i o n a l g o r i t h m . . . . . . . . . . . . . . . . . . . . 4.4.1 Learning automata ................... 4.4.2 Algorithm ........................ 4.5 C o n v e r g e n c e a n d convergence r a t e a n a l y s i s . . . . . . . . . 4.6 P e n a l t y f u n c t i o n . . . . . . . . . . . . . . . . . . . . . . . . 4.7 P r o p e r t i e s o f t h e o p t i m a l s o l u t i o n . . . . . . . . . . . . . . . 4.8 O p t i m i z a t i o n a l g o r i t h m . . . . . . . . . . . . . . . . . . . . 4.9 N u m e r i c a l e x a m p l e s . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Lagrange multipliers approach ............ 4.9.2 P e n a l t y f u n c t i o n a p p r o a c h . . . . . . . . . . . . . . . 4.10 C o n c l u s i o n . . . . . . . . . . . . . . . . . . . . . . . . . . .
References
158
O p t i m i z a t i o n of N o n s t a t i o n a r y ~ n c t i o n s
161
5.1 5.2 5.3 5.4
161 162 164
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . O p t i m i z a t i o n of N o n s t a t i o n a r y F u n c t i o n s . . . . . . . . . . Nonstationary learning systems ................ Learning Automata and Random Environments . . . . . . . . . . . . . . . . . . . . . . . . . .
166
Contents 5.5
Problem
5.6
Some Properties of the Reinforcement Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
170
5.7
Reinforcement Schemes with Normalization Procedure . . . 5.7.1 Bush-Mosteller reinforcement scheme ......... 5.7.2 Shapiro-Narendra reinforcement scheme ....... 5.7.3 Varshavskii-Vorontsova reinforcement scheme .... Simulation results . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .
173 176 179 182 185 187
5.8 5.9
statement
. . . . . . . . . . . . . . . . . . . . . . .
ix
References Appendix A: T h e o r e m s a n d L e m m a s ........... Appendix B: Stochastic Processes ............
Index
169
189 191 200
203
0.1. N o t a t i o n s
0.1
xi
Notations
The upperscript convention will be used to index the discrete-time step. Throughout this book we use the following notation: arg min f ( x ) minimizing argument of f ( x ) X
B,,-I
a ( u l , ..., u , ; ~1,..., ~ , - 1 )
~,~(i)
conditional mathematical expectations of the environment responses arithmetic averages limit of Cn(i) diameter of the subset Xi expectation operator envelope of the set X = ( 1 .... , 1 ) ~ R ~(j) = (0, O, ...., O, 1, O, ..., O) T E R g(j) (the s th component of this vector is equal to 1 if un = u(s) and the other components are equal to zero) distribution function of the vector x a-algebra (a-field:a set of subsets of ~ for which probabilities are defined)) objective function constraints (j = 1, ..., m) Lipschitz constants Lagrange function penalty function number of automaton actions number of actions of the subset V ( j ) probability measure probability distribution at time n probability associated with the optimal action
~(i) di E en X
eN(J) eN(J)(Un)
F.(x) fo(.) ]A) Li L(x,A) Lm6(P, U) N
N(j) P pn
p,~(c,) p
~
q~ T¢ U
u(a) V
~(~)
optimal strategy probability distribution defined over all possible actions subsets --- (fl, 5v, P ) a probability space set of actions automaton output (action) at time n optimal action slack variables (j = 1, ..., m) set of actions
xii
0.1. N o t a t i o n s
V(j) Vo(.) VA.) W,~
{xd X*
Yn
'Tn
),(j) I.t
o(z) o(x) II ~o(.) ~(.) ,,~(i)
f2 03n
jtu subset of actions objective function constraints (j = 1 .... , m) L y a p u n o v function quantification of t h e c o m p a c t set. X o p t i m a l solution observation at time n correction factor 5 - function multi-teacher environment responses at time n (j = 1 , . . . , m ) LagTange multipliers (j = 1 ..... m) p e n a l t y coefficient environment response normalized environment response O(x)/x ~ 0 b o u n d e d when x --* 0 o(x)/x ---*0 w h e n x ~ 0 projection o p e r a t o r objective function constraints (j = 1 ..... m) conditional variances of the environment responses loss function at time n basic space (events space) observation noise at time n
0.2. I n t r o d u c t i o n
0.2
1
Introduction
In the last decades there has been a steadily growing need and interest in computational methods for solving optimization problems with or without constraints. They play an important role in many fields (chemistry, mechanic, electrical, economic, etc.). Optimization techniques have been gaining greater acceptance in many industrial applications. This fact was motivated by the increased interest for improved economy in and better utilization of the existing material resources. Euler says: "Nothing happens in the universe that does not have a sense of either certain maximum or minimum". In this book we are primarily concerned with the use of learning a u t o m a t a as a tool for solving many optimization problems. Learning systems have made a significant impact on many areas of engineering problems including modelling, control, optimization, pattern recognition, signal processing and diagnosis. They are attractive and provide interesting methods for solving complex nonlinear problems characterized by a high level of uncertainty. Learning systems are expected to provide the capability to adjust the probability distribution on-line, based on the environment response. T h e y are essentially feedback systems. The optimization problems are modeled as t h a t of learning automaton or a hierarchical structure of learning automata operating in a random environment. W~e report new and efficient techniques to deal with different kinds (unconstrained, constrained) of stochastic optimization problems. The main advantage of learning a u t o m a t a over other optimization techniques is its general applicability, i.e., there are almost no condition concerning the function to be optimized (continuity, differentiability, convexity, unimodality, etc.). This book presents a solid background for the design of optimization procedures. Ample references are given at the end of each chapter to allow the reader to pursue his interest further. Several applications are described throughout the book to bridge the gap between theory and practice. Stochastic processes as martingales have extensive applications in stochastic problems. They arise naturally whenever one needs to consider mathematical expectations with respect to increasing information patterns. They will be used as well as Lyapunov functions to state several theoretical results concerning the convergence and the convergence rate of learning systems. The Lyapunov approach offers a shortcut to proving global stability (convergence) of dynamical systems. A Lyapunov function often measures the energy of a physical system. This book consists of five Chapters. Chapter one deals with stochastic optimization problems. It is shown that stochastic optimization problems (with or without constraints) can be reduced to optimization problems on
2
0.2. I n t r o d u c t i o n
finite sets and, then solved using learning automata (operating in S-model environments, i.e., continuous environment response) which are essentially sequential machines. An introduction to learning automata and several notions and definitions are presented in Chapter two. Both single and hierarchical structures of learning automata are considered. It is shown that the synthesis of several reinforcement schemes can be associated with the optimization of some functional. Chapters one and two constitute the core of this book. Chapter three is concerned with stochastic unconstrained optimization problems. Learning automata with fixed and changing number of actions are used as a tool for solving this kind of optimization problems. A transformation called "normalization procedure" is introduced to ensure that the environment response (automaton input) belongs to the unit segment. Several analytical results dealing with the convergence characteristics of the optimization algorithms are stated. The stochastic constrained optimization problems are treated in Chapter four. This stochastic optimization problem is formulated and solved as the behaviour of variable-structure stochastic automata operating in multi-teacher environments. Two approaches are considered: Lagrange multipliers and penalty functions. The properties of the optimal solutions are presented. The convergence of the derived optimization algorithms is stated and the associated convergence rates are estimated. Chapter five deals with the optimization of nonstationary (time-varying, etc.) functions which can encountered in several engineering problems. These optimization problems are associated with the behaviour of learning automata with continuous input~s (S-model environment) operating in nonstationary environments. Numerical examples illustrate the performance of the optimization algorithms described in chapters three, four and five. Two appendices end this book. The first one contains several lemmas and their proofs, the second one contains same definitions and properties concerning different kinds of convergence and martingales. A detailed table of contents provide a general idea of the scope of the book. The content of this book, in the opinion of the authors, represent an important step in the systematic use of learning automata concepts in stochastic optimization problems. We would like to thank our doctoral .students: E. Ikonen and E. G. Ramires who have carried out most of the simulation studies presented in this book. Professor A.S. Poznyak Instituto Polytecnico Nacional M~xico, Mexico and Professor K. Najim Institut National Polytechnique de Toulouse, France
1 Stochastic Optimization 1.1
Introduction
Because of its practical significance, there has been considerable interest in the literature on the field of optimization. This field has seen a tremendous growth in several industries (chemistry, data communication, etc.). Any problem of optimization concerns the minimization of some function over some set which can be specified by a number of conditions (equality and inequality constraints) [6]-[7]. Maximization can be converted to minimization by a change of sign. The optimization problems associated with real systems involves a high level of uncertainty, disturbances, and noisy measurements. In several optimization problems there exists no information (probability distribution, etc.) or if there exists it is incomplete. I~arning systems are adaptive machines [8]-[5]. They improve their behaviour with time and are useful whenever complete knowledge about an environment is unknown, expensive to obtain or impossible to quantify. A learning a u t o m a t o n is connected in a feedback loop to the random medium (environment), where the input to one is the output to the other. At every sampling period (iteration), the automaton chooses an action from a finite action set, on the basis of the probability distribution. The selected action causes the reaction of the environment, which in turn ks the input signal for the automaton (i.e., the environment establishes the relation between the actions of the automaton and the signal received at its input which can be binary or continuous). With a given reinforcement scheme, the learning a u t o m a t o n recursively updates its probability distribution on the basis of the environment response. One of our first goals in this chapter will be to present the various stochastic optimization problems in an unified manner and to show how learning a u t o m a t a can be used to solve efficiently this kind of optimization problems. Optimization methods based on learning systems belong to the class of random search techniques. The main advantage of random search over other direct search techniques is its general applicability, i.e., there are almost no conditions concerning the function to be optimized (continuity, unimodality, convexity, differentiability, etc.). T h e standard programming problem will be stated in the next section.
4
1.2. Standard deterministic and stochastic nonlinear programming problem
1.2
Standard deterministic and stochastic nonlinear programming problem
In this section, our main emphasis will be on the standard programming problem [5]-[6] which can be stated as follows: inf fo(x)
(I.1)
X
f~(x) < 0, (j = 1,...,m)
(1.2)
x e X C_ R M
(1.3)
The function fo(x) is an objective function, sometimes also called a criterion function or cost function. In the case of unconstrained problem the optimal solution requires the gradient at the e x t r e m u m of fo(x) to be null. Notice t h a t equalities constraints of the form ¢8(x) = 0, (s = 1 , . . . , m , ) can be transformed into inequalities constraints
fj(x) a~} < ~j
(1.18)
1.3. Standard stochastic programming problem
7
belong to the class of constraints (1.3). Indeed, this chance constraint can be written as follows: E{~) < 0 (1.19t where 0 1 ~ ----X {¢J(~n, ~¢*,"", ~ ) >- aj} - f~j
(1.20)
and the indicator function X(') is defined as follows
X(A) -- ~ 1 if ,4 is true t 0 otherwise R e m a r k 2 Substituting (1.20) into (1.3) leads to n
limsup -1 ~ X n~OO
T/,
{ ¢ j ( ~ , ~'.... ,~2) > ~J} < ~
(1.21)
t=l
In the stationary case, i.e., the probability distribution of the vector ( ~ot, ~lt ,..., ~ ) T is stationary, and in view of the strong law of large numbers [11]-[12], it foUows that inequalities (1.21) and (1.18) are equivalent, i.e.,
limsup ! ~ r~--* o o
n
= limsup ! ~ E n~oO
n
: x {¢j(~;0 ,~;,...,~2) >_ ~j} :
t=l
{X {¢1(~°, ~1, ---, ff~n) >-- aJ } } -----
t=l
= p { ~ : ~(¢o,¢1 .... ,¢y) > ~j} < ~ In such statement, this problem is equivalent to the standard deterministic programming problem (1.1)-(1.3). Before to apply the deterministic optimization techniques, it necessary to calculate the integrals in (1.16) and (1.17). These integrals involve complex operations and can not be exactly calculated. Several approaches have been proposed to avoid this problem. One approach consists of using the average of the observations {yn(w)} given by (1.10), i.e.,
1 lim - ~_.~y°n(w) ~:g inf n--,ov
n
t=l
I ~-2-,n y~n( a.8. w) < O, (j =O,..,m) lim n--* oo
(1.22)
{zn}
n t=l
where {Xn} belongs to the class of all realizable strategies (1.4).
(1.23)
8
1.4. Randomized strategies and equivalent smoothed problems
This stochastic p r o g r a m m i n g problem (1.22)-(1.23) leads to highly computational efforts. In fact, it is necessary to solve it (parallel optimization) for different realizations. Another approach is based on the implementation of stochastic approximation techniques [13]-[14]-[15] using the observations (measurements) of the gradients (1.11)-(1.12). The main ideas of this approach will be considered in this book.
1.4 Randomized strategies and equivalent smoothed problems T h e link between the standard nonlinear p r o g r a m m i n g and smoothed stochastic optimization problems wilt be stated. For all x E X , the nonconvex set ~ of the values of the functions Cj(x), (j = O, ...,m) is defined as follows: ¢ = { v e R m + l : v i = Cj(x) (j = 0 , . . . , m ) , x E X }
(1.24)
A typical nonconvex set is depicted in Figure 1.1. In this Figure we have considered only one constraint. The coordinates of the point A are respectively the value of the cost function and the value of the considered constraint for a given value of x.
o{Xl
¢i[xi FIGURE 1.1. The set of possible values of Cj (x), (j : 0, ..., m). For the standard nonlinear p r o g r a m m i n g problem, the functions are ass u m e d to be convex. The corresponding set ~ of all the possible values (¢j(x)) in the ( m + 1) dimensional space is convex too, a n d as a conse-
1.4. Randomized strategies and equivalent smoothed problems
9
quence, many classical optimization techniques [16] including the stochastic approximation methods [13]-[14]-[15], can be used. For nonconvex functions Cj(x) , (j = 0, ..., m), the set • is nonconvex too. To use the standard optimization techniques for solving this problem which naturally is multimodal, it is necessary to solve several locally convex problems and to select the optimum among the obtained solutions. Now, let us consider the stochastic nonlinear problem (1.16)-(1.17) where at least • one of the functions Cj (x) (j = 0,..., m) is nonconvex • one of the functions Cj(x) (or all of them) is nonsmooth, i.e., we must use only observations of zero-order (1.10). To solve such nonlinear stochastic programming problem, which at first glance seems t o be very complex, we will use the approach presented in [171-[181-[19]. Let us introduce the distribution function F~(z) of the vector x = x(w) E X. It is assumed to be given on the same probability space (f2, ~', P ) , and consider the following constrained optimization problem
[ ¢o(x)p(x)dx--~inf p(x)
~0(P) :=
J
(1.25)
xEX
-¢j(p) := f Cj(x)p(x)dx _ 0 W: S X C_ R M
...
p(v)dv
1
(1.28) (1.29)
We will consider not only absolutely continuous distribution functions satisfying (1.27) but also other distribution functions like the 6-functions. T h e / ~ - f u n c t i o n operates as follows:
p(z) OO
--o0
=
z0)
oO
-- OQ
oQ
--
O0
OO
--0o
(1.30)
10
1.4. Randomized strategies and equivalent smoothed problems
for any function #)(x), x e R M. Notice that, independently of the convexity properties of the initially defined functions Cj(x) (j ---- 0,...,m), the smooth problem (1.25)-(1.26) subject to the probability measure constraints (1.28)-(1.29) is always a linear programming problem in the space of distributions p(x) (including 6-functions (1.30)). The relation between the standard nonlinear programming problem ( 1.16)(1.17) and the corresponding smooth problem (1.25)-(1.29) will be stated by the following theorem. T h e o r e m 1. If
1. the initial standard stochastic nonlinear programming problem (1.16)-(1.17) has a solution x* (not necessary unique), i.e., there exists a point x* 6 X such that
CO(x') < CO(z*) vz e x ¢~(x*) _< 0, (j = 1 ..... m)
(1.32)
2. the constraints -(1.17) of the initial problem (1.16) satisfy the Slater's condition for x = • E X 3. the set X is compact (closed and bounded)
4. $j(x) ( j - - 1 , . . . , m ) are continuons functions on X Then, the corresponding smooth problem (1.25)-(1.29) has a solution too, i.e.,
Sp(x__*) : ¢0(P*) -< CO(P) Vp E SxN Cj(p*) < 0, (j = 1, ..., rn) p,p" E
S~ :=
{p(x): p(x) >_ 0 Vx E X C_ R M,
(1.33) (1.34)
} and the minimal values ¢0(x*) and ¢0(P*) of the cost functions Co(x*) (1.32) and Co(P) (1.33) are connected by the following relations:
¢o(~*)-¢o(p*) = ( 1 - z ~ ) c o ( ~ * ) - ( 1 - z l ) ~ ( ~ ) - ( ~ ; - z l ) c o ( x ' * ) (1.35)
1.4. Randomized strategies and equivalent smoothed problems
11
where x** =arg min Co(x)
x~X/Xo
Xo : : {x e x : Cj(x) < 0 (j = 1,...,m)}
(1.36)
(1.37)
(z[, z~) are the solution of the linear programming problem go (z,, z2) := ¢ 0 ( 3 ) + z , [CO(x-) - Co(r)] + z 2 [CO(x') - CO(z**)] -~ inf Zt ~Z2
gj (zl, z2) := Cj(~)+Zl [¢A~'*) - CJ(~)] +z~ [¢j(~*) - Vj(z**)] (1.38) _0,
]
as:l s:l
hence,
m+2 ~0( ~E ~s~(X -- Zs)) V =
m+2 and, taking into account t h a t the linear combination m+2
of 6-functions with a E S mq-2 is a distribution, we derive t h a t
¢ C_e n ¢ C_"~
(1.55)
Let us consider any p(x) E S M. We have Cj(p) =
/
Cj(x)p(x)dx
(1.56)
xEX
Let us prove t h a t any p(x), corresponding to (1:56) can be expressed in the following form rnq-2
p ( x ) = ~_, ,~86(z - x~), ~ , z s e x ,
~ ~ s m+2
(1.57)
where the points {xs} satisfy the assumption of this theorem. To prove (1.57), let us again use the Caratheodory theorem [6]. According to (1.52) the set • is convex, t h e n for any fixed distribution p the corresponding vector
~TCo) := ( ~0(p),..., ~(p)) e R~+1
1.5. Caratheodory theorem and stochastic linear problem on finite set
17
can be expressed as follows rnT2
~(p) = ~
~,~(ps)
(1.58)
8=1
where Ps is a distribution function. Because of the continuity of the functions Cj (x) there exist the points xs such that Cj(Ps) = ¢j(xs) (j = 0, ...,m) and, taking into account (1.56), we conclude t h a t
~j(ps) = Cj(x~) = f
oj(x)~(~ - zs)d~
xEX
and hence, from (1.58), we finally derive that m+2
-~(p)
8=1 m+2
s=l f J
x~X
xEX rn+2
s=l
:= (¢0(~),...,,~(x)) ~ R~+~ So, the representation (1.57) holds. It means that en
~, 2 ~
(1.59)
Combining (1.55) and (1.59) we finally obtain
Theorem is proved. •
R e m a r k 4 The statement (1.59) can be also proved by using Kall's theorem [20] (see Appendix A)).
18
1.5. Caratheodory theorem and stochastic linear problem on finite set
Indeed, to prove the correctness of assertion (1.57), let us first state the similar representation for (m + 3) points (linear combination of nonminimal numbers of points), i.e., m+3
o~sS(x -- x~), x , x ~ • X , c~ • S m+3
p(x) = ~
(1.60)
s=l
We must prove t h a t such ~ • S "n+3 and x8 • X exist. Substituting (1.60) into (1.56), we obtain rn+3
~j(p) = ~ a.¢~(x.) and eliminating ~m+3
(O~m+3 :
rr~+l 1 -- E Ols)
(1.61)
from (1.61), we obtain
n'~+2
rr~+2
s:l
s=l
or m+2
~j(p) - Cj(~m+3) = ~
~s [¢j(~s) - 0 ~ ( ~ + 3 ) ]
(1.62)
s=l
According t o the continuity property of the functions Cj(x) (j = 0, ...,rn) we can find the points { x s } s = t .....m+l such that det Am+l,m+l ~ 0 where A,~+1,,~+1 is the submatrix containing the first (m -t- 1)-columns of the matrix Am+l,m+2 = IlajsII ,aj8 := C j ( x s ) - Cj(xm+3),j = 0,...,m; s = 1,...,m +2 Then, we can rewrite the last relation (1.62) in the following matrix form b : Am+l,ra+2" c~
(1.63)
where
bv
=
(b0,...,bm), bj : = ~ j ( p ) - O j ( x ~ + 3 )
A,~+t,,~+2
=
[ajs], a~s := ¢~(xs) - ¢#(xm+3) (j = 0 , . . . , m ; s = 1 , . . . , m + 2 )
OLT
=
(O/1, .... O/mq~2) , O~s ~ 0
Applying Kall's theorem [20] (see Appendix A) for the existence of the nonnegative vector a and the points x s ( s : 1, . . . , m + 1) satisfying the linear equation (1.63), it is necessary and sufficient that there exist # >_ 0 and Aj < 0 (j : 1, ..., m + 1) such that rn+l l=l
1.5. Caratheodory theorem and stochastic linear problem on finite set
19
where Az is t h e / - c o l u m n of the matrix A,~+1,,~+2. This relation can be always verified if the points xs(s = 1, ...,m + 3) are selected according to the following rule:
sign (¢j (xl) - Cj (xrn+3)) = - s i g n (¢j(x,~+2) - Cj (xm+3)) V / = I .... , m + l According to the Caratheodory theorem any linear combination of ( m + 3 ) points can be represented as a linear combination of only (m + 2)-points (1.61) . Hence the formulation (1.63) takes place. The next important statement follows directly from this theorem.
Corollary 2. For the class of continuous functions, the stochastic programming problem (1.25)-(1.26) is equivalent to the following nonlinear programming problem ra+2
p(8)c~(~s) - ~ s=l
inf
(1.64)
PESm+2'x'EX
n~+2
p(8)¢j(~,) < o (j = 1,...,m) s=l
Let us denote the corresponding minimal point by p**, ~]* (s = 1,...,rn + 2). This point may be not unique. The most important consequence of this theorem is the following one: the optimization problem is entirely characterized by • a set of ( m + 2) vectors xk E X • (m + 2) values of the probability vector Pk
k>--O, ~ P k = l , ( k = l , . . . , m + 2 )
.
k=l
and can be reformulated as follows inf h0(z)
(1.65)
hi(z) < 0, (j = 1,..., m)
(1.66)
z = {x0, ..., xm+l,po,...,pm+~}
(1.67)
zEZ
subject t o where z e R~,s = ( m + 2 ) ( n + 1)
20
1.5. C a r a t h e o d o r y t h e o r e m a n d s t o c h a s t i c linear
p r o b l e m o n finite set
m+2
xkEX,
pk_>0, E P k = I k=t
and, m+2
hi(z) = y:~ ] ~ ( ~ ) p k ,
(J = 0, ...,m)
(1.6s)
k=l
The next theorem shows that this nonlinear programming problem (1.64) given on the compact set X in the case of Lipschitzian functions e j ( x ) (j = 0, ...,m) can be approximated by the corresponding linear s~ochastic problem on a finite set.
T h e o r e m 4. If 1. X is a compact set of diameter D which can be partitioned into a number of nonempty subsets Xk (k = 1,...,N) having no intersection, i.e., N
x = U xk,
N xj : o
k= 1
(1.69)
i#j
2. the functions ej(x) (j = O, ...,m) are Lipschitzian on each subset X k , i.e., • X'
¢i(x') - e j ( x " ) < n~
-
X"
'
Vx ,x
"
E Xk
(1.70)
then, there exist fixed points {x~} (x~ E X k , k = 1, ...,N) and an enough large positive integer N such that the discrete distribution N
~(x) = ~ p ' ( k ) ~ ( ~
- ~i)
k=l
with the vector p* E S N which is a solution of the linear stochastic programming pro blem N
E P ( k ) ¢ o ( x * k ) - - ~ inf
pEN N
(1.71)
k=l
N
)-~p(k)¢j(xl)
< o (j = 1,...,m)
(1.72)
k=l
satisfies the constraints (1.72) with e - accuracy, i. e.,
ej(p~) < ~j (j -- 1,...,m)
(1.73)
1.5. Caratheodory theorem and stochastic linear problem on finite set
21
and the corresponding loss function ¢o(P~ will deviate from the minimal value ¢0 (P'*) of the initial programming pro blem (1.6~) not more than e, i.e., N
N
P
(k)¢o(xs,5 -
--
*
(1.74)
p* k~l
where ~J ~.
D
-
max N
k=l,...,N
L~ (j = 0 ..... m 5
(1.755
Proof. Let p**,x;*(s = 1, . . . , m + l ) be the o p t i m a l point (solution) of the p r o b l e m (1.71), (1.72) and the partition of the c o m p a c t X is realized such t h a t each subset Xk could contain not more t h a n one of the e x t r e m a l points x~*. Hence, for an?," (j ----0 ..... m ) we obtain N
N
~p(k)¢j(~) = ~p(k)[¢~(~)-k~:=l
k=l
m-l-2
- ~
]
¢~(x;')~(xLz7 • x~) +
s=l
N
m+2
k=l
s=l
Let us denote by xs~ the o p t i m a l point x s • X~. T h e n m+2
a n d hence, for a n y p • S N N
N
~_,p(k)¢j(xD = ~ p ( k ) k=l
* [¢~(x;) - Cj( X *s.)] +
(1.76)
k=l N
+ y~p(k)¢j(x;') k=l
For (j = 1, ...,m) a n d for a n y p E S N, satisfying the constraints (1.725, we have
~=~)-~P(k)oJ(x:')___~lp(k)= [¢J(~)- ¢J(~;2) ---
22
1.5.
t h e o r e m a n d stochastic l i n e a r
Caratheodory
p r o b l e m o n f i n i t e set
N k=l N
2) is the subset F ~ = { P : X E D,~, p(i) > 0} of one of the hyperplane Dn defined as follows N
On=pn:Epn(i)=l,
pn(i)>_O,
i ..... 1,..., N
i=1
The projection of p~ is defined as follows:
II(pn) = p ~ :II P,~ - P,~ I1....
P,~ - Y II
(2.9)
It is obvious that p~ E F~k for a certain k. Note that finding p~ = H(pn) is equivalent to finding the point on the simplex S which is closest to the projection pn(Dn) of the point Pn onto
D~. From the definition (2.9), it follows that minll y - p~ I]=m!nll (y - pn(D,~)) + (B,~(D,~) - p~) ]]= yES
yE
2.6. Normalization and projection
37
=[[ (p,~(D,J - p ~ ) [[2 + [[ (y-pn(D,~)) [[2 The following lemma gives the tool for calculating the projection II(pn) of
Pn. L e m m a 1. The face F N-1 closest to the point p,~(D,~) has an orthogonal vector
aN , = (1, 1, ..., 1, 0,, t ..., i)
(2.10)
~7 J
The irwIex j corresponds to the smallest component of the vector
(point) p,~(D,), i.e., (2.11)
Proof. The distance V between a given point z = (zl, z2, ...) E Dn and the vertex
is equal to N
V~(z, r~)
= (zk - 1) 5 + ~
N
(z~) 2 = ~
i~k
(z,) 2 + (1 - 2z~)
i=1
Then,
V ~ ( z , r ~ ) - V 2 ( z , r ~ ) = 2 (zL- zk) i.e., The most distant vertex corresponds to the smallest component
p,~(k) : ::mzin pn(l) Consequently, the face FN- 1 which lies opposite to the vertex and is closest to the point z has an orthogonal vector aN-1 (2.10). [] The projection of Pn(Drn) onto the hyperplane p,~(Dm) (1 min --p6S N
E
N
c(i)p(i)
> min --p6S N
i=l
E i=1
min
i=l,...,N
N
---- c(&) min E P ( i ) = c(c~) p 6 S N i=1
and
N
=
e(i)p*(i) i=1
c(i)p(i) =
3.3. Learning Automata Design
49
To prove the first equality in (3.16), let us rewrite q)n (3.4) in the following form: N
¢2n = E fn(i)~,~(i)
(3.18)
i=1
where the sequences {fn(i)} and {~n(i)} are defined as follows
f,~(i) E
1 '~ := - E n
X(ut = u(i))
t=l
~(u(i)'w)X(Ut=u(i))
*=~ ~
~X(ut=u(i))>O
if
0
if ~
x(ut = u(i)) = o
t=l
Taking into account assumption (H1), and according to lemma A.12 [14], for almost all co e / 3 ~ : =
co :
~(ut
= u(i))
..... o o
t=l
we have lim ~ ( i ) = T~
OC
c(i)
For almost all co ~ Bi we evidently have timsup l~r~(i)l < oo and, hence, lim A ( i ) = 0 But the vector fn which components are fn(i), belongs to the simplex S/v, and as a result any partial limit • of a the sequence {On} may be expressed, with probability 1, in the following form: N
• = Ep(i)c(~),
p c s N
i=t
where p is a partial limit of the sequence {fn}. Hence, N
a.$.
t i m s u p On > l i m i n f On > min n ~
n~o~
-- PESN
Ep(i)e(i ) i=1
and this inequality turn to an equality if
p,~(i) :=
P
{un =
u(i)I~-,~-, } =
p*(i),
(i = 1, ...,N)
50
3.3. Learning Automata Design
Theorem is proved. • The next theorem is fundamental. It shows that if a nonstationary strategy {Pn} which converges to the optimal strategy p* is used, then the sequence {¢n} of loss functions reaches its minimal value c(c0. T h e o r e m 2, If under assumptions (H1), (H2) a reinforcement scheme generates a sequence {Pn} , which converges with probability one to the optimal strategy p* (3.17), then, the sequences {¢n} of toss functions reaches its minimal value c(~) with probability one, i.e. if Ilp,~ - p*ll = oo~(1) ~+7,--+ ~ OQ o
then,
lim On ~ " c(c~)
(3.19)
rb~oo
Proof.
In view of the strong law of large numbers [21]-[22] (assumptions under consideration), we have 1E~ ~
1
n
n
t=l
E {~tlgr,~_~} ~:2~ 0 n~oo
t=l
But 1 E n
E {~t [.T',~_,}
=
t=l
- Z n
1
=
=
n
i
- ~c(i)p'(i) n
=
~(i)p~(i)
t=l
n
t=l
c@+o~(i)
n
+ - ~c(i) ~
[p,~(i) - p * ( i ) l =
t=l
c(~)
Theorem is proved. • C o r o l l a r y 1. If there exists a continuous function W(p,~), which on the trajectories {Pn} possesses the following properties: 1. W(p) >_ W(p*) = 0 Vp E S iv 2. w ( p ~ )
~
o
then, this trajectory {Pn} is asymptotically optimal, i.e., p,~ a-. 8-, ~ p Tb~
O0
$
3.4. Nonprojectional Reinforcement schemes
51
Proof.
follows immediately from the property of continuity and the considered assumptions. • This scalar function W ( p ) is called L y a p u n o v f u n c t i o r ~ It will be used for the convergence analysis (some interesting properties) of different reinforcement schemes realizing an asymptotically optimal behaviour of the considered class of Learning Stochastic Automata. Notice that the main obstacle to using the Lyapunov approach is finding a suitable Lyapunov function.
3.4
Nonprojectional
Reinforcement
schemes
Let us now consider some nonprojectional learning algorithms (reinforcement schemes) of the type (3.6), and the following Lyapunov function: p(a) p(c~)
1 -
w~ .- - -
(3.20)
It is obvious t h a t this function satisfies the first condition of the previous corollary. We would like to check the validity of the second condition for several commonly used reinforcement schemes applied to nonbinary environment responses, and to show then how to use them for optimization purposes.
3.4.1
B U S H - M O S T E L L E R REINFORCEMENT SCHEME
The Bush-Mosteller scheme [18]-[14] is described by the following recurrent equation:
P~+I = P~ + ~ [e(u~) -- p n ÷ ~,~(e N - N e ( u , ~ ) ) / ( N - 1)]
(3.21)
where
(i) ~(n) e ~ = [0,1], (ii) e g ---- (1, ..., 1) T • R N , (iii) e ( u ( n ) ) = (0, 0, ..., O, 1, O, ..., O) T • R N (the i th component of this vector is equal to 1 if u ( n ) = xi and the other components are equal to zero),
(iv) ~(n) • (0,1).
52
3.4. Nonprojectional Reinforcement schemes
The key to being able to analyse the behaviour of a learning a u t o m a t o n using the Bush-Mosteller reinforcement scheme is the following theorem.
Theorem
3.
If
1. the environment response sequence { ~ } satisfies assumptions (H1),(H2) 2. the optimal action u,~ = u(~) is unique, i.e., rain c(i) > c(a) = 0
3. the correction factor is selected as follows: 3'~--
nA-a
, 3' E (O, 1 ) , a > 3'
then, the Lyapunov function (3.20) possesses the property W(p~) ~ " 0 and, hence, the reinforcement scheme (3.21) generates asymptotically an optimal sequence {Pn}, such that p ~ - ~a . 8p. Proof. Let us consider the following estimation:
p,~+l(i) = p,~(i) + %~[X(U,~ = u(i)) - pn(i)+ +~n [1 -- N x ( u n = u ( i ) ) ] / ( N -- 1)] = pn(i) (1 -- ~ ) +
+~--1
{ ~ + X(u,~ = u(i)) IN(1 - ~n) " 1]} >
(3.22)
n
>_p . ( i ) (1 - ~.) >_... > pl (i) 1 ] ( 1 - ~ ) > 0 t=l
l~¥om l e m m a A.4 [14], it follows t h a t
H'~o - ~ ) _ > t=l
{ ~.\o/ (a-:-x'~ "y ~-~ ,
, a>~
(0, 1)
"y= i, a > O
(3.23)
3.4. Nonprojectional Reinforcement schemes
53
The reinforcement scheme (3.21) leads to E { 1 =pn+l(°~) P~--~I (--~) /.T'~}..s.=
"{'
°~ ]~E -~
-i=1
1/.r,,_,
Au,, = u(i
,
-
E{P'~+I(a)/~'£--IAu'~ =u(i)}
1
)
;.(i)° ~ p,~(i)+s~
(3.24)
where
s~:~E =
p~
+~(a)
1
}
E{p.~+,(,~)/~-.._, n~. = ~,(i)}/~"-' A~ ="(i) ;..(~)
(3.25)
Let us now estimate the term sn (3.25): N 8n : E E { i=1
rn/~n-I
A~-_-~(i)} p~(i)--
N
EE{Y./~._,A.,, =.(~)};.(~)+ (3.2~) i~#a -kS{ E{pn+l(OL)/~-'~)-7-~..'~__ '~n-l-p'~A+l'/$'n ~'~= "//,(C~)}iX ~i-zU-'-~" pn+l~ (Ot)/-~"n--1A'n =='U,(OL)}X =
xp~(a) where y . = E{pn+l(a)/~',~-i Au~ = u(i)} - p n + l ( a ) p~+l (a)E {p,~+l(a)/:F,~_l A u~ = u(i)} Taking into account that P n + l (Ce) : pn(c~) + "Tn[Si,a -- p n ( i ) -[- ~n [1 -- N S i , a ] / ( N - 1)] --
= f
L we derive
p,~(a)(1 - ~.,) + %~ if u(i) = u(a) (~,~ '~'~" O) p,,(a)(1 - %~) +'7,~,~/(N - 1) if u(i) ¢ u(a)
54
3.4. Nonprojectional Reinforcement schemes
and from (3.26) we obtain N
8n = E E
"" {
= ~(~)},:',,(':)"-'
~(i) - ~,,,
=.:,.,,, Z E
,,-
{ gn/'~n-1A'-
}
/-~o-, A ~ =~'(~1
P'(i)
,,,..~. =
E{.., r
where
N-1 A,~ =: p,~(a)(1 - "y,~) and
B~ = (c(i) ~,,) -
Using then Cauchy-Bounyakovskii inequality we get ~f2C
1
N
S n ~ (/~--_1)2 E
2,
"y~C
,
[pn(a)(1--'Tn)+
1
N ~] Ep,~(i )=
1
--< ( / Y - - l ) 2 p2n(o0(1 - q'n) 2 [pn(a)(1 - q'n) +
"7~C
,,:#.
W,,
(N- 1)2 p~(~)(1_ ~)2 [p~(.)(1-~) + ~ ] where
c :::m~x ,,o V/E {(~(') - ~,,)~/-~,,-, A-° = ~,(,)} =m~", ,~,,,
o~ := E {(4i)
- ~.) ./.r,._, A ~ = ~(i)} c- :=max
c(i)
(3.27)
3.4. Nonprojectional Reinforcement schemes
55
Using the lower estimations (3.22) and (3.23) in (3.27), we finally obtain
sn 1
Corollary is proved. • The next subsection deals with the analysis of Shapiro-Narendra reinforcement scheme using the normalization procedure.
3.6.2
SHAPIRO-NARENDRA REINFORCEMENT SCHEME WITH NORMALIZATION PROCEDURE
The Shapiro-Narendra scheme [19]-[14] is described by: Pn+l = Pn -~- *~n( 1 - - ~ n ) [e(~tn) - - Pn]
(3.86)
e [0,1] We shall analyse its behaviour when the environment responses ~ are constructed on the basis of the realizations of the function to be optimized (Figure 3.2) and the normalization procedure (Figure 3.1).
T h e o r e m 7. For the Shapiro-Narendra scheme [19] (3.86), assume that the optimal point x(a) is single and suppose that assumptions (H3), (H4) hold. In addition, suppose that the correction factor satisfies the following condition: .
oo
oo
T~
(3.87) n=l
n=l t=l
=1Lnl-2- t I
t=l
.
lim 2. ~ oo
II(1 t=l
- ~ft) - 1 : = c < p l ( O L ) i *
(3.88)
Reinforcement Schemes with Normalized Environment response
3.6.
81
3.
5
H( 1 -
n
t=l
%) >-
(~v)
~ •
0,
(3.89)
s=:1
Then, this reinforcement scheme select asymptotically the optimal point x(a), i.e.,
p.(a) ~
i
n~oQ
and the loss function ~n tends to its minimal possible value f(x(~)), u~ith probability one. Proof. The proof is similar in structure to the proof of the previous theorem. Let us estimate the lower bounds of the probabilities P ~ I (i):
~f.(l - ~ , ) [ X ( u , = u(i)) -p,(i)] = [1-^I~(i-~,~)] + ~ , ( 1 - ~ , ) X ( u , = u ( i ) )
p,+,(i) = p,(i) + =p,(i)
(3.90)
>
n
> p,~(i) (1 - 2 , )
>...
> Pl(i) I I ( 1 -
~t)
t=l
From (3.87) and (3.90) it follows that o~
~pn(i)
= oo, v i = 1 .... , N
~1
As a result, according to the Borel-Cantelli lemma [22], we obtain oo
E X (u,~ = u(i)) = oo, Vi = 1..... N
(3.91)
n=l
Let us consider again the following Lyapunov function t4~ . -
1 -p~(~)
p~(a)
Notice that in view of assumption (3.89), the relations (3.72) and (3.75) are fulfilled, i.e., 6.8.
"5,~(i) - f(x(i)) := o,~(nl/2_~._~)
(3.92)
Taking into account the Shapiro Narendra scheme (3.86) and (3.92), it follows:
II
+
+
v
+
I
+
+
~
IA
+
+
+
~L ~ ' j
+
+
II
>
> II
I
i
--..2~ --_.ZJ
II
~
H ~
x
÷
4-
II
>
+
I
&
fr
x
I
I
J
x
II
>
+
+
+
II
>
i
-0
I
I
II
>
+
+
P
3.6. Reinforcement Schemes with Normalized ( _ _ ~ A *
)
= \p~(~) (i - ~) - i (i - p~(~)) +
Environment response
(
1
p~(~) (i - ~) + ~
-1
)
83
p~(~)+
Inequality (3,90) gives T~
3'npX' (a) _ ~ assumption (3.87), (3.88) and (3.89) will be satisfied if
lim "yn I ' I ( 1 - - ' y t ) - l = c = O , T = ' y t=l
and: 1. the rate of optimization will be given by n~W,~ ~:g 0 where
u < min{1 -- 3"E'yA*} := u('y) < y*
(3.95)
3.6. Reinforcement Schemes with Normalized
Environment response
85
2. the maximum optimization rate ~* is equal to max v('y) = v* -
A* A*+3
(3.96)
and is reached for 1 "7* - A* + 3
(3.97)
Proof follows directly from inequality (3.94), which can be rewritten as follows: 7/,
Applying then l e m m a A.14 in [14] for u . := w - ,
~ . := -~ (A* + ° ( 1 ) ) , 9 . :=
o
i (~-z:~-3~) ,~-:=
nU
we obtain the desired result. Corollary is proved. • These statements show that, a learning a u t o m a t o n using the ShapiroNarendra reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour and the global optimization objective ks achieved with an e-accuracy. T h e Varshavskii-Vorontsova reinforcement scheme will be considered in the following.
3.6.3
VARS HAVS K I I - V O RONTSOVA R E I N F O R C E M E N T SCHEME W I T H N O R M A L I Z A T I O N P R O C E D U R E
The Varshavskii-Vorontsova scheme [20]-[14] ks described by:
p~+, = p~ + ~ p ~ ~(u~)(1 - 2 ~ ) f ~ ( ~ ) - p~l
(3.99)
This scheme belongs to the class of nonlinear reinforcement schemes. We shall analyse its possibilities to solve the stochastic optimization problems on discrete sets. The following theorem cover all the specific properties of the Varshavskii-Vorontsova scheme.
8. For the Varshavskii- Vorontsova scheme (3.99), assume that the optimal point x(c~) is single and suppose that assumptions (H3), (H4) hold. In addition, suppose that the correction factor satisfies the following condition:
Theorem
86
3.6. Reinforcement Schemes with Normalized Environment response . Z~,=
~,
~=1
H(I-~) n=l
= ~,
t=l
]
b:= \ - [1-2A(i)] 1 ~
O ( 7 ) '
TE
O,
(3.102)
s=l
Then, the optimal point x(a) is asymptotically selected, i.e.,
Pn(a) r b~~ o ~' 1 and the loss function q~,~ tends to its minimal possible value f (x(a ) ), with probability one. Proof. Let us estimate the lower bounds of the probabilities Pr~+l (i) : p~+l(O = p~(i) + ~ p ~ e(un)(1 - 2~)[X(u~ = ~(i)) - pn(i)] =
> p~(i)
[1 - % p ~
e ( u ~ ) ( 1 - 25~)] _>
__ ;~(i) [1 - ~ ; ~
e(u~)] >
> p~(i) (1 - ~ ) >_... >_p~(i) 1-I(1 - ~ ) t=l
From (3.100) and (3.103) it follows that oc
Ep~(i) n= 1
:-oo, vi = 1,...,N
(a.ma)
3.6. Reinforcement Schemes with Normalized
Environment response
87
As a result, and according to the Borel-Cantelli lemma [22], we obtain X(Un =
oo, Vi =
u(i)) =
1,...,N
(3.104)
Let us consider again the Lyapunov function Wn.
1-p~(a)
p,~(~)
Notice t h a t in view of assumption (3.102), relations (3.72) and (3.75) are fulfilled, i.e., "5,~(i) - f ( x ( i ) )
a.~. o 0 , ( n l / 2 ,. . . .
), E
un = u ( i
)} =
1 ) = z~(i) + o (~-3-7~-2~
(3.105)
Taking into account the Varshavskii-Vorontsova scheme (3.99) and (3.105), it follows: N
E{WnTl/J~n } a.s. E E {Wn+i/jTn_ 1A u n __--I~(i)} pn(i)= i=l
N { 1 - 1/.T',~_,Au,~ = u(i)} p,~(i)+ i#a pn(c~)[1-'Tnpn(i))(1- 2&)]
=EE
+E
1 p , ( a ) + ? ~ p , ~ ( a ) ) ( 1 - 2#~) (1 - p ~ ( a ) )
-
-
l/:1zn-1A un = u(i) } x
xp,,(~) = =~E
{IE ~ l+~,~p~(i))(l-2g~)+O(~X)-1/ 1
-E {.,,.(1- ~.,)(1- p.,(,~//+ o(~//.~,._, A"o = ,~(,/} = .
p--~
~ nl--2-r j -~-
-'3'n [1 -- 2A(o~)l (1 - p , j a ) )
-
-
+ o( , n l"~' _ 2~. r , -]- O ( ' ~ 2 ) ~--
88
3.6. Reinforcement Schemes with Normalized
Environment response
N
pn(a) (1 + ?,~)] +
= W,~ [1 -
p,~(z)) [1 - 2A(i)]
Otnl...27yp~(o~) + O(',/~p,~ (a)) =
(3.106)
a.8.
for n >_ no(w), no(w) < cx) . Let us now m a x i m i z e the second t e r m in (3.106) with respect to t h e c o m p o n e n t s Pn(i) (i :/: a ) under the constraint N
Zp,~(i)
-- 1 - p,~(a)
(3.107)
i :~ ct
To do t h a t , let us introduce the following variables
xi ::= pn(i),
a~ := ~n [1 -- 2A(i)],
i ~ a
It follows t h a t N
N
Ep~(i))
[1 - 2 A ( i ) ]
aix~ := F(x)
= E
(3.108)
To m a x i m i z e the function F ( x ) under the constraint (3.107), let us int r o d u c e the following Lagrange function:
L(x, A) := F ( x ) - A
xi - (1 - x,~)
T h e o p t i m a l solution ( x ' , A*) satisfies the following o p t i m a l i t y conditions: 0
oxiL(x*,A*) == 2aixi - A*
0
Vi ~ a
N
~AL(x*, A*)= Zx~-
(1-x~)
= 0
i#a
F r o m these o p t i m a l i t y conditions, it can be seen t h a t
*
N =
xi
A* 2ai
~
A* E
(2a~)-I
~¢~
=
1
- -
x~, A* --
1 xa N ~ (2ai) -1 -
Hence F(x)
< F(x*)
-
1 - - Xo~
-
-
E a~ 1 i#a
( N - 1) = ~ ( N - 1) (1 - ~,~)
3.6. Reinforcement Schemes with Normalized Environment response
89
From this inequality we derive ~
N 2 • 7~b (N E p ~ ( ~ ) ) [ 1 - 2A(i)] < = -~
1) (1 - p,~(a))
(3.109)
Ibl (N - 1) W,~
where b :=
< 0
[1 - 2 A ( i ) ] - '
(3.110)
iCa Substituting (3.109) into (3.106), we obtain a.8.
E {W~+I/~}
~ W~ [1 -
p,~(a)
(1 + ~ )
- T~ Ibl ( N - 1)] +
+or ~n ~+o(~p~l(~)) < , nl-2Tpn(ol)" _< W~ [1 -
~ Ibl (g - 1)1 + o ( ~
1-[(1 - ~ ) - ' ) + O(~ rI(1 - ~d -~) t=l
t=l
(3.111) Taking into account the conditions (3.100) and (3.102), and in view of Robbins-Siegmund theorem [23], we obtain W~ ~-~" 0,
p,(a)
~2; 1
Theorem is proved. •
C o r o l l a r y 9. For example, condition (3.101) will be satisfied for the class
of environments for which
~(i) >~ v i # a This means that the minimal value of the optimized function must be strictly enough different from its values in the other points. The following corollary gives an example of the class of the correction factor (gain sequence) satisfying condition (3.101).
90
3.6. Reinforcement Schemes with Normalized Environment response
C o r o l l a r y 10. I f we select ~ . + a ' ~ < a e (0,1}
7n-
Then, the rate of optimization can be estimated as follows n~W,~ %~ 0 n~o~
where the order of optimization satisfies the following conditions
0~
_~ O.5 .o o_
0
I
0
500
1000
, I
1500 S-N
c-.,
_~ 0.5
,o EL
O
O
1 ~
I
I
I
500
1000
1500
'
'
V-V
_~ O.5 2 EL 0 0
500
i 1000 Iteration number
i 1500
FIGURE 3.4. Evolution of the probabilities associated with the optimal action.
the algorithm and by the fact that a each time (iteration), an action is randomly selected on the basis of the environment response and on the probability distribution p~. The next subsection presents some numerical simulations, from which we can verify the viability of the design and analysis concerning the use of learning a u t o m a t a with changing number of actions and which are given in the second part of this chapter.
3.8. Simulation results
97
0.14 0.12
0.1
l
B-M .£ 0.08
"111 i
0.06 _.J
".Y-V
\
0.04
\ \
"- .~S-N
0.02
I
0 1
I000
0
"". . . . . . . . . . . . . . . . . . . . . . . I
I
l
2000 3000 Iterationnumber
FIGURE 3.5. Evolution of the loss functions.
3.8.2
LEARNING AUTOMATON WITH CHANGING NUMBER OF A C T I O N S
For every time, the optimization algorithm based on learning a u t o m a t a with changing number of actions perforn~s the following steps: Stepl. Choice of a subset V(j) on the basis of the probability Qn. T h e technique used by the algorithm to select one subset V(j) a m o n g W subsets is based on the generation of a normally distributed r a n d o m variable .(any specific machine routine; e.g., RANDU, can be used to carry out a normally distributed random variable). Step 2. Choice of an action u(i)on the basis of the probability distribution p~,. Use of the some procedure as in step 1. Step 3. C o m p u t e the S-model environment response.
98
3.8. Simulation results
~,-
i
|
i
=
B-M O. P o..
0
500
I
I
1000
1500
I,
S-N
..~_-
0.5 2
13_
0 0
500
I
I
1000
1500
L
|
,
~V
~ O.5 2 n
0 0
500
!
I
1000 Iteration number
1500
FIGURE 3.6. Evolution of the probabilities associated with the optimal action. Step 4. Use of this re~sponse to adjust the probability distribution according to the given reinforcement scheme. Step 5. R e t u r n to step 1.
The interval [108.5, 198.5] was partitioned uniformly into 10 values (it is up to designer to specie, the number of actions of the a u t o m a t o n ) . T h e action number can be selected according to the desired accuracy. No prior information was utilized in the initialization of the probabilities. T h e learning a u t o m a t o n started with a purely uniform distribution for each action, i.e.~
PO = and
,.-,
3.8. Simulation results
99
0.14 0.12 i,
0.1 ,Or) C
.£ 0.08 C
O
0.06
0.04 ".
0.02 0
S-N I
0
1000
"'"'-~. I
I
2000 3000 Iteration number
I
4000
FIGURE 3.7. Evolution of the loss functions. The theory of learning systems assures t h a t very little a priori information is known a b o u t the r a n d o m environment (the function to be optimized). Nevertheless, the eventually existing information can be introduced through the initial probability distribution or in the design of the environment response (normalization , etc.). The value of the correction factor ? was determined empirically. The value "y = 0.35 was chosen after some experimentation. Several experiments have been carried out. Figure 3.8 shows the evolution of the probability associated with the optimal state selected by the automaton. This optimal state corresponds to the value of x (x* = 168.5) which minimizes the function f(x) in the interval [108.5,198.5]. The evolution of the loss function is depicted in Figure 3.9. To test the performance of the optimization algorithm in the case of noisy observations (although the data from real systems are passed through a linear filter to remove aspects of d a t a related to disturbances). A zero mean disturbance was added to the function f(x). The dispersion of this disturbance is equal to 0.02. Figures 3.10 and 3.11 illustrate the performance of this optimization algorithm. Figure 3.10 represents the evolution of the probability associated
100 3.8. Simulation results i
1
0.9 0.8 0.7
~o.6 0.5 a_ O.l 0.3 0.2 0.1 0
0
I
I
I
500
1000 Iteration number
1500
FIGURE 3.8. Evolution of the probability associated with the optimal action. with optimal action (solution). After approximately 300 iterations, this probability tends to one. The loss function which tends to zero is depicted in Figure 3.11. As a result of the effect of the noise (perturbation) the t i m e needed for obtaining the optimal solution increases. The decay of the loss function is clearly revealed in Figllre 3.11. T h e nature of convergence of the optimization algorithm is clearly reflected by Figures 3.4-3.11. However, not: only is the convergence of optimization algorithm i m p o r t a n t but the convergence speed is also essential. It depends on the number of operations performed by the algorithm during an iteration as well as the number of iterations needed for convergence. The studied reinforcement schemes behave as expected. T h e y need fewer p r o g r a m m i n g steps and less storage memory. In summary, to verify the results obtained by theoretical analysis a number of simulations have been carried out with a multimodal function. Notice t h a t the suitable choice of the number of actions N is an important issue and depends on the desired accuracy. In order to overcome the high dimensionality of the number of actions a hierarchical structure of learning
3.8. Simulation results
0.14
I
I
101
!
0.12 0.1 r-
.£ 0.08 "5
0
0.06
J
0.04 0.02
0 0
!
!
!
500
1000 Iteration number
1500
FIGURE 3.9. Evolution of the loss function. a u t o m a t a can be considered. In this section we have presented the practical aspects associated with the use of learning a u t o m a t a with fixed and time-varying number of actions for solving unconstrained stochastic optimization problems. The presented simulation results illustrate the performance of thLs optimization approach, and show t h a t complex structures as multimodal functions, etc. can be learned by learning a u t o m a t a . Other applications were carried out on the basis of the algorithms presented in this chapter. T h e y concern processes control and optimization [25], neural networks synthesis [6] [26], fuzzy logic processor training [27] and model order selection [24]. In [25] A learning a u t o m a t o n with continuous input has been used to solve an optimization problem related to the static control of a continuous stirred t a n k fermenter. In this study, the discrete values of the manipulated variable (dilution rate) were associated with the actions of the automaton. Neural networks synthesis can be stated as an optimization problem involving numerous local o p t i m a and presenting a nonlinear character. The experimental results presented in [6] and [26] show the performance of learning a u t o m a t a when used
102
3.8. Simulation results I
!
I
|
1
0.9 08 0.7
.~o.6 £
0.5
n 0.4
0.3 0.2 0.1~
0I
0
200
400 600 Iteration number
FIGURE 3.10. Evolution of the probability associated with the optimal action. as optimzers. In [27] a novel method to train the rule base of a fuzzy logic processor is presented. This method is based on learning a u t o m a t a . ]f comp a r e d to more traditional gradient based algorithms, the main advantages are t h a t the gradient need not to be computed (the calculation of the gradients is the most time-consuming task when training logic processors with gradient methods), and t h a t the search for global minimum is not fooled by local minima. An adaptive selection of the optimal order of linear regression models using variable-structure stochastic learning a u t o m a t o n has been presented in [24]. The Akaike criterion has been derived and evaluated for s t a t i o n a r y a n d nonstationary cases. The Bush-Mosteller reinforcement scheme with normalized a u t o m a t o n input, has been used to adjust the probability distribution. Simulation results have been carried out to illustrate the feasibility and the performance of this model order selection approach. It has been shown t h a t this approach, while requiring from the user no special skills in stochastic processes, can be a practical tool t h a t open t h e way to new applications in process modelling, control and optimization. T h e authors would like t h a n k Dr. Enso Ikonen for his assistance in carrying out these simulation results.
3.8. Simulation results
103
0.9 0.8 0.7
0.6 C
"~ O5 c
0 ,,,J
o.4 0.3 0.2 0.1 !
0
0
200
I
!
400 600 Iteration number
FIGURE 3.11. Evolution of the loss function.
The reinforcement schemes implemented in this chapter give very satisfactory results, and the analytically predictable behaviour is one of their important advantages over heuristic approaches like genetic algorithms and simulated annealing.
104 3.9. Conclusion
3.9
Conclusion
T h e aim of this chapter is the development of a useful framework for computational method for stochastic unconstrained optimization problems on finite sets using variable-stochastic a u t o m a t a , which incorporate the advantages and flexibility of some of the more recent developments in learning systems. Learning stochastic a u t o m a t a characterized by continuous input and fixed or changing number of actions have been considered. The analysis of these stochastic a u t o m a t a has been given in a concise and unified manner, and some properties involving the convergence and the convergence rate have been stated. A comparison between the performance of a learning a u t o m a t a with and without changing number of actions has been done. Several simulation results have been carried out to show the performance of learning a u t o m a t a when they are used for multimodal functions optimization. In the next chapter we turn to the development of optimization algorithms on the basis of learning a u t o m a t a , for solving stochastic constrained optimization problems on finite sets.
References [1] Bi-Chong W, Luus R 1978 Reliability of optimization procedures for obtaining global optimum. AIChE Journal 24:619-626 [2] Dorea C C Y 1990 Stopping rules for a random optimization method. SIAM J., Control and Optimization 28:841-850 [3] Sorensen D C 1982 Newton's method with a model trust region modification. SIAM Journal Numer. Anal. 19:409-426 [4] Dolan W B, Cummings P T, Le Van M D 1989 Process optimization via simulated annealing: application to network design. AIChE Journal 35:725-736 [5] McMurtry G J, Fu K S 1966 A variable structure automaton used as a multimodal searching technique. IEEE Trans. on Automatic Control 11:379-387 [6] Kurz W, Najim K 1992 Synthese neuronaler netze anhand strukturierter stochastischer automaten. Nachrichten Neuronale Netze Journal 2:2-6 [7] Kushner H J 1972 Stochastic approximation algorithms for the local optimization of functions with nonunique stationary points. IEEE Trans. on Automatic Control 17:646-654 [8] Poznyak A S, Najim K, Chtourou M 1993 Use of recursive stochastic algorithm for neural networks synthesis. Applied Mathematical Modelling 17:444-448 [9] Najim K, Chtourou M 1994 Neural networks synthesis based on stochastic approximation algorithm. International Journal of Systems Science 25:1219-1222 [10] Narendra K S, Thathachar M A L 1989 Learning Automata an Introduction. Prentice-Hall, Englewood Cliffs [11] Baba N 1984 New Topics in Learning Automata Theory and Applications. Springer-Verlag, Berlin
106 References [12] Lakshmivarahan S 1981 Learning Algorithms Theory and Applications. Springer-Verlag, Berlin [13] Najim K, Oppenheim (] 1991 Learning systems: theory and application. IEE Proceedings~E 138:183-192 [14l Najim K, Poznyak A S 1994 Learning Automata: Theory and Applications. Pergamon Press, Oxford [15] Thathachar M A L, Harita B R 1987 Learning automata with changing number of actions. IEEE Trans. Syst. Man, and Cybern. 17:1095-1100 [16] Najim K, Poznyak A S 1996 Multimodal searching technique based on learning automata with continuous input and changing number of actions. IEEE Trans. Syst. Man, and Cybern. 26:666-673 [17] Poznyak A S, Najim K t997 Learning automata with continuous input and changing number of actions, to appear in International Journal of Systems Science. [18] Bush R R, Mosteller F 1958 Stochastic Models for Learning. John Wiley & Sons, New York [19] Shapiro I J, Narendra K S 1969 Use of stochastic automata for parameter self optimization with multimodat performance criteria. IEEE Trans. Syst. Man, and Cybern. 5:352-361 [20] Varshavskii V I, Vorontsova I P 1963 On the behavior of stochastic automata with variable structure. Automation and Remote Control 24:327-333 [21] Ash B B 1972 Real Analysis and Probability. Academic Press, New York [22] Doob J L 1953 Stochastic Processes. John Wiley &; Sons, New York [23] Robbins H, Siegmund D 1971 A convergence theorem for nonnegative almost supermartingales and some applications. In Rustagi J S (ed) 1971 Optimizing Methods in Statistics. Academic Press, New York [24] Poznyak A S, Najim K, Ikonen E 1996 Adaptive selection of the optimal order of linear regression models using learning automata. Int. J. of Systems Science 27:151-159 [25] Najim K, M6szar~s A, R~snak A 1997 A stochastic optimization algorithm based on learning automata. Journal a [26] Najim K, Chtourou M, Thibault J 1992 Neural network synthesis using learning automata. Journal of Systems Engineering 2:192-197 [27] Ikonen E, Najim K 1997 Use of learning automata in distributed fuzzy logic processor training, IEE Proeeedings~E
4 Constrained Optimization Problems 4.1
Introduction
In the previous chapter, we disctLssed how to solve stochastic unconstrained optimization problems using learning automata. The goal of this chapter is to solve stochastic constrained optimization problems using different approaches: Lagrange multipliers and penalty functions. Optimization is a large field of numerical research. Engineers (electrical, chemical, mechanical, etc.) and economist are often confronted with constrained optimization where there are a priori limitations (constraints) on the allowed values of independent variables or some functions of these variables. There is now a strong and growing interest within the engineering community in the use of efficient optimization techniques for several purposes. Optimization techniques are commonly used as a framework for the formulation and solution of design problems [1}-[2]. For example, in the context of control, the objective of an on-line optimization scheme is to track the real process optimum as it, changes with time. This must be achieved while allowing for disturbances to the proceas, ensuring process constraints are not violated. In recent years, learning automata [3]-[4] have attracted the attention of scientists and technologists from a number of disciplines, and have been applied to a wide variety of practical problems in which a priori information is incomplete. In fact, observations measured from natural phenomena posses an inherent probabilistic structure. We hasten to note t h a t in stochastic control theory, it is assumed that the probability distribution are known. Unfortunately it is the exception rather than the rule t h a t statistic characteristics of the considered processes are a priori known. In real situations, modelling always involves some element of approximation since all real systerrLs are. to some extent, nonlinear, time-varying and disturbed [5]. Descent algorithrrLs such as steepest descent, conjugate gradient, quasi Newton methods, methods of feasible directions, projected gradient method with trust region, etc. are commonly u,sed to solve optimization problems with or without constraints [6]-[7]-[8]-[9]-[10]-[11]-[12]-[13]. T h e y are based on direct gradient measurements. When only noisy measurements are
108 4.1. Introduction available, algorithms based on gradient approximation from measurements, stochastic approximation algorithms and random search techniques [14] are appropriate for solving stochs~stic optimization problems. The main advantage of random search over other direct search techniques is its general applicability, i.e., there are almost no conditions concerning the function to be optimized (continuity, unimodality, convexity, etc.). Methods based on learning systems [3]-[:15]-[16] belong to this cle~ss of random search techniques. The aim of this chapter is the development of a useful framework for computational method for stochastic constrained optimization problems on finite sets using variable-stochastic automata, which incorporate the advantages and flexibility of some of the more recent developments in learning systems [4]. In recent years, learning systems [4]-[17]-[18] have been fairly exhaustively studied. This interest is essentially due to the fact that learning systems provide interesting methods for solving complex nonlinear problems characterized by a high level of uncertainty. Learning is defined as any relatively permanent change in behaviour resulting from past experience, and a learning system is characterized by its ability to improve its behaviour with time, in some sense tending towards the ultimate goal. They have been used for solving unconstrained optimization problems [4]. This stochastic constrained optimization problem is equivalent to the stochastic linear programming problem which is formulated and solved as the behaviour of a variable-stochastic automaton in a multi-teacher environment [4]-[19]. A reinforcement scheme derived from the Bush-Mosteller scheme [15] is used to adapt the probabilities associated with the actions of the automaton. The learning automaton operates in a multi-teacher environment. The environment response which is constructed from the available data (cost function and constraints realizations) has been normalized and used as the automaton input. In this chapter we shall be concerned with s~ochastic constrained optimization problems on finite sets using two approaches: Lagrange multipliers and penalty functions. The first part of this chapter is dedicated to the Lagrange multipliers approach. It is shown that the considered optimization problem is equivalent to the stochastic linear programming problem given on a multidimensional simplex. The Lagrange function associated with the later is not strictly convex and, as a result of this fact, any attempts to apply directly the gradient technique for finding its saddle-point doomed to failure. To avoid this problem we have introduced a regularization term in the corresponding Lagrange function as in [20]. The Lipschitzian property for the saddle-point of the regularized Lagrange function with respect to the regularization parameter is proved. The second part of this chapter presents an alternative approach for solving the same stochastic constrained optimization problem. It is based on learning automata and the penalty function approach [12]-[21]-[22]-[23]-
4.2. Constrained optimization problem
109
[24]-[25]-[261-[27 ]. T h e environment response is processed using a new normalization procedure. The properties of the optimal solution as well as the basic convergence characteristics of the optimization algorithm are stated. Our main theoretical results (convergence analysis) are stated using martingales theory and Lyapunov approach. In fact, martingales arise naturally whenever one needs to consider conditional expectation with respect to increasing information patterns [51-[28] .
4.2
Constrained optimization problem
4.2.1
PROBLEMFORMULATION
Let us consider the following sequence of r a n d o m vectors 4~(u,w), u E U := {u(1),...,u(N)};
~ E [~; n = 1 , 2 , . . . ; j = 0,1 .... , m
which are given on a probability space (f/, j r p )
where
• u is a discrete variable defined on the set U • w is a r a n d o m variable defined on the set of elementary events fL Let us define the Borel functions ~
1
(j = 0, 1, ..., m) as follows
" ~ (~.~)
(4.1)
t=l
where the sequence {ut}t=l ..... n of r a n d o m variables is also defined on (f~, ~-, P ) and represents the optimization sequence "optimization strategy" which depends on the available information, i.e.,
u~:u,~(~,us,wlj=O,
1,...,m);t=l,...,n;s=l,...,n-i
(4.2)
Let us consider the following assumptions: • ( H 1 ) T h e conditional mathematical expectation of the r a n d o m variables ~ ( u n , co) for any fixed random variable un = u(i) is stationary, i.e., J (j = 0 , 1 , . . . , m ; i = 1 , . . . , g ) E {~(~,,, ~)1--,, = ~,(~) A ~,,_, } °~ ~,~ where ~ n - 1 := a(~Jt,us,colJ = O, 1 ..... m ; t = 1 .... , n -
1;s---- 1 , . . . , n - 1)
is the a - a l g e b r a generated by the corresponding data.
110 4.2. Constrained optimization problem • (H2) For any fixed random variable u~ = u(i), the absolute value of the random variables 4,~(Ju,~, w) are uniformly bounded with probability one, i.e.,. limsup [4[~(u~,w)l a~. a+ < c~ (j = 0,...,N) The main stochastic optimization problem treated in this chapter may now be stated: Under assumptions (H1) and (H2), find the optimization strategy {un} which minimizes, with probability 1, the average loss function lim sup ~}o
inf
(4.3)
under the constraints limsupq),J~a~_ 0;
( j = l .... ,m)
(4.4)
From (4.3) and (4.4) it follows that the sequence {4° } and the sequences {4~} (J = 1,...,m) correspond respectively to the realizations of the loss function (I)°n and the constraints defined by the functions (I){ (j = 1, ..., m). R e m a r k . The "chance constraints" [25]-[29] P { w : 0j(4°, 4¼, ..., ~ ~) _> aj} _< 3j
(4.5)
belong to the class of constraints (4.4). Indeed, this chance constraint can be written as follows: E{4~} _< 0 (4.6) where 4~ = ?( {¢j(4 °, 4¼, .-., 4~~) > ay} - ~j
(4.7)
and the indicator function X(')' is defined as follows ?((,4)
= f lifAistrue [ 0 otherwise
Substituting (4.7) into (4.4) leads to 71,
limsup n E
X {¢/(4°' d ' " " 4 ~ n ) >- ai} - aj}
t=l
= limsup -1 E E { x { ¢ j ( ~ o , n~cx~
n
~ 1, . . . , ~ ) > _ a~ } } =
t=l
= P {~d : 4)j(~°, "1 The problem (4.3)-(4.4) is asymptotically equivalent to a linear programming problem which will be stated in the next subsection. 4.2.2
EQUIVALENT LINEAR PROGRAMMING PROBLEM
Consider the following stochastic programming problem on a multidimensional simplex: N
Yo(p) := E
v°p(i) ---*sup
i=l
(4.9)
p6S
N
Vj(p) := E
vJiP(i) < O,
(j = 1,...,m)
(4.10)
i=l
S ~ :=
p = (p(1),...,p(N)) E I:tNf p(i) >_ O, E p ( i ) = 1
(4.11)
i=1
where SNdenotes the simplex in the Euclidean space R N and p is the probability distribution along with the optimization is done. The following theorem states the asymptotic equivalence between the basic problem (4.3)-(4.4) and the linear programming problem (4.9)-(4.1I). T h e o r e m l . Under assumptions (H1) and (H2):
1. The basic stochastic coru~trained optimization pro blem (4.3)- (4.4) has a solution if and only if the corresponding linear programming problem (4.9)-(4.11) has a solution too (may be not unique).
. These solutions coincide with probability one, i.e., inf @~ 0 a.s. = inf Vo(p) {u~} p~S N
such that the constraints (~.~) and (4.10) are simultaneously fulfilled.
112 4.3. Lagrange multipliers using regularization approach
Proof. Assume t h a t the basic problem (4.3)-(4.4) has a solution. Then, according to Lemma A.4-1 (appendix A), t h e sequence {un} satisfies asymptotically (n -~ co) the constraints (4.9) if and only if the sequence
IN(i) := -
n
X(Ut == u(i))
(4.12)
t=l
satisfy the set of constraints (4.10)-(4.11). Hence, inf ~o a~. V0(p*) (~n} -
(4.13)
where p* E S N is the solution of the linear programming problem (4.9)(4.11). Let us now consider the stationary optimization strategy {un} for which P { ~ : u~ = u(i)l~=~_a} ~'~" p'(i) According to the strong law of large numbers for random sequences containing martingale differences (see Lemma A.4-2 in appendix A) we derive lim ~
~'~" Vj(p*) (j == 1, ...,m)
~--* oo
Theorem is proved. • The next section deals with the properties of the Lagrange function and its saddle-point.
4.3 Lagrange multipliers using regularization approach It is well known [ii]-[26]-[27]-[8] that the stochastic linear programming problem discussed above is equivalent to the problem of finding the saddlepoint for the following Lagrange function m
L(p, A) := Vo(p) + E A(j)Vj(p)
(4.14)
i=1
given on the set S N × R~' where R ~ := ~ = ((~(1),...,~(m)): ~(i) > 0 i = 1 .... ,m}
(4.15)
In other words, the initial problem is equivalent to the following minmax problem L(p,A) --* inf sup (4.16) pC:S N ) ~ R T
4.3. Lagrange multipliers using regularization approach
113
According to the LagTange Multipliers Theory [7]-[26], any vector p* E S g is the solution of the linear programming problem (4.9)-(4.11) if and only if there exists a vector A* E R ~ such that for any other vectors p E S N and A C R m + the following saddle-point inequalities hold L(p*,A) < L(p*,A*) < L(p,A*)
(4.17)
The Lagrange function L(p, A) is not strictly convex and, as a consequence, any attempt to apply directly the gradient technique for finding its saddlepoint doomed to failure. The following example ilh~strates this fact. E x a m p l e . Consider the function L(p, A) for which their arguments belong to the single dimensional space ( N = m -- 1), and Vo(p) = 0, i.e., L(B, A) = Ap
Applying directly the gradient, technique (Arrow-Hurwicz-Uzava algorithm) [31] for finding its saddle-point which is equal to p* = A* = 0, we obtain Pn An
=
Pn-1 - %~V~,L(pn 1 , A n - l ) - - P,~--1 - %~A~ 1 An-1 + % V ~ L ( p n - I , A ~ - I ) = A,~_~ +~/nPn-1
"fn
E
R 1, " y ~ 0 ,
(4.18)
E%=oo
It is easy to show that this procedure generates a divergent sequence {Pn, An} which does not tend to the saddle-point p* = A* = O. Indeed, Pn :=P~ + A2~ = (1 +~f2) p,,_, = P o f l
(1+~/2) >-Po > 0
t=l
One approach for avoiding this problem consists of introducing a regularization term in the corresponding Lagrange function [7]-[20]: /~(p,A) = L(p,A) + 6 (tlpNU _ tlAN2)
(4.19)
For each regularizing parameter 6 > 0, this regularized Lagrange function La(p, A) is strictly convex on p and concave on A . The next theorem describes the dependence of the saddle-point (p* (5), A*(5)) of this regularized function with the regularizing parameter 5 and analyses its a~symptotic behaviour when 6 --* 0. T h e o r e m 2. For any sequence {6n} such that O 0, this inequality leads to
2 L • IIP*(5~)ll 2 + 11~'(6~)tt 2 _< ~ [ (p , ~ * ( ~ ) ) - L ( p * ( 6 ~ ) , ~ * ) ] + + IIp*ll2 + IIA*II2 0, the regularized Lagrange function is a strictly convex function and, has the following useful property which will be given in the next lemma. L e m m a 1. For any p E S N and for any A E R m the following inequality holds: (p - p*(5)) r VpL~(p, A) - (A - A*(6)) v V~L~(p, A) >_
(lip- p*(5)ll2 ~11~- ,\*(5)112) where p" (6), A* (5) is the saddle-point of the regularized Lagrange function (4.68). Proof. From (4.68) we obtain (p - p*(5)) T Vpn~(p, A) - (A - A*(6)) T V~L~(p, A) =
= L~(p, A*(6)) - L~(p*(5), A) +
(lip- p*(e)ll 2 + IP,- "v (5)112)
T h e desired result follows immediately from the following inequality
n~(p, A*(6)) >_ n~(p*(5), A) which corresponds to the saddle-point condition (4.17). Lemma is proved. [] The next theorem states the Lipschitzian property for the saddle-point (p*(6), A*(6)) of the regularized Lagrange function /~(p,A) with respect to the regularization parameter 6. T h e o r e m 3. Under the same conditions as in Theorem 2, there exists a positive constant C such that for any nonnegative parameters 5t and 5s the following inequality holds: liP*(5,) -p*(Ts)ll + 11)¢(5t) - ~,'(5,)11 _< CISt - 6sl
4.3. Lagrange multipliers using regularization approach
117
Proof.
Let us consider the following sets Z0 :=
p,,k) :
p(i) = 1 i=1
z l ( i l .... ,i8) : = { ( p , ~ ) : p ( i k )
}
= 0, k = 1 . . . . . s } n Z 0
Z 2 ( j l .... , j r ) :=: {(p,X): A(jk) .... 0, k = 1,...,r} N Z0
Z3(il,---,is; Jl,-.-, jr) := ZI(il, ...,is) ~ Z2(Jl, ...,Jr) The total number of these sets is equal to 2m(2 g - 1). They will be denoted by ~k (k = 1, .... 2m(2 N - 1)). Let us a.~sociate with each set ~k the problem ~ of finding the saddle-point of the regularized Lagrange function L~(p, ~). The corresponding solution will be denoted by ( p ( ~ ) , /k(~3~)). It is clear that the saddle-point (p* (~),),* (5)) of the regularized Lagrange function L~(p, )~), defined on the set S g × R m + , coincides with the solution (P(~e), A ( ~ ) ) of one of the previous problems ~e. Notice that each problem ~ is a convex optimization problem with equality constraints and, as a consequence we can use the LagTange technique for deriving the necessary optimality conditions. The optimality conditions are:
VpL~(p, A, #) = 0,
V~L~(p,),,#) =: 0,
V,L~(p,A,#) = 0
(4.22)
where /;e(p,)~,#) := Le(p,)~) - #o
p(i) - 1
- ~#lkP(ik)
-- ~ _ , # 2 k , X ( j k )
k
k
and #lk and #2k are Lagrange multipliers. T h e linear system of equations (4.22) is equivalent to m
V 0 -t- ~ / ~ j V 3=1
j ~-5p-
#0 e N -- E l Z l k e ( i k ) k
-~
0
Z,2
=
0
e(
k)
k
(p, A) C Gk where eN :
(l,...,1) T E R N
e ( j k ) := (0, ..0, 1, O, O, ..., O)T
J~
(4.23)
118 4.4. Optimization algorithm From (4.35) it follows t h a t the expression of the optimal solution (saddlepoint) can be written in the following parametric form m+N
m+N bj,6
p(i) =
~=0 ,~+N E
~=0 , A ( j ) = ,~+N P s A s ai~6 ~ aj~6
(4.24)
s=O
8~0
For any 6 > 0, the corresponding Lagrange function is strictly convex and has a unique optimal point. It follows that the denominator of (4.24) is not equal to 0. When 6 Ls small enough the problem to be solved remains the t same (gl~) as 6 decreases. Let us assume that the r p - f i r s t coefficients ~s
(s = O, 1, ...,rp)
are equal to zero and, similarly assume that the rp - f i r s t
P ( 8 = O, 1 ..... rp" ) are also equal to zero. It follows: coefficients ais r
m + N - r~ - 1 ]
p ( i ) = 5 r"
i,s+%+l
s=l m+g
, rp := rp, -- rp,,
(4.25)
r~ := r x' - - r x"
(4.26)
a r; +l + s=:l E aPi,s+r~ +l The same form is valid for tile vector A: rn+N-r~,-
A(j)
=
2
1
m+g a~ 4 ~ a~ ,, 5 s 3,r~ +1 s=l j,s+r~ +1
,
In view of theorem 2 (boundness of the coordinates of the saddle-point) we conclude %>0, r~>0 The assertion of this theorem follows directly from (4.25)-(4.26) and the boundness of the parameter 6. Theorem is proved. B The next section deals with the use of learning a u t o m a t a for solving the equivalent optimization problem stated above.
4.4 Optimization algorithm It has been shown in section 4 that the stochastic constrained optimization problem (4.3)-(4.4) is asymptotically equivalent to the problem related to the determination of the saddle-points of the regularized Lagrange function L 6 ( p , A ) , using the realizations of the cost function and the constraints. This equivalent problem may be formulated and solved as the b e h a v i o u r
4.4. Optimization algorithm
!~
Teacher1 ~,¢~0~n
-----t
Teacher2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
Normalization Procedure
----~ Teacher m+l ~ "
I
I FIGURE 4.1. Multi-teacher environment.
of a variable-stochastic a u t o m a t o n in a multi-teacher environment [4]-[19] (Figure 4.1). Referring to the schematic block diagram for the learning a u t o m a t o n operating in a multi-teacher environment (Figure 4.1), we note t h a t the normalization procedure processes as a mapping from the teachers responses ( ~ , j = 0, ..., m) to the learning a u t o m a t o n input (~n).
4.4.1
LEARNINGAUTOMATA
T h e role of the environment (medium) is to establish the relation between the actions of the a u t o m a t o n and the signals received at its input. For easy reference, the m a t h e m a t i c a l description of a stochastic a u t o m a t o n will be given below. An a u t o m a t o n is an adaptive discrete machine described by:
{z, u, n, {~,}, {~}, {p~},T} where: (i) E is the bounded set of a u t o m a t o n inputs. (ii) U denotes the set {u(1), u(2), ...... u ( N ) } of actions of the automaton. (iii) T~ = ( f~, Y, P ) a probability space.
120 4.4. Optimization algorithm (iv) {[=} is a sequence of automaton inputs (environment response, ~= E S) provided by the environment in a binary (P-model environment) or continuous (S-model environment) form. (v) {u=} is a sequence of automaton outputs (actions). (vi) p= = [p=(1),p=(2), ...,p=(N)] T is the conditional probability distribution at time n N
p~(i) = P{~o: u~, .....u(i)
/
9t-~_1},
~ p , ~ ( i ) = 1 Vn i=l
where ~'~ = a(~l, ul, Pl;---; ~,~, u=, pn) is the a-algebra generated by the corresponding events (.Tn E ~'). (vii) T represents the reinforcement scheme (updating scheme) which changes the probability vector p,, to P~+1:
}t=l ..... ~; {ut}t=l ..... ,~)
P~+I = P= +Tn:/~(Pn; {¢t p~(i)>0
Vi=l
(4.27)
..... N
where "Yn is a scalar correction factor and the vector
.... [Tj(.),..., satisfies the following conditions (for preserving probability measure): N
~T¢~(.) ........0 v,~
(4.28)
Z=I
p,,(i)+%(~,~(.) E [0,1]
Vn, Vi = 1 , . . . , g
(4.29)
This is the heart of the learning automaton.
4.4.2
ALGORITHM
Let un = u(i) be the action selected at time n. For fixed vector An and fixed regularization parameter 6 = 6n, the environment response (automaton input) ~n is defined as follows:
~-
~ p~(i) +~ '
~=
u(i)
(4.3o)
where ~ := _pn_,(i)(1--~ ' n~.,)+~ ' n_lmin{i--~+_1;
>_
}
(4.35)
From the definition of ¢~_ 1 and ¢~_ 1, we derive =
-~_1,
~--
1
Consequently, it follows p,~(i) >_ min { p n - , ( i ) ; ~-~-, } >_ min{min {p,~-2(i); Tn-2} ; ~-n-, } =
= min {pu-2(i);~-,~-l} _> ... _> min {P1(i);~-~-1} = 1 Finally,
p,~(i) >_ ~-,~_i,vi = 1, . . . , N
(4.36)
Consider now the automaton input (~: I
~,~ ,~
~ = (1+(N-2)~) ~
y~ m
~0++~+Z~}+ j=l
+
4.5. Convergence and convergence rate analysgs
>-- ( I + ( N - - 2 ) T = )
[
Pn 7,~
Pr, T 2 ( N - 1)
1 + (N - 2)T,~
1 + ( Y - 2)~',,
¢;~-,, (1 + ( Y - 2)v,~)p,~ and
(C + ~ ) ~ ~n--~ ( I + ( N - - 2 ) T ~ ) p ~ < -
(~n+ + ~ ) 1 + (N - 2)7~
(~+ + ~ ) ~ -~ (1 + ( N - - 2 ) T ~ ) T ~ - I
123
< -
-
= 1 +T~(N--2) = 1 1 + (N - 2)~
Lemma is proved, m As might be expected from the construction of the automaton input (normalized environment response), the signal input ~n belongs to the interval This lemma leads to the following corollary C o r o l l a r y 1. The conditional mathematical expectation of ~,~ is given by
the aJfine transformation of the corresponding regularized Lagrange function (4.20):
T~(i) :-- E{$' [ (U'~= u(i)) A'Ta} -- p,~(i) I [a,~~
L~(p,~,A,,)+t3,, ]
where ~,~ and J3n are defined in the previous lemma. The asymptotic properties of this optimization algorithm are stated in the next section.
4.5
Convergence and convergence rate analysis
In this section we prove the convergence of the optimization algorithm described in the previous section and we estimate its convergence rate. T h e next theorems are our main results in this section. 4. Consider the optimization algorithm (4.33)-(4.34), under assumptions (1-11) and (H2), and in addition, assume that there exist two nonnegative sequences {~,~} and {~-~} such that
Theorem
1) 0 0 which is valid for any ¢ = ¢~ > 0, it follows Wn+l -~ (1 + en)
I1
B,~-P~+7~
[
e(u,~)-pn+~,~
+
+(1 +~71)ii p;÷,-p;il ~+ (~ +~)JJ~-~:, + ~ (:~- ~ ) l l ~+ +(1 + ~;1)II~;+~ - ~II ~ In view of theorem 2, we derive Wn+l _< (1 +¢~)W~ +2"yn(1 + 6~)((p,~ _ p,,).TpA,~ + "y~. - - y,~( A , ~ - - A ~ ) T A ~ I + (4.37)
where A s := e(u,~) -p,~ +~,~
e N - Ne(u,,)
N-1
Let us calculate the conditional mathematical expectations and the estimation of the conditional second moments of the vectors APn and A~n. Notice t h a t Pn and As are ~-n-measurable. Hence, we have: 1. Conditional mathematical expectations I) Vectors A~ N
E {A~I~'~} = E E { A ~
l(u~ = u(i)) A ~'~}p~(i) ----
i= 1
~=]
N - 1
] p,,(i) =
N
t N-1
E ' r P ( i ) p ~ ( i ) [eN -- Ne(u(i))] i:1
(4.38)
126 4.5. Convergence and convergence rate analysis where 7P(i) := E { ( ~ !(u,~ = u(i)) AiY,~} 2) Vectors A~ N
E {A~Ibr.} = E E {A~ [ (u,~ = u(i)) A .7",~} p.(i) =
(4.39)
i--1
N i=1
where 2. Estimation of the conditional second moments 1) Vectors A n
E {IIA~II: i7.} 0
Z
(4.59)
i=l
where the penalty coefficient # -~ oo. The optimal solution of this penalty function is equivalent to the optimal solution of the following penalty function ttV0(p) +
[l/)(p) + gj]
, 0 < # --~ 0
(4.60)
i=1
This penalty function is more useful (practical point of view) because ~t --* 0 instead to infinity. Let us consider the following regularized penalty function [24]
L,,~ (p, ~) := pVo(p) + -~
(Hpll.2+ i1~112) + -~1
IIVP + ~ II2
(4.61)
where
V=- [~li.,.iVN] C R rrt+N /
k
~ T
Vj
(4.62)
We assume that the parameters # (penalty parameter) and 6 (regularization parameter) are positive and tend monotically to zero, i.e., 0 0 and ~ > 0, thks problem is strictly convex. Its solution is unique and will be denoted by
(p*(p, 6),~:*(#, 6)) e S N × R ~
(4.64)
The properties of the optimal solution are presented in the next section.
4.7
Properties of the optimal solution
The next theorem states the property of this solution (4.64) when 6 and # tend to zero. T h e o r e m 6. Let us assume that
1) the set Pr is not empty and the Slater's condition Vj (~) > 0 (j = 0 ..... m)
(4.65)
is fulfilled for some ~ E S N 2) The parameters # and 6 are time-varying, i.e., #=#~,
6=~,~, ( n = 1,2 ..... )
such that o >_ (p*O,,~,6,~) - p*)~' [I + vTv] p*(~,,~,~,~) >__ > (p*(~., ~.) - p*)~ [I + v ~ v ] p * ( ~ , 6 ~) + #~nv_ oT , t; • (u., ~.) - p*) = =
(P *(#,~, ~n) - p,)T [ I + v ~ v ] P*(gn, ~ . ) + -- ~n--T
,
.
~-~Vo ~p (~,,~,~,~) - ~-. {p*O,,. ~,~)}) + +#~--T ~-~v 0 (Trp. {p *(p.~, 6~ ) } - p*) >_
>_ (p.(.~, ~ ) _ p.)T [I + vTv] ; * ( ~ , 6~)+ -}-]I~ VT
0 (p* ( ~ , ~.,)- ~P. {p* ( ~ , 5 ~)})
0 0 where ~0T := (Vl,... , VN). From (4.76), it follows
0 >__(p (#n, ~) _ p . ) T
P*(#'~'
8,~ II~otlc (~'~)~
For n --* oo and in view of assumption (4.67), we finally obtain
o > (p~ - p*)~ [I + v ~ v ] p~
(4.81)
for any partial limit Po¢ This inequality can be interpreted as the necessary condition of optimality of the strictly convex function f ( x ) : : l x T (IN + V T V ) x on the convex compact set P* [26]:
(x - z*)TV f ( z *) >_ 0 V x ~ P * x : : p * , x* : a r g f ( x ) xeP*
138 4.7. Properties of the optimal solution But any strictly convex function has a unique minimum. It follows t h a t X* = arg f(x) = Po~ x~.P*
is unique (all partial limits of the sequence {p* (#n, 6n)} coincide). Because f(p*) is equal to the weighted norm of the vector p* E P*, we conclude that p ~ =p** Theorem is proved, m The next theorem states the Lipschitzian property of the optimal solution of the penalty function (4.61) with respect to the parameters # and 6. T h e o r e m 7. Under the assumptions of theorem 6, there exist two positive
constant C1 and C2 such, that [[P*(~I, 51) -- P*(#2, (~2)1t 0
Notice that tl~n,- ~,112 -< ~'~ wn, 3'n,
II~n, - ',n,+, II ~ *. lien, - ~11 ~ + 2 I1~;, - ~5+11[~
1
2) the mean squares convergence is guaranteed, i.e.,
E
0
for 2 + 6 + U + T = 1, 26 < 1, 0 > 1, 2# > U+~', 26 > U+T (4.104) Proof. Notice that
1
~ : 0(~) From the conditions of theorem 7 and (4.102) follows the desired result. • The next theorem gives the estimation of the order of convergence rate of this optimization algorithm. T h e o r e m 9. Under the conditions of the previous theorem and for the class of parameters (4.102) there exist u > 0 such that W~ :
o~o
,E
14
:o
(1)
(4.1o5)
146 4.8. Optimization algorithm where the order y o f convergence rate satisfies the f o l l o w i n g upper estimation 1
u < u*('y,) 0 For independent centered normal distributed noises, it follows:
(5.5)
5.2. Optimization of Nonstationary Functions
163
ft[u[i))
uli)
) C t
FIGURE 5.1. Nonstationary function.
n1 ~-~.?t(i ) _= n ~ t~l
[ft(u(i)) + E {~t/ut -= u(i) A .~'l- 1}]
----
t~l
1 '~ = - ~_~ (u(i) - u* sin cot - e) 2 = (u(i) - c) 2 n
t=l
--2 (u(i) -- c) u* _1 ~ n
sin cot + (u*) 2 i ~ n
t=l
sin2 cot
t=l
and hence, assumption ( H 2 ) is fulfilled 1
r~
n ~?t(i)
-'~ (u(i)
7t$---}00
c) 2 + (u*)2 := B i ) 2
t=l
E x a m p l e 2 Independent Gaussian noises .Af(atcr2) where at is periodi-
cally time-varying and satisfies the .following constraint 1
'~ t=l
For this example, the previous assumptions hold. T h r o u g h o u t this chapter we will assume t h a t these two assumptions ( ( I l l ) a n d ( I t 2 ) ) are satisfied. Under the~e assumptions we will be inter-
164 5.3. Nonstationary learning systems ested in the following stochastic optimization problem given on a finite set: lim 1 -
n~o~
n
t=l
y~ --~ inf {ut}
(5.6)
This stochastic optimization problem (5.6) presents the correct mathematical formulation of the nonstationary optimization problem (5.1), using the observations (5.2). Hereafter we will associate these observations of the disturbed flmction to be optimized with the response of an environment response. Consequently, we can associate this stochastic optimization problem on finite sets (5.6) to the behaviour of a learning automaton operating in a random environment which is asymptotically stationary in average i.e.,
yt = ~
(5.7)
The next section presents an overview of nonstationary environments.
5.3
N o n s t a t i o n a r y learning s y s t e m s
Learning systems are an efficient tool to deal with a large number of engineering problems [5]-[6]-[7]-[8]. A learning system interacts with an environment and learns the optimal action which the environment offers. Most of the available studies relate to the behaviour of learning automata in stationary environments [5]-[8]. The problem concerning the behaviour of learning automata in nonstationary environments is difficult, and few results are known [7]-[91-[101-[111-[12]-[13]. Narendra and Viswanathan [13] considered periodically changing nonstationary random medium with unknown period. A nonstationary environment in which one action u s continues to have the minimum penalty ca even though all the penalty probabilities keep changing with time, i.e., cry(t, w)+ 6 < cj(t, co) holds for some ~, some 6 > 0, and for all j (j # a), and for all random factors co; has been studied by Baba and Sawaragi [14]. These results have been extended to the nonstationary multi-teacher environment [12]. Several basic norms (expediency, optimality, etc.) of the learning behaviour of stochastic automaton in multi-teacher environment are given in [12]. A variable-structure stochastic automaton where the penalty probabilities have been assumed to be time-varying and depending on the input action chosen, wz~s introduced by Narendra and Thathaehar [15]. Srikantakumar and Narendra [9] have developed an adaptive routine in telephone networks using learning methods. They have considered a nonstationary environment for which reward probabilities ci (p) and their derivatives are Lipschitz functions of all their arguments and
Oci (p) 0pi
>0Viand
Ocj (p) Oci (p) for j # i Opi <
0 Vi=l,...,N). 5.7.1
BUSH-I~'IOSTELLER REINFORCEMENT SCHEME
The Bush-Mosteller scheme [18]-[5] is described by:
pn+l = pn +'y~ [e(un) - pn +'~n(eN - N e ( u , ~ ) ) / ( N - 1 ) ]
(5.40)
5.7. Reinforcement Schemes with Normalization Procedure
177
where
~ ~(~)
• --
[0,1], 8 ~ • [ 0 , 1 ] (0,..., 0,1, 0,..., 0) T , u~ = u(i) i
eN
---- (1, ..., I)T e R g
T h e reinforcement scheme above is frequently used in the case of stat i o n a r y environments. We shall analyze its behaviour in a s y m p t o t i c a l l y s t a t i o n a r y (in t h e average sense) mediums. This question is addressed in t h e following t h e o r e m . 1. For the Bush-Mosteller scheme (5.40), condition (5.35) is satisfied, and if assumptions (H3) and (H4) hold, and the optimal action is single, i.e.,
Theorem
min ~ ( i ) := ~* > 0
(5.41)
then,, the properties (['1) and (P2) are fulfilled and the loss function • ,~ tends to its minimal possible value "5(a) (5.31), with probability one.
Proof. Let us consider t h e following estimation:
p,~+l(i) = p,~(i) + ~ [ X ( u ~ = u(i) ) - p,~(i)+
[1 - NX(u,~ = u ( i ) ) ] / ( N - 1)] =
= p~(i) (1 - ~ ) +
n
~ Pn(i) ( ] - - - ~ f n ) ~ - - " ' ' - - ~ p i ( i ) l - I ( 1 - - ~ t ) t=l
F r o m l e m m a A.4 [5], it follows t h a t
H(!_~)_ t=l
>
~ (~-~V ~?j [
~-~ ,
, a>~
7=1,
(0,1) a>O
(5.4~)
178 5.7. Reinforcement Schemes with Normalization Procedure Substitution of (5.43) into (5.42) leads to the desired result (5.35). In view of (5.27) and assumptions (H1), (H2) we deduce Ar~(~)
=
( I - - 1 ) ~,~_1(c0 + 1A'~n -< a.s.
1
1 -constn
< (1 - n ) ~n_l(a) 71-
Using Lemma A.3-2 (see Appendix A) for u~ = A~(a),
as = 1,
~,~ = n1 c ° n s t '
u~ = d -~ (~ • (0, g )
we derive 1 ~'n(~) a~s " 0(--~_~)
(5.44)
The reinforcement scheme (5.40) and (5.44) lead to E {[ 1 - p ~ + l ( a ) ] / $ ' ~ }
a~. [1 - P ~ + I ( a ) ] -
N
A-o N
_
_
p.(i) ( 1 - ~ ) >... >_pl(i)l-[(1-~)
(5.48)
+'y,,(1-$,~)Z(u,,=u(i)) >__
t=l
180 5.7. Reinforcement Schemes with Normalization Procedure Substituting (5.43) into (5.48) leads to the desired result (P1). In view of (5.27) and assumptions (H3) and (H4), and according to the proof of the previous theorem, we deduce that (5.44) ~s fulfilled. Let us consider the following Lyapunov function W,~ .-- 1--p~(c~) p~(~) Taking into account the Shapiro Narendra scheme and (5.44), it follows:
N
Z{WnT1/ffrn} a'~-s"EE{Wnq_l/ffYn_ 1 A1tn i=1
N(
:= ~
+
/t(/)} pn(i)
I.
(
, ~(p = E
+
1
-1)
) )
- 1
p~(i)+
-1
p~(~) =
p~(i)+
.(a)[1 --Tn] + ~/~,(i) p~(a)
( p.(~)
1
-1)
+~. (1-p.(~))
pn(O~)+"/n ow(1)
Replacing A(i) by its minimal value &* leads to the following inequality:
E{l;Vn+l/.Fn}a~.(
1
_
-
p~(~)[1 - ~] + ~ a * p~(~)
(1 -
p~(c~)) p~(c~)
-~ p~(c~) + % (1 [ 1 = w~ 1 - ~ + ~ *
[1 - 3'~] + "Y~ o ~ ( 1 )
pn(oL))
-p~(a) +
- 1) (1 - p~(c~))+ =
P2~(a)[1- "/~] ] +7= oo~(1) = p~(~-S(i - - - ~ ) + ~ +3'n o~(1)
(5.49)
2i2 ~p:l(~) >_p7!(~)lim n~l-[(1-n~) -1 =p~l(~) c t=l
(5.50)
-
for n > no(~v), no(w) < Condition (5.47) gives a.8.
"f~
oo . n
5.7. Reinforcement Schemes with Normalization Procedure
181
Inequalities (5.50) and (5.49) imply E{PVn+I/5,} < W,~ 1 . y ~ + . y n 5 * -- ( 1 - - T n ) + p i l ( a ) o
.
.
.
.
.
.
c+o(1)
+
(1) =
. 1 q- p l l ( a ) e
o(1)
+ % o~(1) =
Taking into account the normalization procedure (A* E (0, 1)), assumption (5.47) and Robbins-Siegmund theorem [23], we obtain a.8.
W,~ --~ 0,
p,~(a) ~
1
The property (P1) is then fulfilled, To prove the fulfillment of property (P2) we use Lemma 1 and Toeplitz temma:
E{Tt 1/.~t-1} a's'=
Tt-1 .... lira _ 1 £
lira-1£ n--,oo
~
rt-*c~
n
t=~ 0
t=rb 0
N
[ 1 - ~(i)] [e(u(i))- lira P~] ~o¢lim p~(i)a.,._
a.~. lim E{Tn/Jrn} = E i=l
°g
-
0
Theorem is proved. The following corollary deals with the constraints associated with the correction factor %. C o r o l l a r y 1. For the correction factor (5.22), assumption (5.47) will be
satisfied if "f
--: 1,
lim
%
II( 1
t=l
- - ")'t) - 1
:
C := a -1
"~
pl(cO
-v..
A
These statements show that, a learning automaton using the ShapiroNarendra reinforcement scheme with the normalization procedure described above, has an asymptotically optimal behaviour. The Varshavskii-Vorontsova reinforcement scheme will be considered in the following.
182 5.7. Reinforcement Schemes with Normalization Procedure 5.7.3
VARSHAVS K I I - V O RONTSOVA REINFORCEMENT SCHEME
The Varshavskii-Vorontsova scheme [20]-[5] is described by:
This scheme belongs to the class of nonlinear reinforcement schemes. We shall analyze its behaviour in asymptotically stationary (in the average sense) mediums. The following theorem cover all the specific properties of the Varshavskii-Vorontsova scheme. Its significance will be discussed below. T h e o r e m 3. For the Varshavskii-Vorontsova scheme (5.51), condition (5.35) is satisfied and if assumptions (H3), (H4) and the condition (5.41) hold, and in addition if the nonstationary environment satisfies the following condition:
b:=
[ 1 - 2~(i)] -1
< 0
(5.52)
then, the properties (P1) and (P2) are fulfilled and the loss function ¢n tends to its minimal possible value "5(a), with probability one. (5.3I) Proof. Let us estimate the lower bounds of the probabilities: p~+, (i) = p~(i) + ~ p ~ e(u~)(1 - 2~)[X(u~ = u(i))
>_ pn(i) [ 1 - ~',~p,~T e(u~)(1
-- 2~',~)
-p~(i)] =
] _>
> p~(i) [1 - ~p~ e(~)] > > p~(i) (1
-
~1 _>... _> p,(i) [[(1
-
~)
t=l
Substituting (5.43) into (5.53) gives us the desired result (5.35).
(5.53)
5.7. Reinforcement Schemes with Normalization Procedure
183
For convergence analysis, let us introduce the following L y a p u n o v function: - p,~(a) W•
1
.-
reinforcement scheme (5.51), (5.44) and simple calculations lead to
The
N
E{ W~,~+1/~,,}a.8. = ZE{W,,+,/-~",~-i Au~ =
N(
,
( +
1
(
1
1 + ~,,~ (1 - p , ~( a ) )
>_~(~), n0(~0)
-1) p~(~)=
pn(a)) p~(a)
p,,(i)
+
p,~(i)=
)
p,~(a) + %~(1 - 2 A ( a ) + o ~ ( 1 ) ) (1 -
for n
u(i)}
i
- 1
)
+ ~,~ o , , ( 1 )
< ~.
Let us now m a x i m i z e the first t e r m in (5.54) with respect to the c o m p o nents p~(i) (i ¢ a ) under the constraint N
~p~(i)
= 1 - p,~(~)
(5.551
To do t h a t , let us introduce the following variables
x, := p~(i),
a,
[1 - 2 ~ ( i ) ] ,
:= ~
i
#
It follows t h a t N
,#~
(
p,~(i) i -- ~fn
)
1
N
=~
1
X,
a,~
F(x) "
(5.56)
-
To m a x i m i z e the function F(x) under t h e constraint (5.55), let us introduce t h e following L a g r a n g e function:
L(x, )0 := F(x) - ;~
xi -
(1 - x,~)
184
5.7.
Reinforcement Schemes with Normalization Procedure
T h e optimal solution (x*, A*) saGsfies the following optimality conditions:
0•
i L(x*'A*)-
1 (1-a~x~)2
A* = 0
Yi¢~
N
L(x*, A* ) =
i i#a
From these optimality conditions, it can be seen t h a t X i --
1 l-x~
1 - x~ )_ 1
V/~- __ (i
N
'
N
al ~ ai_ 1 i¢c~
~ ai 1 i~a
Hence
F(x) % then n
+a
l+a
'~ ) ~ > H (1 "), + 1
(a_~ -
n
~) > \~-~-~]
k=l
b) for "/= 1 and a > 0, then a
n÷a
k=l
Proof. The proof of assertion a) is evident. Using the convexity property of the function x ln x, it follows that (x + Ax) ln(x + ,~x) -- x In x >_ A x ( l + x in x) Taking into account this inequality, we obtain
n +~----'/+ 1
In
1
n-+ ~/ a
dx
>_
1
>
(1-v(k))=exp k=l
In kk=l
1
kT~
-
196 Appendix A: T h e o r e m s a n d L e m m a s
> exp -
{/( )} In
1
7
0
d~
=
\~1
z + a
Lemma Ls proved. • L e m m a A . 4 - 1 . Under assumptions (H1) and (H2)
1) the random variables eOJ(w) (j 0,...,m) are the partial limits of the sequences { ~ } (j = 0, ...,m) for almost all w e f~ if and only if they can be written in the following form N
¢o(.~) = ~ o p ( i ) :
Vo(p)
i=1 N
• ,(~) = ~ v~p(i) := y,(p),
(j .... 1,..., m)
i=1
where the random vector p = p(w) C S g (defined by equation (4.11)) is a limit point of the vector sequence f~ .... (f~(1) .... ,f,~(N)) T defined by (equation (4.13)). 2) for almost all w E f~
Proof. Let us rewrite 4p~ (equation (4.1)) in the following form N
= E
f,,(z)~,~(u,,, w)
i=l
where n
J {
x(~, =~(i));~ 0,(i),,~) ,
i~(~,~) =
~
~(~, = ~(i)) > 0
t=l
0
,
~
~(~, = ~(i)) = 0
t= 1
are the current average loss functions for ut = u(i) (i = 1, ..., N). According to lemma A.12 (Najim and Poznyak, 1994) (see also Ash, 1972 ), for almost all
t=l
Appendix A: Theorems a n d L e m m a s
197
we have lira ~J G~(u~,w)= v i For almost all w ~ Bi, we have: •
=j
h~. G ( ~ , ~) < and lim f~(i)::= 0
n-~oo
The vector f,~ also belongs to the simplex S g . It follows that any partial limit ~2J(w) of a sequence { ~ } can be expressed in the following form: N
cJ(~) = ~ v ~ p ( i ) : = y~(p),
(j .... 0,...,m)
i=1
where p i~s a partial limit of the sequence {fn} and consequently, ~IiJ(w) C [m/inv/J,maxv~] ( j = O , . . . , m ) Lemma is proved. []
L e m m a A.4-2. Let us assume that 1. the control strategy {u~} is stationary, i.e.,
P {~: u= = u(i)15",~_1 } a'Sp(i) 2. the random variables/~(u(i),w) (j = 0 , . . . , m ; i =: 1,...,N) have stationary condigonal mathematical expectation and uniformly bounded conditional second moment, i.e.,
E { ~ ( u ( i ) , w) I un = u(i) A
~n--1
}
a.8, =:
}
{( .
j
Vi
oo
~x(
U t ~- U
(i)) °--*8
O0
t=l
Then,
~(u(i), ~ ) ~ ( ~ = ~(i)) 8inJ :
t=l
-
x ( ~ = ~(i)) t:l
v~
0
198 Appendix A: Theorems a n d Lemrn~s
Proof. From the recurrent form of s~ ]
s~ 8n_~J1 (1 - ~(,,,,- ~,(0) t :,,(,))]
+
k
,¢)
x ( ~ . = ~(i)) -t
n
E x(~ = , 4 0 ) t=l we derive
X
n
........... 2' X
~f, X(Ut = U(i))
[
t=l
+ ..... x ( ~
= ~(i))
2
t=l
X(Ut =
~t(i))]
co~t(~)
where o~(1) is a random sequence tending to zero with probability one, and Const(w) is an almost surely bounded and positive random variable . Observe t h a t
~(u~ ~ u(0__)(2_+o_~(1)) o.=,.o~ t=l In view of Robbins-Siegmund theorem (Robbins and Siegmund, 1971) (or Lemma A.9 in Najim and Poznyak, 1994), the previous inequality leads to the desired result. Lemma is proved. • The lemma used in chapter 5 is stated and proved in the following.
Appendix A: T h e o r e m s a n d L e m m a s Lemma
199
(the matrix version of the Abel's identity).
A.5-1
n
~
t--I
A,B, = a. ~2 B , t=nO
t=rLO
At
tA,- A,-~I Z B~ t=nO
E R m× k
8=r~O
Bt E R k × t
Proof. For n = no we obtain: no-1
AnoB,~o = A~oBno -
[A~o -
Ano-1] E
B, = A~oB~o
8~T~ 0
r~O- 1
The sum
~ ] Bs in t h e previous equality is zero by virtue of t h e fact t h a t
8~0
the u p p e r limit of this s u m is less t h a n the lower limit. We use induction. We note t h a t t h e identity (Abel's identity) is t r u e for no. We assume t h a t it is t r u e for n a n d prove t h a t it is t r u e for n + 1: ~+1
E
AtBt =
t=r~O
n
= a.~ E t:~o
AtBt + At.+1Bn+l = t=r~O
~
t-1
Bt-
[ a t - At-l] E ~:~ ~t~
B. + a ~ + a B ~ + a
=
8=rl, 0
=(a.~+l~Bt+a.~+lB.~+x)-t=. --
(
(An+l
- A,~)
Bt + t~'n,O
n+l
= An+l E t=nO
[At - A t - l l t='/'~O
n+l
Bt- E
HI
E
B,
8=rl, O
t--1
[A,- At-l] E Bs
t=r/,o
T h e identity (Abel's identy) is proved. •
8=n.o.
/
=
200
Appendix B: S t o c h a s t i c P r o c e s s e s
Appendix B: Stochastic Processes In this appendix we shall review the important definitions and some properties concerning stochastic processes. A stochastic process, {x~, n E N } is a collection (family) of r a n d o m variables indexed by a real p a r a m e t e r n and defined on a probability space (f~, ~ , P ) where f~ is the space of elementary events w, ~- the basic a-algebra and P the probability measure. A a-algebra ~ is a set of subsets of f~ (collection of subsets). ~-(xn) denotes the a-algebra generated by the set of r a n d o m variables xn. The a-algebra represents the knowledge a b o u t the process at time n. A family ~- = {~n, n ~ 0} of a-algebras satisfy the standard conditions Jr s _< ~-~ < ~- for s _< n, :~0 is suggested by sets of measure zero +of ~ , and ~'n = N ~s. s>n
Let {x,~} be a sequence of random variables with distribution function {F~} we say that:
D e f i n i t i o n 7 {xn} converges in distribution (law) to a random variable with distribution function F if the sequence {Fn} converges to F. This is urritten lal!2
Xn
---+ X
D e f i n i t i o n 8 {xn} converges in probability to a random variable x if given s, ~ > 0, then there exists no(s,6) such that Vn > no P(I x~ - x I> s) < 6 This is written x,+ ~
x
D e f i n i t i o n 9 {x~} converges almost surely (with probability 1) to a random variable x if givens, 6 > 0, then there exists n0(e, 5) such that Vn > no. P(Ix~-xl<sVn>n0)> 1-6 This is written a.8.
Xn
----+ x
Appendix B: S t o c h a s t i c P r o c e s s e s
201
D e f i n i t i o n 10 {x~} converges in quadratic mean to a random variable x
if 2 i m E [ ( x ~ - x)T(x~ - x)] = 0 This is written X n q:-~" X
The relationships between these convergence concepts are summarized in the following 1. convergence in probability implies convergence in law. 2. convergence in quadratic mean implies convergence in probability. 3. convergence almost surely implies convergence in probability. In general, the converse of these statements is false. Stochastic processes as martingales have extensive applications in stochastic problems. They arise naturally whenever one needs to consider mathematical expectations with respect to increasing information patterns. They will be used to state several theoretical results concerning the convergence and the convergence rate of learning systems. D e f i n i t i o n 11 A sequence of random variables {x,} is said to be adapted to a the sequence of increasing a-algebras {~n} if x,~ is Y:n measurable for every n.
Definition 12 A stochastic process {x~} is a martingale if a.8.
and
Definition 13 A stochastic process {x=} is a supermartingale if a.8.
E{x,~+I/~}
~ x~
D e f i n i t i o n 14 A stochastic process {x,~} is a submartingate if a,8.
>
202 Appendix B: S t o d m s t i c P r o c e s s e s
The following theorems are useful for convergence analysis. T h e o r e m (Doob, 1953). Let {x~, ~'~} be a nonnegative supermartingate such that
x,~ > 0
s u p E { x ~ } < oo.
Then there exists a random variable x (defined on the same probability space) such that E{x}