MARKOV PROCESSES AND LEARNING MODELS
This is Volume 84 in MATHEMATICS IN SCIENCE AND ENGINEERING A series of monograp...
26 downloads
602 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
MARKOV PROCESSES AND LEARNING MODELS
This is Volume 84 in MATHEMATICS IN SCIENCE AND ENGINEERING A series of monographs and textbooks Edited by RICHARD BELLMAN, University of Southern California The complete listing of the books in this series is available from the Publisher upon request.
MARKOV PROCESSES AND LEARNING MODELS M. FRANK NORMAN Department of Psychology University of Pennsylvania Philadelphia, Pennsylvania
@
ACADEMIC PRESS New York and London
1972
COPYRIGHT 0 1972, BY ACADEMIC PRESS, INC. ALL RIGHTS RESERVED NO PART OF THIS BOOK MAY BE REPRODUCED IN A N Y FORM, BY PHOTOSTAT, MICROFILM, RElXIEVAL SYSTEM, OR ANY OTHER MEANS, WITHOUT W R m N PERMISSION FROM THE PUBLISHERS.
ACADEMIC PRESS, INC. 111 Fifth Avenue, New
York, New York 10003
United Kingdom Edition published by ACADEMIC PRESS, INC. (LONDON) LTD.
24/28 Oval Road, London NWI
LIBRARY OF CONaRESS CATALOQ CARD NUMBER:70-182638
AMS (MOS) 1970 Subject Classifications: 60B10,60F05, 60J05,60J20,60J70,92A25; 60J35,60J60,62M05,62M10, 62M15,92A10 PRINTED IN THE U N m STATES OF AMERICA
TO Sandy
This page intentionally left blank
0 Contents
xi
PREFACE
0 Introduction 0.1 Experiments and Models 0.2 A General Theoretical Framework 0.3 Overview
1 12 13
Part I DISTANCE DIMINISHING MODELS 1 Markov Processes and Random Systems with Complete
Connections 1 . 1 Markov Processes 1.2 Random Systems with Complete Connections
21 24
2 Distance Diminishing Models and Doeblin-Fortet Processes 2. I Distance Diminishing Models 2.2 Transition Operators for Metric State Spaces
30 37 Vii
...
CONTENTS
Vlll
3 The Theorem of Ionescu Tulcea and Marinescu, and Compact Markov Processes 3.1 3.2 3.3 3.4 3.5 3.6 3.7
4
A Class of Operators The Theorem of Ionescu Tulcea and Marinescu Compact Markov Processes: Preliminaries Ergodic Decomposition Subergodic Decomposition Regular and Absorbing Processes Finite Markov Chains
Distance Diminishing Models with Noncompact State Spaces 66 71
4.1 A Condition on p 4.2 Invariant Subsets
5
43 45 50 52 56 61 63
Functions of Markov Processes
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Introduction Central Limit Theorem Estimation of pu Estimation of u2 A Representation of u2 Asymptotic Stationarity Vector Valued Functions and Spectra
13 75 80 84
90 93 94
6 Functions of Events 6.1 Theprocess X,,'=(En,Xn+,) 6.2 Unbounded Functions of Several Events
98 100
Part I1 SLOW LEARNING 7 Introduction to Slow Learning 7.1 Two Kinds of Slow Learning 7.2 Small Probability 7.3 Small Steps: Heuristics
109 111 114
ix
CONTENTS
8 Transient Behavior in the Case of Large Drift 8.1 8.2 8.3 8.4 8.5
A General Central Limit Theorem Properties of f ( t ) Proofs of (A) and (B) Proof of (C) Near a Critical Point
116 120 124 126 133
9 Transient Behavior in the Case of Small Drift 9.1 Diffusion Approximation in a Bounded Interval 9.2 Invariance 9.3 Semigroups
137 141 146
10 Steady-State Behavior 10.1 A Limit Theorem for Stationary Probabilities 10.2 Proof of the Theorem 10.3 A More Precise Approximation to E(X.7
152 154 157
1 1 Absorption Probabilities 11.1 Bounded State Spaces 11.2 Unbounded State Spaces
163 169
Part 111 SPECIAL MODELS 12 The Five-Operator Linear Model 12.1 12.2 12.3 12.4
Criteria for Regularity and Absorption The Mean Learning Curve Interresponse Dependencies Slow Learning
176 179 183 189
13 The Fixed Sample Size Model 13.1 13.2 13.3 13.4
Criteria for Regularity and Absorption Mean Learning Curve and Interresponse Dependencies Slow Learning Convergence to the Linear Model
196 199 202 205
CONTENTS
X
14 Additive Models 14.1 14.2 14.3 14.4 14.5
Criteria for Recurrence and Absorption Asymptotic A , Response Frequency Existence of Stationary Probabilities Uniqueness of the Stationary Probability Slow Learning
21 1 214 217 218 221
15 Multiresponse Linear Models 15.1 Criteria for Regularity 15.2 The Distribution of Y. and Y,
226 229
16 The Zeaman-House-Lovejoy Models 16.1 A Criterion for Absorption 16.2 Expected Total Errors 16.3 The Overlearning Reversal Effect
235 237 239
17 Other Learning Models 17.1 17.2 17.3 17.4
Suppes’ Continuous Pattern Model Successive Discrimination Signal Detection: Forced-Choice Signal Detection: Yes-No
244 241 249 252
18 Diffusion Approximation in a Genetic Model and a
Physical Model
18.1 Wright’s Model 18.2 The Ehrenfest Model
257 262
References
263
LISTOF SYMBOLS
269
INDEX
21 1
0 Preface
This monograph presents some developments in probability theory that were motivated by stochastic learning models, and describes in considerable detail the implications of these developments for the models that instigated them. No attempt is made to establish the psychological utility of these models, but ample references are provided for the reader who wishes to pursue this question. In doing so he will quickly become aware that the difficulty of deriving predictions from these models has prevented them from developing to their full potential, whatever that may be. Since I am more a probabilist than a psychologist, I am in a position to regard this difficulty as a challenge rather than a nuisance. The book has four main parts: Introduction (Chapter 0), Distance Diminishing Models (Part I), Slow Learning (Part I]), and Special Models (Part 111). Parts I and I1 and Chapter 14 in Part I11 (Additive Models) constitute the theoretical core. From the mathematical point of view, Part I develops a theory of Markov processes that move by random contraction of a metric space. Transition by random translation on the line is considered in Chapter 14. Part I1 presents an extensive theory of diffusion approximation of discrete time Markov processes that move “by small steps.” Parts I and I1 are practically independent, so it would be almost as natural to read them in reverse order. Chapters 12-16 of Part Ill consider the various special models described in the Introduction in the light of Parts I and 11. In addition, some xi
xii
PREFACE
special properties of these models are obtained by ad hoc calculation. Chapter 17 takes up a number of other learning models briefly, and Chapter 18 spells out the implications of Part I1 for a population genetic model of S. Wright and for the Ehrenfest model for heat exchange. The chapters of Part I11 are almost independent, except that Chapter 13 depends heavily on Chapter 12. Open mathematical problems are easily visible throughout the book, but especially in Chapters 14-16. The mathematical prerequisites for close reading of Parts I and I1 are analysis (integration, metric topology, functional analysis) and probability at about the first-year graduate level. The texts of Royden (1968) and Breiman (1968) would be excellent preparation. An acquaintance with the language and notation of these subjects will suffice if the reader is willing to skip most of the proofs. The prerequisites for Part I11 (excepting Chapter 14) are less stringent, and it can be used for reference purposes without studying Parts I and I1 systematically. No previous exposure to mathematical learning theory is assumed, though it would be useful. The glimpse of this subject in Chapter 0 is adequate preparation for the rest of the book. The reader is referred to Atkinson, Bower, and Crothers (1965) for a balanced and thorough introduction. Those who are familiar with mathematical learning theory will notice that the emphasis on “continuous state models” in this book, at the expense of “finite state models,” is the reverse of the emphasis in the psychological literature. In particular, the book makes no contribution to the analysis of models with very few states of learning. These models are quite well understood mathematically, and they have been extremely fruitful psychologically. Concerning the numbering of formal statements, Theorem 5.2.3 is the third theorem of Section 5.2. Within Chapter 5, it is referred to as Theorem 2.3. Equations, lemmas, and definitions are numbered in the same way. The symbol Isignifies the end of a proof. I have not attempted to reconstruct the history of the topics treated. In most cases, only the main source or immediate antecedent of a development is cited in the text. These sources often give additional information concerning prior and complementary results. The publications of the following individuals deserve special mention in connection with the parts indicated: C. T. Ionescu Tulcea (Part I), A. Khintchine (Part 11), and R. R. Bush and F. Mosteller (Part 111). Much of the research reported in this volume is my own, and a significant portion appears here for the first time. My work has been generously supported by the National Science Foundation under grants GP-7335 and GB-7946X to the University of Pennsylvania. Most of the writing was done in particularly pleasant circumstances at Rockefeller University on a Guggenheim Fellowship.
PREFACE
...
xlll
It is a pleasure to acknowledge the encouragement and assistance of a number of my teachers and colleagues, especially R. C. Atkinson, K. L. Chung, W. K. Estes, M. Kac, S. Karlin, and P. Suppes. I am also much obliged to R. Bellman for suggesting that I write a book for this series. The manuscript was expertly typed by Mrs. Maria Kudel and Miss Mary Ellen O’Brien. Finally, the book is dedicated to my wife, Sandy, in gratitude for her inexhaustible affection and patience.
This page intentionally left blank
0 0 Introduction
0.1. Experiments and Models In this section we describe a few more or less standard experimental paradigms, and certain special models for subjects’ behavior in such experiments. These models have stimulated and guided the development of the general mathematical theories presented in Parts I and I1 of this volume. In addition, they are of substantial psychological and mathematical interest in their own right. The implications of Parts I and I1 for these models, as well as numerous results of a more specialized character, are collected in Part 111. All but one of the experiments to be described consist of a sequence of trials. On each trial the subject is presented with a stimulus configuration, makes a response, and an outcome ensues. There are at least two response alternatives, and there may be infinitely many. The outcome or payoff typically has obvious positive or negative value to the subject, and would be expected to influence future responses accordingly. If an outcome raises, or at least does not lower, the probability of a response in the presence of a stimulus, it is said to reinforce the association between stimulus and response. The probabilities of different stimuli, and the conditional probabilities of 1
2
0. INTRODUCTION
various outcomes, given the stimulus and response, are prescribed by the experimenter and are constant throughout the experiment. Simple learning experiments are those in which only one stimulus configuration is ever presented. The complementary case of two or more stimulus configurations is called discrimination or identification learning, since the subject must distinguish between stimuli if he is to be able to make appropriate responses (i.e., responses that yield preferred outcomes) to each. A. SIMPLE LEARNING WITH TWO RESPONSES
Three of the six models described in this section relate to experiments of this type. The two responses are denoted A and A , ,and, in the most general case considered, response Ai can be followed by either of two outcomes, Oi, or Oio, where Oij reinforces A j . The reinforcement probability parameters are
,
Rij
= P(OijIA3.
Of course, if there is no outcome Oij in a particular experiment, we take nij = 0, and, conversely, if nij = 0, we need not include 0, in the description of the experiment. The notation 0, emphasizes that the outcome 0,, that follows A , and reinforces A l (perhaps presentation of food or money) may be of a totally different character than the outcome Ool that follows A, and reinforces A , (perhaps no food or loss of money). However, it is convenient, for most purposes, to redefine the outcomes in terms of their supposed reinforcing effects. Thus we introduce the new outcomes 0, and 0,, where Oj indicates reinforcement of Aj, irrespective of the preceding response. Thus Ai Oj (“Ai is followed by Oj”) means the same thing as Ai Oij. The trial number is indicated by a subscript. Thus A, or Ojn denotes occurrence of A i or Oj on trial n. We always call the first trial “trial 0,” so n = 0,1,2, .... This is slightly more convenient mathematically than beginning the trial numbers with 1.
EXPERIMENTS. (i) Paired-associate learning (see Kintsch, 1970). A human subject is required to learn a “correct” response to each of a sequence of stimuli presented repetitively, as when a student learns the vocabulary of a foreign language from a deck of cards. Though this is basically a complex discrimination learning experiment with multiple responses, as a first approximation we may focus on successive presentations of the same item, ignoring interitem interactions, and code the subject’s responses on that item as “correct” (A,) or “incorrect” (A,). If the subject is told the correct response after each of his responses, then 0, is the only outcome, R,, = nl0= 1, and we have an example of continuous reinforcement.
3
0.1. EXPERIMENTS AND MODELS
(ii) Prediction experiments (see Estes, 1964; Myers, 1970). A human subject is asked to predict which of two lights, 0 and 1, will flash on each trial. Response A i is prediction of light i, and outcome Oj is flashing of lightj. Monetary payoffs are sometimes used (with, of course, a larger prize for a correct prediction), but more often they are not. The special case of noncontingent outcomes, where the outcome probabilities do not depend on the response made (nl = nol= n), has received most attention experimentally. After performance has stabilized, it is often found that the proportion of A , responses is very close to the probability n that A , is reinforced, at least when the average of several subjects’ data is considered. This probability matching is somewhat surprising, since frequency of correct prediction is maximized by always predicting the light which flashes most frequently. As one might expect, experiments with monetary payoffs tend to produce behavior closer to this optimal strategy. The condition P(A1n) - P(O1n) 4 0 as n+ 00 defines a form of probability matching that is applicable to arbitrary outcome probabilities. Since P(O1n) =
1111 P(A1n)
+ no1(1-p(A1n)),
we have P(Aln)-P(Oln) = (nOl+n10)P(A1n)-*ol. Therefore, if nol +n,, > 0, probability matching in the above sense is equivalent to P ( A l n )+ I, where 1= ~01/(~01+~10)~
Hence we refer to I as the probability matching asymptote. It is a useful reference point for asymptotic performance. (iii) T-Maze experiments (see Bitterman, 1965; Weinstock, North, Brody, and LoGuidice, 1965). On each trial a rat is placed at the bottom of a Tshaped alley, and proceeds to the end of the right or left arm (response A , or A , , respectively), where he may or may not find food. Finding food reinforces the response just made, and finding no food reinforces the other response, though in some variants of this experiment it appears that the effect of not finding food may be nil or almost nil. Rats show no tendency to match response to outcome probability. Practically all of them develop a decided preference for the response with the highest probability of yielding food. (iv) Avoidance learning (see Bush and Mosteller, 1959; Theios, 1963; Hoffman, 1965). A dog or rat may avoid an electric shock if he jumps over
4
0. INTRODUCTION
a barrier ( A , ) shortly after a warning signal. Otherwise (A,) he must clear the barrier to escape the shock. The only possible outcomes are avoidance and shock, and both appear to reinforce jumping, so no, = n 1 = 1 .
,
MODELS. All of the models presented below allow for the possibility that an outcome will have no effect on response probability on some trials. This state of affairs is called ineffective conditioning and denoted C;, otherwise, conditioning is effective (Cl). The probability of effective conditioning is a function only of the preceding response and outcome: cij = P ( C , piO j ) .
In all three models, the effects of a subject’s experience before trial n are summarized by a random variable that represents his probability of making response A , on trial n. In the linear and stimulus sampling models, this In additive models, this variable is called random variable is denoted X,,. p,,, and X,, is a certain transform of p n that is more useful for most purposes. (i) A stimulus sampling model. Stimulus sampling theory was introduced by Estes in 1950. Atkinson and Estes (1963) gave a unified presentation of the theory, and Neimark and Estes (1967) collected many relevant papers. In the models considered here, it is postulated that the stimulus configuration consists of a population of N stimulus elements, each of which is conditioned to either A , or A , . The subject samples s of these without replacement, and makes response A , with probability m / s if the sample contains m elements conditioned to A , . If A j is effectively reinforced, everything in the sample becomes conditioned to A j . If conditioning is ineffective, no element changes its state of conditioning. Various sampling mechanisms have been considered, but it is usually assumed that the sample size s is fixed throughout the experiment, and all elements are equally likely to be sampled. We will restrict our attention to thisfixed sample size model. In the special case s = 1, each “stimulus element” may be interpreted as a pattern of stimulation, and this special case is consequently referred to as the pattern model. Let X,, be the proportion of elements in the stimulus population conditioned to A I at the beginning of trial n. It is easy to compute the conditional probability, given X,,, of any succession of occurrences on trial n. For example, the probability of obtaining a sample with m elements conditioned to A , , making A , and having A, effectively reinforced, is
,
(N2)(
N(1 -Xn> s-m ) m
-~ S
I O C I O .
5
0.1. EXPERIMENTS A N D MODELS
This event produces the decrement AX,, = -m/N in X,,. The variable X, gives the subject’s A I response probability before sampling, in the sense that
(ii) A linear model. Linear models were introduced by Bush and Mosteller in 1951, and reached a high degree of development in a treatise published soon thereafter (Bush and Mosteller, 1955). The effectiveness of conditioning mechanism was first used in conjunction with linear models by Estes and Suppes (1959a). The five-operator linear model postulates a linear transformation
,
( o < e , 1 < I)
x’=(i-e)x+el
(1.2)
of A response probability x for each response-outcome-effectiveness sequence. Rewriting this as x’ -
= e(n-x),
we see that the change in x i s a proportion 8 of the distance to the fixed point A. If conditioning is ineffective, then 8 = 0 and 1is immaterial. If 0, C , occurs, then x’-x>O for all O < x < 1, so that, if 8 > 0 , we must have A = 1. Similarly I = 0 for 0, C , . Thus
AX,
=
eii(1-Xn)
if
Ain01nCln,
--ei0xn
if
A i n 00, C 1 n 9
I 0
if Con.
The conditional probabilities that various operators are applicable are given by expressions like f‘(A I n 00, C1n IXn) =
X n I[ 10 c10
that are common to all three of the models discussed here. If Oij = 1, AinOjnC,,,produces complete learning, in the sense that X,,, I = j , irrespective of X , . This fact permits us to regard certain “all-or-none” models as linear models. Norman (1964) has given an example in which this point of view is rather natural. It is shown in Chapter 13 that the linear and stimulus sampling models are quite closely related. For example, predictions of the stimulus sampling model converge, as N + co and s / N + 8, to those of the linear model with 8, = 8 for all i and j .
6
0. INTRODUCTION
(iii) Additive models. Luce (1959) proposed that the A , response probability variable p be represented
P
=
+00)
01
in terms of the strength ui > 0 of A i . In terms of the relative strength u = u , /uo or x = In u, this becomes
p
=
+
u/(u 1)
=
e“/(e“ + 1).
(1.3)
Beta models postulate that a learning experience multiplies each ui by a positive constant pi. Thus u is multiplied by p = PI /So, and b = In p is added to x . If the experience reinforces A , , then j? 2 1 or b 2 0 ; if it reinforces A , , then /3 < 1 or b < 0. For the jue-operator beta model, the value X,,of x on trial n thus satisfies
AX,, = b,
if A , Oj, C , , ,
(1.4)
where bi, 2 0 and bio < 0. Alternatively, un+1
= Bijvn
(1.5)
or
if AinOjnC,,,,where p i , 2 1 and pi, < 1. Of course, AX,, do,, and Ap, are all 0 if Con. Generalizing slightly, we may consider additive models, in which a learning experience effects a change x’ = x + b in the variable x (equivalently u’ = pu, where u = e“ and j= eb),just as in the beta model, but the A , response probability variable p = p ( x ) need not be of the form (1.3). It is natural to assume that p is continuous and strictly increasing with p ( - 00) = 0 and p ( a ) = 1, though some of our results require less. Given such qualitative restrictions, the precise form of p has remarkably little influence on those aspects of these models’ behavior that are considered in Chapter 14. Fiue-operator additiue models satisfy (1.4) and (1.5). I f p is strictly increasing, there will also be an equation analogous to (1.6).
SYMMETRY. Symmetries in the experimental situation permit us to reduce the number of distinct parameters in these models. One type of symmetry is especially prevalent. In prediction experiments, and in T-maze experiments where the same amounts of food are used to reward left and right choices, it is natural to assume that the response-outcome pairs A , O , and A,O, that involve “success” are equally effective for learning, and that the same is true of the “failure” pairs A , 0, and A o O , . To say that they are equally
7
0.1. EXPERIMENTS A N D MODELS
effective means, first, that they have equal probabilities of producing effective conditioning, thus
cI1 = coo = c
and
col = cl0 = c*.
( 1 *7)
In the linear and beta models, equal effectiveness of two response-outcome pairs also implies that the corresponding operators u and ii on A l response probability x are complementary. This means that the new value 1 - C(x) = 1 - i i ( 1 - ( 1 - x ) )
of 1 - x produced by ii is the same function of 1 - x that u is of x :
u(x) = 1 - a ( 1 - X ) .
If ii(x) = ( I -8)x, then u(x) = 1 - ( i - e ) ( i - x )
= (i-elx+e,
while if D(P) =
BP BP+l-P’
then
Thus complementarity of the success operators and complementarity of the failure operators reduce, respectively, to =
e
and
e,,
=
el,
=
B
and
Po1
=
W l O
b , , = -boo = b
and
b,, = -bl0 = b*
ell = e,,
=
e*
(1.8)
in the linear model, and 811
=
l/BOO
=
B*
or, alternatively, (1.9)
in the beta model. This characterization of complementarity is valid for all additive models with p strictly increasing and p ( - x ) = 1 - p ( x ) . The further condition that success and failure be equally effective, an extremely stringent symmetry condition, would take the form c = c* and, also, 8 = O* in the linear model and b = b* in additive models.? These assumptions are sometimes convenient mathematically, but most of our results do not require them. ?The pattern model is exceptional, since success has no effect and cI1has no role.
8
0. INTRODUCTION
B. SIMPLE LEARNING WITH MANYRESPONSES As in the experiments described previously, the subject confronts the same stimulus configuration on every trial. But on trial n he makes a choice Y,, from a set Y of alternatives that may have more than two elements, perhaps even a continuum of elements. This is followed by an outcome Z,, from a set Z of possibilities. The conditional probability distribution D(Yn9
A)
= P(Zn E AIYn)
of Z , given Y,, is normally specified by the experimenter, and, at any rate, it does not vary over trials. Linear models for such experiments take the following form. Let X,, be a subject’s choice distribution on trial n ; i.e., p ( y n ~ A I x n= ) xn(A),
where A is any (measurable) subset of Y . For every response-outcome pair e = ( y , z ) , there is a 0 < 8, < 1 and a probability I, on Y such that, if Y,,= y and Z , = z, then
x,,,= (
+
1 - 8 ~ ~ e e I~e .
If 8, > 0, A, represents the asymptote of X,, under repeated occurrence of e. In typical experimental situations, one has enough intuition about this asymptote to place some restrictions on the form of I e . t
A CONTINUOUS PREDICTION EXPERIMENT. Suppes (1959) considered a situation in which a subject predicts where on the rim of a large disk a spot of light will appear. Here Y,, is the subject’s prediction and Z,, is the point subsequently illuminated on trial n. Suppes assumed that Be = 0 does not depend on e and I, = I ( z , .) does not depend o n y . In addition, the distribution I ( z , .) is symmetric about z and has mode at z. Suppes and Frankmann (1961) and Suppes, Rouanet, Levine, and Frankmann (1964) report experimental tests of this model and a closely related stimulus sampling model, also due to Suppes (1960),-that is described in Section 17.1. A pigeon pecks a lighted “key” in a small experiFREE-RESPONDING. mental chamber (or “Skinner box”). Occasional pecks are reinforced by brief presentation of a grain hopper. In the experiments considered here, the experimenter specifies the probability u ( y ) that a peck y seconds after the last one (that is, a y-second interresponse time or IRT) is reinforced. Such experiments do not have trials in the usual sense, but one can consider each response as a choice of an IRT from Y = ( 0 , ~ ) .In Norman’s (1966) linear t A different type of multiresponse linear model has been considered by Rouanct and Rosen berg ( 1964).
9
0.1. EXPERIMENTS A N D M O D E L S
model, Yo is the time until the first response, Y, is the nth IRT for n 2 1, A’, is a subject’s distribution of Y,,, and
1
if
0
otherwise.
Y, is reinforced,
m.
Clearly, n(Y,(1 1) = It is assumed that the entire effect of nonreinforcement is to decrease the rate of responding. Thus
x,,,
= (1
-e*)x, + o*T*
if 2, = 0, where T* has a very large expectation. Reinforcement of a y-second IRT is supposed to result in a compromise between two effects: an increase in the rate of responding, and an increase in the frequency of approximately y-second IRT’s. If T is a probability on Y with a very small expectation, A ( y , .) is a probability on Y with mode near y , and 0 < u < 1, then this compromise is represented by the assumption that
x,,,= (1-8)Xn+
8[(1-a)r+aA(y,
*)I,
if Y, = y and 2, = 1. The case where y is a scale parameter of A ( y , .) [i.e., A ( y , A ) = q ( A / y ) for some probability q on Y ] is especially interesting. Comparison of this model with data from some standard schedules of reinforcement indicates that realistic values of 8 and 8* have 8*/8 extremely small.
c. DISCRIMINATION LEARNING One of the lateral arms of a T-maze is white, the other is black, and the positions of the two are interchangeable. The black arm is placed to the left on a randomly chosen half of the trials. This stimulus configuration is denoted ( B , W ) , the other, (W, B ) . The rat’s choice of arms may be described either with respect to brightness ( B or W ) or position ( L or R). Reward could be correlated with either brightness or position, but let us consider an experiment in which the rat is ,fed if and only if he chooses the black arm. In many experiments of this type, training is continued until performance meets some criterion, and then the correct response is reversed, that is, switched to W . An interesting question is the effect of extra trials before reversal on the number of errors in the new problem. Though such overtraining leads to more errors early in reversal, it sometimes produces fewer total errors in reversal-the overlearning reversal efect. Clearly the rat must learn to observe or attend to brightness, the relevant stimulus dimension, if he is to be fed consistently. And, in so far as perceptual learning of this type takes place in overtraining, an overlearning reversal
10
0. INTRODUCTION
effect is a possibility. The concept of attention to a stimulus dimension is central to the model described below. This model is a specialization, to the case of two stimulus dimensions, of a theory proposed by Zeaman and House (1963) in the context of discrimination learning in retarded children. See Estes (1970, Chapter 13) for a full discussion of the model in that context. The animal is supposed to attend to either brightness (br) or position (PO), and u denotes the probability of the former. If he attends to brightness, he chooses black with probability y. If he attends to position, his probability of going left is z. To summarize: y = P(BIbr),
u = P(br),
and
z = P(L1po).
The values of u, y, and z on trial n are V,, Y,, and Z,. Trial-to-trial changes in these variables are determined by the same considerations that govern changes in response probability in simple learning with two responses. The probability v of attending to brightness increases if the rat attends to brightness and is fed, or does not attend to brightness and is not fed. Otherwise it decreases. The conditional response probabilities y and z change only if the rat attends to the corresponding dimension. Since food is always associated with B, y = P(B1br) increases whenever the rat notices brightness. If he attends to position on a (B, W) trial, z = P(Llpo) increases, since food is (fortuitously!) on the left. On (W, B) trials, z decreases. These changes are summarized in Table 1, which also gives the conditional probabilities of various events, given V, = u, Y, = y , and Z , = z. Choices are specified only as B or W , but, since stimulus configurations are given, laterality is implicit. TABLE 1 EVENTEFFECTS A N D PROBABILITIES I N A DISCRIMINATION LEARNING MODEL" ~~
Event
.
U
~
Y
Z
Probability
0 0 0 0
VYP U(l-Y)/2 VYI2 4 1- Y I P (1 - u) z / 2 (1 - u)(l - 4 1 2 (l-u)(l-z)/2 (1-v)2/2
+ +
-
-
"Notation: + indicates an increase, - a decrease, 0 no change.
11
0.1. EXPERIMENTS AND MODELS
Zeaman and House stipulated that all of these changes are effected by linear operators like those in two-choice simple learning models, and, further, that there are only two learning rate parameters, one for u and one for y and z. We will forego the latter restriction at this juncture. Thus the first two entries under u in Table 1 mean that
where 0 < 42< 1. This rather complex model describes the rat’s choices with respect to both brightness and position. For example, P(BIu,y,z) = P(B1br)u = yu
+ P(B1po) (1 -u)
+ +(l - u ) ,
and, similarly, P(LIu,y,z) = z(1-u)
so that
+ +u,
+
P(Bn) = E(YnVn) E(1 - v , ) / 2 and P(Ln) = E(Zn(1-K))
+ E(V,)/2*
Note that the probability of a black (hence correct) response depends only on u and y. There is a simpler description of the transitions of these variables than that given in Table 1. Collapsing the pair of rows corresponding to each attention-response specification, we obtain the reduced model of Table 2. This reduction presupposes a natural lateral symmetry condition : The operators on u (and y) in the rows to be combined must have the same learning rate parameter, as well as the same limit point. TABLE 2 EVENT EFFECTS AND
PROBABlLmES FOR THE
REDUCEDMODEL
12
0. INTRODUCTION
This model was proposed by Lovejoy (1966, Model I) as an explanation of the overlearning reversal effect. Though he noted that his theory was “quite close” to that of Zeaman and House, the precise nature of the relationship seems to have eluded him, since he later faulted his model for having “. . .completely disregarded the fact that sometimes an animal chooses one side and sometimes the other ...” (Lovejoy, 1968, p.17). This was held to be a serious omission because of the frequent occurrence of strong lateral biases or “position habits” early in acquisition. Within the full model of Table 1, these would be reflected in values of v near zero and values of z near zero or one. Henceforth we will refer to both the full and reduced models described in this subsection as Zeaman-House-Lovejoy (or ZHL) models. The merits and limitations of these models are discussed in Sutherland and Mackintosh’s (197 1) treatise on discrimination learning in animals. The experimental situation that we have considered is an example of a simultaneous discrimination procedure, since both values of the relevant brightness dimension are present on each trial. In the comparable successive procedure, food would be available on the left, say, if both lateral arms are black, and on the right if both are white. A model for successive discrimination, due to Bush (1965) and built on the same psychological base as the ZHL models, is presented in Section 17.2.
0.2. A General Theoretical Framework All of the examples given in the preceding section have the following structure. At the beginning of trial n, the subject is characterized by his state of learning X,,, which takes on values in a state space X . On this trial, an event En occurs, in accordance with a probability distribution p ( X n , G ) = P(En E GIXJ
over subsets G of an event space E. This, in turn, effects a transformation
xn+1
= U ( X n , En)
of state. In all of the examples, the subject’s response is one coordinate of the event E n , and, in the simple learning models, its outcome is another. Additional coordinates are as follows: number of elements in the stimulus sample conditioned to A , (fixed sample size model), effectiveness of conditioning (two-choice simple learning models), state of attention (ZHL models), and stimulus configuration (full ZHL model). In all of the two-choice simple learning models, A’,, can be taken to be a subject’s A , response probability, though we prefer a certain transform of this variable for additive models. Thus A’,, is one dimensional. I n the multichoice models, X,, is the choice distribution on trial n. Even though X,, is
13
0.3. 0 VER VIEW
multidimensional in this case, all of its “coordinates” X,,(A) are of the same type-probabilities of sets of responses. Thus all of our examples of simple learning models are considered uniprocess models. The coordinates of the state variables and xn = (5,Y,) of the ZHL models, on the other hand, do not possess this homogeneity. One of them, 5 , describes a perceptual learning process, while the others describe response learning, but under the influence of different stimuli. These are, therefore, examples of multiprocess models. Though there has been a tendency for uniprocess and multiprocess models to be used in conjunction with simple and discrimination learning, respectively, the association is not inevitable. For an example of a multiprocess simple learning model, see Bower (1959). In Sections 17.3 and 17.4, we consider the uniprocess models of Atkinson and Kinchla (1965) and Kac (1962, 1969) for signal detection experiments with two stimulus conditions. Intermediate in generality between the various special models introduced in Section 0.1 and the general framework described above are two classes of models that will play a prominent role in subsequent developments. The finite state models are those whose state spaces have only a finite number of elements. In the fixed sample size model, for example, Xn =
(5,Yn, 2,)
x = { j / N : j = O , ...,N}. In distance diminishing models, X is a metric space with metric d. Typically, all of the event operators u( ., e ) are nonexpansive:
d(u(x,e),u(y,e))< d ( x , y )
for all x a n d y ,
and some are contractive:
d(u(x,4,u ( y , 4) < d ( x , y ) if x z Y . The precise definition of this class of models is given in Section 2.1. For example, the operators (1.2) of the five-operator linear model satisfy ~x’-Y’l = (1 -e)lx--,q
.
Such an operator is nonexpansive, and contractive if 8 > 0. Given slight restrictions on their parameters, all of the uniprocess linear models discussed above, as well as the reduced ZHL model, are distance diminishing.
0.3. Overview In the general theoretical framework described at the beginning of the last section, the sequence X,, of states is a Markov process. The event process En is not Markovian, but the sequence X,,’ = (En,X,,, of event-state pairs
14
0. INTRODUCTION
is. When E,, includes a specification of the subject's response Y, on trial n, as is usually the case, we may consider Y, as a function either of E,,, Y,, = g(E,,), or of the Markov process X,,', Y,, =f(X,,'). Thus the study of learning models quickly leads to Markov processes. Most of this volume is given over to the systematic investigation of Markov processes arising in or suggested by learning models. Considerations specific to events and responses provide a closely related secondary focus. Some of our results are rather general, and this generality has been emphasized in order to heighten mathematical interest and facilitate use in areas other than learning theory. But applications of general theorems to particular learning models and specific psychological problems are not neglected, and we have not hesitated to include results applicable only to special models. In addition to this multiplicity of levels of abstraction, a variety of different mathematical techniques and viewpoints are employed. We have been quite impressed by the range of analytic and probabilistic tools that yield important insights into the behavior of a few simple models. Here is a brief survey of the contents of the book. Part I is concerned with distance diminishing models and the much simpler finite state models. Chapter 1 gives background material on Markov processes in abstract spaces and on our general theoretical framework. Chapter 2 provides preliminary results for distance diminishing models, and for a class of Markov processes in metric spaces-Doeblin-Fortet processes-that includes their state seThe ergodic theory of compact Markov processes, that is, quences x,,. Doeblin-Fortet processes in compact state spaces, is presented in Chapter 3. This theory is completely analogous to the theory of finite Markov chains, which is a special case. Some comparable results for distance diminishing models in noncompact state spaces are given in Chapter 4. Chapter 5 contains a law of large numbers, a central limit theorem, and some estimation techniques for certain bounded real valued functionsf(X,) of regular Markov processes X,,. This theory is applied in Chapter 6 to the processes X,,'= (En,X,,,,) in distance diminishing and finite state models for which X,, is regular. Part I1 deals with slow learning, to which Chapter 7 gives a full introduction. To study learning by small steps, we consider a family X." of Markov processes indexed by a parameter 8 such that AX." = O(@, and take limits as 8+0. Diffusion approximations to the distribution of A': for the transient phase of the process, when n is not too large, are obtained in Chapters 8 and 9. Approximations to stationary distributions and absorption probabilities are considered in Chapters 10 and 11, respectively. The form of these approximations is determined by the drqt E(AX."IX: = x), and by the conditional variance (or covariance matrix in higher dimensions) of AX.", given :'A = x. Some special considerations apply to the case of small (O(8')) drift.
15
0.3. 0 VER VIEW
In Part TI1 the methods of Parts I and IT and some special techniques are applied to various special models. In order to gain a definite impression of the types of results obtained in Part 111 (and, therefore, in Parts I and I1 as well), let us consider some examples pertaining to the symmetric case of the five-operator linear model. In this model, A response probability satisfies the stochastic difference equation
’ AX,,
=
.
with probability X n n l , c ,
O(I-X,,),
- e*x,,
with probability X,n,, c*, with probability ( ~ - X , , ) T E , ~ C * ,(3.1)
8*(1-X,),
- OX,, ,
with probability (1 - A’,) no, c ,
, o
otherwise.
for which f ( 0 ) = x . Then xn
-f(ne)
=
O(@,
as long as n8 remains bounded. Furthermore, var(X,,) = O ( 8 ) , and the distribution of approaches normality as 8 + 0. The results that follow require certain restrictions on the model’s parameters. Suppose, for simplicity, that success is effective, in the sense that 8c > 0, and that either outcome can follow either response (nij > 0 for all i andj). The cases of effectivefailure (0*c* > 0) and ineffectivefailure (B*c* = 0) must be distinguished. The former arises more frequently in practice.
16
0. INTRODUCTION
When 8*c* > 0, the process X,, has no absorbing states and is regular, so that the distribution of X,, converges (weakly) to a limit p that does not depend on Xo = x . Let x , be the expectation of p : lim xn =
xm =
n-1 m
I
Yp(dY),
and let A,, be a subject’s proportion of A , responses in the first n trials. Then A,,, is asymptotically normally distributed as n -P co, with mean xm and variance proportional to l / n . The proportionality constant o2 can be consistently estimated on the basis of a single subject’s data. The asymptotic A , response probability x , is bounded by the probability matching asymptote 1 = no,/(no,+nl0) and the unique root I in (0,l) of the quadratic equation w ( I ) = 0. In fact, if A is the “better” response, in the sense that n I 1> roo(or, equivalently, no, > nl0), and if 8* < 1, then
,
1 < X,
e*c*,
1 = X, = I
if 8c
1 > X,
if 8c .c 8*c*.
> Iz
= tI*c*,
The quantity I is an appropriate approximation to x, when 8 and 8* are small. For w and thus I are independent of 8 along any line 8*/8 = k, and x, +I as 8 -+ 0. In addition, the asymptotic distribution p of A’,, is approximately normal with variance proportional to 8 when 8 is small. The behavior of X,, is radically different when O*c* = 0. Both 0 and 1 are absorbing states, and P,(lim X,, = 0 or 1) = I n- m
for any initial state x . The probability
4(x)
= P,( lim X,, = 1) n- m
that the process is attracted to 1 is fundamental. When roo= n 11 , 4 ( x ) = x . We now describe an approximation to 4 that is valid when 0 is small and noo-n,l = o(e). Suppose that n,, is fixed and that roo approaches it along a line ((.oo In1 1) - l)/O = k
9
where k#O. This constant is a measure of the relative attractiveness of A . and A,. It follows that
E(AX,,JX,,= x ) = e2a(x)
17
0.3. 0 VER VIEW
and
E ( ( A X , J ~ I= X x~ )
= 02b(x)
+ 0(e3),
where b ( x ) = n , , c x ( l - x ) and a ( x ) = - k b ( x ) . As 0 4 0 , ~ ( X ) + $ ( X ) where , $ ( x ) is the solution of the differential equation 1 2
d2$
- b ( x ) 2( x )
with $ ( O ) = O
dx
d$ + a(x)(x) = 0 dx
and $ ( l ) = 1 ; i.e., $ ( x ) = ( e 2 k x -l)/(e2k- 1 ) .
Note that
$(f)
=
l/(ek++),
which is very small when k is large. Thus, if a subject has no initial response bias, and learning is slow, it is very unlikely that he will be absorbed on A , when A, is much more attractive. This is, of course, just what we would expect intuitively.
This page intentionally left blank
Part I 0 DISTANCE DIMINISHING MODELS
This page intentionally left blank
1 0 Markov Processes and Random Systems with Complete Connections
1.1. Markov Processes
Our starting point is a measurable space (X, a), and a stochastic kernel K defined on X x 99. For every x E X, K ( x , .) is a probability on 99, while for of random every B E $3, K ( .,B) is %measurable. A sequence .?i? = {Xn}nro P),with values in ( X , W), is a Markov vectors on a probability space (Q, 9, process with (stationary) transition kernel K if P(Xn+, EBIXn, . . . , X o ) = K(Xn,B)
(1.1)
almost surely (as.), for each n 2 0 and B E 99. The process has state space (X, 99) and initial distribution p o ( B ) = P ( X o E B). For any (X,99), K, and p o , there exists a corresponding (Q, 9, P) and T !. (Neveu, 1965, V.2). Moreover the distribution of such a process, which = 99 x 99 x is the probability Q on gW given by
Q(W = p ( g E B ) , is completely determined by p o and K. In fact, Q is the only probability on 21
22
1.
MARKOV PROCESSES A N D LEARNING MODELS
93* such that
for any sequence Bn in 93 such that Bn = X for n > k. We sometimes write Qx(B) or Px(X E B) when po = 6, [the probability concentrated at x, also denoted 6(x, .)I and we wish to call attention to x. For a (real or complex) scalar valued functionfon A,' let If1 be its supremum norm, H ( f ) the closed convex hull of its range, and osc(f) the diameter of its range or, equivalently, the diameter of H ( f ) . Thus
If I
=
SUPlf(X)I xcx
9
H ( f ) = CO (rangef) , and Let B(X) be the Banach space of bounded 93'-measurable scalar valued functions on X under the supremum norm. Let M ( X ) be the Banach space of finite signed measures on B, under the norm lpl = total variation of p ,
and let P ( X ) be the probability measures on 93. If p then
Jf dP E H ( f ) .
E
P ( X ) and f E B(X), (1.2)
If p E M ( X ) with p ( X ) = 0, then and
for f~ B ( X ) . The transition operators (1.5)
and
1.1.
23
M A R K 0 V PROCESSES
on B ( X ) and M ( X ) , respectively, generalize left and right multiplication by the transition matrix in the theory of finite Markov chains. The second is the adjoint of the first, in the sense that 0 1 9
(1.7)
U f ) = (TP,f)
3
where
OL,f)
=
s
fdP
for ~ L ME ( X ) and f e B ( X ) . Both U and T are positive ( U f 2 0 if f > O , Tp 2 0 if p 2 0) and both are contractions (I 111, I TI < 1). In addition, Tp E P ( X ) if p E P ( X ) , and Uf =f if f is constant. More generally, Uf(x)E H ( f ) by (1.2); hence, H(Uf) = H ( f ) .
(1.8)
It follows that osc(Uf)< osc(f). The powers of the transition operators satisfy
and T"p(B) =
s
(1.10)
p ( d x ) K(")(x,B ) ,
where K(") is the n-step transition kernel, defined recursively by
K'O'(x,
a)
= 6,
and
K'"+"(x,B) =
s
K(x,dy)K'"'(y, B ) .
The probabilistic significance of K"), U j , and Ti is clear from the formulas
P(X,,+jE BIX,,, ..., Xo) = K(j)(X,,,B ) E(f(Xn+j)(Xn,* * * , X o )
=
Uif(XJ
a.s., a.s.3
and pn+j = Tjpn 9 where 2"= {X,,},,,,, is any Markov process with transition kernel K, and p,, is the distribution of X,,. The first equation is obtained by applying the second to the indicator function I,(x) of B :
24
I . MARKOV PROCESSES A N D LEARNING MODELS
if X E B .
(I
A set B E W is stochastically closed if B # 0 and K ( x , B) = 1 for all x E B. Thus P(X,,+ E BIX,,) = 1 a s . when X,, E B. If a stochastically closed set contains a single element a, we say that a is an absorbing state. A probability p is stationary if Tp = p. If the initial distribution po of X is stationary, then p,,= p, for all n. In fact, S is a strictly stationary stochastic process. 1.2. Random Systems with Complete Connections
In this section we begin with two measurable spaces (X,W) and ( E , Y ) , a stochastic kernel p on X x Y, and a transformation u of X x E into X that ’ and a.Following Iosifescu (1963; see is measurable with respect to W x 3 Iosifescu and Theodorescu, 1969), we call the system ((X, W), (E, Y ) , p ,u ) a (homogeneous) random system with complete connections. A sequence Y = X , , E,, X I , E l , ... of random vectors on a probability space (Q,.F,P) is an associated stochastic process if X,, and En take on values in (X, W )and (E, ’3), respectively,
Xn+1
= u(Xn, En)
(2.1)
and
P(EnEAIXn,En-,,
.a*)
=p(Xn,A)
(2.2)
as., for each A E Y. The processes = {X,,}nro and 8 = are, respectively, state and event sequences, (X, W) is the state space, and (E, ’3) is the event space. The distribution p, of X , is the initial distribution of an associated stochastic process. The concept of a random system with complete connections may be regarded as a generalization and formalization of the notion of a stochastic learning model. Thus we will often call such a system a learning model or simply a mode1.t In this context, the state of learning X,, characterizes a subject’s response tendencies on trial n, and the event En specifies those occurrences on trial n that affect subsequent behavior. Typically En includes a specification of the subject’s response and its observable outcome or payoff. When the subject is in state x , the event has distribution p ( x , .), and the transformation of state associated with the event e is u( -,e). Three classes
t This terminology is slightly at variance with that in Chapter 0. For example, “the fiveoperator linear model” of Chapter 0 is a family of models indexed by the parameters &,, c,,, and n,, according to the present terminology.
25
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
of learning models with which we will be especially concerned are the distance diminishing models defined in the next section, the additive models discussed in Chapter 14, and the finite state models. DEFINITION 2.1. A finite state model is a learning model for which X is a finite set and W contains all its subsets.t
There is a stochastic process Y associated with any random system with complete connections and any initial distribution p o (Neveu, 1965, V. 1). The distribution Q of any such process is the unique extension to L% x Y x W x of the measure on cylinder sets given by
=
Lo Lo PO (dxO)
p (xO
9
deO)
1
BI
(.
(XO
eO), d x l )
1.
P (xk
dek)
where B,, E W and A,, E Y for n < k, and B,, = X and A,, = E for n > k. Let p1 = p , and, for k 2 1, let
for x E X and A equation
E
p((En,
9
(2.3)
Yk+'.Then pk is a stochastic kernel on X x Y k ,and the *..Y
J % + k - l ) E A I X n , & - l ~ .**) = p k ( X n , A )
generalizes (2.2). Let p m ( x ,.) be the distribution of 8 = 6,; i.e.,
(2.4)
when Xo has distribution
P m (x, 4 = P X ( 8 E A ) for A E 9".Then p m is a stochastic kernel on X x Y" which extends pk in the sense that, if A E Y k , Pm(x,A
x E m ) = pk(x,A).
(2.5)
If S is the shift operator in E m :S{e,,},,20= {e,,+l}n,O, and bN= S N b = { E N + n } n 3 0 , then, a.s.3 P(BNE AIXN,E N - , , ...) = p m ( xA~) ,.
(2.6)
One of our prime objectives is to study state sequences from various classes of learning models. Such sequences are interesting in their own right, and
t Similarly, when E is finite, we always assume that all subsets are in 9.
26
1.
M A R K O V PROCESSES A N D LEARNING M O D E L S
provide an indispensable tool for the study of event sequences. The following simple observation is fundamental.
THEOREM 2.1. An associated state sequence X of a random system with complete connections is Markovian, with transition operator
clf(x) =
J P(X,de)f(u(x,e))*
In finite state models, X is a finite Markov chain. Proof. As a consequence of (2.1), E(f(Xn+,)IXn,X n - 1 , ...) = E(f(u(Xn9 En))lxn, for YE B ( X ) , thus E(f(Xn+1)Ixn, Xn-
I
1,
xn-1,
..*)
...) = uf(Xn)
by (2.2). A useful representation of the powers of the transition operator
w - ( x )= = (eo, ..
where e"
. , en-
s
(I is
Pn (x,W f ( u(x,4 )9
(2.7)
and u ( x , e") is defined iteratively:
u(u(x,e"), en) = u ( x , e n + ' ) .
(2.8)
Joint measurability of u ( x , e ) ensures that u ( x , e") is measurable with respect to B x X " , so the integral on the right in (2.7) makes sense and defines a measurable function of x. For any A E g'",
P ~ ( X , S - ~=A P) x ( L f NA~) =
= E x ( P ( s N €AIX,))
Ex(Pm(xN, A ) )
= U N p m( x ,A ) .
(2.9)
The sequence X,,' = ( E n ,X,,, ,) is also Markovian, and the application of an appropriate Markov process theory to this process in Section 6.1 yields additional information about event sequences in distance diminishing and finite state models. The Markov process X: = (X,,, En) can be used in the same way for finite state models. In finite state models with finite event spaces, both X,,' and XL are finite Markov chains. Let S'={Xn'JnbO and .""= {X:}nao
*
THEOREM 2.2. The process X' is Markovian, with transition operator (2.10)
27
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
J ~o(dx)f'(x).I n
=
E(f'(X0))=
Note that, if j - B~( X ' ) depends only on x, U ' f ( e , x ) = Uf(x). Thus, for any f E B(X'),
U'"f(e, x)
=
U'"- 'f'(x,e)
=
un-' f ' ( x ) .
(2.1 1)
THEOREM 2.3. The process X" is Markovian with transition operator
and in it ial distribution
P W )= where
f(4=
s
1
Po(dx)f,(x)9
P(X,
d e ) f ( x ,e ) '
The simple proof is omitted. It is easily shown that
U""f= (u"-'f) 0 for f E B ( X " ) and n >, 1, where g 0 u ( x , e ) = g ( u ( x , 4).
o
(2.12)
indicates composition of functions; i.e.,
REDUCTION.An important tool in the study of learning models with multidimensional state and event spaces is the reduction procedure that was applied
28
I . MARKOV PROCESSES A N D LEARNING MODELS
to the full ZHL model in Section 0.1. Here we describe this procedure in abstract terms and note its properties. The starting point is a learning model ((X, W), (E, Y),p, u), with which are associated state and event sequences Xn and En. In addition, there are measurable spaces ( X * , W*) and ( E * , Y*), and measurable transformations @ and Y of X and E onto X * and E*, respectively. Intuitively, x* = @(x) and e* = Y ( e ) represent simplified state and event variables. In the full ZHL model, they are projections: @ ( w , z ) = (V,Y)
and Y(S, a, r ) = (a,r ) 9
where s = (B, W) or (W, B), a = br or PO, and r = B or W. We now give conditions under which Xn* = @(A',,) and En* = "(En) are state and event sequences for a learning model. Suppose that @(u(x,e))depends only on x* and e*, and that, for any A * € Y*, p ( x , " - ' ( A * ) ) depends only on x*. In other words, there are functions u* on X * x E* and p* on X * x Y* such that
u*(@(x),W e ) ) = @(.(.,e))
(2.13)
and p * ( @ ( x ) , A * )= p ( x , Y - ' ( A * ) ) .
(2.14)
Since di and Y are onto, these functions are unique and p* (x*, .) is a probability.
THEOREM 2.4. If u* and p * ( - , A ) are measurable, so that ((A'*, a*), (E*, Y*), u*, p * ) is a learning model, then X,* = di(X,) and En* = Y(E,) are associated state and event sequences. Proof. Substituting X , and En for x and e in (2.13) and (2.14) we obtain U*(Xn*,
En*) = @(Xn+l)= X,*+l
and p*(Xn*, A*) = P(E, E Y--'(A*)IX,, E n - ' , ...) = P(En*~A*IXn En-', , ...)
a s . And it follows from the latter equation that p*(Xn*, A*) = P(En*E A*IX,*, En*--',...).
1.2. RANDOM SYSTEMS WITH COMPLETE CONNECTIONS
29
The significance of this is that the reduced model of Theorem 2.4 may be far simpler than the model with which we began, and hence may yield useful information about Xn* and En*.This is certainly the case for the ZHL models, since, as we shall see in Chapter 16, the reduced model is distance diminishing while the full model is not. It is not difficult to verify that the conditions for applicability of Theorem 2.4 are satisfied by the ZHL models. The most important of these conditions are (2.13) and (2.14).
2 0 Distance Diminishing Models and Doeblin-Fortet Processes
This chapter begins our study of a class of random systems with complete connections with metric state spaces, and a class of Markov processes in metric spaces that includes their state sequences.
2.1. Distance Diminishing Models Roughly speaking, a distance diminishing model is a random system with complete connections with a metric d on its state space X , such that the distance d(x’, y’) between x‘ = u(x, e) and y’ = u(y, e) tends to be less than the distance d ( x , y ) between x and y. It might be assumed, for example, that the event operators u ( . , e) are uniformly distance diminishing: d ( u ( x , 4,~
( v4 ,) < rd(x,y)
for some r < 1 and all x , y E X. A more general condition, first suggested by Isaac (1962) in a more restricted context, is
/
30
p ( x , de) ~
xe), u (, v ,e)>
1, ri+
< rirj.
32
2.
DISTANCE DIMINISHING MODELS
where
we have
It follows from Inequality (1.1.4) that
and
Proof of Lemma 1.2. Clearly
where x' = u(x,e') and y' = u ( y , e'). The inner integral is at most 9d(x', y'), and / p i (x,de') d(x', y')
so q
< ri9d(x,y). 1
< rid(x,y )
33
2.1. DISTANCE DIMINISHING MODELS
Taking i = 1 in Lemmas 1.1 and 1.2, and recalling our assumption that R, < 00 and r, c 00, we see that Rj < 00 and 5 < 00 for all j 2 1 in a distance diminishing model. The assumption that r, < 1 permits us to say more. THEOREM 1.1
r* = lim rjlj < 1 j-r
Q)
and Proof. By Lemma 1.2, for any n 2 j 2 1, r,, < ri'r,,,, where q = [ n / j ] and m = n-jq <j. Thus
lirnsupr,'l" n-r m
< rj'j
for all j 2 1. Therefore the limit r* exists and r* < r 1,l i for a l l j > 1. Taking j = k we see that r* < 1. Since Rj+ 2 R,, sup R, = lim R,, and (1 -rJRi+j
< Ri
by Lemma 1.1. If ri c 1 (e.g., i = k) we obtain supRj
< Ri/(l-ri) < 00
(1 *7)
on letting j + 00. It remains to show that R, = supR,. As a consequence of (1.2.5), R, 2 Rj for all j 2 1 ; hence, R, 2 sup R j . To obtain the converse, let A E Y", x , y E X, and E > 0 be given. Since Y" is generated by the algebra '3' = '3' (here we identify A E '3' and A x E mE SOo),there is an A' E '3' such that
uT=,
p , (x, A A A')
+ p , ( y , A AA') < E
(Halmos, 1950, Theorem 13.D). Then
I ~ P , ( x , Y , A )1 < 1
(x, A ) - P m (x, A ' ) 1
+ I6Pm(x,y,A')I + IPm(y,A')-Pm(y,A)I,
< supRjd(x,y) by (1.2.5), so ld~co(x,~,AG ) I E + supRjd(x,y)* arbitrary, 16pm(x,y,A)I < sup R,d(x,y). Thus
and Idp,(x,y,A')I
Since E is that R , = sup R,.
R,
< sup R j ,
so
34
2.
DISTANCE DIMINISHING MODELS
The key to understanding the state sequence in a distance diminishing model is to restrict the transition operator U to bounded Lipschitz functions. The following notation will be needed. For f E B ( X ) ,
and
Ilf’ll = m ( f )+ If1 . Let
U X )=
{f:llfll + W
l +
I)lf
I < max(r19 2 R , + 1 ) l l f II .
Thus U is bounded on L ( X ) . Putting .j = k in (1.10) we obtain (1.9) with r = r , < 1 and R = 2 R i < o o . I It follows immediately from (1.9) and I Uf I < If I that
I UYll < rllf II + R‘lf
I 1
(1.11)
where R’ = R + I . This inequality, in a slightly different setting, is one of the main hypotheses of the lonescu Tulcea-Marinescu theorem (Theorem 3.2. I), which is, in turn, the cornerstone of the theory of Doeblin-Fortet processes in compact state spaces (“compact Markov processes”) presented in Chapter 3. COMPACT STATE A N D FINITE EVENT SPACES. Both the five-operator linear model and the reduced ZHL model have compact state spaces and finite event spaces. Proposition 1 gives a relatively simple criterion for such a model to be distance diminishing. If w maps X into X , let
PROPOSITION I. A model with compact state space and finite event space is distance diminishing if r n ( p ( . , e ) ) < 00 and / ( u ( . , e ) ) 0.
36
2. DISTANCE DIMINISHING MODELS
Proof. Clearly
and
< supr(u(.,e)) 0. XSX'
If
E X,
there is an x E X' such that y
E
T ( x ) . Thus
g(y,k) 2 s ( r , i ) > g(x,j)/2 2 g / 2 .
Therefore r,
< 1 -g/2
0; thus p,,(x) = p ( x , A", B ) E L ( X ) . Also pn(x)+ZB(x) for all x E X ; thus U w p n ( x ) + K" ( x , B) for each x E X . Therefore
IK'" (x, B ) - K" ( y ,B)I = lim
n+ w
Iu"Pn(X)
- u"pn(y)l
< R*d(x,y)
(2.14)
by (2.13). The inequality (2.14) holds trivially for B = X and B = 0; hence it is valid for all closed sets B and x , y E X . If A is an arbitrary element of 99, then, by the regularity of K w ( x , .)+ K" (y, .) =C(.), there is a closed subset B = of A such that C(A -B) E. Thus
-=
IK" (x, A ) - K"
s
0,A)I
I(K" ( x ,A ) - K" ( Y , A ) ) - (K" (x, B ) - K r n( y , B))I
+ IK" (X, B ) - K" ( Y , B)I
< C ( A - B ) + IK"(x,B) - K"(y,B)I < E + R * d ( ~ , y ) by (2.14). Since E is arbitrary, we conclude that
m ( K " ( - , A ) ) < R* for all A €99.
[
We close this section with a lemma that will be needed shortly. A real valued function f on X is upper semicontinuous if { x : f ( x )< y} is open for all y. Clearly continuity implies this.
LEMMA2.3. If U is a transition operator for a metric state space such that Ufis upper semicontinuous for allfe L ' ( X ) , then K ( . , B) is upper semicontinuous for all closed sets B. Proof. The sequence {A,,} defined in the second paragraph of the preceding proof is increasing, so {p,,} is decreasing. Thus the sequence {Up,} of upper semicontinuous functions decreases to K ( - ,B). But upper semicontinuity is preserved under decreasing pointwise limits. [
3 0 The Theorem of Ioneseu Tulcea and Marinescu, and Compact Markov Processes
The material in this chapter and the next gives information about certain Doeblin-Fortet operators and corresponding Markov processes, thus for state sequences in certain distance diminishing models. The results in Section 3.1 are applicable to all Doeblin-Fortet operators, while those in later sections of this chapter require compactness of ( X , d), so that corresponding Markov processes are compact according to Definition 3.1. Chapter 4 treats distance diminishing models whose state spaces satisfy weaker conditions such as bounded ness. 3.1. A Class of Operators
Generalizing the situation in Chapter 2, we consider two complex Banach spaces ( B , 1 .I) and ( L , 1) 11) with L c B. B is not necessarily a function Space, though we use the notation f for one of its elements. It is assumed that +
(a) i f f , € L, YE B, lim,,+m[f,-fl and llfll < C.
= 0,
and Ilf,ll < C for all n, then f E L
43
44
3. COMPACT MARKOV PROCESSES
II-II
A linear operator U from L into L is bounded with respect to both where the latter is the restriction of 1. I to L. In addition, and 1.
IL,
(b) H=supnBolUnI,< CQ; and (c) there is a k 2 1, an r < 1, and an R < 00 such that
II UYII G r I l f II + R I f I for a l l f e L. Only (b) and (c) are needed for the next lemma. LEMMA 1.1. For all m 2 0 and f E L,
II UrnYIlG rrnI l f I1 + R' I f
I 9
(1.2)
where R' = (1 -r)-'RH. Furthermore, J = supnBo11 U"II < 00. Proof. Equation (1.2), which is obtained by iterating (l.l), implies that sup, 11 U'"'f11 < 00 for each f E L. By the uniform boundedness principle, the supremum D of (1 UrnkIIis finite (Dunford and Schwartz, 1958,t Corollary 11.3.21). The lemma then follows from
Definition 2.2.1 is applicable in the present context. The following useful condition for aperiodicity generalizes Lemma 1 of Norman (1970a). THEOREM 1.1. If there is a sequence 6, with limit 0, and, for each f there is a U "f E B such that
lUY- U"f I G 6, I l f I1
E
L,
(1.3)
for all n 2 I , then U is aperiodic. Proof. By Lemma 1.1, 11 U"fl1 < Jll f 11 for all n ; hence, U m fE L and by (a). The operator Urn is clearly linear. By means of arguments similar to those in the proof of Lemma 2.2.1, but using (1.3) instead of 11 U" - U" 11 -+ 0, we obtain (2.2.l), (2.2.2), and (2.2.3). Replacing f by V"J n 2 1, in (1.2), and noting that UV = V 2 , we get
11 U"f (1 < Jll f 11
IIVmk+nfll G rrnIIVnfII
+ R' IVY1
< r"W I l f II + 6, R' I l f II . Therefore IIVrnk+"ll< r " W + 6 , R ' .
t This convenient reference on functional analysis is cited repeatedly in this chapter.
3.2. THEOREM OF IONESCU TWLCEA A N D MARINESCU
45
For arbitrary j > 1 let m = [ j / 2 k ] and n =j - m k , so that j = mk+n. Then m , n - + c oasj-,co, SO IIV'II+O. I 3.2. The Theorem of Ionescu Tulcea and Marinescu
To the conditions (a), (b), and (c) of the last section, we now add
(d) if L' is a bounded subset of (L, 11. in (4 I * I).
II),
then UkL' has compact closure
Under these hypotheses, Ionescu Tulcea and Marinescu (1950) obtained the representation of U" given in the important theorem that follows. For any complex number 1, let
D(1) = { f E L : U f = 1 f } , so that I is an eigenvalue of U if and only if D(1) # fa. THEOREM 2.1. The set G of eigenvalues of U of modulus 1 has only ajinite number p of elements. For each 1 E G , D(1) is jinite dimensional. There are bounded linear operators U,, 1 E G, and V on L such that
U,U,< = O
if 1 # L', U: = U,,
(2.2)
UAV = VUA = 0,
(2.3)
UAL = D ( 1 ) ,
(2.4)
r ( V ) < 1.
(2.5)
xAeG
The operator l"UAhas a finite-dimensional range, hence is strongly compact. Since IIV"II c 1 for n sufficiently large, it follows that U is quasistrongly compact (see Lotve, 1963, Section 32.2). The proof of the theorem is broken down into a number of lemmas. LEMMA 2.1. If 111 = 1, D(1) is finite dimensional.
LEMMA2.2. There are only a finite number of eigenvalues of modulus 1. Note that U ' = U k satisfies (b), (c), and (d) with k ' = 1. Furthermore, U f = 1 f implies that U ' f = 1% and IAk1 = 111. Thus D(1) cD'(Ak), so the former is finite dimensional if the latter is. Also d c G', so G is finite if G' is. Hence it suffices to prove these lemmas for k = 1. Proof of Lemma 2.1. Iff E D, where
D = D(1) n {f: I f
I < I},
46
3.
COMPACT M A R K 0 V PROCESSES
then so that
Ilf II G
R/(1 - r ) *
Thus D is bounded, and, by (d), U D has compact closure in 1 . I. But D c AX'UD, so D = B is compact. It follows that D(1) is finite dimensional (Dunford and Schwartz, 1958, Theorem IV.3.5). I
Proof of Lemma 2.2. Suppose not. Let 1,, A 2 , ... be a sequence of distinct elements of G and let f.E D(1,,),f.# 0. The f, are linearly independent. Let S(n) be the linear span of f l ,f 2 , ... ,f.. Then S(n - 1) is a proper subspace of S(n), so that there is a sequence g,, such that g,, E S(n), lgnl = 1, and Ig,,-fl 23 for all f e S ( n - 1) (Dunford and Schwartz, 1958, Lemma VII .4.3). Iff = aifi E S(m), it is easy to see that zUnf E S(m) for all complex z and n 2 0, and that 1;"U"f- f E S ( m - 1) for all n 2 0. By Lemma 1.1
xy=
< r"-'
Ill,:"U"-'gjll
llgjll
+ R' < 1 + R'
for n sufficiently large. Thus (d) implies that there is an increasing sequence
ji and a sequence ni such that
hi = 1r"IU"ig. Ji J1
converges in B. However, d i = Ihi+l-
Qil =
Ihi+I-gji+I-hi+gji+II,
and hiE S ( j i ) c S ( j i + ,- 1). hi+, - gji+,E S ( j i +I - 1) Thus d i >, f. This contradiction establishes the lemma. I
LEMMA 2.3. If 111 = 1 and 1 is not an eigenvalue of U k ,then (21- U k ) L= L. Proof. Again it suffices to consider k = 1. Since 11 U"I( is bounded, r ( U ) < 1, and so (A'I-U>L= L for 11'1 > 1 . For g E L we seek an f E L such that (11- (I)f = g. If g = 0, let f = 0. If g # 0, let 11,,1> 1, I,, + 1, and f,E L with (1,,I- U ) f .=g. Clearly f,# 0, so we may put f.' =f,/lf,l.Then
s.'= K'uf,'+G'g/lf.l7 so that
(2.6)
47
3.2. THEOREM OF IONESCU TULCEA A N D MARINESCU
Therefore
IIXII G
(2.7)
( R + llgll/lf.l)/(1-d*
[.fI
Suppose now that + 00. By (2.7), IlXIl is bounded, so that, for some subsequence f i . , Ufi, converges in B. By (2.6), f i , does too, and the limit f' belongs to L by (a). Since 1U1, < 00, U'i,+ U f ' , and (2.6) yields f'= A-'Uf '. Now 1 = Ifi,l+ If 'I, so f' # 0, and I is an eigenvalue of U . This contradicts our assumption, and rules out the possibility that 00. Applying this result to subsequences of f., we see that d = sup. < 00. Multiplication of (2.7) by then yields
[.fI
[.fI
If.l+
Consequently Uf. has a subsequence Uf.. that converges in B, and f..= I; (U'. +g) converges in I . I to an element f of L. Then U.' + Uf in I f 1, and f = A-'(Uf+g).
I
If 111 = 1 let
LEMMA2.4. For every 111 = 1 and f E L there is a U, f E L to which U,"f converges in B. The linear operators U, have I U,IL< H and 1 U,ll < J , and satisfy (2.1).
Thus U is orderly in the sense of Definition 2.2.1. When, and only when, 1 is the only eigenvalue of modulus 1, U is aperiodic and IIU"- UIJI+ O geometrically. Since compactness implies completeness and separability, Theorem 2.2.1 is applicable to U . Theorem 2.2.2 is also. Theorem 3.2 summarizes some of these results in our present notation.
52
3. COMPACT MARKOV PROCESSES
THEOREM 3.2. The operator U , is the transition operator .for a stochastic kernel K , . For any B E 9, K , ( . , B ) E D(1), and, for any x E X , K , (x, .) is stationary. 3.4. Ergodic Decomposition
The key concept in the analysis of compact Markov processes is that of an ergodic kernel. A subset F of X is an ergodic kernel if it is stochastically and topologically closed, and if it has no stochastically and topologically closed proper subsets. LEMMA 4.1. Any stochastic kernel defined on the Bore1 subsets 9 of a compact metric space ( X , d ) possesses an ergodic kernel. Any two distinct ergodic kernels are disjoint.
Proof. If F, and F2 are ergodic kernels and F = F1 n F2#a,then F is a stochastically and topologically closed subset of Fl and F2.Thus F, =
F = F2. The ergodic kernels are clearly the minimal elements of the collection 8 of stochastically and topologically closed subsets of X under the natural ordering c. Zorn’s lemma (Dunford and Schwartz, 1958, Theorem 1.2.7) implies that there is at least one such minimal element. For since X E 8,8 is not empty. If d is a totally ordered subset of 8,it has the finite intersection The Lindelof theorem (Dunford and Schwartz, property; hence A = n d # 0. 1958, 1.4.14) implies that A = n d ’ for some countable subcollection d’ of d , from which it follows that A is stochastically closed. It is obviously topologically closed; hence A E 8. Clearly A is a lower bound for d .
I
If F E 9 is stochastically closed, then K ; ( x , F) = 1 for all x E F, so that K , (x, F) = 1 if F is also topologically closed. It follows that the functions K l (., F), for different ergodic kernels F, are linearly independent. By Theorem 3.2, all such functions belong to D(1), so the number i of ergodic kernels does not exceed the dimension d of D( 1). Let Fl ,F2,... , Fibe the ergodic kernels, F = 4 , and
ui=
g j ( x ) = K1 (x, 4)*
THEOREM 4.1. The functions gj are a basis for D( l), so there are d ergodic kernels. A compact Markov process is ergodic if and only i f there is a unique ergodic kernel. The second statement follows from the first. There is a unique ergodic kernel if and only if D(l) = U , L ( X ) is one dimensional, i.e., contains only constants. Our proof of the first statement is based on a corollary to the following lemma.
53
3.4. ERGODIC DECOMPOSITION
LEMMA 4.2. Iff is upper semicontinuous and f(x) < Uf(x) for all x E X, then f is constant on each ergodic kernel, and f attains its maximum on F. Proof.
Let
Cj = {x E 4 :f ( x ) = max f ( y ) } . YG4
Since f restricted to F j is upper semicontinuous and 4 is compact, Cj is a nonempty topologically closed subset of 4 . If x E Cj we have n
since 4 is stochastically closed. Hence K ( x , Cj)= 1, and Cj is stochastically closed. But 5 is an ergodic kernel, so Cj = 4 , and f is constant on F j . Similarly, if A = {x:f ( x ) = max f ( y ) } , YeX
A is stochastically and topologically closed. Lemma 4.1 implies that there is an ergodic kernel for the compact metric space ( A , d ) and the restriction to A of the kernel K, and it is easily seen that this set is an ergodic kernel for ( X , d ) and K unrestricted. Hence A 3 f;. for some j , and f attains its
maximum on F.
I
COROLLARY. I f f E D ( I ) , f is constant on each ergodic kernel. /f 111 = 1 and f E D(A), then I f(.)l is constant on each ergodic kernel, and f(x)=O for all x E X if f ( x ) = 0 for all x E F. ProoJ Iff E D(1),then ref€ D(1) and imf E D(1),so both are constant on each ergodic kernel by Lemma 4.2. If 111 = 1 and f E D(A), then g = If(.)l is continuous and satisfies g < Ug. I
Proof of Theorem 4.1. Suppose that f E D ( 1 ) . Let ipj be the value off i g j . Then 6 E D ( 1 ) and 6 vanishes throughout F. on F;. and let 6 = Thus 6 = 0; that is, f = ipjgj. Therefore the linearly independent functions gj span ~ ( 1 ) . I
f-x$=,
xi=,
The expansion of K , (. ,B ) in terms of the gj turns out to be especially interesting. For any probability p on the Bore1 subsets of a separable metric space, there is a smallest closed set with probability 1. This set, called the support of p, is the set of points x such that p ( 0 ) > 0 for all open sets 0 containing x (Parthasarathy, 1967, Theorem 2.1 of Chapter 11). THEOREM 4.2. There is a unique stationary probability p j with p j ( @ = 1. The ergodic kernel t;;. is the support of p j , and {p,, ... , p d } is a basis for {p E M ( X ) : Tp = p}. For all x E X and B E 93, d
(4.1)
54
3. COMPACT MARKOV PROCESSES
Proof. Let p l ( B ) , ... ,p d ( B ) be the unique constants such that (4.1) holds for all x E X . Taking x E 4 , we obtain K, ( x , B) = p j ( B ) for all B E 99. By Theorem 3.2, p j is a stationary probability, and, clearly, p j ( F i ) = d,, so the p j are linearly independent. Suppose that p E M ( X ) and Tp = p. Then T,"p = p. But, for YE L ( X ) ,
(T,"P,f)= (P,V,"f)
+
01,U l f )
=
(TI P 9 - f )
as n + 00. Thus ( p , f ) = (TIp , J ) for all f E L ( X ) and
Therefore the p j are a basis for {p E M ( X ) : Tp = p}, and pi is the only stationary probability with p j ( 4 ) = 1 . Since, by Lemmas 2.2.3 and 4.3, the support Sj of p j is a stochastically as well as topologically closed subset of
4, sj=4. I
LEMMA 4.3. Let K be a stochastic kernel in a separable metric space ( X , d ) , such that K ( - , B ) is upper semicontinuous for B closed. Then the support S of any stationary probability p on 99 is stochastically closed. Proof. Clearly 1 = p ( S ) = j p(dx) K ( x , S). Thus p ( S ' ) = 1, where
S ' = { x : K ( x , S ) =l }
= X-{x:K(x,S)
>) = 1,
(5.2)
from which it follows that Ufl f 2 (4 = 1,2 2 f
l W f 2
(4
and
UfC'(X) = A ; ' f ; ' ( x ) . Sincef, f2 and f;' E L ( X ) , 1,1, and 1;' E G, and G is a group, as claimed. We now show that any finite subgroup G of 111 = 1 is C , for some p. For real t, let e ( t )= exp(i27rt). If 1E G, then, since G is finite, 1 is a root of unity. Hence there are positive integers o(1) and r ( 1 ) such that r(1) 6 o(1), (r(A),o(A))= 1, and 1 = e(r(A)/o(L)). Also e( llo(1)) E G . For there are integers a and b such that 1 = ar(A)+bo(A); hence l/o(A) = ar(I)/o(A)+b, so that e(llo(1))= 2"' G . It follows that Co(d)c G. Let p = max,,,o(1). Then for any 1 E G, o(1)lp. For there are integers a and b such that ao(1)+bp = (o(A),p),and e ( I / o ( l ) )and e ( l / p )E G. Thus e ( t ) E G, where
59
3.5. SUBERGODIC DECOMPOSITION
Since o(e(t))< p , =
r(e(t))/o(e(t>)> I/o(e(t))2
UP,
so, by (5.3), o ( I ) < ( o ( I ) , p ) ; i.e., o ( I ) = ( o ( I ) , p ) .Thus o ( I ) l p , as claimed. But then I E C p , and G c C,. Therefore G = C p . I
Proof of Lemma 5.3. Uniqueness. Given Y ', ... , Y4,let w = exp(i2n/q) and
c w"I,,. 4
g =
m= I
Since the Yj are compact, d(Yj, Y k )> 0 if j # k , so that g E L ( X ) . Then UIym= Iym- 1, and
c O"Iym-, 4
ug =
m=1
= wg.
Clearly g # 0, so w E G = C p . Since q > p , we must have q = p . If I E G, D ( I ) is one dimensional. For let f E D ( I ) , f # 0, and let xo E X. Then f ( x o ) # 0, and, if f ' E D ( I ) and c =f'(xo)lf(xo), A ( x ) = f ' ( x ) - cf(x) belongs to D(A) and vanishes at x o . Since lA (x)l = l A l , f'= cf. Let I = exp(i2n/p) and
f=
P 1 I"&
n= 1
If g is as in the first paragraph of the proof,f,g constant c such that
E
D ( I ) , so there is a complex (5.4)
g = cf.
For any xo E X there are 1 < m , n < p such that x,, E Y" n X". Evaluating (5.4) at xo we obtain I" = cA" or c = I-' where v = n - m . Thus P m= I
I"I,, =
P
1 I"-'I,,
n= I
=
c A"lp+.. P
m= 1
Equating powers of A we obtain Y" = X"'", for all m.
Existence. Let I
= exp(i2n/p), f E
X"
=
D ( I ) , f # 0, xo E X , and
{ x :f ( x ) = I"f(X,)}
for m > 0. Note that X ' , .. . , X p are topologically closed and pairwise disjoint, and that Xm+,= X". By (5.2), K ( x , X " + ' ) = 1 for x E X". Since xo E X o , it follows by induction that X" # 0.The set
X'
=
P
(J X"
rn= 1
60
3.
COMPACT MARKOV PROCESSES
is stochastically and topologically closed. Since X is an ergodic kernel, X'=X. I
Conclusion of proof of Theorem 5.1. Let pi be the integer p and 4" the set X" obtained by applying Lemmas 5.2 and 5.3 to Kj and 4. If qj 2 p j and Y,!", m = 1, ... ,q j , are nonempty, topologically closed, and pairwise disjoint, with union 4 and K(x,YY+') = 1 for X E Y,!", then the Y,!" are topologically closed in (4,d j ) and K j ( x ,Y Y + l )= 1 for x E Y,!". Thus the uniqueness assertion of Lemma 5.3 gives qj =pi and Y,!"= FY". This is the uniqueness claimed by the theorem. Since Gj = C p J ,(5.1) follows from Lemma 5.1.
I
We now sketch some ramifications of the subergodic decomposition. Though these have their own interest, they play no role in subsequent developments. For any n 2 1, U" is a Doeblin-Fortet operator. The subergodic kernel 4" is an ergodic kernel for K(PJ).Let g,!" be the Up' invariant Lipschitz function that is one on 4", and let p,!" be the T p Jinvariant probability with support 4"'.If X is a compact Markov process with transition kernel K and initial distribution a, then g,!"(x)= P(d(X,, F;'"')
+0
as n + 0 0 ) .
(5.5)
The set 4"'is the unique ergodic kernel of K(PJ)14m, p,!"l4" is its only stationary probability, and the corresponding (Doeblin-Fortet) operator is regular. If I P J = 1, let
We have U g r = g : - ' and Tp,?'=py+', from which it follows that UhjPl= and T V ~ = I, ~ V ~ ,The ~ . functions hj,' and g j belong to D ( 1 ) and agree on F, so, by the Corollary to Lemma 4.2,
hj,] = g j , j = 1,..., d . (5.6) Equations (5.5) and (5.6) amplify (4.3). Similarly vj,' is a stationary probability with vj,, (4)= 1 ; so, according to Theorem 4.2, vj,] = p j ,
j
%f =
(/
The linear operator
is bounded on L ( X ) . For I E G, let
=
1,..., d .
fdvj,l) hj,l
61
3.6. REGULAR A N D ABSORBING PROCESSES
Iff
E L(X),
a direct analysis of U" for n large shows that
U;f(x) for all x E F. Since U f and U ; f belong to D(A), the Corollary to Lemma 4.2 implies that U,f = U i f . Thus U, = U,l, from which it follows that Uf(X)
=
3.6. Regular and Absorbing Processes Two types of aperiodic processes, regular and absorbing processes, are especially important in applications. According to Definition 2.2.1 and the U111 -,0 as n -+ co, paragraph that follows it, regularity means that 11 17"and U ,f ( x ) does not depend on x. Theorems 4.1 and 5.1 show that a compact Markov process is regular if and only if it has but one ergodic kernel, and this kernel has period 1. Alternatively, there is a unique stationary probability p, and the distribution p,, of X,, converges to p for any initial distribution po . DEFINITION 6.1. A Doeblin-Fortet operator for a compact state space ( X , d ) (or a corresponding compact Markov process X,) is absorbing if all ergodic kernels are unit sets: 4 = {aj}. Aperiodicity is obvious, and, by Theorem 4.3, X,, converges with probability 1 to a random absorbing state a / , j = j ( w ) . If there are two or more absorbing states, the probability of convergence to each of them depends on the initial distribution. In particular, such a process is not regular. We now give useful criteria for a compact Markov process to be regular or absorbing. These criteria are expressed in terms of the support u,,(x) of K(")(x, a).
THEOREM 6.1. A process is regular if and only if there is a y
EX
such that
d(a,,(x),y ) -+ 0 as n -, co for all x E X. THEOREM 6.2. A process is absorbing if and only if there are absorbing states a , , ,..,ai such that, for eoery x there is a j =j ( x ) for which d(un(x),aj)
0 as n
-+
00.
(6.2)
62
3.
COMPACT M A R K 0 V PROCESSES
Proof of Theorem 6.1. Suppose that (6.1) obtains. For any x E n = kpj we have K(")(x,Fm)= 1. Thus an(x)c &"', so that
emand
d(&",y) 4 d(an(x),y)*
It then follows from (6.1), on letting n -+ 00, that d(4"',y) = 0; i.e., y E 4m. Therefore y belongs to all subergodic kernels. Since the subergodic kernels are disjoint, there must be only one, and (I is regular. Suppose, conversely, that U is regular with ergodic kernel F. Let y E F and, for any E > 0, let 0 be the open sphere with radius E and center y. Then liminfK(")(x,0)2 p ( 0 ) > 0 , n-+ m
the latter since the stationary probability p has support F. Thus K(")(x,0 )> 0, an(x) n 0 # @, and d(o,(x),y) < E for all n sufficiently large. I
Proof of Theorem 6.2. Assume (6.2). If F' is an ergodic kernel, let x E F'. Then an(x)c F', so that (6.2) implies that aj E F'. Since { a j } is stochastically and topologically closed, { a j } = F', and (I is absorbing. Suppose that U is absorbing. For any x E X there is a j =j(x) such that gj(x) > 0. If 0 is the open E sphere about a j , then
lim inf P ( x , 0)3 K , (x, 0)3 gj(x) . n-
m
Therefore K(")(x,0)> 0 and d(a,,(x),a j ) < E for n sufficiently large.
I
Application of these criteria is facilitated by the following interrelationship among the sets a,,(x). THEOREM 6.3. For all m,n 2 0 and x
E
X,
(6.4)
63
3.7. FINITE M A R K 0 V CHAINS
since y E a,,,(x) implies that a =I an(y) and thus K("'(y,a)= 1. Therefore 5 3 am+n(x)* To prove the reverse inclusion, we note first that 1= =
p + n )
1
(
~
ur7n + n ( x ) )
~(m'(x7 dv) ~ ( n ' ( y a7m + n < X I )
9
so that K("')(x,a*)= 1, where 6*
= { y : K'"'(y,o,+,(x)) =
l}.
Since a,,,+,(x) is closed, K("'(y,a,,,+,,(x))is upper semicontinuous by Lemma ) that 2.2.3, and a* is closed. Thus a* 3 a m ( x ) , so that y ~ a , , , ( x implies a,+,(x) 3 an(y).In other words, a,+,(x) 3 a; hence o,,,+~(x)3 5. This completes the proof of (6.3). If a,,,(x) and a n ( y ) are finite, 6 is too. Hence 5 = a , and (6.4) obtains. Using this equation and induction, we derive finiteness of a,,(x) from finiteness of q ( x ) . [
3.7. Finite Markov Chains When the state space X of a Markov process X = {Xn}nrois a finite set, the process is afinite Markov chain. We noted in Section 3.3 that X is a compact space and X a compact process with respect to any metric d. In this section we collect various observations concerning specialization of the theory of compact Markov processes to this case. Though such specialization is an easy and instructive exercise, direct approaches to the theory of finite Markov chains (e.g. Kemeny and Snell, 1960; Feller, 1968; Chung, 1967) are much simpler. The space B ( X ) = L ( X ) of all complex valued functions or vectors on X is finite dimensional, so the norms 1 . I and I/. 11 on this space are equivalent. The same is true of the corresponding norms on (bounded) linear operators on these spaces. Thus the same operators are orderly, ergodic, aperiodic, and regular with respect to the two norms. The fact that X is aperiodic or regular in I . I for certain finite state learning models is the basis for the treatment of their event sequences in Chapter 6. One wants to have matrix interpretations of the various statements about operators in the preceding sections. For every operator W on B ( X ) , there is a unique complex valued function or matrix w = &(W) on X x X such that WfW =
c W(XJ)f(Y)
YEX
64
3.
COMPACT MARKOV PROCESSES
or Wf= w . f , where dot denotes matrix multiplication. Thus, if (I and K are the transition operator and transition kernel of X,the transition matrix P = &(U) has values or elements P ( x , y ) = K ( x , { y } ) . The mapping & is an algebra isomorphism between operators and matrices : A(aV+bW) = aA(V)
+ b&(W),
and A ( V W ) = d ( V )* A ( W ) .
In fact, it is not difficult to show that it is an isometry, IwI = IWI, with respect to the norm
on matrices. These facts yield the desired matrix interpretations. For example, in the aperiodic case
U" = u, + V",
where 1 V"I + 0 geometrically, so, if Pl = & ( V , ) and u = A (V), P" = P,
+ v",
where Iu" I + 0 geometrically. All subsets of X are topologically closed, so an ergodic kernel is a stochastically closed set with no stochastically closed proper subsets. The support of a measure on Xis just the set of its atoms. Theorem 3.4.2 gives the following representation of the matrix P, = A ( U l ) :
In the ergodic case d = 1, P,( x , y ) = p, ({y}) for all x ; i.e., all rows of PI are equal to the stationary distribution. If x,, ,x E X , d(xn, x ) + 0 as n + co if and only if x , = x for (all sufficiently) large n. Similarly, if A c X, d(x,,, A) + 0 if and only if x,, E A for large n. Thus, for example, in Theorem 3.4.3
In, = { o : X,,(0) E 5 for large n} . The fact that the empirical distribution v,, of X,, ..., Xn-,converges weakly to pj(", (Theorem 3.4.4) with probability 1 is equivalent, in this context, to v,,({x})+ pj(-)({x))with probability 1 for all x E X. The words "topologically closed" can be deleted from Theorem 3.5.1. In (3.5.5) g?(x) = P(X, E F;""
for large n).
3.7. FINITE MARKOV CHAINS
65
If P " ( x , y )> 0 we say that y can be reached from x in n steps, and if this holds for some n 2 0 we say that y can be reached from x. Theorems 3.6.1 and 3.6.2 can be rephrased in these terms as follows. A finite Markov chain is regular if and only if there is a y E X and, for every x E X, an integer N,, such that y can be reached from x in n steps if n 2 N, . A finite Markov chain is absorbing if and only if, from any state x, some absorbing state uj can be reached.
4 0 Distance Diminishing Models with Noncompact State Spaces
In this chapter we consider a distance diminishing model ((X, d), (E, Y), p , u ) whose state sequences have transition operator U.The state space is not assumed to be compact. Theorems 1.1 and 2.2 give conditions for regularity of U . The key assumptions in Theorem 1.1 are that ( X , d ) is bounded and p ( x , .) has a lower bound that does not depend on x. Theorem 2.2 assumes “regularity in X ’ ” for an “invariant” subset X’ of X , such that d(x, X’) is bounded. 4.1. A Condition on p
Boundedness of ( X , d) means, of course, that b =
SUP
X.Y€X
d(x,y)
0 such that p ( x , A ) 2 av(A) for all A E Y.
66
4.1.
67
A CONDITION ON p
If p‘(x, A) = v(A),then p i ( x , A) = v’(A), where V j on Y j is the jth Cartesian power of v on 9.Consequently, if $ is defined by (2.1.1)with p i in place of p i ,
The following condition supplements the assumption r, < 1 in the definition of a distance diminishing model (Definition 2.1.1). (b) There is a k’ 2 1 such that r’ = r;. < 1.
If, in addition to (a) and (b), 4 4 x 9 4,U ( Y , 4) < d(X,Y)
(1.3)
for all e E E and x , y E A’, we can take k = k’ in Definition 2.1.1.When a = 1 this is obvious, since p = p ‘ , so that 9 = $. In any case p j ( x , A ) 2 a’v’(A)
for all j 2 1 and A E 3’. Thus, if a < 1, the equation P j ( X , A ) = a’d(A)
+ (l-a’)pj*(x,A)
defines a stochastic kernel pi* on X x Yj, and rj
< a’$ + (1 - u j ) ~ * ,
(1.5)
where q* is defined by (2.1.1)with pi* in place of p i . However (1.3) implies that d ( u ( x ,21,U ( Y , 4 )< d ( x , y )
for allj, e j , x, and y , so that $* < 1 for any stochastic kernel pi*. Hence (1.5) and (b) yield rk,< I , as claimed. The following result is a refinement due to Norman (1970a, Theorem 1) of Theorem 1 of Ionescu Tulcea (1959). THEOREM 1.1.
Under (1. l), (a), and (b), U is regular.
The theorem is proved by combining several of our previous results with the following lemma.
1.1 LEMMA s = supsj jr 1
where
< 1,
68
4 . MODELS WITH NONCOMPACT STATE SPACES
Proof of Lemma 1.1. First we define a family of stochastic kernels vj,. on X x Y j . For 1 <j < n let v ~ , ~ (.)x = , vj, and for j > n let vj,.(x,A) =
S S ...,
v"(de") pj-"(x', de*j-")I,(e'),
where x r = u(x,e") and e*j-" = (en, ej-l). If j < n , then clearly 6vj,,(x,y, A) = 0 for all x , y , and A. If j > n, then 6vj,,(x,y,A) =
Thus, by (1.1.4),
I S
v"(de") 6pj-"(x', y', de*j-")I,(d).
16vj,n(X,y,A)I
'lTd) = (d/n)a2+ o(l/n),
(2.17) (2.18)
and E(IACj13) = o((d/n)%)
(2.19)
for j < q. Let hj(A) = h,!(A) = E(exp(iAcy))
cj. Then
be the characteristic function of hi+ (A)
=
E(exp(iACj)exp(U ACj))
=
E(exp(iACj)E(exp(iA ACj)I$d)).
Using the Taylor expansion exp(it)
=
+ it - t2/2 + o(ltI3)
1
in conjunction with (2.17) and (2.18), we obtain E(exp(iAACi)l8d)
=
1 - (A2/2)(d/n)a'
+ o(l/$) + o(E(ldcj13 15.d)).
(2.20)
80
5. FUNCTIONS OF MARKOV PROCESSES
Substitution into (2.20) and application of (2.19) yield =
+
(1 - (A2/2) (d/n)aZ)hj(A) O(l/&
Thus, by iteration,
h,(l)
=
+
(1 - (12/2) ( d / n ) o 2 r O(q/&)
+ O((d/n)%).
+ O(q(d/n)%).
Since &/n+ 1, the first term on the right converges to exp(-lZaZ/2), and the third converges to zero. Clearly
q/&-
nx-e,
so the second term on the right converges to zero if 5 > 3. Choosing 5 in this way we obtain + exp(-A2a2/2),
h&)
from which Theorem 2.3 follows via the continuity theorem (Breiman, 1968, Theorem 8.28). I 5.3. Estimation of pu
+.
The natural estimator of the asymptotic autocovariance p. of Yn and Yn is the sample autocovariance given by
3. and
p-,
=
1
n-#
2 (yi-y.)(yi+u-y.)
n-'-u
=-
i=o
(3.1)
Pu for u 2 0, where Y. = Sn/n.
This estimator is consistent in both the quadratic mean (q.m.) and a s . senses. THEOREM 3.1. As n + 00, P,,+p,, in quadratic mean and almost surely.
Proof. Since p - . = p U , we can assume that u 2 0 , and, since neither 6, nor pu is affected if a constant is added to f, we can assume that U "f= 0 without loss of generality. Clearly
P.
= P.*
+ E,,
(3.2)
where
(3.3)
5.3. ESTIMATION OF
81
P.
(3.4)
and Y” = s,,n-,/(n Consider
E,
- 24).
first. By the Schwarz inequality,
E((Y’Y.)2)< E“(Y‘4) E“( Y.4) = O(l/(n-u))O(l/n)
= O(l/(n-u)n),
as a consequence of (2.14) and (2.15). The same estimate applies to the other two terms on the right in (3.4), so E(E.z) = 0(l/(n - u ) n) .
Thus E, + 0 q.m. as n + 00. Theorem 2.2 implies that E, only to show that pu* + pu q.m. and a s . If
(3.5) +0 as.,
so it remains
then
For k 2 u, E(Wi Wi + k ) = E(Wi E W i + =
Iq+u))
~ ( 0 ( 1 ) O ( a ~= ) ) O(ak)
by (2.7). While for u 2 k 2 0, E(WiWi+,) = O(1).
Therefore pu* - p, + 0 q.m. and a s . as a consequence of Lemma 3.1.
I
LEMMA 3.1. Let {Wi}i,o be a real valued stochastic process such that E(Wi W i + & ) for k 2 0, where a, 2 0 and
ah
82
5. FUNCTIONS OF MARKOV PROCESSES
Then
If cn+
> cn > 0 and
then i=O
as., as n+
w, + 0
00.
Lemma 3.1 and (2.5) yield an alternative proof of Theorem 2.1. The lemma gives much stronger estimates of the magnitude of Wi than Theorems 2.1 and 3.1 require. For a fuller development of the method used to prove (3.8), see Serfling (1970a,b).
xi
Proof.
Note first that, for any real numbers b, , ... ,bn-
(3.9) where
Thus
G
n- I
1 ali-jlG A n ,
i,j=O
and (3.7) is proved. Let
T, =
n- 1
1 WJCi
i=o
5.3. ESTIMATION OF
83
P.
E((T,-K)')
xi
< A i = mCh n .';c (mvn)-1
(3.10)
Since c c 00, cr2 c m, and T. is q.m. Cauchy. Let T be its q.m. limit. Letting m + m in (3.10) we see that E((T-T,)')
< A C';c Ian
for n 2 2. Thus the sequence E((T-
T21)')
= O(j-')
is summable, from which it follows that T21-+ T a.s. Once it is established that T.+ T as., (3.8) follows via Kronecker's lemma (Neveu, 1965, p.147). To complete the proof that Tm-+T as., it remains only to show that Aj =
max lT,,-T2,1 + O
2Jc n C 21 + 1
a.s. as j + 00. For any 0 < m <j, the sum T2,+- T21of 2' terms can be partitioned into 2j-" blocks of 2"' terms each. Let v,k be the kth such block. Clearly Tn - T ' , =
c' Vdm m
where the prime indicates that the sum need not include all m, and k, is chosen appropriately. Hence, by the Schwarz inequality,
Thus
so that
84
5. FUNCTIONS OF MARKOV PROCESSES
by (3.10), where 2j< i c 2j”. It follows that
for j 3 1. Therefore
and A,+O a s . as j + m .
I 5.4. Estimation of a’
If rs’ > 0 and s is a consistent estimator of a’, it follows from Theorem 2.3 that (S,-ny)/(ns)%
N
N(O, 1 ) .
(4.1)
This fact can serve as a basis for inference about y . Norman’s (1971b) discussion of inference about the mean of a second order stationary time series extends easily to the case at hand. Although (4.1) does not presuppose stationarity of { Y j } j , o , one expects the approximation to be better, for fixed n, when the process is nearly stationary than when it is not. Thus, in practice, it is advisable to disregard the initial, grossly transient, segment of the process. This amounts to considering a shifted process { Y N + j } j r O , to which our assumptions are equally applicable, and which approaches stationarity as N-P 00 according to Theorem 7.1. The purpose of this section is to display a suitable estimator s and to estimate its mean square error. From among a number of closely related possibilities, we single out the truncated estimator
[see (3.1) for 6.1 on the basis of its simplicity and small bias. The truncation point t = t, < n must be chosen judiciously. The danger of choosing t too large is amply illustrated by the fact that, if t = n- 1, s =n
( jy= (o Y j - Y . ) ) z = 0 .
To achieve consistency, t should be small in comparison to n. THEOREM 4.1. s + 0’ in quadratic mean if t + 00 and tln +0 .
5.4.
ESTIMATION OF
O2
THEOREM 4.2. If 0 < c < ln(l/r(V)) and t - (1/2c) Inn = 0 (1 ), then n
-E((s- a’)’) In n
+
204
-. C
For r ( V ) , see (2.2.4). Neither theorem requires o’ > 0, but, if this condition holds, (4.5) can be rewritten in the instructive form E(((s/02)- I)’)
-
4t/n.
As we will see in Section 5.1, s is an example of a spectral estimator, and our proofs of the above theorems differ from arguments familiar in the theory of such estimators (Hannan, 1970, Chapter V) only in the simplifications attendant to this special case and the slight complications due to nonstationarity. A result concerning spectral estimators due to Parzen (1958, Eq.(5.6)) suggests that no estimator of o2 has mean square error of smaller order of magnitude than (In n)/n under our assumptions. Proofs.
As usual, we can assume that U*f = 0. Clearly s-0’
=A + ~ - 8 ~ - 8 2 ,
(4.6)
where
& =
y!!!&u
I4
[for pu* and
E,
see (3.2), (3.3), and (3.4)]
and
We postpone until later the proof that (4.7)
86
5. FUNCTIONS OF MARKOV PROCESSES
as t -+
00
and t/n + 0. As a consequence of (3.9,
Thus E(&2)=
o(t/n)
if t/n 40. Clearly
hence
6: = o(t/n)
(4.9)
62 = O(u2').
(4.10)
as n + c o . Also
so that
Theorem 4.1 follows immediately from (4.6)and (4.7)-(4.10). If t satisfies (4.4), where u = e-c, then (4.10)yields
:6
= o(t/n).
(4.11)
Using this in conjunction with (4.7)-(4.9) and (4.6), we get lim(n/t)E((s-a')')
= lim(n/r)E(d2) =
-
4a4,
and (4.5)follows on noting that t
(1/2c) Inn.
Inequalities (4.3)are equivalent to I > u > r ( ~ ) . I The remainder of the section is devoted to the proof of (4.7). Since
(4.12)
5.4.
ESTIMATION OF
87
ul
where
Also, (4.13) where (4.14) and
a = pu pu - E(Yi Yj)pu - E(Yk Yi)pu + o(,..Jvj)+I~l
= -pupu
= -pupu+
+ ,-pVO+lul)
o*
(4.15)
by (2.8). Combining (4.13), (4.14), and (4.19, we obtain E((YiYj-pu) (YkYi-pu)) = P k - i P t - j
+ p l - i p k - j + 4 + O*-
In view of (4.12) it suffices to establish the following points: ( l / n t )C
~k - i P I - j
= (l/nt)
C
(l/nt)CCq
+
PI-i ~ k j - +
204
9
0,
(4.16) (4.17)
and (l/nt)CCO* + 0 as t + co and t/n+O. The last of these presents no difficulty, since
= O(l)O(n) = O(n)
and the other component of O* is similarly bounded. Proof of (4.17). Clearly "- 1
According to (4.14), q is the difference between two terms, each of which are invariant under permutations of i, j, k, and I, so q also has this property.
88
FUNCTIONS OF MARKOV PROCESSES
5.
Thus
ICCqI
4!c'141
(4.18)
Y
where C'is the sum over O < i < j < k < l < n - l . Over this range, Pk-iPl-j
= o(a'+)
(4.19)
and P1-iPk-j
=
o(d-i)
(4.20)
by (2.9). The equation preceding (2.6) gives
Q = '(Yi 5)~u + E(YiS(Xj))
(4.21)
where g = fvk-jCfV'-kf).
The first term on the right in (4.21) is pupu
+ O(aj+'-k),
as a consequence of (2.8) and (2.9), while the second is
E(Yi u j - ' g ( x , ) ) = E(Y,) u w g
+ E(YiVj-ig(xi))
- o(a'+'-i + &i).
Thus Q
- PuPu =
~ ( ~ i + l - -+ k ai+l-i
Substituting this estimate, (4.19), and (4.20) into (4.14), we obtain
x'
=
~ ( ~ i + l -+ k ai+l-i
Now of each of these powers of q are 0 (n), and (4.17) follows.
c(
+ a'-').
is O(n). Hence
x'141 and, by (4.18),
Proof of (4.16). The equality in (4.16) is obtained by interchanging k and I and replacing - 0 by v in Ex. The inner sum of the term on the left is
1
i.k:
Pk-iPk-i+u-u'
0 6 i, i + Y,k , k + u < n
If k is replaced by i+d, this reduces to
5.4.
89
ESTIMATION OF az
where Mn(u, u, d ) is the number of i (possibly 0) such that O 1). Thus, by the dominated convergence theorem, F,(d,W)
as t +
00
+
J-;z(Y)dY
and t/n+0, for every d and w.
=2
90
5. FUNCTIONS OF MARKOV PROCESSES
Since 0 < M. < n, it is clear from (4.24) that
IF,(d,w)l
< (2t+l)/t
Q 3.
Also
Thus the dominated convergence theorem can be applied to the right-hand side of (4.23) to obtain the limit
cW
d,W=
w
PdPd+w
= 2a4
given in (4.16). This completes the proof of (4.7). 5.5. A Representation of a2
To obtain (4.1), it was assumed that a2> 0, so it is useful to have simple criteria for positivity of this constant. Since
< Po
1P.l
for all u, positivity of po is certainly necessary for positivity of 02, but, in general, it is not sufficient. Theorem 4.1 gives a representation of o2 that permits us to show (Theorem 4.2) that p o > 0 implies o2 > 0 for the important special case of indicator functions f = I , of B E 9, provided, of course, that I , E L. It is anticipated that Theorem 4.1 will be useful for establishing positivity of a2 for other functions f E L. For any g E L, let J ( g ) (4 =
j
K(x,
(dY)-WW2
= Ug2 ( x ) - (Ug)2 ( x )
Clearly J ( g ) E L. Iff
EL
with U"f
f* =
= 0,
.
let
c u'f= pf. W
j=O
W
j=O
The series converges absolutely in L, and L is complete, so f * E L. Clearly (I-U)f*
THEOREM 5.1.
a2= U"Jcf*).
=f
91
5.5. A REPRESENTATION OF u2
This theorem generalizesa formula of Frtchet (1952, p.84) for finite Markov chains.
Proof
since the second series converges in L and Urnis continuous on L. But
c m
j=
by (5.1), so that
- 00
v q - = 2f* -f = f* + Uf*
2 fvijy
j=-m
+ Uf*)
= (f*- Uf*) (f* =
cf*)2
- (Vf*)Z.
Thus
2
=
Uyf*)’- Urn(Uf*)’
= UV(f*)’
- Urn(Uf*)’= U r n J ( f * ) .
a
Let po and 17’ correspond to IB€ L, or, equivalently, to f = IB-b, where b = UmIB.Clearly 0 Q b Q 1 and po =
U*f’ = b(1-b),
so po > 0 is equivalent to 0 < b < 1. THEOREM 5.2. I f 0 < b < 1, then u’ > 0.
Proof. Assume u’ = 0. We will show that b E‘ (0,l). First,
E(If*(xj+,)-uf*(xj>I’) = E ( J ( f * )(xi)) = E(U’Jcf*)
=
(XO))
WJ(f*) WO))
by Theorem 5.1, so
E(If*(xj+,)-uf*(xj>I’) =O(4. Using (5.1) to eliminate Uf* on the left, and taking square roots, we obtain
92
5. FUNCTIONS OF MARKOV PROCESSES
where Yj =f ( X j ) and Yj* =f*(Xj). Thus where
so that sn,n
in probability as n+ For any k,m 2 0,
- (Yn* - Y?n)
+
0
00.
E(Y,*kY2*,”)= E(U”(f*kU”f*m)(-Yo))
PkPm as n-, 00, where pk = Umf*k.Since Yn* = 0(1), it follows easily that there is a unique probability 11 on the set R of real numbers such that fEm <jp(d 0, liminfP(ISn,nI < E ) 2 n-
m
v{t:
<E}
=
PXP{(t*s):
=
/P(mPit:
It-sl < E l
It-sl < E l
= 6.
For any q in the support of p, the integrand is positive. Thus 6 > 0, and P(Isn,nl
<E)
>0
93
5.6. ASYMPTOTIC STATIONARITY
where
(5.3) can be rewritten But Tn(wi)is an integer, so b is within 2e of an integer. Since this holds for all E 0, b is an integer, and b E’ (0,l). I
=-
5.6. Asymptotic Stationarity
If the distribution po of Xo is stationary, then {Xn}n,Ois a (strictly) stationary process, so (Yn}n,ois too. If po is not stationary, we still expect {Yn}n,oto be nearly stationary if a sufficient number of observations at the beginning of the process are disregarded. In other words, we expect the shifted process gN= {YN+n}n,O to approach stationarity as N + 03. Theorem 6.1 gives a result of this type. THEOREM 6.1. The jinite-dimensional distributions of gNconverge weakly
to those of a stationary process gym as N -+
00.
The proof is based on the following lemma, special cases of which have been noted and used in preceding sections. LEMMA 6.1. For any k 2 1 and g o , ...,gk-
E L,
Proof of Lemma 6.1. For k = 1 we get F ( . ;g o ) = g o E L. Suppose, inductively, that the assertion of the lemma holds for some k L 1. Then Ik- I
= F ( x ; g d , ...,S i - l ) ,
\
(6.1)
wheregf=gi for O < i < k - l , and &-1
=gk-IUgkEL.
By hypothesis, the function on the right in (6.1) belongs to L, so the assertion of the lemma holds for k + 1.
94
5. FUNCTIONS OF M A R K O Y PROCESSES
Proof of Theorem 6.1. It follows immediately from the lemma that
* uwF;(* ;90, gk- 1) as N + 00. For any nonnegative integers m,, ...,m K , and distinct nonnegative integers nl,... ,nK, we can apply this to k- 1 = max(n,, ... ,nK) ***
and
9i =
i # nj, allj,
to obtain convergence of
Since the variables Y, are bounded (by If I), such convergence of moments implies weak convergence of the joint distribution p(nl,
e e . 3
nK)
of YN+nl,
*.*)
yN+m,-
But the distributions D N ( n , ,...,nK), for fixed N, are consistent, so the asymptotic distributions D m ( n i ,- . - , n ~ ) are too. Thus there is a distribution Dm on R m with these finite-dimensional distributions (Neveu, 1965, p.82). Clearly DN(nl4- 1 ,
...,nK+ 1) = DN+'(nl,
nK) 3
SO
Drn(n,-k 1 , ... ,n ~ 1) + = D"(n1,
...,n ~ ) ,
and any process OY" with distribution D" is stationary.
I
5.7. Vector Valued Functions and Spectra This section sketches some generalizations of the results of previous sections. Proofs are omitted, since they are slight extensions of arguments given earlier.
5.7.
VECTOR VALUED FUNCTIONS AND SPECTRA
95
If I is a positive integer and fi E L for 1 < i < I, let
f=
P,, = lim cov(Y,,, Y,,+,,) n- 03
exists, P,, = O(d"I), the matrix
z=
f
(7.1)
P,,
U=--m
is positive semidefinite, and the distribution of Z,, converges to the multivariate normal distribution with mean 0 and covariance matrix Z:
z,,
-
N(0,Z). (7.3) This multivariate central limit theorem can be proved by applying Theorem 2.3, the univariate case, to 2*f, for each I-vector 2. Just as in Theorem 6.1, the finite dimensional distributions of {YN+,,},,20 converge to those of a stationary process as N + m . In the remainder of the section, it is assumed that L c Bc(X), rather than B'(X), and that L is closed under complex conjugation (denotedf). If W and Y are complex valued random variables, let cov(W, Y) = E((W - E(W))(Y - E( Y))) .
For anyf,g E L, the limit ~u
cf,8) = n-limw cov(f (xn),( x n + u)
exists, and p,cf,g) = O(alUI).The j , kth element of the matrix Pu in (7.1) is p,,cf,fk). For real 2,
C u= W
d L g ; 2) = (1/2n)
w
puCf,g)e-"A
is the cross spectral density function off and g, and a(f,f; 2 ) is the spectral density function ofJ See Hannan (1970) for a full discussion of spectra and their estimation. Note that
2lroy;f; 0)
= cTz
96
5. FUNCTIONS OF MARKOV PROCESSES
is the asymptotic variance in the univariate central limit theorem, while 2xauj,f,; O) = z j k
is the j , kth element of the asymptotic covariance matrix of the multivariate central limit theorem (7.3). The quantity
is the jinite Fourier transform of f ( X , ) , . .. ,f ( X n - ,). The following relation between the cross spectral density function and finite Fourier transform generalizes Theorem 2.2:
cov(w(f;
4, w ( g ; 4)= ocf g ; 4+ O ( l / n ) .
The natural estimator
where
converges to p . u , g ) in quadratic mean and almost surely as n + co. The estimator
which generalizes s in (4.2), converges to a(f, g; A) in quadratic mean as t + co and rln-+O. If t is chosen in accordance with (4.3) and (4.4), and --R
,,which are Markovian according to Theorems 1.2.2 and 1.2.3. We first consider 9'for distance diminishing models whose state sequences are regular with respect to L ( X ) . Then we give comparable results for 9'and 9"for learning models in which X is regular with respect to B ( X ) .
98
99
6.1. THE PROCESS Xa' = (En,X*+ 1)
Let U be the transition operator for a state sequence % of a distance diminishing model, let U' be the transition operator for S', and let X' = E x X. For f E B(X') (real or complex) let m ' 0 = supm(f(e, 9) 9 CEE
Ilfll'
= m'Cf)
and
L' = {f
E B(X'):
+ If1
9
Ilfll' < a}.
THEOREM 1.1. If U is regular with respect to the norm II.II on L(X), then 3', U', L', and 1) 11' satisfy the assumptions of Chapter 5 [i.e., (a), (b), and (c) of Section 5.1 or the corresponding assumptions in the complex case considered in Section 5.73.
-
It follows that all of the theorems of Chapter 5 are applicable to f(X,,') iff E L'. We now call attention to some interesting subclasses of L'. If f ( e ,x) = g(e)h(x), where g E B ( E ) and h e L ( X ) , then f E E . In fact, If1 = JgIIhl and m'Cf) = 191 m(h), so 1) f 11' = 191 llhll < co. I f f = h [i.e., g(e)= 11, then m'Cf) = m(h) and 11f 11' = llhll. I f f = g [i.e., h(x) = I], then m'Cf) = 0 and (1 f 11' = 191. Thus B ( E ) and L ( X ) are naturally isometrically embedded in L', and all theorems of Chapter 5 are applicable to g(E,), g E B(E), and to h(X,,), h E L ( X ) . In the latter case, these results can be obtained more simply by applying the same theorems to S, U, L ( X ) , and 11 11. So the primary interest in Theorem 1.1 is that it gives asymptotic properties of g(E,,).
-
COROLLARY.All of the results of Chapter 5 are applicable to g(E,,) for 9 E B(E). Proof of Theorem 1.1. By Theorem 1.2.2, assumption (a) of Section 5.1 is satisfied. We omit the elementary verification of (b) (and its complex analog). IffEL',
fw-fw
=
s
P ( X , a
+
(f(e,u(x,e))-f(e,u(y,e)))
( P k de) - P ( Y , d 4 ) f (e,4%4)
so that (1.2.10) yields
m'(U'f) = mCf') < m ' ( f ) r , + 2 1 f ) R l .
Thus U' is a bounded linear operator on L'. If U is aperiodic with limit U", let U'"f(e, x ) = U"f'(x).
9
100
6. FUNCTIONS OF EVENTS
Then
11 ury- Ulrnf1Ir
=
11 un-y - U“flII
by (1.2.1 I), so that
IIU’Y-
urrnfll’ < IIU”-’ - U r n [Il/f’I < lIu”-’- U”ii i i ~ ’ l l ’ l l f l l ’ -
Thus IIU’”-UrrnII--rOasn+co. It follows immediately from ( I . 1) that, if U“h is constant for each h E L ( X ) , U‘“f is constant for each f E E . This completes the verification of (c). Finally, it is clear that, when L‘ c B‘(X’), L‘ is closed under conjugation (with Iifll’ = ilfll’). I There are interesting learning models whose transition operators U are regular with respect to the norm 1.1 on B ( X ) . Finite state models furnish many examples, and Suppes’ stimulus sampling model for a continuum of responses is another. Criteria for regularity in this sense are given in the last paragraph of Section 3.7 and in Theorem 4.1.2. According to Theorem 1.2 below, such regularity is inherited by both U ‘ and U“. THEOREM 1.2. If U is regular on B ( X ) , then the assumptions of Chapter 5 are satisjied by %‘, U ‘ , L’ = B ( X ’ ) ,and II.1)’ = I I, and by X , U“, L” = B ( X ” ) , and 11 .I]‘’ = I . I. A s a consequence, all results of Chapter 5 are applicable to g(&) for E B(E).
-
Proof. Only (c) need be checked. If U‘“ is defined by (l.l),
11 u y -
Ulrnf11’
=
1un-y - Urnf’l
by (1.2.1 I). Thus
1IU’”S- V r n f11’
< 1un-1 n), thus
Replacing B by
B we obtain
IK,("'(X,B )
- L,,(x, B)I
< I&,
C)
- P(nc)l/2.
(2.4)
So (2.3) follows from the form of the Poisson theorem given by Lemma 2.1. I
113
7.2. SMALL PROBABILITY
LEMMA2.1. lB(n,c) - P(nc)l/2 < nc(1 -e-'). Proof. If Pi and Qi are any Bore1 probabilities on the line,
where
*
IPi * Pz-Qi * Q21 G denotes convolution. For
PI * Pz - Q i
and
* Q2
+ IPz-QzI
If'i-Qil
*
= PI (Pz-Qz)
3
(2.5) (2.6)
+ ( P i - Q i ) * Qz
G su~I(Pz-Qz)(B-x)l G IPZ-QZI/~,
so that
IPi * (Pz-QJl G I P z - Q z l . Using (2.6) inductively we obtain IP*"-Q*"l 4 nlP-Ql. Since B(1, c)*, = B(n, c) and P(c)*" = P(nc), this yields IB(n,c)-P(nc)l G n l B ( l , c ) - P P ( c ) l .
But I B ( ~ , c ) - P ( c ) ~=/ ~
1
i: bj > PI
(bj-pj),
where b j = B ( l , c ) ( { j } )andpi = P ( c ) ( { j } ) ,and
bo= 1 - c < e-' = p o ,
so I B ( I , c ) - P ( c ) 1 / 2 = b , - p i = c(l-e-').
I
The bounds in (2.3) and (2.5) are useful only when n is not too large relative to I/c, as, for example, when nc is bounded or converges to t < 00. Using the fact that the distribution functions of B(n,c) and P(nc) both approximate the normal distribution function with mean and variance nc when nc is large and c is small, it can be shown that sup IB(n, c) - P(nc)l n2O
as c
+0
0. Thus, according to (2.4), sup IK,'"'(x, B) - L,, (x, B)I
n.*,B
as c +0. For explicit bounds, see Vervaat (1 969).
-+
0
114
7. INTRODUCTION TO SLOW LEARNING
7.3. Small Steps: Heuristics The assumptions of the theory of learning by small steps will be spelled out in detail in subsequent chapters, but enough was said in Section 7.1 to support an informal discussion of the kinds of results to be obtained in Chapters 8 and 9. Let {X,"},,, be a state sequence corresponding to the value 8 of the learning rate parameter. In general, X," is an N-dimensional random (column) vector. We do not work directly with X," in Chapter 8, but, rather, with a linear normalization
z." = ( X , , @ - f ( n W J B , where the function f is defined in (B) of Theorem 8.1.1. In Chapter 9, no spatial normalization is necessary. The time scales in Chapters 8 and 9 must also be handled differently. Let = 8 and Y; = Z,,@in the former case, and let 7 = 0 2 and Y;=X," in the latter. Just as in the case of learning with small probability, we will obtain approximations to Y (Y;) (the distribution of Y;) by distributions of the form Y ( Y ( n r ) ) ,where Y(t), t 2 0, is a Markov process with continuous time parameter. However, unlike the pseudoPoisson limit in the previous case, the process Y(t) is a diftiion; i.e., it has continuous sample paths. This reflects the basic assumption that AX,,@ is small when 8 is small. We will see in subsequent chapters that the variables Y; satisfy the fundamental equations
+ E((dY;)21Y; = y > = rb(y,n,r) + E(dY,'I y.' = y )
=
W ( y ,Iz, 7 )
,
O(?)
O(?),
(3.1)
and
E(lAY;I31 Y J = y ) = O ( 7 ) , where y2 = yy*, I yl = y*y (* indicates transposition),
4y,n,7)
+
a(r,t),
and
b(y,n,r)
+
b(y,t)
(3.2)
as 7 + 0 and nr+ t. For any t and 7 , let n = [ t / r ] and ' ( t ) = Y;. Then Y'(t+r) = Y i + l ,so (3.1) can be rewritten
+ o(l), 7-'E((Y'(t+.r) - Y'(t))2lY'(t) = y ) = b ( y , n , z ) + o ( l ) , t-'E(Y'(t+r)
- Y'(t)lY'(t) = y )
= a(y,n,r)
and
?-'E(IY'(t+r) - Yr(t)I31Y'(t) = y ) = o(1).
(3.3)
1 I5
7.3. SMALL STEPS: HEURISTICS
Now nr -,t as 7 +0, so (3.2) applies and (3.3) suggests that as r-0, where Y(t) is a diffusion such that limr-'E(Y(t+r)1-0
Y(t)lY(t) = y )
lirnr-'E((Y(t+r) - Y(t))'IY(t) r-0
=
a(y,t),
= y) =
b(y,t),
(3.5)
and
More generally, we might expect convergence of the higher-dimensional distributions, where 0 < I , < t 2 < ... < t,, or even convergence of the distribution, over a suitable function space, of entire trajectories of the process, 9(Y'(t), t Q T) + Y(Y(t), t < T) as r + 0. If, instead of Y'(t), we consider the random polygonal curve with vertices Y*(nr)= Y i , i.e., P(t) =
(
(n+l)r-t
) + (?) Y,r
p(t)
Y,+,
a)
for nr < t Q (n+ l ) r , the natural path space is the set C([O, of continuous (vector valued) functions on [0, TI . For the sake of simplicity, we will focus our attention on the distribution at a single value of t in the chapters that follow. The importance of this circle of ideas for us is that we shall now not be surprised to find that Y ( Y ' ( t ) )converges when (3.1) and certain auxiliary conditions obtain. The Y(t)), while intuitively satisfying, is not characterization of the limit as 9( always convenient for proving convergence, so alternative descriptions of the limit will be used later. Furthermore, plausible though the transition from (3.1) to (3.4) may be, the theorems currently available to justify it (e.g., Theorem 1 on p. 460 of Gikhman and Skorokhod, 1969) do not cover the cases considered in subsequent chapters. Thus we will have to provide our own theorems of this type.
8 0 Transient Behavior in the Case of Large Drift
8.1. A General Central Limit Theorem We now introduce the notation and state precisely the assumptions of Theorem 1.1, which is a general central limit theorem for learning by small steps. Let J be a bounded set of positive real numbers with in f J = 0. Let N be a positive integer, and let RN be the set of N-tuples of real numbers, regarded as column vectors. For every 8 E J , {X,"}.,, is a Markov process with stationary transition probabilities and state space a subset Z, of RN. Let I be the smallest closed convex set including all lo,O E J. Let H: be the normalized increment of X,",
H," = AX,"/B , and let w ( x , O), S ( x , 0), s(x, 0), and r ( x , 0) be, respectively, its conditional mean vector, cross moment matrix, covariance matrix, and absolute third moment, given X," = x : w ( x , e ) = E(H;IX," = x ) , s ( x , e)
=
E ( ( H , B ) ~ I X=. BXI,
s ( x , e) = E((H: - w ( x , e))zIx: =
e)
= s ( ~ , - w2(x, e) ,
and 116
117
8.1. A GENERAL CENTRAL LIMIT THEOREM
Here x 2 = xx* and 1x1’ = x * x for x For any N x N matrix A, let
RN,where
E
*
indicates transposition.
N
We assume that I,,approximates I as 0 -,0 in the sense that, for any x E I, (a.1) lim inf Ix-yI
=
8-0 Y S I ~
0.
Next we suppose that there are functions w ( x ) and s(x) on I that approximate w ( x , 0 ) and s(x,0) when 0 is small, and that, in the former case, the error is 0 (0) : (a.2) S U P ~ W ( X-, w~ () x ) ~= O(0) XSI,
and (a.3)
E
= sup Is@, xel,
0) - s ( x ) ~+ 0
as 840. Let S(X) = s(x)
+ wZ(x).
The function w is assumed to be differentiable, in the sense that there is an N x N matrix valued function on I such that IW(Y) - w ( x ) - w’(x) (Y-XII
(b.1) lim
=
l Y - XI
Y-x Ye1
for all x E I. We assume that w’(x) is bounded, (b.2)
CL =
SUPIW‘(X)J
)* + s(fW)
9
and the initial value is 0. Part (A) means that there is a constant CT such that o,(O,x) < CTO for all O E J, x E 1 0 , and n < T/O. In (B), (1.2) has an analogous interpretation. If d is any metric on probabilities over RN such that P,-+ P weakly implies d(P,, P ) + O (e.g., Dudley's metric described in Section 2.2), it follows from (C) that
as 8 4 0 . The routine argument, which we omit, is based on the compactness of the interval 0 < t < T and the ball 1x1 < T, and the joint weak continuity of 9 ( t , x). Another corollary of Theorem 1.1 is convergence of finite dimensional distributions
az.",, ..., z:,> as O + O and n j O + t j . If (c) is replaced by the stronger assumption
it can also be shown that the distribution over C([O,T I ) of the random polygonal curve Z e ( t ) with vertices Ze(nO)= 2." converges weakly to the distribution of a diffusion as O+O. If 10 = I for all 0, (a.1) is certainly satisfied. And if, in addition, (7.1.2) holds, w(x, O), s(x, O), and r(x, 0) do not depend on 8, so (a.2) and (a.3) are satisfied, and the supremum over O in (c) can be omitted. A simple and important example of this sort is the five-operator linear model with Oij = Oqij, where qij 2 0 and 0 < O < max-'qij. In this case, w(x) and s(x) are polynomials, so that (b) holds, and I H.1 < maxqij as., so (c) does too. Thus Theorem 1.1 is applicable. Another example satisfying 10 = I and (7.1.2) is the standard multidimensional central limit theorem. Let W, , W,, ... be a sequence of independent and identically distributed random vectors, and let
120
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
(X,” = 0). Then {X:},,, is a Markov process in I = RN. Since H: we have = w, w(x, e) = ~(4.1
s(x, 0)
and
=
= W,,
,,
covariance matrix of Wj = s ,
r(x,O) = E(1413) = r .
Thus, assuming r < 03, the remaining assumption (b) is trivially satisfied. Clearlyf(t,O) = t w and g ( f ,x) = f s , so (C) with x,q = x = 0 and 8 = l/n gives
as n+ 03. This result falls short of the standard central limit theorem only because the assumption r < co is stronger than necessary. In fact, the Lyapounov condition (c) can be replaced by the Lindeberg condition
without altering the conclusions of Theorem 1.1, and this more general condition is equivalent to E(l4.1’) < 03 in the case considered above. Theorem 1.1 was stated by Norman (1968c, Theorem 3. l), and proved in the special case N = 1, I,q= I, and L?(H:IX:=x) = U(x). Proofs of (A) and (B)are given in Section 8.3, the proof of (C) in Section 8.4. The latter is rather different from the comparable proof (that of Theorem 2.3) in the paper mentioned above. 8.2. Properties off(?)
A number of arguments in the proofs of (B) and (C) of Theorem 1.1 hinge on properties of f ( f ) that follow from its characterization as the solution of (1.1). To avoid repeated interruptions of the main development, these properties are collected in this section. The simple proofs given are quite standard. Weconsiderfirst w(x). Forx,yE I , a n d O < p < l,leth(p)= w(x+p(y-x)). It follows from (b. 1) that h’(p) = w‘(x + p (Y - x)) (Y - X I ;
hence, by (b.3), h’ is continuous. Thus the fundamental theorem of calculus gives w(y) - w ( x ) = h(1) - h ( O )
t If F c 8,IF = I&)
=
II
w‘(x+p(y-x))dp(y-x),
is 1 or 0, depending on whether or not w E F.
(2.1)
121
8.2. PROPERTIES OF f(t)
and (b.2) then implies that w is Lipschitz, IW(Y)
- w(x)l G alv-xl*
Also,
so that I4.Y)
- w ( x ) - w ’ ( 4 (Y - 41 G (PP) lY - XI
(2.3)
as a consequence of (b.3). Now IW(X,e)I
G r(x,8)% G r%
(2.4)
[see (c)], from which it follows, via (a.1) and (a.2), that (2.5)
Iw(x)l G r%.
If I = RN,the method of successive approximations (see, for example, the proof of Theorem 6 in Chapter 6 of Birkhoff and Rota, 1969) can be used in conjunction with (2.2) to construct a solution f(t) = f ( t , x ) to (1.1) with f(0)= x . The more devious existence proof for general I is given in the next section. Whether or not I = R N , there is at most one solution. For if f(t) and f ( r ) both satisfy (1.1) and start at x , then
f(t)-
m
=
j 0‘ ( w ( f ( 4
so that
-m G
If@)
a
- w ( f ( 4 ) )du
9
1‘ 0
If&)-f@)ldu
by (2.2). Thusf(t) = f ( r ) follows from this form of Gronwall’s lemma (Hille, 1969, Corollary 2 of Theorem 1.5.7). GRONWALL’S LEMMA.
If c, k 2 0, and h ( t ) 2 0 is continuous, then
for 0 < t < T implies
h ( t ) < cek‘ for 0 < t < T.
122
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
Throughout the rest of the section we suppose that, for every X E Z , f ( r ) = f ( r , x) is a solution of (1.1) with f(0) = x . For 6 > 0
so I f ( t + 6 ) - f ( t ) l G 6r”
by (2.5). Also,
thus
by (2.6). Next, note that
Therefore,
so that
(2.10) Note that (ear- I)/t is an increasing function of t ; hence ear- 1 G Ct, where C = (eaT- l)/T, for t < T. From (2.6) and (2.9) we get If(t,Y) -f(U7X)I
< If(t7Y) -f(&Y)l I f ( U , Y ) < r%It-ul + e a ” l y - x l ,
-f(U7X)I
which converges to 0 as t + u and y + x. Thus f(-,.) is jointly continuous.
123
8.2. PROPERTIES OF f ( r )
The existence of solutions A ( t ) = A ( t , x) and B(t) = B ( t , x ) of the linear differential equations (1.3) and B‘(t) =
w’(f(0)B(t)
(2.1 I)
with A (0) and B(0) the identity matrix can be proved by successive approximations, and, in view of (b.2), the uniqueness proof given above for (1.1) applies here also. Both A ( t ) and B(t) are invertible; in fact, (2.12)
B(t)* = A(t)-’,
since
d dt
- B(t)*A(t) =
($
B(r)).A(t)
+ B(t)* -dtdA @ )
=0
and B(O)*A(0) is the identity matrix. The matrix B(t, x) is the differential as we will see below. This fact is needed in Section 8.5. at x off(t, Clearly, a),
F(t) = f ( t , Y ) -f ( 4 4 - a t , x) (Y -x)
Thus
by (2.3), so that
for t < T by (2.9). Thus Gronwall’s lemma gives IF(t)l
< (/?/4a)( e 2 a T - l ) l y - x 1 2 e a r
for t < T. Taking T = t we obtain I f ( t , y ) - f ( t , x ) - B(r,x) ( y - x ) l
< ( 8 / 4 4 ) e u f ( e 2 a ‘ - l ) l y - x ~ 2(2.13) .
a.
124
TRANSIENT B E H A V I O R W I T H LARGE DRIFT
8.3. Proofs of (A) and (B)
In the rest of this chapter, we use the notation K for bounds that do not depend on 0, n, and x (e.g., a, p, and r ) , and C for bounds that are valid for all x and nO < T. In this section the process X,, = X," has initial state x E I,, H,, = AXn/O, p,,= ~,,(O,X), w,, = w , , ( ~ , x )and , Y,, = X,,-p,. Proof of (A). Clearly W,+I
=
+ 2E(Yn*AY,)+ E(lAY,IZ).
0,
(3.1)
Since X,, E I, and I is closed and convex, p,,E I, and w(p,) is defined. We have E(Yn*AYn)/O = E(Yn*(Hn-
since E(Yn*)= 0,
WO1.I))
E(Yn*dYn)/O = E(Yn*(w(Xn, 0 ) - w(pn))) because E(H,,IX,,) = w(X,,, O), E(Y,,*AY,,)/O= E(Yn*(M'(Xn,0 ) - w(Xn>))+ E(Yn*(w(Xn)- ~ ( p , , ) ) )
< E(IYnI Iw(xn, 0 ) - w(Xn>I)+ E(lYnl Iw(Xn) - w(pn)I) < KOE(IYnI) + KE(IYn12) by (a.2) and (2.2), and
E(Y,*AY,,)/O < KO
+ Kw,,
since E(lY,,l) < ( 1 + 0 , , ) / 2 . Finally,
E(lAYn12)/e2< E(IHn12)
as a consequence of (c). Using (3.2) and (3.3) in (3.1) we obtain
< (l+KO)wn + KO2,
which, on iteration, yields
n-
I
< KO2 1 (1 +KO)',
on
since w0 = 0. Thus
j=O
< e((i+Ke)n-
0,
1) (3.4)
125
8.3. PROOFS OF ( A ) A N D ( B )
Proof of (B). To begin with,
so that
If f ( t ) =f(t, x ) satisfies (1.1) with f(0)= x , then, according to (2.7) with 6 = 8 and t = no, v, =f(n8) also satisfies (3.5). Since vo = x = p o , we expect to find v, G p,,. However, at this point the existence of f ( t , x) can only be guaranteed when I = RN. Thus we shall extend w ( x ) to all of RN, define f in terms of the extended function, prove (1.2), and then show that f ( r ) E I, so that f satisfies (1.1). According to (2.2) and (2.5), w ( x ) is bounded and Lipschitz. Thus all of the coordinate functions w j ( x ) of w ( x ) are bounded Lipschitz real valued functions; hence they possess bounded Lipschitz extensions 4.( x ) to RN (Dudley, 1966, Lemma 5). The function W ( x ) with coordinates bV.(x) maps RN into RN and satisfies (2.2) and (2.5) for suitable constants u and r. It follows that, for every x E I (in fact, for every x E RN), there is a unique function f ( t ) =f(t, x ) such that
f’(0 = WfW) andf(0)
f(no, 4,
=x,
and that (2.7) holds with Win place of w. Thus, if v,
+
+
v,+ = v,, ew(v,) k, e2, where 1k.l < K. Since p,,E I, (3.5) can be rewritten in the form pn+ 1 = p n
lcnl
+ eW(pn) + Cne’
>
< C. Subtracting (3.6) from (3.7) and letting d, = p,-
so that
dn+i = dn + e(W(pn) - W(V,))+ C, 8’, Idn+ll G (l+KB)ldnl
+ C8’.
= vn(& x ) =
(3.6)
(3.7)
v, , we obtain
126
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
If vo = p o = x E l o , then do = 0, so iteration yields 1d.l G OC(eKne-1) G 8C(eKT-1)
for nO G T, and (1.2) is proved. Suppose now that x E I and t > 0. We wish to show that f ( t , x ) E I, so that f(.,x) satisfies (I. 1). As a consequence of (a. l), there is a function x. on J such that X0 E I0 and xe + x as 8 -+ 0. If n = [ t / O ] then f(n8, xe) -+f(t, x) as 8 4 0 . But P n (6, ~ 0 )
X&
+
0
by (1.2), so pn(8,x,)+f(t, x ) as 8 4 0 . Since p,,(O,xO)E I and Z is closed, E I, as claimed. I
f(r,x)
8.4. Proof of (C)
In this section, o(8) denotes a random variable such that max E(lo(O)l)/O-+ 0
n =sTI0
as O + 0. The properties of Z,, = 2." that are crucial for (C) are given in the following lemma.
8.4. PROOF OF (C)
127
(4.6)
128
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
where But
r = wm(8,
xn)
+ Ip m (0, Xn) - vm (0, I xn)
G cmO2 by (3.4) and (1.2). Thus E(IZ,,+,,, - Z,,12)< Cm8
I
if (n+m)8 < T, by (4.7).
+ C(m8)2E(IX,- v,12)/e< cm8
Note that a: and b: are bounded and that and
a:-+a(t)
b:--, b(t)
(4.8)
as 8 -+ 0 and n8 + t, where a ( t ) = w f ( f ( t ,x ) ) and b(t) = s(f(r, x)) are continuous. Thus (C) of Theorem 1.1 follows from Lemma 4.2. LEMMA 4.2. Suppose that Z, = Z,: 8 E J, n8 < T, is any family of stochastic processes in RN,with Z, = 0, that satisfies (4.1)-(4.4). Suppose further that a, = a: and b, = b,8 are bounded, and that (4.8) obtains as 8+0 and nO-+t, where a ( t ) and b(t) are continuous. Suppose, finally, that X ( t ) is continuous with ~ ( 0=)0. Then
Z" as 8 + 0 and n8 + t, where
s(0 =
-
NO,9(0)
CA (24) A 0)-'I*b(u) CA (24) A (0-'I d.4,
and A ( t ) is the solution of A'(t)
=
-a(t)*A(t)
with A (0) the identity matrix. Note that Lemma 4.2 does not assume that Z, is Markovian. The lemma is proved below by a variant of the method of Rosen (1967a,b). Proof. Using the third-order expansion of exp(iu) for real u we obtain, for any 1 E RN, E(exp(iA*dZ,)IZ,)
= 1
+ i1*E(dZnIZ,)
- A*E((dZn)*IZn)1/2+ kll13E(IdZn13IZn),
129
8.4. PROOF OF ( C )
where Ikl
< 6 ; hence E(exp(iA*dZ,,)IZ,,)
=
1 + iOA*a,,Z,, - 8A*b,,A/2 + o(8)
by (4.1)-(4.3). Substituting this into E(exp(iA*Z,,+,)) = E(exp(iA*Z,) E(exp(iA*dZ,)lZ,,)), we obtain E(exp(iA*Z,,+,)) = E(exp(iA*Z,))
+ E(o(8))
+ 8A*a,,E (exp(iA*Z,,) iZ,,) - 8A*b,,AE(exp(iA*Z,,))/2 .
Summation over n then yields
(4.9) where E =
'?'E(o(B)).
(4.10)
j=O
Let he(A,t) = E(exp(iA*Z,,)),
a e ( t )= a,",
and
be(t) = b,",
where n = [t/8]. Note that
In these notations, (4.9) becomes
+
- leA*be(u)Ahe@, u ) du .
(4.1 1)
Let 8, be a sequence in J that converges to 0. Suppose that, for all t < T, 9(Z,") converges weakly to a probability distribution Y (t), when 8 + 0 along this sequence. We will show that 9 ( t ) = N(O,g(t)). This requires a somewhat lengthy argument. Let h(A, t) be the characteristic function of 9 ( t ) . Then he(& t) + h(A, t). Furthermore, (4.4) yields (4.12)
130
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
from which it follows that
V h(A, t ) = and
s
exp (U*z) i z 9 (t) (dz) ,
(4.14)
VhO(A,t ) + Vh(A,t ) (see Lobve, 1963, Theorem A, p. 183). Next we show that h(A, - ) and Vh(A, .) are continuous. Let u > t and let m = [u/8] - [t/8], so that [u/8] = m+n. Then he(n, 24)
- hyn, t )
= E( exp (iA*Zn)(exp (iA*(Zn+
-zn>) - I)) ,
so that
(4.15) (4.16)
by the Schwarz inequality, and
IVh(1,u)- Vh(A,t)l G (121 Ixl%+1)X(u-t)%. As a consequence of (4.12) IVhe(A,t)I G E(IznI) G IxI“, so the integrands in (4.11) are bounded over u and 8. And, referring to (4.10),
I4 G j“‘E(lo(e)I) = 0 G nOmaxE(Io(8)1)/8 -+ 0 jaTl0
8.4. PROOF
131
OF (C)
as 8 -+ 0. Thus we can let 8 -+ 0 along 8, in (4.11) to obtain h(I,t) - 1 =
I’
I * a ( u ) V h ( I , ~ ) du3
I’
I*b(u)Ih(rZ,~)d~.
Since the integrands are continuous in u, differentiation yields
ah - ( I , t ) = I*a(t) Vh(1,t ) - +I*b(t)Ih(I, r ) . at
(4.17)
To solve (4.17), let 5 E RN,I(?) = A (t) 5, and F(r) = h(A(t),t). Then
X(t) = -a(t)*l(t). From this and (4.17) we obtain F’(t) = X(t)* V h ( I ,t ) =
+ ah (A, t )
- $I(r)*b(t)A ( t ) F(t) .
The unique solution of this ordinary differential equation with F(O)= h((,O)= 1 is
For
t =A(t)-’2,
2 ( t ) =I, and 2(u) = A ( u ) A ( t ) - ’ I , so
h(A,t) = exp(-3A*gO)I),
the characteristic function of N(O,g(t)). Thus Y ( t )=N(O,g(t)), as claimed. Suppose now that there is some t ’ < T such that U(Zz.),where n‘= [ t ’ p ] , does not converge to N(O,g(t’)) as 8 -+ 0. We will show that this leads to a contradiction. There is a bounded continuous function G on RN such that E(G(Z:.)) *
/
G ( z ) N ( O , g ( t ’ ) (dz) ) = 1.
Hence, there is a 6 > 0 and, for every j , a 8; < l / j such that (4.18)
IE(G(Z,B,))- 11 2 6
if 8 = 8 .; Let n = [ t / 8 ] . Using (4.12) and the diagonal method, we can construct a subsequence 0, of 0; such that U(Z,B)converges weakly to a probability A?(?) as 8+0 for all rational t < T. If h ( I , t) is the characteristic function of A?(?),(4.16) applies to rational t and u. Since h(I, is uniformly continuous on the rationals, it has a unique continuous extension to all of a )
132
8.
TRANSIENT BEHAVIOR WITH LARGE DRIFT
[O, T I , for which the same notation will be used. The extension satisfies (4.16); hence h(A, u ) + h(A, t) uniformly in 111 < K as u + t. Taking t irrational and u rational, the continuity theorem implies that h(A, t) is the characteristic function of a probability 9 ( t ) for all 0 < t < T. For any 0 < t, u < T, t) - h e ( At)l
< ( h ( Lt ) - h ( 5 u)l
+ Ih(A, u ) - he@,u)l + Ihe(A,u) - he(& t)l .
Estimating the third term on the right by (4.19, taking u rational, and letting 8 + 0 along 8, we obtain limsuplh(A,t)-he(A,t)l
< lh(A,t)-h(A,u)l + IAlx(lu-tl)".
Since the right side converges to 0 as u + I, he (A, t) -+ h (A, t) and thus 9(Z:) -, 9 ( t ) for all t < T, as 8 + 0 along 8 k . It follows that 9 ( t ) =N(O,g(t)) for all t < T; in particular, for t = t'. Hence 9 ( Z : , ) +N(O,g(t')), and E(G(Z:,)) + 1 as 8+0 along 8,. Since 8, is a subsequence of 8,l, this contradicts (4.18). Our conclusion is that 9(Z:)+N(O,g(t)) for all t < T as 8 4 0 . It remains only to relax the strict dependence of n on 8 (n = [t/8]). Let n' be any integer between 0 and T/8. As in (4.15), IE(exp(iA*Z,,))- E(exp(iA*Z,,.))I
< 121 x ( [ n 8- n'dl)",
which converges to 0 if 8+0 and n'B+t. Thus 9(Z,,.)-+N(O,g(t)).
I
Alternative proof of Theorem 5.2.3. Lemma 4.2 can be used in place of the final two paragraphs of Section 5.2, which complete the proof of Theorem 5.2.3. This simple exercise throws additional light on that theorem, and provides an example of an application of Lemma 4.2 to a non-Markovian process. In the notation of the final paragraphs of Section 5.2, we have a family {t;i"};= of normalized partial sums indexed by n, and we wish to show that C," N ( 0 , a') as n+ oc). Taking the conditional expectation given cj in (5.2.18) and letting 8 = d/n we obtain
-
+
E ( ( A C ~ )= ~ ~6 C0 ~~ )o(e). It follows from (5.2.19) that E(lACjI3
and, if
ICj)
=
~(e),
5 > -f,(5.2.17) yields E(ACjlCj) = o ( e ) .
These equations are instances of (4.2), (4.3), and (4.1), where aj = 0 and
133
8.5. NEAR A CRITICAL POINT
bj = 0'. Thus (4.8) holds, with a ( t ) = 0 and b(t) = . ' a
From (5.2.16) we get
E(ICj+,c-CjI') G K k e ,
-
which is (4.4) with x ( f )= Kt. Thus Lemma 4.2 is applicable, and g ( t ) = fa'. Since q8-r 1 as n-r 00, C4 N(O,a'), as claimed. I 8.5. Near a Critical Point
Instead of subtracting f(n0, x,), where X,B = x,, from X,", as in (C)of Theorem 1.1, it is natural to consider the alternative normalization z." = (X." -f(nO,
X))/fl,
where x = lime-,ox, . Clearly
4'= (x.- f W ,xo>)/Je + ( f W ,xe>-f(ne,
-x))/fl.
The first term is asymptotically N(O,g(t)) by Theorem 1.1, and, since z: = (xe - x>/Je,
l(f(ne,xtl) - f ( n O 9 x ) ) / J 8 - B(ne>z:l
< cJ81Z:l2
by (2.13). Since B ( . ) (see (2.11)) is continuous,
(f(ne, xe) -f(ne, x ) ) / J B as 8 -+ 0, no -r t , and z:
-r z.
z."
-
-+
B ( f )z
Thus Theorem 1.1 has the corollary (5.1)
N ( B ( t ) z , g ( t ) )= Y ( t )
as 8-0, nO-rt, and z:+z. This result has special interest when x is a critical point, that is, a point such that w ( x ) = 0. In this case f ( t ) = x, so that z."
=
(5.2)
(X."-X)/Je.
Furthermore, B ( t ) = e'W'(X), A(t) =
e-lw'(x)*
(5.3) 9
and
Note that s(x) = S ( x ) - w ' ( x )
= S(x).
In the case N
=
I , g(t) = f s ( x ) if
134
8. TRANSIENT BEHAVIOR WITH LARGE DRIFT
w’(x) = 0, and g ( t ) = s ( x ) ( e 2 f w ‘ ( x ) 1- ) / 2 w ’ ( x )
(5.4)
if w’(x)#O. The result (5.1) suggests the possibility that, if z ; + z , Ywe = lim L?(z;) n- 03
-
lim Y ( t )= p(00)
I+ w
(5.5)
as 8 + 0. Elaborations of this statement assume rather different forms depending on the sign of w ’ ( ~ )so, we examine the cases w ’ ( ~) 0 separately. If w’(x)c 0, then, according to (5.3) and (5.4), B(r)+O and
s(t)
+
I
a2 = s ( x ) / 2 w’(x)I 9
(5.6)
so 9(00) N= ( 0 , 0 2 ) . If x is the only critical point [hence w ( y ) > O for < x and w ( y ) c 0 for y > X I , and if the asymptotic distribution 92 exists (perhaps in the sense of Cesaro convergence), then we expect it to be the unique stationary probability of the Markov process {z,,@}.,, . A generalization of (5.5) that does not presuppose the existence of or uniqueness of the stationary probability is
y
pe
+
N(0,a2)
(5.7)
as O+O, where is any stationary probability of z:, or, equivalently, the normalization (5.2) of any stationary probability of X;. We do not require z;+z for (5.7), since po does not depend on zt and cr2 does not depend on z. A theorem of this type is given in Section 10.1. For any y E R’, Y ( t )( Y , 00)
=
@((B(t)z-y)/(g(t))’/;),
where @ is the standard normal distribution function. If w’(x)>O and s ( x ) > 0, g(z) + 00 and B ( ~ ) / ( d t ) ) ” 1/a; +
thus lim Y ( t )(y, 00)
1-m
=
@(z/a)
for all y. If X; + f.00 as. as n + a,then lim Y(z;)
n- w
so (5.5) suggests
( y , 00) =
P(X; + a),
8.5. NEAR A CRITICAL POINT
135
as 8 --f 0 and (xe- x ) / f i - , z. A theorem of this type for additive models is proved in Section 14.5 with the aid of Theorem 11.2.1 and Lemma 5.1 below. For proofs of (5.7) and (5.8), we need the following lemma about the ,,('= The lemma is applicable to any N. Unlike moments of LIZAX,,('/@. Lemma 4.1, nothing is assumed about the distribution of z t ; thus we are free to consider a stationary initial distribution later. Furthermore, o(8) is a function of zn and 8 such that
and
(5.14)
and
a.
136
TRANSIENT BEHAVIOR WITH LARGE DRIFT
Thus o(e) in (5.11) satisfies
lo(e)l/e G E + K
From (5.13) we get
It-
w’(x) (A’, - x)l
J ~ .
< KO + KO 1z.I 2
by (a.2) and (2.3). Thus IC/JB-w’(x)znI
KJB(l+Izn12),
which implies (5.10). Finally, (5.12) follows directly from (c) of Section 8.1.
9 0 Transient Behavior in the Case of Small Drift
9.1. Diffusion Approximation in a Bounded Interval The corollary (8.5.1) of Theorem 8. I. 1 gives an approximation to the distribution of X: when 6' is small, X: = xe is near a critical point x [i.e., a point such that w ( x ) = 01, and no is bounded. Some one-dimensional cases where w'(x) # 0 and where this approximation continues to be useful when no is unbounded are described roughly in Section 8.5, and will be discussed further in subsequent chapters. In the case of small drift [i.e., w ( x ) = 01, all points of I are critical, and w'(x) = 0. For a critical point x with w'(x) = 0, (8.5.1) gives <X,"-x>/Je
or
-
N ( ( x e - x ) / J e , nes(x))
x,,@- N(x0, ne2s(x)),
which suggests that
9 (X.")-b N ( x , W ) ) as 8-0, x ~ - * x ,and nO--roo in such a way that n 0 2 + t . If s ( x ) > O and I 137
138
9.
TRANSIENT BEHAVIOR WITH S M A L L DRIFT
is bounded, this conjecture is definitely wrong, since Y(X,") is confined to I and the normal distribution N ( x , t s ( x ) ) is not. However, we shall show in this chapter that Y ( X , " ) does converge as 8 -+ 0, X g x, and ne2 -+ t, in many cases of slow learning with small drift in bounded intervals. The limiting distribution is Y ( X ( t ) ) , where x(t),t 2 0, is a certain diffusion in I with X ( 0 ) = x. Under hypotheses (a)-(c) of Section 8.1, with w ( x ) = 0, the conditional moments of AX," are of the form -+
E(AX,IX,
= x) =
O(r),
E((AXJ2IX" = x) = zS(x)
+
0(2),
and E(lAX,131X" = x) = O ( 7 9 ,
where 7 = 02. In order to describe the asymptotic behavior of X." as nz -+ t, we must specify E(AX,IX, = x) more precisely as 7a(x)+o(r),and we must impose various conditions on a ( x ) and S(x) [which corresponds to b(x) below]. These remarks serve to place the assumptions that follow within the framework of the preceding chapter. However, these assumptions are self-contained, and it is not necessary to refer to Chapter 8 in order to apply the theory in this chapter. Let J be a bounded set of positive real numbers with infimum 0, and let I = [do,d,] be a closed bounded interval. For every T E J, let K, be a transition kernel in a Bore1 subset I, of I. Corresponding Markov processes are denoted .;'A Suppose that
+ 0(Z), x ) = rb(x) +
E(dX,'IX,' = X) = T U ( X ) E((AX,')2IX,'=
0(7),
(1.1)
and where
o(7)
is uniform over I,; i.e.,
as r-0. It is assumed that a has three (bounded) continuous derivatives throughout I (a E C 3 ) . We emphasize that a(j)(x),j = 1,2,3, exists and is continuous even at the boundaries x = di. Furthermore, a(do)2 0 and a(dl) < 0. Our conditions on b are rather unusual but, nonetheless, quite general, as we will see. It is assumed that b admits a factorization
139
9.1. DIFFUSION APPROXIMATION
where oi E C3, oi(di)= 0, oi(x)> 0 for do < x < d l , P (4 = 6 0 (x)/(ao (4+ 0 1 (x))
is nondecreasing over this interval, and, letting p (4)= lim P (XI x-d,
7
C3. These conditions imply that b E C3, b(di)=0, and b ( x ) > O for do < x c d l . A very broad class of functions b having suitable factorizations are those of the form
PE
b(x) = (x-do)’(dl -x)’~(x), (1.3) where j and k are positive integers, h E C3, and h(x) > 0 for all x E I. Let no (x) = (X -do)’ (h(x))’
and CT~ (x)
= (d1- X ) k (h(X))’.
Since f i is in c3,cri is too; since ao/alis increasing, p is also increasing; and since ao+ol is positive throughout I, p E C3. Another interesting class of examples are those with b(di)= 0, b ( x ) > 0 on (do,dl), and PE C3. Here we can take oi = f l to obtain p ( x ) = 3. Let $33 be the collection of Bore1 subsets of I. A transition probability is a function P on [O,oo) x I x 9 such that P ( t ) = P ( t ; ., .) is a stochastic kernel, P(0; x, .) = S,, and the Chapman-Kolmogorov equation
s
P ( s ; X, dy) P ( t ; y , A ) = P(s+t; X, A )
holds for all s, t, x, and A. Theorem 1.1 is our main result on transient behavior for slow learning with small drift. I . I . If a and b satisfy the conditions given above, there is one THEOREM and only one transition probability P such that
and
I40
9.
where o(r) is uniform ouer x
E I.
TRANSIENT BEHAVIOR WITH S M A L L DRIFT
If(l.1) holds,
Kj")(x,, .) = U ( X ; l X ; = x,) + P ( r ; X , .)
(1.6)
weakly, as r + 0,x, + x, and nr + t. The uniqueness of P follows immediately from (1.6). For if Q is a transition probability that satisfies (l.5), K, = Q ( r ) is a family of transition kernels in I that satisfy (l.l), and K Y ) = Q(nr) by (1.4). Taking r = t/n and x, = x in(l.6)weobtainQ(t; x , - ) + P ( r ; x, .)asn+co;i.e.,Q(t; x , . ) = P ( r ; x , .). As a consequence of the last equation in (lS), there is a diffusion X ( t ) , r > 0, with X ( 0 ) = x and P ( X ( s + t ) E AIX(s)) = P ( t ; X(s), A ) a.s. (Lamperti, 1966, Theorem 24.1). In particular, 9 ( X ( r ) ) = P(r; X , .).
Under the same hypotheses as in Theorem 1.1, it can be shown that
U K , , ...,X;J+Y(X(tl),
... ,X(t,))
as r+O, x , + x , and njz+tj, j = 1, ..., k. If o ( r ) in (1.1) is replaced by O(r' +') for some v > 0, then U(Xr(t), r G T ) -+ =Y(x(t), r G T), where Y ( r ) is the random polygonal line with vertices Xr(nr) = X;. The family of linear models discussed in the last three paragraphs of Chapter 0 provides a simple illustration of Theorem 1.1. Since failure is ineffective (O*c* = 0), (0.3.1) specializes to
L
O( 1 - X,)
AX,, =
with probability X,, n 1 c ,
-ex,
with probability (1 - X,,) zooc ,
(1.7)
otherwise.
It is assumed that c > 0 and n 1 > 0 are fixed, and that xoo approaches n 1 as O + O along the line 1coo/x11 =
1
+ Ok,
where, for our present purposes, k can be any real constant. Taking and di = i, (1.1) holds with
(1.8) '5
= 8'
and b(x) = Bx(1-x), (1.9) where B = x II c and A = - knI c. The special case of Theorem 1. I corresponding to such functions a and b is contained in an earlier result (Norman, 1971a, Theorem 4.3). a(x) = A x ( l - x )
,
141
9.2. I N V A R I A N C E
9.2. Pnvariance In this section we introduce a special family of transition kernels L, satisfying ( I . l), and show that, if K , is any other such family, then K p ) ( x , .) L?)(x, .) converges to zero, uniformly over x E I, and nr < v, as 7 --f 0. Thus the behavior of K:“) when 7 is small is invariant over all families of kernels satisfying (1.1). Such invariance is at once an important implication of Theorem 1 . 1 and a basic component of its proof. The remainder of the proof of Theorem 1.1 is given in the next section. The family L, is distinguished by its exceptional simplicity, which permits us to establish a crucial property (Lemma 2.2) by a rather direct computation. Let u1 = g 1 , uo = -no, e = & UJX) =
ui(x, 7) = x
p 1 = p , and p o = 1- p . Then
+~
+
( x ) eui(x),
+ 7ayx) + eu;(x) 3 1 - z l a ’ ~ - elu;l > o
u;(x) = 1
if T < T ~ for , some ro > 0 sufficiently small. The condition ‘c < t ois assumed in all that follows. Now u i ( d o ) 2 d o , since a ( d o ) 2 0 and v i ( d o ) 2 0 ; and u i ( d l )< d , , since a(d,) < 0 and ui(d,)< 0. Since ui is increasing, ui maps I into I. Thus the requirement that a transition from x to ui(x) occurs with probability p i ( x ) defines a transition kernel L = L, in I. The corresponding transition operator V = V, is
Vf(x) = 1f (ui (XI) pi (XI 3
where the summation is over i = 0 and 1. Since ui E C3 and p i E C3, V maps C = C ( I ) into C and C 3 into C 3 .
LEMMA 2.1. L, satisfies (1.1). Proof. First,
+ C ui(x) pi(x)
(ui(x)- x ) pi(x) = ~ a ( x ) 0
9
and C u i ( x ) p i ( x ) = o l ( x ) ~ ( x-) ~ o ( x ) q ( x = ) 0
as a consequence of the definition of p (q = 1 - p ) . Next,
+ r2aZ(x),
~ ( u i ( x ) - x ) z p i ( x )= r C U 2 ( X ) p i ( X )
while la1
-= 00, and C 0:
( x ) pi( x ) =
0 1’
( x )P ( x )
+ 00’( x ) 4 (XI
= 01(x)ao(x)q(x) + 0o(x)o,(x)P(x)
(2.1)
142
9.
TRANSIENT BEHAVIOR WITH SMALL DRIFT
by (2.1), so that
C 0; (x) Pi (x) =
01
(x) DO (x) = b (x)
by (1.2). The last equation in (l.l), with O ( t % ) in place of o ( T ) , is a consequence of the boundedness of a and ui. I ForfE C3, let
llfll
=
If’l + Is”l + IPl.
LEMMA2.2. There is a y > 0 such that
II~:fII
for all n 2 0,
t
< t o ,and f E C3.
G
eyrir
llfll
Proof. We often suppress the argument x in the following derivation, writing, for example, p i instead of pi(x). This renders the notation ) g ) ambiguous, so we denote the supremum norm Iglm. Clearly,
(W’ (XI = Cf’(ui)ui’pi + 1fC.i) PI
*
Since ui’ 2 0, ICf’(ui)ui‘Pil G If‘lmCu;Pi
< If’I Also, pd = -P;,
m(
1 + 0 1 0;Pi + la’ I m )*
SO
Cf(.i> pi’ = ( f ( u , ) - f ( u o ) ) P;
However, p ; = p’ 2 0, and hence
and thus
-
(2.3)
143
9.2. IN VARIANCE
We next show that there is a constant c < 00 such that
c
I(VS)”I,
1 ~ 1 , ( 1 + ~+4~
IYL
(2.7)
for all T < 70 and f E C 3 . We use the notation c below for a number of different bounds that do not depend on x , t, o r 5 By the chain and product rules for differentiation,
(vf)“(x) = Cf”#;’pi
+ C f ’ ( u i >#;‘pi
+ 2 C f ’ ( u i )ui’pi’ + Cf(xi)P;I
*
Since u! = 1 + Oq’ + 0 (7) and u;’ = Our + 0 (T), this can be written ( V ~ ) ” ( X=) A1
+ + B1 + B2 + C , A2
(2.8)
where A , = Cf”(ui>U12Pi, A , = 2 C f ’ ( U i ) pi’
9
B1 = Cf’(ui>O(2ui’~i’ + VYPi) B2 = C f ( X 3 P:
9
9
and ICI
0 if do is regular. Now m ( d , ) = co if d , is an exit boundary. Thus, to show that a(d,) = 0 for such a boundary, it suffices to show that a(d,) < 0 implies m ( d , ) < 00. If a(d,) < 0 there is a 5 < d, and an a < 0 such that a ( x ) < a if x > 5. In particular, B’(x) < 0, so that eB@) is decreasing for x 2 5. Thus de”(”)/dx is integrable over [5, d,), and m‘(x) = a(x)-’deB(”)/dx
is too. The proof that a(do) = 0 if do is an exit boundary is similar.
1
Since T,, t 2 0, is a conservative semigroup, there is a transition probability P that satisfies (3.1) for all f~ C and x E I. 3.2. P satisfies (1.5). LEMMA
Proof. Let
R, = T-’(T,-E) - r , and let f(y) = ( y - x ) ’ . Since f ( x ) 3.1,
6,(X) =
so that 16,(x)1
T-’
< IsZ,fl.
s
= 0,
(Y-X)2P(T;
and I‘f(x) = b ( x ) by (B) of Lemma
X,
dy) - b ( X ) = n , f ( X ) ,
But f =J2-2xY+
where Y ( y )= y and 1(y) = 1, and I6r(x)I
1,
a, is linear with sZr 1 = 0, so
< IQzJ21 + 21x1 I Q r 9 I
and, letting c be the maximum of 21x1 for x E I,
141 < IR,J21
+ ClR,Yl.
B y ( A ) o f L e m m a 3 . 1 , Y j E 9 ( T ) f o r a l l j > O ; thus ISZ,JjI-,Oand ISJ+O as T+O. This establishes the second equation in (1.5). ~ , obtain Applying the same argument tof(y) = y - x andf(y) = ( y - ~ ) we where the first equation in (1.5) and le;l+O, E / ( X ) = T-’
s
Iy - Xl’P(T; X , d y ) .
150
9. TRANSIENT BEHAVIOR WITH S M A L L DRIFT
But
G
(kr'l l$1>",
by the Schwarz inequality, and
I d so that le:I in (1.5).
I&l + 1%
is bounded as r + 0. Hence
+ 0,
which is the last equation
This completes the proof of Theorem 1.1. The proof of Lemma 3.2 uses only the properties of TI given in Lemma 3.1. Since, according to Theorem 1.1 , there is only one transition probability P satisfying (1.5), we conclude that there is only one semigroup TI with these properties. Let D be the set of all f E C2 for whichf" is Lipschitz; i.e.,
D
={ f ~ C ~ : m ( f " ) < ~ } ,
where
For f E D,let
Ilf II
=
['"fI
If'l + I f 7 + mCf")-
Clearly C3 c D and m ( f " ) = on C 3 , so this seminorm is an extension of 11 defined previously on C3. It can be shown that, if fk E D, {)If k l l > k > O is bounded, and Ifk- f I + O as k+ 00, then f E D,and
Ilf II
G liminfllfkll * k-
03
Thus, taking n = [ t / r ] in (2.2) and letting r + 0, we see that, for any f~ C3 and t 2 0, Ttf E D and
ll T t f II
G eytIl f
II
f
(3.8)
This bound is very interesting in its own right. Furthermore, were it available from another source, T, could be used in place of V, in Section 9.2. This approach would be reminiscent of that of Khintchine (1948, Section 1 of Chapter 3), who postulated properties analogous to (3.8).
DENSITIES. For do < x c d, , the restriction of the distribution P ( t ; X, .) to ( d o ,d,) has a density [ ( t ;x, .) with respect to Lebesgue measure. TO describe this density, we need the boundary terminology and the notations 9, gi, r*,m(x), and p ( x ) from the proof of Lemma 3.1.
151
9.3. SEMIGROUPS
Let 8,= gi unless di is an exit boundary, in which case
8, = {f€9 : f ( d * )= O} . Let r? be the restriction of boundary,
r* to 8on 8,. If neither do nor d, is a natural
where 0 2 ,Il >A, > ... are the eigenvalues of r?, and rP1, 4,, ... are the corresponding eigenfunctions, normalized so that 4: dm = 1 (Elliott, 1955). The series converges uniformly over t 2 6 and eo < x, y < el , for any 6 > 0 and do < eo < el < d , . A counterpart of (3.9) for natural boundaries is given by McKean (1956). If di is not an exit boundary,
h,(t; x)
=
P ( t ; x, {di}) = 0.
For an exit boundary d,, the following formulas relate hi to (: (3.10)
if the other boundary is not exit, and (3.11)
if both boundaries are exit, where
p o = 1-pl , and the integrals are over (do,dl). We omit the proofs of these equalities.
10 0 Steadystate Behavior
10.1. A Limit Theorem for Stationary Probabilities
Assumptions (a), (b), and (c) of Section 8.1 are in force throughout this chapter, and N = 1. We are concerned here with the limit as 6 + 0 of stationary probabilities of nonabsorbing or recurrent processes. Typically, the distribution of X: converges as n-, 00 to a stationary probability that does not depend on the distribution of X,", but, whether or not this is the case, stationary probabilities represent the possible modes of steady-state behavior of {X:}.,,. Let do be the set of stationary probabilities (with finite fourth moments if I is unbounded) of the transition kernel K, of this process. Naturally, we assume -./B # 0. The processes considered in this chapter are distinguished by the existence of a (critical) point I in the interior of I, such that E(dX:IX; = x ) is positive when x < I and negative when x > I if 6 is small. More precisely:
(d) There is an interior point I of I, such that w(x)
if x < I,
w(x) = 0
if x = A ,
I ,
w(x)
152
>0
10.1. A THEOREM FOR STATIONARY PROBABILITIES
153
and wf(A)< 0 . Theorem 1.1 requires additional assumptions when I is unbounded, as in additive learning models. Henceforth, we assume that, for any 8 E J, = -L4(Xt) E J e . Then U(X,B)= p e for all n ; in fact, X." and 2,"
= <X."-A)/Je
are strictly stationary processes.
THEOREM 1.1. If (i) I = [c, d ] is bounded, or (ii.1) Q ( x , ~ = ) E((H,B)41X: = x ) is bounded, and (ii.2) there are A > 0 and B > 0 such that (A-x)w(x,e) 2 B
for all 8 E J and x E Ie with Ix-AI > A , then
(A) E((X,B-A)2) = 0 ( 8 ) , and (B) 2," --N(0,02) as 8 --* 0, where o2 = s(A)/21wf(A)1= S(A)/2lwf(A)1. Theorem 1.1 under condition (i) is similar to the central limit theorem of Norman and Graham (1968), while (ii) generalizes Theorem 6 of Norman (1970b). Note that (A) implies E(X:) - L = O ( J e ) .
(1.1)
An improved estimate, valid under (i), (ii.l), and additional conditions, is obtained in Section 10.3. Note also that (B) differs only notationally from (8.5.7). The five-operator linear model with Oij=8qij and with nij and cij fixed provides the simplest example of a family X: of processes to which Theorem 1.1 can be applied. We have already noted that it satisfies (a), (b), and (c) of Section 8.1. For this model E(AX,IX,= x ) =
ew(x)
=~
8 1 1 ~ , , - ~ o o ~ o o ~ ~ ~ ~- ~ 1~ 0 ~~i + 0 ~8 2 70 , ~ o , where nij= nijcij, so the quadratic polynomial w has w(0) = qol no,and -w(l) =qlonl0. If both of these quantities are positive, (d) is satisfied. Furthermore, neither 0 nor 1 is absorbing, and it will be shown in Section
154
10. STEAD Y-STATE BEHA VIOR
12.1 that there is a unique stationary probability pe, to which U(X,,!) converges as n+ co, regardless of U ( X t ) . Since I = [O, I], all of the conditions of (i) of Theorem 1.1 are met. 10.2. Proof of the Theorem
As in Chapter 8, we denote a variety of bounds by K. Proof of (A) under (i). Clearly
(xn+l-1)2 = (Xn-1)2
+ 2(Xn-1)dXn + (dXn)2,
so, taking expectations on both sides, canceling E((X,,-A)’) on the left and right, and dividing by 20 we obtain
o = E((x,,-1) W ( X , , , el) + e E p ( x , ,, q / 2 .
Since S(x, 0) < K by (c) of Section 8.1,
0
0)) + KO.
(2.1)
< E((Xn-A) (w(xn, 0) - w(X,)))+ KO < KO,
(2.2)
< E((X,,-A)w(X,,
Thus E((A-Xn) w(Xn))
by (a.2) and the boundedness of I. Let
Since p is positive and continuous by (d), and I is compact, there is a k > 0 such that p ( x ) > k for all x E I. Thus (A-X,,)w(X,,) = (l-X,,)Zp(X,,) > k(X,,-A)’, which yields (A) on substitution into (2.2).
I
Proof of (A) under (ii). Since, by (ii.2),
(Xn-1)w(Xn, 0) < 0 a s . for IX,,-Al > A, (2.1) yields
o < E((x,,-+(x,,, e ) q + K e ,
where
G = IIX,-A16A.
(2.3)
155
10.2. PROOF OF THE THEOREM
Thus
~((1 - x,,)w (x,,)G) G E((x,,- 1)(W (x,,, e) - W (x,,)) G)
< KO.
But (2.3) is valid for IX,,-1l < A , so a = E ( ( ~ zx,,)~G) G
+ Ke
~e .
(2.4)
The binomial expansion of ((X,,-1)+AX,,)4 is (X,,,
- 4 4
=
+ 4(Xn-1)3dXn + 6(X,,-1)2(AX,,)2 + 4(X,,-1)
(Xn-L)4
+ (AX,J4.
(2.5)
156
10. STEADY-STATE BEHAVIOR
Now (xn-43W(x,,,
e)
e)((i-c)+c)
=( ~ , , - 4 3 ~ ( ~ , , ,
< - ~ ( x ~ - , q ~ ( i+-( -x ,c, -)4 3 ( W ( x n , e ) -
by (ii.2) and (d), so that (X,,-43w(X,,, 8)
< -B(X,,-A)’(l-G)
W(X,,))G
+ KO(X,,-A)’
by (a.2) of Section 8.1. This inequality and (2.7) yield B/3 = BE((Xn-A)’(1-G))
Thus
< KOE((X,,-I)’) + KO3.
B/3 < K8(a+/3) G KOp
+ KO2
by (2.4), and B/3 < BPI2
for 8 < 6 = B/2K. Thus
+ KO3
+ KO’
/3< KO2. Adding this to (2.4), we obtain
(A).
I
Part (B) of Theorem 1.1 is proved by noting that, as a consequence of (A) and Lemma 8.5.1, Lemma 2.1 below applies to z.”. LEMMA2.1. For every 8 E J, let {z.”},,,~ be a real valued stationary process. Suppose that
and a.s., where a < 0 and as 8+0. Then as 8+0, where
(i’
= b/21al.
This lemma is a steady-state analog of Lemma 8.4.2. Both results apply to non-Markovian processes. The proof of Lemma 2.1 is much simpler than that of Lemma 8.4.2.
157
10.3. APPRO XIMA TION TO E(X.9
Proof.
As in the proof of Lemma 8.4.2, E(eiodzn(z,,)= 1
where Ikl 0,
U"h(x) < h ( x )
if
E
< 0.
U"h(x) 2 h ( x )
if
E
> 0,
U w h ( x ) < h(x)
if
E
< 0,
Since h is continuous, (1.2) implies that (1.12)
where U" = U," is the transition operator with kernel Kra. , As a consequence of (1.3), U"h(x) - &(x) =
thus IU"h(x)-&(x)l
< aK"(x,I?+uZ;)
=a
b y (1.10) and (1.11). Henceforth, we assume X E I , * , so that h(x)=g,(x).
BY U.W,
4r(x) = U"h(x) + ( $ r ( X > - Umh(x))
2 ge(x)-a
if
E
> 0,
if
E
< 0,
and
+,(XI < g,(x) + a for
7
< T ( E , a). Finally, M-4 - v(x)2 - Ige(x)- W)l- a
if
E
> 0,
M x )- y(x)
0 for i = 0,1, so that the model is distance diminishing by Theorem 1.1. Suppose now that mi < 1. We will show that d(a,,(x),i')+O as n- 00 for all x E X , so that regularity follows from Theorem 3.6.1. Consider first the case x # i'.
(1) If Oii. < 1, let xo = x and
x" = u(x"-1, ( i , i ' , 1)). Since x"-' # i ' , x" E a,(x"-'). By induction, X"E a,,(x). But Oii, > 0, so d(a,,(x),i')
< d(x", i ' ) =
(2) If
(i-eii,yyx-q
-, 0.
nii.< 1, there is an event e such that Ju(x,e) - i'I 2 Ix - i'l
and p ( x , e) > 0, so that u(x, e) E a, (x), Therefore
x" = u(x"- ',e) # i' and
12.2.
179
THE MEAN LEARNING CURVE
= 1, so that i’ E o1(x“-’), we have i’ E a,,(x) for all n > 1. Since i‘ is not absorbing, there is a y # i‘ in a,(i’). Thus on(?)T> on- (y), and
x” E a,,(x). Thus if Bii.
d(o,,(i’),i’)< d(a,,- (y),i’) + 0 as n + co by the previous case. When oo= o1= 1, the process X,, moves to 0 or 1 on its first step and thereafter alternates between these two states. Clearly the process is ergodic, with ergodic kernel (0, l } and period 2. Such cyclic models are of no interest psychologically.
12.2. The Mean Learning Curve The mean learning curve
xn =
f‘(A1n)
= E(Xn)
is the traditional starting point for both mathematical and empirical studies of models for learning in two-choice experiments. The standard tactic for obtaining information about x,, is to use the equation
Ax,
=
E(W(Xn))
9
where W(Xn) = E(AXn I Xn)
7
to obtain a relation between x,, and x , , + ~ Clearly . E ( A , , , A X , , I X , = ~=)
ellx~xnll -e,oxxnlo = elx’x--lx,
E(A,,,AX,~X,,= x)
-e,,xx~noo + e,,
=
x’X’no1= w,,x’ - e,xx‘.
These equations are at the root of most of the computations in this chapter. Adding them, we obtain W(X) = ( e , - e , ) x x ~ - ~ l x + w , x ~
Let
6
=
el - e,,
0
= w1
+ w,,
and, if o > 0 (i.e., 0 and 1 are not both absorbing),
I
=
0010.
180
12. FIVE-OPERATOR LINEAR MODEL
In terms of these quantities, if w = 0,
("
W ( x ) - 6xx' = w o x ' - 0,x =
o(l-x)
Substitution into (2.1) then yields
if o > 0. if o = 0,
AX,, - 6E(X,,X,,') = w ~ x , , ' W,X, =
o(l-x,,)
if o > 0,
(2.3)
(2.4)
which is the relation sought. We fist consider the asymptotic A, response probability x, = limx,. n-r
4)
Later in the section we return to the problem of computing or approximating x,,. It is assumed that the model is distance diminishing and (in the case of no absorbing states) noncyclic, so that the limit x , exists. If both 0 and 1 are absorbing states, the process X,,is absorbing. Thus the probability is 1 that X,,converges to either 0 or 1, and x,
- P(X,,-P 1)
depends on the distribution of X,. The only case in which x , is known exactly is 6 = 0. Then (2.4) gives Axn = 0, so that x, = xo . An approximation to x , when 6, 8 , 1 , and 8, are small is given in Section 12.4. If i is the only absorbing state, X,, -+ i as., and x, = i. The quantity x , is of particular interest when there are no absorbing states. Let 21,
= (l/n)
n- 1
C Aim
m=O
be the proportion of A t responses in the first n trials. The corollary to Theorem 6.1.1 implies that Al,,+x, as., and where m
cT2
=x,x:+2cpj
j= 1
and pi = lim P ( A , , , A l , , + j ) - x, 2 . n-+ m
Once
a2 has
been estimated, this result can be used to construct confidence
181
12.2. T H E MEAN LEARNING CURVE
intervals for or test hypotheses about x,, on the basis of a single subject’s proportion A,,, of A, responses. When a formula for uz like (3.23)is available, the problem of estimating u2 reduces to that of estimating the parameters that appear therein. A model-free approach to the estimation of o2 is described in Section 5.4. It is worth noting that the quantities x,, p j , and u2 do not depend on the distribution of X , . Since W is quadratic with W(0)= w o> 0 and W(1) = -wl< 0, this function has a unique root A in (0,l).
THEOREM 2.1. If there are no absorbing states, and BOl < 1 or el, < 1, then l<X, 0 ,
l=x,=A l>x,>A
fi if
6=0, 6 0. To obtain the relations between x , and 1 listed in (2.6), it remains only to show that a > 0. If a=O, then p({O,l})= 1. If p ( { i } ) = 1, then, by Lemma 3.4.3, i is absorbing, contrary to assumption. Hence, p({i}) < 1 for both i, and (0, l}
182
12. FIVE-0 PERA TOR LINEAR MODEL
is the support of p. Therefore, (0,I } is stochastically closed, according to Lemma 3.4.3. If e = ( i , i ' , l), wi > 0 implies u ( i , e ) E o1( i ) , and u ( i , e) # i. Thus u ( i , e ) = i f ; that is, Bii, = 1, for both i. Since we are assuming that this is not the case, we must have u > 0. Note that u - x,x,I
=
-pp(dx)
+ x$
= -v,,
where
(2.9) is the variance of p. Thus (2.8) yields 0 = 6(x,x,'-v,)
+ w(l-x,),
or W(x,) = 6u, . If v, = 0, x , would support p, hence would be absorbing, contrary to assumption. Thus v, > 0, and the right-hand relations in (2.6) follow immediately. I We now give some results for x, analogous to those in Theorem 2.1. There are no restrictions on the parameters of the model other than those specifically mentioned below. Note first that, when 6 = 0 and w > 0, (2.4) gives x,+1
- 1 = (1-0) (xn-l),
so that x, = l + ( l - w y ( x o - l ) .
(2.10)
Consider now the case 6 > 0 (6 c 0 is similar). Rewriting (2.4) x,+, = 6E(X,X,,')
+ (1-w)x, +
0 0 ,
and noting that E(X, X,,') < x, x,,', we obtain
(2.1 1) where t ( x ) = (1 -w)x
+ wo
and u(x) = 6xx'+(l-o)x+w,.
Let 1, = Izo = x o ,
In+1 = t(1,) , and A n + 1 = u(Ln). When w = O , I n = x o , and, when w > 0, I, is the quantity on the right in (2.10). If 0 c w < 2, I,,+ 1 as n + co. The function u is just the expected
183
12.3. INTERRESPONSE DEPENDENCIES
operator
E(X,+,IX,, =x)
=x
+ W(x) = u(x),
so I,,is the expected operator approximation to x, (Sternberg, 1963, Section 3.2). If I,,converges to a limit A as n 4 co, then A = u(A); i.e., W ( I )= 0.
If o < 1, then I,
THEOREM 2.2. Suppose 6 > 0. condition 0+6 < 1, x,, < A,,.
< x,,.
Under the stronger
Proof. If w < 1, dt(x)/dx 2 0. Thus I,, < x,, implies In+, = t(C) < f(Xn)
G X,+l by (2.1 1). Also du dx
- (x)
= h(1-24
+ (1-0) > 1 - (o+6)
for x < 1. Thus w + 6 < 1 implies du(x)/dx > 0 for 0 < x 0 < I, < 1, the above argument yields x, < I,. I
< 1.
Noting that
An approximation to x,,, valid when the parameters Oij are small, is described in Section 12.4. This approximation is closely related to A,.
12.3. Interresponse Dependencies This section treats the relationship between responses on two different trials. Successive trials are considered first. Let a, denote response alternation between trials n and n + 1, i.e.,
184
12. FIVE-OPERATOR LINEAR MODEL
But, by (2.2),
+ o1xn. Equation (3.1) is obtained by adding these equations. I P(AinAon+I) = (1-fll)E(xnxn’)
To use (3.1) as it stands, we must know x,, and E(X,,X i ) . The only case in which a simple exact formula for x,, is available is 6 = 0. If, in addition,
6, = s1- so = 0 , where
E(XnX;) can be calculated from (3.15). Equations (3.9H3.11) and Theorem 3.2 give expressions for some derivatives of P(a,,) under these rather restrictive conditions. Our immediate objective is to treat the “difficult” case 6 # 0 by an altogether different method. Some slow learning approximations to quantities involving alternations are given in Section 12.4. When 6 # 0, comparison of (2.4) and (3.1) suggests that the troublesome term E(X,,X,,’) be eliminated between them to obtain the relation
6P(un)-t5xn
= 6(o,xi’+o,xn)-t(o,x,’-o,xn)
= -2(1-e,)o0X;
+ 2(1-e0)w1xn
(3.5)
between the mean learning and alternation curves. Even though (3.5) relates the “observable” quantities P(u,,), x,,, and x,,+ it is not easy to test directly. We now note some immediate consequences that are more amenable to comparison with data. It is assumed that the model is distance diminishing and noncyclic. Then the limit P(u,) = lim P(un) n-r
00
185
12.3. INTERRESPONSE DEPENDENCIES
exists, and (3.5) yields
~ ( a ,= ) -2(1-e1)0,X;
+2(1-eo)01x,.
(3.6)
This result is interesting only in the case of no absorbing states. If there are absorbing states, P(a,) = 0. In fact, E(#an) < 00, by Theorem 6.2.2, where #an is the total number of trials on which a, occurs. If 0 is the only absorbing state, E(#A In) < 00 too. Summation of (3.5) over 0 < n < N yields
6
N
C P(an)-t(xN+1-~0)
n=O
= -2(1-e,)00
1 x; + 2(1-e0)0, N
n=O
c x,. N
n=O
When N+ co this becomes 6E(#an) = 2 ( 1 - 8 0 ) ~ 1E(#A,n) - 5x0
(3.7)
if 0 is the only absorbing state, and 6E(#an) = t ( x a -xO)
(3.8)
if both 0 and 1 are absorbing. For a derivation of (3.8) via the functional equation (6.2.2), see Norman (1968d, Theorem 2). We now turn our attention to the case 6 = 0. If there are no absorbing states, (3.1) yields P ( ~ , )= 2(1-
e,)
E(X,
x,') + 2wir,
(3.9)
where E ( X , X , l ) = limn+mE(X,X,,'). Summation of (3.1) over all n gives
as a consequence of (2.10), if 0 is the only absorbing state, and
in the case of two absorbing states. The quadratic means in these equations can be calculated explicitly when 6, = 0. Let g = s,
+ 2(0-t0-?1),
186
12. FIVE-OPERATOR LINEAR MODEL
where ti =
e;, nii,.
THEOREM 3.2. Suppose that 6 = 0 and S2 = 0. I f there are no absorbing states, gE(x,x;)
=
~ro~2-e01-e10~.
(3.12)
I f 0 is the only absorbing state, m
g
C E(XnXn’) = E(X0 Xd) + ( 1 - e l o b 0 .
n=O
(3.13)
Finally, if both endpoints are absorbing, (3.14)
X,+,X,+,
=
XnXn’ +(Xn’-Xn)dXn-(dXn)~;
thus E(X,+,X;.,IX”=x)
= xx’+(x’-x)w(x)-z(x),
where Z ( x )= E((AX,J2IXn=x). Since 6 = 0, (2.3) yields (x’-x)W(x) = oox’+w1x-22wxx’. And Z ( X ) = x(e:, ~
’ ~+e:,x2nl0) n,, + x’(e;, xt2nO1 +e;ox2noo)
+ x’(soX2-2tox+to) - 2(t, +t,)xx’ + fox’ + t , x ,
= x(s1x’z-2t,x’+t,) = SiXX’
since s1 = so. Therefore, since ti = Oii. m i ,
E ( X n + , X : , + , ( X n = x= ) ~ ~ ‘ - g ~ ~ ~ + ( i - e ~ , ) ~ ~ ~ ’ + ( i - e , ~ (3.15)
so that AE(X,X;) = - g E ( ~ n ~ ~ ) + ( i - e o l ) o+o( ix-~- e , , ) ~ , x , . (3.16)
Letting n+co in (3.16), we obtain (3.12). Summation of (3.16) over n 2 0 yields (3.13) and (3.14). I Some light is shed on the meaning of the condition 6 = 6, = 0 by considering the symmetric case. The analog of (2.7) is
s,
= (e+-e*Zc*)
(Icol -lllo).
(3.17)
187
12.3. INTERRESPONSE DEPENDENCIES
From this it follows that 6 = 6, = 0 if and only if nol = nlo, or 8 = 8* and For the latter condition clearly implies the former, and, under the former, nol # nl0 implies c = c*.
ec = 8*c*
and
82c = 8*%*.
Since the model is distance diminishing, either Bc or 8*c* is positive; thus both are positive. Dividing the second equation by the first yields 8 = 8*, from which c = c* follows. The condition n l o = nol means that the probability nii that Ai is “successful” does not depend on i. Yellott (1969) showed that such noncontingent success schedules are especially useful in assessing the relative merits of the linear model with 8 = 8* and c = c* and the pattern model with c = c*. Either x l 0 = nol, or 8 = B* and c = c* is compatible with no absorbing states. In the second case, (3.12) reduces to (3.18) Generally speaking, explicit computation in symmetric models with 6 = 6, = 0 is limited mainly by one’s stamina. When 8 = B* and c = c*, such computations often simplify somewhat if outcomes are noncontingent (no1 = n11).
AUTOCOVARIANCES. We now obtain expressions for the asymptotic response autocovariance function pj and the important quantity a’ [see (2.5)] when 6 = 0. Like (3.9), these expressions involve E ( X , X,’), our formulas C(3.12) and (3.18)] for which require 6 , =O. THEOREM 3.3. If 6 = 0 and Xn has no absorbing states and is noncyclic, then pi
( i - ~ ) j - ~ ( ( i - ~ ) z-r ( i - e i ) ~ ( x m x ; ) )
=
(3.19)
for j > 1, and = 2((1-42)zzt - (~-+)E(X~X;))/W.
Proof. For j’ ‘(A
(3.20)
I, In
A 1 n +j)
=
E ( E ( AI n A 1 n + jl En Xn+ 1))
=
E ( A I n E ( A 1 n + jl Xn+ 1))
=
E(Aln(z+(l-~)j-l(Xn+l-~)))
by (2.10), so that P(AInA1 n + j ) =
z‘(A1n)
+ (1-~)’-’(E(A1nXn+1)-
Ip(Aln))* (3.2’)
188
12. FIVE-OPERATOR LINEAR MODEL
And, by a computation like that in (3.3), E(A,nXn+,) = (l-e,)E(Xn?
+
(e1-01)~,.
(3.22)
Substituting this expression into (3.21) and taking the limit, we obtain = (i-w)j-l((i-el)E(X~z)+(el-wl)i-iz),
from which (3.19) follows. Consequently, o2 =
ir
+ (214 ((1 -w ) ii' - (1 -e,) E ( X , x;)),
which reduces to (3.20).
I
When 8, = 8 and c,, = c for all i andj, (3.18) and (3.20) can be combined to yield
We conclude this section by considering two special topics very briefly. REINFORCEMENT. There is one class of models for which a CONTINUOUS great many analytic results are known even when 6 # 0. These are the continuous reinforcement models (i.e., noo= nl0= 1) with coo = cl0 = 1. The reader is referred to Bush (1959), Tatsuoka and Mosteller (1959), and Sternberg C1963, Eq. (80)] for this interesting development. A number of predictions for continuous reinforcement models with coo = cl0 and Oo0 = Ol0 (and thus 6 = 62= 0) are given by Norman (1964). OF OPERATORS. It may happen that one or both responses in CONTINUA a two-response experiment have more than two experimenter-defined outcomes, or, alternatively, one such outcome can produce several effects. Say response A, can be followed by outcomes Oi4,where a belongs to a discrete index set d ,with probability n,(a)[E4ni(a) = 13, in which case the operator
~ ( x , ( i , a )= ) ( l - e i a ) ~+ Y i a ,
with yi4 = Oi4 or 0, is applied. In fact, we can even consider an arbitrary (measurable) index space d, in which case n, is a probability on the index space. Most of the results of this and the preceding sections carry over to this generalized linear model if we take
189
12.4. SLOW LEARNING
where
oia= lu(i, (i, a)) - il =
if i = O , if i = 1 .
O,-y,
Since the behavior of state and response sequences depends only on the distribution of (&, wia) induced by n,, specification of this distribution, rather than Z7, itself, would suffice for the study of these processes. 12.4. Slow Learning
This section presents various approximations that apply when the stepsize parameters Oij are small. Two different types of variation of the model’s parameters are considered. The first of these produces large drift while the second leads to small drift. LARGEDRIFT. Here we assume that 8, = 8qij, where 8 varies and qij 2 0 is fixed, as are cij and q j . Under these conditions, the following quantities do not depend on 8: di =
8
Bi/8 =
= s/e =
1i qij
nij
9
8, -do,
ai = o,/e = qii, nii., w(x) =
w(x)/e=
F
~
+ E , x’ - E , x , ~
#
(4-1)
and
s(x) = E((mn/e)21xn =x) =
q : 1 n , , x ~ z x + q : , n , , x 3+q:ln,,X13 +~:ono,xzx’.
Let s(x) = S(x) - wz (x). According to Theorem 8.1.1, if X , = x as., and if n8 is bounded, as 8
-
--f
(XI-fW))/Je “ O , g ( W ) 0, where f and g satisfy the differential equations
(4.2)
and
t In this section, prime is used to denote reflection about f ( x ’ = 1 - x) and d/dt to denote differentiation,with one exception: w’ is the derivative of w.
190
12. N YE-OPERATOR LINEAR MODEL
and the initial conditions f ( 0 ) = x and g(0) = 0. The normality assertion of (4.2) and the precise value of g(n@ are of less importance than the fact that the distribution of X , is tightly clustered about f(n0) when 0 is small. We now make some observations about f and g . Except where indicated, these do not depend on the special form of w and s. First, if w(x) = 0, then f ( t ) = x and Zw‘(x)t
s ( t > = s(x) (e
- 1)/2w’(x) ,
assuming w’(x)# 0. When w’(x) = 0, g ( t ) = ts(x). If w(x) # 0, then the quantity in (4.3) never vanishes (Norman, 1968c, Lemma 5.1). Thus we can integrate
= 1 to obtain H (f ( t ) ) = t, where
For the linear model, w is at worst a quadratic polynomial, so H i s easily computed and then inverted to givef. The most difficult case is that in which w(x) = 6 ( X - - I )(x-C)
has distinct roots 1 and (. Let 1 be the root such that w’(I) = 6 ( I - 5 ) < 0.
(4.6)
Using the partial fraction representation 1 w(u)
-=-
1
b(l-C)(ull
in ( 4 . 9 , we obtain
A)
or
I-(
f ( t )- I = z ( t )- 1
-
(4.7)
00. Of course, 0 < 1 < 1, but it is possible that 0 < < 1 also. For example, in the two-absorbing-barrier case o1= w o = O , I = 1 and C = O if 6 > 0 , while 1 = 0 and ( = I if 6 < 0 .
As a consequence of (4.6), f ( t )-P 1 as t --f
191
12.4. SLOW LEARNING
When W(X) # 0, f is strictly monotonic, so we can write g ( t ) = G ( f ( t ) ) . As a consequence of (4.4) and (4.3), wcf)
dG
-gcf) = 2 W ’ W G ( f ) + s(f).
The solution with G(x) = 0 is Gcf) = w ’ v ) l - d u . No absorbing states. The models considered above have no absorbing states if and only if G I > 0 and Go> 0. In this case, w has a unique root I in [O, 13, and, in fact, 0 < 1 < 1. Since W = Ow, 1 is also the unique root of W, referred to in Theorem 2.1, for any 8 > 0. Assuming that 8qii. < 1,
9((Xn- I ) / @ )
-+
92
as n-, co, where YWe does not depend on 9 ( X o ) . By Theorem 10.1.1,
92 + N(O,a2) as O+O, where az = S(R)/2Iw’(1)1.
Clearly 0’ > 0. As in the transient case, asymptotic normality is not as sigabout R when 8 is small. nificant as the clustering of limn-tmY((X,,) Let x, = lim E(XJ n-r m
and E((Xm-I)2)= lim E ( ( X , - ~ ) ’ ) . n-r m
As a consequence of (10.3.1),
E((x, - 112) = ea2 while (10.3.3) gives X,
=
I
+ (e) ,
+ ey + o(e),
(4.9) (4.10)
where
+
y = a28/w’(I).
(4.1 1)
Thus R Oy is a better approximation to x, than 1 is, when 8 is small. Note that y has the same sign as -8 (or -a), as Theorem 2.1 requires.
192
12. FIVE-OPERATOR LINEAR MODEL
It is fairly obvious that P(u,) + 2AV as 0 + 0. A more precise approximation to P(u,) can be derived from (4.9) and (4.10). Letting n + 00 in (3. l), we obtain P(u,) = t E ( X , X,')
Since
+
0 0 xm'
+
0 , x,
.
+ (1-24 (x-A) - ( x - A ) 2 , E ( X , Xd) = 21' + (1 - 2 4 (x, -A) - E((X, -A)'). xx' = M'
Thus, by (4.9) and (4.10),
E(X, X.J = AV + ep + o(e),
where But
t = 2- O(d, +do),
p = (1-24y - 0 2 = 0 2 z / W ' ( A ) . and
0,~;+0,x,
=
e(z,z+a,A)+o(e),
so
+
P ( ~ , )= 2 ~ z c2p - (0, +do) AV
+ zoz + a,A] e + o(e).
(4.12)
SMALLDRIFT. Except in special cases to be identified below, it is necessary to let both nijand Oij depend on an auxiliary parameter 8 in order to obtain slow learning with small drift in linear models. Let T = 02, and suppose that nijand eij vary with 0 in such a way that the following conditions are met: 8ii
(4.13)
= 0(0) ,
+ o(T)
(4.14)
el, n,, - 8oonoo = T a + o(T),
(4.15)
0:
=
Tpi
mi = Oii. nii.= 7ai
9
+ o(T),
e:.nii.= o(z).
(4.16) (4.17)
In these equations, i = 1 and 0, a is any real constant, ai 2 0, pi 2 0, and Po > 0 or p1> 0. The quantity Oij nij,takes into account the two learning-rate parameters Oij and cij associated with Ai O j , as well as the probability nij that Oj follows A i . Thus it is natural to refer to it as the weight of Oj given A i . According to (4.19, the difference between the weights of success given A , and A . is very small (O(T)).And (4.16) implies that the weight of failure given either response is very small. This might mean that Oii, = O(T), Oii. = O(0) and
193
12.4. SLOW LEARNING
Hiis= 0(8), or Hi,, = O(T).If Oii, = o(1) or a, = 0, then (4.16) implies (4.17).
Finally, we note that, when
Pi > 0, (4.13) and (4.14) imply liminfn,, > 0. e-+o
It is interesting to inquire as to the conditions under which our “large drift” scheme satisfies (4.13H4.17). Substituting 8, = @,, into (4.15) and (4.16), we obtain the equations v11n11
=~
tto1n01
= t t l o n l o = 0,
0 0 ~ 0 0
and which are sufficient as well as necessary. We now show that, under (4.13)-(4.17), all of the hypotheses of Theorem 9.1.1 are satisfied. Note first that 6 = ~ * T + o ( T ) , where a* =a+a,-ao. Therefore, letting a(x) = a * x x ’ + a o x ‘ - a l x ,
E(dXn(Xn= x ) = W ( X ) = t a ( x ) Similarly, (4.14) and (4.17) imply that E((dX,)2JXn= x ) = t b ( x )
+ O(T).
(4.18)
+ O(T),
(4.19)
where b(x) = ( 8 1 x ‘ + p o x ) x x ’ .
Finally, 8iU,, = O ( T ) by (4.13) and (4.14), and 0;. U,,, = 4 7 ) by (4.17), so
IX,
~ ( 1 ~ ~ ~ = x1 ) 3=
(4.20)
o(T).
All o(r)’s are uniform over x, and the functions a and b satisfy the requirements of the theorem. It follows that, if Xo = x a s . and nr is bounded, then X,, P ( n z ; x , -) as 8+0, where P is the transition probability that satisfies (9.1.5).
-
Two absorbing states. Suppose that (4.13), (4.14), and (4.15) hold, and that wo = w l= 0 for all 8, so that 0 and 1 are absorbing. Then (4.16) and (4.17) are satisfied, and a ( x ) in (4.18) reduces to a ( x ) = axx’. If, in addition, Po > 0 and P1> 0, let
44
= WX)/b(X) =
W(P1 x’ + P o x )
Y
and let y be the solution of (4.21)
194
12. FIVE-OPERATOR LINEAR MODEL
with ~ ( 0= )0 and ~ ( 1 = ) 1. According to Theorem 11.1.1, d(x) = Px(X,+ 1 as n + a)+ ~ ( x )
as 0-0. Combining this with (3.8) we obtain Ex(#an)
N
(4.22)
~(V(X)-X)/Z~
as 8+0, if a#O. To justify application of Theorem 11.1.1, it is necessary to verify that the o(r)’s in (4.18H4.20) satisfy
as r+O. In the case of (4.18), o ( t ) = (6-ra)xx’
;
hence IO(~)/WX)l
G l 6 l t - .I/minB*, I
and (4.23) follows. The other two equations can be handled similarly, using
E(px,Ijjxn= x)
= (e~lnllx+-l
It is not difficult to solve (4.21). Clearly, *(x) dx
where C > 0. If
=
Cexp-
# Po, and p = 2a/(P,
+e
s
~ o n o o ~ - ~ ~ ~ ~
r,
-Po), this yields
2 (x) = C(P1 x’+Pox)P. dx Thus, if p # - 1, (4.24)
where h(x) = (/31xf+/lox)P+1.
When 8 , = Po and holds with
c(
# 0, as in the example at the end of Chapter 0, (4.24) h(x) = exp -2ax/B1.
13 0 The Fixed Sample Size Model
In the first three sections of this chapter we will give an account of the fixed sample size model that closely parallels the treatment of the linear model in the last chapter. In fact, most of our results for the linear model apply without change to this stimulus sampling model. The close relationship between these models is further emphasized in Section 13.4, where it is shown that the distributions of the state and event sequences of certain linear models are limits of the corresponding distributions for sequences of fixed sample size models. We recall that, in the fixed sample size model, the state variable x represents the proportion of elements in the total stimulus population conditioned to A,. The event variable e = (m, i, j, k) gives the number of elements in the sample conditioned to A , , and the response, outcome, and effectiveness of reinforcement indices. It is a finite state model defined formally as follows. State space: X = { v / N :0 < v GN}. Event space : E = {(m,i, j , k ) : 0 < m G s, 0 < i,j, k < l}. Transformation of x corresponding to e = (my i, j , k ) : u(x,e) = x
+ k e ( j - m/s),
8 = s/N. 195
196
13. FIXED SAMPLE SIZE MODEL
Probability of e given x : p ( x , e) = H(m,x ) L(m/s; i, j , k), where H(m, x ) is the hypergeometric distribution
and L is the event probability function for the linear model:
U y ; iJ,M
=
[ynljclj
if i = 1, k = 1 ,
yqjc;j
if i = 1, k = 0,
y’nojcoj
if i = 0, k = 1,
(y’nojcbj
if i = 0, k = 0 .
The restrictions on the model’s parameters are 1<S
< N,
0
< 7Cij,Cij < 1,
IRij j
= 1.
The formula given above for u(x, e) applies only if p ( x , e) > 0. The definition in other cases is arbitrary. For example, if m >Nx, j = 0, and k = 1, the formula gives u(x,e) = x
- m/N < 0 .
However, in this case
so p ( x , e ) = 0. With the following notations, many of the formulas of the last chapter become applicable to the fixed sample size model:
ei = c(ni,,+nil), mi = en,,., where [ = (s- l ) / ( N - I ) ,
It is understood that c = O if N = 1. Then pattern model), in which case Bi = 0 too.
nij= 7rijcij.
c=
0 if and only if s = 1 (the
13.1. Criteria for Regularity and Absorption
A state sequence X,,for a fixed sample size model is a finite Markov chain. The theory of such chains is discussed in Section 3.7, and in standard sources (e.g., Kemeny and Snell, 1960).
13.1. REGULARITY AND ABSORPTION
197
The following simple properties of the transition kernel K of X,,are used repeatedly below. LEMMA1.1. If Uii,> 0 and x # i’, then, starting at x, the process can move closer to i’:
LEMMA1.2. If (s- 1)nii > 0 and 0 < x < 1, then the process can move closer to i :
K ( x , { y : ly-il < Ix-it}) > 0 . Proofs. In the first case, some event of the following sort has positive probability : The sample contains elements conditioned to A,, response A , is made, and Ai. is effectively reinforced. Any such event moves the process toward i’. In the second case, since O < x < 1 and s 2 2, a sample can be drawn containing elements conditioned to each response. Then A, can be made and effectively reinforced. As for the linear model, the presence of absorbing states exerts a decisive influence on X,,. The criterion for i to be absorbing is the same as in the linear model.
LEMMA 1.3. The state i is absorbing if and only if o,= 0. A state 0 < x < 1 is absorbing if and only if nii,= 0 and (s- 1) n,,= 0 for i = 0 and 1. Then all states are absorbing and the model is said to be trivial. Proof. If Hi,, = 0 , i is certainly absorbing, while if n,,.>O,it is not, according to Lemma 1.1. But wi > 0 if and only if nii,> 0. If 0 < x < 1 is absorbing, nii.= 0 and (s- 1) nii= 0 follow from Lemmas 1.I and 1.2. Suppose, conversely, that the latter conditions hold. If s = 1, then the response made is the one to which the sampled element is conditioned, and a change of state occurs only if the other response is effectively reinforced, which has probability 0. If s # 1, then nii= 0, i = 0,1, and no response is effectively reinforced with positive probability. Thus, again, all states are absorbing.
I
Theorems 1.1 and 1.2 differ only slightly from the comparable theorems (12.1.2 and 12.1.3) for the linear model. All states can be absorbing in Theorem 1.1 ,but the distance diminishing assumption of Theorem 12.1.2 rules this out. And there is a cyclic case in addition to the one (q= wo = 1) given by Theorem 12.1.3. THEOREM 1.1. Ifthere is an absorbing state, then X,, is an absorbing process.
198
13. FIXED SAMPLE SIZE MODEL
THEOREM1.2. If there are no absorbing states, then either (a) no,= n,, = 1 and s = 1 or N, in which case X,,is ergodic with period 2, or (b) X,, is
regular.
Proof of Theorem 1.1. Suppose i' is absorbing and i is not. The latter implies nii.> 0, according to Lemma 1.3. Thus, if x # i', K(")(x,i') = K ( " ) ( x{, i ' } ) > 0 for some n < N by Lemma 1.1. Hence the criterion for absorption given at the end of Section 3.7 is met. If some 0 < x < 1 is absorbing, Lemma 1.3 shows that all states are ab>0 sorbing. Suppose now that this is not the case. By the same lemma, njSj or (s- l ) n j j > 0 for some j . Lemmas 1.1 and 1.2 then imply that, if 0 < x < 1, K ( " ) ( x , j> ) 0 for some n < N - 1. Thus the process is absorbing if both 0 and 1 are absorbing states. I Proof of Theorem 1.2. By Lemma 1.1, both 0 and 1 can be reached from any state, so X,, is ergodic, and both 0 and 1 belong to the single ergodic kernel F. Clearly K ( 0 , s / N ) > 0 and K(s/N,0) > 0, so K("(0,O) > 0. Since the period p of F divides all return times, p < 2. If n,. < 1 for some i, then
K(i, i) = 1 - n,,> 0, so p = 1 and the process is regular. If 1 < s < N , then K ( 3 ) ( 00) , > 0, so, again, p = 1. For no,> 0 and n,, > 0 imply K(O,s/N)> 0 and K(l/N, 0) > 0. If, in addition, 1 < s < N , then starting at x = s / N , a sample with s-1 elements conditioned to A , can be drawn, A, made, and A, effectively reinforced, leading to state 1/N. Thus K(s/N, 1/N) > 0, and ~(y0,o2 ) K(O,s / ~ K) ( ~ / Ni , / ~K(I/N, ) 0) > 0,
as claimed. Suppose now that no,= n,, = 1. If s = N , then K(0,l) = K(1,O) = 1, so that p = 2. If s = 1, then K ( x , { y :[ y - x l
so, again, p = 2.
=
l/N}) = 1 ,
I
Part I of this volume showed that the theories of finite state models and distance diminishing models with compact state spaces are completely parallel. Thus the remarks in Sections 12.2 and 12.3 about the asymptotic behavior of distance diminishing linear models are applicable to nontrivial fixed sample size models with the same absorbing states. For example, the proportion A,, of A , responses in the first n trials is asymptotically normal with mean P(A,,) and variance d / n , where c2 is given by (12.2.5), if the model has no absorbing states and is noncyclic.
199
13.2. MEAN LEARNING CURVE
13.2. Mean Learning Curve and Interresponse Dependencies Our first order of business in this section is to find suitable expressions for P(Ain Ojn CknlXn) 9
E((dXn)AinOjn Cknlxn)9
and various quantities that derive from them. These formulas are compared with analogous expressions for the linear model. Some differences are noted, but there are important similarities that are exploited in the remainder of the section. Iff is any complex valued function on [- 1, I], E(f(dXn)Ain Ojn Cknlxn = X ) = M C f ( k e ( j - m / s ) ) U m / s ; Lj,k)] 9 (2.1)
where M denotes expectation with respect to the distribution H ( m , x). If
f ( y ) = 1, (2.1) and M(m/s)= x yield
P(Ain Ojn Cknlxn = X ) = L ( x ; i,j,k)
just as in the linear model. Summing over j and k for i = 1, we obtain P(AIn(X,= x ) = x ,
so that P ( A I n ) = E(X”) = xn .
If f ( y )= y, (2.1) reduces to E ( ( d X n ) A i n ~ j n C k=nx~)X=n kOM[(j-m/s)L(m/s;i , j , k ) ] .
From 1 N-s sN-1
M ( ( m / s - x ) ’ ) = - -XX’
[Wilks, 1962, Eq. (6.1.5)], it follows that
M ( ( 1-m/s)m/s) = rxx’ , M((m/s)’) = x - r x x ‘ , M ( ( 1 - m/s)’) = x’
where
- rxx‘ ,
(2.2)
200
13. FIXED SAMPLE SIZE M O D E L
Thus
-e(l-rx')xnlo
E((~xn)Ai,,oj,,ck,,Ix,, =x) = .
- cxx'noo
cx' -e(i-rxi)
E(AX,,IX,,= x,Ai,,oj,c,)
=
.
e(i-rx)
- rx
j
k
I
o
I
1
1
o
e(i-rx)x;n,,
I
i
(2.5)
0 0 1
1
1
1
1
o
I
o
1
I
0 0 1
Though all rows are linear functions of x, the second and third differ essentially from the corresponding expressions for any five-operator linear model. This is most striking for the pattern model where they are, respectively, - 1/N and 1/N for all x. Returning to (2.5), and adding the first and second and the third and fourth rows, we get
E(& AX,,IX,, = X) =
(O1x'-ol)x
if i = 1 ,
(oo-Oox)x'
if i = 0,
(2.6)
which is identical to the corresponding linear model expression (12.2.2). Most of the formulas in Sections 12.2 and 12.3 follow directly from (12.2.2), and thus apply to the fixed sample size model. We now discuss these results in more detail. Considering Section 12.2 first, the expressions (12.2.3) for W ( x )= E(dX,,IX,,=x) and (12.2.4) for Ax,, apply here. As in the linear model, 6=01-Oo, W = O ~ + W ~ and , 1 = oolo = ~ o l / ~ ~ o * + ~ l o ~ ~
If the condition Oii. < 1 in Theorem 12.2.1 is replaced by s < N , the system
201
13.2. MEAN LEARNING CURVE
(12.2.6) of bounds for x , is valid in the present context. Clearly,
6 =~
C ~ ~ 1 1 + ~ 1 0 ~ - ~ ~ 0 0 + ~ 0 1 ~ 1 ,
so that 6 = 0 if and only if s = 1 or n 1 1
+ n10 = no0 + n o 1 *
(2.7)
In the symmetric case, cii = c, cii, = c*, we have 6 = r(C-c*)(~01-~10),
which is analogous to (12.2.7). Finally, the formula (12.2.10) for x,, when 6 = 0 and the bounds for x,, when 6 > 0 given by Theorem 12.2.2 apply here. Turning to Section 12.3, the expression (12.3.1) for P(a,,) is valid for the fixed sample size model, as is the relation (12.3.5) between P(a,,) and x,, when 6 # 0 and its corollaries (12.3.6H12.3.8). When 6 = 0, P(a,) and E(#a,,) are related to E(X,,X,,’) according to (12.3.9H12.3.11). Under the same assumption (no analog of “S2 = 0” is needed), the expression (12.3.15) for E(X,,+l X i + l I X n = x ) applies, with =
mi+ 2(1-r)0
and Bii. replaced by 8, as is shown in the next paragraph. The formulas for E(X,X,’) and C,,E(X,,X,,’) in Theorem 12.3.2 follow. In the symmetric case with c = c*, we have this analog to (12.3.18):
Finally, Theorem 12.3.3 on response autocovariance when 6 = 0 holds here.
Proof of (22.3.15). We will obtain an expression for Z(X)
=
E((AXn)21Xn= x )
from which (12.3.15) follows just as in the proof of Theorem 12.3.2. Letting Y = mls, E((AXn)2A1nIXn = x )
+ = e 2 w ( Y ’ Z y (rill ) +nlo)- ~ M ( Y ’ Y ) ~+, ,M ( Y ) ~ , , I
= ~2CM(Y’ZY)nll M((1 -Y’~2Y)n103
=
e 2 ( n , l + n l o ) ~ ( ~ ’ Z Y ) - 2e 0~, ox .l x x ’ +
(2.8)
Similarly, E ( ( A X , , ) ~ A ~ , ~=Xx,), = e 2 ( n o o + n o l ) ~ (-y2r00xx’+ 2~’) e O o ~ t . (2.9)
202
13. FIXED SAMPLE SIZE MODEL
As was noted earlier, if 6 = 0 and s # 1, (2.7) holds. Thus
e2(nl1 + nlo)M ( Y f 2 Y )+ e2w o o + no11M(YZY’) =
+nio) M ( Y ~ ’ )= Be+’,
e2(nil
and this is valid even when s = 1, since both sides vanish in that case. Therefore, addition of (2.8) and (2.9) yields
qX) = (ee, - 2 ~ 0xx’ ) + eml x + emox’ ,
the desired equation.
I
In the fixed sample size model with 6 = 0, just as in certain linear models satisfying this condition, one can obtain complicated formulas for practically any quantity of interest by rather straightforward computation. Some ingenious work has led to elegant expressions for the distribution of NX, in the pattern model. This development was begun by Estes (1959) and carried forward by Chia (1970). Estes also showed that the asymptotic distribution of N X , is binomial, with parameters N and 1. The assumption that col = c l 0 in these papers is unessential.
13.3. Slow Learning The implications of Part I1 for the fixed sample size model are essentially the same as for the linear model (see Section 12.4), but slight additional effort is required to verify the hypotheses of the relevant theorems. This section is mainly devoted to checking these hypotheses. We assume throughout that s is fixed, so 0 + 0 if and only if N + co.
LARGEDRIFT.
For easy reference we quote (12.2.3):
E(dX,IX, = x) = W(X) = 6xx’ Division by 8 yields w(x,e) = (s/e)xx’
for x in
+ 0 0 x’ - 0
+ no,x‘ - n , , x ,
le = { v / N :0 < v
0 and n,,> 0, so that there are no absorbing states, w has a unique root 1 in [0,1]. Thus Theorems 10.1.1 and 10.3.1 apply to the stationary distribution pe of X,. Since W(x) is not quite proportional to w(x), the root 1 = 1, of W ( x ) [or of w ( x , 8 ) ] , which figures in (12.2.6), need not equal the root 1 of w ( x ) ; however, &+A as 8 + 0 . The formula (12.4.11) for y is no longer valid, since v ( x ) need not vanish, but (12.4.12) holds if we t a k e p = ( l - U ) y - a Z and replace 8,+8, and ai by
and Hiit,respectively.
SMALLDRIFT. Under the conditions
n,,= B +we), n,,- no,= a0 + o(8), n,,,= Oa, + o(e),
B > 0,
(3.7) (3.8) (3.9)
and s > 1, which are analogous to (12.4.13)-(12.4.17), Theorem 9.1.1 is applicable, with T = 02, a(x) =
s- 1
(a+a,-a,)xx' S
+ aox' - a, x ,
and s- 1
b(x) = p-xx'. S
In the case
no,= n,,= 0 of two absorbing barriers, a ( x ) reduces to a(x) =
s- 1
-axx' , S
and Theorem 11.1.1 applies also. We will verify the hypotheses of the latter theorem only. Since (3.10) (3.8) gives
6
s- 1
= r-a+o(r) S
205
13.4. CONVERGENCE TO THE LINEAR MODEL
and E(dX,IX,,
=x) = 7 4 x )
+ O(T).
Note that O ( T ) in the last equation has a factor xx’, so max lo(~)/~b(x)I+ 0
(3.1 1)
o<x< I XSl,
as T +0. Next, by (3.9, = BM(YY’)
+ w Y ’ 2 Y ) (1 + M(VZY‘)t o
Y
where y = m/s and t i = n,,-fl.But (2.4) and (3.10) imply that s- 1 M(yy‘) = -xx’ S
+ o(l)xx’,
and (3.7) yields M(Y’2Y)ltll + W Y 2 Y ’ ) l t 0 l G K8M(YY’) < K8xx’.
Therefore
E((dXn)21Xn= x ) = rb(x) + o ( 7 ) ,
(3.12)
and (3.11) holds. Finally,
~ ( l d x lx,, , , ~=~X) Q eE((dx,,)2IX,,= X) = O ( T ) by (3.12). Thus (11.1.4) and (11.1.5) hold, and the other conditions of Theorem 11.1.1 are easy to check. 13.4. Convergence to tbe Linear Model
We begin by noting that, according to (2.3), the variance ofm/s is small when s and N are large: M ( ( m / s - x ) 2 ) Q 1/4s.
(4.1)
Thus m/s tends to approximate x under these conditions, so that, referring to the definition of u at the beginning of the chapter, u(x,(m,i,j,k)) approximates x + M ( j - x), the corresponding event operator for the linear model with Bu = 8. Furthermore, the probability of A, Oj, C,,, given X,, = x is L ( x ; i,j , k) in either model. This suggests that, if s and N approach infinity in such a way that 8 + 8*, while the distribution p of Xo converges to a distribution p*, then the joint distribution 9 ( n ; s, N , p ) of X o , Eo, Xl , ..., X,,for the fixed sample size model should converge, in some sense,
206
13. FIXED SAMPLE SIZE MODEL
to the comparable distribution g ( n ; 0*,p*) for the linear model with Bij = O* and initial distribution p*. In this section, it is shown that this convergence does indeed occur. Of course, En above refers to (Al,,, Oln, Cln), that part of the event for the fixed sample size model that has a counterpart in the linear model. This notation is used throughout the section. The intuitive notion that the linear model is a stimulus sampling model with an infinity of stimulus elements is deeply ingrained in this subject. On the mathematical side, a technical report by Estes and Suppes (1959b) presents a general but arduous approach to convergence of event probabilities. Theorem 4.1, the proof of which is not difficult, is the first result of this kind to be published. THEOREM 4.1. For any n 2 0, 0 < 8* < 1, and distribution p* on [0, g ( n ; s , N , p ) converges to B ( n ; O * , p * ) as s+00, 8+8*, and p - + p * .
13,
The convergence to which the theorem refers is weak sequential convergence in the following sense. Let s = s j , N = Nj y and p = p j be any sequences such that sj -+ co, s j l N j + 0*,and p j converges weakly to p* (see E A] be Section 2.2) as j + 00. Let &';-I = ( E o , ... ,En- 1), and let [S",-' E A and zero otherwise. Then the random variable that is one if as j 3 00, for any
A c
=
( 0 , l ) x (0,l) x
(3n times)
and bounded continuous real valued function h. The cases A h ( x , , ... ,x,) = 1 are, of course, of particular interest. Proof.
=rnand
Let us denote E ( f ( d xn) Ain Ojn Ckn Ixn = X )
by J ( x ) in the fixed sample size model and by J * ( x ) in the linear model. An expression for J ( x ) is given by (2.1), and J * ( x ) = f(kO*(j - x ) ) L ( x ; i,j , k ) . We first show that, iff has a bounded second derivative,
+
J ( x ) = J * ( x ) E(X),
(4.3)
where E ( X ) -+ 0 uniformly in x as s -,00 and 8 + 8*. If g has a bounded second derivative on [0,1], the second-order Taylor expansion g(Y) = g w
+ ( Y - x ) g ' ( x ) + Y19"1(Y-x)2,
207
13.4. CONVERGENCE TO T H E LINEAR M O D E L
where IyI
< 3 and lg”1= sup, Ig”(x)l, yields IM(g(m/s)) -g(x)l G lg”lM((m/s-x)2)/2
G lg”1/8s by (4.1). Iff has a bounded second derivative on [- 1,1] and
W)
9 w =f ( W i - Y ) )
3
where L ( y ) = L ( y ; i , j , k ) , it is easy to see that 197 G
If”l + 2lf’l
Y
so that
IM [f(Wj- m/s)>L(m/s)l -f(Wi-4 )U X ) l
(171+ 21f’I>/8s.
G
But
I f ( W j - x > ) W ) -f(ke*(j-x))Ux)l
G 10-
O*l If’l .
Thus
IJW - J*(x)l G
(If”l+21f’0/8s
+ lo - O*I If’l,
as required. The proof of the theorem now proceeds by induction. All limits below are taken along an arbitrary but fixed sequence (s, N, p) = ( s j , N j , p j ) with the required asymptotic behavior. Since Q(0; s, N , p ) = p and 9(0;0*, p*) = p*, there is nothing to prove when n = 0. Suppose that (4.2) holds for some n 2 0 and all A and h. We must show that n can be replaced by n+ 1. It clearly suffices to consider unit sets A = {(eo,... ,en)}. Furthermore, it follows from the continuity theorem for multidimensional distributions (Breiman, 1968, Theorem 11.6) that only functions of the form 1 n+l
\
need be considered. Now
where ( i , j , k ) andf(y) in the definition of J ( x ) are en and exp((- l ) “ t n+ ly); [S”,-’= e;-’]exp(i
208 by (4.3), where IS1 < max,
13. FIXED SAMPLE SIZE MODEL
IE(x)~
0; and
by the induction hypothesis (4.2), since J*(x) is bounded and continuous. The quantity on the right reduces to
14 0 Additive Models
In additive models for simple two-choice experiments, A, response probability is a Bore1 measurable function p ( x ) on the state space X = R', and state transformations are translations : u(x, e) = x
+ be.
An event e = (i, z ) is determined by a response Ai and a consequence z, which is drawn from a measurable space @,a) in accordance with a specified distribution
n , ( D ) = P(z E DIA,). Thus the distribution p ( x , .) of e given x satisfies p(x,{(l,z):zED)) = P(x)nl(D)
and p ( x , {(O,z): E D ) ) = q(x)n,(Q,
where q ( x ) = 1 -p(x). It is easy to see that n,and be affect the distribution of the (Markovian) sequence (X,,, AIn) of states and responses only through 209
210
14. ADDITIVE MODELS
the conditional distribution
Qi(B) = P(be E BIA,) = n i ( { z :biz E B } ) of be given A , . Though it is natural to require that p meet various criteria of smoothness, it is unnecessary to do so for Sections 14.1 and 14.2. Of course, 0 < p ( x ) < 1. The interpretation of x or u = ex as an “ A , response strength” variable suggests that p be strictly increasing, with p ( c o ) = 1 and p ( - co) = 0. The latter conditions are assumed, but, in place of monotonicity, it suffices for the time being to suppose that p ( x ) is bounded away from 0 and 1 on finite intervals. Equivalently, p(x,,)+ 1 if and only if x,,+ 00, and p(x,,)+ 0 if and only if x,, + - 03. All of these conditions are satisfied in beta models, which are additive models with p ( x ) = u/(u+ 1). Our final assumption is that the moment generating functions m
Mi(A) = E(expAb,lAi) = /-meAyQi(dy)
exist, for I in some open interval A containing 0. The significance of these quantities is suggested by the equation E((Un.1
/on)AIun)
= MI (I)pn + Mo(I)qn
as., where u,, = ex”, p,, =p(X,,), and q,, = q(X,,), which shows that they determine the tendency of u.” to increase or decrease. If k 2 0 and A E A, fy”eayQi(dy)and M / ” ) ( I )exist and are equal. In particular, if mi = E(be1Ai) =
/Ygi(&),
then m i =M/(O), so that mi determines the departure of M i ( l ) from M,(O) = 1, when 111 is small. If mi < 0, then M i ( I ) 1 for I > 0 sufficiently small, while if m i> 0, then M , ( I ) < 1 for I < 0 sufficiently large. Actually, ) M i is convex, so that Mi(A)< 1 implies since M,”(x)= ~ y 2 e A y Q i ( d2y0, that M i ( o ) < 1 whenever w is between I and 0. Henceforth it is understood that an argument of M iis an element of A. Analogous to five-operator linear models, five-operator additive models are defined by z = ( j ,k), where Aj is the response reinforced and k = 1 or 0 depending on whether or not conditioning is effective. Thus, bijo= 0, and, letting b,, = b i j , , b,, 2 0 and bio < 0. Also,
-=
ni( j ,k ) = nij
(“
1 - cij
if k
= 1,
if k = 0,
where 7cij is the probability of reinforcing A j after A i , and cij is the attendant
21 1
14.1. RECURRENCE A N D ABSORPTION
probability of effective conditioning. In this case, mi = the linear model, the symmetric case, defined here by co, = c1, = c*,
c,, = coo = c , 611 =
- b 00
=
cjbijnijcij. As for
bol
b,
=
-blo = b*,
is of particular interest. Much of this chapter follows Norman (1970b). 14.1. Criteria for Recurrence and Absorption In this section we establish Theorem 1.1, which shows that the qualitative asymptotic behavior of p , (or X,, or u,) is determined by the signs of the mean increments mi.
=-
THEOREM l.l.(A) r f m , 0 and ml > 0, limp, = 1 a s . (B) If mo < 0 and m, < 0, limp, = 0 a.s. (C) Zf m, < 0 and m, > 0, g o ( x ) + g , ( x ) = 1, where g,(x) = P,(limp,=i). In addition, g o ( x ) > 0, g , ( x ) > 0, g o ( x )+ 1 as x + - co, and g , (x)+ 1 as
x + co.
(D) Zf m, > 0 and m, < 0, then lim supp,
=1
and lim infp, = 0 a.s.
The proof is based on a number of lemmas, some of which will find other uses later. Let
F,(u) = M , ( A ) p ( x )
+ Mo@)q(x),
so that
E((un + 1 Note that F,(u) + MI (A) as u + 00.
Ion) =
FA (un)
(1.1)
LEMMA1.1. I f m , < O , liminfX, O a n d M l ( A ) < l , there are constants B(A) and C(A)< 1 such that E,(u:)
< ua C"(A) + B(A).
(1.2)
Similarly limsup X , > - 00 if mo > 0, and (1.2) holds if A < 0 and Mo (A) < 1. Proof.
For V sufficiently large,
c(n)= supF,(u) < 1 . uav
212
14. ADDITIVE MODELS
for u, 2 V. But for u, < V,
0, then lirn supX, = 03 implies limX, = 00 a.s. If A < 0, M,(IZ) < 1, and V is sufficiently large that FA@)< 1 for all u 2 V, then 90(4
0.
An analogous result holds if mo < 0. ProoJ
Let H,= min(v:,
V'). This process is a supermartingale. For
E(Hn+1lun,
00)
< E(ui+1Iun) < V: = H,
a s . if u, 2 V, while
E(Hn+IIvn, ***,
00)
< V A= Hn
a.s. if u, < V. Thus
E(Hn+,(un, .-.s 00)
0 or m, > 0, P(limsupX,, E X) = 0. LEMMA
Similarly if m, < 0 or m, < 0, the probability that liminfX,, is real is 0. Proof. If m,> 0, then Qi([2z,00)) > 0 for some E > 0. For any x ,
P(Xn+, > x + & I X n , *.*,Xo) = Q,(Cx-Xn+E,0O))Pn
2
Qi(C%
+ Q~(Cx-Xn+&,m))qn
c o ) ) ~ ,+ Q o ( C k
a))Bx
a s . if IX,,-xl < E, where 01, = infly-xl 0, thus Y x > 0.
It follows that IX,,-xl < E infinitely often (Lo.) implies
c P(X,+, > OD
n=O
X+EIX,,
..., X,)
=
00
as. The latter event is a s . equivalent to X,, 2 X + E i.0. (Neveu, 1965, Corollary to Proposition IV.6.3). Thus the probability that IX,,-xl < E i.0. but not X,, 2 X + E i.0. is 0. However, llimsupX,,-xl < E implies the latter event, so
P(llimsupX,,-x( < E ) = 0. Since X is a denumerable union of intervals of the form ( x - E , x + E ) , the lemma follows. I Proof of Theorem 1.1. Suppose m ,> 0. By Lemma 1.3, limsup X,, = - 00 or limsupX,, = 00 a s . By Lemma 1.2, IimsupX,, = 00 implies limX,, = 00 a s . ; hence lim sup X,, = - co or lim X,, = co a s . When m, > 0, lim sup X,, > - co a s . by (an analog of) Lemma 1.1, so that, when m, > 0 and m, > 0, IimX,, = 00 as. This proves (A), and (B) is similar. Returning to the case where our only assumption is m 1 0, we have limX, = - co or limX, = 03 a.s. Thus go(x)+gl ( x ) = 1. It follows from (1.3) that go( x )+ 0 ; hence g , ( x )-+ 1 as x -+ co. In particular, there is a constant d such that g, ( x ) > 0 for x 2 d. It is easily shown that, for any n > 1
=-
214
14. ADDITIVE MODELS
and x E X,
Px(dXj > E , j = 0,...,n - 1) 3 (6c)n, where c = inf,,,p(y) > 0 and 6 = Q,( [ E , m)) > 0 for E sufficiently small. Hence Px(Xn2 d) > 0 if n is sufficiently large that x+ns 2 d. But g , ( x ) = E x ( g ,(X,,)), so g , ( x ) > 0 for all x E X. The other assertions in (C) follow similarly from m, < 0. Suppose m, < 0. Then lim inf Xn = f00 a s . (Lemma 1.3) and lim inf Xn < 00 a s . (Lemma l.l), so liminfX, = - 03 a s . Thus liminfp,, = 0 as. Similarly, m, > 0 implies limsupp,, = 1 as., and (D) is proved. I If p(x) = O(e0X) as x + - m for some ct > 0, as in the beta model, the conclusion in (B) of Theorem 1.1 can be strengthened to C;=,p,, < m as. In fact
for all x E X (Norman, 1970b, Theorem 2). The recurrent case m , < 0, mo > 0 is the subject of the next three sections. 14.2. Asymptotic A , Response Frequency
As in zero-absorbing-barrier linear and stimulus sampling models, A , response frequency
and average A, response probability
admit strong laws of large numbers in recurrent additive models. Furthermore, there is a simple expression
r = mo/(mo-m,)
(2.1)
for the a s . limit of these sequences in the additive case. Theorem 2.1 also converge to 0, shows that the expectations of &((Aln-[) and f i ( ( p , , - [ ) and their second moments are bounded. Presumably their distributions are asymptotically normal under suitable additional conditions yet to be found. It would also be of interest to have laws of large numbers and central limit theorems for other functions g(E,,) andf(X,,) of events and states.
14.2. A , RESPONSE FREQUENCY
215
from which it follows that E ( V ,V,)= 0 and E ( W ,W,) = 0 if m # n. Furthermore, if si = Jx’ Q,(dx) and maxs, = s,
an3
Note that
where 6 = m , - m , . Now if I > O is sufficiently small,
216
14. ADDITIVE MODELS
by (1.2). Adding these equations, and observing that 8 + e - y > y 2 , we obtain A2Ex(X,Z)< 2coshlx
+ B,
where B = B(A)+B(-A). Thus EX(X:)
< (2coshAK+B)/12 = J
for all n 2 0 and 1x1 < K. Clearly
so X,,/n + 0 a s . This, in conjunction with (2.7) and (2.9), yields
But then (2.8) implies A,,,+ [ as., and (A) is proved. The expectation of (2.9) is 0 = Ex(vn)= E,(X,,)/n - x/n
(2.10)
jj,, + [
as.
+ 6(E(p,,) - 0;
therefore 6lEx(Pn) - tl Q (1x1+ IEx<Xn>l)/n Q (K+J")/n
for 1x1 < K , as a consequence of (2.10). Thus (2.4) is established. Equation (2.9) and Minkowski's inequality yield
0 and let r = bc/v. The quantities bc and b*c* determine the efficacy of success and failure, respectively, and r measures their relative magnitude. A simple computation shows that
m 1 = v(r-nlo)
and
mo = v(nol-r).
Suppose that nO12 n l o , or, equivalently, n,, 2 no,. This means that A , is more likely to be successful than A,. The opposite case is similar. Then recurrence (m,c 0, mo > 0) occurs when r c n l 0 . Under this condition
C
= (~01-r)/(n0,+~,~-2r).
If nol = n l 0 , ( = 3. Otherwise 3 is a strictly increasing convex function that runs from n o l / ( ~ o+nlo) l = 1 at r = 0 to 1 at r = A,,. Thus the model pre-
14.3. EXISTENCE OF STATIONARY PROBABILITIES
217
dicts that asymptotic A , response frequency C will overshoot the probability matching asymptote 1 over the range 0 < r < n l 0 . If nlo< r < nol, then m , > 0 and m, > 0, so p,,+ 1 a s . according to Theorem 1.1. Thus success probability n1 pn A,, qn is maximized asymptotically. Finally, if r > nol,m , > 0 and mo < 0, so p,,+ 1 or pn+ 0 as., and both limits have positive probability.
+
14.3. Existence of Stationary Probabilities If p is continuous, the Markov process X,,always possesses a stationary probability in the recurrent case. Furthermore, the moment generating functions of stationary probabilities are bounded by the function B(A) of Lemma 1.1. Let d' be the set of stationary probabilities. THEOREM 3.1. If mo > 0, m , < 0, and p is continuous, then d is not empty. In fact, for every stochastically and topologically closed subset Y of X , there is u p E d s u c h that p(Y) = 1. Z f A > 0 andM,(A) < 1, or A 0 and n 2 0. Therefore, F ( ( , is uniformly tight, thus conditionally weakly compact (Parthasarathy, 1967, Theorem 11.6.7). Suppose now that Y is stochastically and topologically closed (e.g., Y = X).Let 5 E Y and pj = Pi( 0 and M I (A) < 1, or A < 0 and M o (A) < 1. For any d > 0, let F ( x ) = min (e", d). Then
s s Fdp
=
FdT"p
=
s
U"Fdp.
Since U"F(x)< d for all n and x, Fatou's lemma gives
1 0 for all x E.'A Since p ( x ) > 0 and q ( x ) > 0, 0 = K(x,B) = p(x)Q,(B-x)
+ q(x)Qo(B-x)
implies that Q,(B-x) = 0; thus L ( B - x ) = 0, so that L ( B ) = 0. If A and B are stochastically closed, x E A, and y E B, then K ( x , 1)= 0 and K ( y , B") = 0, so L ( 1 ) = 0 and L(B")= 0. Therefore, A n B # 0, and K is indecomposable. I
14.4.
UNIQUENESS OF THE STATIONARY PROBABILITY
219
However, indecomposability is not necessary for uniqueness in recurrent ) 1, where b , = m , < 0 and additive models. Consider the case Q i ( { b i }= bo = mo > 0, so that X,, has the simple transition law AX,, =
b,
with probabilityp(X,,) ,
bo
with probability q(X,,).
Let G = {nob,
(4.1)
+ n, b , :no, n, are integers}.
Then x + G is stochastically closed for any x E X , and the collection of distinct, hence disjoint, sets of this form is nondenumerable. So X,, is certainly not indecomposable. But uniqueness can obtain. THEOREM 4.2. Zfp is continuous and nondecreasing,and g b o/ b , is irrational, then there is a unique stationary probability. According to Theorem 3.1, a necessary condition for uniqueness is irreducibility, the nonexistence of disjoint stochastically and topologically closed subsets of X. If b o / b , is rational, the sets x + G are topologically closed, so uniqueness fails. On the other hand, if b , / b , is irrational, X is the only stochastically and topologically closed set. For if Y is stochastically closed and x E Y, then x + G + c Y, where
G + = { n , b , + n , b l : n o , n l 20}. And x + G+ is dense, so Y = X if Y is topologically closed. Hence if bo/bl is irrational, irreducibility holds as weil as uniqueness. It remains to be seen whether irreducibility is sufficient for uniqueness in the general recurrent additive model with continuous nondecreasing p. The proof of Theorem 4.2., which is due to James Pickands, 111 and the author, is published here for the first time. Proof. The continuity of T ensures that B is weakly closed, and the inequality
jcoshIxp(dx)
< ( B ( I ) + B(-4)/2,
valid for all p E 9 if I > 0 is sufficiently small according to Theorem 3.1, implies that B is weakly compact. Therefore, by the Krein-Milman theorem (Dunford and Schwartz, 1958, Theorem V.8.4) applied to the finite signed measures on X with the weak topology, the convex set B is the closed convex hull of its extremal points. Hence it suffices to show that there is a unique extremal stationary probability. A stationary probability p is extremal if and only if it is ergodic, that is, B"f+ s f d p in L,(p) for every YE L,(p) (Rosenblatt, 1967, Theorem 2).
220
14. ADDITIVE MODELS
In order to study the dependence of P ' f ( x ) on x , we define a collection of Markov processes Xn(x) on the same probability space. Let t n ,n20, be a sequence of independent random variables, each uniformly distributed on [0,1]. Let Xo(x) = x , and, given Xn(x), let
where pn(x)= p ( X , ( x ) ) . It is easy to see that, for each x E X , X n ( x ) has the transition law (4.1). The processes Xn(x) have the following important property: If 0
< X - Y < bo + (bll = b ,
then Xn(x)-Xn(y) = x - y
or
x-y-b
(4.2)
for all n 2 0. For this is certainly true for n = 0. Suppose, inductively, that Xn(x)- X n ( y )= x - y . Then pn(x)2 pn(y), since p is nondecreasing. If tn
Pn(x), AXn(x)=AXn(Y), SO X n + , ( x ) - X n + 1 ( y ) = x - Y . If Pn(x)> tn>pn(y), AXn(x)=bl and dXn(y)=bo, SO Xn+l(X)-Xn+1(y)= x - y - b . The case X n ( x ) - X n ( y ) = x - y - b is similar, so (4.2) follows.
Suppose now that f is a bounded nondecreasing diffrrentiable function on X such that If'l < co. Then, if 0 < x - y < b,
v f ( x )- U'lf(Y) = E ( 4 , where A = f ( X n( x ) )-f(Xn ( Y ) )
d 1TKx-Y) by (4.2). Thus U " f ( 4 - U"f(Y) G I S ' l ( X - Y ) and
0"fW - PfW G
1f'KX-Y)
(4.3)
for all n > 0. Let p1 and p, be extremal, hence ergodic, stationary probabilities. Since O"f-+jfdpl in L z ( p l ) , there is a subsequence n' such that P ' f - + j f d p I plas. But j f d p , in L, &), so there is a subsequence n* of n', such that v n * f - + j f d p on f a set A i with p f ( A f ) = 1, i=1,2. The support of pi is X . For it is topologically closed by definition and stochastically closed by Lemmas 2.2.3 and 3.4.3, and we have already noted that X is the only sto-
v".'f+
221
14.5. SLOW LEARNING
chastically and topologically closed set when bo/bl is irrational. It follows that A i is dense. If O < x - y < b , X E A ] ,and Y E A , , then letting n+00 along the subsequence n* in (4.3) we obtain
J f 4 - Jf&, G
If'l(X-Y).
Since there are points y in A, with x - y positive and arbitrarily small, jfdcll G f f & z . BY symmetry, S f d h G Jfdcll so Y
Jfh
= JfdP,
(4.4)
for all bounded nondecreasing differentiable functions with If'l < 00. If g is such a function, then, for any y E X and 0 > 0, f U W
=d(X-Y)/4
is too, and, if g ( c o ) = 1 and g ( x ) = 0 for x G 0,
monotonically as a+O. Thus, applying (4.4) to fu, and taking the limit as a+O, we obtain pl(y, co) = p , ( y , 00) for all y E X , from which it follows that p1 = p 2 . I If a recurrent additive model with p continuous has a unique stationary probability p, then B"f(x) + J f d p for all x E X and bounded continuous f. For we saw in the proof of Theorem 3.1 that { P ( x , is weakly conditionally compact, and that every weak subsequential limit is stationary, hence is p. It follows easily that R"(x, - ) converges weakly to p. One wonders whether, under these conditions, Pf converges uniformly over compact sets. This is true for f = p by (B) of Theorem 2.1. In closing this section, we note the lack of criteria for aperiodicity, that is, convergence of K ( " ) ( x ,.) as opposed to P ( x , .). Our only result worth mentioning is negative: X,, is not aperiodic under the hypotheses of Theorem 4.2. 14.5. Slow Learning
Suppose that be = he, where 8 > 0, a, is a fixed real measurable function, and p ( x ) and U i are also fixed. This section describes the limiting behavior of X,, = X," and p,, = p(X,,) as 8 + 0.
14. ADDITIVE MODELS
If
Li(o) = E(expwae)Ai)< co for 101 < E, then M i ( l ) = Li(81) < co for 111 < C / O . The moments of be and a, are linearly related:
mi = E(belAi) = OE(aelAi) = 8iiii, si = ,qb:lAi)
=
= e2E(a:IA,) =
e2ji,
~ ( l b , l 3 1= ~~ e )3 ~ ( l ~ ~ l = 3 l8~ 3~ ~)
In addition, w(x) =
E(Axn/ep-,= x)
=
~ .
E ,+ ~ ii-ioq,
q(Axn/e)21xn = x ) = s , p + s,q, iyX)= ~(l~x~pl~lx,, = x) = plp + iOq.
s(x) =
(5.1)
To apply Theorem 8.1.1, we take Z = Ze = R’, and note that the approximation assumption (a) is trivially fulfilled. Assuming that p ( x ) has two bounded derivatives, as in beta models, the smoothness conditions (b) on w ( x ) and s(x) = S ( x ) - w ( x ) ~are satisfied. Finally, (c) follows from F(x) 6 maxi ii. As a consequence of Theorem 8.1.1, (Xn -f/Je N(O, g (no)) when 8 + 0 and n8 is bounded. As usual, f ’ ( r ) = w(f(t)) and f(0)= x = X , . If w ( x ) # 0, H ( f ( t ) ) = t and g ( t ) = G ( f ( t ) ) , where H and G are given by (12.4.5) and (12.4.8), respectively. Letting P ( t ) = p ( f ( t ) ) , it follows that N
(Pn-P(ne))/JB
Clearly P ( 0 ) = p ( x ) , and
-
~(07p’(f(ne))’g(n~)).
P’(t) = p ’ ( f ) ( n i , P+iii,(l-P)).
In beta models, p’ = pq, so P satisfies the autonomous differential equation P’(t) = P ( 1- P) (Z’ P + 5,(1 - P)).
We mention that, if p ’ ( y ) > 0 for all y, and p has certain other regularities, asymptotic normality of p n as O + 0 can be obtained by applying Theorem 8.1.1 directly to pn. In the remainder of the section, we develop approximations to stationary probabilities and absorption probabilities. Suppose that p’(x) > 0 for all x E R’. If iii, and iiio have opposite signs, there is a unique 5 E R’ such that P(t) =
r = =o/(~o-fi,)
223
14.5. SLOW LEARNING
or, equivalently, w(5) = 0. Also,
w e )=
- fio)P’( 0 and initial state x, A’,,-i _+ 03 a.s. by (C) of Theorem 1.1, and we wish to approximate g , (x) = P,(X,,+oo). Note that X,,+03 if and only if z,,+ 00, so 91
where z = ( x the parameter
= pz(zn+
()/p. By applying
T =8
= b(z),
Theorem 11.2.1 to the process z,, and (not 8’!), we shall show that i SUP Is1(4- W/4l -0
XER~
as O+O. Here 4j is the standard normal cumulative distribution function. Since is a critical point of w, Lemma 8.5.1 yields
0,
as O+O. Thus (11.1.4) and (11.2.2) obtain, and the function r(z) defined
224
14. ADDITIVE MODELS
below (11.1.5) is z/oz. It follows that B(z) =
l
r ( y ) d y = z2/2o2,
and e-B(') is integrable over R', as Theorem 11.2.1 requires. The function @(z/o) is the solution of V(Z)
with limits 0 and 1 at --oo 4 = 4 0 satisfies
+ r(z)$'(z) = 0 ,
and a.Thus, it remains only to show that
lim 4e(z) = lim 1 - #e(z) = 0.
z--m
2-
OD
(5.2)
0-0
0-0
Proof. The calculation that follows will put us in position to use Lemma 1.2. There is a constant c > 0 such that, for 101 < 612,
L,(w) < 1 + OE,+ C 0 2 .
Hence, if ldel
0. Taking 1= - 1/@, we obtain
< - @ ( y ( P - c ) - c@) if @ 0. We assume that X is convex, so that u(x, e ) E X , and separable, so that u is jointly measurable. Both conditions are satisfied if, for example, Y is countably generated and X is the set of probabilities absolutely continuous with respect to a fixed a-finite measure. Some interesting characterizations of transformations of probabilities of the form x+(1-8)x+81 are given by Blau (1961). The linear free-responding model and Suppes’ continuous response model discussed in Chapter 0 are of this type. In the free-responding model, y is an interresponse time and z is the reinforcement indicator variable, so Y = (0,oo) and 2 = (0,I}. Also, O,, = 8 and Oy0 = 8* do not depend on y , and
1, =
(
T*
if z = 0 ,
(l-a)T+ctA(y,-)
if z = 1 ,
where T and T* are probabilities on Y, and A is a stochastic kernel on Y x Y. The probability l7 ( y , { 1)) of reinforcing a y-second interresponse time is denoted u ( y ) . In Suppes’ model, y and z are the predicted and “correct” points on, say, the unit circle Y = Z in the complex plane, and 1, = 1(z, .) does not depend on y . Generalizing Suppes’ assumption that 8, is constant, we assume that 8, = 8 ( z ) depends only on z. The five-operator linear model is also a special case; take Y = (0, l}, Z = (0, l} x (0, l}, Oijk = ke,,, lij,((j})= 1, and 1 7 ( i ; j , k ) = n i j c i j if k = 1 and nij(l-cij) if k = 0 . Of course, here the real variable x({ 1)) characterizes the probability x, and it is natural to describe the model in terms of this variable, as we do in Chapter 12. Response and outcome random variables are denoted Y, and 2,; thus En = (Y,,, ZJ. Other useful notations are
and
15.1. Criteria for Regularity We first give a sufficient condition for the model to be distance diminishing. THEOREM 1.1.
If inf,,
8, > 0, the model is distance diminishing.
15.1.
227
CRITERIA FOR REGULARITY
Proof.
=
.s
sup p(x,de) (1 -8,)
= 1
- inf p(x, de) 8,. x
s
But p ( x , de) 8, =
1
x(dy) 8, 2 inf 8, ,
so that r,
< 1 -info,
< 1.
I
Theorems 1.2 and 1.3 give criteria for X , to be regular with respect to the bounded Lipschitz functions L ( X ) . The condition inf 8, > 0 is assumed in Theorem 1.2 and follows from the assumptions of Theorem 1.3, so both relate to distance diminishing models. Both theorems are corollaries of results in Chapter 4. THEOREM 1.2. If info, > 0, and if there are a > 0 and 4 E P ( Y ) such that & ( A ) 2 ac$(A), for all A E g and e E E with 8, > 0, then X,,is regular. THEOREM 1.3. I f there are a > 0 and v E P ( Z ) such that n ( y , B ) 2 av(B), for all y E Y and B E 2,i f y,(A) = y(z, A ) does not depend on y [thus 8, = 8(z) does not either], and if J v(dz)8 ( z ) > 0, then X,,is regular. In the free-responding model,
A, 2 (1 - a ) ( t A t*), where T A T * is the infimum o f t and t * (see Neveu, 1965, p. 107), so I, has a lower bound of the type required by Theorem 1.2 if a 1 and t and T* are not mutually singular. The condition in Theorem 1.3 that the reinforcement function have a response independent component av is very natural in applications of Suppes’ model. The extreme case of noncontingent outcomes, n ( y , .) = v, has received much attention in the experimental literature.
-=
Proof of Theorem 1.2. Clearly X is bounded: d(x,x’)
< 1x1 + I x ’ ~
= 2.
Let A” be the convex hull of {A,: 8, > 0). Then X’ is invariant (Definition
228
IS. MULTIRESPONSE LINEAR MODELS
4.2. l), so, by Theorem 4.2.2, it suffices to show that the transition operator U‘ for the corresponding submodel is regular. If X E X’, there are nonnegative pi with Cpi = 1, and e(i) E E with Oe([) > 0, such that x = C p IAe(i). Thus x ( 4 3 CPIa+(A) = a + ( 4 9
so that p(x,A) 2 a v w , where
v ( 4 = /+(dY) p ( Y , d Z ) M Y , z ) . Thus the submodel satisfies (a) of Section 4.1. Furthermore, r i = Sv(de)(l-ee) = 1 - J+(dy)ey
< 1 - info,
< 1.
Therefore U‘ is regular, according to Theorem 4.1.1.
Proof of Theorem 1.3. The transition operator
1
is the same as for the (reduced) model with X* = X , E* = 2, u*(x,z) = e), and p*(x, B) = ~ ( d yn)( y , B). But
U(X,
P*(X,
B) 2 av(B) ,
and
so U is regular by Theorem 4.1.1. When 0, > 0 for all y E Y, Ay(4
I =YY(4PY
is a stochastic kernel. The following lemma is needed in the next section.
LEMMA1.1. Under the hypotheses of either Theorem 1.2 or Theorem 1.3, there is a c > 0 and a probability $ on ?Y such that Iy(4
for all y and A.
3 c$(4
(1.1)
229
15.2. DISTRIBUTION OF Y, AND Y,
Proof. Under the hypotheses of Theorem 1.2,
AY(4 =6 2
/
e,. > 0
~(Y,dz)~y,~y,(A)
2 a$(A). Under the assumptions of Theorem 1.3,
where c = a/supO,
and
15.2. The Distribution of Ynand Y ,
Let xn be the unconditional distribution of Yn;i.e., xn(A) = P(YnE A ) .
Then, for any boundzd measurable function g on Y,
230
IS. MULTIRESPONSE LINEAR MODELS
which is analogous to (12.2.4). If 0, = 0 for all y, then JX,,(dy)O,= 0 a.s., so that x,+~is obtained from x,, via the linear transformation
Constancy of 8, generalizes the condition 6 = 0 in Chapter 12. It is satisfied in the free-responding model with u(y) = p, for then 0, = Op + 0*( 1 - p). It holds also in Suppes’ model if O(z) = O or if n ( y , .) = v. In the latter case,
Ay = for all y, so
J v(dz)y(z, .)/a
= (l-d)X,
X,+l
or
x,
=
=
x
+ ox
x + (l-a)”(xo-x).
Suppose now that the model is distance diminishing and that X,, is regular. By Theorem 6.2.1, the model is uniformly regular with 4,,= O(cin), where a < 1. From Section 6.2, we recall that the sequence 8, = {F,},,, of coordinate functions on E m is a stationary process, with respect to the asymptotic distribution r of bN = {E,,+N}n,Oas N+ 00. Let Y , + , and Z , + , be the coordinates of F,: F, = (Y,+,,, Zm+,,),and let x, be the distribution of Y , = Y,,,:
x,(A)
=
r(Y, € A )
for A E CY. Then
Ixn(A)-xm(A)I G 4n,
so Ixn- x, I + 0 geometrically as n + 00. If the real valued measurable function g on Y is positive or integrable,
1
g ( y )x,
a.
(49 = l
d Y m(ern)>r(de“)
is denoted For example, if Y = (0,a),the real powers g ( y ) = 9 are of special interest, while if Y is the unit circle, we might consider g ( y ) = (argy)k. The asymptotic expectation can be estimated as accurately as desired by the corresponding sample mean from a single subject: If IS(Y)I X,(dY) .c 00,
s
(1In)Sn = (I/n)
n- I
1g(Yj)
+
j=O
a s . as n+ a, by Theorem 6.2.3. And, if Jgz(y)x,(dy) < 00, Theorem
231
15.2. DISTRIBUTION OF Y. A N D Y ,
6.2.4shows that
<sn-ng)/J;; where
0’
N
N(O,O’)
is the usual sum of the autocovariances Pj = mv(g(Yw), g(Ym+j))*
At the end of the section, we consider a case in which Y = (0, GO), and determine the powers b for which j y x , ( d y ) < 00. To get information about x , , we return to (2.1). Letting n + a and canceling x, (A) yields
where
Neglecting the term C(A), we obtain the nonlinear integral equation x*(4
1
X*(dY) 0, =
j
X*(dY) Y,(A).
If there is a unique solution x*, it is analogous to the approximation 1 to x, = P ( A , , ) in Theorem 12.2.1. Since that approximation is good for small 0,, we expect x , G x* when ee is small.
Throughout the remainder of the section, it is assumed that 0, = 0 > 0 for all y E Y.Then C ( A ) = 0, so that
We now solve this equation under the additional condition (1. l), which, according to Lemma 1.1, is no more restrictive than our sufficient conditions for regularity. If c = 1 in (l.l), then I, = $ and x , = $. If c < 1, let 5 be the stochastic kernel defined by
2,(4
= c*(4
+ U-c)t(y,’4.
Substitution of this expression into (2.3) yields x,(4 =c*(4
+ (l-c)jx,(dMy,A),
which leads, on iteration, to x,(A) = c
k
1 (1-C)j
j=O
s
*(dyp(y,A)
+ (l-c)k+I
s
(2.4)
x,(dy)p+”(y,A).
232
15. M U L TIRESPONSE LINEAR MODELS
Since the last term is bounded by (1 - c ) ~ + ~and , c > 0, passage to the limit yields
where {‘(y, A ) = c
c (1m
=O
c)Jt”’(y, A).
In the free-responding model with u ( y ) = p, yv = (l-p)O*t* +pO((l-ct)t+aA(y,-)),
so that we can take C+
=
8-1((1-p)e*~*+ p e ( l - a ) 7 )
in (l.l), and 5 = A. The case A ( y , A) = q ( A / y ) is of special interest. ConA) = q(A/y). Then sider now the general model with Y = (0,oo) and
e(~,
t w , 4 = tl’(A/Y)
9
where qo = 6 , , q1 = q, and q1 is the jth convolution power of q :
+“(A)
= Jtl(dY)b(A/y).
Also,
tYY, A ) = tt’(A/Y)
9
where q’(A) = c
c (1 -c)’q’(A) m
/=o
The solution (2.5) takes a particularly simple form in terms of the Mellin transforms a m (B) =
J
y~xm (dY)
I,$@), and fi(j?). Here B may be positive or negative, and 0 < am(j?)< GO. Since the Mellin transform of a convolution is the product of the transforms (whether or not these are finite), (2.5) yields
am(B>
=
I,$(B)?w.
15.2. DISTRIBUTION OF Y. AND Y ,
233
And it is easy to show that (2.6) can be transformed term by term to obtain A
a’@) = c
i (1-C)’Cfi(B)l’.
1-0
Thus R,(p) < 00 if and only if $(B) < m and (1 -c)q(B) < 1, in which case
This formula can also be obtained directly, by transforming both sides of (2.4). If @(po)<mfor some p o > O , then q(/I)+l as BJO, so (l-c)q(B) 0, as would normally be the case in the free-responding model, @(B)+ 00 as /?+ m, so that 9, (B) = 00 for /3 sufficiently large. Similar considerations apply to negative /?.
16 0 The Zeaman-House-Lovejoy Models
The Zeaman-House-Lovejoy or ZHL models were described in Section 0.1. We recall that a rat is fed when he chooses the black arm (B) of a Tmaze, instead of the white one (W). According to the models, he attends to brightness (br), rather than position (PO), with probability u, and, given attention to brightness, he enters the black arm with probability y: u
=
P(br),
y = P(Blbr).
In the full model, there is another variable, z, the probability of going left given attention to position, and x = (u, y, z ) is the state variable. Observe that, if u = 1 and y = 1, the rat attends to brightness and makes the correct response with probability 1. The consequent state transformation does not change x, so all states x = (1, I , z), 0 < z < 1, are absorbing. But Theorem 3.4.1 implies that a distance diminishing model with compact state space has at most a finite number of absorbing states. Therefore, the full ZHL model is not distance diminishing. Furthermore, very little is known about it, beyond what follows from our study of the reduced model. The smallstep-size theory of Chapter 8 is applicable, but its implications have not been worked out in detail. 234
235
16.1. CRITERION FOR ABSORPTION
The reduced ZHL model has state variable x = ( u , y ) and state space X = [0,1] x [O, 11, with Euclidean metric d. Its events, event probabilities, and state transformations are listed in Table 1, which is an amplification of Table 2 of Section 0.1. The row numbers in Table 1 provide convenient labels for events. Of course, all $J’S and 8’s are in the closed unit interval. TABLE 1 EVENTEFFECXS AND PROBABILITIES FOR THE REDUCED ZHL MODEL
(e= 1 - 0 ) Event
Av
Probability
AY
The relationship between stochastic processes associated with the full and reduced ZHL models is exactly what would be expected intuitively. Let Xn = K, Yn7 2,)
and
En = (s,,, a,,, r,,)
be any state and event sequences for the full model. Thus s,, is the stimulus configuration [(B, W) or (W, B)] on trial n, a,, is the dimension attended to, and r,, is the rat’s choice (B or W). Then, by Theorem 1.2.4,
X,,
=
(Y,, Y,,)
and
E,, = (a,,, rn)
are state and event sequences for the reduced model. Thus any theorem concerning V,, Y,,, a,,, and r,, in the reduced model applies also to the full model. Bearing this in mind, we restrict our attention in the remainder of this chapter to the reduced model, and refer to it simply as “the ZHL model.” 16.1. A Criterion for Absorption
THEOREM 1.1. ing.
If 4, , $J., , 0, , 8, > 0, the Z H L model is distance diminish-
THEOREM 1.2. Its state sequences are absorbing, with single absorbing state 5 = (1,l). In particular, X,, + 5 a s . as n + CQ, by Theorem 3.4.3. Also, since P(t,PO
W ) = 0,
236
16. ZEAMAN-HOUSE-LOVEJOY MODELS
E(#po u W )< co by Theorem 6.2.2, where # P O u W is the total number of trials on which either a perceptual or an overt error occurs. In particular, # P O u W < co as.; that is, subjects learn with probability 1. This conclusion may appear obvious, but no simple, direct proof is known. Proof of Theorem 1.1. We shall use Proposition 1 of Section 2.1. The functions p ( .,e) have bounded partial derivatives, so m ( p ( ., e)) < a.All state transformations have the form u(x,e) =
Thus u(x, e)
- u(x', e)
(( 1 - 4 ) u ++Y * ). (1 -8)Y
=
d 2 ( u ( x , e), u(x', e)) = (1-4)z(u-u')2
+ (~-e)~(y--y')~
< (1 - (4 A e))2d2(x,x') , and z(u(-, e))
< 1 - (4 A 0) < 1.
Furthermore, z(u(., 1)) < 1, since > 0 and > 0. Let P = { x : u > O and y > O } . If X E P, p ( x , l ) > O , so we can take j = 1 and d = 1 in Proposition 1. For x E' P, we display, in the next paragraph, a succession e" = (e,,, ... , en- I ) of events such that u ( x , en) E P and p.(x,e")>O. Then, i f j = n + l and d = ( e , , ..., e n - , , l),
+ 0 .
Theorem 2.1 gives the expected total numbers of overt and perceptual errors under these conditions. THEOREM 2.1. For any 0 d v,y d 1,
E,(#w) = 014
+2gp
and
where ij = 1- u.
E,(#PO)
= 2ijl4
+2gp,
(2.2)
Proof. Events 2, 3, and 4 have probability 0 at c, so, as a consequence of Theorem 6.2.2, gk((x)= E,(#k) is the unique Lipschitz function such
238
16. ZEAMAN-HOUSE-LOVEJO Y MODELS
that
+ Q?k(X)
(2.3) and gk(5)= 0. This characterization is now used to establish formulas for these expectations. Table 1 gives g d x ) = P(X,k)
UY - Y = W y l x ) = e(yvY
+j v j ) = ego ;
hence J - UJ = e j v
and j / e = JV
+ uj/e.
Since Jv = p(x, 2), and j j / O is Lipschitz and vanishes at 5, 92
(4 = Y/O
*
Returning to Table 1,
uv - v
= E(dulx)
= #.J(f%y- u 2 j
- vu/2 + i?/2).
Rewriting y in the first term on the right as 1-1, we get
uv - v
= 4(-uj
from which it follows that
fi/+ = - vJ
+ 6/2),
+ fi/2 + uq4.
When (2.5) is added to this equation, the result is
+ jp = fi/2 + u(fi/+ +jle) .
But p ( x , 3) = p ( x , 4 ) = 612, and f i / + + j j / O is Lipschitz and vanishes at g3 (4 = 94(x) =
El4 + J P *
5, so (2.8)
Formulas (2.1) and (2.2) follow from (2.6) and (2.8) on noting that
W # W ) = 9 2 (4 + 9 4 w and &(#PO)
=
9 3
(4 + g 4 ( 4
*
I
We now consider reversal of the correct response. If, on trial n and thereafter, food is placed in the white arm, (2.1) implies that the expected total
239
16.3. 0 VERLEARNING REVERSAL EFFECT
umber of (overt) errors on the new problem, given X,,, is E / 4 + 2 Y n / 8 . Thus the unconditional expected number of errors is
+
Y, = E ( R ) / ~ 2 ~ ( ~ , , ) / e .
The change in Y,, as the result of one additional trial before reversal is then AY,, = -E(AV,)/$
+ 2E(AYn)/8.
In view of (2.4) and (2.7), AY,, = E(3V, 7,- V,,/2). This formula is needed for the study of the overlearning reversal effect in the next section. 16.3. The Overlearning Reversal Effect
To apply Theorem 8.1.1 to the ZHL model, we simply assume that the parameters c # ~= ~ y,8 and 8 , = q j 8 depend linearly on a parameter 8 > 0. Then lo= I = X is the closed unit square, 9(AXn/81Xn= x ) does not depend on 8, W(X)
= E(AX,,/OlX,,= X )
and S ( X ) = E((AX,,/8)21X,,= X )
have polynomial coordinate functions, and AX,,/8 is bounded, so it is easy to see that all of the assumptions of the theorem hold. Let X , = x as., and let f satisfy
f’w = W ( f ( 0 )
(3.1)
with f(0)= x. The theorem says that, for any T < 00, E(Xn) -f(n@
=
o(e)
(3.2)
and (3.3) uniformly over x E X and n8 < T. Furthermore, X,, is approximately normally distributed when 8 is small and n8 is bounded. We now revert to the special case 8, = O2 = 8 and 4j = 4, 1 <j < 4, considered in the last section, and assume that 4 = 78, where y > 0. This parameter is thus the ratio of the perceptual and response learning rates. We will study the asymptotic behavior of f(t), and then use the information
240
16. ZEAMAN-HOUSE-LO V U O Y MODELS
obtained to show which values of y produce an overlearning reversal effect when 8 is small. Let o ( t ) and y(r) be the coordinate functions of f(r):f(t) = (v(r), y ( t ) ) . It follows from (2.4) and (2.7) that the coordinate equations of (3.1) are d ( t ) = y(ij/2
- ujj)
(3.4)
and (3.5)
y’(t) = ojj.
Since f ( t ) E A,‘ 0 < v(t), y(r) < 1. The values u = 1 and y = 1 are associated with perfect learning. It is clear from (3.5) that y ( t ) is nondecreasing. Thus, if y ( 0 ) = 1, y ( t ) E 1, and (3.4) gives 8 ( t ) = ij(0)e-yr/Z.Throughout the remainder of the section, it is assumed that y ( 0 ) < 1. According to Lemma 3.1, o ( t ) and y ( t ) both converge to 1 as r + a.It should be noted that u ( t ) need not be monotonic. By (3.4), v’(0)aO if u(0) G v, and o’(0) < 0 if u ( 0 ) > v , where v = 1/(1 +2jj(O)).
In the former case, u’(t) > 0 for all t > 0. In the latter, there is an s > 0 such that u’(t) < 0 for t < s, o’(s) = 0, and d ( t ) > 0 for > s. Since these facts are not used below, their proofs are omitted.
3.1. For any y > 0, LEMMA jj(r)
as
t+
a,where u =
-
(3.6)
ue-‘
J(0)exp2(jj(O)+E(O)/y).
If y c 2, for some fl> 0. If y = 2, ij(t)
and, if y > 2, ij(t)
-
-
2ate-‘,
-e-r Y-2
*
(3.9)
Proof. Equations (3.4) and (3.5) are equivalent to ij’(t) = ~ ( u -j a/2)
(3.10)
241
16.3. 0 VERLEARNING REVERSAL EFFECT
and (3.1 1)
j’(t) = - u j .
+
Thus, if Z(t) = ij/y j , Z’(t) = -ii/2.
(3.12)
Therefore, Z ( t ) 2 0 is nonincreasing, so that Z(m) = limZ(t) 1- 00
exists. Since y ( m ) exists, u(m) does too. Integrating both sides of (3.12), we obtain Z(0) - Z ( t ) =
hence Z(0)
- Z(m)
=
sb
C(s) ds/2;
dm
iqs) &/2.
Finiteness of the integral implies u(m) = 1. It follows from (3.1 1) that j ( t ) = j ( 0 ) exp - C u ( s ) & t
= ji(o)exp(-r)expS 0 ii(s)&.
Thus, j(t)
-
j(0)exp2(Z(0) -Z(m)) exp(-t).
In particular, y ( m ) = 1, so Z(m) = 0, and (3.6) is proved. Solving (3.10) for ij in terms of q = y u j , we obtain (3.13) If y < 2, (3.6) implies that eY“/’q(s) is integrable over the whole positive axis. Furthermore,
fi
= ij(0)
+ ~ O O e Y s / zdsq (>s )0.
For this is certainly true if Is(0)> 0, and, if ij(0)= 0, then q(s) > 0 for s sufficiently small, which implies positivity of B. Thus (3.7) obtains.
242
16. ZEAMAN-HOUSE-LO VEJO Y MODELS
If y = 2 , eYS12q(s)+2aas s + m ; hence ii(t)et/t = ij(O)/t
+
+ 2a
when t + 00, as required for (3.8). Finally, if y > 2,
This estimate and (3.13) yield (3.9). Throughout what follows, the initial value x = (u,y) of Xn=X." and f ( t ) is fixed and independent of 8, with y < 1. At the end of the last section, we considered Y , = Y:, the expected number of errors following a reversal before trial n, and obtained the expression (2.9) for the change in this quantity due to one additional trial before reversal. This expression and Lemma 3.1 are the basis for Theorem 3.1.
THEOREM 3.1. If y < 3 ( y > 3), there is a To with the following property. For any T > T o , there is an E = E~ > 0 such that, if 8 < E , then AY: < 0 (AY; > 0),for all To < no< T. The theorem is concerned with large values of n (n 2 T o / @ ,or, loosely speaking, with overtraining. It says that, when 8 is small, overtraining facilitates reversal (the overlearning reversal effect) if y < 3, but retards reversal if y > 3. This is consonant with the intuition that, when the perceptual learning rate $ = 78 is small relative to the response learning rate 8, the gain in attention to the relevant stimulus dimension with overtraining should more than compensate for the slight further strengthening of incorrect stimulus-response associations. It is, perhaps, surprising that the overlearning reversal effect is predicted for values of y almost as large as 3. Of course, to obtain an experimentally measurable effect of overtraining, it is necessary to give a reasonable number k of overtraining trials. The change in expected total errors due to overtraining is then
243
16.3. 0 VERLEARNING REVERSAL EFFECT
Proof. According to (2.9), AY; = E(P(Xn)),where P is the polynomial P(x)
=
3vy - u / 2 .
Clearly
IAy'n" - P(f(ne))I G IE(P(Xn)) - P(pn)l
+ IP(pn) - P(f(ne))I
3
where pn = E(Xn). Since the first partial derivatives of P are bounded over X ,
IPO1n) - P(f(n@)l G CIpn -f(ne)I G Ce by (3.2). And, since the second partial derivatives of P are bounded,
IE(P(Xn)) - P 0 l n ) l G CE(IXn-pn1*) G ce by (3.3). Thus, for any T < 00, there is a C = C , such that IAY,B-P(~(G ~ ~ce ))~
(3.14)
for all ne G T. Let p ( t ) = P ( f ( t ) ) = 3v(t)y(t) - u ( t ) / 2 .
It follows easily from Lemma 3.1 that, as t - + a, p(t)eY'/'
+
if y < 2 ,
-8/2
if y = 2 ,
p(t)e'/t + - a
if y > 2 .
p(t)er + 2a(y-3)/(y-2)
-=
Therefore, if y < 3, there is a To such that p ( t ) 0, for t 2 To. And, if y > 3, there is a To such that p ( t ) > 0, for t 2 T o . Suppose that y < 3, and let T > T o . Since - p ( t ) is continuous and strictly positive on [ T o , TI, S = min - p ( t ) > 0. To 0, and (b') of Theorem 4.1.2 obtains.
I
When U is regular on B ( X ) , the model is uniformly regular (Theorem 6.2.1), so that {Ej+,},,o possesses an asymptotic distribution r as j + co. If g is a measurable real valued function on Y Z m , gj
= g 0 and Fi is strictly increasing,
44
= P1 F l W
+ P O Q -Fo(x)) > 0
and
P(x) = P 1 4 (x)/o(x) is strictly increasing. It is also continuous, with p"( - 00) = 0 and p(00) = 1. Thus the Markov process 2, with transitions AX,, =
i
A,
with probability
p"(z,,) ,
A,
with probability
1 - p"(8,) ,
has a unique stationary probability, according to Theorem 14.4.2. If its transition kernel, then
R
is
+
K(x,B) = o(x)R(x,B) (l-o(x))ax(B).
It follows that p + j, where
is a one-to-one mapping of stationary probabilities of K into stationary probabilities of R. Thus K has at most one stationary probability. The argument in the second to last paragraph of Section 14.4 then yields weak convergence of K"(x, .) to the unique stationary probability p. If A , / A , is rational, let j and k be relatively prime positive integers such that A , / l A , I =j / k , and let E
= A,/j = IA,l/k.
It is not difficult to show that G+ = { n , A , + n , A , : n , , n , >O} = EZ,
(4.3)
255
17.4. SIGNAL DETECTION: YES-NO
where Z is the set of all integers. Clearly C, = x + G + is stochastically closed, so the restriction K, of K to C, is the transition kernel for a denumerable Markov chain. As a consequence of (4.3), any two states communicate, so lim K:(y, z) = n(z)
n- m
for all y, z E C, (Parzen, 1962, Theorem 8A, p. 249). But
R:(y, { z : IzI < k ) ) + 1 as k +
00,
uniformly in n, from which it follows that CzEc,n(z)= 1 , and lim R,"(j>,A)=
ndm
1 n(z) = n ( A ) ,
ZEA
for any A c C,. The distribution n is the unique stationary probability of K,, and n ( z ) > O for all ZEC, (Parzen, 1962, Theorem 8B, p. 249, and Theorem 8C, p. 251). Therefore
lim R"(x,B) = n(B n C,) = P ( X , B )
n- m
for all Bore1 sets B, K m ( x ,.) is the unique stationary probability of K with support C,, and K m ( x + e , .) = K m ( x , I a).
To study slow learning, we fix p = A l / A o .c 0, and let A , = 8+0. The moments W(X)
= E(AX,,/OlX,,= X ) = PPlFl(X) +Po(l-&F,(4)
and S(X) = E((AX,,/8)21X,,= X ) = P2PlF1(X)+ P o ( l - F o ( x ) )
do not depend on 8, so (a) of Section 8 . 1 is satisfied, with lo= R'. If F,has two bounded derivatives, w ( x ) and s ( x ) = S(x)- w 2 ( x ) satisfy the smoothness condition (b). Finally, IAX,,/Ol is bounded by the maximum of 1 and p, so (c) holds, and Theorem 8.1.1 is applicable. Thus, if Xt = x does not depend on 8,
(X." -f 0), -+
2
= S(5)/21W'(5)1 *
In view of (4.4), one expects the limiting distributions K m ( x ,.) = KOw@,.) to be asymptotically normal, with mean ( and variance 0 2 , as 0+0. It follows from Theorem 10.1.1 that this is the case.
18 0 Diffusion Approximation in a Genetic Model and a Physical Model
This chapter treats diffusion approximation in a population genetic model of S . Wright and in the Ehrenfest model for heat exchange. Since sequences of states in these models are finite Markov chains, their “small step size” theory has much in common with that of stimulus sampling models (Section 13.3).
18.1. Wright’s Model Our description of this model is based on Ewens (1969, Section 4.8). The reader should also consult that volume for references to the extensive literature on diffusion approximation in population genetics. The work of Karlin and McGregor (1964a,b) is of particular interest. Consider a diploid population of 2M individuals. At a certain chromosomal locus there are two alleles, A and A , . The number i of A genes in the population thus ranges from 0 to 2M, and the proportion, x = i / 2 M , from 0 to 1. The A gene frequencies i (or x) in successive generations are assumed to form a Markov chain, with transition probabilities
257
258
18. DIFFUSION A PPROXIMA TION
where ni = (1 -u)Il**
and xi* =
+ u(1 -n**)
+
(1 +s1)x2 (1 +s,)x(l -x) (1 +s,)x2 2(1 +s,)x(l -x) (1 -x)2
+
+
-
The constant u represents the probability that an A, gene mutates to A,, while u is the probability that A, mutates to A,. The genotypes A , A , , A 1 A 2 , and A 2 A , have fitness 1 +sl, 1 +s2, and 1, respectively, so s, and s, control selective pressures. Let X,, be the proportion of A , genes in generation n. The standard formulas for the mean and variance of the binomial distribution give E(Xn+ 1 I X n = x) = ni
and
,
var(X,,+ IX,, = x) = ni(l - n t ) / 2 M .
(1.1)
Thus E(dX,IX,, = x) = Ici - x and, since q ( 1 - q ) =x(l-x)+O(lxi-xl), var(dX,,IX,, = x) = x ( l -x)/2M
+ O(lni-x1/2M).
(1.3)
The theory of Part I1 has a number of implications for the behavior of
X,,= XnM as M + co and the parameters u, u, and si-+O.In order to study
small values of u, u, and si, we let u = E p , u = iip, and si=Sip, where ii 2 0, ii 2 0, and Si are fixed, and p + 0. Under these conditions, it is shown later that
E(dX,IX" = x) = pw(x)
+ O(p2),
(1.4)
where w(x) = i j - ( i i + i j ) x + x ( l - x ) ( S 2 + ( S , - 2 S 2 ) x ) .
Also, var(dX,,IX,, = x) = x ( l - x) /2M
+ O(p/2M).
(1 -5)
It remains only to specify the relative magnitudes of p and 1/2M. We will consider two possibilities: p = 1/(2M)" and p = 1/2M. These will be referred to as the cases of large and small drift, respectively. The results
18.1.
259
WRIGHT'S MODEL
described next follow immediately from the indicated theorem in Part 11. The hypotheses of these theorems are verified at the end of the section. LARGEDRIFT : TRANSIENT BEHAVIOR.
I,
= {i/2M: 0
Let
< i