MATHEMATICAL APPROACHES TO NEURAL NETWORKS
North-Holland Mathematical Library Board of Advisory Editors: M. Artin, H. Bass, J. Eells, W. Feit, P.J. Freyd, F.W. Gehring, H. Halberstam, L.V. Hormander, J.H.B. Kemperman, H.A. Lauwerier, W.A.J. Luxemburg, F.P. Peterson, I.M. Singer and A.C. Zaanen
VOLUME 51
NORTH-HOLLAND AMSTERDAM LONDON NEW YORK TOKYO
Mathematical Approaches to Neural Networks
Edited by
J.G. TAYLOR Centre f o r Neural Networks Department of Mathematics King's College London London, U.K.
1993
NORTH-HOLLAND AMSTERDAM LONDON NEW YORK TOKYO
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
L i b r a r y o f C o n g r e s s Cataloging-in-Publication
Data
Mathematical approaches to neural networks / edited by J . G . Taylor. p. cm. -- (North-Holland mathematical library ; v . 51) I n c l u d e s bibliographical references. ISBN 0-444-81692-5 1 . Neural networks ( C o m p u t e r science)--Mathematics. I. Taylor, John G e r a l d , 1931. 11. S e r i e s . 1993 ClA76.87.M38 006.3--dc20 93-34573
CIP
ISBN: 0 444 81692 5
0 1993 ELSEVIER SCIENCE PUBLISHERS B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V., Copyright & Permissions Department, P.O. Box 521, 1000 AM Amsterdam, The Netherlands. Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science Publishers B.V., unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein.
This book is printed on acid-free paper Printed in The Netherlands
V
Preface The subject of Neural Networks is being seen to be coming of age, after its initial inception
50 years ago in the seminal work of McCulloch and Pitts. A distinguished gallery of workers (some of whom are included in this volume) have contributed to building the edifice which is now proving of value in a wide range of academic disciplines and in important applications in industrial and business tasks. These two strands of neural networks are thus firstly appertaining to living systems, their explanation and modelling, and secondly that to dedicated tasks to which living systems may be ill adapted or involve uncertain rules in noisy environments. the progress being made in both these approaches is considerable, but yet both stand in need of a theoretical framework of explanation underpinning their usage and allowing the progress being made to be put on a firmer footing. The purpose of this book is to attempt to provide such a framework. Mathematics is rightly to be regarded as the queen of the sciences, and it is through mathematical approaches to neural networks that a suitable explanatory framework is expected to be found. Various approaches are available here, and are contained in the contributions presented here. These span a broad range from single neuron details, through to numerical analysis, functional analysis and dynamical systems theory. Each of these avenues provides its own insights into the way neural networks can be understood, both for artificial ones through to simplified simulations. The breath and vigour of the contributions underline the importance of the ever-deepening mathematical understanding of neural networks. I would like to take this opportunity to thank the contributors for their contributions and the publishers, especially Dr Sevenster, for his forbearance over a rather lengthy gestation period.
J G Taylor King's College, London 28.6.93
This Page Intentionally Left Blank
vii
Table of Contents Control Theory Approach P.J. Antsaklis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.l
Computational Learning Theory for Artificial Neural Networks M. Anthony and N. Biggs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.25
Time-summating Network Approach P.C. Bressloff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.63
The Numerical Analysis Approach S.W. Ellacott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
Self-organising Neural Networks for Stable Control of Autonomous Behavior in a Changing World S . Grossberg.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
139
On-line Learning Processes in Artificial Neural Networks T.M. Heskes and B. Kappen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
199
Multilayer Functionals D.S. Modha and R . Hecht-Nielsen . . . . . . . . . . . . . . . . . . . . . . . . . . .
235
Neural Networks: The Spin Glass Approach D. Sherrington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
261
Dynamics of Attractor Neural Networks T. Coolen and D. Sherrington . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
293
Information Theory and Neural Networks J.G. Taylor and M.D. Plumbley . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
307
Mathematical Analysis of a Competitive Network for Attention J.G. Taylor and F.N. Alavi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
341
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
1
Contmltbeoryappmach Panos J. Antsaklis Department of Electrical Engineering, University of Notre Dame, Notre Dame, Indiana 46556, USA
Abstract The control of complex dynamical systems is a very challenging problem especially when there are significant uncertainties in the plant model and the environment . Neural networks are being used quite successfully in the control of such systems and in this chapter the main approaches are presented and the advantages and drawbacks are discussed. Traditional control methods are based on firm and rigorous mathematical foundations, developed over the last hundred years and 80 it is very desirable to develop corresponding results when neural networks are used to control dynamical systems.
1. INTRODUCTION Problems studied in Control Systems theory involve dynamical systems and require real time operation of control algorithms. Typically, the system to be controlled, called the plant, is described by a set of differential or difference and perhaps nonlinear equations; the equations for the decision mechanism, the controller, are then derived using one of the control design methods. The controller is implemented in hardware or software to generate the appropriate control signals; actuators and sensors are also necessary to translate the control commands into control actions and the values of measured variables into appropriate signals. Examples of control systems are the autopilots in airplanes, the pointing mechanisms of space telecommunication antennas, speed regulators of machines on the factory floor, controllers for emissions control and suspension systems in automobiles, controllers for temperature and humidity regulators at home, to mention but a few. The model of the plant to be controlled can be quite poor either because of lack of knowledge of the process to be controlled, or by choice to reduce the complexity of the control design. Feedback is typically used in control systems to deal with uncertainties in the plant and the environment and achieve robustness in stability and performance. If the control goals are demanding, that is the control specifications are tight, while the uncertainties are large then fixed robust controllers may not be adequate. Adaptive control may be used in this case, where the new plant parameters are identified on line and this information is used to change the coefficients of the controller. The area is based on firm mathematical foundations, although in practice engineering skill and intuition are used to make the theoretical methods applicable to real practical systems, as it is the case in many engineering disciplines.
2
Intelligent Autonomous Control Systems. In recent years it has become quite apparent that in order to achieve high autonomy in control systems, that is to be able to control effectively under significant uncertainties even for example when certain types of failures occur (such as faults in the control surfaces of an aircraft), one needs to implement methods beyond conventional control methods. Decision mechanisms such as planning and expert systems are needed together with learning mechanisms and sophisticated FDI (Failure Diagnosis and Identification) methods. One therefore needs to adopt an interdisciplinary approach involving concepts and methods from areas such as Computer Science, Operations Research in addition to Control Systems and this leads to the area of Intelligent Autonomous Control Systems; see Antsaklis and Passino (1992a) and the references therein. A hierarchical functional intelligent controller architecture, as in Fig. 1, appears to offer advantages; note that in the figure the references to pilot, vehicle and environment etc come from the fact that such functional architecture refers to a high autonomy controller for future space vehicles as described in Antsaklis and Passino (1992b) and Antsaklis, Passino and Wang (1991). A three level architecture in intelligent controllers is quite typical: The lower level is called the Execution level and this is where the numerical algorithms are implemented in hardware or software, that is this is where conventional control systems reside; these am systems characterized by continuous states. The top level is the Management level where symbolic systems reside, which are systems with discrete states. The middle level is the Coordination level where both continuous and discrete state systems may be found. See Antsaklis and Passino (1992b) and the references therein for details. pilot and Crew/Ground Stati0nK)nFbudSystems
t
Middle Management Decision Making. Learning. and Algorithms
Adaptive Contrd &
I
Algorithms in Hardware md Software
Vehicle and Environment
Figure 1. A hierarchical functional architecture for the intelligent control of high autonomy systems
3
Neural Networks in Control Systems. At all levels of the intelligent controller architecture there appears to be room for potential applications of neural networks. Note that most of the uses of neural networks in control to date have been in the Execution and Coordination levels - they have been used mostly as plant models and as fixed and adaptive controllers. Below, in the rest of the Introduction a brief summary of the research activities in the area of neural networks in control is given. One should keep in mind that this is a rapidly developing field. Additional information, beyond the scope of this contribution, can be found in Miller, Sutton and Werbos (19901,in Antsaklis (1990),Anteaklis (1992) and in Warwick (19921,which are good starting sources; see also Antsaklis and Sartori (1992). It is of course well known to the readers that neural networks consist of many interconnected simple processing elements called units, which have multiple inputs and a single output. The inputs are weighted and added together. This sum is then passed through a nonlinearity called the actiuution function, such as a sigmoidal function like fix) = l/(1 + e-') or ffx)= tanh(x), or a gaussian-type hnction, such as fix) = exp(-x2), or even a hard limiter or threshold function, such as f(x) = sign(x) for x # 0. The terms artificial neural networks or connectionist models are typically used t o describe these processing units and to distinguish them from biological networks of neurons found in living organisms. The processing units o r neurons are interconnected, and the strength of the interconnections are denoted by parameters called weights. These weights are adjusted, depending on the task at hand, to improve performance. They can be either assigned values via some prescribed off-line algorithm, while remaining fixed during operation, or adjusted via a learning process on-line. Neural networks are classified by their network structure topology, by the type of processing elements used, and by the kind of learning rules implemented. Several types of neural networks appear to offer promise for use in control systems. These include the multi-layer neural network trained with the backpropagation algorithm commonly attributed to Rumelhart et aZ. (19861,the recurrent neural networks such as the feedback network of Hopfield (1982),the cerebellar model articulation controller (CNLAC) model of Albus (19751,the content-addressable memory of Kohonen (19801,and the gaussian node network of Moody and Darken (1989).The choice of which neural network to use and which training procedure t o invoke is an important decision and varies depending on the intended application. The type of neural networks most commonly used in control systems is the feedforward multilayer neural network, where no information is fed back during operation. There is however feedback information available during training. Supervised learning methods, where the neural network is trained to learn inpuffoutput patterns presented to it, are typicaly used. Most often, versions of the backpropagation algorithm are used to adjust the neural network weights during training. This is generally a slow and very time consuming process as the algorithm usually takes a long time to converge. However other optimization methods such as conjugate directions and quasiNewton have also been implemented; see Hertz et al. (19911,Aleksander and Morton (1990). Most often the individual neuron activation functions are sigmoidal functions, but also signum or radial basis Gaussian functions. Note
4
that in this work the emphasis is on multilayer neural networks. The reader should keep in mind that there are additional systems and control results involving recurrent networks, especially in system parameter identification; one should also mention the work in associative memories, which are useful in the higher levels of intelligent control systems. One property of multilayer neural networks central to most applications to control is that of function approximation. Such networks can generate input/output maps which can approximate any continuous function with any desired accuracy. One may have to use a large number of neurons, but any desired approximation of a continuous function can be accomplished with a multilayer network with only one hidden layer of neurons or two layers of neurons and weights; if the function has discontinuities, a two hidden layer network may be necessary-see below, Section 2.2. To avoid large numbers of processing units and the corresponding inhibitively large training times, a smaller number of hidden layer neurons is often used and the generalization properties of the neural network are utilized. Note that the number of inputs and outputs in the neural network are determined by the nature of the data presented to the neural network and the type of output desired from the neural network, respectively. To model the inputloutput behavior of a dynamical system, the neural network is trained usind inputloutput data and the weights of the neural network are adjusted most often using the backpropagation algorithm. The objective is to minimize the output error (sum of squares) between the neural network output and the output of the dynamical system (output data) for a specified set of input patterns. Because the typical application involves nonlinear systems, the neural network is trained for particular classes of inputs and initial conditions. The underlying assumption is that the nonlinear static map generated by the neural network can adequately represent the system’s behavior in the ranges of interest for the particular application. There is of course the question of how accurately a neural network, which realizes a static map, can represent the inputloutput behavior of a dynamical system. For this to be possible one must provide to the neural network information about the history of the system-typically delayed inputs and outputs. How much history is needed depends on the desired accuracy. There is a tradeoff between accuracy and computational complexity of training, since the number of inputs used affects the number of weights in the neural network and subsequently the training time. One sometimes starts with as many delayed signals as the order of the system and then modifies the network accordingly; it also appears that using a two hidden layer networkinstead of a one hidden layer-has certain computational advantages. The number of neurons in the hidden layeds) is typically chosen based on empirical criteria and one may iterate over a number of networks to determine a neural network that has a reasonable number of neurons and accomplishes the desired degree of approximation. When a multilayer neural network is trained as a controller, either an open or closed loop controller, most of the issues are similar to the above. The difference is that the desired output of the neural network, that is the controller generated appropriate control input to the plant, is not readily available, but has to be derived from the known desired plant output. For this, one may use the mathematical model of the plant if available, or some approximation based
5
on certain knowledge of the process to be controlled; or one may use a neural model of the dynamics of the plant or even of the dynamics of the inverse of the plant if such models have been derived. Neural networks may be combined to both identify and control the plant, thus implementing an adaptive controller. In the above, the desired outputs of the neural networks are either known or they can be derived or approximated. Then, supervised learning via the backpropagation algorithm can be used to train the neural networks. Typical control problems which can be solved in this way are problems where a desired output is known. Such is the case in designing a controller to track a desired trajectory; the error then to be minimized is the sum of the squares of the errors between the actual and desired points along the trajectory. There are control problems where no desired trajectory is known but the objective is to minimize say the control energy needed to reach some goal state(s1. This is an example of a problem where minimization over time is required and the effect of present actions on future consequences must be used to solve it. Two promising approaches for this type of problems are either constructing a model of the process and then using some type of backpropagation through time procedure, or using an adaptive critic and utilizing methods of reinforcement learning. These are discussed below. Neural networks can also be used to detect and identify system failures, and to help store information for decision making, thus providing for example the knowledge to decide when to switch to a different controller among a finite number of controllers. In general there are potential applications of neural networks a t all levels of hierarchical intelligent controllers that provide higher degree of autonomy to systems. Neural networks are useful at the lowest Execution level where the conventional control algorithms are implemented via hardware and software, through the Coordination level, t o the highest Organization level, where decisions are being made based on possibly uncertain and/or incomplete information. One may point out that at the Execution level, the conventional control level, neural network properties such as the ability for function approximation and the potential for parallel implementation appear to be most relevant. In contrast, a t higher levels, abilities such as pattern classification and the ability to store information in a say associative memory appear to be of most interest. When neural networks are used in the control of systems it is important that results and claims are based on firm analytical foundations. This is especially important when these control systems are to be used in areas where the cost of failure is very high, for example when human life is threatened, as in aircraft, nuclear plants etc. It is also true that without a good theoretical framework it is unlikely that the area will progress very far, as intuitive invention and tricks cannot be counted on t o provide good solutions to controlling complex systems under high degree of uncertainty. The analytical heritage of the control field, was in fact pioneered by the use of a differential equation model by J.C.Maxwel1 to study certain stability problems in Watt’s flyball governor in 1868, and this was a case where the theoretical study provided the necessary knowledge to go beyond what the era of Intuitive Invention in control could provide. In a control system which contains neural networks it is in general hard to guarantee typical control systems properties such as stability. The main
6
reason is the mathematical difficulties associated with the study of nonlinear systems controlled by highly nonlinear neural network controllers-note that the control of linear systems is well understood and neural networks are typically used to control highly nonlinear systems. In view of the mathematical difficulties encountered in the past in the adaptive control of linear systems controlled by linear controllers, it is hardly surprising that the analytical study of nonlinear adaptive control using neural networks is a difficult problem indeed. Some progress has been made in this area and certain important theoretical resulta have begun to emerge, but clearly the overall area is still at its early stages of development. In Section 2, the different approaches used in the modeling of dynamical systems are discussed. The function approximation properties of multilayer neural networks are discussed at length, radial basis networks and the Cerebellar Model Articulation Controller (CMAC) are introduced and the modeling of the inverse dynamics of the plant, used in certain control methods is also discussed. In Section 3, the use of neural networks as controllers in problems which can be solved by supervised learning are discussed; such control problems for example would be following a given trajectory while minimizing some output error. In Section 4, control problems which involve minimization over time are of interest; an example would be minimizing the control energy to reach a goal state-there is not known desired trajectory in this case. Methods such as back propagation through time and adaptive critic with reinforcement learning are briefly discussed. Section 5 discusses other uses of neural networks in the failure detection and identification area (FDI), and in higher level control. Sections 6 and 7 contain the concluding remarks and the references respectively. 2. MODELING OF DYNAMICAL SYSTEMS
2.1 ~ t h e ~ o f t h e p l s n t In this approach, the neural network is trained to model the plant's behavior, as in Fig. 2. The input to the neural network is the same input used by the plant. The desired output of the neural network is the plant's output.
U
I
4
r
++
CF
Figure 2. Modeling the plant's dynamics. The signal e = y -
from the summation in Fig. 2 is the error between the
plant's output and the actual output of the neural network. The goal in training the neural network is to minimize this error. The method to accomplish this varies for the type of neural network used and the type of training algorithm chosen. In the figure, the use of the error to aid in the training of the neural network is denoted by the arrow passing through the neural network at an angle. Once the neural network has been successfully trained, it is actually an analytical model of the plant that can be further used to design a controller or to test various control techniques via simulation of this neural plant emulator. This type of approach is discussed in Section 3. In Fig. 2, the type of plant used is not restricted. The plant could be a very well behaved single-input single-output system, or it could be a nonlinear multi-input multi-output system with coupled equations. The actual plant or a digital computer simulation of the plant could be used. The plant may also operate in continuous or discrete time; although for training the neural network, discrete samples of the plants inputs and outputs are often used. If the plant is time-varying, the neural network clearly needs to be updated online and so the typical plant considered is time invariant or if tit is time varying it changes quite slowly. The type of information supplied to the neural network about the plant may vary. For instance, the current input, previous inputs, and previous outputs can be used as inputs to the neural network, This is illustrated in Fig. 3 for a plant operating in discrete time. The boxes with the "A" symbol indicate the time delay. The bold lines stress the fact that signals with varying amounts of delay can be used. The plant's states, derivatives of the plant's variables, or other measures can be used as the neural networks inputs. This type of configuration is conducive to training a neural network when the information available about the plant is in the form of an input-output table.
1
Figure 3. Modeling the discrete time plant's dynamics using delayed signals. Training a neural network in this manner, by using input-output pairs, can be viewed as a form of pattern recognition, where the neural network is being trained to realize some (possibly unknown) relation between two sets. If a multi-layer neural network is used to model the plant via the configuration depicted in Fig. 3, a dynamic system identification can be performed with a
8
static model. The past history information needed to be able to model a dynamic systems via a static model is provided by delayed input and output SigIidS. If the back-propagation algorithm is used in conjunction with a multilayer neural network, considerations need to be made concerning which among the current and past values of the inputs and outputs to utilize in training the neural network; this is especially important when the identification is to be on line. In Narendra and Parthasarathy (1990) it is shown that when a series-parallel identification model is used (and the corresponding delayed signals), then the usual backpropagation algorithm can be employed to train the network; when however a parallel identification model is used then a recurrent network results and some type of backpropagation through time, see Section 4, should be used. A moving window of width p time steps could be employed in which only the most recent values are used. An important question t o be addressed here concerns the number of delays of previous inputs and outputs to be used as inputs to the neural network; most often the number of delays is taken to be equal to the order of the plant, at least initially. If there is some apriori knowledge of the plant's operation, this should be incorporated into the training. This knowledge can be imbedded in a linear or nonlinear model of the plant, or incorporated via some other means; see Sartori and Antsaklis (1992a). A possible way of utilizing this information via a plant model is illustrated in Fig. 4; this can be viewed as modeling the unmodelled dynamics of the plant with a neural network. I I I I
-_
- 1 1 I 1
I
I
I I
t
A
l Y
I
Plant
Figure 4. Using apriori knowledge of the plant. Modeling the plant's behavior via a multilayer sigmoidal neural network has been studied by a number of researchers; see among others Narendra and Parthasarathy (1990), Parthasarathy (1991), Bhat et al (1990), &in, Su and McAvoy (19921, Hou and Antsaklis (1992). In general, the results show that neural networks can be very good models of dynamical systems behavior. This
9
is of course true for stable plants, for certain ranges of inputs and initial conditions and for time invariant or slowly varying systems.
22FunCtLon ' approxhdion Neural networks are useful as models of dynamical systems because of their ability to be universal function approximators. In particular, it turns out that feedforward neural nets can approximate arbitrarily well any continuous function; this in fact can be accomplished using a feedforward network with a single hidden layer of neurons with a linear output unit. More specifically, consider the nonlinear map g: Rm + R where n
g(u>= Cw;l s(w'u) i=l
(1)
with u E R" the input vector, w' = [wil, ..., wim1 the input layer weights and wil the output layer (first layer) weights of unit i. S(.) is the unit activation function. Such maps are generated by networks with a single hidden layer and a linear output unit. Consider now S(x) to be a sigmoid function, defined as a function for which lim,+, S(x) = 1 and lim,+ a x ) = 0. Hornik et al. (1989) has shown that when the activation sigmoid function S(x) is non-decreasing, the above net can approximate a n arbitrary continuous function flu) uniformly on compact sets; in fact it is shown that the set of functions g above is dense in the set off, the continuous functions mapping elements of compact sets in Rm into the real line. Cybenko (1989)extended this result to continuous activation sigmoid functions Jones (1990)showed that the result is still valid when S(x) is bounded. Typical proofs of this important approximation result are based on the Stone-Weierstrass theorem and use approximations of the function via trigonometric functions, which in turn are approximated by sums of sigmoidal functions. Similar results have appeared in Funahashi (19891, among others. An important point is that the exact form of the activation function S(x) is not important in the proof of this result, however it may affect the number of neurons needed for a desired accuracy of approximation of a given function f. These results show that typically one increases accuracy by adding more hidden units; one then stops when the desired accuracy has been achieved. It should be noted that for finite number of hidden units, depending on the function, significant errors may occur; this reminds us of the Gibbs phenomenon in Fourier Series. How many hidden units then one needs in the hidden layer? This is a difficult question to answer and it has attracted much attention. Several authors have shown (with different degrees of ease and generality) that p-1 neurons in the hidden layer suffice to store p arbitrary patterns in the network; see Nilsson (19651,and Baum (19881,Sartori and Antsaklis (1991)for constructive proofs. The answer to the original question also depends of course on the kind of generalization achieved by the network; also note that certain sets of p patterns may be realizable by fewer than p-1 hidden neurons. The question of adequate approximation becomes more complicated in control applications where the functions which must be approximated by the neural network may be functions that generate the control signals, the range and the shape of which may not be known in advance. Because of this, the guidelines to select the appropriate number of hidden neurons are rather empirical at the moment.
ax);
10
The above discussion dealt with continuous functions f, which can be approximated by a neural network with one hidden layer of neurons. When the function under consideration has discontinuities, then two hidden layers may have to be used; Cybenko (1988), Chester (19901, Sontag (1990). In control considerations, Sontag has pointed out that one may need a two hidden layer neural network to stabilize certain plants. It is true of course that a two hidden layer network can approximate any continuous function as well. In addition, experimental evidence tends to show that using a two hidden layer network has advantages over a one layer as it requires shorter training time and overall fewer weights. Because of this a two hidden layer network is many times the network of choice. There are many other issues relevant to function approximation and control applications, such us issues of network generalization, input representation and preprocessing, optimal network architectures, methods to generate networks, methods of pruning and weight decay. These topics are of great interest to the area of neural networks at large, and a number of these are currently attracting significant research efforts. We are not directly addressing these topics here; the interested reader should consult the vast literature on the subject.
23 Radialbasiepetworke To approximate desired functions, networks involving activation functions other than sigmoids can be used. Consider again a feedforward neural network with one hidden layer and a linear output unit. Assume that the hidden neurons have radial basis functions as activation functions, in which case the neural network implements the nonlinear map g: Rm + R where n
-
g(u) = Z W i l alru qll) i=l
(2)
with u E Rm the input vector and wil the output layer (first layer) weights of unit i. G(.) is a radially symmetric activation function, typically the Gaussian function Uiiu ciii) = exp ( -c+ iiu q i i 2 ) (3) where si = 1/ 0i2. The vectors ci i = 1,...,n are the centers of the Gaussian function and if for a particular value of the input u P q then the ith unit gives an output of +l. The deviation 0; controls the width of the Gaussian and for large llu till, more than 3 0 , the output of the neuron is negligible; in this way, practically only inputs in the locality of the center of the Gaussian contribute to the neuron output . It is known from Approximation Theory, see for example Poggio and Girosi (19901, that radially symmetric functions, as g 0 the state X u defines the same function as w. Suppose that H is a hypothesis space defined on the example space X , and let x = ( q , z2, . . . ,z , ) be a sample of length m of examples from X . We define IIH(x), the number of classifications of x by H , to be the number of distinct vectors of the form
as h runs through all hypotheses of H . Although H may be infinite, HI,, the hypothesis space obtained by restricting the hypotheses of H to domain Ex = { I I , x ~ , .. .,xm}, is . that for any sample x of length m, ~IH(x) _< 2m. finite and is of cardinality l T ~ ( x ) Note An important quantity, and one which shall turn out to be crucial in applications to potential learnability, is the maximum possible number of classifications by H of a sample of a given length. We define the growth function IIH by
II,(m) = max {lT~(x) : x E Xm}. We have used the notation IIH for both the number of classifications and the growth function, but this should cause no confusion. We noted above that the number of possible classifications by H of a sample of length m is at most 2", this being the number of binary vectors of length m. We say that a sample x of length m is shattered by H , or that H shatters x, if this maximum possible value is attained; that is, if H gives all possible classifications of x. Note that if the examples in x are not distinct then x cannot be shattered by any H . When the examples are distinct, x is shattered by H if and only if for any subset S of E x , there is some zi E S. S is then the subset hypothesis h in H such that for 1 _< i 5 m, h ( z ; )= 1 of Ex comprising the positive examples of h. Based on the intuitive notion that a hypothesis space H has high expressive power if it can achieve all possible classifications of a large set of examples, we use as a measure of this power the Vapnik-Ghervonenkis dimension, or VC dimension, of H , defined as follows. The VC dimension of H is the maximum length of a sample shattered by H; if there is no such maximum, we say that the VC dimension of H is infinite. Using the notation introduced in the previous section, we can say that the VC dimension of H , denoted VCdim(H), is given by VCdim(H) = max {m : IIH(m) = Y } ,
34
where we take the maximum to be infinite if the set is unbounded.
A result which is often useful is that if H is a finite hypothesis space then H has VC this follows from the observation that if d is the VC dimension dimension at most log !HI; of H and x E X d is shattered by H , then IHI 2 lHlxl = 2d. (Here, and throughout, log denotes logarithm to base 2.) Consider now the perceptron P,,with n inputs. The set of positive examples of the function h, computed by the perceptron in state w = ( ( ~ 1 (, ~ 2 , .. . an,8) is the closed a i y , 2 8. This is bounded by the half-space 1; consisting of y E R" such that hyperplane I , with equation Cbl a i y , = 8. Roughly speaking, I , divides R" into the set of positive examples of h, and the set of negative examples of h,
EL,
We shall use the following result, known as Radon's Theorem, in which, for S C R", conv(S) denotes the convex hull of S. Let n be any positive integer, and let E be any set of n 2 points in R". Then there is a non-empty subset S of E such that
+
con.( S) n conv( E \ S) #
0.
A proof is given, for example, by Grunbaum (1967). Theorem 4 For any positive integer n, let P,, be the real perceptron with n inputs. Then VCdim(P,) = n 1.
+
+
Proof Let x = (11 , x 2 , . . . , x n + 2 ) be any sample of length n 2. As we have noted, if two of the examples are equal then x cannot be shattered. Suppose then that the set Ex of examples in x consists of n 2 distinct points in R". By Radon's Theorem, there is a non-empty subset S of Ex such that conv(S) n conv(Ex \ S) # 0. Suppose that there is a hypothesis h, in P, such that S is the set of positive examples of h, in Ex. Then we have S C_ 12, E x \ S C R"
+
\Is.
Since open and closed half-spaces are convex subsets of R", we also have conv( S) C I s ,
conv( Ex \ S ) C R"
\ It.
Therefore con.( S) n conv(E x \ S) G 1: n R"
\ 1:
= 0,
which is a contradiction. We deduce that no such h , exists and therefore that x is not shattered by P,,. Thus no sample of length n 2 is shattered by P, and the VC dimension of P,, is at most n 1.
+
+
It remains to prove the reverse inequality. Let o denote the origin of R" and, for 1 5 i 5 n, let e , be the point with a 1 in the ith coordinate and all other coordinates 0.
35
+
Then Pn shatters the sample x = (0,e l , ez, . . . ,en) of length n 1. To see this, suppose that S is a subset of Ex = ( 0 , e l , . . . , e n } . For i = 1,2,. . . ,n, let oli be 1 if e , E S and -1 otherwise, and let 0 be -1/2 if o E S, 1/2 otherwise. Then it is straightforward to verify that if w is the state w = (01, az,. . . ,a,, 19) of P, then the set of positive examples of h, in Ex is precisely S. Therefore x is shattered by Pn and, consequently, 0 VCdim(P,) 2 n 1.
+
The growth function II,(m) of a hypothesis space of finite VC dimension is a measure of how many different classifications of an m-sample into positive and negative examples can be achieved by the hypotheses of H,while the VC dimension of H is the maximum value of m for which IIH(m) = 2"'. Clearly these two quantities are related, because the VC dimension is defined in terms of the growth function. But there is another, less obvious, relationship: the growth function II,(m) can be bounded by a polynomial function of m, and the degree of the polynomial is the VC dimension d of H. Explicitly, we have the following theorem. The first inequality is due to Sauer (1972) and is usually known as Saue~'8Lemma. The second inequality is elementary-a proof was given by Blumer e t aZ. (1989). T h e o r e m 5 (Sauer's Lemma) Let d 2 0 and m >_ 1 be given integers and let H be a hypothesis space with VCdim(H) = d 2 1. Then f o r m 2 d,
where e is the base of natural logarithms.
0
We have motivated our discussion of VC dimension by describing it as a measure of the expressive power of a hypothesis space. We shall see that it turns out to be a key parameter for quantifying the difficulty of pac learning. Our first result along these lines is that finite VC dimension is necessary for potential learnability. T h e o r e m 6 If a hypothesis spa.ce has infinite VC dimension then i t is not potentially learnable. Proof Suppose that H has infinite VC dimension, so that for any positive integer m there is a sample z of length 21n which is shattered by H . Let E = Em be the set of examples in this sample and define a probability distribution p on X by p ( ~ = ) 1/2m if I E E and P(I) = 0 otherwise. In other words, p is uniform on E and zero elsewhere. We observe that p m is uniform on Em and zero elsewhere. Thus, with probability one, a randomly chosen sample x of length m is a sample of examples from E . Let s = (x,t(x)) E S(m, t ) be a training sample of length m for a target concept t E H. With probability 1 (with respect to p m ) , we have 2; E E for 1 5 i 5 m. Since z is shattered by H , there is a hypothesis h E H such that h ( s i ) = t ( q ) for each
36
z, (1 5 i 5 m ) , and h ( z ) # t ( z ) for all other 2 in E . It follows that h is consistent with s, whereas h has error at least 1/2 with respect to t . We have shown that for any positive integer m, and any target concept t , there is a probability distribution p on X such that the set {s
1 for all h E H , er.(h)
= 0 ===+ er,(h)
< 112)
has probability zero. Thus, H is not potentially learnable.
0
The converse of the preceding theorem is also true: finite VC dimension is sufficient for potential learnability. This result can be traced back to the statistical researches of Vapnik and Chervonenkis (1971) (see also Vapnik (1982) and Vapnik and Chervonenkis (1981)). The work of Blumer et al. (1989) showed that it is one of the key results in Computational Learning Theory. We now give some indication of its proof. Suppose that the hypothesis space H is defined on the example space X , and let t be any target concept in H ,p any probability distribution on X and r any real number with 0 < E < 1. The objects t , p, r we to be thought of as fixed, but arbitrary, in what follows. The probability of choosing a training sample for which there is a consistent, but e-bad, hypothesis is pm{ s E S(m,t)1 there is h E H such that er,(h) = O,er,(h)
2 e} .
Thus, in order to show that H is potentially learnable, it suffices to find an upper bound f ( m ,e) for this probability which is independent of both t and p and which tends to 0 as m tends to infinity. The following result, of the form just described, is due to Blumer e t al. (1989), and generalises a result of Haussler and Welzl (1987). Better bounds have subsequently been obtained by Anthony, Biggs and Shawe-Taylor (1990) (see also Shawe-Taylor, Anthony and Biggs (1993)), but the result presented here suffices for the present discussion.
Theorem 7 Suppose that H is a hypothesis space defined on an example space X , and that t , p , and r are arbitrary, but fixed. Then pm {s E S(m,t)I there is h E H such that er,(h) = O,er,(h) 2
for all positive integers m
2 816.
c}
< 2 I I ~ ( 2 r n2-‘m’2 ) 0
The right-hand side is the bound f ( m , e ) as postulated above. If H has finite VC dimension then, by Sauer’s Lemma, IIw(2m) is bounded by a polynomial function of m, and therefore f ( m ,c) is eventually dominated by the negative exponential term. Thus the right-hand side, which is independent o f t and p , tends to 0 as m tends to infinity and, by the above discussion, this establishes potential learnability for spaces of finite VC dimension.
31
At this point it is helpful to introduce a new piece of terminology. Supose that real numbers 0 < 6 , 1- 6 whenever m 2 mL(T,6 , E ) ; in other words, a sample of length r n ~ ( T6 , e ) is sufficient to ensure that the output hypothesis L ( s ) is pac, with the given values of 6 and E . In practice we often omit T when this is clear and we usually deal with a convenient upper bound r n o 2 mL, rather than mL itself; thus mo(6,E ) will denote any value sufficient to ensure that the pac conclusion, as stated above, holds for all m 2 mo. The following result follows from Theorem 7.
Theorem 8 There is a constant I< such that if hypothesis space H has VC dimension d 2 1 and the concept space C is a subset of H, then any consistent learning algorithm L for (C, H ) is pac, with sample complexity
for 0 < 6 , i . (This may be done by numbering first those computation nodes which are linked only to input nodes, then those which are linked only to input nodes and already-numbered computation nodes, and so on.) For each state w E R, corresponding to an assignment of weights and thresholds to all the arcs and computation nodes, we let w1 denote the part of w determined by the thresholds on computation nodes 1 , 2 , . . . , 1 and the weights on arcs which terminate at
those nodes. Then for 2 5 I 5 z we have the decomposition w' = ( w ' - ' , ( / ) where (1 stands for the weights on arcs terminating at 1 and the threshold at 1. In isolation, the output of a computation node 1 is a linear threshold function, determined by ( 1 , of the
40
outputs of all those nodes j for which (j,I) is an arc; some of these may be input nodes and some may be computation nodes with j < I. We denote the space of such functions by H I and the growth function of this 'local hypothesis space' by Ill. Suppose that x = ( r l ,12,. . . ,r,) is a sample of inputs to the network. (Each example zi is a IJI-vector of real numbers, where J is the set of input nodes.) For any computation node I (1 5 I 5 z ) , we shall say that states w 1 , q of the network are I-distinguishable by x if the following holds. There is an example in x such that, when this example is input, the output of at least one of the computation nodes 1 , 2 , . . . ,I, is different when the state is w1 from its output when the state is w2. In other words, if one has access to the signals transmitted by nodes 1 to 1 only, then, using the sample x , one can differentiate between the two states. We shall denote by Sl(x) the number of different states which are mutually I-distinguishable by x.
Lemma 10 With the notation defined as above, we have
Proof We prove the claim by induction on I. For 1 = 1 we have S l ( x ) 5 lI,(x), because two states are 1-distinguishable if and only if they give different classifications of the training sample at node 1. Thus Sl(x) 5 lIl(m). Assume, inductively, that z . The decomposition wk = (&', &) the claim holds for I = k - 1, where 2 5 k I shows that if two states are k-distinguishable but not (k - 1)-distinguishable, then they must be distinguished by the action of the node Ic. For each of the Sk-l(x) (k - 1)distinguishable states there are thus at most & ( m ) k-distinguishable states. Hence Sk(x) 5 Sk-l(x) &(m). By the inductive assumption, the right-hand side is at most II1(m)II2(rn).. . I I k ( r n ) . The result follows. 0 If H is the hypothesis space of N then n ~ ( xis)the number of states which are mutually distinguishable by x. Thus, W m )
c nl(m)W m ) .. .W m ) ,
for any positive integer m. The next result follows from this observation and the previous result. Corollary 11 Let (N, A) be a feedforward linear threshold network with z computation nodes, and let W = IN \ JI IAl be the total number of variable weights and thresholds. Let H be the hypothesis space of the network. Then for m > W, we have
+
+
1 for 1 5 i 5 t and so, for each such i and for rn > W , IIi(m) 5 (em/d(z) l ) d ( i ) + by l , Sauer's Lemma and since the VC dimension of Hi is
Proof Certainly, W 2 d(i)
+
41
d(i)
+ 1. It follows that
From this one can obtain the desired result. We omit the details here; these may be found in Baum and Haussler (1989) or Anthony and Biggs (1992). 0 T h e o r e m 12 The VC dimension of a feedforward linear threshold network with z computation nodes and a total of W variable weights and thresholds is at most 2W log (ez).
Proof Let H be the hypothesis space of the network. By the above result, we have, for m 2 W , II,(m) 5 (zem/W)w, where W is the total number of weights and thresholds. Now,
which is true for any z 2 1. Therefore, n H ( r n ) < 2"' when m = 2Wlog(ez), and the VC dimension of H is at most 2W log(ez), as claimed. 0 Notice that the upper bound on the VC dimension depends only on the 'size' of the network; that is, on the number of computation nodes and the number of arcs. That it is independent of the structure of the network - the underlying directed graph - suggests that it may not be a very tight bound. In their paper, Baum and Haussler (1989) showed that certain simple networks have VC dimension at least a constant multiple of the number of weights. More recently, Bartlett (1992) obtained similar results for wider classes of networks. However, in a result which shows that the upper bound is essentially the best that can be obtained, Maass (1992) has shown that there is a constant c such that for infinitely many values of W , some feedforward linear threshold network with W weights has VC dimension at least cW log W . (The networks for which Maass showed this to be true have 4 layers.) If, as in Baum and Haussler (1989), we substitute the bound of Corollary 11 directly into the result of Theorem 7 then we can derive a better upper bound on sample complexity than would result from substituting the VC dimension bound into Theorem 8. Indeed, the former method gives a bound involving a log(z/r) term, while the latter yields a bound depending on log z log ( l / e ) . With this observation and the previous results, we have the following result on sufficient sample size. T h e o r e m 13 Let ( N ,A ) be a feedforward linear threshold network having z computation nodes and W variable weights and thresholds. Then for all 0 < 6 , 0 such that for infinitely many W , there is a network with W weights for which the sufficient sample size must satisfy
6. THE COMPUTATIONAL COMPLEXITY OF LEARNING
Thus far, a learning algorithm has been defined as a function mapping training samples into hypotheses. We shall now be more specific about the algorithmics. If pac learning by a learning algorithm is to be of practical value, it must, first, be possible to implement the learning algorithm on a computer; that is, it must be computable and therefore, in a real sense, an algorithm, not just a function. Further, it should be possible to implement the algorithm ‘quickly’. The subject known as Complexity Theory deals with the relationship between the size of the input to an algorithm and the time required for the algorithm to produce its output for an input of that size. In particular, it is concerned with the question of when this relationship is such that the algorithm can be described as ‘efficient’. Here, we shall describe the basic ideas in a very simplistic way. More details may be found in the books by Garey and Johnson (1979), Wilf (1986), and Cormen, Leiserson and Ftivest (1990). The 3ize of an input to an algorithm will be denoted by s. For example, if an algorithm has a binary encoding as input, the size of an input could be the number of bits it contains. Equally, if the input is a real vector, one could define the size to be the dimension of the vector. Let A be an algorithm which accepts inputs of varying size s. We say that the running time of A is O(f(s)) if there is some constant K such that, for any input of size s, the number of operations required to produce the output of A is at most K f ( s ) . Note that this definition is ‘device-independent’ because the running time depends only on the number of operations carried out, and not on the actual speed with which such an operation can be performed. Furthermore, the running time is a worst-case measure; we consider the maximum possible number of operations taken over all inputs of a given size.
There are good remons for saying that an algorithm with running time O(a“), for some fixed integer r 2 1, is ‘efficient’. Such an algorithm is said to be a polynomial time
43
algorithm, and problems which can be solved by a polynomial time learning algorithm are usually regarded as ‘easy’. Thus, to show that a problem is easy, we should present a polynomial time algorithm for it. On the other hand, if we wish to show that a given problem is ‘hard’, it is enough to show that if this problem could be solved in polynomial time then so too could another problem which is believed to be hard. One standard problem which is believed to be hard is the graph k-colouring problem for k 2 3. Let G be a graph with vertex-set V and edge-set E , so that E is a subset of the set of 2-element subsets of V . A k-colouring of G is a function x : V + { 1,2,. . . ,k} with the property that, whenever i j E E , then x(i) # x(j). The graph k-colouring problem may formally be stated as: GRAPH k-COLOURING Instance A graph G = (V,E ) . Question Is there a k-colouring of G? When we say that GRAPH k-COLOURING is ‘believed to be hard’, we mean that it belongs to a class of problems known as the NP-complete problems. This class of problems is very extensive, and contains many famous problems in Discrete Mathematics. Although it has not yet been proved, it is conjectured, and widely believed, that there is no polynomial time algorithm for any of the NP-complete problems. This is known as the ‘P # N P conjecture’. We shall apply these ideas in the following way. Suppose that II is a problem in which we are interested, and IIo is a problem which is known to be NP-complete. Suppose also that we can demonstrate that if there is a polynomial time algorithm for lI then there is one for no. In that case our problem II is said to be NP-hard. If the P # N P conjecture is true, then proving that a problem n is NP-hard establishes that there is no polynomial time algorithm for II. We now wish to quantify the behaviour of learning algorithms with respect to n, and it is convenient to make the following definitions. We say that a union of hypothesis spaces H = H , is graded by example size n, when H , denotes the space of hypotheses defined on examples of size n. For example, H , may be the space P, of the perceptron, defined on real vectors of length n. By a learning algorithm f o r H = H,,we mean a function L from the set of training samples for hypotheses in H to the space H , such that when s is a training sample for h E H , it follows that L ( s ) E H , . That is, we insist that L preserves the grading. (Analogously, one may define, more generally, a learning algorithm for ( C , H ) when each of C and H are graded.) An example of a learning algorithm defined on the graded perceptron space P = P, is the perceptron learning algorithm of Rosenblatt (1959). (See also Minsky and Papert (1969).) Observe that this algorithm acts in essentially the same manner on each P,; the ‘rule’ is the same for each n,.
u
u
u
u
Consider a learning algorithm L for a hypothesis space H = H , , graded by example size. An input to L is a training sample, which consists of m examples of size n together with the m single-bit labels. The total size of the input is therefore m(n l), and it
+
44
would be possible to use this single number as the measure of input size. However, there is some advantage in keeping track of m and n separately, and so we shall use the n),to denote the worst-case running time of L on a training sample of notation R L ( ~ m examples of size n.
UH , is said to be a pac learning algorithm if L acts as a pac learning algorithm for each H , . The sample complexity provides the link between of a learning algorithm (that is, the number of operations the running time R~(m,n) required to produce its output on a sample of length rn when the examples have size n) and its running time as a pac learning algorithm (that is, the number of operations required to produce an output which is probably appmximately correct with given parameters). Since a sample of length mo(H,, 6, E) is sufficient for the pac property, the number of operations required is at most &(mo(H,, 6, E), n). A learning algorithm L for
Until now, we have regarded the accuracy parameter E as fixed but arbitrary. It is clear that decreasing this parameter makes the learning task more difficult, and therefore the running time of an efficient pac learning algorithm should be constrained in some appropriate way as e-l increases. We say that a learning algorithm L for H = H,, is eficient with respect to accuracy and example size if its running time is polynomial in m and the sample complexity m L ( H , , 6 , e ) depends polynomially on n and e - l .
u
We are now ready to consider the implications for learning of the theory of NP-hard problems. Let H = U H , be a hypothesis space of functions, graded by the example size n. The consistency problem for H may be stated as follows.
H - CONSISTENCY Instance A training sample s of labelled examples of size n. Question Is there a hypothesis in H , consistent with s? In practice, we wish to produce a consistent hypothesis, rather than simply know whether or not one exists. In other words, we have to solve a ‘search’ problem, rather than an ‘existence’ problem. But these problems are directly related. Suppose that we consider only those s with length bounded by some polynomial in n. Then, if we can find a consistent hypothesis in time polynomial in n, we can answer the existence question by the following procedure. Run the search algorithm for the time (polynomial in n) in which it is guaranteed to find a consistent hypothesis if there is one; then check the output hypothesis explicitly against the examples in s to determine whether or not it is consistent. This checking can be done in time polynomial in n also. Thus if we can show that a restricted form of the existence problem is NP-hard, this means that there is no polynomial time algorithm for the corresponding search problem (unless P = NP). If there is a consistent learning algorithm L for a graded hypothesis space H = U H , such that VCdim(H,) is polynomial in n and the algorithm runs in time polynomial in the sample length m , then the results presented earlier show that L pac learns H , with running time polynomial in 17 and and so is efficient with respect to accuracy and example size. Roughly speaking we may say that an efficient ‘consistent-hypothesis-
45
finder’ is an efficient ‘pac learner’. It is natural to ask to what extent the converse is true. It turns out that efficient pac learning does imply efficient consistent-hypothesis-finding, provided we are prepared to accept a randomised algorithm. A full account of the meaning of this term may be found in the book of Cormen, Leiserson and Rivest (1990), but for our purposes the idea can be explained in a few paragraphs. We suppose that there is available some form of random number generator which, given any integer I 2 2, produces a stream of integers i in the range 1 5 i 5 I , each particulax value being equally likely. This could be done electronically, or by tossing an I-sided die. A randomised algorithm A is allowed to use these random numbers as part of its input. The computation carried out by the algorithm is determined by its input, so that it depends on the particular sequence produced by the random number generator. It follows that we can speak of the probability that A has a given outcome, by which is meant the proportion of sequences which produce that outcome. We say that a randomised algorithm A ‘solves’ a search problem ll if it behaves in the following way. The algorithm always halts and produces an output. If A has failed to find a solution to II then the output is simply no. But, with probability at least 1/2 (in the sense explained above), A succeeds in finding a solution to II and its output is this solution. The practical usefulness of a randomised algorithm stems from the fact that repeating the algorithm several times dramatically increases the likelihood of success. If the algorithm fails at the first attempt, which happens with probability at most 1/2, then we simply try again. The probability that it fails twice in succession is at most 1/4. Similarly, the probability that it fails in k attempts is at most (1/2)k, which approaches zero very rapidly with increasing k. Thus in practice a randomised algorithm is almost as good as an ordinary one - provided of course that it has polynomial running time. We have the following theorem of Pitt and Valiant (1988) (see also Natarajan (1989) and Haussler et al. (1988)).
U
Theorem 14 Let H = H, be a hypothesis space and suppose that there is a pac learning algorithm for H which is efficient with respect to accuracy and example size. Then there is a randomised algorithm which solves the problem of finding a hypothesis in H , consistent with a given training sample of a hypothesis in H,, and which has running time polynomial in n and m (the length of the training sample). Proof Suppose that s’ is a training sample for a target hypothesis t E H,, and that s* contains m* distinct labelled examples. We shall show that it is possible to find a hypothesis consistent with s* by running the given pac learning algorithm L on a related training sample. Define a probability distribution p on the example space X by p ( z ) = l / m * if r occurs in s* and p ( r ) = 0 otherwise. We can use a random number generator with output values i in the range 1 t o m* to select an example from X according to this distribution: simply regard each random number as the label of one of the m* equiprobable examples. Thus the selection of a training sample of length rn for t , according to the probability distribution p, can be simulated by generating a
46
sequence of m random numbers in the required range. Let L be a pac learning algorithm as postulated in the statement of the Theorem. Then, when 6, e, are given, we can find an integer mo(n,6, e) for which the probability (with respect to training samples s E S(m0,t))that the error of L ( s ) is less than E is greater than 1 - 6. Suppose we specify the confidence and accuracy parameters to be 6 = 1/2 and e = l/m*. Then if we run the given algorithm L on a training sample of length mo(n, 1/2,l/m*), drawn randomly according to the distribution p , the pac property of L ensures that the probability that the error of the output is less than l / m * is greater than 1- 1/2 = 1/2. Since there are no examples with probability strictly between 0 and l/m*, this implies that the probability that the output agrees exactly with the training sample is greater than 1/2. The procedure described in the previous paragraph is the basis for a randomised algorithm L* for finding a hypothesis which agrees with the given training sample s * . In summary, L* consists of the following steps. 0 0 0 0 0
Evaluate mo = mo(n,1/2,l/m*). Construct, as described, a sample s of length nzo, according to p. Run the given pac learning algorithm L on s. Check L ( s ) explicitly to determine whether or not it agrees with s*. If L ( s ) does not agree with s * , output no. If it does, output L ( s ) .
As we noted, the pac property of L ensures that L* succeeds with probability greater than f . Finally, it is clear that, since the running time of L is polynomial in m and its sample complexity mo(n,1/2,l/m*) is polynomial in n and m* = 1/e, the running time of L' is polynomial in n and m*. 0
7. HARDNESS RESULTS FOR NEURAL NETWORKS The fact that computational complexity-theoretic hardness results hold for neural networks was first shown by Judd (1988). In this section we shall prove a simple hardness result along the lines of one clue to Blum and Rivest (1988).
+
The machine has n input nodes and k 1 computation nodes ( k 2 1). The first k computation nodes are 'in parallel' and each of them is connected to all the input nodes. The last Computation node is the output node; it is connected by wcs with fixed weight 1 to the other computation nodes, and it has fixed threshold k. The effect of this arrangement is that the output node acts as a multiple AND gate for the outputs of the other computation nodes. We shall refer to this machine (or its hypothesis space) as P,".
A state w of P i is described by the thresholds 81 (1 5 1 5 k) of the first k computation nodes and the weights w(i, 1 ) on the arcs (2, I ) linking the input nodes to the computation nodes. We shall use the notation a(')for the n-vector of weights on the arcs terminating at I , so that a!') = w(z, I ) . The set R of such states provides a representation R -+ P,"
47
in the usual way. We shall prove that the consistency problem for Pk = UP: is NP-hard (provided k 1 3 ) , by reducing the problem to GRAPE k-COLOURING. Let G be a graph with vertex-set V = {1,2,. . . ,n } and edge-set E . We construct a training sample s(G) as follows. For each vertex i E V we take as a negative example the vector D , which has 1 in the ith coordinate position, and 0's elsewhere. For each edge ij E E we take as a positive example the vector v, vj. We also take the zero vector o = 00.. . O to be a positive example.
+
Theorem 15 There is a function in P,"which is consistent with s(G) if and only if the graph G is k-colourable. Proof Suppose h E P," is consistent with the training sample. By the construction of the network, h is a conjunction h = hl A hz A . . . A hk of linear threshold functions. (That is, h ( z ) = 1 if and only if h , ( z ) = 1 for all i between 1 and 6.) Specifically, there , . . . ,a(') and thresholds O 1 , & , . . . ,O t such that are weight-vectors a ( 1 )a('), hl(y) = 1
-
(&), y)
2 el
(1 5 i L k)
Note that, since o is a positive example, we have 0 = (a('),o) 2 81 for each I between 1 and k. For each vertex i, h(w,) = 0, and so there is at least one function h f (1 5 f 5 k) for which h f ( u , ) = 0. Thus we may define x : V + {1,2,. . . , k} by
x(z) = min{f 1 h f ( u l )= 0). It remains to prove that x is a colouring of G. Suppose that h f ( v , )= h f ( u J )= 0. In other words,
x(z) = x(j) = f,so that
(a(f), v,) < Bf, (a(f), vl) < Of.
Then, recalling that Of 5 0, we have u,
+
vl)
< of + ef 5 of.
It follows that h f ( u l + v l ) = 0 and h(w, + vj) = 0. Now if 13 were an edge of G, then we should have h(u, v J ) = 1, because we assumed that h is consistent with the training sample. Thus 23 is not an edge of G, and x is a colouring, as claimed.
+
Conversely, suppose we are given a colouring x : V -+ { 1,2,. . . , k}. For 1 5 I 5 k define the weight-vector a(') as follows: a!') = -1 if x(z) = I and a!') = 1 otherwise. Define the threshold 01 to be -1/2. Let h l , hz, . . . ,h k be the corresponding linear threshold functions, and let h be their conjunction. We claim that h is consistent with s(G). Since 0 2 81 = -1/2 it follows that hl(o) = 1 for each I , and so h(o) = 1. In order to evaluate h(vi), note that if x(z) = f then ( a ( f ) , v i= ) ajf) = -1
< -1/2,
48 so h f ( v i )= 0 and h(v,) = 0, as required. Finally, for any colour I and edge ij we know that at least one of x ( i ) and x(j) is not 1. Hence
where either both of the terms on the right-hand side are 1, or one is 1 and the other is -1. In any case the sum exceeds the threshold -1/2, and h,(vj v j ) = 1. Thus 0 h(v1 “j) = 1.
+
+
The proof that the decision problem for consistency in Pkis NP-hard for k 2 3 follows directly from this result. If we are given an instance G of GRAPH k-COLOURING, we can construct the training sample s(G) in polynomial time. If the consistency problem could be solved by a polynomial time algorithm A , then we could answer GRAPH kCOLOURING polynomial time by the following procedure: given G, construct s(G), and run A on this sample. The above result tells us that the answer given by A is the same as the answer to the original question. But GRAPH k-COLOURING is known to be NP-complete, and hence it follows that the Pk-CONSISTENCY problem is NPhard if k 2 3. (In fact, the same is true if k = 2. This follows from work of Blum and Rivest (1988).) Thus, fixing k, we have a very simple family of feedforward linear threshold networks, each consisting of k 1 computation nodes (one of which is ‘hard-wired’ and acts simply as an AND gate) for which the problem of ‘loading’ a training sample is computationally intractable. Theorem 14 enables us to move from this hardness result for the consistency problem to a hardness result for pac learning. The theorem tells us that if we could pac learn P,“ with running time polynomial in eW1 and n then we could find a consistent hypothesis, using a randomised algorithm with running time polynomial in m and n. In the language of Complexity Theory this would mean that the latter problem is in RP, the class of problems which can be solved in ‘randomised polynomial time’. It is thought that RP does not contain any NP-hard problems - this is the ‘RP # NP’ conjecture, which is considered to be as reasonable as the ‘P # NP’ conjecture. Accepting this, it follows that there is no polynomial time pac learning algorithm for the graded space P‘ = UP; when k 2 2.
+
This may be regarded as a rat,her pessimistic note, but it should be emphasised that the ‘non-learnability’ result discussed above is a worst-case result and indicates that training feedforward linear threshold networks is hard in general. This does not mean that a particular learning problem cannot be solved in practice. 8. EXTENSIONS AND GENERALISATIONS
The basic pac model is useful, but it has clear limitations. A number of extensions to the basic model have been made in the last few years. In this section, we briefly describe some of these. It is not possible to give all the details here; the reader is referred to the references cited for more information.
49
8.1
Stochastic concepts
The results presented so far have nothing to say if there is some form of ‘noise’ present during the learning procedure. Further, the basic model applies only to the learning of functions: each example is either a positive example or a negative example of the given target concept, not both. But one can envisage situations in which the ‘teacher’ has difficulty classifying some examples, so that the labelled examples presented to the ‘learner’ axe not labelled by a function, the same example being on occasion presented by the ‘teacher’ as a positive example and on other occasions (possibly within the same training sample) as a negative example. For example, in the context of machine vision, if the concept is a geometrical figure then points close to the boundary of the figure may be difficult for the teacher to classify, sometimed being classified as positive and sometimes as negative. Alternatively, the problem may not lie with the teacher, but with the ‘concept’ itself. This may be ill-formed and may not be a function at all. To deal with these situations, we have the notion of a stochastic concept, introduced by Blumer et al. (1989). A stochastic concept on X is simply a probability distribution P on X x ( 0 , l ) . Informally, for finite or countable X , one interprets P ( ( z ,b)) to be the probability that z will be given classification b. This can be specialised to give the standard pac model, as follows. Supppose we have a probability distribution p on X and a target concept t ; then (see Anthony and Shawe-Taylor (1990), for example) there is a probability distribution P on X x ( 0 , l ) such that for all measurable subsets S of
x,
p (((2,t ( z ) ) I z E S ) ) = 4s); p (((2, b) I 2 E s,b # t ( 2 ) ) )= 0In this case, we say that P corresponds to t and p. What can be said about ‘learning’ a stochastic concept by means of a hypothesis space H of (0, 1)-valued functions? The error of h E H with respect to the target stochastic concept is the probability
of misclassification by h of a further randomly drawn training example. If P is truly stochastic (and not merely the stochastic representation of a function, as described above) it is unlikely that this error can be made arbitrarily small. As earlier, observed error of h on a training sample s = ( ( 2 1 , b l ) , (22,b z ) , . . . ,(z,, b,)) is defined to be
Clearly this may be non-zero for all h (particularly if the same example occurs twice in the sample, but with different labels). What should ‘learning’ mean in this context? What we should like is that there is some sample size mo, independent of the stochastic concept P , such that if a hypothesis has ‘small’ observed error with respect to a random sample of length at least mo then, with high probability, it has ‘small’ error with respect to P. The following result follows from one of Vapnik (1982) and was first presented in the context of computational learning theory by Blumer et al. (1989). (The result presented here is a slight improvement due to Anthony and Shawe-Taylor (1990).)
50
Theorem 16 Let H be a hypothesis space of (0, l)-valued functions defined on an input space X . Let P be any probability measure on S = X x ( 0 , l ) (that is, P is a stochastic concept on X ) , let 0 < E < 1 and let 0 < y 5 1. Then the Pm-probability that, for s E S“, thereissomehypothesisfrom H such that erp(h) > eander,(h) 5 (1-y)erp(h) is at most 4n,(2m)exp (-:y2ern). firthennore, there is a constant h’ > 0 such that if H has finite VC dimension d, then there is mo = rno(6, e,y) satisfying
such that if m
> mo then, for s E S”, er,(h) 5 (1 - T ) E ==+ erp(h)
<E
with probability at least 1 - 6. The notion of a stochastic concept can be applied to a number of situations. As already indicated, it can be useful when the target ‘concept’ is not a function. It can also be useful when there is ‘classification noise’ (see Angluin and Laird (1988)), that is, where there is an underlying target function, but the randomly chosen examples have their labels ‘flipped’ occasionally. This corrupts the training data and results in a stochastic concept. Additionally, in the standard pac model, we have assumed that the concept space is a subset of the hypothesis space. Suppose this is not so and t E C \ H . Then there can be no h E H such that the error of h with respect to t is 0 for all probability distributions p on X . However, Theorem 16 is applicable. Suppose p is a distribution on X and take F to be the stochastic concept corresponding to t and p . Since the sample size given in Theorem 16 is independent of the stochastic concept P, we obtain a type of learnability result when H has finite VC dimension: there is a sample size mo(6,e) such that if a randomly drawn training sample of t of length rno is presented, then with probability at least 1 - 6, if h E H is correct on a fraction of at least 1 - ~ / 2 of the sample, then h has error at most e . (Here, we have taken y = 1/2 for simplicity.) In fact, somewhat more can be said about ‘learning’ in this case. Suppose we have a concept space C and a hypothesis space H defined on X , not necessarily such that C C H . Suppose that p is a probability distribution on X . For any target t E C, let OPtH(t) = infier,(h,t). Let us say that a learning algorithm L for (C, H ) is a probably approzimately optimal algorithm if for any 0 < 6,e < 1, there is m O ( 6 , E ) such that for any t E C and any p , if a training sample s for t of length at least rno is randomly drawn then with probability at least 1 - 6, er,(L(s)) < OPtH(t) E .
+
51
We say that a hypothesis space H has the un:form convergence of errors ( UCE) property if the following holds. Given real numbers 6 and e (0 < 6,e < l ) , there is a positive integer mo(6,e ) such that, for any probability distribution P on S = X x (0, l},
P"' ( { s E S"'
I
for allh E H , Jerp(h) - er.(h)l
< e}) > 1 - 6.
This means, roughly speaking, that one can guarantee with high confidence that the observed errors of functions in H are within e of their actual errors. Results of Vapnik and Chervonenkis (1971) on the uniform convergence of relative frequencies to probabilities show that if H has finite VC dimension then H has the UCE property. It follows, upon taking P to be the stochastic concept corresponding to t and p, that there is a probably approximately optimal learning algorithm for (C, H ) : the algorithm which returns the hypothesis which minimises the observed error. Of course, this minimisation may be a computationally intra.ctable problem, but the emphasis here is on whether such learning is theoretically possible. 8.2
Variations on t h e pac model
We have observed that pac learning may be computationally intractable, due to the difficulty of the associa.ted consistency problem. Many researchers have looked at a number of ways of varying the model to make learning easier. We shall now briefly discuss two important variations: distribution-dependent models and models which allow queries. Perhaps the main attraction of the definition of pac learning is the 'distribution-free' criterion: the sample size is independent of the probability distribution. The proofs of the standard computational hardness results for pac learning and the lower bounds on sample complexity involve the use of very particular probability distributions, so the theory presented earlier is very dependent on this criterion. If we know in advance what the distribution on the example space is, or if we know that it is one of a particular set of distributions, then the full stength of the pac definition is not needed. Generally, suppose C is a concept space, H is a hypothesis space, and P is a class of probability distributions on X . We may say that a learning algorithm L for (C,H ) is a (C, H , P ) pac learning algorithm if there is a sample size mo(6,e ) such that for any t E C and any p E P , with p"-probability at least 1 - 6 , if a training sample s is presented, erp(L(s)) < e . This is a weakening of the pac criterion, in that mo need be uniform only over P and not over all distributions. The case P = { p } , in which P consists of just one distribution, has been studied by Benedek and Itai (1988,1991), where a characterisation for learnability in terms of e-covers is obtained. In general, finite VC dimension is not necessary for (C, H , P) pac learning. (The theory of previous sections shows that it is necessary when P consists of all distributions.) In addition, learning with respect to a particular distribution may be computationally feasible in situations where standard pac learning is NP-hard. For further discussion of distribution-dependent learning, we refer the reader to the papers of Benedek and Itai (1992), Ben-David, Benedek and Mansour (1989), Bertoni e t al. (1992), Khtiritonov (1993), Li and Vitanyi (1989), Linial, Mansour and Nisan (1989).
52
In the standard pac framework, the learning algorithm receives labelled examples and forms a hypothesis only on the basis of these. The learning algorithm has no control over the sequence of training examples. Clearly, it might be possible to make learning easier if we allow L to ‘ask questions’, such as: ‘is x a positive example?’, for a particular I, chosen by the algorithm. This type of query is known as a membership query and, intuitively, one might expect that it makes it easier for the learning algorithm to converge to the target concept. The idea of learning with this and other types of query in addition to random labelled examples (and sometimes in place of random labelled examples) was mentioned by Valiant (1984a) and has been studied extensively in recent years. We refer the reader to the papers by Angluin (1988), Angluin, Fkazier and Pitt (1990), Baum (1990,1991), Maass and Turan (1990,1992), and the survey by Angluin (1992) and the references therein. 8.3
The graph dimension
As far as applications to artificial neural networks are concerned, the most significant and important extension of the basic pac model is to the learning of general function spaces. The basic model concerns (0, 1)-valued functions only; that is, it is concerned only with classification problems. We have seen how it applies to feedforward linear threshold networks having one output node. But what can be said about learning and generalisation in feedforward linear threshold networks having more than one output node, or in artificial neural networks with sigmoid activation functions and a real-valued output? To deal with these problems and others, the pac model has been extended in a number of ways. Most of the relevant work has been on sample complexity rather than computational complexity, with a few notable exceptions, such as the paper of Kearns and Schapire (1990). Here, we concentrate on the problem of sufficient sample size. In what follows, certain technical measure-theoretic conditions have to be satisfied; we shall not discuss these, but refer the reader to the paper of Haussler (1992) or to the book by Pollard (1984). Suppose that C ,H are sets of functions from an example space X into a set Y , with H ,and suppose that t E C. Suppose also that there is a probability distribution p on X. Generalising in the obvious way from our previous definitions, we may define the emor of h E H with respect to t to be
C
er,(h) = cL({x E
-r I h ( z ) # t ( l ) ) ) .
That is, h is erroneous on example x if h(x) # t ( x ) . When Y = R, for example, this may seem a little coarse; we shall later discuss an alternative approach. With this measure of error, we may define learning as earlier. For h E H, let Oh be the function from X x Y to ( 0 , l ) defined by
Q h ( z ,y) = 1
h(x) = y.
and let OH = {Gh : h E X}. Now, it can be shown that if the hypothesis space B H is pac learnable (in the usual sense), then so too is H, by any consistent learning
53
algorithm; see Natarajan (1989) and Anthony and Shawe-Taylor (1990). Furthermore, the sample complexity of any consistent learning algorithm for H (when regarded as a ‘pac algorithm) can be bounded by an expression involving the VC dimension of BH. (A suitable upper bound is any upper bound on the sample complexity of a consistent learning algorithm for OH, such as the expression of Theorem 8 with d = VCdim(PH).) This quantity VCdim(GH) is known as the graph dimension of H (Natarajan (1989)) and is denoted Gdim(H). Clearly, when Y = (0, l}, the graph dimension and the VC dimension coincide. We remark that the idea of stochastic concept can be extended to ‘stochastic function’. Indeed, the distribution P discussed above is an example of a stochastic function. More generally, the analysis presented earlier for (standard) stochastic concepts extends to stochastic functions; see Anthony and Shawe-Taylor (1990) and Buescher and Kumas (1992), for example, and later in this section, where a more general framework is described. We see from this discussion that if Gdim(H) is finite then H is pac learnable by any consistent algorithm. It is natural to ask whether finite graph dimension is a necessary condition for learnability in this generalised model. However, Natarajan showed this not to be the case: there are pac learnable function spaces with infinite graph dimension (see Natarajan (1991)). Natasajan finds a weaker necessary condition for learnability, showing a certain measure, now known as the Natarajan dimension, must be finite for H to be pac learnable. More recently, Ben-David, Cesa-Bianchi and Long (1992) have shown that when Y is finite, the finiteness of the graph dimension is a necessary and sufficient condition for H to be pac learnable. Furthermore, they show that the Natarajan dimension is finite if and only if the graph dimension is finite, so that Natarajan’s necessary and sufficient conditions are matching. In fact, they obtain a ‘meta-result’, characterising those measures of dimension which themselves characterise learnability in the case of finite Y . The graph dimension has been applied to obtain bounds on the required sample size for learning in artificial neural networks. Natarajan (1989) obtained a result bounding the graph dimension of a linear threshold network (not necessarily a feedforward one) with any number of output nodes and with (0, 1)-valued inputs. For feedforward linear threshold networks with real inputs, Shawe-Taylor and Anthony (1991) (see also Anthony and Shawe-Taylor (1990)) generalised the result of Baum and Haussler (1989) presented in Theorem 12. Specifically, they showed that if a feedforward linear threshold network has any number k of output nodes then the graph dimension of the space of {O,l}’-valued functions it computes is at most 2Wlog(ez), where z, W are, as earlier, the number of computation nodes and the number of variable weights and thresholds. Thus, the same upper bound on sample size as presented in Theorem 13 holds for this more general case.
54 8.4
T h e pseudo-dimension
We have seen that the graph dimension can be used to measure the expressive power of a hypothesis space of functions, in somewhat the same way as the VC dimension is used for (0, 1)-valued hypothesis spaces. But there are other such measures, as discussed by Haussler (1992) and Ben-David, Cesa-Bianchi and Long (1992), for example. In passing, we have already mentioned the Natarajan dimension. We now introduce a very useful dimension, known as the pseudo-dimension. This was introduced by Pollard (1984) and is defined whenever the set of functions maps into Y C_ R. (More generally, it may be defined when Y is any totally ordered set, but this shall not concern us here.) Let H be a set of functions from X to R. For any x = ( q , x 2 , . . . ,z,) E X"', and for h E H, let h(x) = ( h ( X l ) , h ( z z ) , . . . >h(Gn)) and let H ( x ) = { h ( x ) : h E H } . We say that x is pseudo-shattered by H if some translate r H ( x ) of H ( x ) intersects all orthants of R". In other words, x is pseudoshattered by H if there are q ,Q , . .. ,r , € R such that for any b E ( 0 , l}", there b; = 1. The largest d such that some sample of is h b E H with h b ( I i ) 2 ri length d is pseudo-shattered is the pseudo-dimension of H and is denoted by Pdim(H). (When this maximum does not exist, the pseudo-dimension is taken to be infinite.) When Y = ( 0 , l}, the definition of pseudo-dimension reduces to the VC dimension. Furthermore, when H is a vector space of real functions, then the pseudo-dimension of H is precisely the vector-space dimension of H ; see Haussler (1992).
+
8.5
A framework for learning function spaces
When considering a space H of functions from X to Rk,it seems rather coarse to say that a hypothesis h is erroneous with respect to a target t on example I unless h ( z ) and t ( 2 ) are precisely equal. For example, with a neural network having k real-valued outputs, it is extremely demanding that each of the Ic outputs be ezactly equal to that which the target function would compute. Up to now, this is the definition of error we have used. There are other ways of measuring error, if one is prepared to ask not is the output correct? but is the output close? in some sense. Haussler (1992) has developed a 'decision-theoretic' framework encompassing many ways of measuring error by means of 1033 functions. We shall describe this framework in a way which also subsumes the discussion on stochastic concepts. First, we need some definitions. A loss function is, for our purposes, a non-negative bounded function 1 : Y x Y -t [0, M ] (for some M ) . Informally, the loss Z(y, y') is a measure of how 'bad' the output y is, when the desired output is y'. An example of a loss function is the discrete loss function, defined by Z(y, y') = 1 unless y = y', in which case Z(y,y') = 0. Another useful loss function in the L'-loss, which is defined when Y Rk. This is given by
55
In both of these examples, the loss function is actually a metric, but there is no need for this. For example, a loss function which is not a metric and which has been usefully applied by Kearns and Schapire (1990), is the L2-loss or quadratic loss, defined on R k by
There are many other useful loss functions, such as the Lm-loss, the logistic loss and the cross-entropy loss. In order to simplify our discussion here, we shall concentrate largely on the L’-loss, which seems appropriate when considering artificial neural networks. The reader is referred to the influential paper of Haussler (1992) for far more detailed discussion of the general decision-theoretic approach and its applications. As in our discussion of stochastic concepts, we consider probability distributions P on X x Y . Suppose that I : Y x Y + [0, MI is a particular loss function. For h E H,we define the error of h with respect to P (and I ) to be
the expected value of I ( h ( z ) , y ) .When P is the stochastic concept corresponding to a target function t and a probability distribution p on X , then this error is E,,((h(z), t ( z ) ) , the average loss in using h to approximate to t. Note that if 1 is the discrete metric then this is simply the p-probability that h ( z ) # t ( z ) ,which is precisely the measure of error used in the standard pac learning definition. Suppose that a sample s = ((zl,gl), . . . ,(zm,ym))of points from X x Y is given. The observed error (or empirical l o s s ) of h on this sample is
c
l m er*,/(h)= m . I(h(Zj),Y j ) . 1=1
The aim of learning in this context is to find, on the basis of a ‘large enough’ sample S, some L ( s ) E H which has close to optimal error with respect to P ; specifically, if 6, E > 0 and if, as in our discussion of probably approximately optimal learning, opt,(P) = inf{erp,l(h) : h E H}, then we should like to have erp,r(l(s)) < o p t d P )
+ e,
with probability at least 1 - 6. As before, ‘large enough’ means at least mo(6,E), where this is independent of P. As for the standard pac model and the stochastic pac model described earlier, this can be guaranteed provided we have a ‘uniform convergence of errors’ property. Extending the earlier definition, we say that a hypothesis space H of functions from X to Y has the uniform convergence of errors ( U C E ) property if for
56
0 < 6, B < 1, there is a positive integer rno(6, e ) such that, for any probability distribution P on X x Y ,
P" ({s
I
for all h E H , (erp,,(h) - er,,,(h)l
< E } ) > 1 - 6.
If this is the case, then a learning algorithm which outputs a hypothesis minimising the observed error will be a probably approximately optimal learning algorithm; see Haussler (1992) for further discussion. We should note that minimisation of observed error is not necessarily the 'simplest' way in which to produce a near-optimal hypothesis. Buescher (1992) has obtained interesting results along these lines. 8.6
The capacity of a function space
An approach to ensuring that a space of functions has the UCE property, which is described in Haussler (1992) and which follows Dudley (1984), is to use the notion of the capacity of a function space. For simplicity, we shall focus here only on the cases in which Y is a bounded subset of some Rk, Y [O,M]', and we shall use the L'-loss function, which from now on will be denoted simply by 1. Observe that the loss function maps into [O,M] in this case. We first need the notion of an e-cover of a subset of a pseudo-metric space. A pseudo-metric a on a set A is a function from A x A to R such that a ( u , b) = a ( b , a) 2 0, a( u, a) = 0, a(a, b) 5 a( u, c ) a ( c , b)
+
for all a , b, c E A . An e-cover for a subset W of A is a subset S of A such that for every w E W , there is some s E S such that a ( w , s) 5 c. W is said to be totally-bounded if it has an e-cover for all e > 0. When W is totally bounded, we denote by N ( e , W ,a) the size of the smallest e-cover for W . To apply this to learning theory, suppose that H maps X into [0,MIk and that p is a probability distribution on X . Define the pseudo-metric on H by 3,,(f,9 ) = E,, (I(f(21,g(2)). We shall define the €-capacity of H to be
a,,
C H ( ~=) supN(e, Hlap), B
where the supremum is taken over all probability distributions p on X . If there is no finite e-cover for some p, or if the supremum does not exist, we say that the e-capacity is infinite. The definition just given is not quite the same as the definition given by Haussler; here, we take a slightly more direct approach because we are not aiming for the full generality of Haussler's analysis. Results of Haussler (1992)and Pollard (1984) provide the following uniform bound on the rate of convergence of observed errors to actual errors. Theorem 17 With the notation of this section, if P is any probability distribution on S = X x Y , then pm({s
1 there is h E H with lerp,l(h) - er,,l(h)l > E ) ) < 4 c H (c/16)
e-r2m/64M2
for all 0 < E
< 1.
0
When k = 1 and H maps into [O,M],the capacity can be related to the pseudodimension of H . Haussler (1992) (see also Pollard (1984)) showed that if d = Pdim(H) then
This, combined with the above result, shows that, in this case, H has the UCE property and that the sufficient sample size is of order
Thus, if H is a space of real functions and Pdim(H) is finite, then the learning algorithm which outputs the hypothesis with minimum observed error is a probably approximately optimal learning algorithm with sample complexity m o ( 6 , ~ ) . Thus, in a sense, for pac learning hypothesis spaces of real functions, the pseudo-dimension takes on a r d e analogous to that taken by the VC dimension for standard pac learning problems.
8.7
Applications to artificial neural networks
We now illustrate how these results have been applied to certain standard types of artificial neural network. We shall consider here the feedforward ‘sigmoid‘ networks. In his paper, Haussler (1992) shows how the general framework and results can also be applied to radial basis function networks and networks composed of product units. Referring back to our definition of a feedforward network, we assumed at that point that each activation function was a linear threshold function. Suppose instead that each activation function fr is a ‘smooth’ bounded monotone function. In particular, ] that it is differentiable suppose that fr takes values in a bounded interval [ c x , ~ and on R with bounded derivative, If:(z)l 5 B for all I. (We shall call such a function a sigmoid.) The standard example of such a function is
where 6’ is known as the threshold, and is adjustable. This type of sigmoid function, which we shall call a standard sigmoid, takes values in ( 0 , l ) and has derivative bounded by 1/4. By proving some ‘composition’ results on the capacity of function spaces and by making use of the pseudo-dimension and its relationship to capacity for real-valued function spaces, Haussler obtained bounds on the capacity of feedforward artificial neural networks with general sigmoid activation functions. It is not possible to provide all the details here; we refer the reader to his paper. Before stating the next result, we need a further definition. The depth of a particular computation node is the number of arcs in the longest directed path from an input node to the node. The depth of the network is the largest depth of any computation node in the network. We have the following special case of a result of Haussler (1992).
58
Theorem 18 Suppose that ( N , A ) is a feedforwaxd sigmoid network of depth d , with z computation nodes, n input nodes, any number of output nodes, and W adjustable weights and thresholds. Let A be the maximum in-degree of a computation node. Suppose that each activation function maps into the interval [a,PI.Let H be the set of PI" when the variable weights functions computable by the network on inputs from [a, are constrained to be at most V in absolute value. Then for 0 < e 5 /3 - a ,
CH(r)
(243
-
a)d(AvB)d-')'" e
where B is a bound on the absolute values of the derivatives of the activation functions. Further, for fixed V , there is a constant Ii such that for any probability distribution P on X x Rk,the following holds: provided
then with Pm-probability at least 1 - 6, a sample s from ( X x Rk)m satisfies l e r d h ) - erp,r(h)l < e for all h E
H. Moreover, there is K1 such that if m
L
+
(Wlog
(q) (i)), +log
and er.,l(h) = 0 then, with probability at least 1 - 6, erp,l(h) < e .
0
This result shows that the space of functions computed by a certain type of sigmoid network has the UCE property. It provides an upper bound on the order of sample size which should be used in order to be confident that the observed error is close to the actual error. In particular, therefore, it follows that a learning algorithm which minimises observed error is probably approximately optimal, with sample complexity bounded by the bounds in the theorem. The presence of the bound B on the absolute values of the derivatives of the activation functions means that this theorem does not apply to linear threshold networks, where the activation functions are not differentiable. Nonetheless, the sample size bounds are similar to those obtained for linear threshold networks. Furthermore, in the theorem, there is assumed to be some uniform upper bound V on the maximum magnitude of the weights. Recently, Macintyre and Sontag (1993) have shown that if every activation function is the standard sigmoid function and if there is one output node, then such a bound is not necessary. They show that the set of functions computed by a standard
sigmoid network with unrestricted weights (and on unrestricted real inputs) has finite pseudo-dimension.
59
We reniark that the results presented concerning sigmoid neural networks are upperbound results. One cannot easily give lower bounds on the sample size as for standard pac learning. One reason for this is that, although pac learnable function spaces from X to finite Y can be characterised (as in Ben-David, Cesa-Bianchi and Long (1992)), no matching necessary and sufficient conditions are known for the more general problem of pac learning when Y is infinite. In other words, it is an open problem to determine a single parameter which quantifies precisely the learning capabilities of a general function space.
REFERENCES Angluin (1988): D. Angluin, Queries and concept learning. Machine Learning, 2(4): 319-342. Angluin (1992): D. Angluin, Computational learning theory: survey and selected bibliography, in Proceedings of the Twenty-Fourth Annual A C M Symposium on the Theory of Computing. Angluin, Frazier and Pitt (1990): D. Angluin, M. Frazier and L. Pitt, Learning conjunctions of Horn clauses. In Proceedings of the Thirtieth-First IEEE Symposium on Foundations of Ccynpuler Science, IEEE Computer Society Press, Washington DC. (See also: Machine Learning, 9 (2-3), 1992: 174-164.) Angluin and Laird (1988): D. Angluin and P. Laird, Learning from noisy examples, Machine Learning, 2: 343-370. Anthony and Biggs (1992): M. Anthony and N. Biggs, Computational Learning Theory: an Introduction, Cambridge University Press. Anthony, Biggs and Shawe-Taylor (1990): M. Anthony, N. Biggs and J. Shawe-Taylor, The learnability of formal concepts. In Proceedings of the Third Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Anthony and Shawe-Taylor (1990): M. Anthony and J. Shawe-Taylor, A result of Vapnik with applications, Technical report CSD-TR-628, Royal Holloway and Bedford New College, University of London. To appear, Discrete Applied Mathematics. Bartlett (1992): P.L. Bartlett, Lower bounds on the Vapnik-Chervonenkis Dimension of multi-layer threshold networks. Technical report IML92/3, Intelligent Machines Laboratory, Department of Electrical Engineering and Computer Engineering, University of Queensland, Qld 4072, Australia, September 1992. Baum (1990): E.B. Baum, Polynomial time algorithms for learning neural nets. In Proceedings of the Third Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Baum (1991): E.B. Baum, Neural net algorithms that learn in polynomial time from examples and queries, IEEE Transactions on Neural Networks, 2: 5-19. Baum and Haussler (1989): E.B. Baum and D. Haussler, What size net gives valid generalization? Neural computation, 1: 151-160.
60
Ben-David, Benedek and Mansour (1989): S. Ben-David, G. Benedek and Y. Mansour, A parameterization scheme for classifying models of learnability. In Proceedings of the Second Workshop on Computational Learning Theory. Morgan Kaufmann, s a n Mateo, CA. Ben-David, Cesa-Bianchi and Long (1992): S. Ben-David, N. Cesa-Bianchi and P. Long, Characterizations of learnability for classes of ( 0 , . . . ,n}-valued functions. In Proceedings of the Fifth A n n u l ACM Workshop on Computational Learning Theory, ACM Press, New York. Benedek and Itai (1988): G. Benedek and A. Itai, Learnability by fixed distributions, In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Benedek and Itai (1991): G . Benedek and A. Itai, Learnability with respect to fixed distributions, Theoretical Computer Science 86 (2): 377-389. Benedek and Itai (1992): G. Benedek and A. Itai, Dominating distributions and learnability, In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Bertoni et al. (1992): A. Bertoni, P. Campadelli, A. Morpurgo, S. Panizza, Polynomial uniform convergence and polynomial sample learnability. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Billingsley (1986): P. Billingsley, Probability and Measure, Wiley, New York. Blum and Rivest (1988): A. Blum and R.L. Rivest, Training a 3-node neural network is NP-complete. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. (See also: Neural Networks, 5 ( l ) , 1992: 117-127.) Blumer e t al. (1989): A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, Learnability and the Vapnik-Chervonenkis Dimension. Journal of the A CM, 36(4): 929-965. Buescher (1992): K.L. Buescher, Learning and smooth simultaneous estimation of errors based on empirical data (PhD thesis), Report UILU-ENG-92-2246, DC-144, Coordinated Science Laboratory, University of Ilinois at Urbana-Champaign. Buescher and Kumar (1992): K.L. Buescher and P.R Kumar, Learning stochastic functions by smooth simultaneous estimation, In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, ACM Press, New York. Cormen, Leiserson and Rivest (1990): T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms. MIT Press, Cambridge, MA. Dudley (1984): R.M. Dudley, A course on empirical processes. Lecture Notes in Mathematics, 1097: 2-142. Springer Verlag, New York. Ehrenfeucht et al. (1989): A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant, A general lower bound on the number of examples needed for learning. Information and Computation, 82 ( 3 ) : 247-261.
61
Garey and Johnson (1979): M. Garey and D. Johnson, Computers and Intractibility: A Guide to the Theory of NP-Completeness. Freeman, San Francisco. Grunbaum (1967): B. Grunbaum, Convez Polytopes. John Wiley, London. Haussler (1992): D. Haussler, Decision theoretic generalizations of the pac model for neural net and other learning applications, Information and Computation, 100: 78-150. Haussler et al. (1988): D. Haussler, M. Kearns, N. Littlestone and M. Warmuth, Equivalence of models for polynomial learnability. In Proceedings of the 1988 Workshop o n Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. (Also, Information and Computation, 95 (2), 1991: 129-161.) Haussler and Welzl (1987): D. Haussler and E. Welzl, Epsilon-nets and simplex range queries. Discrete & Computational Geometry, 2: 127-151. Judd (1988): J.S. Judd, Learning in neural networks. In Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Kearns and Schapire (1990): M. Kearns and R. Schapire, Efficient distribution-free learning of probabilistic concepts. In Proceedings of the Thirty-First IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Kharitonov (1993): M. Kharitonov, Cryptographic hardness of distribution specific learning. To appear, Proceedings of the Twenty-Fifih Annual ACM Symposium on the Theory of Computing, 1999. Li and Vitanyi (1989): M. Li and P. Vitanyi, A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution. In Proceedings of the Thirtieth IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Linial, Mansour and Nisan (1989): N. Linial, Y. Mansour and N. Nisan, Constant depth circuits, Fourier transforms, and learnability. In Proceedings of the Thirtieth IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Maass (1992): W. Maass. Bounds for the computational power and learning complexity of Analog Neural Nets. Manuscript, Institute for Theoretical Computer Science, Technische Universitaet Graz, Austria, 1992. To appear, Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, 1999. Maass and Turan (1990): W. Maass and G. Turan, On the complexity of learning from counterexamples and membership queries. In Proceedings of the Thirty-First IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington DC. Maass and Turan (1992): W. Maass and G. Turan, Lower bound methods and separation results for on-line learning models, Machine Learning 9 (2-3): 107-145. Macintyre and Sontag (1993): A. Macintyre and E.D. Sontag, Finiteness results for
62
sigmoidd “neural” networks (extended abstract). To appear, Proceedings of the TwentyFifth Annual A C M Symposium on the Theory of Computing. Minsky and Papert (1969): M. Minsky and S. Papert, Perceptrons. MIT Press, Cambridge, MA. (Expanded edition 1988.) Natarajan (1989): B.K. Natarajan, On learning sets and functions. Machine Learning, 4: 67-97. Natarajan (1991): B.K. Natarajan, Machine Learning: A Theoretical Approach, Morgan Kaufmann. Pitt and Valiant (1988): L. Pitt and L.G. Valiant, Computational limitations on learning from examples. Journal of the A CM, 35 (4): 965-984. Pollard (1984): D. Pollard, Convergence of Stochastic Processes, Springer-Verlag, New York. Rosenblatt (1959): F. Rosenblatt, Two theorems of statistical separability in the perceptron. In Mechanisation of Thought Processes: Proceedings of a Symposium Held at the National Physical Laboratory, November 1958. Vol. 1. HM Stationery Office, London. Sauer (1972): N. Sauer, On the density of families of sets, Journal of Combinatorial Theory ( A ) , 13: 145-147. ShaweTaylor and Anthony (1991): J. Shawe-Taylor and M. Anthony, Sample sizes for multiple output threshold networks, Network 2: 107-117. Shawe-Taylor, Anthony and Biggs (1993): J. Shawe-Taylor, M. Anthony and N. Biggs, Bounding sample size with the Vapnik-Chervonenkis dimension. To appear, Discrete Applied Mathematics, Vol. 41. Valiant (1984a): L.G. Valiant, A theory of the learnable. Communications of the A C M , 27 (11): 1134-1142. Valiant (1984b): L.G. Valiant, Deductive learning. Philosophical Transactions of the Royal Society of London A, 312: 441-446. Vapnik (1982): V.N. Vapnik, Estimation of Dependences Based on Empirical Data. Springer Verlag, New York. Vapnik and Chervonenkis (1971): V.N. Vapnik and A.Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16 (2), 264-280. Vapnik and Chervonenkis (1981): V.N. Vapnik and A.Ya. Chervonenkis, Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and its Applications, 26 (3), 532-553. Wilf (1986): H.S. Wilf, Algorithms and Complexity. Prentice-Hall, New Jersey.
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) Q 1993 Elsevier Science Publishers B.V. All rights reserved
63
Time-summating network approach Paul C. Bressloff GEC-Marconi Ltd., Hirst Research Centre, East Lane, Wembley, Middx. HA9 7PP, United Kingdom.
Abstract A review of the dynamical and computational properties of timesummating neural networks is presented. 1
INTRODUCTION
The formal neuron used in most artificial neural networks is based on a very simple model of a real neuron due to McCulloch and Pitts [l]. In this model, the output of the neuron is binary-valued, indicating whether or not its activation state exceeds some threshold, and the activation state at any discrete time is equal to the linear sum of inputs to the neuron at the previous time-step, (We shall refer to such a neuron as a standard binary neuron). The simplicity of networks of these neurons has allowed many analytical and numerical results to be obtained. In particular, statisticalmechanical techniques, which exploit the analogy between standard binary networks and magnetic spin systems, have been applied extensively to the study of the collective behaviour of large networks [2,3]. Moreover, Gardner [4] has developed statistical-mechanical methods t o analyse the space of connection weights between neurons and thus determine quantities such as the optimal capacity for the classification and storage of random, static patterns. However, networks of standard binary neurons tend t o be rather limited in terms of (i) the efficiency with which they can process temporal sequences of patterns, and (ii) the range of dynamical behaviour that they can exhibit. These limitations are related to the fact that there is no memory of inputs to the neuron beyond a single time-step. A simple way to incorporate such a memory is to take the activation state of each neuron to be a slowly decaying function of time - a timesummating binary neuron. In this paper, we consider the consequences of this simple modification of the McCulloch-Pitts model for the deterministic dynamics (section 2). stochastic dynamics (section 3) and temporal sequence
64
processing abilities (section 4) of neural networks. One of the interesting features of time-summating neurons is that they incorporate, albeit in simplified form, an important temporal aspect of the process by which real neurons integrate their inputs. In section 5. we describe a n extension of the time-summating model that takes into account spatial aspects of this process such as the geometry of the dendritic tree and soma. 2
DETERMINISTIC DYNAMICS
Consider a fully-connected network of N standard binary-threshold neurons [I,51 and denote the output of neuron i, i = 1,...,N, at the mth time step by ai(m) E {0,1].The binary-valued output indicates whether or not the neuron has fired at time m. The neurons are connected by weights wij t h a t determine the size of an input to neuron i arising from the firing of neuron j. In this simple model, the activation state of neuron i at time m is equal to the linear sum of all the inputs received at the previous time-step,
j+i
where Ii denotes some fixed external input. Each neuron fires whenever its activation state exceeds a threshold hi, ai(m) = e(Vi(m)- hi)
(2.2)
4
where e(x) = 1ifx 20 and e(x) = 0 if x < 0. Note that the external inputs may be absorbed into the thresholds hi. Equations (2.1) and (2.2) determine the dynamics on the discrete space of binary outputs (0,1IN. (Unless otherwise stated, we shall assume throughout that the neurons update their states in parallel). The number of possible states of the network is finite, equal t o gN. Therefore, i n the absence of noise, there is a unique transition from one state to the next and the long-term behaviour is cyclic. This follows from the fact that a finitestate system must return to a state previously visited after a finite number of time-steps ( S 2N). Hence, the dynamics is restricted to attracting cycles consisting of simple sequences of states, i.e., a given state only occurs once per cycle. Complex sequences, on the other hand, contain repeated states so that there is an ambiguity a s to which is the successor of each of these states (see Figure 1);such ambiguities cannot be resolved by a standard binary network. From a computational viewpoint, if each attracting cycle is interpreted as a stored temporal sequence of patterns, then there are severe
65
limitations on the range of sequences that can be stored. B
B
A
D ABCD
D
ABCAD
Figure 1. Example illustrating the difference between a simple sequence
ABCD ... and a complex sequence ABCAD... In the latter case there is an ambiguity concerning the successor of state A. To allow the occurrence of complex sequences, it is necessary to introduce some memory of previous inputs that extends beyond a single time-step. The simplest way to achieve this is to modify the network at the single neuron level by taking the activation state to be a slowly decaying function of time with decay rate ki < 1,say. Equation (2.1)then becomes
(2.3)
We shall refer to a formal neuron satisfying equations (2.2)and (2.3)as a time-summating, binary neuron. The decay term kiVi(m-1) may be viewed as a positive feedback along a delay line of weight ki (see Figure 2); this should be distinguished from models in which the output of the neuron is fed back rather than the value of the activation state. Note that the decay term incorporates a n important temporal feature of biological neurons [6]there is a persistence of cell activity over extended periods due to the leakyintegrator characteristics of the cell surface. If we interpret the activation state of our formal neuron a s a mean soma potential then, crudely speaking, we may relate the decay rate ki to the electrical properties of the and Ci are the leakage capacitance cell surface, ki = exp(-l/RiCi), where and resistance respectively. (See also section 5).Recent neurophysiological evidence suggests that the time constant of certain cortical neurons is of the
66
order of hundreds of milliseconds [7].Since a single time-step corresponds to a few milliseconds, i.e. a refractory period, the decay rate ki could be close to unity.
delay
aj(m - 1)
ij
summation
threshold
Figure 2. A time-summating, binary-threshold neuron. It follows from equation (2.3)that the activation state depends on the previous history of inputs. Assuming that Vi(0) = 0, we have
Such a n activity trace allows a network of time-summating neurons to resolve the ambiguities arising from complex sequences, provided that incoming activity is held over a long enough period [8,9]. Moreover, a timesummating network can be trained t o store such sequences using perceptron-like learning algorithms that are guaranteed t o converge to a solution set of weights if one exists [S]. As in the case of standard binary networks [41, statistical-mechanical techniques may be used to analyse the performance of a time-summating network in the thermodynamic limit 18111. One of the features that emerges from such a n analysis is the nontrivial contribution from intrinsic temporal correlations that are set up between the activation states of the neurons due to the presence of activity traces. (We shall consider temporal sequence processing in section 4). Another important difference between time-summating and standard networks is that the former can display complex dynamics, including frequency-locking and chaos, a t both the single neuron and network levels [12-161. There is a great deal of current interest in the behaviour of networks of oscillatory and chaotic elements. For example, recent neurophysiological experiments [171, [18] suggest t h a t the phasesynchronisation and desynchronisation of neuronal firing patterns could be
67
used to determine whether or not activated features have been stimulated by a single object. This process would avoid the combinatorial explosion associated with the use of “grandmother cells” (the binding problem [191). Time-summating networks provide a discrete-time framework for studying such phenomena. k
I
summation
threshold n U
-W
delay
Figure 3.A time-summating neuron with inhibitory feedback. To illustrate the above, consider a single time-summating neuron with fixed external input I and inhibitory feedback whose activation state evolves according to the Nagumo-Sato equation [201 V(m) = F(V(m - 1))= [kV(m - 1)- wa(m - 1) + I]
(2.5)
where a(m) = B(V(m)-h). The operation of the neuron is shown in Figure 3. We shall assume t h a t the feedback is mediated by a n inhibitory interneuron that fires whenever the excitatory neuron fires. (A more detailed model that takes into account the dynamics of the interneuron essentially displays the same behaviour. Note that the coupling of a n excitatory neuron with an inhibitory neuron using delay connections forms the basic oscillatory element of a continuous time model used to study stimulus-induced phase synchronisation in oscillator networks [21,221). The map F of equation (2.5) is piecewise linear with a single discontinuity a t V = 0, a s shown in Figure 4. Assuming that w > 0 (inhibitory feedback) and 0 c I c w, then all trajectories converge to the interval Z = W-,V+] where V- = I - w and V+ = I. (For values of I outside [O,wl, the dynamics is trivial).The dynamics on Z has been analysed in detail elsewhere [13,23,241.In particular, the map F is equivalent to a circle map with a discontinuity a t V = V+. Such a circle map is obtained by imposing the equivalence relation on C given by V(m>eV(m) + V, - V- E
68
51. The activation state may then be viewed as a phase variable. To describe the behaviour on Z it is useful t o introduce the average firing-rate (2.6) (assuming that the limit exists), where B(FW))is the output of the neuron
at time n given the initial state V. In terms of the equivalent circle map description, p(V) is a rotation number. Y
R
I-------
/
-----
I I
/
I
I
I
/
/
I
/Jf
I
Figure 4. Map F describing the dynamics of a time-summating neuron with inhibitory feedback; for w > 0 and 0 < I < w all trajectories converge t o the bounded interval Z = CV-, V+1. It can be shown that the average firing-rate is independent of the initial point V, pW) = p. and that the dynamics is either periodic o r quasiperiodic depending on whether i?i is a rational o r irrational number. Moreover, as a function of the external input I, i5 forms a “devil’s staircase” [23]. That is, p is a continuous, monotonic function of I which assumes rational values on non-empty intervals of I and is irrational on a Cantor set of I. If j5 is ra€ional, j5 = plq, then there is a periodic orbit of period q which is globally attracting. On the other hand, when 5.j is irrational there are no periodic points and the attractor is a Cantor set [24]. Note that in the limit k + 0 (standard binary neuron), the devil’s staircase structure disappears,
69
and the neuron simply becomes a bistable element alternating between its on and off states, i.e. = 1/2 independently of I.
Figure 5. The map Fp for p = 25.0 with two critical points at V = f V* and an unstable fixed point at V = Vo. Another interesting feature of the above model is that, with slight modifications, chaotic dynamics can occur leading, amongst other things, to a break up of the devil’s staircase structure [12,13].For example, suppose that we replace the step function in equation (2.5)by a sigmoid of gain y so that [12] V(m) = Fy(V(m- 1)) = kV(m - 1)- 1 + e-uV(m-l) +I
(2.7)
We shall briefly discuss the dynamics of Fy as a function of the external I < w. Then Fy has two critical input I. Assume that K s wy/2k -1 pointa at fV*,where fl*= 10d~ f as shown in Figure 5.There is also a fixed point, denoted V = Vo, which lies in the interval [-V*,V*]. For y >> 1(high gain) there exists a range of values of I for which the fixed point is unstable and all trajectories converge to the interval R = Fy(v*), on which the dynamics is either periodic o r chaotic. The chaotic dynamics arises from the fact that for y >> 1 the negative gradient
70
branch of the graph of Fr has an average slope of modulus greater than unity, which can lead to a positive Liapunov exponent h(V(0))where 1251
(2.8)
We note that the circle map equivalent to Fr is nonmonotonic, which is a well known scenario for chaotic dynamics as exemplified by the sine circle map x + F(x) = x + a + ksin(2xxY2x (mod 1) [261. Recently, a network of coupled circle maps has been used to model the synchronisation and desynchronisation of neuronal activity patterns [IS]. The basic idea is to associate with each neuron a phase variable 8; and a n activity si = 0,l. In terms of the time-summating model, si = 1 indicates that the value of the external input t o the ith neuron lies within the parameter regime for which the neuron operates as a n oscillator (active mode), with Vi interpreted as the phase variable Bi. (If I < 0 in equation (2.5) or (2.7)then Vi converges t o a stable fixed point corresponding to a passive mode of the neuron and s i = 0). The dynamics of the phases for all active neurons is taken to be [161 1
Bi (m + 1) = -[F(ei 1+E
(m))+ &F(cpj (m))]
(2.9)
where F is a circle map such as the sine map or the one equivalent to F of equation (2.71, and
(2.10)
Suppose that all neurons are coupled such that wij = w for all i, j, i f j. F o r large N, the stability of the strongly correlated state (ei(m) = 8$m) for all i j ) is determined completely by the properties of the underlying onedimensional map F [16]. To show this, define 60i(m) = 8i(m) - cp(m), where cp(m) is the average phase of the network, N-12;;8i(m). Linear stability analysis then gives (2.11) Using the definition of the Liapunov exponent h (cf. equation (2.8)), it
71
follows that the coherent state is stable provided E > eh - 1. In Ref. [lS], a Hebb-like learning rule is used to organise a network of neurons into strongly coupled groups in which neurons within a group have completely synchronised (chaotic) time series along the lines of the above coherent state, whereas different groups are uncorrelated; each group corresponds to a separate object. The learning rule takes the explicit form
where y determines the learning-rate, h is a "forgetting" term and si = 1if a neuron is activated by an input pattern (object). "he function Q restricts the weights within the interval [a,bl; O(x) = x for x E [a,bl and 0 otherwise. ARer training, if a number of patterns are presented simultaneously to the network, then each of these patterns may be distinguished by the synchronisation of the corresponding subpopulation of active neurons. In this approach, chaos performs two functions. First, it allows separate groups to become rapidly decorrelated in the presence of arbitrarily small differences in initial conditions. Second, it enables a large number of different groups t o be independently synchronised. 3
STOCHASTIC DYNAMICS
It is well known that the most significant source of intrinsic noise in biological neurons arises from random fluctuations in the number of packets of chemical neurotransmitters released a t a synapse on arrival of an action potential [27]. Such noise can be incorporated into a n artificial neural network by taking the connection weights to be discrete random variables independently updated a t every time-step according to fixed probability distributions [28,29,30]. That is, each connection weight has the form w(m) = Eu(m), where I E I is related to post-synaptic efficacy, (the efficiency with which transmitters are absorbed on t o the post-synaptic membrane), sign(&)determines whether the synapse is excitatory o r inhibitory, and u(m) corresponds to the number of packets released a t time m. Following Ref. [31], we shall take the release of chemicals to be governed by a Binomial process. Before discussing the stochastic dynamics of timesummating networks with synaptic noise, it is useful to consider the more familiar case of standard binary networks. 3.1 Standard binary networks. Incorporating synaptic noise into a standard binary neural network leads to the stochastic equations
12
where uij(m) = 0 if aj(m) = 0, whereas uij(m) is generated by a Binomial distribution when aj(m) = 1. Thus, for a given state a(m) = a , the conditional probability that uij(m) = q j is given by
where are constants satisfying 0 I h j I 1and L is the maximum number of packets that can be released a t any one time, (assumed t o be synapseindependent). Note that a random fluctuation qi(m) of the threshold hi has also been included in equation (3.1). We shall take the probability distribution function of qi(m) to be a sigmoid,
where j3-l is a “temperature”parameter. Let p(i I a) be the conditional probability that neuron i fires given that the state of the network a t the previous time-step is a. We may obtain p(ila) by averaging the right-hand side of equation (3.1) over the distributions (3.2)and (3.3).This leads to the result,
(3.4)
since w(V) = 1 - qr(-V) when qt is a sigmoid function. (In the absence of synaptic noise, the conditional probability reduces directly t o that of the Little model [32]). Introducing the probability Pm(a)that the state of the network a t time m is a , we may describe the dynamical evolution of the network in terms of the homogeneous Markov chain
(3.5)
where Qba is the time-independent transition probability of going from
state a to state b in one time-step, and satisfies
N Qba = ~ { b i p ( i l a ) + [ l - b i l [ l - p ( i l a ~ l }
(3.61
i=l
Since the Markov chain generated by equations (3.4)and (3.6)is irreducible when P-l > 0 and hj > 0, (there is a nonzero probability that every state may be reached from every other state in a finite number of time-steps), and assuming that N is finite, we may apply the Perron-Frobenius theorem [32]: If Q is the transition matrix of a finite irreducible Markov chain with period d then (i) the d complex roots of unity, XI, ha = w,...,hd = where w = eanUd, are eigenvalues of Q, and (ii)the remaining eigenvalues hd+l,...,hN satisfy I hj I < 1;(a Markov chain is said t o have period d if, for each state a, the probability of returning to a after m time-steps is zero unless m is an integer multiple of d). For non-zero temperatures, the Markov chain is aperiodic (d = 1) so that there is a nondegenerate eigenvalue of Q satisfying 1.1 = 1,whilst all others lie inside the unit circle. By expanding the solution of equation (3.5) in terms of the generalised eigenvectors of Q, it follows that there is a unique limiting distribution P,(a) such that
(3.7) independently of the initial distribution, where P, is the unique eigenvector of Q with eigenvalue unity. Equation (3.7)implies that timeaverages are independent of initial conditions and may be replaced by ensemble averages over the limiting distribution P,. That is, for any wellbehaved state variable X,
(3.8) Note that in practice time-averages are defined over a finite-time interval T = Tabs. These averages may be replaced by ensemble averages provided ~ the ,maximum relaxation time characterising the rate of that zobs >> T fluctuations of the system. Although techniques have been developed to analyse P, [34], the explicit form for P, tends to be rather complicated except for the special cases in which detailed balance holds. For then there exists some function f
14
such that &baff a) = Qabf(b) and equation (3.5) has the stationary solution P*(a) = f(a&f(a). Since, by the Perron-Frobenius theorem, the limiting distribution is unique, and hence equal t o P*, we obtain the Gibbs distribution
(3.9) a
where H(a) = -P-llogfIa) is an effective Hamiltonian. An example of a network for which detailed balance holds is the Little [321 model with symmetric weights, wij = wj;. In this particular case, fla) = coshp(Cjwij9 hi)[351. One of the consequences of equation (3.7)is that, as i t stands, the network cannot display any long-range order in time since any injection of new information produces fluctuations about the limiting distribution that are then dissipated. Therefore, to operate the network as an associative memory, it is necessary to use one of the following schemes; (a) Introduce an external input I = (11,...,IN) and take the network t o be a continuous mapper [36] in which the limiting distribution P, i s considered as a fimction of I. (See also section 3.2) (b) Take the zero noise limit P - l + 0, Xij + 1, so that equation (3.1) reduces to equations (2.1) and (2.2), with wij = eijL, and the many attracting cycles of the deterministic system emerge. For small but nonzero noise, these cycles will persist for extended lengths of time with the noise inducing transitions between cycles. (c) Take the thermodynamic limit N + leading t o a breaking of the ergodicity condition (3.7). This forms the basis of statistical-mechanical approaches to neural networks [2,31. In contrast to (b), which views noise in terms of its effects on the underlying deterministic system, the statisticalmechanical approach is concerned with emergent properties arising from the collective behaviour of large systems with noise. Such behaviour may be analysed using mean field theory. To discuss the large-N limit, we shall follow the statistical dynamical approach of Ref. [30]. First, on setting u$m) = ilij(m)aj(m),where fii,(m) is generated by a Binomial distribution B(L, Gj), we may rewrite equation (3.1) as
-
with w(m) denoting the set of random parameters (hij(m),h;(m)) and fJa) =
€KZj iiijeija. +qi -hi). We then introduce the notion of a macroscopic variable along the knes of Amari et all [37]: A finite collection of state variables is said to be a closed set of macroscopic variables if there exists a set of functions Qr, r = 1,...,R such that for arbitrary a,
lim var,CX,(f,(a))l
N+m
=0
(3.1lb)
where p and var denote respectively the mean and variance with P respect to the distribution p of o.Equation (3.11)implies that (3.12) Equations (3.11) and (3.12) also hold if a is replaced by the dynamical variable a(m) satisfying (3.10), since the random parameters o(m) are updated independently a t each time-step. Hence, equations (3.10) (3.12) lead to the dynamical mean field equations
-
where q ( m ) = q(a(m)). Equation (3.13) determines the long-term behaviour of the network in the limit N + -. Suppose, for simplicity, that the set (Xr, r = 1,...,R) completely characterises the macroscopic dynamics of the system. Moreover, assume that there exists a number of stationary solutions to (3.13)that are stable fixed points, denoted da). Each such solution satisfies Xy’ = Qr(X(a)) and the eigenvalues hr of the Jacobian Ars = &Dr(X(a))/aXssatisfy the stability criterion I hr I c 1. Assuming that X(0) E A,, where A, is the basin of attraction for X(a),the time-average of X(m) is given by
(3.14)
Broken ergodicity is reflected by the existence of more than one fixed point, since there is then a dependence on initial conditions. Note that broken ergodicity can only occur, strictly speaking, in infinite systems; in a finite system the entire state space is accessible. Hence the limit M + must be taken after the limit N + in equations (3.14).
-
-
76
A simple example of the above is provided by a fully-connected LittleHopfield model [MI, 1381 with threshold noise. Introducing, for convenience, “spin”variables Si = 2% - 1, the network evolves according to the equations (3.15)
with qi generated according t o equation (3.3) and wij is of the “Hebbian” form
(3.16)
sp, sy
for R random, unbiased patterns i.e. = fl with equal probability. For finiteR, a finite set of macroscopic variables satisfying equation (3.11) may be defined in terms of the overlaps
(3.17)
The corresponding dynamical mean field equations are
(3.18) where = np(&€,p- 1x2 + + 1)/2),i.e. for large N we may assume that strong self-averaging over the random patterns sp holds. By studying the stability of the fixed points of (3.181, the pattern storage properties of the network may be determined. (The results are identical to those obtained using equilibrium statistical mechanics [39], i.e. the minima of the free energy derived in [39] correspond exactly t o the stable fixed points of equation (3.18) and the Hessian of the free energy - A ,where Apv is the Jacobian of Qp). For example, consider equals solutions in w&h there is only a non-zero overlap with a single memory, Xp = X$1. This is a solution t o (3.18)provided that X = tanhpX, and X f 0 only if T P - l < 1. There are 2R degenerate solutions corresponding to the
$”
R memories EP , and their opposites -E,F. The Jacobian is given by Aclv = tiPvp(l -X2)such that the solutions are always stable for T < l.(See Ref. [391 for a more detailed analysis based on the statistical-mechanical approach). Unfortunately, the statistical dynamics of a fully-connected HopfieldLittle model with parallel dynamics becomes much more complicated when the number of patterns to be stored becomes infinite in the large-N limit, i.e. R = aN.For then one finds that long-time correlations build up leading to a rapidly increasing number of order parameters or macroscopic variables. "his renders exact treatments ineffective after a few time-steps [40]. (Alternatively, one can consider sparsely-connected networks [41,301 in which the number of parameters becomes tractable]). Also note that for more general choices of the weights wij, it is possible that the resulting dynamical mean field equations exhibit periodic and chaotic behaviour [42].
3.2 Time-summatingbinary networks. Introducing synaptic and threshold noise along the lines of section 3.1, the stochastic dynamics of a time-summating network is given by [15]
where a(m) denotes the set of integers uij(m), i, j = 1,...,N, i # j , corresponding to the number of packets released into synapse (ij) a t the mth time-step, and
(3.20)
For a given state V(m) = V,the probability that a(m) = a is
where ly is the sigmoid function of equation (3.3)and hj = 0 for convenience. Let Cl denote the index set {Q,...,L)x, where x is the number of connections in the network. The set F = ((Fa,@a) I Q E Q) defines a random Iterated Function System (IFS)[43]on the space of activation states M c xN.That is, F consists of a finite, indexed set of continuous mappings on a metric space together with a corresponding set of probabilities for choosing one such map per iteration. (It is s f i c i e n t to endow M with the Euclidean metric. Note, however, that the dynamics is independent of the particular metric chosen: the introduction of a metric structure allows certain
78
mathematical results to be proven, and is useful for characterising the geometrical aspects of a system's attractor). The dynamics described by equation (3.19)corresponds to a n orbit of the IFS F. I n other words, a particular trajectory of the dynamics is specified by a particular sequence of events (a(m), m = 0, 1,...I a(m) E a} together with the initial point V(0). An important feature of 3 is that it is a n hyperbolic IFS (using Barnsley's terminology [43]);the affine maps F, of equation (3.20) are contraction mappings on M, i.e. the contraction ratio & of Fa, defined by
(3.22) satisfies 1 , < 1for all a E a. This result holds since the decay factors in equation (3.20)satisfy ki < 1 and h , = k 3 maxi (ki). By the contraction mapping theorem 1431,there exists a unique fixed point p of Fa such that lim, -+ oo (F,Im(v) = p for all V E M. This may be seen immediately using equation (3.201,with Vai = (Ii + Cj i uijqj)/( 1 - yi). The fact that 9 is hyperbolic allows us to apply a number of known results concerning the limiting behaviour of random IFS's [43-451. To proceed, it is convenient to consider the evolution of probability distributions on Mthat is generated by performing a large number of trials and following the resulting ensemble of trajectories. The stochastic dynamics of this ensemble is then described in terms of the sequence of probability measures (b, m = O,l,...) on M, where
(3.23)
is the probability of a trajectory passing through the (Borel) subset A of %f a t time m with P(W = 1. (We cannot assume that the measures pm are Lebesgue and introduce smooth probability densities on M accordingly such that d b ( V ) = p(V)dV. For, as will be made clear below, there is the possibility of fractal-like structures emerging). The sequence of measures {pm) describes a linear Markov process. Introduce the time-independent transition probability %fBI V)that, given V(m) = V at time m, V(m + 1)belongs to the subset B. This is equal t o the probability of choosing a map F E {Fa, a E Q) such that F O E B. Thus
(3.24)
19
where XB is the indicator function defined by XB(V)= 1 if V E B and 0 otherwise. Given a n initial probability measure 10, 4 generates the sequence of measures (pm) according to
(3.25)
Such a sequence then determines the evolution of the output states of the network by projection. That is, the probability Pm(a)that the network has the output configuration a a t time m is given by
However, the sequence (Pm) induced by (pm) does not generally evolve according t o a Markov chain, which reflects the fact that the activation states are functions of all previous output states, see equation (2.4).An exception occurs in the limit k i --f 0, when the projection of equation (3.25) reduces to the Markov chain (3.5). Using the results of Refs. [43,441,i t can be shown [151 that, in the presence of synaptic (kij > 0) and threshold ( P - l > 0) noise, the limiting behaviour of the associated IFS F is characterised by a unique invariant measure +with lim pm = pg
(3.27)
m+-
independently of the initial distribution po. Moreover, pF satisfies the condition [44,45]that, for almost all trajectories, time averages are equal to space averages, M-1
lim
1
M+- M
f (V(m))= I f (V)dpg(V)
m=O
(3.28)
M
for all continuous functions f: M -+ 2 An equivalent result to (3.28)is that the frequency with which a n orbit visits a subset B is p&B), lim
M+-
#(V(m) E B:l 5 m < M) = pg (B) M
(3.29)
From equation (3.26),it follows that equations (3.7)and (3.8)hold with P oo
80
replaced by the projected distribution Pp
(3.30)
One of the interesting features of random IFS's is that the invariant measure (ifit exists) often has a rich fractal-like structure. (This is a major reason why IFS's have attracted interest lately within the context of image generation and data compression [46,47]). We shall illustrate this with the simple example of a single time-summating neuron with inhibitory feedback (Figure 3). Incorporating synaptic (and threshold) noise, the stochastic evolution of the excitatory neuron's membrane potential is given by V(m) = kV(m-1) -eu(m) + I, where u(m) = u with probability p(u) if a(m 1)= e(V(m-1) + q(m-1)) = 1and u(m) = 0 if a(m-1) = 0. For simplicity we shall assume that u is generated according to a Binomial distribution with L = 1, i.e. p(u) = kU ( 1 for some h, 0 < h < 1. Moreover, the probability distribution of the random threshold q(m) is taken t o be sigmoidal, equation (3.3). The dynamics corresponds t o an IFS G consisting of two maps Fo, F1 : W l , VO]+ [Vl, Vol, where VO,J are the fixed points of
0.0
Membrane potential V
Figure 6. The invariant measure of the random IFS consisting of the two maps Fo, F1 with Fo(V) = kV + 1 - k and F1(V) = kV. The associated probabilities are @o&= V2. (a) k = 0.52.
81
Fo,J, with associated probabilities @o, 0 1 such that Fo(V) = kV + I
@OW)= 1 - Qf(v)
FIN) = kV --E+I
= hv(v)
(3.31)
A reasonable approximation of the resulting invariant measure p G may be obtained by plotting a frequency histogram displaying how often an orbit Wm)) visits a particular subinterval of W1,VO].This is a consequence of equation (3.29). Without loss of generality, we consider the high + 1/2 for all V and set E = I = 1- k so temperature limit p + 0 in which that VO = 1,V1 = 0. The invariant measure pG in the case h = 1 (no synaptic noise) has been an object of interest for over 50 years [48] and many of its mathematical properties are still not very well understood. For k < 1/2 the support of pG is a Cantor set. On the other hand, for k 1 1/2 the support of pgis the whole unit interval and for many values of k the measure has a fractal-like structure. In Figure 6 we display the frequency histogram representation of F~ for h = 1 and (a) k = 0.52, (b) k = 0.6, and (c) k = 0.9. It is clear that p becomes progressively smoother as k + 1. (In the 9 presence of synaptic noise, 0 < h < 1, similar behaviour occurs, but the histograms are no longer symmetric about V = 112).
w)
0.0
Membrane potential V
Figure 6 continued. (b)k = 0.6
82
Returning t o equation (3.281,we see that, as in the case of standard binary networks, it is necessary to operate the network according to one of the schemes discussed a t the end of section 3.1. Consider the problem of training a stochastic time-summating network to operate as a continuous mapper (scheme (a) of section 3.1). In this mode of operation, one can formulate learning in terms of the following inverse problem: For fixed threshold noise, decay factors kj and external inputs Ii, there is a family of IFS's !F= ((Fa,Qa) I a E Q) parametrised by the set r = Kqj, Xi$ I i, j, =
l,..,N,i z j); find a set r such that the resulting invariant measure pF is "sufficientlyclose" to some desired measure labelled by the external input I. One of the potential applications of the IF'S formalism is that a number of techniques have been developed for solving the inverse problem, eg. the Collage theorem [43]. These exploit the self-similarity of fractal structures inherent in typical IFS's. It would be interesting to see whether o r not such techniques are practical within the neural network context. (See also the analysis of associative reinforcement learning in terms of IFS's [49]). ~
- ~
.
~
.
_
_
_
_
_
~
_
_
-
I
0.0
Membrane potential V
Figure 6 continued. (c) k = 0.9. We end this section by briefly discussing the behaviour of stochastic time-summating networks in the large-N limit (scheme (c) of section (3.1)). First, it is useful to reformulate the dynamics in a similar fashion to section
83
3.1. That is, set q$m) = i&j(m)aj(m),aj(m) = e(vj(m>+ ?lj(m))and write
Macroscopic variables may then be defined along the lines of (3.11) with f, and a replaced by F, and V. A simple example [14] is given by a n homogeneous inhibitory network with threshold noise in which q-iii,(m) + -WIN,for all i, j, with w fixed, and ki = k, Ii = I for all i. The {ong-term macroscopic behaviour of the network is governed by the single dynamical mean-field equation X(m + 1) = Fp(X(m)), where X(m) is the mean activation state N-l%Vi(m) and FP is the map in (2.7) with gain y = p. The existence of periodic and chaotic solutions to this equation implies that, in the large-N limit, the network exhibits macroscopic behaviour in which asymptotic stability, i.e. convergence to a unique invariant measure (equation (3.2711,no-longer holds. For in the thermodynamic limit, X(m) is equal to the ensemble average N-lJ ZiVidpmW), given the initial condition N-lJ ZiVidpOW) = X(0); if (3.27) held then X(m>would converge t o a fixed point corresponding to the ensemble average over the invariant measure. It remains to be seen whether or not complex dynamical behaviour at the macroscopic level in a time-summating neural network can be exploited to the same degree as the fixed point behaviour of Hopfield-Little networks in the context of the storage and retrieval of static patterns. One of the issues that would need to be tackled is the appropriate choice of learningrule. 4
TEMPORAL SEQUENCE PROCESSING
A major limitation of standard feedforward neural networks is that they are not suitable for processing temporal sequences, since there is no direct mechanism for correlating separate input patterns belonging t o the same sequence. A common approach to many temporal sequence processing tasks is to convert the temporal pattern into a spatial one by dividing the sequence into manageable segments using a moving window and to temporarily store each sequence segment in a buffer. The resulting spatial pattern may then be presented to the network in the usual way and learning algorithms such as back-error-propagation applied accordingly [50]. However, there are a number of drawbacks with the buffer method: (i) Each element of the buffer is connected to all the units in the subsequent layer so that the number of weights increases with the size of the buffer, which may lead to long training times due to the poor scaling of learning algorithms; (ii) the buffer must be sufficiently large to accommodate the
84
largest possible sequence, which must be known in advance; (iii)the buffer converts temporal shifts t o spatial ones so that, for example, the representation of temporal correlations is obscure; and (iv) the buffer is inefficient if the output response to each pattern of the input sequence is required rather than just the final output. The deficiencies of the buffer method suggest that a more flexible representation of time is needed. One simple approach is t o introduce into the network a layer of time-summating neurons [8,9] each of which builds up an activity trace consisting of a decaying s u m of all previous inputs to that neuron (cf. equation (2.4)), thus forming an internal representation of an input sequence. The inclusion of such a layer eliminates the need for a buffer and allows the network to operate directly in the time domain. Such networks have been applied to the classification of speech signals [51,521, motion detection 1533, and the storage and recall of complex sequences CS]. 4.1 Classification and storage of temporal sequences. Consider a single time-summating binary neuron with N input lines that is required t o learn p sequences of input-output mappings (IYm); m = 1,...,R) + (&m)), m = 1,...,R),p = 1,...,p, where Ip = (If ,...,1%)and oqm>= 0 or 1.We shall reformulate this problem in terms of a classification task to which the perceptron learning theorem 1541 may be applied. First, define a new set of inputs of the form 191 m-1
r=O
where k is the decay-rate of the time-summating neuron. The activation state a t time m is taken to be V(m) = kV(m-1) + Xi w;Iy(m) = Zj wjI!(m). (Our definition of V(m) differs slightly from that of equation (2.3)).The output at time rn is a(m) = 0(v(m)- h), where h is a fixed threshold. Divide the RM inputs ip(m), p = 1,...,p, m = 1,...,R, into two sets F+ and F- where iP(m) E F+ if ohm) = 1 and iP(m) E P otherwise. Learning then reduces to the problem of finding a set of weights (wj, j = 1,...,N) such that the sets F+ and F- are separated by a single hyperplane in the space of inputs ?(m) - linear separability. In other words, the weights must satisfy the RM conditions N
N
cw,iy(rn) > h + Gifip(m) E F+,
xw,iy(m) c h -6 if i"m)
j=l
j=l
E
F (4.2)
85
The perceptron convergence theorem [54]for the time-summating neuron may be stated as follows [9]: Suppose that the weights are updated according to the perceptron learning-rule. That is, at each iteration choose an input icl(m) from either F+ or F- and update the weights according t o the rule wj
wj + (&m) - €)(w.ip(rn) - h))iy(rn)
(4.3)
If there exists a set of weights that satisfy equation (4.2)for some 6 > 0, then the perceptron learning-rule (4.3)will arrive at a solution of (4.2)in a finite number of time steps - independent of N. *
1
\
-. 1
0
1
- a \
\
1
0
0
(a)
(b)
Figure 7. Example of (a) separable and (b) non-separable sets F+ and Fassociated with the sequences of input-output mappings defined in the text. Points in F+ and F+ are denoted, respectively, by 0 and 0. The above result implies that a time-summating binary neuron can learn the set of mappings {I&); t = 1,...,R) + (oqt)), t = 1,...,R), = 1,...,M provided that the associated classes F+ and F- are linearly separable. We shall illustrate this with a simple example for N = 2 [9]. Define the vectors A = (1,O) and B = (0,l)and consider the mappings A B + 1 0 and B A + 0 0. This is essentially an ordering problem since the pattern A produces the output 1 or 0 depending on whether it precedes o r proceeds the pattern B. (Thus it could not be solved by a standard perceptron). Using equation (4.1) we introduce the four vectors
86
(4.4)
It is clear that the sets F+ = (i'(1)) and F- = (i1(Z),?(l),i2(Z)) are linearly separable (Figure 7a) and, hence, that the neuron can learn the above mappings. On the other hand, the neuron is not able to learn the mappings A B -+ 1 1and B A -+ 0 0, since the associated sets @ cannot be linearly separated by a single line, see Figure 7b. This is analogous to the exclusiveOR problem for the perceptron 1541.
k2
(a)
(b)
Figure 8. The sets F+ and F- associated with the mapping C C C + 1 0 1, C = (1,l) for (a)a single time-summating binary neuron and (b)a two-layer network in which the time-summating input neurons have different decayrates, k # k Another example of a non-separable problem is shown in Figure 8a, which describes the sets F+ and P associated with the mapping C C C -+ 1 0 1, where C = (1,l).One way to handle this and the previous example is to use a feedforward network with an input layer of time-summating neurons. More specifically, consider a two-layer network consisting of N timesummating neurons, each with a linear output function, connected to a single standard binary neuron as output, (Figure 9). For a n input sequence (I(l),...,I(R)), the activation state of the jth input neuron is Vj(m) = qkiIj(m-r) and the activation state of the output neuron is V(m) = qwjVj(m). In the special case kj = k, the network is equivalent, in terms of its input-output transformations, to the time-summating neuron considered
87
above. However, the network representation immediately suggests a useful generalisation in which the decay-rate associated with each input line is site-dependent. (The perceptron convergence theorem still holds when k is replaced by kj in (4.1)).It is clear that the network can solve a wider range of tasks than a single time-summating binary neuron. A simple example of this is the problem of Figure 8; the sets Fk become linearly separable when the two input neurons have different decay-rates (Figure 8b).
Figure 9. A two-layer time-summating network One can also use a time-summating neural network to solve nonlinearly separable tasks by introducing one o r more hidden layers. In the example of Figure 7b this may be achieved with two hidden units that separate the classes @ by two lines. However, as with standard networks, it is necessary to replace the perceptron learning-rule with a n algorithm such as back-error-propagation (BEP) [501. In order to implement BEP, the threshold function of all binary neurons must be replaced by a monotonically increasing, smooth function such as a sigmoid. Consider, for example, a three-layer network with a n input layer of linear timesummating neurons, a hidden layer and a n output layer, both of which consist of standard neurons with sigmoidal output functions f, fix) = U(1 + e-1. For a n input sequence MU, ...,I(R)), the input-output transformation realised by the network at time m is
where wjp is the weight from the pth input unit to the jth hidden unit and wij is the weight from the jth hidden unit to the ith output unit.
88
Given a desired output sequence ( ~ ( 1 1..., , o(R)), the network can be trained using a form of BEP that minimises the error E = q,m(Oi(m) oi(m)I2. That is, the weights and decay-rates, denoted collectively by 6, are changed according to a gradient descent procedure, A( = - r$E/aC,, where is the learning-rate. The gradients aEBC are calculated iteratively by backpropagating errors from the output to the hidden layer [50]. For the weights we have,
where q(m) and Ej(m) are the errors at the output and hidden layers respectively,
Similarly, for the decay-rates,
(Note that the above implementation of BEP differs from that of Mozer [52], who considers neurons that have their output rather than their activation state fed back as input). So far we have only discussed the problem of learning sequences of input-output mappings. However, with small modifications, a two-layer, time-summating neural network can also be trained to store and retrieve temporal sequences [8,91. Suppose that the input layer consists of N linear time-summating neurons and the output layer of N standard binary neurons; all neurons in the input layer are connected t o all neurons in the output layer. Thus, we effectivelyhave N independent networks of the form shown in Figure 9. "he network stores a sequence N O ) , ...,UR)) by learning to output the pattern I(m+l) on presentation of the previous patterns I(O),...,Um) for m = 0,...,R-1. For each output neuron, this is achieved by applying a perceptron-like learning-rule of the form (4.3) t o the set of weights connected to that neuron. In other words, for each i = 1,...,N, (4.9)
89
where wij is the weight from the jth input neuron to the ith output neuron. A schematic diagram of the learning-phase is shown in Figure 10. Once the network has been successfully trained, the full sequence may be recalled by simply seeding the network with NO) and feeding the output of the network at each subsequent time-step back to the input layer via delay lines. During recall, all weights are held fixed. The presence of the time-summating neurons in the input layer, with the possible addition of one or more hidden layers, allows the disambiguation of complex sequences. (Recall the discussion of complex sequences in section 2). This may be shown [9] using geometrical arguments similar to those illustrated in Figures 7 and 8. I I I
I
time-summating input layer
output layer
I I
I
I
I
I I
I(m+l)
-1
I(m+l)
Figure 10.Schematic diagram of the learning-phase of a two-layer timesummating network for the storage of a temporal sequence {I(O), ...,I(R)). The network learns to match its actual output O(m) a t time m with the desired output, which is the next pattern of the sequence I(m+l). Dotted line indicates feedback during recall. We conclude that the inclusion of an input layer of time-summating neurons allows a network to solve complex tasks in the time domain by transforming each input sequence into a more useful representation for processing by subsequent layers. However, one class of problems that such networks cannot solve concerns cases in which the response t o a particular
90
pattern within a sequence depends on subsequent rather than prior patterns of that sequence. Nevertheless it is possible to extend the timesummating model so that such problems may be handled. (See section 5).
Temporal correlations In section 4.1, we showed how time-summating neurons can solve certain tasks in the time domain by storing previous inputs in the form of an activity trace. Another consequence of the activity trace is that temporal correlations are set up between the activation states of the neuron on presentation of an input sequence [11,91. To illustrate this point, consider a time-summating binary neuron (or its network equivalent) being presented with sequences of input patterns of the form {Y(l),...,Y(R))where Y(t) = I(t) + as required.
On the other hand if u is orthogonal to x, (I-qxxT)u
= u-VXXTU
= u.
That there can be no more eigenvalues follows from the fact that x together with the set of all vectors orthogonal to x span the input vector space.
I Recall that the matrix norm of any mxn matrix A corresponding to the Euclidean norm in IR* is defined by
This expression defines a norm on the mn dimensional space of matrices with the property that for any n-vector v, l l A ~ 1 1d~ llA11211v112.
Lemma 2.2
e(B)
Provided 0 S q d 2/11x112, we have 11B1I2 = Q@) = 1, where is the spectral radius of B. i.e. the magnitude of its eigenvalue of largest modulus. Proof The 2 norm and spectral radius of a symmetric matrix are the same: see Psaacson and Keller, 1966, p10, equation (ll)] noting that the eigenvalues of A2 are the squares of those of A. That the spectral radius is 1 follows from Lemma 1.
I Now suppose we actually have t pattern vectors xl,...xt. We will assume temporarily that these span the space of input vectors, i.e. if the x’s are n-vectors, then the set of pattern vectors contains n linearly independent ones. (This restriction will be removed later.) For each pattern vector xp, we will have a different matrix B, say B, = (I - qxpxpT). Let A = BtBt-1...B1.
Lemma 2.3 If 0 < ‘1 < 2/11xp112 holds for each training pattern xp. and if the xp span, then 1 1 ~ 1 1 2< 1.
110
Proof By definition, there exists v such that 11A112 = llAv112 and
11~112=
1.
Thus 11AIl2 = (IBtB,, ... Blvll2 c 1IBtB,, ...B,I1211B1v112(from the definition of the norm). We identify two cases:
Case 1) If vTxl # 0, 11B1v112< 1, since the component of v in the direction of x is reduced (see Lemma 2.1: if this is not clear write v in terms of x and the perpendicular component, and apply Bl to it.) on the other hand IIBtBt-i - - ~ 2 1 1 2c
ll~Jlzll~t-i11~..11~2llz = 1.
Case 2) If vTxl = 0,then Blv = v (Lemma 2.1). Hence 11A112 = IIBtB,l may carry on removing B’s until Case 1 applies.
... B2v1I2 and we I
Remark
In theory, at least, one could compute the which is optimal in the sense of minimising 11A112 However this is unlikely to be worthwhile unless we can get an efficient algorithm. A common way to apply the delta rule is to apply patterns xl,xz,...xt in order, and then to
xl. The presentation of one complete set of patterns is called an epoch. Assuming this is the strategy employed, iteration (2.1) yields start again cyclically with
wL+~=
Awk+qh
(2.2a)
where A is as defined above and
h
= yI(BtBt-1 ... B 2 ) ~ 1+ ... yt-lBtxt-1 + Ytxt
.
(2.2b)
Here, of course, yp denotes the target y value for the pth pattern, not the pth element of a and l the x’s, but nor on the current w. vector. Note that the B’s and hence h depend on ? Since 6W in the delta rule is proportional to the error in the outputs, we get a fixed point of (2.1) only if all these errors can be made zero, which obviously is not true in general. Hence the iteration (2.1) does not in fact converge in the usual sense. On the other hand, we have shown (Lemma 2.2) that provided the xp span the space of input vectors, then for < 1. Hence the mapping F(w) = Aw + q h satisfies sufficiently small q.
llA112
i.e. it is contractive with contraction parameter 11A112. Mapping Theorem that the iteration (2.2a) does have Mapping Theorem may be found in most textbooks of systems: see for instance [Vidyasagar, 1978, ~ 7 3 1 .The fixed point is unique. Now if there exists a w that makes
It follows from the Contraction a fixed point. The Contraction functional analysis or dynamical theorem also guarantees that the all the errors zero, then it is easy
111
to verify that this w is a fixed point of (2.1) and hence also of (2.2a). Otherwise, (2.1) has no fixed points, and the fixed point of (2.2a) depends on 7: we denote it by ~ ( 7 )In. the limit, as the iteration (2.1) runs through the patterns, it will generate a limit cycle of vectors wk returning to w(q) after the cycle o f t patterns has been completed.
2.1.1. Dependence of the limit cycle on '1 Since w(q) is a fixed point of (2.2a) we have (writing h = h(q) and A = A(q) to emphasise the dependence) w(7) = Nrl)W(rl)
+qm)
(2.3)
*
Now what can we conclude about w(q)? Let us denote by (unique since the xp span) that minimises
W*
the weight vector w
(2.4a) Denote by X the matrix whose columns are xI,x2,...xt, and let
L
xxr
=
t
=
c
XpXpT.
Pl
Then
W*
satisfies the normal equations t
Lw* =
ypxp = h(0).
(2.4b)
F='
The second equality follows from (2.2b), observing that all the B matrices tend to the identity as q+O. On the other hand from (2.3) we get H(V)w(q) = h(q) where H(q) = (1 - N m . Assuming L-1 exists, define the condition number x(L) by x(L) = 11L-111z&112. Since L is symmetric and positive definite, it is easy to see that x ( L ) is equal to the ratio of the largest and smallest eigenvalues of L (compare [Isaacson and Keller, 1966, p10, equation (1 l)]). A standard result on the solutions of linear equations [Isaacson and Keller, 1966, p37] gives, provided llL - H(q)1I2 < IdlL-1112,
But
112
and considering powers of q in this product we obtain
= I-qL+O(q2)
Thus H(q) = L + O(q). Also an examination of the products in (2.2b) reveals h(q) = h(0) + O(q). This gives the first part of the following theorem.
Theorem 2.4 Suppose that the pattern vectors xp span, and that W* is (as above) the weight vector which minimises the least square error of the outputs over all patterns. If w(q) is a weight vector obtained by applying the delta rule with fixed q until the limit cycle is achieved, then as q+O,
b) If &(q)is the root mean square error corresponding to w(q) (see (2.4a)). and E' is the corresponding error for w*,then ~ ( q-)E* = O(q2) .
Proof a) follows from the remarks immediately preceding the theorem. The condition that the xp span is necessary and sufficient for L to be non singular: see (2.7) below. Note however, that it does not really matter if we look only at the end of the epoch of patterns: the result will apply to any w in the limit cycle. b) is simply the observation that for a least squares approximation problem, the vector of errors for the best vector w' is orthogonal to the space of possible w's, so an O(q) error in W* yields an O(q2) increase in the root mean square error. We will omit the details of this argument. I Remark In actual fact, Theorem 2.4 is not quite as satisfactory as it may appear, since the bound (2.5) depends on llL-1112. Although the spanning of the patterns xp is sufficient to guarantee the existence of L-1, in practice the norm can very large, as we shall see later. These results are illustrated by the following small numerical experiment. This used four input patterns each with a single output. These were
113
The first test used patterns 1 to 3 only. Since the input patterns are independent 3-vectors. the least squares error can in this case be made zero, and it is easily verified that in fact this occurs with
w =
(8).
The spectral radius and 11A\12for various values of T are shown in the Table 2.1. Table 2.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.967 0.928 0.881 0.823 0.750 0.654 0.525
0.967 0.929 0.888 0.848 0.866 0.939 1.035
With 7 = 0.7 we would expect that once the algorithm had settled down, the error would be almost halved after each epoch (complete cycle of patterns). and this indeed proved to be the case. The algorithm converged to the expected weight vector, and the mot sum of squares of the output errors was 5.33E-3 after 10 epochs, and 9.34E-9 after 30 epochs. We then repeated the tests with all 4 patterns. In this case, the output errors cannot all be made zero, and we expect a limit cycle. The corresponding values of spectral radius and two norm are given in Table 2.2. For small q, A is, of course, nearly symmetric, and the spectral radius and norm are almost the same. However, this property degenerates quite rapidly with increasing 1. and for ? >l0.2734088(7dp), the largest eigenvalue of A occurs as a complex conjugate pair. It would of course be unwise to make general deductions from this very small and simple exam le. but it is interesting to note that for this data, at least, e(A) is significantly smaller than gAl12 when ‘1is greater than about 0.3.
114
Table 2.2
0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.935 0.868 0.816 0.819 0.866 0.929 1.006
0.933 0.852 0.725 0.640 0.612 0.677 0.786
The delta rule algorithm itself was run for this data with several values of 7. With ‘1 = 0.35 (the approximate minimiser of the two norm) the final value of the root sum of squares error was 0.53502 and convergence was obtained to 3dp after only 15 epochs. With
’1 = 0.1 the final error o b i n e d was 0.50157, but convergence to 3dp took 55 epochs. The exact least squares miiiimum is 0.5, and we observe that the error for q = 0.1 is comct to nearly 3dp, as suggested by Theorem 2.4b). The matrix L=XXT in this case is
4 2 2
.
(z:)
It has eigenvalues 6.372(3dp), 1 and 0.628(3dp). Therefore x ( L ) = 10.15(2dp). An interesting, and as yet unexplained, observation from the experiments is that when the limit cycle is reached, the error in the output for each individual pattern after the correction for that pattern has been applied was the same for all patterns. 2.1.2. When the patterns do not span One case we have not yet considered is when the xp fail to span. At first sight this may Seem unlikely, but in some applications, for instance in vision, it is quite possible to have more data per pattern than the number of distinct patterns. In this case we have more free weights than desired outputs, so we would normally expect to be able to get zero error. However this may not always be possible: it depends on whether the vector of required outputs is in the column span of the matrix whose rows are the input patterns. Since the row and column rank of a matrix are equal. we can guarantee zero error for any desired outputs if: i) The number of weights 3 the number of patterns, and ii) the patterns are linearly independent.
115
(For a discussion of interpolation properties of semilinear feedforward networks, see wartin, J.M.. 19901.) Now if the patterns do not span the input space, Lemma 2.3 fails to apply. However this is only a minor complication. Any weight vector w can be written as w = W + We where E span(xl ..... xt) and wC is in the orthogonal complement of this space. It follows from Lemma 2.1 that the matrix A simply leaves the part invariant. Thus, a simple extension of the arguments of Lemma 2.3 and the remarks following it shows that the mapping F(w) = /\w + tlh is contractive over the set of vectors which share a common vector Wr. Thus we will get convergence to a limit cycle. Note, however, that the particular
+
limit cycle obtained depends in this case on the wC part of the initial weight vector. A more serious problem is that L will be rank deficient. In this case there is no unique best w*,and (2.5) fails. This problem can in principle be tackled using the the singular value decomposition tools introduced in Section 3 but we omit the discussion in this paper. There is another reason for looking closely at the properties of L, as we shall now see.
2.2. The “epoch method”
Since we are assuming that we have a fixed and finite set of patterns xp. p = 1....t, an alternative strategy is not to update the weight vector until the whole epoch of patterns has been presented. This idea is initially attractive since it can be shown that this actually generates the steepest descent direction for the least squares error. We will call this the “epoch method‘’ to distinguish it from the usual delta rule. This leads to the iteration
n = (I - qXXT) = (I - qL). (2.6) is, of course, the equivalent of (2.2a), not (2.1), since it corresponds to a complete epoch of patterns. There is no question of limit cycling, and, indeed a fixed point will be a true least squares minimum. Unfortunately, however, there is a catch! To see what this is, where
we need to examine the eigenvalues of fl. Clearly L = X x ’ is symmetric and positive semi definite Thus it has real non-negative eigenvalues. In fact, provided the xp span, it is (as is well known) strictly positive definite. To see this we note that for any vector v, t
VTXPV =
c VT(XpXp’,V P=l t
=
c (VTX,)’
p= 1
.
(2.7)
116
Since the patterns span, at least one of quantities in brackets must be non-zero. The eigenvalues of R are 1 - q(the corresponding eigenvalues of XS), and for a strictly positive definite matrix all the eigenvalues must be strictly positive. Thus we have for 7 sufficiently small e(L)= < 1. Hence the iteration (2.6) will converge, provided the patterns span and is sufficiently small.But how small does q have to be? (Recall that for the usual 6 rule we need only the condition of Lemma 2.3) To answer this question we need more precise estimates for the spectrum of L and the norm of R. From these we will be able to see why the epoch algorithm does not always work well in practice. Suppose L = X S has eigenvalues [A? j = 1 ... n ) , with
ll~ll~
The eigenvalues of R are (l-qkl) C (1-qh2) C .... 6 (1-qh,), and e(n) = max ( l - ~ & Il-qk,, (Observe that fl is positive definite for small q, but ceases to be so when q becomes large.) Now
I
1,
I}.
Thus from (2.7)
Hence (2.9)
Note that this upper bound, and hence the corresponding value of l l required, is computable. On the other hand, we can get a lower bound by substituting a particular v into the expression on the right hand side of (2.8).For instance, we have for any k, k=l, ... t,
We next consider some special cases. Case 1: The xp collapse towards a single vector. This situation can arise in practice
117
when the neural net is q u i r e s to separate two classes of vectors which lie close together in the sense that the angle between their means is small. For definiteness, suppose for some f i x 4 v, 11v11* = 1,
xp = v + &e, where e,, is the pth standard basis vector. Then considering (2.8) and (2.9) we see that lim 1, = t. E 4
Also, lim A,, = 0 : this follows simply from the fact the rank of XXr collapses to 1. €+O Case 2: The xp cluster around two vectors u and v which are mutually orthonormal. If these represent two classes which are to be separated, we are in the ideal situation for machine learning. However, even in this case the behaviour of the epoch method is less than ideal. If the clusters are of equal size, we have from (2.8)
limA1)t/2 &+O
andagain
limI,=O:
€+O
since the rank of L = X S collapses to 2. Case 3 The example considered at the end of 2.1.1. Here n = 3, and as was described above, 1, = 6.372, h, = 0.628. From these values, the eigenvalues of R = (I- VL) are easily calculated, as given in Table 2.3.
Table 2.3
tl 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.937 0.874 0.912 1.550 2.186 2.823 3.411
A comparison of this table with Table 2.2 clearly illustrates the rapid growth of Q(R) compared with e(A), as tl increases from zero. In these cases we will need a relatively small value of 7 to get the spectral radius less than 1. In practice, the epoch method does tend to be less stable than the ordinary delta rule. An inkresting open question is whether we always have @A) c Q(n).Note also that
118
I, will be very small, and the corresponding eigenvalue of R close to 1: however, since small eigenvalues correspond to vectors nearly orthogonal to the span of the patterns, the exact effect of this is not immediately clear. This issue is addressed further in Section 3.1: see equation (3.3). Another way of looking at these results is to consider the iterations as approximations to the gradient descent differential equation. Provided the xp span, this differential equation (for the linear perceptron) obviously has a single, asymptotically stable fixed point which is globally attractive. However, at least in the three cases above, the differential equation is s t i r . (Compare Section 1.2). This stiffness is severe in the frst two cases, and marginal in Case 3. The epoch iteration (2.6) is simply Euler’s approximation to the descent differential equation. Euler’s method is not stiffly stable, and so only mimics the behaviour of the The l. iteration (2.2) provides a kind of stiffly stable differential equation for very small ? approximation to the differential equation, albeit having the disadvantage of an oscillatory solution.
2.3. Generalisation to non-linear systems As is well known, the usefulness of linear neural systems is limited, since many pattern recognition problems are not linearly separable. We need to generalise to non linear systems such as the backpropagation algorithm for the multi-layer perceptron (or semilinear feedforward) network. Clearly we can only expect this type of analysis to provide a local result: as discussed in Section 1, global behaviour is likely to be more amenable to dynamical systems or control theory approaches. Nevertheless, a local analysis can be useful in discussing the asymptotic behaviour near a local minimum. The obvious approach to this generalisation is to attempt the “next simplest” case, i.e. the backpropagation algorithm. However, this method looks complicated when written down explicitly: in fact much more complicated than it actually is! A more abstract line of attack turns out to be both simpler and more general. We will define a general non-linear delta rule, of which backpropagation is a special case. Suppose the input patterns x to our network are in Rn, and we have a vector w of parameters in R M describing the particular instance of our network i.e. the vector of synaptic weights. For a single layer perceptron with m outputs, the “vector” w is the the mxn weight matrix, and thus M = mn. For a multilayer perceptron, w is the Cartesian product of the weight matrices in each layer. For a general system with m outputs, the network computes a function G:RMxRURm, say v = G(w,x)
11 11.
where v E Rm. We equip RM, IRm and Rn with suitable norms Since these spaces are finite dimensional, it does not really matter which norms, but for definiteness say the For pattern xp, denote the corresponding output by vp. i.e. Euclidean norm
11 112.
vP = G ( w , x ~ ) . We assume that G is Frechet differentiable with respect to w, and denote by D = D(w,x) the mxM matrix representation of the derivative with respect to the standard basis. Readers unfamiliar with Frechet derivatives may prefer to think of this as the gradient
119
vector: for rn = 1 it is precisely the row vector representing the gradient when G is differentiated with respect to the elements of w. Thus, for a small change 6w and fixed x, we have (by the definition of the derivative) G(w+Gw,x) = G(w,x) + D(w,x)6w + o(l16wll) .
(2.10)
On the other hand for given w, corresponding to a particular pattern x,. we have a desired output yp and thus an error E~ given by
eP2 = cV,-v,~TcVp-vp~ = qpTq,,
say.
(2.11)
The total error is obtained by summing the EPZ's over the t available patterns, thus t
&2
=
CEp2. pi'
An ordinary descent algorithm will seek to minimise €2. However the class or methods we are considering generate, not a descent direction for &*, but rather successive steepest descent directions for E;. Now for a change 6qp in qp we have from from (2.11)
6ep' = (4,
+ %,)T(q,
= 2SqpTq,
+ 8qp) - q,Tq,
+ 6qpT6qp.
Since y, is fixed,
69, = - 6v,
=
- D(w,xp)'w + o(ll6wll)
by (2.10). Thus
6e: = - 2( D(w,xp)Gw)T( yp - G(w,xp) 1 + ~(IlSwll) =
- 26wT( D(W.X,) IT( yp - G(w,xp) ) + ~(IlSwll).
Hence, ignoring the o(l16wll) term, and for a fixed size of small change 6w, the largest decrease in &: is obtained by setting
6~ = T( D(w,xp) P(yp - G(w,xp) ) . This is the generalised delta rule, Compare this with the single output linear perceptron, for which the second term in this expression is scalar with G(w,xp) = wTxp,
and the derivative is the gradient vector (considered as a row vector) obtained by differentiating this with respect to w, i.e. xpT. Thus we indeed have a generalisation of (2.1).
120
Hence, given a kth weight vector wk, we have wk+1 = wk+ 6wk
= wk + q( D(Wk&p) IT( Yp - G(Wk+p)
.
(2.12)
The backpropagation rule mumelhart and McClelland, 1986, pp322-3281 is , of course, a special case of this. To proceed further, we need to make evident the connection between (2.12) and (2.1). However, there is a problem in that, guided by the linear case considered above, we actually expect a limit cycle rather than convergence to a minimum. Nevertheless it is ' , of (2.11): necessary to fix attention to some neighbourhood of a local minimum, say w clearly we cannot expect any global contractivity result as in general (2.12) may have many local minima, as is well known in the backpropagation case. Now from (2.10) and (2.12) we obtain (assuming continuity and uniform boundedness of D in a neighbowhood of w*), Wk+l
= wk + q( D(wk,xp) IT( yp - G(w*.xp) - D(W*.Xp)(Wk-W*)) + o(~lwk-w*l~) = (1 - qD(WkpXp)Q(W*,Xp))Wk + q( D(Wk.Xp) )T( yP - G(w*,Xp)+ D(W*,xp)W*) + o(~lwk-w*~~)
(2.13)
The connection between (2.13) and (2.1) is now clear. Observe that the update matrix (I - tlD(wk.xp)%(w*,xp))is not exactly symmetric in this case, although it will be nearly so if wk is close to w'. More precisely. let us assume that D(w,x) is Lipschitz continuous at w., uniformly over the space of pattern vectors x. Then we have wk+1 = (1 - qD(w*Jp)~(w*Jp))Wk + q( D(w*,Xp) IT( yp - G(W*Jp)+ D(W*,Xp)W*) + o(llwk-w*ll)
(2.14)
Suppose we apply the patterns xl,,..,xtcyclically, as for the linear case. If we can prove that the linearised part (i.e. what we would get if we applied (2.14) without 0 term) of the mapping Wk:+wk+t is contractive, it will follow by continuity that there is a neighbourhood of W' within which the whole mapping is contractive. This is because, by hypothesis, we have only a finite number of patterns. To establish contractivity of the linear part, we may proceed as follows. First observe that D(w',xp)~(w*,xp)is positive semi definite. Thus for 7 sufficiently , x 1.~ )We ~ ~ may ~ decompose the space of weight vectors small, 111 - ~ D ( w * , ~ ~ ) Q ( w * C into the span of the eigenvectors corresponding to zero and non-zero eigenvalues respectively. These spaces are orthogonal complements of each other, as the matrix is symmetric. On the former space, the iteration matrix does nothing. On the latter space it is contractive provided
q < l@( D(wf,xp)7D(w*,xp)1 .
(2.15)
We may then proceed in a similar manner to Lemma 2.3, provided the contractive subspaces for each pattern between them span the whole weight space. If this condition fails then a difficulty arises, since the linearised product mapping will have norm 1, so the
121
non-linear map could actually be expansive on some subspace. For brevity, we will not pursue this detail here. In the c89e of a single output network. D(w;x) is simply a row vector and any acceleration strategy for the linear algorithm based on Lemmas 2.1 and 2.3 should be fairly easy to generalise to the non-linear case. Even for the multi-output case, (2.15) suggests D(wk,xP)W(wk.xp) ) to control the choice of learning rate q. The matrix will only using have the same order as the number of outputs, which in many applications is quite small. It is regrettable that (2.14) has to be based on nearness to a local minimum w ' , but it is difficult to see how to avoid this. The fact that an algorithm generates a descent direction for some Liapunov function is not sufficient to force contractivity. The latter property in Lemma 2.3 arises from the convexity of the quadratic error surface. Nearness to a local minimum in (2.14) enforces this property locally. Nevertheless, the fact that the generalised delta rule only uses information local to a particular pattern xp, rather than the whole pattern matrix X. would still seem to be a very desirable property to have in view of the results of Sections 2.1 and 2.2. There is little to say about the special case of backpropagation, other than that for a multilayer perceptron. the derivative D(WJK)is relatively easy to calculate: this is what makes backpropagation of the error possible.
e(
3. FILTERING, PRECONDITIONING AND THE SINGULAR VALUE DECOMPOSITION Equation (2.14) shows that the non-linear delta rule can be locally linearised and then behaves like the linear case.For simplicity,.therefore. we shall largely restrict attention in this section to linear networks.
3.1. Singular values and principal components As is apparent from the Section 2, (see e.g. (2.7), (2.13) ), matrices of the form Yv are of importance in the context of non-linear least squares. We also pointed out after (2.7) that an analysis of the case when the matrix X S is rank deficient, or nearly so, is important in studying these problems. Not surprisingly therefore, this problem has received considerable attention in the literature of both numerical analysis and multivariate statistics. (See e.g. the chapters by Wilkinson and Dennis in [Jacobs, 19771, pages 3-53 and 269-312 respectively. Also chapter 6 of [Ben-Israel and Greville, 19741.) The exposition given here is based on that of Wilkinson. who deals, however, with complex matrices. We will consider only real matrices. The key construction is the singular value decomposition (SVD).
Theorem 3.1 (Singular Value Decomposition) Let Y be any mxn real matrix. Then a) Y may be factorised in the form
Y = PSQT
where P and Q are orthogonal matrices (mxm and nxn respectively), and S is an mxn matrix which is diagonal in the sense that sij = 0 if i # j. The diagonal elements (li = sii are non negative, and may be assumed ordered so that 'J1 2 U, ... 2 Clmin(m,n) 2 0. These a's are called the singular values of Y. (Some authors, including [Ben-Israel and Greville, 19741, define the singular values to be the non-zero a's.) The columns of Q are eigenvectors of V Y , the columns of P are eigenvectors of Y F ,and the non-zero singular values are the positive square roots of the non-zero eigenvalues of STS or equivalently of SST.
In fact, b) is used to prove a). We consider first the case n 2 m. The matrix V Y is nxn, symmetric and positve semi definite. Then with Q as defined in b), and assuming that the eigenvalues of Y are (J? arranged in non-increasing order, we have Q T W Q = diag(ai?) .
(3.2)
If Y is of rank r, then the last n - r of the Oi are zero. Let qi denote the ith column of Q, and pi = Yqi/ai
i = 1. ... r.
It follows from (3.2) that the pi form an orthonormal set. If' r c n. extend this set to an p " . for i = l,..,n, we orthonormal basis for IRn, by adding additional vectors ~ ~ + ~ , . . . , Then have ith column of YQ = Yqi = D i p i . Thus if P is the orthogonal matrix formed from the columns pi. and S is as defined in a),
Y Q = SP
or
Y
=
PSQT.
This completes the proof of a) for n 2 m. The final part of b) namely that the pi are. eigenvectors of Y, follows froni the observation that Y Y r = PSQTQST
=
PSSTPT.
Transposing (3.1) gives YT = QSVT, from which the case n < m may be deduced. c) is simply the observation [Isaacson and H.B.Keller, 1966, plO] that llYll3 = @(m), where Q denotes the spectral radius.
123
Note that the condition n C m is essential. Otherwise, v could be in the kernel of Y, even
if an * 0.
It is important to emphasise that Theorem 3.lb) should not be used to compute the singular value decomposition (3.1) numerically, as it is numerically unstable. A popular stable algorithm is due to [Golub and Reinsch, 19701. As in (2.4). we denote by W* a weight vector which minimises the total least squares error for a single output linear perceptron. We observed that a fixed point of the "epoch method'' (2.6) will satisfy the normal equations (2.4b). In the discussion of (2.6). we pointed out that in certain plausible leaming situations, the matrix X whose columns are the input patterns, may be rank deficient or nearly so, with the result that the iteration matrix R in (2.6) may have an eigenvalue of 1 or very close to it. The remark was made that this might not matter in practice: we can use the SVD to investigate this further. Replacing Y by X in Theorem 3.1, we write X = PSQTwhere P and Q are orthogonal and S is diagonal (but not necessarily square). To be consistent with notation used later, we will call the singular values of X (the diagonal elements of S) v~...v,If we here denote by y the column vector of outputs (y1,...yJT corresponding to the input patterns xl,....xt, then the total least squares error E (2.4a) may be re-written
since. Q is orthogonal. Now set z = p w , and u = QTy. Then if X has rank r, i.e. r non zero singular values, then
It is now obvious that the least squares problem is solved by setting q' = ui/Vi, i = 1,...,r, choosing the other q' arbitrarily (but sensibly zero), and then setting W* = Pz*. The minimum emor is then given by the second sum on the right hand side of (3.3). We note that if r # M,where M is the number of input patterns, then w' is not unique, but if the undetermined zi are set to zero as suggested, then we get as W* the solution with minimal two norm. In matrix form, let S# be the matrix defined by sii# = 1/Vi, i = 1,...r, and sij# = 0 otherwise. Then Z* = S%I or W* = PS#QTy . The matrix (P)# = PS#QT is called the Moore Penrose psuedoinverse of XT: its properties are discussed at length in pen-Israel and Greville, 19741. However, (3.3) makes apparent a fundamental problem with the least squares approach when X is nearly, but not exactly, rank deficient. As indicated in Section 2 on the
124
discussion of the epoch method, this is likely to occur even in problems that are “good” from the learning point of view. Very small, but non-zero singular values have a large effect both on W* itself and on the error as given by (3.3). although they correspond to “noise” in the pattern matrix X: i.e. they do not give us useful information about separating the input patterns. Indeed the small singular values correspond to similarities in the patterns, not dSfSerences. The decomposition of a data matrix by the SVD in this way, in order to determine the important differences, is called by statisticians principal component analysis. and sometimes in the medical field factor analysis. Now let us take another look at the iteration (2.6). namely
where fl = (I - qXXT). In terms of the notation developed here, we have
- qkx y . wk+1 = (1 - ~ P s s ~ T ) w
or with the obvious extension of notation,
zk+1 = (I - qssT)zk - qsu .
(3.4)
At this point the notation becomes a little messy: let us denote by (zk)i the ith element of zk. These elements are decoupled by the SVD, more specifically (3.4) when written elementwise gives
( z ~ + ~ ) (1 ~ =- vi2)(zk)i - qviui. for i = l,..,r and
( z ~ + ~ ) (zk)i ~ = for i = r+l, ...,M As expected, the iteration has no effect on components corresponding to zero singular values. Moreover the good news is that with a suitable choice of 7, the iteration will converge rapidly for those components that “matter”, i.e. those connected with large singular values. The bad news is that this will not be apparent from an examination of the least squares error (3.3). as this will be dominated by the slowly convergent terms. This is unfortunate, as many published software packages for backpropagation, for example that accompanying [Rumelhart and McClelland, 19871, use the least mean square error for termination. Various authors, e.g. [Sontag and Sussman, 19911, have suggested that the use of this criterion is not ideal for solving the pattern classification problem. An interesting question for further study might be to see if the delta rule classifies better if it is not iterated to convergence. but only for the principal components as defined in Section 3.2. If so. what stopping condition could be used?
125
3.2. Data Compression, Perceptrons, and Matrix Approximations Another important property of :he SVD is its use in solving matrix approximation problems. It is possible to use perceptrons for data compression: one of the few occasions on which cascading linear networks is useful. This approach has been discussed in a connectionist context by @3aldi and Horn&, 19891 using the language of principal component analysis. In fact their result is equivalent to a standard result of approximation theory, although the paper is of c o m e of value in pointing out the connection. As before, let us consider a perceptron with input vectors xl, ...,xt, and assume that these vectors are in RM. Instead of a single output, we now consider t output vectors yl, ....,yt in IR*. The weights thus form an nxM matrix W. The intention here is that n < M. For instance, the input vectors might be bit-mapped images, which we wish to store in compressed form. To decompress the y vectors, we feed them (without thresholding) into a further perceptron with Mxn weight matrix V. producing (unthresholded) output vectors ol. ....,oc The idea is that each oi should be a good approximation to the corresponding xk Of course, rank(VW) 6 n < M. There is no point in trying to make WV approximate I if there is no commonality in the input patterns the compression task is hopeless. So. let us again form the matrix X whose columns are the xi’s. Our aim is now to choose W and V to minimise IlX - VWXlls, where denotes the Schur norm. (The Schur norm of a matrix is simply the square root of the sum of the squares of all its elements.) Matrix approximation problems of this type are discussed in some detail in Chapter 6 of [Ben-Israel and. Greville, 19741: we will give the solution here just for this particular problem. We first observe that for any matrix A and compatible orthogonal matrix 0. llOAlls = since, indeed, multiplication of each column of A by 0 preserves the sum of squares. Similarly for postmultiplication. Now, as above, let X = PSQT be the SVD of X, and suppose rank(X = r. The crucial stage is to find a matrix H satisfying r a n k 0 C n and which minimises l/X - PHPTXII,. Once we have H it is not difficult to factorise PHPT to get W and V. But
11 Ils
llAlls
IlX - PHPTXIIs
=
110 - PHPT)Xlls = 110 - PHPT)PSQTJls
IIP - PH)Slls
= =
i vi2( i=1
(1
= Il(1
- hii)2 +
-
H)SllS
c hj? ) , j+i
where, as above, we denote the singular values of X by vi. Obviously, at the minimum H is diagonal. But we require rank(H) < n. Thus the minimising H is obtained by setting hi = 1, i = 1,...,min(r,n) and the other diagonal elements to zero. If r G n, there is no loss of information in the compression, and the patterns xp are reconstructed exactly. If r > n, then the total error over all patterns is given by r CVi’. i=n+l
126
It remains to perform the factorisation VW = PHPT. While the choice of H is unique, this is not so for the factorisation. However, since PHPT is symmetric, it makes sense to set
w. mT.
V = In fact we have H f l = H, whence PHPT = P m T = PH(PH)T . PH has (at most) n non zero columns: we may take these as V and make W = Vr = the f i s t n rows of Thus the rows of W are those eigenvectors of X f l corresponding to the largest singular values: the principal components. The effect of W is to project the input patterns xp onto the span of these vectors. Specifically, if Y is the matrix whose columns are the compressed patterns yI. i = l,..,t, and G is the matrix formed from the f i s t n rows of fl then
Y = WX = GPTX = GSQT. The importance of the matrices P, Q and S arising from the SVD is clearly illustrated here. Of course, calculation of the SVD is not a natural connectionist way to calculate V and W: as maldi and Hornik, 19891 point out, they can be computed using backpropagation. The importance of the SVD is not restricted to discrete time semilinear feedforward networks: see for example Keen, 19911 where it is shown to be the crucial factor in determining the behaviour of a continuous time representation of a neural feature detector. We remark in passing that the SVD can also be used to solve matrix approximation indeed the construction of the Moore Penrose pseudoinverse described problems in above can be regarded in this light.
11 112:
3.3. Filters, Preprocessing and Preconditioning Many authors have commented on the advisability of performing some preprocessing of the input patterns before feeding them to the network. Often the preprocessing suggested is linear. At f i s t sight this seems to be a pointless exercise, for if the raw input data vector is x, with dimension 1, say; the preprocessing operation is represented by the nxl matrix T, W is the input matrix of the net and we denote by the vector h the input to the next layer of the net. then
h =WTx.
(3.5)
Obviously, the theoretical representational power of the network is the same as one with unprocessed input and input matrix WT. However, this does not mean that these preprocessing operations are useless. We can identify at least the following three uses of preprocessing. i) To reduce work by reducing dimension and possibly using fast algorithms (e.g. the FFT or wavelet transform). (So we do not want to increase the contraction parameter in the delta rule iteration.)
ii) To improve the search geometry by removing principal components of the data and corresponding singular values that are irrelevant to the classification problem. iii) To improve the stability of the iteration by removing near zero singular values
127
(which correspond to noise) and clustering the other singular values near to 1: in the language of numerical analysis to precondition the iteration. We will not address all these three points here directly. Instead we will derive some theoretical principles with the aid of which the issues may be attacked.
3.3.1. Filters and stability of learning The first point to consider is the effect of the filter on the stability of the learning process. For simplicity, we consider only the linear case here. We hope, of course, that a suitable choice of filter will make the learning properties bettm, but the results here show that whatever choice we make, the dynamics will not be made much worse unless the filter has very bad singular values. In particular, we show that if the filter is an orthogonal projection, then the gradient descent mapping with filtering will be at least as contractive as the unfiltered case. Considering first the "epoch method" (2.6).we see from (2.6) that the crucial issue is the relationship between the unfiltered update matrix R = (I - qlXm and its filtered equivalent (I - 72TXX"P) = say. Note that these operators may be defined on spaces of different dimension: indeed for a sensible filtering process we would expect the filter T to involve a significant dimension reduction. Nole also that we have subscripted the learning rates '1 since they might be different. A natural question is to try to relate the norms of these two operators, and hence the rate of convergence of the corresponding iterations. As in Section 2, we suppose L = X p has eigenvalues (Ap j = l...n], with
(Note here we assume the x's span so
A,, #
0. In terns of the singular values vi of X,
v? = li)
The eigenvalues of R are (l-qlll) 6 (1-q112) .... < (l-q,ln),and
with a similar result for the filtered iteration matrix R'. Hence we need to relate the eigenvalues of X f l with those of T X m = L', say. Let L' have eigenvalues pl b p2 )...a pl A I and T have singular values ul 2 c2)....a 0, > 0. Note that we are assuming T has full rank n.
Proposition 3.2 With the notation above, pl 4 'J,2hl and p., 2 un2A.,,.
Proof The frst inequality is straightforward. Since the L and L'are symmetric
128
The second inequality is only slightly more difficult. Let u, be the normalised eigenvector of L'corresponding to p,. Then
But IImunl12 2 An%,,
by a double application of Theorem 3.ld). I -
This result means that Iln'l12cannot be much larger than llnllzif T has singular values close to 1. Many of the most useful filters are projections (although many others, e.g. edge detectors, are not). Projections can be defined in arbitrary vector spaces, and orthogonal projections in general inner product spaces, but for simplicity we will here consider only R n with its usual inner product.
Definition 3.3 a) A linear mapping PIRQIRn is said to be a projection if P(Pv) = Pv for all v E IRn.
b) A projection is said to be orthogonal if (v-Pv)'%
= 0 for all v E Rn.
I Given a subspace S of Rn, the mapping that takes each vector v to its best least squares approximation from S is an orthogonal projection onto S: in fact it is the only such orthogonal projection with respect to the standard inner product. The orthogonality condition in Definition 3.3b) is simply the normal equations for the least squares approximation problem. We list some properties of projections in the next lemma.
Lemma 3.4 For any projection (excluding the trivial one Pv = 0 for all v). a) All the eigenvalues of P are 0 or 1. b) For any norm
11 11 on IRn, we have for the corresponding operator norm, llPil 2
1,
c) If P is an orthogonal projection, llP112 = 1 and indeed all the non-zero singular values of the matrix representation of P with respect to an orthonormal basis are 1.
Proof a) An immediate consequence of Definition 3.3a) is that any eigenvalue A of P must satisfy A2 = A. The only solutions of this equation are 0 and 1. b) Recall llPll 2 llPwll for all w E IRn satisfying llwll = 1. Choose w = Pv/IIPvII for any v # 0.
129
c) Clearly it is sufficient to prove this for a particular orthonormal basis, since changes between such bases are represented by orthogonal matrices which will not change the singular values (compare Theorem 3.1). Let S be the image of P, and construct an orthonormal basis for S , say [sl. ...,s,.), where r is the dimension of S and r C n. Extend this basis to an orthonormal basis for IRn by adding extra vectors (sr+l, ...,%). With respect to this basis we have Psj = sj. j = 1,...,r so the first j columns of the matrix =presentation of P are the first r columns of the n dimensional identity matrix: indeed this is true for any projection onto S, not just the orthogonal projection. However, for an orthogonal projection and for j > r, we have
Since Psj E S and sj is orthogonal to all elements of S by construction, the first term on the right hand side is zero. Thus Psj = 0. Hence the last n - r columns of the matrix representation of P are zero vectors.
a
In practical applications, we do not retain the whole of Wn when using orthogonal projections. In fact the more usual approach is to start by selecting an orthonormal basis and deciding which r of the n basis vectors to “keep”. We may combine Proposition 3.2 and Lemma 3.44 as follows.
Corollary 3.5 Let Isl,...,s,,) be an orthonormal basis for R*. Suppose we express each pattern vector x in terms of Isl,...,s,,) and then discard the component corresponding to [s,.+,, ...,s,,). (Hence x is represented in compressed form by an r-vector. s say, and the problem dimension has been reduced from n to r.) If T is the matrix representing the filter which takes x to s, then
Proof Clearly T represents the non-zero part of an orthogonal projection. Thus all its singular values are 1. The result now follows from Proposition 3.2.
a
This result means that we can apply orthogonal projections to reduce the dimension of our problem without a deleterious effect on the contractivity of the steepest descent operator. We now give some concrete examples. Examples i) The discrete Fourier transform. The basis functions s, whose kth element is defined Thus we may smooth by (s,), = e W - l ) ( k - l U n , where i* = -1, are orthogonal in 0. and compress our patterns xp by expressing them in terms of this basis by means
130
of the Fast Fourier Transform algorithm, and then deleting terms involving high frequencies. (Complex numbers can, of course, be avoided by using sines and cosines.) ii) Other filters based on orthogonal expansions can in principle also be used: the wavelet basis is of course particularly attractive. In general, the basis functions will not be orthogonal with respect to the discrete inner product: since they are normally defied with respect to the inner product of functions defined on an interval. Nonetheless, it is reasonable to suppose that when the projection operators and inner product are discretised, we will land up with an filter matrix T with singular values close to 1, although a further study of this question would certainly be useful.
iii). A less obvious example concerns pixel averaging in image processing. A common way of reducing the size of a grey scale image is to take pixels in two-by-two blocks and average the grey levels for the four pixels, thus reducing the linear resolution by two and the number of pixels by four. This is not a mapping from Rn to itself, but we can make it one by replicating each averaged pixel four times, or equivalently, in each block of four pixels we replace each pixel by the average of the four grey levels in the block. Thus if a block of four pixels initially has grey levels g,.g2,g3,&, after averaging each pixel will have grey level (g1+g2+g3+&)/4. This is obviously a projection: less obviously it is an orthogonal projection. For in each block of pixels we have =
gi
-
(gl+g2+g3+g4)/4, i = 1.2.3.4.
Hence “(v
-
Pv)Tv”
= g1+gz+gs+g,
- 4(g1+gz+g3+&)/4
= 0.
Thus if correctly implemented, pixel averaging should not reduce the rate of convergence of the delta rule. This appears to be contrary to the results given in wand, Evans and Ellacott. 19911 in which it was reported that for a backpropagation net, the rate of convergence was degraded. This must be due either to the effect of the non-linearities, or, more likely, to the way in which the iteration was implemented. The authors intend to reconsider the experimental results in the light of this theoretical development. It is unfortunate that we have not as yet obtained similar results for the iteration (2.1). The very stability of (2. l), together with the fact that it contains information only about one pattern, makes it much more difficult to obtain corresponding bounds. However, (2.1) and (2.6) are. asymptotically the same for small q, it is to be expected that “good” filters for (2.6) will also be “good” for (2.1). There is no reason in principle why the results of this section cannot be extended to non-linear networks via (2.14), although they might be more difficult to apply in practice. The crucial issue is not the effect of the filter on a pattern x, but rather on the Frechet derivative matrix D(w,x). (2.13) shows that principal components of D(w,x) correspond to important directions in the topology of the search space.
131
3.3.2. Data compression and the choice of a filter How do we go about choosing a suitable preprocessing strategy? Usually we search for some insight as to what features of a particular problem are important. This insight may be based on biological analogy, such as attempts to mimic the processing carried out in the human visual cortex; on signal processing grounds (e.g. Fourier, Gaussian or wavelet filters) or on the basis of some mathematical model. However the key issues sre the effect on the learning geometry and the learning dynamics. What algebraic properties should the filter have? One such property has been addressed in the previous section: it should have non-zero singular values close to one unless an alternative choice can be shown to cluster the singular values of the iteration matrix. Another issue to consider is the reduction in work resulting from compressing the input data. Recall the situation described at (3.5). Our raw input data vector is x, with dimension 1, say; the preprocessing operation is represented by the nxl matrix T. W is the input matrix of the net and we denote by the vector h the input to the next layer of the net, then h
= WTx.
(3.5)
In a typical vision application. we may find that the dimension 1 of the input pattern x is much greater than the dimension of h which is the number of nodes in the next layer of the network. This means that WT has very small rank compared with the number of columns. We would like'to choose T so that there exists W such that
for any V (of appropriate size) and input pattern x, but for which the number of rows of T is not much greater than the rank of W. i.e. the dimension of h. Such a choice would mean that the number of free parameters in the learning system (the number of elements of W compared with V) has been drastically reduced, while the achievable least squares error and the representational power of the network is unchanged. Since (3.6) is to hold for each pattern x, we actually want VX = WTX
(3.7)
where, as previously, X is the matrix whose columns are the patterns. Obviously, we cannot, in general, satisfy (3.7) exactly. However, an approximate solution to this problem is once again provided by the singular value decomposition. Suppose we express X as PSQT where P and Q are orthogonal and S is the diagonal matrix of singular values of X, say V1 2 v2 )....a v1 >, 0. Then
Suppose we wish T to be an rxl matrix, with r < 1 so that rank T S r. It is natural to choose T = GPT where G is the rxl matrix with gi, = 1 if i = j and 0 otherwise. (T thus represents a truncated Karhunen Loeve expansion.) We obtain
132
Clearly the best possible choice of W here is the f i s t k columns of VP and with this choice
With this choice of T. the maximum error in replacing V by TX will be negligible provided r is sufficiently large that vW1is negligible compared with vl. Moreover, T is an orthogonal projection so corollary 3.5 applies. Tht Karhunen Locve expansion is thus the natural choice of linear filter for rank reduction. (Non linear filters such as edge detectors. radial basis networks, or adaptive filters such as that of [Lenz and Osterberg, 19901 could be better, of course). But the expansion does not “solve” the problem of linear filtering. As has already been pointed out, retaining just the principal components of X may not retain the information that is relevant to a particular classification problem: the components corresponding to smaller singular values may in fact be those that we require! We can only be confident in removing singular values at the level of random noise or rounding error. Even ignoring this problem, there are other objections to routine use of the Karhunen Loeve expansion: i) Although optimal for rank reduction, the Karhunen Loeve expansion is not optimal as a pnxonditioner. It does remove very small singular values, but leaves the others unchanged, while we wish to cluster the eigenvalues of X P . ii) Detennining the Karhunen Loeve expansion is difficult and expensive computationally. Moreover in a practical learning situation, we may not have the matrix X W y available anyway. Even assuming that we do have all the leaming patterns x available at once, the very large matrix X will be difficult to construct and store.
iii) The expansion q u i r e s a priori knowledge of the actual data: it does not give us a filter that could be hardwired into the leaming as part of (say) a robot vision system. The first point implies that even if we do compute the expansion, we may need to combine it with a preconditioner: we will consider preconditioning below. Now the second point. The standard algorithms for singular value decomposition are not easily implementable on neural networks. As we have seen. the principal component space of X can be computed by backpropagation, but in the linear case this is a computation as expensive as the problem we are trying to solve. In the non-linear case which is the one of most practical interest, computation of the expansion might be worthwhile but is still likely to prove challenging and costly. One possible way out of this difficulty has been suggested by wickerhauser, 19901. He recommends a preliminary compression in terms of a wavelet basis which is chosen optimally in the sense of an entropy-like measure called the theoretical dimension. While Wickerhauser uses this as an intermediate stage to compute the Karhunen Loeve expansion, in the neural net context it might be more sensible just to accept the degree of compmssion provided by the wavelets. A more simple minded approach is of course, is to compute the filter T on the basis of a representative subset of the data rather than all of it. But we then have the problem of how to pick such a subset.
133
The third point is the most fundamental. We do not really want a filter tied to a particular set of data: we want a fixed strategy for preprocessing that will cover a large class of different, albeit related, learning problems such as that used by mammals for vision. Obviously to have any hope of success, we must assume that there is some common structure in the data. What type of structure should we be looking for? As far as data compression is concerned, the information that is most likely to be available and usable is some condition that will guarantee the rapid convergence of some orthogonal expansion or similar approximation process. In practical applications, the pattern vectors x are likely to be spatial or temporal samples of some nondismte data. In vision applications, grey scale images come from A to D conversion of raster scan TV images. In speech recognition, the incoming analogue waveform is either sampled in time or passed through some frequency analysis process. In many cases, therefore, it should be possible to employ some measure of smoothness: differentiability or modulus of continuity in the space or time domain; or conditions on the rate of convergence to zero at fw of the Fourier transform. Given such conditions on the data, rate of convergence results abound in the approximation theory textbooks for the classical basis functions (polynomials, splines, Fourier series) and are beginning to appear for the newer bases such as wavelets or radial basis functions. ([Wickerhauser, 19901, Powell, 19901). (Note that this linear use of radial basis functions is different from the radial basis neural nets found in the literature, which use the basis in non-linear fashion to aid the classification problem.) Of course, the use of orthogonal expansions to preprocess neural net data is commonplace on an ad hoc basis, but a rigorous analysis of the data compression and preconditioning effects would Seem to be overdue.
3.3.3. The optimal preconditioner As in the case of data compression, an ideal choice of filter to act as a preconditioner would not require knowledge of the particular data set under consideration. But while the
requirement - rapid convergence - for a data compressor is obvious, there does not seem to be any clear handle on the preconditioning problem. Therefore we only consider preconditioners based on a known data matrix X. First observe that the theoretically optimal preconditioner for the iteration (2.6) is both easily described and completely useless! Suppose, as above, X has singular value decomposition PSQT. We set T to be the Moore Penrose inverse of X (see the remarks and definition after equation (3.3)). i.e.
T
= X* = QS#€’T.
Then TX
= QS#P‘PSQT
= QS#SQT.
Thus
L‘
= T X W = QS#SSTSflQT,
and S#SSTSm is a diagonal matrix with diagonal elements either 0 or 1. Thus all the eigenvalues of L‘ are either 0 or 1, and indeed, if the x’s span so that X x ’ has no zero
134
eigenvalues, then all the eigenvalues of L' are 1. With 7 = 1, the iteration (2.6) will converge in a single iteration. This is not surprising, since once we know X#, the least squares solution for w may be given explicitly. For the non-linear case we would need to compute the compute local pseudoinverses for the vectors D(w,x) (compare (2.13) and (2.14)). This amounts to local solution of the tangent linear least squares problem at each iterate, and if we are going to go to such lengths, we would be better off using a standard non-linear least squares solver. Moreover, in practice, as we have seen, X f l is likely to have small eigenvalues. so a stable computation of X# is likely to be difficult. A modification of the approach which might be slightly more practicable is just to remove the large eigenvalues of XX? based on computation of the dominant singular values, and corresponding singular vectors, of X. We will present an algorithm for removing the principal components one at a time. It should be emphasised that this algorithm has not been tested even for the linear case, and some care would be needed to make it work for non-linear networks. Moreover, whether an approach based on removal of individual singular values is going to be very useful is debatable: it may help if the data matrix X is dominated by a few principal components with large singular values but otherwise it it likely to be too inefficient. (Methods for simultaneous computation of more of the spectrum, e.g. Amoldi iteration, do exist. However they are of course more complicated.) In spite of these reservations, the algorithm here is presented as an indication that a trivially parallelizable method is at least in principle possible, and to indicate the tools with the help of which a practicable method might be constructed. The f i s t stage is to compute the largest eigenvalue and corresponding eigenvector of XX?. This may be carried out by the power method [Isaacson and Keller, 1966 p.1471 at the same time as the ordinary delta rule iteration: we start with an arbitrary vector u, and simply perform the iteration
Since
xxr
c. xpxp'. I
=
F=l
Thus the iteration can be performed by running through the patterns one at a time, just as for the delta rule itself. The sequence uk will converge to a normalised eigenvector p1 of X x ' corresponding to the largest eigenvalue I , of Xx'. A, itself is conveniently estimated from the Ruyleigh quotienr ukTxfl,k: see [Isaacson and Keller, 1969, p.142.1. Note that since X f l is symmetric and positive definite, repeated eigenvalues will not cause problems. Having determined pl and A,, we set
(3.8)
135
We have X = PDPT where, of course, p1 is the F i t column of P and h , the first element of the diagond matrix D. Since P is an orthogonal matrix,
and, of course, n
Xx’ = PDPT =
&PIPIT, i= 1
Hence, with T as in (3.8), we have
since the pi’s are orthonormal. By a similar calculation
Thus T X m has the same eigenvectors as Xx’, and the same eigenvalues but with A1 replaced by 1. As indicated by (3.5). each pattern xp should then be multiplied by T, and, since we are now iterating with different data, the current weight estimate w should be multiplied by It is easy to check that T-’
= I
+ (hi’/* -
T-l.
1)pIpIT.
Since )cl is a “large” eigenvalue, this process is well conditioned. Observe that calculation of Tx and T-1w each can be achieved with evaluation of only a single inner product. The process can of course be repeated to convert further eigenvalues to 1. Basically the same idea can be used for the iteration (2.2). However there is a problem in that the matrix A is not exactly symmetric, although it is nearly so for small 7. This could be overcome by computing the right as well as left eigenvectors of (A - I)m, but unfortunately this would require presenting the patterns in reverse order: somewhat inconvenient for a neural system. Another possibility is it perform two cycles of the patterns, with the patterns in reverse order on the second cycle. The composite iteration matrix ATA will then be symmetric. However, since we have seen that (3.8) processes
136
principal components of the data, it might be better just to use the preconditioner (3.8) instead. Although space and the requirements of simplicity do not permit a full discussion here, them is no reason in principle why this algorithm should not be applied to the non linear case. However, a considexation of (2.14) indicates that it would not be appropriute just to process the input data. preconditioning based on the entire derivative D(w,x) is required. For the general projected descent algorithm (2.4), this could be very complicated. However the special semi-linear structure of the multi-layer perceptron and similar architectures suggests that we think of the preconditioning as defining a “two-ply” network. (3.5) amounts to a factorisation of the input layer of an MLP which we could think of as being made up of two sub-layers-with units between whose activation function is simply the identity. Similarly we could factorise every layer into two sub-layers or “plies”. In the composite network, the top ply of each layer would be trained by nom-ial backpropagation, whereas the lower ply, trained by an algorithm such as that outlined above, could be thought of as a “slow learning” feature detector whose purpose is to “tune” the network to the general properties of the data being learnt. Note that its properties depend only on the input patterns and the network state, not explicitly on the ouput values. But, further consideration of this idea must be deferred to another occasion. 4. FUTURE DIRECTIONS
We have considered here only the simplest architectmes, in order to demonstrate the kind of results that can be proved and the tools that might be used to prove them. Still wanting are detailed analyses of the backpropagation algorithm and other learning algorithms, and the effect of various filters on non-linear leaming. Progress in this direction should yield much improved algorithms and filters, together with a better understanding of the dynamics of learning. Moreover in this paper we have largely restricted discussion to feedforward networks, although Section 2.3 is much more general. When discussing recursive networks, there are two dynamical processes to consider: the learning dynamics and the intrinsic dynamics of the network itself. It is not sufficient in practice to prove (e.g. using a Liapunov function) that a network is stable, if convergence is very slow. The tools presented in this paper can also be used to analyse and improve the asymptotic behaviour of the network itself, particularly for discrete time realisations. Thus the reader is hopefully convinced of the usefulness of the numerical analysis approach when discussing neural networks. The author does not wish to imply that this is the only mathematical technique of importance. There is a real need to weld the various techniques of analysis into a single coherent theory.
REFERENCES Baldi, P and Homik, K, 1989 “Neural networks and principal component analysis: learning from examples without local minima”, Neural Networks, v01.2, no. 1. Ben-Israel, A and Creville. T N E, 1974 “Generalised inverses, theory and applications”, Wiley.
137
Bunday, B D, 1984: “Basic optimisation methods”. Edward Amold, England. Ellacot, S W, 1990: “An analysis of the delta rule”. Proceedings of the International Neural Net Conference, Paris,pp 956-959. Kluwer Academic Publishers. Golub, G H, and Reinsch, C, 197OSingular value decomposition and least squares solutions”, Numerische Mathematik vol. 14, pp 403420. Hand, C, Evans, M and Ellacott, S W, 1991: “A neural network feature detector using a mullti-resolution pyramid” in “Neural networks for images, speech and natural language, eds. B Linggard, C Nightingale, in press. Isaacson, E and Keller, H B, 1%: “Analysis of numerical methods”, Wiley. Jacobs, D (4) 1977: . “The state of the art in numerical analysis”, Academic Press. Keen, T K, 1991: ”Dynamics of learning in linear feature discovery networks”, Network, V O ~ .2, pp 85-105. Lenz. R and Osterberg. M, 1990: “Learning filter systems”, Proceedings of the International Neural Net Conference, Paris,pp 989-992. Kluwer Academic Publishers . Martin, J M, 1990: “On the interpolation properties of feedforward layered neural networks”. Report NWC TP 7094, Naval Weapons Center, China Lake, CA 935556001, USA. Oja, E, 1983: “Subspace methods of pattern recognition”. Research Studies Press, Letchworth, England. Powell, M J D, 1990: “The theory of radial basis function approximation in 1990”, Report DAMTP 1990/NAll, Dept. of Applied Maths. and Theoretical Physics, Silver Street, Cambridge, CB3 9EW, England. Rumelhart. D E and McClelland, J L , 1986 “Parallel and distributed processing: explorations in the microstructure of cognition”, vols.1 and 2, MIT. Rumelhart, D E and McClelland, J L 1987: “Parallel and distributed processing: explorations in the microstructure of cognition”, vo1.3, MIT. Sontag, E D and Sussman, H J, 1991: “Backpropagation separates where perceptrons do”, Rutgers Center for Systems and Control, Dept. of Math., Rutgers University, New Brunswick, NJ 08903, USA. Vidyasagar, M, 1978: “Nonlinear systems analysis”, Prentice Hall. Wickerhauser, M V. 1990: “A fast approximate Karhunen Loeve expansion”, preprint, Dept. of Math., Yale University, New Haven, Connecticut 06520.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B . V . All rights reserved.
139
SELF-ORGANIZING NEURAL NETWORKS FOR STABLE CONTROL OF AUTONOMOUS BEHAVIOR IN A CHANGING WORLD S. Grossberg# Department of Cognitive and Neural Systems, Boston University, Boston, MA, USA 1. INTRODUCTION: NONLINEAR MATHEMATICS FOR DESCRIBING AUTONOMOUS BEHAVIOR IN A NONSTATIONARY WORLD
The study of neural networks is challenging in part because the field embraces multiple goals. Neural networks to explain mind and brain are not evaluated by the same criteria as artificial neural networks for technology. Both are ultimately evaluated by their success in handling data, but data about behaving animals and humans may bear little resemblance to data that evaluates benchmark performance in technology. Although most artificial neural networks have been inspired by ideas gleaned from mind and brain models, technological applications can sometimes be carried out in an off-line setting with carefully selected data and complete external supervision. The living brain is, in contrast, designed to operate autonomously under real-time conditions in nonstationary environments that may contain unexpected events. Whatever supervision is available derives from the structure of the environment itself. These facts about mind and brain subserve much of the excitement and the intellectual challenge of neural networks, particularly because many important applications need to be run autonomously in nonstationary environments that may contain unexpected events. What sorts of intuitive concepts are appropriate for analysing autonomous behavior that is capable of rapid adaptation t o a changing world? What sorts of mathematics can express and analyse these concepts? I have been fortunate to be one of the pioneers who has participated in the discovery and development of core concepts and models for the neural control of real-time autonomous behavior. A personal perspective on these developments will be taken in this chapter. Such a perspective has much to recommend it a t this time. So many scientific communities and intellectual traditions have recently converged on the neural network field that a consistent historical viewpoint can simplify understanding. When I began my scientific work as an undergraduate student in 1957, the modern field of neural networks did not exist. My main desire was to better understand how we t This research was supported in part by the Air Force Office of Scientific Research (AFOSR F49620-92-5-0225), DARPA (AFOSR 90-0083 and ONR N00014-92-J-4015), and the Office of Naval Research (ONR N00014-91-J-4100). The authors wish to thank Cynthia Bradford and Diana J. Meyers for their valuable assistance in the preparation of the manuscript.
140
humans manage to cope so well in a changing world. This required study of psychological data to become familiar with the visible characteristics of our behavioral endowment. It required study of neurobiological data to better understand how the brain is organized. New intuitive concepts and mathematical models were needed whereby to analyse these data and to link behavior to brain. New mathematical methods were sought to analyse how very large numbers of neural components interact over multiple spatial and temporal scales via nonlinear feedback interactions in real time. These methods needed to show how neural interactions may give rise to behaviors in the form of emergent properties. Essentially no one at that time was trained to individually work towards all of these goals. Many experimentalists were superb at doing one type of psychological or neurobiological data, but rarely read broadly about other types of data. Few read across experimental disciplines. Even fewer knew any mathematics or models. The people who were starting to develop Artificial Intelligence favored symbolic mathematical methods. They typically disparaged the nonlinear differential equations that are needed to describe adaptive behavior in real time. Even the small number of people who used differential equations to describe brain or behavior often restricted their work to linear systems and avoided the use of nonlinear ones. It is hard to capture today the sense of overwhelming discouragement and ridicule that various of these people heaped on the discoveries of neural network pioneers. Insult was added to injury when their intellectual descendants eagerly claimed priority for these discoveries when they became fashionable years later. Their ability to do so was predicated on a disciplinary isolation of the psychological, neurobiological, mathematical, and computational communities that persisted for years after a small number of pioneers began their work to achieve an interdisciplinary synthesis. Some of the historical factors that influenced the development of neural network research are summarized in Carpenter and Grossberg (1991) and Grossberg (1982a, 1987, 1988). The present discussion summarizes several contributions to understanding how neural models function autonomously in a stable fashion despite unexpected changes in their environments. The content of these models consists of a small set of equations that describe processes such as activation of short term memory (STM) traces, associative learning by adaptive weights or long term memory (LTM) traces, and slow habituative gating or medium term memory (MTM) by chemical modulators and transmitters; a larger set of modules that organize processes such as cooperation, competition, opponent processing, adaptive categorization, pattern learning, and trajectory formation; and a still larger set of neural systems or architectures for achieving general-purpose solutions of modal problems such as vision, speech, recognition learning, associative recall, reinforcement learning, adaptive timing, temporal planning, and adaptive sensory-motor control. Each successive level of model organization synthesizes several units from the previous level. 2. THE ADDITIVE AND SHUNTING MODELS Two of the core neural network models that I introduced and mathematically analysed in their modern form are often called the additive model and the shunting model. These models were originally derived in 1957-1958 when I was an undergraduate at Dartmouth College. They describe how STM and LTM traces interact during network processes of activation, associative learning, and recall (Figure 1). It took ten years from their initial discovery and analysis to get them published in the intellectual climate of the 1960’s
141
‘i
i‘
e.. IJ
Figure 1. STM traces (or activities or potentials) z, at cells (or cell populations) D, emit signals along the directed pathways (or axons) e;j which are gated by LTM memory traces (or adaptive weights) z,, before they can perturb their target cells vj. (Reprinted with permission from Grossberg, 1982c.) (Grossberg, 1967, 1968a, 1968b). A monograph (Grossberg, 1964) that summarizes some of these results was earlier distributed to one hundred laboratories of leading researchers from the Rockefeller Institute where I was then a graduate student. Additive STM Equation
Equation (1) for the STM trace z, includes a term for passive decay ( - A i z , ) , positive fj(z,)Bj&)), negative feedback gj(z,)C&)), and input (Zi). feedback Each feedback term includes a state-dependent nonlinear signal (fj(zj),gj(zj)), a conIf the positive and nection, or path, strength (Bj,,Cj,), and an LTM trace (@),$)). negative feedback terms are lumped together and the connection strengths are lumped with the LTM traces, then the additive model may be written in the simpler form
(cy=l
(-xi”=,
Early applications of the additive model included computational analyses in vision, associative pattern learning, pattern recognition, classical and instrumental conditioning, and the learning of temporal order in applications to language and sensory-motor control (Grossberg, 1969a, 1969b, 1969c, 1970a, 1970b, 1971a, 1972a, 1972b, 1974; Grossberg and Pepe, 1971). The additive model has continued to be a cornerstone of neural network research to the present day; see, for example, Amari and Arbib (1982) and Grossberg (1982a). Some physicists unfamiliar with the classical status of the additive model in neural network theory erroneously called it the Hopfield model after they became acquainted with Hopfield’s first application of the additive model in Hopfield (1984), twenty-five years after its discovery; see Section 20. The classical McCulloch-Pitts (1943) model has also erroneously been called the Hopfield model by the physicists who became acquainted with the McCulloch-Pitts model in Hopfield (1982). These historical errors can ultimately be traced to the fact that many physicists and engineers who started studying neural networks in the 1980’s generally did not know the field’s scholarly literature. These errors are
142
gradually being corrected as new neural network practitioners learn the history of their craft. A related network equation was found to more adequately model the shunting dynamics of individual neurons (Hodgkin, 1964; Kandel and Schwartz, 1981; Katz, 1966; Plonsey and Fleming, 1969). In such a shunting equation, each STM trace is restricted to a bounded interval [-D;, B;] and automatic gain control, instantiated by multiplicative shunting terms, interacts with balanced positive and negative feedback signals and inputs to maintain the sensitivity of each STM trace within its interval. S h u n t i n g STM E q u a t i o n
Variations of the shunting equation (3) were also studied (Ellias and Grossberg, 1975) in which the reaction rate of inhibitory STM traces y, was explicitly represented, as in the system
and
Several LTM equations have been useful in applications. Two particularly useful variations have been: Passive Decay LTM E q u a t i o n
and Gated Decay LTM E q u a t i o n d
p ; j
= hj(Zj)[-K;jZ;j
+ L;jf;(zl)].
(7)
In both equations, a nonlinear learning term f i ( z i ) h j ( z j )often , called a Hebbian term after Hebb (1949), is balanced by a memory decay term. In (6), memory decays passively at a constant rate -K,,. In (7), memory decay is gated on and off by one of the nonlinear signals. When the gate opens, z;j tracks f;(z,) by steepest descent. A key property of
143
both equations is that the size of an LTM trace z;, can either increase or decrease due to learning. Neurophysiological support for an LTM equation of the form (7) was reported two decades after it was first introduced (Levy, 1985; Levy, Brassel, and Moore, 1983; Levy and Desmond, 1985; Rauschecker and Singer, 1979; Singer, 1983). Extensive mathematical analyses of these STM and LTM equations in a number of specialized circuits led gradually to the identification of a general class of networks for which one could prove invariant properties of associative spatiotemporal pattern learning and recognition (Grossberg, 1969a, 1971b, 1972c, 1982). These mathematical analyses helped to identify those features of the models that led to useful emergent properties. They sharpened intuition by showing the implications of each idea when it was realized within a complex system of interacting components. Some of these results are summarized below. 3. UNITIZED NODES, SHORT TERM MEMORY, AND AUTOMATIC
ACTIVATION The neural network framework and the additive laws were derived in several ways (Grossberg, 1969a, 1969b, 1969f, 1974). My first derivation in 1957-1958 was based on classical list learning data (Grossberg, 1961, 1964) from the serial verbal learning and paired associate paradigms (Dixon and Horton, 1968; Jung, 1968; McGeogh and Irion, 1952; Osgood, 1953; Underwood, 1966). List learning data force one to confront the fact that new verbal units are continually being synthesized as a result of practice, and need not be the obvious units which the experimentalist is directly manipulating (Young, 1968). All essentially stationary concepts, such as the concept of information itself (Khinchin, 1967) hereby became theoretically useless. By putting the self-organization of individual behavior in center stage, I realized that the phenomenal simplicity of familiar behavioral units, and the synthesis of these units into new representations which themselves achieve phenomenal simplicity through experience, should be made a fundamental property of the theory. To express the phenomenal simplicity of familiar behavioral units, I represented them by indecomposable internal representations, or unitized nodes, u,, i = 1,2,. . . ,n. This hypothesis gained support from the (now classical) paper of Miller (1956) on the Magic Number Seven, which appeared at around the time I was doing this derivation. In this work, Miller described how composites of familiar units can be “chunked”, or unitized, into new units via the learning process. Miller used the concept of information to analyse his results. This concept cannot, however, be used to explain how chunking occurs. A neural explanation of the Magic Number Seven is described in Grossberg (1978a, 1986); see also Cohen and Grossberg (1986). Data concerning the manner in which humans learn serial lists of verbal items led to the first derivation of the additive model. These data were particularly helpful because the different error distributions and learning rates at each list position suggested how each list item dynamically senses and learns from a different spatiotemporal context. It was, for example, known that practicing a list of items such as AB could also lead to learning of BA, a phenomenon called backward learning. A list such as ABC can obviously also be learned, however, showing that the context around item B enables forward learning of BC to supercede backward learning of BA.
144
To simplify the discussion of such interactive phenomena, I will consider only associative interactions within a given level in a coding hierarchy, rather than the problem of how coding hierarchies develop and interact between several levels. All of these conclusions have been generalized to a hierarchical setting (Grossberg, 1974, 1978a, 1980a). 4. BACKWARD LEARNING AND SEFUAL BOWING
Backward learning effects and, more generally, error gradients between nonadjacent, or remote, list items (Jung, 1968; McGeogh and Irion, 1952; Murdock, 1974; Osgood, 1953; Underwood, 1966) suggested that pairs of nodes vi and vj can interact via distinct directed pathways e,j and ej; over which adaptive signals can travel. An analysis of how a node v, could know where to send its signals revealed that no local information exists at the node itself whereby such a decision could be made. By the principle of sufficient reason, the node must therefore send signals towards all possible nodes v j with which it is connected by directed paths e i j . Some other variable must exist that discriminates which combination of signals can reach their target nodes based on past experience. These auxiliary variables turned out to be the long term memory traces. The concept that each node sends out signals to all possible nodes subsequently appeared in models of spreading activation (Collins and Loftus, 1975; Klatsky, 1980) to explain semantic recognition and reaction time data. The form that the signaling and learning laws should take was suggested by data about serial verbal learning. During serial learning, a subject is presented with one list item at a time and asked to predict the next item before it occurs. After a rest period, the list is presented again. This procedure continues until a fixed learning criterion is reached. A main paradox about serial learning concerns the form of the bowed serial position curve which relates cumulative errors to list positions (Figure 2a). This curve is paradoxical for the following reason. If all that happened during serial learning was a build-up of interference at each list position due to the occurrence of prior list items, then the error curve should be monotone increasing (Figure 2b). Because the error curve is bowed, and the degree of bowing depends on the length of the rest period, or intertrial interval, between successive list presentations, the nonoccurrence of list items after the last item occurs somehow improves learning across several prior list items. Internal events thus continue to occur during the intertrial interval. The nonoccurrence of future items can hereby reorganize the learning of a previously occurring list. The bowed serial position curve showed me that a real-time dynamical theory was needed to understand how these internal events continue to occur even after external inputs cease. It also showed that these internal events can somehow operate “backwards in time” relative to the external ordering of observable list items. These backward effects suggested that directed network interactions exist whereby a node v, could influence a node v j , and conversely. Many investigators attributed properties like bowing to one or another kind of rehearsal (Klatsky, 1980; Rundus, 1971). Just saying that rehearsal causes bowing does not explain it, because it does not explain why the middle of the list is less rehearsed. Indeed the middle of the list has more time to be rehearsed than does the end of the list before the next learning trial occurs. In the classical literature, the middle of the list was also said to experience maximal proactive interference (from prior items) and retroactive interference (from future items), but this just labels what we have to explain
145
CUMULATIVE ERRORS
LIST POSITION
(a)
Figure 2. (a) The cumulative error curve in serial verbal learning is a skewed bowed curve. Items between the middle and end of the list are hardest to learn. Items at the beginning of the list are easiest to learn. (b) If position-dependent difficulty of learning were a!l due to interference from previously presented items, the error curve would be monotone increasing. (Reprinted with permission from Grossberg, 1982b.) (Osgood, 1953; Underwood, 1966). The severity of such difficulties led the serial learning expert Young (1968) to write: “If an investigator is interested in studying verbal learning processes .. . he would do well to choose some method other than serial learning” (p.146). Another leading verbal learning expert Underwood (1966) wrote: “The person who originates a theory that works out to almost everyone’s satisfaction will be in line for an award in psychology equivalent to the Nobel prize” (p. 491). It is indicative of the isolated role of real-time modelling in psychology at that time that a theory capable of clarifying the main data effects was available but could not yet get published. Similar chunking and backward effects also occur in a wide variety of problems in speech, language, and adaptive sensory-motor control, so avoiding serial learning will not make the problem go away. Indeed these phenomena may all generally be analysed using the same types of mechanisms. 5. THE NEED FOR A REAL-TIME NETWORK THEORY The massive backward effect that causes the bowed serial curve forced the use of a real-time theory that can parameterize the temporal unfolding of both the occurrences and the nonoccurrences of events. The existence.of facilitative effects due to nonoccurring items also showed that traces of prior list occurrences must endure beyond the last item’s presentation time, so they can be influenced by the future nonoccurrences of items. This fact led to the concept of activations, or short term memory (STM) traces, z,(t) at the nodes v,, i = 1,2,. . .,n, which are turned on by inputs I,(t), but which decay at a rate slower than the input presentation rate. As a result, in response to serial inputs, patterns of STM activity are set up across the network’s nodes. The combination of serial inputs, distributed internodal signals, and spontaneous STM changes at each node changes the STM pattern as the experiment proceeds. A major task of neural network theory was thus to learn how to think in terms of distributed pattern transformations, rather than just in terms of distributed feature detectors or other local entities. When I first realized this, it was quite a radical notion. Now it is so taken for granted that most people do not
146
Figure 3. Suppose that items f l , r 2 , f 3 , 7 - 4 , .. . are presented serially to nodes t 1 1 , t 1 2 , ~ 3 , 214,. .., respectively. Let the activity of node 21i at time t be described by the height of the histogram beneath vi at time t. If each node is initially excited by an equal amount and its excitation decays at a fixed rate, then at every time (each row) the pattern of STM activity across nodes is described by a recency gradient. (Reprinted with permission from Grossberg, 1982b.) realize that is was once an exciting discovery.
6. EVENT TEMPORAL ORDER VS. LEARNED TEMPORAL ORDER The general philosophical interest of the bowed error curve can be appreciated by asking: What is the first time a learning subject can possibly know that item r, is the last list item in a newly presented list rlr2 ...r,, given that a new item is presented every w time units until r, occurs? The answer obviously is: not until at least w time units after vn has been presented. Only after this time passes and no item T , + ~ is presented can r, be correctly recIassified from the list’s “middle” to the list’s “end”. The nonoccurrence of future items reclassifies r , as the “end” of the list. Parameter w is under experimental control and is not a property of the list ordering per se. Spatiotemporal network interactions thus parse a list in a way that is fundamentally different from the parsing rules that are natural to apply to a list of symbols in a computer. Indeed, increasing the event presentation rate, or intratrial interval, w during serial learning can flatten the entire bowed error curve and minimize the effects of the intertrial interval between successive list presentations (Jung, 1968; Osgood, 1953). To illustrate further the difference between computer models and a real-time network approach, suppose that after a node v; is excited by an input I;, its STM trace gets smaller through time due to either internodal competition or to passive trace decay. Then in response to a serially presented list, the last item to occur always has the largest STM trace-in other words, at every time a recency gradient obtains in STM (Figure 3). Given this natural assumption-which, however, is not always true (Bradski, Carpenter, and Grossberg, 1992; Grossberg, 1978a, 1978b)-how do the generalization gradients of
147
Figure 4. At each node v,, the LTM pattern z, = (zjl,zj2,.. ., z . ) that evolves through 1". time is different. In a list of length n = L whose intertrial interval 1s sufficiently long, the LTM pattern at the list beginning ( j 1) is a primacy gradient. At the list end ( j L ) , a recency gradient evolves. Near the list middle ( j 2 $), a two-sided gradient is learned. These gradients are reflected in the distribution of anticipatory and perseverative errors in response to item probes at different list positions. (Reprinted with permission from Grossberg, 1982b.) errors at each list position get learned (Figure 4)? In particular, how does a gradient of anticipatory, or forward, errors occur at the beginning of the list, a gradient of perseverative, or backward, errors occur at the end of the list and a two-sided gradient of anticipatory and perseverative errors occur near the middle of the list (Osgood, 1953)? Otherwise expressed, how does a temporal succession of STM recency gradients generate an LTM primacy gradient at the list beginning but an LTM recency gradient at the list end? I call this STM-LTM order reversal. This property immediately rules out any linear theory, as well as any theory which restricts itself to nearest neighbor associative links. 7. MULTIPLICATIVE SAMPLING BY SLOWLY DECAYING LTM TRACES OF RAPIDLY EVOLVING STM PATTERNS The STM and LTM properties depicted in Figures 3 and 4 can be reconciled by positing the existence of STM traces and LTM traces that evolve according to different time scales and rules. Indeed, this reconciliation was one of the strongest arguments that I knew for these rules until neurobiological data started to support them during the 1980's. Suppose that the STM trace of each active node v, can send out a sampling signal Sj along each directed path e j k towards the node v k , k # j . Suppose that each path e,k contains LTM trace z,k at its terminal point, where z j k can compute, using only local operations, the product of signal Sj and STM trace xk. Also suppose that the LTM trace decays slowly, if at all, during a single learning trial. The simplest law for zjk that satisfies these constraints is d -2's = -czjk + d S j X k , (8)
dt 3 j # k; cf., equation ( 6 ) . To see how this rule generates an LTM primacy gradient at the list beginning, we need to study the LTM pattern (212,z13,.. .,zln)and to show that z12 > 213 > ,. . > zln.To see how the same rule generates an LTM recency gradient at the list end, we need to study the LTM pattern (zn1,zn2,.. ., . z , , ~ . - Iand ) to show that znl < zn2 < . . . < z,-1. The two-sided gradient at the list middle can then be understood as a combination of these effects.
148
By (8), node 01 sends out a sampling signal S1 shortly after item q is presented. After rapidly reaching peak size, signal S1 gradually decays as future list items r2,7-3,.. . are presented. Thus S l is largest when trace 5 2 is maximal, S1 is smaller when both traces 1 2 and 5 3 are active, S1 is smaller still when traces x2, 5 3 , and 2 4 are active, and so on. Consequently, the product 5’1x2 in row 2 of Figure 3 exceeds the product S 1 ~ 3in row 3 of Figure 3, which in turn exceeds the product S1q in row 4 of Figure 3, and so on. Due to the slow decay of each LTM trace z l t on each learning trial, 212 adds up to the products S1zz in successive rows of column 1, 213 adds up to the products S1q in successive rows of column 2, and so on. An LTM primacy gradient zl2 > 213 > . . . > zln is hereby generated. This gradient is due to the way signal S1 multiplicatively samples the successive STM recency gradients and the LTM traces zlk sum up the sampled STM gradients. By contrast, the signal S, of a node vn at the end of the list samples a different set of STM gradients. This is because vn starts to sample (viz., S n > 0) only after all past nodes q , v 2 , . . .,v,,-~ have already been activated on that trial. Consequently, the LTM traces ( z , , ~zn2, , . .., ~ ~ , ~ -of1 node ) vn encode a recency gradient q < 5 2 < 5 3 < . . . < x , - ~ at eoch time. When all the recency gradients are added up through time, the total effect is a recency gradient in vn’s LTM pattern. In summary, nodes at the beginning, middle, and end of the list encode different LTM gradients because they multiplicatively sample and store STM patterns at different times. Similar LTM gradients obtain if the sequences of nodes which are active at any time selectively excite higher-order nodes, or chunks, which in turn sample the field of excited nodes via feedback signals (Grossberg, 1974, 1978a). 8. MULTIPLICATIVE LTM GATING OF STM-ACTIVATED SIGNALS Having shown how STM patterns may be read into LTM patterns, we now need to describe how a retrieval probe rm can read urn’s LTM pattern back into STM on recall trials, whereupon some of the STM traces can be transformed into observable behavior. In particular, how can LTM be read into STM without distorting the learned LTM gradients? The simplest rule generates an STM pattern which is proportional to the LTM pattern that is being read out, and allows distinct probes to each read their LTM patterns into STM in an independent fashion. To achieve faithful read-out of the LTM pattern (zm1,zm2,.. .,zmn) by a probe rm that turns on signal S, let the product Smzm,determine the growth rate of z;. Then LTM trace zmi gates the signal Smalong em; before the gated signal reaches v,. The independent action of several probes implies that the gated signals Srnzmiare added, so that the total effect of all gated signals on vi is Ck=lSmzm,. The simplest equation for the STM trace xi that abides by this rule is the additive equation n d ; E i q = -uz, + b SmZm, I,, (9)
c
+
m=l
where -a is the STM decay rate, Sm is the mth sampling signal, zm, is the LTM trace of pathway em,, and I, is the ith experimental input; cf, equation (2). The reaction of equations (8) and (9) to serial inputs I, is much more complex than is their response to an isolated retrieval probe rm. Due to the fact that STM traces may decay slower than the input presentation rate, several sampling signals S, can be simultaneously active, albeit in different phases of their growth and decay. In fact, this in-
149
teraction leads to properties that mimmick list learning data, but first a technical problem needs to be overcome. 9. BEHAVIORAL CHOICE AND COMPETITIVE INTERACTIONS Once one accepts that patterns of STM traces are evolving through time, one also needs a mechanism for choosing those activated nodes which will influence observable behavior. Lateral inhibitory feedback signals were derived as a choice mechanism (Grossberg, 1968, 1969b, 1970a). The simplest extension of (9) which includes competitive interactions is n n d Bx, = -ax, S$b;,z,, SEb,, + I , (10)
+C
C
m=l
m=l
where S&b$ (S;bZi) is the excitatory (inhibitory) signal emitted from node vm along the excitatory (inhibitory) pathway e;, (e;,); cf., equation (1). Correspondingly equation (8) is generalized to d - 2 . k = -CZjk -k d j k S 7 x k . (11) dt 3
c%=l
The asymmetry between terms S$b;,zm, and ck=l SGb;, in (10) suggested a modification of (10) and a definition of inhibitory LTM traces analogous to the excitatory LTM traces (8), where such inhibitory traces exist (Grossberg, 1969d). Because lateral inhibition can change the sign of each x; from positive to negative in (lo), and thus change the sign of each z j k from positive to negative in (8), some refinements of (10) and (8) were needed to prevent absurdities like the following: S$ < 0 and z,< 0 implies z, > 0; and S$ < 0 and zmi < 0 implies z,> 0. Signal thresholds accomplished this in the simplest way. Letting [[I+ = m a ( [ , 0), define the threshold-linear signals.
sj’= [xj(t- TT)- r;]+ and
S-3 = [ z j ( t- T,-) - r
(12)
~+,
in (10) and ( l l ) , and modify (10) to read
Sigmoid, or S-shaped signals, were also soon mathematically shown to support useful computational properties (Grossberg, 1973). These additive equations and their variants have been used by many subsequent modellers. 10. THE SKEWED BOW: SYMMETRY-BREAKING BETWEEN FUTURE AND PAST
One of the most important contributions of neural network models has been to show how behavioral properties can arise as emergent properties due to network interactions. The bowed error curve is perhaps the first behaviorally important emergent property that was derived from a red-time neural network. It results from forward and backward interactions among all the STM and LTM variables across the network.
150
To explain the bowed error curve, we need to compare the LTM patterns z, = . .,z,”) that evolve at all list nodes 21%. In particular, we need to explain why the bowed curve is skewed; that is, why the list position where learning takes longest occurs nearer to the end of the list than to its beginning (Figure 2a). This skewing effect contradicts learning theories that assume forward and backward effects are equally strong, or symmetric (Asch and Ebenholtz, 1962; Murdock, 1974). This symmetry-breaking between the future and the past, by favoring forward over backward associations, makes possible the emergence of a global “arrow in time,” or the ultimate learning of long event sequences in their correct order, much as we learn the alphabet ABC ... Z despite the existence of backward learning. A skewed bowed error curve does emerge in the network, and predicts that the degree of skewing will decrease, and the relative learning rate at the beginning and end of the list will reverse, as the network’s arousal level increases or its signal thresholds r,’ decrease to abnormal levels (Grossberg and Pepe, 1971). The arousal and threshold predictions have not yet been directly tested to the best of my knowledge. Abnormally high arousal or low thresholds generate a formal network syndrome characterized by contextual collapse, reduced attention span, and fuzzy response categories that resemble aspects of simple schizophrenia (Grossberg and Pepe, 1970; Maher, 1977). To understand intuitively what is involved in this explanation of bowing, note that by , . .,~ “ - 1 ,that ~ is activated by list item equation (14), each correct LTM trace 212, ~ 2 3z34,. r1 may grow at a comparable rate, albeit w time units later than the previous correct LTM trace. However, the LTM patterns 21, z2,. ,.,zn differ at every list position, as in Figure 4. Thus when a retrieval probe T-,reads its LTM pattern z, into STM, the entire pattern must influence overt behavior to explain why bowing occurs. The relative size of the correct LTM trace z,,,+1 conpared to all other LTM traces in z, will influence its success in eliciting r,+1 after competitive STM interactions occur. A larger z3,,+]relative to the sum of all other z,k, k j , j 1, should yield better performance of r3.+] given r ] , other things being equal. To measure the distinctiveness of a trace zJkrelative to all traces in z J , I therefore defined the relative LTM traces (z11,z,2,.
+ + zjt
= Z,L(
C zjm)-’.
(15)
m#3
Equation (15) provides a convenient measure of the effect of LTM on STM after competition acts. By (15), the ordering within the LTM gradients of Figure 4 is preserved by the then 2 1 2 > 2 1 3 > . . . > 21, berelative LTM traces; for example, if z12 > 213 > . .. > zlnr cause all the Zlk% have the same denominator. Thus all conclusions about LTM gradients are valid for relative LTM gradients, which are also sometimes called stimulus sampling probabilities. In terms of the relative LTM traces, the issue of bowing can be mathematically formulated as follows. Define the bowing function B;(t)= Z,,,+l(t). Function B,(t)measures how distinctive the ith correct association is at time t. After a list of n items is presented with an intratrial interval w and a sufficiently long intertrial interval W elapses, does the function B,((n- l ) w W ) decrease and then increase as i increases from 1 to n? Does the minimum of the function occur in the latter half of the list? The answer to both of these questions is “yes.”
+
151
To understand why this happens, it is necessary to understand how the bow depends upon the ability of a node v, to sample incorrect future associations, such as r,r,+2,r,ri+3,.. . in addition to incorrect past associations, such as riri-1, riri-2,. . .. As soon as S, becomes positive, vi can sample the entire past field of STM traces at ~ 1 , 7 1 2 , .. ., v , - ~ . However, if the sampling threshold is chosen high enough, S, might shut off before r,+2 occurs. Thus the sampling duration has different effects on the sampling of past than of future incorrect associations. For example, if the sampling thresholds of all v; are chosen so high that S; shuts off before ri+2 is presented, then the function B,(m) decreases as i increases from 1 to n. In other words, the monotonic error curve of Figure 2b obtains because no node v, can encode incorrect future associations. Even if the thresholds are chosen so that incorrect future associations can be formed, the function B,((i+ 1)w) which measures the distinctiveness of z,,,+~just before r,+2 occurs is again a decreasing function of i. The bowing effect thus depends on threshold choices which permit sampling durations that are at least 2w in length. The shape of the bow also depends on the duration of the intertrial interval, because before the intertrial interval occurs, all nodes build up increasing amounts of associative interference as more list items are presented. The first effect of the nonoccurrence of items after r,, is presented is the growth through time of B,-l(t) as t increases beyond the time nw when item r,+l would have occurred in a larger list. The last correct association is hereby facilitated by the absence of interfering future items during the intertrial interval. This facilitation effect is a nonlinear property of the network. Bowing is also a nonlinear phenomenon in the theory, because it depends on a comparison of ratios of integrals of sums of products as they evolve through time. Mathematical theorems about the bowed error curve and other list learning properties were described in Grossberg (1969c) and Grossberg and Pepe (1971), and reviewed in Grossberg (1982a, 1982b). These results illustrated how STM and LTM processes interact as unitized events occur sequentially in time. Other mathematical studies analysed increasingly general constraints under which distributed STM patterns could be encoded in LTM without bias by arbitrary numbers of simultaneously active sampling nodes acting in parallel. Some of these results are summarized in the next section. 11. ABSOLUTELY STABLE PARALLEL PATTERN LEARNING
Many features of system (10) and (12)-(14) are special; for example, the exponential decay of STM and LTM and the signal threshold rule. Because associative processing is ubiquitous throughout phylogeny and within. functionally distinct subsystems of each individual, a more general mathematical framework was needed. This framework needed to distinguish universally occurring associative principles that guarantee essential learning properties from evolutionary variations that adapt these principles to realize specialized skills. I approached this problem from 1967 to 1972 in a series of articles wherein I gradually realized that the mathematical properties used to globally analyze specific learning examples were much more general than the examples themselves. This work culminated in my universal theorems on associative learning (Grossberg, 1969d, 1971a, 1972a). The theorems say that if certain associative laws were invented at a prescribed time during evolution, then they could achieve unbiased associative pattern learning in essentially any
152
later evolutionary specialization. To the question: Was it necessary to re-invent a new learning rule to match every perceptual or cognitive refinement, the theorems said “no”. They enabled arbitrary spatial patterns to be learned by arbitrarily many, simultaneously active sampling channels that are activated by arbitrary continuous data preprocessing in an essentially arbitrary anatomy. Arbitrary space-time patterns can also be learned given modest constraints on the temporal regularity of stimulus sampling. The universal theorems thus describe a type of parallel processing whereby unbiased associative pattern learning occurs despite mutual crosstalk between nonlinear feedback signals. These results obtain only if the network’s main computations, such as spatial averaging, temporal averaging, preprocessing, gating, and cross-correlation are computed in a canonical ordering. This canonical ordering constitutes a general purpose design for unbiased parallel pattern learning, as well as a criterion for whether particular networks are acceptable models for this task. The universality of the design mathematically takes the form of a classification of oscillatory and limiting possibilities that is invariant under evolutionary specializations. The theorems can also be interpreted in another way that is appropriate in discussions of self-organizing systems. The theorems are absolute stability or global content addressable memory theorems. They show that evolutionary invariants of associative learning obtain no matter how system parameters are changed within this class of systems. Absolutely stable learning is an important property in a self-organizing system because parameters may change in ways that cannot be predicted in advance, notably when unexpected environments act on the system. Absolute stability guarantees that the onset of self-organization does not subvert the very learning properties that make stable self-organization possible. The systems that I considered constitute the generalized additive model
where i and j parameterize arbitrarily large, not necessarily disjoint, sets of sampled and sampling cells, respectively. As in my equations for list learning, A, is an STM decay rate, Bki is a nonnegative performance signal, I i ( t ) is an input function, Cji is an LTM decay rate, and Dj, is a nonnegative learning signal. Unlike the list learning equations, A;, Bk;, C,i, and 0,;may be continuous functionals of the entire history of the system. Equations (16) and (17) are thus very general, and include many of the specialized associative learning models in the literature. For example, although (16) does not seem to include inhibitory interactions, such interactions may be lumped (say) into the STM decay functional A;. The choice n
A , = a, - (b; - C;Z,)G;(Z,)
+ k = l ffk(Zk)dk;
(18)
describes the case wherein system nodes compete via shunting, or membrane equation, interactions (Cole, 1968; Grossberg, 1973; Kandel and Schwartz, 1981; Plonsey and Fleming,
153
1969). The performance, LTM decay, and learning functionals may include slow threshold changes, nonspecific Now Print signals, signal velocity changes, presynaptic modulation, arbitrary continuous rules of dendritic preprocessing and axonal signaling, as well as many other possibilities (Grossberg, 1972a, 1974). Of special importance are the variety of LTM decay choices that satisfy the theorems. For example, a gated LTM law like
d dt
- z . , = [x,(t - .rj) - r j ( y t ) ] + ( - d j z j i 3
+e p , )
(19)
achieves an interference theory of forgetting, rather than exponential forgetting, since S z j , = 0 except when vj is sampling (Adams, 1967); cf., equation (7). Equation (19) also allows the vigor of sampling to depend on changes in the threshold rj(yt) that are sensitive to the prior history yt = ( x i , z,, : i E I , j E J ) * of the system before time t , as in the model of Bienenstock, Cooper, and Munro (1982). In this generality, too many possibilities exist to as yet prove absolute stability theorems. Indeed, if the performance signals Bj, from a fixed sampling node v, to all the sampled nodes v,, a E I , were arbitrary nonnegative and continuous functionals, then the irregularities in each Bji could override any regularities in zj; within the gated performance signal Bjizj; from v, to v,. One further constraint was used to impose some spatiotemporal regularity on the sampling process, as indicated in the next section. 12. LOCAL SYMMETRY, ACTION POTENTIALS, AND UNBIASED LEARNING Absolute stability obtains even if different functionals B,, C,,and D, are assigned to each node vj,j E J , just so long as the same functional is assigned to all pathways e,;, i E I . Where this is not globally true, one can often partition the network into maximal subsets where it is true, and then prove unbiased pattern learning in each subset. This restriction is called the property of local symmetry axes since each sampling cell vj can act as a source of coherent history-dependent waves of STM and LTM processing. Local symmetry axes still permit (say) each Bj to obey different history-dependent preprocessing, threshold, time lag, and path strength laws among arbitrarily many mutually interacting nodes v,. When local symmetry axes are imposed on the generalized additive model in (16) and (17), the resulting class of systems takes the form
A change of variables shows, moreover, that constant interaction coefficients b,, between pairs v, and v, of nodes can depend on i E I without destroying unbiased pattern learning in the systems d -xi = -AX, Bjbjizj, Zi(t) (22) dt i
+C
+
154
By contrast, the systems (22) and
are not capable of unbiased parallel pattern learning (Grossberg, 1972a). A dimensional analysis showed that (22) and (23) hold if action potentials transmit the network’s intercellular signals, whereas (22) and (24) hold if electrotonic propagation is used. The cellular property of an action potential was hereby formally linked to the network property of unbiased parallel pattern learning. 13. THE UNIT OF LTM IS A SPATIAL PATTERN These global theorems proved that “the unit of LTM is a spatial pattern”. This result was surprising to me, even though I had discovered the additive model. The result illustrates how rigorous mathematics can force insights that go beyond unaided intuition. In the present instance, it suggested a new definition of spatial pattern and showed how the network learns “temporally coherent spatial patterns” that may be hidden in its distributed STM activations through time. This theme of temporal coherence, first mathematically discovered in 1966, has shown itself in many forms since, particularly in recent studies of attention, resonance, and synchronous oscillations (Crick and Koch, 1990; Eckhorn, Bauer, Jordan, Brosch, Kruse, Munk, and Reitbock, 1988; Eckhorn and Schanze, 1991; Gray and Singer, 1989; Gray, Konig, Engel, and Singer, 1989; Grossberg, 1976c; Grossberg and Somers, 1991, 1992). To illustrate the global theorems that have been proved, I consider first the simplest case, wherein only one sampling node vg exists (Figure 5a). Then the network is called an outstar because it can be drawn with the sampling node at the center of outward-facing adaptive pathways (Figure 5b) such that the LTM trace zo, in the ith pathway samples the STM trace x i of the ith sampled cell, i E I . An outstar is thus a neural network of the form d = -AX, Bzo, I,(t) (25)
+
+
where A, B , C, and D are continuous functionals such that B and E are nonnegative. Despite the fact that the functionals A, B , C, and D can fluctuate in complex systemdependent ways, and the inputs I , ( t ) can also fluctuate wildly through time, an outstar can learn an arbitrary spatial pattern
where 0, 2 0 and &1Bg = 1, with a minimum of oscillations in its pattern variables Xi = xi(,&^ xg)-l and 2, = ~ i ( C g ~ , z g These ) - ~ . pattern variables learn the temporally coherent weights 0, in a spatial pattern and factor the input activation I ( t ) that energizes the process into the learning rate. The 2,’s are the relative LTM traces (15) that played such a central role in the explanation of serial bowing. The limiting and oscillatory behaviors of the pattern variables have a classification that is independent of particular
155
ucs
I bI
Figure 5. (a) The minimal anatomy capable of associative learning. For example, during classical conditioning, a conditioned stimulus (CS) excites a single node, or cell population, ?JO which thereupon sends sampling signals to a set of nodes u1, v2,. . .,u,. An input pattern representing an unconditioned stimulus (UCS) excites the nodes ~ 1 , 2 1 2 , .. .,u,, which thereupon elicit output signals that contribute to the unconditioned response (UCR). The sampling signals from uo activate the LTM traces zoi i = 1,2,. . .,n. The activated LTM traces can learn the activity pattern across ~1,212,. . ;,un that represents the UCS. (b) When the sampling structure in (a) is redrawn to emphasize its symmetry, the result is an outstar, whose sampling source is uo and whose sampled border is the set of nodes {q, u2,. . .,un}. (Reprinted with permission from Grossberg, 1982b.)
choices of A , B , C, D , and I . These properties are thus evolutionary invariants of outstar learning. The following theorem summarizes, albeit not in the most general known form, some properties of outstar learning. One of the constraints in this theorem is called a local flow condition. This constraint says that a performance signal B can be large only if its associated learning signal D is large. When local flow holds, pathways which have lost their plasticity can be grouped into the total input pattern that is registered in STM for encoding in LTM by other pathways. If the threshold of the performance signal B is no smaller than the threshold of the learning signal D , then local flow is assured. Such a threshold inequality occurs automatically if the LTM trace z j , is physically interpolated between the axonal signal and the postsynaptic target cell ui. That is why the condition is called a local flow condition.
156
Such a geometric interpretation of the location of the LTM trace gives unexpected support to the hypothesis that LTM traces are localized in the synaptic knobs or postsynaptic membranes of cells undergoing associative learning. Here again a network property gives new functional meaning to a cellular property. Theorem 1 (Outstar Pattern Learning) Suppose that (I) the functionals are chosen to keep system trajectories bounded; (11) a local flow condition holds:
LmD ( t ) d t
= m;
(111) the UCS is practiced sufficiently often, and there exist positive constants K1 and
ZC, such that for all T 2 0, f ( T ,T where
+ t ) 2 Zcl
if
V
f(U, V )= J,
Z ( 0 exp
t 2 Icz
(29)
[lV
A(dd44.
(30)
Then, given arbitrary continuous and nonnegative initial data in t 5 0 such that C j Zj(0) > 0, (A) practice makes perfect: The LTM ratios Z,(t) are monotonically attracted to the UCS weights 8; if [zi(O) - Xi(O)l[Xi(O)- Oil 2 0,
(31)
or may oscillate at most once due to prior learning if (31) does not hold, no matter how wildly A , B, C, D , and Z oscillate; (B) the UCS is registered in STM and partial learning occurs: The limits &; = limtWooX ; ( t ) and Pi = limt+rn Z,(t) exist with
Qi = B,,
for all i.
(32)
(C) If, moreover, the CS is practiced sufficiently often, then perfect learning occurs: if
LmD(t)dt
= 00,
then
Pi = B,,
for all i
(33)
Remarkably, similar global theorems hold for systems (20)-(21) wherein arbitrarily many sampling cells can be simultaneously active and mutually signal each other by complex feedback rules (Geman, 1981; Grossberg, 1969d, 1971a, 1972a, 1980b). This is because all systems of the form (20)-(21) can factorize information about how STM and LTM pattern variables learn pattern Bi from information about how fast energy l , ( t ) is being pumped into the system to drive the learning process. The pattern variables 2, therefore oscillate at most once even if wild fluctuations in input and feedback signal
157
energies occur through time. In the best theorems now available, only one hypothesis is not known to be necessary and sufficient (Grossberg, 1972a, 1982a). When many sampling cells v j , can send sampling signals to each sampled cell v,, the outstar property that each relative LTM trace Zj, = zj,(CsEr z j k ) - l oscillates at most once fails to hold. This is so because the Zji of all active nodes v j track X i = x i ( & x t ) - l , while X i tracks Bi and the Zj, of all active nodes vj. The oscillations of the functions Y; = max{Zj; : j E J } and y, = min{Zji : j E J } can, however, be classified much as the oscillations of each Z, can be classified in the outstar case. Since each Zji depends on all Zjk for variable k , each Y , and yi depends on all zjk for variable j and k. Since also each X i depends on all x k for variable k, the learning at each v; is influenced by all z k and z i t . No single cell analysis can provide an adequate insight into the dynamics of this associative learning process. The main computational properties emerge through interactions on the network level. Because the oscillations of all X i , Y;, and y, relative to 0, can be classified, the following generalization of the outstar learning theorem holds. Theorem 2 (Unbiased Parallel P a t t e r n Learning) Suppose that (I) the functionals are chosen to keep system trajectories bounded; (11) every sampling cell obeys a local flow condition: m
for every j, L"Bjdt=cn
onlyif
Djdt
=00;
(34)
(111) the UCS is presented sufficiently often: There exist positive constants K1 and K 2 such that (29) holds. Then given arbitrary nonnegative and continuous initial data in t 5 0 such that Cixi(0) > 0 and all Ciz j i ( 0 ) > 0, (A) the UCS is registered in STM and partial learning occurs: The limits Q,= limt,, X , ( t ) and Pji = limt+m Zji(t) exist with Qi = Bi,
for all i.
(35)
(B) If the j t h CS is practiced sufficiently often, then it learns the UCS pattern perfectly:
4
m
if
Djdt = 00
then
Pj, = B,, for all a.
(36)
Because LTM traces z j i gate the performance signals Bj which are activated by a retrieval probe T;, the theorem enables any and all nodes v j which sampled the pattern 0, during learning trials to read it out accurately on recall trials. The theorem does not deny that oscillations in overall network activity can occur during learning and recall, but shows that these oscillations merely influence the rates and intensities of learning and recall. In particular, phase transitions in memory can occur, and the nature of the phases can depend on a complex interaction between network rates and geometry (Grossberg, 1969g, 1982a).
158
Neither Theorem 1 nor Theorem 2 assumes that the CS and UCS are presented at correlated times. This is because the UCS condition keeps the baseline STM activity of sampled cells from ever decaying below the positive value K1 in (29). For purposes of space-time pattern learning, this UCS uniformity condition is too strong. In Grossberg (1972a), I used a weaker condition which guarantees that CS-UCS presentations are well enough correlated to guarantee perfect pattern learning of a given spatial pattern by certain cells v,, even if other spatial patterns are presented at irregular times when they are sampled by distinct cells vb. 14. PATTERN CALCULUS: RETINA, COMMAND CELL, REWARD, ATTENTION, MOTOR SYNERGY
Three simple but fundamental facts emerge from the mathematical analysis of pattern learning: the unit of LTM is a spatial pattern 6’ = (0, : i E I); suitably designed neural networks can factorize invariant pattern 0 from fluctuating energy; the size of a node’s sampling signal can render it adaptively sensitive or blind to a pattern 6’. These concepts helped me to think in terms of pattern transformations, rather than in terms of feature detectors, computer programs, linear systems, or other types of analysis. When I confronted data about other behavioral problems with these pattern processing properties, the conceptual pressure that was generated drove me into a wide-ranging series of specialized investigations. What is the minimal network that can discriminate 6’ from background input fluctuations? It looks like a retina, and the 6’’s became reflectances (Grossberg, 1970a, 1972b, 1976b, 1983). What is the minimal network that can encode and/or perform a spacetime pattern or ordered series of spatial patterns? Called an avalanche, it looks like an invertebrate command cell (Grossberg, 1969e, 1970b). How can one synchronize CS-UCS sampling if the time intervals between CS and UCS presentations are unsynchronized? This analysis led to psychophysiological mechanisms of reward, punishment, and attention (Grossberg, 1971b, 1972c, 1972d, 1975). What are the associative invariants of motor learning? Spatial patterns become motor synergies wherein fixed relative contraction rates across muscles occur, and temporally synchronized performance signals read-out the synergy as a unit (Grossberg, 1970a, 1974). 15. SHUNTING COMPETITIVE NETWORKS OR ADDITIVE NETWORKS?
These specialized investigations repeatedly led to consideration of competitive systems. For example, the same competitive normalization property that arose during modeling of receptor-bipolar-horizontal cell interactions in retina (Grossberg, 1970a, 197213) also arose in studies of the decision rules needed to release the right amount of incentive motivation in response to interacting drives and conditioned reinforcer inputs within midbrain reinforcement centers (Grossberg, 1972c, 1972d). Because these problems were approached from a behavioral perspective, I knew what interactive properties the competition had to have. I typically found that shunting competition had all the properties that I needed, whereas additive competition often did not. Additive networks approximate shunting networks when their activities are far from cell saturation levels ( B , and D, in equation (3)). When this is not the case, the automatic gain control properties of shunting networks play a major role, as the next section shows.
159
16. THE NOISE-SATURATION DILEMMA: PATTERN PROCESSING BY COMPETITIVE NETWORKS One basic property that was shared by all these systems concerned the manner in which cellular tissues process input patterns whose amplitudes may fluctuate over a much wider range than the cellular activations themselves. This theme is invisible to theories based on binary codes, feature detectors, or additive models. All cellular systems need to prevent sensitivity loss in their responses to both low and high input intensities. Mass action, or shunting, competition enables cells to elegantly solve this problem using automatic gain control by lateral inhibitory signals (Grossberg, 1970a, 1970b, 1973, 1980a). Additive competition fails in this task because it does not, by definition, possess an automatic gain control property. Suppose that the STM traces or activations q,z2,. . .,znat a network level fluctuate within fixed finite limits at their respective network nodes, as in (3). Setting a bounded operating range for each z, enables fixed decision criteria, such as output thresholds, to be defined. On the other hand, if a large number of intermittent input sources converge on the nodes through time, then a serious design problem arises, due to the fact that the total input converging on each node can vary wildly through time. I have called this problem the noise-saturation dilemma : If the z; are sensitive to large inputs, then why do not small inputs get lost in internal system noise? If the z, are sensitive to small inputs, then why do they not all saturate at their maximum values in response to large inputs? Shunting cooperative-competitive networks possess automatic gain control properties capable of generating an infinite dynamic range within which input patterns can be effectively processed, thereby solving the noise-saturation dilemma. The simplest feedforward network will be described to illustrate how its solves the sensitivity problem raised by the noise-saturation dilemma. Let a spatial pattern I, = 0,I of inputs be processed by the cells v,, i = 1,2,. . .,n. Each 0, is the constant relative size, or reflectance, of its input Z, and I is the variable total input size. In other words, I = C;=l I k , so that CFZlBk = 1. How can each cell v, maintain its sensitivity to 0, when I is parametrically increased? How is saturation each cell vi must have information about all the avoided? To compute 0, = I,(C;=, inputs Ik,k = 1,2,. . .,n. Moreover, since Bi = Zi(I, CkZiIk)-', increasing Ii increases 0, whereas increasing any Ik,k # i, decreases 0,. When this observation is translated into an anatomy for delivering feedforward inputs to the cells vi, it suggests that 1, excites v, and that all Z k , k # i, inhibit v,. This rule represents the simplest feedforward on-center off-surround anatomy. How does the on-center off-surround anatomy activate and inhibit the cells v, via mass action? Let each v, possess B excitable sites of which s i ( t ) are excited and B- z,(t) are unexcited at each time t. Then at vi, I, excites B - z, unexcited sites by mass action, and the total inhibitory input Ck., I k inhibits z, excited sites by mass action. Moreover, excitation 5, can spontaneously decay at a fixed rate A, so that the cell can return to an equilibrium point (arbitrarily set equal to 0) after all inputs cease. These rules say that
+
Equation (37) is perhaps the simplest example that illustrates the utility of shunting networks (3). If a fixed spatial pattern 1, = B,I is presented and the background input I
160
is held constant for awhile, each x i approaches an equilibrium value. This value is easily found by setting dxildt = 0 in (37). It is xi = ei-
BI A+Z‘
Equation (38) represents another example of the factorization of pattern (0,) and energy ( B I ( A + Z)-1). As a result, the relative activity Xi = x , ( c L I xk)-l equals 0i no matter how large I is chosen; there is no saturation. This is due to automatic gain control by the inhibitory inputs. In other words, &+i I k multiplies x i in (37). The total gain in (37) is found by writing Zd X ,= - ( A I ) z , + BI,. (39)
+ The gain is the coefficient of xi, namely - ( A + I ) , since if xi(0) = 0,
Both the steady state and the gain of x; depend on the input strength. This is characteristic of shunting networks but not of additive networks. The simple law (38) combines two types of information: information about pattern 0;, or “reflectances”, and information about background activity, or “luminance”. In visual psychophysics, the tendency towards reflectance processing helps to explain brightness constancy (Grossberg and TodoroviC, 1988). Another property of (38) is that the total activity n
is independent of the number of active cells. This normalization rule is a conservation law which says, for example, that a network that receives a fixed total luminance, making one part of the field brighter tends to make another part ,of the field darker. This property helps to explain brightness contrast, as well as brightness assimilation and the CraikO’Brien-Cornsweet effect (Grossberg and TodoroviC, 1988). Equation (38) can be written in another form that expresses a different physical intuition. If we plot the intensity of an on-center input in logarithmic coordinates K i , then Ki = Zn(I,) and Z, = exp(K,). Also write the total off-surround input as J , = Ck+I k . Then (38) can be written in logarithmic coordinates as
2,(Ka,J;) =
BeKi A+eKi + J;’
How does the response xi at v, change if we parametrically change the off-surround input J,? The answer is that xi’s entire response curve to K i is shifted. Its range of maximal sensitivity scales the off-surround intensity, but its dynamic range is not compressed. Such a shift occurs, for example, in the Weber-Fechner law (Cornsweet, 1970), in bipolar cells of the Necturus retina (Werblin, 1971) and in a modified form in the psychoacoustic data of Iverson and Pave1 (1981). The shift property says that
161
for all K , 2 0, where the amount of shift S caused by changing the total off-surround input from Jj') to J,!") is predicted to be
s = In(-
A + J(') A Jj2)
+
1-
(44)
Equation (37) is a special case of a law that occurs in vivo; namely, the membrane equation on which cellular neurophysiology is based. The membrane equation describes the voltage V ( t )of a cell by the law
c aVx = (V+ - V)g+ + ( V - - V)g- + (VP - V)gP.
(45)
In (45), C is a capacitance; V + , V - , and Vp are constant excitatory, inhibitory, and passive saturation points, respectively; and g + , g - , and g p are excitatory, inhibitory, and passive conductances, respectively. We will scale V + and V - so that V + > V - . Then in vivo V + 2 V ( t )2 V - and V + > VP 2 V - . Often V + represents the saturation point of a Na+ channel and V - represents the saturation point of a K+ channel. To see why (37) is a special case of (45), suppose that (45) holds at each cell v,. Then at v,, V = z,.Set C = 1 (rescale time), V + = B, V - = Vp = 0, g+ = I;,9- = CkZiI,., and g p = A. There is also symmetry-breaking in (45) because V + - VP is usually much larger than VP - V - . This symmetry-breaking operation, which is usually mentioned in the experimental literature without comment, achieves an important noise-suppression property when it is coupled to an on-center off-surround anatomy. For example, in the network
both depolarized potentials (0 < xi 5 B ) and hyperpolarized potentials (-C 5 2; < 0) can occur. The equilibrium activity in response to spatial pattern I , = B,I is
Parameter C(B+C)-' is an adaptation level which 8, must exceed in order to depolarize zi and thereby generate an output signal from vi. In order to inhibit uniform input patterns that do not carry discriminative featurd information, we would want Bi = for all i to imply that all z, = 0. This occurs if B = (n - 1)C, so that B B C and thus
v+- v p > v p - v-.
The reflectance processing and Weber law properties, the total activity normalization property, and the adaptation level property of (46) set the stage for the design and classification of more complex feedforward and feedback on-center off-surround shunting networks during the early 1970's.
17. SHORT TERM MEMORY STORAGE AND CAM Feedback networks are capable of storing memories in STM for far longer than a passive decay rate, such as A in (37), would allow, yet can also be rapidly reset. The
162
simplest feedback competitive network capable of solving the noise-saturation dilemma is defined by the equations
i = 1 , 2 , . . .,n. Suppose that the inputs 1, and J, acting before t = 0 establish an arbitrary initial activity pattern (z1(0), zz(O), . . .,zn(0))before being shut off at t = 0. How does the choice of the feedback signal function f(w) control the transformation and storage of this pattern as t -+ co? The answer is determined by the choice of function g(w) = w-lf(w), which measures how much f(w) deviates from linearity at prescribed activity levels w. The network’s responses to these choices may be summarized using the functions X ; = S;(C;!~ zk)-l and z = C!,, xk. The relative activity Xi of the ith node computes how the network transforms the input pattern through time. The functions Xi play the role for feedback networks that the reflectances 8; in (38) play for feedforward networks; also recall Theorems 1 and 2. The total activity z measures how well the network normalizes the total network activity and whether the pattern is stored (z(co) = limt-,ooz(t) > 0) or not ( ~ ( 0 0 )= 0). Variable z plays the role of the total input I in (38). In Grossberg (1973) the following types of results were proved about these systems: Linear signals lead to perfect pattern storage and noise amplification. Slower-than-linear signals lead to pattern uniformization and noise amplification. Faster-than-linear signals lead to winner-take-all choices, noise suppression, and total activity quantization in a network that behaves like an emergent finite state machine. Sigmoid signals lead to partial contrast enhancement, tunable filtering, noise suppression, and normalization. See Grossberg (1981, 1988) for reviews. All of these networks function as a type of global content addressable memory, or CAM, since all trajectories converge to equilibrium points through time. The equilibrium point to which the network converges in response to ari input pattern plays the role of a stored memory. Both linear and sigmoid signals can be chosen to create networks with infinitely many, indeed nondenumerably many, equilibria. Faster-than-linear signals give rise to only finitely many equilibria as part of their winner-take-all property. In summary, several factors work together to generate desirable pattern transformation and STM storage properties. The dynamics of mass action, the geometry of competition, and the statistics of competitive feedback signals work together to define a unified network module whose several parts are designed in a coordinated fashion through development. 18. EVERY COMPETITIVE SYSTEM INDUCES A DECISION SCHEME As solutions to specialized problems involving competition accumulated, networks capable of normalization, sensitivity changes via automatic gain control, attentional biases, developmental biases, pattern matching, shift properties, contrast enhancement, edge and curvature detection, tunable filtering, multistable choice behavior, normative drifts, traveling waves, synchronous oscillations, hysteresis, and resonance began to be classified within the framework of additive or shunting feedforward or feedback competitive networks. As in the case of associative learning, the abundance of special cases made it
163
seem more and more imperative to find a mathematical framework within which these results could be unified and generalized. I also began to realize that many of the pattern transformations and STM storage properties of specialized examples were instances of an absolute stability property of a general class of networks. This unifying mathematical theme can be summarized intuitively as follows: every competitive system induces a decision scheme that can be used to prove global limit and oscillation theorems, notably absolute stability theorems (Grossberg, 1978c, 1978d, 1980~).This decision scheme interpretation provides a geometrical way to think about a Liapunov functional that is naturally associated with every competitive system. A competitive dynamical system is, for present purposes, defined by a system of differential equations such that d Zxi=fi(xl,x2....,Zn) (49) where
afi.0,
axj -
i#j,
and the f,are chosen to generate bounded trajectories. By (50), increasing the activity xj of a given population can only decrease the growth rates of other populations, i # j , or may not influence them at all. No constraint is placed upon the sign of $!&. Typically, cooperative behavior occurs within a population and competitive behavior occurs between populations, as in the on-center off-surround networks (48). The method makes mathematically precise the intuitive idea that a competitive system can be understood by keeping track of who is winning the competition. The decision scheme makes this intuition precise. To define it, write (49) in the form
ix,
d
-xi dt = a,(zi)M,(z),
x = ( q , x Z , . . .,xn),
(51)
which factors out the amplification function .,(xi) 2 0. Then define
and
M + ( x ) = m a x { M , ( x ) : i = 1,2, ...,n }
(52)
M - ( x ) = min{M,(x) : i = 1,2,. . .,n}.
(53)
These variables track the largest and smallest rates of change, and are used to keep track of who is winning. Using these functions, it is easy to see that there exists a property of ignition: Once a trajectory enters the positive ignition region
R+ = {x : M + ( x ) 2 0}
(54)
R- = {x : M - ( z ) 5 O},
(55)
or the negative ignition region
it can never leave it. If x ( t ) never enters the set
R* = R+ n R-,
(56)
164
then each variable z,(t) converges monotonically to a limit. The interesting behavior in a competitive system occurs in R*. In particular, if ~ ( tnever ) enters R+, each z,(t) decreases to a limit; then the competition never gets started. The set
s+= {z : M + ( Z ) = 0)
(57)
acts like a competition threshold, which is called the positive ignition hypersurface. We therefore consider a trajectory after it has entered R*. For simplicity, redefine the time scale so that the trajectory is in R* at time t = 0. The Liapunov functional for any competitive system is then defined as
L ( z t )=
/1
M+(z(u))dv.
0
The Liapunov property is a direct consequence of positive ignition:
This functional provides the “energyn that forces trajectories through a series of competitive decisions, which are also called jumps. Jumps keep track of the state which is undergoing the maximal rate of change at any time (“who’s winning”). If M + ( x ( t ) )= M , ( z ( t ) ) for times S 5 t < T but M + ( z ( t ) )= M j ( z ( t ) )for times T 5 t < U , then we say that the system jumps from node vi to node v, at time t = T . A jump from u, to u, can only occur on the jump set Jij = {ZE R* : M + ( z ) = M;(z) = Mi(”)}. (60) The Liapunov functional L ( z t ) moves the system through these decision hypersurfaces through time. The geometry of S+, S-, and the jump sets J,,, together with the energy defined by L(zt),can be used to globally analyse the dynamics of the system. In particular, due to the positive ignition property (59), the limit
1
m
lim L ( z t ) =
t-oo
0
M+(z(u))du
always exists, and is possibly infinite. 19. LIMITS A N D OSCILLATIONS: CONSENSUS A N D CONTRADICTION
The following results illustrate the use of these concepts (Grossberg, 1 9 7 8 ~ ) : Theorem 3: Given any initial data z(O), suppose that
Jm
M + ( z ( v ) ) d u< 00.
Then the limit ~ ( c o = ) limt-oo z ( t )exists. Corollary 1: If in response to initial data z(O), all jumps cease after some time T < CO, then z(00) exists. Speaking intuitively, this result means that after all local decisions, or jumps, have been made in response to an initial state z(O), then the system can settle down to a
165
global decision, or equilibrium point z(00). In particular, if z(0) leads to only finitely many jumps because there exists a jump tree, or partial ordering of decisions, then z(00) exists. This fact led to the analysis of circumstances under which no jump cycle, or repetitive series of jumps, occurs in response to s(O), and hence that jump trees exist. These results included examples of nonlinear Volterra-Loth equations with asymmetric interaction equations all of whose trajectories approach equilibrium points (Grossberg, 1978~).Thus symmetric coefficients were shown not to be necessary for global approach to equilibrium, or a global CAM property, to obtain. Further information may be derived from (62). Since M + ( z ( t ) )2 0 for all t 2 0, it also follows that limt+m M + ( z ( t ) )= 0. This tells us to look for the equilibrium points z(00) on the positive ignition hypersurface S+ in (57): Corollary 2: If M+(s(t))dt< 00, then z(00) E S+. Thus the positive ignition surface is the place where the competition both ignites and its memories are stored if no jump cycle exists. Using this result, an analysis was made of conditions under which no jump cycle exists in response to any initial vector z(O), and hence all trajectories approach an equilibrium state. The same method was also used to prove that a competitive system can generate sustained oscillations if it contains globally inconsistent decisions. These results provide examples where asymmetric coefficients do lead to oscillations. Here, in response to initial data s(O),
J,"
M + ( z ( v ) ) d v= 00)
(63)
thus infinitely many jumps occur, hence a jump cycle occurs, and the trajectory undergoes undamped oscillations. This method was used to provide a global analysis of the oscillations taking place in a variety of competitive systems, including the Volterra-Lotka systems that model the voting paradox (Grossberg, 1978c, 1980c; May and Leonard, 1975). Using this method, a large new class of nonlinear competitive networks was identified all of whose trajectories converge to one of possibly infinitely many equilibrium points (Grossberg, 1978d). These are the adaptation level systems
d dt
-z,
= a,(z)[b,(s,)- c(z)]
(64)
which were identified through an analysis of many specialized networks. In system (63), each state-dependent amplification function a;(%) and self-signal function b,(zi) can be chosen with great generality without destroying the system's ability to reach equilibrium because there exists a state-dependent adaptation level ~ ( z against ) which each bi(z;)is compared. Such an adaptation level C(Z) defines a strong type of long-range symmetry within the system. Equation (64) is a feedback analog of the feedforward adaptation level equation (47). The examples which motivated the analysis of (64) were additive networks
166
and shunting networks
k
in which the symmetric coefficients i?k,, Ck,, and Ek, took on different values when k = i and when k # i. Examples in which the symmetric coefficients varied with 1 k - i I in a graded fashion were also studied through computer simulations (Ellias and Grossberg, 1975; Levine and Grossberg, 1976). An adequate global mathematical convergence proof was announced in Grossberg (1982b) and elaborated in Cohen and Grossberg (1983). A special case of my theorem concerning these adaption level systems is the following. T h e o r e m 4 (Absolute Stability of Adaptation Level Systems) Suppose that (I) Smoothness: ) continuously differentiable; The functions a i ( z ) , b,(xi),and ~ ( xare (11) Positivity: u , ( x ) > 0 if xi > 0, xj 2 0, j # i; a;(.) = O
if
xi
xj 2 0,
=0,
j # i;
for sufficiently small X > 0, there exists a continuous function ";(xi) such that i ~ ; ( x ,2) a,(x) if
and
J" 0
2E
[ O , A]"
-+
= 03;
a (w)
(111) Boundedness: for each z = 1,2,. ..,n, limsupbi(xi) < c(0,O ,..., m,O ,...,0) 2/-00
where 00 is in the ith entry of (O,O, (IV) Competition: *>O,
dXi
,m, 0,. . . , O ) ;
%..
Z E R T , i = 1 , 2 ,...,n;
(V) Decision Hills: The graph of each bi(x,) possesses at most finitely many maxima in every compact interval. Then the pattern transformation is stored in STM because all trajectories converge to equilibrium points; that is, given any x ( 0 ) > 0, the limit x(00) = limt-OOz ( t )exists. This theorem intuitively means that the decision schemes of adaptation level systems are globally consistent and give rise to a global CAM.
167
In the proof of Theorem 4, it was shown that each z ; ( t )gets trapped within a sequence of decision boundaries that get laid down through time at the abscissa values of the highest peaks in the graphs of the functions b, in (64). The size and location of these peaks reflect the statistical rules, which can be chosen extremely complex, that give rise to the output signals from the totality of cooperating subpopulations within each node v,. In particular, a b, with multiple peaks can be generated when a population’s positive feedback signal function is a multiple-sigmoid function which adds up output signals from multiple randomly defined subpopulations within D,. After all the decision boundaries get laid down, each z, is trapped within a single valley of its b, graph. This valley acts, in some respects, like a classical potential. After all the z, get trapped in such valleys, the function B[z(t)] = max{b,(z(t)) :i = 1,2,. . . ,n} (73) is a Liapunov function. This Liapunov property was used to complete the proof of the theorem. Adaptation level systems exclude distance-dependent interactions. To overcome this gap, Michael Cohen and I (Cohen and Grossberg, 1983; see also Grossberg, 1982b) studied the absolute stability of the symmetric networks
where F,, = Fj,. The adaptation level model (64) is in some ways more general and in some ways less general than model (74). Cohen and I began our study of (74) with the hope that we could use the symmetric coefficients in (74) to prove that no jump cycles exist, and thus that all trajectories approach equilibrium as a consequence of Theorem 3. Such a proof would be part of a more general theory and, by using geometrical concepts such as jump set and ignition surface, it would clarify how to perturb off the symmetric coefficients without generating oscillations. As it turned out, the global Liapunov method that 1developed in the 1970’s sensitized us to think in that direction. We soon discovered a general class of symmetric models and a global Liapunov function for every model in the cla,ss. In each of these models, the Liapunov function was used to prove that all trajectories approach equilibrium points. This CAM model, which is now often called the Cohen-Grossberg model, was designed to include additive networks ( 6 5 ) and shunting networks (66) with symmetric coefficients. 20. COHEN-GROSSBERG CAM MODEL AND THEOREM The Cohen-Grossberg model includes any dynamical system that can be written in the form n d = ~ ; ( z ; ) [ b ;-( ~ ;c); j d j ( z j ) ] . (75)
C
j=1
Each such model admits the global Liapunov function
168
if the coefficient matrix C =((c;j (1 and the functions a,, bi, and d , obey mild technical conditions, including Symmetry: c..=c.. (77) $3 11’
Positivity:
ai(xi) L 0 Monotonicity: 4(Xj)
2 0.
(79)
Integrating V along trajectories implies that
If (78) and (79) hold, then $V 5 0 along trajectories. Once this basic property of a Liapunov function is in place, it is a technical matter to rigorously prove that every trajectory approaches one of a possibly large number of equilibrium points. For expository vividness, the functions in the Cohen-Grossberg model (75) are called the amplification function a , , the self-signal function bi, and the other-signal functions d j . Specialized models are characterized by particular choices of these functions. A. Additive Model Cohen and Grossberg (1983, p.819) noted that “the simpler additive neural networks . . . are also included in our analysis”. The additive equation (2) can be written using the coefficients of the standard electrical circuit interpretation (Plonsey and Fleming, 1969)
as
n -_ + c f j ( X j ) Z j i + Ii. j=l
C . 3= 1 xi ’ dt Ri
(81)
Substitution into (75) shows that 1 Ci
u;(z,)= -
1
b,(z;) = -2, Ri
(constant!)
+ 1,
c . .= -T.. 11
13
(linear!)
(82)
(83)
(84)
Thus in the additive case, the amplification function (82) is a positive constant, hence satisfies (78), and the self-signal term (83) is linear. Substitution of (82)-(83) into (76) leads directly to the equation
169
This Liapunov function for the additive model was later published by Hopfield (1984). In Hopfield's treatment, & is written as an inverse j ; ' ( K ) . Cohen and Grossberg (1983) showed, however, that although f,(s,)must be nondecreasing, as in (79), it need not have an inverse in order for (86) to be valid. B. Shunting Model All additive models lead to constant amplification functions a;(.;) and linear selffeedback functions bi(zi).The need for the more general model (75) becomes apparent when the shunting STM equation (3) is analysed. Consider, for example, a class of shunting models. n
Bd x =~-A;x, + (Bi - ~ i ) [ l+i f i ( ~ i )-] (x,+ Ci)[J,+ C D , j g j ( ~ j ) ] .
(87)
j=1
In (87), each 5;can fluctuate within the finite interval [-Ci, B,] in response to the constant inputs I; and J;, the state-dependent positive feedback signal f;(z;), and the negative feedback signals D ; j g j ( z j ) . It is assumed that Dij
= Dj, 2 0
(88)
and that gi(Xj)2
(89)
0.
In order to write (87) in Cohen-Grossberg form, it is convenient to introduce the variables y, =Xi
+ c,.
(90)
In applications, C, is typically nonnegative. Since X, can vary within the interval
[-C,, B;],y , can vary within the interval [0, B, + C,] of nonnegative numbers. In terms of these variables, (87) can be written in the form
where ai(yj) =y ,
1 b i ( Y i ) = -[AiCi - ( A , Xi
(nonconstant!),
+ J ; ) z ~+ ( B ,+ C, - z,)(I,+
fi(rj
- C;))] (nonlinear!),
C'.3.-- D 13, .. and d j ( y i ) = g j ( y j - C,) (noninvertible!).
Unlike the additive model, the amplification function a , ( y i ) in (92) is not a constant. In addition, the self-signal function b ; ( y i ) in (93) is not necessarily linear, notably because the feedback signal f i ( q - C,) is often nonlinear in applications of the shunting model; in particular it is often a sigmoid or multiple sigmoid signal function.
170
I. 2.
Norrnalixo Total Activity Contrast Enhone.
3. S T M
LTM IN PLASTIC
SYNAPTIC
STRENGTHS
1.Compul.
Tirn.-Av.rag. Pr.rynaphc Signal and
01
Postsynaptic STMi k. Product Gat. Signals
53
2.Mullipli~otir.ly
v,
I.Narrnalizm Total Activity
Iilt)
Input
Panern
Figure 6. The basic computational rules of self-organizing feature maps were established by 1976. (Reprinted with permission from Grossberg, 1976b.)
Property (78) follows from the fact that a,(y,) = y, 2 0. Property (79) follows from the assumption that the negative feedback signal function gj is monotone nondecreasing. Cohen and Grossberg (1983) proved that gj need not be invertible. A signal threshold may exist below which g . - 0 and above which g, may grow in a nonlinear way. The inclusion 3 -. of nonlinear signals with thresholds better enables the model to deal with fluctuations due to subthreshold noise. These results show that adaptation level and distance-dependent competitive networks represent stable neural designs for competitive decision-making and CAM. The fact that adaptation level systems have been analyzed using Liapunov functionals whereas distancedependent, and more generally, symmetric networks have been analyzed using Liapunov functions shows that the global convergence theory of competitive systems is still incomplete. Global limit theorems for cooperative systems were also subsequently discovered (Hirsch, 1982, 1985, 1989), as were theorems showing when closely related cooperativecompetitive systems could oscillate (Cohen, 1988, 1990). Major progress has also been made on explicitly constructing dynamical systems with prescribed sets of equilibrium points, and only these equilibrium points (Cohen, 1992). This is an exciting area for intensive mathematical investigation. Additive and shunting networks have also found their way into many applications. Shunting networks have been particularly useful in understanding biological and machine vision, from the earliest retinal detection stages through higher cortical filtering and grouping processes (Gaudiano, 1992a, 1992b: Grossberg and Mingolla, 1985a, 1985b; Nabet and Pinter, 1991), as well as perceptual and motor oscillations (Cohen, Grossberg, and Pribe, 1993; Gaudiano and Grossberg, 1991; Grossberg and Somers, 1991, 1992; Somers and Kopell, 1993).
171
21. COMPETITIVE LEARNING AND SELF-ORGANIZING FEATURE MAPS
Once mathematical results were available that clarified the global dynamics of associative learning and competition, the stage was set to combine these mechanisms in models of cortical development, recognition learning, and categorization. One major source of interest in such models came from neurobiological experiments on geniculocortical and retinotectal development (Gottlieb, 1976; Hubel and Wiesel, 1977; Hunt and Jacobson, 1974). My own work on this problem was stimulated by such neural data, and by psychological data concerning perception, cognition, and motor control. Major constraints on theory construction also derived from my previous results on associative learning. During outstar learning, for example, no learning of a sampled input pattern 19; in (27) occurs, i = 1,2). ..,n,when the learning signal D ( t ) = 0 in equation (26). This property was called stimulus sampling. It showed that activation of an outstar source cell enables it to selectively learn spatial patterns at prescribed times. This observation led to the construction of more complex sampling cells and networks, called avalanches, that are capable of learning arbitrary space-time patterns, not merely spatial patterns, and to a comparison of avalanche networks with moperties of command cells in invertebrates (GrossberE, - - 1969e, 1970b, 1974). Activation of outstars and avalanches needs to be selective, so as not to release, or recall, learned responses in unappropriate contexts. Networks were needed that could selectively filter input patterns so as to activate outstars and avalanches only under appropriate stimulus conditions. This work led to the introduction of instar networks in Grossberg (1970a, 1972b), to the description of the first self-organizing feature map in Malsburg (1973), and to the development of the main equations and mathematical properties of the modern theory of competitive learning, self-organizing feature maps, and learned vector quantization in Grossberg (1976a, 1976b, 1976c, 1978a). Willshaw and Malsburg (1976) and Malsburg and Willshaw (1977, 1981) also made a seminal contribution at this time to the modelling of cortical development using self-organizing feature maps. In addition, the first self-organizing multilevel networks were constructed in 1976 for the learning of multidimensional maps from P" to Pm, for any n , m 2 1 (Grossberg, 1976a, 1976b, 1976~). . The first two levels Fl and F2 constitute a self-organizing feature map such that input patterns to F1 are categorized at Fz. Levels F2 and F3 are built out of outstars so that categorizing nodes at F2 can learn output patterns at F3. Hecht-Nielsen (1987) later called such networks counterpropagation networks and claimed that they were a new model. The name instar-outstar map has been used for these maps since the 1970's. Recent popularizers of back propagation have also claimed that multilevel neural networks for adaptive mapping were not available until their work using back propagation in the last half of the 1980's. Actually, back propagation was introduced by Werbos (1974) and self-organizing mapping networks that were proven to be stable in sparse environments were available in 1976. An account of the historical development of self-organizing feature maps is provided in Carpenter and Grossberg (1991). The main processing levels and properties of self-organizing feature maps are summarized in Figure 6, which is reprinted from Grossberg (197613). In such a model, an input pattern is normalized and registered as a pattern of activity, or STM, across the feature detectors of level F l . Each Fl output signal is multiplied or gated, by the adaptive weight, or LTM trace, in its respective pathway, and all these LTM-gated inputs are added up
172
at their target F2 nodes, as in equations (1)-(3). Lateral inhibitory, or competitive, interactions within F2 contrast-enhance this input pattern; see Section 17. Whereas many F2 nodes may receive inputs from 4 , lateral inhibition allows a much smaller set of F2 nodes to store their activation in STM. Only the F2 nodes that win the competition and store their activity in STM can influence the learning process. STM activity opens a learning gate at the LTM traces that abut the winning nodes, as in equation (7). These LTM traces can then approach, or track, the input signals in their pathways by a process of steepest descent. This learning law has thus often been called gated steepest descent, or instar learning. As noted in Section 2, it was introduced into neural network models in the 1960’s (e.g. Grossberg, 1969d). Because such an LTM trace can either increase or decrease to track the signals in its pathway, it is not a Hebbian associative law (Hebb, 1949). It has been used to model neurophysiological data about hippocampal LTP (Levy, 1985; Levy and Desmond, 1985) and adaptive tuning of cortical feature detectors during the visual critical period (Rauschecker and Singer, 1979; Singer, 1983), lending support to the 1976 prediction that both systems would employ such a learning law (Grossberg, 1976b, 1978a). Hecht-Nielsen (1987) has called the instar learning law Kohonen learning after Kobonen’s use of the law in his applications of self-organizing feature maps in the 1980’s, as in Kohonen (1984). The historical development of this law, including its use in self-organizing feature maps in the 1970’s, does not support this attribution. Indeed, after self-organizingfeature map models were introduced and computationally characterized in Grossberg (1976b, 1978a), Malsburg (1973), and Willshaw and Malsburg (1976), these models were subsequently applied and specialized by many authors (Amari and Takeuchi, 1978; Bienenstock, Cooper and Munro, 1982; Commons, Grossberg, and Staddon, 1991; Grossberg, 1982a, 1987; Grossberg and Kuperstein, 1986; Kohonen, 1984; Linsker, 1986; Rumelhart and Zipser, 1985). They exhibit many useful properties, especially if not too many input patterns, or clusters of input patterns, perturb level Fl relative to the number of categorizing nodes in level F2. It was proved that under these sparse environmental conditions, category learning is stable, with LTM traces that track the statistics of the environment, are self-normalizing, and oscillate a minimum number of times (Grossberg, 1976b, 1978a). Also, the category decision rule, as in a Bayesian classifier, tends to minimize error. It was also proved, however, that under arbitrary environmental conditions, learning becomes unstable. Such a model could forget your parents’ faces. Although a gradual switching off of plasticity can partially overcome this problem, such a mechanism cannot work in a recognition learning system whose plasticity is maintained throughout adulthood. This memory instability is due to basic properties of associative learning and lateral inhibition. An analysis of this instability, together with data about categorization, conditioning, and attention, led to the introduction of Adaptive Resonance Theory, or ART, models that stabilize the memory of self-organizing feature maps in response to an arbitrary stream of input patterns (Grossberg, 1976~).A central prediction of ART, from its inception, has been that adult learning mechanisms share properties with the adaptive mechanisms that control developmental plasticity, in particular that “adult attention is a continuation on a developmental continuum of the mechanisms needed to solve the stability-plasticity dilemma in infants” (Grossberg, 1982b, p. 335). Recent experimental results concerning the neural control of learning have provided increasing support for this
173
hypothesis (Kandel and O’Dell, 1992). 22. ADAPTIVE RESONANCE THEORY
In an ART model, as shown in Figure 7a, an input vector I registers itself as a pattern X of activity across level F1. The Fl output vector S is then transmitted through the multiple converging and diverging adaptive filter pathways emanating from Fl. This transmission event multiplies the vector S by a matrix of adaptive weights, or LTM traces, to generate a net input vector T to level F2. The internal competitive dynamics of F2 contrast-enhance vector T. Whereas many F2 nodes may receive inputs from F1, competition or lateral inhibition between F2 nodes allows only a much smaller set of F2 nodes to store their activation in STM. A compressed activity vector Y is thereby generated across F2. In the ART 1 and ART 2 models (Carpenter and Grossberg, 1987a, 1987b), the competition is tuned so that the F2 node that receives the maximal F1 -, F2 input is selected. Only one component of Y is nonzero after this choice takes place. Activation of such a winner-take-all node defines the category, or symbol, of the input pattern I. Such a category represents all the inputs I that maximally activate the corresponding node. So far, these are the rules of a self-organizing feature map. In a self-organizing feature map, only the F2 nodes that win the competition and store their activity in STM can immediately influence the learning process. In an ART model (Carpenter and Grossberg, 1987a, 1992), learning does not occur as soon as some winning F2 activities are stored in STM. Instead activation of F2 nodes may be interpreted as “making a hypothesis” about an input I. When Y is activated, it rapidly generates an output vector U that is sent top-down through the second adaptive filter. After multiplication by the adaptive weight matrix of the top-down filter, a net vector V inputs to Fl (Figure 7b). Vector V plays the role of a learned top-down expectation. Activation of V by Y may be interpreted as “testing the hypothesis” Y ,or “reading out the category prototype” V. An ART network is designed to match the “expected prototype” V of the category against the active input pattern, or exemplar, I. Nodes that are activated by I are suppressed if they do not correspond to large LTM traces in the prototype pattern V. Thus F1 features that are not “expected” by V are suppressed. Expressed in a different way, the matching process may change the F1 activity pattern X by suppressing activation of all the feature detectors in I that are not “confirmed” by hypothesis Y. The resultant pattern X* encodes the cluster of features in I that the network deems relevant to the hypothesis Y based upon its past experience. Pattern X* encodes the pattern of features to which the network “pays attention.” If the expectation V is close enough to the input I, then a state of resonance develops as the attentional focus takes hold. The pattern X* of attended features reactivates hypothesis Y which, in turn, reactivates X*. The network locks into a resonant state through the mutual positive feedback that dynamically links X* with Y . In ART, the resonant state, rather than bottom-up activation, drives the learning process. The resonant state persists long enough, at a high enough activity level, to activate the slower learning process; hence the term adaptive resonance theory. ART systems learn prototypes, rather than exemplars, because the attended feature vector X*,rather than the input I itself, is learned. These prototypes may, however, also be used to encode individual exemplars, as described below.
174
23. MEMORY STABILITY AND 2/3 RULE MATCHING This attentive matching process is realized by combining three different types of inputs at level F1 (Figure 7): bottom-up inputs, top-down expectations, and attentional gain control signals. The attentional gain control channel sends the same signal to all F1 nodes; it is a “nonspecific”, or modulatory, channel. Attentive matching obeys a 2/3 Rule (Carpenter and Grossberg, 1987a): an F1 node can be fully activated only if two of the three input sources that converge upon it send positive signals at a given time. The 2/3 Rule allows an ART system to react to bottom-up inputs, since an input directly activates its target F1 features and indirectly activates them via the nonspecific gain control channel to satisfy the 2/3 Rule (Figure 7a). After the input instates itself at F1, leading to selection of a hypothesis Y and a top-down prototype V, the 2/3 Rule ensures that only those F1 nodes that are confirmed by the top-down prototype can be attended at F1 after an Fz category is selected. The 2/3 Rule enables an ART network to realize a self-stabilizing learning process. Carpenter and Grossberg (1987a) proved that ART learning and memory are stable in arbitrary environments, but become unstable when 2/3 Rule matching is eliminated. Thus a type of matching that guarantees stable learning also enables the network to pay attention. 24. PHONEMIC RESTORATION AND PRIMING 2/3 Rule matching in the brain is illustrated by experiments on phonemic restoration (Repp, 1991; Samuel, 1981a, 1981b; Warren, 1984; Warren and Sherman, 1974). Suppose that a noise spectrum replaces a letter sound in a word heard in an otherwise unambiguous context. Then subjects hear the correct letter sound, not the noise, to the extent that the noise spectrum includes the letter formants. If silence replaces the noise, then only silence is heard. Top-down expectations thus amplify expected input features while suppressing unexpected features, but do not create activations not already in the input. 2/3 Rule matching also shows how an ART system can be primed. This property has been used to explain paradoxical reaction time and error data from priming experiments during lexical decision and letter gap detection tasks (Grossberg and Stone, 1986; Schvaneveldt and MacDonald, 1981). Although priming is often thought of as a residual effect of previous bottom-up activation, a combination of bottom-up activation and top-down 2/3 Rule matching was needed to explain the complete data pattern. This analysis combined bottom-up priming with a type of top-down priming; namely, the top-down activation that prepares a network for an expected event that may or may not occur. The 2/3 Rule clarifies why top-down priming, by itself, is subliminal (and in the brain unconscious), even though it can facilitate supraliminal processing of a subsequent expected event. 25. SEARCH, GENERALIZATION, AND NEUROBIOLOGICAL CORRELATES The criterion of an acceptable 2/3 Rule match is defined by a parameter p called vigilance (Carpenter and Grossberg, 1987a, 1992). The vigilance parameter is computed in the orienting subsystem A. Vigilance weighs how similar an input exemplar must be to a top-down prototype in order for resonance to occur. Resonance occurs if plIl- IX*l< 0. This inequality says that the Fl attentional focus X* inhibits A more than the input I excites it. If A remains quiet, then an F1 +.+ F2 resonance can develop.
175
* I+
I
I
Figure 7. ART search for an F2 recognition code: (a) The input pattern I generates the specific STM activity pattern X at F1 as it nonspecifically activates the orienting subsystem A. X is represented by the hatched pattern across Fl. Pattern X both inhibits A and generates the output pattern S. Pattern S is transformed by the LTM traces into the input pattern T, which activates the STM pattern Y across F2. (b) Pattern Y generates the top-down output pattern U which is transformed into the prototype pattern V. If V mismatches I at F1, then a new STM activity pattern X*is generated at Fl. X* is represented by the hatched pattern. Inactive nodes corresponding to X are unhatched. The reduction in total STM activity which occurs when X is transformed into X*causes a decrease in the total inhibition from F1 to A. (c) If the vigilance criterion fails to be met, A releases a nonspecific arousal wave to F2, which resets the STM pattern Y at FZ. (d) After Y is inhibited, its top-down prototype signal is eliminated, and X can be reinstated at Fl. Enduring traces of the prior reset lead X to activate a different STM pattern Y *at Fz. If the top-down prototype due to Y *also mismatches I at F1, then the search for an appropriate Fz code continues until a more appropriate Fz representation is selected. Then an attentive resonance develops and learning of the attended data is initiated. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.)
176
ART 1 (BINARY)
FUZZY ART (ANALOG)
CATEGORY CHOICE
MATCH CRITERION
intersection
minimum
Figure 8. Comparison of ART 1 and Fuzzy ART. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.) Vigilance calibrates how much novelty the system can tolerate before activating A and searching for a different category. If the top-down expectation and the bottom-up input are too different to satisfy the resonance criterion, then hypothesis testing, or memory search, is triggered. Memory search leads to selection of a better category at level F2 with which to represent the input features at level F l . During search, the orienting subsystem interacts with the attentional subsystem, as in Figures 7c and 7d, to rapidly reset mismatched categories and to select other F2 representations with which to learn about novel events, without risking unselective forgetting of previous knowledge. Search may select a familiar category if its prototype is similar enough to the input to satisfy the vigilance criterion. The prototype may then be refined by 2/3 Rule attentional focussing. If the input is too different from any previously learned prototype, then an uncommitted population of F2 cells is selected and learning of a new category is initiated. Because vigilance can vary across learning trials, recognition categories capable of encoding widely differing degrees of generalization or abstraction can be learned by a single ART system. Low vigilance leads to broad generalization and abstract prototypes. High vigilance leads to narrow generalization and to prototypes that represent fewer input exemplars, even a single exemplar. Thus a single ART system may be used, say, to recognize abstract categories of faces and dogs, as well as individual faces and dogs. A single system can learn both, as the need arises, by increasing vigilance just enough to activate A if a previous categorization leads to a predictive error (Carpenter and Grossberg, 1992; Carpenter, Grossberg, and Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992). ART systems hereby provide a new answer to whether the brain learns
177
prototypes or exemplars. Various authors have realized that neither one nor the other alternative is satisfactory, and that a hybrid system is needed (Smith, 1990). ART systems can perform this hybrid function in a manner that is sensitive to environmental demands. These properties of ART systems have been used to explain and predict a variety of cognitive and brain data that have, as yet, received no other theoretical explanation (Carpenter and Grossberg, 1991; Grossberg, 1987a, 1987b). For example, a formal lesion of the orienting subsystem creates a memory disturbance that remarkably mimics properties of medial temporal amnesia (Carpenter and Grossberg, 1987c, 1993; Grossberg and Merrill, 1992). These and related data correspondences to orienting properties (Grossberg and Merrill, 1992) have led to a neurobiological interpretation of the orienting subsystem in terms of the hippocampal formation of the brain. In applications to visual object recognition, the interactions within the Fl and F2 levels of the attentional subsystem are interpreted in terms of data concerning the prestriate visual cortex and the inferotemporal cortex (Desimone, 1992), with the attentional gain control pathway interpreted in terms of the pulvinar region of the brain. The ability of ART systems to form categories of variable generalization is linked to the ability of inferotemporal cortex to form both particular (exemplar) and general (prototype) visual representations. 26. A CONNECTION BETWEEN ART SYSTEMS AND FUZZY LOGIC Fuzzy ART is a generalization of ART 1 that incorporates operations from fuzzy logic (Carpenter, Grossberg, and Rosen, 1991). Although ART 1 can learn to classify only binary input patterns, Fuzzy ART can learn to classify both analog and binary input patterns. Moreover, Fuzzy ART reduces to ART 1 in response to binary input patterns. As shown in Figure 8, the generalization to learning both analog and binary input patterns is achieved by replacing appearances of the intersection operator (n) in ART 1 by the MIN operator (A) of fuzzy set theory. The MIN operator reduces to the intersection operator in the binary case. Of particular interest is the fact that, as parameter a approaches 0, the function T, which controls category choice through the bottom-up filter reduces to the operation of fuzzy subsethood (Kosko, 1986). T, then measures the degree to which the adaptive weight vector w, is a fuzzy subset of the input vector I. In Fuzzy ART, input vectors are normalized at a preprocessing stage (Figure 9). This normalization procedure, called complement coding, leads to a symmetric theory in which the MIN operator (A) and the MAX operator (v) of fuzzy set theory (Zadeh, 1965) play complementary roles. The categories formed by Fuzzy ART are then hyper-rectangles. Figure 10 illustrates how MIN and MAX define these rectangles in the 2-dimensional case. The MIN and MAX values define the acceptable range of feature variation in each dimension. Complement coding uses on-cells (with activity a in Figure 9) and off-cells (with activity ac in Figure 9) to represent the input pattern, and preserves individual feature amplitudes while normalizing the total on-cell/off-cell vector. The on-cell portion of a prototype encodes features that are critically present in category exemplars, while the off-cell portion encodes features that are critically absent. Each category is then defined by an interval of expected values for each input feature. For instance, Fuzzy ART would encode the feature of “hair on head” by a wide interval ([A, 11) for the category “man”, whereas the feature “hat on head” would be encoded by a wide interval ([O, B]). On the other hand, the category “dog” would be encoded by two narrow intervals, [C, I] for hair and [O, D] for hat, corresponding to narrower ranges of expectations for these two features.
178
III = M ( 1-al,
... , l-aM)
Figure 9. Complement coding uses on-cell and off-cell pairs to normalize input vectors. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.) Learning in Fuzzy ART is stable because all adaptive weights can only decrease in time. Decreasing weights correspond to increasing sizes of category “boxes”. This theorem is proved in Carpenter, Grossberg, and Rosen (1991). Smaller vigilance values lead to larger category boxes. Learning stops when the input space is covered by boxes. The use of complement coding works with the property of increasing box size to prevent a proliferation of categories. With fast learning, constant vigilance, and a finite input set of arbitrary size and composition, it has been proved that learning stabilizes after just one presentation of each input pattern. A fast-commit slow-recode option combines fast learning with a forgetting rule that buffers system memory against noise. Using this option, rare events can be rapidly learned, yet previously learned memories are not rapidly erased in response to statistically unreliable input fluctuations. The equations that define the Fuzzy ART algorithm are listed in Section 29. 27. FUZZY ARTMAP AND FUSION ARTMAP: SUPERVISED INCREMENTAL LEARNING, CATEGORIZATION, AND PREDICTION Individual ART modules typically learn in an unsupervised mode. ART systems capable of supervised learning, categorization, and prediction have also recently been introduced (Asfour, Carpenter, Grossberg, and Lesher, 1993; Carpenter and Grossberg, 1992; Carpenter, Grossberg, and Reynolds, 1991; Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992; Carpenter, Grossberg, and Iizuka, 1992). Unlike many supervised learning networks, such as back propagation, these ART systems are capable of functioning in either an unsupervised or supervised mode, depending on whether environmental feedback is available. When supervised learning of Fuzzy ART controls category formation, a predictive error can force the creation of new categories that could not otherwise be learned due to monotone increase in category size through time in the unsupervised case. Supervision permits the creation of complex categorical structures without a loss of stability. The main additional ingredients whereby Fuzzy ART modules are combined into a supervised ART architectures are now summarized.
179
A Fuzzy AND (conjunction)
V Fuzzy OR (disjunction) I
Y m..........X.*V Y
x = (XlJ2) (x A Y ) ~= min(x1,yl) (x v y)1 = max(x1,yl)
Y = (Y17Y2) (x A y)2 = min(x~y2) (x v y)2 = max(x2,~2)
Figure 10. Fuzzy AND and OR operations generate category hyper-rectangles. (Reprinted with permission from Carpenter, Grossberg, and Rosen, 1991.)
The simplest supervised ART systems are generically called ARTMAP. An ARTMAP that is built up from Fuzzy ART modules is called a Fuzzy ARTMAP system. Each Fuzzy ARTMAP system includes a pair of Fuzzy ART modules (ART, and ART,), as in Figure 11. During supervised learning, ART, receives a stream {a(p)} of input patterns and ART, receives a stream {b(p)} of input patterns, where b(p) is the correct prediction given a(p). These modules are linked by an associative learning network and an internal controller that ensures autonomous system operation in real time. The controller is designed to create the minimal number of ART, recognition categories, or “hidden units,” needed to meet accuracy criteria. As noted above, this is accomplished by realizing a Minimax Learning Rule that conjointly minimizes predictive error and maximizes predictive generalization. This scheme automatically links predictive success to category size on a trial-by-trial basis using only local operations. It works by increasing the vigilance parameter pa of ART, by the minimal amount needed to correct a predictive error at ART, (Figure 12). Parameter pa calibrates the minimum confidence that ART, must have in a recognition category, or hypothesis, that is activated by an input a(P) in order for ART, to accept that category, rather than search for a better one through an automatically controlled process of hypothesis testing. As in ART 1, lower values of p , enable larger categories to form. These lower pa values lead to broader generalization and higher code compression. A predictive failure at ARTb increases the minimal confidence pa by the least amount needed to trigger hypothesis testing at ART,, using a mechanism called match trucking (Carpenter, Grossberg, and Reynolds, 1991). Match tracking sacrifices the minimum amount of generalization necessary to correct the predictive error. Speaking intuitively,
180
map field Fab ......................................................
ART, ..........: .............................
Xab 4 ...............
..........
w
Fa2
ART,
4-
reset
t
match tracking
" ' ...................................................... :F
r% Figure 11. Fuzzy ARTMAP architecture. The ART, complement coding preprocessor transforms the Ma-vector a into the 2M,-vector A = (a,ac) at the ART, field F f . A is the input vector to the ART, field Fp. Similarly, the input to FI is the 2Mb-vector (b,bC). When a prediction by ART, is disconfirmed at ARTb, inhibition of map field activation induces the match tracking process. Match tracking raises the ART, vigilance pa to just above the Ff to F," match ratio Ix"l/lAl. This triggers an ART, search which leads to activation of either an ART, category that correctly predicts b or to a previously uncommitted ART, category node. (Reprinted with permission from Carpenter, Grossberg, Markuzon, Reynolds, and Rosen, 1992.)
match tracking operationalizes the idea that the system must have accepted hypotheses with too little confidence to satisfy the demands of a particular environment. Match tracking increases the criterion confidencejust enough to trigger hypothesis testing. Hypothesis testing leads to the selection of a new ART, category, which focuses attention on a new cluster of a(p) input features that is better able to predict b(p). Due to the combination of match tracking and fast learning, a single ARTMAP system can learn a different prediction for a rare event than for a cloud of similar frequent events in which it is embedded. A generalization of Fuzzy ARTMAP, called Fusion ARTMAP, has also recently been introduced to handle multidimensional data fusion, classification, and prediction problems (Asfour, Carpenter, Grossberg, and Lesher, 1993). In Fusion ARTMAP, multiple data channels process different sorts of input vectors in their own ART modules before all
181
n MATCH TRACKING
(a)
PRED ICTlO N
L
t Figure 12. Match tracking: (a) A prediction is made by ART. when the baseline vigilance pa is less than the analog match value. (b) A predictive error at ARTb increases the baseline vigilance value of ART, until it just exceeds the analog match value, and thereby triggers hypothesis testing that searches for a more predictive bundle of features to which to attend. (Reprinted with permission from Carpenter and Grossberg, 1992.) the ART modules cooperate to form a global classification and prediction. A predictive error simultaneously raises the vigilance parameters of all the component ART modules. The module with the poorest match of input to prototype is driven first to reset and search. As a result, the channels whose data are classified with the least confidence are searched before more confident classifications are reset. Channels which provide good data matches may thus not need to create new categories just because other channels exhibit poor matches. Using this parallel match tracking scheme, the network selectively improves learning where it is poor, while sparing the learning that is good. Such an automatic credit assignment has been shown in benchmark studies to generate more persimonious classifications of multidimensional data than are learned by a one-channel Fuzzy ARTMAP. Two benchmark studies using Fuzzy ARTMAP are summarized below to show that even a one-channel network has powerful classification capabilities. 28. TWO BENCHMARK STUDIES: LETTER AND WRITTEN DIGIT RECOGNITION As summarized in Table 1 , Fuzzy ARTMAP has been benchmarked against a variety of machine learning, neural network, and genetic algorithms with considerable success.
182
ARTMAP BENCHMARK STUDIES
1. Medical database - mortality following coronary bypass grafting (CABG) surgery FUZZY ARTMAP significantly outperforms LOGISTIC REGRESSION ADDITIVE MODEL BAYESIAN ASSIGNMENT CLUSTER ANALYSIS CLASSIFICATION AND REGRESSION TREES EXPERT PANEL-DERIVED SICKNESS SCORES PRINCIPAL COMPONENT ANALYSIS 2. Mushroom database DECISION TREES (90-95% correct) ARTMAP (100% correct) Training set an order of magnitude smaller 3. Letter recognition database GENETIC ALGORITHM (82% correct) FUZZY ARTMAP (96% correct) 4. Circle-in-the-Square task BACK PROPAGATION (90% correct) FUZZY ARTMAP (99.5% correct) 5 . Two-Spiral task
BACK PROPAGATION (10,000-20,000 training epochs) FUZZY ARTMAP (1-5 training epochs)
Table 1 An illustrative study used a benchmark machine learning task that Frey and Slate (1991) developed and described as a “difficult categorization problem” (p. 161). The task requires a system to identify an input exemplar as one of 26 capital letters A-Z. The database was derived from 20,000 unique black-and-white pixel images. The difficulty of the task is due to the wide variety of letter types represented: the twenty “fonts represent five different stroke styles (simplex, duplex, complex, and Gothic) and six different letter styles (block, script, italic, English, Italian, and German)” (p. 162). In addition each image was randomly distorted, leaving many of the characters misshapen. Sixteen numerical feature attributes were then obtained from each character image, and each attribute value was scaled to a range of 0 to 15. The resulting Letter Image Recognition file is archived in the UCI Repository of Machine Learning Databases and Domain Theories, maintained by David Aha and Patrick Murphy (
[email protected]). Frey and Slate used this database to test performance of a family of classifiers based on Holland’s genetic algorithms (Holland, 1980). The training set consisted of 16,000 exemplars, with the remaining 4,000 exemplars used for testing. Genetic algorithm classifiers having different input representations, weight update and rule creation schemes, and system parameters were systematically compared. Training was carried out for 5 epochs, plus a sixth “verification” pass during which no new rules were created but a large number
183
of unsatisfactory rules were discarded. In Frey and Slate’s comparative study, these systems had correct prediction rates that ranged from 24.5% to 80.8% on the 4,000-item test set. The best performance (80.8%) was obtained using an integer input representation, a reward sharing weight update, an exemplar method of rule creation, and a parameter setting that allowed an unused or erroneous rule to stay in the system for a long time before being discarded. After training, the optimal case, that had 80.8% performance rate, ended with 1,302 rules and 8 attributes per rule, plus over 35,000 more rules that were discarded during verification. (For purposes of comparison, a rule is somewhat analogous to an ART, category in ARTMAP, and the number of attributes per rule is analogous to the size of ART, category weight vectors.) Building on the results of their comparative study, Frey and Slate investigated two types of alternative algorithms, namely an accuracy-utility bidding system, that had slightly improved performance (81.6%) in the best case; and an exemplar/hybrid rule creation scheme that further improved performance, to a maximum of 82.7%, but that required the creation of over 100,000 rules prior to the verification step. Fuzzy ARTMAP had an error rate on the letter recognition task that was consistently less than one third that of the three best Frey-Slate genetic algorithm classifiers described above. In particular, after 1 to 5 epochs, individual Fuzzy ARTMAP systems had a robust prediction rate of 90% to 94% on the 4,000-item test set. A voting strategy consistently improved this performance. This voting strategy is based on the observation that ARTMAP fast learning typically leads to different adaptive weights and recognition categories for different orderings of a given training set, even when overall predictive accuracy of all simulations is similar. The different category structures cause the set of test items where errors occur to vary from one simulation to the next. The voting strategy uses an ARTMAP system that is trained several times on input sets with different orderings. The final prediction for a given test set item is the one made by the largest number of simulations. Since the set of items making erroneous predictions varies from one simulation to the next, voting cancels many of the errors. Such a voting strategy can also be used to assign confidence estimates to competing predictions given small, noisy, or incomplete training sets. Voting consistently eliminated 25%-43% of the errors, giving a robust prediction rate of 92%-96%. Moreover Fuzzy ARTMAP simulations each created fewer than 1,070 ART, categories, compared to the 1,040-1,302 final rules of the three genetic classifiers with the best performance rates. Most Fuzzy ARTMAP learning occurred on the first epoch, with test set performance on systems trained for one epoch typically over 97% that of systems exposed to inputs for five epochs. Rapid learning was also found in a benchmark study of written digit recognition, where the correct prediction rate on the test set after one epoch reached over 99% of its best performance (Carpenter, Grossberg, and Iizuka, 1992). In this study, Fuzzy ARTMAP was tested along with back propagation and a self-organizing feature map. Voting yielded Fuzzy ARTMAP average performance rates on the test set of 97.4% after an average number of 4.6 training epochs. Back propagation achieved its best average performance rates of 96% after 100 training epochs. Self-organizing feature maps achieved a best level of 96.5%, again after many training epochs. In summary, on a variety of benchmarks (see also Table 1, Carpenter, Grossberg, and Reynolds, 1991, and Carpenter et al., 1992), Fuzzy ARTMAP has demonstrated either much faster learning, better performance, or both, than alternative machine learning,
184
ARTMAP ARTMAP can autonomously learn about (A) RARE EVENTS Need FAST learning (B) LARGE NONSTATIONARY DATABASES Need STABLE learning (C) MORPHOLOGICALLY VARIABLE EVENTS Need MULTIPLE SCALES of generalization (fine/coarse)
(D) ONETO-MANY AND MANY-TO-ONE RELATIONSHIPS Need categorization, naming, and expert knowledge To realize these properties ARTMAP systems: (E) PAY ATTENTION Ignore masses of irrelevant data (F) TEST HYPOTHESES Discover predictive constraints hidden in data streams (G) CHOOSE BEST ANSWERS Quickly select globally optimal solution at any stage of learning (H) CALIBRATE CONFIDENCE Measure on-line how well a hypothesis matches the data (I) DISCOVER RULES Identify transparent IF-THEN relations at each learning stage
(J) SCALE Preserve all desirable properties in arbitrarily large problems Table 2 genetic, or neural network algorithms. Perhaps more importantly, Fuzzy ARTMAP can be used in an important class of applications where many other adaptive pattern recognition algorithms cannot perform well (see Table 2). These are the applications where very large nonstationary databases need to be rapidly organized into stable variable-compression categories under real-time autonomous learning conditions. 29. SUMMARY OF THE FUZZY ART ALGORITHM ART field activity vectors: Each ART system includes a field Fb of nodes that represent a current input vector; a field Fl that receives both bottom-up input from FO and top-down input from a field F2 that represents the active code, or category. The FO activity vector is denoted I = (11,. ..,IM), with each component Z; in the interval [0,1], i = 1,. . .,M . The Fl activity vector is denoted x = ( q .,. .,ZM)and the F2 activity vector is denoted y = (yl,...,yN). The number of nodes in each field is arbitrary. Weight vector: Associated with each F2 category node j ( j = 1 , . . .,N ) is a vector
185
w, G ( w , ~ ., .., W
~ M of )
adaptive weights, or LTM traces. Initially Wjl(0)=
. . . = W j M ( 0 ) = 1;
(96)
then each category is said to be uncommitted. After a category is selected for coding it becomes committed. As shown below, each LTM trace 'wj, is monotone nonincreasing through time and hence converges to a limit. The Fuzzy ART weight vector w, subsumes both the bottom-up and top-down weight vectors of ART 1. Parameters: Fuzzy ART dynamics are determined by a choice parameter cr > 0; a learning rate parameter E [0,1]; and a vigilance parameter p E [0,1]. Category choice: For each input I and F2 node j , the choice function T, is defined
IIAwjl Tj(1)= CY+ Iwjl' where the fuzzy AND operator
A
(97)
is defined by
and where the norm 1 . I is defined by
for any M-dimensional vectors p and q. For notational simplicity, Tj(I)in (97) is often written as Tj when the input I is fixed. The system is said to make a category choice when at most one F2 node can become active at a given time. The category choice is indexed by J , where
TJ = max{Tj :j = 1 . . . N } .
(100)
If more than one T, is maximal, the category j with the smallest index is chosen. In particular, nodes become committed in order j = 1,2,3,. .. . When the Jth category is chosen, y J = 1; and yj = 0 for j # J . In a choice system, the F1 activity vector x obeys the equation
I = {I A w J
if F2 is inactive if the J t h F2 node is chosen.
Resonance or reset: Resonance occurs if the match function 11 A wjI/III of the chosen category meets the vigilance criterion:
that is, by (6), when the J i h category is chosen, resonance occurs if
186
Learning then ensues, as defined below. Mismatch reset occurs if
that is, if 1x1 = 11 A WJI < PlII. (105) Then the value of the choice function TJ is set to 0 for the duration of the input presentation to prevent the persistent selection of the same category during search. A new index J is then chosen, by (100). The search process continues until the chosen J satisfies (102). Learning: Once search ends, the weight vector W J is updated according to the equation W(new) J = p(I A W y ) ) + (1 - p)WYld). (106)
Fast learning corresponds to setting /3 = 1. The learning law used in the EACH system of Salzberg (1990) is equivalent to equation (106) in the fast-learn limit with the complement coding option described below. Fast-commit slow-recode option: For efficient coding of noisy input sets, it is useful to set p = 1 when J is an uncommitted node, and then to take p < 1 after the = I the first time category J becomes active. Moore category is committed. Then (1989) introduced the learning law (106), with fast commitment and slow recoding, to investigate a variety of generalized ART 1 models. Some of these models are similar to Fuzzy ART, but none includes the complement coding option. Moore described a category proliferation problem that can occur in some analog ART systems when a large number of inputs erode the norm of weight vectors. Complement coding solves this problem. Input normalization/complement coding option: Proliferation of categories is avoided in Fuzzy ART if inputs are normalized. Complement coding is a normalization rule that preserves amplitude information. Complement coding represents both the onresponse and the off-response to an input vector a (Figure 8). To define this operation in its simplest form, let a itself represent the on-response. The complement of a, denoted by ac, represents the off-response, where
WY)
a: I 1 - a ; .
The complement coded input I to the field
F1
(107)
is the 2M-dimensional vector
I = (a,ac)= (al,. . . ,aM,a?, . .. , a b ) .
(108)
Note that
=M,
so inputs preprocessed into complement coding form are automatically normalized. Where complement coding is used, the initial condition (96) is replaced by Wjl(0)
= . . . = Wj,1M(O) = 1.
(110)
187
30. FUZZY ARTMAP ALGORITHM The Fuzzy ARTMAP system incorporates two Fuzzy ART modules ART, and ART, that are linked together via an inter-ART module Fabcalled a map field. The map field is used to form predictive associations between categories and to realize the match tracking rule whereby the vigilance parameter of ART, increases in response to a predictive mismatch at ARTb. The interactions mediated by the map field Fab may be operationally characterized as follows. ART, and ART& Inputs to ART, and ARTb are in the complement code form: for ART,, I = A = (a,ac);for ARTb, I = B = (b,bc) (Figure 10). Variables in ART, or ARTb are designated by subscripts or superscripts 'a" or "b". For ART,, let xa E (x! . . denote the Ff output vector; let y" = (yp .. .Y $ ~ ) denote the F; output vector; and let w; i (w;~,w ; ~.,. .,w ~ , ~denote ~ , ) the j t h ART, weight vector. For ARTb, let xb i (x! . . . x ; ~ * ) denote the Fi' output vector; let yb = (y! . ..yk,) denote the F,b output vector; and let w$ = (wf,,wf2, . . ., denote the kth ARTb weight vector. For the map field, let xab= ( x y b , .. . ,x$ denote the Faboutput vector, and let wYb = (w;,", .. .,w;$~) denote ybrand xabare set to the weight vector from the jihF; node to Fab. Vectors xa,ya,xb, 0 between input presentations. Map field activation The map field Fabis activated whenever one of the ART, or ARTb categories is active. If node J of F," is chosen, then its weights wybactivate Fnb. If node Ii' in F,b is active, then the node K in Fab is activated by I-to-1 pathways between F,b and Fab. If both ART, and ARTb are active, then Fab becomes active only if ART, predicts the same category as ARTb via the weights wpb.The Fab output vector xabobeys
xab=lw;b
ybA wyb if the Jth F," node is active and F,b is active
0
Yb
if the Jth F; node is active and F,b is inactive if F," is inactive and F i is active if F," is inactive and F,b is inactive.
(111)
By ( l l l ) , xab= 0 if the prediction w5b is disconfirmed by yb. Such a mismatch event triggers an ART, search for a better category, as follows. Match tracking At the start of each input presentation the ART, vigilance parameter pa equals a baseline vigilance 6. The map field vigilance parameter is Pab. If
babl< pablYbl,
(112)
then pa is increased until it is slightly larger than [AA wyllAl-l, where A is the input to Ff,in complement coding form. Then Ix"I = IA A wDJl< pal-41,
(113)
where J is the index of the active F," node, as in (105). When this occurs, ART, search leads either to activation of another F; node J with
I~~l=lAAw;l IpalAl
(114)
188
and or, if no such node exists, to the shut-down of F; for the remainder of the input present ation. Map field learning Learning rules determine how the map field weights w$ change through time, as Fob paths initially satisfy follows. Weights w$ in F;
-
W$(O) = 1.
During resonance with the ART, category J active, w5b approaches the map field vector xab.With fast learning, once J learns to predict the ART, category K , that association is permanent; i.e., wppK = 1 for all time.
189
REFERENCES
Adams, J.A. (1967). Human memory. New York: McGraw-Hill. Amari, S.-I. and Arbib, M. (Eds.) (1982). Competition and cooperation in neural networks. New York, NY: Springer-Verlag. Amari, S.-I. and Takeuchi, A. (1978). Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 29, 127-136. Asch, S.E. and Ebenholtz, S.M. (1962). The principle of associative symmetry. Proceedings of the American Philosophical Society, 106,135-163. Asfour, Y.R., Carpenter, G.A., Grossberg, S., and Lesher, G. (1993). Fusion ARTMAP: A neural network architecture for multi-channel data fusion and classification. Technical Report CAS/CNS TR93-004, Boston, MA: Boston University. Submitted for publication. Bienenstock, E.L., Cooper, L.N., and Munro, P.W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 3248. Bradski, G., Carpenter, G.A., and Grossberg, S. (1992). Working memory networks for learning multiple groupings of temporal order with application to 3-D visual object recognition. Neural Computation, 4, 270-286. Carpenter, G.A. and Grossberg, S. (1987a). A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37,54-115. Carpenter, G.A. and Grossberg, S. (1987b). ART 2: Stable self-organization of pattern recognition codes for analog input patterns. Applied Optics, 26, 49194930. Carpenter, G.A. and Grossberg, S. (1987~). Neural dynamics of category learning and recognition: Attention, memory consolidation, and amnesia. In S. Grossberg (Ed.), The adaptive brain, I: Cognition, learning, reinforcement, and rhythm. Amsterdam: Elsevier/North Holland, pp. 238-286. Carpenter, G.A. and Grossberg, S. (Eds.) (1991). Pattern recognition by selforganizing neural networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1992). Fuzzy ARTMAP: Supervised learning, recognition, and prediction by a self-organizing neural network. IEEE Communications Magazine, 30,38-49. Carpenter, G.A. and Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Technical Report CASfCNS TR-92-021. Boston, MA: Boston University. Trends in Neurosciences, in press. Carpenter, G.A., Grossberg, S., Markuzon, M., Reynolds, J.H., and Rosen, D.B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions on Neural Network, 3,698-713. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1991). ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4, 565-588.
190
Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759-771. Carpenter, G.A., Grossberg, S., and Iizuka, K. (1992). Comparative performance measures of Fuzzy ARTMAP, learned vector quantization, and back propagation for handwritten character recognition. Proceedings of the international joint conference on neural networks, I, 794-799. Piscataway, NJ: IEEE Service Center. Cohen, M.A. (1988). Sustained oscillations in a symmetric cooperative-competitive neural network: Disproof of a conjecture about a content addressable memory. Neural Networks, 1, 217-221. Cohen, M.A. (1990). The stability of sustained oscillations in symmetric cooperativecompetitive networks. Neural Networks, 3,609-612. Cohen, M.A. (1992). The construction of arbitrary stable dynamics in nonlinear neural networks. Neural Networks, 5 , 83-103. Cohen, M.A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 815-826. Cohen, M.A. and Grossberg, S. (1986). Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short term memory. Human Neurobiology, 5 , 1-22. Cohen, M.A., Grossberg, S., and Pribe, C. (1993). A neural pattern generator that exhibits frequency-dependent bi-manual coordination effects and quadruped gait transitions. Technical Report CAS/CNS TR-93-004. Boston, MA: Boston University. Submitted for publication. Cole, K.S. (1968). Membranes, ions, and impulses. Berkeley, CA: University of California Press. Collins, A.M. and Loftus, E.F. (1975). A spreading-activation theory of semantic memory. Psychological Review, 82, 407-428. Commons, M.L., Grossberg, S., and Staddon, J.E.R. (Eds.) (1991). Neural network models of conditioning and action. Hillsdale, NJ: Lawrence Erlbaum Associates. Cornsweet, T.N. (1970). Visual perception. New York, NY: Academic Press. Crick, F. and Koch, C. (1990). Some reflections on visual awareness. Cold Spring Harbor symposium on quantitative biology, LV, The brain, Plainview, NY: Cold Spring Harbor Laboratory Press, 953-962. Desimone, R. (1992). Neural circuits for visual attention in the primate brain. In G.A. Carpenter and S. Grossberg (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press, pp. 343-364. Dixon, T.R. and Horton, D.L. (1968). Verbal behavior and general behavior theory. Englewood Cliffs, NJ: Prentice-Hall. Eckhorn, R. Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitbock, H.J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biological Cybernetics, 1988, 60, 121-130. Eckhorn, R. and Schanze, T. (1991). Possible neural mechanisms of feature linking in the visual system: Stimulus-locked and stimulus-induced synchronizations. In A.
191
Babloyantz (Ed.), Self-organization, emerging properties, a n d learning. New York, NY: Plenum Press, pp. 63-80. Ellias, S. and Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short term memory of shunting on-center off-surround networks. Biological Cybernetics, 20, 69-98. Frey, P.W. and Slate, D.J. (1991). Letter recognition using Holland-style adaptive classifiers. Machine Learning, 6, 161-182. Gaudiano, P. (1992a). A unified neural model of spatio-temporal processing in X and Y retinal ganglion cells. Biological Cybernetics, 67, 11-21. Gaudiano, P. (1992b). Toward a unified theory of spatio-temporal processing in the retina. In G. Carpenter and S. Grossberg, (Eds.). Neural networks for vision and image processing. Cambridge, MA: MIT Press, pp. 195-220. Gaudiano, P. and Grossberg, S. (1991). Vector associative maps: Unsupervised realtime error-based learning and control of movement trajectories. Neural Networks, 4, 147- 183. Geman, S. (1981). The law of large numbers in neural modelling. In S. Grossberg (Ed.), Mathematical psychology a n d psychophysiology. Providence, RI: American Mathematical Society, pp. 91-106. Gottlieb, G. (Ed.) (1976). Neural a n d behavioral specificity (Vol. 3). New York, NY: Academic Press. Gray, C.M., Konig, P., Engel, A.K., and Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338,334-337. Gray, C.M. and Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences, 86, 1698-1702. Grossberg, S. (1961). Senior Fellowship thesis, Dartmouth College. Grossberg, S. (1964). T h e t h e o r y of embedding fields w i t h applications t o psychology a n d neurophysiology. New York: Rockefeller Institute for Medical Research. Grossberg, S. (1967). Nonlinear difference-differential equations in prediction and learning theory. Proceedings of the National Academy of Sciences, 58, 1329-1334. Grossberg, S. (1968a). Some physiological and biochemical consequences of psychological postulates. Proceedings of the National Academy of Sciences, 60, 758-765. Grossberg, S. (1968b). Some nonlinear networks capable of learning a spatial pattern of arbitrary complexity. Proceedings of the National Academy of Sciences, 59, 368-372. Grossberg, S. (1969a). Embedding fields: A theory of learning with physiological implications. Journal of Mathematical Psychology, 6, 209-239. Grossberg, S. (1969b). On learning, information, lateral inhibition, and transmitters. Mathematical Biosciences, 4, 255-310. Grossberg, S., (1969~). On the serial learning of lists. Mathematical Biosciences, 4, 201-253. Grossberg, S. (1969d). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1, 319-350.
192
Grossberg, S. (1969e). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, I. Journal of Mathematics and Mechanics, 19, 53-91. Grossberg, S. (1969f). On the production and release of chemical transmitters and related topics in cellular control. Journal of Theoretical Biology, 22, 325-364. Grossberg, S. (1969g) On variational systems of some nonlinear difference-differential equations. Journal of Differential Equations, 6,544-577. Grossberg, S. (1970a). Neural pattern discrimination. Journal of Theoretical Biology, 27, 291-337. Grossberg, S. (1970b). Some networks that can learn, remember, and reproduce any number of complicated space-time patterns, 11. Studies in Applied Mathematics, 49, 135-166. Grossberg, S. (1971a). Pavlovian pattern learning by nonlinear neural networks. Proceedings of the National Academy of Sciences, 68, 828-831. Grossberg, S. (1971b). On the dynamics of operant conditioning. Journal of Theoretical Biology, 33,225-255. Grossberg, S. (1972a). Pattern learning by functional-differential neural networks with arbitrary path weights. In K. Schmitt (Ed.), Delay a n d functional-differential equations a n d their applications. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 157-193, Boston, MA: Reidel Press. Grossberg, S. (1972b). Neural expectation: Cerebellar and retinal analogs of cells fired by learnable or unlearned pattern classes. Kybernetik, 10, 49-57. Grossberg, S. (1972~). A neural theory of punishment and avoidance, I: Qualitative theory. Mathematical Biosciences, 15, 39-67. Grossberg, S. (1972d). A neural theory of punishment and avoidance, 11: Quantitative theory. Mathematical Biosciences, 15, 253-285. Grossberg, S. (1973). Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52,217-257. Reprinted in S. Grossberg (1982), Studies of mind and brain, pp. 332-378, Boston, MA: Reidel Press. Grossberg, S. (1974). Classical and instrumental learning by neural networks. In R. Rosen and F. Snell (Eds.), Progress in theoretical biology. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 65-156, Boston, MA: Reidel Press. Grossberg, S. (1975). A neural model of attention, reinforcement, and discrimination learning. International Review of Neurobiology, 1975, 18, 263-327. Reprinted in S. Grossberg (1982), Studies of mind a n d brain, pp. 229-295, Boston, MA: Reidel Press. Grossberg, S. (1976a). On the development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145-159. Grossberg, S. (1976b). Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121-
193
134. Grossberg, S. (1976~).Adaptive pattern classification and universal recoding, 11: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187-202. Grossberg, S. (1976d). On the Development of feature detectors in the visual cortex with applications to learning and reaction-diffusion systems. Biological Cybernetics, 21, 145-159. Grossberg, S. (1978a). A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen and F. Snell (Eds.), Progress in theoretical biology, Vol. 5. New York: Academic Press. Reprinted in S. Grossberg (1982), Studies of mind and brain, pp. 498-639, Boston, MA: Reidel Press. Grossberg, S. (1978b). Behavioral contrast in short term memory: Serial binary memory models or parallel continuous memory models? Journal of Mathematical Psychology, 3, 199-219. Grossberg, S. (1978~). Decisions, patterns, and oscillations in nonlinear competitive systems with applications to Volterra-Lotka systems. Journal of Theoretical Biology, 73, 101-130. Grossberg, S. (1978d). Competition, decision, and consensus. Journal of Mathematical Analysis and Applications, 66,470-493. Grossberg, S. (1980a). How does a brain build a cognitive code? Psychological Review, 1, 1-51. Grossberg, S. (1980b). Intracellular mechanisms of adaptation and self-regulation in self-organizing networks: The role of chemical transducers. Bulletin of Mathematical Biology, 42, 365-396. Grossberg, S. (1980~). Biological competition: Decision rules, pattern formation, and oscillations. Proceedings of the National Academy of Sciences, 77, 2338-2342. Grossberg, S. (Ed.) (1981). Adaptive resonance in development, perception, and cognition. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society. Grossberg, S. (1982a). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control. Boston, MA: Reidel Press. Grossberg, S. (1982b). Associative and competitive principles of learning and development: The temporal unfolding and stability of STM and LTM patterns. In 5-1. Amari and M. Arbib (Eds.), Competition and cooperation in neural networks. New York: Springer-Verlag. Grossberg, S. (1982~).A psychophysiological theory of reinforcement, drive, motivation, and attention. Journal of Theoretical Neurobiology, 1, 286-369. Grossberg, S. (1983). The quantized geometry of visual space: The coherent computation of depth, form, and lightness. Behavioral and Brain Sciences, 6,625-657. Grossberg, S. (1984). Some psychophysiological and pharmacological correlates of a developmental, cognitive, and motivational theory. In J. Cohen, R. Karrer, and P. Tueting (Eds.), Brain and information: Event related potentials, 425, 58-151, Annals of the New York Academy of Sciences. Reprinted in S. Grossberg (Ed.), The adaptive brain, Volume I, 1987, Amsterdam: Elsevier/North-Holland.
194
Grossberg, S. (1986). The adaptive self-organization of serial order in behavior: Speech, language, and motor control. In E.C. Schwab and H.C. Nusbaum (Eds.), Pattern recognition by humans and machines, Volume 1: Speech perception, pp. 187-294, New York, NY: Academic Press. Reprinted in S. Grossberg (Ed.), The adaptive brain, Volume 11, 1987, Amsterdam: Elsevier/North-Holland. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17-61. Grossberg, S. and Kuperstein, M. (1986). Neural dynamics of adaptive sensorymotor control. Amsterdam: Elsevier/North-Holland; expanded edition, 1989, Elmsford, NY: Pergamon Press. Grossberg, S. and Merrill, J.W.L. (1992). A neural network model of adaptively timed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, 1, 3-38. Grossberg, S. and Mingolla, E. (1985a). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychological Review, 92, 173211. Grossberg, S. and Mingolla, E. (1985b). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception and Psychophysics, 1985, 38, 141-171. Grossberg, S. and Pepe, J. (1970). Schizophrenia: Possible dependence of associational span, bowing, and primacy versus recency on spiking threshold. Behavioral Science, 15, 359-362. Grossberg, S. and Pepe, J. (1971). Spiking threshold and overarousal effects in serial learning. Journal of Statistical Physics, 3, 95-125. Grossberg, S. and Somers, D. (1991). Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks, 4, 453-466. Grossberg, S. and Somers, D. (1992). Synchronized oscillations for binding spatially distributed feature codes into coherent spatial patterns. In G.A. Carpenter and S. Grossberg, (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press, 385406. Grossberg, S. and Stone, G.O. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74. Grossberg, S. and TodoroviC, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241-277. Hebb, D.O. (1949). The organization of behavior. New York, NY: Wiley Press. Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26,4979-4984. Hirsch, M.W. (1982). Systems of differential equations which are competitive or cooperative, I: Limit sets. SIAM Journal of Mathematical Analysis, 13, 167-179. Hirsch, M.W. (1985). Systems of differential equations which are competitive or cooperative, 11: Convergence almost everywhere. SIAM Journal of Mathematical Analysis, 16, 423-439. Hirsch, M.W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2 , 331-350.
195
Hodgkin, A.L. (1964). The conduction of the nervous system. Liverpool, UK: Liverpool University. Holland, J.H. (1980). Adaptive algorithms for discovering and using general patterns in growing knowledge bases. International Journal of Policy Analysis and Information Systems, 4, 217-240. Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 25542558. Hopfield, J.J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3058-3092. Hubel, D.H. and Wiesel, T.N. (1977). Functional architecture of macaque monkey visual cortex. Proceedings of the Royal Society of London (B), 198, 1-59. Hunt, R.K. and Jacobson, M. (1974). Specification of positional information in retinal ganglion cells of Xenopus laevis: Intraocular control of the time of specification. Proceedings of the National Academy of Sciences, 71, 3616-3620. Iverson, G.J. and Pavel, M. (1981). Invariant properties of masking phenomena in psychoacoustics and their theoretical consequences. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society, pp. 17-24. Jung, J. (1968). Verbal learning. New York: Holt, Rinehart, and Winston. Kandel, E.R. and O’Dell, T.J. (1992). Are adult learning mechanisms also used for development? Science, 258, 243-245. Kandel, E.R. and Schwartz, J.H. (1981). Principles of neural science. New York, NY: Elsevier/North-Holland. Katz, B. (1966). Nerve, muscle, and synapse. New York, NY: McGraw-Hill. Khinchin, A.I. (1967). Mathematical foundations of information theory. New York, NY: Dover Press. Klatsky, R.L. (1980), Human memory: Structures and processes. San Francisco, CA: W.H. Freeman. Kohonen, T. (1984). Self-organization and associative memory, New York, NY: Springer-Verlag. Kosko, B. (1986). Fuzzy entropy and conditioning. Information Sciences, 40, 165-174. Levine, D. and Grossberg, S. (1976). On visual illusions in neural networks: Line neutralization, tilt aftereffect, and angle expansion. Journal of Theoretical Biology, 61, 477-504. Levy, W.B. (1985). Associative changes at the synapse: LTP in the hippocampus. In W.B. Levy, J. Anderson and S. Lehmkuhle, (Eds.), Synaptic modification, neuron selectivity, and nervous system organization. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 5-33. Levy, W.B., Brassel, S.E., and Moore, S.D. (1983). Partial quantification of the associative synaptic learning rule of the dentate gyrus. Neuroscience, 8, 799-808. Levy, W.B. and Desmond, N.L. (1985). The rules of elemental synaptic plasticity. In W.B. Levy, J . Anderson and S. Lehmkuhle, (Eds.), Synaptic modification, neuron
196
selectivity, and nervous system organization. Hillsdale, NJ: Lawrence Erlbaum Associates, pp. 105-121. Linsker, R. (1986). From basic network principles to neural architecture. Proceedings of the National Academy of Science, 83,7508-7512, 8390-8394, 8779-8783. Maher, B.A. (1977). Contributions to the psychopathology of schizophrenia. New York, NY: Academic Press. Malsburg, C. von der (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85-100. Malsburg, C. von der and Willshaw, D.J. (1981). Differential equations for the development of topological nerve fibre projections. In S. Grossberg (Ed.), Mathematical psychology and psychophysiology. Providence, RI: American Mathematical Society, pp. 39-48. May, R.M. and Leonard, W.J. (1975). Nonlinear aspects of competition between three species. SlAM Journal on Applied Mathematics, 29,243-253. McCulloch, W.S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of the Mathematical Biophysics, 5, 115-133. McGeogh, J.A. and Irion, A.L. (1952). The psychology of human learning, Second edition. New York: Longmans and Green. Miller, G.A. (1956). The magic number seven plus or minus two. Psychological Review, 63, 81. Moore, B. (1989). ART 1 and pattern clustering. In D. Touretzky, G. Hinton, and T. Sejnowski (Eds.), Proceedings of the 1988 connectionist models summer school. San Mateo, CA: Morgan Kaufmann, pp. 174-185. Murdock, B.B. (1974). Human memory: Theory and data. Potomac, MD: Erlbaum Press. Nabet, B. and Pinter, R.B. (1991). Sensory neural networks: Lateral inhibition. Boca Raton, FL: CRC Press. Norman, D.A. (1969). Memory and attention: An introduction to human information processing. New York, NY: Wiley and Sons. Osgood, C.E. (1953). Method and theory in experimental psychology. New York, NY: Oxford Press. Plonsey, R. and Fleming, D.G. (1969). Bioelectric phenomena. New York, NY: McGraw-Hill. Rauschecker, J.P. and Singer, W. (1979). Changes in the circuitry of the kitten’s visual cortex are gated by postsynaptic activity. Nature, 280, 58-60. Repp, B.H. (1991). Perceptual restoration of a “missing” speech sound: Auditory induction or illusion? Haskins Laboratories Status Report on Speech Research, SR-107/108, 147-170. Rumelhart, D.E. and Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9,75-112. Rundus, D. (1971). Analysis of rehearsal processes in free recall. Journal of Experimental Psychology, 89,63-77.
197
Salzberg, S.L. (1990). Learning w i t h nested generalized exemplars. Boston, MA: Kluwer Academic Publishers. Samuel, A.G. (1981a). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General, 110,474-494. Samuel, A.G. (1981b). The rule of bottom-up confirmation in the phonemic restoration illusion. Journal of Experimental Psychology: Human Perception and Performance, 7, 1124-1 131. Schvaneveldt, R.W. and MacDonald, J.E. (1981). Semantic context and the encoding of words: Evidence for two modes of stimulus analysis. Journal of Experimental Psychology: Human Perception and Performance, 7, 673-687. Singer, W., Neuronal activity as a shaping factor in the self-organization of neuron assemblies. In E. Basar, H. Flohr, H. Haken, and A.J. Mandell (Eds.) (1983). Synergetics of t h e brain. New York, NY: Springer-Verlag, pp. 89-101. Smith, E.E. (1990). In D.O. Osherson and E.E. Smith (Eds.), A n invitation to cognitive science. Cambridge, MA: MIT Press. Somers, D. and Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biological Cybernetics, in press. Underwood, B.J. (1966). Experimental psychology, Second edition. New York: Appleton-Century-Crofts. Warren, R.M. (1984). Perceptual restoration of obliterated sounds. Psychological Bulletin, 96,371-383. Warren, R.M. and Sherman, G.L. (1974). Phonemic restorations based on subsequent context. Perception and Psychophysics, 16,150-156. Werblin, F.S. (1971). Adaptation in a vertebrate retina: Intracellular recordings in Necturus. Journal of Neurophysiology, 34, 228-241. Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. Thesis, Cambridge, MA: Harvard University. Willshaw, D.J. and Malsburg, C. von der (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London (B), 194, 431445. Young, R.K. (1968). Serial learning. In T.R. Dixon and D.L. Horton (Eds.), Verbal behavior a n d general behavior theory. Englewood Cliffs, NJ: Prentice-Hall. Zadeh, L. (1965). Fuzzy sets. Information Control, 8, 338-353.
This Page Intentionally Left Blank
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) Q
1993 Elsevier Science Publishers B.V. All rights reserved.
199
On-line learning processes in artificial neural networks Tom M. Heskes and Bert Kappen Department of Medical Physics and Biophysics, University of Nijmegen, Geert Grooteplein 21, 6525 EZ Nijmegen, T h e Netherlands.
Abstract We study on-line learning processes in artificial neural networks from a general point of view. On-line learning means that a learning step takes place a t each presentation of a randomly drawn training pattern. I t can be viewed as a stochastic process governed by a continuous-time master equation. On-line learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a time-dependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network’s representation. Minimization of a criterion incorporating both effects yields an algorithm for on-line adaptation of the learning parameter. The inherent noise of on-line learning makes i t possible t o escape from undesired local minima of t h e error potential on which t h e learning rule performs (stochastic) gradient descent. We try t o quantify these often made claims by considering the transition times between various minima. We apply our results on the transitions from ”twists” in two-dimensional self-organizing maps to perfectly ordered configurations. Finally, we discuss t h e capabilities of on-line learning for global optimization.
1 1.1
Introduction Why a theory for on-line learning?
In neural network models, learning plays an essential role. Learning is the mechanism by which a network adapts itself t o its environment. T h e result of this adaptation process, in both natural as well as in artificial systems, is that the network obtains a representation of its environment. This representation is encoded in its plasticities, such as synapses and thresholds. T h e function of a neural network can be described in terms of its input-output relation, which in turn is fully determined by the architecture of the network and by t h e learning rule. Examples of such functions may be classification (as in multi-layered perceptrons), feature extraction (as in networks that perform a principle component analysis), recognition, transformation for motor tasks, or memory. The representation that the network has learned of the environment enables the network t o perform its function in a way that is ”optimally” suited for the environment on which it is taught. Despite the apparent differences in their functionalities, most learning rules in the current uetwork literature share the following properties.
200 1. Neural networks learn from examples. An example may be a picture that must be memorized or a combination of input and desired output of the network that must be learned. The total set of examples or stimuli is called the training set or the environment of the neural network. 2. The learning rule contains a global scale factor, the ”learning parameter”. It sets the typical magnitude of the weight changes at each learning step. In this chapter, we set up and work out a theoretical framework based on these two properties. It covers both supervised learning (learning with ”teacher”, e.g., backpropagation [55],for a review see [33, 651) and unsupervised learning (learning without ”teacher”, e.g., Kohonen learning [37], for a review see (61). The approach taken in this chapter is therefore quite general. I t includes and extends results from studies on specific learning rules (see e.g. [3,53, 9,481).
1.2
Outline of this chapter
In artificial neural networks, on-line learning is modeled by randomly drawing examples from the environment. This introduces stochasticity in the learning process. The learning process becomes a discrete-time Markov process’, which can be transformed into a continuous-time master equation. The study of learning processes becomes essentially a study of a particular class of master equations. In section 2 we point out the correct way t o approximate this master equation by a Fokker-Plank equation in the limit of small learning parameters. We discuss the consequences of this approach in the case of just one fixed point of the (average) learning dynamics. Section 3 is more like an intermezzo. Here we discuss two other approaches. The Langevin approach, which leads t o an equilibrium Gibbs distribution, has become very popular in neural network literature. However, on-line learning, as we define it, cannot be formulated in terms of a Langevin equation, does not lead t o a Gibbs distribution, and is therefore more difficult to study. We will also discuss the more ”mathematical” approach which describes on-line learning using techniques from stochastic approximation theory. The mathematical approach has led to many important and rigorously proven theorems, some of which will be mentioned in section 3. On-line learning, if compared with batch-mode learning where a learning step takes place on account of the whole training set, is necessary if not all training patterns are available all the time. This not only the case for biological learning systems, but also in many practical applications, especially in applications such as financial modeling, economic forecasting, robot control, etcetera, when the training patterns are drawn from a time-dependent environmental distribution. This notion leads to the study of on-line learning in a changing environment in section 4. Using the same techniques as in section 2, we encounter a conflict between the adaptability and the confidence or accuracy of the network’s representation. Minimization of a suitable criterion, the so-called ”misadjustment”, leads t o an optimal learning parameter for learning in a changing environment. The derivation of the optimal learning parameter in section 4 is nice, but of little practical use. To calculate this learning parameter, one needs detailed information about the neural network and its environment, information that is usually not available. In section 5 we try to solve this problem by considering the statistics of the weights. This yields an autonomous algorithm for learning-parameter adjustment. ‘The underlying assumption is that subsequent stimuli are uncotrelated. This is the case for almost all artificial neural network learning rules. However, for biological learning processes and for some applications subsequent stimuli may be correlated. Then the results of our analysis do not apply.
20 1 Another argument in favor of on-line learning, is the possibility t o escape from undesired local minima of the energy function or error potential on which the learning rule performs (stochastic) gradient descent. In section 6 we try to quantify these often made claims by considering the transition times between various minima of the error potential. Starting from two hypotheses, based on experimental observations and theoretical arguments, we show that these transition times scale exponentially with some constant, the so-called ”reference learning parameter”, divided by the learning parameter. Well-known examples of undesired fixed points of the average learning dynamics are topological defects in self-organizing maps. Using the theory of section 6, we calculate in section 7.1 the reference learning parameters for the transitions from ”twists” in two-dimensional maps t o perfectly ordered configurations. We compare the theoretically obtained results with results obtained from straightforward simulations of the learning rule. Finally, we discuss in section 8 to what extent on-line learning might be used as a global optimization method. We derive cooling schedules that guarantee convergence to a global minimum. In these cooling schedules, the reference learning parameters discussed in section 6 play an important role. We compare the optimization capabilities of on-line backpropagation and ”Langevin-type” learning for a specific example with profound local minima.
Learning processes and their average behavior
2 2.1
From random walk to master equation
Let the adaptive elements of a neural network, such as synapses and thresholds, be given by a weight vectorZ w = (q,. . . , w ~ E) lRN. ~ At distinct iteration times w is changed due to the presentation of a training pattern i = ( z l , .. .,x , ) ~E IR”,which is drawn at random according to a probability distribution p(.’). The new weight vector w’ = w A w depends on the old weight vector and on the training pattern:
+
Aw = v f ( w , Z ) .
(1)
The function f is called the learning rule, 1) the learning parameter. Because of the random pattern presentation, the learning process is a stochastic process. We have to talk in terms of probabilities, averages, and fluctuations. The most obvious probability to start with is the probability p,(w) to be in state w after i iterations. This probability obeys a random walk equation pi(+
=
J dNw ~ ( w ’ l wpi-l(w)j )
(2)
with T(w’1w) the transition probability to ”walk” in one learning step from state w to state w’: T(w’1w) =
/
d l p(i)P
( W ’
-w
- qf(w,5)).
(3)
The random walk equation (2) gives a description in discrete time steps. Bedeaux, Lakatos-Lindenberg, and Shuler [7] showed, that a continuous-time description can be obtained through the assignment of random values At to the time interval between two succeeding iteration steps. If these At are drawn from a probability density
5-
@(At)= -exp
[--:I>
’We use the notation AT t o deliole the transpose of the matrix or vector A
202 the probability #(i,t), tnat after time t there have been exactly i transitions, follows a Poisson process. The probability P(w, t ) , that a network is in state w at time t , reads 05
C d(i7 t)pi(w).
~ ( wt ), =
i=O
This probability function can be differentiated with respect t o time, yielding the master equation 8P(W', t ) ___ = / d N w [W(w'lw)P(w,t) - W(wlw')P(w',t)], at
(4)
with the transition probability per unit time 1 Pl'(w'lw) = -T(w'lw).
(5)
T
Through T we have introduced a physical time scale. Here we have presented a nice mathematical trick to transform a discrete time random walk equation into a continuous time master equation. It is valid for all values of T and 7. For the rest of this chapter we will choose T = 1, i.e., the average time between two learning steps is our unit of time. For notational convenience we introduce the averages over the ensemble of learning networks Z(t)
/
(@(w))~ '%~ ~ ,d N v P ( w , t ) cP(w), and over the set of training patterns R (@(Z))n
sf/ d " r p ( Z ) @ ( Z ) ,
for arbitrary function Q ( w )and '@(Z). The dynamics of equation (4) cannot be solved in general. We will point out the incorrect (section 2.2) and the correct (section 2.3) way t o approximate this master equation for small learning parameters 7 . To simplify the notation, we will only consider the one-dimensional case. In our discussion of the asymptotic dynamics (section 2.4), we will generalize t o N dimensions.
2.2
The Fokker-Planck approximation of the Kramers-Moyal expansion
A totally equivalent description of the master equation is given by its full Kramers-Moyal exDansion
with the so-called jump moments an(w)
erJ dw' (w
-
w')"T(w\u') =
l)n
def
( f " ( w , r ) ) n = 7"6*(w),
(7)
where all iLn are of order 1, i.e., independent of 7. By terminating this series at the second term, one obtains the Fokker-Planck equation
203 In one dimension, the equilibrium distribution of the Fokker-Planck equation can be written in closed form:
with N a normalization constant. Because of the convenience and the simplicity of the result, the Fokker-Planck approach is very popular, also in neural network literature on on-line learning processes [23, 44, 50, 531. However, it is incorrect! Roughly speaking, this approximation is possible if and only if the average step size (Aw) and the variance of the step size ((Aw - (Aw))’) are proportional to the same small parameter 1141. Learning rules of the type (1) have (Aw) = O(9) but ((Aw - (Aw))’) = O ( v 2 )and thus do not satisfy this so-called ”scaling assumption”. To convince ourselves, we substitute the equilibrium distribution (9) into the Kramers-Moyal expansion (6) and notice that the third, fourth, . . ., 00 terms are all of the same order as the first and second order terms: formally there is no reason to break off the Kramers-Moyal series after any number of terms.
2.3
A small fluctuations expansion
Intuitively, a stochastic process can often be viewed as an average, deterministic trajectory, with stochastic fluctuations around this trajectory. Using Van Kampen’s system size expansion [63] (see also [14]),it is possible to obtain the precise conditions under which this intuitive picture is valid. We will refer to this as the small fluctuations expansion. I t consists of the following steps.
1. Following Van Kampen, we make the ”small fluctuations Ansatz”, i.e., we choose a new variable ( such that ‘w = 9(t) -I-f i t (10) with d ( t ) a function to be determined. Equation (10) says that the time-dependent stochastic variable w is given by a deterministic part 4(t) plus a term of order Jii containing the (small) fluctuations. A posteriori, this Ansatz should be verified. The function I I ( ( , t ) is the probability P( w,t ) in terms of the variable [:
2. Using simple chain rules for differentiation, we transform the Kramers-Moyal expansion (G) for P ( w , t ) into a differential equation for II(6, t ) :
3. We choose the function + ( t ) such that the lowest order terms on the left- and righthandside cancel, i.e.,
This is called the deterministic equation.
204 4. We make a Taylor expansion of 6,#($(t)+f i t ) in powers of f i .After some rearrangements
we obtain
5. In the limit 'I -+ 0 only the term m = 2 survives on the righthandside. This is called the linear noise approximation. The remaining differential equation for II((, t ) is the FokkerPlanck equation
where the prime denotes differentiation with respect to the argument. 6. From equation (12) we calculate the dynamics of the average fluctuations (():(,) size of the fluctuations ( ( 2 ) z ( , ) :
and the
is of order 1. From equation (13) we conclude that the final result is consistent with the Ansatz, provided that both evolution equations converge, i.e., that
7. We started with the Ansatz that
4(4(t)) < 0 . So, there are regions of weight space where the small fluctuations expansion is valid (u: and where it is invalid (u; 2 0).
(14)
< 0)
Let us summarize what we have done so far. We have formulated the learning rule (1) in ternis of a discrete time Markov process (2). Introducing Poisson distributed time steps we have hailsformed this discrete random walk equation into a continuous time master equation (4). Making a small fluctuations Ansatz for small learning parameters 9, we have derived equation (11) for the deterministic behavior and equation (12) for the probability distribution of the flucluations around this deterministic behavior At the same time we have derived the condition (14) which musl be satisfied for this description to be valid in the limit of small learning paiameters 17 Now that we have made a rigorous expansion of the master equation, we can refine our lmld statement that the Fokker-Planck approximation i s incorrect If we substitute the small flu 0 and B(z) = 0 for I < 0. So, z is drawn with equal probability from the is constant. The aim of this interval [vt - 1, vt 11. The input standard deviation = 1/& learning rule is to make w coincide with the mean value of the probability distribution p(z,t), i.e., the fixed point w*(t) of the deterministic equation (27) obeys
+
x
w * ( t ) = ( " ) n ( t ) = vt .
So, V' = v , the rate of change of the fixed point solution is equal to the rate of change of the environment. Straightforward calculations show that the evolution of the bias m ( t ) and the variance Cz(t) is governed by
This set of differential equations decays exponentially t o the stationary solution
m = y 17'
72x2
~ 2 = - .
+ v2
7(2 - 9 )
209
P
w - w*(t)
Figure 1: Probability distribution for time-dependent Grossberg learning. Learning parameter 7 = 0.05, standard deviation x = 1.0. The input probability p(z,t) is drawn for reference (solid box). Zero velocity (solid line). Small velocity: v = 0.01 (dashed line). Large velocity: v = 0.1 (dash-dotted line). Note that this behavior is really different from the behavior in a fixed environment. In a fixed environment ( v = 0) the asymptotic bias is negligible if compared with the variance4. However, in a changing environment ( v > 0), the bias is inversely proportional to the learning parameter 7,and can become really important if this learning parameter is chosen too small. In figure 1 we have shown the (simulated) probability density P ( w - w*(l)) for three different values of the speed v . For zero velocity the bias is zero and the distribution is sharply peaked. For a relatively small velocity, the influence on the width of the distribution is negligible, but the effect on the bias is clearly visible. For a relatively large speed, the variance is also affected and can get pretty large. A good measure for the learning performance is the misadjustment defined in equation (28). In the limit T + 00, we can neglect the exponential transients t o the stationary state (29). We obtain 73x2 2v2 & = T2(2 - 17) . This misadjustment is sketched in figure 2, together with simulation results. For small learning parameters the bias dominates the misadjustment and we have
+
On the other hand, for larger learning parameters the variance yields the most important contribution: & %'' -7 for ( v / x ) 2 / 3 0, and every F E C(X, R), there exists a G E S ( X ,R) such that ]IF- G l l ~< e. [0,1] is a squashing function, if We need some more terminology. A function rl, : R it is nondecreasing, limA+m $(A) = 1, and limA-,-w $(A) = 0. For example, the familiar sigmoid function ($(A) = (1 +exp(-A))-'), the cosine squasher of Gallant and White [8], and the Cantor-Lebesgue function [7]are all squashing functions. We now state the central result of this section. (The proof can be found in the Appendix.)
Theorem 3.1 For a normed linear space XI and a squashing function rl,] multilayer functionals M 3 + ( X ,R) are uniformly dense on compact sets in C ( X ,R). In the sense made precise by the above theorem, multilayer functionals are universal functional approximators. Theorem 3.1 requires only that X be a normed linear space. This is a relatively mild condition, and hence the theorem applies to a large number of practical spaces. We now consider some examples of normed linear spaces and comment on their dual spaces. 1. Finite-dimensional Euclidean space (8') X = W. The norm on X is the usual norm llullx = (C&,u~)'/', where u = (~1,. . .,u,) E 92'. Theorem 3.1, in this case, reduces precisely to the well known result of Hornik, Stinchcombe, and White [ll].
2. The space of compactly supported continuous functions Let V be a topological space, and let C(V,R) denote the set of all continuous functions on V. Support of f E C(V,92) is defined as the smallest closed set outside of which f vanishes. We now define
CJV,8)= {f E C(V,%)Isupport of f is compact}.
(15)
Then, X = Cc(V,91) is a normed linear space with the uniform norm given by
where f E Cc(V,8). If V = I = [0,1], then V is compact and Cc(I,82)= C(I,P). This is precisely the space considered in Section 2, where it was established that continuous linear functionals on C ( I ,8)are elements of the space of Lebesgue-Stieltjes integrals.
242
3.
LP Spaces Let V be a set equipped with a a-algebra M , and a measure p . Then, (V,M, p ) is called a measure space. If f is a real valued measurable function on V , then for 1 5 p < 00, we define
llfllp =
(/v
IfIpdP)'/p,
and if p = 00
llfllDo= inf{a 1 ol~({tlf(t)> a ) ) = 01.
(18)
Now, we define
P ( V , M , p )= { f : V -+ Rlfismeasurableand
llfllP < 00).
(19)
These LP(V,M , p ) are normed linear spaces. Let V c R, let M denote the Borel a-algebra on V , and let p denote the Lebesgue measure m. Then, O ( V ,M , p ) is denoted simply by Lp(V). If p is a a-finite measure and 1 5 p < 00, then 4.
(LP)' Z Lq,where p-'
+ q-'
= 1.
P Spaces In example 3, let V c Z the set of all integers, let M = P(V n Z) the power set of V n Z, and select p to be the counting measure, then LP(V n Z, P(V n Z), p ) is denoted by P ( V ) .
5. Sobolev Spaces
In example 3, let V C R, let M denote the Borel o-algebra on V , and let p denote the Lebesgue measure m. Let k E N the set of all natural numbers, then the space of all functions f E L'(V) whose distributional derivatives i3"f are also contained in L*(V) for la1 5 k are known as Sobolev spaces, and are denoted by 'Hk(V). 'Hk(V)is a Hilbert space with the following inner product
where f , g E 'Hk(V).'Hk(V)are normed linear spaces with the norm of any element f defined as (f,f)'/'.
'Hk(V)is a Hilbert space and therefore is its self-dual. C(I,R), LP(V), P(V n Z), and 'Hk(V)where I = [0,1], V c R, and 1 5 p 5 00 have more than exemplar value, they will denote various possible admissible sets of input signal functions (or sequences) to a general system in Section 4. Together they encompass almost all functions of engineering interest.
243
3.B
A Conjecture
Theorem 3.1 assumes that the hidden unit activation function J, is a squashing function. But, in the case of multilayer feedforward network it is known that arbitrary nonlinearity is sufficient for universal function approximation [12]. We conjecture that such a result also holds for multilayer functionals.
Conjecture 3.1 For a normed linear space X , and any continuous, bounded, nonconstant function $, multilayer functionals M F + ( X ,92) are uniformly dense on compact sets in C ( X ,8).
It may be possible to establish Conjecture 3.1 by an extension of the results of Cybenko [6] and Hornik [12] from 92' to arbitrary normed linear spaces.
3.C
Relation to Volterra Functionals
Volterra functionals have a rich history of applications in nonlinear systems theory. Originally, Volterra [34] defined Volterra functionals on the space of compactly supported continuous functions on the real line (denote as C ( I ,a)).His definition was inspired by the structure of polynomial functions on the real line. By an use of Frkchet-Weierstrass theorem he also established that Volterra functionals are uniformly dense on compact sets in C(C(1,a),92) (compare Theorem 3.1). Wiener [39], seeking alternatives to linear filtering and Gaussianity assumption, introduced orthogonalized Volterra functionals, which are called Wiener functionals. Some other notable contributions to Volterra functionals can be found in Barrett [3], Gallman and Narendra [9], Koh and Powers [15] , MorhO [17], Palm and Poggio [22], Porter [24], Prenter [25], and Root [26]. Interested readers may also refer to the books by Banks [2], Rugh [29], and Schetzen [32]. Let X be a normed linear space, then we define X = Xito mean an i-dimensional Cartesian product of X [7]. X i may be identified with the set of ordered i-tuples of the . .. ,u ; ) defined elements of X . X i is a normed linear space with norm of any element (ul, as
n&,
I
II(ul,...,ui)IIxs
= CIIujlIx,
(21)
j=1
where 1) Ilp denotes the norm on Xi, and 1). Ilx denotes the norm on X . For u E X , we define ui to mean an ordered i-tuple (u, . .. ,ti) E X i . Let ( X i ) * denote the dual of x'. The elements of ( X i ) * are continuous linear functionals on X i and are denoted by L'. We now define Volterra functionals on an arbitrary normed linear space. To our knowledge, in past, such a general definition has not appeared in the literature.
Definition 3.1 Let X denote an arbitrary n o n e d linear space, then Volterra functionals of integer order n on X are defined by n
VF(X,I)
=
{F:X~alF(u)=B+CL'(u'),ti€X,P€I, i=l
u' = ( u , . . . , u ) E Xi, and
L' E (Xi)* for i = 1,. . . ,n}.
(22)
244
Clearly, V F ( X ,8) c C ( X ,W ) , that is, Volterra functionals are continuous. The set of all Volterra functionals on X is given by
u 00
V 3 ( X , 3 1 )=
VF(X,%).
(23)
n=l
Some properties of Volterra functionals are obvious.
Proposition 3.1 For a normed linear space X , Volterra functionals V F ( X ,8 ) are uniformly dense on compact sets in C ( X ,31). This follows since Volterra functionals separate points and contain constants, and form C ( K ,8 )for any compact subset K of X. Thus, we can apply the Stone-Weierstrass theorem to arbitrary compact subsets K of X in the spirit of Lemma A.l.
a closed subalgebra of
+,
Corollary 3.1 For a normed linear space X and a squashing function multilayer functionah MF+(X,31) are uniformly dense on compact sets in Volterra functionals
V 3 ( X ,8 ) . The result trivially follows as a corollary to Theorem 3.1. Thus, any Volterrafunctional can be approximated to an arbitrary degree of accuracy by some multilayer functional. This implies that multilayer functionals can directly replace Volterra functionals in the existing applications [15,29,32]. Recently, Jones [14] and Barron [4] have established that multilayer feedforward networks bypass the curse of dimensionality in representing functions on Sr with bounded spectral norms. Notably, polynomials do not possess this advantageous property [4]. This leads us to conjecture that multilayer functionals (which are analogous to multilayer feedforward networks) on an arbitrary normed linear space X provide a more compact representation of some subclass of functionals on X than Volterra functionals (which are analogous to polynomials) on X. Precise characterization of this conjecture, and the proof thereof is left as an open problem. We are now equipped to introduce multilayer operators, and establish their role in system representation.
4
Multilayer Operators in System Representation
In this section, we introduce some notations, definitions and assumptions necessary to introduce the notion of a general input-output representation for time-invariant systems. Thus, to incorporate the notion of time, we will be concerened here only with the operators mapping real valued functions of time into real valued functions of time.
245
4.A
System and Time-Invariance
In order to develop a single theory for both continuous-time as well as discrete-time systems, we define Z as the set of all possible time-instances. For continuous-time systems, we set 5 = W ,while for discrete-time systems, we set 5 = Z the set of all integers. Let J c Z denote an interval of time over which the behavior of a class of systems is of interest. Let U and Y denote some sets of real valued functions over J. In the sequel, U will serve to denote the set of all admissible inputs to the class of systems of interest, and Y will serve to denote the set of possible outputs of the class of systems of interest. For all t E J, let Jt be some subset of Z. We define Jt - t to mean a subset of E obtained by translating each element of Jt by - t . Let u E U be an input signal function, then we define the restriction ut of u to Jt as u t ( s ) = u(t
+ s),
(24)
where t E J and s E Jt - t . Clearly, the restriction ut is defined over the interval 51 - t . We define the restriction U,of U to 51 as the set of restrictions ut for all u E U to Jt. Clearly, Ut is some set of real valued functions over Jt - t . For all t E J, let Ft denote a functional mapping Ut to the real line. A system is simply an operator mapping U to Y.In other words, a system S acts on an input function u in U to emit an output function y = S(u) in Y.Without any loss of generality, output of any system S at every t E J can be given by (25)
Y ( t )= N u t ) ,
where the interval J c 5 and the set { Jt, Ut, Ft},,J are individually specified for each system S [3,22,41]. { J , Jt, Ut,Ft}tEJis called an input-output representation for the system
S. In this paper, we are interested only in time-invariant systems. Concept of timeinvariance is introduced in the following definition.
Definition 4.1 A system S : U + Y described by { J , Jt, U,,Ft),,j is defined to be timeinvariant, if 1 . For all t E J , Jt - t are all identical and are denoted by Jo. 2. U is such that for all t E J , the restrictions Ut are all identical and
QIY
denoted by
uo . 3. For all t E J , functionals Ft : Ut + W are all identical and are denoted by Fo.
A time-invariant system S can be completely described by {J,Jo, UO, Fo}. Jo c Z has the interpretation of characterizing the memory of the system S. If JOhas a finite length, then the system is said to have a finite memory. On the other hand, if JO has an infinite length, then the system is said to have an infinite memory. If Jo is a subset of Z n (-co,O] then the system is said to be causal, and if Jo is not a subset of Z n (-oo,O] then the system is said to be noncausal. In the subsequent treatment, however, we place no restrictions on Jo, and hence it can be chosen at will. In this sense, the developed theory deals with finite or infinite memory systems, as well as with causal or noncausal systems.
246
U0describes the nature of inputs allowed over the memory Jo of the system. Once UO is specified, Definition 4.1 automatically provides every restriction Ut for all t E J, and hence also U. Thus, for a time-invariant system Uo serves as the fundamental input space. Uo can be picked at will, provided it is a normed linear space of real valued functions over Jo c Z. Please refer to Section 3 for examples of some practically important normed linear spaces. U0 is a normed linear space, and thus has the norm topology. The norm topology on Uo,and consequently on Ut for all t E J automatically induces a topology on U generated by the following open sets {v E
uI
I1vt - utllu,, < n-l for all t E J } ,
(26)
where n E N, u E U and 11 . Il", denotes the norm on Uo. The topology on U allows us to talk about convergence of elements of U,and hence about the continuity of systems defined on U. The functional Fo in Definition 4.1 is termed the characteristic functional of system S . We now state some relationships between Fo and S. (The proof can be found in the Appendix.)
Proposition 4.1 Let S : U -+ Y be a time-invariant system (as in Definition 4.1) described by { J, Jo,VO,Fo}, then 1 . S is linear iff Fo is linear, and is nonlinear iff Fo is nonlinear.
2. If Fo is continuous, then Y
c B(J, R). B(J,R) denotes
the set of all bounded real
valued functions on J .
3. If Fo is continuous, and Y has the relative topology derived f r o m the topology of uniform convergence on B ( J ,R), then S is continuous. We now make a pragmatic assumption (dictated only by the availability of tools rather than any fundamental property of systems) on the characteristic functional Fo.
Definition 4.2 A time-invariant system S characterized by { J , Jo, Uo, Fo} (as in Definition 4.1) is defined t o be characterized by a continuous functional, if Fo : Uo -+ IR is continuous.
For any system described by Definitions 4.1 and 4.2,it follows from Proposition 4.1 that for all u E U its output y must be in B(J,IR). Thus, for all bounded inputs' the system emits a (uniformly) bounded output and hence is BIB0 stable [41]. In this sense, the theory developed can deal only with stable systems. Moreover, for any system described by Definitions 4.1 and 4.2,it also follows from Proposition 4.1 that the system is continuous. To complete the list of all the advertised qualities of the representable systems in the theory, we only need to show the theory can deal with linear or nonlinear systems.
11.
lu is said to be a bounded input if its restrictions uI for all t E J are such that IIutllu, JJu, denotes the norm on UI.
< m, where
247
4.B Linear or Nonlinear Systems First, we define multilayer operators and Volterra operators. Definition 4.3 Let { J , Jo, UO}be specified, then 1. The set MO+(U,Y ) of multilayer operators is defined to be the class of systems of the form { J, Jo, U o , N } , where N E M3+(uo,%), and rl, is a continuous squashing
function. 2. The set YO(U,Y) of Volterra operators is defined to be the class of systems of the form { J, Jo, UO,V } , where Y E V3((vo,32). Clearly, multilayer operators MO+(U,Y ) and Volterra operators YU(U,Y ) meet the requirements set forth in Definitions 4.1 and 4.2, and are thus stable and continuous. Let 7C(U, Y ) denote the class of all time-invariant systems characterized by continuous functionals. Then, trivially 7C(U, Y) 2 C(U0,n). Also, let CTC(U, Y)C 7C(U, Y ) d e note the class of systems characterized by a linear functional. Then, trivially C7C(U, Y ) 2
(UO)'. Thus, every element of 7 C ( U ,Y ) can be completely characterized by some continuous functional on 170. But, Theorem 3.1 asserts that any continuous functional on UOcan be uniformly approximated on compact sets by some multilayer functional on UO. This observation, finally, makes precise the role of multilayer functionals and multilayer operators in system representation. This is now formally stated. (The proof can be found in the Appendix.) T h e o r e m 4.1 Let { J, Jo, UO} be specified, then 1. (Linear Systems) If S E L7C(U,Y), then there ezists a L E (UO)' such that for all t E J and for all (I E U , the output y of the system is given b y
Y ( t )= L(.t). (27) 2. (Linear/Nonlinear Systems) If S E 7C(U, Y ) , then for every continuous squashing function rl,, for every c > 0 , and for every U such that its restrictions Utfor all t E J are subsets of a compact set K c Uo, there exists a multilayer operator 0 E MO+(U,Y) such that (28) I N u ) - O(u)II < €1 where 1 ) . 11 denotes the uniform norm on Y c B(J,32). Theorem 4.1 assures us that if we assume that U is such that all its restrictions {Ut)t~j are subsets of a compact set in Uo, then all all time-invariant systems characterized by continuous functionals (continuous-time or discrete-time, finite memory or infinite memory, causal or noncausal, and linear or nonlinear) on U can be uniformly approximated by multilayer operators. We hope that these theoretical guarantees will convince the reader that the quest for representing arbitrary systems by multilayer operators is sound. Needless to say, Volterra operators also enjoy the same representation properties as multilayer operators. However, unlike multilayer operators, Volterra operators cannot be expected to escape the curse of dimensionality for some restricted class of operators, and to our knowledge results in such a general setting have not previously been derived for Volterra operators.
248
4.C
Concept of Initial State
Let S be a time-invariant system described by {J,Jo, UO, Fo}. Let J = 3 n [to, t l ] , where - - o o < t o < t l 0, where 11. I1.y denotes the norm on X . Define L ( u ) = = 2(I,ullx:ll"llx)IIu - UIIX 0. Also, by the triangle 2(llullx:!lY,lw~F("). Then1 U" inequality, L(u - w) 5 2. Now, define an &ne functional A by A(u) = 0 L ( u ) = L ( u ) , then cos oA separates points. = We now apply the Stone-Weierstrass Theorem to conclude that MF,,(K,P) C ( K ,8).But, I< is some arbitrary compact set of X, and hence the proof.
+
+
'
+
255
Remark 1 Lemma A . l also holds for M . F ~ , ( Xa) , and M F e w ( X ,8). L e m m a A.2 Given an affine functional A on X, a compact subset K of X, a squashing function and a el > 0, there exists a N E M F + ( X , % )such that
11,
114 0 A - N l l ~ < €1, where
+ 3n/2) + 1)
if X 5 -n/2 if -n/2 I if X 2 n/2
5 r/2
is the cosine squasher of Gallant and White [8]. Proof
c,"=;'
Without loss of generality, let c1 < 1. We want to construct a N(u) = pJ+(A,(u)) E M F + ( X ,82) such that Il4oA -NII,y < el. We need to find B , , and A, for j = 1,. .. ,Q - 1.
Let e = 2/3el. Pick Q such that l / Q < e/2. For j = 1,. . . ,Q - 1 set p, = 1/Q. Pick M > 0 such that $(u) 5 e/(2Q) for u 5 -M, and $(ti) 2 1 - e/(2Q) for u 2 M. Since 1c, is a squashing function such an M can be found. For j = 1,. ..,Q - 1 set r, = X such that 4(X) = j / Q , and set f Q = X such that 4(X) = 1- 1/(2Q). Let the &ne functional A be of the form A(u) = bA LA(u), where u E X, bA E 92, and LA E X'. Let us define
+
(4 0 A)(E,dj = t u E X l 4 ( C ) < (4 0 N u ) I 4(4J and
(4 A ) ( c . d ) = {u E xl4(c) < (4 A)(u) < d(d))* Now we can partition the space X into Q 1 disjoint sets such that
+
X = (4 0 A)(-co,r11U (4 0 A)(rI,rz]U . * .U (4 0 A)(rQ-I,TQlU (4 0 A)(rQ,+co). On each of the sets, (40 A)(,,,,,,] for j = 1,. . . ,Q - 1, we will approximate the action of 4 o A by tC, o Ar,,r,tl. We now look for such affine functionals Arj,3+1. For all u E (4oA)(7,,rJ+g]74(rj) < (4oA)(u) I 4(rj+1). But, (4oA)(u) = 4 ( b ~ + L ( u ) ) ,
and 4 is nondecreasing. This implies that r, < bA b3,r,tl E a, and L,,r,tlE X', such that -M < arithmetic reveals that the choices
+ I;A(u)
5 rj+1. We wish to find b,,3+1+ L , , r , + l ( U ) I M . Some
and will ~ f f i c eNow, . define (u) = b,,,,,+, +L,,,,,+, (u). Then, N(u) = xi"=;'pj+(Ar,,3tl (u)) is the desired approximation. After some rather lengthy arithmetic, i t can be verified that
II4 0 A - NIIK < €1.
256
Remark 2 Lemma A.2 is simply a generalization of Lemma A.2 in Hornik, Stinchcombe, and White (111. They develop the lemma for the simple case when X = 8'. The essence of the lemma is that cosine squasher 4 can be approximated to an arbitrary degree of accuracy by a superposition of a finite number of scaled and f i n e l y shifted copies of any squashing function $.
Lemma A.3 Given an affine functional A; on X, a compact subset K of X, a squashing function $, and a € 2 > 0, there exists a N; E M 3 + ( X ,92) such that IICOSOA;-XI(K
0 such that -2rM 5 A,(u) 5 2u(M + 1) [27,Theorem 4.151. By a result of Gallant and White [8] on the interval [-27rM,2r(M+l)] the cosine function can be represented by a superposition of a finite number of scaled and af6nely shifted copies of the cosine squasher 4. (For the definition of 4, see Lemma A.2.) Explicitly,
c M
cos(u) =
2[4(-u
+ r/2 - 2mr) + d(u - 3r/2 + 2mr)l - 2(2M + 1 ) + 1.
m=-M
Thus, we can write
c 2[4(-A;(u)+ M
cos(A,(u))=
r / 2 - 2mr) + gl(A;(u)- 3u/2 +2mr))- 2(2M
+ 1 ) + 1.
m=-M
+
Now, we will use Lemma A.2 2(2M 1) times with €1 = 4(2G+l)to approximate each 4 term in the above representation of cos(A;(u))by an element of M 3 + ( X , R ) . For m = -M,. . . ,M , let N;,m,l(~),N,,,,2(u)E M F + ( X ,R) denote approximations to d(-A,(u) r / 2 - 2mr) and d(A,(u)- 3r/2 2mr) respectively. The approximations are obtained by applying Lemma A.2 such that
+
+
lld(-As(.)
+ r / 2 - 2m*) - ~ , m , l l l K
0, Lemma A.l tells us that there exists a N ( u ) E M Fc a(X,R) such that llF - N l l ~< 4 2 . We now need a multilayer functional G(u) E &,a(+) such that /IN - G l l ~< ~ / 2 .Then by triangle inequality we can conclude that IIF - G l l ~< e. Let N(u) = Cy=!=, flj cos(Aj(u)). Let fl = supj flj. Apply Lemma A.3 to each term cos(Aj(u)) with €2 = &j to obtain a Nj(u) such that 11 cos oAj - Njll~< €2. Define G(u)= fljNj(u). Then, we have the required result JIM- G l l ~ < e/2.
Proof of Proposition 4.1 1. Obvious. 2. For any u E U,we have ut E Ut for all t E J. But, U,is assumed to be a normed linear space. Therefore llulll < 00 for all t E J. Then by continuity of F, we have F(u,) = y(t) < 00 for all t E J.
3. Under the topology defined on U: + vt for all t E J.
U by Equation 26, for
u,v
E U,we say
u + v if
Let u" E U + u E U,then continuity of F implies that F(u;) + F(u,) for all t E J. Thus, S(u") -+ S(u)pointwise for every t E J . But, Y c B(J,a) is assumed to have the relative topology derived from the topology of uniform convergence on B(J,R). Thus, when u" + u in the topology on U,S(u") -+ S(u) in the topology on Y.Thus, by definition S is continuous.
Proof of Theorem 4.1 1. Obvious.
2. Since S E 7C(U,Y),by definition there exists a Fo E C(U0,a)characterizing S. Moreover from Theorem 3.1, for any compact subset K of VO,every e > 0, and every continuous squashing function there exists a N E MF+(Uo,R) such that llF0 - N l l K < €. But, set of inputs U is assumed such that its restrictions U,for all t E J are subsets of K. Therefore, for every t E J and for every u E U,we have IFo(u:)-N(ut)l < e, and consequently supteJ IFO(ut) - N(u:)l< e.
+
But, supteJ IFo(t)-N(t)l = ~ ~ S ( u ) - 0 ( uwhere ) ~ ~ ,0 denotes the multilayeroperator described by {J,Jo,Vo,n/} and 11 11 denotes the uniform norm on Y C B(J, a). Thus, the result.
258
References 111 A. D. Back and A. C. Tsoi, "FIR and IIR synapses, a new neural network architecture for time series modelling," Neural Comput., vol. 3, no. 3, pp. 352362, 1991. [2] S. P. Banks, Mathematical Theories of Nonlinear Systems, New York: PrenticeHall, 1988. [3] J. F. Barrett, "The use of functionals in the analysis of nonlinear physical systems," Journal of Electronics and Control, vol. 15, pp. 567-615, 1963. [4] A. R. Barron, "Universal approximation bounds for superpositions of a sigmoidal function", University of Illinois at Urbana-Champaign, Department of Statistics, tech. rep. 58, 1991. [5] A. R. Barron, "Approximation and estimation bounds for artificial neural networks", University of Illinois at Urbana-Champaign, Department of Statistics, tech. rep. 59, 1991. [6] G. Cybenko, "Approximation by superpositions of a sigmoidal function," Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989. [7] G. B. Folland, Real Analysis, New York: John Wiley & Sons, 1984. [8] A. R. Gallant, and H. White, "There exists a neural network that does not make avoidable mistakes," in IEEE Second International Conference on Neural Networks, San Diego, CA, New York: IEEE Press, vol. 1, pp. 657-664, 1988. [9] P. G. Gallman and K. S. Narendra, "Representations of nonlinear systems via the Stone-Weierstrass theorem," Automatica, vol. 12, pp. 619-622, 1976.
[lo] R. Hecht-Nielsen, Neurocomputing, Reading, MA: Addison-Wesley, 1991. [ll] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, pp. 359-366, 1989. [12] K. Hornik, "Approximation capabilities of multilayer feedforward networks," Neural Networks, vol. 4, pp. 251-257, 1991. [13] J. L. Hudson, M. Kube, R. A. Adomaitis, I. G. Kevrekidis, A. S. Lapedes, and R. M. Farber, "Nonlinear signal processing and system identification: applications to time series from electrochemical reactions," Chemical Engineering Science, vol. 45, no. 8, pp. 2075-2081, 1990. [14] L. K. Jones, "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training," Ann. Statist., vol. 20, no. 1, pp. 608-613, 1992. I151 T. Koh and E. J. Powers, "Second-order Volterra filtering and its application to nonlinear system identification," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-33, no. 6, pp. 1445-1455, Dec. 1985. [16] R. J. Marks 11, Introduction to Shannon Sampling and Interpolation Theory, New York: Springer-Verlag, 1991.
259
[17] M. MorhiE, ”A fast algorithm of nonlinear Volterra filtering,” IEEE Transactions on Signal Processing, vol. 39, no. 10, pp. 2353-2356, Oct. 1991. [18] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems,” IEEE Transactions on Neural Networks, vol. 1, no. 1, pp. 4-27, Mar.1990. [19] H. J. Nussbaumer, Fast Fourier Transforms and Convolution Algorithms, Berlin: Springer-Verlag, 1981. [20] A. V. Oppenheim and D. H. Johnson, “Discrete representation of signals,” Proceedings of the IEEE, vol. 60, no. 6, pp. 681-691, June 1972. [21] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Englewood Cliffs, NJ: Prentice-Hall, 1975. [22] G. Palm and T. Poggio, “The Volterra representation and the Wiener expansion: validity and pitfalls,” SIAM Journal of Applied Mathematics, vol. 33, no. 2, pp. 195-216, Sep. 1977. [23] F. J. Pineda, “Recurrent backpropagation and the dynamical approach to adaptive neural computation,” Neural Comput., vol. 1, pp. 161-172, 1989. [24] W. A. Porter, “An overview of polynomic system theory,” Proceedings of the IEEE, vol. 64, no. 1, pp. 18-23, Jan. 1976. [25] P. M. Prenter, “A Weierstrass theorem for real, separable Hilbert spaces,” Journal of Approzimation Theory, vol. 3, pp. 341-351, 1970. [26] W. L. Root, “On the modeling of systems for identification. Part I: crepresentations of classes of systems,” SIAM Journal of Control, vol. 13, no. 4, pp. 927-944, 1975. (271 W. Rudin, Principles of Mathematical Analysis, New York: McGraw Hill, 1964. [28] W. Rudin, Functional Analysis, New York: McGraw Hill, 1991. [29] W. J. Rugh, Nonlinear System Theory: The Volterra/Wiener Approach, Baltimore: The Johns Hopkins University Press, 1981. [30] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Ezplorations in the Microstructure of Cognition, D. E. Rumelhart and J. L. McClelland, Eds., vol. 1, pp. 318-362, Cambridge, MA: MIT Press, 1986. [31] I. W. Sandberg, “Approximations for Nonlinear Functionals,” IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, vol. 39, no. 1, pp. 65-67, Jan. 1992. [32] M. Schetzen, The Volterra and Wiener Theories of Nonlinear Systems, New York: Wiley, 1980. [33] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the Institute of Radio Engineers, vol. 37, no. 1, pp. 10-21, 1948. [34] V. Volterra, Theory of Functionals and of Integral and Integro-Differential Equations, New York: Dover Publications, 1959.
260
[35]E. A. Wan, "Temporal backpropagation for FIR neural networks," Proc. IEEE Int. Joint Conf. Neural Networks, vol. 1, pp. 575-580,1990. [36]H. White, "Parametric statistical estimation with artificial neural networks," in Mathematical Perspectives on Neural Networks, P. Smolensky, M. C. Mozer, and D. E. Rumelhart, Eds., Hilldale, NJ: L. Erlbaum Associates, 1992.
[37]H. White, Artificial Neural Networks: Approximation d Learning Theory, Cambridge, MA: Blackwell Publishers, 1992. [38]B. Widrow and S. D. Stearns, Adaptive Signal Processing,. Englewood Cilffs, NJ: Prentice-Hall, 1985. [39]N. Wiener, Selected Papers of Norbert Wiener, Cambridge, MA: MIT Press, 1964. [40] R. J. Williams and D. Zipser, "A learning algorithm for continually running fully recurrent neural networks," Neural Comput., vol. 1, pp. 270-280,1989. 1411 J. C. Willems, The Analysis of Feedback Systems, Cambridge, MA: MIT Press, 1971.
Mathematical Approaches to Neural Networks J.G. Taylor (Editor) 0 1993 Elsevier Science Publishers B.V. All rights reserved.
26 I
Neural networks: the spin glass approach David Sherrington Department of Physics, University of Oxford, Theoretical Physics, 1 Keble Road, Oxford, OX1 3NP Abstract A brief overview is given of the conceptual basis for and the mathematical formulation of the fruitful transfer of techniques developed for the theory of spin glasses to the analysis of the performance, potential and training of neural networks. 1. INTRODUCTION
Spin glasses are disordered magnetic systems. Their relevance to neural networks lies not in any physical similarity, but rather in conceptual analogy and in the transfer of mathematical techniques developed for their analysis to the quantitative study of several aspects of neural networks. This chapter is concerned with the basis and application of this transfer. A brief introduction to spin glassses in their conventional manifestation is appropriate to set the scene - for a fuller consideration the reader is referred to more specialist reviews (MBzard et. al. 1987, Fischer and Hertz 1991, Binder and Young 1986, Sherrington 1990, 1992). At a microscopic level spin glasses consist of many elementary atomic magnets (spins), fixed in location but free to orient, interacting strongly but randomly with one another through pairwise forces. Individually these forces try to orient their spin pairs either parallel or antiparallel, but collectively they lead to conflicts, or frustration, with regard to the global orientations. The consequence is a system with many non-equivalent metastable global states and consequentially many interesting physical properties. Most of the latter will not concern us here, but the many-state structure has relevance for analogues in neural memory and the mathematical techniques devised to analyze spin glasses have direct applicability. Neural networks also involve the cooperation of many relatively simple units, the neurons, under the influence of conflicting interactions, and they possess many different global asymptotic behaviours in their dynamics. In this case the conflicts arise from a mixture of excitatory and inhibitory synapses, respectively increasing and decreasing the tendency of a post-synaptic neuron to fire if the pre-synaptic neuron fires. The recognition of a conceptual relationship between spin glasses and recurrent neural networks, together with a mathematical mapping between idealizations of each (Hopfield 1982), provided the first hint of what has turned out to be a fruitful transplantation. In fact, there are now two main respects in which spin glass analysis has been of value in considering neural networks for storing and interpreting static data. The first concerns the macroscopic asymptotic behaviour of a neural network of given architecture
262
and synaptic efficacies. The second concerns the choice of efficacies in order to optimize various performance measures. Both will be discussed in this chapter. We shall discuss networks suggested as idealizations of neurobiological structures and also those devised for applied decision making. We shall not, however, dwell on the extent to which these idealizations are faithful, or otherwise, to nature. Although neural networks can also be employed to store and analyse dynamical information, and techniques of non-equilibrium statistical mechanics are being applied to their analysis, we shall restrict discussion in this chapter to static information, albeit stored in dynamic networks. An accompanying chapter (Coolen and Sherrington 1992) gives a brief introduction to dynamics. 2. TYPES OF NEURAL NETWORK
There are two principal types of neural network architecture which have been the subject of active study. The first is that of layered feedforward networks in which many input neurons drive various numbers of hidden units eventually to one or few output neurons, with signals progressing only forward from layer to layer, never backwards or sideways within a layer. This is the preferred architecture of many artificial neural networks for application as expert systems, with the interest lying in training and operating the networks for the deduction of appropriate few-state conclusions from the simultaneous input of many, possibly corrupted, pieces of data. The second type is of recurrent networks where there is no simple feedforward-only or even layered operation, but rather the neurons drive one another collectively and repetitively without particular directionality. In these networks the interest is in the global behaviour of all the neurons and the associative retrieval of memorized states from initialisations in noisy representations thereof. These networks are often referred to as attractor neural networks’. They are idealizations of parts of the brain, such as cerebral cortex. Both of the above can be considered as made up from simple ‘units’ in which a single neuron receives input from several other neurons which collectively determine its output. That output may then, depending upon the architecture considered, provide part of the inputs to other neurons in other units. Many specific forms of idealized neuron are possible, but here we shall concentrate on those in which the neuron state (activity) can be characterized by a single real scalar. Similarly, many types of rule can be envisaged relating the output state of a neuron to those of the neurons which input directly to it. We shall concentrate, however, on those in which the efferent (post-synaptic) behaviour is determined from the states of the afferent (pre-synaptic) neurons via an ‘effective field’ hi =
C J ; j ~-j W;, j#i
‘They are often abbreviated as ANN, but we shall avoid this notation since it is also common for artificial neural networks.
263
where aj measures the firing state of neuron j , J;j is the synaptic weight from j to i and W; is a threshold. For example, a deterministic perceptron obeys the output-input relation
where a,!is the output state of the neuron. More generally one has a stochastic rule, where f(h;)is modified in some random fashion at each step. Specializing/approximately further to binary-state (McCulloch-Pitts) neurons, taken to have a;= fl denoting firing/non-firing, the standard deterministic perceptron rule is
(3)
u: = sgn(h;).
Typical stochastic extensions modify (3) to a random update rule, such as the Glauber rule, 1 2
a;+ u,!with probability -[1+ tanh(ph;u:)],
(4)
or the Gaussian rule ai -+
a: = sgn(h;
+ Tz),
(5)
where z is a Gaussian-distributed random variable of unit variance and T = p-' is a measure of the degree of stochasticity, with T = O(p = m) corresponding to determinism. In a network of such units, updates can be effectuated either synchronously (in parallel) or randomly asynchronously. More generally, a system of binary neurons satisfies local rules of the form
where the u ; ~..., aiC= +l are the states of the neurons feeding neuron i, Rj and Ri are independent tunable stochastic operators randomly changing the signs of their operands, and F; is a Boolean function of its arguments (Aleksander 1988, Wong and Sherrington 1988, 1989). The linearly-separable synaptic form of (2)-(5) is just a small subset of possible Boolean forms. 3. ANALOGY BETWEEN MAGNETISM AND NEURAL NETWORKS
In order to prepare for later transfer of mathematical techniques from the theory of spin glasses to the analysis of neural networks, in this section we give a brief outline of the relevant physical and conceptual aspects of disordered magnets which provide the stimulus for that transfer. 3.1 Magnets
A common simple model magnet idealizes the atomic magnetic moments to have only two states, spin up and spin down, indicated by a binary (Ising) variable a;= f l , where i
264
labels the location and u the state of the spin. A global microstate is a set { u ; } ; i= 1,...N where N is the number of spins. The energy of such a state is typically given by
1
E({u;}) = - - ~ J ; ~ u ; u-, C b;u; 2
ij
(7)
I
where the J;j (known as exchange interactions) correspond to contributions from pairwise forces and the b; t o local magnetic fields. The prime indicates exclusion of i = j . The set of d microstates is referred t o as ‘phase space’. The standard dynamics of such a system in a thermal environment at temperature T is a random sequential updating of the spin states according t o rule (4) with W; -+ -b;. Thus, with this identification, there is a mathematical mapping between the statistical thermodynamics of the spin system and the dynamics of a corresponding recurrent neural network. The converse is not necessarily true since the above spin model has J;j = Jj;, whereas no such restriction need apply t o a general neural network. However, for developmental purposes, we shall assume this symmetry initially, lifting the restriction later in our discussion. Magnetic systems of this kind have been much studied. Let us first concentrate on their asymptotic behaviour. This leads t o a thermodynamic state in which the system randomly passes through the microstates with a Gibbs probabalistic distribution
and the system can be viewed as equivalent to an ensemble of systems with this distributiona. At high temperatures T, effectively all the microstates of any energy are equally likely t o be accessed in a finite time and there are no serious barriers to a change of microstate. At low enough temperatures, however, there is a spontaneous breaking of the phase space symmetry on finite timescales and only a sub-set of microstates is effectively available in a physical measurement on a very large (N -+ co)system. The onset of such a spontaneous separation of phase space is known as a ‘phase transition’. A common example is the onset of ferromagnetism in a system with positive exchange intGactions {J;j} and b = 0. Beneath a critical temperature T,,despite the probabalistic symmetry between a microstate and its mirror image with every spin reversed, as given by (8), there is an effective barrier between the sub-sets of microstates with overall spin up and those with overall spin down which cannot be breached in a physical time, and hence the thermal dynamics is effectively confined t o one or other of these subsets. The origin of this effect lies in the fact that for T < T, the most probable microstates have non-zero values of the averu;,strongly peaked around two values f m ( T ) , while the age magnetization m = N-’ probability of a state of different Im( is exponentially smaller. To go from m = m ( T ) t o m = -m(T) would require the system to pass through intermediate states, such as m = 0, of probability which is vanishingly small as N ---f 00. For T = 0 the dynamics (3) leads t o a minimum of E ( a ) , which would have m = fl for the ferromagnet, with no means of further change.
xi
ZFora further, more complete, discussion of equivalences between temporal and ensemble averages see the subject of ‘ergodicity’ in texts on statistical mechanics
265
A useful picture within which to envisage the effect of spontaneous symmetry-breaking is of an effective energy landscape which incorporates the randomizing tendencies of temperature as well as the ordering tendencies of the energy terms of (7). This is known as a free energy landscape and it evolves with temperature. At high temperature it is such that all energetically equivalent states are equally accessible, but at low temperature it splits into disconnected regions separated by insurmountable ridges. If the system under consideration is the ferromagnet with only positive and zero Jij and without magnetic fields, b = 0, the phase space is thus split into two inversion symmetry related parts. If, however, the Jij are of random sign, but frozen (or quenched), then the resultant low temperature state can have many non-equivalent disconnected regions, or basins, in its free-energy structure; this is the case for spin glasses. Thus, if one starts the dynamical system in a microstate contained within one of the disconnected sub-sets, then in a physical time its evolution will be restricted to that subspace. The system will iterate towards a distribution as given by (8) but restricted to microstates within the sub-space.
3.2 Neural networks Thus one arrives at a potential scenario for a recurrent neural network capable of associatively retrieving any of several patterns { Q } ; p = 1,...p. This is to choose a system in which the J i j are such that, beneath an appropriate temperature (stochasticity) T , there are p disconnected basins, each having a macroscopic overlap3 with just one of the patterns and such that if the system is started in a microstate which is a noisy version of a pattern it will iterate towards a distribution with a macroscopic overlap with that pattern and perhaps, for T -+ 0, to the pattern itself. To store many non-equivalent patterns clearly requires many non-equivalent basins and therefore requires competition among the synaptic weights/exchange interactions {J;j}4. The mathematical machinery devised to study ordering in random magnets is thus a natural choice to consider for adaptation for the analysis of retrieval in the corresponding neural networks. An introduction to this adaptation is the subject of the next section. However, before passing to that analysis a further analogy and stimulus for mathematical transfer will be mentioned. This second area for transfer concerns the choice of {Jij}to achieve a desired network performance. Provided that performance can be quantified, the problem of choosing the optimal { J ; j } is equivalent to one of minimizing some effective energy function in the space of all { J ; j } . The performance requirements, such as which patterns are to be stored and with what quality, impose ‘costs’ on the J;j combinations, much as the exchange interactions do on the spins in (7), and there are normally conflicts in matching local (few-J;j) with global (all-J;j)optimization. Thus, the global optimization problem is conceptually isomorphic with that of finding the ground state of a spin glass, and again a conceptual and mathematical transfer has proved valuable. 8A precise definition of overlap is given later in eqn (9). With the normalization used there an overlap is macroscopic if it is of order 1.
4Note that this concept applies even if there is no Lyapunov or energy function. The expression ‘basin’ refers to a restricted microscopic phase space of the {u},even in a purely dynamical context.
266
4. STATISTICAL PHYSICS OF RETRIEVAL In this section we consider the use of techniques of statistical physics, particularly as developed for the study of spin glasses, for the analysis of the retrieval properties of simple recurrent neural networks. Let us consider such a network of N binary-state neurons, characterized by state variables n; = f l , i = 1,...N, interacting via stochastic synaptic operations (as discussed in section 2) and storing, or attempting to store, p patterns {(f} = {fl};p = 1,...p. Interest is in the global state of the network. Its closeness to a pattern can be measured in terms of the corresponding (normalized) overlap
I
or in terms of the (complementary) fractional Hamming distance
which measures the average number of differing bits. To act as a retrieving memory the phase space of the system must separate so as to include effectively non-communicating sub-spaces, each with macroscopic O ( N o )overlap with a single pattern. 4.1 The Hopfield model
A particularly interesting example for analysis was proposed by Hopfield. It employs symmetric synapses Jij = Jj; and randomly asynchronously updating dynamics, leading to the asymptotic activity distribution (over all microstates)
where E ( a ) has the form of eqn. (7). This permits the applications of the machinery of equilibrium statistical mechanics to study retrieval behaviour. In particular, one studies the resultant thermodynamic phase structure with particular concern for the behaviour of the m w . 4.2 Statistical Mechanics
Basic statistical mechanics for the investigation of equilibrium thermodynamics proceeds by introducing the partition function
Several other quantities of thermodynamic interest, such as the average thermal energy and the entropy, follow immediately; for example
( E ) = 2-l
{W
a aP
E(u)exp ( - P E ( u ) ) = --enZ.
267
Others can be obtained by the inclusion of small generating fields; for example, for any observable O(u),
In particular, the average overlap with pattern p follows from
where
Spontaneous symmetry breaking is usually monitored implicitly, often signalled by divergencies of appropriate response functions or fluctuation correlations in the highsymmetry phase. However, it can be made explicit by the inclusion of infinitesimal symmetry-breaking fields; for example k! = &” will pick out the v t h sub-space if the phase space is disconnected, even for h --t Of, but will be inconsequential for k --t O+ if phase space is connected. 4.3 H e b b i a n Synapses
Hopfield proposed the simple synaptic form .7;j
= N-’
cg y (
1 - bij),
r
inspired by the observations of Hebb; we shall refer to this choice as Hebbian. Let us turn to the analysis and implications of this choice, with all the {Wi} taken t o be zero and for random uncorrelated patterns {tP}. For a system storing just a single pattern, the problem transforms immediately, under u; --t u;&, to a pure ferromagnetic Ising model with J;j = N - ’ . The solution is well known and m satisfies the self-consistency equation m = tanh
(pm),
(19)
with the physical solution rn = 0 for T > 1 (P < 1) and a symmetry-breaking phase transition to two separated solutions f l m ( , with m # 0, for T < 1. For general p one may express exp(-PE(u)) for the Hopfield-Hebb model in a separable form,
268 =
/cfi
d f i p ( P N / 2 ~ ) ) )e ~ p [ ~ ( - N P ( f i ” ) ~-/pfhpc 2 u; T > 0.46 Only type (i) solutions remain, each equally stable and with extensive barriers between them. 4.
T>1
Only the paramagnetic solution (all m” = 0) remains. Thus we see that retrieval noise can serve a useful purpose in eliminating or reducing spurious hybrid solutions in favour of unique retrieval. 4.5 Extensive numbers of patterns
The analysis of the last section shows no dependence of the critical temperatures on p. This is correct for p independent of N (and N + co). However, even simple signal-to-noise arguments demonstrate that interference between patterns will destroy retrieval, even at T = 0, for p large enough and scaling appropriately with N. Geometrical, informationtheoretical and statistical-mechanical arguments (to be discussed later) in fact show that the maximum pattern storage allowing retrieval scales as p = aN, where a is an N independent storage capacity. Thus we need to be able to analyse retrieval for p of order N, which requires a different method than that used in (23) - (25). One is available from the theory of spin glasses. This is the so called replica theory (Edwards and Anderson 1975, Sherrington and Kirkpatrick 1975, Kirkpatrick and Sherrington 1978, MCzard et al. 1987). As noted earlier, physical quantities of interest are obtained from ln2. This will depend on the specific set of { J ; j } , which will itself depend on the patterns {t”} to be stored. Statistically, however, one is interested not in a particular set of { J i j } or {t:} but in relevant averages over generic sets, for example over all sets of p patterns drawn randomly from the 2N possible pattern choices. Furthermore, the pattern averages of most interest are self-averaging5,strongly peaked around their most probable values. Thus, we may ignore fluctuations of en2 over nominally equivalent sets of pattern choices and hence ~ , ( ){(I means an average over the specific pattern choices. consider ( l n Z ) { ~where Although in principle one might envisage the calculation first of ln2 for a particular pattern choice and then its average, in practice this would be essentially impossible for large p without some other approximation‘. Rather, one would like to average formally over the patterns { max ( 1 ,J , / J ) (ii) ferromagnetic, m # 0 , q # 0, for T < J o / J and J o / J greater than the T-dependent value of 0 ( 1 ) , and (iii) spin glass, m = 0 , q # 0 for T < 1 and J,/J less than a T-dependent value of O(1). Within the spin glass problem the greatest interest is the third of these, interpreted as frozen order without periodicity, but for neural networks the interest is in an analogue of the second, ferromagnetism. 4.7. Replica analysis of the Hopfield model
Let us now turn to the neural network problem. In place of the order parameter m one now has all the overlap parameters m”. However, since we are principally interested in retrieving symmetry-breaking solutions, we can concentrate on extrema with only one, or a few, rn” macroscopic ( O ( N o ) )and the rest microscopic ( 5 O(N-f)). This enables one to obtain self-consistent equations for the overlaps with the nominated (potentially macroscopically overlapped or condensed) patterns
where the 1 , ...a label the nominated patterns and ( )T denotes the thermal (symmetrybroken) average at fixed {(}, coupled with a spin-glass like order parameter
and a mean-square average of the overlaps with the un-nominated patterns (itself expressible in terms of 9). Retrieval corresponds to a solution with just one m” non-zero. For the case of the Hopfield-Hebb model the analysis follows readily from an extension of (21). Averaging over random patterns yields ‘in fact, within the RS ansats the physical extremum is found from mazimizing the substituted g with respect to q; this is because the number of (ap) combinations n(n - 1)/2 becomes negative in the limit n
+ 0.
273
(Zn) = exp (-np/3/2) {ma}
1fi
fi{dmp(/3N/2n)! exp [ - N ~ ( / 3 ~ ( r n a ) ' / 2
p=la=l
a
-1
+N-'
en cosh i
(BEmpu:))]}.
(47)
P
To proceed further we separate out the condensed and non-condensed patterns and carry out a sequence of manipulations to obtain an extremally dominated form analagous to eqn (32). Details are deferred to Appendix A, but yield a form
(Z"){,> = (@N/2x)"/'
1 fi p,a=l
where
drnw
1n
dqapdrape-Np*
(4)
is intensive. (48) is thus extremally dominated. At the extremum
Within a replica-symmetric ansatz m p = mp, qap = q, rap = r , self-consistency equations follow relatively straightforwardly. For the retrieval situation in which only one m P is macroscopic (and denoted by m below) they are dz
1 15
m= q=
exp ( - z 2 / 2 ) tanh [p(z&
exp (-.'/a)
+ m)]
tanh2p(z&G+ m)]
(52) (53)
where T
= q ( l - p(1 - q ) ) - 2
(54)
Retrieval corresponds to a solution m # 0. There are two types of non-retrieval solution, (i) m = 0 , q = 0, called paramagnetic, in which the system samples all of phase space, (ii) m = 0,q # 0, the spin glass solution, in which the accessible phase space is restricted but not correlated with a pattern. Fig. 1 shows the phase diagram (Amit et. al. 1985); retrieval is only possible provided the combination of fast (stochastic) noise T and slow (pattern interference) noise a is not too great. There are also (spurious) solutions with more than one m p # 0, but these are not displayed in the figure. In the above analysis, replica symmetry was assumed. This can be checked for stability against small fluctuations by expanding the effective free energy functional
214
F({m"}, { q " P } ) to second order in E" = m" - m, quo = q"P - q and studying the resultant normal mode spectra (de Almeida and Thouless 1978). In fact, it turns out to be unstable in the spin glass region and in a small part of the retrieval region of ( T , a ) space near the maximum a for retrieval. A methodology for going beyond this ansatz has been developed (Parisi 1979) but is both subtle and complicated and is beyond our present scope. However, it might be noted as (i) corresponding to a further hierarchical disconnectedness of phase space (Parisi 1983), and (ii) giving rise to only relatively small changes in the critical retrieval capacity. For the example of section 4.6 replica-symmetry breaking changes the critical boundary between spin-glass and ferromagnet to J,/J = 1. A similar procedure may be used, at least in principle, to analyze retrieval in other networks with J;j = Jj;. Transitions between non-retrieval (m = 0) and retrieval m # 0 may be either continuous or discontinuous; for the fully connected Hopfield-Hebb model the transition is discontinuous but for its dilutely, but still symmetrically, connected counterpart the transition is continuouss (Kanter 1989, Watkin and Sherrington 1991).
10
+
05 -
I
0
0
0 05
0 10
a
f
015
a c = 0.138
Figure 1. Phase diagram of the Hopfield model (after Amit et. al. 1985). T, indicates the limit of retrieval solutions, between T, and Tgthere are spin-glass like non-retrieval solutions, above Tgonly paramagnetic non-retrieval.
4.8 Dilute asymmetric connectivity W e might note that a second type of network provides for relatively straightforward analysis of retrieval, including not only that of the asymptotic retrieval overlap (the m
obtained in the last section) but also the size of the basin from which retrieval is possible (i.e. the minimum initial overlap permitting asymptotic retrieval). This is the dilute 'The dilute case also has greater RS-breaking effects (Watkin and Sherrington 1991)
275
asymmetric network (Derrida et. al 1987) in which synapses are only present with a probability C / N and C is sufficiently small compared with N that self-correlation via synaptic loops is inconsequential. C 0.
216
(Wong and Sherrington 1990a). Thus (50) and (52) can be used to determine the retrieval of any such network, given p ( A ) . In particular, this provides a convenient measure for assessing different algorithms for { J i j } . Of course, for a given relationship between { J ; j } and { t } , p ( h ) follows directly; for the Hebb rule (18), p ( A ) is a Gaussian of mean a-f and standard deviation unity.
1
1
7
f (m), m
/ I
f (m). m
0 mB
mo
m'
1
0
m
1
Figure 2 Schematic illustrations of (a) iterative retrieval; 0, m* are stable fixed points, asymptotically reached respectively from initial states in 0 _< m < m ~mg , < m _< 1, (b) variation of f ( m )with capacity or retrieval temperature, showing the occurrence of a phase transition between retrieval and non-retrieval.
5. STATISTICAL MECHANICS OF LEARNING In the last section we considered the problem of assessing the retrieval capability of a system of given architecture, local update rule and algorithm for {Jjj}. Another important issue is the converse; how to choose/train the { J i j } , and possibly the architecture, in order to achieve the best performance as assessed by some measure. Various such performance measures are possible; for example, in a recurrent network one might ask for the best overlap improvement in one sweep, or the best asymptotic retrieval, or the largest size of attractor basin, or the largest storage capacity, or the best resistance to damage; in a feedforward network trying to learn a r d e from examples one might ask for the best performance on the examples presented, or the best ability to generalize. Statistical mechanics, again largely as originally developed for spin glasses, has played an important role in assessing what is achievable in such optimization and also provides a possible mechanism for achieving such optima (although there may be other algorithms which are quicker to attain the goals which have been shown to be accessible). Thus in this section we discuss the statistical physics of optimization, as applied to neural networks.
211
5.1 Statistical physics of optimization Consider a problem specifiable as the minimization of a function Eia)({b}) where the {a} are quenched parameters and the {b} are the variables to be adjusted, and furthermore, the number of possible values of {b} is very large. In general such a problem is hard. One cannot try all combinations of {b} since there are too many. Nor can one generally find a successful iterative improvement scheme in which one chooses an initial value of {b} and gradually adjusts the value so as to accept only moves reducing E. Rather, if the set {a} imposes conflicts, the system is likely to have a ‘landscape’ structure for E as a function of {b} which has many valleys ringed by ridges, so that a downhill start from most starting points is likely to lead one to a secondary higher-E (local) minimum and not a true (global) minimum or even a close approximation to it. To deal with such problems computationally the technique of simulated annealing was invented (Kirkpatrick et. al. 1983). In this technique one simulates the probabalistic energy-increase (hill-climbing) procedure used by a metallurgist to anneal out the defects which typically result from rapid quenches (downhill only). Specifically, one treats E as a microscopic ‘energy’, invents a complementary ‘temperature’, the annealing temperature TA,and simulates a stochastic thermal dynamics in {b} which iterates to a distribution of the Gibbs form
Then one reduces TA gradually to zero. The actual dynamics has some freedom - for example for discrete variables Monte Car10 simulations with a heat bath algorithm (Glauber 1963), such as (4), or with a Metropolis algorithm (Metropolis et. al. 1953), both lead to (60). For continuous variables Langevin = -vbE(b) ~ ( t )where , ~ ( tis) white noise of strength TA, would dynamics with also be appropriate. Computational simulated annealing is used to determine specific {J}to store specific pattern sets with specific performance measures (sometimes without the limit TA -+ 0 in order to further simulate noisy data). It is also of interest, however, to consider the generic results on what is achievable and its consequences, averaged over all equivalently chosen pattern sets. An additional relevance lies in the fact that there exist algorithms which can be proven to achieve certain performance measures if they are achiewabk (and the analysis indicates if this is the case). The analytic equivalent of simulated annealing defines a generalized partition function
+
where we use C to denote an appropriately constrained sum or integral, from which the average Lenergy’at temperature TAfollows from
218
and the minimum E from the zero ‘temperature’ limit, Em,,,= lim ( E ) T ~ . TA-0
As noted earlier, we are often interested in typicallaverage behaviour, as characterized by averaging the result over a random choice of {a} from some distribution. Hence we require to study (!nZn)(,,},which naturally suggests the use of replicas again. In fact, the replica procedure has been used to study several hard combinatorial optimization problems, such as various graph partitioning (Fu and Anderson 1986, Kanter and Sompolinsky 1987, Wong and Sherrington 1987) and travelling salesman (M6zard and Parisi 1986) problems. Here, however, we shall concentrate on neural network applications. 5.2. Cost functions ddpendent on stability fields
One important class of training problems for pattern-recognition neural networks is that in which the objective can be defined as minimization of a cost function dependent on patterns and synapses only through the stability fields; that is, in which the ‘energy’ to be minimized can be expressed in the form
E&({JH = -
cC 9 ( A 3 P
(64)
i
The reason for the minus sign is that we are often concerned with maximizing performance functions, here the g(A). Before discussing general procedure, some examples of g(A) might be in order. The original application of this technique to neural networks concerned the maximum capacity for stable storage of patterns in a network satisfying the local rule =
sgn
(CJijuj) i#i
(66)
(Gardner 1988). Stability is determined by the A:; if A: > 0, the input of the correct bits of pattern p to site i yields the correct bit as output. Thus a pattern p is stable under the network dynamics if
A: > 0;all i.
(67)
A possible performance measure is therefore given by (64) with g ( A ) = -@(-A)
where @(x)
= 1;z > 0
0 ; x < 0.
(68)
279
g(At) is thus non-zero (and negative) when pattern p is not stably stored at site i. Choosing the { J ; j } such that the minimum E is zero ensures stability. The maximum capacity for stable storage is the limiting value for which stable storage is possible. An extension is to maximal stability (Gardner and Derrida, 1988). In this case the performance measure employed is g(A) = -O(K
-
A)
(70)
and the search is for the maximum value of n for which Em;,, can be held to zero for any capacity a, or, equivalently, the maximum capacity for which Em;,,= 0 for any n. All patterns are then stored with stability fields greater than IC. In fact, for synapses restricted only by the spherical constraint"
C J;"j= N , j#i
with J;j and Jj; independent, the stability field n and the storage ratio a = p / N at criticality can be shown to be related by
For n = 0, the conventional problem of minimal stability, this reduces to the usual a, = 2 (Cover 1965). Yet another example is to consider a system trained to give the greatest increase in overlap with a pattern in one step of the dynamics, when started in a state with overlap mt. In this case, for the update rule (5) the appropriate performance function, averaged over all specific starting states of overlap mt, is (Wong and Sherrington 1990a)
where
This performance function is also that which would result from a modification of eqn (68) in which A: is replaced by
(r
is the result of randomly distorting ,$ with a (training) noise dt = (1 - mt)/2 where and E A is averaged over all implementations of the randomization (with fixed m t ) . This is referred to as 'training with noise' and is based on the physical concept of the use of such "Note that this is a different normalization than that used in eqn (18).
280
noise to spread out information in a network, perhaps with an aim towards producing better generalization, association or stability. 5.3 Methodology
Let us now turn to the methodology. For specific pattern sets we could proceed by computational simulated annealing, as discussed in the first part of section 5.1. Analytically, we require ( l n Z A { ( } ) { , , , where
from which the average minimum cost is given by
( P n Z A ) ( o is obtained via the replica procedure, (26), averaging over the {tp} to yield a replica-coupled effective pure system which is then analyzed and studied in the limit n + 0. The detailed calculations are somewhat complicated and are deferred to Appendix B. However we note here that the general procedure is analagous to those of sections (4.5) - (4.7) but with the {J} as the annealed variables, the as the quenched ones and the retrieval temperature replaced by the annealing temperature. For systems with many neurons the relevant integrals are again extremally dominated, permitting steepest descent analysis. New order parameters are again introduced, including an analogue of the spin glass order parameter q"@;here
{t}
where ( )eg is an average against the effective system resulting from averaging over the patterns; cf eqn (35). Again a mathematical simplification is consequential upon a replica-symmetric ansatz. The net result of such an analysis is that the local field distribution p ( A ) in the optimal configuration is given", within RS theory, for synapses obeying the spherical rule (64) by (Wong and Sherrington 1990)
where Dt = dt exp(-t2/2)
/&
(80)
laNote that when the expression for the partition function is extremally dominated, any other thermal measure is similarly dominated and is often straightforward to evaluate; this is the case here with
( p ( A ) ) { € l= ( N p ) - ' ( x x 6 ( A - A:)){€),
as demonstrated in Appendix B.
28 1 and X ( t ) is the value of X giving the largest value of [g(X) implicitly by a-l =
1Dt(X(t)
-q
- (A
- t ) ’ / 2 ~ where ] 7 is given
2 .
The same expressions apply to single perceptrons storing random input-output associations, where the index i can be dropped and AP = q” CjJj(,”/(CJ:): where {(,”}; j = 1,...N are the inputs and 7’’ the output of pattern p, and for dilute networks where N is replaced by the connectivity C. Immediately, one gets the one-step update of any network optimized as above. Thus, for the dynamics of (5) m’ = / d A p ( A ) erf [mA/(2(1 - m2
+T2))i].
(82)
For a dilute asymmetric network this applies to each update step, as in (57). Wong and Sherrington (1990a,b) have used the above method to investigate how the p(A) and the resultant retrieval behaviour depend on training noise, via (66). They have demonstrated that infinitesimal training noise yields the same p ( A ) as the maximum stability rule, while the limit of very strong training noise yields that of the Hebb rule. The former gives perfect retrieval for T = 0 and a < 2 but has only narrow basins of attraction for a > 0.42, while the Hebb rule has only imperfect retrieval, and that only for a < 0.637, but has wide basins of attraction. Varying mt gives a method of tuning performance between these limits. Similarly, for general T one can determine the optimal mt for the best retrieval overlap or basin size and the largest capacity for retrieval with any mt (Wong and Sherrington 1990b); for example for maximum capacity it is better to use small training noise for low T,high training noise for higher T. Just as in the replica analysis of retrieval, the assumption of replica symmetry for qap of (71) needs to be checked and a more subtle ansatz employed when it is unstable against q 9”p;9pp small. In fact, it should also be tested even when small fluctuations q p p small fluctuations are stable (since large ones may not be). Such effects, however, seem to be absent or small for many cases of continuous { J ; j } , while for discrete {Jij} they are more important’*. --f
+
5.4 Learning a rule
So far, our discussion of optimal learning has concentrated on recurrent networks and on training perceptron units for association of given patterns. Another important area of practical employment of neural networks is as expert systems, trained to try to give correct few-option decisions on the basis of many observed pieces of input data. More precisely, one tries to train a network to reproduce the results of some usually-unknown rule relating many-variable input to few-variable output, on the basis of training with a few examples of input-output sets arising from the operation of the rule (possibly with error in this training data). “For discrete { J , l } there is first order replica-symmetry breaking (Krauth and MCnard 1989) and small fluctuation analysis is insufficient.
282
To assess the potential of an artifical network of some structure to reproduce the output of a rule on the basis of examples, one needs to consider the training of the network with examples of input-output sets generated by known rules, but without the student network receiving any further information, except perhaps the probability that the teacher rule makes an error (if it is allowed to do so). Thus let us consider first a deterministic teacher rule 9 = V({€I),
(83)
relating N elements of input data deterministic student network
((:,ti
..(:)
to a single output B”, being learned by a
9 = B({€)).
(84)
B is known whereas V is not. Training consists of modifying B on the basis of examples drawn from the operation of V. Problems of interest are to train B to give (i) the best possible performance on the example set, (ii) the best possible performance on any random sample drawn from the operation of V, irrespective of whether it is a member of the training set or not. The first of these refers to the ability of the student to learn what he is taught, the second t o his ability to generalise from that training. Note that the relative structures of teacher and student can be either such that the rule is learnable, or not (for example, a perceptron is incapable of learning a parity rule (Minsky and Papert 1969)). The performance on the training set p = 1, ...p can be assessed by a training error
where e(z,y) is zero if z = y, positive otherwise. We shall sometimes work with the fractional training error Et
(86)
= Et/p.
The corresponding average generalisation error is
A common choice for
e
is quadratic in the difference (z - y). With the scaling
e(z,y) = (2 - YI2/4
(88)
one has for binary outputs, 9 = 6 1 , e(z,y) = @(-ZY),
so that if E l ( { ( } ) is a perceptron,
(89)
283 then e’ =
@(-A’)
where now.
A” = qJ”1J j ( r / ( CJ:)i, i
i
making Et analagous t o E A of eqn (64) with ‘performance function’ (68). This we refer t o as minimal stability learning. Similarly t o section 4, one can extend the error definition to e” = @(n
- A”)
(93)
and, for learnable rules, look for the solution with the maximum K for zero training error. This is maximal stability learning. Minimizing Et can proceed as discussed above, either simulationally or analytically. Note, however, that for the analytic study of average performance the ( 7 , t ) combinations are now related by the rule V, rather than being completely independent. eg follows from the resultant distribution p ( A ) . For the case in which the teacher is also a perceptron, the rule is learnable and therefore the student can achieve zero training error. The resultant generalization error is however, not necessarily zero. For continuous weights Jj the generalization error with the above two training error formulations scales as 1/a for large a,where p = a N , with the maximal stability form (Opper et. al. 1990) yielding a smaller multiplicative factor than the minimal stability form (Gyorgy and Tishby 1990). Note, however, that maximal stability training does not guarantee the best generalization; that has been obtained by Watkin (1992), on the basis of Bayesian theory, as the ‘centre of gravity’ of the possible J space permitted by minimal stability. For a perceptron student with binary weights { J j = fl}, learning from a similarly constrained teacher, there is a transition from imperfect t o perfect generalization a t a critical number of presented patterns p = a , N . This is because, in order for the system t o have no training error, beyond critical capacity it must have exactly the same weights as the teacher. Just as in the case of recurrent networks for association, it can be of interest t o consider rule-learning networks trained with randomly corrupted data or with unreliable (noisy) teachers or students. Another possibility is to train a t finite temperature; that is t o keep the annealing temperature finite rather than allowing it to tend to zero. Analysis for PA small is straightforward and shows that for a student perceptron learning to reproduce a teacher perceptron the generalization error scales as eg 1/(/3~a), so that increasing a leads to qualitatively similar performance as a zero-temperature optimized network with p/N =& = (Sompolinsky et. al. 1990). There are many other rules which can be analyzed for possible reproduction by a singlelayer perceptron, some learnable, some not, and attention is now also turning towards the analysis of multilayer perceptrons, but for further details the reader is referred elsewhere (Seung et. al. 1992, Watkin et. al. 1992, Barkai et. al. 1992, Engel et. al. 1992).
-
284 6. CONCLUSION
In this chapter we have tried to introduce the conceptual and mathematical basis for the transfer of techniques developed for spin glasses to the quantitative analysis of neural networks. The emphasis has been on the underlying theme and the principles behind the analysis, rather than the presentation of all the intricacies and the applications. For further details the reader is referred to texts such as those of Amit (1989), Miiller and Reinhardt (1990), Hertz et. al. (1991), the review of Watkin et. al. (1992) and to the specialist research literature. We have restricted discussion to networks whose local dynamics is determined by pairwise synaptic forces and in applications have employed binary neurons and zero thresholds, but all of these restrictions can be lifted in principle and mostly in practice. For example, with binary neurons the synaptic update rule includes only a small subset of all Boolean rules and it is possible to extend the analysis of retrieval in a dilute network and the optimization of rules for pattern association to the general set of Boolean rules (Wong and Sherrington 1989a, 198913). Binary neurons can be replaced by either continuous-valued or multi-state discrete ones. Thresholds can be considered as arising from extra neurons in a fixed activity state. We have discussed only a rather limited set of the types of problems which can be envisaged for neural networks. In particular we have discussed only the storage and retrieval of static data and only one-step or asymptotic retrieval (but see also the accompanying chapter on dynamics: Coolen and Sherrington 1992). Statistical mechanical techniques have been applied t o networks with temporally structured attractors and to the issue of competition between such attractors and ones associated with static associations. Indeed, the study of more sophisticated aspects of dynamics is an active and growing one. Also, we have discussed only supervised learning whilst it is clear that unsupervised learning is also of major biological and engineering relevance - again, there has been and continues to be statistical mechanical transfer to this area also. We have not discussed the problem of optimizing architecture, except insofar as this is implicit in the inclusion of the possibility of Jij = 0. Nor have we discussed training algorithms other than that of simulated annealing, but we note again that there exist other algorithms for certain problems which are known to work zf a solution ezists, while the analytic theory can show if one does. Similarly, we have not discussed the rate of convergence of any algorithms, either to optimum or specified sub-optimal performance. However, overall, it is hoped that it will be apparent that the statistical physics developed for spin glasses has already brought to the subject of neural networks both new conceptual viewpoints and new techniques, particularly oriented towards the quantitative study of typical rather than worst cases and allowing for the consideration of imprecise information and assessment of the resilience of solutions. There is much further potential for the application of the statistical physics of disordered systems to neural networks and possibly also for the converse, where we note in conclusion that the corresponding investigations of spin glasses, started almost two decades ago, led to a major reconsideration of both tenets and techniques of statistical physics, and neural networks could provide an interesting sequel to the fascinating developments which unfolded in that study.
285
Appendix A Here we consider in more detail the derivation of eqns (48)-(54), starting from eqn (47). For the non-condensed patterns, p > s, only small r n p contribute and the corresponding en cosh can be expanded to second order to approximate
The resultant Gaussian form in r n p is inconvenient for direct integration since it would yield an awkward function of u's. However, C usuf may be effectively decoupled by the introduction of a spin-glass like order parameter qa' via the identities
In eqn(47) the m w ; p > s integrations now yield the u-independent result where (2n/NP) t ( p - " ) ( det A)- i(p-a),
A"' = (1 - P)S,p - Pf',
(A-5)
while the a; contributions now enter in the form
which is separable in i. Further anticipating the result that the relevant r' scales as p = a N , (A.6) has the consequence that (Z"){,l is extremally dominated, as in (48). Re-scaling
yields eqn (48) with
286
(-4.8) (4)
P=1
{OP}
where the {u”} summations are now single-site. Minimizing @ with respect to {ma}, {qaa}, {T”@} yields the dominant behaviour, with (49)-(51) providing an interpretation of the extremal values, as follows from an analagous evaluation of the right hand sides of those equations, which are again extremally dominated. Explicit evaluation and the limit n --t 0 are facilitated by the replica symmetric ansatz m p = mP7 -
TaB
(-4.9)
q,
(A.lO)
= T.
(A.ll)
With these assumptions the exponential in the last term of (A.8) may be written as (A.12)
where we have re-expressed the exp(C Cn cosh) to obtain a form with a linear u dependence in an exponential argument. The second term of (A.12) can similarly be transformed t o a linear form by the use of (22), thereby permitting the separation and straightforward execution of the sums of {u“} in (A.7). Also, with the use of (A.lO) the evaluation of CnA is straightforward and in the limit n + 0 yields (A.13)
Thus {mP}, q and
9({ma},q,T)
T
are given as the extrema of
= a/2
+ C(m’)2/2 + aPT(1-
q)/2
P
+ ( a / 2 P ) ( M l - 4 1 - 4)) - P d ( 1 - P(1 - 4)))
-p-’
/ d~e”’/~(Cn[2cosh
@(z&
+ 2 m”~P)])p=*l,r=~... (A.14) p=l
Specializing to the case of retrieval, with only one m’ macroscopic (i.e. s = l ) , there result the self-consistency equations (52)-(54).
287
Appendix B In this appendix we demonstrate how to perform the analytic minimization of a cost function of the general form E$>({JH= -
cd A P ) P
where e
A’ = (:
C Jj[$, j=1
with respect to
Jj
which satisfy spherical constraints
The {(} are random quenched fl. The solution of this problem also solves that of eqn (57) since the J;j of (57) are uncorrelated in i and thus the problem separates in the i label; note that J;j and Jj; are optimized independently. The method we employ is the analytic simulated annealing discussed in section 5.1, with the minimum with respect to the {J}then averaged over the choice of {(}. Thus we require (.!nZ{t)){t)where
{t}
In order to evaluate the average we separate out the explicit ( dependence via delta functions S ( P - ( P cjJ,(j”/Ct)and express all the delta functions in exponential integral represent ation,
exp(iq5”(XP - ( ” c J j ( T / C ! ) ) .
(B.5)
i
Replica theory requires ( Z c > ) { oand therefore the introduction of a dummy replica index on J , E , X and 4; we use a = 1, ..A. For the case in which all the ( are independently distributed and equally likely to be fl,the ( average involves
For large C (and q5.J exponentiation) by
I O(1)) the cosine can be approximated
(after expansion and re-
288
where
9 is conveniently re-expressed as
where we have used the fact that only C Jj" = C contributes to 2. In analogy with the procedure of Appendix A, we eliminate the term in Cj J;Jf in favour of a spin glass like order parameter/variable qa@introduced via the identities 1=
1
dqaP6(qaP - C-'
= (C/27r)
c JTJ;)
(B.ll)
j
1
dzaPdqaPexp(izaP(CqaP-
c J;J;)).
(B.12)
j
The j = 1,...C and p = 1,...p contribute only multiplicatively in relevant terms and (Z"){,i can be written as
where =)/)n d J a e x p ( - ~ a " ( ( J a ) 2 exp G J ( { E " } , { Z " ~ OL
a
-
1)
+ C zaPJaJP)
(B.14)
a>
(B.18)
where i ( t ) is the value of X which maximizes (g(X) - (A - t)’/27); i.e. the inverse function of t(X) given by
- 7g’(X)l+x.
t(X) =
(B.19)
Extremizing the expression in (A.15) with respect to q, or equivalently to 7,gives the (implicit) determining equation for 7
/ Dt(i(t)
(B.20)
- t ) 2 = a.
The average minimum cost follows from (B.21) This can be obtained straightforwardly from (B.16). Similarly, any measure f(h”))~,)(C} may be obtained by means of the generating functional procedure of eqn (14). Alternatively, they follow from the local field distribution p ( X ) defined by
((x;=l
(B.22) which is given by p(X) =
J DtqX - X(t)).
(B.23)
A convenient derivation of (B.23) proceeds as follows. The thermal average for fixed {(} is given by P
(P-’ C &(A - A ” ) ) T A ”=l
=
C(P-’C &(A (4
fi
- A”))eXP(PA
C g(A”))/Z. ”
(B.24)
290
Multiplying numerator and denominator by 2”-’ and taking the limit n (p-’
c
&(A - A’)),
c1
= lim
c
(p-’
n-10 {JQ};a=l, ...n
c P
which permits straightforward averaging over domination as above, so that
exP(c[PAg(Aa) a
&(A - A”’))eXp(PA
---t
g(A””))
0 gives
(B.25)
P”
{ I } .There then results the same extremal
+ i A n Y - (4’”)2/21 - a
0) or anti-ferromagnet (J < 0); for q; = [; E {-1,l) (random) and J > 0 we recover the Liittinger (1976) or Mattis (1976) model
295
(equivalently: the Hopfield (1982) model with only one stored pattern). Note, however, that the interaction matrix is non-symmetric as soon as a pair ( i j ) exists, such that v;(j # qjt; (in general, therefore, equilibrium statistical mechanics does not apply). The local fields become h;(u) = Jv;m(u) with m ( u ) E & x k (kuk. Since they depend on the microscopic state u only through the value of m , the latter quantity appears to constitute a natural macroscopic level of description. The ensemble probability of finding the macroscopic state m ( u ) = m is given by Pt[ml
= C P t ( U ) 6 [m-m(u)l U
Its time derivative is obtained by inserting (1):
Inserting the expressions (2) for the transition rates and the local fields gives:
In the thermodynamic limit N + 00 only the first term survives. The solution of the resulting differential equation for Pt [m] is:
Pt[m]= / d m o %[m016 [m-m'(t)l
This solution describes deterministic evolution, the only uncertainty in the value of m is due to uncertainty in initial conditions. If at t = 0 the quantity m is known exactly, this will remain the case for finite timescales; m turns out to evolve in time according to (3).
2.2 Arbitrary S y n a p t i c Interactions
We will again define our macroscopic dynamics according to the master equation (l),but we will now allow for less trivial choices of the interaction matrix. We want to calculate the evolution in time of a given set of macroscopic state variables n(u)f (CiI(u), . ..,Cin(u))in the thermodynamic limit N + 00. At this stage there are no restrictions yet on the form or the number n of these state variables nk.0); such conditions, however, naturally arise if we require the evolution of the variables 0 to obey a closed set of deterministic laws, as we will show below. The ensemble probability of finding the system in macroscopic state f2 is given by:
296
The time derivative of this distribution is obtained by inserting (1)and can be written as
This expansion (to be interpreted in a distributional sense, i.e. only to be used in expressions of the form JdnPt(n)G(f?) with sufficiently smooth functions G ( n ) , so that all derivatives are well-defined and finite) w i l l only make sense if the single spin-flip shifts Ajk in the state variables n~, are sufficiently small. This is to be expected from a physical point of view: for finite N any state variable &(c)can only assume a finite number of possible values; only in the limit N + 00 may we expect smooth probability distributions for our macroscopic quantities (the probability distribution of state variables which only depend on a small number of spins, however, will not become smooth, whatever the system size). The first (I = 1)term in the series (4) is the flow term; retaining only this term leads us to a Liouville equation which describes deterministic flow in n space, driven by Including the second ( I = 2) term as well leads us to a Fokker-Planck the flow field dl). equation which (in addition to the flow) describes diffusion in 0 space of the macroscopic probability density Pt [n],generated by the diffusion matrix F$). According t o (4) a sufficient condition for a given set of state variables n(u)to evolve in time deterministically in the limit N + 00 is:
(since now for N + 00 only the 1 = 1 term in (4) is retained). In the simple case where the state variables n k are of the same type in the sense that $ shifts Ajk are of the same order in the system size N (i.e. there is a monotonic function AN such that Ajk = AN) for all jk), for instance, the above criterion becomes:
If for a given set of macroscopic quantities the condition (5) is satisfied we can for large N describe the evolution of the macroscopic probability density by the Liouville equation:
the solution of which describes deterministic flow:
291
d -6)*(t) = F(l)[f?*(t);t] dt
n*(o) = no
(7)
In taking the limit N + 00, however, we have to keep in mind that the resulting deterministic theory is obtained by taking this limit for finite t. According to (4) the 1 > 1 terms do come into play for sufficiently large times t ; for N + 00, however, these times diverge by virtue of (5). The equation (7) governing the (deterministic) evolution in time of the macroscopic state variables 6) on finite timescales w i l l in general not be autonomous; tracing back the origin of the explicit time dependence in the right-hand side of (7) one finds that in order to calculate F(') one needs to know the microscopic probability density p * ( u ) . This, in turn, requires solving the master equation (1) (which is exactly what one tries to avoid). However, there are elegant ways of avoiding this pitfall. We will now discuss two constructions that allow for the elimination of the explicit time dependence in the right-hand side of (7) and thereby turn the state variables 6) and their dynamic equations (7) into an autonomous level of description. The first way out is to choose the macroscopic state variables $2 in such a way that there is no explicit time dependence in the flow field F(') [f?;t ] (if possible). According to the definition of the flow field this implies making sure that there exists a vector field # [n]such that N
(with Aj ( h a , . ..,Aj,,)) in which case the time dependence of F ( l )drops out and the macroscopic state variables f? evolve in time according to:
This is the construction underlying the approach in papers like Buhmann and Schulten (1988), Riedel et al (1988), Coolen and Ruijgrok (1988). The advantage is that no restrictions need to be imposed on the initial microscopic configuration; the disadvantage is that for the method to apply, a suitable separable structure of the interaction matrix is required. If, for instance, the macroscopic state variables S l k depend linearly on the microscopic state variables u (i.e. n ( m ) $ Cj"=,,fjuj), we obtain (with the transition rates (2)):
in which case it turns out that the only further condition necessary for (8) to hold is that all local fields hk must (in leading order in N) depend on the microscopic state u only through the values of the macroscopic state variables n (since the local fields depend linearly on u this, in turn, implies that the interaction matrix must be separable). If it is not possible to find a set of macroscopic state variables that satisfies both conditions (5,8), additional assumptions or restrictions are needed. One natural assumption that allows us to close the hierarchy of dynamical equations and obtain an autonomous
298 flow for the state variables n is to assume equipartitioning of probability in the subshells of the ensemble, which allows us to make the replacement:
n-
with the result
Whether or not the above way of closing the set of equations is allowed will depend on w j ( u ) A j ( u )is constant within the extent t o which the relevant stochastic vector the 0-subshells of the ensemble. At t = 0 there is no problem, since one can always choose the initial microscopic distribution n(a)to obey equipartioning. In the case of extremely diluted networks, introduced by Derrida et al (1987), this situation is subsequently maintained by assuring that, due t o the extreme dilution, no correlations can build up in finite time and equipartitioning will be sustained (see also the review paper by Kree and Zippelius 1991). The advantage of extreme dilution is that less strict requirements on the structure of the interaction matrix are involved; the disadvantage is that the required sparseness of the interactions (compared to the system size) does not correspond t o biological reality.
c:,
3. SEPARABLE MODELS
In this section we will show how the formalism described in the previous section can be applied t o networks for which the matrix of interactions Jij has a separable form (which includes most symmetric and non-symmetric Hebbian type attractor models). We will restrict ourselves t o models with Wi = 0; the introduction of non-zero thresholds is straightforward and does not pose new problems.
3.1 Description at the Level of Sublattice Magnetisations The following type of models was introduced by van Hemmen and Kiihn (1986) (for symmetric choices of the kernel Q). The dynamical properties (for arbitrary choices of the kernel Q) were studied by Riedel et al (1988):
1
J'3. . = - -Q N
(€.. 0. If we take the residual yi = I ; - ?i to be the output of our transform, the LMS linear prediction gives us E [ y i ~ i - j ]= 0 (21) for all j > 0, and therefore E [Yiyk] = 0 (22) for all k < i, since Y k = x k - (ulzk-1 a2xk-2 ..’). Thus linear predictive coding has given us the uncorrelated outputs we need.
+
+
317
Figure 5: Linear decorrelating networks ( M = 2).
3.3
Local Decorrelating Algorithms
One of the early suggestions for learning in neural networks was Hebb's (191 principle, that the effectiveness of the connection between two units should be increased when they are both active a t the same time. This has been used as the basis of a number of artificial neural network learning algorithms, so-called Hebbian algorithms, which increase a connection weight in proportion to the product of the unit activations at each end of the connection. If the connection weight decreases (or increases its inhibition) in proportion to the product of the unit activations, this is called anti-Hebbian learning. A number of anti-Hebbian algorithms have been proposed to perform decorrelation of output units. For example, Barlow and Foldigk [lo] have suggested a network with linear recurrent lateral inhibitory connections (Fig. 5(a)) with an anti-Hebbian local learning algorithm. In vector notation, we have an M-dimensional input vector z, an M-dimensional output vector y, and an M x M lateral connection matrix V. For a fixed input, the lateral connections cause the output values to evolve according to the expression (yt)t+l
= 2, - Cut3(y3)t
i.e.
&+l
=X-V&
(23)
3
at time step t , which settles to an equilibrium when 2 = .r - Vy, which we can write as
+
provided ( I M V) is positive definite. We assume that this settling happens virtually instantaneously. The matrix V is assumed to be symmetrical so that the inhibition from unit i to unit j is the same as the inhibition from j t o i, and for the moment we assume that there are no connections from a unit back to itself, so the diagonal entries of V are zero. Barlow and FoldiAk [lo] suggested that for each input z, the weights uj3 between different units should altered by a small change i#j Avij = V Y i Y j where 7 is a small update factor. In vector notation this is A V = Fffdiag(ygT)
(25)
318
since the diagonal entries of V remain fixed at zero. This algorithm converges when E(y,y,) = 0 for all i # j , and thus causes the outputs to become decorrelated [lo]. Atick and Redlich (71 considered a similar network, but with an integrating output d y / d t = c - Vy leading to y = V-'a when it has settled. They show that a similar algorithm for the lateral inhziitory connections between different output units leads to decorrelated outputs, while reducing a information-theoretic redundancy measure. The algorithms considered so far simply decorrelate their outputs, but ignore what happens to the diagonal entries of the covariance matrix. For a signal with statistics which are position-independent, such as images on a regularly-spaced grid of receptors, we can consider the problem in the spatial frequency domain. Decorrelation is optimal, as we have seen above, and the variance of all the outputs will happen to be equal. If we do not have position-independent statistics, we can go back to the power-limited noisy channel argument, but use the actual output covariance matrix instead of working in the frequency domain. For small output noise, we can express the transmitted information as
I ( q , X ) = 1/2logdetC,, - 1/210gdetCo
(27)
and the power cost as S, = Trace(Cy).
Using the Lagrange multiplier technique again, we wish to maximise
J = I ( @, X ) - 1/2XST which leads to the condition [30]
(29)
cy = l/XI&f.
In other words, not only should the outputs be decorrelated, but they should all have the same variance, E(y:) = 1 / X . The Barlow and FoldiAk [lo] algorithm can be modified to achieve this, if self-inhibitory connections from each unit back to itself are allowed [30]. The algorithm becomes A V ; ~= vyiyj - (l/X)S,j
i.e.
A V = v(g/gT- (l/X)I&f)
(31)
which monotonically increases J as it progresses. This is perhaps a little awkward, since the self-inhibitory connections have a different update algorithm to the normal lateral inhibitory connections. As an alternative, a linear network wit.11 inhibitory interneurons (Fig. 5(b)) can he used. After an initial transient, this network settles to y=c-Vg and z = V T y(32) i.e. y = ( I + VVT)-'c (33) where v i j is now the weight of the excitatory (positive) connection from yi to z j , and also the weight of the inhibitory (negative) connection back from to yi. Suppose that the weights in this network are updated according to the algorithm Avij = ? ~ ( y i-~ l/Xlfij) j
(34)
319
r - - - - -
I - - - - - - ,
sh
*
G(f) -
which is a Hebbian (or anti-Hebbian) algorithm with weight decay, and is
in vector notation. Then the algorithm will converge when Cy = l / A I n 4 , which is precisely what we need to maximise J . In fact, this algorithm will also monotonically increase J as it progresses. This network suggests that inhibitory interneurons, which are found in many places in sensory systems, may be performing some sort of decorrelation task. Not only does the condition of decorrelated equal variance output optimize information transmission for a given power cost, but it can be achieved by various biologically-plausible Hebb-like algorithms.
3.4
Optimal filtering
Srinivasan, Laughlin and Dubs [44]suggested that predictive coding is used in the fly’s visual system to perform decorrelation. They compared measurements from the fly with theoretical results based on predictive coding of typical scenes, and found reasonably good agreement at both high and low light levels. However, they did find a slight mismatch, in that the surrounding inhibition was a little more diffuse than t,he theory predicted. A possible problem with the original predictive coding approach is that only the output noise is considered in the calculation of information: the input noise is assumed to be part of the signal. At low light levels, where the input noise is a significant proportion of the input, the noise is simply considered t o change the input power spectrum, making it flatter [44].This assumption means that the predictive coding is an approximation to a true optimal filter: the approximation is likely to be worse for either high frequency components, where the original signal power spectral density is small, or for low light conditions, where all signal compoiient,s are small. In fact, it is possible to analyse the system for both input and output noise (Fig. 6). We can take a similar Lagrange multiplier approach as before, and attempt to maximise transmitted information for a fixed power cost. Omitting the details, we get the following quadratic equation to solve for this optimal filter a t every frequency f [33]
320
Figure 7: Typical optimal filter solution, for equal white receptor and channel noise. where R, is the channel signal to noise power spectral density ratio &IN,, and R, is the receptor signal to noise power spectral density ratio S,/N,, and y is a Lagrange multiplier which determines the particular optimal curve to be used. This leads to a non-zero filter gain Gh whenever R, > [(y/N,) - 11-l. For constant N, (corresponding to a flat channel noise spectrum) there is therefore a certain cut-off point below which noisy input signals will be suppressed. Fig. 7 shows a typical optimal solution, has been investigating modifications to the together with its asymptotes. Plumbley [29,31] decorrelating algorithms mentioned above which may learn to approximate this optimal filtering behaviour. Atick and Redlich [5] used a similar optimal filtering approach in their consideration of the mammalian visual system, minimising redundancy for fixed information rather than maximising information for fixed power. They compared their theory with the spatiotemporal response of the human visual system, and found a very good match [4]. These results suggest very strongly that economical transmission of information is a major factor in the organization of the visual system, and perhaps other sensory systems as well.
4
Principal Component Analysis and Infomax
Principal component analysis (PCA) is widely used for dimension reduction in data analysis and pre-processiiig, and is used under a variet,y of names such as the (discrete) Karhunen Lokve Transform (KLT), factor analysis, or the Hotelling Transform in image processing. Its primary use is to provide a reduction in the number of parameters used to represent a quantity, while minimising the error introduced by so doing. In the case
32 1
Figure 8: The Oja Neuron. of PCA, a purely linear transform is used to reduce the dimensionality of the data, and it is the transform which minimises the mean squared reconstruction error. This is the error which we get if we transform the output y back into the input domain to try to reconstruct the input g so that the error is minimised. Linsker’s principal of maximum information preservation, “Infomax” , can be applied to a number of different forms of neural network. The analysis, however, is much simpler when we are dealing with simple networks, such as binary or linear systems. It is instructive to look a t the linear case of PCA in some detail, since much effort in other fields has been directed at linear systems. We should not be too surprised to find a neural network system which can perform KLT and PCA. From one point of view, these conventional data processing methods let us know what to expect from a linear unsupervised neural network. However, the information theoretic approach to the neural network system can help us with the conventional data processing methods. In particular, we shall find that a dilemma in the use of PCA, known as the scaling problem, can be clarified with the help of information theory.
4.1
The Linear Neuron
Arguably the simplest form of unsupervised neural network is an N-input, single-output linear neuron (Fig. 8). Its output response y is simply the sum of the inputs zi multiplied by their respective weights wi,i.e. N
or, in vector notation, y =ZTZ
where u, = [wl, . . . ,w ~ and] = ~ [XI,.. . ,Z N ] ~are column vectors. The output y is thus the dot product -?:.u) of the input c with the weight vector u.If 0 is a unit vector, i.e. = 1, y is the component of .?: in the direction of u) (Fig. 9).
10
322
”?
x
Figure 9: Output y as a component of g,with unit weight vector.
We thus have a simple neuron which finds the component of the input g in a particular direction. We would now like to have a neural network learning rule for this system, which will modify the weight vector depending on the inputs which are presented to the neuron.
4.2
The Oja Principle Component Finder
A very simple form of Hebbian learning rule would be to update each weight by the product of the activations of the units at either end of the weight. For the single linear neuron (Fig. 8), this would result in a learning algorithm of the form AIUi
= qx,y
(39)
or in vector notation
A 0 = qgy. Unfort,unately, this learning algorithm alone would cause any weight to increase without bound, so some modificat,ion has to he used to prevent the weights from becoming too large. One possible solution is to limit t,he absolute vaJues that each weight 2ui can take 1461, while another is to renormalise the weight vector 0 to have unit length after each update [23]. ,4n alteruat,ive is to use a weight decay term which causes the weight vector to tend to haw unit length as the algorithm progresses, without explicitly normalising it. To see how t,liis works, consider the following weight update algorithm, due to Oja [23]:
A g = ~ ( gy w-y2) -
q(zzT w - w(OT g r T a)).
(41)
When t,he weight vector is small, the update algorithm is dominated by the first term on t,lie right hand side, which causes the weight to increase a.. for the unmodified Hebbian algorithm. However, as the weight vector increases, the second term (the ‘weight decay’
323
term) on the right hand side becomes more significant, and this tends t o keep the weight vector from becoming too large. To find the convergence conditions of the Oja algorithm, let us consider the average weight update over some number of input presentations. We shall assume the input vectors c have zero mean, and we shall also assume that the weight update factor is so small that the weight itself can be regarded as approximately constant over this number of presentations. Thus the mean update is given by
where X = gTC,u, and C, = E ( z g T )is the covariance matrix of the input data c. When the algorithm has converged, the average value of A 0 will be zero, so we have
C& = u x
(43)
i.e. the weight vector 0 is an eigenvector of the input covariance mat,rix C,. A perturbation analysis confirms that the only stable solution is for u to be the principal eigenvector of C,. To find the eventual length of 0 we simply substitute (43) into the expression for A , and we find that x = WT(C,W) = Z.T(aX) (44) i.e. provided X is non-zero, uTu,= 1 so the final weight vector has unit length. We have therefore seen that as the Oja algorithm progresses, the weight vector will converge to the normalised principal eigenvector of the input covariance matrix (or its negative) [23]. The component of the input which is extracted by this neuron, to be transmitted through its output y, is called t,he principal component of the input, and is the component with largest variance for any unit length weight vector.
4.3
Reconstruction Error
For out single-output syst,em, suppose we wish to find the best estimate 2 of the input g from the single output y = a T g . We form our reconstruction using the vector
as
follows: ?=gy
(45)
where 21 is to be adjusted to minimise the mean squared error
If we minimise 6 with respect to 14 for a given weight vector 0, we get a minimum for
E
at
324
where C, = E [ u T ]as before (assuming that g has zero mean). Our best estimate of g is then given by
where the matrix
is a projection operator, a matrix operator which has the property that Q2= Q. This means that the best estimate of the reconstruction vector &, from the output yx = I&, is 2, itself. Once this is established, it is possible to minimise E with respect to the original weight vector w. Provided the input covariance matrix C, is positive definite, this minimum occurs when the weight vector is the principal eigenvector of C,. Thus PCA minimises mean squared reconstruction error.
4.4
The Scaling Problem
Users of PCA are sometimes presented with a problem known as the scaling problem. The result of PCA, and related transforms such as KLT, is dependent on the scaling of the individual input components xi. When all of the input components come from a related source, such as light level receptors in an image processing system, then it is obvious that all the inputs should have the same scaling. However, when different inputs represent unrelated quantities, then the relative scaling which each input should be given is not so apparent. As an extreme example of this problem, consider two uncorrelated inputs which initially have equal variance. Whichever input has the largest scaling will become the principal component. While this extreme situation is unusual, the scaling problem does cause PCA to produce scaling-dependent results, which is rather unsatisfactory. Typically, this dilemma is solved by scaling each input to have the same variance as each other [47]. However, there is also a related problem which arises when multiple readings of the same quantity are available. These readings can either be averaged to form a single reading, or they can be used individually as separate inputs. If same-variance scaling is used, these two options again produce inconsistent results. Thus although PCA is used in many problem areas, these scaling problems may lead us not to trust it to give us a consistent result in an unsupervised learning system.
4.5
Information Maxmization
We have seen that the Oja neuron learns to perform a principal component analysis of its input, but that principal component analysis itself suffers from an inconsistency problem when the scaling of the input components is not well defined. In order to gain some insight to this problem, we shall apply Linsker’s Znfomax principle [21] to this situation. Consider a system with input X and output Y . Linsker’s Infomax principle states that a network should adjust itself so that the information I ( X ,Y ) transmitted to its output ’I’ about its input X should be maximised. This is equivalent to the information in the input S about the output ’I7, since Z( 4 , for which (6) has a solution, in particular by choosing A and B large enough or a close enough to 1. These features are expected to extend to more general non-linear competitive implementation of the model.
2.2
A Simplified Global Competitive Model
The features of the model of Figure 1 we have explored so far lead to competition between thalamic neurons for which there is lateral inhibitory connection between corresponding cells on the NRT net. However, NRT cells have not been observed to have connections across the whole NRT sheet, in spite of their extensive dendrite trees (especially in the rostra1 part of NRT). Thus the range over which this competition can effectively take place will be limited. The localised properties of such competition have already been analysed over a decade ago (Ermentrout and Cowan 1978), in terms of the even more simplified model of a single sheet with excitatory input and a Mexican hat style of lateral coupling; this model results from that of Figure 1 by collapsing the thalamic cells onto the corresponding NRT ones, and further violating Dale’s Law. The range over which the competition can occur is of order that of the range of connections on the sheet. This feature of the sheet is not satisfactory for global control of the form of global guidance described earlier. It would seem that without considerable extensions of the lateral connections on NRT (or a small enough sheet, as might be the case in the rat) global guidance could not arise, and attentional control would be weak. A number of localised activities could then be supported on cortex, in disagreement to the ability to attend only to a single object at a time. The important feature of NRT, noted at least for more advanced mammals, above the rat, in Table 1, is the presence of dendrc-dendritic synapses. These latter allow the NRT to be considered as a totally connected net even without the presence of axon collaterals of the form considered in Figure 1. Such dendro-dendritic synapses arise between horizontal cells in the outer plexiform layer of the retina (Dowling 1987), in the form of electrical gap junctions. They were modelled as linear resistors in the mathematical model of the retina in Taylor (1990). On NRT, the dendro-dendritic synapses appear to be chemical ones (as high-magnification electromicrographs show vesicles on or on both sides of the synapses). Such synapses need to be modelled in a non-linear fashion. In general, the dendro-dendritic synaptic contributions to the membrane potential v T ( r ) at
352
a particular cell at the point r on the NRT sheet might be approximated as a sum of contributions from the nearby synapsing cells, each depending on the membrane potential differences between the cells. Thus, a typical form would be
in terms of some non-linear function F. Working with only small changes of potentials, F can be linearised, to give the contributions
where G = F’(0) is positive in the case of inhibitory action in the dendredendritic synapse. For values of r’ close to r, the continuum limit can be taken of the NRT system, and the expression (8) reduces to (Taylor 1990): -a2V2vT
+ O(a4),
(9)
where Vz is the two-dimensional Laplacian operator in Cartesian coordinates and a is a real constant determined by the average spacing between the neurons of the net. The resulting equations for the action of linearised cells is the extension of (2) by adding the dendredendritic synaptic contribution on the R.H.S. This leads to (1) and (2) modified by (9): VT
VN
=I
+A .
VN,
+ a Z V Z v N= B ’vT - c - v N .
(1’)
(2’)
Upon neglect of the lateral connection matrix C, and with A and B diagonal, we obtain the simpler equation V N 4- b 2 V Z V N = J, (10) where bZ = (1 - AB)-’aZ,J = (1 - AB)-’I, and we assume AB < 1 to prevent infinite gain in the thalamus-NRT feedback loop system. The expression (10) is the basis of the simple version of the global gating model. What can we deduce about the dependence of the response of the NRT cells’ potentials (and hence that of the thalamic cells by (1)) from (10). We claim that (10) instantiates a form of competition in the spatial wavelength parameters of incoming inputs J. Physical systems with this underlying description have been investigated in a number of cases: spatially inhomogeneous superconducting states on tunnel injection of quasparticles (Iguchi and Langenburg 1980); a Peierls insulator under strong dynamic excitation of electron-hole pairs or in the presence of electromagnetic radiation (Berggren and
353 Huberman 1978). These models, and related ones for growth and dispersal in a population (Cohen and Murrray 1981; Murray 1989). We may see precise forms of competition arising from the NRT modelled by (11) or (10) by looking at these equations for inputs made of sums of plane waves. Thus, if J is composed of a set of separate waves of wavenumbers kl, . . .,kn, so J = Cj cj sin kj . r, then for suitable coefficients kj, from (lo), n
'J sink, - r ,
j=1
Thus, NRT activity will also display the same spatial oscillation as the input, but now with amplification of those waves with 1 k? = -
b2'
3
Such augmentation corresponds to a process of competition on the space of the Fourier transform of inputs, where the Fourier transform f(k) for an input I(r) involves the global recombination 1 f(k) = - d2r eik'rI(r). 2* It is in this manner that we can see how NRT can exercise global control, by singling out those components f(k) of I(r) by (13) for which (12) is true. Other values of k do not have such amplification. In other words, it would appear that the NRT would oscillate spatially with wavelength 2rb, with net strength given by the component of the input with the same wavelength. The way in which global control arises now becomes clear. Only those inputs which have special spatial wavelength oscillations are allowed through to the cortex, or are allowed to persist in those regions of cortex strongly connected to the NRT: the thalamusNRT system acts as a spatial Fourier filter. There is evidence for this in that feature detectors occur in a regular manner across striate cortex (Hubel and Weisel 1962) as well as facecoding appears to have a spatial lattice structure (Harries and Perrett 1991). Other explanations of the spatial periodicity of striate cortex feature detectors have been proposed (Durbin and Mitchison 1990) but these are consistent with our present proposal which may only add a further spatial instability to that explored in those references by non-NRT processes. It would seem that the model of Figure 1 with dendro-dendritic synapses, as described by equations (1) and (2), can satisfy criterion (b) of $2, at least to a limited extent. The model can be extended immediately to allow activity (c) of $2 to be implemented by addition of a further direct input arising from a net assessment of peripheral visual inputs.
1
354
There are known diffuse neurochemically identifiable afferents (cholinergic, GABAergic, serotonergic, and noradregernic) which enter NRT. These are known to modify NRT cell firing. In some parts, these inputs may be used to switch off on-going NRT activity (by inhibition) or win any on-going competition (by excitation). This could be modelled by an additional input on the R.H.S. of (2). It is relevant to note that NRT activity is expected to be more sensitive to such inputs than that arising indirectly from T in (1). There is still the question of stability of the solutions. We also have to discuss how flexible is the competitive process. That will be in terms of the more complete model which we outline in the next section. This will indicate that global control may also depend on the input amplitude, as well as their wavelengths, in the non-linear regime.
3
Modelling the Global Competitive Gate
3.1
The Model
3.1.1
Formulation
We discussed in the previous section various simplified versions of the overall model to be presented in this chapter. There were also some unanswered questions raised by these models, and in particular, that of the stability and flexibility of the system. In this section, we wish to present the more complete model and discuss in general how it functions. Furthermore, we will show how it possesses features able to give a broad range of responses to inputs. Most specifically, we wish to show that parameter ranges can be varied by internal and external factors. The basis of the model is best seen in terms of its wiring diagram in Figure 2. The model is an extension of that in Figure 1 by means of : (i) addition of the cortical layer C, (ii) addition of interneurons IN in thalamus, (iii) extension of the input to IN cells as well as the T cells, and NRT feedback solely taken to IN cells. The latter conectivity is an approximation to the results of the analysis of Steriade and colleagues (1986), in which the effect of GABA on GABAergic cells (from N cell feedback to IN cell) is claimed to be an order of magnitude more effective than of GABA on excitatory cells. We note that the model may be regarded as the framework for more extensive modelling with layered cortex and NRT. However, at this stage, too much complexity would be introduced too rapidly by such structural richness. We now turn to the detail of the equations used to describe the thalamus-NRT-
355
cortex interacting system, whose general structure is shown in Figure 2. Excitatory neurons are assumed to be at a coordinate position r (using the same coordinate frame for thalamus, NRT and cortex) on the thalamic and cortical sheets and to have membrane potentials labelled uT(r) and uc(r), and outputs f,(u;), (i = T and C,respectively), where f;(z) = [l exp with threshold 0; and temperature 5";. Inhibitory neurons in the thalamus and on the NRT (with the same coordinate positions) have membrane potentials denoted by vT(r), v N ( r ) , with similar sigmoidal outputs g T , g N to the other neurons. Connection weights from the j'th excitatory (inhibitory) neuron at r' to the i'th excitatory (inhibitory) neuron at r are denoted by WtE(r, r'), W r ( r , r'), W4E(r,r'), W4r(r,r'). The resulting leaky integrator neurons (LINs) satisfy
+
,'-I)?(
1 G T ( r ) = --uT(r) 7T
+1c
[@g(r,r')fC(UC(r'))
1 + w%r, r ' ) g T ( v T ( r ' ) ) ] + -l(r) TT
(17)
r'
(where we are using the notation 4 = &/at). We have rescaled all the connection weights, and the input on the thalamic cells, so that the decay constants r;, r+ drop out for stationary activity. Moreover the dendredenritic contribution has yet to be included in (15), it being given by the analysis of Taylor (1990), as cited earlier, to be equal in the linearised limit to -Gz(NvN(r) -
vN(r')).
(18)
A purely rectangular net with four neighbours at horizontal and vertical distances a*, b* (so r' = (r (a*, b i ) ) ) gives, in the continuum limit (Taylor 1990), the dendro-dendritic term is 1 + G z ( v . V V N -A2 . V z v ~+) higher order terms (19) 2 where v = (a+ - a _ , b+ - b - ) T , A' = (a: + a!, b: b'_)=, V2 = ( 3 z , d i ) T , and G2 is a negatively-valued constant in (19). The equation (15) thus reduces to a negative Laplacian net, with a linear derivative term proportional to v . The inputs to the net depend, however, in a non-linear manner on ON as given by (14), (16), (17).
+
+
+
356 3.1.2
Analysis of the Model
The model of Figure 2, expressed mathematically by equations (14)-(19), is expected to have similar properties to those of the various models associated with Figure 1 and discussed in 52. That is clear for the asymptotic solutions satisfying the equations obtained by setting the left hand sides of (14)-(17) all to zero. The main differences between this static system and the earlier equations of 52 are: (i) the presence of the C-layer and its feedback, (ii) non-linearity of all neurons, (iii) the presence of the inhibitory interneurons IN, and NRT feedback to them rather than the T-cells (as in Figure 1). We will briefly discuss each of these features in turn. The cortical layer can be seen as a mechanism for achieving extra input amplification, by means of the thalamus-cortex-thalamus feedback loop, as has already been discussed by La Berge and colleagues (1992). Indeed, the thalamus-NRT feedback loop is functioning in a similar manner, where the factor 1/(1 - AB) arises as the gain factor in the model of 52. This amplification may make the competition on NRT more effective (La Berge et d.1992). This will be borne out by simulation of our complete system. Nontrivial transforms in C will be expected to modify this result, but will not be considered here. The non-linear functionality of neurons was already argued to be a means of increasing the sensitivity of the system in 52. We will appeal to this same argument here. The inhibitory interneurons have been included so as to be able to implement Dale’s Law properly. This is clearly satisfied in Figure 2 and equations (14)-(17). The effect of NRT activity on the IN cells will therefore be that of disinhibition, which is a mode of action used in numerous parts of the brain. Its net effect, for suitable parameter ranges of cell thresholds is similar to that of excitation of NRT activity directly on the thalamic relay cells, as observed experimentally for example by Steriade et al. (1986). In this range of parameters, then, the disinhibitory NRT-IN-T activity of Figure 2 reduces to the net excitatory NRT-T activity of Figure 1. It would appear that the model of Figure 2 should therefore have similar properties to that of Figure 1, in particular supporting both local and global competition and being able to account for endogenous activity by addition of direct input to NRT on the R.H.S. of (15). However, there is still the crucial question of stability and more detailed inputdependence, of plane waves excited globally across the NRT. Stability can be analysed by looking at higher order terms that might arise in (19), and also consider a linearised analysis of the lateral connection term involving the con-
351
nection weight W"(r,r') on the R.H.S. of (15). The former of these gives (assuming symmetry in the x and y directions)
ahd the latter, for a Gaussian spread of lateral inhibition, the convolution product
where
with b < a. The dispersion relation for the temporal dependence ex* of the NRT membrane potential in (15), neglecting all but the lateral connections as a first approximation (since the other terms do not play a crucial role) becomes for the Fourier mode k, A = -G2k. k - c2(k. k)2 - F ( k ) ,
(23)
with
In Figure 3(a), the general shape of F ( k ) is plotted, and in Figure 3(b), that of A(k). Instability arises for values of k for which X(k) 2 0. This is the interval (kl,k z ) in Figure 3(b). As argued with clarity in Murray (1989),inhomogeneous NRT activity, with wave numbers in the interval (kl,kz) will be expected to grow. The stability of the resulting globally inhomogeneous activity depends on the more detailed non-linear system (14)(18), and presently can only be ascertained by simulation. Results of the latter will be presented in $3.2; they will show the system is stable. The dependence of the winning activity on the maximum of the wavelength number being singled out by the competition on NRT will be discussed further in the next sub-section. There is finally the non-trivial dependence on the inputs of the inhomogeneous mode winning the competition on the NRT sheet. Thus, whilst equation (23) would seem to indicate lack of dependence of this mode on the input, the full non-linear system does indeed have such a dependence. There does not seem to be much mathematical analysis of this problem in the literature, although dependence on boundary and initial conditions for reaction-diffusion models has been studied (Arcuri and Murray 1986). We will turn to answer this question by simulation in the next sub-section; we propose to consider it mathematically elsewhere.
358
3.2 3.2.1
Simulations The Simulation Model
The simulation model we investigated is illustrated in Figure 2. It is essentially onedimensional, which corresponds to lines of thalamic, NRT and cortical neurons. In future work, it is intended to extend the simulation work to the more realistic case of tww dimensional sheets of these neurons. A simplified version of Figure 2 is presented in Figure 4. It is useful for simulation purposes to think of the boxed subsystem in Figure 4 as a single module that can be linearly replicated (with bidirectional lateral links among the NRT units). This allows the size of the simulation to be scaled to suit the available computing power. Within each module, we have inter-unit signal flow as specified by the arrow-tipped curves. Every tip labelled with a ‘+’ signifies an excitatory signal, while every ‘-’ corresponds to an inhibitory signal. Within every module, the time evolution of each neural unit’s voltage is determined by the solution of equations (14)-(17). Within a computing context, of course, a discretised version of these equations has to be integrated over some suitable small time step (we used the RungeKutta fourthmder integration routine). Such a scheme entails in practice that we replace the hard-limiting non-linearity by a smooth analytical function. We adopted the following form, as used by La Berge et al. (1992): f(.A)
= hAY(.A
- 0,)
[I - exp { - p A ( . A
- OA)}]
(25)
where h~ is a scaling parameter, Y ( o )is the step function, 6, is the threshold for unit A and pa is its inverse temperature. In order to proceed with the simulations, it was found necessary to obtain limits on the ranges of the large number of parameters involved. To obtain an idea of what constitutes ‘good’ parameter ranges, consider the equations (14)-(17) in the static limit (obtained by setting d u ; / d t = dvi fdt = 0, or, equivalently,by allowing the ri’s, i = T, N, C to go to zero): uc = w c c y ( U c VN
= wNIY(UT VT
UT
=Ik
+
-
- ec),
(14’)
eT)+ w N c y ( U c - ~ c ) ,
= w T N y ( v N - ON),
WTTY(VT
- 0,)
+WTCy(UC-
(15’) (16’)
@C),
(17‘)
Note that in these equations, we have taken the full non-linearity Y ( o )for the function in (25). To distinguish between zero and finite external inputs, we define further Z i = z k = 0 and 1; = I k # 0. In order to set up the subsystem so that we can achieve global control,
359
we require that units T, Nk and C be switched ‘off’ whenever unit I is ‘on’ in the case of zero external input, and vice versa in the case of finite external input2. Looking at (17‘) separately for the case when I&= I; and I&= I t , it is easy to show that the thresholds of each of the units have to lie in the ranges
3.2.2
Simulation Results
The series of simulation results we carried out reflect in part the major aim of establishing the existence of the global control model for the NRT. In Figures 5-15, we present the results obtained from actual simulation runs. The simulation code we developed was very modular in design, allowing for easy scaling to more powerful simulations, which could, with some reconfiguration, also be performed on parallel processing machines. It is hoped in future to extend the simulations to the more realistic case of two-dimensional sheets of thalamus, NRT and cortical tissue, and this can easily be accommodated. The z-axes in these plots represent the spatial positions of a line of 100 neurons. The y-axes represent OUT(UT) and OUT(VT) for the cases of the thalamic excitatory and inhibitory neurons respectively (where OUT has its usual ANN interpretation), and user-scaled3 raw voltage output from the UN and Uc excitatory neurons in the NRT and cortex respectively. Time delays for signal propagation between neurons were set to zero (although they could easily be incorporated in any more realistic future runs). The very simplest situation one can envisage with this system is that where there is no lateral coupling between NRT neurons. Such a system (Figure 5) is very useful for evaluating the effect of feedback strength between neurons of the three layers. We see that each vertical module acts in essence like an amplifier for its particular input signal. Since there exist a large number of free parameters in this system (between 2040, depending on how the simulations are configured), it is useful to determine suitable ranges for the allowed values these can take, in addition to the constraints in (26). We do this by introducing lateral connectivity (given by equations (18), (22)) and experimenting with different values of the amplitudes, thresholds and temperatures of the neurons, and the range of the spreads for the lateral terms. The last of these turns out to be a stability ’The ‘on’ state corresponds to OUT(unit)=l, while the ‘off’state is OUT(unit)=O, with OUT having its usual artificial neural network meaning. ’As a result, the vertical scales in any of the plots are not in proportion. This, however, is not a concern, since we are interested in the general behaviour of the system rather than particular values of output voltages.
360
parameter, in that excessively long-ranged influence of the lateral connectivity terms leads to unrealistically large output voltages in the NRT neurons. We identify such behaviour as the nonlinear regime of operation. This behaviour is illustrated in Figures 6-8, where we successively increase the spread from 5 to 50. Every positively valued segment of the NRT wave acts to allow an input getting through to the cortex, while every negatively valued segment acts to restrict it. This is best illustrated in Figure 7. For moderate values of the spread, we find that the spread has a second attribute as a mechanism for local control. For large values of the spread, the NRT activity begins to grow disproportionately (Figure 8). The spread also plays a part in the outcome of amplitude competition, as in Figure 9, for instance. Of the two plateaus representing strong and weak inputs, the stronger input dominates the weaker one. The spatial wavelengths set up in the NRT region are much smaller than the spread of the plateaus, yet exercise strong control over the allowed cortical response. There is partially global control over the allowed set of inputs propagating up to the cortex, determined principally by wavelike activity in the NRT, itself a function of the spread. It is interesting to note here the relative effects of the difference of Gaussian (DOG) and dendro-dendritic terms in the emergence of waves on the NRT sheet. Figure 10 illustrates the result we obtain be eliminating the dendro-dendritic term, while Figure 11 shows the activity upon removal of the DOG term. We see that the DOG term is clearly less influential in both spatial waveform generation and the development of strong patterns of activity in the cortical layer. This is to be expected, however, since axon collaterals in the DOG representation are not long-ranged. The global control mechanism that is predicted in $2 is illustrated in Figures 12-15. The spatially global wave of activity on the NRT exercises wavelength dependent control on the signals propagating forth to the cortex. The cortical activity persisting does not reflect the inputs very strongly (Figure 13), as it did for the partially global control in Figure 9, being influenced instead by the oscillatory character of the NRT activity A significant aspect of this system, when operating in a global control phase, is its sensitive dependence on classes of inputs. We have seen that during such a phase, the cortex only sees the winner of the competition taking place on the NRT. Equivalently, the NRT (according to the theory of $2) is acting like non-linear filter that allows one out of all the Fourier components presented to it to propagate through. We expect therefore that the selection of this particular Fourier component would be critically influenced by factors such as the wavelength and the amplitude of the input (we noted this towards the end of $2 as well). It is difficult to establish with certainty, however, which of these two
36 1 are the more significant, since our simulation results shows only trends in the output, and not actual magnitudes thereof, as mentioned earlier.
4
Discussion
The theoretical and simulation results presented in the previous section show that the simplified model of the thalamus-NRT-cortex complex does achieve the requirements (a), (b) and (c) outlined in $2, to allow rapid local and global competition to occur between thalamic inputs. There is still much analysis to be done on this model, both theoretically and by simulation. Thus the following questions, among many, need to be answered: 1. delineation of the domain of parameter space in which the competition between in-
puts occurs in momentum space or instead of on the separate coordinate-dependent amplitudes, due to non-linearities, or vice versa;
2. the crucial parameters on which the speed of the outcome of the competition depends on; 3. the effects of including more realistic properties of neurons, such as stochasticity, spikes, neuronal geometry, etc.; 4. the effect of the more complete structure of cortex and NRT, and of more realistic
modelling of thalamic glomeruli, on the nature of the system;
5. the effects of information transformation in cortex on the details of the operation of the competition. The answer to 2 is of relevance to ongoing experimental work on the bringing to sensory awareness of sensations of touch by direct cortical electrical stimulations (Libet et aJ. 1964), as has been pointed out recently by one of us (Taylor 1993a). The answer is expected to be in terms of an exponential increase in time, with a time-constant depending upon the injected current, as the stability analysis of $2.1 indicates. The answer to 3 is relevant to the level of neuro-biological realism of the modelling, but is not expected to make much difference to the principles of the system. Questions 4 and 5 may allow considerable extension of the model, especially if one takes account of cortical activity at working memory centres being injected as thalamic input; these working memory units may, along with primary cortical input areas, be considered as the source of sensory awareness (Taylor 1993b). Finally, question 1 requires further work
362
also involved with investigations of the other questions. W e hope to be able to consider answers to these questions in due course.
Acknoweledgements One of us (F.N.A.) would like to thank the Science and Engineering Research Council (SERC) of the U.K. for grant GR/F92251 which enabled part of this work to be carried out.
363
p’
; \i
.;, I2
P’
I3
Figure 1. The structure of the first model of the thalamus-NRT-cortex complex, in which the cortex is dropped, and competition occurs only between the inputs Ij to the thalamic relay cell Tj;the strengths Oj of the outputs indicate the winners and losers of the competition carried out between the corresponding NRT neurons Nj by means of their inhibitory lateral connections.
364
Figure 2. The wiring diagram of the main model of the thalamus-NRT-cortex complex. Input Ij is sent both the the thalamic relay cell Tj and the inhibitory interneuron I N j , which latter cell also feeds to Ti. Output from Tj goes up to the corresponding cortical cell Cj, which returns its output to Tj. Both the axons TjCj and CjTj send axon collaterals to the corresponding NRT cell N j . There is axonal output from Nj to I N j , as well as collaterals to neighbouring NRT cells; there are also dendro-dendritic synapses between the NRT cells.
365
Figure Sa. Dependence of the Fourier transform @(k) of the lateral NRT connection weights on the wave variable k. Initially positive, the value of @(k) becomes negative. so producing possible instability according to equation (23).
Figure Sb. Dispersion relation for U N in wavenumber space. The interval (kl,122) contains the wave numbers for which instability occurs, and spatially inhomogeneous activity is expected to arise.
367
r (k-- - - ~ r - - - (k+~ r - - Kth subsystem
1)'lh Subsystem
1)'th Subsystem
1
I I
I
I
I 1
_ - -
- - _ I
(k-1)'th Input
k'th Input
(k+ 1)'th Input
Figure 4. Schematic of the "modularised"version of Figure 2, as used in obtaining the results of the computer simulations.
368
Figure 5. Simulation run with no lateral coupling between NRT neurons. The input essentially feeds through to the cortex, as might be expected.
1
NRT(vo1ts) OUT(Tha1am.) OUT(lnhi6.) -
Input(v0lts) Figure 6. Simulation run with lateral connectivity (both dendro-dendtitic and DOG) intrclduced. The value chosen for the spread is small here. Wave activity is beginning to appear on NRT.
Figure 7. As for Figure 6, but with moderate values for the spread. The NRT is clearly influencing what is dowed to propagate through to the cortex.
37 1
Figure 8. As for Figure 6, but with large values for the spread. The activity on the NRT is beginning to take on a non-linear mode of operation. There is still, however, control over what is allowed to go through to the cortex.
312
Figure 9. Simulation run showing amplitude competition. Of the strong and weak inputs being fed in, only the strong survives the journey to the cortex. There is partially global control exercised over this by the activity on the NRT.
373
Figure 10. Simulation run showing the development of activity on NRT with only a DOG form for the lateral connectivity. (Compare with Figure 11).
374
T
Y
.c
0
S:_"I
P
IIIIIIIIII
IIIIIIIIII
IIIIIIIIIII
IIIIIIIIII
.r
Figure 11. Simulation run showing the development of activity on NRT with only a dendro-
dendritic form for the lateral connectivity. (Compare with Figure 10).
315
Figure 12. Simulation run showing full global control with a spatially constant input. The activity on the cortex reflects the activity on the NRT, and is not dependent on the form of the input.
316
Figure 13. Simulation run showing full global control with semi-constant spatial input. Again, the cortex activity is influenced by the NRT alone.
377
Figure 14. Simulation run showing full global control with short-wavelength periodic input.
378
Figure 15. Simulation run showing full global control with medium-wavelength periodic input.
319
5
References
AhlsBn, G. and Lindstijm, S. (1982). Mutual Inhibition between Perigeniculate Neurons, Brain Res., 236,482-486. Arcuri, P.and Murray, J. D. (1986). Pattern sensitivity to boundary and initial conditions in reaction-diffusion models, Math. Biol., 24, 141-165. Avanzini, G., de Curtis, M., Ferruccio, P. and Spreafico, R. (1989). Intrinsic Properties of Nucleus Reticularis Thalami Neurons of the Rat Studied in uitro, J. Physiol., 416, 111-122. Berggren, K. F. and Huberman, B. A. (1978). Peierls state far from equilibrium, Phys. Rev. B, 18,3369-3375. Barbaresi, P., Spreafico, R.,Frassoni, C. and Rustioni, A. (1986). GABA-ergic neurons are present in the dorsal column nuclei but not in the ventroposterior complex of rats, Brain Res., 382,305-326. Cohen, D. S. and Murray, J. (1981). A Generalised Diffusion Model for growth and dispersal in a population, Math. Biol., 12, 237-249. Crabtree, J. W. (1989). Evidence for topographic maps within the visual and somatosensory sectors of the thalamic reticular nucleus: A comparison of cat and rabbit, SOC. Neurosci. Abs., 15, 1393. Crabtree, J. W. (1991). Maps within the cat’s somatosensory thalamus, SOC. Neurosci. Abs., 17,623. Crabtree, J. W. (1992). The somatotopic organization within the rabbit’s thalamic reticular nucleus, Eur. J. Neurosci., 4, 1343-1351. Crabtree, J. W. and Killackey, H. P. (1989). The topographic organization and axis of projection within the visual sector of the rabbit’s thalamic reticular nucleus, Eur. J. Neurosci., 1, 94-109. Deschknes, M., Madariaga-Domich, A. and Steriade, M. (1989). Dendro-dendritic synapses in the cat reticularis thalami nucleus: A Structural basis for Thalamic Spindle Synchronisation, Brain Res., 334,165-168. Douglas, R. J. and Martin, K. A. (1991). A Functional Microcircuit for Cat Visual Cortex,
380
J. Physiol., 440, 735-769. Dowling, J. (1987). The Retina. Harvard University Press. Durbin, R. and Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps, Nature, 343, 644-647. Ermentrout, G. B. and Cowan, J. D. (1978). Some Aspects of the ‘Eigenbehaviour of Neural Nets’, Studies in Mathematics: The Mathematical Assoc. of America, 15, 67-117. Harries, M. H. and Perret, D. I. (1991). Visual Processing of Faces in Temporal Cortex: Physiological Evidence for a Modular Organisation and Possible Anatomical Correlates, J. Cog. Neurosci., 3, 9-23. Harris, R. M. and Hendrickson, A. E. (1987). Local circuit neurons in the rat ventrobasal thalamus - A GABA immunocytochemical study, Neurosci., 21, 229-236. Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks are universal approximators, Neural Networks, 2, 359-368. Hubel, D. H. and Weisel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Phys., 160,106-154. Iguchi, 1. and Langenburg, D. N. (1980). Diffusive Quasiparticle Instability toward Multiple-Gap States in a Tunnel-Injected Nonequilibrium Superconductor, Phys. Rev. Lett., 44, 486-489. Jones, E. G. (1975). Some Aspects of the Organisation of the Thalamic Reticular Complex, J. Comp. Neurobiol., 162,285-308. Koch, C. and Ullman, S. (1985). Shifts in Selective Visual Attention: Towards the Underlying Circuitry, Human Neurobiol., 4, 219-227. La Berge, D. (1990). Thalamic and Cortical Mechanisms of Attention suggested by recent Positron Emission Tomographic Experiments, J. Cog. Neurosci., 2,358-373. La Berge, D., Carter, M. and Brown, V. (1992). A Network Simulation of Thalamic Circuit Operations in Selective Attention, Neural Computation, 4, 318-331. Libet, B., Alberts, W. W., Wright, E. W., Delattre, L. D., Levin, G. and Feinstein, B. (1964). Production of threshold levels of conscious sensation by electrical stimulation of human somatc-sensory cortex, J. Neurophys., 27, 546-578.
38 1
Lijenstr6m, H. (1991). Modelling the Dynamics of Olfactory Cortex using Simplified Network Units and Realistic Architectures, Int. J. Neurosci., 1-2, (to appear). Linsker, R. (1988). Self-organisation in a Perceptual Network, Computer, 21, 105-1 17. Llinas, R. and Ribary, U. (1991). Ch. 7 in Induced Rhythms in the Brain. Eds. E. Basar and T. Bullock, Birkhauser, Boston. Martin, K. A. (1988). From Single Cells to Single Circuits in the Cerebral Cortex, Quart. J. Exp. Physiol., 73,637-702. McCormick, D. A. and Prince, D. A. (1986). Acetylcholine induces burst firing in thalamic reticular neurons by activating a potassium conductance, Nature, 319,402-405. Montero, V. M., Guillery, R. W. and Woolsey, C. N. (1977). Retinotopic organization within the thalamic reticular nucleus demonstrated by a double label autoradiographic technique, Brain Res., 138,407-421. Murray, J. M. (1989). Mathematical Biology, Springer-Verlag. Ohara, P. T. and Lieberman, A. R. (1985). The Thalamic Reticular Nucleus of the Adult Rat: experimental anatomical studies, J. Neurocyt., 14,365-411. Ohara, P. T. (1988). Synaptic Organisation of the Thalamic Reticular Necleus, J. Elect. Mic. Tech., 10,283-292. Park, D., Steriade, M., Deschdnes, M. and Oakson, G. (1987). Physiological Characteristics of Anterior Thalamic Nuclei, a group devoid of inputs from Reticular Thalamic Nucleus, J. Neurophysiol., 57,1669-1 685. Posner, M. I. and Petersen, S. E. (1990). The Attention System of the Human Being, Ann. Rev. Neurosci., 13,25-42. Schiebel, A. B. (1980), in Reticular Formation Revisited, eds. J. A. Hobson and B. A. Brazier, Raven Press, New York. Spreafico, F., De Curtis, M., Frassoni, C. and Avanzini, G. (1988). Electrophysiological Characteristics of Morphologically identified Reticular Thalamic Neurons from Rat Slices, Neurosci., 27, 629-638. Spreafico, F., Battaglia, G. and Frassoni, C. (1991). The Reticular Thalamic Nucleus (RTN) of the Rat: Cytoarchitectural, Golgi, Immunocytochemical and Horseradish Per-
382
oxidase Study, J. Comp. Neuro., 304, 478-490. Steriade, M., Domich, L. and Oakson, G. (1986). Reticularis Thalami Neurons Revisited: Activity Changes During Shifts in States of Vigilance, J. Neurosci., 6,68-81. Steriade, M., Currc-Dossi, R. and Oakson, G. (1991). Fast Oscillations (20-40 Hz), in thalamocortical systems and their potentiation by Mesopontine Cholinergic Nuclei in the Cat, Proc. Nat. Acad. Sci., 88, 4396-4400. Taylor, J. G. (1990). A Silicon Model of Vertebrate Retinal Processing, Neural Networks, 3,171-178. Taylor, J. G. (1993a). A Competition for Sensory Awareness? King’s College London Preprint. Taylor, J. G. (1993b). Goals, Drives and Consciousness. King’s College London Preprint. Triesman, A. and Gelade, G. (1980). Cog. Sci. 12,99-136. WGrgiitter, F., Niebur, E. and Koch, C. (1991). Isotropic Connections Generate Functional Asymmetrical Behaviour in Visual Cortical Cells, J. Neurophysiol., (in press). Yingling, C. D. and Skinner, J. E. (1977). Gating of Thalamic Inputs to Cerebral Cortex by Nuclear Reticularis Thalami, Prog. Clin. Neurophysiol., 1, 70-96.