Communicated by Wulfram Gerstner
ARTlCLE
Lower Bounds for the Computational Power of Networks of Spiking Neurons Wolf...
8 downloads
933 Views
49MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Communicated by Wulfram Gerstner
ARTlCLE
Lower Bounds for the Computational Power of Networks of Spiking Neurons Wolfgang Maass lizstitiite for Theoretical Computer Science, Technische Uniuersitaet Gmz, Klosteriuiesgasse 3212, A-8010 Graz, Aus tvia We investigate the computational power of a formal model for networks of spiking neurons. It is shown that simple operations on phase differences between spike-trains provide a very powerful computational tool that can in principle be used to carry out highly complex computations on a small network of spiking neurons. We construct networks of spiking neurons that simulate arbitrary threshold circuits, Turing machines, and a certain type of random access machines with real valued inputs. We also show that relatively weak basic assumptions about the response and threshold functions of the spiking neurons are sufficient to employ them for such computations. 1 Introduction and Basic Definitions
There is substantial evidence that timing phenomena such as temporal differences between spikes and frequencies of oscillating subsystems are integral parts of various information processing mechanisms in biological neural systems (for a survey and references see, e.g., Kandel et al. 1991; Abeles 1991; Churchland and Sejnowski 1992; Aertsen 1993). Furthermore, simulations of a variety of specific mathematical models for networks of spiking neurons have shown that temporal coding offers interesting possibilities for solving classical benchmark problems such as associative memory, binding, and pattern segmentation (for an overview see Gerstner et al. 1993). Very recently one has also started to build artificial neural nets that model networks of spiking neurons (see, e.g., Murray and Tarassenko 1994; Watts 1994). Some aspects of these models have also been studied analytically (see, e.g., Gerstner and van Hemmen 1994; Gerstner 1995), but almost nothing is known about their computational complexity (see Judd and Aihara 1993, for some first results in this direction). In this article we investigate a simple formal model SNN for networks of spiking neurons that allows us to model the most important timing phenomena of neural nets, and we prove lower bounds for its computational power. Quite a number of different mathematical models for networks of spiking neurons have previously been introduced within the frameworks Neural Computation 8, 1-40 (1996)
@ 1995 Massachusetts Institute of Technology
2
Wolfgang Maass
of theoretical physics and theoretical neurobiology (see, e.g., Lapicque 1907; Buhmann and Schulten 1986; Crair and Bialek 1990; Gerstner 1991; Gerstner et al. 1993; for a survey and the relationship between these and related models see, e.g., Tuckwell 1988; and Gerstner 1995). The computational model SNN that we consider in this article is most closely related to the spike response inodel of Gerstner (1991) and Gerstner e f 01. (1993). Similarly as in Buhmann and Schulten (1986), we consider in this article only the deterministic case (which corresponds to the limit case J i ixj for the inverse temperature ) j in the spike response model, and respectively, the noise-free case). We refer to Maass (1995d) for results about the computational power of the noisy version of this model. However, in contrast to these preceding models we do not fix particular (necessarily somewhat arbitrarily chosen) response and threshold functions in our model SNN. Instead, we want to be able to use the SNN model as a framework for inzwtigating the computational power of various different response and threshold functions. In addition, we would like to make sure that various different response and threshold functions observed in specific biological neural systems are in fact special cases of the response and threshold functions in the formal model SNN. 1.1 Definition of a Spiking Neuron Network (SNN). An SNN consists of 0
0 0 0
0
N
a finite directed graph (V.E) (we refer to the elements of V as "neurons" and to the elements of E as "s~yitapses") a subset V,, C V of i n p u t neurons a subset Vout 2 V of output neiirons for each neuron z1 E V - V,, a thresholdfiiizction O,, : R+ R U {x} (where R+ := {x E R : x 2 O}) for each synapse (ti. ZI) E E a response fiiiiction zrr,, : R+ R and a -+
zoeight fiinction zuII : R+
-+
R.
-
We assume that the firing of the input neurons zi E V,, is determined from outside of N , i.e., the sets F,, C R+ of firing times ("spike trains") for the neurons z, E V,, are given as the input of N . Furthermore we assume that a set T C R+ of potential firing tiines has been fixed (we will consider only the cases T = R+ and T = { i . p : i E N} for some > 0). For a neuron zj E V- V,, one defines its set F,, of firing times recursively. The first element of F,, is inf{t E T : P , ( t ) 2 O,(O)}, and for any s E Fi, the next larger element of F,, is inf{t E T : t > s and P,,(t) 2 O,,(t - s)}. where the potential ftinction P,, : R+ + R is defined by
[the trivial summand 0 makes sure that P , ( t )is well-defined even if F,, = d, for all 11 with (z1.u) E El.
Coinputational Power of Networks of Spiking Neurons
3
The firing times (”spike trains”) F,, of the output neurons u E V,,, that result in this way are interpreted as the output of N . Regarding the set T of potential firing times we consider in this article primarily the case T = R+ (SNN with continuous time), and only in Corollary 2.5 the case T = {i . / I : i E N} for some p > 0 (SNN with discrete
time). Our subsequent assumptions about the threshold functions O,, will imply that for each SNN N there exists a bound TN E R with 7 , >~ 0 such that O,,(x) = 00 for all x E ( 0 . 7 ~ and ) all u E V - Vi, (TN may be interpreted as the minimum of all ”refractory periods” TWf of neurons in N ) . Furthermore we assume that all ”input spike trains” F , with u E Vi, satisfy IF, n [O. t]j < ocj for all t E R+.On the basis of these assumptions one can also in the continuous case easily show that the firing times are well-defined for all u E V - V,, (and occur in distances of at least 7A.l. In models for biological neural systems one assumes that if x time-units have passed since its last firing, the current threshold O , ( x ) of a neuron u is ”infinite” for x < Tref (where 7,f = refractory period of neuron zi), and then approaches quite rapidly from above some constant value. A neuron u ”fires” (i.e., it sends an ”action potential” or ”spike” along its axon) when its current membrane potential P U ( t )at the axon hillock exceeds its current threshold 8 , . P,(t) is the sum of various postsynaptic -s). Each of these terms describes an excitatory (EPSP) potentials wl,,~,~~ll,,(t or inhibitory (IPSP) postsynaptic potential at the axon hillock of neuron u at time t, as a result of a spike that had been generated by the ”presynaptic” neuron u at time s, and which has been transmitted through a synapse between both neurons. Recordings of an EPSP typically show a function that has a constant value c (c = resting membrane potential; e.g., c = -70 mV) for some initial time interval (reflecting the axonal and synaptic transmission time), then rises to a peak value, and finally drops back to the same constant value c. An IPSP tends to have the negative shape of an EPSP (see Fig. 3). For the sake of mathematical simplicity we assume in the SNN model that the constant initial and final value of all response functions E ~ ~ , :is , equal to 0 (in other words, E ~ ~ models . ~ , the difference between a postsynaptic potential and the resting membrane potential c). Different presynaptic neurons u generate postsynaptic potentials of different sizes at the axon hillock of a neuron ZI, depending on the size, location, and current state of the synapse (or synapses) between u and u. This effect is modeled by the weight factors W ~ , , ~ , ( S ) . The precise shapes of threshold, response, and weight functions may vary among different biological neural systems, and even within the same system. Fortunately one can prove significant upper bounds for the computational complexity of SNNs N without any assumptions about the specific shapes of these functions of N . Instead, for such upper bounds one only has to assume that they are of a reasonably simple mathematical structure (see Maass 399410, 1995~).
4
Wolfgang Maass
To prove lower boziizds for the computational complexity of an SNN
N one is forced to make more specific assumptions about these functions. However, we show in this article that significant (and in some cases optimal, see Section 3) lower bounds can be shown under some rather weak basic assurnptions about these functions, which will be further relaxed in Section 4. These basic assumptions (see Section 2) mainly require that EPSPs have an arbitrarily small time segment where they increase linearly, and some arbitrarily small time segment where they decrease linearly. Since the computational power of SNNs may potentially increase through the use of time-dependent weights, l o u w bounds for their computational power are more significant if they do not involve the use of time-dependent weights. Hence we will assume throughout this article that all 7iieight-fiiizctioizs I L ~ , , . , , ( S )have a constant value w,,,,,, 7iihich does not depend on the time s. Apart from the abovementioned condition on the existence of linear segments in EPSPs, the basic assumptions that underlie the lower bound results of this article involve no other significant conditions on the shape of response and threshold functions. Hence one may argue that these basic assumptions are biologically plausible. In addition, we will show in Section 4 that the same lower bounds can be shown if also phenomena such as "adaption" of neurons, or a "reset" of the potential after a firing are taken into account. Thus the more critical points with regard to the biological interpretation of these lower bound results appear to be the relatively simple firing mechanism of the SNN model, which, for example, ignores for the sake of simplicity nonlinear interactions among postsynaptic potentials such as integration of potentials within the dendritic tree of a neuron, and various possible sources of "imprecision" in the determination of the firing times. The latter issue can partially be taken into account by considering the variation of the SNN model with discrete firing times as in Corollary 2.5 (although the implicit global synchronization of this version is not completely satisfactory). In this variation of the SNN model with discrete firing times i . p for i E N one can view a firing of a neuron at time i . p as representing a somewhat imprecise firing time in a small interval around time i . p. The computational complexity of another neural network model where timing plays an important role has previously been investigated by Judd and Aihara (1993). Their model PI" is also motivated by biological spiking neurons, but it employs a quite different firing mechanism. There are no response functions in this model, and instead of integrating all incoming EPSPs and IPSPs in order to determine whether it should "fire," a neuron in a PPN randomly selects a single one of the incoming "stimulations'' of maximal size, and determines on the basis of that stimulation whether it should fire. Consequently, computations in this model PI" proceed quite differently from computations in models of spiking neurons such as the spike response model of Gerstner and van Hemmen (1994), or the model SNN considered here. Judd and Aihara (1993) con-
Computational Power of Networks of Spiking Neurons
5
struct PPNs that can simulate Turing machines that use at most a constant number s of cells on their tapes, where s is bounded by the number of neurons in the simulating PPN. However a Turing machine with a constant bound s on its number of tape cells is just a special case of a finite automaton, and hence this result does not show that a PPN of finite size can have the computational power of an arbitrary Turing machine. In contrast to the quoted result about PPNs, it is shown in Theorem 2.1 of this article that with arbitrary response and threshold functions that satisfy the basic assumptions of Section 2, one can construct for any given Turing machine M an SNN n / ~of finite size that can simulate any computation of M in real-time (even if the number of tape cells that M uses is much larger than the number of neurons in NM).In addition, at the end of Section 4 we will describe a way in which a simulation of arbitrary Turing machines can also be accomplished by finite SNNs whose response and threshold functions are piecewise constant. If we understand the model of Judd and Aihara (1993) correctly (their description is somewhat unclear), then our method for proving this (see also Maass and Ruf 1995) can also be used to show that with the help of a module that decides whether two neurons have fired simultaneously, one can simulate (although not in real-time) any Turing machine M (where M may use an unbounded number of tape cells) by some PPN PM of finite size, thereby improving the lower bound for the computational power of PPNs due to Judd and Aihara (19931, from finite automata to Turing machines. The focus in the investigation of computations in biological neural systems differs in two essential aspects from that of classical computational complexity theory. First, the timing constraints for computations in biological neural systems are often tighter than for computations in digital computers, and many complex computations have to be carried out in ”real-time” with relatively slow “switching elements.” Secondly, one is not only interested in separate computations on unrelated inputs, but also in the ability of the system to learn to react appropriately to a sequence of related tasks. Hence the custom to evaluate the computational power in terms of ”complexity classes” such as P or P/poly appears to be less suitable for the investigation of models for biological neural systems, and we therefore resort to an analysis in terms of refined concepts such as ”real-time computations” and “real-time simulations.” In this way we get not only information about the relationship between the “large scale” complexity classes (e.g., polynomial time) for these models for biological neural systems, but also about their behavior in terms of common notions of ”low-level complexity” such as sublinear or real-time. Furthermore, with the help of our refined analysis of real-time simulations one also gets information about the “adaptive” or “learning” abilities of the considered models. Assume for example that ((x(i).y(i))),tN is the protocol of some real-time ”learning process” of a system M, where the y(i) are the ”responses” of M to a sequence x ( i ) of ”stimuli.” If one
6
Wolfgang Maass
has shown that another model M' can simulate M in real-time, then this entails that the same "learning process" can also be carried out in realtime by M'. 1.2 Definition of Real-Time Computation and Real-Time Simulation. Fix some arbitrary (finite or infinite) input alphabet A,, and output alphabet Aout (for example they can be chosen to be (0, l}, (0. l}' or R). We say that a iizacliiize M processes a s e q w i c e ( ( x ( i ) > y ( i ) ) ) l of E Npairs (x(i).y(i)) E A,,x A,,, iiz real-time Y, if M outputs y(i) for every i E N within r computation steps after having received input x ( i ) [for i > 0 we assume that x(i) is presented at the next step after M has given output y(i - 1)l. We say that a macliiiie M' siiiiiilates a machine M iii real-tinze (wifli delay factor A) if for every I' E N and every sequence that is processed by M in real-time r, M' can process the same sequence in real time A . r. In the case of SNNs M we count each spike in M as a computation step.
We first would like to point out that these notions contain the usual notions of a computation and simulation as special cases. Let {0.1}* be the set of all binary sequences of finite length. If M computes a Boolean function F : (0. l}" 4 (0.1) in time t ( n ) (in the usual sense of computational complexity theory), then one can identify each input ( z , , . . . . z , ~ )E (0.1}* with an infinite sequence (x(i))lENwhere s(i) = 2 , for i 5 i i and x(i) = B for i > ii (assume that M gets one input bit per step, B := "blank). Furthermore one can set y(i) = B for those steps i where M s computation is not yet finished, and y(i) = F ( ( z 1 . . . . . z , , ) ) for all later i [in particular for all i 2 t(tz)l. Obviously M processes this sequence ( ( x ( i ) y. ( i ) ) ) l t N in real-time 1. Hence, if another machine M' can simulate M in real-time with delay factor A, then M' can compute the same function F : (0. l}' + (0. l} in time A.t(iz).This implies that a realtime simulation is a special case of a linear-time simulation. In particular, every computational problem that can be solved by M within a certain time complexity can be solved by M' within the same time complexity (up to a constant factor). In addition, the remarks before the definition imply that when we show that M' can simulate M in real-tiiiie, we may conclude that any adnptiue behazk~r(or leariiing algorithm) of M can also be implemented on M'. Finally we would like to point out that for the investigation of specific computational and learning problems on specific models for biological neural nets one would like to also eventually get estimates for the size of the constant r in real-time processing and the size of the delay factor A in a real-time simulation. Such refined analysis (which will not be carried out in this paper) appears to be also of interest, since it is likely to throw some light on the specific advantages and disadvantages of different models for biological neural systems (e.g., networks of spiking
Computational Power of Networks of Spiking Neurons
7
neurons versus analog neural nets), which are shown in Maass (199413, 1995c), to be equivalent with regard to the preceding notion of a real-time simulation. In contrast to the usual notion of a simulation, a real-time simulation of another computational model M by an SNN implies that the simulation of each computation step of M requires only a fixed number of spikes in the SNN. In particular, the required number of spikes does not become larger for the simulation of later computation steps of M. 1.3 Input and Output Conventions. For simulations between SNNs and Turing machines one may either assume that the SNN gets an input (or produces an output) from {O.l}* in the form of a spike train he., one bit per unit of time), or that the input (output) of the SNN is encoded into the phase difference of just two spikes. The former convention is suitable for comparisons with Turing machines that receive a single input bit and produce a single output bit at each computation step. For comparisons with Turing machines that start with the whole input written on a specified tape, and have their whole output written on another tape when the machine halts, it is more adequate to assume that the SNN receives at the beginning of a computation the whole tape content of the input tape encoded into the time difference cp between two spikes (using the same encoding as we will use in Section 2 to represent the content of a stack), and that the SNN also provides the final content of the output tape in the same form. Real-vnlued input or output for an SNN is always encoded into the phase difference of two spikes.
1.4 Notation. We employ in this article the following common notation: We write N for the set of natural numbers (including 0), Q for the set of rational numbers, and R for the set of real numbers. R+ is defined as {x E R : x 2 O}. For any x E R+ we write [xl for the least I I E N with II 2 x. {0.1}* denotes the set of all binary strings of finite length. For any set S we write 3x E S(. . .) instead of 3x(x E S and . . .), and Vx E S(. . .) instead of Vx(x E S + . . .). For two functions f . g : N i N we write f = O(g) if there is some constant c such that f(n)5 c . g ( i i ) for all except possibly finitely many IZ E N. 1.5 Structure of This Article. In Section 2 we specify our basic assumptions about the response and threshold functions of an SNN, and we construct SNNs that can simulate in real-time arbitrary threshold circuits and Turing machines. In Section 3 we relate the computational power of SNNs for real-valued inputs to a specific type of random access machine. In Section 4 we discuss variations of the preceding constructions
Wolfgang Maass
8
for related models of spiking neurons, and in Section 5 we outline some conclusions from the results in this article. 2 Simulation of Threshold Circuits and Turing Machines by Networks of Spiking Neurons
To carry out computations on an SNN, soiiie assumptions have to be made about the structure of the response and threshold functions of its neurons. It is obvious that for example neurons with response function E,,,,, such that E,,,,,(s) = 0 for all s 2 0 cannot carry out any computation. We will specify in the following a set of basic assumptions, which suffice for the constructions in this article. Some variations of these conditions will be discussed in Section 4. We assume that there exist some arbitrary given constants Amin.A,,,, E R with 0 5 Amin < Amax so that we can choose for each “synapse” (u.11) E E an individual “delay” A,,,,, E [Amin.AmaX] with cl,.,,(x)= 0 for all x E [O. A,,,,,].This parameter A,,,,, corresponds in biology to the time span between the firing of the presynaptic neuron I ( and the moment when its effect reaches the trigger zone (axon hillock) of the postsynaptic neuron u. This time span is known to vary for individual neurons in biological neural systems, depending on the type of synapse and the geometric constellation. The constants Aminand A,,, can be interpreted as biological constraints on the possible lengths of such time spans. No requirements about Aminand Amaxare needed for our construction, except that Amin< Amax. We assume that except for their individual delays the response functions E,,,,,(as well as the threshold functions @,,) are stereotyped, i.e., that their shape is determined by some general functions and (3. which do not depend on 11 or u. More precisely, we assume that we can decide for any pair ( u . zi) E E whether E~,.,, should represent an excitatory “EPSP ~.rspoiisrfiiizction,’’or an inhibitory “IPSP respoiise fiincfioiz.” In the EPSP case we assume that
E~,,,,(A~,,,, + x) = zE(x)
for all x
E
R’
and in the IPSP case we assume that
E~,,,,(A,~,~, + x) = r^ (x) I
for all x E R+.
In either case we assume that cl,,,,(x)= 0
for all x E [O, A,,,,,].
Furthermore, we assume for all neurons
@,,(x)= O ( x )
for all x E RS
ZI E
V
- Vi,
that
Computational Power of Networks of Spiking Neurons
9
Figure 1: Illustration of our notation for the basic assumptions on 0 .F~~ 5' (the functions shown are quite arbitrary and complicated, but nevertheless they satisfy our basic assumptions). We assume that the three functions icE : R+ -+ R+, icl : R+ i {x E + R+ U {co} are some arbitrary functions with the following properties: There exist some arbitrary strictly positive real numbers Tref, Tend, gi, gz, g 3 . 71, 72, 7 3 , L, sup, Sdol\7n with 0 < Tref < Tend, (TI < ~ 7 2< 0 3 , 71 < 72 < 7 3 (see Fig. 1 for an illustration), which satisfy the following five conditions:
R : x 5 0) and 0 : R+
1. 0 ( x ) 2 0 ( 0 ) > 0 for all x E R+, 0 ( x ) = cc for all x E (0.~~~~). and 0 ( x ) = C3(0) < m for all x E [ ~ ~ , , dcc) .
2. ~ ~ (=08 ) ( x ) = 0 for all x E [Q. oo),and there exists some E,, so that 3x E R+[cE(x)= smax] and V y E R + [ E ~ 5 ( ~ic,] ) 3.
+z ) ~ + z() = v
E R+
+ sup z for all z E [-L. L ]
~ ~ ( 0 ,= ~ ~ ( 0 1 )
4. ~
. z for all z
~ ~ (~ ~ 7 sdown 2 )
E
[-L. L]
5 . $ ( O ) = ~'(x)= 0 for all x E [ ~ 3ooj, . $(xj < o for all x E nonincreasing in [O. 711 and nondecreasing in [ T Z .731.
(0.73). E'
is
We assume in addition that 0(0), E ~ ( o ~ cE(a2), ), sup,sdorvn E Q. It should be pointed out that no conditions about the smoothness, the continuity, or the number of extrema of the functions 0,cE, E' are made in the preceding basic assumptions. However, if one demands in addition that cE is piecewise linear and continuous, then conditions ( 3 ) and (4) become redundant. The assumption that 0 ( 0 ) ,sE(a1), ~ € ( r r 2 ) sup, ,
Wolfgang Maass
10
I
:', and :'
Figure 2: Examples tor mathematically \ w y simple functions (3, satisty the basic assumptions.
that
b L I , > , , ,are , rationals will be needed unlp t o ensure that certain weights can be chosen to be r-ntrorrirls (see Section 2.9). Examples of mathematically particularly simple (piecewise linear) functions and (-) that satisfy all of the above conditions are exhibited i n Figure 2. The subsequent construction shows that neurons with the very simple response nnd threshold functions from Figure 2 can, in principle, be used to build an artificial neural network with some finite number i r i 4 of spiking neurons that can simulate in real time any other digital computer (even computers that employ many more than ! I ; , memory cells or computational units). We have formulated the preceding basic assumptions on the response and threshold functions in a rather general fashion to make sure that they can in principle be satisfied by a wide range of EPSPs, IPSPs and threshold functions that ha1.e been observed i n a number of biological neural systems (see, e.g., Fig. 3). The currently available findings about biological neural systems (see, e . g , Kandel c-t (11. 1991, and the discussions in Valiant 199.1) indicate that in general a single EPSP alone cannot cause a neuron to fire. In fact, it is comnionlv reported that 30 to 100 EPSP have to arrive within a short time span at a neuron to trigger its firing. These reports indicate that the weights x i , , in our model should be assumed to be relatively small, since they cannot amplify a single EPSP to yield an arbitrarily high potential P,,. Hence for the sake of biological plausibility one should
:'. .-'
Coinputational Power of Networks of Spiking Neurons
11
mV -68
- 70 - 72 Figure 3: Inhibitory and excitatory postsynaptic potentials at a biological neuron. [After Schmidt (1978). Firrirlnrnrrztals of Neurophysiology. Springer-Verlag, Berlin].
assume that the values of all weights w,,, in an SNN belong to some bounded interval [O. w ~ ~ ,For ~ ~simplicity ~]. we assume in the following that zu,,, = 1. This convention just amounts to a certain scaling of the values of the response functions in relation to the threshold functions. In any version of this model where a single neuron is not able to cause the firing of another neuron, one necessarily has to assume that each input spike is simultaneously received by several neurons (since otherwise it cannot have any effect). In spite of this convention we will occassionally assign much larger values to certain weights zuU,,. We will then (silently) assume that u does 1 that all fire concurrently in fact represent an nssembly of [ 7 ~ , , ~ , neurons ([w) is defined as the least natural number 2 x ) . Furthermore, we assume in those situations that all edges from neurons in this assembly to neuron v have the same delay, and the same weight W ~ , , ~ , / [ WE, ~[0,1]. . ~ ~ The main difference between this type of construction and a construction with arbitrarily large weights is that in our setup the (virtual) use of large weights blows up the number of neurons that are needed. Theorem 2.1. If the response and threshold fuizctiorzs of the tzeurons satisfy the previously described basic assumptions, then one can build from such neurons for any g i z m d E N an SNN N T M ( d ) of finite size that caiz simulate with a suitable assigizmeizt of rafionnl values frorii [O. 11 to its weigkts any Turing mnchine with at most d tapes in real-time. F u v t l z e r ~ n o r e N ~ ~can ( 2 )cornpiite any function F : (0.1)' -+ (0, l}' zuith n suitable nssignment of real values from [0,1]to its weigkts. The proof of Theorem 2.1 is rather complex. Therefore we have divided it into Sections 2.1 to 2.10, which are devoted to different aspects of the modules of the construction. Several of these modules are also useful for other constructions. The global construction of N T M (with ~) the properties claimed in Theorem 2.1 is described in Section 2.10.
Wolfgang Maass
12
We will discuss in Section 4 some methods for alternative constructions of N&(d) that are based on different assumptions about response and threshold functions.
2.1 Conditions on the Neurons. We assume that we can decide for any pair (zi. v) of neurons whether there should be a “synapse” between both neurons (i.e., ( u 3v) E E ) . Self-referential edges of the form ( u .u ) will not be needed. In this proof the weights w,,,,, on edges ( 2 4 . z1) are always assumed to be time invariant, and they are only assigned values from [O. 11. We assume that the response and threshold functions satisfy the previously described basic assumptions.
2.2 Delay- and Inhibition Modules. We will construct in this section two very simple modules that will be used frequently (and often silently) in the subsequent constructions. From the general point of view the existence of these two modules shows that our very weak assumptions about Amin and Amax(we have only required that 0 5 Amin< A,,,) as well as our very weak assumptions about the shape of E’ in condition (5) are in fact sufficient to create in an SNN arbitrarily long delays, and arbitrarily fast appearing or arbitrarily fast disappearing inhibitions of arbitrarily long duration. A ”delay-module” is simply a chain u l . . . . of neurons so that ( z i i . i i i + l ) E E, E,,,,,,,+, is an EPSP response function, and w, ,,,,,,+, := @(O)/E,,, for i = 1.. . . . k. Since each delay A ,,,,,,,+, can be chosen arbitrarily from [Amin.A,,,], the total ”delay” between the firing of u1 and the arrival of an EPSP at iik+1 can be chosen to assume any value in a certain interval of length k . (A,,, - Amin).It will cause no problem that the total transmission time from u1 to uktl grows along with k, since in the subsequent constructions time will essentially be considered only m o d d o a certain constant TPM. We next construct for any given real numbers 6,A > 0 and h: < 0 “inhibition modules” 1 6 . t i . ~and PA. I s , ~ . Acan be used to transmit to any desired neuron zl a volley of IPSPs that sum u p to a potential which changes from its initial value 0 to some value 5 h:, within a time interval of length 6, and then maintains a value 5 h: for at least the following time interval of length A. l D , nconsists ,~ of a neuron 21 that transmits EPSPs simultaneously to several ”relay neurons” u1, . . . uf,which are triggered by this EPSP to send an IPSP to some given neuron v. If I and the delays between the neurons are chosen appropriately [as a function of 6. ti. A, ~’(6)and the parameter 711, this module will have the desired effect on neuron v. Dually, one can also build for any 6, X > 0 and ti, < 0 an inhibition module Ib*“tX that sends IPSPs to any specified neuron v whose sum stays 5 K for a time interval of length 2 A, and then returns to 0 within a time ~
%
Computational Power of Networks of Spiking Neurons
13
Figure 4: Graph structure of an oscillator consisting of one neuron (a) and two neurons (b). interval of length 5 6. Here we exploit the fact that according to condition (5) the function ~ ' ( xis) nondecreasing and strictly negative for x E [7*.n ) . 2.3 Oscillators. Consider subgraphs of an SNN of the structure shown in Figure 4. Both types of subgraphs can be used to build an oscillator. The first one is somewhat simpler, but we will not use it in our construction since it would require a self-referential edge (ZI.11) E E. In the second type of oscillator (Fig. 4b) we assume that w ~ , , ~U ,I ,~ ~2. ~ , O(O)/E,,,~~, and that both E,,,,, and E,,,,, are EPSP response functions. Thus after an initial EPSP through edge a both neurons will fire periodically. More precisely, z) will fire at times t o i. 7r for i = 1.2. . . ., until it is halted by an IPSP through edge 11. We refer to 7r as the oscillation period of this oscillator. We will distinguish one such oscillator as the "pacemnker" for the constructed SNN, which we denote by PM. We write T ~ Mfor its oscillation period. We assume that the oscillation of PM is started at "time 0 by the first input spike to the SNN, and that it continues without interruption throughout the computation of the SNN. PM emits EPSPs through edge e, which will then be broadcast as a timing standard throughout the SNN. We will say in the following that some other neuron ZJ in the SNN fires "at unit fimr" or "synchronously" if the considered firing of z, occurs at a time point t of the form i . T P M for some i E N. In N T M ( d ) we will use oscillators in two ways as storage devices. First we use them as "registers" for storing a bit (via their two states dormant/oscillating), for example in the control of h $ ~ ( d ) .Second we
+
Wolfgang Maass
14
use oscillators 0 with oscillation period TPM to store arbitrary numbers p E [o.T p M ] via their phnse difference to PM (i.e., neuron ZI of oscillator 0 fires at time points of the form i . T P M + p with i E N). In this way oscillators can for example store the time difference between two input spikes to the SNN, and the program and tape content of a simulated Turing machine, respectively. 2.4 Synchronization Modules. A characteristic feature of a computation on a feedforward Boolean circuit of the usual type is that the fiining of its computation steps is independent of the unlues of the bits that occur in the computation. For example, the timing of the output signal of an OR gate does not depend on the values of its input bits. This feature is very useful, since with its help one can arrange that all input bits for Boolean gates on higher levels of the circuit arrive simultaneously, and therefore it allows us to build complex circuits from simple modules. If one wants to carry out computations on an SNN with single spikes, one would like to interpret the firing of a neuron at a certain time as the bit "1" and nonfiring as "0." Thus one might, for example, want to simulate an OR gate by a neuron u that fires whenever it receives at least one EPSP. However, when that neuron receives fzuo EPSPs simultaneously (corresponding to tzuo input bits being 1) it would in general fire slightly earlier than in a situation where it receives just a single EPSP. This effect that are not is a consequence of having EPSP response functions E,,,?,(x) piecewise constant. In addition, if ZI has already fired just before, then the fact that O ( x ) is in general not piecewise constant also contributes to this effect. Unfortunately this effect makes it impossible to simulate on an SNN in a straightforward manner a multilayer Boolean circuit (where the bit "1" is signaled by a spike, and "0" by the absence of a spike at the corresponding time): the input "bits" for neurons that simulate Boolean gates on higher layers of the circuit will in general not arrive at the same time. Furthermore it is not possible to correct this problem by employing delay modules of the type that we had constructed in Section 2.2, since the required length of the delays depends on the current values of the input bits. We will solve this problem with the help of the here constructed synchronization module. In fact, we will show in the next section that with the help of this module an SNN suddenly gains the full computational power of a Boolean feedforward tkreskold circuit, and therefore is able to carry out within a small number of "cycles" substantially more complex computations than a regular Boolean circuit. On first sight it appears to be impossible to build a synchronization module without postulating the existence of an EPSP response function that has segments of length 2 T P M where it is constant, or increases or decreases linearly. However the following "double-negation trick" allows us to build a synchronization module without any additional assumptions. zj
Computational Power of Networks of Spiking Neurons
15
Figure 5: Structure of a synchronization module.
Consider the graph of an SNN on the left hand side of Figure 5. We arrange that as long as no EPSP is transmitted through its "input edge" E, the neuron u fires regularly with period T P M as a result of EPSPs from the pacemaker PM. These EPSPs induce the inhibition module 12 to send IPSPs to neuron ZI that "cancel out" the EPSPs that arrive at u directly from PM. Therefore in the absence of an input through edge e this neuron u does not fire. Assume now that at some arbitrary time point an (unsynchronized) EPSP arrives through edge e. This EPSP triggers the inhibition module 11,which then sends out IPSPs that prevent neuron 14 from firing for a time interval of some fixed length > T P M . Therefore at least one of the EPSPs that arrive at neuron u from PM is not cancelled out by IPSPs from the inhibition module 12, and neuron u emits at least one synchronized spike (i.e., u fires at least once, and with a proper choice of delays only at unit times of the form i . TPM with i E N). A closer look shows that the mechanism of this module is in fact a bit more delicate. It can, in principle, happen that at neuron u the beginning or the end of a negative potential from II coincides with an EPSP from PM in such a way that it leads to a small shift 0 in some firing time of u (besides canceling other firings of u). This could shift the time interval of the activity of l2 by a certain amount p. One has to make sure that this shift cannot lead to a competition at neuron zi between the negative
16
Wolfgang Maass
potential from 12 and the EPSP from PM that results in an uizsynchronized firing of v. One can solve this technical problem by designing 11 and so that their output is the superposition of the output of a module l,,h.X and of a module I"."~'. In this way their strongly negative output potential (of value 5 K ) both builds up and disappears at neuron ZJ within time intervals of length ;1. This parameter 1, provides then an upper bound for the length p of the possible time shifts of these negative potentials. By choosing b sufficiently small (and by arranging the lengths and delays of these inhibitions appropriately), for n i i y arrival time of an input spike through edge e and for any EPSP from PM the resulting inhibition from 12 either cancels the corresponding firing of ZJ, or it lets ZJ fire without shifting its firing time (canceling some other firings of instead). For that purpose one chooses the weight ziJ E [0.1] on the edge from PM to zl so that the resulting function U J . z E crosses O(0) while it is in the middle of its linearly increasing segment [see condition ( 3 ) of our basic assumptions]. The timing of this synchronization module can be specified with more precision as soon as one selects concrete response and threshold functions that satisfy our basic assumptions. However, the preceding analysis shows that it will do its job in any case. One should keep in mind that our basic assumptions are relatively weak. For example, they do not even prescribe the relationships between the sizes of the parameters 03, ~ 3 and , Ten 0 so that UI,,,.~,. sup= 7 ~ , , ~ . ~.. s ~ I and w,,,,~, . c E ( a l ) w , , ~ s, E~(.0 2 )= O(0) (see Fig. 6). According to our general convention at the beginning of this section we actually have to replace in the case z(~,,,,, > 1 the neuron iii by an assembly of [ r 0 ~ ( ~ , ~neurons ,1 with weights from [O, 11 on their edges to u. However, for the sake of simplicity, we will ignore this trivial complication in the following. We arrange that for - - an arbitrarily given parameter 0 > 0 inhibition modules I b , h , ~ and I"'.' (with suitable values of their parameters) are triggered by spikes from I'M to send II'SI's to u so that u is not able to fire within the time intervals [f*-L/2-D, t ' - L / 2 ) and (f*+L/2, t*+L/2+D\ even if the firing time t2 of neuron u2 is arbitrarily shifted, but so that
+
ZLJ
~ ~ . ~
Wolfgang Maass
20
these inhibition modules have no effect on the potential P, at neuron ZI during the time interval [t* - L/4. t* + L/4]. Consider now what happens if the phase difference 9 of the oscillator 0 is not fixed at p = 0, but assumes any value in [0,L/2]. Then by choice of the parameters ZLJ,,, ?). w , , ',~, and f*, and by the conditions (3) and (4) of our basic assumptions, the sum of the EPSPs from u1 and u2 at neuron ZI has in any case a constant value within the time interval [f*-L/2, f*+L/2]. Furthermore, this constant value is 2 O(0) if and only if p 2 a . Hence the neuron u will fire within the time interval [f* - L/2. f* + L/2] if and only if p 2 (1. Furthermore, by the choice of the inhibition modules the neuron I J fires within the time interval [f* - L/2. f* + L/2] if and only if it fires within the time interval [f" - D.t* + D ] . We now assume that some arbitrary real number j > 0 is given, and we construct a module that carries out the operation MULTIPLY( -I). This module also consists of neurons 1 1 1 . u 2 . u with ( z i z . z i ) E E for I = 1.2 so that 1 4 1 is triggered to fire at some time fl by a spike from the pacemaker PM, and u2 is triggered to fire at some time f2 by a spike from an oscillator 0 that has oscillation period 7rpM and some phase difference E [O.min(L/2,L/2/j)]to PM. We want to achieve that for any value p E [0,min(L/2. L/2d)] of this phase difference the "output neuron" 11 of this module fires at a time t + /j p, where f does not depend on p. The construction of the module for the operation MULTIPLY( j ) is slightly different for the two cases J > 1 and 1 E (0.1 I We consider first the case 11> 1. Assume for the moment that the phase difference 9 E [O. L/2,3] between 0 and PM has value 0, and choose delays A,,, so that there exists for i, := t, A,,,1, some 'f 2 max(il.I,) with 'f - ;I = "2 and f * - i2 = 01. Furthermore, we choose weights ZL'~,! 1 , > 0 so that L,
+
ZO,,, 7,
. 2 ( t * - 71)
+ ZUlf1,
.€E(t*
-
i,)
= O(0)
(2.1)
and (2.2) Since j > 1, equation 2.2 implies that 0 < ~ we have
1
. sdown < w,,? 1 (i.e., t,
21
-
f* = i ? . 9).
at a suitable time an inhibition module ILI2," ' ~ n d sends IPSPs to u,which makes it impossible for u to fire during the time interval [t" t* -L/2) (no matter at what time u2 fires), but which does not influence the potential P,,(t) at times t 2 t'. Furthermore, we arrange that no other EPSPs or IPSPs contribute to P U ( t )for t E [t* f*]. In this way u can fire during the time interval [f* - Tend. t ' ) (even if the firing of u2 is delayed by some p E [O. L/2/??]).Therefore in the case p = 0 our assumption (equation 2.1) implies that neuron u will fire at time f*. We now consider what will change if the firing of u2 at time f2 is replaced by a slightly later firing at time t2 + p, whereas the firing time of u1 and of the inhibition module remain unchanged. We will show that for any p E (0.L/2@ this delay will cause a somewhat delayed firing of u (see Fig. 7). Consider the time point t,, which is defined by the equation wlgl,*, ."(t,
- tl)
+ wuz
D
'
"E[t,
-
(t, + p)] = O(0)
(2.4)
By equation 2.1 and conditions (3) and (4) of our basic assumptions we have for t , - f * E [-L.L]
w,,3
'
"(f,
-
i,) = w,,z,
and for 9,f, with f, - t'
'
-
EE(f* - TI) -
w,,,, ' Sdown (t, '
-
f')
(2.5)
p E [-L. L] we have that
w " ~ ~ , . E ~ [ ~ , - ( ~ ~ + ( P ) -] /=~w) + ~ w ~ ,~~~~. , E . (s~t, ,(-~~t *.-* p )
(2.6)
Wolfgang Maass
22
Figure 8: Multiplication of a phase 9 with H E (0.1). These two equations in conjunction with 2.1, 2.2, and 2.4 imply that
t 9 - t* = I ] ' $ , It is obvious that for 9 E [O. L/2P] one has that d . p, ,I . 9 - p E [-L. L ] . Furthermore, it is clear from our construction that zl cannot fire during the time interval [t" t* + d . 9). Therefore t , := t* + /j.p is in fact the firing time of ZI if 1 ~ 2fires at time t2 + p. Hence the described module carries out the operation MULTIPLY(/j)in case that ,j> 1. To carry out the operation MULTIPLY(1j) for some arbitrarily given ,I E (0.1) we just change the delay A,,, l , in the previously described module so that t' - i, = g1 (instead of t* - i, = ~ 2 (see ) Fig. 8). We choose weights ZU,,,,~,> 0 so that equation 2.1 holds and (2.7) As before, we consider the time point t, that is defined by equation 2.4. Then equation 2.6 holds, but instead of 2.5 we have 7/7,,,,,,
'
sE(f,
-
i,) = w,,,,,, . sE(t* - i,) + w,,,.,, . s"p
'
(tP - t * ) .
The latter two equations in conjunction with 2.1, 2.4 and 2.7 imply that
fP
-
t* = / j . p .
Hence the described module carries out the operation MULTIPLY ( / j ) for an arbitrarily given /-I t (0.1).
Computational Power of Networks of Spiking Neurons
23
2.7 Simulation of a Stack with Unlimited Capacity by an SNN of Fixed Size. The simulation of a stack (also called pushdown store, of first in-last out list) is the most delicate part of the construction of NTM(d), since it requires the construction of a module in which the lengths I' of the bit-strings (b,, . . . bl) that are stored and manipulated are in general much larger than the number of neurons in this module (in fact, (I can be needs to have a component with this arbitrarily large). Of course NTM(~) property, since otherwise the SNN N T M ( d ) (which will consist of a fixed finite number of neurons) cannot simulate the computations of Turing machines that involve tape inscriptions of arbitrary finite length. The content ( b l . . . . bt) E {0.1}* of a stack S (where bl is the symbol on top of the stack) will be stored in the form of the phase difference %
e
ps =
C b, . 2-'-' i=l
of a special oscillator 0 s . More precisely, we assume that 0 s fires with the same oscillation period T P M as the pacemaker PM, but with a delay ps. The parameter c E R+ is some arbitrary constant that is sufficiently large so that 2-' 5 min(L/2, T P M ) . We will now describe the mechanisms for simulating the stack operations POP and PUSH on a bit string (b,~.. . .be) that is stored in ps. The stack operation POP determines the value of the top-bit bl, and then replaces the stack content (b, . . . .br) by (b2. . . . ,bc). In an SNN one can determine the value of bl from ps by testing whether 9 s 2 2-'-'. For that purpose one employs a module that carries out the operation COMPARE(2 2-'-') (see the preceding section). To change the phase-difference ps from & b, .2-'-' to Cp=;' bi+l.2-'-' one first replaces cps by bi.2-'-' . For the case bl = 1 this can be carried out by directing an EPSP from 0 s through a suitable delay module, by halting simultaneously the oscillation of 0 s with the help of an inhibition module, and by restarting the oscillation of 0 s with an EPSP from the considered delay module. Note that we can employ at this point a simple delay module as described in Section 2.2, because in the case bl = 1 the length of the desired shift of the phase difference does not depend on its current value. It remains to carry out a SHIFT-LEET operation, which replaces the phase difference C,=* b . . 2-1-C by ~
xf=2
2.
c
I-1
r=2
i=l
1b, . 2-i-c = 1bi+l . 2-i-c,
This operation cannot be implemented by a delay-module, since it has to shift the phase difference by an amount that depends on the values of l and b2, . . . ,be. Instead, we have to employ a module that carries out the operation MULTIPLY(2) (see Section 2.6).
Wolfgang Maass
24
To simulate the stack operation PUSH one has to replace for a given ho E (0.1) the current phase-difference ps = b, . 2-'-' of the oscillator 0 s by 1 ; : ; b,_l . 2-'-c. Our simulation of PUSH consists of two separate parts: a SHIFT-RIGHT operation that changes the current phase b,-l . 2+', and a subsequent ADD(?) operation that difference to C,"=',' adds y := bo . 2-1-c to this phase difference. Obviously ADD(?) can be implemented in an analogous way as the subtraction of bl .2-'-' from pi in the previously described simulation of POP. Thus it just remains to simulate a SHIFT-RIGHT operation, i.e., to replace the phase difference ps = Cf=,b, . 2-r-Lof size 5 L / 2 by ps/2 = c,(+I =2 k l f 2-I-[ . For that purpose we employ a module for the operation MULTIPLY(1/2), as constructed in the preceding section.
cf=l
2.8 Simulation of an Arbitrary Fixed Turing Machine by an SNN. We will show in this section that the previously constructed modules suffice to construct for any given Turing machine M an SNN NM(whose structure may depend on M ) that can simulate M in real-time. According to the notion of a real-time computation (see Section 1) we assume that the given Turing machine M processes a sequence ((x(j),~(j))),~~ with XI)).y(j) E (0. l}" in real-time. We assume that the inputs xu) are presented to M on a read-only input tape, and the outputs y ( j )are written by M on some write-only output tape. We will assume that the simulating SNN NMreceives each input x ( j ) E {0,1)*in the form of a time difference p between two input-spikes, with p = b, .2-'-' for XI)) = ( b l , . . . . b,). We will arrange that NMdelivers its outputs yI)) in the same form (as a time difference between two output spikes). It is easy to see that any Turing machine M , with any finite number d of two-way infinite read/write-tapes, can be simulated in real-time by a similar machine which has 2d stacks, but no tapes (see, e g , Hopcroft and Ullman 1979). We will call the latter type of machine also a Turing machine. In this simulation one uses two stacks for the simulation of each tape: one stack for simulating the part of the tape that lies to the left of the current position of the tape-head and another stack for simulating the part of tape to the right of the tape-head. In principle it would suffice to consider a Turing machine with 1 tape (or 2 stacks), since this type of Turing machine can simulate any other Turing machine (although not in real time). However, it is known that various concrete problems (especially several pattern-matching problems) can be solved faster on a Turing machine that has more than one tape (see, e. g., Hopcroft and Ullman 1979; Maass 1985; and Maass ef al. 1987). Therefore, and because it does not cause any extra work, we simulate an arbitrary Turing machine M with any number k of stacks by an SNN N M . At any computation step the Turing machine M may POP or PUSH a symbol on each of its k stacks. We assume for simplicity that the stackalphabet of M is binary (i.e., M can push 0 or 1 on each stack, and pop
cf=l
Computational Power of Networks of Spiking Neurons
25
a binary symbol, or receive the signal "bottom-of-stack if the stack is empty.) Furthermore, we assume that the input for the computation of M is given as the initial content of the first one of the k stacks, and that the output of M consists of the final content of the last one of the k stacks (at the moment when the machine halts). If Q is the (finite) set of states of M, then after assigning a number in binary notation to each state in Q the transition function of M can be encoded by a function FM : (0. l } ~ ' n g ~ Q l ~ --+ + k (0. I}l"glQllfk. We assume here that the state of M indicates on which of the stacks a POP or PUSH has to be carried out. Thus to simulate the finite control of M by an SNN, it suffices to employ a module that can compute an arbitrary given function from (O.l}llnglQll+k into itself. We assume here that the rloglQl1 + k input and output bits of this function are stored in a corresponding number of oscillators with two states (dormant/oscillating). According to Lupanov (1973), one can compute any function F : (0. l } ~ l o g I Q l l + k + (0. l}ilnglQll+k on a feedforward threshold circuit with O(IQ/'/*.2k/2)gates. In addition, Horne and Hush (1994) have shown that any such function F can be computed by a threshold circuit of depth 4 with O[IQ/'/2.2k/2.(log lQl+k)] gates, using only weights and thresholds from (-1.0. I}. Hence our previously described simulation of an arbitrary threshold circuit on an SNN in Section 2.5 allows us to simulate in NM the finite control of M with a module of O[lQI'/' . 2k/2]neurons (provided the SNN may use arbitrarily large weights). Furthermore, the quoted result by Horne and Hush in conjunction with our construction in Section 2.5 implies that with O[IQ1'/2. 2k/2. (log IQI + k ) ] neurons one can implement in N M the finite control of M in such a way that only very simple weights from [O, 11 are needed in NM, and that the simulation of each computation step of M requires only O(1) "machine-cycles" of N M . More precisely, each computation step of M is simulated by NM in a time interval in which the pacemaker I'M fires 5 K times, where K is some absolute constant that is independent of IQl,k, the length of the current input of M, and the number of the previously simulated computation steps of M. Apart from the finite control component, the SNN J\/M consists of a module of O(1) neurons for each of the k stacks, and O(1) neurons that implement the pacemaker PM. In addition N M uses O(1og IQI + k ) neurons for other oscillators that serve as temporary registers for bits. Thus N M consists altogether of at most O[IQ11/2. 2k/2. (log IQI + k ) ] neurons, and the simulation of any computation step of M involves at most O[lQ]1'2. 2k/2. (log IQI + k)] firings of neurons in NM.After NMhas simulated every computation step of M on the current input x ( j ) E {0.1}*, it has generated on an oscillator OS,which corresponds to the stack S on which M writes its output y(j) = ( b l , . . . .bi), a phase-difference cps = ZfZlb,.2-'-' with regard to the pacemaker PM. NM outputs two spikes, where one is generated by PM and the other one by OS,before receiving its next input. Since for fixed M the parameters IQI and k can be viewed as constants,
Wolfgang Maass
26
I
-
.........
I
t + t,
--t
Figure Y: Mechanism of the weight-to-phase transformation module.
.\*,,just uses Oi 1
)
spikes tor the simulation of each computation step of
M. Hence .I* simulates ,, M in real-time.
2.9 Weight-to-Phase Transformation. At this point the only missing link for the construction of the desired SNN . l & ( d ) is a module that allows us to generate from suitable weights of an SNN the encoding of arbitrarily long (even infinitely long) bit strings, which may, for example, represent the program of a Turing machine, or an infinitely long "lookup table." The weight-to-phase trmsformation module constructed here will be able to generate within a fixed number of "machine cycles" any given phase difference ;= C:=, b, . 2-'+' of an oscillator (for arbitrary i E N u { x } and k , E ( 0 . 1 ) ) from suitable weights between 0 and 1. Furthermore, these weights can be chosen to be rational if / E N. This module will exploit effects of the firing mechanism of a neuron in an SNN that are closely related to those that we had used in Section 2.6 to multiply the phase of an oscillator with a constant factor. To allow a i i r f i q f i e decoding of ;rifin;fd,g /mis bit sequences from phase differences ; we adapt the convention that h1== 0 for all i E N in case that I = x . We consider the same configuration with neurons i f 1 . 1 1 2 . ~ 1 , and an inhibition inodule as for MULTIPLY( . j ) in Section 2.6. However, instead of shifting the firing time of i f ? , we are now interested in the consequences of multiplying the weight on the edge from i l l to 5' with some factor ii' E i0. 1) (see Fig. 9). We choose values for the delays so that - tor t , := t , 7 A?(:. there exists some t' >_ max(t1.t 2 ) with t' - t l = m2
Computational Power of Networks of Spiking Neurons
27
and t* - i, = ( 7 1 . Furthermore, we choose positive weights w,,, ,, so that w,,,sup = 2w,,,l 3 . Sdown and zL7,,, ( I ) from Section 2.6 in combination with the preceding module for SUBTRACT
Computational Power of Networks of Spiking Neurons
33
+ f - (f*
+ @J)
+ t - (t* + 93)
Figure 11: Mechanism of the module for SUBTRACT.
allows us to build a module for the test COMPARE, i.e., a module that decides for any two given phase differences pl.32 E [O. L/4] of two oscillators 0 1 and 0 2 with oscillation period TTT~Mwhether pl > 3 2 . For that purpose one first transforms pl with the help of a delay module to p; := 9 1 L/4. It is then clear that 3; 2 p2, and the module for SUBTRACT can be employed to compute p; - 9 2 = p1 - 92 L/4. With the help of a subsequent module for COMPARE(> L/4) we can then decide whether p1 - 992 L / 4 > L/4, i.e., whether p1 2 3 2 . Of course one can also build directly a module for COMPARE by using a variation of the construction for COMPARE(> 0)in Section 2.6.
+
+
+
4 Variations of the Constructions for Related Models of Spiking Neurons We have assumed for the constructions in the preceding two sections that the response and threshold functions are stereotyped, i.e., that apart from their individual delays A,,u the functions E,, L, and 0, all have the same shape. This assumption is convenient, but not really necessary for the preceding constructions. The same constructions can also be carried out if these functions are differed for different edges (u.v) E E and different v E V . More precisely, it suffices to assume that the response functions F , , are defined with the help of individual delays A, l , and iizdzvicfual functions &fZ, and ~ f ,,, so that E~ u(x) = 0 for x E [0,A, and E,, u ( A u@)I. But the module for COMPARE(> a ) is of independent interest, since it shows in the context of Section 3 that discontinuous real-valued functions can also be computed on an SNN. The implicit assumptions about the firing mechanism of neurons in the version of the SNN model from Section 1 ignore the well-known "reset" and "adaptation" phenomena of neurons. However, one can easily adjust the definition of the SNN model so that it also takes these features into account. To model a reset of a neuron at its moment of firing, one can adjust the definition of the set F,, of firing times of a neuron u by deleting (or modifying) in the definition of P , ( t ) those EPSPs and IPSPs from presynaptic neurons 11 that had already arrived at u before the most recent firing of u. Adaptation of a neuron zi refers to the observation that the firing-rate of a biological neuron may decline after a while even if the incoming
Computational Power of Networks of Spiking Neurons
35
excitation [i.e., P,(t)] remains at a constant high level (see for example Kandel et al. 1991). This effect can be reflected in the SNN model by replacing the term O,(t - s) in the definition of the set F , of firing times by a sum over O,(t - s) for several recent firing times s E F,, [and by assuming that O,(x) returns only relatively slowly to its initial value
Wyl. We would like to point out that all of our constructions in Sections 2 and 3 are compatible with our above-mentioned changes in the SNN mode1 for modeling the reset and adaptation of neurons. The reason for this is that we can arrange in the constructions of Sections 2 and 3 that all "relevant" firings of a neuron v are spaced so far apart that reset and adaption of u have no effect on those critical firing times. Regarding the simulation of threshold circuits by SNNs (see Section 2.5) we would like to point out that the corresponding SNN module can be constructed with fewer neurons if one makes further assumptions about the shape of EPSP and IPSP response functions. For example, one can simulate directly a threshold gate TE with weights (Y, of different sign in a similar way as we have simulated monotone threshold gates T" in Section 2.5, provided that the EPSPs (modeling inputs with positive weights) and IPSPs (modeling inputs with negative weights) move linearly within the same time span from 0 to their extremal values. Finally, we would like to point out that the class of piecewise constant functions (i.e., the class of step-functions) provides an example for a class of response and threshold functions that do not satisfy our basic assumptions from Section 2, but that can still be used to build for any Turing machine M an SNN N M f that can simulate M (although not in real-time). We assume here that the response functions are piecewise constant (but not identically zero), and that the threshold functions are arbitrary functions (e.g., piecewise constant) that satisfy condition (1) of our basic assumptions. One can then build oscillators, as well as delay, inhibition, and synchronization modules, in the same way as in Section 2, and one can also simulate arbitrary threshold circuits in the same way. Furthermore one can use the phase difference between an oscillator 0 with the same oscillation period TPM as the pacemaker I'M to simulate a counter. For that purpose one employs a delay module D with a suitable delay p > 0 (so that k . p = !. T P M for any k. 1 E N implies that k = L = 0). One can then use the phase difference between 0 and PM to record how often the "spike in 0 has been directed in the course of the computation through this delay module D.Hence one can store in the SNN an arbitrary natural number k, which can be incremented and decremented by suitable modules. To decide whether k = 0, one needs a module that can carry out a special case of the operation COMPARE. Such a module cannot be built in the same way as in Sections 2 and 3, but one can employ directly the "jump" in the piecewise constant response functions considered here to test whether two neurons fire exactly at the same time
Wolfgang Maass
36
It is well known (see Hopcroft and Ullmari 1979) that any Turing machine M can be siniulated (although not in real-time) by a machine M' that has no tapes or stacks, but two counters. The preceding argument shows that such an M' (in fact, a machine with any finite number of counters) can be simulated in real-time by some finite SNN ,\:MI with piecewise comtaiit response and threshold functions. The effect of the shape of postsynaptic potentials on the coniputational power of networks of spiking neurons is investigated more thoroughly in Maass and Ruf (1995). It is shown there that computations with single spikes in networks of spiking neurons become substantially slower if they cannot make use of 11icrc77siiig and ilecrensi~i,ylinear segments of EPSPs. 5 Conclusion
--
We have analyzed the computational power of a simple formal model SNN for networks of spiking neurons. I n particular we have shown tliat if the response and threshold functions o f the SNN satisfy some rather weak basis assumptions (see Section 2), then one can build modules that can syiiclir.oiii:c the spiking of different network parts, as well as modules that can i n i ~ / f i p / the y phase difference between two oscillators with any given constant, a n d add, siihtrnc?, or roiiipnrc the phase differences of different oscillators (see tlie constructions in Sections 2 and 3 ) . With the help of these quite powerful computational operations an S N N can simulate in real-time for Boolean-valued input a n y Turing machine, and for real-valued input any N-RAM (a slightly weaker version of tlie model of Blum ct 01. 1989; see Section 3 of this article). On tlie side we would like to niention that these results also yield lower bounds for tlie VC-dimension of networks of spiking neurons, hence for tlie number of training examples needed for learning by such networks (see Maass 1994b, 1995~).One immediate consequence of this type is indicated in Corollary 2.2 of this article. The version of tlie model SNN ivitli unlimited timing-precision ( i t . , T = R' in the definition in Section 1) is not biologically realistic, insofar as it does not take tlie effects of noise into account. From that point of view our alternative version of this model with discrc~h~ firing times from : i E N} for sonie > 0 is preferrable (since it allows us t o represent {i. an imprecise firing anyivliere in the time interval ( i - :. i . 1' -t E) in the biological system by a "symbolic" firing at time i . p ) . Therefore it is important to note that our results about SNNs with unlimited timing precision iiidirie corresponding results for the computational power of SNNs ivitli discrc~trfiring times, as we Iiave indicated in Corollary 2.5 (see Theorem 5 in Maass 1994b, as well a s Maass 1995c, for further consequences of our results for SNNs with limited timing precision). In addition our constructions of SNN modules for the operations ADD, SUBTRACT, and MULTIPLY( jl on time differences between spikes appear to be quite
Computational Power of Networks of Spiking Neurons
37
robust, in the sense that they provide approximate implementations of these operations on time differences between spikes in various models for real-valued computations in networks of spiking neurons with noise. We refer to Maass (1995d) for further results about the computational power of SNNs with noise. The results of this article have two interesting consequences. One is that in order to show that a network of spiking neurons can carry out some specific task (e.g., in pattern recognition or pattern segmentation, or solving some binding problem; see, eg., von der Malsburg and Schneider 1986, or Gerstner e f a/. 1993) it now suffices to show that a threshold circuit, a finite automaton, a Turing machine, or an N-RAM (see Section 3) can carry out that task in an efficient manner. Furthermore the simulation results of this article allow us to relate the computational resources that are needed on the latter more convenient models (e.g., the required work space on a Turing machine) to the required resources needed by the SNN (e.g., the timing precision of the SNN, see Corollary 2.5). In other words, one may view N-RAMSand the other mentioned common computational models as “higher programming languages” for the construction of networks of spiking neurons. The real-time simulation methods of this article exhibit automatic methods for translating any program that is written in such higher programming language into the construction of a corresponding SNN. In this way the ”user” of an SNN may choose to ignore all worrisome implementation details on SNNs such as timing (potentially at the cost of some efficiency). Furthermore the matching upper bound result for N-RAMS (see Maass 1995b,c) shows that the corresponding ”higher programming language” is able to exploit all computational abilities of SNNs. Second, in combination with the corresponding upper bound results for SNNs with quite arbitrary response and threshold functions (and time-dependent weights) in Maass (1995b,c), the lower bounds of this article provide for a large class of response and threshold functions exact characterizations (up to real-time simulations) of the computational power of SNNs with real valued inputs, and for SNNs with bounded timing precision. As a consequence of these results, one can then also relate the computational power of SNNs to that of recurrent analog neural nets with various activation functions (see Section 3), thereby throwing some light on the relationships between the computational power of models of neurons with spike coding (SNNs) and models of neurons with frequency coding (analog neural nets). Furthermore, the combination of these lower and upper bound results shows that extremely simple response and threshold functions (such as, for example, those in Fig. 2 in Section 2) are universal in the sense that with these functions an SNN can simulate in real-time any SNN that employs arbitrary piecewise linear response and threshold functions. Equivalence results of this type induce some structure in the “zoo” of response and threshold functions that are mathematically interesting or occur in biological neural systems,
Wolfgang Maass
38
a n d they allow us to focus o n those aspects of these functions that are c.;srntin/ for the computational power of spiking neurons. Finally w e would like to point out that since we have based all of o u r investigations on the rather fine notion of a rd-tirric siriiuhtioii (see Section 11, our results provide information not just about the relationships between the coiiiprtilfioiinl power of the previously mentioned models for neural networks, but also about their capability to execute Icl7rriing algorithms (i.e., about their ndnptiile qualities).
Acknowledgments
-
.~
I ~voulcllike t o thank Wulfrani Gerstner, John G. Taylor, a n d three anonymous referees for helpful comments.
References
-
Abeles, M. 1991, C o f f j m i f t - sN w r d Cirsitifs o f fhc C c r t ~ / dCortr’*. Cainbridge University Press, Cambridge, England. Aertsen, A,,ecl. 1993. Brniii T / I L Y J ~S !~/i :[ 7 f I o - T L ’ r i i p o iAspccfs ~~/ of Brniri Firiictfori. Elsevier, Amsterdam. Hum, L., Sliub, M., and Sninlc, S. 1989. On a theory of computation and complexity over the real numbers: NP-completeness, recursive functions ancl uni\wsal machines. Birll. Arri. Muflr. Soc. 21(1), 1-46, Bulimann, J , , ancl Schulten, K. 1986. Associative rccognition and storage in a model network ot physiological neurons. B i d . C,i/l~~rii. 54, 319-335. Churchland, P. S., and Sejnowski, T. J. 1992. T h Ct~riipiitcitiiiriaIB r ~ I r i .MIT Press, Cambridge, M A . Crair, M. C., and Bialek, W. 1990. Non-Boltzmann dynaniics in networks of spiking neurons. A h i r i c c r irf Ntpirrn/Iiiforxinfiorf ProccsIrig Systeiris, Vol. 2, 109-116. Morgan Kaufmann, San Mateo, CA. Cerstner, W. 1991, Associative memory i n a network of ”biological” neurons. Aifzliiiizt~siri N c u d lriforiiiofIoii Prvc-mi~igSysfeiiir, Vol. 3, pp. 84-90. Morgan Kaufmann, San Mateo, CA. Gerstner, W. 1995. Time structure of the actilrity i n neural network models. Plys. Rczi. P 51, 738-758. Cerstner, W., and van Hemmen, J. L. 1994. How to describe neuronal activity: Spikes, rates, or assemblies? A t f i w i s r s iiz Nrirrnl lrijoririnfiori Prcwssiiig Sysfetiis, Vol. 6, pp. 463470. Morgan Kaufmann, San Mateo, CA. Gerstner, W., Ritz, I 0.2 a,
u: 0.1
-
0'
LOW (88.1)
FULL
MIDDLE
(95.2) (100.1 ) Amplitude Range
HIGH (106.0)
Figure 4: Increasing the tiring treyuency decreased the reconstruction error. Each bar represents the axwage rRMSE ( 5 SE) for scatterplot reconstructions of 10 input signals that had the same (40 Hz) bandwidth but were rescaled to different amplitude ranges along the f / I curve. As the average firing rate increased (from 2 88 Hz on the left to 2 106 Hz on the right), the rRMSE decreased. Generally, the firing frequency could be predicted by the mean value (DC level) of the stimulus. However, the average firing rate of "middle" range signals was larger than that of the "full" (100.1 i 0.3 Hz vs. 95.2 i- 0.8 Hz), and the rRMSE was smaller, e i m though they had the same DC bias. Numbers in parenthese5 belolv the amplitude ranges correspond to the average firing rate in H z . 4.1 Other Spike Generators. It has been suggested, because the Hodgkin-Huxley spike generator has such a narrow dynamic range (Fig. 1)a n d a regular firing rate, that the reconstruction method described here may not be generally applicable. However, the same reconstruction technique (August a n d Levy 1994) has also been applied to spike trains from a retinal ganglion cell (RGC) model (Fohlmeister ct nl. 1990). The wider dynamic range of the RGC model w a s reflected in a larger coefficient of variance (CV) of the IS1 histogram compared to the Hodgkin-Huxley model. For example, the CV of the RGC model, when stimulated with 50 H z bandwidth gaussian noise, w a s 0.26. This was over twice the CV ot the Hodgkin-Huxley spike generator when stimulated with 40 Hz
Simple Spike Train Decoder
79
-----
Sampling Theorem (EST) Other nonlinear interpolation methods (E3, E5,E6) Scatterplot method (Esc)
20
30
40
50
60
70
ao
Input signal bandwidth (Hz)
Figure 5: Increasing stimulus bandwidth increased reconstruction error. Signals were scaled to the same (135-435 nA) amplitude range, but were filtered with lowpass cutoffs of 20, 40, 60, or 80 Hz. That is, while average firing frequency remained constant (FZ 106 Hz), the stimulus bandwidth increased. The markedly increased errors for 60 and 80 Hz bandwidths indicate aliasing. Aliasing was most apparent in the reconstruction methods using nonlinear interpolation (solid lines). For the methods using linear interpolation (dashed lines), the increased slope of the rRMSE curve was less apparent.
(CV = 0.10) or 60 Hz (CV = 0.12) noise. Thus, increasing the spike generator’s dynamic range and the irregularity of spiking does not invalidate our approach. 4.2 The Sampling Theorem. This paper has related information transmission to communication theory by relating the average firing frequency of a neuron to the Sampling Theorem. However, while conceptually similar to a classical Sampling Theorem reconstruction, the method here differed by its use of nonuniformly spaced samples, nonexact sample amplitudes, and linear interpolation. A comparison of several different resonstruction methods (Table 3) revealed how much these three error sources contribute to the total overall error. For the particular signals studied, linear interpolation was the most significant single factor in increasing error (Table 4). However, the exact ranking of these errors may
80
David A. August and William B Levy
change depending on the bandwidth and amplitude range of the input signals (data not shown). Aliasing, as defined for sinc-function reconstructions (Couch 19871, is a type of reconstruction error caused by high-frequency signal components being "folded" back into lower frequencies, due to an inadequate sampling rate. The Nyquist rate is the sampling rate below which this folding occurs. To investigate the relevance of the Nyquist rate to neuronal communication, we examined the relationship between spike rate and stimulus bandwidth. In varying the firing frequency for a constant stimulus bandwidth (Fig. -Ireconstruction ), error decreased as firing frequency increased. In Lwying the input signal bandwidth while maintaining a constant firing frequency (Fig. 5), reconstruction error increased as this bandwidth increased. Thus, relative to communications theory, spike frequency seemed to be the appropriate analog of sampling frequency. Also, the importance of matching this sampling-or spike-frequency to the input signal bandwidth \%'asclearly apparent. In these aliasing studies, tlie reconstructions using linear interpolation degraded more gracefully than nonlinear interpolation as input bandwidth increased (Fig. 3. This observation may be biologically pertinent. Since rieuronal signals are not strictly bandlimited at these low frequencies, aliasing will likely be present i r i iiivo. Thus, it is notable that nature can combat aliasing by simplifying the iiiterpolatory scheme from nonlinear to linear. The relationship between cell firing and the Nyquist rate has implications for the experimental design of future studies that deliver a time-varying current injection to a neuron, record a spike train, and then "decode" this spike train into an estimate of the stimulus. While much work has gone into determining the appropriate decoding filter, the question of how the stimulus itself should be chosen remains open. Several researchers have already emphasized the complexity of naturalistic stimuli (Field 1987; Ruderman and Bialek 1994). Here, we propose a specific guideline for neurophysiologists. Gaussian current injection signals should be scaled so that the DC bias and amplitude range produce firing at rates comparable to those observed iri z l i z v . Next, tlie stimulus bandwidth should be limited to half this average firing rate or less, if a maximal capacity measurement is the goal. The relationship of the reconstruction method proposed here, the Sampling Theorem, and other decoding studies can be understood as follows. Bialek and colleagues have shown the theoretical optimality of linear decoding filters when the firing rate, R, becomes very small compared to the stimulus bandwidth, W (Bialek ct nl. 1993), and this has been confirmed experimentally by decoding sensory system spike trains with linear filters (Bialek rt al. 1991; Warland et a / . 1992; Theunissen 1993). Thus, linear decoding filters can be used quite successfully in the R < 2W range of neural dynamics. However, the Sampling Theorem suggests that the R > 2W range of dynamics may also be of interest. In this case, linear decoding
Simple Spike Train Decoder
81
may no longer be optimal, and the experimenter faces the more difficult task of constructing nonlinear filters. The present study shows, empirically, that a very simple (triangular-shaped) nonlinear filter-equivalent to linear interpolation-can still produce high-quality reconstructions. This method should prove useful to experimenters interested in decoding more slowly varying stimuli from spike trains with higher firing rates. 4.3 Limitations of the Model. The model of information transmission by ISIs applies to neural systems with a PPF decay similar to the average firing rate, a relatively small conduction jitter, and a relatively large quantal content. We hypothesize that if these conditions are not met, then the spike train is transmitting a frequency code. First, our reconstruction method requires that the average ISIs for the postsynaptic cell lie along a range of moderate slope on the PPF-like decoding curve (Fig. 2). If ISIs fall predominantly along the nearly flat region, then the EPSPs would all be the same size, as is the case for linear filtering methods (Bialek et al. 1991). At the squid giant synapse, the first component of PPF decays over 5-10 msec (Charlton and Bittner 1978), which is suitably matched to the z 100 Hz firing rate that has been used here. Similarly, PPF at spinal interneuron-motoneuron and at corticorubral synapses decays over 50 msec (Murakami et al. 1977; Kuno and Weakly 1972), which would be appropriate for cells firing at 10-20 Hz. Second, poor reconstructions could result from ISIs being distorted by a large axonal conduction jitter. Many neural systems have relatively little jitter. For example, in the frog sciatic nerve, jitter is z 4 psec, (Rapoport and Horvath 19601, < 50 psec in human motor axons (Salmi 1983; Stralberg et al. 19711, 100-200 psec in various reflex arcs (Trontelj 1973; Trontelj and Trontelj 1978), < 50 psec in the barn owl auditory system (Rose et al. 1967), < 40 psec in the bat echolocation system (Simmons 1979),and < 1 psec in the weakly electric fish (Bullock 1970). In fact, jitter as a noise source was already implicit in the reconstruction method here. Since the decoding function was fit to a scatterplot, the method must have been robust to at least the amount of scatter about this curve. For the 40 Hz signal shown in Figure 2, the widest scatter was approximately 200 psec, which provides a lower bound on the maximum tolerable jitter. Third, the reconstruction technique will not work at synapses with low and variable quantal content [ e g , hippocampal region CAI (Allen and Stevens 1994; Bekkers and Stevens 1990; Foster and McNaughton 1991; Hessler et al. 1993)], but is applicable where quantal content is high [e.g., squid giant synapse (Miledi 19671, frog neuromuscular junction (Martin 1955),and the climbing fiber synapses on cerebellar Purkinje cells (Llinas ef al. 1969)]. Because we invoke a PPF-like effect for decoding, and because PPF is thought to be caused by changes in release probability, p , a fairly large number of release sites, n, would be required to detect the PPF-induced variability in np. Further, by emphasizing the importance of individual ISIs for carrying information, we assuine reliable synaptic
82
David A. August and William B Levy
transmission (e.g., np >> 1, a n d a high safety factor for spike invasion). Therefore, an explicit conclusion is that systems with low safety factors or low release probability a n d few release sites will not use IS1 codes. Finally, we note that this study has approached neuronal information transmission differently than nature. We adjusted the stimulus bandwidth a n d amplitudes to match the spike generator. In nature, however, neurons would presumably co-evolve so that spike generators, interpolators, a n d dendritic filters matched. That is, w e expect that evolution has discovered some simple filtering a n d interpolation schemes that avoid substantial information loss.
Acknowledgments This research was supported in part by NIH GM07267 a n d MH10702 to D.A.A., a n d NIH MH00622 a n d MH48161 to W.B L., a n d EPRI RP803008 to I? Papantoni-Kazakos, a n d by the Department of Neurosurgery, University of Virginia, Dr. John A. Jane, Chairman. The authors would like to thank Steve Wilson a n d Chris Fall for their constructive comments.
References Agin, D. 1964. Hodgkin-Huxley equations: Logarithmic relation between menibrane current and frequency of repetitive activity. Nntiilz (Landoil) 201, 625626. Allen, C., and Stevens, C. E 1994. An evaluation of causes for unreliability of synaptic transmission. Pvoc. Nntl. Acad. Sci. U.S.A. 91, 10380-10383. August, D. A,, and Levy, W. 8. 1994. Information maintenance by retinal ganglion cell spikes. In The Neiirobio/ogy of Coiizpiitntioii, J. M. Bower, ed., pp. 4146. Kluwer, Norwell, MA. Bekkers, J. M., and Stevens, C. F. 1990. Presynaptic mechanism for long-term potentiation in the hippocampus. Natiive (Loiidoiz) 346, 724-729. Bialek, W., and Rieke, F. 1992. Reliability and information transmission in spiking neurons. TINS 15(11), 428-434. Bialek, W., Reike, F., and de Ruyter van Steveninck, R. 1991. Reading a neural code. Science 252, 185441857, Bialek, W., DeWeese, M., Reike, F., and Warland, D. 1993. Bits and brains: Information flow in the nervous system. Physica A 200, 581-593. Black, H. S. 1953. Moddatioii Theory. D. Van Nostrand, New York. Bower, J. M., and Beeman, D. 1995. The Book of Genesis. Springer-Verlag Telos, New York. Bullock, T. H. 1970. The reliability of neurons. I. Geii. Ph!ys. 55, 565-584. Charlton, M. P., and Bittner, G. D. 1978. Facilitation of transmitter release at squid synapses. 1. Gen. Physiol. 72, 471-486. Couch, L. W. 1987. Digital aild Aiialog Coiiziiiiinicntioii Systenis. Macmillan, New York.
Simple Spike Train Decoder
83
de Ruyter van Steveninck, R., and Bialek, W. 1988. Real-time performance of a movement-sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. Proc. R. SOC.London B 234, 379-414. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. I. Opt. SOC.Am. 4(12), 2379-2394. Fohlnieister, J. F., Coleman, I? A,, and Miller, R. F. 1990. Modeling the repetitive firing of retinal ganglion cells. Brain Res. 510, 343-345. Foster, T. C., and McNaughton, B. L. 1991. Long-term enhancement of CAl synaptic transmission is due to increased quantal size, not quantal content. Hippocampiis 1(1),79-91. Hessler, N. A., Shirke, A. M., and Malinow, R. 1993. The probability of transmitter release at a mammalian central synapse. Nafure (London) 366, 569-572. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. 1.Pkysiof. 117, 500-544. Jerri, A. J. 1977. The Shannon sampling theorem-its various extensions and applications: A tutorial review. Proc. IEEE 6501), 1565-1596. Katz, B., and Miledi, R. 1968. The role of calcium in neuromuscular facilitation. I . Phys. 195, 481492. Kuno, M., and Weakly, J. N. 1972. Facilitation of monosynaptic excitatory synaptic potentials in spinal motoneurones evoked by internuncial impulses. I. Pkysiol. 224, 271-286. Llinas, R., Bloedel, J. R., and Hillman, D. E. 1969. Functional characterization of neuronal circuitry of frog cerebellar cortex. I. Neuropkys. 32(6), 847-870. Marks, R. J. 1991. Introduction to Shannon Sampliiig and Interpolation Theory. Springer-Verlag, New York. Martin, A. A. 1955. A further study of the statistical composition of the endplate potential. J. Pkysiol. 130, 114-122. Miledi, R. 1967. Spontaneous synaptic potentials and quantal release of transmitter in the stellate ganglion of the squid. ].Pkysiol. 192(2), 379406. Murakami, F., Tsukahara, N., and Fujito, Y. 1977. Properties of synaptic transmission of the newly formed cortico-rubral synapses after lesion of the nucleus interpositus of the cerebellum. Exp. Brain Res. 30, 245-258. Nyquist, H. 1928. Certain topics in telegraph transmission theory. AIEE Trans. 47, 617-644. Perkel, D. H., and Bullock, T. H. 1968. Neural coding. Neurosci. Res. Prog. Bull. 6(3),227-348. Rapoport, A,, and Horvath, W. J. 1960. The theoretical channel capacity of a single neuron as determined by various coding schemes. Information Control 3,335-350. Reike, F,, Yamada, W., Moortgat, K., Lewis, E. R., and Bialek, W. 1992. Real time coding of complex sounds in the auditory nerve. Adv. Biosci. 83, 315-322. Reike, F., Warland, D., and Bialek, W. 1993. Coding efficiency and information rates in sensory neurons. Europhys. Lett. 22(2), 151-156. Rose, J. E., Brugge, J. F., Anderson, D. J., and Hind, J. E. 1967. Phase-locked re-
84
David A. August and William B Levy
sponse to low-frequency tones in single auditory n a v e fibers of the squirrel monkey. 1. Ncrrroyh!/s. 30, 769-793. Ruderman, D. L., a i d Bialek, W. 1991. Statistics of natural images: Scaling in the woods. In , 4 h i r c i ~ sirr l%wrti/ !rifc~rrrintiorrPri~ci~ssIrrg Systtwis, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6, pp. 5 3 - 3 8 , Morgan Kaufmann, San Mateo, CA. Sakuranaga, M., Ando, Y.-I., and Naka, K.-I. 1987. Dynamics of the ganglion cell response in the catfish and frog retinas. J. Geri. Phys. 90, 229-259. Salmi, T. 1983. A duration matching method for the measurement of jitter in Clirr. Neiiroplr!/siol. 56, 515-520. single fibre EMG. Elcctrcieric~~~~lirlo~r.. Shannon, C. E. 1949. Communications in tlie presence of noise. Prcic. IRE 37, 10-21. Simmons, J. A. 1979. Perception of echo phase information in bat sonar. Scirrrci204, 1336-1338. Stalberg, E., Ekstedt, J., and Broman, A. 1971. The electromyographic jitter in normal human muscles. E l c t - t ~ o c r i i r ~ ~ l r cCliri. l i i ~ ~N~irra~~iiysiol. . 31, 429438. Stein, IfmLINs) are represented. The simulations have proven that an excitatory element local to each glomerulus is necessary to account for the neural response patterns observed in intracellular recordings: we thus introduce a class of nonspiking ci.v-cifnfor-!/ l o c o l i m l i i t~e r i i < 3 ~ i r o ithat i~ have dendritic arborizations (input and output synapses) restricted to one glomerulus. l i i h i h f o r y locnli;rd iiiterizrirraiis have a dense arborization (input and output synapses) in orzc glomerulus and sparse arborizations in nll others; as we have shown before (Linster rf a/.1994), these distributed synapses should be mainly output synapses to guarantee a lateral inhibition mechanism. The synaptic coefficients of these lateral inhibitory connections decrease linearly with the distance between the point of summing of actiVation (the principal glomerulus of the LIN) and the glomeruli with which the interaction takes place; similarly, the transmission delays o f these interactions increase with the distance of the interactions. Similarly to the LINs, a part of the ONs have dendrites invading only one glomerulus (Lliii O N ) ,whereas the others (Plzrr-iO N ) are pluriglomerular. The axons of both types of ONs project to various areas of the protocerebrum, especially onto the mushroom body interneurons (Fonta rt n / . 1993). In the model, only / o c L ~ / oIi ~i t pLi r~f iiciiroiis, which correspond to Llrii ONs, are represented; they connect only to local interneurons, and do not receive direct input from receptor cells. Thus, each glomerulus integrates 1. input from one type of receptor cell, 2 . local excitation provided by its local excitatory interneuron, 3. local inhibition provided by its associated inhibitory interneuron, md 4.lateral inhibition coming from neighboring glomeruli provided by inhibitory interneurons associated to the neighboring glomeruli. Striiitrlntion of the receptor cells by a subset of the 10 molecules triggers several phenomena:
1. due to the excitatory elements (which feed back onto each other) local to each glomerulus, an activated glomerulus tends to enhance the activation it receilres from the receptor cells, 2. the local inhibitory elements are activated (with a certain delay) by the receptor cell activity and by the self-activation of the local excitatory elements, and
3 . due to the lateral inhibitory connections, these tend to inhibit neighboring glomeruli.
These phenomena result in a coiizyetitiori between active glomeruli: during a number of sampling steps, the output activity of each glomerulus (represented by the firing probabilities of its associated output neu-
Model of Olfactory Sensory Memory in the Honeybee
99
p c Molecule spectra
M
Figure 3: Organization of the model olfactory circuitry. In the model, we introduce types of receptor cells with overlapping molecule (M) spectra; each receptor cell type has its maximal spiking probability P for the presence of one molecule i. The axons of the different receptor cell types project into distinct regions of the glomerular layer. All allowed connections (as described in the text) exist with the same probability, but with different connection strengths and transmission delays. The output of each glomerulus is represented by its associated output neurons. (For simulation parameters, see the Appendix.) rons) oscillates from high activity to low activity. Due to the competition provided by the lateral inhibition, the spatial activity pattern in the glomerular layer changes over time, and a stable activity map is reached eventually. A number of glomeruli "win" and stay active, whereas others "lose" and are silent. Figure 4A shows the evolution of the glomerular activities after odor presentation. For each sampling step of 2 msec, the average firing probabilities of the output neurons associated with each of the glomeruli are shown. The last pattern shows the stabilized activity map resulting from the competition between close glomeruli.
Christiane Linster and Claudine Masson
100
Glomeruli 1 15 ~
Stabiiizeapattern
Firing probabilities
Figure 4: Odor processing in the modcl. ( A ) Stabilization of spatial activity patterns. For a number of sampling steps (2 msec), the average firing probabilities (Lwies between 0 and 1) of the ONs associated to each glomerulus are shown. After stin~ulationall glomeruli are differentially activatcd. Lateral inhibition silences all OIL'Sat step 2 on the presented diagram. During the next sampljng steps, the cornpetitinn between glomeruli due to the lateral inhibition and to the local cxcitation can be observed. Around step 9, the final activity pattern begins to emerge. The last pattern shows the stabilized activity pattern wliich resiilts from the stiniulation. (B) Evolution of the firing probabilities of inciiviciual output neurons after stimulation. As an illustration, the temporal evolution of the average firing probabilities (ranging from zero to one) of the ONs associated to some glomeruli are traced. After stabilization of the activity map, ONs are either silent or active. After stimulus offset, the activities recover their spontaneous activation level after ca. 15 msec. (Stimulus onset and offset are indicated by arrows.)
Model of Olfactory Sensory Memory in the Honeybee
101
The neural code read by the next layer (e.g., the mushroom bodies) of the olfactory network is represented by the across fiber pattern of the activities of the output neurons. The activities of individual output neurons follow the general pattern described above: oscillation of the activity during a number of sampling steps until the activity "settles" down to a stable value [Fig. 4B shows the firing probabilities of some output neurons in response to the same stimulation as that used in Fig. 4A; Fig. 5a shows the output activity (action potentials and membrane potential) of several LIN and ONs in response to different odor stimuli]. A stable activity can either be a constant firing probability or a "stable" oscillation of the firing probability. An output neuron associated to a particular glomerulus may be active in response to a particular odor quality, and silent for others. The simulations suggest two main conclusions concerning the neural circuitry in the antenna1 lobe: (1) local excitation should be present in the glomeruli and (2) the particular, heterogeneous arborization pattern of the localized interneurons is closely related to their functional role: to provide lateral inhibition between glomeruli. To ensure the stability of the system (i.e., the network can be activated only by external input, not by its internal noise), a basic condition has to be observed: the sum of excitation and inhibition arriving at the excitatory interneuron from other interneurons is lower than its saturation threshold. After stabilization of an activity pattern in response to stimulation, a maximal signal-to-noise ratio can be obtained if the total input to activated interneurons exceeds their saturation threshold; in this case all activated interneurons fire at their maximal frequency (see the Appendix, Section 4 for derivation). For adjustment of parameters, we use two scales of observation: 0
0
Activities of individual neurons can be compared to intracellular recordings (spontaneous activities, response latencies, average firing frequencies) (Sun et al. 1993); this permits us to adjust intrinsic parameters such as membrane time constants, spiking thresholds, and synaptic transmission delays (Fig. 5a). Statistical distributions of simple neural response patterns (excitation, inhibition, or no-response) in response to stimulations can be compared to quantitative descriptions of electrophysiological data (Sun et al. 1993); this permits us to adjust "wiring" parameters such as connection strengths and the decay of connection effects with respect to distance of signal transmission (see Fig. 5b for details).
The phenomena described above arise for a large scale of parameters (observing the conditions described in the Appendix, Section 4), however, the average number of neurons that is active or inhibited for each stimulation pattern is determined by the balance between local excitation and lateral inhibition in each glomerulus. These phenomena are described in more detail in Masson and Linster (1995) and Linster et al. (1994). In the simulations described here, the average values of the set of parameters
102
Christiane Linster aiid Claudine Massoii
Figure Ja: Validation of parameters by comparison to experimental data. Statistical distribution of response patterns to olfactory stimulation in niodcl neurons (11) and antenna1 lobe iieurons (U). A: In the 10-dimensioIial odorant space, all combinations of binary odors (102-1) were used. For each stimulus (625 msec simulation time), the membrane potentials of LINs and ONs associated to each of tlie 1.5 glomeruli were averaged. For LlN and ON populations spontaneous activity was averaged over 623 msec ivith no stimulation, and all response amplitudes Ivere iiornialized Irith respect to the maximal response amplitude (excitation and inhibition considered separately) obtained for all stimulations. A significant excitatory response was detected when the increase of activation bvith rrspect to the spontaneous activitv exceeded 10'4 ; a significant inhibitory response was detected when the decrease of activation with respect to spontaneous activity exceeded 10';5 . The graph shows tlie average number of excitatory, inhibitory, and nonresponse patterns for both LINs and ONs for all stimulatictns. 8: Percentage of nonresponse, excitation and inhibition patterns recorded froiii two classes of antenna1 lobe neurons (Hetero LlN aiid ON) (Sun ct (71. 1993). Tliesc results represent a quantitati\re representation of data obtained by intracellular recordings of 1-11 antenna1 lobe neurons, with 7 different stimulations (as described in Fig. 8) fur each recording.
derived by comparison to the experimental ddta are constant; all real values are chosen in a random distribution (110% around the average value). We will show in the next section how modulation of the lateral inhibition strength provides a simple a n d efficient scheme for sensory memory:
Model of Olfactory Sensory Memory in the Honeybee
103
Figure 5b: Validation of parameters by comparison to experimental data. Individual neural activities of neurons in the model (A) and antennal lobe neurons (8).A: Individual LIN (1-2) and ON (1-2) activities in response to two different odor stimulations (01and 02). Stimulus onset and offset are indicated by arrows. Local interneurons are mainly activated (increased spiking frequency) by stimulation; their temporal response patterns vary with the stimulation, Output neurons are either activated or inhibited by stimulation. In the model, mean background activities (in absence of stimulation) are 12.5 spikes/sec for ONs and 4.2 spikes/sec for LINs. B: Intracellular recording from Hetero LIN responding to pure components and their binary and ternary mixtures with varying response profiles (Sun et al. 1993). Mean background activities recorded are 5.9 spikes/sec for Hetero LINs and 12 spikes/s for ONs. the neural activities in the model may memorize the activity pattern even after stimulus offset for a short period of time.
3 Modeling Sensory Memory: A Problem of Modulation of Inhibition in the Network In a foraging situation, a trace of the olfactory stimulus has to be established until either a positive (food is found) or a negative (no food is found) reinforcement stimulus is received. Here, we predict that this trace can be established in the antennal lobe by memorizing the neural activity pattern triggered by the stimulus even after stimulus offset. This memorization of the neural activity pattern is achieved, in the model, by modulation of the lateral inhibition strength between glomeruli, e.g., by perturbation of the local balance between excitation and inhibition: 1. Before stimulus offset, the neurons local to one glomerulus are either excited or inhibited.
104
Christiane Linster and Claudine Masson
2. After stimulus offset, due to the membrane time constant, each neuron memorizes its activity level for several milliseconds. 3. If during that period (or at any moment after stabilization of the activity pattern) the lateral inhibition strength is cut off or considerably decreased, the local excitatory elements "take over": in those glomeruli that have been activated by the stimulus the local excitation will enhance this activation, whereas those glomeruli that have been inhibited will stay silent. 4. The neural activity of all neurons local to one glomerulus is sustained after stimulus offset, due to the local excitatory elements which feed back to each other. 5. When the lateral inhibition strength recovers its original value, its local effects become stronger than those of the excitatory elements and the global activity level goes back to its spontaneous level. Two conditions on the choice of parameters are necessary for these phenomena to occur: (1) after stabilization of the activity map in response to an odor input, active excitatory elements should be driven into saturation (as explained above, the total input to these elements has to be higher than their saturation threshold), and (2) the coefficient of the positive feedback connection on excitatory elements has to be higher than their saturation threshold (plus the local inhibition) (see the Appendix, Section 4 for derivation). This means that in the absence of lateral inhibition, the equation governing the evolution of the excitatory interncurons becomes unstable for those excitatory neurons that fire at their maximal spiking frequency and thus keeps them in saturation even in the absence of external input. Excitatory elements that are inhibited d o not send a positive feedback onto themselves; they stay silent and cannot activate themselves. Figure 6 shows the activity of several output neurons due to a stimulus application (experience with stimulus A), and their evolution during sensory storage (experience with stimulus B). Right after stimulus application, all glomeruli are active, competing for a few sampling steps, until some glomeruli "win" and stay active, whereas others "lose." After stimulus offset, in normal conditions, the activity returns to its spontaneous level after 4-8 msec (experience A). In experience B, the value of the lateral inhibition strength is set to zero shortly after stimulus offset ( 2 msec), and begins to increase toward its original value after 24 msec. Due to the local excitatory elements, each glomerulus tends to enhance its activity: active glomeruli stay active, whereas glomeruli that have been inhibited by the glomerular competition stay silent. Figure 7 shows the evolution of all glomerular activities starting 4 msec (two sampling steps) before stimulus offset. In Figure 7A, no modulation of the lateral inhibition occurs, whereas in Figure 78, the lateral inhibition strength is set to zero 2 msec after stimulus offset, and begins to slowly increase toward its original value after 24 msec.
105
Model of Olfactory Sensory Memory in the Honeybee
!mi,
ON2
. .
. . I
.
,
~ i l i. I . I, ,
,
,
. .
.
.ii.
,
.
,
,
.I,
. .
,
. .
L
.
.I,
Membrane potential
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 6: Memorization of the neural activity pattern due to modulation of the lateral inhibition strength. The temporal evolution of the average firing activities (upper trace) and the membrane potentials of ONs (ON1-ON4) associated to four glomeruli are traced. In experience A, no modulation of the lateral inhibition is performed; after stimulus offset, the neural activity goes back to its spontaneous activity level. In experience B (the qualitative evolution oi the lateral inhibition is shown below), the lateral inhibition strength is set to zero 2 msec after stimulus offset, and slowly increases toward its original value after 24 msec. The ON activity pattern is memorized while the lateral inhibition is low, and tends to disappear when the lateral inhibition increases. The activities return to the spontaneous activity level when the lateral inhibition recovers its original value. (Stimulus onset and offset are indicated by arrows.) In the model, the decrease of the lateral inhibition strength can be achieved by decreasing the synaptic efficacy of the localized inhibitory interneurons or by increasing their spiking threshold.
4 Discussion The model that has been described makes a number of predictions concerning odor processing and sensory memory in the bee antenna1 lobe neural network. With respect to odor processing, intracellular recorded responses to odor mixtures are in general very complex and difficult to interpret from the responses to single odor components (Sun et al. 1993; and Fig. 8). A
Christiane Liiister and Claudine Masson
106
Stimulm offset
e
IW m; after stimulus offset
Clomeqdi 1 15
Stimulus offset
Lateral inhibition = 0
~
e
Begin of recovery of lateral inhibition
Figure 7: E\dution of tlie glonierular activity pattern after stimulus offset with and lvithout inodulatioii of the lateral inhibition strength. The evolution of the average firing probabilities (each diagrani represents 2 msec) of tlie ONs associated to each glomerulus are shown starting 4 nisec before stimulus offset. ( A ) The activity pattern sloivly disappears after stiniulus offset. (B) Thc lateral inhibition strength is set to zero 2 msec after stiiniilus offset and starts to increase toivard its original value after 24 iiisec. The neural activity pattern of the glomerular ONs is memorized after stimulus offsct.
tendency to select particular odor-related information is expressed in the category of localized antenna1 lobe neurons, both LlNs and ONs. In contrast, both global LINs and ONs are often niore responsive to mixtures than to single components. This might indicate that the related localized glomeruli represent functional subunits that are particularly involved in the discrimination of some key features (Masson et rrl. 1993). In addition to single cell recordings, the study of the spatial distribution of odor-related activity evidenced by 2DG suggests that odor qualities with different biological meaning might be decoded according to separate spatial maps sharing a number of common processing areas (Masson ct n / . 1993; Nicolas ct a / . 1993). Our model suggests the decoding of the olfactory stimuli in spatial maps o f activity spanning tlie whole glomerular layer; it allows us to understand the spatial activity distribution as a function of single cell responses. Recent data in locust antenna1 lobe (taurent and Davidowitz
Model of Olfactory Sensory Memory in the Honeybee
107
Figure 8: Spike rate histograms before, during, and after stimulation of two antennal lobe output neurons. Antenna1 lobe neurons have been recorded intracellularly during olfactory stimulation of the antenna with three pure odors (HEP, 2-heptanone; GER, geraniol; ISO, isoamyle aceate) and their binary and ternary mixtures. Spike rates before (1 sec), during (1 sec), and after stimulation (1 sec) are expressed as a function of the spontaneous rate recorded before stimulation (which corresponds to 100%). (A) Pluriglomerular ON responding with increased spiking frequency to the ternary mixture of the three odorants but almost not to stimulation with single odorants and their binary mixtures; this neuron keeps the high spiking frequency dur to the stimulation (G + H + I) for 1 sec after stimulus offset. (B) Uniglomerular ON responding with increased spiking frequency to GER, with decreased spiking frequency to HEP and with various degrees of excitation to the binary and ternary mixtures. The response to GER exhibits a long-lasting excitation after stimulus offset, whereas the response to HEP exhibits a long-lasting inhibition. From Sun (1991).
1994) strongly suggest the representation of a n odor by a n assembly of coherently firing antennal lobe neurons. Because antennal lobe neurons are generally activated by several odors (in these experiments, complex food odors were used as olfactory stimuli), the assemblies that encode different odors can overlap. Interestingly, models of the vertebrate olfactory bulb (Li a n d Hopfield 1989; Li 1990; Erdi etal. 1993),which compares
108
Cliristiane Linster and Claudine Masson
to the antennal lobe, while implementing a different neural circuitry, predict the same type of odor processing in the glomerular layer. With respect to sertsory nzernory, in the honeybee, cooling experiments combined with single trial learning have shown that cooling of the antennal lobe later than 2 min, the o-lobes of the mushroom bodies later than 3 niin, and the calyces later than 5 min did not impair memory formation (see Menzel 1983, 1984 for review). This indicates that the early memory traces may be located in the antennal lobe, as proposed b y the model. Furthermore, these results suggest a hierarchical transmission of the olfactory images, where the memory traces are established at different layers of processing at different times. Our model of sensory memory will allow us to explore a number of features concerning the transfer of the olfactory images from the antennal lobe and its sensory store to a more permanent associative memory device, presumably located in the mushroom bodies. The model suggests a uniform modulation of inhibition strength in the antennal lobe as a basis of sensory memory. This implicates the presence of neurornodulator circuits, which would be controlled by higher order brain centers. In our experiments, intracellular recordings have evidenced the existence of long-lasting excitation after stimulus offset in some antennal lobe neurons (Fonta rf r11. 1989 and Fig. 8). The precise conditions of occurrence of these phenomena as well as their dependence on the presence of specific neuromodulators will be undertaken in a new set of experiments. The presence of feedback circuits between the mushroom bodies and the antennal lobe interneurons has been suggested before (Masson 1977; Erber 1981), as well as their iniportance for memory formation. In addition, the localization of several neurotransmitters in the bee brain has been evidenced (for a review see Bicker 1993) by use of neurochemic,il tools; the functional study of these neurotransmitters is being undertaken (unpublished data). The niodeling approach combined with new experiments will help us to elucidate the role of these neurotransmitters, allowing us in a unique way to integrate elements of knowledge coming from converging experimental and theoretical approaches. Appendix: Implementation and Simulation Parameters _
_
_
1. Neurons. The different neuron populations associated with each glomerulus are represented in the simulations by one unit (each unit is governed by one difference equntmn). All connection weights and transmission delays are chosen randomly around a mean value. In discrete time, the fluctuation of the membrane potential around the resting potential, due to irzpiif c , i t ) at its postsynaptic sites, is expressed dS
~
Model of Olfactory Sensory Memory in the Honeybee
109
where ~i is the membrane time constant, At is the sampling interval, and ei(t) is the total input to neuron i at time t. The firing probability P[xi(t) = 11 that the state x,(t) of neuron i at time t is 1 is given by a quasilinear function of the neuron membrane potential vi(t) at time t, where the lower threshold Omin determines the amount of noise, and the upper threshold Om,, determines the value of the membrane potential for which the maximal firing probability is reached:
A @mill
1
h
x v?)
The value of the transmission delay associated with each synapse is chosen randomly around a given average value; it is meant to model all sources of delay, transduction, and deformation of the transmitted signal from the cell body or dendrodendritic terminal of neuron j to the receptor site of neuron i. The mean value of the delay distribution is longer for inhibition than for excitation: we thereby take into account approximately the fact that IPSCs usually have slower decay than EPSCs, and may accumulate to act later than actually applied. 2. Molecule Arrays and Receptor Cells. Odorants are represented in a ten dimensional, discrete odorant space; a stimulation corresponds to a particular point in this space. R receptor cells are differentially sensitive to all M molecules: each receptor cell has a maximal (I) sensitivity to one molecule (center of the gaussian sensitivity curve); its sensitivity to surrounding molecules is given by a gaussian function with width 1. Each receptor cell projects onto a subset of N glomeruli with an afferent weight W R . 3. Intemeurons. Local excitatory elements are local to one glomerulus, they send excifatovy input to all localized inhibitory interneurons and projection neurons associated to that glomerulus, and they receive input from receptor cells and from all inhibitory interneurons sending input (local and lateral) to their glomerulus:
gf,
where e, is the input to the excitatory neuron associated to glomerulus j, R, is the output of the receptor cell projecting to glomerulus j ; I, is the output of the localized inhibitory interneuron whose high branching pattern is in glomerulus j : wi is its connection strength and r1 is its transmission delay;
Christiane Linster and Claudine Masson
110
Ig are the outputs of the neighboring localized inhibitory interneurons associated to glomeruli g: wls is their connection strengths and rIs is their transmission delays; E, is a recurrent input of the excitatory element onto itself 7 u E is its connection strength and rE is its transmission delay. Localized inhibitory iizterizeziroizs are associated with one glomerulus: they send lateml inhibitory input to inhibitory and excitatory interneurons in neighboring glomeruli as well as local iizhibition to the neurons in their principal glomerulus; they receive input from the receptor cells projecting onto their principal glomerulus and from the local excitatory elements in that glomerulus, as well as lateral inhibition input from surrounding glomeruli:
where i, is the input to the localized inhibitory interneuron associated with glomerulus j. Localized ozi tput neurons integrate synaptic activity from all interneurons with principal synapses in their associated glomerulus: O , ( t ) = 7/+1,(t- r
~ WEE,(^ )
-
7’~)
where o,( t ) is the input to the projection neuron associated with glomerulus j. All membrane potentials and outputs are computed according to the equations given above; the values of the time constants and thresholds are given below. 4. Conditions on Parameters. The behavior of the model presented here is largely dominated by the excitatory interneurons and their positive feedback connections. The system is stable (it can be activated u p to saturation only by exterior input from receptor cells and not by intrinsic noise) if, in the absence of external input, the total input to the excitatory elements is smaller than their saturation threshold (thus, their firing probability stays smaller than 1):
In the worst case, Yj?g.I(f) = E(t) then G
wI
+C + xk
U’E
< emax
=
1.0 (maximal spiking frequency),
Model of Olfactory Sensory Memory in the Honeybee
111
The signal-to-noise ratio after stabilization of an activity pattern in response to odor stimulation is large if all activated excitatory interneurons reach their saturation threshold; thus =
e,(t)
zuRR,(f) + wrI,(f - YI)
G
+ CwlgIg(t
-
rig)
+ WEEi(t
-
YE)
> Omax
gf1
€,(f)
=
I
[where E,(t) is the output-firing probability-of neuron I at time t], and for inactive excitatory interneurons after stabilization:
e,(f)
=
WRR,(f) + WI,(f - TI)
G
+ Cw,glg(t
-
+ wEEj(f
-
YE)
< @m,,
gfi
E,(t)
=
0
for neurons that will be inhibited after stabilization. If, after stabilization of the activity pattern, lateral inhibition is set to zero, active excitatory interneurons will stay in saturation even after stimulus offset if
Inactive excitatory interneurons will stay inactive because E,(f - YE)= 0 and the positive feedback loop is interrupted for these neurons. Thus, the removal of the lateral inhibition creates a positive feedback loop driving the system into saturation for those glomeruli that are activated by the exterior input. Recovery of the lateral inhibition reduces the positive feedback loop and drives the system back into its original, stable state. 5. Simulation Parameters. In all simulations described in the text, the following parameters have been used:
R = 5 receptor cells sensitive to M = 10 different types of molecules; Matrix of receptor cell sensitivities: Receptors
1.000 0.018 0.000 0.000 0.018
0.368 0.018 0.001 0.368 1.000 0.368 0.001 0.018 0.368 0.000 0.000 0.001 0.001 0.000 0.000
Odorants 0.000 0.000 0.000 0.018 0.001 0.000 1.000 0.368 0.018 0.018 0.368 1.000 0.000 0.001 0.018
0.001 0.018 0.368 0.000 0.000 0.001 0.001 0.000 0.000 0.368 0.018 0.001 0.368 1.000 0.368
112
Christiane Linster and Claudine Masson
G = 15 glomeruli, thus each receptor cell projects onto N = 3 neighboring glomeruli with afferent connection strength:
7.5 (&lo%); 6.0 (110%) is the connection strength of local excitatory elements, r~ = 2 msec (110%) their transmission delay; zoI = -1.0 (110%)is the local inhibitory connection strength, 1’1 = 2 msec (110%)its transmission delay; uilX= -2.5 (110%) is the maximal value of the lateral inhibition strength (to the two nearest neighboring glomeruli); this value decays linearly toward a minimal value of 7 ~ 1 = , ~ -0.5 as the distance between glomeruli g and j increases; r,$ = 6 msec is the shortest transmission delay (between two nearest neighboring glomeruli); the delay increases toward a maximal value of 15 msec as the distance between glomeruli g and j increases. TOR =
=
For all neurons, the value of the thresholds are Om,, = -0.1 and
em,,= 4.0; for excitatory local interneurons, Om,, = 0.01; the membrane time constant T = 5 msec for inhibitory LINs and ONs and J = 8 msec for excitatory LINs. Updating of all neurons is synchronous, with At = 2 msec.
Acknowledgments The authors are thankful to David Marsan for his inspiring ideas, which helped to start this project. They thank Jean-Pierre Nadal, Stefan Knerr, and Brigitte Quenet for constructive criticisms on the manuscript and G. Arnold, G. Dreyfus, J. Gascuel, and M. Kerszberg for valuable discussions.
References Abott, L. F. 1990. Modulation of function and gated learning in a network memory. Proc. Nntl. Acnd. Sci. U.S.A. 87, 9241-9245. Akers, R. I?, and Getz, W. M. 1993. Response of olfactory receptor neurons in honeybees to odorants and their binary mixtures. J. Cornp. Physiol. A 173, 169-185. Arnold, G., Masson, C., and Budharugsa, S. 1985. Comparative study of the antennal afferent pathway of the workerbee and the drone (Apis rnellifera L.). Cell Tisszie Res. 242, 593-605. Bicker, G. 1993. Chemical architecture of antennal pathways mediating proboscis extension learning in the honeybee. Apidologie 24, 235-248. Erber, J. 1981. Neural correlates of learning in the honeybee. TlNS 4, 270-273.
Model of Olfactory Sensory Memory in the Honeybee
113
Erber, J., Masuhr, Th., and Menzel, R. 1980. Localization of short-term memory in the brain of the bee, Apis mellifera. Physiol. Entomol. 5, 343-358. Erdi, P., Grobler, T., Barna, G., and Kaski, K. 1993. Dynamics of the olfactory bulb: Bifurcations, learning and memory. Biol. Cybern. 69, 57-66. Esslen, J., and Kaissling, K.E. 1976. Zahl und Verteilung antennaler Sensillen bei der Honigbiene. Zoomorphologie 83, 227-251. Fonta, C., Sun, X. J., and Masson, C. 1991. Cellular analysis of odour integration in the honeybee antennal lobe. In Behavior and Physiology of Bees, L. J. Goodman and R. C. Fisher, eds., pp. 227-241. Cab International, Wallingford, UK. Fonta, C., Sun, X., and Masson, C. 1993. Morphology and spatial distribution of bee antennal lobe interneurons responsive to odours. Chem. Senses 18(2), 101-1 19. Gascuel, J., and Masson, C. 1991. Quantitative electron microscopic study of the antennal lobe in the honeybee. Tissue Cell 23,341-355. Hasselmo, M. E. 1993. Acetycholine and learning in a cortical associative memory. Neural Comp. 5, 3244. Hasselmo, M. E., Anderson, B. P., and Bower, J. M. 1992. Cholinergic modulation of cortical associative memory function. I. Neurophysiol. 647(5), 12301246. Kerszberg, M., and Masson, C. 1995. Signal induced selection among spontaneous activity patterns of bee’s olfactory glomeruli. Biol. Cybern. 72,487495. Laurent, G., and Davidowitz, H. 1994. Encoding of olfactory information with oscillation assemblies. Science 265, 1872-1875. Li, Z. 1990. A model of olfactory adaptation and sensitivity in the olfactory bulb. Bid. Cybern. 62, 349-361. Li, Z., and Hopfield, J. J. 1989. Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern. 61, 379-392. Linster, C., and Masson, C. 1994. Odor processing in the honeybee’s antennal lobe glomeruli: Modelling sensory memory, accepted for publication. In Computational Neural Systems, F. H. Eeckman, ed., Kluwer Academic Publishers, Boston. Linster, C., Marsan, D., Masson, C., and Kerszberg, M. 1994. Odor processing in the bee: A preliminary study of the role of central input to the antennal lobe. In Advances in Neural lnformatian Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 527-534. Morgan Kaufmann, San Mateo, CA. Malun, D. 1991a. Inventory and distribution of synapses of identified uniglomerular projection neurons in the antennal lobe of Periplaneta americana. J. Comp. Neurol. 305, 348-360. Malun, D. 1991b. Synaptic relationships between GABA-immunoreactive neurons and an identified uniglomerular projection neuron in the antennal lobe of Periplaneta americana: A double labeling electron microscopic study. Histochemistry 96, 197-207. Masson, C. 1977. Central olfactory pathways and plasticity of responses to odor stimuli in insects. In Olfaction and Taste Vl, J. Le Magnen and P. MacLeod, eds., pp. 305-314. IRL, London.
Christiane Linster and Claudine Masson
114
Masson, C., and Linster, C. 1995. Towards a cognitive understanding of odor discrimination. Be1rclikt.d Prcicrsses, Vol. 35 (in press). Masson, C., and Linster, C. 1994. Towards a cognitive understanding of odor discrimination. B ~ h z l Proci~sses. . Specid issue: Cogriitioii nrzd Ezdzltiorz (in press). Masson, C., Pham-Delegue, M. H., Fonta, C., Gascuel, J., Arnold, G., Nicolas, G., and Kerszberg, M. 1993. Recent advances in the concept of adaptation to natural odour signals in the honeybee Apis irieflifern L. Apidologic 24, 169-194. Menzel, R. 1983. Neurobiology of learning and memory: The honeybee as a model system. Nntf!ric,issc.iiscli~fti~zf 70, 504-511 . S i d ~ s t r d c self Menzel, R. 1984. Short-term memory in bees. In Priimry N w I . ~ Lmrriiiig m i i f BdioiGoral C l i n q y , Alkon, D. L. and Farley, J., eds., pp. 259-274. Cambridge University Press. Menzel, R., Michelsen, B., Riiffer, P., and Sugawa, M. 1988. Neuropharmacology of learning and memory in honey bees. In M o d i i / i i f i ~ iofS!/rzq~fic/i Trnrisniissiou am$ Plnsticity iri Newiiirj Systtws, G. Herting and H. C. Spatz, eds., pp. 333350. Nato AS1 series H19. Mercer, A,, and Menzel, R. 1982. The effects of biogenic amines on conditioned and unconditioned responses to olfactory stimuli in the honeybee Apis r i d /ifc’J’O. 1. COr7i/J.P//!/SiOl. 145, 363-368. Michelsen, B. D. 1988. Catecholaniines affect storage and retrieval of conditioned odour stimuli i n honeybees. Corrrp. Bicic~feuz.P / l p k J / . 91C, 479382. Mobbs, P.G. 1984. Neural networks in the mushroom bodies of the honeybee. 1. Iizwct. Plysiol. 30(1), 43-58. Nicolas, G., Arnold, G., Patte, F., and Masson, C. 1993. Regional distribution of -3Hj2-deoxyglucose uptake in the wrorker honeybee antenna1 lobe. C . X . Acnd. Sci. 316, 1245-1249. Paris. Sun, X. J. 1991. Caracterisation Electrophysiologique et Morphologique des Newones Olfactifs du Lobe Antennaire de 1‘Abeille, Apis lfJi’//iff’f’i7. ThPse de Doctorat de I’Universitt! de Paris-Sud, Centre d’Orsay, France. Sun, X., Fonta, C., and Masson, C. 1993. Odour quality processing by bee antenna1 lobe neurons. C/zcrir.stmi? 18(4),355-377. Vareschi, E. 1971. Duftunterscheidung bei der Honigbiene. Eiiizelzel-Ableitungeii und Verhaltuiigsreaktion. Z . V q l . Phy~ioI.75, 143-1 73. Zipser, D. 1991. Recurrent network model oi the neural nicchanism of shortterm memory activity. Ni’llf’i7/Cotrip. 3, 179-1 93. .
~
__
~.
-
Received June 22, 1991, accepted March 15, 1995
This article has been cited by: 2. Dominique Martinez. 2005. Detailed and abstract phase-locked attractor network models of early olfactory systems. Biological Cybernetics 93:5, 355-365. [CrossRef] 3. Thomas A. Cleland, Christiane Linster. 2002. How synchronization properties among second-order sensory neurons can mediate stimulus salience. Behavioral Neuroscience 116:2, 212-221. [CrossRef] 4. Jean-Marc Fellous, Christiane Linster. 1998. Computational Models of NeuromodulationComputational Models of Neuromodulation. Neural Computation 10:4, 771-805. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Lippmann
A Spherical Basis Function Neural Network for Modeling Auditory Space Rick L. Jenison Kate Fissell Department of Psyclzology, University of Wisconsin, Madison, W153706 U S A This paper describes a neural network for approximation problems on the sphere. The von Mises basis function is introduced, whose activation depends on polar rather than Cartesian input coordinates. The architecture of the von Mises Basis Function (VMBF) neural network is presented along with the corresponding gradient-descent learning rules. The VMBF neural network is used to solve a particular spherical problem of approximating acoustic parameters used to model perceptual auditory space. This model ultimately serves as a signal processing engine to synthesize a virtual auditory environment under headphone listening conditions. Advantages of the VMBF over standard planar Radial Basis Functions (RBFs) are discussed. 1 Introduction
Artificial neural networks and approximation techniques typically have been applied to problems conforming to an orthogonal Cartesian input space. In this paper we present a neural network operating on a problem in acoustics whose input space is best represented in spherical (or polar), rather than Cartesian coordinates. The neural network employs a novel basis function, the von Mises function, which is well adapted to spherical input, within the standard Radial Basis Function (RBF) architecture. The primary advantage of a basis on a sphere, rather than on a plane, is the natural constraint of periodicity and singularity at the poles. The RBF network architecture using a single layer of locally tuned units (basis functions) covering a multidimensional space is now well known (e.g., Broomhead and Lowe 1988; Moody and Darken 1988,1989; Poggio and Girosi 1990). Gradient-descent learning rules akin to backpropagation that move and shape the basis functions to minimize the output error have also been proposed (Poggio and Girosi 1990; Hartman and Keeler 1991). Our network was constructed to synthesize a continuous map of acoustic parameters used to simulate the virtual experience of free-field spatial hearing under headphones. The actual signal processing details Neural Computation 8,115-128 (1996) @ 1995 Massachusetts Institute of Technology
116
Rick L. Jenison and Kate Fisseil
used to synthesize auditory space will for now be deferred, with the main focus being on the description of the spherical neural network and corresponding learning rules. It is anticipated that this neural network will have general application to functional approximation problems on the sphere, such as inverse kinematics of spherical mechanisms (Chiang 1988)and global weather prediction (Wahba and Wendelberger 1980; Ghil et al. 1981). 2 von Mises Basis Function (VMBF) Network
The network basis function is based on a spherical probability density function (p.d.f)' that has been used to model line directions distributed unimodally with rotational symmetry. The function is well-known in the spherical inferential statistics literature and commonly referred to as either the von Mises-Arnold-Fisher or Fisher distribution [see Mardia (1972) or Fisher et al. (1987)l. The kernel form of the p.d.f. was first introduced by Arnold (1941) in his unpublished doctoral dissertation. Historically, the function has served as an assumed parametric distribution from which spherical data are sampled; but we are not aware of its use as a basis function in the approximation theory literature. The expression for the von Mises basis function, dropping the constant of proportionality and elevational weighting factor from the p.d.f., is VM(B.p. a. ,I. ,.) = p ~ [ ~ i n Q s i n ~ c o s ( H - n ) + c o s ~ c o s 8 1 (2.1) where the input parameters correspond to a sample location in azimuth and elevation (B. qb), a centroid in azimuth and elevation ((Y, /I), and the concentration parameter K . Application of the von Mises function requires an azimuthal range in radians from 0 to 27r and elevational range from 0 to 7 r . Any sample (0.4) on the sphere will induce an output from each VMBF proportional to the solid angle between the sample and the centroid of the VMBF (0,p). The azimuthal periodicity of the basis function is driven by the cos(8 - a ) term, which will be maximal when B = 0 . The (sin 4 sin /3) term modulates the azimuthal term in the elevational plane, hence the requirement that 4 range from 0 to 7 r . As the sample elevation or the centroid elevation approaches either pole (0 or 7 r ) , the multiplicative effect of (sin sin 13) progressively eliminates the contribuS term dominates. h is tion of azimuthal variation and the (cos ~ C O 6) a shape parameter called the concentration parameter, where the larger the value the narrower the function width after transformation by the expansive function e. While other spherical functions have been proposed for approximation on the sphere [e.g., thin-plate pseudosplines (Wahba 1981)], the VMBF serves as a convenient spherical analogue of the wellknown multidimensional gaussian on a plane (see Fig. 1). It behaves in a similar fashion to the planar gaussian with the centroid corresponding
Spherical Basis Function Neural Network
117
Figure 1: Three-dimensional rendering of two von Mises basis functions positioned on the unit sphere. The basis functions are free to move about the sphere, changing adaptively in position, width, and magnitude.
to the mean and l/r; corresponding to the standard deviation. It differs from the thin-plate spline in that it has a parameter for controlling the width or concentration of the basis function, which allows the VMBF to focus resolution optimally. The von Mises basis function serves as the activation function of the hidden layer units conforming to the RBF architecture as shown in Figure 2. The output of the ith output node, f , ( H , 4), when spherical coordinates are presented as input, is given by
(2.2)
where VM(B.4.a,, PI. q) is the output of the jth von Mises basis function and wl1is the weight connecting the jth basis function with the ith output node.
Rick L. Jenison and Kate Fissell
118
Basis Functions
Bias Unit
Figure 2: Architecture of the von Mises Basis Function (VMBF) neural network. 3 Parameter Learning
To optimize the approximation function given a fixed number of basis functions we apply a gradient-descent method on an error function to update the parameters of the network. In our case we require the sum-of-squared-error to be minimized. This technique has been applied successfully to gaussian RBF neural networks (Moody and Darken 1989; Poggio and Girosi 1990; Hartman and Keeler 1991) and we derive the analogous equations for the von Mises basis here. The error function for the pth M-dimensional training pattern is
where t,, is the ith element of the pth training pattern. The notation of the network output f,,(H? &S2,) includes [ I , ,which is a vector of changing network parameters
Parameter values are learned through successive presentation of inputoutput pairs using the well-known update rule
,:I
= sy’d
+ ,I
ASI,
(3.3)
where rl corresponds to the learning rate, which can also be updated as learning proceeds. Obtaining AR, involves computing the error gradient by differentiating the error function with respect to each free parameter in the network. These derivations are relatively straightforward using
Spherical Basis Function Neural Network
119
the chain rule and algebra, however are rather involved with respect to the bookkeeping of indices. For convenience R,is omitted from the following derivations. The update expression for wij is just the WidrowHoff learning rule
The update expression for the azimuthal movement of the basis center can be derived in a similar fashion
M
x
-
fi(e.4)lwl,)VM(d. 4. "1.0,K.J )
(3.5)
I
as can the update expression for elevational movement &3,
=
dE
[sin 9 cos /j,cos(t) - a,) - cos Q sin p,]
--
a,=
M
x
-
h(@,4)lw,} VWQ. 9. 3,.q ) "13
(3.61
I
and finally the concentration parameter
(3.7)
The degree of improvement realized by the gradient-descent optimization of the parameter vector Q, over a direct matrix pseudoinverse solution of wTi alone depends on the particular problem as well as the number of basis functions available. We have typically observed about a 2-fold improvement in the final root-mean-square (RMS) error for networks with a small number of basis functions, which will be discussed for our specific application in the following section. 4 Approximating Auditory Space
The VMBF neural network is ideally suited to the problem of learning a continuous mapping from spherical coordinates to acoustic parameters that specify sound source direction. The spatial position of a sound source in a listener's environment is specified by a number of factors related to the interaction of sound pressure waves with the pinna (external ear), head, and upper body. These interactions can be described mathematically in the form of a linear transformation, commonly referred to
120
Rick L. Jenison and Kate Fissell
as a "head-related transfer function," or HRTF. HRTFs are measured by sampling the air pressure variation in time near the eardrum as a function of known sound source location. Generally, HRTFs are visualized in the frequency (spectral) domain rather than in the Fourier equivalent time domain, because it is a more meaningful way of characterizing acoustic processing by the auditory system.2 Henceforth, an n-dimensional vector will represent the HRTF, where each element of the vector corresponds to a discrete sample in frequency. Figure 3 illustrates a typical spherical grid of speaker placements used to position sound sources, preferably within an anechoic environment. Studies of human sound localization most often use a coordinate system in degrees with the origin (O", 0")located at the intersection of the horizon and the medial saggital plane directly in front of the listener. We adhere to this convention for the remaining discussion, aware that coordinates must be converted from degrees to radians under the range constraints imposed by the VMBF (equation 2.1). Enforcement of these constraints is accomplished by mapping -180" 5 H < +180" to 0 5 H < 27r and -90" 5 4 5 +90° to 0 5 d 5 W. HRTFs change in complicated, but systematic ways as a function of sound direction relative to each ear. The HRTFs completely specify the sound source location because the measurements are made near the point where sound is transduced by the auditory system. The measurement and analysis techniques are well developed and have been detailed in the psychoacoustic literature (Wightman and Kistler 1989; Middlebrooks rt al. 1989; Pralong and Carlile 1994). One practical use of these measurements is to create the sense of a sound coming from any "virtual" location under a headphone listening condition. This is accomplished by convolving an HRTF with any digitally recorded sound, effectively simulating the actual spectral filtering of the individual's external ear, and inducing an apparent location of the sound. Using the VMBF neural network to create the parameters for a virtual environment affords the ability to synthesize a set of HRTFs (for each ear) for any location in space, not just the spherical locations where measurements were obtained. Furthermore, the continuous modeled environment affords smooth auditory motion (sound trajectory). Others have recently applied standard regularization techniques to the problem of HRTF approximation (Chen ef al. 1993), using a basis of planar thin-plate splines rather than spherical basis functions, and without benefit of gradient-descent learning. Their approximation performance is consistent with that of the standard class of planar RBF networks, hence subject to the distortions addressed by this paper. From both a practical as well as a theoretical standpoint, the measured HRTFs are of much higher dimensionality than necessary for modeling purposes. The redundancy inherent in the measurement can be statis?Vision, on the other hand, is more naturally considered in spatial terms due to the retinotopic projection.
Spherical Basis Function Neural Network
121
90'
Figure 3: Coordinate system for sampling head-related transfer functions (HRTFs). Measurements are recorded near the eardrum of a subject placed in the center of a spherical speaker array. Azimuth is denoted by Q (ranging periodically from -180" to f180" or 0 to 27r) and elevation by 4 (ranging from -90" to +90" or 0 to K). The origin 0') of the coordinate system is directly in front of the listener. (OO,
tically removed via the technique of principal component analysis (see Kistler and Wightman 1992). Each HRTF magnitude spectrum consists of 256 real-valued frequency components for each ear and each sound source location in space, hence, a 256-dimensional vector. From the set of all measurements of sound source locations, we are able to reduce the dimensionality of the frequency components to six principal components (for each ear and location) that account for 98% of the spectral variance. This linear operation allows the use of six output units in the network rather than 256 frequency components while losing only a negligible
122
Rick L. Jenison and Kate Fissell
amount of information. Reconstruction to the original dimensionality of the HRTFs is a straightforward linear inverse transformation. 5 Learning Auditory Space
150 HRTF measurements front a single ltuntan ear served as the database for training a VMBF neural network.3 These measurements, corresponding to discrete locations on the sphere, ranged from -170' to 180" in azimuth and -50' to 90'in elevation in 10' steps. At higher elevations the solid angle spanned is less, thus requiring fewer samples. The six principal components described above are the M elements of the training pattern t,,,. The total number of training patterns was 400, with 50 random pitterns reserved for test data Le., data fed forward through the network without parameter updating). The network parameters were initialized by positioning the basis functions uniformly on the input space with a small degree of relative overlap and solving the output weights w,, with '3 single pseudoinverse. Iterative training then proceeded with the successive presentation of training patterns, evaluation of gradient-descent equations, and update of free parameters. Figure 4 shows the training and intermediate testing history for 2500 epochs for a VMBF network with 9 basis functions. The initial point of the learning curves corresponds ot the RMS error immediately following the initialization of the VMBF network parameters. Improvement due to learned basis function positions and shapes is clearly evident in Figure 1. The RMS testing error is only slightly worse than that of the training data, which demonstrate reasonable generalization to novel data. Good generalization is desirable since the training data themselves are somewhat noisy. Figure 5 illustrates two views of the final positions and widths of the nine von Mises basis functions following gradient-descent learning. This particular network was trained on measurements taken from the right ear. Due to the acoustical interaction of the head with the radiating sound pressure wave, most of the variability, hence available information, in the HRTFs occurs when the sound source is on the same side of the head. This asymmetry is reflected in the final positioning of the basis function as shown in Figure 5, and demonstrates the directness with which we can interpret the learned hidden unit weights (positions and widths). Planar gaussian RBF networks applied to this spherical problem generally perform as well as the VMBF network in regions near the center of the planar projected input space, but perform less well near the edges. As learning proceeds, the centers of the gaussian basis functions move inward due to the artificially absent data beyond the edges. The periodic VMBF network is immune to this bias due to its intrinsic lack of edges. jHRTI7s \ $ w e obtained from Drs. Wightman and Kistler, Hearing Development Research Laboratory, University of Wiscotisiti-Madison.
Spherical Basis Function Neural Network
0.111
I
1
123
I
I
I
I
I
0.105 0.1
0.075
I
\\
h
U
0.07
Figure 4: Root-mean-squared error for training and testing data as a function of training epochs.
To demonstrate this immunity, 40 input-output pairs were systematically split off from the original database of 450 in the region near an arbitrarily defined edge (*lSOO azimuth) for use as novel probe data (in contrast with a distribution of randomly selected test data.) The average magnitudes of the probe data set and the training data set were equalized. The top panel of Figure 6 shows the progressive decline in RMS error as training progresses for a gaussian RBF network and a VMBF network with the remaining 410 input-output pairs. Both networks have nine basis functions and the appropriate gradient-descent update rules. The bottom panel shows the initial decline in RMS error of the testing data. As the gaussian RBF network learns the training set, the RMS error of the probe data rises dramatically . In contrast, the VMBF network training generalizes well to the probe data due to its intrinsic periodicity. For this particular set of novel probe data, the gaussian RBF network must extrapolate beyond the artificial edge of the training data, while the VMBF network performs a spherically constrained interpolation. Figure 7 illustrates the principal component surfaces approximated by
124
Rick L. Jenison and Kate Fissell
a
Figure 5: Positions and relative widths of the von Mises basis functions from a 9-hasis function VMBF network shown from two viewpoints: (a) 45" to the right of the median line and (b) directlv overhead, The displayed width of each basis function is determined by a cross-sectional cut 25% down from the peak o f each basis function, which by definition is located at the centroid.
Spherical Basis Function Neural Network
125
Training Data 0.1
0.1
0.
z
2 0.0
3
0.0
0.0
A
0.0
0.06' 0
500
500
1000
z000
1500
1
1
1000
1500
1
z000
1
2500
Epochs
Figure 6: Comparison of gaussian RBF network and VMBF network generalization to novel probe data selected in the region near an arbitrary edge (k180" azimuth). Gaussian RBF error is denoted by the fine line and VMBF error is denoted by the bold line.
126
Rick L. Jenison and Kate Fissell
Figure 7: Final approximation surfaces for a 25-basis function VMBF. (A, B) The database of HRTF principal coiiipoiients I (A) and 11 ( 8 )as a function of direction (azimuth and elevation) in steps of lo-. The actual network contained 6 output units ( i e , four more in addition to the 2 shown) corresponding to the 6 principal components derixred from the total 450 256-dimensional HRTFs. (C, D) The results of VMBF network training. The predicted principal components I (C) mil I1 (D) are shorvn in increments ot 5 . a 25-basis function VMBF network. Figures 7A and 7B show the known first and second principal components (I and II), respectively, as a function of azimuth and elevation location, which together account for 93% of the total variance of the measured 450 HRTFs. Note that measurements were not taken below -50' elevation d u e to technical constraints in obtaining those samples. Figures 7C and 7D show the principal components predicted by the VMBF network as a function of spherical input for the entire range of positions. The data are plotted on a two-dimensional Cartesian grid with edges, rather than on the more difficult to visualize spherical grid. Therefore, the graphic representation of samples near the poles will naturally distort in the fashion well-known to cartographers. For example, all of the samples at -90' (south pole) are the same mcasuremrnt regardless of azimuth, hence an isomorphic line of output from
Spherical Basis Function Neural Network
127
the network should, and does, occur at this elevation. This singularity can be observed in Figures 7C and 7D. Due to the spherical topology of the basis function, this constraint is fundamental to the VMBF network; in contrast, the gaussian RBF network operating in Cartesian space would not enforce this constraint. Smoothing of the measured data as a consequence of the well-trained approximation function is also evident in Figures 7C and 7D. 6 Conclusions
The model of auditory space represents a good example of tailoring a particular neural network architecture (or basis function) to the appropriate input representation, in this case a spherical representation. Because of the periodic nature of the spherical basis, there are no edge effects that arise when using the multidimensional gaussian for approximation in Cartesian space. Well-behaved networks are obtained as a result of this spherical constraint. Work is ongoing to better characterize the hidden-layer subspace and learning dynamics. These algorithms currently serve as the foundation for ongoing research into implementing auditory virtual environments. We are particularly interested in how a human listener could be integrated into the network learning loop for individual tuning of an auditory space model trained on another person’s set of HRTFs, thereby eliminating the need for technically demanding acoustic measurements.
Acknowledgments This work was supported in part by the Wisconsin Alumni Research Foundation and the Office of Naval Research. We greatly appreciate the database of HRTFs provided by Fred Wightman and Doris Kistler.
References Arnold, K. J. 1941. On spherical probability distributions. Unpublished Ph.D. Thesis, Massachusetts Institute of Technology. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chen, J., Van Veen, B. D., and Hecox, K. E. 1993. Synthesis of 3D virtual auditory space via a spatial feature extraction and regularization model. In l E E E Virtual Reality Aniiu. Int. Symp. 188-193. Chiang, C. H. 1988. Kinematics of Sphevical Mechanisms. Cambridge University Press, Cambridge. Fisher, N. I., Lewis, T., and Embleton, B. J. J. 1987. Statistical A~zalyslsisof Spherical Data. Cambridge University Press, Cambridge.
128
Rick L. Jenison and Kate Fissell
Ghil, M., Cohn, S., Tavantzis, J., Bube, K., and Isaacson, E. 1981. Applications of estimation theory to numerical weather prediction. In Dynamic Meteorology: Data Assimilation Methods, L. Bengtsson, M. Ghil, and E. Kallen, eds., pp. 139284. Springer-Verlag, New York. Hartman, E. J., and Keeler, J. D. 1991. Predicting the future: Advantages of semilocal units. Neural Comp. 3, 566-578. Kistler, D. J., and Wightman, F. L. 1992. A model of head-related transfer functions based on principal component analysis and minimum phase reconstruction. /. Acoiist. Soc. Am. 91, 1637-1647. Mardia, K. V. 1972. Statistics of Directional Data. Academic Press, London. Middlebrooks, J. C., Makous, J. C., and Green, D. M. 1989. Directional sensitivity of sound-pressure levels in the human ear canal. 1.Acoust. SOC.A m . 86, 89108. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Coiiizectionist Models Siimiiier School 1-11, Moody, J., and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Cotizp. 1, 281-294. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78, 1481-1496. Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithiizs for Approximation, J. C . Mason and M. G. Cox, eds., pp. 143-166. Clarendon Press, Oxford. Pralong, D., and Carlile, S. 1994. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature "in-ear" recording system. 1.Acoust. Soc. Aim 95, 3435-3444. Wahba, G., and Wendelberger, J. 1980. Some new mathematical methods for variational objective analysis using splines and cross-validation. Monthly Weatker Rev. 108, 1122-1145. Wahba, G. 1981. Spline interpolation and smoothing on the sphere. S l A M / . Sci. Stat. CoiiIplit. 2, 5-16. Wightman, F. L., and Kistler, D. J. 1989. Headphone simulation of free-field listening I: Stimulus synthesis. /. Acoust. Soc. Am.85, 858-867.
Received July 29, 1994; accepted March 20, 1995.
This article has been cited by: 2. Tze-Yiu Ho, Chi Sing Leung, Ping-Man Lam, Tien-Tsin Wong. 2009. Efficient Relighting of RBF-Based Illumination Adjustable Images. IEEE Transactions on Neural Networks 20:12, 1987-1993. [CrossRef]
Communicated by Steve Nowlan
On Convergence Properties of the EM Algorithm for Gaussian Mixtures Lei Xu Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, M A 02139 USA and Department of Computer Science, The Chinese University of Hong Kong, Hong Kong Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts lnstitute of Technology, Cambridge, M A 02139 USA We build up the mathematical connection between the ”ExpectationMaximization” (EM) algorithm and gradient-based approaches for maximum likelihood learning of finite gaussian mixtures. We show that the EM step in parameter space is obtained from the gradient via a projection matrix P, and we provide an explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties of P and provide new results analyzing the effect that P has on the likelihood surface. Based on these mathematical resuIts, we present a comparative discussion of the advantages and disadvantages of EM and other algorithms for the learning of gaussian mixture models. 1 Introduction
The “Expectation-Maximization” (EM) algorithm is a general technique for maximum likelihood (ML) or maximum a posteriori (MAP) estimation. The recent emphasis in the neural network literature on probabilistic models has led to increased interest in EM as a possible alternative to gradient-based methods for optimization. EM has been used for variations on the traditional theme of gaussian mixture modeling (Ghahramani and Jordan 1994; Nowlan 1991; Xu and Jordan 1993a,b; Tresp et al. 1994; Xu et al. 1994) and has also been used for novel chain-structured and tree-structured architectures (Bengio and Frasconi 1995; Jordan and Jacobs 1994). The empirical results reported in these papers suggest that EM has considerable promise as an optimization method for such architectures. Moreover, new theoretical results have been obtained that link EM to other topics in learning theory (Amari 1994; Jordan and Xu 1995; Neal and Hinton 1993; Xu and Jordan 1993c; Yuille et al. 1994). Despite these developments, there are grounds for caution about the promise of the EM algorithm. One reason for caution comes from conNeural Cornputdon 8, 129-151 (1996)
@ 1995 Massachusetts Institute of Technology
Lei Xu and Michael Jordan
130
sideration of theoretical convergence rates, which show that EM is a first-order algorithm.' More precisely, there are two key results available in the statistical literature on the convergence of EM. First, it has been established that under mild conditions EM is guaranteed to converge toward a local maximum of the log likelihood / (Boyles 1983; Dempster et al. 1977; Redner and Walker 1984; Wu 1983). (Indeed the convergence is monotonic: 1 ( 8 ( ' + ' ) ) 2 l((-)(kl), where 0(')is the value of the parameter vector 0 at iteration k . ) Second, considering EM as a mapping EYk+ll = M(8tk") with fixed point 0' = M ( 0 * ) ,we have (_)('+') - (3*z [ O M ( 0 * ) / d O * ] ( O ck 1@*) when @ ( k + l l is near O', and thus
with
almost surely. That is, EM is a first-order algorithm. The first-order convergence of EM has been cited in the statistical literature as a major drawback. Redner and Walker (1984),in a widely cited article, argued that superlinear (quasi-Newton, method of scoring) and second-order (Newton) methods should generally be preferred to EM. They reported empirical results demonstrating the slow convergence of EM on a gaussian mixture model problem for which the mixture components were not well separated. These results did not include tests of competing algorithms, however. Moreover, even though the convergence toward the "optimal" parameter values was slow in these experiments, the convergence in likelihood was rapid. Indeed, Redner and Walker acknowledge that their results show that . . . "even when the component populations in a mixture are poorly separated, the EM algorithm can be expected to produce in a very small number of iterations parameter values such that the mixture density determined by them reflects the sample data very well." In the context of the current literature on learning, in which the predictive aspect of data modeling is emphasized at the expense of the traditional Fisherian statistician's concern over the "true" values of parameters, such rapid convergence in likelihood is a major desideratum of a learning algorithm and undercuts the critique of EM as a "slow" algorithm. 'For an iterative algorithm that converges to a solution O*, if there is a real number ko, such that for all k > ku, we have
?,, and a constant integer
1 @"f"
-@*I/ 5 qll@w- q p
with q being a positive constant independent of k , then we say that the algorithm has a convergence rate of order y,,.Particularly, an algorithm has first-order or linear convergence if yo = 1, superlinear convergence if 1 < y,, < 2, and second-order or quadratic convergence if yo = 2.
EM Algorithm for Gaussian Mixtures
131
In the current paper, we provide a comparative analysis of EM and other optimization methods. We emphasize the comparison between EM and other first-order methods (gradient ascent, conjugate gradient methods), because these have tended to be the methods of choice in the neural network literature. However, we also compare EM to superlinear and second-order methods. We argue that EM has a number of advantages, including its naturalness at handling the probabilistic constraints of mixture problems and its guarantees of convergence. We also provide new results suggesting that under appropriate conditions EM may in fact approximate a superlinear method; this would explain some of the promising empirical results that have been obtained (Jordan and Jacobs 19941, and would further temper the critique of EM offered by Redner and Walker. The analysis in the current paper focuses on unsupervised learning; for related results in the supervised learning domain see Jordan and Xu (1995). The remainder of the paper is organized as follows. We first briefly review the EM algorithm for gaussian mixtures. The second section establishes a connection between EM and the gradient of the log likelihood. We then present a comparative discussion of the advantages and disadvantages of various optimization algorithms in the gaussian mixture setting. We then present empirical results suggesting that EM regularizes the condition number of the effective Hessian. The fourth section presents a theoretical analysis of this empirical finding. The final section presents our conclusions. 2 The EM Algorithm for Gaussian Mixtures
We study the following probabilistic model: K
P ( x 1 0)=
C a)P(x1 m,.c,)
(2.1)
,=1
and P ( x I m,.C,)
=
1
- I /2 (I-It!, ) I
c; 1 (F,!1!!
p/*
(27r)”/2)C,
where cv/ 2 0 and C,”=,a, = I, d is the dimension of x. The parameter vector 0 consists of the mixing proportions q,the mean vectors rn,, and the covariance matrices C,. Given K and given N independent, identically distributed samples ,?}I#{ we obtain the following log likelihood:2 N
N
*Although we focus on maximum likelihood (ML) estimation in this paper, it is straightforward to apply our results to maximum a posteriori (MAP) estimation by multiplying the likelihood by a prior.
132
Lei Xu and Michael Jordan
which can be optimized via the following iterative algorithm (see, e.g, Dempster et al. 1977):
where the posterior probabilities kjk)are defined as follows:
3 Connection between EM and Gradient Ascent
In the following theorem we establish a relationship between the gradient of the log likelihood and the step in parameter space taken by the EM algorithm. In particular we show that the EM step can be obtained by premultiplying the gradient by a positive definite matrix. We provide an explicit expression for the matrix. Theorem 1. At each iteration of the EM algorithm equation 2.3, we linzie
(3.3)
where (3.4)
(3.5) (3.6)
EM Algorithm for Gaussian Mixtures
133
where A denotes the vector of mixing proportions [NI.. . . C Y K ] ~j , indexes the mixture cornponeifts Cj = 1. . . . ,K), k denotes the iteration number, "vec[B]" denotes the vector obtained by stacking the column vectors of the matrix B, and "8" denotes theKroneckerproduct. Moreover,given theconstraints a:) = 1 and irjk) 2 0, Pa' is a positive definite matrix and the matrices Pi:: aiid Pg: are positive definite with probability one for N sufficiently large. %
cF=,
The proof of this theorem can be found in the Appendix. Using the notation 0 = [m;?.. . , m ~ , v e ~ [ C . . .~.vec[CKIT. ]~, ATIT,and P ( 0 ) = diag[P,,,,. . . . P,,, , Pc,, . . . Pc,. P A ] , we can combine the three updates in Theorem 1 into a single equation: ~
(3.7) Under the conditions of Theorem 1, P ( @ ) ) is a positive definite matrix with probability one. Recalling that for a positive definite matrix B, we have (al/i30)TB(al/i3C3)> 0, we have the following corollary:
Corollary 1. For each iteration of the EM algorithm given by equation 2.3, the search direction E3(kf1j - O(k)has a positive projection on the gradient of 1. That is, the EM algorithm can be viewed as a variable metric gradient changes at each ascent algorithm for which the projection matrix P(ock)) iteration as a function of the current parameter value 0@). Our results extend earlier results due to Baum and Sell (1968), who studied recursive equations of the following form: X(k+l)
= qx(k)).
T(X(k)) = [T(X(k)),. . . . T(X'k')K] ~
c:,
~
where xjk) 2 0, xjk) = 1, where J is a polynomial in xjk) having positive coefficients. They showed that the search direction of this recursive formula, i.e., T ( x @ )) dk),has a positive projection on the gradient of J with respect to the x@)(see also Levinson et al. 1983). It can be shown that Baum and Sell's recursive formula implies the EM update formula for A in a gaussian mixture. Thus, the first statement in Theorem 1 is a special case of Baum and Sell's earlier work. However, Baum and Sell's theorem is an existence theorem and does not provide an explicit expression for the matrix PA that transforms the gradient direction into the EM direction. Our theorem provides such an explicit form for PA. Moreover, we generalize Baum and Sell's results to handle the updates for m, and C,, and we provide explicit expressions for the positive definite transformation matrices P,, and Pc, as well.
134
Lei Xu and Michael Jordan
It is also worthwhile to compare the EM algorithm to other gradientbased optimization methods. Nezutods mrtlzod is obtained by premultiplying the gradient by the inverse of the Hessian of the log likelihood:
Newton's method is the method of choice when it can be applied, but the algorithm is often difficult to use in practice. In particular, the algorithm can diverge when the Hessian becomes nearly singular; moreover, the computational costs of computing the inverse Hessian at each step can be considerable. An alternative is to approximate the inverse by a recursively updated matrix B i k + l ) = B(') r/AB(').Such a modification is called a quasi-Newton method. Conventional quasi-Newton methods are unconstrained optimization methods, however, and must be modified to be used in the mixture setting (where there are probabilistic constraints on the parameters). In addition, quasi-Newton methods generally require that a one-dimensional search be performed at each iteration to guarantee convergence. The EM algorithm can be viewed as a special form of quasi-Newton method in which the projection matrix P( @('I)in equation 3.7 plays the role of B ( k ) .As we discuss in the remainder of the paper, this particular matrix has a number of favorable properties that make EM particularly attractive for optimization in the mixture setting.
+
4 Constrained Optimization and General Convergence
An important property of the matrix P is that the EM step in parameter space automatically satisfies the probabilistic constraints of the mixture model in equation 2.1. The domain of 0 contains two regions that embody the probabilistic constraints: VI = ( 0 : C,"=, ojk) = 1) and V 2 = {(+ : ~ 1 ; ~ 2 ) 0, S, is positive definite}. For the EM algorithm the update for the mixing proportions 0, can be rewritten as follows:
It is obvious that the iteration stays within V,.Similarly, the update for C j can be rewritten as:
which stays within Vz for N sufficiently large. Whereas EM automatically satisfies the probabilistic constraints of a mixture model, other optimization techniques generally require modification to satisfy the constraints. One approach is to modify each iterative
EM Algorithm for Gaussian Mixtures
135
step to keep the parameters within the constrained domain. A number of such techniques have been developed, including feasible direction methods, active sets, gradient projection, reduced-gradient, and linearly constrained quasi-Newton. These constrained methods all incur extra computational costs to check and maintain the constraints and, moreover, the theoretical convergence rates for such constrained algorithms need not be the same as that for the corresponding unconstrained algorithms. A second approach is to transform the constrained optimization problem into an unconstrained problem before using the unconstrained method. This can be accomplished via penalty and barrier functions, Lagrangian terms, or reparameterization. Once again, the extra algorithmic machinery renders simple comparisons based on unconstrained convergence rates problematic. Moreover, it is not easy to meet the constraints on the covariance matrices in the mixture using such techniques. A second appealing property of P ( O ( k ) )is that each iteration of EM is guaranteed to increase the likelihood (i.e., 1(0@+')) 2 I ( @ @ ) ) ) . This monotonic convergence of the likelihood is achieved without step-size parameters or line searches. Other gradient-based optimization techniques, including gradient descent, quasi-Newton, and Newton's method, do not provide such a simple theoretical guarantee, even assuming that the constrained problem has been transformed into an unconstrained one. For gradient ascent, the step size 11 must be chosen to ensure that (JO(k+l)- O ( k - l ) ~ ~ / ~ \-( O O(k-l))ll (k) 5 111 + r/H(@-'))\I < 1. This requires a one-dimensional line search or an optimization of rl at each iteration, which requires extra computation, which can slow down the convergence. An alternative is to fix r/ to a very small value, which generally makes 111 q H ( @ - ' ) ) I/ close to one and results in slow convergence. For Newton's method, the iterative process is usually required to be near a solution, otherwise the Hessian may be indefinite and the iteration may not converge. Levenberg-Marquardt methods handle the indefinite Hessian matrix problem; however, a one-dimensional optimization or other form of search is required for a suitable scalar to be added to the diagonal elements of Hessian. Fisher scoring methods can also handle the indefinite Hessian matrix problem, but for nonquadratic nonlinear optimization Fisher scoring requires a stepsize r/ that obeys I ~ L + T / B H ( O ( ~ - '< ) ) 1, // where B is the Fisher information matrix. Thus, problems similar to those of gradient ascent arise here as well. Finally, for the quasi-Newton methods or conjugate gradient methods, a one-dimensional line search is required at each iteration. In summary, all of these gradient-based methods incur extra computational costs at each iteration, rendering simple comparisons based on local convergence rates unreliable. For large-scale problems, algorithms that change the parameters immediately after each data point ("on-line algorithms") are often significantly faster in practice than batch algorithms. The popularity of gradient descent algorithms for neural networks is in part to the ease of obtaining on-line variants of gradient descent. It is worth noting that on-line
+
136
Lei Xu and Michael Jordan
variants of the EM algorithm can be derived (Neal and Hinton 1993; Titterington 1984), and this is a further factor that weighs in favor of EM as compared to conjugate gradient and Newton methods. 5 Convergence Rate Comparisons
In this section, we provide a comparative theoretical discussion of the local convergence rates of constrained gradient ascent and EM. For gradient ascent a local convergence result can be obtained by Taylor expanding the log likelihood around the maximum likelihood estimate 8".For sufficiently large k we have
and
where H is the Hessian of I, 11 is the step size, and r = max{ll rlX,,,,-H(U*)]I, 11 - I/A~~~[-H(@*)]~}, where &[A] and A,,,[A] denote the largest and smallest eigenvalues of A, respectively. Smaller values of Y correspond to faster convergence rates. To guarantee convergence, we require Y < 1 or 0 < 11 < 2/XM[-H(@*)]. The minimum possible value of r is obtained when r j = l/XM[H(@*)]with
where .[HI = xM[H]/&,,[H]is the coliditioii izirmber of H . Larger values of the condition number correspond to slower convergence. When .[HI = 1 we have r,, = 0, which corresponds to a superlinear rate of convergence. Indeed, Newton's method can be viewed as a method for obtaining a more desirable condition number-the inverse Hessian H-' balances the Hessian H such that the resulting condition number is one. Effectively, Newton can be regarded as gradient ascent on a new function with an effective Hessian that is the identity matrix: H,ff = H-'H = I. In practice, however, .[HI is usually quite large. The larger K [ H ]is, the more difficult it is to compute H-' accurately. Hence it is difficult to balance the Hessian as desired. In addition, as we mentioned in the previous section, the Hessian varies from point to point in the parameter space, and at each iteration we need to recompute the inverse Hessian. Quasi-Newton methods approximate H(O(k)j-' by a positive matrix B(') that is easy to compute. The discussion thus far has treated unconstrained optimization. To compare gradient ascent with the EM algorithm on the constrained mix-
EM Algorithm for Gaussian Mixtures
137
ture estimation problem, we consider a gradient projection method: (5.3) where IIk is the projection matrix that projects the gradient dl/13@(~) into V,.This gradient projection iteration will remain in D1 as long as the initial parameter vector is in V,.To keep the iteration within V2,we choose an initial @(O) E V2 and keep 7/ sufficiently small at each iteration. Suppose that E = [el.. . . .em]is a set of independent unit basis vectors and n,(81/30‘k’) become = that spans the space V1.In this basis, ETO@)and aI/aO$k’= E T ( d l / d @ ( k )respectively, ), with /\@ik’--O;/I= @*]I. In this representation the projective gradient algorithm equation 5.3 becomes simple gradient ascent: Oikfl’ = + ~/(dl/dO?’).Moreover, - O*ll 5 IIE’[I+ ~ H ( o * ) ] l l l l @-~ @*/I. ’ AS a equation 5.1 becomes llO(k+l) result, the convergence rate is bounded by
@gk)
r,
=
i -
+
F M [ E r [ 1+ 277H(O*) v2H2(@*)]E]
Since H ( O * )is negative definite, we obtain
r, 5 41 +q2XL[-H,] -~vX,,[-H,]
(5.4)
In this equation H , = E T H ( 0 ) Eis the Hessian of I restricted to V,. We see from this derivation that the convergence speed depends on K[H,]= XM[-H~]/XJJI[-H,]. When K[H,]= 1, we have
which in principle can be made to equal zero if 71 is selected appropriately. In this case, a superlinear rate is obtained. Generally, however, r;[H,]# 1, with smaller values of K[H,]corresponding to faster convergence. We now turn to an analysis of the EM algorithm. As we have seen EM keeps the parameter vector within Vl automatically. Thus, in the new basis the connection between EM and gradient ascent (cf. equation 3.7) becomes
Lei X u and Michael Jordan
138
The latter equation can be further manipulated to yield
r, I 41 + Ah[ETPHE]- 2A,,,[-ETPHE]
(5.5)
Thus we see that the convergence speed of EM depends on
K [ E ~ P H E=] AM [ETPHE]/A,,, [ETPHE] When
k[ETPHE]= 1.
A M [ E ~ P H E=] 1
we have J1
+ Ah[ETPHE]- 2A,,,[-ETPHE]= (1
-
AM[-E~PHE= ] )0
In this case, a superlinear rate is obtained. We discuss the possibility of obtaining superlinear convergence with EM in more detail below. These results show that the convergence of gradient ascent and EM both depend on the shape of the log likelihood as measured by the condition number. When &[HIis near one, the configuration is quite regular, and the update direction points directly to the solution yielding fast con] very large, the 1 surface has an elongated shape, vergence. When K [ H is and the search along the update direction is a zigzag path, making convergence very slow. The key idea of Newton and quasi-Newton methods is to reshape the surface. The nearer it is to a ball shape (Newton's method achieves this shape in the ideal case), the better the convergence. Quasi-Newton methods aim to achieve an effective Hessian whose condition number is as close as possible to one. Interestingly, the results that we now present suggest that the projection matrix P for the EM algorithm also serves to effectively reshape the likelihood yielding an effective condition number that tends to one. We first present empirical results that support this suggestion and then present a theoretical analysis. We sampled 1000 points from a simple finite mixture model given by
p(x) = f l I P I ( X )
+ 02p2(x)
where
The parameter values were as follows: c q = 0.7170, (v2 = 0.2830, in1 = -2, in2 = 2, n: = 1, = 1. We ran both the EM algorithm and gradient ascent on the data. The initialization for each experiment is set randomly, but is the same for both the EM algorithm and the gradient algorithm. At each step of the simulation, we calculated the condition number of the Hessian ( [H( )I), the condition number determining the rate of convergence of the gradient algorithm (h.[ETH(@)')El), and the condition number de@ ( k ) ) E ] ) .We also termining the rate of convergence of EM (h-[ETP(@))H(
EM Algorithm for Gaussian Mixtures
l
-5000‘ 0
o
I
10
t
20
139
o
I
30
I
40
I
50
o
I
60
I
70
I
80
C
,
90
I
100
the learningsteps
Figure la: Experimental results for the estimation of the parameters of a twocomponent gaussian mixture. (a) The condition numbers as a function of the iteration number. calculated the largest eigenvalues of the matrices H ( O ( k ) ) ,ETH(O@))E, and ETP(O(k))H(O(k))E. The results are shown in Figure 1. As can be seen in Figure la, the condition numbers change rapidly in the vicinity of the 25th iteration. This is because the corresponding Hessian matrix is indefinite before the iteration enters the neighborhood of a solution. Afterward, the Hessians quickly become definite and the condition numbers ~ o n v e r g e As . ~ shown in Figure lb, the condition numbers converge toward the values 6 [ H ( O i k ) )= ] 47.5, K [ E ~ H ( O ( ’ ) ) E=] 33.5, and K [ E ~ P ( @ ( ~ ) ) H ( O=( ~3.6. ) ) EThat ] is, the matrix P has greatly reduced the condition number, by factors of 93 and 13.2. This significantly improves the shape of 1 and speeds up the convergence. 31nterestingly,the EM algorithm converges soon afterward as well, showing that for this problem EM spends little time in the region of parameter space in which a local analysis is valid.
Lei Xu and Michael Jordan
140
73
8 a,
10':
I
I
30
40
50
60 70 the learning steps
80
90
100
Figure lb: (b) A zoomed version of (a) after discarding the first 25 iterations. The terminology "original, constrained, and EM-equivalent Hessians" refers to the matrices H , E ' H E , and ETPIfE, respecti\dy.
We ran a second experiment in which tlie means of tlie component gaussians were 1 1 7 1 = -1 and rti? = 1. The results are similar to those shown in Figure 1. Since tlie distance between two distributions is reduced into half, tlie shape of 1 becomes more irregular (Fig. 2). The condition number K:H((-)"' ) ] increases to 352, ti;EJH((-)'"]Ej increases to 216, and K~E'P((-)''') H ((-3'")E; increases to 61. We see once again a significant improvement in tlie case of EM, by factors of 3.5 and 5.8. Figure 3 shows that tlie matrix P has also reduced the largest eigenvalues of the Hessian from between 2000 to 3000 to around 1. This demonstrates clearly the stable convergence that is obtained via EM, without a line search or tlie need for external selection of a learning stepsize. In tlie remainder of the paper we provide some theoretical analyses that attempt to shed some light on these empirical results. To illustrate the issues involved, consider a degenerate mixture problem in which
EM Algorithm for Gaussian Mixtures
l
o
800 -
o
dash-dot
-...._---
zoo-
a,
o-/-
.-I
c
c
dashed - the constrained Hessian
.-6 400.-5
5
0
solid - the original Hessian
600 -
n
141
-
-. the EM-equivalent Hessian
- - - - - -- - - - - --- - - - - - -- - - - - -
_ _ _ - ------ ---- -- --
-
- - - - - - -
-
-
-400
-600 . -800 -10001
0
1
50
I
100
I
150
I
ZOO
250
300
350
400
I
450
500
the learning steps
Figure 2: Experimental results for the estimation of the parameters of a twocomponent gaussian mixture (cf. Fig. 1). The separation of the gaussians is half the separation in Figure 1.
the mixture has a single component. (In this case (tl = 1.) Let us furthermore assume that the covariance matrix is fixed (i.e., only the mean vector m is to be estimated). The Hessian with respect to the mean m is N = -NC-’ and the EM projection matrix P is C,” For gradient ascent, we have y;[ETHE]= 4-’], which is larger than one whenever C # cl. EM, on the other hand, achieves a condition number of one exactly (r;[ETPHE]= /G[PH]= &.[I]= 1 and &[ETPHE] = 1). Thus, EM and Newton’s method are the same for this simple quadratic problem. For general nonquadratic optimization problems, Newton retains the quadratic assumption, yielding fast convergence but possible divergence. EM is a more conservative algorithm that retains the convergence guarantee but also maintains quasi-Newton behavior. We now analyze this behavior in more detail. We consider the special case of estimating the means in a gaussian mixture when the gaussians are well separated.
142
Lei Xu and Michael Jordan
6 lo2--
r
u
.
8a,
:
5
.
..
#
-
the EM-equivalent Hessian -- _ _ _ _ _ _ _ _ _ _ -
c
I - _ ‘ I‘
10’
I
I
I
I
I
I
Figure 2: Continued.
Theorem 2. Consider the EM algorithm in equation 2.3, where the parameters OP, and S, are assumed to be known. Assume that the K gaussiaiz distribictioizs are zaell separated, such that for sufficiently large k the posterior probabilities hjk’( t ) are approximately zero or one. For suck k, the condition number associated 7idh EM is approximately one, which is smaller than the cona’itioiz number associated with gradient ascent. That is (5.6) (5.7)
Furthermore, u1e have also (5.8)
EM Algorithm for Gaussian Mixtures
143
Figure 3: The largest eigenvalues of the matrices H. ETHE, and ETPHE plotted as a function of the number of iterations. The plot in (a) is for the experiment in Figure 1; (b) is for the experiment reported in Figure 2.
(5.9)
144
Lei Xu and Michael Jordan
the original Hessian
---_-___________-----
I
1o3
.
/
the constrained Hessian
/
/
the EM-equivalentHessian
1oo _ _ -
1°.’b
-
50
I00
I50
250 3;)O the learning steps
200
- -
3iO
400
450
I 5b0
Figure 3: Continued. The plot in (a) is for the experiment in Figure 1; (b) is for the experiment reported in Figure 2. with y,(x“)) = [dl, - k j k ) ( t ) ] k j k ) ( The t ) . projection matrix P is P(k’ = diag[Pi:’. . . . P,,](kl, ~
where
Given that kik’(t)[l-kjk’( t ) ]is negligible for sufficiently large k [since kjk)(t ) are approximately zero or one], the second term in equation 5.10 can be neglected, yielding H I , M -(C;k))-’ Cr=lkF’(t)and H = diag[H11,.. . ,H K K ] . This implies that PH = -I and ETPHE z -I, thus r;[ETPHE]M 1 and 0 XM[ErPHE]M 1, whereas usually h.[ETHE]> 1. This theorem, although restrictive in its assumptions, gives some indication as to why the projection matrix in the EM algorithm appears to
EM Algorithm for Gaussian Mixtures
145
condition the Hessian, yielding improved convergence. In fact, we conjecture that equations 5.7 and 5.8 can be extended to apply more widely, in particular to the case of the full EM update in which the mixing proportions and covariances are estimated, and also, within limits, to cases in which the means are not well separated. To obtain an initial indication as to possible conditions that can be usefully imposed on the separation of the mixture components, we have studied the case in which the second term in equation 5.10 is neglected only for HI,and is retained for the HI, components, where j # i. Consider, for example, the case of a univariate mixture having two mixture components. For fixed mixing proportions and fixed covariances, the Hessian matrix (equation 5.9) becomes
and the projection matrix (equation 5.10) becomes
‘=[
-k211 0
-h;‘O
I
where
and
m l ) , i f j = 1.2 If H is negative definite (i.e., k l l h 2 2 - h12h21 < 0), then we can show that the conclusions of equation 5.7 remain true, even for gaussians that are not necessarily well separated. The proof is achieved via the following lemma: Lemma 1. Consider the positive definite matrix
=
[
011
@12
021
022
I
For the diagonal matrix B
= diag[a,’, OG1],
we have K;[BC]< ~1x1.
Proof. The eigenvalues of C are the roots of ( 0 1 1 which gives AM
A,
= =
3
=
011
+ 022 + Y
011
+ @22 - Y
2
2
+ c72212
d(c~li
- 4(011022 - 021012)
X ) ( a 2 2 - A)
- 0 2 1 0 1 2 = 0,
Lei Xu and Michael Jordan
146
and
.[El
ff11
=
+ ff22 + Y
011 + ffzz - Y The condition number .[El can be written as .[El where s is defined as follows: 4(g11g22 (011
= (1+ s ) / ( l - s )
=f(s),
- g21g12)
+ (722l2
Furthermore, the eigenvalues of BE are the roots of (1 - A ) ( 1 - A ) ( 0 2 1 g 1 2 ) / ( g 1 1 ~ 2 2 ) = 0, which gives AM = 1 and A,, = Thus, defining r = J ( ~ c r 1 2 ) / ( g 1 1 0 2 2 ) , we have
+ )d-
We now examine the quotient s/r: S
r
=
'dl
4(1 - r2)
-
r
(011
+
+ ff2d2/(g11(722)
Given that (011 ~ 7 2 2 ) ~ / ( a l l ( 7 2 2 )2 4, we have s/r > l / r d m = 1. That is, s > r. Since f(x) = (1 x)/(l - x) is a monotonically increasing function for x > 1, we have f ( s ) > f ( r ) . Therefore, h-[BC]< .[El. 0
+
We think that it should be possible to generalize this lemma beyond the univariate, two-component case, thereby weakening the conditions on separability in Theorem 2 in a more general setting. 6 Conclusions
In this paper we have provided a comparative analysis of algorithms for the learning of gaussian mixtures. We have focused on the EM algorithm and have forged a link between EM and gradient methods via the projection matrix P. We have also analyzed the convergence of EM in terms of properties of the matrix P and the effect that P has on the likelihood surface. EM has a number of properties that make it a particularly attractive algorithm for mixture models. It enjoys automatic satisfaction of probabilistic constraints, monotonic convergence without the need to set a learning rate, and low computational overhead. Although EM has the reputation of being a slow algorithm, we feel that in the mixture setting the slowness of EM has been overstated. Although EM can indeed converge slowly for problems in which the mixture components are not well separated, the Hessian is poorly conditioned for such problems and thus other gradient-based algorithms (including Newton's method) are also likely to perform poorly. Moreover, if one's concern is convergence in likelihood, then EM generally performs well even for these illconditioned problems. Indeed the algorithm provides a certain amount
EM Algorithm for Gaussian Mixtures
147
of safety in such cases, despite the poor conditioning. It is also important to emphasize that the case of poorly separated mixture components can be viewed as a problem in model selection (too many mixture components are being included in the model), and should be handled by regularization techniques. The fact that EM is a first-order algorithm certainly implies that EM is no panacea, but does not imply that EM has no advantages over gradient ascent or superlinear methods. First, it is important to appreciate that convergence rate results are generally obtained for unconstrained optimization, and are not necessarily indicative of performance on constrained optimization problems. Also, as we have demonstrated, there are conditions under which the condition number of the effective Hessian of the EM algorithm tends toward one, showing that EM can approximate a superlinear method. Finally, in cases of a poorly conditioned Hessian, superlinear convergence is not necessarily a virtue. In such cases many optimization schemes, including EM, essentially revert to gradient ascent. We feel that EM will continue to play an important role in the development of learning systems that emphasize the predictive aspect of data modeling. EM has indeed played a critical role in the development of hidden Markov models (HMMs), an important example of predictive data modeling4 EM generally converges rapidly in this setting. Similarly, in the case of hierarchical mixtures of experts the empirical results on convergence in likelihood have been quite promising (Jordan and Jacobs 1994; Waterhouse and Robinson 1994). Finally, EM can play an important conceptual role as an organizing principle in the design of learning algorithms. Its role in this case is to focus attention on the "missing variables" in the problem. This clarifies the structure of the algorithm and invites comparisons with statistical physics, where missing variables often provide a powerful analytic tool (Yuille ef al. 1994).
Appendix: Proof of Theorem 1 1. We begin by considering the EM update for the mixing proportions (I]. From equations 2.1 and 2.2, we have
most applications of HMMs, the "parameter estimation" process is employed solely to yield models with high likelihood; the parameters are not generally endowed with a particular meaning.
148
Lei Xu and Michael Jordan
Premultiplying by P?), we obtain
=
l N x[kjk'(t). . , . . k f ' ( t ) I T - A(k)
-
f=1
The update formula for A in equation 2.3 can be rewritten as
Combining the last two equations establishes the update rule for A (equation 2.4). Furthermore, for an arbitrary vector u, we have NuTP$a ) u = uT diag[oik'.. . . . ( y K( k )121 - (u'A('))2.By Jensen's inequality we have
= (idTA(a')2
Thus, uTP$'u > 0 and P$) is positive definite given the constraints C/K_,n;A)= 1 and 0;') 2 0 for all j . 2. We now consider the EM update for the means m,. It follows from equations 2.1 and 2.2 that
Premultiplying by P::: yields
-
4k+1) -m y
From equation 2.3, we have ELl kik'(t) > 0; moreover, X;') is positive definite with probability one assuming that N is large enough such that
EM Algorithm for Gaussian Mixtures
149
the matrix is of full rank. Thus, it follows from equation 3.5 that P$) is positive definite with probability one. 3. Finally, we prove the third part of the theorem. It follows from equations 2.1 and 2.2 that
With this in mind, we rewrite the EM update formula for Er' as
where
That is, we have
Utilizing the identity vec[ABC]= (CT@ A)vec[B],we obtain
Thus P g ) we have
=
A ( E , @ )@ Ey)). Moreover, for an arbitrary matrix U ,
c,"=, qk%)
v e c [ ~ ] ~ (8 ~~ ] kr ) ) v e c [ = ~ ] tr(C]k)Uxjk)UTj =
(k) T (k)U tr[(E1 U1 ( X i 11
=
v e ~ [ C j ~ ' ~ ] ~ v e2co[ ~ j ~ ' ~ ]
where equality holds only when E;kiU= 0 for all U . Equality is impossible, however, since EjkJis positive definite with probability one when N is sufficiently large. Thus it follows from equation 3.6 and Izik)(t)> 0 0 that P$: is positive definite with probability one.
150
Lei Xu and Michael Jordan
Acknowledgments This project was supported in part by a Ho Sin-Hang Education Endowment Foundation Grant and by the HK RGC Earmarked Grant CUHK250/ 94E, by a grant from the McDonnell-Pew Foundation, by a grant from ATR Human Information Processing Research Laboratories, by a grant from Siemens Corporation, by Grant IRf-9013991 from the National Science Foundation, and by Grant N00014-90-J-1942from the Office of Naval Research. Michael I. Jordan is an NSF Presidential Young Investigator. References Amari, S. 1995. Intormation geometry of the EM and em algorithms for neural networks. Ntwnl Ncticorks 8, (5) (in press). Baum, L. E., and Sell, G. R. 1968. Growth transformation for functions on manifolds. Pas. I. Math. 27, 211-227. Bengio, Y., and Frasconi, I?, 1995. An input-output HMM architecture. In Arfiwiices iii Neirrd [rtforiir~~fioit P ~ C J C M I ~S Iy Ss t c w s 7, G. Tesauro, D. S. Touretzky, and J. Alspector, eds. MIT Press, Cambridge MA. Boyles, R. A. 1983. On the convergence of the EM algorithm. 1. Roynl Stnt. Soc. B45(1), 47-50. Dempster, A . P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. 1. Roynl Stat. Soc-. 839, 1-38. Ghahramani, Z. and Jordan, M. I. 1994. Function approximation via density estimation using the EM approach. I n Adiniws irr Nrrrrnl Iriforim7fic1r7Processiiig Systtv7.s 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 120-127. Morgan Kaufmann, San Mateo, C.4. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Ntwral Cotrip. 6, 181-214. Jordan, M. l., and X u , L. 1995. Convergence results tor the EM approach to mixtures-of-experts architectures. Ntvrrnl N r f i i ~ o t . k(in press). Levinson, S. E., Rabiner, L. R., and Sondhi, M. M. 1983. An introduction to the application of the theory of probabilistic functions of Markov process to automatic speech recognition. Bell 5y.q. E d ~ i i i/.. 62, 1035-1072. Neal, R. N., and Hinton, G. E. 1993. A N r i l ~Vitwl c,f tlic EM Algoritliiir flint Jirsfifirs 117~rcrri~7iit~71 a i d Otlw L47rIiiiif~. University of Toronto, Department of Computer Science preprint. Nowlan, S. J. 1991. Soft Corirpfitiiv z4ib~pt~ific~~r: Ncirrsl N c ~ t i [ ~ Levrriirig ~~rk Alp rithiis B n d C J ~ IFittirry Stnfisfiinl MiytrrrcJs. Tech. Rep. CMU-CS-91-126, CMU, Pittsburgh, PA. Redner, R. A,, and Walker, H. E 1984. Mixture densities, maximum likelihood, and the EM algorithm, SIAM R c i ~ .26, 195-239. Titterington, D. M. 1984. Recursive parameter estimation using incomplete data. 1. of Royni S f n f . Soi. 846, 257-267. Tresp, V.,Ahmad, S., and Neuneier, R. 1994. Training neural networks with deficient data. In Ah(iiiws irr Ntwral Irrfor~irotiorr Proicsiiir,y S y s t m s 6, J. D.
EM Algorithm for Gaussian Mixtures
151
Cowan, G. Tesauro, and J. Alspector, eds. Morgan Kaufmann, San Mateo, CA. Waterhouse, S. R., and Robinson, A. J. 1994. Classification using hierarchical mixtures of experts. Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 177-186. Wu, C. F. J. 1983. On the convergence properties of the EM algorithm. Ann. Stat. 11,95-103. Xu, L., and Jordan, M. I. 1993a. Unsupervised learning by EM algorithm based on finite mixture of Gaussians. Proc. WCNN'93, Portland, OR, 11, 431434. Xu, L., and Jordan, M. I. 1993b. EM learning on a generalized finite mixture model for combining multiple classifiers. Proc. WCNN'93, Portland, OR, IV, 227-230. Xu, L., and Jordan, M. I. 1993c. Theoretical and Experimental Studies of the EM Algorithm for Unsupeivised Learning Based on Finite Gaussian Mixtures. MIT Computational Cognitive Science, Tech. Rep. 9302, Dept. of Brain and Cognitive Science, MIT, Cambridge, MA. Xu, L., Jordan, M. I., and Hinton, G. E. 1994. A modified gating network for the mixtures of experts architecture. Proc. WCNN'94, San Diego, 2, 405-410. Yuille, A. L., Stolorz, P., and Utans, J. 1994. Statistical physics, mixtures of distributions and the EM algorithm. Neural Comp. 6, 334-340.
Received November 17, 1994; accepted March 28, 1995.
This article has been cited by: 2. Xiao-liang Tang, Min Han. 2010. Semi-supervised Bayesian ARTMAP. Applied Intelligence 33:3, 302-317. [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. Lei Xu. 2010. Machine learning problems from optimization perspective. Journal of Global Optimization 47:3, 369-401. [CrossRef] 5. Behrooz Safarinejadian, Mohammad B. Menhaj, Mehdi Karrari. 2010. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems 23:3, 267-292. [CrossRef] 6. D. P. Vetrov, D. A. Kropotov, A. A. Osokin. 2010. Automatic determination of the number of components in the EM algorithm of restoration of a mixture of normal distributions. Computational Mathematics and Mathematical Physics 50:4, 733-746. [CrossRef] 7. Erik Cuevas, Daniel Zaldivar, Marco Pérez-Cisneros. 2010. Seeking multi-thresholds for image segmentation with Learning Automata. Machine Vision and Applications . [CrossRef] 8. Roy Kwang Yang Chang, Chu Kiong Loo, M. V. C. Rao. 2009. Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification. Neural Computing and Applications 18:7, 791-800. [CrossRef] 9. Guobao Wang, Larry Schultz, Jinyi Qi. 2009. Statistical Image Reconstruction for Muon Tomography Using a Gaussian Scale Mixture Model. IEEE Transactions on Nuclear Science 56:4, 2480-2486. [CrossRef] 10. Hyeyoung Park, Tomoko Ozeki. 2009. Singularity and Slow Convergence of the EM algorithm for Gaussian Mixtures. Neural Processing Letters 29:1, 45-59. [CrossRef] 11. Siddhartha Ghosh, Dirk Froebrich, Alex Freitas. 2008. Robust autonomous detection of the defective pixels in detectors using a probabilistic technique. Applied Optics 47:36, 6904. [CrossRef] 12. O. Michailovich, A. Tannenbaum. 2008. Segmentation of Tracking Sequences Using Dynamically Updated Adaptive Learning. IEEE Transactions on Image Processing 17:12, 2403-2412. [CrossRef] 13. Dongbing Gu. 2008. Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks. IEEE Transactions on Neural Networks 19:7, 1154-1166. [CrossRef] 14. C.K. Reddy, Hsiao-Dong Chiang, B. Rajaratnam. 2008. TRUST-TECH-Based Expectation Maximization for Learning Finite Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:7, 1146-1157. [CrossRef] 15. Michael J. Boedigheimer, John Ferbas. 2008. Mixture modeling approach to flow cytometry data. Cytometry Part A 73A:5, 421-429. [CrossRef]
16. Michael Lynch, Ovidiu Ghita, Paul F. Whelan. 2008. Segmentation of the Left Ventricle of the Heart in 3-D+t MRI Data Using an Optimized Nonrigid Temporal Model. IEEE Transactions on Medical Imaging 27:2, 195-203. [CrossRef] 17. Xing Yuan, Zhenghui Xie, Miaoling Liang. 2008. Spatiotemporal prediction of shallow water table depths in continental China. Water Resources Research 44:4. . [CrossRef] 18. A. Haghbin, P. Azmi. 2008. Precoding in downlink multi-carrier code division multiple access systems using expectation maximisation algorithm. IET Communications 2:10, 1279. [CrossRef] 19. Estevam R. Hruschka, Eduardo R. Hruschka, Nelson F. F. Ebecken. 2007. Bayesian networks for imputation in classification problems. Journal of Intelligent Information Systems 29:3, 231-252. [CrossRef] 20. Oleg Michailovich, Yogesh Rathi, Allen Tannenbaum. 2007. Image Segmentation Using Active Contours Driven by the Bhattacharyya Gradient Flow. IEEE Transactions on Image Processing 16:11, 2787-2801. [CrossRef] 21. J. W. F. Robertson, C. G. Rodrigues, V. M. Stanford, K. A. Rubinson, O. V. Krasilnikov, J. J. Kasianowicz. 2007. Single-molecule mass spectrometry in solution using a solitary nanopore. Proceedings of the National Academy of Sciences 104:20, 8207-8211. [CrossRef] 22. Chunhua Shen, Michael J. Brooks, Anton van den Hengel. 2007. Fast Global Kernel Density Mode Seeking: Applications to Localization and Tracking. IEEE Transactions on Image Processing 16:5, 1457-1469. [CrossRef] 23. Miguel A. Carreira-Perpinan. 2007. Gaussian Mean-Shift Is an EM Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:5, 767-776. [CrossRef] 24. Yuanqing Li, Cuntai Guan. 2006. An Extended EM Algorithm for Joint Feature Extraction and Classification in Brain-Computer InterfacesAn Extended EM Algorithm for Joint Feature Extraction and Classification in Brain-Computer Interfaces. Neural Computation 18:11, 2730-2761. [Abstract] [PDF] [PDF Plus] 25. L.C. Khor. 2006. Robust adaptive blind signal estimation algorithm for underdetermined mixture. IEE Proceedings - Circuits, Devices and Systems 153:4, 320. [CrossRef] 26. Carlos Ordonez, Edward Omiecinski. 2005. Accelerating EM clustering to find high-quality solutions. Knowledge and Information Systems 7:2, 135-157. [CrossRef] 27. J. Fan, H. Luo, A.K. Elmagarmid. 2004. Concept-Oriented Indexing of Video Databases: Toward Semantic Sensitive Retrieval and Browsing. IEEE Transactions on Image Processing 13:7, 974-992. [CrossRef] 28. Balaji Padmanabhan, Alexander Tuzhilin. 2003. On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities. Management Science 49:10, 1327-1343. [CrossRef]
29. Meng-Fu Shih, A.O. Hero. 2003. Unicast-based inference of network link delay distributions with finite mixture models. IEEE Transactions on Signal Processing 51:8, 2219-2228. [CrossRef] 30. R.D. Nowak. 2003. Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Transactions on Signal Processing 51:8, 2245-2253. [CrossRef] 31. Sin-Horng Chen, Wen-Hsing Lai, Yih-Ru Wang. 2003. A new duration modeling approach for mandarin speech. IEEE Transactions on Speech and Audio Processing 11:4, 308-320. [CrossRef] 32. Y. Matsuyama. 2003. The α-EM algorithm: surrogate likelihood maximization using α-logarithmic information measures. IEEE Transactions on Information Theory 49:3, 692-706. [CrossRef] 33. M.A.F. Figueiredo, A.K. Jain. 2002. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:3, 381-396. [CrossRef] 34. Zheng Rong Yang, M. Zwolinski. 2001. Mutual information theory for adaptive mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:4, 396-403. [CrossRef] 35. H. Yin, N.M. Allinson. 2001. Self-organizing mixture networks for probability density estimation. IEEE Transactions on Neural Networks 12:2, 405-411. [CrossRef] 36. H. Yin, N.M. Allinson. 2001. Bayesian self-organising map for Gaussian mixtures. IEE Proceedings - Vision, Image, and Signal Processing 148:4, 234. [CrossRef] 37. Qiang Gan, C.J. Harris. 2001. A hybrid learning scheme combining EM and MASMOD algorithms for fuzzy local linearization modeling. IEEE Transactions on Neural Networks 12:1, 43-53. [CrossRef] 38. C.J. Harris, X. Hong. 2001. Neurofuzzy mixture of experts network parallel learning and model construction algorithms. IEE Proceedings - Control Theory and Applications 148:6, 456. [CrossRef] 39. Jinwen Ma , Lei Xu , Michael I. Jordan . 2000. Asymptotic Convergence Rate of the EM Algorithm for Gaussian MixturesAsymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures. Neural Computation 12:12, 2881-2907. [Abstract] [PDF] [PDF Plus] 40. Dirk Husmeier . 2000. The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural NetworksThe Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks. Neural Computation 12:11, 2685-2717. [Abstract] [PDF] [PDF Plus] 41. Ashish Singhal, Dale E. Seborg. 2000. Dynamic data rectification using the expectation maximization algorithm. AIChE Journal 46:8, 1556-1565. [CrossRef] 42. P. Hedelin, J. Skoglund. 2000. Vector quantization based on Gaussian mixture models. IEEE Transactions on Speech and Audio Processing 8:4, 385-401. [CrossRef]
43. Man-Wai Mak, Sun-Yuan Kung. 2000. Estimation of elliptical basis function parameters by the EM algorithm with application to speaker verification. IEEE Transactions on Neural Networks 11:4, 961-969. [CrossRef] 44. M. Zwolinski, Z.R. Yang, T.J. Kazmierski. 2000. Using robust adaptive mixing for statistical fault macromodelling. IEE Proceedings - Circuits, Devices and Systems 147:5, 265. [CrossRef] 45. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 46. A.V. Rao, D. Miller, K. Rose, A. Gersho. 1997. Mixture of experts regression modeling by deterministic annealing. IEEE Transactions on Signal Processing 45:11, 2811-2820. [CrossRef] 47. James R. Williamson . 1997. A Constructive, Incremental-Learning Network for Mixture Modeling and ClassificationA Constructive, Incremental-Learning Network for Mixture Modeling and Classification. Neural Computation 9:7, 1517-1543. [Abstract] [PDF] [PDF Plus] 48. John J. Kasianowicz, Sarah E. Henrickson, Jeffery C. Lerman, Martin Misakian, Rekha G. Panchal, Tam Nguyen, Rick Gussio, Kelly M. Halverson, Sina Bavari, Devanand K. Shenoy, Vincent M. StanfordThe Detection and Characterization of Ions, DNA, and Proteins Using Nanometer-Scale Pores . [CrossRef]
Communicated by Steve Nowlan and Richard Lippmann
A Comparison of Some Error Estimates
for Neural Network Models Robert Tibshirani Department of Prezleiitiue Medicine niid Biostntistics and Depnrtineizt of Statistics, Uniziersity of Toronto, Toronto, Ontorio, Cniiadn
We discuss a number of methods for estimating the standard error of predicted values from a multilayer perceptron. These methods include the delta method based on the Hessian, bootstrap estimators, and the "sandwich" estimator. The methods are described and compared in a number of examples. We find that the bootstrap methods perform best, partly because they capture variability due to the choice of starting weights. 1 Introduction We consider a multilayer perceptron with one hidden layer and a linear output layer. See Lippman (1989), and Hinton (19891, and Hertz ef al. (1991) for details and references. A perceptron is a nonlinear model for predicting a response y based on p measurements of predictors (or input patterns or features) X I . x2. . . . x~,.For convenience we assume that x1 = 1. The model with H hidden units has the form /
H
\
where the errors f have mean zero, variance c2 and are independent across training cases. Since Y is a continuous response variable, we take the output function 00 to be the identity. The standard choice for the hidden layer output function Q is the sigmoid =
1 1 exp (-x)
(1.2)
+
Our training sample has n observations (XIy1). . . . (xll.yil). Denote the ensemble of parameters (weights) by 0 = ( 7 ~ 07u1. . . . . ZOH. ,111 . . . #$H) and let y ( x l ;0) be the predicted value for input xI and parameter 0. The total number of parameters is p . H 1 H. ~
+ +
Nrirral Corripufafiori 8, 152-163 (1996) @ 1995 Massachusetts Institute of Technology
Error Estimates for Neural Network Models
153
Estimation of H is usually carried out by minimization of C[yl y(xl; with either early stopping or some form of regularization to prevent overfitting. Commonly used optimization techniques include backpropagation (gradient descent), conjugate gradients, and quasi-Newton (variable metric) methods. Since the dimension of H is usually quite large, search techniques requiring computation of the Hessian are usually impractical. In this paper we focus on the problem of estimation of the standard error of the predicted values y(6;xl). A reference for these techniques is Efron and Tibshirani (1993), especially Chapter 21. One approach is through likelihood theory. If we assume that the errors in model 1.1 are distributed as N(0. a2),then the log-likelihood is
@)Iz,
t(H)
=
1 -g C[yl 1=1
-
1 y(x; H ) ] 2 - - log cT2 2
We eliminate 0’ by replacing it by 6’= C:’,l[yl - y(x;H)]’/n in C(#). first and second derivatives have the form
(1.3) The
The exact form of these derivatives is simple to derive for a neural network, and we do not give them here. Because of the structure of the network, the only nonzero second derivative terms are those of the form a2y/3/~,,,d,&, and a2y/3wh3&, and there are a total of H . p 2 + H . p such terms. Buntine and Weigend (1994) describe efficient methods for computing the Hessian. Let I equal -a2t/dHkdH~ evaluated at H = 4 (the negative Hessian or ”observed information” matrix), and gl = 3y(x,; 0)/i)H evaluated at 6. Then using a Taylor series approximation we obtain
G [y(xl;
e)]
RZ
[g; . I-’ . gl]
(1.5)
This is often called the delta method estimate of standard error (see Efron and Tibshirani 1993, Chapter 21). For computational simplicity, we can leave out the terms in 1.4 involving second derivatives. These are often small because the multipliers yl - y(xl; 0) tend to be small. We will denote the resulting approximate information matrix by i. With weight decay induced by a penalty term XCH:, it might be preferable to use the Hessian of the regularized log-likelihood P ( H ) X C 8;. This simply replaces I by i 2X in formula 1.5, and will tend to reduce the delta method standard error estimates. This is the approach taken in MacKay (1992).
+
154
Robert Tibshirani
2 The Sandwich Estimator
Like the information-based approach, the sandwich estimator has a closed form. Unlike the information however, its derivation does not rely on model correctness and hence it can potentially perform well under modelmisspecification. Let s, = (sll.s,z. . . .) be the gradient vector of t for the ith observation: (2.1) Then the sandwich estimator of variance of
is defined by (2.2)
To estimate the standard error of y(xi;b), we substitute Vssndfor equation 1.5:
i-'
in
Note that C;'sisT estimates E%$$'. The idea behind the sandwich estimator is the following. If the model is specified correctly,
Therefore Vsand E j-'€(I)j-' zz I-' if the model is correct. Suppose, however, that the expected value of Y is modeled correctly but the errors have different variances. Then the sandwich estimator still provides a consistent estimate of variance, but equation 2.4 does not hold and hence the inverse information is not consistent. Details may be found in Kent (1982) and Efron and Tibshirani (1993, Chapter 21). 3 Bootstrap Methods
A different approach to error estimation is based on the bootstrap. It works by creating many pseudoreplicates ("bootstrap samples") of the training set and then reestimating 0 on each bootstrap sample. There are two different ways of bootstrapping in regression settings. One can consider each training case as a sampling unit, and sample with replacement from the training set cases to create a bootstrap sample. This is often called the "bootstrap pairs" approach. On the other hand, one can consider the predictors as fixed, treat the model residuals yI - yi as the sampling units, and create a bootstrap sample by adding residuals to the model fit yl. This is called the "bootstrap residual" approach. The details are given below:
Error Estimates for Neural Network Models
155
Bootstrap pairs sampling algorithm 1. Generate B samples, each one of size n drawn with replacement yll)}. Defrom the n training observations {(XI. yl). (XZ.y2). . . . (x,~. note the bth sample by { (x;'. yi"). (x;". Y;~). . . . (x;'. yEb)}. 2. For each bootstrap sample b = 3 . . . . B, minimize CYtl [yf' -y(x;; H ) I 2 giving 4*l1. 3. Estimate the standard error of the ith predicted value by
where y(xi; .) = C;==,y(xl; 8*")/B. Bootstrap residual sampling algorithm 1. Estimate 8 from the training sample and let y i = yi - y(xl;8). i = 1.2. . . . n. 2. Generate B samples, each one of size n drawn with replacement from y1.r2. . . .Y,,. Denote the bth sample by rf".v;". . . and let yfb = y(xl; 8) + r;! 3. For each bootstrap sample b = 1. . . . B, minimize C:I=l[y;h - y(xi; Q)I2 giving JCR. 4. Estimate the standard error of the ith predicted value by
where y(x,; .)
=
If=, y(xl; &")/B.
Note that each method requires refitting of the model (retraining the network) B times. Typically B is in the range 20 5 B 5 200. In simple linear least squares regression, it can be shown that both the information-based estimate (equation 1.5)and the bootstrap residual sampling estimate (as B m) both agree with the standard least squares ~ ]denoting '/~, the design matrix having rows x,. formula [ X ~ ( X ~ X ) - ' X , ~ X How do the two bootstrap approaches compare? The bootstrap residual procedure is model-based, and relies on the fact that the errors yl - yI are representative of the true model errors. If the model is either misspecified or overfit, the bootstrap pairs approach is more robust. On the other hand, the bootstrap pairs approach results in a different set of predictor values in each bootstrap sample, and in some settings, this may be inappropriate. In some situations the set of predictor values is chosen by design, and we wish to condition on those values in our inference procedure. Such situations are fairly common in statistics (design of experiments) but probably less common in applications of neural networks. --f
1-56
Robert Tibshirani
4 Examples -
In the following examples we compare a number of different estimates of the standard error of predicted values. The methods are as follows: 1. Delta: the delta method (equation 1.5) 2. Delta,: the approximate delta method, using the approximate information matrix Z that ignores second derivatives. 3. Delta?: the delta niethod, adding the term 2X (from the regularization penalty) to the diagonal of the Hessian 4.Sand: the sandwich estimator (equation 2.3) 5. Sand,: the approximate sandwich estimator that uses i in place of I in equations 2.2 and 2.3 6. Bootp: bootstrapping pairs 7. Bootr: bootstrapping residuals
We used Brian Ripley’s “nnetl” S-language function for the fitting, which uses the BFGS variable metric optimizer, with weight decay parameter set at 0.01. The optimizer is based on the Pascal code given in Nash (1979). Only B = 20 bootstrap replications were used. This is a lower limit on the number required in most bootstrap applications, but a perhaps a reasonable number when fitting a complicated model like a neural network. In the simulation studies (Examples 2-_5), we carried out 25 simulations of each experiment. 4.1 Example 1: Air Pollution Data. I n this first example we illustrate the preceding techniques on 111 observations on air pollution, taken from Chambers and Hastie (1991). The goal is to predict ozone concentration from radiation, temperature axid wind speed. We fit a multilayer perceptron with one hidden layer of 3 hidden units, and a linear output layer. The various estimates of standard error, at five randomly chosen feature vectors, are shown in Table 1. Notice that the larger standard errors are given by the bootstrap methods in four of the five cases. As we will see in the simulations below, this is partly because the bootstrap captures the \-ariahility due to the choice of random starting weights. In this example, repeatcd training of the neural network with different starting weights resulted in an average standard error of 0.07 for the predicted values. One potential source of bias in the delta method estimate is our use of the maximum likelihood estimate for n7,namely ;T2 = [y, -y(x; 0)]2/12. We could instead use an unbiased estimate of the form 0’ = C:’,,[yi y ( x ; f l ) . ’ ; ( i i - k ) , where k is an estimate of the number of effective parameters used by the network. However in this example, an upper bound for k is 4 x 3 T 3 i1 = 16, and hence n increases only by a factor of (111,’95)’’2= 1.07. There is more information from the bootstrap process besides the estimated standard errors. Figure 1 shows boxplots of the predicted values
x::,
Error Estimates for Neural Network Models
157
Table 1: Results for Example 1-Air Pollution Data: Standard Error Estimates at Five Randomly Chosen Feature Vectors. Point Method Delta Delta1 Delta2 Sand Sandl Bootp Bootr
1
2
3
4
5
0.15 0.13 0.13 0.14 0.10 0.28 0.19
0.13 0.12 0.12 0.12 0.11 0.26 0.24
0.24 0.16 0.17 0.32 0.21 0.56 0.24
0.38 0.12 0.24 0.29 0.12 0.23 0.23
0.20 0.17 0.17 0.25 0.20 0.25 0.24
at each of the 5 feature vectors. Each boxplot contains values from 50 bootstrap simulations. Notice for example point 3 in the bottom plot. Its predicted values are skewed upward, and so we are less sure about the upper range of the prediction than the lower range. 4.2 Example 2: Fixed-X Sampling. In this example we define X I ,
xz, x3, x4 to be multivariate gaussian with mean zero, variance 1, and pairwise correlation 0.5. This predictor set was generated once and then fixed for all 25 of the simulations. We generated y as where E is gaussian with mean zero and standard deviation 0.7. This gave a signal-to-noise ratio of roughly 1.2. There were 100 observations in each training set. Note that this function could be modeled exactly by a sigmoid net with two hidden nodes and a linear output node. The results are shown in Table 2. In the left half of the table, a perceptron with one hidden layer of 2 hidden units, and a linear output was fit. In the right half, the perceptron had only one hidden unit in the hidden layer. Let j.,k be the estimated standard deviation of y(xl;j),for the kth simulated sample. Then we define sek = median,(slk),the median over the training cases of the estimated standard deviation of yl, for the kth simulated sample. Let s, be the actual standard deviation of y(xl;8). The actual value of the median standard deviation med(s,) is 0.86, as estimated over the 25 simulations. To measure the absolute error of the estimate over each of the training cases, we define ek = median,Is, - &kI. In the left half of the table the two bootstrap methods are clearly superior to the other methods. The ”Random weight se” of 0.39 is the standard error due solely to the choice of starting weights, estimated by fixing the data and retraining with different initial weights. This component of variance is missed by the first four methods. In the right half
Robert Tibshirani
158
x
1
i
1
2
3
4
5
4
5
bootstrap residuals
1
2
3
bootstrap pairs
Figure I : Boxplots of bootstrap replications for each of five randomly chosen feature vectors, from example 1. The bold dot in each box indicates the median, while tlie lower and upper edges are the 25 and 75': percentiles. The broken lines are the hinges, beyond which points are considered to be outliers.
of the table, all of the e5timates have average values of L'L. Surprisingly, the bootstrap residual method is closest on tlie average to tlie actual se, closer than the bootstrap pairs approach. This may be because the bootstrap pairs method varies the X values and hence inflates the variance compared to the fixed->( sampling variance. 4.3 Example 3: Random-X Sampling. The setup in this example is the same a s in the last one, except that a new set of predictor values was generated for each simulation. The predictions were done at a fixed
Error Estimates for Neural Network Models Table 2: Results for Example 2-Fixed-X
159
Samplinga
Correct model (2 hidden units)
Incorrect model (1 hidden unit)
Method
Mean (SD) of sek
Mean (SD) of ek
Mean (SD) of sek
Mean (SD) of ek
Delta Delta, Deltaz Sand Sandl Bootp Bootr
0.39(.08) 0.35(.07) 0.36(.09) 0.40(.06) 0.39(.09) 0.93(.09) 0.74(.07)
0.39(.06) 0.42(,071 0.41 (.05) 0.41(.04) 0.42(.08) 0.17(.05) 0.15(.04)
0.41 (.09) 0.43(.09) 0.41(.09) 0.41(.08) 0.47(.14) 0.72(.14) 0.58(.06)
0.15(.07) 0.15(.08) 0.15(.08) 0.17(.05) 0.18(.06) 0.22(.09) 0.16(.04)
Actual SE Random weight se
0.86(-)
0.56(-)
0.39(-)
0.38(-)
“See text for details.
Table 3: Results for Example 3-Random-X
Sampling.”
Correct model (2 hidden units)
Incorrect model (1 hidden unit)
Mean (SD) of sek
Mean (SD) of ek
Mean (SD) of sek
Mean (SD) of q
Delta Delta, Delta2 Sand Sand, Bootp Bootr
0.38(.08) 0.35(.11) 0.34(.08) 0.39(.08) 0.46(.17) 1.05(.11) 0.81(.09)
0.45(.05) 0.48(.09) 0.47(.09) 0.47(.06) 0.45(.08) 0.26(.06) 0.22(.05)
0.38(.07) 0.34(.14) 0.33(.14) 0.39(.08) 0.49t.24) 0.73(.12) 0.53(.14)
0.36(.09) 0.38(.15) 0.38(.15) 0.34(.07) 0.40(.13) 0.21(.06) 0.24(.04)
Actual SE Random weight se
0.87(-)
-
0.76(-)
-
0.39(-)
-
0.38(-)
-
Method
Osee text for details.
set of predictor values, however, to allow pooling across simulations. The bootstrap methods perform the best again: surprisingly, the bootstrap pairs method only does best in the ”incorrect model” case. This
Robert Tibshirani
3 60
Table 4: Results for Example +-Overfitting." Method
Mean (SD) of s q
Mean (SD) of e i
Bootp
2.2N.13)
Bootr
1.23(.06)
0.61(.09) 0.62(,071
Actual SE
1.8'4-1 0.52(-)
Random weight SE
-
-
'See text for detail5
is surprising because its resampling of tlie predictors matches the actual simulation sampling used in the example.
4.4 Example 4: Overfitting. In this example, tlie setup is the same as in the left hand side of Table 3, except that the neural net was trained with 7 hidden units and no weight decay. Thus the model has 5 more units than is necessary, and with no weight decay, should overfit the training data. The results of tlie simulation experiment are shown in Table 4. We had difficulty in computing tlie inverse information matrix due to near siiigularities in the models, and hence report only tlie bootstrap results. As expected, tlie bootstrap residual method underestimates the true standard error because the overfitting has biased the residuals toward zero. In the extreme cast', if we were to completely saturate the model, the residuals would all be zero and the resulting standard error estimate would also be zero. Tlie bootstrap pairs niethod seems to capture the variation better, but suffers from excess variability across simulations. 4.5 Example 5: Averaging over Runs. The setup here is the same as in the left hand side of Table 2, except that the training is done by averaging the predicted values over three runs with different random starting weights. The results are shown in Table 5. The bootstrap methods still perform the best, but by a lesser amount than before. The reason is that the \,ariation due to the choice of random starting weights has been reduced by the averaging. Presumably, if we were to average over a larger number of runs, this variation would be further reduced. 5 Discussion
~.
In the simulation experiments of this paper, we found that 0
Tlie bootstrap methods provided the most accurate estimates of the standard errors of predicted values.
Error Estimates for Neural Network Models
161
Table 5: Results for Example 5-Averaging over Runs.O Method
Mean (SD) of sek
Mean (SD) of ek
Delta Delta, Delta2 Sand Sandl Bootp Bootr
0.37(.lo) 0.30(.10) 0.35(.09) 0.38(.06) 0.52(.29) 0.68(.06) 0.57(.02)
0.24(.07) 0.30(.07) 0.25(.06) 0.38(.05) 0.52(.08) 0.16(.03) 0.13(.01)
Actual SE Random weight SE
0.61 (-) 0.25(-)
"See text for details. 0
The nonsimulation methods (delta method, sandwich estimator) missed the substantial variability due to the random choice of starting values.
Of course the results found here may not generalize to all other applications of neural networks. For example, the nonsimulation approaches may work better with fitting methods that are less sensitive to the choice of starting weights. Larger training sets, and the use of gradient descent methods, will probably lead to fewer local minima and hence less dependence on the random starting weights than seen here. In addition, in very large problems the bootstrap approaches may require too much computing time to be useful. Note that the bootstrap methods illustrated here do not suffer from matrix inversion problems in overfit networks, and do not require the existence of derivatives. It is important to note that an interval formed by taking say plus and minus 1.96 times a standard error estimate from this paper would be an approximate confidence interval for the mean of a predicted value. This differs from a prediction interval, which is an interval for a future realization of the process. A prediction interval is typically wider than a confidence interval, because it must account for the variance of the future realization. Such an interval can be produced by increasing the width of the confidence interval by an appropriate function of the noise variance 62.
We have considered only regression problems here, but the methods generalize easily to classification problems. With k classes, one usually specifies k output units, each with a sigmoidal output function 4 0 and minimize either squared error or the multinomial log-likelihood (crossentropy). The only nontrivial change occurs for the bootstrap residual method. There are no natural residuals for classification problems, and instead we proceed as follows. Suppose for simplicity that we have
162
Robert Tibshirani
two classes 0, and 1, and let f i ( x I ) be the estimated probability that y equals one for feature vector x,. We fix each x, and generate Bernoulli random variables y;” according to Prob(y;” = 1 ) = p ( x l ) , for i = 1.., . I I and b = 1... . B. Then w e proceed as in steps 3 and 4 of the bootstrap residual sampling algorithm, using either squared error or cross-entropy in step 3. An application of this procedure is described in Baxt and White (1994). A Bayesian approach to error estimation in neural networks may be found in Buntine and Weigend (1991) and MacKay (1992). Nix and Weigend (1994) propose a method for estimating the variance of the target, allowing it to vary as a function of the input features. LeBaron and Weigend (Snowbird 1994) propose a method similar to the bootstrap pairs approach that uses a test set to generate the predicted values. Leonard et al. (1992) describe an alternative approach to confidence interval estimation that can be applied to radial basis networks.
Acknowledgments The author thanks Richard Lippmann, Andreas Weigend, and two referees for their valuable comments, and acknowledges the Natural Sciences and Engineering Research Council of Canada for its support.
References Baxt, W., and White, H. 1994. Bootstrapping Confidence Intervals for Clinical Input Varinble Effects in a Network Trained to Identify the Presence of Acute Myocardial Irzfarction. Tech. Rep., University of California, San Diego. Buntine, W., and Weigend, A. 1994. Computing second derivatives in feed forward neural networks: A review. lEEE Trans. N e w . Netzilorks 5, 480488. Chambers, J., and Hastie, T. 1991. Statistical Models i n S . Wadsworth/Brooks Cole, Pacific Grove, CA. Efron, B., and Tibshirani, R. 1993. An Introdirction to the Bootstrap. Chapman and Hall, London. Hertz, J., Krogh, A., and Palmer, R. 1991. Introduction to the Theory of Neural Coinputation. Addison-Wesley, Redwood City, CA. Hinton, G. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185-234. Kent, T. 1982. Robust properties of likelihood ratio tests. Biometriku 69, 19-27. LeBaron, A., and Weigend, A. 1994. Evaluating neural network predictors by bootstrapping. In Proceedings of the International Conference on Neural Information Processing (lCONIP’94), pp. 1207-1212. Seoul, Korea. Leonard, J., Kramer, M., and Ungar, L. 1992. A neural network architecture that computes its own reliability. Conzp. Chem. Eng. 16, 819-835. Lippman, R. 1989. Pattern classification using neural networks. l E E E Cotmrun. Mag. 11, 47-64.
Error Estimates for Neural Network Models
163
MacKay, D. 1992. A practical bayesian framework for backpropagation neural networks. Neural Comp. 4,448472. Nash, J. 1979. Compact Numerical Methods for Computers. Halsted, New York. Nix, D., and Weigend, A. 1994. Estimating the mean and variance of a target probability distribution. In Proceedings of the IJCNN, Orlando.
Received April 30, 1994; accepted March 15, 1995.
This article has been cited by: 2. Sim Won Lee, Dong Su Kim, Man Gyun Na. 2010. Prediction of DNBR Using Fuzzy Support Vector Regression and Uncertainty Analysis. IEEE Transactions on Nuclear Science 57:3, 1595-1601. [CrossRef] 3. G. Joya, Francisco García-Lagos, F. Sandoval. 2010. Contingency evaluation and monitorization using artificial neural networks. Neural Computing and Applications 19:1, 139-150. [CrossRef] 4. Dong Hyuk Lim, Sung Han Lee, Man Gyun Na. 2010. Smart Soft-Sensing for the Feedwater Flowrate at PWRs Using a GMDH Algorithm. IEEE Transactions on Nuclear Science 57:1, 340-347. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Heon Young Yang, Sung Han Lee, Man Gyun Na. 2009. Monitoring and Uncertainty Analysis of Feedwater Flow Rate Using Data-Based Modeling Methods. IEEE Transactions on Nuclear Science 56:4, 2426-2433. [CrossRef] 7. Maria P. Cadeddu, David D. Turner, James C. Liljegren. 2009. A Neural Network for Real-Time Retrievals of PWV and LWP From Arctic Millimeter-Wave Ground-Based Observations. IEEE Transactions on Geoscience and Remote Sensing 47:7, 1887-1900. [CrossRef] 8. Tao Lu, Martti Viljanen. 2009. Prediction of indoor temperature and relative humidity using neural network models: model comparison. Neural Computing and Applications 18:4, 345-357. [CrossRef] 9. Wenwu Tang, George P. Malanson, Barbara Entwisle. 2009. Simulated village locations in Thailand: a multi-scale model including a neural network approach. Landscape Ecology 24:4, 557-575. [CrossRef] 10. M. Tsujitani, M. Sakon. 2009. Analysis of Survival Data Having Time-Dependent Covariates. IEEE Transactions on Neural Networks 20:3, 389-394. [CrossRef] 11. L. J. Lancashire, C. Lemetre, G. R. Ball. 2008. An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies. Briefings in Bioinformatics 10:3, 315-329. [CrossRef] 12. Larry Buckley, Jeremy Collie, Lisa A. E. Kaplan, Joseph Crivello. 2008. Winter Flounder Larval Genetic Population Structure in Narragansett Bay, RI: Recruitment to Juvenile Young-of-the-Year. Estuaries and Coasts 31:4, 745-754. [CrossRef] 13. L. Leistritz, M. Galicki, E. Kochs, E.B. Zwick, C. Fitzek, J.R. Reichenbach, H. Witte. 2006. Application of Generalized Dynamic Neural Networks to Biomedical Data. IEEE Transactions on Biomedical Engineering 53:11, 2289-2299. [CrossRef]
14. Masaaki Tsujitani, Masahiko Aoki. 2006. Neural regression model, resampling and diagnosis. Systems and Computers in Japan 37:6, 13-20. [CrossRef] 15. Massih R. Amini, Patrick Gallinari. 2005. Semi-supervised learning with an imperfect supervisor. Knowledge and Information Systems 8:4, 385-413. [CrossRef] 16. J. F. Crivello, D. J. Danila, E. Lorda, M. Keser, E. F. Roseman. 2004. The genetic stock structure of larval and juvenile winter flounder larvae in Connecticut waters of eastern Long Island Sound and estimations of larval entrainment. Journal of Fish Biology 65:1, 62-76. [CrossRef] 17. Nobuhiko Yamaguchi, Naohiro Ishii. 2004. Combining classifiers in error correcting output coding. Systems and Computers in Japan 35:4, 9-18. [CrossRef] 18. Yacine Oussar , Gaétan Monari , Gérard Dreyfus . 2004. Reply to the Comments on “Local Overfitting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. PersonnazReply to the Comments on “Local Overfitting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. Personnaz. Neural Computation 16:2, 419-443. [Abstract] [PDF] [PDF Plus] 19. I. Rivals, L. Personnaz. 2003. No free lunch with the sandwich. IEEE Transactions on Neural Networks 14:6, 1553-1559. [CrossRef] 20. Yulei Jiang. 2003. Uncertainty in the output of artificial neural networks. IEEE Transactions on Medical Imaging 22:7, 913-921. [CrossRef] 21. G. Papadopoulos, P.J. Edwards, A.F. Murray. 2001. Confidence estimation methods for neural networks: a practical comparison. IEEE Transactions on Neural Networks 12:6, 1278-1287. [CrossRef] 22. K. Lewenstein. 2001. Radial basis function neural network approach for the diagnosis of coronary artery disease based on the standard electrocardiogram exercise test. Medical & Biological Engineering & Computing 39:3, 362-367. [CrossRef] 23. T. Koshimizu, M. Tsujitani. 2000. Neural discriminant analysis. IEEE Transactions on Neural Networks 11:6, 1394-1401. [CrossRef] 24. Y. Bissessur, E.B. Martin, A.J. Morris, P. Kitson. 2000. Fault detection in hot steel rolling using neural networks and multivariate statistics. IEE Proceedings - Control Theory and Applications 147:6, 633. [CrossRef] 25. A.-P. N. Refenes, A. D. Zapranis. 1999. Neural model identification, variable selection and model adequacy. Journal of Forecasting 18:5, 299-332. [CrossRef] 26. Karin Haese . 1999. Kalman Filter Implementation of Self-Organizing Feature MapsKalman Filter Implementation of Self-Organizing Feature Maps. Neural Computation 11:5, 1211-1233. [Abstract] [PDF] [PDF Plus] 27. N.W. Townsend, L. Tarassenko. 1999. Estimations of error bounds for neural-network function approximators. IEEE Transactions on Neural Networks 10:2, 217-230. [CrossRef]
28. B. LeBaron, A.S. Weigend. 1998. A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9:1, 213-220. [CrossRef] 29. Peter SussnerPerceptrons . [CrossRef]
Communicated by Federico Girosi
~
Neural Networks for Optimal Approximation of Smooth and Analytic Functions
We prove that neural networks with a single hidden layer are capable of providing an optimal order of approximation for functions assumed to possess a given number of derivatives, if the activation function evaluated by each principal element satisfies certain technical conditions. Under these conditions, it is also possible to construct networks that provide a geometric order of approximation for analytic target functions. The permissible activation functions include the squashing function ( 1 - (,-')-' as well as a variety of radial basis functions. Our proofs are constructive. The weights and thresholds of our networks are chosen independently of the target function; we give explicit formulas for the coefficients as simple, continuous, linear functionals of the target function. 1 Introduction
~
.-
-
In recent years, there has been a great deal of research in the theory of approximation of real valued functions using artificial neural networks tvith one or more hidden layers, with each principal element (izmroii) evaluating a sigmoidal or radial basis function (Barron 1993; Barron and Barron 1988; Broomlitad and Lowe 1988; Cybenko 1989; Girosi c t d . 1995; Hornik ef 01. 1989; Leslino ct n1. 1993; Moody and Darken 1989; Poggio and Girosi 1990; Poggio ci nl. 1993). A typical density result sliows that a network can approximate an arbitrary function in a given function class to any degree of accuracy. Such theorems are proved for instance in Cvbenko (1989) and Hornik c>t a / . (1989) in the case of sigmoidal activation functions and in Park and Sandberg (1991) and Powell (1991) for radial basis functions. Very general theorems of this nature can be found in Leshno ct (11. (1993) and Mhaskar and Micchelli (1992). A related important problem is the coiiiplesity / ~ ~ i b k i iiit,. , to determine the number of neurons required to guarantee that nll functions, assumed to belong to a certain function class, can be approximated within n prescribed accuracy, f . For example, the now classical result of Barron (1993) shows that if the function is assumed to satisfy certain conditions
Optimal Approximation of Functions
165
expressed in terms of its Fourier transform, and each of the neurons evaluates a sigmoidal activation function, then at most U(E-*) neurons are needed to achieve the order of approximation t. An interesting aspect of this result is that the order of magnitude of the number of neurons is independent of the number of variables on which the function depends. Other bounds of this nature are obtained in Mhaskar and Micchelli (1994) when the activation function is not necessarily sigmoidal. A very common assumption about the function class is defined in terms of the number of derivatives that a function possesses. For example, one is interested in approximating all functions of s real variables having a continuous gradient. By a suitable normalization, one may assume that the gradient is bounded by 1. It is known (e.g., DeVore ef al. 1989) that any reasonable approximation scheme to provide an approximation order t for all functions in this class must depend upon at least (2(t-”) parameters. In Mhaskar (19931, we showed how to construct networks with two hidden layers, each neuron evaluating a bounded sigmoidal function, to accomplish such an approximation order with C?(f-’) neurons. Mhaskar and Micchelli (1995) have studied this problem in much greater detail. The best result known so far for networks with a single hidden layer is that O[t-’-’ log(1/E)] neurons are enough if the activation function is the squashing function 1/(1+ e - x ) . In our work (Chui etal. 19951, we have shown that if s > 1 and the approximation is required to be “localized”, then at least Ole-’ log(l/t)]neurons are necessary, even if different n e u r p s may evaluate different activation functions. A detailed discussion of the notion of localized approximation is not relevant within the context of this paper; we refer the reader to Chui ef nl. (1995). We made a conjecture in Mhaskar (1994) that with a sigmoidal activation function, the number of neurons necessary to provide the approximation order t to all functions in this class, with or without localization, cannot be U(t-s). In this paper, we disprove this conjecture. We prove that if the activation function satisfies certain technical conditions then the optimal order of approximation for this class (and other similar classes) can be achieved with a neural network with a single hidden layer. Our results will be formulated for neural networks more general than the traditional networks evaluating a univariate activation function. In particular, our results will include estimates on the order of approximation by generalized regularization networks introduced in Girosi et al. (19951, Poggio and Girosi (1990), and Poggio et al. (1993). The precise definitions and results will be given in the next section. The proofs of all the new results in Section 2 will be given in Section 3.
H. N. Mhaskar
166
2 Main Results
-
Let 1 5 d 5 s, i i 2 1 be integers, f : R + R and q5 : R f R. A getieralized traiislafioii nefziiork with ti neurons evaluates a function of the form C;'=,a k q ( A k ( . ) + bk) where the weights ALSare d x s real matrices, the thresholds bk E R' and the coefficients ah E R (1 5 k 5 11). The set of all such functions (with a fixed 1 7 ) will be denoted by II,;,,,5. We are interested in approximating the target function f by elements of &,;ll,s on [-1.1]'. In the case when d = 1, the class I14;l,.sdenotes the outputs of the classical neural networks with one hidden layer consisting of 11 neurons, each evaluating the univariate activation function d. In Girosi e f nl. (19951, Poggio and Girosi (19901, and Poggio et a / . (1993),the authors have pointed out the importance of the study of the more general case considered here. They have demonstrated how such general networks arise naturally in applications such as image processing and graphics as solutions of certain extremal problems. Our approximations will not be constructed as in Girosi et a / . (1995), Poggio and Girosi (19901, and Poggio et al. (1993) as solutions of extremal problems, but rather will be given explicitly. They will not provide the best approxiinntion, but will nevertheless provide the optimal order of approximation. An additional advantage of our networks is that the weights Aks and the thresholds bks will be determined independently of the target function f . We observe in this connection that the determination of these quantities is typically a major problem in most traditional trqining algorithms such as backpropagation. In fact, the only "training" required for our networks consists of evaluating the coefficients f l k . We give explicit formulas for these coefficients as linear combinations of the Fourier-Chebyshev coefficients of the target function. Alternative formulas based on the values of the target function can also be given, but we do not present these alternative constructions here, since a good discussion of this issue would require us to elaborate upon some very techincal background material. From a practical perspective, we observe that we are assuming that the target function can be sampled without noise at prescribed points. Our constructions are extremely simple, use no optimization, and avoid all the problems, for example, local minima, stability, etc., associated with the classical, optimization-based training paradigms such as backpropagation. We fully expect the constructions to be robust under noise, but have not developed any theory to deal with this question. First, we introduce some notations. If A C RS is Lebesgue measurable, and f : A + R is a measurable function, we define the P ( A )norms off as follows.
The class of all functions f for which
I If1 l p . ~ < 00
is denoted by Lp'(A). It is
Optimal Approximation of Functions
167
customary (and in fact essential from a theoretical point of view) to adopt the convention that if two functions are equal almost everywhere in the measure-theoretic sense then they should be considered as equal elements of LJ’(A).We make two notational simplifications. The symbol L”(A) will denote the class of continuous functions on A. In this paper, we have no occasion to consider discontinuous functions in what is normally denoted by L“(A), and using this symbol for the class of continuous functions will simplify the statements of our theorems. Second, when the set A = [-131]’, we will not mention the set in the notation. Thus, Ilfll, will mean I f ] l p . ~ - l , l ~ , etc. We measure the degree of approximation of f by the expression Ed;ll.pcf)
:= inf{llf -gllp
:
g E rIo;lIs}
(2.2)
The quantity E~;,l,pcf) denotes the theoretically minimal error that can be achieved in approximating the function f in the Lp norm by generalized translation networks with n neurons each evaluating the activation function 4. The complexity problem is clearly equivalent to obtaining sharp estimates on Eo,l,,pcf). In theoretical investigations of the degree of approximation, one typically makes an a priori assumption that the target function f, although itself unknown, belongs to some known class of functions. In this paper, we are interested in the Sobolev classes, which we define as follows. Let r > 1 be an integer and Q be a cube in R . The class WF.s(Q) consists of all functions with r - 1 continuous partial derivatives on Q, which in turn can be expressed (almost everywhere on Q) as indefinite integrals of functions in LF’(Q). Alternatively, the class WI,,(Q) consists of functions that have, at almost all points of Q, all partial derivatives u p to order r such that all of these derivatives are in Lp(Q). The Sobolev norm of f E WF,,(Q) is defined by
where for the multiinteger k = (kl.. . . .k,) E Z’, 0 5 k 5 r means that each component of k is nonnegative and does not exceed r, /kl := C;=, lk,l and
Again, WE(Q) will denote the class of functions that have continuous derivatives of order r and lower. As before, if Q = [-1, l]’,we will not mention it in the notation. Thus, we write !N& = W;,,([-l, 11’) etc. Since the target function itself is unknown, the quantity of interest is hJ,n,p.r.s
:= suP{E$;n,pCf)
Ilfllw~,, L 11.
(2.4)
H. N.Mhaskar
168
We observe that any function in W!, can be normalized so that I if1 lw;,\ 1. 1. measures the "worst case" degree of approximation by Hence, E~,;lr,p,r,s generalized translation networks with n neurons under the assumption that f E WF.sand is properly normalized. Since any element of H&;ll,sdepends upon ( d s + d + 1)nparameters, the general results by DeVore et a/. (1989) indicate that
The general results in DeVore et a / . (1989) are not exactly applicable here since the definition of the degree of approximation does not preclude the possibility that the parameters involved in the approximation may be discontinuous functionals on the class in question. Therefore, equation 2.5 is only a conjecture, rather than a known fact. In our constructions below, the parameters are continuous functionals of the class, and hence, equation 2.5 is applicable, and shows that the networks provide an optimal order of approximation subject to the continuity requirement. In the sequel, we make the following convention regarding constants. The letters c. c1. cz. . . . will denote positive constants which may depend upon p , r, s and other explicitly indicated quantities. Their value may be different at different occurrences, even within a single formula. We now formulate our main theorem.
Theorem 2.1. Let 1 5 d 5 s, r 2 1, n 2 1be integers, 1 5 p 1. m, d : R" + R he infinitely ninny times coiztiizuously differentiable in some open sphere in R i . W e firrther assiinie that there exists b in this sphere slick that
Dkd(b)# 0.
k E Z". k 2 0.
(2.6)
Then there exist d x s matrices {A,}iLl 7iiifh the follozuiizg property. For any f E Wr,s,there exist coefficients a , ( f ) such that
The ftinctionals
flj are
E@;,,.p,?,$< - cn-rls
continuoiis linear fiiizctioiials on Wr.,. hi particular, (2.8)
We observe that the condition equation (2.6) implies that 4 is not a polynomial. For the function #(x) := cosxl + C O S X(d ~ ,= 2), we have D('.l)d = 0. Thus, when d > 1, the assumption 2.6 is stronger than the assumption that 4 is not a polynomial. We suspect that it is a stronger assumption also in the case when d = 1. Proposition 2.2 shows that equation 2.6 is nevertheless satisfied by a large class of functions. In light of the first part of this proposition, we doubt that in the case when d = 1, a nonpolynomial function that is infinitely many times differentiable but does not satisfy equation 2.6 would be of any practical interest whatever.
Optimal Approximation of Functions
169
Proposition 2.2. Let d 2 1 be an integer and Cp : Rd + R be infinitely many times continuously differentiable on an open sphere B. If equation 2.6 is not satisfied, i.e., at every point of B some derivative of Q, is zero, then for every closed sphere U c B, there exists a multiinteger r 2 0, a sphere N U and functions h,, N of d - 1 real variables such that d r,-1
d(x) =
hlj,N(X1..
. . .X 1 - 1 > X l + l . . . .
Xd)4.
X
E
N.
(2.9)
i=l j=1 = 1 and 4 is analytic in a (complex) neighborhood of some point in B but not a polynomial, then equation 2.6 is satisfied.
If d
Some of the important examples where equation 2.6 is satisfied are the following, where for x E Rd,we write llxl/ := CC,"=,:');x
d
= 1.
@(x)
:= (I
+ e-')-'
(The squashing function)
d 2 1.
d(x) := (1 + / ( x ( I ~ ) ~ .
d 2 1.
q E Z. q > d / 2 .
{
0
$ Z (Generalized multiquadrics)
11x112q-d log j 1x11.
'(')
:=
4iX)
:= exP(-/IXIl2).
jlX~~2q-d%
d even, (Thin plate splines) d odd
and
d L 1,
(The Gaussian function)
If the target function is merely assumed to be in LI' rather than in W%, the estimate of equation 2.7 leads to a similar estimate in terms of the modulus of smoothness of the function. This is a fairly standard argument in approximation theory, and does not add any new insight to the problem. Since a formulation of this result would require us to introduce a great deal more notation, we omit this apparent generalization. The idea behind the proof of Theorem 2.1 is simple. It is well known that for every integer m 2 r, there exists a polynomial P l l l ( f )of coordinatewise degree not exceeding m such that for every f E W!,s, (2.10) Following Leshno etal. (1993) we express each monomial in Pltl(f)in terms of a suitable derivative of 4. In turn, this derivative can be approximated by an appropriate divided difference, involving (3(mS)evaluations of 4. A careful bookkeeping then yields Theorem 2.1. If the target function f is analytic in the polyellipse
E,, := {z = (z,.. . . . zs) E C' :
+1 - J
IZ,
I p. j =
I... ..s>
(2.11)
H. N. Mhaskar
170
for some p > 1 and 1 < (11 < /I then for every integer i n 2 1 there exists a polynomial (Siciak 1962) L,,, cf ) (different from the polynomials described above) with coordinatewise degree not exceeding nz such that
\If
- LJIIcf)IIp
(2.12)
5 Cp.pl/I~J”z$f(z)l’
Approximating these polynomials by networks as above, we get the following Theorem 2.3.
Theorem 2.3. Let 1 5 d 5 s, JI 2 1 he inftJgers,1 < ,I)I < p, 1 5 p 5 w mid f be aiinlytic in thr yolyrlliyse &,,. Fztrtlzer, let @ be as iiz Theorem 2.1. Tlzeii E$,JI,pcf)
5 C,l.plp~’l”’
2;: if(z)l’
(2.13)
It is possible to obtain some estimates on the degree of approximation under substantially weaker assumptions on 0 than those assumed in Theorems 2.1 and 2.3. One strategy, as in Leshno et al. (19931, would be to take the convolution of ?J with a suitable, infinitely many times continuously differentiable function; apply Theorem 2.1 (or Theorem 2.3) to the resulting function and use a quadrature formula. We have not yet worked out the details of this argument, but it seems unlikely that these estimates would be optimal under the weak assumptions on $9. Using the ideas in the proof of Theorem 2.1, it is also possible to obtain estimates for simultaneous approximation of derivatives of the target function. This would follow from the corresponding theorems in the theory of trigonometric approximation (cf. Mhaskar and Micchelli 1995). Although the technical details in these generalizations are expected to be of some interest, we do not wish to pursue these ideas further in this paper. 3 Proofs
To prove Theorem 2.1, we first recall some well known facts from the theory of trigonometric approximation. These will be used to construct the polynomial operator in equation 2.10. The subspace of 27r-periodic functions in Lp( [ - T . TI‘) [respectively, Wl,s([ - T . TI’)] will be denoted by Li” (respectively, Wi.;:). If g E Lp*, its Fourier coefficients are defined by
k
E
Z‘
(3.1)
The partial sums of the Fourier series of g are defined by
sm(g.t) :=
g(k)dkt.
m E Z’. m 2 0. t
E [-7r.~]’.
(3.2)
-rnSk 0, there exists Gk.,ii,c E n9;(6i,r+l)+,s slich that IlTk - G ~ . I I ~I, ~ ~ ~ ~ X f
(3.16)
Optimal Approximation of Functions
173
The weights and fhresholds of each G k . m . c may be chosen from a fixed set with cardinality not exceeding (6m + 1)5. Proof. First, we consider the case when d = 1. The point b in equation 2.6 is a real number in this case and accordingly, will be denoted by b. Let 0 be infinitely many times continuously differentiable on [b - b. b + 61. For a multiinteger p = ( P I . . . . p s ) , and x E R5,we write x p := where 0"is interpreted as 1. From the formula
n:=,XI",
dlPl
pp(w;x) :=
awy . . . awk
d(w.
x
+ b) = xP$('P~)(w. x + b)
(3.17)
we conclude that
Following the ideas in Leshno et al. (1993) we now replace the partial ; an appropriate divided difference. For multiintederivative ~ ~ (x)0by gers p and r, we write
(p)
)(;
Ii
:= / = 1
For any k > 0, the network defined by the formula (3.19) is in n,,,,,,,, (!,,+I) ther, we have
and represents a divided difference for ~ ~ (x )0. Fur;
l l @ p . ~ ~ - @p(O;.)llx
5 Mo,,,,&2.
max 1
~ I~ m. 1
1 9 3
jh/ 5 6/(3ms) (3.20)
where M4,17t,s is a positive constant depending only on the indicated variables. Now, we write T k ( x ) := x O < p S k '&,pxp, and choose -
Then equation 3.20 implies that the network G k m , c ( X ) :=
G k m,L defined
~k,p(~"p"(b))-l@p.~~,,,,~(X). X
by
E [-I. 11'
(3.21)
KpSk
satisfies equation 3.16. For each k, the weights and thresholds in are chosen from the set {(k4,,,g, b ) : r E Z .
Irll 5 3m.
1S j L s } .
Gk,m,r
H. N. Mhaskar
1 74
+
The cardinality of this set is (6m l ) 5 .Therefore, G k E n9,(6,,,+1). ,. Next, if d > 1, and b is as in equation 2.6, then we consider the univariate function D(X) := C/J(x.h2..
. . .b,)
The function (T satisfies all the hypotheses of Theorem 2.1, with b , in place of b in equation 2.6. Taking into account the fact that D ( w . x + ~ , ) = c$(A,x + b) with
A, :=
[g)
any network in n,;,,,sis also a network in implies the lemma also when d > 1.
n~pp;,l.h. Therefore, the case d = 1 0
Proof of Theorem 2.1. Without loss of generality, we may assume that IZ 2 13s. Let 111 2 1 be the largest integer such that (12m + l ) sI 17. We define P,,,cf) = C05k5Z,,i V k ( f ) T k as in equation 3.13. In view of equation 3.15, the network
Ni(f.x ) :=
Vkcf)Gk.2r,i.,rr~'-n(X)
(3.22)
OO z k . Now, each Z k is a closed set and U being a closed sphere, is a complete metric space. Therefore, Baire's category theorem implies that for some multiinteger r 2 0, Z , contains a nonempty interior. Hence, there exists an open sphere N C U such that D'd(x) = 0 for every x E N. Equation 2.9 expresses d, as a solution of this differential equation on N. If d = 1, d, is analytic in a closed neighborhood U of some point xo E B and equation 2.6 is not satisfied, then we have proved that 4 is equal to a polynomial on some interval contained in U.The identity theorem of complex analysis then shows that 4 itself is a polynomial. 0 4 Conclusions
We have constructed generalized translation networks with a single hidden layer that provide an optimal order of approximation for functions in Sobolev classes similar to the order obtained in the classical polynomial approximation theory. If the target function is analytic, then it is possible to get a geometric rate of approximation, again similar to polynomial approximation. The weights and thresholds of our networks are chosen independently of the target function. We give explicit formulas for the coefficients, so that the "training" consists of calculating certain simple, coninuous linear functionals on the target function. The activation function for the network is fairly general, but has to satisfy certain smoothness conditions. Among the activation functions for which our theorems are applicable are the squashing function, the gaussian function, thin plate splines, and generalized multiquadric functions.
Acknowledgments I wish to thank Professors F. Girosi and T. Poggio, MIT Artificial Intelligence Laboratories, for their kind encouragement in this research. The research was supported in part by National Science Foundation Grant DMS 9404513 and Air Force Office of Scientific Research Grant F4962093-1-0150.
References Barron, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. I E E E Cans. Information Theory 39, 930-945. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unified view. In Symposium on the Interface: Statistics and Computing Science, April, Reston, Virginia.
176
H. N. Mhaskar
Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chui, C. K., Li, X., and Mhaskar, H. N. 1995. Some limitations on neural networks with one hidden layer. Submitted. Cybenko, G. 1989. Approximation by superposition of sigmoidal functions. Mnth. Control, Signal Syst. 2, 303-314. DeVore, R., Howard, R., and Michelli, C. A. 1989. Optimal nonlinear approximation. Matillscript. Math. 63, 469478. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neiiral Coinp. 7,219-269. Hornik, K., Stinchcornbe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Leshno, M., Lin, V., Pinkus, A., and Schocken, S. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neiiral Networks 6 , 861-867. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comp. Mntlz. 1, 61-80. Mhaskar, H. N. 1994. Approximation of real functions using neural networks. In Proceedings of lntermtional Conference on Coniputational Mathematics, H. P. Dikshit and C. A. Micchelli, eds. World Scientific Press, New Delhi, India. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function and radial basis functions. Adz,. Appl. Math. 13, 350-373. Mhaskar, H. N., and Micchelli, C. A. 1994. Dimension independent bounds on the degree of approximation by neural networks. IBM 1. Res. Deu. 38, 277-284. Mhaskar, H. N., and Micchelli, C. A. 1995. Degree of approximation by neural and translation networks with a single hidden layer. Adv. App. Math. 16, 151-183. Moody, J., and Darken, C. 1989. Fast learning in networks of locally tuned processing units. Neiiral Comp. 1(2), 282-294. Park, J. and Sandberg, I. W. 1991. Universal approximation using radial basis function networks. Neural Comp. 3, 246-257. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. IEEE 78(9). Poggio, T., Girosi, F., and Jones, M. 1993. From regularization to radial, tensor, and additive splines. In Neural Networks for Signal Processing, III, C. A. Kamm, G. M. Kuhn, B. Yoon, R. Chellappa, and S. Y. Kung, eds., pp. 3-10. IEEE, New York. Powell, M. J. D. 1992. The theory of radial basis function approximation. In Adzlances in Niitrrerical Analysis III, Wavelets, Subdivision Algorithms and Radial Bnsis Functions, W. A. Light, ed., pp. 105-210. Clarendon Press, Oxford. Siciak, J. 1962. On some extremal functions and their applications in the theory of analytic functions of several complex variables. Trans. A m . Math. SOC.105, 322-357. Stein, E. M. 1970. Singiilar Integrals and Differentiability Properties of Functions. Princeton Univ. Press, Princeton.
Optimal Approximation of Functions
177
Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. MacmilIan, New York.
Received January 23, 1995; accepted April 10, 1995.
This article has been cited by: 2. V. E. Maiorov. 2010. Best approximation by ridge functions in L p -spaces. Ukrainian Mathematical Journal . [CrossRef] 3. S. Giulini, M. Sanguineti. 2009. Approximation Schemes for Functional Optimization Problems. Journal of Optimization Theory and Applications 140:1, 33-54. [CrossRef] 4. D.G. Khairnar, S.N. Merchant, U.B. Desai. 2007. Radial basis function neural network for pulse radar detection. IET Radar, Sonar & Navigation 1:1, 8. [CrossRef] 5. K. Schwab, M. Eiselt, P. Putsche, M. Helbig, H. Witte. 2006. Time-variant parametric estimation of transient quadratic phase couplings between heart rate components in healthy neonates. Medical & Biological Engineering & Computing 44:12, 1077-1083. [CrossRef] 6. Zongben Xu, Jianjun Wang. 2006. The essential order of approximation for nearly exponential type neural networks. Science in China Series F: Information Sciences 49:4, 446-460. [CrossRef] 7. P. Chandra, Y. Singh. 2004. Feedforward Sigmoidal Networks—Equicontinuity and Fault-Tolerance Properties. IEEE Transactions on Neural Networks 15:6, 1350-1366. [CrossRef] 8. L. Zhang, W. Zhou, L. Jiao. 2004. Hidden Space Support Vector Machines. IEEE Transactions on Neural Networks 15:6, 1424-1434. [CrossRef] 9. Felice Arena, Silvia Puca. 2004. The Reconstruction of Significant Wave Height Time Series by Using a Neural Network Approach. Journal of Offshore Mechanics and Arctic Engineering 126:3, 213. [CrossRef] 10. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 11. S. Watanabe. 2001. Learning efficiency of redundant neural networks in Bayesian estimation. IEEE Transactions on Neural Networks 12:6, 1475-1486. [CrossRef] 12. V. Maiorov, R. Meir. 2001. Lower bounds for multivariate approximation by affine-invariant dictionaries. IEEE Transactions on Information Theory 47:4, 1569-1575. [CrossRef] 13. Sumio Watanabe . 2001. Algebraic Analysis for Nonidentifiable Learning MachinesAlgebraic Analysis for Nonidentifiable Learning Machines. Neural Computation 13:4, 899-933. [Abstract] [PDF] [PDF Plus] 14. R. Meir, V.E. Maiorov. 2000. On the optimality of neural-network approximation using incremental algorithms. IEEE Transactions on Neural Networks 11:2, 323-337. [CrossRef]
15. Wenxin Jiang , Martin A. Tanner . 1999. On the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear ModelsOn the Approximation Rate of Hierarchical Mixtures-of-Experts for Generalized Linear Models. Neural Computation 11:5, 1183-1198. [Abstract] [PDF] [PDF Plus] 16. G. Ritter. 1999. Efficient estimation of neural weights by polynomial approximation. IEEE Transactions on Information Theory 45:5, 1541-1550. [CrossRef] 17. Nahmwoo Hahm, Bum Il Hong. 1999. Extension of localised approximation by neural networks. Bulletin of the Australian Mathematical Society 59:01, 121. [CrossRef] 18. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 19. Peter L. Bartlett , Vitaly Maiorov , Ron Meir . 1998. Almost Linear VC-Dimension Bounds for Piecewise Polynomial NetworksAlmost Linear VC-Dimension Bounds for Piecewise Polynomial Networks. Neural Computation 10:8, 2159-2173. [Abstract] [PDF] [PDF Plus] 20. A.J. Zeevi, R. Meir, V. Maiorov. 1998. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory 44:3, 1010-1025. [CrossRef] 21. Pencho P. Petrushev. 1998. Approximation by Ridge Functions and Neural Networks. SIAM Journal on Mathematical Analysis 30:1, 155. [CrossRef] 22. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus] 23. C. K. Chui, Xin Li, H. N. Mhaskar. 1996. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics 5:1, 233-243. [CrossRef]
Communicated by Michael Jordan
Equivalence of Linear Boltzmann Chains and Hidden Markov Models David J. C. MacKay Covendish Laboratory, Madingley Rood, Coinbridge CB3 OHE, United Kingdom Several authors have studied the relationship between hidden Markov models and "Boltzmann chains" with a linear or "time-sliced" architecture. Boltzmann chains model sequences of states by defining statestate transition energies instead of probabilities. In this note I demonstrate that under the simple condition that the state sequence has a mandatory end state, the probability distribution assigned by a strictly linear Boltzmann chain is identical to that assigned by a hidden Markov model. Several authors have made a link between hidden Markov models for time series and energy-based models (Luttrell 1989; Williams 1990; Saul and Jordan 1995). Saul and Jordan (1995) discuss a linear Boltzmann chain model with state-state transition energies All, (going from state i to state i') and symbol emission energies B,, under which the probability of an entire state {il.j~}fgiven the length of the sequence, L, is
P ( {il.jl}f
I II.A. B. L. XBC)
where Z ( n. A. B, L ) is the obvious normalizing constant. Here the symbol i runs over n discrete "hidden" states, and j runs over m visible states. In contrast, a hidden Markov model (HMM) assigns a probability distribution of the form
where TT, is a prior probability vector for the initial state, a1,8 is a transition probability matrix, and b,, is a matrix of emission probabilities satisfying, respectively:
C7rI= 1. I
C alII= 1 Vi. I'
and
b,, = 1 Vi
(3)
I
Nrirrnl Computnfian 8, 178-181 (1996) @ 1995 Massachusetts Institute of Technology
Boltzmann Chains and Hidden Markov Models
179
Here again, the symbol i runs over an alphabet of YI hidden states, and j runs over m visible states. While any HMM can be written as a linear Boltzmann chain by setting exp(-A,,t) = aiif, exp(-B 11) = bij and exp(-ni) = 7ri, not all linear Boltzmann chains can be represented as HMMs (Saul and Jordan 1995). However, the difference between the two models is minimal. To be precise, if the final hidden state iL of a linear Boltzmann chain is constrained to be a particular end state, then the distribution over sequences is identical to that of a
hidden Markov model. Proof. Start from the distribution ( I ) and consider the quantity in the exponent. The probability distribution over states {il,j,}+is unchanged if we subtract arbitrary constants LL, u from this exponent. The distribution will also be unaffected if we add arbitrary terms bIl to every appearance of B,,,,, provided we also subtract [A, from every term A,,,,,,. And we may similarly add a,,,, to every term Ai,,,,, if we also subtract nil+, from the following term Ail+,,,+,.The probability distribution may therefore be rewritten unchanged (except for the normalizing constant) as
P({i,.j l } : exp
I II.A. B, L , EBC) c(
[-(n;,+
ail
+
+ Ai,i,+!+
+
ail+,
1=1
L
-
L-1 - ~ ( - 0, ~
C(Bi,jI+ Pi,) + OiL +Pi, I=1
a}
1
(4)
where p, u,{ni. are arbitrary quantities. This probability distribution has the form of an HMM (equation 2) if 1. the quantities 7rl = exp(-(II,+ai+p)), a,it = exp(ai+[ji-Aiir -u) and b,, f exp[-(Bi, pi)] satisfy the normalization conditions ( 3 ) . 2. the trailing term aiL+ ,OiL can be treated as a constant, which holds if we assume that iL is fixed to a particular end state iL = n, say (a commonly applied constraint in the HMM literature).
+
Does a solution over p, u, {ni.@i}of the normalization conditions ( 3 ) exist? Trivially, we find for /3,:
The normalization condition that { a , }and u must satisfy is
C exp(n, + [jr
-
A,,, -
azi
- u)= 1 tti
I'
Rearranging, we obtain C[exp(P, - A,,~)][exp(-a,O] = exp(.)[exp(-a,)l I'
b'i
180
David MacKay
which can be recognized as an eigenvector/eigenvalue equation for the matrix Mil/ = [exp(ijl- All{)],with exp(v) being the eigenvalue and [exp(-n,)] being the eigenvector. This eigenproblem has a solution, by the Perron-Frobenius theorem (Seneta 1973, p. l), which states that a positive matrix be., one in which every element Mzl,is positive) has a positive eigenvector with positive eigenvalue. A solution for { n i l } and I / therefore exists. Finally p is given by
This completes the proof.
0
The linear Boltzmann chain therefore can differ from an HMM only in having a pseudo-prior over its final state as well as a pseudo-prior over its initial state. However the equivalence of linear Boltzmann chains to HMMs may prove fruitful in stimulating the development of new optimization methods for these models. And it may be found that Saul and Jordan's generalizations to Boltzmann chains with more complex architectures provide useful new modeling capabilities. The Boltzmann chain, and its relationship to HMMs, have also been studied by Luttrell (1989) who calls it the "Gibbs machine," and by Williams (1990), who calls it a "Boltzmann machine with a time-sliced architecture and Potts units." Luttrell discusses an alternative optimization algorithm to the decimation method suggested by Saul and Jordan, and notes that the Gibbs machine is only an improvement on the HMM when generalized to architectures with loops and other nontree structures. Williams also shows how to translate an HMM into a Boltzmann machine and notes that a generalized Boltzmann machine with a "componential'' structure (similar to the "coupled parallel Boltzmann chains" of Saul and Jordan) has greater representational power than a single HMM of the same size.
Acknowledgments
I thank Radford Neal and Chris Williams for comments on the manuscript. This work is supported by a Royal Society research fellowship. References Luttrell, S. P. 1989. The Gibbs Machine Applied to Hidden MarkozJModel Problems. Part 1: Basic Theoy, Tech. Rep. 99, SP4 division, RSRE, Malvern, U.K. Saul, L., and Jordan, M. 1995. Boltzmann chains and hidden Markov models. Adz).Neural Inform. Process. Syst. 7, 435442. Seneta, E. 1973. Non-Negatiz1e Matrices. John Wiley & Sons, New York.
Boltzmann Chains and Hidden Markov Models
181
Williams, C. K. I. 1990. Using deterministic Boltzmann machines to discriminate temporally distorted strings. Master’s thesis, Department of Computer Science, University of Toronto; see also Williams, C. K. I., and Hinton, G. E. 1990. Mean field networks that learn to discriminate temporally distorted strings. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. S. Sejnowski, and G. E. Hinton, eds. Morgan Kaufmann, San Mateo, CA.
Received September 23, 1994; accepted April 10, 1995.
This article has been cited by: 2. Noah A. Smith, Mark Johnson. 2007. Weighted and Probabilistic Context-Free Grammars Are Equally ExpressiveWeighted and Probabilistic Context-Free Grammars Are Equally Expressive. Computational Linguistics 33:4, 477-491. [Abstract] [PDF] [PDF Plus] 3. Or Zuk, Ido Kanter, Eytan Domany. 2005. The Entropy of a Binary Hidden Markov Process. Journal of Statistical Physics 121:3-4, 343-360. [CrossRef]
Communicated by Fernando Pineda -
Diagrammatic Derivation of Gradient Algorithms for Neural Networks
Deriving gradient algorithms for time-dependent neural network structures typically requires numerous chain rule expansions, diligent bookkeeping, and careful manipulation of terms. In this paper, we show how to derive such algorithms via a set of simple block diagram manipulation rules. The approach provides a common framework to derive popular algorithms including backpropagation and backpropagationthrough-time without a single chain rule expansion. Additional examples are provided for a variety of complicated architectures to illustrate both the generality and the simplicity of the approach. 1 Introduction
~
Deriving the appropriate gradient descent algorithm for a new network architecture or system configuration normally involves brute force derivative calculations. For example, the celebrated backpropagation algorithm for training feedforward neural networks was derived by repeatedly applying chain rule expansions backward through the network (Rumelhart c.f 01. 1986; Werbos 1974; Parker 1982). However, the actual implementation of backpropagation may be viewed as a simple reversal of signal flow through the network. Another popular algorithm, backpropagationthrough-time for recurrent networks, can be derived by Euler-Lagrange or ordered derivative methods, and involves both a signal flow reversal and time reversal (Werbos 1992; Nguyen and Widrow 1989). For both of these algorithms, there is a r.c~iipocn71nature to the forward propagation o f states and the backward propagation of gradient terms. Furthermore, both algorithms are efficient in the sense that calculations are order N, where N is the number of variable weights in the network. These properties are often attributed to the clever manner in which the algorithms
Derivation of Gradient Algorithms
183
were derived for a specific network architecture. We will show, however, that these properties are universal to all network architectures and that the associated gradient algorithm may be formulated directly with virtually no effort. The approach consists of a simple diagrammatic method for construction of a reciprocal network that directly specifies the gradient derivation. This is in contrast to graphical methods, which simply illustrate the relationship between signal and gradient flow after derivation of the algorithm by an alternative method (Narendra and Parthasarathy 1990, Nerrand et al. 1993). The reciprocal networks are, in principle, identical to adjoint systems seen in N-stage optimal control problems (Bryson and Ho 1975). While adjoint methods have been applied to neural networks, such approaches have been restricted to specific architectures where the adjoint systems resulted from a disciplined Euler-Lagrange optimization technique (Parisini and Zoppoli 1994; Toomarian and Barhen 1992; Matsuoka 1991). Here we use a graphic approach for the complete derivation. We thus prefer the term "reciprocal" network, which further imposes certain topological constraints and is taken from the electrical network literature. The concepts detailed in this paper were developed in Wan (1993) and later presented in Wan and Beaufays (1994).' 1.1 Network Adaptation and Error Gradient Propagation. In supervised learning, the goal is to find a set of network weights W that minimizes a cost function ] = C,"=,Lk[d(k), y(k)], where k is used to specify a discrete time index (the actual order of presentation may be random or sequential), y(k) is the output of the network, d(k) is a desired response, and L k is a generic error metric that may contain additional weight regularization terms. For illustrative purposes, we will work with the squared error metric, L k = e(k)Te(k),where e(k) is the error vector. Optimization techniques invariably require calculation of the gradient vector a]/SW(k).At the architectural level, a variable weight w,, may be isolated between two points in a network with corresponding signals a , ( k ) and a,(k) [i.e., u,(k) = w,, a,(k)l. Using the chain rule, we get2
where we define the error gradient Sj(k) A a]/aa,(k). The error gradient 6,(k) depends on the entire topology of the network. Specifying the gradients necessitates finding an explicit formula for calculating the delta 'The method presented here is similar in spirit to Automatic Differentiation (Rall 1981; Griewank and Corliss 1991). Automatic Differentiation is a simple method for finding derivative of functions and algorithms that can be represented by acyclic graphs. Our approach, however, applies to discrete-time systems with the possibility of feedback. In addition, we are concerned with diagrammatic derivations rather than computational rule based implementations. *In the general case of a variable parameter, we have a,(k) = f[w,,.a,(k)],and equation 1.1 remains as 6,(k)&,(k)/Dw,,(k), where the partial term depends on the form o f f .
184
Eric A. Wan and Francoise Beaufays
terms. Backpropagation, for example, is nothing more than an algorithm for generating these terms in a feedforward network. In the next section, we develop a simple nonalgebraic method for deriving the delta terms associated with any network architecture.
2 Network Representation and Reciprocal Construction Rules ___
An arbitrary neural network can be represented as a block diagram whose building blocks are: summing junctions, branching points, univariate functions, multivariate functions, and time-delay operators. Only discrete-time systems are considered. A signal located within the network is labeled n , ( k ) . A synaptic weight, for example, niay be thought of as a linear transmittance, which is a special case of a univariate fuiiction. The basic neuron is simply a sum of linear trarismittances followed by a univariate sigmoid function. Networks can then be constructed from individual neurons, and may include additional functioiial blocks and time-delay operators that allow for buffering of signals and internal memory. This block diagram representation is really nothing more than the typical pictorial description of a neural network with a bit of added formalism. Directly from the block diagram we may construct the reciprocal rrefi i w k by reversing the flow direction in the original network, labeling all resulting signals ( k ) , and perforniing the following operations:
1. Suniirziriy jirrictioiir nrcl i ’ c p l n ~ diuitlr
hi’I711ChiilS poirifs.
Derivation of Gradient Algorithms
185
Explicitly, the scalar coiitiizzioiis function a , ( k ) = f [ n , ( k ) is ] replaced by h , ( k ) = f ’ [ n , ( k ) ]d,(k), where f’[a,(k)]~da,(k)/da,(k). Note this rule replaces a nonlinear function by a linear time-dependeiit transmittance. Special cases are 0
Weights: a, ‘i
= su,, a,,
YJ
in which case 6, ‘I
0
= 70,, d,.
3 1‘
51
Activation functions: a,,(k) = tanh[a,(k)].In this case, f’[a,(k)]= 1 - a:(k).
4. Mirltivariate functions are replaced with their Jacobians.
ai,,fW
I
I
A multivariate function maps a vector of input signals into a vector
of output signals, aout = F(a,,). In the transformed network, we have 6,,(k) = F’[a,,(k)] SOut(k),where F’[a,,(k)]~aa,,,(k)/aa,,(k) corresponds to a matrix of partial derivatives. For shorthand, F’[a,,(k)] will be written simply as F’(k). Clearly both summing junctions and univariate functions are special cases of multivariate functions. A multivariate function may also represent a product junction (for sigma-pi units) or even another multilayer network. For a multilayer network, the product F’[a,,(k)] b,,,(k) is found directly by backpropagating boutthrough the network.
Delay operators are replaced zuitk ndvance operafors.
A delay operator q-’ performs a unit time delay on its argument: a,(k) = q-’a,(k) = a,(k - 1). In the reciprocal system, we form a unit time advance: d,(k) = q+‘h,(k) = h,(k + 1). The resulting system is thus noncausal. Actual implementation of the reciprocal network in a causal manner is addressed in specific examples.
Eric A. Wan and F r a n p i s e Beaufays
186
6. Ozttputs become inputs.
network
a o = Yo
network
By reversing the signal flow, output nodes n,,(k) = y , , ( k )in the original network become input nodes in the reciprocal network. These inputs are then set at each time step to -2e,,(k). [For cost functions other than squared error, the input should be set to i ) L ~ / d y , , ( k ) . I These six rules allow direct construction of the reciprocal network from the original network.3 Note that there is a topological equivalence between the two networks. The order of computations in the reciprocal network is thus identical to the order of computations in the forward network. Whereas the original network corresponds to a nonlinear timeindependent system (assuming the weights are fixed), the rrciprocn2 network is a linear time-dependent system. The signals 6 , ( k ) that propagate through the reciprocal network correspond to the terms tl]/&,(k) necessary for gradient adaptation. Exact equations may then be "read-out" directly from the reciprocal network, completing the derivation. A formal proof of the validity and generality of this method is presented in Appendix A. 3 Examples
3.1 Backpropagation. We start by rederiving standard backpropagation. Figure 1 shows a hidden neuron feeding other neurons and an output neuron in a multilayer network. For consistency with traditional notation, we have labeled the summing junction signal s: rather than n,, and added superscripts to denote the layer. In addition, since multilayer networks are static structures, we omit the time index k. The reciprocal network shown in Figure 2 is found by applying the construction rules of the previous section. From this figure, we may immediately write down the equations for calculating the delta terms:
These are precisely the equations describing standard backpropagation. In this case, there are no delay operators and 6, = b,(k) aJl/as,(k) = 3CClearly, this set of rules is not a minimal set, i.e., a summing junction can be considered a special case of a multivariate function. However, we choose this set for ease and clarity of construction.
Derivation of Gradient Algorithms
187
Figure 1: Block diagram construction of a multilayer network
Figure 2: Reciprocal multilayer network.
[?leT(k)e(k)]/3s,(k) corresponds to an instanfancous gradient. Readers familiar with neural networks have undoubtedly seen these diagrams before. What is new is the concept that the diagrams themselves inay be used directly, completely circumventing all intermediate steps involving tedious algebra. 3.2 Backpropagation-Through-Time. For the next example, consider a network with output feedback (see Fig. 3) described by
(3.21 Y(k) = "X(k),Y(k - 111 where x ( k ) are external inputs, and y ( k ) represents the vector of outputs that form feedback connections. N is a multilayer neural network. If N has only one layer of neurons, every neuron output has a feedback
Eric A. Wan and Francoise Beaufays
188
Figure 3: Recurrent network and backpropagation-through-time.
connection to the input of every other neuron and the structure is referred to as a fzilly recurreiit netzoork (Williams and Zipser 1989). Typically, only a select set of the outputs have an actual desired response. The remaining outputs have no desired response (error equals zero) and are used for internal computation. Direct calculation of gradient terms using chain rule expansions is extremely complicated. A weight perturbation at a specified time step affects not only the output at future time steps, but future inputs as well. However, applying the reciprocal construction rules (see Fig. 3) we find immediately:
(3.3)
These are precisely the equations describing hllckyroprzgatioii-through-tinie, which have been derived in the past using either ordered derivatives (Werbos 1974) or Euler-Lagrange techniques (Plumer 1993). The diagrammatic approach is by far the simplest and most direct method. Note that the causality constraints require these equations to be run backward in time. This implies a forward sweep of the system to generate the output states and internal activation values, followed by a backward sweep through the reciprocal network. Also from rule 4 in the previous section, the product N’(k)6(k)may be calculated directly by a standard backpropagation of 6(k 1) through the network at time k.4
+
4Backpropagation-through-time is viewed as an off-lirzegradient descent algorithm in which weight updates are made after each presentation of an entire training sequence. An orz-he version in which adaptation occurs at each time step is possible using an algorithm called real-time-bnckprop~gntjon (Williams and Zipser 1989). The algorithm, however, is far more computationally expensive. The authors have presented a method based on flow graph iirferreciprorify to directly relate the two algorithms (Beaufays and Wan 1994a).
Derivation of Gradient Algorithms
189
rfkLd
I
Controller
I
Figure 4: Neural network control using nonlinear ARMA models 3.2.2 Backpropagation-Through-Time and Neural Control Architectures. Backpropagation-through-time can be extended to a number of neural control architectures (Nguyen and Widrow 1989; Werbos 1992). A system may be configured using full-state feedback or more complicated ARMA (AutoRegressive Moving Average) models as illustrated in Figure 4. To adapt the weights of the controller, it is necessary to find the gradient terms that constitute the effective error for the neural network. Figure 5 illustrates how such terms may be directly acquired using the reciprocal network. A variety of other recurrent architectures may be considered including hierarchical networks to radial basis networks with feedback. In all cases, the diagrammatic approach provides a direct derivation of the gradient algorithm. 3.3 Cascaded Neural Networks. Let us now turn to an example of two cascaded neural networks (Fig. 6), which will further illustrate advantages of the diagrammatic approach. The inputs to the first network are samples from a time sequence x(k). Delayed outputs of the first network are fed to the second network. The cascaded networks are defined as
u ( k ) = Nl[W,.x(k),x(k- l).x(k - 2 ) ]
(3.4)
y(k) = N2[W2,u ( k f ,u(k - 3 ) . u(k - 211
(3.5)
where W1 and W, represent the weights parameterizing the networks, y ( k ) is the output, and u ( k ) the intermediate signal. Given a desired response for the output y of the second network, it is straightforward to use backpropagation for adapting the second network. It is not obvious, however, what the effective error should be for the first network.
Eric A. Wan and Franqoise Beaufays
190
Plant
Controller
Model
i
Figure 3: Reciprocal network tor control using nonlinear ARMA models.
__
___
Figure 6 : Cascaded neural network filters and reciprocal counterpart.
Derivation of Gradient Algorithms
191
From the reciprocal network also shown in Figure 6, we simply label the desired signals and write down the gradient relations:
6,,(k) = & ( k )
+ S 2 ( k + 1)+ &(k + 2 )
(3.6)
with
[6i(k)
6 2 ( k ) 63(k)I =
(3.7)
- W k ) N[u(k)]
i.e., each S,(k) is found by backpropagation through the output network, and the 6,s (after appropriate advance operations) are summed together. The gradient for the weights in the first network is thus given by
in which the product term is found by a single backpropagation with & ( k ) acting as the error to the first network. Equations can be made causal by simply delaying the weight update for a few time steps. Clearly, extrapolating to an arbitrary number of taps is also straightforward. For comparison, let us consider the brute force derivative approach to finding the gradient. Using the chain rule, the instantaneous error gradient is evaluated as: (3.9)
+ =
ay(k) du(k - 2) d u ( k - 2 ) dW,
I
du(k - 1) du(k - 2 ) + 6 3 ( k ) aw, b I ( k ) g b2(k) dW,
+
(3.10)
where we define
Again, the 6,terms are found simultaneously by a single backpropagation of the error through the second network. Each product 6,(k)[du(k- i - 1)/ 3W1] is then found by backpropagation applied to the first network with 6,+l(k) acting as the error. However, since the derivatives used in backpropagation are time-dependent, separate backpropagations are necessary for each b,,, ( k ) . These equations, in fact, imply backpropagation through an unfolded structure and are equivalent to weight sharing (LeCun et al. 1989) as illustrated in Figure 7. In situations where there may be hundreds of taps in the second network, this algorithm is far less efficient than the one derived directly using reciprocal networks. Similar arguments can be used to derive an efficient on-line algorithm for adapting time-delay neural networks (Waibel et al. 1989).
Eric A. Wan and Frangoise Beaufays
192
Figure 7: Cascaded neural network filters unfolded-in-time. 3.4 Temporal Backpropagation. An extension of the feedforward network can be constructed by replacing all scalar weights with discrete time linear filters to provide dynamic interconnectivity between neurons. Mathematically, a neuron i in layer I may be specified as
(3.11) (3.12) where a ( k ) are neuron output values, s ( k ) are summing junctions, f ( . )are sigmoid functions, and W(q-') are synaptic filters5 Three possible forms for W(qp')are Case I I
M
Case I1
(3.13)
5The time domain operator q-' is used instead of the more common z-domain variable 2 - ' . The z notation would imply an actual transfer function that does not apply in nonlinear systems.
Derivation of Gradient Algorithms
193
a
Figure 8: Block diagram construction of an FIR network and corresponding reciprocal structure.
In Case I, the filter reduces to a scalar weight and we have the standard definition of a neuron for feedforward networks. Case I1 corresponds to a Finite Impulse Response (FIR) filter in which the synapse forms a weighted sum of past values of its input. Such networks have been utilized for a number of time-series and system identification problems (Wan 1993a,b,c). Case I11 represents the more general Infinite Impulse
194
Eric A. Wan and Fraiifoise Beaufays
Figure 9: IIR filter realizations: (a) controller canonical, (b) reciprocal observer canonical, (c) lattice, (d) reciprocal lattice. Response (IIR) filter, in which feedback is permitted. In all cases, coefficients are assumed to be adaptive. Figure 8 illustrates a network composed of FIR filter synapses realized as tap-delay lines. Deriving the gradient descent rule for adapting filter coefficients is quite formidable if we use a direct chain rule approach. However, using the construction rules described earlier, we may trivially form the reciprocal network also shown in Figure 8. By inspection we have
Consideration of an output neuron at layer L yields ? f ( k )= -2e,(k)f'[sF(k)]. These equations define the algorithm known as temporal hnckyropagafioii (Wan 1993a,b). The algorithm may be viewed as a temporal generalization of backpropagation in which error gradients are propagated not by simply taking weighted sums, but by backward filtering. Note that in the reciprocal network, backpropagation is achieved through the reciprocal filters W(q+*). Since this is a noncausal filter, it is necessary to introduce a delay of a few time steps to implement the on-line adaptation. In the IIR case, it is easy to verify that equation 3.14 for temporal backpropagation still applies with W(q+') representing a noncausal IIR filter. As with backpropagation-through-time, the network must be trained using a forward and backward sweep necessitating storage of all activation values at each step in time. Different realizations for the filters dictate
Derivation of Gradient Algorithms
195
how signals flow through the reciprocal structure as illustrated in Figure 9. In all cases, computations remain order N (this is in contrast with the order N2 algorithms derived by Back and Tsoi (1991) using direct chain rule methods). Note that the poles of the forward IIR filters are reciprocal to the poles of the reciprocal filters. Stability monitoring can be made easier if we consider lattice realizations in which case stability is guaranteed if the magnitude of each coefficient is less than 1. Regardless of the choice of the filter realization, reciprocal networks provide a simple unified approach for deriving a learning algorithm.6 The above examples allow us to extrapolate the following additional construction rule: A n y linear subsystem H(q-') in the original network is transformed to H(q+l) in the reciprocal system. 4 Summary
The previous examples served to illustrate the ease with which algorithms may be derived using the diagrammatic approach. One starts with a diagrammatic representation of the network of interest. A reciprocal network is then constructed by simply swapping summing junctions with branching points, continuous functions with derivative transmittances, and time delays with time advances. The final algorithm is read directly off the reciprocal network. No messy chain rules are needed. The approach provides a unified framework for formally deriving gradient algorithms for arbitrary network architectures, network configurations, and systems.
Appendix A: Proof of Reciprocal Construction Rules We show that the diagrammatic method constitutes a formal derivation for arbitrary network architectures. Intuitively, the chain rule applied to individual building blocks yields the reciprocal architecture. However, delay operators, which cannot be differentiated, as well as feedback, prevents a straightforward chain rule approach to the proof. Instead, we use a more rigorous approach that may be outlined as follows: (1) It is argued that a perturbation applied to a specific node in the network propagates through a derivative network that is topologically equivalent to the original network. (2) The derivative network is systematically unraveled in time to produce a linear time independent flow graph. (3) Next, the principle of flow graph interreciprocity is evoked to reverse the signal flow through the unraveled network. (4) The reverse flow graph is 'In related work (Beaufays and Wan 1994b), the diagrammatic method was used to derive an algorithm to minimize the output power at each stage of an FIR lattice filter. This provides an adaptive lattice predictor used as a decorrelating preprocessor to a second adaptive filter. The new algorithm is more effective than the Griffiths algorithm (Griffiths 1977).
Eric A. Wan and Franqoise Beaufays
196
raveled back in time to produce the reciprocal network. The input, originally corresponding to a perturbation, becomes an output providing the desired gradient. (5) By symmetry, it is argued that a11 signals in the reciprocal network correspond to proper gradient terms. 1. We will initially assume that only uiziuariate functions exist within the network. This is by no means restrictive. It has been shown (Hornik et a / . 1989; Cybenko 1989) that a feedforward network with two or more layers and a sufficient number of internal neurons can approximate any iuiifonnly coiitiiiiioits multivariate function to an arbitrary accuracy. A feedforward network is, of course, composed of simple univariate functions and summing junctions. Thus any multivariate function in the overall network architecture is assumed to be well approximated using a univariate composition. We may completely specify the topology of a network by the set of equations
I-
E
cfo.q - 7
(A.2)
where n , ( k ) is the signal corresponding to the node a, at time k. The sum is taken over all signals a,(k) that connect to a,(k), and T,, is a transmittance operator corresponding to either a univariate function (e.g., sigmoid function, constant multiplicative weight), or a delay operator. (The symbol o is used to remind us that T is an operator whose argument is a.) The signals a,(k) may correspond to inputs (a,%,), outputs (arey,), or internal signals to the network. Feedback of signals is permitted. Let us add to a specific node a' a perturbation An*(k)at time k. The perturbation propagates through the network resulting in effective perturbations Aa,(k) for all nodes in the network. Through a continuous univariate function in which u,(k) = f [ a l ( k ) ]we have, to first order:
where it must be clearly understood that Aa,(k) and &(k) are the perturbations directly resulting from the external perturbation Aa*(k).Through a delay operator, a j ( k ) = 9-'a1(k) = a , ( k - 1), we have
Combining these two results with equation A.l gives
au,( k ) =
cT:,
0
aa,( k )
vj
(A.5)
I
where we define T' E cf'[nl(k)]. q-'}. Note that f ' ( u i ( k ) )is a linear timedependent transmittance. Equation A.5 defines a derizmfiaenetwork that is
Derivation of Gradient Algorithms
197
Aa*(k) -
J:
Figure 10: (a) Time dependent input/output system for the derivative network. (b) Same system with all delays drawn externally.
topologically identical to the original network (i.e., one-to-one correspondence between signals and connections). Functions are simply replaced by their derivatives. This is a rather obvious result, and simply states that a perturbation propagates through the same connections and in the same direction as would normal signals. 2. The derivative network may be considered a time-dependent system with input Aa*(k) and outputs Ay(k) as illustrated in Figure 10a. Imagine now redrawing the network such that all delay operators q-' are dragged outside the functional block (Fig. lob). Equation A.5 still applies. Neither the definition nor the topology of the network has been changed. However, we may now remove the delay operators by cascading copies of the derivative network as illustrated in Figure 11. Each stage has a different set of transmittance values corresponding to the time step. The unraveling process stops at the final time K ( K is allowed to approach co). Additionally, the outputs Ay(n) at each stage are multiplied by - 2 e ( ~ z )and ~ then summed over all stages to produce a single output A]$ -2e(~~)~~y(n). By removing the delay operators, the time index k may now be treated as simply a labeling index. The unraveled structure is thus a time independent linear flow graph. By linearity, all signals in the flow graph can be divided by Aa*(k) so that the input is now 1, and the output is A]/Aa*(k). In the limit of small Aa*(k)
~ t = ~
Since the system is causal the partial of the error at time n with respect
Eric A. Wan and Francoise Beaufays
198
~
~~
Figure 11: Flow graph corresponding to unraveled derivative network.
I
I
I
1.0
Figure 12: Transposed flow graph corresponding to unraveled derivative network. to n * ( k ) is zero for
5
?d;w ;e;'
rl=k
iz
< k. Thus -
5
De(n)Te(i') da*(k) - an*(k) J' "h'(k) ~~
(A.7)
11=l
The term h * ( k )is precisely what we were interested in finding. However, calculating all the 6,( k ) terms would require separately propagating signals through the unraveled network with an input of 1 at each location associated with a , ( k ) . The entire process would then have to be repeated at every time step. 3. Next, take the unraveled network (i.e., flow graph) and form its transpose. This is accomplished by reversing the signal flow direction, transposing the branch gains, replacing summing junctions by branching points and vice versa, and interchanging input and output nodes. The new flow graph is represented in Figure 12.
Derivation of Gradient Algorithms
199
From the work by Tellegen (1952) and Bordewijk (19561, we know that transposed flow graphs are a particular case of interreciprocal graphs. This means that the output obtained in one graph, when exciting the input with a given signal, is the same as the output value of the transposed graph, when exciting its input by the same signal. In other words, the two graphs have identical transfer f ~ n c t i o n sThus, .~ if an input of 1.0 is distributed along the lower horizontal branch of the transposed graph, the final output will equal b*(k). This b*(k) is identical to the output of our original flow graph. 4. The transposed graph can now be raveled back in time to produce the reciprocal network. Since the direction of signal flow has been reversed, delay operators q-' become advance operators q+'. The node a*(k) that was the original source of an input perturbation is now the output S*(k) as desired. The outputs of the original network become inputs with value -2e( k ) . Summarizing the steps involved in finding the reciprocal network, we start with the original network, form the derivative network, unravel in time, transpose, and then ravel back u p in time. These steps are accomplished directly by starting with the original network and simply swapping branching points and summing junctions (rules 1 and 2), functions f for derivatives f ' (rule 3), and qpls for q+'s (rule 5). 5. Note that the selection of the specific node a * ( k ) is totally arbitrary. Had we started with any node a,(k), we would still have arrived at the same result. In all cases, the input to the reciprocal network would still be -2e(k). Thus by symmetry, every signal in the reciprocal network provides S,(k) = aJ1/3a,(k). This is exactly what we set out to prove for the signal flow in the reciprocal network. Finally, for multivariate functions, aout(k)= F[a,,(k)], where by our earlier statements it was assumed the function was explicitly represented by a composition of summing junctions and univariate functions. F is restricted to be memoryless and thus cannot be composed of any delay operators. In the reciprocal network, a,,(k) becomes b,,(k) and aout(k) becomes bout( k ). But
Thus any section within a network that contains no delays may be replaced by the multivariate function F( ), and the corresponding section in the reciprocal network is replaced with the Jacobian F'[ai,(k)]. This verifies rule 4. 5 7This basic property, which was first presented in the context of electrical circuits analysis (Penfield et al. 1970), finds applications in a wide variety of engineering disciplines, such as the reciprocity of emitting and receiving antennas in electromagnetism (Ramo et al. 1984), and the duality between decimation in time and decimation in frequency formulations of the FIT algorithm in signal processing (Oppenheim and Schafer 1989). Flow graph interreciprocity was first applied to neural networks to relate realtime-backpropagation and backpropagation-through-time by Beaufays and Wan (1 994a).
200
Eric A. Wan and Franqoise Beaufays
Acknowledgments This work was funded in part by EPRI under Contract RP801013 a n d by NSF under Grants IRI 91-12531 and ECS-9410823.
References Back, A., and Tsoi, A. 1991. FIR and IIR synapses, a new neural networks architecture for time series modeling. Neural Conzp. 3(3), 375-385. Beaufays, F., and Wan, E. 1994a. Relating real-time backpropagation and backpropagation-through-time: An application of flow graph interreciprocity. Neural Comp. 6(2), 296-306. Beaufays, F., and Wan, E. 199413. An efficient first-order stochastic algorithm for lattice filters. ICANN'94 2, 1021-1024. Bordewijk, J. 1956. Inter-reciprocity applied to electrical networks. AppI. Sci. Res. 6B, 1-74. Bryson, A,, and Ho, Y. 1975. Applied Optinznl Coiztrol. Hemisphere, New York. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Corifrol Sigizals Syst. 2(4). Griewank, A., and Coliss, G., (eds.) 1991. Automatic differentiation of algorithms: Theory, implementation, and application. Proc. First SlAM Workshop oiz Autoinatic Differelitintioil, Brekenridge, Colorado. Griffiths, L. 1977. A continuously adaptive filter implemented as a lattice structure. Proc. ICASSP, Hartford, CT, 683-686. Hornik, K., Stinchombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neirral Networks, 2, 359-366. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1989. Backpropagation applied to handwritten zip code recognition. Neural Conzp. 1, 541-551. Matsuoka, K. 1991. Learning of neural networks using their adjoint systems. Syst. Computers Jpii. 22(11), 31-41. Narendra, K, and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. I E E E Trails. Neural Netniorks 1(1),4-27. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., and Marcos, S. 1993. Neural networks and nonlinear adaptive filtering: Unifying concepts and new algorithms. Neural Comp. 5(2), 165-199. Nguyen, D., and Widrow, B. 1989. The truck backer-upper: An example of self-learning in neural networks. Proc. Iizt. Joiizt Coizf. on Neural Networks, 11, Washington, DC. 357-363. Oppenheim, A., and Schafer, R. 1989. Digital SigizaI Processing. Prentice-Hall, Englewood Cliffs, NJ. Parisini, T., and Zoppoli, R. 1994. Neural networks for feedback feedforward nonlinear control systems. I E E E Trans. Neural Netzilorks, 5(3), 436-439. Parker, D., 1982. Learning-Logic. Invention Report S81-64, File 1, Office of Technology Licensing, Stanford University, October.
Derivation of Gradient Algorithms
201
Penfield, P., Spence, R., and Duiker. S. 1970. Tellegen's Theorem and Electrical Networks. MIT Press, Cambridge, MA. Plumer, E. 1993. Optimal terminal control using feedforward neural networks. Ph.D. dissertation. Stanford University, San Fransisco, CA. Rall, 8. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science, Springer-Verlag, Berlin. Ramo, S., Whinnery, J. R., and Van Duzer, T. 1984. Fields and Waves in Communication Electronics, 2nd Ed. John Wiley, New York. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA. Tellegen, D. 1952. A general network theorem, with applications. Philzps Res. Rep. 7, 259-269. Toomarian, N. B., and Barhen, J. 1992. Learning a trajectory using adjoint function and teacher forcing. Neural Networks 5(3), 473484. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. I E E E Trans. Acoustics, Speech, Signal Process. 37(3), 328-339. Wan, E. 1993a. Finite impulse response neural networks with applications in time series prediction. Ph.D, dissertation. Stanford University, San Fransisco, CA. Wan, E. 199313. Time series prediction using a connectionist network with internal delay lines. In Time Series Prediction: Forecasting the Future and Understanding the Past, A. Weigend and N. Gershenfeld, eds. Addison-Wesley, Reading, MA. Wan, E. 1993c. Modeling nonlinear dynamics with neural networks: Examples in time series prediction. Proc. Fifth Workshop Neural Networks: Acadernic/lndustrial/NASA/Defense, WNN93/FNN93, pp. 327-232, San Francisco. Wan, E., and Beaufays, F. 1994. Network reciprocity: A simple approach to derive gradient algorithms for arbitrary neural network structures. WCNN'94, San Diego, CA, 111, 87-93. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University, Cambridge, MA. Werbos, P.1992. Neurocontrol and supervised learning: An overview and evaluation. In Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, chap. 3, D. White and D. Sofge, eds. Van Nostrand Reinhold, New York. Williams, R., and Zipser D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1, 270-280.
Received April 29, 1994; accepted May 16, 1995.
This article has been cited by: 2. Ieroham S. Baruch, Carlos R. Mariaca-Gaspar. 2009. A levenberg-marquardt learning applied for recurrent neural identification and control of a wastewater treatment bioprocess. International Journal of Intelligent Systems 24:11, 1094-1114. [CrossRef] 3. K. Fujarewicz, M. Kimmel, T. Lipniacki, A. Swierniak. 2007. Adjoint Systems for Models of Cell Signaling Pathways and their Application to Parameter Fitting. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4:3, 322-335. [CrossRef] 4. T.G. Barbounis, J.B. Theocharis, M.C. Alexiadis, P.S. Dokopoulos. 2006. Long-Term Wind Speed and Power Forecasting Using Local Recurrent Neural Network Models. IEEE Transactions on Energy Conversion 21:1, 273-284. [CrossRef] 5. M. Bouchard. 2001. New recursive-least-squares algorithms for nonlinear active control of sound and vibration using neural networks. IEEE Transactions on Neural Networks 12:1, 135-147. [CrossRef] 6. Paolo Campolucci , Aurelio Uncini , Francesco Piazza . 2000. A Signal-Flow-Graph Approach to On-line Gradient CalculationA Signal-Flow-Graph Approach to On-line Gradient Calculation. Neural Computation 12:8, 1901-1927. [Abstract] [PDF] [PDF Plus] 7. A.F. Atiya, A.G. Parlos. 2000. New results on recurrent network training: unifying the algorithms and accelerating convergence. IEEE Transactions on Neural Networks 11:3, 697-709. [CrossRef] 8. Stanislaw Osowski, Andrzej Cichocki. 1999. Learning in dynamic neural networks using signal flow graphs. International Journal of Circuit Theory and Applications 27:2, 209-228. [CrossRef] 9. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef]
Communicated by David Wolpert
Does Extra Knowledge Necessarily Improve Generalization?
The generalization error is a widely used performance measure employed in the analysis of adaptive learning systems. This measure is generally critically dependent on the knowledge that the system is given about the problem it is trying to learn. In this paper we examine to what extent it is necessarily the case that an increase in the knowledge that the system has about the problem will reduce the generalization error. Using the standard definition of the generalization error, we present simple cases for which the intuitive idea of "reducivity"-that more knowledge will improve generalization-does not hold. Under a simple approximation, however, we find conditions to satisfy "reducivity." Finally, we calculate the effect of a specific constraint on the generalization error of the linear perceptron, in which the signs of the weight components are fixed. This particular restriction results in a significant improvement in generalization performance. 1 Introduction
________~-
The employment of a priori knowledge in designing a learning machine is crucial to the success of the machine's ability to generalize well. Given that knowledge affects tlie generalization ability, our aim in this paper is to address tlie following question: does more knowledge necessarily improve generalization? Intuitively, the answer to this question would seem to be "yes," depending, of course, on tlie definitions of knowledge and generalization. Nevertheless, this question phrases a possible desiderata, which itself could affect tlie design of learning machines. We formulate the problem in the language of learning from examples (see, e.g., Hertz t7tnl. 1991). A training set of input/output pairs is generated by some teacher function, and the task is to find a student function whose outputs match closely the outputs of the teacher function on the training set. Constraints on the set of possible teacher functions that generate the training set are critical in narrowing down the search for a good student. Indeed, without any constraints it is an impossible task to find a student that generalizes to unseen examples (see, e.g., Wolpert 1992). A priori assumptions
Extra Knowledge and Generalization
203
are therefore made as to the form of the teacher, that is, restrictions are imposed on the space of teacher functions. Throughout this paper we assume that the spaces of the teacher and student functions are the same. The learning problem is then realizable in the sense that among the student space, there is a student that will match perfectly the output of the teacher on all possible inputs. We denote the teacher/student space of functions by F ( 9 ) , and a particular mapping as y = f(x. 0) for f E F ( Q ) and Q E 9, where the output is denoted by y, and the input by x. A particular mapping that a function performs is represented by the point ,9 in the parameter space Q. We assume that a teacher Bo generates the noise-free set of training data L = (x',f(xu. Q')}, where D = l..p indexes each element of the training set L. In the learning problem, one attempts to find a student f(x. Q) that matches the teacher f(x, Bo) on the training set.' To measure the extent to which the student has learned the teacher, an error measure t(Q.0'. x) is defined. The set of admissible students, represented by the parameter space Q E Q, is determined by the requirement of minimizing the error measure on all examples in the training set, and satisfying a priori constraints on the student. Hence 0 expresses all the information that the student has about the teachec2 In Section 2 we review briefly the definition of the generalization error, before formulating the original question more rigorously. We subsequently consider specific cases, beginning with the simplest possible-a one-dimensional version space. In Section 3, we analyze higher dimensions, using the linear perceptron as the function space F ( Q ) . In Section 4 we conclude with a summary of the main results of the paper and an outlook on further research. 2 General Theory
2.1 The Generalization Error. To measure how well the student performs on the p training examples, the training energy is formed, Et, o( C:=,t(O.QO,x").The student is found by minimizing the training error with respect to the parameter 8, while also adhering to additional a priori constraints. This is typically achieved by stochastic gradient descent, resulting in a posttraining distribution of students, P(Q 1 L ) 0: PP"(8) exp( -E,,/T) where the temperature, T , controls the randomness of the stochastic algorithm (see, e.g., Watkin et al. 1993). PP"(Q) is the a priori constraint on the student. In the limit of zero T, the distribution of students becomes uniform over those that have zero training error and satisfy the a priori constraints; this space of student functions 'Extra regularization conditions on the student, such as weight decay, will not be considered here. 2We briefly note that the assumption that the set of admissible functions is all that is known about the teacher function is found also in the PAC approach (see, e.g., Haussler 1994); in this paper, however, we address somewhat different issues.
204
David Barber and David Saad
is known as the version space (Watkin et al. 1993), which we denote by 0.3 In Section 3.3, we present results for nonzero T , but for the rest of the paper, zero T is implied. To find the expected error that a student makes on a random example input, termed the generalization function, we average the error over the input distribution, P ( x ) , giving q ( B , 0') = J dxP(x)t(B,B ' . x ) . ~ Hence, given the teacher, tr(0,0') measures the expected error that a student 0 makes, given that the teacher is 0' and that the student is 0. As the student does not know the teacher, we assume that 0 expresses all the information that the student has about the teacher. The generalization error is then defined as the expected performance of a randomly selected student from 0,given a randomly selected teacher from 0,
where (..)BcltO and represent averages over the version space 8.5 We write ~ ~ (to 0emphasize ) that the generalization error is a function of the version space. Intuitively, one expects that any further restrictions or a priori assumptions, resulting in a smaller version space, must necessarily reduce the generalization error. To test this intuition, we make the following definition.
Definition. F ( 0')is an "error reduced" function space of F ( 0 )if cg( (3')< fg(0)for 0'c 0, and we say that "reducivity" holds. In this paper we examine which subsets 0'of (3 are error reducing, according to the preceding definition. We mention briefly that one can also consider the generalization error for a fixed teacher, ~ ~ ( 0O' ). = ( f f ( Q , QO))ste, and check reducivity with the teacher assumed known. We show in a later section, however, that the main results of this paper also hold for ~ ~ ( 08), " .and concentrate accordingly on c g ( 0 ) . 2.2 One-Dimensional Version Space. We begin with the simplest possible case of a one-dimensional version space, assuming that it can be parameterized by a connected interval on the real line, which we write, without loss of generality, as [O,al. Furthermore, we assume that the generalization function can be written as, c f ( Q 0') % = g(1Q- @'I), for some function g(.).6c,(O) is then simply f g ( a ) = $d@P(H)JXdQoP(0')g(l6- @'I), 3The student distribution we consider is known also as exhaustive learning (see, e.g., Schwartz rt nl. 1990). 4An extension to the framework of this paper is to consider the off-training-set error (see, eg., Wolpert 1992) in which the expected error of the student is calculated for test examples not included in the training set. ?In this joint average of f t ( B 1So) over the version space, we assume independence of the student and the teacher: As the training set is fixed, we write P(0O.O 1 Is) = P(H 1 Ho, C)P(Bo 1 L ) . With the assumption P(B I Oo, C) = P(O I C), we have that I ) and 0" are independently distributed over 0. this assumption as to the form of the generalization function we have in mind a larger class of error measures than the square error measure, t ( H . H U . . u ) =
Extra Knowledge and Generalization
205
where P ( . ) is the parameter space distribution. For a uniform distribution, P(O) = P(OO)= l/a, and we can write
for which the requirement of reducivity, i.e., d ~ , ( a ) / d a > 0 becomes
This is equivalent to
where ( s )is~ the average value of g(.) over the interval [0.x]. For a monotonically increasing function, (g)" > (g)*(a > x), and thus reducivity holds for all monotonic increasing functions defined on the real line. Unfortunately, for higher dimensional cases, it is not generally possible to separate the dependence of the generalization function into a summation over the individual components of the parameter vector, i.e., F f ( N . Oo) # c:'g(lO,-Opl), where n is the dimension of the parameterization, and more complicated effects can appear. In the following sections we concentrate on the linear perceptron, beginning with an explicit example of a two-dimensional version space that violates the error reduction property. 3 The Linear Perceptron
For the noise free linear perceptron, the inputs are represented by ndimensional real vectors, x E R", and the output is a single valued real variable, y E !R (see, e.g., Hertz et al. 1991). The inputs x are assumed drawn independently and identically from a zero mean, unit covariance matrix gaussian distribution. The teacher outputs are given by f(x. w') = wo.x/Jtr. Similarly, the student outputs are f(x, w) = w.x/Jtr. The error measure is taken to be proportional to the squared difference between the teacher and student outputs, E(W,wo,x)= (w.x - w0.x)*/2n,which 2 nalso . impose the additional a priori gives, tt(w.wo) = (w - ~ ~ ) ~ /We spherical constraint on both the student and teacher, w.w = wo.wO= n. We proceed to analyze this model for a specific version space. 3.1 A Two-Dimensional Version Space. Let us consider the threedimensional linear perceptron. A point on the surface of a three-dimensional sphere of radius Y = is given by the ordered pair ( 4 .O ) , which represents the usual spherical polar coordinate parameteri~ation.~ 1/2 [ f ( x ,6') -f(x% 6'fl)]2, for which the assumption q(0, Ofl) = g(l0 - @ I ) would hold only for the linear function f(x, 8 ) = X B and g(s) = s2. 7201 = rcos(q5)sin(6'),w2 = rsin(q5) sin(S),w 3 = rcos(6') where, r = 4 for the spherical normalization condition.
206
David Barber and David Saad
Figure 1: A sphere of radius A. The shaded region represents the version space, 8 = {O E [0.4.0.6].+E [ 0 , 2 ~ ] }Making . 0 smaller by pushing the inner boundary toward the outer boundary does not result in a reduction in generalization error.
The generalization function is F ~ ( w wo) , = 1 - w.w0/3. We write this expression in spherical coordinates and average over the version space ] } give , given by 0 = { ( 4 . 0 ) % E$ [u.b].8E [ ~ ~ d to
F g ( 0 ) = 1-
~
(d - c)’
+
(A [cos(d)- cos(c)12 [sin(d)- sin(c)]’}
where X = 2 [l - cos(b - a)] / ( b - a)2. To violate reducivity we look for regions such that when we reduce the width of, for example, the interval [c.d], the generalization error increases. Without loss of generality, we search for regions for which i3tg(0)/dc>0, and we plot one such region in Figure 1. To find such a region explicitly, we look for the boundary
Extra Knowledge and Generalization
207
at which &,(O)/dc = 0, and define A(c.d) given by the equation
= X[&,(0)/3c
=
01,which is
sinc - sind + (d - c) cosc cosd - cosc (d - c) sinc
A=
+
In Figure 2a, we show how this relates to reducivity. In region (l),,I varies between 0 and 1, and &,(0)/ac can be of either sign, depending on the value of A; thus in region (11, reducivity depends critically on 6 = b-a. For X > A, 3 c g ( 0 ) / a c < 0, and for X < A, ac,(O)/ac > 0. In both regions (2) and ( 3 ) A @ [0,1] and, as X E [O. I] (Fig. 2b), the sign of ~ F ~ ( O )is/ & fixed, independent of [a. b]. In fact, in regions ( 2 ) and (3), reducivity is guaranteed. In region (2), as 6 decreases (i.e., [u.b] shrinks), &,(0)/3c becomes increasingly negative, whereas in region (31, for decreasing 6, dcx(0)/i3c becomes less negative. The boundary between regions (2) and ( 3 ) is given by the solution of cosd - cosc (d - c) sinc = 0. Despite the simplicity of the example, the behavior of reducivity on the sphere is nontrivial. At this point, the reader may well conjecture that reducivity would be guaranteed for convex regions 0 and 0' c 0.8 Perhaps somewhat surprisingly, we demonstrate in the next section that convexity is not a sufficient condition for reducivity.
+
3.2 Euclidean Approximation To The Version Space. For simplicity, we concentrate on version spaces small enough such that the region can be considered Euclidean. For the linear perceptron described above, this corresponds to a region small enough such that the curved surface of the hypersphere appears "flat." By writing w = c W, and wo = c Wo, where c lies in the space 0,we have q = (W - W ' ) ' /2n, and
+
-
Eg(0) =
1 212
- ((W
-W
O
)'>
+
W,W"&
where 6 is the approximately flat region on the sphere. As w and Wo are uncorrelated, this can be written in the form, 1
cg (0 ) = ;({W2)-WE@
.
-
(Wj:&)
We now consider an infinitesimal decrease in the space @ = 6 - A. For a uniform distribution over the space, and ignoring terms in A*, we can write, with a slight abuse of notation,
a
(3.1)
'In general, a region is convex if the geodesic connecting any two points lies wholly within the region itself.
208
David Barber and David Saad
d
0
1
2
3
4
6
C
0
2 6
Figure 2: The version space is the region on the sphere given by 0 = { (4,H ) , 4 E [a.b], H E [ c - d ] } . (a) In (1) reducivity depends on the region [a,b]. In (2) and ( 3 ) reducivity is guaranteed [&,(@)/& < 01. In (2), as [a, b] shrinks, L k g ( @ ) / a c becomes more negative, and vice versa in region (3). The region c > d is unphysical. (b) The function X versus 6 = b - a.
where A and 6 are the surface contents of A and 6, respectively. In equation 3.1, we have assumed, without loss of generality, that (w)GEh= 0, i.e., that the origin, c, is taken to be the centroid of 6. Reducivity holds then for the condition (3.2)
Extra Knowledge and Generalization
209
Figure 3: Counter example used to show that convexity is not a sufficient condition for reducivity. We take the hypotenuse to have length 2. The cross marks the position of the teacher for the example of reducivity violation for a given teacher.
Note that this is a general condition, holding for any dimension. Using this, we can show that convexity (for the linear perceptron at least) is not a sufficient condition for reducivity to hold. To do this, we observe that equation 3.2 will not be satisfied for regions, A, sufficiently close to the centroid, since then the left hand side of equation 3.2 will be small. This observation leads to the following two-dimensional counter example. Let the convex region 6 be the larger triangle as shown on Figure 3. By explicit calculation, one finds ntg(tri) = 2/9 for the marked angle y = ~ 1 2 We . now take @, a convex subset of 6, to be the trapezium as shown, for which, in the limit h-0, nc,(trap) = 1/3. Hence ~(6') > cg(6), demonstrating the insufficiency of convexity as a condition for reducivity.' At this point we refer to Section 2.1 and note that we can readily find an example of a fixed teacher for which an increase in the student's knowledge results in an increase in t,(Oo, 0). In the above trapezium/triangle example, consider a very flat triangle, for which y tends to T. We take the teacher to be positioned at the cross marked in Figure 3, for which, cg( x fri) = 1/6. Taking again, 6' to be the infinitely thin trapezium, we have cg( x . trap) = 1/3, which is larger than cg( x . tri). ~
'Note that the "distance" measure, .q = (w -Us,)' / 2 n is not a metric (it does not satisfy the triangle inequality). For the metric, cf = IW - w"//Zrz, ireg(tri) = 0.29, and mg(trap)= 0.32, such that reducivity is still violated, though not as severely.
210
David Barber and David Saad
The geometry of the above situation may appear somewhat pathological. Such nonreducive situations can, however, be constructed for essentially any version space (3. In passing, we mention another example to help clarify the situation. For a two-dimensional ellipse with minor and niajor axes n and b, respectively, one readily finds (Wz),lllp5c = (n’+h’)/4. We see then that for a circle ( h = n), all infinitesimal enlargements of the circle are ”expansions” in the sense that they satisfy equation 3.2. For an ellipse ( b > n ) we can violate equation 3.2 by choosing the point on the perimeter about which we wish to expand to be close to the centroid ((W’)A = n’) with 11 > &a. We note that this violation of reducivity occurs for an eccentricity ( b / n ) that is not much larger than unity. In general, such nonexpansive enlargements can occur for the following reason: the centroid represents the best-guess student (within the euclidean approximation); adding space as close as possible to this student increases the weight on the distribution of weight space close to this best-guess, decreasing F R . By examining equation 3.1, we note that the greatest decrease in generalization error is to be found for a region _1 farthest away from the centroid of the set. This is in line with the intuitive notion that we can improve generalization most by increasing our knowledge about the teacher in those regions that contribute most to the generalization error. One way to obtain this knowledge is to choose an input s such that the reply from the teacher yields information about the teacher in the desired region; this is the concept of query learning (see, e.g., Sollich 1994). The previous arguments have been aimed at infinitesimal, local alterations to ),(: and we consider briefly an example of global enlargement. We envisage situations in which the boundary of (l) can be expressed in a spherical coordinate system, I’ = ~ ( c . I .H . ..), which is the case for convex regions. The enlarged version space (1,‘ can then be defined by a new boundary, 1” = X ( o.H. . . )I’( o.H . ..), for some A ( o.N. ..) > 1. Assuming we can bound A by some extremum values, A, ,,,, < A((.). H. . . ) < ,,A, it is then a simple matter to forni an inequality such that the generalization error of the larger version space is greater than the generalization error of the smaller. For an enlargement X(o. H . . . ) that preserves the origin as the centroid of both (I> and (:),’ we require for reducivity in the two dimensional, Xi,,, > A,,,,,-sufficient, but by no iiieans necessary.
3.3 Sign Constrained Weights. To now we have considered lowdimensional version spaces; here we calculate the generalization error of an infinitely large perceptron for a specific weight constraint. The sign of each weight is predetermined according to sgn(w,) = pi, where each pi (i = l../z)is either il.This constraint has been studied previously in the context of pattern storage for the Hopfield network, for which it was found that the capacity was half that without the sign constraints (Amit ef ol. 1989).
Extra Knowledge and Generalization
0.0
0.5
211
1.o
1.5
2.0
a Figure 4: Comparison of the generalization error for the spherical constraint and the spherical-sign constraint. The curves beginning at 1 for a = 0 are the spherical constraint; the spherical-sign curves begin at 1 - 8 , ’ ~ for a=0. By writing the output of the perceptron as y = C x,sgn(w,)Iw, I, where 1.. 1 is the modulus, and transforming the inputs according to, x: = p,xl,the output can be written y = Cx:lw,l. As the input distribution is gaussian and hence symmetric, the analysis of the sign constraint is equivalent to that of constraining the weights to be positive. In addition, we retain the spherical constraint. The method of calculation is that of statistical mechanics, following closely the exposition given in Seung et al. (1992). This will enable us to obtain results for any temperature, and without recourse to the euclidean approximation employed in Section 3.2. As is required in statistical mechanics calculations, we define the limit of the dimension of the perceptron such that the number of training patterns is proportional to the dimension of the perceptron, i.e., p = an. A sketch of the calculation is given in the Appendix; as the calculation follows so closely that given by Seung et al. (19921, we refer the reader to that work, and point out only the major differences between our and their analyses. For the spherical constraint alone, the dimension of the version space (T= 0) reduces linearly with a, resulting in a linear reduction of the generalization error, tg= 1-a, Q _< 1. For the spherical-sign constraint, however, boundary effects result in a small deviation from linearity (Fig. 4). For T = 0 and (Y 2 1, the subspace of solutions collapses to a single point and tg = 0. Nonzero T results in an increase in generalization error,
David Barber and David Saad
212
affecting both the spherical and spherical-sign constraint similarly, such that for a given ( a .T ) .F : ~ 37'0, the mean firing rate decreases again, d u e to skipping and to the decrease in the number of spikes per burst (Fig. 4 0 . The curve in Figure 48 is similar to ones published in Bade rt 01. (1979), although most of their mean rate curves decrease over a slightly larger (although variable) 5-10 degree range. Agreement would be even better if we could adjust the
230
Andri. Longtin and Karin Hinzer
0.41
0.0
,
,
,
20
,
,
I
1.0,
,
30 40 TEMPERATURE ("C)
,
20
~
.
,
30 40 TEMPERATURE ('C)
6.0
+
50
9m
4.0
v)
3.0
a
2
2.0
Y
h 1.0 0)
0.0
-
D=0.0025
MD=O
20
30 40 TEMPERATURE ("C)
0
2
4 6 8 FREQUENCY (Hz)
1
0
Figure 4: Statistics of simulated firing patterns as a function of temperature. (A) Number of spikes per burst; (B) burst frequency ( B F reciprocal of the interburst period) corresponding to the frequency of the subthreshold oscillation; (C) mean firing rate (reciprocal of the mean of the ISIH). The power spectrum of the spike trains is used to estimate BE An example of such a spectrum for T = 35°C is shown in (D). The power spectra are obtained by averaging the spectra from 100 realizations of the same duration as those described in Figure 3 for the ISIHs. The alias-free spectra with a flat spectral window were obtained function, acby convolving each delta-function spike with a sin(271%,t)/(27rfSt) cording to the method of French and Holden (1971). parameters in a way that did not produce slightly higher numbers of spikes per burst; nevertheless, the quasilinear shape of the number of spikes per burst vs. T (Fig. 4C) agrees with the experiment. The noise is seen to not significantly affect the mean number of spikes per burst, except above 35°C and below 17.8"C. The burst frequency in Figure 4A was measured from the power specthe sequence of spiking events, not the trum of the spike trains k@., voltage time series), averaged over many realizations of our stochastic model. The method of French and Holden (1971) was used to generate spectrally flat alias-free estimates of the power spectra. An example of such a power spectrum for T = 35 is shown in Figure 4D. Note that it
Encoding in Mammalian Cold Receptors
231
is not accurate to use the ISIH to estimate the frequency of the pattern when bursting is present. The interval corresponding to the first mode is then always shorter than the mean period of the oscillation since it lacks the contribution of the short intraburst ISI's. It is interesting to see how smoothly the frequency of the slow wave varies with temperature, as in the data of Braun et al. (1980) where it is also almost linear. This frequency probably conveys important information about temperature. Also, there is little difference between this curve and the one in the noiseless case, i.e., this frequency is very robust to noise. The mean amplitude of the slow wave increases with temperature, but decreases again at the high temperatures. This variation is small (Fig. 2), but together with the DC shifts due to the pump and the effects of the other thermosensitive parameters, it determines the different bursting and skipping patterns. We found that it is important to increase the ratio of G N to ~ GK with temperature. Doing so increases the number of spikes per burst. If it were kept constant, the active phase would shorten and drop out more quickly. In other words, increasing this ratio increases excitability, and slows down the progression through the sequence bursting-beatingskipping. In our model, the increase in the rate constants of the slow subsystem with temperature is responsible for the variation in the period of the slow wave, and thus of the bursting pattern (Fig. 4A). Increasing p by itself also decreases the period of the slow wave. In view of the excellent agreement between our model and the data, it is tempting to conclude, if indeed there are two slow variables, that p does have a Qlo comparable to that of the gating variables. At cold temperatures, our model predicts that the bursting activity simply ceases (around T = 13"C), after the number of spikes per burst has declined from its value at T = 17°C. The bursting period of R15 and other pacemaker neurons has also been found to increase, as in our model, as the cell cools down (Moffett and Wachtel 19761, and bursting ceases when the temperature is too low. It appears that cessation of firing in our study is a result of decreased excitability of the fast dynamics, since the slow wave amplitude is still large. Some solutions appear to be chaotic, with a random number of spikes per burst. While the low temperature patterns are stable to noise, spikes are randomly deleted by the noise as T decreases below 17.8 (with parameters varying according to equations 4.94.13). As mentioned in Section 1, 50% of fibers exhibit irregular bursts at low temperature, while the remaining ones are either silent or burst regularly throughout the low temperature range. Depending on whether the temperature is low or very low, our model can exhibit either regular or irregular patterns. It is likely that other effects also come into play, such as the deactivation of the Na-K pump at low temperature (Willis et al. 1974). Further investigation of this low temperature transition is warranted.
232
Andre Longtin and Karin Hinzer
The effect of increasing extracellular Ca'- on pacemaker cells can be very complicated, as it mav impact the dynamics of other currents. But assuming the main effect is an increase in Vca (by tlie Nernst formula), the result is a hyperpolarization of the slow wave. This in turn decreases the number of spikes per burst, effectively converting a bursting cold fiber into a nonbursting one as observed experimentally in Schafer r f (71. (1982). We have not investigated tlie potentially more significant effect of voltage screening due to this enhanced concentration.
5.2 Mechanism of Skipping. It1 this section, we discuss the mechanism of skipping in our model. The nonlinear dynamics of the Plant model have been studied by Rinzel and Lee (1987) using a decomposition of the full equations into slow variables ( s .Cai underlying the slow wave and fast variables governing the action potentials ( V .I Z . ~ ) . The slow wave is an autonomous oscillation (independent of spikes), which typically appears at a Hopf bifurcation in the slow subsystem. The active phase of the burst begins when the slow wave reaches the threshold voltage for the activation of the fast inward currents. The rapid firing during that active phase of the burst corresponds to a limit cycle in the fast subsystem. Rinzel and Lee's study emphasized that the likely mechanism for slow wave bursting involves a homoclinic transition rather than a Hopf bifurcation. Near this transition the period of firing varies strongly, and is infinite at the bifurcation point. In contrast, the firing period is finite at a Hopi bifurcation. When the slow wave is suprathreshold, the fast dynamics "riding" this oscillation periodically visit their threshold. Depending how much time they spend near threshold, i.e., on the rate at which the homoclinic curve is crossed, the duration of the ISIS can vary greatly. For the parameters chosen here as well as in Plant (1981), the IS1 increases during the active phase, which is also a property of the data in Figure 1. Next, we consider the case where the SIOW wave is subthreshold. Since this wave brings the fast subsystem periodically near the homoclinic curve, the probability of crossing the curve is also periodic. This periodically modulated probability underlies the skipping behavior. For tlie noise levels that produce multimodal ISIHs with reasonable widths, it appears that the phase preference is quite sharp as this firing probability closely parallels the amplitude of the slow wave. It is a fact, however, that if the slow wave did not exist, a precise time scale for the noiseinduced firings would still exist, even in the absence of any deterministic time scale. Such a noise-induced time scale has been shown by Sigeti and Horsthemke (1989) for systems near a saddle-node bifurcation. The ISIH is then gamma-like with a low IS1 cutoff, and a well-defined peak. The slow wave here has the effect of introducing a modulation on this basic ISIH, i.e., it makes it multimodal. The details of this mechanism of skipping will be published elsewhere.
Encoding in Mammalian Cold Receptors
233
If only the As are varied with temperature (with a Qlo of 3), a decrease in the period and number of spikes per burst is still observed (not shown). As temperature increases, the first spike in a burst becomes significantly higher than the others; at the same time the amplitude of all the spikes decreases. Skipping then arises because only the first spike is close enough to the propagation threshold. Then, with noise, the first spike may or may not propagate during a given cycle of the slow wave. But this is a more complicated and less likely mechanism for skipping. 5.3 Spectral Properties of Solutions. The power spectra of bursting neurons can be quite intricate, as is clear from Figure 4D. It is well known that the power spectrum of a repetitively firing pattern, modeled by a train of Dirac &pulses of period TO,
c a2
x ( f )=
S(f
(5.1)
-
is given by a set of delta functions at integer multiples of fo
=
l/To:
The highest peak in Figure 4D corresponds to the fundamental frequency of the noisy bursting solution. Its harmonics are visible, as expected for a periodic pulsed pattern (equation 5.2). Broad bumps are also seen. In the absence of noise, this and other bursting power spectra exhibit an even greater number of sharp peaks, with again broad bumps. This structure is similar to that seen in spectra of integral pulse frequency modulators (IPFM), for which Bayly (1968) has obtained an exact expression. These IPFMs are integrate-and-fire devices that, with constant input, fire at a precise frequency known as the carrier frequency fc. This carrier frequency is similar to the high frequency firing during the active phases, and corresponds to the large bump around 5 Hz. This bump is broad because the firing frequency varies during the active phase (see Section 5.2). When a n IPFM is driven by a frequency fm < fc, the spike train resembles that of a bursting neuron. This modulation frequency and its harmonics appear, producing sidebands on the carrier peaks. These harmonics are similar to the fundamental peak and its harmonics in Figure 4D. The spectra of noisy bursting neurons from Plant's model thus share features with IPFM spectra, but are more intricate. One can calculate from these spectra a signal-to-noise ratio at the fundamental frequency, and the dependence of this ratio on temperature. Preliminary results indicate that this ratio is very high for temperatures below 37"C, but drops significantly when skipping sets in. The characterization of the model spectra and behavior of the signal-to-noise ratio, as well as comparisons to those estimated from the experimental data, will be reported elsewhere.
Andre Longtin and Karin Hinzer
234
6 Sources of Pattern Variability
-
Section 4.3 describes a path through parameter space that yields the sequence of bursting to skipping patterns observed in Figure 1A as the temperature increases. Neighboring paths may or may not yield qualitatively similar results. It is important to understand what other dynamic behaviors exist near this path, because noise will allow the system to sample these behaviors. Exploring the vicinity of this path thus yields information on the sensitivity of the patterns to parameter variations. This in turn indicates how observable a pattern should be in the presence of additive or multiplicative noise. In other words, this exploration helps determine the "volume" of the path corresponding to the observed sequence. This section focuses on sensitivity to noise and parameter variations. Results may shed light on the origin of aperiodic firing for a given cell. Further, they may explain the variability in activity across fibers in the same preparation, and across preparations (since different cells may have different parameter values). I t is known, for example, that other receptors, such as the cold fibers oi the ampullae of Lorenzini, exhibit different sequences of bursting, beating, and skipping as temperature increases (Iggo and Iggo 1971; Braun ct (11. 1984a). But their basic ionic mechanisms may be quite similar to those of cat lingual cold receptors. If this is the case, their firing patterns may arise as temperature parametrizes a different path through parameter space (due to different pump activities, ionic concentrations etc.). 6.1 Critical Slowing Down at High Temperature. We first focus on the intluence of noise in the higher temperature range. As temperature increases i n the noiseless model, the amplitude of the slow wave first increases, and then decreases for T > 35-C (this is barely visible in the simulations with noise of Fig. 2 ) . At these higher temperatures, the slow wave becomes further hyperpolarized due to the N a / K pump. This downward shift moves the slow dynamics closer to a Hopf bifurcation at which the slow wave disappears and the slow dynamics converge to a stable fixed point. Since the limit cycle disappears, this bifurcation is sometimes called a reverse-Hopf bifurcation. This effect of I , is illustrated in the left panels of Figure 5, where for simplicity we have fixed all other parameters to their values at 7 = 4f3-C.This Hopf bifurcation of the slow dynamics should be distinguished from the homoclinic bifurcation of the fast dynamics, at which the fast spiking arises (Section 5.2). The slow wave frequency varies slowly across the Hopf bifurcation. In Figure 5, the Hopf bifurcation occurs a t I , = -0.068. By comparison, at T = 25 C, we used I, = 0.004 in Figure 2, a value well beyond that at which the Hopf bifurcation occurs for this temperature (I1, = -0.08). As can be seen with I , = -0.0675, the decay time of the slow wave to its
Encoding in Mammalian Cold Receptors
235
0
2
0
4
0
6
o
e
Q
1
0
0
TIME [SEC)
Figure 5: Critical slowing down: increased effect of noise on firing pattern at T = 40°C as I , decreases. Left panels: deterministic case. Slow-wave amplitude decreases as I , decreases. 1, = -0.0675 is just above the bifurcation point value (-0.068). Right panels: stochastic case with D = 0.0025. D and r, are the same in each plot. However, the slow wave is increasingly perturbed (the variance of the amplitude becomes larger than the mean amplitude)as the Hopf bifurcation is approached. Only the I , = -0.04 case has spikes in the absence of noise. asymptotic amplitude is quite long. In fact, it increases as the bifurcation point is approached, and is infinite at the bifurcation point itself. This loss of stability implies that noise has a greater influence on the solution as I, decreases, even though the noise intensity is constant. This is shown in the corresponding stochastic simulations in the right panels of Figure 5. This apparent amplification of fluctuations as a bifurcation point is approached is known as "critical slowing down" (see, e.g., Horsthemke and Lefever 1984). The amplification of noise is most obvious for I, = -0.0675. What this finding implies is that, even though the noise level is assumed constant, the effect of noise will be higher at high temperature. Consequently, the firing probability increases. For example, in going from I , = -0.05 to Ip = -0.06, the slow wave has become slightly more hyperpolarized, and its amplitude has decreased. These two deterministic effects conspire to abolish all spiking. Nevertheless, spiking is still seen on some cycles at Ip = -0.06, because the "effective" noise intensity is now larger, due to the loss of stability. If the noise were sufficiently intense, skipping could arise even though I , was below the value at which
Andre Longtin and Karin Hinzer
236
the slow wave comes into existence. If I , only shifted the slow wave downward without bringing the slow dynamics nearer to the Hopf bifurcation, the amplitude and stability of the slow wave as well as the "effective noise intensity" would change only slightly. Thus, both noise and critical slowing down contribute to extending the encoding range, by allowing spikes to occur over a broader range of physiological parameters. This critical slowing down could occur at other kinds of bifurcations than the supercritical Hopf bifurcation present here, although the implications for encoding may then be different. 6.2 Period Doubling, Chaos, and Skipping. The model exhibits other dynamical behaviors than those discussed u p to now. These behaviors occur for parameters in the vicinity of the path defined by equations 4.94.13. For example, period-doublings leading to chaotic motion occur for the T 20°C parameters as I , increases slightly. Since it is difficult to visualize the bifurcations and chaotic motion from the full bursting solution, we have used a first return map representation of the ISI's (Fig. 6). The iirst panel for I, = 0.022 corresponds to the noiseless version of the firing pattern at T = 20°C. We see that it is in fact the first period-doubled solution of a fundamental bursting pattern occurring for I , < 0.022. However, the presence of noise produces a pattern with a spectral peak centered on that of the fundamental solution (not shown). The chaotic motion at I , = 0.03 is manifested in the variable number of spikes per burst from one cycle to the next. Further, there is the issue of multistability, a property found in other models (Chay and Kang 1987; Canavier t>t ol. 1993). If multistability exists in our model, noise can perturb the dynamics from one kind of motion to another coexisting motion. Noise may thus cause a random sampling of different simple and complex patterns, along with their transients. There is also another kind of chaotic motion, occurring at a higher value of Jp, which can lead to skipping when a small amount of noise is present. This is illustrated in Figures 7 and 8. This chaotic motion is closely related to that studied in Section 7.2 below. As I , increases, the number of spikes per burst and the amplitude of the slow wave decrease. At some point, the depolarization is not sufficient to cause spiking. Near this point, skipping can arise through stochastic forcing of the chaotic motion (Longtin 1995a). Due to the low amplitude of the slow wave, this form of skipping is not as sharply phase-locked as that at T = 40°C in Figure 3. This is seen by comparing the ISIHs of Figure 3 to the one in Figure 8. It is also apparent that short bursts are sometimes associated with this kind of skipping. Thus, depending on the precise balance of hyperpolarizing and depolarizing influences, skipping may be seen in a given preparation at lower temperatures than in Figure 1 (i.e., lower than 35" or so). This may explain some of the differences between the firing patterns of cold-sensitive fibers of ampullae of Lorenzini, boa warm receptor, and cat cold bursting and nonbursting receptors (Braun et d.1984a). ~~
Encoding in Mammalian Cold Receptors
237
1 1
10
ISI,
Figure 6: First return maps of interspike intervals at three values of the pump current lp in the absence of noise. The other parameters are those used for T = 20°C in Figure 2. As the successive ISIs vary widely, a connected log-log plot was used to represent the solutions. A period-doubling cascade occurs as Ip increases, with chaotic bursting when lp = 0.03. These behaviors are also found at other temperatures.
6.3 Noise-Induced Bursting from a Beating State. Figure 9 presents another kind of firing pattern that may be relevant to the question of variability across preparations. This noise-induced bursting occurs for T = 40°C with a high value of Ip rather than the low one (-0.05) used in Figure 2. In the noiseless case, the slow wave amplitude and frequency decrease as Ip increases, while the duration of the active phase increases due to a growing asymmetry in the shape of the oscillation. At I, = 0.039 the successive active phases merge, and high frequency beating Le., periodic firing) ensues. When D > 0 and Ip = 0.04, the slow wave that exists for I , < 0.039 becomes "sampled" by the noise: bursting is induced
Andre Longtin and Karin Hinzer
238
20 -10 -40
5 -70 1
5 z W
20
2
-10
IIW
2m
5
-40
-70
2 20 -10
-40 -70 0
20
40
60
a0
TIME (SEC)
Figure 7: Bursting patterns a t T = 23 C 2s I , increases. When D = 0, the number of spikes per brirsi as tvell as the peak t o peak ampIitude of the slow ~ ‘ a are w reduced as 1 , increases. The solution with I , = 0.06 and D = 0 has c~ very long period and i5 probabl!7 chaotic. At I , = 0.06 and D = 0.0025, a skipping pattern appears.
by the noise. The mean interburst period decreases as D increases over w
IL
0
40
a:
W
m
f. 20
z 0
0
20
40
60
cn I-
2
9w %
100
U W
m
5z
50
n
"0
20 40 INTERSPIKE INTERVAL (SEC)
60
Figure 8: ISIHs for noise-induced skipping in the vicinity of chaotic motion (refer to Fig. 7 )for T = 25°C and Ip = 0.06. Note that the distribution of intervals when D = 0 is continuous rather than singular: the solution is probably chaotic. The modes of this ISIH are considerably broader than those shown in Figure 3.
7 Deterministic Skipping and
Gslow
In our model, temperature increases the rate of the activation kinetics of the slow inward current. In this section, the effect of also varying its maximal conductance G,, (Gslow in the following) is described. This can lead to a form of skipping without noise (deterministic skipping). The possibility that the skipping seen in cold receptors is of deterministic origin, and that the sequence in Figure 1A is mostly determined by Gslow, is discussed in the context of the recent model of bursting by Chay and Fan (1993). Studying the effect of Gslowalso suggests possible mechanisms for the paradoxical cold response.
Andr6 Longtin and Karin Hinzer
240
I
Ip=0.03 D=0.0025
I
I
1 ~ 0 . 0 4D=O
II
30 0
-30 -60
5 Iz
-60 I ~ d . 0 4D=le-5
30
gw o z
$
-30
9W
-60
5
Ip=0.04 DS.0025
1
I
30 0
-30 -60
0
10
20 TIME
30
40
50
(SEC)
Figure 9: Noise-induced bursting at high I , for T = 10T.When D = 0, tlie duration of the active phase increases with IF. At I , = 0.039, the successive active phases merge. When D j 0 and I , = 0.04, the slow wave that exists for Ip < 0.039 is sampled by the noise. Near this bifurcation, tlie noise sets the mean time scale of the bursting i t induces. Variations i n spike heights are a plotting artifact due to decimation of the large number of points required to represent a solution. 7.1 Thermosensitivity of G,,,,,, . It has been reported that the slow inward current responsible for tlie negative slope resistance of pacemaker cells is dependent o n temperature (Wilson a n d Wachtel 1974; Adams a n d Benson 1985). This means that its activation rate a n d / o r its inactivation rate a n d / o r its maximal conductance may vary with temperature. In o u r model, the kinetic rate oi activation was given the same Qlo as that of the other activation variables (equation 4.6). This current was chosen a s noninactivating, as discussed in Section 4.1. Its maximal conductance G,,,,,.
Encoding in Mammalian Cold Receptors
241
( G , in equation 4.2) was kept constant, as its variation with temperature has been considered secondary to those of G N a and Gh, as discussed in Section 3. It has also been reported (Nobile et al. 1990) that the calcium currents in chick dorsal root ganglion neurons (containing the cell bodies of different kinds of sensory neurones) can have high QKIS. LVA-type channels have lower Qlos than those of HVA type. The reported values for LVA are 1.7 for maximal conductance, 1.9 for activation, and 2.2 for inactivation. The channels gating the slow currents in cold receptors have been reported to have characteristics that are more of the LVA than HVA type (Schafer ef 01. 1988). The firing patterns produced by our model are sensitive to Gslow. Varying this parameter along with the other parameters produces some correspondence with Figure lA, especially if G,,,, starts at a lower value. However, the range of correspondence is shortened. It is likely, if Gslow does indeed vary with temperature, that a more elaborate parameter variation scheme is required to reproduce the sequence. The sensitivity of the model to G.J,,,~may then partly explain why fibers usually do not exhibit the whole gamut of firing behaviors shown in Figure 1A. Figure 10 shows the effect of increasing Gslowat T = 40°C. An increase from 0.01 to 0.011 produces a transition from skipping to bursting. This bursting is deterministic since it occurs also when D = 0. A further increase in Gslow to 0.012 produces a merging of the active phases (as in Fig. 9), and high frequency beating ensues. It is tempting to draw an analogy between this renewed firing at high temperature and the paradoxical cold response (Hensel 1974). This response of cold fibers to warm temperatures normally occurs after cessation of firing. Increases in Gslow could then be involved in the increased mean firing rate after cessation around 45°C. Comparison with data is not possible at present as the temporal firing patterns of this paradoxical response have not been studied in detail (Hans Braun, personal communication). Another interesting finding is that our model can produce deterministic skipping for smaller values of Gslow. This is shown in Figure 11. This occurs over a narrow range of values of Gslow. This skipping is very sharply phase-locked. Addition of noise to the dynamics produces an ISIH with a smoothly decaying envelope, as seen in the data. This model behavior, found at different temperatures, may account for some of the aperiodicity and response variability across fibers.
7.2 The Bursting Model of Chay and Fan. The Chay and Fan (1993) (CF) model of bursting was motivated by the search for drug treatments of certain irregular activities in the brain. We have chosen to study this model because it points to other possible mechanisms for the transitions between firing patterns in Figure 1A. In particular, it suggests that chaotic dynamics may underlie the skipping behavior. This model has a slow
242
Andre Longtin and Karin Hinzer
50 10
-30
9 E w
-70
v
s:
50
3
10
z
-30
9w
sm
I -70
W
5
t
50
Gx=0.012
1
10
-30 I
I
1
-70
0
5
10
15 TIME (SEC)
20
i
25
Figure 10: Effect of increasing the maximal conductance G, of the slow current in equation 4.2 with temperature. Other parameters are a s in Figure 2 for T = 40°C. Deterministic bursting, followed by high frequency beating, can be recovered from the skipping regime by increasing G,. This behavior may contribute to the paradoxical cold effect. inward current Islowgiven by
(7.1) (V - Vs~ow) The activation variable d and the inactivation variable f are voltage- and time-dependent: dy (7.2) = b,(V) - y1 l.,(V) Is~ow= Gslowdf
dt
~
where y stands for either if or f . It also has Hodgkin-Huxley-type fast action potential dynamics:
(7.3)
Encoding in Mammalian Cold Receptors
500
243
500 T=35 D=O
cn
400
400
300
300
200
200
100
100
I
T=35 D=0.0025
5
2 8 d
g
5 z 0
fl
0
10 20 30 INTERSPIKE INTERVAL (SEC)
“0
10 20 30 INTERSPIKE INTERVAL (SEC)
Figure 11: Deterministic skipping when the maximal conductance G, in equation 4.2 is lowered from its value of 0.01 (used up to now) to 0.009548. Other parameters are as in Figure 2 for T = 35°C. Left: the skipping occurs in the absence of noise, near the onset of bursting. Right: Deterministic skipping in the presence of noise (D = 0.0025).
where n and k are also governed by equation 7.2. The time constants are T , ~= 0.0, Q = 0.2, r,, = 0.3, r d = 1.0, and 7 = 40.0 (c = 1). Thus the activation of Islowis slower than the kinetics of n and k, but 40 times faster than the inactivation of Islow. The CF model, slightly modified from previous models studied by this group, is also of the slow wave bursting-type with only one dominantly slow variable, as opposed to two in Plant’s model. To our knowledge, an analysis of the CF model in terms of fast and slow submanifolds and pseudo-steady states, as Rinzel and Lee (1987) have done for Plant’s model, has not been published. When Ifast= 0, the five-dimensional CF model undergoes a Hopf bifurcation to a low-amplitude slow-wave oscillation as Gslow reaches a value near 10, and a reverse Hopf bifurcation when Gslow reaches a value near 16.5 (Chay and Fan 1993). This slow wave underlies the bursting pattern when Ifast# 0. The fast dynamics and the activation kinetics of Islowin CF are similar to those in Plant’s model. In this latter model, Islow does not inactivate. Rather, this current turns off when the voltage decreases due to the calcium activated K+ current. Rinzel and Lee’s (1987) analysis of Plant’s model shows that significant qualitative changes in behavior are not expected if instead one assumes calcium inactivation of Islow.The CF model differs from these two alternatives in that the inactivation directly and
244
Andre Longtin and Karin Hinzer
solely depends on voltage and time. However, the main and important difference between the CF model and that of Plant is that [Ca2+],is not a state variable in CF. Thus the CF model applies to preparations in which [Ca2+]iis not thought to play an essential role in the genesis of bursting. Figure 8 of Chay and Fan (1993) shows a sequence of firing patterns for increasing Gslow over a small range. This sequence is similar to that seen in Figure 1A. While it is not surprising that other models of slowwave bursting give rise to similar transitions, it is of great interest that this model can produce skipping without noise. Given that Gslou. may be temperature dependent (Section 7.11, this raises the interesting possibility that the firing patterns of cold receptors are largely determined by variations in Gslow.While this has not been proposed as a primary mechanism in the literature on cold receptors (summarized in Section 3), we feel nevertheless that it should be seriously considered. This is further warranted by the fact that, although the literature on cold receptors strongly suggests that [Ca'+Ii does play a role in bursting, its involvement has not been solidly confirmed. Consequently, it would be worthwhile to investigate this model in the context of what is known about mammalian cold receptors, i.e., by varying all of the putative thermosensitive parameters, and not just GSlOMr. We do not attempt a full analysis of the CF model using a parameter variation scheme as in Section 4.3. Rather, we consider the transition from beating to skipping, and compare the solutions and ISIHs to those in Figure 1. We have constructed the ISIHs for five values of Gslow in the range of interest, both without noise, and with a moderate amount of noise (D= 10: the scaling is different in CF, thus the higher values of D). The results are shown in Figure 12. The main features of beating and skipping are visible in the ISIHs obtained with D = 0. The skipping solutions appear indeed chaotic (not shown). Both beating and skipping are accompanied by significant bursting; this can probably be removed by parameter adjustment. There are, however, clear differences with the experimental data. The simulated ISIHs have more structure than those in Figure 1, and exhibit less phase-locking. The structure for D = 0 is due to the chaotic motion. It is partly smoothed out by noise (Fig. 12, right panels). However, some structure beyond that seen in the data still remains despite the presence of noise, such as the asymmetry and splitting of the first mode associated with the slow wave period. The reason why phase-locking decreases as Gslow increases in the CF model is that the slow wave amplitude is decreasing to zero. This decreased phase-locking is similar to that seen in our Figures 7-8 for noise-induced skipping near chaotic motion, at which the slow wave amplitude is small. In contrast, our model produces multimodal ISIHs with the proper structure and a good degree of phase-locking (the peaks are very clearly separated, as in the data). This is because the pump current shifts the slow wave downward, and noise-induced bursting occurs when the slow
Encoding in Mammalian Cold Receptors
D=O
245
D=2.5
750 500 250
0 750 500 250
z
W
o 750 500
6 250 g o 750 500 250
0 750 500 250
0
0
25
50 75 0 25 50 INTERSPIKE INTERVAL (MSEC)
75
Figure 12: Deterministic skipping in the slow wave bursting model of Chay and Fan (1993) as their maximal conductance Gslow varies over a small range. Left panels: D = 0. The progressive loss of spikes follows the decrease in amplitude of the slow wave. This is accompanied by loss of phase-locking. Most solutions in this range appear to be chaotic, producing multimodal ISIHs with a more complicated structure than those for the noise-induced skipping case (Fig. 3, T = 40°C). As Gslow increases above 16.0 the spikes disappear. Right panels: D = 2.5, r, = 0.01. The ratio of T~ to the fastest time constant in CF is similar to that used for our stochastic simulations of Plant’s model. The structure in the noiseless ISIHs is partially smoothed out by the noise.
wave amplitude is large. Further, for slight changes in the calcium concentration, the experimental ISIHs have many peaks (up to eight), a n d there is still sharp phase locking. It is difficult to see how chaotic skipping riding a low-amplitude slow wave as in the CF model could produce this effect. Our model can easily produce skipping ISIHs with many modes.
246
Andre Longtin and Karin Hinzer
If G,,,,% is lowered below 14.0, the CF solutions go through an inverse period-adding sequence, in which a bursting solution bifurcates to another bursting solution with one less spike per burst. This behavior is different from the one seen in Figure 1A as temperature decreases from T = 30-C. Also, the proper variation of the bursting periods is not reproduced. It is expected that concomitant variation of the kinetic rates, especially those for G,,,,, , is necessary to produce proper variations in the slow wave period. This will perhaps also produce more sharply phase-locked deterministic skipping. We conclude that it would be very interesting to pursue the study of the CF model in the context of cold thermoreception. It is likely that other parameters have to be varied along with G,,,,,., and that noise has to be coupled to the dynamics, if this model is to agree with the data to the extent that our extended Plant model does. The appeal of the CF model lies in its conceptual simplicity compared to that of Plant, since intracellular calcium dynamics are not present (both models nevertheless have five dynamical variables). It should be mentioned that CF also gives a paradoxical cold response as G,,,,,. is increased past 17.0. The effect of noise on the CF model is further discussed at the end of Section 8.1. 8 Role of Noise in Skipping and Coding
Our study suggests that noise accounts for much o f the aperiodicity observed in the firing patterns. However, chaotic bursting and skipping as well as the effect of noise in the vicinity of bifurcations cannot be ruled out, especially as bifurcations and chaotic motion are common features of models of bursting (Chay L’t nl. 1995). Whatever the source of aperiodicity, the fact remains that this aperiodicity probably plays a role in the encoding of stimuli. 8.1 Subthreshold and Suprathreshold Skipping. The interaction of noise with a subthreshold slow wave can produce skipping. This form of skipping arises froin noise-induced phase-locking, as in Figure 5 with I , = -0.05. In our model, this occurs for 7 > 37’C. Skipping can also arise when noise perturbs a deterministic phase-locked pattern, as in Figure 5 with I , = -0.04. In this case, the slow wave is suprathreshold since firing occurs without noise. In our model, this occurs for T < 37°C. This form of skipping has also been found for the stochastic version of the Fitzhugli-Nagumo neuron equations with periodic forcing by Longtin (199%). This latter study shows that it is possible to experimentally distinguish between the two forms of skipping if the noise level can be varied, e.g., by using an external noise source as in Douglas et nl. (1993). I n the subthreshold case, increasing the noise will always cause the IS1 probability to spread to lower multiples of the basic period, i.e., it will reduce skipping. This property of the noise-induced skipping occurs
Encoding in Mammalian Cold Receptors
247
because a larger noise reduces the escape time to the firing threshold. In the suprathreshold case, increasing D will first perturb the phase-locked pattern, with ISIs spreading out to the higher modes of the ISIH. Past a certain value of D, the ISIs will move back to the lower modes. For example, one way to obtain skipping similar to that seen at T = 40°C is to increase the noise intensity at T = 35°C (the basic periods will of course be different). While the period of the T = 35 pattern will not change, there will be a spread of the probability to larger ISIs, characteristic of the suprathreshold case. Hence, the transition from bursting to skipping is not clear cut, in the sense that skipping does not necessarily imply a subthreshold oscillation. But it is clear that noise increases the range of parameters where firings can occur, and thus extends the encoding range. In the case of deterministic skipping studied in the CF model (Section 7.2), preliminary results indicate that the effect of noise is not systematic. For example, noise will slightly increase skipping for Gslow = 15.25 and 15.5, but not for the other values investigated. These results suggest that the method of distinguishing between different origins of skipping using noise can be extended to the deterministic skipping case.
8.2 Sensitivity to Stimuli and Noise. Wiederhold and Carpenter (1982) have suggested a role for regular firing patterns such as bursting from the point of view of sensory encoding. They argue that sensory cells might avoid regular spontaneous firing (such as bursting) if they are to encode stimuli at frequencies close to that of the regular activity. For example, if an auditory cell fired regularly (they fire very irregularly), a stimulus at a frequency near its mean spontaneous frequency could not easily change this mean rate; the encoding capability would be diminished. In contrast, cells with regular spontaneous firing (such as bursting) could encode weak stimuli through the modulatory effect of these stimuli on the regular activity. Thus, the stimuli would not have to exceed threshold to be encoded, since the cell is already biased into a suprathreshold region. Our study of mammalian cold receptors suggests an interesting expansion on this point. Weak temperature stimuli are readily encoded through their effect on the bursting period, the duration of the active phase, and the number of spikes per burst. The dynamic response (i.e., transients) further enhances this sensitivity (Braun et al. 1990). This sensitivity is also present at high temperature, even though the slow wave is subthreshold. This has been shown in recent theoretical (Longtin 1993; Chialvo and Apkarian 1993) and experimental (Douglas et al. 1993) studies of neurons driven by periodic forcing and noise. When such neurons are biased into their subthreshold regions, noise can enhance the expression of a small periodic signal through an effect known as "stochastic resonance." This occurs when the time scale of firing imposed by the
248
Andre Longtin and Karin Hinzer
stimulus becomes commensurate with the mean firing time in the absence of stimulus. Neurons that exhibit this effect also exhibit skipping. Further, characteristics of the ISIH, such as the rate of decay of the envelope, are very sensitive to parameters such as stimulus amplitude, frequency, and noise intensity (Longtin r fa / . 1994). The multimodal ISIHs in our model are also \ w y sensitive to, e.g., the static temperature and the noise intensity, even though the ”periodic signal” is endogenous rather than external. The noise helps express the frequency of the underlying slow wave when it is subthreshold. Thus, the sensitivity of regular activity can extend to skipping. 8.3 Deterministic versus Stochastic Coding. The issue of whether stochastic coding or temporal coding is used by the brain is a burning question, especially when it is addressed to cortical information processing (see Shadlen and Newsome 1994, for a current review; Usher ct nl. 1994). Trying to answer such questions requires that a precise nieaning be ascribed to ”precise timing” and “stochastic.” I n the case of cold receptors, our study suggests that the code combines deterministic and stochastic components. The precise timing of spikes is seen in the predictability of firing times that characterize cyclical patterns such as bursting and beating. However, there are fluctuations within these patterns in the exact times at which the firings occur. For example, the interspike intervals in a burst do not repeat exactly from one burst to the next; likewise, the number of spikes per burst and the time between bursts fluctuate. A t high temperatures, the precise timing is seen in the persistence of the phase-locking to the slow wave, but the number of cycles between firings is random. The probability of firing may itself be part of the code, as suggested by Sclieicli r t 171. (1973) in the context of skipping cells known as “probability coders” in weakly electric fish. Modeling of the next stages of processing of thermal information all the way up to the hypothalamus may be needed before a clear understanding of the interplay of deterministic and stochastic aspects of the code is achieved. If the skipping pattern is indeed relevant to the coding by cold receptors, then our model suggests that noise is an important component of this code. It endows the cold receptor with a continuous variation in firing pattern as temperature varies (this occurs also by smoothing out, e.g., period-doubled solutions as in Figs. 2 and 6). Without noise, there would be no skipping in our model over the 5-1O’C range where it is measured. Clearly, too much noise would destroy the multimodal pattern. In the deterministic skipping case, an optimal amount of noise also appears to be needed to produce a smooth JSIH. Thus, this sense may have accommodated to an amount of noise that allows a sufficient dynamic range along with a reasonable signal-to-noise ratio. The precise sense in which noise could be used optimally by cold receptors will be
Encoding in Mammalian Cold Receptors
249
investigated elsewhere. Suffice it to say that this encoding scheme at higher temperatures is similar to that seen over a wide range of stimulus amplitudes in other thermal noise-limited senses such as the auditory system. 9 Conclusion 9.1 Summary of Results. 0
0
0
0
0
0
Our model of mammalian cold receptors accounts for the main temporal and statistical features of the sequence of firing patterns shown in Figure 1A (Section 5.1). It is necessary to vary seven parameters concomitantly to obtain this agreement. The model incorporates the putative thermosensitive mechanisms discussed in the physiological literature on cold receptors (Section 3). Based on Plant's (1981) five-dimensional ionic model of bursting in the R15 pacemaker cell of Aplysia, the model provides a framework for the oscillatory theory of transduction by these receptors (Braun et al. 1980, 1984b). Skipping arises here through noise-induced firing from a subthreshold slow wave oscillation in the receptor. We have studied the variability (Section 1) of the firing patterns seen across fibers, and across preparations, by exploring behaviors of our model in the vicinity of the parameter path defined by equations 4.9-4.13. Section 6 reports our findings on noise-induced beating from a bursting state, on period-doubling sequences, and on noise-induced skipping from a chaotic low amplitude slow wave. Our assumption of constant noise intensity is compatible with the increasing importance of noise at higher temperatures. This is due to the loss of stability of the slow wave at high temperature (Section 6.1). Our model suggests that spikes drop out at higher temperatures as a consequence of hyperpolarization of the slow wave (Section 5.2). It is known that action potentials can also be quenched at high temperatures ("heat block": see Hodgkin and Katz 1949; Huxley 1959). This occurs when the rate of rise of the spike cannot keep u p with the rates of change of the permeabilities that lead to recovery. Skipping does not appear to be a form of intermittent heat block. At very low temperatures, noise has an increased effect on pattern variability, as it affects the number of spikes dropping out of the bursting pattern (Section 5.1). The physiological evidence for our model is indirect as it derives from extracellular recordings. In view of the diversity and complexity of the ionic dynamics underlying the firing patterns of bursting cells (Adams and Benson 1985; Canavier etal. 1991; Chay etal. 1995), it would not be surprising that other models with different currents
Andre Longtin and Karin Hinzer
250
and/or temperature effects neglected here could also reproduce the data. Our model provides a framework from which to proceed for studying thermoreception. It can easily accommodate new physiological data as they become available. 0
0
0
0
0
Our work provides a good starting point for studying the interaction of pacemaker dynamics with noise. Further, in the event that noise is at the origin of skipping, the discussion of the role of noise in Section 8 will likely survive the precise details of future improved ionic descriptions. In our model, the effect of temperature on the slow inward current was to increase the rate of the activation kinetics, as the literature on cold receptors suggests that variations in G,I,,,. have a secondary importance. Incorporating variations of G,l,,,,. with temperature in our model requires more assumptions on the behavior of other parameters (Section 7.1). Our study discusses an attractive alternate mechanism for the sequence shown in Figure lA, based on the results of Chay and Fan (1993) (Section 7.2). We have investigated the behavior of their model of slow wave bursting as G,I,,,,. increases to produce a transition from beating to skipping. The skipping ISIHs exhibit more structure than those in Figure 1. These chaotic solutions likely require stochastic forcing to produce smoother ISIHs. Also, the deterministic skipping is less phase-locked than the stochastic skipping in our model. A better assessment o f their model would require a full study of its dynamics as many parameters are varied along with G.,lOl,, following a scheme similar to that in Section 4.3. In our view, such a study would be of great interest. Deterministic skipping also occurs in our model at slightly smaller values of G,. The ISIHs are strongly phase-locked, and their envelope can be nonrnnnotonic. Noise makes the ISIH envelope monotone decreasing, similar to those in Figure 1. Paradoxical cold responses can be obtained in both our model and that of Chay alid Fan (Section 7.1).
9.2 Future Work. 0
0
An improvement to our model would include the increase with temperature of all the maximal conductances in equation 4.2. Other parameters may have to change also, such as the Qlos, or the temperature dependence of the pump. The precise form of these changes would have to be surmised from other preparations. An important next step is to model the dynamic responses, i.e., the transient responses to temperature changes. These are well documented, and may serve to validate models. This would require
Encoding in Mammalian Cold Receptors
0
0
0
0
0
251
proper modeling of the transient behavior of the electrogenic pump currents. The patterns of Figure 1 were reproduced by increasing the global rate p of the intracellular calcium kinetics. Perhaps it is sufficient to only vary the rate of calcium sequestration. Preliminary results indicate, however, that this is not the case. One can also think of more detailed modeling of the intracellular calcium kinetics as in, e.g., Canavier et al. (1991) or Chay ef al. (1995). One can test other hypotheses for noise that involve, e.g., changing D and T~ with temperature. The inclusion of conductance fluctuations, e.g., as in Chay and Kang (1988), is an obvious first step. The effect of r, should also be investigated, as it can affect the correlations between the skipping events (Longtin et al. 1994). Temporal properties (such as correlations) of the experimental spike trains, near and in the skipping regime, should be compared with those of the simulated spike trains discussed in our paper. Such analyses could include return maps of ISIs and spectral analyses. Externally imposed noise could alter the skipping behavior of the receptor (Section 8.1). These changes could be compared to those predicted by models. Multistability (Chay and Kang 1987; Canavier et al. 1993, Chay and Fan 1993) may underlie some of the observed variability in cold receptors. It is worthwhile investigating this possibility.
Acknowledgments This work was supported by NSERC Canada, and NIMH (USA) through Grant R01 MH47184-01. The authors wish to thank Leonard Maler, John Rinzel, Wendy Brandts, and Michael Guevara for useful discussions. We would like to thank an anonymous reviewer for suggesting the relevance to our study of the results of Chay and Fan (1993). References Adams, W. B., and Benson, J. A 1985. The generation and modulation of endogeneous rhythmicity in the Aplysia bursting pacemaker neurone R15. Prog. Biuphys. Mol. Biol. 46, 1 4 9 . Bade, H., Braun, H. A,, and Hensel, H. 1979. Parameters of the static burst discharge of lingual cold receptors in the cat. Pfliigers Arch. 382, 1-5. Barish, M. E., and Thompson, S. H. 1983. Calcium buffering and slow recovery kinetics of calcium-dependent outward current in molluscan neurones. I. Physiol. 337, 201-219. Bayly, E. J. 1968. Spectral analysis of pulse frequency modulation in the nervous system. I E E E Trans. Biu-Med. Eng. 15, 257-265.
252
Andr6 Longtin and Karin Hinzer
Braun, H. A,, Bade, H., and Hensel, H. 1980. Static and dynamic discharge patterns of bursting cold fibers related to hypothetical receptor mechanisms. P f 7 i i g ~Arch. ~ 386, 1-9. Braun, H. A,, Schiifer, K., Wissing, H., and Hensel, H. 1984a. Periodic transduction processes in therniosensitive receptors. In S m w y Rcwytor- M L ~ I ~ I I ~ S I ~ ~ S , Lv. Haniann and A. Iggo, eds., pp. 147-156. World Scientific, Singapore. Braun, H. A,, Scliafer, K., and Wissing, H. 1984b. Theorien und Modelle zum i;bertragungs~.erlialten thermosensitiver Rereptoren. Frrrrkt. Biol. M c d . 3, 26-36. Braun, H. A,, Schater, K., and Wissing, H. 1990. Tlieories and models of temc ~ iT/re~ft?~)~l,g~rlr?ti[)fi, fioff J. Bligli and perature transduction. I n ~ / i l ~ r r r ? o ~ ~ ~ cnmf K. Voigt, eds., pp. 19-29, Springer Verlag, Berlin. Wissing, H., Scliifer, K., and Hirsch, M. C. 1994. Oscillation Braun, H. ,4., and noise determine signal transduction in shark multimodal sensory cells. "Vitrrw ( I . ( i i r d o / r ) 367, 270-273. Canavier, C. C., Clark, J. W., and Byrne, J. H. 1991. Simulation of the bursting activity of neuron R15 in Ap/!/sin: Role of ionic currents, calcium balance, and modulatory transmitters. 1. b.krfmp/iysio/. 66, 2107-2124. Canavier, C. C., Baxter, D. A,, Cl'lrk, J. W., and Byrne, J. H. 1993. Nonlinear dynamics in a model neuron pro\kie a novel mechanism for transient synaptic inputs to produce long-term alterations of postsynaptic activity. J . NwropIiy~ioI.69, 2252-22.57. Carpenter, D. 0. 1967. Temperature effects on pacemaker generation, membrane potential, and critical firing thresliold in Ap/ysia neurons. /. Gt,/r.Physiol. 50, 1469-1 484. Carpenter, D. 0. 1981. Ionic and metabolic bases o f neuronal thermosensitivity. F r d . PWC. 40, 2808-2813. Carpenter, D. O., and Alving, B. 0. 1908. A contribution of an electrogenic Na' pump to membrane potential in A ! J / ! / ~neurons. ~[I J . Gcri. Pliysid. 52, 1-21. Chay, T. R. 1983. Eyring rate theory in excitable membranes: Application to neuronal oscillations. 1. P/J!/s.Clieui. 87, 2935-2940. Chay, T. R. 1984. Abnormal discharges and chaos in a neuronal model system. Biol. Cybrrrf. 50, 301-31 1. Chay, T. R., and Kang, H. S. 1987. Multiple oscillatory states and chaos in the endogeneous activity of excitable cells: I'ancreatic .l-cells as an example. In Chnos iu Biologic-a/ S!/strrris, H. Degn, A. V. Holden, and L. F. Olsen, eds., pp. 173-181. Plenum, New York. Cliay, T. R., and Kang, H. S. 1988. Role of single-cliannel stochastic noise on bursting clusters of pancreatic .kells. Biophys. 1. 54, 127435. Chay, '1. R., and Fan, Y. 1993. Evolution of periodic states and cliaos in two types of neuronal models. I n Chaos irf BiokJgy mid Merficirir, P m .SPIE 2036, 100- 114. Chay, T. R., and Kinzel, J. 1985. Bursting, beating, and chaos in an excitable membrane model. Biophys. J. 47, 357-366. Chay, J. R., Lee, Y. S., and Fan, Y. 19%. Bursting, spiking, chaos, fractals and universality iii biological rhythms. I u t . J. Bifiirc. C h o s (in press).
Encoding in Mammalian Cold Receptors
253
Chialvo, D. R., and Apkarian, V. 1993. Modulated noisy biological dynamics: Three examples. J. Stat. Phys. 70,375-391. Clay, J. R. 1977. Monte Carlo simulation of membrane noise: An analysis of fluctuations in graded excitation of nerve membrane. I. Tlzeor. Biol. 64, 671680. DeFelice, L. J. 1981. Introduction to Membrane Noise. Plenum, New York. Douglass, J. K., Wilkens, L., Pantazelou, E., and Moss, F. 1993. Noise enhancement of information transfer in crayfish mechanoreceptors by stochastic resonance. Nature (London) 365, 337-340. Dykes, R. W. 1975. Coding of steady and transient temperatures by cutaneous "cold" fibers serving the hand of monkeys. Brain Res. 98, 485-500. Fox, R. F., Gatland, I. R., Roy, R., and Vemuri, G. 1988. Fast, accurate algorithm for numerical simulation of exponentially correlated colored noise. Phys. Rev. A 38, 5938-5940. French, A. S., and Holden, A. V. 1971. Alias-free sampling of neuronal spike trains. Kybernetik 8, 165-171. French, A. S., Holden, A. V., and Stein, R. B. 1972. The estimation of the frequency response function of a mechanoreceptor. Kyhernetik 11, 15-23. Gerstein, G., and Mandelbrot, B. 1964. Random walk models for the spike activity of a single neuron. Biophys. J. 4, 4148. Hensel, H. 1974. Thermoreceptors. A n n u . Rev. Physiol. 36, 233-249. Hindmarsh, J. L., and Rose, R. M. 1984. A model of neuronal bursting using three coupled first order differential equations. Proc. Roy. Soc. London 8221, 87-102. Hochmair-Desoyer, I. J., Hochmair, E. S., Motz, H., and Rattay, F. 1984. A model for the electrostimulation of the nervus acusticus. Neuroscience 13, 553-562. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 117, 500-544. Hodgkin, A. L., and Katz, B. 1949. The effect of temperature on the electrical activity of the giant axon of the squid. J . Physiol. (London) 109, 240-249. Hodgkin, A. L., and Keynes, R. D. 1955. Active transport of cations in giant axons from Sepia and Loligo. J. Physiol. (London) 128, 2840. Horsthemke, W., and Lefever, R. 1984. Noise-Induced Transitions. Theory and Applications in Physics, Chemistry, and Biology. Springer Series in Synergetics, Vol. 15. Springer, Berlin. Huxley, A. F. 1959. Ion movements during nerve activity. Ann. N.Y. Acad. Sci. 81, 221-246. Iggo, A,, and Iggo, B. J. 1971. Impulse coding in primate cutaneous thermoreceptors in dynamic thermal conditions. J. Physiol. (Paris) 63, 287-290. Junge, D., and Stephens, C. L. 1973. Cyclic variation of potassium conductance in a burst-generating neurone in Aplysia. J. Physiol. (London) 235, 155-181. Kramer, R. A., and Zucker, R. S. 1985. Calcium-induced inactivation of calcium current cawes the inter-burst hyperpolarization of Aplysia bursting neurones. J. Physiol. 362, 131-160. Laiiger, P. 1991. Electrogenic Ion Pumps. Sinauer, New York. Lecar H., and Nossal, R. 1971. Theory of threshold fluctuations in nerves. 1. Re-
254
Andre Longtin and Karin HinLer
lationship betiveen electrical noise and fluctuations in axon firing. Biopliys. /. 11, 1049-1067. Longtin, A. 1993. Stochastic resonance in neuron models. 1. Slot. Ph!/s. 70, 309327. Longtin, A. 1995a. Mechanisms of stochastic phase-locking. C/ino..-5, 209-215. Longtin, A. 199%. Synchronization of the stochastic Fitzhugh-Nagumo equations to periodic forcing. [I Nirovo Cirritwto D (in press). Longtin, A,, Bulsara, A., Pierson, D., and Moss, F. 1994. Bistability and the . 569-578. dynamics of periodically forced sensory neurons. R i d . C ! / ~ J W I70, Moffett, S., and Wachtel, H. 1976. Correlations between temperature effects on behavior in A$ycin and firing patterns of identified neurons. Mar. Bohn71. Physid. 4, 61-74. Kobile, M., Carbone, E., Lux, H. D., and Zucker, H. 1990. Temperature sensitivity of Ca currents i n chick sensory neurones. Pfliig. /. 16, 227-244. Poulos, D. A. 1981. Central processing of cutaneous temperature information. F r d . P ~ o c .40, 2825-2829. Rinzel, J. 1987. A tormal classification of bursting mechanisms in excitable systems. I n Motli~w~rtiiiil 'Lpiis iri Pop~rlntiorrBiology, Morpliog~wsisntrrf N t v msc-irmc~,Teranioto, E., and Yamaguti, M., eds., Lecture Notes in Biomathematics Vol. 71, pp. 267-281. Springer, N e w York. Rinzel, J., and Lee, Y. S. 1986. In N~~rrlirrr7nr Osiillntio~isirr B i o k y y iird Chuii~stny, H. G. Othnier, ed., Lecture Notes in Biomatliematics Vol. 66, pp. 19-33. Springer-Verlag, Berlin. Rinzel, I., and Lee 1'. S.1987. Dissection of a model for neuronal parabolic bursting. 1.Mntli. Biol. 25, 633-675. Rose, J., Brugge, J., Anderson, D., and Hind, J. 1967. Phase-locked response to low frequencv tones in single auditorv fibers of the squirrel monkey. /. N c /I~I I / ~ ~ / s ~30, ( I / .769-793. Schiifer, K., Braun, H. A., and Hensel, H. 1982. Static and dynamic activity of cold receptors at various calcium levels. 1. Nerir.opli!/siol. 47, 1017-1028. Schafer, K., Braun, H. A., and Renipe, L. 1988. Classification of a calcium conductance in cold receptors. Pros. Brairi Res. 74, 29-36. Schafer, K., Braun, H. A., and Rempe, L. 1990. Mechanisms of sensory transduction in cold receptors. In Tlrcrrtiorucepticri arid T l i r r r r ~ c i r t ~ ~ ~ r l aJ.f iBligh ri, and K. Voigt, ecls., pp. 30-36. Springer-Verlag, Berlin. Scheich, H., Bullock, T. H., and Hamstra, R. H. Jr. 1973. Coding properties of
Encoding in Mammalian Cold Receptors
255
two classes of afferent nerve fibers: High-frequency electroreceptors in the electric fish, Eigenmannia. 1. Neurophysiol. 36, 39-60. Shadlen, M. N., and Newsome, W. T. 1994. Noise, neural codes and cortical organization. Curr. Opin.Neurobiol. 4, 569-579. Sigeti, D., and Horsthemke, W. 1989. Pseudo-regular oscillations induced by external noise. I. Stat. Phys. 54, 1217-1222. Spekreijse, H. 1969. Rectification in the goldfish retina: Analysis by sinusoidal and auxiliary stimulation. Vision Res. 9, 1461-1472. Stevens, C. F. 1972. Inferences about membrane properties from electrical noise measurements. Biophys. 7. 12, 1028-1047. Usher, M., Stemmler, M., Koch, C., and Olami, Z. 1994. Network amplification of local fluctuations causes high spike rate variability, fractal patterns and oscillatory local field potentials. Neural Comp. 5, 795-836. Wiederhold, M. L., and Carpenter, D. 0. 1982. In Cellular Pacemakers. Vol. 2: Function in Normal and Diseased States, D. 0.Carpenter, ed., pp. 27-58. WileyInterscience, New York. Willis, J. A., Gaubatz, G. L., and Carpenter, D. 0. 1974. The role of the electrogenic sodium pump in modulation of pacemaker discharge of Aplysiu neurons. 1.Cell. Physiol. 84, 463471. Wilson, W. A., and Wachtel, H. 1974. Negative resistance characteristic essential for the maintenance of slow oscillations in bursting neurons. Science 186, 932-934.
Received November 8, 1994; accepted June 14, 1995
This article has been cited by: 1. Xiufeng Lang, Qishao Lu, Jürgen Kurths. 2010. Phase synchronization in noise-driven bursting neurons. Physical Review E 82:2. . [CrossRef] 2. Georgi S. Medvedev. 2009. Electrical Coupling Promotes Fidelity of Responses in the Networks of Model NeuronsElectrical Coupling Promotes Fidelity of Responses in the Networks of Model Neurons. Neural Computation 21:11, 3057-3078. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Pawel Hitczenko, Georgi S. Medvedev. 2009. Bursting Oscillations Induced by Small Noise. SIAM Journal on Applied Mathematics 69:5, 1359. [CrossRef] 4. Qishao Lu, Huaguang Gu, Zhuoqin Yang, Xia Shi, Lixia Duan, Yanhong Zheng. 2008. Dynamics of firing patterns, synchronization and resonances in neuronal electrical activities: experiments and analysis. Acta Mechanica Sinica 24:6, 593-628. [CrossRef] 5. G Tanaka, K Aihara. 2007. Collective skipping: Aperiodic phase locking in ensembles of bursting oscillators. Europhysics Letters (EPL) 78:1, 10003. [CrossRef] 6. Martin Huber, Hans Braun. 2006. Stimulus-response curves of a neuronal model for noisy subthreshold oscillations and related spike generation. Physical Review E 73:4. . [CrossRef] 7. Yang Zhuo-Qin, Lu Qi-Shao. 2006. Bursting and spiking due to additional direct and stochastic currents in neuron models. Chinese Physics 15:3, 518-525. [CrossRef] 8. Alexander Neiman, David Russell. 2002. Synchronization of Noise-Induced Bursts in Noncoupled Sensory Neurons. Physical Review Letters 88:13. . [CrossRef] 9. Hans Plesser, Theo Geisel. 2001. Stochastic resonance in neuron models: Endogenous stimulation revisited. Physical Review E 63:3. . [CrossRef] 10. Peter Roper , Paul C. Bressloff , André Longtin . 2000. A Phase Model of Temperature-Dependent Mammalian Cold ReceptorsA Phase Model of Temperature-Dependent Mammalian Cold Receptors. Neural Computation 12:5, 1067-1093. [Abstract] [PDF] [PDF Plus] 11. Ulrike Feudel, Alexander Neiman, Xing Pei, Winfried Wojtenek, Hans Braun, Martin Huber, Frank Moss. 2000. Homoclinic bifurcation in a Hodgkin–Huxley model of thermally sensitive neurons. Chaos: An Interdisciplinary Journal of Nonlinear Science 10:1, 231. [CrossRef] 12. Elad Schneidman , Barry Freedman , Idan Segev . 1998. Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike TimingIon Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing. Neural Computation 10:7, 1679-1703. [Abstract] [PDF] [PDF Plus]
13. V. Galdi, V. Pierro, I. Pinto. 1998. Evaluation of stochastic-resonance-based detectors of weak harmonic signals in additive white Gaussian noise. Physical Review E 57:6, 6470-6479. [CrossRef] 14. Marisciel Litong, Caesar Saloma. 1998. Detection of subthreshold oscillations in a sinusoid-crossing sampling. Physical Review E 57:3, 3579-3588. [CrossRef] 15. Wei Wang, Yuqing Wang, Z. Wang. 1998. Firing and signal transduction associated with an intrinsic oscillation in neuronal systems. Physical Review E 57:3, R2527-R2530. [CrossRef] 16. André Longtin. 1997. Autonomous stochastic resonance in bursting neurons. Physical Review E 55:1, 868-876. [CrossRef] 17. Epifanio Bagarinao, Caesar Saloma. 1996. Frequency analysis with Hopfield encoding neurons. Physical Review E 54:5, 5516-5521. [CrossRef]
Communicated by Peter Foldiak
NOTE
Associative Memory with Uncorrelated Inputs
In hybrid learning schemes a layer of unsupervised learning is followed by supervised learning. In this situation a connection between two unsupervised learning algorithms, principal component analysis and decorrelation, and a supervised learning algorithm, associative memory, is shown. When associative memory is preceded by principal component analysis or decorrelation it is possible to take advantage of the lack of correlation among inputs to associative memory to show that correlation matrix memory is a least squares solution to the supervised learning problem. 1 Introduction ____
Hybrid learning schemes employ an unsupervised learning algorithm to transform raw input data into a more useful form. Unsupervised learning is then followed by supervised learning to learn some desired output. Several authors have published unsupervised learning algorithms for principal component analysis (Oja 1992), and Foldiak (1989) has published an algorithm for the decorrelation of input vectors. It is possible to take advantage of the special form of the output from these algorithms in the design of a supervised learning algorithm. 111designing that algorithm it is interesting to consider the central point of Fuster (1995), "nll r m w o y is nssociotiw." 2 Discussion __
___
...
As a starting point, consider the optimal linear associative mapping (OLAM) o f Kohonen (1988). The idea of associative memory may be expressed in a matrix-vector equation:
M x ~= yi
k = 1.2... . . p
(2.1)
where XI t '%"I is a zero mean key vector, yi is the response vector, and M is the memory matrix. The { x i } and {yk} may be combined into matrices X and Y. Associative memory may then be expressed as
(2.2)
MX=Y \CIIIO/
256-259 (19%)
~ l i ~ ~ l ~ ~ f l f 18, 7 h l l l ~
@ 1996 Masachusett\ Institute of Technology
Associative Memory with Uncorrelated Inputs
257
The least-squares error solution for M may be calculated using the pseudoinverse as shown in equation 2.3 M = YX'
(2.3)
where M is the least-squares error solution for M. In Kohonen's discussion of the OLAM, it was assumed that p 5 n. If the assumption is made that p > n then the definition of the pseudoinverse is Xf
= xT(xxT)-'
(2.4) and not X+ = (XTX)-'XT as for the OLAM. If p > n then the {xk} cannot be linearly independent and perfect recall of the associated {yk} is not possible. Error minimization is the best that can be hoped for. Now assume that the principal components of the stream of data vecThis assumption can easily be realized tors {xk} lie along the axes of P. by passing a stream of raw data vectors through a principal component algorithm such as the weighted subspace learning algorithm (Oja 1992) or a decorrelation algorithm (Foldihk 1989), the output of which is the stream of {xk}. Then, given sufficiently large p , the autocorrelation matrix of the xk, C, is diagonal with the eigenvalues of C lying along the diagonal. This may be summarized as follows:
r
01
XI
(2.5) 1 0
A,,
1
This leads to a simple expression for (XXT'-' 1 .
(2.6)
1'
A,,
Now the pseudoinverse may be expressec as follows:
Ronald Michaels
258
where each column of the resulting AX matrix represents one pattern vector divided elementwise by the eigenvalues of C. Since each eigenvalue represents the value of the variance of one of the elements of xk the eigenvalues are locally computable using the recursive method due to Oja (1983), or a variation thereof. Equation 2.3 above may now be written as 1
M = -Y (XX)T
(2.8) P which is nothing more than correlation matrix memory with variance normalization. In the case where input vectors have been decorrelated and have had all variances equalized and scaled to a value of 1.0 as a part of the unsupervised learning scheme (Foldiiik 1992) then AX = X and equation 2.8 reduces to pure correlation matrix associative memory. Note that equation 2.8 is a least-squares error solution for M, but that it does not require the matrix multiplication and inversion of the pseudoinverse. The recursive form of equation 2.8 can be derived as follows. For p pairs of xi; and yk vectors the solution is 1 MI’ = -yY Y
For p
(2.9)
+ 1 pairs of xp and yi; vectors the solution is (2.10)
Equation 2.10 may be approximated as follows:
r (2.11)
Equation 2.11 may then be rewritten as follows: (2.12) For the recursive version of the algorithm [ l / ( p + l ) ]may be considered a gain factor, which may be represented by y. It is assumed that the eigenvalues of C are recursively updated at each step by a separate algorithm. Equation 2.12 may now be rewritten in the following form: MF’ - yMp + y [y!J+l( AxY+l)T] A
MY+1 =
(2.13)
In equation 2.13 the term -7MI’ may be considered a “forgetting term.” The y!J+’(AX!’+’ ) term may be considered a Hebbian learning term.
Associative Memory with Uncorrelated Inputs
259
The convergence rate of the recursive associative memory algorithm depends upon, among other things, the convergence rate of the preceding recursive principal component analysis or decorrelation algorithm and the recursive algorithm used to estimate the variance of the elements of the {xk}. It should be noted in passing that the above pseudoinverse technique is applicable to the Ho-Kashyap (Ho and Kashyap 19651algorithm when that algorithm is preceded by principal component analysis.
3 Conclusions In hybrid learning schemes a layer of unsupervised learning is followed by supervised learning. In this situation a connection between two unsupervised learning algorithms, principal component analysis and decorrelation, and a supervised learning algorithm, associative memory, has been shown. The output vectors of these unsupervised learning schemes have, in the limit, a diagonal autocorrelation matrix. This allows the pseudoinverse to be computed in a very simple manner using local, recursive computations. Using this computation, it has been shown that correlation matrix associative memory is a least-squares solution to the supervised learning problem. References Foldiak, P. 1989. Adaptive network for optimal linear feature extractor. Proc. lnt. Joint Conf. Neural Networks 401405. Foldiak, I? 1992. Models of Sensory Coding. Tech. Rep. CUED/F-INFENG/TR 91, Physiological Laboratory, University of Cambridge, January. Fuster, J. M. 1995. Memory in the Cerebral Cortex. MIT Press, Cambridge, MA. Ho, Y-C., and Kashyap, R. L. 1965. An algorithm for linear inequalities and its applications. IEEE Trans. Electronic Computers EC-14(5), 683-688. Reprinted in: Pattern Recognifion, J. Sklansky ed., pp. 49-54. Dowden, Hutchinson & Ross, Stroudsburg, PA. Kohonen, T. 1988. Self-Organization and Associative Memory. Springer Series in lnformafion Sciences, 2nd Ed. Springer-Verlag, Berlin. Oja, E. 1983. SubspaceMethods ofPattern Recognition. Research Studies Press Ltd., Letchworth, Hertfordshire, England. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5(6), 927-935.
Received April 10, 1995; accepted June 8, 1995.
NOTE
Communicated by Jurgen Schmidhuber
Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps
According to Barlow (1989), feature extraction can be understood as finding a statistically independent representation of the probability distribution underlying the measured signals. The search for a statistically independent representation can be formulated by the criterion of minimal mutual information, which reduces to decorrelation in the case of gaussian distributions. If nongaussian distributions are to be considered, minimal mutual information is the appropriate generalization of decorrelation as used in linear Principal Component Analyses (PCA). We also generalize to nonlinear transformations by only demanding perfect transmission of information. This leads to a general class of nonlinear transformations, namely symplectic maps. Conservation of information allows us to consider only the statistics of single coordinates. The resulting factorial representation of the joint probability distribution gives a density estimation. We apply this concept to the real world problem of electrical motor fault detection treated as a novelty detection task. 1 Information Preserving Nonlinear Maps
Unless one has a priori knowledge about the environment, i.e., the distribution of the input signals, it can be difficult to find criteria for separating noise from useful information. To extract structure from the signals, one applies statistical decorrelating transformations to the input variables. To avoid a loss of information, these transformations have to preserve entropy. According to Shannon (1948) entropy is defined as H i s ) = - J p ( x ) l n P ( . u ) d of s a continuous distribution p ( r ) , with .y E R”. Continuous entropy is sensitive to scaling. Scaling coordinates changes the amount of information (or entropy) of a distribution. More general, for an arbitrary mapping on R”: y = f ( x ) condition det(i)f/ilx)= 1 yields H(y1 = H ( s ) (Papoulis 19911, i.e., local conservation of volume guarantees constant entropy from the input .Y to the output y. To avoid spuNt,i[ui/C o ? i i p i ~ t ~ i t i o8,~ i260-269 (1996)
@ 1996 Massachusetts Institute of Technology
Statistical Independence and Novelty Detection
261
rious information generated by a transformation, we consider therefore volume-conserving maps, i.e., those with unit Jacoby determinant. The goal of this paper is to present a special neural-network like structure for building volume preserving transformations. Two approaches may be used to achieve this goal. First, one may prestructure the neural network in such a way that volume preservation is guaranteed independent of the network weights (Deco and Brauer 1995; Deco and Schiirman 1995). Alternatively, weight constraints may be used to restrict the learning algorithms to volume conserving network solutions. In this paper we present a new prestructuring technique that is based on symplectic geometry in even-dimensional spaces (n = 2m). The core of symplectic geometry is the idea that certain "area elements" are the analogue of "length in standard Euclidean geometry (Siege11943). Transformations that preserve these area elements are referred to as symplectic. Symplectic transforms do also preserve volume. However, the converse is not true, i.e., volume preservation is not sufficient for symplecticity. The advantage of symplectic transforms is the fact that they can be parameterized by arbitrary scalar functions s ( ~z) €; R2"' due to the implicit representation' (1.1)
where the I denotes the n-dimensional identity matrix, and x.y E R21n. Any nonreflecting symplectic transform {det[l- (dfldx)] # 0} can be generated by an appropriate function s, and also the converse is true: Any twice differentiable scalar function, e.g., a n arbitrary standard neural network, leads to a symplectic transform in equation 1.1. A discussion of the origin and significance of the structure of equation 1.1 has to be avoided since it gives little insight in the main issue of statistical independence. We use the representation (equation 1.1) from a pragmatic point of view. To obtain a set of symplectic transforms that is as general as possible, we use a one-hidden-layer neural network NN as a general function approximator (Hornik et al. 1989) for the generating function S: S(Z) = "(2,
W , W) = w
. g(WZ)
(1.2)
where W denotes the input-hidden weight matrix, w the hidden-output weights, and g the activation function. Equation 1.1 has to be solved numerically. We use either fixed-point iteration or a homotopy-continuation method (Stoer and Bulirsch 1993). ~
'This representation of symplectic maps is a special case of the generating function theory developed in full generality by Feng Kang and Qing Meng-zhao (1985). A proof of the representation (equation 1.1) and a discussion of its role for Hamiltonian systems can be found in Abraham and Marsden (1978) and Miesbach and Pesch (1992).
262
L. Parra, C. Deco, and S. Miesbach
2 Mutual Information and Statistical Independence
The components of a multidimensional random variable y E R” are said to be statistical independent if the joint probability distribution p ( y ) factorizes, i.e., p(y) = p(yI). Here, p(yl) represents the distribution of the individual coordinates yI, i = 1,. . . . n of the random variable y. Statistical independence can be measured in terms of the mutual information
n:’
W Y
)I
(2.1) Zero mutual information indicates statistical independence. Here, H(y,) = - J p(yI)In p(y,)dy, denotes the single coordinate entropies. In the case of gaussian distributions, linear decorrelation, i.e., diagonalizing the correlation matrix of the output y, has been proven to be equivalent to minimizing the mutual information (Papoulis 1991)and corresponds to the standard principal component analysis (PCA) method. However, for general distributions, decorrelation does not imply statistical independence of the coordinates. Starting from the principle of minimum mutual information, Deco and Brauer (1994) formulated criteria for decorrelation by means of higher orders cumulants. A similar approach, that considers the distance to the gaussian distribution (standardized mutual information) but restricts itself to linear transformations, was studied by Comon (1994). Redlich (1993) suggested the use of reversible cell automata in the context of nonlinear statistical independence. Instead of preserving information the invertibility of the map was considered. While invertibility indeed assures constant information when dealing with discrete variables, for continuous variables, conservation of volume is necessary. In the case of binary outputs, maximum mutual information has been proposed instead (Schmidthuber 1992; Deco and Parra 1994). In the context of the blind separation problem, Bell and Sejnowski (1994) proposed a technique for the separation of continuous output coordinates with a single layer perceptron. But the authors admit that the information maximization criterion they use does not necessarily lead to a statistical independent representation. In parallel Nadal and Parga (1994) based this idea on a more rigorous discussion. In this paper, we make use of the more general principle of minimal mutual information (statistical independence) instead of the decorrelation used in PCA. For the symplectic map, the identity H ( x ) = H(y) holds, and therefore we are left with the task of minimizing the sum of the single coordinate entropies (second term in the left-hand side of equation 2.1). Since we are given only a set of data points, drawn according to the output distributions, this is still a difficult task. But fortunately, there is a feasible
Statistical Independence and Novelty Detection
263
upper bound for these entropies (Parra et al. 1995),
where (y,) = j p(y,)yldy,. Using only the second-order moments for estimating the mutual information might be seen as a strong simplification. At the expense of computational efficiency, higher order cumulants may be included to increase accuracy. An interesting property of equation 2.2 is that, if the transformation y = f ( x ) is flexible enough, this cost function will produce gaussian distributions at the output. Using a variational approach it can be shown that under the constraint of constant entropy a circular gaussian distribution minimizes the sum of variances in equation 2.2 (Parra et al. 1995). This will be useful for the density estimation addressed next. We will observe there some limitations of the continuous volume conserving map in transforming arbitrary distributions into gaussians. The training of the network (equation 1.2) can be performed with standard gradient descent techniques. The gradient of the output coordinates with respect to the parameters of the map can be calculated by implicitly differentiating equation 1.1. This leads to a system of linear equations for the gradient. The overall computational complexity of the optimization algorithm is then O(n4) for each data point. This restricts this approach to a low dimensional space (in practice n 5 30). 3 Density Estimation and Novelty Detection
If one knows that a joint distribution factorizes, then the problem of finding an estimation of the joint probability p(x) in an n-dimensional space is reduced to the task of finding the one-dimensional probability distributions p(yI). As stated before, the gaussian upper bound cost function favors gaussian distributions at the output, provided that the symplectic map is general enough to transform the given distribution. Figure 1 demonstrates this ability. If the training succeeds, we might estimate the distributions by the straightforward assumption of independent gaussian distributions at the output:
Estimation reduces then to the measurement of the output variances of-. We now address the closely related task of novelty detection. Given a set of samples corresponding to a prior distribution, one has to decide whether or not a new sample corresponds to this distribution. Putting it into other words the question is: "How probable is an observed new
L. Parra, G. Dew, and S. Miesbach
264
Symplectic Output after Training
Nonlinear Correlated Input Training Set 2,
' 7
34-
15t
1
. , .. . . . . ... 44
5
44
- -3
-3 2
2.
1 1
0 0
11
22
Figure 1: Nonlinear correlatal and nongaussiaii joint input distribution (left) is transformed into almost independent normal distributions (right). The input distribution was generated by mapping a one-dimensionalexponential distribution with additive gaussian noise onto a circle. The cost function was reduced by 68'; in 300 training steps. The "network" contained 6 parameters ( 7 0 t R' and W F R' x R 2 ) .
sample according to what we have seen so far?" Given a certain decision threshold, novelty detection is based on the corresponding contour of the density of the data points previously seen. If the contour is required for an arbitrary threshold, we need the complete estimation of the density. As a solution to this problem we propose the presented symplectic factorization with the a posterior gaussian density estimation (equation 3.1). The decision surface for the novelty detection is then just a hypersphere in the output of the symplectic map after reducing the mutual information according to the given sample set. Figure 2 demonstrates this idea. The symplectic map was trained to reduce mutual information on the samples + . The samples o are to be discriminated. The procedure transforms the output distribution to a gaussian distribution as closely as possible, to use a circular contour of the density as a decision boundary. As a side effect, volume conservancy tends to separate regions not belonging to the training set from those corresponding to it. The former regions are mapped far away from the gap area. Obviously, taking a circular decision measure at the output distribution will give a fair solution. We show the performance of the proposed technique in Figure 3 (left) by showing the standard graph of inisclassification and false-alarm rates. For this illustrative example we
Statistical Independence and Novelty Detection
input Tfaining andTest Sol 4,
I
,
I
,
,
,
265
SympMc OuIputafferTminiog
10,
,
I
I
I
,
,
I
I
,
3-
21-
Q1-
k-
2
3
. .-.,
1
Figure 2: +, training samples; a, test samples. Left: input signals; center: output signals of the trained symplectic map. The symplectic map partially transforms a bimodal training distribution into a unimodal distribution. The map used again 6 parameters. Ellipses indicate possible classification boundaries for the + samples. Right: rate of misclassification and false-alarm. We used in both cases (input and output) an elliptical distance measure as decision criteria for novelty, i.e., we classify as "normal" all points lying within an elliptical area around the center of the "normal" training set. All others are classified as "novel." The decreasing curves gives the false-alarm rate, while the increasing curves denote the rate of missing the "novel" data points. could also obtain good results with a simple gaussian mixture (Duda and Hart 1973) of two gaussian spots. This example also demonstrates one of the possible limitations of the technique as a general density estimation procedure. Perfect transformation into a single gaussian spot requires a singularity to map the two spots arbitrarily close together. Because of the property of local conservation of volume, vanishing distance in one direction implies unbounded stretching in the orthogonal direction, which will not be possible with a continuous map. More generally speaking, the combination of a continuous and volume conserving map together with a unimodal distribution is best suited for distributions spread over connected regions rather than for disjoint distributions. For the novelty detection this behavior is clearly an advantage since it separates known distributions from unknown regions.
4 Motor Fault Detection
In this section, we show that the proposed concept of novelty detection provides encouraging results in a high dimensional real world problem. In motor fault detection, the task consists of noting early irregularities
266
L. Parra, G. Deco, and S. Miesbach
. . .
Figure 3: Left: distribution of the 2 first principal components demonstrates a clear nonlinear depcndency. Center: resulting distribution of the same first 2 components after reducing the redundancy in the first 10 components. Right: For any other component higher than the 15th no pairwise dependency could be observed. Here, we arbitrarily plot components 50 and 100.
in electrical motors by monitoring the electrical current. The spectrum of the current is used as a feature vector. The motor failure detector is trained with data supplied by a healthy motor and should indicate if tlie motor is going to fail. Typically, one deals here with at least 100 and u p to 1000 dimensions. Applying the outlined procedure to the complete feature vector is not manageable because of the high computational costs of our training procedure. On the other hand, it is hard to believe that 100 or more coordinates are altogether nonlinearly correlated. More likely, we expect most of the coordinates to be (if at all) cmly linearly correlated. Therefore, we first transform tlie spectrum with a linear PCA. We use 230 coordinates in the spectrum between 20 and 130 Hz. We observed that a few of the first principal components are nonlinearly correlated. N o pairwise nonlinear structure could be observed between coordinates other than the first 10 or 15 first principal components. We assume that all other principal components are uncorrelated, unimodal, and symmetrically distributed. They can be fairly well approximated by a normal distribution (see Fig. 4). We know tliat for normal distributions, linear decorrelation is the best that can be done to minimize mutual information. Therefore, we can assume that these lower principal components are statistical independent. We apply the symplectic factorization only to the first few components. Figure 4 shows how two of the first 10 principal components have been transformed by a 10-20-10 symplectic map trained with 800 samples (wE R"). W E RZ0 x XI").The net reduced variance by 65% in 650 training steps
Statistical Independence and Novelty Detection
267
Figure 4: Left: maximum measure on the 230-dimensional principal component space. Center: circular distance measure on the 10 symplectic mapped first linear principal component. Right: combined symplectic and linear features space: 10 symplectic transformed first PC and last 7 PC of the normalized spectrum. The decreasing curves give the false alarm rate. Each of the three increasing curves provides the rate of missing the fault for three different fault situations (-, no fault; . . ., bearing race hole, - -,unbalance; broken rotor bar). Now we use this result to classify "good" vs. "bad" motors, according to equation 3.1. Since the performance may vary for different types of faults, we plot the performance curve for the three failure modes occurring in our test data (unbalanced, bearing race hole, and broken rotor bar). In Figure 4 we compare the performance with a maximum measure (max,(yl - (y,))) on the complete 230-dimensional principal component space (left), but use only the gaussian estimates of the 10 nonlinear transformed coordinates (center). Furthermore, we analyze to what extent a given coordinate separates the "good motor" and "bad motor" distributions by measuring the ratio of the corresponding variances. This analysis reveals that by including the low variance linear normalized PCA coordinates the classificatiofi measure can be improved further. With "normalized we express the fact that we normalize the variance before performing PCA. Best results were obtained by including between 5 and 20 low variance PCA coordinates (see Fig. 4, right). One possible measurement of the quality of the classification technique is the decision error at the optimal decision threshold. The proposed technique achieves a decision error of 10&0.5%.This result is comparable with different approaches that have been applied to this problem at SCR2including, among others, MLP (11%)and RBF (10%)autoassocia2Siemens Corporate Research, Inc., 755 College Road East, Princeton, NJ 08540
268
L. Parra, G. Deco, and S. Miesbach
tors, nearest neighbor (18-32% ), a n d hypersphere (37%) clustering, PCA (12%), or maximum measure (in roughly 2000 dimensions) (11%I.
5 Conclusions The factorization of a joint probability distribution has been formulated a s a minimal mutual information criterion under the constraint of volume conservation. Volume conservation has been implemented by a general class of noidinear transformations-the symplectic maps. A gaussian upper bound leads to a computational efficient optimization technique a n d favors normal distributions at the output as optimal solutions. This, in turn, facilitates density estimation, a n d can be used particularly for novelty detection. The proposed technique has been applied successfully to the real world problem of motor fault detection.
References Abraham, R., and Marsden, J. 1978. Foirriclntioiis qf M d i n r i i z s . Benjaniin-Cummings, London. Barlow, H. 1989. Unsupervised learning. Newill Corrip. 1(1),295-311. Bell, A. J., and Sejnowski, T. 1995. An information-maximization approach to blind separation and blind deconvolution. Nci[rd Cor~ip.7(6), 1129-1159. Comon, P. 1994. Independent component analysis, a new concept. Sigrid Processirig 36, 287-314. Deco, G., and Brauer, W. 1993. Nonlinear higher order statistical decorrelation by volume conserving neural architectures. Neirnd Nctiuorks (in press). Deco, G., and Parra, L. 1996. Nonlinear features extraction by redundancy reduction with stochastic neural networks. Nrirral Nctiilorks (to appear). Deco, G., and Schiirman, B. 1995. Learning time series evolution by unsupervised extraction of correlations. P l y s . Rcil. € 51(2), 1780-1790. Duda, R. O., and Hart, I? E. 1973. Pntterri Classificmtiorr mid Sceric Arinhysis. Wiley, New York. Feng Kang, and Qin Meng-zhao. 198.5. The symplectic methods for the computation of Hamiltonian equations. I n Nirriz~rica/M d i o d s for Portid Differentin[ Eqitntioiis, Proceedings of a Conference held in Shanghai, 1987. Lecture Notes in Mathematics. Zhu You-lan and Guo Ben-yu, eds., Vol. 1297, pp. 135. Springer, Berlin. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward neural networks are universal approximators. Ncwrd N&i1orh 2, 359-366. Miesbach, S., and Pesch, H. J. 1992. Symplectic phase flow approximation for the numerical integration of canonical systems. Niirriel: Math. 61, 501-521. Nadal, J-P., and Parga, N. 1994. Non-linear neurons in the low noise limit: A factorial code maximizes information transfer. Netiuork 5(4), 565-581. Pcipoulis, A. 1991. Probnbility, Rniiilorri Vnrinbl~5, ilrid Stoclinstic Processes, 3rd ed. McCraw-Hill, New York.
Statistical Independence and Novelty Detection
269
Parra, L., Deco, G., and Miesbach, S. 1995. Redundancy reduction with information preserving nonlinear maps. Network 6,61-72. Redlich, A. N. 1993. Supervised factorial learning. Neural Comp. 5, 750-766. Schmidhuber, J. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4(6), 863-879. Shannon, C. 1948. A mathematical theory of communication. Bell Syst. Tech. I. 7, 379-423. Siege1 1943. Symplectic geometry. Am. J. Math. 65, 1-86. Stoer, J., and Bulirsch, R. 1993. Introduction to Numerical Analysis. Springer, New York.
Received July 25,1994; accepted May 8, 1995.
This article has been cited by: 1. Victoria J. Hodge, Jim Austin. 2004. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review 22:2, 85-126. [CrossRef] 2. S. Singh, M. Markou. 2004. An approach to novelty detection applied to the classification of image regions. IEEE Transactions on Knowledge and Data Engineering 16:4, 396-406. [CrossRef] 3. M.K. Omar, M. Hasegawa-Johnson. 2003. Approximately independent factors of speech using nonlinear symplectic transformation. IEEE Transactions on Speech and Audio Processing 11:6, 660-671. [CrossRef] 4. Simone Fiori . 2001. A Theory for Learning by Weight Flow on Stiefel-Grassman ManifoldA Theory for Learning by Weight Flow on Stiefel-Grassman Manifold. Neural Computation 13:7, 1625-1647. [Abstract] [PDF] [PDF Plus] 5. D. Martinez. 1998. Neural tree density estimation for novelty detection. IEEE Transactions on Neural Networks 9:2, 330-338. [CrossRef]
Communicated by Robert Jacobs
Neural Network Models of Perceptual Learning of Angle Discrimination
We study neural network models of discriminating between stimuli with two similar angles, using the two-alternative forced choice (2AFC) paradigm. Two network architectures are investigated: a two-layer perceptron network and a gating network. In the two-layer network all hidden units contribute to the decision at all angles, while in the other architecture the gating units select, for each stimulus, the appropriate hidden units that will dominate the decision. We find that both architectures can perform the task reasonably well for all angles. Perceptual learning has been modeled by training the networks to perform the task, using unsupervised Hebb learning algorithms with pairs of stimuli at fixed angles 0 and M. Perceptual transfer is studied by measuring the performance of the network on stimuli with 0’ # 0. The two-layer perceptron shows a partial transfer for angles that are within a distance 17 from 0, where n is the angular width of the input tuning curves. The change in performance due to learning is positive for angles close to 0, but for lH - H’/ = 17 it is negative, i.e., its performance after training is worse than before. In contrast, negative transfer can be avoided in the gating network by limiting the effects of learning to hidden units that are optimized for angles that are close to the trained angle. 1 Introduction
~
The ability of animals and humans to carry out perceptual tasks, such as discrimination of two similar stimuli, improves with practice (Walk 1978). One of the most interesting teatures of this improvement is that it is stimulus selective. For instance, learning to discriminate between two gratings with given orientations or spatial frequencies does not lead to improvement for substantially different orientations or spatial frequencies (Fiorentini and Berardi 19811. Similarly, learning to determine the sign of the offset in a vertically oriented vernier stimulus does not improve the performance for a horizontally oriented vernier stimulus (Poggio rt 01. 1992). This limited transfer to different stimulus parameters
Perceptual Learning of Angle Discrimination
271
suggests that the learning is due to changes in early stages of the sensory pathway, where stimuli characterized by very different parameters are represented by different neurons. As the properties of the neurons in these early stages are relatively well known, especially in the visual cortex (Orban 1984), we can attempt to use this information to study possible neural mechanisms of perceptual learning in these systems. In this work we investigate neural network models of 2AFC discrimination of a pair of stimuli characterized by angles I9 and I9 + 68. The parameter I9 that takes values from -7r to T represents, for instance, the direction of motion of a contour of a visual image. We assume that the performance is limited due to the neuronal noise, i.e., the ambiguity induced by the stochastic responses of the neurons to the stimuli. Thus, the level of performance will depend on the ratio between the stimulus separation SI9 and the internal neuronal noise. For concreteness, we will assume a Poisson statistics for the neuronal noise. The first issue we address is which network architecture is capable of performing the task. To assess the quality of the network performance we will compare their performance with the performance of a discriminator based on a maximum-likelihood (ML) decision, which will be described in Section 2. The performance of the simplest network model, i.e., a single layer perceptron that performs a threshold-linear operation on its inputs, has been studied in Seung and Sompolinsky (1993). In this work it has been shown that a perceptron can reach the ML performance on a single stimulus parameter 0. However, the perceptron cannot yield the ML performance over a range of angles. A perceptron that makes the optimal decision for one angle yields suboptimal decisions for other angles. Moreover, we will show that any perceptron will yield the wrong answer more than 50% of the times, in some range of angles, regardless of the level of noise. In the language of learning theory, the angle discrimination task over the whole range of angles is not realizable by a single-layer perceptron (see, e.g., Hertz et al. 1991; Sompolinsky and Barkai 1993). For this reason, and unless one assumes some mechanisms of rapid modifications of the synaptic weights, it is necessary to adopt a more complex network architecture for modeling this perceptual task. In this work we will consider two network architectures: a feedforward network with one layer of hidden units and a gating network. Both architectures can perform the task for the whole range of angles in the case of small noise. We have studied numerically the optimal average performance of these networks in the presence of noise. These results and their comparison with the ML performance will be presented in Section 3. The main focus of this work is on issues related to the learning of this task by the networks, particularly the phenomenon of perceptual transfer. We assume that the networks are trained to perform the discrimination task using Hebbian learning rules with examples of pairs of stimuli that have a fixed angle 8. The initial state of the networks is such that it yields a reasonably good uniform performance over the whole angular range. In
272
G. Mato and H. Sompolinsky
this case, perceptual transfer is defined as the change in the performance of the system by the training for values of H that are different from the training angle. The models of perceptual learning in the two networks and their qualitatively different behavior with regard to perceptual transfer will be presented in Section 4. We find that the multilayer perceptron displays w g a f i u c perceptiial fraiisfiv, i t . , a worsening of the performance for angles different from the one that has been trained. In contrast, we find that the gating network does not display this phenomenon. In the last section we discuss the results and possible extensions of the models. 2 Maximum Likelihood Discrimination
We consider the task of discrimination of angles in two dimensions in the 2AFC paradigm. The stimuli are visual images that are characterized by an angle 0, -T 5 H 5 + T . This angle could represent, for instance, the direction of motion of the image. Two stimuli, with angles H and H + hH, respectively, are presented, one after the other, in one of two possible orders. The task of the system is to find out the order of presentation, e.g., the output should be +1 if the first stimulus was H + c\H and -1 for the reverse order. We assume that the visual input is represented by the responses of N noisy, angle-tuned neurons. These responses are denoted by a vector r of integers Y,, j = 1. . . . N, where Y, denotes the number of spikes emitted by the jth neuron during a fixed period of time following the stimulus onset. The responses of the neurons to each stimulus are assumed to be independent random processes, with the following probability distribution:
The maximum-likelihood (ML) procedure for discriminating between the two alternatives consists of evaluating PI = P(r 1 H bH)P(r’ 1 0) and P2 = P(r I H)P(r‘ I H hH), where r and r’ are the number of spikes emitted by the neurons during the presentation of the first and second stimuli, respectively. The ML decision is +1 if PI > P2 and -1 if P I < P2. In the limit of large population size, N, the probability of mistake of this rule is given by
+
f
=H ( d ’ / A )
+
(2.2)
where H ( x ) = (27rP1/’ J,x e-”/2dt. The parameter d’ is the discriminability of the stimuli, and is equal to
Perceptual Learning of Angle Discrimination
273
where 1 is the Fisher information
where the brackets (. . .) denote the average with respect to the probability distribution of equation 2.1. The Fisher information measures the total amount of information about the stimulus 0 that is contained in the noisy response vector r. It can be shown that in the limit of large N, the ML procedure is optimal in the sense that it is unbiased and minimizes the square of the decision error (Seung and Sompolinsky 1993). We will consider the responses r I , which consist of discrete events. They are assumed to be variables described by the following distribution, (2.5) where we have denoted the mean (and the variance) of r, by (r,) = ((nu,)’) = h,(H). The function h,(Q)will be called the tuning curve of the ith input neuron. The maximum of h l ( Q ) denoted , by Q1, will be called the preferred angle (PA) of the ith neuron. We assume that all the neurons have the same tuning curve but with different preferred angles, i.e., k,(H) = k(Q - Q1), and also that the PAS are distributed uniformly, so that Q, = 2 ~ j / N- T , (j = 1 . . . .N).The difference between the maximum and minimum values of h(0) will be denoted by n. In the following we will use normalized tuning curve f (Q),defined by (2.6) so that the difference between the maximum and the minimum values of f ( Q )is 1. The factor n is a measure of the magnitude of the mean response. The discriminability for the Poisson case in the large N limit is given by
and
The discriminability d’ represents the signal-to-noise ratio of the system. The quantity A is the factor in the signal-to-noise ratio that does not depend on the form of the tuning curve. The result, equation 2.2, is valid when d‘ is of order 1, i.e., when 60 is of order
I/m.
G. Mato and H. Sompolinsky
274
1.o
0.8
0.6
0.4
0.2
0.0
-3
-2
-1
0
e
1
Figure 1: Normalized tuning curve f(O), equation 2.9, with
2
IZ =
3
1 (rad) and
fml, = 0.01.
An example which we will use frequently in this paper is the following input tuning curve (shown in Fig. 1)
(2.9) where c1 is the width of the tuning curve and fmin is the rate of spikes in the background divided by n. For this tuning curve, and assuming fmln 0. The angular brackets (. . .) denote an average with respect to the probability distribution of equation 2.5. To evaluate the performance of the perceptron in the limit of large N, we note that for large N, the field h generated by pairs of stimuli with a fixed angle 8, has a gaussian distribution with mean value N
( h ( 0 ) )= IzbQao CW18J’(8)
(3.4)
]=1
and variance N
(6h2(Q))= t I 2 C 2 ~ f h ( 8 )
(3.5)
]=I
Performing the average of equation 3.3, using the gaussian distribution of h, yields equation 2.2 but with a discriminability that is given by
where A is given as before by equation 2.8. The set of weights that minimizes the probability of error in discriminating stimuli with angle 6’ is found by minimizing equation 3.6, yielding
w,(Q) = f,’(O)/fJQ) (3.7) Substituting these weights in equation 3.6 we find that the discriminability, and hence also the average error, of the optimal perceptron is the same as in ML. However, the performance of the perceptron cannot be optimal for more than a single stimulus angle. In particular, the discriminability of a perceptron that is optimized for an angle H with respect to stimuli with an angle 0’ is given by
where w i ( 0 ) are given by equation 3.7. Using this expression and equation 2.2, the average error of this perceptron can be evaluated. In Figure 2 we plot this error as a function of the stimulus angle 8’. The results show that not only is the performance for 8‘ # 0 suboptimal, but there is a range of angles for which the probability of error is larger than 0.5, namely, the system is doing worse than random. Improving the signal-to-noise ratio, for instance by increasing 68 or increasing ti, leads to an even larger probability of error. In fact, for any fixed set of weights, there is a range of angles for which the perceptron behaves worse than random, for all levels of noise. This can be seen from the fact that the integral of equation 3.4 over the whole range of 0 is zero, implying that the output of the perceptron will necessarily differ from a0 for some range of 8. To realize this task we need to consider more complex networks.
Perceptual Learning of Angle Discrimination
277
0.5
0.0
-3
-2
-1
0
1
2
3
Figure 2: Error probability vs. stimulus angle 0 for the perceptron, optimized for angle 0. N = 50, a = 1 (rad),fmin = 0.01, btl = 3", and n = 50. 3.2 Two-Layer Perceptron. We consider here a feedforward network consisting of an input layer with N units, a single hidden layer with M units, and a single output unit (see Fig. 3). The input units have the same tuning properties as in the previous section, namely their output is given by r, - r:. Each hidden unit is a sigmoidal perceptron, SI
(3.9)
=d h , )
where k , is the internal field of the ith hidden unit, N
(3.10)
h, = C W I , ( T / /=1
and wij is the connection between input unit j and hidden unit i. The sigmoid function g ( h ) will be chosen for convenience as g ( k ) = tanh(k). The output of the system will be assumed to be u = sign
CS, (,MI
)
(3.11)
G. Mato and H. Sompolinsky
278
W.. 'I
1 r-r' Figure 3: Architecture of the two-layer perceptron Note that we consider here a special two-layer network, in which all the weights from the hidden layer to the output unit are equal. The advantage of this restricted architecture is that it is easier to interpret its operation since the system's decision consists of a majority vote of the hidden perceptrons, similar to the cOJ171Jlifff7f~riinc/iirii>architecture (Sompolinsky and Barkai 1993). The two-layer perceptron described above can realize the angle discrimination task in the limit of small noise because it combines the signal from M perceptrons. In a region where one perceptron has the wrong output there will be others with the correct one that will cancel its effect. To see how the network can operate in the limit of small noise, it is sufficient to consider the mean values of the internal fields of the hidden units, equation 3.10, generated by a pair of stimuli with angle 0. For large h', they are equal to
where the weights between the input units and the hidden units have been expressed as = i L ' i ( ( - 1 , ) . Thus, input units with iL):((.)) < 0 provide a contribution with the correct sign, whereas those with w:( o ) :> 0 provide a wrong signal. For a given hidden unit, the total contributions will yield the correct signal fur some H and incorrect ones for other values, as we have shown in the case of a single perceptron. However, here it is sufficient that for all H, (3.13)
Perceptual Learning of Angle Discrimination
279
r--. --.
--. -.--.
I I I
I
W
-3
-2
-1
0
e
1
2
3
Figure 4: Example of weight pattern for the two-layer perceptron that solves the discrimination task in the limit of low noise. Shown are the weights between the input units and two of the hidden units, vs. the PA of the input units. Each weight pattern is a piecewise linear function of the PA. The weights of different hidden units are translated versions of each other (see equation 3.14). This can be achieved if each weight function wl(@) is such that its derivative is negative for most of the range of 4. In addition, the different are arranged so that in the region where one of them has functions wl($) a positive derivative most of the others have a negative one so that they can compensate the wrong signal. A simple example is provided by choosing uniformly displaced weight patterns to the hidden units, Wl(d) =
W($ - 01)
(3.14)
where Q 1 = 27ri/M and the function W(4) is a saw-tooth function (see Fig. 4). For this weight pattern, equation 3.12 yields
(3.25)
G. Mato and H. Sompolinsky
280
wheref = J f($) d$/2.. This result assumes that f is not too wide, that M is sufficiently large, and that the hidden units are saturated. The minimal number of hidden units that is needed depends on the particular form o f f . Of course there are many other solutions for performing the task in the limit of weak noise. In particular, different hidden units may have different profiles of input weights, and the input weights profile may have more than one narrow region of positive derivative. It is important to point out the role of the nonlinearity of the hidden units. The region of the weights with w:(cp)> 0 will give a wrong output. The absolute value of the field for this region will be larger than the one in the region with za:($) < 0 because the integral of the internal field from -7r to i7 must be zero. But this effect is suppressed by the suppressive nonlinearity of the hidden units, as was shown in the above example. 3.3 Optimal Performance of the Two-Layer Perceptron. Here we consider the performance of the two-layer perceptron in the presence of noise. The optimal network will be defined as the one that minimizes f, which denotes the probability of error averaged over all angles,
(3.16) where c ( H ) is given by equation 3.3. To find the optimal network, we use the following on-line gradient descent algorithm. In each iteration an angle H is chosen at random with uniform distribution between [-.. 7r] and a pair of stimuli r a n d r’ is generated at random according to equation 2.5. The weights will be updated according to the following stochastic gradient descent rule (3.17) where zi13eW are the updated weights. The energy E is a quadratic cost function for each example 1 (m (3.18) 2 where cr,] is the correct output for the current example and, in order for the gradient of E to be well defined we use in E a sigmoidal version of the network output, i.e., 5 = tanh (g C, S,). Here g is a gain parameter that is taken to be relatively large so that the final output neuron is close to saturation. This update rule is repeated for each new example, and the average error is monitored. The algorithm is stopped when the observed average error saturates. We would like to emphasize that the supervised rule, equation 3.17, was used to find the optimal two-layer perceptron for the discrimination task, but it will not be used to model the actual process of perceptual learning. The algorithm for perceptual learning will be introduced in Section 4.
E
=
-
Perceptual Learning of Angle Discrimination
0.2
-
0.0
2
281
&
-
A
h
Figure 5: Error probability for the optimal two-layer network with N = 50, = 11, as a function of B for 60 equal to lo,3", and 18" (top to bottom).
M
In Figure 5 we show the performance of a network with N = 50 input units and M = 11 hidden units, obtained using the above minimization procedure, with rl = 0.01,g = 1. The number of training examples was P = 50,000, and they were recycled N,= 500 times. The tuning curves have u = 1 (rad), fnlln = 0.01, b0 = 3", and n = 50. The figure displays the probability of error for stimuli with angle 8, as a function of 8, for three test values of 68. The performance is better than random and improves with increasing the test value of 68. In the limit of noiseless inputs the error goes to zero. The performance is relatively uniform in 8, except for large 60 where the relative nonuniformity is enhanced by the fact that the mean error is extremely small. The performance of the two-layer perceptron was insensitive to the details of the minimization algorithm, such as changing the sigmoidal forms of the output or changing the training parameters, rl, 68, or n. However, it does depend on the number of hidden units. In Figure 6 we show the error probability averaged over all the angles as a function of with all other parameters held fixed, including the input size,
l/m,
G. Mato and H. Sompolinsky
282
M-"2
Figure 6: Average error as a function of l/d% for N = 50 and 17 = 50. (a) The two-layer network. The weights have been calculated by iterating equation 3.17 with P = 50.000 stimuli, generated by the Poisson distribution, equation 2.5, with uniformly sampled H , and hH = 3". The training set was recycled 500 times. The step size is r/ = 0.01. The initial values of the weights are chosen from a gaussian distribution. The average error is measured by averaging the network's error over a set of randomly sampled test stimuli with the same distribution as the training set. The result was further averaged over 5 realizations of the training algorithm; each corresponds to different samples of initial weights and stimuli. The line is a linear fit, with intercept at the origin, which equals 0.031. The error of ML is 0.018 (dotted line). (b) The gating network. Each hidden unit is an optimal perceptron. The parameters of the gating system are described in the text.
N = 50. The extrapolation to M m yields a minimum error of about 0.03. It is interesting to note that this asymptotic value is larger than the ML error, which for the parameters given above yields (via equations 2.7 and 2.2) fML FZ 0.018. Nevertheless, we have checked that the network obeys the same scaling as the error of ML, in the sense that it depends on the parameters i i , N, a n d ;I0 only through the signal-to-noise ratio. ---f
Perceptual Learning of Angle Discrimination
283
15 10
5
w,
0
-5 -10 -15
-3
-2
-1
0
e
1
2
3
Figure 7a: Optimal weights between the input units and the first hidden unit for a two-layer perceptron with N = 50, M = 11, n = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. In Figure 7 we plot the weights between the input unit with PA 8 and three of the hidden units. The weights corresponding to different hidden units are not the same, even when they receive the same input during the learning process. The reason for this is that if the weights for all the hidden units were the same, the system would be equivalent to a perceptron, which is not a minimum of the cost function (equation 3.18). However, to find the solution of Figure 7, the weights must be initialized with different values for different hidden units, because otherwise the system would be always constrained to the subset of identical weights for different hidden units. We can also observe from Figure 7 that different sets of weights have a similar shape. This shape is characterized by a negative derivative in two wide ranges of angles, and a steep positive derivative in two narrow intervals. This profile is similar qualitatively to the solutions we have discussed in the previous section for the zero noise case. Nevertheless, it is interesting to note that the symmetry between the weights of different hidden units is not exact. To check whether this asymmetry is a
G. Mato and H. Sompolinsky
281
-5
-
-10
-
-15
'
-3
1
-2
-1
1
0
2
3
Figure 7b: Optimal weights between the input units and the second hidden unit for a two-layer perceptron with N = 50, M = 11, i f = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. consequence of the small number of hidden units we have measured the asymmetry by evaluating the variation between the hidden units of the largest negative derivative interval of their weights. The standard deviation of these fluctuations was found not to decrease significantly with x limit the symmetry increasing M, suggesting that even in the M between the hidden units is broken. The width of the input tuning curve, n, has an important effect on the structure of the weights. Using l7 = 1.1 bad) instead of a = 1 we find that some of the hidden units have only one positive derivative regime, as shown in Figure 8. As we increase a the proportion of hidden units with this profile increases and for n = 1.5 all units have oniy one positive derivative regime, similar qualitatively to the zero-noise example of Figure 4. Decreasing a, to values significantly less than 1 yields units with three or more positive derivative regimes (not shown). Summarizing, the weight profiles are a compromise between two factors: increasing the number of positive regions increases the absolute value of the derivative of the weights in the regions where the signal has the correct sign
-
Perceptual Learning of Angle Discrimination
15
10
I
- I
285
1
1
-
5 -
w3
0
-
-5
-
-10
-
-15
'
1
-3
-2
-1
0
1
I
1
2
3
0 Figure 7c: Optimal weights between the input units and the third hidden unit for a two-layer perceptron with N = 50, M = 11, n = 50, and a = 1 (rad). The weights are plotted as a function of the PA of the input units. (keeping the absolute size of the weights constant). On the other hand, it increases the regions that contribute the wrong signal. As each of the narrow positive derivative regions contributes a wrong signal from an angular region of size a (see equation 3.12), increasing a tends to reduce their number. 3.4 Gating Network. We have seen above that whereas discrimination of stimuli around a single angle is relatively simple and can be performed well by a single-layer perceptron, discrimination over a wide range of angles is a complicated task. The two-layer perceptron solves the problem in a distributed manner. One of the main characteristics of that system is the fact that the tuning of the hidden units is broad. For practically all angles the discrimination is performed by the summed outputs of all the hidden units. This distributed mode of operation has a significant consequence for the perceptual transfer properties of the system, as will be discussed in the following section. Here we present an
G. Mato and H. Sompolinsky
286
30 20
10
0
-10
-20
-30
-3
-2
-1
0
1
2
3
Figure 8: Optimal weights bettveen the input units and one of tlic hidden units for ii = I .2( r i d ) . The other parameters are as in Figure 7.
alternative architecture for performing the discrimination task. The gating network consists of two types of units: an array of estimators and an array o f loco/ discriminators. The estimators identify roughly the angle H o f the stimuli, and gate the discrimination network, i.e., decide which of the local discriminators will be assigned the discrimination task. Thus, the network simplifies the discrimination task by splitting it into discriminations about narrow angular regimes, which can be solved relatively easily. Similar architecture has recently attracted considerable interest (Jordan and Jacobs 1994). The architecture of the gating network is shown in Figure 9a. It consists of N input units and M hidden units with the same properties as in the two-layer perceptron above. However, here the M hidden units are local discriminators. They are assigned M angles, H, = 27ii/M, i = 1. . . . . M, which denote the range o f operation of each discriminator. To implement the local discrimination, the output of the network is not given by ecpa-
Perceptual Learning of Angle Discrimination
287
M
N
N
tr Figure 9: (a) Architecture of the gating network. (b) Architecture of the gating system.
tion 3.11 but by
(3.19) The M non-negative numbers c, are the outputs of the gating network. The gating system is shown in Figure 9b. It consists of an array of M perceptrons that perform a weighted sum of the N inputs, Y,. Specifically, we choose CI =
exp(x1)
(3.20)
G. Mato and H. Sompolinsky
288
where the internal fields x, are given by (3.21) where ],, are the weights from the Itli input to the ith gating unit, t, are thresholds, and [I is a gain factor. We now have to determine the values of the discriminators’ weights w,,, the gating weights I,,, and thresholds t,, appropriate for solving the angle discrimination task. Following our analysis above, we choose for the it11 local discriminator the weights that are optimal for a perceptron discriminating around an angle HI, i.e., ZL’,, = f’(8,
- 6 )/ f ( Q ,
-
0,)
(3.22)
(see equation 3.7). The gating system’s parameters should be such that for a stimulus angle near HI, c, will be much bigger than c,, j # i, so that S, will dominate the decision made by CT (equation 3.19). One way t o achieve this is to demand that x, be proportional to the log-likelihood that the inputs r were generated by a stimulus, HI, which by Bayes theorem reduces to
s,= ,f lnP(r I 0,) + C
(3.23)
where C is a n arbitrary constant. In general, this result may not be achieved by a linear sum, such as equation 3.21, and more complex gating units than single-layer perceptrons will be needed. However, in the special case of the Poisson distribution equation 3.23 can be achieved in our architecture by choosing N
(3.24) Comparing with equation 2.5 we see that our choice is equivalent to c, x [P(r I
QJI”
(3.25)
The sharpness of the gating is determined by the gain parameter
/j. If
,d = 1, c, is the probability that input r has been generated by the angle 8,.
In the limit of large N the probability distribution has a width O ( l / f i ) and one c, will be much larger than all the others, namely the one that corresponds to the angle 8, with minimal distance from the input angle 0. On the other hand if = O(l/nN), the gating will be broad. All local discriminators with angles 8, that differ from the input angle c) by an amount smaller than u will contribute to the decision. Thus, assuming large M, the width of the gating changes from 0(27r/M) when l j = O(1) (sharp gating) to O(a/2rr) when /j = O ( l / i z N ) (broad gating). In the presence of noise, the level of performance of the gating network depends
Perceptual Learning of Angle Discrimination
289
on parameters such as P, M, and a. In fact, from the above argument it follows that in the Poisson case, if we choose = 1, and take the large M limit, the network performs ML estimation of 8, which is followed by ML discrimination. We thus expect that the network performance will approach the ML performance level in the limit of large M. This is indeed confirmed by our simulation results for the average error of this network, as shown in Figure 6b. Finally, it should be pointed out that unlike the two-layer perceptron, the gating network uses as inputs not only the differences in the responses, r, - r:, but also the individual responses r,. These two sets are used by two different portions of the gating network. The differences are used in the local discriminators, whereas the individual responses r, are used by the units that compute the coefficients c,. As we assume that all pairs of stimuli have small angular separations 68,the precise choice for inputs to the gating units is not important. Instead of r,, we could use r: or ( r , r3/2.
+
4 Perceptual Learning
In this section we address the problem of perceptual learning for the systems described above. We consider a network that has a reasonable but suboptimal performance for all angles, and improve its ability to discriminate between stimuli in a narrow range about a particular angle 0. We then measure the perceptual transfer of learning by evaluating the system's probability of error for angles that differ from the trained angle 0. 4.1 The Two-Layer Perceptron
4.1.1 lnitial State. We first implement the above program in the twolayer network. The initial state of the network is chosen by the gradient descent rule of equation 3.17, with stimuli that have relatively large separation, i.e., 68 = 9", and the input angles are distributed uniformly between -7r and T . The number of learning iterations and the size of the learning step, rj, are chosen so that at the end the system has a very small average discrimination error t (equation 3.16) ( 6 = 0.031, when 68 = 9", but the error for 68 = 2" is high, c = 0.35. As the system has been trained for all the angles, the performance is roughly uniform. The weights of this initial condition are a "noisy" version of the optimal weights of Figure 6. 4.1.2 Learning Algorithm. We now train the system to improve its performance using training examples with 68 = 2", and 0 = 0". For this phase of training, which models the process of perceptual learning in a
G. Mato and H. Sompolinsky
290
psychophysical experiment, we do not use the algorithm of equation 3.17. The reason is that this update rule is a supervised rule, since it depends on the correct output signal, which is assumed to be provided with each input (see equation 3.18). Since perceptual learning is known to occur even without an external error signal (Fiorentini and Berardi 1981; Ball and Sekuler 1987)we use in this phase a n unsupervised learning algorithm. Specifically, after generating randomly a set of inputs r and,'I according to the distribution, equation 2.5, all the weights are incremented, using the following unsupervised Hebbian rule:
w:,= Wl, + Y(Y,
- r:)SI - llWt,Y,
(4.1)
The first term is proportional to the product of the outputs of the presynaptic jth input unit and the ith hidden unit. The last term is a weight decay term, with a decay constant, which depends on Y, + (. Here again one can replace r, by I.: or ( r ,+ r : ) / 2 . The reason for choosing this inputdependent weight decay is to ensure that if we run this algorithm for a long time (with examples drawn with the same angle, say 0 = 0') it will converge to the optimal weights, for this angle, rx f/(O0)/f,(Oo), provided y and r/ are sufficiently small. This can be checked by equating the left hand side of equation 3.26 with 7 4 , and replacing the inputs by their average values. Note that in this asymptotic state, the weights ZL+, are independent of i, namely all the hidden units converge to the same perceptron, which is optimized for the training angle. It should be noted that our model uses supervised training for the initial state, and unsupervised learning for the perceptual learning stage. Since the initial state is presumably achieved by learning with large signal-to-noise, (in our model, large bB), our scenario is consistent with the idea that when the psychophysical task is "easy" the system has an internal error signal (Weiss et al. 1993). 4.1.3 Transfer Curve. The performance of the network after training with 68 = 2" as a function of the angle 0 (the transfer curve) is shown in Figure 10. The probability of error increases as the distance from the trained angle 0" increases, reflecting the stimulus specificity of the learning. However, up to a distance of approximately n from the origin, the error is still smaller than the initial baseline. This implies that there is partial transfer of learning, the range of which is determined by the width of the input tuning curve. However, for larger distances from the origin, the error continues to grow and becomes larger than 0.5, indicating a performance that is worse than random. Furthermore, if we test the performance on stimuli with a larger separation, the systematic negative bias increases, as shown in Figure 10. The appearance of systematic error after the learning stage stems from the fact that during learning with a single stimulus angle, the whole system converges fast to the optimal perceptron for the training angle and
Perceptual Learning of Angle Discrimination
0.8
I
291
I
0.6
0.4
0.2
0.0
-3
-2
-1
0
e
1
2
3
Figure 10: Error probability vs. 0 for the two-layer network (same parameters as in Figure 5 ) after training the system around r9 = 0 with the learning rule of equation 4.1. Dashed line: The initial performance for 60 = 2". Solid line: Performance after training. Error measured with 68 = 2". Dot-dashed line: Performance after training. Error measured with bB = 9". The parameters for the learning algorithm are h0 = 2", P = 5.000, 60 = 2", y = 5 x and T / = 1.5 x lop5. Error evaluated by averaging over 50 realizations of input responses. the ability to perform discrimination for very different angles is completely lost. One way to avoid this catastrophe is to assume that there is a constant active internal "refreshing" mechanism that generates feedback error signals that prevent the development of a systematic negative bias. To model such a scenario, we have added to the unsupervised learning, equation 4.1, with a single angle, a low rate of supervised updates, equation 3.17, with examples generated by input angles, uniformly distributed between -T and T . Consistent with our previous remarks, the supervised signals are generated with "easy" stimuli, i.e., with the same relatively large values of 60,that were used for reaching the initial state (in our case SB = 9"). These infrequent updates will have negligible effect on the improvement for angles close to O", as the errors there are
G. Mato and H. Sompolinsky
292
0.5
0.4
0.3 E
0.2
0.1 0.0
-3
-2
-1
0
1
2
3
Figure 11: Error probability vs. 0 tor two-layer network (N= 50, M = 11,n = 1 (rad), and ii = 50) using a mixture of unsupervised learning and low rates of supervised updates. For details see text. Dashed line: Initial performance. Solid line: Performance after training. Error measured with b0 = 2". Dot-dashed line: After training. Error measured with 60 = 9". The other parameters are as in Figure 7. Average over 50 realizations. small anyway, but they will have a strong effect in angles far away from 0" where the unsupervised learning tends to generate systematic error. As a result, this low-rate supervised uniform learning will prevent the full recruitment of all hidden units to the neighborhood of 0". Instead, several hidden unit weights will remain relatively unchanged from their initial state. By tuning the relative rates of stimulus specific learning with supervised uniformly distributed updates, we have found parameters that successfully avoid the systematic negative bias. An example is shown in Figure 11. Nevertheless, even in this case the level of error for intermediate angles is larger than it was before training. To conclude, we find that perceptual learning in the two-layer network results in an improvement of performance in the neighborhood of the trained angle, a phenomenon that we term positive perceptual transfer. The range of posi-
Perceptual Learning of Angle Discrimination
293
tive transfer is set by the tuning width of the input units, a. For angles far away from the trained angle the performance is the same as its initial level. On the other hand, at an intermediate range of angles, there is a negative perceptual transfer, meaning that the recruitment of units to improve the performance in the trained regime results in worsening of the performance compared with its initial level. 4.2 The Gating Network. Our model for perceptual learning in the gating network assumes that the parameters of the gating units are fixed at their correct values, given by equations 3.20 and 3.24, and the perceptual learning process affects only the weights of the discriminator input weights. This makes it possible to avoid the convergence of all the hidden units to the same optimal perceptron because the gating variables c, can also be used to contain the effect of the learning process. We thus use the following unsupervised Hebbian rule
w;= 7.4,
+ CI [Y(yi $1 s, -
- 7/WI,Y,]
(4.2)
with examples from stimuli with 0 = 0".As in the previous case, the initial values of w,, are determined by a supervised learning rule, similar to equation 3.17, with examples generated uniformly over the whole angular range. The cost function is similar to equation 3.18, but with 8 5 tanh ( E lc,S,). As we have mentioned in the previous section, in the limit of sharp gating, for any stimulus parameter 0 there will be one c, that is much larger than all the others. This is the one that corresponds to the angle 8, with minimal distance from 0. As the derivative of the cost function with respect to wIIcontains a factor c,, the weights of one of the hidden units will be updated by a much larger factor than all the others. The updating would eventually converge to the optimal perceptron for the angle 0,. When different angles are presented during the training process, different hidden units are chosen. If the algorithm is applied during a long time all the units would converge to the optimal perceptron for its corresponding angle 0, and the performance would be the optimal one. To reach the desired initial condition (in which the performance is uniform but not optimal) we stop the supervised learning algorithm at a relatively early stage, so that the system does not reach the optimal performance. The weights in the initial condition are a mixture of the optimal perceptron for each angle and noise coming from the initial condition. In general, the outcome of the model depends on the network parameters. However, its transfer properties are easily understood in the limit of large N and sharp gating, i.e., /? = O(1). In this case, essentially only the local discriminator with 0, FZ 0" will be affected by the learning stimulus angle 0. When a stimulus with 101 > T / M is presented, the output will be determined by a hidden unit that has not been changed during the learning process. Consequently, the error for that angIe wilI
294
G. Mato and H. Sompolinsky
0.1
1
I
I
I
Figure 12: Error probability vs. H for the gating network N = 50, M = 11, 1 (rad), I I = 50, and /j = 1 and the learning rule of equation 4.2. Dashed line: Initial performance for b8 = 2". Solid line: Performance after training. Error measured with bc) = 2". The parameters for the learning algorithm are and = 2.5 x lo-'. Average over SO bH = 2", P = 5.000, 08 = 2", y = 5 x realizations.
R =
be the same as before training. Thus, in this limit the range of any perceptual transfer is 27r/M. If M is large compared to 27r/a, then negative transfer will be avoided. An example of a transfer curve for the above model of perceptual learning in the gating network is shown in Figure 12. The results exhibit only positive transfer, in agreement with the above considerations. In general, suppression of negative transfer can also be achieved in the more realistic regime of broad gating [[j = 0(1/N)l, provided that the angular tuning width of c, is less than about a / 2 . On the other hand, if the gating width is greater than or equal to R then negative transfer will be seen. In this work we have assumed for simplicity that the parameters of the gating units are fixed, whereas the local discriminators can change. We have verified that similar results are obtained if we allow learning also in the gating parameters.
Perceptual Learning of Angle Discrimination
295
5 Discussion We have addressed the problem of solving an angle discrimination task using simple neural network models. We have focused on limitations in performing the task that are induced by the noise in the neuronal representations of the stimuli. Under general plausible assumptions about the noise, the degree of difficulty of performing the task decreases upon increasing two experimentally controlled parameters: the angular separation between the pair of stimuli, 68,and the amplitude of the response, n. The latter can be varied by, e.g., varying the duration of stimulus, or its contrast. In addition the signal-to-noise ratio increases with the number of input neurons, N, that represent the stimuli, neglecting correlations between the fluctuations in their responses. Our formal analysis utilized the limit of N + m. In practice, our analysis applies to N > 30, as has been verified by our numerical simulations. In this work we have focused on the 2AFC discrimination paradigm. Our first goal was to find simple feedforward networks that can perform the task for all angles. We have shown that although the single-layer perceptron can perform the task when the stimuli are concentrated around one angle, the task is unrealizable by a single-layer perceptron even in the limit of large signal-to-noise. The discrimination problem can be thought of as classifying the N-dimensional inputs generated by the stimuli in the two possible temporal orders. Our result means that the input vectors of stimuli from all angles are not linearly separable (see, e.g., Hertz et al. 1991) even if the scatter induced by the noise is neglected. In principle, one way out of this problem is to assume that there is a fast learning mechanism by which, following the presentation of a stimulus, the perceptron weights can rapidly adapt to the angular neighborhood of the current stimulus. For example, in a previous work (Seung and Sompolinsky 1993), it has been implicitly assumed that a fast adaptive mechanism exists that provides information about the rough range of angles of the stimulus, thereby avoiding the appearance of negative d’, of equation 3.6. This amounts to using the absolute value of d’ in the evaluation of the perceptron error probability (equation 2.2). This approach prevents the appearance of systematic error, but yields a nonmonotonic transfer curve (see Figure 3 of Seung and Sompolinsky 1993). Furthermore, it can be shown that it will still predict negative transfer in some range of angles. This behavior is exhibited by the two-layer linear network, based on the population vector, studied in Seung and Sompolinsky (1993). In the present work we chose not to resort to ad hoc assumptions of fast adaptive mechanisms. Instead, we assume more complex network architectures that are capable of implementing the task, without fast adaptation. We have shown that a multilayer perceptron network with one hidden layer can achieve reasonably good uniform performance. Interestingly, extrapolation of our results indicates that even in the limit when the number of hidden units, M, is large the average error does not reach the
G. Mato and H. Sompolinsky
296
level achieved by the ML discriminator. At present, we are not aware of any theoretical result regarding the capability of two-layer perceptrons to approximate ML discrimination. However, it is quite possible that our result is due to the restricted range of M used in the extrapolation, or it reflects the limitation inherent in the gradient descent algorithm, equation 3.17, we have used to find the optimal weights. In addition, we have used a restricted architecture of a cotiivziftee iiiachirie (Sompolinsky and Barkai 1993) where all the output weights have the same fixed value. However, we have found similar results also in simulations of a fully modifiable two-layer network with a backpropagation algorithm (Hertz tJt (71. 1991). The second feedforward architecture w7e have studied is that of a gating network, which consists of an array of M local discriminators, each optimized to discriminate in a narrow range of angles, and an array of M gating units that select for each stimulus the appropriate discriminator by estimating the angular range of the stimulus. We have shown that this network can approach the ML average performance in the limit of large M, but this result relied on our construction, equations 3.22 and 3.24, which is appropriate for the Poisson distribution. We d o not know whether the gating network, with the simple architecture of Figure 9, is capable of achieving the ML performance for other distributions. lrrespective of the quantitative difference in their performance, the two-layer perceptron and the gating network differ qualitatively in their mode of operation. The two-layer network solves the problem in a distributed manner by arranging the weights of the different hidden units in a way that error performed by one of them will be compensated by the others. The optimal arrangement of weight patterns that efficiently achieves this error-correction depends on n. For input tuning widths that are not too broad, n 5 1 (rad), the weight pattern has two maxima in a cycle (see Figure 7). This would show u p as two maxima in the tuning curves of the hidden units. In contrast, the gating network operates by combining estimation with local discrimination. Hence the weight patterns of the discriminators arc similar to that of a perceptron optimized for a single angle, and they are, therefore, less sensitive to the value of (7. I n fact, the discriminators are predicted to have a tuning curve with a single maximum irrespective of n. The performance of the two-layer network is similar to the ML discriminator in that for small c% both require only the temporal difference in the input responses to the two stimuli, i.e., r - r’. Consequently, their performance depends on the parameters M, 1 2 , and N, only through the combination h N j In contrast, to perform the estimation, the gating network uses the response vector r in addition to the difference vector r-r’. Consequently, its performance may improve if I I or N increases even if PW is changed simultaneously, so that 6 H I f i is unclianged. This prediction can be checked experimentally, by changing 11 and M . The above consideration holds in the case of large N, as was assumed throughout
m.
Perceptual Learning of Angle Discrimination
297
this work. Angle estimation by small neuronal populations has been recently studied (Salinas and Abbott 1994). The second goal was to study how the proposed networks learn the task. In particular, we examine how the effect of learning in one angular neighborhood affects performance in other regimes. A central problem in this work is the question of negative transfer. This is defined as a decrease in the performance for angles different from the trained one, relative to the performance before training. Our paradigm for perceptual learning consists of choosing a network with an initial set of weights that yields good uniform performance for easy tasks, namely high signal-tonoise ratio, and then training them with inputs drawn from stimuli in one narrow angular range. For the training algorithm we choose simple unsupervised Hebbian rules (equations 4.1 and 4.2). The most striking feature in the resultant perceptual transfer curve of the two-layer perceptron is the appearance of negative transfer, namely, a range of angles for which the performance after learning was worse than before. This occurs for angles that are separated from the training angle by approximately a. The reason for this behavior is the fact that the learning process affects equally almost all the hidden units, and tends to change their weights so that all of them will resemble a single layer perceptron. This perceptron works fine around the trained angle but is inadequate for dissimilar ones. This effect was quite robust to variations in the learning algorithm, including using a supervised learning algorithm (again with a single angle). In contrast, negative transfer can be avoided in the gating network. This is because the same gating system that operates during the discrimination task can limit the effect of the training to those hidden units that are tuned to the trained angle, thereby leaving intact the performance for different values of 0. In general, suppression of negative transfer in the gating network requires that the gating width will be smaller than RZ a/2. This implies the existence of units with relatively sharp angular tuning. Thus, neurons with sharp tuning could be candidates for such a gating system. In addition, the sharp gating implies that the range of (positive) perceptual transfer is narrower than a. This should be contrasted with the two-layer perceptron where the extent of perceptual transfer is roughly a. It would be interesting to study systematically the general interplay between the parameters a, M , and B. Such a study may shed further light on the general properties of gating networks (Jordan and Jacobs 1994). It would be interesting to test the possible existence of negative transfer in psychophysical experiments. In fact in Ball and Sekuler (1987), Fahle and Edelman (1993), and Weiss et al. (1993) there is some evidence for this phenomenon, although it is observed for only some of the subjects that take part in the experiments. Further experimental studies could clarify the situation. In the present work, we assumed that the input representation of
G. Mato and H. Sompolinsky
298
angles is uniform, so that the system has an underlying rotational symmetry, which is broken only by the external stimulus. Thus, we have ignored the "oblique effect" that denotes the experimental finding that the discrimination is better for vertical and horizontal directions than for oblique directions (Appelle 1972; Heeley and Timney 1988). It would be interesting to study tlie effect of this phenomenon on our model and resu 1ts. Another class of models that has been introduced to study perceptual learning is based on local basis functions (Poggio cf a/.1992; Weiss et a/. 1993). These models also display a limited transfer to parameters very different from the trained one. At present it is not clear whether these models will display negative transfer in a certain regime of parameters. This may depend on the assumptions about the learning rule incorporated in the models. I t would be interesting to study in more detail the perceptual transfer in these models and compare them with the results of the neural network models studied here. Finallv, we note that in this work we focused on the discrimination of angle variables, which has the salient feature of periodicity. It would be interesting to extend our study to discrimination tasks involving other stimulus parameters, such as spatial frequency, or velocity. The properties of tlie system in the general cases are expected to depend on the form of the underlying input tuning curves. In particular, we expect behavior similar to the present case if all the input tuning curves have a nonmonotonic bell shape. This may not be the case for systems with tuning curves that are monotonic functions of the stimulus parameter. Extending our study to general discrimination tasks will presumably invoh~ the incorporation of a mixture of different types of tuning curves, as was done in the study o f discrimination of stereo disparity (Lehky and Sejnowski 1990). References Appelle, S. 1972. Perception and discrimination as a function of stimulus orientation. The "oblique effect" in man and animals. Z'sydid. B i r / [ . 78, 266-278. Ball, K., and Sekuler, R. 1987. Direction-specific improvement in motion discrimination. Vision Rcs. 27, 953-965. Devos, M., and Orban, G. A . 1990. Modeling orientation discrimination a t multiple reference orientations with a neural network. N~wr.n/Coin;?. 2, 152161.
Fahle, M., and Edelman, S. 1993. Long-term learning in vernier acuity: Effects of stimulus orientation, range and of feedback. V i s i a ~Rcs. ~ 33, 397412. Fiorentini, A , , and Berardi, N. 1981. Learning in grating waveform discrimination: Specificity for orientation and spatial frequency. Visioii Rcs. 21, 1119-1 158.
Heeley, D. W., and Timney, 8. 1988. Meridional anisotropies of orientation discrimination for sine wave gratings. Visiori RCS. 28, 337-344.
Perceptual Learning of Angle Discrimination
299
Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to fhe Theory of Neural Computation, Addison-Wesley, Cambridge, MA. Jordan, M. J., and Jacobs, M. A. 1994. Hierarchical mixture of experts and the E. M. algorithm. Neural Camp. 6, 181-214. Lehky, S. R., and Sejnowski, T. J. 1990. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neurosci. 10, 2281-2299. Orban, G. A. 1984. Neuronal Operations in the Visual Cortex. Springer-Verlag, Berlin. Poggio, T., Fahle, M., and Edelman, S. 1992. Fast perceptual learning in visual hyperacuity. Science 256, 1018-1021. Salinas, E., and Abbott, L. F. 1994. Vector reconstruction from firing rates. J. Comput. Neurosci. 1,89-107. Seung, H. S., and Sompolinsky, H. 1993. Decoding of distributed neural codes. Proc. Natl. Acad. Sci. U.S.A. 90, 10749-10753. Sompolinsky, H., and Barkai, N. 1993. Theory of learning from examples. Proc. IJCNN 93, Nagoya, Japan. Tutorial Volume, pp. 221-240. Vogels, R., Spileers, W., and Orban, G. A. 1989. The response variability of striate cortical neurons in the behaving monkey. E x p . Brain Res. 77, 432436. Walk, R. D. 1978. Perceptual Learning. In HaiidbookofPerception, E. C. Carterette and M. P. Friedman, eds., Vol. IX, pp. 257-298. Academic Press, New York. Weiss, Y., Edelman, S., and Fahle, M. 1993. Models of perceptual learning in vernier hyperacuity. Neural Comput. 5, 695-718.
Received March 6, 1995; accepted July 18, 1995
This article has been cited by: 2. Misha Tsodyks, Charles Gilbert. 2004. Neural networks and perceptual learning. Nature 431:7010, 775-781. [CrossRef] 3. Jason M. Gold, Allison B. Sekuler, Partrick J. Bennett. 2004. Characterizing perceptual learning with external noise. Cognitive Science 28:2, 167-207. [CrossRef] 4. Laurent Itti, Christof Koch, Jochen Braun. 2000. Revisiting spatial vision: toward a unifying model. Journal of the Optical Society of America A 17:11, 1899. [CrossRef] 5. Alexandre Pouget , Kechen Zhang , Sophie Deneve , Peter E. Latham . 1998. Statistically Efficient Estimation Using Population CodingStatistically Efficient Estimation Using Population Coding. Neural Computation 10:2, 373-401. [Abstract] [PDF] [PDF Plus]
Communicated by Sidney Lehky
Directional Filling-in
The filling-in theory of brightness perception has gained much attention recently owing to the success of vision models. However, the theory and its instantiations have suffered from incorrectly dealing with transitive brightness relations. This paper describes an advance in the filling-in theory that overcomes the problem. The advance is incorporated into the BCSlFCS neural network model, which allows it, for the first time, to account for all of Arend’s test stimuli for assessing brightness perception models. The theory also suggests a new teleology for parallel ON- and OFF-channels.
1 Introduction
Light intensity reflected from a surface changes dramatically with change in illumination, but the ratio of intensities (contrast) reflected from adjacent locations remains essentially constant. The visual system extracts the contrast ratio from the distribution of light hitting the retina by local differencing mechanisms of two types: c)r~-ccrit~r/off-s~lrvollrlcl detectors that respond maximally to a light spot surrounded by a dark annulus, and off-crntcr/c)ii-si~~roi~/~~~ detectors that respond maximally to a dark spot surrounded by a lighter annulus. These two distinct populations appear at retinal ganglion cells that project to the visual cortex. Given that the information sent from the retina to the brain is primarily about local luminance and color contrasts rather than about extended areas, why do we experience object surfaces, rather than mere edges? One explanation is that information from the edges ”flows” across the areas that correspond to uniform surfaces, filling them in with features such as color and brightness. Numerous examples of filling-in phenomena appear in the clinical literature: from retinal scotomas (Gerrits and Timmerman 19691, and from experimental work using stabilized images (Krauskopf 1963; Gerrits et ill. 1966; Yarbus 1967). There is also a growing literature on the filling-in of texture information from human psychophysics (Ramachandran and Gregory 1991; Ramachandran t>f 01. 1992) .. . , &} and the ”true” or ”best” value of Ok is sought. This problem appears in many practical applications, e.g., speech recognition (Rabiner and Schafer 1988) and enzyme classification (Papanicolaou and Medeiros 1990). An extensive list of applications can be found in Hertz et al. (1991). In this paper we present an incremental credit assignment (ICRA) scheme that assigns credit to each source according to its predictive power. This approach yields a hierarchical architecture with a prediction level at the bottom and a decision level at the top. We present a recurrent, hierarchical, modular neural network implementation of this architecture. A bank of local prediction modules are trained, each on data from a particular source S ( & ) . The prediction modules can be implemented by several different kinds of feedforward neural networks: sigmoid, linear, gaussian, polynomial etc. The decision module is implemented by a recurrent gaussian network that combines the outputs of the prediction modules. The overall structure of the network is presented in Figure 1. We prove that the Neural Computation 8, 357-372 (1996) @ 1996 Massachusetts Institute of Technology
Vassilios Petridis and Athanasios Kehagias
3 58
Figure 1 : The Network architecture. Summation neurons are denoted by C. Gaussian neurons are denoted by G, identity neurons are denoted by I . The symbol denotes weights determined by q:. The block denoted DECISION MODULE implements equation 3.4.
-
credit functions converge with probability one to correct values, namely, to one for the module with maximum predictive power and to zero for the remaining modules. Moreover, ICRA has an easy neural network implementation (using only adders and multipliers). ICRA has been inspired by classification based on the Bayesian posterior probabilities of the candidate sources, but the Bayesian connection is not necessary for developing ICRA or for proving its convergence properties. The idea of combining local models into a large modular network has recently become very popular. It is used for prediction as well as for classification of both static and dynamic (time series) patterns. Early examples of this idea are, for example (Farmer and Sidorowich 1988; Moody 1989), where a time series predzctioii problem is solved by partitioning the i i r p r l f space into a number of regions and training a local predictor for each region; in every instance, the local predictor used is explicitly determined by past input values, hence it is not necessary to assign credit to each predictor. A later development is the combination of local models using a weighted sum; the weights can be interpreted as conditional probabilities or as credit functions. This is the approach taken in Jacobs et al. (1991), Jordan and Jacobs (1992), Neal (1991), and Nowlan (1990), where the terms lord expcrfs and prohnbility niixfurt7s are
Time Series Classification
359
used; the term committees appears in Schwarze and Hertz (1992), the term neural ensembles in Perrone and Cooper (1993), and so on. Our point of view is similar to that of the above papers, insofar as we also use local models (predictors) and credit functions. However, ICRA is a recursive scheme for online credit assignment, so that classification at a given time depends on past classifications. This is particularly appropriate for classification of dynamic patterns, such as time series, where the history of the signal must be taken into account. In contrast, the above-mentioned papers use offline credit assignment and are applied either to static problems or "staticized" dynamic problems, where preprocessing is used to transform a time-evolving signal to a static feature vector (FFT or LPC coefficients, etc.). However, static feature vectors may not capture all the dynamic properties of a time series, especially in case of source switchings. On the other hand, while our method assumes that the classes to be used are given in advance, several of the above-mentioned papers present algorithms that discover an expedient partition of the source space. In fact, there are several neural algorithms that combine local models and adaptive partitioning (Ayestaran and Prager 1993; Baxt 1992; Jordan and Jacobs 1994; Kadirkamanathan and Niranjan 1992; Schwarze and Hertz 1992; Shadafan and Niranjan 1994). However, while such algorithms perform adaptive partitioning, they do not perform, as far as we know, adaptive classification, since they do not use classification results recursively. In short, our ICRA algorithm is applicable to problems of time series classification, where past classification results must be used for future classification, and classes are given in advance. 2 Bayesian Time Series Classification
A random variable Z that takes values in 0 = { e l . . . . &} is introduced. The time series ~ 1 . ~ 2 ... . is produced by source S(Z). For instance, if Z = 01, then the time series y l . y 2 > . . is produced by S(Q1).At every time t a decision rule produces an estimate of Z, denoted Z,. For instance, if at tzme t we believe that the time series y1. . . . y, has been produced by Q1, then Z,= 81. Clearly, Z,may change with time, as more observations become available. The conditional posterior probability pf for k = 1 2. . . . .K , t = 1.2, . . . is defined by ~
Prob(Z = 0, I yt.. . . , y l )
pf
also the prior probability
pt
A
p i for k = 1,2,. . ..K is defined by
Prob(Z = O k I at t
=0)
(2.1)
p{ reflects our prior knowledge of the value of Z. In the absence of any prior information we can just assume all models to be equiprobable: =
pi
Vassilios Petridis and Athanasios Kehagias
360
1/K fork = 1.2. . . . . K. p ) reflects our belief (after observing data y l . . . . .y,) that the time series is produced by S ( H k ) . We choose Z,= argmaxokEep;. In other words, at time t we consider that y l . . . . . y t has been produced by source S(Zt),where Z, maximizes the posterior probability. So the classification problem has been reduced to computing p:, t = 1 . 2 . . . ., k = 1 2.. . . . K. This computation (Hilborn and Lainiotis 1969; Lainiotis and Plataniotis 1994) is based on Bayes’ rule: ~
Also Prob(y,.Z=Ha I!/f-.1.....!/i)=Prob(y, l y f - i . . . y l . Z = & ) p : k- , Now equations 2.2 and 2.3 imply the following recursion for k t = 0.1.2.. . .:
=
(2.3)
1.2.. . .
K,
and we need only (for each t and k ) to compute Prob(yt 1 1y-1.. . . . yl. Z = H k ) . This probability depends on the form of the predictor; the predictors have a general parametric form f( .; Q i ) , k = 1. . . . .K: k
yt =fiyf-l.....Yt-N;HA)
(2.5)
Typically, f (,;H k ) would be a feedforward (linear, sigmoid, gaussian, polynomial) neural network trained on data from source S ( H k ) . This predictor approximates y, zcdien the tivie series is prodliceif by S ( H k ) . For k = 1.2. . . . . K the prediction error d , k = 1 . . . . . K , t = 1 . 2 . . . . is defined by eA f
2~
y:
i.
-
!/,
(2.6)
It is nssunirif that 4 is a white, gaussian noise process, with conditional probability of the form
It then follows immediately from equations 2.5, 2.6, and 2.7 that
The probability assumption of equation 2.7 is arbitrary, but works well in practice, as will be seen in Section 5. The parameter 0: is the variance and C(0,) is a normalizing constant. Extensions for vector valued y, and c$ are obvious. The posterior probability p: of source HA), k = 1.2. . . . . K, for time f = I . 2. can be computed by means of the above equations. At time t the time series is classified to the source that maximizes the
Time Series Classification
361
posterior probability: (2.9) The recursion for p: is obtained from equation 2.1, 2.4, 2.5, 2.8, and 2.9. 3 Incremental Credit Assignment Scheme
In this section we introduce an incremental credit assignment (ICRA) scheme to be used for time series classification. ICRA is motivated from the Bayesian scheme, but it is simpler in implementation, requiring only adders and multipliers. In addition, ICRA classifies as well as, and sometimes better, than the Bayesian scheme, as will become obvious in Section 5. Finally, ICRA has desirable convergence properties that can be mathematically proved. Hence ICRA is an attractive alternative to Bayesian classification. To develop ICRA, start by defining
Now consider the following difference equation
with initial condition (k = 1.2, . . . K ) K
(3.3) It is clear that if the qfs converge, in equilibrium (9: = qf-,) we will have 9; Y g(e:)p;-, / ~ ~ = l g ( ~ Since ) ~ - the l. in equation 3.2 are unknown, let us replace them by the qf-ls. After some rewriting, equation 3.2 becomes (3.4) Equation 3.4 is the important part of the ICRA scheme. Even though we have started with a Bayesian point of view, this can now be abandoned. We consider the 4: to be credit functions: the higher q: gets, the most likely S(&) is to be the "true" source. From equation 3.4 we see that the credit fuctions 92 are updated in an incremental manner, similar to a steepest descent procedure. At time t the time series is classified to source S ( Z ; ) , where
Z;
arg max qf eke
(3.5)
Vassilios Petridis and Athanasios Kehagias
362
Of course the use of equation 3.5 requires some justification; namely we must prove that if the "true" or "best" source is S(&), then q/" is greater than q:, k # i i i . This justification will be provided in the next section. Namely, we will prove that the q:s as given by equation 3.4 are convergent; in particular, the qj" associated with source S(Q,,,) of highest predictive power converges to one, while all other q f s converge to zero. Therefore the credit functions qf can be used for classification. In summary, the ICRA scheme is based on equations 3.3, 2.5, 2.6, 3.1, 3.4, and 3.5, which can be implemented by the recurrent, hierarchical, modular network of Figure 1. The bottom, prediction level of the hierarchy consists of a bank of predictive modules, each one implementing a predictor of the form of equation 2.5, for a specific value Or. Typically these modules are feedforward neural networks (sigmoid, linear, gaussian etc.) The top, decision level of the hierarchy consists of a module that implements equation 3.1; this module can be built from gaussian neurons. At this point we should emphasize that within this context the gaussian form y(t{) ceases to be an assumption about the statistical properties of error and becomes a matter of design regarding the credit assignment scheme. Also, we emphasize that ICRA can be implemented using only adders and multipliers, hence implementation is simpler than that of the Bayesian scheme. Finally, it should be mentioned that implementation of the ICRA scheme requires computation of equation 3.4 for k = 1 . 2 . . . . . K, which obviously scales linearly with K, the number of classes. Hence, time requirements of ICRA are O ( K ) :to handle 100 classes takes only 10 times more than to handle 10 classes if the algorithm is implemented serially. It should also be noted that equation 3.4 is fully pnmlleliznblc (see also Fig. 1) resulting in O(1) (constant) execution time for parallel implementation. Memory requirements are also O ( K ) ,since only the current '7:s need to be retained at every time step.' 67:
4 Convergence
We will now show that equation 3.4 has the following property: if HI,, is the "best" value of H [i.e., source S(H,,,)best predicts the data observed1 then qj" converges to 1 and qi converges to 0 for k # m. We start with the following lemma.
Proof. Proof will be by induction. Supposing I,"=, 95-( = 1, it will be shown that C,"=, qt = 1 as well. Summing equation 3.4 over k (and using 'The same time and memory requirements hold for the Bayesian classifier of Section 2.
Time Series Classification
363
t: 1
K
K
= k=l
+ Y Cqsk-18(~$) cd-ig(ei) k=l
Since the proposition is true for t 1,2,. . . proves the Lemma.
=
=
(4.1)
0, applying 4.1 repeatedly for s
=
Now we can state and prove the following convergence theorem. Theorem 1. Define ak = E [ g ( $ ) ] ,k = 1,.. . ,K. Suppose a, is the unique maximum of al , . . . , aK. If qr > 0,then q/” + 1and qt + 0 fork # m, with probability 1. Remarks. First, note that g($) is a random variable, since it is a function of the error ef. Assuming ef to be stationary, a k = E[g(eF)],i.e., the expectation of g(4), is time independent. Since g ( e ) is a decreasing function of ]el, a large value of ak implies good predictive performance. In this sense, ak can be viewed as a prediction quality index and it is natural to consider as optimal the predictor m that has maximum a,. In the course of the proof it will become clear that any function g(le1) could be used as long as g(1el) is a decreasing function of ]el. The theorem can be generalized to the case where there is more than one predictor that achieves maximum a,; then the totai posterior probability of all such predictors will converge to 1. The proof for that case is similar to the one presented here, and is omitted for economy of space. Finally, note that the credit functions q: are random variables, as they depend on y l , y2, . . . , y f . Hence, q: converge in a stochastic sense, in this case with probability one. Proof. For t = 0.1.2, . . ., define 3; to be the sigma field generated by 4; and {$5}5=o, with k = 1 , .. . . K . Define by .Em = u z l F t . Now, q: is .Ft measurable, for all k . t? This is so because qf is a function of e: . . . . . ef and 1 K of q1f P 1.,. . qf-l. But qt-l, . . . ,qf-, are, in turn, functions of . . . et-l and of qi-2r. . . .qfP2 and so on. In short, q! is a function of e i . . . . . eK, , ei.. . .e f l . Hence it is clearly Ft measurable. Also, for k = 1.2.. . . K , f = 0,1,2,. . ., define 7rF = E ( q f ) . In 3.4 let k = rn and take conditional expectations with respect to . F f - l . For all k and t we have E(qf-, I Ft-1) = q:-,, E[g(&) 1 F,-l] = E[g(e:)] = a k . In other words, g(e:) is independent of Ff-l. This is so because we assumed the noise process {$;}El to be
.
%
2A sigma field F generated by random variables u1.u 2 . . . . is defined to be the set of all sets of events dependent only on u1.~ 2 . ,. . . A random variable u is said to be F measurable if knowledge of u,, u 2 , . . . compietely determines u; in other words, either v is one of M I ,u2.. . . or it is a function of them: v(u1,u 2 , . . .). Note that the total number of u1,~2~ . . . may be finite, countably infinite, or even uncountably infinite. For more details see Billingsley (1986).
Vassilios Petridis and Athanasios Kehagias
364
white, hence c$ is independent of P!, I from Lemma 1, CF=lqiP1 = 1, hence
=
1 . .. . .K , s
=
1 . .. . . t
-
1. Finally,
From equation 4.2 follows that {# }& is a sidviznrtingillr. Since 0 5 E ( l q y ' / ) 5 1, we can use the Submartingale Convergence Theorem and conclude that, with probability 1, the sequence { q r } E Ocoiiverges to some random variable, call it q"', where ql" is F x measurable. We have assumed that qg > 0; from this, and equation 3.4 it follows that for all t we have q y > 0. From this it is easy to prove that the limit 9"' > 0. Hence, convergence of 9;' does izot depend on the initial values qt, k = 1.2. . . . . K, as long as q:' is greater than zero. However, we still d o not know whether the sequences {q:}:,,, k # if?, converge. Similarly, since qy q"', we can take expectations and obtain E ( q Y ) E(q"') = T " ' ; but we do not know whether E ( q f ) converges for k # i n . However, since Cf=:=, q: = 1 for all f, we have € ( ~ k + , l f q= ~1)- € ( q y ) 1 - Y . Now, if in equation 3.4 we set k = m and take the limit as t - x, we obtain +
+
-
(4.3) Since 4"' = limt-\ q y i > 0, equation 4.3 implies (4.4) The important point is that the quantity in curly brackets has a limit. Since q"' > 0, it can be cancelled on both sides of equation 4.4; then we get
r
1
(taking expectations and using the Dominated Convergence Theorem') him
'The Dominated Convergence Theorem states that under appropriate conditions, ,F ( t t )= Filim -~t,) See also Billingslev (1986)
Time Series Classification
365 r
(define al = rnaxkj,
ak
1
and note that a/ < a,)
lim [um(l - T F ~ )I] a1 lim fia:
t i m
bi. !
Ed-, + u , ~ ( I
-
I al(1 - P)(4.5)
From equation 4.5 it follows immediately that T , = 1; otherwise we could cancel 1 - T" from both sides of equation 4.5 and obtain a, 5 al, which is a contradiction. Hence 1 = rrn= limf-ra:TF, i.e., 1= limiiooE(qF) = E(limf,mqj"). Since limtdccqY 5 1, we must have limtia:qT = 1 with probability 1; it follows that limf-3cd = 0, for j # m, which completes the proof. 0 5 Examples 5.1 Logistic Classification. A logistic time series is produced by the following recursion (the source parameter is a ) Xt+l
= O X f ( 1 - Xt)
t = 1,2,. . . .
In the first set of experiments, a test time series has been generated by running a logistic with a = 3.8, for 182 time steps and then switching a to 3.6 and running the logistic for another 182 steps. Zero-mean white noise, uniformly distributed in the interval (-A/2. A/2] has been added to the data. We have used A = 0.00,0.05,.. . .0.50. We plot the time series (at noise level A = 0.2) in Figure 2. The task is to detect the active value of Q. We use our ICRA scheme and compare it to the Bayesian scheme. In both cases we use the same type of predictor modules. Ten predictor modules (18-5-1 sigmoid, feedforward neural networks) have been trained on logistics with Q = 3.0,3.1,.. . ,3.9, respectively. Average predictor training time was 2.5 min on a Sun Sparc IPC workstation. The 0 parameter is the same for both classifiers; we take it equal to the experimentally computed standard deviation of predictor error. For all prediction modules this is approximately equal to 0.25; so we have g1 = . . . = ~ 1 = 0 0.25. A probability threshold parameter k = 0.01 is also used. For the ICRA method we also use y = 0.99. Different values of 7 do not affect classification performance, as long as they are not too low. In general, small values of 0 and large values of 7 result in faster update of the p: and qf (see equation 3.4), hence in faster response of the algorithm. Finally, it should be mentioned that choice of pk, qk does not affect the convergence, as remarked in the previous section. This conclusion was supported by our experiments: while we tried several values for pi, qi classification performance was not affected. In the experiments reported here, we have always used pg = 1/K, q; = 1/K.
Vassilios Petridis and Athanasios Kehagias
366
0.00
I 1
I 51
101
151
201
25 1
30 1
35 1
Time (Steps)
Figure 2: Plots of logistic time series: for t = 1 . 2 . .. . ,182 we have (1 = 3.8; for t = 183... . ,364 we have (1 = 3.6. Noise level is A = 0.2. In Figure 3 the evolution of the 9;s is plotted for a typical experiment. Classification to the true logistic takes very few time steps: at t = 2 > q:, k # 9 and at t = 8 it reaches its steady-state value; then at t = 183 we have the (I transition and by time t = 189 we have 97 > qi, k # 7; at t = 194 97 has reached steady state (the whole transition takes 12 time steps). Location and width of the transition points of this experiment are typical; all the classification experiments we have run gave similar results. It should be emphasized that no training is required for the decision module; its online operation only requires computation of equations 2.5, 3.4 for all 10 predictors ( k = 1.. . . ,101. Classification of each time step requires 0.08 sec 011 a Sun Sparc I I T workstation. Classification performance is measured by determining the number of time steps for which I t is correctly identified and dividing by 364, the total number of time steps. Thus we obtain two figures of merit: one for the Bayesian and one for the lCRA method. The results, for various noise levels A, are summarized in Figure 4. We see that in the noisefree case both schemes perform very well, the Bayesian scheme slightly outperforming the ICRA scheme. However, the ICRA scheme is more robust to higher noise levels. In the second set of experiments we want to evaluate classification performance when the actual ( I parameter is izof in our search set. To this end we train 10 liizenr predictors 011 o = 3.0.3.1.. . . .3.9 values. Training time per predictor was slightly over 1 sec on a Sun Sparc IPC workstation. Then we generate five 364-step test logistics with an (I transition at step 182. The (1 transitions are 3.7-6tr to 3.9+6tr, where h o takes the values
Time Series Classification
0.90
-
*f
, . I .....I
0.70
_ . . . I
..... --.. . I . . I
I I I
0.80 I7 0.60
l...l...l...
I
-
0.80
367
I I - 8
I
0.40
I
1
Figure 3: Logistic classification for ten sources (cu = 3.0.3.1.3.2,. . . .3.9), t = 1.2.. . . .364. The solid line corresponds to q: ( a = 3.8) and the dotted line corresponds to 4: ( a = 3.6). For k # 7, 9 4: go to zero very rapidly and are not discernible in the figure.
-
1.00 -m .%
b
0.90
E
0.80 *.
c 0
0.70 *.
*
-
;" 0.60 * -
figure of merit
CI)
F 0.50 C
.P w
Bayesian classification
0.40 * -
.O 0.30
z
D
0.20 * 0.10 0.00
a.
:
:
:
:
Figure 4: Figures of merit for logistic classification at various noise levels. A denotes the noise level. Here ICRA figure of merit (respectively Bayesian figure of merit) denotes fraction of correctly classified time steps (out of a total 364) by ICRA (respectively Bayesian) scheme. We observe that while in the noise-free case the Bayesian scheme performs slightly better than the ICRA scheme, ICRA is more robust to noise. (In all experiments we use h = 0.01, = 0.25, y = 0.99.)
Vassilios Petridis and Athanasios Kehagias
368
-
-
VI
?!
n
Z
ICM classification figure of merit
0.60
0.50
Bayesian classification figure of merit
0.00
J
0.00
I 0.01
0.02
0.03
0.04
8a
Figure 5: Logistic classification for N outside the search set. ICRA figure of merit (respectively Bayesian figure of merit) denotes fraction of correctly classified time steps (out of a total 364) by ICRA (respectively Bayesian) scheme. This is plotted against difference 6n. We observe that when the actual cy values are in the search set (60 = 0.0) the Bayesian scheme is slightly better than the ICRA scheme. However, ICRA is more robust to increased 6 a . (In all experiments we use h = 0.01, rn = 0.25, y = 0.99.)
0.00, 0.01, 0.02, 0.03, 0.04. Hence 6tr measures the difference between the (1 on which we trained our search set and the actual a value, which generates the test time series. Note that for 6a = 0.05 we get cv = 3.65, exactly halfway between the search set a’s 3.6 and 3.7. All the other parameters of the experiments are the same as in the previous paragraph. With the exception of the first case, the trite values of a are not in our search set. The results of these experiments are summarized in Figure 5. Classification at a time step is considered to be correct when the time series is classified to the value of cr in the search set, which is closest to the true value of a. In other words, for all five time series correct classification should be (Y = 3.7 for the first 182 steps and a = 3.9 for the last 182 steps. In Figure 5 we plot classification figure of merit vs. ha. We see that the ICRA scheme performs better than the Bayesian scheme: it is more robust to parameter variations. Of course, an additional conclusion of this experiment set is that classification can be successfully performed using linear predictors. Finally, let us note that classification of each time step requires 0.04 sec on a Sun Sparc IPC workstation. 5.2 Enzyme Classification. This experiment involves classification of the /?-lactamase enzymes. The data and problem are described in Pa-
Time Series Classification
369
panicolaou and Medeiros (1990); here we give a short overview. /ILactamases determine resistance to /3-lactam antibiotics. Classification of ,5’-lactamases is a problem that has received considerable attention by biomedical researchers. A classification method, presented in Papanicolaou and Medeiros (1990), uses an ”inhibition” experiment. The $-lactamase enzyme causes hydrolysis of a chemical called nitrocefin, and the B-lactam slows hydrolysis down by inhibiting the action of the enzyme. In the following paragraphs we use the terms enzyme (in place of 0-lactamase) and inhibitor (in place of p-lactam). For every enzyme/inhibitor pair an “inhibition profile” is obtained, which (for a given inhibitor) characterizes the enzyme. This method has a high classification success, but the following problem occurs: the properties of enzymes and inhibitors heavily depend on the conditions under which they are prepared, and this results in varying inhibition profiles for different preparations of the same enzyme/inhibitor pair. However, some dynamic properties of the profile remain invariant; in Papanicolaou and Medeiros (1990) it is reported that enzyme classification depends on the slope of the inhibition profile at various times during the experiment, as well as on the final concentration of nitrocefin. This information was used by a human operator, who classified the enzyme by combining the various characteristics of an inhibition profile. We use the ICRA and Bayesian schemes to automate the enzyme classification process. The inhibition profiles are used as input time series. Eight enzymes are classified. The data set of inhibition profiles is separated into a test set and a training set.4 We use two data sets, consisting of inhibition profiles for two different inhibitors and all eight enzymes. In Figure 6 we plot inhibition profiles for three enzymes from the training set and the same three enzymes from the test set. In all cases the same inhibitor has been used. It is noted that for the same enzyme, the test profile can differ significantly from the training profile, for the reasons explained in the previous paragraph. For each enzyme a sixth-order linear predictor is trained on the corresponding inhibition profile from the training set. (These profiles are 40 min long time series; each time step represents 0.5 min of real time.) This is the offline training phase, which takes less than 1 sec per predictor on a Sun Sparc IPC workstation. Mean square prediction error is approximately 0.05 for all profiles. Next, we choose an inhibition profile from the test set and proceed to determine the enzyme it corresponds to. Both Bayesian and ICRA schemes are used; in Figure 7 we present 9: evolution for a particular enzyme inhi8 bition profile. In this task final classification uses values qi0, qi0.. . . .q40 (p40, 1 p40, 2 . . . ,pto, respectively). Classification performance of the Bayesian scheme is measured by cp, the number of correctly classified enzymes (at time t = 40 min) divided by eight, the total number of enzymes. A similar number, cq, is computed for the ICRA scheme. For the Bayesian 4We want to thank G. A. Papanicolaou for kindly permitting us to use the inhibition profile data.
Vassilios Petridis and Athanasios Kehagias
370
Figure 6 3.50
la
3.00
-
-
2a
2.60
.5 2.00 .n .5 1.50
3a
-
lb
-
1.oo
--t
3b
0.50 0.00
2b
1 1
0
11
16
21
28
31
36
Time (Minsl
Figure 6: Enzyme inhibition profiles tor enrvmes 1, 2, and 3. la, Za, 3a are training data, lb, 2b, 3b are test data. scheme we find ci, = 0.875, i.e., seven out of eight enzymes were correctly classified. For the ICRA scheme we find cC7= 1.000, i.e., all eight enzymes were correctly classified. Therefore, in this experiment the ICRA scheme classifies better than the Bayesian scheme. Classification of each time step requires 0.03 sec on a Sun Sparc IPC workstation. 6 Conclusions
We have presented ICRA, an incremental credit assignment scheme for time series classification. ICRA is implemented by a recurrent, hierarchical, modular neural network that consists of a decision module and a bank of predictive modules. The decision module implements a gaussian function s ( ~ (where ) e is prediction error) but any function g ( . ) can be used, as long as it is a decreasing function of /el. The predictive modules can be sigmoid, linear, gaussian, etc., feedforward networks. In fact, because of the competitive nature of the ICRA scheme, classification depends on relntiw, not nbsolirtr predictive performance, making ICRA robust to noise and prediction error. We have proven that under mild conditions, ICRA converges to the correct result, i.e., it detects the time series source that best predicts the observed data. The ICRA classifier is recursive, appropriate for online time series classification, which must be updated at every time step, taking into account past classification as well as the dynamic behavior of the time series. ICRA is modular and parallelizable, which means that offline training (of the predictor modules) as well as online operation scale linearly with the number of classes handled. No online training is necessary. Hence, to train and classify 100
Time Series Classification
1.00
-
..- - - - .- -.....- ......-.... . .. .. .-.......- _ _ ......~. ..........- - .....
0.90
*. ,,.
0.80
-. ,'; '. ; :
0.70 0.60 0 0.60
0.40
371
a.
*. /
-
0.30 * ,' 0.20 0.10 * ' 0.00 1 1
6
11
16
21
26
31
36
Time (Mind
Figure 7: Enzyme classification. The dotted line corresponds to the credit function 4: of the correct enzyme. The solid line corresponds to overlapping plots of 4:. . . . ,q:. Classification is based on the final value of 4:. logistics would take 10 times as long as to train and classify 10 logistics; in principle there is no limit to the number of classes that can be handled. Online operation time is O ( K ) (where K is the number of classes) for serial operation and 0(1) for parallel operation, i.e., all per step classification times reported in the previous section would be reduced by approximately 1/K if ICRA was implemented in parallel. The above paragraph summarizes the basic features of ICRA classification. These also hold for the Bayesian classifier of Section 2. However, the experiments of Section 5 indicate that ICRA classification is more accurate and robust than Bayesian classification. In addition, unlike Bayesian, the ICRA classifier can be implemented using only adders and multipliers; hence a simple and fast hardware implementation is possible. This is a further advantage over the Bayesian classification scheme, which requires a more complicated implementation. In short, the advantages listed in this and the previous paragraph make ICRA an attractive recursive method for time series classification problems, where past classification results must be used for future classification, and classes are given in advance. References Ayestaran, H. E., and Prager, R. W. 1993. The Logical Gates Growing Network. Tech. Rep. CUED F-INFENG TR 137, Cambridge University Engineering Dept.
Vassilios Petridis and Athanasios Kehagias
372
Baxt, W. G. 1992. Improving the accuracy of an artificial neural network using multiple differently trained networks. Neirrnl Coiqr. 4(5), 772-780. Billingsley, 1’. 1986. Probnbility nird Mrasrir~.John Wiley, New York. Farmer, J. D., and Sidorowich, J. S. 1988. Esploitiiig c h s to predict the future mid reduce rioise. Tech. Rep. LA UR 88 901, Los Alamos National Laboratory. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. Irztratficctiori to the Theory of Neirrnl Coiiipirtatioir. Addison-Wesley, Redwood City, CA. Hilborn, C. G., and Lainiotis, D. G. 1969. Optimal estimation in the presence of unknown parameters. I € € € Trnm.Syst. Scr. Cyberri. SSC-5, 3843. Jacobs, R. A., ef i l l . 1991. Adaptive mixtures of local experts. Neitrol Conrp. 3(1), 79-87. Jordan, M. I., and Jacobs, R. A. 1992. Hierarchies of adaptive experts. In NIPS 4, J. Moody, S. Hansen, and R. Lippman, eds. Morgan Kaufmann, San Mateo, CA. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neiirnl Coriip. 6(1), 181-214. Kadirkanianatlian, V., and Niranjan, M. 1992. Application of an architecurally dynamic neural network for speech pattern classification. Proc. Irist. Acoustics 14(6), 343-350. Lainiotis, D. G., and Plataniotis, K. N. 1994. Adaptive dynamics neural network estimation. In Proc. ICNN, Vol. 6, pp. 47364745. Moody, J. 1989. Fnst Lrnrrriirg i r i Mirlti-Rrsolutioiz Hicrnrchics. Tech. Rep. YALEU DCS RR 681, Dept. of Computer Science, Yale University. Neal, R. M. 1991. Bn!/esiarr Mistzirr Moddliirg by Monte Cnrlo Sirmilnfioii. Tech. Rep. CRG-TR-91-2, Dept. of Computer Science, University of Toronto. Nowlan, S. J. 1990. Maximum likelihood competitive learning. In NIPS 2, D. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Papanicolaou, G. A., and Medeiros, A . A. 1990. Discrimination of extendedspectrum j-lactamases by a novel nitrocefin competition assay. Aiztiiriicrob. A p r t s Chr7nothtv’. 34(11), 2184-2192. Perrone, M. P., and Cooper, L. N. 1993. When networks disagree: Ensemble methods for hybrid neural networks. In Nerirnl Nrticwks for Speecli m i d linage P r o c m i i ~ ~ R. y , J. Mainmone, ed. Chapman-Hall, London. Rabiner, L. R., and Schafer, R. W. 1988. Digital Processiris of Speech Signnls. Prentice-Hall, Englewood Cliffs, NJ. Schwarze, H., and Hertz, J. 1992. Generalization in a large conimmittee machine. Preprint, The Niels Bohr Institute. Shadaian, R. S., and Niranjan, M. 1994. A dynamic neural network architecture by sequential partitioning of the input space. Nt’rird Corrzp. 6(6), 1202-1222. .~ __ -~~ Received June 7, 1994, accepted June 1-1, 1995 ~
This article has been cited by: 2. T. McConaghy, H. Leung, E. Bosse, V. Varadan. 2003. Classification of audio radar signals using radial basis function neural networks. IEEE Transactions on Instrumentation and Measurement 52:6, 1771-1779. [CrossRef] 3. A. Kehagias, V. Petridis. 2002. Predictive modular neural networks for unsupervised segmentation of switching time series: the data allocation problem. IEEE Transactions on Neural Networks 13:6, 1432-1449. [CrossRef] 4. Jyh-Da Wei, Chuen-Tsai Sun. 2000. Constructing hysteretic memory in neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:4, 601-609. [CrossRef] 5. V. Petridis, E. Paterakis, A. Kehagias. 1998. A hybrid neural-genetic multimodel parameter estimation algorithm. IEEE Transactions on Neural Networks 9:5, 862-876. [CrossRef] 6. Athanasios Kehagias , Vassilios Petridis . 1997. Time-Series Segmentation Using Predictive Modular Neural NetworksTime-Series Segmentation Using Predictive Modular Neural Networks. Neural Computation 9:8, 1691-1709. [Abstract] [PDF] [PDF Plus] 7. V. Petridis, A. Kehagias. 1997. Predictive modular fuzzy systems for time-series classification. IEEE Transactions on Fuzzy Systems 5:3, 381-397. [CrossRef]
Communicated by Peter Konig
Temporal Segmentation in a Neural Dynamic System David Horn Irit Opher School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
Oscillatory attractor neural networks can perform temporal segmentation, i.e., separate the joint inputs they receive, through the formation of staggered oscillations. This property, which may be basic to many perceptual functions, is investigated here in the context of a symmetric dynamic system. The fully segmented mode is one type of limit cycle that this system can develop. It can be sustained for only a limited number n of oscillators. This limitation to a small number of segments is a basic phenomenon in such systems. Within our model we can explain it in terms of the limited range of narrow subharmonic solutions of the single nonlinear oscillator. Moreover, this point of view allows us to understand the dominance of three leading amplitudes in solutions of partial segmentation, which are obtained for high n. The latter are also abundant when we replace the common input with a graded one, allowing for different inputs to different oscillators. Switching to an input with fluctuating components, we obtain segmentation dominance for small systems and quite irregular waveforms for large systems. 1 Introduction
Segmentation is a concept that is generally invoked when one discusses the way an image has to be separated into its components. This can be a general two-dimensional scene (see, e.g., Pentland 1989) or some very specific task such as recognizing overlapping handprinted characters (Martin 1993). Segmentation is a precursor of object recognition in many sensual modalities. In addition to vision we may mention auditory signal separation as in the cocktail party effect (von der Malsburg and Schneider 1986) and odor separation in the olfactory bulb (Hopfield and Gelperin 1989). It is tempting to assume that segmentation, as well as binding, is implemented via a temporal mechanism for tagging the individual memories in the mixed input. There are at least two good reasons for this approach: It makes it easy to deal with distributed as well as overlapping Neural Computation 8, 373-389 (1996)
@ 1996 Massachusetts Institute of Technology
374
David Horn and Irit Opher
memories, and it is a natural mechanism when temporal structure exists in the input. Moreover, a temporal coding mechanism fits well within the cell assemblies approach, allowing for large variability. Single neurons need not be restricted to one feature, rather they can participate in different feature-representing assemblies each time a new stimulus is presented. Binding is performed naturally by such a temporal mechanism. Each cell assembly that represents a feature is composed of synchronized oscillating neurons. This phase locking serves as a binding mechanism, defining (tagging) groups of correlated neurons. The biological evidence for binding through phase locking comes from the well known results of Eckhorn ef nl. (1988) and Gray and Singer (1989). For more discussions of binding and segmentation in this context see, e g . , Engel rt id.(19911, KZjnig and Schillen (1991), Singer (19941, and Konig et nl. (1995). Temporal segmentation can be implemented in oscillatory networks using two very different mechanisms. One is to have each segment oscillate with a different frequency. Eckhorn (1994) has reported biological evidence for phase coupling of different frequency oscillations, corresponding to different elements of a visual scene. It remains then an open problem to understand the origin of the different frequency assignments. The aiternative, in the case where the different segments use the same frequency, is to have well-defined phase lags between them. In nonlinear oscillatory systems this can mean staggered oscillations with each segment well separated from the other ones. This is the method that we employ, because we restrict ourselves to the case of parallel retrieval of individual memorized patterns that belong to the same neural network, where it is only natural to assume that one common frequency dominates the behavior. Its possible application to scene analysis was discussed by von der Malsburg and Buhmann (1992). The implementation of temporal segmentation through staggered oscillations was demonstrated by Wang r f n l . (1990) and by Horn and Usher (1991). The way it works is that each one of the activities of the different memory-patterns that are turned on by the input is dominant for only a short while, a fraction of the cycle of the whole system. This way the temporal overlap between any two memory activities is close to zero, so we can regard them as being well separated. This behavior is obtainable in these models because of the nonlinearity of their oscillations. Both models have a limited segmentation power, i t . , for only a small number of common inputs they can lead to staggered oscillations. Assuming all inputs are constant and of similar magnitude, it turns out in these models that for an input of more than approximately five memories the system will collapse. One may be tempted to speculate that this could be related to the psychophysical limit on attention and short-term memory (Miller 1956). Horn et nl. (1992) have observed that when noise is added to constant inputs, the network can continue its staggered oscillations for very large numbers of objects. This comes about because noise fluctuations can
Temporal Segmentation in a Neural System
375
enhance momentarily the input of one of the Hebbian cell-assemblies, enabling it to overtake the other ones. Clearly a waveform displaying segmentation is a quite particular outcome of an oscillating neural system. Can we pinpoint the nonlinear property that enables its existence? Is there some way to quantify the importance of this mode of behavior? Can we devise a neuronal system in which it is the dominant mode? To study such questions we investigate a symmetric neural system, in which all possible waveforms can be classified by symmetry and followed through numerical simulations. This allows us to find the basin of attraction of segmentation, and compare it with those of other limit cycles. Moreover, by tracing the way this dynamic system operates, we find that the phenomenon of subharmonic oscillations, which can be obtained only in nonlinear systems, is responsible for segmentation. This leads to an understanding of the limitation on the number of segments that appear in this mode. Then, introducing symmetry breaking into the neuronal inputs, we find that segmentation modes play dominant roles, because of the lack of degeneracy in their waveforms. 2 Limit Cycles of a Symmetric Dynamic System
The emergence of the segmentation mode was investigated by von der Malsburg and Buhmann (1992) in a dynamic system of two oscillators with an inhibitory interneuron. In spite of the big parameter space that even such a simple problem possesses, segmentation is a very natural outcome. This could be different in systems with a larger number of oscillators, which we set out to investigate. To cut down the number of parameters we concentrate on a system in which each oscillator couples to itself and one common inhibitory unit. The oscillators are composed of excitatory neurons with dynamic thresholds that receive external inputs (Horn and Opher 1995):
du,/dt
= -u,
+ rn,
-
am - 0, + I,
dQi/dt = bmi - cQ; dvldt
=
-gv -em
(2.2)
(2.2)
+f E m ,
(2.3)
1
u, denote the postsynaptic currents of the excitatory neurons, whose av-
erage firing rates are
m, = (1 + e-PU’)p’
(2.4)
while v and m are analogous quantities for the inhibitory neuron that induces competition between all excitatory ones:
m
=
+
(1 e-P”)-’
(2.5)
David Horn and Irit Opher
376
b
t C
I
Figure 1: Limit cycles ot the H = 3 system. Parameters were a = 0.5, b = 0.4, c = 0.2, x : 0.1, L’ = 1 .l, f = 0.3, i= 9. The different m, are plotted vs. time after the system has reached stability. The time scale is arbitrary but is chosen to be the same in a11 figures. Each 112, is represented by a different symbol. The limit cycles are (a) fully synchronous, I = 0.8; (b) partial synchronous waveform, I = 0.4; ( c ) full segmentation, I 0.4.
. g and i are fixed parameters. #, are dynamic thresholds that rise when their corresponding neurons i tire. They quench the active neurons and lead to oscillatory behavior. Note that H , can also be interpreted as inhibitory linear neurons that pair u p with the excitatory u , to form nonlinear oscillators. The continuous neurons play a role analogous to the Hebbian cell assemblies in the model of Horn and Usher (1990). That model was expressed in terms of rate variables, and could be derived from an underlying picture of nonoverlapping cell assemblies. Here we resort to single neuronal units for simplicity of the analysis.
0 . ., .
Temporal Segmentation in a Neural System
377
All neurons are assumed to be under the influence of a common input I, = I. In this case the system is fully symmetric: it remains invariant under the interchange of any two excitatory neurons i H j . We will also assume that the common input is constant in time. In general, this dynamic system flows into a set of dynamic attractors. Thus, for 3 excitatory elements, one finds the following types of attractors: (a) common fixed point or common oscillatory mode; (b) two of the elements oscillate in phase and a third out of phase; and (c) staggered oscillations of all elements. The last type fits our understanding of temporal segmentation. Examples of all three types of limit cycles are shown in Figure 1. The parameters are fixed (as specified in Fig. 1) but for the value of I, which is chosen so as to obtain all types of behavior. It should be emphasized that the choice of parameters in these examples is quite arbitrary. The phenomena exist within a wide window of parameters. For example, dominance of modes b and c is obtained for 0.3 < I < 0.65 and 0.5 < a < 0.7. The solutions displayed in this figure, and throughout the present work, were obtained using the fourth-order Runge-Kutta method for integrating a set of differential equations with time steps of dt = 0.005. Smaller time steps led to the same results. By choosing a fixed set of parameters and varying over the initial conditions one can map the basins of attraction of the different limit cycles. An illustration of such a map is presented in Figure 2 where, for simplicity, we test a two-dimensional domain of initial conditions for parameters corresponding to Figure l b and c. The resulting structure is complicated, as expected when dealing with a nonlinear system with strong dependence on initial conditions. We can see, for example, in the upper right corner of the map that there are bright spots on the boundaries between two other areas that represent full segmentation waveforms. These are islands of waveforms of type b. The latter dominate other parts of this plane, as can be seen in the right lower corner and the left upper corner of the map. This complicated figure still conveys the message that each of the three waveforms of types b and two waveforms of type c have roughly the same size of basin of attraction. To estimate the sizes of the basins of attraction of all waveforms we have chosen random initial conditions over the whole seven-dimensional space of initial conditions of the n = 3 problem, and checked to which waveforms the system converges. Our interest is focused on the segmentation limit cycles. We found that the probability of converging onto the two waveforms of type c is 0.45 for the set of parameters specified in Figure lc. To perform such calculations one needs automatic means to recognize the limit cycles. This is where the symmetry of the problem becomes very useful. Starting from some initial condition the system flows into a limit cycle during a time period that is equivalent to several periods 7 of the limit cycle, as shown in Figure 3. It turns out that 7 is approximately the same for all limit cycles in this problem. Therefore, we wait for 207 until
378
David Horn and Irit Opher
Figure 2: Basins of attraction of different limit cycles in the I I = 3 problem. Parameters are the same as in cases (b) and (c) of Figure 1. Axes are initial conditions ot -0.6 < 112 < 0.6 and -0.6 < i i j < 0.6 with irl(0) = 1 tir(0) i r ? ( O ) and i i ( 0 j = 0.09. H , i O ) = 0. I, 11, and 111 denote the three possible partial synchronous waveforms (corresponding to the three possibilities of choosing 2 neurons out oi 3 ) . 1V and V denote the two possible full segmentation modes (corresponding to the two permutations of full segmentation). ~
we test the obtained limit cycle (for another 207-1 and match the obtained result with one of the finite set of limit cycles that we know the system can flow into. 2.1 Larger 1 1 Values. Turning to larger I I values we continue and stick to the particular parameters specified in the examples of = 3. 11 was defined as the number of excitatory neurons in our model. If instead of choosing a common input I! = I for all neurons we present the input only to a subset, then all other excitatory neurons will remain inactive. Hence, for all practical purposes, I J may be regarded as the number of excitatory neurons that are influenced by the common input. The symmetric model allows us to try and pinpoint the properties of the segmentation mode. In particular, we would like to understand how prominent it is. Using the same parameters for which the ii = 3 problem has segmentation probability of 0.45, we find for I Z = 2.4.5 the values 1, 0.20, and 0.56, respectively.
Temporal Segmentationin a Neural System
379
b
0
0
50.0
100.1
t
6
0.0
0.6
X
Figure 3: Transient motion into a limit cycle. Results for the three rn, values are shown (a) as function of time and (b) in a phase space with axes x = ( r n l -rn2)/2, y = (rnl + m2 - 2m3)/2&. The shown time scale corresponds to (a) 20.000 iterations and (b) 40,000 iterations.
This calculation necessitates the employment of the procedure explained above. For n > 3 we just check whether the solution is of the full segmentation type, Starting from random initial conditions, the accuracy of the calculation is limited only by the total number of trials employed, which was 10,000 for every n. The results depend on the parameters of the system. For example, the n = 4 segmentation probability increases to 0.80 for a = 0.65 and I = 0.5. In any case it seems that segmentation plays a quite dominant role for n = 4 and even 5. The latter is, however, the last n value for which a full segmentation mode is obtained. For n > 5 we find that some forms of effective segmentation often occur. This can happen in one of two ways (or a combination of both): (1) Degenerate segmentation, i.e., formation of clusters of amplitudes that move in unison in a segmentation pattern. An example of n = 6 is shown in Figure 4, where the amplitudes pair up to form three clusters. (2) Partial segmentation, where one finds leading amplitudes that display segmentation, and nonleading ones that have completely different behavior. An example for n = 8 is shown in Figure 5. In this example we find that the low amplitudes exhibit chaotic motion, yet the three large amplitudes follow a periodic segmentation course.
David Horn and Irit Opher
380
9
-1
t
Figure 4: Degenerate segmentation in the 12 = 6 problem: Formation of clusters of pairs of patterns that vary synchronously. Each mi is represented by a different symbol.
3 Segmentation and Subharmonic Oscillations
In the system of equations that we study here, the interaction between all elements is provided by the inhibitory unit (equation 2.3). The individual excitatory unit 1, described by equations 2.1 and 2.2, is influenced by all other units through the amplitude m of the inhibitory unit in equation 2.3. The behavior of each 172, in any given waveform can therefore be also viewed as the response of equations 2.1 and 2.2 to a driving term atn(t). All the waveforms that we encounter have an overall period 7 that is roughly the same as that of the free oscillator (a = 0 ) . In a segmentation mode m ( t ) oscillates with a period of T / I Z . This can be seen in Figure 6, where we display t i i l ( f ) and m ( t ) for the I I = 5 segmentation mode. If we think of m as the driving term then the phenomenon observed here is that of subharmonic oscillation that is known to exist in nonlinear oscillating systems (Mandelstam and Papalexi 1932; Hayashi 1964). The subharmonic oscillation exhibited in Figure 6 is of a very special kind:
Temporal Segmentationin a Neural System
381
Figure 5: A quasiperiodic solution of the n = 8 problem that displays partial segmentation. The three large amplitudes form a segmented pattern, while the low amplitudes (lines without symbols) form a chaotic background.
The width of the subharmonic amplitude ml is of the same order as that of the driving term, r / n . Let us refer to it as a narrow subharmonic solution. This is of course necessary for the segmentation structure to occur, since the latter is constructed from amplitudes mi that are repeated recurrences of ml shifted by multiples of r / n . This is the only way n nonoverlapping amplitudes can be accommodated in one cycle. Each of the mi corresponds to a particular choice of phase out of the n-fold degeneracy of 1/n ordered narrow subharmonic solutions. A stable linear system follows the frequency of the driving term. Only nonlinear systems exhibit different periodic solutions, including the subharmonic ones that are of interest to us. The nonlinear characteristics of the system determine the possible orders of the subharmonic oscillations. In particular, a subharmonic solution of order l / v is known to occur when one of the terms of the nonlinear function has the power U. In our case the nonlinearity is that of a sigmoid function, which, in principle, contains all powers. To test directly this idea, we ran the system of
David Horn and hit Opher
382
-1
Figure 6: The variation of mi (solid line) and vz (the inhibitory amplitude, dotted line) in the i i = 5 fully segmented limit cycle. equations 2.1 and 2.2 with a constaiit+sinusoidal driving term replacing m. In other words, we investigated solutions of the set of equations dirl,idt = - - i l l
- ll
- 1121 - Ir - /Il
d H l / l i t = h1lZI - C H I
(3.1) (3.2)
in which h i t ) is chosen to be similar to nz of Figure 6 but with a tunable frequency. We were able to generate narrow subharmonic solutions of I /2 to 1/ 5 . From 1/ 6 onward the subharmonic solutions were no longer narrow, i.e., the t n I amplitude generated by such a driving term has a width that is considerably larger than T ! ’ i z . This explains why full segmentation is limited in our system to 11 5 5. Higher II values cannot sustain the narrow subharmonic solution needed to build up segmentation. It is interesting to find the stability of the subharmonic oscillations. We tested it in two ways. First we ran the system with variations of the frequency +’ of the pure sinusoidal driving term and measured the window J L ~for L which subharmonic oscillations were obtained. The results show large ranges for the subharmonics 1/2 and 1/3, for which the relevant values are 0.32 and 0.11, respectively. The other subharmonic
Temporal Segmentation in a Neural System
383
solutions were obtained for considerably smaller frequency windows. The results for 1/4,1/5; and 1/6 are 0.03,0.015, and 0.013, respectively. Then we tested stability of the subharmonic solution against the mixing of a lower frequency w - Sw with the frequency w, which is the driving term of the nth subharmonic. Surprisingly the 1/2 solution turns out to be unstable, whereas the higher solutions of 1/3,1/4, etc. have a range of stability of the order 6w/w E 0.2. These results explain why in partial segmentation solutions, observed for high n values, three segments of leading amplitudes are dominant. The small amplitudes add a varying background to the driving term created by the large amplitudes, the ones responsible for the subharmonic solution. The fact that the 1/3 subharmonic solution has a large frequency window and is stable against admixture of several frequencies is the reason for the dominance of structures like the one displayed in Figure 5. 4 Breaking the Input Symmetry
The system that we have considered so far was totally symmetric. The limit cycles can be viewed as spontaneously breaking this symmetry. Each waveform is invariant under some symmetry group that is a subgroup of the direct product of the permutation symmetry-under which the dynamic system (equations 2.1-2.3) is invariant-and time translation symmetry. This was investigated by us recently in some detail (Horn and Opher 1995) following a known methodology of dynamic systems. The segmentation mode has the important characteristic that no residual degeneracy is left in it. This is of particular importance when we run our system with different inputs to different units. In fact, it is the reason that segmentation modes become dominant under such circumstances. We introduce symmetry breaking by modifying the inputs to allow small deviations that, at this stage, are kept constant in time: I , = I + F!. For small perturbations we find that the waveforms of the different limit cycles are only slightly modified and thus we can continue to use the specifications of the symmetric problem. Yet the basins of attraction change considerably. In particular, the basin of any mode that has previously displayed a residual degeneracy gets strongly reduced when the degeneracy in the input is being removed. Let us return to the n = 3 case of Figures 1 and 2 to show what happens when the input symmetry is broken. Starting with 1 = 0.4 and letting c1 grow, we find that the probability of flowing into a segmentation waveform increases rapidly, as displayed in Figure 7. It saturates at 0.8 because there are two attractors of type b whose basins of attraction shrink to zero. The remaining attractor of type b whose basin stabilizes around a size of 0.2 is the one that has residual permutation symmetry between amplitudes 2 and 3, a symmetry that is still maintained by our symmetry breaking term.
384
David Horn and Irit Opher
6
1 0.00
0.01
,
1 0.02
1 0.03
1 0.04
1 0.05
, 0.06
Figure 7 Change of the size of the basin of attraction of the segmentation waveforms in the T I = 3 problem for 1, = 0.4 + 6,,lcl. For higher values of IZ and various symmetry breaking patterns we have obtained effective segmentation modes of different types. These include partial segmentation modes with three major amplitudes, like the example shown in Figure 8. Other possibilities are variations of degenerate segmentation modes, in which the degeneracy of the amplitudes is lifted but phase degeneracy is kept intact. Our general conclusion is that effective segmentation modes dominate when the input symmetry is broken.
5 Fluctuating Inputs
Horn et al. (1992) have noticed that in the system that they have studied fluctuating inputs were able to support segmentation of a large number of oscillators. Trying to test the generality of this phenomenon, we have subjected our system to such conditions. Our question is what happens when we replace the constant symmetry breaking terms of the previous chapter with fluctuating terms whose average is the same for all excitatory neurons.
Temporal Segmentation in a Neural System
385
t
Figure 8: Waveforms of an n
= 8 system with graded inputs. I, were chosen as 0.6, 0.575, 0.55, 0.525, 0.5, 0.475, 0.45, 0.425.
A typical example of fluctuating inputs that we used is I; = 0.4+0.1~1,, where 7, is a random variable between 0 and 1, which varies with a random temporal distribution whose average is of the order of 7/20. In spite of these variations, we obtain for n = 3 a very regular segmentation structure, hardly distinguishable from that shown in Figure lc. Given the fluctuating input one cannot define limit cycles for such a system. Yet, the stability of the full segmentation mode is striking. In fact we may say that this random perturbation changed the probability of flowing into segmentation modes from 0.45 to about 1. This is, however, true only for n = 3. Moving to n = 4 we find, for the same type of variable input, approximate segmentation structure with about the same probability as for the constant common input. The resulting waveforms are no longer as regular as in the noiseless case. For the purpose of analysis we found it useful to move from inspection of waveforms to analysis of correlations, which we define through
where the integral is carried out over a time period T >
T
after some
David Horn and Irit Opher
386
0
C
Figure 9: Example of irregular segmentation behavior for the with inputs of the type I, = 0.4 A 0.3r/,.
II
=5
problem
transient time f , has elapsed. Regularity may emerge only if we allow for an integration time T that is much larger than the characteristic period. Integrating over long time scales we find in the case of iz = 4 that the fact that the system of differential equations is symmetric on average reflects itself in an effective symmetry of the correlation matrix elements. Increasing the random component to 0.2 and 0.3 this symmetry gets spoiled and the regular structures disappear. For higher I I values this symmetry breakdown occurs for lower values of the random component. The general emerging pattern can be described as irregular segmentation. An example is shown in Figure 9. Although we find the phenomenon of temporary dominance by a single amplitude, or a pair of amplitudes, all regularity of relative phases is lost. Only the 12 = 2 and H = 3 systems display the dynamic stability of regular segmentation, making it the dominant mode in the presence of fluctuating inputs.
Temporal Segmentation in a Neural System
387
6 Discussion
The waveforms corresponding to a segmentation solution are sometimes referred to in the literature as rotating waves or ponies on a merry-goround. An example of the latter can be found in Ermentrout (1992), who discusses a neural system composed of several excitatory neurons and one inhibitory one that, for some choice of parameters, exhibits such solutions. In his system, as well as in many other models of nonlinear oscillators that exhibit this phenomenon, the excitatory elements interact with one another in a fashion that corresponds to nearest neighbors on a ring. The rotating wave solution fits then the topology of the interaction matrix. The symmetric model that we have discussed has the topology of a star (with inhibition at the center) leading to many possible limit cycles. The segmentation mode is an extreme case of spontaneous breaking of the original symmetry. Yet it plays a special role. Its importance lies in the fact that the degeneracies between the different oscillators are lifted, hence it is stable against symmetry breaking of the input. Therefore we find that full segmentation in small systems, or partial segmentation in big systems, are favored limit cycles under these conditions. We view segmentation as an important oscillation mode because of the special cognitive role it may play in the analysis of mixed signals. Therefore it is important to understand the limit on the number of segments. In the system that we have discussed the limit is related to the inability of invoking narrow subharmonic solutions of n > 5. By studying the stability of subharmonic solutions we find the reason why usually three large amplitudes are involved in partial segmentation of high n systems. In the Introduction we raised the question of when segmentation is the favored waveform. We find that effective segmentation modes are preferred when we use graded inputs in our otherwise symmetric system. One may wonder what happens when the symmetry is broken in the interaction matrix, i.e., if we introduce in (1) nondiagonal interactions of the type du,ldt = -u,
+
W,,m, - am - 61 + I ,
(6.1)
I
For this type of symmetry breaking segmentation is not guaranteed. Nondiagonal excitatory interactions can lead to phase-locking. This is, after all, the procedure of forming cell assemblies. For particular cases, e.g., W, = 0 if W,, > 0, ring-type interactions are formed that can lead to segmentation with a specified order, of the kind discussed at the beginning of this chapter. A combination of these two types of effects leads to degenerate segmentation. Yet it cannot be generally stated that segmentation is the dominant effect of symmetry breaking. Finally we have looked for the stability of segmentation in the case of fluctuating inputs. We find that only the cases n 5 3 exhibit stability and dominance of pure segmentation. Noisy inputs in high n systems lead to
388
David Horn and Irit Opher
irregular segmentation patterns, which may be sufficient for some types of signal separation processes, but they n o longer possess any regularity in the structure of the resulting waveforms. Dealing with nonlinear dynamic systems it is difficult to derive general results. It is therefore important to elucidate the dynamic behavior in a system that can be studied in detail a n d where the interesting features can he discerned by the numerical analysis. This w a s o u r goal when we set out to study temporal segmentation in a symmetric neural system. Clearly many of o u r results are specific to o u r model. Nonetheless, we believe that the connection that w e have m a d e with subharmonic oscillations, a well known property of nonlinear oscillators, can serve as a qualitative understanding of the behavior of temporal segmentation. In particular, the effective limit of small numbers, allowing only few segments to appear in temporal segmentation, seems to hold true in different oscillatory systems in which this phenomenon w a s investigated. The subharmonic explanation that can be provided in o u r model throws new light on the reason for this general observation.
References Eckhorn, R. 1994. Oscillatory and non-oscillatory synchronization in the visual cortex and their possible roles in associations of visual features. Prog. Brniti Rrs. 102, 405-426. Eckhorn et nl. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . C!y!Jer!i. 60, 121-130. Engel, A. K., Konig, P., and Singer, W. 1991. Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acnd. Sci. U.S.A. 88, 9 136-91 40. Ermentrout, 8 . 1992. Complex dynamics in winner-take-all neural nets with slow inhibition. Neirrul Netiivrks 5, 415-431. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Pror. Nafl. Acntf. Sci. U.S.A. 86, 1698-1 702. Hayashi, C. 1964. No!dIrirar Oscillntiolzs in Physical Systettis. Princeton University Press, Princeton, NJ. Hopfield, J. F., and Gelperin, A. 1989. Differential conditioning to a compound stimulus and its components in the terrestrial mollusc Limax maximus. Behav. Neurosci. 103, 329-333. Horn, D., and Opher, I. 1995. Dynamical symmetries and temporal segmentation. 1. Nonlinenr Sci. 5, 359-372. Horn, D., and Usher, M. 1990. Excitatory-inhibitory networks with dynamical thresholds. Int. 1. Nriiml Syst. 1, 249-257. Horn, D., and Usher, M. 1991. Parallel activation of memories in an oscillatory neural network. Neiirul Cottip. 3, 31-43. Horn, D., Sagi, D., and Usher, M. 1992. Segmentation, binding and illusory conjunctions. Neiirnl Cottip. 3, 509-524
Temporal Segmentation in a Neural System
389
Konig, I?, and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: Synchronization and desynchronization. Neural Comp. 3, 155-178. Konig, P., Engel, A. K., Roelfsema, I? R., and Singer, W. 1995. How precise is neural synchronization? Neural Comp. 7, 469485. Mandelstam, L., and Papalexi, N. 1932. Uber Resonanzerscheinungen bei Frequenzteilung. Z. Phys. 73, 223-248. Martin, G. L. 1993. Centered object integrated segmentation and recognition of overlapping handprinted characters. Neural Comp. 5, 419429. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity of processing information. Psychol. Rev. 63, 81-97. Pentland, A. 1989. Part segmentation for object recognition. Neural Comp. 1, 82-91. Singer, W. 1994. Coherence of cortical functions. Int. Rev. Neurobiot. 37, 153-183. von der Malsburg, C., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. B i d . Cybern. 67, 233-242. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail party processor. Biol. Cybern. 54, 2940. Wang, D., Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2, 94-106.
Received December 5, 1994; accepted July 10, 1995.
This article has been cited by: 2. M. Ursino, E. Magosso, C. Cuppini. 2009. Recognition of Abstract Objects Via Neural Oscillators: Interaction Among Topological Organization, Associative Memory and Gamma Band Synchronization. IEEE Transactions on Neural Networks 20:2, 316-335. [CrossRef] 3. Seung Han, Won Kim, Hyungtae Kook. 1998. Temporal segmentation of the stochastic oscillator neural network. Physical Review E 58:2, 2325-2334. [CrossRef]
Communicated by Errki Oja
Circular Nodes in Neural Networks Michael J. Kirby Rick Miranda Deprfriieizt ofMntlicr?~ntic.s,Colorndo S f n t c Uiiii~ersrty,Fort Collirzs, CO 80523 U S A
In the usual construction of a neural network, the individual nodes store and transmit real numbers that lie in an interval on the real line; the values are often envisioned as amplitudes. In this article we present a design for a circular node, which is capable of storing and transmitting angular information. We develop the forward and backward propagation formulas for a network containing circular nodes. We show how the use of circular nodes may facilitate the characterization and parameterization of periodic phenomena in general. We describe applications to constructing circular self-maps, periodic compression, and one-dimensional manifold decomposition. We show that a circular node may be used to construct a homeomorphism between a trefoil knot in x7 and a unit circle. We give an application with a network that encodes the dynamic system on the limit cycle of the Kuramoto-Sivashinsky equation. This is achieved by incorporating a circular node in the bottleneck layer of a three-hidden-layer bottleneck network architecture. Exploiting circular nodes systematically offers a neural network alternative to Fourier series decomposition in approximating periodic or almost periodic functions. 1 introduction
-
A sigmoidal node of a standard feedforward neural network stores information as a real number in a bounded interval. Thinking geometrically, such a node is capable of encoding a point in the one-dimensional manifold I that is the open interval. Up to homeomorphism, there are only two one-dimensional manifolds: the open interval I and the circle S'. The open interval is capable of and appropriate for encoding amplitude information; the circle is capable of and appropriate for encoding angular information. In this article we propose a "circular" node that is capable of encoding angular information and give one implementation of it that we have found expedient. We present both the forward propagation formulas and the backpropagation algorithm for a network containing circular nodes. We describe several applications in pattern analysis and the analysis of high-dimensional dynamic systems at the end of this article, culminating \rtwi~il
Coinpictillioti 8, 390-402 (1996)
@ 1996 Massachusetts Institute of Technology
Circular Nodes in Neural Networks
391
in a neural network parameterization and compression of the limit cycle of the Kuramoto-Sivashinsky partial differential equation. This is an extension of the neural network description of the limit cycle for the Van der Pol oscillator, described in Kirby and Miranda (19944. Another approach for encoding of angular information in neural networks has been presented in Zemel et al. (1995). The reader may also be interested in the approach of Gislkn et al. (1992) that uses rotor neurons to process data lying on an n-sphere. 2 Implementation of Circular Nodes
In our implementation, a circular node is actually a pair of coupled nodes, whose values are constrained to lie on the unit circle. To be specific, we suppose that the neural network has L layers, numbered from 0 to L - 1. In each layer there are nodes, which can be either of two types: circular or sigmoidal. The jth node in layer i will be denoted by A((’).The number of nodes in layer i is “‘1, and for fixed i the nodes A((’)are indexed from j = 0 to j = N(’)- 1. At each node A((’)the state value of the network is denoted by S:”. The circular nodes occur in coupled pairs, i.e., if node (node j in layer i) is part of a circular node, then there will be a coupled node N::) [node TO’), also in layer i]. Thus there are an even number of such nodes, in coupled pairs; we assume that T [ T ~ )=]. ] By construction the state values for coupled circular nodes will always satisfy the circular constraint
4‘’)
( s y + (s:;,)2
(2.1) 1 We think of each pair of such coupled nodes as a single ”abstract circular” node, whose joint state value represents true angular information. Note that the use of two unconstrained sigmoidal nodes is not equivalent and the resulting parameterization will fail to be circular. This implementation of a circular node fits naturally into other types of networks also (eg., ones with linear and sigmoidal units, or ones with feedback); we examine the circular node in this simplified yet fundamental situation to isolate its unique features. =
3 The Forward Propagation of the Network
In the architecture described above, each state value in the network S;” is determined from the state values Sl-’)occurring in the previous layer. The inputs are the state values S,@) at the nodes of the input layer 0. Each node has associated to it weight constants w;;) and bias constants b:’). For each node A((’) with i 1 1 (that is, not at the input
q(‘)
Michael J. Kirby and Rick Miranda
392
layer), we define a prestate value PI’” by the formula
k=O
This first step is the same for all nodes, whether they are circular or sigmoidal. Note that the bias constants b,(” are defined for all i = 1, . . . , L1, while weight constant w[;’ are defined for all i = O, . . . , L - 2. The final state values are then determined by the prestate values using different formulas depending on the type of node: if
+(I1
is circular
if
+(I’
is sigmoidal
(3.2)
Cry(P/(’) where
Note that in the circular case, the state values automatically satisfy the circular constraint 2.1. It is useful to define the radial value. Let
q‘’)
if is circular. Note that R;” be rewritten as
s“’= J
=
Ryi,, and the state formula can then
p(lJ/R(l)
J
J
in the circular case. This implementation of circular nodes requires only a relatively minor alteration of existing network architectures. 4 Error and Gradient Computations via Back-Propagation
Given a set of input state values Sjo’ (which then determine the full set of state values using the forward propagation described above), one has a desired set of goal values G,, as j runs from 0 to N(L-l)- 1, i.e., over the output nodes. The total squared errur E in the network for this state S (relative to this goal G ) is the total squared error between the actual state values SI(L-’) at the output nodes and the goal values G, there
Circular Nodes in Neural Networks
393
If more than one state is used to form an average error, the gradient of the error simply sums over all these states; hence to avoid using an index to indicate the state, we will develop the gradient computations for a single state only; the reader may then take the appropriate sum for an average error over many states. We now develop the formulas for the backpropagation algorithm, which computes the gradient of the error with respect to the weights and biases. Hence we compute the values
for all i, j , and k. Standard application of the chain rule for derivatives leads to the formulas
and
It remains then to compute the derivatives 6'E/i3Sji', which is done recursively; at the start of the recursion, we have (4.4)
For other "lower" layer values, we have (4.5)
where
With these modifications the backpropagation algorithm proceeds as usual.
Michael J. Kirby and Rick Miranda
394
5 Applications In this section we will briefly describe several applications of circular nodes in the architecture of a neural network. We begin with certain prototypical constructions, which the reader may easily modify for more elaborate applications. 5.1 Prototypical Uses for Circular Nodes.
0
0
Parameterization of Periodic Phenomena: A locus r in P‘is said to be periodic if it is the image of a circle. A parniizeferizafion of this periodic locus is a function from the circle S’ to the locus r. A neural network designed to produce such a function would have a single circular node a s the input layer, one or more hidden layers, and an output layer consisting of J I sigmoidal nodes. Circular Self-Maps: Mappings from the circle to itself are approximated by neural networks having a single circular node as the input layer, one or more hidden layers and a single circular node as the output layer. A linear combination of networks that reproduce multiplication-angle formulas would explicitly approximate any term of a finite Fourier polynomial, and would then be a direct generalization of nonharmonic decomposition; this can be compared with Lapedes and Farber (1987). Periodic Compression: The converse to parameterization of a periodic locus is the idea of pn.iodir co~7ipressiooti,that is, taking a periodic locus r and mapping it to a circle (see, for example, Kirby and Miranda 1994b). indeed, the locus r need not be a periodic locus; in this case a mapping from I’ to S’ would simply be a ”circular feature extraction”; one might view this as a lossy compression of (almost) periodic data. Circular Remodeling of Boundary Value Problems: Suppose that U = F ( i r ) defines a boundary value problem, which is defined on an irregular open locus in F? with an irregular boundary r that is homeomorphic to the circle. If the interior of r is star-shaped, the existence of a compression mapping G : r‘ - S’ may be extended tct a mapping from the interior of r to the interior of the unit disc, and therefore used to transport the original differential equation to the interior of the unit disc, where solutions may be more easily computed via standard numerical techniques. One-Dimensional Manifold Decomposition: The natural generalization of single feature extraction, whether sigmoidal or circular, is multiple feature extraction, where several features of a data set are to be captured simultaneously. If some of the features are amplitudes of various outputs, and others are periodic, our viewpoint is that this is mathematically best expressed a s a mapping from the
Circular Nodes in Neural Networks
395
pattern set r to I" x ( S ' ) k ; the amplitude features are encoded in the n interval coordinates of the I" factors, and the angular features are encoded in the k circular coordinates of the (S'Ik factors. Such a mapping can be considered as a one-dimensional manifold decomposition of I?. 5.2 Applications to Pattern Analysis. Let r c E@' be a data set that one wants to optimally compress. Linear methods (principal component analysis, or the Karhunen-Lo6ve decomposition, see Devijver and Kittler 1982) discover an optimal ordered coordinate system in R~ such that l? lies in the subspace defined by the first few coordinates; this works very well and is efficient if r fills up an open subset of a linear subspace. , one must have nonlinHowever, if r is a nonlinear subset of R ~then ear coordinates to capture r completely and efficiently. Mathematically speaking, we are seeking a nonlinear function G : r + V where V is an open subset of R",and the mapping function G is continuous and invertible (albeit nonlinear). Thus there will be an inverse mapping H : V l? and a nonlinear representation of r may be thought of as giving inverse mappings G and H as above. The compressed manifold V may be more efficiently represented as lying in R'" x (S')k; this affords the opportunity for efficient angular feature extraction. -+
5.2.2 Bottleneck Architectures. Circular nodes fit naturally into the bottleneck architecture (see, e.g., Rumelhart and McClelland 1986; Oja 1991; Kramer 1991; Demers and Cottrell 1992; Krischer et nl. 1993; Doya and Selverston 19941, which permits a nonlinear principal component analysis. The use of circular nodes permits this type of architecture to discover both types of amplitude and angular features. The model for this type of network has three (or more) hidden layers consisting of nodes that may be of any of the types described here, i.e., linear, circular, or sigmoidal. This network can be used to train the identity mapping using backpropagation; one of the interior layers acts as the bottleneck layer, and provides the low-dimensional representation of the data. If one trains this network to reproduce the identity function on the data set r, the mapping G obtained from the input layer (of high dimension N) to the bottleneck layer (of low dimension m) will be an invertible and nonlinear modeling of the data set. Its inverse H is given by the mapping from the bottleneck layer to the output layer. 5.2.2 The Failure of the Sigmoidal. For truly periodic phenomena, one strictly needs a circle as a target for the compression map G; an interval (which is the natural output of a sigmoidal node) is not sufficient. This is because there is no invertible mapping from the circle to any interval 1. Hence if r is periodic with a bijective parameterization mapping H : S' -+
Michael J. Kirby and Rick Miranda
396
r, and one tries to construct a bijective compression map G : I'+ I , the comyosition G o H : S' I would be a bijective mapping and therefore could not be continuous. The corresponding network will not be able to reliably generalize. An example of this can be seen in Kramer (1991). A pattern set consisting of 100 points on the unit circle and the bottleneck architecture were used to construct a mapping from a circle to an interval. Kramer showed that this architecture trained quite accurately on the 100 data points on the circle. However, for the reasons mentioned above, it camof perform well for the general point on the circle. To demonstrate this we first trained the 2-4-1-4-2 bottleneck network (using all sigmoidal nodes) as above to high precision (average error = 0.001) on 20 points on the circle as shown in the upper graph of Figure 1. The removal of the discontinuity point on the circle leaves a pattern set that is topologically an interval, and therefore one can train such a network to arbitrarily high accuracy (using more hidden nodes if necessary) at the expense of missing the point of discontinuity. One clearly sees a discontinuity in the mapping functions when applied to the general point on the circle; this is illustrated in the lower graph of Figure 1. We remark that these results are a property of the topology of the problem and are not a consequence of training strategies or architecture parameters; the discontinuity in the problem is the underlying source of the generalization failure. The introduction of a single circitkar node at the bottleneck layer gives a continuous mapping function and removes the discontinuity, allowing the network to fully generalize to the entire circle. In general when one uses a bottleneck layer consisting of a single circular node one expects to find the best closed curve approximation to the data, i.e., the reconstructed data are necessarily a closed curve that is an image of the circle on the bottleneck. For an example of the best closed curve approximation to the Lorenz attractor, see Kirby and Miranda (1994b).
-
5.2.3 The Trrfoir Ktiot. A beautiful example of the technique, which also illustrates the capability of unraveling topological complexity, is the case of pattern data lying on a trefoil knot K in w3. This knot is also referred to as a (2.3)-torusknot; it is obtained by winding around a torus twice in one dimension while going around three times in the other. The parametric equations are given below: x ( t ) = cos(2ti[2t cos(3t)j !/(ti
=
sin(2t)[24cos(3t)l
z(t)
=
sin(3t)
Circular Nodes in Neural Networks
397
Figure 1: The output of a single sigmoidal bottleneck node as a function of angle for 20 trained points is represented by the symbol 0. The attempted generalization of this network trained on 20 points applied to 100 points on the circle is represented by the symbol +.
The knot itself is intrinsically homeomorphic to a circle; the embedding into w3 is what makes it topologically interesting. A parameterization mapping H : S' -+ K gives such an embedding. The compression mapping G : K -+ S' may be considered as the "unknotting" of the trefoil knot. The architecture used in our experiment was a 3-15-1-15-3 bottleneck network, all of whose nodes were sigmoidal except for the single circular node in the middle (bottleneck) layer. We trained this network to approximate the identity mapping for 1000 points on the trefoil knot K. In Figure 2 we show the trefoil knot and the result of applying the parameterization mapping H to 400 points on the unit circle, to obtain points in w3 that approximate the trefoil knot.
5.3 Uncovering a High-Dimensional Limit Cycle. The KuramotoSivashinsky (K-S) equation
Michael J. Kirby and Rick Miranda
398
Figure 2: The trefoil knot in W3 is represented by the solid line. The output of the network modeling the trefoil knot is shown as dots.
has become a benchmark for many theories on dissipative dynamic systems and global attractors. It exhibits low-dimensional complex dynamics including chaos and it has been shown to possess an inertial Inanifold, i.e., there exists a smooth (C') low-dimensional description of its dynamics. Thus in the spirit of the Lorenz equations one expects that a small system of ordinary differential equations will model the PDE and in fact low-dimensional approximations of the inertial manifold have been constructed for Neuman boundary conditions in Jolly et al. (1990) and for periodic boundary conditions in Armbruster ef al. (1989). The geometry of the solutions has also been studied and Karhunen-Lo6ve-based lowdimensional approximations presented in Kirby and Armbruster (1992). In this section we investigate how a bottleneck neural network with a circular node can be used to study the dynamics of the K-S equation. To obtain data for the K-S equation (with periodic boundary conditions) we perform a numerical integration by means of a Galerkin approximation. This approach generates a system of ordinary differential equations by decomposing the velocity field via the expansion M
N
-x
-N
(5.2) Substituting 5.2 into the K-S equation 5.1 and exploiting the orthogonality of the complex exponentials one is led to a system of ordinary differential equations in the Fourier coefficients a,(t). Making use of the
Circular Nodes in Neural Networks
399
x
Figure 3: The localized oscillatory pattern of the K-S equation for a = 84.25.
reality condition a, = a-1 and the fact that the a0 term decouples from the system gives (v
N-1
1-1
a, = P(a - 412)Ul + - )-(I
n)nal-,a, - cy C(l+ n)na,+,a,
(5.3) 2 n=l 11=1 where2LI < N - l a n d f o r l = l a n d I = N a n d i z l =(cu-4)al-aCr==,(1n)na,,-lan, i z = ~ fl(a - 4 N 2 ) u ~ (cu/2)c/=:'=,(N - n ) n u ~ - ~ u The , . system of equations for the Fourier coefficients is then integrated numerically. Thus the output of the simulation is the N complex Fourier coefficients [ a l ( t ) .. . . a N ( t ) ] and this can be returned to the original coordinate system using 5.2. It is well established that the K-S equation undergoes a Hopf bifurcation at a = 83.75 resulting in a localized oscillatory pattern. We simulate equation 5.1 with cu = 84.25 and N = 10, and the results are shown in Figure 3. Observe that the oscillations appear to be strictly localized in space traveling up the tip of the higher hump. This is considered to be a forerunner of spatiotemporal complexity encountered in spatially concentrated zones of turbulence. The origin of spatiotemporal complexity in moderate turbulence is one of the big open questions in turbulence research. Using local analysis techniques on the system 5.3 we can show that the solution is temporally periodic near the bifurcation point. Hence, -
+
Michael J. Kirby and Rick Miranda
400
f-:
I
Figure 4: The output of the neural network corresponding to the first four complex Fourier coefficients. while it lies in a 20-dimensional real vector space, it is a limit cycle r homeomorphic to S'. For this reason it is appropriate to attempt to map the data to a circle and in this example we apply the periodic compression architecture to uncover the periodic solution that lies in w2". In our experiment we took N = 10, which was sufficient for numerical reasons and the resulting 10 complex Fourier coefficients { a , , ( t ) 13 . = l.l0} can be written as the 20-tuple r = [ r , ( t ) . y , ( f ) . . . .xl,(f).ylo(f)]with a,, = x,, + iy,?. This time series can be viewed as a periodic data set in R?". To construct a mapping from this data to the circle we utilize a 20-151-15-20 bottleneck neural network with a circular node in the middle bottleneck layer. After normalizing the data by the variance we trained this network to approximate the identity mapping for 1500 points on the high-dimensional cycle r. The network trained quickly to an average error of less than 10-' indicating the excellent degree of fit between the data set and the unit circle. Note that with this construction, the data originates in R'", are funneled through the bottleneck layer into a circle s', and are reproduced by the second half of the network. In Figure 4 we present the output of the second half of the neural network, for the first four Fourier coefficients a l . a 2 . a 3 . 4 ; this is the parameterization H : S' + r, for these same four coefficients. We also remark that the technique of periodic compression can be used to study a sequence of Hopf, or cycle-producing bifurcations. For
Circular Nodes in Neural Networks
401
example, at a = 87 the solution is the oscillating structure of the previous regime, which is now a traveling beating wave. This solution could be mapped to S' x S' and would be an example of manifold decomposition outlined in Section 5.1. In addition, we observe that our analysis is not restricted to a local region about the bifurcation value. The parameterization techniques presented for the analysis of data of dynamic systems such as the K-S equation can also be used as the basis for approximate inertial manifold constructions. In particular, the inverse mappings G and H constructed by the neural network can also be used to transport the dynamical system or flow from the attractor embedded in the high-dimensional space to the model space built as the image of the system in the low-dimensional bottleneck layer. Therefore we not only have a much simpler model for the attractor, but also a potentially simpler model for the differential equation. This work has begun in Kirby and Miranda (1994a) and will be discussed in terms of the circular node in a forthcoming article (Kirby and Miranda 1996).
6 Summary
In this paper we present an implementation of a circular node for a neural network capable of storing and transmitting angular information. It can be used in conjunction with standard sigmoidal nodes to create networks utilizing both types of node. We present the equations for forward and backward propagation, which have the same general form as the standard implementation. Thus the incorporation of circular nodes and the training of networks utilizing them is not radically different from existing algorithms, and can be easily implemented. Finally, we present several examples of the use of circular nodes in pattern analysis. We close with a nontrivial reconstruction of the limit cycle of the KuramoteSivashinsky equation whose original geometry lies in a space of 20 dimensions. Our work suggests that the construction and implementation of other useful types of nodes are both feasible and potentially useful. For example, mappings to two-dimensional manifolds may be better served by the construction of nodes carrying spherical or multitoroidal information (Hundley et al. 1995).
Acknowledgments Research supported in part by the NSF under Grant no. ECS-9312092.
Michael J. Kirby and Rick Miranda
402
References Armbruster, D., Guckenheimer, J., and Holmes, P. 1989. Kuramoto-Sivashinsky dynamics on the center-unstable manifold. SlAM I . Appl. Moth. 49, 676. Demers, D., and Cottrell, G. 1992. In Neitral Irzfortizntiorr Processiizg Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., p. 582. Morgan Kaufmann, San Mateo, CA. DevijITer, P. A,, and Kittler, J. 1982. Potterrr Recogrritiorz: A Stntistical Approach Prentice-Hall, Englewood Cliffs, NJ. Doya, K., and Selverston, A. I. 1994. Dimension reduction of biological neuron models by artificial neural networks. Ntrrral Cornp. 6, 696-717. Cislen, L., Peterson, C., and Siiderberg, B. 1992. Neural Corrrp. 4 737-745. Hertz, J., Krogh, A,, and Palmer, R. G. 1991. [ritmdirctiorr to the Theory of Neural Coiripictntiori. Addison Wesley, Reading, MA. Hundley, D., Kirby, M., and Miranda, R. 1995. Spherical nodes in neural networks and applications. Artificial Nci~rnlNetzoorks irr Erigiireeriiig (to appear). Jolly, M. S., Kevrekidis, I. G., and Titi, E. S. 1990. Approximate inertial manifolds for the Kuramoto-Sivashinsky equation: Analysis and computations. Physicn 44 D, 38. Kirby, M., and Armbruster, D. 1992. Reconstructing phase-space from PDE simulations. Z . ArrSt’iL’ Moth Phys. 43, 999-1022. Kirby, M., and Miranda, R. 1994a. The nonlinear reduction of high-dimensional dynamical systems via neural networks. Pkys. Resl. Lett. 72 (121, 1822-1825. Kirby, M., and Miranda, R. 1994b. The remodeling of chaotic dynamical systems. In Irztelligcvrt Erigirwritq S!ysterri~ Tlirorrgli Artificial Neirral Networks, C . H. Dagli, 8. R. Fernandez, J. Ghosli, and R. T. S. Kumara, eds., Vol. 4, pp. 831-836. Procedirigs of flrc ANNlE ‘94Coiiftwnce, St. Louis, MO, ASME Press, New York. Kirby, M., and Miranda, R. 1996. The reduction of dynamical systems (in preparation). Kramer, M. A. 1991. Nonlinear principal component analysis using autoassociative neural networks. AIClrE 1. 37(2), 233-243. Krischer, K., Rico-Martinez, R., Kevrikidis, I. G., Rotermund, H. H., Ertl, G., and Hudson, J. L. 1993. Model identification of a spatiotemporally varying catalytic reaction. AICliE J., 39(1), 89-98. Lapedes, A. S., and Farber, R. 1987. Notzliwar Sigrid Processirrg Usirig Netiral Netzmrks: Predicfiorz m i d Systrtir Modelirrt a/. 1992). DA has been shown to achieve significant improvement over conventional approaches for the basic clustering problem. Moreover, the method has been generalized to address several important extensions of the basic problem (Rose et a / . 1993; Buhmann and Kuhnel 1994; Miller and Rose 1994a; Miller et a / . 1994). Recently, close ties were established between the DA approach and rate-distortion theory, yielding contributions to the fundamental information-theoretic problem of rate-distortion function computation and analysis (Rose 1994). In the present work, we seek to generalize the DA approach to address the problem in which structural constraints are imposed on the clustering solution. A typical structure is a tree, for which solutions can be represented by a decision tree diagram, as shown in Figure 1 for a binary tree of depth three. Here, the tree parameters, when used with an associated measure of distance, collectively specify a partitioning of the data space. For a tree of depth L and branching factor K, we denote the node parameters {si} where 1 = 1. . . . . L denotes the layer of the tree, and j = 0. . . . .K' - 1 denotes the node's position within the layer. There is an important distinction between the role of internal nodes, which specify a hierarchical partitioning and the role of leaf nodes (layer L), which are the cluster representative vectors for the partition regions. While this distinction will be further explained later, we emphasize the different roles of the leaf and nonleaf layers by introducing a special notation for the leaf parameter set, Y = {y,} = {sf}, and referring to these parameters a s code wcfors or cluster repwsentntiiw. The internal-node parameters, which
Unsupervised Learning
Yo
x
427
Y2
Y,
Figure 1: A tree diagram for a balanced, binary tree of depth three.
are also typically vector-valued, will be referred to as test vectors. This terminology is borrowed from the vector quantization literature. Specializing the description to binary trees ( K = 2), the test vectors {si} determine a sequence of nested half-space tests that, when traversed from root to leaf, lead to the partition regions of the data space. At each pair of sibling nodes (s:,s\+,),j even, an equation of the form d(x,s:) = d(x. specifies a decision boundary. Here, d(., .) is a dissimilarity ("distance") measure. If d ( . , .) is the commonly used squared distance, then the boundaries are hyperplanes and the partition regions are convex cells. This description is trivially extended to trees of higher branching factors, where a Voronoi (nearest neighbor) partition is used by the set of siblings to partition the cell belonging to their parent. Tree-structured clustering has important applications in unsupervised learning, including vector quantization (Buzo et al. 1980) and numerical taxonomy (Sokal and Sneath 1963). Moreover, the clustering solution is relevant to supervised learning problems such as prototype-based statistical classifier design as well, since in practice class labels may not exist or obtaining them may be an expensive process. In these cases, clustering is an important design step and several methods have been applied (Lippmann 1989; Farrell et al. 1994). There are several advantages to the tree-structured architecture. For classification and regression, the structure can be efficiently pruned to search for a parsimonious model or for the minimum cost structure
428
David Miller and Kenneth Rose
given a constraint on model complexity (Breiman rt nl. 1980; Chou et ul. 1989). Another advantage relates to the complexity of statistical classification (known as the encoder search complexity in VQ). Unstructured prototype-based classifiers and vector quantizers may require an exhaustive search for the nearest prototype, which is impractical when the feature space and number of prototypes are large. Alternatively, there are methods for implementing efficient nearest prototype searches (e.g., Gersho and Gray 1992, Chapter 101, but these approaches may require substantial increases in memory storage, and the reduction in search is not guaranteed to be significant in general. The alternative that has been gaining increasing popularity with VQ researchers is to impose structural constraints on the design. The tree-structured classifiers we consider do not guarantee finding the nearest prototype, but they d o achieve a substantial reduction in classification search. In the V Q context, this property makes tree-structured vector quantizers (TSVQs) a practical alternative to full search (unstructured) VQ, and, indeed, structurally constrained VQ has been intensively investigated in recent years (Gersho and Gray 1992, Chapter 12). While standard optimization methods for unstructured clustering are iterative descent procedures that guarantee convergence to a local optimum, approaches to tree-structured clustering such as the splitting algorithm (Hartigan 1975; Buzo rt a / . 1980; Riskin and Gray 1991) are greedy, optimizing a local cost to grow a tree one node at a time. The primary reason is that whereas an optimal partition design step is readily specified by the nearest neighbor rule in the unstructured case, in the tree-structured case an optimal partition is determined only by solving a formidable multiclass risk discrimination problem (Duda and Hart 1974). Accordingly, standard methods grow trees in a greedy fashion, using heuristics both for determining the order of node splits (and hence the tree structure), and for performing the node splitting. While pruning can be used to find a good cost/complexity tradeoff (Chou et 01. 19891, the solution quality of this method depends on the initial tree. Pruning does not move partition region boundaries to improve the solution, but rather removes boundaries from the solution. Alternatives to greedy methods were recently proposed for the general regression problem (Jordan and Jacobs 1994) and for the specific case of tree-structured clustering (Miller and Rose 1994b). Both the approach of Jordan and Jacobs and our own preliminary work are based on a probabilistic formulation. However, their method performs direct descent on a cost surface and thus may be sensitive to finding poor local minima, while our approach for the restricted problem is an annealing approach and attempts to avoid local minima. In this paper we fundamentally extend our preliminary work, both in its theoretical development and its practical application. First, the method is given a more sound theoretical justification based on the information-theoretic principle of minimum cross-entropy. This inference is
Unsupervised Learning
429
used to introduce a new paradigm for structured clustering within a probabilistic framework-the goal of approximating the optimal (unstructured) clustering solution while imposing the structural constraint. Second, it is recognized that the clustering solution can be improved by generalizing the class of distance measures used by the structured hierarchy. An example will be shown using Mahalanobis distance (Duda and Hart 1974). Finally, and most importantly, we note that whereas our earlier work used annealing to optimize a tree of fixed structure, a more fundamental advantage of the approach relates to phase transitions in the solution process and their connection with growth in the tree model. In particular, it is recognized here that our effective tree model ”grows” by bifurcations, which occur while minimizing a free energy cost at decreasing “temperature” scales. Thus, our method naturally provides an estimate of the model size and structure at each temperature. In the sequel, we will first generalize the probabilistic inference used by DA for the unstructured clustering problem to include prior probabilities. 2 A Tree-Structured Clustering Method 2.1 Minimum Cross Entropy Inference. Even if one is interested in a ”hard” (i.e., nonfuzzy) clustering solution, still it may be useful within the context of an optimization method to consider points associated in probability with clusters. In the basic DA method (Rose ef al. 1993, no underlying assumptions were made about the data distribution. Accordingly, the principle of maximum entropy (Jaynes 1989) was invoked to obtain probabilities of association {P[x E Cj]} between data points and cluster representatives. More concretely, the probabilistic inference was obtained by posing for each data point
max
{c -
1
P [ x E C]]log P [ x E C]]
I
subject to
< Ex >= CP[X E Cj]d(x.yj) I
The solution is the Gibbs distribution (2.2)
where the denominator is a partition function from statistical physics and p is a Lagrange multiplier. Now suppose that there is prior knowledge relating points and clusters stated in the form of probabilities, { Q [ x E Cl]}. The natural generalization of maximum entropy inference to include a prior is the principle
David Miller and Kenneth Rose
430
of minimum cross entropy (Kullback 1969). In Shore and Johnson (1980) it was shown that this principle is the consistent principle of inference given new information. Accordingly, we pose the problem (2.3)
subject to
< E, >= ~ P [ E XCj]d(x.y;) i
The solution is the so-called ”tilted” distribution: (2.4) The Lagrange multiplier 4 determines the value of < E , >. It also can be interpreted as an inverse ”temperature” influencing the degree of fuzziness of the distribution. For uniform, these associations revert to the maximum entropy associations of equation 2.2 as we expect. We note further that for il = 0 we obtain P[x E C,] = Q [ x E C,], i.e., we give full weight to the prior and ignore the clustering distortion. At the other limit of 1) + 00 we minimize distortion and totally ignore the prior, except concerning classifications that are precluded by the prior, i.e., for which Q [ x E C,] = 0, and hence P[x E C,] = 0. This restriction of the distribution through the prior will later provide a tool for imposing constraints on the clustering solution within an annealing optimization framework. The partition function associated with a single datum is the denominator of equation 2.4. The total partition function over the entire training set is the product
{a[.]}
(2.5) r
x
/
Correspondingly, the free energy in the physical analogy is (2.6)
This function is a generalization of the effective cost minimized by the basic DA method. The free energy is the quantity minimized at isothermal equilibrium by simulated annealing and it can also be viewed as the log likelihood associated with a mixture model, as discussed in Jordan and Jacobs (1994). For {Q[.]-+ (0.1)) or for d -+ cx), the free energy is equivalent to a hard clustering distortion. Minimization of this cost with respect to the cluster representatives can be realized for a given prior by an annealing approach, as in the original DA method (Rose ct nl. 1992),
Unsupervised Learning
431
wherein F is minimized starting from high temperature (small p) and the solution is tracked while the temperature is lowered. The annealing process is useful for avoiding local optima of the cost. The condition for optimizing the free energy at any temperature is
or the centroid rule d E C]-d(x,y,)
CP[X X
8%
= 0.
Vi
(2.8)
For the squared distance measure, we may write
which can be iterated until a fixed point is reached at each temperature. Equation 2.9 connects the method with statistical approaches since these iterations are a special instance of the expectation/maximization (EM) algorithm (Dempster ef al. 1977; see also Yuille ef al. 1994). Of course, there may be methods that are more efficient than fixed point iterations for minimizing F at each temperature. While this introduction of a prior within the DA framework may have application to supervised learning,' in this paper we will focus on an unsupervised learning problem (tree-structured clustering) and demonstrate that the inclusion of a prior is especially useful in this context, as it allows explicit quantification of the dependencies between the leaf and nonleaf layers. 2.2 Tree-Structured Formulation. We now relate the previous section's results to structurally constrained clustering. For clarity's sake, we first consider a simple two-layer binary tree and then show that our framework is easily extended to trees of any breadth and depth. Thus, we have a single internal layer with two test vectors SO and s1 (note that we drop the redundant superscript 1 = l), and a leaf layer consisting of four code vectors yo. yl, y?, and y3. Assuming, for simplicity, that squared distance is used both for classification in the internal layer and for measuring the clustering cost at the leaves, the nonleaf decision boundary d(x, so) = d(x,s1) is a hyperplane, dividing the feature space into two regions, which are then further subdivided by the leaf layer. A probabilistic generalization of this hard boundary (justifiable with the 'A special case where { Q [ x E C,]} are interpreted as probabilistic class labels for the training data, arising from uncertainties in a supervised learning process, is related to the work in Buhmann and Kuhnel (1993) and Oehler and Gray (1993), where a supervising term was introduced within the clustering cost based on knowledge of class labels for the data. Although our approach could potentially make a contribution to this application, supervised learning is not addressed in this paper.
432
David Miller and Kenneth Rose
maximum entropy principle) is the Gibbs distribution e-Y’(x,soj
PH[X E SO] = e-rd(x.so) + e - N x . s i )
(2.10)
where y is a positive scale parameter, controlling the fuzziness of the distribution, and where the cell So is the set of all points classified to test vector SO, that is, to node 0 in the internal layer. For SO # s1 and y -+ 00, equation 2.10 reduces to a hard hyperplane decision function. In a similar fashion trees of larger depth can be probabilistically generalized via products of Gibbs distributions. For some internal node s with corresponding cell S we write the recursion formula: e-$(x.sj
PH[x E S] = PH[x E parent(S)]
C
e-yd(x3)
(2.11)
SEsiblings(s j
which applies the Gibbs distribution to divide the probability of classification to the parent node among the siblings. It is easy to see that the corresponding closed-form expression is a product of Gibbs distributions, which, at the limit y + 00, specifies a sequence of nested decisions for a hard decision tree. One strategy for imposing the structural constraint within a probabilistic clustering framework is to view {&[.I} as a prior that influences the formation of leaf representative/data association probabilities. Denote the distribution at the leaves {PL[xE C,]} where partition region C, refers to leaf j represented by code vector y,. The leaf associations should be chosen to “agree” with the prior to the parent layer while satisfying an average clustering distortion constraint. Accordingly, as in the previous section, we pose
subject to (2.13)
Here, the prior is equally split between its K children nodes at the leaves-a choice justified by the principle of maximum entropy. The solution is the tilted distribution (2.14)
Unsupervised Learning
433
and the corresponding free energy is FT = -
(2.15)
The parameterization of {PH[.])guarantees that as y 00 and + 03, { P L[.I} determines a tree-structured partition of the feature space. Moreover, at these limits FTreduces to the tree-structured clustering distortion. The free energy is considered both in our own previous work (Miller and Rose 1994b)and in Jordan and Jacobs (1994). There, it was proposed to maximize the log likelihood (negative of the free energy) with respect to all model parameters. For tree-structured clustering, this would mean optimizing FT with respect to {s:}, y, and {y,). While this approach is appealing because of its general applicability to classification and regression and because of its connection with the EM algorithm, it is not consistent with an annealing approach, wherein P must control the average distortion and data “scale” of the solution. In particular, note that this strategy allows the necessary optimality condition for a test vector to be satisfied at any temperature p by choosing the parameters so as to make j “hard.”’ Hardening of {pH[’] j imposes severe the probabilities {pH[’] restrictions on {PL[.])and hence on the extent to which controls the data scale of the solution. Alternatively, the method we propose retains p as a computational temperature, controlling the degree of ”fuzziness” in the solution and, as will be discussed, the effective model size. Whereas the conventional splitting algorithm treats all nodes as if they were leaves by placing nonleaf (test) vectors as well as the leaf (code) vectors at region centroids, we view the code vectors {y,} and test vectors {s:} as complementary sets of variables. The role of the leaves is to minimize the clustering distortion by placing code vectors {y,>at region centroids, while the role of the internal nodes (hierarchy) is to classify to the leaf layer to minimize clustering distortion at the leaves-in general this will not be achieved by placing test vectors {s:> at region centroids. Clearly, the leaf and nonleaf objectives are intertwined, but we ”decouple” them in the optimization by alternating between the two complementary subproblems, namely, optimize the leaf nodes given a fixed hierarchy, and optimize the internal nodes given the fixed leaf layer. --$
2Consider the two-layer binary tree. A necessary condition for optimizing FT with respect to so is
X
-
which is satisfied when {PH[,]} becomes ”hard,” i.e., { P H [ . ] } {O,l}, due to the dependence of P L on PH as given in equation 2.14. The hardening can be achieved by sending test vectors or y to infinity This undesired phenomenon has been confirmed experimentally.
434
David Miller and Kenneth Rose
2.2.1 Optimization of the Leaves. The dependence of the leaves on the hierarchy is naturally built into FT through the prior {PH[.]}.Thus, given {pH[']}, our method directly minimizes the free energy with respect to {y,} at each p. A necessary optimality condition is (2.17) The minimization can be implemented via gradient descent or, for the squared distance measure, through fixed point iterations (the centroid rule).
2.2.2 Optimization of the Hierarchy. In a corresponding fashion, the test vectors {s'} and scale parameter y should be chosen to produce a distribution /or the hierarchy that "agrees w i t h a (given) distribution over the leaves. Moreover, this optimization should retain p as a computational temperature for the method. Accordingly, we introduce a new paradigm into the problem that essentially states: approximate the opti-
mal (unstructured)clustering solution while imposing the structural constraint. Given fixed (leaf) code vectors, the optimal hard partition is the (unstructured) nearest-neighbor partition induced by the leaves. Within the probabilistic framework, given the "temperature" p, the corresponding ideal distribution is the maximum entropy distribution (2.18) nl
The ideal probability of classification to any internal node S is thus given by (2.19) PI[X E S] = Pl[X E C,I
C
j:C$descendents( S)
The hierarchical parameters {s:} and y should be chosen so that the distribution to the parent layer of the leaves {PH[xE Sf-']} "agrees with" the ideal distribution {PI[. E Sk-']} as nearly as possible. To achieve this objective, we again appeal to cross entropy as a measure of distance between distributions3 and pose mipD({PI[x E sF-']}II{PH[x E r1ls1)
SF-']}) (2.20)
3D(,11.) is an asymmetric measure of distance between distributions; thus some justification of the choice D(P1I IPH)rather than D(PH1101)is due. We view PI as the ideal (i.e., the "true") distribution to be approximated by the model-constrained distribution PH. Thus we average the "log likelihood ratio" with respect to PI: D = C PI [logPI - log PHI; see Cover and Thomas (1991) for more details on this interpretation of D(.]l.). This choice also leads to a simpler result in our case.
Unsupervised Learning
435
For pH[’] recursively defined in equation 2.11, one can easily show that equation 2.20 is equivalent to a minimization problem involving a sum of cross entropies over each nonleaf layer, i.e. (2.21
This problem can be solved by a gradient descent technique. The partial derivatives with respect to any test vector s (with partition cell S) and the scale parameter y are, respectively, (2.22) and (2.23)
Note that equation 2.22 has the same form as the gradient of the free en} {PI[.]}.This simple rule can ergy (see footnote 2), but replaces { P L [ . ]with be interpreted as a probabilistic, batch version of the Perceptron learning approxirule, optimizing the test vectors of the hierarchy so that mates {PI[.]} (Miller and Rose 1994b). Effectively, we have introduced a supervised learning paradigm within an unsupervised problem setting. In a similar fashion, y is modified to match the average distance [variance for d(.) the squared distance] based on {pH[’]} with that of {PI[.]}. Similar gradient rules can also be specified for matrix parameters if Mahalanobis distance is used for classification in nonleaf layers. While the Perceptron does not converge for nonseparable classes, the cross-entropy minimization should converge so long as p is finite.
}I.[&{
2.2.3 Algorithm Summary. Our annealing approach optimizes the parameters of a tree of fixed structure-balanced, with specified depth and branching factor. The method involves alternating minimizations with respect to the leaves and the hierarchy at a sequence of increasing 0, starting from small p. At each p, the optimization consists of four iterated steps: (1) Given a fixed {pH[.]}, choose {y,} to minimize Fr from equation 2.15; (2) compute the ideal distributions {PI[.]} based on the new leaves using equation 2.19; ( 3 ) optimize the hierarchical parameters y and {sf} to agree with { P I [ . ] }in the cross-entropy sense of equation 2.21; and (4) compute {pH[’]} using equation 2.11. These steps are iterated until a convergence condition is satisfied. Then /3 is increased. As the “temperature” decreases, the distributions begin to ”harden,” and upon termination, our method specifies a hard, tree-structured solution through {s;} and {y,}. The algorithm is listed in pseudo-code in Table 1. Several algorithm steps, including initialization and the termination condition, will be explained in the next section. While this study does not
436
David Miller and Kenneth Rose
address theoretical issues of convergence, in practice based on numerous simulations, we have found that convergence is always achieved at a given temperature. 2.3 Growing by Phase Transitions. Even though the approach we have described optimizes parameters of a tree of specified size and structure, this choice does not restrict the effective tree size and structure produced at a given p. At high temperature (P = 01, there is a unique minimum of FT, with all cluster representatives at the global centroid of the data set. The corresponding distributions { P I [ . ] }are uniform, and thus the optimal { P H [ . ] }that minimizes the cross-entropy is also uniform, achieved by choosing all the test vectors to be nondistinct. Thus, at small P the tree effectively collapses to a single node, justifying the initialization of all node vectors to the global centroid (see Table 1). The global centroid is a solution for any /3. However, as B is increased, for some critical value this solution changes from a minimum to a local maximum or a saddle point of FT. Essentially, the increased emphasis given to minimizing clustering distortion for increasing p prompts splits of the representatives and growth in the effective model size. To break the symmetry of the global centroid solution, small perturbation vectors w are added to the representatives to promote splits (see Table 1). At special, critical values of I?, perturbations will grow into actual splits, while at all other p, nondistinct representatives that have been pushed apart by perturbations will return to their nondistinct state. These splits at critical [I can be interpreted via the physical analogy as phase transitions (bifurcations) in the annealing process. Conditions for bifurcation have been derived for the unstructured DA method (Rose et al. 1992; Rose 1994), as well as for the elastic net method for the traveling salesman problem (Durbin et al. 1989). In the Appendix, a condition is derived for splits in the tree-structured annealing process. The first bifurcation is initiated along the principal data axis and at a value of p inversely proportional to the "spread" (the maximum eigenvalue) along this direction (see the Appendix). The initial P for our algorithm in Table 1 is thus chosen to be smaller than this critical value. Subsequent bifurcations occur in a similar fashion, dependent on the data "owned" probabilistically by the representatives undergoing the split. Thus, the annealing process generates a sequence of solutions of increasing effective size and finer scale. As p + 00, the free energy cost becomes the clustering distortion, which can always be decreased by increasing the model size, and so in the limit of low temperature the amount of splitting is limited only by the number of representatives assumed by the system. This choice may depend upon practical requirements (e. g., model complexity issues, bandwidth for data compression application) or on cluster validity measures. While our method does not yield insights into the correct model size for the hard clustering problem, for any finite [j (and hence a probabilistic solution), there is a "correct" model size-
Unsupervised Learning
437
Table 1: Pseudo-Code for the Tree-Structured DA Method?
Given: a data set X = { x } of size M , the balanced tree depth L and branching factor K, a target leaf size Ntarget, an annealing parameter a, and a threshold E .
Calculate the data global centroid p = of the matrix R
=
$7x and principal eigenvalue A,,
CX ( x - p ) ( x - P ) ~ .
Sets:
t
p, for j = 0 ,... ,K' - I ,
Set yj
t
p, for j = O , . .
I = 1 ,..., L.
. , K L - 1.
+ argmin{?,{4}}T D ( { p I k E Sll} 11 { P H b E sf]}) Compute P H [ XE S:], VI, j , using the recursion of equation 2.11. } while AFT/FT > E
P t (1 + .)P Calculate the effective number of leaves (number of distinct elements in { y j } ) : N = I{Y,)I Calculate the entropy H = --$CCPL[X E C,] logPL[x E C,]. Perturb leaf layer: yl t y; if (N< Nta,t) IIWlli2
-
+ wj,
Vj, where wj is a small perturbation,
E.
while ( N < Ntarget) or ( H > E ) 'This algorithm optimizes a tree-shuctured model of fixed size, but the effective tree size grows by bifurcations during the annealing process. our simulations we used a = 0.05 and t = 0,0001. The algorithm is not sensitive to variations in these values so long as they are sufficiently small. Issues of how to select these values so as to optimize the tradeoffs between performance and computation remain open.
~
438
David Miller and Kenneth Rose
that associated with the solution that achieves the global minimum of the free energy. As long as the tree ”thrown into” the optimization is sufficiently large, our method can in principle estimate the optimal tree at each J, with the effective tree size simply the number of distinct representatives in the solution.4 In practice we cannot claim that the optimal model will be found, as our optimization method is not guaranteed to find the global minimum. Still, the effective tree size does grow by bifurcations at critical temperatures in the solution process, and thus our method does provide an estininte of the tree size and structure at each 8. By disallowing splits of parameters (by not introducing perturbations) once a specified effective tree size (Ntaget leaves) has been reached, our algorithm can be used to search for the optimal tree-structured solution (of a priori unknown structure) with a given number of representatives. The algorithm of Table 1 determines the number of distinct representatives at each temperature and terminates when the target tree size has been reached and when the solution is sufficiently “hard” (i. e., when the entropy at the leaves is very low). In Figure 2, we present an example showing an ”evolution” of growth in the tree model for increasing j. The data set is a gaussian mixture with eight components. We optimized a balanced binary tree of depth four (sixteen leaves) and annealed starting from J below the initial critical value. The figures show the converged solutions after critical J of the annealing process have been reached. Here, the effective model size grows from two leaves in Figure 2a to eight leaves in Figure 2g. Note that the partition regions “separating” the leaf representatives are probabilistic-the lines are equiprobable contours with membership probability of p = 0.33, except for Figures 2a and 2g, for which p = 0.5. Figure 2a shows the solution after the initial split from the global centroid. The solution, with 16 leaves, has two distinct leaf representatives, with the eight left subtree leaf vectors all at one location and the eight right subtree leaf vectors all at another location. The associated tree structure is shown to the right of the figure. In Figure 2b, the left subtree has undergone bifurcation, leading to a tree structure with three leaves. The subsequent figures and associated tree diagrams show how the effective tree grows for increasing ,j. The hard clustering solution of Figure 2g was obtained by fixing the model size (disallowing bifurcations) when the target size of eight leaves was achieved and then annealing to low temperature. The tree diagrams emphasize the fundamental differ‘A dependence of the clustering solution on the multiplicity of overlapping representatives within a distinct cluster was noted in Rose et ni. (1992) and also referred to a5 cluster degeneracy in Buhmann and Kuhnel 11994). In Rose et nl. (1993) this weakness was eliminated by introduction of mass variables. However, for the binary-tree constrained design this addition is unnecessary since the tree structure limits the types of hjfurcations that can occur. Nevertheless, this problem theoretically exists for nonhinary trees, although in practical experiments w e have not encountered it. Adopting a ”mass-constrained” modification for the tree-structured problem is beyond the scope { i f this paper.
Unsupervised Learning
439
Figure 2a-c: A hierarchy of tree-structured solutions generated by the annealing method for increasing p. The source is a gaussian mixture with eight components. The figures show the converged solutions at distinct p at which the effective tree size has grown by bifurcation. The computed code vectors are barely visible, denoted by " 0 s . To the right of each figure is the associated effective tree structure. The lines in the figure are equiprobable contours with membership probability of p = 0.33 in a given partition region, except for a and g, for which p = 0.5. "H" denotes the highest level decision boundary in the "hard" solution of g.
David Miller and Kenneth Rose
440
I
Figure 2d-g: Coirtiizzred
Unsupervised Learning
441
Figure 3: An example involving a mixture of four isotropic gaussian components and the two layer binary tree solution found via (a) splitting, with E = 0.9 and (b) tree-structured DA, with E = 0.5. In each figure " X denotes a true mixture component center, " 0denotes a cluster representative found by the method, and "H" denotes the highest level decision boundary of the solution. ence between the "hierarchy" of tree-structured solutions generated by our method and the "hierarchy" of unstructured solutions generated by the original DA method (Rose et al. 1992)-as the tree-structured clustering model grows, so does a corresponding decision tree structure. For the "hard" clustering solution, the decision tree specifies an efficient classification search, which is of practical importance in vector quantization.
3 Results
Here we compare performance of our method with both tree-structured and unstructured clustering methods in both the pattern recognition and data compression contexts. For pattern recognition some simple illustrative examples are used to demonstrate in a fundamental way the im-
David Miller and Kenneth Rose
442
b
Figure 4: An example involving a mixture of eight gaussian components. (a) The splitting solution with E = 0.73 (b) The tree-structured DA solution with E = 0.50.
Unsupervised Learning
443
Figure 5: An example involving a mixture of four isotropic gaussian components and the two layer binary tree solutions. (a) The solution found by treestructured DA using squared distance for the hierarchical layer, with E = 0.43. (b) The tree-structured DA solution using Mahalanobis distance for the hierarchical layer, with E = 0.33. Note that in this case the highest level decision boundary is a hyperbola.
444
David Miller and Kenneth Rose
Figure 6: A gaussian mixture example with sixteen components. (a) The best basic Isodata solution, based on 30 random initializations within the data set, with E = 0.51. (b) A typical basic Isodata solution, with E = 0.72. (c) The unbalanced tree-structured DA solution with maximal depth of five and E = 0.49. provements achievable by our method. In Figures 3-6 the data were generated by randomly sampling from 2D gaussian mixture distributions with isotropic components. In all the figures, X s are used to denote mixture component centers and 0 s to denote computed cluster representatives. Moreover, " H denotes the highest level decision boundary of the solution. The cost function is the sum of squared distances. For the DA method, the annealing parameter o was chosen to be 0.05 and F was set to 10-4.
Unsupervised Learning
445
The example of Figure 3 shows that even for one of the simplest possible trees (two layer, binary) and data sets (clusters along a line) the splitting algorithm fails to adequately discriminate mixture components. Here in Figure 3a, placing test vectors at region centroids leads to the suboptimal boundary ("H), which separates three clusters from one in the first layer. By contrast, our approach places the test vectors so as to achieve a (visually apparent) optimal solution separating all mixture components, with the quality also reflected in a much smaller sum of squared distances cost. Similar performance gains are achieved by our approach in Figure 4. For the example of Figure 5, Mahalanobis distance is used in the nonleaf layer to improve the clustering result and demonstrates that optimal structured solutions need not have convex decision boundaries. In Figure 6, our method is compared with the unstructured basic Isodata algorithm. For this example, despite its structural handicap, our approach achieves a better solution than the best result of basic Isodata based on numerous initializations (30) within the data. While these examples are all fairly simple to aid visual assessment, we have performed extensive testing of our method on a variety of data sources (with data dimensions ranging from 2 to 8) and have found it to be successful in separating complex data distributions with numerous, overlapping mixture components. We have found that our method always outperforms greedy tree-structured methods, including those that use optimal pruning, and very often outperforms unstructured methods such as basic Isodata (which tends to get trapped in nonglobal optima) as well, while reducing search complexity of the resulting classifier. We have also tested our method in comparison with splitting (Buzo et a1. 1980) and with the generalized Lloyd algorithm (GLA) (Linde e t a / . 1980) for vector quantization of Gauss-Markov sources. For this problem our approach bridges a significant portion of the performance gap between splitting for treestructured design and GLA for unstructured design. Some performance results are shown in Table 2 for the case of four-dimensional data vectors.
4 Conclusions
In this paper, we have extended the deterministic annealing approach to address the problem of structurally constrained clustering. Whereas the original approach was developed using the principle of maximum entropy, the new method is based on minimum cross-entropy inference, which is a convenient formalism for expressing the joint objectives of enforcing a tree-structured solution and approximating the optimal (unstructured) solution. The annealing approach is useful in two important respects. First, it is helpful for avoiding local optima of the cost, allowing the method to achieve substantial performance gains over other tree-structured methods. Second, and most importantly, annealing leads
David Miller and Kenneth Rose
446
Table 2: Vector Quantization Performance for the Unconstrained Generalized Lloyd Algorithm and Tree-Structured DA on 4D Gaussian and FirstOrder Gauss-Markov Sources with Correlation Coefficient 0." I)
0.0 0.0 0.0 0.5 0.5 0.5
R
GLA
TREE - D A
1.o 1.25 1.5 1.o 1.25 1.5
0.26 0.83 0.94 0.41 0.68 0.99
0.09 0.29 0.40 0.18 0.40 0.46
"The performance measure is gain (in dB) over the standard splitting algorithm for TSVQ design. Note that a significant portion of the performance gap between splitting for TSVQ design and GLA for unstructured VQ design is recouped by the tree-structured DA method.
to phase transitions in the process and consequential growth in the model size and tree structure. Here, we emphasized that tree growth in our approach is a natural consequence of the optimization of the free energy cost and allows automatic model order and tree structure estimation at each temperature scale. One outgrowth of our method is a probabilistic, batch generalization of the Perceptron algorithm and its connection with minimizing cross-entropy. Another is a basic modification of the clustering problem to incorporate a prior. One clear direction that is beyond the scope of this work is to directly address the more general class of supervised learning problems that includes (nonlinear) piecewise regression and statistical classifier design to minimize probability of error. The tree-structured DA method does not directly address these problems, since clustering is just a special case of regression. However, our approach does successfully combine within an optimization framework important elements of these problems, including test vector variables whose primary role is classification with "representative" variables chosen to minimize the cost. Thus, our approach does provide some impetus for tackling these more general problems. We wish to draw reader attention to more recent work extending the DA method to address supervised learning problems including statistical classification (Miller et nl. 1995).
Unsupervised Learning
447
Appendix A Necessary Condition for Bifurcation Here, we derive the general condition for bifurcation in our tree-structured process, assuming squared distance measures the cost at the leaves. To simplify the sequel, we define the following sets: the representatives Y = {y,}, the perturbations T = { q } ,the perturbed representatives Y p = {y, + EW,}, and the data X = { x } . Here, x,y,, and W~ are elements of the same real vector space and E is a scalar. From the theory of the calculus of variations, necessary conditions satisfied at a minimum of the free energy FT(Y)are
and
Both of these conditions must be satisfied for all permissible perturbations T. The critical Pc at which a phase transition is initiated is a temperature that marks the transition between a solution with positive second derivative for all perturbations T and a solution with zero second derivative for some perturbation. To examine the conditions more closely, we first write out explicitly
and
+ 2PC X
I
p L [ xE I
i'
C,17f(x - Y,)
(A.4)
where we have identified the covariance matrix associated with the C, partition cell:
Now we wish to show that equation A.4 is positive for all perturbations if and only if the matrices {[I- 2pC,]} are positive definite. The "if" part is trivial since the second term in equation A.4 is always nonnegative. To show the "only if" part, suppose that for a pair of nondistinct leaf code vectors satisfying y~= ym (and also C I = Em),the corresponding matrix [I - 2PC[] = [I - 2PCm],is not positive definite. It thus has a nonpositive
448
David Miller and Kenneth Rose
eigenvalue with corresponding eigenvector 11. Choose the perturbation with I " = L I , o,,, = -u, and = 0,b'j # Z,m. Clearly the first term in equation A.4 is nonpositive and the second term is zero. We have thus shown that bifurcation occurs when some of the matrices { [ I - 2&]} stop being positive definite. The critical B is, therefore, /$ = 1/2AmaX, where, , ,A is the largest eigenvalue of XI. Moreover, the split is initiated along the axis of the principal eigenvector of C,. Acknowledgments
This work was supported in part by the National Science Foundation under Grant NCR-9314335, the University of California MICRO program, DSP Group, Inc., Echo Speech Corporation, Moseley Associates, Rockwell International Corporation, Speech Technology Labs, Texas Instruments, Inc., and Qualcomm, Inc. References Ball, G., and Hall, D. 1967. A clustering technique for summarizing multivariate data. Behav. Sci. 12, 153-155. Bezdek, J. C. 1980. A convergence theorem for the fuzzy ISODATA clustering algorithms. I E E E Trans. Putt. Anal. Mach. Intell. PAMI-2, 1-8. Breiman, L., Friedman, J. H., Olshen, R. A,, and Stone, C. J. 1980. Classification and Regression Trees. The Wadsworth Statistics/Probability Series, Belmont, CA. Buhmann, J., and Kuhnel, H. 1993. Complexity optimized data clustering by competitive neural networks. Neural Comp. 5, 75-88. Buhmann, J., and Kuhnel, H. 1994. Vector quantization with complexity costs. I E E E Trans. Inform. Theory 39, 1133-1145. Buzo, A,, Gray, Jr., A. H., Gray, R. M., and Markel, J. D. 1980. Speech coding based on vector quantization. I E E E Trans. Acoiist., Speech, Sig. Proc. 28, 562574. Chou, P., Lookabaugh, T., and Gray, R. M. 1989. Optimal pruning with applications to tree-structured source coding and modeling. I E E E Trans. Inform. Theoy 35, 299-315. Cover, T. M., and Thomas, J. A. 1991. Elements oflnformation Theory. John Wiley, New York. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum-likelihood from incomplete data via the EM algorithm. J. Roy. Stat. SOC.,Ser. B 39, 1-38. Duda, R. O., and Hart, P. E. 1974. Pattern Classificationand Scene Analysis. WileyInterscience, New York. Dunn, J. C. 1974. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3, 32-57. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (Lotidon) 326, 689691.
Unsupervised Learning
449
Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. Farrell, K. R., Mammone, R. J., and Assaleh, K. T. 1994. Speaker recognition using neural networks and conventional classifiers. I E E E Trans. Speech Audio Proc. 2, 194-205. Geiger, D., and Girosi, F. 1991. Parallel and deterministic algorithms from MRFs: Surface reconstruction. I E E E Trans. Patt. Anal. Mach. Intell. 13, 401412. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Paft.Anal. Mach. Intell. PAMI6, 721-741. Gersho, A., and Gray, R. M. 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, MA. Hartigan, J. A. 1975. Clustering Algorithms. John Wiley, New York. Jain, A. K., and Dubes, R. C. 1988. Algorithmsfor Clustering Data. Prentice Hall, Englewood Cliffs, NJ. Jaynes, E. T. 1989. Information theory and statistical mechanics. In Papers on Probability, Statistics and Statistical Physics, R. D. Rosenkrantz ed. Kluwer Academic Publishers, Dordrecht, The Netherlands. (Reprint of the original 1957 papers in Pkys. Rev.) Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6, 181-214. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Kullback, S. 1969. Information Theory and Statistics. Dover, New York. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantizer design. I E E E Trans. Commun. COM-28, 84-95. Lippmann, R. P. 1989. Pattern classification using neural networks. IEEE Commun. Mag. 47-64. Lloyd, S. P. 1982. Least squares quantization in PCM. I E E E Trans. Inform. Theory IT-28, 129-137. (Reprint of the 1957 paper.) MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp. Math. Stat. Prob. I, 281-297. Miller, D., and Rose, K. 1994a. Combined source-channel vector quantization using deterministic annealing. I E E E Trans. Commun. 42, 347-356. Miller, D., and Rose, K. 1994b. A non-greedy approach to tree-structured clustering. Patt. Rec. Lett. 15, 683-690. Miller, D., Rose, K., and Chou, P. A. 1994. Deterministic annealing for trellis quantizer and HMM design using Baum-Welch re-estimation. I E E E lnt. Conf. Acousf. Speech Sig. Proc., Adelaide, Australia. Miller, D., Rao, A., Rose, K., and Gersho, A. 1995. An information-theoretic learning algorithm for neural-network classification. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Advances in Neural Information Processing 8. MIT Press, Cambridge, Ma. Oehler, K., and Gray, R. M. 1993. Combining image classification and image compression using vector quantization. Proc. l E E E Data Comp. Conf. 2-11. Riskin, E. A., and Gray, R. M. 1991. A greedy growing algorithm for the design of variable rate vector quantizers. IEEE Trans. Sig. Proc. 39, 2500-2507.
450
David Miller and Kenneth Rose
Rose, K. 1994. A mapping approach to rate-distortion computation and analysis. I E E E Trans. Inform. Theory. 40, 1939-1952. Rose, K., Gurewitz, E., and Fox, G. C. 1990. Statistical mechanics and phase transitions in clustering. Phys. Rev. Left. 65, 945-948. Rose, K., Gurewitz, E., and Fox, G. C. 1992. Vector quantization by deterministic annealing. I E E E Trans. Inform. Theory 38, 1249-1258. Rose, K., Gurewitz, E., and Fox, G. C. 1993. Constrained clustering as an optimization method. l E E E Trails. P a f f .Anal. Mach. Intell. 15, 785-794. Shore, J. E., and Johnson, R. W. 1980. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. I E E E Trans. Inform. Theory IT-26, 26-37. Simic, I? D. 1990. Statistical mechanics as the underlying theory of elastic and neural optimization. Network 1,89-103. Sokal, R., and Sneath, I? 1963. Principles ofNiimerica1 Taxonomy. W. H. Freeman, San Francisco. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2, 1-24. Yuille, A. L., Stolorz, I?, and Utans, J. 1994. Statistical physics, mixtures of distributions, and the EM algorithm. Neural Comp. 6, 334-340.
Received September 30, 1994; accepted June 8, 1995.
This article has been cited by: 2. X.L. Yang, Q. Song, W.B. Zhang. 2006. Kernel-based deterministic annealing algorithm for data clustering. IEE Proceedings - Vision, Image, and Signal Processing 153:5, 557. [CrossRef] 3. J. Puzicha, M. Held, J. Ketterer, J.M. Buhmann, D.W. Fellner. 2000. On spatial quantization of color images. IEEE Transactions on Image Processing 9:4, 666-682. [CrossRef] 4. Yee Leung, Jiang-She Zhang, Zong-Ben Xu. 2000. Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:12, 1396-1410. [CrossRef] 5. Shai Wiseman, Marcelo Blatt, Eytan Domany. 1998. Superparamagnetic clustering of data. Physical Review E 57:4, 3767-3783. [CrossRef] 6. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210-2239. [CrossRef] 7. Marcelo Blatt , Shai Wiseman , Eytan Domany . 1997. Data Clustering Using a Model Granular MagnetData Clustering Using a Model Granular Magnet. Neural Computation 9:8, 1805-1842. [Abstract] [PDF] [PDF Plus] 8. K. Rose, D. Miller, A. Gersho. 1996. Entropy-constrained tree-structured vector quantizer design. IEEE Transactions on Image Processing 5:2, 393-398. [CrossRef]
Communicated by Nicol Schraudolph
The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks Georg Thimm Perry Moerland Emile Fiesler IDIAP, CH-1920 Martigny, Switzerland The backpropagation algorithm is widely used for training multilayer neural networks. In this publication the gain of its activation function(s) is investigated. In specific, it is proven that changing the gain of the activation function is equivalent to changing the learning rate and the weights. This simplifies the backpropagation learning rule by eliminating one of its parameters. The theorem can be extended to hold for some well-known variations on the backpropagation algorithm, such as using a momentum term, flat spot elimination, or adaptive gain. Furthermore, it is successfully applied to compensate for the nonstandard gain of optical sigmoids for optical neural networks. 1 Introduction
When using the backpropagation algorithm' to train a multilayer neural network, one is free to choose parameters like the initial weight distribution, learning rate, activation function, network topology, and gain of the activation function. A common choice for the activation function cp of a neuron in a multilayer neural network is the logistic or sigmoid function: n,
cp@) =
I
(1.1)
which has a range (0, y). Alternative choices for cp are a hyperbolic tangent, y tanh(px), yielding output values in the range (-?, ?), and a gaussian function ye-(@')* with range (O,y]. The parameter p is called the gain, and rP the steepness (slope) of the activation function.2 The effect of changing the gain of an activation function is illustrated in Figure 1: the gain scales the activation function in the direction of the horizontal axis. 'See, for example, Rumelhart et al. (1986). 2Note that gain and steepness an? identical for activation functions with y = 1 (Saxena and Fiesler 1995).The term temperature is sometimes used as a synonym for the reciprocal of gain.
Neural Computation 8, 451-460 (1996)
@ 1996 Massachusetts Institute of Technology
G. Thimm, P. Moerland, and E. Fiesler
452
Y
0
0
Figure 1: A logistic and a gaussian function of gain one (solid lines) and their four times steeper counterparts (dotted lines). This publication proves that a relationship between gain, learning rate, and weights in backpropagation neural networks exists; followed by the implications of this relationship for variations of the backpropagation algorithm. Finally, a direct application of the relationship to the implementation of neural networks with optical activation functions with a nonstandard gain is presented. Several other authors hypothesized about the existence of a relationship between the gain of the activation function and the weights (Codrington and Tenorio 1994; Wessels and Barnard 1992),3or between the gain and the learning rate (Kruschke and Movellan 1991; Mundie and Massengill 1992; Zurada 1992; Brown eta!. 1993; Brown and Harris 1994). Specifically, Zurada writes: "leads to the conclusion that using actiuation functions with large [gain] X inay yield results similar as in the case of large learning constant 71. I t thus s e e m advisable to keep X at a standard value of 1, and to
control learning speed using solely the learning coizstant 11. 2 The Relationship between the Gain of the Activation Function, the Learning Rate, and the Weights The theorem below gives a precise relationship between gain, initial weights, and learning rate for two backpropagation neural networks with the same topology and where corresponding neurons have activation 'Wessels and Barnard (1992) use a weight initialization method that scales the initial weight range according to the gain of the activation function.
Backpropagation Neural Networks
453
Table 1: The Relationship between Activation Function, Gain, Weights, and Learning Rate.
Activation function Gain Learning rate Weights
Network M
Network N
q ( x ) = cp(0x)
$4
P
p=1
W
W=pW
v
f=/
= P277
functions cpl and pl (indices are omitted where corresponding functions or variables have the same index): Y ( X )= cp(Px)
(2.1)
This means that corresponding neurons in the two networks have the same activation function, except that the first has gain 1 and the second gain p. A proof of Theorem 1 and a detailed description of the backpropagation algorithm can be found in the appendix. Theorem 1. Two neural networks Mand N of identical topology whoseactivation function (p, gain p, learning rate v,and weights ware related to each other as given in Table 2 are equivalent under theon-line backpropagationalgorithm; that is, when presented the same pattern set in the same order, their outputs are identical. An increase of the gain with a factor 0 can therefore be compensated for by dividing the initial weights by ,B and the learning rate by p2. 3 Extensions and Applications of Theorem 1
Many variations of the standard backpropagation algorithm are in use. A list of common variations and their interdependence with Theorem 1 is presented here. The corresponding proofs are omitted, as they are analogous to the proof of Theorem 1. Momentum (Rumelhart et al. 1986): Theorem 1 holds when both networks have identical momentum terms. Batch or off-line learning: Theorem 1 holds without modification of the network parameters when off-line learning is used. Flat spot elimination (Fahlman 1988): Theorem 1 holds if the constant C (0.1 in Fahlman’s paper) added to the derivative in network N equals clp. Weight discretization with multiple thresholding of the real-valued weights (Fiesler et al. 1995) requires an adaptation of the threshold values for the weight discretization. If d and d are the discretization functions applied on the weights, Theorem 1 holds if VX : Pd(x) = ~(Px).
454
G. Thimm, P. Moerland, and E. Fiesler
Adaptive gain (Plaut e f a l . 1986; Bachmann 1990; Kruschke and Movellan 1991): A change -1 -I of the gain can be expressed as a change of the learning rates from J 2 r l to ( . I + . L 9 ) 2 r l , and of the weights from 3w to (J+ -1 j)w, without changing the gain of the activation functions. The use of steeper activation functions for decreasing the convergence time (Izui and Pentland 1990; Cho et 01. 1991) is equivalent to using a higher learning rate and a bigger weight range according to Theorem 1. Approaching hard limiting thresholds by increasing the gain of the activation functions (Corwin et nl. 1994; Yu et a / . 1994) IS similar to multiplying the weights with a constant greater than one. In the final stage of the training process the activation functions can be replaced by a threshold if this does not cause a degradation in performance. A major problem in using optical activation functions is their nonstandard gain (Saxena and Fiesler 1995).In Figure 2, an optical activation function and its approximation by a shifted sigmoid, with a gain of approximately 1/24, are depicted. Note that the domain of the optical activation function is restricted to positive values due to constraints imposed by the optical implementation. Using this optical activation function in a backpropagation algorithm with a normal learning rate, say 0.3, and a normal initial weight interval, say [-0.5.0.5], leads to very slow convergence. Theorem 1 explains why: this choice of parameters corresponds to having an activation function of gain one and a small learning rate of (1/24)* x 0.3 = 0.00052. Theorem 1 gives a way to overcome this problem: choose a learning rate of 24’ x 0.3 = 172.8 and an initial weight interval of [-(24 x 0.5).24 x 0.51. The neural network using these adopted parameters performed well. 4 Conclusions
The gain of the activation function and the learning rate in backpropagation neural networks are exchangeable. More precisely, there exists a well-defined relationship between the gain, the learning rate, and the initial weights. This relationship is presented as a theorem that is accompanied by a detailed, yet easy to understand, proof. The theorem also holds for several variations of the backpropagation algorithm, like the use of momentum terms, adaptive gain algorithms, and Fiesler’s weight discretization method. A direct application area of the theorem is analog neural network hardware implementations, where the possible activation functions are limited by the available components. One can compensate for their nonstandard gain by modifying the learning rate and initial weight according to the theorem to optimize the performance of the neural network.
Backpropagation Neural Networks
455
Figure 2: An optical activation function (solid line) and its approximation by a shifted sigmoid (dotted line). 5 Appendix
Before proving theorem 1 a generalization of the on-line backpropagation learning rule (Rumelhart et al. 1986) is described, in which every neuron has its own (local) learning rate and activation function. The standard case of a unique learning rate and activation function corresponds to all local learning rates and activation functions being equal for the whole network. The following notation and nomenclature is used (Fiesler 1994): a (multilayer) neural network can have an arbitrary number of layers denoted by L. The number of neurons in layer 1 ( 1 5 C 5 L ) is denoted by N! and the neurons in layer l are numbered from 1 up to N t . Layer 1 is the input layer and layer L is the output layer of the network. Adjacent layers are fully interlayer connected. The weight from neuron i to neuron j in the next layer e is denoted by We,lJ. The activation value of this neuron is indicated as UeJ ( j > 0), and t, denotes the target pattern value for output neuron j . To simplify the notation the convention is used that Ut-1,o = 1 and the bias (or offset) is we,^,^. The backpropagation algorithm is described by the following five steps: 1. Initialization: Weights and biases are initialized with random values4 4See Thimm and Fiesler (1995) for an in-depth study of weight initialization.
G. Thimm, P. Moerland, and E. Fiesler
456
2. Pattern presentation: An input pattern, which is used to initialize the activation values of the neurons in the input layer, and its corresponding target pattern are presented. 3. Forward propagation: During this phase, the activation values of the input neurons are propagated layer wise through the network. The new activation value of neuron j in layer t ( 2 5 8 5 L ) is '!J
=
{
1 cpe,,(nete,,)
ifj=O otherwise
(5.1)
where the input of a neuron j , not in the input layer, is defined as (5.2) where pe,l is a differentiable activation function, for example, the logistic function (equation 1.1). 4. Backward propagation: For each neuron an error signal 6 is calculated, starting at the output layer and then propagating it back through the network: 6L,/ - PL,/(netL./) (t/ - ' L J ) = cph,,(nete,l)Ck &+l,kWe+l,/,k
for the output layer L for layers 2 I 8 I L - 1 (5.3)
After the calculation of all error signals, the weights and biases are updated: := w e > , >+ / ~e,,6e,lae-i,1
(5.4)
where lie,, denotes the learning rate of neuron j in layer 8. 5. Convergence test: If no convergence go to Pattern presentation. The notation introduced in the formulation of the backpropagation algorithm permits the proper formulation of the Proof of Theorem 1. To simplify the notation, the vector of incoming weights of neuron j is denoted by We,] and the vector of activation values of layer C by a!. Now, using this notation, equation 5.2 can be rewritten as
is the inner product operator. The variables of network N where (the network with activation functions with gain one) are overlined; for example: net!,, = We,j . St-1. The proof of Theorem 1 is separated into two parts. Lemma 1 deals with the forward propagation: the networks have the same output for the same input pattern. Lemma 2 deals with the backward propagation: the conditions for Lemma 1 still hold after a training step. I'."
Backpropagation Neural Networks
457
Lemma 1. Two networks M and N , satisfying the preconditions given in Theorem 1, have the same activation values for corresponding neurons, that is a! = Se (for all 0, if al = 2, (is forward propagated).
Proof. By induction on the number of layers, starting at the input layer. Induction base: The activation values of the input layer neurons of networks, M and N are identical, since the input patterns are identical (& = at). Induction step: For neuron j , not in the input layer: at)j = ae.j
ve,j(nete,j)= cPe,j("ett,j)
using equation 5.1, trivially fulfilled for j using equation 2.1:
= 0.
ve,;(x) = Fe,j(Pe,,x) -e
Fe,j(Pe,jnete,j) = iPe,j(=t,j) Pe,j nete,j = nett,,
@
Pe,j(we,j. ae-1) = we,j. ae-1
@
&(we,; . ae-1)
@
using equation 5.5 on account of = pe,;wt,j
= (Pt.jwe,l). ae-1
which is true on account of the induction hypothesis that the activation values in the lower layers are identical. Note that in the course of the proof it has also been shown that pe,j nett,, = S t , , . 0 = Pe,jwe,jis used. Since the In the proof of Lemma 1 the property We,, backward propagation changes the weights, it has to be shown that this property is an invariant of the backward propagation step.
Lemma 2. Consider networks M and N , with Ze,j = atd and st, =; @e,jnetp,j(for all .t and j), then V j , e : mt,j = Pe,jwe,l
(5.6)
is invariant under the backward propagation step (if the same input and target patterns are propagated through the networks). Proof. Let AWQ denote the weight change qt,j6e,jat-1. One observes that equation 5.6 holds if and only if AWe,j = /%,,Awe,,(for all j and 0. Manipulating this expression:
*
AWe,j = Pe,jAwe,j Ve,j6,,j4-1 = Pe3j7/t>;6e>jae-1
definition of Awt,j since qe,j = /3&ve,j and Ze
= at
+ P&qe,jSe,j = Pe,jqe,j6e,j hence, it needs to be shown that /3e,j$t,j = 6ej, which is done by an induction on the number of layers, starting at the output layer.
458
G. Thimm, P. Moerland, and E. Fiesler
Induction step: For a neuron j not in the output layer (t < L):
An induction over the number of pattern presentations, using these 0 lemmas, concludes the proof of Theorem 1.
Addendum For completeness the authors would like to include the reference to a letter in Neural Networks @a et al. 1994) that was brought to their attention after the submission of this paper to Neural Computation, in which a similar theorem is presented, albeit without proof or applications. The theorem includes momentum and is related to Izui and Pentland (1990) and Sperduti and Starita (1993) by the authors.
References
Bachmann, C. M. 1990. Learning and generalization in neural networks. Ph.D. thesis, Department of Physics, Brown University, Providence, RI. Brown, M., An, P. C., Harris, C. J., and Wang H. 1993. How biased is your multi-layer perceptron. World Congress Neural Networks 3, 507-511. Brown, M., and Harris, C. 1994. Neurofuzzy adaptive modelling and control. In Prentice Hall International Series in Systemsand Control Engineering, M. J. Grimble, ed. Prentice-Hall, Englewood Cliffs, NJ.
Backpropagation Neural Networks
459
Cho, T.-H., Conners, R. W., and Araman, P. A. 1991. Fast back-propagation learning using steep activation functions and automatic weight reinitialization. Proc. 1991 IEEE Int. Conf. Systems, Man, Cybernetics: Decision Aiding for Complex Syst. 3, 1587-1592. Codrington, C., and Tenorio, M. 1994. Adaptive gain networks. Proc. I E E E Int. Conf. Neural Networks (ICNN94), 1, 339-344. Convin, E., Logar, A,, and Oldham, W. 1994. An iterative method for training multilayer networks with threshold functions. I E E E Trans. Neural Networks 5(3), 507-508. Fahlman, S. E. 1988. A n Empirical Study of Learning Speed in Backpropagation Networks. Tech. Rep. CMU-CS-88-162, School of Computer Science, Carnegie Mrllon University, Pittsburgh, PA. Fiesler, E. 1994. Neural network classification and formalization. In Computer Standards 6 Interfaces, special issue on Neural Network Standardization, J. Fulcher, ed., Vol. 16, No. 3, pp. 231-239. North-Holland/Elsevier Science Publishers, Amsterdam, The Netherlands. Fiesler, E., Choudry, A., and Caulfield, H. J. 1996. A universal weight discretization method for multi-layer neural networks. l E E E Transactions on Systems, Man, and Cybernetics (IEEE-SMC) (conditionally accepted for publication). See also Fiesler, E., Choudry, A., and Caulfield, H. J. 1990. A weight discretization paradigm for optical neural networks. Proc. Int. Cong. Optical Sci. Eng. SPIE 1281, 164-173. Izui, Y., and Pentland, A. 1990. Analysis of neural networks with redundancy. Neural Comp. 2, 226-238. Jia, Q., Hagiwara, K., Toda, N., and Usui, S. 1994. Equivalence relation between the backpropagation learning process of an FNN and that of an FNNG. Neural Networks 7(2), 411. Kruschke, J. K., and Movellan, J. R. 1991. Benefits of gain: Speeded learning and minimal hidden layers in backpropagation networks. I E E E Trans. Syst. Man Cybern. 21(1), 273-280. Mundie, D. B., and Massengill, L. W. 1992. Threshold non-linearity effects on weight-decay tolerance in analog neural networks. Proc. Int. Joint Conf.Neural Networks (lJCNN92)2, 583-587. Plaut, D., Nowlan, S., and Hinton, G. 1986. Experiments on Learning by Back Propagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations, pp. 318-362. MIT Press, Cambridge, MA. Saxena, I., and Fiesler, E. 1995. Adaptive multilayer optical neural network with optical thresholding. In Optical Engineering, special on Optics in Switzerland, P. Rastogi, ed., Vol. 34(8), pp. 2435-2440. Sperduti, A,, and Starita, A. 1993. Speed up learning and network optimization with extended backpropagation. Neural Networks 6, 365-383. Thimm, G., and Fiesler, E. 1996. Weight initialization for high order and multilayer perceptrons. I E E E Trans. Neural Networks. (conditionally accepted for
G. Thimm, P. Moerland, and E. Fiesler
460
publication). See also Thimm, G., and Fiesler, E. 1994. Weight initialization for high order and multilayer perceptrons. In Proceedings of the '94 SIPARWorkshop on Parallel a i d Distributed Cornpiititig, M. Aguilar, ed., pp. 91-94. Institute of Informatics, University Perolles, Chemin du Musee 3, Fribourg, Switzerland. SI Group for Parallel Systems. Wessels, L. E A,, and Barnard, E. 1992. Avoiding false local minima by proper initialization of connections. IEEE Trans. Neural Netzuorks 3, 899-905. Yu, X., Loh, N., and Miller, W. 1994. Training hard-limiting neurons using backpropagation algorithm by updating steepness factors. Proc. IEEE Int. Cot7f. Neural Networks (lCNN94) 1,526-530. Zurada, J. M. 1992. Iizfroducfiori fo Arfificial Neural S y s t e m . West Publishing Company, St. Paul, MN. -
_
_
_
~
~ _ _
Received November 29, 1994, accepted March 13, 1995
This article has been cited by: 2. M. USAJ, D. TORKAR, M. KANDUSER, D. MIKLAVCIC. 2010. Cell counting tool parameters optimization approach for electroporation efficiency determination of attached cells in phase contrast images. Journal of Microscopy no-no. [CrossRef] 3. Maciej Pedzisz, Danilo P. Mandic. 2008. A Homomorphic Neural Network for Modeling and PredictionA Homomorphic Neural Network for Modeling and Prediction. Neural Computation 20:4, 1042-1064. [Abstract] [PDF] [PDF Plus] 4. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 5. William J. Egan, S. Michael Angel, Stephen L. Morgan. 2001. Rapid optimization and minimal complexity in computational neural network multivariate calibration of chlorinated hydrocarbons using Raman spectroscopy. Journal of Chemometrics 15:1, 29-48. [CrossRef] 6. Danilo P. Mandic , Jonathon A. Chambers . 2000. Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural FiltersRelationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters. Neural Computation 12:6, 1285-1292. [Abstract] [PDF] [PDF Plus] 7. Danilo P. Mandic , Jonathon A. Chambers . 1999. Relating the Slope of the Activation Function and the Learning Rate Within a Recurrent Neural NetworkRelating the Slope of the Activation Function and the Learning Rate Within a Recurrent Neural Network. Neural Computation 11:5, 1069-1077. [Abstract] [PDF] [PDF Plus] 8. G. Thimm, E. Fiesler. 1997. High-order and multilayer perceptron initialization. IEEE Transactions on Neural Networks 8:2, 349-359. [CrossRef]
ARTICLE
Communicated by Chris Bishop and Fernando Pineda
A Smoothing Regularizer for Feedforward and Recurrent Neural Networks Lizhong Wu John Moody Computer Science Department, Oregon Graduate Institute, Portland, OR 97291-1000 USA We derive a smoothing regularizer for dynamic network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first-order Tikhonov stabilizer to dynamic models. For two layer networks with recurrent connections described by
Y(f) = f[WY(f - T ) + vx(f)],
z ( t ) = UY(f)
the training criterion with the regularizer is
where @ = { U . V. W} is the network parameter set, Z ( t ) are the targets, I ( t ) = { X ( s ) .s = 1.2. . . . . f} represents the current and all historical input information, N is the size of the training data set, p:(@) is the regularizer, and X is a regularization parameter. The closed-form expression for the regularizer for time-lagged recurrent networks is
where 1) 11 is the Euclidean matrix norm and y is a factor that depends upon the maximal value of the first derivatives of the internal unit activations f ( ). Simplifications of the regularizer are obtained for simultaneous recurrent nets ( T H 0), two-layer feedforward nets, and one layer linear nets. We have successfully tested this regularizer in a number of case studies and found that it performs better than standard quadratic weight decay. 1 Introduction
One technique for preventing a neural network from overfitting noisy data is to add a regularizer to the error function being minimized. RegNeural Computation 8, 461-489 (1996)
@ 1996 Massachusetts Institute of Technology
Lizhong Wu and John Moody
462
ularizers typically smooth the fit to noisy data.' Well-established techniques include ridge regression (see Hoerl and Kennard 1970a,b), and more generally spiine smoothing functions or Tikhonov stabilizers that penalize the mth-order squared derivatives of the function being fit, as in Tikhonov and Arsenin (1977), Eubank (1988), Hastie and Tibshirani (1990), and Wabba (1990). These methods have recently been extended to networks of radial basis functions (Powell 1987; Poggio and Girosi 1990; Girosi et nl. 1995) and several heuristic approaches have been developed for sigmoidal neural networks, for example, quadratic weight decay (Plaut et nl. 1986), weight elimination (Rumelhart 1986; Scalettar and Zee 1988; Chauvin 1990; Weigend et nl. 1990), soft weight sharing (Nowlan and Hinton 1992),and curvature-driven smoothing (Bishop 1993).2Quadratic weight decay (which is equivalent to ridge regression) and weight elimination are frequently used "on-line" during stochastic gradient learning. The other regularizers listed above are not generally used with on-line algorithms, but rather with "batch" or deterministic optimization methods. All previous studies on regularization have concentrated on feedforward neural networks. To our knowledge, recurrent learning with regularization has not been reported before. In Section 2 of this paper, we develop a smoothing regularizer for general dynamic models, which is derived by considering perturbations of the training data. We demonstrate that this regularizer corresponds to a dynamic generalization of the well-known first-order Tikhonov stabilizer. We then present a closed-form expression for our regularizer for two layer feedforward and recurrent neural networks, with standard weight decay being a special case. In Section 3, we evaluate our regularizer's performance on a number of applications, including regression with feedforward and recurrent neural networks and predicting the U S . Index of Industrial Production. The advantage of our regularizer is demonstrated by comparing to standard weight decay in both feedforward and recurrent modeling. Finally, we discuss several related questions and conclude our paper in Section 4.
2 Smoothing Regularization
2.1 Prediction Error for Perturbed Data Sets. Consider a training data set (P : Z(t ) . X(t ) } , where the targets Z ( t ) are assumed to be gener~
~~~
'Other techniques to prevent overfitting include early stopping of training [which CAI\ he viewed as having an effect similar to weight decay (SjCiberg and Ljung 1992,1995)l and using prior knowledge in the form of hints (see Abu-Mostafa 1995; Tresp e t n l . 1993, and rcferences therein). Smoothing regularizers can be viewed as a special class of hints. 'Two additional papers related to ours, but dealing only with feedforward networks, came t o our attention or were written after our work was completed. These are Bishop (1995) and Leen (1995). Also, Moody and Rognvaldsson (1995) have recently proposed two new classes of smoothing regularizers for feed forward nets.
Smoothing Regularizer for Recurrent Networks
463
ated by an unknown dynamic system F*[I(t)]and an unobserved noise process:
Z ( t ) = F * [ I ( t ) ]+ E * ( f )
with I ( t ) = { X ( s ) . s = 1 . 2 . . . . . t }
(2.1)
Here, I ( t ) is the information set containing both current and past inputs X ( s ) , and the E * ( f ) are independent random noise variables with zero mean and variance a*’. Consider next a dynamic network model Z ( t ) = F [ @ .I(t)]to be trained on data set P, where @ represents a set of network parameters, and F ( ) is a network transfer function, which is assumed to be nonlinear and dynamic. We assume that F ( ) has good approximation capabilities, such that F [ @ p , I(f)]M F * [ I ( f ) ]for learnable parameters @p. Our goal is to derive a smoothing regularizer for a network trained on the actual data set P that in effect optimizes the expected network performance (prediction risk) on perturbed test data sets of form {Q : Z(t),X ( t ) } . The elements of Q are related to the elements of P via small random perturbations E , ( f ) and E,(t), so that
Z(t)
=
Z ( t )+ E,(t)
(2.2)
X ( t ) = X ( t ) +E,(f)
(2.3)
The E2(f) and E x ( t ) have zero mean and variances g,’ and a:, respectively. The training and test errors for the data sets P and Q are
DP
1
=
jq
c{Z(f) ”
-F[@p.I(t)]}2
f=l
DQ
l N
=
-
C{Z(t) -F[@~,i(f)]}~ t=l
where @P denotes the network parameters obtained by training on data set P, and j ( t ) = { X ( s ) ,s = 1 , 2 , .. . , t } is the perturbed information set of Q. With this notation, our goal is to minimize the expected value of DQ, while training on DP. Consider the prediction error for the perturbed data point at time t :
d ( t ) = { Z ( t )- F[@p,I(t)]}’
(2.6)
Lizhong Wu and John Moody
464
Assuming that E , ( f ) is uncorrelated with { Z ( t )-F[@.p. i(t)]} and averaging over the exemplars of data sets P and Q, equation 2.7 becomes
x;,
[ ~ ~ ( tin) equation ]~, 2.8 is independent of the weights, The third term, so it can be neglected during the learning process. The fourth term in equation 2.8 is the cross-covariance between { Z ( t ) - F [ @ p , l ( t ) ] } and { F [ @ . p . I ( t ) -F[@p?i(f)]}. ] We argue in Appendix A that this term can also be neglected.
2.2 The Dynamic Smoothing Regularizer and Tikhonov Correspondence. The above analysis shows that the expected test error DQ can be minimized by minimizing the objective function D:
In equation 2.9, the second term is the time average of the squared disturbance IlZ(t) - Z(t)1I2 of the trained network output due to the input perturbation Ilj(f) - I ( t ) l 12. Minimizing this term demands that small changes in the input variables yield correspondingly small changes in the output. This is the standard smoothness prior, namely that if nothing else is known about the function to be approximated, a good option is to assume a high degree of smoothness. Without knowing the correct functional form of the dynamic system F' or using such prior assumptions, the data fitting problem is ill-posed. It is straightforward to see that the second term in equation 2.9 corresponds to the standard first order Tikhonov ~tabilizer.~ Expanding to , expectation value of this term first order in the input perturbations E ~ the becomes
3Bishop (1995) has independently made this observation for the case of feedforward networks.
Smoothing Regularizer for Recurrent Networks
465
If the dynamics are trivial, so that the mapping F* has no recurrence, then (2.11)
and equation 2.10 reduces to (2.12)
This is the usual first-order Tikhonov stabilizer weighted by the empirical distribution. In equations 2.12 and 2.10, a : plays the role of a regularization parameter X that determines the compromise between the degree of smoothness of the solution and its fit to the noisy training data. This is the usual bias/variance tradeoff (see Geman et al. 1992). A reasonable choice for the value of 0,' is to set it proportional to the average squared nearest neighbor distance in the input data. For normalized input data (e.g., where each variable has zero mean and unit variance), one can estimate the average nearest neighbor distance as % 0,' %
KNp2lv
(2.13)
where 2) is the intrinsic dimension of the input data (less than or equal to the number of input variables) and K is a geometric factor (of order unity in low dimensions). To summarize this section, the training objective function D of equation 2.9 can be written in approximate form as
where the second term is a dynamic generalization of the first-order Tikhonov stabilizer. 2.3 Form of the Proposed Smoothing Regularizer for Two Layer Networks. Consider a general, two layer, nonlinear, dynamic network with recurrent connections on the internal layer4 as described by
Y(t) Z(fj
=
f [WY(t - 7) + V X ( t ) ]
=
UY(t)
(2.15)
where X(t), Y(f), and Z ( t ) are, respectively, the network input vector, the hidden output vector, and the network output; @ = { U , V, W) is the output, input, and recurrent connections of the network, f ( ) is the vector-valued nonlinear transfer function of the hidden units, and T is 4 0 ~ derivation r can easily be extended to other network structures.
Lizhong Wu and John Moody
466
a time delay in the feedback connections of the hidden layer, which is predefined by a user and will not be changed during learning. T can be zero, a fraction, or an integer, but we are interested in the cases with a small T . ~ When r = 1, our model is a recurrent network as described by Elman (1990) and Rumelhart et al. (1986) (see Fig. 17 on p. 355). When T is equal to some fraction smaller than one, the network evolves 1/ r times within each input time intervaL6 When r decreases and approaches zero, our model is the same as the network studied by Pineda (1989), and earlier, widely studied recurrent networks7 (see, for example, Grossberg 1969; Amari 1972; Sejnowski 1977; Hopfield 1984). In Pineda (1989), T was referred to as the network relaxation time scale. Werbos (1992) distinguished the recurrent networks with zero T and nonzero T by calling them simultaneous recurrent networks and time-lagged recurrent networks, respectively. We show in Appendix B that minimizing the second term of equation 2.9 can be obtained by smoothing the output response to an input perturbation at every time step, and we have
IlZ(t) - Z(t)1I25 p $ ( @ p ) I I X ( t )- X(t)ll’
for t
= 1 , 2 , .. . , N
(2.16)
We call p T 2 ( @ p ) the output sensitivity of the trained network @p to an input perturbation. p T 2 ( @ p )is determined by the network parameters only and is independent of the time variable t. We obtain our new regularizer by training directly on the expected prediction error for perturbed data sets Q. Based on the analysis leading to equations 2.9 and 2.16, the training criterion thus becomes (2.17) As in equation 2.14, the coefficient X in equation 2.17 is a regularization parameter that measures the degree of input perturbation I II(t) - I ( t )I 12. Note that the subscript P has been dropped from @, since D is now the training objective function for any set of weights. Also note in comparing equation 2.17 to equation 2.14 that the sum over the past history indexed by s no longer appears, and that a trivial factor l/NCL, = 1 has been dropped. These simplifications are due to our minimizing the zero-memory response at each time step during training as described after equation B.23 in Appendix B. 5When the time delay T exceeds some critical value, a recurrent network becomes unstable and lies in oscillatory modes. See, for example, Marcus and Westervelt (1989). 6When T is a fraction smaller than one, the hidden node‘s function can be described by Y(t+kT-l) = f { W Y [ t + ( k - l ) ~ - l ] + V X ( t ) } The input X ( t ) is kept fixed during the above evolution. 7These were called additive networks.
f o r k = 1 , 2 , . . . , 1/7.
Smoothing Regularizer for Recurrent Networks
467
The algebraic form for p,(@) as derived in Appendix B is (2.18) for time-lagged recurrent networks (7 > 0). Here, ( 1 11 denotes the Euclidean matrix norm.8 The factor y depends upon the maximal value of the first derivatives of the activation functions of the hidden units and is given by
where j is the index of hidden units and o j ( t ) is the input to the jth unit. In general, y 5 l.9 To ensure stability and that the effects of small input perturbations are damped out, it is required that YIIWII < 1
(2.20)
The regularizer equation 2.18 can be deduced for the simultaneous recurrent networks in the limit T H 0 by (2.21)
If the network is feedforward, W become
=0
and
T
P ( @ ) = -/I1~llIlV1I
= 0,
equations 2.18 and 2.21 (2.22)
Moreover, if there is no hidden layer and the inputs are directly connected to the outputs via U, the network is an ordinary linear model, and we obtain P ( @ ) = IIUII
(2.23)
which is standard quadratic weight decay (Plaut et al. 1986) as is used in ridge regression (see Hoerl and Kennard 1970a,b). The regularizer (equation 2.22 for feedforward networks and equation 2.18 for recurrent networks) was obtained by requiring smoothness of the network output to perturbations of data. We therefore refer to it as sThe Euclidean norm of a real M x
N matrix A is
where AT is the transpose of A and ulj is the element of A. 9For instance, f ’ ( x ) = [I - f ( x ) l f ( x ) if f ( x ) = 1/(1 + e P ) . In this case, y = max 1 f ’ ( x ) ) I= 1/4 at x = 0. If 1x1 is much larger than 0, then f ( x ) operates in its asymptotic region, and I f’(x) I will be far less than 1/4. In fact, y is exponentially small in this case.
468
Lizhong Wu and John Moody
a smoothing regularizer. Several approaches can be applied to estimate the regularization parameter A, as in Eubank (19881, Hastie and Tibshirani (19901, and Wahba (1990). We will not discuss this subject in this paper. After including a regularization term in training, the weight update equation becomes
A@
=
VV&
-
= -rl{VaDp
+ XOa[pi2(@)]}
(2.24)
where 11 is a learning rate. With our smoothing regularizer, 0,[p,2(@)] is computed as (2.25) (2.26)
where I{;,, u,,, and wij are the elements of U, V , and W, respectively. For simultaneous recurrent networks, equation 2.27 becomes (2.28) When standard weight decay is used, the regularizer for equation 2.15 is (2.29) The corresponding update equations for this case are (2.30) (2.31) DpZ ~
dWjj
=
2w;j
(2.32)
In contrast to our smoothing regularizer, quadratic weight decay treats all network weights identically, makes no distinction between recurrent and input/output weights, and takes into account no interactions between weight values. In the next section, we evaluate the new regularizer in a number of tests. In each case, we compare the networks trained with the smoothing regularizer to those trained with standard weight decay.
Smoothing Regularizer for Recurrent Networks
469
3 Empirical Tests
In this section, we demonstrate the efficacy of our smoothing regularizer via three empirical tests. The first two tests are on regression with some synthetic data, and the third test is on predicting the monthly US. Index of Industrial Production. 3.1 Regression with Feedforward Networks. We form a set of data generated by a predefined function G. The data are contaminated by some degree of zero-mean gaussian noise before being used for training. Our task is to train the networks so that they estimate G. We will first study function estimation with feedforward networks, and then extend it to the case with recurrent networks.
3.1.1 Data. The data in this test are synthesized with the function
where x is uniformly distributed within [-10:10], E is normally distributed with zero mean and variance u’, and a and b are two constants. In our test, we set a = 1 and b = 5.
3.1.2 Model. The model we have used for the above data is a twohidden unit, feedforward network with sigmoidal functions at hidden units and a linear function at a single output unit. It can be described as (3.2)
The model overall has 7 weight parameters. 3.2.3 Performance Measure. The criterion to evaluate the model performance is the true mean squared error (MSE) minus the noise variance 02:
(3.3)
where G ( x ) is the noiseless, source function as shown in the first part of equation 3.1, Z ( x ) is the network response function as given by equation 3.2, and p(x) is the probability density of x. In this experiment, p(x) is uniformly distributed within [-xg,xo] and xo = 10.
Lizhong Wu and John Moody
470
Table 1: Comparison of the Performances (as Measured by equation 3.3) between the Feedforward Networks Trained with the Smoothing Regularizer and Those Trained with Standard Weight Decay for the Function Estimation." Number of training patterns
Noise variance
With standard
With smoothing
weight decay
regularizer
11
0.1 0.5 1 .0
0.037i0.011 0.137i 0.003 0.151 i 0.000
0.020i0.003 0.076i 0.028 0.117i 0.011
21
0.1 0.5 1.0
0.014i, 0.003 0.061i0.004 0.097i 0.005
0.010i 0.000 0.048 0.042 0.068i 0.009
41
0.1 0.5 1 .o
0.011 0.000 0.038-C 0.001 0.066i 0.001
-
*
*
0.008i 0.000 0.028i 0.000 0.050& 0.000
.'The results showit are the mean and the standard deviation over 10 models with difierent initial weights.
3.1.4 R e s d t s . Comparisons between the smoothing regularizer and standard weight decay are listed in Table 1. The performance comparisons are made for a number of cases. The numbers of training patterns are varied from 11, 21, and 41. The noise variances are from 0.1, 0.5, and 1.0. To observe the effect of the regularization parameters, we did not use their estimated values. Instead, the regularization parameters for both the smoothing regularizer and standard weight decay are varied from 0 to 0.1 with step-size 0.001. Figure 1 shows the downsampled training and test errors versus the regularization parameters. The performances in Table 1 are the optimal results over all these regularization parameters. This gives the best potential result each network can obtain. Unlike our other tests in real world applications, neither early stopping nor validation was applied in this test. Each network was trained over 5000 epochs. It was found that for all networks, the training error did not decrease significantly after 5000 training epochs. With these conditions, the task to prevent the network from overtraining or overfitting is completely dependent on the regularization. We believe that such results will more directly reflect, and more precisely compare, the efficacy of different regularizers. Table 1 shows that the potential predictive errors with the smoothing regularizer are smaller than those with standard weight decay. Figure 2 gives an example and compares the approximation functions obtained with standard weight decay and our smoothing regularizer to the true function. We can see that the function obtained with our smoothing regularizer is obviously closer to the true function than that obtained with standard weight decay.
Smoothing Regularizer for Recurrent Networks
471
0.55 0.5 0.45 -
5
With Weight Decay
L
W
.-? c .E c
0.4-
0.350.3With Smoothing Reg.
1 o-2
lo-'
Regularization Parameters
0.18
I
With Weight Decay
0.1 -
0.08 -
0.06 '
Io
-~
1 O-z
10.'
Regularization Parameters
Figure 1: Training (upper panel) and test (lower panel) errors versus regularization parameters. Networks trained with ordinary weight decay are plotted by "+," and those trained with smoothing regularizers are plotted by "*." Ten different networks are shown for each case. The curves are the median errors of these 10 networks.
472
Lizhong Wu and John Moody
+
+ +
1.5t
+
+
+
I +
t
+
+
+
Figure 2: Comparing the estimated function obtained with our smoothing regularizer (dashed curve) and that with standard weight decay (dotted curve) to the true function (solid curve). The ”+” plots 21 training patterns that are uniformly distributed along the x axis and normally distributed along the f (x) direction with noise variance 1. The model is a two-node, feedforward network.
3.2 Regression With Recurrent Networks.
3.2.1 Data. For recurrent network modeling, we synthesized a data sequence of
x(t)
N samples with =
10
&j
-
the following dynamic function:
1)
(3.4)
(3.5) (3.6) (3.7) where t = 0.1,. . . . N - 1, E(t) is normally distributed with zero mean and variance u2, and a = 1 and b = 5 are two constants. Two dummy variables, yl(t) and y2(t), evolve from their previous values. In our test, we initialize yl(t) = y2(t) = 0.
Smoothing Regularizer for Recurrent Networks
473
Table 2: Comparison between the Recurrent Networks Trained with the Smoothing Regularizer and Those Trained with Standard Weight Decay for the Function Estimation Task.O Number of training patterns
Noise variance
With standard weight decay
With smoothing regularizer
11
0.1 0.5 1.o
0.035f0.006 0.123f0.008 0.151fO.OOO
0.020f0.002 0.067f0.007 0.111f0.015
21
0.1 0.5 1.0
0.014f0.001 0.058f0.002 0.095f0.004
0.009f0.000 0.037f0.001 0.071f0.001
41
0.1 0.5 1.o
0.009f0.000 0.032f0.001 0.057d~0.001
0.006f0.000 0.024f0.005 0.039f0.016
~~~~
"The results shown are averaged over 10 different initial weights.
3.2.2 Model. The model we have used for the above data is a twohidden unit, recurrent network with sigmoidal functions at hidden units and a linear function at a single output unit. The output of a hidden unit is time-delayed and fed back to another hidden unit input. It can be described as (3.8) (3.9) (3.10) where yl(t) and y2(f) correspond to the two hidden-unit outputs. The model overall has 9 weight parameters.
3.2.3 Results. The performance measure is the same as equation 3.3 in the case for feedforward networks. Table 2 lists the performances of the recurrent networks trained with standard weight decay and those with our smoothing regularizer. The table shows the results with the best value of regularization parameters. It again shows that, in all cases, the smoothing regularizer outperforms standard weight decay. For all networks listed in Table 2, the numbers of training patterns are varied from 11, 21, and 41. The noise variances are from 0.1, 0.5, and 1.0. The regularization parameters for both the smoothing regularizer and standard weight decay are varied from 0 to 0.1 with step-size 0.001.
474
Lizhong Wu and John Moody 3.3 Predicting the U.S. Index of Industrial Production.
3.3.1 Data. The Index of Industrial Production (IP) is one of the key measures of economic activity. It is computed and published monthly. Our task is to predict the 1-month rate of change of the index from January 1980 to December 1989 for models trained from January 1950 to December 1979. The exogenous inputs we have used include 8 time series such as the index of leading indicators, housing starts, the money supply M2, the S&P 500 Index. These 8 series are also recorded monthly. In previous studies by Moody etal. (1993),with the same defined training and test data sets, the normalized prediction errors of the 1-month rate of change were 0.81 with the neuz neural network simulator, and 0.75 with the proj neural network simulator.'"
3.3.2 Model. We have simulated feedforward and recurrent neural network models. Both models consist of two layers. There are 9 input units in the recurrent model, which receive the 8 exogenous series and the previous month IP index change. We set the time-delayed length in the recurrent connections 7 = 1. The feedforward model is constructed with 36 input units, which receive 4 time-delayed versions of each input series. The time-delay lengths are 1, 3, 6, and 12, respectively. The activation functions of hidden units in both feedforward and recurrent models are tai7l7 functions. The number of hidden units varies from 2 to 6. Each model has one linear output unit. 3.3.3 Trnrnrrzg. We have divided the data from January 1950 to December 1979 into four nonoverlapping subsets. One subset consists of 70% of the original data and each of the other three subsets consists of 10% of the original data. The larger subset is used as training data and the three smaller subsets are used as validation data. These three validation data sets are, respectively, used for determination of early stopped training, selecting the regularization parameter, and selecting the number of hidden units. We have formed 10 random training-validation partitions. For each training-validation partition, three networks with different initial weight parameters are trained. Therefore, our prediction committee is formed by 30 networks. The committee error is the average of the errors of all committee members. All networks in the committee are trained simultaneously and "'The neuz networks were trained using stochastic gradient descent, early stopping via a validation set, and the PCP regularization method proposed by Levin e t a / . (1994). The proj networks were trained using the Levenburg-Marquardt algorithm, and network pruning after training was accomplished via the methods described in Moody and Utans (1992). The internal layer nonlinearities for the neuz networks were sigmoidal, while some of the proj networks included quadratic nonlinearities as described in Moody and Yarvin (1Y92).
Smoothing Regularizer for Recurrent Networks
475
Table 3: Normalized Prediction Errors for the 1-Month Rate of Return on the U.S. Index of Industrial Production (Jan. 1980-Dec. 1989)."
+ Std
~~~~~
Model
Regularizer Mean
Median Max
Min
Committee
Recurrent networks
Smoothing Weight decay
0.646 f0.008 0.734 f 0.018
0.647 0.737
0.657 0.632 0.767 0.704
0.639 0.734
Feedforward networks
Smoothing Weight decay
0.700 f 0.023 0.745 f0.043
0.707 0.748
0.729 0.654 0.805 0.676
0.693 0.731
"Each result is based on 30 networks.
stopped at the same time based on the committee error of a validation set. The value of the regularization parameter and the number of hidden units are determined by minimizing the committee error on separate validation sets.
3.3.4 Results. Table 3 lists the results over the test data set. The performance measure is the normalized prediction error as used in Moody et al. (19931, which is defined as (3.11)
where S ( t ) stands for the observations, Q represents the test data set, and S is the mean of S ( t ) over the training data set. This measure evaluates prediction accuracy by comparing to a trivial predictor that uses the mean of the training data as its prediction. Table 3 also compares the out-of-sample performance of recurrent networks and feedforward networks trained with our smoothing regularizer to that of networks trained with standard weight decay. The results are based on 30 networks. As shown, the smoothing regularizer again outperforms standard weight decay with 95% confidence (in t-distribution hypothesis) in both cases of recurrent networks and feedforward networks. We also list the median, maximal, and minimal prediction errors over 30 predictors. The last column gives the committee results, which are based on the simple average of 30 network predictions. We see that the median, maximal, and minimal values and the committee results obtained with the smoothing regularizer are all smaller than those obtained with standard weight decay, in both recurrent and feedforward network models. Figure 3 plots the changes of prediction errors with the regularization parameter in recurrent neural network modeling. As shown by the figure, the prediction error over the training data set increases with the regularization parameter, the prediction errors over the validation and test data
476
Lizhong Wu and John Moody
sets first decrease and then increase with the regularization parameter. The optimal regularization parameter with the least validation error is 0.8 with our smoothing regularizer and 0.03 with standard weight decay. In all cases, we have found that the regularization parameters should be larger than zero to achieve optimal prediction performance. This confirms the necessity of regularization during training in addition to early stopped training. We have observed and compared the weight histogram of the networks trained with our smoothing regularizer and those with standard weight decay. As demonstrated in Figure 4, although the distribution has heavy tail, most weights parameters in the networks with the smoothing regularizer are more concentrated on small values. Its distribution is more like a real symmetric o-stable (SaS)" distribution rather than a gaussian distribution. This is also consistent with the soft weight-sharing approach proposed by Nowlan and Hinton (1992), in which a gaussian mixture is used to model the weight distribution. We thus believe that our smoothing regularizer provides a more effective means to differentiate "essential" large weights from "irrelevant" small weights than does standard weight decay.
3.3.5 With arid Withorif Earl!/ Stopping of Tkaitiirig. The results shown in Table 3 and Figures 3 and 4 are for networks trained with both the regularization and early stopping techniques. From Figure 3, we see that the prediction performance is far worse than the optimum if the network is trained with just early stopping and no regularization (A = 0). Another case is that the network is trained with regularization and without early stopping. We compare the performances between the networks trained with regularization and early stopping and the networks trained with regularization but without early stopping in Table 4. For the latter networks, those 10% of the data originally used for early stopping are now used in training. All other training conditions are the same for both cases. From the table, we see that the perfxmance of networks without early stopping is slightly worse than those with regularization and early stopping simultaneously. However, the difference is small in terms of the median or committee errors, even though the deviation of prediction errors ltas increased. 3.3.6 Sfnbility Analysis. I n Section 2, we found that equation 2.20 (i.e., -;l]Wi! < 1) must hold to ensure stability and that the effects of small input perturbations are damped out. Figure 5 shows the value of ;81/W// of the trained networks. The networks trained with the optimal regularization parameter satisfy the inequality, and those networks trained with regularization parameters, which are much larger or smaller than the optimal value, d o not satisfy the stability requirement. "See, for exdmple, Shao and Nikias (1993)
Smoothing Regularizer for Recurrent Networks
477
.....................................................................................
i
/-\
Quadraticweight Decay :
Vahdation
".'I .......'...... / .. ....:..... ..j 0.65 .......
0
0.01
................ . ...........................................
........;..... 0.02
:
..j..
....._.......; ....................
0.05 0.06 0.07 Regularization Parameter
0.03 0.04
0.08 0.09
......
0.1
Figure 3: Regularization parameter vs. normalized prediction errors for the task of predicting the U.S. Index of Industrial Production. The example given is for a recurrent network trained with either the smoothing regularizer (upper panel) or standard weight decay (lower panel). For the smoothing regularizer, the optimal regularization parameter that leads to the least validation error is 0.8 corresponding to a test error of 0.646. For standard weight decay, the optimal regularization parameter is 0.03 corresponding to a test error of 0.734. The new regularizer thus yields a 12% reduction of test error relative to that obtained using quadratic weight decay.
Lizhong Wu and John Moody
478
With Smoothing Regularizer
n
I
-'1
-0.8
-0.6
-0.4
-0.2
0
I
0.2
0.4
0.6
0.8
1
0.4
0.6
0.8
1
Weight W f i Standard Weight Decay 400 300
6 $200 ,m u. 100 0
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
Weight
Figure 4: Comparison of the weight histogram between the recurrent networks trained with our smoothing regularizer and those with standard weight decay. Each histogram summarizes 30 networks trained on the IP task. The smoothing regularizer yields a symmetric ct-stable (or leptokurtic) distribution of weights (large peak near zero and long tails), whereas the quadratic weight decay produces a distribution that is closer to a gaussian. The smoothing regularizer thus distinguishes more sharply between "essential" (large) weights and "nonessential" (near-zero-valued) weights. The near-zero-valued weights can be pruned.
4 Concluding Remarks and Discussion
Regularization in learning can prevent a network from overtraining. Several techniques have been developed in recent years, but all these are specialized for feedforward networks. To our best knowledge, a regularizer for a recurrent network has not been reported previously. We have developed a smoothing regularizer for recurrent neural networks that captures the dependencies of input, output, and feedback weight values on each other. The regularizer covers both simultaneous
Smoothing Regularizer for Recurrent Networks
479
Table 4: Comparison of Prediction Performance of the Networks Trained with and without Early Stopping of Training." Training
Mean f Std
Median
Max
Min
Committee
With early stopping Without early stopping
0.646 f 0.008 0.681 & 0.057
0.647 0.664
0.657 0.938
0.632 0.643
0.639 0.657
flResults given in the table are the normalized prediction errors for the IP task as those shown in Table 3. All the results are based on 30 recurrent networks. Whether trained with early stopping or not, the networks are both trained with the smoothing regularizer.
i4+
"I T
li 1.3
1.1
0.5 0.4
,.... . ........,........_ ,..._.....
I
'
0
0.5
J
1
1.5
Regularization Parameter
Figure 5: Regularization parameter vs. 711WIl of trained recurrent networks. The networks are trained to predict the U.S. Index of Industrial Production. For each regularization parameter, 30 networks have been trained. Each network has four hidden units. The smoothing regularizer and early stopping are both used during learning. From Figure 3, we know that the optimal regularization parameter for these networks is 0.8. This figure plots the mean values of y / /W1J of these 30 networks with the error bars indicating the maximal and minimal values. As shown, the networks with the optimal regularization parameter have y((W/I< 1. This confirms the networks' stability, in the sense that the network response to any input perturbation will be smooth.
480
Lizhong Wu and John Moody
and time-lagged recurrent networks, with feedforward networks and single layer, linear networks as special cases. Our smoothing regularizer for linear networks has the same form as standard weight decay. The regularizer developed depends on only the network parameters, and can easilv be used. A series of empirical tests has demonstrated the efficacy of this regularizer and its superior performance relative to standard quadratic weight decay. Empirical results show that the smoothing regularizer yields a real symmetric ti-stable (S(i S) weight distribution, whereas standard quadratic weight decay produces a normal distribution. We therefore believe that our smoothing regularizer provides a more reasonable constraint during training than standard weight decay does. Our regularizer keeps "essential" weights large as needed and, at the same time, makes "nonessential" weights assume values near to zero. We conclude with several additional comments. As described in equation 2.19, to bound the first derivatives of the activation functions in the hidden units, we have used their maximal value without considering different nonlinearities for different nodes and ignoring their changes with time. We have extended our smoothing regularizer to take into account these factors. Due to the page limit, we cannot include these extensions in this paper. In the simulations conducted in this paper, we have fully searched over the regularization parameter values by using a validation data set. This helps us observe the effect of the regularization parameter, but it is time consuming if the network and the training data set are large. Stability is another big issue for recurrent neural networks. There is a lot of literature on this topic, for example, Hirsch (1989) and Kuan ef al. (1994). In our derivation of the regularizer, we have found that ?\\Wil < 1 must hold to ensure the effects of small input perturbations are damped out. This inequality can be used for diagnosing the stability of trained networks as shown in Figure 5 . It can also be appended to our training criterion equation 2.17 as an additional constraint. Werbos and Titus (1978) proposed the following cost function for their pure robust model
(4.1)
In the model, w,was, in fact, a weight parameter in a feedback connection from the output to the input, but it was predefined in the range 0 < w,< 1 and kept fixed after being defined. Werbo and Titus's new cost function actually had a similar effect as our smoothing regularizer. A s they claimed, the biggest advantage of their new cost function was its ability to shift smoothly in different environments.
Smoothing Regularizer for Recurrent Networks
481
Appendix A: Neglecting the Cross-Covariance We neglect the cross-covariance term in equation 2.8 2 N
- C(Z(t) -F[@P,~(f)l){~[~P -l~~[ (@t )Pl> W ) l )
(A.1)
t=l
for two reasons. First, its expectation value will be small, and second, its value can be rigorously bounded with no qualitative change in our proposed training objective function in equation 2.9. Noting that the target noise E* is uncorrelated with the input perturbations E, and assuming that model bias can be neglected, the expectation value of equation A.l taken over possible training sets will be small:
Note here that @p are the weights obtained after training. In addition, the expectation value of equation A.l taken over the input perturbations E~ will be zero to first order in the (A.3)
Of course, many nonlinear dynamic systems have positive Lyapunov exponents, and so the second order and higher order effects in these cases cannot be ignored. Although its expectation value will be small, the cross-covariance term in equation 2.8 can be rigorously bounded. Using the inequality 2ab 5 a2 b2, we obtain
+
2 N -
N
C{z(t)- F[@P.I(t)]}{F[@.p.I ( t ) ]
-
F[@P-?(~)])
r=i
< -
c{Z(t) N l N
r=i
=
Dp
+l
+
- F[@P.w)l)2
-
N
c
l N
2
{ F [ @ P .I(t)l - F [ @ P . W)])
f=l
{F[@p.I(t)] - F[@p.j(t)]}2
(A.4)
t=l
Minimizing the first term Dp and the second term 1," Cf"=, {F[@.p.I ( t ) ]F [ Q p , j ( t ) ] } 2 in equation 2.8 during training will thus automatically decrease the effect of the cross-covariance term. Using this bound, instead of the small expectation value approximation, will in effect multiply the first two terms in equation 2.8 by a factor of 2. However, this amounts to an irrelevant scaling factor and can be dropped. Thus, our proposed training objective function equation 2.9 will remain unchanged.
482
Lizhong Wu and John Moody
Appendix B: Output Sensitivity of a Trained Network to Its Input Perturbation For a recurrent network of form given by equation 2.15:
Y(t) = f [WY(t - 7 ) f V X ( t ) ].
Z ( t ) = UY(t)
(B.1)
this appendix studies the output perturbationt2 2
1.
a;(!!)= Z ( t ) - Z(t)j/ in response to an input perturbation ( a t )=
(lii'(f)
2 -
(B.3)
X(f)i(
The output perturbation will depend on the weight parameter matrices U,V, and W. The sizes of the U, V, and W are No x N,,, N,, x Nil, and Nil x N,. The numbers of output, hidden, and input units are N,,, N,,, and N,, respectively. By expressing the inputs to the hidden units as an Nil-dimensional column vector
o(t)= [a,(!!). .
. *ON,,(t)]'
=
Wy(t - 7 ) + V X ( t )
(B.4)
and using the mean value theorem,13 we get
f[OU)]- f[O(t)]= f'[O*(t)][O(t)- O ( t ) ]
(B.5)
wheref[~(t)I=Cfi[ol(t)I.. . . . f N , , [ w , , ( f ) I ) ? . f [ o ( t ) I = C f 1 [ 0 1 ( f ) I , .. . , f N , , [ G N , , ( f ) l ) T and f'[O*(t)]is a diagonal matrix with elements {f'[O*(t)]),,= f,'[o;(t)]. f,'( ) is the first derivative off,( ) and min[6,(f). o,(t)] lo;(t) 0
T
1 -{W [f [O(t)]- f[O(t)]]- [O(t)- O ( t ) ] 7
+ v [X(t+
T)
-X(t
+
(B.12)
TI]}
We get
dam dt -
-
-
2 [O(t) - - 0 ( t ) l T W[f [O(t)]- f[O(t)]] -{ T -
II
- O(t)
If
+ [O(t)- 0(t)lTV[ X ( t +
T)-
X(t +T)]}
(B.13)
Using the mean value theorem and Schwarz's inequality again, we obtain the following equations:
II [W- 0 ( t i l T W
(fmf[Wl}II 5 -
(B.14)
Tll~ll~:(t)
for the first term in the right-hand side of equation B.13 and
11
[O(t) - O(t)]'V [ X ( t + T ) - X ( t
+ T)] 11 I a,(t)llVIIo.(t +
7)
(B.15)
for the third term of equation 8.13. During the evolution process of the network, the input perturbation a,(t) is assumed to be constant or to change more slowly than a,(t). This is true when 7 is ~mal1.l~ Therefore, the a x ( t )is replaced by n.yin I4We can obtain O(t + r ) = W Y ( t )+ VX( t + r ) and Y(t) = f[O(t)] by substituting the following approximation into equation B.9: dO(t) dt
~
o(t+
7) -
O(t)
r
Here, we assume that r is small. Note that such a dynamic function has also been used to describe the evolution process of recurrent networks by other researchers, for example, Pineda (1988) and Pearlmutter (19891. lSSee footnote 6 for justification.
484
Lizhong Wu and John Moody
the following derivation. With equations 8.14 and B.15, equation B.13 becomes (B.16)
or (B.17)
due to oo(t)> 0. For notational clarity, define a
= y((W(1- 1,
and
b = JJVJI
(B.18)
so that equation 8.17 becomes
!?&
dt
5
+
[auo(t) bn,] T
(B.19)
Integration of equation 8.19 from t - 1 to t yields the solutions (B.20) (B.21)
One sees that oO(t ) depends on the current input perturbation ' T ~as well as its previous value a,(+ 1). n z ~ = , + ~ (=f ) kre+,(t)= 0 at Tl+l < f < T,. But this contradicts Theorem 1. As a result, TI < TI+, for all i = 1 . 2 . . . . .N - 2. In addition to Theorem 4, we conclude that 0 < TI < T2 < . . < TN-1 = T,, < co. (b) Since u,,+,(t 1) - u,,(t + 1) = (1 + F)‘[U,,+,(~)- u , ( O ) ]for all t > 0, u, iT,) = 0 implies that z’,~+,(T,) = 0. By definition, TI = TI+,. Hence the proof is completed. 0
+
In the next section, we proceed to derive the solution of the Maxnet using the above corollary.
Maxnet Dynamics
495
3 Network Dynamics
The dynamics of the network during 0 < t < TI is given by VN(t
+ 1)= ANVN(t)
(3.1)
where vN(f) = [vT,(t),vRZ(t), . . . ,vR,(t)lT and AN is an N x N matrix with diagonal elements equal to 1 and off diagonal elements equal to - E . Then, during 0 < t < T I , the network may be regarded as a linear discrete-time time invariant system. The solution of this system can then be obtained by evaluating the eigenvalues and the eigensubspace of AN.
Lemma 1. TheeigenvaZuesofANare [l- ( N - ~ ) E and ] ( 1 + ~ )Thecorresponding . eigensubspace of [l - (N - l)~]and (1 + E ) are MN and M i , respectively, where
and Mi
=
{v E RNIvTelN= 0}
Proof. Let xi = 1/JN for all i, i.e., x E M N . Then
= [l -
(N - I)E]Xi
So that, [l - (N - l ) ~is]an eigenvalue for A N . Next consider w = v - (vTelN)elN, w E MN', (3.3)
i.e., ( A N w ) ~ = (1 + t)wi or A N w = (1 + E)W. Hence the other eigenvalue is (1 + F ) . Moreover, we can replace w by
(5,s,o,.. . ;o)T (&,o, 3,. . . ,o)T
John P. F. Sum and Peter K. S. Tam
496
Therefore, it can be concluded that [l - (N - I ) € ]and (1+ F ) are the only 0 eigenvalues of AN because dim(MN) dim(M,I) = N.
+
With the aid of Lemma 1, the solution of the dynamics equation 3.1 can be written as VN(f
+ 1)
=
[I - (N- ~ ) E ] [ v ; ( ~ ) ~ I N ] ~ I N
+ (1 f €1{ V N ( f ) That is to say, for all i
(3.4)
- [v;(t)elN]elN}
= 1 , 2 , .. . , N,
for all t < TI. It is the exact solution of Maxnet in the time interval 0 5 t 5 TI. Furthermore, the settling time of if. neuron is given by
Once ?-in, reaches zero, the corresponding output will also be zero. After T1, the network dynamics can be modeled in a lower dimensional space. There are two cases to be considered: (1) v,(T1) = 0 and ( 2 ) Z’,~(T~) > 0. For case (l),we can simply skip the time interval TI 5 t < T2 and proceed to consider the dynamics of the network in the time interval Tz 5 f < T3. In case of (2) we can denote that VN-l(t)
=
[ u q ( f ) ’ u 7 r l ( f ). ..
r
.-UT,(~)]
and consider the dynamics as VN-l(f
+ 1) = A ~ - l v ~ - l ( f )
Since A N - ] is defined in the same way as AN except that the dimension is N - 1, we can follow the same principle applied to the derivation of equation 3.5 and Lemma 1 to deduce that UT,(f)
for all i
==
[1 - (N- 2)f](‘-T’) (VN-I(TI)) (1 + f ) ( f - r l )[vn,(TI) - ( v N - I ( T l ) ) ]
+
= 2.3, . . .
N and
(3.7)
Maxnet Dynamics
497
for all TI 5 t < T2. Here
Repeating the same procedure, we can derive the general solution of the Maxnet for all time t 2 0. Denoting that V N - k ( t ) = [vR,+,(t).vR,+,(t)....~uT~(t)]T
the general solution of the network at time Tk 5 t < Tk+l is given by DT,(t)
=
[l - (N- k - 1 ) ~ ] ( ' - (~v~N) - ~ ( T ~ ) ) [v~,(Tk)- (VN-k(Tk))]
+ (1 + f ) ( f - T k )
(3.8)
for all i = k + l . k + 2 . . . . ,Nand (3.9)
V R , ( t )= 0
for all i = 1 . 2 , . . . .k, where
Besides, the settling time for .rrTd, recursively by
~ 5 . .~. .
Since D, ( t + 1) = uRN(t) whenever t is given by
neurons can be obtained
2 T N - ~ ,the network response time
where To = 0 and 1x1 is the smallest integer that is just greater than x. 4 Geometric Interpretation
Koutroumbas and Kalouptsidis (1994) present a brief geometric interpretation for two dynamic properties of the Maxnet: (1)once the initial state VN is located on the hyperplane that bisects the angles between the reference axis, the limit vector will be the null vector, and (2) otherwise, the
498
John P. F. Sum and Peter K. S. Tam
Figure 2: The geometric interpretation of the dynamics of the Maxnet. x.y.z correspond to three initial conditions that are located in three regions, A, B, and the line along tl. limit point will be on the axis that corresponds to the node TN. Essentially, these properties can be easily visualized from equations 3.8, 3.9, and Lemma 1. To simplify the discussion, we describe the case of two neurons, but the interpretation can be extended to N neurons. From Lemma 1, it is observed that the component of v2, which is parallel to ell, will decrease a t a rate (1 - f ) , while the component perpendicular to e12will increase at a rate of (1 + f ) . Figure 2 shows three situations, indicated by X , y, and z. x1 and y l are the components of x and y that are parallel to e, whereas, x2 and yz are the components perpendicular to e. The lengths of the arrows indicate the corresponding rates of change. Consider Y, i.e., in region A , the magnitude of the change of x along c is -ai,which is larger than F X ~ The . resultant change of x is pointing toward the axis v2. Similarly, the resultant change of x will point toward the u1 axis if x is located in the other A region. In region B, y1 2 y2. Equality holds only when y lies on the axis u1. Therefore, the change of y along the direction of e is also larger than that along the direction perpendicular to e. The resultant change of y is again toward one of the axes. Consider z, which is on the line of e; its resultant change is pointing toward (0.0). In summary, if u l ( 0 ) > ~ ( 0[v2(0) ) > q(O)], then the limit point will be on the v1 (u2)axis. If zll(0) = u2(0), then the limit point will be (0.0).
Maxnet Dynamics
499
5 Conclusion In this paper, we have derived the complete solution of the Maxnet. This solution provides an alternative approach to understanding the properties of Maxnet. Besides, the exact response time is also deduced as long as v,,(O) # vTN_,(0).Since our derivation of the solution is based on the method of eigensubspace analysis, the geometric interpretation of the network dynamics can be described vigorously. Such a technique can be readily adapted to the analysis of similar WTA networks such as Imax (Yen and Chang 1992), Gemnet (Yang et al. 1995), and Selectron (Yen et al.
1994).
Acknowledgment We would like to thank an anonymous referee for valuable comments.
References Dempsey, G. L., and McVey, E. S. 1993. Circuit implementation of a peak detector neural network. l E E E Transact. Circuits Systems-ll 40(9), 585-591. Floreen, I? 1991. The convergenceof hamming memory networks. l E E E Transact. Neural Networks 2(4), 449457. Gee, A. H., ef al. 1993. An analytical framework for optimizing neural networks. Neural Networks 6(11, 79-97. Kosko, B. 1992. Neural Networks and Fuzzy Systems. Prentice-Hall, Englewood Cliffs, NJ. Koutroumbas, K., and Kalouptsidis, N. 1994. Qualitative analysis of the parallel and asynchronous modes of the hamming network. l E E E Transact. Neural Networks 5(3), 380-391. Lippman, R. 1987. An introduction to computing with neural nets. I E E E ASSP Mag. 4, 4-22. Pao, Y. 1989. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, Reading, MA. Perfetti, R. 1990. Winner-take-all circuit for neurocomputing applications. IEE Proc. Part G 137(5), 353-359. Yang, J., et al. 1995. A general mean-based iterative winner-take-all neural network. IEEE Transact. Neural Networks 6(1), 14-24. Yen, J.-C., and Chang, S. 1992. Improved winner-take-all neural network. Electronics Lett. 28(7), 662-664. Yen, J.-C., ef al. 1994. A new winners-take-all architecture in artificial neural networks. IEEE Transact. Neural Networks 5(5), 838-843.
Received January 3, 1995; accepted July 10, 1995.
This article has been cited by: 1. Xindi Cai, Danil V. Prokhorov, Donald C. Wunsch II. 2007. Training Winner-Take-All Simultaneous Recurrent Neural Networks. IEEE Transactions on Neural Networks 18:3, 674-684. [CrossRef] 2. Zhi-Hong Mao, Steve G. Massaquoi. 2007. Dynamics of Winner-Take-All Competition in Recurrent Neural Networks With Lateral Inhibition. IEEE Transactions on Neural Networks 18:1, 55-69. [CrossRef] 3. Xiaohui Xie , Richard H. R. Hahnloser , H. Sebastian Seung . 2002. Selectively Grouping Neurons in Recurrent Networks of Lateral InhibitionSelectively Grouping Neurons in Recurrent Networks of Lateral Inhibition. Neural Computation 14:11, 2627-2646. [Abstract] [PDF] [PDF Plus] 4. Heiko Wersing , Jochen J. Steil , Helge Ritter . 2001. A Competitive-Layer Model for Feature Binding and Sensory SegmentationA Competitive-Layer Model for Feature Binding and Sensory Segmentation. Neural Computation 13:2, 357-387. [Abstract] [PDF] [PDF Plus]
Communicated by Alain Destexhe
NOTE
Optimizing Synaptic Conductance Calculation for Network Simulations William W. Lytton Department of Neurology, University of Wisconsin, Wm. S. Middleton VA Hospital, 1300 University Ave., MSC 1720, Madison, W153706 USA High computational requirements in realistic neuronal network simulations have led to attempts to realize implementation efficiencies while maintaining as much realism as possible. Since the number of synapses in a network will generally far exceed the number of neurons, simulation of synaptic activation may be a large proportion of total processing time. We present a consolidating algorithm based on a recent biophysically-inspired simplified Markov model of the synapse. Use of a single lumped state variable to represent a large number of converging synaptic inputs results in substantial speed-ups. 1 Introduction
The computational demands of a single synapse in realistic neural simulations can equal the cost of several neuronal units in an artificial neural network. In particular, Markov models of synaptic activation are dynamic systems that may have 10 or more state variables. An alternative, the classical ”alpha function” model (Rall1967), is computationally cheap but lacks obvious biophysical correlates at the channel level (Destexhe, Mainen, and Sejnowski 199413). Recently, a middle ground has been developed that preserves some major aspects of a biophysically realistic full Markov model at considerably less computational cost (Wang and Rinzel 1992; Destexhe et al. 1994a,b). Destexhe and co-workers demonstrated a minimal two-state model with a fundamental biophysical verisimilitude that used a simple implementation practical for network use. We will call this the ”DMS model” after the authors’ initials. Individual synapses in neuronal networks are generally represented as distinct entities that alter a conductance in the postsynaptic neuron after detecting some signal, typically voltage or calcium crossing a threshold, in the presynaptic neuron. This representation is widely used in the synaptic packages available with the major realistic neural simulators. All of the synapses of a single type are doing identical, potentially redundant, calculations, albeit at slightly different times. Srinivasan and Chiel (1993) previously demonstrated how multiple alpha functions could be consolidated by representing their summation in an iterated closed form. Neural Computation 8, 501-509 (1996)
0 1996
Massachusetts Institute of Technology
William W. Lytton
502
We present a similar consolidating algorithm that allows an efficient implementation of large numbers of DMS synapses. Rather than treating each synapse individually, we will lump all of the synapses of a given type (e.g., GABAA or AMPA), converging onto a single compartment of one model neuron. These will then be represented by consolidated state variables and a single synaptic conductance and synaptic current. In the following, ”single synapse” or ”individual synapse” is used to describe the basic two-state DMS model. “Complex synapse” describes the lumped synapse model. Lower-case state variables and conductance (r.8) will be used for the former and upper-case ( R , C ) for the latter. 2 The DMS Algorithm
The first-order kinetic scheme was introduced by Destexhe et nl. (1994a). The notation has been slightly modified for simplicity of description of the subsequent algorithms. (2.1) This model of a ligand-gated ion channel is comparable to the standard Hodgkin-Huxley parameterization for voltage-sensitive ion channels. The difference is that the (r and 13 parameters are not functions of voltage. Instead n is taken as a simple function of transmitter concentration: (2.2) Transmitter concentration C is assumed to be given by a square wave of amplitude 1 and duration Cdur. 13 is a constant. Following the Hodgkin-Huxley notation, the kinetic scheme can be expressed as a first order differential equation that solves for r in terms of R , and T R in the usual way: TRL = R, - r
R,
=
0 ~
a
+ /j
(2.34 (2.3b)
The update rule derived from the analytic solution for a single time step At is ~
R X,(1 - e - A t / ~ ~ + ) rt,-Af/~~
(2.4)
Synaptic Conductance for Network Simulations
503
(Note that this rule defines r in terms of itself, connoting the update step that would be used in software implementation.) The full rule is needed only when transmitter is present, since when C = 0, R, = 0 and TR = I//?. Equation 2.4 can be split into two update rules using Y = r,, or r = roff depending on the presence or absence of transmitter C. r,, = R,(1 - e - A t / T R ) r,ff = roff e-pAt
+ rone-Af/TR
(C > 0) (C= 0)
(2.5a) (2.5b)
Note that r,, and r,ff are not the same as rope, and r,lose from equation 2.1 but are both components of r,,. Synaptic conductance gsynand current isynare defined in the usual way:
3 Summing DMS Synapses
Summing synaptic activations makes it possible to maintain and update two rather than N state variables for the N synapses of a single type converging onto a given cell. An added advantage of this method is that it permits us to maintain a single queue of spike arrival times (now more accurately a heap) instead of N queues. The former improvement results in saving CPU time and the latter in saving both time and memory. 3.1 Separate Summations Required for R,, and R,,+ We cannot use the single update rule of equation 2.4 since this would require summing over different rm, and r., depending on the presence or absence of transmitter at the ith synapse. However, the two rules 2.5a and 2.5b have known factors that can be precalculated and brought out from under the summation. Again splitting r into r,, and r,ff, we then simply sum across No, and N,ff synapses, respectively, where the N total synapses have been partitioned depending on their status.
(3.1b) 1=1
i=1
;:x
x?;
Using X,, = r,,,, and Rolf = C:: r,ff, and noting that X, = N,,R,, we can simplify 3.la and 3.lb to produce update rules for unsubscripted Rs: (3.2a) (3.2b)
William W. Lytton
504
This form is identical to the single synapse update rules (2.5a and 2.5b) except that the forcing function for Ron has been multiplied by No,, the number of active synapses. These two update rules 3.2a and 3.2b form a compact two-step inner loop that the complex synapse executes at every time step. 3.2 Modifying Ron and RoffWhen Individual Synapses G o on or off. Updating the summed synaptic state variables on each time step saves the computational cost associated with updating individual Y,S. However, since Ron and Roff are complementary state variables that follow the respective rise and decay of multiple single synapses, these single synapse r,s are still needed. When a single synapse changes state from off to on, Ron must be augmented by the corresponding Y, and R,,ff decremented. Conversely, when an individual synapse turns from on to off, Ron must be decremented and Roffaugmented by the appropriate amount. In addition, the value of No, in equation 3.2a must be incremented by 1 (offkon) or decremented by 1 (on+off). Keeping track of individual r, values is easily done. Since these state variables are independent of voltage, they can be projected out in time from the last (opposite direction) transition (Destexhe e f al. 1994a): y,
=
R,(l
y,
=
r,e-fi's'
- ~ - C d u r / ~ R+ ) rle-Cdur/rR
(turning off)
(3.3a)
(turning on)
(3.3b)
These are identical to equations 2.5a and 2.5b except that time interval At has been replaced by Cdur (duration of transmitter release) in the first case and by ISI, the interspike interval, in the second. Note that while C d u r will remain constant during a simulation, IS1 = f - ( t o + C d u r ) where to is the time of last activation. Thus, IS1 is the interval from the end of synaptic activation to the beginning of the next activation for the same individual synapse. 3.3 Handling Different Maximal Conductances. We now have a way of calculating the state variable R = Ron + Roff. We can calculate a conductance G from this as we did in equation 2.6a or else calculate the components of G = Go, Gaff individually:
+
(3.4a) (3.4b) (3.4c) The foregoing analysis assumes that the individual synapses all have identical conductances. This will generally not be the case. To handle
Synaptic Conductance for Network Simulations
505
varying g,s, we need to expand 3.4b and 3 . 4 ~in the same manner as previously (cf. equations 3.la and 3.lb):
I=1
1=1
c,
We divide through by and change variables to create a new state Nor1 f roff, variable r: = (gl/c)rl.If we now redefine Ron = EL! Y& Roff = C,=, and No, = ~ l ( g l / ~we) ,arrive back at equations 3.2a and 3.2b. The change in the definition of No, is the only one that affects the implementation. The previous No, = C, 1.0 since each individual synapse had identical magnitude 1. Now, instead of incrementing or decrementing by 1 when turning an individual synapse on or off, we simply add or subtract the appropriate g,/c. Making our new state variable a proportion rather than a conductance is done for convenience and to maintain the Hodgkin-Huxley tradition of dimensionless state variables. The new state variable is described by a slight variation in equation 2.1. In the usual convention, rclosed = 1- Yopen. With this modification r&sed = (gl/c) - ropen. The dimensionless state variable is also useful for managing simulations. With treated as a simulation-wide global parameter, equation 3.4a gives the user the ability to globally alter the strength of a particular neurotransmitter by reducing This is analogous to the common experimental the corresponding practice of dumping transmitter antagonists into the bath in vitro or giving antagonists systemically in viva
c.
4 Maintaining a Single Queue
Simulating delay is necessary because most simulations do not include axons. Therefore the delay encompasses both the time taken for an action potential to proceed down the axon (axonal delay) as well as the typically shorter time required for transmitter to diffuse across the synaptic cleft and bind. Handling delays requires maintenance of a queue, a data structure that always disgorges its oldest element (first-in, first-out). Typically, the time of a presynaptic activation is added to the appropriate delay and then stored on a queue. When this time is reached in the simulation, the item is removed from the queue, and the postsynaptic element is activated. Since many individual synapses are now maintained as a single complex synapse, it is natural to consider maintaining a single queue instead of N queues. The queue must now store not only the times of synaptic activation but also an index indicating the specific individual synapse.
506
William W. Lytton
4.1 Managing the Queue from the Presynaptic Side. In the direct object-oriented approach to the synapse, the synapse manages its own initiation by constantly checking the presynaptic cell for a signal, generally the passage of voltage above a predetermined threshold. The consolidation of signal management in a single queue would require an array of such presynaptic pointers. The alternative, maintaining a presynaptic array of postsynaptic pointers, is far more efficient: access across the pointer is required only when spikes occur instead of on each time step (Bower and Beenian 1994). Such forward pointers are particularly important in implementations on multiprocessor computers where pointer access between different CPUs is relatively slow (Niebur rt 01. 1991). Using forward pointers, the queue receives its input from a structure associated with the presynaptic neuron. When triggered, this structure writes a time stamp equal to current time plus the appropriate delay. Presynaptic identity is also written in the form of an index. The queue is read postsynaptically when time reaches the value of the next queue time. The postsynaptic mechanism is then altered by moving the corresponding r, from R,,f to Ron. Because a complex synapse has a single Cdur,the queue can serve double duty and signal not only the initiation of the synapse but also its termination. For this reason, the queue time is not removed but is instead incremented by Cdur. The queue is implemented with two heads: the first head gives the time for initiating another synapse while the second head gives the time for terminating one. Each time is associated with an index that indicates exactly which individual synapse is being started or stopped.
4.2 A Heap Qua Queue Handles Differing Synaptic Delays. Individual synapses may have different delays. If these synapses share the same queue, an individual synapse with a relatively long delay could activate presynaptically shortly before one with a relatively short delay. This would put a later time on the queue in front of an earlier time. The synapse associated with the earlier time would be activated only after the later time was removed from the queue. To avoid this problem, a heap is used in lieu of a queue. Items in a heap are maintained in numerical order. A traditional heap implementation involves a binary tree (Knuth 1973). In the present case, items arrive out of order relatively rarely and are usually not very far out of order, making a binary tree unnecessary. Instead, the item is checked when it arrives, with the appropriate heap location readily found by forward search when needed. Further consolidation is possible by creating a single master heap for all synapses of a given type. Each heap entry must then include not only an index for the presynaptic mechanism but also one for the postsynaptic mechanism. The algorithm must take account of postsynaptic mechanism number as well as time in maintaining heap order.
Synaptic Conductance for Network Simulations
507
5 Simulation Results
Benchmark simulations were run in NEURON (Hines 1993) on Sun SPARClOs under SunOS 4.1.3 and Intel Pentiums under Linux. Figure 1 shows results comparing individual single synapse evaluation with the complex summated synapse. In Figure 1B the summation of DMS state variables (C ri, dashed line) is compared to the single Ron calculated using the present algorithm (solid line). The lines do not coincide because the complex synapse algorithm includes weighting for the different 8s. Figure 1C compares conductances for the two schemes. Benchmarking demonstrated a 3-fold speed-up with the current algorithm. Extending the simulation to 200 fully interconnected mutually excitatory neurons receiving similar input and spiking at approximately 12 Hz demonstrated a 45-fold speed-up. A more complex simulation with 225 excitatory and inhibitory neurons was also benchmarked. Individual neurons had two compartments and 8-9 voltage- and /or calcium-sensitive conductances. Connectivity was nearly complete with a total of 50,400 synapses and the average firing rate was approximately 20 Hz. No attempt was made to optimize either the old or new synapse model by determining ideal queue lengths; instead, conservative values were used. With the synapse model presented here, core memory usage was reduced 38% from 8 to 5 Mb. CPU time was reduced by 96% from 38 hr 50 min to 1 hr 35 min. 6 Discussion
Simulations of neuronal networks quickly fall victim to the perils of combinatorics. While the calculation time required for simulating individual neurons increases proportionally with number of neurons n, the number of synapses can increase up to n2 depending on convergence. Specifically, the number of synapses S can be given either by the product of convergence and number of postsynaptic cells S = C . Post or by the product of divergence and number of presynaptic cells S = D . Pre. Percent convergence, C/Pre, expressed in terms of number of synapses is S/(Post . Pre). This is equal to percent divergence: D/Post = S/(Pre . Post) (Traub et al. 1987). Calling this term percent connectivity (pij), a network with N cell types and nj cells of each type will have number of synapses S given by
i=l j=1
Depending on the complexity of the single neuron model chosen, time spent in synaptic computations can readily outrun the time spent modeling the neurons themselves. This will be particularly true in parallel implementations if pointers are not managed carefully, as noted above.
William W. Lytton
508
A
ms 30-
20
~
10
0
1
0
2
0
3
0
4
0
5
0
ms
__
Figure 1:
Figure 1: Comparison of DMS model synapse (dashed line) with the complex synapse (solid line). Randomized synaptic input was used to drive both synapse models. Individual 2 values ranged systematically from 0.1 to 5 pS while delays ranged from 0 to 25 msec. (A) Six of the 50 presynaptic spike trains used as input to the two synapse models. The bottom 5 traces are the most rapidly spiking and the top 1 trace is the least rapidly spiking cell. Spike trains were produced with a Poisson generator using the gen.mod presynaptic spike generator written by Zach Mainen. (B) Comparison of summed state variables for the two models: Crl (dashed line) vs. R (solid line). The former is dimensionless while the latter is in pS. (C) Comparison of summed conductance (in jrS) for the two models: Crig, (dashed line) vs. RC (solid line). The curve for the complex synapse is identical to that in B since G = 1. Although not apparent here, the superposition is imperfect due to time-step round-off differences between the two implementations.
Synaptic Conductance for Network Simulations
509
The consolidated implementation presented here extends the value of the original DMS synapses in reducing this computational load.
Acknowledgments
I would like to thank Jack Wathey, Mike Hines, and Alain Destexhe for many helpful virtual discussions and the two anonymous reviewers for useful comments and corrections. Scott Deyo ran some of the benchmarks. This work was done using the NEURON simulator with support from NINDS and the Veterans Administration.
References Bower, J., and Beeman, D. 1994. The Book of Genesis. Springer-Verlag,New York. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. 1994a. An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Cornp. 6, 14-18. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. 1994b. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Cornp. Neurosci. 1, 195-230. Hines, M. 1993. NEURON-A program for simulation of nerve equations. In Neural Systems: Analysis and Modeling, F. H. Eeckman, ed., pp. 127-136. Kluwer Academic Press, Boston, MA. Knuth, D. 1973. The Art of Computer Programming Vol. 3: Sorting and Searching. Addison-Wesley, New York. Niebur, E., Kammen, D. M., Koch, C., Ruderman, D., and Schuster, H. G. 1991. Phase coupling in two-dimensional networks of interacting oscillators. In Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 123-129. Morgan Kaufmann, San Mateo, CA. Rall, W. 1967. Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophys. 30, 11381168. Srinivasan, R., and Chiel, H. J. 1993. Fast calculation of synaptic conductances. Neural Cornp. 5,200-204. Traub, R. D., Miles, R., and Wong, R. K. S. 1987. Models of synchronized hippocampal bursts in the presence of inhibition. I. Single population events. J. Neurophys. 58, 739-751. Wang, X. J., and Rinzel, J. 1992. Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Cornp. 4, 84-97.
Received May 26, 1995; accepted August 15, 1995
This article has been cited by: 1. Samuel A. Neymotin, Kimberle M. Jacobs, André A. Fenton, William W. Lytton. 2010. Synaptic information transfer in computer models of neocortical columns. Journal of Computational Neuroscience . [CrossRef] 2. Rogerio R. L. Cisi, André F. Kohn. 2008. Simulation system of spinal cord motor nuclei and associated nerves and muscles, in a Web-based architecture. Journal of Computational Neuroscience 25:3, 520-542. [CrossRef] 3. William W. Lytton, Ahmet Omurtag, Samuel A. Neymotin, Michael L. Hines. 2008. Just-in-Time Connectivity for Large Spiking NetworksJust-in-Time Connectivity for Large Spiking Networks. Neural Computation 20:11, 2745-2756. [Abstract] [PDF] [PDF Plus] 4. Romain Brette, Michelle Rudolph, Ted Carnevale, Michael Hines, David Beeman, James M. Bower, Markus Diesmann, Abigail Morrison, Philip H. Goodman, Frederick C. Harris, Milind Zirpe, Thomas Natschläger, Dejan Pecevski, Bard Ermentrout, Mikael Djurfeldt, Anders Lansner, Olivier Rochel, Thierry Vieville, Eilif Muller, Andrew P. Davison, Sami El Boustani, Alain Destexhe. 2007. Simulation of networks of spiking neurons: A review of tools and strategies. Journal of Computational Neuroscience 23:3, 349-398. [CrossRef] 5. Michelle Rudolph, Alain Destexhe. 2006. Analytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation StrategiesAnalytical Integrate-and-Fire Neuron Models with Conductance-Based Dynamics for Event-Driven Simulation Strategies. Neural Computation 18:9, 2146-2210. [Abstract] [PDF] [PDF Plus] 6. Romain Brette. 2006. Exact Simulation of Integrate-and-Fire Models with Synaptic ConductancesExact Simulation of Integrate-and-Fire Models with Synaptic Conductances. Neural Computation 18:8, 2004-2027. [Abstract] [PDF] [PDF Plus] 7. William W. Lytton , Michael L. Hines . 2005. Independent Variable Time-Step Integration of Individual Neurons for Network SimulationsIndependent Variable Time-Step Integration of Individual Neurons for Network Simulations. Neural Computation 17:4, 903-921. [Abstract] [PDF] [PDF Plus] 8. Jan Reutimann , Michele Giugliano , Stefano Fusi . 2003. Event-Driven Simulation of Spiking Neurons with Stochastic DynamicsEvent-Driven Simulation of Spiking Neurons with Stochastic Dynamics. Neural Computation 15:4, 811-830. [Abstract] [PDF] [PDF Plus] 9. Michele Giugliano . 2000. Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network SimulationsSynthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations. Neural Computation 12:4, 903-931. [Abstract] [PDF] [PDF Plus] 10. Michele Giugliano , Marco Bove , Massimo Grattarola . 1999. Fast Calculation of Short-Term Depressing Synaptic ConductancesFast Calculation of Short-Term
Depressing Synaptic Conductances. Neural Computation 11:6, 1413-1426. [Abstract] [PDF] [PDF Plus] 11. J. Köhn , F. Wörgötter . 1998. Employing the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network SimulationsEmploying the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations. Neural Computation 10:7, 1639-1651. [Abstract] [PDF] [PDF Plus] 12. Paul Bush, Nicholas Priebe. 1998. GABAergic Inhibitory Control of the Transient and Sustained Components of Orientation Selectivity in a Model Microcolumn in Layer 4 of Cat Visual CortexGABAergic Inhibitory Control of the Transient and Sustained Components of Orientation Selectivity in a Model Microcolumn in Layer 4 of Cat Visual Cortex. Neural Computation 10:4, 855-867. [Abstract] [PDF] [PDF Plus] 13. M. L. Hines, N. T. Carnevale. 1997. The NEURON Simulation EnvironmentThe NEURON Simulation Environment. Neural Computation 9:6, 1179-1209. [Abstract] [PDF] [PDF Plus] 14. Alain Destexhe. 1997. Conductance-Based Integrate-and-Fire ModelsConductance-Based Integrate-and-Fire Models. Neural Computation 9:3, 503-514. [Abstract] [PDF] [PDF Plus]
Communicated by Laurence Abbott
Parameter Extraction from Population Codes: A Critical Assessment Herman P. Snippe' University of Stirling, Department of Psychology, Stirling FK9 4LA, Scotland U.K.
In perceptual systems, a stimulus parameter can be extracted by determining the center-of-gravity of the response profile of a population of neural sensors. Likewise at the motor end of a neural system, center-of-gravity decoding, also known as vector decoding, generates a movement direction from the neural activation profile. We evaluate these schemes from a statistical perspective, by comparing their statistical variance with the minimum variance possible for an unbiased parameter extraction from the noisy neuronal ensemble activation profile. Center-of-gravity decoding can be statistically optimal. This is the case for regular arrays of sensors with gaussian tuning profiles that have an output described by Poisson statistics, and for arrays of sensors with a sinusoidal tuning profile for the (angular) parameter estimated. However, there are also many cases in which center-of-gravity decoding is highly inefficient. This includes the important case where sensor positions are very irregular. Finally, we study the robustness of center-of-gravity decoding against response nonlinearities at different stages of an information processing hierarchy. We conclude that, in neural systems, instead of representing a parameter explicitly, it is safer to leave the parameter coded implicitly in a neuronal ensemble activation profile. 1 Introduction
Structure can be coded in many ways. We briefly discuss three coding systems for a parameter a. More possibilities, e.g., based on the temporal characteristics of the neural response, certainly exist (e.g., Geisler ef aI. 1991; Hopfield 1995; Konig et aI. 1995; Middlebrooks ef al. 1994; Oram and Perrett 1992). 1. Make a sensor that has a response strength R that grows monotonically with a: R = R(a). In the most simple case the function R(a) *Present address: University of Groningen, Department of Biophysics, Nijenborgh 4, 9747 AG Groningen, The Netherlands.
Neural Cornpufufion 8, 511-529 (1996)
@ 1996 Massachusetts Institute of Technology
512
Herman I? Snippe
is the identity: R = a . We call this iiiterisity codirig for the parameter a (Fig. la). 2. Divide parameter space into a large number of discrete cells, and make a sensor for each of these cells. Then a is coded by the identity of the responding sensor (Fig. lb). It is ubiquitous in computers, e.g., the pixellation of an image. 3. Use sensors with graded and overlapping sensitivity profiles. The coarse value of the parameter a is now carried by the idcrztity of the responding sensors, but for a precise determination of a we have to invoke the relative iriterisity with which each of these sensors responds. This is called a p o p u l a t i o ~codiriy for the parameter a (Fig. lc). Population codes are ubiquitous in biological systems. For instance, in the visual system, Iocal pattern spatial frequency and orientation, motion speed and direction, and binocular disparity are all strongly believed to be population coded according to our definition. Many more examples exist in other modalities, for instance the coding of the location of a sound source using neurons selectively tuned for different values of interaural time and intensity differences (e.g., Konishi 1991, 19931, the coding of target distance and speed in bat echolocation using neurons tuned for echo delay and Doppler shift (e.g., Altes 1989; Olsen 1992; Olsen and Suga 1991a,b), and the coding of olfactory and gustatory stimuli (e.g., Rolls 1989). Population coding is not restricted to sensory modalities. It has been shown that neurons in motor cortex have graded and overlapping activity profiles for (intended) movement direction (Georgopoulos et al. 1988; see Lee ct 01. 1988; McIlwain 1991 for examples in the superior collicuius). Also it is conceivable that population coding is used in cognition (Poggio and Girosi 1990; Young and Yamane 1992). Because of the ubiquity of population coding in neural systems, it is important to know its capacities. What is the precision with which the parameter a is present in the neural ensemble response profile? Are there simple ways to extract a from the neuronal activation profile? This extraction of a is not trivial since a is iiiiplicif only in the ensemble of sensor responses. A solution to the parameter extraction problem is to use the ccizfer of gravity of the neuronal activation profile as the estimate of a (Baldi and Heiligenberg 1988; Lee et al. 1988; Zohary 1992). Denoting the parameter tuning of the nth neuron in the ensemble by a,,and the actual response of this neuron by R,,, the center-of-gravity (CG) estimate ~ C C of , a is
Note that equation 1.1 makes intuitive sense. Each neuron promotes its tuning label, and is allowed to d o so in proportion to its response to the actual stimulus. The denominator in 1.1 is a normalization factor that
Parameter Extraction from Population Codes
513
11 I I I I I1 I I I1 I1 I1 I I1 I I B I I I I I I I I I I I1 I I I I fa
I
1
2
3
4
5
6
Figure 1: Three possibilities to code the value of an environmental parameter a. (a) Intensity coding; the intensity of response of a single sensor suffices to code the value of a. (b) Labeled line coding; the identity of the responding sensor codes the value of a. For a precise evaluation of a many sensors are needed. (c) Population coding by sensors with graded and overlapping sensitivity profiles. This represents a balance between the extremes of intensity coding and labeled line coding. A value of a indicated by the arrow would yield a strong response from sensor 4, a moderate response from sensors 3 and 5, and no response from the other sensors in the array. Thus, the value of a is coded partly by the identity of the responding sensors, and partly by the intensity of their response.
Herman P. Snippe
514
deconfounds influences due to stimulus intensity. Center-of-gravity decoding can be easily extended to a multidimensional parameter a (Georgopoulos et al. 1988; Lee ef al. 1988; McIlwain 1991; Salinas and Abbott 1994; Sanger 1994; Zohary 1992), or reformulated as an extraction of a population vector (Seung and Sompolinsky 1993; Vogels, 1990). For clarity we restrict our discussion to the one-dimensional form of equation 1.1. Previous studies have looked in detail into systematic deviations of i i c ~from the true value a (e.g., Baldi and Heiligenberg 1988; Sanger 1994). These deviations are very small provided that the sensors sample the parameter space sufficiently dense (Baldi and Heiligenberg 1988), and that the distribution of the sensor locations a,, is sufficiently homogeneous (see Sanger 1994 for a more precise statement). Thus for many systems systematic errors in the center-of-gravity estimate will tend to be small. However, in real neural systems the sensor responses are noisy. Hence the sensor responses R,, in equation 1.1 should be treated as random variables. From statistics we know that the fidelity of an estimate as in equation 1.1 is measured not only by its systematic deviations (bias), but also by its random deviations (variance). Of the possible unbiased estimators of a, the one that has the lowest variance is to be preferred. In this paper we compare the variance of the center-of-gravity estimate with that of the statistically optimal, minimum variance unbiased estimate based upon the same channel-coded data. Our analytical results complement simulation studies by Salinas and Abbott (1994). For additional analytical results see Seung and Sompolinsky (1993). Equation 1.1 has usually been studied under the assumption of system linearity. However, nonlinearities in neural information processing abound (e.g., Abbott 1994; Douglas and Martin 1991). Thus it is crucial that a proposed parameter extraction scheme is robust under nonlinearities at different stages in the information processing hierarchy. In section 5 we evaluate this robustness for the CG estimate 1.1. 2 Efficiency of CG Estimation Is Low for Sharply Tuned Sensors Perturbed by Background Noise
In this section we study the model of Baldi and Heiligenberg (1988): A regular array of 2h4 + 1 sensors with unit spacing between consecutive sensors. The parameter tuning profiles of the sensors are gaussian with width rr; for the nth sensor
Contrary to the treatment by Baldi and Heiligenberg, we assume that the actual sensor outputs R,, are noisy:
R,, = QII(n)
- WI,
(2.2)
Parameter Extraction from Population Codes
515
Throughout this paper, we assume that the noise W is uncorrelated between sensors (see Snippe and Koenderink 1992b for a treatment of noise correlations). In this section we also assume that the noise is gaussian, and that the noise variance fl is independent of (expected) sensor response Qn. We call this situation background noise. Although the statistics of real neural noise are closer to Poisson (Softky and Koch 1992), a study of the effects of background noise is nevertheless relevant when sensor response (2.2) is a modulation superimposed on a spontaneous (noisy) neural activity level, as is the case for retinal ganglion cells (Robson and Troy 1987). The extraction of the stimulus parameter a from the ensemble response {R,} can be formulated as a problem in statistical estimation theory (Deutsch 1965): Generate an estimator f that operates on the ensemble response { R , 2 } to yield an estimate u of the actual value of the parameter u: = f (R-M,. . . , RM)
(2.3)
Generally, the quality of an estimator is measured using two quantities: 1. Its bias, representing any systematic deviations of u from u. 2. Its variance, representing the random errors in u.
Baldi and Heiligenberg (1988) show that the center-of-gravity estimation scheme 1.1 is virtually bias free (when sensor tuning width is not much smaller than the spacing between consecutive sensors). However, to fully assess the statistical quality of the estimate 1.1, we have to compare its variance with the statistically optimal, unbiased minimum variance estimator. In statistics there is a well-known lower bound on the variance Vur(u) of any unbiased estimate of u. This Cram&-Rao bound is given by (Deutsch 1965; Paradiso 1988)
in which E[...] is the statistical expectation operation, and p({R,,} I a ) is the probability distribution function of the neuronal ensemble response. Using the gaussian nature of the noise statistics W,, in 2.2, one easily calculates (2.5) and (2.6)
Herman P. Snippe
516
The prime on Q:,(a) indicates a differentiation with respect to a. Thus the Cram&-Rao bound 2.4 is (2.7) Substituting the explicit expression 2.1 for Q,,(a)in 2.7, and replacing the summation by an integration (an excellent approximation for 2 1, Baldi and Heiligenberg 1988), it is easy to evaluate
Hence the Cramer-Rao bound for our model system is
Note that the minimum attainable variance increases with the sensor tuning width 0 (Seung and Sompolinsky 1993; Snippe and Koenderink 1992a). Thus to attain a precise estimate of il, it is preferable to have sensors with low 0, i.e., with sharp tuning. For the present model system, the Cramer-Rao bound is attained for the maximum likelihood estimator (Snippe and Koenderink 1992a; Snippe 1996). Thus we use the right-hand side of equation 2.9 as a benchmark against which to compare the variance of the center-of-gravity estimate 1.1. We now proceed to calculate this variance. We concentrate on the variance produced by the numerator I,, n,,R,, in 1.1, and neglect the contribution to the variance due to the normalization El,R,,, which we treat as noise free. It can be shown that this does not affect our conclusions. Thus (2.1 0)
For our regular sensor array
3
Assuming that
(T
II,, =
n, thus the numerator in 2.10 equals
(2.11)
2 1, and that n is well within the sensor tuning range
Parameter Extraction from Population Codes
517
[ - M . . . MI, the denominator in 2.10 equals
(-i(y)*) 2
=
[l:exp
dx]
Thus equation 2.10 yields (2.13)
Contrary to the variance of the maximum likelihood estimator, the variance of the center-of-gravity estimate decreases as a function of sensor tuning width U . Hence, for background noise, it is advantageous to use broadly tuned sensors in a center-of-gravity estimation scheme (Seung and Sompolinsky 1993). Baldi and Heiligenberg (1988) arrived at a similar conclusion for the systematic deviations (bias) of the center-of-gravity estimate. Using the Cram&-Rao bound as a benchmark, the center-of-gravity estimate has an efficiency (2.14)
If the sensor array extent 2M is large relative to the sensor tuning width, the statistical efficiency of the center-of-gravity estimate 1.1 is low. For instance if M = 10a, equation 2.14 yields an efficiency of about 0.01, i.e., the variance of the center-of-gravity estimate is about 100 times the variance of the statistically most efficient estimate. Why is center-of-gravity estimation so dramatically inefficient under the conditions studied in this section? The reason is the combination of two circumstances: 1. The sensor array extent is large relative to the sensor tuning width. Thus most sensors in the array are not at all responsive to a stimulus with a specific value for the parameter u. 2. Sensor noise is response independent. Thus sensors that are not at all stimulated do generate noise. The inefficiency of the center-of-gravity (CG) estimate is due to the fact that in equation 1.1 contributions of all sensors are summed (in proportion to their position label in the array). This includes the contributions of sensors that show little response to the stimulus, but that do contribute noise. Note that these sensors also contribute in the OLE (optimum linear estimation) scheme of Salinas and Abbott (19941, since for the regular
Herman P. Snippe
518
sensor array studied in this section, OLE is essentially identical to the CG estimate 1.1 (with the denominator in 1.1 replaced by a suitable normalization constant). This behavior of CG/OLE decoding is opposed to that of the maximum likelihood (ML) estimator. In an ML estimation scheme the sensor outputs enter the decision with a weight equal to their differential sensitivity to small variations of the parameter a around its actual value (Snippe and Koenderink 1992a; Snippe 1996). Since sensors with tuning labels a,, that are far removed from a remain unresponsive to the stimulus under small variations of a around its actual value, these sensors do not enter the ML decision variable. Hence, contrary to centerof-gravity (and OLE) decoding, ML estimation is not affected by noise in the unresponsive sensors. 3 Efficiency of CG Estimation Is High for Poisson Noise or Broadly Tuned Sensors
In the previous section we showed that center-of-gravity estimation is highly inefficient for ensembles of sharply tuned sensors that are perturbed by response independent noise. Here we show that if either of these circumstances is relaxed, the center-of-gravity estimate 1.1 can actually be ideal, i.e., have 100% efficiency. 3.1 Poisson Noise. First we study a case in which sensor response noise variance vanishes for unstimulated sensors. This is true if sensor noise is governed by the Poisson distribution:
(3.1) In this expression we assume that the actual sensor response R,, for the nth sensor is a whole number, e.g., the number of spikes generated by the nth sensor in the observation interval. The Poisson distribution is relevant, since its variance equals its mean Q,,(a), approximately true for cortical neurons that show little spontaneous activity (Dean 1981; Softky and Koch 1992; Tolhurst et a!. 1983; Vogels et a!. 1989). We now calculate the ML estimate for the parameter R when sensor response noise is governed by 3.1. As is well known, the ML estimate can be obtained by differentiating the logarithm of the response probability distribution function with respect to the parameter of interest:
=
3 In Rll!+ R,, In c,i,I
(a) -
QIl
(3.2)
Parameter Extraction from Population Codes
519
Again we assume that the sensor tuning profiles are gaussian, and that the sensor array is regular (we will study nonregular arrays in the next section). Then the ratio Qi,(a)/Qll(a)equals ( n - a)/.’. When the sensor array is sufficiently dense, X I ,Q:,( a ) equals zero (Baldi and Heiligenberg 1988), and equation 3.2 reduces to
Setting this expression to zero yields the ML estimator U M L of a (3.4)
This result is identical to the center-of-gravity estimate 1.1 for a regular sensor array. Since the ML estimator (3.4) is essentially unbiased (Baldi and Heiligenberg 1988), it is the minimum variance estimate (Deutsch 1965). Thus under the circumstances studied in this subsection: 0 0
a regular array, of sensors with gaussian tuning functions, with output noise governed by Poisson statistics,
center-of-gravity estimation is optimal from a statistical point of view. How closely does center-of-gravity estimation still approximate ideal when these conditions are relaxed?
C, QL(a) is still zero, the regularity of the sensor array is not essential since the ML estimate, as evaluated from equation 3.2, is identical to equation 1.1. In general, however, as we show in the next section, sensor array irregularities will lead to a reduction of the efficiency of the center-of-gravity estimate. 2. Small amounts of background noise injections can have a large influence on the structure of the ML estimator (Geisler 1984). However if the Poisson component of the noise is still the dominant contribution to the variance of the center-of-gravityestimate, it will remain close to ideal. How much background noise can be tolerated depends on the size of the sensor array, i.e., the parameter M / o from equation 2.14. Large arrays tolerate less background noise. 3. The derivation presented works only for gaussian tuning functions. The impairment in the efficiency of the center-of-gravity estimate depends critically on the behavior of the tuning function in its skirts, with shallow skirts yielding large impairments. For instance, for Cauchy tuning functions, C,(a) = l/[(a - n)’ + b’], that have very shallow skirts, we calculate that the efficiency of the center-of-gravity estimator approaches zero for a large sensor array ( M / b >> 1). 1. Provided that
520
Herman P. Snippe
3.2 Broadly Tuned Sensors. In Section 2 we showed that for sensors perturbed by background noise, the statistical efficiency of centerof-gravity estimation is low when the sensors are sharply tuned, i.e., when M / o >> 1. A careful evaluation o f the integrals in Section 2 shows that when the sensors are broadly tuned ( M / n < 11, the efficiency of population coding approximates 1. This is a general result that is not restricted to gaussian tuning functions (Seung and Sompolinsky 1993). For instance, if the tuning functions are cosinusoidal, i.e., if sensor output is described by a projrctioii of a stimulus vector on a sensor sensitivity vector, vector decoding (a variant of the center-of-gravity scheme 1.1) can be shown to be statistically optimal for a sufficiently homogeneous distribution of sensor sensitivity vectors (Salinas and Abbott 1994; Sanger 1994). 4 Sensor Position Irregularities: Another Noise Source for
Center-of-Gravity Estimation Up to now we have studied regitlar sensor arrays. Although near-perfect regular arrays exist in biological systems (e.g., the human foveal photoreceptor array, Hirsch and Miller 1987; the cricket cercal system, Miller etnl. 1991), deviations from regularity often occur. How do such irregularities affect the performance of the center-of-gravity estimation scheme? To study this question, we analyze a system in which the actual sensor tuning locations n,, are perturbed from the positions a,, = IZ they would have in a regular array. We describe the perturbations using a gaussian probability density function: (4.1) The (root-mean-square) perturbation size s describes the degree of irregularity of the sensor array. For s small relative to the average spacing between consecutive sensors (s > 1 the array is very irregular. We assume that the sensors are identical with respect to tuning width, response strength, and noise variance. This is not very realistic biologically, but actually such variations between sensors would have effects very similar to the effects of location perturbations studied here. See Vogels (1990) and Zohary (1992) for simulation studies of these effects. We evaluate the impairment of the population coding estimate 1.1 due to the sensor location perturbations for sensors with outputs perturbed by Poisson noise (equation 3.1). Note that by using the actual (perturbed) sensor locations a,, in 1.1 instead of their mean it , we assume that the estimator fully knows the sensor array irregularities (Bossomaier et al. 1985; Theunissen and Miller 1991). See Ahumada (1991) for learning models that accomplish this.
Parameter Extraction from Population Codes
521
We rewrite equation 1.1 as
This shows that the actual value a of the environmental parameter is not critical for our results, hence we lose little generality by assuming u = 0, which we do for convenience. For reasons of symmetry it is then obvious that the numerator and denominator in the right-hand side of equation 4.2 are statistically independent, and that the statistical expectation of the numerator equals zero. Thus, the variance of the center-of-gravity estimate 4.2 is (4.3) Note that expectation operations €[. . .] are to be evaluated over the sensor response distribution P ( R n 1 a ) of equation 3.1 and over the sensor position label distribution A ( a n ) of equation 4.1. Schematically: (4.4)
In the Appendix we show that the variance factor in 4.3 can be rewritten as
(4.5) The first term in the right-hand side of equation 4.5 generates the part of the variance of UCG due to the variance of sensor response. This is easy to see if we evaluate the variance operation in equation 4.3 while keeping the sensors fixed at positions an, instead of describing them using a probability model as in equation 4.1. For such fixed (albeit irregular) sensor positions, equation 4.3 yields
Thus only the contribution of the first term in the right-hand side of equation 4.5 occurs if we fix the sensor array irregularity and take into account only the variance due to the noise in the sensor outputs. Therefore we interpret this as the contribution to the variance of the center-of-gravity estimate due to the noise in the sensor outputs. What, then, is the second term in equation 4.5? Though a fixed, irregular array generates only the variance of the first term of equation 4.5, it will, in general, also generate a bias in the center-of-gravity estimate of a. The second term in equation 4.5 is simply the mean squared bias in the estimate of a due to the
522
Herman P. Snippe
irregularity of the sensor array. In this section we interpret this term as a variance due to the randomness in the sensor positions. We assume, as in equation 2.1, that the sensor tuning profiles QJ1 are gaussian, but with an extra free parameter R,, that represents the sensor sensitivity to the stimulus; thus (4.7) For this Qjlthe remaining expectation operations over sensor position in equation 4.5 can be calculated explicitly. When replacing the summations over 11 by integrations, the calculations are straightforward (though lengthy), and yield:
The term 23,iZR,,1 is the contribution to the variance due to the noise in the sensor responses; the remaining term is due to the randomness in the sensor positions. To judge the relative contributions of these two terms, we need a realistic estimate of R,,,. An estimate R,,, = 25 follows from observed spiking frequencies in well stimulated cortical neurons (circa 100 spikes/sec) and a reaction/integration time of 250 msec (Vogels 1990). A similar estimate for R,, would follow from results on the actual noise variance of cortical neurons when we realize that R,, = R$JRmaX is the (quadratic) signal-to-noise ratio of the most vigorously responding neurons in our model (Tolhurst et nl. 1983; Vogels et 01. 1989). For a perfectly regular sensor array (s = O), the factor in curly brackets in 4.8 equals 0, and the sole contribution to the error in i 7 is~ due ~ to the sensors' response noise. For s > 0, there is a contribution due to sensor array irregularities, quadratic in s for s maxf 0 for 1 = 1.. . . , m. The pseudodimension of F is the length m of the largest shattered sequence. For pattern classification problems, we typically consider a class of (0.1 }-valued functions obtained by thresholding a class of real-valued functions. Define the threshold function, ‘FI : R + {O,l}, as ‘FI(0)= 1 if and only if N 2 0. If F is a class of real-valued functions, let X ( F ) denote the set {Xu): f E F } . The Vapnik-Chervonenkis dimension of a class F of real-valued functions defined on X is the size of the largest sequence of points that can be classified arbitrarily by X ( F ) , ~
VCdim(F)=max{rn: 3 x ~ X ’ ” R(F) . shatters ((xl.l/2)>. . . %xm.1/2))} ( Clearly, VCdim(F) 5 dimp(F). The function classes considered in this note can be indexed using a real vector B of parameters. Let 0 and X be the parameter and input spaces, respectively, and let f : 0 x X + R. The function f defines a parameterized class of real-valued functions defined on cf(S, .) : Q E @}. We also denote this function class by f .
x,
Definition 1. A two-layer sigmoid network with 11 inputs, W weights, and a single real-valued output is described by the function fs : Rw x X R,where X W,
--f
k
js(O,x) = a.
+ CU,/(I +
e-(blx+bio))
,=I
with a, E R, b, = (bill.. . , b,,,) E W”, and 0 = ( 0 0 , . . . , a k , b l o , . . . , bkn) E R ~ (For . x, y E R”, x . y = C:=,x,y,.)In this case, W = kn + 2k + 1. A radial basis function (RBF) network is described by the function k
VC Dimension of Two-Layer Neural Networks
627
with a, E R, c, = (GI. . . . cln) E R", and 8 = (ao. . . . ak. ~11.. . . . ck,,) E iWn, l/x/I2= x . x.) Here, W = kn + k + 1. %
~
E
Ww. (If
x
Theorem 2. Let X = {-D, . . . , D}" forsomepositiveintegerD. For thesigmoid and RBF networks fs?fRBF : Rw x X + R,we have
VCdimCfs) 5 dimpfjs) < 2W log2(24eWD) VCdimfjRBF) 5 dimpCfRBF) < 4w 10g2(24eWD) The proof of Theorem 2 follows from the simple observation that the function classes fs and ~ R B Fcan be expressed as a polynomial in some transformed set of parameters when the inputs are integers. We can then use an upper bound from Goldberg and Jerrum (1993) on the VCdimension of such a function class. Proof. Consider the function fs defined in Definition 1. For any 8 E 0, x E X and Y E R, let
k
/ n
tl
\
Clearly, fs,[O,( x . r ) ] always has the same sign as f s ( 0 , x ) - Y, since the denominators in fs(8,x) are always positive, so dimpus) 5 VCdimus)). But fs, [8, (x,r)] is polynomial in 8' = (ao, . . . ak. ecblo? . . . .ecbk,l),with degree no more than 2Dnk + k + Dn + 1 < 3DW. Theorem 2.2 in Goldberg and Jerrum (1993) implies that VCdimCfst) < 2Wlog2(24eWD). Similarly, ~ R B F ( H ,x) - r has the same sign as fRBF/[Q,(x,r)], where fmF![e,
(x,I)Y
k
=
n
x) - rI r]:~ e Z c ~ l D
[~RBF(~,
1=11=1
n:n: k
= (ao- r )
n
e2ciJD
1x1 /=1
Again, ~ R B [O, F ~ (x,r ) ] is polynomial in 8' = (ao, . . . ,uk, e2cll,. . . ,e2+rz epc:1 , . . . , e-'L) ~
Peter L. Bartlett and Robert C. Williamson
628
with degree n o more than IID(X. 2 ) + 2 < 3DW. As above, dimp(fs)
0.02), the neural-network function attains a quasilinear form and the generalization performance deteriorates. To verify the predicted mechanisms by which the weight noise improves generalization, we examined the hidden-layer activations and derivatives. With no noise injection, it was found that all 15 hidden units contribute to the network output. At T = 1.25 x on average only five hidden units out of 15 had nonnegligible contributions. When the noise level was increased to T = 0.01, the number of active hidden units was further reduced to three. This result stands in contrast to training with input noise, where most of the hidden units contribute to the network output. We further found that those active hidden units had activations close to either -1 or 1. The foregoing observations are in agreement with our predictions that weight noise reduces the number of hidden units and encourages small derivatives at sigmoidal units.
6.1.3 Lnizgeviii Noisc The neural-network function obtained using Brownian dynamics with a gradually decreasing T is shown as the dashed curve in Figure 4. The temperature ( T ) is decreased exponentially with the number of iterations during training. The solid curve represents the neural-network function obtained using the conjugate-gradient algorithm. We see that the Langevin noise has indeed no regularization effect, although it is more effective in finding the global minimum of E(w)as manifested by the perfect fit of the training set (shown by the dots). The training errors and generalization errors for the various neuralnetwork functions are given in Table 1.
6.2 A Classification Problem. Our second example is a two-class classification problem. Through this example, we demonstrate how the input noise and weight noise affect the decision boundaries and how they improve a neural networks generalization performance for classification problems. The classification problem under consideration is defined by two overlapping bivariate normal distributions. Let N ( p . 1)be a bivariate normal distribution with mean p and covariance matrix X. The joint probability
Adding Noise During Backpropagation
665
1
0.5
0
?/
-0.5 -1 -1.5
-2 -0.5
-1
0
0.5
1
X
Figure 4: Training with Langevin noise. The dashed and solid curves represent the neural-network functions obtained using Brownian dynamics and the conjugate-gradient algorithm, respectively. Note that the neural network function obtained using the Brownian dynamics (the dashed curve) fits the training set perfectly. This suggests that the dashed curve is at the global minimum of E(w) whereas the solid curve is at a local minimum. that x will be found and that it belongs to the first class, Class A, is given by 1 (6.3) P(A,x) = 2N(PA, C) The joint probability that x will be found and that it belongs to the second class, Class B, is given by
P ( B ,x)
1
=
2 W B , C)
(6.4)
The means and covariances appearing in equation 6.3 and equation 6.4 are given by pA = (1,-1);
pa= (-1,l);
and C = I
(6.7)
where I is the two-dimensional identity matrix. Assuming the cost of misclassifying Class A is the same as that of misclassifying Class B, one can derive the optimal classification rule for a two-class classification problem (e.g., Pao 1989): classify x as
Class A Class B
if P(A,x) > P ( B , x) otherwise
(6.8)
666
Guozhong An
The decision boundary implied by equation 6.8 is thus described by the equation P(A.x) = P ( B . x). Using equations 6.3 and 6.4, we find that the ideal boundary for the present problem is described by the equation s1 = xz. A total of 40 training examples were generated, 20 from each class. We assigned a desired output of -0.9 to Class A and 0.9 to Class B. The architecture of the neural network was chosen to be N2-101. Both the hidden and the output units are assigned a tanh(x) transfer function. The neural network had 40 parameters (weights). With the chosen architecture, the number of weights in the neural network matches the number of training examples. Training was performed using the conjugate-gradient algorithm in all cases. The decision boundaries of the neural network are taken to be fix.w)= 0 for the present classification problem. Gaussian noise with zero mean was used in all cases. 6.2.7 Data Noise. In agreement with our results on the regression problem and with our theoretical analysis, the input noise also makes the neural-network function smoother for classification problems. With an appropriately chosen noise strength T , the neural-network decision boundary comes close to the ideal one. As a consequence, it is found that the generalization performance is significantly improved. The results of training without noise injection are presented in Figure 5. The neural-network function f(x.w) is plotted in Figure 5a and the training set and the decision boundary are shown in Figure 5b. The diamonds and the stars represent the training examples from Class A and Class B, respectively. The ideal and the neural-network classification boundaries are represented by the thin and thick lines, respectively. The figure shows that the neural network adapts so much to the training set that it classifies the training set perfectly. It does so by forming complex boundaries, however. Poor generalization is evident, since the neural-network decision boundary differs considerably from the ideal one, as can be seen from Figure 5b. The results of training with input noise (T = 0.125) are presented in Figure 6. It is clear from a comparison of Figures 5 and 6 that input noise smoothes the neural-network function. Input noise causes the decision boundaries to be less well adapted to the training set data and to be closer to the ideal one. Improved generalization is apparent. To quantify the generalization performance, we measured the classification error on a large test set (10,000 data points) that was generated in the same way as the training set. The generalization performance of this training with input noise is summarized in Table 2. As it can be seen from the table, training with noise reduces the misclassification rate from 27 to 9%. 6.2.2 Weight Noise. We found, in agreement with previously reported results (Murray and Edwards 19941, that weight noise improves the generalization performance for classification problems. The lowest misclas-
Adding Noise During Backpropagation
667
0
4
t
t
-3
-2
-1
0
1
2
:
31
(b)
Figure 5: Training without noise injection for a classification problem: (a) the neural-network function; (b) the classification boundaries and the training data. The thick line and the thin line represent, respectively, the neural-network classification boundary and the ideal classification boundary. The diamonds and stars denote the training examples from Class A and Class B, respectively. Note that the neural network makes a perfect classification on the training set by forming complex classification boundaries.
Guozhong An
668
Table 2: Generalization Performance of Training with Input Noise for the Classification Problem” Misclassification on training set (%)
0 3 10
Misclassification on test set (’5%) 27 17 9
Noise level T
0 0.02 0.125
“Bayes classification error for this problem is 8%. Note that, with a noise strength T = 0.125, the generalization performance differs from the optimal classification by only 1%
sification rate that we found on the test set was 1170,which is comparable to the 9% misclassification rate that was achieved using input noise (Tables 2 and 3 ) . For T > 0.018, the network functions on one side of the classification boundary had an almost constant - 1 value, which changed abruptly to a constant 1 value on the other side of the boundary. This is in line with our prediction that weight noise forces a small derivative jk’(1)I at the output unit, which in turn leads to flat network outputs. In contrast to the regression problem, we found that even with noise addition, all 10 hidden units contribute to the network output. This finding is also in line with the theory that penalizing large derivatives at the sigmoidal units weakens the effect of penalizing large hidden-layer activation. For the present problem, we may conclude that the smoothing effect, i.e., the small derivative effect, is the primary mechanism in improving generalization. To verify this, we applied noise only to the output bias since output-bias noise generates a penalty P = TC,,[h”(l)(y”- f , L ) + k’(I)’]/N (cf. equation 4.10), which favors small Ih’(l)jwithout affecting the hidden activations. As shown in Table 3, applying noise only to the output bias indeed improved the generalization as much as applying noise to all the weights. The best generalization was achieved at a temperature of T,, = 0.125. 7 Conclusions
In this article, we have studied three types of noise injection: data noise, weight noise, and Langevin noise. A rigorous analysis is presented of how the various types of noise affect the learning cost function. The noise-induced penalty functions were computed and analyzed in the weak-noise limit. Their properties were related to the generalization performance of training using noise. Experiments were performed on a regression and a classification problem to illustrate and to compare with formal analysis.
Adding Noise During Backpropagation
t
-3
-2
-1
669
0
,*
1
,
2
3
21
(b)
Figure 6: Same as Figure 5 except that the neural network is trained with input noise (T = 0.125): (a) the neural-network function; (b) the classification boundaries and the training set. Note that the neural-network function is smoother than that of Figure 5, and that the decision boundary is closer to the ideal one. Input noise is shown to add two penalty terms to the standard error function. One is identical to a smoothing term as found in regularization theory, while a second term depends on the fitting residuals. The
Guozhong An
670
Table 3: Generalization Performance of Training with Weight Noise for the Classification Problem" Noise type
Output bias -
All weights
Misclassification
Misclassification
on training set (%)
on test set (5%)
To
Th
0 7.5
23.3 11.1
0.005 0.125
0 0
0 7.5 7.5
24 10.8
0.011 0.018 0.045
0.011 0.018 0.045
11
Noise level
ST" and T h denote the temperature at the output and the hidden layer, respectively.
main effect of the input-noise induced penalty terms is in constraining the neural network to be less sensitive to variations in the input data. This smoothing effect could be beneficial to the generalization performance. In contrast to input noise, output noise results only in a constant term in the cost function, and hence does not improve the generalization performance. We showed that weight noise induces changes in the cost function that are formally similar to that of the input noise. The penalty functions that are induced respectively by the input noise and the weight noise would be identical if differentiations with respect to the inputs and differentiations with respect to the weights are interchanged. Despite such formal similarities, we argued that weight noise has a different effect on the generalization performance than the input noise. Weight noise constrains the neural network to be less sensitive to variations in the weights instead of variations in the inputs. We further demonstrated that weight noise encourages sigmoidal units to operate close to saturation and discourages both large weights in the output layer and large activations in the hidden layer. On both test problems, input noise significantly reduces the generalization error. On our classification problem, weight noise improves the generalization as much as the input noise. However, in the test-case regression problem weight noise improves the generalization substantially less than input noise. Owing to the limited scale of the test case, this may not be generally true. We argued that training with annealed Langevin noise results in a global minimization similar to that of simulated annealing. Therefore, training with annealed Langevin noise has no regularization effect, although it could be effective in finding the global minimum, as demonstrated by our experiment.
Adding Noise During Backpropagation
671
Appendix: Noise Injection During Normal Backpropagation Training In this appendix, we show how the analysis presented in Section 3 can be extended to the normal backpropagation training. Define a vector that represents the whole training set by ZO= (z1.z2, . . . , P I.. , zN) . and make the dependency of E(w) on the training set explicit by denoting it by E ( Z 0 , w) = E(w). Let Z be a vector in the same space as ZO. The standard error E(w) can again be written into the form of equation 3.3, with the correspondence u to Z, c(u,w) to E ( Z , w) and g(Z) = ~ ( Z - Z O ) By . applying the stochastic gradient-descent algorithm to this form of E(w), we rederive the backpropagation training algorithm. Therefore, both the on-line backpropagation and the normal backpropagation training algorithm can be treated as special cases of the stochastic gradient-descent algorithm. Consider now a set of N data points represented by Z = (ZI,Z 2 , . . . , Z,) where Z, denotes a data point that we have until now denoted by zp. In the vector space of Z, the training set is represented by a single vector Zo = (z’ , z2, . . . ,zN). Recall that the backpropagation training is described by the following weight update equation: Wttl
= Wt - rlrVwE(Z0,w)
(A.1)
This training algorithm, although deterministic in nature, may be treated as a special case of the stochastic gradient-descent algorithm by interpreting ZOas a realization of the following distribution:
g(Z) = 6(Z - ZO) Bearing the above in mind, it is clear that with the data noise injection procedure detailed in equation 2.5, the training algorithm is now described by Wt+l = Wf
- rlfVWE(Z, w)
(A.2)
where Z is randomly drawn from the density
Therefore, the cost function that is minimized by equation A.2 has the form C(w) =
/ E ( Z ,w)g(Z) dZ
(A.4)
Substituting equations 2.2 and A.3 into equation A.4 after simplifying the integration over Z, we have
672
Guozhong An
Renaming the dummy integration variable Z , in equation A.5 z, we obtain
c
l N C(w) = - /e(z, w)p(z - ZW) nz
N p=l
(A.6)
This is exactly the &(w)given by equation 3.5, with g(z) given by equation 3.1.
Acknowledgments
I thank Wim Schinkel for many helpful suggestions that have lead to numerous improvements in the presentation of this work. References Arfken, G. 1985, Mathematical Methods for Physicists. Academic Press, New York. Bishop, C. M. 1995. Training with noise is equivalent to Tikhomov regularization. Neural Comp. 7, 108-116. Bottou, L. 1991. Stochastic gradient learning in neural networks. NEURONIMES’91, EC2, Nanterre, France, 687-606. Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing System I, D. Touretsky, ed., pp. 519-526. Morgan Kaufmann, San Mateo, CA. Clay, R., and Sequin, C. 1992. Fault tolerance training improves generalization and robustness. Proc. Int. Joint Conf. Neural Networks, IEEE Neural Council, Baltimore, 1-769-774. Drucker, H., and Le Cun, Y. 1992. Improving generalisation performance using double back-propagation. I E E E Trans. Neural Networks 3, 991-997. Geman, S., and Hwang, C. 1986. Diffusion for global optimization. S I A M 1. Control Optim. 25, 1031-1043. Gillespie, D. 1992. Markozi Processes. Academic Press, London. Guillerm, T., and Cotter, N. 1991. A diffusion process for global optimization in neural networks. Proc. Int. Joint Conf. Neural Networks, 1-335. Guyon, A., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing System 4 (NIPS 91), J. Moody et al., eds., pp. 471-479. Morgan Kaufmann, San Mateo, CA. Hanson, S. J. 1990. A stochastic version of the delta rule. Physica D 42, 265-272. Hertz, J., Krogh, A,, and Thorbergsson, G. 1989. Phase transitions in simple learning. J. Phys. A: Math. Gen. 22, 2133-2150. Hinton, G. E. 1986. Learning distributed representations of concepts. Proc. Eighth Annu. Conf. Cog. Sci. SOC.,Amherst, 1-12. Hinton, G. E., and van Camp, D. 1993. Keeping neural networks simple by minimizing the description length of the weights. Sixth ACM Conf. Comp. Learning Theory (Santa Crud, 5-13.
Adding Noise During Backpropagation
673
Holmstrom, L., and Koistinen, P. 1992. Using additive noise in back-propagation training. IEEE Trans. Neural Networks 3, 24-38. Kendall, G. D., and Hall, T. J. 1993. Optimal network construction by minimum description length. Neural Comp. 5, 210-212. Kendall, M., and Stuart, A. 1977. The Advanced Theory of Statistics, Vol. 1, 4th ed., Charles Griffin, London. Kirkpatrick, S., Gelatt, C., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 671-680. Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization. In Advances in Neural Information Processing System 4 (NIPS 9U, J. E. Moody et at., eds., pp. 950-957. Morgan Kaufmann, San Mateo, CA. Kushner, H. 1987. Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo. S I A M J. A p p l . Math. 47, 169-185. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4, 415-447. Matsuoka, K. 1992. Noise injection into inputs in back-propagation learning. I E E E Trans. Sys. Man Cybern. 22, 436440. Murray, A. F., and Edwards, P. J. 1993. Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements. I E E E Trans. Neural Networks 4, 722-725. Murray, A. F., and Edwards, I? J. 1994. Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Networks 5, 792-802. Neuralware. 1991. Reference Guide: NeuralWorks Professional Il/PlUS and NeuralWorks Explorer. Neuralware, Inc, Pittsburgh. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural Comp. 4,473493. Pao, Y.H. 1989. Adaptive Pattern Recognition and Neural Networks, chap. 2, pp. 2535. Addison-Wesley, Reading, MA. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78, 1481-1497. Reed, R., Oh, S., and Marks, R. J., 11. 1992. Regularization using jittered training data. Proc. Int. Joint Conf. Neural Networks, IEEE Neural Council, Baltimore, 111-147. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore. Rognvaldsson, T. 1994. On Langevin updating in multilayer perceptrons. Neural Comp. 6, 916-926. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323,533-536. Seung, H., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056-6091. Sietsma, J., and Dow, R. 1988. Neural network pruning-why and how. Proc. I E E E Int. Conf. Neural Networks I, 325-333. Sjoberg, J., and Ljung, L. 1992. Overtraining, regularisation, and searching for minimum in neural networks. Proc. Symp. Adaptive Systems Control Signal Process., Grenoble, France.
674
Guozhong An
van Kampen, N. G. 1981. Stochastic Processes in Physics arid Cheniistry. NorthHolland, Amsterdam. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weightelimination with application to forecasting. In Advances in Neurnl Znformation Processing System 3 (NIPS 90), J. M. R. P. Lippmann and D. Touretsky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. I, 425-464.
Received January 10, 1995; accepted July 14, 1995.
This article has been cited by: 2. Taeho Jo. 2010. The effect of mid-term estimation on back propagation for time series prediction. Neural Computing and Applications 19:8, 1237-1250. [CrossRef] 3. Kevin I.-J Ho, Chi-Sing Leung, John Sum. 2010. Convergence and Objective Functions of Some Fault/Noise-Injection-Based Online Learning Algorithms for RBF Networks. IEEE Transactions on Neural Networks 21:6, 938-947. [CrossRef] 4. Adam Petrie, Thomas R. Willemain. 2010. The snake for visualizing and for counting clusters in multivariate data. Statistical Analysis and Data Mining n/a-n/a. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Richard M. Zur, Yulei Jiang, Lorenzo L. Pesce, Karen Drukker. 2009. Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Medical Physics 36:10, 4810. [CrossRef] 7. C. P. Unsworth, G. Coghill. 2006. Excessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented EnvironmentsExcessive Noise Injection Training of Neural Networks for Markerless Tracking in Obscured and Segmented Environments. Neural Computation 18:9, 2122-2145. [Abstract] [PDF] [PDF Plus] 8. P. Chandra, Y. Singh. 2004. Feedforward Sigmoidal Networks—Equicontinuity and Fault-Tolerance Properties. IEEE Transactions on Neural Networks 15:6, 1350-1366. [CrossRef] 9. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 10. Edward Wilson, Stephen M. Rock. 2002. Gradient-based parameter optimization for systems containing discrete-valued functions. International Journal of Robust and Nonlinear Control 12:11, 1009-1028. [CrossRef] 11. A. Alessandri, M. Sanguineti, M. Maggiore. 2002. Optimization-based learning with bounded error for feedforward neural networks. IEEE Transactions on Neural Networks 13:2, 261-273. [CrossRef] 12. Vicente Ruiz de Angulo , Carme Torras . 2001. Architecture-Independent Approximation of FunctionsArchitecture-Independent Approximation of Functions. Neural Computation 13:5, 1119-1135. [Abstract] [PDF] [PDF Plus] 13. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef]
14. Katsuhisa Hirokawa, Kazuyoshi Itoh, Yoshiki Ichioka. 2000. Invariant Pattern Recognition Using Neural Networks Combined with Optical Wavelet Preprocessor. Optical Review 7:4, 284-293. [CrossRef] 15. Chuan Wang, J.C. Principe. 1999. Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks 10:6, 1511-1517. [CrossRef] 16. P.J. Edwards, A.F. Murray. 1998. Fault tolerance via weight noise in analog VLSI implementations of MLPs-a case study with EPSILON. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 45:9, 1255-1262. [CrossRef] 17. Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. 1997. Noise Injection: Theoretical ProspectsNoise Injection: Theoretical Prospects. Neural Computation 9:5, 1093-1108. [Abstract] [PDF] [PDF Plus]
Communicated by Don R. Hush
ARTICLE
Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants Christian W. Omlin NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 USA C. Lee Giles NEC Research Institute, 4 lndependence Way, Princeton, NJ 08540 USA and UMIACS, University of Maryland, College Park, M D 20742 U S A
We propose an algorithm for encoding deterministic finite-state automata (DFAs) in second-order recurrent neural networks with sigmoidaI discriminant function and we prove that the languages accepted by the constructed network and the DFA are identical. The desired finite-state network dynamics is achieved by programming a small subset of all weights. A worst case analysis reveals a relationship between the weight strength and the maximum allowed network size, which guarantees finite-state behavior of the constructed network. We illustrate the method by encoding random DFAs with 10,100, and 1000 states. While the theory predicts that the weight strength scales with the DFA size, we find empirically the weight strength to be almost constant for all the random DFAs. These results can be explained by noting that the generated DFAs represent average cases. We empirically demonstrate the existence of extreme DFAs for which the weight strength scales with DFA size. 1 Introduction
It is possible to train recurrent neural networks to behave like deterministic finite-state automata (Elman 1990; Frasconi et al. 1991; Giles etal. 1992; Pollack 1991; Servan-Schreiber et al. 1991; Watrous and Kuhn 1992). The internal representation of learned DFA states can deteriorate due to the dynamic nature of recurrent networks making predictions about the generalization performance of trained recurrent networks difficult (Zeng et al. 1993). Methods for constructing DFAs in recurrent networks with hardlimiting neurons discriminant functions have been proposed (Alon et al. 1991; Horne and Hush 1994; Minsky 1967);methods for constructing networks with sigmoidal and radial-basis discriminant functions have been discussed (Frasconi et al. 1993; Gori et al. 1994; Giles and Omlin 1993). We prove that recurrent networks with continuous sigmoidal discriminant functions can be constructed such that the encoded finite-state dynamics Neural Coinputation
8, 675696 (1996)
@ 1996 Massachusetts Institute of Technology
676
Christian W. Omlin and C. Lee Giles
remains stable indefinitely. Notice that we do not claim that such a stable representation can be lenmed. 2 Encoding DFA Dynamics 2.1 Finite State Automata. A deterministic finite-state automaton (DFA) is an acceptor for a regular language L ( M ) . Formally, a DFA M is a 5-tuple M = (C. Q.R. F. 6) where Y = { u I . . . . . air,} is the alphabet of the language L, Q = { q , . . . . ,q,,} is a set of states, R E Q is the start state, F C Q is a set of accepting states, and h : Q x C + Q defines state transitions in M. A string is accepted by the DFA M if an accepting state is reached; otherwise, the string is rejected.
2.2 Recurrent Network. We implement DFAs in discrete-time, recurrent networks with second-order weights Wllk. The continuous network dynamics are described by the following equations:
where b, is the bias associated with hidden recurrent state neurons S,; l k denotes input neurons. The product S:I/ directly corresponds to the state transition O(q,,ak) = ql. After a string has been processed, the output of a designated neuron SO decides whether the network accepts or rejects a string. The network accepts a given string if the value of the output neuron Sh at the end of the string is greater than 0.5; otherwise, the network rejects the string. For the remainder of this paper, we assume a one-hot encoding for input symbols uk, i.e., I: E (0. l}. 2.3 Encoding Algorithm. The encoding algorithm achieves a nearly orthonormal internal representation of the desired DFA dynamics; it constructs a network with n + 1 recurrent state neurons (including the output neuron) and m input neurons from a DFA with iz states and 112 input symbols. There is a one-to-one correspondence between state neurons S, and DFA states 4 , . For each DFA state transition b ( q , . n ~ ) = q,, we set Wllk to a large positive value +H. The self connection W,,, is set to -H (i.e., neuron S, changes its state from high to low) except for state transitions Cl(q,>uk) = q, (self-loops) where W,,, is set to +H (i.e., state of neuron S, remains high). Furthermore, if state q, is an accepting state, then we program the weight WO,, to +H; otherwise, we set Walk to -H. We set the bias terms b, of all state neurons S, to -H/2. For each DFA state transition, at most three weights of the network have to be programmed. The initial state So of the network is So = (S:. 1.0,O.. . . 0). The value of the response neuron S: is 0 if the DFA's initial state qo is a rejecting state
Stable Encoding of Large Finite-State Automata
677
and 1 otherwise. All weights that are not set to -H, -H/2, or +H are set to zero. The question this paper addresses is whether the value of H can be chosen such that the finite-state dynamics in a recurrent network remains indefinitely stable. 3 Analysis
We prove the stability of DFA encodings in recurrent neural networks for strings of arbitrary length. Due to space limitations, we give only the proofs of the theorems that establish our results; for proofs of auxiliary lemmas see Omlin and Giles (1994). 3.1 Fixed Point Analysis for Sigmoidal Discriminant Function. Recall that the recurrent network changes its state according to equation 2.1. Our DFA encoding algorithm yields a special form of that equation describing the dynamics of a constructed network:
(3.1) The bias term -HI2 is common to all state neurons. Hxj is the weighted sum feeding into neuron S j t + l ) . Under certain conditions, the discriminant function h ( . ) has fixed points that allow a stable internal representation of DFA states. 3.2 Network Dynamics as Iterated Functions. When a network processes a string, the state neurons go through a sequence of state changes. The network state at time t 1 is computed from the network state at time t, its current input, and its weights. Since the discriminant function h ( . j is the same for all state neurons, these network state changes can be represented as iterations of h ( . ) for each state neuron:
+
A network will correctly classify strings of arbitrary length only if its internal DFA state representation remains sufficiently stable. Stability can be guaranteed only if the neurons are shown to operate near their saturation regions for sufficiently high gain of the sigmoidal discriminant function h ( . ) . One way to achieve stability is thus to show that the iteration of the discriminant function h ( . ) converges toward its fixed points in these regions, i.e., points for which we have, i.e., h(x.H ) = x. This observation will be the basis for a quantitative analysis, which establishes bounds on the network size and the weight strength H, which guarantee stable internal representation for arbitrary DFAs.
Christian W. Omlin and C. Lee Giles
678
Given a stable DFA encoding where neurons operate near their saturated regions, each neuron can send two kinds of signals to other neurons: 1. High signals: If neuron Si represents the current DFA state ql, then S: will be high (S::high). 2. Low signals: Neurons S: that d o not represent the current DFA state have a low output 6;: low).
Recall that the arguments of the discriminant function I i ( x . H ) were the sum of unweighted signals s and the weight strength H . We now expand the term s to account for the different kinds of signals that are present in a neural DFA. From the DFA encoding algorithm, we can derive four different types of neuron state changes: 1071'
- high: (3.3)
(3.5)
(3.6)
(3.7) (3.8) where C,,a = { S , I WjIk= H.i # i. I # j }
(3.9)
The inputs 1; are not shown explicitly since we assume that each input symbol is assigned a separate input neuron in a one-hot encoding. The DFA state transitions corresponding to these types of neuron state changes are shown in Figure 1. The signals Si and S: represent the principnl coiitrihiitioizs to the neuron Si-' that are responsible for driving the output of neuron S:*' low or
Stable Encoding of Large Finite-State Automata
I
679
I
@’
@’
I+
I
l+l
Figure 1: Neuron state changes and corresponding DFA state transitions: The figures (a)-(f) illustrate the DFA state transitions corresponding to all possible state changes of neuron Sl; the DFA state(s) participating in the current transitions are marked with t and t + 1. (a) low + high (with self-loop on ql}, (b) low + high (with self-loop on 9,), (c) high + high (necessarily a self-loop on ql), (d) high + low (necessarily no self-loop on 9J, (e) low + low (with self-loop on 9,), (f) low + low (no self-loop on q r ) . Notice that even though state 9l is neither the source nor the target of the current state transition in cases (e) and (f), the corresponding state neuron S , still receives residual inputs from state neurons Sl,, . . . ,SI,.
Christian W. Omlin and C. Lee Giles
680
high. All other terms are the residual contrihiitioizs to the input of neuron S:". The term CSi contributes to the total input of state neuron S:+' if there are other transitions cl(9l.nk) = q, in the DFA from which the recurrent network is constructed. Since there is a one-to-one correspondence between state neurons and DFA states, there will always be a negative contribution -Si for the current DFA state transition d ( q , . n ~ )= q1 (assuming q1 # ql), i.e., only Si can drive the signal S;'' low. The above equations account for all possible contributions to the net input of all state neurons because the encoding algorithm constructs a sparse recurrent network. For a worst case analysis, it suffices to investigate the cases of minimum and maximum neuron inputs for high and low signals, respectively. Equations 3.3-3.9 condense to the following two equations: lou1
-
10111:
(3.10)
low + high: (3.11) We now define a new function ha(x,. H) that takes the residual inputs into consideration. Let AX: denote the residual neuron inputs to neuron Then, the function kz'(xi,H) is recursively defined as
The initial values for low and high signals are xp = 0 and xp = 1, respectively. The magnitude of the residual inputs Ax, depend on the coupling between recurrent state neurons. Neurons that are connected to a large number of other neurons will receive a larger residual input than neurons which are connected to only a few other neurons. Consider the neuron S,,, which receives a residual input Ax, from the most number r of neurons, i.e., Ax; 5 Ax,,. To show network stability, it suffices to assume the worst case where all neurons receive the same amount of residual input for given time index t, i.e. Ax;. This assumption is valid since the initial value for all neurons except the neuron corresponding to a DFA's start state is 0. 3.3 Fixed Point Analysis for Sigmoidal Discriminant Function. In order to guarantee low signals to remain low, we have to give a tight upper bound for low signals which remains valid for an arbitrary number of time steps.
Stable Encoding of Large Finite-State Automata
681
Figure 2: Fixed Points of the Sigmoidal Discriminant Function: Shown are (dashed graphs) for H = 8 and the graphs of the function f ( x . r ) = I+eH(;-2,,j,z r = (1.2.4. lo} and the function p(x,u) = I+eH(l-2(,-,,)j,2 1 (dotted graphs) for H = 8 and u = {0.0.0.1.0.4.0.9}. Their intersection with the function y = x shows the existence and location of fixed points. In this example, f ( x . r ) has three fixed points for r = {132}, but only one fixed point for I' = (4.10) and p ( x . u ) has three fixed points for u = {0.0.0.1}, but only one fixed point for u = {0.6.0.9}. Lemma 3.3.1. The low signals are bounded from above by the fixed point 9; of
the function f
{
fa = 0
(3.13)
ft+' = h(r . f')
i.e. we have Ax:+' = r . f ' since xp = 0 for low signals in equation (3.12). This lemma can easily be proven by induction on t. It is easy to see that the function to be iterated in equation (3.13) is 1 f ( x . r ) = 1 + e ( H / 2 ) ( 1 - 2 r x ) The graphs of the function are shown in Figure 2 for different values of the parameter r. The function f (x.r ) has some desirable properties (Omlin and Giles, 1995): '
Lemma 3.3.2. For a n y H > 0, thefuncfionf(x.u ) kasaf least onefixed point #. Lemma 3.3.3. There exists a ualue H;(u) such that for any H > Hi ( r ) ,f ( x . r ) has threefixed points 0 < 0,-< 0; < 4; < 1.
Christian W. Omlin and C. Lee Giles
682
Lemma 3.3.4. I f f ( x .r ) has three fixed points 47,@, and , :4 then
(3.14)
The above lemma can be shown by defining an appropriate Lyapunov function L and showing that L has minima at 4 ; and and that f ' converges toward one of these minima. Notice that the fixed point 4; is unstable.
$7
Lemma 3.3.5. Let f " ? f I . f'. . . . denote thefinite sequencecomputed by successive iteration of thefunction f . Then we have f o < f' < . . . < q5f where df is one of the stable fixed points off ( x .r ) . With these properties, we can quantify the value H;(r) such that for any H > H;(r), f ( x >r ) has three fixed points. The low and high fixed points qh-, and 4; will be the bounds for low and high signals, respectively. The larger r, the larger H must be chosen in order to guarantee the existence of three fixed points. If H is not chosen sufficiently large, then f' converges to a unique fixed point 0.5 < 4, < 1. The following lemma expresses a quantitative condition which guarantees the existence of three fixed points: 1 Lemma 3.3.6. Thefunctionf(x,r)= has threefixed points 0 < e(H,Z)i,-2r,) $6; < 4; < @: < 1 if H is chosen such that +
H > H{(r)=
2( 1 + (1 - x ) log(+)) I-x
where x satisfies the equation r=
1
2x(1 + (1 - x ) log(?))
The contour plots in Figure 3 show the relationship between H and x for various values of Y . If H is chosen such that H > H o ( r ) ,then three fixed points exist; otherwise, only a single fixed point exists. The number and the location of fixed points depends on the values of Y and H . Thus, we write @;(r.H), # ( r > H ) ,and $ / ( r . H ) , to denote the stable low, the unstable, and the stable high fixed point, respectively. We will use @ as a generic name for any fixed point of a function f . Similarly, we can quantify high signals in a constructed network:
683
Stable Encoding of Large Finite-State Automata
I
0
02
0.6
04
06
1
X
Figure 3: Existence of Fixed Points: The contour plots of the function h ( x . r ) = x (dotted graphs) show the relationship between H and x for various values of r. If H is chosen such that H > Ho(r) (solid graph), then a line parallel to the x-axis intersects the surface satisfying h ( x . r ) = x in three points which are the fixed points of h(x,r ) ,
Lemma 3.3.7. The high signals are bounded from below by the fixed point the function
4;
of
(3.15)
This lemma is easily proven by induction if we assume the worst case for neuron state transitions low -+ high where that neuron receives no residual inputs that would strengthen the high signal. Notice that the above recurrence relation couples f ' and g' which makes it difficult if not impossible to find a function g(x, r) which when iterated gives the same values as g'. However, we can bound the sequence go,gl, g2,. . . from below by a recursively defined function p'-i.e., Vt : p' 5 ,$--which decouples g' from f'.
Christian W. Omlin and C. Lee Ciles
684
Lemma 3.3.8. Let (,$fdciiofe fhefised poziit of the rerifrszvefzriicfzo~z f , i.e. lim f ' fx '+if. The17 the r e c ~ ~ r s ~ zdefiried ~ e l y firiicfzoii p
=
(3.16) 1717s tl7e property that
v t : p' 5 8'.
This can be proven by induction on t. The graph of the function p(x. 7 1 ) for some values of u are shown in Figure 2. The lemmas 3.3.2 through 3.3.5 also apply to the function p(x. 10.
Lemma 3.3.9. Let the fziizction p(x. 14 j hnve t7oo sfable fixed points a i d let V t :
IJ' 5 8 ' . Thcri fhc firrictioii g(x. rj has also tzvo stable fixed poirits. Since we have decoupled the iterated function g' from the iterated function f ' by introducing the iterated function p', we can apply the same technique for finding conditions for the existence of fixed points of p(x. u ) as in the case off'. In fact, the function that when iterated generates the sequence yo. p' .p 2 . . . . is defined by (3.17)
with
H'
= H(l
+ 2q;)).
(3.18)
Since we can iteratively compute the value of & for given parameters H and r, we can repeat the original argument with H' and r' in place of H and r to find the conditions under which p ( r ,x) and thus g ( r .x ) have three fixed points. This results in the following lemma:
Lemma 3.3.10. The function p ( x . d;j
=
1
+ e ~ ~ , ~ I ~has, three - ~ filed , , ~ ~ ~ ~ ~
poiizts 0 < 4; < 0; < 0: < 1 if H is chosen such that
where x satisfies the equation 1
--
1+24;
1
-
241
+ (1 -x)log(?))
Stable Encoding of Large Finite-State Automata
685
3.4 Network Stability. We now define stability of recurrent networks constructed from DFAs: Definition 3.4.1. A n encoding of DFA states in a second-order recurrent neural nefwork is called &if all the low signals are less than c$(r. H ) , and all the high signals are greater fhaiz @ ( r . H).
Consider equation (3.10). In order for the low signal to remain less than o!, the argument of h ( . ) must be less than ,3fo for all values of f . Thus, we require the following invariant property of the residual inputs:
H
+ H v o ; < o,! (3.19) 2 where we assumed that all low signals have the same value and that their maximum value is the fixed point $5;. This assumption is justified since the output of all state neurons with low values is initialized to zero. A similar analysis can be carried out for state transitions of equation (3.11). The following inequality must be satisfied for stable high signals: --
-H2 + H&!
-
Ha;; >
*;
(3.20)
where we assumed that there is only one DFA transition 6(q,.ak) = q, for chosen qi and a k , and thus CS,tC,,k = 0. Solving inequalities (3.19) and (3.20) for 4- and @,:< respectively, we obtain conditions under which a constructed recurrent network implements a given DFA. These conditions are expressed in the following theorem: Theorem 3.4.1. For somegiven DFA M with n stafes and m input symbols, let r
denote the maximum number of transitions to any state over all input symbols of + 1 sigrnoidal state neurons and m input neurons can be constructed from M such that the internal state representation remains stable if the following thvee conditions are satisfied:
M. Then, a sparse recurrent neural network with n
(3.21) (3) H > max(Hc(r),H;(r)) Furthermore, the constructed network has at most 3mn second-order weights with alphabet C, = { -H. 0, +H}, n + 1 biases with alpkabet Cb = { - H / 2 } , and maximum fan-out 3m. The number of weights and the maximum fan-out follow directly from the DFA encoding algorithm. Stable encoding of DFA state is a necessary condition for a neural network to implement a given DFA. The network must also correctly classify all strings. The conditions for correct string classification are expressed in the following corollary:
Christian W. Omlin and C. Lee Giles
686
Corollary 3.1. Let L(MDFA)denote the regular language accepted by a DFA M with n states and let L(MR") be the langirage accepted by the recurrent network constructed froin M . Then, we have L ( M R N N= ) L ( M D F Aif)
(2) H > rnax(Hi(r).H,+(r)) Proof. For the case of an ungrammatical string, the input to the response neuron So must satisfy the following condition: 1 H - _ - H$ + ( n - 1 ) H q < (3.22) 2 2 where we have made the usual simplification about the convergence of the outputs to the fixed points 4; and q5i. Furthermore, we assume that the state ql of the state transition 6(q17/.ak) = q; is the only rejecting state; then the output neuron's residual inputs from all other state neurons is positive, weakening the intended low signal for the networks output neuron. Notice that the output neuron is the only neuron that can be forced toward a low signal by neurons other than itself. A similar condition can be formulated for grammatical strings: H 1 -H4: - ( n - l)H4; > (3.23) 2 2 The above two inequalities can be simplified into a single inequality:
+
-2HdLi + 2(n - l)ff$; < 0
(3.24)
Observing that 4; + 4; < 2 and solving for 47, we get the following condition for the correct output of a network: 2 (3.25) 4; < ; Thus we have the following conditions for stable low signals and correct string classification:
(3.26)
(classification) We observe that
1 Choosing < - thus implies the condition for stable low signals in 2n partially recurrent networks. Substituting & for 4; in inequality (3.26) 0 yields condition (1) of the corollary.
$7
Stable Encoding of Large Finite-State Automata
687
Figure 4: Randomly generated 100-state DFA: The minimal DFA has 100 states and alphabet C = { O % l } . State 1 is the start state. States with and without double circles are accepting and rejecting states, respectively. 4 Experiments 4.1 Simulation Results. To empirically validate our analysis, we constructed networks from randomly generated DFAs with 10,100, and 1000 states. For each of the three DFAs, we randomly generated different test sets each consisting of 1000 strings of length 10, 100, and 1000, respectively. The randomly generated, minimized 100-state DFA with alphabet C = (0.1) that we encoded into a recurrent network with 101 state neurons is shown in Figure 4. The networks' generalization performance on these test sets for rule strength H = (0.0.0.1.0.2.. . . 7.0) are shown in Figures 5-7. A misclassification of these long strings indicates a networks failure to maintain the stable finite-state dynamics that was encoded, However, we observe that the networks can implement stable DFAs as indicated by the perfect generalization performance for some
688
Christian W. Omlin and C. Lee Giles
RNN Encoding of 10-state DFA 0.4
0 35
"string.length-1000
-0
0.3
0.25
0.2
0.15
0.1
0.05
0 0
1
2
3
H
4
5
6
Figure 5: Performance of 10-stateDFA: The network classificationperformance on three randomly generated data sets consisting of 1000 strings of length 10 (O), 100 (+), and 1,000 (O),respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classification on the strings of length 1000 for H > 6.0. choice of the rule strength H and chosen test set. Thus, we have empirical evidence that supports our analysis. All three networks achieve perfect generalization for all three test sets for approximately the same value of H . Apparently, the network size plays an insignificant role in determining for which value of H stability of the internal DFA representation is reached, at least across the considered 3 orders of magnitude of network sizes.
4.2 Discussion. In our simulations, few neurons ever exceeded or fell below the fixed points 4- and $+, respectively. Furthermore, the network has a built-in reset mechanism that allows low and high signals to be strengthened. Low signals Si are strengthened to h ( 0 , H ) when there exists no state transition h ( . % a h )= 4,. In that case, the neuron S: receives no inputs from any of the other neurons; its output becomes less than 6- since h(O.H) < 4- for H > 4. Similarly, high signals S: get
689
Stable Encoding of Large Finite-State Automata
RNN Encoding of 100-state DFA
x
L
g
w
0
1
2
3
4
5
6
7
H
Figure 6: Performance of 100-state DFA: The network classification performance on three randomly generated data sets consisting of 1000 strings of length 10 (O), 100 (+), and 1000 (O), respectively, as a function of the rule strength H (in 0.1 increments) is shown. The network achieves perfect classification on the strings of length 1000 for H > 6.2. strengthened if either low signals feeding into neuron S, on a current state transition b( (4,). u k ) = q, have been strengthened during the previous time step or when the number of positive residual inputs to neuron S, compensates for a weak high signal from neurons (4,). Thus only a small number of neurons will have Si > p- or Si < d+. For the majority of neurons we have Si 5 4- and Si 2 @+. Since constructed networks are able to regenerate their internal signals and since typical DFAs do not have the worst case properties assumed in this analysis, the conditions guaranteeing stable low and high signals are generally much too strong for some given DFA. 5 Scaling Issues 5.1 Preliminaries. The worst case analysis in Section 3 supports the following predictions about the implementation of arbitrary DFAs:
Christian W. Omlin and C. Lee Giles
690
RNN Encoding of 1000-state DFA 0.6
I
I
I
I
"string.length-10 +"string.length-100 -+"string.length-1000 -0.-
OowooWoD t++++++*+O
0.5
-5
0.4
2
-8
03
._0 Y
5
v1
0.2
0.1
n 0
1
2
3
4
5
6
7
H
Figure 7: Performance of 1000-state DFA The network classification performances on three randomly generated data sets consisting of 1000 strings of length 10 (0),100 (+), and 1,000 (O), respectively, as a function of the rule strength H (in 0.1 increments). The network achieves perfect classification on the strings of length 1000 for H > 6.1. 1. Neural DFAs can be constructed that are stable for arbitrary string length for finite value of the weight strength H . 2. For most neural DFA implementations, network stability is achieved for values of H that are smaller than the values required by the conditions in Theorem 3.4.1. 3. The value of H scales with the DFA size, i.e., the larger the DFA and thus the network, the larger H will be for guaranteed stability.
Predictions (1)and (2) are supported by our experiments. However, when we compare the values H in the above experiments for DFAs of different sizes, we find that H z 6 for all three DFAs. This observation seems inconsistent with the theory. The reason for this inconsistency lies in the assumption of a worst case for the analysis, whereas the DFAs we implemented represent average cases. For the construction of the randomly generated 100-state DFA we found correct classification of strings of length 1000 for H = 6.3. This value corresponds to a DFA whose states
Stable Encoding of Large Finite-State Automata
11
691
,
I
I
I
5 m
g v)
E
P
3
0
20
I
I
1
40
60
80
100
maximum indegree
Figure 8: Scaling weight strength: An accepting state q,, in 10 randomly generated 100-stateDFAs was selected. The number of states q1 for which 6(q,. 0) = qp was gradually increased in increments of 5% of all DFA states. The graph shows the minimum value of H for correct classification of 100 strings of length 100. H increases up to p = 75%; for p > 75%, the DFA becomes degenerated causing H to decrease again. have 'average' indegree r = 1.5. [The magic value 6 also seems to occur for networks which are trained. Consider a neuron S;; then, the weight that causes transitions between dynamical attractors often has a value = 6 (Tino 1994).] However, there exist DFAs that exhibit the scaling behavior that is predicted by the theory. We will briefly discuss such DFAs. That discussion will be followed by an analysis of the condition for stable DFA encodings for asymptotically large DFAs. 5.2 DFA States with Large Indegree. We can approximate the worst case analysis by considering an extreme case of a DFA:
(1) Select an arbitrary DFA state qp; (2) select a fraction p of states 9j and set 6(9j,uk)
= qp.
692
Christian W. Omlin and C. Lee Giles
(3) For low values of p, a constructed network behaves similarly to a randomly generated DFA. (4) As the number of states q, for which h(q,.ak) = 4,) increases, the behavior gradually moves toward the worst case analysis where one neuron receives a large number of residual inputs for a designated input symbol ak. We constructed a network from a randomly generated DFA Mo with 100 states and two input symbols. We derived DFAs M,,,. M,?. . . . M,,, where the fraction of DFA states q, from MI,< to M,,,, with h(q,.ak) = q,, increased by A/); for our experiments, we chose Ap = 0.05. Obviously, the languages L ( M , ) change for different values of pl. The graph in Figure 8 shows for 10 randomly generated DFAs with 100 state the minimum weight strength H necessary to correctly classify 100 strings of length 100-a new data set was randomly generated for each DFA-as a function of p in 5% increments. We observe that H generally increases with increasing values of p; in all cases, the hint strength H sharply decline for some percentage value p. As the number of connections +H to a single state neuron S, increases, the number of residual inputs that can cause unstable internal DFA representation and incorrect classification decreases. Let us assume that the extreme DFA state 9(, is an accepting state. Then, the input to output neuron S;" is (5.1) For correct classification, the net input must be larger than 0.5. As the value of p increases, the number of terms in the first and second sum increase and decrease, respectively. Thus, smaller values of H lead to correct string classification. A similar argument can be made if q/, is a rejecting state. We observe that there are two runs where outliers occur, i.e., Hp, > H,,,,even though we have pI < p I + l . Since the value H,, depends on the randomly generated DFA, the choice for q,, and the test set, we can expect such an uncharacteristic behavior to occur in some cases. 5.3 Asymptotic Case Analysis. We are interested in finding an expression for the average number of residual inputs to a neuron in large DFAs. Since we are dealing with a second-order network architecture, disjoint parts of the network participate in the computation of the next state for any given input symbol. Thus, we can limit our analysis to DFAs with a single input symbol. Consider a DFA M and its underlying graph G(V.E ) whose vertices V and directed edges E are the DFA states Q and state transitions 6, respectively. We assume that G ( V .E ) is randomly generated: For any given vertex u,,a directed edge e, is drawn to another vertex ZI, with equal probability l / n for all vertices of G. The number of directed edges entering
Stable Encoding of Large Finite-State Automata
693
any given vertex ui from other vertices u,,, is the number of residual inputs state neuron Si receives from other state neurons S,,,. Thus we need to compute only the expected number of incoming edges ("in-degree") for a DFA generated according to the above probability distribution. The probability p ( d = k ) for a vertex to have in-degree k follows a binomial distribution; thus, the average in-degree is given by the expected value of k, which can be written as
For n -+ 03 and X = np = 1 where p is the probability that an event occurs (in our case we have p = l / n and thus X = 1) and p + 0 the binomial distribution asymptotically converges toward the Poisson distribution: oc
k e-"
E{d = k} = k=l
With X
= 1,
Xk -
k!
we conclude
The lemmas of section 3.3 simplify for the case r and Giles, 1995):
=
1 as follows (Omlin
Lemma 5.3.1. For 0 < H < 4, h(x.H ) has the following fixed point:
do = 0.5 Furthermore, h(x.H ) converge to $"for any choice of a start value XO. Lemma 5.3.2. ForH
2 4, k ( x . H ) has threefixed points $'
Lemma 5.3.3. for x
) peaks at lower cf. d )for smaller S2. In the multiscale representation, we further model (Fig. 2) Ct,, w ) (see equation 2.10), that do not depend on (x, f ) are summed into variables $*. (The subscript n for the neuron and superscript a for scale are omitted for clarity.) The binocular RFs in a cortical cell are (Li and Atick 1994b; Li 1995) Kl(x.f ) = Ksum(x,t ) + K(x, f ) for the left and K,(x, t ) = Ksum(x.t ) K,,,(x, t ) for the right eyes: Keye(x.t )
=
lrn/mdw[Kd,,cf, 0
df
+ K,-,Cf,
0
w) coscfx
w) coscfx - wt
+ wt + d),:,)
+ $FYJ
(3.3) (3.4)
with eye = 1. Y. Here K&Cf. w) and KgY,cf. w) are the monocular sensitivities to stimuli of opposite motion directions. The questions are: what wzk/f
808
Bhaskar DasGupta and Georg Schnitger
( [ z ]denotes the bit sum of the binary string z; i.e., [z]= C,zl.) We obtain the following result. Theorem 1.1. A biiiary threshold iietzuork acceytiiig SQ,, must have size at lemt (](log1 1 ) . Birt SQ,, can be coiizpirted by II 0-net 7 ~ i i t htziio gates.
In fact, we give a generalized upper bound in Theorem 2.1, where y-nets with two gates are constructed for a large class of functions ?. The lower bound of Theorem 1.1 is ”almost” tight, since it is possible to design a binary threshold net of size O(log iz . log log iz . log log log i i ) that accepts SQll. The proof of 1.1 uses techniques of circuit theory. We refer the reader to Wegener (1987) and Boppana and Sipser (1990) for a detailed account of circuit theory and restrict ourselves to a few comments. A circuit corresponds (using the notation of this paper) to a r-net, where r is a class of Boolean functions and where functions in r are assigned to the vertices of the net-architecture. {AND, OR, NOT}-circuits are perhaps the most prominent circuit class. One of the main tasks of circuit theory is to derive lower bounds for the size and/or depth of circuits computing specific Boolean functions. Little progress has been made in deriving lower bounds for {AND, OR, NOT}-circuits of bounded fan-in. (The fan-in of a circuit is the maximum, over all vertices, of the number of immediate predecessors of a vertex.) For instance, no specific function is known that requires superlinear size. The situation improves considerably if {AND, OR, NOT}-circuits of u i i borriided fan-in (and small depth) are considered: Razborov (1987) gives exponential lower bounds for the size of circuits computing the parity of I? bits in bounded depth. However, as already mentioned above, thresholdcircuits (or threshold networks in our notation) of bounded depth have an impressive computing power and, perhaps not surprisingly, not even superlinear lower bounds on the size are known. In Wegener (1991) sublinear lower bounds on the size of threshold circuits are given. There the notion of sensitivity is introduced: a Boolean function f of n variables is called k-sensitive, if no setting of 11 - k variables to arbitrary (zero or one) values transforms f into a constant function of the remaining k free variables. (For example, the parity function of n variables is k-sensitive for any k with 1 5 k 5 n.) We face the problem that SQ,l is not k-sensitive even for large values of k; for instance, if we set all x-bits to 1 and one y-bit to 0, then SQn becomes a constant function of the remaining free variables. Also, intermediate forms of sensitivity (i.e., k-sensitivity in which at least a constant fraction of both the x-bits and the y-bits are set) have to be ruled out: setting half of the x-bits to 0 and setting half of the y-bits to 1 again reduces SQ,I to a constant function of the remaining free variables. Therefore, Wegener’s lower bound for sensitive functions (Wegener 1991) does not apply. Our lower bound proof does proceed by trying to examine the given circuit gate by gate. But we were not successful in trivializing each gate ke.,
Analog vs. Discrete Neural Networks
809
by setting input bits to appropriate zero/one values, we were unable to guarantee that a considered gate computes a constant function of the remaining free input bits). Instead we construct a subdomain of the input space that allows us to trivialize threshold gates while not trivializing SQn. The rest of the paper is organized as follows. In Section 2 we show that SQ,,can be computed by y-nets with two gates, where y is any realvalued activation function at least three times continuously differentiable in some small neighborhood. In Section 3 we prove that any binary threshold network accepting SQn must have size at least R(1ogn). A preliminary version of this result appeared in DasGupta and Schnitger (1993). 2 Computing SQ,, by y-nets
We say, that a function y : R + R has the Q-property, if and only if there exist real numbers a and 6 > 0 such that (a) ? ( x ) is at least 3 times continuously differentiable in the interval [a - h.a + 61 and (b) y”(a) # 0. Notice that the standard sigmoid .(x) = 1/(1+c’) has the &-property. Next we show that SQllcan be computed with relatively large separation by small ?-nets with small weights, provided y has the Q-property. Theorem 2.1. Assume that y has the Q-property. Then there is a ?-net with tzoo gates that accepts SQ,,zuith separation Q( 1 ) . Moreover, all weights are bounded in absolute value by a polynomial in 1 2 . The proof of Theorem 2.1 utilizes the Q-property to extract square polynomials from . In particular, we approximate the quadratic polynomial [XI’ - [y] with small error. Finally, the function SQli is computed with 12(1)-separationby comparing the approximated polynomial with a suitable threshold value.
-,
Proof of Theorem 2.1. Since y is at least 3-times continuously differentiable in 1 = [u - h. u + 01, we obtain
where r ( z ) = ?“’(0,)/6 z3 (for z E [-6.b] and some BL t I). Moreover, by continuity, there is a constant Max with I yl”(u)1 5 Max for all 14 E I. We set
. { 1 3Max , ?”(a) 1 El} 4.
= max
n3
’
Bhaskar DasGupta and Georg Schnitger
810
Since 0 I [XI 5
12,
we obtain I[x]/LI 5 In/LI 5 S and thus
or, equivalently,
) ~ we obtain the bound Also, since ~ - y ” ’ ( d ~5x ~Max,
+
The y-net accepting SQ,, consists of a first neuron computing u ( x ) = ? [ a ( [ x ] / L ) ] The . second neuron, the output neuron, computes the weighted sum
As a consequence of equation 2.1, the output neuron approximates [x’] [y] with error at most 1/4. Thus,
Thus, setting tc = -1/2 in Definition 1.2, it follows that our y-net C accepts SQ,, with separation at least 1/4. The weight bound follows, since L = O(1z3). 3 A Lower Bound for “Unary Squaring”
We have to show the following result. Theorem 3.1. A17y binary threshold rzefzclork acceyfiizg SQ,, rnusf h z l e size nf least R(1og n ) . Let S Q , denote the language SQ,
= {(x.y): x E
(0,l}“. y
E
(0.
and
[xI2 = [y]}
Proposition 3.1. Assume that there exists a binary threshold network of size f,, accepting SQ,,. Then there exists a binary threshold network of size f,, + t,,+l + 1 accepting SQ,.
Analog vs. Discrete Neural Networks
811
Proof. Since there exists a binary threshold network of size t,, accepting SQ,,, there also exists a binary threshold network of size t,, accepting the complement of SQII. We show how to compute the language
a,
SQ‘
=
{(x.y): x E (0.1)”. y E {0.1}”2. and
[XI’
5 [y]}
Consider the binary threshold network for SQll+l, with binary inputs xI... . . x t I + l , and y l . . . . , Y ( ~ ~ + ~We ) Z .set x,,+I = 0, y,rz+l = 1, and y,,~+~ = . . . - Y ( , , + ~ ) Z = 0. With those bits fixed, the threshold network for SQ,,+1 accepts the input (XI. x2.. . . . xl13 ~ 1y2. % . . . . y,,~)if and only if (C::, x,)‘ < y,. Hence, size t,,+1 y, + 1. But this is equivalent to x,)5 ~ threshold circuits can compute SQ:. But note that SQ; = SQ,, A SQ:. Hence, SQ, can be computed with t,, + t,[+l + 1 threshold gates.
$,
(xy!,
Thus it suffices to show that any binary threshold network accepting SQ, must have size at least R(1ogn). Let us assume that C, is a binary threshold network with 5 gates accepting SQ,. Our approach will be to successively trivialbe (i.e., partially fix the outcomes of) the gates of C, by fixing appropriate bits of the input (x.y). The process of trivialization starts with source gates, continues with gates all of whose immediate predecessors have been trivialized and finally terminates with the sink gate of C,. Let us assume that the process of trivialization has reached gate g. Moreover, assume that the bits xk+l... . x,,and ~ 1 ~ 1. %.yl,2 . have been fixed with (xk+l... . .x,,)= and (yl+l.. . . .y,,z) = 11. Determine (Y with 1 = 2ak + k’ and set
lH,,. Here X is the so called "regularization parameter" and Q f ] is a functional that measures smoothness of the functions involved. Choice of an optimal X is an interesting question in regularization techniques and typically cross-validation or other heuristic schemes are used. 3. Structural risk minimization (Vapnik 1982) is another method to achieve a trade-off between network complexity (corresponding to n in our case) and fit to data. However, it does not guarantee that the architecture selected will be the one with minimal parameterization. In fact, it would be of some interest to develop a sequential growing scheme. Such a technique would at any stage perform a sequential hypothesis test. It would then decide whether to ask for more data, add one more node or simply stop and output the function it has as its f-good hypothesis. I n such a process, one might even incorporate active learning (Angluin
Radial Basis Functions
835
1988; Niyogi 1995) so that if the algorithm asks for more data, then it might even specify a region in the input domain from where it would like to see these data. 4. It should be noted here that we have assumed that the empirical risk Cl=,[yi- f(xi)]’ can be minimized over the class H, and the function fn,l be effectively computed. While this might be fine in principle, in practice only a locally optimal solution to the minimization problem is found (typically using some gradient descent schemes). The computational complexity of obtaining even an approximate solution to the minimization problem is an interesting one, and results from computer science (Judd 1988; Blum and Rivest 1988) suggest that it might in general be NP-hard. 4.3.2 Connections with Other Results. 1. In the neural network and computational learning theory communities results have been obtained pertaining to the issues of generalization and learnability. Some theoretical work has been done (Baum and Haussler 1989; Haussler 1992; Ji and Psaltis 1992) in characterizing the sample complexity of finite sized networks. Of these, it is worthwhile to mention again the work of Haussler (1992) from which this paper derives much inspiration. He obtains bounds for a fixed hypothesis space, i.e., a fixed finite network architecture. Here we deal with families of hypothesis spaces using richer and richer hypothesis spaces as more and more data become available. Others (Levin et al. 1990) attempt to characterize the generalization abilities of feedforward networks using theoretical formalizations from statistical mechanics. Yet others (Botros and Atkeson 1991; Moody 1992; Cohn and Tesauro 1991; Rumelhart ef al. 1991) attempt to obtain empirical bounds on generalization abilities. 2. This is an attempt to obtain rate-of-convergence bounds in the spirit of Barron’s work (1994), but using a different approach. We have chosen to combine theorems from approximation theory [which gives us the O(l/n) term in the rate] and uniform convergence theory (which gives us the other part). Note that at this moment, our rate of convergence is worse than Barron’s. In particular, he obtains a rate of convergence of O{l/n + [nkln(l)]/l}. Further, he has a different set of assumptions on the class of functions (corresponding to our F).Finally, the approximation scheme is a class of networks with sigmoidal units as opposed to radialbasis units and a different proof technique is used. 3. It is worthwhile to refer to the article of Geman et al. (1992) in this journal, which discusses the Bias-Variance dilemma. Using our noand the tation the integrated square bias is defined as B = /If0 - ED,f,1]’ - f , , . ~ )where ~ ] , ED, stands for the integrated variance is V = ED,[(ED,I~,,,~] expected value over all the possible data sets of size 1. Geman et al. (1992) show that the generalization error averaged over Dl can be decomposed as B + V. They show that as the number of parameters increases, the bias
836
Partha Niyogi and Federico Girosi
of the estimator decreases and the variance increases for a fixed size of the data set. From an intuitive point of view, the bias B plays the role of the approximation error llfo - f , 1 12, although their relationship is not clear. In fact, the average estimator ED,fI1,1] differs from f,i, and need not even belong to HI,. The variance V is related to the average estimation error, and it can be shown that both of them are bounded by the quantity ED,Ilf,l - ~ l , , ~Finding ~ ~ z . the right bias-variance trade-off is very similar in spirit to finding the trade-off between network complexity and data complexity. 4. Given the class of radial basis functions we are using, a natural comparison arises with kernel regression (Krzyzak 1986; Devroye 1981) and results on the convergence of kernel estimators. It should be pointed out that, unlike our scheme, gaussian-kernel regressors require the variance of the gaussian to go to zero as a function of the data. Further the number of kernels is always equal to the number of data points and the issue of trade-off between the two is not explored to the same degree. 5. In our statement of the problem, we discussed how pattern classification could be treated as a special case of regression. In this case the function fo corresponds to the Bayes a posteriori decision function. Researchers (Richard and Lippman 1991; Hampshire and Pearlmutter 1990; Gish 1990) in the neural network community have observed that a network trained on a least square error criterion and used for pattern classification was in effect computing the Bayes decision function. This paper provides a rigorous proof of the conditions under which this is the case. 4.4 Empirical Results. The main thrust of this paper is to provide some insight into how overfitting can be studied in classes of feedforward networks and the general laws that govern overfitting phenomena in such networks. How closely do “real” function learning problems obey the the general principles embodied in the theorem described earlier? We do not attempt to provide an extensive answer to this question-but just to satisfy the reader’s curiosity, we now describe some empirical results.
4.4.1 The Experiment. The target function, a k-dimensional function, was assumed to have the following form, which ensures that the assumptions of Theorem 3.1 are satisfied:
Here 2 is a diagonal matrix (C),,, = k IT&,,. The parameters, IT^^. w,.cl} were chosen at random in the following ranges: IT, E [1.7.2.3], w, E [-2.2Ik, c, E [-1.11, E [O.~T], N E [3.20]. Training sets of different sizes, ranging from 1 = 30 to 1 = 500, were randomly generated in the kdimensional cube [-T.7rIk, and an independent test set of 2000 examples
Radial Basis Functions
837
I
0
20
40
60
80
100
Number of Nodes
Figure 6: The generalization error is plotted as a function of the number of nodes of an RBF network (3.1) trained on 100 data points of a function of the type (4.2) in 2 dimensions. For each number of parameters 10 results, corresponding to 10 different local minima, are reported. The continuous lines above the experimental data represent the bound + b[(nkln(nl)- ln6)/1]'/2 of eq. (3.5), in which the parameters a and b have been estimated empirically, and 6 = 1 0 P .
was chosen to estimate the generalization error. Gaussian RBF networks (as in Theorem 3.1) with different numbers of hidden units, ranging from n = 1 to n = 300, were trained using a gradient descent scheme. Each training session was repeated 10 times with random initialization, because of the problem of local minima. We did experiments in 2,4, 6, and 8 dimensions. In all cases the qualitative behavior of the experimental results followed the theoretical predictions. In Figures 6 and 7 we report the experimental results for a two- and six-dimensional case, respectively. We found, in general, that although overfitting occurs as expected, it has a tendency to occur at a larger number of parameters than expected. We attribute that to the presence of local minima, that have the effect of restricting the hypothesis, and suggesting that the "effective" number of parameters (Moody 1991) is much smaller than the total number of parameters. We believe that extensive experimentation is needed to compare the deviation between theory and practice, and the problem of local minima
Partha Niyogi and Federico Girosi
838
'.
I
,
0
20
60
40
80
Number of Nodes
Figure 7: Everything is as in figure (6), but here the dimensionality is 6 and the number of data points is 150. As before, the parameters n and I] have been estimated empirically and fi = lo-'. Notice that this time the curve passes through some o f the data points. However, we recall that the bound indicated by the curve holds under the assumption that the global minimum has been found, and that the data points represent different lon7l minima. Clearly in the figure the curve bounds the best of the local minima.
should be seriously addressed. This is well beyond the scope of the current article, and further research on the matter is planned. 5 Conclusion
_
_
~
.
For the task of learning some unknown function from labeled examples where we have multiple hypothesis classes of varying complexity, choosing the class of right complexity and the appropriate hypothesis within that class pose an interesting problem. We have provided an analysis of the situation and the issues involved and in particular have tried to show how the hypothesis complexity, the sample complexity, and the generalization error are related. We proved a theorem for a special set of hypothesis classes, the radial basis function networks, and we bound the generalization error for certain function learning tasks in terms of the number of parameters and the number of examples. This is equivalent to obtaining a bound on the rate at which the number of parameters
Radial Basis Functions
839
must grow with respect to the number of examples for convergence to take place. Thus we use richer and richer hypothesis spaces as more and more data become available. We also see that there is a trade-off between hypothesis complexity and generalization error for a certain fixed amount of data and our result allows us a principled way of choosing an appropriate hypothesis complexity (network architecture). The choice of an appropriate model for empirical data is a problem of long-standing interest in statistics and we provide connections between our work and other work in the field. Appendix: A Useful Decomposition of the Expected Risk We now show that regression function defined in equation 2.2 minimizes the expected risk, llf].By adding and subtracting the regression function, fo. we see that
By definition of the regression function fo(x), the cross-product in the last equation is easily seen to be zero, and therefore
Clearly, the minimum of I F ] is achieved when the first term is minimum, that is when f(x) = fo(x). In the case in which the data come from randomly sampling a function f in presence of additive noise, Ijfo] = cr2 where g2 is the variance of the noise. When data are noisy, therefore, even in the most favorable case we cannot expect the expected risk to be smaller than the variance of the noise. Acknowledgments We are grateful to V. Vapnik, T. Poggio, and B. Caprile for useful discussions and suggestions. We also wish to thank N. T. Chan for kindly providing the code for the numerical simulations. References Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 319-342.
840
Partha Niyogi and Federico Girosi
Barron, A. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I € € € Tmrzs. Iriforiri. Theory 39(3), 930-945. Barron, A. 1994. Approximation and estimation bounds for artificial neural networks. Mnch. Leorri. 14, 115-133. Barron, A,, and Cover, T. 1991. Minimum complexity density estimation. I € € € Trnris. Tlieor!/ 37(4). Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neumf C o ~ i p 1, . 151-160. Blum, A,, and Rivest, R. L. 1988. Training a three-neuron neural net is NPcomplete. In Proceediiigs oftlie 2 988 Workshop oii Coriipiitatiorznl Leariiirig Tlwury, pp. 9-18. Morgan Kaufmann, San Mateo, CA. Botros, S., and Atkeson, C. G. 1991. Generalization properties of Radial Basis Functions. In Ailzlniices ir7 Neural Ii~forrizntioriProcessir7g S?ystevzs 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 707-713. Morgan Kaufmann, San Mateo, CA. Cohn, D., and Tesauro, G. 1991. Can neural networks do better than the VC bounds? In Adzmriccs iri Neiirnl Iiiformtioii Processing S?ysterns 3, R. Lippmann, J. Moody, and D. Touretzky, eds., pp. 911-917. Morgan Kaufmann, San Mateo, CA. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross validation. Nnrner. Mntlz. 31, 377403. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Moth. Corztrol S!yst. Sigrinls 2(4), 303-314. Devroye, L. 1981. On the almost everywhere convergence of nonparametric regression function estimate. A m . Stotist. 9, 1310-1319. Dudley, R. M. 1987. Universal Donsker classes and metric entropy. Aizrz. Proh. 14(4), 1306-1326. Dudley, R. M. 1989. Real AiinlysisnridProbaDility, Mathematics Series. Wadsworth and Brooks/Cole, Pacific Grove, CA. Efron, B. 1982. The Jnckriife, the Bootstrap, nizd Other Xesnrizpliizg Plans. SIAM, Philadelphia. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neurnl Cornp. 4, 1-58. Girosi, F. 1994. Regularization theory, Radial Basis Functions and networks. In Froin Statistics to Neiiral Netzoorks. Theory aiid Patterti Recogizitioii Applicntions, V. Cherkassky, J. H. Friedman, and H. Wechsler, eds. Subseries F, Computer and Systems Sciences, Springer-Verlag, Berlin. Girosi, F., and Anzellotti, G. 1993. Rates of convergence for Radial Basis Functions and neural networks. In Artificial Neitral Networks for Speech aiid Visioii, R. J. Mammone, ed., pp. 97-113. Chapman & Hall, London. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural Coiiip. 7, 219-269. Gish, H. 1990. A probabilistic approach to the understanding and training of neural network classifiers. In Proceedings of the ICASSP-90, pp. 1361-1365, Albuquerque, NM. Hampshire, J. B. 11, and Pearlmutter, B. A. 1990. Equivalence proofs for multi-
Radial Basis Functions
841
layer perceptron classifiers and the bayesian discriminant function. In Proceedings of the 1990Connectionist Models Summer School, J. Elman, D. Touretzky, and G. Hinton, eds. Morgan Kaufmann, San Mateo, CA. Haussler, D. 1992.Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100(1), 78-150. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Ji, C., and Psaltis, D. 1992. The VC dimension versus the statistical capacity of multilayer networks. In Advances in Neural lnformation Processing Systems 4, S. J. Hanson, J. Moody, and R. P. Lippman, eds., Morgan Kaufmann, San Mateo, CA. Jones, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for Projection Pursuit Regression and neural network training. A n n . Statist. 20(1), 608-613. Judd, S. 1988.Neural network design and complexity of learning. Ph.D. thesis, University of Massachusetts, Amherst, Amherst, MA. Krzyzak, A. 1986. The rates of convergence of kernel regression estimates and classification rules. ZEEE Trans. Inform. Theory IT-32(5),668-679. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. 1EEE 78(10),1568-1574. Lippmann, R. P. 1987. An introduction to computing with neural nets. ZEEE A S S P Mag. April, 4-22. Lorentz, G. G.1986.Approximation of Functions. Chelsea Publishing, New York. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adi? Comput. Math. 1, 61-80. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function. Ado. A p p l . Math. 13, 350-373. Moody, J. 1991. The effective number of parameters: An analysis of generalization and regularization in non-linear learning systems. In Advances in Nrurd Ziiformation Processing Systems 4, S. J. Hanson, J. Moody, and R. P. Lippman, eds., pp. 847-854. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Conzp. 1(2),281-294. Niyogi, P. 1995. The informational complexity of learning from examples. Ph.D. thesis, MIT, Cambridge, MA. Niyogi, P., and Girosi, F. 1994. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467, Massachussetts Institute of Technology, 1994. URL ftp:/ /publications.ai.mit.edu/ai-publications/l000-1499/AIM-l467.ps.Z. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Pror. l E E E 78(9). Pollard, D.1984. Cotmergence of Stochastic Processes. Springer-Verlag, Berlin. Powell, M. J. D. 1992. The theory of radial basis functions approximation in 1990. In Advances in Numerical Analysis Volume Zl: Wauelets, Siihdiuisioiz AIgorithms and Radial Basis Funcfions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford, England.
Partha Niyogi and Federico Girosi
842
Richard, M. D., and Lippman, R. P. 1991. Neural network classifier estimates bayesian a-posteriori probabilities. Neiirnl Corirp. 3, 461-483. Rissanen, J. 1989. Stnchnsfic Cornplrsity irr Stntisticnl Iriqiiiry. World Scientific, Singapore. Stein, E. M. 1970. Sirrytrlnr Iiitegrnls nird Differerrtiability Properties of Fuizctioris. Princeton University Press, Princeton, NJ. Tikhonov, A. N. 1963. Solution of incorrectly formulated problems and the regularization method. Sazliet Moth. Dokl. 4, 1035-1038. Vapnik, V. N. 1982. Estiriintioti of Depeirtfuiicies Bosed oil Ertrpiricnl Dntn. SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Th. Prob. Appl. 17(2), 264-280. Vapnik, V. N.,and Chervonenkis, A. Y. 1991. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Patt. Rrcoigri. [ m y e Arrnlysis 1(3), 283-305. Wahba, G. 1990. Splirie Models for Obserzntiorinl Dntn, Series in Applied Mathematics, Vol. 59. SIAM, Philadelphia. Weigand, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight elimination with applications to forecasting. In Adzvvicts itr Neural Irtforrrintiori P r o w s s i q Systrt~ls3, R. Lippmann, J. Moody, and D. Touretzky, eds., Morgan Kaufmann, San Mateo, CA. White, H. 1990. Connectionist nonparametric regression: Multilayer percep~ r535-549. k~ trons can learn arbitrary mappings. Nt.urnl N ~ t i ~ l 3, -
~
Received September 15, 1995, accepted hoxember 2, 1993
This article has been cited by: 2. M. Baglietto, C. Cervellera, M. Sanguineti, R. Zoppoli. 2010. Management of water resource systems in the presence of uncertainties by nonlinear approximation techniques and deterministic sampling. Computational Optimization and Applications 47:2, 349-376. [CrossRef] 3. Vidya Bhushan Maji, T. G. Sitharam. 2008. Prediction of Elastic Modulus of Jointed Rock Mass Using Artificial Neural Networks. Geotechnical and Geological Engineering 26:4, 443-452. [CrossRef] 4. Cristiano Cervellera, Marco Muselli. 2007. Efficient sampling in approximate dynamic programming algorithms. Computational Optimization and Applications 38:3, 417-443. [CrossRef] 5. Yiming Ying. 2007. Convergence analysis of online algorithms. Advances in Computational Mathematics 27:3, 273-291. [CrossRef] 6. D.G. Khairnar, S.N. Merchant, U.B. Desai. 2007. Radial basis function neural network for pulse radar detection. IET Radar, Sonar & Navigation 1:1, 8. [CrossRef] 7. Ding-Xuan Zhou, Kurt Jetter. 2006. Approximation with polynomial kernels and SVM classifiers. Advances in Computational Mathematics 25:1-3, 323-344. [CrossRef] 8. Jih-Gau Juang, Kai-Chung Cheng. 2006. Application of Neural Networks to Disturbances Encountered Landing Control. IEEE Transactions on Intelligent Transportation Systems 7:4, 582-588. [CrossRef] 9. Qiang Wu , Ding-Xuan Zhou . 2005. SVM Soft Margin Classifiers: Linear Programming versus Quadratic ProgrammingSVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming. Neural Computation 17:5, 1160-1187. [Abstract] [PDF] [PDF Plus] 10. Ding-Xuan Zhou. 2003. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory 49:7, 1743-1752. [CrossRef] 11. Sayan Mukherjee, Pablo Tamayo, Simon Rogers, Ryan Rifkin, Anna Engle, Colin Campbell, Todd R. Golub, Jill P. Mesirov. 2003. Estimating Dataset Size Requirements for Classifying DNA Microarray Data. Journal of Computational Biology 10:2, 119-142. [CrossRef] 12. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 13. Xiaohong Chen, J. Racine, N.R. Swanson. 2001. Semiparametric ARX neural-network models with an application to forecasting inflation. IEEE Transactions on Neural Networks 12:4, 674-683. [CrossRef]
14. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 15. A. Krzyzak, T. Linder. 1998. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks 9:2, 247-256. [CrossRef] 16. Yali Amit , Donald Geman . 1997. Shape Quantization and Recognition with Randomized TreesShape Quantization and Recognition with Randomized Trees. Neural Computation 9:7, 1545-1588. [Abstract] [PDF] [PDF Plus] 17. Jeong-Woo Lee, Jun-Ho Oh. 1997. Hybrid Learning of Mapping and its Jacobian in Multilayer Neural NetworksHybrid Learning of Mapping and its Jacobian in Multilayer Neural Networks. Neural Computation 9:5, 937-958. [Abstract] [PDF] [PDF Plus] 18. S. Ridella, S. Rovetta, R. Zunino. 1997. Circular backpropagation networks for classification. IEEE Transactions on Neural Networks 8:1, 84-97. [CrossRef]
Communicated by Andreas Weigend and Chris Bishop
Using Neural Networks to Model Conditional Multivariate Densities Peter M. Williams School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton BN1 9QH, England
Neural network outputs are interpreted as parameters of statistical distributions. This allows us to fit conditional distributions in which the parameters depend on the inputs to the network. We exploit this in modeling multivariate data, including the univariate case, in which there may be input-dependent (e.g., time-dependent) correlations between output components. This provides a novel way of modeling conditional correlation that extends existing techniques for determining input-dependent (local) error bars. 1 Introduction
Neural networks provide a way of modeling the statistical relationship between an independent variable X and a dependent variable Y. For example, X could be financial data up to a certain time and Y could be a future stock index, exchange rate, option price etc. Alternatively, X could represent geophysical features of a prospect and Y could represent mineralization at a certain depth. In general, X and Y can be vectors of continuous or discrete quantities. Suppose that the conditional distribution of Y belongs to a family of distributions characterized by a finite set of parameters that are functions of conditioning values of X. These functions, which in general will be nonlinear, can then be modeled by a neural network. For discrete distributions this approach is exemplified in the softmax rule (Bridle 1990). The use of network outputs to set the parameters of a density model forms the basis of the competing local experts model of Jacobs et al. (1991). The idea of using neural networks to return the complete conditional density of the output is also found in Ghahramani and Jordan (1994), for example. Bishop (1994) gives a systematic exposition of this approach, in particular for the case of gaussian mixtures. The case of a single kernel is treated independently by Weigend and Nix (1994) and Nix and Weigend (1995)where the output from an auxiliary variance unit is used to set local time-dependent error bars for time-series predictions. The purpose of the present paper is to extend these techniques to the Neurul Computation 8, 843-854 (1996)
@ 1996 Massachusetts Institute of Technology
844
Peter M. Williams
case of multivariate data where the conditional covariance matrix may be nondiagonal. 2 Multivariate Data
The conditional distribution of the n-dimensional quantity Y given X = s is assumed to be described by the multivariate gaussian density ~ ( I s) y = (2.)-1~/~ISI-"' exp { - i ( y
-
/ c ) T ~ - ' ( y- p ) }
(2.1)
where p ( x ) is the vector of conditional means and S(x) is the conditional covariance matrix. Both /r and S are understood to be functions of x in a way that depends on the outputs of a neural network when the conditioning vector x is given as input. It is assumed that the network has linear output units and that 11 and S are determined by the activations of these units. We now discuss the link between network outputs and the components of p and S.The mean presents no problem. The network will be required to have 12 output units whose activations, {zf'} say, are related to the IZ components of 1-1 by /cl
= 2:'
i = l . . . . .n
(2.2)
These units compute the components of the mean directly. It is less obvious how to represent the covariance matrix. Being symmetric, S has at most n ( n 1 ) / 2 independent entries but it must also be positive definite.' The problem is to parameterize the class of symmetric positive definite matrices in such a way that (1) the parameters can freely assume any real values, (2) the determinant is a simple expression of the parameters, and (3) the correspondence is bijective. To solve this problem we recall the Cholesky factorization of a symmetric positive definite matrix as ATA where A is upper triangular with strictly positive diagonal elements. The square root of the determinant of ATA is the product of the diagonal elements of A. Conversely, if A is any upper triangular matrix with strictly positive diagonal entries, ATA is symmetric positive definite and the correspondence is unique.2 Applying this factorization to the inverse covariance matrix when n = 4, for example, gives
+
'We restrict to the proper case where C is invertible. 2The diagonal entries of A are the square roots of the pivots under gaussian elimination (Horn and Johnson 1985; Golub and Loan 1989). Note that every positive definite matrix is invertible, the inverse of a positive definite matrix is positive definite, and every symmetric positive definite matrix is the covariance matrix of some multivariate gaussian.
Conditional Multivariate Densities
845
with
lxl-’’Z
=
(1’1 (Yz2 “33 (144
To represent the matrix A we stipulate that the network is provided with an additional set of dispersion output units whose activations ( 2 : ) and {z;;} are related to the elements of A by
a,,= expz: Ck,\
= 2;;
i = l . . .. . I ?
(2.3)
i = l .....n - 1 .
j = 2 . . . . .17.
i<j
(2.4)
In this way IZ network outputs are needed for the mean (equation 2.2), another 7.1 for the positive diagonal entries (equation 2.31, and n ( n - 1)/2 for the off-diagonal entries (equation 2.4) making n ( n + 3 ) / 2 in alL3 Note that C can be recovered by inverting X-’,which is easy to compute now that C-’ is known as the product ATA of lower and upper triangular matrices (Press et a / . 1992, Ch. 2). Note also that if X is a vector of independent standard normal deviates, then Y = A-’X is gaussian with covariance matrix C. This can be used for generating efficient Monte Carlo simulations. The use of the exponential in equation 2.3 ensures that the diagonal entries are always positive so that every possible network output vector corresponds to a unique multivariate gaussian. When all network outputs vanish, for example, the corresponding gaussian has zero mean and unit covariance matrix. This representation also prevents variances going to zero, especially in the presence of suitable regularization. The particular choice of the exponential can be distinguished in a Bayesian framework by consideration of uninformative priors for scale parameters, assuming network outputs zT have uniform distributions (Nowlan and Hinton 1992b). 3 Likelihood
Suppose N pairs of corresponding observations { (x,.yr,) : p = 1. . . . .N} have been made on X and Y. The negative conditional log likelihood of the data is assumed to factorize as N
E=CE,
(3.1)
p=l
where from equation 2.1 the negative log likelihood of an individual observation is
E,
=
;1% IC,l
+ ;(Yp
-
/%I Tc,
-1
(YP - PP)
(3.2)
3Network output activations are likely to be stored in a one-dimensional structure for most implementations. It is left to the reader how to manage the two-dimensional indexing.
Peter M. Williams
846
apart from a constant.4 Maximum likelihood estimation would seek network weights 7t1 that minimize E. Whatever form of estimation is used, with or without some form of regularization, the gradient of equation 3.1 with respect to network weights is of interest. Concentrating on equation 3.2 and omitting the subscript y we define
The negative log likelihood for an individual observation is then
and partial derivatives with respect to network outputs are easily seen to be
i = l .... > n - l >
j = 2 . . . . . ii.
i<j
These expressions can be used with backpropagation to calculate VE with respect to network weights. 3.1 Regularization. Since neural networks are universal approximators, care is needed to match the complexity of the model to the information content of the data. Overfitting would take a particularly extreme form in the present case if the model were to fit a gaussian with arbitrarily small variance to one or more data points. It is therefore important to ensure appropriately limited variation over the training set of the modeled covariance matrix. This can be achieved by suitably limiting the number and sizes of the weights in the network. The general technique used below is described in Williams (1995), although other methods are also applicable (Nowlan and Hinton 1992a; Bishop 1993). 3.2 Constant Dispersion. It is interesting to consider the special case in which the network weights attached to the dispersion output units vanish. This would be appropriate if the noise distribution were constant over the whole training set. In that case one would expect an adequate regularizer to detect this feature of the dataset and set the dispersion 41t will not be investigated under what assumptions this factorization over observations is justified. It is sufficient that the observation pairs are jointly independent, but this is not necessary (see Section 4.2).
Conditional Multivariate Densities
847
output weights to zero. However this case may arise, the activations { z r } and { z ; } are then independent of network inputs and determined just by the biases on the corresponding output units. It can then be shown that, at any local minimum of E as a function of weights and biases, the dispersion output biases must assume values such that the inverse of ATA is given by (3.3) where pp is the conditional mean for input x, as computed by the network S for each C, in equations 3.1 and at this local r n i n i m ~ m .Substituting ~ 3.2 leads to
+
E = iNlog IS1 constant
(3.4)
as the expression for the negative log likelihood, permitting dispersion output units to be dispensed with. In the case of univariate data, or more generally of uncorrelated multivariate data, equation 3.4 can also be obtained by integrating out the diagonal elements of the covariance matrix using an uninformative prior (Buntine and Weigend 1991; Williams 1995). The present approach, however, is more flexible in allowing dispersion to vary over the input domain and, even in the case of constant dispersion for multivariate data, more efficient than tackling equation 3.4 directly. 4 Examples
We now illustrate these ideas applied first to synthetic data and then to empirically generated time series data. We begin with computergenerated data for which the generating distribution is known. 4.1 Synthetic Data. Weigend and Nix (1994) discuss univariate data (M = 1) drawn from normal distributions N(pL, cr) with means
p ( x ) = sin(2.5x) sin(l.5x)
and variances
d ( x ) = 0.01 + 0.25 [l - sin(2.5x)I2 jThe proof follows the lines of the usual treatment of maximum likelihood estimators of parameters of multivariate normal distributions, together with their invariance under invertible reparameterizations (Anderson 1958). It should be noted, however, that equation 3.3 is only the maximum likelihood estimate of the covariance matrix if the pp are themselves maximum likelihood estimates of the means, which depends on the style of regularization in force. Nonetheless, since maximum likelihood estimators of variance are biased, this raises the possibility of bias in the present estimators. This issue, especially in the case of larger dimensional data and smaller samples, will be the subject of further investigation.
Peter M. Williams
848
............................
0.00
0.79
1.57
2.36
3.14
Figure 1: Training set for the univariate case showing the random distribution of training data around the mean p [ x ) = sin(2.5~) s i n ( l . 5 ~with ) variance n 2 ( x )= 0.01 + 0.25 [l - sin(2.5s)I2for 0 5 x 5 TT.
One thousand training examples were generated using this example with
x drawn randomly from a uniform distribution on [O.T]. The training set is shown in Figure 1. Results are shown in Figure 2. These were obtained using a simple fully connected 3-layer network with 1 input unit, 10 hidden units, and 2 output units. Networks were trained using the optimization and regularization algorithms of Williams (1991, 19951, which pruned the network to 6 hidden units with 23 remaining nonzero weights and biases. Weigend and Nix, in fact, propose a more complex architecture and training regime. This seems not to be needed by present methods, which fit both first and second moments together and appear to give improved results.‘
To investigate variability between local minima, 20 similar networks were trained and the results averaged. For the mean this gives 11 = ( i l k ) and for the variance u2 = ( n f ) + {(/,f) - ( / L A ) ’ } where / ~ A ( xand ) t$(x) are the mean and variance for the kth network, k = 1 . .. . .20, and ( / L A ) is the average of the means, etc. The results for / I ( x ) and O ( X ) for the mixture are indistinguishable at this scale from those shown in Figure 2. This form of averaging corresponds to rudimentary integration of the predictive distribution over weight space (Buntine and Weigend 1991; Neal 1995).
Conditional Multivariate Densities
849
Standard deviation
1.2 1 0.8
0.6 0.4 0.2 - I- .4
0
0.00 0.79 1.57 2.36 3.14
0.00 0.79 1.57
2.36 3.14
Figure 2: Neural network fit for univariate data using a 3-layer network. Continuing this example we consider data drawn from the bivariate normal distribution ( n = 2) with mean ( P I . p z ) and covariance matrix
“‘y)
(
where the means are given by pI(x) = sin(2.5x)sin(l.5~) p2(x) =
cos(3.5x) cos(o.5x)
the variances by n:(x)
=
CJ:(X)
=
0.01 + 0.25 [l - sin(2.5x)I2 0 01 + 0.25 [l - C O S ( ~ . ~ X ) ] ~
and the correlation coefficient by p ( x ) = sin(2.5x) cos(0.5x)
Three thousand training examples were generated with x randomly distributed over [O, n].’These were modeled using a fully connected 3-layer network with 1 input unit, 20 hidden units, and 5 output units (2 for the means and 3 for the inverse covariance matrix). As an effect of regularization, these were pruned to 12 hidden units with 62 nonzero weights and biases. Results are shown in Figure 3. These show a reasonable fit for most of the interval. 'Specifically yl,y2 were generated a s y ~= p ~ + a(nuiunit A
-
E\\.~~ ilI n tIt I,
(1.3)
where the objective function E is tlie mean absolute error over the training samples. Both techniques train a feedforward network until a reasonable solution has been obtained, then prune an element in terms of its saliency or relevance, followed by retraining. This requires the total change in error E to be calculated for every unit or weight in the network, which is a computational-expensive task. A better objective function for the estimation of the relevance of a unit in the function space is the Lz-norm of the distance between the neural network mapping f and the underlying problem F,given by
This objective function requires units with finite Lz-nornl. A radial basis function network f described by a linear combination of gaussian radial basis units given b y ( j k ,
h
with K units, (1.6) sa tisfies this requirement.
'
Although the underlying function F is unknown and hence it is impossible to calculate E, an approximation of / J A in terms of the network f with and Lvithout the unit k may be obtained by directly considering the relevance of a unit in the function space. From the geometric representation of a K-unit network sho.rzrn in Figure 1, good estimation of the relevance of unit k is stated as
Christophe Molina and Mahesan Niranjan
857
Figure 1: Geometric illustration of GaRBF networkf (with unit k ) and f’ (without unit k ) in a three-dimensionalfunction space ‘HK. f* represents the orthogonal projection of network f in the subspace ‘HK-I containing f’.
where f* is the orthogonal projection off onto the subspace ‘&I and O k the angle between unit k and the subspace 7 i K - I . The interest of using the projection f* of the network, instead of the network f’(f without unit k), is that it takes into account the best network with K - 1 units that may be obtained if the network is retrained to absorb the loss of information due to the prune of unit k. Moreover, equation 1.7 has a useful property that makes it suitable for growing and pruning. The relevance pk does not need to be computed over the whole training set. The approximation of the relevance of a unit given above has been proposed in Kadirkamanathan and Niranjan (1993) for growing GaRBF networks in sequential learning and has led to a theoretical foundation for the RAN network. In this paper, we show how the above work can be extended to provide automatic and sequential pruning with replacement of resource allocating networks having a limited number of units (LRAN). The lack of a pruning rule for a network having a limited number of units may lead it to have insufficient resources and no way of dealing with the need for additional units. Units that are induced by noise in the data, during the early stages of the algorithm, are a waste of resources. Too large a network can lead to overtraining. Finally, in any hardware training, the resources are going to be finite. This paper is organized as follows: In Section 2 we give a brief review of the F-projection geometric growth criterion and the simplifications leading to the RAN algorithm. In Section 3 we describe how the framework naturally extends to a pruning scheme. Section 4 gives an experimental evaluation of the pruning scheme on the prediction of the laser time series of the Santa Fe competition (Weigend and Gershenfeld 1993).
Pruning on LRAN by F-Projections
858
2 The F-Projection Growth Criterion
Starting from a sequential function estimation approach, Kadirkamanathan and Niranjan (1993) arrived at a theoretical foundation for growing RANs when a new observation ( X , ~ . Y , , )occurs, in which the addition of a new GaRBF OK+Ito a K-unit network was governed by [)k in equation 1.7 exceeding a given threshold E
(2.1)
Under the eqiiality constmitit f ( x l , )= yIIand a siiiootliiicss constraint that consider the underlying function 3 as smooth, the new GaRBF values are assigned as follows: (2.2) (2.3) (2.4) where X is an overlap factor that provides smoothness to the network in terms of the minimal distance from the center of the new unit to the rest of the units. The norm of @K+I depends only on the width ( T K + ~ , which can be considered as constant for each new observation. Therefore, equation 2.1 depends only on the parameters ( Y K + ~ and OK+l and the criterion to increment the complexity of the network can be split into two parameters both exceeding threshold values,
Because of the difficulty of evaluating the angle R, Platt and Kadirkamanathan propose an approximation equivalent to the minimal distance between the input x,, and the GaRBF centers /ik, k = 1 . . .K, expressed as (2.6) where E,, decreases exponentially (Platt 1991) from the upper bound t o until it reaches a lower bound tmin,
(-31
(2.7)
and 7 is a decay constant. Hence, a new GaRBF at step n is added for the observation (xn.yn)if the criteria given in equation 2.5 are satisfied. When the observation ( x f l yn) . does not satisfy the novelty criteria, the LMS algorithm or the extended Kalman filter (EKF) algorithm is used to adapt the output layer coefficients @k and the GaRBF centers p k .
Christophe Molina and Mahesan Niranjan
859
The F-projection growing technique on RANs is then stated as follows: each observation ( x n .y,) &I
- E, -
= max[emi,.€0 . exp(-$)I
If Iyn - f ( x ) l > aminand infF=, IIx,
- pkll
>
then
Allocate a new unit with, QK+I = yn - f ( x ) .
jLK+l =
xn,
and
K
gKfl = infk=, IIxn - pkll Update the number of units, K=K+1 '
& Adapt the network coefficients (and covariance matrices) by LMS (or EKF) {if}
& {for} 3 Pruning Limited Resource Allocating Networks (LRAN)
Pruning techniques have been designed to be implemented on systems with an unlimited or large number of units [see Reed (1993) for a survey of such techniques]. The pruning process removes a large number of units, leaving only a reduced subset to be employed in the final task. This is feasible for software networks, but may be unrealistic for hardware networks where the number of units is severely limited. Before developing the F-projection pruning criterion, we place the RAN in a more realistic and practical context in which the network may grow to contain a maximum number of units K,,,,. This is the case in hardware networks, for example, for which the training algorithm is wired and performed on-line. The principle of F-projection views pruning as a problem of providing an input-output mapping in the function space with a minimal loss of information. In this context, the pruning technique assists in making the best use of the available units, and a unit is pruned only when a relevant observation arrives and nofree unit is available to store this information in the network. Thus, pruning can be viewed as a generalization of growing in which the total information contained in the network is maximized by pruning the least relevant unit and replacing it by another more relevant unit. The reuse of units was first suggested and implemented by Anderson (1993) in a modified RAN architecture employing reinforcement learning. The advantage of our approach is that it remains in Platt's RAN framework, which is globally justified by the F-projection principle. The relevance p k of a unit k, among the K,,,, units of the network, has been stated in equation 1.7 in terms of its amplitude, L2-norm, and
Pruning on LRAN by F-Projections
860
angle with the K-unit network. Because only P k is required to compare the relevance of unit k to the relevance of the other units of the network, the norm 11$k11 may be replaced by the power of the width C J ~in equation 1.7 and, consequently, p k may be expressed as
where the approximation of the angle given in equation 2.6 is taken into account. The comparison between relevances of the least relevant unit j and the candidate unit dn from a new observation (xll,yll)is performed in the same K,,,-dimensional function space. Hence, the relevance p,, of the candidate unit is stated as (3.2)
with amplitude and d,,, parameters assigned as follows: (3.3) I“,, = CJI I
(3.4)
XI,
(3.5)
=
where the least relevant unit d, has been temporarily pruned and replaced by the candidate unit. The F-projection pruning technique may therefore be stated as follows: each observation (xll.yII) & When no more units remain free - Compute the relevance of each unit and keep the least relevant unit 0, and its relevance p, - Compute the relevance pI1for the candidate unit -
zf /A, > p, then Replace d, by qll, nI = n,,% /iI = pl,. and
F, = cr,,
-& Adapt the network coefficients (and covariance matrices) by LMS (or EKF) {if}
-
{when}
& {for}
Christophe Molina and Mahesan Niranjan
861
Although both techniques may be studied separately, pruning with replacement implicitly contains the growing technique. This is easily explained as follows. Suppose that the network initially contains the final limited number of units. These units may be randomly initialized with very low amplitudes and narrow widths (which is equivalent to having no units at the beginning of learning). Once new observations arrive, these random units are pruned and replaced by more relevant units according to the pruning technique. It is evident that no growing is necessary at this stage, and that the network behavior will not change significantly compared with the previous growing algorithm. Moreover, the use of the artificial thresholds cmin and aminfor the distance between units and the minimal unit allocation error, respectively, is no longer necessary. The final F-projection algorithm for LRANs may then be stated as foIlows: 0
Allocate K units at random positions with zero amplitude and small width. Execute the F-projection pruning with replacement algorithm.
3.1 F-Projection Pruning with Replacement on a Toy Problem. Figure 2 shows graphically how F-projection pruning with replacement works for a synthetic problem when new observations arrive. The test consisted of the regression of a one-dimensional curve generated by the equation
y
= cos(2 . 7r
. x) . e-"/O
23
x E [-0.75.0.751
(3.6)
A set of 200 randomly generated observations { x . y} was presented to an LRAN network containing 5 units. The figures on the left show the target function, the approximation, as well as the new observation. The right-hand figures illustrate the contributions made by each of the five units. Table 1 provides numerical results for the amplitude, center, width, distance, and value of the relevance pk as defined in equation 3.1 for each unit when pruning with replacement is performed. The relevance p,! of the new observation (Obs)is calculated using equation 3.2. In Figures 2ad units 1, 5, 2, and 4, respectively, are pruned and replaced by the new observations. Since gradient descent training occurs after each pruning and replacement step, the contributions of the units have slightly different shapes and positions from one figure to the next. 4 Experimental Results
The F-projection growing and pruning technique was tested on the laser data of the Santa Fe competition. These data consist of clean laser intensity data collected from a laboratory experiment (Weigend and Gershen-
Pruning on LRAN by F-Projections
862
't
I
I
05:
21
1 5:---
3
4
---
i
I
I
05
3
05
.
I I
,
1
-0 5
0
05
-0 5
0
05
1
-
p
Figure 2: Pruning with replacement for a S unit LRAN during the approximation of the function given in equation 3.6. Figures on the left show the original function (dash-dot), its LRAN approximation (solid),and the newly arrived observations (circles) at different stages. Figures on the right show the five units (numbered) and their positions. Units 1, 5, 2, and 4 are pruned and replaced by the new observation in turn. Numerical information is given i n Table 1.
Christophe Molina and Mahesan Niranjan
863
Table 1: Numerical Results for the Pruning with Replacement of a 5 Unit LRAN as Shown in Figure 2"
Figure
Unit Amplitude Center Width Distance number ( k ) (w) (Pk) (03
Prune Relevance and ( P A ) replace
1 2 3 4 5 Obs
0.849 0.984 -0.358 0.142 -0.020 -0.222
0.080 0.030 0.430 0.660 -0.710 -0.530
0.006 0.044 0.304 0.200 0.687 0.157
0.050 0.050 0.230 0.230 0.740 0.180
0.000 0.002 0.025 0.007 0.010 0.006
1 2 3 4 5 Obs
-0.222 0.984 -0.353 0.106 -0.007 0.624
-0.530 0.030 0.492 0.678 -0.710 0.150
0.157 0.044 0.304 0.200 0.687 0.104
0.180 0.462 0.186 0.186 0.180 0.120
0.006 0.020 0.020 0.004 0.001 0.008
C
1 2 3 4 5 Obs
-0.222 0.985 -0.400 0.097 0.666 0.859
-0.530 0.055 0.485 0.684 0.121 0.000
0.157 0.044 0.304 0.200 0.104 0.106
0.585 0.066 0.199 0.199 0.066 0.121
0.020 0.003 0.024 0.004 0.005 0.011
d
1 2 3 4 5 Obs
-0.222 0.859 -0.400 0.097 0.666 0.227
-0.530 0.000 0.485 0.684 0.121 0.640
0.157 0.106 0.304 0.200 0.104 0.135
0.530 0.121 0.199 0.199 0.121 0.155
0.018 0.011 0.024 0.004 0.008 0.005
a
b
Yes -
1
-
Yes 5
-
Yes -
4
"The relevances of units as well as new observations are respectively calculated according to equations 3.1 and 3.2, and the least relevant units are pruned and replaced by the new observations.
feld 1993). Although deterministic, its behavior is chaotic as shown in Figure 3. The laser data given for the competition consisted of 1000 observations {y,}, n = 1,. . . 1000, and the goal was to predict 100 observations ahead at five different times t = (1000,2180,3870,4000,5180)in the time series. Observations where coded into just 8-bits between the values 3 and 255. No information was given about the dynamics of laser data, nor about its embedding dimension and variable dependencies, also known as the time series representation. We used an algorithm based on geometric techniques to determine the embedding dimension of the laser data (Molina and Niranjan 1995). The most relevant past observations for the prediction of the next turned out to be the following 27 observations {I, 2,5,7,9,11,13... . ,18,21,23,25,27,34,38,41,44,49,52,61,68,69,79,95} according to our algorithm.
.
Pruning on LRAN by F-Projections
864
3cm 250
I
I
I
I
I
I
I
I
I
I
Figure 3: Original 1000 laser observations provided in the time series prediction Santa Fe competition.
The task of prediction may be achieved using three different techniques. The first consists of using a one-sfep-ahead predicfor from observation ( y l l )and recursively predict observation (y,+l) until (yt+loo).In this case, estimated observations are used to construct input vectors. The second technique consists of directly predicting observation ( y l i + , ~ ~from ,~) available data. Finally, a combination of both can be used (Sauer 1993). In our experiments, we implemented a recursive predictor, which turned out to be of sufficient quality for our prediction purposes. We used an LRAN network with 200 units that were initially randomly distributed in the input space. Their amplitudes c q were initialized with zero values and their widths DA with the value lo-'. The overlap factor X was fixed at 0.87. During learning the same gradient descent technique (LMS) used in Platt (1991) was employed to adapt the LRAN coefficients. The learning rate was fixed at 0.02. Training data were randomized and presented to the network during 20 epochs. Figure 4 shows the results obtained by LRAN under the above configuration for the one-step prediction of the laser data at 5 different initial points. The results clearly show that each prediction remains close to its corresponding observation. Numerical results are given in Table 2. Figure 5 shows the same portions of the laser time series, this time predicted using a recursive 100-step ahead LRAN predictor. The results turned out to correspond closely to the original time series. Differences generally arose due to a small drift in the prediction at each step with respect to the target observation, particularly in the first portion, since here the predictor completely mislead the prediction at approximately time 1050. Despite this, the results still match the targets closely.
Christophe Molina and Mahesan Niranjan
I
250’
r
I
865
I
,
i
I
i 1
\
,
1
200 150 100
50
3870
3880
3890
3900
3910
3920
3930
3940
39%
3960
3970
4010
4020
4030
4040
4050
4060
4070
4080
4090
4100
150
100
50
4000
200
-
15010050 07
Figure 4: LRAN one-step prediction (continuous curves) at different starting points of the laser time series. The dashed curves correspond to the true time series.
Pruning on LRAN by F-Projections
3870
3880
3890
3900
39iO
3920
3930
3940
3950
3960
Figure 5: LRAN 100 recursive step prediction (continuous curves) a t different starting points of the laser time series. The dashed curves correspond to the true time series.
Christophe Molina and Mahesan Niranjan
867
Table 2: NMSE Obtained by LRAN (]-Step Ahead and 100-Steps Ahead), Delay Coordinate Embedding (Sauer 19931, and Internal Delay Lines (Wan 1993) on the Prediction of Laser Data at Different Starting Points Starting point
NMSE One-step ahead
NMSE 100-steps ahead
NMSE Sauer
NMSE Wan
1000 2180 3870 4000 5180
0.0102 0.0082 0.0119 0.0072 0.0119
0.2515 0.0093 0.1237 0.0736 0.0228
0.0770 0.1740 0.1830 0.0060 0.2540
0.0270 0.0650 0.4870 0.0230 0.1600
Total
0.0494
0.4809
0.5510
0.7620
5 Conclusion We have presented an algorithm for pruning and replacing units in limited resource allocating networks. The pruning algorithm is based on the F-projection principle of sequential function estimation, and generalizes previous work that has been done in growing RANs (Kadirkamanathan and Niranjan 1993). The performance of growing and pruning techniques for LRAN has been illustrated using the laser time series of the Santa Fe competition, and showed that a neural network solution based on the reallocation of units predicts the time series as successfully as the approaches used by the winners of the Santa Fe competition.
Acknowledgment The authors wish to thank Thomas Niesler and two anonymous referees for their valuable comments and suggestions. This research was supported by EPSRC grant No. GR/H16759.
References
Anderson, C. W. 1993. Q-learning with hidden-unit restarting. In Aduaiices in NIPS, S. J. Hanson, J. D. Cowan, and C. L. G., eds., pp. 81-88. Kadirkamanathan, V., and Niranjan, M. 1993. A function estimation approach to sequential learning with neural networks. Neural Comp. 5(6), 954-975. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in NIPS, D. S. Touretzky, ed. No. 2, pp. 598-605. Molina, C., and Niranjan, M. 1995. Finding the Embedding Dimension of Tiiiie Series by Geometrical Techniques. Tech. Rep. CUED/F-INFENG/TR.221 1995, Cambridge University Engineering Department, Cambridge, England.
Pruning on LRAN by F-Projections
868
Mozer, M. C., and Sniolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Ahnriccs iri NIPS, D. S. Touretrky, ed., No. I , pp. 107-115. I’latt, J. C. 1991. A resource allocating netwcxk for function interpolation. Neitrd C